<<

0

Improved Learning of AC Functions

Merrick . Furst Jeffrey C. Jackson Sean W. Smith School of Computer Science School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected] [email protected]

Abstract A variant of one of our gives a more general learning method which we conjecture produces reasonable 0 Two extensions of the Linial, Mansour, Nisan approximations of AC functions for a broader class of in- 0 put distributions. A brief outline of the general AC learning algorithm are presented. The LMN method works when input examples are drawn is presented along with a conjecture about a class of distri- uniformly. The new algorithms improve on theirs butions for which it might perform well. by performing well when given inputs drawn Our two learning methods differ in several ways. The from unknown, mutually independent distribu- direct algorithm is very similar to the LMN algorithm and tions. A variant of the one of the algorithms is depends on a substantial generalization of their theory. This conjectured to work in an even broader setting. algorithm is straightforward, but our bound on its running time is very sensitive to the probability distribution on the inputs. The indirect algorithm, discovered independently 1 INTRODUCTION by Umesh Vazirani [Vaz], is more complicated but also relatively simple to analyze. Its time bound is only mildly Linial, Mansour, and Nisan [LMN89] introduced the use affected by changes in distribution, but for distributions not too far from uniform it is likely greater than the direct of the Fourier transform to accomplish Boolean function 0 bound. We suggest a possible hybrid of the two methods

learning. They showed that AC functions are well- characterized by their low frequency Fourier spectra and which may have a better running time than either method alone for certain distributions. gave an algorithm which approximates such functions rea- sonably well from uniformly chosen examples. While the The outline of the paper is as follows. We begin with 0 class AC is provablyweak in that it does not contain modu- de®nitions and proceed to discuss the key idea behind our lar counting functions, from a learning theory point of view direct extension: we use an appropriate change of basis for

0 AC it is fairly rich. For example, contains polynomial-size the space of n-bit functions. Next, we prove that under 0

DNF and addition. Thus the LMN learning procedure is po- this change AC functions continue to exhibit the low- tentially powerful, but the restriction that their algorithm be order spectral property which the LMN result capitalizes given examples drawn according to a uniform distribution on. After giving an overview of the direct algorithm and is particularly limiting. analyzing its running time, we discuss the indirect approach and compare it with the ®rst. Finally, we indicate some A further limitation of the LMN algorithm is its running time. Valiant's [Val84] learning requirements are widely directions for further research. accepted as a baseline characterization of feasible learning. They include that a learning algorithm should be distribu- 2 DEFINITIONS AND NOTATION tion independent and run in polynomial time. The LMN

poly log n

 

algorithm runs in quasi-polynomial (O 2 ) time.

; ;ng n All sets are subsets of f1 ... , where as usual repre- In this paper we develop two extensions of the LMN learn- sents the number of variables in the function to be learned.

ing algorithm which produce good approximatingfunctions Capital letters denote set variables unless otherwise noted.

 X when samples are drawn according to unknown distribu- The complement of a set X is indicated by , although we

tions which assign values to the input variables indepen- abuse the concept of complementation somewhat: in cases S

dently. Call such a distribution mutually independent since where X is speci®ed to be a subset of some other set ,

 

X = S X X = f ;ngX it is the joint probability distribution corresponding to a set ,otherwise 1 ; ... . Strings of

of mutually independent random variables [Fel57]. The 0 =1 variables are referred to by barred lower case letters x

running times of the new algorithms are dependent on the (e.g. ) which may be superscripted to indicate one of a

j

x x i

distributionÐthe farther the distribution is from uniform sequence of strings (e.g. ). i refers to the th variable





x the higher the boundÐand, as is the case for the LMN in a string . Barred constants (e.g. 0, ) indicate strings of

algorithm, the time bounds are quasi-polynomial in n. the given value with length implied by context.

Unless otherwise speci®ed, all functions are assumed to 3 AN ORTHONORMAL BASIS FOR

n

; g

have as domain the set of strings f0 1 . The range of BOOLEAN FUNCTIONS SAMPLED g Boolean functions will sometimes be f0,1 , particularly

when we are dealing with circuit models, but will usually be UNDER MUTUALLY INDEPENDENT

; g f1 1 for reasons that should become clear subsequently. DISTRIBUTIONS

Frequently we will write sets as arguments where strings

X  f x f

would be expected, e.g. f rather than for a 3.1 RATIONALE

n

; g f X 

function on f0 1 . In such cases is a shorthand for

v

cX  cX 

f where is a characteristic function de®ned Given some element of a vector space and a basis for

v

X = i 2 X

c this space, a discrete Fourier transform expresses as the by i 0if and 1 otherwise. Note that the sense of this function is opposite the natural one in which 0 coef®cients of the linear combination of basis vectors repre- represents set absence. senting v . We are interested in learning Boolean functions

0 on n bits, which can be represented as Boolean vectors of n An AC function is a Boolean function which can be com- length 2 . Linial et al. used as the basis for this space the puted by a family of acyclic circuits (one circuit for each n

characters of the group Z2 in R. The characters are given n number n of inputs) consisting of AND and OR gates plus by the 2 functions

negations only on inputs and satisfying two properties:

jA\X j

: X =

1 A

 The number of gates in each circuit (its size) is bounded

jAj f j

Each is simply a polynomial of degree ,and

A A

by a ®xed polynomial in n.

Aj k g j spans the space of polynomials of degree not

 The maximum number of gates between an input and

more than k. With an inner product de®ned by

the output (the circuit depth) is a ®xed constant.

X

n

f xg x f; gi =

h 2

 x

A random restriction p;q is a function which given input

n

; g x2f

 0 1

x  p

maps i to with ®xed probability and assigns 0's and 1's

to the other variables according to a probability distribution p

f k = hf; f i

and norm k the characters form an orthonor-

q x f 

.If represents the input to a function then p;q induces

mal basis for the space of real-valued functions on n bits.

d another function f which has variables corresponding to

The Fourier coef®cients of a function f with respect to this

the stars and has the other variables of f ®xedto0or1.

^

f = hf; i f

basis are simply A , the projections of on the This is a generalization of the original de®nition [FSS81] A

basis vectors. Given a suf®cient number m of uniformly

P q

in which is the uniform distribution. The subscripts of m

j j j j

x ;fx  f x  x =m

selected examples  , is a

A

j =

 are generally dropped and their values understood from 1

^ f context. good estimate of A [LMN89].

The function obtained by setting a certain subset S of the What happens if the examples are chosen according to some q

variables of f to the values indicated by the characteristic nonuniform distribution ? If we blindly applied the LMN

S f dS  X

function of a subset X is denoted by or, learning algorithm we would actually be calculating an

P

f x xq x A

S f dX approximation to n for each .

when is implied by context, simply . For example, if A

x2f ; g

 0 1

= f ; g X = f g f dX f

S 1 3 and 3 then is the function with That is, we would be calculating the expected value of the

x x q

variable 1 setto1and 3 to 0. product f with respect to rather than with respect to the A We will use several parameters of the probability distribu- uniform distribution. This leads to the following observa-

tion: if we modify the de®nition of the inner product to be

q  = [x = ] i tion throughout the sequel. We de®ne i Pr 1 ,

with respect to the distribution q on the inputs and modify where the probability is with respect to q . Another param- eter which we will use frequently is the basis to be orthonormal under this new inner product

then the estimated coef®cients should on average be close

=  = ; =  : i

max 1 i 1 1 to the appropriate values in this new basis. i

We assume that this value is ®nite, since in®nite implies More precisely, de®ne the inner product to be

some variable is actually a constant and can be ignored by X

hf; gi = f xg xq x

the learning procedure. q

n

x2f ; g

It is convenient to de®ne a set-based notation for proba-  0 1

= f g

bilities also. For example, if X 2 and it has been and de®ne the norm to be1

X f; g q X 

speci®ed that 1 2 then we will write for q

q x = ^ x =  X

f k = hf; f i :

1 0 . In general, if is speci®ed to be k q

1 2 q

q X 

a subset of some set S then represents the marginal

 S

probabilitythat the variables indicated by take on the val- Let the basis vectors A be de®ned by orthonormalizing

X  S q X  c the with respect to this inner product (such a basis will

ues speci®ed by ,andif is not speci®ed then A

cX  q

is just q . Thus for mutually independent , 1 Y

Y Typically, a subscript indicates that the norm is to be consid-

q X =     : i

1 i ered one of the HÈolder p-norms. We are using the subscript to



2X

i mean something different here.

i2X

A B

be referred to as orthonormal in the q -norm or orthonormal Another useful property is that for all and ,

p p

with respect to q ). Then we expect that if a large enough

 B  q B =  A q A :

A B q

number of samples m are drawn according to ,

m

 X

This follows from the above representation of A after

p

j j

~

f = f x  x =m

A A

 =    

i i

noting that i 1 . Finally, Parseval's identity = j 1 gives

X 2

2 ^

^

kf k = f

A

f f 

q A is a good approximation to , the projection of onto A .

0 A

The main result of Linial et al. [LMN89] isthat, for any AC . f

function f , the Fourier coef®cients of with respect to the

jAj

's become small for large . Thus the LMN learning A algorithm consists simply of estimating the ªlow-orderº 4 THE DROPOFF LEMMA coef®cients. We show that essentially the same property of 0 As noted above, Linial et al. have shown that the sum of AC functions also holds for the coef®cients of a particular squares of coef®cients of high-degree terms (the high-order  basis orthonormal with respect to a mutually independent 0

power spectrum)of AC functions becomes exponentially q . Thus, as with LMN learning, we can obtain a good

0 small as order increases when the coef®cients are relative f approximation to an AC function by estimating low-

to the basis. In this section we show that this also holds order coef®cients of f relative to the transformed basis. Our learning procedure differs in that the estimated coef®cients for coef®cients relative to the  basis. We do this by gen- are with respect to a basis which must also be estimated. eralizing the series of lemmas used in their proof. It will be convenient to have a name for a basis which is Essentially, we prove that the following facts hold for orthonormal with respect to a mutually independent dis- Fourier coef®cients relative to the  basis: tribution as opposed to an arbitrary distribution. We will 0

1. Random restrictions of AC functions have small  refer to such a basis as a  basis and reserve for bases minterms and maxterms with high probability as long orthonormal with respect to an arbitrary q . From now on a

as the distribution function q is mutually independent.

^ f

Fourier coef®cient A will be assumed to be the coef®cient

 2. All the high-order Fourier coef®cients of a function of the basis vector A unless otherwise noted. with small minterms and maxterms are zero.

0 AC 3.2 PROPERTIES OF THE  BASIS 3. The coef®cients of an function are closely related

to the coef®cients of its restrictions.

 i x i

Let i be the standard deviation of the th variable when 4. Probabilistic arguments can be used to tie the above

q 

samples are selected according to and note that i as facts together and show that the high-ordercoef®cients

z =x i previously de®ned represents the mean. Let i 0

of an AC function must be small.

 = z

i i

;thatis, i is the normalized variable corresponding

x q to i . Then, due to the mutual independence of , one We present the proof of the Dropoff Lemma in this order.

possible  basis is given by Bahadur [Bah61]:

Y

= z :

 4.1 RANDOM RESTRICTIONS

A i

2A i The linchpin of the Linial et al. result is Hastad's Switching

This basis will be referred to as the  basis; it is the basis Lemma [Has86]. This lemma states that when restriction

which would be obtained by a Gram-Schmidt orthonormal-

 q

p;q of a suitable CNF function has uniform then with high

ization (with respect to the q -norm) of the basis performed probabilitythe mintermsof therestricted functionare small;

Aj in order of increasing j . a similar statement about a DNF function and maxterms follows immediately. Thus such a randomly restricted CNF The  basis has a number of properties which make our generalization of the LMN result possible. First, due to the formula can with high probability be rewritten as a DNF

formula having a small number of variables in every term.

f jjAj

nature of Gram-Schmidt orthonormalization, A

We generalize this to a lemma which holds for any mutually

g f jjAj k g

k spans the same space as , so linear

A

q = independent q . For uniform , 2 and Lemma 1 reduces combinations of such 's are simply polynomials of degree to a restatement of Hastad's Lemma.

not more than k. Also, it follows immediately from the



 A; S X S

above representation of A that for all ,and , t

Lemma 1 Let f be a CNF formula with at most variables

S

Y , 

in any clause and let random restriction p;q have mutu-

 X [ Y =  X  Y :



A A\S

q 

A\S

ally independent with parameters i and as de®ned

previously. Then

A; B S

Likewise it follows that for any S and ,



s

X

[f d >s]   pt=  

Pr has a minterm of size ln g

= B

1ifA

 X  X  q X =

A B

0otherwise. p

 = + =

g

S X where 1 5 2, the golden ratio.

Proof Sketch: Our proof involves a lengthy reworking of Lemma 4 Let q be any (not necessarily mutually indepen- 

the Boppana and Sipser proof of Hastad's lemma [BS90]. dent) probabilitydistribution,let A bea basis orthonormal

q f jjAj k g

We give here only the details of our extension to a key in the -norm such that A spans the same

^

f jjAj k g

inequality in their proof; the remainder of our proof is f

space as , and let A be the Fourier coef-

A straightforward. 

®cient of A . Then if all of the minterms and maxterms of

p

^

f k f =

a Boolean function have size at most , A 0 for all

Let C be an OR of variables, none of which are negated,

jAj >k q

and let Y be a subset of these variables. Let be a mutually .

 p;q

independent distributionand let be a random restriction p k

de®ned on the variables in C . Then we show that Proof: It is not hard to see that if the condition is met

h i

then f can be computed by a decision tree of depth no more

jY j



C d 6 Y =    p : Pr  1

than k [BI87]. Linial et al. [LMN89] further show that f the Fourier coef®cients of the basis for such an satisfy

Indeed, because q is mutually independent,

the lemma. This means that f is in the space spanned by

h i

Y

f jjAj k g 

A

y =]

[ . Since by de®nition the low-order span A

Pr i



Y = = C d 6 Pr 1 :

the same space, our lemma follows immediately. 2

[y  6= ]

Pr i 1

y 2Y i

Finally, note that the  basis meets the criteria of the lemma

i  

From the de®nition of it follows that for all , i due to the nature of the Gram-Schmidt orthonormalization

=

 1 ,so process which de®nes it.

1

 8i [y  6= ]  p:

Pr i 1 1 1 4.3 RELATING COEFFICIENTS OF A FUNCTION

AND ITS RESTRICTIONS  Thus, noting that 2,

Putting together the results thus far, we know that with

p [y = ]

Pr i 0 

high probabilitythe random restrictions of an AC function

1

[y  6= ]

i p Pr 1 1 1

have zero-valued high-order Fourier coef®cients. Now we p:

 show a key relationship between the coef®cients of arbitrary

functions and their restrictions when the coef®cients are 2

relative to a  basis.

i      = Because it is also true that for all , 1 i 1 , We begin with a lemma which allows a rewriting of the an analog to Lemma 1 can easily be proved for the case de®nition of a Fourier coef®cient and follow with the

where C in the above proof is an AND. This gives: coef®cient-relating result. t

Lemma 2 Let f be a DNF formula with at most variables

n S

Lemma 5 Let f be any real-valued -bit function, any 

in any term and let random restriction p;q have mutually 

set, and A orthonormal with respect to mutually indepen- independent q .Then

dent q . Then for any subset A,

s

X

[f d >s]   pt=   :

Pr has a maxterm of size ln g

d

^

f = f dX  X  q X 



A

A\S

A\S



S Finally we can state the principal result of this section, X which is obtained by successively applying Lemmas 1 and

2 to the lowest levels of a circuit (cf. [BS90, Theorem 3.6]): Proof:

X

^

f = f Z  Z  q Z 

A A

Lemma 3 Let f be a Boolean function computed by a cir-

Z

X

M d 

cuit of size and depth , and let p;q have mutually

= f dX Y  X  Y q X q Y 



A\S

q

\S

independent probability distribution with parameter as A



S;Y S

de®ned previously. Then X

2 3

s

[f d >s]  M X

Pr has a minterm or maxterm of size 2 X

4 5

 f dX Y  Y  q Y  =

A\S

d

[]= s=  

where Pr 2 ln g .



Y S

X S

 X  q X 



\S 4.2 SMALL MINTERMS AND MAXTERMS MEAN A

VANISHING HIGH-ORDER SPECTRUM 2

Here we begin to relate the above results to the Fourier

f S 

Lemma 6 Let , and A be as above. Then for any 0

spectrum of AC functions. We show that if a function

S

B , X

has only small minterms and maxterms then its high-order X 2

d

^2

f = q X  f dX

B [C B

Fourier coef®cientsÐeven withrespect to certain  basesÐ

 

S X S vanish. C

Proof: and thus that the sum of any subset of the squared coef®- 3

2 2 cients of such a Boolean function is bounded by unity. With

X X X this fact in hand we can prove the following bound on the

2 d

^

4 5

f = f dX  X  q X 

C

B [C

B summation of the previous lemma:

  

C S C S X S

X

n

f f ; g f ; g S d

d Lemma 8 Let be a function from 0 1 to 1 1 ,

= f dX f dY  X   Y  q X q Y 

C C

B B

[ ;n] q

any set, t an integer in 0 , and a mutually independent



C;X;Y S

^

f

X A

p distribution de®ning the basis for .Then

d d

= f dX f dY q X q Y  

B B

X 2

^

X;Y

 f

A

X

p

jA\S j>t

 X  Y  q X q Y 

C C

p

C

f dX > t ]

Pr [ has a minterm or maxterm of size

X

p

X

d d

q X q Y   = f dX f dY

B B



X S

X;Y where is an assignment to the variables in chosen X

according to q .

 C  C  q C 

X Y

C Proof:

X X X

from which the Lemma follows by the orthonormality of  2 2

^ ^

= f f

A B [C

on subsets. 2



jA\S j>t B S;jB j>t

C S

X 2

4.4 BOUNDING HIGH ORDER POWER d

E [f dX = ]



X S

SPECTRUM B

B S;jB j>t 2 We now use a series of probabilistic arguments to tie the 3

X 2

d

4 5

f dX E

above lemmas together into the desired result. Although = X

the proofs are very similar to those in [LMN89], we include B

B j>t them for completeness. We begin with an easily proved j bound on the high-order spectrum of any function. where the second line follows from Lemma 6 and expecta-

tion is with respect to q . Now since the terms in the expecta-

n p Lemma 7 Let f be any real-valued -bit function, a tion are never larger than unity, it is clearly bounded above

P 2

;  k p k

real in 0 1 , an integer, and and chosen such that

d

dX

by the probability that f is nonzero. But then

jB j>t B

^

pk > f

8.Let A be a Fourier coef®cient with respect to any

application of Lemma 4 completes the proof. 2

orthonormal basis. Then 2

3 We can now prove the main result. X

X 2 2

^ ^

4 5

f  E f

S A A 2

Lemma 9 (Dropoff Lemma) Let f be a function from

jA\S j>pk= jAj>k

2 n

; g f ; g d

f0 1 to 1 1 computed by a circuit of depth and

M k

where S is a subset chosen at random such that each vari- size , and let be any integer. Then

able appears in it independently with probability p.

X 1 +

2 d 2

k =

^ 1 5

f  M :

A 2

E [jA \ S j]  pk jAj >k

Proof: Clearly S for every . Thus

jAj>k

Chernoff bounds tell us that pk =

8

jA \ S j pk = ]  e

Pr [ 2 Proof: From the previous two lemmas, S

1 X 2

^ 

 :

f  E [f dS  X A

2 2 S Pr has a minterm or

X

jAj>k

S jA \ S j >

Thus for each A at least half of the 's will satisfy

p

2 pk =

pk = ]: 2. maxterm of size > 2 To this point we have not been much concerned with the But this latter value is just

form of the output of the function f ; in fact, several of the

previous lemmas hold for any real-valued function. For the p

f d > pk = ]

2Pr[ has a minterm or maxterm of size 2 

sequel, we specify that f is Boolean and, furthermore, maps

to either 1 or 1. It is then the case by Parseval's identity p pk = 2

that and by Lemma 3 is bounded above by 2M 2 as long X

2 p

d

^

= p   pk = =   f g

A 1 as 2 2 ln . Some simplifyingarithmetic

2

f ; ;ng A 1 ... gives the result.

~

0

0

^

~

f f

5 DIRECT LEARNING the usual hat (^). Thus rather than is used to represent

A A

th f

an estimate of the A Fourier coef®cient of relative to the 0

As alluded to earlier, our direct learning algorithm, like estimated basis  .

that of Linial et al., depends on the spectral property of A 0

0 ~ f

AC Notice that the restriction on the magnitude of can only

functions proved above. That is, since the high-order A

Fourier coef®cients relative to a  basis are small, we need bring this estimated coef®cient closer to the true coef®cient 0 only estimate low-order coef®cients in order to derive a in the  basis, since all of the coef®cients of a Boolean close approximation to the desired function. As shown be- function must be no larger than 1 in magnitude. The re-

low, the linear combination of the low-order  basis vectors striction also plays a helpful role in the lemmas to follow. de®ned by these coef®cients is a function which is close

to the true function in the sense that the norm of the dif- 5.2 BOUNDS FOR = LEARNING

ference between the functions is small. Furthermore, the k sign of this approximating function will with high proba- Here we derive upper bounds on the values of m and

bility match the true function, where the the probability is required for the above algorithm to achieve speci®ed error q relative to the input distribution . bounds on an input distribution with a given .Our®rst step is to generalize a lemma of Linial et al. [LMN89] to Actually, since we assume that only input/output pairs are

the case of arbitrary distributions q . given, the distribution q must also be estimated and hence

the function is learned relative to an approximate basis. In n

f ; g f ; g Lemma 10 Let f be a functionmapping 0 1 to 1 1

spite of this we are able to prove a bound on the running q and g an arbitrary functionon the same domain. Let be an

time of our algorithm which is similar to the bound on LMN n

; g

arbitraryprobabilitydistributionon f0 1 , let the Fourier

~

4f

learning. More speci®cally, let f denote the probability

 q

coef®cients be relative to the basis A de®ned by , and let

~

f x 6= f x x

that when the input 's are drawn according probabilities be with respect to q .Then X

to a mutually independent probability distribution q .Our

^ 2

[f x 6= g x]  f g^  : A

Pr sign A

~

 

algorithm, given parameters and , produces an f such that

~

f 4f>]  

Pr [ when the algorithm is given access to a

f x 6= g x]  [jf x g xj > ] 

Proof: Pr [ sign Pr 1 P

suf®cient number of examples of the true function f drawn 2 2 2

]  f x g x q x= kf g k E [f x g x q according to q . The algorithm runs in time and number of

and the lemma follows from Parseval's identity and the = examples quasi-polynomial in n and 1 , exponential in the

linearity of the Fourier transform. 2

q  =  parameter of , and polynomial in log 1 .

In the sections that follow we ®rst give the algorithm and Now for the remainder of this section let q be a mutually then prove the bound on its running time. independent distribution on the inputs to the algorithm and

let unprimed Fourier coef®cients be with respect to the 

basis de®ned by q . Then in the notation of our learning

5.1 THE DIRECT ALGORITHM P ^

~ 2

f 4f  f g^  A

algorithm, Lemma 10 says that A .So

m k j

j our goal becomes ®nding an and such that with prob-

x ;fx 

Algorithm 1 Given m examples of a function

 g

ability at least 1 the algorithm produces a satisfying

n

~

P

f f ; g !f ; g k  n : 0 1 1 1 and an integer , determine f

^ 2

f g^    A as follows: A . While the details of this calculation

are a bit messy, the basic idea is not. Allocate half of the

P



m j

0 1 error to each of two jobs: taking care of the error in the

 x  i  n

1. Compute = for 1 .

i

i

j = m 1

coef®cients corresponding to sets smaller than k and larger p

than k . The Dropoff Lemma is used to bound the error in

0 0 0 0

z =x  =    

2. De®ne i 1 .

i i i i

the latter and will also ®x k . Chernoff bound arguments

Q

m 0

0 will give the value of needed to bound the error due to

z 

3. De®ne = .

i A

2A

i estimating the low-order coef®cients and basis functions.

P

m

0 j 0 j

~ 1

f f x  jAj k = x 

0 0

4. Compute for ~

A A

j =

m

= jjAj k g f Aj >k f

1 Because 0forall j and spans

A A

0 0 0

~ ~ ~

jf f f  g^ j > =  A

and 0 otherwise. If 1 let sign ,where the same space as the corresponding set of A , must also

A A A

f ; ; g

Aj >k

sign is the obvious function with range 1 0 1 . vanish for all j . Thus the following lemma, which

P follows immediately from the Dropoff Lemma, gives the

P

0 0

~

x  f g x=

5. De®ne . ^ 2

A A

f g^  A

required bound on A .

jAj>k

~

g x x=

6. De®ne f sign .

f ; g

Lemma 11 Let f bea Boolean functionwithrange 1 1

; g with corresponding f0 1 - valued function computed by a

We intend primes (0 ) to indicate values that are based on M circuit of depth d and size .Then

an estimated probability distribution rather than the true

  +

d 2

X

 M

one. A twiddle (~) indicates that the value includes other 4 2

^

f k    : estimates. When a Fourier coef®cient is based on an esti- 5 log2 A

 2

Aj>k mated distribution it is written with the twiddle replacing j

k

  n The bound on the error in low-order coef®cients is a bit + 1 max

more involved. There are really two sources of error: the k

 + n  :

estimate of the basis functions and the estimate of the co-  max A ef®cients relative to these functions. It seems simplest to Actually, careful consideration of the cases A empty and

consider these sources of error separately. First, de®ne nonempty shows that the bound can be tightened to simply

P

k

j j 0

~ ~

n  :

f = f x  x =m f A

A and, as in the de®nition of , max

A

~ f

restrict the magnitude of this value to 1. That is, A rep-  Clearly the magnitude of  depends on the error in the resents the coef®cient which would be estimated if the true 0

estimate of  . Intuitively, by driving the relative error in

i

k 0

basis function was known. Since there are at most n low-

  c

small we drive small. Thus de®ne  as the smallest

i

0 m

order coef®cients for n> 1, an satisfying

i j  j c  ;  

 i i

value such that for all , i min 1 ;that

i

r

  c

is,  is the least upper bound on the relative error. Then by

  ~

^ 1 1

jf f j >   9jAj k 

A A

  

Pr s.t. 1 

i i

k considering the cases and it can be shown

8 n 2 2 2

0 0

i jz =z j jz =z j i

that for all , both the ratios i and are bounded

i i

and 1

c + c  

above by 2  1 as long as . Hence the largest

r 2

 

0 k

 

 = jAjk  c +  

possible value of the ratio A for is 2 1 ,

A

~

jf g^ j > ;   9jAj k 

A A

Pr s.t. 2 k

k

  [ c +  ] n

8 2 and thus 2  1 1 max. Finally, use of the

j

x  =j  x +   jx j P identity 1 1 1 2 for a positive

^ 2

[ f g^   >= ]   A

guarantees that Pr A 2 .Here

jAjk

c  =k   c k : 

integer shows that if 2  1 then 4 max

^ ~

p

jf f j A

A represents the error due to estimating the Fourier k

2 3 k 2

c  = k n  

Thus  128 1 implies that for all

coef®cients given perfect knowledge of the input distribu- p

k

~

jAj k jf g^ j = n c ~

A d

, A 8 .Let represent the desired

jf g^ j A

tion, A the error due to estimating the distribution.

c = =  ;  

i i

An inequality of Hoeffding [Hoe63] will be very useful for bound on  .Thensince1 min 1 it follows

0

m [9 i j  j >c = ]  = i

that such that Pr s.t. d 2 satis- i

®nding the required m.

0 0

E [   ]= 

®es (2). Noting that i ,that0 1, and that

i i

0

n 

X there are different , application of Lemma 12 shows that i

Lemma 12 (Hoeffding) Let i be independent random

k k+

2 3 2 2 1

m k n   n= 

E [X ] i a  X 

variables all with mean such that for all , i the required is bounded by 64 ln 4 . >

b. Then for any 0, This is always larger than the value required for (1), so we

"

 have

m

X 2 2

1

 m=ba

2

X E [X ]    e :

Pr i 2

 

m Theorem 1 For any positive and and any mutually in- =

i 1 dependent distribution on the inputs, Algorithm 1 produces

~ ~

f f 4f>]  

a function with the property that Pr [ when-

~ f

For the moment we remove the unity restriction on A 's

ever

^ ~

f E [f ] A

magnitude. Then A is , and using Lemma 12 to ®nd

  +

d 2

~

M

f m 4

an satisfying (1) requires only that the bounds on A

 ;

k 5 log2

~

 f

be determined. By its de®nition, the magnitude of A is

 = j xj

bounded by max A , where the maximum is

max 1 4 n

k+

2 3 k 2 2

m  k n :

Aj k x

over all possible j and . It is not hard to show that 64 ln

p

 

k=2

i jz j    

for all , i 1, so max 1 .Thenby

k k

k 1

 n     n = 

Lemma 12, m 16 1 ln 4 guarantees

p



Thus for ®xed , ,and , the algorithm requires

k

^ ~

n

]  Aj k [jf f j > = n

j poly log 0 A

that for any given ,Pr A 8

  AC

O 2 examples to adequately approximate

k

= m

2 n . Hence such an satis®es (1). Note ®nally that functions.

~ f

restricting the magnitude of A can only improve the like-

The bounds for LMN learning on uniform distributions are

~ ^ ^

f f f 2 [ ; ]

A A

lihood that A is suf®ciently near ,since 1 1 .

M =

similar. The LMN k bound is polylogarithmicin and 1 +

d 3

O  M= m

Finding an m satisfying (2) is more involved. First, ( log ), and the bound is quasi-polynomial

P k

0 2

0 = n = O n =

~ in and 1 ( ) and logarithmic in 1 .

h ; f i g^ A

rewrite A as (all inner products

B

B

jB jk

 =

in this section are with respect to q )andlet

0

j x  xj

A

Ajk;x max j . Then some algebra shows

A 6 INDIRECT LEARNING

0

jAj; jB j k jh ; : i h ; ij  

A B

that for , A

B max

Aj k

It follows that for all j 6.1 OVERVIEW

0

0

~ ~ ~

jf g^ j  jf f h ; +

ij 0

A A A A

A A

Our approach to learning AC functions sampled according

X

0 0

~ to mutually independent distributions results in a straight-

j ij f h ;

A B

B forward deterministic algorithm, but the analysis is quite

B jk;B 6=A

j involved. We, and independently Umesh Vazirani [Vaz],

0 0

0

~ ~ ~

ij h ; ih ; j + jf  jf f

A A A

A have noticed a clever randomized approach which would be

A A A

somewhat more dif®cult to implement but admits a simpler Of course, even if q is known exactly it may not be desirable

  ;  C

analysis. Observe ®rst that for any given value in 0 1 it or even possible to represent it exactly with the i .Inthis is easy to construct a small ®xed-depth circuit which, given case the LMN theory must be extended a bit to cover the

inputs drawn uniformly, produces 1's with probability ap- case of nearly uniform distributions. Call a distribution

n

r x  x jU x r xj =

proximately . Thus for any given mutually independent on -uniform if for all , 2 ,

n

n U U x= probability distribution on n-bit strings a set of disjoint where is the uniform distribution 2 . Then the circuits can be constructed which given uniform inputs will probabilities of the occurrence of some event with respect

produce as output each n-bit string with approximately the to these distributions can be related in a simple way. In g desired probability. Conversely, a randomized inverse of particular, for any Boolean f and approximating ,

each of these small circuits can be constructed such that

f x 6= g x]

mutually independent inputs to the inverses will produce a Pr [ sign r

nearly uniform output.

 g x] + : f x 6= Pr [ sign 1 With this background, the indirect uniform construction ap- U

proach falls out naturally. We are given a set of input/output Also, the expected value of a Fourier coef®cient computed

x; f x x

pairs  where the 's are drawn according to a mu- using examples drawn from a -uniform rather than truly f tually independent distribution q . The unknown function uniform distribution will differ from the true coef®cient by

0

F is computed by some AC circuit . Also, there exists a no more than . Finally, as would be expected, the con-

0 1

AC C y C

set of disjoint circuits i which, given uniform 's, vergence of the to a uniform distribution as variables

i

x q

C

produce as output 's with distribution close to .Callthe are added is extremely rapid once each i hasatleastlog

1

C C

randomized inverses of these circuits . Then there is variables, that is, once the probability of a 1 for each i is i

0 G

another AC circuit consisting of the obvious composi- in the vicinity of the appropriate value.

C F G g

tion of the i 's and such that, if computes function Putting these facts together with an analysis similar to that

1 1

x g C C x = f x x

then for all ,  . Since the are al-

C

i i

used in proving Theorem 1 shows that if each of the i has

most uniformly distributed, a variant of LMN learning can a polylogarithmic number of variables and a similar num-

be used to obtain a good approximation to g and therefore 1 C

ber of gates then the distribution r induced by the will i indirectly to f . be near enough uniform for an adequate g to be learned.

2

= [ k ; ]+

Speci®cally, let l 2max2 log2 2 be the num-

C k

ber of variables input to each i ,where satis®es the

6.2 ANALYSIS OF UNIFORM CONSTRUCTION LMN bound modi®ed to re¯ect the increase of 2 in depth

M l

d and the logarithmic dependence of circuit size on . C

Clearly if the circuits i produce exactly the desired prob- Then the uniform construction method satis®es speci®ed q

ability distribution on their outputs then the LMN theory = bounds as long as the number of examples is at least

l k

1 k 2 1

    n = 

applies immediately to uniform construction, since the C 64 2 ln 4 . i

will produce an exactly uniform distribution for learning g . m Considering the forms of the bounds on k and for LMN learning, the analysis for this case reduces to determining 7 COMPARISON OF APPROACHES

the size and depth of the circuit G and the length of its input. y This in turn reduces to determining how long the string  The primary advantage of our direct approach to learning

0 1 AC functions is probably its potential application to non- generated by the C is and ascertaining the size and depth

i independent distributions. While it is not at all clear how C

of the i .

a technique like uniform construction could be used on C Although many possible forms of the i could be con- an arbitrary distribution, our direct algorithm offers the sidered, we will assume that simple depth 2 DNF circuits hope of wider applicability, as discussed in the next section.

are used in order to minimize the increase in depth of G

Also, the  basis and its properties have proved useful in

P

l

j

F  a

over . With such circuits any of the form j 2 , extending another learning result from uniform to mutually =

j 1

2f ; g a independent distributions [Bel91].

where j 0 1 , can be easily constructed using at most

l +

l variables and 1 gates. The idea is to create a circuit Another signi®cant area of difference between the ap-

j a =

with one AND for each such that j 1, to have that proaches is the use of randomness. Uniform construction is j AND produce 1's with probability 2 , and to insure that a random algorithm in terms of both learning and the func- at most one AND produces a 1 on each input. Such a cir- tion learned, while our direct algorithm and the function

cuit is easy to construct; for example, the circuit computing learned are deterministic.

_x ^x _x ^x ^x ^x  x1 1 2 1 2 3 4 has four variables, one OR,

two AND's, and produces 1's with probability 13=16. In terms of expected running times, both algorithms are

quasi-polynomial. Uniform construction would seem to

q C

Thus in the case of exact representation of by depth 2 i

have a distinct advantage when for the underlying distri-

l x

there must be some value such that for each variable i ,

P bution is large. On the other hand, for moderate the direct

l

j

x = a G nl j

i 2 . Therefore has at most variables =

j 1 approach should be faster due to the increase in circuit depth

l +  F d + and n 1 more gates than and has depth 2. which uniform construction must contend with. An interesting implementation possibility is a hybrid of the [BI87] M. Blum and R. Impagliazzo. Generic oracles two approaches. Variables with means far from uniform and oracle classes. In 28th AnnualSymposiumon would be handled via uniform construction methodsÐbe Foundations of Computer Science, pages 118±

1

expanded by an appropriate C Ðand those closer to uni- 126, 1987. i form would be unchanged. Then the direct learning al- [BRS90] Richard Beigel, Nick Reingold, and Daniel gorithm rather than LMN would be applied to the resulting Spielman. The perceptron strikes back. Techni- strings, which would now be nearly independent rather than cal Report YALEU/DCS/TR-813, Yale Univer- nearly uniform. If only a few variables are far from uniform sity, 1990. then increasing the depth of the circuit at these few points might not affect overall circuit depth. Thus the hybrid ap- [BS90] Ravi B. Boppana and Michael Sipser. The com- proach potentially avoids the primary sources of run time plexity of ®nite functions. In Handbook of blowup in the individual methods. Theoretical Computer Science, volume A. MIT Press/Elsevier, 1990. [Fel57] William Feller. An Introduction to Probability 8 OPEN QUESTIONS Theory and Its Applications, volume 1. John Wiley & Sons, second edition, 1957. An averaging argument added to a fundamental idea of 0 [FSS81] M. Furst, J. Saxe, and M. Sipser. Parity, circuits,

Valiant and Vazirani [VV85] shows that for every AC

and the polynomial time hierarchy. In 22nd An- q function f and every distribution on the inputs there is nual Symposium on Foundations of Computer a low-degree polynomial which is a close approximation

Science, pages 260±270, 1981. q to f with respect to [BRS90, Tar91]. Unfortunately, this is only an existence proof which does not give rise [Has86] J. Hastad. Computational Limitations for Small immediately to a computationally feasible algorithm for Depth Circuits. PhD thesis, MIT Press, 1986. ®nding such polynomials. The obvious question is to ®nd [Hoe63] Wassily Hoeffding. Probability inequalities for such an algorithm. sums of bounded random variables. American StatisticalAssociation Journal, 58:13±30, 1963.

Given an unknown distribution q and examples of a func- q tion f drawn according to we can use something like [LMN89] Nathan Linial, Yishay Mansour, and Noam an approximate Gram-Schmidt process to orthogonalize, Nisan. Constant depth circuits, Fourier trans-

relative to q , a low-degree basis. We can then estimate form, and learnability. In 30th Annual Sym-

the low-degree coef®cients of function f . We conjecture posium on Foundations of Computer Science, that for many natural distributions this will be a good ap- pages 574±579, 1989. proximation. For what distributions is this true? It is not [Smo] Roman Smolensky. Personal communication. true for all distributions; Smolensky [Smo] has produced a counterexample. [Tar91] J. Tarui. Randomized polynomials, thresh- old circuits, and the polynomial hierarchy. In 0 It is natural to de®ne AC distributionsto be those obtained C. Choffrut and M. Jantzen, editors, 8th Annual in the following way. Transform uniformly drawn input Symposium on Theoretical Aspects of Computer

0

y x AC C variables  to new variables via an circuit .The Science Proceedings, pages 238±250, 1991.

0

x AC induced distributionon the  is called . Does the above 0 [Val84] L. G. Valiant. A theory of the learnable. Com- variant work for AC distributions? munications of the ACM, 27(11):1134±1142, November 1984. Acknowledgements [Vaz] Umesh Vazirani. Personal communication. Thanks to Alan Frieze for a helpfulsuggestion regarding the [VV85] L. G. Valiant and V. V. Vazirani. NP is as easy error analysis in Theorem 1. This research was supported in as detecting unique solutions. In Proceedings part by an unrestricted grant from AT&T. The third author of the Seventeenth Annual ACM Symposium on received support from an ONR Graduate Fellowship and Theory of Computing, pages 458±463, 1985. NSF Grant CCR-8858087.

References

[Bah61] R. R. Bahadur. A representation of the joint dis-

tributionof responses to n dichotomousitems. In Herbert Solomon, editor, Studies in Item Anal- ysis and Prediction, pages 158±168. Stanford University Press, Stanford, California, 1961. [Bel91] Mihir Bellare. The spectral norm of ®nite func- tions. Unpublished manuscript, February 1991.