Improved Learning of AC0 Functions
Total Page:16
File Type:pdf, Size:1020Kb
0 Improved Learning of AC Functions Merrick L. Furst Jeffrey C. Jackson Sean W. Smith School of Computer Science School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected] [email protected] Abstract A variant of one of our algorithms gives a more general learning method which we conjecture produces reasonable 0 Two extensions of the Linial, Mansour, Nisan approximations of AC functions for a broader class of in- 0 put distributions. A brief outline of the general algorithm AC learning algorithm are presented. The LMN method works when input examples are drawn is presented along with a conjecture about a class of distri- uniformly. The new algorithms improve on theirs butions for which it might perform well. by performing well when given inputs drawn Our two learning methods differ in several ways. The from unknown, mutually independent distribu- direct algorithm is very similar to the LMN algorithm and tions. A variant of the one of the algorithms is depends on a substantial generalization of their theory. This conjectured to work in an even broader setting. algorithm is straightforward, but our bound on its running time is very sensitive to the probability distribution on the inputs. The indirect algorithm, discovered independently 1 INTRODUCTION by Umesh Vazirani [Vaz], is more complicated but also relatively simple to analyze. Its time bound is only mildly Linial, Mansour, and Nisan [LMN89] introduced the use affected by changes in distribution, but for distributions not too far from uniform it is likely greater than the direct of the Fourier transform to accomplish Boolean function 0 bound. We suggest a possible hybrid of the two methods learning. They showed that AC functions are well- characterized by their low frequency Fourier spectra and which may have a better running time than either method alone for certain distributions. gave an algorithm which approximates such functions rea- sonably well from uniformly chosen examples. While the The outline of the paper is as follows. We begin with 0 class AC is provablyweak in that it does not contain modu- de®nitions and proceed to discuss the key idea behind our lar counting functions, from a learning theory point of view direct extension: we use an appropriate change of basis for 0 AC it is fairly rich. For example, contains polynomial-size the space of n-bit functions. Next, we prove that under 0 DNF and addition. Thus the LMN learning procedure is po- this change AC functions continue to exhibit the low- tentially powerful, but the restriction that their algorithm be order spectral property which the LMN result capitalizes given examples drawn according to a uniform distribution on. After giving an overview of the direct algorithm and is particularly limiting. analyzing its running time, we discuss the indirect approach and compare it with the ®rst. Finally, we indicate some A further limitation of the LMN algorithm is its running time. Valiant's [Val84] learning requirements are widely directions for further research. accepted as a baseline characterization of feasible learning. They include that a learning algorithm should be distribu- 2 DEFINITIONS AND NOTATION tion independent and run in polynomial time. The LMN poly log n algorithm runs in quasi-polynomial (O 2 ) time. ; ;ng n All sets are subsets of f1 ... , where as usual repre- In this paper we develop two extensions of the LMN learn- sents the number of variables in the function to be learned. ing algorithm which produce good approximatingfunctions Capital letters denote set variables unless otherwise noted. X when samples are drawn according to unknown distribu- The complement of a set X is indicated by , although we tions which assign values to the input variables indepen- abuse the concept of complementation somewhat: in cases S dently. Call such a distribution mutually independent since where X is speci®ed to be a subset of some other set , X = S X X = f ;ngX it is the joint probability distribution corresponding to a set ,otherwise 1 ; ... Strings of of mutually independent random variables [Fel57]. The 0 =1 variables are referred to by barred lower case letters x running times of the new algorithms are dependent on the (e.g. ) which may be superscripted to indicate one of a j x x i distributionÐthe farther the distribution is from uniform sequence of strings (e.g. ). i refers to the th variable x the higher the boundÐand, as is the case for the LMN in a string . Barred constants (e.g. 0, ) indicate strings of algorithm, the time bounds are quasi-polynomial in n. the given value with length implied by context. Unless otherwise speci®ed, all functions are assumed to 3 AN ORTHONORMAL BASIS FOR n ; g have as domain the set of strings f0 1 . The range of BOOLEAN FUNCTIONS SAMPLED g Boolean functions will sometimes be f0,1 , particularly when we are dealing with circuit models, but will usually be UNDER MUTUALLY INDEPENDENT ; g f1 1 for reasons that should become clear subsequently. DISTRIBUTIONS Frequently we will write sets as arguments where strings X f x f would be expected, e.g. f rather than for a 3.1 RATIONALE n ; g f X function on f0 1 . In such cases is a shorthand for v cX cX f where is a characteristic function de®ned Given some element of a vector space and a basis for v X = i 2 X c this space, a discrete Fourier transform expresses as the by i 0if and 1 otherwise. Note that the sense of this function is opposite the natural one in which 0 coef®cients of the linear combination of basis vectors repre- represents set absence. senting v . We are interested in learning Boolean functions 0 on n bits, which can be represented as Boolean vectors of n An AC function is a Boolean function which can be com- length 2 . Linial et al. used as the basis for this space the puted by a family of acyclic circuits (one circuit for each n characters of the group Z2 in R. The characters are given n number n of inputs) consisting of AND and OR gates plus by the 2 functions negations only on inputs and satisfying two properties: jA\X j : X = 1 A The number of gates in each circuit (its size) is bounded jAj f j Each is simply a polynomial of degree ,and A A by a ®xed polynomial in n. Aj k g j spans the space of polynomials of degree not The maximum number of gates between an input and more than k. With an inner product de®ned by the output (the circuit depth) is a ®xed constant. X n f xg x f; gi = h 2 x A random restriction p;q is a function which given input n ; g x2f 0 1 x p maps i to with ®xed probability and assigns 0's and 1's to the other variables according to a probability distribution p f k = hf; f i and norm k the characters form an orthonor- q x f .If represents the input to a function then p;q induces mal basis for the space of real-valued functions on n bits. d another function f which has variables corresponding to The Fourier coef®cients of a function f with respect to this the stars and has the other variables of f ®xedto0or1. ^ f = hf; i f basis are simply A , the projections of on the This is a generalization of the original de®nition [FSS81] A basis vectors. Given a suf®cient number m of uniformly P q in which is the uniform distribution. The subscripts of m j j j j x ;fx f x x =m selected examples , is a A j = are generally dropped and their values understood from 1 ^ f context. good estimate of A [LMN89]. The function obtained by setting a certain subset S of the What happens if the examples are chosen according to some q variables of f to the values indicated by the characteristic nonuniform distribution ? If we blindly applied the LMN S f dS X function of a subset X is denoted by or, learning algorithm we would actually be calculating an P f x xq x A S f dX approximation to n for each . when is implied by context, simply . For example, if A x2f ; g 0 1 = f ; g X = f g f dX f S 1 3 and 3 then is the function with That is, we would be calculating the expected value of the x x q variable 1 setto1and 3 to 0. product f with respect to rather than with respect to the A We will use several parameters of the probability distribu- uniform distribution. This leads to the following observa- tion: if we modify the de®nition of the inner product to be q = [x = ] i tion throughout the sequel. We de®ne i Pr 1 , with respect to the distribution q on the inputs and modify where the probability is with respect to q . Another param- eter which we will use frequently is the basis to be orthonormal under this new inner product then the estimated coef®cients should on average be close = = ; = : i max 1 i 1 1 to the appropriate values in this new basis. i We assume that this value is ®nite, since in®nite implies More precisely, de®ne the inner product to be some variable is actually a constant and can be ignored by X hf; gi = f xg xq x the learning procedure.