0
Improved Learning of AC Functions
Merrick L. Furst Jeffrey C. Jackson Sean W. Smith School of Computer Science School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected] [email protected]
Abstract A variant of one of our algorithms gives a more general learning method which we conjecture produces reasonable 0 Two extensions of the Linial, Mansour, Nisan approximations of AC functions for a broader class of in- 0 put distributions. A brief outline of the general algorithm AC learning algorithm are presented. The LMN method works when input examples are drawn is presented along with a conjecture about a class of distri- uniformly. The new algorithms improve on theirs butions for which it might perform well. by performing well when given inputs drawn Our two learning methods differ in several ways. The from unknown, mutually independent distribu- direct algorithm is very similar to the LMN algorithm and tions. A variant of the one of the algorithms is depends on a substantial generalization of their theory. This conjectured to work in an even broader setting. algorithm is straightforward, but our bound on its running time is very sensitive to the probability distribution on the inputs. The indirect algorithm, discovered independently 1 INTRODUCTION by Umesh Vazirani [Vaz], is more complicated but also relatively simple to analyze. Its time bound is only mildly Linial, Mansour, and Nisan [LMN89] introduced the use affected by changes in distribution, but for distributions not too far from uniform it is likely greater than the direct of the Fourier transform to accomplish Boolean function 0 bound. We suggest a possible hybrid of the two methods
learning. They showed that AC functions are well- characterized by their low frequency Fourier spectra and which may have a better running time than either method alone for certain distributions. gave an algorithm which approximates such functions rea- sonably well from uniformly chosen examples. While the The outline of the paper is as follows. We begin with 0 class AC is provablyweak in that it does not contain modu- de®nitions and proceed to discuss the key idea behind our lar counting functions, from a learning theory point of view direct extension: we use an appropriate change of basis for
0 AC it is fairly rich. For example, contains polynomial-size the space of n-bit functions. Next, we prove that under 0
DNF and addition. Thus the LMN learning procedure is po- this change AC functions continue to exhibit the low- tentially powerful, but the restriction that their algorithm be order spectral property which the LMN result capitalizes given examples drawn according to a uniform distribution on. After giving an overview of the direct algorithm and is particularly limiting. analyzing its running time, we discuss the indirect approach and compare it with the ®rst. Finally, we indicate some A further limitation of the LMN algorithm is its running time. Valiant's [Val84] learning requirements are widely directions for further research. accepted as a baseline characterization of feasible learning. They include that a learning algorithm should be distribu- 2 DEFINITIONS AND NOTATION tion independent and run in polynomial time. The LMN
poly log n
algorithm runs in quasi-polynomial (O 2 ) time.
; ;ng n All sets are subsets of f1 ... , where as usual repre- In this paper we develop two extensions of the LMN learn- sents the number of variables in the function to be learned.
ing algorithm which produce good approximatingfunctions Capital letters denote set variables unless otherwise noted.
X when samples are drawn according to unknown distribu- The complement of a set X is indicated by , although we
tions which assign values to the input variables indepen- abuse the concept of complementation somewhat: in cases S
dently. Call such a distribution mutually independent since where X is speci®ed to be a subset of some other set ,
X = S X X = f ;ng X it is the joint probability distribution corresponding to a set ,otherwise 1 ; ... . Strings of
of mutually independent random variables [Fel57]. The 0 =1 variables are referred to by barred lower case letters x
running times of the new algorithms are dependent on the (e.g. ) which may be superscripted to indicate one of a
j
x x i
distributionÐthe farther the distribution is from uniform sequence of strings (e.g. ). i refers to the th variable
x the higher the boundÐand, as is the case for the LMN in a string . Barred constants (e.g. 0, ) indicate strings of
algorithm, the time bounds are quasi-polynomial in n. the given value with length implied by context.
Unless otherwise speci®ed, all functions are assumed to 3 AN ORTHONORMAL BASIS FOR
n
; g
have as domain the set of strings f0 1 . The range of BOOLEAN FUNCTIONS SAMPLED g Boolean functions will sometimes be f0,1 , particularly
when we are dealing with circuit models, but will usually be UNDER MUTUALLY INDEPENDENT
; g f1 1 for reasons that should become clear subsequently. DISTRIBUTIONS
Frequently we will write sets as arguments where strings
X f x f
would be expected, e.g. f rather than for a 3.1 RATIONALE
n
; g f X
function on f0 1 . In such cases is a shorthand for
v
cX cX
f where is a characteristic function de®ned Given some element of a vector space and a basis for
v
X = i 2 X
c this space, a discrete Fourier transform expresses as the by i 0if and 1 otherwise. Note that the sense of this function is opposite the natural one in which 0 coef®cients of the linear combination of basis vectors repre- represents set absence. senting v . We are interested in learning Boolean functions
0 on n bits, which can be represented as Boolean vectors of n An AC function is a Boolean function which can be com- length 2 . Linial et al. used as the basis for this space the puted by a family of acyclic circuits (one circuit for each n
characters of the group Z2 in R. The characters are given n number n of inputs) consisting of AND and OR gates plus by the 2 functions
negations only on inputs and satisfying two properties:
jA\X j
: X =
1 A
The number of gates in each circuit (its size) is bounded
jAj f j
Each is simply a polynomial of degree ,and
A A
by a ®xed polynomial in n.
Aj k g j spans the space of polynomials of degree not
The maximum number of gates between an input and
more than k. With an inner product de®ned by
the output (the circuit depth) is a ®xed constant.
X