<<

930 IEEE TRANSACTIONS ON INFORMAI10N THEORY, VOL. 39, NO.3, MAY 1993

Universal Approximation Bounds for Superpositions of a Sigmoidal

Andrew R. Barron, Member, IEEE

Abstract- Approximation properties of a class of artificial A property of the function to be approximated is neural networks are established. It is shown that feedforward expressed in terms of its Fourier representation. In particular, networks with one of sigmoidal nonlinearities achieve inte­ an average of the norm of the frequencyvector weighted by the grated squared error of order O(l/n), where n is the number Fourier magnitude distribution is used to the extent of nodes. The function appruximated is assumed to have a bound on the first moment of the magnitude distribution of to which the function oscillates. In this Introduction, the result the . The nonlinear parameters associated is presented in the case that the Fourier distribution has a with the sigmoidal nodes, as well as the parameters of linear density that is integrable as well as having a finitefirst moment . combination, are adjusted in the approximation. In contrast, it Somewhat greater generality is permitted in the theorem stated is shown that for series expansions with n terms, in which only and proven in Sections III and IV. the parameters of linear combination are adjusted, the integrated squared approximation error cannot be made smaller than order Consider the class of functions f on Rd for which there is 1/n2/d uniformly for functions satisfying the same smoothness a Fourier representation of the form assumption, where d is the dimension of the input to the function. For the class of functions examined here, the approximation rate f(x) (2 = r eiw.x j(w) w, ) and the parsimony of the parameterization of the networks are JRd d surprisingly advantageous in high-dimensional settings. for some complex-valued function j(w) for which wj(w) is Index Terms- Artificial neural networks, approximation of integrable, and define functions, , Kolmogorov n-widths. Of ( = llwllj(w)1 dw, 3) 1. INTRODUCTION Rd . w)1/2. PPROXIMATION bounds for a class of artificial neural where Iwl = (w For each 0> 0, let rc be the set of functions such that Cf networks are derived. Continuous functions on compact f ::; 0, Asubsets of Rd can bc uniformly well approximated by linear Functions with Cf finite are continuously differentiable on combinations of sigmoidal functions as independently shown Rd and the of f has the Fourier representation by Cybenko [1] and Hornik, Stinchcombe, and White [2]. The (4) purpose of this paper is to examine how the approximation 6.f(.7:) = jeiw'X6.f(W)dW, error is related to the number of nodes in the network. As in [1], we adopt the definition of a sigmoidal function where 6.f(w) = iwj(w). Thus, condition (3) may be in­ ¢;(z) as a bounded measurable function on the real line for terpreted as the integrability of the Fourier transform of which ¢;(z) --* 1 as z --* 00 and ¢;(z ) --+ 0 as z --* the gradient of the function f. In Section III, functions are

-00. Feedforward neural network models with one layer of permitted to be definedon domains (such as Boolean functions sigmoidal units implement functions on Rd of the form on {O, I} d) for which it does not make sense to refer to differentiability on iliat domain. Nevertheless, the conditions

n imposed imply that the function has an extension to Rd with f x l:>k¢;(ak . (1) n( ) x + bk) + Co = a gradient that possesses an integrable Fourier representation. k=1 The following approximation bound is representative of the results obtained in this paper for approximation by linear parameterized by ak E Rd and bk, Ck E R, where a . x combinations of a sigmoidal function. The approximation error denotes the inner of vectors in Rd. The total number is measured by the integrated squared error with respect to an of parameters of the network is (el+ 2)'11, + 1. arbitrary measure J.L on the ball Br = {x: Ixl ::; r} Manuscript received February 19, 1991. This work was supported by ONR of radius r > O. The function ((1(z) is an arbitrary fixed under Contract N00014-S9-J-1SI1. Material in this paper was presented at the sigmoidal function. IEEE International Symposium on , Budapest, Hungary, Proposition 1: For every function f with Of finite, and June 1991. 1 The author was with the Department of , the Dcpartment of every n � , there exists a linear combination of sigmoidal Electrical and Computer , the Coordinated Science Laboratory, functions fn(x) of the form (1), such that and the Beckman Institute, University of Illinois at Urbana-Champaign. He is now with the Department of Statistics, Yale University, Box 2179, Yale 2 c' Station, New Haven, CT 06520. r (f(X) - f,,(X)) J.L(dx)::; : (5) IEEE Log Number 920696fi. .fEr '

0018-9448/93$03.00 © 1993 IEEE BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCTION 931

where cj = (2rCf )2. For functions in r G, the coefficients tion on low-frequency components) than does the integrability of the linear combination in (1) may be restricted to satisfy of Iwlli(w)l. In the course of our proof, it is seen that the l and = f(O). integrability of (w) is also sufficientfor a linear combina­ 2::�=1 ICk ::; 2rC, Co Iw 11.1 I Extensions of this result are also given to handle Fourier tion of sinusoidal functions to achieve the 1/11, approximation distributions that are not absolutely continuous, to bound rate. Siu and Brunk [5] have obtained similar approximation the approximation error on arbitrary bounded sets, to restrict results for neural networks in the case of Boolean functions on . the parameters ak and bk to be bounded, to handle certain {O, l}d. Independently, they developed similar probabilistic infinite-dimensionalcases, and to treat iterative optimization of arguments for the existence of accurate approximations in their the network approximation. Examples of functions for which setting. bounds can be obtained for Cf are given in Section IX. It is not surprising that sinusoidal functions are at least as A lower bound on the integrated squared error is given in well suited for approximation as are sigmoidal functions, given Section X for approximations by linear combinations of fixed that the smoothness properties of the function are formulated basis functions. For dimensions d ;::: 3, the bounds demonstrate in terms of the Fourier transform. The sigmoidal functions a striking advantage of adjustable basis functions (such as used are studied here not because of any unique qualifications in in sigmoidal networks) when compared to fixedbasis functions achieving the desired approximation properties, but rather to for the approximation of functions in r c. answer the question as to what bounds can be obtained for this commonly used class of neural network models. There are moderately good approximation rate properties in II. DISCUSSION high dimensions for other classes of functions that involve a The approximation bound shows that feedforward networks high degree of smoothness. In particular, for functions with with one layer of sigmoidal nonlinearities achieve integrated .r li(w)12IwI2s dw bounded, the best approximation rate for the integrated squared error achievable by traditional basis squared error of order 0(1/11,), where 11, is the number of nodes, uniformly for functions in the given smoothness class. function expansions using order md parameters is of order A surprising aspect of this result is that the approximation 0(I/m)2S for m = 1, 2"", for instance, see [3] (for bound of order 0(1/11,) is achieved using networks with methods m is the degree, and for spline methods a relatively small number of parameters compared to the m is the number of knots per coordinate). If 8 = d/2 exponential number of parameters required by traditional poly­ and n is of order md, then the approximation rates in the nomia' spline, and trigonometric expansions. These traditional two settings match. However, the exponential number of expansions take a linear combination of a set of fixed basis parameters required for the series methods still prevent their functions. It is shown in Section X that there is no choice direct use when d is large. Unlike the condition lj(w)i2lw 2s dw which by Par­ of 11, fixed basis' functions such that linear combinations J I < 00, of them achieve integrated squared approximation error of seval's identity is equivalent to the square integrability of all partial of order the condition li(w)llwldw smaller order than (1/n)C2/d) uniformly for functions in r c, in 8, J < agreement with the theory of Kolmogorov n-widths for other 00 is not directly related to a condition on derivatives of similar classes of functions (see, e.g., [3, pp. 232-233]). This the function. It is necessary (but not sufficient) that all first­ vanishingly small approximation rate (2/ d instead of 1 in the order partial derivatives be bounded. It is sufficient (but not exponent of 1/11,) is a "curse of dimensionality" that does not necessary) that all partial derivatives of order less than or equal d apply to the methods of approximation advocated here for to .� be square-integrable on R , where s is the least functions in the given class. greater than 1 + d/2, as shown in example 15 of Section IX. In

Roughly, the idea behind the proof of the lower bound result the case of approximation on a ball of radius r, if the partial is that there are exponentially many orthonormal functions derivatives of order 8 are bounded on Br' for some r' > r, with the same magnitude of the frequency w. Unless all of then there is a smooth extension of f for which the partial these orthonormal functions are used in the fixed basis, there derivatives of order 8 are square integrable on Rd, thereby permitting the approximation bounds to be applied to this case. will remain functions in rG that are not well approximated. This problem is avoided by tuning or adapting thc parameters Another class of functions with good approximation prop­ of the basis functions to fit the target function as in the erties in moderately high dimensions is the set of functions case of sigmoidal networks. The idea behind the proof of the with a bound on J(osd f(x)/ox� ... OXd)2 dx (or equivalent, upper bound result (Proposition 1) is that if the function has .r IWl12s 'IWdI2sli(w W dw). For this class, an approximation an integrable representation in terms of parameterized basis rate of order•• 0(1/11,)28 is achieved using 0(n(10gn)d-1) functions, then a random sample of the parameters of the parameters, corresponding to a special subset of terms in basis functions from the right distribution leads to an accurate a Fourier expansion (see Korobov [6] and Wahba [7, pp. approximation. 145-146]). Nevertheless, the (logn)d 1 factor still rules out Jones [4] has obtained similar approximation properties practical use of these methods in dimensions- of, say, 10 or for linear combinations of sinusoidal functions, where the more. frequency variables are the nonlinear parameters. The class Thus far in the discussion, attention is focused on the of functions he examines are those for which J li(w)ldw comparison of the rate of convergence. In this respect, methods is bounded, which places less of a restriction on the high­ that adapt the basis functions (such as sigmoidal networks) are frequency components of the function (but more of a restric- shown to be superior in dimensions d ;::: 3 for the class r c for 932 IEEE TRANSACTIONS ON INFORMATION THEORY, 39, NO. 3, MAY 1993 VOL. any value of C, no matter how large. Now it must be pointed In the case that the function is observed at sites out that the dimension d can also appear indirectly through the Xl, X , . , XN restricted to Bro Proposition 1 provides a 2 . . constant Cf. Dependence of the constant on d does not affect bound on the training error the convergence rate as an exponent of lin. Nevertheless, if N Cf is exponentially large in d, then an exponentially large I c'j � A (7) value of n would be required for CJ I n to be small for 0(t(Xi) -!n(Xi)) 2 -:; N -:;; , approximation by sigmoidal networks. If C is exponentially i=l large, then approximation by traditional expansions can be where the estimate = of the form is chosen to even worse. Indeed, since the lower bound developed in in in, N (1) minimize the sum of squared errors (or to achieve a somewhat Section X is of the form C2 In(2/ d), a superexponentially large simpler iterative minimization given in Section VIII). In this number of terms n would be necessary to obtain a small value case, the in Proposition is taken to be with respect of the integrated squared error for some functions in r 1 e. to the empirical distribution. The constant Cf involves a d-dimensional integral, and it The implications for the generalization capability of sig­ is not surprising that often it can be exponentially large in moidal networks estimated from data are discussed briefly. d. Standard smoothness properties such as the existence of There are contributions to the total mean squared error f enough bounded derivatives guarantee that Cf is finite (as dis­ IB r ( - cussed above), but alone they are not enough to guarantee that in)2 dlt from the mean squared error of approximation IB r (t­ and the mean squared error of estimation Cf is not exponentially large. In Section IX, a large number of fn)2 d{l IBr (tn - examples are provided for which Cf is only moderately large, in)2 d{l. An index of resolvability provides a bound to the total e.g., O(d1/2) or Oed), together with certain properties mean squared error in terms of the approximation error and the for translation, scaling, linear combination, and composition model complexity according to a theorem in [8] and [9] (see of functions. Since in engineering and scientific contexts it is also [10] for related results). In [11], the approximation result not unusual for functions to be built up in this way, the results obtained here is used to evaluate tIlis index of resolvability suggest that fe may be a suitable class for treating many for neural network estimates of functions in r, assuming a functions that arise in such contexts. smoothness condition for the sigmoid. There it is concluded Otller classes of functions may ultimately provide better that statistically estimated sigmoidal networks achieve mean characterizations of the approximation capabilities of artificial squared error bounded by a constant multiple of CJ I n + neural networks. The class re is provided as a first step in (ndIN)logN. In particular, with n � Cj(NI(dlogNW/2, the direction of identifying those classes of functions for which the bound on the mean squared error is a constant times artificial neural networks provide accurate approximations. Cj«dIN) logN)1/2.ln the theory presented in [11], a bound Some improvements to the bound may be possible. Note that of the same form is also obtained when the number of units n there can be more than one extension of a function outside of is not preset as a function of tile sample size N, but rather a bounded set B that possesses a gradient with an integrable it is optimized from the data by the use of a complexity transform. Each such extension provides an upper bound for regularization criterion. tile approximation error. An interesting open question is how Other relevant work on the statistical estimation of sig­ to solve for the extension of a function outside of Br that moidal networks is in White [12] and Haussler [13] where yields the smallest value for IlwIIRw)ldw. metric entropy bounds play a key role in characterizing the For small d, the bound (2rC)2 In on the integrated squared estimation error. For these metric entropy calculations and error in Proposition 1 is not the best possible. In particular, for for the complexity bounds in [11], it is assumed that domain d = 1, the best bound for approximation by step functions is bounds are imposed for the parameters of the sigmoidal net­ (rCt/n)2 (which can be obtained by standard methods using work. In order that the approximation theory can be combined the fact that, for functions in re, the absolute value of the with such statistical results, the approximation bounds are is bounded by C). For d > 1, it is recently shown refined in Section VI under constraints on the magnitudes of in [20] that the rate for sigmoidal networks cannot be better the parameter values. . The size of the parameter domains for than (1/n)1+(2/d) in the worse case for functions in Ie. Note the sigmoids grows with n to preserve the same approximation that the gap between the upper and lower bounds on the rates rate as in the unbounded case. vanishes in the limit of large dimension. Determination of the For the practitioner, the theory provides the guidance to exact rate for each dimension is an open problem. choose the number of variables d, the number of network The bound in the proposition assumes that {l is a probability nodes n, and the sample size N, such that lin and measure. More generally, if {l is a measure for which {l( Br) (ndIN) log N are small. But there are many otller practical is finite, it follows from Proposition 1 that issues that must be addressed to successfully estimate network functions in high dimensions. Some of these issues include (6) the iterative search for parameters, the selection of subsets of terms input to each node, the possible selection of higher order In particular, with the choice of It equal to Lebesgue measure, terms, and the automatic selection of the number of nodes on the bound is of order O(l/n), which is independent of d, but the basis of a suitable model selection criterion. See Barron and the constant {l(Br) is equal to the volume of the ball in d Barron [14] for an examination of some of these issues and dimensions, which grows exponentially in d for r > 1. the relationship between neural network methods and other BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL PUNCTION 933

methods developed in statistics for the approximation and is finite, (9) is used instead of (8) since then the required estimation of functions. integrability follows from leiw.x - 11 :::; 21w . xl :::; 2lwllxl. After the initial manuscript was distributed to colleagues, (See the Appendix for the characterization of the Fourier the methods and results of this paper have found application representation in this context.) The class of functions on Rd to approximation by hinged hyperplanes (Breiman [15]), slide for which Cf = J IwIF(dw) is finite is denoted by r. functions (Tibshirani [16]), projection pursuit regression (Zhao Functions are approximated on bounded measurable subsets d d [17)), radial basis functions (Girosi and Anzellotli [18)), and of their domain in R . Let B be a bounded set in R that the convergence rate for neural net classification error (Farago contains the point x = 0, and let fB be the set of functions and Lugosi [19]). Moreover, the results have been refined to f on B for which the representation (9) holds for x E B for give approximation bounds for network approximation in Lp some complex-valued measure F(dw) for which .r IwIF(dw) norms, 1 < p < 00 (Darken et at. [31], in the Loo norm is finite, where F is the magnitude distribution corresponding (Barron [20], Yukich [32]) and in Sobolev norms (Hornick et to F. (The right side of (9) then defines an extension of at. [21)). the function 1 from B to Rd that is contained in r, and Approximation rates for the sigmoidal networks have re­ F may be interpreted as the Fourier distribution of an a cently been developed in McGaffrey and Gallant [22],Mhaskar continuously differentiable extension of 1 from B to Rd. Each and Micchelli [23], and Kurkova [33] in the settings of such extension provides a possible Fourier representation of 1 more traditional smoothness classes that are subject to the on B.) curse of dimensionality. Reference [22] also gives implications Linear functions I(x) = a· x and, more generally, the class d for statistical convergence rates of neural networks in these , of infinitely differentiable functions on R are not contained in

settings. Jones [24] gives convergence rates and a set of "good r, but they are contained inr B when restricted to any bounded weights" to use in the estimation of almost periodic functions. set B (because such functions can be modified outside of the Zhao [17] gives conditions such that uniformly distributed bounded set to produce a function in r; see Section IX). The d weight directions are sufficient for accurate approximation. set of functions on R with this property of containment in A challenging problem for network estimation is the opti­ rB for every bounded set of B is denoted for convenience mization of the parameters in high-dimensional settings. In by r. = nBrB.

Section VIII, a key lemma due to Jones [4] is presented For each C > 0, let r a B be the set of all functions 1 in that permits the parameters of the network to be optimized fB such that for some F �eprescnting 1 on B, one node at a time, while still achieving the approximation bound of Proposition 1. This result considerably reduces the IWIBF(dW) :::; C, (10) computational task of the parameter search. Nevertheless, it is j not known whether there is a computational that can where IwlB = SUPx B Iw xl· In the case of the ball Br = be proven to produce accurate estimates in polynomial time as E . {:r,: this norm simplifies to = (See a function of the number of variables for the class of functions Ixl :::; r}, IwlBr rlwl. Section V for the form of the bound for certain other domains studied here. We have avoided the effects of the curse of such as cubes.) dimensionality in terms of the accuracy of approximation but The main theorem follows; a bound is given for the inte­ not in terms of computational complexity. grated squared error for approximation by linear combinations of a sigmoidal function. III. CONTEXT AND STATEMENT OF THE THEOREM Theorem 1: For every function f in rB a,every sigmoidal In this section, classes of functions are defined and then the , function ,p, every j.L, and every n :2: 1, main result is stated for the approximation of these functions there exists a linear combination of sigmoidal functions fn(x) by sigmoidal networks. The context of Fourier distribution of the form (1), such that permits both series and integral cases. A number of interesting examples make use of the Fourier distribution, as will be seen in Sections VII and IX. The Fourier distribution of a function f(x) on gt is a unique complex-valued measure F(dw) = ei8(c;)F(dw), The coefficients of the linear combination in (1) may be where F(dw) denotes the magnitude distribution and B(w) restricted to satisfy 2:�=1 ICkl :::; 2C, and Co = 1(0). denotes the phase at the frequency w, such that IV. ANALYSIS (8) Denote the set of bounded multiples of a sigmoidal function composed with linear functions by or, more generally,

G¢ = b): hi:::; a E Rd, bE (12) (9) b¢(a ·x+ 2C, R}. f(x) = f(O) + j(eiW'X -l)F(dw),

For functions 1 in r a, B, Theorem 1 bounds the error in the d for all x E R . If J F(dw) is finite, then both (8) and (9) are approximation of the functionJ(x) = I(x) 1(0) by convex - valid and (9) follows from (8). Assuming only that J IwlF(dw) combinations of functions in the set G¢. 934 IEEE TRANSACfIONS ON INFORMATION THEORY. VOL 39, NO, 3, MAY 1993

Proof of Theorem The proof of Theorem 1 is based ( 1: = Re eiwx -l)eiOCw) F(dw) on the following fact about convex combinations in a Hilbert fo space, which is attributed to Maurey in Pisier [25]. We denote = (cos (w· x+ B(w» - cos (B(w»)F(dw) the norm of the by II . II. fo Lemma 1: If 7 is in the closure of the convex hull of a set Gf,B - -- -(cos(w .7: . +B(w» G in a Hilbert space, with Ilgll ::; for b each 9 E G, then for - 1 W1 1B every n � 1, and every there is an In in the Il c' > b2- 117112, - cos (B(w»)A(dw) convex hull of 71, points in G such that = g(x, w)A(dw), (14) (13) fo for x E B, where G, B = IwIBF(dw) ::; is G tile integral Proof" A proof of this lemma by use of an iterative f J assumed to be bounded; is a approximation, in which the points of the convex combination A(dw) = IwIBF(dw)/Gf,B ; IwlB = sUPx B Iw. xl; and are optimized one at a time, is due to Jones [4]. A slight E refinement of his iterative Hilbert space approximation the­ Gf,B g(x, w) (cos (w· x + B(w» - cos (B(w»). (15) orem is in Section VIII. The noniterative proof of Lemma 1 = IwlB (credited to Maurey) is based on a law of large numbers bound Note that these functions are bounded by Ig(x, w) ::; Glw, as follows. Given n � 1 and 8 > 0, let 1* be a point in the 1 xl/lwlB ::; for G x in Band w =J O. convex hull of G with 117- 1* II ::; 8/71, Thus, 1* is of the . The integral in (14) represents 7 as an infinite convex form 'Ykgk with gk � 0, l:Z'=1 1, for l:Z'=1 E G, Yk 'Yk = combination of functions in the class some sufficiently large m. Let 9 be randomly drawn from the set {g�, ...,g::'} with P{g = gk} = 'Yk; let gl, g2," ',gn Geos = (cos (w· x + b) - cos (b»: w =J 0, be independently drawn from the same distribution as g; { lw�B and let In = (l/n)l: �=1 g; be the sample average. Then I'YI ::; G, bE R}. (16)

Ef n = 1* , and the expected value of the squared norm of It follows that 7 is in the closure of tile convex hull of G the error is Eil/n - 1*112 = (l/n)Ellg- 1*112, which equals eos. (l/n)(EllgI12-11I'11 2) and is bounded by (1/n)(b2- 111*112) . This can be seen by Riemann-Stieltjes integration theory in Since the expected value is bounded in this way, there must the case that F has a continuous density function on Rd. More generally, it follows form an law of large numbers. exist gl, g ,'" ,g" for which III" -1*112 ::; (1/n)(b2- L2 2 Indeed, if WI, W ,... , Wn is a random sample of n points, 111* 112). Using tile triangle inequality and 117- f* II ::; 8/n, the 2 proof of Lemma 1 is completed by the choice of a sufficiently independently drawn from the distribution A, then by Fubini's Theorem the expected square of the L (/-l, Br) norm is small 8. 0 2

Fix a bounded measurable set B that contains the point x = 0 and a positive constant G. If it is shown that for functions in the class rC,B, the function 7(x) I(x) -1(0) is in the closure of the convex hull of in L =(p" then it G.p 2 E), will follow by Lemma 1 that there exists a convex combination of n sigmoidal functions such that the square of the L2(p" B ) norm is bounded by a constant divided by n. Therefore, the main task is to demonstrate the following theorem. (17) Theorem 2: For every function I in r c B, and every , n sigmoidal function 1;, the function f(x) - 1(0) is in the Thus, the mean value of the squared L (/-l, B) norm of a closure of the convex hull of G1>' where the closure is taken 2 convex combination of n points in converges to zero in L (/-l, B». Geos 2 as (Note that it converges at rate in The method used here to prove Theorem 2 is motivated by n --> 00. O(l/n) the techniques used in Jones [4] to prove convergence rate accordance with Lemma 1.) Therefore, there exists a sequence results for projection pursuit approximation, and in Jones [261 of convex combinations of points in Geos that converges to 7 in L (/-l, B). We have proven the following. to prove the denseness property of sigmoidal networks in the 2 space of continuous functions. Lemma 2: For each I in LC,B, the function f(x) - f(O) is in the closure of the convex hull of Geos. Proof" Let F(dw) ei&(wJF(dw) denote the magnitude = Next it is shown that functions in Geos are in the closure and phase decomposition in the Fourier representation of an of the convex hull of Glinear function z = a·x, where a = w/lwlB for some w =JO. For x in the variable a·x takes values f(.7:) - f(O ) -l)F(dw) E, z = = Re jeeiW'X in a subset of [-1, 1]. Therefore, it suffices to examine tile BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCTION 935

approximation of the sinusoidal function 9 on [-1, 1]. Note Lemma 4: G�ep is in the closure of Gq,. that 9 has derivative bounded by h'l � C. Now since 9 is Together, Lemmas 2, 31, and 4 show that in L2(/1, B), uniformly continuous on [-1, 1], it follows that it is uniformly well approximated by piecewise constant functions, for any r� c coGcos C coG�tep C coGq" (20) sequence of partitions of [-1, 1] into intervals of maximum

width tending to zero. Such piecewise constant functions may where coG denotes the closure in L2(/1, B) of the convex hull be represented linear combinations of unit step functions. as of G, and r& is the set of functions in r e with I (0) = O. Moreover, it can be arranged that the sum of the absolute Here the fact is used that the closure of a convex set is convex, yalues of the coefficients of the linear combination are bounded so that it is not necessary to take the convex hull operation by 2C. twice when combining the Lemmas. This completes the proof In particular, consider first the function g(z) restricted to of Theorem 2. D

o � z � 1, and note that 9 = (0) = O. For a partition The proof of Theorem 1 is completed by using Lemma 1 o = to < tl < ... < tk = 1, define and Theorem 2. To see that the constant in Theorem can k-l 2 1 be taken to equal (2C) for functions I in re, B, proceed as gk,+(Z) = 9 ) (ti d)l � } (18) �) (ti - g - {z t, . follows. The approximation bound is trivially true if 11111 = 0, i=1 where f(x) - 1(0), for then I(x) is equal to a constant /1- This piecewise constant function interpolates the function almost everywhere on B. So now suppose that 11111 > O. If the sigmoidal function is bounded by one, then the functions in 9 at the points ti for i � k - 1. Note that gk, + is a Gq, are bounded by b = 2C. Consequently, any cl greater than linear combination of step functions. Now since the deriva­ 2 2 tive of 9 is bounded by C on [0, 1], it follows that the (2C) - 11111 is a valid choice for the application of Lemma 1. sum of the absolute values of the coefficients Li Ig(ti) - The conclusion is that there is a convex combination of n functions in for which the square of the L2(/1, B) norm g(ti-l)1 is bounded by C. In a similar way, define gk, _(z) = Gq, of the approximation error is bounded by /n. L:�ll(g(-t;)-g( -t i-d)l{z:O;-t,}. Adding these components d If the sigmoidal function ¢>(x) is not bounded by one, first gn, -(z) + gn, + (z) yields a sequence of piecewise constant use Lemma 1 and the conclusion that to obtain functions on [-1, 1] that are uniformly dose to 9 (z) (as r& c coG�tep a convex combination of functions in for which the the maximum interval width tends to zero), and each of n G�tep squared L2(/1, B) norm of the approximation error is bounded these approximating functions is a linear combination of 2 2 by divided by n. Then, using Lemma 4, by step functions with the sum of the absolute values of the (20) -(1/2)11111 a suitable choice of scale of the sigmoidal function, sufficiently coefficients bounded by 2C. It follows that the functions g(z) accurate replacements to the step functions can be obtained are in the closure of the convex hull of the set of functions 'Y such that the resulting convex combination of n functions in step (z-t) and'Y step (-z-t) with I'YI � 2C and It I � 1, where 2 yields a square L2(/1, B) norm bounded by (20) /n. This step (z) = l{z�o} denotes the unit step function. Defining Gq, completes the proof of Theorem 1. D bstep 0:' X - t): � 2C, l B 1, t I � Gstep = ( I'YI od I I}, Note that the argument given above simplifies slightly in the = (19) case that the distribution of 0: . x . is continuous for every a, the following lemma has been demonstrated, where the closure for then G can be used in place of G�tep and there would property holds with the supremum norm on B, and hence with stcp be no need for Lemma 31• respect to L2(/1, B). A variant of the theory just developed is to replace the Lemma 3: Gcos is in the closure of the convex hull of function step(z) with the function stepq,(z), which is the same G p ste . as the unit step function! except at z = 0 where it is set to equal It can be seen that Lemma 3 continues to work if, for each 1>(0). By a modification of the proof of Lemma 3, it can be 0:, the parameter t is restricted to a subset T", that is dense in shown that G is in the closure of the convex hull of G step� . The advantagecos of this variant is that is in the closure [-1,1]. In particular, restrict t to the continuity points of the Gstep/> . d of without any restriction on the location of the points distribution of z = 0: x induced by the measure /1 on R . Let Gq" G�tep be the subset of step functions in G.tep with locations tin [-1,1]. But if I¢>(O)I > 1, then an additional argument t restricted in this way. Then the following result holds. would still be needed (as above, where t is restricted to the Lemma continuity points of . x) in order to show that the constant 31: Gcos is in the closure of the convex hull of a 2 can be taken to be not larger than (20) . G�ep· cl Functions in are in the closure of the class of G�ep sigmoidal functions, taking the closure in L2(/1, B). This V. APPROXIMATION ON OTHER BOUNDED DOMAINS IN Rd follows by taking the sequence of sigmoidal functions ¢>(I aI ( 0:'

X - t» with lal --+ 00. This sequence has pointwise limit In thi� brief section, the form of the constant Cf, B =

equal to step (0: . x - t) (except possibly for x in the set with J IwIBIII(w)ldw in the approximation bound is determined for . 0: x - t = 0, which has p, measure zero by the restriction various choices of the bounded set B other than the Euclidean imposed on t). Consequently, by the dominated convergence ball of radius r mentioned in the Introduction. . theorem, the limit also holds in L2(p" B). Thus the desired Recall that, by definition, IwlB = 8UP"EB Iw xl. The closure property holds. interpretation is that IwlB bounds the domain of the trigono- 936 TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO.3, MAY 1993 IEEE

metric component eiw.x that has frequency w in the Fourier VI. REFINEMENT representation of the function restricted to x in B. In the above analysis, the approximation results were proved Clearly, if is contained in a ball of radius that is, if B r, by allowing the magnitude of the parameters to be ar­ for in then, by the Cauchy-Schwarz inequality, ak Ixl ::; r x B, bitrarily large. The absence of restrictions on lak I yields a IwlB ::; rlwl. Thus, difficult problem of searching an unbounded domain. Large values of contribute to large of the sigmoidal (21) I ak I function which can also lead to difficulties of computation. In this section, we control the growth of I and bound However, for some natural sets B, a fairly large radius ball lak the effect on the approximation error. Knowledge of the would be required for application of the bound in that form. relationship between the magnitude of the parameters and the It is better to determine directly in some cases. If B is a IwlB accuracy of the network makes it possible to bound the index multiple of a unit ball with respect to some norm on Rd, then of resolvability of sigmoidal networks as in [11]. Bounds is determined by the dual norm. Iw IB on the parameters are also required in the metric entropy In particular, if B = is the ball Boo, {x: Ixloo ::; r} 1= = r computations as in White [12, Lemma4.3] and Haussler [13]. of radius (the cube centered at 0 with sidelength r x = r), 2 Given r > 0, C > 0 and a bounded set B, let then IwlB = rlwll where Iwll is the h norm and G b4>(r(a· + (22) ¢,.,. = x b)): hi::; 2C, lalB ::; 1, Ibl ::; I}. (27) More generally, if B = Bp, = is the ball of {x: Ixlp ::; r} lp This is the class of bounded multiples of a sigmoidal function, radius then rwhere 1/ q = 1 (as can be r, IwlB = rlwlq l/p + with the scale parameter of the sigmoid not larger than . We seen by a standard application of HOlder's inequality). Here the r desire to bound the approximation error achievable by convex norm is given by l for 1 ::; x, Ip Ixlp = CL:�=l IXiIP) /p p < combinations of functions in Gq".,.. and maxi I for p = x. Thus, n Ixloo = IXi Theorem 3: For every E re, every f B, r > 0, n 2: 1, (23) probability measure po, and every sigmoidal function 4> with ° S 4>(x) S 1, there is a function fn in the convex hull of n The approximation bound becomes functions in G¢,.,. such that 2 (f(x) - fn X)) MCdx) /2 + (28) .Lv. r ( 111- fnll ::; 2C ( 71,; D.,.) )2 (24) (2: (./lwlqIJ(w)1 dW) 2, where 11·11 denotes the L (M, B) norm, lCx) f(x) - f(O), ::; and 2 = for some network fn of the form (1). Note also that the center of the domain of integration may be taken to be any point Xo in B not necessarily equal to O. (This follows by a simple argument, since the magnitude of the Here, D.,. is a distance between the unit step function and the Fourier transform is unchanged by translation.) In particular, scaled sigmoidal function. Note that D.,. --;0 as r --; 00. If 4> is the unit step function, then 0 for all r > 0, and for any cube C of side length s, the result becomes b.,. = Theorem 3 reduces to Theorem 1. If 4>is the logistic sigmoidal function 1 for some network fn of the form (1). In like manner, a scaling 4>(z) = l+e-z' (30) argument shows that if Rect is any rectangle with side lengths then 14>(rz)+I{z>o}1 e-TE for Setting (lnr)/r SI, S ,"', Sd, then there is a network fn such that S Izl2: E. f = 2 yields (f(x) - fnCX»2MCdx) 1+21nr (}.,.c S (31) .Lect ---­r (26) (�S}IWiIIJ(w)1 dW 2 Therefore, if we set r n1/21n then from Theorem 3, for ::; � ) 2: 71" functions in r f c, B, In general, for a bounded set fl, the point Xo to take for the centering that would lead to the smallest approximation bound (32) is one such that C is minimized f, B, Xu = J IwIB, Xo IJ(w)ldw where In this context, the IwlB,xo = sUPxEB Iw· (x - xo)l. Similar conclusions hold for other sigmoidal functions. The representation (4) would become 1 size of rn required to preserve the order (l/n) /2 approxima­ tion depends on the rate at which the sigmoid approach the fCx) = f(xo) + (eiO!x - eiw.xO)J(w) w. ./ d limits of 0 and 1 as x --; ±oo. BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCTION 937

The proof of Theorem 3 is based on the following result we require is that there is a complex-valued measure F( dw) = for the univariate case. Let be a function with a bounded ei8(w)F(dw) on H such that f(x) JH eiw.ItF(dw) or g(z) = derivative on [-1, 1]. Assume that 0 is in the range of the f(x) = f( ) + w . O JH(eiw.1t - I)F(d ) function Let 9r denote a function in the convex hull of g. Theorem 4: Let lex), x E H be a function on a Hilbert Gq"r' space with J IwlF(dw) then for every Lemma [-1,1] H Cf H < 00; r > 0, 5: If 9 is a function on with derivative = every sigmoidal function on R1, every probability measure bounded by a constant then for every > 0, C, r on and every � 1, there is a linear combination of jJ, H, n infgTEeoG sUP l�r � 2C (33) sigmoidal functions fn(x . + bk) + •. z g( gr( r ) = L�=l x co, T l I z) - z)1 o . ck(ak such that JnJf(x) - fn(x»2jJ,(dx) � (2 )2/ where Proof: The proof of Lemma 5 is as follows. Given rCf n, Br = {x E H: Ixl � is the Hilbert space ball of radius o < � 1/ , let k be an integer satisfying (lit) - 1 < r} r. t 2 The parameters take. values in the Hilbert space, while k � liE Partition [-1,1] into k intervals of width 21k. Then ak . the other parameters are realcvalued. With modification to the approximate the function 9 by a linear combination of k unit approximation bound, the norms of the parameters may be step functions as in the proof of Lemma 3. Since the derivative restricted in the same way as in Theorem 3. is bounded by the error of the approximation is bounded C, For an interesting class of examples in this Hilbert space by / k for all z in [-1, 1]. Replacing each step function 2C setting, let R(t, s) be a positive definite function on [0, 1]2 (a or l{z$td by the sigmoidal function - til) or l z�td (r(z valid covariance function for a Gaussian process on [0, 1],and { - til), respectively, one obtains a function in the (-r(z gr suppose for simplicity that R(t, t) = 1. Let f be the function convex hull of k functions in Gq" which has error bounded r, defined by by

2C f(x) = exp 1 1 x(s)x(t)R(s, t) ds dt/2 (36) � + (r -l{y>o} , 10 10 Ig(z) -gT(z)1 T 2CsIJPIYI� /k i y) i {- } 1 (34) for square-integrable on , 1]. Note that is the for every z in [-1, 1], where I k bounds the contribution x [0 f(x) 2C characteristic fimction of the Gaussian process to the error from the replacement of the unit step function (w(t), 0 � t � 1) with mean zero and covariance R(s, t) = E(w(s)w(t)): by the sigmoidal function centered at the point ti closest to and I(ry) bounds the cumulative that is, if F is the Gaussian meas)lre on w, z, 2CsuPI I�1 k -I y>o}1 errors from theY other/ sigmoidal{ functions (each of which is eiwox F(dw) E[eiwolt] centered at distance at least 11k from the point z). Since j =

E � 11k � E/(l - f), it follows that = exp 1 )x(t)R(s, t) ds dt/2 t -10 1o\;(S Ig(z) - � { } gT( )1 2C 1 z E = f . (3 ) _ Cc) 7 +2 CsuPIYI�rt l 3 i(r - {y>ol i · ( 5) y) Now, from the identity E(w2(t)) = R(t, t) 1, it follows = Using E/(I - t) � dor O < t � 1/ , and taking the infimum, 2 2 that Elwl2 J0 Ew2(t) dt = 1. Therefore, the constant 5. = 1 Cf completes the proof of Lemma 0 in the approximation bound satisfies

Proof of Theorem 3: The proof of Theorem 3 is as fol­ = lwI (dw) = Elw lows. From Lemma 2, 1 is in the closure of the convex hull Cf j F l of functions in Geos. The functions in Geos are univariate 2 � (ElwI ) functions z) evaluated at a linear combination = a· x with 1/2 g( z =1. (38) Ig'(z)1 � C and Izl � 1. Each such function 9 is approximated by a function in Lemma 5 with supremum error bounded gr as Thus, for any probability measure jJ, on x and for any sigmoidal by a quantity arbitrarily close to It follows that there is a l 2C Or. function on R , it follows from the theorem that for this function f in the closure of the convex hull of Gq" such that T r infinite-dimensional example there exists fn(x) such that - f Il � From Lemma 1 there is an fn in the convex 111 T 2CoT. hull of points in Gq" such that l fT - fnll � By n T i 2Cln1/2• r (I(x) - fn(X»)2jJ,(dx) � �. (39) the triangle inequality, this completes the proof of Theorem J{lxlSll n 3. 0 An even more general context may be treated in which the nonlinear functions are defined on a normed linear space. VII. EXTENSION V�t B be a bounded subset of a normed linear space X, and d An extension of the theory is to replace R by a (possibly let w take values in a set of bounded linear operators on X infinite dimensional) Hilbert space H, where now w· x denotes (the dual space of X). Now w x denotes the operator w 0 the Hilbert space inner product, and I . I denotes the Hilbert applied to x and IwlB = 8UPxEB Iw . xl denotes the nanD. space norm. For instance, H may be the space L2 of square of the operator restricted to B. If there is a meflSurable set integrable (x(t), 0 � t � 1) with inner product of w's and some complex-valued measure F(dw) on this 1 t dt. A real-valued function of the set, such that the function has the representa�i.?n w . x I0 w( )x(t) f(x) f I(x) = = x E H is to be approximated. The Fourier representation J eiwx F(dw) or f(x) f(O) + J(eiwox I)F(dw) with = 938 IEEE TRANSACfIONS ON fNFORMAIlON THEORY, VOL. 39, NO. 3, MAY 1993

Gf, B flwIB IF(dw)1 finite, then for every n > 1, every lIaln -l +a g -III. The iterations (42) are initialized with = probability measure {t on X, and every sigmoidal function cP, 0:1 = 0, so that h is a point gl in G that achieves a nearly

there will exist In = 2:�=1 ckcP(ak'x + bk ) + Co such that minimal value for Ilgl - III. Note that In> as defined in (42), is in the convex hull of the points g1, ... , g". ( )2 r (I(x) In(x))2{t(dx) ::; 2Gf, B , (40) Jones [4] showed that if I is in the closure of the convex - n JB hull of G, then Illn - 1112 ::; O(l/n), for the sequence of where now the ak' s take values in the dual space of X. approximations defined as in (42). Here Jones' theorem and One context for this more general result is the case that proof are presented with a minor refinement. The constant X is the set of bounded signals (x(t), 0 ::; t ::; 1), B in the approximation bound is improved so as to agree with 1 = {x: suptlx(t)l ::; r}, and w·x = Io w(t)x(t) dt, where the constant in the noniterative version (Lemma 1). As noted the bounded linear operators w are identified with integrable by Jones, the error Ilaln -l +(5 g - III need not be exactly functions on [0, 1]. Then l the Fourier minimized; here it is shown that it is enough to achieve a Iw lll = r fo Iw(t)1 dt, distribution F(dw) would be a measure supported on the set value within O(1/n)2 of the infimum on each iteration. of in and the ak's would be integrable functions on w Llo Theorem 5: Suppose I is in the closure of the convex hull [0, 1]. of a set G in a Hilbert space, with Ilgll ::; b for each 9 E G. Set bJ = b2 - 11/112. Suppose that h is chosen to satisfy VIII. ITERATIVE APPROXIMATION Ilh - 1112 ::; infgEG IIg - 1112 + El and, iteratively, In is In this section, it is seen that the bounds in Theorems chosen to satisfy 1, 3, and 4 can be achieved by an iterative sequence of Ilfn - fl12 ::; info::;o9infgEG lIaln -l + (5g - 1112 + En approximations taking the form (43) where (5 1-a, c' 2: bJ , p = c'lbJ-1, and = pc' The optimization is restricted to the parameters an, "fn, an, En ::; (44) and bn of the nth node, with the parameter values from ear­ n(n + p)" lier nodes held fixed. This iterative formulation considerably Then for every n > 1, reduces the complexity of the to be optimized at each c' step. III - In l12 ::; . (45) This reduction in the complexity of the surface is particu­ -n larly useful in the case that the function I is only observed Proof' The proof of Theorem 5 is as follows. We show at sites X , X ,"',XN in a bounded set The itera­ that if I is in the closure of the convex hull of G, then, for l 2 n. tive approximation theory shows that to find an estimate any given In -l and 0 ::; a ::; 1, with. average squared error bounded by rv (11 N) 2:;=1(le Xi) - inf Ilaln -l +ag - 1112 In (Xi))2 ::; c'ln, it suffices to optimize the parameters of the gEG network one node at a time, Av oiding global optimization has infgEG Ila(ln-l - I) + (5(g - 1) 112 computational benefits. The error surface is still multimodal = a211/n_l - 1112 + (1 - a)2bJ. (46) as a function of the parameters of the nth node, but there is ::; The proof is then completed by setting a reduction in the dimensionality of the search problem by a = b} /(b} + II In -l - optimizing one node at a time. 1112) to get A recent result of Jones [4] on iterative approximation in a bJ ll/n-l - 1112 Hilbert space is the key to the iterative approximation bound 2 IIln - III ::; +E n, (47) in the neural network case. As in the noniterative case, the bJ + Il/n-l 1112 _ applicability of Jones' Theorem is based on our demonstration or, equivalently, that functions in are in the closure of the convex. hull of r� 111 G¢. ------,,,--- 2: + 2' (48) To avoid cluttering the notation in this section, the notation Illn - 1112 -En Il/n-l - III 2 bf (instead of ) is used to denote the point to be approximated I ] Equation (48) provides what is needed to verify that Illn - by elements of the convex hull. As before, for the application 1112 ::; c'ln by an induction argument. Indeed, (46) with Q; = 0 to the approximation by sigmoidal networks of functions in shows that the desired inequality is true for n 1. Suppose r one subtracts off the value of the function at to = C, x = ° Il/n-1 - 1112 ::; c'/(n 1), then plugging this into (48) and obtain the function in r� which is approximated by functions using c' = b} (1 + p) yields- in the convex hull of G¢. 1 1 1 Let G be a subset of a Hilbert space. Let In be a sequence ------2: + of approximations to an element that take the form 2 2 b2 I Illn - 111 - En Illn-l - III f n 1 1 (42) >--- + - (;, -b;' where an = (1 un) and 0 ::; an ::; 1 and gn E C. Here n+p - (49) an and gn are chosen to achieve a nearly minimal value for c' BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSlTlONS OF A SIGMOIDAL FUNCTION 939

Reciprocating and using the assumed bound on En yields IX. PROPERTIES AND EXAMPLES OF fuNCTIONS IN r

' In this section, several properties and examples of functions IIfn - fl12 :s; -c + En n + p f(x) are presented for which the Fourier integral Cj = is evaluated or bounded, where is the ' C'p J IwIF(dw) F(dw) --c n - n(n+p) + E n magnitude distribution of the Fourier transform. Note that in the case that B is the unit ball centered at Cf = Cj , B C' (5 ) zero. Examples are also given of classes of functions in that - . 0 r <- n' 2 *, is, functions on R that are contained in r B when restricted as desired. to any bounded set B. The simpler facts are stated wiiliout Thus, it remains to verify (46). Given 8 > 0, let r be a proof. point in the convex hull of G with :s; 8. Thus r is Ilf -1*11 1) Translation: If f(x) E re, then f(.'E + b) E re. of the form "'(k9Z with Z E G 0 1, L;;'=l g , rk ;:::: , L;;'=l rk = 2) Scaling: If (x) E re, then (ax) E rlale. for some sufficiently large Then, f f m. 3) Combination: If /;(x) E re" then f E L ;3; ; rL Ipilei l lIa(fn-1 - 1) + a(g - 1)11 4) Gaussian: If f(x) = e- xl2/ 2, then Of :s; d1/2• Indeed, d 2 l :s; Ila(fn-l - 1) + Ci(g - r)1I + 8. (51) jew) (27r)- / e- wI2/2 and OJ = f Iwlj(w) dw = which is bounded by (J Iw jew) dw)I/ = d1/ . Expanding the square yields l2 2 2 5) Positive Definite Functions: Of -::; (-f(0)\72 f(0» 1/2. is one such lIa(fn-1 - 1) + a(g - 1*)112 A positive definite function f (x) iliat Li, j X;Xjf(Xi - Xi) is nonnegative for all Xl, X2, '" ,Xk = a2l1fn_1 - fl12 + a211g - 1*112 in Rd. Positive definite functions arise as covariance + 2aa(fn-1 - f, 9 - f*), (52) functions for random fields and as characteristic functions for probability distributions on Rd The where (., .) denotes the inner product. Now the average value . of the last two terms is, for E { i," " ;',}, essential property (due to Bochner) is that contin­ 9 g g uous positive definite functions are characterized as m functions that have a Fourier representation f (x) = Lrk(a21lgk - f*1I2 + 2aa(fn-l - f, gk - f*» f e;w,,,,F(dw) in terms a positive real-valued measure k=l If is a twice continuously differentiable pos­ m F. f itive definite function, then by the Cauchy-Schwarz = a2 rkllg;:- f*112 + 0 k=lL inequality J Iw\F(dw) :s; (J F(dw) f Iwl2 F(dw» 1/2 = (� ) (-f(0)\72 f(0» 1/2, where \72 f(x) = L�=l 82 f(;r;) /8x;. = a2 lkI19kIl2 _ 111*112 (Positive definite functions have a maximum at X = 0, so \72 f(O) :s; 0.) What is noteworthy about iliis class :s; a2(b2 - 111* 112). (53) of functions for our approximation purposes is that the Cj is bounded in terms of behavior at a single point Since the average value is bounded in iliis way, there must X = O. Moreover, since \72 f(O) is a sum of d terms, it exist 9 E {gi,''''g ;;'}, such that is plausible to model a moderate behavior of the constant OJ, such as order d1/ , for positive definite functions of ( ) + (g ) 2 lIa fn-l - 1 a - f* 112 many variables. :s; l fn l - fl2 a2(b2 - If 112). (54) a2 1 _ 1 + 1 * 6) Integral Representations: Suppose f(·7;)= f K(a(x + b) G(da, db) for some location and scale mixture of a Now by the triangle inequality, 111* II > IIfll- 8. So using (51) function in r, for and d (For instance, and letting fj 0, it follows that K (x) a > 0 b E R . -> K(x) may be a Gaussian density or other positive infgEG lIa(fn-l - f) + a(g - 1)112 definitekernel on Rd.) Then OJ :s; OK flaIIGI(da, db). :s; a2l1fn_l - fl12 + a2(b2 - llfIl2), (55) In the same way, if the function has a representative . . d f(x) = J K(a 'E + b)G(da, db), for a E R and as desired. This completes ilie proof of Theorem 5. 0 b E Rd, for some K (z) on Rl and some signed measure Inspection of the proof shows an alternative optimization G(da, db), then OJ :s; CK J laIIGI(da, db). that may provide further simplification in some cases. Instead 7) Ridge Functions: If f(x) = g(a . x) for some direction of minimizing the sum of squares Ilafn l ag fl at each a E Rd with lal 1 and some univariate function - + - 12 = with integrable Fourier transform on R\ then iteration, one may instead choose 9 E G ·to maximize the inner g(z) g product (f -fn-l, g). (In this case, one can derive the bound f has a Fourier representation in terms of a singular IIf -fnll -::;( 2b)2/n.) For sigmoids, the search task reduces to distribution F(dw) concentrated on the set of w in Rd finding the parameters an and bn such that the inner product in the direction a, iliat is, f (x) f eita.x g( t) dt. In this case, = If of ¢(a· x + b) and f - fn-l is maximized. The function fn Of = 09 = fR' Itllg(t)1 dt. f(x) = g(a . x) depends linearly on the other parameters an and Cn in (41), for some a E Rd with 1 a I = 1 and the derivative of so they may be determined by ordinary least squares. 9 is a continuous positive definite function on Rl, then 940 IEEE TRANSACfIONS ON INPORMATION THEORY, VOL. 39, NO. 3, MAY 1993

= Cf Cg gl(O). Note that Ct is independent of the 12) Composition with : If 9 E rca, c) , = dimension d. then (g(x» k is in rcak, kak-1c). It follows that if 8) Sigmoidal Functions on Rd: These are ridge functions of I (z) is a polynomial function of one variable and d the form I(x)

integrable transform �/(t). In this case, I is in r with variables with functions in rca, c) .

Ct lal J 1¢/Ct)1 dt. Using the closure properties for 13) Composition with Analytic Functions: If 9 E rca, c) linear= combinations and compositions, it is seen that and 1(:1;) is an represented by a power certain multiple-layer sigmoidal networks are also in r. series J(z) = L;;'=o akzk with radius of absolute 9) Radial Functions: Suppose I(x) = g(lxl) is a function convergence r > a, then the composition I (g( x) is x la k that depends on only through the magnitude Ixl (i.e., in r(fabs (a), c/�bs(a) where labs(z) Lk klz . the angular components of I(x) are constant), and that The next examples concern functions= in r .-that is, I has a Fourier representation I(x) J eiw.x l(w) dw. fu nctions which can be modified outside of bounded Then lCw) is also a radial function=; that is, lCw) sets B to produce functions in r. For I in r. , let l = g(lwl) for some function 9 on R . Integrating IwIIF(w) C = infg Cg , where the infimum is over 9 in r j, B B, . , that agree with I on the set B, using polar coordinates yields Cf 8d Jo= ,, dl9(r)I dr eg , B J IwIBG(dw), = and G is the magnitude distribution= in the Fourier where Sd is the d-1-dimensional volume of the unit sphere in Rd. The factor rd in the integrand suggests that representation of 9 on Rd. For functions in r., the ap­ CJ is typically exponentially large in d; for an exception, proximation error JB(f(x) - 1,,(x) 2t£(dx) is bounded 2 the in example (4) is a radial function by (2Cj, ) In, for some sigmoidal network In of the 2 B with CJ � d1/ . form (1). 10) Sigmoidal Approximation with an Augmented Input Ve c­ 14) Linear Functions and Other Polynomials: If I(x) 2d tor: For a d-dimensional input x, let .7:1 in R consist a . x, then J is in r Moreover, Cj, B � lair, for •. of the coordinates of x and the squares of these every set n contained in the ball {x: Ixl � r}, for

coordinates. Then with the unit step function for O. This is shown first in the case the terms in the approximation In (xl) LCk bk E R2d consist of indicators of certain extrapolations hex). In particular, let h(x) = ellipsoidal regions. Functions I(x) on Rd can have a hb(:.r:) have derivative hl(x) that is equal to 1 for

significantly smaller value for Cf when represented Ixl � r, equal to 0 for Ixl ::: r + b, and interpolates as a function of Xl on R2d. In particular, consider linearly between 1 and 0 for r � Ixl � r + b, for the functions of the form I(x) geL aixT) on Rd some b > O. Then hl(x) has a Fourier transform with L aT 1. These functions= include the radial that can be calculated to be 2 sin (wb 12) sin ( w (r + functions and= may be interpreted as a ridge fu nction in bI2» /(w27rb). By a change of variables (t wbl2) the squared components. In this case, if has a Fourier and an application of the dominated convergence= the­ l 9 transform 9 on R , then there is a representation of the orem, it is seen that as b tends to infinity, Chb 2d = function I as a function on R , with CJ given by the J Iwllhb (w) dw = J Isin C t) sin (t(l + 21-jb)) II (t27r)dt one-dimensional integral JIl' Itllg(t)1 dt. This potential converges to J(sin (t» 2/(t27r)dt 1. (This matches improvement in the constant in the approximation the intuition that as b tends to infinity,= hb(x) approaches bound helps justify the common practice of including I (x) which has a constant derivative equal to one, so the squares of the inputs in the sigmoidal network. the Fourier transform of h� (x) ought to approximate a " For the following examples, let rca, c) C rc be "delta function" and the integral of Ih�1 should be close d the class of functions 1(:1:) on R for which there is to oney Consequently, Cj, [-r, r] � limb�oo rChb = a Fourier representation I(x) = J eiw·x F(dw) with = d r. For I(x) a.x on R , let gb (x) lalhb(a·x) where magnitude distribution F( dw) satisfying J F( dw) � a a = allal. Then for sets B in the ball= Br of radius r, and J IwlF(dw) � c. C ali f j, B � l n b Chb, [-r, r] � lair. Other extrapolations 11) Products of Functions in r: If It E rca], cd and of I(:!:) x on [-r, r] can be constructed for which h E r(a2' (2), then the product 11 (x)h(x) is a g(w), as =well as wg(w), are integrable. Together with function in r(a a2, a CL cd This follows by l l c2 + 2 . property (12), this implies that polynomial functions applying Young's inequality to the Fourier (in one or several variables) are contained in ro. representations of the product I(x )g(x) and its gradient Manageable bounds on the constraints Cj, B can be f(x)Vg(x) + g(x)V/(x). Together with property (3), obtained in the case of sparse polynomials and in the this shows that the class of functions in r for which both J F ( dw ) and J Iw IF ( dw) are finite forms an For r at in which the representation is 1 an alternative t e ment Fourier gener­ algebra of functions closed under linear combinations alized functions to allow for with a linear component on Rd, sec the remarks Appendix. and products. in the BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCtION 941

case of multiple-layer polynomials networks, which are rare for this periodic extension to be continuous or to polynomials defined in terms of a restricted number of possess a continuous derivative on the boundary points elementary compositions (sums and products). of [0, l]d. (It is for this reason that in this paper the 15) Functions with Derivatives of Sufficiently High Order: Fourier distribution has not been restricted exclusively If the partial derivatives of f (x) of order s = ld /2J + 2 to such a lattice; allowing the coordinates of w to take are continuous on Rd, then f is in r.. Consider first arbitrary values relaxes the boundary constraints.) In the case that the partial derivatives of order less than some cases, it is possible to take a function defined or equal to s are square-integrable on Rd. In this case, on a subset of [0, l]d and extend it in a continuously f is in r. Indeed, write li(w)llwl = a(w)b(w) with differentiable way to a function that is zero and has a(w) = (1 + IwI2k)-1/2 and b(w) = li(w)llwl(1 + zero gradient on the boundary of [0, l]d, so that the Iw 12k)1/2, where k = s - 1. By the Cauchy-Schwarz requirement of a continuously differentiable periodic inequality, Gf = I a( w ) b( w ) dw is bounded by the extension is satisfied. Examples of this are similar to product of (J a2(w) dw)1/2 and (J b2(w) dw)1/2. Now those given in 14) and 15). k Functions of : Functions defined on d are in the integral I a2(w) dw = 1(1 + IwI2 )-1 dw is finite 17) Z for any finite subset of d Indeed, given such a for 2k > d and, by Parseval's identity, the integral r B B. Z . d I b2(w) dw = I li(wW(lwI2 + IwI2S) dw is finite when set B, set iB (W) to equal (1/27r) L:xEB e-iw,xf(x) for w in [-7r, d and to equal ° otherwise.2 Then the the partial derivatives of order s and of order 1 are 7r] Fourier representation x eiw,xiB( )dw square-integrable on Rd . This demonstrates that f is in f( ) = I[-1C, 7r]d W r. Now suppose that the partial derivatives of order s holds for x in B. Now iB (W) is a continuous func­ are continuous, but not necessarily square-integrable on tion on the set [-7r, 7r]d, which implies that liB (W)1 Rd. Given r > 0, let p(x) be an s-times continuously is bounded and Gf, B = k",7rjd IwIBliB (W)l dw is that is equal to 1 on Br = finite. Consequently, .f is in tB. Moreover, Gf, B may ' {x: Ixl :::: r} and equal to ° for Ixl � r for some be bounded in terms of the L1 norm of the Fourier ' (In particular, we can take = (lx ) transform: that is r > r. p(x) P1 l Gf, B :::: 7rSd�_7r, 1CJd liB (W)ldw, where equals for for ' a practical matter, if p1(Z) 1 z. :::: r, ° z � r , where s = maxxEB Ixloo.As d and interpolates by a (piecewise polynomial) spline is large, then additional structure needs to be assumed

of order s for r :::: z :::: r'.) Consider the function of the function to have a moderate value of Gf, B. fr (x) = I(x)p(x), which agrees with f(x) on Br. It 18) Boolean Functions: Here, we consider functions has continuous partial derivatives for order s which are defined on B {O, l}d and note that, for

equal to zero for xl and hence are integrable on Boolean functions, rc B is related to a class of I > r', , Rd. Consequently, for each r > 0, the function fr (x) functions recently examined by Siu and Bruck [5] and is in r. It follows that f is in r A disadvantage of Kushilevitz and Mansour [27t This leads to a number characterizing approximation capabilities•. in terms of of additional interesting examples of functions with high-order differentiability properties is that the bound not excessively large values of Gf, B. For functions on the constant Gf can be quite large. Indeed, the f (x) on {O, I}d, the Fourier distribution may be restricted to the set of of the form with integral I i(w)i2lwI2s dw, as characterized by Parse­ w 7rk k E l k x val's identity, involves a sum of exponentially many {O, I} d (for then the functions ei7r . , k E {O, I} d terms--one for each of order s. are orthogonal fu nctions that span the 2dcdimensional The last three examples involve functions in r which linear space of real-valued functions kon {O, 1 }d). have a discrete domain for either of the variables x or onsequently, f(x) = L:kE{O l}d ei1r .xik o where � k , w. fk = 2-d L..," xE{O, l}d e-,,, ,xf( x) . Here, Gf , B = 16) Absolutely Convergent : Suppose f is a 7rL: kE {O, l}d Ikl11fkl which is bounded above by on [0, l]d; let A = (27r)-d 1[0, 1jd 7rdL(.f),where L(.f) = L:kE{O, l}d likl is the spectral e-i21rk.x I(x) dx be the Fourier series coefficients of f, norm. Now the class PL of Boolean j;'unctions, for and suppose that A and kik arc absolutely summable which L(f) is bounded by a polynomial function of d d d sequences for k E Z , where Z is the set of vectors (that is, L(f) :::: de for some fixed c � 1), is examined with integer coordinates. Then is a function in r with in [5] and [27].3 In particular, Siu and Bruck [5] f k the Fourier series representation f(x) = L:k ei27r .x ik show, among other examples, that the Boolean function and 2 Thus, f has a Fourier dis­ on {O, 1 Fd defined by the comparison of two d-bit Gf = 7r L:k Ikllikl· tribution restricted to the lattice of points w of the F 2If it happens that f(x) is an absolutely summable function n Zd, then the form 27rk, with F ( {27r k}) = k In this case, for i transfonn J(w) (1/21r)d f(x), [-11', 1I'jd, may be used k E x e-iw'x w E and kik to be absolutely summable, it is necessary that in place of fB (W),= The cited the function f(x) possess a continuously differentiable 3 references express the Fourier transform interms of a polynomial be basis s d periodic extension to Rd. It is a simple matter to basis that turns out to identical to the Fourier u e here. Indeed, for x restricted to {a, 1 }d, the Fourier basis functions ei�k.x rr1=1 (eiUj )kj periodically extend a function defined on [0, such = l]d be ex ressed in may p the polynomial fonn nd=l ,l i, where Xj 1 - 2xj that I (x + k) = I (x) for all vectors of integers -1 = equals I, for xj equal to 0, 1, respectively (in igreement with the values k; however, for functions that arise in practice, it is assigned by ei�Xl ). 942 IEEE TRANSACTIONS ON INFORMArION VOL. 39, NO. 3, 1993 THEORY, MAY

integers and the functions defined by (each bit ot) the Th eorem 6: For every choice of fixed basis functions of two such integers are functions in P [, with h , I , h2," ' hn, L(f) = d+ It follows that C/, .:; 27rd( d+ 1) for the 1. B comparison and addition functions. On the other hand, (59) they show that the majority function 1{L::=1 xj-d/2} (which has a simple network representation) is not where is a universal constant not smaller than 1/(81re7r-1). in the class PL. Kushilevitz and Mansour [27] show r;, Thus, the Kolmogorov n-width of the class of functions r c that a class of binary decision represent Boolean trees satisfies functions satisfying L(f) .:; where is the nnmber Tn, Tn l of nodes of the tree. It follows that C/, B .:; nndfor C l /d Wn > r;,- - (60) such decision trees. Bellare [28] generalizes the results - d ( ) of [27] by allowing decision trees with more general n P L functions implemented at the nodes of the tree. The proof of Theorem 6 is based on the following Lemma. He gives bounds on L(f) in terms of spectral norms Lemma 6: No linear subspace of dimension n can have

of the node functions, from which bounds on C/, B squared distance less than 1/2 from every basis function in an follow for the classes of decision trees he considers. orthonormal basis of a 2n-dimensional space. The implication of polynomial bounds on C/, as a B, Proof' For the proof of Lemma 6, it is to be shown consequence of the bound 2C/, B / yin fromTheorem 1, that if e ... , e n is an orthonormal basis and Gn = span is that a polynomial rather than an exponential number 1 , 2 {gt is a linear subspace of dimension n, then there is of nodes n is sufficient for accurate approximation by ..., g,, } an Cj, such that the squared distance d2(ej , Gn) � 1/2 Indeed, sigmoidal networks. . Ie! P denote projection onto Gn. Then, d2(ej, Gn) = Ilej­ Pej 112 = Ilej 112 - IIPej Il2 = 1 - 11 Pej 112. Thus it is equivalent X. LoWER BOUNDS FOR ApPROXIMATION to show that there is an ej such that the norm squared of the BY LI NEAR SUBSPACES projection satisfies IIPejl12 .:; 1/2. Without loss of generality, take g1 , to be an orthonormal basis of Gn. Then the The purpose of this section is to present and derive a lower . ..,g n ) bound on the best approximation error for linear combinations projection Pej takes the form L:�=1( ej, gi gi. So the norm squared of the projection satisfies of any fixed basis functions for functions in rc. These results, IIPejl12 L:�=I(ej, gi )2. = 1" ", = taken together with Theorem 1, show that fixed basis function Ta king the average for .i 2n, exchanging the order of summation, and using Ilg ll 1, yields expansion must have a worst-case performance that is much i = worse that that which is proven to be achievable by certain adaptable basis function methods (such as neural nets). Let /L be the uniform probability distribution on the unit cube B [O, I]d, and let d(f, g) = U[o. l]' (f(.x )­ = be the distance between functions in B). g(x))2 dx)I/2 L2(/L, For a function f and a set of functions G, let d(f, G) = infYEG d(f, g). For a collection of basis functions h1 , h2, "', hn (56) 2n -2 (61) denotes the error in the approximation of a function f by Since the average value of the norm squared of the projection the best linear combination of the functions h h1, h2, " " n, IIPej l12 is equal to 1/2, there must exist a choice of the basis where H" = span . The behavior of this (hI, h2, ' " hn). function ej for some .:; .i .:; 2n for which IIPej 112 � 1/2. approximation error for functions in r c = rc may be 1 , B This completes the proof of Lemma 6. D characterized (in the worse case) by Proof of Theorem 6: Let ht, h�, . . . be the functions (57) cos (w . x) for w = 27rk for k E {a, 1, ... }d ordered in terms of increasing h norm Ikl1 = L:�=l lh l· Let H2n denote the Here, a lower bound to this approximation error is determined span of h!, ... , h;n' We proceed as follows. First reduce the that holds uniformly over all choices of fixed basis functions. supremum over rc by restricting to functions in H2n , then In this formulation, the functions hi are not allowed to depend lower bound further by replacing the arbitrary basis functions on f (in contrast, sigmoidal basis functions have nonlinear hi, ... ,hn with their projections onto H2n, which we denote parameters that are allowed to be adjusted in the fit to f). Let by .111 , ... , gn ' Then gl, . . . ,gn span an n-dimensional linear subspace of H2n and a lower bound is obtained by taking Wn = infh" .. ,hn SUP/Ere d(f, span (hI, h2," ' , hn)). the infimum over all n-dimensional linear subspaces Gn. The (58) supremum is then restricted to multiples of the orthogonal This is the Kolmogorov n-width of the class of functions rc. functions hi that belong to rc, which permits application of BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCTION 943 the lemma. Thus, putting it all together, I)F(dw) for some complex-valued measure F(dw) for which lRd IwIIF(dw)l· Complex-valued measures take the form Wn = inf sup d(f, span (hI, h2, ··· , hn)) ei8(w) h" .··,h" JEre F(dw), for some real-valued measure F(dw) IF(dw)1 called the magnitude distribution and some function= � inf sup d(f, span (h1 , h ,··: , hn)) (J(w) h,,.··,h,, 2 called the phase (see, for instance, Rudin [29, theorem JEH;n nrc 6.12]). A complex vector-valued measure G(dw) on Rd is � inf d(f, span (gl, g2, '" ,gn)) h,,.··,h,, sup a vector of complex-valued measures (G1(dw), " ' , Gd(dw)). JEH;" nrc Let IG(dw)h L:� IGk(dw)1 denote the sum of the � inf sup =l d(f, Gn) magnitude distributions= of the coordinate measures. Gn JEHin nrc Proposition: The following are equivalent for a function I � inf u dU, G ) s p n on Rd. Gn JE{(C/IWjl) (Wj"x),j=l,. ·,2n} cos a) The gradient of I has the Fourier representation > min \lI(x) = 1 eiw xG(dw) for some complex vector-valued . - j=1,.··,2n (�)27rlkjl measure G with IIG(dw)1 < 00 and G({O}) = 0 (in . (infGn sUPJE{COS (Wj.x),j=1,. .. ,2n} d(f, Gn)) which case it follows that G(dw) = iwF(dw) for some complex scalar-valued measure F). > min b) The function I has the representation I(x) = 1(0) + - j=1, .. 2 ,2n (�)27rlkjl � 1 (eiw x -1)F( dw) for x E Rd, for some complex-valued measure with >� (62) F llwIIF(dw)1 < DC. - 47rm ' c) The increments of the function I of the form !h (:r;) = m j I(x + h) - I(x) have a Fourier representation I (X) = for m satisfying ( ; d) � 2n (such that the number of h 1 eiw'X(eiw.h - l)F(dw), E�, for each hERd, for multiindices with norm Ikl :::: m is at least 2n). A bound from x Stirling's formula yields (m;jd) � (m/Td)d for a universal some complex-valued measure F with llwIF(dw)1 <

DC. constant T ::::> e,,-l. Setting m = ITdnl/dl and adjusting the constant to account for the rounding of m to an integer, the If any one of a), b), or c) is satisfied for some F, then the desired bound is obtained, namely, other two representations hold with the same F. C l l/d Proof' The proof of this proposition is as follows. First, W > -- - (63) recall that leiw.h -11 is bounded by so 1 IwIIF(dw)1 n - 87rTd( n ) 2hlwl, < 00 implies the absolute integrability of the representations in This completes the proof of Theorem 6. o b) and c). Now, b) implies c) since the difference of the integrands at x and x + h is integrable, and c) implies b) XI. CONCLUSION by taking a specific choice of x and h; consequently, b) and c) are equivalent. Next, a) follows from c) by the dominated The error in the approximation of functions by artificial convergence theorem; c) follows from a) by plugging the neural networks is bounded. For an artificial neural network Fourier representation of the gradient into I(x h) - I(x) = with one layer of n sigmoidal nodes, the integrated squared 1 + error of approximation, integrating on a bounded subset of 10 h . V I(x + th) dt and applying Fubini's theorem. It remains to show that in a), if the gradient of I has d variables, is bounded by cf / n, where cf depends on a norm of the Fourier transform of the function being ap­ an absolutely integrable Fourier representation \lI (x) = proximated. This rate of approximation is achieved under 1 eiw.xG(dw), and if G assigns no mass to the point w = growth constraints on the magnitudes of the parameters of 0, then G ( dw) is proportional to w (that is, the measures the network. The optimization of a network to achieve these (l/wk)Gk(dw) are the same for k = 1, 2, ... , d). Now, bounds may proceed one node at a time. Because of the if the gradient of I has an absolutely integrable Fourier representation, then so do the increments Indeed, !h (x = economy of number of parameters, order nd instead of nd, fh . ) h . th dt = h . h dt, these approximation rates permit satisfactory estimators of 101 \lI( :r; + ) J� lRd eiw.(x+t )G(dw) functions using artificial neural networks even in moderately and integrating first with respect to t yields Ih (x) = h high-dimensional problems. J�d eiW X« eiW _ l)/iw . h)h · G(dw) (the exchange in order of integration is valid by Fubini's theorem since the integral of eit:.;·h is (eiw.h - l)/iw . h, which has magnitude bounded APPENDIX by 2). Thus, II. has a Fourier distribution In this appendix, equivalent characterizations of the class of functions r are given in the context of general Fourier distributions on Rd. This appendix is not needed for t!i.eproofs of the theorems in the paper. It is intended to supplement the understanding of the class of functions for which the It is argued that the factor h . G(dw)/h . w .determines a approximation bounds are obtained. measure that does not depend on h (from which it follows Recall that r is defined (in Section III) as the class of that G(dw) is proportional to w). Now, the increments of I functions I on Rd such that I(x) = 1(0) + lRd (eiWX - satisfy !h (X + Y) = ly+h(X) - Iy (x), so it follows that their 944 IEEE TRANSACTIO:-lS ON INFORMAI10N THEORY, VOL. 39, NO. 3, MAY 1993

Fourier distributions satisfy l)F(dw), where a = G({O}), and F(dw) is characterized by on Rd The component . is (65) G(dw) = iwF(dw) - {O}. a x eiw,y A(dw) = Fy+h(dw) + Fy (dw) approximated by linear combinations of sigmoidal functions for all y, hERd, Examination of this identity suggests that in the same way as the sinusoidal components as in the proof w of Theorem Now let where is A(dw) must be of the form (ei 'h - l)F(dw) for some 1. Cj• B = I IG(dw)IB, IGIB measure l' which does not depend on h. Indeed, by (64), the measure that assigns mass IG( to } )IB lalB at w 0, and that equals =when restricted= to the measures Fh are dominated by IGII for all h, so (64) and IG(dw)IB = IwIBI IF(dw)1 (65) may be reexpressed as identities involving the densities Rd - {O} (recall that, by definition, lalB SUPXEB la . xj). It = of these measures with respect to IGll' Consequently, can be shown in this context that there is a linear combination of n sigmoidal functions fn(x) of the form (1), such that the h _ h·g(w) eiwY(eiw' 1 ) L2(p" B) norm of the error f - fn is bounded by 2Cj, B/..;n. h·w The same bound can also be obtained by the extrapolation (y + h) · g(w) method in example ( 4). (eiw,(y-h) _ 1) 1 = (y +h)·w Additional Remarks: In the case that the distribution l' y g(w) has an integrable Fourier density jew), there is a forward + (eiw,y - l) ' (66) y·w , transform characterization in terms of Gaussian summability, that is, where g(w) is a complex vector-valued function such that

G(dw) = g(w)IG(dw) ll' (For each y and h in Rd, (66) holds-except possibly for a set of w of measure zero with respect to IGll-so if y and h are restricted to a countable for almost every (see, for instance, Stein and Weiss dense set, then there is one I G 1 1 -null set outside of which w (66) holds for all such y and h.) Now take a derivative in [30]). In the same way, iwJcw) is determined as the (66), replacing h with th, dividing both sides by t, and letting Gauss-Fourier transform of \lf(x) for functions in r in the case that Fourier distribution of the gradient is absolutely i -t 0 (along a countable sequence of values with th restricted to the dense set). The identity that results from this derivative continuous. If f(x) or \lf(x), respectively, is integrable on calculation, after a rearrangement of the terms, is Rd, then j (w) is determined by an ordinary forward transform, that is, d X or , ](w) (27f ) Ie-iW' f(x) dx iwJ(w) eiW y-1 _ iW h '9(W) _ y g(w) = - = W . h ie ,y , 0. (67) (27r)-dI e-i"" X\lf(x)dx for almost every w. ( w·y )( h·w y·w ) = Note Added in Proof: A factor of two improvement in the Therefore, h g(w)/h · w y . g(w)/y . w, whenever h . w constant in the approximation bound can be obtained.Indeed, and y . w are .not equal to zero= (for y and h in the countable if rj;(z)is a given sigmoid with range in [0,1] then subtracting a constant of yields a sigmoid with range, in dense set and for almost every w). Let p( w) y . g( w) / y . w 1/2 [-1/2, 1/2. = Allowing for a change in the additive constant the class of denote the common value of this ratio for all such y (for w Co, outside of the null set). Then, 0; so functions represented by superpositions of this new sigmoid y. (g(w) - wp(w») = taking points which span Rd, it fo llows that yew) is the same as represented by superpositions of the original d y = wp(w) for almost every Consequently, sigmoid. Therefore, an approximation bound using the new w. G(dw) = wp(wllG(dw)ll, one also is achievable by the original sigmoid. Now the new which may be expressed in the form G(dw) = iwF(dw) for sigmoid has norm bounded by instead ofl. Applying this some complex-valued measure l' on Rd. This completes the 1/2 fact in the proof of Theorem yields the existence of a network proof of the proposition. 0 1 function fn (x) of the form (1) such that The usefulness of the above proposition is that it provides a Fourier characterization of l' for functions in r in the 2 � �.ch (69) case that I 11'( dw) I is not necessarily finite. It is the unique (f(x) - fn (x) ti(dx) complex-valued measure such that G(dw) iwF(dw), where BJ is the Fourier distribution of the gradient= of the function. G Other scales for the magnitude of the sigmoid are also permit­ For several of the examples in Section IX of functions in r, ted, including popular choices with for whieh rj;(z) has limits including sigmoidal functions, the function does not possess f ±1 as In that case, the bound (69) holds with the an integrable Fourier representation (in the traditional form z -t ±oo. constraint on the coefficients of fn that :L:�=1 I Ck I� C, f(x) I eiw,x F(dw» , but the gradient of f does possess = provided the spectral norm satifies Cj,B � C an integrable Fourier representation, and in this way F is , determined for the mQdified Fourier representation f(.T) = ACKNOWLEDGMENT f(O) + I(eiw.", - l)F(dw). A Remark Concerning Functionswith a Linear Component: The author is grateful for tlte collaboration of L. Jones. As If the Fourier distribution G of the gradient of the function f he points out in [4], together we observed that his theorem has G( {O}) # 0, that is, if the gradient has a nonzero constant applies to artificial neural networks, provided it could be component, then (strictly speaking) the function f is not in r, shown that the function to be approximated is in the closure of Nevertheless, it is possible to treat this more general situation the convex hull of bounded multiples of sigmoidal functions. by using the representation f(x) f(O) + a·x + I(ei,"'x - This motivated the work presented here. = BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSlTlONS OF A SIGMOIDAL FUNCTION 945

R. Dudley pointed out [25] where the result of Lemma [15) L. Breiman, Hinging hyperplanes for regression, classification, and approximation,"" IEEE Trans. Inform. Theory, 39, pp. 1 is attributed to Maurey. E. Sontag noted the necessity function vol. 999-1013, May 1993. of the restriction to continuity points of the distribution for [16) R. Tibshirani, "Slide functions for projection pursuit regression and the validity of Lemma 4. B. Hajek suggested the extension neural networks," Dep. Stat. Tech. Rep. 9205, Univ. To ronto, 1992. to functions on a Hilbert space. H. White encouraged the [17] Y. Zhao, "On projection pursuit learning," Ph.D. dissertation, Dept. Math. Art. Intell. Lab., M.LT., 1992. development of the lower bounds as in Theorem 6, and D. [18] F. Girosi and G. Anzcllotti, "Convergence rates of approximation by Haussler and J. Zinn contributed to the proof of Lemma 6. !ranslates," Art. Intell. Lab. Tech. Rep. 1288, Mass. Inst. Techno!., 1992. [19] A. Farago and G. Lugosi, "Strong universal consistency of neural network classifiers," Dep. Elect. Eng. Tech. Rep., Tech. Vniv. Budapest, REFERENCES 1992. (To appear in the IEEE Trans. Inform. Theory.) [20] A R. Barron, "Neural net apprOximation," in Proc. Ya le Wo rkshop [1) G. Cybenko, "Approximations by superpositions of a sigmoidal func­ Adaptive Learning Syst. , K Narendra Ed., Yale Vniv., May 1992. tion," Ma th Contr. Signals, Sy st, vol. 2, pp. 303-3 14, 1989. [21] K. Hornik, M. Stinchcombe, H. White,, and P. Auer, "Degree of (2) K Hornik, M. Stinchcombe, and H. White, "Multi-layer feedforward approximation results for feedforward networks approximating unknown Neural Networks, networks are universal approximators," vol. 2, pp. mappings and their derivatives," preprint, 1992. 359-366, 1989. [22] D. F. McCaffrey and A R. Gallant, "Convergence rates for sil)gle hidden (3) A. Pinkus, n·Widths in Approximation Theory. New Yo rk: Springer· layer feedforward networks," Te ch. Rep. RAND Corp., Santa Monica, Verlag, 1985. CA, and Dept. Statist., North Carolina State Vniv., 1992. [4] L. K. Jones, "A simple lemma on greedy approximation in Hilbert [23] H. N. Mhaskar and C. A. Micchelli, "Approximation by superposition of space and convergence rates for projection pursuit regression and neural a sigmoidal function," Advances in Appl. Math., voL 13, pp. 350-373, network training," Ann. Stati,1., vol. 20, pp. 608-613, Mar. 1992. 1992. [5] K-Y. Siu and J. Brunk, "On the power threshold circuits with small [24) L. K. Jones, "Good weights and hyperbolic kernels for neural networks, weights," SIAM J. Discrete Math. , 1991. proj ection pursuit, and pattern classification: Fourier strategies for [6] N. M. Korobov, Nllmber Theoretic Methods of Approximate 1\.nalysis. extracting information from high·dimensional data," Thch. Rep., Dept Moscow: Fiznatgiz, 1963. Math. Sci., Univ. of Massachusetts, Lowell, 1991. (To apPear in IEEE [1) G. Wahba, Splinl' Models for Observational Data. Philadelphia, PA : Tr ans. Inform. Th eory. ) 1990. SIAM, [25] G. Pisier, "Remarqucs sur un resultat non publie de B. Maurey," [8) A. R. Barron, "Statistical properties of artificial neural networks " in presented at the Seminaire d' analyse fonctionelle 1980-1981, Ecole Proc. IEEE Int. Cont Decision Contr. , Tampa, FL, Dec. 13-15, 1989,, Poly technique, Centre de Mathematiques, Palaiseau. pp. 280-285. [26] L. K. Jones, "Constructive approximations for neural networks by sill­ [9] , "Complexity regularization with applications to artificial neural moidal functions," Proc. IEEE: Sp ecial Issue on Neural Networks, vol. __ Nonparametric Functional Estimatiqn, networks," in G. Roussas, Ed. 78, pp. 1586-1589, 1991. Dordrecht, The Netherlands: K1uwer Academic, 1991, pp. �61-516. [27] E. Kushilevitz and Y. Mansour, "Learning decisiol) trees using the [10) A. R. Barron and t. M. Cqver, "Minimum complexity density estima­ Fourier ," in Proc. 23rd ACM Symp Theory Comput. , 1991, IEEE Trans. Inform. Th eory, lion," vol. IT-31, pp. 1034-lQ53, 1991. pp. 455-464. [11) A. R. Barron, "Approxim�tion and estimation bounds for artificial neural [28] M. Bellare, "The spectral norm of finite functions," MIT'LCS Ta ch. networks," in Proc. 4th Annu. Wo rkshop Computation. Learning Theory, Rep. TR-465, 1991. San Mateo: Morgan Kauffman, Aug. 1991, pp. 243-249. (Also to appear [29] W. Rudin, Realand . New York: McGraw-Hilt, 1984. Introduction Fourier Analysis on Euclidean in Mach/ne Learning.) [30) E. M. Stein and G. Weiss, to [12) H. White, "Connectionists : Multilayer feed· Sp aces. Princeton, NJ: Princeton Vniv. Press, 1911. forward networks can learn arbitrary mappings," Neural NetworkS, voL [311 C. Darken, M. Donahue, L. Gurvits, and E. Sontag, "Rate of approxi­ 3, no. 5, pp. 535-550, 1990. mation results motivated b robust neu\1ll network learning," to appear fB) D. Haussler, "Decision theoretic generalizations of the PAC model for in Proc. Sixth ACM Wr okshop on Computat. Learning Theory. neural net and other learning applications," Computer Res. Lab. Te ch. [32) J. Yu kich, "Sup norm approximation bounds for networks via Yap· Rep. UCSC-CRL·91-02. Vniv. of California, Santa Cruz, 1991. nik-Chervoncnkis classes," Notes, Dept. of Math., Lehigh Vniv., Betil­ [(4) A. R. Barron and R. L. Barron, "Statistical lcarningnetworks: A unifying lehem, PA, Jan. 1993. view, " in Computing Science and Statistics: Proc. 21 sl Symp. Interface, [33) V. Kdrkova, "Kolmogorov 's theorem and multilayer neural networJ<:s," Alexandria: American Statistical Assoc. 1988, pp. 192-203. Neural Net. , vol. 5, pp. 501-506, 1992. ,