<<

ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 118–122 http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 V. K˚urková

Multivariable Approximation by Convolutional Kernel Networks

Veraˇ K˚urková

Institute of Computer Science, Academy of Sciences of the Czech, [email protected], WWW home page: http://www.cs.cas.cz/ vera ∼ Abstract: Computational units induced by convolutional widths parameters) benefit from geometrical properties of kernels together with biologically inspired perceptrons be- reproducing kernel Hilbert spaces (RKHS) generated by long to the most widespread types of units used in neu- these kernels. These properties allow an extension of rocomputing. Radial convolutional kernels with vary- the maximal margin classification from finite dimensional ing widths form RBF (radial-basis-function) networks and spaces also to sets of data which are not linearly separable these kernels with fixed widths are used in the SVM (sup- by embedding them into infinite dimensional spaces [2]. port vector machine) algorithm. We investigate suitabil- Moreover, symmetric positive semidefinite kernels gener- ity of various convolutional kernel units for function ap- ate stabilizers in the form of norms on RKHSs suitable for proximation. We show that properties of Fourier trans- modeling generalization in terms of regularization [6] and forms of convolutional kernels determine whether sets of enable characterizations of theoretically optimal solutions input-output functions of networks with kernel units are of learning tasks [3, 19, 11]. large enough to be universal approximators. We com- Arguments proving the universal approximation prop- pare these properties with conditions guaranteeing positive erty of RBF networks using sequences of scaled kernels semidefinitness of convolutional kernels. might suggest that variability of widths is necessary for the universal approximation. However, for the special case of 1 Introduction the Gaussian kernel, the universal approximation property holds even when the width is fixed and merely centers are varying [14, 12]. Computational units induced by radial and convolutional On the other hand, it is easy to find some examples kernels together with perceptrons belong to the most of positive semidefinite kernels such that sets of input- widespread types of units used in neurocomputing. In output functions of shallow networks with units generated contrast to biologically inspired perceptrons [15], local- by these kernels are too small to be universal approxima- ized radial units [1] were introduced merely due to their tors. For example, networks with product kernel units of good mathematical properties. Radial-basis-function units the form K x y k x k y generate as input-output func- (RBF) computing spherical waves were followed by ker- ( , )= ( ) ( ) tions only scalar multiples ck x of the function k. nel units [7]. Kernel units in the most general form include ( ) all types of computational units, which are functions of In this paper, we investigate capabilities of networks two vector variables: an input vector and a parameter vec- with one hidden layer of convolutional kernel units to ap- tor. However, often the term kernel unit is reserved merely proximate multivariable functions. We show that a crucial for units computing symmetric positive semidefinite func- property influencing whether sets of input-output func- tions of two variables. Networks with these units have tions of convolutional kernel networks are large enough been widely used for classification with maximal margin to be universal approximators is behavior of the Fourier by the support vector machine algorithm (SVM) [2] as transform of the one variable function generating the con- well as for regression [21]. volutional kernel. We give a necessary and sufficient con- Other important kernel units are units induced by convo- dition for universal approximation of kernel networks in lutional kernels in the form of translations of functions of terms of the Fourier transforms of kernels. We compare one vector variable. Isotropic RBF units can be viewed as this condition with properties of kernels guaranteeing their non symmetric kernel units obtained from convolutional positive definitness. We illustrate our results by exam- radial kernels by adding a width parameter. Variability ples of some common kernels such as Gaussian, Laplace, of widths is a strong property. It allows to apply argu- parabolic, rectangle, and triangle. ments based on classical results on approximation of func- The paper is organized as follows. In section 2, no- tions by sequences of their convolutions with scaled bump tations and basic concepts on one-hidden-layer networks functions to prove universal approximation capabilities of and kernel units are introduced. In section 3, a neces- many types of RBF networks [16, 17]. Moreover, some sary and sufficient condition on a convolutional kernel that estimates of rates of approximation by RBF networks ex- guarantees that networks with units induced by the kernel ploit variability of widths [9, 10, 13]. have the universal approximation property. In section 4 On the other hand, symmetric positive semidefinite ker- this condition is compared with a condition guaranteeing nels (which include some classes of RBFs with fixed that a kernel is positive semidefinite and some examples Multivariable Approximation by Convolutional Kernel Networks 119

of kernels satisfying both or one of these conditions are units induced by the kernel K are contained in Hilbert given. Section 5 is a brief discussion. spaces defined by these kernels. These spaces are called reproducing kernel Hilbert spaces (RKHS) and denoted (X). They are formed by functions from 2 Preliminaries HK spanGK(X)= span Kx x X , Radial-basis-function networks as well as kernel models { | ∈ } belong to the class of one-hidden-layer networks with one where linear output unit. Such networks compute input-output Kx(.)= K(x,.), functions from sets of the form together with limits of their Cauchy sequences in the n . K. The norm . K is induced by the inner product R N k k k k spanG = ∑ wigi wi , gi G,n + , .,. K, which is defined on (i=1 | ∈ ∈ ∈ ) h i GK(X)= Kx x X { | ∈ } where the set G is called a dictionary [8], and R, N+ de- note the sets of real numbers and positive integers, resp. as Typically, dictionaries are parameterized families of func- Kx,Ky K = K(x,y). h i tions modeling computational units, i.e., they are of the So spanGK(X) HK(X). form ⊂ GK(X,Y)= K(.,y) : X R y Y { → | ∈ } 3 Universal approximation capability of where K : X Y R is a function of two variables, an input vector x× X→ Rd and a parameter y Y Rs. Such convolutional kernel networks functions of two∈ variables⊆ are called kernels∈ .⊆ This term, derived from the German term “kern”, has been used since In this section, we investigate conditions guaranteeing that 1904 in theory of integral operators [18, p.291]. sets of input-output functions of convolutional kernel net- An important class of kernels are convolutional kernels works are large enough to be universal approximators. which are obtained by translations of one-variable func- The universal approximation property is formally de- tions k : Rd Rd as fined as density in a normed linear space. A class of one- → hidden-layer networks with units from a dictionary G is K(x,y)= k(x y). said to have the universal approximation property in a − normed linear space (X , . ) if it is dense in this space, k kX Radial convolutional kernels are convolutional kernels ob- i.e., clX spanG = X , where spanG denotes the linear tained as translations of radial functions, i.e., functions of span of G and clX denotes the closure with respect to the the form topology induced by the norm . . More precisely, for k kX k(x)= k1( x ), every f X and every ε > 0 there exist a positive integer k k ∈ n, g1,...,gn G, and w1,...,wn R such that where k1 : R R. + → ∈ ∈ The convolution is an operation defined as n f wigi < ε. k − ∑ kX f g(x)= f (y x)g(y)dy = f (y)g(x y)dy i=1 ∗ Rd − Rd − Z Z Function spaces where the universal approximation [20, p.170]. property has been of interest are spaces (C(X), . sup) of Recall, that a kernel K : X X R is called positive continuous functions on subsets X of Rd (typicallyk k com- × → semidefinite if for any positive integer m, any x1,...,xm pact) with the supremum norm ∈ X and any a1,...,am R, ∈ f sup = sup f (x) m m k k x X | | ∈ ∑ ∑ aia jK(xi,x j) 0. i=1 j=1 ≥ p Rd Rd and spaces (L ( ), . L p ) of functions on with fi- p k k nite Rd f (y) dy and the norm Similarly, a function of one variable k : Rd R is called | | → positive semidefinite if for any positive integer m, any R 1/p p x1,...,xm X and any a1,...,am R, f L p = f (y) dy . k k Rd | | ∈ ∈ Z m m   ∑ ∑ aia jk(xi x j) 0. Recall that the d-dimensional Fourier transform is an i=1 j=1 − ≥ isometry on L 2(Rd) defined on L 2(Rd) L 1(Rd) as ∩ For symmetric positive semidefinite kernels K, the sets 1 ix s fˆ(s)= e− · f (x)dx spanGK(X) of input-output functions of networks with 2π d/2 Rd ( ) Z 120 V. K˚urková

2 Rd 2 Rd Rd and extended to L ( ) [20, p.183]. L ( ) clL 2 spanGK( ), l( f0)= 1. By the Riesz Rep- Note that the Fourier transform of an even function is resentation\ Theorem [5, p.206], l can be expressed as an real and the Fourier transform of a radial function is radial. inner product with some h L 2(Rd). If k cL1(Rd), then kˆ is uniformly continuous and with As k is even, for all y R∈d, increasing∈ frequencies converges to zero, i.e., ∈ h,K(.,y) = h(x)k(x y)dx = lim kˆ(s)= 0. h i Rd − s ∞ Z k k→

The following theorem gives a necessary and sufficient h(x)k1(y x)dx = h k1(x)= 0. Rd − ∗ condition on a convolutional kernel that guarantees that the Z class of input-output functions computable by networks By the Young Inequality for convolutions h k L 2(Rd) with units induced by the kernel can approximate arbitrar- and so by the [20, p.188],∗ ∈ ily well all functions in L 2(Rd). The condition is formu- lated in terms of the size of the set of frequencies for which [ h k1 L 2 = 0. the Fourier transform is equal to zero. By λ is denoted the k ∗ k Lebesgue measure. As [ 1 1 d h k = hˆ kˆ Theorem 1. Let d be a positive integer, k L (R ) 1 d/2 ∈ ∩ ∗ (2π) L 2(Rd) be even, K : Rd Rd R be defined as K(x,y)= d × → ˆ ˆ k(x y), and X R be Lebesgue measurable. Then [20, p.183], we have hk L 2 = 0 and so − ⊆ 2 k k spanGK(X) is dense in (L (X), . 2 ) if and only if k kL λ( s Rd kˆ(s)= 0 )= 0. (hˆ(s)kˆ(s))2ds = 0. { ∈ | } Rd Z Proof. First, we prove the necessity. To prove it by As the set contradiction, assume that λ(S) = 0. Take any function 6 S = s Rd kˆ(s)= 0 f L 2(Rd) L 1(Rd) with a positive Fourier transform ∈ ∩ { ∈ | } (for example, f can be the Gaussian). Let ε > 0 be such has Lebesgue measure zero we have that 2 ε < fˆ(s) ds. hˆ s 2kˆ s 2ds hˆ s 2kˆ s 2ds Rd ( ) ( ) = ( ) ( ) = 0. Rd Rd S Z Z Z \ d Assume that there exists n, wi R, and yi R such that Rd ˆ 2 ˆ 2 ∈ ∈ As for all s S, k(s) > 0, we have h 2 ds = 0. So n ∈ \ k kL h 2 = 0 and hence by the Cauchy-Schwartz Inequality f w k(. y ) < ε. L ∑ i i L2 wek k get k − j=1 − k

Then by the Plancherel Theorem [20, p.188], 1 = l( f0)= f0(y)h(y)dy f0 2 h 2 = 0, Rd ≤ k kL k kL n n Z \ 2 2 fˆ wik(. yi) = fˆ w¯ikˆ , which is a contradiction. k − ∑ − kL2 k − ∑ kL2 j=1 j=1 Extending a function f from L 2(X) to f¯ from L 2(Rd) by setting its values equal to zero outside of X and restrict- iy wherew ¯i = wie i . Hence d ing approximations of f¯ by functions from spanGK(R ) to n X, we get the statement for any Lebesgue measurable sub- fˆ w¯ kˆ 2 Rd ∑ i L2 = set X of . k − j=1 k ✷

Theorem 1 shows that sets of input-output functions of 2 n convolutional kernel networks are large enough to approx- ˆ ˆ ˆ 2 2 f (s) ∑ w¯ik(s) ds + f (s) ds > ε, imate arbitrarily well all L -functions if and only if the Rd S − S Z \ j=1 ! Z Fourier transform of the function k is almost everywhere which is a contradiction. non-zero. Theorem 1 implies that when kˆ s is equal to zero for all To prove the sufficiency, we first assume that X = Rd. ( ) s such that s r for some r > 0 (the Fourier transform We prove it by contradiction, so we suppose that k k≥ d is -limited), then the set spanGK(R ) is too small to Rd Rd 2 Rd have the universal approximation capability. In the next clL 2 spanGK( )= clL 2 span K(.,y) y = L ( ). { | ∈ } 6 section we show, that some of such kernels are positive Then by the Hahn-Banach Theorem [20, p. 60] there ex- semidefinite. So they can be used for classification by the ists a bounded linear functional l on L 2(Rd) such that SVM algorithm but they are not suitable for function ap- d for all f cl 2 spanGK(R ), l( f )= 0 and for some f0 proximation. ∈ L ∈ Multivariable Approximation by Convolutional Kernel Networks 121

4 Positive semidefinitness and universal and there also are kernels which are not positive defi- approximation property nite but induce networks with the universal approximation property. The first ones are suitable for SVM but not for In this section, we compare a condition on positive regression, while the second ones can be used for regres- semidefinitness of a convolutional kernel with the condi- sion but are not suitable for SVM. In the sequel, we give tion on the universal approximation property derived in the some examples of such kernels. previous section. A paradigmatic example of a convolutional kernel is the d As the inverse Fourier transform of a convolutional ker- Gaussian kernel ga : R R defined for a width a > 0 as → nel can be expressed as a2 . 2 ga = e− k k . 1 For any fixed width a and any dimension d, K(x,y)= k(x y)= kˆ(s)ei(x y) sds d/2 − · − (2π) Rd d 1/a2 . 2 Z ga =(√2a)− e− k k . ˆ it is easy to verify that when k is positive or non nega- So the Gaussian kernel is positive definite and the class of b tive than K defined as K(x,y = k(x y) is positive definite, Gaussian kernel networks have the universal approxima- semidefinite, resp. − n tion property. Indeed, to verify that ∑ j,l=1 a jalK(x j,xl) 0 we ex- The rectangle kernel is defined as press K in terms of the inverse Fourier transform.≥ Thus we get rect(x)= 1 for x ( 1/2,1/2), otherwise rect∈ (x−)= 0. n n 1 i(x j xl ) s a jalK(x j,xl)= a jal kˆ(s)e − · ds = ∑ ∑ 2π d/2 Rd Its Fourier transform is the sinc function j,l=1 j,l=1 ( ) Z sin(π s) rect(s)= sinc(s)= . π s 1 n n i(x j) s i(xl ) s a je · ake− · kˆ(s)ds = So the Fourier transform of rect is not non negative but its (2π)d/2 Rd ∑ ∑ d Z j ! l ! zeros form a discrete set of the Lebesgue measure zero. Thus the rectangle kernel is not positive semidefinite but 2 1 n induces class of networks with the universal approxima- i(x j) s a je · kˆ(s)ds 0. d ∑ tion property. On the other hand, the Fourier transform of (2π)d/2 R j ≥ Z sinc is the rectangle kernel and thus it is positive semidef-

The following proposition is well-known (see, e.g., [4]). inite, but does not induce networks with the universal ap-

proximation property. Proposition 2. Let k 1 Rd 2 Rd be an even L ( ) L ( ) The Laplace kernel is defined for any a > 0 as function such that kˆ(s) ∈ 0 for all s∩ Rd. Then K(x,y)= ≥ ∈ k(x y) is positive semidefinite. l(x)= e a x . − − | | A complete characterization of positive semidefinite Its Fourier transforms is positive as bounded continuous kernels follows from the Bochner 2a Theorem. lˆ(s)= . a2 +(2πs)2 Theorem 3 (Bochner). A bounded continuous function The triangle kernel is defined as k : Rd C is positive semidefinite iff k is the Fourier trans- → form of a nonnegative finite Borel measure µ, i.e., tri(x)= 2x 1/2 for x ( 1/2,0), tri(x)= 2(−x + 1/2) for∈x −(0,1/2), 1 x s − ∈ k(x)= e− · µ(ds). otherwise tri(x)= 0. 2π d/2 Rd ( ) Z Its Fourier transforms is positive as

2 The Bochner Theorem implies that when the Borel mea- sin(π s) tri(s)= sinc(s)2 = . sure µ has a distribution function then the condition in π s   Proposition 2 is both sufficient and necessary. Thus both theb Laplace and the triangle kernel are positive Comparison of the characterization of kernels for which definite and induce networks having the universal approx- by Theorem 1 one-hidden-layer kernel networks are imation property. universal approximators with the condition on positive The parabolic (Epinechnikov) kernel is defined semidefinitness from Proposition 2 shows that there are 3 2 positive semidefinite kernels which do not generate net- epi(x)= 4 (1 x ) for x ( 1,1), works possessing the universal approximation capability otherwise− epi(x)=∈0.− 122 V. K˚urková

Its Fourier transforms is [8] R. Gribonval and P. Vandergheynst. On the exponential convergence of matching pursuits in quasi-incoherent dic- epi(s)= 3 (sin(s) 1 scos(s)) for s = 0, s3 − 2 6 tionaries. IEEE Trans. on Information Theory, 52:255–261, epi(s)= 1 for s = 0. 2006. c [9] P. C. Kainen, V. K˚urková, and M. Sanguineti. Complexity So the parabolic kernel is not positive semidefinite but in- c of Gaussian radial basis networks approximating smooth duces networks with the universal approximation property. functions. J. of Complexity, 25:63–74, 2009. [10] P. C. Kainen, V. K˚urková, and M. Sanguineti. Dependence 5 Discussion of computational models on input dimension: Tractability of approximation and optimization tasks. IEEE Transac- We investigated effect of properties of the Fourier trans- tions on Information Theory, 58:1203–1214, 2012. form of a kernel function on suitability of the convolu- [11] V. K˚urková. Neural network learning as an inverse prob- tional kernel for function approximation (universal ap- lem. Logic Journal of IGPL, 13:551–559, 2005. proximation property) and for maximal margin classifi- [12] V. K˚urková and P. C. Kainen. Comparing fixed and cation algorithm (positive semidefinitness). We showed variable-width gaussian networks. Neural Networks, that these properties depend on the way how the Fourier 57:23–28, 2014. transform converges with increasing frequencies to infin- [13] V. K˚urková and M. Sanguineti. Model complexities of shal- ity. For the universal approximation property, the Fourier low networks representing highly varying functions. Neu- rocomputing, 171:598–604, 2016. transform can be negative but cannot be zero on any set of frequencies of non-zero Lebesgue measure. On the other [14] H. N. Mhaskar. Versatile Gaussian networks. In Proceed- ings of IEEE Workshop of Nonlinear Image Processing, hand, functions with non-negative Fourier transforms are pages 70–73, 1995. positive semidefinite even if they are compactly supported. [15] M. Minsky and S. Papert. Perceptrons. MIT Press, 1969. We illustrated our results by the paradigmatic example of the multivariable Gaussian kernel and by some one- [16] J. Park and I. Sandberg. Universal approximation us- ing radial–basis–function networks. Neural Computation, dimensional examples. Multivariable Gaussian is a prod- 3:246–257, 1991. uct of one variable functions and thus its multivariable [17] J. Park and I. Sandberg. Approximation and radial basis Fourier transform can be computed using transforms of function networks. Neural Computation, 5:305–316, 1993. one-variable Gaussians. Fourier transforms of other radial [18] A. Pietsch. Eigenvalues and s-Numbers. Cambridge Uni- multivariable kernels are more complicated, their expres- versity Press, Cambridge, 1987. sions include Bessel functions and the Hankel transform. [19] T. Poggio and S. Smale. The mathematics of learning: deal- Investigation of properties of Fourier transforms of multi- ing with data. Notices of AMS, 50:537–544, 2003. variable radial convolutional kernels is subject of our fu- [20] W. Rudin. . Mc Graw-Hill, 1991. ture work. [21] B. Schölkopf and A. J. Smola. Learning with Kernels – Support Vector Machines, Regularization, Optimization Acknowledgments. This work was partially supported by and Beyond. MIT Press, Cambridge, 2002. the grant GACRˇ 15-181085 and institutional support of the Institute of Computer Science RVO 67985807.

References

[1] D. S. Broomhead and D. Lowe. Error bounds for approx- imation with neural networks. Complex Systems, 2:321– 355, 1988. [2] C. Cortes and V. N. Vapnik. Support vector networks. Ma- chine Learning, 20:273–297, 1995. [3] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of AMS, 39:1–49, 2002. [4] F. Cucker and D. X. Zhou. Learning Theory: An Approx- imation Theory Viewpoint. Cambridge University Press, Cambridge, 2007. [5] A. Friedman. Modern Analysis. Dover, New York, 1982. [6] F. Girosi. An equivalence between sparse approxima- tion and support vector machines. Neural Computation, 10:1455–1480 (AI memo 1606), 1998. [7] F. Girosi and T. Poggio. Regularization algorithms for learning that are equivalent to multilayer networks. Sci- ence, 247(4945):978–982, 1990.