Multivariable Approximation by Convolutional Kernel Networks
Total Page:16
File Type:pdf, Size:1020Kb
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 118–122 http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 V. K˚urková Multivariable Approximation by Convolutional Kernel Networks Veraˇ K˚urková Institute of Computer Science, Academy of Sciences of the Czech, [email protected], WWW home page: http://www.cs.cas.cz/ vera ∼ Abstract: Computational units induced by convolutional widths parameters) benefit from geometrical properties of kernels together with biologically inspired perceptrons be- reproducing kernel Hilbert spaces (RKHS) generated by long to the most widespread types of units used in neu- these kernels. These properties allow an extension of rocomputing. Radial convolutional kernels with vary- the maximal margin classification from finite dimensional ing widths form RBF (radial-basis-function) networks and spaces also to sets of data which are not linearly separable these kernels with fixed widths are used in the SVM (sup- by embedding them into infinite dimensional spaces [2]. port vector machine) algorithm. We investigate suitabil- Moreover, symmetric positive semidefinite kernels gener- ity of various convolutional kernel units for function ap- ate stabilizers in the form of norms on RKHSs suitable for proximation. We show that properties of Fourier trans- modeling generalization in terms of regularization [6] and forms of convolutional kernels determine whether sets of enable characterizations of theoretically optimal solutions input-output functions of networks with kernel units are of learning tasks [3, 19, 11]. large enough to be universal approximators. We com- Arguments proving the universal approximation prop- pare these properties with conditions guaranteeing positive erty of RBF networks using sequences of scaled kernels semidefinitness of convolutional kernels. might suggest that variability of widths is necessary for the universal approximation. However, for the special case of 1 Introduction the Gaussian kernel, the universal approximation property holds even when the width is fixed and merely centers are varying [14, 12]. Computational units induced by radial and convolutional On the other hand, it is easy to find some examples kernels together with perceptrons belong to the most of positive semidefinite kernels such that sets of input- widespread types of units used in neurocomputing. In output functions of shallow networks with units generated contrast to biologically inspired perceptrons [15], local- by these kernels are too small to be universal approxima- ized radial units [1] were introduced merely due to their tors. For example, networks with product kernel units of good mathematical properties. Radial-basis-function units the form K x y k x k y generate as input-output func- (RBF) computing spherical waves were followed by ker- ( , )= ( ) ( ) tions only scalar multiples ck x of the function k. nel units [7]. Kernel units in the most general form include ( ) all types of computational units, which are functions of In this paper, we investigate capabilities of networks two vector variables: an input vector and a parameter vec- with one hidden layer of convolutional kernel units to ap- tor. However, often the term kernel unit is reserved merely proximate multivariable functions. We show that a crucial for units computing symmetric positive semidefinite func- property influencing whether sets of input-output func- tions of two variables. Networks with these units have tions of convolutional kernel networks are large enough been widely used for classification with maximal margin to be universal approximators is behavior of the Fourier by the support vector machine algorithm (SVM) [2] as transform of the one variable function generating the con- well as for regression [21]. volutional kernel. We give a necessary and sufficient con- Other important kernel units are units induced by convo- dition for universal approximation of kernel networks in lutional kernels in the form of translations of functions of terms of the Fourier transforms of kernels. We compare one vector variable. Isotropic RBF units can be viewed as this condition with properties of kernels guaranteeing their non symmetric kernel units obtained from convolutional positive definitness. We illustrate our results by exam- radial kernels by adding a width parameter. Variability ples of some common kernels such as Gaussian, Laplace, of widths is a strong property. It allows to apply argu- parabolic, rectangle, and triangle. ments based on classical results on approximation of func- The paper is organized as follows. In section 2, no- tions by sequences of their convolutions with scaled bump tations and basic concepts on one-hidden-layer networks functions to prove universal approximation capabilities of and kernel units are introduced. In section 3, a neces- many types of RBF networks [16, 17]. Moreover, some sary and sufficient condition on a convolutional kernel that estimates of rates of approximation by RBF networks ex- guarantees that networks with units induced by the kernel ploit variability of widths [9, 10, 13]. have the universal approximation property. In section 4 On the other hand, symmetric positive semidefinite ker- this condition is compared with a condition guaranteeing nels (which include some classes of RBFs with fixed that a kernel is positive semidefinite and some examples Multivariable Approximation by Convolutional Kernel Networks 119 of kernels satisfying both or one of these conditions are units induced by the kernel K are contained in Hilbert given. Section 5 is a brief discussion. spaces defined by these kernels. These spaces are called reproducing kernel Hilbert spaces (RKHS) and denoted (X). They are formed by functions from 2 Preliminaries HK spanGK(X)= span Kx x X , Radial-basis-function networks as well as kernel models { | ∈ } belong to the class of one-hidden-layer networks with one where linear output unit. Such networks compute input-output Kx(.)= K(x,.), functions from sets of the form together with limits of their Cauchy sequences in the norm n . K. The norm . K is induced by the inner product R N k k k k spanG = ∑ wigi wi , gi G,n + , .,. K, which is defined on (i=1 | ∈ ∈ ∈ ) h i GK(X)= Kx x X { | ∈ } where the set G is called a dictionary [8], and R, N+ de- note the sets of real numbers and positive integers, resp. as Typically, dictionaries are parameterized families of func- Kx,Ky K = K(x,y). h i tions modeling computational units, i.e., they are of the So spanGK(X) HK(X). form ⊂ GK(X,Y)= K(.,y) : X R y Y { → | ∈ } 3 Universal approximation capability of where K : X Y R is a function of two variables, an input vector x× X→ Rd and a parameter y Y Rs. Such convolutional kernel networks functions of two∈ variables⊆ are called kernels∈ .⊆ This term, derived from the German term “kern”, has been used since In this section, we investigate conditions guaranteeing that 1904 in theory of integral operators [18, p.291]. sets of input-output functions of convolutional kernel net- An important class of kernels are convolutional kernels works are large enough to be universal approximators. which are obtained by translations of one-variable func- The universal approximation property is formally de- tions k : Rd Rd as fined as density in a normed linear space. A class of one- → hidden-layer networks with units from a dictionary G is K(x,y)= k(x y). said to have the universal approximation property in a − normed linear space (X , . ) if it is dense in this space, k kX Radial convolutional kernels are convolutional kernels ob- i.e., clX spanG = X , where spanG denotes the linear tained as translations of radial functions, i.e., functions of span of G and clX denotes the closure with respect to the the form topology induced by the norm . More precisely, for k kX k(x)= k1( x ), every f X and every ε > 0 there exist a positive integer k k ∈ n, g1,...,gn G, and w1,...,wn R such that where k1 : R R. + → ∈ ∈ The convolution is an operation defined as n f wigi < ε. k − ∑ kX f g(x)= f (y x)g(y)dy = f (y)g(x y)dy i=1 ∗ Rd − Rd − Z Z Function spaces where the universal approximation [20, p.170]. property has been of interest are spaces (C(X), . sup) of Recall, that a kernel K : X X R is called positive continuous functions on subsets X of Rd (typicallyk k com- × → semidefinite if for any positive integer m, any x1,...,xm pact) with the supremum norm ∈ X and any a1,...,am R, ∈ f sup = sup f (x) m m k k x X | | ∈ ∑ ∑ aia jK(xi,x j) 0. i=1 j=1 ≥ p Rd Rd and spaces (L ( ), . L p ) of functions on with fi- p k k nite Rd f (y) dy and the norm Similarly, a function of one variable k : Rd R is called | | → positive semidefinite if for any positive integer m, any R 1/p p x1,...,xm X and any a1,...,am R, f L p = f (y) dy . k k Rd | | ∈ ∈ Z m m ∑ ∑ aia jk(xi x j) 0. Recall that the d-dimensional Fourier transform is an i=1 j=1 − ≥ isometry on L 2(Rd) defined on L 2(Rd) L 1(Rd) as ∩ For symmetric positive semidefinite kernels K, the sets 1 ix s fˆ(s)= e− · f (x)dx spanGK(X) of input-output functions of networks with 2π d/2 Rd ( ) Z 120 V. K˚urková 2 Rd 2 Rd Rd and extended to L ( ) [20, p.183]. L ( ) clL 2 spanGK( ), l( f0)= 1. By the Riesz Rep- Note that the Fourier transform of an even function is resentation\ Theorem [5, p.206], l can be expressed as an real and the Fourier transform of a radial function is radial. inner product with some h L 2(Rd).