<<

Approximation Error and

Federico Girosi

Santa Monica, CA

[email protected] 1 Plan of the class

• Learning and generalization error

• Approximation problem and rates of convergence

• N-widths

• “Dimension independent” convergence rates 2 Note These slides cover more extensive material than what will be presented in class. 3 References The background material on generalization error (first 8 slides) is explained at length in:

1. P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis complexity, and sample complexity for Radial Basis Functions. Neural Computation, 8:819–842, 1996. 2. P. Niyogi and F. Girosi. Generalization bounds for approximation from scattered noisy data. Advances in Computational , 10:51–80, 1999. [1] has a longer explanation and introduction, while [2] is more mathematical and also contains a very simple probabilistic proof of a class of “dimension independent” bounds, like the ones discussed at the end of this class. As far as I know it is A. Barron who first clearly spelled out the decomposition of the generalization error in two parts. Barron uses a different framework from what we use, and he summarizes it nicely in: 3. A.R. Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14:115–133, 1994. The paper is quite technical, and uses a framework which is different from what we use here, but it is important to read it if you plan to do research in this field. The material on n-widths comes from: 4. A. Pinkus. N-widths in Approximation Theory, Springer-Verlag, New York, 1980. Although the book is very technical, the first 8 pages contain an excellent introduction to the subject. The other great thing about this book is that you do not need to understand every single proof to appreciate the beauty and significance of the results, and it is a mine of useful information. 5. H.N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8:164–177, 1996. 4 6. A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transaction on , 39:3, 930–945, 1993. 7. F. Girosi and G. Anzellotti. Rates of convergence of approximation by translates A.I. Memo 1288, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1992. For a curious way to prove dimension independent bounds using VC theory see: 8. F. Girosi. Approximation error bounds that use VC-bounds. In Proc. International Conference on Artificial Neural Networks, F. Fogelman-Souli`eand P. Gallinari, editors, Vol. 1, 295–302. Paris, France, October 1995. 5 Notations review

• I[f] = X×Y V(f(x), y)p(x, y)dxdy

R 1 l • Iemp[f] = l i=1 V(f(xi), yi) P • f0 = arg minf I[f] , f0 ∈ T

• fH = arg minf∈H I[f]

• f^H,l = arg minf∈H Iemp[f] 6 More notation review

• I[f0] = how well we could possibly do

• I[fH] = how well we can do in space H

• I[f^H,l] = how well we do in space H with l data

• |I[f] − Iemp[f]| ≤ Ω(H, l, δ) ∀f ∈ H (from VC theory)

• I[f^H,l] is called generalization error

• I[f^H,l] − I[f0] is also called generalization error ... 7 A General Decomposition

I[f^H,l] − I[f0] = I[f^H,l] − I[fH] + I[fH] − I[f0] generalization error = estimation error + approximation error

When the cost function V is quadratic:

2 I[f] = kf0 − fk + I[f0] and therefore

2 2 kf0 − fH,lk = I[f^H,l] − I[fH] + kf0 − fHk generalization error = estimation error + approximation error 8 A useful inequality

If, with probability 1 − δ

I[f] − Iemp[f] ≤ Ω(H, l, δ) ∀f ∈ H then ^ I[fH,l] − I[fH] ≤ 2Ω(H, l, δ) You can prove it using the following observations:

• I[fH] ≤ I[f^H,l] (from the definition of fH)

• Iemp[f^H,l] ≤ Iemp[fH] (from the definition of f^H,l) 9 Bounding the Generalization Error

2 2 kf0 − fH,lk ≤ 2Ω(H, l, δ) + kf0 − fHk

Notice that:

• Ω has nothing to do with the target space T , it is studied mostly in ;

• kf0 − fHk has everything to do with the target space T , it is studied mostly in approximation theory; 10 Approximation Error

We consider a nested family of hypothesis spaces Hn:

H0 ⊂ H1 ⊂ ...Hn ⊂ ... and define the approximation error as:

T (f, Hn) ≡ inf kf − hk h∈Hn

T (f, Hn) is the smallest error that we can make if we approximate f ∈ T with an element of Hn (here k · k is the norm in T ). 11 Approximation error

For reasonable choices of hypothesis spaces Hn:

lim T (f, Hn) = 0 n

This means that we can→∞ approximate functions of T arbitrarily well with elements of {Hn}n=1 Example: T = continuous functions∞ on compact sets, and Hn = of degree at most n. 12 The rate of convergence The interesting question is:

How fast does T (f, Hn) go to zero?

• The rate of convergence is a measure of the relative complexity of T with respect to the approximation scheme H.

• The rate of convergence determines how many samples we need in order to obtain a given generalization error. 13 An Example

• In the next slides we compute explicitly the rate of convergence of approximation of a smooth function by trigonometric polynomials.

• We are interested in studying how fast the approximation error goes to zero when the number of parameters of our approximation scheme goes to infinity.

• The reason for this exercise is that the results are representative: more complex and interesting cases all share the basic features of this example. 14 Approximation by Trigonometric Polynomials Consider the set of functions \ C2[−π, π] ≡ C[−π, π] L2[−π, π]

Functions in this set can be represented as a :

π ikx −ikx f(x) = cke , ck ∝ dxf(x)e ∞ −π kX=0 Z The L2 norm satisfies the equation:

kfk2 = c 2 L2 k ∞ kX=0 15 We consider as target space the following Sobolev space of smooth functions:

s 2 d f Ws,2 ≡ f ∈ C2[−π, π] | < +  dxs   L2  ∞   The (semi)-norm in this Sobolev space is defined as:

s 2 2 d f 2s 2 kfk ≡ = k c Ws,2 s k dx L ∞ 2 kX=1 If f belongs to Ws,2 then Fourier coefficients ck must go to zero at a rate which increases with s. 16

We choose as hypothesis space Hn the set of trigonometric polynomials of degree n: n ikx p(x) = ake kX=1 Given a function of the form

ikx f(x) = cke ∞ kX=0 the optimal hypothesis fn(x) is given by the first n terms of its Fourier series:

n ikx fn(x) = cke kX=1 17 Key Question

For a given f ∈ Ws,2 we want to study the approximation error:

 [f] ≡ kf − f k2 n n L2

• Notice that n, the degree of the , is also the number of parameters that we use in the approximation.

• Obviously n goes to zero as n + , but the key question is: how fast? → ∞ 18

An easy estimate of n

 [f] ≡ kf − f k2 = c2 = c2k2s 1 < n n L2 k=n+1 k k=n+1 k k2s P∞ P∞ 2 kfkW < 1 c2k2s < 1 c2k2s = s,2 n2s k=n+1 k n2s k=1 k n2s P∞ P∞

kfk2 Ws,2 n[f] <⇓ n2s More smoothness faster rate of convergence

But what happens in⇒ more than one dimension? 19 Rates of convergence in d dimensions

It is enough to study d = 2. We proceed in full analogy with the 1-d case:

i(kx+my) f(x, y) = ckme ∞ k,mX=1

s 2 s 2 2 d f d f 2s 2s 2 kfk ≡ + = (k + m )c Ws,2 s s km dx L dy L ∞ 2 2 k,mX=1

Here Ws,2 is defined as the set of functions such that kfk2 < + Ws,2

∞ 20

We choose as hypothesis space Hn the set of trigonometric polynomials of degree l:

l (ikx+imy) p(x) = akme k,mX=1

A trigonometric polynomial of degree l in d variables has a number of coefficients n = ld. We are interested in the behavior of the approximation error as a function of n. The approximating function is:

l (ikx+imy) fn(x, y) = ckme k,mX=1 21

An easy estimate of n

(k2s + m2s)  [f] ≡ kf − f k2 = c2 = c2 < n n L2 km km 2s 2s ∞ ∞ k + m k,mX=l+1 k,mX=l+1 1 1 < c2 (k2s + m2s) < c2 (k2s + m2s) = 2s km 2s km 2l ∞ 2l ∞ k,mX=l+1 k,mX=1 2 kfkW = s,2 2l2s

d 1 Since n = l , then l = nd (with d = 2), and we obtain:

kfk2 Ws,2 n < 2s 2n d 22 (Partial) Summary The previous calculations generalizes easily to the d-dimensional case. Therefore we conclude that: if we approximate functions of d variables with s square integrable derivatives with a trigonometric polynomial with n coefficients, the approximation error satisfies:

C n < 2s n d More smoothness s faster rate of convergence

d Higher dimension ⇒ slower rate of convergence

⇒ 23 Another Example: Generalized Translation Networks Consider networks of the form:

n f(x) = akφ(Akx + bk) kX=1 d m where x ∈ R , bk ∈ R , 1 ≤ m ≤ d, Ak are m × d matrices, ak ∈ R and φ is some given function. For m = 1 this is a Multilayer Perceptron.

For m = d, Ak diagonal and φ radial this is a Radial Basis Functions network. 24 Theorem (Mhaskar, 1994) p d Let Ws (R ) be the space of functions whose derivatives d up to order s are p-integrable in R . Under very general assumptions on φ one can prove that there exists d × m n p d matrices {Ak}k=1 such that, for any f ∈ Ws (R ), one can find bk and ak such that:

n − s kf − a φ(A x + b )k ≤ cn dkfk p k k k p Ws kX=1 Moreover, the coefficients ak are linear functionals of f. This rate is optimal 25 The curse of dimensionality If the approximation error is

s  1 d n ∝ n then the number of parameters needed to achieve an error smaller than  is:

d 1 s n ∝  the curse of dimensionality is the d factor; the blessing of smoothness is the s factor; 26 Jackson type rates of convergence It happens “very often” that rates of convergence for functions in d dimensions with “smoothness” of order s are of the Jackson type:

s !  1 d O n

Example: polynomial and spline approximation techniques, many non-linear techniques. Can we do better than this? Can we defeat the curse of dimensionality? Have we tried hard enough to find “good” approximation techniques? 27 N-widths: definition (from Pinkus, 1980)

Let X be a normed space of functions, A a subset of X. We want to approximate elements of X with linear n superposition of n basis functions {φi}i=1. Some sets of basis functions are better than others: which are the best ? what error do they achieve? To answer these questions we define the Kolmogorov n-width of A in X:

n

dn(A, X) = inf sup inf f − ciφi φ1,...φn f∈A c1,...cn Xi=1 X 28 Example (Kolmogorov, 1936)

X = L2[0, 2π] ˜ s s−1 (j) W2 ≡ {f | f ∈ C [0, 2π], f periodic, j = 0, . . . , s − 1} ˜s ˜ s (s) A = B2 ≡ {f | f ∈ W2 , kf k2 ≤ 1} ⊂ X Then s s 1 d2n−1(B˜ ,L2) = d2n(B˜ ,L2) = 2 2 ns and the following Xn is optimal (in the sense that it achieves the rate above):

X2n−1 = span{1, sin(x), cos(x),..., sin(n − 1)x, cos(n − 1)x} 29 Example: multivariate case d Id ≡ [0, 1]

X = L2[Id] s s−1 (s) W2[Id] ≡ {f | f ∈ C [Id] , f ∈ L2[Id]} s s (s) B2 ≡ {f | f ∈ W2[Id] , kf k2 ≤ 1}

Theorem (from Pinkus, 1980)

s  d s 1 dn(B ,L2) ≈ 2 n

Optimal basis functions are usually splines (or their relatives) 30 Dimensionality and smoothness

Classes of functions in d dimensions with smoothness of order s have an intrinsic complexity characterized by the s ratio d:

• the curse of dimensionality is the d factor;

• the blessing of smoothness is the s factor;

We cannot expect to find an approximation technique that “beats the curse of dimensionality”, unless we let the smoothness s increase with the dimension d. 31 Theorem (Barron, 1991) Let f be a function such that its Fourier transform satisfies

dω kωk|f˜(ω)| < + d ZR d Let Ω be a bounded domain in R . Then∞ we can find a neural network with n coefficients ci, n weights wi and n biases θi such that

2 n c f − ciσ(x · wi + θi) < n i=1 X L2(Ω) The rate of convergence is independent of the dimension d. 32 Here is the trick...

The space of functions such that

dω kωk|f˜(ω)| < + . d ZR is the space of functions that can be written∞ as

1 f = ∗ λ kxkd−1 where λ is any function whose Fourier transform is integrable. Notice how the space becomes more constrained as the dimension increases. 33 Theorem (Girosi and Anzellotti, 1992) s,1 d s,1 d Let f ∈ H (R ), where H (R ) is the space of functions whose partial derivatives up to order s are integrable, and let Ks(x) be the Bessel-Macdonald kernel, that is the Fourier transform of

1 K˜s(ω) = s s > 0 . (1 + kωk2)2 If s > d and s is even, we can find a Radial Basis Functions network with n coefficients cα and n centers tα such that

n 2 c kf − cαKs(x − tα)k < L n αX=1 ∞ 34 Theorem (Girosi, 1992) s,1 d s,1 d Let f ∈ H (R ), where H (R ) is the space of functions whose partial derivatives up to order s are integrable. If s > d and s is even, we can find a Gaussian basis function network with n coefficients cα, n centers tα and n variances σα such that

n (x−t )2 − α c 2σ2 2 kf − cαe α k < L n αX=1 ∞ 35 √1 Same rate of convergence: O( n)

Function space Norm Approximation scheme

˜ n d dω |f(ω)| < + L2(Ω) f(x) = ci sin(x · wi + θi) R i=1 R P (Jones) ∞

˜ n d dω kωk|f(ω)| < + L2(Ω) f(x) = ciσ(x · wi + θi) R i=1 R P (Barron) ∞

2 ˜ n d dω kωk |f(ω)| < + L2(Ω) f(x) = ci|x · wi + θi|++ R i=1 R P (Breiman) ∞ + a · x + b

2 ˜ s d n −kx−tαk f(ω) ∈ C0, 2s > d L (R ) f(x) = α=1 cαe P (Girosi and Anzellotti) ∞

kx−t k2 − α 2 2s,1 d d n σα H (R ), 2s > d L (R ) f(x) = α=1 cαe P (Girosi) ∞ 36 Summary

• There is a trade off between the size of the sample (l) and the size of the hypothesis space n;

• For a given pair of hypothesis and target space the approximation error depends on the trade off between dimensionality and smoothness;

• The trade off has a “generic” form and sets bounds on what can and cannot be done, both in linear and non-linear approximation;

• Suitable spaces, which trade dimensionality versus smoothness, can be defined in such a way that the rate of convergence of the approximation error is independent of the dimensionality.