Approximation Error and Approximation Theory
Total Page:16
File Type:pdf, Size:1020Kb
Approximation Error and Approximation Theory Federico Girosi Center for Basic Research in the Social Sciences Harvard University and Center for Biological and Computational Learning MIT [email protected] 1 Plan of the class • Learning and generalization error • Approximation problem and rates of convergence • N-widths • “Dimension independent” convergence rates 2 Note These slides cover more extensive material than what will be presented in class. 3 References The background material on generalization error (first 8 slides) is explained at length in: 1. P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis complexity, and sample complexity for Radial Basis Functions. Neural Computation, 8:819–842, 1996. 2. P. Niyogi and F. Girosi. Generalization bounds for function approximation from scattered noisy data. Advances in Computational Mathematics, 10:51–80, 1999. [1] has a longer explanation and introduction, while [2] is more mathematical and also contains a very simple probabilistic proof of a class of “dimension independent” bounds, like the ones discussed at the end of this class. As far as I know it is A. Barron who first clearly spelled out the decomposition of the generalization error in two parts. Barron uses a different framework from what we use, and he summarizes it nicely in: 3. A.R. Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14:115–133, 1994. The paper is quite technical, and uses a framework which is different from what we use here, but it is important to read it if you plan to do research in this field. The material on n-widths comes from: 4. A. Pinkus. N-widths in Approximation Theory, Springer-Verlag, New York, 1980. Although the book is very technical, the first 8 pages contain an excellent introduction to the subject. The other great thing about this book is that you do not need to understand every single proof to appreciate the beauty and significance of the results, and it is a mine of useful information. 5. H.N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8:164–177, 1996. 4 6. A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transaction on Information Theory, 39:3, 930–945, 1993. 7. F. Girosi and G. Anzellotti. Rates of convergence of approximation by translates A.I. Memo 1288, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1992. For a curious way to prove dimension independent bounds using VC theory see: 8. F. Girosi. Approximation error bounds that use VC-bounds. In Proc. International Conference on Artificial Neural Networks, F. Fogelman-Souli`eand P. Gallinari, editors, Vol. 1, 295–302. Paris, France, October 1995. 5 Notations review • I[f] = X×Y V(f(x), y)p(x, y)dxdy R 1 l • Iemp[f] = l i=1 V(f(xi), yi) P • f0 = arg minf I[f] , f0 ∈ T • fH = arg minf∈H I[f] • f^H,l = arg minf∈H Iemp[f] 6 More notation review • I[f0] = how well we could possibly do • I[fH] = how well we can do in space H • I[f^H,l] = how well we do in space H with l data • |I[f] − Iemp[f]| ≤ Ω(H, l, δ) ∀f ∈ H (from VC theory) • I[f^H,l] is called generalization error • I[f^H,l] − I[f0] is also called generalization error ... 7 A General Decomposition I[f^H,l] − I[f0] = I[f^H,l] − I[fH] + I[fH] − I[f0] generalization error = estimation error + approximation error When the cost function V is quadratic: 2 I[f] = kf0 − fk + I[f0] and therefore 2 2 kf0 − fH,lk = I[f^H,l] − I[fH] + kf0 − fHk generalization error = estimation error + approximation error 8 A useful inequality If, with probability 1 − δ I[f] − Iemp[f] ≤ Ω(H, l, δ) ∀f ∈ H then ^ I[fH,l] − I[fH] ≤ 2Ω(H, l, δ) You can prove it using the following observations: • I[fH] ≤ I[f^H,l] (from the definition of fH) • Iemp[f^H,l] ≤ Iemp[fH] (from the definition of f^H,l) 9 Bounding the Generalization Error 2 2 kf0 − fH,lk ≤ 2Ω(H, l, δ) + kf0 − fHk Notice that: • Ω has nothing to do with the target space T , it is studied mostly in statistics; • kf0 − fHk has everything to do with the target space T , it is studied mostly in approximation theory; 10 Approximation Error We consider a nested family of hypothesis spaces Hn: H0 ⊂ H1 ⊂ ...Hn ⊂ ... and define the approximation error as: T (f, Hn) ≡ inf kf − hk h∈Hn T (f, Hn) is the smallest error that we can make if we approximate f ∈ T with an element of Hn (here k · k is the norm in T ). 11 Approximation error For reasonable choices of hypothesis spaces Hn: lim T (f, Hn) = 0 n This means that we can→∞ approximate functions of T arbitrarily well with elements of {Hn}n=1 Example: T = continuous functions∞ on compact sets, and Hn = polynomials of degree at most n. 12 The rate of convergence The interesting question is: How fast does T (f, Hn) go to zero? • The rate of convergence is a measure of the relative complexity of T with respect to the approximation scheme H. • The rate of convergence determines how many samples we need in order to obtain a given generalization error. 13 An Example • In the next slides we compute explicitly the rate of convergence of approximation of a smooth function by trigonometric polynomials. • We are interested in studying how fast the approximation error goes to zero when the number of parameters of our approximation scheme goes to infinity. • The reason for this exercise is that the results are representative: more complex and interesting cases all share the basic features of this example. 14 Approximation by Trigonometric Polynomials Consider the set of functions \ C2[−π, π] ≡ C[−π, π] L2[−π, π] Functions in this set can be represented as a Fourier series: π ikx −ikx f(x) = cke , ck ∝ dxf(x)e ∞ −π kX=0 Z The L2 norm satisfies the equation: kfk2 = c 2 L2 k ∞ kX=0 15 We consider as target space the following Sobolev space of smooth functions: s 2 d f Ws,2 ≡ f ∈ C2[−π, π] | < + dxs L2 ∞ The (semi)-norm in this Sobolev space is defined as: s 2 2 d f 2s 2 kfk ≡ = k c Ws,2 s k dx L ∞ 2 kX=1 If f belongs to Ws,2 then Fourier coefficients ck must go to zero at a rate which increases with s. 16 We choose as hypothesis space Hn the set of trigonometric polynomials of degree n: n ikx p(x) = ake kX=1 Given a function of the form ikx f(x) = cke ∞ kX=0 the optimal hypothesis fn(x) is given by the first n terms of its Fourier series: n ikx fn(x) = cke kX=1 17 Key Question For a given f ∈ Ws,2 we want to study the approximation error: [f] ≡ kf − f k2 n n L2 • Notice that n, the degree of the polynomial, is also the number of parameters that we use in the approximation. • Obviously n goes to zero as n + , but the key question is: how fast? → ∞ 18 An easy estimate of n [f] ≡ kf − f k2 = c2 = c2k2s 1 < n n L2 k=n+1 k k=n+1 k k2s P∞ P∞ 2 kfkW < 1 c2k2s < 1 c2k2s = s,2 n2s k=n+1 k n2s k=1 k n2s P∞ P∞ kfk2 Ws,2 n[f] <⇓ n2s More smoothness faster rate of convergence But what happens in⇒ more than one dimension? 19 Rates of convergence in d dimensions It is enough to study d = 2. We proceed in full analogy with the 1-d case: i(kx+my) f(x, y) = ckme ∞ k,mX=1 s 2 s 2 2 d f d f 2s 2s 2 kfk ≡ + = (k + m )c Ws,2 s s km dx L dy L ∞ 2 2 k,mX=1 Here Ws,2 is defined as the set of functions such that kfk2 < + Ws,2 ∞ 20 We choose as hypothesis space Hn the set of trigonometric polynomials of degree l: l (ikx+imy) p(x) = akme k,mX=1 A trigonometric polynomial of degree l in d variables has a number of coefficients n = ld. We are interested in the behavior of the approximation error as a function of n. The approximating function is: l (ikx+imy) fn(x, y) = ckme k,mX=1 21 An easy estimate of n (k2s + m2s) [f] ≡ kf − f k2 = c2 = c2 < n n L2 km km 2s 2s ∞ ∞ k + m k,mX=l+1 k,mX=l+1 1 1 < c2 (k2s + m2s) < c2 (k2s + m2s) = 2s km 2s km 2l ∞ 2l ∞ k,mX=l+1 k,mX=1 2 kfkW = s,2 2l2s d 1 Since n = l , then l = nd (with d = 2), and we obtain: kfk2 Ws,2 n < 2s 2n d 22 (Partial) Summary The previous calculations generalizes easily to the d-dimensional case. Therefore we conclude that: if we approximate functions of d variables with s square integrable derivatives with a trigonometric polynomial with n coefficients, the approximation error satisfies: C n < 2s n d More smoothness s faster rate of convergence d Higher dimension ⇒ slower rate of convergence ⇒ 23 Another Example: Generalized Translation Networks Consider networks of the form: n f(x) = akφ(Akx + bk) kX=1 d m where x ∈ R , bk ∈ R , 1 ≤ m ≤ d, Ak are m × d matrices, ak ∈ R and φ is some given function.