<<

Approximation theory in neural networks

Yanhui Su† yanhui [email protected]

March 30, 2018 Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Basic Notations

1 d: dimension of input layer; 2 L: of layers;

3 Nl: number of neuros in the lth layers, l = 1, ··· ,L; 4 ρ : R → R: activation function; N N 5 Wl : R l−1 → R l , 1 ≤ l ≤ L, x → Alx + bl; 6 (Al)ij, (bl)i: the networks weights;

Definition 1 d L A map Φ: R → R given by

d Φ(x) = WLρ(WL−1ρ(··· ρ(W1(x)))), x ∈ R ,

is called a neural network. Outline Functions Functionals Operators Bounds Optimal Basic Notations Outline Functions Functionals Operators Bounds Optimal A classical result of Cybenko

 1, x → +∞ We say the σ is sigmoidal if σ(x) → . 0, x → −∞ A classical result on approximation of neural networks is: Theorem 2 (Cybenko [6]) Let σ be any continuous sigmoidal function. Then finite sums of the form

N X G(x) = αjσ(yj · x + θj) (1) j=1

are dense in C(Id).

In [5], T.P. Chen, H. Chen and R.W. Liu gave a constructive proof which only assume that σ is bounded sigmoidal function. Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Approximations of continuous functionals on Lp space

Theorem 3 (Chen and Chen [3]) Suppose that U is a compact set in Lp[a, b](1 < p < ∞), f is a continuous functional defined on U, and σ(x) is a bounded sigmoidal function, then for any ε > 0, there exist h > 0, a positive integer m, m + 1 points a = x0 < x1 < ··· < xm = b, xj = a + j(b − a)/m, j = 0, 1, ··· , m, a positive integer N and constants ci, θi, ξi,j, i = 1, ··· ,N, j = 0, 1, ··· , m, such that

N  m  Z xj +h X X 1 f(u) − ciσ  ξi,j u(t)dt + θi < ε 2h i=1 j=0 xj −h

holds for all u ∈ U. Here it is assumed that u(x) = 0, if x∈ / [a, b]. Outline Functions Functionals Operators Bounds Optimal Approximations of continuous functionals on C[a, b]

Theorem 4 (Chen and Chen [3]) Suppose that U is a compact set in C[a, b], f is a continuous functional defined on U, and σ(x) is a bounded sigmoidal function, then for any ε > 0, there exist m + 1 points a = x0 < ··· < xm = b, a positive integer N and constants ci, θi, ξi,j, i = 1, ··· ,N, j = 0, 1, ··· , m, such that for any u ∈ U,   N m X X f(u) − ciσ  ξi,ju(xj) + θi < ε

i=1 j=0 Outline Functions Functionals Operators Bounds Optimal An example in dynamical system

Suppose that the input u(x) and the output s(x) = G(u(x)) satisfies ds(x) = g(s(x), u(x), x), s(a) = s dx 0 where g satisfies Lipschitz condition, then Z x (Gu)(x) = s0 + g((Gu)(t), u(t), t)dt. a It can be shown that G is a continuous functional on C[a, b]. If the input set U ⊂ C[a, b] is compact, then the output at a specified time d can be approximated by

N  m  X X ciσ  ξi,ju(xi) + θi . i=1 j=1 Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Approximation by Arbitrary Functions

Definition 5 If a function g : R → R satisfies that all the linear combinations of the form

N X cig(λix + θi), λi, θi, ci ∈ R, i = 1, ··· ,N i=1 are dense in every C[a, b], then g is called a Tauber-Wiener (TW) function.

Theorem 6

0 (Chen and Chen [4]) Suppose that g(x) ∈ C(R) ∩ S (R), then g ∈ (TW) if and only if g is not a polynomial. Outline Functions Functionals Operators Bounds Optimal Approximation by Arbitrary Functions

Theorem 7

d (Chen and Chen [4]) Suppose that K is a compact set in R , U is a compact set in C(K), g ∈ (TW), then for any ε > 0, there are a d positive integer N, θi ∈ R, ωi ∈ R , i = 1, ··· ,N, which are all independent of f ∈ U and constants ci(f) depending on f, i = 1, ··· ,N, such that

N X f(x) − ci(f)g(ωi · x + θi) < ε i=1

holds for all x ∈ K, f ∈ U. Moreover, every ci(f) is a continuous functional defined on U. Outline Functions Functionals Operators Bounds Optimal Approximation to functionals by Arbitrary Functions

The following theorem can be viewed as a generalization of Theorem 4 of sigmoidal function case. Theorem 8 (Chen and Chen [4]) Suppose that g ∈ (TW), X is a Banach space, K ⊂ X is a compact set, V is a compact set in C(K), f is a continuous functional defined on V . Then for any ε > 0, there are positive integers N, m points x1, ··· , xk ∈ K, and constants ci, θi, ξij ∈ R, i = 1, ··· , N, j = 1, ··· , m, such that   N m X X f(u) − cig  ξiju(xj) + θi < ε

i=1 j=1 holds for all u ∈ V . Outline Functions Functionals Operators Bounds Optimal Approximation to operators by Arbitrary Functions

Theorem 9 (Chen and Chen [4]) Suppose that g ∈ (TW), X is a Banach d space, K1 ⊂ X, K2 ⊂ R are two compact sets. V is a compact set in C(K1), G is a nonlinear continuous operators, which maps V to C(K2). Then for any ε > 0, there are a positive integers k k d M, N, m, constants ci , ζk, ξij ∈ R, points ωk ∈ R , xj ∈ K1, i = 1, ··· , M, k = 1, ··· ,N, j = 1, ··· , m, such that   M N m X X k X k k G(u)(y) − ci g  ξiju(xj) + θi  · g(ωk · y + ζk) < ε

i=1 k=1 j=1

holds for all u ∈ V , y ∈ K2. Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Basic notations

1 F˜(dω) = eiθ(ω)F (dω): the Fourier distribution (i.e. d complex-valued measure) of a function f(x) on R , and Z f(x) = eiω·xF˜(dω) (2)

d 2 B: bounded set in R that contains {0} 3 ΓB: the set of functions f on B for which the representation (2) holds for x ∈ B for some complex-valued measure F˜(dω) for which R |ω|F (dω) is finite.

4 ΓC,B: the set of all functions f in ΓB such that for some F˜ representing f on B Z |ω|BF (dω) ≤ C

where |ω|B = supx∈B |ω · x|. Outline Functions Functionals Operators Bounds Optimal Universal approximation bounds

Theorem 10

(Barron [1]) For every function f in ΓC,B, every sigmoidal function σ, every probability measure µ, and every n ≥ 1, there exists a linear combination of sigmoidal functions fn(x), such that

Z 2 2 (2C) (f¯(x) − fn(x)) µ(dx) ≤ B n where f¯(x) = f(x) − f(0).

In theorem 10, the approximation result was proved without the restrictions on |yj| which yield a difficult problem of searching an unbounded domain. Outline Functions Functionals Operators Bounds Optimal Universal approximation bounds

Given τ > 0, C > 0 and a bounded set B, let

Gσ,τ = {γσ(τ(α · x + b)) : |γ| ≤ 2C, |α|B ≤ 1, |b| ≤ 1}

Theorem 11

(Barron [1]) For every f ∈ ΓC,B,τ > 0, n ≥ 1, every probability measure µ, and every sigmoidal function σ with 0 ≤ σ ≤ 1, there is a function fn in the convex hull of n functions in Gσ,τ such that   ¯ 1 f − fn ≤ 2C + δτ n1/2

where k · k denote the L2(µ, B) norm, f¯ = f(x) − f(0), and ( ) δτ = inf 2ε + sup |σ(τz) − 1{z>0}| 0<ε≤1/2 |z|≥ε Outline Functions Functionals Operators Bounds Optimal Outline

1 Approximation of functions by a sigmoidal function

2 Approximations of continuous functionals by a sigmoidal function

3 Universal approximation by neural networks with arbitrary activation functions

4 Universal approximation bounds for superpositions of a sigmoidal function

5 Optimal approximation with sparsely connected deep neural networks Outline Functions Functionals Operators Bounds Optimal Basic Notations

d 1 Ω: a domain in R ; 2 C: a function class in L2(Ω). 3 M: the networks connectivity (i.e. the total number of nonzero edge weights). If M is small relative to the number of connections possible, we say that the network is sparsely connected. d 4 NNL,M,d,ρ: the class of networks Φ: R → R with L layers, connectivity no more than M, and activation function ρ. Moreover, we let [ [ NN∞,M,d,ρ := NNL,M,d,ρ, NNL,∞,d,ρ := NNL,M,d,ρ L∈N M∈N [ NN∞,∞,d,ρ := NNL,∞,d,ρ M∈N Outline Functions Functionals Operators Bounds Optimal Basic Notations Outline Functions Functionals Operators Bounds Optimal Best M-term Approximation Error

Definition 12 (DeVore and Lorentz, [7]) Given C ⊂ L2(Ω), and a representation 2 system D = (ϕi)i∈N ⊂ L (Ω), we define, for f ∈ C and M ∈ N,

D X ΓM (f) := inf f − ciϕi IM ⊂N ]I =M i∈I M M L2(Ω)

D We call ΓM (f) the best M-term approximation error of f with respect to D. The supremal γ > 0 such that there exists C > 0 with

C −γ sup ΓM (f) ≤ CM , ∀M ∈ N f∈C

will be referred to as γ∗(C, D). Outline Functions Functionals Operators Bounds Optimal Best M-term Approximation Error

1 It is conceivable that the optimal approximation rate for C in any representation system reflects specific properties of C. However, a countable and dense repersentation system 2 d ∗ D ⊂ L (R ) results in γ (C, D) = ∞. 2 In numerical computation, we need to find some efficient methods to approximate any f ∈ C by linear combination of finite elements in D. However, finding index in the full index N is computationally infeasible. 3 In [8], Donoho suggests to restrict the search for the optimal coefficient set to the first π(M) coefficients where π is some polynomial. This approach is known as polynomial-depth search. Outline Functions Functionals Operators Bounds Optimal Effective Best M-term Approximation Error

To overcome these problems, Donoho [8] and Grohs [9] proposed the following Definition 13 2 2 Given C ⊂ L (Ω), a representation system D = (φi)i∈N ⊂ L (Ω). For γ > 0, we say that C has effective best M-term approximation rate M −γ in D if there exists a univariate polynomial π and constants C, D > 0 such that for all M ∈ N and f ∈ C,

X −γ f − ciϕi ≤ CM

i∈I M L2(Ω)

for some index set IM ⊂ {1, ··· , π(M)} with ]IM = M and

cofficients (ci)i∈IM satisfying maxi∈IM |ci| ≤ D. The supremal γ > 0 such that C has effective best M-term approximation rate M −γ in D will be referred to as γ∗,eff (C, D). Outline Functions Functionals Operators Bounds Optimal Best M-edge Approximation Error

Definition 14 (B¨olcskei et. al. [2]) Given C ⊂ L2(Ω), we define, for f ∈ C and M ∈ N, NN ΓM (f) := inf kf − ΦkL2(Ω) Φ∈N N∞,M,d,ρ NN We call ΓM (f) the best M-edge approximation error of f. The supremal γ > 0 such that a C > 0 with

NN −γ sup ΓM (f) ≤ CM , ∀M ∈ N f∈C

∗ will be referred to as γNN (C, ρ). Outline Functions Functionals Operators Bounds Optimal Best M-edge Approximation Error

The following theorem in [10] shows that Definition 14 has the samilar troubles with the Definition 12. Theorem 15 ∞ There exists a function ρ : R → R that is C , strictly increasing, and satisfies limx→∞ ρ(x) = 1 and limx→−∞ ρ(x) = 0, such that d for any d ∈ N, any f ∈ C([0, 1] ) and any ε > 0 there exists a neural network Φ with activation function ρ three layers of dimensions N1 = 3d, N2 = 6d + 3, and N3 = 1 satisfying

sup |f(x) − Φ(x)| ≤ ε x∈[0,1]d Outline Functions Functionals Operators Bounds Optimal Effective Best M-edge Approximation Error

Definition 16 (B¨olcskei et. al. [2]) For γ > 0, C ⊂ L2(Ω) is said to have effective best M-edge approximation rate M −γ by neural networks with activation function ρ if there exist L ∈ N, a univariate polynomial π, and a constant C > 0 such that for all M ∈ N and f ∈ C

−γ kf − ΦkL2(Ω) ≤ CM

for some Φ ∈ N NL,M,d,ρ with the weights of Φ all bounded in absolute value by π(M). The supremal γ > 0 such that C has effective best M-edge −γ ∗,eff approximation rate M will henceforth be denoted as γNN (C, ρ). Outline Functions Functionals Operators Bounds Optimal Min-Max Rate Distortion Theory in [8, 9]

Definition 17 2 Let C ⊂ L (Ω), for each l ∈ N, we denote by

El := {E : C → {0, 1}l}

the set of binary encoders of C of length l, and we let

Dl := {D : {0, 1}l → L2(Ω)}

be the set of binary decoders of length l. An encoder-decoder pair (E,D) ∈ El × Dl is said to achieve distortion ε > 0 over the function class C, if

sup kD(E(f)) − fkL2(Ω) ≤ ε f∈C Outline Functions Functionals Operators Bounds Optimal Min-Max Rate Distortion Theory in [8, 9]

Definition 18 Let C ⊂ L2(Ω), for ε > 0 the minimax code length L(ε, C) is

( ∃(E,D) ∈ El × Dl such that ) L(ε, C) := min l ∈ : N sup kD(E(f)) − fkL2(Ω) ≤ ε f∈C

Moreover, the optimal exponent γ∗(C) is defined by

∗ −γ γ (C) := inf{γ ∈ R : L(ε, C) = O(ε )} Outline Functions Functionals Operators Bounds Optimal Min-Max Rate Distortion Theory in [8, 9]

Theorem 19 Let C ⊂ L2(Ω), and the optimal effective best M-term approximation rate of C in D ⊂ L2(Ω) be M −γ∗,eff (C,D). Then, 1 γ∗,eff (C, D) ≤ . γ∗(C)

If the representation system D satifies 1 γ∗,eff (C, D) = , γ∗(C)

then, D is said to be optimal for the function class C. Outline Functions Functionals Operators Bounds Optimal Foundamental Bound on Effective M-edge Approximation

Theorem 20 (B¨olcskei et. al. [2]) Let C ⊂ L2(Ω) and

Learn : (0, 1) × C → N N ∞,∞,d,ρ

be a map such that, for each pair (ε, f) ∈ (0, 1) × C, every weight of the neural network Learn(ε, f) can be encoded with no more −1 than c log2(ε ) bits while guaranteeing that

sup kf − Learn(ε, f)kL2(Ω) ≤ ε f∈C

Then

1 1 γ sup ε · sup M(Learn(ε, f)) = ∞, ∀γ > ∗ 1 f∈C γ (C) ε∈(0, 2 ) Outline Functions Functionals Operators Bounds Optimal Foundamental Bound on Effective M-edge Approximation

The main idea of the proof of Theorem 20 is encoding the topology and weights of the map Learn(ε, f) by encoder-decoder pairs (E,D) ∈ El(ε) × Dl(ε) achieving distortion ε over C with

1 l(ε) ≤ C0 · sup M(Learn(ε, f)) log2(M(Learn(ε, f))) log2 , f∈C ε

where C0 > 0 is a constant. Outline Functions Functionals Operators Bounds Optimal Foundamental Bound on Effective M-edge Approximation

Corollary 21 d 2 (B¨olcskei et. al. [2]) Let Ω ⊂ R be bounded, and C ⊂ L (Ω). Then, for all ρ : R → R that are Lipschitz continuous or differentiable with polynomially bounded first , we have 1 γ∗,eff (C, ρ) ≤ NN γ∗(C)

We call a function class C ⊂ L2(Ω) optimally representable by neural networks with activation function ρ : R → R, if 1 γ∗,eff (C, ρ) = NN γ∗(C) Outline Functions Functionals Operators Bounds Optimal From Representation Systems to Neural Networks

Definition 22 2 (B¨olcskei et. al. [2]) Let D = (ϕi)i∈N ⊂ L (Ω) be a representation system. Then, D is said to be representable by neural networks (with activation function ρ), if there exists L, R ∈ N such that for all η > 0 and every i ∈ N there is a neural network Φi,η ∈ N N L,R,d,ρ and

kϕi − Φi,ηkL2(Ω) ≤ η

If, in addition, the neural networks Φi,η ∈ N N L,R,d,ρ have weights that are uniformly polynomially bounded in (i, η−1), and if ρ is either Lipschitz-continuous, or differentiable with polynomially

bounded derivative, we call the representation system (ϕi)i∈N effectively representable by neural networks (with activation function ρ). Outline Functions Functionals Operators Bounds Optimal From Representation Systems to Neural Networks

Theorem 23 d (B¨olcskei et. al. [2]) Let Ω ⊂ R be bounded, and suppose that C ⊂ L2(Ω) is effectively representable in the representation system 2 D = (ϕi)i∈N ⊂ L (Ω). Suppose that D is effectively representable by neural networks. Then, for all γ < γ∗,eff (C, D) there exist constants c, L > 0 and and a map

2 Learn : (0, 1) × L (Ω) → N N L,∞,d,ρ

such that for every f ∈ C the following statements hold: 1 there exists k ∈ N such that each weight of the network Learn(ε, f) is bounded by ε−k.

2 the error bound kf − Learn(ε, f)kL2(Ω) ≤ ε holds true, and 3 the neural network Learn(ε, f) has at most cε−1/γ edges. Outline Functions Functionals Operators Bounds Optimal From Representation Systems to Neural Networks

Specifically, in [2], the authors show that all function classes that are optimally approximated by a general class of representation systems–so-called affine systems–can be approximated by deep neural networks with minimal connectivity and memory requirements. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets, α-shearlets, and more generally α-molecules. Reference

[1] A.R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transactions on Information theory, 930–945, 39(3), 1993.

[2] H. B¨olcskei, P. Grohs, G. Kutyniok and P. Petersen, Optimal approximation with sparsely connected deep neural networks, arXiv:1705.01714, 2017. [3] T.P. Chen and H. Chen, Approximations of continuous functionals by neural networks with application to dynamic systems, IEEE Transactions on Neural Networks, 910–918, 4(6), 1993.

[4] T.P. Chen and H. Chen, Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems, IEEE Transactions on Neural Networks, 911–917, 6(4), 1995.

[5] T.P. Chen, H. Chen and R.W. Liu, A constructive proof and an extension of Cybenkos approximation theorem, In Computing science and statistics, 163–168, Springer, 1992. [6] G. Cybenko, Approximation by superpositions of a sigmoidal function, of control, signals and systems, 303–314, 2(4), 1989.

[7] R.A. DeVore and G.G. Lorentz, Constructive approximation, Springer Science & Business Media, 1993.

[8] D. Donoho, Unconditional bases are optimal bases for data compression and for statistical estimation, Applied and computational harmonic analysis, 100–115, 1(1), 1993.

[9] P. Grohs, Optimally sparse data representations, In Harmonic and Applied Analysis, 199–248, Springer, 2015. [10] V. Maiorov and A. Pinkus, Lower bounds for approximation by MLP neural networks, Neurocomputing, 81–91, 25(1–3), 1999. Thank You for Your Attention!