DGD Workshop

Sandra Keiper,∗ Gitta Kutyniok,∗ Philipp Petersen∗ October 2017

Contents

1 Deep learning 1

2 Mathematical description of neural networks 5 2.1 Activation functions ...... 6 2.2 Basic operations with networks ...... 6 2.2.1 Rescaling and shifting ...... 7 2.2.2 Concatenation ...... 7 2.2.3 Parallelization ...... 8

3 Approximation theory of neural networks 9 3.1 Universal approximation theorems ...... 9 3.2 Approximation of smooth functions with smooth rectifiers ...... 10 3.3 Constant size approximation with special rectifiers ...... 11 3.4 Lower bounds for approximation: ...... 13 3.4.1 Continuous nonlinear approximation ...... 14 3.4.2 The VC dimension of piecewise linear networks ...... 14 3.4.3 Encodability of networks ...... 17

4 Recent developments 18 4.1 Neural networks and dictionaries ...... 18 4.2 ReLU approximation ...... 21

5 Open questions for further research and discussion 23 5.1 Prepared open questions ...... 23 5.2 Open questions from the discussion ...... 24

1 Deep learning

The topic of this workshop is approximation theory of neural networks. Nonetheless, we shall start with a detour on a novel technique called ”Deep Learning” since it can be considered the main motivation for neural networks. To be honest right from the start, we wish to emphasize that this workshop does not explain why deep learning works. What we will do, however, is to provide an overview of the basic architecture that deep learning resides on and this is, as expected, given by neural networks. Deep learning is based on the efficient, data-driven, training of neural networks. We shall decipher this statement one by one, starting with neural networks. We will provide a precise mathematical definition of neural networks in the sequel, but at this moment, a neural network is a collection of very simple computational

∗Email: {keiper, kutyniok, petersen}@math.tu-berlin.de

1 units, called neurons, which are arranged in (potentially) complex patterns, see Figure 1. Some of the neurons are special in the sense that they are input or output neurons.

Figure 1: A neural network with multiple neurons connected by edges. Observe that circles are possible.

Certainly, when we take a look at the simple neural network of Figure 1 and also take into consideration the name, then we are lead to believe that there is some connection to actual biological neural networks. Indeed, the starting point for neural networks is the work of McCulloch and Pitts [18] which had the goal of establishing a mathematical model of the human brain. The underlying idea is that a biological neuron is activated and begins to send out its own signals if a specific threshold of inputs is exceeded. Mathematically, a neuron according to McCulloch and Pitts is a ft such that

ft(x) = 1 if x ≥ t and ft(x) = 0 else. Using compositions and sums of these functions McCulloch and Pitts demonstrated that simple logical functions such as the AND, OR, and NOT functions can be realized as neural networks. For computational purposes, and especially deep learning, a general neural network model is not suitable (admittedly there are exceptions). This is because circles as in Figure 1 can make the network arbitrarily complex. A neural network, whose graph has no circles is called a feed-forward neural network, see Figure 2, all other networks are called recurrent neural networks. This network model can be formulated mathematically. Let

• d ∈ N be the input dimension • L ∈ N be the number of layers

• N0,...,NL ∈ N be the number of neurons in each layer, where N0 = d N ×N N • A` ∈ R ` `−1 , b` ∈ R ` be the weights of a neural network, where ` = {1,...,L}. • % : R → R be an activation function. Then

d N Φ: R → R L x 7→ WL(%(WL−1(%(...,%(W1(x)))))), where W`(x) = A`(x) + b` and % is applied componentwise, is a neural network. We will refine this model in the sequel, but in essence this is the mathematical model for a neural network that we shall use. Additionally, we mention convolutional networks which are neural networks that use convolutions in place of general matrix multiplication. Feed-forward neural networks are the backbone of modern learning because they can be efficiently trained. This leads us to the second aspect of deep learning. It was said before, that deep learning is based on the efficient, data-driven training of neural networks. By now, we have a model for neural networks, but it is not yet clear how to train it. Training of a neural network refers to the act of adjusting its weights, i.e., the objects A`, b`, in a way as to minimize a certain loss function on a training set. Let us say, we have a number of input-output pairs

2 Figure 2: A feed-forward neural network.

d×d0 d0 d0 + (xi, yi)i=1,...,m ⊂ R , and a map L : R × R → R . Our goal is to solve the following minimization problem

m X min L(Φ(xi), yi), (1.1) Φ i=1

over all neural networks Φ with a d-dimensional input and NL-dimensional output under some restrictions on the number of neurons, the number of layers, or something similar. M 0 M 0 0 PL In practice, (1.1) is solved by stochastic gradient descent. If W = (ωk)k=1 ∈ R , with M = `=1 N`N`−1, denotes all the weights of a neural network, i.e., (A`)i,j and (b`)i, for i = 1 ...,N`, j = 1 ...,N`−1, ` = 1,...,L, then a network can also be interpreted as the function

(x, ω1, . . . , ωM 0 ) 7→ ΦW (x).

0 PM 0 In this interpretation, we can compute ∂ i=1 L(ΦW (xi), yi)/∂ωk for all k = 1,...,M and replace ωk by

M 0 X ω˜k = ωk − λ∂ L(ΦW (xi), yi)/∂ωk, i=1 where λ is the step-size. At first it seems like an especially complex task to compute all the derivatives for all weights, but there exists a convenient , called backpropagation,[27], allowing very efficient computation. In essence this algorithm is simply an iterative application of the chain rule. To spare ourselves from tedious technicalities, we will not review the algorithm here but refer instead to either [10], http://www.deeplearningbook.org/ or one of the many good explanations online. We have now understood the basic ingredients to deep learning. First of all, the architecture is provided by neural networks, and secondly, if we have sufficiently many training samples, we can manipulate this network in such a way that its associated function fits the training data. This allows us to pose three fundamental questions that form the basis for the understanding of deep learning:

• Expressibility: How powerful is the network architecture? Can it indeed represent the right functions for the learning procedures described above? • Learning: Why does the stochastic gradient descent algorithm produce anything reasonable? For general activation functions, it is clear that the loss function is not convex in the weight variables. This leads to potentially plenty of local minima. • Generalization: The learning only has access to input-output pairs from a training set. To complete a machine learning task, we require that the resulting network is able to generalize, i.e., perform also correctly on a test set, which is not a subset of the training set. In applications, it appears that neural networks generalize unreasonably well.

3 Figure 3: Result of a neural network trained to produce describing sentences for images. The image is taken from [14].

This workshop can only provide partial answers to the first of these questions. One question, which was not posted above is this: Why should you keep reading/listening to the workshop. Indeed, at this point, the basic functionality of deep learning was explained, but we lack a profound motivation to study it. Thus we will spend the rest of the chapter on a couple of remarkable applications of deep learning to stimulate the appetite for more theory. One of the most prominent applications of deep learning is certainly image classification. In fact, one could even argue that the work of Krizhevsky et al. [15] from 2012 triggered the hype on deep learning. In this paper, a convolutional neural network with 60 million weights was trained on a database of 1.5 million images. Finally, it can associate a label to each image describing the content of the image. In doing so, it shattered benchmark classification results based on different methods than deep learning. Another example of image classification even combines the extraction of a label with a describing sentence, [14], see Figure 3. In addition to image analysis, neural networks can also manipulate images. A very fun example is that of style transfer, [9]. Here a network is able to extract certain features of an image. In fact, in [9] it was demonstrated, that style and content of an image can be separated in a way that makes it possible to replace the style of an image by a different one. A particularly fascinating result is demonstrated in Figure 4. Another especially remarkable feat of deep learning is the 2016 victory of a against one of the best Go players in the world, [7]. Such an achievement was believed to be tens of years away. During its second game against the human grandmaster Lee Sedol, in move 37 the machine, called AlphaGo, made a move that was at first considered to be a blunder by the moderators, and only turned out to be remarkably brilliant multiple moves later. It is believed that such a move was never played by a human being. This shows that the machine invented strategies on its own and did not just memorize human moves. AlphaGo’s superior performance is based on a training method called reinforcement learning where two machines, which are pretrained on human moves, start playing each other and improve during the process. The list of remarkable applications continues. A good overview is contained in the overview article [16] and in the book [10] which is available on http://www.deeplearningbook.org/ for free.

4 Figure 4: Top Left: Vincent van Gogh’s ”Starry Night”, Top Right: A part of the Stanford Campus, Bottom: The style of van Gogh applied to the Stanford Campus. The image is taken from [9].

We have, by now, hopefully motivated ourselves sufficiently to continue to do some elementary math, to understand the basic workhorse of deep learning—neural networks.

2 Mathematical description of neural networks

The somewhat intuitive definition of the previous section describes neural networks as functions, given by alternating composition of affine-linear maps and component-wise applications of activation functions. This definition is sufficient for almost all results we have in mind. Nonetheless, there is a minor imprecision, since 0 the definition does not differentiate between a function Φ : Rd → Rd and its representation as a network. In particular, different weights or different network architectures can result in the same function. Thus, we introduce a minor refinement.

Definition 2.1. Let d, L ∈ N.A neural network Φ with input dimension d and L layers is a sequence of matrix-vector tuples Φ = ((A1, b1), (A2, b2),..., (AL, bL)), N where N0 = d and N1,...,NL ∈ N, and where each A` is an N` × N`−1 matrix, and b` ∈ R ` . If Φ is a neural network as above, and if % : R → R is arbitrary, then we define the associated realization d NL of Φ with activation function % as the map R%(Φ) : R → R such that

R%(Φ)(x) = xL, where xL results from the following scheme:

x0 : = x,

x` : = %(A` x`−1 + b`), for ` = 1,...,L − 1,

xL : = AL xL−1 + bL,

5 where % acts componentwise, i.e., %(y) = (%(y1),...,%(ym)) for y = (y1, . . . , ym) ∈ Rm. PL We call N(Φ) := d + j=1 Nj the number of neurons of the network Φ, L =: L(Φ) the number of layers, PL and, finally, M(Φ) := j=1(kAjk`0 + kbjk`0 ) denotes the number of non-zero entries of all A`, b` which we call the number of weights of Φ. Moreover, we refer to NL as the dimension of the output layer of Φ. Following this definition, we now have a clear separation between a network, as a set of all weights, and the associated realization. The definition already introduced the key parameters of a neural network Φ. These are • the number of layers L(Φ), or the depth of the network; • the number of neurons N(Φ) of the network; • the number of weights M(Φ) of the network.

In some sense these values describe the complexity, or the size of a network. Thus, if we want to study approximation properties of neural networks, we would like to put the approximation quality of a network in a relationship with one or more of the parameters above. To simplify the notation in the sequel, we make the following additional definition.

Definition 2.2. For d ∈ N and M,N,L ∈ N ∪ {∞} we denote by NN d,M,N,L the set of all neural networks Φ with a d-dimensional input and one-dimensional output such that M(Φ),N(Φ),L(Φ) < ∞ and M(Φ) ≤ M, N(Φ) ≤ N, L(Φ) ≤ L.

2.1 Activation functions By Definition 2.1 we require an activation function for the realization of a neural network. A couple of choices are frequently employed in the literature. One of the most basic activation function is given by the

+ Heaviside function H : R → {0, 1}, H(x) = χR (x). This is the activation function used in the original works of McCulloch and Pitts and models a biological neuron, that ”fires” whenever a certain threshold is exceeded. This activation function is entirely unsuitable for learning, since its derivative vanishes almost everywhere, which implies that the backpropagation algorithm cannot work. A second choice, which is probably the most prominent and widely used activation function right now, is the rectified linear unit (ReLU). This is the function % : R → R+, x 7→ max{0, x}. (ReLU)

Since the evaluations of this function and its derivative %0 = H are very simple, this function leads to very fast learning. A smooth alternative to the ReLU is the softmax function % : R → R+, x 7→ ln(1 + exp(x)). (softmax)

For some applications general classes of smooth functions that behave similar to the ReLU can be studied. A prominent class is given by sigmoidal functions (German: Schwanenhalsfunktion). A sigmoidal function of order k ∈ N is one that is monotone, k-times continuous differentiable, and satisfies %(x) → 0 for x → −∞ and %(x)/xk → 1 for x → ∞.

2.2 Basic operations with networks We shall discuss some basic operations with neural networks, that will play a role in the sequel and familiarize ourselves with this new concept.

6 2.2.1 Rescaling and shifting

0 Let d, d0 ∈ N and f : Rd → Rd . For t ∈ Rd we define the translation operator by

d Tt(f)(x) := f(x − t) for all x ∈ R .

Now, if Φ = ((A1, b1), (A2, b2),..., (AL, bL)), we see that

(TtR%(Φ))(x) = R%(Φ)(˜ x), where Φ˜ = ((A1, b1 − A1t), (A2, b2),..., (AL, bL)). Another important operator is given by the dilation operator. Let a > 0, then d/2 Da(f)(x) = a f(ax).

Question: Which network Φ˜ satisfies DaR%(Φ) = R%(Φ)˜ ?

2.2.2 Concatenation

0 0 00 Given two functions f : Rd → Rd and g : Rd → Rd , where d, d0, d00 ∈ N, we often care about the concatenation f ◦ g of these functions. A similar concept is possible for neural networks.

Figure 5: Top: Two networks Bottom: Concatenation of both networks according to Definition 2.3.

Definition 2.3. Let L ,L ∈ and let Φ1 = ((A1, b1),..., (A1 , b1 )), Φ2 = ((A2, b2),..., (A2 , b2 )) be 1 2 N 1 1 L1 L1 1 1 L2 L2 two neural networks such that the input layer of Φ1 has the same dimension as the output layer of Φ2. Then, 1 2 Φ Φ denotes the following L1 + L2 − 1 layer network:

1 2 2 2 2 2 1 2 1 2 1 1 1 1 1 Φ Φ := ((A1, b1),..., (AL −1, bL −1), (A1AL ,A1bL + b1), (A2, b2),..., (AL , bL )). 2 2 2 2 1 1 We call Φ1 Φ2 the concatenation of Φ1 and Φ2.

1 2 1 2 Question: Is this a reasonable definition? In particular, do we have R%(Φ Φ ) = R%(Φ ) ◦ R%(Φ )?

What is M(Φ1 Φ2)? Is there a more efficient definition of concatenation to limit the number of weights?

Let us do something at first hand completely unreasonable. Assume % to be the ReLU. Then we can build a two layer network whose realization is the identity.

Lemma 2.4. Let % be the ReLU, let d ∈ N, and define

Id Φ := ((A1, b1), (A2, b2))

7 with   Id d  A1 := R , b1 := 0,A2 := Id d −Id d , b2 := 0. −Id d R R R Id Then R (Φ ) = Id d . % R Using concatenations we can now directly construct a network with L + 2 layers whose ReLU realization equals the identity by setting Id Id Id Φd,L+2 := Φ ... Φ , where we take L concatenations.

2.2.3 Parallelization One final operation that will become surprisingly convenient is that of parallelization. This also allows us to define an operation whose realization is the sum of two networks. 1 1 1 1 1 2 2 2 2 2 Definition 2.5. Let L ∈ N and let Φ = ((A1, b1),..., (AL, bL)), Φ = ((A1, b1),..., (AL, bL)) be two neural networks with L layers and with d-dimensional input. We define 1 2 ˜ ˜ P(Φ , Φ ) := ((A˜1, b1),..., (A˜L, bL)), where  1   1   1   1  ˜ A1 ˜ b1 ˜ A` 0 ˜ b` A1 := 2 , b1 := 2 , and A` := 2 , b` := 2 for 1 < ` ≤ L. A1 b1 0 A` b` Then P(Φ1, Φ2) is a neural network with d-dimensional input and L layers, called the parallelization of Φ1 and Φ2.

Figure 6: Top: Two networks Bottom: Parallelization of both networks according to Definition 2.5.

One readily verifies that M(P (Φ1, Φ2)) = M(Φ1) + M(Φ2), and 1 2 1 2 d R%(P(Φ , Φ ))(x) = (R%(Φ )(x), R%(Φ )(x)) for all x ∈ R . (2.1) We depict the parallelization of two networks in Figure 6. If two networks have one-dimensional output and the same number of layers, then there is a neural network whose realization is the sum of the realizations of the two networks. This can easily be seen by using paralellization and concatenation with a linear function. With this argument, we obtained a meaningful definition of the sum of two networks.

Question: Is there also a reasonable way to define a parallelization or sum if two networks have a different number of layers?

8 3 Approximation theory of neural networks

In this section, we aim to provide an overview about established results in the literature. The most well-known and basic results are the universal approximation theorems, which shall also be the starting point of our review. Afterwards, we will try to understand a more quantitative notion of approximation properties. Therefore we pick function classes and try to understand the connection between the size of a neural network and the approximation quality of its realization.

3.1 Universal approximation theorems The universal approximation theorems can be seen as the first approximation results with neural networks. In essence, they proclaim that every continuous function on a compact domain can be arbitrarily well approximated by the realization of a neural network. Three papers appeared on such a result in a very short period. These are the uniform approximation result by Cybenko, [6], one by Hornik, Stinchcombe, and White, [13], and shortly afterward a result by Hornik, [12]. The results differ in the type of activation function (see Section 2.1) they allow for, as well as, in their proof techniques. Cybenko’s result holds for sigmoidal functions. He proved density of realizations of neural networks in C([0, 1]) by using the theorem of Hahn-Banach and the Riesz representation theorem. The argument will be presented in a special case in the proof of Theorem 3.2. Hornik, Stinchcombe, and White’s proof is based on the Stone-Weierstrass theorem and requires special activation functions. Finally, the result of [12] improves the technique of Cybenko to obtain density of realizations of neural networks for general continuous, unbounded, and non-constant activation functions. The result reads as follows:

Theorem 3.1 ([12]). If % is unbounded and non-constant, d ∈ N, X ⊂ Rd compact, then

{R%(Φ) : X → R | Φ ∈ N N d,∞,∞,2 } = C(X). In words, the closure of the realizations of all two-layer neural networks is dense in C(X). The proof is reasonably tedious due to the generality of the statement. We will prove a similar theorem for a very special case, which is based on the same idea as the proof of Theorem 3.1, but much shorter.

Theorem 3.2. Let % be the ReLU, d ∈ N, X ⊂ Rd compact. Then

{R%(Φ) : X → R | Φ ∈ N N d,∞,∞,3 } = C(X). In words, the ReLU realizations of three-layer neural networks are dense in C(X). Proof. We have that

U% = {R%(Φ) : X → R | Φ ∈ N N d,∞,∞,3 } is a linear subspace of C(X) (see also Subsection 2.2.3 for the definition of the sum of realizations of neural networks). Assume towards a contradiction that U% is not dense. Then, we can conclude by the theorem 0 0 of Hahn-Banach, that there exists a non-zero functional f such that f (h) = 0 for all h ∈ U%. By the representation theorem of Riesz we conclude that there exists a finite signed measure σ 6= 0 such that Z h(x)dσ(x) = 0 X

for all h ∈ U%. We shall continue by showing, that for any cuboid Q ⊆ X there exists a sequence (hn)n∈N ⊂ U% such that hn → χQ pointwise and hn ≤ hn+1 pointwise. By the theorem of monotone convergence this yields that σ(Q) = 0 for all cuboids Q which is a contradiction. Qd Let Q = i=1[di, ei] ⊂ X be arbitrary. For x ∈ R set n ti (x) := %(nx − ndi) − %(nx − ndi − 1) − %(nx − nei + 1) + %(nx − nei), (3.1)

9 n n n+1 for i = 1, . . . , d. By construction, we have that ti equals 1 on [di + 1/n, ei − 1/n]. Moreover, ti (x) ≤ ti (x) can easily be checked. We define now

d ! X n hn(x) = % ti (x) − (d − 1) , i=1

One readily verifies, that (hn)n∈N satisfies all assumptions. In particular, it is pointwise monotone, since the n Pd n (ti )n∈N are, and it converges pointwise to χQ since ( i=1 ti (x) − (d − 1))n∈N converges pointwise to 1 on Q and is less than 0 on Qc. This completes the proof.

Figure 7: Left: A ReLU Right: A sum of four ReLUs as in Equation (3.1)

The universal approximation theorem has historically been used as a justification of the power of neural networks. It should be emphasized though, that mere density of a function system in C(X) does not tell the whole story. For instance, are dense as well but are not used for deep learning. On the other hand, S0(X), i.e., the space of all tempered distributions on X, is a very powerful space which, in particular, contains C(X), but this space is not at all useful for a learning task. This is because 0 N the Dirac distribution is an element of S (X). Thus a learned function for a training set (xi, yi)i=1 and a PN Euclidean distance loss function could, in theory, be i=1 yiδxi . This function has terrible generalization properties, meaning that the network yields a large approximation error on a set of test points (xˆi, yˆi) where yi 6= 0 and which are not in the training set.

3.2 Approximation of smooth functions with smooth rectifiers In the previous section only approximation of continuous functions was analyzed. In a typical approximation task, we have much more structure and information about our function class. One particularly prominent and well-understood function class is that of smooth functions Cn([0, 1]d) or W n,p([0, 1]d), where n, d ∈ N and 1 < p ≤ ∞. Here Cn([0, 1]d) denotes the space of n-times continuously differentiable functions and W n,p([0, 1]d) is the space of functions that have n-th order partial derivatives in Lp. We set

α n d kfkCn([0,1]d) : = max kD fk∞ for f ∈ C ([0, 1] ), and |α|≤n X α n,p d kfkW n,p([0,1]d) : = kD fkp for f ∈ W ([0, 1] ). |α|≤k

We want to understand how well these function classes can be approximated by neural networks. Before we do so, let us try to understand, what we can expect. It is well known, that for functions in Cn([0, 1]d) there exists an approximation method based on N parameters that yields an approximation error O(N −n/d). Such an estimate is usually called Jackson estimate and, as we will see in Subsection 3.4 this rate is optimal if the approximant depends continuously on the function to be approximated. For example for n-times differentiable functions this rate is achieved by free knot spline approximation. Because of the preceding considerations an approximation of order O(N −n/d) is expected to also hold for neural network approximation. Indeed, in the works [5, 19, 20, 21, 32] precisely this rate is given for various activation functions. We state one of the results.

10 Theorem 3.3 ([20]). Let 1 ≤ d, N, n, be integers, 1 ≤ p ≤ ∞, % : R → R be infinitely many times differentiable in R. Further assume that there exists xˆ ∈ R such that

k D %(ˆx) 6= 0, for all k ∈ N0.

1×d ∞ Then there exists (Aj)j=1,...,N ⊂ R and continuous linear functionals (aj)j=1 such that for any f ∈ W n,p([0, 1]d) it holds N X −n/d f − aj(f)%(Aj(·) +x ˆ) ≤ cN kfkW n,p([0,1]d), p j=1 L for a constant c = c(d, p, %) > 0. Let us understand intuitively why this result holds for the example of sigmoidal rectifier functions. Let % be a sigmoidal function of order k. Then %(x)/xk → 1 for x → ∞ and %(x) → 0 for x → −∞. As a consequence, we observe that %(Cx) → xk for all x ∈ . Ck + R k for C → ∞, where x+ = ReLU(x) = max{x, 0}. In other words, we can reproduce x 7→ x+ by any sigmoidal k function and by the definition of networks we can also build (x − j)+ for all j ∈ Z. For B-splines of order k one has the divided difference formula:

k+1 1 X k + 1 M(x) := (−1)j (x − j)k . k! j + j=1

This formula implies that we can approximate B-splines by realizations of a fixed size neural network. And thus it is not surprising that the approximation rate of Theorem 3.3 equals that of the Jackson estimate which is also achieved by spline approximation. The construction of this subsection already reveals the overall philosophy of approximation theory with neural networks. As a first step, another function system can be reproduced with neural networks. Thereby it is possible to transfer some established approximation rates to neural networks. This observation should be paired with some form of optimality showing that nothing was lost using the transfer argument. The notion of optimality is quite subtle and it gets even more complicated when we take into consideration the approximation result of the next subsection.

3.3 Constant size approximation with special rectifiers A lower bound on the degree to which a neural network with a single hidden layer can approximate any real-valued function is certainly given by the bound a linear combination of ridge function can attain. Ridge functions are multivariate functions of the form

f(x) = g(a · x), (3.2) where g : R → R, a ∈ Rd \{0} and x ∈ Rd. Those functions appear in the so-called single hidden layer MLP models, which can mathematically expressed by a realization of the neural network

ΦMLP = ((A, b), (c, 0)) , (3.3) with A ∈ RN,d, b ∈ RN and c ∈ R1,N for N ∈ N. This realization of Φ with a continuous activation function % is given by

N X ci%(ai · x − bi). (3.4) i=1

11 Thus, each factor %(ai · x − bi) is a ridge function. As such a lower bound to which extent this MLP model can approximate a function is given by the approximation order of the set

( N ) X d MN = gi(ai · x): ai ∈ R , gi ∈ C(R), i = 1,...,N . (3.5) i=1 Maiorov established lower and upper bounds on the degree of approximation of functions from some Sobolev- 2 type space of L -functions by functions in MN ,[17]. Roughly spoken, it is shown that Sobolev-type functions 2 whose derivatives of all orders up to r lie in L , can be approximated by a function from MN within an L2-error of

O(N −r/(d−1)). (3.6)

It is also shown that for each N ∈ N there is some function in the set which cannot be approximated from MN with an error less than this rate. The main difference between MN and an MLP-model is that we allow for different continuous activation functions gi, i = 1,...,N, in the case of MN , whereas we only have one activation function in the case of an MLP model. However, the next theorem states that every f ∈ MN can be approximated arbitrarily well by an MLP-model with a sigmoidal activation function and only 3N neurons. Theorem 3.4 ([17]). There exists a function % which is real analytic, strictly increasing, and sigmoidal 1,3N 3N T satisfying the following: Given f ∈ MN and ε > 0, there exists c ∈ R , b ∈ R and A = [a1, . . . , a3N ] ∈ 3N,d d−1 R with ai ∈ S , i = 1,..., 3N, such that the realization of Φ := ((A, b), (c, 0)) fulfills

|R%(Φ)(x) − f(x)| ≤ ε (3.7)

d  d for all x ∈ B = x ∈ R : kxk2 ≤ 1 . Proof. The first step is to prove that the same statement holds true for C∞-function and then to generalize this to real analytic functions. We will shortly explain how to prove the statement for C∞-functions. ∞ ∞ It is very well-known that there exists a countable dense subset of polynomials, say {uk}k=1, of C ([−1, 1]). The main idea then is to define the activation function % on the disjoint intervals [4k, 4k + 2], k = 1,..., ∞, to be approximately equal to uk and then to fill in the intervals in between such that % fulfills the requirements. To be a little bit more precise we set

%(t + 4k + 1) := bk + ckt + dkuk(t) (3.8)

for t ∈ [−1, 1], for some constants bk, ck, dk to be determined such that the values of % at the boundaries of k the intervals as well as its derivative can be controlled. One then can find reals ai , i = 1,..., 3, for which a (shifted) linear combination of % is equal to the uk, i.e., such that

k k k uk(t) = a1 %(t − 7) + a2 %(t − 3) + a3 %(t + 4k + 1) (3.9)

PN for all t ∈ [−1, 1]. With that construction of % on can approximate f ∈ MN , with x 7→ j=1 gj(aj · x), by approximating each gj, j ∈ N separately.

The number of neurons in the constructions of the above network depends on the number of ridge functions. Using the Kolmogorov superposition theorem and allowing for two hidden layers, however, on can prove that every f ∈ C([0, 1]d) can be approximated by a neural network with two hidden layers and a constant number of neurons. Note that the number of neurons will depend on the dimension d but not on the function f. Theorem 3.5 ([17]). There exists an activation function % which is real analytic, strictly increasing, and d 18d2+9d,d sigmoidal, and has the following property. For any f ∈ C([0, 1] ) and ε > 0, there exists A1 ∈ R ,

12 6d+3,18d2+9d 1,3d 1,6d+3 18d2+9d A2 = diag(a1, . . . , a6d+3) ∈ R , with ai ∈ R , i = 1,..., 6d + 3, A3 ∈ R , b1 ∈ R 6d+3 and b2 ∈ R for which the realization of Φ = ((A1, b1), (A2, b2), (A3, 0)) fulfills

|f(x) − R%(Φ)(x)| < ε, (3.10) for all x ∈ [0, 1]d.

Proof. The main ingredient of the proof is the Kolmogorov superposition theorem, which basically states that we can represent every continuous function f : Rd → R as

2d  d  X X f(x1, . . . , xd) = g  λjφi(xj) , (3.11) i=1 j=1

Pd where λj > 0, j = 1, . . . , d, with j=1 λj ≤ 1, φi : [0, 1] → [0, 1], i = 1,..., 2d + 1, strictly increasing and g ∈ C([0, 1]) depending on f. With the same idea as in the proof of Theorem 3.4 one then approximates the one-dimensional functions g and φi, i = 1,..., 2d + 1, with a single hidden layer network to prove the claim. Remark 3.6. High-dimensional functions that appear as solutions of real-world problems often have a lower dimensional structure. A well-known example is to assume that we can represent a function f ∈ C([0, 1]d) as ˜ T m,d ˜ m f = f(A·) with A = [a1, . . . , am] ∈ R and f ∈ C([0, 1] ) and m  d. In that case we can represent f as

2m  m  X X f(x) = g  λjφi(aj · x) , (3.12) i=1 j=1 and thus find a neural network with two hidden layers and only O(m2) neurons instead of O(d2).

3.4 Lower bounds for approximation: In this section, we present multiple approaches to establish lower bounds on the complexity of neural networks whose realizations should achieve some approximation fidelity. In other words, for a given ε > 0 we want to establish a minimum size of networks in terms of the number of weights or neurons such that all functions from a function class C can be approximated by realizations of networks with the prescribed size and with an error less than ε. At first, we recall from Subsection 3.3 that in particular circumstances every function can be approximated up to arbitrary precision with O(1) weights and neurons, which seems to make the quest for lower bounds superfluous. Some things need to be observed about this theorem, though. First of all, the involved activation function is very complicated and does not allow an efficient implementation. Additionally, while the associated weights are only finitely many, they can be potentially gigantic and thus not at all storable on a computer. Therefore, we conclude that there are some reasonable restrictions on network architectures that disallow the construction of Theorem 3.5. We shall focus on three of such restrictions:

• continuous dependence of the weights on the data, • special rectifiers, • controllable/encodable weights.

In particular, we will see that all of these assumptions necessitate networks whose sizes grow accordingly to the required approximation quality over certain function classes.

13 3.4.1 Continuous nonlinear approximation We shall introduce the notion of continuous nonlinear N-widths of function classes and then demonstrate how this concept can be applied to neural networks and yield lower bounds on their sizes. A well established measure of the best possible approximation with elements from a linear space, such as polynomials, is the Kolmogorov N-width. For a Banach space X and a compact subset C we define the Kolmogorov N-width of C in X by ( ) dN (C)X := inf sup inf kf − gkX : LN is an N-dimensional linear subspace of X . (3.13) f∈C g∈LN

By definition, dN (C)X describes the best possible uniform approximation of elements of a set C with elements from a linear N-dimensional space. However, quite frequently, and in particular for realizations of neural networks, we are not only interested in approximation from linear spaces, but also from nonlinear sets. This could be a manifold, an algebraic variety, or even a completely unstructured set. It can be observed easily that in general, for a given activation function, the space of realizations of a network with M weights is not a linear space. However, this set could be perceived as the image of RM under some potentially complicated map. To deal with situations like this and describe lower bounds DeVore, Howard, and Micchelli developed the notion of continuous non-linear N-width, [8]. At first, one is tempted to think that a reasonable generalization of (3.13) would be to replace linear spaces by N-dimensional manifolds. But, since there are space-filling curves we can come up with many examples, where such an N-width would always be 0. It turns out that a very reasonable generalization is to N allow only a continuous selection procedure in the approximation. Let MN be the set of maps from R → C and let A be the set of continuous maps from X → RN . Then one defines

δN (C)X := inf inf sup kf − M(a(f))kX . M∈MN a∈A f∈C What appears to be abstract non-sense at first can actually be bounded for many examples of X and C and 2 d n,p d gives meaningful lower bounds. For X = L ([0, 1] ) and C = {f ∈ W ([0, 1] ): kfkW n,p([0,1]d) ≤ 1} one obtains −n/d δN (C)X = O(N ) for N → ∞. In the example of a neural network, this means that, if one requires the weights of a neural network to depend continuously on the function to be approximated, then asymptotically ε−d/n weights will be required to achieve an approximation accuracy of ε > 0.

3.4.2 The VC dimension of piecewise linear networks The Vapnik–Chervonenkis dimension, short VC dimension, is a tool for understanding the classification capabilities of a function class. We will see later that one can estimate the VC dimension of the space of functions that are realizations of neural networks with a certain number of weights. In any case, we will start with a formal definition of a VC dimension. Let X be a set, S ⊂ X, and let H ⊆ {h : X → {0, 1}} be a set of binary valued maps on X. We define

H|S := {h|S : h ∈ H}, which, in words, is the restriction of the function class H to S. The VC dimension of H is now defined as ( ) m VCdim(H) := sup m ∈ N : sup |H|S| = 2 . |S|≤m

The VC dimension of H is the largest integer m such that there exist a set S ⊂ X containing only m points m such that H|S has the maximum possible cardinality given by 2 . In fact, if one associates to each of m

14 points either a 0 or a 1 then we have 2m unique combinations. In other words, there are 2m possibilities to m associate each point to one of two groups. If |H|S| = 2 then we require all of these possible group allocations to be possible within H|S in this case one says that H shatters S. Example 3.7. Let X = R2. 2 1. Let H = {0, χΩ} for some fixed non-empty set Ω ⊂ R . Then V Cdim(H) = 1.

+ 2. Let h = χR and      cos θ 2 H = hθ,t := h , • − t θ ∈ [−π, π], t ∈ . sin θ R Then H is the set of all linear classifiers. We can clearly see, that if S contains 3 points in general position, then |H|S| = 8, see Figure 8. On the other hand, 4 points cannot be shattered by H as Figure 9 reveals.

Figure 8: Three points shattered by a set of linear classifiers.

As it turns out the notion of VC dimension is very powerful, especially for the analysis of networks, since, at least for special activation functions, we can estimate the VC dimension of neural networks. We have the following theorem.

+ Theorem 3.8 ([1]). Let % be a piecewise with p pieces of degree at most `, h = χR , and for N, M, d ∈ N we define Hd,M,N,∞ := {h ◦ R%(Φ) : Φ ∈ N N d,M,N,∞} .

Then VCdim(Hd,M,N,∞) = O(M(M + N` log(p))).

Figure 9: There is no way to shatter 4 points by a linear classifier. In the example above there is no function of the form of Example 3.7.2), that takes the values 1 on the points in the red region and 0 in the blue region.

In essence, the VC dimension of the realization of a neural networks scales as M(Φ)2, if we assume to have less neurons than weights, which is almost always a reasonable assumption. If one, in addition to restricting the number of weights, also places a requirement on the number of layers then the bound above can even be improved.

15 + Theorem 3.9 ([1]). Let % be piecewise polynomial with p pieces of degree at most `, h = χR , and for N, M, d ∈ N we define Hd,M,N,L := {h ◦ R%(Φ) : Φ ∈ N N d,M,N,L} . Then 2 VCdim(Hd,M,N,L) ≤ 2ML log2(4MLpN/ ln 2) + 2ML log2(` + 1) + 2L. For fixed p and ` and N ≤ M this leads to

2 VCdim(Hd,M,N,L) = O(ML log2(M) + ML ). At this point we have an upper bound on the VC dimension of neural networks, but we have not related this rate to the size of a neural network. The following intriguing argument was made in [32] and demonstrates how to obtain a lower bound on the number of weights of a network necessary to achieve a certain approximation quality of smooth functions. The statement of the following theorem is similar to one in [32] but highly simplified.

n d Theorem 3.10 ([32]). Let n, d, L ∈ N and H = {h ∈ C ([0, 1] ): khkCn ≤ 1}. Further let % be the ReLU. If for every ε > 0 and every h ∈ H there exists a neural network Φ ∈ N N d,M(ε),M(ε),L such that

kh − R%(Φ)k∞ < ε, then M(ε) = Ω(ε−d/n/(1+δ)), for all δ > 0 Proof. We present an argument that, while not entirely rigorous, is reasonably intuitive so that the missing technical steps can be completed without considerable effort. Towards a contradiction assume, that for every h ∈ H there exists a neural network Φh with M neurons and weights and at most L layers such that 1 kh − R (Φ )k ≤ M −(1+δ)n/d (3.14) % h ∞ 8

1+δ d −(1+δ)/d for some δ > 0. We can find m = M points in x1, x2, . . . , xm ∈ [0, 1] such that |xi − xj| ≥ M . Around each of these points we can now construct functions φi ∈ H, i = 1, . . . , m such that all φi have disjoint −(1+δ)/d −(1+δ)n/d supports with diameter M . Additionally, we can require φi(xi) = M /2 for all i = 1, . . . , m and still be able to find such functions in H. Due to the construction above, we have that for every map g from {x1, x2, . . . , xm} to {0, 1} there exists m m a vector d = (di)i=1 ∈ {0, 1} such that m (1+δ)n/d X g(x) = 2M diφi(x) for all x = x1, x2, . . . , xm. (3.15) i=1 The right hand side of equation (3.15) is depicted in a special case in Figure 10. Pm By construction, we have that i=1 diφi ∈ H and thus we conclude with equation (3.14) that there exists a neural network Φ ∈ N N d,M,M,L such that

m X 1 d φ − R (Φ) ≤ M −(1+δ)n/d. (3.16) i i % 8 i=1 ∞ Equations (3.15) and (3.16) imply for i = 1, . . . , m that 1 2M (1+δ)n/dR (Φ)(x ) − > 0 if g(x ) = 1 and % i 2 i 1 2M (1+δ)n/dR (Φ)(x ) − < 0 if g(x ) = 0. % i 2 i

16 Pm Figure 10: A special case of the function i=1 diφi constructed in equation (3.15). The figure is taken from [32].

Thus, we have that   (1+δ)n/d 1 χ + 2M R (Φ)(·) − = g on {x , . . . , x }. R % 2 1 m

Since g is an arbitrary binary map on {x1, . . . , xm} we conclude with the notation of Theorem 3.9 that 1+δ VCdim(Hd,M,N,L) ≥ m. Theorem 3.9 now implies that m = M ∈ O(M log(M)), which is a contradiction. This concludes the proof.

3.4.3 Encodability of networks Another approach to establish lower bounds on the sizes of neural networks is given by studying the description complexity of function classes. This connection was established in [2, 3] and is based on the following observation: If a neural network approximates a function f well, then we can conclude that—up to a small error—f can be encoded by storing all non-zero weights of the network—for example as sparse matrix. In this way, we can encode f as a bit-string and we can decode this string by reconstructing the weights and realizing the network. supplies us with multiple results that enforce a certain complexity of encoder and decoder pairs achieving a given reconstruction fidelity over function classes. We start by reviewing the required notation.

Definition 3.11. Let d ∈ N, Ω ⊂ Rd be measurable, and let C ⊂ L2(Ω) be an arbitrary function class. For each ` ∈ N, we denote by E` := E : C → {0, 1}` the set of binary encoders mapping elements of C to bit-strings of length `, and we let

D` := D : {0, 1}` → L2(Ω) be the set of binary decoders mapping bit-strings of length ` to elements of L2(Ω). An encoder-decoder pair (E`,D`) ∈ E` × D` is said to achieve distortion ε > 0 over the function class C, if

` ` sup D (E (f)) − f L2(Ω) ≤ ε. f∈C Finally, for ε > 0 the minimax code length L(ε, C) is ( ) ` ` ` ` ` ` L(ε, C) := min ` ∈ N : ∃ E ,D ∈ E × D : sup D (E (f)) − f L2(Ω) ≤ ε , f∈C

` ` ` ` ` ` with the interpretation L(ε, C) = ∞ if supf∈C kD (E (f)) − fkL2(Ω) > ε for all (E ,D ) ∈ E × D and arbitrary ` ∈ N.

17 Sometimes one is only interested in the asymptotic behavior of L(ε, C) which can be quantified by the optimal exponent.

Definition 3.12. Let d ∈ N, Ω ⊂ Rd and C ⊂ L2(Ω). Then, the optimal exponent γ∗(C) is defined by ∗  −γ γ (C) := inf γ ∈ R : L(ε, C) = O(ε ) . The optimal exponent γ∗(C) describes how fast L(ε, C) tends to infinity as ε decreases. For function ∗ ∗ classes C1 and C2, the notion γ (C1) < γ (C2) indicates that asymptotically, i.e., for ε → 0, the length of the encoding bit string for C2 is larger than that for C1. In other words, a smaller exponent indicates smaller description complexity.

Example 3.13. For many function classes the optimal exponent is well-known. Let d, n ∈ N, 1 ≤ p, q ≤ ∞, then

∗  n d  • γ f ∈ C ([0, 1] ): kfkCn ≤ 1 = d/n,

∗ n n d o • γ f ∈ B ([0, 1] ): kfk n ≤ 1 = d/n, p,q Bp,q • Let

n d n d d n E [0, 1] := {f1 + χBf2 : f1, f2 ∈ C [0, 1] ,B ⊂ (0, 1) , ∂B ∈ C and

kgkCα ≤ 1 for g = f1, f2, ∂B}.

Then, we have γ∗(En([0, 1]d)) = 2(d − 1)/n. We already announced at the beginning of this section that we aim to transfer the lower bounds on description complexity of signal classes to lower bound the approximation capabilities of neural networks. We have discussed earlier that lower bounds are impossible in a completely general setting. Thus, also in the information theoretical bound, there needs to be an additional assumption. In this case, we restrict the complexity of the involved weights to only grow polynomially in the inverse approximation fidelity.

Theorem 3.14 ([2]). Let Ω ⊂ Rd, d ∈ N, % : R → R, c > 0, and C ⊂ L2(Ω). Further, let  1 Learn : 0, × C → N N 2 ∞,∞,d,% be a map such that, for each pair (ε, f) ∈ (0, 1/2) × C, every weight of the neural network Learn(ε, f) can be encoded with no more than −c log2(ε) bits while guaranteeing that

sup kf − R%(Learn(ε, f))kL2(Ω) ≤ ε. (3.17) f∈C Then, sup εγ · sup M(Learn(ε, f)) = ∞, for all γ < γ∗(C). (3.18) 1 f∈C ε∈(0, 2 ) The theorem reveals that the behavior of the numbers of weights asymptotically needs to follow the scaling law provided by the optimal exponent, if uniform approximation of the function class is required.

4 Recent developments 4.1 Neural networks and dictionaries One of the main ideas behind establishing approximation rates for neural networks is to demonstrate how other function systems can be emulated by neural networks. We saw this philosophy earlier, when we demonstrated how smooth functions or sparse sums of ridge functions can be efficiently approximated.

18 In [29] a similar approach was followed by demonstrating that realizations of neural networks can reproduce wavelet-like functions and thereby also sums of wavelets. This observation allows to transfer M-term approximation rates with wavelets to M-weight approximation rates with neural networks. For a normed space X and a dictionary (φi)i∈I ⊂ B we define the error of best N-term approximation of a function f ∈ X as

X σM (f) := inf ciφi − f . IM ⊂I,|IM |=M, X i∈IM (ci)i∈IM −r + For C ⊂ X we say that (φi)i∈I yields an M-term approximation rate of M for r ∈ R if −r sup σM (f) = O(M ) for M → ∞. f∈C To transfer the M-term approximation rates of wavelets to M-weight with neural networks we briefly describe how one can reconstruct a wavelet with a fixed size neural network. For simplicty we assume % to be the ReLU, but similar constructions are possible with smooth approximations to the ReLU or sigmoidal rectifiers. Define

t(x) := %(x) − %(x − 1) − %(x − 2) + %(x − 3) for all x ∈ R. (4.1)

Then t is a bump function (see Figure 11 (left)). Moreover, we define for d ∈ N

d ! X d φ(x) := % t(x) − (d − 1) for all x ∈ R . (4.2) i=1 With this construction φ is a d-dimensional bump function (see Figure 11 (right)) and also the realization of a 3-layered neural network with 8d + 1 weights.

Figure 11: Left: A one-dimensional bump function constructed from four ReLUs as in equation 4.1. Right: A two-dimensional bump function that is the realization of a three layer ReLU neural network as in equation 4.2.

In the most general setting, a wavelet is a function ψ ∈ L2(Rd) that has at least one vanishing moment, i.e., Z ψ(x)dx = 0. d R A wavelet system in one dimension is constructed from a wavelet ψ by

n j j o ψj,m := 2 2 ψ(2 · −m): j ∈ Z, m ∈ Z .

19 In higher dimensions one can construct wavelet systems using multiple generators ψ`, ` = 1, . . . , r, r ∈ N. The systems then have the form

n d j j d o ψj,m,` := 2 2 ψ`(2 · −m): j ∈ Z, m ∈ Z , ` = 1, . . . , r .

If we now define

d ψ(x) := φ(x) − φ(x − 1) for all x ∈ R , then ψ has one vanishing moment. By the definition of neural networks and the observations from Subsection 2.2, we can now find for each j ∈ Z and m ∈ Zd a neural network with less than a constant C > 0 nonzero weights whose realization equals ψj,m. Using parallelizations, we conclude that if wavelet systems yield a certain M-term approximation rate for a function class C, then ReLU neural networks produce at least that error rate in terms of weights. Mind though, that this argument holds only for functions that are well approximated by wavelets of the form illustrated in Figure 11. The argument above only focuses on wavelet approximation. However, the same argument could be made for much more general systems. Indeed, instead of rescaling by 2j and shifting by m every conceivable affine linear transformation could be applied to the function ψ so that it would still be the realization of a fixed size neural network. The previous consideration establishes a connection between the approximation rates of affine systems and that of neural networks as was studied in [2]. We start with a formal definition of affine systems.

Definition 4.1 ([2]). Let d ∈ N, Ω ⊂ Rd bounded, and f ∈ L2(Rd) compactly supported. Let δ > 0, s r r d (ci )i=1 ⊂ R, for s = 1,...,S, and (di)i=1 ⊂ R . Consider the compactly supported functions

r X s gs := ci f(· − di), s = 1, . . . , S. i=1 We define the corresponding affine system as

n j,b 1/2 d j,b o D := gs := det(Aj) gs(Aj · − δ · b) · χΩ : s = 1, . . . , S, b ∈ Z , j ∈ N, and gs 6= 0 , where χΩ denotes the indicator function of Ω. Unsurprisingly, we have that the M-term approximation of dictionaries transfers to the M-weight approximation with realizations of neural networks if the function f of Definition 4.1 can be well approximated. This is described in the following theorem.

d 2 Theorem 4.2 ([2]). Suppose that Ω ⊂ R is bounded and D = (ϕi)i∈N ⊂ L (Ω) is an affine system according to Definition 4.1. Suppose further that for the activation function % : R → R there exists a constant C such that for all D, ε > 0 there is ΦD,ε ∈ N N L,C,d,% with

kf − R%(ΦD,ε)kL2([−D,D]d) ≤ ε, (4.3)

2 M where f is as in Definition 4.1. Then, if ε > 0, M ∈ N, g ∈ L (Ω) such that there exist (di)i=1:

M X g − diϕi ≤ ε, i=1 L2 there exists a neural network Φ with O(M) nonzero weights such that

kg − R%(Φ)kL2 ≤ 2ε.

20 To state one application of Theorem 4.2, we mention the affine system of α-shearlets [24, 11], that yields the optimal M-term approximation rates of O(M −α/2) for the function class C of α-cartoon-like functions

α 2 α 2 d α E [0, 1] := {f1 + χBf2 : f1, f2 ∈ C ((0, 1) ),B ⊂ (0, 1) compact , ∂B ∈ C ,

where kgkCα ≤ 1 for g = f1, f2, ∂B}. From Theorem 4.2 one can conclude that realizations of neural networks yield an M-weight approximation rate of O(M −α/2). In fact, we have that γ∗(Eα([0, 1]2)) = α/2 so that Theorem 3.14 demonstrates that O(M −α/2) is the optimal approximation rate by realizations of neural networks with M nonzero weights for the function class Eα([0, 1]2).

4.2 ReLU approximation As we have mentioned before, the most commonly used activation function is the ReLU. Thus it makes sense to give particular attention to ReLU realizations of neural networks. Additionally, the ReLU is often excluded from classical approximation results such as, e.g., Theorem 3.3 since it is not differentiable. We can start by asking ourselves if ReLU realizations of neural networks achieve the same optimal approximation rates for Cn functions of M −n/d as sigmoidal functions. At the core of all the approximation results with ReLUs of, e.g., [30, 31, 32] lies the fact that ReLU networks can implement an approximate multiplication operator. We have the following result. Proposition 4.3 ([32]). For any ε > 0 there exists a neural network Φ with M(Φ),N(Φ),L(Φ) ∈ O(ln(1/ε)) such that for f : [0, 1] → [0, 1], f(x) = x2 we have

kf − R%(Φ)k∞ < ε, where %(x) = max{0, x}. Being able to realize the function f of Proposition 4.3, we can use 1 1 xy = (x + y)2 − x2 − y2 = (f(x + y) − f(x) − f(y)) , 2 2 to construct a neural network Φ whose realization approximates a multiplication operator and such that M(Φ),N(Φ),L(Φ) ∈ O(ln(1/ε)). From realizing a multiplication, it is straightforward to observe how to realize polynomials. Moreover, using the function on the right of Figure 7 and the multiplication operator allows the construction of a network that constitutes a partition of unity. In combination, one can now construct sums of local Taylor polynomials by neural networks. This yields the following result for the overall approximation of smooth functions. For n, d ∈ N let n d Fn,d := {f ∈ C ([0, 1] ): kfkCn([0,1]d) ≤ 1}. Then we have the following result for the approximation by ReLU networks.

Theorem 4.4 ([32]). Let n, d ∈ N, ε ∈ (0, 1), and % be the ReLU. Then for all f ∈ Fn,d there exists a neural network Φ with N(Φ),M(Φ) ≤ cε−d/n(ln(1/ε) + 1) and L(Φ) ≤ c(ln(1/ε) + 1) for a constant c = c(n, d) such that

kf − R%(Φ)k∞ < ε. We observe, that up to the ln(1/ε) factor, the approximation rate guaranteed by Theorem 4.4 is optimal with respect to the notion of nonlinear N-width and VC dimension as in Theorem 3.10. What is somehow unsatisfying is the dependence of the depth on the approximation quality. Recent results in [25] replace k.k∞ by k.k2 approximation and thereby achieve a constant depth approxi- mation of Cn([0, 1]d) and piecewise Cn([0, 1]d) functions, i.e., En([0, 1]d), where the depth only depends on n and d.

21 Theorem 4.5 ([25]). Let d ∈ N≥2, n ∈ N and % be the ReLU. Then there exist constants c = c(d, n) > 0, s = s(d, n) ∈ N, and c0 > 0, such that for all ε ∈ (0, 1/2) and all f ∈ En([0, 1]d) there is a neural network f 0 −2(d−1)/n Φε with at most c · log2(2 + n) · (1 + n/d) layers, and at most c · ε non-zero weights which are all elements of εsZ ∩ [−ε−s, ε−s] such that f kR%(Φε ) − fkL2 ≤ ε. Remark 4.6. • We observe that the approximation rate coincides with the lower bound of Theorem 3.14. Note that the requirement that the weights are elements of εsZ ∩ [−ε−s, ε−s] allows to encode each weight with no 00 00 00 00 more than −c log2(ε) weights for a constant c = c (s) = c (d, n) > 0. • The depth of the produced networks is independent of ε and is only influenced by d and n. While this is just an upper bound, it appears to be very reasonable that the depth increases with a higher approximation rate. One interpretation is that, since the networks have to realize functions more efficiently, more levels of abstraction are necessary. • The proof of this result is based on the fact that if only L2-approximation is required, then two constructions can be made much more efficiently than in an L∞-approximation setting. First, a multiplication operator can be constructed with a fixed depth. Second, a partition of unity can be implemented very efficiently with a fixed depth neural network. The details can be found in [25]. We want to end this section by trying to understand the depth of the involved networks. At first, it might seem hard to quantify, what deep networks can do and shallow networks cannot. Indeed, for general rectifiers, this question is wide open. For ReLUs, however, there is a chance. One obvious observation about the realization of ReLU neural networks is that they are piecewise affine-linear. Thus, we can count the number of linear regions. One can show that if one wants to approximate non-linear functions by piecewise linear ones, then the number of pieces fundamentally limits the possible approximation fidelity. Picture a one-dimensional function f, which is a sum of rescaled and shifted ReLUs, i.e., f(x) = Pm i=1 ci%(aix − bi). It is easy to see that f has at most m + 1 linear regions. On the other hand, we have the following result by Montufar, Pascanu, Cho, and Bengio [22]: Theorem 4.7 ([22]). For every d, L ∈ N a deep network Φ with L layers N neurons and d-dimensional input can be constructed, such that the number of linear regions is at least O((n/d)L). Intuitively, the reason for the much higher number of linear regions that deep networks of comparable size can have compared to shallow networks is due to the fact that each layer introduces a folding of the previous space. We depict this in Figure 12. Given a smooth function whose second derivative is non-zero on an open set, then the best possible approximation by piecewise constant functions can be quantified in terms of the number of linear regions. This, in combination with an estimate on the number of regions yields lower bounds on the number of weights, if a certain approximation quality is required. Precisely, we have the following result. Theorem 4.8 ([25, 28]). Let Ω ⊂ Rd be nonempty, open, bounded, and connected. Furthermore, let f ∈ C3 (Ω) be nonlinear. Then there is a constant Cf > 0 satisfying

−2L(Φ) kf − R% (Φ)kLp ≥ Cf · (N (Φ) + 1) and −2L(Φ) kf − R% (Φ)kLp ≥ Cf · (M (Φ) + d) for all 1 ≤ p < ∞ and each ReLU neural network Φ with input dimension d and output dimension 1. Remark 4.9. Given a function from Cn([0, 1]d), n ≥ 3, we know that, in order to guarantee an approximation error of ε, neural networks need to have M(Φ) ∈ Ω(ε−d/n) many weights. Theorem 4.8, however, also implies that any construction which yields an error of ε with only M(Φ) ∈ O(ε−d/n) needs to have at least n/(2d) layers. In other words, if a network is required to have the optimal number of weights, then it needs to have a certain depth.

22 Figure 12: Top: Repeated folding of a two-dimensional plane introduces many linear regions. Bottom: In a complex classification problem, a high number of linear regions is necessary to finely resolve a separating curve. The figure is taken from [22].

5 Open questions for further research and discussion 5.1 Prepared open questions We have prepared some open questions and topics for further research in approximation theory of neural networks:

1. Models in applications: It is not entirely clear which function classes are the most appropriate models for different applications. Is there a justification for studying say C17 functions, or piecewise C2 functions? Are there applications, where one can precisely identify the function class of the underlying problem? What about non-euclidean input data, potentially from a manifold as considered in [4]? 2. High-dimensional problems: This topic is closely related to the previous. In applications, one often has very high dimensional problems, but most presented approximation results are heavily dependent on the input dimension and do not explain the remarkable results in practice. In [26], compositional functions were studied and it was shown that neural networks approximate such functions well without dependence on the dimension. Are there other similar function classes?

23 3. Other network models: We have discussed plain vanilla networks, without any additional assumptions on the affine linear transformations or additional pooling steps. What kind of approximation results are still valid for convolutional networks? What if additional pooling such as max pooling or product pooling are allowed? Some results in this direction have been obtained in [23] by identifying convolutional networks with representations.

4. Efficiency of depth: We have seen some advantages of deep networks when their ReLU realization is considered. For more general activation functions this advantage is not studied. Understanding the effect of depth is still one of the main topics of interest. Whether or not deep networks are better than shallow networks will most likely also depend on the function class.

5.2 Open questions from the discussion More important than the prepared open questions are those that we can identify as a group and then try to solve together. You find some space below to note down a couple of questions that we identify during the workshop.

1.

2.

3.

4.

References

[1] M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1st edition, 2009. [2] H. B¨olcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connected deep neural networks. arXiv:1705.01714. [3] H. B¨olcskei, P. Grohs, G. Kutyniok, and P. Petersen. Memory-optimal neural network approximation. In Proc. of SPIE (Wavelets and Sparsity XVII), 2017.

24 [4] M.M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, July 2017. [5] C.K. Chui, X. Li, and H.N. Mhaskar. Neural networks for localized approximation. of Computation, 63(208):607–623, 1994. [6] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. Control Signal, 2(4):303–314, 1989. [7] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. [8] R.A. DeVore, R. Howard, and Ch. Micchelli. Optimal nonlinear approximation. manuscripta mathematica, 63(4):469–478, Dec 1989. [9] L.A. Gatys, A.S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv:1508.06576, Aug 2015. [10] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. [11] P. Grohs, S. Keiper, G. Kutyniok, and M. Sch¨afer. α-molecules. Appl. Comput. Harmon. Anal., 41(1):297–336, 2016. [12] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991. [13] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Netw., 2(5):359–366, 1989. [14] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):664–676, April 2017. [15] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [16] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [17] V. Maiorov and A. Pinkus. Lower bounds for approximation by mlp neural networks. Neurocomputing, (25):81–91, 1999. [18] W. McCulloch and W. Pitts. A logical calculus of ideas immanent in nervous activity. Bull. Math. Biophys., 5:115–133, 1943. [19] H.N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Advances in Computational Mathematics, 1(1):61–80, Feb 1993. [20] H.N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Comput., 8(1):164–177, 1996. [21] H.N Mhaskar and C.A Micchelli. Approximation by superposition of sigmoidal and radial basis functions. Advances in , 13(3):350–373, 1992. [22] G. Mont´ufar,R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, pages 2924–2932, Cambridge, MA, USA, 2014. MIT Press.

25 [23] O. Sharir N. Cohen and A. Shashua. On the expressive power of deep learning: A tensor analysis. 2016. [24] A. Pein and F. Voigtlaender. Analysis sparsity versus synthesis sparsity for α-shearlets, 2017. arXiv:1702.03559. [25] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. arXiv:1709.05289. [26] T. Poggio, H.N. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. IJAC, 2017. [27] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pages 318–362. MIT Press, 1986. [28] I. Safran and O. Shamir. Depth-width tradeoffs in approximating natural functions with neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2979–2987, 2017.

[29] U. Shaham, A. Cloninger, and R.R. Coifman. Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. to appear. [30] M. Telgarsky. Benefits of depth in neural networks. arXiv:1602.04485. [31] M. Telgarsky. Neural networks and rational functions. arXiv:1706.03301.

[32] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114, 2017.

26