DGD Approximation Theory Workshop
Total Page:16
File Type:pdf, Size:1020Kb
DGD Approximation Theory Workshop Sandra Keiper,∗ Gitta Kutyniok,∗ Philipp Petersen∗ October 2017 Contents 1 Deep learning 1 2 Mathematical description of neural networks 5 2.1 Activation functions . .6 2.2 Basic operations with networks . .6 2.2.1 Rescaling and shifting . .7 2.2.2 Concatenation . .7 2.2.3 Parallelization . .8 3 Approximation theory of neural networks 9 3.1 Universal approximation theorems . .9 3.2 Approximation of smooth functions with smooth rectifiers . 10 3.3 Constant size approximation with special rectifiers . 11 3.4 Lower bounds for approximation: . 13 3.4.1 Continuous nonlinear approximation . 14 3.4.2 The VC dimension of piecewise linear networks . 14 3.4.3 Encodability of networks . 17 4 Recent developments 18 4.1 Neural networks and dictionaries . 18 4.2 ReLU approximation . 21 5 Open questions for further research and discussion 23 5.1 Prepared open questions . 23 5.2 Open questions from the discussion . 24 1 Deep learning The topic of this workshop is approximation theory of neural networks. Nonetheless, we shall start with a detour on a novel technique called "Deep Learning" since it can be considered the main motivation for neural networks. To be honest right from the start, we wish to emphasize that this workshop does not explain why deep learning works. What we will do, however, is to provide an overview of the basic architecture that deep learning resides on and this is, as expected, given by neural networks. Deep learning is based on the efficient, data-driven, training of neural networks. We shall decipher this statement one by one, starting with neural networks. We will provide a precise mathematical definition of neural networks in the sequel, but at this moment, a neural network is a collection of very simple computational ∗Email: fkeiper, kutyniok, [email protected] 1 units, called neurons, which are arranged in (potentially) complex patterns, see Figure 1. Some of the neurons are special in the sense that they are input or output neurons. Figure 1: A neural network with multiple neurons connected by edges. Observe that circles are possible. Certainly, when we take a look at the simple neural network of Figure 1 and also take into consideration the name, then we are lead to believe that there is some connection to actual biological neural networks. Indeed, the starting point for neural networks is the work of McCulloch and Pitts [18] which had the goal of establishing a mathematical model of the human brain. The underlying idea is that a biological neuron is activated and begins to send out its own signals if a specific threshold of inputs is exceeded. Mathematically, a neuron according to McCulloch and Pitts is a function ft such that ft(x) = 1 if x ≥ t and ft(x) = 0 else. Using compositions and sums of these functions McCulloch and Pitts demonstrated that simple logical functions such as the AND, OR, and NOT functions can be realized as neural networks. For computational purposes, and especially deep learning, a general neural network model is not suitable (admittedly there are exceptions). This is because circles as in Figure 1 can make the network arbitrarily complex. A neural network, whose graph has no circles is called a feed-forward neural network, see Figure 2, all other networks are called recurrent neural networks. This network model can be formulated mathematically. Let • d 2 N be the input dimension • L 2 N be the number of layers • N0;:::;NL 2 N be the number of neurons in each layer, where N0 = d N ×N N • A` 2 R ` `−1 , b` 2 R ` be the weights of a neural network, where ` = f1;:::;Lg. • % : R ! R be an activation function. Then d N Φ: R ! R L x 7! WL(%(WL−1(%(:::;%(W1(x)))))); where W`(x) = A`(x) + b` and % is applied componentwise, is a neural network. We will refine this model in the sequel, but in essence this is the mathematical model for a neural network that we shall use. Additionally, we mention convolutional networks which are neural networks that use convolutions in place of general matrix multiplication. Feed-forward neural networks are the backbone of modern learning algorithms because they can be efficiently trained. This leads us to the second aspect of deep learning. It was said before, that deep learning is based on the efficient, data-driven training of neural networks. By now, we have a model for neural networks, but it is not yet clear how to train it. Training of a neural network refers to the act of adjusting its weights, i.e., the objects A`; b`, in a way as to minimize a certain loss function on a training set. Let us say, we have a number of input-output pairs 2 Figure 2: A feed-forward neural network. d×d0 d0 d0 + (xi; yi)i=1;:::;m ⊂ R , and a map L : R × R ! R . Our goal is to solve the following minimization problem m X min L(Φ(xi); yi); (1.1) Φ i=1 over all neural networks Φ with a d-dimensional input and NL-dimensional output under some restrictions on the number of neurons, the number of layers, or something similar. M 0 M 0 0 PL In practice, (1.1) is solved by stochastic gradient descent. If W = (!k)k=1 2 R , with M = `=1 N`N`−1, denotes all the weights of a neural network, i.e., (A`)i;j and (b`)i, for i = 1 :::;N`, j = 1 :::;N`−1, ` = 1;:::;L, then a network can also be interpreted as the function (x; !1;:::;!M 0 ) 7! ΦW (x): 0 PM 0 In this interpretation, we can compute @ i=1 L(ΦW (xi); yi)=@!k for all k = 1;:::;M and replace !k by M 0 X !~k = !k − λ∂ L(ΦW (xi); yi)=@!k; i=1 where λ is the step-size. At first it seems like an especially complex task to compute all the derivatives for all weights, but there exists a convenient algorithm, called backpropagation,[27], allowing very efficient computation. In essence this algorithm is simply an iterative application of the chain rule. To spare ourselves from tedious technicalities, we will not review the algorithm here but refer instead to either [10], http://www.deeplearningbook.org/ or one of the many good explanations online. We have now understood the basic ingredients to deep learning. First of all, the architecture is provided by neural networks, and secondly, if we have sufficiently many training samples, we can manipulate this network in such a way that its associated function fits the training data. This allows us to pose three fundamental questions that form the basis for the understanding of deep learning: • Expressibility: How powerful is the network architecture? Can it indeed represent the right functions for the learning procedures described above? • Learning: Why does the stochastic gradient descent algorithm produce anything reasonable? For general activation functions, it is clear that the loss function is not convex in the weight variables. This leads to potentially plenty of local minima. • Generalization: The learning only has access to input-output pairs from a training set. To complete a machine learning task, we require that the resulting network is able to generalize, i.e., perform also correctly on a test set, which is not a subset of the training set. In applications, it appears that neural networks generalize unreasonably well. 3 Figure 3: Result of a neural network trained to produce describing sentences for images. The image is taken from [14]. This workshop can only provide partial answers to the first of these questions. One question, which was not posted above is this: Why should you keep reading/listening to the workshop. Indeed, at this point, the basic functionality of deep learning was explained, but we lack a profound motivation to study it. Thus we will spend the rest of the chapter on a couple of remarkable applications of deep learning to stimulate the appetite for more theory. One of the most prominent applications of deep learning is certainly image classification. In fact, one could even argue that the work of Krizhevsky et al. [15] from 2012 triggered the hype on deep learning. In this paper, a convolutional neural network with 60 million weights was trained on a database of 1.5 million images. Finally, it can associate a label to each image describing the content of the image. In doing so, it shattered benchmark classification results based on different methods than deep learning. Another example of image classification even combines the extraction of a label with a describing sentence, [14], see Figure 3. In addition to image analysis, neural networks can also manipulate images. A very fun example is that of style transfer, [9]. Here a network is able to extract certain features of an image. In fact, in [9] it was demonstrated, that style and content of an image can be separated in a way that makes it possible to replace the style of an image by a different one. A particularly fascinating result is demonstrated in Figure 4. Another especially remarkable feat of deep learning is the 2016 victory of a computer against one of the best Go players in the world, [7]. Such an achievement was believed to be tens of years away. During its second game against the human grandmaster Lee Sedol, in move 37 the machine, called AlphaGo, made a move that was at first considered to be a blunder by the moderators, and only turned out to be remarkably brilliant multiple moves later.