Expressivity of Deep Neural Networks∗
Ingo G¨uhring †‡ Mones Raslan †‡ Gitta Kutyniok‡§
10/07/2020
Abstract In this review paper, we give a comprehensive overview of the large variety of approximation results for neural networks. Approximation rates for classical function spaces as well as benefits of deep neural networks over shallow ones for specifically structured function classes are discussed. While the main body of existing results is for general feedforward architectures, we also review approximation results for convolutional, residual and recurrent neural networks. Keywords: approximation, expressivity, deep vs. shallow, function classes
1 Introduction
While many aspects of the success of deep learning still lack a comprehensive mathematical explanation, the approximation properties1 of neural networks have been studied since around 1960 and are relatively well understood. Statistical learning theory formalizes the problem of approximating - in this context also called learning - a function from a finite set of samples. Next to statistical and algorithmic considerations, approximation theory plays a major role for the analysis of statistical learning problems. We will clarify this in the following by introducing some fundamental notions.2 Assume that X is an input space and Y a target space, L : Y × Y → [0, ∞] is a loss function and P(X ,Y) a (usually unknown) probability distribution on some σ-algebra of X × Y. We then aim at finding a minimizer of the risk functional3 Z X R : Y → [0, ∞], f 7→ L (f(x), y) dP(X ,Y)(x, y), X ×Y X induced by L and P(X ,Y) (where Y denotes the set of all functions from X to Y). That means we are looking for a function fˆ with fˆ = argmin R(f): f ∈ YX . In the overwhelming majority of practical applications, however, this optimization problem turns out to be infeasible due to three reasons: (i) The set YX is simply too large, such that one usually fixes a priori some hypothesis class H ⊂ YX and arXiv:2007.04759v1 [cs.LG] 9 Jul 2020 instead searches for ˆ fH = argmin {R(f): f ∈ H} . In the context of deep learning, the set H consists of deep neural networks, which we will introduce later. ∗This review paper will appear as a book chapter in the book “Theory of Deep Learning” by Cambridge University Press. †These authors contributed equally. ‡Institute of Mathematics, Technical University of Berlin, Straße des 17. Juni 136, 10623 Berlin, Germany; E-Mail: {guehring, raslan, kutyniok}@math.tu-berlin.de §Faculty - Electrical Engineering and Computer Science, Technical University of Berlin / Department of Physics and Tech- nology, University of Tromsø 1Throughout the paper, we will interchangeably use the term approximation theory and expressivity theory. 2[17] provides a concise introduction to statistical learning theory from the point of view of approximation theory. 3With the convention that R(f) = ∞ if the integral is not well-defined.
1 YX fˆ
H ˆ fH ˆ Approximation Error ˆ∗ fH,S fH,S (Expressivity) Estimation Error Training Error (Generalization) (Optimization)
Figure 1: Decomposition of the overall error into training error, estimation error and approximation error
(ii) Since P(X ,Y) is unknown, one can not compute the risk of a given function f. Instead, we are given m a training set S = ((xi, yi))i=1, which consists of m ∈ N i.i.d. samples drawn from X × Y with respect to P(X ,Y). Thus, we can only hope to find the minimizer of the empirical risk RS (f) = 1 Pm m i=1 L(f(xi), yi)) given by ˆ fH,S = argmin {RS (f): f ∈ H} .
(iii) In the case of deep learning one needs to solve a complicated non-convex optimization problem to find ˆ fH,S , which is called training and can only be done approximately.
ˆ∗ Denoting by fH,S ∈ H the approximative solution, the overall error