<<

arXiv:1811.05199v6 [math.FA] 17 Jun 2020 elfunction real ihweights with oe n n idnlyrof layer hidden one and node, edowr erlntokwt natvto activation an with network neural feedforward A Introduction 1 2010: Classification Subject AMS Principle Boundedness Uniform Examples, Counter Keywords: logi the for shown are results first inver als dimension, like functions Chervonenkis paper activation present defined) the piecewise not of (and Results dis the . functions, spline lit activation the knot linear to piecewise belong not choosing do By errors show Approximation appr that counterexamples possible. be th best construct to of to appro extension function allows quantitative best indeed the the ciple of context, of this Errors In of learned). operations. moduli such using perform feedforw pressed layer univariat node hidden for Single input bounds functions. one error ReLU of and sharpness u sigmoid show the of of to extension used quantitative is a ciple of variant non-linear new A Abstract a niern n optrSine nttt o Patter for Institute , Computer Germany and Engineering cal idnLyrFefradNua Networks Neural Single Feedforward for Layer Bounds Hidden Error of Sharpness On hrns fErrBud o nvraeApoiainb S by Approximation Univariate for Bounds Error of Sharpness it poddvrin hsi r-rn fa cetdjunlp journal accepted an of pre-print a is This version: uploaded Sixth ∗ tffnGebl@srd,NeerenUiest fAppl of University Niederrhein Steff[email protected], ilapa nRslsi ahmtc RM)udrtette“ title the under (RIMA) in Results in appear will idnLyrFefradNua Networks Neural Feedforward Layer Hidden erlNtok,Rtso ovrec,Sapeso ro B Error of Sharpness Convergence, of Rates Networks, Neural g a k ftype of b , k c , k ∈ R h ie ae osntda ihmliait ap- multivariate with deal not does paper given The . n tffnGoebbels Steffen g ern ssoni iue1ipeet nunivariate an implements 1 Figure in shown as neurons ( x = ) k X =1 n 12,4A0 62M45 41A50, 41A25, a 1 k σ ( b k x + c k ) eonto,D485Krefeld, D-47805 Recognition, n ∗ etnet ae nVapnik- on Based . se usdpolmbcmsfree becomes problem cussed e cecs aut fElectri- of Faculty , ied nfr onens prin- boundedness uniform e l- ls fgvnbounds. given of class tle-o tcfunction. stic odfrnon- for hold o prxmto ae obe to rates approximation σ iombuddesprin- boundedness niform prxmto ysums by approximation e r erlntok with networks neural ard n nu,oeoutput one input, one , ,17.06.2020 ”, xmtd(.. obe to (i.e., oximated iaincnb ex- be can ximation prthat aper On ounds, ingle 1 Introduction 2

!

#' "' !' ! " # !# ##

" ! % ! !

"$ # !$

!

#$

n Fig. 1: One hidden layer neural network realizing Pk=1 akσ(bkx + ck) proximation. But some results can be extended to multiple input nodes, see Section 5. Often, activation functions are sigmoid. A sigmoid function σ : R → R is a measurable function with lim σ(x) = 0 and lim σ(x) = 1. x→−∞ x→∞ Sometimes also monotonicity, boundedness, continuity, or even differentiability may be prescribed. Deviant definitions are based on convexity and concavity. In case of differentiability, functions have a bell-shaped first . Throughout this paper, approximation properties of following sigmoid functions are discussed:

0, x< 0 σ (x) := (Heaviside function), h 1, x ≥ 0  1 0, x< − 2 1 1 1 σc(x) := x + 2 , − 2 ≤ x ≤ 2 (cut function),  1  1, x> 2 1 1 σa(x) := + arctan(x) (inverse tangent), 2 π 1 1 x σ (x) := = 1 + tanh (logistic function). l 1+ e−x 2 2    Although not a sigmoid function, the ReLU function (Rectified Linear Unit)

σr(x) := max{0,x} is often used as activation function for deep neural networks due to its computational simplicity. The Exponential Linear Unit (ELU) activation function

α(ex − 1), x< 0 σ (x) := e x, x ≥ 0  with parameter α 6= 0 is a smoother variant of ReLU for α = 1. Qualitative approximation properties of neural networks have been studied exten- sively. For example, it is possible to choose an infinitely often differentiable, almost monotonous, sigmoid activation function σ such that for each f, each compact interval and each bound ε> 0 weights a0,a1,b1, c1 ∈ R exist such that f can be approximated uniformly by a0 +a1σ(b1x+c1) on the interval within bound ε, see [27] and literature cited there. In this sense, a neural network with only one hidden 1 Introduction 3

neuron is capable of approximating every continuous function. However, activation functions typically are chosen fixed when applying neural networks to solve applica- tion problems. They do not depend on the unknown function to be approximated. In the late 1980s it was already known that, by increasing the number of neurons, all continuous functions can be approximated arbitrarily well in the sup-norm on a compact set with each non- bounded monotonically increasing and continuous activation function (universal approximation or density property, see proof of Funa- hashi in [22]). For each continuous sigmoid activation function (that does not have to be monotone), the universal approximation property was proved by Cybenko in [14] on the unit cube. The result was extended to bounded sigmoid activation functions by Jones [31] without requiring continuity or monotonicity. For monotone sigmoid (and not necessarily continuous) activation functions, Hornik, Stinchcombe and White [29] extended the universal approximation property to the approximation of measurable functions. Hornik [28] proved density in Lp-spaces for any non-constant bounded and continuous activation functions. A rather general theorem is proved in [35]. Leshno et al. showed for any continuous activation function σ that the universal approximation property is equivalent to the fact that σ is not an algebraic polynomial. There are many other density results, e.g., cf. [7, 12]. Literature overviews can be found in [42] and [44]. To approximate or interpolate a given but unknown function f, constants ak, bk, and ck typically are obtained by learning based on sampled function values of f. The underlying optimization algorithm (like descent with back propagation) might get stuck in a local but not in a global minimum. Thus, it might not find optimal constants to approximate f best possible. This paper does not focus on learning but on general approximation properties of function spaces

n

Φn,σ := g : [0, 1] → R : g(x)= akσ(bkx + ck) : ak,bk, ck ∈ R . ( ) Xk=1 Thus, we discuss functions on the interval [0, 1]. Without loss of generality, it is used instead of an arbitrary compact interval [a,b]. In some papers, an additional constant function a0 is allowed as summand in the definition of Φn,σ . Please note that akσ(0 · x + bk) already is a constant and that the definitions do not differ significantly. For f ∈ B[0, 1], the set of bounded real-valued functions on the interval [0, 1], let kfkB[0,1] := sup{|f(x)| : x ∈ [0, 1]}. By C[0, 1] we denote the Banach space of contin- uous functions on [0, 1] equipped with norm kfkC[0,1] := kfkB[0,1]. For Banach spaces p L [0, 1], 1 ≤ p < ∞, of measurable functions we denote the norm by kfkLp[0,1] := 1 p 1/p ∞ ( 0 |f(x)| dx) . To avoid case differentiation, we set X [0, 1] := C[0, 1] with p p k · k ∞ := k · k , and X [0, 1] := L [0, 1] with kfk p := kfk p , R X [0,1] B[0,1] X [0,1] L [0,1] 1 ≤ p< ∞. The error of best approximation E(Φn,σ, f)p, 1 ≤ p ≤∞, is defined via

E(Φn,σ, f)p := inf{kf − gkXp[0,1] : g ∈ Φn,σ}.

We use the abbreviation E(Φn,σ, f) := E(Φn,σ, f)∞ for p = ∞. A trained network cannot approximate a function better than the error of best approximation. Therefore, it is an important measure of what can and what cannot be done with such a network. The error of best approximation depends on the smoothness of f that is measured in terms of moduli of smoothness (or moduli of continuity). In contrast to using 1 Introduction 4

, first and higher differences of f obviously always exist. By applying a norm to such differences, moduli of smoothness measure a “degree of continuity” of f. For a natural number r ∈ N = {1, 2,... }, the rth difference of f ∈ B[0, 1] at x ∈ [0, 1 − rh] with step size h> 0 is defined as

1 r 1 r−1 ∆hf(x) := f(x + h) − f(x), ∆hf(x):=∆h∆h f(x),r> 1, or

r r ∆r f(x) := (−1)r−j f(x + jh). h j j=0 ! X The rth uniform modulus of smoothness is the smallest upper bound of the absolute values of rth differences:

r ωr(f, δ) := ωr(f, δ)∞ := sup {|∆hf(x)| : x ∈ [0, 1 − rh], 0 < h ≤ δ} . With respect to Lp spaces, 1 ≤ p< ∞, let

1−rh 1/p r p ωr(f, δ)p := sup |∆hf(x)| dx : 0 < h ≤ δ . (Z0  ) r Obviously, ωr(f, δ)p ≤ 2 kfkXp[0,1], and for r-times continuously differentiable func- tions f, there holds (cf. [16, p. 46])

r (r) ωr(f, δ)p ≤ δ kf kXp[0,1]. (1.1) Barron applied Fourier methods in [4], cf. [36], to establish rates of convergence 2 in an L -norm, i.e., he estimated the error E(Φn,σ, f)2 with respect to n for n → ∞. Makovoz [40] analyzed rates for uniform convergence. With respect to moduli of smoothness, Debao [15] proved a direct estimate that is here presented in a version of the textbook [9, p. 172ff]. This estimate is independent of the choice of a bounded, sigmoid function σ. Doctoral thesis [10], cf. [11], provides an overview of such direct estimates in Section 1.3. According to Debao, 1 E(Φn,σ, f) ≤ sup |σ(x)| ω1 f, (1.2) R n x∈    holds for each f ∈ C[0, 1]. This is the prototype estimate for which sharpness is dis- cussed in this paper. In fact, the result of Debao for E(Φn,σ, f) allows to additionally restrict weights such that bk ∈ N and ck ∈ Z. The estimate has to hold true even for σ being a discontinuous Heaviside function. That is the reason why one can only expect an estimate in terms of a first order modulus of smoothness. If the order of ap- proximation of a continuous function f by such piecewise constant functions is o(1/n) then f itself is a constant, see [16, p. 366]. In fact, the idea behind Debao’s proof is that sigmoid functions can be asymptotically seen as Heaviside functions. One gets arbitrary step functions to approximate f by superposition of Heaviside functions. For quasi-interpolation operators based on the logistic activation function σl, Chen and Zhao proved similar estimates in [8] (cf. [2, 3] for hyperbolic tangent). However, they only reach a convergence order of O (1/nα) for α < 1. With respect to the error of best approximation, they prove

3 exp 2 E(Φn,σl , f) ≤ 80 · ω1 f, n  ! 1 Introduction 5

by estimating with a polynomial of best approximation. Due to the different technique, constants are larger than in error bound (1.2). If one takes additional properties of σ into account, higher convergence rates are possible. Continuous sigmoid cut function σc and ReLU function σr lead to spaces

Φn,σc,r of continuous, piecewise linear functions. They consist of free knot spline functions of polynomial degree at most one with at most 2n or n knots, cf. [16, Section

12.8]. Spaces Φn,σc,r include all continuous spline functions g on [0, 1] with polynomial degree at most one that have at most n − 1 simple knots. We show g ∈ Φn,σc for such k a spline g with equidistant knots xk = n−1 , 0 ≤ k ≤ n − 1, to obtain an error bound for n ≥ 2:

n−2 1 k + 1 g(x)= g(0)σ x+ + [g (x ) − g (x )] σ (n−1) · x − 2 . c 2 k+1 k c n − 1   Xk=0   

One can also represent g by ReLU functions, i.e., g ∈ Φn,σr : With (1 ≤ k ≤ n − 1)

k−1 g (xk) − g (xk−1) m0 := g(0), mk := − mj , xk − xk−1 j=0 X we get n−1

g(x)= m0 · σr (x +1)+ mkσr (x − xk−1) . Xk=1 Thus, we can apply the larger error bound [16, p. 225, Theorem 7.3 for δ =(n−1)−1] for fixed simple knot spline approximation with functions g in terms of a second modulus to improve convergence rates up to O 1/n2 : For f ∈ Xp[0, 1], 1 ≤ p ≤∞, and n ≥ 2

1  2 1 E(Φ , f) ≤ Cω f, ≤ Cω f, ≤ 4Cω f, . (1.3) n,σc,r p 2 n − 1 2 n 2 n  p  p  p Section 2 deals with even higher order direct estimates. Similarly to (1.3), not only sup-norm bound (1.2) but also an Lp-bound, 1 ≤ p< ∞, for approximation with Heaviside function σh can be obtained from the corresponding larger bound of fixed simple knot spline approximation. Each Lp[0, 1]-function that is constant between k knots xk = n , 0 ≤ k ≤ n, can be written as a linear combination of n translated Heaviside functions. Thus, [16, p. 225, Theorem 7.3 for δ = 1/n]) yields for n ∈ N

1 E(Φ , f) ≤ Cω f, . (1.4) n,σh p 1 n  p Lower error bounds are much harder to obtain than upper bounds, cf. [42] for some results with regard to multilayer feedforward perceptron networks. Often, lower bounds are given using a (non-linear) Kolmogorov n-width Wn (cf. [41, 45]),

n

Wn := inf sup inf f(·) − akσ(bk · +ck) b1,...,bn,c1...,cn f∈X a1,...,an k=1 X for a suitable function space X (of functions with certain smoothness) and norm k · k. Thus, parameters bk and ck cannot be chosen individually for each function f ∈ X. Higher rates of convergence might occur, if that becomes possible. 1 Introduction 6

There are three somewhat different types of sharpness results that might be able to show that left sides of equations (1.2), (1.3), (1.4) or (2.2) and (2.9) in Section 2 do not converge faster to zero than the right sides. The most far reaching results would provide lower estimates of errors of best ap- proximation in which the lower bound is a modulus of smoothness. In connection with direct upper bounds in terms of the same moduli, this would establish theorems similar to the equivalence between moduli of smoothness and K-functionals (cf. [16, theorem of Johnen, p. 177], [30]) in which the error of best approximation replaces the K-functional. Let σ be r-times continuously differentiable like σa or σl. Then for f ∈ C[0, 1], a standard estimate based on (1.1) is

1 1 1 1 ωr f, = inf ωr f − g + g, ≤ inf ωr f − g, + ωr g, n g∈Φn,σ n g∈Φn,σ n n          r 1 (r) ≤ inf 2 kf − gkB[0,1] + r kg kB[0,1] . g∈Φn,σ n   (r) It is unlikely that one can somehow bound kg kB[0,1] by CkfkB[0,1] to get

1 C ω f, ≤ 2rE(φ , f)+ kfk . (1.5) r n n,σ nr B[0,1]   However, there are different attempts to prove such theorems in [51, Remark 1, p. 620], [52, p. 101] and [53, p. 451]. But the proofs contain difficulties and results are wrong, see [24]. In fact, we prove with Lemma 2.1 in Section 2 that (1.5) is not valid even if constant C is allowed to depend on f. A second class of sharpness results consists of inverse and equivalence theorems. Inverse theorems provide upper bounds for moduli of smoothness in terms of weighted sums of approximation errors. For pseudo-interpolation operators based on piecewise linear activation functions and B-splines (but not for errors of best approximation), [37] deals with an inverse estimate based on Bernstein . An idea that does not work is to adapt the inverse theorem for best trigonometric approximation [16, p. 208]. Without considering effects related to interval endpoints in algebraic approximation one gets a (wrong) candidate inequality

n 1 Cr r−1 ωr f, ≤ r k E(φk,σ, f)+ kfkB[0,1] . (1.6) n n " #   Xk=1 By choosing f ≡ σ for a non-polynomial, rth times continuously differentiable activa- tion function σ, the modulus on the left side of the estimate behaves like n−r. But the errors of best approximation on the right side are zero. At least this can be cured Cr by the additional expression nr kfkB[0,1]. Typically, the proof of an inverse theorem is based on a Bernstein-type inequality that is difficult to formulate for function spaces discussed here. The Bernstein inequal- ity provides a bound for derivatives. If pn is a trigonometric polynomial of degree at ′ most n then kpnkB[0,2π] ≤ nkpnkB[0,2π], cf. [16, p. 97]. The problem here is that differentiating aσ(bx + c) leads to a factor b that cannot be bounded easily. Indeed, we show for a large class of activation functions that (1.6) can’t hold, see (2.7). As noticed in [24], the inverse estimates of type (1.6) proposed in [51] and [53] are wrong. Similar to inverse theorems, equivalence theorems (like (1.7) below) describe equiv- alent behavior of expressions of moduli of smoothness and expressions of approximation 2 Higher order estimates 7

errors. Both inverse and equivalence theorems allow to determine smoothness prop- erties, typically membership to Lipschitz classes or Besov spaces, from convergence rates. Such a property is proved in [13] for max-product neural network operators ac- tivated by sigmoidal functions. The relationship between order of convergence of best approximation and Besov spaces is well understood for approximation with free knot spline functions and rational functions, see [16, Section 12.8], cf. [34]. The Heaviside activation function leads to free knot splines of polynomial degree 0, i.e., less than r = 1, cut and ReLU function correspond with polynomial degree less than r = 2. For σ being one of these functions, and for 0 <α

∞ q ∞ q ωk(f,t) (E(Φ , f) ) q dt < ∞ ⇐⇒ n,σ p < ∞. (1.7) t1+αq n1−αq 0 n=1 Z X However, such equivalence theorems might not be suited to obtain little-o results: As- 1 1 sume that E(Φn,σ, f)p = = o , then the right side of (1.7) converges nβ (ln(n+1))1/q nβ 1 exactly for the same smoothness parameters 0 <α<β than if E(Φn,σ, f)p = nβ 6= 1 o nβ . The third type of sharpness results is based on counterexamples. The present  paper follows this approach to deal with little-o effects. Without further restrictions, counterexamples show that convergence orders can not be faster than stated in terms of moduli of smoothness in (1.2), (1.3), (1.4) and the estimates in following Section 2 for some activation functions. To obtain such counterexamples, a general theorem is introduced in Section 3. It is applied to neural network approximation in Section 4. Unlike the counterexamples in this paper, counterexamples that do not focus on moduli of smoothness were recently introduced in Almira et al. [1] for continuous piecewise polynomial activation functions σ with finitely many pieces (cf. Corollary 4.1 below) as well as for rational activation functions (that we also briefly discuss in Section ∞ 4): Given an arbitrary sequence of positive real numbers (εn)n=1 with limn→∞ εn = 0, a continuous counterexample f is constructed such that E(Φn,σ, f) ≥ εn for all n ∈ N.

2 Higher order estimates

In this section, two upper bounds in terms of higher order moduli of smoothness are derived from known results. Proofs are given for the sake of completeness. If, e.g., σ is arbitrarily often differentiable on some open interval such that σ is no polynomial on that interval then it is known that E(Φn,σ,pn−1) = 0 for all polynomials pn−1 of degree n n−1 n−2 at most n−1, i.e., pn−1 ∈ Π := {dn−1x +dn−2x +· · ·+d0 : d0,...,dn−1 ∈ R}, see [42, p. 157] and (2.8) below. Thus, upper bounds for polynomial approximation can be used as upper bounds for neural network approximation in connection with certain activation functions. Due to a corollary of the classical theorem of Jackson, the best approximation to f ∈ Xp[0, 1], 1 ≤ p ≤ ∞, by algebraic polynomials is bounded by the rth modulus of smoothness. For n ≥ r, we use Theorem 6.3 in [16, p. 220] that is stated for the interval [−1, 1]. However, by applying an affine transformation of [0, 1] to [−1, 1], we see that there exists a constant C independently of f and n such that

n+1 1 inf{kf − p k p : p ∈ Π }≤ Cω f, . (2.1) n X [0,1] n r n  p 2 Higher order estimates 8

Ritter proved an estimate in terms of a first order modulus of smoothness for approximation with nearly exponential activation functions in [43]. Due to (2.1), Ritter’s proof can be extended in a straightforward manner to higher order moduli. The special case of estimating by a second order modulus is discussed in [51]. According to [43], a function σ : R → R is called “nearly exponential” iff for each ε> 0 there exist real numbers a, b, c, and d such that for all x ∈ (−∞, 0]

|aσ(bx + c)+ d − ex| < ε.

The logistic function fulfills this condition with a = 1/σl(c), b = 1, d = 0, and c< ln(ε) such that for x ≤ 0 there is 0

σ (x + c) 1+ e−c |aσ (bx + c)+ d − ex| = l − ex = − ex l σ (c) 1+ e−ce−x l −c x x 1+ e x 1 − e 1 − 0 c = e − 1 = e ≤ 1 · = e < ε. ex + e−c ex + e−c 0+ e−c

R R p Theorem 2.1. Let σ : → be a nearly , f ∈ X [0, 1], 1 ≤ p ≤ ∞, and r ∈ N. Then, independently of n ≥ max{r, 2} (or n>r) and f, a constant Cr exists such that 1 E(Φ , f) ≤ C ω f, . (2.2) n,σ p r r n  p

n+1 Proof. Let ε> 0. Due to (2.1), there exists a polynomial pn ∈ Π of degree at most n such that 1 kf − p k p ≤ Cω f, + ε. (2.3) n X [0,1] r n  p The Jackson estimate can be used to extend the proof given for r = 1 in [43]: Auxiliary functions (α> 0) x h (x) := α 1 − exp − α α    d converge to x pointwise for α →∞ due to the theorem of L’Hospital. Since dx (hα(x)− −x/α x) = 0 ⇐⇒ e = 1 ⇐⇒ x = 0, the maximum of |hα(x) − x| on [0, 1] is obtained at the endpoints 0 or 1, and convergence of hα(x) to x is uniform on [0, 1] for α → ∞. Thus limα→∞ pn(hα(x)) = pn(x) uniformly on [0, 1], and for the given ε we can choose α large enough to get kpn(·) − pn(hα(·))kB[0,1] < ε. (2.4) Therefore, function f is approximated by an exponential sum of type

n kx p (h (x)) = γ + γ exp − n α 0 k α Xk=1   −1 within the bound Cωr f, n + 2ε. It remains to approximate the exponential sum by utilizing that σ is nearly exponential. For 1 ≤ k ≤ n, γ 6= 0, there exist parameters  k ak,bk, ck,dk such that

kx kx ε a σ b − + c + d − exp − < k k α k k α n|γ |       k

2 Higher order estimates 9

for all x ∈ [0, 1]. Also because σ is nearly exponential, there exists c ∈ R with γ σ(c) 6= 0. Thus, a constant γ can be expressed via σ(c) σ(0x + c). Therefore, there exists a function gn,ε ∈ Φn+1,σ,

n γ kx g (x)= 0 σ(0x + c)+ γ a σ b − + c + d n,ε σ(c) k k k α k k k=1X,γk 6=0       n n γ0 + γkdk kx = k=1,γk6=0 σ(0x + c)+ γ a σ b − + c σ(c) k k k α k P k=1,γ 6=0 Xk     such that

kpn(hα(·)) − gn,εkB[0,1] n kx kx ≤ |γk| akσ bk − + ck + dk − exp − α α B[0,1] k=1,γk 6=0       X ε ≤ n = ε. (2.5) n By combining (2.3), (2.4) and (2.5), we get

E(Φn+1,σ, f)p

≤ kf − pnkXp[0,1]+kpn − pn(hα(·))kB[0,1]+kpn(hα(·)) − gn,εkB[0,1] 1 ≤ Cω f, + 3ε. r n  p Since ε can be chosen arbitrarily, we obtain for n ≥ 2

1 2 1 E(Φ , f) ≤ Cω f, ≤ Cω f, ≤ C2r ω f, . (2.6) n,σ p r n − 1 r n r n  p  p  p

By choosing a = 1/α, b = 1, c = 0, and d = 1, the ELU activation function σe(x) obviously fulfills the condition to be nearly exponential. But its definition for x ≥ 0 plays no role. Given a nearly exponential activation function, a lower bound (1.5) or an inverse estimate (1.6) with a constant Cr, independent of f, is not valid, see [24] for r = 2. Such inverse estimates were proposed in [51] and [53]. Functions fn(x) := exp(−nx) fulfill kfnkB[0,1] = 1 and

r −1 r k r k 1 (1 − e ) = (−1) fn ≤ ωr fn, . k! n n k=0     X x For x ≤ 0, one can uniformly approximate e arbitrarily well by assigning values to a, b, c and d in aσ(bx + c)+ d. Thus, each function fn(x) can be approximated uniformly by aσ(−nbx + c)+ d on [0, 1] such that E(Φk,σ, fn) = 0, k> 1. Using fn with formula (1.6) implies C 2C (1 − e−1)r ≤ r [E(Φ , f )+ kf k ] ≤ r (2.7) nr 1,σ n n B[0,1] nr which is obviously wrong for n →∞. The same problem occurs with (1.5). 2 Higher order estimates 10

The “nearly exponential” property only fits with certain activation functions but it does not require continuity. For example, let h(x) be the Dirichlet function that is one for rational and zero for irrational numbers x. Activation function exp(x)(1 + exp(x)h(x)) is nowhere continuous but nearly exponential: For ε> 0 let c = ln(ε) and a = exp(−c), b = 1, then for x ≤ 0 ex − aex+c(1 + ex+ch(x + c)) = e2x+ch(x + c) ≤ ec = ε. But a bound can also be obtained from arbitrarily often differentiability. Let σ be arbitrarily often differentiable on some open interval such that σ is no polynomial on that interval. Then one can easily obtain an estimate in terms of the rth modulus from the Jackson estimate (2.1) by considering that polynomials of degree at most n − 1 can be approximated arbitrarily well by functions in Φn,σ , see [42, Corollary 3.6, p. 157], cf. [33, Theorem 3.1]. The idea is to approximate monomials by differential quotients of σ. This is possible since derivative ∂k σ(bx + c)= xkσ(k)(bx + c) (2.8) ∂bk at b = 0 equals σ(k)(c)xk. Constants σ(k)(c) 6= 0 can be chosen because polynomials are excluded. The following estimate extends [42, Theorem 6.8, p. 176] in the univariate case to moduli of smoothness. Theorem 2.2. Let σ : R → R be arbitrarily often differentiable on some open interval in R and let σ be no polynomial on that interval, f ∈ Xp[0, 1], 1 ≤ p ≤∞, and r ∈ N. Then, independently of n ≥ max{r, 2} (or n>r) and f, a constant Cr exists such that 1 E(Φ , f) ≤ C ω f, . (2.9) n,σ p r r n  p

This theorem can be applied to σl but also to σa and σe. The preliminaries are also fulfilled for σ(x) := sin(x), a function that is obviously not nearly exponential.

Proof. Let ε> 0. As in the previous proof, there exists a polynomial pn of degree at most n such that (2.3) holds. Due to [42, p. 157] there exists a function gn,ε ∈ Φn+1,σ such that kgn,ε − pnkB[0,1] <ε. This gives 1 E(Φ , f) ≤ kf − p k p + kp − g k ≤ Cω f, + 2ε. n+1,σ p n X [0,1] n n,ε B[0,1] r n  p Since ε can be chosen arbitrarily, we get (2.9) via equation (2.6).

Polynomials in the closure of approximation spaces can be utilized to show that a direct lower bound in terms of a (uniform) modulus of smoothness is not possible. Lemma 2.1 (Impossible Inverse Estimate). Let activation function ϕ be given as in the preceding Theorem 2.2, r ∈ N. For each positive, monotonically decreasing ∞ sequence (αn)n=1, αn > 0, and each 0 <β< 1 a counterexample fβ ∈ C[0, 1] exists such that (for n →∞) 1 1 1 1 ω f , ≤ C , ω f , 6= o , (2.10) r β n nrβ r β n nrβ       1 ωr fβ , 1 1 lim sup n α > 0, i.e., ω f , 6= o E(Φ , f ) . (2.11) E(Φ , f ) n r β n α n,σ β n→∞ n,σ β    n  3 A Uniform Boundedness Principle with Rates 11

Even if the constant C = Cf > 0 may depend on f (but not on n), estimate (1.5), as proposed in a similar context in [51] for r = 2, does not apply.

Proof. For best trigonometric approximation, comparable counter examples are con- structed in [21, Corollary 3.1] based on a general resonance theorem [21, Theorem 2.1]. We apply this theorem to show a similar result for best algebraic polynomial approximation, i.e., there exists fβ ∈ C[0, 1] such that (2.10) and

1 ωr fβ , n lim sup n+1 αn+1 > 0. (2.12) n→∞ E(Π , fβ )

We choose parameters of [21, Theorem 2.1] as follows: X = C[0, 1], Λ = {r}, Tn(f)= 1 r β Sn,r(f) := ωr f, n such that C2 = C5,n = 2 , ω(δ)= δ . Different to [21, Corollary j 3.1] we use resonance functions hj (x) := x , j ∈ N, such that C1 = 1. We further set  r ϕn = τn = ψn = 1/n such that with (1.1)

1 (r) ϕn ϕn |Tn(hj )|≤ r khj kB[0,1] ≤ , Sn,r(hj ) ≤ =: C6,j ϕn = C6,j τn. n ϕj ϕj We compute an r-th difference of xn at the interval endpoint 1 to get the resonance condition

r n n 1 k r k lim inf |Sn,r(hn)| = lim inf ωr x , ≥ lim (−1) 1 − n→∞ n→∞ n n→∞ k! n   Xk=0   r r k r k r r 1 1 = (−1) exp(−k)= − = 1 − =: C7,r > 0. k! k! e e Xk=0 Xk=0     n+1 Further, let Rn(f) := E(Π , f) such that Rn(hj ) = 0 for j ≤ n, i.e., 0 = Rn(hj ) ≤ ρn C4 for ρn := αn+1ψn. Then, due to [21, Theorem 2.1], fβ ∈ C[0, 1] exists such that ψn (2.10) and (2.12) are fulfilled. n+1 Since polynomials of Π are in the closure of Φn+1,σ according to [42, p. 157], n+1 i.e., E(Φn+1,σ, fβ ) ≤ E(Π , fβ ), we obtain (2.11) from (2.12):

1 ωr fβ, 1 1 n+1 r ωr fβ , lim sup α ≥ lim sup 2 n α   n+1 n+1 n→∞ E(Φn+1,σ, fβ ) n→∞ E(Φn+1 ,σ, fβ) 1 1 ωr fβ, n ≥ lim sup r n+1 αn+1 > 0. n→∞ 2 E(Π , fβ)

3 A Uniform Boundedness Principle with Rates

In this paper, sharpness results are proved with a quantitative extension of the classi- cal uniform boundedness principle of Functional Analysis. Dickmeis, Nessel and van Wickern developed several versions of such theorems. We already used one of them in the proof of Lemma 2.1. An overview of applications in Numerical Analysis can be found in [23, Section 6]. The given paper is based on [20, p. 108]. This and most other versions require error functionals to be sub-additive. Let X be a normed space. A 3 A Uniform Boundedness Principle with Rates 12

functional T on X, i.e., T maps X into R, is said to be (non-negative-valued) sub-linear and bounded, iff for all f,g ∈ X, c ∈ R T (f) ≥ 0, T (f + g) ≤ T (f)+ T (g) (sub-additivity), T (cf)= |c|T (f) (absolute homogeneity),

kT kX∼ := sup{T (f) : kfkX ≤ 1} < ∞ (bounded functional). The set of non-negative-valued sub-linear bounded functionals T on X is denoted by X∼. Typically, errors of best approximation are (non-negative-valued) sub-linear bounded functionals. Let U ⊂ X be a linear subspace. The best approximation of f ∈ X by elements u ∈ U 6= ∅ is defined as E(f) := inf{kf − ukX : u ∈ U}. Then E is sub-linear: E(f + g) ≤ E(f)+ E(g), E(cf) = |c|E(f) for all c ∈ R. Also, E is bounded: E(f) ≤ kf − 0kX = kfkX . Unfortunately, function sets Φn,σ are not linear spaces, cf. [42, p. 151]. In general, from f,g ∈ Φn,σ one can only conclude f + g ∈ Φ2n,σ whereas cf ∈ Φn,σ, c ∈ R. Functionals of best approximation fulfill E(Φn,σ, f)p ≤ kf − 0kXp[0,1] = kfkXp[0,1]. Absolute homogeneity E(Φn,σ, cf)p = |c|E(Φn,σ , f)p is obvious for c = 0. If c 6= 0,

n

E(Φn,σ, cf)p = inf cf − akσ(bkx + ck) : ak,bk, ck ∈ R    k=1 Xp[0,1]  X n  ak ak  = |c| inf f − σ(bkx + ck) : ,bk, ck ∈ R  c c   k=1 Xp[0,1]  X = |c|E(Φ , f) . n,σ p  But there is no sub-additivity. However, it is easy to prove a similar condition: For each ε> 0 there exists elements uf,ε, ug,ε ∈ Φn,σ that fulfill ε ε kf − u k p ≤ E(Φ , f) + , kg − u k p ≤ E(Φ ,g) + f,ε X [0,1] n,σ p 2 g,ε X [0,1] n,σ p 2 and uf,ε + ug,ε ∈ Φ2n,σ such that

E(Φ2n,σ , f + g)p ≤ kf − uf,ε + g − ug,εkXp[0,1]

≤ kf − uf,εkXp[0,1] + kg − ug,εkXp[0,1] ≤ E(Φn,σ, f)p + E(Φn,σ,g)p + ε, i.e., E(Φ2n,σ, f + g)p ≤ E(Φn,σ, f)p + E(Φn,σ,g)p. (3.1)

Obviously, also E(Φn,σ, f)p ≥ E(Φn+1,σ, f)p holds true. In what follows, a quantitative extension of the uniform boundedness principle based on these conditions is presented. The conditions replace sub-additivity. Another extension of the uniform boundedness principle to non-sub-linear functionals is proved in [19]. But this version of the theorem is stated for a family of error functionals with two parameters that has to fulfill a condition of quasi lower semi-continuity. Functionals Sδ measuring smoothness also do not need to be sub-additive but have to fulfill a condition Sδ(f + g) ≤ B(Sδ(f)+ Sδ(g)) for a constant B ≥ 1. This theorem does not consider replacement (3.1) for sub-additivity. Both rate of convergence and size of moduli of smoothness can be expressed by ab- stract moduli of smoothness, see [49, p. 96ff]. Such an abstract modulus of smoothness is defined as a continuous, increasing function ω on [0, ∞) that fulfills

0= ω(0) <ω(δ1) ≤ ω(δ1 + δ2) ≤ ω(δ1)+ ω(δ2) (3.2) 3 A Uniform Boundedness Principle with Rates 13

for all δ1, δ2 > 0. It has similar properties as ωr(f, ·), and it directly follows for λ> 0 that ω(λδ) ≤ ω(⌈λ⌉δ) ≤ ⌈λ⌉ω(δ) ≤ (λ + 1)ω(δ). (3.3)

Due to continuity, limδ→0+ ω(δ) = 0 holds. For all 0 < δ1 ≤ δ2, equation (3.3) also implies

δ2 δ2 δ1 ω δ1 1+ ω(δ1) + 1 ω(δ1) ω(δ ) δ1 δ1 δ2 ω(δ ) 2 = ≤ = ≤ 2 1 . (3.4) δ2 δ2   δ2  δ1 δ1 Functions ω(δ) := δα, 0 < α ≤ 1, are examples for abstract moduli of smoothness. They are used to define Lipschitz classes. The aim is to discuss a sequence of remainders (that will be errors of best approx- ∞ imation) (En)n=1, En : X → [0, ∞). These functionals do not have to be sub-linear but instead have to fulfill m m

Em·n fk ≤ En(fk) (cf. (3.1)) (3.5) ! Xk=1 Xk=1 En(cf)= |c|En(f) (3.6)

En(f) ≤ DnkfkX (3.7)

En(f) ≥ En+1(f) (3.8) for all m ∈ N, f, f1, f2,...,fm ∈ X, and constants c ∈ R. In the boundedness condition (3.7), Dn is a constant only depending on En but not on f. Theorem 3.1 (Adapted Uniform Boundedness Principle). Let X be a (real) Banach ∞ space with norm k · kX . Also, a sequence (En)n=1, En : X → [0, ∞) is given that fulfills conditions (3.5)–(3.8). To measure smoothness, sub-linear bounded functionals ∼ Sδ ∈ X are used for all δ > 0. Let µ(δ): (0, ∞) → (0, ∞) be a positive function, and let ϕ : [1, ∞) → (0, ∞) be a strictly decreasing function with limx→∞ ϕ(x) = 0. An additional requirement is that −1 for each 0 <λ< 1 a point X0 = X0(λ) ≥ λ and constant Cλ > 0 exist such that

ϕ(λx) ≤ Cλϕ(x) (3.9) for all x > X0. If there exist test elements hn ∈ X such that for all n ∈ N with n ≥ n0 ∈ N and δ > 0

khnkX ≤ C1, (3.10) µ(δ) S (h ) ≤ C min 1, , (3.11) δ n 2 ϕ(n)   E4n(hn) ≥ c3 > 0, (3.12) then for each abstract modulus of smoothness ω satisfying ω(δ) lim = ∞ (3.13) δ→0+ δ there exists a counterexample fω ∈ X such that (δ → 0+, n →∞)

Sδ(fω)= O (ω(µ(δ))) , (3.14)

En(fω) En(fω) 6= o(ω(ϕ(n))), i.e., lim sup > 0. (3.15) n→∞ ω(ϕ(n)) 3 A Uniform Boundedness Principle with Rates 14

For example, (3.9) is fulfilled for a standard choice ϕ(x) = 1/xα. The prerequisites of the theorem differ from the Theorems of Dickmeis, Nessel, and ∼ van Wickern in conditions (3.5)–(3.8) that replace En ∈ X . It also requires additional constraint (3.9). For convenience, resonance condition (3.12) replaces En(hn) ≥ c3.

Without much effort, (3.12) can be weakened to lim supn→∞ E4n(hn) > 0. The proof is based on a gliding hump and follows the ideas of [20, Section 2.2] (cf. [18]) for sub-linear functionals and the literature cited there. For the sake of completeness, the whole proof is presented although changes were required only for estimates that are effected by missing sub-additivity.

Proof. The first part of the proof is not concerned with sub-additivity or its replace- ment. If a test element hj , j ≥ n0, exists that already fulfills E (h ) lim sup n j > 0, (3.16) n→∞ ω(ϕ(n)) then fω := hj fulfills (3.15). To show that this fω also fulfills (3.14), one needs inequality min{1, δ}≤ Aω(δ) (3.17) for all δ > 0. This inequality follows from (3.4): If 0 <δ< 1 then ω(1)/1 ≤ 2ω(δ)/δ, such that δ ≤ 2ω(δ)/ω(1). If δ > 1 then ω(1) ≤ ω(δ), i.e., 1 ≤ ω(δ)/ω(1). Thus, one can choose A = 2/ω(1). Smoothness (3.14) of test elements hj now follows from (3.17): (3.11) µ(δ) (3.17) µ(δ) S (h ) ≤ C min 1, ≤ AC ω δ j 2 ϕ(j) 2 ϕ(j)     (3.3) 1 ≤ AC 1+ ω(µ(δ)). 2 ϕ(j)   Under condition (3.16) function fω := hj indeed is a counter example. Thus, for the remaining proof one can assume that for all j ∈ N, j ≥ n0: E (h ) lim n j = 0. (3.18) n→∞ ω(ϕ(n)) The arguments of Dickmeis, Nessel and van Wickern have to be adjusted to missing sub-additivity in the next part of the proof. It has to be shown that for each fixed N m R m ∈ a finite sum inherits (3.18). Let (al)l=1 ⊂ and j1,...,jm ≥ n0 different indices. To prove m En alhj lim l=1 l = 0, (3.19) n→∞ ω(ϕ(n)) P  one can apply (3.8), (3.5), and (3.6) for n ≥ 2m:

m (3.8) m (3.5) m En alhj Em·⌊n/m⌋ alhj E (alhj ) 0 ≤ l=1 l ≤ l=1 l ≤ l=1 ⌊n/m⌋ l ω(ϕ(n)) ω(ϕ(n)) ω(ϕ(n)) P  P  P m (3.6) E (hj ) = |a | ⌊n/m⌋ l . (3.20) l ω(ϕ(n)) Xl=1 Since ϕ(x) is decreasing and ω(δ) is increasing, ω(ϕ(x)) is decreasing. Thus, (3.9) for −1 λ := (2m) and n> max{2m, X0(λ)} implies n − m n − n/2 (3.9) ω (ϕ (⌊n/m⌋)) ≤ ω ϕ ≤ ω ϕ ≤ ω C 1 ϕ(n) m m 2m         3 A Uniform Boundedness Principle with Rates 15

(3.3) ≤ ⌈C 1 ⌉ω (ϕ(n)) . 2m With this inequality, estimate (3.20) becomes

m m En l=1 alhjl E⌊n/m⌋ (hjl ) 0 ≤ ≤ ⌈C 1 ⌉ |al| . ω(ϕ(n)) 2m ω (ϕ (⌊n/m⌋)) P  l=1 X According to (3.18), this gives (3.19). ∞ N N Now one can select a sequence (nk)k=1 ⊂ , n0 ≤ nk < nk+1 for all k ∈ , to construct a suitable counterexample

fω := ω(ϕ(nk))hnk . (3.21) Xk=1 Let n1 := n0. If n1,...,nk have already be chosen then select nk+1 with nk+1 ≥ 2k and nk+1 > nk large enough to fulfill following conditions: 1 ω(ϕ(nk+1)) ≤ ω(ϕ(nk)) lim ω(ϕ(x))=0 (3.22) 2 x→∞   ω(ϕ(nk)) D2·n ω(ϕ(nk+1)) ≤ lim ω(ϕ(x))=0 (3.23) k k x→∞ k   ω(ϕ(n )) ω(ϕ(n )) j ≤ k+1 (cf. (3.13)) (3.24) ϕ(nj ) ϕ(nk+1) j=1 X E k ω(ϕ(n ))h nk+1 l=1 l nl 1 ≤ (cf. (3.19)). (3.25) ωP(ϕ(nk+1))  k + 1 Only condition (3.23) is adjusted to missing sub-additivity. The next part of the proof does not consider properties of En, see [20]. Function fω in (3.21) is well-defined: For j ≥ k, iterative application of (3.22) leads to −1 k−j ω(ϕ(nj )) ≤ 2 ω(ϕ(nj−1)) ≤···≤ 2 ω(ϕ(nk)). This implies ∞ ∞ k−j ω(ϕ(nj )) ≤ 2 ω(ϕ(nk))=2ω(ϕ(nk)). (3.26) Xj=k Xj=k With this estimate (and because of limk→∞ ω(ϕ(nk)) = 0), it is easy to see that ∞ m (gm)m=1, gm := j=1 ω(ϕ(ni))hni , is a Cauchy sequence that converges to fω in Banach space X: For a given ε > 0, there exists a number N0(ε) > n0 such that P ω(ϕ(nk)) < ε/(2C1) for all k > N0. Then, due to (3.10) and (3.26), for all k>i>N0:

k ∞

kgk − gikX ≤ ω(ϕ(nj ))khnj kX ≤ C1 ω(ϕ(nj )) ≤ 2C1ω(ϕ(ni+1)) < ε. j=i+1 j=i+1 X X

Thus, the Banach condition is fulfilled and counterexample fω ∈ X is well defined. Smoothness condition (3.11) is proved in two cases. The first case covers numbers δ > 0 for which µ(δ) ≤ ϕ(n1). Since limx→∞ ϕ(x) = 0, there exists k ∈ N such that in this case ϕ(nk+1) ≤ µ(δ) ≤ ϕ(nk). (3.27) 3 A Uniform Boundedness Principle with Rates 16

Using this index k in connection with the two bounds in (3.11), one obtains for sub- linear functional Sδ

k ∞

Sδ(fω) ≤ ω(ϕ(nj ))Sδ(hnj )+ ω(ϕ(nj ))Sδ(hnj ) (3.28) j=1 X j=Xk+1 k−1 ∞ (3.11) µ(δ) µ(δ) ≤ C2 ω(ϕ(nj )) + ω(ϕ(nk)) + ω(ϕ(nj ))  ϕ(nj ) ϕ(nk)  j=1 ! X j=Xk+1 (3.24), (3.26)   ω(ϕ(nk)) ≤ 2C2µ(δ) + 2C2ω(ϕ(nk+1)) ϕ(nk)

ω(ϕ(nk)) ≤ 2C2µ(δ) + 2C2ω(µ(δ)). (3.29) ϕ(nk)

The last estimate holds true because ϕ(nk+1) ≤ µ(δ) in (3.27). The first expression in (3.29) can be estimated by 4C2ω(µ(δ)) with (3.27): Because µ(δ) ≤ ϕ(nk), one can apply (3.4) to obtain ω(ϕ(n )) ω(µ(δ)) k ≤ 2 . ϕ(nk) µ(δ)

Thus, Sδ(fω) ≤ 6C2ω(µ(δ)). The second case is µ(δ) > ϕ(n1). In this situation, let k := 0. Then only the second sum in (3.28) has to be considered: Sδ(fω) ≤ 2C2ω(ϕ(n1)) ≤ 2C2ω(µ(δ)). The little-o condition remains to be proven without sub-additivity. From (3.5) one obtains E2n(f)= E2n(f + g − g) ≤ En(f + g)+ En(−g), i.e.,

(3.6) En(f + g) ≥ E2n(f) − En(−g) = E2n(f) − En(g). (3.30) The estimate can be used to show the desired lower bound based on resonance condition

(3.12) with the gliding hump method. Functional Enk is in resonance with summand

ω(ϕ(nk))hnk . This part of the sum forms the hump.

k−1 ∞

En (fω)= En ω(ϕ(n ))hn + ω(ϕ(nj ))hn + ω(ϕ(nj ))hn k k  k k j j  j=1 X j=Xk+1   (3.30) ∞ k−1 ≥ E2·n ω(ϕ(n ))hn + ω(ϕ(nj ))hn − En ω(ϕ(nj ))hn k  k k j  k j j=1 ! j=Xk+1 X   (3.30), (3.6) ∞ ≥ ω(ϕ(n ))E4·n (hn ) − E2·n ω(ϕ(nj ))hn k k k k  j  j=Xk+1 k−1  

− Enk ω(ϕ(nj ))hnj j=1 ! X (3.12), (3.7), (3.25) ∞ ω(ϕ(nk)) ≥ ω(ϕ(n ))c3 − D2n ω(ϕ(nj )) hn − k k  j X  k j=k+1 X (3.10), (3.26)  ω(ϕ( n ))  ≥ ω(ϕ(n ))c − D C 2ω(ϕ(n )) − k k 3 2nk 1 k+1 k (3.23) 2C + 1 ≥ c − 1 ω(ϕ(n )). 3 k k   4 Sharpness 17

Thus En(fω) 6= o(ω(ϕ(n))).

4 Sharpness

Free knot spline function by Heaviside, cut and ReLU functions are first examples for application of Theorem 3.1. r Let Sn be the space of functions f for which n + 1 intervals ]xk,xk+1[, 0 = x0 < x1 < · · ·

Corollary 4.1 (Free Knot Spline Approximation). For r, r˜ ∈ N, 1 ≤ p ≤∞, and for each abstract modulus of smoothness ω satisfying (3.13), there exists a counterexample p fω ∈ X [0, 1] such that

r ωr(fω, δ)p = O (ω(δ )) ,

r˜ r˜ 1 E(S , f ) := inf{kf − gk p : g ∈ S }= 6 o ω . n ω p ω X [0,1] n nr    Note that r andr ˜ can be chosen independently. This corresponds with Marchaud inequality for moduli of smoothness. The following lemma helps in the proof of this and the next corollary. It is used to show the resonance condition of Theorem 3.1.

Lemma 4.1. Let g : [0, 1] → R, and 0 = x0 < x1 < · · · < xN+1 = 1. Assume that for each interval Ik := (xk,xk+1), 0 ≤ k ≤ N, either g(x) ≥ 0 for all x ∈ Ik or g(x) ≤ 0 for all x ∈ Ik holds. Then g can change its sign only at points xk. Let h(x) := sin(2N · 2π · x). Then there exists a constant c > 0 that is independent of g and N such that kh − gkXp[0,1] ≥ c> 0.

The prerequisites on g are fulfilled if g is continuous with at most N zeroes.

−1 −1 Proof. We discuss 2N intervals Ak := (k(2N) , (k + 1)(2N) ), 0 ≤ k < 2N. Func- tion g can change its sign at most in N of these intervals. Let J ⊂ {0, 1,..., 2N − 1} the set of indices k of the at least N intervals Ak on which g is non-negative or non-positive. On each of these intervals, h maps to both its maximum 1 and its min- imum −1. Thus kh − gkB[0,1] ≥ 1. This shows the Lemma for p = ∞. Functions h and g have different sign on (a,b) where (a,b)=(k(2N)−1, (k + 1/2)(2N)−1 ) or (a,b) = ((k + 1/2)(2N)−1, (k + 1)(2N)−1), k ∈ J. For all x ∈ (a,b) we see that |h(x) − g(x)| ≥ |h(x)| = sin(2N · 2π · (x − a)). Thus, for 1 ≤ p< ∞,

1 p p kh − gkLp[0,1] ≥ |h − g| "k∈J ZAk # X 1 1 1 p π 4N N p ≥ N sinp (2N · 2π · x) dx ≥ sinp(u) du =: c> 0. N4π " Z0 #  Z0  4 Sharpness 18

Proof. (of Corollary 4.1) Theorem 3.1 can be applied with following parameters. Let Banach-space X = Xp[0, 1].

r˜ En(f) := E(Sn, f)p, Sδ(f) := ωr(f, δ)p.

Whereas Sδ is a sub-linear, bounded functional, errors of best approximation En fulfill r conditions (3.5), (3.6), (3.7), and (3.8), cf. (3.1), with Dn = 1. Let µ(δ) := δ and ϕ(x) = 1/xr such that condition (3.9) holds: ϕ(λx) = ϕ(x)/λr. Let N = N(n) := (4n + 1)˜r. Resonance elements

hn(x) := sin(2N · 2π · x) = sin((16n + 4)˜rπ · x) obviously satisfy condition (3.10): khn(x)kXp[0,1] ≤ 1 =: C1. One obtains (3.11) because of

r r Sδ(hn)= ωr(hn, δ)p ≤ 2 khnkXp[0,1] ≤ 2 and (see (1.1)) r (r) r r Sδ(hn)= ωr(hn, δ)p ≤ δ khn kXp[0,1] ≤ ((16n + 4)˜rπ) δ µ(δ) ≤ (˜rπ)rδr(20n)r = (˜r20π)r . ϕ(n)

r˜ Let g ∈ S4n, then g is composed from at most 4n + 1 polynomials on 4n + 1 intervals. On each of these intervals, g ≡ 0 or g at most hasr ˜ − 1 zeroes. Thus g can change sign at 4n interval borders and at zeroes of polynomials, and g fulfills the prerequisites of Lemma 4.1 with N = (4n + 1) · r˜ > 4n + (4n + 1) · (˜r − 1). Due to the lemma, khn − gkXp[0,1] ≥ c > 0 independent of n and g. Since this holds true for all g, (3.12) is shown for c3 = c. All preliminaries of Theorem 3.1 are fulfilled such that counterexamples exist as stated.

Corollary 4.1 can be applied with respect to all activation functions σ belonging to the class of splines with fixed polynomial degree less than r and a finite number of r knots k because Φn,σ ⊂ Snk. 1 Since Φn,σh ⊂ Sn, Corollary 4.1 directly shows sharpness of (1.2) and (1.4) for the Heaviside activation function if one chooses r =r ˜ = 1. Sharpness of (1.3) for 2 2 cut and ReLU function follows for r =r ˜ = 2 because Φn,σc ⊂ S2n, Φn,σr ⊂ Sn. However, the case ω(δ)= δ of maximum non-saturated convergence order is excluded by condition (3.13). We discuss this case for r =r ˜. Then a simple counterexample r is fω(x) := x . For each sequence of coefficients d0,...,dr−1 ∈ R we can apply the fundamental theorem of algebra to find complex zeroes a0,...,ar−1 ∈ C such that

r−1 r−1 r k x − dkx = |x − ak|.

k=0 k=0 X Y −1 −1 There exists an interval I := (j(r + 1) , ( j +1)(r + 1) ) ⊂ [0, 1] such that real parts −1 of complex numbers ak are not in I for all 0 ≤ k

r−1 1 r |x − a |≥ =: c > 0. k 4(r + 1) ∞ kY=0   This lower bound is independent of coefficients dk such that

r r inf{kx − q(x)kB[0,1] : q ∈ Π }≥ c∞ > 0. 4 Sharpness 19

We also see that p 1 r−1 1 r k 1 2(r+1) p x − dkx dx ≥ pr dx = pr =: cp > 0, 0 I0 [4(r + 1)] [4(r + 1)] Z k=0 Z X r r inf{k x − q(x)kLp[0,1] : q ∈ Π }≥ cp > 0. r Each function g ∈ Sn is a polynomial of degree less than r on at least n intervals (j(2n)−1, (j + 1)(2n)−1), j ∈ J ⊂ {0, 1,..., 2n − 1}. For j ∈ J:

r r x j c∞ j j+1 inf kx − q(x)kB , = inf + − q(x) ≥ r . q∈Πr [ 2n 2n ] q∈Πr 2n 2n (2n)   B[0,1]

r r 1 p Thus, E(Sn,x ) 6= o nr . In case of L -spaces, we similarly obtain with substitution u = 2nx − j  +1 j 1 r p 2n 1 u j inf |xr − q(x)|p dx = inf + − q(u) du r r q∈Π j q∈Π 2n 0 2n 2n Z 2n Z  

1 1 1 cp = inf |ur − q(u)|p du ≥ p . q∈Πr 2n (2n)pr 2n · (2n)pr Z0 Sharpness is demonstrated by combining lower estimates of all n subintervals:

p r r p cp r r cp 1 E(Sn,x )p ≥ n pr+1 ,E(Sn,x )p ≥ 1 6= o r . (2n) r+ r n 2 p n   Although our counterexample is arbitrarily often differentiable, the convergence or- der is limited to n−r. Reason is the definition of the activation function by piecewise polynomials. There is no such limitation for activation functions that are arbitrar- ily often differentiable on an interval without being a polynomial, see Theorem 2.2. Thus, neural networks based on smooth non-polynomial activation functions might approximate better if smooth functions have to be learned. Theorem 3 in [15] states for the Heaviside function that for each n ∈ N a function fn ∈ C[0, 1] exits such that the error of best uniform approximation exactly equals 1 ω1 fn, 2(n+1) . This is used to show optimality of the constant. Functions fn might be different for different n. One does not get the condensed sharpness result of Corol- lary 4.1. Another relevant example of a spline of fixed polynomial degree with a finite num- ber of knots is the square non-linearity (SQNL) activation function σ(x) := sign(x) for |x| > 2 and σ(x) := x − sign(x) · x2/4 for |x| ≤ 2. Because σ, restricted to each of the four sub-intervals of piecewise definition, is a polynomial of degree two, we can chooser ˜ = 3. The proof of Corollary 4.1 is based on Lemma 4.1. This argument can be also used to deal with rational activation functions σ(x) = q1(x)/q2(x) where q1, q2 6≡ 0 are polynomials of degree at most ρ. Then non-zero functions g ∈ Φn,σ do have at most ρn zeroes and ρn poles such that there is no change of sign on N + 1 intervals, N = 2ρn. Thus, Corollary 1 can be extended to neural network approximation with rational activation functions in a straight forward manner. Whereas the direct estimate (1.3) for cut and ReLU functions is based on lin- ear best approximation, the counterexamples hold for non-linear best approximation. Thus, error bounds in terms of moduli of smoothness may not be able to express the 4 Sharpness 20

advantages of non-linear free knot spline approximation in contrast to fixed knot spline approximation (cf. [45]). For an error measured in an Lp norm with an order like n−α, smoothness only is required in Lq , q := 1/(α + 1/p), see (1.7) and [16, p. 368].

Corollary 4.2 (Inverse Tangent). Let σ = σa be the sigmoid function based on the inverse tangent function, r ∈ N, and 1 ≤ p ≤ ∞. For each abstract modulus of p smoothness ω satisfying (3.13), there exists a counterexample fω ∈ X [0, 1] such that

1 ω (f , δ) = O (ω(δr)) and E(Φ , f ) 6= o ω . r ω p n,σa ω p nr    The corollary shows sharpness of the error bound in Theorem 2.2 applied to the arbitrarily often differentiable function σa.

Proof. Similarly to the proof of Corollary 4.1, we apply Theorem 3.1 with parameters p r r X = X [0, 1], En(f) := E(Φn,σa , f)p, Sδ(f) := ωr(f, δ)p, µ(δ) := δ , ϕ(x) = 1/x , and hn(x) := sin (16n · 2πx) = sin (2N · 2πx) with N = N(n) := 8n, such that condition (3.10) is obvious and (3.11) can be shown by estimating the modulus in terms of the rth derivative of hn with (1.1). Let g ∈ Φ4n,σa ,

4n 1 1 g(x)= a + arctan(b x + c ) . k 2 π k k Xk=1   Then 4n ′ akbk 1 s(x) g (x)= 2 = π 1+(bkx + ck) q(x) Xk=1 where s(x) is a polynomial of degree 2(4n−1), and q(x) is a polynomial of degree 8n. If g is not constant then g′ at most has 8n−2 zeroes and f at most has 8n−1 zeroes due to the (Rolle’s theorem). In both cases, the requirements of Lemma 4.1 are fulfilled with N(n) = 8n> 8n−1 such that khn −gkXp[0,1] ≥ c> 0 independent of n and g. Since g can be chosen arbitrarily, (3.12) is shown with E4nhn ≥ c> 0.

Whereas lower estimates for sums of n inverse tangent functions are easily obtained by considering O(n) zeroes of their derivatives, sums of n logistic functions (or hyper- bolic tangent functions) might have an exponential number of zeroes. To illustrate the problem in the context of Theorem 3.1, let

4n ak g(x) := ∈ Φ4n,σl . (4.1) 1+ e−ck (e−bk )x Xk=1 Using a common denominator, the numerator is a sum of type

m x αkκk Xk=1 4n for some κk > 0 and m< 2 . According to [50], such a function has at most m − 1 < 16n − 1 zeroes, or it equals the zero function. Therefore, an interval [k(16)−n, (k + 1)(16)−n] exists on which g does not change its sign. By using a resonance sequence 4 Sharpness 21

n n hn(x) := sin (16 · 2πx), one gets E(Φ4n,σl , hn) ≥ 1. But factor 16 is by far too large. x One has to choose φ(x) := 1/16 and µ(δ) := δ to obtain a “counterexample” fω with

1 1 E(Φ , f ) ∈ O ω and E(Φ , f ) 6= o ω . (4.2) n,σl ω n n,σl ω 16n       The gap between rates is obvious. The same difficulties do not only occur for the logistic function but also for other activation functions based on exp(x) like the softmax function σm(x) := log(exp(x) + 1). Similar to (2.8),

∂ exp(bx + c) σ (bx + c)= = σ (bx + c). ∂c m exp(bx + c) + 1 l Thus, sums of n logistic functions can be approximated uniformly and arbitrarily well by sums of differential quotients that can be written by 2n softmax functions. A lower bound for approximation with σm would also imply a similar bound for σl and upper bounds for approximation with σl imply upper bounds for σm. With respect to the logistic function, a better estimate than (4.2) is possible. It can be condensed from a sequence of counterexamples that is derived in [39]. However, we show that the Vapnik-Chervonenkis dimension (VC dimension) of related function spaces can also be used to prove sharpness. This is a rather general approach since many VC dimension estimates are known. Let X be a set and A a family of subsets of X. Throughout this paper, X can be assumed to be finite. One says that A shatters a set S ⊂ X if and only if each subset B ⊂ S can be written as B = S ∩ A for a set A ∈ A. The VC dimension of A is defined via

VC-dim(A) := sup{k ∈ N : ∃S ⊂ X with cardinality |S| = k such that S is shattered by A}.

This general definition can be adapted to (non-linear) function spaces V that consist of functions g : X → R on a (finite) set X ⊂ R. By applying Heaviside-function σh, let

A := {A ⊂ X : ∃g ∈ V :(∀x ∈ A : σh(g(x)) = 1) ∧

(∀x ∈ X \ A : σh(g(x))= 0)}.

Then the VC dimension of function space V is defined as VC-dim(V ) := VC-dim(A). This is the largest number m ∈ N for which m points x1, ..., xm ∈ X exist such that for each sign sequence s1,...,sm ∈ {−1, 1} a function g ∈ V can be found that fulfills

1 s · σ (g(x )) − > 0, 1 ≤ i ≤ m, (4.3) i h i 2   cf. [6]. The VC dimension is an indicator for the number of degrees of freedom in the construction of V . Condition (4.3) is equivalent to

σh(g(xi)) = σh(si), 1 ≤ i ≤ m.

∞ Corollary 4.3 (Sharpness due to VC Dimension). Let (Vn)n=1 be a sequence of (non- linear) function spaces Vn ⊂ B[0, 1] such that

En(f) := inf{kf − gkB[0,1] : g ∈ Vn} 4 Sharpness 22

fulfills conditions (3.5)–(3.8) on Banach space C[0, 1]. Let τ : N → N and

j G := : j ∈ {0, 1,...,τ(n)} n τ(n)   be an equidistant grid on the interval [0, 1]. We restrict functions in Vn to this grid:

Vn,τ(n) := {h : Gn → R : h(x)= g(x) for a function g ∈ Vn}.

Let function ϕ(x) be defined as in Theorem 3.1 such that (3.9) holds true. If, for a constant C > 0, function τ fulfills

VC-dim(Vn,τ(n)) < τ(n), (4.4) C τ(4n) ≤ , (4.5) ϕ(n) for all n ≥ n0 ∈ N then for r ∈ N and each abstract modulus of smoothness ω satisfying (3.13), a counterexample fω ∈ C[0, 1] exists such that

r r ωr(fω, δ)= O (ω(δ )) and En(fω) 6= o (ω ([ϕ(n)] )) .

Proof. Let n ≥ n0/4. Due to (4.4), a sign sequence s0,...,sτ(4n) ∈ {−1, 1} exists i such that for each function g ∈ V4n there is a point x0 = τ(4n) ∈ G4n such that σh(g(x0)) 6= σh(si). We utilize this sign sequence to construct resonance elements. It is well known, that auxiliary function

exp 1 − 1 for |x| < 1, h(x) := 1−x2 ( 0  for |x|≥ 1, is arbitrarily often differentiable on the real axis, h(0) = 1, khkB(R) = 1. This function becomes the building block for the resonance elements:

τ(4n) i h (x) := s · h 2τ(4n) x − . n i τ(4n) i=0 X   

The interior of the support of summands is non-overlapping, i.e., khnkB[0,1] ≤ 1, and (r) −r because of (4.5) norm khn kB[0,1] of the rth derivative is in O([ϕ(n)] ). We apply Theorem 3.1 with

r X = C[0, 1], Sδ(f) := ωr(f, δ), µ(δ) := δ ,

En(f) as defined in the theorem, and resonance elements hn(x) that represent the existing sign sequence. Function [ϕ(x)]r fulfills the requirements of function ϕ(x) in Theorem 3.1. Then conditions (3.10) and (3.11) are fulfilled due to the norms of hn and its derivatives, cf. (1.1). Due to the initial argument of the proof, for each g ∈ V4n at −1 least one point x0 = iτ(4n) , 0 ≤ i ≤ τ(4n), exists such that we observe σh(g(x0)) 6= σh(hn(x0)). Since |hn(x0)| = 1, we get khn − gkB[0,1] ≥ |hn(x0) − g(x0)| ≥ 1, and (3.12) because E4nhn ≥ 1. 4 Sharpness 23

Corollary 4.4 (Logistic Function). Let σ = σl be the logistic function and r ∈ N. For each abstract modulus of smoothness ω satisfying (3.13), a counterexample fω ∈ C[0, 1] exists such that 1 ω (f , δ)= O (ω(δr)) and E(Φ , f ) 6= o ω . r ω n,σl ω (n[1 + log (n)])r   2  The corollary extends the Theorem of Maiorov and Meir for worst case approxima- tion with sigmoid functions in the case p = ∞ to Lipschitz classes and one condensed counterexample (instead of a sequence), see [39, p. 99]. The sharpness estimate also holds in Lp[0, 1], 1 ≤ p< ∞. For all these spaces, one can apply Theorem 3.1 directly with the sequence of counterexamples constructed in [39, Lemma 7, p. 99]. Even more generally, Theorem 1 in [38] utilizes pseudo-dimension, a generalization of VC dimension, to provide bounded sequences of counterexamples in Sobolev spaces.

Proof. We apply Corollary 4.3 in connection with a result concerning the VC dimen- sion of function space (D ∈ N) D R Φn,σl := g : {−D, −D + 1,...,D}→ : n  g(x)= a0 + akσl(bkx + ck) : a0,ak,bk, ck ∈ R , k=1 X see paper [6] that is based on [26], cf. [32]. Functions are defined on a discrete set with 2D + 1 elements, and in contrast to the definition of Φn,σ , a constant function with coefficient a0 is added. D According to Theorem 2 in [6], the VC dimension of Φn,σl is upper bounded by (n large enough)

2 · (3 · n + 1) · log2(24e(3 · n + 1)D), i.e., there exists n0 ∈ N, n0 ≥ 2, and a constant C > 0 such that for all n ≥ n0

D VC-dim(Φn,σl ) ≤ Cn[log2(n) + log2(D)].

1+log2(E) Since limE→∞ E = 0, we can choose a constant E > 1 such that 1 + log (E) 1 2 < . E 4C

With this constant, we define D = D(n) := ⌊En(1 + log2(n))⌋ such that the VC D dimension of Φn,σl is less than D for n ≥ n0:

D VC-dim(Φn,σl ) ≤ Cn[log2(n) + log2(En(1 + log2(n)))]

≤ Cn[log2(n) + log2(En(log2(n) + log2(n)))]

≤ Cn[2 log2(n) + log2(E) + log2(2 log2(n)))]

≤ Cn[3 log2(n) + log2(E) + 1)] ≤ 4Cn log2(n)[1 + log2(E)]

< En log2(n) ≤ ⌊En(1 + log2(n))⌋ = D. By applying an affine transform that maps interval [−D, D] to [0, 1] and by omit- ting constant function a0, we immediately see that for Vn := Φn,σl and τ(n) := 2D(n) τ(n) VC-dim(V ) < D(n)= <τ(n) n,τ(n) 2 4 Sharpness 24

such that (4.4) is fulfilled. We define strictly decreasing 1 ϕ(x) := . x[1 + log2(x)]

−2 Obviously, limx→∞ ϕ(x) = 0. Condition (3.9) holds: Let x > X0(λ) := λ . Then log2(λ) > − log2(x)/2 and 1 1 1 1 2 ϕ(λx)= < 1 < ϕ(x). λ x(1 + log2(x) + log2(λ)) λ x(1+ 2 log2(x)) λ Also, (4.5) is fulfilled:

τ(4n) = 2D(4n) = 2E · 4n(1 + log2(4n))=8E · n(3 + log2(n)) 24E < 24E · n(1 + log (n)) = . 2 ϕ(n) Thus, all prerequisites of Corollary 4.3 are shown.

The corollary improves (4.2): There exists a counterexample fω ∈ C[0, 1] such that (see (2.2), (2.9))

r ωr(fω, δ) ∈ O (ω(δ )) , 1 1 E(Φ , f ) ∈ O ω and E(Φ , f ) 6= o ω . (4.6) n,σl ω nr n,σl ω nr[1 + log (n)]r      2 

The proof is based on the O(n log2(n)) estimate of VC dimension in [6]. This requires functions to be defined on a finite grid. Without this prerequisite, the VC dimension is in Ω(n2), see [48, p. 235]. The referenced book also deals with the case that all weights are restricted to floating point numbers with a fixed number of bits. Then the VC dimension becomes bounded by O(n) without the need for the log-factor. However, direct upper error bounds (2.2) and (2.9) are proved for real-valued weights only. The preceding corollary is a prototype for proving sharpness based on known VC dimensions. Also at the price of a log-factor, the VC dimension estimate for radial basis functions in [6] or [46] can be used similarly in connection with Corollary 4.3 to construct counterexamples. The sharpness results for Heaviside, cut, ReLU and inverse tangent activation functions shown above for p = ∞ can also be obtained with Corollary 4.3 by proving that VC dimensions of corresponding function spaces Φn,σ are in O(n) (whereas the result of [5] only provides an O(n log(n)) bound). This in turn can be shown by estimating the maximum number of zeroes like in the proof of the next corollary and in the same manner as in the proofs of Corollaries 4.1 and 4.2. The problem of different rates in upper and lower bounds arises because different scaling coefficients bk are allowed. In the case of uniform scaling, i.e. all coefficients bk in (4.1) have the same value bk = B = B(n), the number of zeroes is bounded by 4n − 1 instead of 16n − 2. Let

n ˜ R R Φn,σl := g : [0, 1] → : g(x)= akσl(Bx + ck) : ak,B,ck ∈ ( ) Xk=1 4 Sharpness 25

be the non- space generated by uniform scaling. Because the quasi- interpolation operators used in the proof of Debao’s direct estimate (1.2), see [15], are defined using such uniform scaling, see [9, p. 172], the error bound 1 E(Φ˜ , f) := inf{kf − gk : g ∈ Φ˜ }≤ ω f, n,σl B[0,1] n,σl 1 n   holds. This bound is sharp:

Corollary 4.5 (Logistic Function with Restriction). Let σ = σl be the logistic function and r ∈ N. For each abstract modulus of smoothness ω satisfying (3.13), there exists a counterexample fω ∈ C[0, 1] such that 1 ω (f , δ)= O (ω(δr)) and E(Φ˜ , f ) 6= o ω . r ω n,σl ω nr    To prove the corollary, we apply following lemma.

N j Lemma 4.2. Vn ⊂ C[0, 1], 2 ≤ τ(n) ∈ , Gn := τ(n) : j ∈ {0, 1,...,τ(n)} , and Vn,τ(n) := {h : Gn → R : h(x)= g(x) for a function ng ∈ Vn} are given as in Corollaryo 4.3. If VC-dim(Vn,τ(n)) ≥ τ(n) then there exists a function g ∈ Vn, g 6≡ 0, with a set of at least ⌊τ(n)/2⌋ zero points in [0, 1] such that g has non-zero function values between each two consecutive points of this set. Proof. Because of the VC dimension, a subset J ⊂ {0, 1,...,τ(n)} with τ(n) elements j1 < j2 < · · · < jτ(n) and a function g ∈ Vn exist such that

j 1+(−1)k+1 σ g k = , 1 ≤ k ≤ τ(n). h τ(n) 2    −1 −1 Then we find a zero on each interval [j2k−1τ(n) , j2kτ(n) ), 1 ≤ k ≤ ⌊τ(n)/2⌋: If −1 g(j2k−1τ(n) ) 6= 0 then continuous g has a zero on the interval

−1 −1 (j2k−1τ(n) , j2kτ(n) ).

Thus, g has ⌊τ(n)/2⌋ zeroes on different sub-intervals. Between zeroes are non-zero, −1 −1 negative function values g(jkτ(n) ) for even k because σh(g(jkτ(n) )) = 0. ˜ Proof. (of Corollary 4.5) We apply Corollary 4.3 with Vn = Φn,σl and En(f) = ˜ E(Φn,σl , f) such that conditions (3.5)–(3.8) are fulfilled. Let τ(n) := 2n and ϕ(n) = 1/n such that (4.5) holds true: τ(4n) = 8n = 8/ϕ(n). Assume that VC-dim(Vn,τ(n)) ≥ ˜ τ(n), then according to Lemma 4.2 there exists a function f ∈ Φn,σl such that f 6≡ 0 has ⌊τ(n)/2⌋ = n zeroes. However, we can write f as

n a s(x) f(x)= k = . 1+ e−ck (e−B)x q(x) Xk=1 Using a common denominator q(x), the numerator is a sum of type

n−1 −kB x s(x)= αk(e ) Xk=0 which has at most n − 1 zeroes, see [50]. Because of this contradiction to n zeroes, (4.4) is fulfilled. 5 Conclusions 26

By applying Lemma 4.1 for N(n) = n − 1 in connection with Theorem 3.1, one can also show Corollary 4.5 for Lp-spaces, 1 ≤ p< ∞. Linear VC dimension bounds were proved in [47] for radial basis function networks with uniform width (scaling) or uniform centers. Such bounds can be used with Corollary 4.3 to prove results that are similar to Corollary 4.5. Also, such a corollary can be shown for the ELU function σe. However, without the restriction bk = B(n), piecewise superposition of exponential functions leads to O(n2) zeroes of sums of ELU functions. Then in combination with direct estimates Theorems 2.1 and 2.2, i.e., 1 E(Φn,σe , fω) ≤ Crωr fω, n , we directly obtain following (improvable) result in a straightforward manner. 

Corollary 4.6 (Coarse estimate for ELU activation). Let σ = σe be the ELU function and r ∈ N, n ≥ max{2,r} (see Theorem 2.1). For each abstract modulus of smoothness ω satisfying (3.13), there exists a counterexample fω ∈ C[0, 1] that fulfills

1 1 E(Φ , f ) ≤ C ω f , ∈ O ω and n,σe ω r r ω n nr      1 E(Φ , f ) 6= o ω . (4.7) n,σe ω n2r    r Proof. To prove the existence of a function fω with ωr (fω, δ) ∈ O (ω (δ )) and (4.7), we apply Corollary 4.3 with Vn = Φn,σe and En(f)= E(Φn,σe , f) such that conditions (3.5)–(3.8) are fulfilled. For each function g ∈ Vn the interval [0, 1] can be divided into at most n + 1 subintervals such that on the lth interval g equals a function gl of type

n

gl(x)= γl + δlx + αl,k exp(βl,kx). Xk=1 Derivative n ′ gl(x)= δl exp(0 · x)+ αl,kβl,k exp(βl,kx) Xk=1 has at most n zeroes or equals the zero function according to [50]. Thus, due to the mean value theorem (or Rolle’s theorem), gl has at most n + 1 zeroes or is the zero 2 function. By concatenating functions gl to g, one observes that g has at most (n + 1) different zeroes such that g does not vanish between such consecutive zero points. Let τ(n) := 8n2 and ϕ(n) = 1/n2 such that (4.5) holds true: τ(4n) = 128n2 = 128/ϕ(n). If VC-dim(Vn,τ(n)) ≥ τ(n) then due to Lemma 4.2 and because n ≥ 2 there 2 2 exists a function in Φn,σe with at least ⌊τ(n)/2⌋ = (2n) > (n + 1) zeroes such that between consecutive zeroes, the function is not the zero function. This contradicts the previously determined number of zeroes and (4.4) is fulfilled.

Sums of n softsign functions ϕ(x)= x/(1+|x|) can be expressed piecewise by n+1 rational functions that each have at most n zeroes. Thus, one also has to deal with O(n2) zeroes.

5 Conclusions

Corollaries 4.1 and 4.2 can be seen in the context of Lipschitz classes. Let r := 1 for the Heaviside, r := 2 for cut and ReLU functions and r ∈ N for the inverse tangent 5 Conclusions 27

α p activation function σ = σa. By choosing ω(δ) := δ , a counterexample fα ∈ X [0, 1] exists for each α ∈ (0,r) such that

1 1 ω (f , δ) ∈ O (δα) ,E(Φ , f ) ∈ O , and E(Φ , f ) 6= o . r α p n,σ α p nα n,σ α p nα     With respect to Corollary 4.4 for the logistic function, in general no higher convergence 1 order nβ , β > α can be expected for functions in the Lipschitz class that is defined r α via Lip (α, C[0, 1]) := {f ∈ C[0, 1] : ωr(f, δ)= O (δ )}. In terms of (non-linear) Kolmogogrov n-width, let X := Lipr(α, C[0, 1]). Then, for example, condensed counterexamples fα for piecewise linear or inverse tangent activation functions and p = ∞ imply

n

Wn := inf sup inf f(·) − akσ(bk · +ck) b1,...,bn,c1...,cn f∈Lipr (α,C[0,1]) a1,...,an k=1 B[0,1] X n

≥ inf inf fα(·) − akσ(bk · +ck) b1,...,bn,c1...,cn a1,...,an k=1 B[0,1] X 1 = E(Φ , f ) 6= o . n,σ α nα   The restriction to the univariate case of a single input node was chosen because of compatibility with most cited error bounds. However, the error of multivariate approximation with certain activation functions can be bounded by the error of best multivariate polynomial approximation, see proof of Theorem 6.8 in [42, p. 176]. Thus, one can obtain estimates in terms of multivariate radial moduli of smoothness similar to Theorem 2.2 via [30, Corollary 4, p. 139]. Also, Theorem 3.1 can be applied in a multivariate context in connection with VC dimension bounds. First results are shown in report [25]. Without additional restrictions, a lower estimate for approximation with logistic function σl could only be obtained with a log-factor in (4.6). Thus, either direct bounds (1.2) and (2.2) or sharpness result (4.6) can be improved slightly.

Acknowledgment I would like to thank an anonymous reviewer, Michael Gref, Christian Neumann, Peer Ueberholz, and Lorens Imhof for their valuable comments.

References

[1] Almira, J.M., de Teruel, P.E.L., Romero-Lopez, D.J., Voigtlaender, F.: Negative results for approximation using single layer and multilayer feedforward neural networks. arXiv (2018). URL http://arxiv.org/abs/1810.10032 [2] Anastassiou, G.A.: Multivariate hyperbolic tangent neural network approxima- tion. Computers & Mathematics with Applications 61(4), 809–821 (2011) [3] Anastassiou, G.A.: Univariate hyperbolic tangent neural network quantitative approximation. In: Intelligent Systems: Approximation by Artificial Neural Net- works, vol. 19, pp. 33–65. Springer, Berlin (2011) 5 Conclusions 28

[4] Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory 39(3), 930–945 (1993) [5] Bartlett, P.L., Maiorov, V., Meir, R.: Almost linear VC-dimension bounds for piecewise polynomial networks. Neural Computation 10(8), 2159–2173 (1998) [6] Bartlett, P.L., Williamson, R.C.: The VC dimension and pseudodimension of two- layer neural networks with discrete inputs. Neural Computation 8(3), 625–628 (1996) [7] Chen, T., Chen, H., Liu, R.: A constructive proof and an extension of cybenkos approximation theorem. In: C. Page, R. LePage (eds.) Computing Science and Statistics, pp. 163–168. Springer, New York, NY (1992) [8] Chen, Z., Cao, F.: The properties of logistic function and applications to neu- ral network approximation. J. Computational Analysis and Applications 15(6), 1046–1056 (2013) [9] Cheney, E.W., Light, W.A.: A course in Approximation Theory. AMS, Provi- dence, RI (2000) [10] Costarelli, D.: Sigmoidal Functions Approximation and Applications. Doctoral thesis, Roma Tre University, Rome (2014) [11] Costarelli, D., Spigler, R.: Approximation results for neural network operators activated by sigmoidal functions. Neural Networks 44, 101–106 (2013) [12] Costarelli, D., Spigler, R.: Constructive approximation by superposition of sig- moidal functions. Analysis in Theory and Applications 29, 169–196 (2013) [13] Costarelli, D., Vinti, G.: Saturation classes for max-product neural network op- erators activated by sigmoidal functions. Results in Mathematics 72, 1555–1569 (2017) [14] Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathe- matics of Control, Signals and Systems 2(4), 303–314 (1989) [15] Debao, C.: Degree of approximation by superpositions of a sigmoidal function. Approximation Theory and its Applications 9(3), 17–28 (1993) [16] DeVore, R.A., Lorentz, G.G.: Constructive Approximation. Springer, Berlin (1993) [17] DeVore, R.A., Popov, V.A.: Interpolation spaces and non-linear approximation. In: M. Cwikel, J. Peetre, Y. Sagher, H. Wallin (eds.) Function Spaces and Appli- cations, pp. 191–205. Springer, Berlin, Heidelberg (1988) [18] Dickmeis, W., Nessel, R.J., van Wickeren, E.: A general approach to counterex- amples in numerical analysis. Numerische Mathematik 43(2), 249–263 (1984) [19] Dickmeis, W., Nessel, R.J., van Wickeren, E.: On nonlinear condensation princi- ples with rates. manuscripta mathematica 52(1), 1–20 (1985) [20] Dickmeis, W., Nessel, R.J., van Wickeren, E.: Quantitative extensions of the uniform boundedness principle. Jahresber. Deutsch. Math.-Verein. 89, 105–134 (1987) [21] Dickmeis, W., Nessel, R.J., van Wickern, E.: On the sharpness of estimates in terms of averages. Math. Nachr. 117, 263–272 (1984) [22] Funahashi, K.I.: On the approximate realization of continuous mappings by neu- ral networks. Neural Networks 2, 183–192 (1989) 5 Conclusions 29

[23] Goebbels, S.: A sharp error estimate for numerical Fourier transform of band- limited functions based on windowed samples. Z. Anal. Anwendungen 32, 371–387 (2013) [24] Goebbels, S.: A counterexample regarding “New study on neural networks: the essential order of approximation”. Neural Networks 123, 234–235 (2020) [25] Goebbels, S.: On sharpness of error bounds for multivariate neural network ap- proximation. arXiv (2020). URL http://arxiv.org/abs/2004.02203 [26] Goldberg, P.W., Jerrum, M.R.: Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Machine Learning 18(2), 131–148 (1995) [27] Guliyev, N.J., Ismailov, V.E.: A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function. Neural Networks 28(7), 1289–1304 (2016) [28] Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neu- ral Networks 4, 251–257 (1991) [29] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359–366 (1989) [30] Johnen, H., Scherer, K.: On the equivalence of the K-functional and moduli of continuity and some applications. In: W. Schempp, K. Zeller (eds.) Constructive Theory of Functions of Several Variables. Proc. Conf. Oberwolfach 1976, pp. 119– 140 (1976) [31] Jones, L.K.: Constructive approximations for neural networks by sigmoidal func- tions. Proceedings of the IEEE 78(10), 1586–1589, Correction and addition in Proc. IEEE 79 (1991), 243 (1990) [32] Karpinski, M., Macintyre, A.: Polynomial bounds for VC dimension of sigmoidal and general pfaffian neural networks. Journal of Computer and System Sciences 54(1), 169–176 (1997) [33] K˚urkov´a, V.: Rates of approximation of multivariable functions by one-hidden- layer neural networks. In: M. Marinaro, R. Tagliaferri (eds.) Neural Nets WIRN VIETRI-97, Perspectives in Neural Computing, pp. 147–152. Springer, London (1998) [34] Lei, Y., Ding, L.: Approximation and estimation bounds for free knot splines. Computers & Mathematics with Applications 65(7), 1006–1024 (2013) [35] Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6(6), 861 – 867 (1993) [36] Lewicki, G., Marino, G.: Approximation of functions of finite variation by super- positions of a sigmoidal function. Applied Mathematics Letters 17(10), 1147–1152 (2004) [37] Li, F.J.: Constructive function approximation by neural networks with optimized activation functions and fixed weights. Neural Computing and Applications 31(9), 4613–4628 (2019) [38] Maiorov, V., Ratsaby, J.: On the degree of approximation by manifolds of finite pseudo-dimension. Constructive Approximation 15, 291–300 (1999) 5 Conclusions 30

[39] Maiorov, V.E., Meir, R.: On the near optimality of the stochastic approximation of smooth functions by neural networks. Advances in Computational Mathematics 13, 79–103 (2000) [40] Makovoz, Y.: Uniform approximation by neural networks. Journal of Approxi- mation Theory 95(2), 215–228 (1998) [41] Pinkus, A.: n-Width in Approximation Theory. Springer, Berlin (1985) [42] Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143–195 (1999) [43] Ritter, G.: Efficient estimation of neural weights by polynomial approximation. IEEE Transactions on Information Theory 45(5), 1541–1550 (1999) [44] Sanguineti, M.: Universal approximation by ridge computational models and neural networks: A survey. The Open Applied Mathematics Journal 2, 31–58 (2008) [45] Sanguineti, M., Hlav´aˇckov´a-Schindler, K.: Some comparisons between linear ap- proximation and approximation by neural networks. In: Artificial Neural Nets and Genetic Algorithms, pp. 172–177. Springer, Vienna (1999) [46] Schmitt, M.: Radial basis function neural networks have superlinear VC dimen- sion. In: Proceedings of the 14th Annual Conference on Computational Learning Theory COLT 2001 and 5th European Conference on Computational Learning Theory EuroCOLT 2001, volume 2111 of Lecture Notes in Artificial Intelligence, pp. 14–30. Springer, Berlin (2001) [47] Schmitt, M.: RBF neural networks and Descartes’ rule of signs. In: N. Cesa- Bianchi, M. Numao, R. R. (eds.) Algorithmic Learning Theory ALT 2002, Lecture Notes in Computer Science, vol. 2533. Springer, Heidelberg (2002) [48] Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning from The- ory to Algorithms. Cambridge University Press, New York, NY (2014) [49] Timan, A.: Theory of Approximation of Functions of a Real Variable. Pergamon Press, New York, NY (1963) [50] Tossavainen, T.: The lost cousin of the fundamental theorem of algebra. Mathe- matics Magazine 80(4), 290–294 (2007) [51] Wang, J., Xu, Z.: New study on neural networks: the essential order of approxi- mation. Neural Networks 23(5), 618 – 624 (2010) [52] Xu, Z., Cao, F.: The essential order of approximation for neural networks. Science in China F: Information Sciences 47(1), 97–112 (2004) [53] Xu, Z., Wang, J.J.: The essential order of approximation for neural networks. Science in China Series F: Information Sciences 49(4), 446–460 (2006)