CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

CS281B/Stat241B. Statistical Learning Theory. Lecture 7. Peter Bartlett Review: ERM and uniform laws of large numbers • 1. Rademacher complexity 2. Tools for bounding Rademacher complexity Growth function, VC-dimension, Sauer’s Lemma − Structural results − Neural network examples: linear threshold units • Other nonlinearities? • Geometric methods • 1 ERM and uniform laws of large numbers Empirical risk minimization: Choose fn F to minimize Rˆ(f). ∈ How does R(fn) behave? ∗ For f = arg minf∈F R(f), ∗ ∗ ∗ ∗ R(fn) R(f )= R(fn) Rˆ(fn) + Rˆ(fn) Rˆ(f ) + Rˆ(f ) R(f ) − − − − ∗ ULLN for F ≤ 0 for ERM LLN for f |sup R{z(f) Rˆ}(f)| + O(1{z/√n).} | {z } ≤ f∈F − 2 Uniform laws and Rademacher complexity Definition: The Rademacher complexity of F is E Rn F , k k where the empirical process Rn is defined as n 1 R (f)= ǫ f(X ), n n i i i=1 X and the ǫ1,...,ǫn are Rademacher random variables: i.i.d. uniform on 1 . {± } 3 Uniform laws and Rademacher complexity Theorem: For any F [0, 1]X , ⊂ 1 E Rn F O 1/n E P Pn F 2E Rn F , 2 k k − ≤ k − k ≤ k k p and, with probability at least 1 2exp( 2ǫ2n), − − E P Pn F ǫ P Pn F E P Pn F + ǫ. k − k − ≤ k − k ≤ k − k Thus, P Pn F E Rn F , and k − k ≈ k k R(fn) inf R(f)= O (E Rn F ) . − f∈F k k 4 Tools for controlling Rademacher complexity 1. F (Xn) small. (max F (xn) is the growth function) | 1 | | 1 | 2. For binary-valued functions: Vapnik-Chervonenkis dimension. Bounds rate of growth function. Can be bounded for parameterized families. 3. Structural results on Rademacher complexity: Obtaining bounds for function classes constructed from other function classes. 5 Controlling Rademacher complexity: Growth function Definition: For a class F 0, 1 X , the growth function is ⊆ { } n ΠF (n) = max F (x ) : x ,...,xn . {| 1 | 1 ∈ X } Lemma: For f F satisfying f(x) 1, ∈ | |≤ 2 log(2ΠF (n)) E Rn F . k k ≤ r n 6 Vapnik-Chervonenkis dimension X Definition: A class F 0, 1 shatters x ,...,xd ⊆ { } { 1 }⊆X means that F (xd) = 2d. | 1 | The Vapnik-Chervonenkis dimension of F is dVC (F ) = max d : some x ,...,xd is shattered by F { 1 ∈X } d = max d : ΠF (d) = 2 . 7 Vapnik-Chervonenkis dimension: “Sauer’s Lemma” Theorem: [Vapnik-Chervonenkis] dVC (F ) d implies ≤ d n ΠF (n) . ≤ i i=0 X d If n d, the latter sum is no more than en . ≥ d So the VC-dimension is a single integer summary of the growth function: d n either it is finite, and ΠF (n)= O(n ), or ΠF (n) = 2 . No other growth is possible. = 2n if n d, ΠF (n) ≤ (e/d)d nd if n>d. ≤ 8 Vapnik-Chervonenkis dimension: “Sauer’s Lemma” Thus, for dVC (F ) d, ≤ 2 log(2ΠF (n)) d log n E Rn F = O . k k ≤ r n r n ! 9 Vapnik-Chervonenkis dimension: Example Theorem: For the class of thresholded linear functions, F = x 1[g(x) 0] : g G , where G is a linear space, { 7→ ≥ ∈ } dVC (F ) = dim(G). 10 Rademacher complexity: structural results Theorem: 1. F G implies Rn F Rn G. ⊆ k k ≤ k k 2. Rn cF = c Rn F . k k | |k k 3. For g(X) 1, E Rn F g E Rn F 2 log 2/n. | |≤ | k k + − k k |≤ 4. Rn co F = Rn F , where co F is the convexp hull of F . k k k k 5. (Contraction inequality) Consider φ : R R. × Z → Suppose, for all z, α φ(α, z) is 1-Lipschitz and φ(0, z) = 0. 7→ Define φ(F )= z φ(f(z), z): f F . { 7→ ∈ } Then E Rn 2E Rn F . k kφ(F ) ≤ k k 11 Overview Review: ERM and uniform laws of large numbers • Neural network examples • 1. Linear threshold units 2. Scaled convex combinations of LTUs 3. ... with bounded fan-in 4. Networks of LTUs Other nonlinearities? • Geometric methods • 12 Neural network examples Example: Consider the class of linear threshold functions: ⊤ d Fd = x sign(θ x θ ): θ R , θ R , 7→ − 0 ∈ 0 ∈ 1 if α 0, sign(α) := ≥ 1 otherwise. − This is a thresholded vector space of functions. dVC (Fd)= d + 1. 13 Neural network examples Example: Consider two-layer networks with linear threshold functions in the first layer and bounded weights in the second layer: k Fd,r = x θifi : k 1, fi Fd, θ r . 7→ ≥ ∈ k k1 ≤ ( i=1 ) X Then Fd,r = r co(Fd Fd), so ∪− E Rn F rE Rn F ∪−F k k d,r ≤ k k d d d log n = O r . r n ! 14 Neural network examples Example: Consider two-layer networks with bounded fan-in linear threshold functions in the first layer and bounded weights in the second layer: ⊤ d Fd,s = x sign(θ x θ ): θ R , θ s, θ R , 7→ − 0 ∈ k k0 ≤ 0 ∈ k Fd,s,r = x θifi : k 1, fi Fd,s, θ r . 7→ ≥ ∈ k k1 ≤ ( i=1 ) X en s+1 d ΠF (n) , d,s ≤ s + 1 s log Π (n) E Fd,s s log(nd/s) Rn Fd,s,r = O r = O r . k k r n ! r n ! 15 Networks of linear threshold functions Theorem: Let Fp,k be the class of functions computed by a feedforward network of linear threshold functions, with k computation units and p parameters. Then for n p, ≥ enk p ΠF (n) , p,k ≤ p and hence dVC (Fp,k) < 2p log2(2k/ ln2). 16 Networks of linear threshold functions Proof Idea: Fix a set S of n input vectors x1,...,xn. Consider a topological ordering of the computation units. For computation unit l, let pl be the number of parameters, and let Dl(S) be the number of distinct states (that is, parameter settings that compute l distinct mappings x ,...,xn 1 from input vectors to outputs { 1 } 7→ {± } of computation units up to the lth). 1. D (S) (en/p )p1 . 1 ≤ 1 pl 2. Dl(S) Dl− (S)(en/pl) . ≤ 1 k pl 3. Hence, Dk(S) l=1(en/pl) and ≤ k log ΠF (n) pl log(en/pl). p,k ≤ Ql=1 4. The bound is maximizedP by spreading the p uniformly over the k units: log ΠF (n) p log(enk/p). p,k ≤ 17 Networks of linear threshold functions d Theorem: Let Fd,k be the class of functions f : R 1 com- → {± } puted by a two-layer feed-forward network of linear threshold functions, with k computation units (so p = (d + 2)k + 1 parameters). Then dVC (Fd,k) dk = Ω(p). ≥ (A more involved argument shows that dVC (Fd,k)=Ω(p log k).) 18 Networks of linear threshold functions Idea: 1. Arrange kd points in k well-separated clusters on the surface of a sphere in Rd. 2. Ensure that the d points in each cluster are in general position. 3. For each cluster, fit the decision boundary (hyperplane) of a hidden unit to intersect all d points. oriented so that the unit has output 1 at the center of the sphere. 4. Choose the parameters of the output unit so that it computes the conjunction of its k inputs. 5. By perturbing the hidden unit parameters, it is clear that all 2kd classifications can be computed. 19 Networks of linear threshold functions Summary: For networks of linear threshold functions with p parameters and k computation units, the VC-dimension is Θ(p log k), independent of depth. 20 Overview Review: ERM and uniform laws of large numbers • Neural network examples • Other nonlinearities? • 1. Polynomials 2. Sigmoids 3. Sinusoids! Geometric methods • 21 Piecewise polynomial nonlinearities Definition: A ReLU (Rectified Linear Unit) computes a function from F = x (θ⊤x θ ) : θ Rd, θ R , + 7→ − 0 + ∈ 0 ∈ α if α 0, (α)+ := ≥ 0 otherwise. VC-dimension of networks of ReLUs? • Piecewise-linear functions. 22 Piecewise polynomial nonlinearities VC-dimension of networks of units with piecewise-quadratic • nonlinearity? (That is, where ( ) is replaced with ( )2 .) · + · + These are piecewise-polynomial functions. What if the nonlinearity is a fixed polynomial? • Then the network computes a parameterized polynomial in the input variables, which is a linearly parameterized class (i.e., a vector space), so the VC-dimension is bounded. But the dimension of this vector space might be very large. (Since the network cannot compute arbitrary polynomials in this huge space, we might hope that the VC-dimension is less than the linear dimension.) 23 Sigmoidal nonlinearities VC-dimension of networks of units with a sigmoid function, • 1 σ(α)= , 1+ e−α or its symmetric version, (1 2e−α)/(1+ e−α), or its vector version − (softmax function), eαi σ(α)i = ? αj j e VC-dimension of networks of unitsP with a soft-plus function (the • smooth version of the ReLU), σ(α) = ln(1+ eα)? It’s not clear why these should be finite. 24 Example: Sinusoidal nonlinearity A smooth function, with few parameters, can have infinite VC-dimension: Example: Consider the parameterized class F = x sign(sin(θx)) : θ R . sin { 7→ ∈ } Then dVC (F )= . sin ∞ Proof: Any sequence 2, 4, 8, 16,..., 2n is shattered. n To get labels (y ,...,yn) 1 , set bi = 1[yi = 1] and θ = cπ 1 ∈ {± } − where c has the binary representation c = 0.b b bn1.

CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

Lecture 4 Feedforward Neural Networks, Backpropagation

Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

Loss Function Search for Face Recognition

Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan

Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of

Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019

Categorical Data

Mixed Pattern Recognition Methodology on Wafer Maps with Pre-Trained Convolutional Neural Networks

Arxiv:1910.04465V2 [Cs.CV] 16 Oct 2019

Reinforcement Learning with Dynamic Boltzmann Softmax Updates Arxiv

Rethinking Feature Distribution for Loss Functions in Image Classification