<<

CS281B/Stat241B. Statistical Learning Theory. Lecture 7. Peter Bartlett

Review: ERM and uniform laws of large numbers • 1. Rademacher complexity 2. Tools for bounding Rademacher complexity Growth function, VC-dimension, Sauer’s Lemma − Structural results − Neural network examples: linear threshold units • Other nonlinearities? • Geometric methods •

1 ERM and uniform laws of large numbers

Empirical risk minimization:

Choose fn F to minimize Rˆ(f). ∈ How does R(fn) behave? ∗ For f = arg minf∈F R(f),

∗ ∗ ∗ ∗ R(fn) R(f )= R(fn) Rˆ(fn) + Rˆ(fn) Rˆ(f ) + Rˆ(f ) R(f ) − − − − ∗ ULLN for F ≤ 0 for ERM LLN for f |sup R{z(f) Rˆ}(f)| + O(1{z/√n).} | {z } ≤ f∈F −

2 Uniform laws and Rademacher complexity

Definition: The Rademacher complexity of F is E Rn F , k k where the empirical process Rn is defined as

n 1 R (f)= ǫ f(X ), n n i i i=1 X and the ǫ1,...,ǫn are Rademacher random variables: i.i.d. uni- form on 1 . {± }

3 Uniform laws and Rademacher complexity

Theorem: For any F [0, 1]X , ⊂ 1 E Rn F O 1/n E P Pn F 2E Rn F , 2 k k − ≤ k − k ≤ k k p  and, with probability at least 1 2exp( 2ǫ2n), − −

E P Pn F ǫ P Pn F E P Pn F + ǫ. k − k − ≤ k − k ≤ k − k

Thus, P Pn F E Rn F , and k − k ≈ k k

R(fn) inf R(f)= O (E Rn F ) . − f∈F k k

4 Tools for controlling Rademacher complexity

1. F (Xn) small. (max F (xn) is the growth function) | 1 | | 1 | 2. For binary-valued functions: Vapnik-Chervonenkis dimension. Bounds rate of growth function. Can be bounded for parameterized families.

3. Structural results on Rademacher complexity: Obtaining bounds for function classes constructed from other function classes.

5 Controlling Rademacher complexity: Growth function

Definition: For a class F 0, 1 X , the growth function is ⊆ { } n ΠF (n) = max F (x ) : x ,...,xn . {| 1 | 1 ∈ X }

Lemma: For f F satisfying f(x) 1, ∈ | |≤

2 log(2ΠF (n)) E Rn F . k k ≤ r n

6 Vapnik-Chervonenkis dimension

X Definition: A class F 0, 1 shatters x ,...,xd ⊆ { } { 1 }⊆X means that F (xd) = 2d. | 1 | The Vapnik-Chervonenkis dimension of F is

dVC (F ) = max d : some x ,...,xd is shattered by F { 1 ∈X } d = max d : ΠF (d) = 2 . 

7 Vapnik-Chervonenkis dimension: “Sauer’s Lemma”

Theorem: [Vapnik-Chervonenkis] dVC (F ) d implies ≤ d n ΠF (n) . ≤ i i=0 X   d If n d, the latter sum is no more than en . ≥ d So the VC-dimension is a single integer summary  of the growth function: d n either it is finite, and ΠF (n)= O(n ), or ΠF (n) = 2 . No other growth is possible. = 2n if n d, ΠF (n) ≤  (e/d)d nd if n>d. ≤ 

8 Vapnik-Chervonenkis dimension: “Sauer’s Lemma”

Thus, for dVC (F ) d, ≤

2 log(2ΠF (n)) d log n E Rn F = O . k k ≤ r n r n !

9 Vapnik-Chervonenkis dimension: Example

Theorem: For the class of thresholded linear functions,

F = x 1[g(x) 0] : g G , where G is a linear space, { 7→ ≥ ∈ } dVC (F ) = dim(G).

10 Rademacher complexity: structural results

Theorem:

1. F G implies Rn F Rn G. ⊆ k k ≤ k k 2. Rn cF = c Rn F . k k | |k k 3. For g(X) 1, E Rn F g E Rn F 2 log 2/n. | |≤ | k k + − k k |≤ 4. Rn co F = Rn F , where co F is the convexp hull of F . k k k k 5. (Contraction inequality) Consider φ : R R. × Z → Suppose, for all z, α φ(α, z) is 1-Lipschitz and φ(0, z) = 0. 7→ Define φ(F )= z φ(f(z), z): f F . { 7→ ∈ } Then E Rn 2E Rn F . k kφ(F ) ≤ k k

11 Overview

Review: ERM and uniform laws of large numbers • Neural network examples • 1. Linear threshold units 2. Scaled convex combinations of LTUs 3. ... with bounded fan-in 4. Networks of LTUs

Other nonlinearities? • Geometric methods •

12 Neural network examples

Example: Consider the class of linear threshold func- tions:

⊤ d Fd = x sign(θ x θ ): θ R , θ R , 7→ − 0 ∈ 0 ∈  1 if α 0, sign(α) := ≥  1 otherwise. − This is a thresholded of functions. dVC (Fd)= d + 1.

13 Neural network examples

Example: Consider two- networks with linear threshold functions in the first layer and bounded weights in the second layer:

k Fd,r = x θifi : k 1, fi Fd, θ r . 7→ ≥ ∈ k k1 ≤ ( i=1 ) X Then Fd,r = r co(Fd Fd), so ∪−

E Rn F rE Rn F ∪−F k k d,r ≤ k k d d d log n = O r . r n !

14 Neural network examples

Example: Consider two-layer networks with bounded fan-in linear threshold functions in the first layer and bounded weights in the second layer:

⊤ d Fd,s = x sign(θ x θ ): θ R , θ s, θ R , 7→ − 0 ∈ k k0 ≤ 0 ∈  k Fd,s,r = x θifi : k 1, fi Fd,s, θ r . 7→ ≥ ∈ k k1 ≤ ( i=1 ) X en s+1 d ΠF (n) , d,s ≤ s + 1 s     log Π (n) E Fd,s s log(nd/s) Rn Fd,s,r = O r = O r . k k r n ! r n !

15 Networks of linear threshold functions

Theorem: Let Fp,k be the class of functions computed by a feed- forward network of linear threshold functions, with k computation units and p parameters. Then for n p, ≥ enk p ΠF (n) , p,k ≤ p   and hence dVC (Fp,k) < 2p log2(2k/ ln2).

16 Networks of linear threshold functions

Proof Idea:

Fix a set S of n input vectors x1,...,xn. Consider a topological ordering of the computation units.

For computation unit l, let pl be the number of parameters, and let Dl(S) be the number of distinct states (that is, parameter settings that compute l distinct mappings x ,...,xn 1 from input vectors to outputs { 1 } 7→ {± } of computation units up to the lth). 1. D (S) (en/p )p1 . 1 ≤ 1 pl 2. Dl(S) Dl− (S)(en/pl) . ≤ 1 k pl 3. Hence, Dk(S) l=1(en/pl) and ≤ k log ΠF (n) pl log(en/pl). p,k ≤ Ql=1 4. The bound is maximizedP by spreading the p uniformly over the k units: log ΠF (n) p log(enk/p). p,k ≤

17 Networks of linear threshold functions

d Theorem: Let Fd,k be the class of functions f : R 1 com- → {± } puted by a two-layer feed-forward network of linear threshold functions, with k computation units (so p = (d + 2)k + 1 parameters). Then

dVC (Fd,k) dk = Ω(p). ≥

(A more involved argument shows that dVC (Fd,k)=Ω(p log k).)

18 Networks of linear threshold functions

Idea:

1. Arrange kd points in k well-separated clusters on the surface of a sphere in Rd.

2. Ensure that the d points in each cluster are in general position. 3. For each cluster, fit the decision boundary (hyperplane) of a hidden unit to intersect all d points. oriented so that the unit has output 1 at the center of the sphere.

4. Choose the parameters of the output unit so that it computes the conjunction of its k inputs. 5. By perturbing the hidden unit parameters, it is clear that all 2kd classifications can be computed.

19 Networks of linear threshold functions

Summary: For networks of linear threshold functions with p parameters and k computation units, the VC-dimension is Θ(p log k), independent of depth.

20 Overview

Review: ERM and uniform laws of large numbers • Neural network examples • Other nonlinearities? • 1. Polynomials 2. Sigmoids 3. Sinusoids! Geometric methods •

21 Piecewise polynomial nonlinearities

Definition: A ReLU (Rectified Linear Unit) computes a func- tion from

F = x (θ⊤x θ ) : θ Rd, θ R , + 7→ − 0 + ∈ 0 ∈  α if α 0, (α)+ := ≥ 0 otherwise.   VC-dimension of networks of ReLUs? • Piecewise-linear functions.

22 Piecewise polynomial nonlinearities

VC-dimension of networks of units with piecewise-quadratic • nonlinearity? (That is, where ( ) is replaced with ( )2 .) · + · + These are piecewise-polynomial functions.

What if the nonlinearity is a fixed polynomial? • Then the network computes a parameterized polynomial in the input variables, which is a linearly parameterized class (i.e., a vector space), so the VC-dimension is bounded. But the dimension of this vector space might be very large. (Since the network cannot compute arbitrary polynomials in this huge space, we might hope that the VC-dimension is less than the linear dimension.)

23 Sigmoidal nonlinearities

VC-dimension of networks of units with a , • 1 σ(α)= , 1+ e−α or its symmetric version, (1 2e−α)/(1+ e−α), or its vector version − (softmax function), eαi σ(α)i = ? αj j e VC-dimension of networks of unitsP with a soft-plus function (the • smooth version of the ReLU),

σ(α) = ln(1+ eα)?

It’s not clear why these should be finite.

24 Example: Sinusoidal nonlinearity

A smooth function, with few parameters, can have infinite VC-dimension: Example: Consider the parameterized class

F = x sign(sin(θx)) : θ R . sin { 7→ ∈ }

Then dVC (F )= . sin ∞ Proof: Any sequence 2, 4, 8, 16,..., 2n is shattered. n To get labels (y ,...,yn) 1 , set bi = 1[yi = 1] and θ = cπ 1 ∈ {± } − where c has the binary representation c = 0.b b bn1. Then 1 2 ··· i i sin(2 θ) = sin(2 π 0.b b bn1) × 1 2 ··· = sin(π b bi.bi bn1) × 1 ··· +1 ··· = sin(π bi.bi bn1), × +1 ··· i so sign(sin(θ2 )) = yi.

25 Example: Sinusoidal nonlinearity

And this implies that well-behaved nonlinearities (bounded, monotonic, convex to the left of zero, concave to the right) can make the VC-dimension infinite. Example: Consider the parameterized class

k ⊤ d Fσ,k = x sign α + αiσ(θ x) : αi R,θi R . 7→ 0 i ∈ ∈ ( i=1 ! ) X For 1 2 σ(α)= + cα3e−α sin α, 1+ e−α if c> 0 is sufficiently small, σ is analytic, monotonic, and convex/concave to the left/right of zero. And for k 2, ≥ we can compute sign(sin(θx)) using functions in Fσ,k, so its VC-dimension is infinite.

26 Overview

Review: ERM and uniform laws of large numbers • Neural network examples • Other nonlinearities? • Geometric methods • 1. Counting cells: Linear threshold functions 2. Counting cells: Arithmetic complexity

27 Geometric methods

We want to bound the growth function for parameterized function classes of the form F = x f(x, θ): θ Θ , { 7→ ∈ } where Θ Rp. ⊂ (e.g., f might be a neural network of a fixed architecture with weights θ.) Let’s start with the special case of linear threshold functions:

⊤ d Fd = x sign(θ x θ ): θ R , θ R , 7→ − 0 ∈ 0 ∈  1 if α 0, sign(α) := ≥  1 otherwise. − (This class is a thresholded vector space of functions of dimension d + 1, dVC (Fd)= d + 1, but we’ll compute ΠFd directly, because the proof ideas are useful for general parameterized classes.)

28 Linear threshold functions

Theorem: For the class of linear threshold functions,

d n 1 Π (n) = 2 − . Fd i i=0 X  

Proof idea: d Fix n points x ,...xn R . Divide the parameter space of 1 ∈ (θ,θ ) = Rd+1 into cells that give the same classification of the points, { 0 } and count the number of these equivalence classes using a geometric argument (that dates back to Schaffli, 1851).

29 Linear threshold functions

1. Assume the points in S are in “general position,” that is, all subsets of x x x 1 , 2 , , n 1 1 ··· 1       of size up to d + 1 are linearly independent. (This implies that no three points are in a line, no four are in a plane, etc.) Notice that this is generically true.

2. For each xi, define the hyperplane

d+1 ⊤ Pi = (θ,θ ) R : θ xi + θ = 0 . 0 ∈ 0 

30 Linear threshold functions

′ ′ 3. In order for (θ,θ0) and (θ ,θ0) to label xi differently, they must lie on opposite sides of Pi (assuming that neither is on Pi: for points in general position, this is wlog). Thus,

n d+1 n F (x ) = CC(R Pi), | 1 | \∪i=1 where CC denotes the number of connected components.

d+1 n 4. We define C(n, d +1) := CC(R Pi); the inductive \∪i=1 argument below shows that it depends only on n and d. 5. First, C(1,d) = 2. (One plane splits Rd into two cells.)

31 Linear threshold functions

6. Next, C(n + 1,d)= C(n, d)+ C(n, d 1). Indeed, suppose we − have n planes in Rd and we add an (n + 1)th. It splits some of the C(n, d) cells in two, and leaves some of them intact. The number

that are split by Pn+1 is equal to the number of connected n components of Pn Pi, which is C(n, d 1). +1 \∪i=1 − 7. Induction shows that

d− 1 n 1 C(n, d) = 2 − . k k X=0  

32 Overview

Review: ERM and uniform laws of large numbers • Neural network examples • Other nonlinearities? • Geometric methods • 1. Counting cells: Linear threshold functions 2. Counting cells: Arithmetic complexity

33 VC-dimension bounds for parameterized families

Consider a parameterized class of binary-valued functions,

p Ff = x f(x, θ): θ R , { 7→ ∈ } where f : Rm Rp 1 . × → {± } Suppose that, for each x, f(x, ) can be computed using no more than t · operations of the following kinds:

1. arithmetic (+, , , /), − × 2. comparisons (>, =, <),

3. output 1. ± Theorem: dVC (Ff ) 4p(t + 2). ≤

34 VC-dimension bounds for parameterized families

Suppose that, for each x, f(x, ) can be computed using no more than t · operations of the following kinds:

1. arithmetic (+, , , /), − × 2. exponentiation (x ex), 7→ 3. comparisons (>, =, <),

4. output 1. ± 2 2 Theorem: dVC (Ff )= O(p t ).

35 VC-dimension bounds for parameterized families

Proof idea: Any f of this kind can be expressed as f(x, θ)= h(sign(g1(x, θ)),..., sign(gk(x, θ))) for functions gi that are polynomial in θ, and some boolean function h. (Notice that k 2t, and t ≤ the degree of any polynomial gi is no more than 2 .) Notice that a change of the value of f must be due to a change of the sign of one of the gi. d Hence, ΠF (n) number of connected components in R after the sets ≤ gi(xj ) = 0 are removed. We can bound this number using similar ideas to the case of linear threshold functions (being careful to ensure that the analog of points being in general position holds without loss of generality).

36 Overview

Review: ERM and uniform laws of large numbers • Neural network examples • Other nonlinearities? • Geometric methods • 1. Counting cells: Linear threshold functions 2. Counting cells: Arithmetic complexity

37