<<

Generalization in Deep Learning

Peter Bartlett Alexander Rakhlin

UC Berkeley

MIT

May 28, 2019

1 / 174 Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

Interpolation

2 / 174 What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)

log(2 + 2) = log(2) + log(2)

log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)

3 / 174 Statistical Learning Theory

Aim: Predict an outcome y from some set of possible outcomes, on the basis of some observation x from a featureY space . X Use data set of n pairs

= (x1, y1),..., (xn, yn) S { } to choose a function f : so that, for subsequent( x, y) pairs, f (x) is a good prediction of yX. → Y

5 / 174 Formulations of Prediction Problems

To define the notion of a ‘good prediction,’ we can define a loss function

` : R. Y × Y → `(ˆy, y) is cost of predictingˆy when the outcome is y.

Aim: `(f (x), y) small.

6 / 174 Formulations of Prediction Problems

Example In pattern classification problems, the aim is to classify a pattern x into one of a finite number of classes (that is, the label space is finite). If all mistakes are equally bad, we could define Y ( 1 ify ˆ = y, `(ˆy, y) = I yˆ = y = 6 { 6 } 0 otherwise.

Example In a regression problem, with = R, we might choose the quadratic loss function, `(ˆy, y) = (ˆy y)2. Y −

7 / 174 Probabilistic Assumptions

Assume: I There is a probability distribution P on , X × Y I The pairs( X1, Y1),..., (Xn, Yn), (X , Y ) are chosen independently according to P The aim is to choose f with small risk:

L(f ) = E`(f (X ), Y ). For instance, in the pattern classification example, this is the misclassification probability.

L(f ) = L01(f ) = EI f (X ) = Y = Pr(f (X ) = Y ). { 6 } 6

8 / 174 Why not estimate the underlying distribution P (or PY X ) first? | This is in general a harder problem than prediction.

9 / 174 Key difficulty: our goals are in terms of unknown quantities related to unknown P. Have to use empirical data instead. Purview of . For instance, we can calculate the empirical loss of f : X → Y n 1 X Lb(f ) = `(Yi , f (Xi )) n i=1

10 / 174 The function x fbn(x) = fbn(x; X1, Y1,..., Xn, Yn) is random, since it 7→ depends on the random data = (X1, Y1,..., Xn, Yn). Thus, the risk S h i L(fbn) = E `(fbn(X ), Y ) |S h i = E `(fbn(X ; X1, Y1,..., Xn, Yn), Y ) |S is a random variable. We might aim for EL(fbn) small, or L(fbn) small with high probability (over the training data).

11 / 174 Quiz: what is random here?

1. Lb(f ) for a given fixed f

2. fbn

3. Lb(fbn)

4. L(fbn) 5. L(f ) for a given fixed f

12 / 174 Theoretical analysis of performance is typically easier if fbn has closed form (in terms of the training data).

1 E.g. ordinary fbn(x) = x >(X >X )− X >Y .

Unfortunately, most ML and many statistical procedures are not explicitly defined but arise as I solutions to an optimization objective (e.g. logistic regression) I as an iterative procedure without an immediately obvious objective function (e.g. AdaBoost, Random Forests, etc)

13 / 174 The Gold Standard

Within the framework we set up, the smallest expected loss is achieved by the Bayes optimal function

f ∗ = arg min L(f ) f

where the minimization is over all (measurable) prediction rules f : . X → Y The value of the lowest expected loss is called the Bayes error:

L(f ∗) = inf L(f ) f Of course, we cannot calculate any of these quantities since P is unknown.

14 / 174 Bayes Optimal Function

Bayes optimal function f ∗ takes on the following forms in these two particular cases:

I Binary classification ( = 0, 1 ) with the indicator loss: Y { } f ∗(x) = I η(x) 1/2 , where η(x) = E[Y X = x] { ≥ } |

1

⌘(x) 0

I Regression ( = R) with squared loss: Y f ∗(x) = η(x), where η(x) = E[Y X = x] |

15 / 174 The big question: is there a way to construct a learning algorithm with a guarantee that L(fbn) L(f ∗) − is small for large enough sample size n?

16 / 174 Consistency

An algorithm that ensures

lim L(fbn) = L(f ∗) almost surely n →∞ is called universally consistent. Consistency ensures that our algorithm is approaching the best possible prediction performance as the sample size increases.

The good news: consistency is possible to achieve.

I easy if is a finite or countable set X I not too hard if is infinite, and the underlying relationship between x and y is “continuous”X

17 / 174 The bad news...

In general, we cannot prove anything quantitative about L(fbn) L(f ∗), unless we make further assumptions (incorporate prior knowledge− ).

“No Free Lunch” Theorems: unless we posit assumptions,

I For any algorithm fbn, any n and any  > 0, there exists a distribution P such that L(f ∗) = 0 and 1 EL(fbn)  ≥ 2 −

I For any algorithm fbn, and any sequence an that converges to0, there exists a probability distribution P such that L(f ∗) = 0 and for all n

EL(fbn) an ≥

18 / 174 is this really “bad news”?

Not really. We always have some domain knowledge.

Two ways of incorporating prior knowledge: I Direct way: assumptions on distribution P (e.g. margin, smoothness of regression function, etc) I Indirect way: redefine the goal to perform as well as a reference set of predictors: F L(fbn) inf L(f ) − f ∈F encapsulates our inductive bias. F

We often make both of these assumptions.

19 / 174 Perceptron

AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p

21 / 174 Perceptron

(x1, y1),..., (xT , yT ) 1 (T may or may not be same as n) ∈ X × {± } d Maintain a hypothesis wt R (initialize w1 = 0). ∈ On round t,

I Consider( xt , yt )

I Form prediction yt = sign( wt , xt ) b h i I If yt = yt , update b 6 wt+1 = wt + yt xt

else wt+1 = wt

22 / 174 Perceptron

wAAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/ t

xAAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A t

23 / 174 For simplicity, suppose all data are in a unit ball, xt 1. k k ≤

Margin with respect to( x1, y1),..., (xT , yT ):

γ = max min (yi w, xi )+ , w =1 i [T ] h i k k ∈ where( a)+ = max 0, a . { }

Theorem (Novikoff ’62).

Perceptron makes at most1 /γ2 mistakes (and corrections) on any sequence of examples with margin γ.

24 / 174 Proof: Let m be the number of mistakes after T iterations. If a mistake is made on round t,

2 2 2 2 wt+1 = wt + yt xt wt + 2yt wt , xt + 1 wt + 1. k k k k ≤ k k h i ≤ k k Hence, 2 wT m. k k ≤ For optimal hyperplane w ∗

γ w ∗, yt xt = w ∗, wt+1 wt . ≤ h i h − i Hence (adding and canceling),

mγ w ∗, wT wT √m. ≤ h i ≤ k k ≤

25 / 174 Recap

For any T and( x1, y1),..., (xT , yT ),

T X D2 I yt wt , xt 0 { h i ≤ } ≤ γ2 t=1

where γ = γ(x1:T , y1:T ) is margin and D = D(x1:T , y1:T ) = maxt xt . k k

Let w ∗ denote the max margin hyperplane, w ∗ = 1. k k

26 / 174 Consequence for i.i.d. data (I)

Do one pass on i.i.d. sequence( X1, Y1),..., (Xn, Yn) (i.e. T = n).

Claim: expected indicator loss of randomly picked function x wτ , x (τ unif(1,..., n)) is at most 7→ h i ∼ 1 D2  E . n × γ2

27 / 174 Proof: Take expectation on both sides:

" n # 1 X  D2  E I Yt wt , Xt 0 E n { h i ≤ } ≤ nγ2 t=1

n Left-hand side can be written as (recall notation = (Xi , Yi ) ) S i=1

Eτ E I Yτ wτ , Xτ 0 S { h i ≤ }

Since wτ is a function of X1:τ 1, Y1:τ 1, above is − −

E Eτ EX ,Y I Y wτ , X 0 = E Eτ L01(wτ ) S { h i ≤ } S Claim follows.

NB: To ensure E[D2/γ2] is not infinite, we assume margin γ in P.

28 / 174 Consequence for i.i.d. data (II)

Now, rather than doing one pass, cycle through data

(X1, Y1),..., (Xn, Yn)

repeatedly until no more mistakes (i.e. T n (D2/γ2)). ≤ × Then final hyperplane of Perceptron separates the data perfectly, i.e.

finds minimum Lb01(wT ) = 0 for

n 1 X Lb01(w) = I Yi w, Xi 0 . n { h i ≤ } i=1

Empirical Risk Minimization with finite-time convergence. Can we say anything about future performance of wT ?

29 / 174 Consequence for i.i.d. data (II)

Claim: 1 D2  EL01(wT ) E ≤ n + 1 × γ2

30 / 174 Proof: Shortcuts: z = (x, y) and `(w, z) = I y w, x 0 . Then { h i ≤ } " n+1 # 1 X ( t) E EZ `(wT , Z) = E ,Zn+1 `(w − , Zt ) S S n + 1 t=1

( t) where w − is Perceptron’s final hyperplane after cycling through data Z1,..., Zt 1, Zt+1,..., Zn+1. − That is, leave-one-out is unbiased estimate of expected loss.

Now consider cycling Perceptron on Z1,..., Zn+1 until no more errors. Let i1,..., im be indices on which Perceptron errs in any of the cycles. 2 2 We know m D /γ . However, if index t / i1,..., im , then whether ≤ ∈{ } or not Zt was included does not matter, and Zt is correctly classified by ( t) w − . Claim follows.

31 / 174 A few remarks

I Bayes error L01(f ∗) = 0 since we assume margin in the distribution P (and, hence, in the data). I Our assumption on P is not about its parametric or nonparametric form, but rather on what happens at the boundary. I Curiously, we beat the CLT rate of1 /√n.

32 / 174 Overparametrization and overfitting The last lecture will discuss overparametrization and overfitting.

I dimension d never appears in the mistake bound of Perceptron. Hence, we can even take d = . ∞ I Perceptron is 1-layer neural network (no hidden layer) with d neurons. I For d > n, any n points in general position can be separated (zero empirical error).

Conclusions: I More parameters than data does not necessarily contradict good prediction performance (“generalization”). I Perfectly fitting the data does not necessarily contradict good generalization (particularly with no noise). I Complexity is subtle (e.g. number of parameters vs margin) I ... however, margin is a strong assumption (more than just no noise)

33 / 174 Relaxing the assumptions

We proved

EL01(wT ) L01(f ∗) O(1/n) − ≤ under the assumption that P is linearly separable with margin γ.

The assumption on the probability distribution implies that the Bayes classifier is a linear separator (“realizable” setup).

We may relax the margin assumption, yet still hope to do well with linear separators. That is, we would aim to minimize

EL01(wT ) L01(w ∗) − hoping that

L01(w ∗) L01(f ∗) −

is small, where w ∗ is the best hyperplane classifier with respect to L01.

35 / 174 More generally, we will work with some class of functions that we hope captures well the relationship between X and Y . The choiceF of gives rise to a bias-variance decomposition. F

Let f argmin L(f ) be the function in with smallest expected loss. F ∈ f F ∈F

36 / 174 Bias-Variance Tradeoff

L(fbn) L(f ∗) = L(fbn) inf L(f ) + inf L(f ) L(f ∗) − − f f − | {z∈F } |∈F {z } Estimation Error Approximation Error

fˆn f f ⇤ F

F Clearly, the two terms are at odds with each other: I Making larger means smaller approximation error but (as we will see) largerF estimation error I Taking a larger sample n means smaller estimation error and has no effect on the approximation error. I Thus, it makes sense to trade off size of and n (Structural Risk Minimization, or Method of Sieves, or ModelF Selection).

37 / 174 Bias-Variance Tradeoff

In simple problems (e.g. linearly separable data) we might get away with having no bias-variance decomposition (e.g. as was done in the Perceptron case).

However, for more complex problems, our assumptions on P often imply that class containing f ∗ is too large and the estimation error cannot be controlled. In such cases, we need to produce a “biased” solution in hopes of reducing variance.

Finally, the bias-variance tradeoff need not be in the form we just presented. We will consider a different decomposition in the last lecture.

38 / 174 Consider the performance of empirical risk minimization:

Choose fbn to minimize Lb(f ), where Lb is the empirical risk, ∈ F n 1 X Lb(f ) = Pn`(f (X ), Y ) = `(f (Xi ), Yi ). n i=1

For pattern classification, this is the proportion of training examples misclassified.

How does the excess risk, L(fbn) L(f ) behave? We can write − F h i h i h i L(fbn) L(f ) = L(fbn) Lb(fbn) + Lb(fbn) Lb(f ) + Lb(f ) L(f ) − F − − F F − F

Therefore, h i EL(fbn) L(f ) E L(fbn) Lb(fbn) − F ≤ − because second term is nonpositive and third is zero in expectation.

40 / 174 The term L(fbn) Lb(fbn) is not necessarily zero in expectation (check!). An easy upper bound− is

L(fbn) Lb(fbn) sup L(f ) Lb(f ) , − ≤ f − ∈F and this motivates the study of uniform laws of large numbers.

Roadmap: study the sup to I understand when learning is possible, I understand implications for sample complexity, I design new algorithms (regularizer arises from upper bound on sup)

41 / 174 Loss Class

In what follows, we will work with a function class on , and for the learning application, = ` : G Z G ◦ F = (x, y) `(f (x), y): f G { 7→ ∈ F} and = . Z X × Y

42 / 174 Empirical Process Viewpoint

n 1 G G g(zi) Eg n i=1 Xi=1

0

all functions gˆn g⇤ gG gˆn

43 / 174 Uniform Laws of Large Numbers

For a class of functions g : [0, 1], suppose that Z1,..., Zn, Z are i.i.d. on ,G and consider Z → Z

1 n X U = sup Eg(Z) g(Zi ) = sup Pg Png =: P Pn . g − n g | − | k | −{z } kG ∈G i=1 ∈G emp proc

If U converges to0, this is called a uniform .

44 / 174 Glivenko-Cantelli Classes

Definition.

P is a Glivenko-Cantelli class for P if Pn P 0. G k − kG →

I P is a distribution on , Z I Z1,..., Zn are drawn i.i.d. from P,

I Pn is the empirical distribution (mass1 /n at each of Z1,..., Zn), I is a set of measurable real-valued functions on with finite Gexpectation under P, Z

I Pn P is an empirical process, that is, a indexed− by a class of functions , and G I Pn P := supg Png Pg . k − kG ∈G | − |

45 / 174 Glivenko-Cantelli Classes Why ‘Glivenko-Cantelli’? An example of a uniform law of large numbers.

Theorem (Glivenko-Cantelli).

as Fn F 0. k − k∞ →

Here, F is a cumulative distribution function, Fn is the empirical cumulative distribution function, n 1 X Fn(t) = I Zi t , n { ≤ } i=1

where Z1,..., Zn are i.i.d. with distribution F , and F G = supt F (t) G(t) . k − k∞ | − | Theorem (Glivenko-Cantelli).

as Pn P 0, for = z I z θ : θ R . k − kG → G { 7→ { ≤ } ∈ }

46 / 174 Glivenko-Cantelli Classes

Not all are Glivenko-Cantelli classes. For instance, G = I z S : S R, S < . G { { ∈ } ⊂ | | ∞} Then for a continuous distribution P, Pg = 0 for any g , but as ∈ G supg Png = 1 for all n. So although Png Pg for all g , this convergence∈G is not uniform over . is too→ large. ∈ G G G In general, how do we decide whether is “too large”? G

47 / 174 Rademacher Averages Given a set H [ 1, 1]n, measure its “size” by average length of projection onto⊆ random− direction.

Rademacher process indexed by vector h H: ∈ 1 Rbn(h) = , h n h i

where  = (1, . . . , n) is vector of i.i.d. Rademacher random variables (symmetric 1). ± Rademacher averages: expected maximum of the process

1 E Rbn = E sup , h H h H n h i ∈

49 / 174 Examples

If H = h0 , { } 1/2 E Rbn = O(n− ). H If H = 1, 1 n, {− } E Rbn = 1. H

If v + c 1, 1 n H, then {− } ⊆

E Rbn c o(1) H ≥ −

How about H = 1, 1 where 1 is a vector of1’s? {− }

50 / 174 Examples

If H is finite, r log H E Rbn c | | H ≤ n for some constant c.

This bound can be loose, as it does not take into account “overlaps”/correlations between vectors.

51 / 174 Examples

n n LetB p be a unit ball in R :

n !1/p n n X p Bp = x R : x 1 , x = xi . { ∈ k kp ≤ } k kp | | i=1

n For H = B2 1 1 1 E max , g = E  2 = n g 1 | h i | n k k √n k k2≤

n 1 On the other hand, Rademacher averages forB 1 are n .

52 / 174 Rademacher Averages

What do these Rademacher averages have to do with our problem of bounding uniform deviations?

Suppose we take

n n H = (Z1 ) = (g(Z1),..., g(Zn)) : g R G { ∈ G} ⊂

g AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmClTWIfCgS72bYBtkm6STbbtY2AoiiZMCkKJBVHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaV7vb/v3X/wzrvvvb/R6X7w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk5zpPyZijOKERxUiD6iy+8q8ebva2e72e7/uuEfz9vR4Ig0F/x++7vjHBs3nwZOvnfxzHOb16tPHXKBQ44yTRmCGlhn4v1eMCSU0xI2V3lCmSIjxFMRmCmCBO1Lioci3dLdCEbiQk/BLtVtpVjwJxpXIeAJIjPVG2zSjbbMNMR/1xQZM00yTB9UZRxlwtXFO4G1JJsGY5CAhLCrm6eIIkwhroaewyAbiUJGpWgrmmqYcYGxc3ecNUaDr9samB/jBLo0EgSQx9qg3GidFAIpkDdVLMlKcmKCWq7G5Z/CoiaeRVJVdSKkVElOkpYoZFb4KSUGTaqxsRQP+JhDDucn/lPo0ECz9fi/3fUVcCrvJx7o8LAzNkNyzQ6QRLopuTUNzwlGdM06aWZIzIa25FQCwW0KIJ924lipvtSITkiBE+LiACbwZN42bviiDg/zt7TbnpQaMglGDCSrehNMRJFVnQmURpRGML+11yBiMtmiMxjCTQyomeiPCrkEQI2BkXPKzUYdk2+N78cHgMaXIDklEAOU1wDDlMKL5p7iaBp9C7NhdFOF4yvEPu5N69k3wUKMHgnHkC7haG8nEB6ehUABkQjIAyN+kWo+p4FwHLSFkuLCFVKfhYRnB0n5+8PHnlHr74+uj46Pzo5PisOwJeIOEaeYjk9BtJSFIWMg7Korft73pwVe2Zxd81nK/CX9J4on8gjInZrUPfQKtl0I6H+Pktet8A68UGz3NZYqtM6qU18DNT5gI82K1TqNZBK95y8Ctove61OryCmbkts+ZlvtrwZwzauMACBF4bckZSisoC83xqMBDxC8+Hv/1eacWSFE+rrRdYd7s/gGXwJSw7fQt+OaF6URUUY14LcZrJlJGVhkFrPQNKyAwLzuFOKkazKswQzvDIDF7tWSuLTb+00NWEWeBK14KVUIoFNaoWpJAoidfizrUt+LiaXAu+MtJteZs+WR7z5q2jmRmClvSXw7Huo6o2Ww7z3rfkY3rdssNyBlqrzlszqg/PukMIjLQ5LQ/cuk9az4zlsZikCr/lvjg+rC+Ysy588iy+a9y7hdc72z7I3/ubB32nfjacT53PnKeO7+w7B863zqlz4WAndn5z/nD+7Lzt/NT5pfNrDb1/b+7zxGk8nd//BQ28jZo=sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==sha1_base64="IKa0cr9W32dKAhXNEcFAYkW1opU=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBKm8Q+FFjsZtsG2CbpJtlsaxsBRVESYVIUSCqOKgjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCo9GPzz4OFbb7/z7ntbvf77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIXGZlyFKc0ohhpUJ3H1/71o+3B7mAw8H3fNYJ/eDAAYTQa7vlD1zcmeLadxXN2/Xjr70kocM5JqjFDSo39QaanJZKaYkaq/iRXJEN4hmIyBjFFnKhpWedauTugCd1ISPil2q216x4l4koVPAAkRzpRts0ou2zjXEfDaUnTLNckxc1GUc5cLVxTuBtSSbBmBQgISwq5ujhBEmEN9LR2SQAuJYnalWCuaeYhxqblbdEylZrOfm5roD/M0mgQSBpDnxqDcWI0kEgWQJ0Uc+WpBGVEVf0di19FJI28uuRayqSIiDI9Rcyw6CUoDUWuvaYRAfSfSAjjrvZX7pNIsPCLjdj/HXUt4DofF/60NDBDdssCnU6xJLo9CeUtz3jONG1rSc6IvOFWBMRiAS1KuHcnUdxuRyokR4zwaQkReDtoFrd7VwYB/9/Za8pND1oFoRQTVrktpSFOqsiCziXKIhpb2O/Tcxhp0R6JcSSBVk50IsKvQxIhYGda8rBWh1XX4HuLw+ExpMktSEYB5LTBMeSQUHzb3k0CT6F3Yy6KcLpieI/cy717L/koUILBOfME3C0MFdMS0tGZADIgGAFlYdItJ/XxLgOWk6paWkKqMvCxjODoPj99efrKPXrxzfHJ8cXx6cl5fwK8QMIN8gjJ2beSkLQqZRxU5WDX3/fgqjowi79vOF+Hv6Rxon8kjIn5ncPQQOtl1I2H+MUd+tAAm8UGL3JZYetMmqUz8DNT5hI82m9SqNdRJ95y8Gtosx50OryCmbkrs+FlsdrwZwzauMQCBF4bck4yiqoS82JmMBDxS8+Hv8NBZcWSFM/qrZdYd3c4gmX0FSx7Qwt+lVC9rAqKMa+FOMtlxshaw6C1ngGlZI4F53AnlZN5HWYMZ3hiBq/xbJTltl9Z6HrCLHCt68BKKMWCGlUHUkiUxhtxF9oOfFxPrgVfG+muvE2fLI9F8zbRzAxBR/qr4dj0UXWbLYdF7zvyMb3u2GE1A51VF50ZNYdn0yEERrqcVgdu0ydrZsbyWE5Sjd9xX5wcNRfMeR8+eZbfNe79wuu9XR/kH/ztp8PFx8+W85nzufPE8Z1D56nznXPmXDrYiZ3fnT+dv3o/9X7p/dr7rYE+fLDw+dRpPb0//gXMf4u+ 1 g AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmC5TWIfCgS72bYBtkm6TjbbtY2AoiiZMCkKJBXHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaU7nb/v3X/wzrvvvb/Van/w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk53qekjFHcUIjipEG1SC+6l493O7sdjod3/ddI/gH+x0Q+v1e1++5vjHBs334ZOfnfxzHObt6tPXXKBQ44yTRmCGlhn4n1eMcSU0xI0V7lCmSIjxFMRmCmCBO1Dgvcy3cHdCEbiQk/BLtltp1jxxxpeY8ACRHeqJsm1E22YaZjnrjnCZppkmCq42ijLlauKZwN6SSYM3mICAsKeTq4gmSCGugp7bLBOBSkqheCeaaph5ibJzfzGumXNPpj3UN9IdZGg0CSWLoU2UwTowGEsk5UCfFTHlqglKiivaOxa8ikkZeWXIppVJERJmeImZY9CYoCUWmvaoRAfSfSAjjrvZX7tNIsPDzjdj/HXUt4Dof5/44NzBDds0CnU6wJLo+CfkNT3nGNK1rScaIvOZWBMRiAS2acO9WorjejkRIjhjh4xwi8HrQNK73Lg8C/r+z15SbHtQKQgkmrHBrSkOcVJEFnUmURjS2sN8lAxhpUR+JYSSBVk70RIRfhSRCwM4452GpDoumwfcWh8NjSJMbkIwCyKmDY8hhQvFNfTcJPIXetbkowvGK4S65k3v3TvJRoASDc+YJuFsYmo9zSEenAsiAYASUc5NuPiqPdx6wjBTF0hJSlYKPZQRH9/npy9NX7tGLr49Pjs+PT08G7RHwAglXyCMkp99IQpIil3FQ5J1df8+Dq2rfLP6e4Xwd/pLGE/0DYUzMbh16Blou/WY8xJ/fog8MsFps8CKXFbbMpFoaAz8zZS7B/b0qhXLtN+ItB7+EVut+o8MrmJnbMiteFqsNf8agjUssQOC1IQOSUlTkmM+nBgMRv/B8+DvoFFYsSfG03HqJdXd7fVj6X8LS7VnwywnVy6qgGPNaiLNMpoysNQxa6xlQQmZYcA53Uj6alWGGcIZHZvAqz0qZb/uFhS4nzAKXugashFIsqFE1IIVESbwRd6FtwMfl5FrwtZFuytv0yfJYNG8TzcwQNKS/Go5NH1W22XJY9L4hH9Prhh1WM9BY9bwxo+rwbDqEwEiT0+rAbfqk1cxYHstJKvE77ouTo+qCGbThk2f5XePeLbzu7vogf+9vH/ac6tlyPnU+c546vnPgHDrfOmfOhYOd2PnN+cP5s/W29VPrl9avFfT+vYXPE6f2tH7/FxcujZs=sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=sha1_base64="z+p23dkdGhVa20Q1jNHE5rneJ2s=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBym8Q+FFjsZtsG2CbpOtlsaxsBRVESYVIUSCqOKwjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCrd6/3z4OFbb7/z7ns7ne77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIvMjLlKE5pRDHSoBrF1/3rR7u9/V6v5/u+awT/6LAHwnA46PsD1zcmeHad5XN+/Xjn70kocM5JqjFDSo39XqanBZKaYkbK7iRXJEN4hmIyBjFFnKhpUeVaunugCd1ISPil2q20mx4F4koteABIjnSibJtRttnGuY4G04KmWa5JiuuNopy5WrimcDekkmDNFiAgLCnk6uIESYQ10NPYJQG4lCRqVoK5ppmHGJsWt4uGqdB09nNTA/1hlkaDQNIY+lQbjBOjgURyAdRJMVeeSlBGVNnds/hVRNLIq0qupEyKiCjTU8QMi16C0lDk2qsbEUD/iYQw7np/5T6JBAu/2Ir931E3Am7yceFPCwMzZDcs0OkUS6Kbk1Dc8oznTNOmluSMyBtuRUAsFtCihHt3EsXNdqRCcsQInxYQgTeDZnGzd0UQ8P+dvabc9KBREEoxYaXbUBripIos6FyiLKKxhf0+HcFIi+ZIjCMJtHKiExF+HZIIATvTgoeVOizbBt9bHg6PIU1uQTIKIKcJjiGHhOLb5m4SeAq9G3NRhNM1w31yL/fuveSjQAkG58wTcLcwtJgWkI7OBJABwQgoFybdYlId7yJgOSnLlSWkKgMfywiO7vOzl2ev3OMX35ycnlycnJ2OuhPgBRKukcdIzr6VhKRlIeOgLHr7/oEHV9WhWfwDw/km/CWNE/0jYUzM7xwGBlotw3Y8xF/coY8MsF5s8DKXNbbKpF5aAz8zZa7Aw4M6hWodtuItB7+C1uthq8MrmJm7MmtelqsNf8agjSssQOC1ISOSUVQWmC9mBgMRv/R8+DvqlVYsSfGs2nqFdfcHQ1iGX8HSH1jwq4TqVVVQjHktxHkuM0Y2Ggat9QwoJXMsOIc7qZjMqzBjOMMTM3i1Z60sdv3SQlcTZoErXQtWQikW1KhakEKiNN6Ku9S24ONqci34xki35W36ZHksm7eNZmYIWtJfD8e2j6rabDkse9+Sj+l1yw7rGWitetGaUX14th1CYKTNaX3gtn2yemYsj9UkVfg998XpcX3BjLrwybP6rnHvF173932Qf/B3nw6WHz87zmfO584Tx3eOnKfOd865c+lgJ3Z+d/50/ur81Pml82vntxr68MHS51On8XT++BfV8Yu/ 2

AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX/tWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8CuwONdQ==sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==sha1_base64="yCmzR+UYSiB70+ceKvNWi19m/ko=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d+VfPdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FgCqL0Q== AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY3pLYDwOCNt0WoEuyOmm62kZAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFKSMKt3p/HPv/oP33v/gw61W+6OPH37y6aPHT14rkUlMLrBgQr4JkCKMJuRCU83Im1QSxANGLoPpc2O/vCZSUZGc63lKxhzFCY0oRhpUg3dX3atH253dTqfj+75rBP9gvwNCv9/r+j3XNyZ4tg8f/p25juOcXT3e+msUCpxxkmjMkFJDv5PqcY6kppiRoj3KFEkRnqKYDEFMECdqnJe5Fu4OaEI3EhJ+iXZL7bpHjrhScx4AkiM9UbbNKJtsw0xHvXFOkzTTJMHVRlHGXC1cU7gbUkmwZnMQEJYUcnXxBEmENdBT22UCcClJVK8Ec01TDzE2zm/mNVOu6fRdXQP9YZZGg0CSGPpUGYwTo4FEcg7USTFTnpqglKiivWPxq4ikkVeWXEqpFBFRpqeIGRa9CUpCkWmvakQA/ScSwrir/ZX7NBIs/HIj9n9HXQu4zse5P84NzJBds0CnEyyJrk9CfsNTnjFN61qSMSKvuRUBsVhAiybcu5UorrcjEZIj z1 z2 AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX6dWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8C+0uNsg==sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==sha1_base64="kRKiSD+HgKged5aOJIkSLIO3sLs=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d5VePdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FwHKMDg== zn

53 / 174 Rademacher Averages

n Then, conditionally on Z1 ,

n 1 X E Rbn = E sup i g(Zi ) H g n ∈G i=1 is a data-dependent quantity (hence the “hat”) that measures complexity of projected onto data. G

Also define the Rademacher process Rn (without the “hat”)

n 1 X R (g) =  g(Z ), n n i i i=1 and Rademacher complexity of as G

E Rn = EZ n E Rbn . k kG 1 (Z n) G 1

54 / 174 Uniform Laws and Rademacher Complexity

We’ll look at a proof of a uniform law of large numbers that involves two steps:

1. Symmetrization, which bounds E P Pn in terms of the k − kG Rademacher complexity of F , E Rn . k kG 2. Both P Pn and Rn are concentrated about their expectation.k − kG k kG

55 / 174 Uniform Laws and Rademacher Complexity Theorem.

For any , G E P Pn 2E Rn k − kG ≤ k kG

If [0, 1]Z , then G ⊂ r 1 log 2 E Rn E P Pn 2E Rn , 2 k kG − 2n ≤ k − kG ≤ k kG and, with probability at least1 2 exp( 22n), − − E P Pn  P Pn E P Pn + . k − kG − ≤ k − kG ≤ k − kG as Thus, E Rn 0 iff P Pn 0. k kG → k − kG →

That is, the supremum of the empirical process P Pn is concentrated about its expectation, and its expectation is about− the same as the expected sup of the Rademacher process R . n 56 / 174 Uniform Laws and Rademacher Complexity

1 Pn The first step is to symmetrize by replacing Pg by Pn0 g = n i=1 g(Zi0). In particular, we have

" n # 1 X n E P Pn = E sup E (g(Zi0) g(Zi )) Z1 k − kG g n − ∈G i=1 " n # 1 X n EE sup (g(Zi0) g(Zi )) Z1 ≤ g n − ∈G i=1

1 n X = E sup (g(Zi0) g(Zi )) = E Pn0 Pn . g n − k − kG ∈G i=1

57 / 174 Uniform Laws and Rademacher Complexity

Another symmetrization: for any i 1 , ∈ {± }

1 n X E Pn0 Pn = E sup (g(Zi0) g(Zi )) k − kG g n − ∈G i=1

1 n X = E sup i (g(Zi0) g(Zi )) , g n − ∈G i=1

This follows from the fact that Zi and Zi0 are i.i.d., and so the distribution of the supremum is unchanged when we swap them. And so in particular the expectation of the supremum is unchanged. And since this is true for any i , we can take the expectation over any random choice of the i . We’ll pick them independently and uniformly.

58 / 174 Uniform Laws and Rademacher Complexity

1 n X E sup i (g(Zi0) g(Zi )) g n − ∈G i=1

1 n 1 n X X E sup i g(Zi0) + sup i g(Zi ) ≤ g n g n ∈G i=1 ∈G i=1 n 1 X = 2 E sup i g(Zi ) = 2E Rn , g n k kG ∈G i=1 | {z } Rademacher complexity

Pn where Rn is the Rademacher process Rn(g) = (1/n) i=1 i g(Zi ).

59 / 174 Uniform Laws and Rademacher Complexity

The second inequality (desymmetrization) follows from:

n n 1 X 1 X E Rn E i (g(Zi ) Eg(Zi )) + E i Eg(Zi ) k kG ≤ n − n i=1 i=1 G G 1 n 1 n X X E i (g(Zi ) g(Zi0)) + P E i ≤ n − k kG n i=1 i=1 G 1 n X = E (g(Zi ) Eg(Zi ) + Eg(Zi0) g(Zi0)) n − − i=1 G n 1 X + P E i k kG n i=1 r 2 log 2 2E Pn P + . ≤ k − kG n

60 / 174 Uniform Laws and Rademacher Complexity Theorem.

For any , G E P Pn 2E Rn k − kG ≤ k kG

If [0, 1]Z , then G ⊂ r 1 log 2 E Rn E P Pn 2E Rn , 2 k kG − 2n ≤ k − kG ≤ k kG and, with probability at least1 2 exp( 22n), − − E P Pn  P Pn E P Pn + . k − kG − ≤ k − kG ≤ k − kG as Thus, E Rn 0 iff P Pn 0. k kG → k − kG →

That is, the supremum of the empirical process P Pn is concentrated about its expectation, and its expectation is about− the same as the expected sup of the Rademacher process R . n 61 / 174 Back to the prediction problem

Recall: for prediction problem, we have = ` for some loss function ` and a set of functions . G ◦ F F

We found that E P Pn ` can be related to E Rn ` . Question: k − k ◦F k k ◦F can we get E Rn instead? k kF

For classification, this is easy. Suppose 1 X , Y 1 , and F ⊆ {± } ∈ {± } `(y, y 0) = I y = y 0 = (1 yy 0)/2. Then { 6 } −

1 n X E Rn ` = E sup i `(yi , f (xi )) k k ◦F f n ∈F i=1 n n 1 1 X 1 X = E sup i + i yi f (xi ) 2 f n n − ∈F i=1 i=1  1  1 O + E Rn . ≤ √n 2 k kF Other loss functions?

62 / 174 Rademacher Complexity: Structural Results

Theorem.

Subsets implies Rn Rn . F ⊆ G k kF ≤ k kG Scaling Rn c = c Rn . k k F | |k kF Plus Constant For g(X ) 1, | | ≤ p E Rn +g E Rn 2 log 2/n. | k kF − k kF | ≤ Convex Hull Rn co = Rn , where co is the convex hull ofk k. F k kF F F Contraction If ` : R R [ 1, 1] hasˆy `(ˆy, y) 1-Lipschitz for all y×, then→ for−` = (x7→, y) `(f (x), y) , ◦ F { 7→ } E Rn ` 2E Rn + c/√n. k k ◦F ≤ k kF

64 / 174 Deep Networks Deep compositions of nonlinear functions

h = hm hm 1 h1 ◦ − ◦ · · · ◦

e.g., hi : x σ(Wi x) hi : x r(Wi x) 7→ 7→ 1 σ(v)i = , r(v)i = max 0, vi 1 + exp( vi ) { } −

66 / 174 Rademacher Complexity of Sigmoid Networks

Theorem.

Consider the following class B d of two-layer neural networks on F , Rd : ( k ) X T  B d = x wi σ v x : w 1 B, vi 1 B, k 1 , F , 7→ i k k ≤ k k ≤ ≥ i=1

where B > 0 and the nonlinear function σ : R R satisfies the Lipschitz condition, σ(a) σ(b) a b , and σ→(0) = 0. Suppose that the distribution| is such− that| ≤X | − | 1 a.s. Then k k∞ ≤ r 2 2 log 2d E Rn B,d 2B , k kF ≤ n

where d is the dimension of the input space, = Rd . X

67 / 174 Rademacher Complexity of Sigmoid Networks: Proof

Define := (x1,..., xd ) xj : 1 j d , G { 7→ ≤ ≤ } ( d ) X B := x v 0x : v 1 = vi B V 7→ k k | | ≤ i=1 = Bco ( ) , G ∪ −G ( k ) X with co( ) := αi fi : k 1, αi 0, α 1 = 1, fi . F ≥ ≥ k k ∈ F i=1

( k k ) X X Then B d = x wi σ(vi (x)) k 1, wi B, vi B F , 7→ | ≥ ≤ ∈ V i=1 i=1

= Bco (σ B σ B ) . ◦ V ∪ − ◦ V

68 / 174 Rademacher Complexity of Sigmoid Networks: Proof

E Rn B,d = E Rn Bco(σ B σ B ) k kF k k ◦V ∪− ◦V = BE Rn co(σ B σ B ) k k ◦V ∪− ◦V = BE Rn σ B σ B k k ◦V ∪− ◦V BE Rn σ B + BE Rn σ B ≤ k k ◦V k k− ◦V = 2BE Rn σ B k k ◦V 2BE Rn B ≤ k kV = 2BE Rn Bco( ) k k G∪−G 2 = 2B E Rn rk kG∪−G 2 log (2d) 2B2 . ≤ n

69 / 174 Rademacher Complexity of Sigmoid Networks: Proof

The final step involved the following result.

Lemma.

For f satisfying f (x) 1, ∈ F | | ≤ r n 2 log(2 (X1 ) ) E Rn E |F | . k kF ≤ n

70 / 174 Rademacher Complexity of ReLU Networks

Using the positive homogeneity property of the ReLU nonlinearity (that is, for all α 0 and x R, σ(αx) = ασ(x)) gives a simple argument to bound the Rademacher≥ ∈ complexity.

Theorem.

For a training set( X1, Y1),..., (Xn, Yn) 1 with Xi 1 ∈ X ×{± } k k ≤ a.s., (2B)L R , E n F ,B k kF ≤ √n where f F B is an L-layer network of the form ∈ F ,

F ,B := WLσ(WL 1 σ(W1x) ), F − ··· ··· σ is1-Lipschitz, positive homogeneous (that is, for all α 0 ≥ and x R, σ(αx) = ασ(x)), and applied componentwise, and ∈ Wi F B.(WL is a row vector.) k k ≤

72 / 174 ReLU Networks: Proof

(Write E as the conditional expectation given the data.)

Lemma.

1 n 1 n X X E sup i σ(Wf (Xi )) 2BE sup i f (Xi ) . f , W F B n ≤ f n ∈F k k ≤ i=1 2 ∈F i=1 2

Iterating this and using Jensen’s inequality proves the theorem.

73 / 174 ReLU Networks: Proof

For W > = (w1 wk ), we use positive homogeneity: ··· n 2 k n !2

X X X i σ(Wf (xi )) = i σ(wj>f (xi )) i=1 j=1 i=1 k n !!2 X 2 X wj> = wj i σ f (xi ) . k k wj j=1 i=1 k k

74 / 174 ReLU Networks: Proof

k n !!2 X 2 X wj> sup wj i σ f (xi ) W F B k k wj k k ≤ j=1 i=1 k k k n !2 X X  = sup αj i σ wj>f (xi ) 2 wj =1; α 1 B k k k k ≤ j=1 i=1 n !2 2 X  = B sup i σ w >f (xi ) , w =1 k k i=1 then apply (Contraction) and Cauchy-Schwartz inequalities.

75 / 174 Rademacher Complexity and Covering Numbers

Definition.

An -cover of a subset T of a metric space( S, d) is a set Tˆ T such that for each t T there is a tˆ Tˆ such that d(t, tˆ) ⊂ . The -covering number∈ of T is ∈ ≤

(, T , d) = min Tˆ : Tˆ is an -cover of T . N {| | }

Intuition: A k-dimensional set T has (, T , k) = Θ(1/k ). Example: ([0, 1]k , l ). N ∞

77 / 174 Rademacher Complexity and Covering Numbers

Theorem.

For any and any X1,..., Xn, F Z r n ∞ ln (, (X1 ), `2) R n c d. E bn (X1 ) N F k kF ≤ 0 n and

r n 1 ln (α, (X1 ), `2) R n sup α . E bn (X1 ) N F k kF ≥ c log n α>0 n

78 / 174 Covering numbers and smoothly parameterized functions

Lemma.

Let F be a parameterized class of functions,

= f (θ, ): θ Θ . F { · ∈ }

Let Θ be a norm onΘ and let be a norm on F . Suppose thatk the · k mapping θ f (θ, ) is L-Lipschitz,k · kF that is, 7→ ·

f (θ, ) f (θ0, ) L θ θ0 Θ. k · − · kF ≤ k − k

Then (, , ) (/L, Θ, Θ). N F k · kF ≤ N k · k

79 / 174 Covering numbers and smoothly parameterized functions

A Lipschitz parameterization allows us to translate a cover of the parameter space into a cover of the function space. Example: If is smoothly parameterized by (a compact set of) k parameters,F then (, ) = O(1/k ). N F

80 / 174 Recap

n 1 G G g(zi) Eg n i=1 Xi=1

0

all functions gˆn g⇤ gG gˆn

81 / 174 Recap

E P Pn E Rn k − kF  k kF

Advantage of dealing with Rademacher process: I often easier to analyze I can work conditionally on the data

Example: = x w, x : w 1 , x 1. Then F { 7→ h i k k ≤ } k k ≤ n * n + 1 1 X X E Rbn = E sup i w, xi = E sup w, i xi w 1 n h i w 1 n F k k≤ i=1 k k≤ i=1 which is equal to q n Pn x 2 p 1 X i=1 i tr(XX >) E i xi k k = n ≤ n n i=1

82 / 174 Classification and Rademacher Complexity

Recall:

For 1 X , Y 1 , and `(y, y 0) = I y = y 0 = (1 yy 0)/2, F ⊆ {± } ∈ {± } { 6 } − 1  1  E Rn ` E Rn + O . k k ◦F ≤ 2 k kF √n

How do we get a handle on Rademacher complexity of a set of classifiers? Recall that we are looking at “size” of projectedF onto data: F n n (x ) = (f (x1),..., f (xn)) : f 1, 1 F 1 { ∈ F} ⊆ {− }

84 / 174 Controlling Rademacher Complexity: Growth Function

Example: For the class of step functions, = x I x α : α R , the set of restrictions, F { 7→ { ≤ } ∈ }

n (x ) = (f (x1),..., f (xn)) : f F 1 { ∈ F} q n 2 log 2(n+1) is always small: (x1 ) n + 1. So E Rn . |F | ≤ k kF ≤ n Example: If is parameterized by k bits, = x fF(x, θ): θ 0, 1 k for some f : 0, 1 k [0, 1], F 7→ ∈ { } X × { } → n k (x1 ) 2 , |F | ≤ r 2(k + 1) log 2 E Rn . k kF ≤ n

85 / 174 Growth Function

Definition.

For a class 0, 1 X , the growth function is F ⊆ { } n Π (n) = max (x1 ) : x1,..., xn . F {|F | ∈ X }

q q  2 log(2ΠF (n)) log(ΠF (n)) I E Rn ; E P Pn = O . k kF ≤ n k − kF n

I Π (n) , limn Π (n) = . F ≤ |F| →∞ F |F| n I Π (n) 2 . (But then this gives no useful bound on E Rn .) F ≤ k kF

86 / 174 Vapnik-Chervonenkis Dimension

Definition.

A class 0, 1 X shatters x1,..., xd means that (x d ) =F 2d ⊆. The { Vapnik-Chervonenkis} { dimension} ⊆ X of is |F 1 | F

dVC ( ) = max d : some x1,..., xd is shattered by F { ∈ X F} = max d :Π (d) = 2d . F

88 / 174 Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”

Theorem (Vapnik-Chervonenkis).

dVC ( ) d implies F ≤ d X n Π (n) . F ≤ i i=0

d If n d, the latter sum is no more than en  . ≥ d

So the VC-dimension d is a single integer summary of the growth function: ( = 2n if n d, Π (n) d ≤ F (e/d) nd if n > d. ≤

89 / 174 Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”

Thus, for dVC ( ) d and n d, F ≤ ≥ r ! d log(n/d) sup E P Pn = O . P k − kF n

And there is a matching lower bound.

Uniform convergence uniformly over probability distributions is equivalent to finiteness of the VC-dimension.

90 / 174 VC-dimension Bounds for Parameterized Families Consider a parameterized class of binary-valued functions,

p = x f (x, θ): θ R , F { 7→ ∈ } where f : Rp 1 . X × → {± } Suppose that for each x, θ f (x, θ) can be computed using no more than t operations of the following7→ kinds: 1. arithmetic (+, , , /), − × 2. comparisons (>,=, <), 3. output 1. ± Theorem (Goldberg and Jerrum).

dVC ( ) 4p(t + 2). F ≤

91 / 174 VC-Dimension of Neural Networks

Theorem.

Consider the class of 1, 1 -valued functions computed by a network with L layers,F p parameters,{− } and k computation units with the following nonlinearities: 1. Piecewise constant (linear threshold units): VCdim( ) = Θ(˜ p). F (Baum and Haussler, 1989) 2. Piecewise linear (ReLUs): VCdim( ) = Θ(˜ pL). F (B., Harvey, Liaw, Mehrabian, 2017) 3. Piecewise polynomial: VCdim( ) = O˜ pL2. F (B., Maiorov, Meir, 1998) 4. Sigmoid: VCdim( ) = O˜ p2k2. F (Karpinsky and MacIntyre, 1994)

93 / 174 Some Experimental Observations

(Schapire, Freund, B, Lee, 1997)

95 / 174 Some Experimental Observations

(Schapire, Freund, B, Lee, 1997)

96 / 174 Some Experimental Observations

(Lawrence, Giles, Tsoi, 1997)

97 / 174 Some Experimental Observations

(Neyshabur, Tomioka, Srebro, 2015)

98 / 174 Large-Margin Classifiers

I Consider a real-valued function f : R used for classification. X → I The prediction on x is sign(f (x)) 1, 1 . ∈ X ∈ {− } I If yf (x) > 0 then f classifies x correctly. I We call yf (x) the margin of f on x. (c.f. the perceptron) I We can view a larger margin as a more confident correct classification. I Minimizing a continuous loss, such as

n n X 2 X (f (Xi ) Yi ) , or exp ( Yi f (Xi )) , − − i=1 i=1 encourages large margins. I For large-margin classifiers, we should expect the fine-grained details of f to be less important.

99 / 174 Recall

Contraction property: If ` isˆy `(ˆy, y) 1-Lipschitz for all y and ` 1, then 7→ | | ≤

E Rn ` 2E Rn + c/√n k k ◦F ≤ k kF Unfortunately, classification loss is not Lipschitz. By writing classification loss as (1 yy 0)/2, we did show that −

E Rn ` sign( ) 2E Rn sign( ) + c/√n k k ◦ F ≤ k k F but we want E Rn . k kF

100 / 174 Rademacher Complexity for Lipschitz Loss

Consider the1 /γ-Lipschitz surrogate loss  1 if α 0,  ≤ φ(α) = 1 α/γ if 0 < α < γ, − 0 if α 1. ≥

(yf(x))

I{yf(x)60}

yf(x)

101 / 174 Rademacher Complexity for Lipschitz Loss

Large margin loss is an upper bound and classification loss is a lower bound:

I Yf (X ) 0 φ(Yf (X )) I Yf (X ) γ . { ≤ } ≤ ≤ { ≤ }

So if we can relate the Lipschitz risk Pφ(Yf (X )) to the Lipschitz empirical risk Pnφ(Yf (X )), we have a large margin bound:

PI Yf (X ) 0 Pφ(Yf (X )) c.f. Pnφ(Yf (X )) PnI Yf (X ) γ . { ≤ } ≤ ≤ { ≤ }

102 / 174 Rademacher Complexity for Lipschitz Loss

With high probability, for all f , ∈ F L (f ) = PI Yf (X ) 0 01 { ≤ } Pφ(Yf (X )) ≤ c Pnφ(Yf (X )) + E Rn + O(1/√n) ≤ γ k kF c PnI Yf (X ) γ + E Rn + O(1/√n). ≤ { ≤ } γ k kF Notice that we’ve turned a classification problem into a regression problem. The VC-dimension (which captures arbitrarily fine-grained properties of the function class) is no longer important.

103 / 174 Example: Linear Classifiers

1/2 Let = x w, x : w 1 . Then, since E Rn . n− , F { 7→ h i k k ≤ } k kF n 1 X 1 1 L01(f ) I Yi f (Xi ) γ + . n { ≤ } γ √n i=1 Compare to Perceptron.

104 / 174 Risk versus surrogate risk

In general, these margin bounds are upper bounds only. But for any surrogate loss φ, we can define a transform ψ that gives a tight relationship between the excess0 1 risk (over the Bayes risk) to the excess φ-risk: −

Theorem.

1. For any P and f , ψ (L01(f ) L∗ ) L (f ) L∗ . − 01 ≤ φ − φ 2. For 2,  > 0 and θ [0, 1], there is a P and an f with |X | ≥ ∈

L01(f ) L∗ = θ − 01 ψ(θ) L (f ) L∗ ψ(θ) + . ≤ φ − φ ≤

105 / 174 Generalization: Margins and Size of Parameters

I A classification problem becomes a regression problem if we use a loss function that doesn’t vary too quickly. I For regression, the complexity of a neural network is controlled by the size of the parameters, and can be independent of the number of parameters. I We have a tradeoff between the fit to the training data (margins) and the complexity (size of parameters):

n 1 X Pr(sign(f (X )) = Y ) φ(Yi f (Xi ))+ pn(f ) 6 ≤ n i=1

I Even if the training set is classified correctly, it might be worthwhile to increase the complexity, to improve this loss function.

106 / 174 In the next few sections, we show that h i E L(fbn) Lb(fbn) − can be small if the learning algorithm fbn has certain properties.

Keep in mind that this is an interesting quantity for two reasons: I it upper bounds the excess expected loss of ERM, as we showed before. I can be written as an a-posteriori bound

L(fbn) Lb(fbn) + ... ≤

regardless of whether fbn minimized empirical loss.

107 / 174 Compression Set

Let us use the shortened notation for data: = Z1,..., Zn , and let us S { } make the dependence of the algorithm fbn on the training set explicit: fbn = fbn[ ]. As before, denote = (x, y) `(f (x), y): f , and S G { 7→ ∈ F} write gbn( ) = `(fbn( ), ). Let us write gbn[ ]( ) to emphasize the dependence.· · · S ·

Suppose there exists a “compression function” Ck which selects from any dataset of size n a subset of k examples Ck ( ) such that S S ⊆ S

fbn[ ] = fbk [Ck ( )] S S That is, the learning algorithm produces the same function when given S or its subset Ck ( ). S One can keep in mind the example of support vectors in SVMs.

109 / 174 Then,

n 1 X L(fbn) Lb(fbn) = Egn gn(Zi ) − b − n b i=1 n 1 X = Egk [Ck ( )](Z) gk [Ck ( )](Zi ) b S − n b S i=1 ( n ) 1 X max Egk [ I ](Z) gk [ I ](Zi ) ≤ I 1,...,n , I k b S − n b S ⊆{ } | |≤ i=1 where I is the subset indexed by I . S

110 / 174 Since gbk [ I ] only depends on k out of n points, the empirical average is “mostly outS of sample”. Adding and subtracting loss functions for an additional set of i.i.d. random variables W = Z10,..., Zk0 results in an upper bound { }

( ) 1 X (b a)k max Egk [ I ](Z) gk [ I ](Z 0) + − I 1,...,n , I k b S − n b S n ⊆{ } | |≤ Z 0 0 ∈S where[ a, b] is the range of functions in and 0 is obtained from by G S S replacing I with the corresponding subset WI . S

111 / 174 For each fixed I , the random variable

1 X Egk [ I ](Z) gk [ I ](Z 0) b S − n b S Z 0 0 ∈S is zero mean with standard deviation O((b a)/√n). Hence, the expected maximum over I with respect to −, W is at most S r (b a)k log(en/k) c − n since log n k log(en/k). k ≤

112 / 174 Conclusion: compression-style argument limits the bias r ! h i k log n E L(fbn) Lb(fbn) O , − ≤ n which is non-vacuous if k = o(n/ log n).

Recall that this term was the upper bound (up to log) on expected excess loss of ERM if class has VC dimension k. However, a possible equivalence between compression and VC dimension is still being investigated.

113 / 174 Example: Classification with Thresholds in 1D

I = [0, 1], = 0, 1 X Y { } I = fθ : fθ(x) = I x θ , θ [0, 1] F { { ≥ } ∈ } I `(fθ(x), y) = I fθ(x) = y { 6 }

fˆn

0 1

For any set of data( x1, y1),..., (xn, yn), the ERM solution fbn has the property that the first occurrence xl on the left of the threshold has label yl = 0, while first occurrence xr on the right – label yr = 1.

Enough to take k = 2 and define fbn[ ] = fb2[(xl , 0), (xr , 1)]. S

114 / 174 Further examples/observations: I Compression of size d for hyperplanes (realizable case) I Compression of size1 /γ2 for margin case I Bernstein bound gives1 /n rate rather than1 /√n rate on realizable data (zero empirical error).

115 / 174 Algorithmic Stability

h i Recall that compression was a way to upper bound E L(fbn) Lb(fbn) . − Algorithmic stability is another path to the same goal.

Compare:

I Compression: fbn depends only on a subset of k datapoints.

I Stability: fbn does not depend on any of the datapoints too strongly.

As before, let’s write shorthand g = ` f and gn = ` fbn. ◦ b ◦

117 / 174 Algorithmic Stability We now write

E L(fbn) = EZ1,...,Zn,Z gn[Z1,..., Zn](Z) S { b }

Again, the meaning of gbn[Z1,..., Zn](Z): train on Z1,..., Zn and test on Z.

On the other hand,

( n ) 1 X E Lb(fbn) = EZ1,...,Zn gn[Z1,..., Zn](Zi ) S n b i=1 n 1 X = EZ1,...,Zn gn[Z1,..., Zn](Zi ) n { b } i=1

= EZ1,...,Zn gn[Z1,..., Zn](Z1) { b } where the last step holds for symmetric algorithms (wrt permutation of training data). Of course, instead of Z1 we can take any Zi .

118 / 174 Now comes the renaming trick. It takes a minute to get used to, if you haven’t seen it.

Note that Z1,..., Zn, Z are i.i.d. Hence,

E L(fbn) = EZ1,...,Zn,Z gn[Z1,..., Zn](Z) S { b } = EZ1,...,Zn,Z gn[Z, Z2,..., Zn](Z1) { b }

Therefore, n o E L(fbn) Lb(fbn) = EZ1,...,Zn,Z gn[Z, Z2,..., Zn](Z1) gn[Z1,..., Zn](Z1) − { b − b }

119 / 174 Of course, we haven’t really done much except re-writing expectation. But the difference

gn[Z, Z2,..., Zn](Z1) gn[Z1,..., Zn](Z1) b − b has a “stability” interpretation. If it holds that the output of the algorithm “does not change much” when one datapoint is replaced with n o another, then the gap E L(fbn) Lb(fbn) is small. − Moreover, since everything we’ve written is an equality, this stability is n o equivalent to having small gap E L(fbn) Lb(fbn) . −

120 / 174 n o NB: our aim of ensuring small E L(fbn) Lb(fbn) only makes sense if − Lb(fbn) is small (e.g. on average). That is, the analysis only makes sense for those methods that explicitly or implicitly minimize empirical loss (or a regularized variant of it).

It’s not enough to be stable. Consider a learning mechanism that ignores the data and outputs fbn = f0, a constant function. Then n o E L(fbn) Lb(fbn) = 0 and the algorithm is very stable. However, it does − not do anything interesting.

121 / 174 Uniform Stability

Rather than the average notion we just discussed, let’s consider a much stronger notion:

We say that algorithm is β uniformly stable if

i,z0 i [n], z1,..., zn, z0, z gn[ ](z) gn[ ](z) β ∀ ∈ b S − b S ≤

i,z0 where = z1,..., zi 1, z0, zi+1,..., zn . S { − }

122 / 174 Uniform Stability

Clearly, for any realization of Z1,..., Zn, Z,

gn[Z, Z2,..., Zn](Z1) gn[Z1,..., Zn](Z1) β, b − b ≤ and so expected loss of a β-uniformly-stable ERM is β-close to its empirical error (in expectation). See recent work by Feldman and Vondrak for sharp high probability statements.

Of course, it is unclear at this point whether a β-uniformly-stable ERM (or near-ERM) exists.

123 / 174 Kernel Ridge Regression

Consider n 1 X 2 2 fbn = argmin (f (Xi ) Yi ) + λ f K f n − k k ∈H i=1 in RKHS corresponding to kernel K. H Assume K(x, x) κ2 for any x. ≤

Lemma.

1  Kernel Ridge Regression is β-uniformly stable with β = O λn

124 / 174 Proof (stability of Kernel Ridge Regression)

To prove this, first recall the definition of a σ-strongly convex function φ on convex domain : W σ u, v , φ(u) φ(v) + φ(v), u v + u v 2 . ∀ ∈ W ≥ h∇ − i 2 k − k

Suppose φ, φ0 are both σ-strongly convex. Suppose w, w 0 satisfy φ(w) = φ0(w 0) = 0. Then ∇ ∇ σ 2 φ(w 0) φ(w) + w w 0 ≥ 2 k − k and σ 2 φ0(w) φ0(w 0) + w w 0 ≥ 2 k − k As a trivial consequence,

2 σ w w 0 [φ(w 0) φ0(w 0)] + [φ0(w) φ(w)] k − k ≤ − −

125 / 174 Proof (stability of Kernel Ridge Regression) Now take 1 X 2 2 φ(f ) = (f (xi ) yi ) + λ f n − k kK i ∈S and 1 X 2 2 φ0(f ) = (f (xi ) yi ) + λ f n − k kK i 0 ∈S where and 0 differ in one element:( xi , yi ) is replaced with( x 0, y 0). S S i i

Let fbn, fbn0 be the minimizers of φ, φ0, respectively. Then

1  2 2 φ(fb0) φ0(fb0) (fb0(xi ) yi ) (fb0(x 0) y 0) n − n ≤ n n − − n i − i and 1  2 2 φ0(fbn) φ(fbn) (fbn(x 0) y 0) (fbn(xi ) yi ) − ≤ n i − i − −

NB: we have been operating with f as vectors. To be precise, one needs to define the notion of strong convexity over . Let us sweep it under H the rug and say that φ, φ0 are2 λ-strongly convex with respect to . k·kK 126 / 174 Proof (stability of Kernel Ridge Regression) 2 Then fbn fbn0 is at most − K

1  2 2 2 2 (fb0(xi ) yi ) (fbn(xi ) yi ) + (fbn(x 0) y 0) (fb0(x 0) y 0) 2λn n − − − i − i − n i − i which is at most 1 C fbn fb0 2λn − n ∞

where C = 4(1 + c) if Yi 1 and fbn(xi ) c. | | ≤ | | ≤ On the other hand, for any x p p f (x) = f , Kx f Kx = f Kx , Kx = f K(x, x) κ f h i ≤ k kK k k k kK h i k kK ≤ k kK and so

f κ f K . k k∞ ≤ k k

127 / 174 Proof (stability of Kernel Ridge Regression)

Putting everything together,

2 1 κC fbn fbn0 C fbn fbn0 fbn fbn0 − K ≤ 2λn − ≤ 2λn − K ∞ Hence, 1 κC fbn fbn0 C fbn fbn0 − K ≤ 2λn − ≤ 2λn ∞ To finish the claim,   2 2 1 (fbn(xi ) yi ) (fbn0(xi ) yi ) C fbn fbn0 κC fbn fbn0 O − − − ≤ − ≤ − K ≤ λn ∞

128 / 174 We consider “local” procedures such as k-Nearest-Neighbors or local smoothing. They have a different bias-variance decomposition (we do not fix a class ). F Analysis will rely on local similarity (e.g. Lipschitz-ness) of regression function f ∗. Idea: to predict y at a given x, look up in the dataset those Yi for which Xi is “close” to x.

130 / 174 Bias-Variance It’s time to revisit the bias-variance picture. Recall that our goal was to ensure that

EL(fbn) L(f ∗) −

decreases with data size n, where f ∗ gives smallest possible L.

For “simple problems” (that is, strong assumptions on P), one can ensure this without the bias-variance decomposition. Examples: Perceptron, linear regression in d < n regime, etc.

However, for more interesting problems, we cannot get this difference to be small without introducing bias, because otherwise the variance (fluctuation of the stochastic part) is too large.

Our approach so far was to split this term into an estimation-approximation error with respect to some class : F EL(fbn) L(f ) + L(f ) L(f ∗) − F F −

131 / 174 Bias-Variance

Now we’ll study a different bias-variance decomposition, typically used in nonparametric statistics. We will only work with square loss.

Rather than fixing that controls the estimation error, we fix an F algorithm (procedure/estimator) fbn that has some tunable parameter.

By definition E[Y X = x] = f ∗(x). Then we write |

2 2 EL(fbn) L(f ∗) = E(fbn(X ) Y ) E(f ∗(X ) Y ) − − − − 2 2 = E(fbn(X ) f ∗(X ) + f ∗(X ) Y ) E(f ∗(X ) Y ) − − − − 2 = E(fbn(X ) f ∗(X )) − because the cross term vanishes (check!)

132 / 174 Bias-Variance

Before proceeding, let us discuss the last expression. Z 2 2 E(fbn(X ) f ∗(X )) = E (fbn(x) f ∗(x)) P(dx) − S x − Z 2 = E (fbn(x) f ∗(x)) P(dx) x S −

2 We will often analyze E (fbn(x) f ∗(x)) for fixed x and then integrate. S − The integral is a measure of distance between two functions: Z f g 2 = (f (x) g(x))2P(dx) L2(P) . k − k x −

133 / 174 Bias-Variance

Let us drop L2(P) from notation for brevity. The bias-variance decomposition can be written as

2 2 f f = f [f ] + [f ] f E bn ∗ E bn EY1:n bn EY1:n bn ∗ S − L2(P) S − − 2 2

= E fbn EY1:n [fbn] + EX1:n EY1:n [fbn] f ∗ , S − − because the cross term is zero in expectation.

The first term is variance, the second is squared bias. One typically increases with the parameter, the other decreases.

Parameter is chosen either (a) theoretically or (b) by cross-validation (this is the usual case in practice).

134 / 174 k-Nearest Neighbors

As before, we are given( X1, Y1),..., (Xn, Yn) i.i.d. from P. To make a prediction of Y at a given x, we sort points according to distance Xi x . Let k − k (X(1), Y(1)),..., (X(n), Y(n)) be the sorted list (remember this depends on x).

The k-NN estimate is defined as

k 1 X fbn(x) = Y(i). k i=1

135 / 174 Variance: Given x,

k 1 X fbn(x) EY1:n [fbn(x)] = (Y(i) f ∗(X(i))) − k − i=1

√ 1 which is on the order of1 / k. Then variance is of the order k .

Bias: a bit more complicated. For a given x,

k 1 X EY1:n [fbn(x)] f ∗(x) = (f ∗(X(i)) f ∗(x)). − k − i=1

Suppose f ∗ is1-Lipschitz. Then the square of above is

k !2 k 1 X 1 X 2 (f ∗(X(i)) f ∗(x)) X(i) x k − ≤ k − i=1 i=1 So, the bias is governed by how close the closest k random points are to x.

136 / 174 Claim: enough to know the upper bound on the closest point to x among n points.

Argument: for simplicity assume that n/k is an integer. Divide the original (unsorted) dataset into k blocks, n/k size each. Let X i be the closest point to x in ith block. Then the collection X 1,..., X k , a k-subset which is no closer than the set of k nearest neighbors. That is,

k k 1 X 2 1 X i 2 X(i) x X x k − ≤ k − i=1 i=1

Taking expectation (with respect to dataset), the bias term is at most

( k ) 1 X i 2 1 2 E X x = E X x k − − i=1 which is expected squared distance from x to the closest point in a random set of n/k points.

137 / 174 If support of X is bounded and d 3, then one can estimate ≥ 2 2/d E X X(1) n− . − . That is, we expect the closest neighbor of a random point X to be no 1/d further than n− away from one of n randomly drawn points.

Thus, the bias (which is the expected squared distance from x to the closest point in a random set of n/k points) is at most

2/d (n/k)− .

138 / 174 Putting everything together, the bias-variance decomposition yields

1 k 2/d + k n

2 Optimal choice is k n 2+d and the overall rate of estimation at a given point x is ∼ 2 n− 2+d . Since the result holds for any x, the integrated risk is also

2 2 E fbn f ∗ n− 2+d . − .

139 / 174 Summary

I We sketched the proof that k-Nearest-Neighbors has sample complexity guarantees for prediction or estimation problems with square loss if k is chosen appropriately. I Analysis is very different from “empirical process” approach for ERM. I Truly nonparametric! I No assumptions on underlying density (in d 3) beyond compact support. Additional assumptions needed for ≥d 3. ≤

140 / 174 Local Kernel Regression: Nadaraya-Watson

d Fix a kernel K : R R 0. Assume K is zero outside unit Euclidean ball → ≥x2 at origin (not true for e− , but close enough).

(figure from Gy¨orfiet al)

Let Kh(x) = K(x/h), and so Kh(x x 0) is zero if x x 0 h. − k − k ≥ h is “bandwidth” – tunable parameter. Assume K(x) > cI x 1 for some c > 0. This is important for the “averaging effect”{k to kickk ≤ in.}

141 / 174 Nadaraya-Watson estimator:

n X fbn(x) = Yi Wi (x) i=1 with Kh(x Xi ) Wi (x) = Pn − Kh(x Xi ) i=1 − P (Note: i Wi = 1).

142 / 174 Unlike the k-NN example, bias is easier to estimate.

Bias: for a given x,

" n # n X X EY1:n [fbn(x)] = EY1:n Yi Wi (x) = f ∗(Xi )Wi (x) i=1 i=1 and so n X EY1:n [fbn(x)] f ∗(x) = (f ∗(Xi ) f ∗(x))Wi (x) − − i=1

Suppose f ∗ is1-Lipschitz. Since Kh is zero outside the h-radius ball,

2 2 EY1:n [fbn(x)] f ∗(x) h . | − | ≤

143 / 174 Variance: we have n X fbn(x) EY1:n [fbn(x)] = (Yi f ∗(Xi ))Wi (x) − − i=1 Expectation of square of this difference is at most

" n # X 2 2 E (Yi f ∗(Xi )) Wi (x) − i=1 since cross terms are zero (fix X ’s, take expectation with respect to the Y ’s).

We are left analyzing

 K (x X )2  n h 1 E Pn − 2 ( Kh(x Xi )) i=1 − Under some assumptions on density of X , the denominator is at least d 2 2 d (nh ) with high prob, whereas EKh(x X1) = O(h ) assuming R K 2 < . This gives an overall variance− of O(1/(nhd )). Many details skipped∞ here (e.g. problems at the boundary, assumptions, etc) 144 / 174 1 Overall, bias and variance with h n− 2+d yield ∼ 2 1 2 h + = n 2+d nhd −

145 / 174 Summary

I Analyzed smoothing methods with kernels. As with nearest neighbors, slow (nonparametric) rates in large d. I Same bias-variance decomposition approach as k-NN.

146 / 174 Overfitting in Deep Networks

I Deep networks can be trained to zero training error (for regression loss) I ... with near state-of-the-art performance (Ruslan Salakhutdinov, Simons Boot Camp, January 2017)

I ... even for noisy problems. I No tradeoff between fit to training data and complexity! I Benign overfitting.

(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)

148 / 174 Overfitting in Deep Networks

Not surprising:

I More parameters than sample size. (That’s called ’nonparametric.’) I Good performance with zero classification loss on training data. (c.f. Margins analysis: trade-off between (regression) fit and complexity.) I Good performance with zero regression loss on training data when the Bayes risk is zero. (c.f. Perceptron)

Surprising:

I Good performance with zero regression loss on noisy training data. I All the label noise is making its way into the parameters, but it isn’t hurting prediction accuracy!

149 / 174 Overfitting

“... interpolating fits... [are] unlikely to predict future data well at all.” “Elements of Statistical Learning,” Hastie, Tibshirani, Friedman, 2001

150 / 174 Overfitting

151 / 174 Benign Overfitting

A new statistical phenomenon: good prediction with zero training error for regression loss

I Statistical wisdom says a prediction rule should not fit too well. I But deep networks are trained to fit noisy data perfectly, and they predict well.

152 / 174 Not an Isolated Phenomenon

Kernel Regression on MNIST

10 1 digits pair [i,j] [2,5] [2,9] [3,6]

log(error) [3,8] [4,7]

0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolated solution, perfect fit on training data.

153 / 174 Not an Isolated Phenomenon

154 / 174 Simplicial interpolation (Belkin-Hsu-Mitra)

Observe: 1-Nearest Neighbor is interpolating. However, one cannot guarantee EL(fbn) L(f ∗) small. − Cover-Hart ’67: lim L(fbn) 2L(f ∗) n E →∞ ≤

156 / 174 Simplicial interpolation (Belkin-Hsu-Mitra)

Under regularity conditions, simplicial interpolation fbn

2 2 2 lim sup E fbn f E(f ∗(X ) Y ) n − ∗ ≤ d + 2 − →∞

Blessing of high dimensionality!

157 / 174 Nadaraya-Watson estimator (Belkin-R.-Tsybakov)

Consider the Nadaraya-Watson estimator. Take a kernel that approaches a large value τ at0, e.g.

K(x) = min 1/ x α , τ { k k }

Note that large τ means fbn(Xi ) Yi since the weight Wi (Xi ) is large. ≈

If τ = , we get interpolation fbn(Xi ) = Yi of all training data. Yet, earlier∞ proof still goes through. Hence, “memorizing the data” (governed by parameter τ) is completely decoupled from the bias-variance trade-off (as given by parameter h).

Minimax rates are possible (with a suitable choice of h).

159 / 174 Min-norm interpolation in RKHS

Min-norm interpolation:

fbn := argmin f , s.t. f (xi ) = yi , i n . f k kH ∀ ≤ ∈H

Closed-form solution:

1 fbn(x) = K(x, X )K(X , X )− Y

Note: GD/SGD started from0 converges to this minimum-norm solution.

Difficulty in analyzing bias/variance of fbn: it is not clear how the random 1 matrix K − “aggregates” Y ’s between data points.

161 / 174 Min-norm interpolation in Rd with d n 

(Hastie, Montanari, Rosset, Tibshirani, 2019) Linear regression with p, n , p/n γ, empirical spectral distribution ofΣ (the discrete measure on→∞ its set of→ eigenvalues) converges to a fixed measure. Apply random matrix theory to explore the asymptotics of the excess prediction error as γ, the noise variance, and the eigenvalue distribution vary. Also study the asymptotics of a model involving random nonlinear features.

162 / 174 Implicit regularization in regime d n  (Liang and R., 2018)

Informal Theorem: Geometric properties of the data design X , high dimensionality, and curvature of the kernel interpolated solution generalizes. ⇒

Bias bounded in terms of effective dimension

 XX >  X λj d dim(XX >) = 2 h  > i j XX γ + λj d

where γ> 0 is due to implicit regularization due to curvature of kernel and high dimensionality. Compare with regularized least squares as low pass filter:

 XX >  X λj d dim(XX >) =  XX >  j λ + λj d

163 / 174 Implicit regularization in regime d n  Test error (y-axis) as a function of how quickly spectrum decays. Best performance if spectrum decays but not too quickly.

  j κ1/κ λj (Σ) = 1 − d

164 / 174 Lower Bounds

(R. and Zhai, 2018)

Min-norm solution for Laplace kernel

d K (x, x 0) = σ− exp x x 0 /σ σ {− k − k } is not consistent if d is low:

Theorem.

1/2 Theorem: for odd dimension d, with probability1 O(n− ), for any choice of σ, − 2 E fbn f ∗ Ωd (1). − ≥

Conclusion: need high dimensionality of data.

165 / 174 Connection to neural networks For wide enough randomly initialized neural networks, GD dynamics quickly converge to (approximately) min-norm interpolating solution with respect to a certain kernel. m 1 X f (x) = ai σ( wi , x ) √m h i i=1 For square loss and relu σ, n ∂Lb 1 1 X = (f (xj ) yj )ai σ0( wi , xj )xj ∂wi √m n − h i j=1 GD in continuous time: d ∂Lb wi (t) = dt −∂wi Thus   m  T   m  T n n " m # n d X ∂f (x) dwi X 1 1 1 X 1 X 1 X 2 1 X m f (x) = ai σ0( wi , x )x  (f (xj ) yj )ai σ0( wi , xj )xj  (f (xj ) yj ) ai σ0( wi , x )σ0( wi , xj ) x, xj (f (xj ) yj ) K (x, xj ) dt ∂wi dt √m h i −√m n − h i −n − m h i h i h i −n − i=1 i=1 j=1 j=1 i=1 j=1 | {z } ∞ K (x,xj ) →

166 / 174 Connection to neural networks

(Xie, Liang, Song, ’16), (Jacot, Gabriel, Hongler ’18), (Du, Zhai, Poczos, Singh ’18), etc. K ∞(x, xj ) = E[ x, xj I w, x 0, w, xj 0 ] h i {h i ≥ h i ≥ }

See Jason’s talk.

167 / 174 Linear Regression (B., Long, Lugosi, Tsigler, 2019)

I (x, y) Gaussian, mean zero. I Define: X Σ := Exx > = λi vi vi>, (assume λ1 λ2 ) ≥ ≥ · · · i 2 θ∗ := arg min E y x >θ , θ − 2 2 σ := E(y x >θ∗) . −

ˆ  I Estimator θ = X >X † X >y, which solves min θ 2 θ k k s.t. X θ y 2 = min X β y 2 = 0. k − k β k − k

I When can the label noise be hidden in θˆ without hurting predictive accuracy?

168 / 174 Linear Regression

Excess prediction error:

 2 2     ˆ ˆ  ˆ > ˆ L(θ) L∗ := E(x,y) y x >θ y x >θ∗ = θ θ∗ Σ θ θ∗ . − − − − − −

Σ determines how estimation error in different parameter directions affect prediction accuracy.

169 / 174 Linear Regression

Theorem.

2 For universal constants b, c, and any θ∗, σ ,Σ, if λn > 0 then

σ2 k n  min ∗ + , 1 c n Rk∗ (Σ) r  ! ˆ 2 tr(Σ) 2 k∗ n EL(θ) L∗ c θ∗ + σ + , ≤ − ≤ k k n n Rk∗ (Σ)

where k∗ = min k 0 : rk (Σ) bn . { ≥ ≥ }

170 / 174 Notions of Effective Rank Definition.

Recall that λ1 λ2 are the eigenvalues ofΣ. Define the effective ranks ≥ ≥ · · · P P 2 i>k λi i>k λi rk (Σ) = , Rk (Σ) = P 2 . λk+1 i>k λi

Example If rank(Σ) = p, we can write

r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ), Pp Pp 2 1/p i=1 λi 1/p i=1 λi with s(Σ) = , S(Σ) = Pp 2 . λ1 1/p i=1 λi

Both s and S lie between1 /p (λ2 0) and1( λi all equal). ≈ 171 / 174 Benign Overfitting in Linear Regression

Intuition

I To avoid harming prediction accuracy, the noise energy must be distributed across many unimportant directions: I There should be many non-zero eigenvalues. I They should have a small sum compared to n. I There should be many eigenvalues no larger than λk∗ (the number depends on how assymmetric they are).

172 / 174 Overfitting?

I Fitting data “too” well versus I Bias too low, variance too high.

Key takeaway: we should not conflate these two.

173 / 174 Outline

Introduction

Uniform Laws of Large Numbers

Classification

Beyond ERM

Interpolation Overfitting in Deep Networks Nearest neighbor Local Kernel Smoothing Kernel and Linear Regression

174 / 174