Statistical (Supervised) Learning Theory FoPPS Logic and Learning School

Lady Margaret Hall University of Oxford

Varun Kanade July 1, 2018 Previous Mini-Course

Introduction to Computational Learning Theory (PAC)

Learnability and the VC dimension

Sample Compression Schemes

Learning with Membership Queries

(Computational) Hardness of Learning

1 This Mini-Course

Statistical Learning Theory Framework

Capacity Measures : Rademacher Complexity

Uniform Convergence : Generalisation Bounds

Some Techniques

Algorithmic Stability to prove Generalisation

2 Outline

Statistical (Supervised) Learning Theory Framework

Linear Regression

Rademacher Complexity

Support Vector Machines

Kernels

Neural Networks

Algorithmic Stability Target values: Y I Y = {−1, 1} : I Y = R : regression

We consider data to be generated from a joint distribution D over X × Y

Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)

Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16

Statistical (Supervised) Learning Theory Framework

n Input space: X (most often X ⊂ R )

3 We consider data to be generated from a joint distribution D over X × Y

Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)

Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16

Statistical (Supervised) Learning Theory Framework

n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression

3 Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)

Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16

Statistical (Supervised) Learning Theory Framework

n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression

We consider data to be generated from a joint distribution D over X × Y

3 Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16

Statistical (Supervised) Learning Theory Framework

n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression

We consider data to be generated from a joint distribution D over X × Y

Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)

3 Statistical (Supervised) Learning Theory Framework

n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression

We consider data to be generated from a joint distribution D over X × Y

Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)

Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16

3 How can we fit a function to the data?

I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality

Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to

We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded

Statistical (Supervised) Learning Theory Framework

Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting)

4 Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting

We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded

Statistical (Supervised) Learning Theory Framework

Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting)

How can we fit a function to the data?

I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality

4 We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded

Statistical (Supervised) Learning Theory Framework

Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting)

How can we fit a function to the data?

I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality

Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting

4 Statistical (Supervised) Learning Theory Framework

Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting)

How can we fit a function to the data?

I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality

Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting

We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded

4 Generative Models: Model the full joint distribution D(x, y)

I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(y|x)

2 I : y|w0, w, x ∼ w0 + w · x + N (0, σ )

I Classification: y|w0, w, x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1

The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability.

Aside : Connections to classical Statistics/ML

Attempt to explicitly model the distributions D(x) and/or D(y|x)

5 Discriminative Models: Model only the conditional distribution D(y|x)

2 I Linear Regression: y|w0, w, x ∼ w0 + w · x + N (0, σ )

I Classification: y|w0, w, x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1

The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability.

Aside : Connections to classical Statistics/ML

Attempt to explicitly model the distributions D(x) and/or D(y|x)

Generative Models: Model the full joint distribution D(x, y)

I Gaussian Discriminant Analysis, Naïve Bayes

5 The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability.

Aside : Connections to classical Statistics/ML

Attempt to explicitly model the distributions D(x) and/or D(y|x)

Generative Models: Model the full joint distribution D(x, y)

I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(y|x)

2 I Linear Regression: y|w0, w, x ∼ w0 + w · x + N (0, σ )

I Classification: y|w0, w, x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1

5 Aside : Connections to classical Statistics/ML

Attempt to explicitly model the distributions D(x) and/or D(y|x)

Generative Models: Model the full joint distribution D(x, y)

I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(y|x)

2 I Linear Regression: y|w0, w, x ∼ w0 + w · x + N (0, σ )

I Classification: y|w0, w, x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1

The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability.

5 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1

Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.

+ A cost function γ : Y × Y → R

The loss for f ∈ F at point (x, y) is given by

`(f; x, y) = γ(f(x), y)

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Statistical (Supervised) Learning Theory Framework

X instance space; Y target values Distribution D over X × Y

6 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1

+ A cost function γ : Y × Y → R

The loss for f ∈ F at point (x, y) is given by

`(f; x, y) = γ(f(x), y)

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Statistical (Supervised) Learning Theory Framework

X instance space; Y target values Distribution D over X × Y

Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.

6 The loss for f ∈ F at point (x, y) is given by

`(f; x, y) = γ(f(x), y)

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Statistical (Supervised) Learning Theory Framework

X instance space; Y target values Distribution D over X × Y

Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.

+ A cost function γ : Y × Y → R 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1

6 + The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Statistical (Supervised) Learning Theory Framework

X instance space; Y target values Distribution D over X × Y

Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.

+ A cost function γ : Y × Y → R 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1

The loss for f ∈ F at point (x, y) is given by

`(f; x, y) = γ(f(x), y)

6 Statistical (Supervised) Learning Theory Framework

X instance space; Y target values Distribution D over X × Y

Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.

+ A cost function γ : Y × Y → R 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1

The loss for f ∈ F at point (x, y) is given by

`(f; x, y) = γ(f(x), y)

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

6 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small  > 0:

R(fb) ≤ inf R(f) +  f∈F

Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data

Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F

Statistical (Supervised) Learning Theory Framework

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small  > 0:

R(fb) ≤ inf R(f) +  f∈F

Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data

Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F

Statistical (Supervised) Learning Theory Framework

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Would like to find f ∈ F that ‘‘minimises’’ the risk R

7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small  > 0:

R(fb) ≤ inf R(f) +  f∈F

Only have access to D through a sample of size m drawn from D called the training data

Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F

Statistical (Supervised) Learning Theory Framework

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest

7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small  > 0:

R(fb) ≤ inf R(f) +  f∈F

Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F

Statistical (Supervised) Learning Theory Framework

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data

7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small  > 0:

R(fb) ≤ inf R(f) +  f∈F

A learning algorithm (possibly randomised) is a map A from 2X ×Y to F

Statistical (Supervised) Learning Theory Framework

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data

Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D

7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small  > 0:

R(fb) ≤ inf R(f) +  f∈F

Statistical (Supervised) Learning Theory Framework

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data

Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F

7 Statistical (Supervised) Learning Theory Framework

+ The Risk functional R : F → R is given by:     R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D

Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data

Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F

Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small  > 0:

R(fb) ≤ inf R(f) +  f∈F

7 Tractable if there exists a separator with no error!

Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1

ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest

E.g. Find a linear separator that minimises the number of misclassifications

Empirical Risk Minimisation

Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F

8 Tractable if there exists a separator with no error!

ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest

E.g. Find a linear separator that minimises the number of misclassifications

Empirical Risk Minimisation

Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F

Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1

8 Tractable if there exists a separator with no error!

E.g. Find a linear separator that minimises the number of misclassifications

Empirical Risk Minimisation

Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F

Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1

ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest

8 Tractable if there exists a separator with no error!

Empirical Risk Minimisation

Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F

Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1

ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest

E.g. Find a linear separator that minimises the number of misclassifications

8 Empirical Risk Minimisation

Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F

Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1

ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest

E.g. Find a linear separator that minimises the number of misclassifications Tractable if there exists a separator with no error!

8 I How do we guarantee that the (actual) risk is close to optimal? I Focus on classification, i.e. F ⊂ {−1, 1}X and suppose VC(F) = d < ∞ 0 0 I Cost function is γ(y , y) = I(y 6= y)

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {−1, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

Empirical Risk Minimisation

ERM Principle: Learning algorithm should pick f ∈ F that minimises the empirical risk m 1 X RbS (f) = γ(f(xi), yi) m i=1

9 I Focus on classification, i.e. F ⊂ {−1, 1}X and suppose VC(F) = d < ∞ 0 0 I Cost function is γ(y , y) = I(y 6= y)

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {−1, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

Empirical Risk Minimisation

ERM Principle: Learning algorithm should pick f ∈ F that minimises the empirical risk m 1 X RbS (f) = γ(f(xi), yi) m i=1

I How do we guarantee that the (actual) risk is close to optimal?

9 Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {−1, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

Empirical Risk Minimisation

ERM Principle: Learning algorithm should pick f ∈ F that minimises the empirical risk m 1 X RbS (f) = γ(f(xi), yi) m i=1

I How do we guarantee that the (actual) risk is close to optimal? I Focus on classification, i.e. F ⊂ {−1, 1}X and suppose VC(F) = d < ∞ 0 0 I Cost function is γ(y , y) = I(y 6= y)

9 Empirical Risk Minimisation

ERM Principle: Learning algorithm should pick f ∈ F that minimises the empirical risk m 1 X RbS (f) = γ(f(xi), yi) m i=1

I How do we guarantee that the (actual) risk is close to optimal? I Focus on classification, i.e. F ⊂ {−1, 1}X and suppose VC(F) = d < ∞ 0 0 I Cost function is γ(y , y) = I(y 6= y)

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {−1, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

9 ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) +  Using Theorem (flipped)

Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS

Then, we have,

R(fb) ≤ RbS (fb) + /2 Using Theorem

Where  is chosen to be a suitable function of δ and m

Empirical Risk Minimisation

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

10 ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) +  Using Theorem (flipped)

Then, we have,

R(fb) ≤ RbS (fb) + /2 Using Theorem

Where  is chosen to be a suitable function of δ and m

Empirical Risk Minimisation

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS

10 ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) +  Using Theorem (flipped)

Where  is chosen to be a suitable function of δ and m

Empirical Risk Minimisation

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS

Then, we have,

R(fb) ≤ RbS (fb) + /2 Using Theorem

10 ≤ R(f ∗) +  Using Theorem (flipped)

Where  is chosen to be a suitable function of δ and m

Empirical Risk Minimisation

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS

Then, we have,

R(fb) ≤ RbS (fb) + /2 Using Theorem ∗ ≤ RbS (f ) + /2 As fbminimises RbS

10 Where  is chosen to be a suitable function of δ and m

Empirical Risk Minimisation

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS

Then, we have,

R(fb) ≤ RbS (fb) + /2 Using Theorem ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) +  Using Theorem (flipped)

10 Empirical Risk Minimisation

Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS

Then, we have,

R(fb) ≤ RbS (fb) + /2 Using Theorem ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) +  Using Theorem (flipped)

Where  is chosen to be a suitable function of δ and m

10 How should be pick the class of functions F? I More ‘‘complex’’ F can achieve smaller empirical risk I Difference between true risk and empirical risk (generalisation error) will be higher for more ‘‘complex’’ F

Choose an infinite family of classes {Fd : d = 1, 2,...} and find the minimiser: fb = argmin RbS (f) + κ(d, m) f∈Fd,d∈N where κ(d, m) is a penalty term that depends on the sample size and the ‘‘complexity’’ or ‘‘capacity’’ measure

Related to the more commonly used approach in practice:

fb = argmin RbS (f) + λ · regulariser(f) f∈F

Structural Risk Minimisation

r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

11 Choose an infinite family of classes {Fd : d = 1, 2,...} and find the minimiser: fb = argmin RbS (f) + κ(d, m) f∈Fd,d∈N where κ(d, m) is a penalty term that depends on the sample size and the ‘‘complexity’’ or ‘‘capacity’’ measure

Related to the more commonly used approach in practice:

fb = argmin RbS (f) + λ · regulariser(f) f∈F

Structural Risk Minimisation

r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

How should be pick the class of functions F? I More ‘‘complex’’ F can achieve smaller empirical risk I Difference between true risk and empirical risk (generalisation error) will be higher for more ‘‘complex’’ F

11 Related to the more commonly used approach in practice:

fb = argmin RbS (f) + λ · regulariser(f) f∈F

Structural Risk Minimisation

r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

How should be pick the class of functions F? I More ‘‘complex’’ F can achieve smaller empirical risk I Difference between true risk and empirical risk (generalisation error) will be higher for more ‘‘complex’’ F

Choose an infinite family of classes {Fd : d = 1, 2,...} and find the minimiser: fb = argmin RbS (f) + κ(d, m) f∈Fd,d∈N where κ(d, m) is a penalty term that depends on the sample size and the ‘‘complexity’’ or ‘‘capacity’’ measure

11 Structural Risk Minimisation

r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m

How should be pick the class of functions F? I More ‘‘complex’’ F can achieve smaller empirical risk I Difference between true risk and empirical risk (generalisation error) will be higher for more ‘‘complex’’ F

Choose an infinite family of classes {Fd : d = 1, 2,...} and find the minimiser: fb = argmin RbS (f) + κ(d, m) f∈Fd,d∈N where κ(d, m) is a penalty term that depends on the sample size and the ‘‘complexity’’ or ‘‘capacity’’ measure

Related to the more commonly used approach in practice:

fb = argmin RbS (f) + λ · regulariser(f) f∈F

11 Outline

Statistical (Supervised) Learning Theory Framework

Linear Regression

Rademacher Complexity

Support Vector Machines

Kernels

Neural Networks

Algorithmic Stability h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D   + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)

h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D

Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2

Let D be a distribution over X × Y, let g(x) = E[y | x]

For any h : X → R:

h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D

If g ∈ F, we are in the so-called realisable setting

Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

12 h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D   + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)

h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D

Let D be a distribution over X × Y, let g(x) = E[y | x]

For any h : X → R:

h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D

If g ∈ F, we are in the so-called realisable setting

Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2

12 h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D   + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)

h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D

For any h : X → R:

h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D

If g ∈ F, we are in the so-called realisable setting

Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2

Let D be a distribution over X × Y, let g(x) = E[y | x]

12 h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D   + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)

h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D

If g ∈ F, we are in the so-called realisable setting

Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2

Let D be a distribution over X × Y, let g(x) = E[y | x]

For any h : X → R:

h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D

12   + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)

h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D

If g ∈ F, we are in the so-called realisable setting

Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2

Let D be a distribution over X × Y, let g(x) = E[y | x]

For any h : X → R:

h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D

h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D

12 h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D

If g ∈ F, we are in the so-called realisable setting

Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2

Let D be a distribution over X × Y, let g(x) = E[y | x]

For any h : X → R:

h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D

h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D   + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)

12 If g ∈ F, we are in the so-called realisable setting

Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2

Let D be a distribution over X × Y, let g(x) = E[y | x]

For any h : X → R:

h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D

h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D   + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)

h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D

12 Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2

Let D be a distribution over X × Y, let g(x) = E[y | x]

For any h : X → R:

h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D

h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D   + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)

h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D

If g ∈ F, we are in the so-called realisable setting

12 We can defined the likelihood of observing the data under this model

m 2 ! 1 Y (yi − w · xi) p(y , . . . , y | w, x ,..., x ) = exp − 1 m 1 m (2πσ2)m/2 2σ2 i=1

Looking at the log likelihood is slightly simpler

m m 2 1 X 2 LL(y , . . . , y | w, x ,..., x ) = − log(2πσ ) − (y − w · x ) 1 m 1 m 2 2σ2 i i i=1

Finding parameters w that maximise the (log) likelihood is the same as finding w that minimises the empirical risk with the squared error cost

The method of least squares goes back at least 200 years to Gauss, Laplace

Aside: Maximum Likelihood Principle

Discriminative Setting: Model y | w, x ∼ w · x + N (0, σ2)

13 Looking at the log likelihood is slightly simpler

m m 2 1 X 2 LL(y , . . . , y | w, x ,..., x ) = − log(2πσ ) − (y − w · x ) 1 m 1 m 2 2σ2 i i i=1

Finding parameters w that maximise the (log) likelihood is the same as finding w that minimises the empirical risk with the squared error cost

The method of least squares goes back at least 200 years to Gauss, Laplace

Aside: Maximum Likelihood Principle

Discriminative Setting: Model y | w, x ∼ w · x + N (0, σ2)

We can defined the likelihood of observing the data under this model

m 2 ! 1 Y (yi − w · xi) p(y , . . . , y | w, x ,..., x ) = exp − 1 m 1 m (2πσ2)m/2 2σ2 i=1

13 Finding parameters w that maximise the (log) likelihood is the same as finding w that minimises the empirical risk with the squared error cost

The method of least squares goes back at least 200 years to Gauss, Laplace

Aside: Maximum Likelihood Principle

Discriminative Setting: Model y | w, x ∼ w · x + N (0, σ2)

We can defined the likelihood of observing the data under this model

m 2 ! 1 Y (yi − w · xi) p(y , . . . , y | w, x ,..., x ) = exp − 1 m 1 m (2πσ2)m/2 2σ2 i=1

Looking at the log likelihood is slightly simpler

m m 2 1 X 2 LL(y , . . . , y | w, x ,..., x ) = − log(2πσ ) − (y − w · x ) 1 m 1 m 2 2σ2 i i i=1

13 Aside: Maximum Likelihood Principle

Discriminative Setting: Model y | w, x ∼ w · x + N (0, σ2)

We can defined the likelihood of observing the data under this model

m 2 ! 1 Y (yi − w · xi) p(y , . . . , y | w, x ,..., x ) = exp − 1 m 1 m (2πσ2)m/2 2σ2 i=1

Looking at the log likelihood is slightly simpler

m m 2 1 X 2 LL(y , . . . , y | w, x ,..., x ) = − log(2πσ ) − (y − w · x ) 1 m 1 m 2 2σ2 i i i=1

Finding parameters w that maximise the (log) likelihood is the same as finding w that minimises the empirical risk with the squared error cost

The method of least squares goes back at least 200 years to Gauss, Laplace

13 ERM for Linear Regression

m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1

How do we argue about the generalisation properties of this algorithm? Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ... We will require some boundedness assumptions on the data and the linear functions

Linear Regression

n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

14 How do we argue about the generalisation properties of this algorithm? Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ... We will require some boundedness assumptions on the data and the linear functions

Linear Regression

n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

ERM for Linear Regression

m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1

14 Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ... We will require some boundedness assumptions on the data and the linear functions

Linear Regression

n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

ERM for Linear Regression

m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1

How do we argue about the generalisation properties of this algorithm?

14 We will require some boundedness assumptions on the data and the linear functions

Linear Regression

n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

ERM for Linear Regression

m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1

How do we argue about the generalisation properties of this algorithm? Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ...

14 Linear Regression

n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}

ERM for Linear Regression

m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1

How do we argue about the generalisation properties of this algorithm? Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ... We will require some boundedness assumptions on the data and the linear functions

14 Outline

Statistical (Supervised) Learning Theory Framework

Linear Regression

Rademacher Complexity

Support Vector Machines

Kernels

Neural Networks

Algorithmic Stability S = {z1, . . . , zm} ⊂ Z be a fixed sample of size m

Then the Empirical Rademacher Complexity of G with respect to S is defined as:  m  1 X RS (G) = sup σig(zi) b E m   σ∼u{−1,1} g∈G m i=1

m where (σ1, . . . , σm) =: σ ∼u {−1, 1} indicates that each σi is a random variable taking the values {−1, 1} with equal probability. These are called Rademacher random variables

Empirical Rademacher Complexity

Let G be a class of functions from Z → [a, b] ⊂ R

15 Then the Empirical Rademacher Complexity of G with respect to S is defined as:  m  1 X RS (G) = sup σig(zi) b E m   σ∼u{−1,1} g∈G m i=1

m where (σ1, . . . , σm) =: σ ∼u {−1, 1} indicates that each σi is a random variable taking the values {−1, 1} with equal probability. These are called Rademacher random variables

Empirical Rademacher Complexity

Let G be a class of functions from Z → [a, b] ⊂ R

S = {z1, . . . , zm} ⊂ Z be a fixed sample of size m

15 Empirical Rademacher Complexity

Let G be a class of functions from Z → [a, b] ⊂ R

S = {z1, . . . , zm} ⊂ Z be a fixed sample of size m

Then the Empirical Rademacher Complexity of G with respect to S is defined as:  m  1 X RS (G) = sup σig(zi) b E m   σ∼u{−1,1} g∈G m i=1

m where (σ1, . . . , σm) =: σ ∼u {−1, 1} indicates that each σi is a random variable taking the values {−1, 1} with equal probability. These are called Rademacher random variables

15 Rademacher Complexity Let D be a distribution over the set Z. Let G be a class of functions from Z → [a, b] ⊂ R. For any m ≥ 1, the Rademacher complexity of G is the expectation of the empirical Rademacher complexity of G over a sample drawn from Dm: h i Rm(G) = Rb S (G) S∼EDm

Rademacher Complexity

Empirical Rademacher Complexity

 m  1 X RS (G) = sup σig(zi) b E m   σ∼u{−1,1} g∈G m i=1

16 Rademacher Complexity

Empirical Rademacher Complexity

 m  1 X RS (G) = sup σig(zi) b E m   σ∼u{−1,1} g∈G m i=1

Rademacher Complexity Let D be a distribution over the set Z. Let G be a class of functions from Z → [a, b] ⊂ R. For any m ≥ 1, the Rademacher complexity of G is the expectation of the empirical Rademacher complexity of G over a sample drawn from Dm: h i Rm(G) = Rb S (G) S∼EDm

16 Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:

m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

Henceforth, for S = {z1, . . . , zm}, we will use the notation: m   1 X Eb g(z) = g(zi) z∼uS m i=1

We will see a full proof of this theorem. First, let’s apply this to linear regression.

Rademacher Complexity

 m  1 X h i RS (G) = sup σig(zi) ; Rm(G) = RS (G) b E m   E m b σ∼u{−1,1} g∈G m S∼D i=1

17 Henceforth, for S = {z1, . . . , zm}, we will use the notation: m   1 X Eb g(z) = g(zi) z∼uS m i=1

We will see a full proof of this theorem. First, let’s apply this to linear regression.

Rademacher Complexity

 m  1 X h i RS (G) = sup σig(zi) ; Rm(G) = RS (G) b E m   E m b σ∼u{−1,1} g∈G m S∼D i=1

Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:

m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

17 Rademacher Complexity

 m  1 X h i RS (G) = sup σig(zi) ; Rm(G) = RS (G) b E m   E m b σ∼u{−1,1} g∈G m S∼D i=1

Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:

m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

Henceforth, for S = {z1, . . . , zm}, we will use the notation: m   1 X Eb g(z) = g(zi) z∼uS m i=1

We will see a full proof of this theorem. First, let’s apply this to linear regression.

17  m  1 X = E  sup w · σixi m σ w,kwk2≤W i=1   m W X = E  σixi  m σ   i=1 2

Let S = {x1,..., xm}. Then we have:

 m  1 X Rb S (F) = E  sup σi(w · xi) m σ w,kwk2≤W i=1

The last step follows from (the equality condition of) the Cauchy-Schwartz Inequality

Generalisation Bounds for Linear Regression

n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

18  m  1 X = E  sup w · σixi m σ w,kwk2≤W i=1   m W X = E  σixi  m σ   i=1 2

The last step follows from (the equality condition of) the Cauchy-Schwartz Inequality

Generalisation Bounds for Linear Regression

n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

Let S = {x1,..., xm}. Then we have:

 m  1 X Rb S (F) = E  sup σi(w · xi) m σ w,kwk2≤W i=1

18   m W X = E  σixi  m σ   i=1 2

The last step follows from (the equality condition of) the Cauchy-Schwartz Inequality

Generalisation Bounds for Linear Regression

n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

Let S = {x1,..., xm}. Then we have:

 m  1 X Rb S (F) = E  sup σi(w · xi) m σ w,kwk2≤W i=1

 m  1 X = E  sup w · σixi m σ w,kwk2≤W i=1

18 Generalisation Bounds for Linear Regression

n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

Let S = {x1,..., xm}. Then we have:

 m  1 X Rb S (F) = E  sup σi(w · xi) m σ w,kwk2≤W i=1

 m  1 X = E  sup w · σixi m σ w,kwk2≤W i=1   m W X = E  σixi  m σ   i=1 2

The last step follows from (the equality condition of) the Cauchy-Schwartz Inequality

18   1  2 2 m W  X  ≤ E  σixi  m σ    i=1  2

1    2     m  W X 2 2 X  =   σ kx k + 2 σ σ x · x  E  i i 2 i j i j  m σ    i=1 i

1  m  2 W X 2 WX =  kxik  = √ m 2 m i=1

  m W X Rb S (F) = E  σixi  m σ   i=1 2

Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

19   1  2 2 m W  X  ≤ E  σixi  m σ    i=1  2

1    2     m  W X 2 2 X  =   σ kx k + 2 σ σ x · x  E  i i 2 i j i j  m σ    i=1 i

1  m  2 W X 2 WX =  kxik  = √ m 2 m i=1

Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

  m W X Rb S (F) = E  σixi  m σ   i=1 2

19 1    2     m  W X 2 2 X  =   σ kx k + 2 σ σ x · x  E  i i 2 i j i j  m σ    i=1 i

1  m  2 W X 2 WX =  kxik  = √ m 2 m i=1

Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

  1    2 2 m m W X W  X  Rb S (F) = E  σixi  ≤ E  σixi  m σ   m σ   i=1  i=1  2 2

19 1  m  2 W X 2 WX =  kxik  = √ m 2 m i=1

Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

  1    2 2 m m W X W  X  Rb S (F) = E  σixi  ≤ E  σixi  m σ   m σ   i=1  i=1  2 2

1    2     m  W X 2 2 X  =   σ kx k + 2 σ σ x · x  E  i i 2 i j i j  m σ    i=1 i

19 Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

  1    2 2 m m W X W  X  Rb S (F) = E  σixi  ≤ E  σixi  m σ   m σ   i=1  i=1  2 2

1    2     m  W X 2 2 X  =   σ kx k + 2 σ σ x · x  E  i i 2 i j i j  m σ    i=1 i

1  m  2 W X 2 WX =  kxik  = √ m 2 m i=1

19 For this we need to look at the composition of the linear function and the loss/cost function

Let G be a class of functions from Z → [a, b] and let ϕ : [a, b] → R be L-Lipschitz Then Talagrand’s Lemma tells us that:

Rb S (φ ◦ G) ≤ L · Rb S (G)

Rm(φ ◦ G) ≤ L · Rm(G)

Talagrand’s Lemma

We computed the Rademacher complexity of linear functions, but we’d like to apply the ‘‘main theorem’’ to the true risk

20 Let G be a class of functions from Z → [a, b] and let ϕ : [a, b] → R be L-Lipschitz Then Talagrand’s Lemma tells us that:

Rb S (φ ◦ G) ≤ L · Rb S (G)

Rm(φ ◦ G) ≤ L · Rm(G)

Talagrand’s Lemma

We computed the Rademacher complexity of linear functions, but we’d like to apply the ‘‘main theorem’’ to the true risk For this we need to look at the composition of the linear function and the loss/cost function

20 Then Talagrand’s Lemma tells us that:

Rb S (φ ◦ G) ≤ L · Rb S (G)

Rm(φ ◦ G) ≤ L · Rm(G)

Talagrand’s Lemma

We computed the Rademacher complexity of linear functions, but we’d like to apply the ‘‘main theorem’’ to the true risk For this we need to look at the composition of the linear function and the loss/cost function

Let G be a class of functions from Z → [a, b] and let ϕ : [a, b] → R be L-Lipschitz

20 Talagrand’s Lemma

We computed the Rademacher complexity of linear functions, but we’d like to apply the ‘‘main theorem’’ to the true risk For this we need to look at the composition of the linear function and the loss/cost function

Let G be a class of functions from Z → [a, b] and let ϕ : [a, b] → R be L-Lipschitz Then Talagrand’s Lemma tells us that:

Rb S (φ ◦ G) ≤ L · Rb S (G)

Rm(φ ◦ G) ≤ L · Rm(G)

20 Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z

φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m

Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m

h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm

Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

21 φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m

Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m

h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm

Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z

21 √ Rb S ([−M,M]) ≤ M/ m

Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m

h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm

Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z

φ is 2(M + WX)-Lipschitz on its domain

21 Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m

h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm

Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z

φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m

21 h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm

Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z

φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m

Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m

21 Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]

Let F = {x 7→ w · x | kwk2 ≤ W }

Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z

φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m

Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m

h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm

21 How can we solve this optimisation problem? (without norm constraint there is a closed form solution)

This convex optimisation problem can solved using projected (stochastic)

Guaranteed to find a near-optimal solution in polynomial time

Aside: Algorithms for the Linear Regression Model

ERM for Linear Regression

m 1 X 2 J(w) = (w · x − y ) m i i i=1 wb = argmin J(w) w,kwk2≤W

22 This convex optimisation problem can solved using projected (stochastic) gradient descent

Guaranteed to find a near-optimal solution in polynomial time

Aside: Algorithms for the Linear Regression Model

ERM for Linear Regression

m 1 X 2 J(w) = (w · x − y ) m i i i=1 wb = argmin J(w) w,kwk2≤W

How can we solve this optimisation problem? (without norm constraint there is a closed form solution)

22 Guaranteed to find a near-optimal solution in polynomial time

Aside: Algorithms for the Linear Regression Model

ERM for Linear Regression

m 1 X 2 J(w) = (w · x − y ) m i i i=1 wb = argmin J(w) w,kwk2≤W

How can we solve this optimisation problem? (without norm constraint there is a closed form solution)

This convex optimisation problem can solved using projected (stochastic) gradient descent

22 Aside: Algorithms for the Linear Regression Model

ERM for Linear Regression

m 1 X 2 J(w) = (w · x − y ) m i i i=1 wb = argmin J(w) w,kwk2≤W

How can we solve this optimisation problem? (without norm constraint there is a closed form solution)

This convex optimisation problem can solved using projected (stochastic) gradient descent

Guaranteed to find a near-optimal solution in polynomial time

22 Recall in our case K = {w | kwk2 ≤ W }, ΠK (·) is the projection operator

Informal Theorem 5 0 P Suppose sup 0 w − w ≤ R and ∇J(w) ≤ L, then with √ w,w ∈K 2 w∈K 2 η = R/(L T ) RL J(w) ≤ min J(w) + √ w∈K T

Aside: Gradient Descent

Algorithm 1 Projected Gradient Descent Inputs: η, T Pick w1 ∈ K for t = 1,...,T do 0 wt+1 = wt − η∇J(wt) 0 wt+1 = ΠK (wt+1) end for 1 PT Output: w = T t=1 wt

23 Informal Theorem 5 0 P Suppose sup 0 w − w ≤ R and ∇J(w) ≤ L, then with √ w,w ∈K 2 w∈K 2 η = R/(L T ) RL J(w) ≤ min J(w) + √ w∈K T

Aside: Gradient Descent

Algorithm 2 Projected Gradient Descent Inputs: η, T Pick w1 ∈ K for t = 1,...,T do 0 wt+1 = wt − η∇J(wt) 0 wt+1 = ΠK (wt+1) end for 1 PT Output: w = T t=1 wt

Recall in our case K = {w | kwk2 ≤ W }, ΠK (·) is the projection operator

23 Aside: Gradient Descent

Algorithm 3 Projected Gradient Descent Inputs: η, T Pick w1 ∈ K for t = 1,...,T do 0 wt+1 = wt − η∇J(wt) 0 wt+1 = ΠK (wt+1) end for 1 PT Output: w = T t=1 wt

Recall in our case K = {w | kwk2 ≤ W }, ΠK (·) is the projection operator

Informal Theorem 5 0 P Suppose sup 0 w − w ≤ R and ∇J(w) ≤ L, then with √ w,w ∈K 2 w∈K 2 η = R/(L T ) RL J(w) ≤ min J(w) + √ w∈K T

23 We can consider the ERM problem:

m 1 X 2 J(w) = (u(w · x ) − y ) ; w = argmin J(w) m i i b i=1 w,kwk2≤W

Can bound Rademacher complexity easily using the boundedness and Lipschitz property of u However, the optimisation problem is now non-convex!

Aside: Generalised Linear Models

Can consider more general models called generalised linear models

GLM = {x 7→ u(w · x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W }

24 Can bound Rademacher complexity easily using the boundedness and Lipschitz property of u However, the optimisation problem is now non-convex!

Aside: Generalised Linear Models

Can consider more general models called generalised linear models

GLM = {x 7→ u(w · x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W }

We can consider the ERM problem:

m 1 X 2 J(w) = (u(w · x ) − y ) ; w = argmin J(w) m i i b i=1 w,kwk2≤W

24 Aside: Generalised Linear Models

Can consider more general models called generalised linear models

GLM = {x 7→ u(w · x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W }

We can consider the ERM problem:

m 1 X 2 J(w) = (u(w · x ) − y ) ; w = argmin J(w) m i i b i=1 w,kwk2≤W

Can bound Rademacher complexity easily using the boundedness and Lipschitz property of u However, the optimisation problem is now non-convex!

24 Can consider a different cost/: Z u−1(y0) γ(y0, y) = (u(z) − y)dz 0 Z w·x `(w; x, y) (u(z) − y)dz 0

The resulting objective function is convex in w m 1 X Je(w) = `(w; xi, yi) m i=1

  In the realisable setting, i.e. E y |x = u(w · x), the global minimisers of J(w) (squared error) and Je(w) coincide, yielding computationally and statistically efficient algorithms. 12

Aside: Generalised Linear Models Can consider more general models called generalised linear models

GLM = {x 7→ u(w·x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W }

25 The resulting objective function is convex in w m 1 X Je(w) = `(w; xi, yi) m i=1

  In the realisable setting, i.e. E y |x = u(w · x), the global minimisers of J(w) (squared error) and Je(w) coincide, yielding computationally and statistically efficient algorithms. 12

Aside: Generalised Linear Models Can consider more general models called generalised linear models

GLM = {x 7→ u(w·x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W }

Can consider a different cost/loss function: Z u−1(y0) γ(y0, y) = (u(z) − y)dz 0 Z w·x `(w; x, y) (u(z) − y)dz 0

25   In the realisable setting, i.e. E y |x = u(w · x), the global minimisers of J(w) (squared error) and Je(w) coincide, yielding computationally and statistically efficient algorithms. 12

Aside: Generalised Linear Models Can consider more general models called generalised linear models

GLM = {x 7→ u(w·x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W }

Can consider a different cost/loss function: Z u−1(y0) γ(y0, y) = (u(z) − y)dz 0 Z w·x `(w; x, y) (u(z) − y)dz 0

The resulting objective function is convex in w m 1 X Je(w) = `(w; xi, yi) m i=1

25 Aside: Generalised Linear Models Can consider more general models called generalised linear models

GLM = {x 7→ u(w·x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W }

Can consider a different cost/loss function: Z u−1(y0) γ(y0, y) = (u(z) − y)dz 0 Z w·x `(w; x, y) (u(z) − y)dz 0

The resulting objective function is convex in w m 1 X Je(w) = `(w; xi, yi) m i=1

  In the realisable setting, i.e. E y |x = u(w · x), the global minimisers of J(w) (squared error) and Je(w) coincide, yielding computationally and statistically efficient algorithms. 12

25 Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:

m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

We will make use of a concentration of measure inequality, called McDiarmid’s inequality. McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi,

0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci.

Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0,

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

Rademacher Complexity : Main Result

26 We will make use of a concentration of measure inequality, called McDiarmid’s inequality. McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi,

0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci.

Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0,

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

Rademacher Complexity : Main Result Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:

m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

26 McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi,

0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci.

Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0,

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

Rademacher Complexity : Main Result Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:

m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

We will make use of a concentration of measure inequality, called McDiarmid’s inequality.

26 Rademacher Complexity : Main Result Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:

m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

We will make use of a concentration of measure inequality, called McDiarmid’s inequality. McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi,

0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci.

Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0,

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci 26 For S ⊂ Z, define the function:       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

i 0 Let S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, and consider, 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m

McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi,

0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci.

Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0,

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

Proof of Main Result 0 0 0 m Let S = {z1, . . . , zm},S = {z1, . . . , zm} ∼ D

27 i 0 Let S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, and consider, 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m

McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi,

0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci.

Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0,

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

Proof of Main Result 0 0 0 m Let S = {z1, . . . , zm},S = {z1, . . . , zm} ∼ D For S ⊂ Z, define the function:       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

27 McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi,

0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci.

Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0,

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

Proof of Main Result 0 0 0 m Let S = {z1, . . . , zm},S = {z1, . . . , zm} ∼ D For S ⊂ Z, define the function:       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

i 0 Let S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, and consider, 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m

27 Proof of Main Result 0 0 0 m Let S = {z1, . . . , zm},S = {z1, . . . , zm} ∼ D For S ⊂ Z, define the function:       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

i 0 Let S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, and consider, 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m

McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi,

0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci.

Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0,

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

27 i 0 Let S = {z1, . . . , zm}, S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, we have 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m

Applying McDiarmid’s inequality with ci = 1/m for all i,   Φ(S) ≥ Φ(S) + ε ≤ exp(−2ε2m) P S∼EDm

Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m

Proof of Main Result

McDiarmid’s Inequality

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

28 Applying McDiarmid’s inequality with ci = 1/m for all i,   Φ(S) ≥ Φ(S) + ε ≤ exp(−2ε2m) P S∼EDm

Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m

Proof of Main Result

McDiarmid’s Inequality

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

i 0 Let S = {z1, . . . , zm}, S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, we have 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m

28 Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m

Proof of Main Result

McDiarmid’s Inequality

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

i 0 Let S = {z1, . . . , zm}, S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, we have 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m

Applying McDiarmid’s inequality with ci = 1/m for all i,   Φ(S) ≥ Φ(S) + ε ≤ exp(−2ε2m) P S∼EDm

28 Proof of Main Result

McDiarmid’s Inequality

2 ! h   i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci

i 0 Let S = {z1, . . . , zm}, S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, we have 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m

Applying McDiarmid’s inequality with ci = 1/m for all i,   Φ(S) ≥ Φ(S) + ε ≤ exp(−2ε2m) P S∼EDm

Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m

28   Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm

Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m Recall that, m   1 X Eb g(z) = g(zi) z∼uS m i=1

Want to show m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

Proof of Main Result       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m

29   Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm

Recall that, m   1 X Eb g(z) = g(zi) z∼uS m i=1

Want to show m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

Proof of Main Result       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m

29   Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm

Want to show m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

Proof of Main Result       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m Recall that, m   1 X Eb g(z) = g(zi) z∼uS m i=1

29   Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm

Proof of Main Result       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m Recall that, m   1 X Eb g(z) = g(zi) z∼uS m i=1

Want to show m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1

29 Proof of Main Result       Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS

Alternatively, for any δ > 0, with probability at least 1 − δ, r !   log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m Recall that, m   1 X Eb g(z) = g(zi) z∼uS m i=1

Want to show m r !   1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1   Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm 29 Introduce a fresh sample S0 ∼ Dm     ! Φ(S) = sup g(z) − g(z) E m E m  0 E m Eb 0 Eb  S∼D S∼D g∈G S ∼D z∼uS z∼uS

Pushing the sup inside the expectation "  # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS

Consider "  # Φ(S) = sup g(z) − g(z) E m E m E Eb S∼D S∼D g∈G z∼D z∼uS

Proof of Main Result

  All that remains to show is that Φ(S) ≤ 2Rm(G) S∼EDm

30 Introduce a fresh sample S0 ∼ Dm     ! Φ(S) = sup g(z) − g(z) E m E m  0 E m Eb 0 Eb  S∼D S∼D g∈G S ∼D z∼uS z∼uS

Pushing the sup inside the expectation "  # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS

Proof of Main Result

  All that remains to show is that Φ(S) ≤ 2Rm(G) S∼EDm Consider "  # Φ(S) = sup g(z) − g(z) E m E m E Eb S∼D S∼D g∈G z∼D z∼uS

30 Pushing the sup inside the expectation "  # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS

Proof of Main Result

  All that remains to show is that Φ(S) ≤ 2Rm(G) S∼EDm Consider "  # Φ(S) = sup g(z) − g(z) E m E m E Eb S∼D S∼D g∈G z∼D z∼uS

Introduce a fresh sample S0 ∼ Dm     ! Φ(S) = sup g(z) − g(z) E m E m  0 E m Eb 0 Eb  S∼D S∼D g∈G S ∼D z∼uS z∼uS

30 Proof of Main Result

  All that remains to show is that Φ(S) ≤ 2Rm(G) S∼EDm Consider "  # Φ(S) = sup g(z) − g(z) E m E m E Eb S∼D S∼D g∈G z∼D z∼uS

Introduce a fresh sample S0 ∼ Dm     ! Φ(S) = sup g(z) − g(z) E m E m  0 E m Eb 0 Eb  S∼D S∼D g∈G S ∼D z∼uS z∼uS

Pushing the sup inside the expectation "  # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS

30 S and S0 are identically distributed, so their elements can be swapped by introducing Rademacher random variables σi ∈ {−1, 1} " #   1 0 Φ(S) ≤ sup σi(g(z ) − g(zi)) E m E i S∼D S∼Dm,S0∼Dm,σ g∈G m " # 1 ≤ 2 sup σig(zi) = 2Rm(G) Em S∼D ,σ g∈G m

Proof of Main Result

Pushing the sup inside the expectation "  # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS

31 Proof of Main Result

Pushing the sup inside the expectation "  # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS

S and S0 are identically distributed, so their elements can be swapped by introducing Rademacher random variables σi ∈ {−1, 1} " #   1 0 Φ(S) ≤ sup σi(g(z ) − g(zi)) E m E i S∼D S∼Dm,S0∼Dm,σ g∈G m " # 1 ≤ 2 sup σig(zi) = 2Rm(G) Em S∼D ,σ g∈G m

31 Outline

Statistical (Supervised) Learning Theory Framework

Linear Regression

Rademacher Complexity

Support Vector Machines

Kernels

Neural Networks

Algorithmic Stability Which separator should be picked?

Support Vector Machines: Binary Classification

Goal: Find a linear separator

Data is linearly separable if there exists a linear separator that classifies all points correctly

32 Support Vector Machines: Binary Classification

Goal: Find a linear separator

Data is linearly separable if there exists a linear separator that classifies all points correctly

Which separator should be picked?

32 Support Vector Machines: Maximum Margin Principle

Maximise the distance of the closest point from the decision boundary Points that are closest to the decision boundary are support vectors

33 Support Vector Machines : Geometric View

n Given a hyperplane: H ≡ w · x + w0 = 0 and a point x ∈ R , how far is x from H?

34 Support Vector Machines : Geometric View

Consider the hyperplane: H ≡ w · x + w0 = 0

The distance of point x from H is given by

|w · x + w0|

kwk2 All points on one side of the hyperplane satisfy (labelled y = +1)

w · x + w0 ≥ 0

and points on the other side satisfy (labelled y = −1)

w · x + w0 < 0

35 SVM Formulation : Separable Case

1 2 minimise: 2 kwk2 subject to:

yi(w · xi + w0) ≥ 1 for i = 1, . . . , m

Here yi ∈ {−1, 1}

If data is separable, then we find a classifier with no classification error on the training set 1 ∗ The margin of the classifier is ∗ if w is the optimal solution kw k2 This is a convex quadratic program and hence can be solved efficiently 36 SVM Formulation : The Dual

m m m X 1 X X 1 2 maximise αi − αiαj yiyj xi · xj minimise: 2 kwk2 2 i=1 i=1 j=1 subject to: subject to: yi(w · xi + w0) − 1 ≥ 0 Pm i=1 αiyi = 0 for i = 1, . . . , m 0 ≤ αi

Here yi ∈ {−1, 1} for i = 1, . . . , m

Lagrange Function

m 1 2 X Λ(w, w ; α) = kwk − α (y (w · x + w ) − 1) 0 2 2 i i i 0 i=1 Complementary Slackness

αi(yi(w · xi + w0) − 1) = 0, i = 1, . . . , m

37 SVM Formulation : Non-Separable Case

m 1 2 X minimise: 2 kwk2 + C ζi i=1 subject to:

yi(w · xi + w0) ≥ 1 − ζi

ζi ≥ 0 for i = 1, . . . , m

Here yi ∈ {−1, 1}

38 6

4

2 Hinge Loss

0 −6 −4 −2 0 2 4 6

y(w · x + w0)

Note that for the optimal solution, ζi = max{0, 1 − yi(w · xi + w0)} Thus, SVM can be viewed as minimising the hinge loss with regularisation

SVM Formulation : Loss Function

m 1 2 X minimise: kwk + C ζ 2 2 i | {z } i=1 Regulariser | {z } Loss Function subject to:

yi(w · xi + w0) ≥ 1 − ζi

ζi ≥ 0 for i = 1, . . . , m

Here yi ∈ {−1, 1}

39 Note that for the optimal solution, ζi = max{0, 1 − yi(w · xi + w0)} Thus, SVM can be viewed as minimising the hinge loss with regularisation

SVM Formulation : Loss Function

m 1 2 X minimise: kwk + C ζ 2 2 i 6 | {z } i=1 Regulariser | {z } Loss Function 4 subject to: 2 yi(w · xi + w0) ≥ 1 − ζi Hinge Loss

ζi ≥ 0 0 −6 −4 −2 0 2 4 6

for i = 1, . . . , m y(w · x + w0)

Here yi ∈ {−1, 1}

39 SVM Formulation : Loss Function

m 1 2 X minimise: kwk + C ζ 2 2 i 6 | {z } i=1 Regulariser | {z } Loss Function 4 subject to: 2 yi(w · xi + w0) ≥ 1 − ζi Hinge Loss

ζi ≥ 0 0 −6 −4 −2 0 2 4 6

for i = 1, . . . , m y(w · x + w0)

Here yi ∈ {−1, 1}

Note that for the optimal solution, ζi = max{0, 1 − yi(w · xi + w0)} Thus, SVM can be viewed as minimising the hinge loss with regularisation

39 SVM : Deriving the Dual

m 1 2 X minimise: 2 kwk2 + C ζi i=1 subject to:

yi(w · xi + w0) − (1 − ζi) ≥ 0

ζi ≥ 0 for i = 1, . . . , m

Here yi ∈ {−1, 1}

Lagrange Function

m m m 1 2 X X X Λ(w, w , ζ; α, µ) = kwk +C ζ − α (y (w·x +w )−(1−ζ ))− µ ζ 0 2 2 i i i i 0 i i i i=1 i=1 i=1

40 SVM : Deriving the Dual

Lagrange Function

m m m 1 2 X X X Λ(w, w , ζ; α, µ) = kwk +C ζ − α (y (w·x +w )−(1−ζ ))− µ ζ 0 2 2 i i i i 0 i i i i=1 i=1 i=1

We write derivatives with respect to w, w0 and ζi,

m ∂ Λ X = − αiyi ∂w0 i=1 ∂ Λ = C − αi − µi ∂ζi m X ∇wΛ = w − αiyixi i=1

For (KKT) dual feasibility constraints, we require αi ≥ 0, µi ≥ 0

41 SVM : Deriving the Dual

Setting the derivatives to 0, substituting the resulting expressions in Λ (and simplifying), we get a function g(α) and some constraints

m m m X 1 X X g(α) = α − α α y y x · x i 2 i j i j i j i=1 i=1 j=1 Constraints

0 ≤ αi ≤ C i = 1, . . . , m m X αiyi = 0 i=1

Finding critical points of Λ satisfying the KKT conditions corresponds to finding the maximum of g(α) subject to the above constraints

42 SVM: Primal and Dual Formulations

Primal Form Dual Form

m m m m X X 1 X X minimise: 1 kwk2 +C ζ maximise α − α α y y x ·x 2 2 i i 2 i j i j i j i=1 i=1 i=1 j=1

subject to: subject to:

y (w · x + w ) ≥ (1 − ζ ) Pm i i 0 i i=1 αiyi = 0

ζi ≥ 0 0 ≤ αi ≤ C

for i = 1, . . . , m for i = 1, . . . , m

43 Pm Recall the form of the solution: w = i=1 αiyixi

Thus, only those datapoints xi for which αi > 0, determine the solution

This is why they are called support vectors

KKT Complementary Slackness Conditions

 For all i, αi yi(w · xi + w0) − (1 − ζi) = 0

If αi > 0, yi(w · xi + w0) = 1 − ζi

44 KKT Complementary Slackness Conditions

 For all i, αi yi(w · xi + w0) − (1 − ζi) = 0

If αi > 0, yi(w · xi + w0) = 1 − ζi

Pm Recall the form of the solution: w = i=1 αiyixi

Thus, only those datapoints xi for which αi > 0, determine the solution

This is why they are called support vectors

44 Support Vectors

45 Consider the cost function γρ : R × {−1, 1} → [0, 1] defined as 0 0 γρ(y , y) = ϕρ(yy ), where ϕρ : R → [0, 1] is defined as:  0 if ρ ≤ z  ϕρ(z) = 1 − z/ρ if 0 ≤ z ≤ ρ  1 if z ≤ 0

) 1 z

( 0.5 ρ

γ 0 −6 −4 −2 0 2 4 6 z

Let H = {x 7→ w · x | kwk2 ≤ W } and let kxk2 ≤ X for all x ∈ X, as ϕρ is 1/ρ-Lipschitz by Talagrand’s Lemma we have WX Rb(ϕρ ◦ H) ≤ √ ρ m

Generalisation Bounds Based on Margin

Suppose we solve the SVM objective by constraining w to be in the set

{w | kwk2 ≤ W }

46 Let H = {x 7→ w · x | kwk2 ≤ W } and let kxk2 ≤ X for all x ∈ X, as ϕρ is 1/ρ-Lipschitz by Talagrand’s Lemma we have WX Rb(ϕρ ◦ H) ≤ √ ρ m

Generalisation Bounds Based on Margin

Suppose we solve the SVM objective by constraining w to be in the set

{w | kwk2 ≤ W }

Consider the cost function γρ : R × {−1, 1} → [0, 1] defined as 0 0 γρ(y , y) = ϕρ(yy ), where ϕρ : R → [0, 1] is defined as:  0 if ρ ≤ z  ϕρ(z) = 1 − z/ρ if 0 ≤ z ≤ ρ  1 if z ≤ 0

) 1 z

( 0.5 ρ

γ 0 −6 −4 −2 0 2 4 6 z

46 Generalisation Bounds Based on Margin

Suppose we solve the SVM objective by constraining w to be in the set

{w | kwk2 ≤ W }

Consider the cost function γρ : R × {−1, 1} → [0, 1] defined as 0 0 γρ(y , y) = ϕρ(yy ), where ϕρ : R → [0, 1] is defined as:  0 if ρ ≤ z  ϕρ(z) = 1 − z/ρ if 0 ≤ z ≤ ρ  1 if z ≤ 0

) 1 z

( 0.5 ρ

γ 0 −6 −4 −2 0 2 4 6 z

Let H = {x 7→ w · x | kwk2 ≤ W } and let kxk2 ≤ X for all x ∈ X, as ϕρ is 1/ρ-Lipschitz by Talagrand’s Lemma we have WX Rb(ϕρ ◦ H) ≤ √ ρ m

46   Let R(hw) = E γ(sign(w · x), y) and let (x,y)∼D   Rρ(hw) = E γρ(w · x, y) . Let Rb and Rcρ denote the corresponding (x,y)∼D empirical risks

Then, we have r ! log(1/δ) R(h) ≤ Rρ(h) ≤ Rbρ(h) + 2Rb(φ ◦ H) + O m

√ As Rb(φ ◦ H) = O(XW/ρ m), a sample size of m = O(W 2X2/(ρ)2) is sufficient to get  excess risk (over Rbρ(h))

Note that solving the SVM objective is not guaranteed to give h that has the smallest Rb(h) (the problem of minimising disagreements with a linear separator is NP-hard)

Generalisation Bounds Based on Margin

0 0 0 0 Let γ(y , y) = I(sign(y ) 6= y) (zero-one loss) and γρ(y , y) = ϕρ(y y). 0 0 Observe that γ(y , y) ≤ γρ(y , y)

47 Then, we have r ! log(1/δ) R(h) ≤ Rρ(h) ≤ Rbρ(h) + 2Rb(φ ◦ H) + O m

√ As Rb(φ ◦ H) = O(XW/ρ m), a sample size of m = O(W 2X2/(ρ)2) is sufficient to get  excess risk (over Rbρ(h))

Note that solving the SVM objective is not guaranteed to give h that has the smallest Rb(h) (the problem of minimising disagreements with a linear separator is NP-hard)

Generalisation Bounds Based on Margin

0 0 0 0 Let γ(y , y) = I(sign(y ) 6= y) (zero-one loss) and γρ(y , y) = ϕρ(y y). 0 0 Observe that γ(y , y) ≤ γρ(y , y)   Let R(hw) = E γ(sign(w · x), y) and let (x,y)∼D   Rρ(hw) = E γρ(w · x, y) . Let Rb and Rcρ denote the corresponding (x,y)∼D empirical risks

47 √ As Rb(φ ◦ H) = O(XW/ρ m), a sample size of m = O(W 2X2/(ρ)2) is sufficient to get  excess risk (over Rbρ(h))

Note that solving the SVM objective is not guaranteed to give h that has the smallest Rb(h) (the problem of minimising disagreements with a linear separator is NP-hard)

Generalisation Bounds Based on Margin

0 0 0 0 Let γ(y , y) = I(sign(y ) 6= y) (zero-one loss) and γρ(y , y) = ϕρ(y y). 0 0 Observe that γ(y , y) ≤ γρ(y , y)   Let R(hw) = E γ(sign(w · x), y) and let (x,y)∼D   Rρ(hw) = E γρ(w · x, y) . Let Rb and Rcρ denote the corresponding (x,y)∼D empirical risks

Then, we have r ! log(1/δ) R(h) ≤ Rρ(h) ≤ Rbρ(h) + 2Rb(φ ◦ H) + O m

47 Note that solving the SVM objective is not guaranteed to give h that has the smallest Rb(h) (the problem of minimising disagreements with a linear separator is NP-hard)

Generalisation Bounds Based on Margin

0 0 0 0 Let γ(y , y) = I(sign(y ) 6= y) (zero-one loss) and γρ(y , y) = ϕρ(y y). 0 0 Observe that γ(y , y) ≤ γρ(y , y)   Let R(hw) = E γ(sign(w · x), y) and let (x,y)∼D   Rρ(hw) = E γρ(w · x, y) . Let Rb and Rcρ denote the corresponding (x,y)∼D empirical risks

Then, we have r ! log(1/δ) R(h) ≤ Rρ(h) ≤ Rbρ(h) + 2Rb(φ ◦ H) + O m

√ As Rb(φ ◦ H) = O(XW/ρ m), a sample size of m = O(W 2X2/(ρ)2) is sufficient to get  excess risk (over Rbρ(h))

47 Generalisation Bounds Based on Margin

0 0 0 0 Let γ(y , y) = I(sign(y ) 6= y) (zero-one loss) and γρ(y , y) = ϕρ(y y). 0 0 Observe that γ(y , y) ≤ γρ(y , y)   Let R(hw) = E γ(sign(w · x), y) and let (x,y)∼D   Rρ(hw) = E γρ(w · x, y) . Let Rb and Rcρ denote the corresponding (x,y)∼D empirical risks

Then, we have r ! log(1/δ) R(h) ≤ Rρ(h) ≤ Rbρ(h) + 2Rb(φ ◦ H) + O m

√ As Rb(φ ◦ H) = O(XW/ρ m), a sample size of m = O(W 2X2/(ρ)2) is sufficient to get  excess risk (over Rbρ(h))

Note that solving the SVM objective is not guaranteed to give h that has the smallest Rb(h) (the problem of minimising disagreements with a linear separator is NP-hard)

47 Outline

Statistical (Supervised) Learning Theory Framework

Linear Regression

Rademacher Complexity

Support Vector Machines

Kernels

Neural Networks

Algorithmic Stability The matrix K is positive semi-definite

If we perform basis expansion

n N φ : R → R

T then replace entries by φ(xi) φ(xj ) We only need the ability to compute inner products to use (dual version of) SVM

Gram Matrix

th T If we put the inputs in matrix X, where the i row of X is xi .

 T T T  x1 x1 x1 x2 ··· x1 xm T T T x2 x1 x2 x2 ··· x2 xm  T   K = XX =  . . . .   . . . .   . . . .  T T T xmx1 xmx2 ··· xmxm

48 If we perform basis expansion

n N φ : R → R

T then replace entries by φ(xi) φ(xj ) We only need the ability to compute inner products to use (dual version of) SVM

Gram Matrix

th T If we put the inputs in matrix X, where the i row of X is xi .

 T T T  x1 x1 x1 x2 ··· x1 xm T T T x2 x1 x2 x2 ··· x2 xm  T   K = XX =  . . . .   . . . .   . . . .  T T T xmx1 xmx2 ··· xmxm

The matrix K is positive semi-definite

48 We only need the ability to compute inner products to use (dual version of) SVM

Gram Matrix

th T If we put the inputs in matrix X, where the i row of X is xi .

 T T T  x1 x1 x1 x2 ··· x1 xm T T T x2 x1 x2 x2 ··· x2 xm  T   K = XX =  . . . .   . . . .   . . . .  T T T xmx1 xmx2 ··· xmxm

The matrix K is positive semi-definite

If we perform basis expansion

n N φ : R → R

T then replace entries by φ(xi) φ(xj )

48 Gram Matrix

th T If we put the inputs in matrix X, where the i row of X is xi .

 T T T  x1 x1 x1 x2 ··· x1 xm T T T x2 x1 x2 x2 ··· x2 xm  T   K = XX =  . . . .   . . . .   . . . .  T T T xmx1 xmx2 ··· xmxm

The matrix K is positive semi-definite

If we perform basis expansion

n N φ : R → R

T then replace entries by φ(xi) φ(xj ) We only need the ability to compute inner products to use (dual version of) SVM

48 But, we could also use the map:

T h √ √ 2 2 √ i φ(x) = 1, 2x1, 2x2, x1, x2, 2x1x2

T 0 0 0 T If x = [x1, x2] and x = [x1, x2] , then

T 0 0 0 2 0 2 2 0 2 0 0 φ(x) φ(x ) = 1 + 2x1x1 + 2x2x2 + x1(x1) + x2(x2) + 2x1x2x1x2 0 0 2 0 2 = (1 + x1x1 + x2x2) = (1 + x · x )

Instead of spending ≈ nd time to compute inner products after degree d polynomial basis expansion, we only need O(n) time

Kernel Trick

2 Suppose, x ∈ R and we perform degree 2 polynomial expansion, we could use the map:

T h 2 2 i ψ(x) = 1, x1, x2, x1, x2, x1x2

49 T 0 0 0 T If x = [x1, x2] and x = [x1, x2] , then

T 0 0 0 2 0 2 2 0 2 0 0 φ(x) φ(x ) = 1 + 2x1x1 + 2x2x2 + x1(x1) + x2(x2) + 2x1x2x1x2 0 0 2 0 2 = (1 + x1x1 + x2x2) = (1 + x · x )

Instead of spending ≈ nd time to compute inner products after degree d polynomial basis expansion, we only need O(n) time

Kernel Trick

2 Suppose, x ∈ R and we perform degree 2 polynomial expansion, we could use the map:

T h 2 2 i ψ(x) = 1, x1, x2, x1, x2, x1x2

But, we could also use the map:

T h √ √ 2 2 √ i φ(x) = 1, 2x1, 2x2, x1, x2, 2x1x2

49 Instead of spending ≈ nd time to compute inner products after degree d polynomial basis expansion, we only need O(n) time

Kernel Trick

2 Suppose, x ∈ R and we perform degree 2 polynomial expansion, we could use the map:

T h 2 2 i ψ(x) = 1, x1, x2, x1, x2, x1x2

But, we could also use the map:

T h √ √ 2 2 √ i φ(x) = 1, 2x1, 2x2, x1, x2, 2x1x2

T 0 0 0 T If x = [x1, x2] and x = [x1, x2] , then

T 0 0 0 2 0 2 2 0 2 0 0 φ(x) φ(x ) = 1 + 2x1x1 + 2x2x2 + x1(x1) + x2(x2) + 2x1x2x1x2 0 0 2 0 2 = (1 + x1x1 + x2x2) = (1 + x · x )

49 Kernel Trick

2 Suppose, x ∈ R and we perform degree 2 polynomial expansion, we could use the map:

T h 2 2 i ψ(x) = 1, x1, x2, x1, x2, x1x2

But, we could also use the map:

T h √ √ 2 2 √ i φ(x) = 1, 2x1, 2x2, x1, x2, 2x1x2

T 0 0 0 T If x = [x1, x2] and x = [x1, x2] , then

T 0 0 0 2 0 2 2 0 2 0 0 φ(x) φ(x ) = 1 + 2x1x1 + 2x2x2 + x1(x1) + x2(x2) + 2x1x2x1x2 0 0 2 0 2 = (1 + x1x1 + x2x2) = (1 + x · x )

Instead of spending ≈ nd time to compute inner products after degree d polynomial basis expansion, we only need O(n) time

49 The dual program becomes m m m X X X maximise αi − αiαj yiyj Ki,j i=1 i=1 j=1 Pm subject to : 0 ≤ αi ≤ C and i=1 αiyi = 0

To make prediction on new xnew, only need to compute κ(xi, xnew) for support vectors xi (for which αi > 0)

Kernel Trick

We can use a symmetric positive semi-definite kernel (Mercer Kernels)   κ(x1, x1) κ(x1, x2) ··· κ(x1, xm)  κ(x , x ) κ(x , x ) ··· κ(x , x )   2 1 2 2 2 m  K =  . . . .   . . .. .   . . .  κ(xm, x1) κ(xm, x2) ··· κ(xm, xm)

Here κ(x, x0) is some measure of similarity between x and x0

50 To make prediction on new xnew, only need to compute κ(xi, xnew) for support vectors xi (for which αi > 0)

Kernel Trick

We can use a symmetric positive semi-definite kernel (Mercer Kernels)   κ(x1, x1) κ(x1, x2) ··· κ(x1, xm)  κ(x , x ) κ(x , x ) ··· κ(x , x )   2 1 2 2 2 m  K =  . . . .   . . .. .   . . .  κ(xm, x1) κ(xm, x2) ··· κ(xm, xm)

Here κ(x, x0) is some measure of similarity between x and x0

The dual program becomes m m m X X X maximise αi − αiαj yiyj Ki,j i=1 i=1 j=1 Pm subject to : 0 ≤ αi ≤ C and i=1 αiyi = 0

50 Kernel Trick

We can use a symmetric positive semi-definite kernel (Mercer Kernels)   κ(x1, x1) κ(x1, x2) ··· κ(x1, xm)  κ(x , x ) κ(x , x ) ··· κ(x , x )   2 1 2 2 2 m  K =  . . . .   . . .. .   . . .  κ(xm, x1) κ(xm, x2) ··· κ(xm, xm)

Here κ(x, x0) is some measure of similarity between x and x0

The dual program becomes m m m X X X maximise αi − αiαj yiyj Ki,j i=1 i=1 j=1 Pm subject to : 0 ≤ αi ≤ C and i=1 αiyi = 0

To make prediction on new xnew, only need to compute κ(xi, xnew) for support vectors xi (for which αi > 0)

50 Polynomial Kernels

Rather than perform basis expansion,

κ(x, x0) = (1 + x · x0)d

This gives all terms of degree up to d

If we use κ(x, x0) = (x · x0)d, we get only degree d terms

Linear Kernel: κ(x, x0) = x · x0

All of these satisfy the Mercer or positive-definite condition

51 Gaussian or RBF Kernel

Radial Basis Function (RBF) or Gaussian Kernel ! kx − x0k2 κ(x, x0) = exp − 2σ2

σ2 is known as the bandwidth

1 We used this with γ = 2σ2 when we studied kernel basis expansion for regression

Can generalise to more general covariance matrices

Results in a Mercel kernel

52 Outline

Statistical (Supervised) Learning Theory Framework

Linear Regression

Rademacher Complexity

Support Vector Machines

Kernels

Neural Networks

Algorithmic Stability A unit in a neural network computes an affine function of its input and is then composed with a non-linear a

For example the activation function could be the logistic sigmoid 1 σ(z) = 1 + e−z

Neural Networks : Unit

1

b

x1 w1 Σ a(b + w · x)

w2

x2

53 A unit in a neural network computes an affine function of its input and is then composed with a non-linear activation function a

For example the activation function could be the logistic sigmoid 1 σ(z) = 1 + e−z

Neural Networks : Unit

1

b

x1 w1 Σ a(b + w · x)

w2

x2 Linear Function Non-linearity

53 A unit in a neural network computes an affine function of its input and is then composed with a non-linear activation function a

For example the activation function could be the logistic sigmoid 1 σ(z) = 1 + e−z

Neural Networks : Unit

1 Unit

b

x1 w1 Σ a(b + w · x)

w2

x2 Linear Function Non-linearity

53 Neural Networks : Unit

1 Unit

b

x1 w1 Σ a(b + w · x)

w2

x2 Linear Function Non-linearity

A unit in a neural network computes an affine function of its input and is then composed with a non-linear activation function a

For example the activation function could be the logistic sigmoid 1 σ(z) = 1 + e−z

53 Feedforward Neural Networks Layer 1 Layer 2 Layer 3 Layer 4 (Input) (Hidden) (Hidden) (Output)

Fully Connected Layer

54 d d A layer l : R 1 → R 2 consists of an element-wise composition of a non-linear activation a, e.g. rectifier or logistic sigmoid, and an affine map

l(z) = a(W z + b)

n An L-hidden layer network represents a function f : R → R

f(x) = w · lL(lL−1(··· (l1(x) ··· ) + b

Typically, the output layer is simply an affine map of the penultimate layer (without any non-linear activation)

Neural Networks

Only consider fully-connected, feed-forward neural networks, with non-linear activation functions applied element-wise to units

55 n An L-hidden layer network represents a function f : R → R

f(x) = w · lL(lL−1(··· (l1(x) ··· ) + b

Typically, the output layer is simply an affine map of the penultimate layer (without any non-linear activation)

Neural Networks

Only consider fully-connected, feed-forward neural networks, with non-linear activation functions applied element-wise to units

d d A layer l : R 1 → R 2 consists of an element-wise composition of a non-linear activation a, e.g. rectifier or logistic sigmoid, and an affine map

l(z) = a(W z + b)

55 Typically, the output layer is simply an affine map of the penultimate layer (without any non-linear activation)

Neural Networks

Only consider fully-connected, feed-forward neural networks, with non-linear activation functions applied element-wise to units

d d A layer l : R 1 → R 2 consists of an element-wise composition of a non-linear activation a, e.g. rectifier or logistic sigmoid, and an affine map

l(z) = a(W z + b)

n An L-hidden layer network represents a function f : R → R

f(x) = w · lL(lL−1(··· (l1(x) ··· ) + b

55 Neural Networks

Only consider fully-connected, feed-forward neural networks, with non-linear activation functions applied element-wise to units

d d A layer l : R 1 → R 2 consists of an element-wise composition of a non-linear activation a, e.g. rectifier or logistic sigmoid, and an affine map

l(z) = a(W z + b)

n An L-hidden layer network represents a function f : R → R

f(x) = w · lL(lL−1(··· (l1(x) ··· ) + b

Typically, the output layer is simply an affine map of the penultimate layer (without any non-linear activation)

55 Exercise: Prove this using the fact that if G is the function class consisting of all convex combinations of functions in G, then Rb m(G) = Rb(G). (Also prove the latter claim.)

Rademacher Complexity of Neural Nets Suppose F is the class of feed-forward neural nets with L − 1 hidden layers

I every row of w of any W in the net satisfying kwk1 ≤ W

I every bias vector b satisfying kbk∞ ≤ B I the activations a being 1-Lipschitz

I and furthermore, the inputs x ∈ X satisfying kxk∞ ≤ 1  L−1  1 L X i Then, Rb m(F) ≤ √ (2W ) + B (2W )  m i=0

Capacity of Neural Networks

VC dim. of Neural Nets Informally, if a is the sgn function, and C is the class of all neural networks

with at most ω parameters then, VC(C) ≤ 2ω log2(eω)

56 I every bias vector b satisfying kbk∞ ≤ B I the activations a being 1-Lipschitz

I and furthermore, the inputs x ∈ X satisfying kxk∞ ≤ 1  L−1  1 L X i Then, Rb m(F) ≤ √ (2W ) + B (2W )  m i=0

Exercise: Prove this using the fact that if G is the function class consisting of all convex combinations of functions in G, then Rb m(G) = Rb(G). (Also prove the latter claim.)

Capacity of Neural Networks

VC dim. of Neural Nets Informally, if a is the sgn function, and C is the class of all neural networks

with at most ω parameters then, VC(C) ≤ 2ω log2(eω)

Rademacher Complexity of Neural Nets Suppose F is the class of feed-forward neural nets with L − 1 hidden layers

I every row of w of any W in the net satisfying kwk1 ≤ W

56 I the activations a being 1-Lipschitz

I and furthermore, the inputs x ∈ X satisfying kxk∞ ≤ 1  L−1  1 L X i Then, Rb m(F) ≤ √ (2W ) + B (2W )  m i=0

Exercise: Prove this using the fact that if G is the function class consisting of all convex combinations of functions in G, then Rb m(G) = Rb(G). (Also prove the latter claim.)

Capacity of Neural Networks

VC dim. of Neural Nets Informally, if a is the sgn function, and C is the class of all neural networks

with at most ω parameters then, VC(C) ≤ 2ω log2(eω)

Rademacher Complexity of Neural Nets Suppose F is the class of feed-forward neural nets with L − 1 hidden layers

I every row of w of any W in the net satisfying kwk1 ≤ W

I every bias vector b satisfying kbk∞ ≤ B

56 I and furthermore, the inputs x ∈ X satisfying kxk∞ ≤ 1  L−1  1 L X i Then, Rb m(F) ≤ √ (2W ) + B (2W )  m i=0

Exercise: Prove this using the fact that if G is the function class consisting of all convex combinations of functions in G, then Rb m(G) = Rb(G). (Also prove the latter claim.)

Capacity of Neural Networks

VC dim. of Neural Nets Informally, if a is the sgn function, and C is the class of all neural networks

with at most ω parameters then, VC(C) ≤ 2ω log2(eω)

Rademacher Complexity of Neural Nets Suppose F is the class of feed-forward neural nets with L − 1 hidden layers

I every row of w of any W in the net satisfying kwk1 ≤ W

I every bias vector b satisfying kbk∞ ≤ B I the activations a being 1-Lipschitz

56  L−1  1 L X i Then, Rb m(F) ≤ √ (2W ) + B (2W )  m i=0

Exercise: Prove this using the fact that if G is the function class consisting of all convex combinations of functions in G, then Rb m(G) = Rb(G). (Also prove the latter claim.)

Capacity of Neural Networks

VC dim. of Neural Nets Informally, if a is the sgn function, and C is the class of all neural networks

with at most ω parameters then, VC(C) ≤ 2ω log2(eω)

Rademacher Complexity of Neural Nets Suppose F is the class of feed-forward neural nets with L − 1 hidden layers

I every row of w of any W in the net satisfying kwk1 ≤ W

I every bias vector b satisfying kbk∞ ≤ B I the activations a being 1-Lipschitz

I and furthermore, the inputs x ∈ X satisfying kxk∞ ≤ 1

56 Exercise: Prove this using the fact that if G is the function class consisting of all convex combinations of functions in G, then Rb m(G) = Rb(G). (Also prove the latter claim.)

Capacity of Neural Networks

VC dim. of Neural Nets Informally, if a is the sgn function, and C is the class of all neural networks

with at most ω parameters then, VC(C) ≤ 2ω log2(eω)

Rademacher Complexity of Neural Nets Suppose F is the class of feed-forward neural nets with L − 1 hidden layers

I every row of w of any W in the net satisfying kwk1 ≤ W

I every bias vector b satisfying kbk∞ ≤ B I the activations a being 1-Lipschitz

I and furthermore, the inputs x ∈ X satisfying kxk∞ ≤ 1  L−1  1 L X i Then, Rb m(F) ≤ √ (2W ) + B (2W )  m i=0

56 Capacity of Neural Networks

VC dim. of Neural Nets Informally, if a is the sgn function, and C is the class of all neural networks

with at most ω parameters then, VC(C) ≤ 2ω log2(eω)

Rademacher Complexity of Neural Nets Suppose F is the class of feed-forward neural nets with L − 1 hidden layers

I every row of w of any W in the net satisfying kwk1 ≤ W

I every bias vector b satisfying kbk∞ ≤ B I the activations a being 1-Lipschitz

I and furthermore, the inputs x ∈ X satisfying kxk∞ ≤ 1  L−1  1 L X i Then, Rb m(F) ≤ √ (2W ) + B (2W )  m i=0

Exercise: Prove this using the fact that if G is the function class consisting of all convex combinations of functions in G, then Rb m(G) = Rb(G). (Also prove the latter claim.)

56 Several other authors proved similar results roughly at the same time 1,8,11

Doesn’t give an explicit upper bound on the number of units required

Known that the number of units required can be exponential for arbitrary continuous functions

These kinds of results don’t inform us directly about the success of training algorithms or the possibility of generalisation

Neural Networks : Universality Results

(Simplified) Theorem (Cybenko) 6 Let σ be the logistic sigmoid activation function. Then the set of functions PN G(x) = i=1 αj σ(wj · x + bj ) are dense in the set of continuous functions on [0, 1]n.

57 Doesn’t give an explicit upper bound on the number of units required

Known that the number of units required can be exponential for arbitrary continuous functions

These kinds of results don’t inform us directly about the success of training algorithms or the possibility of generalisation

Neural Networks : Universality Results

(Simplified) Theorem (Cybenko) 6 Let σ be the logistic sigmoid activation function. Then the set of functions PN G(x) = i=1 αj σ(wj · x + bj ) are dense in the set of continuous functions on [0, 1]n.

Several other authors proved similar results roughly at the same time 1,8,11

57 Known that the number of units required can be exponential for arbitrary continuous functions

These kinds of results don’t inform us directly about the success of training algorithms or the possibility of generalisation

Neural Networks : Universality Results

(Simplified) Theorem (Cybenko) 6 Let σ be the logistic sigmoid activation function. Then the set of functions PN G(x) = i=1 αj σ(wj · x + bj ) are dense in the set of continuous functions on [0, 1]n.

Several other authors proved similar results roughly at the same time 1,8,11

Doesn’t give an explicit upper bound on the number of units required

57 These kinds of results don’t inform us directly about the success of training algorithms or the possibility of generalisation

Neural Networks : Universality Results

(Simplified) Theorem (Cybenko) 6 Let σ be the logistic sigmoid activation function. Then the set of functions PN G(x) = i=1 αj σ(wj · x + bj ) are dense in the set of continuous functions on [0, 1]n.

Several other authors proved similar results roughly at the same time 1,8,11

Doesn’t give an explicit upper bound on the number of units required

Known that the number of units required can be exponential for arbitrary continuous functions

57 Neural Networks : Universality Results

(Simplified) Theorem (Cybenko) 6 Let σ be the logistic sigmoid activation function. Then the set of functions PN G(x) = i=1 αj σ(wj · x + bj ) are dense in the set of continuous functions on [0, 1]n.

Several other authors proved similar results roughly at the same time 1,8,11

Doesn’t give an explicit upper bound on the number of units required

Known that the number of units required can be exponential for arbitrary continuous functions

These kinds of results don’t inform us directly about the success of training algorithms or the possibility of generalisation

57 Eldan and Shamir 7 established the existence of a function that can be well approximated by a depth-3 (2 hidden layers) neural network using polynomially many units (in dimension), but requires exponentially many units using a depth-2 network

15 Telgarsky established for each k ∈ N, the existence of a function that can be well approximated by a depth-k3 neural network using polynomially many units (in dimension and k), but requires exponentially many units using a depth-k neural network

Depth Separation Results

Universality results establish that neural nets with one hidden layer are universal approximators

Establishing the benefits of depth (both for representation and learning) is an active area of research

58 Depth Separation Results

Universality results establish that neural nets with one hidden layer are universal approximators

Establishing the benefits of depth (both for representation and learning) is an active area of research

Eldan and Shamir 7 established the existence of a function that can be well approximated by a depth-3 (2 hidden layers) neural network using polynomially many units (in dimension), but requires exponentially many units using a depth-2 network

15 Telgarsky established for each k ∈ N, the existence of a function that can be well approximated by a depth-k3 neural network using polynomially many units (in dimension and k), but requires exponentially many units using a depth-k neural network

58 Outline

Statistical (Supervised) Learning Theory Framework

Linear Regression

Rademacher Complexity

Support Vector Machines

Kernels

Neural Networks

Algorithmic Stability These results only depend on certain complexity/capacity measures of the class of functions F used by the learning algorithm

Q. Can analysing learning algorithms directly yield a (possibly different/better) way to obtain bounds on the true risk?

Algorithmic Stability

So far we have seen uniform convergence bounds, i.e. bounds of the form that ‘‘under suitable conditions’’ with high probability, ∀f ∈ F,

R(f) ≤ RbS (f) + 

59 Q. Can analysing learning algorithms directly yield a (possibly different/better) way to obtain bounds on the true risk?

Algorithmic Stability

So far we have seen uniform convergence bounds, i.e. bounds of the form that ‘‘under suitable conditions’’ with high probability, ∀f ∈ F,

R(f) ≤ RbS (f) + 

These results only depend on certain complexity/capacity measures of the class of functions F used by the learning algorithm

59 Algorithmic Stability

So far we have seen uniform convergence bounds, i.e. bounds of the form that ‘‘under suitable conditions’’ with high probability, ∀f ∈ F,

R(f) ≤ RbS (f) + 

These results only depend on certain complexity/capacity measures of the class of functions F used by the learning algorithm

Q. Can analysing learning algorithms directly yield a (possibly different/better) way to obtain bounds on the true risk?

59 A (possibly randomised) learning algorithm A takes a sample S as input and outputs a function fS = A(S)

+ Recall that γ : Y × Y → R is the cost function

Uniform Stability A learning algorithm A is uniformly β-stable if for any samples S, S0 of size m, differing in exactly one point, it holds for every (x, y) that:

|γ(fS (x), y) − γ(fS0 (x), y)| ≤ β

Algorithmic Stability

Let S = {(x1, y1),..., (xm, ym)} be a sample drawn from D over X × Y and S0 be a sample that differs from S on exactly one point, say it has 0 0 (xm, ym) instead of (xm, ym)

60 + Recall that γ : Y × Y → R is the cost function

Uniform Stability A learning algorithm A is uniformly β-stable if for any samples S, S0 of size m, differing in exactly one point, it holds for every (x, y) that:

|γ(fS (x), y) − γ(fS0 (x), y)| ≤ β

Algorithmic Stability

Let S = {(x1, y1),..., (xm, ym)} be a sample drawn from D over X × Y and S0 be a sample that differs from S on exactly one point, say it has 0 0 (xm, ym) instead of (xm, ym)

A (possibly randomised) learning algorithm A takes a sample S as input and outputs a function fS = A(S)

60 Uniform Stability A learning algorithm A is uniformly β-stable if for any samples S, S0 of size m, differing in exactly one point, it holds for every (x, y) that:

|γ(fS (x), y) − γ(fS0 (x), y)| ≤ β

Algorithmic Stability

Let S = {(x1, y1),..., (xm, ym)} be a sample drawn from D over X × Y and S0 be a sample that differs from S on exactly one point, say it has 0 0 (xm, ym) instead of (xm, ym)

A (possibly randomised) learning algorithm A takes a sample S as input and outputs a function fS = A(S)

+ Recall that γ : Y × Y → R is the cost function

60 Algorithmic Stability

Let S = {(x1, y1),..., (xm, ym)} be a sample drawn from D over X × Y and S0 be a sample that differs from S on exactly one point, say it has 0 0 (xm, ym) instead of (xm, ym)

A (possibly randomised) learning algorithm A takes a sample S as input and outputs a function fS = A(S)

+ Recall that γ : Y × Y → R is the cost function

Uniform Stability A learning algorithm A is uniformly β-stable if for any samples S, S0 of size m, differing in exactly one point, it holds for every (x, y) that:

|γ(fS (x), y) − γ(fS0 (x), y)| ≤ β

60 Theorem (Bousquet & Elisseeff) 3 Suppose γ is a bounded cost function |γ| ≤ M and that A is uniformly β-stable. Let S ∼ Dm, then for every δ > 0, with probability at least 1 − δ, it holds that: r log(1/δ) R(fS ) ≤ RbS (fS ) + β + (2mβ + M) 2m √ Clearly we need β = o(1/ m) to get a non-trivial bound

Cannot be used for zero-one classification loss

Algorithmic Stability

Uniform Stability A learning algorithm A is uniformly β-stable if for any samples S, S0 of size m, differing in exactly one point, it holds for every (x, y) that:

|γ(fS (x), y) − γ(fS0 (x), y)| ≤ β

61 √ Clearly we need β = o(1/ m) to get a non-trivial bound

Cannot be used for zero-one classification loss

Algorithmic Stability

Uniform Stability A learning algorithm A is uniformly β-stable if for any samples S, S0 of size m, differing in exactly one point, it holds for every (x, y) that:

|γ(fS (x), y) − γ(fS0 (x), y)| ≤ β

Theorem (Bousquet & Elisseeff) 3 Suppose γ is a bounded cost function |γ| ≤ M and that A is uniformly β-stable. Let S ∼ Dm, then for every δ > 0, with probability at least 1 − δ, it holds that: r log(1/δ) R(fS ) ≤ RbS (fS ) + β + (2mβ + M) 2m

61 Cannot be used for zero-one classification loss

Algorithmic Stability

Uniform Stability A learning algorithm A is uniformly β-stable if for any samples S, S0 of size m, differing in exactly one point, it holds for every (x, y) that:

|γ(fS (x), y) − γ(fS0 (x), y)| ≤ β

Theorem (Bousquet & Elisseeff) 3 Suppose γ is a bounded cost function |γ| ≤ M and that A is uniformly β-stable. Let S ∼ Dm, then for every δ > 0, with probability at least 1 − δ, it holds that: r log(1/δ) R(fS ) ≤ RbS (fS ) + β + (2mβ + M) 2m √ Clearly we need β = o(1/ m) to get a non-trivial bound

61 Algorithmic Stability

Uniform Stability A learning algorithm A is uniformly β-stable if for any samples S, S0 of size m, differing in exactly one point, it holds for every (x, y) that:

|γ(fS (x), y) − γ(fS0 (x), y)| ≤ β

Theorem (Bousquet & Elisseeff) 3 Suppose γ is a bounded cost function |γ| ≤ M and that A is uniformly β-stable. Let S ∼ Dm, then for every δ > 0, with probability at least 1 − δ, it holds that: r log(1/δ) R(fS ) ≤ RbS (fS ) + β + (2mβ + M) 2m √ Clearly we need β = o(1/ m) to get a non-trivial bound

Cannot be used for zero-one classification loss

61 Example of Ridge Regression The ridge regression method finds m 1 X 2 2 wb = argmin (w · xi − yi) + λ kwk2 w∈ n m R i=1

If kwk2 ≤ X and if Y = [−M,M], then it is easy to see that any minimiser 2 2 wb has to satisfy kwk2 ≤ M /λ Consequently, γ(y0, y) = (y0 − y)2 is σ-admissible for the class of functions that can be solutions√ to the ridge regression problem with σ = 2(MX/ λ + M) Theorem. Since γ is convex and σ-admissible, ridge regression is uni- σ2X2 formly β-stable with β ≤ mλ = O(1/m)

Recent work by Hardt et al. 9 has shown that stochastic gradient descent (with ) is uniformly stable

Algorithmic Stability A cost function γ is σ-admissible with respect to a class of function F, if for every f, f 0 ∈ F and (x, y) ∈ X × Y, it is the case that

γ(f 0(x), y) − f(x, y)| ≤ σ|f 0(x) − f(x)|

62 If kwk2 ≤ X and if Y = [−M,M], then it is easy to see that any minimiser 2 2 wb has to satisfy kwk2 ≤ M /λ Consequently, γ(y0, y) = (y0 − y)2 is σ-admissible for the class of functions that can be solutions√ to the ridge regression problem with σ = 2(MX/ λ + M) Theorem. Since γ is convex and σ-admissible, ridge regression is uni- σ2X2 formly β-stable with β ≤ mλ = O(1/m)

Recent work by Hardt et al. 9 has shown that stochastic gradient descent (with early stopping) is uniformly stable

Algorithmic Stability A cost function γ is σ-admissible with respect to a class of function F, if for every f, f 0 ∈ F and (x, y) ∈ X × Y, it is the case that

γ(f 0(x), y) − f(x, y)| ≤ σ|f 0(x) − f(x)|

Example of Ridge Regression The ridge regression method finds m 1 X 2 2 wb = argmin (w · xi − yi) + λ kwk2 w∈ n m R i=1

62 Consequently, γ(y0, y) = (y0 − y)2 is σ-admissible for the class of functions that can be solutions√ to the ridge regression problem with σ = 2(MX/ λ + M) Theorem. Since γ is convex and σ-admissible, ridge regression is uni- σ2X2 formly β-stable with β ≤ mλ = O(1/m)

Recent work by Hardt et al. 9 has shown that stochastic gradient descent (with early stopping) is uniformly stable

Algorithmic Stability A cost function γ is σ-admissible with respect to a class of function F, if for every f, f 0 ∈ F and (x, y) ∈ X × Y, it is the case that

γ(f 0(x), y) − f(x, y)| ≤ σ|f 0(x) − f(x)|

Example of Ridge Regression The ridge regression method finds m 1 X 2 2 wb = argmin (w · xi − yi) + λ kwk2 w∈ n m R i=1

If kwk2 ≤ X and if Y = [−M,M], then it is easy to see that any minimiser 2 2 wb has to satisfy kwk2 ≤ M /λ

62 Theorem. Since γ is convex and σ-admissible, ridge regression is uni- σ2X2 formly β-stable with β ≤ mλ = O(1/m)

Recent work by Hardt et al. 9 has shown that stochastic gradient descent (with early stopping) is uniformly stable

Algorithmic Stability A cost function γ is σ-admissible with respect to a class of function F, if for every f, f 0 ∈ F and (x, y) ∈ X × Y, it is the case that

γ(f 0(x), y) − f(x, y)| ≤ σ|f 0(x) − f(x)|

Example of Ridge Regression The ridge regression method finds m 1 X 2 2 wb = argmin (w · xi − yi) + λ kwk2 w∈ n m R i=1

If kwk2 ≤ X and if Y = [−M,M], then it is easy to see that any minimiser 2 2 wb has to satisfy kwk2 ≤ M /λ Consequently, γ(y0, y) = (y0 − y)2 is σ-admissible for the class of functions that can be solutions√ to the ridge regression problem with σ = 2(MX/ λ + M)

62 Recent work by Hardt et al. 9 has shown that stochastic gradient descent (with early stopping) is uniformly stable

Algorithmic Stability A cost function γ is σ-admissible with respect to a class of function F, if for every f, f 0 ∈ F and (x, y) ∈ X × Y, it is the case that

γ(f 0(x), y) − f(x, y)| ≤ σ|f 0(x) − f(x)|

Example of Ridge Regression The ridge regression method finds m 1 X 2 2 wb = argmin (w · xi − yi) + λ kwk2 w∈ n m R i=1

If kwk2 ≤ X and if Y = [−M,M], then it is easy to see that any minimiser 2 2 wb has to satisfy kwk2 ≤ M /λ Consequently, γ(y0, y) = (y0 − y)2 is σ-admissible for the class of functions that can be solutions√ to the ridge regression problem with σ = 2(MX/ λ + M) Theorem. Since γ is convex and σ-admissible, ridge regression is uni- σ2X2 formly β-stable with β ≤ mλ = O(1/m)

62 Algorithmic Stability A cost function γ is σ-admissible with respect to a class of function F, if for every f, f 0 ∈ F and (x, y) ∈ X × Y, it is the case that

γ(f 0(x), y) − f(x, y)| ≤ σ|f 0(x) − f(x)|

Example of Ridge Regression The ridge regression method finds m 1 X 2 2 wb = argmin (w · xi − yi) + λ kwk2 w∈ n m R i=1

If kwk2 ≤ X and if Y = [−M,M], then it is easy to see that any minimiser 2 2 wb has to satisfy kwk2 ≤ M /λ Consequently, γ(y0, y) = (y0 − y)2 is σ-admissible for the class of functions that can be solutions√ to the ridge regression problem with σ = 2(MX/ λ + M) Theorem. Since γ is convex and σ-admissible, ridge regression is uni- σ2X2 formly β-stable with β ≤ mλ = O(1/m)

Recent work by Hardt et al. 9 has shown that stochastic gradient descent (with early stopping) is uniformly stable 62 Summary

Uniform convergence bounds for bounding generalisation error using Rademacher complexity bounds

Application of Rademacher complexity bounds to Linear Regression, GLMs, SVMs

A brief view of some results about neural networks

Algorithmic stability as a means to bound generalisation error

63 ReferencesI

[1] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on , 39 (3):930--945, 1993. [2] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463--482, 2002. [3] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499--526, 2002. [4] Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. Introduction to statistical learning theory. In Advanced lectures on machine learning, pages 169--207. Springer, 2004. [5] Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning. Now, 2015. [6] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303--314, 1989. [7] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907--940, 2016.

64 ReferencesII

[8] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183--192, 1989. [9] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: stability of stochastic gradient descent. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 1225--1234, 2016. [10] David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. In Proc. of the 1st Workshop on Algorithmic Learning Theory, pages 21--41, 1990. [11] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359--366, 1989. [12] Sham M Kakade, Varun Kanade, Adam Kalai, and Ohad Shamir. Efficient learning of generalized linear and single index models with . In Advances in Neural Information Processing Systems, pages 927--935, 2011. [13] Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward efficient agnostic learning. Machine Learning, 17(2-3):115--141, 1994.

65 References III

[14] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT Press, 2012. [15] Matus Telgarsky. benefits of depth in neural networks. In Conference on Learning Theory, pages 1517--1539, 2016. [16] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, XVI(2):264--280, 1971. [17] Vladimir Vapnik. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998. [18] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.

66