Statistical (Supervised) Learning Theory FoPPS Logic and Learning School
Lady Margaret Hall University of Oxford
Varun Kanade July 1, 2018 Previous Mini-Course
Introduction to Computational Learning Theory (PAC)
Learnability and the VC dimension
Sample Compression Schemes
Learning with Membership Queries
(Computational) Hardness of Learning
1 This Mini-Course
Statistical Learning Theory Framework
Capacity Measures : Rademacher Complexity
Uniform Convergence : Generalisation Bounds
Some Machine Learning Techniques
Algorithmic Stability to prove Generalisation
2 Outline
Statistical (Supervised) Learning Theory Framework
Linear Regression
Rademacher Complexity
Support Vector Machines
Kernels
Neural Networks
Algorithmic Stability Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression
We consider data to be generated from a joint distribution D over X × Y
Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)
Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16
Statistical (Supervised) Learning Theory Framework
n Input space: X (most often X ⊂ R )
3 We consider data to be generated from a joint distribution D over X × Y
Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)
Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16
Statistical (Supervised) Learning Theory Framework
n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression
3 Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)
Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16
Statistical (Supervised) Learning Theory Framework
n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression
We consider data to be generated from a joint distribution D over X × Y
3 Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16
Statistical (Supervised) Learning Theory Framework
n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression
We consider data to be generated from a joint distribution D over X × Y
Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)
3 Statistical (Supervised) Learning Theory Framework
n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1, 1} : binary classification I Y = R : regression
We consider data to be generated from a joint distribution D over X × Y
Sometimes convenient to factorise: D(x, y) = D(x)D(y|x)
Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16
3 How can we fit a function to the data?
I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality
Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting
We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded
Statistical (Supervised) Learning Theory Framework
Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting)
4 Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting
We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded
Statistical (Supervised) Learning Theory Framework
Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting)
How can we fit a function to the data?
I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality
4 We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded
Statistical (Supervised) Learning Theory Framework
Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting)
How can we fit a function to the data?
I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality
Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting
4 Statistical (Supervised) Learning Theory Framework
Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting)
How can we fit a function to the data?
I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality
Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting
We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded
4 Generative Models: Model the full joint distribution D(x, y)
I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(y|x)
2 I Linear Regression: y|w0, w, x ∼ w0 + w · x + N (0, σ )
I Classification: y|w0, w, x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1
The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability.
Aside : Connections to classical Statistics/ML
Attempt to explicitly model the distributions D(x) and/or D(y|x)
5 Discriminative Models: Model only the conditional distribution D(y|x)
2 I Linear Regression: y|w0, w, x ∼ w0 + w · x + N (0, σ )
I Classification: y|w0, w, x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1
The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability.
Aside : Connections to classical Statistics/ML
Attempt to explicitly model the distributions D(x) and/or D(y|x)
Generative Models: Model the full joint distribution D(x, y)
I Gaussian Discriminant Analysis, Naïve Bayes
5 The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability.
Aside : Connections to classical Statistics/ML
Attempt to explicitly model the distributions D(x) and/or D(y|x)
Generative Models: Model the full joint distribution D(x, y)
I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(y|x)
2 I Linear Regression: y|w0, w, x ∼ w0 + w · x + N (0, σ )
I Classification: y|w0, w, x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1
5 Aside : Connections to classical Statistics/ML
Attempt to explicitly model the distributions D(x) and/or D(y|x)
Generative Models: Model the full joint distribution D(x, y)
I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(y|x)
2 I Linear Regression: y|w0, w, x ∼ w0 + w · x + N (0, σ )
I Classification: y|w0, w, x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1
The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability.
5 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1
Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.
+ A cost function γ : Y × Y → R
The loss for f ∈ F at point (x, y) is given by
`(f; x, y) = γ(f(x), y)
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Statistical (Supervised) Learning Theory Framework
X instance space; Y target values Distribution D over X × Y
6 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1
+ A cost function γ : Y × Y → R
The loss for f ∈ F at point (x, y) is given by
`(f; x, y) = γ(f(x), y)
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Statistical (Supervised) Learning Theory Framework
X instance space; Y target values Distribution D over X × Y
Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.
6 The loss for f ∈ F at point (x, y) is given by
`(f; x, y) = γ(f(x), y)
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Statistical (Supervised) Learning Theory Framework
X instance space; Y target values Distribution D over X × Y
Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.
+ A cost function γ : Y × Y → R 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1
6 + The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Statistical (Supervised) Learning Theory Framework
X instance space; Y target values Distribution D over X × Y
Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.
+ A cost function γ : Y × Y → R 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1
The loss for f ∈ F at point (x, y) is given by
`(f; x, y) = γ(f(x), y)
6 Statistical (Supervised) Learning Theory Framework
X instance space; Y target values Distribution D over X × Y
Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F.
+ A cost function γ : Y × Y → R 0 0 I E.g. Y = {−1, 1}, γ(y , y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y , y) = y − y for p ≥ 1
The loss for f ∈ F at point (x, y) is given by
`(f; x, y) = γ(f(x), y)
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
6 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small > 0:
R(fb) ≤ inf R(f) + f∈F
Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data
Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F
Statistical (Supervised) Learning Theory Framework
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small > 0:
R(fb) ≤ inf R(f) + f∈F
Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data
Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F
Statistical (Supervised) Learning Theory Framework
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Would like to find f ∈ F that ‘‘minimises’’ the risk R
7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small > 0:
R(fb) ≤ inf R(f) + f∈F
Only have access to D through a sample of size m drawn from D called the training data
Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F
Statistical (Supervised) Learning Theory Framework
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest
7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small > 0:
R(fb) ≤ inf R(f) + f∈F
Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F
Statistical (Supervised) Learning Theory Framework
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data
7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small > 0:
R(fb) ≤ inf R(f) + f∈F
A learning algorithm (possibly randomised) is a map A from 2X ×Y to F
Statistical (Supervised) Learning Theory Framework
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data
Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D
7 Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small > 0:
R(fb) ≤ inf R(f) + f∈F
Statistical (Supervised) Learning Theory Framework
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data
Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F
7 Statistical (Supervised) Learning Theory Framework
+ The Risk functional R : F → R is given by: R(f) = E `(f; x, y) = E γ(f(x), y) (x,y)∼D (x,y)∼D
Would like to find f ∈ F that ‘‘minimises’’ the risk R Even calculating (let alone minimising) the risk is essentially impossible in most cases of interest Only have access to D through a sample of size m drawn from D called the training data
Throughout the talk, S = {(x1, y1), (x2, y2),..., (xm, ym)} a sample of size m drawn i.i.d. (independent and identically distributed) from D A learning algorithm (possibly randomised) is a map A from 2X ×Y to F
Goal: To guarantee with high probability (over S) that if fb = A(S), then for some small > 0:
R(fb) ≤ inf R(f) + f∈F
7 Tractable if there exists a separator with no error!
Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1
ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest
E.g. Find a linear separator that minimises the number of misclassifications
Empirical Risk Minimisation
Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F
8 Tractable if there exists a separator with no error!
ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest
E.g. Find a linear separator that minimises the number of misclassifications
Empirical Risk Minimisation
Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F
Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1
8 Tractable if there exists a separator with no error!
E.g. Find a linear separator that minimises the number of misclassifications
Empirical Risk Minimisation
Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F
Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1
ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest
8 Tractable if there exists a separator with no error!
Empirical Risk Minimisation
Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F
Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1
ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest
E.g. Find a linear separator that minimises the number of misclassifications
8 Empirical Risk Minimisation
Training sample S = {(x1, y1),..., (xm, ym)} Learning algorithm: A maps 2X ×Y to F
Define the empirical risk on a sample S as: m 1 X RbS (f) = γ(f(xi), yi) m i=1
ERM (Empirical Risk Minimisation) principle suggests that we find f ∈ F that minimises the empirical risk I Focus mostly on statistical questions I Computationally ERM is intractable for most problems of interest
E.g. Find a linear separator that minimises the number of misclassifications Tractable if there exists a separator with no error!
8 I How do we guarantee that the (actual) risk is close to optimal? I Focus on classification, i.e. F ⊂ {−1, 1}X and suppose VC(F) = d < ∞ 0 0 I Cost function is γ(y , y) = I(y 6= y)
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {−1, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
Empirical Risk Minimisation
ERM Principle: Learning algorithm should pick f ∈ F that minimises the empirical risk m 1 X RbS (f) = γ(f(xi), yi) m i=1
9 I Focus on classification, i.e. F ⊂ {−1, 1}X and suppose VC(F) = d < ∞ 0 0 I Cost function is γ(y , y) = I(y 6= y)
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {−1, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
Empirical Risk Minimisation
ERM Principle: Learning algorithm should pick f ∈ F that minimises the empirical risk m 1 X RbS (f) = γ(f(xi), yi) m i=1
I How do we guarantee that the (actual) risk is close to optimal?
9 Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {−1, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
Empirical Risk Minimisation
ERM Principle: Learning algorithm should pick f ∈ F that minimises the empirical risk m 1 X RbS (f) = γ(f(xi), yi) m i=1
I How do we guarantee that the (actual) risk is close to optimal? I Focus on classification, i.e. F ⊂ {−1, 1}X and suppose VC(F) = d < ∞ 0 0 I Cost function is γ(y , y) = I(y 6= y)
9 Empirical Risk Minimisation
ERM Principle: Learning algorithm should pick f ∈ F that minimises the empirical risk m 1 X RbS (f) = γ(f(xi), yi) m i=1
I How do we guarantee that the (actual) risk is close to optimal? I Focus on classification, i.e. F ⊂ {−1, 1}X and suppose VC(F) = d < ∞ 0 0 I Cost function is γ(y , y) = I(y 6= y)
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {−1, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
9 ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) + Using Theorem (flipped)
Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS
Then, we have,
R(fb) ≤ RbS (fb) + /2 Using Theorem
Where is chosen to be a suitable function of δ and m
Empirical Risk Minimisation
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
10 ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) + Using Theorem (flipped)
Then, we have,
R(fb) ≤ RbS (fb) + /2 Using Theorem
Where is chosen to be a suitable function of δ and m
Empirical Risk Minimisation
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS
10 ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) + Using Theorem (flipped)
Where is chosen to be a suitable function of δ and m
Empirical Risk Minimisation
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS
Then, we have,
R(fb) ≤ RbS (fb) + /2 Using Theorem
10 ≤ R(f ∗) + Using Theorem (flipped)
Where is chosen to be a suitable function of δ and m
Empirical Risk Minimisation
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS
Then, we have,
R(fb) ≤ RbS (fb) + /2 Using Theorem ∗ ≤ RbS (f ) + /2 As fbminimises RbS
10 Where is chosen to be a suitable function of δ and m
Empirical Risk Minimisation
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS
Then, we have,
R(fb) ≤ RbS (fb) + /2 Using Theorem ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) + Using Theorem (flipped)
10 Empirical Risk Minimisation
Theorem (Vapnik, Chervonenkis) 14,16 Let F ⊂ {0, 1}X with VC(F) = d < ∞. Let S ∼ Dm for some distribution D over X × {0, 1}. Then, for every δ > 0, with probability at least 1 − δ, for every f ∈ F, r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
Suppose f ∗ is the ‘‘minimiser’’ of the true risk R and fbis the minimiser of the empirical risk RbS
Then, we have,
R(fb) ≤ RbS (fb) + /2 Using Theorem ∗ ≤ RbS (f ) + /2 As fbminimises RbS ≤ R(f ∗) + Using Theorem (flipped)
Where is chosen to be a suitable function of δ and m
10 How should be pick the class of functions F? I More ‘‘complex’’ F can achieve smaller empirical risk I Difference between true risk and empirical risk (generalisation error) will be higher for more ‘‘complex’’ F
Choose an infinite family of classes {Fd : d = 1, 2,...} and find the minimiser: fb = argmin RbS (f) + κ(d, m) f∈Fd,d∈N where κ(d, m) is a penalty term that depends on the sample size and the ‘‘complexity’’ or ‘‘capacity’’ measure
Related to the more commonly used approach in practice:
fb = argmin RbS (f) + λ · regulariser(f) f∈F
Structural Risk Minimisation
r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
11 Choose an infinite family of classes {Fd : d = 1, 2,...} and find the minimiser: fb = argmin RbS (f) + κ(d, m) f∈Fd,d∈N where κ(d, m) is a penalty term that depends on the sample size and the ‘‘complexity’’ or ‘‘capacity’’ measure
Related to the more commonly used approach in practice:
fb = argmin RbS (f) + λ · regulariser(f) f∈F
Structural Risk Minimisation
r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
How should be pick the class of functions F? I More ‘‘complex’’ F can achieve smaller empirical risk I Difference between true risk and empirical risk (generalisation error) will be higher for more ‘‘complex’’ F
11 Related to the more commonly used approach in practice:
fb = argmin RbS (f) + λ · regulariser(f) f∈F
Structural Risk Minimisation
r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
How should be pick the class of functions F? I More ‘‘complex’’ F can achieve smaller empirical risk I Difference between true risk and empirical risk (generalisation error) will be higher for more ‘‘complex’’ F
Choose an infinite family of classes {Fd : d = 1, 2,...} and find the minimiser: fb = argmin RbS (f) + κ(d, m) f∈Fd,d∈N where κ(d, m) is a penalty term that depends on the sample size and the ‘‘complexity’’ or ‘‘capacity’’ measure
11 Structural Risk Minimisation
r r ! 2d log(em/d) log(1/δ) R(f) ≤ RbS (f) + + O m 2m
How should be pick the class of functions F? I More ‘‘complex’’ F can achieve smaller empirical risk I Difference between true risk and empirical risk (generalisation error) will be higher for more ‘‘complex’’ F
Choose an infinite family of classes {Fd : d = 1, 2,...} and find the minimiser: fb = argmin RbS (f) + κ(d, m) f∈Fd,d∈N where κ(d, m) is a penalty term that depends on the sample size and the ‘‘complexity’’ or ‘‘capacity’’ measure
Related to the more commonly used approach in practice:
fb = argmin RbS (f) + λ · regulariser(f) f∈F
11 Outline
Statistical (Supervised) Learning Theory Framework
Linear Regression
Rademacher Complexity
Support Vector Machines
Kernels
Neural Networks
Algorithmic Stability h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)
h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D
Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2
Let D be a distribution over X × Y, let g(x) = E[y | x]
For any h : X → R:
h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D
If g ∈ F, we are in the so-called realisable setting
Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
12 h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)
h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D
Let D be a distribution over X × Y, let g(x) = E[y | x]
For any h : X → R:
h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D
If g ∈ F, we are in the so-called realisable setting
Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2
12 h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)
h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D
For any h : X → R:
h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D
If g ∈ F, we are in the so-called realisable setting
Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2
Let D be a distribution over X × Y, let g(x) = E[y | x]
12 h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)
h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D
If g ∈ F, we are in the so-called realisable setting
Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2
Let D be a distribution over X × Y, let g(x) = E[y | x]
For any h : X → R:
h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D
12 + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)
h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D
If g ∈ F, we are in the so-called realisable setting
Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2
Let D be a distribution over X × Y, let g(x) = E[y | x]
For any h : X → R:
h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D
h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D
12 h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D
If g ∈ F, we are in the so-called realisable setting
Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2
Let D be a distribution over X × Y, let g(x) = E[y | x]
For any h : X → R:
h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D
h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)
12 If g ∈ F, we are in the so-called realisable setting
Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2
Let D be a distribution over X × Y, let g(x) = E[y | x]
For any h : X → R:
h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D
h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)
h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D
12 Linear Regression n Let K ⊂ R . Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
Consider the squared loss as a cost function: γ(y0, y) = (y0 − y)2
Let D be a distribution over X × Y, let g(x) = E[y | x]
For any h : X → R:
h 2i h 2i R(h) = E (h(x) − y) = E (h(x) − g(x) + g(x) − y) (x,y)∼D (x,y)∼D
h 2i h 2i = E (h(x) − g(x)) + E (g(x) − y) (x,y)∼D (x,y)∼D + 2 E (h(x) − g(x))(g(x) − y) (x,y)∼D | {z } =0 as E[y | x]=g(x)
h 2i = E (h(x) − g(x)) + R(g) (x,y)∼D
If g ∈ F, we are in the so-called realisable setting
12 We can defined the likelihood of observing the data under this model
m 2 ! 1 Y (yi − w · xi) p(y , . . . , y | w, x ,..., x ) = exp − 1 m 1 m (2πσ2)m/2 2σ2 i=1
Looking at the log likelihood is slightly simpler
m m 2 1 X 2 LL(y , . . . , y | w, x ,..., x ) = − log(2πσ ) − (y − w · x ) 1 m 1 m 2 2σ2 i i i=1
Finding parameters w that maximise the (log) likelihood is the same as finding w that minimises the empirical risk with the squared error cost
The method of least squares goes back at least 200 years to Gauss, Laplace
Aside: Maximum Likelihood Principle
Discriminative Setting: Model y | w, x ∼ w · x + N (0, σ2)
13 Looking at the log likelihood is slightly simpler
m m 2 1 X 2 LL(y , . . . , y | w, x ,..., x ) = − log(2πσ ) − (y − w · x ) 1 m 1 m 2 2σ2 i i i=1
Finding parameters w that maximise the (log) likelihood is the same as finding w that minimises the empirical risk with the squared error cost
The method of least squares goes back at least 200 years to Gauss, Laplace
Aside: Maximum Likelihood Principle
Discriminative Setting: Model y | w, x ∼ w · x + N (0, σ2)
We can defined the likelihood of observing the data under this model
m 2 ! 1 Y (yi − w · xi) p(y , . . . , y | w, x ,..., x ) = exp − 1 m 1 m (2πσ2)m/2 2σ2 i=1
13 Finding parameters w that maximise the (log) likelihood is the same as finding w that minimises the empirical risk with the squared error cost
The method of least squares goes back at least 200 years to Gauss, Laplace
Aside: Maximum Likelihood Principle
Discriminative Setting: Model y | w, x ∼ w · x + N (0, σ2)
We can defined the likelihood of observing the data under this model
m 2 ! 1 Y (yi − w · xi) p(y , . . . , y | w, x ,..., x ) = exp − 1 m 1 m (2πσ2)m/2 2σ2 i=1
Looking at the log likelihood is slightly simpler
m m 2 1 X 2 LL(y , . . . , y | w, x ,..., x ) = − log(2πσ ) − (y − w · x ) 1 m 1 m 2 2σ2 i i i=1
13 Aside: Maximum Likelihood Principle
Discriminative Setting: Model y | w, x ∼ w · x + N (0, σ2)
We can defined the likelihood of observing the data under this model
m 2 ! 1 Y (yi − w · xi) p(y , . . . , y | w, x ,..., x ) = exp − 1 m 1 m (2πσ2)m/2 2σ2 i=1
Looking at the log likelihood is slightly simpler
m m 2 1 X 2 LL(y , . . . , y | w, x ,..., x ) = − log(2πσ ) − (y − w · x ) 1 m 1 m 2 2σ2 i i i=1
Finding parameters w that maximise the (log) likelihood is the same as finding w that minimises the empirical risk with the squared error cost
The method of least squares goes back at least 200 years to Gauss, Laplace
13 ERM for Linear Regression
m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1
How do we argue about the generalisation properties of this algorithm? Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ... We will require some boundedness assumptions on the data and the linear functions
Linear Regression
n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
14 How do we argue about the generalisation properties of this algorithm? Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ... We will require some boundedness assumptions on the data and the linear functions
Linear Regression
n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
ERM for Linear Regression
m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1
14 Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ... We will require some boundedness assumptions on the data and the linear functions
Linear Regression
n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
ERM for Linear Regression
m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1
How do we argue about the generalisation properties of this algorithm?
14 We will require some boundedness assumptions on the data and the linear functions
Linear Regression
n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
ERM for Linear Regression
m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1
How do we argue about the generalisation properties of this algorithm? Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ...
14 Linear Regression
n Let K ⊂ R , e.g. K = {w | kwk2 ≤ W }. Consider the family of linear functions F = {x 7→ w · x | w ∈ K}
ERM for Linear Regression
m 1 X 2 wb = argmin (w · xi − yi) w∈K m i=1
How do we argue about the generalisation properties of this algorithm? Use a different capacity measure I Rademacher complexity, VC dimension, pseudo-dimension, covering numbers, fat-shattering dimension, ... We will require some boundedness assumptions on the data and the linear functions
14 Outline
Statistical (Supervised) Learning Theory Framework
Linear Regression
Rademacher Complexity
Support Vector Machines
Kernels
Neural Networks
Algorithmic Stability S = {z1, . . . , zm} ⊂ Z be a fixed sample of size m
Then the Empirical Rademacher Complexity of G with respect to S is defined as: m 1 X RS (G) = sup σig(zi) b E m σ∼u{−1,1} g∈G m i=1
m where (σ1, . . . , σm) =: σ ∼u {−1, 1} indicates that each σi is a random variable taking the values {−1, 1} with equal probability. These are called Rademacher random variables
Empirical Rademacher Complexity
Let G be a class of functions from Z → [a, b] ⊂ R
15 Then the Empirical Rademacher Complexity of G with respect to S is defined as: m 1 X RS (G) = sup σig(zi) b E m σ∼u{−1,1} g∈G m i=1
m where (σ1, . . . , σm) =: σ ∼u {−1, 1} indicates that each σi is a random variable taking the values {−1, 1} with equal probability. These are called Rademacher random variables
Empirical Rademacher Complexity
Let G be a class of functions from Z → [a, b] ⊂ R
S = {z1, . . . , zm} ⊂ Z be a fixed sample of size m
15 Empirical Rademacher Complexity
Let G be a class of functions from Z → [a, b] ⊂ R
S = {z1, . . . , zm} ⊂ Z be a fixed sample of size m
Then the Empirical Rademacher Complexity of G with respect to S is defined as: m 1 X RS (G) = sup σig(zi) b E m σ∼u{−1,1} g∈G m i=1
m where (σ1, . . . , σm) =: σ ∼u {−1, 1} indicates that each σi is a random variable taking the values {−1, 1} with equal probability. These are called Rademacher random variables
15 Rademacher Complexity Let D be a distribution over the set Z. Let G be a class of functions from Z → [a, b] ⊂ R. For any m ≥ 1, the Rademacher complexity of G is the expectation of the empirical Rademacher complexity of G over a sample drawn from Dm: h i Rm(G) = Rb S (G) S∼EDm
Rademacher Complexity
Empirical Rademacher Complexity
m 1 X RS (G) = sup σig(zi) b E m σ∼u{−1,1} g∈G m i=1
16 Rademacher Complexity
Empirical Rademacher Complexity
m 1 X RS (G) = sup σig(zi) b E m σ∼u{−1,1} g∈G m i=1
Rademacher Complexity Let D be a distribution over the set Z. Let G be a class of functions from Z → [a, b] ⊂ R. For any m ≥ 1, the Rademacher complexity of G is the expectation of the empirical Rademacher complexity of G over a sample drawn from Dm: h i Rm(G) = Rb S (G) S∼EDm
16 Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:
m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1
Henceforth, for S = {z1, . . . , zm}, we will use the notation: m 1 X Eb g(z) = g(zi) z∼uS m i=1
We will see a full proof of this theorem. First, let’s apply this to linear regression.
Rademacher Complexity
m 1 X h i RS (G) = sup σig(zi) ; Rm(G) = RS (G) b E m E m b σ∼u{−1,1} g∈G m S∼D i=1
17 Henceforth, for S = {z1, . . . , zm}, we will use the notation: m 1 X Eb g(z) = g(zi) z∼uS m i=1
We will see a full proof of this theorem. First, let’s apply this to linear regression.
Rademacher Complexity
m 1 X h i RS (G) = sup σig(zi) ; Rm(G) = RS (G) b E m E m b σ∼u{−1,1} g∈G m S∼D i=1
Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:
m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1
17 Rademacher Complexity
m 1 X h i RS (G) = sup σig(zi) ; Rm(G) = RS (G) b E m E m b σ∼u{−1,1} g∈G m S∼D i=1
Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G:
m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1
Henceforth, for S = {z1, . . . , zm}, we will use the notation: m 1 X Eb g(z) = g(zi) z∼uS m i=1
We will see a full proof of this theorem. First, let’s apply this to linear regression.
17 m 1 X = E sup w · σixi m σ w,kwk2≤W i=1 m W X = E σixi m σ i=1 2
Let S = {x1,..., xm}. Then we have:
m 1 X Rb S (F) = E sup σi(w · xi) m σ w,kwk2≤W i=1
The last step follows from (the equality condition of) the Cauchy-Schwartz Inequality
Generalisation Bounds for Linear Regression
n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]
Let F = {x 7→ w · x | kwk2 ≤ W }
18 m 1 X = E sup w · σixi m σ w,kwk2≤W i=1 m W X = E σixi m σ i=1 2
The last step follows from (the equality condition of) the Cauchy-Schwartz Inequality
Generalisation Bounds for Linear Regression
n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]
Let F = {x 7→ w · x | kwk2 ≤ W }
Let S = {x1,..., xm}. Then we have:
m 1 X Rb S (F) = E sup σi(w · xi) m σ w,kwk2≤W i=1
18 m W X = E σixi m σ i=1 2
The last step follows from (the equality condition of) the Cauchy-Schwartz Inequality
Generalisation Bounds for Linear Regression
n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]
Let F = {x 7→ w · x | kwk2 ≤ W }
Let S = {x1,..., xm}. Then we have:
m 1 X Rb S (F) = E sup σi(w · xi) m σ w,kwk2≤W i=1
m 1 X = E sup w · σixi m σ w,kwk2≤W i=1
18 Generalisation Bounds for Linear Regression
n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M]
Let F = {x 7→ w · x | kwk2 ≤ W }
Let S = {x1,..., xm}. Then we have:
m 1 X Rb S (F) = E sup σi(w · xi) m σ w,kwk2≤W i=1
m 1 X = E sup w · σixi m σ w,kwk2≤W i=1 m W X = E σixi m σ i=1 2
The last step follows from (the equality condition of) the Cauchy-Schwartz Inequality
18 1 2 2 m W X ≤ E σixi m σ i=1 2
1 2 m W X 2 2 X = σ kx k + 2 σ σ x · x E i i 2 i j i j m σ i=1 i 1 m 2 W X 2 WX = kxik = √ m 2 m i=1 m W X Rb S (F) = E σixi m σ i=1 2 Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } 19 1 2 2 m W X ≤ E σixi m σ i=1 2 1 2 m W X 2 2 X = σ kx k + 2 σ σ x · x E i i 2 i j i j m σ i=1 i 1 m 2 W X 2 WX = kxik = √ m 2 m i=1 Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } m W X Rb S (F) = E σixi m σ i=1 2 19 1 2 m W X 2 2 X = σ kx k + 2 σ σ x · x E i i 2 i j i j m σ i=1 i 1 m 2 W X 2 WX = kxik = √ m 2 m i=1 Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } 1 2 2 m m W X W X Rb S (F) = E σixi ≤ E σixi m σ m σ i=1 i=1 2 2 19 1 m 2 W X 2 WX = kxik = √ m 2 m i=1 Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } 1 2 2 m m W X W X Rb S (F) = E σixi ≤ E σixi m σ m σ i=1 i=1 2 2 1 2 m W X 2 2 X = σ kx k + 2 σ σ x · x E i i 2 i j i j m σ i=1 i 19 Generalisation Bounds for Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } 1 2 2 m m W X W X Rb S (F) = E σixi ≤ E σixi m σ m σ i=1 i=1 2 2 1 2 m W X 2 2 X = σ kx k + 2 σ σ x · x E i i 2 i j i j m σ i=1 i 1 m 2 W X 2 WX = kxik = √ m 2 m i=1 19 For this we need to look at the composition of the linear function and the loss/cost function Let G be a class of functions from Z → [a, b] and let ϕ : [a, b] → R be L-Lipschitz Then Talagrand’s Lemma tells us that: Rb S (φ ◦ G) ≤ L · Rb S (G) Rm(φ ◦ G) ≤ L · Rm(G) Talagrand’s Lemma We computed the Rademacher complexity of linear functions, but we’d like to apply the ‘‘main theorem’’ to the true risk 20 Let G be a class of functions from Z → [a, b] and let ϕ : [a, b] → R be L-Lipschitz Then Talagrand’s Lemma tells us that: Rb S (φ ◦ G) ≤ L · Rb S (G) Rm(φ ◦ G) ≤ L · Rm(G) Talagrand’s Lemma We computed the Rademacher complexity of linear functions, but we’d like to apply the ‘‘main theorem’’ to the true risk For this we need to look at the composition of the linear function and the loss/cost function 20 Then Talagrand’s Lemma tells us that: Rb S (φ ◦ G) ≤ L · Rb S (G) Rm(φ ◦ G) ≤ L · Rm(G) Talagrand’s Lemma We computed the Rademacher complexity of linear functions, but we’d like to apply the ‘‘main theorem’’ to the true risk For this we need to look at the composition of the linear function and the loss/cost function Let G be a class of functions from Z → [a, b] and let ϕ : [a, b] → R be L-Lipschitz 20 Talagrand’s Lemma We computed the Rademacher complexity of linear functions, but we’d like to apply the ‘‘main theorem’’ to the true risk For this we need to look at the composition of the linear function and the loss/cost function Let G be a class of functions from Z → [a, b] and let ϕ : [a, b] → R be L-Lipschitz Then Talagrand’s Lemma tells us that: Rb S (φ ◦ G) ≤ L · Rb S (G) Rm(φ ◦ G) ≤ L · Rm(G) 20 Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } 21 φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z 21 √ Rb S ([−M,M]) ≤ M/ m Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z φ is 2(M + WX)-Lipschitz on its domain 21 Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m 21 h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m 21 Generalisation of Linear Regression n Instance space X ⊂ R , ∀x ∈ X , kxk2 ≤ X Target values Y = [−M,M] Let F = {x 7→ w · x | kwk2 ≤ W } Consider the following: H = {(x, y) 7→ (f(x) − y)2 | x ∈ X , y ∈ Y, f ∈ F} 2 φ : [−(M + WX), (M + WX)] → R, φ(z) = z φ is 2(M + WX)-Lipschitz on its domain √ Rb S ([−M,M]) ≤ M/ m Using Rb S (F + G) ≤ Rb S (F) + Rb S (G) and Talagrand’s Lemma, we get 2(M + WX)2 Rb S (H) ≤ √ m h i Note that Rm(H) = Rb S (H) ≤ supS,|S|=m Rb S (H) S∼EDm 21 How can we solve this optimisation problem? (without norm constraint there is a closed form solution) This convex optimisation problem can solved using projected (stochastic) gradient descent Guaranteed to find a near-optimal solution in polynomial time Aside: Algorithms for the Linear Regression Model ERM for Linear Regression m 1 X 2 J(w) = (w · x − y ) m i i i=1 wb = argmin J(w) w,kwk2≤W 22 This convex optimisation problem can solved using projected (stochastic) gradient descent Guaranteed to find a near-optimal solution in polynomial time Aside: Algorithms for the Linear Regression Model ERM for Linear Regression m 1 X 2 J(w) = (w · x − y ) m i i i=1 wb = argmin J(w) w,kwk2≤W How can we solve this optimisation problem? (without norm constraint there is a closed form solution) 22 Guaranteed to find a near-optimal solution in polynomial time Aside: Algorithms for the Linear Regression Model ERM for Linear Regression m 1 X 2 J(w) = (w · x − y ) m i i i=1 wb = argmin J(w) w,kwk2≤W How can we solve this optimisation problem? (without norm constraint there is a closed form solution) This convex optimisation problem can solved using projected (stochastic) gradient descent 22 Aside: Algorithms for the Linear Regression Model ERM for Linear Regression m 1 X 2 J(w) = (w · x − y ) m i i i=1 wb = argmin J(w) w,kwk2≤W How can we solve this optimisation problem? (without norm constraint there is a closed form solution) This convex optimisation problem can solved using projected (stochastic) gradient descent Guaranteed to find a near-optimal solution in polynomial time 22 Recall in our case K = {w | kwk2 ≤ W }, ΠK (·) is the projection operator Informal Theorem 5 0 P Suppose sup 0 w − w ≤ R and ∇J(w) ≤ L, then with √ w,w ∈K 2 w∈K 2 η = R/(L T ) RL J(w) ≤ min J(w) + √ w∈K T Aside: Gradient Descent Algorithm 1 Projected Gradient Descent Inputs: η, T Pick w1 ∈ K for t = 1,...,T do 0 wt+1 = wt − η∇J(wt) 0 wt+1 = ΠK (wt+1) end for 1 PT Output: w = T t=1 wt 23 Informal Theorem 5 0 P Suppose sup 0 w − w ≤ R and ∇J(w) ≤ L, then with √ w,w ∈K 2 w∈K 2 η = R/(L T ) RL J(w) ≤ min J(w) + √ w∈K T Aside: Gradient Descent Algorithm 2 Projected Gradient Descent Inputs: η, T Pick w1 ∈ K for t = 1,...,T do 0 wt+1 = wt − η∇J(wt) 0 wt+1 = ΠK (wt+1) end for 1 PT Output: w = T t=1 wt Recall in our case K = {w | kwk2 ≤ W }, ΠK (·) is the projection operator 23 Aside: Gradient Descent Algorithm 3 Projected Gradient Descent Inputs: η, T Pick w1 ∈ K for t = 1,...,T do 0 wt+1 = wt − η∇J(wt) 0 wt+1 = ΠK (wt+1) end for 1 PT Output: w = T t=1 wt Recall in our case K = {w | kwk2 ≤ W }, ΠK (·) is the projection operator Informal Theorem 5 0 P Suppose sup 0 w − w ≤ R and ∇J(w) ≤ L, then with √ w,w ∈K 2 w∈K 2 η = R/(L T ) RL J(w) ≤ min J(w) + √ w∈K T 23 We can consider the ERM problem: m 1 X 2 J(w) = (u(w · x ) − y ) ; w = argmin J(w) m i i b i=1 w,kwk2≤W Can bound Rademacher complexity easily using the boundedness and Lipschitz property of u However, the optimisation problem is now non-convex! Aside: Generalised Linear Models Can consider more general models called generalised linear models GLM = {x 7→ u(w · x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W } 24 Can bound Rademacher complexity easily using the boundedness and Lipschitz property of u However, the optimisation problem is now non-convex! Aside: Generalised Linear Models Can consider more general models called generalised linear models GLM = {x 7→ u(w · x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W } We can consider the ERM problem: m 1 X 2 J(w) = (u(w · x ) − y ) ; w = argmin J(w) m i i b i=1 w,kwk2≤W 24 Aside: Generalised Linear Models Can consider more general models called generalised linear models GLM = {x 7→ u(w · x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W } We can consider the ERM problem: m 1 X 2 J(w) = (u(w · x ) − y ) ; w = argmin J(w) m i i b i=1 w,kwk2≤W Can bound Rademacher complexity easily using the boundedness and Lipschitz property of u However, the optimisation problem is now non-convex! 24 Can consider a different cost/loss function: Z u−1(y0) γ(y0, y) = (u(z) − y)dz 0 Z w·x `(w; x, y) (u(z) − y)dz 0 The resulting objective function is convex in w m 1 X Je(w) = `(w; xi, yi) m i=1 In the realisable setting, i.e. E y |x = u(w · x), the global minimisers of J(w) (squared error) and Je(w) coincide, yielding computationally and statistically efficient algorithms. 12 Aside: Generalised Linear Models Can consider more general models called generalised linear models GLM = {x 7→ u(w·x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W } 25 The resulting objective function is convex in w m 1 X Je(w) = `(w; xi, yi) m i=1 In the realisable setting, i.e. E y |x = u(w · x), the global minimisers of J(w) (squared error) and Je(w) coincide, yielding computationally and statistically efficient algorithms. 12 Aside: Generalised Linear Models Can consider more general models called generalised linear models GLM = {x 7→ u(w·x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W } Can consider a different cost/loss function: Z u−1(y0) γ(y0, y) = (u(z) − y)dz 0 Z w·x `(w; x, y) (u(z) − y)dz 0 25 In the realisable setting, i.e. E y |x = u(w · x), the global minimisers of J(w) (squared error) and Je(w) coincide, yielding computationally and statistically efficient algorithms. 12 Aside: Generalised Linear Models Can consider more general models called generalised linear models GLM = {x 7→ u(w·x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W } Can consider a different cost/loss function: Z u−1(y0) γ(y0, y) = (u(z) − y)dz 0 Z w·x `(w; x, y) (u(z) − y)dz 0 The resulting objective function is convex in w m 1 X Je(w) = `(w; xi, yi) m i=1 25 Aside: Generalised Linear Models Can consider more general models called generalised linear models GLM = {x 7→ u(w·x) | u bounded, increasing & 1-Lipschitz, kwk2 ≤ W } Can consider a different cost/loss function: Z u−1(y0) γ(y0, y) = (u(z) − y)dz 0 Z w·x `(w; x, y) (u(z) − y)dz 0 The resulting objective function is convex in w m 1 X Je(w) = `(w; xi, yi) m i=1 In the realisable setting, i.e. E y |x = u(w · x), the global minimisers of J(w) (squared error) and Je(w) coincide, yielding computationally and statistically efficient algorithms. 12 25 Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G: m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 We will make use of a concentration of measure inequality, called McDiarmid’s inequality. McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi, 0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci. Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0, 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci Rademacher Complexity : Main Result 26 We will make use of a concentration of measure inequality, called McDiarmid’s inequality. McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi, 0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci. Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0, 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci Rademacher Complexity : Main Result Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G: m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 26 McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi, 0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci. Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0, 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci Rademacher Complexity : Main Result Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G: m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 We will make use of a concentration of measure inequality, called McDiarmid’s inequality. 26 Rademacher Complexity : Main Result Theorem 2,14 Let G be a class of functions mapping Z → [0, 1]. Let D be a distribution over Z and suppose that a sample S of size m is drawn from Dm. Then for every δ > 0, with probability at least 1 − δ, the following holds for each g ∈ G: m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 We will make use of a concentration of measure inequality, called McDiarmid’s inequality. McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi, 0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci. Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0, 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci 26 For S ⊂ Z, define the function: Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS i 0 Let S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, and consider, 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi, 0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci. Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0, 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci Proof of Main Result 0 0 0 m Let S = {z1, . . . , zm},S = {z1, . . . , zm} ∼ D 27 i 0 Let S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, and consider, 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi, 0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci. Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0, 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci Proof of Main Result 0 0 0 m Let S = {z1, . . . , zm},S = {z1, . . . , zm} ∼ D For S ⊂ Z, define the function: Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS 27 McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi, 0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci. Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0, 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci Proof of Main Result 0 0 0 m Let S = {z1, . . . , zm},S = {z1, . . . , zm} ∼ D For S ⊂ Z, define the function: Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS i 0 Let S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, and consider, 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m 27 Proof of Main Result 0 0 0 m Let S = {z1, . . . , zm},S = {z1, . . . , zm} ∼ D For S ⊂ Z, define the function: Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS i 0 Let S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, and consider, 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m McDiarmid’s Inequality m Let Z be a set and let f : Z → R be a function such that, ∀i, ∃ci > 0 0, ∀z1, . . . , zm, zi, 0 |f(z1, . . . , zi, . . . , zm) − f(z1, . . . , zi, . . . , zm)| ≤ ci. Let Z1,...,Zm be i.i.d. random variables taking values in Z, then ∀ε > 0, 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci 27 i 0 Let S = {z1, . . . , zm}, S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, we have 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m Applying McDiarmid’s inequality with ci = 1/m for all i, Φ(S) ≥ Φ(S) + ε ≤ exp(−2ε2m) P S∼EDm Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Proof of Main Result McDiarmid’s Inequality 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci 28 Applying McDiarmid’s inequality with ci = 1/m for all i, Φ(S) ≥ Φ(S) + ε ≤ exp(−2ε2m) P S∼EDm Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Proof of Main Result McDiarmid’s Inequality 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci i 0 Let S = {z1, . . . , zm}, S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, we have 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m 28 Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Proof of Main Result McDiarmid’s Inequality 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci i 0 Let S = {z1, . . . , zm}, S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, we have 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m Applying McDiarmid’s inequality with ci = 1/m for all i, Φ(S) ≥ Φ(S) + ε ≤ exp(−2ε2m) P S∼EDm 28 Proof of Main Result McDiarmid’s Inequality 2 ! h i 2ε P f(Z1,...,Zm) ≥ E f(Z1,...,Zm) + ε ≤ exp − P 2 i ci i 0 Let S = {z1, . . . , zm}, S = {z1, . . . , zi−1, zi, zi+1, . . . , zm}, we have 1 1 Φ(S) − Φ(S0) ≤ |g(z ) − g(z0)| ≤ m i i m Applying McDiarmid’s inequality with ci = 1/m for all i, Φ(S) ≥ Φ(S) + ε ≤ exp(−2ε2m) P S∼EDm Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m 28 Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m Recall that, m 1 X Eb g(z) = g(zi) z∼uS m i=1 Want to show m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 Proof of Main Result Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m 29 Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm Recall that, m 1 X Eb g(z) = g(zi) z∼uS m i=1 Want to show m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 Proof of Main Result Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m 29 Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm Want to show m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 Proof of Main Result Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m Recall that, m 1 X Eb g(z) = g(zi) z∼uS m i=1 29 Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm Proof of Main Result Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m Recall that, m 1 X Eb g(z) = g(zi) z∼uS m i=1 Want to show m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 29 Proof of Main Result Φ(S) = sup E g(z) − Eb g(z) g∈G z∼D z∼uS Alternatively, for any δ > 0, with probability at least 1 − δ, r ! log(1/δ) Φ(S) ≤ E Φ(S) + O S∼Dm m Thus, for any δ > 0, with probability at least 1 − δ, for every g ∈ G, r ! log(1/δ) g(z) ≤ g(z) + Φ(S) + O E Eb E m z∼D z∼uS S∼D m Recall that, m 1 X Eb g(z) = g(zi) z∼uS m i=1 Want to show m r ! 1 X log(1/δ) E g(z) ≤ g(zi) + 2Rm(G) + O . z∼D m m i=1 Wouldn’t it be nice if Φ(S) ≤ 2Rm(G)? S∼EDm 29 Introduce a fresh sample S0 ∼ Dm ! Φ(S) = sup g(z) − g(z) E m E m 0 E m Eb 0 Eb S∼D S∼D g∈G S ∼D z∼uS z∼uS Pushing the sup inside the expectation " # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS Consider " # Φ(S) = sup g(z) − g(z) E m E m E Eb S∼D S∼D g∈G z∼D z∼uS Proof of Main Result All that remains to show is that Φ(S) ≤ 2Rm(G) S∼EDm 30 Introduce a fresh sample S0 ∼ Dm ! Φ(S) = sup g(z) − g(z) E m E m 0 E m Eb 0 Eb S∼D S∼D g∈G S ∼D z∼uS z∼uS Pushing the sup inside the expectation " # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS Proof of Main Result All that remains to show is that Φ(S) ≤ 2Rm(G) S∼EDm Consider " # Φ(S) = sup g(z) − g(z) E m E m E Eb S∼D S∼D g∈G z∼D z∼uS 30 Pushing the sup inside the expectation " # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS Proof of Main Result All that remains to show is that Φ(S) ≤ 2Rm(G) S∼EDm Consider " # Φ(S) = sup g(z) − g(z) E m E m E Eb S∼D S∼D g∈G z∼D z∼uS Introduce a fresh sample S0 ∼ Dm ! Φ(S) = sup g(z) − g(z) E m E m 0 E m Eb 0 Eb S∼D S∼D g∈G S ∼D z∼uS z∼uS 30 Proof of Main Result All that remains to show is that Φ(S) ≤ 2Rm(G) S∼EDm Consider " # Φ(S) = sup g(z) − g(z) E m E m E Eb S∼D S∼D g∈G z∼D z∼uS Introduce a fresh sample S0 ∼ Dm ! Φ(S) = sup g(z) − g(z) E m E m 0 E m Eb 0 Eb S∼D S∼D g∈G S ∼D z∼uS z∼uS Pushing the sup inside the expectation " # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS 30 S and S0 are identically distributed, so their elements can be swapped by introducing Rademacher random variables σi ∈ {−1, 1} " # 1 0 Φ(S) ≤ sup σi(g(z ) − g(zi)) E m E i S∼D S∼Dm,S0∼Dm,σ g∈G m " # 1 ≤ 2 sup σig(zi) = 2Rm(G) Em S∼D ,σ g∈G m Proof of Main Result Pushing the sup inside the expectation " # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS 31 Proof of Main Result Pushing the sup inside the expectation " # Φ(S) ≤ sup g(z) − g(z) E m mE 0 m Eb 0 Eb S∼D S∼D ,S ∼D g∈G z∼uS z∼uS S and S0 are identically distributed, so their elements can be swapped by introducing Rademacher random variables σi ∈ {−1, 1} " # 1 0 Φ(S) ≤ sup σi(g(z ) − g(zi)) E m E i S∼D S∼Dm,S0∼Dm,σ g∈G m " # 1 ≤ 2 sup σig(zi) = 2Rm(G) Em S∼D ,σ g∈G m 31 Outline Statistical (Supervised) Learning Theory Framework Linear Regression Rademacher Complexity Support Vector Machines Kernels Neural Networks Algorithmic Stability Which separator should be picked? Support Vector Machines: Binary Classification Goal: Find a linear separator Data is linearly separable if there exists a linear separator that classifies all points correctly 32 Support Vector Machines: Binary Classification Goal: Find a linear separator Data is linearly separable if there exists a linear separator that classifies all points correctly Which separator should be picked? 32 Support Vector Machines: Maximum Margin Principle Maximise the distance of the closest point from the decision boundary Points that are closest to the decision boundary are support vectors 33 Support Vector Machines : Geometric View n Given a hyperplane: H ≡ w · x + w0 = 0 and a point x ∈ R , how far is x from H? 34 Support Vector Machines : Geometric View Consider the hyperplane: H ≡ w · x + w0 = 0 The distance of point x from H is given by |w · x + w0| kwk2 All points on one side of the hyperplane satisfy (labelled y = +1) w · x + w0 ≥ 0 and points on the other side satisfy (labelled y = −1) w · x + w0 < 0 35 SVM Formulation : Separable Case 1 2 minimise: 2 kwk2 subject to: yi(w · xi + w0) ≥ 1 for i = 1, . . . , m Here yi ∈ {−1, 1} If data is separable, then we find a classifier with no classification error on the training set 1 ∗ The margin of the classifier is ∗ if w is the optimal solution kw k2 This is a convex quadratic program and hence can be solved efficiently 36 SVM Formulation : The Dual m m m X 1 X X 1 2 maximise αi − αiαj yiyj xi · xj minimise: 2 kwk2 2 i=1 i=1 j=1 subject to: subject to: yi(w · xi + w0) − 1 ≥ 0 Pm i=1 αiyi = 0 for i = 1, . . . , m 0 ≤ αi Here yi ∈ {−1, 1} for i = 1, . . . , m Lagrange Function m 1 2 X Λ(w, w ; α) = kwk − α (y (w · x + w ) − 1) 0 2 2 i i i 0 i=1 Complementary Slackness αi(yi(w · xi + w0) − 1) = 0, i = 1, . . . , m 37 SVM Formulation : Non-Separable Case m 1 2 X minimise: 2 kwk2 + C ζi i=1 subject to: yi(w · xi + w0) ≥ 1 − ζi ζi ≥ 0 for i = 1, . . . , m Here yi ∈ {−1, 1} 38 6 4 2 Hinge Loss 0 −6 −4 −2 0 2 4 6 y(w · x + w0) Note that for the optimal solution, ζi = max{0, 1 − yi(w · xi + w0)} Thus, SVM can be viewed as minimising the hinge loss with regularisation SVM Formulation : Loss Function m 1 2 X minimise: kwk + C ζ 2 2 i | {z } i=1 Regulariser | {z } Loss Function subject to: yi(w · xi + w0) ≥ 1 − ζi ζi ≥ 0 for i = 1, . . . , m Here yi ∈ {−1, 1} 39 Note that for the optimal solution, ζi = max{0, 1 − yi(w · xi + w0)} Thus, SVM can be viewed as minimising the hinge loss with regularisation SVM Formulation : Loss Function m 1 2 X minimise: kwk + C ζ 2 2 i 6 | {z } i=1 Regulariser | {z } Loss Function 4 subject to: 2 yi(w · xi + w0) ≥ 1 − ζi Hinge Loss ζi ≥ 0 0 −6 −4 −2 0 2 4 6 for i = 1, . . . , m y(w · x + w0) Here yi ∈ {−1, 1} 39 SVM Formulation : Loss Function m 1 2 X minimise: kwk + C ζ 2 2 i 6 | {z } i=1 Regulariser | {z } Loss Function 4 subject to: 2 yi(w · xi + w0) ≥ 1 − ζi Hinge Loss ζi ≥ 0 0 −6 −4 −2 0 2 4 6 for i = 1, . . . , m y(w · x + w0) Here yi ∈ {−1, 1} Note that for the optimal solution, ζi = max{0, 1 − yi(w · xi + w0)} Thus, SVM can be viewed as minimising the hinge loss with regularisation 39 SVM : Deriving the Dual m 1 2 X minimise: 2 kwk2 + C ζi i=1 subject to: yi(w · xi + w0) − (1 − ζi) ≥ 0 ζi ≥ 0 for i = 1, . . . , m Here yi ∈ {−1, 1} Lagrange Function m m m 1 2 X X X Λ(w, w , ζ; α, µ) = kwk +C ζ − α (y (w·x +w )−(1−ζ ))− µ ζ 0 2 2 i i i i 0 i i i i=1 i=1 i=1 40 SVM : Deriving the Dual Lagrange Function m m m 1 2 X X X Λ(w, w , ζ; α, µ) = kwk +C ζ − α (y (w·x +w )−(1−ζ ))− µ ζ 0 2 2 i i i i 0 i i i i=1 i=1 i=1 We write derivatives with respect to w, w0 and ζi, m ∂ Λ X = − αiyi ∂w0 i=1 ∂ Λ = C − αi − µi ∂ζi m X ∇wΛ = w − αiyixi i=1 For (KKT) dual feasibility constraints, we require αi ≥ 0, µi ≥ 0 41 SVM : Deriving the Dual Setting the derivatives to 0, substituting the resulting expressions in Λ (and simplifying), we get a function g(α) and some constraints m m m X 1 X X g(α) = α − α α y y x · x i 2 i j i j i j i=1 i=1 j=1 Constraints 0 ≤ αi ≤ C i = 1, . . . , m m X αiyi = 0 i=1 Finding critical points of Λ satisfying the KKT conditions corresponds to finding the maximum of g(α) subject to the above constraints 42 SVM: Primal and Dual Formulations Primal Form Dual Form m m m m X X 1 X X minimise: 1 kwk2 +C ζ maximise α − α α y y x ·x 2 2 i i 2 i j i j i j i=1 i=1 i=1 j=1 subject to: subject to: y (w · x + w ) ≥ (1 − ζ ) Pm i i 0 i i=1 αiyi = 0 ζi ≥ 0 0 ≤ αi ≤ C for i = 1, . . . , m for i = 1, . . . , m 43 Pm Recall the form of the solution: w = i=1 αiyixi Thus, only those datapoints xi for which αi > 0, determine the solution This is why they are called support vectors KKT Complementary Slackness Conditions