Statistical (Supervised) Learning Theory Fopps Logic and Learning School

Statistical (Supervised) Learning Theory Fopps Logic and Learning School

Statistical (Supervised) Learning Theory FoPPS Logic and Learning School Lady Margaret Hall University of Oxford Varun Kanade July 1, 2018 Previous Mini-Course Introduction to Computational Learning Theory (PAC) Learnability and the VC dimension Sample Compression Schemes Learning with Membership Queries (Computational) Hardness of Learning 1 This Mini-Course Statistical Learning Theory Framework Capacity Measures : Rademacher Complexity Uniform Convergence : Generalisation Bounds Some Machine Learning Techniques Algorithmic Stability to prove Generalisation 2 Outline Statistical (Supervised) Learning Theory Framework Linear Regression Rademacher Complexity Support Vector Machines Kernels Neural Networks Algorithmic Stability Target values: Y I Y = {−1; 1g : binary classification I Y = R : regression We consider data to be generated from a joint distribution D over X × Y Sometimes convenient to factorise: D(x; y) = D(x)D(yjx) Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16 Statistical (Supervised) Learning Theory Framework n Input space: X (most often X ⊂ R ) 3 We consider data to be generated from a joint distribution D over X × Y Sometimes convenient to factorise: D(x; y) = D(x)D(yjx) Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16 Statistical (Supervised) Learning Theory Framework n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1; 1g : binary classification I Y = R : regression 3 Sometimes convenient to factorise: D(x; y) = D(x)D(yjx) Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16 Statistical (Supervised) Learning Theory Framework n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1; 1g : binary classification I Y = R : regression We consider data to be generated from a joint distribution D over X × Y 3 Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16 Statistical (Supervised) Learning Theory Framework n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1; 1g : binary classification I Y = R : regression We consider data to be generated from a joint distribution D over X × Y Sometimes convenient to factorise: D(x; y) = D(x)D(yjx) 3 Statistical (Supervised) Learning Theory Framework n Input space: X (most often X ⊂ R ) Target values: Y I Y = {−1; 1g : binary classification I Y = R : regression We consider data to be generated from a joint distribution D over X × Y Sometimes convenient to factorise: D(x; y) = D(x)D(yjx) Make no assumptions about a specific functional relationship between x and y, a.k.a. agnostic setting 10,13,16 3 How can we fit a function to the data? I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded Statistical (Supervised) Learning Theory Framework Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting) 4 Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded Statistical (Supervised) Learning Theory Framework Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting) How can we fit a function to the data? I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality 4 We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded Statistical (Supervised) Learning Theory Framework Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting) How can we fit a function to the data? I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting 4 Statistical (Supervised) Learning Theory Framework Input space: X , target values: Y Arbitrary data distribution D over X × Y (agnostic setting) How can we fit a function to the data? I Classical approach to function approximation: polynomials, trigonometric functions, universality theorems I These suffer from the curse of dimensionality Finding any function that fits the observed data may perform arbitrarily badly on unseen points leading to overfitting We will focus on fitting functions from a class of functions whose ‘‘complexity’’ or ‘‘capacity’’ is bounded 4 Generative Models: Model the full joint distribution D(x; y) I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(yjx) 2 I Linear Regression: yjw0; w; x ∼ w0 + w · x + N (0; σ ) I Classification: yjw0; w; x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1 The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability. Aside : Connections to classical Statistics/ML Attempt to explicitly model the distributions D(x) and/or D(yjx) 5 Discriminative Models: Model only the conditional distribution D(yjx) 2 I Linear Regression: yjw0; w; x ∼ w0 + w · x + N (0; σ ) I Classification: yjw0; w; x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1 The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability. Aside : Connections to classical Statistics/ML Attempt to explicitly model the distributions D(x) and/or D(yjx) Generative Models: Model the full joint distribution D(x; y) I Gaussian Discriminant Analysis, Naïve Bayes 5 The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability. Aside : Connections to classical Statistics/ML Attempt to explicitly model the distributions D(x) and/or D(yjx) Generative Models: Model the full joint distribution D(x; y) I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(yjx) 2 I Linear Regression: yjw0; w; x ∼ w0 + w · x + N (0; σ ) I Classification: yjw0; w; x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1 5 Aside : Connections to classical Statistics/ML Attempt to explicitly model the distributions D(x) and/or D(yjx) Generative Models: Model the full joint distribution D(x; y) I Gaussian Discriminant Analysis, Naïve Bayes Discriminative Models: Model only the conditional distribution D(yjx) 2 I Linear Regression: yjw0; w; x ∼ w0 + w · x + N (0; σ ) I Classification: yjw0; w; x ∼ 2 · Bernoulli(sigmoid(w0 + w · x)) − 1 The (basic) PAC model in CLT assumes a functional form, y = c(x), for some concept c in class C, and the VC dimension of C controls learnability. 5 0 0 I E.g. Y = {−1; 1g, γ(y ; y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y ; y) = y − y for p ≥ 1 Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F. + A cost function γ : Y × Y ! R The loss for f 2 F at point (x; y) is given by `(f; x; y) = γ(f(x); y) + The Risk functional R : F! R is given by: R(f) = E `(f; x; y) = E γ(f(x); y) (x;y)∼D (x;y)∼D Statistical (Supervised) Learning Theory Framework X instance space; Y target values Distribution D over X × Y 6 0 0 I E.g. Y = {−1; 1g, γ(y ; y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y ; y) = y − y for p ≥ 1 + A cost function γ : Y × Y ! R The loss for f 2 F at point (x; y) is given by `(f; x; y) = γ(f(x); y) + The Risk functional R : F! R is given by: R(f) = E `(f; x; y) = E γ(f(x); y) (x;y)∼D (x;y)∼D Statistical (Supervised) Learning Theory Framework X instance space; Y target values Distribution D over X × Y Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F. 6 The loss for f 2 F at point (x; y) is given by `(f; x; y) = γ(f(x); y) + The Risk functional R : F! R is given by: R(f) = E `(f; x; y) = E γ(f(x); y) (x;y)∼D (x;y)∼D Statistical (Supervised) Learning Theory Framework X instance space; Y target values Distribution D over X × Y Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F. + A cost function γ : Y × Y ! R 0 0 I E.g. Y = {−1; 1g, γ(y ; y) = I(y 6= y) 0 0 p I E.g. Y = R, γ(y ; y) = y − y for p ≥ 1 6 + The Risk functional R : F! R is given by: R(f) = E `(f; x; y) = E γ(f(x); y) (x;y)∼D (x;y)∼D Statistical (Supervised) Learning Theory Framework X instance space; Y target values Distribution D over X × Y Let F ⊂ YX be a class of functions. A learning algorithm will output some function from the class F. + A cost function γ : Y × Y ! R 0 0 I E.g.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    218 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us