An Introduction to Statistical Machine Learning - Theoretical Aspects

An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio [email protected] Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny, Switzerland http://www.idiap.ch/~bengio IDIAP - October 21, 2004 Statistical Learning Theory 1. The Data 2. The Function Space 3. The Loss Function 4. The Risk and the Empirical Risk 5. The Training Error 6. The Capacity 7. The Bias-Variance Dilemma 8. Regularization 9. Estimation of the Risk 10. Model Selection 11. Methodology Statistical Learning Theory 2 The Data • Available training data: ◦ Dn = {z1, z2, ··· , zn}∈Z, ◦ independently and identically distributed ( iid), ◦ drawn from unknown distribution p(Z) • Various forms of the data: ◦ Classification: Z =(X,Y ) ∈ Rd × {−1, 1} objective: given a new x, estimate P (Y |X = x) ◦ Regression: Z =(X,Y ) ∈ Rd × R objective: given a new x, estimate E[Y |X = x] ◦ Density estimation: Z ∈ Rd objective: given a new z, estimate p(z) Statistical Learning Theory 3 The Function Space • Learning: search for a good function in a function space F • Examples of functions f(·; θ) ∈F: ◦ Regression: yˆ = f(x; a,b,c)= a · x2 + b · x + c ◦ Classification: yˆ = f(x; a,b,c) = sign(a · x2 + b · x + c) ◦ Density estimation 1 1 pˆ(z)= f(z; µ, Σ) = exp − (z − µ)T Σ−1(z − µ) |z| 2 (2π) 2 |Σ| p Statistical Learning Theory 4 The Loss Function • Learning: search for a good function in a function space F • Examples of loss functions L : Z×F ◦ Regression: L(z, f)= L((x, y), f)=(f(x) − y)2 ◦ Classification: 0 if f(x)= y L(z, f)= L((x, y), f)= 1 otherwise ◦ Density estimation: L(z, f)= − log f(z) Statistical Learning Theory 5 The Risk and the Empirical Risk • Learning: search for a good function in a function space F • Minimize the Expected Risk on F, defined for a given f as R(f)= EZ [L(z, f)] = L(z, f)p(z)dz ZZ • Induction Principle: ◦ select f ∗ = arg min R(f) f∈F ◦ problem: p(z) is unknown!!! • Empirical Risk: n 1 Rˆ(f,D )= L(z , f) n n i i=1 X Statistical Learning Theory 6 The Risk and the Empirical Risk • The empirical risk is an unbiased estimate of the risk: ED[Rˆ(f,D)] = R(f) • The principle of empirical risk minimization: ∗ f (Dn) = arg min Rˆ(f,Dn) f∈F • Training error: ∗ Rˆ(f (Dn),Dn) = min Rˆ(f,Dn) f∈F • Is the training error a biased estimate of the risk? Statistical Learning Theory 7 The Training Error • Is the training error biased? yes. ∗ ∗ E[R(f (Dn)) − Rˆ(f (Dn),Dn)] ≥ 0 ∗ • The solution f (Dn) found by minimizing the training error is 0 better on Dn than on any other set Dn drawn from p(Z). • Can we bound the difference between the training error and the generalization error? ∗ ∗ |R(f (Dn)) − Rˆ(f (Dn),Dn)|≤? • Answer: under certain conditions on F, yes. • These conditions depend on the notion of capacity h of F. Statistical Learning Theory 8 The Capacity • The capacity h(F) is a measure of its size, or complexity. • Classification: The capacity h(F) is the largest n such that there exist a set of examples Dn such that one can always find an f ∈F which gives the correct answer for all examples in Dn, for any possible labeling. • Regression and density estimation: capacity exists also, but more complex to derive (for instance, we can always reduce a regression problem to a classification problem). • Bound on the expected risk: let τ = sup L− inf L 2n η h ln h + 1 − ln 9 P sup |R(f) − Rˆ(f,Dn)|≤ 2τ ≥ 1−η ∈F n f s Statistical Learning Theory 9 Theoretical Curves Bound on the Expected Risk Confidence Interval Empirical Risk h Statistical Learning Theory 10 Theoretical Curves Bound on the Expected Risk inf R(f) Empirical Risk n Statistical Learning Theory 11 The Bias-Variance Dilemma • The generalization error can be decomposed into 3 parts: ◦ the bias: due to the fact that the set of functions F does not contain the optimal solution, ◦ the variance: due to the fact that if we had been using 0 another set Dn drawn from the same distribution p(Z), we would have obtained a different solution, ◦ the noise: even the optimal solution could be wrong! (for instance if for a given x there are more than one possible y) • Intrinsic dilemma: when the capacity h(F) grows, the bias goes down, but the variance goes up! Statistical Learning Theory 12 The Bias-Variance Dilemma (Graphical View) variance noise optimal solution solution obtained bias with training set 1 set of functions the size of depends on the size of solution obtained with training set 2 Statistical Learning Theory 13 Regularization • We have seen that learning = searching in a set of functions • This set should not be too small ( underfitting) • This set should not be too large ( overfitting) • One solution: regularization • Penalize functions f according to a prior knowledge • For instance, penalize functions that have very large parameters ∗ f (Dn) = arg min Rˆ(f,Dn)+ H(f) f∈F with H(f) a function that penalizes according to your prior • For example, in some models: small parameters → simpler solutions → less capacity Statistical Learning Theory 14 Early Stopping • Another method for regularization: early stopping. • Works when training is an iterative process. • Instead of selecting the function that minimizes the empirical risk on Dn, we can do: ◦ divide your training set Dn into two parts tr ◦ train set D = {z1, z2, ··· , ztr} va ◦ validation set D = {zva+1, ztr+2, ··· , ztr+va} ◦ tr + va = n ◦ let f t(Dtr) be the current function found at iteration t 1 ◦ let Rˆ(f t(Dtr),Dva)= L(z , f t(Dtr)) va i z ∈Dva iX ◦ stop training at iteration t∗ such that t∗ = arg min Rˆ(f t(Dtr),Dva) t t∗ tr and return function f(Dn)= f (D ) Statistical Learning Theory 15 Methodology • First: identify the goal! It could be 1. to give the best model you can obtain given a training set? 2. to give the expected performance of a model obtained by empirical risk minimization given a training set? 3. to give the best model and its expected performance that you can obtain given a training set? • If the goal is (1): use need to do model selection • If the goal is (2), you need to estimate the risk • If the goal is (3): use need to do both! • There are various methods that can be used for either risk estimation or model selection: ◦ simple validation ◦ cross validation (k-fold, leave-one-out) ◦ sequential validation (for time series) Statistical Learning Theory 16 Model Selection - Validation • Select a family of functions with hyper-parameter θ • Divide your training set Dn into two parts tr ◦ D = {z1, z2, ··· , ztr} te ◦ D = {ztr+1, ztr+2, ··· , ztr+te} ◦ tr + te = n • For each value θm of the hyper-parameter θ ◦ select f ∗ (Dtr) = arg min Rˆ(f,Dtr) θm ∈F f θm ∗ te 1 ∗ tr ◦ let Rˆ(f ,D )= L(zi, f (D )) θm te θm ∈ te ziXD ∗ ˆ ∗ te • select θm = arg min R(fθm ,D ) θm ∗ • return f (Dn) = arg min Rˆ(f,Dn) f∈F ∗ θm Statistical Learning Theory 17 Model Selection - Cross-validation • Select a family of functions with hyper-parameter θ • Divide your training set Dn into k distinct and equal parts D1, ··· ,Dk • For each value θm of the hyper-parameter θ ◦ For each part Dj (and its counterpart D¯ j) ◦ select f ∗ (D¯ j) = arg min Rˆ(f, D¯ j) θm ∈F f θm ∗ j j 1 ∗ j ◦ let Rˆ(f (D¯ ),D )= L(zi, f (D¯ )) θm |Dj| θm ∈ j ziXD 1 ∗ j j ◦ estimate Rˆθ (f)= Rˆ(f (D¯ ),D ) m k θm j X ∗ ˆ • select θm = arg min Rθm (f) θm ∗ • return f (Dn) = arg min Rˆ(f,Dn) f∈F ∗ θm Statistical Learning Theory 18 Estimation of the Risk - Validation • Divide your training set Dn into two parts tr ◦ D = {z1, z2, ··· , ztr} te ◦ D = {ztr+1, ztr+2, ··· , ztr+te} ◦ tr + te = n • select f ∗(Dtr) = arg min Rˆ(f,Dtr) f∈F (this optimization process could include model selection) 1 • estimate R(f)= Rˆ(f ∗(Dtr),Dte)= L(z , f ∗(Dtr)) te i ∈ te ziXD Statistical Learning Theory 19 Estimation of the Risk - Cross-validation • Divide your training set Dn into k distinct and equal parts D1, ··· ,Dk • For each part Dj j j ◦ let D¯ be the set of examples that are in Dn but not in D ◦ select f ∗(D¯ j) = arg min Rˆ(f, D¯ j) f∈F (this process could include model selection) 1 ◦ let Rˆ(f ∗(D¯ j),Dj)= L(z , f ∗(D¯ j)) |Dj| i ∈ j ziXD 1 • estimate R(f)= Rˆ(f ∗(D¯ j),Dj) k j X • When k = n: leave-one-out cross-validation Statistical Learning Theory 20 Estimation of the Risk - Sequential Validation • When data is sequential in nature (time series) • Divide your training set Dn into k distinct and equal sequential blocks D1, ··· ,Dk • For j = 1 → k − 1 j ◦ select f ∗(D1→j) = arg min Rˆ(f, Di) f∈F i=1 [ (this process could include model selection) 1 ◦ let Rˆ(f ∗(D1→j),Dj+1)= L(z , f ∗(D1→j)) |Dj+1| i ∈ j+1 zi XD k−1 1 • estimate R(f)= Rˆ(f ∗(D1→j),Dj+1) k − 1 j=1 X Statistical Learning Theory 21 Estimation of the Risk - Bootstrap • Is our estimate of the risk really accurate? • Let us use Bootstrap to estimate the accuracy of a given statistics: ◦ Let us create N bootstraps of Dn ◦ For each bootstrap Bi, get an estimate of the risk Ri (using cross-validation for instance) ◦ You can now compute estimates of the mean and the standard deviation of your estimates of the risk: N 1 R¯ = R N j j=1 X N 1 2 σ = R − R¯ vN − 1 i u i=1 u X t Statistical Learning Theory 22 Bootstrap • Given a data set Dn with n examples drawn from p(Z) • A bootstrap Bi of Dn also contains n examples: th • For j = 1 → n, the j example of Bi is drawn independently with replacement from Dn • Hence, ◦ some examples from Dn are in multiple copies in Bi ◦ and some examples from Dn are not in Bi • Hypothesis: the examples were iid drawn from p(Z) • Hence, the datasets Bi are as plausible as Dn, but drawn from Dn instead of p(Z).

An Introduction to Statistical Machine Learning - Theoretical Aspects

Early Stopping for Kernel Boosting Algorithms: a General Analysis with Localized Complexities

Search for Ultra-High Energy Photons with the Pierre Auger Observatory Using Deep Learning Techniques

Using Validation Sets to Avoid Overfitting in Adaboost

Identifying Training Stop Point with Noisy Labeled Data

Links Between Perceptrons, Mlps and Svms

Asymptotic Statistical Theory of Overtraining and Cross-Validation

Explaining the Success of Adaboost and Random Forests As Interpolating Classifiers

Asymptotic Statistical Theory of Overtraining and Cross-Validation

Early Stopping for Kernel Boosting Algorithms: a General Analysis with Localized Complexities

Early Stopping and Non-Parametric Regression: an Optimal Data-Dependent Stopping Rule

Don't Relax: Early Stopping for Convex Regularization Arxiv:1707.05422V1

Implementation of Intelligent System to Support Remote Telemedicine Services Using