UE: Statistical Learning Theory

Massih-Reza Amini

Universite´ Joseph Fourier Laboratoire d’Informatique de Grenoble [email protected] 2/38 What is theory?

K Theory is a set of statments or principles which aim to explain a group of facts or phenomena, especially one that has been repeatedly tested or is widely accepted and, can be used to make predictions about natural phenomena,

K So the aim is to model a phenomenon (provide a schematic description of it, that accounts for its properties),

K In order to be able to understand it (use the model to further study the characteristics of the phenomenon),

K And to predict it (derive consequences that can be tested).

[email protected] UE: Machine Learning 3/38 Induction vs. deduction

K Deduction is the process of reasoning in which a conclusion follows necessarily from the stated premises; it is an inference by reasoning from the general to the specific.

This is how mathematicians prove theorems from axioms.

K Induction is, in the other hand, the process of deriving general principles from particular facts or instances.

[email protected] UE: Machine Learning 4/38 What is learning?

K Learning is to gain knowledge, comprehension or mastery of some skill through experience or study.

K Learning becomes interesting when what is gained is more than what was explicitly given.

K We will consider these terms as synonyms 1. learning, 2. generalization, 3. induction, 4. modeling.

[email protected] UE: Machine Learning 5/38 Formally

We consider an input space X ⊆ Rd and an output space Y. Assumption: Example pairs (x, y) ∈ X × Y are identically and independently distributed (i.i.d) with respect to an unknown but fixed probability distribution D.

Samples: We observe a sequence of m pairs of examples (xi , yi ) generated i.i.d from D. Aim: Construct a prediction function f : X → Y which predicts an output y for a given new x with a minimum probability of error. Learning theory [Vapnik88]: Bound the probability of error , R(f )

R(f ) ≤ Empirical error of f + Complexity of the class of functions + residual term

[email protected] UE: Machine Learning 6/38 Example 1. In the case of binary classification, we generally consider Y = {−1, 1}. 2. In document categorization, a document is represented by a vector in the vectorial space of terms (bag of words representation).

Document d Représentation vectorielle de d Vocabulaire

w1,d t1

w2,d t2 t12 t240 t120 t13 t5 t8 t16 w3,d t3 t11 t156 tV t17 t26 t64 t78 t980 t411 t664 t584 t167 t96 w4,d t4 t145 t38 t799 t415 t311 t877 t954 t157 t65 t946 t445 t978 w5,d t5 t155 t71 t894 t945 t1204 t441 w6,d t t144 t544 t613 t714 t844 t94 6 t411 t544 t746 t56 t12740 w7,d t7 t360 t454 t119 t487 t1 t114 t t t t t 1102 787 564 441 1123 . t t t t t t 411 11 41 441 120 5451 . t t 448 315 .

wV,d tV

[email protected] UE: Machine Learning 7/38 Example: document classification

Etiquette EtiquetteEtiquette dede dd

Etiquette de d

Modèle d

d vectoriel Algorithme d’apprentissage

Collection Documents Documents étiquetée avec leurs étiquettes vectorisés

Modèle d' Etiquette d' vectoriel de d’

Base de test Document Document Classifieur test vectorisé

[email protected] UE: Machine Learning 8/38 Discriminant models for classification K Discriminant models directly find a classification function f : X → Y without making any hypotheses on the generation of examples. K The classifier is however supposed to belong to a given class of functions F and its analytical form is found by minimizing an objective function (also called risk)

+ L : Y × Y → R The error function considered in classification is usually the misclassification error:

∀(x, y); L(f (x), y) = [[f (x) 6= y]]

Where [[π]] is equal to 1 if the predicate π is true and 0 otherwise.

[email protected] UE: Machine Learning 9/38 Discriminant models for classification (2)

K Recall: The classifier which is learned should be abl to make good predictions on new examples (induction), or have a small generalization error, which with the i.i.d. hypothesis writes : Z R(f ) = E(x,y)∼DL(x, y) = L(f (x), y)dD(x, y) X ×Y K Empirical risk minimization (ERM) principle: Find f by minimizing the unbiased estimator of R on a given training m set S = (xi , yi )i=1:

m 1 X Rˆ (f , S) = L(f (x ), y ) m m i i i=1

[email protected] UE: Machine Learning 10/38 Discriminant models for classification (3)

Considered convex loss functions:

Hinge loss Lh(f (x), y) = [[(1 − yf (x)) > 0]]

Logistic loss L`(f (x), y) = ln(1 + exp(−yf (x))) −yf (x) Exponentiel loss Le(f (x), y) = e

[email protected] UE: Machine Learning 11/38 Gradient descent

K Gradient descent is a simple and widely used first-order optimization algorithm for differentiable loss functions.

Algorithm 1 Gradient descent 1: Initialize the weights w (0) 2: t ← 0 3: Learning rate λ > 0 4: Precision  > 0 5: repeat (t+1) (t) (t) 6: w ← w − λ∇w(t) L(w ) 7: t ← t + 1 8: until |L(w (t)) − L(w (t−1))| < 

[email protected] UE: Machine Learning 12/38 Some standard learning algorithms

K

K

K Adaboost

K Support Vector Machines (SVM)

But also....

K K-nearest neighbors (k-NN)

[email protected] UE: Machine Learning 13/38 Perceptron K One of the first learning algorithms proposed by Rosenblatt K The prediction function is in the form : fw : X 7→ R x → hw, xi

Algorithm 2 The algorithm of perceptron

1: Training set S = {(xi , yi ) | i ∈ {1,..., m}} 2: Initialize the weights w (0) ← 0 3: t ← 0 4: Learning rate λ > 0 5: Number of tolerated misclassified examples K 6: repeat 7: Choose randomly an example (x, y) ∈ S 8: if y w (t), x < 0 then 9: w (t+1) ← w (t) + λ × y × x 10: end if 11: t ← t + 1 12: until The number of misclassified examples is less than K

What is the convex loss function minimized with perceptron?

[email protected] UE: Machine Learning 14/38 Perceptron (convergence) [Novikoff62] showed that

K if there exists a weight w ∗, such that ∗ ∀i ∈ {1,..., m}, yi × hw , xi i > 0,

 D ∗ E K w then, by denoting ρ = mini∈{1,...,m} yi ||w ∗|| , xi ,

K and, R = maxi∈{1,...,m} ||xi ||,

K and, w (0) = 0, λ = 1,

K we have a bound over the maximum number of updates l : $ % R 2 l ≤ ρ

[email protected] UE: Machine Learning 15/38 Logistic regression K The logistic regression has been proposed to model the posterior probability of classes via logistic functions. 1 P(y = 1 | x) = = gw (x) 1 + e−hw,xi 1 P(y = 0 | x) = 1 − P(y = 1 | x) = = 1 − gw (x) 1 + ehw,xi y 1−y P(y | x) = (gw (x)) (1 − gw (x))

1

0.8

0.6

0.4 1/(1+exp(-))

0.2

0 -6 -4 -2 0 2 4 6

[email protected] UE: Machine Learning 16/38 Logistic regression K For g : R → ]0, 1[ 1 x 7→ 1 + e−x we have ∂g g0(x) = = g(x)(1 − g(x)) ∂x

K Model parameters w are found by maximizing the complete log-liklihood, which by assuming that m training examples are generated independently, writes m m Y Y L = log P(xi , yi ) = log P(yi | xi ) i=1 i=1 m X  yi 1−yi  = log (gw (xi )) (1 − gw (xi )) i=1

[email protected] UE: Machine Learning 17/38 Logistic regression

K If we consider the function fw : x 7→ hw, xi, the maximization of the log-liklihood L is equivalent to the minimization of the empirical logistic loss in the case where ∀i, yi ∈ {−1, +1}.

m 1 X Rˆ (f , S) = ln(1 + e−yi fw (xi )) w m i=1 K The maximization of the log-liklihood function can be carried out using the gradient ascent algorithm

new old w = w + λ∇w old L m ! old X = w + λ (yi − gw old (xi ))xi i=1

[email protected] UE: Machine Learning 18/38 Boosting

K The Boosting algorithm proposed by [Schapire99] generates a set of weak learners and combines them with a majority vote in order to produce an efficient final classifier.

K Each weak classifier is trained sequentially in the way to take into account the classification errors of the previous classifier This is done by assigning weights to training examples and at each iteration to increase the weights of those on which the current classifier makes misclassification.

In this way the new classifier is focalized on hard examples that have been misclassified by the previous classifier.

[email protected] UE: Machine Learning 19/38 Boosting

Algorithm 3 The algorithm of Boosting

1: Training set S = {(xi , yi ) | i ∈ {1,..., m}} 1 2: Initialize the initial distribution over examples ∀i ∈ {1,..., m}, D1(i) = m 3: T , the maximum number of iterations (or classifiers to be combined) 4: for t=1,. . . ,T do 5: Train a weak classifier ft : X → {−1, +1} by using the distribution Dt P 6: Set t = Dt (i) i:ft (xi )6=yi 1 1−t 7: Choose αt = ln 2 t 8: Update the distribution of weights

−α y f (x ) Dt (i)e t i t i ∀i ∈ {1,..., m}, Dt+1(i) = Zt

Where, m X −αt y ft (x ) Zt = Dt (i)e i i i=1 9: end for PT  10: The final classifier: ∀x, F(x) = sign t=1 αt ft (x)

[email protected] UE: Machine Learning 20/38 Boosting

PT K If we denote by ∀x, H(x) = t=1 αt ft (x) and F(x) = sign(H(x)) then

m m 1 X 1 X [[y 6= F(x )]] ≤ e−yi H(xi ) m i i m i=1 i=1 K In the other hand

m m 1 X X Y e−yi H(xi ) = Z D (i) e−yi αt ft (xi ) m 1 2 i=1 i=1 t>1

So m T 1 X Y e−yi H(xi ) = Z m t i=1 t=1

[email protected] UE: Machine Learning 21/38 Boosting K The minimum of the normalisation term, with respect to the combination weights, αt

αt −αt ∀t, Zt = t e + (1 − t )e

is reached for α = 1 ln 1−t t 2 t 1 1 K By posing γt = 2 − t , and when t < 2

q 2 2 −2γt ∀t, Zt = 1 − 4γt ≤ e K The empirical misclassification error decreases exponentially to 0

m T 1 X Y −2 PT γ2 [[y 6= F(x )]] ≤ Z ≤ e t=1 t m i i t i=1 t=1

[email protected] UE: Machine Learning 22/38 Remarks

K The induction principle which consists in learning a predictor by minimizing the empirical error of an upper bound of the misclassification error over a training set is called the Empirical Error Minimization (ERM) principle.

K But if the goal is to find a predictor f which minimizes the best, the empirical error, so why not to choose f from a complex class of functions with which we are sure to reach a perfect error on a given training set?

K Or equivalently, from different learning algorithms we have to find a predictor, which algorithm to choose?

[email protected] UE: Machine Learning 23/38 Overfitting

K Remind that learning becomes interesting when what is gained is more than what was explicitly given, but :

Classification Regression, [Bishop06]

K Complex models, having too many parameters relative to the number of observations, tend to overfit the training observations

K A model which overfits the training data, will generally have poor predictive performance.

[email protected] UE: Machine Learning 24/38 Overfitting

K So what to do? K Statistical learning theory says : take into account the capacity of the class of functions that the learning machine can implement. But how?

[email protected] UE: Machine Learning 25/38 Study the consistency of the ERM principle

K For a given S = {((xi , yi ); i ∈ {1,..., m}} let 1 ∀i, ξ = | f (x ) − y |∈ {0, 1} i 2 i i

Here, ξi are independent Bernoulli trials

K In this case, the empirical risk is

m 1 X Rˆ (f , S) = ξ m i i=1

K And, R(f ) = E(ξ)

[email protected] UE: Machine Learning 26/38 Chernoff’s Bound

m K If (ξi )i=1 are independent random variables then

m ! 1 X ∀ > 0, P ξ − (ξ) ≥  ≤ 2 exp(−2m2) m i E i=1 K However, if we look at the training set before choosing f , all ξi , ∀i, which depend on f , will not be independent any more K .... find a uniform bound [Vapnik88] which takes into account the whole set of functions that can be implemented by the learning machine : a bound on   ∀ > 0, P sup(R(f ) − Rˆ (f , S)) ≥  f ∈F

[email protected] UE: Machine Learning 27/38 Uniforme convergence [Vapnik88] K If the function class is a singleton F = {f }, then Chernoff’s bound suffices :   P sup(R(f ) − Rˆ (f , S)) = P(R(f ) − Rˆ (f , S)) f ∈F

K If the function class is finite F = {f1, ..., fN }, then use the union bound, first set

j m ˆ C = {(xi , yi )i=1 | (R(fj ) − R(fj , S)) > } Then   N ˆ 1 N X j P sup(R(f ) − R(f , S)) = P(C ∪ ... ∪ C ) ≤ P(C) f ∈F j=1

Use the chernoff bound for each term of the sum.

[email protected] UE: Machine Learning 28/38 Uniforme convergence [Vapnik88]

K For an infinite function class, the trick is to see that for an finite sample there are only finitely many functions concerned.

2 Lemma 1 [Vapnik88] : ∀ > 0 and m ≥ 2 , we have     P sup(R(f ) − Rˆ (f , S)) >  ≤ 2P sup(Rˆ (f , S) − Rˆ 0(f , S0)) > /2 f ∈F f ∈F

Here the first P refers to the distribution of iid samples of size m, while the second one refers to iid samples of size 2m. K So the number of functions that we consider in F does not exceed the maximum number of different ways the function class can separate 2m points into two classes, N (F, 2m).

[email protected] UE: Machine Learning 29/38 Uniforme convergence [Vapnik88] K So putting everything together ∀ > 0 we have

! ! P sup(R(f ) − Rˆ(f , S)) >  ≤ 2P sup(Rˆ(f , S) − Rˆ 0(f , S0)) > /2 f ∈F f ∈F

N (F,2m) X  ˆ ˆ 0 0  ≤ 2P |R(fj , S) − R (fj , S )| > /2 j=1  m2  ≤ 4N (F, 2m) exp − 8 K So with probability of at least 1 − δ we have v u   u 8 4 ˆ u   R(f ) ≤ R(f , S) +u ln(N (F, 2m)) + ln  | {z } tm | {z } δ Error Complexity

K The study of consistency of the ERM principle leads to another principle called structural risk minimization.

[email protected] UE: Machine Learning 30/38 Tradeoff complexity vs empirical error

Empirical error Empirical error + complexity

Complexity

[email protected] UE: Machine Learning 31/38 Tradeoff complexity vs empirical error (2)

Image from : http://www.svms.org/srm/

[email protected] UE: Machine Learning 32/38 Regularization

K Find a predictor by minimising the empirical risk with an added penalty for the size of the model, K A simple approach consists in choosing a large class of functions F and to define on F a regularizer, typically a norm || g ||, then to minimize the regularized empirical risk

ˆ ˆ 2 f = argmin Rm(f , S) + γ × || f || f ∈F |{z} hyperparameter

K The hyper parameter, or the regularisation parameter allows to choose the right trade-off between fit and complexity.

[email protected] UE: Machine Learning 33/38 K -fold cross validation K Create a K -fold partition of the dataset K For each of K experiments, use K − 1 folds for training and a different fold for testing, this procedure is illustrated in the following figure for K = 4

Train 1, 1 Test 1 Crossval. 1

Train 2, 2 Test 2 Crossval. 2

Test 3 Train 3, 3 Crossval. 3

Test 4 Train 4, 4 Crossval. 4

K The value of the hyper parameter corresponds to the value of γk for which the testing performance is the highest on one of the folds.

[email protected] UE: Machine Learning 34/38 Other frameworks: K There is no association between the inputs and the outputs, K Input examples are supposed to be multidimensional random variables, generated by a mixture of K probability densities with K proportions (πk )k=1 : X πk = 1 and πk ≥ 0 (k = 1, ..., K ). k=1 K Each probability density is a parametric function modelling the conditional probability distributions

∀k ∈ Y; P(x | y = k) = fk (x, θk ) K The mixture density models the generation of an example x, by the K density functions

K X P(x, Θ) = πk fk (x, θk ) k=1

[email protected] UE: Machine Learning 35/38 Other frameworks: unsupervised learning

K Where Θ is the set of mixture proportions and all the parameters defining the parametric density functions

Θ = {θk , πk : k ∈ {1, ..., K }}

K The aim is then to estimate the model parameters Θ over a set n of observed data X = (xi )i=1 by maximising the log-likelihood L(Θ) = ln P(X | Θ)

K Once the parameters have been estimates, examples can be classified using the Bayes decision rule

x 0 belongs to the class k iff k = argmax P(y = h | x 0) h∈Y

[email protected] UE: Machine Learning 36/38

Other frameworks: Learning with partially labeled data

K Semi- K Multiview learning* (internship → PhD) K Active learning*

[email protected] UE: Machine Learning 37/38 Some useful links

K Books K Christopher M. Bishop, and Machine Learning, 2006. K Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning, 2001 K Tom Mitchell, Machine Learning, 1997 K Mehryar Mohri, Afshin Rostamzadehn Ammet Talwalker, Foundations of Machine Learning, 2012 K John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis, 2004 K Softwares K General K Weka 3: Software in Java http://www.cs.waikato.ac.nz/ml/weka K Lush http://lush.sourceforge.net K SVM K http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ K http://svmlight.joachims.org/ K Test sets K UCI machine learning repository http://archive.ics.uci.edu/ml/

K DMOZ open directory project http://www.dmoz.org/Computers/Artificial_Intelligence/Machine_Learning/ Datasets/

[email protected] UE: Machine Learning 38/38 References

Christopher M. Bishop Pattern Recognition and Machine Learning 2006. A. B. Novikoff On convergence proofs on . Symposium on the Mathematical Theory of Automata Vol. 12, pp. 615–622, 1962. F. Rosenblatt The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain Psychological Review, (65):6, pp. 386–408, 1958. R.E. Schapire Theoretical views of Boosting and Applications Proceedings of the 10th International Conference on Algorithmic Learning, 13–25, 1999. V. Vapnik. The nature of statistical learning theory. Springer, Verlag, 1998.

[email protected] UE: Machine Learning