Statistical Learning Theory

UE: Machine Learning Statistical Learning Theory Massih-Reza Amini Universite´ Joseph Fourier Laboratoire d’Informatique de Grenoble [email protected] 2/38 What is theory? q Theory is a set of statments or principles which aim to explain a group of facts or phenomena, especially one that has been repeatedly tested or is widely accepted and, can be used to make predictions about natural phenomena, q So the aim is to model a phenomenon (provide a schematic description of it, that accounts for its properties), q In order to be able to understand it (use the model to further study the characteristics of the phenomenon), q And to predict it (derive consequences that can be tested). [email protected] UE: Machine Learning 3/38 Induction vs. deduction q Deduction is the process of reasoning in which a conclusion follows necessarily from the stated premises; it is an inference by reasoning from the general to the specific. This is how mathematicians prove theorems from axioms. q Induction is, in the other hand, the process of deriving general principles from particular facts or instances. [email protected] UE: Machine Learning 4/38 What is learning? q Learning is to gain knowledge, comprehension or mastery of some skill through experience or study. q Learning becomes interesting when what is gained is more than what was explicitly given. q We will consider these terms as synonyms 1. learning, 2. generalization, 3. induction, 4. modeling. [email protected] UE: Machine Learning 5/38 Formally We consider an input space X ⊆ Rd and an output space Y. Assumption: Example pairs (x; y) 2 X × Y are identically and independently distributed (i.i.d) with respect to an unknown but fixed probability distribution D. Samples: We observe a sequence of m pairs of examples (xi ; yi ) generated i.i.d from D. Aim: Construct a prediction function f : X!Y which predicts an output y for a given new x with a minimum probability of error. Learning theory [Vapnik88]: Bound the probability of error , R(f ) R(f ) ≤ Empirical error of f + Complexity of the class of functions + residual term [email protected] UE: Machine Learning 6/38 Example 1. In the case of binary classification, we generally consider Y = {−1; 1g. 2. In document categorization, a document is represented by a vector in the vectorial space of terms (bag of words representation). Document d Représentation vectorielle de d Vocabulaire w1,d t1 w2,d t2 t12 t240 t120 t13 t5 t8 t16 w3,d t3 t11 t156 tV t17 t26 t64 t78 t980 t411 t664 t584 t167 t96 w4,d t4 t145 t38 t799 t415 t311 t877 t954 t157 t65 t946 t445 t978 w5,d t5 t155 t71 t894 t945 t1204 t441 w6,d t t144 t544 t613 t714 t844 t94 6 t411 t544 t746 t56 t12740 w7,d t7 t360 t454 t119 t487 t1 t114 t t t t t 1102 787 564 441 1123 . t t t t t t 411 11 41 441 120 5451 . t t 448 315 . wV,d tV [email protected] UE: Machine Learning 7/38 Example: document classification Etiquette EtiquetteEtiquette dede dd Etiquette de d Modèle d d vectoriel Algorithme d’apprentissage Collection Documents Documents étiquetée avec leurs étiquettes vectorisés Modèle d' Etiquette d' vectoriel de d’ Base de test Document Document Classifieur test vectorisé [email protected] UE: Machine Learning 8/38 Discriminant models for classification q Discriminant models directly find a classification function f : X!Y without making any hypotheses on the generation of examples. q The classifier is however supposed to belong to a given class of functions F and its analytical form is found by minimizing an objective function (also called risk) + L : Y × Y ! R The error function considered in classification is usually the misclassification error: 8(x; y); L(f (x); y) = [[f (x) 6= y]] Where [[π]] is equal to 1 if the predicate π is true and 0 otherwise. [email protected] UE: Machine Learning 9/38 Discriminant models for classification (2) q Recall: The classifier which is learned should be abl to make good predictions on new examples (induction), or have a small generalization error, which with the i.i.d. hypothesis writes : Z R(f ) = E(x;y)∼DL(x; y) = L(f (x); y)dD(x; y) X ×Y q Empirical risk minimization (ERM) principle: Find f by minimizing the unbiased estimator of R on a given training m set S = (xi ; yi )i=1: m 1 X R^ (f ; S) = L(f (x ); y ) m m i i i=1 [email protected] UE: Machine Learning 10/38 Discriminant models for classification (3) Considered convex loss functions: Hinge loss Lh(f (x); y) = [[(1 − yf (x)) > 0]] Logistic loss L`(f (x); y) = ln(1 + exp(−yf (x))) −yf (x) Exponentiel loss Le(f (x); y) = e [email protected] UE: Machine Learning 11/38 Gradient descent q Gradient descent is a simple and widely used first-order optimization algorithm for differentiable loss functions. Algorithm 1 Gradient descent 1: Initialize the weights w (0) 2: t 0 3: Learning rate λ > 0 4: Precision > 0 5: repeat (t+1) (t) (t) 6: w w − λrw(t) L(w ) 7: t t + 1 8: until jL(w (t)) − L(w (t−1))j < [email protected] UE: Machine Learning 12/38 Some standard learning algorithms q Perceptron q Logistic regression q Adaboost q Support Vector Machines (SVM) But also.... q K-nearest neighbors (k-NN) [email protected] UE: Machine Learning 13/38 Perceptron q One of the first learning algorithms proposed by Rosenblatt q The prediction function is in the form : fw : X 7! R x ! hw; xi Algorithm 2 The algorithm of perceptron 1: Training set S = f(xi ; yi ) j i 2 f1;:::; mgg 2: Initialize the weights w (0) 0 3: t 0 4: Learning rate λ > 0 5: Number of tolerated misclassified examples K 6: repeat 7: Choose randomly an example (x; y) 2 S 8: if y w (t); x < 0 then 9: w (t+1) w (t) + λ × y × x 10: end if 11: t t + 1 12: until The number of misclassified examples is less than K + What is the convex loss function minimized with perceptron? [email protected] UE: Machine Learning 14/38 Perceptron (convergence) [Novikoff62] showed that q if there exists a weight w ∗, such that ∗ 8i 2 f1;:::; mg; yi × hw ; xi i > 0, D ∗ E q w then, by denoting ρ = mini2f1;:::;mg yi jjw ∗jj ; xi , q and, R = maxi2f1;:::;mg jjxi jj, q and, w (0) = 0, λ = 1, q we have a bound over the maximum number of updates l : $ % R 2 l ≤ ρ [email protected] UE: Machine Learning 15/38 Logistic regression q The logistic regression has been proposed to model the posterior probability of classes via logistic functions. 1 P(y = 1 j x) = = gw (x) 1 + e−hw;xi 1 P(y = 0 j x) = 1 − P(y = 1 j x) = = 1 − gw (x) 1 + ehw;xi y 1−y P(y j x) = (gw (x)) (1 − gw (x)) 1 0.8 0.6 0.4 1/(1+exp(-<w,x>)) 0.2 0 -6 -4 -2 0 2 4 6 <w,x> [email protected] UE: Machine Learning 16/38 Logistic regression q For g : R ! ]0; 1[ 1 x 7! 1 + e−x we have @g g0(x) = = g(x)(1 − g(x)) @x q Model parameters w are found by maximizing the complete log-liklihood, which by assuming that m training examples are generated independently, writes m m Y Y L = log P(xi ; yi ) = log P(yi j xi ) i=1 i=1 m X yi 1−yi = log (gw (xi )) (1 − gw (xi )) i=1 [email protected] UE: Machine Learning 17/38 Logistic regression q If we consider the function fw : x 7! hw; xi, the maximization of the log-liklihood L is equivalent to the minimization of the empirical logistic loss in the case where 8i; yi 2 {−1; +1g. m 1 X R^ (f ; S) = ln(1 + e−yi fw (xi )) w m i=1 q The maximization of the log-liklihood function can be carried out using the gradient ascent algorithm new old w = w + λrw old L m ! old X = w + λ (yi − gw old (xi ))xi i=1 [email protected] UE: Machine Learning 18/38 Boosting q The Boosting algorithm proposed by [Schapire99] generates a set of weak learners and combines them with a majority vote in order to produce an efficient final classifier. q Each weak classifier is trained sequentially in the way to take into account the classification errors of the previous classifier + This is done by assigning weights to training examples and at each iteration to increase the weights of those on which the current classifier makes misclassification. + In this way the new classifier is focalized on hard examples that have been misclassified by the previous classifier. [email protected] UE: Machine Learning 19/38 Boosting Algorithm 3 The algorithm of Boosting 1: Training set S = f(xi ; yi ) j i 2 f1;:::; mgg 1 2: Initialize the initial distribution over examples 8i 2 f1;:::; mg; D1(i) = m 3: T , the maximum number of iterations (or classifiers to be combined) 4: for t=1,. ,T do 5: Train a weak classifier ft : X ! {−1; +1g by using the distribution Dt P 6: Set t = Dt (i) i:ft (xi )6=yi 1 1−t 7: Choose αt = ln 2 t 8: Update the distribution of weights −α y f (x ) Dt (i)e t i t i 8i 2 f1;:::; mg; Dt+1(i) = Zt Where, m X −αt y ft (x ) Zt = Dt (i)e i i i=1 9: end for PT 10: The final classifier: 8x; F(x) = sign t=1 αt ft (x) [email protected] UE: Machine Learning 20/38 Boosting PT q If we denote by 8x; H(x) = t=1 αt ft (x) and F(x) = sign(H(x)) then m m 1 X 1 X [[y 6= F(x )]] ≤ e−yi H(xi ) m i i m i=1 i=1 q In the other hand m m 1 X X Y e−yi H(xi ) = Z D (i) e−yi αt ft (xi ) m 1 2 i=1 i=1 t>1 So m T 1 X Y e−yi H(xi ) = Z m t i=1 t=1 [email protected] UE: Machine Learning 21/38 Boosting q The minimum of the normalisation term, with respect to the combination weights, αt αt −αt 8t; Zt = t e + (1 − t )e is reached for α = 1 ln 1−t t 2 t 1 1 q By posing γt = 2 − t , and when t < 2 q 2 2 −2γt 8t; Zt = 1 − 4γt ≤ e q The empirical misclassification error decreases exponentially to 0 m T 1 X Y −2 PT γ2 [[y 6= F(x )]] ≤ Z ≤ e t=1 t m i i t i=1 t=1 [email protected] UE: Machine Learning 22/38 Remarks q The induction principle which consists in learning a predictor by minimizing the empirical error of an upper bound of the misclassification error over a training set is called the Empirical Error Minimization (ERM) principle.

Load more