Machine Learning Empirical Risk Minimization

Machine Learning Empirical Risk Minimization Dmitrij Schlesinger WS2014/2015, 08.12.2014 Recap – tasks considered before n For a training dataset L = (xl, yl) ... with xl ∈ R (data) and yl ∈ {+1, −1} (classes) fing a separating hyperplane yl · [hw, xli + b] ≥ 0 ∀l → Perceptron algorithm The goal is to fing a "corridor" (stripe) of the maximal width that separates the data → Large Margin learning, linear SVM 1 kwk2 → min 2 w s.t. yl[hw, xli + b] ≥ 1 In both cases the data is assumed to be separable. What if not? ML: Empirical Risk Minimization 08.12.2014 2 Empirical Risk Minimization Let a loss function C(y, y0) be given that penalizes deviations between the true class and the estimated one (like the loss in the Bayesian Decision theory). The Empirical Risk of a decision strategy is the total loss over the training set: X R(e) = C yl, e(xl) → min e l It should bi minimized with respect to the decision strategy e. Special case (today): – the set of decisions is {+1, −1} – the loss is the delta-function C(y, y0) = δ(y 6= y0) – the decision strategy can be expressed in the form e(x) = sign f(x) with an evaluation function f : X → R Example: f(x) = hw, xi − b is a linear classifier. ML: Empirical Risk Minimization 08.12.2014 3 Hinge Loss The problem: the subject is not convex The way out: replace the real loss by its convex upper bound ← example for y = 1 (for y = −1 it should be flipped) δ y 6= sign f(x) ≤ max 0, 1 − y · f(x) It is called Hinge Loss ML: Empirical Risk Minimization 08.12.2014 4 Sub-gradient algorithm Let the evaluation function be parameterized, i.e. f(x; θ) for example θ = (w, b) for linear classifiers The optimization problem reads: X H(θ) = max 0, 1 − yl · f(xl; θ) → min θ l It is convex with respect to f but not differentiable Solution by the sub-gradient (descent) algorithm: 1. Compute the sub-gradient (later) 2. Apply it with a step size that is decreasing in time ∂H θ(t+1) = θ(t) − γ(t) ∂θ ∞ with lim γ(t) = 0 and P γ(t) = ∞ (e.g. γ(t) = 1/t) t→∞ t=1 ML: Empirical Risk Minimization 08.12.2014 5 The sub-gradient for the Hinge loss X H(θ) = max 0, 1 − yl · f(xl; θ) → min θ l 1. Estimate data points with the Hinge grater zero 0 L = {l : yl · f(xl) < 1} 2. The sub-gradient is ∂H X ∂f(xl; θ) = − yl · ∂θ l∈L0 ∂θ In particular, for linear classifiers ∂H X = − yl · xl ∂θ l∈L0 i.e. some data points are added to the parameter vector → it reminds on the Perceptron algorithm ML: Empirical Risk Minimization 08.12.2014 6 Kernelization Let us optimize the Hinge loss in a feature space Φ: X → H, → the (linear) evaluation function is not f(x; w) = hx, wi but f(x; w) = hΦ(x), wi Remember (see the previous lecture) that the evaluation function can be expressed as a linear combination X f(x; α) = hΦ(x), wi = hΦ(x), αiyiΦ(xi)i = i X X = αiyihΦ(x), Φ(xi)i = αiyiκ(x, xi) i i The subject to be minimized reads now X X H(α) = max 0, 1 − αiyiκ(xl, xi) → min α l i ML: Empirical Risk Minimization 08.12.2014 7 Kernelization The task: X X H(α) = max 0, 1 − αiyiκ(xl, xi) → min α l i The sub-gradient: ∂H X = −yi ylκ(xl, xi) ∂αi l∈L0 As usual for kernels neither the feature space H nor the mapping Φ(x) are necessary in order to estimate α-s, if the kernel κ(x, x0) is given !!! ML: Empirical Risk Minimization 08.12.2014 8 Maximum margin vs. minimum loss – Linear SVM: maximum margin, separable data – Non-separable data: min. Empirical Risk, Hinge loss – "Kernelization" can be done for both variants Does is have sense to to minimize loss (especially the Hinge loss, which has a "geometrical nature") defined in the feature space ? It is indeed always possible to make the training set separable by choosing a suitable kernel. Interesting – both formulations are equivalent in certain circumstances. ML: Empirical Risk Minimization 08.12.2014 9 A "unified" formulation In Machine Learning it is a common technique to enhance an objective function (e.g. the average loss) by a regularizer C X R(w) + `(xl, yl, w) |L| l with – parameter vector w – loss `(xl, yl, w), e.g. delta, Hinge, metric, additive etc. 2 – regularizer R(w), e.g. ||w|| , |w|1 etc. – balancing factor C Now: we start from the unified formulation, make some assumption and end up with the maximum margin learning ML: Empirical Risk Minimization 08.12.2014 10 Hinge loss → maximum margin C X R(w) + `(xl, yl, w) → min w |L| l Assumption: the training set is separable, i.e. there exists such w that the loss is zero. Set C to a very high value and obtain R(w) → min w s.t. `(xl, yl, w) = 0 ∀l Set R(w) = 1/2||w||2 and ` to the Hinge loss (for linear cl.) `(xl, yl, w) = max(0, 1 − ylhw, xli) = 0 It results in the maximum margin learning 1 ||w||2 → min 2 w s.t. ylhw, xli ≥ 1 ∀l ML: Empirical Risk Minimization 08.12.2014 11 Maximum margin → Hinge loss Let us "weaken" the maximum margin learning – we introduce so-called slack-variables ξl ≥ 0 that represent "errors": 1 2 C X ||w|| + ξl → min w 2 |L| l s.t. ylhw, xli ≥ 1 − ξl ∀l Note that we want to minimize the total error. On the other hand ξl ≥ 1 − ylhw, xli and ξl ≥ 0 ⇒ ξl = max(0, 1 − ylhw, xli) Substituting it gives 1 2 C X ||w|| + max(0, 1 − ylhw, xli) → min w 2 |L| l i.e. the Hinge loss minimization ML: Empirical Risk Minimization 08.12.2014 12 Summary C X R(w) + `(xl, yl, w) → min w |L| l Two extremes: – Big C → the loss is more important → better recognition rate but smaller margin (worse generalization) – Small C → the generalization is more important → larger margin (more robust classifier) but worse recognition rate Recommended reading: Sebastian Nowozin and Christopher H. Lampert, "Structured Prediction and Learning in Computer Vision", Foundations and Trends in Computer Graphics and Vision, Vol. 6, Nr. 3-4 http://www.nowozin.net/sebastian/cvpr2012tutorial ML: Empirical Risk Minimization 08.12.2014 13.

Machine Learning Empirical Risk Minimization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support