Linear Classifiers, a Perceptron Family

Linear classifiers, a perceptron family Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 160 00 Prague 6, Jugoslávských partyzánů 1580/3, Czech Republic http://people.ciirc.cvut.cz/hlavac, [email protected] also Center for Machine Perception, http://cmp.felk.cvut.cz Courtesy: M.I. Schlesinger, V. Franc. Outline of the talk: A classifier, dichotomy, a multi-class classifier. Perceptron and its learning. A linear discriminant function. ε-solution. Learning a linear classifier. Learning for infinite training sets. A classifier 2/23 Analyzed object is represented by X – a space of observations, a vector space of dimension n. Y – a set of hidden states. The aim of the classification is to determine a relation between X and Y , i.e. to find a discriminant function f : X → Y . Classifier q : X → J maps observations Xn → set of class indices J, J = 1,..., |Y |. Mutual exclusion of classes is required X = X1 ∪ X2 ∪ ... ∪ X|Y |, Xi ∩ Xj = ∅, i 6= j, i, j = 1 ... |Y |. Classifier, an illustration 3/23 A classifier partitions the observation space X into class-labelled regions Xi, i = 1,..., |Y |. Classification determines, to which region Xi corresponding to a hidden state yi an observed feature vector x belongs. Borders between regions are called decision boundaries. X 1 X X1 X 4 X X 3 X2 3 1 X2 X3 X2 X4 X4 - background Several possible arrangements of decision boundaries corresponding to hidden states yi (classes in a more specific case). A multi-class decision strategy 4/23 Discriminant functions fi(x) should have the (ideal) property: fi(x) > fj(x) for x ∈ class i , i 6= j. f1 (x) x f (x) y 2 max f||Y (x) Strategy: j = argmax fj(x) j However, it is uneasy to find such a discriminant function. Most ‘good classifiers’ are dichotomic (as perceptron, SVM). The usual solution: One-against-All classifier, One-against-One classifier. Linear discriminant function 5/23 fj(x) = hwj, xi + bj, where h i denotes a scalar product. A strategy j = argmax fj(x) divides X into |Y | convex regions. j y=6 y=5 y=1 y=2 y=4 y=3 The classes separation corresponds to the Voronoi diagram (a term from computational geometry). The aim is to obtain a partition filling the space completely and reflecting the proximity information. Voronoi diagram is a partition of a space (plane in a special case) into regions close to each of given objects (seeds). The region corresponding to a particular seed consists of all points closer to that seed than to other seeds. Dichotomy, two classes only 6/23 |Y | = 2, i.e. two hidden states (typically also classes) y = 1 , if hw, xi + b ≥ 0 , q(x) = y = 2 , if hw, xi + b < 0 . x 1 1 b threshold activation x1 w1 function x w 2 2 y x w 3 3 S x2 xn wn weights Perceptron by F. Rosenblatt 1958 Learning linear classifiers 7/23 The aim of learning is to estimate classifier q(x) parameters wi, bi for ∀i. The learning algorithms differ by The character of training set 1. Finite set consisting of individual observations and hidden states, i.e., {(x1, y1) ... (xL, yL)}. 2. Infinite sets described e.g. by a mixture of Gaussian distributions. Learning task formulations. Learning tasks formulations for finite training sets 8/23 Empirical risk minimization: dummy dummy dummy dummy dummy L 1 P True risk is approximated by Remp(q(x, Θ)) = L W (q(xi, Θ), yi), where W is a i=1 penalty. ∗ Learning is based on the empirical minimization principle Θ = argmin Remp(q(x, Θ)). Θ Examples of learning algorithms: Perceptron, Back-propagation. Structural risk minimization: dummy dummy dummy dummy dummy True risk is approximated by a guaranteed risk (a regularizer securing upper bound of the risk is added to the empirical risk, Vapnik-Chervonenkis theory of learning). Example: Support Vector Machine (SVM). Perceptron, the history 9/23 The perceptron (F. Rosenblatt, 1958) was intended to be a machine (even it was first implemented as a program on IBM 704). This machine was designed for image recognition: it had an array of 400 photocells, randomly connected to the “neurons”. Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors Although the perceptron initially seemed promising, its limitations appeared. In 1969 a famous book entitled Perceptrons by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function. It is often believed (incorrectly) that they also conjectured that a similar result would hold for a multi-layer perceptron network. The often-miscited Minsky/Papert text caused a significant decline in interest and funding of neural network research. Courtesy: Wikipedia (A single-layer) perceptron learning; Task formulation 10/23 Input: T = {(x1, y1) ... (xL, yL)}, yi ∈ {1, 2}, i = 1,...,L, w, x + b = 0 dim(xi) = n. X n Output: a weight vector w, offset b for ∀j ∈ {1,...,L} satis- fying: hw, xji + b ≥ 0 for y = 1, hw, xji + b < 0 for y = 2. The task can be formally transcribed to a single inequality 0 0 hw , xji ≥ 0 by embedding it into n + 1 dimensional space, X n+1 w’, x’ = 0 where w0 = [w b], [x 1] for y = 1 , x0 = −[x 1] for y = 2 . We drop the primes and go back to w, x notation. Perceptron learning: the algorithm 1957 11/23 Input: T = {(x1, y1) ... (xL, yL)}. w Output: a weight vector w. t wt+1 The Perceptron algorithm 1. w1 = 0. w,x=0t 2. A wrongly classified observation xj is sought, i.e., hwt, xji < 0, j ∈ {1,...,L}. 0 xt 3. If there is no misclassified observation then Perceptron update rule the algorithm terminates otherwise wt+1 = wt + xj. 4. Goto2. The perceptron algorithm was invented at the F. Rosenblatt: The perceptron: a probabilistic model for information storage and Cornell Aeronautical Laboratory by Frank Ro- organization in the brain,”Psychological review 65.6 (1958): 386 senblatt (∗1928, †1971). Novikoff theorem, 1962 12/23 Proves that the Perceptron algorithm converges in a finite number steps if the linearly separable solution exists. convex hull Let X denotes a convex hull of points (set of observations) X. Let D = max |xi|, m = min |xi| > 0. i x∈X Novikoff theorem: If the data is linearly separable then there exists a ∗ D2 number t ≤ , such that the vector w ∗ satisfies m2 t D hwt∗, xji > 0, ∀j ∈ {1,...,L} . m origin What if the data is not separable? How to terminate the perceptron learning? Idea of the Novikoff theorem proof 13/23 2 Let express bounds for |wt| : 2 Upper bound: |wt | 2 2 2 2 |wt+1| = |wt + xt| = |wt| + 2 hxt, wti +|xt| | ≤{z0 } 2 2 2 2 ≤ |wt| + |xt| ≤ |wt| + D . 2 2 2 2 2 2 2 |w0| = 0, |w1| ≤ D , |w2| ≤ 2D ,..., |wt+1| ≤ t D , ... 2 2 2 Lower bound: is given analogically |wt+1| >t m . t 2 2 2 D2 Solution: t m ≤ t D ⇒ t ≤ m2 . An alternative training algorithm Kozinec (1973) 14/23 Input: T = {(x1, y1) ... (xL, yL)}. wt Output: a weight vector w∗. w t+1 w,x=0 1. w = x , i.e., any observation. t 1 j b 2. A wrongly classified observation xt is sought, i.e., j x hwt, x i < 0, j ∈ J. t 0 3. If there is no wrongly classified observation then the algorithm finishes otherwise Kozinec wt+1 = (1 − k) · wt + xt · k, k ∈ R, where k = argmin |(1 − k) · wt + xt · k|. k 4. Goto2. Perceptron learning as an optimization problem (1) 15/23 Perceptron algorithm, the batch version, handling non-separability, another perspective: Input: T = {(x1, y1) ... (xL, yL)}, yi ∈ {1, 2}, i = 1,...,L, dim(xi) = n. Output: a weight vector w minimizing J(w) = |{x ∈ X : hwt, xi < 0}| or, equivalently X J(w) = 1 . x∈X : hwt,xi<0 Q: Is the most common optimization method, the gradient descent wt = w − η∇J(w), applicable? A: Unfortunately not. The gradient of J(w) is either 0 or undefined. Perceptron learning as an Optimization problem (2) 16/23 Let us redefine the cost function: X Jp(w) = hw, xi . x∈X : hw,xi<0 ∂J X ∇J (w) = = x . p ∂w x∈X : hw,xi<0 The Perceptron algorithm performs gradient descent optimization of Jp(w). Learning by the empirical risk minimization is just an instance of an optimization problem. Either gradient minimization (backpropagation in neural networks) or convex (quadratic) minimization (called convex programming in mathematical literature) is used. Perceptron algorithm 17/23 Classifier learning, a non-separable case, a batch version Input: T = {(x1, y1) ... (xL, yL)}, yi ∈ {1, 2}, i = 1,...,L, dim(xi) = n. Output: a weight vector w∗. ∗ 1. w1 = 0, E = |T | = L, w = 0 . − 2. Find all misclassified observations X = {x ∈ X : hwt, xi < 0}. − − ∗ 3. If |X | < E then E = |X |; w = wt, the time of the last update tlu = t. ∗ P 4. If the termination condition tc(w , t, tlu) then terminate else wt+1 = wt + ηt x. x∈X− 5. Goto2. The algorithm converges with the probability 1 to the optimal solution. The convergence rate is not known. The termination condition tc(.) is a complex function of the quality of the best solution, time since the last update t − tlu and requirements on the solution. The optimal separating plane and the closest point to the convex hull 18/23 The problem of the optimal separation by a hyperplane ∗ w w = argmax min , xj (1) w j |w| can be converted to a seek for the closest point to a convex hull (denoted by the overline) of the observations x in the training multi-set.

Linear Classifiers, a Perceptron Family

Warm Start for Parameter Selection of Linear Classifiers

Neural Networks and Backpropagation

Lecture 2: Linear Classifiers

Hyperplane Based Classification: Perceptron and (Intro

2 Linear Classifiers and Perceptrons

Machine Learning Empirical Risk Minimization

Using Linear Classifier Probes

Chapter 2 Classification

Linear Classification Overview Logistic Regression Model I Logistic

Differentially Private Empirical Risk Minimization

Tutorial on Support Vector Machine (SVM) Vikramaditya Jakkula, School of EECS, Washington State University, Pullman 99164

2DI70 - Statistical Learning Theory Lecture Notes