<<

Why Linear?

4. Linear Discriminant •It is simple and intuitive. •Minimizing a criterion error; e.g. sample risk, Functions training error, margin, etc. •Can be generalized to find non-linear discriminant Aleix M. Martinez regions. [email protected] •It is generally very difficult to calculate the distance of a testing sample to a nonlinear function. Limited number of training samples. HandoutsHandouts for for ECE ECE 874, 874, 2007. 2007. •No need to estimate class distributions.

Distance to a nonlinear function Linear Discriminant Analysis

•If we have samples corresponding to two or more classes, we prefer to select those features that best discriminate between classes –rather than those that best describe the data. •This will, of course, depend on the classifier. •Assume our classifier is Bayes. •Thus, we want to minimize the probability of error. •We will develop a method based on scatter matrices.

From Murase & Nayar, 1995.

Theorem

•Let the samples of two classes be Normally distributed in Rp, with common covariance . Then, the Bayes errors in the p-dimensional space and in the one-dimensional subspace given by 1 1 v  (1 - 2)/||  (1 - 2) ||, are the same; where || x || is the Euclidean norm of PCA the vector x. •That is, there is no loss in classification when LDA reducing from p dimensions to one.

1 Scatter matrices and •To formulate criteria for class separability, separability criteria we need to convert these matrices to numerical values: tr(S 1S ) •Within-class scatter matrix: 2 1 C N ln | S | ln | S | T 1 2 SW xij j xij j . j1 i1 tr S1 tr S •Between-class scatter matrix: 2 C T SB j j . •Typical combination of scatter matrices are: j1 {S , S } {S , S },{S ,ˆ}, and{S ,ˆ}. ˆ 1 2 B W B W •Note that: SW SB .

Ronald Fisher (1890-1962) A solution to LDA • Fisher was an eminent scholar and one of the great scientist of the first part of the 20th century. After graduating from Cambridge and being denied entry to the British army for his poor eyesight, he worked as a statistician for six years •Again, we want to minimize the Bayes error. before starting a farming business. While a farmer, he continued his genetics and statistics research. During this time, he developed the well-known analysis •Therefore, we want the projection from Y to of variance (ANOVA) method. After the war, Fisher finally moved to Rothamsted Experimental Station. Among his many accomplishments, Fisher invented ANOVA, the technique of maximum likelihood (ML), Fisher X that minimizes the error: p n ˆ Information, the concept of sufficiency, and the method now known as Linear X( p) yii bii . Discriminant Analysis(LDA). During World War II, the filed of eugenics i1 ip1 suffered a big blow -- mainly do to the Nazis use of it as a justification for •The eigenvalue decomposition is the optimal some of their actions. Fisher moved back to Rothamsted and then to Cambridge wherehe retired. Fisher has been accredited to be one of the transformation: 1 founders of modern statistics and one cannot study pattern recognition without SB SW i ii . encountering several of his ground-breaking insights. Yet as great as a statistician that he was, he also become a major figure in genetics. A classical quote in the Annals of Statistics reads “I occasionally meet geneticists who Simultaneous diagonalization. ask me whether it is true that the great geneticist R.A. Fisher was also an important statistician."

Example: Face Recognition

2 Limitations of LDA PCA versus LDA •To preventS to become singular, N>d. W •In many applications the number of samples is •There are only C-1 nonzero eigenvectors. relatively small compared to the dimensionality of • the data. •Even for simple PDFs, PCA can outperform LDA (testing data). •Again, this limits the number of features one can use. •PCA is usually a guarantee, because all we try to •Nonparametric LDA is design to solve the do is to minimize the representation error. last problem (we’ll see this latter in the course).

Problems with Multi-class Eigen-based Algorithms •In general, researchers define algorithms which are optimal in the 2-classes case and then extend this idea (way of thinking) to the multi-class problem. •This may caused problems. underlying but unknown PDFs •This is the case for eigen-based approaches which use the idea of scatter matrices defined above.

•Let’s define the general case:M1V M 2V. •This is the same as selecting those T eigenvectors v that maximize: v M1v T v M 2 v

•Note that this can only be achieved if M1 and M2 agree. •The existence of solution depends on the angle between the eigenvectors of M and 1 th vi is the i vector of the solution space. M2. wi are the eigenvectors of M1. ui are the eigenvectors of M2.

3 Classification: The linear case How to know?

r i r i 2 T 2 K cosij u j wi . i1 j1 i1 j1

where r < q and q is the number of eigenvectors of M1.

•The larger K is, the less probable that the results will be correct.

Decision Surfaces Discriminant function = distance •A linear discriminant function can be threshold •The discriminant function gives an mathematically written as: T algebraic measure of the distance. g(x) w x wo. •Take two vectors x1 and x2, both on the •2-class case: weight vector decision boundary. Then, write x as: –Decide w if g(x) 0. 1 w g(x) – w2 if g(x) 0. x x p r  r  . T w w •We can also do that with: w1 if w x wo . Projection of x onto g(x)=0. threshold Distance from x to g(x)=0.

Multicategory case

•Two straightforward algorithms: –Reduce the problem to C-1 2-class problems. –Construct C(C-1)/2 linear discriminant functions. •A linear machine assigns x to:

wi if gi (x) g j (x) j i. •The decision boundaries (between two

adjacent ) are given by: gi (x) g j (x).

4 Linearly Separable Linear Classifier

•Our previous algorithm is general and can •If the classes (or training data) are linearly be applied even when the classes are not separable, then there exist a unique gi(x)>0. linearly separable. •This means that a new sample t, can be •C classes are linearly separable iff for every classified as wi if gi(t)>0.

wi there exist a linear classifier (hyperplane) •Alternatively (when the data is not linearly such that all the sample of wi lie on its separable), we can use: positive side (gi(x)>0) and all the sample of wi (t | w1,..., wC , w01,..., w0C ) arg maxi gi (t). wj (j i) are on the negative side of gi(x). weights thresholds

It is simpler to start with regression.

•Remember that regression is a related (but different) problem to that of classification. •We search that function g(x) which best

interpolates a given training set S={(x1,y1), p …, (xn,yn)}; wherexi  and yi are scalars. w, x wT x, and we have

•We want to minimize: assumed wo=0. f (x, y)y g(x) y w, x 0.

•If enough training data is available there T •Now, differentiating with respect to w, we exist a unique solution: T T T T w X y. get:2y X 2w XX 0, •When there is noise (i.e.,there does not exist wT XXT yT XT . a g() for which f()=0), we use least squares 1 •If the inverse exist, then: w XXT Xy. (LS).   •When there is less samples than dimensions •The (LS) error function is given by: n n (n

5 Ridge Regression •Note that we could have also written w as a function of the inputs X: 1 T •We re-formulate the problem as: n w  XX w yX, 2 2 T min Ew,X,ymin w  yi g(xi ). w w  X w y, i1 •Differentiating with respect to the XT Xy, parameters, we get: T X X I y, T T T T T T T  n  w XX w w XX I p y X 1 G In y. 1 primal solution. T T w XX I p Xy. •G=X X is the Gram matrix: G ij  xi ,x j . •Dual solution: •Note that for any given , we choose that n w  x . solution which minimizes the norm of w.  i i i1

A look back at PCA Generalized Linear Discriminant •Remember that to compute the PCs of a •We can rewrite our linear discriminant as: distribution when n

2 g(x) a1 a2 x a3 x

1    y x  2  x 

6 Kernels Training •We can also address this problem using •The weight vector a must be determined kernels. from training observations (samples) of the •If we consider x p (x)F q , world. we can write f (x, y)y w,(x) . •The data must also be linearly separable at •We have already shown that this implies: least in y. Gij (xi ),(x j ) . kernel. •We now have: 1 T g(x) G In yk , kernel. ki (xi ),(x) .

Complexity Linearly Separable:

•The overall complexity of computing is 2-class problem O(n3+n2q). •The goal is to find the vector a that

•And, to evaluate g on a new sample O(nq). correctly classifies all samples, yi . •Kernels are very useful, because they allow us to T •Classification: a yi 0  w1 compute classification on an space of q T dimensions (generally, q>>p), without the need to a yi 0  w2 specifically calculate the non-linear projection •“Normalization”: the replacement of all the x,z). samples ofw2 by their negatives. Then: •Do not be deceived though. Methods based on aT y 0, i. kernels are computationally expensive and slow. i

“Normalized”: all samples positives

7 procedure

•We define a criterion function J(a) that is T minimized when a is a solution of a yi 0. •We start with some arbitrary value and then use the direction of the gradient: J (a). •Learning equation: a(k 1) a(k) (k)J (a(k)).

learning rate

Newton’s descent

• Uses: a(k 1) a(k) H1J. • Algorithm: 1. Do a a H1J 2. Until H1J (a) threshold. 3. return a.

•We now calculate the gradient: J P (a) y y •And include this solution in our learning •One of the simples criteria used to minimize equation: T a(k 1) a(k) (k)y. a yi 0, is the Perceptron criterion. y •It only uses those feature vectors that are T •Algorithm: (k=0) misclassified: J P (a) (a y) y –Do k = mod (k+1,n) •This criterion is never negative. •If is misclassified by a then: y k a  a y k •When this is zero, a is a solution. –Until all patterns are properly classified

8 # of patterns misclassified T J P (a) (a y) y

Perceptron: Batch variable Convergence increment •Theorem (Perception Convergence): If •We include a margin b to the solution: T the training samples are linearly separable, a (k)y k b k. then the sequence of weight vectors given •And the set of properly classified samples is by the Perceptron algorithm will terminate given by: aT y b. at the solution vector. k

•The selection of b is generally difficult.

Winnow Algorithm Relaxation Procedure

•Considers the “positive” and “negative” •The Perceptron criterion described above is weight vectors separately: aand a. by no means the only solution. •Corrections on the positive weight are made •Another possibility would be: J (a)  (aT y)2. only when one or more training patterns in q  w1 y are misclassified. •Its gradient is continuous (whereas that of J p •Corrections on the negative weight only is not). when patterns ofw2 are misclassified. •Unfortunately, the function is too smooth. Convergence is usually at a = 0.

9 •Another problem withJ q is that it is dominated by the largest training vectors. •Instead, we usually use the following: 1 (aT y b)2 J r (a)   2 marginal 2 y y T •Its gradient is: a y b J r  2 y. y y •And the learning (update) equation is: T 2 T 2 T J (a)  (a y) . 1 (a y b) a y b q  J (a)  a(k 1) a(k) (k) y. y r  2  2 2 y y y y

Minimum Square-Error •In matrix form: Ya b. Procedure •Unfortunately, Y is rectangular, and there •We will now build a close-form solution that uses are more equations than unknowns => a is, all samples y simultaneously. therefore, overdetermined. •For that we shall try to make T a y k b. •Again, we need to minimize an error; e.g.: •A solution is given by the following set of linear equations: error Ya b. y y  y a  b  10 11 1d 0  1  form which we can define the following n criterion: 2 y20 y21  y2d a1  b2  J (a) Ya b  (aT y b )2.  S  i i      i1           yn0 yn1  ynd ad  bn 

•The gradient is: •A solution is now given by: n T 1 T * MSE T T a (Y Y) Y b Y b. J S 2(a yi bi )yi 2Y (Ya b). i1 pseudoinverse •Now, we want to minimize this: •If Y is square and nonsingular => the YT (Ya b) 0 pseudoinverse = inverse. YT Ya YTb 0 •A solution can always be found with: T T Y Ya Y b Y* lim (YT Y I)1 YT . •Why is this new equation important? 0 Because YT Y is a square matrix. •MSE = LDA.

10 A simple example LMS (least-mean-square)

T T procedure w1 : (1,2) , (2,0) w : (3,1)T , (2,3)T 2 • Two problems of MSE: T 1  1. Y Y can be singular.   T 2. Y is generally a very large matrix. a x1 0   x2  • We can solve this by minimizing the 2 above defined criterion: J S (a) Ya b . 1 1 2    5 / 4 13/12 3/ 4 7 /12  • The update equation would be: 1 2 0   *   Y  Y  1/ 2 1/ 6 1/ 2 1/ 6 T 1 3 1     a(k) a(k 1) (k)Y (Ya(k) b).    0 1/ 3 0 1/ 3 1 2 3  

•Usually however, we consider the sample sequentially: T a(k) a(k 1) (k)(bk y ka (k))y k . •Advantage: less memory required. •It can be shown that with (k) (1) / k this algorithm converges. •Unfortunately, the solution needs not give a separating vector (even if one exists).

So far, so good The Ho-Kashyap procedure

•The Perceptron finds the separating vector •The MSE minimizes: Ya b 2. when the data is linearly separable, but does •There exist a b for which the MSE yields a not converge on non-separable sets. correct solution. However, b is unknown. •Convergence is guaranteed in MSE & LMS, but the resulting vector needs not be the •The Ho-Kashyap procedure searches for correct one. both, a and b. The equation to minimize is: 2 •We will now define another algorithm J S (a,b) Ya b . which attempts to address this problem.

11 •And its gradients are: •If the samples are linearly separable, T convergence is guaranteed. a J S 2Y Ya b b J S 2 Ya b •Becausea Y*b , one only needs to •If they are not, thenYa(k) b(k) 0 and minimized the criterion with respect to b. the algorithm stops. This can be used as a •However, we also need to guarantee that proof that the samples are not separable. b>0. We can do that with an update equation which is always positive: b(1) 0, 1 b(k 1) b(k) (k)  J  J . 2 b S b S

•As usual we have g(y) aT y and we want Support Vector Machines to find the separating hyperplane,zk g(y k ) 1,

with largest margin b: zk g(y k ) •As we did before the goal is to represent the data b. a in a high-dimensional space where the data is linearly separable. •The goal is to find the weight vector a that •Weuseanonlinear mapping() ;asforexamplea maximizes b. polynomial or a Gaussian mapping function. •This is the same as: find the separating •Optional homework: show that any dataset that hyperplane that is farthest from the most belongs to two classes can be separated by a difficult patterns to classified. hyperplane if the data is represented in a sufficiently large space.

An SVM algorithm

•Recalculate a using the Perceptron algorithm and the worst-classified patterns: a(k 1) a(k) (k) y. worst y •The resulting a is called the support vector.

12 Support Vectors •The samples that are closest to the margin are called the support vectors. •We want to find those samples that are •Our goal is to maximize . closest to the hyperplane that divides our •To solve this problem we need a feature space into two regions (+ and -). regularizing term. As we did before, we will •We can formulate this as follows: minimize ||w||. Formally, r aT y a  margin i i 0 , 1 a min a , 2 where i={1,…,n} (# of samples), and ri=1 if th T the i sample belongs to class 1 and –1 if subject to ri a yi a0 1. the sample belongs to the second class.

Soft Margin SVM Lagrange Multipliers •Our previous SVM algorithms may not •We can solve this quadratic optimization perform adequately when the data is non- problem using Lagrange multipliers. linearly separable. n 1 2 •A soft margin SVM allows for the a  r aT y a 1  i i  i 0   misclassification of a few samples while 2 i1 still maximizing the margin for the rest. •The goal is to minimize with respect to w and maximize with respect to . •We can model the error of each sample as i follows, T ri a yi a0 1i . Reconstruction error (as in PCA).

Multiclass Classification Kesler’s construction

•Unfortunately, there is no general way to •We start with a set of c-1 inequalities: achieve multiclass classification. T T ai y k a j y k ,0, j 2,...,c. •A learning machine: y  y  y  augmented vector       T y 0  0  gi (x) ai y T z [a1,a2 ,...,ac ] 12 0 , 13 y,..., 1c 0        x is assigned to w1 if gi (x) g j (x) j.          T T 0  0  y •Which is the same as: ai y k a j y k , i j.

T a1,a2 ,...,ac •The goal is to solve: z ij 0. # of classes

13 Generalization of MSE

•The only straightforward solution is to consider a set of c 2-class problems:

T ai y 1 y i set of vectors in class i T ai y 1 y i

T T A [a1,a2 ,...,ac ] B [b1,b2 ,...,bc ] A Y*B.

14