Linear Discriminant Functions

Why Linear? 4. Linear Discriminant •It is simple and intuitive. •Minimizing a criterion error; e.g. sample risk, Functions training error, margin, etc. •Can be generalized to find non-linear discriminant Aleix M. Martinez regions. [email protected] •It is generally very difficult to calculate the distance of a testing sample to a nonlinear function. Limited number of training samples. HandoutsHandouts for for ECE ECE 874, 874, 2007. 2007. •No need to estimate class distributions. Distance to a nonlinear function Linear Discriminant Analysis •If we have samples corresponding to two or more classes, we prefer to select those features that best discriminate between classes –rather than those that best describe the data. •This will, of course, depend on the classifier. •Assume our classifier is Bayes. •Thus, we want to minimize the probability of error. •We will develop a method based on scatter matrices. From Murase & Nayar, 1995. Theorem •Let the samples of two classes be Normally distributed in Rp, with common covariance matrix. Then, the Bayes errors in the p-dimensional space and in the one-dimensional subspace given by 1 1 v (1 - 2)/|| (1 - 2) ||, are the same; where || x || is the Euclidean norm of PCA the vector x. •That is, there is no loss in classification when LDA reducing from p dimensions to one. 1 Scatter matrices and •To formulate criteria for class separability, separability criteria we need to convert these matrices to numerical values: tr(S 1S ) •Within-class scatter matrix: 2 1 C N ln | S | ln | S | T 1 2 SW xij j xij j . j1 i1 tr S1 tr S •Between-class scatter matrix: 2 C T SB j j . •Typical combination of scatter matrices are: j1 {S , S } {S , S },{S ,ˆ}, and{S ,ˆ}. ˆ 1 2 B W B W •Note that: SW SB . Ronald Fisher (1890-1962) A solution to LDA • Fisher was an eminent scholar and one of the great scientist of the first part of the 20th century. After graduating from Cambridge and being denied entry to the British army for his poor eyesight, he worked as a statistician for six years •Again, we want to minimize the Bayes error. before starting a farming business. While a farmer, he continued his genetics and statistics research. During this time, he developed the well-known analysis •Therefore, we want the projection from Y to of variance (ANOVA) method. After the war, Fisher finally moved to Rothamsted Experimental Station. Among his many accomplishments, Fisher invented ANOVA, the technique of maximum likelihood (ML), Fisher X that minimizes the error: p n ˆ Information, the concept of sufficiency, and the method now known as Linear X( p) yii bii . Discriminant Analysis(LDA). During World War II, the filed of eugenics i1 ip1 suffered a big blow -- mainly do to the Nazis use of it as a justification for •The eigenvalue decomposition is the optimal some of their actions. Fisher moved back to Rothamsted and then to Cambridge wherehe retired. Fisher has been accredited to be one of the transformation: 1 founders of modern statistics and one cannot study pattern recognition without SB SW i ii . encountering several of his ground-breaking insights. Yet as great as a statistician that he was, he also become a major figure in genetics. A classical quote in the Annals of Statistics reads “I occasionally meet geneticists who Simultaneous diagonalization. ask me whether it is true that the great geneticist R.A. Fisher was also an important statistician." Example: Face Recognition 2 Limitations of LDA PCA versus LDA •To preventS to become singular, N>d. W •In many applications the number of samples is •There are only C-1 nonzero eigenvectors. relatively small compared to the dimensionality of • the data. •Even for simple PDFs, PCA can outperform LDA (testing data). •Again, this limits the number of features one can use. •PCA is usually a guarantee, because all we try to •Nonparametric LDA is design to solve the do is to minimize the representation error. last problem (we’ll see this latter in the course). Problems with Multi-class Eigen-based Algorithms •In general, researchers define algorithms which are optimal in the 2-classes case and then extend this idea (way of thinking) to the multi-class problem. •This may caused problems. underlying but unknown PDFs •This is the case for eigen-based approaches which use the idea of scatter matrices defined above. •Let’s define the general case:M1V M 2V. •This is the same as selecting those T eigenvectors v that maximize: v M1v T v M 2 v •Note that this can only be achieved if M1 and M2 agree. •The existence of solution depends on the angle between the eigenvectors of M and 1 th vi is the i basis vector of the solution space. M2. wi are the eigenvectors of M1. ui are the eigenvectors of M2. 3 Classification: The linear case How to know? r i r i 2 T 2 K cosij u j wi . i1 j1 i1 j1 where r < q and q is the number of eigenvectors of M1. •The larger K is, the less probable that the results will be correct. Decision Surfaces Discriminant function = distance •A linear discriminant function can be threshold •The discriminant function gives an mathematically written as: T algebraic measure of the distance. g(x) w x wo. •Take two vectors x1 and x2, both on the •2-class case: weight vector decision boundary. Then, write x as: –Decide w if g(x) 0. 1 w g(x) – w2 if g(x) 0. x x p r r . T w w •We can also do that with: w1 if w x wo . Projection of x onto g(x)=0. threshold Distance from x to g(x)=0. Multicategory case •Two straightforward algorithms: –Reduce the problem to C-1 2-class problems. –Construct C(C-1)/2 linear discriminant functions. •A linear machine assigns x to: wi if gi (x) g j (x) j i. •The decision boundaries (between two adjacent areas) are given by: gi (x) g j (x). 4 Linearly Separable Linear Classifier •Our previous algorithm is general and can •If the classes (or training data) are linearly be applied even when the classes are not separable, then there exist a unique gi(x)>0. linearly separable. •This means that a new sample t, can be •C classes are linearly separable iff for every classified as wi if gi(t)>0. wi there exist a linear classifier (hyperplane) •Alternatively (when the data is not linearly such that all the sample of wi lie on its separable), we can use: positive side (gi(x)>0) and all the sample of wi (t | w1,..., wC , w01,..., w0C ) arg maxi gi (t). wj (j i) are on the negative side of gi(x). weights thresholds It is simpler to start Linear Regression with regression. •Remember that regression is a related (but different) problem to that of classification. •We search that function g(x) which best interpolates a given training set S={(x1,y1), p …, (xn,yn)}; wherexi and yi are scalars. w, x wT x, and we have •We want to minimize: assumed wo=0. f (x, y)y g(x) y w, x 0. •If enough training data is available there T •Now, differentiating with respect to w, we exist a unique solution: T T T T w X y. get:2y X 2w XX 0, •When there is noise (i.e.,there does not exist wT XXT yT XT . a g() for which f()=0), we use least squares 1 •If the inverse exist, then: w XXT Xy. (LS). •When there is less samples than dimensions •The (LS) error function is given by: n n (n<p), there exist many possible solutions 2 2 E(w,(X,y)) yi g(xi ). for w ill-condition problem. i1 i1 •We need to impose a restriction (or bias) to •Using norm-2 LS: favor one solution over the rest. This is T 2 y wT X y wT X . known as regularization. 2 5 Ridge Regression •Note that we could have also written w as a function of the inputs X: 1 T •We re-formulate the problem as: n w XX w yX, 2 2 T min Ew,X,ymin w yi g(xi ). w w X w y, i1 •Differentiating with respect to the XT Xy, parameters, we get: T X X I y, T T T T T T T n w XX w w XX I p y X 1 G In y. 1 primal solution. T T w XX I p Xy. •G=X X is the Gram matrix: G ij xi ,x j . •Dual solution: •Note that for any given , we choose that n w x . solution which minimizes the norm of w. i i i1 A look back at PCA Generalized Linear Discriminant •Remember that to compute the PCs of a •We can rewrite our linear discriminant as: distribution when n<p, we also used the d ~ T 3 3 g(x) w w x . dual option; i.e.Q X X. (n ) vs.( p ). 0 i1 i i •The two arguments (PCA and LS) are •We can now extend this to a quadratic form: d d d g(x) w w x w x x . equivalent. 0 i1 i i i1 j1 ij i j •To see this, remember that to minimize Ux •Or to any other polynomial discriminant 2 T T using LS, we have E Ux x U Ux. function: g(x) aT y •The eigenvector associated to the smallest dˆ eigenvalue of UTU minimizes E.

Load more