Lecture outline
Why do we want to throw away Dimensionality reduction information??! Selecting a set of features Forward, backward and floating selection INF 3300/4300 Lecture 10 Linear projections Asbjørn Berge Principal components Linear discriminants
INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005
The ”curse” of dimensionality The ”curse” of dimensionality
Bellman (1961) - referring to the computational complexity of searching the neighborhood of data points in high dimensional settings Commonly used in statistics to describe the problem of data sparsity Example: a 3-class pattern recognition problem Divide the feature space into uniform bins Compute the ratio of objs from each class in each bin For a new obj, find the correct bin and assign to dominant class Preserve resolution of bins 3 bins in 1D increases to 32 bins in 2D Roughly 3 examples per bin in 1D - need 27 examples topreserve density of examples INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005
The ”curse” of dimensionality The ”curse” of dimensionality
Using 3 features makes Dividing sample space into equally spaced bins very inefficient the problem worse Can we beat the curse? Use prior knowlegde (f.ex. discard parts of the sample space or restrict The number of bins is estimates) now 33 = 27 Increase the smoothness of the density estimate (wider bins) 81 examples are needed Reducing the dimensionality to preserve density In practice, the curse means that, for a given sample size, there is a maximum number of features one can add before the classifier Using original amount (9) starts to degrade. of examples means that 2/3 of the feature space is empty!
INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005
1 Dimensionality reduction Feature selection
In general - two approaches for dimensionality reduction Feature selection: choose a subset of the features Given a feature set x={x1, x2,…,xn} find a subset ym={xi1,xi2,…,xim} with m Feature extraction: create a subset of new features by combining existing features INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Feature selection Feature selection Motivation for feature selection Search strategy Features may be expensive to obtain Exhaustive search implies if n Think blood samples, seismic measurements etc we fix m and 2 if we need to search all possible m as well. Selected features possess simple meanings (measurement units remain intact) Choosing 10 out of 100 will result in 1013 queries to J We may directly derive understanding of our problem from the classifier Obviously we need to guide the search! Features may be discrete or non-numeric Binary existence indicators Objective function (J) ”Predict” classifier performance INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Objective function Naïve individual feature selection Distance metrics Goal: select the two best Euclidean (mean difference, minimum features individually distance, maximum distance) Any reasonable objective J Parametric (Mahalanobis, Bhattacharyya) will rank the features Information theoretic (Divergence) J(x1)>J(x2)≈ J(x3)>J(x4) Classifier performance Thus features chosen are Need to generate a testset from the [x1,x2] or [x1,x3] training data (f.x. cross validation) However, x is the only Note that distance metrics are usually 4 pairwise comparisons. feature that provides complementary information (Need to define a C class extension of the distance) to x1 INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 2 Sequential Forward Selection (SFS) Sequential Backward Selection (SBS) Starting from the empty set, sequentially add the feature x+ that Starting from the full set, sequentially remove the feature x- that + - results in the highest objective function J(Yk+x ) when combined results in the smallest decrease in the value of J(Yk-x ) with the features Yk that have already been selected Algorithm Algorithm 1. Start with the full set Y =X 1. Start with the empty set Y ={∅} 0 0 2. Remove the worst feature x-= argmax [J(Y -x)] 2. Select the next best feature x+ = argmax [J(Y +x)] x∈ Yk k x ∈ Yk k 3. Update Y =Y -x-; k=k+1 3. Update Y =Y +x+; k=k+1 k+1 k k+1 k 4. Goto 2 4. Goto 2 SFS performs best when the optimal subset has a small number of SBS performs best when the optimal subset has a large number of features features SFS cannot discard features that become obsolete when adding SBS cannot reenable features that become obsolete when adding other features other features INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Sequential Floating Selection Plus-L Minus-R Selection (LRS) (SFFS and SFBS) If L>R, LRS starts from the empty set and repeatedly adds L Extension to the LRS algorithms with flexible backtracking features and removes R features capabilities If L INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Vector spaces Linear transformation A set of vectors {u1, u2, …, un} is said to form a basis for a vector space if any arbitrary vector x can A linear transformation is a mapping from be represented by a linear combination x=a u + 1 1 N M a2u2 + …. + anun a vector space R onto a vector space R , The coefficients {a1, a2, … , an} are called the components of vector x with respect to the basis and is represented by a matrix {ui} In order to form a basis, it is necessary and Given a vector x ∈ RN, the corresponding sufficient that the {ui} vectors be linearly independent vector y on RM is A basis {ui} is said to be orthogonal if A basis {ui} is said to be orthonormal if INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 3 Interpretation of eigenvectors and Eigenvectors and eigenvalues eigenvalues Given a matrix AN× N, we say that v is an eigenvector if there exists a scalar λ (the eigenvalue) such that Av=λv ⇔ v is an eigenvector with corresponding eigenvalue λ Zeroes of the characteristic equation are the eigenvalues of A A is non-singular⇔ all eigenvalues are non-zero The eigenvectors of the covariance matrix Σ correspond to the A is real and symmetric ⇔ all eigenvalues are real, and eigenvectors are principal axes of equiprobability ellipses! orthogonal The linear transformation defined by the eigenvectors of Σ leads to vectors that are uncorrelated regardless of the form of the distribution If the distribution happens to be Gaussian, then the transformed vectors will be statistically independent INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Feature extraction Signal representation vs classification Feature extraction can be stated as The search for the feature extraction mapping N N N Given a feature space xi∈ R find an optimal mapping y=f(x):R → R with M INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Signal representation vs classification PCA - Principal components analysis Principal components analysis (PCA) Reduce dimension while preserving signal variance - signal representation, unsupervised ("randomness") Linear discriminant analysis (LDA) -classification, supervised Represent x as a linear combination of orthonormal n basis vectors [φ1⊥φ2⊥ … ⊥φn] : x=∑i=1 yiφi Approximate x with only m INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 4 PCA - Principal components analysis PCA - Principal components analysis Approximation error is then Optimal values of bi can be found by taking the partial derivatives of the approximation error: To measure the representation error we use the mean We replace the discarded dimensions by squared error their expected value. (This also feels intuitively correct) INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 PCA - Principal components analysis PCA - Principal components analysis The orthonormality constraint on φ can be incorporated in our The MSE can now be written as optimization using Lagrange multipliers λ Thus, we can find the optimal φi by partial derivation Optimal φ and λ are eigenvalues and eigenvectors of the Σx is the covariance matrix of x i i covariance matrix Σx INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 PCA - Principal components analysis PCA - Principal components analysis The optimal approximation of a Note that this also implies random vector x ∈ R by a linear combination of m INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 5 PCA - Principal components analysis PCA example PCA uses the eigenvectors of the covariance matrix Σx, 3d Gaussian with parameters and thus is able to find the independent axes of the data under the unimodal Gaussian assumption For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlate the axes Main limitation of PCA is that it is unsupervised - thus it does not consider class separability It is simply a coordinate rotation aligning the transformed axes with the directions of maximum variance No guarantee that these directions are good features for discrimination INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) To find a good projection vector, we need to define a measure of separation between the Goal: projections (J(w)) The mean vector of each class in the spaces spanned by x and y are Reduce dimension while preserving class discriminatory information Strategy (2 classes): A naive choice would be projected mean difference, We have a set of samples x={x1, x2, …, xn} where n1 belong to class ω1 and the rest n2 to class ω2. Obtain a scalar value by projecting x onto a line y: y=wTx Select the one w that maximizes the separability of the classes INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) Fishers solution: maximize a function that To optimize w we need J(w) to be an explicit function of w represents the difference between the means, Redefine scatter scaled by a measure of the within class where S is the within class scatter matrix. scatter w Remember the scatter of the projection y Define classwise scatter (similar to variance) is within class scatter Fishers criterion is then The projected mean difference can be expressed by the original means S is called between class scatter. We look for a projection where examples from b the same class are close to each other, while The Fisher criterion in terms of Sw and Sb is at the same time projected mean values are as far apart as possible. INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 6 Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) To find the optimal w, derive J(w) and equate to zero LDA generalizes nicely to C classes Instead of one projection y, we will seek C-1 projections [y1,y2, …,yC-1] from C-1 projection T vectors W=[w1,w2,…,wC-1] : y=W x The generalization of within-class scatter is and T The generalization of between-class scatter is Divide by w Sww and one gets and has the solution INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) Similar to the 2 class example, mean vector and scatter matrices for ∗ the projected samples can be expressed The matrix W that maximizes this ratio can be shown to be composed of the eigenvectors corresponding to the largest eigenvalues of the eigenvalue problem Sb is the sum of C matrices of rank one or less, and thus and the mean vectors are constrained by We want an scalar objective function and use the ratio of matrix determinants , , so Sb will be rank C-1 or less. Only C-1 of the eigenvalues will be non-zero - Another variant on scalar representation of the criteria is to use the we can only find C-1 projection vectors. ratio of the trace of the matrices, i.e. INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Solving multiclass LDA Limitations of LDA Estimate Sw and find a C × p matrix of class centroids M LDA produces at most C-1 feature projections Transform the centroid matrix into a space where Sw is diagonal LDA is parametric, since it assumes unimodal gaussian likelihoods -1/2 Find Sw a rotating and scaling linear transformation Transform M, * * Compute Sb , the scatter matrix of M * Eigenanalyze Sb , i.e. find the principal components of the centroids Note that since we only have C centroids, these can span at most a C-1 dimensional LDA will fail when the discriminatory information is not in the mean subspace, which means that only C-1 principal components are nonzero but in the variance of the data * * The eigenvectors of Sb define the optimal projections W Transform the set of solutions back to the original space by performing the inverse of the linear transformation Thus, the maximizing W is INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 7 LDA Demo Matlab implementations Feature selection PRTools Evaluate features: feateval Selection strategies: featselm (Distance measures: distmaha, distm, proxm) Linear projections PRTools Principal component analysis: pca, klm LDA: fisherm INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 8