<<

Lecture outline

„ Why do we want to throw away information??! „ Selecting a set of features Forward, backward and floating selection INF 3300/4300 Lecture 10 „ Linear projections Asbjørn Berge Principal components Linear discriminants

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

The ”curse” of dimensionality The ”curse” of dimensionality

„ Bellman (1961) - referring to the computational complexity of searching the neighborhood of data points in high dimensional settings „ Commonly used in statistics to describe the problem of data sparsity „ Example: a 3-class pattern recognition problem Divide the feature space into uniform bins Compute the ratio of objs from each class in each bin For a new obj, find the correct bin and assign to dominant class „ Preserve resolution of bins 3 bins in 1D increases to 32 bins in 2D Roughly 3 examples per bin in 1D - need 27 examples topreserve density of examples INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

The ”curse” of dimensionality The ”curse” of dimensionality

„ Using 3 features makes „ Dividing sample space into equally spaced bins very inefficient the problem worse „ Can we beat the curse? Use prior knowlegde (f.ex. discard parts of the sample space or restrict The number of bins is estimates) now 33 = 27 Increase the smoothness of the density estimate (wider bins) 81 examples are needed Reducing the dimensionality to preserve density „ In practice, the curse means that, for a given sample size, there is a maximum number of features one can add before the classifier Using original amount (9) starts to degrade. of examples means that 2/3 of the feature space is empty!

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

1 Dimensionality reduction

„ In general - two approaches for dimensionality reduction Feature selection: choose a subset of the features „ Given a feature set x={x1, x2,…,xn} find a subset ym={xi1,xi2,…,xim} with m

Feature extraction: create a subset of new features by combining existing features

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Feature selection Feature selection

„ Motivation for feature selection „ Search strategy Features may be expensive to obtain Exhaustive search implies if n „ Think blood samples, seismic measurements etc we fix m and 2 if we need to search all possible m as well. Selected features possess simple meanings (measurement units remain intact) Choosing 10 out of 100 will result in 1013 queries to J „ We may directly derive understanding of our problem from the classifier Obviously we need to guide the search! Features may be discrete or non-numeric „ Binary existence indicators „ Objective function (J) ”Predict” classifier performance

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Objective function Naïve individual feature selection

„ Distance metrics „ Goal: select the two best Euclidean (mean difference, minimum features individually distance, maximum distance) Any reasonable objective J Parametric (Mahalanobis, Bhattacharyya) will rank the features Information theoretic (Divergence) J(x1)>J(x2)≈ J(x3)>J(x4) „ Classifier performance Thus features chosen are Need to generate a testset from the [x1,x2] or [x1,x3] training data (f.x. cross validation) „ However, x is the only „ Note that distance metrics are usually 4 pairwise comparisons. feature that provides complementary information (Need to define a C class extension of the distance) to x1

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

2 Sequential Forward Selection (SFS) Sequential Backward Selection (SBS)

„ Starting from the empty set, sequentially add the feature x+ that „ Starting from the full set, sequentially remove the feature x- that + - results in the highest objective function J(Yk+x ) when combined results in the smallest decrease in the value of J(Yk-x ) with the features Yk that have already been selected Algorithm Algorithm 1. Start with the full set Y =X 1. Start with the empty set Y ={∅} 0 0 2. Remove the worst feature x-= argmax [J(Y -x)] 2. Select the next best feature x+ = argmax [J(Y +x)] x∈ Yk k x ∈ Yk k 3. Update Y =Y -x-; k=k+1 3. Update Y =Y +x+; k=k+1 k+1 k k+1 k 4. Goto 2 4. Goto 2

„ SFS performs best when the optimal subset has a small number of „ SBS performs best when the optimal subset has a large number of features features „ SFS cannot discard features that become obsolete when adding „ SBS cannot reenable features that become obsolete when adding other features other features

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Sequential Floating Selection Plus-L Minus-R Selection (LRS) (SFFS and SFBS)

„ If L>R, LRS starts from the empty set and repeatedly adds L „ Extension to the LRS algorithms with flexible backtracking features and removes R features capabilities If L R then start with the empty set Y0={∅}, „ Sequential Floating Forward Selection (SFFS) starts from else start with the full set Y = X and goto step 3 the empty set 2. Repeat SFS step L times After each forward step, SFFS performs backward steps as long as the objective function increases 3. Repeat SBS step R times „ Sequential Floating Backward Selection (SFBS) starts from 4. Goto step 2 the full set After each backward step, SFBS performs forward steps as „ LRS attempts to compensate for weaknesses in SFS and SBS by long as the objective function increases backtracking „ Difficult to find optimal L and R

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Vector spaces Linear transformation

„ A set of vectors {u1, u2, …, un} is said to form a basis for a vector space if any arbitrary vector x can „ A linear transformation is a mapping from be represented by a linear combination x=a u + 1 1 N M a2u2 + …. + anun a vector space R onto a vector space R , „ The coefficients {a1, a2, … , an} are called the components of vector x with respect to the basis and is represented by a matrix {ui} „ In order to form a basis, it is necessary and „ Given a vector x ∈ RN, the corresponding sufficient that the {ui} vectors be linearly independent vector y on RM is „ A basis {ui} is said to be orthogonal if

„ A basis {ui} is said to be orthonormal if

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

3 Interpretation of eigenvectors and Eigenvectors and eigenvalues eigenvalues

„ Given a matrix AN× N, we say that v is an eigenvector if there exists a scalar λ (the eigenvalue) such that Av=λv ⇔ v is an eigenvector with corresponding eigenvalue λ

„ Zeroes of the characteristic equation are the eigenvalues of A „ A is non-singular⇔ all eigenvalues are non-zero „ The eigenvectors of the covariance matrix Σ correspond to the „ A is real and symmetric ⇔ all eigenvalues are real, and eigenvectors are principal axes of equiprobability ellipses! orthogonal „ The linear transformation defined by the eigenvectors of Σ leads to vectors that are uncorrelated regardless of the form of the distribution „ If the distribution happens to be Gaussian, then the transformed vectors will be statistically independent

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Feature extraction Signal representation vs classification

„ Feature extraction can be stated as „ The search for the feature extraction mapping N N N Given a feature space xi∈ R find an optimal mapping y=f(x):R → R with M

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Signal representation vs classification PCA - Principal components analysis

„ Principal components analysis (PCA) „ Reduce while preserving signal variance - signal representation, unsupervised ("randomness") „ Linear discriminant analysis (LDA) -classification, supervised „ Represent x as a linear combination of orthonormal n basis vectors [φ1⊥φ2⊥ … ⊥φn] : x=∑i=1 yiφi „ Approximate x with only m

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

4 PCA - Principal components analysis PCA - Principal components analysis

„ Approximation error is then „ Optimal values of bi can be found by taking the partial derivatives of the approximation error:

„ To measure the representation error we use the mean „ We replace the discarded by squared error their expected value. (This also feels intuitively correct)

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

PCA - Principal components analysis PCA - Principal components analysis

„ The orthonormality constraint on φ can be incorporated in our The MSE can now be written as optimization using Lagrange multipliers λ

„ Thus, we can find the optimal φi by partial derivation

„ Optimal φ and λ are eigenvalues and eigenvectors of the Σx is the covariance matrix of x i i covariance matrix Σx

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

PCA - Principal components analysis PCA - Principal components analysis

The optimal approximation of a „ Note that this also implies random vector x ∈ R by a linear combination of m

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

5 PCA - Principal components analysis PCA example

„ PCA uses the eigenvectors of the covariance matrix Σx, 3d Gaussian with parameters and thus is able to find the independent axes of the data under the unimodal Gaussian assumption For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlate the axes „ Main limitation of PCA is that it is unsupervised - thus it does not consider class separability It is simply a coordinate rotation aligning the transformed axes with the directions of maximum variance No guarantee that these directions are good features for discrimination

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA)

„ To find a good projection vector, we need to define a measure of separation between the „ Goal: projections (J(w)) „ The mean vector of each class in the spaces spanned by x and y are Reduce dimension while preserving class discriminatory information „ Strategy (2 classes): „ A naive choice would be projected mean difference, We have a set of samples x={x1, x2, …, xn} where n1 belong to class ω1 and the rest n2 to class ω2. Obtain a scalar value by projecting x onto a line y: y=wTx Select the one w that maximizes the separability of the classes

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA)

„ Fishers solution: maximize a function that „ To optimize w we need J(w) to be an explicit function of w represents the difference between the means, Redefine scatter scaled by a measure of the within class where S is the within class scatter matrix. scatter w Remember the scatter of the projection y „ Define classwise scatter (similar to variance)

„ is within class scatter „ Fishers criterion is then The projected mean difference can be expressed by the original means

S is called between class scatter. „ We look for a projection where examples from b the same class are close to each other, while „ The Fisher criterion in terms of Sw and Sb is at the same time projected mean values are as far apart as possible.

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

6 Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA)

„ To find the optimal w, derive J(w) and equate to zero „ LDA generalizes nicely to C classes „ Instead of one projection y, we will seek C-1 projections [y1,y2, …,yC-1] from C-1 projection T vectors W=[w1,w2,…,wC-1] : y=W x „ The generalization of within-class scatter is

and T „ The generalization of between-class scatter is „ Divide by w Sww and one gets

and „ has the solution

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA)

„ Similar to the 2 class example, mean vector and scatter matrices for „ ∗ the projected samples can be expressed The matrix W that maximizes this ratio can be shown to be composed of the eigenvectors corresponding to the largest eigenvalues of the eigenvalue problem

„ Sb is the sum of C matrices of rank one or less, and thus and the mean vectors are constrained by „ We want an scalar objective function and use the ratio of matrix determinants , , so Sb will be rank C-1 or less. „ Only C-1 of the eigenvalues will be non-zero - „ Another variant on scalar representation of the criteria is to use the we can only find C-1 projection vectors. ratio of the trace of the matrices, i.e.

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

Solving multiclass LDA Limitations of LDA

„ Estimate Sw and find a C × p matrix of class centroids M „ LDA produces at most C-1 feature projections „ Transform the centroid matrix into a space where Sw is diagonal „ LDA is parametric, since it assumes unimodal gaussian likelihoods -1/2 Find Sw a rotating and scaling linear transformation

Transform M, * * „ Compute Sb , the scatter matrix of M * „ Eigenanalyze Sb , i.e. find the principal components of the centroids Note that since we only have C centroids, these can span at most a C-1 dimensional „ LDA will fail when the discriminatory information is not in the mean subspace, which means that only C-1 principal components are nonzero but in the variance of the data * * The eigenvectors of Sb define the optimal projections W „ Transform the set of solutions back to the original space by performing the inverse of the linear transformation Thus, the maximizing W is

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

7 LDA Demo Matlab implementations

„ Feature selection PRTools „ Evaluate features: feateval „ Selection strategies: featselm „ (Distance measures: distmaha, distm, proxm) „ Linear projections PRTools „ Principal component analysis: pca, klm „ LDA: fisherm

INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005

8