Curse” of Dimensionality the ”Curse” of Dimensionality

Lecture outline Why do we want to throw away Dimensionality reduction information??! Selecting a set of features Forward, backward and floating selection INF 3300/4300 Lecture 10 Linear projections Asbjørn Berge Principal components Linear discriminants INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 The ”curse” of dimensionality The ”curse” of dimensionality Bellman (1961) - referring to the computational complexity of searching the neighborhood of data points in high dimensional settings Commonly used in statistics to describe the problem of data sparsity Example: a 3-class pattern recognition problem Divide the feature space into uniform bins Compute the ratio of objs from each class in each bin For a new obj, find the correct bin and assign to dominant class Preserve resolution of bins 3 bins in 1D increases to 32 bins in 2D Roughly 3 examples per bin in 1D - need 27 examples topreserve density of examples INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 The ”curse” of dimensionality The ”curse” of dimensionality Using 3 features makes Dividing sample space into equally spaced bins very inefficient the problem worse Can we beat the curse? Use prior knowlegde (f.ex. discard parts of the sample space or restrict The number of bins is estimates) now 33 = 27 Increase the smoothness of the density estimate (wider bins) 81 examples are needed Reducing the dimensionality to preserve density In practice, the curse means that, for a given sample size, there is a maximum number of features one can add before the classifier Using original amount (9) starts to degrade. of examples means that 2/3 of the feature space is empty! INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 1 Dimensionality reduction Feature selection In general - two approaches for dimensionality reduction Feature selection: choose a subset of the features Given a feature set x={x1, x2,…,xn} find a subset ym={xi1,xi2,…,xim} with m<n which optimizes an objective function J(Y) Feature extraction: create a subset of new features by combining existing features INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Feature selection Feature selection Motivation for feature selection Search strategy Features may be expensive to obtain Exhaustive search implies if n Think blood samples, seismic measurements etc we fix m and 2 if we need to search all possible m as well. Selected features possess simple meanings (measurement units remain intact) Choosing 10 out of 100 will result in 1013 queries to J We may directly derive understanding of our problem from the classifier Obviously we need to guide the search! Features may be discrete or non-numeric Binary existence indicators Objective function (J) ”Predict” classifier performance INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Objective function Naïve individual feature selection Distance metrics Goal: select the two best Euclidean (mean difference, minimum features individually distance, maximum distance) Any reasonable objective J Parametric (Mahalanobis, Bhattacharyya) will rank the features Information theoretic (Divergence) J(x1)>J(x2)≈ J(x3)>J(x4) Classifier performance Thus features chosen are Need to generate a testset from the [x1,x2] or [x1,x3] training data (f.x. cross validation) However, x is the only Note that distance metrics are usually 4 pairwise comparisons. feature that provides complementary information (Need to define a C class extension of the distance) to x1 INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 2 Sequential Forward Selection (SFS) Sequential Backward Selection (SBS) Starting from the empty set, sequentially add the feature x+ that Starting from the full set, sequentially remove the feature x- that + - results in the highest objective function J(Yk+x ) when combined results in the smallest decrease in the value of J(Yk-x ) with the features Yk that have already been selected Algorithm Algorithm 1. Start with the full set Y =X 1. Start with the empty set Y ={∅} 0 0 2. Remove the worst feature x-= argmax [J(Y -x)] 2. Select the next best feature x+ = argmax [J(Y +x)] x∈ Yk k x ∈ Yk k 3. Update Y =Y -x-; k=k+1 3. Update Y =Y +x+; k=k+1 k+1 k k+1 k 4. Goto 2 4. Goto 2 SFS performs best when the optimal subset has a small number of SBS performs best when the optimal subset has a large number of features features SFS cannot discard features that become obsolete when adding SBS cannot reenable features that become obsolete when adding other features other features INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Sequential Floating Selection Plus-L Minus-R Selection (LRS) (SFFS and SFBS) If L>R, LRS starts from the empty set and repeatedly adds L Extension to the LRS algorithms with flexible backtracking features and removes R features capabilities If L<R, LRS starts from the full set and repeatedly removes R Rather than fixing the values of L and R, these floating features followed by L feature additions methods allow those values to be determined from the data: The size of the subset during the search can be thought to Algorithm be “floating” 1. If L > R then start with the empty set Y0={∅}, Sequential Floating Forward Selection (SFFS) starts from else start with the full set Y = X and goto step 3 the empty set 2. Repeat SFS step L times After each forward step, SFFS performs backward steps as long as the objective function increases 3. Repeat SBS step R times Sequential Floating Backward Selection (SFBS) starts from 4. Goto step 2 the full set After each backward step, SFBS performs forward steps as LRS attempts to compensate for weaknesses in SFS and SBS by long as the objective function increases backtracking Difficult to find optimal L and R INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Vector spaces Linear transformation A set of vectors {u1, u2, …, un} is said to form a basis for a vector space if any arbitrary vector x can A linear transformation is a mapping from be represented by a linear combination x=a u + 1 1 N M a2u2 + …. + anun a vector space R onto a vector space R , The coefficients {a1, a2, … , an} are called the components of vector x with respect to the basis and is represented by a matrix {ui} In order to form a basis, it is necessary and Given a vector x ∈ RN, the corresponding sufficient that the {ui} vectors be linearly independent vector y on RM is A basis {ui} is said to be orthogonal if A basis {ui} is said to be orthonormal if INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 3 Interpretation of eigenvectors and Eigenvectors and eigenvalues eigenvalues Given a matrix AN× N, we say that v is an eigenvector if there exists a scalar λ (the eigenvalue) such that Av=λv ⇔ v is an eigenvector with corresponding eigenvalue λ Zeroes of the characteristic equation are the eigenvalues of A A is non-singular⇔ all eigenvalues are non-zero The eigenvectors of the covariance matrix Σ correspond to the A is real and symmetric ⇔ all eigenvalues are real, and eigenvectors are principal axes of equiprobability ellipses! orthogonal The linear transformation defined by the eigenvectors of Σ leads to vectors that are uncorrelated regardless of the form of the distribution If the distribution happens to be Gaussian, then the transformed vectors will be statistically independent INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Feature extraction Signal representation vs classification Feature extraction can be stated as The search for the feature extraction mapping N N N Given a feature space xi∈ R find an optimal mapping y=f(x):R → R with M<N. y = f(x) is guided by an objective function we An optimal mapping in classification: the transformed feature vector want to maximize. y yield the same correct classification rate as x. In general we have two categories of objectives The optimal mapping may be a non-linear function Difficult to generate/optimize non-linear transforms in feature extraction: Feature extraction is therefore usually limited to linear transforms y=ATx Signal representation: Accurately approximate the samples in a lower-dimensional space. Classification: Keep (or enhance) class-discriminatory information in a lower-dimensional space. INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 Signal representation vs classification PCA - Principal components analysis Principal components analysis (PCA) Reduce dimension while preserving signal variance - signal representation, unsupervised ("randomness") Linear discriminant analysis (LDA) -classification, supervised Represent x as a linear combination of orthonormal n basis vectors [φ1⊥φ2⊥ … ⊥φn] : x=∑i=1 yiφi Approximate x with only m<n basis vectors. This can be T done by replacing the components [ym+1, …,yn] with some pre-selected constants bi: INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 4 PCA - Principal components analysis PCA - Principal components analysis Approximation error is then Optimal values of bi can be found by taking the partial derivatives of the approximation error: To measure the representation error we use the mean We replace the discarded dimensions by squared error their expected value. (This also feels intuitively correct) INF 3300 / 4300 Autumn 2005 INF 3300 / 4300 Autumn 2005 PCA - Principal components analysis PCA - Principal components analysis The orthonormality constraint on φ can be incorporated in our The MSE can now be written as optimization using Lagrange multipliers λ Thus, we can find the optimal φi by partial derivation Optimal φ and λ are

Load more