9/7/2017
Feature Extraction Techniques
Unsupervised methods can also be used to find features which can be useful for categorization. There are unsupervised methods that represent a form of smart Unsupervised Learning II feature extraction. Feature Extraction
Transforming the input data into the set of features still describing the data with sufficient accuracy In pattern recognition and image processing, feature extraction is a special form of dimensionality reduction
What is feature reduction? Why feature reduction?
Original data reduced data • Most machine learning and data mining techniques may not be effective for high-dimensional data • When the input data to an algorithm is too large to be processed and it is suspected to be redundant (much data, Linear transformation but not much information) • Analysis with a large number of variables generally requires a large amount of memory and computation d T dp Y power or a classification algorithm which overfits the G training sample and generalizes poorly to new samples • The important dimension may be small. X p – For example, the number of genes responsible for a certain type of disease may be small. Gpd : X Y GT X d
1 9/7/2017
Feature reduction versus feature Why feature reduction? selection • Visualization: projection of high-dimensional data • Feature reduction onto 2D or 3D. – All original features are used – The transformed features are linear combinations of the original features. • Data compression: efficient storage and retrieval. • Feature selection • Noise removal: positive effect on query accuracy. – Only a subset of the original features are used.
• Continuous versus discrete
Application of feature reduction Algorithms
Feature Extraction Techniques • Face recognition Principal component analysis • Handwritten digit recognition Singular value decomposition • Text mining Non-negative matrix factorization • Image retrieval Independent component analysis • Microarray data analysis • Protein classification
2 9/7/2017
What is Principal Component Geometric picture of principal components (PCs) Analysis? • Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables z – Retains most of the sample's information. 1 – Useful for the compression and classification of data.
• By information we mean the variation present in the • the 1st PC is a minimum distance fit to a line in X space sample, given by the correlations between the original z1 variables. nd • the 2 PC z 2 is a minimum distance fit to a line in the plane – The new variables, called principal components (PCs), are perpendicular to the 1st PC uncorrelated, and are ordered by the fraction of the total information each retains. PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous.
Principal Components Analysis (PCA) Background Mathematics
• Principle – Linear projection method to reduce the number of parameters – Transfer a set of correlated variables into a new set of uncorrelated Linear Algebra variables – Map the data into a space of lower dimensionality – Form of unsupervised learning Calculus
• Properties – It can be viewed as a rotation of the existing axes to new positions in the Probability and Computing space defined by original variables – New axes are orthogonal and represent the directions with maximum variability
3 9/7/2017
Eigenvalues & eigenvectors Eigenvalues & eigenvectors
Example. Consider the matrix Consider the matrix P for which the columns are C1, C2, and C3, i.e., 121 112 A 610 P 623 121 1312 Consider the three column matrices we have Determinants of P det(p)= 84. So this matrix is invertible. Easy calculations give é 1 ù é -1 ù é 2 ù ê ú ê ú ê ú C1 = ê 6 ú, C2 =ê 2 ú, C3 =ê 3 ú, 707 ê -13 ú ê 1 ú ê -2 ú 1 1 ë û ë û ë û P 27249 84 32128 We have 046 Next we evaluate the matrix P-1AP. ACACAC0,8,9, 123 707121112000 046 1 PAP1 27249610623040 84 In other words, we have 321281211312003
AC1 = 0C1, AC2 = -4C2, AC3 = 3C3, In other words, we have 000 0, -4 and 3 are eigenvalues of A, C1,C2 and C3 are eigenvectors PAP1 040 A c c 003
Eigenvalues & eigenvectors Eigenvalues & eigenvectors
In other words, we have Definition. Let A be a square matrix. A non-zero vector C is called an eigenvector of A if and only if there exists a number (real or complex) λ such that AC C.
Using the matrix multiplication, we obtain If such a number λ exists, it is called an eigenvalue of A. The vector C 000 is called eigenvector associated to the eigenvalue λ. 1 APP0 4 0 0 0 3 Remark. The eigenvector C must be non-zero since we have which implies that A is similar to a diagonal matrix. In particular, we have A0 0 0 (0 is a zero vector)
000 for any number λ. nn1 APP040 n 003 for n1,2,...
4 9/7/2017
Eigenvalues & eigenvectors Determinants Example. Consider the matrix Determinant of order 2 121 A 610 .easy to remember (for order 2 only).. 121 We have seen that aa1112 ||Aaaaa 11221221 aa2122 AC1 = 0C1, AC2 = -4C2, AC3 = 3C3, - + where 12 112 Example: Evaluate the determinant: CCC6,2,3, 34 123 12 1312 14232 34 So C1 is an eigenvector of A associated to the eigenvalue 0. C2 is an eigenvector of A associated to the eigenvalue -4 while C3 is an eigenvector of A associated to the eigenvalue 3.
Eigenvalues & eigenvectors Eigenvalues & eigenvectors
Example. Consider the matrix For a square matrix A of order n, the number λ is an eigenvalue if and only if there exists a non-zero vector C such that 12 A ACC 20
Using the matrix multiplication properties, we obtain The equation ()0AIC n translates into 12 (1)(0) 4 0 We also know that this system has one solution if and only if the matrix coefficient is 20 invertible, i.e. which is equivalent to the quadratic equation det(AI n ) 0. 2 Since the zero-vector is a solution and C is not the zero vector, then we must have 40 Solving this equation leads to (use quadratic formula) det()0.AI n 1 17 1 17 , and 22 In other words, the matrix A has only two eigenvalues.
5 9/7/2017
Eigenvalues & eigenvectors Eigenvalues & eigenvectors
Example. Consider the diagonal matrix In general, for a square matrix A of order n, the equation a 000 000b d e t ( )AI 0 . n D . 000 c 000 d will give the eigenvalues of A. Its characteristic polynomial is It is a polynomial function in λ of degree n. Therefore this equation will not have more than n roots or solutions. a 0 0 0 0b 0 0 So a square matrix A of order n will not have more than n eigenvalues. det(D I ) ( a )( b )( c )( d ) 0 n 0 0c 0 0 0 0 d
So the eigenvalues of D are a, b, c, and d, i.e. the entries on the diagonal.
Computation of Eigenvectors Computation of Eigenvectors 121 Example. Consider the matrix A 610 Let A be a square matrix of order n and λ one of its eigenvalues. 121 Let X be an eigenvector of A associated to λ. We must have First we look for the eigenvalues of A. These are given by the characteristic equation
AX Xor ( A In ) X 0 det(AI 3 ) 0. 1 2 1 This is a linear system for which the matrix coefficient is AI n 6 1 0 0 Since the zero-vector is a solution, the system is consistent. 1 2 1 Remark. Note that if X is a vector which satisfies AX= λX , If we develop this determinant using the third column, we obtain then the vector Y = c X (for any arbitrary number c) satisfies the 6112 same equation, i.e. AY- λY. ( 1)0 1261 In other words, if we know that X is an eigenvector, then cX is also By algebraic manipulations, we get an eigenvector associated to the same eigenvalue. ( 4)( 3) 0
which implies that the eigenvalues of A are 0, -4, and 3.
6 9/7/2017
Computation of Eigenvectors Computation of Eigenvectors EIGENVECTORS ASSOCIATED WITH EIGENVALUES So the unknown vector X is given by 1. Case λ=0. : The associated eigenvectors are given by the linear system xx 1 AX 0 Xyxx 66 which may be rewritten by zx1313 x y z 20 60xy Therefore, any eigenvector X of A associated to the eigenvalue 0 is given by x y z 20 1 The third equation is identical to the first. From the second equation, Xc 6, we have y = 6x, so the first equation reduces to 13x + z = 0. So this system is equivalent to 13
yx 6 where c is an arbitrary number. zx13
Computation of Eigenvectors Computation of Eigenvectors
Then we use elementary row operations to reduce it to a upper-triangular form. 2. Case λ=-4: The associated eigenvectors are given by the linear system First we interchange the first row to the end AXX4 or (4)0 AIX 1230 3 5210 which may be rewritten by 6300 5x 2 y z 0 6xy 3 0 Next, we use the first row to eliminate the 5 and 6 on the first column. x 2 y 3 z 0 We obtain We use elementary operations to solve it. 1 2 3 0 First we consider the augmented matrix [40]AI 3 0 8 16 0 0 9 18 0 5 2 1 0 6 3 0 0 1 2 3 0
7 9/7/2017
Computation of Eigenvectors Computation of Eigenvectors
If we cancel the 8 and 9 from the second and third row, we obtain Next, we set z = c. From the second row, we get y = 2z = 2c. The first row will imply x = -2y+3z = -c. Hence
1 2 3 0 xc1 0 1 2 0 Xycc 22 0 1 2 0 zc 1 Finally, we subtract the second row from the third to get Therefore, any eigenvector X of A associated to the eigenvalue -4 is given by 1 1 2 3 0 Xc 2 0 1 2 0 1 0 0 0 0 where c is an arbitrary number.
Computation of Eigenvectors Computation of Eigenvectors
Summary: Let A be a square matrix. Assume λ is an eigenvalue of A. In order to find the associated eigenvectors, Case λ=3: Using similar ideas as the one described above, one may easily we do the following steps: show that any eigenvector X of A associated to the eigenvalue 3 is given by 1. Write down the associated linear system
AX X 2 or( A I ) X 0 n Xc 3 2. Solve the system. 2 3. Rewrite the unknown vector X as a linear combination where c is an arbitrary number. of known vectors.
8 9/7/2017
Why Eigenvectors and Eigenvalues Principal Components Analysis (PCA)
An eigenvector of a square matrix is a non-zero vector that, when multiplied by the matrix, yields a vector that differs from the • Principle original at most by a multiplicative scalar. The scalar is represented – Linear projection method to reduce the number of parameters by its eigenvalue. – Transfer a set of correlated variables into a new set of uncorrelated variables In this shear mapping the – Map the data into a space of lower dimensionality red arrow changes direction but the blue – Form of unsupervised learning arrow does not. The blue arrow is an eigenvector, • Properties and since its length is unchanged its eigenvalue – It can be viewed as a rotation of the existing axes to new positions in the is 1. space defined by original variables – New axes are orthogonal and represent the directions with maximum variability
Dimensionality Reduction
PCs and Variance Can ignore the components of lesser significance.
• The first PC retains the greatest amount of variation in the sample
• The kth PC retains the kth greatest fraction of the
variation in the sample You do lose some information, but if the eigenvalues are small, you don’t lose much – n dimensions in original data • The kth largest eigenvalue of the correlation matrix C – calculate n eigenvectors and eigenvalues is the variance in the sample along the kth PC – choose only the first p eigenvectors, based on their eigenvalues – final data set has only p dimensions
9 9/7/2017
PCA Example –STEP 1 PCA Example –STEP 1
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf ZERO MEAN DATA: • Subtract the mean from each of the data dimensions. DATA: This produces a data set whose mean is zero. x y x y Subtracting the mean makes variance and covariance 2.5 2.4 .69 .49 calculation easier by simplifying their equations. The 0.5 0.7 -1.31 -1.21 variance and co-variance values are not affected by 2.2 2.9 .39 .99 the mean value. 1.9 2.2 .09 .29 3.1 3.0 1.29 1.09 2.3 2.7 .49 .79 2 1.6 .19 -.31 1 1.1 -.81 -.81 1.5 1.6 -.31 -.31 1.1 0.9 -.71 -1.01
PCA Example –STEP 1 PCA Example –STEP 2 http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf • Calculate the covariance matrix cov = .616555556 .615444444 .615444444 .716555556
Variance measures how far a set of numbers is spread out
Covariance provides a measure of the strength of the correlation between two or more sets of random variates. The covariance for two random variates and , each with sample size, is defined by the expectation value
10 9/7/2017
PCA Example –STEP 3 PCA Example –STEP 3
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf •eigenvectors are plotted as • Calculate the eigenvectors and eigenvalues of diagonal dotted lines on the plot. the covariance matrix •Note they are perpendicular to each other. eigenvalues = 0.04908340 •Note one of the eigenvectors goes through the middle of 1.28402771 the points, like drawing a line of best fit. eigenvectors = -0.73517866 -0.677873399 •The second eigenvector gives us the other, less important, pattern in the data, 0.67787340 -0.735178656 that all the points follow the main line, but are off to the side of the main line by some amount.
PCA Example –STEP 4 PCA Example –STEP 4
• Reduce dimensionality and form feature vector Now, if you like, you can decide to ignore the components of the eigenvector with the highest eigenvalue is the principle lesser significance. component of the data set. You do lose some information, but if the eigenvalues are small, In our example, the eigenvector with the larges eigenvalue was you don’t lose much the one that pointed down the middle of the data.
Once eigenvectors are found from the covariance matrix, the • n dimensions in your data next step is to order them by eigenvalue, highest to lowest. • calculate n eigenvectors and eigenvalues This gives you the components in order of significance. • choose only the first p eigenvectors • final data set has only p dimensions.
11 9/7/2017
PCA Example –STEP 4 PCA Example –STEP 5
• Feature Vector • Deriving the new data FeatureVector = (eig eig eig … eig ) 1 2 3 n FinalData = RowFeatureVector x RowZeroMeanData We can either form a feature vector with both of the RowFeatureVector is the matrix with the eigenvectors in the eigenvectors: columns transposed so that the eigenvectors are now in the -.677873399 -.735178656 rows, with the most significant eigenvector at the top -.735178656 .677873399 RowZeroMeanData is the mean-adjusted data or, we can choose to leave out the smaller, less transposed, ie. the data items are in each column, significant component and only have a single column: with each row holding a separate dimension. - .677873399 - .735178656
PCA Example –STEP 5 PCA Example –STEP 5
FinalData transpose: dimensions http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf along columns x y -.827970186 -.175115307 1.77758033 .142857227 -.992197494 .384374989 -.274210416 .130417207 -1.67580142 -.209498461 -.912949103 .175282444 .0991094375 -.349824698 1.14457216 .0464172582 .438046137 .0177646297 1.22382056 -.162675287
12 9/7/2017
PCA Algorithm Why PCA?
Get some data Maximum variance theory Subtract the mean o In general, variance for noise data should be low and signal should be high. A common measure is the signal-to-noise Calculate the covariance matrix ratio (SNR). A high SNR indicates high precision data. Calculate the eigenvectors and eigenvalues of the covariance matrix Choosing components and forming a feature vector Deriving the new data set
Projection Maximum Variance ab a b cos() m m Unit Vector (i)T 1 (i)T 2 1 T (i) (i)T The dot product is also related to the var(x u) (x u) u x x u Projection m m angle between the two vectors – but it i1 i1 1 m doesn’t tell us the angle uT ( x(i) x(i)T )u m i1 Original coordinate 1 m Cov = åx(i)x(i)T Covariance Matrix since the mean is zero m i=1 m (i)T 1 (i)T 2 T T (i)T l = var(x u) = å(x u) =u Covu u u 1 x u is the length from blue node to origin of coordinates. The m i=1 mean for the is zero. T m m ul = lu = uu Covu = Covu (i)T 1 (i)T 2 1 T (i) (i)T var(x u) (x u) u x x u Therefore m i1 m i1 Covu = lu m T 1 (i) (i)T We got it. is the eigenvalue of matrix C o v . u is the Eigenvectors. The goal of u ( x x )u PCA is to find an u where the variance of all projection points is maximum. m i1
13