9/7/2017

Feature Extraction Techniques

Unsupervised methods can also be used to find features which can be useful for categorization. There are unsupervised methods that represent a form of smart Unsupervised Learning II feature extraction. Feature Extraction

Transforming the input data into the set of features still describing the data with sufficient accuracy In pattern recognition and image processing, feature extraction is a special form of dimensionality reduction

What is feature reduction? Why feature reduction?

Original data reduced data • Most machine learning and data mining techniques may not be effective for high-dimensional data • When the input data to an algorithm is too large to be processed and it is suspected to be redundant (much data, Linear transformation but not much information) • Analysis with a large number of variables generally requires a large amount of memory and computation d T dp Y  power or a classification algorithm which overfits the G  training sample and generalizes poorly to new samples • The important dimension may be small. X  p – For example, the number of genes responsible for a certain type of disease may be small. Gpd : X Y  GT X d

1 9/7/2017

Feature reduction versus feature Why feature reduction? selection • Visualization: projection of high-dimensional data • Feature reduction onto 2D or 3D. – All original features are used – The transformed features are linear combinations of the original features. • Data compression: efficient storage and retrieval. • Feature selection • Noise removal: positive effect on query accuracy. – Only a subset of the original features are used.

• Continuous versus discrete

Application of feature reduction Algorithms

Feature Extraction Techniques • Face recognition Principal component analysis • Handwritten digit recognition Singular value decomposition • Text mining Non-negative factorization • Image retrieval Independent component analysis • Microarray data analysis • Protein classification

2 9/7/2017

What is Principal Component Geometric picture of principal components (PCs) Analysis? • Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables z – Retains most of the sample's information. 1 – Useful for the compression and classification of data.

• By information we mean the variation present in the • the 1st PC is a minimum distance fit to a in X space sample, given by the correlations between the original z1 variables. nd • the 2 PC z 2 is a minimum distance fit to a line in the plane – The new variables, called principal components (PCs), are perpendicular to the 1st PC uncorrelated, and are ordered by the fraction of the total information each retains. PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous.

Principal Components Analysis (PCA) Background Mathematics

• Principle – Linear projection method to reduce the number of parameters – Transfer a set of correlated variables into a new set of uncorrelated Linear Algebra variables – Map the data into a space of lower dimensionality – Form of unsupervised learning Calculus

• Properties – It can be viewed as a of the existing axes to new positions in the Probability and Computing space defined by original variables – New axes are orthogonal and represent the directions with maximum variability

3 9/7/2017

Eigenvalues & eigenvectors Eigenvalues & eigenvectors

Example. Consider the matrix Consider the matrix P for which the columns are C1, C2, and C3, i.e., 121  112  A 610  P  623 121 1312 Consider the three column matrices we have Determinants of P det(p)= 84. So this matrix is invertible. Easy calculations give é 1 ù é -1 ù é 2 ù ê ú ê ú ê ú C1 = ê 6 ú, C2 =ê 2 ú, C3 =ê 3 ú, 707 ê -13 ú ê 1 ú ê -2 ú 1 1  ë û ë û ë û P 27249 84  32128 We have  046 Next we evaluate the matrix P-1AP. ACACAC0,8,9, 123   707121112000  046  1 PAP1  27249610623040  84    In other words, we have 321281211312003   

AC1 = 0C1, AC2 = -4C2, AC3 = 3C3, In other words, we have 000 0, -4 and 3 are eigenvalues of A, C1,C2 and C3 are eigenvectors PAP1 040 A c c   003

Eigenvalues & eigenvectors Eigenvalues & eigenvectors

In other words, we have Definition. Let A be a . A non-zero vector C is called an eigenvector of A if and only if there exists a number (real or complex) λ such that AC  C.

Using the , we obtain If such a number λ exists, it is called an eigenvalue of A. The vector C 000 is called eigenvector associated to the eigenvalue λ. 1 APP0 4 0 0 0 3 Remark. The eigenvector C must be non-zero since we have which implies that A is similar to a diagonal matrix. In particular, we have A0 0 0 (0 is a zero vector)

000 for any number λ. nn1 APP040 n 003 for n1,2,...

4 9/7/2017

Eigenvalues & eigenvectors Determinants Example. Consider the matrix Determinant of order 2 121  A 610 .easy to remember (for order 2 only).. 121 We have seen that aa1112 ||Aaaaa 11221221 aa2122 AC1 = 0C1, AC2 = -4C2, AC3 = 3C3, - + where 12 112  Example: Evaluate the determinant: CCC6,2,3, 34 123   12 1312  14232  34 So C1 is an eigenvector of A associated to the eigenvalue 0. C2 is an eigenvector of A associated to the eigenvalue -4 while C3 is an eigenvector of A associated to the eigenvalue 3.

Eigenvalues & eigenvectors Eigenvalues & eigenvectors

Example. Consider the matrix For a square matrix A of order n, the number λ is an eigenvalue if and only if there exists a non-zero vector C such that 12 A   ACC  20

Using the matrix multiplication properties, we obtain The equation ()0AIC n translates into 12 (1)(0)  4  0 We also know that this system has one solution if and only if the matrix coefficient is 20 invertible, i.e. which is equivalent to the quadratic equation det(AI n ) 0. 2 Since the zero-vector is a solution and C is not the zero vector, then we must have  40  Solving this equation leads to (use quadratic formula) det()0.AI n 1 17 1 17 , and 22 In other words, the matrix A has only two eigenvalues.

5 9/7/2017

Eigenvalues & eigenvectors Eigenvalues & eigenvectors

Example. Consider the diagonal matrix In general, for a square matrix A of order n, the equation a 000  000b d e t ( )AI 0 .  n D  . 000 c  000 d will give the eigenvalues of A. Its characteristic polynomial is It is a polynomial function in λ of degree n. Therefore this equation will not have more than n roots or solutions. a   0 0 0 0b   0 0 So a square matrix A of order n will not have more than n eigenvalues. det(D I )   ( a   )( b   )( c   )( d   )  0 n 0 0c   0 0 0 0 d  

So the eigenvalues of D are a, b, c, and d, i.e. the entries on the diagonal.

Computation of Eigenvectors Computation of Eigenvectors 121 Example. Consider the matrix  A 610 Let A be a square matrix of order n and λ one of its eigenvalues. 121 Let X be an eigenvector of A associated to λ. We must have First we look for the eigenvalues of A. These are given by the characteristic equation

AX Xor ( A  In ) X  0 det(AI 3 ) 0. 1  2 1 This is a linear system for which the matrix coefficient is AI  n 6 1  0  0 Since the zero-vector is a solution, the system is consistent. 1  2  1   Remark. Note that if X is a vector which satisfies AX= λX , If we develop this determinant using the third column, we obtain then the vector Y = c X (for any arbitrary number c) satisfies the 6112  same equation, i.e. AY- λY. (  1)0  1261   In other words, if we know that X is an eigenvector, then cX is also By algebraic manipulations, we get an eigenvector associated to the same eigenvalue. (   4)(   3)  0

which implies that the eigenvalues of A are 0, -4, and 3.

6 9/7/2017

Computation of Eigenvectors Computation of Eigenvectors EIGENVECTORS ASSOCIATED WITH EIGENVALUES So the unknown vector X is given by 1. Case λ=0. : The associated eigenvectors are given by the linear system xx 1 AX  0  Xyxx 66 which may be rewritten by  zx1313  x y z 20   60xy  Therefore, any eigenvector X of A associated to the eigenvalue 0 is given by   x y z 20 1 The third equation is identical to the first. From the second equation,  Xc 6, we have y = 6x, so the first equation reduces to 13x + z = 0.   So this system is equivalent to 13

yx 6 where c is an arbitrary number.  zx13

Computation of Eigenvectors Computation of Eigenvectors

Then we use elementary row operations to reduce it to a upper-triangular form. 2. Case λ=-4: The associated eigenvectors are given by the linear system First we interchange the first row to the end AXX4 or (4)0 AIX  1230 3  5210 which may be rewritten by  6300 5x 2 y  z  0   6xy 3 0 Next, we use the first row to eliminate the 5 and 6 on the first column.  x 2 y  3 z  0 We obtain We use elementary operations to solve it.  1  2 3 0  First we consider the augmented matrix [40]AI  3 0 8 16 0  0 9 18 0 5 2 1 0  6 3 0 0  1 2 3 0

7 9/7/2017

Computation of Eigenvectors Computation of Eigenvectors

If we cancel the 8 and 9 from the second and third row, we obtain Next, we set z = c. From the second row, we get y = 2z = 2c. The first row will imply x = -2y+3z = -c. Hence

  1 2 3 0 xc1   0 1 2 0 Xycc 22   0 1 2 0 zc 1 Finally, we subtract the second row from the third to get Therefore, any eigenvector X of A associated to the eigenvalue -4 is given by 1   1 2 3 0   Xc 2 0 1 2 0   1  0 0 0 0 where c is an arbitrary number.

Computation of Eigenvectors Computation of Eigenvectors

Summary: Let A be a square matrix. Assume λ is an eigenvalue of A. In order to find the associated eigenvectors, Case λ=3: Using similar ideas as the one described above, one may easily we do the following steps: show that any eigenvector X of A associated to the eigenvalue 3 is given by 1. Write down the associated linear system

AX  X 2 or( A I ) X 0  n Xc 3  2. Solve the system. 2 3. Rewrite the unknown vector X as a linear combination where c is an arbitrary number. of known vectors.

8 9/7/2017

Why Eigenvectors and Eigenvalues Principal Components Analysis (PCA)

An eigenvector of a square matrix is a non-zero vector that, when multiplied by the matrix, yields a vector that differs from the • Principle original at most by a multiplicative scalar. The scalar is represented – Linear projection method to reduce the number of parameters by its eigenvalue. – Transfer a set of correlated variables into a new set of uncorrelated variables In this shear mapping the – Map the data into a space of lower dimensionality red arrow changes direction but the blue – Form of unsupervised learning arrow does not. The blue arrow is an eigenvector, • Properties and since its length is unchanged its eigenvalue – It can be viewed as a rotation of the existing axes to new positions in the is 1. space defined by original variables – New axes are orthogonal and represent the directions with maximum variability

Dimensionality Reduction

PCs and Variance Can ignore the components of lesser significance.

• The first PC retains the greatest amount of variation in the sample

• The kth PC retains the kth greatest fraction of the

variation in the sample You do lose some information, but if the eigenvalues are small, you don’t lose much – n dimensions in original data • The kth largest eigenvalue of the correlation matrix C – calculate n eigenvectors and eigenvalues is the variance in the sample along the kth PC – choose only the first p eigenvectors, based on their eigenvalues – final data set has only p dimensions

9 9/7/2017

PCA Example –STEP 1 PCA Example –STEP 1

http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf ZERO MEAN DATA: • Subtract the mean from each of the data dimensions. DATA: This produces a data set whose mean is zero. x y x y Subtracting the mean makes variance and covariance 2.5 2.4 .69 .49 calculation easier by simplifying their equations. The 0.5 0.7 -1.31 -1.21 variance and co-variance values are not affected by 2.2 2.9 .39 .99 the mean value. 1.9 2.2 .09 .29 3.1 3.0 1.29 1.09 2.3 2.7 .49 .79 2 1.6 .19 -.31 1 1.1 -.81 -.81 1.5 1.6 -.31 -.31 1.1 0.9 -.71 -1.01

PCA Example –STEP 1 PCA Example –STEP 2 http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf • Calculate the covariance matrix cov = .616555556 .615444444 .615444444 .716555556

Variance measures how far a set of numbers is spread out

Covariance provides a measure of the strength of the correlation between two or more sets of random variates. The covariance for two random variates and , each with sample size, is defined by the expectation value

10 9/7/2017

PCA Example –STEP 3 PCA Example –STEP 3

http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf •eigenvectors are plotted as • Calculate the eigenvectors and eigenvalues of diagonal dotted lines on the plot. the covariance matrix •Note they are perpendicular to each other. eigenvalues = 0.04908340 •Note one of the eigenvectors goes through the middle of 1.28402771 the points, like drawing a line of best fit. eigenvectors = -0.73517866 -0.677873399 •The second eigenvector gives us the other, less important, pattern in the data, 0.67787340 -0.735178656 that all the points follow the main line, but are off to the side of the main line by some amount.

PCA Example –STEP 4 PCA Example –STEP 4

• Reduce dimensionality and form feature vector Now, if you like, you can decide to ignore the components of the eigenvector with the highest eigenvalue is the principle lesser significance. component of the data set. You do lose some information, but if the eigenvalues are small, In our example, the eigenvector with the larges eigenvalue was you don’t lose much the one that pointed down the middle of the data.

Once eigenvectors are found from the covariance matrix, the • n dimensions in your data next step is to order them by eigenvalue, highest to lowest. • calculate n eigenvectors and eigenvalues This gives you the components in order of significance. • choose only the first p eigenvectors • final data set has only p dimensions.

11 9/7/2017

PCA Example –STEP 4 PCA Example –STEP 5

• Feature Vector • Deriving the new data FeatureVector = (eig eig eig … eig ) 1 2 3 n FinalData = RowFeatureVector x RowZeroMeanData We can either form a feature vector with both of the RowFeatureVector is the matrix with the eigenvectors in the eigenvectors: columns transposed so that the eigenvectors are now in the -.677873399 -.735178656 rows, with the most significant eigenvector at the top -.735178656 .677873399 RowZeroMeanData is the mean-adjusted data or, we can choose to leave out the smaller, less transposed, ie. the data items are in each column, significant component and only have a single column: with each row holding a separate dimension. - .677873399 - .735178656

PCA Example –STEP 5 PCA Example –STEP 5

FinalData : dimensions http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf along columns x y -.827970186 -.175115307 1.77758033 .142857227 -.992197494 .384374989 -.274210416 .130417207 -1.67580142 -.209498461 -.912949103 .175282444 .0991094375 -.349824698 1.14457216 .0464172582 .438046137 .0177646297 1.22382056 -.162675287

12 9/7/2017

PCA Algorithm Why PCA?

Get some data Maximum variance theory Subtract the mean o In general, variance for noise data should be low and signal should be high. A common measure is the signal-to-noise Calculate the covariance matrix ratio (SNR). A high SNR indicates high precision data. Calculate the eigenvectors and eigenvalues of the covariance matrix Choosing components and forming a feature vector Deriving the new data set

Projection Maximum Variance ab  a b cos() m m Unit Vector (i)T 1 (i)T 2 1 T (i) (i)T The dot product is also related to the var(x u)  (x u)  u x x u Projection m m between the two vectors – but it i1 i1 1 m doesn’t tell us the angle  uT (  x(i) x(i)T )u m i1 Original coordinate 1 m Cov = åx(i)x(i)T Covariance Matrix since the mean is zero m i=1 m (i)T 1 (i)T 2 T T (i)T l = var(x u) = å(x u) =u Covu u u 1 x u is the length from blue node to origin of coordinates. The m i=1 mean for the is zero. T m m ul = lu = uu Covu = Covu (i)T 1 (i)T 2 1 T (i) (i)T var(x u)  (x u)  u x x u Therefore m i1 m i1 Covu = lu m T 1 (i) (i)T We got it.  is the eigenvalue of matrix C o v . u is the Eigenvectors. The goal of  u (  x x )u PCA is to find an u where the variance of all projection points is maximum. m i1

13