Principal Component Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Principal Component Analysis Weitao, Tan August 9, 2018 Instructor: Junwen Peng 1 Introduction The principle component analysis is a mathematical way to find main axes for multidimensional data set. In some conditions of data mining and machine learning, the data is usually saved in an array(as a vector). In fact, those vectors usually have thousands of dimensions, and we need to reduce the dimension of the data vector in order to process those data. Reducing the dimension of data will cause the loss of some data, and because the principle component analysis can give us the main axes of the data set, we can project each data input onto the main axes to minimize the amount of lost data. For example, the pca algorithms are used commonly in Face Preception. Each pixel with its neighbor pixel of our face picture is a data in one dimension. It is a millions of dimensional data set for entire face picture. By applying pca, we can find which pixel contributes most to make the whole picture different. It reduces the runtime for identification program significantly. 2 What is the Principal Component Analysis The Principal Component Analysis is an algorithms to find principal axes for multidimensional data set. It can be used to analyze and to simplified large data sets. By projecting data to the main axes, it can reduce the dimensions of data set while keeping relatively more important data. The reason it will work is based on the principle axie theorem. 2.1 Principal axis theorem A principal axis is a certain line in a Euclidean space associated with an ellipsoid or hyperboloid, generalizing the major and minor axes of an ellipse or hyperbola. The principal axis theorem gives a constructive procedure for finding the princepal axes. The principal axis theorem concerns quadratic forms in Rn, which are homogeneous polynomials of degree 2. Here is an example: Ax2 + Bxy + Cy2. Any quadratic form may be represented as 1 Q(x) = xT Mx (1) where M is a symmetric matrix. For example, Ax2 + Bxy + Cy2 is equal to [x; y] ∗ M ∗ [x; y]T for some matrix M. The principal axis theorem tells us that each eigenvector of Q(x) represent a principal axie of the ellipse or hyperbola. 2.2 Principal Component Analysis When principal axis theorem can compute the principal axis for one data input, the principal component analysis does it for multiple data. The first step of principal component analysis is to make the data set centered at zero. The second step is to get the co-variance Matrix of the data set which is similar with summation and averaging of each single data input. Finally, calculate the eigenvalue and its vector for the principle axes for the data set. 3 PCA algorithms This is the step one: making the data set centered at zero. Data: double [ ] [ ] matrix Result: double [ ] [ ] matrix n; m = matrix.dimension; aij = matrix.input; Pn aij i=1 avej = n ; for aij : matrix do aij = aij - avej end Algorithm 1: changeAverageToZero algorithms This is the step two: getting the co-variance Matrix of the data set. Data: double [ ] [ ] matrix Result: double [ ] [ ] matrix n; m = matrix.dimension; matrixT = matrix.Transpose; matrixT ∗matrix matrix = n ; Algorithm 2: getVarianceMatrix; This is the step three: computing the eigenvalues and vectors. Data: double [ ] [ ] matrix Result: double [ ] eigenvalue; double [ ] [ ] eigenvector; matrix = matrix.changeAverageToZero;(get zero-centered data set) matrix = matrix.getVarianceMatrix;(get co-Variance matrix) eigenvalue = matrix.getEigenvalue;(get Eigenvalue for matrix; eigenvector = matrix.getEigenVectorMatrix;(get Eigenvectors for matrix) Algorithm 3: PCA algorithms 2 4 Data input and data result 4.1 Example result for rectangular data set In this graph, the data input is rectangular like, the red point shows the eigenvectors scaled by the unit size. The eigenvalues are 0.082 and 0.750, so the vector with 0.750 eigenvalue (pointing right) is the major axis. 4.2 Example result for oval data set In this graph, the data input is oval like, the red point shows the eigenvectors scaled by the unit size. The eigenvalues are 0.231 and 0.798, so the vector with 0.798 eigenvalue (pointing right) is the major axis. 4.3 Example result for oval data set spining 30o In this graph, the data input is same as the previous example, but spinned clockwise by 30o, the red point shows the eigenvectors scaled by the unit size. 3 The eigenvalues are 0.231 and 0.798, so the vector with 0.798 eigenvalue (pointing rightdown) is the major axis. Because the data is get from spinning the previous one, so the major axis does not change. 5 The restriction of the PCA and solutions PCA algorithm also has some restrictions. The main restriction is that the main data set should be in a simply connnected domain. If two parts of data set are disconneceted from each other, for example, one part is loacated in a circle with radius 2 centered at (2,2), and another part is located in a cicle with redius 1 centered at (-3,-3), the principle analysis will give you the vector which connect two circle centers. A way to solve this kind of restriction is to apply the pca for each part of data separately, then use the result to analyze them separately. Another restriction is that if two eigenvalue get from pca are very close, it will be difficult to say which axes is more important. The solution to solve this one is to use eigenvector as input data set to apply the pca again until there is a difference between the eigenvalue. 4.