Dimensionality Reduction and Principal Component Analysis

Dimensionality Reduction Before running any ML algorithm on our data, we may want to reduce the number of features Dimensionality Reduction and Principal ● To save computer memory/disk space if the data are large. Component Analysis ● To reduce execution time. ● To visualize our data, e.g., if the dimensions can be reduced to 2 or 3. Dimensionality Reduction results in an Projecting Data - 2D to 1D approximation of the original data x2 x1 Original 50% 5% 1% Projecting Data - 2D to 1D Projecting Data - 2D to 1D x2 x2 x2 x1 x1 x1 Projecting Data - 2D to 1D Projecting Data - 2D to 1D new feature new feature Projecting Data - 3D to 2D Projecting Data - 3D to 2D Projecting Data - 3D to 2D Why such projections are effective ● In many datasets, some of the features are correlated. ● Correlated features can often be well approximated by a smaller number of new features. ● For example, consider the problem of predicting housing prices. Some of the features may be the square footage of the house, number of bedrooms, number of bathrooms, and lot size. These features are likely correlated. Principal Component Analysis (PCA) Principal Component Analysis (PCA) Suppose we want to reduce data from d dimensions to k dimensions, where d > k. Determine new basis. PCA finds k vectors onto which to project the data so that the Project data onto k axes of new projection errors are minimized. basis with largest variance. In other words, PCA finds the principal components, which offer the best approximation. PCA ≠ Linear Regression PCA Algorithm Given n data points, each with d features, i.e., an n×d matrix X: x2 y ● Preprocessing: perform feature scaling ● Compute covariance matrix ● Calculate eigenvalues and corresponding eigenvectors of covariance matrix via singular value decomposition ● This yields a new basis of d vectors as well as the variance along each of the d axes x1 x1 ● Retain the k vectors with the largest corresponding variance and PCA Linear Regression project the data onto this k-dimensional subspace PCA Example PCA Example sklearn sklearn PCA with X is 30×2 matrix PCA with X is 30×2 matrix from sklearn.decomposition import PCA from sklearn.decomposition import PCA pca = PCA(n_components=1) pca = PCA(n_components=1) Z = pca.fit_transform(X) Z = pca.fit_transform(X) print(Z.shape) print(Z.shape) print(pca.explained_variance_ratio_) print(pca.explained_variance_ratio_) Shape of Shape of compressed data Z compressed data Z (30, 1) (30, 1) [ 0.5815958 ] [ 0.9945036 ] 58% of variance 99% of variance explained by first explained by first principal component principal component sklearn PCA with X is 30×2 matrix Iris Data For 50 instances of each type of iris, we have four features: from sklearn.decomposition import PCA sepal length, sepal width, petal length, petal width pca = PCA(n_components=1) Z = pca.fit_transform(X) print(Z.shape) print(pca.explained_variance_ratio_) Shape of compressed data Z (30, 1) [ 1.0000000 ] 100% of variance explained by first principal component Iris setosa Iris virginica Iris versicolor from sklearn import preprocessing from sklearn.decomposition import PCA Feature Scaling X_scaled = preprocessing.scale(X) PCA: 4D to 2D pca = PCA(n_components=2) Z = pca.fit_transform(X_scaled) sepal sepal petal petal sepal sepal petal petal print(Z.shape) length width length width length width length width print(pca.explained_variance_ratio_) (cm) (cm) (cm) (cm) (cm) (cm) (cm) (cm) Z is 150×2 matrix 5.1 3.5 3.4 0.2 -0.9 0.3 -0.1 0.2 4.9 3.0 1.1 1.8 -1.1 -0.7 -0.7 -0.5 Shape of compressed data Z 4.7 3.2 2.3 2.1 0.6 0.0 -0.3 0.0 (150, 2) 4.6 2.6 4.5 0.1 0.1 -0.1 -1.2 0.1 5.0 2.8 1.4 0.3 1.4 -1.0 0.5 -1.1 [ 0.728 0.230 ] 5.4 3.3 5.7 1.4 -0.8 0.0 0.6 -0.4 95% of variance 4.6 2.2 6.1 0.2 -0.5 0.4 -0.3 0.0 explained by first two 5.0 3.4 2.5 2.0 0.0 0.9 1.3 0.6 principal components 4.4 2.9 1.4 1.7 -0.4 -0.8 0.9 0.7 … … … … … … … … 5.9 3.0 5.1 1.8 -0.2 -0.6 0.3 -0.8 X is 150×4 matrix X_scaled is 150×4 matrix We reduce features of We reduce features of 250x250 pixel images 250x250 pixel images Eigenfaces from 62,500 to 200 Eigenfaces from 62,500 to 200 from sklearn.decomposition import PCA from sklearn.decomposition import PCA pca = PCA(n_components=200) pca = PCA(n_components=200) eigenfaces = pca.components_ eigenfaces = pca.components_ print(pca.explained_variance_ratio_) print(pca.explained_variance_ratio_) print(pca.explained_variance_ratio_.sum()) print(pca.explained_variance_ratio_.sum()) 57% of variance 57% of variance [ 0.22 0.09 0.05 0.04 0.04 explained by first 10 [ 0.22 0.09 0.05 0.04 0.04 explained by first 10 0.03 0.03 0.03 0.02 0.02 ...] principal components 0.03 0.03 0.03 0.02 0.02 ...] principal components 0.952283956204 0.952283956204 95% of variance 95% of variance explained by first 200 explained by first 200 principal components principal components We reduce features of Eigenfaces 250x250 pixel images How to choose k from 62,500 to 200 ● We want to lose as little information as possible, i.e., we want the from sklearn.decomposition import PCA pca = PCA(n_components=200) proportion of variance that is eigenfaces = pca.components_ retained to be as large as possible print(pca.explained_variance_ratio_) print(pca.explained_variance_ratio_.sum()) ● Typically, k is chosen so that 95% 57% of variance or 99% of the variance in the data is [ 0.22 0.09 0.05 0.04 0.04 explained by first 10 retained 0.03 0.03 0.03 0.02 0.02 ...] principal components 0.952283956204 95% of variance ● If there are many correlated explained by first 200 features in the data, often a high principal components percentage of the variance can be retained while using a small number of features, i.e., k much less than d ML Algorithms PCA Summary Overview Supervised Unsupervised Learning Learning ● Prior to running a ML algorithm, PCA can be used to reduce the number of dimensions in the data. This is helpful, e.g., to speed up execution of the ML algorithm. Non-Parametric Parametric Hierarchical Clustering ● Since datasets often have many correlated features, PCA is effective in reducing the number of features while retaining most Decision Trees K-Means Regression Linear Non-Linear of the variance in the data. Models Classifiers Classifiers Gaussian kNN Mixture Models ● Before performing PCA, feature scaling is critical. Linear Neural Perceptron Support Vector Regression Networks Dimensionality ● Principal components aren’t easily interpreted features. Machines Reduction We’re not using a subset of the original features. Logistic Collaborative Regression Hidden Markov Filtering Models.

Dimensionality Reduction and Principal Component Analysis

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Comparison of Dimensionality Reduction Techniques on Audio Signals

Litrec Vs. Movielens a Comparative Study

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

Unsupervised Speech Representation Learning Using Wavenet Autoencoders

Collaborative Filtering: a Machine Learning Perspective by Benjamin Marlin a Thesis Submitted in Conformity with the Requirement

13 Dimensionality Reduction and Microarray Data

Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

Collaborative Filtering: a Machine Learning Perspective Chapter 6: Dimensionality Reduction

Accent Transfer with Discrete Representation Learning and Latent Space Disentanglement

Text Comparison Using Word Vector Representations and Dimensionality Reduction

Anomalous Sound Event Detection Based on Wavenet