Dimensionality Reduction CSL603 - Fall 2017 Narayanan C Krishnan [email protected] Outline

Dimensionality Reduction CSL603 - Fall 2017 Narayanan C Krishnan [email protected] Outline • Motivation • Unsupervised Dimensionality Reduction • Principal Component Analysis • Supervised Dimensionality Reduction • Linear Discriminant Analysis Dimensionality Reduction CSL603 - Machine Learning 2 Motivation (1) • Easy and convenient to collect data • Redundant and “not so useful” pieces of information • Simultaneously collecting for multiple objectives • Data accumulates at a rapid pace • Effective approach to downsize data • Intrinsic dimension of the data may be small • Facial poses are governed by two-three degrees of freedom • Visualization • Projecting high-dimensional data onto 2-3 dimensions Dimensionality Reduction CSL603 - Machine Learning 3 Motivation (2) • Curse of Dimensionality • Ineffectiveness of most machine learning and data mining techniques on high-dimensional data • Data Compression • Efficient storage and retrieval • Noise removal Dimensionality Reduction CSL603 - Machine Learning 4 Examples of High Dimensional Data (1) • Computer Vision Dimensionality Reduction CSL603 - Machine Learning @Jieping Ye 5 Examples of High Dimensional Data (2) • Document/Text Analysis q Classify the documents § Thousands of documents § Thousands of terms 20 Newsgroups data § 20K documents Dimensionality Reduction CSL603 - Machine Learning § 40K features! @Jieping Ye 6 Dimensionality Reduction Approaches • Feature extraction(reduction) • Subspace learning: mapping high dimensional data onto a lower dimensional space • Linear • Non-Linear • Feature selection • Selects an optimal subset of features from the existing set according to an objective function Dimensionality Reduction CSL603 - Machine Learning 7 Representative Algorithms • Unsupervised • Principal Component Analysis (PCA) • Kernel PCA, Probabilistic PCA • Independent Component Analysis (ICA) • Latent Semantic Indexing (LSI) • Manifold Learning • Supervised • Linear Discriminant Analysis • Canonical Correlation Analysis • Partial Least Squares Dimensionality Reduction CSL603 - Machine Learning 8 Principal Component Analysis (PCA) • Basic Idea • Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables • Retains most of the sample’s information • The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains. • Also known as Karhunen-Loève transform • A very popular feature extraction technique Dimensionality Reduction CSL603 - Machine Learning 9 PCA - Formulation • Objective: ∑& • Minimize error #'( �# − �%# ) (minimum projection error) • Maximum variance of the projected data • Theoretically both are equivalent Dimensionality Reduction CSL603 - Machine Learning 10 PCA – Maximum Variance Formulation (1) • Consider set of data points �# where � = 1, … , � 3 and �# ∈ ℝ ( • Mean of the original data: �̅ = ∑& � & #'( # • Goal: Project data onto K < � dimensional space while maximizing the variance of the projected data • To begin with, consider K = 1: • Let �( be the direction of the projection. • Set �( = 1, as it is only the direction that is important : : • Projected data: �( �# and Projected mean: �( �̅ Dimensionality Reduction CSL603 - Machine Learning 11 PCA – Maximum Variance Formulation (2) • Covariance of original data: & 1 Σ = < � − �̅ � − �̅ : � # # #'( Dimensionality Reduction CSL603 - Machine Learning 12 PCA – Maximum Variance Formulation (3) • Covariance of original data: & 1 Σ = < � − �̅ � − �̅ : � # # #'( • Variance of the projected data: Dimensionality Reduction CSL603 - Machine Learning 13 PCA – Maximum Variance Formulation (4) • Covariance of original data: & 1 Σ = < � − �̅ � − �̅ : � # # #'( • Variance of the projected data: & & 1 1 <{�:� − �:�̅}) = <{�:(� − �̅)}) � ( # ( � ( # #'&( #'( 1 < : : : = �( (�# − �)̅ �# − � ̅ �( = �( Σ �( � #'( • Goal: maximizing variance of the projected data : max �( Σ �( such that �( ) = 1 DE Dimensionality Reduction CSL603 - Machine Learning 14 PCA – Maximum Variance Formulation (5) • Using Lagrange multipliers : : max �( Σ �( + �((1 − �( �() DE • By setting the derivative wrt �(equal to 0 Dimensionality Reduction CSL603 - Machine Learning 15 PCA – Maximum Variance Formulation (6) • Using Lagrange multipliers : : max �( Σ �( + �((1 − �( �() DE • By setting the derivative wrt �(equal to 0 Σ�( = �(�( : • �(must be an eigenvector of Σ ! And �( Σ �( = �(implying variance is maximized by choosing the eigenvector associated with the largest eigenvalue. • �(corresponds to the first principal component. Dimensionality Reduction CSL603 - Machine Learning 16 PCA – Maximum Variance Formulation (7) • Second principal direction: Maximize variance : : �) Σ�), subject to �) ) = 1 and �) �( = 0 Dimensionality Reduction CSL603 - Machine Learning 17 PCA – Maximum Variance Formulation (8) • Second principal direction: Maximize variance : : �) Σ�), subject to �) ) = 1 and �) �( = 0 : : : Max �) Σ�) + �) 1 − �) �) + �(�) �() • Setting the derivative wrt �)equal to 0 Dimensionality Reduction CSL603 - Machine Learning 18 PCA – Maximum Variance Formulation (9) • Second principal direction: Maximize variance : : �) Σ�), subject to �) ) = 1 and �) �( = 0 : : : Max �) Σ�) + �) 1 − �) �) + �(�) �() • Setting the derivative wrt �)equal to 0 2Σ�) − 2�)�) + ��( = 0 • Σ�) = �)�) - choose �)as the eigenvector of Σ with the second largest eigenvalue �) • And so on … Dimensionality Reduction CSL603 - Machine Learning 19 PCA – Maximum Variance Formulation (10) • Final Solution: • Eigenvalue decomposition (SVD) of the covariance matrix Σ = �Λ�: • For K < � dimensional projection space: • Choose K eigenvectors {�(, �), … , �R}with the largest associated eigenvalues {�(, �), … , �S} Dimensionality Reduction CSL603 - Machine Learning 20 PCA - Algorithm 1. Create �×�data matrix �, with one row vector �# per data point 2. Subtract mean �̅ from each column vector �# in � 3. Σ ← covariance matrix of � 4. Find eigenvectors � and eigenvalues Λ of Σ 5. PC’s �R the � eigenvectors with largest eigenvalues 6. Transformed data � = �:� Dimensionality Reduction CSL603 - Machine Learning 21 PCA – Illustration (1) Original data Principal components Original data projected (PC) into the PC space Dimensionality Reduction CSL603 - Machine Learning 22 PCA – Illustration (2) Original K = 1 K = 10 K = 50 K = 250 DimensionalityD = 784 Reduction CSL603 - Machine Learning 23 PCA – Model Selection • How many principal components to select? • Heuristic: Detect the knee in the eigenvalue spectrum and retain the large eigenvalues Eigenvalue spectrum for off-line Reconstruction error by projecting digits dataset the data into � PC’s Dimensionality Reduction CSL603 - Machine Learning 24 Limitations of PCA (1) • Non- parametric • No probabilistic model for observed data • The co-variance matrix needs to be calculated • Can be computation intensive for datasets with a high number of dimensions • Does not deal with missing data • Incomplete data must either be discarded or imputed using ad-hoc methods • Outliers in the data can unduly affect the analysis • Batch mode algorithm Dimensionality Reduction CSL603 - Machine Learning 25 Assumptions of PCA • Gaussian Assumption with linear embedding Dimensionality Reduction CSL603 - Machine Learning 26 Limitations of PCA (2) • Non-linear embedding Dimensionality Reduction CSL603 - Machine Learning 27 Limitations of PCA (3) • Non-Gaussian Assumption Dimensionality Reduction CSL603 - Machine Learning 28 Nonlinear PCA using Kernels • Traditional PCA applies linear transformation • May not be effective for nonlinear data • Solution: apply nonlinear transformation to the high dimensional space �: � → �(�) • Computational efficiency through kernel trick � �#, �^ = � �# ⋅ � �^ • Compute the eigenvectors of � rather than the sample covariance or covariance in the high- dimensional space Dimensionality Reduction CSL603 - Machine Learning 29 Dimensionality Reduction CSL603 - Machine Learning 30 Dimensionality Reduction CSL603 - Machine Learning 31 Dimensionality Reduction CSL603 - Machine Learning 32 Linear Discriminant Analysis (LDA) • Objective: perform dimensionality reduction, while preserving as much of the class discriminatory information as possible • Supervised dimensionality reduction • Seeks to find directions along which the classes are best separated Dimensionality Reduction CSL603 - Machine Learning 33 LDA - Illustration • Consider set of data points �# where i = 3 1, … , � and �# ∈ ℝ , �( of which belong to �( and �) belong to �). • Obtain scalar � by projecting samples � onto a line (� − 1 dimensional space) � = �:� • Find � that maximizes the separability of the scalars Dimensionality Reduction CSL603 - Machine Learning 34 LDA Formulation – Two Classes (1) • In order to find a good projection vector, we need to define a measure of separation between the projections. • The mean vector of each class in the original and transformed space is e 1 �S = < �# �S f ∈e h 1 g i �̅S = < �# �S jg∈hi Dimensionality Reduction CSL603 - Machine Learning 35 LDA Formulation – Two Classes (2) • In order to find a good projection vector, we need to define a measure of separation between the projections. • The mean vector of each class in the original and transformed space is ( e ( e ( e : �S = ∑f ∈h �#, �̅S = ∑j ∈h �# = ∑f ∈h � �# &i g i &i g i &i g i • We could then choose the distance between the projected means as our objective function ) : ) � � = �̅( − �̅) = � �( − �) Dimensionality Reduction CSL603 - Machine Learning 36

Dimensionality Reduction CSL603 - Fall 2017 Narayanan C Krishnan [email protected] Outline

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Comparison of Dimensionality Reduction Techniques on Audio Signals

Litrec Vs. Movielens a Comparative Study

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

Unsupervised Speech Representation Learning Using Wavenet Autoencoders

Dimensionality Reduction and Principal Component Analysis

Collaborative Filtering: a Machine Learning Perspective by Benjamin Marlin a Thesis Submitted in Conformity with the Requirement

13 Dimensionality Reduction and Microarray Data

Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

Collaborative Filtering: a Machine Learning Perspective Chapter 6: Dimensionality Reduction

Accent Transfer with Discrete Representation Learning and Latent Space Disentanglement

Text Comparison Using Word Vector Representations and Dimensionality Reduction