Dimensionality Reduction II: Feature Extraction

Lecture 14 Dimensionality Reduction II: Feature Extraction short version STAT 479: Machine Learning, Fall 2018 Sebastian Raschka http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/ Sebastian Raschka STAT 479: Machine Learning FS 2018 !1 Dimensionality Reduction Feature Selection Feature Extraction Today Sebastian Raschka STAT 479: Machine Learning FS 2018 !2 Dimensionality Reduction Feature Selection Feature Extraction • Principal Component Analysis (PCA) Linear • Independent Component Analysis (ICA) • Autoencoders (linear act. func.) Methods • Singular Vector Decomposition (SVD) • Linear Discriminant Analysis (LDA) (Supervised) • ... • t-Distr. Stochastic Neigh. Emb. (t-SNE) Nonlinear • Uniform Manifold Approx. & Proj. (UMAP) Methods • Kernel PCA • Spectral Clustering • Autoencoders (non-linear act. func.) • ... Sebastian Raschka STAT 479: Machine Learning FS 2018 !3 Goals of Dimensionality Reduction • Reduce Curse of Dimensionality problems • Increase storage and computational efficiency • Visualize Data in 2D or 3D Sebastian Raschka STAT 479: Machine Learning FS 2018 !4 Principal Component Analysis (PCA) 1) Find directions of maximum variance x 2 PC2 PC1 PC2 PC1 x2 PC2 x 1 PC1 Sebastian Raschka STAT 479: Machine Learning FS 2018 !5 Principal Component Analysis (PCA) 2) Transform features onto directions of maximum variance x 2 x PC2 2 PC2 PC1 PC2 PC1 PC2 PC1 PC1 x2 PC2 x2 PC2 x 1 x 1 PC1 PC1 Sebastian Raschka STAT 479: Machine Learning FS 2018 !6 Principal Component Analysis (PCA) 3) Usually consider a subset of vectors of most variance (dimensionality reduction) x 2 PC2 PC1 PC2 PC1 PC1 PC2 PC1 x2 PC2 x2 x 1 PC1 Sebastian Raschka STAT 479: Machine Learning FS 2018 !7 Principal Component Analysis (PCA) (in a nutshell) Given design matrix X ∈ ℝn×m find vector with maximum variance αi repeat: find with maximum variance uncorrelated with αi+1 αi (repeat k times, where k is the desired number of dimensions; k ≤ m ) Sebastian Raschka STAT 479: Machine Learning FS 2018 !8 Principal Component Analysis (PCA) Two approaches to solve PCA (on standardized data): 1. Constrained maximization (e.g., Lagrange multipliers) 2. Eigen-decomposition of covariance matrix directly Sebastian Raschka STAT 479: Machine Learning FS 2018 !9 Principal Component Analysis (PCA) (in a nutshell) Collect vectors in a projection matrix m×k αi A ∈ ℝ (Sorted from highest to lowest associated eigenvalue) Compute projected data points: Z = XA Sebastian Raschka STAT 479: Machine Learning FS 2018 !10 Principal Component Analysis (PCA) Usually useful to plot the explained variance (normalized eigenvalues) Sebastian Raschka STAT 479: Machine Learning FS 2018 !11 Principal Component Analysis (PCA) Keep in mind that PCA is unsupervised! Sebastian Raschka STAT 479: Machine Learning FS 2018 !12 PCA Factor Loadings • The loadings are the unstandardized values of the eigenvectors • We can interpret the loadings as the covariances (or correlation in case we standardized the input features) between the input features and the and the principal components (or eigenvectors), which have been scaled to unit length Sebastian Raschka STAT 479: Machine Learning FS 2018 !13 Mirrored Results in PCA Sebastian Raschka STAT 479: Machine Learning FS 2018 !14 Mirrored Results in PCA • Not due to an error; reason for this difference is that, depending on the eigensolver, eigenvectors can have either negative or positive signs For instance, if v is an eigenvector of a matrix Σ , we have Σv = λv, where λ is the eigenvalue then −λ is also an eigenvalue of the same value since Σ(−v) = − Σv = − λv = λ(−v) Sebastian Raschka STAT 479: Machine Learning FS 2018 !15 t-Distributed Stochastic Neighbor Embedding (t-SNE) VAN DER MAATEN AND HINTON (t-SNE is only meant for visualization not for preparing datasets!) 0 1 2 3 4 5 6 Note that MNIST has 7 28 x 28 = 784 dimensions 8 9 (a) Visualization by t-SNE. Shown are 6000 images from MNIST projected in 2D Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. Sebastian Raschka STAT 479: Machine Learning FS 2018 !16 (b) Visualization by Sammon mapping. Figure 2: Visualizations of 6,000 handwritten digits from the MNIST data set. 2590 Stochastic Nearest Neighbor Embeddings (SNE) m Given high-dimensional datapoints, x1, . , xn ∈ ℝ represent intrinsic structure of the data in 1D, 2D, or 3D (for visualization) How? 1) Model neighboring datapoint pairs based on the distance of those points in the high-dimensional space 2) Find a probability distribution of the pairwise distances in the low dimensional space that is as close as possible as the original probability distribution Main Idea: Map points near on a manifold to a near position in low-dimensional space Sebastian Raschka STAT 479: Machine Learning FS 2018 !17 Stochastic Nearest Neighbor Embeddings (SNE) pi|j + pj|i Based on probability of selecting neighboring points p = ij 2n (this is a modification to make the entropy [later slides] symmetric) where the conditional probability, pj|i, that xi would pick xj as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at xi 2 2 exp( − ||xi − xj || /2σi ) neighborhood size is controlled by pj|i = σi 2 2 (in turn controlled by perplexity parameter) ∑k≠i exp( − ||xi − xk || /2σi ) Denominator makes sure that similarity is independent of the point's density For reference, note that the normal distribution is defined as 1 (x − μ)2 p(x|μ, σ) = exp − 2 2πσ2 ( 2σ ) Sebastian Raschka STAT 479: Machine Learning FS 2018 !18 t-Distributed Stochastic Neighbor Embedding (t-SNE) 2 −1 (1 + ||zi − zj || ) in t-SNE, modeled with Student's t-distribution qij = 2 −1 ∑k≠i (1 + ||zi − zk || ) to prevent crowding problem where zi and zj are the points in the low-dimensional space • t-distribution has "fatter" tails (more scale invariant to points far away) • t-distribution avoids crowding problem • (minor point: is faster for computing the density; no exponential) Sebastian Raschka STAT 479: Machine Learning FS 2018 !19 t-Distribution With 1 degree of freedom same as Cauchy distribution Sebastian Raschka STAT 479: Machine Learning FS 2018 !20 t-Distributed Stochastic Neighbor Embedding (t-SNE) Idea: Map points near on a manifold to a near position in low-dimensional space 1. Measure euclidean distance in high dim & convert to probability of picking a point as a neighbor (similarity is proportional to probability); use Gaussian distribution for density of each point 2. Same as 1. in low dimensionality but with t distribution (has heavier tails) 3. Minimize the difference of the conditional probabilities (KL-divergence) Sebastian Raschka STAT 479: Machine Learning FS 2018 !21 Kullback Leibler divergence Measures difference between 2 distributions; asymmetric ∞ p(x) DKL(P||Q) = p(x) log dx ∫−∞ ( q(x) ) ∞ ∞ = p(x) log p(x)dx − p(x) log q(x)dx ∫−∞ ∫−∞ Entropy Cross-Entropy Remember Entropy from the Decision Tree lecture for discrete distributions? Shannon Entropy: n average amount of information H(i; x ) = − p(i|x )log p(i|x ) produced by a stochastic source j ∑ j 2 j of data i=1 for feature xj and class label i Sebastian Raschka STAT 479: Machine Learning FS 2018 !22 t-Distributed Stochastic Neighbor Embedding (t-SNE) VAN DER MAATEN AND HINTON conditional similarity between points in original space: a small q j i to model a large p j i), but 2there is2 only a smallreplacecost for using withnearbyp map points to represent widely| expseparated( − |datapoints.xi −| xj ||This/2smallσi ) cost comes from wasting some ofithej probability masspjin|i the= relevant Q distributions. In other words, the SNE cost function focuses on retaining the 2 2 in symmetric SNE and t-SNE local structure∑ofk≠thei −dataexpin(the||mapxi −(forxk |reasonable| /2σi )values of the variance of the Gaussian in the high-dimensional space, σi). The remaining parameter to be selected is the variance σi of the Gaussian that is centered over each high-dimensional datapoint, xi. It is not likely that there is a single value of σi that is optimal for all datapoints in the data set because the density of the data is likely to vary. In dense regions, a smaller value of σi is usually more appropriate than in sparser regions. Any particular value of σi induces a probability distribution, Pi, over all of the other datapoints. This distribution has an entropy which increases as σi increases. SNE performs a binary search for the value of σi that 3 produces a Pi with a fixed perplexity that is specified by the user. The perplexity is defined as H(Pi) Perp(Pi) = 2 , where H(Pi) is the Shannon entropy of Pi measured in bits H(Pi) = ∑ p j i log2 p j i. − j | | The perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50. The minimization of the cost function in Equation 2 is performed using a gradient descent Maaten,method. L.The V. D.,gradient & Hinton,has aG.surprisingly (2008). Visualizingsimple form data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. δC = 2∑(p j i q j i + pi j qi j)(yi y j). δyi j | − | | − | − Sebastian Raschka STAT 479: Machine Learning FS 2018 !23 Physically, the gradient may be interpreted as the resultant force created by a set of springs between the map point y and all other map points y . All springs exert a force along the direction (y y ). i j i − j The spring between yi and y j repels or attracts the map points depending on whether the distance between the two in the map is too small or too large to represent the similarities between the two high-dimensional datapoints.

Load more