<<

Lecture 14

Dimensionality Reduction II: Feature Extraction

short version

STAT 479: , Fall 2018 Sebastian Raschka http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

Sebastian Raschka STAT 479: Machine Learning FS 2018 1

Feature Selection Feature Extraction

Today

Sebastian Raschka STAT 479: Machine Learning FS 2018 2 Dimensionality Reduction

Feature Selection Feature Extraction

• Principal Component Analysis (PCA) Linear • Independent Component Analysis (ICA) • (linear act. func.) Methods • Singular Vector Decomposition (SVD) • Linear Discriminant Analysis (LDA) (Supervised) • ...

• t-Distr. Stochastic Neigh. Emb. (t-SNE) Nonlinear • Uniform Manifold Approx. & Proj. (UMAP) Methods • Kernel PCA • Spectral Clustering • Autoencoders (non-linear act. func.) • ... Sebastian Raschka STAT 479: Machine Learning FS 2018 3 Goals of Dimensionality Reduction

• Reduce Curse of Dimensionality problems • Increase storage and computational efficiency • Visualize Data in 2D or 3D

Sebastian Raschka STAT 479: Machine Learning FS 2018 4 Principal Component Analysis (PCA)

1) Find directions of maximum variance

x 2 PC2

PC1 PC2 PC1

x2 PC2

x 1

PC1

Sebastian Raschka STAT 479: Machine Learning FS 2018 5 Principal Component Analysis (PCA)

2) Transform features onto directions of maximum variance

x 2 x PC2 2 PC2 PC1 PC2 PC1 PC2 PC1 PC1 x2 PC2 x2 PC2

x 1 x 1

PC1 PC1

Sebastian Raschka STAT 479: Machine Learning FS 2018 6 Principal Component Analysis (PCA)

3) Usually consider a subset of vectors of most variance (dimensionality reduction)

x 2 PC2

PC1 PC2 PC1 PC1 PC2 PC1 x2 PC2 x2

x 1

PC1

Sebastian Raschka STAT 479: Machine Learning FS 2018 7 Principal Component Analysis (PCA) (in a nutshell)

Given design matrix X ∈ ℝn×m

find vector with maximum variance αi repeat: find with maximum variance uncorrelated with αi+1 αi (repeat k times, where k is the desired number of dimensions; k ≤ m )

Sebastian Raschka STAT 479: Machine Learning FS 2018 8 Principal Component Analysis (PCA)

Two approaches to solve PCA (on standardized data):

1. Constrained maximization (e.g., Lagrange multipliers) 2. Eigen-decomposition of covariance matrix directly

Sebastian Raschka STAT 479: Machine Learning FS 2018 9 Principal Component Analysis (PCA) (in a nutshell)

Collect vectors in a projection matrix m×k αi A ∈ ℝ (Sorted from highest to lowest associated eigenvalue)

Compute projected data points: Z = XA

Sebastian Raschka STAT 479: Machine Learning FS 2018 10 Principal Component Analysis (PCA)

Usually useful to plot the explained variance (normalized eigenvalues)

Sebastian Raschka STAT 479: Machine Learning FS 2018 11 Principal Component Analysis (PCA)

Keep in mind that PCA is unsupervised!

Sebastian Raschka STAT 479: Machine Learning FS 2018 12 PCA Factor Loadings

• The loadings are the unstandardized values of the eigenvectors • We can interpret the loadings as the covariances (or correlation in case we standardized the input features) between the input features and the and the principal components (or eigenvectors), which have been scaled to unit length

Sebastian Raschka STAT 479: Machine Learning FS 2018 13 Mirrored Results in PCA

Sebastian Raschka STAT 479: Machine Learning FS 2018 14 Mirrored Results in PCA

• Not due to an error; reason for this difference is that, depending on the eigensolver, eigenvectors can have either negative or positive signs For instance, if v is an eigenvector of a matrix Σ , we have Σv = λv, where λ is the eigenvalue then −λ is also an eigenvalue of the same value since Σ(−v) = − Σv = − λv = λ(−v)

Sebastian Raschka STAT 479: Machine Learning FS 2018 15 t-Distributed Stochastic Neighbor Embedding (t-SNE)

VAN DER MAATEN AND HINTON (t-SNE is only meant for visualization not for preparing datasets!)

0 1 2 3 4 5 6 Note that MNIST has 7 28 x 28 = 784 dimensions 8 9

(a) Visualization by t-SNE. Shown are 6000 images from MNIST projected in 2D

Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.

Sebastian Raschka STAT 479: Machine Learning FS 2018 16

(b) Visualization by Sammon mapping. Figure 2: Visualizations of 6,000 handwritten digits from the MNIST data set.

2590 Stochastic Nearest Neighbor Embeddings (SNE)

m Given high-dimensional datapoints, x1, . . . , xn ∈ ℝ represent intrinsic structure of the data in 1D, 2D, or 3D (for visualization)

How?

1) Model neighboring datapoint pairs based on the distance of those points in the high-dimensional space 2) Find a probability distribution of the pairwise distances in the low dimensional space that is as close as possible as the original probability distribution

Main Idea: Map points near on a manifold to a near position in low-dimensional space

Sebastian Raschka STAT 479: Machine Learning FS 2018 17 Stochastic Nearest Neighbor Embeddings (SNE)

pi|j + pj|i Based on probability of selecting neighboring points p = ij 2n (this is a modification to make the entropy [later slides] symmetric)

where the conditional probability, pj|i, that xi would pick xj as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at xi

2 2 exp( − ||xi − xj || /2σi ) neighborhood size is controlled by pj|i = σi 2 2 (in turn controlled by perplexity parameter) ∑k≠i exp( − ||xi − xk || /2σi )

Denominator makes sure that similarity is independent of the point's density

For reference, note that the normal distribution is defined as

1 (x − μ)2 p(x|μ, σ) = exp − 2 2πσ2 ( 2σ )

Sebastian Raschka STAT 479: Machine Learning FS 2018 18 t-Distributed Stochastic Neighbor Embedding (t-SNE)

2 −1 (1 + ||zi − zj || ) in t-SNE, modeled with Student's t-distribution qij = 2 −1 ∑k≠i (1 + ||zi − zk || ) to prevent crowding problem

where zi and zj are the points in the low-dimensional space

• t-distribution has "fatter" tails (more scale invariant to points far away) • t-distribution avoids crowding problem • (minor point: is faster for computing the density; no exponential)

Sebastian Raschka STAT 479: Machine Learning FS 2018 19 t-Distribution

With 1 degree of freedom same as Cauchy distribution

Sebastian Raschka STAT 479: Machine Learning FS 2018 20 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Idea: Map points near on a manifold to a near position in low-dimensional space

1. Measure euclidean distance in high dim & convert to probability of picking a point as a neighbor (similarity is proportional to probability); use Gaussian distribution for density of each point 2. Same as 1. in low dimensionality but with t distribution (has heavier tails) 3. Minimize the difference of the conditional probabilities (KL-divergence)

Sebastian Raschka STAT 479: Machine Learning FS 2018 21 Kullback Leibler divergence

Measures difference between 2 distributions; asymmetric

∞ p(x) DKL(P||Q) = p(x) log dx ∫−∞ ( q(x) )

∞ ∞ = p(x) log p(x)dx − p(x) log q(x)dx ∫−∞ ∫−∞

Entropy Cross-Entropy

Remember Entropy from the Decision Tree lecture for discrete distributions?

Shannon Entropy: n average amount of information H(i; x ) = − p(i|x )log p(i|x ) produced by a stochastic source j ∑ j 2 j of data i=1 for feature xj and class label i

Sebastian Raschka STAT 479: Machine Learning FS 2018 22 t-Distributed Stochastic Neighbor Embedding (t-SNE) VAN DER MAATEN AND HINTON

conditional similarity between points in original space:

a small q j i to model a large p j i), but 2there is2 only a smallreplacecost for using withnearbyp map points to represent widely| expseparated( − |datapoints.xi −| xj ||This/2smallσi ) cost comes from wasting some ofithej probability masspjin|i the= relevant Q distributions. In other words, the SNE cost function focuses on retaining the 2 2 in symmetric SNE and t-SNE local structure∑ofk≠thei −dataexpin(the||mapxi −(forxk |reasonable| /2σi )values of the variance of the Gaussian in the high-dimensional space, σi). The remaining parameter to be selected is the variance σi of the Gaussian that is centered over each high-dimensional datapoint, xi. It is not likely that there is a single value of σi that is optimal for all datapoints in the data set because the density of the data is likely to vary. In dense regions, a smaller value of σi is usually more appropriate than in sparser regions. Any particular value of σi induces a probability distribution, Pi, over all of the other datapoints. This distribution has an entropy which increases as σi increases. SNE performs a binary search for the value of σi that 3 produces a Pi with a fixed perplexity that is specified by the user. The perplexity is defined as

H(Pi) Perp(Pi) = 2 ,

where H(Pi) is the Shannon entropy of Pi measured in bits

H(Pi) = ∑ p j i log2 p j i. − j | |

The perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50. The minimization of the cost function in Equation 2 is performed using a gradient descent Maaten,method. L.The V. D.,gradient & Hinton,has aG.surprisingly (2008). Visualizingsimple form data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. δC = 2∑(p j i q j i + pi j qi j)(yi y j). δyi j | − | | − | − Sebastian Raschka STAT 479: Machine Learning FS 2018 23 Physically, the gradient may be interpreted as the resultant force created by a set of springs between the map point y and all other map points y . All springs exert a force along the direction (y y ). i j i − j The spring between yi and y j repels or attracts the map points depending on whether the distance between the two in the map is too small or too large to represent the similarities between the two high-dimensional datapoints. The force exerted by the spring between yi and y j is proportional to its length, and also proportional to its stiffness, which is the mismatch (p j i q j i + pi j qi j) between | − | | − | the pairwise similarities of the data points and the map points. The gradient descent is initialized by sampling map points randomly from an isotropic Gaussian with small variance that is centered around the origin. In order to speed up the optimization and to avoid poor local minima, a relatively large momentum term is added to the gradient. In other words, the current gradient is added to an exponentially decaying sum of previous gradients in order to determine the changes in the coordinates of the map points at each iteration of the gradient search. Mathematically, the gradient update with a momentum term is given by

(t) (t 1) δC (t 1) (t 2) Y = Y − + η + α(t) Y − Y − , δY − ! " 3. Note that the perplexity increases monotonically with the variance σi.

2582 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Gradient Descent Optimization

Cost function C:

pj|i C = ∑ KL(Pi ||Qi) = ∑ ∑ pj|i log i i j qj|i

Regular SNE Gradient w..t. z: ∂C = 2∑ (pj|i − qj|i + pi|j − qi|j)(zi − zj) ∂zi j

replace pj|i with pij in symmetric SNE

Sebastian Raschka STAT 479: Machine Learning FS 2018 24 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Crowding problem

1 2 d 3d 2d bad! d d x2 2d 4 d 3 1 2 3 4 d d d x 1 z

suppose you want to maintain the neighbor-ship of the 2D space in 1D

Sebastian Raschka STAT 479: Machine Learning FS 2018 25 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Crowding problem

3d bad!

2d 1 2 1 2 3 4 d d d d 2d z d d x2

4 d 3 3d bad! x1 2d 4 1 2 3 d d d z

Sebastian Raschka STAT 479: Machine Learning FS 2018 26 t-Distributed Stochastic Neighbor Embedding (t-SNE)

3d bad!

2d 1 2 3 4 d d d z 1 2 d 3d 2d bad! d d x2 2d 4 d 3 4 1 2 3 d d d x1 z

d = 0 bad! 1 2 3 d 4 d z

Sebastian Raschka STAT 479: Machine Learning FS 2018 27 t-Distributed Stochastic Neighbor Embedding (t-SNE)

3d bad!

2d 1 2 3 4 d d d z 1 2 d 3d 2d bad! d d x2 2d 4 d 3 4 1 2 3 d d d x1 z

case where distance representation in d = 0 bad! low dimension is impossible :( 1 2 3 d 4 d z

Sebastian Raschka STAT 479: Machine Learning FS 2018 28 t-Distributed Stochastic Neighbor Embedding (t-SNE)

What would regular SNE do? 1 2 d

2d d d x2 2 4 4 d 3 1 3 x1 Crowding problem! Squashes all points!

Sebastian Raschka STAT 479: Machine Learning FS 2018 29 t-Distributed Stochastic Neighbor Embedding (t-SNE)

VAN DER MAATEN AND HINTON

Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.

18 14 1 16 12 14 10 0.5 12 8 10 6 0 8 4 6 2 −0.5 dimensional distance > dimensional distance > dimensional distance >

− 4 − − 0 2 Low Low −2 Low −1 0 −4 High−dimensional distance > High−dimensional distance > High−dimensional distance >

(a) Gradient of SNE. (b) Gradient of UNI-SNE. (c) Gradient of t-SNE.

Figure 1: Gradients of three types of SNE as a function of the pairwise Euclidean distance between two points in the high-dimensional and the pairwise distance between the points in the low-dimensional data representation.

negative gradient if points are too close in low-dim space to provide selection of the Studentsomet-distrib utionrepulsionis that it isagainstclosely related crowdingto the Gaussian distribution, as the Student t-distribution is an infinite mixture of Gaussians. A computationally convenient property is that it is much faster to evaluate the density of a point under a Student t-distribution than under a Gaussian because it does not involve an exponential, even though the Student t-distribution is equivtalent-SNEto an infinite Gradientmixture of Gaussiansw.r.t. withz: different variances. The gradient of the Kullback-Leibler divergence between P and the Student-t based joint prob- pi|j + pj|i −1 ability∂distribC ution Q (computed using Equation 4) is derived in2Appendix A, andwhereis given bypij = = 4 (pij − qij)(zi − zj)(1 + ||zi − zj || ) 2n ∂z ∑ i j δC 2 1 = 4∑(pi j qi j)(yi y j) 1 + yi y j − . (5) δyi j − − ∥ − ∥ Sebastian Raschka STAT 479: Machine !Learning "FS 2018 30 In Figure 1(a) to 1(c), we show the gradients between two low-dimensional datapoints yi and y j as a function of their pairwise Euclidean distances in the high-dimensional and the low-dimensional space (i.e., as a function of x x and y y ) for the symmetric versions of SNE, UNI-SNE, ∥ i − j∥ ∥ i − j∥ and t-SNE. In the figures, positive values of the gradient represent an attraction between the low- dimensional datapoints yi and y j, whereas negative values represent a repulsion between the two datapoints. From the figures, we observe two main advantages of the t-SNE gradient over the gradients of SNE and UNI-SNE. First, the t-SNE gradient strongly repels dissimilar datapoints that are modeled by a small pair- wise distance in the low-dimensional representation. SNE has such a repulsion as well, but its effect is minimal compared to the strong attractions elsewhere in the gradient (the largest attraction in our graphical representation of the gradient is approximately 19, whereas the largest repulsion is approx- imately 1). In UNI-SNE, the amount of repulsion between dissimilar datapoints is slightly larger, however, this repulsion is only strong when the pairwise distance between the points in the low- dimensional representation is already large (which is often not the case, since the low-dimensional representation is initialized by sampling from a Gaussian with a very small variance that is centered around the origin). Second, although t-SNE introduces strong repulsions between dissimilar datapoints that are modeled by small pairwise distances, these repulsions do not go to infinity. In this respect, t-SNE differs from UNI-SNE, in which the strength of the repulsion between very dissimilar datapoints

2586 t-Distributed Stochastic Neighbor Embedding (t-SNE)

• Great for visualizing datasets in 2D • Need to analyze multiple perplexity values (tuning parameter related to standard deviation of the Gaussian, to balance local and global attention) • Not deterministic, the cost function for t-SNE is not convex • More hyperparameters (learning rate epsilon)

Sebastian Raschka STAT 479: Machine Learning FS 2018 31 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Source: https://distill.pub/2016/misread-tsne/

Sebastian Raschka STAT 479: Machine Learning FS 2018 32 f-Divergences

In probability theory f-divergence is a function Df (P||Q) for measuring the difference between 2 probability distributions P and Q

Csiszár, I. (1963). "Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten". Magyar. Tud. Akad. Mat. Kutato Int. Kozl. 8: 85–108.

Morimoto, T. (1963). "Markov processes and the H-theorem". J. Phys. Soc. Jpn. 18 (3): 328–331

Table 1: A list of commonly used f-divergences (along with their generating function) and their corresponsing t-SNE objectivet-SNE (which embeddings we refer to as ft-SNE). based The on last five column diff describeserent whatf-divergences kind of distance relationship gets emphasized by different choices of f-divergence.

Df (P Q) f(t) ft-SNE objective Emphasis k pij Kullback-Leibler (KL) t log t pij log Local qij (p q )2 Chi-square ( 2 or CH) (t 1)2 P ⇣ij ij ⌘ Local X qij qij Reverse-KL (RKL) log t Pqij log Global pij 2 1 pij +qij pij +qij Jensen-Shannon (JS) (t +1)log (t+1) + t log t 2 (KL(pij P2 ⇣)+KL(⌘qij 2 )) Both 2 k 2 | Hellinger distance (HL) (pt 1) (ppij pqij ) Both P

Imprecision DJ, Verma and N, Branson recall, and K. Stochastic we show Neighbor that this Embedding can be achieved under f-divergences. by minimizing arXivf preprint-divergences arXiv:1811.01247. other than 2018 the Nov 3. KL-divergence. We prescribe that data scientists create and explore low-dimensional visualizations of their data corre- sponding to several different f-divergences, each of which is geared toward different types of structure. To this end, we provide efficient code for finding t-SNE embeddings based on five different f-divergences1.Userscan even provide their own specificSebastian instantiation Raschka of STAT an f 479:-divergence, Machine Learning if needed. Our FS code 2018 can optimize either the 33 standard criterion, or a variational lower bound based on convex conjugate of the f-divergence. Empirically, we found that minimizing this dual variational form was computationally more efficient and produced better quality embeddings, even for the standard case of KL-divergence. To our knowledge, this is the first work that explicitly compares the optimization of both the primal and dual form of f-divergences, which would be of independent interest to the reader.

2 Stochastic Neighbor Embedding for Low-Dimensional Visualiza- tions

D Given a set of m high-dimensional datapoints x1,...,xm R ,thegoalofStochasticNeighborEmbedding 2 (SNE) is to represent these datapoints in one- two- or three-dimensions in a way that faithfully captures important intrinsic structure that may be present in the given input. It aims to achieve this by first modelling neighboring pairs of points based on distance in the original, high-dimensional space. Then, SNE aims to find a low-dimensional representation of the input datapoints whose pairwise similarities induce a probability distribution that is as close to the original probability distribution as possible. More specifically, SNE computes pij,theprobabilityofselectingapairofneighboringpointsi and j,as

pi j + pj i p = | | , ij 2m

where pj i and pi j represent the probability that j is i’s neighbor and i is j’s neighbor, respectively. These | | are modeled as exp x x 2/22 k i jk i pj i := 2 2 . | k=i exp ( xi xk /2i ) 6 k k The parameters i control the effective neighborhoodP size for the individual datapoints xi. In practical implementations the neighborhood sizes are controlled by the so-called perplexity parameter, which can be interpreted as the effective number of neighbors for a given datapoint and is proportional to the neighborhood size [19]. d The pairwise similarities between the corresponding low-dimensional datapoints y1,...,ym R (where 2 d =1, 2 or 3 typically), are modelled as Student’s t-distribution

2 1 (1 + yi yj ) qij := k k 2 1 . k=i(1 + yi yk ) 6 k k 1The code is available at github.com/jiwoongim/ftP SNE.

2 Uniform Manifold Approximation and Projection (UMAP)

UMAP T-SNE

McInnes, L., & Healy, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

Compared to t-SNE, UMAP seems to be

• faster • deterministic • better at preserving clusters

Sebastian Raschka STAT 479: Machine Learning FS 2018 34

Figure 1: Comparison of UMAP and t-SNE embeddings for a number of real world datasets. More of the loops in the COIL20 dataset are kept intact, including the intertwined loops by UMAP. Similarly the global relationships among di￿erent digits in the MNIST digits dataset are more clearly captured with 1 (red) and 0 (dark red) at far corners of the embedding space, and 4,7,9 (yellow, sea-green, and violet) and 3,5,8 (orange, chartreuse, and blue) separated as distinct clumps of similar digits. In the Fashion MNIST dataset the distinction between clothing (dark red, yellow, orange, vermilion) and footwear (chartreuse, sea-green, and violet) is made more clear. 12 Reading Assignments

• Python Machine Learning, 2nd Edition. Chapter 5: Compressing Data via Dimensionality Reduction

• Scikit-learn doc 2.2. Manifold learning: https://scikit-learn.org/stable/modules/manifold.html

Sebastian Raschka STAT 479: Machine Learning FS 2018 35 Code Examples

https://github.com/rasbt/stat479-machine-learning-fs18/blob/master/ 14_feat-extract/14_feat-extract_code.ipynb

Sebastian Raschka STAT 479: Machine Learning FS 2018 36