Spring 2020 Jaakko Peltonen
Total Page:16
File Type:pdf, Size:1020Kb
MTTTS17 Dimensionality Reduction and Visualization Spring 2020 Jaakko Peltonen Lecture 6: Nonlinear dimensionality reduction, part 1 Dimension reduction Reminder: Dimension reduction methods • What to do if... I have to analyze multidimensional data??? • Solution: Dimensionality Reduction from 17 dimensions to 1D, 2D or 3D • Problem: How to project the nodes faithfully into low-dimensional space (1D-3D)? 2.5 5 2 4 1.5 1 3 0.5 2 0 1 !0.5 0 1.5 !6 !5 !4 !3 !2 !1 0 1 2 1 0.5 0 0.5 !0.5 2D projection by Curvilinear Original 3D data Component Analysis 2 Visualization of a data set blks/s wblks/s usr sys intr 5 blks/s 0 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 5 wblks/s 0 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 usr 0 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Analysing a lot of dimensions sys 0 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 intr 2 individually takes a lot of time 0 wio idle ipkts opkts color coding 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2 wio 0 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 and can miss relevant 4 idle 2 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 4 properties of the data, so we ipkts 2 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 4 2 opkts use dimensionality reduction to 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 try to capture the important (a) Component planes (b) Time-series data properties in a few max dimensions. blks/s wblks/s (Pictures from Juha Vesanto’s usr sys doctoral thesis, 2002.) intr wio idle ipkts min opkts blks/s wblks/s usr sys intr wio idle ipkts opkts (c) Scatterplot matrix (d) Parallel axis 3 Figure 4.5: Small multiples of the system data set. In the component planes figure (a), linking is done by position. Each subplot corresponds to one variable, and each dot in each subplot to one data sample. The coordinates are from the PCA-projection of the data. The name “component planes” comes from a similar technique used in visual- ization of SOM, see Figure 4.8. The last component plane shows the color coding of the map, and thus links the component planes with (b-d). In the time-series plot (b) the system data for Tuesday 6:00 to 22:00 is depicted. The samples have been colored using the color coding. The scatterplot matrix (c) shows the pairwise scatterplots of all variable pairs. The objects in the scatter plots have been linked together using color. On the diagonal, the histograms of individual variables are shown. In the parallel coordi- nates visualization (d) each vertical axis corresponds to one variable. Each horizontal line corresponds to one object such that linking is done explicitly using lines. The line color encodes similarity between objects. Figures (a), (c) and (d) all show all of the data, and can be used to detect correlations between variables. 38 PCA Principal component analysis (PCA) • PCA is stable, there are no additional parameters, and it is guaranteed always to converge to the same optima. • Hence, PCA is usually the first dimension reduction method to try (if it doesn’t work, then try something more fancy) 4 PCA Principal component analysis (PCA) Covariance matrix (after subtracting mean from data) 2.5 2 T 999.00 979.84 1.5 C = X X = 979.84 999.00 1 0.5 Eigenvector matrix (projection to principal components) 0 x2 0.7071 0.7071 −0.5 V = − −1 0.7071 0.7071 −1.5 −2 Eigenvalues (variance along principal components) −2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x1 [λ1λ2]= 1978.8 19.2 Proportion of λ1 ⇥ ⇤ variance explained TVE = d = 0.9904 i=1 λi ⇥ ⇤ d ˆi i Reconstruction i=1Pvar[(X X )] error E = − = 0.0096 d P 5 ⇥ ⇤ Example: Bi-Gaussian distribution — clusters (continued) [Linear methods] (18) • PCA transformation rotates the data so that the first principal component Example: Bi-Gaussian(P1d) isistcriollinearbutionto—thecldirustecertions (ofcomnaxtinimuedal t)ot[Lalineavr mearthiancods] (1e8), loosing information of the two clusters • PCA transformation rotates the data so that thePCAfirst principal component (P1) is collinearTwo•toICAthe tclustersrdiransecfortionmatofionmmaxapsimtalhetotfiralst vindependentariance, looscomingponent (I1) to the Example: Bi-Gaussian distribution — clusters [Lminear meaxthods] (17im) ally non-Gaussian direction, revealing the clusters informatTheion PCAof failsthe to tseparatewo clus thet erclusterss (you don’t see • The data points•are distributed along two stretched and rotated PCA ICA Gaussians cluster structure from the 1D visualization, lower right) 3 ORIGINAL DATA 30 • ICA30 transformation maps the first independent component2 (I1) to the 20 1 10 20 0 I2 0 maximally non-Gaussian direction,P2 revealing the clusters −10 10 −1 PCA −20 ICA 3 −2 0 X2 −30 30 −3 −202 0 20 −2 −1 0 1 2 −10 20 P1 I1 1 10 FIRST PRINCIPAL COMPONENT (P1) FIRST INDEPENDENT COMPONENT (I1) −20 25 35 0 I2 0 P2 30 −10 20 −30 −1 −40 −30 −20 −10 0 10 20 30 40 25 X1 −20 15 −2 20 −30 10 −3 15 −20 0 20 −2 −1 0 1 2 P1 I1 10 5 5 FIRST PRINCIPAL COMPONENT (P1) FIRST INDEPENDENT COMPONENT (I1) 25 0 35 0 −50 0 50 −2 −1 0 1 2 6 30 20 25 15 20 10 15 10 5 5 0 0 −50 0 50 −2 −1 0 1 2 Properties of PCA • Strengths – Eigenvector method PCA – No tuning parametersNonlinear data – Non-iterative – No local optima • Weaknesses The first principal component is given by the red line. The green line on the right gives the “correct” non-linear dimension (which PCA is of course unable to find). – Limited to second order statistics7 – Limited to linear projections Properties of PCA • Strengths – Eigenvector method – No tuning parameters Manifolds Manifolds • Left, PCA mapping would – Non-iterative not find the “correct” 1D manifold, shown in green, because they try to preserve global features. • Often, preserving local – No local optima features, like trustworthiness, continuity (next lecture) or conformity - or manifold structure - is more important than global • Weaknesses properties. 8 – Limited to second order statistics – Limited to linear projections Dimension reduction Dimension reduction methods • Matlab+Excel example - PCA versus nonlinear multidimensional scaling! 9 Nonlinear dimensionality reduction and Manifold Learning Topology, Spaces and Manifolds • If variables depend on each other their joint distribution, the support of their joint distribution does not span the whole space → induce some structure in distribution (geometrical locus, object in space) • Topology (mathematics) studies properties of objects that are preserved through deformations, twisting and stretches Tearing forbidden, guaranteeing that the intrinsic structure or connectivity is not altered (circle topol. equivalent to ellipse) • Tearing may still be interesting operation (unfolding the earth) • Objects with same topological properties are called homeomorphic Topology, Spaces and Manifolds A topological space is a set for which a topology is specified For a set a topology T is defined as a collection of subsets of with the following properties: • trivially and • whenever 2 sets are in T, so is their intersection • whenever 2 or more sets are in T, so is their union Definition holds for Cartesian space as for graphs (example: topology in (set of real numbers) is union of all open intervals) Topology, Spaces and Manifolds More generally: topological space can be defined using neighborhoods (Ns) and Haussdorf’s axioms (1919) -neighborhood for or infinitesimal open set often defined as open-Ball a set of points inside a d-dim. hollow sphere of radius > 0 centered on y Topology, Spaces and Manifolds More generally: topological space can be defined using neighborhoods (Ns) and Haussdorf’s axioms (1919) • each point y corresponds at least one neighborhood containing y • if and same y, then exists • if then of z exists such that • for 2 distinct points 2 disjoint Ns of these points exist -neighborhood for or infinitesimal open set often defined as open-Ball a set of points inside a d-dim. hollow sphere of radius > 0 centered on y Topology, Spaces and Manifolds • Within this framework a topological manifold M is a topological space that is locally Euclidean (any object that is nearly flat on small scales) • Manifold can be compact, non-compact, connected or disconnected • Embedding: representation of topological object (manifold, graph,etc.) in cartesian space (usually ) preserving topological properties • Smooth manifold is called a diferentiable manifold • For example: hollow d-dimensional hypersphere is a (d - 1) manifold Topology, Spaces and Manifolds • Whitney (1930) showed that any P-manifold can be embedded in • A line can be embedded in • A circle is a compact 1-manifold, can be embedded in • Trefoil (knotted circle) reaches the bound • In practice the manifold is nothing more than the underlying support of a data distribution known only through finite samples • 2 Problems appear: (1) Dimen. reduction works with sparse and limited data (2) assuming manifold takes into account the support of data distr. but not other properties such as its density → problematic for latent variable finding (the model of data density is of prime importance) • Manifold does not account for noise (points may lie nearby) → DR re-embeds a manifold, noisy data is projected onto it Intrinsic Dimension • The intrinsic dimension(ality) of a random vector y equals the topological dimension of the support of the distribution of y • Given a topological space , the covering of a subset S is a collection C of open subsets whose union contains S A subset S of topological space has topo- logical dimension Dtop (Lebesgue covering dimen- • sion) if every covering C of S has a refine- ment C' in which every point of S belongs to a maximum of Dtop+1