Spring 2020 Jaakko Peltonen

Total Page:16

File Type:pdf, Size:1020Kb

Spring 2020 Jaakko Peltonen MTTTS17 Dimensionality Reduction and Visualization Spring 2020 Jaakko Peltonen Lecture 6: Nonlinear dimensionality reduction, part 1 Dimension reduction Reminder: Dimension reduction methods • What to do if... I have to analyze multidimensional data??? • Solution: Dimensionality Reduction from 17 dimensions to 1D, 2D or 3D • Problem: How to project the nodes faithfully into low-dimensional space (1D-3D)? 2.5 5 2 4 1.5 1 3 0.5 2 0 1 !0.5 0 1.5 !6 !5 !4 !3 !2 !1 0 1 2 1 0.5 0 0.5 !0.5 2D projection by Curvilinear Original 3D data Component Analysis 2 Visualization of a data set blks/s wblks/s usr sys intr 5 blks/s 0 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 5 wblks/s 0 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 usr 0 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Analysing a lot of dimensions sys 0 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 intr 2 individually takes a lot of time 0 wio idle ipkts opkts color coding 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2 wio 0 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 and can miss relevant 4 idle 2 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 4 properties of the data, so we ipkts 2 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 4 2 opkts use dimensionality reduction to 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 try to capture the important (a) Component planes (b) Time-series data properties in a few max dimensions. blks/s wblks/s (Pictures from Juha Vesanto’s usr sys doctoral thesis, 2002.) intr wio idle ipkts min opkts blks/s wblks/s usr sys intr wio idle ipkts opkts (c) Scatterplot matrix (d) Parallel axis 3 Figure 4.5: Small multiples of the system data set. In the component planes figure (a), linking is done by position. Each subplot corresponds to one variable, and each dot in each subplot to one data sample. The coordinates are from the PCA-projection of the data. The name “component planes” comes from a similar technique used in visual- ization of SOM, see Figure 4.8. The last component plane shows the color coding of the map, and thus links the component planes with (b-d). In the time-series plot (b) the system data for Tuesday 6:00 to 22:00 is depicted. The samples have been colored using the color coding. The scatterplot matrix (c) shows the pairwise scatterplots of all variable pairs. The objects in the scatter plots have been linked together using color. On the diagonal, the histograms of individual variables are shown. In the parallel coordi- nates visualization (d) each vertical axis corresponds to one variable. Each horizontal line corresponds to one object such that linking is done explicitly using lines. The line color encodes similarity between objects. Figures (a), (c) and (d) all show all of the data, and can be used to detect correlations between variables. 38 PCA Principal component analysis (PCA) • PCA is stable, there are no additional parameters, and it is guaranteed always to converge to the same optima. • Hence, PCA is usually the first dimension reduction method to try (if it doesn’t work, then try something more fancy) 4 PCA Principal component analysis (PCA) Covariance matrix (after subtracting mean from data) 2.5 2 T 999.00 979.84 1.5 C = X X = 979.84 999.00 1 0.5 Eigenvector matrix (projection to principal components) 0 x2 0.7071 0.7071 −0.5 V = − −1 0.7071 0.7071 −1.5 −2 Eigenvalues (variance along principal components) −2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x1 [λ1λ2]= 1978.8 19.2 Proportion of λ1 ⇥ ⇤ variance explained TVE = d = 0.9904 i=1 λi ⇥ ⇤ d ˆi i Reconstruction i=1Pvar[(X X )] error E = − = 0.0096 d P 5 ⇥ ⇤ Example: Bi-Gaussian distribution — clusters (continued) [Linear methods] (18) • PCA transformation rotates the data so that the first principal component Example: Bi-Gaussian(P1d) isistcriollinearbutionto—thecldirustecertions (ofcomnaxtinimuedal t)ot[Lalineavr mearthiancods] (1e8), loosing information of the two clusters • PCA transformation rotates the data so that thePCAfirst principal component (P1) is collinearTwo•toICAthe tclustersrdiransecfortionmatofionmmaxapsimtalhetotfiralst vindependentariance, looscomingponent (I1) to the Example: Bi-Gaussian distribution — clusters [Lminear meaxthods] (17im) ally non-Gaussian direction, revealing the clusters informatTheion PCAof failsthe to tseparatewo clus thet erclusterss (you don’t see • The data points•are distributed along two stretched and rotated PCA ICA Gaussians cluster structure from the 1D visualization, lower right) 3 ORIGINAL DATA 30 • ICA30 transformation maps the first independent component2 (I1) to the 20 1 10 20 0 I2 0 maximally non-Gaussian direction,P2 revealing the clusters −10 10 −1 PCA −20 ICA 3 −2 0 X2 −30 30 −3 −202 0 20 −2 −1 0 1 2 −10 20 P1 I1 1 10 FIRST PRINCIPAL COMPONENT (P1) FIRST INDEPENDENT COMPONENT (I1) −20 25 35 0 I2 0 P2 30 −10 20 −30 −1 −40 −30 −20 −10 0 10 20 30 40 25 X1 −20 15 −2 20 −30 10 −3 15 −20 0 20 −2 −1 0 1 2 P1 I1 10 5 5 FIRST PRINCIPAL COMPONENT (P1) FIRST INDEPENDENT COMPONENT (I1) 25 0 35 0 −50 0 50 −2 −1 0 1 2 6 30 20 25 15 20 10 15 10 5 5 0 0 −50 0 50 −2 −1 0 1 2 Properties of PCA • Strengths – Eigenvector method PCA – No tuning parametersNonlinear data – Non-iterative – No local optima • Weaknesses The first principal component is given by the red line. The green line on the right gives the “correct” non-linear dimension (which PCA is of course unable to find). – Limited to second order statistics7 – Limited to linear projections Properties of PCA • Strengths – Eigenvector method – No tuning parameters Manifolds Manifolds • Left, PCA mapping would – Non-iterative not find the “correct” 1D manifold, shown in green, because they try to preserve global features. • Often, preserving local – No local optima features, like trustworthiness, continuity (next lecture) or conformity - or manifold structure - is more important than global • Weaknesses properties. 8 – Limited to second order statistics – Limited to linear projections Dimension reduction Dimension reduction methods • Matlab+Excel example - PCA versus nonlinear multidimensional scaling! 9 Nonlinear dimensionality reduction and Manifold Learning Topology, Spaces and Manifolds • If variables depend on each other their joint distribution, the support of their joint distribution does not span the whole space → induce some structure in distribution (geometrical locus, object in space) • Topology (mathematics) studies properties of objects that are preserved through deformations, twisting and stretches Tearing forbidden, guaranteeing that the intrinsic structure or connectivity is not altered (circle topol. equivalent to ellipse) • Tearing may still be interesting operation (unfolding the earth) • Objects with same topological properties are called homeomorphic Topology, Spaces and Manifolds A topological space is a set for which a topology is specified For a set a topology T is defined as a collection of subsets of with the following properties: • trivially and • whenever 2 sets are in T, so is their intersection • whenever 2 or more sets are in T, so is their union Definition holds for Cartesian space as for graphs (example: topology in (set of real numbers) is union of all open intervals) Topology, Spaces and Manifolds More generally: topological space can be defined using neighborhoods (Ns) and Haussdorf’s axioms (1919) -neighborhood for or infinitesimal open set often defined as open-Ball a set of points inside a d-dim. hollow sphere of radius > 0 centered on y Topology, Spaces and Manifolds More generally: topological space can be defined using neighborhoods (Ns) and Haussdorf’s axioms (1919) • each point y corresponds at least one neighborhood containing y • if and same y, then exists • if then of z exists such that • for 2 distinct points 2 disjoint Ns of these points exist -neighborhood for or infinitesimal open set often defined as open-Ball a set of points inside a d-dim. hollow sphere of radius > 0 centered on y Topology, Spaces and Manifolds • Within this framework a topological manifold M is a topological space that is locally Euclidean (any object that is nearly flat on small scales) • Manifold can be compact, non-compact, connected or disconnected • Embedding: representation of topological object (manifold, graph,etc.) in cartesian space (usually ) preserving topological properties • Smooth manifold is called a diferentiable manifold • For example: hollow d-dimensional hypersphere is a (d - 1) manifold Topology, Spaces and Manifolds • Whitney (1930) showed that any P-manifold can be embedded in • A line can be embedded in • A circle is a compact 1-manifold, can be embedded in • Trefoil (knotted circle) reaches the bound • In practice the manifold is nothing more than the underlying support of a data distribution known only through finite samples • 2 Problems appear: (1) Dimen. reduction works with sparse and limited data (2) assuming manifold takes into account the support of data distr. but not other properties such as its density → problematic for latent variable finding (the model of data density is of prime importance) • Manifold does not account for noise (points may lie nearby) → DR re-embeds a manifold, noisy data is projected onto it Intrinsic Dimension • The intrinsic dimension(ality) of a random vector y equals the topological dimension of the support of the distribution of y • Given a topological space , the covering of a subset S is a collection C of open subsets whose union contains S A subset S of topological space has topo- logical dimension Dtop (Lebesgue covering dimen- • sion) if every covering C of S has a refine- ment C' in which every point of S belongs to a maximum of Dtop+1
Recommended publications
  • Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis
    University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2009 Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis Lisha Chen Yale University Andreas Buja University of Pennsylvania Follow this and additional works at: https://repository.upenn.edu/statistics_papers Part of the Statistics and Probability Commons Recommended Citation Chen, L., & Buja, A. (2009). Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis. Journal of the American Statistical Association, 104 (485), 209-219. http://dx.doi.org/10.1198/jasa.2009.0111 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/statistics_papers/515 For more information, please contact [email protected]. Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis Abstract In the past decade there has been a resurgence of interest in nonlinear dimension reduction. Among new proposals are “Local Linear Embedding,” “Isomap,” and Kernel Principal Components Analysis which all construct global low-dimensional embeddings from local affine or metric information.e W introduce a competing method called “Local Multidimensional Scaling” (LMDS). Like LLE, Isomap, and KPCA, LMDS constructs its global embedding from local information, but it uses instead a combination of MDS and “force-directed” graph drawing. We apply the force paradigm to create localized versions of MDS stress functions with a tuning parameter to adjust the strength of nonlocal repulsive forces. We solve the problem of tuning parameter selection with a meta-criterion that measures how well the sets of K-nearest neighbors agree between the data and the embedding. Tuned LMDS seems to be able to outperform MDS, PCA, LLE, Isomap, and KPCA, as illustrated with two well-known image datasets.
    [Show full text]
  • A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets
    Narechania et al. GigaScience (2016) 5:44 DOI 10.1186/s13742-016-0152-3 TECHNICAL NOTE Open Access Clusterflock: a flocking algorithm for isolating congruent phylogenomic datasets Apurva Narechania1, Richard Baker1, Rob DeSalle1, Barun Mathema2,5, Sergios-Orestis Kolokotronis1,4, Barry Kreiswirth2 and Paul J. Planet1,3* Abstract Background: Collective animal behavior, such as the flocking of birds or the shoaling of fish, has inspired a class of algorithms designed to optimize distance-based clusters in various applications, including document analysis and DNA microarrays. In a flocking model, individual agents respond only to their immediate environment and move according to a few simple rules. After several iterations the agents self-organize, and clusters emerge without the need for partitional seeds. In addition to its unsupervised nature, flocking offers several computational advantages, including the potential to reduce the number of required comparisons. Findings: In the tool presented here, Clusterflock, we have implemented a flocking algorithm designed to locate groups (flocks) of orthologous gene families (OGFs) that share an evolutionary history. Pairwise distances that measure phylogenetic incongruence between OGFs guide flock formation. We tested this approach on several simulated datasets by varying the number of underlying topologies, the proportion of missing data, and evolutionary rates, and show that in datasets containing high levels of missing data and rate heterogeneity, Clusterflock outperforms other well-established clustering techniques. We also verified its utility on a known, large-scale recombination event in Staphylococcus aureus. By isolating sets of OGFs with divergent phylogenetic signals, we were able to pinpoint the recombined region without forcing a pre-determined number of groupings or defining a pre-determined incongruence threshold.
    [Show full text]
  • Multidimensional Scaling
    UCLA Department of Statistics Papers Title Multidimensional Scaling Permalink https://escholarship.org/uc/item/83p9h12v Author de Leeuw, Jan Publication Date 2000 eScholarship.org Powered by the California Digital Library University of California MULTIDIMENSIONAL SCALING J. DE LEEUW The term ‘Multidimensional Scaling’ or MDS is used in two essentially different ways in statistics (de Leeuw & Heiser 1980a). MDS in the wide sense refers to any technique that produces a multidimensional geometric representation of data, where quantitative or qualitative relationships in the data are made to correspond with geometric relationships in the represen- tation. MDS in the narrow sense starts with information about some form of dissimilarity between the elements of a set of objects, and it constructs its geometric representation from this information. Thus the data are dis- similarities, which are distance-like quantities (or similarities, which are inversely related to distances). This chapter concentrates on narrow-sense MDS only, because otherwise the definition of the technique is so diluted as to include almost all of multivariate analysis. MDS is a descriptive technique, in which the notion of statistical inference is almost completely absent. There have been some attempts to introduce statistical models and corresponding estimating and testing methods, but they have been largely unsuccessful. I introduce some quick notation. Dis- similarities are written as δi j , and distances are di j (X). Here i and j are the objects we are interested in. The n × p matrix X is the configuration, with p coordinates of the objects in R . Often, we also have as data weights wi j reflecting the importance or precision of dissimilarity δi j .
    [Show full text]
  • Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
    JournalofMachineLearningResearch11(2010)451-490 Submitted 4/09; Revised 12/09; Published 2/10 Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization Jarkko Venna [email protected] Jaakko Peltonen [email protected] Kristian Nybo [email protected] Helena Aidos [email protected] Samuel Kaski [email protected] Aalto University School of Science and Technology Department of Information and Computer Science P.O. Box 15400, FI-00076 Aalto, Finland Editor: Yoshua Bengio Abstract Nonlinear dimensionality reduction methods are often used to visualize high-dimensional data, al- though the existing methods have been designed for other related tasks such as manifold learning. It has been difficult to assess the quality of visualizations since the task has not been well-defined. We give a rigorous definition for a specific visualization task, resulting in quantifiable goodness measures and new visualization methods. The task is information retrieval given the visualization: to find similar data based on the similarities shown on the display. The fundamental tradeoff be- tween precision and recall of information retrieval can then be quantified in visualizations as well. The user needs to give the relative cost of missing similar points vs. retrieving dissimilar points, after which the total cost can be measured. We then introduce a new method NeRV (neighbor retrieval visualizer) which produces an optimal visualization by minimizing the cost. We further derive a variant for supervised visualization; class information is taken rigorously into account when computing the similarity relationships. We show empirically that the unsupervised version outperforms existing unsupervised dimensionality reduction methods in the visualization task, and the supervised version outperforms existing supervised methods.
    [Show full text]
  • A Review of Multidimensional Scaling (MDS) and Its Utility in Various Psychological Domains
    Tutorials in Quantitative Methods for Psychology 2009, Vol. 5(1), p. 1‐10. A Review of Multidimensional Scaling (MDS) and its Utility in Various Psychological Domains Natalia Jaworska and Angelina Chupetlovska‐Anastasova University of Ottawa This paper aims to provide a non‐technical overview of multidimensional scaling (MDS) so that a broader population of psychologists, in particular, will consider using this statistical procedure. A brief description regarding the type of data used in MDS, its acquisition and analyses via MDS is provided. Also included is a commentary on the unique challenges associated with assessing the output of MDS. Our second aim, by way of discussing representative studies, is to highlight and evaluate the utility of this method in various domains in psychology. The primary utility of statistics is that they aid in represent similar objects while those that are further apart reducing data into more manageable pieces of information represent dissimilar ones. The underlying dimensions from which inferences or conclusions can be drawn. extracted from the spatial configuration of the data are Multidimensional scaling (MDS) is an exploratory data thought to reflect the hidden structures, or important analysis technique that attains this aim by condensing large relationships, within it (Ding, 2006; Young & Hamer, 1987). amounts of data into a relatively simple spatial map that The logic of MDS analysis is most effectively introduced relays important relationships in the most economical with an intuitive and simple example. For instance, if a manner (Mugavin, 2008). MDS can model nonlinear matrix of distances between ten North American cities was relationships among variables, can handle nominal or subjected to MDS, a spatial map relaying the relative ordinal data, and does not require multivariate normality.
    [Show full text]
  • Multidimensional Scaling Advanced Applied Multivariate Analysis STAT 2221, Fall 2013
    Lecture 8: Multidimensional scaling Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: [email protected] http://www.stat.pitt.edu/sungkyu/AAMA/ 1 / 41 Multidimensional scaling Goal of Multidimensional scaling (MDS): Given pairwise dissimilarities, reconstruct a map that preserves distances. • From any dissimilarity (no need to be a metric) • Reconstructed map has coordinates xi = (xi1; xi2) and the natural distance (kxi − xj k2) 2 / 41 Multidimensional scaling • MDS is a family of different algorithms, each designed to arrive at optimal low-dimensional configuration (p = 2 or 3) • MDS methods include 1 Classical MDS 2 Metric MDS 3 Non-metric MDS 3 / 41 Perception of Color in human vision • To study the perceptions of color in human vision (Ekman, 1954, Izenman 13.2.1) • 14 colors differing only in their hue (i.e., wavelengths from 434 µm to 674 µm) • 31 people to rate on a five-point scale from 0 (no similarity at 14 all) to 4 (identical) for each of pairs of colors. 2 • Average of 31 ratings for each pair (representing similarity) is then scaled and subtracted from 1 to represent dissimilarities 4 / 41 Perception of Color in human vision The resulting 14 × 14 dissimilarity matrix is symmetric, and contains zeros in the diagonal. MDS seeks a 2D configuration to represent these colors. 434 445 465 472 490 504 537 555 584 600 610 628 651 445 0.14 465 0.58 0.50 472 0.58 0.56 0.19 490 0.82 0.78 0.53 0.46 504 0.94 0.91 0.83 0.75 0.39 537 0.93 0.93 0.90 0.90 0.69 0.38 555 0.96 0.93 0.92 0.91 0.74 0.55 0.27 584 0.98 0.98 0.98 0.98 0.93 0.86 0.78 0.67 600 0.93 0.96 0.99 0.99 0.98 0.92 0.86 0.81 0.42 610 0.91 0.93 0.98 1.00 0.98 0.98 0.95 0.96 0.63 0.26 628 0.88 0.89 0.99 0.99 0.99 0.98 0.98 0.97 0.73 0.50 0.24 651 0.87 0.87 0.95 0.98 0.98 0.98 0.98 0.98 0.80 0.59 0.38 0.15 674 0.84 0.86 0.97 0.96 1.00 0.99 1.00 0.98 0.77 0.72 0.45 0.32 0.24 5 / 41 Perception of Color in human vision MDS reproduces the well-known two-dimensional color circle.
    [Show full text]
  • Multidimensional Scaling
    NCSS Statistical Software NCSS.com Chapter 435 Multidimensional Scaling Introduction Multidimensional scaling (MDS) is a technique that creates a map displaying the relative positions of a number of objects, given only a table of the distances between them. The map may consist of one, two, three, or even more dimensions. The program calculates either the metric or the non-metric solution. The table of distances is known as the proximity matrix. It arises either directly from experiments or indirectly as a correlation matrix. To understand how the proximity matrix may be observed directly, consider the following marketing research example. Suppose ten subjects rate the similarities of six automobiles. That is, each subject rates the similarity of each of the fifteen possible pairs. The ratings are on a scale from 1 to 10, with “1” meaning that the cars are identical in every way and “10” meaning that the cars are as different as possible. The ratings are averaged across subjects, forming a similarity matrix. MDS provides the marketing researcher with a map (scatter plot) of the six cars that summarizes the results visually. This map shows the perceived differences between the cars. The program offers two general methods for solving the MDS problem. The first is called Metric, or Classical, Multidimensional Scaling (CMDS) because it tries to reproduce the original metric or distances. The second method, called Non-Metric Multidimensional Scaling (NMMDS), assumes that only the ranks of the distances are known. Hence, this method produces a map which tries to reproduce these ranks. The distances themselves are not reproduced.
    [Show full text]
  • On Multidimensional Scaling and Unfolding in R: Smacof Version 2
    More on Multidimensional Scaling and Unfolding in R: smacof Version 2 Patrick Mair Patrick J. F. Groenen Harvard University Erasmus University Rotterdam Jan de Leeuw University of California, Los Angeles Abstract This vignette is a (slightly) modified version of the paper submitted to the Journal of Statistical Software. The smacof package offers a comprehensive implementation of multidimensional scal- ing (MDS) techniques in R. Since its first publication (De Leeuw and Mair 2009b) the functionality of the package has been enhanced, and several additional methods, features and utilities were added. Major updates include a complete re-implementation of mul- tidimensional unfolding allowing for monotone dissimilarity transformations, including row-conditional, circular, and external unfolding. Additionally, the constrained MDS im- plementation was extended in terms of optimal scaling of the external variables. Further package additions include various tools and functions for goodness-of-fit assessment, uni- dimensional scaling, gravity MDS, asymmetric MDS, Procrustes, and MDS biplots. All these new package functionalities are illustrated using a variety of real-life applications. Keywords: multidimensional scaling, constrained multidimensional scaling, multidimensional unfolding, SMACOF, R. 1. Introduction Multidimensional scaling (MDS; Torgerson 1952; Kruskal 1964; Borg and Groenen 2005) is a technique that represents proximities among objects as distances among points in a low-dimensional space. Multidimensional unfolding (Coombs 1964; Busing, Groenen, and Heiser 2005; Borg and Groenen 2005) is a related technique that represents input preference data as distances (among individuals and objects) in a low-dimensional space. Nowadays, MDS as well as unfolding problems are typically solved through numeric optimization. The state-of-the-art approach is called SMACOF (Stress Majorization of a Complicated Function; De Leeuw 1977)1 and provides the user with a great amount of flexibility for specifying MDS and unfolding variants.
    [Show full text]
  • Dimensionality Reduction a Short Tutorial
    Dimensionality Reduction A Short Tutorial Ali Ghodsi Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada, 2006 �c Ali Ghodsi, 2006 Contents 1 An Introduction to Spectral Dimensionality Reduction Methods 1 1.1 Principal Components Analysis ........................ 3 1.1.1 Dual PCA ................................ 6 1.2 Kernel PCA ................................... 8 1.2.1 Centering ................................ 10 1.3 Locally Linear Embedding ........................... 10 1.4 Laplacian Eigenmaps ............................. 12 1.5 Metric Multidimensional Scaling (MDS) ................... 14 1.6 Isomap ...................................... 15 1.7 Semidefinite Embedding (SDE) ........................ 16 1.8 Unified Framework ............................... 17 vi List of Tables 1.1 Direct PCA Algorithm ............................. 6 1.2 Dual PCA Algorithm .............................. 8 1.3 SDE Algorithm ................................. 17 vii List of Figures 1.1 A canonical dimensionality reduction problem ................ 2 1.2 PCA ....................................... 3 1.3 Kernel PCA ................................... 9 1.4 LLE ....................................... 11 1.5 LEM ....................................... 13 1.6 MDS ....................................... 14 1.7 Isomap ...................................... 16 1.8 SDE ....................................... 18 viii Chapter 1 An Introduction to Spectral Dimensionality Reduction Methods Manifold learning is a significant problem
    [Show full text]
  • Bayesian Multidimensional Scaling Model for Ordinal Preference Data
    Bayesian Multidimensional Scaling Model for Ordinal Preference Data Kerry Matlosz Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences Columbia University 2013 © 2013 Kerry Matlosz All rights reserved ABSTRACT Bayesian Multidimensional Scaling Model for Ordinal Preference Data Kerry Matlosz The model within the present study incorporated Bayesian Multidimensional Scaling and Markov Chain Monte Carlo methods to represent individual preferences and threshold parameters as they relate to the influence of survey items popularity and their interrelationships. The model was used to interpret two independent data samples of ordinal consumer preference data related to purchasing behavior. The objective of the procedure was to provide an understanding and visual depiction of consumers’ likelihood of having a strong affinity toward one of the survey choices, and how other survey choices relate to it. The study also aimed to derive the joint spatial representation of the subjects and products represented by the dissimilarity preference data matrix within a reduced dimensionality. This depiction would aim to enable interpretation of the preference structure underlying the data and potential demand for each product. Model simulations were created both from sampling the normal distribution, as well as incorporating Lambda values from the two data sets and were analyzed separately. Posterior checks were used to determine dimensionality, which were also confirmed within the simulation procedures. The statistical properties generated from the simulated data confirmed that the true parameter values (loadings, utilities, and latititudes) were recovered. The model effectiveness was contrasted and evaluated both within real data samples and a simulated data set.
    [Show full text]
  • Multivariate Data Analysis Practical and Theoretical Aspects of Analysing Multivariate Data with R
    Multivariate Data Analysis Practical and theoretical aspects of analysing multivariate data with R Nick Fieller Sheffield © NRJF 1982 1 (Left blank for notes) © NRJF 1982 2 Multivariate Data Analysis: Contents Contents 0. Introduction ......................................................................... 1 0.0 Books..............................................................................................1 0.1 Objectives .......................................................................................3 0.2 Organization of course material......................................................3 0.3 A Note on S-PLUS and R.................................................................5 0.4 Data sets.........................................................................................6 0.4.1 R data sets ................................................................................................ 6 0.4.2 Data sets in other formats....................................................................... 6 0.5 Brian Everitt’s Data Sets and Functions .........................................7 0.6 R libraries required..........................................................................8 0.7 Subject Matter.................................................................................9 0.8 Subject Matter / Some Multivariate Problems...............................11 0.9 Basic Notation...............................................................................13 0.9.1 Notes ......................................................................................................
    [Show full text]
  • ELKI: a Large Open-Source Library for Data Analysis
    ELKI: A large open-source library for data analysis ELKI Release 0.7.5 “Heidelberg” https://elki-project.github.io/ Erich Schubert Arthur Zimek Technische Universität Dortmund University of Southern Denmark 1 Introduction This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI oers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will rst outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality. 2 ELKI 0.7.5: “Heidelberg” The majority of the work in this release was done at Heidelberg University, both by students (that for example contributed many unit tests as part of Java programming practicals) and by Dr. Erich Schubert. The last ocial release 0.7.1 is now three years old. Some changes in Java have broken this old release, and it does not run on Java 11 anymore (trivial xes are long available, but there has not been a released version) — three years where we have continuted to add functionality, improve APIs, x bugs, ...yet surprisingly many users still use the old version rather than building the latest development version from git (albeit conveniently available at Github, https://github.com/elki-project/elki).
    [Show full text]