Umap Documentation Release 0.3

umap Documentation Release 0.3 Leland McInnes Jan 29, 2020 User Guide / Tutorial: 1 How to Use UMAP 3 1.1 Iris data..................................................3 1.2 Digits data................................................7 2 Basic UMAP Parameters 15 2.1 n_neighbors ............................................. 17 2.2 min_dist ................................................ 24 2.3 n_components ............................................ 30 2.4 metric ................................................. 32 3 Plotting UMAP results 41 3.1 Plotting larger datasets.......................................... 46 3.2 Interactive plotting, and hover tools................................... 48 3.3 Plotting connectivity........................................... 49 3.4 Diagnostic plotting............................................ 51 4 UMAP Reproducibility 57 5 Transforming New Data with UMAP 63 6 Inverse transforms 69 7 UMAP on sparse data 77 7.1 A mathematical example......................................... 77 7.2 A text analysis example......................................... 81 8 UMAP for Supervised Dimension Reduction and Metric Learning 85 8.1 UMAP on Fashion MNIST....................................... 86 8.2 Using Labels to Separate Classes (Supervised UMAP)......................... 87 8.3 Using Partial Labelling (Semi-Supervised UMAP)........................... 89 8.4 Training with Labels and Embedding Unlabelled Test Data (Metric Learning with UMAP)...... 90 9 Using UMAP for Clustering 93 9.1 Traditional clustering........................................... 94 9.2 UMAP enhanced clustering....................................... 97 10 Outlier detection using UMAP 101 i 11 Embedding to non-Euclidean spaces 111 11.1 Plane embeddings............................................ 111 11.2 Spherical embeddings.......................................... 112 11.3 Embedding on a Custom Metric Space................................. 115 11.4 A Practical Example........................................... 118 11.5 Bonus: Embedding in Hyperbolic space................................. 124 12 Gallery of Examples of UMAP usage 129 12.1 UMAP on the MNIST Digits dataset.................................. 129 12.2 UMAP on the MNIST Digits dataset.................................. 130 12.3 UMAP as a Feature Extraction Technique for Classification...................... 131 12.4 UMAP on the Fashion MNIST Digits dataset using Datashader.................... 132 12.5 Comparison of Dimension Reduction Techniques............................ 133 13 Frequently Asked Questions 137 13.1 Should I normalise my features?..................................... 137 13.2 Can I cluster the results of UMAP?................................... 137 13.3 The clusters are all squashed together and I can’t see internal structure................. 138 13.4 I ran out of memory. Help!........................................ 138 13.5 UMAP is eating all my cores. Help!................................... 138 13.6 Is there GPU or multicore-CPU support?................................ 138 13.7 Can I add a custom loss function?.................................... 138 13.8 Is there support for the R language?................................... 139 13.9 Is there a C/C++ implementation?.................................... 139 13.10 I can’t get UMAP to run properly!.................................... 139 13.11 What is the difference between PCA / UMAP / VAEs?......................... 139 13.12 Successful use-cases........................................... 140 14 How UMAP Works 141 14.1 Topological Data Analysis and Simplicial Complexes......................... 141 14.2 Adapting to Real World Data...................................... 144 14.3 Finding a Low Dimensional Representation............................... 150 14.4 The UMAP Algorithm.......................................... 151 15 Performance Comparison of Dimension Reduction Implementations 153 15.1 Performance scaling by dataset size................................... 153 16 Interactive Visualizations 157 16.1 UMAP Zoo................................................ 157 16.2 Tensorflow Embedding Projector.................................... 158 16.3 PixPlot.................................................. 159 16.4 UMAP Explorer............................................. 160 16.5 Audio Explorer.............................................. 161 17 Exploratory Analysis of Interesting Datasets 163 17.1 Prime factorizations of numbers..................................... 163 17.2 Structure of Recent Philosophy..................................... 164 17.3 Language, Context, and Geometry in Neural Networks......................... 165 17.4 Activation Atlas............................................. 166 17.5 Open Syllabus Galaxy.......................................... 167 18 Scientific Papers 169 18.1 The single-cell transcriptional landscape of mammalian organogenesis................ 169 18.2 A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution....... 170 18.3 Exploring Neural Networks with Activation Atlases.......................... 170 ii 18.4 TimeCluster: dimension reduction applied to temporal data for visual analytics............ 171 18.5 Dimensionality reduction for visualizing single-cell data using UMAP................. 172 18.6 Revealing multi-scale population structure in large cohorts....................... 172 18.7 Understanding Vulnerability of Children in Surrey........................... 173 19 UMAP API Guide 175 19.1 UMAP.................................................. 175 19.2 Useful Functions............................................. 179 20 Indices and tables 189 Python Module Index 191 Index 193 iii iv umap Documentation, Release 0.3 Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data 1. The data is uniformly distributed on Riemannian manifold; 2. The Riemannian metric is locally constant (or can be approximated as such); 3. The manifold is locally connected. From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure. The details for the underlying mathematics can be found in our paper on ArXiv: McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018 You can find the software on github. Installation Conda install, via the excellent work of the conda-forge team: conda install-c conda-forge umap-learn The conda-forge packages are available for linux, OS X, and Windows 64 bit. PyPI install, presuming you have numba and sklearn and all its requirements (numpy and scipy) installed: pip install umap-learn User Guide / Tutorial: 1 umap Documentation, Release 0.3 2 User Guide / Tutorial: CHAPTER 1 How to Use UMAP UMAP is a general purpose manifold learning and dimension reduction algorithm. It is designed to be compatible with scikit-learn, making use of the same API and able to be added to sklearn pipelines. If you are already familiar with sklearn you should be able to use UMAP as a drop in replacement for t-SNE and other dimension reduction classes. If you are not so familiar with sklearn this tutorial will step you through the basics of using UMAP to transform and visualise data. First we’ll need to import a bunch of useful tools. We will need numpy obviously, but we’ll use some of the datasets available in sklearn, as well as the train_test_split function to divide up data. Finally we’ll need some plotting tools (matplotlib and seaborn) to help us visualise the results of UMAP, and pandas to make that a little easier. import numpy as np from sklearn.datasets import load_iris, load_digits from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import seaborn as sns import pandas as pd %matplotlib inline sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)}) 1.1 Iris data The next step is to get some data to work with. To ease us into things we’ll start with the iris dataset. It isn’t very representative of what real data would look like, but it is small both in number of points and number of features, and will let us get an idea of what the dimension reduction is doing. We can load the iris dataset from sklearn. iris= load_iris() print(iris.DESCR) 3 umap Documentation, Release 0.3 Iris Plants Database ==================== Notes ----- Data Set Characteristics: :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael

Load more