Introduction to the R Package TDA
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to the R package TDA Brittany T. Fasy, Jisu Kim, Fabrizio Lecci, Cl´ement Maria, David L. Millman, and Vincent Rouvreau In collaboration with the CMU TopStat Group Abstract We present a short tutorial and introduction to using the R package TDA, which pro- vides some tools for Topological Data Analysis. In particular, it includes implementations of functions that, given some data, provide topological information about the underlying space, such as the distance function, the distance to a measure, the kNN density estima- tor, the kernel density estimator, and the kernel distance. The salient topological features of the sublevel sets (or superlevel sets) of these functions can be quantified with persistent homology. We provide an R interface for the efficient algorithms of the C++ libraries GUDHI, Dionysus, and PHAT, including a function for the persistent homology of the Rips filtration, and one for the persistent homology of sublevel sets (or superlevel sets) of arbitrary functions evaluated over a grid of points. The significance of the features in the resulting persistence diagrams can be analyzed with functions that implement the methods discussed in Fasy, Lecci, Rinaldo, Wasserman, Balakrishnan, and Singh(2014), Chazal, Fasy, Lecci, Rinaldo, and Wasserman(2014c) and Chazal, Fasy, Lecci, Michel, Rinaldo, and Wasserman(2014a). The R package TDA also includes the implementation of an algorithm for density clustering, which allows us to identify the spatial organization of the probability mass associated to a density function and visualize it by means of a dendrogram, the cluster tree. Keywords: Topological Data Analysis, Persistent Homology, Density Clustering. 1. Introduction Topological Data Analysis (TDA) refers to a collection of methods for finding topological structure in data (Carlsson 2009). Recent advances in computational topology have made it possible to actually compute topological invariants from data. The input of these pro- cedures typically takes the form of a point cloud, regarded as possibly noisy observations from an unknown lower-dimensional set S whose interesting topological features were lost during sampling. The output is a collection of data summaries that are used to estimate the topological features of S. One approach to TDA is persistent homology (Edelsbrunner and Harer 2010), a method for studying the homology at multiple scales simultaneously. More precisely, it provides a framework and efficient algorithms to quantify the evolution of the topology of a family of nested topological spaces. Given a real-valued function f, such as the ones described in Section2, persistent homology describes how the topology of the lower level sets fx : f(x) ≤ tg (or superlevel sets fx : f(x) ≥ tg) changes as t increases from −∞ to 1 (or decreases from 1 to −∞). This information is encoded in the persistence diagram, a multiset of points in 2 Introduction to the R package TDA the plane, each corresponding to the birth and death of a homological feature that existed for some interval of t. This paper is devoted to the presentation of the R package TDA, which provides a user- friendly interface for the efficient algorithms of the C++ libraries GUDHI (Maria 2014), Dionysus (Morozov 2007), and PHAT (Bauer, Kerber, and Reininghaus 2012). In Section2, we describe how to compute some widely studied functions that, starting from a point cloud, provide some topological information about the underlying space: the distance function (distFct), the distance to a measure function (dtm), the k Nearest Neighbor density estimator (knnDE), the kernel density estimator (kde), and the kernel distance (kernelDist). Section3 is devoted to the computation of persistence diagrams: the function gridDiag can be used to compute persistent homology of sublevel sets (or superlevel sets) of functions evaluated over a grid of points; the function ripsDiag returns the persistence diagram of the Rips filtration built on top of a point cloud. One of the key challenges in persistent homology is to find a way to isolate the points of the persistence diagram representing the topological noise. Statistical methods for persistent homology provide an alternative to its exact computation. Knowing with high confidence that an approximated persistence diagrams is close to the true{computationally infeasible{ diagram is often enough for practical purposes. Fasy et al. (2014), Chazal et al. (2014c), and Chazal et al. (2014a) propose several statistical methods to construct confidence sets for persistence diagrams and other summary functions that allow us to separate topological signal from topological noise. The methods are implemented in the TDA package and described in Section3. Finally, the TDA package provides the implementation of an algorithm for density clustering. This method allows us to identify and visualize the spatial organization of the data, without specific knowledge about the data generating mechanism and in particular without any a priori information about the number of clusters. In Section4, we describe the function clusterTree, that, given a density estimator, encodes the hierarchy of the connected components of its superlevel sets into a dendrogram, the cluster tree (Kpotufe and von Luxburg 2011; Kent 2013). 2. Distance Functions and Density Estimators As a first toy example to using the TDA package, we show how to compute distance functions and density estimators over a grid of points. The setting is the typical one in TDA: a set d of points X = fx1; : : : ; xng ⊂ R has been sampled from some distribution P and we are interested in recovering the topological features of the underlying space by studying some functions of the data. The following code generates a sample of 400 points from the unit circle and constructs a grid of points over which we will evaluate the functions. > library("TDA") > X <- circleUnif(400) > Xlim <- c(-1.6, 1.6); Ylim <- c(-1.7, 1.7); by <- 0.065 > Xseq <- seq(Xlim[1], Xlim[2], by = by) > Yseq <- seq(Ylim[1], Ylim[2], by = by) > Grid <- expand.grid(Xseq, Yseq) Brittany Terese Fasy, Jisu Kim, Fabrizio Lecci, Cl´ement Maria, David L. Millman, Vincent Rouvreau3 The TDA package provides implementations of the following functions: d • The distance function is defined for each y 2 R as ∆(y) = infx2X kx − yk2 and is computed for each point of the Grid with the following code: > distance <- distFct(X = X, Grid = Grid) • Given a probability measure P , the distance to measure (DTM) is defined for each d y 2 R as Z m0 1=r 1 −1 r dm0(y) = (Gy (u)) du ; m0 0 where Gy(t) = P (kX − yk ≤ t), and m0 2 (0; 1) and r 2 [1; 1) are tuning parame- ters. As m0 increases, DTM function becomes smoother, so m0 can be understood as a smoothing parameter. r affects less but also changes DTM function as well. The default value of r is 2. The DTM can be seen as a smoothed version of the distance function. See (Chazal, Cohen-Steiner, and M´erigot 2011, Definition 3.2) and (Chazal, Massart, and Michel 2015, Equation (2)) for a formal definition of the "distance to measure" function. Given X = fx1; : : : ; xng, the empirical version of the DTM is 0 11=r 1 X d^ (y) = kx − ykr ; m0 @k i A xi2Nk(y) where k = dm0 ∗ ne and Nk(y) is the set containing the k nearest neighbors of y among x1; : : : ; xn. For more details, see (Chazal et al. 2011) and (Chazal et al. 2015). The DTM is computed for each point of the Grid with the following code: > m0 <- 0.1 > DTM <- dtm(X = X, Grid = Grid, m0 = m0) d • The k Nearest Neighbor density estimator, for each y 2 R , is defined as ^ k δk(y) = d ; n vd rk(y) d where vn is the volume of the Euclidean d dimensional unit ball and rk(x) is the Eu- clidean distance form point x to its kth closest neighbor among the points of X. It is computed for each point of the Grid with the following code: > k <- 60 > kNN <- knnDE(X = X, Grid = Grid, k = k) 4 Introduction to the R package TDA d • The Gaussian Kernel Density Estimator (KDE), for each y 2 R , is defined as n 2 1 X −ky − xik p^ (y) = p exp 2 : h d 2h2 n( 2πh) i=1 where h is a smoothing parameter. It is computed for each point of the Grid with the following code: > h <- 0.3 > KDE <- kde(X = X, Grid = Grid, h = h) d • The Kernel distance estimator, for each y 2 R , is defined as v n n n u 1 X X 1 X κ^ (y) = u K (x ; x ) + K (y; y) − 2 K (y; x ); h tn2 h i j h n h i i=1 j=1 i=1 2 −kx−yk2 where Kh(x; y) = exp 2h2 is the Gaussian Kernel with smoothing parameter h. The Kernel distance is computed for each point of the Grid with the following code: > h <- 0.3 > Kdist <- kernelDist(X = X, Grid = Grid, h = h) For this 2 dimensional example, we can visualize the functions using persp form the graphics package. For example the following code produces the KDE plot in Figure1: > persp(Xseq, Yseq, + matrix(KDE, ncol = length(Yseq), nrow = length(Xseq)), xlab = "", + ylab = "", zlab = "", theta = -20, phi = 35, ltheta = 50, + col = 2, border = NA, main = "KDE", d = 0.5, scale = FALSE, + expand = 3, shade = 0.9) Brittany Terese Fasy, Jisu Kim, Fabrizio Lecci, Cl´ement Maria, David L. Millman, Vincent Rouvreau5 Distance Function DTM Sample X 1.0 0.5 0.0 −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 kNN KDE Kernel Distance Figure 1: distance functions and density estimators evaluated over a grid of points. 2.1. Bootstrap Confidence Bands We can construct a (1 − α) confidence band for a function using the bootstrap algorithm, which we briefly describe using the kernel density estimator: 1.