Partitioned Persistent

A thesis submitted to the

Division of Research and Advanced Studies of the University of Cincinnati

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE in the Department of Electrical Engineering & Computer Science of the College of Engineering and Applied Sciences

November 9, 2020

by

Nicholas O. Malott BSCompE, University of Cincinnati, 2016

Thesis Advisor and Committee Chair: Dr. Philip A. Wilsey Abstract

In an era of increasing data size and complexity novel algorithms to mine embedded relationships within big data are being studied throughout high performance computing. One such approach, Topological Data

Analysis (TDA), studies homologies in a point cloud. Identified homologies represent topological features

— loops, holes, and voids — independent of continuous deformation and resilient to noise. These topolog- ical features identify a generalized structure of input data. TDA has shown promise in several applications including studying various networks [1–3], digital images and movies [4–9], protein analysis and classifica- tion [10–12], and genomic sequences [1,13–15]. TDA provides an alternative approach to classification for general data mining applications.

The principle tool of TDA that examines the topological structures in a point cloud is Persistent Homol- ogy. Persistent homology identifies the underlying homologies of the point cloud as the connectedness of the point cloud increases, or persists. The output of persistent homology can be used to classify embedded structures in high dimensional spaces [16, 17], enable automated classification from connected component analysis [18], identify abnormalities in data streams [15], classify structure of protein chains [14,15,19,20], along with many other science and biological applications. Although the uses are vast, persistent homol- ogy suffers from exponential growth in both run-time and memory complexity leading to limitations for evaluating large datasets [21].

Persistent homology cannot be directly applied to large point clouds; modern tools are limited to the

3 analysis of a few thousand points in . Optimization of data structures and enhancements to the algorithm have led to significant performance improvements, but the approaches still suffer in relation to the size of the input point cloud. One approach studied over recent years is to reduce the input point cloud, either in size or dimensions [16, 22–25]. These reductions gain obvious improvement, enabling computation on data

i sets 3 − 4 magnitudes larger than the exact approach. Salient topological features have been shown to be preserved with the reduced point cloud and can reasonably approximate the large topological features of the original point cloud. However, the loss of smaller features in approximate approaches are still a significant problem for some application areas [18, 26].

A partitioned technique to compute persistent homology that pairs the approximation of the salient topological features with the reconstruction of the smaller features embedded into the partitions is a natural extension to reduction of the point cloud. Several other studies [27, 28] implement this approach by using partitioning algorithms to approximate the large topological features while simultaneously reconstructing the smaller topological features from the individual partitions. This technique, developed and formalized in this thesis, is called Partitioned Persistent Homology (PPH)1.

This thesis outlines the theory and approach of Partitioned Persistent Homology coupled with limitations of the reconstructed results. The technique is detailed and integrated into the Lightweight Homology Frame- work (LHF) for demonstration. Current optimization algorithms are implemented into the framework and detailed to provide a current state of tools for persistent homology and usage in the LHF application. Anal- ysis of the approach against theory, experimental results, and proofs provides an encompassing overview of Partitioned Persistent Homology. Detail of the considerations for parameter selection frame the use of

Partitioned Persistent Homology for big data research.

1Support for this work was provided in part by the National Science Foundation under grants ACI–1440420 and IIS–1909096.

ii

Acknowledgments

I would like to thank my advisor, Philip A. Wilsey, for support and guidance during the arduous process of exploring this work. The opportunities and challenges set forth by the entire research team I have worked with over the past few years have reshaped my approach to engineering, problem solving, and analysis of processes beyond the scope of this paper. To the many individuals that have participated in the high per- formance algorithms group I express my gratitude for the conversations and adversities we have overcome through our joint studies.

I would also like to thank my colleagues from Interstates. My experience in the industry has continued to drive my curiosity of the links between academic research and the advancement of technology as it evolves into business. I am humbled by the technologies that I have not grown to understand, and without the support of my colleagues my drive to contribute to innovative solutions would prove more exhausting.

Finally, I would like to express gratitude to my family, friends, and acquaintances who have uncondi- tionally supported my endeavors and explorations over the years. The countless conversations, explanations, and questions have encouraged me to seek deeper understanding of everything I explore.

iii Contents

1 Introduction 1

1.1 Motivation and Principle Hypothesis ...... 3

1.2 Thesis Overview ...... 5

2 Background 6

2.1 Persistent Homology ...... 6

2.1.1 Simplicial Complex: Simplices ...... 8

2.1.2 Simplicial Complex: Filtration ...... 10

2.1.3 Analysis and Visualization of Results ...... 12

2.2 Space Partitioning ...... 14

3 Related Work 16

3.1 Related Work in Distributed Data Mining ...... 18

4 Overview of the Approach 20

4.1 Partitioning of Point Cloud ...... 21

4.1.1 Partitioning bounds and error ...... 23

4.1.2 Partitioning Comparison ...... 25

4.1.3 Partitioning with k-means++ ...... 27

4.2 Upscaling ...... 28

4.2.1 Limitations of Upscaling ...... 28

4.2.2 Upscaling of Large Features ...... 30

iv CONTENTS

4.3 Regional Persistent Homology ...... 31

4.3.1 Regional Reconstruction Analysis ...... 32

4.3.2 Merging Persistence Intervals ...... 34

5 Implementation Details 36

5.1 LHF Architecture ...... 37

5.2 LHF Data Structures ...... 41

5.3 Limitations ...... 42

6 Experimental Analysis 44

6.1 Centroid-Approximated PH ...... 45

6.2 Centroid-Approximated Upscaling ...... 48

6.3 Small Feature Reconstruction ...... 51

6.4 Partitioned Persistent Homology ...... 54

6.5 Performance: Parallel PPH ...... 57

7 Discussion 60

7.1 Suggestions for Future Work ...... 62

v List of Figures

2.1 Example of filtration of a point cloud with Vietoris–Rips complex and resulting persistence

intervals...... 8

2.2 Example complex representing 1-simplices ...... 9

2.3 Example complex representing 1-simplices and 2-simplices ...... 9

2.4 Data flow diagram of computation of persistent homology on a point cloud ...... 11

2.5 Flamingo triangulated mesh point cloud and resultant persistence intervals displayed as a

barcode diagram (center) and a persistence diagram (right)...... 13

4.1 Persistence interval accuracy comparison for various partitioning algorithms at set reduction

levels...... 27

4.2 Reduction of the flamingo triangulated mesh model with k-means++ ...... 27

4.3 Example of created feature due to partitioning of the point cloud...... 29

4.4 Example of lost feature due to partitioning of the point cloud...... 30

4.5 Representation of point cloud connections formed by the MST when max ≥ d(x1, x2) ≥

2rmax. Red lines indicate duplicates identified in two partitions. The blue line represents

0 the approximated distance in P , and the green line represents the true distance for the H0 homology...... 33

5.1 Limitations of Ripser for computing persistent homology on 128GB RAM. Values on the

x-axis did not finish...... 37

5.2 FastPersistence pipeline and reduced mode with a preprocessor...... 39

5.3 Upscaled persistent homology mode and the reduced persistent homology mode pipelines. . 39

vi LIST OF FIGURES

5.4 Partitioned persistent homology pipeline and the iterative mode for recursive partitioning. . 40

6.1 Change in heat kernel distance at various k-means++ reduction levels for all experimental

datasets...... 47

6.2 Reduction effect on H0 features...... 47

6.3 Reduction effect on Hd, d > 0 features ...... 47 6.4 Averaqe improvement (%) over all data sets with upscaling persistent homology for Heat

Kernel (HK) and filtered Heat Kernel (fHK) distances...... 50

6.5 Averaqe improvement (%) in heat kernel distance over all data sets with regional persistent

homology at various levels of scalars (s∗rmax). Results compared to H0 feature heat kernel distance (Table 6.5) ...... 54

6.6 Heat Kernel Distance improvement on selected datasets alongside average. Results com-

pared to centroid approximated heat kernel distance (Table 6.1)...... 56

6.7 PPH Speedup for selected datasets alongside average. Results compared to standard fast-

Persistence performance...... 56

6.8 Performance results based on varying input parameters of the PPH approach...... 58

6 6.9 PPH parallel speedup H1 features of synthetic dSphere; 10k points embedded in R . . . . . 59 6 6.10 PPH parallel speedup H2 features of synthetic dSphere; 10k points embedded in R . . . . . 59

vii List of Tables

6.1 Effect of k-means++ reduction on the heat kernel distance for the experimental datasets. . . 46

6.2 Effect of k-means++ reduction on the heat kernel distance on Hd > 0 for the experimental datasets...... 48

6.3 Heat kernel distance and improvement (%) of upscaling over centroid approximated persis-

tent homology for Hd > 0 features. Results compared to Hd, d > 0 feature heat kernel distance (Figure 6.3). Dashed lines indicate when upscaling failed due to physical memory

limitations...... 49

6.4 Heat kernel distance and improvement (%) of upscaling over filtered centroid approximated

persistent homology...... 50

6.5 Effect of k-means++ reduction on the heat kernel distance on Hd = 0 for the experimental datasets...... 51

6.6 Heat kernel distance and improvement (%) of regional persistent homology over centroid

approximated persistent homology of H0 features. Results compared to H0 feature heat kernel distance (Table 6.5) ...... 52

6.7 Heat kernel distance and improvement (%) of regional persistent homology over centroid

approximated persistent homology of H0 features. Results compared to H0 feature heat kernel distance (Table 6.5) ...... 53

6.8 Heat kernel distance and improvement (%) of partitioned persistent homology on the heat

kernel distance for the experimental datasets. Results compared to centroid approximated

heat kernel distance (Table 6.1) ...... 56

viii Chapter 1

Introduction

Topological Data Analysis (TDA) is a method to study homologies — embedded structures of a space in- dependent of continuous deformation. Many of the tools of TDA identify the homology of a point cloud through extensions of group theory, set theory, and graph theory. The algebraic analysis examines the con- nectedness of a set of points in an abstract manner; that is, relationships are examined independent of the location of the points in space. Abstraction of points leads to generalization of structures, or homologies, present within a point cloud. These features can be compared and classified based on similarities or differ- ences between them.

Persistent homology, the principle tool of TDA, examines the homology of a point cloud at various levels of connectivity to determine what homologies persist in the space. Persistent homology was formally introduced by Edelsbrunner et al. [29] alongside the persistence diagram, a visualization of the identified homologies in a point cloud. The extension of algebraic topology into a computational approach provided a foundation for algorithmic improvements, the study of higher dimensional topologies, and the recent understandings of how topology can be utilized for data mining and machine learning applications.

The general approach to persistent homology is limited by the time and space exponential complexity of the underlying algorithms [21]. Current optimizations of the algorithm have only reduced the runtime from

3 2.376 O(nK ) to O(nK ) where nK represents the total generated simplices of the complex. Identification of topological features requires storage of generated simplices of the space, which can quickly expand into the millions on small point clouds. This limits the analysis of the persistent homology to smaller point clouds

1 CHAPTER 1. INTRODUCTION and prevents applications in big data analysis.

Although complexity limitations prevent the application for big data, many studies have been conducted showing the significance of persistent homology on smaller data sets. These studies include bioinformatics

[14, 19, 20], networking [30, 31], classification [16, 17, 32], pattern recognition [18, 33, 34], and additional

fields closely related to machine learning. A majority of these studies focus on identification of salient or long persistence intervals; small intervals are generally regarded as noise but can provide differentiating structures relevant to classification [18]. Extension of persistent homology to larger data will continue to enable scientists and researchers to analyze more complex relationships embedded into point clouds.

Persistent homology, as opposed to other methods for determining relationships between variables, does not require data reduction and can be analyzed at the full scale of data. Multivariate relationships between variables can be found with persistent homology without the need for principle component analysis or other prior knowledge of the domain. Full scale data analysis can be an invaluable tool to recognize and detect structures embedded over multiple features. The analysis of data in high dimensions and for large point clouds can require significant resources when evaluating the exact persistent homology. Experiments with persistent homology on diverse data sets have provided several landmark results that encourage further study of the topic.

The output of persistent homology is a set of persistence intervals that provides a multidimensional summary of the Hd topological features within the point cloud. The dimensional structures are intuitive: a

2 H0 topological feature represents connected components; a H1 topological structure represents a mathbbR loop, such as a hole within a circle or tunnel through a 3-dimensional manifold. Likewise, a H2 topological feature is a mathbbR3 void such as a the interior of a hollow sphere. Generalization of higher dimensional topological structures continues to be an active topic of research in the TDA community.

Persistence intervals represent features as they persist throughout time. To compute the intervals of the point cloud a parameter, , is varied from 0 to the maximum distance between any two points in the point cloud. In some cases the maximum epsilon, max, is provided to limit the computation complexity but restrict the identified features. When a topological feature is born, it is denoted as the birth time (birth). Likewise, when a topological feature collapses and is no longer identified in the point cloud, it is denoted as (death). The dimension of the topological structure is recorded with the < birth, death > pair to form

2 CHAPTER 1. INTRODUCTION 1.1. MOTIVATION AND PRINCIPLE HYPOTHESIS

the barcode itself. All topological features identified where birth < death < max are recorded in the persistence intervals alongside the dimension of the feature (g.g., < dimension, birth, death >). Persistence intervals provide the foundation for comparison and classification in a data science approach.

In order to enable applications in big data, the limitations for persistent homology in both time and space complexity need to be further attacked for modern computing hardware. While approaches to optimize the algorithm itself have provided significant speedup, additional approaches that approximate the persistent homology of the original point cloud are necessary.

1.1 Motivation and Principle Hypothesis

While the general persistent homology algorithm continues to be optimized to compute the exact persistence intervals, approximate approaches have also been shown to provide suitable results in several applications

[15, 16, 22, 23, 27, 28, 35]. Current approximation methods for persistent homology can be categorized into two general classes: approximation of the complex and approximation of the point cloud.

Approximation of the complex has been studied through various complex representations. One signifi- cant approach, developed by Sheehy, attempts to sparsify the Vietoris–Rips complex [35] to remove points from the complex that do not affect the topology. Dey et al. introduced a similar, scalable approach through batch collapse of simplices to reduce the size of the complex. An inherent loss of small topological features in the collapsed complex results in an approximation of the point cloud with sufficient results for certain applications.

Approximation of the input point cloud can be traced back to [36] where the effect of a functor on a topological space is examined. A functor is described as a mapping between topological spaces; in the case of approximation the original point cloud and the reduced point cloud. Subsampling first exhibited motivating results through random selection methods, presented by Chazel et al. [16]. Random samples of the point cloud are analyzed to reconstruct the persistence intervals from multiple trials, preserving and providing reconstruction of an approximated set of persistence intervals of a much larger point cloud. A more statistical approach utilizing nano-clusters, was described by Moitra et al. [22] to approximate the point cloud using geometric centroids of partitions of the space. This approximation provided bounded geometric preservation of the salient topological features in the point cloud at the cost of small features.

3 CHAPTER 1. INTRODUCTION 1.1. MOTIVATION AND PRINCIPLE HYPOTHESIS

Partitioned Persistent Homology further expands on the notion of approximating the input point cloud by describing a reconstruction of the lost persistence intervals by parts. Salient topological features are identified in the centroid-approximated point cloud while remaining small topological features are identified from respective partitions. This approach has been studied in several other works [27, 28], but the complete reconstruction and analysis of partitioned persistent homology is fully detailed in this thesis.

Several factors affect the error induced from partitioning on the output persistence intervals. These include the loss of features due to the reduction, the shifting of birth times and death times due to the replacement of points with representative centroids, and the creation of false features in several edge cases.

Results using k-means++ as a partitioning algorithm and replacing the constituent points of a partition with their geometric centroid has been proven to have bounded error on the barcodes related to the radius of the partitions [22]. This result is encouraging, as the approximation of long-lasting features is of significant interest in the results.

While the centroid approximation will bound the large persistence intervals to a small amount of induced error, further refinement can be accomplished by recomputing the persistent homology on points in each individual partition. Many of these partitions can be evaluated in parallel, providing more accurate results in the persistence interval output without requiring significant hardware resources.

The study of partitioned persistent homology can be difficult with existing tools and libraries. Sev- eral libraries are outdated; others serve a very specific purpose and are not built as extensible. A simple,

Lightweight Homology Framework that does not rely on external libraries for computation will provide researchers additional opportunity to explore, implement, and experiment with persistent homology meth- ods. This motivation led to the development of the Lightweight Homology Framework and implementation of several features to provide research-level access to computation persistent homology in a modular and understandable way.

The Lightweight Homology Framework (LHF) library is introduced as a pipeline-driven persistent ho- mology framework that can perform both standard persistent homology and partitioned persistent homol- ogy. The framework is designed to give a base understanding of the ability for upscaling to provide re-

fined output on partitioned point clouds. LHF provides a tool for exploring the results of the partitioned persistent homology. The LHF code base is an open source project and the source code is available at

4 CHAPTER 1. INTRODUCTION 1.2. THESIS OVERVIEW github.com/wilseypa/lhf.

1.2 Thesis Overview

The remainder of this thesis is organized as follows:

Chapter 2 covers the background of Persistent Homology from a theoretical and algebraic perspective.

The background contains a short introduction into persistent homology paired with analysis and visualiza- tion of persistence intervals. A general approach to partitioning point clouds is described.

Chapter 3 describes related work in the field of Topology and recent advances through other methodolo- gies to improve the speed and memory requirements of persistent homology. These include studies that have led to new complex construction algorithms, more efficient data types, and optimizations that significantly improve the performance of persistent homology computation.

Chapter 4 gives the approach of Partitioned Persistent Homology. The individual steps, theories, and limitations are described to provide analysis of the technique.

Chapter 5 focuses on the implementation and design of the Lightweight Homology Framework (LHF) library and techniques used to enable partitioned persistent homology. The chapter serves as an in-depth walkthrough of the pipelines and data structures used by the LHF library.

Chapter 6 examines the accuracy of the partitioned persistent homology to measure characteristics of each step in the approach. Detailed experiments for each step in partitioned persistent homology are per- formed and analyzed to determine the strengths and weaknesses of the approach.

Chapter 7 summarizes the work from a qualitative perspective, giving detailed information on the find- ings and future for partitioned persistent homology. Limitations that were found through the integration and analysis of LHF along with providing future opportunities of expansion of the partitioning methodol- ogy for other partitioning algorithms.The chapter concludes with suggestions for future research to provide additional opportunities for expansion of the methods described in this study.

5 Chapter 2

Background

Partitioned persistent homology combines partitioning with multiple computations of persistent homology.

A brief background of both persistent homology and partitioning is necessary to preface the approach. While partitioned persistent homology is evaluated in this thesis using the Vietoris–Rips complex, the approach functions independent of the complex type or boundary matrix reduction techniques employed to compute persistent homology. In the same sense, several partitioning algorithms are examined in Chapter 4, but any general spatial partitioning is analyzed and can be used within the framework.

With this in mind, this background provides the process of computing the persistent homology in detail through Section 2.1. General partitioning is covered in Section 2.2. The combination of these two ap- proaches is key in the approach of partitioned persistent homology and in the general applicability of the approach to continued improvements in computational persistent homology.

2.1 Persistent Homology

Persistent homology examines the point cloud to identify topological features at varying levels of connectiv- ity. These varying levels of connectivity expose different topologies as points become more interconnected and more structures formed. The connectivity parameter, , is used to represent the distance used to form connections between neighboring points. This parameter is varied from 0 to either a defined maximum or the maximum distance between any two points in the point cloud, such that 0 < i < max. Representing the connectedness of the point cloud removes the dimensional constraint of the input point

6 CHAPTER 2. BACKGROUND 2.1. PERSISTENT HOMOLOGY cloud. Instead of analyzing points in space and their absolute locations, we instead focus on the relative distance between every pair of points in the point cloud. This representation flattens the point cloud into an embedded distance matrix, later used to form representations of structures within the point cloud.

The connectedness of the point cloud can be represented in many ways, but is typically stored as a sim- plicial complex. The simplicial complex holds representations of simplices, generalized geometric objects such as points, lines, and edges. Representations of simplices can take different forms; this paper focuses on triangular simplices, described in section 2.1.1. Additionally, the simplicial complex stores a weight, indicating the birth of the simplex. This is described in more detail in section 2.1.2.

Once the complex is constructed, filtration of the complex defines methods to reduce the complex size and complexity to only compute over a defined range of simplices. There are several different filtrations, complex storage types, and even representation of simplices that are integral into design and development of purpose-built data mining pipelines. These concepts are briefly discussed in Sections 2.1.1 and 2.1.2.

Post-construction of the simplicial complex leads to the reduction of the boundary matrix to extract algebraic loops within the complex. These algebraic loops represent topological features. In a way, the boundary matrix can be thought of as a system of equations, where the solutions represent topological features at each dimension of homology. These solutions are captured as generators of algebraic loops, output as persistence intervals representing the lifespan of the algebraic loop.

An example of the filtration and corresponding persistence intervals is displayed in Figure 2.1. As the connectedness  increases the point cloud becomes increasingly more connected. At  = 0.4 an H1 feature is formed and persists representing the 2-dimensional loop in the center of the point cloud. The interval persists to around  = 0.9 and is recorded in the resultant persistence intervals as a longer, salient topological feature.

Analysis of the persistence intervals is integral to the study of both persistent homology and applications.

For the purposes of LHF, persistence intervals are used to identify the error of approximation methods between two sets of intervals. Detail on analysis used for persistent homology is described in section 6.

7 CHAPTER 2. BACKGROUND 2.1. PERSISTENT HOMOLOGY

ϵ = 0.0 ϵ = 0.4 ϵ = 0.8 ϵ = 1.2

Figure 2.1: Example of filtration of a point cloud with Vietoris–Rips complex and resulting persistence intervals.

2.1.1 Simplicial Complex: Simplices

The simplicial complex, first and foremost, is an embedding of the entire point cloud into a space-independent representation of all relevant connections. Represented points in the complex are no longer spatially ori- ented; rather, the connections that form within the point cloud are indicated by a weighted-edge graph.

This step immediately reduces the dimensions of the input data by abstracting the topological space into

finely-grained simplices.

There are alternative complex representations as well, such as the cubical or cover complexes, which characterize the space in slightly different manners. While these complexes have strengths in some applica- tions, the approach in this thesis focuses on the simplicial complex for consistency.

One common abstraction of a point cloud is through triangulation. In this representation, the complex defines a point as a 0-dimensional simplex, a 1-simplex as an edge between two points, and a triangle as a 2- simplex, consisting of 3 edges and 3 points. Higher dimensional simplices exist, such as the tetrahedron, the

3-simplex, with 6 edges and 4 points, along with 4 faces. This relationship is fundamental in understanding the storage of simplices in a complex and computation of the algebraic topology in an efficient manner.

The simplicial complex stores relationships of these simplices, typically through face lineage. A face

8 CHAPTER 2. BACKGROUND 2.1. PERSISTENT HOMOLOGY represents a d − 1 simplex; the set of all faces of a d simplex are represented in this manner.

The simplicial complex is evaluated to identify loops within the different dimensions indicating pres- ence of features. Features are a generalized term for the connected components, loops, voids, and higher dimensional homological structures that can be identified from the point cloud. A connected component

(H0) represents connections in the 1-dimensional space and identifies independent structures in the com- plex. A loop (H1) represents a set of edges that form a 2-dimensional hole in the point cloud. A void (H2) represents a set of faces1, or 2-simplices, that form a 3-dimensional void in the point cloud. These features may be different shapes or sizes, but homology views the topological features as identical through contin- uous deformations of the point space. This is important: the structure of the data is the focus of persistent homology.

Figure 2.2 and Figure 2.3 visualize the relationship between a simplicial complex and the points in a point cloud. Figure 2.2 represents the 1-simplices, or edges, of the points labeled a through f. Each edge is formed at a specified distance or weight, typically identified over a Euclidean distance metric. Figure 2.3 shows the same points, instead represented with 1-simplices and 2-simplices. The edges, or 1-simplices are still present in the representation with the inclusion of several triangles formed by the connectedness of all

3 edges of a triangle. Translation from the 1-simplices to the 2-simplices is intuitive when visualizing the connectedness of the point cloud.

B B 1 1 2 2 A 3 A 3 6 7 6 7 D E D E 5 G 5 G 4 9 4 9 8 8 C F C F

Figure 2.2: Example complex representing 1- Figure 2.3: Example complex representing 1- simplices simplices and 2-simplices

1A face is a generalization of the simplicial complex relationship representing a d − 1 simplex that is part of a given feature.

9 CHAPTER 2. BACKGROUND 2.1. PERSISTENT HOMOLOGY

2.1.2 Simplicial Complex: Filtration

Computation of algebraic persistent homology requires several steps to output the lifespans (persistence in- tervals) of topological features. First, the data is transformed from a point cloud to a distance matrix that represents the pairwise distances between each point. The original locations of the points are no longer nec- essary after this transformation, as persistent homology only deals with relative closeness of points, ignoring their orientation in space. The distance matrix is transformed into a simplicial complex, a data structure to store and query against for persistent homology computation. Several algorithms exist for defining sim- plicial complexes. This paper primarily deals with the Vietoris–Rips complex, the only polynomial time algorithm for construction of a complete simplicial complex.

The Vietoris–Rips complex V (P, ) is defined as:

V (P, ) = { σ ⊂ P | dist (x, y) ≤  for all x, y ∈ σ } where  represents the current distance parameter for all real numbers  ≥ 0.

The Vietoris–Rips complex V (P, ) represents the radius of n-balls used to connect neighboring points when constructing the graph at a given . If any point falls within  of another, they will be connected by an edge. The Vietoris–Rips expansion then constructs any higher level simplices from the connected edges, and iterates until no higher simplices can be formed or the dimension of simplices are above dmax.

Once the simplicial complex is constructed, algebraic loops are detected by creating a boundary matrix at each dimension. This boundary matrix represents the system of equations for each simplex in the dimension and their relationship to the other simplices through shared d − 1 simplices.

For example, two edges (1-simplices) that are connected will share a common point (0-simplex). Simi- larly, two triangles (2-simplices) that are connected will share a common edge (1-simplex). This construction allows the boundary matrix to be reduced as a system of equations and evaluate the number of loops present in the matrix.

By calculating these loops at different values of  up to max, features can be identified by their birth time(where the original simplex is formed) and death (where the simplex collapses or merges with another simplex). These birth and death times are used as the persistence intervals for the output of persistent

10 CHAPTER 2. BACKGROUND 2.1. PERSISTENT HOMOLOGY homology, indicating the span of  in which the feature exists.

Salient or long lasting persistence intervals are generally important in the classification of features in a point cloud. In homology, the shape or features identified are called Betti numbers, with the count of Betti numbers classifying the general shape. Betti numbers are representations of the dimensional features; salient

H0 features from the persistent homology represent B0 Bettis, and likewise for all dimensions. For example a torus, one of most notable structures in topology, has a Betti B1 = 2 representing the 2-dimensional loop through the center of the ring and the interior. Persistent homology can be utilized to identify the Betti numbers of a point cloud.

Figure 2.4 shows the general pipeline for computing the persistent homology of the point cloud. The distance matrix is typically evaluated from a point cloud, though some libraries provide distance matrix input. The simplicial complex is created using the Vietoris–Rips complex and expansion. The complete simplicial complex is formed and the boundary matrix is created from the simplices present. The boundary matrix is reduced and betti numbers are computed to provide the output lifespans of features within the point cloud.

Input 3. Boundary Matrix 1. Compute Distance 2. Populate Simplical 4. Compute Betti Output Point Creation and Matrix Complex Structure Numbers Lifespans Cloud Reduction

Figure 2.4: Data flow diagram of computation of persistent homology on a point cloud

The steps to compute persistent homology are straightforward to analyze the topology of the point cloud.

Visualization of the resultant persistence intervals can be carried out in several forms including persistence diagrams [29], barcode diagrams [37], persistence landscapes [38], and persistence heat maps [39]. Visual analysis can be useful when determining the Betti numbers of the point cloud. Additionally, selection of an appropriate distance metric, max, simplicial complex algorithm, and simplicial complex storage type define the parameters that have been optimized to provide significant performance enhancements.

Selection of the appropriate distance metric and selecting max are closely related. The distance metric used is typically euclidean, but certain data sets may benefit from a different distance metric being used to separate features in the point cloud that have small distances between them. The distance metric also governs max. Using the L2 (Euclidean) distance the value of epsilon is intuitive; it represents the euclidean or direct distance between vectors. Other distance metrics, such as the L1 (Manhatten) distance metric

11 CHAPTER 2. BACKGROUND 2.1. PERSISTENT HOMOLOGY

will change the value of max needed to identify larger features in the data set. Due to the realization that altering the underlying distance metric utilized to calculate pairwise distances will affect the impact of max, the euclidean distance is consistently utilized throughout this study.

There are multiple algorithms for the creation of the simplicial complex. The Vietoris–Rips complex

[40] uses n-balls to evaluate connectedness of the graph at an  value. The Witness complex [24] uses two sets of vectors to compute the complex of points within an inner distance. The Skeleton Blocker complex

[41] encodes only implicit representations of the simplices in a lighter data structure, using missing faces or blockers. Any of these complex construction algorithms can be interchanged to evaluate the persistent homology of a point cloud.

Storage of the simplicial complex plays a major role in determining the memory limitations of a designed library. The Simplex Array List is the naive approach, storing each identified simplex into an array. The

Simplex Tree [42] provides a more efficient data structure for insertion and retrieval of simplices, but still has limitations as the size of the simplicial complex grows. Attempts to reduce or encode the simplicial complex can be provided from the algorithm or within the data structure itself.

Combination of the distance metric, simplicial complex algorithm, and simplicial complex storage type provide many different approaches to extracting the homological features from the point cloud. Each selec- tion has benefits and drawbacks that require an understanding of the evaluated point cloud for appropriate selection.

2.1.3 Analysis and Visualization of Results

The output of persistent homology is a set of persistence intervals that are formed as a 3-tuple < dimension, birth, death >.

These represents the birth of a feature at birth and the death or collapse of that feature at death. Each per- sistence interval identifies a homological feature that exists over the interval, with longer intervals typically representing features that persist and are generally considered more significant.

Visualization of the output lifespans from persistent homology can take several forms, each providing different means for interpreting results. The barcode diagram plots the lifespans as bars on a horizontal axis and makes recognition of long-lasting features easy and intuitive. The persistence diagram plots the birth times and death times of features identified through persistent homology on a x-y plot, with features

12 CHAPTER 2. BACKGROUND 2.1. PERSISTENT HOMOLOGY

Figure 2.5: Flamingo triangulated mesh point cloud and resultant persistence intervals displayed as a bar- code diagram (center) and a persistence diagram (right). identified shown as points indicating their lifespan. Other methods, such as the landscape diagram and dissimilarity matrix exist for more specific analysis of persistence intervals.

Barcode diagrams are generated by plotting each lifespan, by length, along the y-axis of the plot. The variation of  is plotted along the x-axis. Representations of lifespans are plotted from the birth time to death time as horizontal lines. A plot of each dimension is typically trellised or included in a different color, with the dimension increasing towards the top of the plot. Figure 2.5 displays the output barcode diagram for the flamingo point cloud in the center. Long persistence intervals are easily distinguishable in this method, such as the green H2 interval representing a void in the space. Referencing the Barcode diagram is a quick way to recognize dimensional structures existent in a point cloud over the connectedness of the simplicial complex.

Persistence diagrams provide an alternative visual on what features are identified from persistent homol- ogy. The persistence diagram gives an easy way to identify the changes between different lifespan outputs.

Each quadrant is typically into 2 dimensions for compactness. The x-axis represents the birth times of the dn lifespans and the death times of the dn+1 lifespans. The y-axis represents the death times of the dn lifespans and the birth times of the dn+1 lifespans. Because the death time of a barcode will never occur before the birth time, the dn lifespans are plotted as points in one half of the quadrant and the dn+1 lifespans plotted in the other. Figure 2.5 displays the persistence diagram for the flamingo example on the right.

While the barcode and persistence diagrams provide a qualitative view of the persistence intervals, there are several measures that can give a quantitative difference between two sets of persistence intervals. This

13 CHAPTER 2. BACKGROUND 2.2. SPACE PARTITIONING analysis is useful when comparing an approximated set of persistence intervals to the original set. In these cases, the Bottleneck distance, Wasserstein distance, and Heat Kernel Distance provide measures of the difference in persistence intervals.

The bottleneck distance measures the maximum difference between two sets of persistence intervals.

This becomes useful when there are large topological features that exist in the original persistence intervals but do not in the approximated intervals. This would provide a large bottleneck distance, indicating that long or significant persistence intervals have been lost.

The Wasserstein distance compares the two sets of persistence intervals and computes the total change needed to make the persistence intervals equal, sometimes referred to as the earth-movers distance. The

Wasserstein distance gives a metric that tracks the total amount of change between sets of persistence inter- vals, however it can be expensive to compute for large sets.

The heat-kernel distance [43], a recent metric for comparing persistence intervals, uses Gaussian kernel density estimates to provide a stable heat-kernel metric for classification applications. The heat-kernel distance is robust to noise and outliers within the data making it a suitable for set comparison. Analysis for this paper generally uses the heat-kernel distance to provide a general metric for comparing persistence intervals.

Pairing visual and metric analysis of the persistence intervals can give an understanding of the homolo- gies present in different point clouds and the differences between the topological features in each. When comparing different point clouds, such as classifying different triangulated mesh representations of objects, metrics can provide a sense of dissimilarity between the individual persistence intervals where the visual analysis is not enough.

2.2 Space Partitioning

A brief introduction to space partitioning must also be addressed to provide a complete understanding of the approach. This study utilizes space partitioning to classify similar points into a single representative point as an attempt to reduce the number of input vectors while preserving the topological features present in the point cloud. This partitioning step can be evaluated as a mapping function; map all of the vectors that are locally similar to a single representative vector, for all vectors in the source data. More formally the partition

14 CHAPTER 2. BACKGROUND 2.2. SPACE PARTITIONING can be expressed as: ˆ ˆ P = {p | p ⊂ P } ∀p, q ∈ P | p 6= q, p ∩ q = ∅ and ∪p∈Pˆ p = P . The partitioning algorithm needs to select a representative point of the partitioned space; in the case of k- means++, the centroid is used to represent the source points. k-means++ provides several features that help preserve the topological features of the data at high levels of reduction. The algorithm selects k centroids and attempts to minimize the within cluster sum of squared errors, or WCSSE. Each iteration attempts to refine the WCSSE until the location and constituent points of the partition are minimized. k-means++ runs in O(nkd) time, where n is the number of input vectors, k is the number of classifications to generate, and d is the dimensionality of the input vector.

d Density based space partitioning locates dense regions of R and attempts to connect similar dense regions by location. This approach focuses on estimating the underlying gradient or density function that defines the space and separating the points into dense partitions.

d Grid based space partitioning divides space in R into partitions by splitting the point cloud by n- dimensional hyperplanes. The hyperplane can be oriented in space to split points evenly, or into specific partitions based on the density or characteristics of the point cloud. The grid base partitioning algorithm can then assign points in a similar classification to a representative point. This may be a centroid of the points contained in the classification, a randomly selected point from the set, or some other representative method.

The selection for preserving topological features should attempt to preserve the geometric shape of the data.

15 Chapter 3

Related Work

3 An approach to computational persistent homology in R was introduced by Edelsbrunner et al. along with the representative persistence diagram [29]. This work was expanded by Carlsson et al. with reformula- tion of the definition and creation of the barcode diagram [44]. Zomorodian later introduced an optimized

n and generalized algorithm for extracting persistence pairs in R [37]. These primary works formed the foundation for computation of persistent homology in higher dimensions.

Efforts to reduce the complexity of both runtime and memory can be categorized into several classes: (i) complex storage and construction, (ii) boundary matrix reduction, (iii) complex collapses, and (iv) approxi- mation of the input point cloud. Significant developments in each of these areas are covered, though many other techniques and approaches have been studied.

Complex storage and construction involves identifying connections and building a multidimensional connected graph of filtrations. Zomorodian et al. [45] describes fast construction of the Vietoris–Rips com- plex, used throughout this study. Construction of the complex also requires storage of the simplices, in which case the simplexTree [42] is used to efficiently insert and retrieve co-faces of simplices. The com- pressed annotation matrix, a structure to store both complex and cohomology groups is a recent structure introduced by Boissonnat et al. for reduced memory footprint of the complex [46]. Efficiently storing and using the complex to compute persistence intervals continues to be a significant area of study.

Boundary matrix reduction can take significant time with the original approach described by Zomorodian as the number of points and dimensions of homology increase. Methods to provide faster reduction of

16 CHAPTER 3. RELATED WORK the boundary matrix have been studied in various aspects, with some of the most notable including the twist algorithm [47], clearing [48], and coreduction [49]. Ripser [50] provides one of the most efficient implementations of these algorithms for cohomology of point clouds alongside an implicit Vietoris–Rips representation of the boundary matrix for reduction.

Approximation of the complex has been widely studied to reduce the size of the complex during con- struction. This approach has obvious improvements for the size of the boundary matrix and can be useful for identifying salient topological features. Sparsification of the Vietoris–Rips complex has been one sig- nificant area of study [35, 51], leading to methods to approximate filtrations on much larger point clouds through batch collapse. Sparrips, a Julia library, implements a version of sparsification of the boundary ma- trix that enables computation on larger point clouds [52]. Approximation of the complex has lead to scalable solutions for identifying salient topological features in a point cloud.

Approximation of the input point cloud is a more recent branch of study for computing the persistent homology. Reduction of the number of points n or the number of dimensions d have both been studied through subsampling methods and random projection, respectively. Subsampling [16, 22, 24], in particular, has shown remarkable success in identifying large topological features in a point cloud. Reducing the dimensionality of the input point cloud through random projection [23, 25] continues to be an area of study but also can preserve topological features from a reduced point cloud.

Experimental analysis of random selection from point clouds by Chazal et al. [16] provides significant motivation for the methods in this work. Preservation of topological features from random subsampling of the point cloud and reconstruction of the landscape diagrams indicates that a targeted approach may be able to accurately approximate the large persistence intervals with some bounded error.

Approximation of the point cloud through nano-clusters can be traced back to Moitra et al. [22] to provide topologically preserving reduction. Salient topological features identified in the full scale point cloud are shown to be identifiable up to significant reduction percentages using nano-clusters. Malott et al. [27] expanded on this work by studying reconstruction of the persistence intervals from the partitions and generating co-cycles of the approximated point cloud, leading to the approach of the Partitioned Persistent

Homology.

17 CHAPTER 3. RELATED WORK 3.1. RELATED WORK IN DISTRIBUTED DATA MINING

3.1 Related Work in Distributed Data Mining

The approach described in this thesis has existing parallels in distributed data mining (DDM) frameworks.

In such cases a localized data mining algorithm can perform on distributed partitions of related data sources forming local models preceding an aggregation model to produce a final output. An overview of distributed data mining algorithms and systems from an application perspective can be found from Park et al. [53].

There are many domains benefiting from such an approach including wireless communications, classifier systems, distributed production systems, and various privacy preserving systems [54]. Use of distributed data mining frameworks with persistent homology has been previously studied by Bauer [55] and imple- mented into the DIPHA library for exact persistent homology.

Two of the most notable distributed approaches significant in this work are distributed clustering and graph mining. Within distributed clustering, several algorithms such as k-means [56, 57], ensemble based k-harmonic means [58], and Expected Maximization (EM) [59] have been shown to work in parallel and distributed environments. Several of these algorithms require exchange of statistics or classification in- formation necessary for determining clusters. The approach described in this paper can be independently computed over a partitioned set of data which may be suitable in a distributed environment with a distributed clustering frontend.

Distributed approaches to graph mining such as the distributed minimum spanning tree have been exten- sively studied [60]. These approaches influence the provided theorems and proofs around reconstruction of the H0 topological features discussed in Section 4.3. One frequent approach in parallel computing literature is Boruvka/Sollin’s˚ algorithm [61] where partitions of an entire weight edge graph can be individually added to a forest of minimum-weight edge incidences, then combined with minimum-weight edges from each of the other partitions in O(E log V) performance where E represents the number of edges and V represents the number of vertices in a graph, G. The merging and reconstruction of persistent homology from the partitions themselves closely resemble this algorithm. Additionally the computation of the H0 persistence intervals in the fast persistence computation utilizes Kruskal’s algorithm for computing a local minimum spanning tree [62].

MapReduce [63–65] design patterns and similar distributed models can provide fault tolerance and dis- trusted computation of algorithms in a flexible flamework. Many of the aforementioned algorithms have

18 CHAPTER 3. RELATED WORK 3.1. RELATED WORK IN DISTRIBUTED DATA MINING been implemented into MapReduce frameworks to provide simplicity in the distribution, local processing, and aggregation of partitioned models. While the algorithms introduced in this thesis do not directly uti- lize the MapReduce framework, a similar approach is taken when computing the persistent homology over individual partitions.

Many of these related works influence the described approach in both theory and implementation. Con- tinued developments in the field of distributed data mining provide improvement to the partitioned approach to persistent homology and should be monitored for various improvements to both the accuracy and perfor- mance of the technique.

19 Chapter 4

Overview of the Approach

Partitioned Persistent Homology attempts to reconstruct persistence intervals of a large point cloud from smaller, more computationally feasible partitions. The approach requires several steps: partitioning of the input point cloud, computation of PH on the approximated point cloud with upscaling [27], computation of the distributed PH in each of the partitions, and merging of the results into a unified set of persistence intervals. In general the reconstructed persistence intervals can accurately approximate the original point cloud’s persistence intervals with minor perturbations from misidentification and dimensional shifts. The result is a fast, highly parallel solution that can approximate the persistent homology of point clouds beyond the limitations of current state-of-the-art approaches.

In the remainder of this paper, the following terms will be used to describe the components of partitioned persistent homology:

•P, the original point cloud,

• Pˆ, the partitions,

•P0, the centroids,

0 0 • ri, the distance from the partition centroid, P i ∈ P , to the most distant point in that partition, and

• rmax = max(ri), the maximum ri of all the partitions.

Partitioning of the input point cloud requires selection of an appropriate partitioning algorithm and repre- sentative point for approximation. The algorithm acts as a functor that maps the topology of the original

20 CHAPTER 4. OVERVIEW OF THE APPROACH 4.1. PARTITIONING OF POINT CLOUD point cloud, P , to the representative point cloud, Pˆ. The number of nano-clusters inherently affects the error induced from the mapping on the persistence intervals and must also be selected to characterize the salient topological features of the space. Analysis of partitioning approaches for partitioned persistent homology is described in Section 4.1.

Once the approximated persistent homology is computed over P0, the generators of the identified cycles can be utilized to upscale large topological features. This process maps representative points to their respec- tive partitions to compute the persistent homology of the original point cloud. Independent features within the point cloud can be upscaled in parallel to refine the persistence intervals identified in P0. Upscaling is described more formally in Section 4.2.

Smaller topological features of the space lost in P0 can be reconstructed from the individual partitions, Pˆ, to some degree. Fuzzy partitions, including points outside of each individual partition, can provide refined reconstruction of smaller topological features, leading to more accurate persistence intervals when compared to the original persistence intervals of P. The persistent homology of each partition can be performed in parallel, but require a merge step to prevent duplicate persistence intervals being reported when identified in multiple fuzzy partitions. A description of the approach for reconstructing the small persistence intervals from the regional partitions is available in Section 4.3.

4.1 Partitioning of Point Cloud

Partitioning of P enables both the identification of large topological features in the approximated point cloud

P0 and small topological features within the partitions themselves. This leads to two goals to a partitioning algorithm: construct a point cloud P0 that preserves large topological features present in P and construct partitions of P that can reconstruct the smaller persistence intervals embedded in Pˆ.

This section will focus on the effects of partitioning on approximation of the input point cloud. Error in- duced from the approximation on the persistent homology is characterized to provide insight into algorithm selection. Effects of the approximation, such as lost or mischaracterized features, are identified to provide a complete analysis of the partitioning algorithm and resultant persistence intervals.

A general partitioning algorithm can be described as: ˆ ˆ Let P = {p | p ⊂ P} be a partitioning of P, then ∀p, q ∈ P | p 6= q, p ∩ q = ∅ and ∪p∈Pˆ p = P.

21 CHAPTER 4. OVERVIEW OF THE APPROACH 4.1. PARTITIONING OF POINT CLOUD

Within each partition the maximum radius, rmax, is then defined as the the maximum ri of all the partitions, where ri represents the radius of each partition Pˆi. Notably, the maximum radius of the partitions bound the induced error in the resultant persistence intervals:

W∞(DgD, DgC) ≤ 2 rmax (4.1) as proved by [22].

In this case, the persistence intervals identified by the persistent homology of P0 can not be shifted in birth or death time by more than 2rmax. Smaller persistence intervals, which do not exist in the approximated space, are non-existent in the persistence intervals and cannot be compared. In cases where the persistent homology of P0 is reduced beyond the existence of any relevant topological features all of the error can be attributed to topologies embedded in the partitions, Pˆ.

Selection of a partitioning algorithm that minimizes the maximum radius of the partitions will produce the least error in identified persistence intervals. Scaling the number of partitions to generate will also affect the maximum radius; more partitions will provide tighter clusters leading to a smaller maximum radius.

However, reducing the input point cloud size while preserving the topological features will provide the best performance. Scaling the number of partitions to generate can be beneficial for approximating the persistent homology.

Centroid clustering algorithms are a natural approach to minimizing rmax. Several existing studies into partitioned persistent homology have focused on k-means++ [22, 27, 28]. k-means++ minimizes the within cluster sum of squared error (WCSSE), a measure of the compactness of partitions that directly relates to rmax. k-means++ also takes a k parameter for the number of partitions to generate, useful for examining a point cloud at various reductions. The linear runtime complexity (O(nkd)) for reduction of the point cloud prior to persistent homology is suited for the approach.

Other partitioning schemes may provide different characteristics more targeted to a specific application.

Density-based or hierarchical based partitioning can provide a different preservation of topological features that can be exploited through iterative upscaling of a reduced point cloud. The designed library incorpo- rates methods to add additional partitioning functions in the future to continue investigation of the effect of partitioning on persistent homology and suitability for upscaling.

22 CHAPTER 4. OVERVIEW OF THE APPROACH 4.1. PARTITIONING OF POINT CLOUD

4.1.1 Partitioning bounds and error

Moitra et. al [22] have provided a error bound for the identified persistence intervals in the approximated point cloud, P0. This error bound characterizes persistence intervals of features that are identified in both P and P0 but does not include lost or mischaracterized features of the approximation. This section describes the possible losses from partitioning not accounted for by the error bound.

Potential loss of topological features can be categorized into three primary observations: false topolog- ical features, dimensional shift of topological features, and lost topological features. Preserving the per- sistence intervals is the goal of the approximation, but a method of refining the topological features called upscaling can be utilized to more accurately identify small intervals in the approximated space. Upscaling is described fully in Section 4.2.

False Topological Features

The approximate point cloud P0 consists of centroids representing regions or partitions of the original point cloud. When using centroids, there is a chance that false voids can be identified that lie between centroids that do not exist in the original point cloud P. More formally false features can occur when:

Theorem 1 (False voids from centroid gaps). False voids can appear when min < rmax < max.

Proof. Consider a partitioning of a uniformly distributed space into four partitions where each partition is of radius r, such that min < r < max. Also consider in the uniformly distributed space that the minimum pairwise distance between any two points in P is less than min. It follows that the topological feature identified in the approximated point cloud, P0, does not exist in the original point cloud, P, indicating a false topological feature identified in the approximation. 

As identified by Malott et al. [28].

In essence, a false topological feature can be formed when the centroids create a new boundary of a topological feature that was originally filled with constituent points in P. Fortunately false topological features can be rectified in the upscaling step to identify the original topological features that exist in the point cloud. Any upscaling done will negate false topological features identified in the approximated point cloud.

23 CHAPTER 4. OVERVIEW OF THE APPROACH 4.1. PARTITIONING OF POINT CLOUD

Dimensional Shift of Topological Features

The approximation of the point cloud can also have the resilient effect of shifting topological features be- tween dimensions. A higher dimensional shift can occur when the approximation stretches a topological feature into a higher-dimensional void. More formally, a d-dimensional convex hull in P can be approxi- mated such that the cover in one (or more) dimensions is lost in P0 and only occurs in j > d-dimensions.

Theorem 2 (Feature shift to higher dimensions). The PH computation in P0 may shift an identified topo- logical feature into a higher dimension.

2 Proof. Consider a point cloud where points are uniformly distributed around the edge of an R circle, 3 embedded into R . Then a partitioning of the space with rectangular partitions perpendicular to the face of the circle can be constructed and extend infinitely in the z direction. Selection of representative points at various z values within each of these partitions will add additional dimensionality to the point cloud, possibly creating an H2 void in the space that does not exist in P. 

As identified by Malott et al. [28].

Shifting of topological features into higher dimensions may occur when the approximation adds addi- tional dimensionality into the simplicial complex. Alternatively a feature shift may also occur into lower dimensions. This occurs when the approximation P0 loses dimensionality by flattening the original point cloud P.

Theorem 3 (Feature shift to lower dimensions). The PH computation in P0 may shift an identified topolog- ical feature into a lower dimension.

Proof. Consider a point cloud representing the boundary of a cylinder can be approximated such that rect- angular partitions perpendicular to the top and bottom faces of the cylinder are utilized and reduce the point cloud to represent the boundary of a circle. In this case the H2 void within the cylinder is shifted to an H1 loop representing the boundary of the circle. 

As identified by Malott et al. [28].

Lower dimensional shift occurs when the approximated point cloud P0 reduces dimensionality of the point cloud.

24 CHAPTER 4. OVERVIEW OF THE APPROACH 4.1. PARTITIONING OF POINT CLOUD

Lost Topological Features

Lost topological features occur naturally from the mapping of P to P0. The most obvious of these lost features are the H0 connections in P, which represent the minimum spanning tree of the point cloud. In the approximated point cloud P0 the minimum spanning tree is constructed of the representative centroids of

0 the space. The total number of H0 connections in P is |P|, while the total number of H0 connections in P is the number of centroids, k. This difference can be characterized as a loss of topological features due to the mapping of P to P0.

While H0 provides an intuitive characterization of lost topological features due to the approximation, higher dimensional topological features can also be lost. In these cases the defining points of the convex hull of a topological feature of P do not provide sufficient approximation in P0 to identify the corresponding convex hull. This indicates that the centroids representing partitions of the original space do not preserve the topological features of the space. In many cases the approach aims to preserve the large topological features, while the small topological features are considered noise.

The accuracy impact of lost topological features can be directly attributed to the number of representative centroids, k. With a larger number of centroids, the approximation nears the original point cloud P. In these cases more of the topological features in the original point cloud are preserved in the persistence intervals of P0. A smaller number of centroids provides faster approximation of the persistent homology, but can suffer from lost topological features embedded in the partitions themselves. Fortunately many of the lost topological features from the mapping can be recovered through upscaling of the identified feature boundaries and persistence intervals embedded within the partitions. To combat these occurrences a suitable partitioning scheme that can be controlled to approximate the persistence intervals at different reduction scales is necessary.

4.1.2 Partitioning Comparison

Although this study focuses on k-means++ as the partitioning algorithm to approximate the point cloud, other partitioning approaches may provide similar or better results. This depends on several factors, includ- ing the amount of noise present in the point cloud, the density of boundaries around significant topological features, and the effectiveness of the partitioning algorithm in preserving the topological shapes and distri-

25 CHAPTER 4. OVERVIEW OF THE APPROACH 4.1. PARTITIONING OF POINT CLOUD bution of points around those shapes.

Density-based clustering, such as DBSCAN [66], can provide clustering of dense regions connected throughout a point cloud. Density clustering does not require an input parameter for the number of clusters, and can be useful for classification of connected objects within space. Density-based clustering has many similarities to the construction of the Vietoris–Rips complex and connectedness of points found with per- sistent homology. For example, DBSCAN utilizes -balls around each point to build a neighborhood graph and identify connectedness to establish classifications of the point cloud.

Density-based clustering has applications for identifying connected dense regions without defining the number of clusters to be emitted. For the approach described in this paper the number of output clusters needs to be large to approximate the geometric shape of the point cloud. Density-based clustering does not have a direct application for outputting a large number of partitions and won’t be considered with the approach.

Hierarchical clustering, sometimes referred to as connectivity based algorithms, provides a similar ap- proach as density-based clustering by examining the distance between points in the point cloud. Hierarchical clustering builds a structure that represents the hierarchy of pairwise connectedness. This structure can be used to extract clusters that have high degrees of connectivity without examining a single distance between the points. Hierarchical clusters can emit any number of classification groups which can provide the reduced data necessary for persistent homology and upscaling in the approach.

While a detailed analysis of how well different partitioning metods preserve topological structure, a brief illustration of the effectiveness of different partitioning algorithms on several datasets is shown in Figure 4.1.

While most algorithms perform better than random selection of points from the point cloud, k-means++ and agglomerative ward preserve the topological features the best down to the most amount of reduction. This is significant for accurately preserving the topological features of the input point cloud while reducing the size of the constructed complex. A more detailed comparative analysis is available in [67, 68]

There are other partitioning algorithms that can be used to classify and reduce the number of points for the input to persistent homology. Some of these include geometric or spatial partitioning, approximate partitioning, and distribution based partitioning. Study of these algorithms requires intimate knowledge of the distribution and target data features to provide intended functionality beyond the applications examined

26 CHAPTER 4. OVERVIEW OF THE APPROACH 4.1. PARTITIONING OF POINT CLOUD

20 140 Seeds Partitioning Reduction WaterTreatment Partitioning Reduction

2 14 kmeans++

HK Distance kmeans++ HK Distance agglomerativeward agglomerativeward agglomerativesingle agglomerativesingle maxmin maxmin gmm gmm Random_1 Random_1 Random_2 Random_2 Random_3 Random_3 1.4 0.2 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 500 475 450 425 400 375 350 325 300 275 250 225 200 175 150 125 100 75 50 25 No. Points No. Points Reduction of the Seeds dataset Reduction of the WaterTreatment dataset

Figure 4.1: Persistence interval accuracy comparison for various partitioning algorithms at set reduction levels.

in this study.

4.1.3 Partitioning with k-means++

Partitioning with k-means++ to produce centroids preserves large topological features under significant

reduction. Previous experiments [22], have shown the induced error in a cluster can be bounded by rmax, providing a measurable limit on how much features can change from data reduction with k-means++. The

k-means algorithm preserves geometric shape under high levels of reduction and has some resistance to

noise or density of a point cloud. Figure 4.2 shows the Flamingo model from the UCI Machine Learning

Repository as a triangulated mesh model at varying levels of reduction.

0.4 0.4 0.4

0.4 0.4 0.4 0 1 0 1 0 1 Original point cloud of 2000 points Reduced point cloud of 300 points Reduced point cloud of 100 points

Figure 4.2: Reduction of the flamingo triangulated mesh model with k-means++

The geometric shape of the Flamingo model is still discernible when reduced to 100 points. Recognizing

27 CHAPTER 4. OVERVIEW OF THE APPROACH 4.2. UPSCALING this feature and upscaling the boundary points would allow the error induced from the partition to be reduced and converge to the barcode of the original point cloud. Partitioning with k-means++ provides subsampling of the point cloud in linear time while preserving the geometric shape of the original data.

4.2 Upscaling

The representative co-cycles of persistence intervals are an often discarded when examining the persistent homology results to interpret the underlying homology of the point cloud. For each persistence interval, the representative cocycles are the generating simplices that form the boundary of the feature. While the persistence interval measures how long the feature exists, the generators give a set of simplices that form the minimum occurrence of that feature. These are often unnecessary for analysis, but can relate the identified features back to the source point cloud.

Partitioned persistent homology utilizes the generators of persistence intervals to upscale the identified topological features. Upscaling involves mapping the generators of the intervals of P0 back to the original points in P, then recomputing the persistence intervals on the segmented portion of the original point cloud.

In many cases this can provide exact boundary refinements of large topological features identified in P0.

Upscaling can be performed in parallel for independent features identified in P0. This may, in some identified cases, result in slightly perturbed death intervals for a refined persistence interval for certain features. Upscaling may not always refine the features of P0 in some cases where the topological feature is lost as identified in Section 4.1.1. Additionally, an upscaled feature may generate a complex too large for available memory, in which case a strategy of iterative refinement may be employed to more accurately approximate a persistence interval.

4.2.1 Limitations of Upscaling

Persistent homology of boundary points of each feature will inherently provide several different sets of persistence intervals. These intervals need to be merged for the persistence of the original point cloud, but require evaluation of several cases to determine whether a barcode is valid from a partition.

The first case occurs when new features are identified from the upscaled point cloud. These new barcodes represent features that are formed prior to the radius of the current cluster being examined. If the radius of the

28 CHAPTER 4. OVERVIEW OF THE APPROACH 4.2. UPSCALING cluster is greater than the maximum distance of the simplex, the simplex will be lost during the partitioning step and recreated during upscaling. These features can be inserted directly into the final barcode list for output once evaluated at original scale.

An example of a created feature from partitioning can be seen in Figure 4.3. The points in the center of the diagram present a fully connected graph of points with no topological loops. The second figure shows a partitioning of the data that results in a false topological feature being formed in R2. This feature would not be identified if an exact algorithm is used but will show up in the partitioning performed.

X X

X X X X

X X

Original point cloud Partitioned point cloud

Figure 4.3: Example of created feature due to partitioning of the point cloud.

The second case occurs when a feature extends over multiple partitions. If the feature birth time is greater than 2rmax the feature will be identified from the original partitioning and upscaled appropriately.

However, if the feature birth time is less than 2rmax and spans over multiple partitions, the feature will need to be identified by examining a target partition and several other partitions within a range of that original partition. Fortunately this feature can be identified through the regional reconstruction step in Section 4.3.

The third case occurs as a larger feature is upscaled. Ideally the original points of the entire feature identified from the partitioned point cloud can fit within memory. If this is not the case and the original point clouds forming the boundary of the feature can not be executed, the point cloud may be reduced again at some level to fit the data within memory. This may happen multiple times to get the most accurate lifespan of the feature without running out of resources. Once the feature has been upscaled and the largest point

29 CHAPTER 4. OVERVIEW OF THE APPROACH 4.2. UPSCALING

2*r 2*rmax max X X r rmax max X X

X X

Original point cloud Partitioned point cloud

Figure 4.4: Example of lost feature due to partitioning of the point cloud. cloud used to compute the persistent homology, the results are used in the persistent homology output. Any features identified with a larger partitioning will be discarded to not report the same feature multiple times.

This process is referred to as iterative upscaling

Following these three cases, the barcode output of persistent homology can be merged between the subsets of the original data set. The result will be a complete set of barcodes with refined feature intervals within the resource constraints that represents the large topological features of the original point cloud, P.

4.2.2 Upscaling of Large Features

Each independent larger feature identified from the boundary matrix can be upscaled separately. In some cases these new partitions of independent features will allow persistent homology to be computed at full scale. If the persistent homology can not be computed based on the resource availability of the system and size of the complex, iterative upscaling will provide a refinement of the larger features to reduce the induced error and identify additional features that extend over multiple partitions in the original partitioning.

Any upscaling that can occur will reduce the value of rmax, in turn reducing the error in the persistence. The upscaling will also combat the limitation described in Figure 4.4, where a feature has been lost due to the birth time of features being less than rmax. In the example the feature extending over multiple partitions is lost because of the placement of representative points. With a larger number of partitions, the external structure of the loop will be represented by additional points, preserving the boundary of the feature. As

30 CHAPTER 4. OVERVIEW OF THE APPROACH 4.3. REGIONAL PERSISTENT HOMOLOGY the number of partitions nears the total number of points in the original point cloud the boundary of the lost feature will become more defined and eventually will be recognized as a boundary.

The iteration of upscaling refines the features in multiple steps. Each time a refinement occurs, addi- tional features identified by boundaries may be upscaled in a smaller set of constituent points, providing more accurate persistence intervals being detected. In an ideal scenario the homological features that exist over multiple partitions will become apparent as the scale of the points increase. This approach will lead to refinement of large features from the initial reduction and identification of additional smaller features embedded within the partitions.

4.3 Regional Persistent Homology

Regional persistent homology attempts to reconstruct the small topological features lost in P 0. Although these small topological features cannot be directly computed in P 0, the embeddings are contained within the partitions themselves and can be utilized to reconstruct the regional persistent homology in parts. This section identifies the bounds of reconstruction and introduces a scalar parameter, s, to represent overlap between partitions for identifying small topological features.

The persistence intervals identified from P 0 are assumed to be gathered separately; that is, large features where sB > 2rmax,Hd > 0 are recorded separately and merged in the same approach described in Sec- tion 4.3.2. The regional persistence prioritizes identification of all intervals of H0 and higher-dimensional features where sB ≤ 2rmax.

In order to present an appropriate bound for the reconstruction of the small persistence intervals from the regional persistent homology a parallel is drawn to the minimum spanning tree, as the H0 persistence intervals of P represent the minimum spanning tree of the point cloud. This parallel provides a basis for evaluating how the reconstruction of the persistence intervals is performed. Analysis of the reconstruction of persistence intervals from the regional persistent homology is detailed in Section 4.3.1.

Once the regional persistence intervals are obtained, merging of the persistence intervals requires several rules for total ordering of the source dataset, ownership of a persistence interval by a region, and trimming of duplicate intervals in the results. Merging of the PPH persistence intervals is covered in Section 4.3.2.

31 CHAPTER 4. OVERVIEW OF THE APPROACH 4.3. REGIONAL PERSISTENT HOMOLOGY

4.3.1 Regional Reconstruction Analysis

Regional reconstruction of the persistence intervals from the partitions requires some amount of overlap between the partitions to identify features spanning multiple partitions. Features wholly contained within

Pˆi can be reconstructed by persistent homology, but features spanning over multiple features are lost from independence of the partitions. In order to reconstruct spanning features some overlap of the partitions to evaluate is required.

Fuzzy partitions, which contain overlap of points between features, hold a similar definition to regu- lar partitions (Section 2.2). In the case of generating fuzzy partitions from centroids, these can be easily generated for each partition as:

ˆf 0 P i = {p | d(p, Pi ) ≤ ri + δ, δ ≥ 0} ∀p ∈ P . where δ represents the overlap added to the partition.

Section 4.1 details the induced error from partitioning on large features as bounded by 2rmax. Features smaller than 2rmax are inherently lost in the approximated persistent homology, but determine a target for reconstruction of the smaller topological features. For this reason 2rmax was studied for generating fuzzy partitions. However, with this large of a value for δ, the sizes of the fuzzy partitions end up with significant overlap and requiring large complexes to compute persistent homology over.

Determining the optimal overlap of features to reconstruct the small features completely relies on the analogous minimum spanning tree and H0 features. Interpretation of the regional reconstruction of the persistence intervals can be extended to higher dimensions for reconstruction of Hd | d > 0 features. Figure 4.5 depicts cases that can occur when attempting to reconstruct the minimum spanning tree from the partitioned points.

Two cases of reconstruction from partitions are obvious: the case where the overlap of partitions includes the nearest point outside of the partition and the case where the partition contains no overlap. In the prior case, depicted by the relationship of C0 and C1 in the diagram, the red connections indicate persistence intervals identified by both C0 and C1, including the MST connection. If all fuzzy partitions fall into this case the H0 persistence intervals can be completely reconstructed. However, the latter case of the partitions containing no overlap may still provide an approximated MST from the persistence intervals of P0.

Reconstruction of the persistence intervals from the partitions for H0 only requires that the nearest point

32 CHAPTER 4. OVERVIEW OF THE APPROACH 4.3. REGIONAL PERSISTENT HOMOLOGY

C0 C1

2rmax

C2

Figure 4.5: Representation of point cloud connections formed by the MST when max ≥ d(x1, x2) ≥ 2rmax. Red lines indicate duplicates identified in two partitions. The blue line represents the approximated 0 distance in P , and the green line represents the true distance for the H0 homology. outside of the partition is included in the fuzzy partition. While this guarantees complete reconstruction of the minimum spanning tree, higher dimensional topological features need to be reconstructed when they fall between partitions and are lost in the approximate point cloud P0. This can be summarized by:

Theorem 4. All H0 topological features can be identified their partitions and the outer-partition point closest to each respective centroid in the original point cloud.

Proof. Consider any set of partitions of a point cloud in which two or more partitions are relatively close with a single partition further away. In this case each close partition will generate the minimum spanning tree to all other partitions using the outer-partition point closest to the respective centroid. This may create multiple identical connections in the case two partitions are closest to one another. Consequently the partition that is further away will emit the minimum distance to the closest outer-partition point, representing one of the partitions in the relatively close grouping. This results in the single recording of the interval representing the minimum spanning tree of the point cloud, generating the H0 topological features.  as introduced by Malott et al. [28].

33 CHAPTER 4. OVERVIEW OF THE APPROACH 4.3. REGIONAL PERSISTENT HOMOLOGY

For the connections between partitions C1 and C2 the usage of 2rmax for generating fuzzy partitions does not contain shared points. The original MST connection in the point cloud is drawn in green to show what the true value should be. In the centroid approximated point cloud, indicated by the purple diamond centroids, this connection is estimated and may be used in the output persistent homology. For a more accurate interval that feature may even be upscaled from the approximated space to gather the exact distance in green.

This leads to a tradeoff between the regional reconstruction of features and features that need to be upscaled from the centroid approximated intervals. A larger amount of overlap between the persistence intervals will generate more accurate H0 intervals, requiring less upscaling of identified centroids in the approximated intervals. For example, if the fuzzy partition overlap was set to be 2rmax from the centroid, upscaling would only need to occur for the H0 intervals where birth > 2rmax to completely reconstruct the

H0 intervals.

One of the benefits of using the Vietoris–Rips filtration to evaluate persistence intervals, in this case, is the Rips expansion that occurs after the neighborhood graph is constructed. Reconstruction of the H0 intervals within the fuzzy partition will be combined to evaluate the higher dimensional features in the space. Naturally, any feature wholly contained within the fuzzy partition is now recognized from persistent homology. Consequently if the birth > 2rmax the boundary will not be wholly contained in a fuzzy partition and must be identified from the approximated point cloud.

Scaling δ, the fuzzy partition overlap, becomes a critical factor in tuning the performance and accuracy of the approach. In cases where there are significant dense regions throughout the point cloud the approach may generate too large of fuzzy partitions, leading to a larger memory footprint than intended. Refinement of the regional topological features may provide significant differentiating data in comparison.

4.3.2 Merging Persistence Intervals

Merging of persistence intervals from the partitioned persistent homology is relatively straightforward. Per- sistence intervals identified from the approximated point cloud P0 are included if their birth time occurs after δ. Intervals that are born before δ are gathered from the partitioned results.

Within the partitioned results there may be duplicate entries for persistence intervals, depicted in Figure

34 CHAPTER 4. OVERVIEW OF THE APPROACH 4.3. REGIONAL PERSISTENT HOMOLOGY

4.5 as red lines between points in the partition. Merging of these persistence intervals requires assigning ownership of the persistence interval. To do this, a total ordering is placed on the original points in P.A persistence interval is owned, and therefore reported, by a partition Pˆi if the minimum total-ordered index of the persistence interval is a part of the original partition Pˆi. Ownership of the persistence intervals by the minimum indexed point prevents duplicate reporting of persistence intervals by the partitions, leading to a merge step that can be executed individually by each partition in parallel.

35 Chapter 5

Implementation Details

Computing persistent homology is generally limited by two factors: the size of the point cloud, n, and the dimension of homology to compute up to, Hmax. There are several existing libraries that compute the persistent homology of and input dataset including GUDHI [69], Eirene [70], and Ripser [50]. Of these, Ripser is the fastest and most efficient approach to compute the Vietoris–Rips persistence intervals.

However, Ripser still requires significant memory resources to compute persistent homology; currentlyl limited to only n = 2000 points in H2 to identify covered voids of the space.

Figure 5.1 shows that as both n and Hmax increase computing the persistent homology with Ripser becomes limited. This can be attributed to the size of the simplicial complex; for a Vietoris–Rips complex the simplex complexity is O(N!/Hmax!(N − Hmax)!). Simplices must be stored and reduced to compute the persistence intervals of the point cloud.

This is the principle motivation for Partitioned Persistent Homology. However, a study of the full ap- proach with existing libraries is limited due to generation of the co-cycles for upscaling, as detailed in

Section 4.2. To provide an end-to-end framework for computing the Partitioned Persistent Homology and enabling further study of other approaches to persistent homology the Lightweight Homology Framework was developed.

The Lightweight Homology Framework (LHF) was designed to provide an end to end pipeline of data reduction, persistent homology computation, generating co-cycle tracking, and upscaling with iterative re-

finement. LHF is designed and implemented completely in native C++. This allows the project and libraries

36 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. LHF ARCHITECTURE

Ripser Dimensional Runtimes 10000 h_1

h_2 1000 h_3

h_4 100 h_5

Execution Time (s) 10

1 0 500 1000 1500 2000 2500 3000 No Points Figure 5.1: Limitations of Ripser for computing persistent homology on 128GB RAM. Values on the x-axis did not finish. to be ported between Windows and Linux machines, giving flexibility to the program for performing in different environments and applications. While partitioned persistent homology has yet to be utilized in a machine learning application, future development with large data sets can utilize the LHF libraries to bridge the gap into advanced analytics with computational topology. The LHF library is released under an open source license and the source code is avilable at gitlab.com/wilseypa/lhf.

The remainder of this chapter details implemented complexes, pipeline functions, and partitioning algo- rithms in LHF. Section 5.1 describes the high-level modes that the LHF library can execute in for computing persistent homology. Section 5.2 provides analysis of the data structures utilized by the LHF architecture in computing persistent homology. Finally, limitations of the library are discussed in Section 5.3.

5.1 LHF Architecture

LHF uses pipeline segments to construct several different operating modes for computing persistent homol- ogy. Each of these pipelines is detailed below to display the wide capabilities of the approach:

Partitioning Functions:

• kMeans++: Seeded version of the k-means clustering algorithm to minimize the within cluster sum

of squares (WCSS); takes parameter of k for number of centroids to generate.

37 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. LHF ARCHITECTURE

• StreamingkMeans: Streaming version of k-means clustering algorithm for continuous updating of

the source points. Useful for streaming data analysis.

• DBScan: Density-based spatial clustering for applications with noise (DBSCAN) that groups dense

regions of many points into partitions.

• DenStream: Density-based clustering for evolving data streams.

Pipe Functions:

• DistanceMatrix: Computes the upper triangular distance matrix of an input point cloud.

• NeighborhoodGraph: Computes the neighborhood graph for the insertion of σ0 and σ1 simplices into the complex.

• RipsExpansion: Generates higher dimensional simplices from the σ1 simplices recursively up to

Hmax + 1.

• FastPersistence: Creates and reduces the boundary matrix from the simplicial complex using meth-

ods described in [50], including the generating co-cycles.

• Upscale: Recomputes refined persistence intervals using partitions Pˆ, labels, and generating co-

cycles.

• SlidingWindow: Computes the persistent homology of a sliding window of data for streaming appli-

cations.

• SmartWindow: Computes the persistent homology of a complex or function-driven sliding window

for streaming applications.

Combinations of these pipe functions can form many different approaches to computing the persistent homology with LHF. The basic pipeline function consists of computing the fast persistence of the input point cloud. This pipeline is depicted in Figure 5.2. Any of the preprocessors can be also be used to provide the approximated persistent homology of P0, as shown below the fastPersistence pipeline.

When using a preprocessor, the upscaled persistent homology and the reduced persistent homology can be individually controlled. Each of these cases is shown in Figure 5.3. The upscaled persistent homology

38 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. LHF ARCHITECTURE

FastPersistence Pipeline

Input PIs Data DistanceMatrix NeighborhoodGraph RipsExpansion FastPersistence

Reduced PH Pipeline

Input P' PIs Partition Data DistanceMatrix NeighborhoodGraph RipsExpansion FastPersistence Data

Figure 5.2: FastPersistence pipeline and reduced mode with a preprocessor. only attempts to refine the persistence intervals identified from P0, while the reduced persistent homology reconstructs the smaller persistence intervals from the partitions Pˆ. These each provide methods to study the individual and combined effects for the partitioned persistent homology.

The final two modes focus on the partitioned persistent homology approach in 5.4. The partitioned persistent homology mode performs both upscaling of identified features in P0 and reconstruction from the partitions Pˆ. This is the primary focus of this paper and studied further in Chapter 6. The iterative mode, which recursively reduces partitions that are too large to store the complex in memory, is also presented as implemented into LHF. The iterative mode can repartition point clouds too large to compute persistent homology over and remains an active area of study with LHF.

Several other methods to compute the persistent homology, such as the sliding window mode, are in-

Upscaled Pipeline

Input P' PIs PIs Partition Data FastPersistence Upscale Data FastPersistenceFastPersistence

Partitions, Labels

Partitioned Pipeline Parallel

Input P' PIs PIs Partition Data Data FastPersistenceFastPersistenceFastPersistence Merge

k Partitions FastPersistenceFastPersistenceFastPersistence PIs

Figure 5.3: Upscaled persistent homology mode and the reduced persistent homology mode pipelines.

39 CHAPTER 5. IMPLEMENTATION DETAILS 5.1. LHF ARCHITECTURE cluded in the LHF architecture to support other studies on the streaming persistent homology and effects on the simplicial complex, the primary storage structure in LHF. Experimental modes to continue to reduce the memory footprint of the complex are one of the driving factors for development of the LHF library.

Partitioned Persistent Homology Parallel

Input P' PIs PIs PIs Partition Data Data FastPersistenceFastPersistenceFastPersistence Upscale Merge

Partitions, Labels

k Partitions PIs FastPersistenceFastPersistenceFastPersistence

Iterative Partitioned Persistent Homology Parallel

Input P' PIs PIs PIs Partition Data FastPersistence Upscale Merge Data FastPersistenceFastPersistence

Partitions, Labels

k Partitions n ≤threshold PIs FastPersistenceFastPersistenceFastPersistence n > threshold

PIs FastPersistenceFastPersistenceRecurse PPH

Figure 5.4: Partitioned persistent homology pipeline and the iterative mode for recursive partitioning.

40 CHAPTER 5. IMPLEMENTATION DETAILS 5.2. LHF DATA STRUCTURES

5.2 LHF Data Structures

LHF utilizes several compact and efficient data structures for storage of the simplicial complex to compute the persistence intervals. The first of these is the simplexNode, which records each simplex of dimension d with an unique index, a set of the indices forming the simplex, and the weight of the simplex being formed. For a Vietoris–Rips complex, the weight of the simplex is defined as the largest face constructing the simplex.

This leads to a compact representation of a simplex; however the number of simplices within a simpli- cial complex grow exponentially, which also requires a fast data structure for insertion and access of nodes within the complex. LHF implements two separate storage structures: the simplexArrayList and the sim- plexTree. The simplexArrayList is a array of sorted lists that store the ordered simplices of each dimension.

The simplexTree provides a faster insertion and construction time of the complex at the cost of additional metadata for the tree itself. In many cases the simplexArrayList performs better than the simplexTree due to the additional memory structures required for the simplexTree. However, in cases where maintaining a complex continuously is required, such as streaming applications, the simplexTree can provide benefits over simplexArrayList.

Once the simplicial complex is used to compute the persistence intervals, a second data structure is used to track the persistence intervals and generating co-cycles. This data structure consists of the dimension of the persistence interval, the birth and death time of the interval, and a set of the co-cycles. The full set of persistence intervals are stored in an array and passed between modes that require multiple iterations or merging of the persistence intervals. This leads to a lightweight representation of the results of persistent homology on a point cloud that can represent any portion of the entire point cloud.

The overall memory footprint is dominated by the simplicial complex. LHF, when running in several of the modes, may run in parallel in which case multiple simplicial complexes are being constructed and managed within memory. Efficient data structures and memory deallocation is a continued area for evalua- tion of the architecture. A small decreases within the size of the simplicial complex can lead to significant improvement in the memory limitation.

41 CHAPTER 5. IMPLEMENTATION DETAILS 5.3. LIMITATIONS

5.3 Limitations

There are several limitations to the LHF library that affect the performance and the minimal error reduction.

Performance of upscaling requires recreation of the simplicial complex multiple times, along with identifica- tion of the constituent boundary points of features and partitioning steps. This additional time to efficiently utilize memory may only have benefits in point clouds that have several features that can be independently computed. When running on point clouds that can not be iteratively upscaled to a significant number of points the error may not be worth the iteration steps to attempt to refine the barcodes. In these cases the partitioning and approximation of the barcodes may provide appropriate results.

Changes to the topology and features in the space due to the partitioning algorithm may also contribute to inaccuracies in the upscaled point cloud. This error may come from features that were lost or shifted dimensions due to partitioning and can not be recovered with the current upscaling approach. Changes in the topology of the point cloud due to the mapping of classified points into a representative centroid may happen on certain edge cases resulting in additional error to the barcodes.

Partitioning can cause features to be lost if the size of the feature is less than rmax. This may apply to a feature as a whole, or to parts of a feature that come form before rmax. Some of these features may be identified through upscaling, but a partition that bisects a feature could be lost when upscaling independent features and merging the persistence intervals.

Partitioning can also cause features to shift dimensions. This happens when a feature that should be in one dimension is flattened into another dimension, or expanded up a dimension because of the arrangement of points in the point cloud. When the data is partitioned the effects of a dimension may cancel out leaving centroids in another dimensional connection with neighboring points when evaluating the persistent homol- ogy. This indicates we have changed the topology of the space, and our mapping from partitioning was not a proper mapping to preserve topology. An increase in the number of vectors used for the initial partitioning may alleviate this problem in practice. LHF is suited to evaluate this effect in a controllable manner.

All of these limitations should be considered to extract relevant homological features using upscaling for larger data. While not every feature may be identified through upscaling, a majority of large features can be identified and refined for a more accurate analysis of the persistent homology. The approach can be combined with complexes and optimizations used for persistent homology to further extend the application

42 CHAPTER 5. IMPLEMENTATION DETAILS 5.3. LIMITATIONS for data beyond the current limitations of exact persistent homology.

43 Chapter 6

Experimental Analysis

This chapter introduces the performance and accuracy metrics used to evaluate the approach of Partitioned

Persistent Homology in the Lightweight Homology Framework. First, we examine the accuracy of the centroid-approximated persistent homology in terms of feature size in Section 6.1. This analysis establishes the bounds for approximation of the large topological features from P0. Upscaling is introduced in Section

6.2 and compared against the approximation to identify the amount of error rectified by the approach and limitations when the boundary contains a large portion of the point cloud. The large features need to be resolved in the final output and can be identified in this approach with proper tuning.

For small features the partitions, Pˆ, are merged to reconstruct the partitioned persistent homology. This step depends heavily on the selection of an appropriate k, the number of partitions to generate, and s, the overlap scalar used to generate fuzzy partitions. A complete analysis of these parameters against experi- mental and real-world data is presented in 6.3 to demonstrate the effectiveness of the partitioned persistent homology and merging step to reconstruct the small features.

Finally, the complete performance and accuracy analysis of the partitioned persistent homology is pre- sented in 6.4. The scalability of the parallel implementation is analyzed in Section 6.5. This gives a direct notion of the speed and scale of persistent homology that can be computed with partitioned persistent ho- mology.

Several datasets used throughout literature pertaining to persistent homology will be utilized for the comparison and are detailed below.

44 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.1. CENTROID-APPROXIMATED PH

• Klein Bottle (K), is a traditional topological shape that has been utilized in multiple studies with

3 persistent homology and topological data analysis. The shape consists of 900 points in R , creating a point cloud representing the triangulation of a Klein bottle.

2 • twoMoons (TM), a generated 2000 vector point cloud of two overlapping moons in R .

• seeds (S), a UCI dataset measuring geometrical properties of three different wheat kernels. The

dataset is primarily used for classification and cluster analysis.

• iris (I), a UCI dataset measuring characteristics of several classes of flowers and corresponding

metrics of the petals. Iris is a well known dataset used in pattern recognition and classification.

• waterT reatment (WT ), a UCI dataset describing daily measurements of various sensors in an urban

waste water treatment plant. The dataset provides a multivariate time-series dataset that can be utilized

to predict faults through state variables of the treatment process.

• dSpheres, embedded d-dimensional spheres in a point cloud, are generated to evaluate the approach

against. dSpheres are useful to evaluate higher-dimensional topological features and are primarily

studied to determine the performance implications of partitioned persistent homology.

Additional information on the number of points and dimensionality of each dataset are included in Table 6.1.

The synthetic dataset dSpheres is utilized in Section 6.5 and describing parameters are noted throughout.

All experiments were carried out on an AMD(R) Ryzen Threadripper 1950X CPU @ 2.60GHz with

128GB of RAM. Each experiment was executed through scripts to consistently test and analyze results. All experimental results were stored to permit re-analysis of the results using different filters and comparisons.

Several supporting python scripts were used to run and analyze the results from the LHF library, available in the github repository at github.com/wilseypa/LHF.

6.1 Centroid-Approximated PH

Partitioning of the input point cloud and replacement with representative centroids to approximate the per- sistent homology has been studied in various aspects [22,27]. While these studies have provided insight into

45 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.1. CENTROID-APPROXIMATED PH the general effect of partitioning, the datasets explored in this thesis are re-evaluated to provide a baseline for improvement from the partitioned reconstruction of the persistence intervals.

The datasets are compared at various reductions to examine the effect on the resultant intervals. Only datasets that could be computed up to their intended homological features at full scale were used to reduce possible bias from using a centroid-based reduction from the original dataset. In some previous studies this has shown improvement when reduced further, but the biasing of using k-means as a preprocessor can leave unclear results.

For all experiments in this section the input point cloud was reduced using the LHF framework. k- means++ is utilized for the reductions to preserve the geometric shape of the data [22, 67]. In this thesis, a statement of X% reduction means that X% of the points are removed from the data set. The heat kernel dis- tance [43] was used to measure the difference between the exact persistence intervals and the approximated persistence intervals. This quantitative measure provides the error induced from the centroid-approximated persistent homology. Comparisons of the heat kernel distance are indicated as percent improvement, calcu- lated as (%imp = (HKDold − HKDnew)/HKDold ∗ 100. Results for each of the datasets are displayed in Table 6.1 and are plotted for visual interpretation in Figure 6.1.

Reduction Red. Red. Red. Red. Red. Red. Dataset N d Hmax 25% 50% 75% 80% 85% 90% 95% K 900 3 2 1.98 7.52 15.27 17.34 19.71 22.30 25.76 S 210 7 3 1.72 4.65 8.80 9.80 10.71 12.15 13.56 I 150 4 4 0.48 1.69 3.39 3.84 4.28 4.84 5.56 WT 527 41 3 24.38 54.47 86.94 93.48 99.83 106.63 112.71 TM 2000 2 1 0.27 0.99 2.55 3.03 3.59 4.46 5.63

Table 6.1: Effect of k-means++ reduction on the heat kernel distance for the experimental datasets.

Reduction with k-means++ shows expected results; the increase in reduction adds minimal error to the persistence intervals. These results will serve as a baseline for the improvement of upscaling and regional intervals to compare against. This kernel distance includes all connected components, loops, voids, etc. of the space and can be further separated to show how each step of the partitioned persistent homology affects topological features. To do this, the heat kernel distance is separated by dimension; that is, connected components (H0) features are evaluated separately from higher dimensional features.

Separation of the H0 features negates a large amount of noise that is known to be lost when utilizing

46 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.1. CENTROID-APPROXIMATED PH

1000 Effect of kmeans++ Reduction on Heat Kernel Distance

100

10 HKD klein 1 seeds iris waterTreat twoMoon 0.1 20 30 40 50 60 70 80 90 100 Reduction (%) Figure 6.1: Change in heat kernel distance at various k-means++ reduction levels for all experimental datasets.

the centroid approximated persistent homology. Typically the H0 features account for a large amount of the persistence intervals of the point cloud and can add significant error to the overall heat kernel distance.

Analysis using the H0 features and the set of all higher-dimensional features (Hd, d > 0) is captured in

Figure6.2 and Figure 6.3. Note the scale of each of the graphs; the H0 features account for a large portion of the overall heat kernel distance, while the higher dimensional features are a small percentage of the result.

1000 Effect of kmeans++ Reduction on Heat Kernel Distance, Hd = 0 10 Effect of kmeans++ Reduction on Heat Kernel Distance, Hd > 0

100 1

10 HKD HKD klein 0.1 klein 1 seeds seeds iris iris waterTreat waterTreat twoMoon twoMoon 0.1 0.01 20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100 Reduction (%) Reduction (%)

Figure 6.2: Reduction effect on H0 features. Figure 6.3: Reduction effect on Hd, d > 0 features

The affect of centroid approximation of the point cloud on the resultant persistence intervals serve as a baseline for improvement from the two approaches in partitioned persistent homology. Analysis of upscal- ing (Section 6.2) utilizes the Hd, d > 0 features to characterize the improvement in the large persistence intervals as the identified boundaries are refined. Regional reconstruction (Section 6.3) primarily focuses on the H0 features to reconstruct the minimum spanning tree from the partitions. This analysis only forms a

47 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.2. CENTROID-APPROXIMATED UPSCALING

small basis for the regional reconstruction; additional features identified where Hd, d > 0 in the partitions will contribute to the refinement of the centroid approximated intervals, but are difficult to separate in a meaningful way.

6.2 Centroid-Approximated Upscaling

Upscaling of large topological features from the centroid approximated persistent homology refines the in- dividual persistence intervals to provide more accurate representation. This step involves extracting the constituent boundary points from the identified topological features and recomputing the persistence inter- vals on the original points in P . The boundary points, each identifying a topological feature, are combined using a set union to determine independence of features in the original point cloud. Representative centroid points that are part of two boundaries need to be upscaled together, while independent boundaries can be computed in parallel as described in Section 4.2.

In order to provide an accurate comparison of upscaling on the resultant persistence intervals the Hd, d > 0 persistence intervals are examined separately. The upscaling step does not attempt to refine persistence intervals for H0 features, leading to a significant portion of noise being generated in the heat kernel distance.

While these H0 persistence intervals could be upscaled to form the minimum spanning tree, these features are better suited to be determined from the partitions. Table 6.2 provides the Hd, d > 0 heat kernel distance for the same data sets and reduction levels presented in Section 6.1 and graphically illustrated in Figure 6.3.

Red. Red. Red. Red. Red. Red. Red. Dataset N d Hmax Meas. 25% 50% 75% 80% 85% 90% 95% K 900 3 2 HKD 0.10 0.96 2.56 2.82 3.26 3.60 4.05 S 210 7 3 HKD 0.11 0.34 0.81 0.92 0.89 1.03 1.02 I 150 4 4 HKD 0.08 0.18 0.16 0.18 0.22 0.24 – WT 527 41 3 HKD 1.52 3.56 4.11 4.29 4.34 4.56 4.57 TM 2000 2 1 HKD 0.01 0.06 0.24 0.28 0.32 0.43 0.51

Table 6.2: Effect of k-means++ reduction on the heat kernel distance on Hd > 0 for the experimental datasets.

Comparison of the centroid approximated persistence intervals and the upscaled intervals using the Heat

Kernel Distance against the exact persistence intervals gives a measure of improvement in the approach.

48 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.2. CENTROID-APPROXIMATED UPSCALING

Table 6.3 details the resultant heat kernel distance alongside the improvement in percentage over the results in Section 6.1.

Red. Red. Red. Red. Red. Red. Red. Dataset N d Hmax Meas. 25% 50% 75% 80% 85% 90% 95% K 900 3 2 HKD 0 0 0.17 0.37 0.06 0.54 0.41 %imp 100 100 93.24 86.84 98.29 85.02 89.83 S 210 7 3 HKD 0.31 0.36 0.34 0.65 0.41 0.91 – %imp -194.87 -3.36 57.88 29.35 54.27 11.68 – I 151 4 4 HKD 0.13 0.18 0.14 0.16 – – – %imp -53.04 -1.30 13.13 6.79 – – – WT 527 41 3 HKD 0.62 1.24 1.41 1.98 0.91 2.40 4.57 %imp 59.45 65.11 65.75 53.90 79.02 47.39 0.11 TM 2000 2 1 HKD – 0.02 0.01 0.04 0.01 0.07 0.01 %imp – 65.87 95.67 85.56 97.43 83.30 98.55

Table 6.3: Heat kernel distance and improvement (%) of upscaling over centroid approximated persistent homology for Hd > 0 features. Results compared to Hd, d > 0 feature heat kernel distance (Figure 6.3). Dashed lines indicate when upscaling failed due to physical memory limitations.

The results show overall improvement in both the heat kernel and filtered heat kernel distances. Notably in the filtered heat kernel distance the upscaling of the Klein bottle from 25% and 50% reduction result in the complete reconstruction of the full scale persistence intervals. In all cases reduction in the range of 80-

95% provide suitable improvement to the large topological features originally identified from the centroid approximated persistent homology. This follows the expected effect of upscaling described in Section 4.2.

Persistence intervals receiving the benefit from upscaling are generally longer persistence intervals as the size of the reduction increases. This indicates the need for a filtering of the persistence intervals to only focus on longer intervals and ignore the noise generated in the heat kernel distance from smaller intervals.

Several methods have been studied to filter out the shorter intervals in a similar approach as described by

Malott et al. [67]. By applying a filter and comparing the longer persistence intervals the effect of upscaling can be more clearly seen. In the experimental results presented the filter removes persistence interval less than x¯ + σ (mean + standard deviation) calculated from the full scale set of persistence intervals.

Finally, the average heat kernel and filtered heat kernel distances over all datasets are plotted in Figure

6.4. Both metrics show significant improvement over the centroid approximated persistence intervals with a larger reduction percentage. The filtered heat kernel distance demonstrates a more stable measurement of

49 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.2. CENTROID-APPROXIMATED UPSCALING

90 Improvement of Upscaling on Heat Kernel Distance, Hd>0 80 70 60 50 40 30 20 HKD Improvement (%) HK fHK 10 0 75 77 79 81 83 85 87 89 91 93 95 Reduction (%) Figure 6.4: Averaqe improvement (%) over all data sets with upscaling persistent homology for Heat Kernel (HK) and filtered Heat Kernel (fHK) distances.

upscaling as the reduction increases, only focusing on the longer persistence intervals of the point cloud.

This can be attributed to the variation of features generated in the approximated point cloud when only analyzing Hd, d > 0 features that may be considered noise in the underlying intervals. Stability of the filtered heat kernel distance corresponds with the theory covered in Section 4.2.

Red. Red. Red. Red. Red. Red. Red. Dataset N d Hmax Meas. 25% 50% 75% 80% 85% 90% 95% K 900 3 2 HKD 0 0 0.01 0.01 0.02 0.02 0.06 %imp 100 100 95.42 96.53 92.59 94.43 91.04 S 210 7 3 HKD 0.20 0.13 0.05 0.26 0.21 0.31 – %imp -195.41 -88.07 84.13 -9.31 33.54 16.99 – I 151 4 4 HKD 0.07 0.09 0.04 0.04 – – – %imp -214.90 -66.12 -28.01 1.86 – – – WT 527 41 3 HKD 0.45 0.63 0.80 0.76 0.63 1.03 1.66 %imp 11.63 39.56 43.27 50.89 58.53 37.70 – TM 2000 2 1 HKD – 0.03 0.01 0.04 0.02 0.07 0.04 %imp – -528.36 -99.01 -77.36 34.91 15.78 67.40

Table 6.4: Heat kernel distance and improvement (%) of upscaling over filtered centroid approximated persistent homology.

50 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.3. SMALL FEATURE RECONSTRUCTION

6.3 Small Feature Reconstruction

While the centroid approximation and upscaling approaches provide identification and refinement of large topological features, small topological features need to be rectified from the partitions. The small feature refinement generally focuses on the H0 features and small higher-dimensional features embedded within the partitions. This section studies the effect of the regional persistent homology and merging on the H0 dimensional features, however additional accuracy improvement to higher dimensional features is achieved from the regional construction.

To provide analysis of the H0 features for regional reconstruction Table 6.5 provides the baseline heat kernel distances for H0 only at each reduction level. These values are used to determine the improvement in the heat kernel distance for regional persistent homology.

Red. Red. Red. Red. Red. Red. Red. Dataset N d Hmax Meas. 25% 50% 75% 80% 85% 90% 95% K 900 3 2 HKD 1.89 6.56 12.73 14.54 16.48 18.74 21.76 S 210 7 3 HKD 1.62 4.33 8.03 8.93 9.86 11.15 12.52 I 150 4 4 HKD 0.48 1.60 3.21 3.64 4.07 4.59 5.29 WT 527 41 3 HKD 23.25 52.02 83.54 89.95 96.22 102.84 108.84 TM 2000 2 1 HKD 0.26 0.93 2.30 2.75 3.27 4.03 5.12

Table 6.5: Effect of k-means++ reduction on the heat kernel distance on Hd = 0 for the experimental datasets.

Regional persistent homology requires selection of a scalar parameter, s, to generate the fuzzy partitions from the original clustering. In the case of LHF, the scale parameter is multiplied by rmax to provide the additional points to add within a centroid; that is, if a point is within s ∗ rmax distance from a centroid it is included in the fuzzy partition. Naturally a larger s will result in more overlap between the partitions, leading to a more accurate reconstruction of the regional persistence intervals, specifically the H0 features constituting the minimum spanning tree as described in Section 4.3. However, a larger s also results in larger point clouds in the partitions, consequentially leading to a larger complex being formed to compute the persistent homology. Balancing the performance of the partitioned approach with the accuracy of the reconstruction becomes the tradeoff when computing the regional persistent homology.

For comparing the heat kernel distance of the regional persistent homology several selections of the scalar were used: s = 0, s = 0.5, s = 1.0, s = 1.5, s = 2.0. These values give an indication of the regional

51 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.3. SMALL FEATURE RECONSTRUCTION persistent homology with no fuzzy partitions and the improvements by adding the additional fuzzy partition connections to build the minimum spanning tree. Tables 6.6 and 6.7 present the improvement of the regional persistent homology of H0 features from the original point cloud.

Red. Red. Red. Red. Red. Red. Red. Dataset s Meas. 25% 50% 75% 80% 85% 90% 95% K 0.0 HKD 24.31 19.90 11.41 9.28 6.97 4.39 2.33 %imp -1189.22 -203.55 10.39 36.17 57.69 76.57 89.28 0.5 HKD 24.44 19.67 11.69 9.41 6.80 4.14 2.34 %imp -1195.99 -200.02 8.16 35.32 58.74 77.92 89.25 1.0 HKD 24.14 11.80 5.69 4.25 4.67 4.81 4.05 %imp -1180.12 -79.99 55.31 70.80 71.66 74.31 81.38 1.5 HKD 18.17 1.06 2.12 3.39 3.73 3.83 4.07 %imp -863.52 83.79 83.32 76.67 77.39 79.58 81.31 2.0 HKD 7.36 0.77 1.36 2.52 3.72 4.41 4.26 %imp -290.26 88.25 89.28 82.65 77.40 76.49 80.41 S 0.0 HKD 12.17 9.39 5.49 4.41 3.34 1.82 0.85 %imp -649.75 -117.05 31.60 50.56 66.07 83.70 93.17 0.5 HKD 12.23 9.52 5.43 4.34 3.34 2.13 0.81 %imp -653.47 -120.02 32.34 51.39 66.15 80.90 93.50 1.0 HKD 12.24 8.63 4.36 3.34 2.73 2.69 1.97 %imp -654.49 -99.48 45.63 62.52 72.23 75.90 84.30 1.5 HKD 9.20 4.74 2.60 2.27 2.05 2.69 2.00 %imp -467.24 -9.53 67.55 74.52 79.18 75.92 84.03 2.0 HKD 7.35 2.85 2.86 2.74 2.33 1.93 1.48 %imp -353.08 34.16 64.37 69.35 76.38 82.68 88.22 I 0.0 HKD 5.70 4.56 2.61 1.98 1.68 1.13 0.75 %imp -1098.94 -185.40 18.69 45.65 58.82 75.32 85.77 0.5 HKD 5.67 4.73 2.59 2.02 1.60 1.24 0.80 %imp -1093.08 -195.60 19.59 44.53 60.65 72.98 84.85 1.0 HKD 5.52 4.43 2.05 1.79 0.99 0.76 0.50 %imp -1062.18 -176.93 36.19 50.74 75.70 83.35 90.50 1.5 HKD 4.43 2.72 1.14 1.24 1.14 0.90 0.58 %imp -833.14 -70.19 64.60 65.77 71.85 80.47 89.08 2.0 HKD 3.25 2.03 1.39 1.08 0.84 1.01 0.71 %imp -584.52 -27.22 56.83 70.17 79.45 78.02 86.52

Table 6.6: Heat kernel distance and improvement (%) of regional persistent homology over centroid ap- proximated persistent homology of H0 features. Results compared to H0 feature heat kernel distance (Table 6.5)

Figure 6.5 presents the average regional improvement at each level of scalar for all datasets. Interestingly

52 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.3. SMALL FEATURE RECONSTRUCTION

Red. Red. Red. Red. Red. Red. Red. Dataset s Meas. 25% 50% 75% 80% 85% 90% 95% WT 0.0 HKD 97.37 69.33 36.79 30.64 23.95 17.00 10.66 %imp -318.76 33.27 55.96 65.93 75.11 83.47 90.20 0.5 HKD 97.10 69.38 36.62 30.65 27.58 16.18 18.26 %imp -317.62 -33.36 56.16 65.93 71.33 84.27 83.23 1.0 HKD 94.95 67.28 43.37 37.47 33.21 25.19 18.66 %imp -308.38 -29.32 48.09 58.34 64.48 75.51 82.85 1.5 HKD 55.61 42.15 32.30 30.53 26.35 21.67 19.11 %imp -139.15 18.98 61.33 66.06 72.61 78.93 82.45 2.0 HKD 43.56 38.06 31.66 29.33 28.09 26.01 17.15 %imp -87.35 26.83 62.10 67.40 70.81 74.71 84.25 TM 0.0 HKD 6.70 5.50 3.09 2.48 1.69 0.97 0.29 %imp -2523.19 -493.66 -34.11 9.94 48.17 75.96 94.39 0.5 HKD 6.70 5.50 3.05 2.45 1.69 0.92 0.28 %imp -2519.78 -494.11 -32.57 10.92 48.41 77.27 94.58 1.0 HKD 6.67 4.76 1.65 1.44 1.21 0.81 0.69 %imp -2509.44 -413.59 28.27 47.69 63.02 80.03 86.58 1.5 HKD 5.96 2.90 1.16 1.19 1.04 0.91 0.71 %imp -2234.32 -212.99 49.65 56.59 68.33 77.33 86.13 2.0 HKD 5.05 1.95 1.23 1.17 1.08 0.86 0.73 %imp -1877.68 -110.26 46.40 57.50 67.12 78.73 85.66

Table 6.7: Heat kernel distance and improvement (%) of regional persistent homology over centroid ap- proximated persistent homology of H0 features. Results compared to H0 feature heat kernel distance (Table 6.5) an inflection point of 90% indicates the number of points contained in each partition become closer to approximating the H0 topological features of the space. This is intuitive; a large number of partitions (low reduction percentage) generates more tightly compact clusters. These clusters result in a smaller value for rmax required to reconstruct a majority of the persistence intervals. Likewise, a higher reduction percentage generates less partitions and a larger value of rmax. Utilizing a larger scalar of rmax in this case results in a significant improvement in the heat kernel distance due to the centroid-approximated metric realizing very few H0 intervals. This follows the analysis in Section 4.3.

While this section primarily focuses on the H0 persistence intervals generated from the regional per- sistent homology, it is important to note the higher dimensional persistence intervals embedded within the partitions are realized. Any persistence interval wholly contained in any given fuzzy partition will be iden- tified and properly characterized from the regional step. While this effect is difficult to characterize with

53 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.4. PARTITIONED PERSISTENT HOMOLOGY

100 Improvement of Regional PH on Heat Kernel Distance, Hd=0 90 80 70 60 50

40 S=0.0 30 S=0.5 S=1.0 HKD Improvement (%) 20 S=1.5 10 S=2.0 0 75 80 85 90 95 Reduction (%) Figure 6.5: Averaqe improvement (%) in heat kernel distance over all data sets with regional persistent homology at various levels of scalars (s ∗ rmax). Results compared to H0 feature heat kernel distance (Table 6.5) current analysis of persistent homology, it does play a role in the approach of partitioned persistent homol- ogy and will affect the improvement of accuracy on generated persistence intervals.

6.4 Partitioned Persistent Homology

This section evaluates the approach of partitioned persistent homology to reconstruct the original persistence intervals. Analysis of the accuracy and performance benefits of the approach are presented to demonstrate the effectiveness of accurately merging the persistence intervals in a serial approach. Additional studies of the parallel mode is covered separately, in Section 6.5.

As indicated in studies of the upscaled persistence intervals and regional persistent homology the reduc- tion percentage becomes a driving factor in the accuracy of the approach. A smaller number of partitions leads to a more accurate reconstruction of the H0 persistence intervals while subsequently losing larger persistence intervals in the centroids. If the reduction loses salient topological features in the approximated results they will not be upscaled or realized in the results. Finding the balance between these two factors is studied in this section.

Table 6.8 presents the improvement over centroid-approximated upscaling on each of the studied datasets.

All studied datasets achieve significant improvement to the heat kernel distance after a reduction of 75%.

54 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.4. PARTITIONED PERSISTENT HOMOLOGY

Referring back to Figure 6.4, the upscaling improvement also sees significant benefits on both the heat ker- nel and filtered heat kernel distances beyond this point. This indicates that a reduction level larger than 75% will gain significant improvement from the approach. These results are graphically displayed in Figure 6.6 to show the overall improvement of the approach.

The performance of the partitioned persistent homology remains important to the overall approach.

The primary goal of using PPH is to provide tighter memory bounds for large data sets alongside faster parallel computation the results. In this case, a larger number of partitions provides more parallelism but each partition is generally smaller in size, leading to the centroid approximated persistent homology and upscaling requiring a dominating amount of processing time in a single thread. Alternatively a small number of partitions results in larger parallel workloads but a less accurate approximation of the salient topological features.

Figure 6.7 displays the performance improvement for partitioned persistent homology on each dataset.

Even with a low reduction percentage the performance averages over 10x speedup. As the reduction percent- age continues to increase several of the datasets reach 100x and 1000x speedup, dependent on the dataset.

This can be attributed to the nature of the data being studied; in the case of the Klein Bottle, a triangulated mesh representation of the geometric shape, the reduction provides unparalleled performance improvements.

In less structured datasets, such as the classification sets for Seeds and Water Treatment the speedup is not nearly as significant. Additionally, in low dimensional datasets such as twoMoons, searching for only H1 features and below does not require significant resources per thread and is generally dominated by the ap- proximated persistent homology. Further study of this performance is covered in Section 6.5.

Overall, significant accuracy and performance gains are evident with partitioned persistent homology.

Even at low levels of reduction the accuracy and performance on tested datasets are motivating for the approach to be extended into larger datasets and higher dimensional computation of persistent homology.

55 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.4. PARTITIONED PERSISTENT HOMOLOGY

Red. Red. Red. Red. Red. Red. Red. Dataset N d Hmax Meas. 25% 50% 75% 80% 85% 90% 95% K 900 3 2 HKD 24.02 11.75 7.74 4.03 4.72 4.81 3.92 %imp -1111.34 -56.31 49.31 76.76 76.04 78.40 84.79 S 210 7 3 HKD 6.71 2.83 3.29 1.95 2.03 2.10 1.51 %imp -289.56 39.23 62.66 80.06 81.09 82.75 88.23 I 150 4 4 HKD 5.57 4.53 1.83 1.67 1.35 1.05 0.65 %imp -1052.76 -167.33 45.92 56.40 68.55 78.34 88.23 WT 527 41 3 HKD 93.18 63.46 41.89 39.06 27.80 26.52 18.96 %imp -282.20 -16.50 51.82 58.22 72.15 75.13 83.18 TM 2000 2 1 HKD 6.69 4.93 1.91 1.53 1.01 0.81 0.68 %imp -2379.34 -397.59 24.91 49.37 71.87 81.77 87.94

Table 6.8: Heat kernel distance and improvement (%) of partitioned persistent homology on the heat kernel distance for the experimental datasets. Results compared to centroid approximated heat kernel distance (Table 6.1)

Improvement of PPH on Heat Kernel Distance 100 PPH Speedup for Selected Datasets 10000 90 klein 80 seeds 70 iris 1000 waterTreat 60 twoMoon 50 klein AVG seeds 100 40 iris waterTreat 30 Speedup twoMoon HKD Improvement (%) AVG 20 10 10 0 75 80 85 90 95 1 Reduction (%) 20 30 40 50 60 70 80 90 100 Reduction (%) Figure 6.6: Heat Kernel Distance improvement on Figure 6.7: PPH Speedup for selected datasets selected datasets alongside average. Results com- alongside average. Results compared to standard pared to centroid approximated heat kernel distance fastPersistence performance. (Table 6.1).

56 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.5. PERFORMANCE: PARALLEL PPH

6.5 Performance: Parallel PPH

One benefit to the partitioned persistent homology is the parallelism between the partitions and the centroid approximated persistence intervals. Reconstruction of the persistence intervals has been the focus of this work, but paired with the ability to independently compute on the partitions themselves. This leads to an embarrassingly parallel approach where each partition is evaluated for embedded persistence intervals, persistence intervals are trimmed based on the ownership of the interval to the partition, and then merged into the full set of persistence intervals.

The merge step, although possessing a few serial steps for cleanup, is primarily handled by each indi- vidual partition. This allows the approach to not only easily scale to a multithreaded approach, but also to a multiprocessor approach. LHF implements both of these approaches through OpenMP and OpenMPI.

This allows for simple multithreading and multiprocessing using the parallel partitions as work points for listening workers. For this study the OpenMP approach is utilized to limit interference from heterogeneous hardware and network overhead. Similar results can be obtained from the OpenMPI approach of distributing partitions to worker nodes within a cluster.

Performance of the parallel partitioned approach is captured in several aspects. First, the parallel speedup over the standard fastPersistence mode is presented in Figure 6.8(A). The parallel speedup is computed on a synthetic dSphere of 10k points leading to a large enough data set to show some improvement. However, the persistent homology can only be computed up to H1 to compare to the fastPersistence mode in this case. The number of threads utilized does show some improvement in the performance, but without larger workloads for the individual processes the graph fails to show significant speedup with additional threads.

The parallel speedup by number of points is important in the computation as it directly affects the number of generated simplices in the complex. This speedup changes significantly with the dimension of homology to compute up to, Hmax. In cases where only low-dimensional topological features being identified the approach gains no benefit (and costs more than standard fastPersistence). Higher dimensional features require significantly larger complexes and benefit more from the partitioned approach as shown in Figure 6.8(B). In several cases the fastPersistence mode cannot process into these higher dimensions, leading to values on the top x-axis (100k) indicating these instances. Computing H2 and H3 features is expensive in the fastPersistence approach but achievable with the partitioned persistent homology.

57 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.5. PERFORMANCE: PARALLEL PPH

Parallel Speedup over FastPersistence Parallel Speedup by Number of Points 12 1000

10 H_0 100 H_1 8 10 H_2 6 H_3 1 Speedup Speedup 4

0.1 2

0 0.01 1 3 5 7 9 11 13 15 17 19 0 50 100 150 200 250 300 350 Number of Threads Points (||P||) 6 6 A. Synthetic dSphere; 10k points embedded in R ; H1. B. Synthetic dSphere embedded in R ; k selected as (0.1 ∗ kP k)

Parallel Speedup by Number of Partitions Parallel Speedup by Scalar 1000 1000 H_0 H_0

H_1 H_1 100 100 H_2 H_2 H_3 10 H_3 10

1 1 Speedup Speedup (%)

0.1 0.1

0.01 0.01 0 100 200 300 400 500 600 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Partitions (k) Scalar (s*r_max) C. Synthetic dSphere; 1k points embedded in R6. D. Synthetic dSphere; 1k points embedded in R6; k = 100. Figure 6.8: Performance results based on varying input parameters of the PPH approach.

The number of generated partitions also directly affects the performance. A larger number of partitions

generates fewer points per partition, leading to lighter workloads for each of the processes. When the

number of partitions is low the regional persistent homology can accurately reconstruct the persistence

intervals while parallelizing a large portion of the computation. Results again differ based on the dimension

of homology to compute to (Hmax) as shown in Figure 6.8(C).

The remaining factor that generally affects the performance of the approach is the scalar used to generate

fuzzy partitions. This scalar is used to multiple rmax such that points are included if they are within s∗rmax of the corresponding centroid. results for the parallel speedup by scalar are presented in Figure 6.8(D).

Values on the x-axis, such as the H3 persistent homology with scalar of 1.5, 1.75, and 2.0, indicate the partitions have become too large to compute in parallel. In general the approach is generally constant up to

a large value of s, indicating a value of s = 1.0 may be a suitable value for generating the fuzzy partitions.

However the results of the regional persistent homology in Figure 6.5 indicate very little improvement in

the persistence intervals at lower levels of reduction and should be taken into consideration.

58 CHAPTER 6. EXPERIMENTAL ANALYSIS 6.5. PERFORMANCE: PARALLEL PPH

As previously indicated the number of processes have minimal effect on low dimensions of persistent homology. Comparison to the serial version of partitioned persistent homology with a larger Hmax becomes the only way to gauge the performance improvement with multiple threads. Figure 6.9 provides the parallel speedup by number of threads in H1 utilizing the same results from Figure 6.8(A). Increasing the workload of each parallel thread requires an increase in Hmax. Figure 6.10 presents the parallel speedup by number of threads in H2. Detecting higher dimensional topological features requires a much larger complex, ex- ponential in Hd. Even an increase from H1 to H2 shows a more significant speedup when computing the partitioned persistent homology in multiple threads as the workloads have become much larger.

Overall the performance of the approach is guided by several factors: the number of points, number of partitions, scalar used for generating fuzzy partitions, dimension of homology to compute to (Hmax), and the number of threads utilized to provide parallelism in the approach. Appropriately tuning these parameters can lead to even more significant speedup while still providing accurate reconstruction of the persistent homology in the approach.

Parallel Speedup by Number of Threads (H_1) Parallel Speedup by Number of Threads (H_2) 25 18

16

20 14

12 15 10

8 10 Speedup Speedup 6

5 4 2

0 0 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 Number of Threads Number of Threads

Figure 6.9: PPH parallel speedup H1 features of Figure 6.10: PPH parallel speedup H2 features of 6 6 synthetic dSphere; 10k points embedded in R . synthetic dSphere; 10k points embedded in R .

59 Chapter 7

Discussion

This chapter covers the conclusions and suggestions for future research derived from the study of Partitioned

Persistent Homology (PPH). PPH combines partitioning and sampling of a large point cloud to build an ap- proximate methods solution that enables the computation of persistent homolgy on “big data”. Partitioning and iterative upscaling have shown to provide significant benefits when computing persistent homology over large point clouds. The upscaling technique refines the identified topological features for both large and small features in the source point cloud.

Effects of data reduction through partitioning have shown a significant increase in performance of the

LHF library and other libraries [22, 27]. The approach of data reduction remains a topic of interest when working with topological spaces and provides alternatives for exploiting partitioning when computing per- sistent homology. The work may be applicable to other types of machine learning techniques where reduc- ing the number of input vectors has significant impact on the performance of the algorithm. In the case of persistent homology, k-means++ preserves topological structures of the data after significant reduction by retaining the geometric shape of the input point cloud. A finite bound can be placed on the error in- duced from reduction, relating directly to the maximum radius of the clusters. This is a worst-case scenario; k-means++ reduction will typically induce error below the ravg of cluster centroids.

Extraction of the constituent boundary points from the boundary matrix provides a new way to interpret results of persistent homology. Most libraries utilize optimizations in the boundary matrix reduction that ignore the identification of specific boundaries as they persist. The method of boundary extraction can track

60 CHAPTER 7. DISCUSSION boundaries and constituent points as  is varied. These points could be used to visualize where in the data a feature is found and even the shape of the feature for further analysis. Significant topological features in the point cloud could be further partitioned to refine understanding of topological features identified by boundaries.

The heat kernel, bottleneck, and wasserstein error indices provide general mappings of two persistence diagrams and the average or maximum distance between the two. One pitfall of this method when comparing a partitioned data set is the tracking of boundaries to their true matches in a second persistence diagram. If the mappings are incorrect, some error may occur in any of the error indices that does not properly reflect the error of the reduced mapping. By tracking the constituent boundary points of a barcode the true mappings can be compared between the original and reduced mappings. This will give a more accurate representation of the error induced due to partitioning and reduction.

The iterative upscaling approach taken for this study focuses on upscaling significant features and gath- ering smaller features in parallel processing steps. The approach comes with drawbacks. Features not identified in a partitioned space may be lost when upscaling. Features can shift dimensions, whether flatten- ing into a lower dimension or expanding to higher dimensions due to representative point placement. Many of these issues can be avoided by selecting a suitable partitioning algorithm that preserves the topological features in the space, but they will still exist with any partitioning scheme in certain edge cases.

Iterative upscaling provides a means to further refine computation of the persistent homology of large point clouds. Other approaches have attempted to reduce the size of the simplicial complex, optimize the storage structure of the simplicial complex, structure the complex differently to identify significant fea- tures, and even perform collapses on the simplicial complex prior to computing the persistent homology.

All of these optimizations can be exploited with the iterative upscaling approach if designed effectively.

Partitioning and iterative upscaling serve as wrappers to the existing and future optimizations for persistent homology.

Study of larger data sets with TDA tools is necessary to begin understanding higher dimensional struc- tures in point clouds and visualizing the evolution of structure through persistent homology. Identification of topological features in large data sets is difficult with existing tools due to memory constraints and process- ing time. Partitioning and iterative upscaling, examined through the LHF library and this study, provides a

61 CHAPTER 7. DISCUSSION 7.1. SUGGESTIONS FOR FUTURE WORK foundation for exploitation of partitioning optimizations for computing persistent homology on large data sets with a bounded error on the induced change in shape of the data.

7.1 Suggestions for Future Work

This study has laid the ground work for an expanded study on the effect of data partitioning and upscaling for larger data sets. The library provides tools and modular functionality to replace components of the simplicial complex algorithm, partitioning algorithm, simplex storage data structure, and other techniques beyond those studied today. Study of the effects of variance, noise, data distribution, and performance need to be evaluated for each of these components to design an optimized system for larger data sets beyond the reach of libraries existing today.

Other complexes beside Vietoris–Rips should be implemented into LHF to understand the effects of partitioning of the topological space on the complex creation and storage structure. Specifically, the Witness and Skeleton Blocker complexes are of interest for evaluation of a partitioned persistent homology.

Partitioning algorithms outside of k-means need to be evaluated to determine what methods provide suitable space partitioning and preservation of the topological features in the point cloud. Partitioning and classification govern the overall identification of large topological features; a forest of different partitioning algorithms may provide further insight into how a point cloud reacts to approximation.

Further optimization of the algorithms used in LHF should be explored to optimize the non-upscaled runtime to be closer to the GUDHI and ripser algorithms examined in this study.

Distributed persistent homology with the LHF framework should be explored through a Map-Reduce framework when partitioning and iteratively upscaling the data. The first partitioning would be executed to gather independent features within the point cloud, then distribute identified features to distributed resources to reprocess after upscaling. This framework would provide a distributed upscaling model that could utilize a larger resource pool to execute on big data sets.

Streaming persistent homology with the LHF framework is also a topic of interest beyond this study.

By reducing data within a stream and recomputing the persistent homology only when a change in the topological features is identified the persistent homology could detect changes in topology of data. LHF has been utilized to perform initial studies into streaming persistent homology and will continue to be a useful

62 CHAPTER 7. DISCUSSION 7.1. SUGGESTIONS FOR FUTURE WORK tool in exploration [15].

Visualizing and evaluation of persistent homology outputs remains difficult for understanding the higher dimensional topological features in a point cloud. While barcodes identify the existence of a feature and connectedness of the feature’s boundary, no method exists for extracting the constituent boundaries and

2 3 visualizing the d-dimensional feature in some way. For R and R features the shape may be intuitive. Examining higher dimensional features from constituent boundary point extraction can identify additional multivariate relationships in a source point cloud. Understanding and evaluating these topological features is important in continued experimentation and conveyance of results to readers.

63 Bibliography

[1] G. Petri, M. Scolamiero, I. Donato, and F. Vaccarino, “Topological strata of weighted

complex networks,” PLOS ONE, vol. 8, no. 6, pp. 1–8, Jun. 2013. [Online]. Available:

https://doi.org/10.1371/journal.pone.0066506

[2] D. Horak, S. Maletic,´ and M. Rajkovic,´ “Persistent homology of complex networks,”

Journal of Statistical Mechanics: Theory and Experiment, Mar. 2009. [Online]. Available:

https://doi.org/10.1088%2F1742-5468%2F2009%2F03%2Fp03034

[3] M. Hajij, B. Wang, C. E. Scheidegger, and P. Rosen, “Visual detection of structural changes in time-

varying graphs using persistent homology,” IEEE Pacific Visualization Symp, pp. 125–134, 2018.

[4] T. Kaczynski, K. Mischaikow, and M. Mrozek, Computational Homology, ser. Applied Mathematical

Sciences. Springer-Verlag New York, 2004, vol. 157.

[5] H. Wagner, C. Chen, and E. Vuc¸ini, Efficient Computation of Persistent Homology for Cubical

Data. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 91–106. [Online]. Available:

https://doi.org/10.1007/978-3-642-23175-9 7

[6] P. Bendich, H. Edelsbrunner, and M. Kerber, “Computing robustness and persistence for images,” IEEE

Transactions on Visualization and Computer Graphics, vol. 16, no. 6, pp. 1251–1260, Nov. 2010.

[7] K. Mischaikow and V. Nanda, “Morse theory for filtrations and efficient computation of persistent

homology,” Discrete & Computational Geometry, vol. 50, no. 2, pp. 330–353, 2013.

64 BIBLIOGRAPHY BIBLIOGRAPHY

[8] V. Nanda, “Discrete morse theory for filtrations,” Ph.D. dissertation, Department of Mathematics,

Rutgers University, Oct. 2012. [Online]. Available: http://people.maths.ox.ac.uk/nanda/source/Thesis.

pdf

[9] G. Carlsson, T. Ishkhanov, V. de Silva, and A. Zomorodian, “On the local behavior of spaces of natural

images,” International Journal of Computer Vision, vol. 76, no. 1, pp. 1–12, Jan. 2008.

[10] Z. Cang, L. Mu, K. Wu, K. Opron, K. Xia, and G.-W. Wei, “A topological approach for protein

classification,” Molecular Based Mathematical Biology, vol. 3, no. 1, Nov. 2015.

[11] T. K. Dey and S. Mandal, “Protein classification with improved topological data analysis,” in 18th

International Workshop on Algorithms in Bioinformatics, ser. WABI 2018, Aug. 2019, pp. 6:1–6:13.

[12] K. Xia and G.-W. Wei, “Persistent homology analysis of protein structure, flexibility, and folding,” Int

Journal for Numerical Methods in Biomedical Engineering, vol. 30, no. 8, pp. 814–844, 2014.

[13] P. G. Camara, D. I. S. Rosenbloom, K. J. Emmett, A. J. Levine, and R. Rabadan, “Topological data

analysis generates high-resolution, genome-wide maps of human recombination,” Cell systems, vol. 3,

no. 1, pp. 83–94, 2016.

[14] J. M. Chan, G. Carlsson, and R. Rabadan, “Topology of viral evolution,” Proceedings of the

National Academy of Sciences, vol. 110, no. 46, pp. 18 566–18 571, 2013. [Online]. Available:

https://www.pnas.org/content/110/46/18566

[15] A. Moitra, N. O. Malott, and P. A. Wilsey, “Persistent homology on streaming data,” in The 8th Work-

shop on Data Mining in Biomedical Informatics and Healthcare, ser. DMBIH’20, Nov. 2020.

[16] F. Chazal, B. T. Fasy, F. Lecci, B. Michel, A. Rinaldo, and L. Wasserman, “Subsampling methods

for persistent homology,” in International Conference on Machine Learning, ser. ICML 2015, Lille,

France, Jul. 2015. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01073073

[17] H. Lee, H. Kang, M. K. Chung, B.-N. Kim, and D. S. Lee, “Persistent brain network homology from

the perspective of dendrogram,” IEEE Transactions on Medical Imaging, vol. 31, no. 12, pp. 2267–

2277, Dec. 2012.

65 BIBLIOGRAPHY BIBLIOGRAPHY

[18] P. Bendich, J. S. Marron, E. Miller, A. Pieloch, and S. Skwerer, “Persistent homology analysis of brain

artery trees,” The Annals of Applied Statistics, vol. 10, no. 1, pp. 198–218, Mar. 2016.

[19] L. Li, W.-Y. Cheng, B. S. Glicksberg, O. Gottesman, R. Tamler, R. Chen, E. P. Bottinger, and J. T.

Dudley, “Identification of type 2 diabetes subgroups through topological analysis of patient similarity,”

Science translational medicine, vol. 7, no. 311, Oct. 2015.

[20] M. Nicolau, A. J. Levine, and G. Carlsson, “Topology based data analysis identifies a subgroup of

breast cancers with a unique mutational profile and excellent survival,” Proceedings of the National

Academy of Sciences, vol. 08, no. 17, pp. 7265–7270, 2011.

[21] N. Otter, M. A. Porter, U. Tillmann, P. Grindrod, and H. A. Harrington, “A roadmap for the computa-

tion of persistent homology,” EPJ Data Science, vol. 6, no. 1, Aug. 2017.

[22] A. Moitra, N. Malott, and P. A. Wilsey, “Cluster-based data reduction for persistent homology,” in

2018 IEEE International Conference on Big Data, ser. Big Data 2018, Dec. 2018, pp. 327–334.

[23] D. R. Sheehy, “The persistent homology of distance functions under random projection,” in Proceed-

ings of the Thirtieth Annual Symposium on Computational Geometry, ser. SOCG’14. New York, NY,

USA: ACM, 2014, pp. 328–334.

[24] V. de Silva and G. Carlsson, “Topological estimation using witness complexes,” in Eurographics Sym-

posium on Point-Based Graphics, ser. SPBG ’04, M. Gross, H. Pfister, M. Alexa, and S. Rusinkiewicz,

Eds. The Eurographics Association, 2004.

[25] K. N. Ramamurthy, K. R. Varshney, and J. J. Thiagarajan, “Computing persistent homology under

random projection,” in IEEE Workshop on Statistical Signal Processing, Jun. 2014, pp. 105–108.

[26] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Han-

son, F. Motta, and L. Ziegelmeier, “Persistence images: A stable vector representation of persistent

homology,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 218–252, Jan. 2017.

66 BIBLIOGRAPHY BIBLIOGRAPHY

[27] N. O. Malott and P. A. Wilsey, “Fast computation of persistent homology with data reduction and data

partitioning,” in 2019 IEEE International Conference on Big Data, ser. Big Data 2019, Dec. 2019, pp.

880–889.

[28] N. O. Malott, R. R. Verma, and P. A. Wilsey, “Parallel computation of partitioned persistent homology,”

in IEEE International Parallel and Distributed Processing Symposium, 2020, (submitted).

[29] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,” in

Proceedings of the 41st Annual Symposium on Foundations of Computer Science, ser. FOCS ’00.

Washington, DC, USA: IEEE Computer Society, 2000.

[30] R. Ghrist and A. Muhammad, “Coverage and hole-detection in sensor networks via homology,” in

Fourth International Symposium on Information Processing in Sensor Networks, ser. IPSN 2005.

IEEE, Apr. 2005, pp. 254–260.

[31] V. D. Silva and R. Ghrist, “Homological sensor networks,” Notices of the American Math Soc, vol. 54,

no. 1, pp. 10–17, Jan. 2007.

[32] A. Adcock, D. Rubin, and G. Carlsson, “Classification of hepatic lesions using the matching metric,”

Computer vision and image understanding, vol. 121, pp. 36–42, Apr. 2014.

[33] G. Carlsson, “Topological pattern recognition for point cloud data,” Acta Numerica, vol. 23, pp. 289–

368, 2014.

[34] P. Frosini and C. Landi, “Persistent betti numbers for a noise tolerant shape-based approach to image

retrieval,” Pattern Recognition Letters, vol. 34, no. 8, pp. 863–872, Jun. 2013.

[35] D. R. Sheehy, “Linear-size approximations to the vietoris–rips filtration,” Discrete & Computational

Geometry, vol. 49, no. 4, pp. 778–796, Jun. 2013.

[36] G. Carlsson, “Topology and data,” Bulletin of the American Mathematical Society, vol. 46, no. 3, pp.

255–308, Apr. 2009.

[37] A. Zomorodian and G. Carlsson, “Computing persistent homology,” Discrete Comput Geom, vol. 33,

no. 2, pp. 249–274, Feb. 2005.

67 BIBLIOGRAPHY BIBLIOGRAPHY

[38] P. Bubenik, “Statistical topological data analysis using persistence landscapes,” The Journal of Ma-

chine Learning Research, vol. 16, no. 1, pp. 77–102, Jan. 2015.

[39] P. Donatini, P. Frosini, and A. Lovato, “Size functions for signature recognition,” in Vision Geometry

VII, vol. 3454. International Society for Optics and Photonics, 1998, pp. 178–183.

[40] L. Vietoris, “Uber¨ den hoheren¨ zusammenhang kompakter raume¨ und eine klasse von zusammen-

hangstreuen abbildungen,” Mathematische Annalen, vol. 97, no. 1, pp. 454–472, 1927.

[41] D. Attali, A. Lieutier, and D. Salinas, “Efficient data structure for representing and simplifying simpli-

cial complexes in high dimensions,” International Journal of Computational Geometry & Applications,

vol. 22, no. 04, pp. 279–303, 2012.

[42] J.-D. Boissonnat and C. Maria, “The simplex tree: An efficient data structure for general simplicial

complexes,” Algorithmica, vol. 70, no. 3, pp. 406–427, Nov. 2014.

[43] J. M. Phillips, B. Wang, and Y. Zheng, “Geometric inference on kernel density estimates,” arXiv

preprint arXiv:1307.7760, 2013.

[44] G. Carlsson, A. Zomorodian, A. Collins, and L. J. Guibas, “Persistence barcodes for shapes,” Interna-

tional Journal of Shape Modeling, vol. 11, no. 02, pp. 149–187, 2005.

[45] A. Zomorodian, “Fast construction of the vietoris–rips complex,” Computer and Graphics, pp. 263–

271, 2010.

[46] J.-D. Boissonnat, T. K. Dey, and C. Maria, “The compressed annotation matrix: an efficient

data structure for computing persistent cohomology,” CoRR, vol. abs/1304.6813, 2013. [Online].

Available: http://arxiv.org/abs/1304.6813

[47] C. Chen and M. Kerber, “Persistent homology computation with a twist,” in Proceedings 27th Euro-

pean Workshop on Computational Geometry (EuroCG’11), 2011, pp. 197–200.

[48] U. Bauer, M. Kerber, and J. Reininghaus, “Clear and compress: Computing persistent homology in

chunks,” in Topological Methods in Data Analysis and Visualization III, P. T. Bremer, I. Hotz, V. Pas-

cucci, and R. Peikert, Eds. Springer International Publishing, Mar. 2014, pp. 103–117.

68 BIBLIOGRAPHY BIBLIOGRAPHY

[49] M. Mrozek and B. Batko, “Coreduction homology algorithm,” Discrete & Computational Geometry,

vol. 41, no. 1, pp. 96–118, Jan. 2009.

[50] U. Bauer, “Ripser: efficient computation of vietoris-rips persistence barcodes,” 2019.

[51] T. K. Dey, D. Shi, and Y. Wang, “Simba: An efficient tool for approximating rips-filtration persistence

via simplicial batch-collapse,” 24th Annual European Symposium on Algorithms (ESA 2016), 2016.

[52] B. Brehm and H. Hardering, “Sparips,” 2018. [Online]. Available: https://arxiv.org/abs/1807.09982

[53] B.-H. Park and H. Kargupta, “Distributed data mining: Algorithms, systems, and applications,” in

Data Mining Handbook, N. Ye, Ed., 2002, pp. 341–358.

[54] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, “Tools for privacy preserving distributed

data mining,” ACM Sigkdd Explorations Newsletter, vol. 4, no. 2, pp. 28–34, 2002.

[55] U. Bauer, M. Kerber, and J. Reininghaus, Distributed Computation of Persistent Homology. SIAM,

2014, pp. 31–38.

[56] G. Forman and B. Zhang, “Distributed data clustering can be efficient and exact,” ACM SIGKDD

explorations newsletter, vol. 2, no. 2, pp. 34–38, 2000.

[57] R. Jin, A. Goswami, and G. Agrawal, “Fast and exact out-of-core and distributed k-means clustering,”

Knowledge and Information Systems, vol. 10, no. 1, pp. 17–40, 2006.

[58] K. Thangavel and N. K. Visalakshi, “Ensemble based distributed k-harmonic means clustering,” Inter-

national Journal of Recent Trends in Engineering, vol. 2, no. 1, 2009.

[59] R. D. Nowak, “Distributed em algorithms for density estimation and clustering in sensor networks,”

IEEE transactions on signal processing, vol. 51, no. 8, pp. 2245–2253, 2003.

[60] R. G. Gallager, P. A. Humblet, and P. M. Spira, “A distributed algorithm for minimum-weight spanning

trees,” ACM Transactions on Programming Languages and systems (TOPLAS), vol. 5, no. 1, pp. 66–77,

1983.

[61] C. Berge and A. Ghouila-Houri, “Programming, games and transportation networks,” 1965.

69 BIBLIOGRAPHY BIBLIOGRAPHY

[62] J. B. Kruskal, “On the shortest spanning subtree of a graph and the traveling salesman problem,”

Proceedings of the American Mathematical Society, vol. 7, no. 1, pp. 48–50, Feb. 1956.

[63] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears, “Mapreduce online.”

in Nsdi, vol. 10, no. 4, 2010, p. 20.

[64] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,”

Communications ofthe ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [Online]. Available:

http://doi.acm.org/10.1145/1327452.1327492

[65] ——, “Mapreduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp.

72–77, 2010.

[66] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density–based algorithm for discovering clusters

in large spatial databases with noise,” in Proceedings of the Second International Conference on

Knowledge Discovery and Data Mining, ser. KDD’96. AAAI Press, Aug. 1996, pp. 226–231.

[Online]. Available: http://dl.acm.org/citation.cfm?id=3001460.3001507

[67] N. O. Malott, A. Sens, and P. A. Wilsey, “Topology preserving data reduction for computing persistent

homology,” in International Workshop on Big Data Reduction, 2020, (submitted).

[68] A. Sens, “Topology preserving data reductions for computing persistent homology,” Master’s thesis,

Dept of EECS, University of Cincinnati, 2021, (expected).

[69] C. Maria, J.-D. Boissonnat, M. Glisse, and M. Yvinec, “The gudhi library: Simplicial

complexes and persistent homology,” INRA, Tech. Rep. RR-8548, 2014. [Online]. Available:

https://hal.inria.fr/hal-01005601v2

[70] G. Henselman. (2019) Eirene: julia library for homological persistence. [Online]. Available:

https://github.com/Eetion/Eirene.jl

70