Partitioned Persistent Homology
Total Page:16
File Type:pdf, Size:1020Kb
Partitioned Persistent Homology A thesis submitted to the Division of Research and Advanced Studies of the University of Cincinnati in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in the Department of Electrical Engineering & Computer Science of the College of Engineering and Applied Sciences November 9, 2020 by Nicholas O. Malott BSCompE, University of Cincinnati, 2016 Thesis Advisor and Committee Chair: Dr. Philip A. Wilsey Abstract In an era of increasing data size and complexity novel algorithms to mine embedded relationships within big data are being studied throughout high performance computing. One such approach, Topological Data Analysis (TDA), studies homologies in a point cloud. Identified homologies represent topological features — loops, holes, and voids — independent of continuous deformation and resilient to noise. These topolog- ical features identify a generalized structure of input data. TDA has shown promise in several applications including studying various networks [1–3], digital images and movies [4–9], protein analysis and classifica- tion [10–12], and genomic sequences [1,13–15]. TDA provides an alternative approach to classification for general data mining applications. The principle tool of TDA that examines the topological structures in a point cloud is Persistent Homol- ogy. Persistent homology identifies the underlying homologies of the point cloud as the connectedness of the point cloud increases, or persists. The output of persistent homology can be used to classify embedded structures in high dimensional spaces [16, 17], enable automated classification from connected component analysis [18], identify abnormalities in data streams [15], classify structure of protein chains [14,15,19,20], along with many other science and biological applications. Although the uses are vast, persistent homol- ogy suffers from exponential growth in both run-time and memory complexity leading to limitations for evaluating large datasets [21]. Persistent homology cannot be directly applied to large point clouds; modern tools are limited to the 3 analysis of a few thousand points in R . Optimization of data structures and enhancements to the algorithm have led to significant performance improvements, but the approaches still suffer in relation to the size of the input point cloud. One approach studied over recent years is to reduce the input point cloud, either in size or dimensions [16, 22–25]. These reductions gain obvious improvement, enabling computation on data i sets 3 − 4 magnitudes larger than the exact approach. Salient topological features have been shown to be preserved with the reduced point cloud and can reasonably approximate the large topological features of the original point cloud. However, the loss of smaller features in approximate approaches are still a significant problem for some application areas [18, 26]. A partitioned technique to compute persistent homology that pairs the approximation of the salient topological features with the reconstruction of the smaller features embedded into the partitions is a natural extension to reduction of the point cloud. Several other studies [27, 28] implement this approach by using partitioning algorithms to approximate the large topological features while simultaneously reconstructing the smaller topological features from the individual partitions. This technique, developed and formalized in this thesis, is called Partitioned Persistent Homology (PPH)1. This thesis outlines the theory and approach of Partitioned Persistent Homology coupled with limitations of the reconstructed results. The technique is detailed and integrated into the Lightweight Homology Frame- work (LHF) for demonstration. Current optimization algorithms are implemented into the framework and detailed to provide a current state of tools for persistent homology and usage in the LHF application. Anal- ysis of the approach against theory, experimental results, and proofs provides an encompassing overview of Partitioned Persistent Homology. Detail of the considerations for parameter selection frame the use of Partitioned Persistent Homology for big data research. 1Support for this work was provided in part by the National Science Foundation under grants ACI–1440420 and IIS–1909096. ii Acknowledgments I would like to thank my advisor, Philip A. Wilsey, for support and guidance during the arduous process of exploring this work. The opportunities and challenges set forth by the entire research team I have worked with over the past few years have reshaped my approach to engineering, problem solving, and analysis of processes beyond the scope of this paper. To the many individuals that have participated in the high per- formance algorithms group I express my gratitude for the conversations and adversities we have overcome through our joint studies. I would also like to thank my colleagues from Interstates. My experience in the industry has continued to drive my curiosity of the links between academic research and the advancement of technology as it evolves into business. I am humbled by the technologies that I have not grown to understand, and without the support of my colleagues my drive to contribute to innovative solutions would prove more exhausting. Finally, I would like to express gratitude to my family, friends, and acquaintances who have uncondi- tionally supported my endeavors and explorations over the years. The countless conversations, explanations, and questions have encouraged me to seek deeper understanding of everything I explore. iii Contents 1 Introduction 1 1.1 Motivation and Principle Hypothesis . 3 1.2 Thesis Overview . 5 2 Background 6 2.1 Persistent Homology . 6 2.1.1 Simplicial Complex: Simplices . 8 2.1.2 Simplicial Complex: Filtration . 10 2.1.3 Analysis and Visualization of Results . 12 2.2 Space Partitioning . 14 3 Related Work 16 3.1 Related Work in Distributed Data Mining . 18 4 Overview of the Approach 20 4.1 Partitioning of Point Cloud . 21 4.1.1 Partitioning bounds and error . 23 4.1.2 Partitioning Comparison . 25 4.1.3 Partitioning with k-means++ . 27 4.2 Upscaling . 28 4.2.1 Limitations of Upscaling . 28 4.2.2 Upscaling of Large Features . 30 iv CONTENTS 4.3 Regional Persistent Homology . 31 4.3.1 Regional Reconstruction Analysis . 32 4.3.2 Merging Persistence Intervals . 34 5 Implementation Details 36 5.1 LHF Architecture . 37 5.2 LHF Data Structures . 41 5.3 Limitations . 42 6 Experimental Analysis 44 6.1 Centroid-Approximated PH . 45 6.2 Centroid-Approximated Upscaling . 48 6.3 Small Feature Reconstruction . 51 6.4 Partitioned Persistent Homology . 54 6.5 Performance: Parallel PPH . 57 7 Discussion 60 7.1 Suggestions for Future Work . 62 v List of Figures 2.1 Example of filtration of a point cloud with Vietoris–Rips complex and resulting persistence intervals. 8 2.2 Example complex representing 1-simplices . 9 2.3 Example complex representing 1-simplices and 2-simplices . 9 2.4 Data flow diagram of computation of persistent homology on a point cloud . 11 2.5 Flamingo triangulated mesh point cloud and resultant persistence intervals displayed as a barcode diagram (center) and a persistence diagram (right). 13 4.1 Persistence interval accuracy comparison for various partitioning algorithms at set reduction levels. 27 4.2 Reduction of the flamingo triangulated mesh model with k-means++ . 27 4.3 Example of created feature due to partitioning of the point cloud. 29 4.4 Example of lost feature due to partitioning of the point cloud. 30 4.5 Representation of point cloud connections formed by the MST when max ≥ d(x1; x2) ≥ 2rmax. Red lines indicate duplicates identified in two partitions. The blue line represents 0 the approximated distance in P , and the green line represents the true distance for the H0 homology. 33 5.1 Limitations of Ripser for computing persistent homology on 128GB RAM. Values on the x-axis did not finish. 37 5.2 FastPersistence pipeline and reduced mode with a preprocessor. 39 5.3 Upscaled persistent homology mode and the reduced persistent homology mode pipelines. 39 vi LIST OF FIGURES 5.4 Partitioned persistent homology pipeline and the iterative mode for recursive partitioning. 40 6.1 Change in heat kernel distance at various k-means++ reduction levels for all experimental datasets. 47 6.2 Reduction effect on H0 features. 47 6.3 Reduction effect on Hd; d > 0 features . 47 6.4 Averaqe improvement (%) over all data sets with upscaling persistent homology for Heat Kernel (HK) and filtered Heat Kernel (fHK) distances. 50 6.5 Averaqe improvement (%) in heat kernel distance over all data sets with regional persistent homology at various levels of scalars (s∗rmax). Results compared to H0 feature heat kernel distance (Table 6.5) . 54 6.6 Heat Kernel Distance improvement on selected datasets alongside average. Results com- pared to centroid approximated heat kernel distance (Table 6.1). 56 6.7 PPH Speedup for selected datasets alongside average. Results compared to standard fast- Persistence performance. 56 6.8 Performance results based on varying input parameters of the PPH approach. 58 6 6.9 PPH parallel speedup H1 features of synthetic dSphere; 10k points embedded in R . 59 6 6.10 PPH parallel speedup H2 features of synthetic dSphere; 10k points embedded in R . 59 vii List of Tables 6.1 Effect of k-means++ reduction on the heat kernel distance for the experimental datasets. 46 6.2 Effect of k-means++ reduction on the heat kernel distance on Hd > 0 for the experimental datasets. 48 6.3 Heat kernel distance and improvement (%) of upscaling over centroid approximated persis- tent homology for Hd > 0 features. Results compared to Hd; d > 0 feature heat kernel distance (Figure 6.3). Dashed lines indicate when upscaling failed due to physical memory limitations. 49 6.4 Heat kernel distance and improvement (%) of upscaling over filtered centroid approximated persistent homology. 50 6.5 Effect of k-means++ reduction on the heat kernel distance on Hd = 0 for the experimental datasets. 51 6.6 Heat kernel distance and improvement (%) of regional persistent homology over centroid approximated persistent homology of H0 features.