<<

Applications of Persistent and Cycles

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Sayan Mandal, B.Tech., M.E.

Graduate Program in Department of Computer Science and Engineering

The Ohio State University

2020

Dissertation Committee:

Dr. Tamal Dey, Advisor Dr. Yusu Wang Dr. Raphael Wenger c Copyright by

Sayan Mandal

2020 Abstract

The growing need to understand and process data has driven innovation in many disparate

areas of data science. The computational biology, graphics, and machine learning communities,

among others, are striving to develop robust and efficient methods for such analysis. In this work, we demonstrate the utility of topological data analysis (TDA), a new and powerful

tool to understand the shape and structure of data, to these diverse areas.

First, we develop a new way to use , a core tool in topological data

analysis, to extract machine learning features for classification. Our work focuses on

improving modern image classification techniques by considering topological features. We

show that incorporating this information to supervised learning models allows our models to

improve classification, thus providing evidence that topological signatures can be leveraged

for enhancing some of the pioneering applications in computer vision.

Next, we propose a based, fast, scalable, and parameter-free technique to explore

a related problem in protein analysis and classification. On an initial

built using constituent protein atoms and bonds, simplicial collapse is used to construct a

filtration which we use to compute persistent homology. This is ultimately our signature for

the protein-molecules. Our method, besides being scalable, shows sizable time and memory

improvements compared to similar topology-based approaches. We use the signature to train

a protein domain classifier and compare state-of-the-art structure-based protein signatures to

achieve a substantial improvement in accuracy.

ii Besides considering the intervals of persistent homology like our first two applications,

some applications need to find representative cycles for them. These cycles, especially the

minimal ones, are useful geometric features functioning as augmentations for the intervals

in a purely topological barcode. We address the problem of computing these representative

cycles, termed as persistent d-cycles. Since generating optimal persistent 1-cycle is NP-hard, we propose an alternative set of meaningful persistent 1-cycles that is computable using

an efficient polynomial time algorithm. Next, we address the same problem for general

dimensions. We illustrate the use of an algorithm to spawn d-cycles for finite intervals on

a weak (d + 1)-pseudomanifold. We design two specialised softwares to compute persistent

1-cycles and d-cycles respectively. Experiments on 3D point clouds, mineral structures,

images, and medical data show the effectiveness of our algorithms in practice.

We further investigate into the use of these representative persistent cycles in the field of

bio-science and technology. Our concluding work tries to understand gene-expression levels

for various organisms who are either infected or under the effect of antigens. We use persistent

cycles to curate both the cohort list and gene expressions levels so as to obtain a “crux” of

better representatives. This in turn, provides improvement in both deep and shallow learning

classifications. We further show that the n-cycles has an unsupervised inclination towards

phenotype labels. The penultimate chapter of this thesis provides evidence that topological

signatures are able to comprehend gene expression levels and classify cohorts on its basis.

iii To Mammam and Babai

iv Acknowledgments

If I have seen further, it is by standing on the shoulders of giants.

Sir Isaac Newton

I would like to thank my Ph.D. supervisor Dr. Tamal K. Dey for his guidance and

support over the past few years. His wisdom and mentoring has been a continuous source of

encouragement to me. He has been source of inspiration not just in my domain of research

but rather in analytical and independent thinking, resource management, and many other

quintessential areas of being a productive individual. Above all, he has always prioritised

taking care of self enrichment over academic progress. I honestly thank him for being a true

mentor and steadfast supporter.

This thesis was enriched significantly through helpful discussions with my predecessors

Dr. Dayu Shi, Dr. Alfred Rossi, and Dr. Mickael Buchet. They had significant impact in my

research and the understanding of topological data analysis in general by answering all my

queries, no matter how mundane. Dayu helped me comprehending and maintaining the open

source code repositories of our team including Simpers, SimBa, and ShortLoop. His input was in part extended beyond these works and helped me a lot in building later softwares:

Persloop and Pers2cyc-fin.

The members of the TGDA have contributed immensely to my personal and

professional time at Ohio State. The group has been a source of friendships as well as good

advice and collaboration. Among them, Tianqi Li and Ryan Slechta need special mention.

v They have been with me through the tough times in grad school specially when research was at a stalemate. Ryan has especially been instrumental in refining my research ideas and

reports.

The enjoyment of learning increases when we share our thoughts and work

together in a constructive way. Any productive work including research is a collaborative

effort and I have been lucky to have worked with William Varcho, Tao Hao, and Soham

Mukherjee as my co-authors. I have learned much from them and gained valuable insight

from both.

I had the opportunity to take several coursework during my time as a graduate student

in the Ohio State University. I would like to thank all faculty members who have helped

me augment my knowledge base. Special mention to Dr. Yusu Wang, Dr. Ten-Hwang Lai,

Dr. Tamal Dey, Dr. Hanwei Shen, and Dr. Jim Davis whose lectures have really been truly

enjoyable and inspired me to improve as a faculty as well. The course materials they covered

included all the state-of-the-art topics and have been directly influential in my research as well.

Research meets have been an ancient medium to exchange ideas and insights. In fact,

2500 years ago the Gymnasiums in Athens had been a hotbed for discussions in mathematics,

literature, and philosophy frequent by Plato, Socrates, Alcibiades etc. In this era of digital

connectivity, we have serious discussions as to whether traditional classrooms or meet-and-

greets are still valid in academia. I have always been a proponent of these traditional meet

ups and strongly believe real time interaction with scientists helps spurring a plethora of

research insights and ideas. I would therefore take this opportunity to thank the organisers

and committee members of VMV 2017, WABI 2018, and CTIC 2019 where I have met many

stalwarts in our field or research and learned a lot. Helpful discussions with faculties, and

vi peers have helped me gained a lot of insight and inspiration for future research methodologies

and ideas.

I would also take this opportunity to thank my committee members. Dr. Yusu Wang and

Dr. Raphael Wenger for their feedback throughout my graduate carrier. From the beginning

of candidacy exam to thesis committee meeting, their constructive feedback has always been

helpful.

Finally, I would like to thank the National Science Foundation for supporting the research work presented here.

vii Vita

June, 1990 Born - Kolkata, India School - St. Stephens’ School 1993 Kolkata, India B.Tech - Computer Sc. and Technology, 2008 WBUT, India M.E. - Computer Sc. and Engineeering, 2012 IIEST Shibpur, India Senior Research Fellow, Computer Science and Engineering, 2014 IIT Kharagpur, India University Fellow, Computer Sc. and Engineering, 2015 The Ohio State University, USA Graduate Teaching/Research Assistant, Computer Sc. and Engg., 2016 The Ohio State University, USA Graduate Research Intern, Health Care and Life Science 2019 TJ Watson Labs IBM, Yorktown Heights, USA Graduate Research Assistant, Computer Sc. and Engg., 2019-present The Ohio State University, USA

Publications

Research Publications

T. Dey, S. Mandal, S. Mukherjee, “Gene expression data classification using topology and machine learning models”. Arxiv May. 2020.

S. Mandal, A. Guzman-Saenz, N. Haiminen, L. Parida,S. Basu, “A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data”. AICoB 2020: International Conference on Algorithms for Computational Biology LNCS/LNBI Springer, April. 2020.

viii T. Dey, T. Hao, S. Mandal, “Computing Minimal Persistent Cycles: Polynomial and Hard Cases”. Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms 10.5555/3381089.3381247, 2587–2606, Jan 2020.

T. Dey, T. Hao, S. Mandal, “Persistent 1-Cycles: Definition, Computation, and Its Applica- tion”. in Image Context. CTIC 2019. Lecture Notes in Computer Science 10.1007/978-3-030-10828-1 10, 123–136, Dec 2018.

T. Dey, S. Mandal, “Protein Classification with Improved Topological Data Analysis”. 18th International Workshop on Algorithms in Bioinformatics 10.4230/LIPIcs.WABI.2018.6, 1–13 Aug 2018.

T. Dey, S. Mandal, W. Varcho, “Improved Image Classification using Topological Persistence ”. Vision, Modeling and Visualization: The Eurographics Association 10.2312/vmv.20171272, Sep 2017.

Fields of Study

Major Field: Department of Computer Science and Engineering

ix Table of Contents

Page

Abstract...... ii

Dedication...... iv

Acknowledgments...... v

Vita...... viii

List of Tables...... xii

List of Figures...... xiii

1. Introduction...... 1

1.1 Topological persistence of point cloud data...... 3 1.2 Topological persistence via Successive Collapses...... 5 1.2.1 Subsampling...... 7 1.3 Representative cycles for Homology classes...... 10 1.4 Topologically Relevant Gene Expression and Cohort Analysis...... 12 1.4.1 Contour of the thesis...... 14

2. Some Applications of Persistent Homology...... 15

2.1 General applications of Persistent Homology...... 15 2.2 Work on Computer Vision & Graphics using Persistent Homology..... 16 2.3 Work on Bio Science using Persistent Homology...... 17 2.4 Work on Machine Learning and Persistent Homology...... 19 2.5 Work on representative cycles for homology group...... 20

3. Image Classification...... 21

3.1 Feature Vector Generation...... 24 3.2 Algorithm for fast computation of topological signature...... 25

x 3.2.1 Choosing Parameters...... 26 3.3 Results...... 27 3.3.1 Feature Vector based Supervised Learning...... 28 3.3.2 CNN and Deep Learning based Training...... 31

4. Protein Classification...... 33

4.1 Topological persistence...... 34 4.2 Collapse-induced persistent homology from point clouds...... 36 4.2.1 Feature vector generation...... 38 4.3 Experiments and results...... 39 4.3.1 Topological description of alpha helix and beta sheet...... 40 4.3.2 Topological Description of macromolecular structures...... 43 4.3.3 Supervised learning based classification models...... 45

5. Representative Persistent cycles...... 51

5.1 Persistent 1-cycles...... 51 5.2 Definitions: Persistent Basis and Cycles...... 52 5.3 Minimal Persistent q-Basis and Their Limitations...... 53 5.4 Computing Meaningful Persistent 1-Cycles in Polynomial Time...... 54 5.5 Results and Experiments...... 57 5.5.1 Persistent 1-Cycles for 3D Point Clouds...... 58 5.5.2 Image Segmentation and Characterization Using 58 5.5.3 Hexagonal Structure of Crystalline Solids...... 60 5.6 Finite Persistent n-cycle...... 62 5.7 Minimal persistent d-cycles of finite intervals for weak (d + 1)-pseudomanifolds 63 5.8 Experimental results...... 65

6. Topologically Relevant Cohort and Gene Expression...... 70

6.1 Introduction...... 70 6.2 Idea...... 74 6.3 Methods...... 75 6.3.1 Data...... 76 6.3.2 Topo-Curated Cohort...... 80 6.3.3 Topo-Relevant Gene Expression...... 84 6.3.4 Neural Network Architecture...... 86

7. Contributions and Future Work...... 90

Bibliography...... 93

xi List of Tables

Table Page

1.1 Compute 1-dimensional homology: Time comparison with SimBa...... 9

3.1 Precision(P), Recall() for different Methods with and without using Topological features. (P-T and R-T) indicates Precision and Recall Without Topology whereas (P+T and R+T) indicates Precision and Recall With Topology. Note that except for Cal-256 which has 20 classes, all the other datasets have 10 classes...... 28

3.2 Qualitative comparison of our algorithm with SimBa with and without 2- dimensional homological features. Acc: Accuracy, Pr: Precision, Re: Recall. I- 5 classes of CalTech256, II - CIFAR-10...... 30

4.1 Time comparison of against SimBa and VR...... 44

4.2 Accuracy comparison with FragBag and Cang...... 44

4.3 Classification accuracy for different techniques on Protein dataset. SC: SCOP95, CA: CATH95, Sf: Superfamily, Fm: Family, F: Filtered, T: Topology, H: Ho- mology, C: Class, 5f: 5fold, A: Architecture, Si: Similarity...... 48

6.1 Classification using topo-relevant cohort. Each of the data are explained in section 6.3.1. The # symbol indicates the size of each dataset. ‘-’ in the table means the stats were too low: the relevant classifier was unable to classify the given data...... 78

6.2 Neural network result. The column TP (Z) indicates the results on reduced gene set using topology. Full indicates results on the full gene set. Tr-Loss, Tr-Acc, Tr-F1, Tr-Prec,Tr-Rec is loss, accuracy, F1-score, precision, and recall on the training data. Whereas the prefix Ts- indicate the same on the test set. 88

xii List of Figures

Figure Page

1.1 Persistence of a point cloud in R2 and its corresponding barcode...... 3

1.2 A visual example of the sequence generated by subsampling a point set via the Morton Ordering...... 5

1.3 Visualization of a Morton Ordering of point cloud data. Points with similar hue are close in the total ordering...... 10

3.1 Top: The topological features act as inputs to the fully connected part of our modified convolutional neural network. Bottom: Using modified Fisher Vector on SIFT along with topological features for training...... 22

3.2 One dimensional homology for an image in Caltech-256. The bars in blue are above the widest gap and chosen as feature vector...... 25

3.3 T-SNE was used as an aide in picking parameters for the computation of barcodes.(a)Good clustering on MNIST. (b) Bad clustering on CIFAR-10. (c) Better clustering on CIFAR-10...... 27

3.4 Comparison of accuracy and precision with and without topological features. (a)Performance versus training size (b)Fluctuations in precision across different class...... 30

4.1 Workflow...... 35

4.2 Weighted Alpha complex for protein structure...... 37

4.3 Top: Collapse of weighted alpha complex generated from protein structure via simplicial map. Bottom: Same algorithm applied to a kitten model in R3 .. 38

xiii 4.4 (a) Top: Alpha helix from PCB 1C26 , Middle: Barcode of [114], Right: Our Barcode, (b) Left: Beta sheet from PCB 2JOX, Middle: Barcode of [114], Bottom: Our Barcode. Each segment of the barcodes shows β0(top) and β1(bottom)...... 40

4.5 Barcode and Ribbon diagram of (Left): PDB: 1c26. (Right): PDB: 1o9a. Diagram courtesy NCBI [82]...... 42

4.6 Heatmap correlating secondary structure against our feature vector. Each column in the heatmap is the feature vector...... 44

4.7 Plot showing accuracy against varying training data size. 100(%) indicates the entire training and test data...... 44

4.8 Left:a) Difference in precision and recall from FragBag. Middle: b) Difference in precision and recall from [16]. Right: c) ROC curve for SVM classification of our algorithm...... 47

5.1 (a) Point cloud of Botijo model. (b,c) Barcode and persistent 1-cycles for Botijo, where the 3 longest bars (dark blue, light blue, and green) have their corresponding persistent 1-cycles drawn with the same colors. (d,e) Barcode and persistent 1-cycles for the retinal image, with each green cycle corresponding to a red bar...... 52

5.2 PersLoop user interface demonstrating persistent 1-cycles computed for a 3D point cloud (a) and a 2D image (b), where green cycles correspond to the chosen bars...... 57

5.3 Persistent 1-cycles (green) corresponding to long intervals computed for three different point clouds...... 58

5.4 Persistent 1-cycles computed for image segmentation. Green cycles indicate persistent 1-cycles consisting of only one component (|G| = 1) and red cycles indicate those consisting of multiple components (|G| > 1). (a,b) Persistent 1-cycles for the top 20 and 350 longest intervals on the nerve dataset. (c) Persistent 1-cycles for the top 200 longest intervals on the Drosophila larvae dataset...... 59

5.5 (a) Hexagonal cyclic structure of silicate glass. (b) Persistent 1-cycles computed for the green bars with red points denoting silicate atoms and grey points denoting oxygen atoms. (c) Persistent 1-cycles computed for the red bars. (d) Barcode for the filtration on silicate glass...... 60

xiv 5.6 A weak 2-pseudomanifold Ke embedded in R2 with three voids. Its dual graph is drawn in blue. The complex has one 1-connected component and four 2-connected components with the 2-simplices in different 2-connected components colored differently...... 63

5.7 An example of the constructions in our algorithm showing the between persistent cycles and cuts having finite capacity for d = 1. (a) The input weak 2-pseudomanifold K with its dual flow network drawn in blue, where the central hollow vertex denotes the dummy vertex, the red vertex denotes the source, and all the orange vertices (including the dummy one) denote the sinks. All “dangled” graph edges dual to the outer boundary 1-simplices actually connect to the dummy

vertex and these connections are not drawn. (b) The partial complex Kβ in the F input filtration F, where the bold green 1- denotes σβ which creates the F green 1-cycle. (c) The partial complex Kδ in F, where the 2-simplex σδ creates the pink 2-chain killing the green 1-cycle. (d) The green persistent 1-cycle of the interval [β, δ) is dual to a cut (S, T ) having finite capacity, where S contains all the vertices inside the pink 2-chain and T contains all the other vertices. The red graph edges denote those edges across (S, T ) and their dual 1-chain is the green persistent 1-cycle...... 64

5.8 (a,b) Cosmology dataset and the minimal persistent 2-cycles of the top five longest intervals. (c,d) Turbulent combustion dataset and its corresponding minimal persistent 2-cycles...... 66

5.9 (a,b) Minimal persistent 2-cycles for the hurricane model. (c) Minimal per- sistent 2-cycles of the larger intervals for the human skull. i: Right and left cheek muscles with the right one rotated for better visibility. ii: Right and left eyes. iii: Jawbone. iv: Nose cartilage. v: Nerves in the parietal lobe..... 67

5.10 (a) Cubic lattice structure of BaT iO3 (courtesy Springer Materials [91]) with diffused structure in backdrop. (b) Minimal persistent 2-cycles computed on the original function. (c) Minimal persistent 2-cycles computed on the negated function. (d) Minimal persistent 2-cycles computed on the negated function of

a tetragonal lattice structure of BaT iO3. The inlaid picture [91] illustrates the bonds of the structure...... 68

6.1 (a): Flowchart for topo-relevant gene expression extraction. Refer to Section 6.3.3 for details. (b): Flowchart for topo-curated cohort extraction. Refer to Section 6.3.2 for details. In both, bold lines show the path to take for training or testing large data. Dotted lines used in Figure 6.2...... 72

xv 6.2 Flowchart of proposed pipeline. For units Topo-relevant gene pipeline and Topo-curated cohort pipeline we follow the dotted lines in Fig. 6.1(a) and (b) respectively...... 74

6.3 (a) Filtration F = K0, .., K9 explaining persistence pairing.(b) Different H1 cycles for same homology class...... 74

6.4 t-SNE on entire cohort point cloud(D0). Red vertices indicate cohorts included in top 100 H2 cycles whereas blue indicate otherwise...... 77

6.5 Plot of geodesic centers for dominating cycles using T-sNE. Red vert: non dominating cycles. Graded green points: dominating cycles. Alpha values indicate ratio of dominating phenotype in each cycle versus the other labels. 81

6.6 Three figures plotting individual dominating cycles for gene dataset D0. These cycles actually reside in m- and are projected down to 3D using Principal component analysis. The colors indicate cohort phenotype labels X ∈ 0, 1, 2...... 82

6.7 (a): Count of vertex labels in individual H2-cycles for D0. The red points indicate cycles having phenotype labels 0 and 1, blue indicates cycles with

labels 1 and 2 whereas green(very few in the top 500 H2 cycles) indicates labels 0 and 2. (b): Count of vertex labels in individual H2-cycles for D1. Red indicate cycles having equal phenotype labelled vertices. Blue and cyan indicate prevalence of label 0 and 1 respectively. In both the diagrams, black points indicate cycles having a single phenotype label...... 84

6.8 Three 2-cycles for gene dataset D6 with colors indicating gene function labels. Total number of colors indicate the κ value of the cycle. Note that a gene can be responsible for several functionalities. The legend in this plot takes into account only a single functionality which contributes towards the maximal cover of the representative cycle.(a,b) Low κ:3 (b) High κ:6...... 86

6.9 Comparison of (a)Accuracy (b) F1-score (c)Loss function for 50 epochs. For the TDA curated data, red and yellow lines represent train and test scores respectively. For the full data, they are represented by the green and blue lines. 89

xvi Chapter 1: Introduction

We are living in an era where a plethora of data is existent that obligates analysis using powerful tools in statistics and learning to solve real world complex problems. These problems suffer from being extremely stochastic in nature and formulation of mathematical models to describe them is a predicament. Among the contrivances harnessed to explain the structure of these data, topological data analysis has gained significant cognizance in the past decade. The idea behind topological data analysis(TDA for short) finds its roots in contrived by French mathematician Henri Poincar´ein his work Analysis Situs. With the advent of data big data, TDA gained traction leading to persistent homology [45], where the associated persistence diagrams serve as stable signatures to describe various domains in data science(Ref. Ch.2). The ability of topological approaches to define structures in high dimensional data, coupled with mathematical guarantees towards robustness, isometric and scale invariance, is what makes it a propitious tool in contemporary data analysis.

In this work, we investigate into the creation of potent methods and tools based on persistent homology to solve problems in engineering and science namely computer vi- sion(Ref. Ch.3), computer graphics(Ref. Sec. 5.5.1), protein structure(Ref. Ch.4), medical images and volumetric scans(Ref. Sec. 5.5.2, Sec. 5.8), gene expression levels(Ref. Ch.6), and material science(Ref. Sec. 5.5.3, Sec. 5.8). In fact, we manifest direct applications of such signatures in image segmentation and characterisation(Ref. Sec. 5.5.2). In addition, they are

1 harnessed to render supplementary signatures on top of the shallow and deep learning based machine architectures (Ref. Sec. 4.2.1, 3.1, 6.3.1).

Besides considering the multiset of intervals included in a persistence diagram, some applications need to find representative cycles for the intervals. In this work, we also address the problem of computing these representative cycles, termed as persistent d-cycles. Since it has been showed that the computation of the optimal persistent d-cycles is NP-hard, both for the 1-dimensional case and higher [32,36], we propose an alternative set of meaningful persistent 1-cycles that can be computed with an efficient polynomial time algorithm. For d ≥ 2, we show the applications of another polynomial time algorithm on finite intervals. We design a software which applies our algorithm to various datasets. Experiments on 3D point clouds, mineral structures, and images show the effectiveness of our algorithm in practice.

Chapter5 deals with these d-dimensional homology cycles.

We do further investigation into the use of these representative cycles in bio-science and technology. Chapter6 deals with how gene expressions and cohorts embedded on these representative cycles in high dimensions provide relevant information towards their phenotype.

A main difficulty in using topological signatures for learning and classification tasks is that the state-of-the-art techniques for computing such signatures for a large sets of data do not scale up appropriately. Attempting to compute the persistent homology of points in high-dimensional spaces, is a road block in applying the topological data analysis techniques to classification tasks. To work around this bottleneck, we take a different approach for computing a topological signature. In the next section we discuss the traditional method to build topological signatures as well as propose a lighter and faster technique to do the same.

2 (a) (b) (c) (d) (e)

Figure 1.1: Persistence of a point cloud in R2 and its corresponding barcode

1.1 Topological persistence of point cloud data

We can start with a point cloud data in any n-dimensional . However,

to illustrate the of persistent homology, we consider a toy example of taking a set

of points in two dimensions sampled uniformly from a two- structure (Fig. 1.1a). We

start growing balls around each point, increasing their radius r continually and tracking the

behavior of the union of these growing balls. If we start increasing r from zero, we notice that

at some r = r1 (Fig. 1.1b) both holes are prominent in the union of structure. Further

increasing r to r2, leads to filling of the smaller hole (Fig. 1.1d). This continues till the value

of r is large enough for the union of balls to fill the entire structure. During the change in the

structure of the union of balls due to increase in radius, the larger of the two holes ‘persists’

for a larger range of r compared to the smaller one. Hence features that are more prominent

are expected to persist for longer periods of increasing r. This is the basic intuition for

topological persistence. The connected components and holes in this example are captured by

calculating a set of birth-death pairs of homology cycle classes that indicate at which value

of r the class is born and where it dies. The persistence is visualized in R2 using horizontal line segments that connect two points whose x-coordinates coincide with the birth and death values of the homology classes. These collection of line segments, as shown in Figure 1.1e,

3 are called barcodes [20]. The length of each line segment corresponds to the persistence of

either a connected component(H0) or cycle(H1) in the structure. Hence, the short blue line

segments in H0 correspond to the short lived component that are formed intermittently as

the radius increases. The two long blue line segments in H1 correspond to the two holes in

the structure, the largest being the bigger hole. For computational purposes, the growing

sequence of the union of balls is converted to a growing sequence of triangulations; simplicial

complexes in general, called a filtration. In some cases, some cycles called the ‘essential cycles’

persists till the end of the filtration. In our example, the red line in H0 is one such instance

of an essential cycle as the final connected component persists till the end.

The rank of the persistent homology group called the persistent Betti numbers capture

the number of persistent features. For d-dimensional homology group, we denote this number

as βd. This means β0 counts the number of connected components that arise in the filtration.

Similarly, β1 counts the number of circular holes being born as we proceed through the

filtration. With the above technique, difficulties are faced as the radius r increases as it

leads to steep increase in time complexity and a combinatorial explosion. In proteins, for

example, an average protein in a database such as CATH [103] has 20,0000∼30,000 atoms,

thus creating a point cloud of the same size in R3. Furthermore, the initial complex including 3-simplices (or tetrahedra) becomes collosal. On an average, this complex size grows to

(50 ∼ 100)x simplices of dimension upto 4 and becomes exorbitant and quite arduous to

process. For an image with dimension 200x200 pixels, persistence computation takes about

9.6s per image. Now imagine having to do this for several thousands of images which is the

typical size of a database. Building a filtration using this growing sequence of balls is thus

not scalable.

Traditionally, topological persistence for a point cloud is computed on the Vietoris-Rips

(VR) complex where: a simplex σ ∈ VRr(P ) iff d(p, q) ≤ α for every pair of vertices of σ.

4 As the value of α increases, the filtration follows a sequence of nested simplicial complexes

through which we track the persistent homology classes represented by cycles. For our case,

this essentially determines how long classes of different cycles ‘persist’ as α changes, thereby

generating a signature for each point cloud (which may in-turn represent images or proteins

or mineral structures etc).

The problem with filtrations built using VR-complex, however, is that, since we connect

every pair of points within distance α to form a higher dimensional simplex, there is a steep

rise in size, which is more so for points living in a high dimension. Hence we use a new

method to calculate persistence of a point cloud to generate topological signature for our

classification problems.

Figure 1.2: A visual example of the sequence generated by subsampling a point set via the Morton Ordering

1.2 Topological persistence via Successive Collapses

As a primal step towards computing persistence faster, we start by sparsifying the point

cloud by taking a -sparse, δ-net subsample on it:

5 Definition 1. [33] A finite set P ⊆ X is an  − sample of a (X, d), if for each point x ∈ X there is a point in p ∈ P so that d(x, p) ≤ . Additionally, P is called δ − sparse if d(p1, p2) ≥ δ for any pair of points in P.

Intuitively, it means that for each point in the initial cloud, we have a point at a distance

 in the subsample and that no two points are more than δ distance close to each other in the sparser cloud(we use  = δ for our experiments later). We then build a simplicial complex(C) on this sparse point cloud. The exact complex being built depends on the application: chiefly the graph induced complex [40] for our image classification and weighted alpha complex [43] for describing protein structures. This initial complex is reduced in size by successive edge collapses that collapses vertices as well; see Figure 1.2. In effect, it generates a sequence of simplicial complexes where successive complexes do not nest, but are connected with simplicial maps (these are maps that extend vertex maps to simplices, see [80] for details).

We generate a sequence of subset of vertices(V i) using space filling curves (see the next section for details). This sequence of subsets of V i allows us to define a simplicial map between any two adjacent subsets V i and V i+1 by the following map.

(p if p ∈ V i+1 f i(p) = v : d(p, v) = inf d(p, v) otherwise v∈V i+1

Essentially, each vertex in V i is either chosen for the subsampling or mapped to its nearest neighbor in V i+1 .

This map on the vertices, then induces a map on all higher-order simplices of C. More formally these maps are collapses of the simplicial complex C.

0 f1 1 f2 fk k C −→ C −→ ... −→ C

6 0 Given a sequence of simplicial maps f1, f2, ... fn between an initial simplicial complex C and

a resulting simplicial complex Cn, authors in [34] describes an annotation-based technique to generate the persistence of such a sequence. The authors use a set of elementary inclusions

(not needed in our case) and collapses to break down the the simplicial maps into their

fundamental elements. Using this, they derive the persistence of the simplicial maps. For our

purposes we utilize this annotation based algorithm on the sequence of maps f i described

above.

Algorithms for computing persistence under simplicial maps is presented in [34], and the

authors have announced a software (Simpers) for it, which we use for our purpose. The

persistence under simplicial maps give us a set of intervals(barcodes) which we shall use later

in our TDA toolkit for description and classification(Ch.3,4).

In the next section we shall discuss the space filling curve which we used earlier to find

the sequence of subset of vertices.

1.2.1 Subsampling

The topological signature using simplicial collapse defined in the previous section works on

a sequence of subsample of vertices. In order to choose a subsample very fast that respects the

local density, we use the Morton Ordering [72,79] of the point cloud. The Morton Ordering

provides a total ordering on points in Zd where points close in the ordering are usually close

in space, thus respecting spatial density (see Figure 1.3). Our data is sparsified by removing

every nth point from the current Morton ordering, and then repeating the process until there

are less than n points remaining for a chosen n. Note that there are other algorithms which

can be used for this purpose, such as implementing a k-means clustering with n-clusters and

choosing the center of each cluster for removal. However, the Morton Ordering is very fast as

7 it is based on bit operations and it is a non iterative process, hence we inculcate it in our

algorithm.

Now we will discuss our method for generating a sequence of subsamples, given an initial

simplicial complex C. To do this, we first create a total ordering on our point set V 0. This ordering is explicitly defined by the Morton Ordering mapping M : ZN 7→ Z such that

B N _ _ i,b M(p) = x2  N(b + 1) − (i + 1) b=0 i=0

i,b th th where x2 denotes the b bit value of the binary representation of the i component of x, ‘∨’ denotes the binary or operation, and ‘’ denotes the left binary shift operation. This

mapping is merely a bit interleaving of the different components of p. Applying M to every

p ∈ V 0 yields a total ordering on our initial point set.

We can exploit the knowledge that points with similar Morton encoding are found in

similar areas, to generate a new subset V 1 ⊂ V 0 that respects the underlying density of the

initial point set.

First choose a value n such that 1 < n ≤ kV 0k. V i+1 can then be defined as

i+1 i V = {xj | xj ∈ V , j 6≡ 0 mod n}

th i Where xj is the j vertex in the Morton Ordering of V . Following this approach, the process

can be repeated to create a sequence of subsets

V 0 ⊃ V 1 ⊃ ... ⊃ V n, kV nk≤ n

It should be noted that in most cases, datasets often have real-valued data instead of

integer values required for the Morton Ordering. To overcome this, we apply a basic scaling

to the data as a preprocessing step, and then consider the closest integer point in Zd when

determining the ordering.

8 Table 1.1: Compute 1-dimensional homology: Time comparison with SimBa Data #points Dim SimBa(sec) Our-Algo(sec) Kitten 90120 3 35.72 19.05 PrCl-s 15000 18 94.13 28.17 PrCl-l 15000 25 254.37 47.12 Surv-s 252996 3 469.40 165.28 Surv-l 252996 150 1696.59 294.6 Caltech-256 10786 5 8.38 2.27 MNIST 2786 5 1.86 0.56

Finally, we compute the persistence of this sequence of collapse operations (simplicial maps)

connecting successive complexes using the software Simpers [34]. The detailed algorithm

changes with specific applications and are provided in Sections 3.2, and 4.2.

To illustrate the speed-up we gain by our collapse based persistence computation with

Morton Ordering (on graph induced complex), we report its running time on several datasets

ranging from 3D data as geometric meshes to high dimensional data embedded in dimension

as high as 150-dim (Table 1.1). The time taken by the algorithm when run on a random

sample image from the Caltech-256 and MNIST datasets has also been included. We compare

the speed of computation with SimBa [38]. Since the authors of [38] already showed that it

generates results faster than existing techniques, beating its speed indicates the superiority

of our approach in this context. For this comparison, we only compute persistence up to

the one-dimensional homology. While comparing our technique with SimBa , we retain the

default parameter values suggested in the software manual. We test the speed of our method

for computing topological signature on several datasets having dimensions much larger than

three in Table 1.1. The PrCl-s and PrCl-l are the datasets containing Primary

formed from natural images [1]. In PrCl-s, each point corresponds to a 5 × 5 image patch whereas in PrCl-l, they are of size 7 × 7. We run our algorithm on the Adult data [61]

9 Figure 1.3: Visualization of a Morton Ordering of point cloud data. Points with similar hue are close in the total ordering.

obtained from the UCI Machine Learning Repository. This is a 14-dim classification data

used for determining whether the average salary in the US exceeds 50k. We also experimented

on the Surviving protein dataset [55]. This includes 25, 2996 protein conformations, each

being considered as a point in R150. We generate a scaled down version of this dataset as well by reducing the dimension to R3 using PCA, and testing on it. As is evident from Table 1.1, our algorithm performs much faster, especially in high

dimensions. Since we avoid simplex insertions of the classical inclusion based persistence

computations, we are able to yield a significant speed-up. The technique of generating

topological signature using simplicial collapse has been successfully applied by us in several

domains. Details about its deployment in image classification can be found in chapter3, while chapter4 embellish the applications in protein characterisation and classification.

1.3 Representative cycles for Homology classes

So far we have used barcodes computed using persistent homology as features to describe

and learn data. Next, we look into applications which requires finding the representative

cycles for persistent homology. Because the minimal persistent 1-cycles are not stable and

their computation is NP-hard [32], we propose an alternative set of meaningful persistent

10 1-cycles which can be computed efficiently in polynomial time. The persistent 1-cycle we

calculate for a finite interval is a sum of shortest cycles born at different indices. Since a

shortest cycle is usually a good representative of its class, the sum of shortest cycles ought

to be a good choice of representative for an interval. In many cases, this sum contains only

one component. The persistent 1-cycles computed for such intervals are guaranteed to be

optimal. In fact we show experimentally that such optimal intervals occur quite frequently.

Especially, for some image datasets, nearly all computed persistent 1-cycles contain only one

component and hence are minimal.

We also deal with a practical problem of the previous algorithm where there is an

unnecessary computation of tiny intervals regarded as noise. Users are generally more

interested in finding representative cycles for the significantly large intervals only. The

previous algorithm computes 1-cycles even for tiny intervals which are considered as noise in

most applications. We present an improved algorithm where we only compute the shortest

cycles at the birth indices whose corresponding intervals contain the input interval [b,d).

Since an user often provides a long interval, the intervals containing it constitute a small

subset of all the intervals. This makes our latest algorithm run much faster in practice.

Since it is NP-hard to compute minimal persistent cycles in 1-dimensional homology

groups, this naturally leads to the following questions: Are there other interesting cases

beyond 1-dimension for which minimal persistent cycles can be computed in polynomial time?

In a follow up work, it is again shown that when d ≥ 2, computing minimal persistent d-cycles

for finite intervals is NP-hard in general( [32]). This work also describes computing minimal

persistent d-cycles for finite intervals based on a special but important class of simplicial

complexes, called weak (d + 1)-pseudomanifolds. Our chapter on representative persistent

cycle also explores applications of this algorithm in several domains.

11 We announce a software based on our algorithm to generate tight persistent 1-cycles on 3D

point clouds and 2D images. We experiment with various datasets commonly used in geometric

modeling, computer vision and material science, details of which are illustrated in Section5.

The software, named PersLoop, along with an introductory video and other supplementary

materials are available at the project website http://web.cse.ohio-state.edu/~dey.8/ PersLoop. We present a follow up software to compute minimal d−dimensional persistent

cycles for finite intervals. Experiments with representative 2-cycles on scientific data indicate

that the minimal persistent cycles capture various significant features of the data. In section

5.8, we show how these cycles extract information from medical data, scientific visualisation

data, as well as molecular configurations. The software is available online at https://github.

com/Sayan-m90/Minimum-Persistent-Cycles/tree/master/pers2cyc_fin.

1.4 Topologically Relevant Gene Expression and Cohort Analysis

The quick progress of genome-scale sequencing has dispensed a comprehensive list of genes

for different organisms. Interpretation of high-throughput gene expression data continues

to require mathematical tools in data analysis that recognizes the shape of the data in high

dimensions. The fundamental challenge in modelling a mathematical structure to explain

these high dimensional data is the stochastic nature of biological processes and associated

noise levels acquired during the mining process. In the penultimate chapter of this thesis, we engender the representative persistent cycles discussed in the forthcoming chapters to

curate gene expression data. This work dissents from the preceding works in two aspects:

(1) Traditional TDA pipelines uses topological signature called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating said features

to obtain a “crux” of better representatives. These of the entire data facilitates better

comprehension of the phenotype labels. (2) Most primal works (including our works in image

12 and protein analysis in chapter34) employ barcodes obtained using topological summaries

as fingerprints for the data. Even though they are stable signatures, there exists no direct

mapping between the datum and said barcodes. The representative persistent cycles we use

allow us to directly procure genes entailed in similar processes.

We look into two problems in the understanding of genome sequence where scientists

have shown particular interest. A genome-wide association study (GWAS) is a method to

link a subset of genes to a particular disease or physical phenomenon in an organism. Since

the number of gene expression in a cohort profile is far greater than the number of sample

cohorts, it becomes important for these cases to identify a subset of genes whose expression

levels reflect the phenotype of the cohorts. On the other side of the argument, it is often

the case that some cohort have incorrect or uncorrelated data due to instrument or manual

error. We find in practise that the elimination of such instances leads to better prediction

scores and performance. We delve into both these problems of identifying relevant cohorts

and genes in chapter6. Since genes have previously shown clustering tendencies [46, 86], this work explores to investigate if they contain higher dimensional topological information as well. We use the representative n-cycles to be discussed in forthcoming chapters to extract

these data and results based on real life gene expression data of various organism seems

to be encouraging. The topology relevant curated data that we obtain provides reasonable

improvement in both shallow and deep machine learning based classifications. We further

show that the representative cycles we compute has an unsupervised inclination towards

phenotype labels. This work thus shows that topological signatures are able to comprehend

gene expression levels and classify cohorts on its basis.

13 1.4.1 Contour of the thesis

This thesis resonates on the prime theme of finding topological tools to articulate an

ensemble of problems in science and engineering. In the next chapter, we delve in to a

comprehensive review of some of the pivotal findings on the domain. Chapter3 employs

the sub-sampling based techniques discussed in 1.1 to classify images. Chapter4 uses a

similar strategy as 1.1, with further optimisations to show tractable performance in protein

classification. In contrast to the image classification, where we augmented topological features with traditional feature vectors to show learning improvement, this work uses unaided

persistent barcodes to transcend the state of the art.

With these two chapters dealing with the applications in persistent barcodes directly, our

later chapters investigate into the representative cycles. Chapter5 describes the computation

and applications of persistent 1-cycles. The same chapter uses a generalised algorithm

(Ref. 5.6) to compute d-dimensional persistent cycles for finite intervals on a special type

of simplicial complexes. This section exhibits some interesting applications and instances

of representative persistent 2-cycles in Z2. Chapter6 utilises the algorithms of chapter5

to enounce gene expression profiles. We show the unsupervised ability of the topological

features to identify phenotype labels, and adopt the same in supervised learning as well.

In sum, these 4 chapters articulate how we build several tools in topology and show their

utility in a myriad of applications. Finally, we conclude our work in chapter7.

14 Chapter 2: Some Applications of Persistent Homology

In this section, we provide a brief survey of the different domains in which topological data analysis, specifically persistent homology has found its applications.

2.1 General applications of Persistent Homology

The idea of persistent homology has found applications in several areas including robotics

[41], sensor networks [28,52], and the analysis of different other domains [53,71,98].

Several structures in material science has been analysed using persistent homology.

Nakamura et al. [81] proposed a methodology based on persistent homology to describe

Medium Range Order (MRO) in amorphous solids. Their experiments on crystalline, random and amorphous structures show the power of persistent diagrams to explain geometric structures in single and many-body atomic configurations. For instance in glass materials, the presence of characteristic curves in a function over the persistent diagram imply presence of MRO. For crystalline solids, the periodic structure yields a few island supports in the persistence diagram with high multiplicity. The diagonal region in this diagram corresponds to the secondary holes that represent distortion from the primary holes. In effect, this paper provides strong evidence of the importance of Persistent Diagrams in the analysis of material structures. In another work, Hiroaka et al. [56] shows similar results where they distinguish the different states of matter using persistent diagrams. They also study thermal fluctuations and strains on molecules using persistent homology.

15 The idea of persistent homology has found application in time series data analysis as well.

The authors of [109] identify the qualitative changes in time series using delay embeddings and

topological data analysis. This delay-variant embedding method reveals multiple time-scale

patterns in a time series. The authors also combine these features with the kernel technique

in machine learning algorithms to classify the general time-series data. In another work,

Perea et al. [88] uses a sliding window based technique and persistence to compute periodicity

of signals.

2.2 Work on Computer Vision & Graphics using Persistent Ho- mology

In computer vision, persistent homology has been used for a wide array of applications

ranging from image segmentation [65] to shape characterization [6]. We discuss some of the

recent works in vision and computer graphics in this section.

Carri`ereet al. [22] used persistent homology to estimate 3D shapes. They work on

generating stable signatures to describe compact smooth surfaces in R3. Given a shape S,

they calculate persistence by growing geodesic balls for each point x ∈ S. They vectorise the

persistent diagrams by treating them as metric spaces and using pairwise distances. Thus for

each shape, they have a set of vectors, one for each point in the shape. Multidimensional

Scaling on these vectors show that there is some continuity between vectors with identical

labels, which suggests that the signatures vary continuously along the shape. It is these vectors that are in turn used as features for machine learning. Results of classifying shaped

based on these signatures are quite encouraging. In a similar application, Thomas et al. [10]

uses persistent homology to estimate 3D shape poses. Their work uses the intervals obtained

in persistent homology as features for pooling in a bag-of-word approach. Their results fare

better than the state-of-the-art techniques as well.

16 In a work related to image analysis of medical data, [104] studies the cell arrangement of

microscopic images of breast cancer using topological data analysis. They take the points

and their corresponding weights as the individual nucleus and its mass. Using this, they

compute the persistent homology of the VR complex starting from a weighted point cloud.

They classify different cancer type and demonstrate that the topological features contain

useful complementary information to image-appearance based features that can improve

discriminatory performance of classifiers.

Two more recent works deserve mention. The first one is by Leon et al. [70], who uses

filters on gaits of human silhouettes to build simplicial complexes. They then use these

complexes for gender classification. In another work, Aras et al. [5] use persistent homology

to analyse image tampering. They utilise the non uniformity in Local Binary Pattern of

images to build simplicial complexes. They use the number of connected components in

this simplicial complex for different threshold values as features for their classifier. Their

results show that the persistent homology sequence defines a discriminating criterion for the

morphing attacks (i.e. distinguishing morphed images from genuine ones).

2.3 Work on Bio Science using Persistent Homology

In bio-science, topological data analysis have found applications in medical imaging [26,

110], molecular architecture [92,97], and genomics [14], among others.

Topological structures have also been used to analyse viruses. Emmet et al. worked on

influenza [47] to show a bimodality of distribution of intervals in the persistent diagram.

This bimodality indicates two scales of topological structure, corresponding to intra-subtype

(involving one HA subtype) and inter-subtype (involving multiple HA subtypes) viral re-

assortments. These results on viruses suggest that persistent homology can be also used

to study other forms of reticulate evolution. Overall, this paper presents clear examples of

17 topological structures demonstrating different biological processes. In another work, Parida

et al. [87] used topological characteristics to detect subtle admixture in related populations.

In [99], gene expressions from peripheral blood data was used to build a model based on

TDA network model and discrete to look into routes of disease progression.

Persistent homology has also been employed in [42] for comparison of several weighted gene

co-expression networks.

Persistent Homology has also been used to identify DNA copy number aberrations [4].

Their experiments found a new amplification in 11q at the location of the progesterone

receptor in the Luminal A subtype. Seemann et al. [100] used persistent homology to identify

correlated patient samples in gene expression profiles. Their work focuses on the H0 homology

class which is used to partition the point cloud . The famous paper by Nicolau et al. [84]

identified a subgroup of breast cancers using topological data analysis in gene expressions.

Several works [93,107] on use of machine learning techniques on gene expression profile

have shown promising results. Kong et al. [63] used random forests to extract features for

their Neural Network architecture. They investigate a problem similar to our ’Topo-relevant gene’ and the results show significant improvement. [58] analyzes gene expression data to

classify cancer types. Different techniques of supervised learning are used to understand genes

and classify cancer. The authors of [112] use machine learning to identify novel diagnostic

and prognostic markers and therapeutic targets for soft tissue sarcomas. Their work shows

overlap of three groups of tumors in their molecular profile.

The authors of [16] characterised proteins using persistent homology. Since we will

compare our algorithm with this work in Section4, we explain its working in some detail. The work uses amino acid molecules as points to build a VR filtration. Based on this filtration,

the authors chose a feature vector of length 13, based on number, length, birth and death

time, average, and summation of certain specific intervals. This feature vector was used as

18 input to supervised classification. This work shows that the classification accuracy for protein

structures is improved even by choosing feature vectors naively without any statistical basis.

In another work [18], the authors have used barcodes to describe the secondary structure

of proteins such as alpha helix and beta sheets. They also employ said barcodes to analyse

protein elastic network models.

Several works [93,107] on use of machine learning techniques on gene expression profile

have shown promising results. Kong et al. [63] used random forests to extract features for

their Neural Network architecture. They investigate a problem similar to our ’Topo-relevant gene’ and the results show significant improvement. [58] analyzes gene expression data to

classify cancer types. Different techniques of supervised learning are used to understand genes

and classify cancer. The authors of [112] use machine learning to identify novel diagnostic

and prognostic markers and therapeutic targets for soft tissue sarcomas. Their work shows

overlap of three groups of tumors in their molecular profile.

2.4 Work on Machine Learning and Persistent Homology

There has been experiments on different kernels using persistence including Sliced Wasser-

stein [21], persistence scale space [94], weighted Gaussian [66], and persistence fisher kernels

on Riemannian Manifold [68].

Some work use persistent diagrams as features for machine learning classifiers by doing

statistical measure on the intervals. The authors of [15,17] has used the binning process by

collecting values at the grid point. Other works [16,65] used the top ‘n’ intervals as features.

Bubenik [12] introduced persistent landscapes which was subsequently used as feature vectors either through binning or as intensity maps for neural networks. Recently the persistent

diagrams has been represented using a persistent function [2]. The persistent surface

can be discretized by a Cartesian grid into image data. By the integration of persistent

19 surface function over each grid (or pixel), a persistent image function is obtained. These

images can in term be fed directly into classifiers.

2.5 Work on representative cycles for homology group

The computation of representative cycles for homology groups with Z2 coefficients has

been extensively studied over the decades. While a polynomial time algorithm computing

an optimal basis for first homology group H1 [39] has been proposed, finding an optimal

basis for dimension greater than one and localizing a homology class of any dimension are

proved NP-hard [25]. There are a few works addressing the problem of finding representatives

for persistent homology, some of which compute an optimal cycle at the birth index of an

interval but do not consider what actually die at the death index [48,50].

Obayashi [85] formalizes the computation of optimal representatives for a finite interval as

an integer programming problem. He advocates solving it with linear programs though the

correctness is not necessarily guaranteed. Wu et al. [113] proposed an algorithm for computing

an optimal representative for a finite interval with a worst-case complexity exponential to

the cardinality of the persistence diagram.

Chambers et al. [23] proved that the localization problem over dimension one is NP-hard when the given simplicial complex is a 2-manifold. Several other works [11, 24,35,49] address variants of the two problems while considering special input classes, alternative cycle measures,

or coefficients for homology other than Z2.

20 Chapter 3: Image Classification

Image classification has been a buzzing topic of interest for researchers in both computer vision and machine learning for quite some time. Affordable high-quality cameras and online

social media platforms provided the internet with billions of images, most of which are

raw and unclassified. Recent developments of new techniques for classification shows very

promising results, even on large datasets, such as ImageNet [29]. In this chapter, we intend to

investigate if topological signatures arising from topological persistence can provide additional

information that could be used in the classification of images.

For supervised image classification with machine learning techniques, we investigate both

feature vector based shallow classification and neural network based deep classification; see

Figure 3.1 for a schematic diagram. For the shallow learning method, we use one of the

state-of-the-art techniques using SIFT followed by Fisher-Vector encoding [96] to generate the

feature vectors. Classification using these vectors has an accuracy of 59.7% on the Caltech-256

dataset for 60 or more training samples. Classifications using Convolutional Neural Networks was tested with another state-of-the-art model: AlexNet [64]. We reproduced the experiments with and without our topological features to get an improvement for the latter. This trend is

also evident in modifying AlexNet (which has a precision of 83.2%) and evaluating on the

CIFAR-10 image data set, where we found consistent improvements in model precision when

including topological features.

21 Figure 3.1: Top: The topological features act as inputs to the fully connected part of our modified convolutional neural network. Bottom: Using modified Fisher Vector on SIFT along with topological features for training.

We compute the topological features for each image and append this data with the feature vector obtained from the traditional method to classify the images. While there may be

improvements in classifiers and methods for augmenting features into them, we claim that

even naively including the topological features adds additional relevant information about the

image which can be utilized by the network in making more accurate classifications. Since

traditional feature extractors rely on geometric and image processing properties, such as

gradient, orientation of sub-pixels, statistical distribution of colors or the learned features

found in Convolutional Neural Networks (CNNs), topological features are lost in this process.

By reintroducing these features, we add additional relevant information which can be utilized

by existing classification techniques. Our entire technique has been illustrated in a video which is available at https://youtu.be/hq4DYse2c-Y.

22 We compute the topological features for images and use these features for image classifica-

tion using machine learning techniques. We use the theory of persistent homology to generate

feature vectors for a point cloud. But, instead of the classical Rips complex filtrations, we

use a hierarchy of collapsed simplicial complexes and a novel point selection strategy to save

time and prevent a combinatorial blowup on the number of simplices. The background and

motivation for our algorithm can be found in section 1.1.

We transform each pixel to a point p ∈ R5 by taking the RGB intensity values as well

as its (x,y) coordinates. Note that this is only one way of transforming an image to a point

cloud (other techniques as lower star filtration on the intensity, exist as well). On this point

cloud we build a graph induced complex [30] which is basically the insertion of a simplex with a vertex set V ⊆ Q if a set of points in P, each being closest to exactly one vertex in V ,

forms a clique. Formally,

Definition 2. Let G(V) be a graph with the vertex set V and let v V −→ V 0 be a vertex map where v(V ) = V 0 ⊆ V . The graph induced complex G(V,V’,v) is defined as the simplicial

0 0 0 complex where a k-simplex σ = v1, v2, ..., vk + 1 is in G( V, V’,v ) iff ∃a(k + 1)-clique

0 v1, v2, ..., vk+1 ⊆ V so that v(vi) = vi for each i ∈ 1, 2, ..., k + 1. To see that it is indeed a simplicial complex, observe that a subset of a clique is also a clique.

We build this initial simplicial complex and proceed with the collapse based filtration

described in 1.2. The theory of persistent homology tracks the birth and death of the cycles.

For ease of computation we compute persistence only up to one dimensional homology, which keeps track of the 1-cycles or loops in R5. This allows us to balance the necessity of

getting relevant topological information against the increased computation time required for

generating high-dimensional homological features.

23 3.1 Feature Vector Generation

Now we describe the method we use to incorporate the topological signatures as features

for image classification.

Definition 3. Given an image I, we use the function f : I → R5 mapping the RGB intensity

of the pixel with coordinates (x, y) ∈ I to a point

r − µ g − µ b − µ ( r , g , b , x − x, y − y) σr σg σb

in the point cloud P ∈ R5.

(Here µi and σi refer to the mean and standard deviation of the intensity channel i which

can be red, green and blue respectively. Similarly, x, y are the corresponding mean.). The

color of images which varies from 0 − 255 and the size of images typically 200 × 200 depending

on the dataset are essentially normalised using this process. We apply this mapping to all

pixels in the input image in order to obtain an initial point set P on which the algorithm

in section 3.2 operates. This algorithm computes the barcodes denoting the birth-death value for each cycle (as described in section 1.1). Typically, cycles with short barcode length

correspond to intermediate/insignificant cycles or noise. So, to find the cycles which persist

longer, we sort the barcodes w.r.t their lengths and find the largest difference in length

between two consecutive barcodes. The (death-birth) value for every barcode above this gap

is taken as our feature vector. Therefore, if there are ‘m’ barcodes above the widest gap of

the barcodes for an image, li denoting length of the ith barcode, we take the length of the

top ‘m’ barcodes (l1, l2, ..., lm) as our feature vector. This m-length vector is added to the

feature vector obtained from the traditional machine learning approach and used for training

and testing. The barcode of a sample image from Caltech-256 is given in Figure 3.2 with

the bottom 6 lines in blue forming (l1, l2, ..., l6). We note that, one may compute the feature

24 vectors from our topological signatures using the methods proposed in [12][67]. We adopt

the simple approach as described here because of the consideration of speed and simplicity.

Figure 3.2: One dimensional homology for an image in Caltech-256. The bars in blue are above the widest gap and chosen as feature vector

3.2 Algorithm for fast computation of topological signature

To compute the topological signature for an initial point cloud P ⊂ Rn, we follow the

procedure below:

1. Create a Nearest Neighbor graph on P by creating an edge between each point and its

k-nearest neighbors.

2. Create a δ-sparse, δ-net subsample on P to form V 0 (Def. 1 in Sec. 1.2)

3. Build a graph induced complex [40] C0 on V 0

4. Undergo a sequence of subsampling of the initial point set V 0 ⊃ V 1 ⊃ ... ⊃ V k based

on the Morton Ordering discussed in 1.2.1. (For every V i ⊃ V i+1, we remove every nth

point from the ith sample to form V i+1).

5. Generate a sequence of vertex maps f i : V i → V i+1, as defined in section 1.2. This

in turn generates a sequence of collapsed complexes: C0, C1, ..., Cn. Each vertex map

25 induces a simplicial map f i : Ci−1 → Ci that associate simplices in Ci−1 to simplices in

Ci

f1 f2 fk 6. Compute the persistence for the simplicial maps in the sequence C0 −→ C1 −→ ... −→ Ck to generate the topological signature of the point set P .

Thus given a sequence of simplicial maps, we can compute persistence by sequence of collapse

operations (induced by the maps) on our initial complex (which is described in section 1.2).

3.2.1 Choosing Parameters

Next, we discuss how we choose the different parameters to generate persistence for the

images. We need to tune two parameters, the first being the value k which are the number

of nearest neighbors each node connected to for the δ-sparse, δ-sample complex we build

(Section 3.2 Step 1). The second is the parameter n, where we choose the nth point from V i

and collapse to its nearest neighbor for the simplicial mapping f i : Ci−1 → Ci; see Section 3.2. For choosing these parameters, we do an initial unsupervised clustering of the images based

on the t-distributed stochastic neighbor embedding or t-SNE [111]. This technique provides visualizations of high-dimensional point clouds via a nonlinear embedding which attempts

to place close points in the high dimensional space to close places in either R2 or R3. We

take a subset of images from each of our datasets and generate the persistent signatures as

described above.

These signatures are embedded in R2 using tSNE. The effects are evident in Figure 3.3a, where we experiment on the MNIST digit dataset. MNIST digits clustered based on the

computed barcodes, shows that digits with similar one-dimensional homology features are

close together. Specifically, the digits 0-9 can be partitioned into 3 equivalence classes based

on the number of holes in each digit, where

[0] = {1, 2, 3, 5}, [1] = {0, 4, 6, 9}, [2] = {8}

26 (a) (b) (c)

Figure 3.3: T-SNE was used as an aide in picking parameters for the computation of barcodes.(a)Good clustering on MNIST. (b) Bad clustering on CIFAR-10. (c) Better clustering on CIFAR-10

and this is reflected in 3.3a. We provide a bad clustering example on the CIFAR-10 dataset with values k = 8, we can see that t-SNE produced a noisy result, where not a lot of distinct

clusters are formed. Tuning the parameter values of k to be k = 15, more distinct visible

clusters were formed on the CIFAR-10 dataset as evident in 3.3c. We found this to indicate

that we should modify the aforementioned parameters. We repeat this experiment for k

= {15, 17, 19, 21, 23, 25}th nearest neighbors and value of n ranging from 8 to 20. Finally, we choose those values of k and n for which tSNE produces the most distinct clustering of

images, namely k=17 and n=15.

3.3 Results

We are primarily focused on two supervised classification frameworks. The first one is the

traditional feature vector based approach where we generate a feature set for each image;

thereby training and testing on an optimal classifier. The second one is the convolutional

neural network approach where we modify the final layers of the classifier to accommodate our

new features. We worked on several image datasets chiefly the CIFAR-10, the Caltech-256,

27 Dataset Method P - T P + T R - T R + T CIFAR-10 SIFT + Fisher Vector Encoding 23.63 28.24 23.56 31.16 CIFAR-10 AlexNet 83.2 83.8 98.25 98.15 Caltech-256 SIFT + Fisher Vector Encoding 57.50 59.50 51.50 64.00 MNIST Deep MNIST SoftMax Regression 98.15 98.46 99.57 99.48

Table 3.1: Precision(P), Recall(R) for different Methods with and without using Topological features. (P-T and R-T) indicates Precision and Recall Without Topology whereas (P+T and R+T) indicates Precision and Recall With Topology. Note that except for Cal-256 which has 20 classes, all the other datasets have 10 classes.

and the MNIST hand written dataset. All our codes are freely available on our website:

https://web.cse.ohio-state.edu/∼dey.8/imagePers/ImagePers.html.

3.3.1 Feature Vector based Supervised Learning

For classification, the number of features extracted for each image is generally quite

large and in addition, can vary depending on the image. Because of this, the images are

transformed into a fixed sized image representation using global image descriptor known

as the Fisher Vector encoding [96]. We assume 16 Gaussian functions to form the GMM

used as the generative model. We first compute the Hessian-Affine [78] extractor to generate

interest points, thereafter calculate the SIFT [74] features of these interest points. If there

are ‘l’ interest points, this process generates a 128 × l dimensional vector. This vector is

transformed to a feature vector of length 4096 × 1 using the Fisher Vector encoding. Finally, we train an SVM model using the feature vector generated from each image and use it to

classify the images.

We first tried to classify images from the CIFAR-10 dataset [75]. The dataset contains 6

batches of 1000 images with a total of 6000 images for each of the 10 output classes. Each

individual image is extremely tiny with dimension 32x32. Since these images are so low in

resolution, the number of interest points extracted is very small, and thus insufficient to

28 characterize each image. There are ten classes of images in CIFAR-10, giving a baseline of

0.1 for precision and recall. For each class we trained on 4000 images and tested on the other

2000. We present the average result over all the 10 classes in Table 3.1.

Next, we compute the persistence of each image in R5. The value of ‘m’ as discussed in

Section 3.1, computed as an average over all images, is 10. Hence we append the longest 10

barcodes to the signature described above, giving vector of length 5006 × 1. The precision

and recall for this case increased significantly as noted in the Table 3.1.

The second dataset that we use is the Caltech-256 [54]. The number of images in each

varies within a range between 80 and 827, the median being 100. Since the dataset

is skewed, we use a subset of the dataset taking the first 20 classes having 100 or more images

as a measure for image classification. We also fix the number of training and test images as

75 and 20 respectively, for each class to maintain consistency. We use the same technique as

before, computing precision and recall for features with and without the persistence values.

In this case, for the 20 classes, the precision and recall improved significantly as well. The

average accuracy for all the 20 datasets using SIFT with the Fisher Vector Encoding comes

out to be 53.27%. However if we use the signature of our persistent homology, the accuracy

increases to 56.74%. There is an increase in the average precision and recall for each class as well, as listed in Table 3.1. We also plot the accuracy varying the training set size from 25

to 75. The accuracy has a considerable increase using topological features in each case(see

Figure 3.4a). Figure 3.4b plots the precision of a subset of eight classes on the dataset, and

shows that the fluctuations in precision across different classes varies a lot for the fisher vector

method, generating a result of 0% on two occasions, whereas the output using topology is

reasonable for all the classes.

Two things are worth noting at this point. Fist, our algorithm runs faster than the state-of-

the-art software SimBa used for generating topological features from point cloud data. In this

29 (a) (b)

Figure 3.4: Comparison of accuracy and precision with and without topological features. (a)Performance versus training size (b)Fluctuations in precision across different class

SimBa Our Algo Our Algo +β2 Acc Pr Re Acc Pr Re Acc Pr Re I 54.6 61.9 63.9 58.9 65.6 64.8 59.2 60.1 64.8 II 19.6 21.2 33.0 21.3 28.2 31.4 22.4 28.2 31.2

Table 3.2: Qualitative comparison of our algorithm with SimBa with and without 2- dimensional homological features. Acc: Accuracy, Pr: Precision, Re: Recall. I- 5 classes of CalTech256, II - CIFAR-10

regard, we provid a quantitative comparison (Table 1.1). Second, we do not include the two

dimensional homology features to save computational time. Therefore, we show the result

that we would have obtained after including these topological features. The following table

(3.2) shows the accuracy, precision and recall of running SimBa and our algorithm (with 2D

features) on 5 classes of the Caltech-256 dataset and on CIFAR-10. Note that, since we took

a subset of the entire dataset, better accuracy on this subset does not necessarily mean better

overall performance. Interestingly, in some cases, considering only one dimensional features

provides better accuracy.

30 3.3.2 CNN and Deep Learning based Training

The second framework in our experimentation was based on the Convolutional and Deep

Neural Network models. For these models we started by experimenting with the MNIST

handwritten dataset [69]. We implement a straightforward Deep MNIST softmax regression

network [106]. In a nutshell, the network comprises of two sets of convolutional layers followed

by pooling and rectified linear activations, which is then input to a fully connected layer from which we determine the final output probabilities. After training, this model has a precision

of 98.16%. However, including the topological features in the second to last layer of the

fully connected network, we get a further improvement of 0.36% over the previous reported

result. While this may not seem significant improvement, getting a slight improvement on

a model which is already so accurate is encouraging. This also indicates that topological

signatures contain information which neural network pipelines are not able to mine. This

trend in improvement continues for another dataset that we tried, namely the CIFAR-10 which we discussed earlier. While the SIFT feature vector is not a very good method to

classify these tiny images, Deep Neural Networks have proven to be quite effective in such

cases. A classic, successful model for this dataset is AlexNet invented Krizhevsky et al. [64]

in 2012. We modify this model slightly to accomodate our features. AlexNet starts with

multiple convolutions, followed by normalization and the pooling layers and finally two fully

connected layers with ReLu activation function; see [64] for more details. The original model

also trained on multiple GPUs by splitting the layers of AlexNet. The original model also

had a layer for local response normalisation which normalised specific positions across all

channels in a certain layer. We do not include these two layers in our experimentation as

they have not been very popular subsequently. Training the classifier with 50,000 iterations with a batch size of 64, we obtained a precision of 83.2%. On top of that, we added the

31 topological features to the fully connected layer in the last stage of the model to get a 0.6% increase in precision. The details of all the results are included in Table 3.1.

Thus we see that topological features provide additional information for the classification of images. This is not surprising as most techniques rely on geometric, textural, or gradient based features for classification that do not necessarily capture topological features. In the next chapter, we investigate into the use of such features for protein characterisation and classification. In contrast to this chapter, protein structures can be explained directly using topological features without the need for additional features.

32 Chapter 4: Protein Classification

Proteins are by far the most anatomically intricate and functionally sophisticated molecules known. The benchmarking and classification of unannotated proteins have been done by researchers for quite a long time. This effort has direct influence in understanding behavior of unknown proteins or in more advanced tasks as genome sequencing. Since the sheer volume of protein structures is huge, up till the last decade, it had been a cumbersome task for scientists to manually evaluate and classify them. For the last decade, several works aiming at automating the classification of proteins were developed. The majority of annotation and classification techniques are based on sequence comparisons (for example in BLAST [101],

HHblits [7] and [95]) that try to align protein sequences to find homologs or a common ancestor. However, since those methods focus on finding sequence similarity, they are more efficient in finding close homologs. Some domains such as remote homologs are known to have less than 25% sequential similarity and yet have common ancestors and are functionally similar. So, we miss out important information on structural variability while classifying proteins solely based on sequences. Even though, sometimes, homology is established by comparing structural alignment [73], accurate and fast structural classification techniques for the rapidly expanding Protein Data Bank remains a challenge.

The algorithm that we present here is a fast technique to generate a topological signature for protein structures. We build our signature based on the coordinates of the atoms in R3 using their radius as weights. Since we also consider existing chemical bonds between the

33 atoms while building the signature, we believe that the hierarchical convoluted structure

of protein is captured in our features. Finally, we developed a new technique to generate

persistence that is much quicker and uses less space than even the current state-of-the-

art such as SimBa. It helps us generate the signature even for reasonably large protein

structures. In sum, in this chapter, we focus on three problems: (1) effectively map a protein

structure into a suitable complex; (2) develop a technique similar to our previous chapter to

generate fast persistent signature from this complex; (3) use this signature to train a machine

learning model for classification and compare against other techniques. Our entire method

is summarized in figure 4.1. We also illustrate this method using a supplementary video

available at https://youtu.be/yfcf9UWgdTo.

With the traditional technique discussed in section 1.1, difficulties are faced as r increases.

An average protein in a database such as CATH [103] has 20, 0000 ∼ 30, 000 atoms, thus

creating a point cloud of the same size in R3. Furthermore, the initial complex including 3-simplices (or tetrahedra) becomes quite large. On an average, this complex size grows

to (50 ∼ 100)x104 simplices of dimension upto 4 and becomes quite difficult to process.

Building a filtration using this growing sequence of balls is thus not scalable. We attack the

problem with two strategies: (1)we only consider simplices on the boundary of the entire

simplicial complex in our algorithm and (2) compute a new filtration technique that is based

on collapsing simplices rather than growing their numbers by addition (described in Ch.1).

4.1 Topological persistence

Traditionally, given a point cloud, its persistence signature is calculated by building a

filtration over a simplicial complex called Vietoris-Rips(VR). The VR complex is easy to

implement, but its size can become a hindrance for an even a moderate size protein molecule.

34 Figure 4.1: Workflow

Thus, instead of a VR complex, we use the (weighted) alpha complex that is sparser and has

been used to model molecules in earlier works [59].

Definition 4. Alpha complex AC(α): For a given value of α, a simplex σ ∈ AC(α) if:

• The circumball of σ is empty and has radius < α, or

• σ is a face of some other higher dimensional simplex in AC(α).

Definition 5. Weighted Alpha Complex W ACPˆ(α): Let Bk(pˆ) be a k-dimensional closed

ball with center p, and weight rp. It is orthogonal or sub-orthogonal to a weighted point

0 0 2 2 2 0 2 2 2 (p ,rp0 ) iff ||p − p || = rp + rp0 or ||p − p || < rp + rp0 respectively.

An orthoball of a k-simplex σ = {pˆ0,..., pˆk} is a k-dimensional ball that is orthogonal to

every vertex . A simplex is in the weighted alpha complex W ACPˆ(α) iff its orthoball has radius less than α and is suborthogonal to all other weighted points in Pˆ.

35 4.2 Collapse-induced persistent homology from point clouds

The following procedure computes a topological signature for a weighted point cloud ˆ P = {p, rp} using subsamples and subsequent collapses:

0 ˆ 1. Compute a weighted alpha complex C on the point set P = {p, rp} using the algorithm

described in [108]. Let V 0 be the vertex set of C0.

2. Compute a sequence of subsamples V 0 ⊃ V 1 ⊃ ... ⊃ V k of the initial vertex set V 0

based on the Morton Ordering as discussed in 1.2.1. (For every V i, we remove every

nth point in the Morton Ordering from V i to form V i+1. We choose ‘n’ based on the

number of initial points).

3. This sequence of subsets of V i allows us to define a simplicial map between any two

adjacent subsets: f i(p) = V i → V i+1. We use the same definition of f i(p) as Sec. 1.2

4. This vertex map f i : V i → V i+1 in turn generates a sequence of collapsed complexes:

C0, C1, ..., Cn. Each vertex map induces a simplicial map f i : Ci−1 → Ci that associates

simplices in Ci−1 to simplices in Ci(see Figure 4.3)

f1 f2 fk 5. Compute the persistence for the simplicial maps in the sequence C0 −→ C1 −→ ... −→ Ck to generate the topological signature of the point set Pˆ.

In step 1 of the procedure, weighted points alone lead to disconnected weighted atoms in C0 rather than capturing the actual protein structure. To sidestep this difficulty, we increase

the weights of these points based on the existence of covalent or ionic bonds in the structure.

That is, if there exists a chemical bond between two atoms (which we get from the input

.pdb file), we scale-up the weight of each point so that they are connected in the weighted

alpha complex W ACPˆ(α) (see Fig. 4.2). We determine a global multiplying factor ρ ≥ 1 for

36 Figure 4.2: Weighted Alpha complex for protein structure

this purpose. As mentioned earlier, we take the boundary of this weighted complex which

forms our initial simplicial complex C0. In step 2, in order to generate the sequence of subsamples, we pick vertices uniformly

from the simplicial complex to be collapsed to their respective nearest neighbors. To choose

a subsample that respects local density, we use a space curve generation technique called

Morton Ordering [79]. The details of this method is given in section 1.2.1

Finally, as described in step 3, instead of constructing the filtration by increasing the value of α, we perform a series of successive collapses starting with the initial simplicial

complex. This leads to a sequence of complexes that decreases in size instead of growing as we proceed forward. Effectively, it generates a sequence called tower of simplicial complexes where successive complexes are connected by simplicial maps. These maps which are the

counterpart of continuous maps for the combinatorial setting extend maps on vertices (vertex

maps) to simplices (see [80] for details). In our case, collapses of vertices generate these

simplicial maps between a simplicial complex in the tower to the one it is collapsed to.

Persistence for towers under simplicial maps can be computed by the algorithm introduced

in [34]. We use the package called Simpers that the authors report for the same.

To summarize, the algorithm generates an initial weighted alpha complex. It then proceeds

by recursively choosing vertices based on Morton Ordering to be collapsed to their nearest

37 Figure 4.3: Top: Collapse of weighted alpha complex generated from protein structure via simplicial map. Bottom: Same algorithm applied to a kitten model in R3

neighbors resulting in vertex maps. These vertex maps are then extended to higher order simplices (such as triangles and tetrahedra) using the simplicial map. Finally given the simplicial map, we generate the persistence and get the barcodes for the zero and one dimensional homology groups.

4.2.1 Feature vector generation

We discuss how we generate a feature vector given a protein structure. We take protein data bank (*.pdb) files as input to extract protein structures. It contains the coordinates of every atom, their name, chemical bond with neighboring atoms and other meta-data such as helix, sheet and peptide residue information. We introduce a weighted point for each atom in the protein where the point is the center of the atom and its weight is the specified radius.

For instance, for a Nitrogen atom in the amino acid, we assign a weight equal to its covalent

38 radius of 71(pm). On this weighted point cloud pˆ = (p, rp), if two atoms pˆ and qˆ are involved

in a chemical bond, we increase their weights so that p and q get connected in the alpha

complex. We compute the persistence by generating the initial alpha complex and undergoing

a series of collapses as described in the previous section. For computational efficiency, we

only consider the barcodes in zero and one dimensional homology groups. Note that some of

the barcodes can have death time equal to infinity indicating an essential feature. For finite

barcodes, shorter lengths (death − birth) indicate noise. Elimination of these intermittent

features serves some interesting purpose as we will see in section 4.3. To find relatively

long barcodes, we sort them in descending order of their lengths. Let {l1, l2, ..., lk} be this

0 0 0 0 0 sorted sequence. Consider the sequence {l 1, l 2, ..., l k−1} where l i = li+1 − li and let lm be a

maximal element for 1 ≤ m ≤ k − 1. All barcodes with the lengths [l1..lm] form part of the

feature vector. Essentially we remove all barcodes whose lengths are shorter than the largest

gap between two consecutive barcodes when sorted according to their lengths. A similar

technique used in [65] has shown improved results in image segmentation over other heuristics

and parameterizations. Since the feature vector needs to be of a fixed length for feeding into

a classifier, we compute the index m of lm over all protein structures and take an average.

The feature vector also includes the number of essential zero and one dimensional cycles.

0 0 0 1 1 1 Therefore, we have a feature vector of length 2 ×m + 2 : {l1, l2, ...lm, l1, l2, ...lm, cβ0 , cβ1 }. Here

0 1 li and li are the lengths of zero and one dimensional homology cycles respectively whereas

cβi are the total number of essential cycles in i-dimensional homology.

4.3 Experiments and results

We perform several experiments to establish the utility of the generated topological

signature. First, we show how our feature vector captures various connections in the single

strands of secondary structures and compare them against the signatures obtained in [114].

39 Then we investigate if there is a correlation between the count of such secondary structures

and our feature vector. Next, we describe the topological feature vector obtained from

two macromolecular proteins structures. We also compare the size and time needed by our

algorithm (software) over the other commonly used persistence software (as in [16]). Lastly, we show the effectiveness of our approach in classifying protein structures using machine

learning models.

Figure 4.4: (a) Top: Alpha helix from PCB 1C26 , Middle: Barcode of [114], Right: Our Barcode, (b) Left: Beta sheet from PCB 2JOX, Middle: Barcode of [114], Bottom: Our Barcode. Each segment of the barcodes shows β0(top) and β1(bottom)

4.3.1 Topological description of alpha helix and beta sheet

It is known that barcodes can explain the structure of an alpha helix and a beta sheet [114].

The authors in [114] use a coarse-grain(CG) representation of the protein by replacing each

amino acid molecule with a single atom. This representation removes the short barcodes

corresponding to the edges and cycles of the chemical bonds inside the amino acid molecule.

40 We do not need this CG representation as our procedure can implicitly determine a threshold

lm and therefore delete all barcodes of length shorter than the largest gap between two

consecutive barcodes (as described in section 4.2.1). So, we get a barcode that describes the

essential features of the secondary structures without including noise or short lived cycles

from the amino acids. For a fair comparison, we compute our barcodes on the same alpha

helix residue as in [114] with 19 residues extracted from the protein strand having PDB ID

1C26 (see figure 4.4). Analogous to the barcode of [114] (as shown in the middle diagram of

figure 4.4a), we have 19 bars in the zero-dimensional homology for the alpha helix representing

the nineteen initial residues. These components die as edges are introduced in the weighted

alpha complex which gets them connected. For one-dimensional homology, an initial with 4 residues is formed followed by additional rings resulting from the growing connections

in each amino acid. These cycles eventually die by the collapse operations in our algorithm.

The same process is followed for beta sheets after we extract two parallel beta sheet

strands from the protein structure with PDB ID 2JOX. The zero-dimensional homology

cycles are killed when individual disconnected amino acid residues belonging to the same

beta sheet strand are connected by edges, as represented in the top 17 barcodes (leftmost

figure of 4.4b). However, other than these barcodes and the longest bar corresponding to the essential cycle, there is one bar in the zero-dimensional homology which is longer than the

top 17 bars. This bar represents the component which is killed by joining the closest adjacent

amino acid molecules from the two parallel beta strands. The one dimensional homology

bars are formed as more adjacent amino acid molecules are connected and killed once the

collapse operation starts. Note that the two barcodes shown in figure 4.4 comparing our work with [114] are not to scale. This is because, in contrast to [114], the barcodes in our

figure are not plotted against Euclidean distance rather the step at which each insertion and

collapse operation occurs.

41 Figure 4.5: Barcode and Ribbon diagram of (Left): PDB: 1c26. (Right): PDB: 1o9a. Diagram courtesy NCBI [82]

A caveat

Our aim is to compute signatures that capture discriminating structural information

useful for classifying proteins. Even though we can use our signature to describe secondary

structures, we do not want our signature to be directly correlated to the number of alpha helix

or beta sheet as it would mean they are redundant. We generate a 2 × 12 where each

cell contains the correlation value between beta-sheet(top row) and alpha-helix(bottom row)

0 0 0 1 1 1 with each individual component in the feature vector: {l1, l2, ...lm, l1, l2, ...lm, cβ0, cβ1}. We use proteins in the PCB00020 record of the PCBC database to compute this matrix and depict

it by a heatmap (Fig 4.6). Essentially, we first generate two vectors vα and vβ of the number

of alpha helices and beta sheets respectively in each protein over all entries in the database.

Similarly, we produce a vector for each value in the feature vector: {v 0 , ..., v }. Now we l1 cβ1 populate the matrix by calculating the correlation between each of these individual vectors with vα and vβ. For example, row 1 and column 1 of the matrix contain the correlation value

between the vectors v 0 and v . The heat map color ranges from blue for zero correlation to l1 β dark-red for complete correlation. As we can see from the figure, almost all matrix entries have

a blue tinge indicating low correlation. This shows that our feature vector is non-redundant

42 over the frequency of secondary structures.

4.3.2 Topological Description of macromolecular structures

In the previous section, we use our signature to describe the secondary structures and compare it with the work in [114]. In this section, we further show how our signature works by describing two macromolecular protein structures that are built on multiple secondary structures. We start by describing the tetrametric protein: 1C26. The ribbon diagram and associated barcode after noise removal is given in figure 4.5. It essentially contains four monomers, associated pairwise to form two dimers. These two dimers, in turn, join across a distinct parallel helix-helix interface to form the tetramer. When we build the filtration on this protein structure, two monomers on opposite sides are killed first by connecting to their adjacent monomers to form two distinct dimers. This is evident as there are two short bars in the zero dimensional barcode (Fig. 4.5 left: shown in red). We now have two dimers, one of which is killed when it joins with the other to form a third slightly longer non-essential barcode (shown in purple). The second dimer lives on as the tetramer and forms an essential barcode (shown in black). Next, if we look into the one dimensional homology (shown as blue lines), we notice that the most notable feature for the protein is the tetramer structure which contains a large loop when the two dimers are connected. This is evident in our 1D-barcode as there is a distinct long bar representing the large one dimensional cycle. Note that the birth time of this cycle in 1D corresponds with the death time of the non-essential dimer in

0D.

Next, we consider the protein structure 1O9A. The structure contains several antiparallel beta-strands and is an example of a tandem beta-zipper. As we can see from the ribbon diagram in Fig. 4.5 right, there are six beta sheets on one side and five on another, connected together to form a fibronectin. This is evident as there are ten non essential and one essential

43 Figure 4.6: Heatmap correlating secondary Figure 4.7: Plot showing accuracy against structure against our feature vector. Each varying training data size. 100(%) indicates column in the heatmap is the feature vector. the entire training and test data.

SVM KNN Size Time (in sec) FB Cang Our FB Cang Our Data Dim VR SimBa Our VR SimBa Our Class 91.08 89.07 92.36 86.01 86.40 86.39 CATH 3 – 1422 443 – 1.75 0.35 Architecture 90.26 91.11 92.20 88.17 87.47 89.11 Soneko 3 324802 10188 576 32 6.77 2.05 Surv-l 150 – 3.1 × 106 1.09 × 106 – 5.08 × 103 884 Topology 92.19 94.87 96.71 91.54 94.02 96.20 PrCl-I 25 – 10.2 × 106 0.22 × 106 – 585 141.3 Homology 93.33 94.06 94.17 90.28 91.11 93.30

Table 4.1: Time comparison of against SimBa Table 4.2: Accuracy comparison with FragBag and VR and Cang

bar in the zero dimensional homology owing to the six beta sheets on one side and five on

the other. Each component is killed as the beta sheets join with another as the filtration

proceeds. Note that the last connected component after joining all beta sheets forms an

essential bar. Moreover, since there is no distinct cycle in the structure, we do not get any

distinct long bar in the one dimensional homology. The presence of multiple one dimensional

bars of similar size are probably due to the antiparallel beta-strands on either side which

form a ring once joined. Thus, we can see that using the same signature generation method, we can describe secondary structures (as in the previous section) as well as macromolecular

proteins without any change in the parameter. It is therefore evident that our signature is

intrinsic and scale independent.

44 Time and space comparison

The method in [16] uses persistent homology as feature vectors for machine learning.

However, as mentioned earlier, the use of Veitoris-Rips (VR) complex leads to a size blow up that not only increases runtime but also in most cases, causes the operating system to abort the process due to space requirements. Results in [16] procure good results as the datasets are of moderate size, but the same could not be reported for larger and real life protein structures. In table 4.1, we show a size and time comparison of our approach with the original feature generation technique used in [16]. We also tabulate the size and time to generate the same feature vector in [16] using a state-of-the-art persistence algorithm called

SimBa [38]. Table 4.2 contains a mix of protein databases and other higher dimensional datasets. As we see in the table, our algorithm is faster even when the features in [16] are generated with SimBa.

4.3.3 Supervised learning based classification models Classification model.

For the purpose of protein classification, we train two classifiers: an SVM model and a k-nn model on some protein databases. Once the model is trained, we test it to find accuracy, precision, and recall. The reason behind choosing Support Vector Machine and k-nn based supervised learning technique over other sophisticated and state-of-the-art classifiers is their basic nature. Results obtained from basic learning techniques prove the effectiveness of the feature vectors rather than that of the classifier. We can further improve the classification accuracy for proteins using some advanced supervised learning or Neural Network based classifiers using our proposed features.

45 Benchmark techniques.

In order to test the effectiveness of our protein signature, we need to compare it against

some of the state-of-the-art protein structure classification techniques. We generate feature vectors through these techniques and train and test the same classification models as before.

The first technique, known as PRIDE [51], classifies proteins at the Homology level in the

CATH database. It represents protein structures by distance distributions and compares two

sets of distributions via contingency table analysis. The more recent work by Budowski-Tal

et al. [13], which has achieved significant improvement in protein classification is used as

our second benchmark technique. Their work, known as FragBag mimics the bag of word

representation used in natural language processing and computer vision. They maintain

protein fragments as a benchmark. Given a large protein strand, they generate a vector which

is essentially a count of the number of times each fragment approximates a segment in this

strand. This vector now acts as a signature for the protein structure and that is what forms

the basis for their feature vector which we use to train and test our classifier. The protein

fragment benchmark is available from the library [62]. We choose 250 protein fragments

of length 7. The third work that we test against is the topological approach to generate a

protein fingerprint [16]. However, as we saw earlier, it is not possible to generate all the

protein signatures using the original algorithm used by the authors. Therefore, we replace

the Vietoris-Rips filtration by the state-of-the-art SimBa and generate feature vectors the

same way as mentioned in their paper.

46 Figure 4.8: Left:a) Difference in precision and recall from FragBag. Middle: b) Difference in precision and recall from [16]. Right: c) ROC curve for SVM classification of our algorithm

Database

The database that we use is called Protein Classification Benchmark Collection (PCBC) [105].

It has 20 classification tasks, each derived from either the SCOP or CATH databases, each

containing about 12000 protien structures. The classification experiment involves the group

of the domain as positive set, and the positive test set is the sub-domain group. The negative

set includes the rest of the database outside the superfamily divided into a negative test or

negative training set. The result for some of the classification tasks for the database is given

in Table 4.3. As evident from the table, the accuracy obtained by using our signature has a

considerable improvement over the state of the art techniques. The only classification task in which our algorithm under-performs is with the protein domain CATH95 Class 5fold Filtered

(fourth row of table 4.3). The class domain is randomly sub-divided into 5 subclasses in this

task. Since the class is divided randomly into subclasses, we believe some proteins belonging

to different sub-classes generate a similar initial complex resulting in a similar filtration and

ultimately a decrease in performance.

The PCBC dataset, even though suitable for learning algorithms, suffers from being

skewed as the number of negative instances in any classification is much larger than the

number of positive instances, leading to probable incorrect classifications. Therefore, we test

47 SVM k-NN Pride Fragbag Cang Our Pride Fragbag Cang Our SC Sf Fm F 90.09 93.01 93.39 95.24 89.58 87.31 89.83 91.66 CA T 5f 94.23 92.97 94.87 99.53 90.96 91.16 94.57 97.87 CA T H F 90.15 89.89 95.06 98.80 84.98 81.11 86.65 95.51 CA C 5f F 85.09 84.76 80.98 82.36 80.18 84.74 83.83 78.81 CA H Si F 98.60 95.89 98.24 99.05 95.45 91.11 79.469 97.56 CA A T F 87.56 91.58 74.58 90.95 67.47 89.00 68.90 87.00

Table 4.3: Classification accuracy for different techniques on Protein dataset. SC: SCOP95, CA: CATH95, Sf: Superfamily, Fm: Family, F: Filtered, T: Topology, H: Homology, C: Class, 5f: 5fold, A: Architecture, Si: Similarity

on one of the most popular protein databases known as CATH [27]. The CATH database

contains proteins divided into different domains (C: class; A: architecture; T: topology; H:

homologous superfamily). For each domain, we get protein structures and their labels in

accordance with the sub-domain they belong to. For any classification task, we randomly

choose positive instances from one sub-domain and the same number of negative instances

sampled equally from the other sub-domains. Each such task, on average has 400 protein

structures containing approximately 30,000 atoms each. We then divide this into 80%-training

and 20%-test set. The result of classification on the CATH database averaged over several

such randomly chosen sub-domains as positive classes, are illustrated in table 4.2. We see yet again that for each case, there is an improvement of about 3-4% over the benchmark

techniques.

Classification result

We list our main results in tables 4.2 and 4.3 showing the improvement in accuracy

using our method over the state-of-the-art techniques of FragBag, PRIDE and the preceding work on topology by Cang et al. [16]. We provide further evidence of the efficiency of our

algorithm by comparing the precision and recall in figures 4.8a and b. In these plots, we

48 show the difference between the precision and recall obtained using our algorithm against

that of FragBag(4.8a) and Cang(4.8b). A green bar indicates that our algorithm performed

better and the difference is positive while a red bar suggests the opposite. This experiment is

done on the CATH database and the figure shows the precision and recall for each domain:

class(C), architecture(A), topology(T) and homology(H). Notice that, since the classification

is binary, we get two precision and two recall for every class in each domain. Thus, there are

four bars for each of C,A,T,H. Yet again, other than a few marginal cases, our algorithm

largely performs better. Finally, we calculate the ROC curve using SVM on a subset of

the CATH dataset, the result of which is shown in figure 4.8c. The ROC curve is a plot of

the true positive rate against false positive rate obtained by changing the input size and

parameter. This means that the further the lines are away from the diagonal, the better is

the classifier.

For the positive test cases, we investigate further the trend of the output. We try to see

the correlation of accuracy with the change in training set size. We therefore change the

training and test set sizes by taking a fraction of the entire dataset and trace the accuracy in

each case. This is done over all the test cases shown in Table 4.3 and the average is shown in

Fig 4.7. We plot the output of our algorithm in blue with two instances of FragBag with

(fragment, library) sizes (5,225) and (7,250) in red and green respectively. In addition, we

plot the output of PRIDE as well. Ideally, the accuracy should decrease uniformly with a

decrease in training set size and we should get a straight line across the diagonal. In this

case, all the trendlines are almost close to the diagonal and hence we can say that they are

correlated. Moreover, we observe that even as the training data size decreases, the accuracy

of our algorithm remains better or comparable to the other algorithms. This indicates that

topological features work better with a lower number of samples as well.

49 The last two chapters were dedicated to the applications of persistent barcodes in two very disparate fields of image and protein classification. In the forthcoming chapters, we

mildly switch gears to see if we can extract more geometric information out of the intervals.

In effect, we dive into the computation of representative persistent cycles.

50 Chapter 5: Representative Persistent cycles

The previous chapters describe the use topological data analysis particularly the barcodes using persistent homology to express and classify data from a multitude of domains. Besides considering the multiset of intervals included in a persistence diagram, some applications need to find representative cycles for the intervals.

5.1 Persistent 1-cycles

In this work, we study the problem of computing representative cycles for persistent first homology group (H1-persistent homology) with Z2 coefficients. We term theses cycles as persistent 1-cycles. Since it has been shown that the computation of the optimal cycles is

NP-hard [32], we propose an alternative set of meaningful persistent 1-cycles with an efficient polynomial time algorithm. Although the original persistence algorithm of Edelsbrunner et al. [45] implicitly computes persistent cycles, it does not necessarily provide minimal ones.

In fact, similar definitions for finite intervals is already proposed [85,113], however, to our knowledge, explicit explanation of how the representative cycles are related to persistent homology has not been addressed. Our polynomial time algorithm is not any worse than an optimal cycle generating algorithm though is much more efficient in terms of the time complexity.

We use a software based on our algorithm to generate tight persistent 1-cycles on 3D point clouds and 2D images as shown in Figure 5.1. We experiment with various datasets

51 (a) (b) (c) (d) (e)

Figure 5.1: (a) Point cloud of Botijo model. (b,c) Barcode and persistent 1-cycles for Botijo, where the 3 longest bars (dark blue, light blue, and green) have their corresponding persistent 1-cycles drawn with the same colors. (d,e) Barcode and persistent 1-cycles for the retinal image, with each green cycle corresponding to a red bar.

commonly used in geometric modeling, computer vision and material science, details of which

are given in Section 5.5. The software, named PersLoop, along with an introductory video

and other supplementary materials are available at the project website http://web.cse.

ohio-state.edu/~dey.8/PersLoop.

5.2 Definitions: Persistent Basis and Cycles

Definition 6 (Persistent Basis). An indexed set of q-cycles {cj|j ∈ J} is called a persistent

F L [bj ,dj ) q-basis for a filtration F if Pq = j∈J I and for each j ∈ J and bj ≤ k < dj,

[bj ,dj ) I (k) = {0, [cj]}.

F Definition 7 (Persistent Cycle). For an interval [b, d) ∈ D(Pq ), a q-cycle c is called a persistent q-cycle for the interval, if one of the following holds:

• d 6= +∞, c is a cycle in Kb containing σb, and c is not a boundary in Kd−1 but becomes

a boundary in Kd;

• d = +∞ and c is a cycle in Kb containing σb.

52 The following theorem characterizes each cycle in a persistent basis:

Theorem 1. An indexed set of q-cycles {cj|j ∈ J} is a persistent q-basis for a filtration F if

F L [bj ,dj ) F and only if Pq = j∈J I and cj is a persistent q-cycle for every interval [bj, dj) ∈ D(Pq ).

The proof is given in [32]. With definition7 and theorem1, it is true that for a persistent

q-cycle c of an interval [b, d) ∈ Dq(F), we can always form an interval decomposition

F of Pq , where c is a representative cycle for the interval module of [b, d).

5.3 Minimal Persistent q-Basis and Their Limitations

The optimal versions of persistent basis are of particular interest because they capture

more geometric information of the space. The cycles for an optimal (minimal) persistent

basis is already defined and studied in [50,85]. In particular, the author of [85] proposed an

integer program to compute these cycles. Although these integer programs can be solved

exactly by linear programs for certain cases [31], the integer program is NP-hard in general.

This of course does not settle the question of whether the problem of computing minimal

persistent 1-cycles is NP-hard or not. We prove that it is indeed NP-hard and thus has no

hope of admitting a polynomial time algorithm unless P = NP.

Consider a simplicial complex K with each edge being assigned a non-negative weight.

We refer to such K as a weighted complex. For a 1-cycle c in K, define its weight to be the

sum of all weights of its edges.

Definition 8 (Minimal Persistent 1-Cycle and 1-Basis). Given a filtration F on a weighted

complex K, a minimal persistent 1-cycle for an interval of D1(F) is defined to be a persistent

1-cycle for the interval with the minimal weight. An indexed set of 1-cycles cj|j ∈ J is a

minimal persistent 1-basis for F if for every [bj, dj) ∈ D1(F), cj is a minimal persistent

1-cycle for [bj, dj).

53 The following special version of the problem of finding a minimal persistent 1-cycle is

NP-hard(proof in [32]).

Problem 1 (LST-PERS-CYC). Given a filtration F : ∅ = K0 ⊆ K1 ⊆ ... ⊆ Km = K, and

a finite interval [b, d) ∈ D1(F), find a 1-cycle with the least number of edges which is born in

Kb and becomes a boundary in Kd.

Given the problem, we have the following theorem:

Theorem 2. The problem LST-PERS-CYC is NP-hard. (See [32] for details).

It is also shown in the paper that neither the minimal persistent 1-cycles or persistent

1-cycles computed by our algorithm are stable. The perturbations of both classes of cycles

turn out to be unstable. So, in this regard, our polynomial time algorithm is not any worse

than an optimal cycle generating algorithm though is much more efficient in terms of the

time complexity.

5.4 Computing Meaningful Persistent 1-Cycles in Polynomial Time

Because the minimal persistent 1-cycles are not stable and their computation is NP-hard, we propose an alternative set of meaningful persistent 1-cycles which can be computed

efficiently in polynomial time. We first present a general framework in Algorithm 1. Although

the computed persistent 1-cycles have no guaranteed properties, the framework lays the

foundation for the algorithm computing meaningful persistent 1-cycles that we propose later.

Refer to [32] for proof of correctness.

Based on Algorithm1, we present another algorithm which produces meaningful persistent

1-cycles.

54 Algorithm 1 Input: A simplicial complex K, a filtration F : ∅ = K0 ⊆ K1 ⊆ ... ⊆ Km = K, and D1(F), Output: A persistent 1-basis for F. The algorithm maintains a basis Bi for H1(Ki) for every i ∈ [0, m]. 1: Initially, let B0 = ∅, 2: for i = 1, . . . , m: do 3: if σi is positive then 4: Find a 1-cycle ci containing σi in Ki 5: Let Bi = Bi−1 ∪ {ci}. 6: end if 7: if σi is negative then P 8: Find a set {cg|g ∈ G} ⊆ Bi−1 so that g∈G[cg] = 0. This can be done in O(βi = |Bi|) time by the annotation algorithm in [33]. Maintaining the annotations will take O(nω) time altogether where K has n simplices and ω is the matrix multiplication exponent. ∗ ∗ 9: Let g be the greatest index in G, then [g , i) is an interval of D1(F). P 10: Assign g∈G cg to this interval as a persistent 1-cycle and let Bi = Bi−1 \ cg∗ . 11: end if 12: Otherwise, let Bi = Bi−1. 13: end for 14: For each cycle cj ∈ Bm, assign cj as a persistent 1-cycle to the interval [j, +∞).

Algorithm 2. In Algorithm1, when σi is positive, let ci be the shortest cycle containing σi

in Ki. The cycle ci can be constructed by adding σi to the shortest path between vertices of

σi in Ki−1, which can be computed by Dijkstra’s algorithm applied to the 1-skeleton of Ki−1.

Note that in Algorithm2, a persistent 1-cycle for a finite interval is a sum of shortest

cycles born at different indices. Since a shortest cycle is usually a good representative of its

class, the sum of shortest cycles ought to be a good choice of representative for an interval. P In some cases, when σi is negative, the sum g∈G cg contains only one component. The persistent 1-cycles computed by Algorithm2 for such intervals are guaranteed to be optimal

as shown below.

P Proposition 1. In Algorithm2, when σi is negative, if |G| = 1, then g∈G cg is a minimal persistent 1-cycle for the interval ending with i.

55 In Section 5.5 where we present the experimental results, we can see that, scenarios

depicted by Proposition1 occur quite frequently. Especially, for the larvae and nerve datasets,

nearly all computed persistent 1-cycles contain only one component and hence are minimal.

A practical problem with Algorithm2 is that unnecessary computational resource is

spent for computing tiny intervals regarded as noise, especially when the user cares about

significantly large intervals only. We present a more efficient algorithm for such cases.

∗ Proposition 2. In Algorithm1 and2, when σi is negative, for any g ∈ G, one has bg ≤ g

and dg ≥ i.

∗ Proof. Note that σbg must be unpaired before σi is added, this implies that dg ≥ i. Since g

∗ is the greatest index in G, bg = g ≤ g .

Proposition2 leads to Algorithm3 in which we only compute the shortest cycles at the

birth indices whose corresponding intervals contain the input interval [b, d). In the worst

case, Algorithm2 and3 run in O(nω + n2 log n) = O(nω) time. However, since an user

often provides a long interval, the intervals containing it constitute a small subset of all the

intervals. This makes Algorithm3 run much faster than Algorithm2 in practice.

Algorithm 3 Input: The input of Algorithm2 plus an interval [ b, d) Output: A persistent 1-cycle for [b, d) output by Algorithm2. 1: G0 ← φ 2: for i ← 1, . . . , b do 3: if σi is positive & (σi is paired with a σj s.t j ≥ d or σi never gets paired) then 4: ci ← the shortest cycle containing σi in Ki 5: G0 ← G0 ∪ i 6: end if 7: end for 0 P 8: find a G ⊆ G s.t. [cg] = 0 in Kd P g∈G 9: output g∈G cg as the persistent 1-cycle for [b, d)

56 (a) (b)

Figure 5.2: PersLoop user interface demonstrating persistent 1-cycles computed for a 3D point cloud (a) and a 2D image (b), where green cycles correspond to the chosen bars.

Proposition3 reveals some characteristics of the persistent 1-cycles computed by Algorithm

2 and3:

P Proposition 3 (Minimality Property). The persistent 1-cycle g∈G cg computed by Algo- rithm2 and3 has the following property: There is no non-empty proper subset G0 of G such P P that g∈G0 [cg] = 0 in H1(Kd), where d is the death index of the interval to which g∈G cg is associated.

5.5 Results and Experiments

Our software PersLoop implements Algorithm3. Given a raw input, which is a 2D image

or a 3D point cloud, and a filtration built from the raw input, the software first generates

and plots the barcode of the filtration. The user can then click an individual bar to obtain

the persistent 1-cycle for the bar (Figure 5.2). The experiments on 3D point clouds and 2D

images using the software show how our algorithm can find meaningful persistent 1-cycles in

several geometric and vision related applications.

57 Figure 5.3: Persistent 1-cycles (green) corresponding to long intervals computed for three different point clouds

5.5.1 Persistent 1-Cycles for 3D Point Clouds

We take a 3D point cloud as input and build a Rips filtration using the Gudhi library [108].

As shown in Figure 5.3, persistent 1-cycles computed for the three point clouds sampled

from various models are tight and capture essential geometrical features of the corresponding

persistent homology. Note that our implementation of Algorithm3 runs very fast in practice.

For example, it took 0.3 secs to generate 50 cycles on a regular commodity laptop for the

Botijo (Figure 5.1a) point cloud of size 10,000.

5.5.2 Image Segmentation and Characterization Using Cubical Complex

In this section, we show the application of our algorithm on image segmentation and

characterization problems. We interpret an image as a piecewise linear function on a 2-

dimensional cubical complex. The cubical complex for an image has a vertex for each pixel,

an edge connecting each pair of horizontally or vertically adjacent vertices, and squares to

fill all the holes such that the complex becomes a disc. The function values on the vertices

are the density or color values of the corresponding pixels. The lower star filtration [44] of

58 (a) (b) (c)

Figure 5.4: Persistent 1-cycles computed for image segmentation. Green cycles indicate persistent 1-cycles consisting of only one component (|G| = 1) and red cycles indicate those consisting of multiple components (|G| > 1). (a,b) Persistent 1-cycles for the top 20 and 350 longest intervals on the nerve dataset. (c) Persistent 1-cycles for the top 200 longest intervals on the Drosophila larvae dataset.

the PL function is then built and fed into our software. We use the coning based annotation

strategy [33] to compute the persistence diagrams. In our implementation, a cubical tree, which is similar to the simplicial tree [9], is built to store the elementary cubes. Each

elementary cube points to a row in the annotation matrix via the union find data structure.

The simplicial counterpart of this association technique is described in [8].

Our first experiment is the segmentation of a serial section Transmission Electron Mi-

croscopy (ssTEM) data set of the Drosophila first instar larva ventral nerve cord (VNC) [19].

The segmentation result is shown in Figures 5.4a and 5.4b, from which we can see that the

cycles are in exact correspondence to the membranes hence segment the nerve regions quite

appropriately. The difference between Figure 5.4a and 5.4b shows that longer intervals tend

to have longer cycles. Another similar application is the segmentation of the low magnification

micrographs of a Drosophila embryo [60]. As seen in Figure 5.4c, the cycles corresponding to

the top 200 longest intervals indicate that the larvae image is properly segmented.

59 (a) (b) (c) (d)

Figure 5.5: (a) Hexagonal cyclic structure of silicate glass. (b) Persistent 1-cycles computed for the green bars with red points denoting silicate atoms and grey points denoting oxygen atoms. (c) Persistent 1-cycles computed for the red bars. (d) Barcode for the filtration on silicate glass.

We experiment on another dataset from the STARE project [57] to show how persistent

1-cycles computed by our algorithm can be utilized for characterization of images. The dataset

contains ophthalmologist annotated retinal images which are either healthy or suffering from

diseases. Our aim is to automatically identify retinal and sub-retinal hemorrhages, which are

black patches of blood accumulated inside eyes. Figures 5.1e and 5.2b show that a very tight

cycle is derived around each dark hemorrhage blob even when the input is noisy.

5.5.3 Hexagonal Structure of Crystalline Solids

In this experiment, we use our persistent 1-cycles to describe the crystalline structure of

silicate glass (SiO2) commonly known as quartz. Silicate glass has a non-compact structure with three silicon and oxygen atoms arranged alternately in a hexagon as shown in Figure 5.5a.

We build a 8×8×8 weighted point cloud with the silicon and oxygen atoms arranged according

to the space group on the crystal structure as illustrated in Figure 5.5b. The weights of the

60 points correspond to the atomic weights of the atoms. On this weighted point cloud, we

generate a filtration of weighted alpha complexes [43] by increasing α from 0 to ∞.

Persistent 1-cycles computed by our algorithm for this dataset reveal both the local and

global structures of silicate glass. Figure 5.5d lists the barcode of the filtration we build

and Figure 5.5b shows the persistent 1-cycles corresponding to the medium sized green bars

in Figure 5.5d. We can see on close observation that the cycles in Figure 5.5b are in exact

accordance to the hexagonal cyclic structure of quartz shown in Figure 5.5a. The larger

persistent 1-cycles in Figure 5.5c, which span the larger lattice structure formed by our weighted point cloud, correspond to the longer red bars in Figure 5.5d. These cycles arise

from the long-range order 1 of the crystalline solid. This is evident from our experiment

because if we increase the size of the input point cloud, these cycles grow larger to span the

entire lattice.

Convinced by the fact that representative topological 1-cycles have several real world

applications, it is but natural to ponder if we can extend the computation of these persistent

cycles to higher dimensions. The next section investigates into such possibilities and present

an interesting case in general dimensional persistent cycles.

1Long-range order is the translational periodicity where the self-repeating structure extends infinitely in all directions

61 5.6 Finite Persistent n-cycle

This section is a continuation of the previous where we ask the question: Are there other

interesting cases beyond 1-dimension for which minimal persistent cycles can be computed in

polynomial time? We compute the minimal persistent cycles with Z2 coefficients in general dimensions. In a recent work [32], a special but important class of simplicial complexes, which is termed as weak (d + 1)-pseudomanifolds, has been identified for which minimal

persistent d-cycles can be computed in polynomial time. We apply this algorithm to several

application domains which is our contribution detailed in this section. Although the details

of the algorithm and its proof of correctness will appear in another thesis work, we describe

some of the essential concepts to describe the algorithm and its applications.

A weak (d + 1)-pseudomanifold 2 is a generalization of a (d + 1)-manifold:

Definition 9. A simplicial complex K is a weak (d + 1)-pseudomanifold if each d-simplex is

a face of no more than two (d + 1)-simplices in K.

Specifically, if the given complex is a weak (d + 1)-pseudomanifold, the problem of

computing minimal persistent d-cycles for finite intervals can be cast into a minimal cut

problem .

The persistent cycle problems. The minimal persistent cycle problems is as follows:

PCYC-FINd Given a finite d-weighted simplicial complex K, a filtration F : ∅ = K0 ⊆

K1 ⊆ ... ⊆ Kn = K, and a finite interval [β, δ) ∈ Dd(F), this problem asks for computing a

d-cycle with the minimal weight which is born in Kβ and becomes a boundary in Kδ.

In order to solve this problem for weak (d+1)-pseudomanifolds, we need to following

definition as well:

2The naming of weak pseudomanifold is adapted from the commonly accepted name pseudomanifold.

62 Figure 5.6: A weak 2-pseudomanifold Ke embedded in R2 with three voids. Its dual graph is drawn in blue. The complex has one 1-connected component and four 2-connected components with the 2-simplices in different 2-connected components colored differently.

Undirected flow network. An undirected flow network (G, s1, s2) consists of an undirected

graph G with vertex set V (G) and edge set E(G), a capacity function c : E(G) → [0, +∞],

and two non-empty disjoint subsets s1 and s2 of V (G). Vertices in s1 are referred to as sources

and vertices in s2 are referred to as sinks.A cut (S,T ) of (G, s1, s2) consists of two disjoint

subsets S and T of V (G) such that S ∪ T = V (G), s1 ⊆ S, and s2 ⊆ T . The set of edges

that connect a vertex in S and a vertex in T are referred as the edges across the cut (S,T ) P and is denoted as ξ(S,T ). The capacity of a cut (S,T ) is defined as c(S,T ) = e∈ξ(S,T ) c(e).

A minimal cut of (G, s1, s2) is a cut with the minimal capacity. Note that we allow parallel

edges in G (see Figure 5.6) to ease the presentation. These parallel edges can be merged into

one edge during computation.

5.7 Minimal persistent d-cycles of finite intervals for weak (d + 1)- pseudomanifolds

In this section, we describe the algorithm which computes minimal persistent d-cycles for

finite intervals given a filtration of a weak (d + 1)-pseudomanifold when d ≥ 1. Although the

detailed process can be found in [32], the general process is as follows: Suppose that the input weak (d + 1)-pseudomanifold is K associated with a filtration F : K0 ⊆ K1 ⊆ ... ⊆ Kn and

63 σβ

σδ

(a) (b) (c) (d)

Figure 5.7: An example of the constructions in our algorithm showing the duality between persistent cycles and cuts having finite capacity for d = 1. (a) The input weak 2-pseudomanifold K with its dual flow network drawn in blue, where the central hollow vertex denotes the dummy vertex, the red vertex denotes the source, and all the orange vertices (including the dummy one) denote the sinks. All “dangled” graph edges dual to the outer boundary 1-simplices actually connect to the dummy vertex and these connections are not drawn. (b) The partial complex Kβ in the input filtration F, F where the bold green 1-simplex denotes σβ which creates the green 1-cycle. (c) The partial complex F Kδ in F, where the 2-simplex σδ creates the pink 2-chain killing the green 1-cycle. (d) The green persistent 1-cycle of the interval [β, δ) is dual to a cut (S, T ) having finite capacity, where S contains all the vertices inside the pink 2-chain and T contains all the other vertices. The red graph edges denote those edges across (S, T ) and their dual 1-chain is the green persistent 1-cycle.

the task is to compute the minimal persistent cycle of a finite interval [β, δ) ∈ Dd(F). We first construct an undirected dual graph G for K where vertices of G are dual to (d + 1)-simplices of K and edges of G are dual to d-simplices of K. One dummy vertex termed as infinite vertex which does not correspond to any (d + 1)-simplices is added to G for graph edges dual to those boundary d-simplices. We then build an undirected flow network on top of G where

F the source is the vertex dual to σδ and the sink is the infinite vertex along with the set of

F F vertices dual to those (d + 1)-simplices which are added to F after σδ . If a d-simplex is σβ

F or added to F before σβ , we let the capacity of its dual graph edge be its weight; otherwise, we let the capacity of its dual graph edge be +∞. Finally, we calculate a minimal cut of this

flow network and return the d-chain dual to the edges across the minimal cut as a minimal persistent cycle of the interval.

64 The intuition of the above algorithm is best explained by an example in Figure 5.7, where

d = 1. The key to the algorithm is the duality between persistent cycles of the input interval

and cuts of the dual flow network having finite capacity. To see this duality, first consider a

persistent d-cycle ζ of the input interval [β, δ). There exists a (d + 1)-chain A in Kδ created

F by σδ whose boundary equals ζ, making ζ killed. We can let S be the set of graph vertices dual to the simplices in A and let T be the set of the remaining graph vertices, then (S,T ) is

a cut. Furthermore, (S,T ) must have finite capacity as the edges across it are exactly dual to

the d-simplices in ζ and the d-simplices in ζ have indices in F less than or equal to β. On the

other hand, let (S,T ) be a cut with finite capacity, then the (d + 1)-chain whose simplices

F are dual to the vertices in S is created by σδ . Taking the boundary of this (d + 1)-chain, we get a d-cycle ζ. Because d-simplices of ζ are exactly dual to the edges across (S,T ) and each

edge across (S,T ) has finite capacity, ζ must reside in Kβ. We only need to ensure that ζ

F contains σβ in order to show that ζ is a persistent cycle of [β, δ). This is actually done in

the paper [32] in proposition 3.2 which shows that for any cut (G, s1, s2), the d-chain ζ is a

persistent cycle of [β, δ).

5.8 Experimental results

We experiment with our algorithms for PCYC-FIN2 on several volume datasets. Since volume data have a natural cubical complex structure, we adapt our implementation slightly

in order to work on cubical complexes. The cubical complex for volume data consists of cells

in dimensions from 0 to 3 with the underlying space homeomorphic to a 3-dimensional ball.

Note that a filtration built from a volume dataset does not produce any infinite intervals.

We use the Gudhi [108] library to build the filtrations and compute the persistence intervals.

From the experiments, we can see that the minimal persistent 2-cycles computed by our

algorithms capture various features of the data which originate from different fields. Note

65 (a) (b) (c) (d)

Figure 5.8: (a,b) Cosmology dataset and the minimal persistent 2-cycles of the top five longest intervals. (c,d) Turbulent combustion dataset and its corresponding minimal persistent 2-cycles.

that the combustion, hurricane, and medical datasets are time-varying and we chose a single

time frame to compute the persistent intervals and cycles.

Cosmology. The simulation data shown in Figure 5.8a from computational cosmology [3]

consist of dark matter represented as particles. The thread-like structures in deep purple

shown in Figure 5.8a correspond to sites of large scale structure formation. Galaxy clus-

ters/superclusters are contained in such large scale structures. Figure 5.8b shows the minimal

persistent 2-cycles of the top five longest intervals computed by our algorithms and these

cycles precisely represent the top five galaxy clusters/superclusters in volume.

Combustion. The data shown in Figure 5.8c correspond to the physical variable3 χ from

a model of a turbulent combustion process. The variable χ represents scalar dissipation

rate and provides a measure of the maximum possible chemical reaction rate. The minimal

persistent 2-cycles shown in Figure 5.8d represent areas with high value of χ.

3A physical variable defines a scalar value of a certain kind on each point.

66 (a) (b) (c)

Figure 5.9: (a,b) Minimal persistent 2-cycles for the hurricane model. (c) Minimal persistent 2-cycles of the larger intervals for the human skull. i: Right and left cheek muscles with the right one rotated for better visibility. ii: Right and left eyes. iii: Jawbone. iv: Nose cartilage. v: Nerves in the parietal lobe.

Hurricane. This dataset4 with 11 physical variables corresponds to the devastating hurri-

cane named Isabel. We down-sampled the data into a resolution of 250 × 250 × 50 and worked with two physical variables. The minimal persistent 2-cycle colored blue in Figure 5.9a is

computed on the cloud-volume variable and extracts the eye of the hurricane. The minimal

persistent 2-cycle colored green in Figure 5.9b is computed on the pressure variable and

captures the jagged shape of the pressure variation around the hurricane.

Medical imaging. This dataset from the ADNI [89] project contains the MRI scan of a

healthy human skull. The minimal persistent 2-cycles corresponding to the larger intervals as

shown in Figure 5.9c are computed from two time frames. They extract significant features

such as eyes, cartilages, nerves, and muscles.

4The Hurricane Isabel data is produced by the Weather Research and Forecast (WRF) model, courtesy of NCAR, and the U.S. National Science Foundation (NSF).

67 (a) (b) (c) (d)

Figure 5.10: (a) Cubic lattice structure of BaT iO3 (courtesy Springer Materials [91]) with diffused structure in backdrop. (b) Minimal persistent 2-cycles computed on the original function. (c) Minimal persistent 2-cycles computed on the negated function. (d) Minimal persistent 2-cycles computed on the negated function of a tetragonal lattice structure of BaT iO3. The inlaid picture [91] illustrates the bonds of the structure.

Material science. We consider the atomic configuration of BaT iO3, which is a ferroelectric

material used for making capacitors, transducers, and microphones. Figure 5.10a shows

the atomic configuration of the molecule, where the red, grey, and green balls denote the

Oxygen, Titanium, and Barium atoms separately and the radii of the balls equal the radii of

the corresponding atoms. Volume data are built by uniformly sampling a 3 × 3 × 3 lattice

structure similar to the one shown in Figure 5.10a, with the step width equal to one angstrom

(note that Figure 5.10a only shows a 2 × 2 × 2 lattice structure). Scalar value on a point of the volume is determined as follows: For each atom, let the distance from the point to the atom’s

center be d, then the scalar value of the point contributed by the atom is max{w(r − d)/r, 0}, where r is the radius of the atom and w is the atomic weight. The scalar value on the point

is then equal to the sum of the above values contributed by all atoms. For the purpose of this

experiment, we computed minimal persistent 2-cycles on both the original scalar function and

its negated one. Figure 5.10b shows a portion of the minimal persistent 2-cycles computed

on the original function, where the purple, red, and green cycles correspond to atoms of

Barium, Titanium, and Oxygen respectively. In our experiment, every atom corresponds

68 to such a minimal persistent 2-cycle of a long interval. Figure 5.10c shows a portion of the

minimal persistent 2-cycles computed on the negated function, where the cycles complement

the Barium atoms. Figure 5.10d shows the output on the negated function from a tetragonal

lattice structure [91], where the atomic bonds are not straight (see Figure 5.10d inlay). The

stretch on the lattice structure leads to minimal persistent 2-cycles with non-trivial .

Thus as we can see, from all the instances, persistent 2-cycles can be quite instrumental in

understanding properties of different domains of data. We further proceed this understanding

in the next chapter where we discuss the uses of persistent cycles in gene expressions.

69 Chapter 6: Topologically Relevant Cohort and Gene Expression

6.1 Introduction

The rapid advances in genome-scale sequencing have dispensed a comprehensive list

of genes for different organisms. These data gives us a broad scope to comprehend the

developmental and functional processes of these organisms. Since the advent of DNA

microarray, it is now possible to measure the expression levels of large number of genes

simultaneously. This has made a holistic analysis possible for gene data using their expression

levels. The stochastic nature of biological processes and associated noise acquired during

the mining process pose a fundamental challenge in modelling a mathematical structure

explaining these high dimensional data. We look into two problems in data analysis involving

gene expressions that are of current research interest.

A genome-wide association study (GWAS) is a method to link a subset of genes to a

particular disease or physical phenomenon in an organism. It has been especially important to

identify specific gene subsets not only from a clinical perspective but also from a data science

perspective as well. The assimilation of these subsets enable better phenotype identification

and improve prediction of cohort status using machine learning based approach. Our definition

of cohort follows its common usage in biology where “a cohort is a group of animals of the same species, identified by a common characteristic, which are studied over a period of time as part of a scientific or medical investigation”. For small or medium sized data sets, since

70 the number of gene expression in a cohort profile is far greater than the number of sample cohorts, disease prediction using neural networks is challenging as these architectures largely succeed when the number of samples is much larger. It becomes important for these cases to identify a subset of genes whose expression levels reflect the phenotype of the cohorts.

In addition, it is often the case that some cohort have incorrect or uncorrelated data due to instrumental or manual error. Hence, their gene expressions may not reflect their phenotype class. We find in practice that the elimination of such instances leads to better prediction scores and performance. In this work, we use topological data analysis to investigate both of these issues. We identify cohorts which are topologically relevant(Sec. 6.3.2). We show that the use of these cohorts to determine phenotypes instead of the entire set improves classification. Next, in section 6.3.3 we look into the classic GWAS problem mentioned above to identify a small subset of genes by using topological data analysis. We compare classification results obtained by using this reduced gene subsets against the one obtained by using full gene pool. The results for the receded gene profile yields better prediction rate.

In previous works it has been shown that genes sharing similar attributes tend to cluster in high dimensions [46, 86]. This is because protein encoding genes that are part of the same biological pathway or have similar functionality are corregulated. This ultimately leads such gene clusters to have similar expression profiles. The property of clustering is essentially captured by the zeroth order homology class in Persistent Homology(see next section). Motivated by these works, we are interested in finding if there exist relationships among similar genes in the higher order homology classes as well.

Traditional TDA pipelines including our first two chapters use Persistent Homology to compute a set of intervals called barcodes which are used as topological feature in subsequent processing such as learning [18, 37, 76]. While such barcodes provide robust topological signatures for the persistent features in data (such as tunnels, voids, loops, cycles etc.), their

71 (a)

(b)

Figure 6.1: (a): Flowchart for topo-relevant gene expression extraction. Refer to Section 6.3.3 for details. (b): Flowchart for topo-curated cohort extraction. Refer to Section 6.3.2 for details. In both, bold lines show the path to take for training or testing large data. Dotted lines used in Figure 6.2.

72 association to data is not immediately clear thus missing some crucial information. In effect,

since these intervals represent homology classes, they contain the set of all loops around a

topological hole (ref. Fig. 6.3b). Thus using barcodes, it is hard to localize a feature, e.g.,

the shortest cycles or holes in a Persistent Homology class. This, in turn, hinders getting

any direct mapping between the topological signatures and input cohorts or genes. So far

there had been few studies addressing the problem of localizing persistent features and it has

been shown that finding shortest cycles in given Persistent Homology classes is an NP-hard

problem [32,36]. However, taking advantage of the recent results in [32,36], we are able to

to compute good representative cycles for our application. These cycles capture definitive

geometric features and provide a mapping between two domains of gene expression and

topology.

In this chapter we conduct two main experiments using the representative cycles: one to

extract topologically relevant genes and the other to curate relevant cohorts. For these studies,

some organisms were control units while others were either infected and/or injected with some

antigen. The input consists of a matrix K which has n rows signifying the cohorts and their

corresponding gene expressions in m columns. For obtaining and classifying topologically

relevant (topo-relevant) genes, our experiment follow the pipeline in Fig. 6.1a whereas for

determining curated cohorts, it follows the pipeline in Fig. 6.1b. For a large data set, we can

trim out both insignificant cohorts and genes starting from the ‘Training data Kn,m’. This

can be done following the pipeline in Fig. 6.2. We train our neural network architecture on

the final curated dataset and thereby test against any unknown cohort. For our experiments, we work on gene expressions extracted from different organisms including Drosophila, Mus

Muculus, and Homo Sapiens. We convert these data into a binary or multi-classification

problem based on their phenotype and feed it into the pipeline. Our methodology and results

have been listed in section 6.3.

73 (a) (b)

Figure 6.3: (a) Filtration F = K0, .., K9 explaining persistence pairing.(b) Differ- ent H1 cycles for same homology class Figure 6.2: Flowchart of proposed pipeline. For units Topo-relevant gene pipeline and Topo-curated cohort pipeline we follow the dotted lines in Fig. 6.1(a) and (b) respec- tively.

6.2 Idea

In this section we provide a bit of an overview into the synthesis of gene expressions.

A gene, informally, is the basic physical unit of heredity which are present in two copies

obtained from each parent. It is a sequence of nucleotides (which is a molecule of sugar

linked to a nitrogen-containing organic ring compound attached to a phosphate group) in

DNA or RNA. These genes contain instructions which are read to synthesize protein or RNA

molecules. These molecules are useful in carrying out all the vital functions of our body such

as producing enzymes, hormones, or building muscles, bones, blood etc. Gene expression

is the process by which instructions in the genes are converted to the functional products

as proteins. It is a tightly regulated process that allows a cell to respond to its changing

74 environment. For instance, if an organism devours some nutrition, there is an increase in gene expression values which produce proteins responsible for assimilating that nutrition and convert it to energy. Hence any observable phenotype (an observable trait) is a direct result of a change in gene expression levels for the gene or set of genes responsible for the said phenotype. Thus the regulation of gene expression gives control over the timing, location, and amount of a given gene product (protein or RNA) present in a cell and can have a profound effect on the cellular structure and function. Regulation of gene expression is the basis for cellular differentiation, development, morphogenesis and the versatility and adaptability of any organism. Thus any disease or physical phenomenon result from the interplay of many genes and environmental factors. The challenge is thus to unravel those individual set of genes responsible for certain phenotype by changing their expression level (be it up or down regulation). These changes in expression levels may either be in direct response to some change in environment (food intake or a virus) or due to a recursive effect of change in other some gene expression level that acts regulators. In either case, identifying and mapping change in gene expression level due to a certain phenotype is vital in several aspects of life including disease diagnosis.

6.3 Methods

We work under the hypothesis that topological data analysis extracts relevant information sufficient for cohort classification. We note that topological feature extraction methods used in earlier works may not work in this setting. Traditionally, for many applications in bio science(say protein classification) and engineering, we find corresponding topological signatures using Persistent Homology for each sample (in this case cohorts or genes). These signatures are appended to the feature vectors. However, in this case, since each cohort is represented by a single 1D vector of gene expression levels, we are not able to find suitable

75 signatures to append. This is why the algorithms we described in the previous section comes

handy, as we will see in this section. For algorithms3 and 5.7, we need a simplicial complex

K, a filtration F, and finite intervals. For all the studies in the chapter, we use Sparse

Rips [102] to obtain the simplicial complex K and its filtration (F). We can apply the theory

of Persistent Homology to obtain the set of all finite intervals. In addition, algorithm 5.7

requires a pseudo-manifold(Ke) instead of a regular simplicial complex K. For our case, this

means that all triangles(d = 2-simplices) has at most two tetrahedrons(d + 1 = 3-simplices)

attached to it. We convert K into Ke by allowing at most two cofaces (tetrahedra) per triangle which appear first in the filtration:

• Add all σ0...d to Ke:

•∀ σd ∈ K:

– Sort: its co-faces T = σd+1 by F(σd+1)

– If: |T | ≥ 2, insert into Ke, the first two σd+1 in T ,

– Else: insert T in Ke

6.3.1 Data

We have a set of n cohorts(C) each represented by the gene expression profile of m genes

th (G). Thus our input is a matrix K of dimension (n×m) where each Ki,j represents the j gene

of the ith cohort. In addition, we have X : C → I, where I is the phenotype for the cohort.

For instance, X (c) = 0 may imply that c is a healthy or control, whereas X (c) = 1 may imply

they are infected or treated with an antigen depending on the experiment. Throughout our

experiments we will work on several datasets containing gene expression profile of different

organisms [90]. We provide a brief description of these data.

76 Figure 6.4: t-SNE on entire cohort point cloud(D0). Red vertices indicate cohorts included in top 100 H2 cycles whereas blue indicate otherwise.

77 EXPR Decision Tree Naive Bayes Classifier FULL H1+H2 H2 FULL H1+H2 H2 # 131 116 101 131 116 101 DROSO MELANO Accuracy 0.714125 0.751768 0.793434 0.398146 0.412121 0.422444 Precision 0.745417 0.815000 0.835000 0.389111 0.431756 0.451673 Recall 0.712500 0.754167 0.795833 0.400000 0.416667 0.434478 # 89 85 51 Accuracy 0.792778 0.796667 0.811667 - - - DROSO PARASITOD Precision 0.817381 0.823571 0.859167 - - - Recall 0.792500 0.797500 0.825000 - - - # 321 292 168 321 292 146 Accuracy 0.562310 0.616240 0.586843 0.555112 0.578489 0.576131 MOUSE PRION Precision 0.562716 0.591471 0.543743 0.383462 0.378556 0.384572 Recall 0.539712 0.564394 0.558267 0.415855 0.422354 0.423651 # 242 229 190 242 229 190 Accuracy 0.682761 0.698934 0.729545 0.723232 0.723232 0.721404 MOUSE LIVER CANCER Precision 0.590716 0.579833 0.656051 0.444761 0.444761 0.412018 Recall 0.573319 0.602582 0.641168 0.499837 0.499837 .506429 # 226 206 166 226 206 166 Accuracy 0.880731 0.851794 0.892900 0.592770 0.592105 0.592105 MOUSE ECOLI Precision 0.880541 0.853406 0.901481 0.604010 0.651101 0.652203 Recall 0.868052 0.842963 0.891786 0.509841 0.511111 0.511111 # 1745 101 101 Accuracy 0.499698 0.510987 0.510987 - - - HUMAN BOWEL DISEASE Precision 0.493808 0.509147 0.509147 - - - Recall 0.491258 0.501173 0.501173 - - -

Table 6.1: Classification using topo-relevant cohort. Each of the data are explained in section 6.3.1. The # symbol indicates the size of each dataset. ‘-’ in the table means the stats were too low: the relevant classifier was unable to classify the given data

(D0): Droso Breeding: In this data set, the Drosophila melanogaster larvae is bred on a Aspergillus nidulans infested breeding substrate. The phenotypes differ on the different

breeding condition for the Drosophilas. We assign label 0 to control, label 1 to the

Drosophilas bred on Aspergillus nidulans mutant laeA, and label 2 to both the the

Drosophilas bred on wild Aspergillus nidulans and sterigmatocystin. Note that in this

experiment, mutating laeA from wild Aspergillus nidulans removes sterigmatocystin

production. Hence, both the wild Aspergillus and the class with external sterigma-

tocystin should have similar gene expression profile. The experiments in the dataset

78 website confirms this fact, as there is no change in any gene expression profile between

these two classes. The number of cohorts in the database is 131.

(D1): Droso Parasitod:The data contains the profile of Drosophila larvae after a parasitod attack. There are two labels on the phenotype, one for the control and the other for

the cohorts under parasitod attack. Thus, we can a binary classification problem in

this case. Total cohorts count is 91

(D2): Mouse Prion: This data has Mus Musculus as the cohort. The experiment investigates into the effects of two different strains of the prion disease. The phenotypes are ‘RML

infected’, ‘301V infected’, and the healthy control which are assigned labels 0 − 3

respectively. Total cohort count is 418

(D3): Mouse Liver Cancer: This is again a binary classification problem of the Mus Muculus. The two phenotypes are control type and liver cancer cohorts. Total cohort

count is 242.

(D4): Mouse EColi.: The three phenotypes in this dataset are the Eschreichia Coli, Staphy- lococcus, and control. The total number of cohorts across all three phenotypes in

321.

(D5): Human Bowel Disease: A binary classification problem where the phenotype are from cohorts suffering Crohns Disease and placebo cases. This is bigger dataset having

gene expressions of 1745 human.

(D6): Human Bone Marrow: This data set contains gene expressions of patients having bone marrow failure and cytogeneic abnormalities along with healthy cohorts who serve

as control. This dataset has 469 cohorts.

79 (D7): Human Dengue: This is yet another big dataset having two types of phenotypes where we have gene expression of Dengue patients versus cohort control. Cohort count

for this dataset is 2978.

We apply the t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize our

entire cohort point cloud for the D0 Drosophila dataset in Fig. 6.4. To get a sense of the

distribution of topological cycles, we calculate the top 100 representative H2 cycles based on

their interval length (δ − β). In figure 6.4, we color a cohort vertex red if it is contained in

any of the top 100 H2 cycles. The cohorts not included are painted blue. This figure shows

the uniform distribution of the topological cycles w.r.t the entire dataset.

0 We now discuss the two ways to reduce the input Kn,m into Kn0m0 where n ≤ n and

m0 ≤ m. The first section deals with finding pertinent cohorts, and the next with finding

pertinent genes.

6.3.2 Topo-Curated Cohort

For our first proof of concept, we find a subset of cohorts who provide topologically

relevant information for classification. The aim is to remove cohorts having either incorrect

or uncorrelated data due to instrumental or manual error. Specifically, given Kn,m, we would

0 0 like to find Kn0,m ⊆ Kn,m for n ≤ n which improves classification odds for the cohorts. This subset of n0 cohorts should therefore be topologically more relevant. We start by converting

the matrix Kn,m into a point cloud. This point cloud has n points each of dimension m.

Hence each cohort in the matrix is converted to an m-dimensional point where each dimension

represents the expression level for each gene. We use Sparse Rips on the resulting point cloud

to obtain a simplicial complex K and its filtration (F) and apply the theory of Persistent

Homology to obtain the set of finite intervals.

80 Figure 6.5: Plot of geodesic centers for dominating cycles using T-sNE. Red vert: non dominating cycles. Graded green points: dominating cycles. Alpha values indicate ratio of dominating phenotype in each cycle versus the other labels.

We consider the dataset D0 having three phenotypes. We generate the longest 100 H2 cycles based on their interval length (δ − β). For each cycle, we consider the constituent vertices and their corresponding phenotype labels (X ). We plot the count of X values in

individual H2-cycles in Fig. 6.7(a) with the X, Y and Z axis representing X ∈ 0, 1, and 2

81 (a) (b) (c)

Figure 6.6: Three figures plotting individual dominating cycles for gene dataset D0. These cycles actually reside in m-dimension and are projected down to 3D using Principal component analysis. The colors indicate cohort phenotype labels X ∈ 0, 1, 2

respectively. The black points indicate cycles where all vertices belong to a single phenotype.

The red, green, and blue points indicate cycles having labels (0, 1), (0, 2) and (1, 2) respectively.

The yellow points correspond to cycles having all three labels 0, 1, and 2. The takeaway from

this plot is that, since most points are skewed towards some particular axis, most H2-cycles

have constituent vertices who belong predominantly to some particular label in X . Thus

topological cycles in general are inclined towards some X labels without any supervision as

they were not fed with the phenotype labels. Note that we added a small random noise to

each point coordinate to illustrate multiplicity. Figure 6.7(b) plots similar values for the top

200 H2 cycles for dataset D1. Since this dataset has two phenotypes, we get a 2d plot. The red labels denote cycles which have an equal constituent phenotypes, whereas blue and cyan

represent skew, with blacks representing single labeled cycles as before. As is evident, most

cycles exhibit a predominance in either X ∈ 0 or 1. Based on the intuition of this plot, we

82 define a cycle Z as a Dominant Cycle if, there exists a vertex set U ⊆ Vert(Z)5 so that

every vertex in U has the same label and |U| ≥ |Vert(Z)/2|6.

To illustrate the frequency of dominating cycles versus non-dominating ones, we plot

the geodesic centers of the H2 cycles for D0 by projecting them down to 2d using T-sNE (Fig. 6.5). Red vertices indicate non dominating cycles while each of the graded green points

indicate the dominating ones. Clearly, most of the topology cycles are dominating and

indicate a vote towards some phenotype class. The alpha values(denoted by the green bar

at the right) indicates the ratio of the dominating phenotype in each cycle versus the other

labels(X ). Hence, intuitively, more opaque a given point is, more is it dominated by a single

class phenotype. Finally, we plot some of the individual dominating H2 cycles along with

their phenotype labels in Fig. 6.6. Note that these points are part the original D0 cohort

point cloud and they were projected down to 3D using PCA.

Classification using machine learning

We work on several gene expression data extracted from different organisms. On each of

these, we create a classification problem as described in the data section. For each dataset, we use the entire cohort list(irrespective of their phenotype) as an (n × m) dimensional

point cloud. We generate the top 100 H1 and H2 cycles and select the dominant cycles.

Next we select the vertices contained in these dominant cycles which form our new set of

n0(≤ n)-cohorts. Taking gene expression for these n0-cohorts lets us form our new smaller

0 matrix Kn0,m. Thereafter, use train supervised classification models once using Kn,m and then

0 again using Kn0,m and compare results for each. We use 10-fold cross validation by splitting the data randomly into 80% − 20% in each fold. For our classification models, we use Decision

Tree and Naive Bayes. The average value of accuracy, precision, and recall for the 10-fold

5Vert(Z) denotes the constituent vertices on the cycle Z 6the modulo operator implies size of a set

83 (a) (b)

Figure 6.7: (a): Count of vertex labels in individual H2-cycles for D0. The red points indicate cycles having phenotype labels 0 and 1, blue indicates cycles with labels 1 and 2 whereas green(very few in the top 500 H2 cycles) indicates labels 0 and 2. (b): Count of vertex labels in individual H2-cycles for D1. Red indicate cycles having equal phenotype labelled vertices. Blue and cyan indicate prevalence of label 0 and 1 respectively. In both the diagrams, black points indicate cycles having a single phenotype label.

cross validation is reported in table 6.1. The column ‘FULL’ represents training on Kn,m while

0 H1 + H2 represent the union of n topo-relevant cohorts obtained from the dominant cycles

in either H1 or H2. We also get good classification statistics for the vertices in dominant

cycles picked up only by H2 cycles only as reported in the same table. As is evident from the

results, reduction in the number of cohorts leads to an increase in classification measures.

Thus TDA is able to pick up cohorts who carry more decisive gene expression levels for their

individual phenotype classes.

6.3.3 Topo-Relevant Gene Expression

0 0 Our next problem is to reduce the matrix Kn×m to K of dimension (n × m ) where

0 m  m. We use the persistent cycle descriptors H1 and H2 introduced in the previous

section to extract |m0| meaningful genes(G0) such that G0 ⊂ G. To this effect, we use the

84 annotation of the gene set G based on their functional classification obtained from the ‘Panther

Classification System by Geneontology’ [77] and the ’NCBI Gene Data set’ [83]. Thus for

each g ∈ G, ∃f : g → R, where R is a vector of functional attributes obtained from [77].

Once we obtain the representative cycles, we find the maximal cover of each cycle defined

as follows:

Maximal cover of representative cycle(κ): For each gene expression g ∈ Vert(Z)

represented as vertices in a single representative cycle, we have a set of annotations f(g). We

select the minimum set consisting of at least one annotation for each g ∈ V ert(Z). Let S be

any set of annotations which contains at least one annotation for each g ∈ Vert(Z). Thus,

κ = inf{|S| | ∀g ∈ Vert(Z), S ∩ f(g) 6= ∅}

The idea behind using κ is to get a sense of the functionality of the gene. A gene may

be responsible for multiple processes described in the Panther and NCBI database. If κ

is low or unity for a certain Z, it probably indicates that the gene expressions involved in

Z reflect the functionality captured by κ. This is illustrated in figure 6.8 where we plot

0 some of the H2 cycles generated on K with color annotated by their functionality. We use

PCA as before to project the points down to 3-dimensions. The three figures illustrate three

instances of different κ-values. Consider the example in Fig.6.8(a) for getting the intuition

behind κ. The six vertices representing genes in the H2 cycles have function annotations: {1:

Localization, 2: Not annotated, 3: Metabolic process, Cellular process, 4: Metabolic process,

Cellular process, Biological regulation, 5: Metabolic process, Cellular process, Localization, 6:

Not Annotated}. Out of this the set: {Localization, Not annotated, and Metabolic process}

covers all the vertices and hence κ is 3.

We choose C with low κ values and select their component genes as part of G0. We can

control the size of G0 based on the value of κ.

85 For all our experiments, we run each architecture and obtain performance measures

on K which contains the exhaustive set of m-genes. We re-run these experiments on our

trimmed set K0 containing m0( m) topologically significant genes. Note that we may use

0 00 the topo-relevant cohort extraction to additionally reduce K n0×m into K n0×m0 . But since the

public datasets we work on as our proof of concept have much less number of data to work with a NN architecture, we do not trim the the number of cohorts for the same.

(a) (b) (c)

Figure 6.8: Three 2-cycles for gene dataset D6 with colors indicating gene function labels. Total number of colors indicate the κ value of the cycle. Note that a gene can be responsible for several functionalities. The legend in this plot takes into account only a single functionality which contributes towards the maximal cover of the representative cycle.(a,b) Low κ:3 (b) High κ:6

6.3.4 Neural Network Architecture

We use one dimensional convolutional neural network to perform experiments on gene-

expression data. Our architecture is inspired by [63] who have procured some promising

results for the same. The authors use a series of dense networks connected by activation

functions. Since we provide some functional relevance among the genes, we sort them by

their functionality and feed them to a convolutional layer. We start with this 1D-CNN layer

activated by the sigmoid function. Sigmoid is a traditional activation function which provides

86 a smooth non linearity in the network and since the architecture is not too deep, we do not need to worry about its shortcomings like the vanishing gradient. This is followed by a max pooling of size 2 and subsequently a dropout layer. This layer is connected to two densely connected layers with decreasing sizes. These layers have ReLU as their activation function as used in the paper by [63]. In the end, we add a softmax activation layer to determine the

final label of the data. The hyper-parameters of the network can be tuned using advanced hyper-parameter optimization algorithm such as Bayesian Optimization. However, since this study is a proof of concept, and its purpose is to show the effectiveness of our feature selection, we fine tune them using manual observation.

Since the number of samples is still less for CNN, overfitting is an issue. Notice that, for this precise reason, we do not curate this data using the pipeline in section 6.3.2. Dropout layers are added after each layer to further prevent overfitting and reduce high variance. We, however, do not initiate early stopping as those pipelines are not amenable to orthogonalization.

Finally, the model is optimised using Adam as it optimises based on adaptive learning. The dataset is split evenly into 80-20 and cross validated for 50 epochs. The neural network has been implemented in Python using Tensorflow and Keras. The results for our experiment on datasets D5, D6 and D7 is shown in table 6.2. The row # genes show that our architecture using vertices selected from topological cycles are less than 30% the size of the original gene pool. The results have, however, improved in all the cases. We also compare our technique using a baseline generated by randomly choosing the same number of genes as the curated data. For the human dengue dataset, we average this baseline over 10 random samples.

It gives an average test data loss, accuracy, precision and recall of 0.3262, 81.59, 81.2 and

81.2 respectively. So, we observe that our technique performs substantially better than the baseline. For the human bowel disease dataset(D5) the training result on full data performs slightly better than our curated data. This is probably due to overfitting since its performance

87 Human Human Human Data Name Dengue Bone Marrow Bowel Disease (#: 4415) (#: 469) (#: 1745) Method TP (Z) Full TP (Z) Full TP (Z) Full # genes 4415 5464 17258 1801 54715 Tr-Loss(e−2) 5.95 10.06 05.16 05.70 13.11 9.58 Tr-Acc 97.84 96.64 99.72 99.15 96.63 97.56 Tr-F1 97.86 96.48 99.72 99.15 96.60 97.55 Tr-Prec 97.86 96.48 99.72 99.15 96.60 97.55 Tr-Rec 97.86 96.48 99.72 99.15 96.60 97.55 Ts-Loss(e−2) 21.99 14.55 06.34 51.30 84.29 83.73 Ts-Acc 93.21 91.65 97.46 95.76 90.10 89.62 Ts-F1 92.26 90.67 96.95 95.74 90.34 89.66 Ts-Prec 93.48 90.67 96.95 95.74 90.34 89.66 Ts-Rec 93.48 90.67 96.95 95.74 90.34 89.66

Table 6.2: Neural network result. The column TP (Z) indicates the results on reduced gene set using topology. Full indicates results on the full gene set. Tr-Loss, Tr-Acc, Tr-F1, Tr-Prec,Tr-Rec is loss, accuracy, F1-score, precision, and recall on the training data. Whereas the prefix Ts- indicate the same on the test set.

improvement is not reflected in the test data. We can employ other techniques to regularise

this instance, but since we wanted to preserve the same architecture for all, we left the results

as is. On reducing the number of features, we get an improvement in the parameters on the

test set.

We follow the trend of the loss, accuracy and F1 score by plotting their value after every

epoch in our algorithm. Figure 6.9 shows this result on dataset D7. We see that the loss function on test data has been slightly higher but smoother than the full dataset. Despite this,

using TDA the accuracy and F1 score has consistently performed better in every iteration

for both the training and test data.

88 (a) (b) (c)

Figure 6.9: Comparison of (a)Accuracy (b) F1-score (c)Loss function for 50 epochs. For the TDA curated data, red and yellow lines represent train and test scores respectively. For the full data, they are represented by the green and blue lines.

89 Chapter 7: Contributions and Future Work

This work deals with the creation of several topological tools that helps in understanding

data in a number of domains. In chapter3, we accumulated ample evidence that topological

features provide additional information for the classification of images. This is not surprising

as most techniques rely on geometric, textural, or gradient based features for classification

that do not necessarily capture topological features. The aggressive sub sampling based

algorithm helped improving computational time for generating the topological signatures which was the main bottleneck. This work was an attempt to see if topology provides added

feature to machine learning. Since we have an affirmation on the effectiveness, further research

should be done to find more innovative methods to assimilate the topological features in the

deep learning architectures.

In the next chapter, we showed a practical topological technique to generate signatures

for protein molecules that can be used as feature vectors for its classification. This work

uses unaided persistent barcodes to transcend the state of the art techniques. In addition

to procuring encouraging results on classifications, we can draw direct correlation between

the persistent intervals and the hierarchical protein structures. Since we investigated the

descriptive power of our signature, we believe it can be used for other purposes such as

protein energy computation, or finding protein B-factor. We believe that this signature can

be extended to other biomolecular data such as DNA or enzymes.

90 In the next chapter, we inspect methods to compute minimal persistent cycles. We

generate representative persistent 1-cycles which are computable in polynomial time. We P further showed that if the output cycle has one component ( g∈G cg contains only one component), the computed cycle is minimal. Further, our experiments highlighted the

frequent occurrence of such events. Finally, we showed that the algorithm3 to compute

persistent 1-cycle for a specific interval runs in O(nω). Note that this algorithm only requires

to compute the cycles for intervals containing the input interval. Since an user is often

interested in a long interval, the intervals containing it constitute a small subset of all the

intervals. This makes Algorithm3 run much faster in practice.

For the general dimension, we find the problems to be tractable if the given complex is

a weak (d+1)-pseudomanifold. The details of the algorithm are not part of this thesis and will appear elsewhere. We focus on the applications of these algorithms in image analysis,

material science, medical data, scientific visualisation among others. This research leads to

some open questions concerning persistent cycles:

• In our experiments, some persistent cycles correspond to important features of the

data (see Section 5.8). However, we also ran into some intervals whose persistent cycles

do not have obvious meanings. If there are ways to design filtrations for data such

that persistent cycles are related to the important features, then the prospect for the

application of persistent cycles or persistence in general would be more extensive.

• In section 5.6, we have presented O(n2)-time algorithms for computing a minimal

persistent cycle for a given interval. A natural question is whether this time complexity

can be improved. Furthermore, can we devise a better algorithm to compute minimal

persistent cycles for all intervals (i.e., the minimal persistent basis described in chapter

5), improving upon the obvious O(n3)-time algorithm that runs our algorithms on each

interval.

91 Finally, in our last chapter, we utilised the representative persistent cycles to extract

relevant cohorts and gene expressions so as to improve feature selections. Both our test cases

show that the data follow some topological alignment due to which the representative cycles

are able to extract sort of the “crux” of the data. This is why we are able to fit our training

models better and reduce variance thereby getting better accuracy and f1-score. In future work, one can try to further tune our models so as to correlate the selected features with

their functionality.

This work makes it evident that topological data analysis has an important role to play in

understanding and describing complex data. This is not surprising as most classical techniques

to describe data (including neural networks) rely on geometric, textural, or gradient based

features that do not necessarily capture topological features.

Having said that, it is not entirely true that the notions in topology are entirely untrodden.

For instance, as discussed in the last chapter, any zero dimensional barcode using Rips

or similar filtration, encodes information similar to hierarchical clustering. 1-dimensional

simplicial complexes are essentially graph structures. Machine learning techniques using

random forests echo similar ideas. So it would be more meaningful to perceive topological

advancements as a natural extension of traditional data science measures as opposed to a

discrete domain. Based on the evidence we found in this thesis, I think it would be safe to

echo Dr. Noah Giansiracusa’s words where he thinks “TDA should be viewed as a tool to be added to the quiver of data science arrows, rather than an entirely new weapon.”

What would be interesting to see, are the effects of even higher order homology features in

data analysis. We have already seen them to be instrumental in several domains in chapters

5.6 and6. To further the cause, better use of TDA to explain data would require further

investigation of theoretical premises for higher order homologies. This needs to be coupled with development of better toolkits for use by researchers from other domains.

92 Bibliography

[1] Henry Adams and Gunnar Carlsson. On the nonlinear statistics of range image patches. SIAM Journal on Imaging Sciences, 2(1):110–117, 2009.

[2] Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta, and Lori Ziegelmeier. Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18(8):1–35, 2017.

[3] Ann S. Almgren, John B. Bell, Mike J. Lijewski, Zarija Luki´c,and Ethan Van Andel. Nyx: A massively parallel AMR code for computational cosmology. The Astrophysical Journal, 765(1):39, feb 2013.

[4] Javier Arsuaga, Tyler Borrman, Raymond Cavalcante, Georgina Gonzalez, and Cather- ine Park. Identification of copy number aberrations in breast cancer subtypes using persistence topology. PubMed: 27600228, 4:339–369, 2015.

[5] Aras Asaad and Sabah Jassim. Topological data analysis for image tampering detection. In Christian Kraetzer, Yun-Qing Shi, Jana Dittmann, and Hyoung Joong Kim, editors, Digital Forensics and Watermarking, pages 136–146, Cham, 2017. Springer International Publishing.

[6] P. Bendich, H. Edelsbrunner, and M. Kerber. Computing robustness and persistence for images. IEEE Transactions on Visualization and Computer Graphics, 16(6):1251–1260, Nov 2010.

[7] Juliana Bernardes, Gerson Zaverucha, Catherine Vaquero, Alessandra Carbone, and Levitt Michael. Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence. Public Library of Science, 12, 07 2016.

[8] Jean-Daniel Boissonnat, Tamal K. Dey, and Cl´ement Maria. The compressed annotation matrix: an efficient data structure for computing persistent . CoRR, abs/1304.6813, 2013.

[9] Jean-Daniel Boissonnat and Cl´ement Maria. The simplex tree: An efficient data structure for general simplicial complexes. 20th Annual European Symposium, Ljubl- jana,Slovenia, 2:731–742, 2012.

93 [10] Thomas Bonis, Maks Ovsjanikov, Steve Oudot, and Fr´ed´ericChazal. Persistence- based pooling for shape pose recognition. In Alexandra Bac and Jean-Luc Mari, editors, Computational Topology in Image Context, pages 19–29, Cham, 2016. Springer International Publishing.

[11] Glencora Borradaile, Erin Wolf Chambers, Kyle Fox, and Amir Nayyeriy. Minimum cycle and homology bases of surface-embedded graphs. Journal of Computational Geometry, 8(2), 2017.

[12] Peter Bubenik. Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(1):77–102, 2015.

[13] Inbal Budowski-Tal, Yuval Nov, and Rachel Kolodny. Fragbag, an accurate representa- tion of protein structure, retrieves structural neighbors from the entire pdb quickly and accurately. PNAS, 107(8):3481–3486, February 2010.

[14] Pablo Camara. Topological methods for genomics: present and future directions. Curr Opin Syst Biol, pages 95–101, 2017.

[15] Zixuan Cang, Lin Mu, and Guowei Wei. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLOS Computa- tional Biology, 14, 08 2018.

[16] Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and Guo-Wei Wei. A topological approach for protein classification. In A topological approach for protein classification. MBMB, Nov 2015.

[17] Zixuan Cang and Guo-Wei Wei. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. IJNMBE, 2017.

[18] Zixuan Cang and Guo-Wei Wei. Topologynet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS Computational Biology, Oct 2017.

[19] Saalfeld S Cardona A, Preibisch S, Schmid B, Cheng A, and Pulokas J et al. An integrated micro- and macroarchitectural analysis of the drosophila brain by computer- assisted serial section electron microscopy. PLoS Biol, 8, 2010.

[20] Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas Guibas. Persistence barcodes for shapes. In Proceedings of the 2004 Eurographics ACM SIGGRAPH Symposium on Geometry Processing, SGP ’04, pages 124–135. ACM, 2004.

[21] Mathieu Carriere, Steve Oudot, and Maks Ovsjanikov. Sliced Wasserstein Kernel for Persistence Diagrams. In ICML 2017 - Thirty-fourth International Conference on Machine Learning, pages 1–10, Sydney, Australia, August 2017.

94 [22] Mathieu Carri`ere,Steve Y. Oudot, and Maks Ovsjanikov. Stable topological signatures for points on 3d shapes. In Proceedings of the Eurographics Symposium on Geom- etry Processing, SGP ’15, pages 1–12, Aire-la-Ville, Switzerland, Switzerland, 2015. Eurographics Association. [23] Erin W. Chambers, Jeff Erickson, and Amir Nayyeri. Minimum cuts and shortest ho- mologous cycles. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pages 377–385. ACM, 2009. [24] Chao Chen and Daniel Freedman. Measuring and computing natural generators for homology groups. Computational Geometry, 43(2):169–181, 2010. [25] Chao Chen and Daniel Freedman. Hardness results for homology localization. Discrete & Computational Geometry, 45(3):425–448, 2011. [26] Moo Chung, Peter Bubenik, and Peter Kim. Persistence diagrams of cortical surface data. In Information processing in medical imaging, pages 386–397. Springer, 2009. [27] Natalie Dawson, Tony E Lewis, Sayoni Das, Jonathan Lees, David Lee, Paul Ashford, Christine Orengo, and Ian Sillitoe. Cath: An expanded resource to predict protein function through structure and sequence. Nucleic acids research, 45, 11 2016. [28] Vin De Silva and Robert Ghrist. Coverage in sensor networks via persistent homology. Algebraic & , 7(1):339–358, 2007. [29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, volume 115, pages 211–252, 2009. [30] T. Dey, F. Fan, and Y. Wang. Graph induced complex on point data. Computational Geometry, 200, 1995. [31] T. K. Dey, A. Hirani, and B. Krishnamoorthy. Optimal homologous cycles, total unimodularity, and linear programming. SIAM Journal on Computing, 40(4):1026– 1044, 2011. [32] Tamal Dey, Tao Hou, and Sayan Mandal. Computing minimal persistent cycles: Polynomial and hard cases. ACM-SIAM Symposium on Discrete Algorithms(SODA20), 07 2019. [33] Tamal K Dey, Fengtao Fan, and Yusu Wang. Computing topological persistence for simplicial maps. In Proceedings of the thirtieth annual symposium on Computational geometry, page 345. ACM, 2014. [34] Tamal K. Dey, Fengtao Fan, and Yusu Wang. Computing topological persistence for simplicial maps. Symposium on Computational Geometry, pages 345–354, june 2014. [35] Tamal K. Dey, Anil N. Hirani, and Bala Krishnamoorthy. Optimal homologous cycles, total unimodularity, and linear programming. SIAM Journal on Computing, 40(4):1026– 1044, 2011.

95 [36] Tamal K. Dey, Tao Hou, and Sayan Mandal. Persistent 1-cycles: Definition, computa- tion, and its application. In Computational Topology in Image Context, pages 123–136, Cham, 2019. Springer International Publishing. [37] Tamal K. Dey and Sayan Mandal. Protein classification with improved topological data analysis. In WABI, 2018. [38] Tamal K. Dey, Dayu Shi, and Yusu Wang. Simba: An efficient tool for approximating rips-filtration persistence via simplicial batch-collapse. In ESA, volume 57, 2016. [39] Tamal K Dey, Jian Sun, and Yusu Wang. Approximating loops in a shortest homol- ogy basis from point data. In Proceedings of the twenty-sixth annual symposium on Computational geometry, pages 166–175. ACM, 2010. [40] Tamal Krishna Dey, Fengtao Fan, and Yusu Wang. Graph induced complex on point data. In Proceedings of the Twenty-ninth Annual Symposium on Computational Geometry, SoCG ’13, pages 107–116, New York, NY, USA, 2013. ACM. [41] A. Dirafzoon and E. Lobaton. Topological mapping of unknown environments using an unlocalized robotic swarm. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5545–5551, Nov 2013. [42] Ali Nabi Duman and Harun Pirim. Gene coexpression network comparison via persistent homology. International Journal of Genomics, 11, 2018. [43] Herbert Edelsbrunner. Weighted alpha shapes. Technical report, IL, USA, 1992. [44] Herbert Edelsbrunner and John Harer. Computational topology: an introduction. American Mathematical Soc., 2010. [45] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28(4):511–533, nov 2002. [46] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998. [47] Kevin Emmett, Daniel Rosenbloom, Pablo Camara, and Raul Rabadan. Parametric inference using persistence diagrams: A case study in population genetics. arxiv, 2014. [48] Kevin Emmett, Benjamin Schweinhart, and Raul Rabadan. Multiscale topology of chromatin folding. In Proceedings of the 9th EAI international conference on bio-inspired information and communications technologies (formerly BIONETICS), pages 177–180. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2016. [49] Jeff Erickson and Kim Whittlesey. Greedy optimal and homology generators. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1038–1046. Society for Industrial and Applied Mathematics, 2005.

96 [50] Emerson G Escolar and Yasuaki Hiraoka. Optimal cycles for persistent homology via linear programming. In Optimization in the Real World, pages 79–96. Springer, 2016.

[51] Zolt´anG´asp´ari,Kristian Vlahovicek, and S´andorPongor. Efficient recognition of folds in protein 3d structures by the improved pride algorithm. Bioinformatics, 21(15), 2005.

[52] R. Ghrist and A. Muhammad. Coverage and hole-detection in sensor networks via homology. In IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005., pages 254–260, April 2005.

[53] D Goldfarb. An application of topological data analysis to hockey analytics. arXiv preprint, 21(15), 2014.

[54] Greg Griffin, Alex Holub, and Peitro Perona. Caltech-256 object category dataset. Technical report: Caltech Authors, 2006.

[55] William Harvey, In-Hee Park, Oliver R¨ubel, Valerio Pascucci, Peer-Timo Bremer, Chenglong Li, and Yusu Wang. A collaborative visual analytics suite for protein folding research. Journal of Molecular Graphics and Modelling, 53:59 – 71, 2014.

[56] Yasuaki Hiraoka, Takenobu Nakamura, Akihiko Hirata, Emerson G. Escolar, Kaname Matsue, and Yasumasa Nishiura. Hierarchical structures of amorphous solids char- acterized by persistent homology. Proceedings of the National Academy of Sciences, 113(26):7035–7040, 2016.

[57] A. Hoover and M. Goldbaum. Locating the optic nerve in a retinal image using the fuzzy convergence of the blood vessels. IEEE Transactions on Medical Imaging, 22(8):951–958, Aug 2003.

[58] Kyu-Baek Hwang, Dong-Yeon Cho, Sang-Wook Park, Sung-Dong Kim, and Byoung-Tak Zhang. Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis, pages 167–182. Springer US, Boston, MA, 2002.

[59] Liang J, Edelsbrunner H, Fu P, Sudhakar PV, and Subramaniam S. Analytical shape computation of macromolecules: Ii. molecular area and volume through alpha shape. In Proteins, volume 33, pages 18–29, 1998.

[60] Daniel P. Kiehart, Catherine G. Galbraith, Kevin A. Edwards, Wayne L. Rickoll, and Ruth A. Montague. Multiple forces contribute to cell sheet morphogenesis for dorsal closure in drosophila. The Journal of Cell Biology, 149(2):471–490, 2000.

[61] Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pages 202–207. AAAI Press, 1996.

[62] Rachel Kolodny, Patrice Koehl, Leonidas Guibas, and Michael Levitt. Small libraries of protein fragments model native protein structures accurately. JMB, 323, 2002.

97 [63] Yunchuan Kong and Tianwei Yu. A deep neural network model using random forest to extract feature representation for gene expression data classification. Scientific Reports, 8(1):16477, 2018.

[64] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[65] Vitaliy Kurlin. A fast persistence-based segmentation of noisy 2D clouds with provable guarantees. Pattern Recognition Letters, 83:3–12, 2015.

[66] Genki Kusano, Kenji Fukumizu, and Yasuaki Hiraoka. Persistence weighted gaussian kernel for topological data analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2004– 2013. JMLR.org, 2016.

[67] Roland Kwitt, Stefan Huber, Marc Niethammer, Weili Lin, and Ulrich Bauer. Statistical topological data analysis - a kernel perspective. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3070–3078. Curran Associates, Inc., 2015.

[68] Tam Le and Makoto Yamada. Persistence fisher kernel: A riemannian manifold kernel for persistence diagrams. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10007–10018. Curran Associates, Inc., 2018.

[69] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.

[70] Javier Lamar Leon, Andrea Cerri, Edel Garcia Reyes, and Rocio Gonzalez Diaz. Gait- based gender classification using persistent homology. In Jos´eRuiz-Shulcloper and Gabriella Sanniti di Baja, editors, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 366–373, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.

[71] Max Z. Li, Megan S. Ryerson, and Hamsa Balakrishnan. Topological data analysis for aviation applications. Transportation Research Part E: Logistics and Transportation Review, 128:149 – 174, 2019.

[72] Shengren Li, Lance Simons, Jagadeesh Bhaskar Pakaravoor, Fatemeh Abbasinejad, John D. Owens, and Nina Amenta. kANN on the GPU with Shifted Sorting. Euro- graphics Association, pages 39–47, 2012.

[73] Holm Liisa and Rosenstr¨omP¨aivi.Dali server: conservation mapping in 3d. Nucleic Acids Research, 38:W545–W549, 2010.

98 [74] D. G. Lowe. Object recognition from local scale- features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1150–1157 vol.2, 1999. [75] David Lowe. Learning Multiple Layers of Features from Tiny Images. P, April 2009. [76] Sayan Mandal, Aldo Guzm´an-S´aenz,Niina Haiminen, Saugata Basu, and Laxmi Parida. A topological data analysis approach on predicting phenotypes from gene expression data. Springer Lecture Notes in Computer Science/LNBI, September 2020. [77] Huaiyu Mi, Anushya Muruganujan, Dustin Ebert, Xiaosong Huang, and Paul D Thomas. PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools. Nucleic Acids Research, 47(D1):D419–D426, 11 2018. [78] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. 7th European Conference on Computer Vision (ECCV ’02),, 7, May 2002. [79] G. M. Morton. A computer oriented geodetic data base; and a new technique in file sequencing. International Business Machines Co., 1966. [80] J. R. Munkres. Elements of Algebraic Topology, chapter 1. Perseus,Cambridge, Mas- sachusetts, 1 edition, 1984. [81] Takenobu Nakamura, Yasuaki Hiraoka, Akihiko Hirata, Emerson G Escolar, and Yasumasa Nishiura. Persistent homology and many-body atomic structure for medium- range order in the glass. IOP Science Nanotechnology, 26(30), 2015. [82] USA National Institutes of Health. National center for biotechnology information. https://www.ncbi.nlm.nih.gov. [83] USA National Institutes of Health. National center for biotechnology information. https://www.ncbi.nlm.nih.gov//gene. [84] Monica Nicolau, Arnold J. Levine, and Gunnar Carlsson. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011. [85] Ippei Obayashi. Volume optimal cycle: Tightest representative cycle of a generator on persistent homology. arXiv preprint arXiv:1712.05103, 2017. [86] Jelili Oyelade, Itunuoluwa Isewon, Funke Oladipupo, Olufemi Aromolaran, Efosa Uwoghiren, Faridah Ameh, Moses Achas, and Ezekiel Adebiyi. Clustering algo- rithms: Their application to gene expression data. Bioinformatics and Biology Insights, 10:BBI.S38316, 2016. [87] Laxmi Parida, Filippo Utro, Deniz Yorukoglu, Anna Paola Carrieri, David Kuhn, and Saugata Basu. Topological signatures for population admixture. In Teresa M. Przytycka, editor, Research in Computational Molecular Biology, pages 261–275, Cham, 2015. Springer International Publishing.

99 [88] Jose A. Perea and John Harer. Sliding windows and persistence: An application of topological methods to signal analysis. Foundations of Computational Mathematics, 15(3):799–838, Jun 2015.

[89] Ronald Carl Petersen, PS Aisen, Laurel A Beckett, MC Donohue, AC Gamst, Danielle J Harvey, CR Jack, WJ Jagust, LM Shaw, AW Toga, et al. Alzheimer’s disease neu- roimaging initiative (ADNI): clinical characterization. Neurology, 74(3):201–209, 2010.

[90] Robert Petryszak, Maria Keays, Y. Amy Tang, Nuno A. Fonseca, Elisabet Barrera, Tony Burdett, Anja F¨ullgrabe, Alfonso Mu˜noz-Pomer Fuentes, Simon Jupp, Satu Koskinen, Oliver Mannion, Laura Huerta, Karine Megy, Catherine Snow, Eleanor Williams, Mitra Barzine, Emma Hastings, Hendrik Weisser, James Wright, Pankaj Jaiswal, Wolfgang Huber, Jyoti Choudhary, Helen E. Parkinson, and Alvis Brazma. Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants. Nucleic Acids Research, 44(D1):D746–D752, 10 2015.

[91] Pierre Villars (Chief Editor). PAULING FILE in: Inorganic Solid Phases. SpringerMa- terials (online database), Springer, Heidelberg (ed.).

[92] Jeremy A. Pike, Abdullah O. Khan, Chiara Pallini, Steven G. Thomas, Markus Mund, Jonas Ries, Natalie S. Poulter, and Iain B. Styles. Topological data analysis quantifies biological nano-structure from single molecule localization microscopy. bioRxiv, 2018.

[93] Mehdi Pirooznia, Jack Y. Yang, Mary Qu Yang, and Youping Deng. A comparative study of different machine learning methods on microarray gene expression data. BMC genomics, 9 Suppl 1(Suppl 1):S13–S13, 2008.

[94] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt. A stable multi-scale kernel for topological machine learning. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4741–4748, June 2015.

[95] M Remmert, A Biegert, and S¨oding J. Hauser A. Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature Methods, 9, Dec 2011.

[96] Jorge Sanchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision,Springer, pages 222–245, May 2013.

[97] Natalie Sauerwald, Yihang Shen, and Carl Kingsford. Topological data analysis reveals principles of chromosome structure throughout cellular differentiation. bioRxiv, 2019.

[98] Aleksandar Savic, Gergely Toth, and Ludovic Duponchel. Topological data analysis (tda) applied to reveal pedogenetic principles of european topsoil system. Science of The Total Environment, 586:1091 – 1100, 2017.

[99] James P R Schofield, Fabio Strazzeri, Jeannette Bigler, Michael Boedigheimer, Ian M Adcock, Kian Fan Chung, Aruna Bansal, Richard Knowles, Sven-Erik Dahlen, Craig E.

100 Wheelock, Kai Sun, Ioannis Pandis, John Riley, Charles Auffray, Bertrand De Meulder, Diane Lefaudeux, Devi Ramanan, Ana R Sousa, Peter J Sterk, Rob. M Ewing, Ben D Macarthur, Ratko Djukanovic, Ruben Sanchez-Garcia, Paul J Skipp, and . A topological data analysis network model of asthma based on blood gene expression profiles. bioRxiv, 2019.

[100] Lars Seemann, Jason Shulman, and Gemunu H. Gunaratne. A Robust Topology-Based Algorithm for Gene Expression Profiling, 2012.

[101] Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J.n. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.

[102] Donald R. Sheehy. Linear-size approximations to the vietoris-rips filtration. Discrete & Computational Geometry, 49(4):778–796, 2013.

[103] Ian Sillitoe, Tony E Lewis, and et al. Cath: Comprehensive structural and functional annotations for genome sequences. Nucleic Acids Research, 43, 01 2015.

[104] Nikhil Singh, Heather D. Couture, J. S. Marron, Charles Perou, and Marc Niethammer. Topological descriptors of histology images. In Guorong Wu, Daoqiang Zhang, and Luping Zhou, editors, Machine Learning in Medical Imaging, pages 231–239, Cham, 2014. Springer International Publishing.

[105] Paolo Sonego, Mircea Pacurar, Somdutta Dhir, Attila Kertesz-Farkas, Andr´asKocsor, Zolt´anG´asp´ari,Jack A M Leunissen, and S´andorPongor. A protein classification benchmark collection for machine learning. Nucleic acids research, 35:D232–6, 02 2007.

[106] Richard S. Sutton and Andrew Barto. Reinforcement Learning: An Introduction. The MIT Press,Cambridge, Massachusetts, 1998.

[107] Sara Tarek, Reda Abd Elwahab, and Mahmoud Shoman. Gene expression based cancer classification. Egyptian Informatics Journal, 18(3):151 – 159, 2017.

[108] The GUDHI Project. GUDHI User and Reference Manual. GUDHI Editorial Board, 2015.

[109] Q. H. Tran and Y. Hasegawa. Topological time-series analysis with delay-variant embedding. , 99(3):032209, March 2019.

[110] Katharine Turner, Sayan Mukherjee, and Doug M Boyer. Persistent homology transform for modeling shapes and surfaces. Information and Inference, page iau011, 2014.

[111] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. 9, November 2008.

[112] David G. P. van IJzendoorn, Karoly Szuhai, Inge H. Briaire-de Bruijn, Marie Kostine, Marieke L. Kuijjer, and Judith V. M. G. Bov´ee. Machine learning analysis of gene expression data reveals novel diagnostic and prognostic biomarkers and identifies

101 therapeutic targets for soft tissue sarcomas. PLoS computational biology, 15(2):e1006826, 2019.

[113] Pengxiang Wu, Chao Chen, Yusu Wang, Shaoting Zhang, Changhe Yuan, Zhen Qian, Dimitris Metaxas, and Leon Axel. Optimal topological cycles and their application in cardiac trabeculae restoration. In International Conference on Information Processing in Medical Imaging, pages 80–92. Springer, 2017.

[114] Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure, flexibility and folding. IJNMBE, 30(8):814–844, 2014.

102