Applications of Persistent Homology and Cycles
Dissertation
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Sayan Mandal, B.Tech., M.E.
Graduate Program in Department of Computer Science and Engineering
The Ohio State University
2020
Dissertation Committee:
Dr. Tamal Dey, Advisor Dr. Yusu Wang Dr. Raphael Wenger c Copyright by
Sayan Mandal
2020 Abstract
The growing need to understand and process data has driven innovation in many disparate
areas of data science. The computational biology, graphics, and machine learning communities,
among others, are striving to develop robust and efficient methods for such analysis. In this work, we demonstrate the utility of topological data analysis (TDA), a new and powerful
tool to understand the shape and structure of data, to these diverse areas.
First, we develop a new way to use persistent homology, a core tool in topological data
analysis, to extract machine learning features for image classification. Our work focuses on
improving modern image classification techniques by considering topological features. We
show that incorporating this information to supervised learning models allows our models to
improve classification, thus providing evidence that topological signatures can be leveraged
for enhancing some of the pioneering applications in computer vision.
Next, we propose a topology based, fast, scalable, and parameter-free technique to explore
a related problem in protein analysis and classification. On an initial simplicial complex
built using constituent protein atoms and bonds, simplicial collapse is used to construct a
filtration which we use to compute persistent homology. This is ultimately our signature for
the protein-molecules. Our method, besides being scalable, shows sizable time and memory
improvements compared to similar topology-based approaches. We use the signature to train
a protein domain classifier and compare state-of-the-art structure-based protein signatures to
achieve a substantial improvement in accuracy.
ii Besides considering the intervals of persistent homology like our first two applications,
some applications need to find representative cycles for them. These cycles, especially the
minimal ones, are useful geometric features functioning as augmentations for the intervals
in a purely topological barcode. We address the problem of computing these representative
cycles, termed as persistent d-cycles. Since generating optimal persistent 1-cycle is NP-hard, we propose an alternative set of meaningful persistent 1-cycles that is computable using
an efficient polynomial time algorithm. Next, we address the same problem for general
dimensions. We illustrate the use of an algorithm to spawn d-cycles for finite intervals on
a weak (d + 1)-pseudomanifold. We design two specialised softwares to compute persistent
1-cycles and d-cycles respectively. Experiments on 3D point clouds, mineral structures,
images, and medical data show the effectiveness of our algorithms in practice.
We further investigate into the use of these representative persistent cycles in the field of
bio-science and technology. Our concluding work tries to understand gene-expression levels
for various organisms who are either infected or under the effect of antigens. We use persistent
cycles to curate both the cohort list and gene expressions levels so as to obtain a “crux” of
better representatives. This in turn, provides improvement in both deep and shallow learning
classifications. We further show that the n-cycles has an unsupervised inclination towards
phenotype labels. The penultimate chapter of this thesis provides evidence that topological
signatures are able to comprehend gene expression levels and classify cohorts on its basis.
iii To Mammam and Babai
iv Acknowledgments
If I have seen further, it is by standing on the shoulders of giants.
Sir Isaac Newton
I would like to thank my Ph.D. supervisor Dr. Tamal K. Dey for his guidance and
support over the past few years. His wisdom and mentoring has been a continuous source of
encouragement to me. He has been source of inspiration not just in my domain of research
but rather in analytical and independent thinking, resource management, and many other
quintessential areas of being a productive individual. Above all, he has always prioritised
taking care of self enrichment over academic progress. I honestly thank him for being a true
mentor and steadfast supporter.
This thesis was enriched significantly through helpful discussions with my predecessors
Dr. Dayu Shi, Dr. Alfred Rossi, and Dr. Mickael Buchet. They had significant impact in my
research and the understanding of topological data analysis in general by answering all my
queries, no matter how mundane. Dayu helped me comprehending and maintaining the open
source code repositories of our team including Simpers, SimBa, and ShortLoop. His input was in part extended beyond these works and helped me a lot in building later softwares:
Persloop and Pers2cyc-fin.
The members of the TGDA group have contributed immensely to my personal and
professional time at Ohio State. The group has been a source of friendships as well as good
advice and collaboration. Among them, Tianqi Li and Ryan Slechta need special mention.
v They have been with me through the tough times in grad school specially when research was at a stalemate. Ryan has especially been instrumental in refining my research ideas and
reports.
The enjoyment of learning increases manifold when we share our thoughts and work
together in a constructive way. Any productive work including research is a collaborative
effort and I have been lucky to have worked with William Varcho, Tao Hao, and Soham
Mukherjee as my co-authors. I have learned much from them and gained valuable insight
from both.
I had the opportunity to take several coursework during my time as a graduate student
in the Ohio State University. I would like to thank all faculty members who have helped
me augment my knowledge base. Special mention to Dr. Yusu Wang, Dr. Ten-Hwang Lai,
Dr. Tamal Dey, Dr. Hanwei Shen, and Dr. Jim Davis whose lectures have really been truly
enjoyable and inspired me to improve as a faculty as well. The course materials they covered
included all the state-of-the-art topics and have been directly influential in my research as well.
Research meets have been an ancient medium to exchange ideas and insights. In fact,
2500 years ago the Gymnasiums in Athens had been a hotbed for discussions in mathematics,
literature, and philosophy frequent by Plato, Socrates, Alcibiades etc. In this era of digital
connectivity, we have serious discussions as to whether traditional classrooms or meet-and-
greets are still valid in academia. I have always been a proponent of these traditional meet
ups and strongly believe real time interaction with scientists helps spurring a plethora of
research insights and ideas. I would therefore take this opportunity to thank the organisers
and committee members of VMV 2017, WABI 2018, and CTIC 2019 where I have met many
stalwarts in our field or research and learned a lot. Helpful discussions with faculties, and
vi peers have helped me gained a lot of insight and inspiration for future research methodologies
and ideas.
I would also take this opportunity to thank my committee members. Dr. Yusu Wang and
Dr. Raphael Wenger for their feedback throughout my graduate carrier. From the beginning
of candidacy exam to thesis committee meeting, their constructive feedback has always been
helpful.
Finally, I would like to thank the National Science Foundation for supporting the research work presented here.
vii Vita
June, 1990 Born - Kolkata, India School - St. Stephens’ School 1993 Kolkata, India B.Tech - Computer Sc. and Technology, 2008 WBUT, India M.E. - Computer Sc. and Engineeering, 2012 IIEST Shibpur, India Senior Research Fellow, Computer Science and Engineering, 2014 IIT Kharagpur, India University Fellow, Computer Sc. and Engineering, 2015 The Ohio State University, USA Graduate Teaching/Research Assistant, Computer Sc. and Engg., 2016 The Ohio State University, USA Graduate Research Intern, Health Care and Life Science 2019 TJ Watson Labs IBM, Yorktown Heights, USA Graduate Research Assistant, Computer Sc. and Engg., 2019-present The Ohio State University, USA
Publications
Research Publications
T. Dey, S. Mandal, S. Mukherjee, “Gene expression data classification using topology and machine learning models”. Arxiv May. 2020.
S. Mandal, A. Guzman-Saenz, N. Haiminen, L. Parida,S. Basu, “A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data”. AICoB 2020: International Conference on Algorithms for Computational Biology LNCS/LNBI Springer, April. 2020.
viii T. Dey, T. Hao, S. Mandal, “Computing Minimal Persistent Cycles: Polynomial and Hard Cases”. Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms 10.5555/3381089.3381247, 2587–2606, Jan 2020.
T. Dey, T. Hao, S. Mandal, “Persistent 1-Cycles: Definition, Computation, and Its Applica- tion”. Computational Topology in Image Context. CTIC 2019. Lecture Notes in Computer Science 10.1007/978-3-030-10828-1 10, 123–136, Dec 2018.
T. Dey, S. Mandal, “Protein Classification with Improved Topological Data Analysis”. 18th International Workshop on Algorithms in Bioinformatics 10.4230/LIPIcs.WABI.2018.6, 1–13 Aug 2018.
T. Dey, S. Mandal, W. Varcho, “Improved Image Classification using Topological Persistence ”. Vision, Modeling and Visualization: The Eurographics Association 10.2312/vmv.20171272, Sep 2017.
Fields of Study
Major Field: Department of Computer Science and Engineering
ix Table of Contents
Page
Abstract...... ii
Dedication...... iv
Acknowledgments...... v
Vita...... viii
List of Tables...... xii
List of Figures...... xiii
1. Introduction...... 1
1.1 Topological persistence of point cloud data...... 3 1.2 Topological persistence via Successive Collapses...... 5 1.2.1 Subsampling...... 7 1.3 Representative cycles for Homology classes...... 10 1.4 Topologically Relevant Gene Expression and Cohort Analysis...... 12 1.4.1 Contour of the thesis...... 14
2. Some Applications of Persistent Homology...... 15
2.1 General applications of Persistent Homology...... 15 2.2 Work on Computer Vision & Graphics using Persistent Homology..... 16 2.3 Work on Bio Science using Persistent Homology...... 17 2.4 Work on Machine Learning and Persistent Homology...... 19 2.5 Work on representative cycles for homology group...... 20
3. Image Classification...... 21
3.1 Feature Vector Generation...... 24 3.2 Algorithm for fast computation of topological signature...... 25
x 3.2.1 Choosing Parameters...... 26 3.3 Results...... 27 3.3.1 Feature Vector based Supervised Learning...... 28 3.3.2 CNN and Deep Learning based Training...... 31
4. Protein Classification...... 33
4.1 Topological persistence...... 34 4.2 Collapse-induced persistent homology from point clouds...... 36 4.2.1 Feature vector generation...... 38 4.3 Experiments and results...... 39 4.3.1 Topological description of alpha helix and beta sheet...... 40 4.3.2 Topological Description of macromolecular structures...... 43 4.3.3 Supervised learning based classification models...... 45
5. Representative Persistent cycles...... 51
5.1 Persistent 1-cycles...... 51 5.2 Definitions: Persistent Basis and Cycles...... 52 5.3 Minimal Persistent q-Basis and Their Limitations...... 53 5.4 Computing Meaningful Persistent 1-Cycles in Polynomial Time...... 54 5.5 Results and Experiments...... 57 5.5.1 Persistent 1-Cycles for 3D Point Clouds...... 58 5.5.2 Image Segmentation and Characterization Using Cubical Complex 58 5.5.3 Hexagonal Structure of Crystalline Solids...... 60 5.6 Finite Persistent n-cycle...... 62 5.7 Minimal persistent d-cycles of finite intervals for weak (d + 1)-pseudomanifolds 63 5.8 Experimental results...... 65
6. Topologically Relevant Cohort and Gene Expression...... 70
6.1 Introduction...... 70 6.2 Idea...... 74 6.3 Methods...... 75 6.3.1 Data...... 76 6.3.2 Topo-Curated Cohort...... 80 6.3.3 Topo-Relevant Gene Expression...... 84 6.3.4 Neural Network Architecture...... 86
7. Contributions and Future Work...... 90
Bibliography...... 93
xi List of Tables
Table Page
1.1 Compute 1-dimensional homology: Time comparison with SimBa...... 9
3.1 Precision(P), Recall(R) for different Methods with and without using Topological features. (P-T and R-T) indicates Precision and Recall Without Topology whereas (P+T and R+T) indicates Precision and Recall With Topology. Note that except for Cal-256 which has 20 classes, all the other datasets have 10 classes...... 28
3.2 Qualitative comparison of our algorithm with SimBa with and without 2- dimensional homological features. Acc: Accuracy, Pr: Precision, Re: Recall. I- 5 classes of CalTech256, II - CIFAR-10...... 30
4.1 Time comparison of against SimBa and VR...... 44
4.2 Accuracy comparison with FragBag and Cang...... 44
4.3 Classification accuracy for different techniques on Protein dataset. SC: SCOP95, CA: CATH95, Sf: Superfamily, Fm: Family, F: Filtered, T: Topology, H: Ho- mology, C: Class, 5f: 5fold, A: Architecture, Si: Similarity...... 48
6.1 Classification using topo-relevant cohort. Each of the data are explained in section 6.3.1. The # symbol indicates the size of each dataset. ‘-’ in the table means the stats were too low: the relevant classifier was unable to classify the given data...... 78
6.2 Neural network result. The column TP (Z) indicates the results on reduced gene set using topology. Full indicates results on the full gene set. Tr-Loss, Tr-Acc, Tr-F1, Tr-Prec,Tr-Rec is loss, accuracy, F1-score, precision, and recall on the training data. Whereas the prefix Ts- indicate the same on the test set. 88
xii List of Figures
Figure Page
1.1 Persistence of a point cloud in R2 and its corresponding barcode...... 3
1.2 A visual example of the sequence generated by subsampling a point set via the Morton Ordering...... 5
1.3 Visualization of a Morton Ordering of point cloud data. Points with similar hue are close in the total ordering...... 10
3.1 Top: The topological features act as inputs to the fully connected part of our modified convolutional neural network. Bottom: Using modified Fisher Vector on SIFT along with topological features for training...... 22
3.2 One dimensional homology for an image in Caltech-256. The bars in blue are above the widest gap and chosen as feature vector...... 25
3.3 T-SNE was used as an aide in picking parameters for the computation of barcodes.(a)Good clustering on MNIST. (b) Bad clustering on CIFAR-10. (c) Better clustering on CIFAR-10...... 27
3.4 Comparison of accuracy and precision with and without topological features. (a)Performance versus training size (b)Fluctuations in precision across different class...... 30
4.1 Workflow...... 35
4.2 Weighted Alpha complex for protein structure...... 37
4.3 Top: Collapse of weighted alpha complex generated from protein structure via simplicial map. Bottom: Same algorithm applied to a kitten model in R3 .. 38
xiii 4.4 (a) Top: Alpha helix from PCB 1C26 , Middle: Barcode of [114], Right: Our Barcode, (b) Left: Beta sheet from PCB 2JOX, Middle: Barcode of [114], Bottom: Our Barcode. Each segment of the barcodes shows β0(top) and β1(bottom)...... 40
4.5 Barcode and Ribbon diagram of (Left): PDB: 1c26. (Right): PDB: 1o9a. Diagram courtesy NCBI [82]...... 42
4.6 Heatmap correlating secondary structure against our feature vector. Each column in the heatmap is the feature vector...... 44
4.7 Plot showing accuracy against varying training data size. 100(%) indicates the entire training and test data...... 44
4.8 Left:a) Difference in precision and recall from FragBag. Middle: b) Difference in precision and recall from [16]. Right: c) ROC curve for SVM classification of our algorithm...... 47
5.1 (a) Point cloud of Botijo model. (b,c) Barcode and persistent 1-cycles for Botijo, where the 3 longest bars (dark blue, light blue, and green) have their corresponding persistent 1-cycles drawn with the same colors. (d,e) Barcode and persistent 1-cycles for the retinal image, with each green cycle corresponding to a red bar...... 52
5.2 PersLoop user interface demonstrating persistent 1-cycles computed for a 3D point cloud (a) and a 2D image (b), where green cycles correspond to the chosen bars...... 57
5.3 Persistent 1-cycles (green) corresponding to long intervals computed for three different point clouds...... 58
5.4 Persistent 1-cycles computed for image segmentation. Green cycles indicate persistent 1-cycles consisting of only one component (|G| = 1) and red cycles indicate those consisting of multiple components (|G| > 1). (a,b) Persistent 1-cycles for the top 20 and 350 longest intervals on the nerve dataset. (c) Persistent 1-cycles for the top 200 longest intervals on the Drosophila larvae dataset...... 59
5.5 (a) Hexagonal cyclic structure of silicate glass. (b) Persistent 1-cycles computed for the green bars with red points denoting silicate atoms and grey points denoting oxygen atoms. (c) Persistent 1-cycles computed for the red bars. (d) Barcode for the filtration on silicate glass...... 60
xiv 5.6 A weak 2-pseudomanifold Ke embedded in R2 with three voids. Its dual graph is drawn in blue. The complex has one 1-connected component and four 2-connected components with the 2-simplices in different 2-connected components colored differently...... 63
5.7 An example of the constructions in our algorithm showing the duality between persistent cycles and cuts having finite capacity for d = 1. (a) The input weak 2-pseudomanifold K with its dual flow network drawn in blue, where the central hollow vertex denotes the dummy vertex, the red vertex denotes the source, and all the orange vertices (including the dummy one) denote the sinks. All “dangled” graph edges dual to the outer boundary 1-simplices actually connect to the dummy
vertex and these connections are not drawn. (b) The partial complex Kβ in the F input filtration F, where the bold green 1-simplex denotes σβ which creates the F green 1-cycle. (c) The partial complex Kδ in F, where the 2-simplex σδ creates the pink 2-chain killing the green 1-cycle. (d) The green persistent 1-cycle of the interval [β, δ) is dual to a cut (S, T ) having finite capacity, where S contains all the vertices inside the pink 2-chain and T contains all the other vertices. The red graph edges denote those edges across (S, T ) and their dual 1-chain is the green persistent 1-cycle...... 64
5.8 (a,b) Cosmology dataset and the minimal persistent 2-cycles of the top five longest intervals. (c,d) Turbulent combustion dataset and its corresponding minimal persistent 2-cycles...... 66
5.9 (a,b) Minimal persistent 2-cycles for the hurricane model. (c) Minimal per- sistent 2-cycles of the larger intervals for the human skull. i: Right and left cheek muscles with the right one rotated for better visibility. ii: Right and left eyes. iii: Jawbone. iv: Nose cartilage. v: Nerves in the parietal lobe..... 67
5.10 (a) Cubic lattice structure of BaT iO3 (courtesy Springer Materials [91]) with diffused structure in backdrop. (b) Minimal persistent 2-cycles computed on the original function. (c) Minimal persistent 2-cycles computed on the negated function. (d) Minimal persistent 2-cycles computed on the negated function of
a tetragonal lattice structure of BaT iO3. The inlaid picture [91] illustrates the bonds of the structure...... 68
6.1 (a): Flowchart for topo-relevant gene expression extraction. Refer to Section 6.3.3 for details. (b): Flowchart for topo-curated cohort extraction. Refer to Section 6.3.2 for details. In both, bold lines show the path to take for training or testing large data. Dotted lines used in Figure 6.2...... 72
xv 6.2 Flowchart of proposed pipeline. For units Topo-relevant gene pipeline and Topo-curated cohort pipeline we follow the dotted lines in Fig. 6.1(a) and (b) respectively...... 74
6.3 (a) Filtration F = K0, .., K9 explaining persistence pairing.(b) Different H1 cycles for same homology class...... 74
6.4 t-SNE on entire cohort point cloud(D0). Red vertices indicate cohorts included in top 100 H2 cycles whereas blue indicate otherwise...... 77
6.5 Plot of geodesic centers for dominating cycles using T-sNE. Red vert: non dominating cycles. Graded green points: dominating cycles. Alpha values indicate ratio of dominating phenotype in each cycle versus the other labels. 81
6.6 Three figures plotting individual dominating cycles for gene dataset D0. These cycles actually reside in m-dimension and are projected down to 3D using Principal component analysis. The colors indicate cohort phenotype labels X ∈ 0, 1, 2...... 82
6.7 (a): Count of vertex labels in individual H2-cycles for D0. The red points indicate cycles having phenotype labels 0 and 1, blue indicates cycles with
labels 1 and 2 whereas green(very few in the top 500 H2 cycles) indicates labels 0 and 2. (b): Count of vertex labels in individual H2-cycles for D1. Red indicate cycles having equal phenotype labelled vertices. Blue and cyan indicate prevalence of label 0 and 1 respectively. In both the diagrams, black points indicate cycles having a single phenotype label...... 84
6.8 Three 2-cycles for gene dataset D6 with colors indicating gene function labels. Total number of colors indicate the κ value of the cycle. Note that a gene can be responsible for several functionalities. The legend in this plot takes into account only a single functionality which contributes towards the maximal cover of the representative cycle.(a,b) Low κ:3 (b) High κ:6...... 86
6.9 Comparison of (a)Accuracy (b) F1-score (c)Loss function for 50 epochs. For the TDA curated data, red and yellow lines represent train and test scores respectively. For the full data, they are represented by the green and blue lines. 89
xvi Chapter 1: Introduction
We are living in an era where a plethora of data is existent that obligates analysis using powerful tools in statistics and learning to solve real world complex problems. These problems suffer from being extremely stochastic in nature and formulation of mathematical models to describe them is a predicament. Among the contrivances harnessed to explain the structure of these data, topological data analysis has gained significant cognizance in the past decade. The idea behind topological data analysis(TDA for short) finds its roots in Algebraic Topology contrived by French mathematician Henri Poincar´ein his work Analysis Situs. With the advent of data big data, TDA gained traction leading to persistent homology [45], where the associated persistence diagrams serve as stable signatures to describe various domains in data science(Ref. Ch.2). The ability of topological approaches to define structures in high dimensional data, coupled with mathematical guarantees towards robustness, isometric and scale invariance, is what makes it a propitious tool in contemporary data analysis.
In this work, we investigate into the creation of potent methods and tools based on persistent homology to solve problems in engineering and science namely computer vi- sion(Ref. Ch.3), computer graphics(Ref. Sec. 5.5.1), protein structure(Ref. Ch.4), medical images and volumetric scans(Ref. Sec. 5.5.2, Sec. 5.8), gene expression levels(Ref. Ch.6), and material science(Ref. Sec. 5.5.3, Sec. 5.8). In fact, we manifest direct applications of such signatures in image segmentation and characterisation(Ref. Sec. 5.5.2). In addition, they are
1 harnessed to render supplementary signatures on top of the shallow and deep learning based machine architectures (Ref. Sec. 4.2.1, 3.1, 6.3.1).
Besides considering the multiset of intervals included in a persistence diagram, some applications need to find representative cycles for the intervals. In this work, we also address the problem of computing these representative cycles, termed as persistent d-cycles. Since it has been showed that the computation of the optimal persistent d-cycles is NP-hard, both for the 1-dimensional case and higher [32,36], we propose an alternative set of meaningful persistent 1-cycles that can be computed with an efficient polynomial time algorithm. For d ≥ 2, we show the applications of another polynomial time algorithm on finite intervals. We design a software which applies our algorithm to various datasets. Experiments on 3D point clouds, mineral structures, and images show the effectiveness of our algorithm in practice.
Chapter5 deals with these d-dimensional homology cycles.
We do further investigation into the use of these representative cycles in bio-science and technology. Chapter6 deals with how gene expressions and cohorts embedded on these representative cycles in high dimensions provide relevant information towards their phenotype.
A main difficulty in using topological signatures for learning and classification tasks is that the current state-of-the-art techniques for computing such signatures for a large sets of data do not scale up appropriately. Attempting to compute the persistent homology of points in high-dimensional spaces, is a road block in applying the topological data analysis techniques to classification tasks. To work around this bottleneck, we take a different approach for computing a topological signature. In the next section we discuss the traditional method to build topological signatures as well as propose a lighter and faster technique to do the same.
2 (a) (b) (c) (d) (e)
Figure 1.1: Persistence of a point cloud in R2 and its corresponding barcode
1.1 Topological persistence of point cloud data
We can start with a point cloud data in any n-dimensional Euclidean space. However,
to illustrate the theory of persistent homology, we consider a toy example of taking a set
of points in two dimensions sampled uniformly from a two-hole structure (Fig. 1.1a). We
start growing balls around each point, increasing their radius r continually and tracking the
behavior of the union of these growing balls. If we start increasing r from zero, we notice that
at some r = r1 (Fig. 1.1b) both holes are prominent in the union of ball structure. Further
increasing r to r2, leads to filling of the smaller hole (Fig. 1.1d). This continues till the value
of r is large enough for the union of balls to fill the entire structure. During the change in the
structure of the union of balls due to increase in radius, the larger of the two holes ‘persists’
for a larger range of r compared to the smaller one. Hence features that are more prominent
are expected to persist for longer periods of increasing r. This is the basic intuition for
topological persistence. The connected components and holes in this example are captured by
calculating a set of birth-death pairs of homology cycle classes that indicate at which value
of r the class is born and where it dies. The persistence is visualized in R2 using horizontal line segments that connect two points whose x-coordinates coincide with the birth and death values of the homology classes. These collection of line segments, as shown in Figure 1.1e,
3 are called barcodes [20]. The length of each line segment corresponds to the persistence of
either a connected component(H0) or cycle(H1) in the structure. Hence, the short blue line
segments in H0 correspond to the short lived component that are formed intermittently as
the radius increases. The two long blue line segments in H1 correspond to the two holes in
the structure, the largest being the bigger hole. For computational purposes, the growing
sequence of the union of balls is converted to a growing sequence of triangulations; simplicial
complexes in general, called a filtration. In some cases, some cycles called the ‘essential cycles’
persists till the end of the filtration. In our example, the red line in H0 is one such instance
of an essential cycle as the final connected component persists till the end.
The rank of the persistent homology group called the persistent Betti numbers capture
the number of persistent features. For d-dimensional homology group, we denote this number
as βd. This means β0 counts the number of connected components that arise in the filtration.
Similarly, β1 counts the number of circular holes being born as we proceed through the
filtration. With the above technique, difficulties are faced as the radius r increases as it
leads to steep increase in time complexity and a combinatorial explosion. In proteins, for
example, an average protein in a database such as CATH [103] has 20,0000∼30,000 atoms,
thus creating a point cloud of the same size in R3. Furthermore, the initial complex including 3-simplices (or tetrahedra) becomes collosal. On an average, this complex size grows to
(50 ∼ 100)x simplices of dimension upto 4 and becomes exorbitant and quite arduous to
process. For an image with dimension 200x200 pixels, persistence computation takes about
9.6s per image. Now imagine having to do this for several thousands of images which is the
typical size of a database. Building a filtration using this growing sequence of balls is thus
not scalable.
Traditionally, topological persistence for a point cloud is computed on the Vietoris-Rips
(VR) complex where: a simplex σ ∈ VRr(P ) iff d(p, q) ≤ α for every pair of vertices of σ.
4 As the value of α increases, the filtration follows a sequence of nested simplicial complexes
through which we track the persistent homology classes represented by cycles. For our case,
this essentially determines how long classes of different cycles ‘persist’ as α changes, thereby
generating a signature for each point cloud (which may in-turn represent images or proteins
or mineral structures etc).
The problem with filtrations built using VR-complex, however, is that, since we connect
every pair of points within distance α to form a higher dimensional simplex, there is a steep
rise in size, which is more so for points living in a high dimension. Hence we use a new
method to calculate persistence of a point cloud to generate topological signature for our
classification problems.
Figure 1.2: A visual example of the sequence generated by subsampling a point set via the Morton Ordering
1.2 Topological persistence via Successive Collapses
As a primal step towards computing persistence faster, we start by sparsifying the point
cloud by taking a -sparse, δ-net subsample on it:
5 Definition 1. [33] A finite set P ⊆ X is an − sample of a metric space (X, d), if for each point x ∈ X there is a point in p ∈ P so that d(x, p) ≤ . Additionally, P is called δ − sparse if d(p1, p2) ≥ δ for any pair of points in P.
Intuitively, it means that for each point in the initial cloud, we have a point at a distance
in the subsample and that no two points are more than δ distance close to each other in the sparser cloud(we use = δ for our experiments later). We then build a simplicial complex(C) on this sparse point cloud. The exact complex being built depends on the application: chiefly the graph induced complex [40] for our image classification and weighted alpha complex [43] for describing protein structures. This initial complex is reduced in size by successive edge collapses that collapses vertices as well; see Figure 1.2. In effect, it generates a sequence of simplicial complexes where successive complexes do not nest, but are connected with simplicial maps (these are maps that extend vertex maps to simplices, see [80] for details).
We generate a sequence of subset of vertices(V i) using space filling curves (see the next section for details). This sequence of subsets of V i allows us to define a simplicial map between any two adjacent subsets V i and V i+1 by the following map.
(p if p ∈ V i+1 f i(p) = v : d(p, v) = inf d(p, v) otherwise v∈V i+1
Essentially, each vertex in V i is either chosen for the subsampling or mapped to its nearest neighbor in V i+1 .
This map on the vertices, then induces a map on all higher-order simplices of C. More formally these maps are collapses of the simplicial complex C.
0 f1 1 f2 fk k C −→ C −→ ... −→ C
6 0 Given a sequence of simplicial maps f1, f2, ... fn between an initial simplicial complex C and
a resulting simplicial complex Cn, authors in [34] describes an annotation-based technique to generate the persistence of such a sequence. The authors use a set of elementary inclusions
(not needed in our case) and collapses to break down the the simplicial maps into their
fundamental elements. Using this, they derive the persistence of the simplicial maps. For our
purposes we utilize this annotation based algorithm on the sequence of maps f i described
above.
Algorithms for computing persistence under simplicial maps is presented in [34], and the
authors have announced a software (Simpers) for it, which we use for our purpose. The
persistence under simplicial maps give us a set of intervals(barcodes) which we shall use later
in our TDA toolkit for description and classification(Ch.3,4).
In the next section we shall discuss the space filling curve which we used earlier to find
the sequence of subset of vertices.
1.2.1 Subsampling
The topological signature using simplicial collapse defined in the previous section works on
a sequence of subsample of vertices. In order to choose a subsample very fast that respects the
local density, we use the Morton Ordering [72,79] of the point cloud. The Morton Ordering
provides a total ordering on points in Zd where points close in the ordering are usually close
in space, thus respecting spatial density (see Figure 1.3). Our data is sparsified by removing
every nth point from the current Morton ordering, and then repeating the process until there
are less than n points remaining for a chosen n. Note that there are other algorithms which
can be used for this purpose, such as implementing a k-means clustering with n-clusters and
choosing the center of each cluster for removal. However, the Morton Ordering is very fast as
7 it is based on bit operations and it is a non iterative process, hence we inculcate it in our
algorithm.
Now we will discuss our method for generating a sequence of subsamples, given an initial
simplicial complex C. To do this, we first create a total ordering on our point set V 0. This ordering is explicitly defined by the Morton Ordering mapping M : ZN 7→ Z such that
B N _ _ i,b M(p) = x2 N(b + 1) − (i + 1) b=0 i=0
i,b th th where x2 denotes the b bit value of the binary representation of the i component of x, ‘∨’ denotes the binary or operation, and ‘’ denotes the left binary shift operation. This
mapping is merely a bit interleaving of the different components of p. Applying M to every
p ∈ V 0 yields a total ordering on our initial point set.
We can exploit the knowledge that points with similar Morton encoding are found in
similar areas, to generate a new subset V 1 ⊂ V 0 that respects the underlying density of the
initial point set.
First choose a value n such that 1 < n ≤ kV 0k. V i+1 can then be defined as
i+1 i V = {xj | xj ∈ V , j 6≡ 0 mod n}
th i Where xj is the j vertex in the Morton Ordering of V . Following this approach, the process
can be repeated to create a sequence of subsets
V 0 ⊃ V 1 ⊃ ... ⊃ V n, kV nk≤ n
It should be noted that in most cases, datasets often have real-valued data instead of
integer values required for the Morton Ordering. To overcome this, we apply a basic scaling
to the data as a preprocessing step, and then consider the closest integer point in Zd when
determining the ordering.
8 Table 1.1: Compute 1-dimensional homology: Time comparison with SimBa Data #points Dim SimBa(sec) Our-Algo(sec) Kitten 90120 3 35.72 19.05 PrCl-s 15000 18 94.13 28.17 PrCl-l 15000 25 254.37 47.12 Surv-s 252996 3 469.40 165.28 Surv-l 252996 150 1696.59 294.6 Caltech-256 10786 5 8.38 2.27 MNIST 2786 5 1.86 0.56
Finally, we compute the persistence of this sequence of collapse operations (simplicial maps)
connecting successive complexes using the software Simpers [34]. The detailed algorithm
changes with specific applications and are provided in Sections 3.2, and 4.2.
To illustrate the speed-up we gain by our collapse based persistence computation with
Morton Ordering (on graph induced complex), we report its running time on several datasets
ranging from 3D data as geometric meshes to high dimensional data embedded in dimension
as high as 150-dim (Table 1.1). The time taken by the algorithm when run on a random
sample image from the Caltech-256 and MNIST datasets has also been included. We compare
the speed of computation with SimBa [38]. Since the authors of [38] already showed that it
generates results faster than existing techniques, beating its speed indicates the superiority
of our approach in this context. For this comparison, we only compute persistence up to
the one-dimensional homology. While comparing our technique with SimBa , we retain the
default parameter values suggested in the software manual. We test the speed of our method
for computing topological signature on several datasets having dimensions much larger than
three in Table 1.1. The PrCl-s and PrCl-l are the datasets containing Primary Circles
formed from natural images [1]. In PrCl-s, each point corresponds to a 5 × 5 image patch whereas in PrCl-l, they are of size 7 × 7. We run our algorithm on the Adult data [61]
9 Figure 1.3: Visualization of a Morton Ordering of point cloud data. Points with similar hue are close in the total ordering.
obtained from the UCI Machine Learning Repository. This is a 14-dim classification data
used for determining whether the average salary in the US exceeds 50k. We also experimented
on the Surviving protein dataset [55]. This includes 25, 2996 protein conformations, each
being considered as a point in R150. We generate a scaled down version of this dataset as well by reducing the dimension to R3 using PCA, and testing on it. As is evident from Table 1.1, our algorithm performs much faster, especially in high
dimensions. Since we avoid simplex insertions of the classical inclusion based persistence
computations, we are able to yield a significant speed-up. The technique of generating
topological signature using simplicial collapse has been successfully applied by us in several
domains. Details about its deployment in image classification can be found in chapter3, while chapter4 embellish the applications in protein characterisation and classification.
1.3 Representative cycles for Homology classes
So far we have used barcodes computed using persistent homology as features to describe
and learn data. Next, we look into applications which requires finding the representative
cycles for persistent homology. Because the minimal persistent 1-cycles are not stable and
their computation is NP-hard [32], we propose an alternative set of meaningful persistent
10 1-cycles which can be computed efficiently in polynomial time. The persistent 1-cycle we
calculate for a finite interval is a sum of shortest cycles born at different indices. Since a
shortest cycle is usually a good representative of its class, the sum of shortest cycles ought
to be a good choice of representative for an interval. In many cases, this sum contains only
one component. The persistent 1-cycles computed for such intervals are guaranteed to be
optimal. In fact we show experimentally that such optimal intervals occur quite frequently.
Especially, for some image datasets, nearly all computed persistent 1-cycles contain only one
component and hence are minimal.
We also deal with a practical problem of the previous algorithm where there is an
unnecessary computation of tiny intervals regarded as noise. Users are generally more
interested in finding representative cycles for the significantly large intervals only. The
previous algorithm computes 1-cycles even for tiny intervals which are considered as noise in
most applications. We present an improved algorithm where we only compute the shortest
cycles at the birth indices whose corresponding intervals contain the input interval [b,d).
Since an user often provides a long interval, the intervals containing it constitute a small
subset of all the intervals. This makes our latest algorithm run much faster in practice.
Since it is NP-hard to compute minimal persistent cycles in 1-dimensional homology
groups, this naturally leads to the following questions: Are there other interesting cases
beyond 1-dimension for which minimal persistent cycles can be computed in polynomial time?
In a follow up work, it is again shown that when d ≥ 2, computing minimal persistent d-cycles
for finite intervals is NP-hard in general( [32]). This work also describes computing minimal
persistent d-cycles for finite intervals based on a special but important class of simplicial
complexes, called weak (d + 1)-pseudomanifolds. Our chapter on representative persistent
cycle also explores applications of this algorithm in several domains.
11 We announce a software based on our algorithm to generate tight persistent 1-cycles on 3D
point clouds and 2D images. We experiment with various datasets commonly used in geometric
modeling, computer vision and material science, details of which are illustrated in Section5.
The software, named PersLoop, along with an introductory video and other supplementary
materials are available at the project website http://web.cse.ohio-state.edu/~dey.8/ PersLoop. We present a follow up software to compute minimal d−dimensional persistent
cycles for finite intervals. Experiments with representative 2-cycles on scientific data indicate
that the minimal persistent cycles capture various significant features of the data. In section
5.8, we show how these cycles extract information from medical data, scientific visualisation
data, as well as molecular configurations. The software is available online at https://github.
com/Sayan-m90/Minimum-Persistent-Cycles/tree/master/pers2cyc_fin.
1.4 Topologically Relevant Gene Expression and Cohort Analysis
The quick progress of genome-scale sequencing has dispensed a comprehensive list of genes
for different organisms. Interpretation of high-throughput gene expression data continues
to require mathematical tools in data analysis that recognizes the shape of the data in high
dimensions. The fundamental challenge in modelling a mathematical structure to explain
these high dimensional data is the stochastic nature of biological processes and associated
noise levels acquired during the mining process. In the penultimate chapter of this thesis, we engender the representative persistent cycles discussed in the forthcoming chapters to
curate gene expression data. This work dissents from the preceding works in two aspects:
(1) Traditional TDA pipelines uses topological signature called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating said features
to obtain a “crux” of better representatives. These kernel of the entire data facilitates better
comprehension of the phenotype labels. (2) Most primal works (including our works in image
12 and protein analysis in chapter34) employ barcodes obtained using topological summaries
as fingerprints for the data. Even though they are stable signatures, there exists no direct
mapping between the datum and said barcodes. The representative persistent cycles we use
allow us to directly procure genes entailed in similar processes.
We look into two problems in the understanding of genome sequence where scientists
have shown particular interest. A genome-wide association study (GWAS) is a method to
link a subset of genes to a particular disease or physical phenomenon in an organism. Since
the number of gene expression in a cohort profile is far greater than the number of sample
cohorts, it becomes important for these cases to identify a subset of genes whose expression
levels reflect the phenotype of the cohorts. On the other side of the argument, it is often
the case that some cohort have incorrect or uncorrelated data due to instrument or manual
error. We find in practise that the elimination of such instances leads to better prediction
scores and performance. We delve into both these problems of identifying relevant cohorts
and genes in chapter6. Since genes have previously shown clustering tendencies [46, 86], this work explores to investigate if they contain higher dimensional topological information as well. We use the representative n-cycles to be discussed in forthcoming chapters to extract
these data and results based on real life gene expression data of various organism seems
to be encouraging. The topology relevant curated data that we obtain provides reasonable
improvement in both shallow and deep machine learning based classifications. We further
show that the representative cycles we compute has an unsupervised inclination towards
phenotype labels. This work thus shows that topological signatures are able to comprehend
gene expression levels and classify cohorts on its basis.
13 1.4.1 Contour of the thesis
This thesis resonates on the prime theme of finding topological tools to articulate an
ensemble of problems in science and engineering. In the next chapter, we delve in to a
comprehensive review of some of the pivotal findings on the domain. Chapter3 employs
the sub-sampling based techniques discussed in 1.1 to classify images. Chapter4 uses a
similar strategy as 1.1, with further optimisations to show tractable performance in protein
classification. In contrast to the image classification, where we augmented topological features with traditional feature vectors to show learning improvement, this work uses unaided
persistent barcodes to transcend the state of the art.
With these two chapters dealing with the applications in persistent barcodes directly, our
later chapters investigate into the representative cycles. Chapter5 describes the computation
and applications of persistent 1-cycles. The same chapter uses a generalised algorithm
(Ref. 5.6) to compute d-dimensional persistent cycles for finite intervals on a special type
of simplicial complexes. This section exhibits some interesting applications and instances
of representative persistent 2-cycles in Z2. Chapter6 utilises the algorithms of chapter5
to enounce gene expression profiles. We show the unsupervised ability of the topological
features to identify phenotype labels, and adopt the same in supervised learning as well.
In sum, these 4 chapters articulate how we build several tools in topology and show their
utility in a myriad of applications. Finally, we conclude our work in chapter7.
14 Chapter 2: Some Applications of Persistent Homology
In this section, we provide a brief survey of the different domains in which topological data analysis, specifically persistent homology has found its applications.
2.1 General applications of Persistent Homology
The idea of persistent homology has found applications in several areas including robotics
[41], sensor networks [28,52], and the analysis of different other domains [53,71,98].
Several structures in material science has been analysed using persistent homology.
Nakamura et al. [81] proposed a methodology based on persistent homology to describe
Medium Range Order (MRO) in amorphous solids. Their experiments on crystalline, random and amorphous structures show the power of persistent diagrams to explain geometric structures in single and many-body atomic configurations. For instance in glass materials, the presence of characteristic curves in a function over the persistent diagram imply presence of MRO. For crystalline solids, the periodic structure yields a few island supports in the persistence diagram with high multiplicity. The diagonal region in this diagram corresponds to the secondary holes that represent distortion from the primary holes. In effect, this paper provides strong evidence of the importance of Persistent Diagrams in the analysis of material structures. In another work, Hiroaka et al. [56] shows similar results where they distinguish the different states of matter using persistent diagrams. They also study thermal fluctuations and strains on molecules using persistent homology.
15 The idea of persistent homology has found application in time series data analysis as well.
The authors of [109] identify the qualitative changes in time series using delay embeddings and
topological data analysis. This delay-variant embedding method reveals multiple time-scale
patterns in a time series. The authors also combine these features with the kernel technique
in machine learning algorithms to classify the general time-series data. In another work,
Perea et al. [88] uses a sliding window based technique and persistence to compute periodicity
of signals.
2.2 Work on Computer Vision & Graphics using Persistent Ho- mology
In computer vision, persistent homology has been used for a wide array of applications
ranging from image segmentation [65] to shape characterization [6]. We discuss some of the
recent works in vision and computer graphics in this section.
Carri`ereet al. [22] used persistent homology to estimate 3D shapes. They work on
generating stable signatures to describe compact smooth surfaces in R3. Given a shape S,
they calculate persistence by growing geodesic balls for each point x ∈ S. They vectorise the
persistent diagrams by treating them as metric spaces and using pairwise distances. Thus for
each shape, they have a set of vectors, one for each point in the shape. Multidimensional
Scaling on these vectors show that there is some continuity between vectors with identical
labels, which suggests that the signatures vary continuously along the shape. It is these vectors that are in turn used as features for machine learning. Results of classifying shaped
based on these signatures are quite encouraging. In a similar application, Thomas et al. [10]
uses persistent homology to estimate 3D shape poses. Their work uses the intervals obtained
in persistent homology as features for pooling in a bag-of-word approach. Their results fare
better than the state-of-the-art techniques as well.
16 In a work related to image analysis of medical data, [104] studies the cell arrangement of
microscopic images of breast cancer using topological data analysis. They take the points
and their corresponding weights as the individual nucleus and its mass. Using this, they
compute the persistent homology of the VR complex starting from a weighted point cloud.
They classify different cancer type and demonstrate that the topological features contain
useful complementary information to image-appearance based features that can improve
discriminatory performance of classifiers.
Two more recent works deserve mention. The first one is by Leon et al. [70], who uses
filters on gaits of human silhouettes to build simplicial complexes. They then use these
complexes for gender classification. In another work, Aras et al. [5] use persistent homology
to analyse image tampering. They utilise the non uniformity in Local Binary Pattern of
images to build simplicial complexes. They use the number of connected components in
this simplicial complex for different threshold values as features for their classifier. Their
results show that the persistent homology sequence defines a discriminating criterion for the
morphing attacks (i.e. distinguishing morphed images from genuine ones).
2.3 Work on Bio Science using Persistent Homology
In bio-science, topological data analysis have found applications in medical imaging [26,
110], molecular architecture [92,97], and genomics [14], among others.
Topological structures have also been used to analyse viruses. Emmet et al. worked on
influenza [47] to show a bimodality of distribution of intervals in the persistent diagram.
This bimodality indicates two scales of topological structure, corresponding to intra-subtype
(involving one HA subtype) and inter-subtype (involving multiple HA subtypes) viral re-
assortments. These results on viruses suggest that persistent homology can be also used
to study other forms of reticulate evolution. Overall, this paper presents clear examples of
17 topological structures demonstrating different biological processes. In another work, Parida
et al. [87] used topological characteristics to detect subtle admixture in related populations.
In [99], gene expressions from peripheral blood data was used to build a model based on
TDA network model and discrete Morse theory to look into routes of disease progression.
Persistent homology has also been employed in [42] for comparison of several weighted gene
co-expression networks.
Persistent Homology has also been used to identify DNA copy number aberrations [4].
Their experiments found a new amplification in 11q at the location of the progesterone
receptor in the Luminal A subtype. Seemann et al. [100] used persistent homology to identify
correlated patient samples in gene expression profiles. Their work focuses on the H0 homology
class which is used to partition the point cloud . The famous paper by Nicolau et al. [84]
identified a subgroup of breast cancers using topological data analysis in gene expressions.
Several works [93,107] on use of machine learning techniques on gene expression profile
have shown promising results. Kong et al. [63] used random forests to extract features for
their Neural Network architecture. They investigate a problem similar to our ’Topo-relevant gene’ and the results show significant improvement. [58] analyzes gene expression data to
classify cancer types. Different techniques of supervised learning are used to understand genes
and classify cancer. The authors of [112] use machine learning to identify novel diagnostic
and prognostic markers and therapeutic targets for soft tissue sarcomas. Their work shows
overlap of three groups of tumors in their molecular profile.
The authors of [16] characterised proteins using persistent homology. Since we will
compare our algorithm with this work in Section4, we explain its working in some detail. The work uses amino acid molecules as points to build a VR filtration. Based on this filtration,
the authors chose a feature vector of length 13, based on number, length, birth and death
time, average, and summation of certain specific intervals. This feature vector was used as
18 input to supervised classification. This work shows that the classification accuracy for protein
structures is improved even by choosing feature vectors naively without any statistical basis.
In another work [18], the authors have used barcodes to describe the secondary structure
of proteins such as alpha helix and beta sheets. They also employ said barcodes to analyse
protein elastic network models.
Several works [93,107] on use of machine learning techniques on gene expression profile
have shown promising results. Kong et al. [63] used random forests to extract features for
their Neural Network architecture. They investigate a problem similar to our ’Topo-relevant gene’ and the results show significant improvement. [58] analyzes gene expression data to
classify cancer types. Different techniques of supervised learning are used to understand genes
and classify cancer. The authors of [112] use machine learning to identify novel diagnostic
and prognostic markers and therapeutic targets for soft tissue sarcomas. Their work shows
overlap of three groups of tumors in their molecular profile.
2.4 Work on Machine Learning and Persistent Homology
There has been experiments on different kernels using persistence including Sliced Wasser-
stein [21], persistence scale space [94], weighted Gaussian [66], and persistence fisher kernels
on Riemannian Manifold [68].
Some work use persistent diagrams as features for machine learning classifiers by doing
statistical measure on the intervals. The authors of [15,17] has used the binning process by
collecting values at the grid point. Other works [16,65] used the top ‘n’ intervals as features.
Bubenik [12] introduced persistent landscapes which was subsequently used as feature vectors either through binning or as intensity maps for neural networks. Recently the persistent
diagrams has been represented using a persistent surface function [2]. The persistent surface
can be discretized by a Cartesian grid into image data. By the integration of persistent
19 surface function over each grid (or pixel), a persistent image function is obtained. These
images can in term be fed directly into classifiers.
2.5 Work on representative cycles for homology group
The computation of representative cycles for homology groups with Z2 coefficients has
been extensively studied over the decades. While a polynomial time algorithm computing
an optimal basis for first homology group H1 [39] has been proposed, finding an optimal
basis for dimension greater than one and localizing a homology class of any dimension are
proved NP-hard [25]. There are a few works addressing the problem of finding representatives
for persistent homology, some of which compute an optimal cycle at the birth index of an
interval but do not consider what actually die at the death index [48,50].
Obayashi [85] formalizes the computation of optimal representatives for a finite interval as
an integer programming problem. He advocates solving it with linear programs though the
correctness is not necessarily guaranteed. Wu et al. [113] proposed an algorithm for computing
an optimal representative for a finite interval with a worst-case complexity exponential to
the cardinality of the persistence diagram.
Chambers et al. [23] proved that the localization problem over dimension one is NP-hard when the given simplicial complex is a 2-manifold. Several other works [11, 24,35,49] address variants of the two problems while considering special input classes, alternative cycle measures,
or coefficients for homology other than Z2.
20 Chapter 3: Image Classification
Image classification has been a buzzing topic of interest for researchers in both computer vision and machine learning for quite some time. Affordable high-quality cameras and online
social media platforms provided the internet with billions of images, most of which are
raw and unclassified. Recent developments of new techniques for classification shows very
promising results, even on large datasets, such as ImageNet [29]. In this chapter, we intend to
investigate if topological signatures arising from topological persistence can provide additional
information that could be used in the classification of images.
For supervised image classification with machine learning techniques, we investigate both
feature vector based shallow classification and neural network based deep classification; see
Figure 3.1 for a schematic diagram. For the shallow learning method, we use one of the
state-of-the-art techniques using SIFT followed by Fisher-Vector encoding [96] to generate the
feature vectors. Classification using these vectors has an accuracy of 59.7% on the Caltech-256
dataset for 60 or more training samples. Classifications using Convolutional Neural Networks was tested with another state-of-the-art model: AlexNet [64]. We reproduced the experiments with and without our topological features to get an improvement for the latter. This trend is
also evident in modifying AlexNet (which has a precision of 83.2%) and evaluating on the
CIFAR-10 image data set, where we found consistent improvements in model precision when
including topological features.
21 Figure 3.1: Top: The topological features act as inputs to the fully connected part of our modified convolutional neural network. Bottom: Using modified Fisher Vector on SIFT along with topological features for training.
We compute the topological features for each image and append this data with the feature vector obtained from the traditional method to classify the images. While there may be
improvements in classifiers and methods for augmenting features into them, we claim that
even naively including the topological features adds additional relevant information about the
image which can be utilized by the network in making more accurate classifications. Since
traditional feature extractors rely on geometric and image processing properties, such as
gradient, orientation of sub-pixels, statistical distribution of colors or the learned features
found in Convolutional Neural Networks (CNNs), topological features are lost in this process.
By reintroducing these features, we add additional relevant information which can be utilized
by existing classification techniques. Our entire technique has been illustrated in a video which is available at https://youtu.be/hq4DYse2c-Y.
22 We compute the topological features for images and use these features for image classifica-
tion using machine learning techniques. We use the theory of persistent homology to generate
feature vectors for a point cloud. But, instead of the classical Rips complex filtrations, we
use a hierarchy of collapsed simplicial complexes and a novel point selection strategy to save
time and prevent a combinatorial blowup on the number of simplices. The background and
motivation for our algorithm can be found in section 1.1.
We transform each pixel to a point p ∈ R5 by taking the RGB intensity values as well
as its (x,y) coordinates. Note that this is only one way of transforming an image to a point
cloud (other techniques as lower star filtration on the intensity, exist as well). On this point
cloud we build a graph induced complex [30] which is basically the insertion of a simplex with a vertex set V ⊆ Q if a set of points in P, each being closest to exactly one vertex in V ,
forms a clique. Formally,
Definition 2. Let G(V) be a graph with the vertex set V and let v V −→ V 0 be a vertex map where v(V ) = V 0 ⊆ V . The graph induced complex G(V,V’,v) is defined as the simplicial
0 0 0 complex where a k-simplex σ = v1, v2, ..., vk + 1 is in G( V, V’,v ) iff ∃a(k + 1)-clique
0 v1, v2, ..., vk+1 ⊆ V so that v(vi) = vi for each i ∈ 1, 2, ..., k + 1. To see that it is indeed a simplicial complex, observe that a subset of a clique is also a clique.
We build this initial simplicial complex and proceed with the collapse based filtration
described in 1.2. The theory of persistent homology tracks the birth and death of the cycles.
For ease of computation we compute persistence only up to one dimensional homology, which keeps track of the 1-cycles or loops in R5. This allows us to balance the necessity of
getting relevant topological information against the increased computation time required for
generating high-dimensional homological features.
23 3.1 Feature Vector Generation
Now we describe the method we use to incorporate the topological signatures as features
for image classification.
Definition 3. Given an image I, we use the function f : I → R5 mapping the RGB intensity
of the pixel with coordinates (x, y) ∈ I to a point
r − µ g − µ b − µ ( r , g , b , x − x, y − y) σr σg σb
in the point cloud P ∈ R5.
(Here µi and σi refer to the mean and standard deviation of the intensity channel i which
can be red, green and blue respectively. Similarly, x, y are the corresponding mean.). The
color of images which varies from 0 − 255 and the size of images typically 200 × 200 depending
on the dataset are essentially normalised using this process. We apply this mapping to all
pixels in the input image in order to obtain an initial point set P on which the algorithm
in section 3.2 operates. This algorithm computes the barcodes denoting the birth-death value for each cycle (as described in section 1.1). Typically, cycles with short barcode length
correspond to intermediate/insignificant cycles or noise. So, to find the cycles which persist
longer, we sort the barcodes w.r.t their lengths and find the largest difference in length
between two consecutive barcodes. The (death-birth) value for every barcode above this gap
is taken as our feature vector. Therefore, if there are ‘m’ barcodes above the widest gap of
the barcodes for an image, li denoting length of the ith barcode, we take the length of the
top ‘m’ barcodes (l1, l2, ..., lm) as our feature vector. This m-length vector is added to the
feature vector obtained from the traditional machine learning approach and used for training
and testing. The barcode of a sample image from Caltech-256 is given in Figure 3.2 with
the bottom 6 lines in blue forming (l1, l2, ..., l6). We note that, one may compute the feature
24 vectors from our topological signatures using the methods proposed in [12][67]. We adopt
the simple approach as described here because of the consideration of speed and simplicity.
Figure 3.2: One dimensional homology for an image in Caltech-256. The bars in blue are above the widest gap and chosen as feature vector
3.2 Algorithm for fast computation of topological signature
To compute the topological signature for an initial point cloud P ⊂ Rn, we follow the
procedure below:
1. Create a Nearest Neighbor graph on P by creating an edge between each point and its
k-nearest neighbors.
2. Create a δ-sparse, δ-net subsample on P to form V 0 (Def. 1 in Sec. 1.2)
3. Build a graph induced complex [40] C0 on V 0
4. Undergo a sequence of subsampling of the initial point set V 0 ⊃ V 1 ⊃ ... ⊃ V k based
on the Morton Ordering discussed in 1.2.1. (For every V i ⊃ V i+1, we remove every nth
point from the ith sample to form V i+1).
5. Generate a sequence of vertex maps f i : V i → V i+1, as defined in section 1.2. This
in turn generates a sequence of collapsed complexes: C0, C1, ..., Cn. Each vertex map
25 induces a simplicial map f i : Ci−1 → Ci that associate simplices in Ci−1 to simplices in
Ci
f1 f2 fk 6. Compute the persistence for the simplicial maps in the sequence C0 −→ C1 −→ ... −→ Ck to generate the topological signature of the point set P .
Thus given a sequence of simplicial maps, we can compute persistence by sequence of collapse
operations (induced by the maps) on our initial complex (which is described in section 1.2).
3.2.1 Choosing Parameters
Next, we discuss how we choose the different parameters to generate persistence for the
images. We need to tune two parameters, the first being the value k which are the number
of nearest neighbors each node connected to for the δ-sparse, δ-sample complex we build
(Section 3.2 Step 1). The second is the parameter n, where we choose the nth point from V i
and collapse to its nearest neighbor for the simplicial mapping f i : Ci−1 → Ci; see Section 3.2. For choosing these parameters, we do an initial unsupervised clustering of the images based
on the t-distributed stochastic neighbor embedding or t-SNE [111]. This technique provides visualizations of high-dimensional point clouds via a nonlinear embedding which attempts
to place close points in the high dimensional space to close places in either R2 or R3. We
take a subset of images from each of our datasets and generate the persistent signatures as
described above.
These signatures are embedded in R2 using tSNE. The effects are evident in Figure 3.3a, where we experiment on the MNIST digit dataset. MNIST digits clustered based on the
computed barcodes, shows that digits with similar one-dimensional homology features are
close together. Specifically, the digits 0-9 can be partitioned into 3 equivalence classes based
on the number of holes in each digit, where
[0] = {1, 2, 3, 5}, [1] = {0, 4, 6, 9}, [2] = {8}
26 (a) (b) (c)
Figure 3.3: T-SNE was used as an aide in picking parameters for the computation of barcodes.(a)Good clustering on MNIST. (b) Bad clustering on CIFAR-10. (c) Better clustering on CIFAR-10
and this is reflected in 3.3a. We provide a bad clustering example on the CIFAR-10 dataset with values k = 8, we can see that t-SNE produced a noisy result, where not a lot of distinct
clusters are formed. Tuning the parameter values of k to be k = 15, more distinct visible
clusters were formed on the CIFAR-10 dataset as evident in 3.3c. We found this to indicate
that we should modify the aforementioned parameters. We repeat this experiment for k
= {15, 17, 19, 21, 23, 25}th nearest neighbors and value of n ranging from 8 to 20. Finally, we choose those values of k and n for which tSNE produces the most distinct clustering of
images, namely k=17 and n=15.
3.3 Results
We are primarily focused on two supervised classification frameworks. The first one is the
traditional feature vector based approach where we generate a feature set for each image;
thereby training and testing on an optimal classifier. The second one is the convolutional
neural network approach where we modify the final layers of the classifier to accommodate our
new features. We worked on several image datasets chiefly the CIFAR-10, the Caltech-256,
27 Dataset Method P - T P + T R - T R + T CIFAR-10 SIFT + Fisher Vector Encoding 23.63 28.24 23.56 31.16 CIFAR-10 AlexNet 83.2 83.8 98.25 98.15 Caltech-256 SIFT + Fisher Vector Encoding 57.50 59.50 51.50 64.00 MNIST Deep MNIST SoftMax Regression 98.15 98.46 99.57 99.48
Table 3.1: Precision(P), Recall(R) for different Methods with and without using Topological features. (P-T and R-T) indicates Precision and Recall Without Topology whereas (P+T and R+T) indicates Precision and Recall With Topology. Note that except for Cal-256 which has 20 classes, all the other datasets have 10 classes.
and the MNIST hand written dataset. All our codes are freely available on our website:
https://web.cse.ohio-state.edu/∼dey.8/imagePers/ImagePers.html.
3.3.1 Feature Vector based Supervised Learning
For classification, the number of features extracted for each image is generally quite
large and in addition, can vary depending on the image. Because of this, the images are
transformed into a fixed sized image representation using global image descriptor known
as the Fisher Vector encoding [96]. We assume 16 Gaussian functions to form the GMM
used as the generative model. We first compute the Hessian-Affine [78] extractor to generate
interest points, thereafter calculate the SIFT [74] features of these interest points. If there
are ‘l’ interest points, this process generates a 128 × l dimensional vector. This vector is
transformed to a feature vector of length 4096 × 1 using the Fisher Vector encoding. Finally, we train an SVM model using the feature vector generated from each image and use it to
classify the images.
We first tried to classify images from the CIFAR-10 dataset [75]. The dataset contains 6
batches of 1000 images with a total of 6000 images for each of the 10 output classes. Each
individual image is extremely tiny with dimension 32x32. Since these images are so low in
resolution, the number of interest points extracted is very small, and thus insufficient to
28 characterize each image. There are ten classes of images in CIFAR-10, giving a baseline of
0.1 for precision and recall. For each class we trained on 4000 images and tested on the other
2000. We present the average result over all the 10 classes in Table 3.1.
Next, we compute the persistence of each image in R5. The value of ‘m’ as discussed in
Section 3.1, computed as an average over all images, is 10. Hence we append the longest 10
barcodes to the signature described above, giving vector of length 5006 × 1. The precision
and recall for this case increased significantly as noted in the Table 3.1.
The second dataset that we use is the Caltech-256 [54]. The number of images in each
category varies within a range between 80 and 827, the median being 100. Since the dataset
is skewed, we use a subset of the dataset taking the first 20 classes having 100 or more images
as a measure for image classification. We also fix the number of training and test images as
75 and 20 respectively, for each class to maintain consistency. We use the same technique as
before, computing precision and recall for features with and without the persistence values.
In this case, for the 20 classes, the precision and recall improved significantly as well. The
average accuracy for all the 20 datasets using SIFT with the Fisher Vector Encoding comes
out to be 53.27%. However if we use the signature of our persistent homology, the accuracy
increases to 56.74%. There is an increase in the average precision and recall for each class as well, as listed in Table 3.1. We also plot the accuracy varying the training set size from 25
to 75. The accuracy has a considerable increase using topological features in each case(see
Figure 3.4a). Figure 3.4b plots the precision of a subset of eight classes on the dataset, and
shows that the fluctuations in precision across different classes varies a lot for the fisher vector
method, generating a result of 0% on two occasions, whereas the output using topology is
reasonable for all the classes.
Two things are worth noting at this point. Fist, our algorithm runs faster than the state-of-
the-art software SimBa used for generating topological features from point cloud data. In this
29 (a) (b)
Figure 3.4: Comparison of accuracy and precision with and without topological features. (a)Performance versus training size (b)Fluctuations in precision across different class
SimBa Our Algo Our Algo +β2 Acc Pr Re Acc Pr Re Acc Pr Re I 54.6 61.9 63.9 58.9 65.6 64.8 59.2 60.1 64.8 II 19.6 21.2 33.0 21.3 28.2 31.4 22.4 28.2 31.2
Table 3.2: Qualitative comparison of our algorithm with SimBa with and without 2- dimensional homological features. Acc: Accuracy, Pr: Precision, Re: Recall. I- 5 classes of CalTech256, II - CIFAR-10
regard, we provid a quantitative comparison (Table 1.1). Second, we do not include the two
dimensional homology features to save computational time. Therefore, we show the result
that we would have obtained after including these topological features. The following table
(3.2) shows the accuracy, precision and recall of running SimBa and our algorithm (with 2D
features) on 5 classes of the Caltech-256 dataset and on CIFAR-10. Note that, since we took
a subset of the entire dataset, better accuracy on this subset does not necessarily mean better
overall performance. Interestingly, in some cases, considering only one dimensional features
provides better accuracy.
30 3.3.2 CNN and Deep Learning based Training
The second framework in our experimentation was based on the Convolutional and Deep
Neural Network models. For these models we started by experimenting with the MNIST
handwritten dataset [69]. We implement a straightforward Deep MNIST softmax regression
network [106]. In a nutshell, the network comprises of two sets of convolutional layers followed
by pooling and rectified linear activations, which is then input to a fully connected layer from which we determine the final output probabilities. After training, this model has a precision
of 98.16%. However, including the topological features in the second to last layer of the
fully connected network, we get a further improvement of 0.36% over the previous reported
result. While this may not seem significant improvement, getting a slight improvement on
a model which is already so accurate is encouraging. This also indicates that topological
signatures contain information which neural network pipelines are not able to mine. This
trend in improvement continues for another dataset that we tried, namely the CIFAR-10 which we discussed earlier. While the SIFT feature vector is not a very good method to
classify these tiny images, Deep Neural Networks have proven to be quite effective in such
cases. A classic, successful model for this dataset is AlexNet invented Krizhevsky et al. [64]
in 2012. We modify this model slightly to accomodate our features. AlexNet starts with
multiple convolutions, followed by normalization and the pooling layers and finally two fully
connected layers with ReLu activation function; see [64] for more details. The original model
also trained on multiple GPUs by splitting the layers of AlexNet. The original model also
had a layer for local response normalisation which normalised specific positions across all
channels in a certain layer. We do not include these two layers in our experimentation as
they have not been very popular subsequently. Training the classifier with 50,000 iterations with a batch size of 64, we obtained a precision of 83.2%. On top of that, we added the
31 topological features to the fully connected layer in the last stage of the model to get a 0.6% increase in precision. The details of all the results are included in Table 3.1.
Thus we see that topological features provide additional information for the classification of images. This is not surprising as most techniques rely on geometric, textural, or gradient based features for classification that do not necessarily capture topological features. In the next chapter, we investigate into the use of such features for protein characterisation and classification. In contrast to this chapter, protein structures can be explained directly using topological features without the need for additional features.
32 Chapter 4: Protein Classification
Proteins are by far the most anatomically intricate and functionally sophisticated molecules known. The benchmarking and classification of unannotated proteins have been done by researchers for quite a long time. This effort has direct influence in understanding behavior of unknown proteins or in more advanced tasks as genome sequencing. Since the sheer volume of protein structures is huge, up till the last decade, it had been a cumbersome task for scientists to manually evaluate and classify them. For the last decade, several works aiming at automating the classification of proteins were developed. The majority of annotation and classification techniques are based on sequence comparisons (for example in BLAST [101],
HHblits [7] and [95]) that try to align protein sequences to find homologs or a common ancestor. However, since those methods focus on finding sequence similarity, they are more efficient in finding close homologs. Some domains such as remote homologs are known to have less than 25% sequential similarity and yet have common ancestors and are functionally similar. So, we miss out important information on structural variability while classifying proteins solely based on sequences. Even though, sometimes, homology is established by comparing structural alignment [73], accurate and fast structural classification techniques for the rapidly expanding Protein Data Bank remains a challenge.
The algorithm that we present here is a fast technique to generate a topological signature for protein structures. We build our signature based on the coordinates of the atoms in R3 using their radius as weights. Since we also consider existing chemical bonds between the
33 atoms while building the signature, we believe that the hierarchical convoluted structure
of protein is captured in our features. Finally, we developed a new technique to generate
persistence that is much quicker and uses less space than even the current state-of-the-
art such as SimBa. It helps us generate the signature even for reasonably large protein
structures. In sum, in this chapter, we focus on three problems: (1) effectively map a protein
structure into a suitable complex; (2) develop a technique similar to our previous chapter to
generate fast persistent signature from this complex; (3) use this signature to train a machine
learning model for classification and compare against other techniques. Our entire method
is summarized in figure 4.1. We also illustrate this method using a supplementary video
available at https://youtu.be/yfcf9UWgdTo.
With the traditional technique discussed in section 1.1, difficulties are faced as r increases.
An average protein in a database such as CATH [103] has 20, 0000 ∼ 30, 000 atoms, thus
creating a point cloud of the same size in R3. Furthermore, the initial complex including 3-simplices (or tetrahedra) becomes quite large. On an average, this complex size grows
to (50 ∼ 100)x104 simplices of dimension upto 4 and becomes quite difficult to process.
Building a filtration using this growing sequence of balls is thus not scalable. We attack the
problem with two strategies: (1)we only consider simplices on the boundary of the entire
simplicial complex in our algorithm and (2) compute a new filtration technique that is based
on collapsing simplices rather than growing their numbers by addition (described in Ch.1).
4.1 Topological persistence
Traditionally, given a point cloud, its persistence signature is calculated by building a
filtration over a simplicial complex called Vietoris-Rips(VR). The VR complex is easy to
implement, but its size can become a hindrance for an even a moderate size protein molecule.
34 Figure 4.1: Workflow
Thus, instead of a VR complex, we use the (weighted) alpha complex that is sparser and has
been used to model molecules in earlier works [59].
Definition 4. Alpha complex AC(α): For a given value of α, a simplex σ ∈ AC(α) if:
• The circumball of σ is empty and has radius < α, or
• σ is a face of some other higher dimensional simplex in AC(α).
Definition 5. Weighted Alpha Complex W ACPˆ(α): Let Bk(pˆ) be a k-dimensional closed
ball with center p, and weight rp. It is orthogonal or sub-orthogonal to a weighted point
0 0 2 2 2 0 2 2 2 (p ,rp0 ) iff ||p − p || = rp + rp0 or ||p − p || < rp + rp0 respectively.
An orthoball of a k-simplex σ = {pˆ0,..., pˆk} is a k-dimensional ball that is orthogonal to
every vertex pi. A simplex is in the weighted alpha complex W ACPˆ(α) iff its orthoball has radius less than α and is suborthogonal to all other weighted points in Pˆ.
35 4.2 Collapse-induced persistent homology from point clouds
The following procedure computes a topological signature for a weighted point cloud ˆ P = {p, rp} using subsamples and subsequent collapses:
0 ˆ 1. Compute a weighted alpha complex C on the point set P = {p, rp} using the algorithm
described in [108]. Let V 0 be the vertex set of C0.
2. Compute a sequence of subsamples V 0 ⊃ V 1 ⊃ ... ⊃ V k of the initial vertex set V 0
based on the Morton Ordering as discussed in 1.2.1. (For every V i, we remove every
nth point in the Morton Ordering from V i to form V i+1. We choose ‘n’ based on the
number of initial points).
3. This sequence of subsets of V i allows us to define a simplicial map between any two
adjacent subsets: f i(p) = V i → V i+1. We use the same definition of f i(p) as Sec. 1.2
4. This vertex map f i : V i → V i+1 in turn generates a sequence of collapsed complexes:
C0, C1, ..., Cn. Each vertex map induces a simplicial map f i : Ci−1 → Ci that associates
simplices in Ci−1 to simplices in Ci(see Figure 4.3)
f1 f2 fk 5. Compute the persistence for the simplicial maps in the sequence C0 −→ C1 −→ ... −→ Ck to generate the topological signature of the point set Pˆ.
In step 1 of the procedure, weighted points alone lead to disconnected weighted atoms in C0 rather than capturing the actual protein structure. To sidestep this difficulty, we increase
the weights of these points based on the existence of covalent or ionic bonds in the structure.
That is, if there exists a chemical bond between two atoms (which we get from the input
.pdb file), we scale-up the weight of each point so that they are connected in the weighted
alpha complex W ACPˆ(α) (see Fig. 4.2). We determine a global multiplying factor ρ ≥ 1 for
36 Figure 4.2: Weighted Alpha complex for protein structure
this purpose. As mentioned earlier, we take the boundary of this weighted complex which
forms our initial simplicial complex C0. In step 2, in order to generate the sequence of subsamples, we pick vertices uniformly
from the simplicial complex to be collapsed to their respective nearest neighbors. To choose
a subsample that respects local density, we use a space curve generation technique called
Morton Ordering [79]. The details of this method is given in section 1.2.1
Finally, as described in step 3, instead of constructing the filtration by increasing the value of α, we perform a series of successive collapses starting with the initial simplicial
complex. This leads to a sequence of complexes that decreases in size instead of growing as we proceed forward. Effectively, it generates a sequence called tower of simplicial complexes where successive complexes are connected by simplicial maps. These maps which are the
counterpart of continuous maps for the combinatorial setting extend maps on vertices (vertex
maps) to simplices (see [80] for details). In our case, collapses of vertices generate these
simplicial maps between a simplicial complex in the tower to the one it is collapsed to.
Persistence for towers under simplicial maps can be computed by the algorithm introduced
in [34]. We use the package called Simpers that the authors report for the same.
To summarize, the algorithm generates an initial weighted alpha complex. It then proceeds
by recursively choosing vertices based on Morton Ordering to be collapsed to their nearest
37 Figure 4.3: Top: Collapse of weighted alpha complex generated from protein structure via simplicial map. Bottom: Same algorithm applied to a kitten model in R3
neighbors resulting in vertex maps. These vertex maps are then extended to higher order simplices (such as triangles and tetrahedra) using the simplicial map. Finally given the simplicial map, we generate the persistence and get the barcodes for the zero and one dimensional homology groups.
4.2.1 Feature vector generation
We discuss how we generate a feature vector given a protein structure. We take protein data bank (*.pdb) files as input to extract protein structures. It contains the coordinates of every atom, their name, chemical bond with neighboring atoms and other meta-data such as helix, sheet and peptide residue information. We introduce a weighted point for each atom in the protein where the point is the center of the atom and its weight is the specified radius.
For instance, for a Nitrogen atom in the amino acid, we assign a weight equal to its covalent
38 radius of 71(pm). On this weighted point cloud pˆ = (p, rp), if two atoms pˆ and qˆ are involved
in a chemical bond, we increase their weights so that p and q get connected in the alpha
complex. We compute the persistence by generating the initial alpha complex and undergoing
a series of collapses as described in the previous section. For computational efficiency, we
only consider the barcodes in zero and one dimensional homology groups. Note that some of
the barcodes can have death time equal to infinity indicating an essential feature. For finite
barcodes, shorter lengths (death − birth) indicate noise. Elimination of these intermittent
features serves some interesting purpose as we will see in section 4.3. To find relatively
long barcodes, we sort them in descending order of their lengths. Let {l1, l2, ..., lk} be this
0 0 0 0 0 sorted sequence. Consider the sequence {l 1, l 2, ..., l k−1} where l i = li+1 − li and let lm be a
maximal element for 1 ≤ m ≤ k − 1. All barcodes with the lengths [l1..lm] form part of the
feature vector. Essentially we remove all barcodes whose lengths are shorter than the largest
gap between two consecutive barcodes when sorted according to their lengths. A similar
technique used in [65] has shown improved results in image segmentation over other heuristics
and parameterizations. Since the feature vector needs to be of a fixed length for feeding into
a classifier, we compute the index m of lm over all protein structures and take an average.
The feature vector also includes the number of essential zero and one dimensional cycles.
0 0 0 1 1 1 Therefore, we have a feature vector of length 2 ×m + 2 : {l1, l2, ...lm, l1, l2, ...lm, cβ0 , cβ1 }. Here
0 1 li and li are the lengths of zero and one dimensional homology cycles respectively whereas
cβi are the total number of essential cycles in i-dimensional homology.
4.3 Experiments and results
We perform several experiments to establish the utility of the generated topological
signature. First, we show how our feature vector captures various connections in the single
strands of secondary structures and compare them against the signatures obtained in [114].
39 Then we investigate if there is a correlation between the count of such secondary structures
and our feature vector. Next, we describe the topological feature vector obtained from
two macromolecular proteins structures. We also compare the size and time needed by our
algorithm (software) over the other commonly used persistence software (as in [16]). Lastly, we show the effectiveness of our approach in classifying protein structures using machine
learning models.
Figure 4.4: (a) Top: Alpha helix from PCB 1C26 , Middle: Barcode of [114], Right: Our Barcode, (b) Left: Beta sheet from PCB 2JOX, Middle: Barcode of [114], Bottom: Our Barcode. Each segment of the barcodes shows β0(top) and β1(bottom)
4.3.1 Topological description of alpha helix and beta sheet
It is known that barcodes can explain the structure of an alpha helix and a beta sheet [114].
The authors in [114] use a coarse-grain(CG) representation of the protein by replacing each
amino acid molecule with a single atom. This representation removes the short barcodes
corresponding to the edges and cycles of the chemical bonds inside the amino acid molecule.
40 We do not need this CG representation as our procedure can implicitly determine a threshold
lm and therefore delete all barcodes of length shorter than the largest gap between two
consecutive barcodes (as described in section 4.2.1). So, we get a barcode that describes the
essential features of the secondary structures without including noise or short lived cycles
from the amino acids. For a fair comparison, we compute our barcodes on the same alpha
helix residue as in [114] with 19 residues extracted from the protein strand having PDB ID
1C26 (see figure 4.4). Analogous to the barcode of [114] (as shown in the middle diagram of
figure 4.4a), we have 19 bars in the zero-dimensional homology for the alpha helix representing
the nineteen initial residues. These components die as edges are introduced in the weighted
alpha complex which gets them connected. For one-dimensional homology, an initial ring with 4 residues is formed followed by additional rings resulting from the growing connections
in each amino acid. These cycles eventually die by the collapse operations in our algorithm.
The same process is followed for beta sheets after we extract two parallel beta sheet
strands from the protein structure with PDB ID 2JOX. The zero-dimensional homology
cycles are killed when individual disconnected amino acid residues belonging to the same
beta sheet strand are connected by edges, as represented in the top 17 barcodes (leftmost
figure of 4.4b). However, other than these barcodes and the longest bar corresponding to the essential cycle, there is one bar in the zero-dimensional homology which is longer than the
top 17 bars. This bar represents the component which is killed by joining the closest adjacent
amino acid molecules from the two parallel beta strands. The one dimensional homology
bars are formed as more adjacent amino acid molecules are connected and killed once the
collapse operation starts. Note that the two barcodes shown in figure 4.4 comparing our work with [114] are not to scale. This is because, in contrast to [114], the barcodes in our
figure are not plotted against Euclidean distance rather the step at which each insertion and
collapse operation occurs.
41 Figure 4.5: Barcode and Ribbon diagram of (Left): PDB: 1c26. (Right): PDB: 1o9a. Diagram courtesy NCBI [82]
A caveat
Our aim is to compute signatures that capture discriminating structural information
useful for classifying proteins. Even though we can use our signature to describe secondary
structures, we do not want our signature to be directly correlated to the number of alpha helix
or beta sheet as it would mean they are redundant. We generate a 2 × 12 matrix where each
cell contains the correlation value between beta-sheet(top row) and alpha-helix(bottom row)
0 0 0 1 1 1 with each individual component in the feature vector: {l1, l2, ...lm, l1, l2, ...lm, cβ0, cβ1}. We use proteins in the PCB00020 record of the PCBC database to compute this matrix and depict
it by a heatmap (Fig 4.6). Essentially, we first generate two vectors vα and vβ of the number
of alpha helices and beta sheets respectively in each protein over all entries in the database.
Similarly, we produce a vector for each value in the feature vector: {v 0 , ..., v }. Now we l1 cβ1 populate the matrix by calculating the correlation between each of these individual vectors with vα and vβ. For example, row 1 and column 1 of the matrix contain the correlation value
between the vectors v 0 and v . The heat map color ranges from blue for zero correlation to l1 β dark-red for complete correlation. As we can see from the figure, almost all matrix entries have
a blue tinge indicating low correlation. This shows that our feature vector is non-redundant
42 over the frequency of secondary structures.
4.3.2 Topological Description of macromolecular structures
In the previous section, we use our signature to describe the secondary structures and compare it with the work in [114]. In this section, we further show how our signature works by describing two macromolecular protein structures that are built on multiple secondary structures. We start by describing the tetrametric protein: 1C26. The ribbon diagram and associated barcode after noise removal is given in figure 4.5. It essentially contains four monomers, associated pairwise to form two dimers. These two dimers, in turn, join across a distinct parallel helix-helix interface to form the tetramer. When we build the filtration on this protein structure, two monomers on opposite sides are killed first by connecting to their adjacent monomers to form two distinct dimers. This is evident as there are two short bars in the zero dimensional barcode (Fig. 4.5 left: shown in red). We now have two dimers, one of which is killed when it joins with the other to form a third slightly longer non-essential barcode (shown in purple). The second dimer lives on as the tetramer and forms an essential barcode (shown in black). Next, if we look into the one dimensional homology (shown as blue lines), we notice that the most notable feature for the protein is the tetramer structure which contains a large loop when the two dimers are connected. This is evident in our 1D-barcode as there is a distinct long bar representing the large one dimensional cycle. Note that the birth time of this cycle in 1D corresponds with the death time of the non-essential dimer in
0D.
Next, we consider the protein structure 1O9A. The structure contains several antiparallel beta-strands and is an example of a tandem beta-zipper. As we can see from the ribbon diagram in Fig. 4.5 right, there are six beta sheets on one side and five on another, connected together to form a fibronectin. This is evident as there are ten non essential and one essential
43 Figure 4.6: Heatmap correlating secondary Figure 4.7: Plot showing accuracy against structure against our feature vector. Each varying training data size. 100(%) indicates column in the heatmap is the feature vector. the entire training and test data.
SVM KNN Size Time (in sec) FB Cang Our FB Cang Our Data Dim VR SimBa Our VR SimBa Our Class 91.08 89.07 92.36 86.01 86.40 86.39 CATH 3 – 1422 443 – 1.75 0.35 Architecture 90.26 91.11 92.20 88.17 87.47 89.11 Soneko 3 324802 10188 576 32 6.77 2.05 Surv-l 150 – 3.1 × 106 1.09 × 106 – 5.08 × 103 884 Topology 92.19 94.87 96.71 91.54 94.02 96.20 PrCl-I 25 – 10.2 × 106 0.22 × 106 – 585 141.3 Homology 93.33 94.06 94.17 90.28 91.11 93.30
Table 4.1: Time comparison of against SimBa Table 4.2: Accuracy comparison with FragBag and VR and Cang
bar in the zero dimensional homology owing to the six beta sheets on one side and five on
the other. Each component is killed as the beta sheets join with another as the filtration
proceeds. Note that the last connected component after joining all beta sheets forms an
essential bar. Moreover, since there is no distinct cycle in the structure, we do not get any
distinct long bar in the one dimensional homology. The presence of multiple one dimensional
bars of similar size are probably due to the antiparallel beta-strands on either side which
form a ring once joined. Thus, we can see that using the same signature generation method, we can describe secondary structures (as in the previous section) as well as macromolecular
proteins without any change in the parameter. It is therefore evident that our signature is
intrinsic and scale independent.
44 Time and space comparison
The method in [16] uses persistent homology as feature vectors for machine learning.
However, as mentioned earlier, the use of Veitoris-Rips (VR) complex leads to a size blow up that not only increases runtime but also in most cases, causes the operating system to abort the process due to space requirements. Results in [16] procure good results as the datasets are of moderate size, but the same could not be reported for larger and real life protein structures. In table 4.1, we show a size and time comparison of our approach with the original feature generation technique used in [16]. We also tabulate the size and time to generate the same feature vector in [16] using a state-of-the-art persistence algorithm called
SimBa [38]. Table 4.2 contains a mix of protein databases and other higher dimensional datasets. As we see in the table, our algorithm is faster even when the features in [16] are generated with SimBa.
4.3.3 Supervised learning based classification models Classification model.
For the purpose of protein classification, we train two classifiers: an SVM model and a k-nn model on some protein databases. Once the model is trained, we test it to find accuracy, precision, and recall. The reason behind choosing Support Vector Machine and k-nn based supervised learning technique over other sophisticated and state-of-the-art classifiers is their basic nature. Results obtained from basic learning techniques prove the effectiveness of the feature vectors rather than that of the classifier. We can further improve the classification accuracy for proteins using some advanced supervised learning or Neural Network based classifiers using our proposed features.
45 Benchmark techniques.
In order to test the effectiveness of our protein signature, we need to compare it against
some of the state-of-the-art protein structure classification techniques. We generate feature vectors through these techniques and train and test the same classification models as before.
The first technique, known as PRIDE [51], classifies proteins at the Homology level in the
CATH database. It represents protein structures by distance distributions and compares two
sets of distributions via contingency table analysis. The more recent work by Budowski-Tal
et al. [13], which has achieved significant improvement in protein classification is used as
our second benchmark technique. Their work, known as FragBag mimics the bag of word
representation used in natural language processing and computer vision. They maintain
protein fragments as a benchmark. Given a large protein strand, they generate a vector which
is essentially a count of the number of times each fragment approximates a segment in this
strand. This vector now acts as a signature for the protein structure and that is what forms
the basis for their feature vector which we use to train and test our classifier. The protein
fragment benchmark is available from the library [62]. We choose 250 protein fragments
of length 7. The third work that we test against is the topological approach to generate a
protein fingerprint [16]. However, as we saw earlier, it is not possible to generate all the
protein signatures using the original algorithm used by the authors. Therefore, we replace
the Vietoris-Rips filtration by the state-of-the-art SimBa and generate feature vectors the
same way as mentioned in their paper.
46 Figure 4.8: Left:a) Difference in precision and recall from FragBag. Middle: b) Difference in precision and recall from [16]. Right: c) ROC curve for SVM classification of our algorithm
Database
The database that we use is called Protein Classification Benchmark Collection (PCBC) [105].
It has 20 classification tasks, each derived from either the SCOP or CATH databases, each
containing about 12000 protien structures. The classification experiment involves the group
of the domain as positive set, and the positive test set is the sub-domain group. The negative
set includes the rest of the database outside the superfamily divided into a negative test or
negative training set. The result for some of the classification tasks for the database is given
in Table 4.3. As evident from the table, the accuracy obtained by using our signature has a
considerable improvement over the state of the art techniques. The only classification task in which our algorithm under-performs is with the protein domain CATH95 Class 5fold Filtered
(fourth row of table 4.3). The class domain is randomly sub-divided into 5 subclasses in this
task. Since the class is divided randomly into subclasses, we believe some proteins belonging
to different sub-classes generate a similar initial complex resulting in a similar filtration and
ultimately a decrease in performance.
The PCBC dataset, even though suitable for learning algorithms, suffers from being
skewed as the number of negative instances in any classification is much larger than the
number of positive instances, leading to probable incorrect classifications. Therefore, we test
47 SVM k-NN Pride Fragbag Cang Our Pride Fragbag Cang Our SC Sf Fm F 90.09 93.01 93.39 95.24 89.58 87.31 89.83 91.66 CA T 5f 94.23 92.97 94.87 99.53 90.96 91.16 94.57 97.87 CA T H F 90.15 89.89 95.06 98.80 84.98 81.11 86.65 95.51 CA C 5f F 85.09 84.76 80.98 82.36 80.18 84.74 83.83 78.81 CA H Si F 98.60 95.89 98.24 99.05 95.45 91.11 79.469 97.56 CA A T F 87.56 91.58 74.58 90.95 67.47 89.00 68.90 87.00
Table 4.3: Classification accuracy for different techniques on Protein dataset. SC: SCOP95, CA: CATH95, Sf: Superfamily, Fm: Family, F: Filtered, T: Topology, H: Homology, C: Class, 5f: 5fold, A: Architecture, Si: Similarity
on one of the most popular protein databases known as CATH [27]. The CATH database
contains proteins divided into different domains (C: class; A: architecture; T: topology; H:
homologous superfamily). For each domain, we get protein structures and their labels in
accordance with the sub-domain they belong to. For any classification task, we randomly
choose positive instances from one sub-domain and the same number of negative instances
sampled equally from the other sub-domains. Each such task, on average has 400 protein
structures containing approximately 30,000 atoms each. We then divide this into 80%-training
and 20%-test set. The result of classification on the CATH database averaged over several
such randomly chosen sub-domains as positive classes, are illustrated in table 4.2. We see yet again that for each case, there is an improvement of about 3-4% over the benchmark
techniques.
Classification result
We list our main results in tables 4.2 and 4.3 showing the improvement in accuracy
using our method over the state-of-the-art techniques of FragBag, PRIDE and the preceding work on topology by Cang et al. [16]. We provide further evidence of the efficiency of our
algorithm by comparing the precision and recall in figures 4.8a and b. In these plots, we
48 show the difference between the precision and recall obtained using our algorithm against
that of FragBag(4.8a) and Cang(4.8b). A green bar indicates that our algorithm performed
better and the difference is positive while a red bar suggests the opposite. This experiment is
done on the CATH database and the figure shows the precision and recall for each domain:
class(C), architecture(A), topology(T) and homology(H). Notice that, since the classification
is binary, we get two precision and two recall for every class in each domain. Thus, there are
four bars for each of C,A,T,H. Yet again, other than a few marginal cases, our algorithm
largely performs better. Finally, we calculate the ROC curve using SVM on a subset of
the CATH dataset, the result of which is shown in figure 4.8c. The ROC curve is a plot of
the true positive rate against false positive rate obtained by changing the input size and
parameter. This means that the further the lines are away from the diagonal, the better is
the classifier.
For the positive test cases, we investigate further the trend of the output. We try to see
the correlation of accuracy with the change in training set size. We therefore change the
training and test set sizes by taking a fraction of the entire dataset and trace the accuracy in
each case. This is done over all the test cases shown in Table 4.3 and the average is shown in
Fig 4.7. We plot the output of our algorithm in blue with two instances of FragBag with
(fragment, library) sizes (5,225) and (7,250) in red and green respectively. In addition, we
plot the output of PRIDE as well. Ideally, the accuracy should decrease uniformly with a
decrease in training set size and we should get a straight line across the diagonal. In this
case, all the trendlines are almost close to the diagonal and hence we can say that they are
correlated. Moreover, we observe that even as the training data size decreases, the accuracy
of our algorithm remains better or comparable to the other algorithms. This indicates that
topological features work better with a lower number of samples as well.
49 The last two chapters were dedicated to the applications of persistent barcodes in two very disparate fields of image and protein classification. In the forthcoming chapters, we
mildly switch gears to see if we can extract more geometric information out of the intervals.
In effect, we dive into the computation of representative persistent cycles.
50 Chapter 5: Representative Persistent cycles
The previous chapters describe the use topological data analysis particularly the barcodes using persistent homology to express and classify data from a multitude of domains. Besides considering the multiset of intervals included in a persistence diagram, some applications need to find representative cycles for the intervals.
5.1 Persistent 1-cycles
In this work, we study the problem of computing representative cycles for persistent first homology group (H1-persistent homology) with Z2 coefficients. We term theses cycles as persistent 1-cycles. Since it has been shown that the computation of the optimal cycles is
NP-hard [32], we propose an alternative set of meaningful persistent 1-cycles with an efficient polynomial time algorithm. Although the original persistence algorithm of Edelsbrunner et al. [45] implicitly computes persistent cycles, it does not necessarily provide minimal ones.
In fact, similar definitions for finite intervals is already proposed [85,113], however, to our knowledge, explicit explanation of how the representative cycles are related to persistent homology has not been addressed. Our polynomial time algorithm is not any worse than an optimal cycle generating algorithm though is much more efficient in terms of the time complexity.
We use a software based on our algorithm to generate tight persistent 1-cycles on 3D point clouds and 2D images as shown in Figure 5.1. We experiment with various datasets
51 (a) (b) (c) (d) (e)
Figure 5.1: (a) Point cloud of Botijo model. (b,c) Barcode and persistent 1-cycles for Botijo, where the 3 longest bars (dark blue, light blue, and green) have their corresponding persistent 1-cycles drawn with the same colors. (d,e) Barcode and persistent 1-cycles for the retinal image, with each green cycle corresponding to a red bar.
commonly used in geometric modeling, computer vision and material science, details of which
are given in Section 5.5. The software, named PersLoop, along with an introductory video
and other supplementary materials are available at the project website http://web.cse.
ohio-state.edu/~dey.8/PersLoop.
5.2 Definitions: Persistent Basis and Cycles
Definition 6 (Persistent Basis). An indexed set of q-cycles {cj|j ∈ J} is called a persistent
F L [bj ,dj ) q-basis for a filtration F if Pq = j∈J I and for each j ∈ J and bj ≤ k < dj,
[bj ,dj ) I (k) = {0, [cj]}.
F Definition 7 (Persistent Cycle). For an interval [b, d) ∈ D(Pq ), a q-cycle c is called a persistent q-cycle for the interval, if one of the following holds:
• d 6= +∞, c is a cycle in Kb containing σb, and c is not a boundary in Kd−1 but becomes
a boundary in Kd;
• d = +∞ and c is a cycle in Kb containing σb.
52 The following theorem characterizes each cycle in a persistent basis:
Theorem 1. An indexed set of q-cycles {cj|j ∈ J} is a persistent q-basis for a filtration F if
F L [bj ,dj ) F and only if Pq = j∈J I and cj is a persistent q-cycle for every interval [bj, dj) ∈ D(Pq ).
The proof is given in [32]. With definition7 and theorem1, it is true that for a persistent
q-cycle c of an interval [b, d) ∈ Dq(F), we can always form an interval module decomposition
F of Pq , where c is a representative cycle for the interval module of [b, d).
5.3 Minimal Persistent q-Basis and Their Limitations
The optimal versions of persistent basis are of particular interest because they capture
more geometric information of the space. The cycles for an optimal (minimal) persistent
basis is already defined and studied in [50,85]. In particular, the author of [85] proposed an
integer program to compute these cycles. Although these integer programs can be solved
exactly by linear programs for certain cases [31], the integer program is NP-hard in general.
This of course does not settle the question of whether the problem of computing minimal
persistent 1-cycles is NP-hard or not. We prove that it is indeed NP-hard and thus has no
hope of admitting a polynomial time algorithm unless P = NP.
Consider a simplicial complex K with each edge being assigned a non-negative weight.
We refer to such K as a weighted complex. For a 1-cycle c in K, define its weight to be the
sum of all weights of its edges.
Definition 8 (Minimal Persistent 1-Cycle and 1-Basis). Given a filtration F on a weighted
complex K, a minimal persistent 1-cycle for an interval of D1(F) is defined to be a persistent
1-cycle for the interval with the minimal weight. An indexed set of 1-cycles cj|j ∈ J is a
minimal persistent 1-basis for F if for every [bj, dj) ∈ D1(F), cj is a minimal persistent
1-cycle for [bj, dj).
53 The following special version of the problem of finding a minimal persistent 1-cycle is
NP-hard(proof in [32]).
Problem 1 (LST-PERS-CYC). Given a filtration F : ∅ = K0 ⊆ K1 ⊆ ... ⊆ Km = K, and
a finite interval [b, d) ∈ D1(F), find a 1-cycle with the least number of edges which is born in
Kb and becomes a boundary in Kd.
Given the problem, we have the following theorem:
Theorem 2. The problem LST-PERS-CYC is NP-hard. (See [32] for details).
It is also shown in the paper that neither the minimal persistent 1-cycles or persistent
1-cycles computed by our algorithm are stable. The perturbations of both classes of cycles
turn out to be unstable. So, in this regard, our polynomial time algorithm is not any worse
than an optimal cycle generating algorithm though is much more efficient in terms of the
time complexity.
5.4 Computing Meaningful Persistent 1-Cycles in Polynomial Time
Because the minimal persistent 1-cycles are not stable and their computation is NP-hard, we propose an alternative set of meaningful persistent 1-cycles which can be computed
efficiently in polynomial time. We first present a general framework in Algorithm 1. Although
the computed persistent 1-cycles have no guaranteed properties, the framework lays the
foundation for the algorithm computing meaningful persistent 1-cycles that we propose later.
Refer to [32] for proof of correctness.
Based on Algorithm1, we present another algorithm which produces meaningful persistent
1-cycles.
54 Algorithm 1 Input: A simplicial complex K, a filtration F : ∅ = K0 ⊆ K1 ⊆ ... ⊆ Km = K, and D1(F), Output: A persistent 1-basis for F. The algorithm maintains a basis Bi for H1(Ki) for every i ∈ [0, m]. 1: Initially, let B0 = ∅, 2: for i = 1, . . . , m: do 3: if σi is positive then 4: Find a 1-cycle ci containing σi in Ki 5: Let Bi = Bi−1 ∪ {ci}. 6: end if 7: if σi is negative then P 8: Find a set {cg|g ∈ G} ⊆ Bi−1 so that g∈G[cg] = 0. This can be done in O(βi = |Bi|) time by the annotation algorithm in [33]. Maintaining the annotations will take O(nω) time altogether where K has n simplices and ω is the matrix multiplication exponent. ∗ ∗ 9: Let g be the greatest index in G, then [g , i) is an interval of D1(F). P 10: Assign g∈G cg to this interval as a persistent 1-cycle and let Bi = Bi−1 \ cg∗ . 11: end if 12: Otherwise, let Bi = Bi−1. 13: end for 14: For each cycle cj ∈ Bm, assign cj as a persistent 1-cycle to the interval [j, +∞).
Algorithm 2. In Algorithm1, when σi is positive, let ci be the shortest cycle containing σi
in Ki. The cycle ci can be constructed by adding σi to the shortest path between vertices of
σi in Ki−1, which can be computed by Dijkstra’s algorithm applied to the 1-skeleton of Ki−1.
Note that in Algorithm2, a persistent 1-cycle for a finite interval is a sum of shortest
cycles born at different indices. Since a shortest cycle is usually a good representative of its
class, the sum of shortest cycles ought to be a good choice of representative for an interval. P In some cases, when σi is negative, the sum g∈G cg contains only one component. The persistent 1-cycles computed by Algorithm2 for such intervals are guaranteed to be optimal
as shown below.
P Proposition 1. In Algorithm2, when σi is negative, if |G| = 1, then g∈G cg is a minimal persistent 1-cycle for the interval ending with i.
55 In Section 5.5 where we present the experimental results, we can see that, scenarios
depicted by Proposition1 occur quite frequently. Especially, for the larvae and nerve datasets,
nearly all computed persistent 1-cycles contain only one component and hence are minimal.
A practical problem with Algorithm2 is that unnecessary computational resource is
spent for computing tiny intervals regarded as noise, especially when the user cares about
significantly large intervals only. We present a more efficient algorithm for such cases.
∗ Proposition 2. In Algorithm1 and2, when σi is negative, for any g ∈ G, one has bg ≤ g
and dg ≥ i.
∗ Proof. Note that σbg must be unpaired before σi is added, this implies that dg ≥ i. Since g
∗ is the greatest index in G, bg = g ≤ g .
Proposition2 leads to Algorithm3 in which we only compute the shortest cycles at the
birth indices whose corresponding intervals contain the input interval [b, d). In the worst
case, Algorithm2 and3 run in O(nω + n2 log n) = O(nω) time. However, since an user
often provides a long interval, the intervals containing it constitute a small subset of all the
intervals. This makes Algorithm3 run much faster than Algorithm2 in practice.
Algorithm 3 Input: The input of Algorithm2 plus an interval [ b, d) Output: A persistent 1-cycle for [b, d) output by Algorithm2. 1: G0 ← φ 2: for i ← 1, . . . , b do 3: if σi is positive & (σi is paired with a σj s.t j ≥ d or σi never gets paired) then 4: ci ← the shortest cycle containing σi in Ki 5: G0 ← G0 ∪ i 6: end if 7: end for 0 P 8: find a G ⊆ G s.t. [cg] = 0 in Kd P g∈G 9: output g∈G cg as the persistent 1-cycle for [b, d)
56 (a) (b)
Figure 5.2: PersLoop user interface demonstrating persistent 1-cycles computed for a 3D point cloud (a) and a 2D image (b), where green cycles correspond to the chosen bars.
Proposition3 reveals some characteristics of the persistent 1-cycles computed by Algorithm
2 and3:
P Proposition 3 (Minimality Property). The persistent 1-cycle g∈G cg computed by Algo- rithm2 and3 has the following property: There is no non-empty proper subset G0 of G such P P that g∈G0 [cg] = 0 in H1(Kd), where d is the death index of the interval to which g∈G cg is associated.
5.5 Results and Experiments
Our software PersLoop implements Algorithm3. Given a raw input, which is a 2D image
or a 3D point cloud, and a filtration built from the raw input, the software first generates
and plots the barcode of the filtration. The user can then click an individual bar to obtain
the persistent 1-cycle for the bar (Figure 5.2). The experiments on 3D point clouds and 2D
images using the software show how our algorithm can find meaningful persistent 1-cycles in
several geometric and vision related applications.
57 Figure 5.3: Persistent 1-cycles (green) corresponding to long intervals computed for three different point clouds
5.5.1 Persistent 1-Cycles for 3D Point Clouds
We take a 3D point cloud as input and build a Rips filtration using the Gudhi library [108].
As shown in Figure 5.3, persistent 1-cycles computed for the three point clouds sampled
from various models are tight and capture essential geometrical features of the corresponding
persistent homology. Note that our implementation of Algorithm3 runs very fast in practice.
For example, it took 0.3 secs to generate 50 cycles on a regular commodity laptop for the
Botijo (Figure 5.1a) point cloud of size 10,000.
5.5.2 Image Segmentation and Characterization Using Cubical Complex
In this section, we show the application of our algorithm on image segmentation and
characterization problems. We interpret an image as a piecewise linear function on a 2-
dimensional cubical complex. The cubical complex for an image has a vertex for each pixel,
an edge connecting each pair of horizontally or vertically adjacent vertices, and squares to
fill all the holes such that the complex becomes a disc. The function values on the vertices
are the density or color values of the corresponding pixels. The lower star filtration [44] of
58 (a) (b) (c)
Figure 5.4: Persistent 1-cycles computed for image segmentation. Green cycles indicate persistent 1-cycles consisting of only one component (|G| = 1) and red cycles indicate those consisting of multiple components (|G| > 1). (a,b) Persistent 1-cycles for the top 20 and 350 longest intervals on the nerve dataset. (c) Persistent 1-cycles for the top 200 longest intervals on the Drosophila larvae dataset.
the PL function is then built and fed into our software. We use the coning based annotation
strategy [33] to compute the persistence diagrams. In our implementation, a cubical tree, which is similar to the simplicial tree [9], is built to store the elementary cubes. Each
elementary cube points to a row in the annotation matrix via the union find data structure.
The simplicial counterpart of this association technique is described in [8].
Our first experiment is the segmentation of a serial section Transmission Electron Mi-
croscopy (ssTEM) data set of the Drosophila first instar larva ventral nerve cord (VNC) [19].
The segmentation result is shown in Figures 5.4a and 5.4b, from which we can see that the
cycles are in exact correspondence to the membranes hence segment the nerve regions quite
appropriately. The difference between Figure 5.4a and 5.4b shows that longer intervals tend
to have longer cycles. Another similar application is the segmentation of the low magnification
micrographs of a Drosophila embryo [60]. As seen in Figure 5.4c, the cycles corresponding to
the top 200 longest intervals indicate that the larvae image is properly segmented.
59 (a) (b) (c) (d)
Figure 5.5: (a) Hexagonal cyclic structure of silicate glass. (b) Persistent 1-cycles computed for the green bars with red points denoting silicate atoms and grey points denoting oxygen atoms. (c) Persistent 1-cycles computed for the red bars. (d) Barcode for the filtration on silicate glass.
We experiment on another dataset from the STARE project [57] to show how persistent
1-cycles computed by our algorithm can be utilized for characterization of images. The dataset
contains ophthalmologist annotated retinal images which are either healthy or suffering from
diseases. Our aim is to automatically identify retinal and sub-retinal hemorrhages, which are
black patches of blood accumulated inside eyes. Figures 5.1e and 5.2b show that a very tight
cycle is derived around each dark hemorrhage blob even when the input is noisy.
5.5.3 Hexagonal Structure of Crystalline Solids
In this experiment, we use our persistent 1-cycles to describe the crystalline structure of
silicate glass (SiO2) commonly known as quartz. Silicate glass has a non-compact structure with three silicon and oxygen atoms arranged alternately in a hexagon as shown in Figure 5.5a.
We build a 8×8×8 weighted point cloud with the silicon and oxygen atoms arranged according
to the space group on the crystal structure as illustrated in Figure 5.5b. The weights of the
60 points correspond to the atomic weights of the atoms. On this weighted point cloud, we
generate a filtration of weighted alpha complexes [43] by increasing α from 0 to ∞.
Persistent 1-cycles computed by our algorithm for this dataset reveal both the local and
global structures of silicate glass. Figure 5.5d lists the barcode of the filtration we build
and Figure 5.5b shows the persistent 1-cycles corresponding to the medium sized green bars
in Figure 5.5d. We can see on close observation that the cycles in Figure 5.5b are in exact
accordance to the hexagonal cyclic structure of quartz shown in Figure 5.5a. The larger
persistent 1-cycles in Figure 5.5c, which span the larger lattice structure formed by our weighted point cloud, correspond to the longer red bars in Figure 5.5d. These cycles arise
from the long-range order 1 of the crystalline solid. This is evident from our experiment
because if we increase the size of the input point cloud, these cycles grow larger to span the
entire lattice.
Convinced by the fact that representative topological 1-cycles have several real world
applications, it is but natural to ponder if we can extend the computation of these persistent
cycles to higher dimensions. The next section investigates into such possibilities and present
an interesting case in general dimensional persistent cycles.
1Long-range order is the translational periodicity where the self-repeating structure extends infinitely in all directions
61 5.6 Finite Persistent n-cycle
This section is a continuation of the previous where we ask the question: Are there other
interesting cases beyond 1-dimension for which minimal persistent cycles can be computed in
polynomial time? We compute the minimal persistent cycles with Z2 coefficients in general dimensions. In a recent work [32], a special but important class of simplicial complexes, which is termed as weak (d + 1)-pseudomanifolds, has been identified for which minimal
persistent d-cycles can be computed in polynomial time. We apply this algorithm to several
application domains which is our contribution detailed in this section. Although the details
of the algorithm and its proof of correctness will appear in another thesis work, we describe
some of the essential concepts to describe the algorithm and its applications.
A weak (d + 1)-pseudomanifold 2 is a generalization of a (d + 1)-manifold:
Definition 9. A simplicial complex K is a weak (d + 1)-pseudomanifold if each d-simplex is
a face of no more than two (d + 1)-simplices in K.
Specifically, if the given complex is a weak (d + 1)-pseudomanifold, the problem of
computing minimal persistent d-cycles for finite intervals can be cast into a minimal cut
problem .
The persistent cycle problems. The minimal persistent cycle problems is as follows:
PCYC-FINd Given a finite d-weighted simplicial complex K, a filtration F : ∅ = K0 ⊆
K1 ⊆ ... ⊆ Kn = K, and a finite interval [β, δ) ∈ Dd(F), this problem asks for computing a
d-cycle with the minimal weight which is born in Kβ and becomes a boundary in Kδ.
In order to solve this problem for weak (d+1)-pseudomanifolds, we need to following
definition as well:
2The naming of weak pseudomanifold is adapted from the commonly accepted name pseudomanifold.
62 Figure 5.6: A weak 2-pseudomanifold Ke embedded in R2 with three voids. Its dual graph is drawn in blue. The complex has one 1-connected component and four 2-connected components with the 2-simplices in different 2-connected components colored differently.
Undirected flow network. An undirected flow network (G, s1, s2) consists of an undirected
graph G with vertex set V (G) and edge set E(G), a capacity function c : E(G) → [0, +∞],
and two non-empty disjoint subsets s1 and s2 of V (G). Vertices in s1 are referred to as sources
and vertices in s2 are referred to as sinks.A cut (S,T ) of (G, s1, s2) consists of two disjoint
subsets S and T of V (G) such that S ∪ T = V (G), s1 ⊆ S, and s2 ⊆ T . The set of edges
that connect a vertex in S and a vertex in T are referred as the edges across the cut (S,T ) P and is denoted as ξ(S,T ). The capacity of a cut (S,T ) is defined as c(S,T ) = e∈ξ(S,T ) c(e).
A minimal cut of (G, s1, s2) is a cut with the minimal capacity. Note that we allow parallel
edges in G (see Figure 5.6) to ease the presentation. These parallel edges can be merged into
one edge during computation.
5.7 Minimal persistent d-cycles of finite intervals for weak (d + 1)- pseudomanifolds
In this section, we describe the algorithm which computes minimal persistent d-cycles for
finite intervals given a filtration of a weak (d + 1)-pseudomanifold when d ≥ 1. Although the
detailed process can be found in [32], the general process is as follows: Suppose that the input weak (d + 1)-pseudomanifold is K associated with a filtration F : K0 ⊆ K1 ⊆ ... ⊆ Kn and
63 σβ
σδ
(a) (b) (c) (d)
Figure 5.7: An example of the constructions in our algorithm showing the duality between persistent cycles and cuts having finite capacity for d = 1. (a) The input weak 2-pseudomanifold K with its dual flow network drawn in blue, where the central hollow vertex denotes the dummy vertex, the red vertex denotes the source, and all the orange vertices (including the dummy one) denote the sinks. All “dangled” graph edges dual to the outer boundary 1-simplices actually connect to the dummy vertex and these connections are not drawn. (b) The partial complex Kβ in the input filtration F, F where the bold green 1-simplex denotes σβ which creates the green 1-cycle. (c) The partial complex F Kδ in F, where the 2-simplex σδ creates the pink 2-chain killing the green 1-cycle. (d) The green persistent 1-cycle of the interval [β, δ) is dual to a cut (S, T ) having finite capacity, where S contains all the vertices inside the pink 2-chain and T contains all the other vertices. The red graph edges denote those edges across (S, T ) and their dual 1-chain is the green persistent 1-cycle.
the task is to compute the minimal persistent cycle of a finite interval [β, δ) ∈ Dd(F). We first construct an undirected dual graph G for K where vertices of G are dual to (d + 1)-simplices of K and edges of G are dual to d-simplices of K. One dummy vertex termed as infinite vertex which does not correspond to any (d + 1)-simplices is added to G for graph edges dual to those boundary d-simplices. We then build an undirected flow network on top of G where
F the source is the vertex dual to σδ and the sink is the infinite vertex along with the set of
F F vertices dual to those (d + 1)-simplices which are added to F after σδ . If a d-simplex is σβ
F or added to F before σβ , we let the capacity of its dual graph edge be its weight; otherwise, we let the capacity of its dual graph edge be +∞. Finally, we calculate a minimal cut of this
flow network and return the d-chain dual to the edges across the minimal cut as a minimal persistent cycle of the interval.
64 The intuition of the above algorithm is best explained by an example in Figure 5.7, where
d = 1. The key to the algorithm is the duality between persistent cycles of the input interval
and cuts of the dual flow network having finite capacity. To see this duality, first consider a
persistent d-cycle ζ of the input interval [β, δ). There exists a (d + 1)-chain A in Kδ created
F by σδ whose boundary equals ζ, making ζ killed. We can let S be the set of graph vertices dual to the simplices in A and let T be the set of the remaining graph vertices, then (S,T ) is
a cut. Furthermore, (S,T ) must have finite capacity as the edges across it are exactly dual to
the d-simplices in ζ and the d-simplices in ζ have indices in F less than or equal to β. On the
other hand, let (S,T ) be a cut with finite capacity, then the (d + 1)-chain whose simplices
F are dual to the vertices in S is created by σδ . Taking the boundary of this (d + 1)-chain, we get a d-cycle ζ. Because d-simplices of ζ are exactly dual to the edges across (S,T ) and each
edge across (S,T ) has finite capacity, ζ must reside in Kβ. We only need to ensure that ζ
F contains σβ in order to show that ζ is a persistent cycle of [β, δ). This is actually done in
the paper [32] in proposition 3.2 which shows that for any cut (G, s1, s2), the d-chain ζ is a
persistent cycle of [β, δ).
5.8 Experimental results
We experiment with our algorithms for PCYC-FIN2 on several volume datasets. Since volume data have a natural cubical complex structure, we adapt our implementation slightly
in order to work on cubical complexes. The cubical complex for volume data consists of cells
in dimensions from 0 to 3 with the underlying space homeomorphic to a 3-dimensional ball.
Note that a filtration built from a volume dataset does not produce any infinite intervals.
We use the Gudhi [108] library to build the filtrations and compute the persistence intervals.
From the experiments, we can see that the minimal persistent 2-cycles computed by our
algorithms capture various features of the data which originate from different fields. Note
65 (a) (b) (c) (d)
Figure 5.8: (a,b) Cosmology dataset and the minimal persistent 2-cycles of the top five longest intervals. (c,d) Turbulent combustion dataset and its corresponding minimal persistent 2-cycles.
that the combustion, hurricane, and medical datasets are time-varying and we chose a single
time frame to compute the persistent intervals and cycles.
Cosmology. The simulation data shown in Figure 5.8a from computational cosmology [3]
consist of dark matter represented as particles. The thread-like structures in deep purple
shown in Figure 5.8a correspond to sites of large scale structure formation. Galaxy clus-
ters/superclusters are contained in such large scale structures. Figure 5.8b shows the minimal
persistent 2-cycles of the top five longest intervals computed by our algorithms and these
cycles precisely represent the top five galaxy clusters/superclusters in volume.
Combustion. The data shown in Figure 5.8c correspond to the physical variable3 χ from
a model of a turbulent combustion process. The variable χ represents scalar dissipation
rate and provides a measure of the maximum possible chemical reaction rate. The minimal
persistent 2-cycles shown in Figure 5.8d represent areas with high value of χ.
3A physical variable defines a scalar value of a certain kind on each point.
66 (a) (b) (c)
Figure 5.9: (a,b) Minimal persistent 2-cycles for the hurricane model. (c) Minimal persistent 2-cycles of the larger intervals for the human skull. i: Right and left cheek muscles with the right one rotated for better visibility. ii: Right and left eyes. iii: Jawbone. iv: Nose cartilage. v: Nerves in the parietal lobe.
Hurricane. This dataset4 with 11 physical variables corresponds to the devastating hurri-
cane named Isabel. We down-sampled the data into a resolution of 250 × 250 × 50 and worked with two physical variables. The minimal persistent 2-cycle colored blue in Figure 5.9a is
computed on the cloud-volume variable and extracts the eye of the hurricane. The minimal
persistent 2-cycle colored green in Figure 5.9b is computed on the pressure variable and
captures the jagged shape of the pressure variation around the hurricane.
Medical imaging. This dataset from the ADNI [89] project contains the MRI scan of a
healthy human skull. The minimal persistent 2-cycles corresponding to the larger intervals as
shown in Figure 5.9c are computed from two time frames. They extract significant features
such as eyes, cartilages, nerves, and muscles.
4The Hurricane Isabel data is produced by the Weather Research and Forecast (WRF) model, courtesy of NCAR, and the U.S. National Science Foundation (NSF).
67 (a) (b) (c) (d)
Figure 5.10: (a) Cubic lattice structure of BaT iO3 (courtesy Springer Materials [91]) with diffused structure in backdrop. (b) Minimal persistent 2-cycles computed on the original function. (c) Minimal persistent 2-cycles computed on the negated function. (d) Minimal persistent 2-cycles computed on the negated function of a tetragonal lattice structure of BaT iO3. The inlaid picture [91] illustrates the bonds of the structure.
Material science. We consider the atomic configuration of BaT iO3, which is a ferroelectric
material used for making capacitors, transducers, and microphones. Figure 5.10a shows
the atomic configuration of the molecule, where the red, grey, and green balls denote the
Oxygen, Titanium, and Barium atoms separately and the radii of the balls equal the radii of
the corresponding atoms. Volume data are built by uniformly sampling a 3 × 3 × 3 lattice
structure similar to the one shown in Figure 5.10a, with the step width equal to one angstrom
(note that Figure 5.10a only shows a 2 × 2 × 2 lattice structure). Scalar value on a point of the volume is determined as follows: For each atom, let the distance from the point to the atom’s
center be d, then the scalar value of the point contributed by the atom is max{w(r − d)/r, 0}, where r is the radius of the atom and w is the atomic weight. The scalar value on the point
is then equal to the sum of the above values contributed by all atoms. For the purpose of this
experiment, we computed minimal persistent 2-cycles on both the original scalar function and
its negated one. Figure 5.10b shows a portion of the minimal persistent 2-cycles computed
on the original function, where the purple, red, and green cycles correspond to atoms of
Barium, Titanium, and Oxygen respectively. In our experiment, every atom corresponds
68 to such a minimal persistent 2-cycle of a long interval. Figure 5.10c shows a portion of the
minimal persistent 2-cycles computed on the negated function, where the cycles complement
the Barium atoms. Figure 5.10d shows the output on the negated function from a tetragonal
lattice structure [91], where the atomic bonds are not straight (see Figure 5.10d inlay). The
stretch on the lattice structure leads to minimal persistent 2-cycles with non-trivial genus.
Thus as we can see, from all the instances, persistent 2-cycles can be quite instrumental in
understanding properties of different domains of data. We further proceed this understanding
in the next chapter where we discuss the uses of persistent cycles in gene expressions.
69 Chapter 6: Topologically Relevant Cohort and Gene Expression
6.1 Introduction
The rapid advances in genome-scale sequencing have dispensed a comprehensive list
of genes for different organisms. These data gives us a broad scope to comprehend the
developmental and functional processes of these organisms. Since the advent of DNA
microarray, it is now possible to measure the expression levels of large number of genes
simultaneously. This has made a holistic analysis possible for gene data using their expression
levels. The stochastic nature of biological processes and associated noise acquired during
the mining process pose a fundamental challenge in modelling a mathematical structure
explaining these high dimensional data. We look into two problems in data analysis involving
gene expressions that are of current research interest.
A genome-wide association study (GWAS) is a method to link a subset of genes to a
particular disease or physical phenomenon in an organism. It has been especially important to
identify specific gene subsets not only from a clinical perspective but also from a data science
perspective as well. The assimilation of these subsets enable better phenotype identification
and improve prediction of cohort status using machine learning based approach. Our definition
of cohort follows its common usage in biology where “a cohort is a group of animals of the same species, identified by a common characteristic, which are studied over a period of time as part of a scientific or medical investigation”. For small or medium sized data sets, since
70 the number of gene expression in a cohort profile is far greater than the number of sample cohorts, disease prediction using neural networks is challenging as these architectures largely succeed when the number of samples is much larger. It becomes important for these cases to identify a subset of genes whose expression levels reflect the phenotype of the cohorts.
In addition, it is often the case that some cohort have incorrect or uncorrelated data due to instrumental or manual error. Hence, their gene expressions may not reflect their phenotype class. We find in practice that the elimination of such instances leads to better prediction scores and performance. In this work, we use topological data analysis to investigate both of these issues. We identify cohorts which are topologically relevant(Sec. 6.3.2). We show that the use of these cohorts to determine phenotypes instead of the entire set improves classification. Next, in section 6.3.3 we look into the classic GWAS problem mentioned above to identify a small subset of genes by using topological data analysis. We compare classification results obtained by using this reduced gene subsets against the one obtained by using full gene pool. The results for the receded gene profile yields better prediction rate.
In previous works it has been shown that genes sharing similar attributes tend to cluster in high dimensions [46, 86]. This is because protein encoding genes that are part of the same biological pathway or have similar functionality are corregulated. This ultimately leads such gene clusters to have similar expression profiles. The property of clustering is essentially captured by the zeroth order homology class in Persistent Homology(see next section). Motivated by these works, we are interested in finding if there exist relationships among similar genes in the higher order homology classes as well.
Traditional TDA pipelines including our first two chapters use Persistent Homology to compute a set of intervals called barcodes which are used as topological feature in subsequent processing such as learning [18, 37, 76]. While such barcodes provide robust topological signatures for the persistent features in data (such as tunnels, voids, loops, cycles etc.), their
71 (a)
(b)
Figure 6.1: (a): Flowchart for topo-relevant gene expression extraction. Refer to Section 6.3.3 for details. (b): Flowchart for topo-curated cohort extraction. Refer to Section 6.3.2 for details. In both, bold lines show the path to take for training or testing large data. Dotted lines used in Figure 6.2.
72 association to data is not immediately clear thus missing some crucial information. In effect,
since these intervals represent homology classes, they contain the set of all loops around a
topological hole (ref. Fig. 6.3b). Thus using barcodes, it is hard to localize a feature, e.g.,
the shortest cycles or holes in a Persistent Homology class. This, in turn, hinders getting
any direct mapping between the topological signatures and input cohorts or genes. So far
there had been few studies addressing the problem of localizing persistent features and it has
been shown that finding shortest cycles in given Persistent Homology classes is an NP-hard
problem [32,36]. However, taking advantage of the recent results in [32,36], we are able to
to compute good representative cycles for our application. These cycles capture definitive
geometric features and provide a mapping between two domains of gene expression and
topology.
In this chapter we conduct two main experiments using the representative cycles: one to
extract topologically relevant genes and the other to curate relevant cohorts. For these studies,
some organisms were control units while others were either infected and/or injected with some
antigen. The input consists of a matrix K which has n rows signifying the cohorts and their
corresponding gene expressions in m columns. For obtaining and classifying topologically
relevant (topo-relevant) genes, our experiment follow the pipeline in Fig. 6.1a whereas for
determining curated cohorts, it follows the pipeline in Fig. 6.1b. For a large data set, we can
trim out both insignificant cohorts and genes starting from the ‘Training data Kn,m’. This
can be done following the pipeline in Fig. 6.2. We train our neural network architecture on
the final curated dataset and thereby test against any unknown cohort. For our experiments, we work on gene expressions extracted from different organisms including Drosophila, Mus
Muculus, and Homo Sapiens. We convert these data into a binary or multi-classification
problem based on their phenotype and feed it into the pipeline. Our methodology and results
have been listed in section 6.3.
73 (a) (b)
Figure 6.3: (a) Filtration F = K0, .., K9 explaining persistence pairing.(b) Differ- ent H1 cycles for same homology class Figure 6.2: Flowchart of proposed pipeline. For units Topo-relevant gene pipeline and Topo-curated cohort pipeline we follow the dotted lines in Fig. 6.1(a) and (b) respec- tively.
6.2 Idea
In this section we provide a bit of an overview into the synthesis of gene expressions.
A gene, informally, is the basic physical unit of heredity which are present in two copies
obtained from each parent. It is a sequence of nucleotides (which is a molecule of sugar
linked to a nitrogen-containing organic ring compound attached to a phosphate group) in
DNA or RNA. These genes contain instructions which are read to synthesize protein or RNA
molecules. These molecules are useful in carrying out all the vital functions of our body such
as producing enzymes, hormones, or building muscles, bones, blood etc. Gene expression
is the process by which instructions in the genes are converted to the functional products
as proteins. It is a tightly regulated process that allows a cell to respond to its changing
74 environment. For instance, if an organism devours some nutrition, there is an increase in gene expression values which produce proteins responsible for assimilating that nutrition and convert it to energy. Hence any observable phenotype (an observable trait) is a direct result of a change in gene expression levels for the gene or set of genes responsible for the said phenotype. Thus the regulation of gene expression gives control over the timing, location, and amount of a given gene product (protein or RNA) present in a cell and can have a profound effect on the cellular structure and function. Regulation of gene expression is the basis for cellular differentiation, development, morphogenesis and the versatility and adaptability of any organism. Thus any disease or physical phenomenon result from the interplay of many genes and environmental factors. The challenge is thus to unravel those individual set of genes responsible for certain phenotype by changing their expression level (be it up or down regulation). These changes in expression levels may either be in direct response to some change in environment (food intake or a virus) or due to a recursive effect of change in other some gene expression level that acts regulators. In either case, identifying and mapping change in gene expression level due to a certain phenotype is vital in several aspects of life including disease diagnosis.
6.3 Methods
We work under the hypothesis that topological data analysis extracts relevant information sufficient for cohort classification. We note that topological feature extraction methods used in earlier works may not work in this setting. Traditionally, for many applications in bio science(say protein classification) and engineering, we find corresponding topological signatures using Persistent Homology for each sample (in this case cohorts or genes). These signatures are appended to the feature vectors. However, in this case, since each cohort is represented by a single 1D vector of gene expression levels, we are not able to find suitable
75 signatures to append. This is why the algorithms we described in the previous section comes
handy, as we will see in this section. For algorithms3 and 5.7, we need a simplicial complex
K, a filtration F, and finite intervals. For all the studies in the chapter, we use Sparse
Rips [102] to obtain the simplicial complex K and its filtration (F). We can apply the theory
of Persistent Homology to obtain the set of all finite intervals. In addition, algorithm 5.7
requires a pseudo-manifold(Ke) instead of a regular simplicial complex K. For our case, this
means that all triangles(d = 2-simplices) has at most two tetrahedrons(d + 1 = 3-simplices)
attached to it. We convert K into Ke by allowing at most two cofaces (tetrahedra) per triangle which appear first in the filtration:
• Add all σ0...d to Ke:
•∀ σd ∈ K:
– Sort: its co-faces T = σd+1 by F(σd+1)
– If: |T | ≥ 2, insert into Ke, the first two σd+1 in T ,
– Else: insert T in Ke
6.3.1 Data
We have a set of n cohorts(C) each represented by the gene expression profile of m genes
th (G). Thus our input is a matrix K of dimension (n×m) where each Ki,j represents the j gene
of the ith cohort. In addition, we have X : C → I, where I is the phenotype for the cohort.
For instance, X (c) = 0 may imply that c is a healthy or control, whereas X (c) = 1 may imply
they are infected or treated with an antigen depending on the experiment. Throughout our
experiments we will work on several datasets containing gene expression profile of different
organisms [90]. We provide a brief description of these data.
76 Figure 6.4: t-SNE on entire cohort point cloud(D0). Red vertices indicate cohorts included in top 100 H2 cycles whereas blue indicate otherwise.
77 EXPR Decision Tree Naive Bayes Classifier FULL H1+H2 H2 FULL H1+H2 H2 # 131 116 101 131 116 101 DROSO MELANO Accuracy 0.714125 0.751768 0.793434 0.398146 0.412121 0.422444 Precision 0.745417 0.815000 0.835000 0.389111 0.431756 0.451673 Recall 0.712500 0.754167 0.795833 0.400000 0.416667 0.434478 # 89 85 51 Accuracy 0.792778 0.796667 0.811667 - - - DROSO PARASITOD Precision 0.817381 0.823571 0.859167 - - - Recall 0.792500 0.797500 0.825000 - - - # 321 292 168 321 292 146 Accuracy 0.562310 0.616240 0.586843 0.555112 0.578489 0.576131 MOUSE PRION Precision 0.562716 0.591471 0.543743 0.383462 0.378556 0.384572 Recall 0.539712 0.564394 0.558267 0.415855 0.422354 0.423651 # 242 229 190 242 229 190 Accuracy 0.682761 0.698934 0.729545 0.723232 0.723232 0.721404 MOUSE LIVER CANCER Precision 0.590716 0.579833 0.656051 0.444761 0.444761 0.412018 Recall 0.573319 0.602582 0.641168 0.499837 0.499837 .506429 # 226 206 166 226 206 166 Accuracy 0.880731 0.851794 0.892900 0.592770 0.592105 0.592105 MOUSE ECOLI Precision 0.880541 0.853406 0.901481 0.604010 0.651101 0.652203 Recall 0.868052 0.842963 0.891786 0.509841 0.511111 0.511111 # 1745 101 101 Accuracy 0.499698 0.510987 0.510987 - - - HUMAN BOWEL DISEASE Precision 0.493808 0.509147 0.509147 - - - Recall 0.491258 0.501173 0.501173 - - -
Table 6.1: Classification using topo-relevant cohort. Each of the data are explained in section 6.3.1. The # symbol indicates the size of each dataset. ‘-’ in the table means the stats were too low: the relevant classifier was unable to classify the given data
(D0): Droso Breeding: In this data set, the Drosophila melanogaster larvae is bred on a Aspergillus nidulans infested breeding substrate. The phenotypes differ on the different
breeding condition for the Drosophilas. We assign label 0 to control, label 1 to the
Drosophilas bred on Aspergillus nidulans mutant laeA, and label 2 to both the the
Drosophilas bred on wild Aspergillus nidulans and sterigmatocystin. Note that in this
experiment, mutating laeA from wild Aspergillus nidulans removes sterigmatocystin
production. Hence, both the wild Aspergillus and the class with external sterigma-
tocystin should have similar gene expression profile. The experiments in the dataset
78 website confirms this fact, as there is no change in any gene expression profile between
these two classes. The number of cohorts in the database is 131.
(D1): Droso Parasitod:The data contains the profile of Drosophila larvae after a parasitod attack. There are two labels on the phenotype, one for the control and the other for
the cohorts under parasitod attack. Thus, we can a binary classification problem in
this case. Total cohorts count is 91
(D2): Mouse Prion: This data has Mus Musculus as the cohort. The experiment investigates into the effects of two different strains of the prion disease. The phenotypes are ‘RML
infected’, ‘301V infected’, and the healthy control which are assigned labels 0 − 3
respectively. Total cohort count is 418
(D3): Mouse Liver Cancer: This is again a binary classification problem of the Mus Muculus. The two phenotypes are control type and liver cancer cohorts. Total cohort
count is 242.
(D4): Mouse EColi.: The three phenotypes in this dataset are the Eschreichia Coli, Staphy- lococcus, and control. The total number of cohorts across all three phenotypes in
321.
(D5): Human Bowel Disease: A binary classification problem where the phenotype are from cohorts suffering Crohns Disease and placebo cases. This is bigger dataset having
gene expressions of 1745 human.
(D6): Human Bone Marrow: This data set contains gene expressions of patients having bone marrow failure and cytogeneic abnormalities along with healthy cohorts who serve
as control. This dataset has 469 cohorts.
79 (D7): Human Dengue: This is yet another big dataset having two types of phenotypes where we have gene expression of Dengue patients versus cohort control. Cohort count
for this dataset is 2978.
We apply the t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize our
entire cohort point cloud for the D0 Drosophila dataset in Fig. 6.4. To get a sense of the
distribution of topological cycles, we calculate the top 100 representative H2 cycles based on
their interval length (δ − β). In figure 6.4, we color a cohort vertex red if it is contained in
any of the top 100 H2 cycles. The cohorts not included are painted blue. This figure shows
the uniform distribution of the topological cycles w.r.t the entire dataset.
0 We now discuss the two ways to reduce the input Kn,m into Kn0m0 where n ≤ n and
m0 ≤ m. The first section deals with finding pertinent cohorts, and the next with finding
pertinent genes.
6.3.2 Topo-Curated Cohort
For our first proof of concept, we find a subset of cohorts who provide topologically
relevant information for classification. The aim is to remove cohorts having either incorrect
or uncorrelated data due to instrumental or manual error. Specifically, given Kn,m, we would
0 0 like to find Kn0,m ⊆ Kn,m for n ≤ n which improves classification odds for the cohorts. This subset of n0 cohorts should therefore be topologically more relevant. We start by converting
the matrix Kn,m into a point cloud. This point cloud has n points each of dimension m.
Hence each cohort in the matrix is converted to an m-dimensional point where each dimension
represents the expression level for each gene. We use Sparse Rips on the resulting point cloud
to obtain a simplicial complex K and its filtration (F) and apply the theory of Persistent
Homology to obtain the set of finite intervals.
80 Figure 6.5: Plot of geodesic centers for dominating cycles using T-sNE. Red vert: non dominating cycles. Graded green points: dominating cycles. Alpha values indicate ratio of dominating phenotype in each cycle versus the other labels.
We consider the dataset D0 having three phenotypes. We generate the longest 100 H2 cycles based on their interval length (δ − β). For each cycle, we consider the constituent vertices and their corresponding phenotype labels (X ). We plot the count of X values in
individual H2-cycles in Fig. 6.7(a) with the X, Y and Z axis representing X ∈ 0, 1, and 2
81 (a) (b) (c)
Figure 6.6: Three figures plotting individual dominating cycles for gene dataset D0. These cycles actually reside in m-dimension and are projected down to 3D using Principal component analysis. The colors indicate cohort phenotype labels X ∈ 0, 1, 2
respectively. The black points indicate cycles where all vertices belong to a single phenotype.
The red, green, and blue points indicate cycles having labels (0, 1), (0, 2) and (1, 2) respectively.
The yellow points correspond to cycles having all three labels 0, 1, and 2. The takeaway from
this plot is that, since most points are skewed towards some particular axis, most H2-cycles
have constituent vertices who belong predominantly to some particular label in X . Thus
topological cycles in general are inclined towards some X labels without any supervision as
they were not fed with the phenotype labels. Note that we added a small random noise to
each point coordinate to illustrate multiplicity. Figure 6.7(b) plots similar values for the top
200 H2 cycles for dataset D1. Since this dataset has two phenotypes, we get a 2d plot. The red labels denote cycles which have an equal constituent phenotypes, whereas blue and cyan
represent skew, with blacks representing single labeled cycles as before. As is evident, most
cycles exhibit a predominance in either X ∈ 0 or 1. Based on the intuition of this plot, we
82 define a cycle Z as a Dominant Cycle if, there exists a vertex set U ⊆ Vert(Z)5 so that
every vertex in U has the same label and |U| ≥ |Vert(Z)/2|6.
To illustrate the frequency of dominating cycles versus non-dominating ones, we plot
the geodesic centers of the H2 cycles for D0 by projecting them down to 2d using T-sNE (Fig. 6.5). Red vertices indicate non dominating cycles while each of the graded green points
indicate the dominating ones. Clearly, most of the topology cycles are dominating and
indicate a vote towards some phenotype class. The alpha values(denoted by the green bar
at the right) indicates the ratio of the dominating phenotype in each cycle versus the other
labels(X ). Hence, intuitively, more opaque a given point is, more is it dominated by a single
class phenotype. Finally, we plot some of the individual dominating H2 cycles along with
their phenotype labels in Fig. 6.6. Note that these points are part the original D0 cohort
point cloud and they were projected down to 3D using PCA.
Classification using machine learning
We work on several gene expression data extracted from different organisms. On each of
these, we create a classification problem as described in the data section. For each dataset, we use the entire cohort list(irrespective of their phenotype) as an (n × m) dimensional
point cloud. We generate the top 100 H1 and H2 cycles and select the dominant cycles.
Next we select the vertices contained in these dominant cycles which form our new set of
n0(≤ n)-cohorts. Taking gene expression for these n0-cohorts lets us form our new smaller
0 matrix Kn0,m. Thereafter, use train supervised classification models once using Kn,m and then
0 again using Kn0,m and compare results for each. We use 10-fold cross validation by splitting the data randomly into 80% − 20% in each fold. For our classification models, we use Decision
Tree and Naive Bayes. The average value of accuracy, precision, and recall for the 10-fold
5Vert(Z) denotes the constituent vertices on the cycle Z 6the modulo operator implies size of a set
83 (a) (b)
Figure 6.7: (a): Count of vertex labels in individual H2-cycles for D0. The red points indicate cycles having phenotype labels 0 and 1, blue indicates cycles with labels 1 and 2 whereas green(very few in the top 500 H2 cycles) indicates labels 0 and 2. (b): Count of vertex labels in individual H2-cycles for D1. Red indicate cycles having equal phenotype labelled vertices. Blue and cyan indicate prevalence of label 0 and 1 respectively. In both the diagrams, black points indicate cycles having a single phenotype label.
cross validation is reported in table 6.1. The column ‘FULL’ represents training on Kn,m while
0 H1 + H2 represent the union of n topo-relevant cohorts obtained from the dominant cycles
in either H1 or H2. We also get good classification statistics for the vertices in dominant
cycles picked up only by H2 cycles only as reported in the same table. As is evident from the
results, reduction in the number of cohorts leads to an increase in classification measures.
Thus TDA is able to pick up cohorts who carry more decisive gene expression levels for their
individual phenotype classes.
6.3.3 Topo-Relevant Gene Expression
0 0 Our next problem is to reduce the matrix Kn×m to K of dimension (n × m ) where
0 m m. We use the persistent cycle descriptors H1 and H2 introduced in the previous
section to extract |m0| meaningful genes(G0) such that G0 ⊂ G. To this effect, we use the
84 annotation of the gene set G based on their functional classification obtained from the ‘Panther
Classification System by Geneontology’ [77] and the ’NCBI Gene Data set’ [83]. Thus for
each g ∈ G, ∃f : g → R, where R is a vector of functional attributes obtained from [77].
Once we obtain the representative cycles, we find the maximal cover of each cycle defined
as follows:
Maximal cover of representative cycle(κ): For each gene expression g ∈ Vert(Z)
represented as vertices in a single representative cycle, we have a set of annotations f(g). We
select the minimum set consisting of at least one annotation for each g ∈ V ert(Z). Let S be
any set of annotations which contains at least one annotation for each g ∈ Vert(Z). Thus,
κ = inf{|S| | ∀g ∈ Vert(Z), S ∩ f(g) 6= ∅}
The idea behind using κ is to get a sense of the functionality of the gene. A gene may
be responsible for multiple processes described in the Panther and NCBI database. If κ
is low or unity for a certain Z, it probably indicates that the gene expressions involved in
Z reflect the functionality captured by κ. This is illustrated in figure 6.8 where we plot
0 some of the H2 cycles generated on K with color annotated by their functionality. We use
PCA as before to project the points down to 3-dimensions. The three figures illustrate three
instances of different κ-values. Consider the example in Fig.6.8(a) for getting the intuition
behind κ. The six vertices representing genes in the H2 cycles have function annotations: {1:
Localization, 2: Not annotated, 3: Metabolic process, Cellular process, 4: Metabolic process,
Cellular process, Biological regulation, 5: Metabolic process, Cellular process, Localization, 6:
Not Annotated}. Out of this the set: {Localization, Not annotated, and Metabolic process}
covers all the vertices and hence κ is 3.
We choose C with low κ values and select their component genes as part of G0. We can
control the size of G0 based on the value of κ.
85 For all our experiments, we run each architecture and obtain performance measures
on K which contains the exhaustive set of m-genes. We re-run these experiments on our
trimmed set K0 containing m0( m) topologically significant genes. Note that we may use
0 00 the topo-relevant cohort extraction to additionally reduce K n0×m into K n0×m0 . But since the
public datasets we work on as our proof of concept have much less number of data to work with a NN architecture, we do not trim the the number of cohorts for the same.
(a) (b) (c)
Figure 6.8: Three 2-cycles for gene dataset D6 with colors indicating gene function labels. Total number of colors indicate the κ value of the cycle. Note that a gene can be responsible for several functionalities. The legend in this plot takes into account only a single functionality which contributes towards the maximal cover of the representative cycle.(a,b) Low κ:3 (b) High κ:6
6.3.4 Neural Network Architecture
We use one dimensional convolutional neural network to perform experiments on gene-
expression data. Our architecture is inspired by [63] who have procured some promising
results for the same. The authors use a series of dense networks connected by activation
functions. Since we provide some functional relevance among the genes, we sort them by
their functionality and feed them to a convolutional layer. We start with this 1D-CNN layer
activated by the sigmoid function. Sigmoid is a traditional activation function which provides
86 a smooth non linearity in the network and since the architecture is not too deep, we do not need to worry about its shortcomings like the vanishing gradient. This is followed by a max pooling of size 2 and subsequently a dropout layer. This layer is connected to two densely connected layers with decreasing sizes. These layers have ReLU as their activation function as used in the paper by [63]. In the end, we add a softmax activation layer to determine the
final label of the data. The hyper-parameters of the network can be tuned using advanced hyper-parameter optimization algorithm such as Bayesian Optimization. However, since this study is a proof of concept, and its purpose is to show the effectiveness of our feature selection, we fine tune them using manual observation.
Since the number of samples is still less for CNN, overfitting is an issue. Notice that, for this precise reason, we do not curate this data using the pipeline in section 6.3.2. Dropout layers are added after each layer to further prevent overfitting and reduce high variance. We, however, do not initiate early stopping as those pipelines are not amenable to orthogonalization.
Finally, the model is optimised using Adam as it optimises based on adaptive learning. The dataset is split evenly into 80-20 and cross validated for 50 epochs. The neural network has been implemented in Python using Tensorflow and Keras. The results for our experiment on datasets D5, D6 and D7 is shown in table 6.2. The row # genes show that our architecture using vertices selected from topological cycles are less than 30% the size of the original gene pool. The results have, however, improved in all the cases. We also compare our technique using a baseline generated by randomly choosing the same number of genes as the curated data. For the human dengue dataset, we average this baseline over 10 random samples.
It gives an average test data loss, accuracy, precision and recall of 0.3262, 81.59, 81.2 and
81.2 respectively. So, we observe that our technique performs substantially better than the baseline. For the human bowel disease dataset(D5) the training result on full data performs slightly better than our curated data. This is probably due to overfitting since its performance
87 Human Human Human Data Name Dengue Bone Marrow Bowel Disease (#: 4415) (#: 469) (#: 1745) Method TP (Z) Full TP (Z) Full TP (Z) Full # genes 4415 5464 17258 1801 54715 Tr-Loss(e−2) 5.95 10.06 05.16 05.70 13.11 9.58 Tr-Acc 97.84 96.64 99.72 99.15 96.63 97.56 Tr-F1 97.86 96.48 99.72 99.15 96.60 97.55 Tr-Prec 97.86 96.48 99.72 99.15 96.60 97.55 Tr-Rec 97.86 96.48 99.72 99.15 96.60 97.55 Ts-Loss(e−2) 21.99 14.55 06.34 51.30 84.29 83.73 Ts-Acc 93.21 91.65 97.46 95.76 90.10 89.62 Ts-F1 92.26 90.67 96.95 95.74 90.34 89.66 Ts-Prec 93.48 90.67 96.95 95.74 90.34 89.66 Ts-Rec 93.48 90.67 96.95 95.74 90.34 89.66
Table 6.2: Neural network result. The column TP (Z) indicates the results on reduced gene set using topology. Full indicates results on the full gene set. Tr-Loss, Tr-Acc, Tr-F1, Tr-Prec,Tr-Rec is loss, accuracy, F1-score, precision, and recall on the training data. Whereas the prefix Ts- indicate the same on the test set.
improvement is not reflected in the test data. We can employ other techniques to regularise
this instance, but since we wanted to preserve the same architecture for all, we left the results
as is. On reducing the number of features, we get an improvement in the parameters on the
test set.
We follow the trend of the loss, accuracy and F1 score by plotting their value after every
epoch in our algorithm. Figure 6.9 shows this result on dataset D7. We see that the loss function on test data has been slightly higher but smoother than the full dataset. Despite this,
using TDA the accuracy and F1 score has consistently performed better in every iteration
for both the training and test data.
88 (a) (b) (c)
Figure 6.9: Comparison of (a)Accuracy (b) F1-score (c)Loss function for 50 epochs. For the TDA curated data, red and yellow lines represent train and test scores respectively. For the full data, they are represented by the green and blue lines.
89 Chapter 7: Contributions and Future Work
This work deals with the creation of several topological tools that helps in understanding
data in a number of domains. In chapter3, we accumulated ample evidence that topological
features provide additional information for the classification of images. This is not surprising
as most techniques rely on geometric, textural, or gradient based features for classification
that do not necessarily capture topological features. The aggressive sub sampling based
algorithm helped improving computational time for generating the topological signatures which was the main bottleneck. This work was an attempt to see if topology provides added
feature to machine learning. Since we have an affirmation on the effectiveness, further research
should be done to find more innovative methods to assimilate the topological features in the
deep learning architectures.
In the next chapter, we showed a practical topological technique to generate signatures
for protein molecules that can be used as feature vectors for its classification. This work
uses unaided persistent barcodes to transcend the state of the art techniques. In addition
to procuring encouraging results on classifications, we can draw direct correlation between
the persistent intervals and the hierarchical protein structures. Since we investigated the
descriptive power of our signature, we believe it can be used for other purposes such as
protein energy computation, or finding protein B-factor. We believe that this signature can
be extended to other biomolecular data such as DNA or enzymes.
90 In the next chapter, we inspect methods to compute minimal persistent cycles. We
generate representative persistent 1-cycles which are computable in polynomial time. We P further showed that if the output cycle has one component ( g∈G cg contains only one component), the computed cycle is minimal. Further, our experiments highlighted the
frequent occurrence of such events. Finally, we showed that the algorithm3 to compute
persistent 1-cycle for a specific interval runs in O(nω). Note that this algorithm only requires
to compute the cycles for intervals containing the input interval. Since an user is often
interested in a long interval, the intervals containing it constitute a small subset of all the
intervals. This makes Algorithm3 run much faster in practice.
For the general dimension, we find the problems to be tractable if the given complex is
a weak (d+1)-pseudomanifold. The details of the algorithm are not part of this thesis and will appear elsewhere. We focus on the applications of these algorithms in image analysis,
material science, medical data, scientific visualisation among others. This research leads to
some open questions concerning persistent cycles:
• In our experiments, some persistent cycles correspond to important features of the
data (see Section 5.8). However, we also ran into some intervals whose persistent cycles
do not have obvious meanings. If there are ways to design filtrations for data such
that persistent cycles are related to the important features, then the prospect for the
application of persistent cycles or persistence in general would be more extensive.
• In section 5.6, we have presented O(n2)-time algorithms for computing a minimal
persistent cycle for a given interval. A natural question is whether this time complexity
can be improved. Furthermore, can we devise a better algorithm to compute minimal
persistent cycles for all intervals (i.e., the minimal persistent basis described in chapter
5), improving upon the obvious O(n3)-time algorithm that runs our algorithms on each
interval.
91 Finally, in our last chapter, we utilised the representative persistent cycles to extract
relevant cohorts and gene expressions so as to improve feature selections. Both our test cases
show that the data follow some topological alignment due to which the representative cycles
are able to extract sort of the “crux” of the data. This is why we are able to fit our training
models better and reduce variance thereby getting better accuracy and f1-score. In future work, one can try to further tune our models so as to correlate the selected features with
their functionality.
This work makes it evident that topological data analysis has an important role to play in
understanding and describing complex data. This is not surprising as most classical techniques
to describe data (including neural networks) rely on geometric, textural, or gradient based
features that do not necessarily capture topological features.
Having said that, it is not entirely true that the notions in topology are entirely untrodden.
For instance, as discussed in the last chapter, any zero dimensional barcode using Rips
or similar filtration, encodes information similar to hierarchical clustering. 1-dimensional
simplicial complexes are essentially graph structures. Machine learning techniques using
random forests echo similar ideas. So it would be more meaningful to perceive topological
advancements as a natural extension of traditional data science measures as opposed to a
discrete domain. Based on the evidence we found in this thesis, I think it would be safe to
echo Dr. Noah Giansiracusa’s words where he thinks “TDA should be viewed as a tool to be added to the quiver of data science arrows, rather than an entirely new weapon.”
What would be interesting to see, are the effects of even higher order homology features in
data analysis. We have already seen them to be instrumental in several domains in chapters
5.6 and6. To further the cause, better use of TDA to explain data would require further
investigation of theoretical premises for higher order homologies. This needs to be coupled with development of better toolkits for use by researchers from other domains.
92 Bibliography
[1] Henry Adams and Gunnar Carlsson. On the nonlinear statistics of range image patches. SIAM Journal on Imaging Sciences, 2(1):110–117, 2009.
[2] Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta, and Lori Ziegelmeier. Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18(8):1–35, 2017.
[3] Ann S. Almgren, John B. Bell, Mike J. Lijewski, Zarija Luki´c,and Ethan Van Andel. Nyx: A massively parallel AMR code for computational cosmology. The Astrophysical Journal, 765(1):39, feb 2013.
[4] Javier Arsuaga, Tyler Borrman, Raymond Cavalcante, Georgina Gonzalez, and Cather- ine Park. Identification of copy number aberrations in breast cancer subtypes using persistence topology. PubMed: 27600228, 4:339–369, 2015.
[5] Aras Asaad and Sabah Jassim. Topological data analysis for image tampering detection. In Christian Kraetzer, Yun-Qing Shi, Jana Dittmann, and Hyoung Joong Kim, editors, Digital Forensics and Watermarking, pages 136–146, Cham, 2017. Springer International Publishing.
[6] P. Bendich, H. Edelsbrunner, and M. Kerber. Computing robustness and persistence for images. IEEE Transactions on Visualization and Computer Graphics, 16(6):1251–1260, Nov 2010.
[7] Juliana Bernardes, Gerson Zaverucha, Catherine Vaquero, Alessandra Carbone, and Levitt Michael. Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence. Public Library of Science, 12, 07 2016.
[8] Jean-Daniel Boissonnat, Tamal K. Dey, and Cl´ement Maria. The compressed annotation matrix: an efficient data structure for computing persistent cohomology. CoRR, abs/1304.6813, 2013.
[9] Jean-Daniel Boissonnat and Cl´ement Maria. The simplex tree: An efficient data structure for general simplicial complexes. 20th Annual European Symposium, Ljubl- jana,Slovenia, 2:731–742, 2012.
93 [10] Thomas Bonis, Maks Ovsjanikov, Steve Oudot, and Fr´ed´ericChazal. Persistence- based pooling for shape pose recognition. In Alexandra Bac and Jean-Luc Mari, editors, Computational Topology in Image Context, pages 19–29, Cham, 2016. Springer International Publishing.
[11] Glencora Borradaile, Erin Wolf Chambers, Kyle Fox, and Amir Nayyeriy. Minimum cycle and homology bases of surface-embedded graphs. Journal of Computational Geometry, 8(2), 2017.
[12] Peter Bubenik. Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(1):77–102, 2015.
[13] Inbal Budowski-Tal, Yuval Nov, and Rachel Kolodny. Fragbag, an accurate representa- tion of protein structure, retrieves structural neighbors from the entire pdb quickly and accurately. PNAS, 107(8):3481–3486, February 2010.
[14] Pablo Camara. Topological methods for genomics: present and future directions. Curr Opin Syst Biol, pages 95–101, 2017.
[15] Zixuan Cang, Lin Mu, and Guowei Wei. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLOS Computa- tional Biology, 14, 08 2018.
[16] Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and Guo-Wei Wei. A topological approach for protein classification. In A topological approach for protein classification. MBMB, Nov 2015.
[17] Zixuan Cang and Guo-Wei Wei. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. IJNMBE, 2017.
[18] Zixuan Cang and Guo-Wei Wei. Topologynet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS Computational Biology, Oct 2017.
[19] Saalfeld S Cardona A, Preibisch S, Schmid B, Cheng A, and Pulokas J et al. An integrated micro- and macroarchitectural analysis of the drosophila brain by computer- assisted serial section electron microscopy. PLoS Biol, 8, 2010.
[20] Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas Guibas. Persistence barcodes for shapes. In Proceedings of the 2004 Eurographics ACM SIGGRAPH Symposium on Geometry Processing, SGP ’04, pages 124–135. ACM, 2004.
[21] Mathieu Carriere, Steve Oudot, and Maks Ovsjanikov. Sliced Wasserstein Kernel for Persistence Diagrams. In ICML 2017 - Thirty-fourth International Conference on Machine Learning, pages 1–10, Sydney, Australia, August 2017.
94 [22] Mathieu Carri`ere,Steve Y. Oudot, and Maks Ovsjanikov. Stable topological signatures for points on 3d shapes. In Proceedings of the Eurographics Symposium on Geom- etry Processing, SGP ’15, pages 1–12, Aire-la-Ville, Switzerland, Switzerland, 2015. Eurographics Association. [23] Erin W. Chambers, Jeff Erickson, and Amir Nayyeri. Minimum cuts and shortest ho- mologous cycles. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pages 377–385. ACM, 2009. [24] Chao Chen and Daniel Freedman. Measuring and computing natural generators for homology groups. Computational Geometry, 43(2):169–181, 2010. [25] Chao Chen and Daniel Freedman. Hardness results for homology localization. Discrete & Computational Geometry, 45(3):425–448, 2011. [26] Moo Chung, Peter Bubenik, and Peter Kim. Persistence diagrams of cortical surface data. In Information processing in medical imaging, pages 386–397. Springer, 2009. [27] Natalie Dawson, Tony E Lewis, Sayoni Das, Jonathan Lees, David Lee, Paul Ashford, Christine Orengo, and Ian Sillitoe. Cath: An expanded resource to predict protein function through structure and sequence. Nucleic acids research, 45, 11 2016. [28] Vin De Silva and Robert Ghrist. Coverage in sensor networks via persistent homology. Algebraic & Geometric Topology, 7(1):339–358, 2007. [29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, volume 115, pages 211–252, 2009. [30] T. Dey, F. Fan, and Y. Wang. Graph induced complex on point data. Computational Geometry, 200, 1995. [31] T. K. Dey, A. Hirani, and B. Krishnamoorthy. Optimal homologous cycles, total unimodularity, and linear programming. SIAM Journal on Computing, 40(4):1026– 1044, 2011. [32] Tamal Dey, Tao Hou, and Sayan Mandal. Computing minimal persistent cycles: Polynomial and hard cases. ACM-SIAM Symposium on Discrete Algorithms(SODA20), 07 2019. [33] Tamal K Dey, Fengtao Fan, and Yusu Wang. Computing topological persistence for simplicial maps. In Proceedings of the thirtieth annual symposium on Computational geometry, page 345. ACM, 2014. [34] Tamal K. Dey, Fengtao Fan, and Yusu Wang. Computing topological persistence for simplicial maps. Symposium on Computational Geometry, pages 345–354, june 2014. [35] Tamal K. Dey, Anil N. Hirani, and Bala Krishnamoorthy. Optimal homologous cycles, total unimodularity, and linear programming. SIAM Journal on Computing, 40(4):1026– 1044, 2011.
95 [36] Tamal K. Dey, Tao Hou, and Sayan Mandal. Persistent 1-cycles: Definition, computa- tion, and its application. In Computational Topology in Image Context, pages 123–136, Cham, 2019. Springer International Publishing. [37] Tamal K. Dey and Sayan Mandal. Protein classification with improved topological data analysis. In WABI, 2018. [38] Tamal K. Dey, Dayu Shi, and Yusu Wang. Simba: An efficient tool for approximating rips-filtration persistence via simplicial batch-collapse. In ESA, volume 57, 2016. [39] Tamal K Dey, Jian Sun, and Yusu Wang. Approximating loops in a shortest homol- ogy basis from point data. In Proceedings of the twenty-sixth annual symposium on Computational geometry, pages 166–175. ACM, 2010. [40] Tamal Krishna Dey, Fengtao Fan, and Yusu Wang. Graph induced complex on point data. In Proceedings of the Twenty-ninth Annual Symposium on Computational Geometry, SoCG ’13, pages 107–116, New York, NY, USA, 2013. ACM. [41] A. Dirafzoon and E. Lobaton. Topological mapping of unknown environments using an unlocalized robotic swarm. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5545–5551, Nov 2013. [42] Ali Nabi Duman and Harun Pirim. Gene coexpression network comparison via persistent homology. International Journal of Genomics, 11, 2018. [43] Herbert Edelsbrunner. Weighted alpha shapes. Technical report, IL, USA, 1992. [44] Herbert Edelsbrunner and John Harer. Computational topology: an introduction. American Mathematical Soc., 2010. [45] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28(4):511–533, nov 2002. [46] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998. [47] Kevin Emmett, Daniel Rosenbloom, Pablo Camara, and Raul Rabadan. Parametric inference using persistence diagrams: A case study in population genetics. arxiv, 2014. [48] Kevin Emmett, Benjamin Schweinhart, and Raul Rabadan. Multiscale topology of chromatin folding. In Proceedings of the 9th EAI international conference on bio-inspired information and communications technologies (formerly BIONETICS), pages 177–180. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2016. [49] Jeff Erickson and Kim Whittlesey. Greedy optimal homotopy and homology generators. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1038–1046. Society for Industrial and Applied Mathematics, 2005.
96 [50] Emerson G Escolar and Yasuaki Hiraoka. Optimal cycles for persistent homology via linear programming. In Optimization in the Real World, pages 79–96. Springer, 2016.
[51] Zolt´anG´asp´ari,Kristian Vlahovicek, and S´andorPongor. Efficient recognition of folds in protein 3d structures by the improved pride algorithm. Bioinformatics, 21(15), 2005.
[52] R. Ghrist and A. Muhammad. Coverage and hole-detection in sensor networks via homology. In IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005., pages 254–260, April 2005.
[53] D Goldfarb. An application of topological data analysis to hockey analytics. arXiv preprint, 21(15), 2014.
[54] Greg Griffin, Alex Holub, and Peitro Perona. Caltech-256 object category dataset. Technical report: Caltech Authors, 2006.
[55] William Harvey, In-Hee Park, Oliver R¨ubel, Valerio Pascucci, Peer-Timo Bremer, Chenglong Li, and Yusu Wang. A collaborative visual analytics suite for protein folding research. Journal of Molecular Graphics and Modelling, 53:59 – 71, 2014.
[56] Yasuaki Hiraoka, Takenobu Nakamura, Akihiko Hirata, Emerson G. Escolar, Kaname Matsue, and Yasumasa Nishiura. Hierarchical structures of amorphous solids char- acterized by persistent homology. Proceedings of the National Academy of Sciences, 113(26):7035–7040, 2016.
[57] A. Hoover and M. Goldbaum. Locating the optic nerve in a retinal image using the fuzzy convergence of the blood vessels. IEEE Transactions on Medical Imaging, 22(8):951–958, Aug 2003.
[58] Kyu-Baek Hwang, Dong-Yeon Cho, Sang-Wook Park, Sung-Dong Kim, and Byoung-Tak Zhang. Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis, pages 167–182. Springer US, Boston, MA, 2002.
[59] Liang J, Edelsbrunner H, Fu P, Sudhakar PV, and Subramaniam S. Analytical shape computation of macromolecules: Ii. molecular area and volume through alpha shape. In Proteins, volume 33, pages 18–29, 1998.
[60] Daniel P. Kiehart, Catherine G. Galbraith, Kevin A. Edwards, Wayne L. Rickoll, and Ruth A. Montague. Multiple forces contribute to cell sheet morphogenesis for dorsal closure in drosophila. The Journal of Cell Biology, 149(2):471–490, 2000.
[61] Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pages 202–207. AAAI Press, 1996.
[62] Rachel Kolodny, Patrice Koehl, Leonidas Guibas, and Michael Levitt. Small libraries of protein fragments model native protein structures accurately. JMB, 323, 2002.
97 [63] Yunchuan Kong and Tianwei Yu. A deep neural network model using random forest to extract feature representation for gene expression data classification. Scientific Reports, 8(1):16477, 2018.
[64] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[65] Vitaliy Kurlin. A fast persistence-based segmentation of noisy 2D clouds with provable guarantees. Pattern Recognition Letters, 83:3–12, 2015.
[66] Genki Kusano, Kenji Fukumizu, and Yasuaki Hiraoka. Persistence weighted gaussian kernel for topological data analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2004– 2013. JMLR.org, 2016.
[67] Roland Kwitt, Stefan Huber, Marc Niethammer, Weili Lin, and Ulrich Bauer. Statistical topological data analysis - a kernel perspective. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3070–3078. Curran Associates, Inc., 2015.
[68] Tam Le and Makoto Yamada. Persistence fisher kernel: A riemannian manifold kernel for persistence diagrams. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10007–10018. Curran Associates, Inc., 2018.
[69] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
[70] Javier Lamar Leon, Andrea Cerri, Edel Garcia Reyes, and Rocio Gonzalez Diaz. Gait- based gender classification using persistent homology. In Jos´eRuiz-Shulcloper and Gabriella Sanniti di Baja, editors, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 366–373, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
[71] Max Z. Li, Megan S. Ryerson, and Hamsa Balakrishnan. Topological data analysis for aviation applications. Transportation Research Part E: Logistics and Transportation Review, 128:149 – 174, 2019.
[72] Shengren Li, Lance Simons, Jagadeesh Bhaskar Pakaravoor, Fatemeh Abbasinejad, John D. Owens, and Nina Amenta. kANN on the GPU with Shifted Sorting. Euro- graphics Association, pages 39–47, 2012.
[73] Holm Liisa and Rosenstr¨omP¨aivi.Dali server: conservation mapping in 3d. Nucleic Acids Research, 38:W545–W549, 2010.
98 [74] D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1150–1157 vol.2, 1999. [75] David Lowe. Learning Multiple Layers of Features from Tiny Images. P, April 2009. [76] Sayan Mandal, Aldo Guzm´an-S´aenz,Niina Haiminen, Saugata Basu, and Laxmi Parida. A topological data analysis approach on predicting phenotypes from gene expression data. Springer Lecture Notes in Computer Science/LNBI, September 2020. [77] Huaiyu Mi, Anushya Muruganujan, Dustin Ebert, Xiaosong Huang, and Paul D Thomas. PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools. Nucleic Acids Research, 47(D1):D419–D426, 11 2018. [78] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. 7th European Conference on Computer Vision (ECCV ’02),, 7, May 2002. [79] G. M. Morton. A computer oriented geodetic data base; and a new technique in file sequencing. International Business Machines Co., 1966. [80] J. R. Munkres. Elements of Algebraic Topology, chapter 1. Perseus,Cambridge, Mas- sachusetts, 1 edition, 1984. [81] Takenobu Nakamura, Yasuaki Hiraoka, Akihiko Hirata, Emerson G Escolar, and Yasumasa Nishiura. Persistent homology and many-body atomic structure for medium- range order in the glass. IOP Science Nanotechnology, 26(30), 2015. [82] USA National Institutes of Health. National center for biotechnology information. https://www.ncbi.nlm.nih.gov. [83] USA National Institutes of Health. National center for biotechnology information. https://www.ncbi.nlm.nih.gov//gene. [84] Monica Nicolau, Arnold J. Levine, and Gunnar Carlsson. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011. [85] Ippei Obayashi. Volume optimal cycle: Tightest representative cycle of a generator on persistent homology. arXiv preprint arXiv:1712.05103, 2017. [86] Jelili Oyelade, Itunuoluwa Isewon, Funke Oladipupo, Olufemi Aromolaran, Efosa Uwoghiren, Faridah Ameh, Moses Achas, and Ezekiel Adebiyi. Clustering algo- rithms: Their application to gene expression data. Bioinformatics and Biology Insights, 10:BBI.S38316, 2016. [87] Laxmi Parida, Filippo Utro, Deniz Yorukoglu, Anna Paola Carrieri, David Kuhn, and Saugata Basu. Topological signatures for population admixture. In Teresa M. Przytycka, editor, Research in Computational Molecular Biology, pages 261–275, Cham, 2015. Springer International Publishing.
99 [88] Jose A. Perea and John Harer. Sliding windows and persistence: An application of topological methods to signal analysis. Foundations of Computational Mathematics, 15(3):799–838, Jun 2015.
[89] Ronald Carl Petersen, PS Aisen, Laurel A Beckett, MC Donohue, AC Gamst, Danielle J Harvey, CR Jack, WJ Jagust, LM Shaw, AW Toga, et al. Alzheimer’s disease neu- roimaging initiative (ADNI): clinical characterization. Neurology, 74(3):201–209, 2010.
[90] Robert Petryszak, Maria Keays, Y. Amy Tang, Nuno A. Fonseca, Elisabet Barrera, Tony Burdett, Anja F¨ullgrabe, Alfonso Mu˜noz-Pomer Fuentes, Simon Jupp, Satu Koskinen, Oliver Mannion, Laura Huerta, Karine Megy, Catherine Snow, Eleanor Williams, Mitra Barzine, Emma Hastings, Hendrik Weisser, James Wright, Pankaj Jaiswal, Wolfgang Huber, Jyoti Choudhary, Helen E. Parkinson, and Alvis Brazma. Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants. Nucleic Acids Research, 44(D1):D746–D752, 10 2015.
[91] Pierre Villars (Chief Editor). PAULING FILE in: Inorganic Solid Phases. SpringerMa- terials (online database), Springer, Heidelberg (ed.).
[92] Jeremy A. Pike, Abdullah O. Khan, Chiara Pallini, Steven G. Thomas, Markus Mund, Jonas Ries, Natalie S. Poulter, and Iain B. Styles. Topological data analysis quantifies biological nano-structure from single molecule localization microscopy. bioRxiv, 2018.
[93] Mehdi Pirooznia, Jack Y. Yang, Mary Qu Yang, and Youping Deng. A comparative study of different machine learning methods on microarray gene expression data. BMC genomics, 9 Suppl 1(Suppl 1):S13–S13, 2008.
[94] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt. A stable multi-scale kernel for topological machine learning. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4741–4748, June 2015.
[95] M Remmert, A Biegert, and S¨oding J. Hauser A. Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature Methods, 9, Dec 2011.
[96] Jorge Sanchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision,Springer, pages 222–245, May 2013.
[97] Natalie Sauerwald, Yihang Shen, and Carl Kingsford. Topological data analysis reveals principles of chromosome structure throughout cellular differentiation. bioRxiv, 2019.
[98] Aleksandar Savic, Gergely Toth, and Ludovic Duponchel. Topological data analysis (tda) applied to reveal pedogenetic principles of european topsoil system. Science of The Total Environment, 586:1091 – 1100, 2017.
[99] James P R Schofield, Fabio Strazzeri, Jeannette Bigler, Michael Boedigheimer, Ian M Adcock, Kian Fan Chung, Aruna Bansal, Richard Knowles, Sven-Erik Dahlen, Craig E.
100 Wheelock, Kai Sun, Ioannis Pandis, John Riley, Charles Auffray, Bertrand De Meulder, Diane Lefaudeux, Devi Ramanan, Ana R Sousa, Peter J Sterk, Rob. M Ewing, Ben D Macarthur, Ratko Djukanovic, Ruben Sanchez-Garcia, Paul J Skipp, and . A topological data analysis network model of asthma based on blood gene expression profiles. bioRxiv, 2019.
[100] Lars Seemann, Jason Shulman, and Gemunu H. Gunaratne. A Robust Topology-Based Algorithm for Gene Expression Profiling, 2012.
[101] Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J.n. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.
[102] Donald R. Sheehy. Linear-size approximations to the vietoris-rips filtration. Discrete & Computational Geometry, 49(4):778–796, 2013.
[103] Ian Sillitoe, Tony E Lewis, and et al. Cath: Comprehensive structural and functional annotations for genome sequences. Nucleic Acids Research, 43, 01 2015.
[104] Nikhil Singh, Heather D. Couture, J. S. Marron, Charles Perou, and Marc Niethammer. Topological descriptors of histology images. In Guorong Wu, Daoqiang Zhang, and Luping Zhou, editors, Machine Learning in Medical Imaging, pages 231–239, Cham, 2014. Springer International Publishing.
[105] Paolo Sonego, Mircea Pacurar, Somdutta Dhir, Attila Kertesz-Farkas, Andr´asKocsor, Zolt´anG´asp´ari,Jack A M Leunissen, and S´andorPongor. A protein classification benchmark collection for machine learning. Nucleic acids research, 35:D232–6, 02 2007.
[106] Richard S. Sutton and Andrew Barto. Reinforcement Learning: An Introduction. The MIT Press,Cambridge, Massachusetts, 1998.
[107] Sara Tarek, Reda Abd Elwahab, and Mahmoud Shoman. Gene expression based cancer classification. Egyptian Informatics Journal, 18(3):151 – 159, 2017.
[108] The GUDHI Project. GUDHI User and Reference Manual. GUDHI Editorial Board, 2015.
[109] Q. H. Tran and Y. Hasegawa. Topological time-series analysis with delay-variant embedding. , 99(3):032209, March 2019.
[110] Katharine Turner, Sayan Mukherjee, and Doug M Boyer. Persistent homology transform for modeling shapes and surfaces. Information and Inference, page iau011, 2014.
[111] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. 9, November 2008.
[112] David G. P. van IJzendoorn, Karoly Szuhai, Inge H. Briaire-de Bruijn, Marie Kostine, Marieke L. Kuijjer, and Judith V. M. G. Bov´ee. Machine learning analysis of gene expression data reveals novel diagnostic and prognostic biomarkers and identifies
101 therapeutic targets for soft tissue sarcomas. PLoS computational biology, 15(2):e1006826, 2019.
[113] Pengxiang Wu, Chao Chen, Yusu Wang, Shaoting Zhang, Changhe Yuan, Zhen Qian, Dimitris Metaxas, and Leon Axel. Optimal topological cycles and their application in cardiac trabeculae restoration. In International Conference on Information Processing in Medical Imaging, pages 80–92. Springer, 2017.
[114] Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure, flexibility and folding. IJNMBE, 30(8):814–844, 2014.
102