An Introduction to Topological Data Analysis

Outline Why Topology? Simplicial Complex Persistent Homology

Yuan Yao

Department of Mathematics HKUST

April 22, 2020

1 Outline Why Topology? Simplicial Complex Persistent Homology

1 Why Topological Methods? Methods for Visualizing a Data Geometry

2 Simplicial Complex for Data Representation Simplicial Complex Nerve, Reeb Graph, and Mapper Applications of Mapper Graph Cech,ˇ Vietoris-Rips, and Witness Complexes

3 Persistent Homology Betti Numbers Betti Number at Diﬀerent Scales Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches

Outline 2 Outline Why Topology? Simplicial Complex Persistent Homology

Outline

1 Why Topological Methods? Methods for Visualizing a Data Geometry

2 Simplicial Complex for Data Representation Simplicial Complex Nerve, Reeb Graph, and Mapper Applications of Mapper Graph Cech,ˇ Vietoris-Rips, and Witness Complexes

3 Persistent Homology Betti Numbers Betti Number at Diﬀerent Scales Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches

Why Topological Methods? 3 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Imposing a Geometry

Figure: Deﬁne a metric

Why Topological Methods? 4 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Methods for Summarizing or Visualizing a Geometry

Figure: Linear projection (PCA, MDS, etc. Euclidean Metric)

Why Topological Methods? 5 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Methods for Summarizing or Visualizing a Geometry

Figure: Nonlinear Dimensionality Reduction (ISOMAP, LLE etc. Riemannian Metric)

Why Topological Methods? 6 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Geometric Data Reduction

General method of manifold learning takes the following Spectral Kernal Embedding approach construct a neighborhood graph of data, G • construct a positive semi-deﬁnite kernel on graphs, K • ﬁnd global embedding coordinates of data by eigen-decomposition of• K = YY T Sometimes ‘distance metric’ is just a similarity measure (nonmetric MDS, ordinal embedding) Sometimes coordinates are not a good way to organize/visualize the data (e.g. d > 3) Sometimes all that is required is a qualitative view

Why Topological Methods? 7 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Methods for Summarizing or Visualizing a Geometry

Figure: Clustering the data

Why Topological Methods? 8 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Methods for Summarizing or Visualizing a Geometry

Average Linkage Complete Linkage Single Linkage

Figure: Cluster trees: Average, complete, and single linkage. From Introduction to Statistical Learning with Applications in R.

Why Topological Methods? 9 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Hierarchical Cluster Trees

1 Start with each data point as its own cluster; 2 Repeatedly merge two “closest” clusters, where notions of “distance” between two clusters are given by: Single linkage: closest pair of points • Complete linkage: furthest pair of points • Average linkage (several variants): • (i) distance between centroids (ii) average pairwise distance (iii) Ward’s method: increase in k-means cost due to merger

Why Topological Methods? 10 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Methods for Summarizing or Visualizing a Geometry

Figure: Deﬁne a graph or network structure Why Topological Methods? 11 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Topology

Origins of Topology in Math • Leonhard Euler 1736, Seven Bridges of K¨onigsberg • Johann Benedict Listing 1847, Vorstudien zur Topologie • J.B. Listing (orbituary) Nature 27:316-317, 1883. “qualitative geometry from the ordinary geometry in which quantitative relations chieﬂy are treated.”

Why Topological Methods? 12 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry RNA hairpin folding pathways

23%

98% 100% 100%

100% 99% 100% 98% 44%

G1 G1 G1 G1 C6 A7

G5 G2 G2 G2 G2 A8

C4 G3 G3 G3 G3 G9

C6 A7 C6 C4 C4 C4 C4 G3 0.41 C6 A7 A7 C10 0.51 0.58 G5 0.41 G5 A8 G5 G5 G5 G5 G2 G5 A8 C11 A8

0.96 C4 G1 U12 C4 G9 C6 C6 C6 C6 0.71 0.75 G9 C4 G9 0.46 0.62 G3 A7 A7 G3 C10 A7 A7 0.72 0.80 C10 C6 A7 G3 C10 0.79 0.51 G2 0.72 A8 A8 0.57 A8 A8 G2 C11 C11 G2 C11 G5 A8 G1 0.450.75 G1 G9 G9 G9 G9 U12 0.50 G1 U12 U12 0.70 C4 G9 C10 C10 C10 C10 G3 C10

G2 C11 C11 C11 C11 C11 G1 U12 U12 U12 U12 U12

C6 G5 A7

C4 0.42 A8 0.50 G3 G9

G2 G1 0.50 C10

C11

U12

Figure: Jointly with Xuhui Huang, Jian Sun, Greg Bowman, Gunnar Carlsson, Leo Guibas, and Vijay Pande, JACS’08, JCP’09

Why Topological Methods? 13 166 Part I Topological Data Analysis

Day 2 Day 3 Day 4 Day 5 Day 6

Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Differentiation process from murine embryonic stem cells Figure 2.31 Over time, embryonic stem cells differentiate into distinct cell types. to motor neuronsThese pictures capture the in vitro differentiation of mouse embryonic stem cells into motor neurons over the course of a week. Embryonic stem cells are marked in red, and fully differentiated neurons in green. Figure from experiment performed by Elena Kandror, Abbas Rizvi and Tom Maniatis at Columbia University.

Neurons

Pluripotent cells

Group 1a Group 1b Group 2 Group 3 genes genes genes genes Neural precursors 3.0 2.3 3.9 4.4

Progenitors (1+TPM) 2

log 0.0 0.0 0.0 0.0 Figure 2.32 The different regions in the Mapper graph nicely line up with Figure: Mapperdi graphfferent points of single along the cell differentiation data, timeline. where Source: the different [431]. regions in the Mapper graph nicely line up with different points along the differentiation timeline. Rizvi et al. Nature Biotechnol. 35.6 (2017), 551-560. Effectively, the issue is that a mismatch between the scale of change in the data and the width of the overlap of inverse images can give rise to dramatic changes Why Topologicalin the Methods? Mapper graph in response to small shifts in filter function or cover. (See 14 Figure 2.33 for a representative example of this phenomenon.) There are various different approaches to handling this instability in practice. Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Key elements

Coordinate free representation Invariance under deformations Compressed qualitative representation

Why Topological Methods? 15 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Topology in continuous spaces

To see points in neighborhood the same requires distortion of distances, i.e. stretching and shrinking We do not permit tearing, i.e. distorting distances in a discontinuous way

Why Topological Methods? 16 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Continous Topology

Figure: Homeomorphic

Why Topological Methods? 17 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Continuous Topology

Figure: Homeomorphic

Why Topological Methods? 18 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Discrete case?

How does topology make sense, in discrete and noisy setting?

Why Topological Methods? 19 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Properties of Data Geometry

Fact We Don’t Trust Large Distances!

In life or social sciences, distance (metric) are constructed using a notion of similarity (proximity), but have no theoretical backing (e.g. distance between faces, gene expression proﬁles, Jukes-Cantor distance between sequences) Small distances still represent similarity (proximity), but long distance comparisons hardly make sense

Why Topological Methods? 20 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Properties of Data Geometry

Fact We Only Trust Small Distances a Bit!

Both pairs are regarded as similar, but the strength of the similarity as encoded by the distance may not be so significant Similar objects lie in neighborhood of each other, which suffices to define topology

Why Topological Methods? 21 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Properties of Data Geometry

Fact Even Local Connections are Noisy, depending on observer’s scale!

Is it a circle, dots, or circle of circles? To see the circle, we ignore variations in small distance (tolerance for proximity) Why Topological Methods? 22 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry So we need robust topology against metric distortions

Distance measurements are noisy Physical device like human eyes may ignore diﬀerences in proximity (or as an average eﬀect) Topology is the crudest way to capture invariants under distortions of distances At the presence of noise, one need topology varied with scales

Why Topological Methods? 23 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry What kind of topology?

Topology studies (global) mappings between spaces Point-set topology: continuous mappings on open sets Differential topology: differentiable mappings on smooth manifolds Morse theory tells us topology of continuous space can be learned by• discrete information on critical points Algebraic topology: homomorphisms on algebraic structures, the most concise encoder for topology Combinatorial topology: mappings on simplicial (cell) complexes Simplicial complex may be constructed from data • Algebraic, differential structures can be defined here •

Why Topological Methods? 24 Outline Why Topology? Simplicial Complex Persistent Homology

Methods for Visualizing a Data Geometry Topological Data Analysis

What kind of topological information often useful 0-homology: clustering or connected components • 1-homology: coverage of sensor networks; paths in robotic planning• 1-homology as obstructions: inconsistency in statistical ranking; harmonic• ﬂow games high-order homology: high-order connectivity? • How to compute homology in a stable way? simplicial complexes for data representation • ﬁltration on simplicial complexes • persistent homology •

Why Topological Methods? 25 Outline Why Topology? Simplicial Complex Persistent Homology

Outline

1 Why Topological Methods? Methods for Visualizing a Data Geometry

2 Simplicial Complex for Data Representation Simplicial Complex Nerve, Reeb Graph, and Mapper Applications of Mapper Graph Cech,ˇ Vietoris-Rips, and Witness Complexes

3 Persistent Homology Betti Numbers Betti Number at Diﬀerent Scales Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches

Simplicial Complex for Data Representation 26 Outline Why Topology? Simplicial Complex Persistent Homology

Simplicial Complex Simplicial Complexes for Data Representation

Deﬁnition (Simplicial Complex) An abstract simplicial complex is a collection Σ of subsets of V which is closed under inclusion (or deletion), i.e. τ Σ and σ τ, then σ Σ. ∈ ⊆ ∈ Chess-board Complex Term-document cooccurance complex Nerve complex Point cloud data in metric spaces: Cech,ˇ Rips, Witness complex • Mayer-Vietoris Blowup • Clique complex in pairwise comparison graphs Strategic complex in game theory

Simplicial Complex for Data Representation 27 Outline Why Topology? Simplicial Complex Persistent Homology

Simplicial Complex Chess-board Complex

Deﬁnition (Chess-board Complex) Let V be the positions on a Chess board. Σ collects position subsets of V where one can place queens (rooks) without capturing each other.

Closedness under deletion: if σ Σ is a set of “safe” positions, then any subset τ σ is also a set of∈ “safe” positions ⊆

Simplicial Complex for Data Representation 28 the simplexes σ0 and σ0 are “disconnected” from other (r5) (r6) four members. The following definition is used to model this kind of topological property. We have modified the original definition of “connectiveness” in Q-analysis to cater for our present application.

Deﬁnition 3. Let " be a simplicial family and d is the highest dimension of the simplexes in ". Let 0 q d be an integer. ≤ ≤ We call two simplexes σa and σb in " q-near if they have a common q-face. We call σa and σb q-connected if there exists a sequence σ1, σ2,...,σj (10)

of distinct simplexes of ", such that σ1 σa, σj σb, and σi FIG. 3. (i) A simplicial family. (ii) Not a simplicial family. = = is qi-near to σi 1 for all 1 i j 1, 0 qi d an integer, + ≤ ≤ − ≤ ≤ and q min qi . We call Sequence 10 a q-chain Cab from = { } σa to σb and the number (j 1) the length of Cab, denoted − by l(Cab). For all possible q-chains connecting σa to σb with the same length L, we call the chain with the maximum value of q q* the maximal L-chain, denoted by C (L). We say = ab∗ that σa and σb are q*-connected if they are connected by a maximal chain. Note that if two simplexes are q-near, then they must be the simplexes σ0 and σ0 are “disconnected” from other connected and the length is equal(r to5) 1.(r If6) there is no chain four members. The following definition is used to model this connecting two simplexes,kind of then topological we property.set the We length have modified between the original Outline Why Topology? Simplicial Complex them to . If two simplexesPersistentdefinition Homology are of “connectiveness”q-connected, in Q-analysis then they to cater also for our ∞ present application. Simplicial Complex are (q 1)-connected for (q 1) 0. − Definition− 3. Let≥" be a simplicial family and d is the highest dimension of the simplexes in ". Let 0 q d be an integer. ≤ ≤ Example 4. Referring toWe Figure call two simplexes 4 ofσ Examplea and σb in " q 3,-near theif they sim- have a Term-Document Co-occurrence Complex0 2 common q-face.1 We call σa and1 σb q-connected if there exists plexes σ and σ are 0-near,a sequence σ and σ are 1-near, and (r1) (r2) (r3) (r4) σ0 and σ0 are 0-near. Furthermore,σ1,σσ02,...,isσj 0-connected(10) FIG. 4. The simplexes generated by the rows of the matrix (8). (r5) (r6) (r1) of distinct simplexes of ", such that σ1 σa, σj σb, and σi FIG. 3. (i) A simplicial family. (ii)1 Not a simplicial family.1 = = to σ and σ via, respectively,is qi-near to σi 1 for the all 1 maximali j 1, 0 q 2-chainsi d an integer, (r3) (r4) + ≤ ≤ − ≤ ≤ and q min qi . We call Sequence 10 a q-chain Cab from Example 3. Consider the following matrix 0 2 1 0 2= { 1} σ , σ , σ and σ σ,a σto σb and, σ the number(i.e., (j q*1) the 0).length However,of Cab, denoted (r1) (r2) (r3) (r1) (r2) (r4) − = σ0 and σ0 are not connectedby l(Cab). For all to possible anyq-chains of the connecting otherσa to fourσb with c1 c2 c3 c4 c5 (r5) (r6) the same length L, we call the chain with the maximum value of q q* the maximal L-chain, denoted by C (L). We say simplexes. = ab∗ r1 10000 that σa and σb are q*-connected if they are connected by a A further structure canmaximal be defined chain. on a simplicial family, r2 11100 as follows. Note that if two simplexes are q-near, then they must be r3 00110 (8) connected and the length is equal to 1. If there is no chain connecting two simplexes, then we set the length between r4 00110 them to . If two simplexes are q-connected, then they also Definition 4. The relationare “is (q q∞1)-connected-connected for (q to”1) on0. a simplicial r5 00001 − − ≥ family ", denoted by rq, is an equivalence relation. Let "q r6 00001 Example 4. Referring to Figure 4 of Example 3, the simplexes σ0 and σ2 are 0-near, σ1 and σ1 are 1-near, and be the set of simplexes in "(r1)with(r2 dimension) (r3) greater(r4) than 0 0 0 FIG. 4. The simplexes generated by the rows of the matrix (8). σ and σ are 0-near. Furthermore, σ is 0-connected with six rows r ,r ,...,r and five columns c ,c ,...,c . or equal to q, where q =0(r,5)1,...,(r6) dim". Then, rq (rpartitions1) 1 2 6 1 2 5 to σ1 and σ1 via, respectively, the maximal 2-chains (r3) (r4) For row r1, the column c1 contains a “1” andExample the other 3. Consider the" followingq into matrix equivalence classesσ0 , ofσ2 q, σ-connected1 and σ0 , σ2 simplexes., σ1 (i.e., q* 0). These However, (r1) (r2) (r3) (r1) (r2) (r4) = c c c c c σ0 and σ0 are not connected to any of the other four columns contain “0.” We associate with r1 a 0-simplex 1 equivalence2 3 4 5 classes are called(r5) the(r6) q-connected components Left is a term-document co-occurrence matrix simplexes. 0 r1 10000 σ (c1). In a similar way, we obtain the following of ". Let Qq denote the numberA further structureof q-connected can be defined on components a simplicial family, (r1) r2 11100 = as follows. simplexesRight for theis a remaining simplicial rows: complex representationr3 00110in " of. The terms determination(8) of the components and Qq for each r4 00110 Definition 4. The relation “is q-connected to” on a simplicial r5 00001value of q is termed a Q-analysis of ". Connectivity2 analysis captures more information than Latentfamily ", denoted by rq, is an equivalence relation. Let "q σ (c1,c2,c3), r6 00001 (r2) = be the set of simplexes in " with dimension greater than Semantic Index1 (Li & Kwong 2009)with six rows r1,r2,...,rExample6 and five columns 5. Thec1,c2,...,c result5. ofor Q-analysis equal to q, where forq =0 the, 1,..., simplicialdim". Then, familyrq partitions σ(r ) (c3,c4), 3 = For row r1, the column c1 contains a “1” and the other "q into equivalence classes of q-connected simplexes. These 1 columns contain “0.” Wein associate Example with r 31 a is 0-simplex given inequivalence Table 2. classes Since are called the the highestq-connected dimen- components σ (c ,c ), 0 (r4) 3 4 σ (c1).(9) In a similar way, we obtain the following of ". Let Qq denote the number of q-connected components = (r1) = sion of the simplexes is 2, the Q-analysis of the simplicial 0 simplexes for the remaining rows: in ". The determination of the components and Qq for each σ (c5), value of q is termed a Q-analysis of ". (r5) 2 family has three levels corresponding to q 0,1 and 2. The = σ (c1,c2,c3), (r2) = = 0 1 level q 2 consists of those simplexes with dimension greater σ (c5). σ (c3,c4), Example 5. The result of Q-analysis for the simplicial family Simplicial Complex for(r6) Data Representation(r3) = 29 = 1 = in Example 3 is given in Table 2. Since the highest dimen- σ than(c3,c or4), equal to 2;(9) hence, this level contains one simplex (r4) = sion of the simplexes is 2, the Q-analysis of the simplicial 0 2 1 σ (c5), family has three levels corresponding to q 0,1 and 2. The We draw the six simplexes in Figure 4, from which we (r5) σ=(r ). Next, at the level q 1, two more simplexes σ(r ) and 0 2 = 3 σ (c5). level= q 2 consists of those simplexes with dimension greater (r6) 1 = see clearly that they do form a simplicial family. However, σ=(r ) come in, which arethan 1-connected or equal to 2; hence, by a this chain level contains of length one simplex 1 4 2 1 We draw the six simplexes in Figure 4, from which we σ . Next, at the level q 1, two more simplexes σ and (r2) = (r3) see clearly that they do form a simplicial family. However, σ1 come in, which are 1-connected by a chain of length 1 (r4)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—March 2010 597 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—March 2010 597 DOI: 10.1002/asiDOI: 10.1002/asi Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Nerve complex

Definition (Nerve Complex) Define a cover of X, X = U . V = U and define ∪α α { α} Σ = U : ∈ U = . { I ∩α I I 6 ∅} Closedness under deletion Can be applied to any topological space X

Simplicial Complex for Data Representation 30 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Nerve Theorem

Theorem (Nerve Theorem) Consider the nerve complex of X,

Σ = U : ∈ U = ,X = U . { I ∩α I I 6 ∅ ∪α α}

If every UI is contractible, then X has the same homotopy type as Σ.

Simplicial Complex for Data Representation 31 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper NerveTopological complex example Mapping

Figure:Covering Covering of Circle of circle

Simplicial Complex for Data Representation 32 TopologicalOutline Why Topology? Mapping Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper TopologicalNerve Mapping complex example

Covering of CircleFigure: Create nodes Create nodes

Simplicial Complex for Data Representation 33 TopologicalOutline Why Topology? Mapping Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper TopologicalNerve Mapping complex example

Figure:Covering Create of edges, Circle that gives a Nerve complex (graph) Create edges

Simplicial Complex for Data Representation 34 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Nerve of Seven Bridges of K¨onigsberg

Figure: Nerve graph of Seven Bridges of K¨onisberg

Simplicial Complex for Data Representation 35 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Point cloud data

Now given point cloud data = x1, . . . , xn , and a covering V = U , where each U isX a cluster{ of data} { α} α Build a simplicial complex (Nerve) in the same way, but components replaced by clusters

Simplicial Complex for Data Representation 36 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Mapping

How to choose coverings? Create a reference map (or ﬁlter) h : , where is a X → Z Z topological space often with interesting metrics (e.g. R, R2, S1 etc.), and a covering of , then construct the covering of using inverse map h−1U U. Z X { α}

Simplicial Complex for Data Representation 37 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Example: Morse Theory and Reeb graph

a nice (Morse) function: h : R, on a smooth manifold X → X topology of reconstructed from level sets h−1(t) X topological of h−1(t) only changes at ‘critical values’ Reeb graph: a simpliﬁed version, contracting into points the connected components in h−1(t)

Figure: Construction of Reeb graph; h maps each point on torus to its height.

Simplicial Complex for Data Representation 38 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Mapper: from Continuous to Discrete...

a2 b1

a3 h b2

a4 b3

a5 b4

Figure: An illustration of Mapper.

Note: degree-one nodes contain local minima/maxima; degree-three nodes contain saddle points (critical points); degree-two nodes consist of regular points

Simplicial Complex for Data Representation 39 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Mapper algorithm

[Singh-Memoli-Carlsson. Eurograph-PBG, 2007] Given a data set , X choose a ﬁlter map h : , where is a topological space such X → Z Z as R, S1, Rd, etc. choose a cover U Z ⊆ ∪α α −1 cluster/partite level sets h (Uα) into Vα,β

graph representation: a node for each Vα,β, an edge between (V ,V ) iﬀ U U = and V V = . α1,β1 α2,β2 α1 ∩ α2 6 ∅ α1,β1 ∩ α2,β2 6 ∅ extendable to simplicial complex representation.

Note: it extends Reeb Graph from R to general topological space ; may lead to a particular implementation of Nerve theorem through ﬁlterZ map h.

Simplicial Complex for Data Representation 40 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper In applications.

Reeb graph has found various applications in computational geometry, statistics under diﬀerent names. computer science: contour trees, Reeb graphs statistics: density cluster trees (Hartigan)

Simplicial Complex for Data Representation 41 Outline Why Topology? Simplicial Complex Persistent Homology

Nerve, Reeb Graph, and Mapper Reference Mapping

Typical one dimensional ﬁlters/mappings: Density estimators 0 p Measures of data (ec-)centrality: e.g. x0∈X d(x, x ) Geometric embeddings: PCA/MDS, ManifoldP learning, Diﬀusion Maps etc. Response variable in statistics: progression stage of disease etc.

Simplicial Complex for Data Representation 42 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Example: RNA Tetraloop

Biological relevance: serve as nucleation site for RNA folding form sequence speciﬁc tertiary interactions protein recognition sites certain Tetraloops can pause RNA transcription Note: simple, but, biological debates over intermediate states on folding pathways

Figure: RNA GCAA-Tetraloop

Simplicial Complex for Data Representation 43 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Debates: Two-state vs. Multi-state Models

2-state: transition state with any one stem base pair, from thermodynamic experiments [Ansari A, et al. PNAS, 2001, 98: 7771-7776] multi-state: there is a stable intermediate state, which contains collapsed structures, from kinetic measurements [Ma H, et al. PNAS, 2007, (a) 2-state model 104:712-6] experiments: no structural information computer simulations at full-atom resolution: exisitence of intermediate states • (b) multi-state model if yes, what’s the structure? •

Simplicial Complex for Data Representation 44 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph MD Simulation by Folding@Home

[Bowman, Huang, Y., Sun, ... Vijay. JACS, 2008] 2800 SREMD (Serial Replica Exchange Molecular Dynamics) simulations with RNA hairpin (5’-GGGCGCAAGCCU-3’) 389 RNA atoms, 4000 water and 11 Na+ ∼ SREMD random walks in temperature space (56 ladders from 285K to 646K) with molecular dynamic trajectories 210,000 ns simulations with 105,000,000 conﬁgurations ∼ Simulation Box. Unfortunately, sampling still not converged!

Simplicial Complex for Data Representation 45 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Dimensionality Reduction using Contact Map

Massive volume and high dimensionality: 100M samples in 12K Cartesian coordinates contact maps as 55-bit string ⇒ Samples are not in equilibrium distribution Looking for a needle in a haystack: intermediates/transition states of interests are of low-density • folded/unfolded states are dominant •

C6 A7

G5 A8

C4 G9 1

G3 C10 2

3 G2 C11

G1 U12 4

Figure: Left: NMR structure of the GCAA tetraloop. Right: Contact map.

Simplicial Complex for Data Representation 46 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Mapper with density ﬁlters in biomolecular folding

Reference: Bowman-Huang-Yao et al. J. Am. Chem. Soc. 2008; Yao, Sun, Huang, et al. J. Chem. Phys. 2009. densest regions (energy basins) may correspond to metastates (e.g. folded, extended) intermediate/transition states on pathways connecting them are relatively sparse Therefore with Mapper clustering on density level sets helps separate and identify metastates and intermediate/transition states graph representation reﬂects kinetic connectivity between states

Simplicial Complex for Data Representation 47 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph A vanilla version

exp( d ) exp( d ) − 11 − 12 exp( d21) exp( d22)  − .−  graph .. row sum clustering K =            exp( d )   − nn   

Figure: Mapper Flow Chart

1 Kernel density estimation h(x) = i K(x, xi) with Hamming distance for contact maps P 2 Rank the data by h and divide the data into n overlapped sets 3 Single-linkage clustering on each level sets 4 Graphical representation

Simplicial Complex for Data Representation 48 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Mapper output for Unfolding Pathways

100% 99% 97% 94% 81% 100% 100% 100%

G1 G1 G1

G2 G2 G2

G3 G3 G3

C4 C4 C4 C6 C6 A7 C6 A7 A7 C6 A7 C6 A7 0.54 0.40 0.40 0.42 G5 G5 G5 G5 0.63 G5 A8 G5 A8 G5 A8 A8 G5 A8

0.54 C6 C6 C6 C4 C4 C4 G9 C4 G9 G9 C4 G9 G9 0.92 0.74 0.72 0.59

0.42 A7 A7 A7 G3 C10 G3 G3 C10 G3 C10 C10 G3 C10 0.81 0.59 0.56 0.56 G2 G2 C11 A8 A8 A8 G2 C11 G2 C11 C11 G2 C11 G1 G1 G1 G1 G1 U12 U12 U12 U12 U12 G9 G9 G9

C10 C10 C10

C11 C11 C11

U12 U12 U12

Figure: Unfolding pathway

Simplicial Complex for Data Representation 49 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Mapper output for Refolding Pathways

23%

98% 100% 100%

100% 99% 100% 98% 44%

G1 G1 G1 G1 C6 A7

G5 G2 G2 G2 G2 A8

C4 G3 G3 G3 G3 G9

C6 A7 C6 C4 C4 C4 C4 G3 0.41 C6 A7 A7 C10 0.51 0.58 G5 0.41 G5 A8 G5 G5 G5 G5 G2 G5 A8 C11 A8

G2 C11 C11 C11 C11 C11 G1 U12 U12 U12 U12 U12

C6 G5 A7

C4 0.42 A8 0.50 G3 G9

G2 G1 0.50 C10

C11

U12

Figure: Refolding pathway Simplicial Complex for Data Representation 50 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Example: Progression of Breast Cancer

We study samples of expression data in Rn (n = 262) from 295 breast cancers as well as additional samples from normal breast tissue. The distance metric was given by the correlation between (projected)• expression vectors. The ﬁlter function used was a measure taking values in R of the •deviation of the expression of the tumor samples relative to normal controls (l2-eccentrality). The cover was overlapping intervals in R. • Two branches of breast cancer progression are discovered.

Simplicial Complex for Data Representation 51 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph

Progression of Breast Cancer: l2-eccentrality Mapping

Figure: MonicaDiagram Nicolau, of gene A. expression Levine, and proﬁles Gunnar for Carlsson, breast cancerPNAS’10 M. Nicolau, A. Levine, and G. Carlsson, PNAS 2011

Simplicial Complex for Data Representation 52 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Note: Progression of Breast Cancer

The lower right branch itself has a subbranch (referred to as c-MYB+ tumors), which are some of the most distinct from normal and are characterized by high expression of genes including c-MYB, ER, DNALI1 and C9ORF116. Interestingly, all patients with c-MYB+ tumors had very good survival and no metastasis. These tumors do not correspond to any previously known breast cancer subtype; the grouping seems to be invisible to classical hierarchical clustering methods.

Simplicial Complex for Data Representation 53 Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Example: diﬀerentiation process using single cell data

166 Part I Topological Data Analysis

Day 2 Day 3 Day 4 Day 5 Day 6

Figure 2.31 Over time, embryonic stem cells differentiate into distinct cell types. These pictures capture the in vitro differentiation of mouse embryonic stem cells into motor neurons over the course of a week. Embryonic stem cells are marked in red, and fully differentiated neurons in green. Figure from experiment performed by Elena Kandror, Abbas Rizvi and Tom Maniatis at Columbia University. Simplicial Complex for Data Representation 54

Neurons

Pluripotent cells

Group 1a Group 1b Group 2 Group 3 genes genes genes genes Neural precursors 3.0 2.3 3.9 4.4

Progenitors (1+TPM) 2

log 0.0 0.0 0.0 0.0 Figure 2.32 The different regions in the Mapper graph nicely line up with different points along the differentiation timeline. Source: [431].

Effectively, the issue is that a mismatch between the scale of change in the data and the width of the overlap of inverse images can give rise to dramatic changes in the Mapper graph in response to small shifts in filter function or cover. (See Figure 2.33 for a representative example of this phenomenon.) There are various different approaches to handling this instability in practice. Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph Diﬀerentiation process visualization by Mapper

Over time, undifferentiated embryonic cells become differentiated motor neurons when retinoic acid and sonic hedgehog (a differentiation-promoting protein) are applied. Mapper graph of differentiation process from murine embryonic stem cells to motor neurons: The data generated corresponds to RNA expression profiles from roughly• 2000 single cells. The distance metric was provided by correlation between expression• vectors. The filter function used was multidimensional scaling (MDS) • projection into R2. The cover was overlapping rectangles in R2. •

Simplicial Complex for Data Representation 55 166 Part I Topological Data Analysis

Day 2 Day 3 Day 4 Day 5 Day 6

Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph

Mapper GraphFigure of 2.31 Differentiation Over time, embryonic stem cells diProcessfferentiate into distinct cell types. These pictures capture the in vitro differentiation of mouse embryonic stem cells into motor neurons over the course of a week. Embryonic stem cells are marked in red, and fully differentiated neurons in green. Figure from experiment performed by Elena Kandror, Abbas Rizvi and Tom Maniatis at Columbia University.

Neurons

Pluripotent cells

Group 1a Group 1b Group 2 Group 3 genes genes genes genes Neural precursors 3.0 2.3 3.9 4.4

Progenitors (1+TPM) 2

log 0.0 0.0 0.0 0.0 Figure 2.32 The different regions in the Mapper graph nicely line up with Figure: The differentdifferentpoints regions along in the thedifferentiation Mapper timeline. graph Source: nicely [431]. line up with different points along the differentiation timeline. Rizvi et al. Nature Biotechnol. 35.6 (2017), 551-560. Effectively, the issue is that a mismatch between the scale of change in the data and the width of the overlap of inverse images can give rise to dramatic changes in the Mapper graph in response to small shifts in filter function or cover. (See Simplicial ComplexFigure for2.33 Datafor a representative Representation example of this phenomenon.) 56 There are various different approaches to handling this instability in practice. Outline Why Topology? Simplicial Complex Persistent Homology

Applications of Mapper Graph

Example: Brain Tumor 7SingleCellExpressionData 403

Right Left Recurrence

Proneural Neural Classical Mesenchymal EGFR A289V EGFR G598V EGFR-SEP14 EGFR Rearrangements PIK3CA GBM9-R1 b c

Left Right EGFR

GBM9-R2 GBM9-2 Recurrence 215 TPM (log scale) GBM9-1

d Mitotic markers 93 103 EGFR (A289T, 76 63 G598V, vIII) 3 EGFR:SEPT14 Fusion NF1 (S1078, L2593) ARID2 96 PIK3CA F1016C EGFR amp CDKN2A del PTEN del average TPM Germline

Figure 7.2 Single cell RNA-seq allows the spatial and temporal study of the struc- Figure: A patient with two focal glioblastomas, on theture left of and tumors. right Thishemispheres. is a particular After surgery case of and a patient standard with treatment, two focal the glioblastomas, tumor reappeared on the left side. Genomic analysis shows that theon the initial left tumors and right were hemispheres. seeded by two After independent, surgery and but standard related clones. treatment, The the tumor recurrent tumor was genetically similar to the left one. Jin-Kureappeared Lee et al. onNature the left Genetics side. Genomic49.4 (2017): analysis 594-599. (on the left) shows that the initial tumors were seeded by two independent, but related clones. The recurrent tumor was genetically similar to the one on the left. The expression proﬁles from single cells from the two foci at diagnosis and the relapse recapitulate the clonal history. Transcriptionally and genetically, the recurrence resembles the left parental tumor. Asmallsubsetofthecellsintheinitiallefttumorshowasimilartranscriptionpro- ﬁle as the recurrent tumor, suggesting that the resistant population originated from Simplicial Complex for Data Representationasubclonalpopulationintheoriginaltumor.Source:[320]. From Jin-Ku Lee 57 et al., Spatiotemporal genomic architecture informs precision oncology in glioblastoma, Nature Genetics 49.4 (2017): 594-599. c 2017. Reprinted with permission from Springer Nature. ⃝

Single cell techniques provide the means to study heterogeneous cell populations. The following example studies the mutational and transcriptional proﬁle of a multicentric glioblastoma. Multicentric glioblastomas represent tumors that occur in multiple discrete areas in the brain. In this particular case, at diagnosis, the tumor presented two focal points, on the left and on the right brain frontal lobes. After surgery, chemoradiotherapy, and EGFR targeted therapy, the tumor recurred on the left side. Diﬀerent samples were taken from the initial left and right loci and two samples at recurrence. The history of this tumor was then reconstructed using genomic sequencing from each of the biopsies. The genetic characterization shows that the right tumor shares most but not all genetic alterations with the left tumor, indicating a common origin for the two clones that seeded the left and right tumors. Outline Why Topology? Simplicial Complex Persistent Homology 7SingleCellExpressionData 403 Applications of Mapper Graph Mapper GraphRight of Single Cell Seq.Left Recurrence Proneural Neural Classical Mesenchymal EGFR A289V EGFR G598V EGFR-SEP14 EGFR Rearrangements PIK3CA GBM9-R1 b c

Left Right EGFR

GBM9-R2 GBM9-2 Recurrence 215 TPM (log scale) GBM9-1

d Mitotic markers 93 103 EGFR (A289T, 76 63 G598V, vIII) 3 EGFR:SEPT14 Fusion NF1 (S1078, L2593) ARID2 96 PIK3CA F1016C EGFR amp CDKN2A del PTEN del average TPM Germline

Figure 7.2 SingleSimplicial cell RNA-seqComplex for allows Data the Representation spatial and temporal study of the struc- 58 ture of tumors. This is a particular case of a patient with two focal glioblastomas, on the left and right hemispheres. After surgery and standard treatment, the tumor reappeared on the left side. Genomic analysis (on the left) shows that the initial tumors were seeded by two independent, but related clones. The recurrent tumor was genetically similar to the one on the left. The expression proﬁles from single cells from the two foci at diagnosis and the relapse recapitulate the clonal history. Transcriptionally and genetically, the recurrence resembles the left parental tumor. Asmallsubsetofthecellsintheinitiallefttumorshowasimilartranscriptionpro- ﬁle as the recurrent tumor, suggesting that the resistant population originated from asubclonalpopulationintheoriginaltumor.Source:[320]. From Jin-Ku Lee et al., Spatiotemporal genomic architecture informs precision oncology in glioblastoma, Nature Genetics 49.4 (2017): 594-599. c 2017. Reprinted with permission from Springer Nature. ⃝

Applications of Mapper Graph Note: Mapper Graph

Using Mapper, one can appreciate a more continuous structure that recapitulates the clonal and genetic history. The tumor on the right appears to be transcriptionally distinct from• the left tumor and the recurrence tumor. Expression profiles from cells in the recurrence tumor resembled the• originating initial tumor. This is an important finding, as it shows a continued progression at• the expression level, with a few cells at diagnosis having a similar pattern as cells at relapse. It also shows that EGFR mutation is a subclonal event, occurring only• in the tumor at diagnosis that is not responsible for the relapse. So tumors with heterogeneous populations of cells are less sensitive specific therapies which target a subpopulation..

Simplicial Complex for Data Representation 59 Outline Why Topology? Simplicial Complex Persistent Homology

Cech,ˇ Vietoris-Rips, and Witness Complexes Cechˇ complex

Deﬁnition (Cechˇ Complex C)

In a metric space (X, d), deﬁne a cover of X, X = αUα where U = B (t ) := x X : d(x t ) . V = U ∪ and deﬁne α α { ∈ − α ≤ } { α} Σ = U : ∈ U = . { I ∩α I I 6 ∅} Closedness under deletion Can be applied to any metric space X

Nerve Theorem: if every UI is contractible, then X has the same homotopy type as Σ.

Simplicial Complex for Data Representation 60 Outline Why Topology? Simplicial Complex Persistent Homology

Cech,ˇ Vietoris-Rips, and Witness Complexes Example: Cechˇ Complex

Figure: Cechˇ complex of a circle, C, covered by a set of balls.

Simplicial Complex for Data Representation 61 Outline Why Topology? Simplicial Complex Persistent Homology

Cech,ˇ Vietoris-Rips, and Witness Complexes Vietoris-Rips complex

Cechˇ complex is hard to compute, even in Euclidean space One can easily compute an upper bound for Cechˇ complex Construct a Cechˇ subcomplex of 1-dimension, i.e. a graph with edges• connecting point pairs whose distance is no more than . Find the clique complex, i.e. maximal complex whose 1-skeleton is the• graph above, where every k-clique is regarded as a k 1 simplex − Deﬁnition (Vietoris-Rips Complex) Let V = x X . Deﬁne VR = U V : d(x , x ) , α, β I . { α ∈ } { I ⊆ α β ≤ ∈ }

Simplicial Complex for Data Representation 62 Outline Why Topology? Simplicial Complex Persistent Homology

Cech,ˇ Vietoris-Rips, and Witness Complexes Example: Rips Complex

Figure: Left: Cechˇ complex gives a circle; Right: Rips complex gives a sphere S2.

Simplicial Complex for Data Representation 63 Outline Why Topology? Simplicial Complex Persistent Homology

Cech,ˇ Vietoris-Rips, and Witness Complexes Generalized Vietoris-Rips for Symmetric Relations

Deﬁnition (Symmetric Relation Complex) Let V be a set and a symmetric relation R = (u, v) V 2 such that (u, v) R (v, u) R. Σ collects subsets of{V which} ⊆ are in pairwise relations.∈ ⇒ ∈

Closedness under deletion: if σ Σ is a set of related items, then any subset τ σ is a set of related∈ items ⊆ Generalized Vietoris-Rips complex beyond metric spaces E.g. Zeeman’s tolerance space C.H. Dowker deﬁnes simplicial complex for unsymmetric relations

Simplicial Complex for Data Representation 64 Outline Why Topology? Simplicial Complex Persistent Homology

Cech,ˇ Vietoris-Rips, and Witness Complexes Sandwich Theorems

Rips is easier to compute than Cech even so, Rips is exponential to dimension generally • However Vietoris-Rips CAN NOT preserve the homotopy type as Cech But there is still a hope to ﬁnd a lower bound on homology –

Theorem (“Sandwich”)

VR C VR ⊆ ⊆ 2 If a homology group “persists” through R R , then it must → 2 exists in C; but not the vice versa.

Simplicial Complex for Data Representation 65 Outline Why Topology? Simplicial Complex Persistent Homology

Cech,ˇ Vietoris-Rips, and Witness Complexes A further simpliﬁcation: Witness complex

Deﬁnition (Strong Witness Complex)

Let V = tα X . Deﬁne W s = U{ ∈V : } x X, α I, d(x, t ) d(x, V )+ . { I ⊆ ∃ ∈ ∀ ∈ α ≤ } Deﬁnition (Week Witness Complex)

Let V = tα X . Deﬁne w { ∈ } W = U V : x X, α I, d(x, t ) d(x, V− )+ . { I ⊆ ∃ ∈ ∀ ∈ α ≤ I } V can be a set of landmarks, much smaller than X ∗ ∗ 0 Monotonicity: W W 0 if ⊆ ≤ But not easy to control homotopy types between W ∗ and X

Simplicial Complex for Data Representation 66 O F O F O 3, 2 0, 0 O 4, 2 0, 0 F 0, 0 2, 3 F 1, 0 2, 3 (a) Battle of the sexes (b) Modiﬁed battle of the sexes

It is easy to see that these two games have the same pairwise comparisons, which will lead to identical equilibria for the two games: (O, O) and (F, F). It is only the actual equilibrium payoffs that would differ. In particular, in the equilibrium (O, O),thepayoff of the row player is increased by 1.

The usual solution concepts in games (e.g., Nash, mixed Nash, correlated equilibria) are defined in terms of pairwise comparisons only. Games with identical pairwise comparisons share the same equilibrium sets. Thus, we refer to games with identical pairwise comparisons as strategically equivalent games. By employing the notion of pairwise comparisons, we can concisely represent any strategic-form game in terms of a flow in a graph. We recall this notion next. Let G =(N,L) be an undirected graph, with set of nodes N and set of links L. An edge flow (or just flow) on this graph is a function Y : N N R such that Y (p, q)= Y (q, p) and Y (p, q) = 0 for (p, q) / L [21, 2]. Note that × → − ∈ the flow conservation equations are not enforced under this general definition. Given a game , we define a graph where each node corresponds to a strategy profile, and G each edge connects two comparable strategy profiles. This undirected graph is referred to as the game graph and is denoted by G( ) (E,A), where E and A are the strategy profiles and pairs G of comparable strategy profiles defined above, respectively. Notice that, by definition, the graph G( ) has the structure of a direct product of M cliques (one per player), with clique m having G hm vertices. The pairwise comparison function X : E E R defines a flow on G( ), as it × → G satisfies X(p, q)= X(q, p) and X(p, q) = 0 for (p, q) / A. This flow may thus serve as an − ∈ equivalent representation of any game (up to a “non-strategic” component). It follows directly from the statements above that two games are strategically equivalent if and only if they have the OutlinesameWhy flow Topology? representationSimplicial and Complex game graph. Persistent Homology

Cech,ˇ Vietoris-Rips,Two examples and Witness Complexes of game graph representations are given below.

StrategicExample 2.2. SimplicialConsider again Complex the “battle for of Flow the sexes” Games game from Example 2.1. The game graph has four vertices, corresponding to the direct product of two 2-cliques, and is presented in Figure 2.

2 (O, O) (O, F) O F O F O 3, 2 0, 0 3 O 4, 2 0,2 0 F 0, 0 2, 3 F 31, 0 2, 3 (a) Battle of the sexes (F, O(b)) Modified(F, battle F) of the sexes Figure 2: Flows on the game graph corresponding to “battle of the sexes” (Example 2.2). It is easy toStrategic see that simplicial these two complex games have is the the clique same complex pairwise of comparisons, pairwise which will lead to identical equilibriacomparison for the two graph games: above,(O, inspired O) and by(F, ranking F). It is only the actual equilibrium payoffs that would differ. In particular, in the equilibrium (O, O),thepayoff of the row player is increased ExampleEvery 2.3. gameConsider can be a decomposed three-player as game, the direct where sum each of player potential can choose between two strategies by 1. a, b . We represent the strategic interactions among the players by the directed graph in Figure { } games and zero-sum games (harmonic games) (Candogan, Menache, 3a, where the payoff of player i is 1 if its strategy is identical to the strategy of its successor The usual solutionOzdaglar concepts and Parrilo in games 2010) (e.g.,− Nash, mixed Nash, correlated equilibria) are defined in terms of pairwise comparisons only. Games with identical pairwise comparisons share the same equilibrium sets. Thus, we refer to games with identical pairwise comparisons as strategically Simplicial Complex for Data Representation7 67 equivalent games. By employing the notion of pairwise comparisons, we can concisely represent any strategic-form game in terms of a flow in a graph. We recall this notion next. Let G =(N,L) be an undirected graph, with set of nodes N and set of links L. An edge flow (or just flow) on this graph is a function Y : N N R such that Y (p, q)= Y (q, p) and Y (p, q) = 0 for (p, q) / L [21, 2]. Note that × → − ∈ the flow conservation equations are not enforced under this general definition. Given a game , we define a graph where each node corresponds to a strategy profile, and G each edge connects two comparable strategy profiles. This undirected graph is referred to as the game graph and is denoted by G( ) (E,A), where E and A are the strategy profiles and pairs G of comparable strategy profiles defined above, respectively. Notice that, by definition, the graph G( ) has the structure of a direct product of M cliques (one per player), with clique m having G hm vertices. The pairwise comparison function X : E E R defines a flow on G( ), as it × → G satisfies X(p, q)= X(q, p) and X(p, q) = 0 for (p, q) / A. This flow may thus serve as an − ∈ equivalent representation of any game (up to a “non-strategic” component). It follows directly from the statements above that two games are strategically equivalent if and only if they have the same flow representation and game graph. Two examples of game graph representations are given below.

Example 2.2. Consider again the “battle of the sexes” game from Example 2.1. The game graph has four vertices, corresponding to the direct product of two 2-cliques, and is presented in Figure 2.

2 (O, O) (O, F)

3 2 3 (F, O) (F, F)

Figure 2: Flows on the game graph corresponding to “battle of the sexes” (Example 2.2).

Example 2.3. Consider a three-player game, where each player can choose between two strategies a, b . We represent the strategic interactions among the players by the directed graph in Figure { } 3a, where the payoﬀ of player i is 1 if its strategy is identical to the strategy of its successor −

7 Outline Why Topology? Simplicial Complex Persistent Homology

Outline

1 Why Topological Methods? Methods for Visualizing a Data Geometry

2 Simplicial Complex for Data Representation Simplicial Complex Nerve, Reeb Graph, and Mapper Applications of Mapper Graph Cech,ˇ Vietoris-Rips, and Witness Complexes

3 Persistent Homology Betti Numbers Betti Number at Diﬀerent Scales Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches

Persistent Homology 68 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Numbers Betti Numbers: the number of i-dim holes

Persistent Homology 69 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Numbers Betti Numbers: the number of i-dim holes

Figure: Sphere: β0 = 1, β1 = 0, β2 = 1, and βk = 0 for k ≥ 3

Persistent Homology 70 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Numbers Betti Numbers: the number of i-dim holes

Persistent Homology 71 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Numbers Betti Numbers and Homology Groups

Betti numbers are computed as dimensions of Boolean vector spaces (E. Noether, Z2-homology group) βi(X) = dimHi(X, Z2), Z2-homology or more general Homology group associated with any ﬁelds or integral domain (e.g. Z, Q, and R)

Hi(X) is functorial, i.e. continuous mapping f : X Y induces linear transformation H (f): H (X) H (Y ), structure→ preserving i i → i computation is simple linear algebra over ﬁelds or integers data representation by simplicial complexes

Persistent Homology 72 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Number at Diﬀerent Scales Topology at Diﬀerent Scales

Is it a circle, dots, or circle of circles? How to ﬁnd robust topology at diﬀerent scales?

Persistent Homology 73 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Number at Diﬀerent Scales Example I: Persistent Homology of Cechˇ Complexes

Figure: Scale 1: β0 = 1, β1 = 3

Persistent Homology 74 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Number at Diﬀerent Scales Example I: Persistent Homology of Cechˇ Complexes

Figure: Scale 2 > 1: β0 = 1, β1 = 2. Persistent β0 = 1 and β1 = 1 from 1 to 2 suggest that a connected component and a loop are stable topological features here.

Persistent Homology 75 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Number at Diﬀerent Scales Example II: Persistence 0-Homology induced by Height Function

Figure: The birth and death of connected components.

Persistent Homology 76 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Number at Diﬀerent Scales Example III: Persistent Homology as Online Algorithm to Track Topology Changements

Persistent Homology 77 Figure: The birth and death of simplices. Outline Why Topology? Simplicial Complex Persistent Homology

Betti Number at Diﬀerent Scales Persistent Betti Numbers: Barcodes

Toolbox: JavaPlex (https: //github.com/appliedtopology/javaplex/wiki/Tutorial) Java version of Plex, work with matlab • Rips, Witness complex, Persistence Homology • Other Choices: Plex 2.5 for Matlab (not maintained any more), Dionysus (Dimitry Morozov)

Persistent Homology 78 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Number at Diﬀerent Scales Persistent Homology: Algebraic Characterization

All above gives rise to a ﬁltration of simplicial complex

= Σ Σ Σ ... ∅ 0 ⊆ 1 ⊆ 2 ⊆ Functoriality of inclusion: there are homomorphisms between homology groups 0 H H ... → 1 → 2 → A persistent homology is the image of Hi in Hj with j > i.

Persistent Homology 79 Outline Why Topology? Simplicial Complex Persistent Homology

Betti Number at Diﬀerent Scales Persistent 0-Homology of Rips Complex

Equivalent to single-linkage clustering or minimal spanning tree Barcode is the single linkage dendrogram (tree) without labels Kleinberg’s Impossibility Theorem for clustering: no clustering algorithm satisﬁes scale invariance, richness, and consistency Memoli & Carlsson 2009: single-linkage is the unique persistent clustering (functorial) with scale invariance Open Question: but, is persistence the necessity for clustering?

Notes: try matlab command linkage or R hclust for single-linkage clustering.

Persistent Homology 80 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Application: Evolutionary Trees

Figure: Are phylogenetic trees good representations for evolution?

Persistent Homology 81 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Virus gene reassortment may introduce loops

Figure 5.16 Left: Reassortments in viruses lead to incompatibility between trees. Reticulate network representing the reassortment of three parental strains. The reticulate network results from merging the three parental phylogenetic trees. Source: [100]. Right: Indeed, incompatibility between tree topologies inferred from different genes is a criterion used for the identification of events of genomic material exchange. Here we represent two genes of influenza A virus with different topologies using phylogenetic networks. From Joseph Minhow Chan, Gunnar Carlsson, and Raúl Rabadán, ‘Topology of viral evolution’, Proceedings of the National Academy of Sciences 110.46 (2013): 18566–18571. Reprinted with Permission from Proceedings of the National Academy of Sciences.

Persistent Homology 82 290 Part II Biological Applications between ferrets, suggesting that human to human transmission of H7N9 has most likely already occurred [548]. These outbreaks underscore the need for further investigation into the mechanisms of viral evolution and the adaptation of animal viruses to humans. Influenza viruses are enveloped and nearly 100 nm in diameter. Their genome is 13,000 bases long and is composed of eight segments of single-stranded antisense RNA (Figure 5.14). Each segment encodes one or two viral genes. Antisense RNA is the complement of the RNA that codes for proteins; thus it cannot be directly translated into functional protein. In order for the influenza genome to express protein, positive-sense strands must be produced from the template of the antisense strands. Further complexity arises when the virus attempts to make new virions, the infectious particles that allow the virus to be transmitted outside of the host cell. The replicating virus must duplicate its original antisense RNA and, in order Outline Why Topology? to do so, it mustSimplicial polymerize Complex new strands of ribonucleotides complementary toPersistent the Homology template of the positive-sense RNA. Influenza carries its own polymerase complex, Applications: H1N1 Evolution,which Sensor it Network uses for Coverage, all of its Natural RNA Image replication; Patches in fact, the three longest genes of influenza (PB2, PB1, PA) code for the three proteins directly involved in replicating Influenza genomic material. The polymerase complex interacts directly with viral RNA and the nucleoproteins (NPs) that attach to it. An RNA segment, together with a copy

neuraminidase

hemagglutinin

matrix PB2 PB1 PA HA NP NA M ion channel NS

Figure 5.14 Influenza A is an antisense single-stranded RNA virus whose genome is composed of eight different segments containing one or two genes per segment. This virus contains an envelope borrowed from the infected cell that expressed two viral proteins, hemagglutinin and neuraminidase. When circulating viruses co-infect the same cell, new viruses can be created that contain segments from both parents. This phenomenon, called reassortment, can lead to dramatic adapta- tions to novel environments, and it is thought to be one of the contributing factors to human influenza pandemics.

Persistent Homology 83 294 Part II Biological Applications

A Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Origins of H1N1-2009 A/Kansas/UR06-0283/2007 A/Mississippi/UR06-0242/2007 A/California/02/2007 A/Wellington/12/2005 A/Denmark/50/2006 A/South Australia/58/2005 A/New York/63/2009 A/South Canterbury/31/2009 A/New York/241/2001 A/Denmark/20/2001 A/Switzerland/5389/95 A/Memphis/51/1983 A/Chile/1/1983 A/Arizona/14/1978 A/Hong Kong/117/77 A/Memphis/10/1978 A/Leningrad/1954/1 A/Roma/1949 A/swine/Bakum/1832/2000 A/Wilson-Smith/1933 A/Alaska/1935 A/South Carolina/1/1918 A/swine/Iowa/15/1930 A/swine/France/WVL3/1984 A/swine/Belgium/1/83 A/swine/Belgium/WVL1/1979 A/duck/Bavaria/1/1977 A/duck/Miyagi/66/1977 A/pintail duck/Alberta/210/2002 A/pintail duck/ALB/238/1979 A/mallard/Alberta/42/1977 A/duck/Alberta/35/76 A/duck/NZL/160/1976 A/swine/Hokkaido/2/1981 A/swine/Wisconsin/30954/1976 A/swine/Tennessee/15/1976 A/New Jersey/11/1976 A/swine/Ratchaburi/NIAH550/2003 A/swine/Iowa/3/1985 A/swine/North Carolina/43110/2003 A/swine/Indiana/P12439/00 A/Auckland/4/2009 A/Israel/277/2009 A/Mexico/4108/2009 A/California/05/2009

294 Part II Biological Applications

B 2009 Human H1N1 2009

North American swine North American swine

A/Kansas/UR06-0283/2007 A/Mississippi/UR06-0242/2007 A/California/02/2007 A/Wellington/12/2005 A/Denmark/50/2006 A/South Australia/58/2005 A/New York/63/2009 A/South Canterbury/31/2009 A/New York/241/2001 A/Denmark/20/2001 A/Switzerland/5389/95 A/Memphis/51/1983 A/Chile/1/1983 A/Arizona/14/1978 A/Hong Kong/117/77 A/Memphis/10/1978 A/Leningrad/1954/1 A/Roma/1949 A/swine/Bakum/1832/2000 A/Wilson-Smith/1933 A/Alaska/1935 A/South Carolina/1/1918 A/swine/Iowa/15/1930 A/swine/France/WVL3/1984 A/swine/Belgium/1/83 A/swine/Belgium/WVL1/1979 A/duck/Bavaria/1/1977 A/duck/Miyagi/66/1977 A/pintail duck/Alberta/210/2002 A/pintail duck/ALB/238/1979 A/mallard/Alberta/42/1977 A/duck/Alberta/35/76 A/duck/NZL/160/1976 A/swine/Hokkaido/2/1981 A/swine/Wisconsin/30954/1976 A/swine/Tennessee/15/1976 A/New Jersey/11/1976 A/swine/Ratchaburi/NIAH550/2003 A/swine/Iowa/3/1985 A/swine/North Carolina/43110/2003 A/swine/Indiana/P12439/00 A/Auckland/4/2009 A/Israel/277/2009 A/Mexico/4108/2009 A/California/05/2009 H3N2 H1N2 2000 1990

Eurasian swine Classic swine H1N1 Human H3N2 Avian B Figure 5.15 Origins of H1N1 2009 pandemic virus. Using phylogenetic trees, 2009 Human H1N1 the history of the HA gene of the 2009 H1N1 pandemic virus was reconstructed. Figure: Origins of H1N1 2009 pandemic virus. Using phylogeneticIt was trees, relatedthe to viruses history that circulated of the HA in pigs gene potentially of the since 2009 the 1918 H1N1 H1N1 pandemic virus was reconstructed. It was related to viruses that circulated in pigspandemic. potentially These viruses since had the diverged 1918 since H1N1 that date pandemic. into various These independent viruses had strains, infecting humans and swine. Major reassortments between strains led to diverged since that date into various independent strains, infecting2009 humansnew sets of and segments swine. from Major different reassortments sources. In 1998, triplebetween reassortant strains viruses led to new North American swine North American swine sets of segments from different sources. InH3N2 1998, tripleH1N2 reassortant viruseswere found were infecting found pigs infecting in NorthAmerica. pigs inThese North triple America. reassortant These viruses triple contained segments that were circulating in swine, humans and birds. Further reassortant viruses contained segments that were circulating in swine,reassortment humans of and these birds. viruses with Further other swine reassortment viruses created of the these ancestors viruses of this with other 2000 swine viruses created the ancestors of this pandemic. Until this day,pandemic. it is unclear Until this how, day, it where is unclear or how, when where these or when reassortments these reassortments happened. hap- Source: [506]. From New England Journal of Medicine, Vladimir Trifonov,pened. Source: Hossein [506]. Khiabanian,From New England and Journal RaúlRabadán,Geographic of Medicine,VladimirTrifonov, Hossein Khiabanian, and Raúl Rabadán, Geographic dependence, surveillance,

dependence, surveillance, and origins of the 2009 influenza A (H1N1)1990 and virus, origins 361.2, of the 2009 115–119. influenza A (H1N1) virus, 361.2, 115–119. c 2009 Massachusetts Medical Society. Reprinted with permission from Massachusetts⃝ Eurasian swine Classic swine H1N1 Human H3N2 Avian Medical Society. Figure 5.15 Origins of H1N1 2009 pandemic virus. Using phylogenetic trees, the history of the HA gene of the 2009 H1N1 pandemic virus was reconstructed. It was related to viruses that circulated in pigs potentially since the 1918 H1N1 pandemic. These viruses had diverged since that date into various independent strains, infecting humans and swine. Major reassortments between strains led to new sets of segments from different sources. In 1998, triple reassortant viruses were found infecting pigs in North America. These triple reassortant viruses contained segments that were circulating in swine, humans and birds. Further Persistentreassortment Homology of these viruses with other swine viruses created the ancestors of this 84 pandemic. Until this day, it is unclear how, where or when these reassortments happened. Source: [506]. From New England Journal of Medicine,VladimirTrifonov, Hossein Khiabanian, and Raúl Rabadán, Geographic dependence, surveillance, and origins of the 2009 influenza A (H1N1) virus, 361.2, 115–119. c 2009 Massachusetts Medical Society. Reprinted with permission from Massachusetts⃝ Medical Society. Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches When Persistent Betti-0 meets Pylogenetic Trees

5Evolution,Trees,andBeyond 297

A C H13 H16

H11

H9 H6

Betti Number 0 H8 H5

100 200 300 400 500 600 H12 B H12 H1 H9 H8 H6 H16 H13 H11 H5 H1 H4 H3 H10 H15 H4 H17 H7 100 200 300 400 500 600 H3 H15 Base Pairs H10 Figure 5.17 In case of vanishing higher dimensional homology, zero dimen- Figure: In case of vanishingsional higher homology dimensional generates homology, trees. When zero dimen- applied sional to only homology one gene generates of influenza trees. When applied to only one gene of influenza A, in thisA, case in hemagglutinin, this case hemagglutinin, the only significant the only homology significant occurs homology in dimen- occurs sion zero in dimen- (panel A). The barcode represents a summary of a clusteringsion zero (panelprocedure A). (panel The barcode B), that represents recapitulates a summary the known of a phylogenetic clustering procedure relation between different hemagglutinin types (panel(panel C). Source: B), that [100]. recapitulates From Joseph the Minhow known Chan, phylogenetic Gunnar Carlsson,relation between and RaúlRabadán,‘Topology different of viral evolution’, Proceedings of thehemagglutinin National Academy types (panel of Sciences C). Source: 110.46 [(2013):100]. From 18566–18571. Joseph Minhow Chan, Gun- nar Carlsson, and Raúl Rabadán, ‘Topology of viral evolution’, Proceedings of the National Academy of Sciences 110.46 (2013): 18566–18571. Reprinted with Persistent HomologyPermission from Proceedings of the National Academy of Sciences. 85

higher homology, the zero dimensional homology closely follows the traditional tree structure. However, when studying the persistent homology for several genes at the same time, large numbers of homology classes appear at dimensions one and higher, indicating pervasive reassortments. By looking in detail at the cycles in higher dimensional homologies, we can attribute these cycles to different biological processes that violate tree-like assumptions: homoplasies, recombinations or reassortments. If several sequences generate a large non-trivial class, a reassortment event likely took place among the ancestors of these isolates [100]. We can generate useful statistics based on barcode information; for instance, we can estimate how often different combinations of the eight segments cosegregate in an effort to identify preferences among the potential combinations. As an example, we rarely see cycles form with the segments that interact to form the polymerase complex PA, PB1, PB2, NP, indicating that these segments tend to cosegregate [100]. This Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Whole Genomic Persistent Betti Numbers

Figure 5.18 Influenza evolves through mutations and reassortment. When the persistent homology approach is applied to finite metric spaces derived from only one segment, up to small noise, the homology is zero dimensional suggesting a tree-like process (left). However, when different segments are put together, the structure is more complex revealing non-trivial homology at different dimensions (right). 3105 influenza whole genomes were analyzed. Data from isolates collected between 1956 to 2012; all influenza A subtypes.

Persistent Homology 86 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Two modes in persistent β1 distributions suggest intra- and inter-subtypes 5Evolution,Trees,andBeyond 299

Figure 5.19 Co-reassortment of viral segments as structure in persistent homol- Figure: Co-reassortmentogy diagrams. of viral Left: segments The as non-random structure in persistent cosegregation homology diagrams. of influenza Left: The segments non-random was cosegregation of influenza segments was measured by testing a null model of equal reassortment. Significant cosegregation was identified within PA, PB1, PB2, NP, consistentmeasured with the by cooperative testing func- a null tion model of the ofpolymerase equal reassortment. complex. Source: Significant [100]. Right: cosegregationThe persistence diagram for whole-genome avianwas flu identified sequences revealed within bimodal PA, PB1, topological PB2, structure. NP, consistent Annotating eachwith interval the cooperative as intra- or inter-subtype func- clarified a genetic barrier totion reassortment of the polymeraseat intermediate complex.scales. From Source: Joseph Minhow [100 Chan,]. Right: Gunnar The Carlsson, persistence and RaúlRabadán,‘Topology diagram of viral evolution’, Proceedingsfor whole-genome of the National Academy avian of flu Sciences sequences 110.46 (2013): revealed 18566–18571. bimodal topological structure. Annotating each interval as intra- or inter-subtype clarified a genetic barrier to reassortment at intermediate scales. From Joseph Minhow Chan, Gunnar Carls- Persistent Homologyson, and Raúl Rabadán, ‘Topology of viral evolution’, Proceedings of the National 87 Academy of Sciences 110.46 (2013): 18566–18571. Reprinted with Permission from Proceedings of the National Academy of Sciences.

finding is consistent with the cooperative functioning of these proteins, which engenders negative selection against new combinations that do not cooperate as effectively (Figure 5.19). In addition, each of the sequenced viruses (isolates) comes with information of where and when the virus was isolated, together with the hemagglutinin and neuraminidase subtype. Under the assumption that smaller cycles in the non-trivial homology classes are in some way closer genetically, one can also infer when and where the event took place and what the types of the parental strains were. Other relevant information is provided by the birth and death times of the class which provide information about how genetically distant parental viruses were. Numbers associated to one and higher dimensional classes (birth, death and size of bars in the barcode diagram) provide a useful way to summarize the type of event. The size of the bars associated to non-zero homology classes is also indicative of the type of reassortment events that could occur. The persistence diagram for whole genomes of avian flu sequences reveals bimodal topological structure (Figure 5.19,right). In other words, there are smaller bars and larger bars. Inspection of generators of different bars immediately reveals two types of reassortment processes. Small bars are generated by mixing of viruses that are closely related, belonging to the Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Application: Sensor Network Coverage by Persistent Homology

V. de Silva and R. Ghrist (2005) Coverage in sensor networks via persistent homology. Ideally sensor communication can be modeled by Rips complex two sensors has distance within a short range, then two sensors receive• strong signals; two sensors has distance within a middle range, then two sensors receive• weak signals; otherwise no signals •

Persistent Homology 88 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Sandwich Theorem

Theorem (de Silva-Ghrist 2005) d ˇ Let X be a set of points in R and C(X) the Cech complex of the cover of X by balls of radius /2. Then there is chain of inclusions

2d 0 R (X) C(X) R(X) whenever 0 . ⊂ ⊂ ≥ rd + 1 Moreover, this ratio is the smallest for which the inclusions hold in general.

Note: this gives a suﬃcient condition to detect holes in sensor network coverage Cechˇ complex is hard to compute while Rips is easy;

If a hole persists from R0 to R, then it must exists in C.

Persistent Homology 89 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Persistent 1-Homology in Rips Complexes

Figure: Left: R0 ; Right: R. The middle hole persists from R0 to R.

Persistent Homology 90 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Application: Natural Image Statistics

G. Carlsson, V. de Silva, T. Ishkanov, A. Zomorodian (2008) On the local behavior of spaces of natural images, International Journal of Computer Vision, 76(1):1-12. An image taken by black and white digital camera can be viewed as a vector, with one coordinate for each pixel Each pixel has a “gray scale” value, can be thought of as a real number (in reality, takes one of 255 values) Typical camera uses tens of thousands of pixels, so images lie in a very high dimensional space, call it pixel space, P

Persistent Homology 91 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Natural Image Statistics

D. Mumford: What can be said about the set of images one obtains when one takes many images with a digital camera?I ⊆ P Lee, Mumford, Pedersen: Useful to study local structure of images statistically

Persistent Homology 92 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Natural Image Statistics

Figure: 3 × 3 patches in images

Persistent Homology 93 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Natural Image Statistics

Lee-Mumford-Pedersen [LMP] study only high contrast patches. Collect: 4.5M high contrast patches from a collection of images obtained by van Hateren and van der Schaaf Normalize mean intensity by subtracting mean from each pixel value to obtain patches with mean intensity = 0 Puts data on an 8-D hyperplane, R8 ≈ Furthermore, normalize contrast by dividing by the norm, so obtain patches with norm = 1, whence data lies on a 7-D ellipsoid, S7 ≈

Persistent Homology 94 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Natural Image Statistics: Primary Circle

High density subsets (k = 300, t = 0.25): M Codensity ﬁlter: dk(x) be the distance from x to its k-th nearest neighbor the lower dk(x), the higher density of x •Take k = 300, the extract 5, 000 top t = 25% densest points, which concentrate on a primary circle

Persistent Homology 95 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Natural Image Statistics: Three Circles

Take k = 15, the extract 5, 000 top 25% densest points, which shows persistent β1 = 5, 3-circle model

Persistent Homology 96 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Natural Image Statistics: Three Circles

Generators for 3 circles

Persistent Homology 97 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Natural Image Statistics: Klein Bottle

Persistent Homology 98 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Natural Image Statistics: Klein Bottle Model

Persistent Homology 99 Outline Why Topology? Simplicial Complex Persistent Homology

Applications: H1N1 Evolution, Sensor Network Coverage, Natural Image Patches Reference

Edelsbrunner, Letscher, and Zomorodian (2002) Topological Persistence and Simpliﬁcation. Ghrist, R. (2007) Barcdes: the Persistent Topology of Data. Bulletin of AMS, 45(1):61-75. Edelsbrunner, Harer (2008) Persistent Homology - a survey. Contemporary Mathematics. Carlsson, G. (2009) Topology and Data. Bulletin of AMS, 46(2):255-308. Camara et al. (2016) Topological Data Analysis Generates High-Resolution, Genome-wide Maps of Human Recombination, Cell Systems, 3(1): 83–94. Wei, Guowei, (2017) Persistent Homology Analysis of Biomolecular Data, SIAM News. Raul Rabadan and Andrew J. Blumberg (2020). Topological Data Analysis for Genomics and Evolution. Cambridge University Press.

Persistent Homology 100