How Persistent Homology Can Be Useful in Predicting Proteins Functions
Total Page:16
File Type:pdf, Size:1020Kb
University of Rabat (Morocco) Faculty of Medicine MASTER BIOTECHNOLOGIE MEDICALE´ OPTION "BIOINFORMATIQUE" Projet Fin Annee´ (PFA) How persistent homology can be useful in predicting proteins functions Author: Zakaria Lamine Jury: President : ..... Examinator : Advisor : My Ismail Mamouni, CRMEF Rabat Soutenu le .... Dedication I dedicate this modest work to : My parents, especially my mother for all the sacrifices she has made, for all her assistance and presence in my life, my father for all his sacrifice and support. Thanks I thank GOD for giving me the strength to do this work. I thank my family for the support and the sacrifice. I thank my professor My Ismail Mamouni for his support and for all the discussions that have always been fruitful it is an honour to be one of his students. Introduction Persistent homology is a method for computing topological features of a space at different spatial resolutions. More persistent features are detected over a wide range of length and are deemed more likely to represent true features of the underlying space, rather than artifacts of sampling, noise, or particular choice of parameters. To find the persistent homology of a space, the space must first be represented as a simplicial complex. A distance function on the underlying space corresponds to a filtration of the simplicial complex, that is a nested sequence of increasing subsets. In other words persistent homology characterizes the geometric features with persistent topological invariants by defining a scale parameter relevant to topological events. Through filtration and persistence, persistent homology can capture topological structures continuously over a range of spatial scales. Unlike commonly used computational homology which results in truly metric free or coordinate free representations, persistent homol- ogy is able to embed geometric information to topological invariants so that \birth" and \death" of isolated components, circles, rings, loops, pockets, voids and cavities at all geometric scales can be monitored by topo- logical measurements. The basic concept was introduced by Frosini and Landi, and in a general form by Robins, Edelsbrunner and Zomorodian and Carlsson, independently. Efficient computational algorithms have been pro- posed to track topological variations during the filtration process. Usually, the persistent diagram is visualized through barcodes, in which various horizontal line segments or bars are the homology generators lasted over filtration scales. It has been applied to a variety of domains, including image analysis, image retrieval, chaotic dynamics verification, sensor network, complex network, data analysis, computer vision, shape recognition and computational biology. Compared with computational topology, and/or computational homology, persistent homology inherently has an additional dimension, the filtration parameter ,which can be utilized to embed some crucial geometric or quantitative information into the topological invariants. In medicine persistent homology has lead to a number of interesting results which can be visualized in chapter 4 by application of this method to detect some interesting properties of proteins which is an example from a bunch of other results . Contents 1 persistent homology of proteins6 1.1 proteins(overview)............................................6 1.2 structure-function relationship in proteins...............................6 1.3 introduction to persistent homology in analysis of proteins......................7 1.3.1 Prediction of the optimal characteristic distance using persistent homology........ 12 2 Conclusion 15 5 Chapter 1 Simplicial Homology 1.1 Simplicial complex In algebraic topology, simplicial homology formalizes the idea of the number of holes for a given dimension in a simplicial complex. This generalizes the number of connected components (the case of dimension 0). Simplicial homology arose as a way to study topological spaces whose building blocks are p-simplices, the p-dimensional analogs of triangles. This includes a point (0-simplex), a line segment (1-simplex), a triangle (2-simplex) and a tetrahedron (3-simplex). By defnition, such a space is homeomorphic to a simplicial complex (more precisely, the geometric realization of an abstract simplicial complex). Such a homeomorphism is referred to as a triangulation of the given space. Many topological spaces of interest can be triangulated, including every smooth manifold. Simplicial homology is defined by a simple recipe for any abstract simplicial complex. It is a remarkable fact that simplicial homology only depends on the associated topological space. As a result, it gives a computable way to distinguish one space from another. Singular homology is a related theory which is more commonly used by mathematicians today. Singular homology is defined for all topological spaces, and it agrees with simplicial homology for spaces which can be triangulated. Nonetheless, because it is possible to compute the simplicial homology of a simplicial complex automatically and efficiently, simplicial homology has become important for application to real-life situations, such as image analysis, medical imaging, and data analysis in general. In mathematics, a simplicial complex is a set composed of points, line segments, triangles, and their p- dimensional counterparts. Simplicial complexes should not be confused with the more abstract notion of a simplicial set appearing in modern simplicial homotopy theory. The purely combinatorial counterpart to a simplicial complex is an abstract simplicial complex. Figure 1.1: Simplicial complex p Definition 1 (p-simplex). A p-dimensional simplex (or p-simplex) σ = [e0; e1; :::; ep] is the smallest convex m set in a Euclidean space R containing the p + 1 points e0; :::; ep: We usually specify that for an p-simplex, we have that the points are not contained in any hyperplane of dimension less than p, The standard p-simplex is: p p p+1 X ∆ = f(t0; :::; tp) 2 R : ti = 1 and ti ≥ 0 for all i = 0; :::; pg i=0 Affinely independent means the p vectors xi − x0 for i = 1 : : : p are linearly independent, i.e., they are in general position. 6 ENSIAS MASTER BIOTECHNOLOGIE MEDICALE´ PFA : Persistent Homology Fac. Medecine OPTION "BIOINFORMATIQUE" The convex hull is simply the solid polyhedron determined by the p+1 vertices. A 0-simplex is a vertex, 1-simplex an edge, 2-simplex a triangle, and 3-simplex a tetrahedron Figure 1.2: 0,1,2 simplex Definition 2. Given a simplex σp, any (p − 1)-dimensional sub-simplex is called a face. For example, a tetrahedron has four triangle faces corresponding to the four subsets S obtained by removing one vertex at a time from σ. These four triangle faces are 2-simplices themselves. It also has six edge faces and four singleton vertex faces. Our space of interest is properly arranged simplices: Definition 3 (Simplicial complex). A simplicial complex K is a finite set of simplices satisfying the following conditions: 1. For all simplices A 2 K with α a face of A, we have α 2 K. 2. A; B 2 K ) A; B are properly situated. The dimension of a complex is the maximum dimension of the simplices contained in it. The intuition of simplicial complex is that if a simplex is in K, all its faces need to be in K, too. In addition, the simplicies have to be glued together along whole faces or be separate. The figure on the left is a simplicial complex, while the one on the right is not: Figure 1.3: Left simplicial complex ,Right not a simplicial complex Simplicial complex plays the role of the yellow space in the rubber band example. We next introduce the discrete version of the rubber bands. 1.2 Simplical homology groups Definition 4 (p-chains). A p-chains is a formal sum Np X p c = ciσi i=1 pi where σi are p-simplicies in K and ci 2 Z. Page 7 A. Akkari & Z. Lamine ENSIAS MASTER BIOTECHNOLOGIE MEDICALE´ PFA : Persistent Homology Fac. Medecine OPTION "BIOINFORMATIQUE" For example, let K be a tetrahedron. By definition the four triangle faces (i.e., 2-simplicies) are in K, too. A 2- chain is a subset of these four triangles, e.g., all four triangle, the bottom triangle face only, or the empty set. There are 24 distinct 2-chains. Similarly, by definition all six edges of the tetrahedron are in K, too. Thus, there are 26 distinct 1-chains. Despite the name "chain" a p-chain does not have to be connected. Figure 1.4: 2- chain on the left and a 1-chain (the blue edges) on the right Definition 5 (p-chain group). The p-chain group Cp(K) of k is a free Abelian group generated by the oriented p-simplexes of K. Np M Cp(K) ' Z i=1 By de finition Cp(K) = 0 for p > n When adding two p-chains we get another p- chain with duplicate p-simplices cancel out. We have a separate chain group for each dimension p. Figure 1.5: 1-chain addition Definition 6 (Boundary operator). The boundary operator is a homomorphism @p : Cp(K) ! Cp−1(K) defined as level of generator as follows: The boundary of an oriented p-simplex p σ = [p0; p1; ··· ; pp] is a (p - 1)-chain de fined by: p p X i @pσ = (−1) [e0; e1; :::; e^i; :::; ep] i=0 where e^i is omitted The boundary of a tetrahedron is the set of four triangles faces; the boundary of a triangle is its three edges; the boundary of an edge is its two vertices. The boundary of a tetrahedron is the set of four triangles faces; the boundary of a triangle is its three edges; the boundary of an edge is its two vertices. The boundary of a p-chain is de fined by linearity. Definition 7 (Chain complex). The chain complex is a sequence of free Abelian groups and homomorphisms i @p @p−1 @1 @0 0 ,! Cp(K) −! Cp−1(K) −! ::: −! C0(K) −! 0 where ,! denotes the inclusion map.