Applications of Persistent Homology and Cycles
Total Page:16
File Type:pdf, Size:1020Kb
Applications of Persistent Homology and Cycles Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayan Mandal, B.Tech., M.E. Graduate Program in Department of Computer Science and Engineering The Ohio State University 2020 Dissertation Committee: Dr. Tamal Dey, Advisor Dr. Yusu Wang Dr. Raphael Wenger c Copyright by Sayan Mandal 2020 Abstract The growing need to understand and process data has driven innovation in many disparate areas of data science. The computational biology, graphics, and machine learning communities, among others, are striving to develop robust and efficient methods for such analysis. In this work, we demonstrate the utility of topological data analysis (TDA), a new and powerful tool to understand the shape and structure of data, to these diverse areas. First, we develop a new way to use persistent homology, a core tool in topological data analysis, to extract machine learning features for image classification. Our work focuses on improving modern image classification techniques by considering topological features. We show that incorporating this information to supervised learning models allows our models to improve classification, thus providing evidence that topological signatures can be leveraged for enhancing some of the pioneering applications in computer vision. Next, we propose a topology based, fast, scalable, and parameter-free technique to explore a related problem in protein analysis and classification. On an initial simplicial complex built using constituent protein atoms and bonds, simplicial collapse is used to construct a filtration which we use to compute persistent homology. This is ultimately our signature for the protein-molecules. Our method, besides being scalable, shows sizable time and memory improvements compared to similar topology-based approaches. We use the signature to train a protein domain classifier and compare state-of-the-art structure-based protein signatures to achieve a substantial improvement in accuracy. ii Besides considering the intervals of persistent homology like our first two applications, some applications need to find representative cycles for them. These cycles, especially the minimal ones, are useful geometric features functioning as augmentations for the intervals in a purely topological barcode. We address the problem of computing these representative cycles, termed as persistent d-cycles. Since generating optimal persistent 1-cycle is NP-hard, we propose an alternative set of meaningful persistent 1-cycles that is computable using an efficient polynomial time algorithm. Next, we address the same problem for general dimensions. We illustrate the use of an algorithm to spawn d-cycles for finite intervals on a weak (d + 1)-pseudomanifold. We design two specialised softwares to compute persistent 1-cycles and d-cycles respectively. Experiments on 3D point clouds, mineral structures, images, and medical data show the effectiveness of our algorithms in practice. We further investigate into the use of these representative persistent cycles in the field of bio-science and technology. Our concluding work tries to understand gene-expression levels for various organisms who are either infected or under the effect of antigens. We use persistent cycles to curate both the cohort list and gene expressions levels so as to obtain a \crux" of better representatives. This in turn, provides improvement in both deep and shallow learning classifications. We further show that the n-cycles has an unsupervised inclination towards phenotype labels. The penultimate chapter of this thesis provides evidence that topological signatures are able to comprehend gene expression levels and classify cohorts on its basis. iii To Mammam and Babai iv Acknowledgments If I have seen further, it is by standing on the shoulders of giants. Sir Isaac Newton I would like to thank my Ph.D. supervisor Dr. Tamal K. Dey for his guidance and support over the past few years. His wisdom and mentoring has been a continuous source of encouragement to me. He has been source of inspiration not just in my domain of research but rather in analytical and independent thinking, resource management, and many other quintessential areas of being a productive individual. Above all, he has always prioritised taking care of self enrichment over academic progress. I honestly thank him for being a true mentor and steadfast supporter. This thesis was enriched significantly through helpful discussions with my predecessors Dr. Dayu Shi, Dr. Alfred Rossi, and Dr. Mickael Buchet. They had significant impact in my research and the understanding of topological data analysis in general by answering all my queries, no matter how mundane. Dayu helped me comprehending and maintaining the open source code repositories of our team including Simpers, SimBa, and ShortLoop. His input was in part extended beyond these works and helped me a lot in building later softwares: Persloop and Pers2cyc-fin. The members of the TGDA group have contributed immensely to my personal and professional time at Ohio State. The group has been a source of friendships as well as good advice and collaboration. Among them, Tianqi Li and Ryan Slechta need special mention. v They have been with me through the tough times in grad school specially when research was at a stalemate. Ryan has especially been instrumental in refining my research ideas and reports. The enjoyment of learning increases manifold when we share our thoughts and work together in a constructive way. Any productive work including research is a collaborative effort and I have been lucky to have worked with William Varcho, Tao Hao, and Soham Mukherjee as my co-authors. I have learned much from them and gained valuable insight from both. I had the opportunity to take several coursework during my time as a graduate student in the Ohio State University. I would like to thank all faculty members who have helped me augment my knowledge base. Special mention to Dr. Yusu Wang, Dr. Ten-Hwang Lai, Dr. Tamal Dey, Dr. Hanwei Shen, and Dr. Jim Davis whose lectures have really been truly enjoyable and inspired me to improve as a faculty as well. The course materials they covered included all the state-of-the-art topics and have been directly influential in my research as well. Research meets have been an ancient medium to exchange ideas and insights. In fact, 2500 years ago the Gymnasiums in Athens had been a hotbed for discussions in mathematics, literature, and philosophy frequent by Plato, Socrates, Alcibiades etc. In this era of digital connectivity, we have serious discussions as to whether traditional classrooms or meet-and- greets are still valid in academia. I have always been a proponent of these traditional meet ups and strongly believe real time interaction with scientists helps spurring a plethora of research insights and ideas. I would therefore take this opportunity to thank the organisers and committee members of VMV 2017, WABI 2018, and CTIC 2019 where I have met many stalwarts in our field or research and learned a lot. Helpful discussions with faculties, and vi peers have helped me gained a lot of insight and inspiration for future research methodologies and ideas. I would also take this opportunity to thank my committee members. Dr. Yusu Wang and Dr. Raphael Wenger for their feedback throughout my graduate carrier. From the beginning of candidacy exam to thesis committee meeting, their constructive feedback has always been helpful. Finally, I would like to thank the National Science Foundation for supporting the research work presented here. vii Vita June, 1990 Born - Kolkata, India School - St. Stephens' School 1993 Kolkata, India B.Tech - Computer Sc. and Technology, 2008 WBUT, India M.E. - Computer Sc. and Engineeering, 2012 IIEST Shibpur, India Senior Research Fellow, Computer Science and Engineering, 2014 IIT Kharagpur, India University Fellow, Computer Sc. and Engineering, 2015 The Ohio State University, USA Graduate Teaching/Research Assistant, Computer Sc. and Engg., 2016 The Ohio State University, USA Graduate Research Intern, Health Care and Life Science 2019 TJ Watson Labs IBM, Yorktown Heights, USA Graduate Research Assistant, Computer Sc. and Engg., 2019-present The Ohio State University, USA Publications Research Publications T. Dey, S. Mandal, S. Mukherjee, \Gene expression data classification using topology and machine learning models". Arxiv May. 2020. S. Mandal, A. Guzman-Saenz, N. Haiminen, L. Parida,S. Basu, \A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data". AICoB 2020: International Conference on Algorithms for Computational Biology LNCS/LNBI Springer, April. 2020. viii T. Dey, T. Hao, S. Mandal, \Computing Minimal Persistent Cycles: Polynomial and Hard Cases". Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms 10.5555/3381089.3381247, 2587{2606, Jan 2020. T. Dey, T. Hao, S. Mandal, \Persistent 1-Cycles: Definition, Computation, and Its Applica- tion". Computational Topology in Image Context. CTIC 2019. Lecture Notes in Computer Science 10.1007/978-3-030-10828-1 10, 123{136, Dec 2018. T. Dey, S. Mandal, \Protein Classification with Improved Topological Data Analysis". 18th International Workshop on Algorithms in Bioinformatics 10.4230/LIPIcs.WABI.2018.6, 1{13 Aug 2018. T. Dey, S. Mandal, W. Varcho, \Improved Image Classification using Topological Persistence ". Vision, Modeling and Visualization: The Eurographics Association