Computational Methods for Functional Interpretation of Diverse Omics Data by Sumaiya Nazeen B.Sc
Total Page:16
File Type:pdf, Size:1020Kb
Computational Methods for Functional Interpretation of Diverse Omics Data by Sumaiya Nazeen B.Sc. Engg., Bangladesh University of Engineering and Technology (2011) S.M., Massachusetts Institute of Technology (2014) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2019 ○c Massachusetts Institute of Technology 2019. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science August 30, 2019 Certified by. Bonnie Berger Simons Professor of Mathematics Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by . Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students Computational Methods for Functional Interpretation of Diverse Omics Data by Sumaiya Nazeen Submitted to the Department of Electrical Engineering and Computer Science on August 30, 2019, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Recent technological advances have resulted in an explosive growth of various types of “omics” data, including genomic, transcriptomic, proteomic, and metagenomic data. Functional interpretation of these data is key to elucidating the potential role of different molecular levels (e.g., genome, transcriptome, proteome, metagenome) in human health and disease. However, the massive size and heterogeneity of raw data pose substantial computational and statistical challenges in integrating and inter- preting these data. To overcome these challenges, we need sophisticated approaches and scalable analytical frameworks. This thesis outlines two research efforts along these lines. First, we develop a novel three-tiered integrative omics framework for integrating and functionally analyzing heterogeneous omics datasets across a group of co-occurring diseases. We demonstrate the effectiveness of this framework in in- vestigating the shared pathophysiology of autism spectrum disorder (ASD) and its multi-organ-system co-morbid diseases (e.g., inflammatory bowel disease, asthma, muscular dystrophy, cerebral palsy) and uncover a novel innate immunity connection between them. Second, we develop a new end-to-end computational tool, Carnelian, for robust, alignment-free functional profiling of whole metagenome sequencing reads, that is uniquely suited to finding hidden functional trends across diverse data sets in comparative analysis. Carnelian can find shared metabolic pathways, concordant functional dysbioses, and distinguish microbial metabolic function missed by state- of-the-art functional annotation tools. We demonstrate Carnelian’s effectiveness on large-scale metagenomic studies of type-2 diabetes, Crohn’s disease, Parkinson’s dis- ease, and industrialized versus non-industrialized cohorts. Thesis Supervisor: Bonnie Berger Title: Simons Professor of Mathematics Professor of Electrical Engineering and Computer Science To the wonderful memories of my beloved maternal and paternal grandmothers, Jamal Ara Begum and Hayaten Nesha They always supported me and celebrated every success of mine. I miss them every day! Acknowledgments First and foremost, I would like to immensely thank my thesis supervisor, Bonnie Berger, who has been a constant source of inspiration and encouragement. I am grateful to her for providing me the freedom to explore many topics in computa- tional biology and to pursue my interests while offering technical guidance and advice throughout my years at MIT. Besides being an extremely resourceful mentor, Bonnie is also a compassionate person who always cared for my well-being. When I was diagnosed with a rare health condition during my second year of Ph.D. and had to go through major neurosurgery, she offered her full support. She patiently supported me throughout my surgery and the year-long recovery process, which I am really grateful for. I hope to extend the same level of kindness and compassion towards my students in the future. I am grateful to Eric Alm and Manolis Kellis for being on my thesis committee. They believed in me, provided guidance, and pointed me to the areas where I could improve to succeed in academia. I want to acknowledge my collaborators, Nathan Palmer, Isaac Kohane, and Yun William Yu; this thesis would not have been possible without their invaluable contributions. Special thanks to Eric Alm, Mathieu Groussin, and Mathilde Poyet for sharing the unpublished gut microbiome data from Bostonian and Baka individuals from the Global Microbiome Conservancy project (http:// microbiomeconservancy.org/) with us. I am immensely thankful for my peers in the Berger lab, who have not only helped me grow as a researcher but also helped me persevere through the ups and downs of grad school. In particular, thanks to Po-Ru, George, and Mark for inspiring me to join Berger lab; to Deniz and Jian for helping me get started on research; to Sean, Deniz, Hoon, William, Ariya, Rohit, Noah, Yaron, Sepehr, Ibrahim, Sarah, Tristan, Perry, Youn, Ashwin, Andrew, Ellen, Brian, Max, Lillian, Ben, Rachel, and Alex for all the conversations I had with them about things both research and not research related; and to Patrice for greeting me everyday with warmth, lending a shoulder whenever I needed, and making life smooth for all of us in the lab (you really do keep the lab from falling apart!). My graduate experience has been greatly enhanced because of a warm set of people around me. I have enjoyed being involved with the GW6, GWAMIT, and TOC communities, and getting to know so many amazing minds. Thanks to Mie, Yiou, Yi, Becca, Lily, Cindy, and Debbie, with whom I have shared many enjoyable moments throughout grad school. I owe my gratitude to Christina; I have always appreciated your friendship and encouragement; the wisdom you shared with me about life and faith has always inspired me and helped me build confidence whenever I felt low. Thank you, Marty, for all the fun conversations and encouragements; I learned so much about life from you. I have also been blessed with the best roommates and neighbors: Rosalie, Kristi, Bella, Dora, Katherine, Jack, and Dhavala, who helped me have the much needed work-life balance. I had a great time being part of the Bangladeshi Students Association of MIT and organizing social and cultural events. It was great to have an excellent set of friends in the Boston area, including Ehsan bhai, Nafisa apu, Nasim bhai, Zakia apu, Zubair, Saima, Sabrina, Hasnain bhai, Deeni, Nazia, Urmi, Sujoy bhai, Tapoti apu, Saquib, Mahi, Bristy, Sabrine, Tahsin, Sanzeed, Caleb, Lana, Phei Er, John, Tiara, Jian- qiao, Nadim, Takian, Nishit, Deena, Murshid bhai, Zayan, Wasifa, Sourav, Adeeb, Protyasha, Taniya apu, Rumpa apu, Shafqat bhai, Maisha, Usama, and Ayesha. The time spent with all of you will always remain very memorable to me! I am also thank- ful for my friends from Bangladesh and all over the world made through Fulbright Fellowship. They always made time to chat with me despite being in different time zones. Special thanks to Fatema and Rimpi for patiently listening to all my rants throughout the years and keeping me sane during stressful periods of grad school. I owe my gratitude to all the members of the EECS Graduate Office, Financial Services, CSAIL Headquarters, and International Students Office for helping me with the logistics of grad school. Special thanks to Janet, Leslie, Alicia, Kathy, and Sylvia for being there for me whenever I needed advice. I am incredibly grateful to my care teams at MIT Medical and MGH, who provided top-notch care for my health concerns. Throughout my Ph.D., I was generously supported by the International Fulbright Science & Technology Fellowship, MIT’s Ludwig Center Molecular Oncol- ogy Fellowship, and grants from the National Institute of Health, the Center for Microbiome Informatics and Therapeutics, and F. Thomson Leighton fund. Thanks also to all the great advisors and mentors I have had over the years outside of MIT. This includes Dr. M Sohel Rahman, Dr. Mohammad Kaykobad, and Dr. Mohammad Eunus Ali at BUET, and Dr. Ishtiaque Ahmed at the University of Toronto, who have always been a source of great advice and encouragement. I am grateful to my mom (Nilufar Jahan), dad (S M Salim Jahangir), and broth- ers (Shahriar Salim and Taqi Tahmeed Sakib) for their unwavering support of my research pursuits. Their love and trust in my abilities have been a significant source of inspiration throughout this journey. I would also like to acknowledge my extended family who have always prayed for me, believed in me, and encouraged me to push forward. Finally, I express my utmost gratitude to my greatest supporter: to the Almighty Allah, who had bestowed this opportunity upon me and helped me persevere through every part of this journey. I am immensely grateful for His innumerable blessings all these years, which have helped me grow not only academically but also spiritually. “He (Allah) is your Protector; and excellent is the Protector and excellent is the Helper.” ∼ Al Quran 22:78 ∼ Contents 1 Introduction 25 1.1 Types of omics data . 26 1.2 Challenges in functional interpretation of omics data . 28 1.3 Prior work on these challenges . 29 1.3.1 Study of groups of diseases in the literature . 29 1.3.2 Functional profiling of metagenomic reads and comparative func- tional metagenomics . 31 1.4 Outline . 32 2 Multi-level integrative omics analysis for ASD and its comorbidities 35 2.1 Introduction . 36 2.2 Results . 39 2.2.1 Overview of the three-tiered integrative omics pipeline . 39 2.2.2 Involvement of innate immunity pathways in ASD and its co- morbidities . 49 2.2.3 Disease–innate immunity pathway overlap at gene level . 56 2.2.4 Discriminatory power of innate immunity pathway genes . 63 2.3 Methods . 64 2.3.1 Gene-centric transcriptomic analysis per disease . 64 2.3.2 Pathway-centric enrichment analysis per disease . 65 2.3.3 Across-disease shared significance analysis of pathways .