The First Version of the Thesis Was Deposited in May 24, 2016. This Is
Total Page:16
File Type:pdf, Size:1020Kb
COMPUTATIONAL APPROACHES TO IMPROVING THE RECONSTRUCTION OF METABOLIC PATHWAYS Faizah Aplop A Thesis in the department of Computer Science and Software Engineering Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy Concordia University Montreal,´ Quebec,´ Canada September 2016 ⃝c Faizah Aplop, 2016 Concordia University School of Graduate Studies This is to certify that the thesis prepared By: Mrs. Faizah Aplop Entitled: Computational Approaches to Improving the Reconstruction of Metabolic Pathways and submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science) complies with the regulations of this University and meets the accepted standards with respect to originality and quality. Signed by the final examining committee: Chair Dr M. Reza Soleymani External Examiner Dr Anthony J. Kusalik Examiner Dr Nawwaf Kharma Examiner Dr Volker Haarslev Examiner Dr Lata Narayanan Supervisor Dr Gregory Butler Approved Chair of Department or Graduate Program Director 20 Dr Amir Asif, Dean Faculty of Engineering and Computer Science Abstract Computational Approaches to Improving the Reconstruction of Metabolic Pathways Faizah Aplop, Ph.D. Concordia University, 2016 Metabolic pathway reconstruction is the essence of systems biology where in silico modeling and prediction of the cell's function is based on the interaction of the cell's components represented as a network of reactions. The reconstructed model and the associated database of information about the organism's genes and their functional roles facilitate a variety of analysis and simulation techniques that can enrich our understanding. However, there are unresolved issues for genome-scale metabolic network reconstruction, such as our incomplete knowledge of the cell's networks for metabolism, transport, and regulation; the completeness, accuracy, and specificity of the annotation of genomes; and our ability to fully utilise the available information from -omics (genomics, proteomics, metabolomics, etc) for the recon- struction of the networks. These issues result in incomplete metabolic models, which limit our ability to perform analysis of and to make predictions about the cell that are based on the network model. This dissertation discusses the state-of-the-art of metabolic pathway reconstruction and high- lights the outstanding issues. In particular, we consider a number of case studies using genomes of fungi relevant to industrial applications, such as biofuels, to demonstrate the performance of existing techniuqes and illustrate the issues. Our case studies focus on the cell's central metabolism, and the utilisation and transport of sugars as a carbon source, since these are essential concerns for industrial applications. A significant deficiency in the existing state-of-the-art for the reconstruction of metabolic pathways is the ability to associate genes and proteins to the transport reactions that move specific compounds across the membranes of the cell. The dissertation reviews the state-of- the-art of prediction methods for transmembrane transport proteins by developing a scheme to describe and compare existing methods, and applying the existing techniques to the iii fungal genome of A. niger CBS 513.88. This reveals the split between those methods that use the Transporter Classification (TC) as their target for prediction, and those thatuse the type of chemical substrates being transported as their target. Despite this difficulty in comparing approaches, it is clear that the state-of-the-art cannot predict specific substrates being transported, and hence cannot associate genes and proteins to the transport reactions. The dissertation presents TransATH, which stands for Transporters via ATH (Annotation Transfer by Homology), a system which automates Saier's protocol and includes the compu- tation of subcellular localization and improves the computation of transmembrane segments. The choice of thresholds for the parameters of TransATH is investigated to determine opti- mal peformance as defined by a gold standard set of transporters and non-transporters from S. cerevisiae. The dissertation demonstrates TransATH on the fungal genome of A. niger CBS 513.88 and evaluates the correctness of TransATH using the curated information in AspGD (the Aspergillus Database). A website for TransATH is available for use. iv Acknowledgments I wish to express my sincere gratitude to Dr. Gregory Butler for always being there for me. I am extremely grateful and indebted to him for his expertise, time, patience, continuous support, sincere and unwavering guidance, and encouragement extended to me. Also, I would like to place on record my sincere gratitude to my committee members, Dr. Volker Haarslev, Dr. Lata Narayanan and Dr. Nawwaf Kharma for their patience, precious time, constructive advice throughout the processes and challenges in this work. A special thanks to my external examiner, Dr. Anthony J. Kusalik for his support and invaluable advice. My sincere thanks to my lab mates | Christine Kehyayan, Lin Cheng, Yi Qing, Stuart Thiel, Jun Luo, Jianlong Qi and Stephen Barrett | for being a great source of friendships. I owe a deep sense of gratitude to all the good people at the Centre for Structural and Functional Genomics and the Department of Computer Science and Software Engineering at Concordia University that have provided me with theoretical and technical support. I thank profusely the good people of Universiti Malaysia Terengganu, my supportive col- leagues and friends, Noorasiah Moidu and her team, for being very tolerant with me through- out my study period. Also, I am extremely thankful to my good friends Abu Suffian Abu Bakar, Yoisel Melis Santana and Marie-France Lessard for their continuous support. It is my privilege to thank my loving husband, Mohd Riduan Abd Rahim for his understand- ing, sacrifice and support throughout my research period. To my beloved parents, Aplop Awang and Che' Ramlah Ismail, who have been a source of inspiration to me throughout my life, a very special thank you for your unconditional love, prayer, and nurture. I dedicate this work to my family, many friends and the apple of my eye, Qurratul Aini Aplop. Above all, my utmost appreciation and grateful to The Almighty God for enabling me to complete this thesis. v Contents List of Figures x List of Tables xii List of Terms and Abbreviations xvii 1 Introduction 1 1.1 Genome-Scale Network Reconstruction . 4 1.1.1 Some Historical Context . 5 1.1.2 Resources . 7 1.1.3 Issues and Challenges . 9 1.2 Contributions . 11 1.3 Organization of the Thesis . 13 2 Background 15 2.1 Basic Concepts from Biology . 16 2.1.1 Nucleic Acids . 17 2.1.2 Central Dogma of Molecular Biology . 17 2.1.3 Proteins . 18 2.1.4 Domains . 19 2.1.5 Classification Schemes for Enzymes . 20 vi 2.2 Metabolic Pathways . 21 2.2.1 Central Carbon Metabolism . 22 2.3 Transport . 28 2.3.1 Classification Schemes . 31 2.4 Genome-Scale Network Reconstruction . 37 2.5 Machine Learning in Bioinformatics . 38 2.5.1 Binary, Multi-Class and Multi-Label Classifiers . 38 2.5.2 Basic Local Alignment Search Tool . 40 2.5.3 Amino Acid Composition . 41 2.5.4 Hidden Markov Models for Protein Sequences . 42 2.6 Genomics Resources . 42 3 Metabolic Pathway Reconstruction 47 3.1 The State of the Art . 48 3.1.1 Pathway Tools . 49 3.1.2 SEED . 51 3.1.3 Pathway Analyst . 52 3.1.4 AUTOGRAPH . 52 3.1.5 Pantograph . 52 3.1.6 Other Tools . 53 3.2 Well-Curated Fungal Genomes . 54 3.3 Case Studies . 57 3.3.1 Datasets . 57 3.3.2 Methods . 58 3.3.3 Results . 60 3.3.4 Details for P.chrysosporium RP78 .................... 60 3.3.5 Discussion . 63 3.4 Conclusion . 66 vii 4 Prediction of Transport Proteins 68 4.1 A Scheme to Compare Transport Predictors . 69 4.2 The State of the Art . 71 4.2.1 TransAAP . 72 4.2.2 Transport Inference Parser . 73 4.2.3 Saier Lab . 73 4.2.4 Zhao Lab . 73 4.2.5 Gromiha Lab . 75 4.2.6 Helms Lab . 76 4.3 Case Study . 77 4.3.1 A Pathway Tools Reconstruction . 78 4.3.2 TCDB-Blast| Our G-Blast(v2) Implementation . 79 4.3.3 Sanity Check of Prediction on TCDB . 80 4.3.4 A niger CBS 513.88 . 80 4.3.5 Transport Prediction on Fungal Genomes . 83 4.3.6 Discussion . 83 4.4 Automation of Manual Protocol of Saier . 84 4.4.1 The Protocol . 85 4.4.2 TCDB-Blast Search . 88 4.4.3 Topology Step . 88 4.4.4 Localization Step . 89 4.4.5 Substrate Information . 91 4.4.6 Case Study Revisited . 91 4.4.7 The TransATH Web Service . 98 4.5 Evaluation . 100 4.5.1 Thresholds for TCDB-Blast . 101 viii 4.5.2 Thresholds of TCDB-Blast for A. niger CBS 513.88 . 103 4.5.3 Correctness of TransATH . 104 4.6 Predicting Specific Substrates . 112 4.7 A New Computational Framework . 115 4.7.1 The Relational Dataspace . 118 4.8 Conclusion . 121 5 Conclusion 123 5.1 Contributions . 124 5.2 Limitations . 126 5.3 Future Directions . 127 5.4 Postscript . 128 A Sugar Porters 130 B TransportTP Results 136 C TCDB-Blast Results 145 C.1 TCDB-Blast Results for A. niger CBS 513.88 . 145 C.2 TCDB-Blast Results for Fungal Genomes . 156 D TCDB-Blast Results with Substrates and Localization 165 D.1 TCDB-Blast Results with TrSSP Predictions . 165 D.2 TCDB-Blast Results with LocTree3 Predictions . 171 Bibliography 177 ix List of Figures 1 Relating Hypotheses from -Omics to the Central Dogma . 3 2 Example of a GENRE . 5 3 The Ongoing Reconstruction of the E. coli Metabolic Network . 7 4 GENREs and their Coverage . 8 5 Components of the Eukaryotic Cell .