Cheminformatics Tools for Enabling Metabolomics by Yannick Djoumbou Feunang
Total Page:16
File Type:pdf, Size:1020Kb
Cheminformatics Tools for Enabling Metabolomics by Yannick Djoumbou Feunang A thesis submitted in partial fulfillment of requirements for the degree of Doctor of Philosophy in Microbiology and Biotechnology Department of Biological Sciences University of Alberta ©Yannick Djoumbou Feunang, 2017 ii Abstract Metabolites are small molecules (<1500 Da) that are used in or produced during chemical reactions in cells, tissues, or organs. Upon absorption or biosynthesis in humans (or other organisms), they can either be excreted back into the environment in their original form, or as a pool of degradation products. The outcome and effects of such interactions is function of many variables, including the structure of the starting metabolite, and the genetic disposition of the host organism. For this reasons, it is usually very difficult to identify the transformation products as well as their long-term effect in humans and the environment. This can be explained by many factors: (1) the relevant knowledge and data are for the most part unavailable in a publicly available electronic format; (2) when available, they are often represented using formats, vocabularies, or schemes that vary from one source (or repository) to another. Assuming these issues were solved, detecting patterns that link the metabolome to a specific phenotype (e.g. a disease state), would still require that the metabolites from a biological sample be identified and quantified, using metabolomic approaches. Unfortunately, the amount of compounds with publicly available experimental data (~20,000) is still very small, compared to the total number of expected compounds (up to a few million compounds). For all these reasons, the development of cheminformatics tools for data organization and mapping, as well as for the prediction of biotransformation and spectra, is more crucial than ever. My PhD thesis focused on developing several cheminformatics tools that address these limitations. First, I developed ClassyFire and ChemOnt. ClassyFire is a publicly available software tool and webserver that automatically and hierarchically classifies any given molecule based on its structure. It relies partly on ChemOnt, a comprehensive and iii comprehensible taxonomy that contains >4,800 chemical categories, as well as their textual descriptions and mappings to other ontologies. ClassyFire was used to classify and annotate >80 million compounds. The webserver also integrates a text-based search engine. These features make ClassyFire unique in the sphere of publicly available computational tools. ClassyFire and ChemOnt are available at http://classyfire.wishartlab.com. Second, I developed BioTransformer and BioTransformerDB. BioTransformer is a software tool for the prediction of small molecule metabolism in mammals. It uses a hybrid approach that partly relies on BioTransformerDB, a unique database of biotransformations containing experimentally confirmed metabolic reactions that transform >1,000 drugs, pesticides, cosmetics, and food compounds, among others. The current version of BioTransformer, which is available at https://bitbucket.org/djoumbou/biotransformer, focuses on the human species, but is easily expandable to other species. Third, I developed CFM-ID 3.0, an extension of CFM-ID (1.0, and 2.0), originally developed by Felicity Allen et al. CFM- ID 3.0 is a software tool and webserver for the prediction and annotation of MS spectra, as well as the identification of metabolites. With the integration of a rule-based fragmentation approach for spectra prediction, the development of new ranking functions, and the expansion of the spectral database, CFM-ID 3.0 showed a significant improvement, in terms of speed and accuracy, compared to previous versions. CFM-ID 3.0 is currently available as we web server at http://cfmid-staging.wishartlab.com/. ClassyFire, BioTransformer, and CFM-ID have found applications in various fields including chemical information management, metabolomics, and exposomics, among others. Together, they build a cheminformatics platform that can enable iv metabolomics, and contribute to the understanding of our environment as well as the advancement of science. v Preface This thesis is an original work by Yannick Djoumbou Feunang, based on several original ideas provided by my supervisor Dr. David S. Wishart. This research project was led by Dr. David Wishart from the Departments of Biological Sciences and Computing Science at University of Alberta. All experiments and research activities were performed in Dr. Wishart’s Lab at the University of Alberta. To complete the work described in this thesis, I was assisted by a number of individuals including Dr. Wishart and members of his laboratory. More specifically, Dr. Wishart supervised my training, coordinated the programming activities, designed the assessment methods used for this work and played a key role in the writing and editing of this thesis. Craig Knox, Roman Eisner, Michael Wilson helped in the development of the classification infrastructure for the ClassyFire program. Nazanin Assempour, and Dr. Ithayavani Iynkkaran contributed to the annotation/validation of biotransformations and structures in BioTransformerDB. Allison Pon, and Tanvir Sajed, contributed to the management of experimental spectral data obtained from other sources. Tammy Zheng prepared samples and collected experimental ESI-MS/MS spectra that were analyzed to create fragmentation rules. Dr Naama Karu provided feedback for the design and validation of the fragmentation rules. Dr Felicity Allen provided feedback in regard to previous versions of CFM-ID. In addition to the aforementioned colleagues, several other scientists provided feedback for different projects. Dr Evan Bolton facilitated the collaboration with the PubChem development team (NIH, USA). Dr Christoph Steinbeck, Dr Gary Owen, Dr Jana Hastings facilitated the collaboration with the ChEBI development team (EBI, UK). Dr Fahy Eoin, and Dr Shankar Subramanian facilitated the collaboration with the LIPID MAPS consortium vi (UCSD, USA). Dr Jarlei Fiamoncini (INRA, France), and Dr Claudine Manach (INRA, France) provided feedback in regard to gut microbial metabolism for BioTransformer. Dr Katherine Fenner (EAWAG, Switzerland) provided access to up to date preferences rules for the prediction of environmental microbial degradation. I was responsible for writing and testing the programs constituting ClassyFire, BioTransformer and CFM-ID 3.0. I was also responsible for developing the chemical ontology (ChemOnt) and the biotransformation database (BioTransformerDB) that were required to implement ClassyFire and BioTransformer, respectively. In addition I was largely responsible for acquiring and collecting all of the data required for BioTransformerDB and the experimental mass spectra files (from various databases) for CFM-ID 3.0’s spectral database, and for writing this thesis. vii Dedication I would like to dedicate this thesis to my late father, Nkemnoubong Djoumbou Joseph, and my mother Métago Janette épouse Djoumbou. Knowledge without wisdom is like water in the sand – African proverb viii Acknowledgements First and foremost, I would like to express my infinite gratitude to the almighty God, for giving me the honour and the ability maintain the course throughout this long journey. I aim at always doing my best for the betterment of the world and myself. I would like to express my sincere gratitude to my supervisor, Dr. David S. Wishart, for giving me the opportunity to work on some very exciting projects in his research group. I am particular grateful for his continuous support, patience, and motivation. Dr. Wishart’s vision, attention to detail, and high expectation for excellent work have played a significant role in my journey as a PhD student; they helped me acquire and improve my scientific skills that have helped make me a better, more independent researcher. It has been a real honour working in the Wishart group with such a diverse group of scientists and students from so many backgrounds and cultures. I would like to acknowledge the past and present members of the Wishart lab for contributing to my experience here at the University of Alberta. This is a special group of people and I will always appreciate the encouragement and good will we shared. I am especially thankful to Craig Knox, Roman Eisner, Allison Pon, Michael Wilson, Nazanin Assempour, Tammy Zheng, Tanvir Sajed, Siyang Tian, Zheng Shi, Dr. Naama Karu, Dr. Ithayavani Iynkkaran, and Dr. Felicity Allen. Thanks also to Dr. Rupa Mandal, the lab’s metabolomics manager, who has been a great source of help and provided wonderful suggestions during my research. I was able to work or collaborate with several scientists who contributed to the evaluation or development of my projects. I would like to particularly thank Dr. Evan ix Bolton (NIH, USA), Dr. Jarlei Fiamoncini (INRA, France), Dr. Claudine Manach (INRA, France), Dr. Christoph Steinbeck (EBI, UK), Dr. Gary Owen (EBI, UK), Dr. Jana Hastings (EBI, UK), Dr. Fahy Eoin (UCSD, USA), Dr. Shankar Subramanian (UCSD, USA), and Dr. Katherine Fenner (EAWAG, Switzerland). I would like to gratefully acknowledge my generous funding sources, especially the Canadian Institutes of Health Research and Genome Canada for financial support. I am also sincerely grateful to my supervisory committee members. I would like to thank Dr. Russ Greiner for his continuous support, and for always having the door open whenever I needed his help. He has played a significant role in my