Cheminformatics for Genome-Scale Metabolic Reconstructions
Total Page:16
File Type:pdf, Size:1020Kb
CHEMINFORMATICS FOR GENOME-SCALE METABOLIC RECONSTRUCTIONS John W. May European Molecular Biology Laboratory European Bioinformatics Institute University of Cambridge Homerton College A thesis submitted for the degree of Doctor of Philosophy June 2014 Declaration This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation is not substantially the same as any I have submitted for a degree, diploma or other qualification at any other university, and no part has already been, or is currently being submitted for any degree, diploma or other qualification. This dissertation does not exceed the specified length limit of 60,000 words as defined by the Biology Degree Committee. This dissertation has been typeset using LATEX in 11 pt Palatino, one and half spaced, according to the specifications defined by the Board of Graduate Studies and the Biology Degree Committee. June 2014 John W. May to Róisín Acknowledgements This work was carried out in the Cheminformatics and Metabolism Group at the European Bioinformatics Institute (EMBL-EBI). The project was fund- ed by Unilever, the Biotechnology and Biological Sciences Research Coun- cil [BB/I532153/1], and the European Molecular Biology Laboratory. I would like to thank my supervisor, Christoph Steinbeck for his guidance and providing intellectual freedom. I am also thankful to each member of my thesis advisory committee: Gordon James, Julio Saez-Rodriguez, Kiran Patil, and Gos Micklem who gave their time, advice, and guidance. I am thankful to all members of the Cheminformatics and Metabolism Group. In particular, the extremely knowledgeable ChEBI curators as well as past and present members of the research team: Pablo Moreno, Stephan Beisken, Luís de Figueiredo, Felicity Allen, and Kalai Vanii Jayaseelan. I owe a great deal of thanks to Egon Willighagen for meticulously review- ing the deluge of patches, programming advice, and for encouraging me to start a blog. I am grateful to attendees of the Cambridge Cheminformatics Network Meetings for stimulating conversations. Specifically, I would like to thank Roger Sayle for many enjoyable discussions on efficient algorithms and for prompting me to address some long standing issues in the CDK library. I am also thankful to Mark Vine for valuable input and advice. I would like to thank: Gordon James, Stephan Beisken, Pablo Moreno, Egon Willighagen, and Róisín Sarsfield for reading and providing feed- back of draft chapters of this dissertation. Finally, on a personal note, I would like to thank my loving parents, my sister Alexandra, and my partner Róisín who have always encouraged me to pursue what I find interesting and for providing support throughout my studies. Abstract Genome-scale metabolic reconstructions are an important resource in the study of metabolism. They provide both a system and component level view of the biochemical transformations of metabolites. As more recon- structions have been created it remains a challenge to integrate and reason about their contents. This thesis focuses on the development of computa- tional methods to allow on-demand comparison and alignment of meta- bolic reconstructions. A novel method is introduced that utilises chemical structure representa- tions to identify equivalent metabolites between reconstructions. Using a graph theoretic representation allows the identification and reasoning of metabolites that have a non-exact match. A key advantage is that the method uses the contents of reconstructions directly and does not rely on the creation or use of a common reference. To annotate reconstructions with chemical structure representations an in- teractive desktop application is introduced. The application assists in the creation and curation of metabolic information using manual, semi-auto- mated, and automated methods. Chemical structure representations can be retrieved, drawn, or generated to allow precise metabolite annotation. In processing chemical information, efficient and optimised algorithms are required. Several areas are addressed and implementations have been con- tributed to the Chemistry Development Kit. Rings are a fundamental prop- erty of chemical structures therefore multiple ring definitions and fast al- gorithms are explored. Conversion and standardisation between structure representations present a challenge. Efficient algorithms to determine aro- maticity, assign a Kekulé form, and generate tautomers are detailed. Many enzymes are selective and specific to stereochemistry. Methods for the identification, depiction, comparison, and description of stereochem- istry are described. Contents Contents xi List of Figures xix 1 Introduction1 1.1 Introduction to genome-scale metabolic models . .2 1.1.1 Constraint-based analysis . .2 1.1.2 Creation . .3 1.1.3 Exchange . .5 1.1.4 Databases . .6 1.1.5 Software tools . .8 1.2 Introduction to cheminformatics . 12 1.2.1 Representation and applications . 12 1.2.2 Exchange . 14 1.2.3 Software libraries . 16 1.3 Chemical information in metabolic reconstructions . 18 1.3.1 Applications . 18 1.3.2 Availability of reconstructions with chemical structures . 21 1.4 Objectives of this thesis . 23 1.5 Publications . 24 I Chemical data in metabolic reconstructions 25 2 Alignment of reconstructions 27 2.1 Introduction . 27 2.1.1 Motivation . 27 xi 2.1.2 Identifying metabolites . 27 2.1.3 Existing approaches . 29 2.1.4 Introduction to chemical identity . 30 2.1.5 Objectives . 33 2.2 Methods . 35 2.2.1 Connectivity hash codes . 35 2.2.2 Candidate retrieval . 40 2.2.3 Candidate reasoning and classification . 42 2.3 Results and discussion . 46 2.3.1 Exploratory analysis of aligning iJR904 and MetaCyc 16 . 46 2.3.2 Hash code collision analysis . 49 2.3.3 Structure key distributions . 55 2.3.4 Alignment of iJR904 and MetaCyc 17.5 . 59 2.3.5 Alignment of iYO844 and iBsu1103 .................. 69 2.4 Conclusions . 78 2.5 Availability . 80 3 Annotation of reconstructions 83 3.1 Introduction . 83 3.1.1 Motivation . 83 3.1.2 Existing software . 84 3.1.3 Objectives . 85 3.2 Methods . 86 3.2.1 Data representation . 86 3.2.2 Data import and export . 90 3.2.3 Data resources . 95 3.2.4 Graphical interface . 99 3.3 Results and discussion . 104 3.3.1 Cross-references . 104 3.3.2 Chemical structure annotations . 108 3.3.3 Verification . 112 3.3.4 Standardisation . 112 3.4 Conclusions . 115 3.5 Availability . 117 II Chemical information processing 119 4 Ring perception 121 4.1 Introduction . 121 4.1.1 Motivation . 121 4.1.2 Types of ring information . 122 4.1.3 Objectives . 126 4.2 Methods . 129 4.2.1 Graph data structures . 129 4.2.2 Cycle membership . 131 4.2.3 Minimum cycle basis, essential and relevant cycles . 132 4.2.4 All elementary cycles . 134 4.3 Results and discussion . 135 4.3.1 Cycle membership . 135 4.3.2 Minimum cycle basis, essential, and relevant cycles . 137 4.3.3 All elementary cycles . 141 4.4 Conclusions . 143 4.5 Availability . 144 5 Aromaticity, Kekulisation, and tautomerism 147 5.1 Introduction . 147 5.1.1 Motivation . 147 5.1.2 Aromaticity . 148 5.1.3 Kekulisation . 149 5.1.4 Tautomerism . 150 5.1.5 Objectives . 151 5.2 Methods . 153 5.2.1 Aromaticity . 153 5.2.2 Graph matching . 154 5.2.3 Fast Kekulisation . 158 5.2.4 Unique tautomer generation . 160 5.3 Results and discussion . 161 5.3.1 Aromaticity . 161 5.3.2 Kekulisation . 162 5.3.3 Unique tautomer generation . 167 5.4 Conclusions . 170 5.5 Availability . 171 6 Stereochemistry 173 6.1 Introduction . 173 6.1.1 Motivation . 173 6.1.2 Stereocentres . 173 6.1.3 Describing stereochemistry . 174 6.1.4 Objectives . 175 6.2 Methods . 177 6.2.1 Representation . 177 6.2.2 Identification . 178 6.2.3 Depiction . ..