<<

UvA-DARE (Digital Academic Repository)

The road to knowledge: from biology to databases and back again

Stobbe, M.D.

Publication date 2012 Document Version Final published version

Link to publication

Citation for published version (APA): Stobbe, M. D. (2012). The road to knowledge: from biology to databases and back again.

General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Download date:04 Oct 2021 Miranda Stobbe The road to knowledge: from biology to databases and back again

- Miranda Stobbe -

Chapter 2: Copyright © 2011 Stobbe et al Chapter 3: Copyright © 2012 Oxford Journals Chapter 4: Copyright © 2012 FASEB Journal Chapter 5: Copyright © 2012 Stobbe et al

All rights reserved. Chapter 2 is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. None of the other parts of this thesis may be reproduced or transmitted in any form or by any means electronic, mechanical, photocopying, any information storage and retrieval system, or otherwise, without written permission from the copyright owner. The road to knowledge: from biology to databases and back again

ACADEMISCH PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof. dr. D.C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel

op donderdag 18 oktober 2012, te 12:00 uur

door Miranda Daniëlle Stobbe geboren te Laren Promotiecommissie:

Promotor: Prof. dr. A.H.C. van Kampen Copromotor: Dr. ir. P.D. Moerland

Overige leden: Prof. dr. S. Brul Dr. ir. C.T.A. Evelo Prof. dr. W.J. Stiekema Prof. dr. B. Teusink Dr. I. Thiele Prof. dr. R.J.A. Wanders Prof. dr. L.F.A. Wessels

Faculteit der Geneeskunde

The printing of this thesis is supported by the Stichting ter bevordering van de Klinische Epidemiologie en Biostatistiek, the Academic Medical Center Amsterdam and the Netherlands Bioinformatics Centre.

This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI). The research in this thesis was carried out in the department of Clinical Epidemiology, Biostatistics and Bioinformatics of the Academic Medical Center, Amsterdam, the Netherlands.

Printed by Wöhrmann Print Service, Zutphen, the Netherlands

Voor opa en oma Praag, en opa Stobbe

Content

Chapter 1 9 General introduction

Chapter 2 17 Critical assessment of human metabolic pathway databases: a stepping stone for future integration

Chapter 3 57 Knowledge representation in metabolic pathway databases

Chapter 4 87 Improving the description of metabolic networks: the TCA cycle as example

Chapter 5 107 Consensus and Conflict Cards for metabolic pathway databases

Chapter 6 131 General discussion

Bibliography 143

Summary 151

Samenvatting 155

Curriculum Vitae & list of publications 161

Dankwoord - Acknowledgements 165

General introduction

“There is nothing like looking, if you want to find something. You certainly usually find something, if you look, but it is not always quite the something you were after.”

J.R.R. Tolkien (The Hobbit)

Chapter 1

From biology … The study of has a long history, the roots of which can be traced back to the year 1614 when Santorio Sanctorius published his results on body weight fluctuations during the course of a day (Sanctorius, 1614). The realization in the 19th century that the reactions within a cell are the same as those studied in chemistry marked a new era in the field of metabolism. Another important milestone was the discovery that catalyze the metabolic reactions (Buchner, 1897). Series of reactions leading to a particular biological outcome have traditionally been organized into pathways. The first complete metabolic pathways were described in the 1930s, among which the tricarboxylic acid (TCA) cycle. This classical pathway was discovered by Hans Krebs (Krebs and Johnson, 1937) for which he earned a Nobel Prize in 1953. From the 1930s onwards an increasing number of metabolic pathways have been unraveled. Although often studied in isolation, pathways interact and together constitute what is referred to as the metabolic network. This highly organized and complex network of reactions is able to adapt to a constantly changing environment. Nowadays, metabolic networks are studied on a genome- wide scale for a wide range of organisms, fueled by the tremendous progress in the assembly and functional annotation of whole genomes over the past 15 years.

… to pathway databases … The knowledge on metabolism gained over the years is scattered across a multitude of resources, including scientific literature. To collect and organize this knowledge a growing number of (public) metabolic pathway databases have been created for many different organisms. Figure 1 provides a high-level overview of the main aspects of the metabolic network as described in a pathway database. One key objective of these databases is to accurately represent the metabolic network in a format suitable for computational processing and analyses. The databases also serve as a digital encyclopedia and are often accompanied with powerful visualization aids. One of the first databases to be made publicly available was the Kyoto Encyclopedia of and Genomes (KEGG) in 1995 (Kanehisa et al, 2012). KEGG was initiated to depart from a list of parts, such as catalogs containing functions of individual genes, to pathways that show how these parts interact. A metabolic pathway database integrates the functional annotation of the genome and the metabolic network of an organism. Such an integrative approach is crucial, since the inner workings of the metabolic network cannot be unraveled by only considering the individual roles of its parts, like we cannot understand the principles of powered flight and working details of a modern aircraft by only considering the components of an airplane laid out on a hanger floor (Vastrik et al, 2007).

10 General introduction

Figure 1 - The main aspects of the metabolic network as described in pathway databases. The description of the metabolic network in a pathway database can also be used as a basis for the formulation of a mathematical model.

Only by studying the metabolic network at the systems level one can begin to understand (human) metabolism in healthy and diseased states and for this pathway databases are instrumental.

Metabolic network construction The great value and potential of pathway databases has inspired many research groups to construct their own database for their organism of interest. The metabolic networks of several key organisms, such as S. cerevisiae and H. sapiens, are even described in a multiple of databases. Different strategies have been used to build a metabolic network (Table 1). Frequently a genome-based approach is taken to

11 Chapter 1 construct an initial draft using the functional annotation of the genome of a particular organism as a basis. Using annotation, reactions are retrieved from other databases such as the one of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology, KEGG LIGAND (Goto et al, 2002) or MetaCyc (Caspi et al, 2012). The manner and degree of manual curation of the resulting genome-scale network differs among the various pathway databases. In most cases the genome-based draft is further refined based on expert knowledge, scientific literature, and other resources. A next step taken by research groups focusing on systems biology is to convert the network into a constraint-based mathematical model (Thiele and Palsson, 2010b). This model is iteratively adapted and fine-tuned by verifying its ability to simulate known metabolic functions. An alternative approach to the genome-based approach is to build the metabolic network incrementally. One particular example is Reactome in which independent researchers with expert knowledge of parts of the metabolic network are invited to curate a specific process (Croft et al, 2011). New information is peer-reviewed before being added to Reactome. WikiPathways is another example of a database using an incremental approach (Pico et al, 2008). This database enables contributions from the entire community and any user can either add a pathway or edit an existing one.

… and back again. One of the most popular applications of a pathway database is to provide context for the analysis and interpretation of high-throughput data such as obtained from next- generation sequencing, microarrays, proteomics or metabolomics. Typically, the differentially expressed genes, proteins or metabolites are mapped to the metabolic network to identify the affected metabolic processes. This gives more insight in the possible mechanisms underlying the condition of interest than solely looking at a long list of individual genes, proteins or metabolites (Khatri et al, 2012). Pathway databases also have been used in the study of evolution of the metabolic network across species (Tanaka et al, 2006) and disease-related studies, such as the relation between diseases frequently co-occurring in the same patient (Lee et al, 2008). Furthermore, the holistic view that the metabolic network offers, allows researchers to more easily identify gaps in our knowledge on metabolism. For instance, analyzing the network can reveal metabolites that are produced, but not consumed, indicating that the fate of the metabolites is unknown (Rolfsson et al, 2011). One of the ultimate goals is to construct the full metabolic network in all its detail as a mathematical model that can be used to generate experimentally verifiable hypotheses, to identify potential drug targets and to simulate the effect of network perturbations, such as loss of function (Oberhardt et al, 2009).

12

Database Description Construction strategy Curation # of organisms Three levels of curation: Tier 1 - highly curated, tens of man-years of Collection of Pathway/Genome Databases, each literature-based curation, BioCyc describing the genome and metabolic pathways of genome-based > 1700 Tier 2 - moderately curated, i.e., partly curated a single organism and partly automatically derived, Tier 3 - automatically derived, not curated Curation based on primary literature articles, BIochemical Genetic and Genomic knowledgebase genome-based 8* BiGG reviews, and biochemical textbooks and by of large scale metabolic networks performing functional validation tests Curated based on literature and the literature- EHMN Edinburgh Human Metabolic Network genome-based 1 (human) based Enzymes and Metabolic Pathways database INOH Integrating Network Objects with Hierarchies incremental Constructed manually from text books 1 (human) Kyoto Encyclopedia of Genes and Genomes: The main 15 databases KEGG consists of are resource for understanding high-level functions manually curated. Several auxiliary databases KEGG genome-based > 1700 and utilities of the biological system from containing genomic information, such as the one molecular-level information for draft genomes, are automatically generated. Protein Analysis THrough Evolutionary Relationships: protein sequences are grouped into Manually curated by 20 external experts with an Panther incremental 48 functional subfamilies, which allows for a more option for community pathway curation accurate association with biological pathways. Human network is curated, other organisms are derived automatically using orthology** Open-source, open access, manually curated and 20 Reactome incremental peer-reviewed pathway database Curation is done by invited biological experts and each module is peer-reviewed before it is added to Reactome A resource for the exploration and annotation of starting from reactions, Curated based on literature and existing General introduction UniPathway > 2800 metabolic pathways which are linked to proteins metabolic resources such as KEGG and MetaCyc

Open, public platform dedicated to the curation of WikiPathways biological pathways by and for the scientific incremental Curated by the community 22 community

Table 1 - Selection of publicly available pathway databases. *Ten networks are described, three of which are of E. coli. **There are 4 spin-offs, e.g.,

13 FlyReactome and Gallus Reactome, maintained by other research groups.

Chapter 1

Outline Our own interest in metabolic pathway databases started when developing algorithms for the identification of plausible candidates for 'missing genes' (Orth and Palsson, 2010). A gene is referred to as ‘missing’ if for a metabolic reaction that is known to take place the gene product () catalyzing it is unidentified. When evaluating the performance of our algorithms to identify such missing genes we, however, observed that the outcome depended heavily on the metabolic pathway database used. The dependency of the outcome of computational analyses on the choice of a specific metabolic database has also been observed by others (Lee et al, 2008; Zelezniak et al, 2010). A single, complete, and accurate description of the metabolic network is therefore essential, not only for resolving the ‘missing genes’ problem, but in many other applications such as the analysis of high-throughput data, and the in silico prediction of phenotypes. This motivated us to continue on a different path and focus on the metabolic pathway databases themselves to unravel the reasons for the differences observed in the outcome of computational analyses. Some people considered (and still consider) this path to be short, straight and simple to travel, but it turned out to be a long, but rewarding journey.

In Chapter 2 and 3 we describe the results of the analysis and comparison of the description of the human metabolic network as given by five frequently used pathway databases. We compared these databases on two levels. On the level of the biology underlying the metabolic network we wanted to see if different descriptions agree on which reactions take place in human and which genes are involved. It is to be expected that to some extent the descriptions differ since the networks were built and curated in different ways and by different research groups (Table 1). On the other hand, these databases aim to represent the metabolic capabilities of the same organism and one would therefore expect that at least the core of the metabolic network described by these databases would be similar, but we show that this is not the case. So far, however, the extent of the differences and the explanations for the differences had not been systematically analyzed for the human metabolic network, likely because the differences were assumed to be small. The results of our comparison are valuable in their own right but also provide further insight in the reasons why these databases differ in the first place. The second level on which we compared the pathway databases is the representation of the metabolic network in a digital format. Not only differences in content may influence the results of analyses, but also the varying definitions that are used by the databases to, for instance, define a pathway (Green and Karp, 2006). A detailed understanding of how a pathway database represents knowledge is, therefore, also important. These two comparisons

14 General introduction provide useful insights in the road ahead to integrate the knowledge contained in the pathway databases into a single metabolic network and the challenges to be met.

From pathway database experts… In Chapter 2 we describe the results of the comprehensive and systematic analysis of the descriptions of the human metabolic network as found in five of the major databases, i.e., EHMN (Hao et al, 2010), H. sapiens Recon 1 (Duarte et al, 2007), HumanCyc (Romero et al, 2004), and the metabolic subsets of KEGG and Reactome. Reasons why the comparison of the databases is not as trivial as it may seem include the widely different ways in which data is stored and organized, and the different ways in which it needs to be retrieved. Another reason is the difficulty of establishing whether databases refer to the same metabolite and consequently the same reaction. We have successfully addressed these challenges. The results of our comparison show a surprisingly limited consensus, even for core metabolic processes like metabolism. Moreover, the databases differ in the breadth and depth of their coverage of the human metabolic network emphasizing the importance of integration efforts to further refine its description.

In Chapter 3 we discuss how the five pathway databases solve the challenge of capturing the knowledge on human metabolism in a digital format. The choices made in how to represent knowledge affect the ability of a pathway database to capture the biological complexity of human metabolism. It depends on the application at hand which aspects of the metabolic network are important and to what detail they need to be represented. The differences in representation we identified are of interest for (future) database developers, knowledge curators, and domain experts and can help to further improve knowledge representation.

… to biologists Gathering all up-to-date knowledge on metabolism to build an accurate metabolic network is a huge challenge as the relevant literature is extensive. This is further complicated by the changing nomenclature of enzymes and metabolites in the course of time. Moreover, not for every piece of the metabolic network conclusive evidence is available and some parts might still be subject to controversy. Chapter 4 and 5 illustrate that the different views on the same biological system offered by the databases can reveal both controversial and complementary biological knowledge. By exploiting these different views we can further improve the description of the (human) metabolic network. These chapters also underline the importance of the involvement of experts on metabolism.

15 Chapter 1

The aim of Chapter 4 is to increase the awareness of the scientific community of the existing differences and biological inaccuracies within the descriptions provided by pathway databases, and to convince experts to help resolve them. For this purpose, we use one of the most well-known pathways, the TCA cycle, as an example. We show that the lack of consensus between pathway databases can partly be explained by an inaccurate description of the knowledge found in scientific literature. We compared the descriptions of the TCA cycle as given by ten databases. None of these were entirely consistent with the literature. Based on the ten descriptions, additional literature research and the knowledge of two experts in the field of metabolism we propose an improved description of the TCA cycle. This proved to be quite a time- consuming challenge, even for this relatively small pathway. First of all, it is challenging to oversee the large volume of articles related to the TCA cycle from 1937 onwards and to cope with the changing nomenclature for enzymes and metabolites. Moreover, the biochemistry behind the TCA cycle turned out to be not as clear cut as one might expect and active involvement of experts proved to be crucial to resolve the conflicting information in the ten databases.

In Chapter 5 we present the web application 'Consensus and Conflict Cards'

(C2Cards) that allows experts to more easily identify the differences between metabolic pathway databases mentioned above. Case studies illustrate that the concise overview a C2Card provides can reveal conflicting information that requires additional biochemical experiments to be resolved. Although built for H. sapiens,

C2Cards can easily be constructed for other organisms as well. Identifying conflicts is essential for ongoing efforts to reconcile the various descriptions of the metabolic network available for a particular organism. The examples of conflicting information uncovered by the C2Cards application also illustrate the advantage of combining descriptions and the importance of going back to the literature.

We conclude by discussing the road ahead in further refining the description of the (human) metabolic network (Chapter 6).

16

Critical assessment of human metabolic pathway databases: a stepping stone for future integration Miranda D. Stobbe1,3, Sander M. Houten5,6, Gerbert A. Jansen1,3, Antoine H.C. van Kampen1,2,3,4, Perry D. Moerland1,3

1 Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, 1100 DE, Amsterdam, the Netherlands 2 Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Science Park 904, 1098 XH, Amsterdam, the Netherlands 3 Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525 GA, Nijmegen, the Netherlands 4 Netherlands Consortium for Systems Biology, University of Amsterdam, PO Box 94215, 1090 GE, Amsterdam, the Netherlands 5 Department of Clinical Chemistry, Laboratory Genetic Metabolic Diseases, Academic Medical Center, University of Amsterdam, PO Box 22700, 1100 DE, Amsterdam, the Netherlands 6 Department of Pediatrics, Emma Children's Hospital, Academic Medical Center, University of Amsterdam, PO Box 22700, 1100 DE, Amsterdam, the Netherlands

Published in: BMC Systems Biology 2011, 5:165

Abstract

Background Multiple pathway databases are available that describe the human metabolic network and have proven their usefulness in many applications, ranging from the analysis and interpretation of high-throughput data to their use as a reference repository. However, so far the various human metabolic networks described by these databases have not been systematically compared and contrasted, nor has the extent to which they differ been quantified. For a researcher using these databases for particular analyses of human metabolism, it is crucial to know the extent of the differences in content and their underlying causes. Moreover, the outcomes of such a comparison are important for ongoing integration efforts.

Results We compared the genes, EC numbers and reactions of five frequently used human metabolic pathway databases. The overlap is surprisingly low, especially on reaction level, where the databases agree on 3% of the 6968 reactions they have combined. Even for the well-established tricarboxylic acid cycle the databases agree on only 5 out of the 30 reactions in total. We identified the main causes for the lack of overlap. Importantly, the databases are partly complementary. Other explanations include the number of steps a conversion is described in and the number of possible alternative substrates listed. Missing metabolite identifiers and ambiguous names for metabolites also affect the comparison.

Conclusions Our results show that each of the five networks compared provides us with a valuable piece of the puzzle of the complete reconstruction of the human metabolic network. To enable integration of the networks, next to a need for standardizing the metabolite names and identifiers, the conceptual differences between the databases should be resolved. Considerable manual intervention is required to reach the ultimate goal of a unified and biologically accurate model for studying the systems biology of human metabolism. Our comparison provides a stepping stone for such an endeavor.

Comparison of human metabolic pathway databases

Introduction A detailed description of the human metabolic network is essential for a better understanding of human health and disease (Mo and Palsson, 2009). Several of the most prevalent diseases in modern societies, such as cardiovascular disease, diabetes, and obesity have a strong metabolic component. These multifactorial diseases involve hundreds of genes and many developmental and environmental factors. Therefore, network-based approaches are needed to uncover the parts of the molecular mechanisms perturbed by disease (Lusis et al, 2008) and to identify possible drug targets. For example, metabolic networks are nowadays routinely used for the systems-level interpretation of high-throughput data, such as microarray gene expression profiles (Antonov et al, 2008; Goffard et al, 2009).

Over the past fifteen years several groups have constructed high-quality human (metabolic) pathway databases that can be used in this endeavor (Bader et al, 2006; Croft et al, 2011; Duarte et al, 2007; Hao et al, 2010; Pico et al, 2008; Romero et al, 2004). One of the first pathway databases was the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa et al, 2010) that was initiated to depart from the existing gene catalogs to pathways. Another example is Reactome (Croft et al, 2011), which has as one of its main goals to serve as a knowledgebase that describes human biological processes and can be used for computational analyses. The first fully compartmentalized, genome-scale in silico model of the human metabolic network is Homo sapiens Recon 1 (Duarte et al, 2007). This model forms a stepping stone for modeling human metabolic phenotypes.

The various pathway databases available differ in a number of ways and all have their own strengths and weaknesses. For example, they have different solutions for technical issues such as how the data is presented to the user, how one can query the database (Soh et al, 2010; Wittig and De Beuckelaer, 2001), and the exchange formats provided (Bauer-Mehren et al, 2009; Chowbina et al, 2009; Soh et al, 2010). Several initiatives, such as BioWarehouse (Lee et al, 2006) and Pathway Commons (http://www.pathwaycommons.org), have used a data warehouse approach to resolve these differences. By bringing multiple databases under one roof, a data warehouse can be used as “one-stop shop” for answering most of the questions that the source databases can handle, but via a uniform interface (Stein, 2003). Another type of difference is that the conceptualizations used vary, for example, with respect to the definition of a pathway (Green and Karp, 2006; Soh et al, 2010). Furthermore, different databases have taken different approaches in the reconstruction process of the human metabolic network. The reconstruction of the Edinburgh Human

19 Chapter 2

Metabolic Network (EHMN) (Hao et al, 2010), for example, is based on a genome- scale approach using genome annotation as a starting point. Reactome on the other hand takes an incremental approach, regularly adding new parts to its network, and with reactions as basic units (Croft et al, 2011). Also the manner and level of curation may differ per database. For instance, Recon 1 is completely manually curated using evidence from literature and then fine-tuned and validated by simulating 288 known metabolic functions in silico. Most of the initial content of HumanCyc (Romero et al, 2004) was automatically derived from both genome annotation and MetaCyc, a multiorganism curated metabolic pathway database, and only curated to a limited extent. Further manual curation of HumanCyc resumed in 2009. Finally, in the reconstruction process evidence from literature may be interpreted differently by curators (Mo and Palsson, 2009).

It may be apparent that the described differences will have an effect on the metabolic networks defined by the databases. However, so far the various metabolic networks available have not been systematically compared, nor has the extent to which they differ been quantified. For a researcher, e.g., a biomedical scientist who wants to use these databases as a reference repository or a bioinformatician who wants to perform a systems-level analysis of human metabolism, it is crucial to know the extent of the differences in content as well as their underlying causes. The choice for a particular database may, for example, influence the outcome of a computational analysis, as evidenced by diverging results for methods that were applied to multiple metabolic pathway databases (Elbers et al, 2009; Green and Karp, 2006; Lee et al, 2008; Zelezniak et al, 2010). Moreover, the sheer variety of metabolic pathway databases is unsatisfactory and their integration is desired. This has been recognized by several groups and integration initiatives are currently ongoing for various organisms (Thiele and Palsson, 2010a). This has already led to the publication of consensus metabolic networks for S. cerevisiae (Herrgård et al, 2008) and for the human pathogen S. typhimurium (Thiele et al, 2011). The results of a systematic comparison, including the reasons for the differences, can be used as a stepping stone for the reconciliation of human metabolic networks.

We performed a systematic comparison of five frequently used databases, each of which is based on a different approach towards reconstructing the human metabolic network and built by an independent research group: EHMN, Homo sapiens Recon 1 (referred to as BiGG in the rest of the paper), HumanCyc, and the metabolic subsets of KEGG and Reactome. We compared the metabolic reactions, Enzyme Commission (EC) numbers, enzyme encoding genes as well as combinations of these three

20 Comparison of human metabolic pathway databases elements across the five selected databases. We provide an overall analysis, but also compare the tricarboxylic acid (TCA) cycle separately to see in how far the databases agree on this classical metabolic pathway. Our comparison allows us to identify the parts the databases agree on and at the same time to reveal conflicting information. Moreover, current reconstructions of the human metabolic network are work in progress and, therefore, still contain gaps as evidenced by the regular updates of the various databases, reported dead-end metabolites (Recon 1) (Duarte et al, 2007), and listed missing genes (HumanCyc). Our comparison provides a valuable source of complementary information that can be used to fill such knowledge gaps.

Our results show a surprisingly limited level of agreement between the five databases and highlight the challenges to be met when integrating their contents into a single metabolic network.

Results For each of the five pathway databases, i.e., BiGG, EHMN, HumanCyc, KEGG, and Reactome (Table 1), we retrieved all metabolic reactions with their corresponding genes, EC numbers, and pathways. Data was imported in a relational database. The database content statistics of the five databases (Table 2) already show that there are notable differences in database size.

Database Export formats used Versiona Downloaded from BiGG Flat file, SBML 1 http://bigg.ucsd.edu/ EHMN Excel 2 http://www.ehmn.bioinformatics.ed.ac.uk/ HumanCyc Flat file 15.0 http://biocyc.org/download.shtml KEGG Flat file, KGML 58 ftp://ftp.genome.jp/pub/kegg/ Reactome MySQL database 36 http://reactome.org/download/index.html

Table 1 – Overview of metabolic pathway databases used. a Downloaded in the first week of May 2011. KGML: KEGG Markup Language; SBML: Systems Biology Markup Language.

Number of Database Genes EC numbers Metabolites Reactions BiGG 1496 645 1485 2617 EHMN 2517 940 2676 3893 HumanCyc 3586 1215 1681 1785 KEGG 1535 726 1553 1635 Reactome 1159 356 984 1175

Table 2 – Pathway database content statistics. Genes: counts are based on the internal database identifiers and including genes encoding for a component of a as separate entities. EC numbers: only fully specified EC numbers are counted. Metabolites: counts are based on the internal database identifiers and including instances of metabolite classes for HumanCyc and members of sets for Reactome. Reactions: if reactions only differ in direction and/or compartments they are counted as one.

21 Chapter 2

For each comparison we calculated the consensus, defined as the overlap between the databases as a percentage of their union

| ∩∩ CCC ∩ ∩ CC | consensus = BiGG EHMN HumanCyc KEGG Reactome × %100 | BiGG EHMN ∪∪ CCC HumanCyc KEGG ∪∪ CC Reactome | where C is the set of entities (genes, EC numbers, metabolites, reactions) under consideration. The consensus is constrained by the smallest database for a specific entity, which is in all cases Reactome. Therefore, we also calculated a score that is less sensitive to these differences in database size, the majority score, defined as the number of entities that occurs in at least three out of the five pathway databases as a percentage of their union. To limit the impact of out-of-date identifiers and EC numbers on our comparison, the ones that had been transferred were replaced by their new ID/EC number and otherwise they were not used in the comparison (Supplementary Table S1).

Comparison: genes Although some reactions in the metabolic network may take place spontaneously, most reactions are catalyzed by an enzyme. In the first comparison we, therefore, investigated the genes encoding for these enzymes by comparing their Gene IDs. The consensus on gene level is only 13% of the 3858 Entrez Gene IDs contained in the union of all five databases (Table 3, Figure 1). The majority score shows that only 42% of all genes can be found in at least three databases. There are 1139 genes that are present in only one of the databases, representing 30% of the total.

We compared the (GO) annotation of the 510 genes in the consensus on gene level versus the union of the remaining genes using FatiGO (Medina et al, 2010) to gain a better understanding of the biological processes the consensus genes are involved in. The set of consensus genes is significantly enriched (adjusted P<0.01) for processes related to the generation of precursor metabolites and energy, nucleotide metabolism, alcohol metabolism, and metabolism (Supplementary File S1).

Comparison: EC numbers The Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) classifies and names enzymes according to the reaction they catalyze (http://www.chem.qmul.ac.uk/iubmb/enzyme/). EC numbers are used as the vocabulary to describe this classification. An EC number consists of four numbers. The first three indicate increasingly narrower classes of

22 Comparison of human metabolic pathway databases

Number of (percentage of union) Reactions Genes EC numbers Metabolites Reactions - + (ignoring e , H , H2O) union 3858 1410 4679 7758 6968 consensus 510 (13%) 259 (18%) 400 (9%) 101 (1%) 199 (3%) majority score 1636 (42%) 709 (50%) 967 (21%) 732 (9%) 1004 (14%) Database Unique per database (percentage of union) BiGG 45 (1%) 48 (3%) 528 (11%) 1441 (19%) 1250 (18%) EHMN 128 (3%) 82 (6%) 1184 (25%) 2120 (27%) 1832 (26%) HumanCyc 759 (20%) 294 (21%) 739 (16%) 1032 (13%) 905 (13%) KEGG 63 (2%) 13 (1%) 282 (6%) 414 (5%) 348 (5%) Reactome 144 (4%) 11 (1%) 406 (9%) 601 (8%) 539 (8%) Total 1139 (30%) 448 (32%) 3139 (67%) 5608 (72%) 4874 (70%)

Table 3 – Statistics of the pathway database comparison. Genes: Entrez Gene IDs, including genes encoding for a component of a protein complex as separate entities. EC numbers: only fully specified EC numbers. Metabolites: if two metabolites of one database both match the same metabolite in another database this is counted as one match. The first 'Reactions' column: all reactions are considered. The second 'Reactions' column: reactions were not required to match on e-, H+ and/or H2O.

100

90

80 70 70 67 Consensus 60 Match in 4 databases 50 Match in 3 databases Match in 2 databases 40 30 50% 32 Unique entities 30 42% 28

percentage of the union the of percentage 20 21% 14% 20 17 18 18 16 13 12 12 12 majority score 9 77 10 5 3 4 0 (Total number in union) Genes (3858) EC numbers (1410) Metabolites (4679) Reactions (6968) Figure 1 – Overlap between the five metabolic pathway databases for the global comparison. The dark green bars give the percentage of entities (genes, EC numbers, metabolites and reactions) that are part of the consensus. The majority score is given by the combined percentages of the dark green, light green (4 out of 5 databases agree) and yellow (3 out of 5) bars. The orange bars indicate the percentage of entities that can only be found in 2 databases. The percentage of unique entities is indicated by the - + red bars. In matching the reactions we did not take into account e , H and H2O. enzymatic functions. The fourth number serves as a serial number and defines the substrate specificity of the enzyme (Kotera et al, 2004). Comparing EC numbers on the basis of only the first three numbers (ignoring the last) thus gives a global indication whether the databases agree on the types of enzymatic functions involved in the human metabolic network. There are 164 unique entries in the union of all databases with a consensus of 51%. For the remaining 49% the five databases do not agree. For example, the group of peptidases present in the union of the five databases is not part of the consensus.

23 Chapter 2

If we compare complete EC numbers, thus taking into account the serial number that represents substrate specificity, the consensus decreases to 18% of the 1410 EC numbers contained in the union (Table 3, Figure 1). Of the total set of EC numbers 32% can only be found in a single database, primarily HumanCyc.

Comparison: metabolites Agreement on the metabolites that are part of the metabolic network is a prerequisite for consensus between databases on reaction level. Metabolites were matched based on the KEGG Compound ID, if available for both metabolites. If the KEGG Compound ID was absent, metabolites were matched on one of the other four available metabolite identifiers (KEGG Glycan, ChEBI, PubChem Compound or CAS) or on metabolite name, provided that also the chemical formula matched. The consensus for the metabolites is only 9% of the 4679 metabolites contained in the union (Table 3, Figure 1). The majority score equals 21% of the metabolites.

Comparison: reactions Reactions were considered to be the same if all substrates and products matched (see above). As expected, given the outcome of the metabolite comparison, the number of reactions included in all five databases is small: consensus on reaction level equals 1% of 7758 reactions in the union of all five databases (Table 3).

Reactions are not always balanced, especially with respect to electrons (e-), protons + (H ) and water (H2O) (Ott and Vriend, 2006). Therefore, we performed a second comparison where reactions were not required to match with respect to these three metabolites. The number of reactions in the consensus nearly doubled to 199 reactions, corresponding to 3% of the 6968 reactions in total. The majority score for this reaction comparison equals 14% (Table 3, Figure 1). Around one third of the 199 consensus reactions are part of nucleotide metabolism or cofactors and vitamins metabolism (Supplementary File S2). This is in line with the results of the functional enrichment analyses of the consensus genes.

We compared a relatively large number of pathway databases, each restricting the consensus, which partly explains the small overlap. If we only compare pairs of databases, the overlap on reaction level increases substantially. The consensus of two databases ranges from 11%, when comparing EHMN and Reactome, to as much as 28% when comparing EHMN and KEGG (Supplementary Figure S1). The pairwise comparisons on gene, EC number, and metabolite level also show a substantial increase in overlap.

24 Comparison of human metabolic pathway databases

Comparison: combinations So far, we only compared the databases on a single level. We also investigated the consensus on two levels by requiring the gene and the complete EC number to match. For 35% of the 510 genes in the consensus, the five databases also agree on all EC number(s) connected to a gene. For 63% of the consensus genes the databases agree on at least one EC number. The mismatches suggest that the databases do not fully agree on the enzymatic activities that gene products can have.

If we require an exact match on all three levels - EC number, gene, and reaction - then the five pathway databases agree on the genes and EC numbers of 85 of the 199 - + reactions in the consensus (when not taking into account e , H and H2O). For 44 reactions the databases agree only on the EC number and for 25 reactions only on the genes. For 24 consensus reactions there is not a single EC number the databases agree on and not a single gene for 9 reactions. The main reason (57 reactions) that there is no agreement on all genes is because one or more of the databases links additional genes to the reaction in comparison to the other databases. See Supplementary File S2 for a detailed summary of the consensus reactions with their associated EC numbers, genes, and pathways.

Comparison: TCA cycle We also analyzed the well-known TCA cycle, already described in 1937 by Hans Krebs (Krebs and Johnson, 1937; Krebs et al, 1938). For this pathway we expected a high agreement between the databases. However, also for the TCA cycle the consensus on reaction level is surprisingly low, although higher than what was observed at database level. The databases agree on 5 (17%) of the 30 reactions in total (Table 4 and Figure 2, see Supplementary Figure S2 and Supplementary File S3 for a breakdown per database). On gene level the consensus is 36% of 45 genes and on EC number level the consensus is 30% of 20 EC numbers. For the five reactions in the reaction consensus, the databases all agree on the EC number and on at least one gene. Only for two reactions (EC 1.1.1.41, EC 4.2.1.2) they agree on exactly the same set of genes.

Number of (percentage of union) Genes EC numbers Metabolites Reactions Union 45 20 41 30 Consensus 16 (36%) 6 (30%) 18 (44%) 5 (17%) Majority 23 (51%) 11 (55%) 25 (61%) 12 (40%)

Table 4 – Statistics of the comparison of the TCA cycle. Genes: including genes encoding for a component of a protein complex as separate entities. EC numbers: fully specified EC numbers. Metabolites: several metabolites were matched manually (see Materials and Methods). Reactions: reactions where not required to match on H+.

25

pyruvate 2-hydroxyethyl-ThPP s-acetyldihydrolipoamide CoA H+ AMAC1 CS ACO1 ACO2 citrate Only in 1 database In all 5 databases 26 1.2.4.1 1.2.4.1 2.3.1.12 Chapter 2 2.3.3.1 water 4.2.1.3 ThPP CO2 lipoamide ThPP CoA DLAT ACO1 acetyl-CoA water ACO2 PDHA1 PDHA2 PDHB PDHA1 PDHA2 PDHB CLYBL dihydrolipoamide cis-aconitate water IREB2 acetate NADH 4.2.1.3 + 4.1.3.6 NAD 1.8.1.4 oxaloacetate LOC646675 ACO1 H+ ACLY DLD LOC646677 4.2.1.3 ATP ACO2 CoA LOC650667 ADP pyruvate 6.4.1.1 2.3.3.8 ATP LOC650674 phosphate isocitrate - HCO + 3 PC phosphate NADP ADP IDH1 phosphoenolpyruvate B 1.1.1.42 4.1.1.32 NADH IDH2 CO 2 + + + GTP H NADH NADP NAD H+ H+ GDP MDH2 PCK1 PCK2 IDH3A NADPH IDH1 phosphoenolpyruvate MDH1 1.1.1.37 IDH3B 4.1.1.32 oxalosuccinate NNT x 1.1.1.42 1.1.1.41 MDH1B IDH2 CO2 ITP IDH3G

IDP PCK1 PCK2 IDH1 + + + + 1.1.1.42 H NAD NADPH CO H + (s)-malate NAD 2 CO2 NADH H A IDH2

CO2

FH 4.2.1.2 C

water NAD+ CoA 2-oxoglutarate ThPP DLD DLD fumarate DLST DLD DLST DLD OGDH FADH ubiquinol DLD DLST 2 OGDH DLST OGDH 1.2.4.2 SDHA SDHA SDHA OGDHL OGDHL DLST OGDH OGDHL SDHB SDHB SDHB OGDH CO2 1.3.99.1 1.3.5.1 1.3.5.1 DHTKD1 OGDH OGDHL OGDHL PDHX SDHC SDHC SDHC x 1.2.1.52 1.8.1.4 2.3.1.61 1.2.4.2 3-carboxy-1-hydroxypropyl-ThPP lipoamide SDHD SDHD SDHD FAD ubiquinone HS14068 LOC651820 NADH OGDH 1.2.4.2 CO2 2-oxogluterate OGDHL succinate H+ complex

SUCLA2 CoA GTP

CoA ITP CoA ATP SUCLG1 CoA s-succinyldihydrolipoamide ThPP

SUCLA2 SUCLA2 SUCLG2

SUCLG1 SUCLG1 6.2.1.4 6.2.1.4 6.2.1.5 SUCLA2P1

SUCLG2 SUCLG2 phosphate 2.3.1.61 DLST NADH LOC283398 DLD ADP NAD+ phosphate IDP phosphate GDP H+ D succinyl-CoA dihydrolipoamide 1.8.1.4

Figure 2 – Comparison of the TCA cycle in five metabolic pathway databases. Map illustrating the (lack of) consensus for the TCA cycle. Metabolites are represented by rectangles, genes by rounded rectangles, and EC numbers by parallelograms. Color indicates how many of the five databases agree on a specific entity (gene, EC number, reaction). We first matched reactions based on their metabolites. Genes and EC numbers were matched within matching reactions. Color of an arrow indicates the number of databases that agree upon an entire reaction, i.e., all its metabolites (except H+ which was matched separately). ‘x’ denotes a missing EC number. Areas labeled A-D highlight the reactions that are discussed in the running text as examples of differences caused by a disagreement on pathway definition.

Comparison of human metabolic pathway databases

Analysis of differences between databases The above results show that consensus between the five databases is low on all levels compared. This is most pronounced for the reactions. First, we use the TCA cycle to illustrate a number of reasons for these differences. Next, we describe how this translates to the comparison at database level.

TCA Cycle In what follows we present the main, sometimes overlapping, causes for lack of consensus at the reaction level: (i) disagreement on pathway definition, (ii) difference in number of intermediate steps, (iii) a different number of possible alternative substrates. In addition, it is difficult to determine when databases refer to the same metabolite. Missing and out-of-date gene identifiers also hinder the comparison. Since genes and EC numbers are tightly linked to reactions, most differences on these two levels are caused by differences on the reaction level. We conclude with additional causes for lack of consensus for genes and EC numbers.

Pathway definition The five pathway databases each have their own definition of which reactions are part of the pathway describing the TCA cycle. For example, in KEGG the conversion of pyruvate into acetyl-CoA is included in the TCA Cycle (Figure 2, purple area). In EHMN and BiGG this conversion is part of the glycolysis/gluconeogenesis pathway and in the other two databases it is part of a separate pathway. The differences in pathway definition are further illustrated by several reactions that are not in the consensus of the TCA cycle, but are part of the consensus at database level:

A. The reaction transforming oxaloacetate into phosphoenolpyruvate (EC 4.1.1.32 via GTP → GDP, Figure 2). In general, this reaction, although tightly linked to the TCA Cycle, is considered to be part of gluconeogenesis (Berg et al, 2002). However, KEGG includes this reaction in the TCA cycle pathway. KEGG and EHMN also mention the same conversion with an alternative cosubstrate (EC 4.1.1.32 via ITP → IDP). The latter reaction is not part of the consensus at database level. B. The reaction converting citrate back to oxaloacetate (EC 2.3.3.8). This reaction is found in BiGG, EHMN, and KEGG. According to Reactome the reaction belongs to the pathway ‘Fatty Acyl-CoA Biosynthesis’ and HumanCyc assigns it to ‘acetyl-CoA biosynthesis (from citrate)’. Moreover, Reactome also provides evidence that the reaction takes place in the cytosol and not in the mitochondrion where the TCA cycle takes place. Interestingly, BiGG and EHMN also claim that the reaction does not take place in the mitochondrion,

27 Chapter 2

but they include it in the TCA cycle nevertheless. C. The reaction transforming succinyl-CoA into succinate via GDP → GTP (EC 6.2.1.4). This reaction is described in HumanCyc, but was not assigned to any pathway. D. The interconversion of NAD+/NADPH and NADH/NADP+. Only Reactome includes this reaction in the TCA Cycle, in the other four databases it is part of pathways related to nicotinate and nicotinamide metabolism.

Differences in pathway definition explain why 14 of the 30 reactions are not in the consensus (Supplementary File S3).

Number of intermediate steps Another explanation for the differences observed is that the number of intermediate steps used to describe a specific conversion varies. A typical example is the oxidative decarboxylation of 2-oxoglutarate to succinyl-CoA (2-oxoglutarate dehydrogenase complex). KEGG describes this reaction in four steps. In BiGG, HumanCyc and Reactome the entire oxidative decarboxylation is described in a single step. Interestingly, EHMN describes it both in a single step as well as in three steps.

The databases also disagree on the number of steps for describing the conversion of citrate to isocitrate (EC 4.2.1.3). In BiGG and Reactome this is a single step, but it takes two steps in HumanCyc and KEGG with cis-aconitate as intermediate. Indeed, cis-aconitate has been shown to be an intermediate in the conversion of citrate into isocitrate (Berg et al, 2002; Krebs and Holzach, 1952). EHMN includes both the single and the two-step variant. Note that there is no automated way in which we could tell whether the difference in the number of steps is because of a difference in the level of detail used to describe a particular conversion or due to a disagreement on the number of steps needed for that conversion. Differences in number of intermediate steps explain 14 mismatches on reaction level.

Number of alternative substrates A third explanation for the observed differences is the variation in the number of possible alternative substrates listed. This is, for example, observed for the type of nucleotide diphosphate as cosubstrate for the conversion of succinyl-CoA to succinate. According to EHMN and KEGG, not only ADP (EC 6.2.1.5) can be used, but also IDP (EC 6.2.1.4). Differences caused by alternative substrates explain six mismatches on reaction level.

28 Comparison of human metabolic pathway databases

Establishing identity The comparison of the metabolites is hindered by the difficulty of determining in an automated way when databases refer to the same compound. For example, we decided for three pairs of metabolites that the databases are referring to the same metabolite, despite that the databases linked different KEGG Compound IDs to these metabolites (see Materials and Methods). The only difference between these pairs, is that one is the enzyme bound form of the metabolite, e.g., lipoamide-E (KEGG Compound ID: C15972), and the other is indicated as being unbound, e.g., lipoamide (KEGG Compound ID: C00248). For a relatively small pathway like the TCA Cycle, such highly similar compounds can be easily identified manually, but on database level this is very challenging.

Also out-of-date and missing identifiers influence the comparison. Five unmatched genes from Reactome had an Entrez Gene ID that had become obsolete and could not be transferred to another entry. For a single gene in HumanCyc there were no gene identifiers available at all.

Additional explanations on gene and EC number level On gene level, ten differences remain that are not caused by differences on the reaction level or out-of-date identifiers. Three genes (ACO1, IREB2, and MDH1) encode for proteins that are not localized in the mitochondrion, according to the UniProt annotation. Since the TCA cycle takes place in the mitochondrion, these may be annotation errors of the pathway databases. In BiGG PDHX encodes for a component of the 2-oxoglutarate complex, but according to Entrez Gene it encodes for a component of the similar, but different, complex. The gene OGDHL, which is found in three databases, is described by Entrez Gene as ‘oxoglutarate dehydrogenase-like’, which refers to the OGDH gene that is part of the consensus. For two genes (LOC283398 and SUCLA2P1) in Reactome the RefSeq status is ‘inferred’, which may be a reason for the other databases to not include these genes. For the other three genes (AMAC1, DHTKD1, MDH1B) there is no clear explanation. Possibly these are incorrectly connected to the reactions of the TCA cycle.

For four EC numbers the differences on reaction level do not explain why they are not part of the consensus. All four are assigned to the reaction converting 2- oxoglutarate to succinyl-CoA by at least one of the databases. Three of these EC numbers belong to the individual components of the complex catalyzing the reaction. BiGG only assigns one (EC 1.2.4.2) of these three to the catalyst, EHMN assigns all three and HumanCyc leaves the EC number blank. According to IUBMB the EC

29 Chapter 2 number (EC 1.2.1.52) assigned by Reactome belongs to the enzyme that can catalyze a similar reaction, but with NADP+/NADPH as cosubstrates instead of NAD+/NADH.

Database level The explanations we gave for the lack of consensus in the TCA cycle can be generalized to the comparisons at database level. One exception is the difference in pathway definition, as the subdivision of the network in pathways no longer plays a role in the comparisons on database level. However, a similar effect can be observed due to differences in metabolic network coverage.

Metabolic network coverage All five databases are work in progress and, therefore, do no yet fully cover the complete metabolic network. As the database content statistics (Table 2) show, there are large differences in the number of genes, EC numbers, and reactions contained in each database. On gene and EC number level HumanCyc is largely a superset of the other four databases and contains the highest number of unique entities on these two levels. EHMN has the highest number of metabolites and reactions. This is to a large extent explained by a set of 1100 transport reactions and 1016 reactions in contained in EHMN, compared to 484 and 211, respectively, in Reactome, for example. In general, the size differences can be partly explained by the different criteria the five databases have for including reactions in their metabolic network. A difference in coverage could also to some extent explain the large percentage of data that is only found in one of the databases. For example, there are 1139 unique genes and 4874 unique reactions (Table 3).

To gain a better understanding of which parts of the metabolic network are only described in a single database, we compared the GO annotation of all 1139 unique genes versus the union of the remaining genes using FatiGO (Supplementary File S1). The unique genes are significantly enriched for terms related to ion transport, protein metabolism like proteolysis, and to RNA metabolism such as tRNA processing.

For a more in-depth analysis of the coverage of the individual databases, we compared for each database separately its unique genes with the remaining genes contained in the union (Supplementary File S1). HumanCyc has the largest set of unique genes, which are significantly enriched for terms related to, among others, ion transport, protein metabolic processes like proteolysis, and (t)RNA processing. Enriched terms for Reactome include transport, protein catabolic processes, and

30 Comparison of human metabolic pathway databases regulation of catalytic activity. As metabolic and non-metabolic reactions in Reactome are intertwined, this might be an indication that some non-metabolic reactions are described in the metabolic pathways we selected. EHMN only has few significant terms, which are related to Golgi vesicle transport and budding. EHMN contains the highest number of transport reactions, but 55% of these are not linked to a gene and, therefore, do not influence the GO analysis. BiGG and KEGG contain the lowest number of unique genes and only BiGG has a significantly enriched GO term, namely signal processing.

The GO enrichment analysis shows that the pathway databases are partly complementary and include reactions that are peripheral to metabolism proper, such as ion transport and macromolecular reactions. On the other hand, the genes contained in the majority of the databases compared to the union of the remaining genes, proved to be significantly enriched for many of what one could consider to be core metabolic processes (Supplementary File S1): nucleotide metabolism, carbohydrate metabolism, lipid metabolism, and generation of energy, among others. One might, therefore, conjecture that the consensus between the databases would significantly increase by restricting the comparison to core metabolic processes. For this purpose, we first grouped the pathways from each of the databases into categories using the KEGG hierarchy as a guideline (see Materials and Methods, Supplementary File S4). Next, we restricted the comparison to the following six core metabolic categories: amino acid metabolism, carbohydrate metabolism, energy metabolism, lipid metabolism, metabolism of cofactors and vitamins, and nucleotide metabolism. Reactions not part of any pathway could not be assigned to a category and were therefore excluded. Furthermore, we excluded macromolecular reactions, i.e., reactions in which at least one metabolite was labeled as being a protein, for HumanCyc and Reactome. Also transport reactions found in BiGG, EHMN, HumanCyc and Reactome were excluded. The database content statistics for this comparison of core metabolic processes are given in Table 5. The consensus hardly changes for any of the entities compared (Figure 3, Table 6). The majority score for the gene comparison augmented considerably by 9%. For the reaction comparison the majority score increased by only 4%.

These results support the conclusion that the networks are partly complementary, but also indicate that there are additional reasons for the lack of overlap, which we will describe below.

31 Chapter 2

Number of (percentage of total) Database Genes EC numbers Metabolites Reactions BiGG 957 (64%) 558 (87%) 1041 (70%) 1301 (50%) EHMN 1221 (49%) 707 (75%) 1924 (72%) 2180 (56%) HumanCyc 832 (23%) 503 (41%) 847 (50%) 815 (46%) KEGG 1291 (84%) 619 (85%) 1190 (77%) 1308 (80%) Reactome 413 (36%) 279 (78%) 532 (54%) 511 (43%)

Table 5 – Pathway database content statistics of core metabolic processes. Genes: counts based on the internal database identifiers and including genes encoding for a component of a protein complex as separate entities. EC numbers: only fully specified EC numbers are counted. Metabolites: counts based on the internal database identifiers and including instances of metabolite classes for HumanCyc and members of sets for Reactome. Reactions: if reactions only differ in direction and/or compartments they are counted as one.

Number of (percentage of union) Genes EC numbers Metabolites Reactions union 1723 -55% 959 -32% 2805 -40% 3713 -47% consensus 264 (15%) + 2% 180 (19%) + 1% 316 (11%) + 2% 144 (4%) + 1% majority score 875 (51%) + 9% 508 (53%) + 3% 757 (27%) + 6% 674 (18%) + 4% Database Unique per database (percentage of union) BiGG 75 (4%) + 3% 63 (7%) + 4% 276 (10%) - 1% 621 (17%) - 1% EHMN 186 (11%) + 8% 125 (13%) + 7% 910 (32%) + 7% 1129 (30%) + 4% HumanCyc 70 (4%) -16% 62 (6%) -15% 243 (9%) - 7% 326 (9%) - 4% KEGG 180 (10%) + 8% 45 (5%) + 4% 209 (7%) + 1% 274 (7%) + 2% Reactome 19 (1%) - 3% 8 (1%) 0 80 (3%) - 6% 154 (4%) - 4% Total 530 (31%) + 1% 303 (32%) 0 1718 (61%) - 6% 2504 (67%) - 3%

Table 6 – Statistics of the pathway database comparison of core metabolic processes. Genes: Entrez Gene IDs, including genes encoding for a component of a protein complex as separate entities. EC numbers: only fully specified EC numbers. Metabolites: if two metabolites of one database both match the same metabolite in another database this is counted as one match. Reactions: reactions where not - + required to match on e , H and/or H2O. Differences with the outcomes of the global comparison (Table 3) are indicated in percentages for each level of comparison.

100

90

80

70 67 61 Consensus 60 Match in 4 databases 50 Match in 3 databases Match in 2 databases 40 32 51% 31 53% Unique entities 30 27% 18% percentage of the union the of percentage 18 18 18 19 19 20 15 15 15 14 majority score 11 12 7 99 10 4 6 0 Genes (1723) EC numbers (959) Metabolites (2805) Reactions (3713) (Total number in union)

Figure 3 – Overlap between the five metabolic pathway databases for the comparison of the core metabolic processes. The dark green bars give the percentage of entities (genes, EC numbers, metabolites and reactions) that are part of the consensus in the comparison restricted to the following six categories: amino acid metabolism, carbohydrate metabolism, energy metabolism, lipid metabolism, metabolism of cofactors and vitamins, and nucleotide metabolism. The majority score is given by the combined percentages of the dark green, light green (4 out of 5 databases agree) and yellow (3 out of 5) bars. The orange bars indicate the percentage of entities that can only be found in 2 databases. The percentage of unique entities is indicated by the red bars. In matching the reactions we - + did not take into account e , H and H2O.

32 Comparison of human metabolic pathway databases

Number of intermediate steps In the comparison of the TCA cycle a difference in the number of steps used to describe a specific metabolic conversion could easily be identified manually. On database level, however, this poses a considerable challenge and would require very generic tools for network alignment. One indication that the problem is not restricted to the TCA cycle is given by 64 reactions in BiGG for which the comments in the SBML file indicate that the reaction summarizes a conversion that actually consists of several steps. For example, BiGG describes the breakdown of palmitoyl-CoA to octanoyl-CoA in a single step. However, this is a simplification of four rounds of , each round consisting of four separate reactions. In KEGG the same conversion makes up a large part of the ‘ metabolism’ pathway.

Number of alternative substrates The number of reactions linked to one of the 259 consensus EC numbers varies considerably across the databases and equals 411 for Reactome, 441 for HumanCyc, 539 for BiGG, 582 for KEGG, and 942 for EHMN. A possible explanation for a low number of reactions is the use of a single generic reaction to model the broad substrate specificity of an enzyme instead of explicitly describing each specific reaction separately with the same EC number. HumanCyc, for example, uses generic metabolites, such as ‘an alcohol’, in 24% of the reactions linked to an EC number from the consensus. The high number of reactions in EHMN is at least partly explained by the number of alternative substrates specified. Focusing on lipid metabolism, the median number of reactions per EC number is three for EHMN, while for HumanCyc, for example, the median is one. The effect of alternative substrates has been noticed before (Kuffner et al, 2000) in a comparison of all reactions in BRENDA (Chang et al, 2009), ENZYME (Bairoch, 2000), and KEGG. Reactions in these databases overlapped for only 21%. Consensus increased to 67% when they included only the main reactions, as defined by IUBMB, of BRENDA and not the reactions derived from these with alternative substrates.

Establishing identity The difficulty of determining when databases refer to the same compound partly explains the lack of overlap on metabolite level and consequently on reaction level. Metabolite identifiers provide a common ground for finding corresponding metabolites in a reliable way, provided the correct identifier was assigned to each metabolite. The only identifier type that is shared among the five databases and that is available for a substantial number of metabolites is the KEGG Compound ID (Supplementary Tables S2 and S3). Unfortunately, for 34% (HumanCyc) to 42% (BiGG) of the metabolites included in the pathway databases, except for KEGG, this

33 Chapter 2 identifier is missing. In KEGG for 8% of its metabolites the KEGG Glycan ID is provided instead. To increase the number of metabolites for which we could potentially identify corresponding metabolites we also included KEGG Glycan, ChEBI, PubChem and CAS IDs for the comparison. However, 25% (HumanCyc) to 34% (BiGG) of the metabolites included in the pathway databases (except KEGG) were not linked to any of the four metabolite databases (Supplementary Table S3). We, therefore, decided to also match on the metabolite name, which has as disadvantage that there will often be a large number of, possibly ambiguous, synonyms and spelling variants (Hettne et al, 2009). To restrict the possibility of false positive matches caused by matching on the metabolite name, we also required the chemical formula to match.

Even using this strategy, a large number of metabolites without identifier remains that could not be matched on name, see Supplementary File S5 for an overview. This overview shows that the majority of these unique metabolites are part of specific metabolic processes, illustrating the different choices made by each of the databases. In EHMN, for example, 60% of the unique metabolites without an identifier are part of lipid metabolism. In BiGG, 55% is found in glycan biosynthesis and metabolism, e.g., precursors or degradation products of long unbranched polysaccharides such as chondroitin sulfate, heparin sulfate, or keratan sulfate. Furthermore, in Reactome, 64% of the unique reactants without a metabolite identifier are proteins and complexes directly encoded by the genome and have a UniProt ID instead. In HumanCyc, finally, 55% of the metabolites are part of reactions that have not been assigned to any pathway and which are possibly peripheral to metabolism proper. Restricting the comparison to the core metabolic processes and removing macromolecular reactions from Reactome and HumanCyc, reduced the impact of the mismatches because of missing metabolite identifiers. For BiGG, HumanCyc, and Reactome the percentage of metabolites without an identifier decreased from 34%, 25%, and 31% to 21%, 13% and 10%, respectively (Supplementary Table S4). Since lipid metabolism is part of the core comparison, EHMN is still greatly affected by the lack of identifiers for lipids and misses an identifier for 38% of its metabolites. There is a large variety of lipids, which may explain the lack of identifiers for this type of metabolite.

On gene level the only identifier type shared by all five databases is the Entrez Gene ID (Supplementary Table S2). In total 356 genes do not have an Entrez Gene ID (after removing obsolete IDs) most of which are contained in HumanCyc (327 genes). On the level of EC numbers the five databases combined contain 83 EC numbers that are

34 Comparison of human metabolic pathway databases not fully specified. Moreover, the catalysts of 41%, 27%, and 17% of the reactions in Reactome, BiGG and EHMN, respectively, are not linked to an EC number. In both cases this may be because IUBMB has not yet assigned an EC number to the enzyme. For more than half of these reactions not linked to an EC number in BiGG, the catalyst facilitates a transport reaction. In this case the Transport Classification (TC) system (Saier et al, 2009) of the IUBMB might provide a more appropriate descriptor. In EHMN and Reactome this is even 73% and 70% of the cases, respectively. A number of EC numbers are missing because the database curators did not enter them into the database.

Miscellaneous Next to the reasons outlined above, we also identified a number of more subtle and less frequent explanations for the limited overlap. An example at the metabolite level is that BiGG uses D-glucose in its reactions instead of specifying whether it is α-D- glucose or β-D-glucose, while Reactome only uses α-D-glucose. The other databases use all three variations. On the other hand, BiGG does not use generic metabolites like ‘an alcohol’ (KEGG Compound ID: C00069) or ‘an L-amino acid’ (KEGG Compound ID: C00151) in contrast to HumanCyc, KEGG and EHMN. Furthermore, BiGG and HumanCyc explicitly state that their reactions are charge and mass balanced. The chemical formula and charge of the metabolites were based on their ionization state at a pH level of 7.2 and 7.3, respectively, while the other three databases use the neutral form of the metabolites. This partly explains the observed increase in consensus when we did not take into account H+. By using the KEGG Compound ID as the prime identifier for matching metabolites, we reduce the impact of a difference in protonation state as in general the distinction between the base and the acid form of a metabolite is not made in KEGG Compound in contrast to, e.g., ChEBI. We also compared the databases while allowing for an inexact match of the chemical formula with respect to the number of H atoms, to account for the variation in protonation state between the databases. This hardly affected our results (data not shown). Discussion Our comparison revealed that there is only a small core of the metabolic network on which all five databases agree. Especially on reaction level the overlap is surprisingly low, only 199 reactions could be found in all five databases. Our analysis shows that the small overlap between the databases is partly explained by conceptual differences like a difference in coverage of the metabolic network. One clear example is the large set of transport reactions and reactions in lipid metabolism in EHMN, which account for 23% of the unique reactions.

35 Chapter 2

Our decision to compare five pathway databases, also limits the consensus: the more databases one includes in the comparison, the lower the consensus is likely to be. We indeed observe a substantial increase in overlap when we compare pairs of databases (Supplementary Figure S1) instead of five. However, also in this case with a median consensus of around 15%, the agreement on reaction level is still relatively low. Two main factors can strongly bias the size of the consensus detected. Firstly, the consensus is constrained by differences in database size. This partly explains, for example, the consensus of only 11% when comparing a large database such as EHMN and a small database such as Reactome. Secondly, the consensus is positively influenced by the fact that databases are not constructed independently from each other. For example, EHMN used KEGG as a starting point for its reconstruction (Hao et al, 2010), which explains the higher consensus of 28%. However, even if we would restrict our comparison to three pathway databases, BiGG, EHMN, and KEGG, that are most interdependent (Duarte et al, 2007; Hao et al, 2010), the consensus on reaction level is still only 14%, when not considering the transport reactions from BiGG and EHMN.

Despite the observed lack of overlap, the GO enrichment analysis of the consensus and majority genes (Supplementary File S1) does provide us with evidence that there is a core of metabolic processes the databases agree on. Examples of such processes are nucleotide metabolism and carbohydrate metabolism, which is also reflected on reaction level (Supplementary File S2). The comparison of the core metabolic processes indeed showed a considerable increase of the majority score at the gene level and to a lesser extent at reaction level. However, the consensus on reaction level remains low even for this more limited set.

Especially on reaction level the comparison is clouded by several conceptual differences and technical difficulties. The main technical challenge is to establish the identity of metabolites between databases. This was also observed to be one of the main problems for the experts involved in the construction of the consensus of two in silico metabolic network reconstructions of S. cerevisiae (Herrgård et al, 2008). Matching metabolites by name is not an ideal solution, as many, possibly ambiguous, synonyms and spelling variants exist for the same metabolite (Hettne et al, 2009). Matching metabolites using metabolite identifiers is, in our comparison, restricted by the relatively large number of metabolites that had not been linked any of the four metabolite databases (KEGG, ChEBI, PubChem Compound, and CAS). One reason for the lack of metabolite identifiers is that the metabolite databases themselves are also work in progress. Metabolites that exist in a large number of structural

36 Comparison of human metabolic pathway databases variations such as, for example, lipids may not have been described yet in full detail in the metabolite databases. This was indeed observed for EHMN, where a large set of the unique metabolites without an identifier is involved in lipid metabolism. On the other hand, part of the metabolites of the pathway databases may not be described in any of the four metabolite databases we considered, because they, for example, do not meet the criteria to be included, such as proteins encoded by the genome found in Reactome. Furthermore, all pathway databases have a preference for one of the metabolite databases for which they curate the link. For example, BiGG mainly derived its identifiers from KEGG Compound. Similarly, for Reactome only ChEBI IDs have been manually curated. Due to this, metabolites may not link out to a metabolite database if the metabolite does not exist in the preferred reference database.

It will require a considerable manual effort to correctly assign metabolite identifiers to each metabolite and establish the correspondence of metabolites between databases. An initiative that could aid in solving some of these problems is ChemSpider (http://www.chemspider.com/), which integrates a wide variety of metabolite databases. The use of database-independent structural representations such as SMILES and InChI strings has also been recommended (Herrgård et al, 2008). In our case, three databases (EHMN, HumanCyc and KEGG) provide InChI strings for 77%, 58%, and 75% of their metabolites, respectively. The consensus is, however, only 66 of the 3475 InChI strings in total. The low consensus when matching on InChI string can partly be explained by a difference in the amount of detail with which the structure of metabolites has been described and a difference in protonation state.

The question remains to what extent the reaction consensus would increase, even if all metabolites were properly described. As illustrated by our comparison of the TCA cycle also conceptual differences play an important role in explaining the lack of overlap. A similar conclusion can be drawn from a comparison of the two yeast metabolic networks that were used in building a consensus network (Herrgård et al, 2008). Even after the identity of the metabolites between the two reconstructions had been established manually, the consensus on reaction level was still only 36%. In a recent comparison of two metabolic networks of A. thaliana (Radrich et al, 2010) only 33% of the total number of reactions could be matched unambiguously. Furthermore, it is important to keep in mind that even if we would find unambiguous descriptions for each metabolite this does not guarantee a match. Firstly, the databases, or more specifically their metabolites, are partly complementary. EHMN, for example,

37 Chapter 2 explicitly focused on expanding lipid metabolism in comparison to KEGG (Hao et al, 2010). Secondly, many of the reactants without a metabolite identifier are part of reactions that are peripheral to metabolism proper, such as precursor and degradation products of BiGG and proteins in Reactome, and are therefore unlikely to have a match in all five databases.

An example of a conceptual difference is the variation in the number of intermediate steps used to describe a specific metabolic conversion. This could be because of different database-specific criteria for when the intermediate steps of a conversion should be described or not. A second example is the use of generic metabolites (e.g., alcohol) in reactions, as HumanCyc does. This may be done to model the broad substrate specificity of the enzyme or to indicate that the exact substrate specificity is unknown. Other databases, for example BiGG, focus more on indicating the specific metabolite, e.g., ethanol instead of alcohol. This difference may be amplified by the number of specific instances given. Also more subtle conceptual differences play a role, like a different protonation state (neutral versus charged), the detail in which the structure of a metabolite is described (e.g., D-Glucose versus α-D-Glucose) or whether the metabolite is described as enzyme bound or not (e.g., lipoamide-E versus lipoamide). Finally, our GO enrichment analysis showed that the scope of the metabolic networks described by the five databases differs. The set of genes that are only found in at most two databases is, compared to the genes found in the majority of the databases, enriched for terms related to protein metabolic processes, like protein phosphorylation, proteolysis, and RNA metabolism (Supplementary File S1). EHMN and HumanCyc, for example, both include a generic reaction describing the phosphorylation of a protein, which is connected to a large set of 250 and 304 kinases, respectively. Differences in the metabolic processes covered by the databases also explain to some extent the differences in size of the databases.

The differences mentioned above not only make it difficult to determine the consensus between databases, but also to distinguish between conflicting and complementary content. This is especially so if one also keeps in mind that all five databases are work in progress. For example, a difference in the coverage of the metabolic network could be caused by a fundamental disagreement on whether certain processes are part of the human metabolic network. It could also be that they just did not include these processes yet and then this could be seen as complementary information. Similarly, for 45% of the consensus reactions the databases do not fully agree on the genes coding for the catalyst (Supplementary File S2), which may point to either complementary or conflicting information. Another

38 Comparison of human metabolic pathway databases example is the difference in number of steps, which can in most cases be explained by a difference in the level of detail of the description. It could, however, also reflect disagreement on the number of intermediate steps required for a particular conversion.

The low level of consensus provides compelling evidence that additional curation and the integration of the content of the five pathway databases in a single human metabolic network is desired and would improve the description of human metabolism. However, given the results of our comparison and all difficulties outlined above, what would be the way forward towards an integrated network? The consensus consists of only 199 reactions, even less when also considering the connected genes and EC numbers, and is therefore not of direct practical use. Another option is to take the union of the reactions contained in the individual databases. This is the approach taken by, for example, ConsensusPathDB (Kamburov et al, 2009) for integrating functional interactions, including metabolic reactions. Besides being restricted by the same conceptual and technical issues that we described, combining the content of the databases is not the definite answer. It will not solve disagreements between databases regarding, for example, the gene product catalyzing a reaction or whether a reaction can take place in human or not. Conflicting information would end up in the union and ultimately requires manual curation or at least annotation of such conflicts. Reasons for disagreement are manifold and database-dependent. Some databases, for example HumanCyc, prefer to err on the side of false positives to bring potential pathways to the attention of the community (Romero et al, 2004). In BiGG, some reactions without evidence were included because they improved the performance of the in silico model. A different interpretation of the literature used in the construction of the network also causes disagreements (Herrgård et al, 2008). Moreover, some parts of the metabolic network are still subject of debate and the current literature reflects these different opinions. The union will for a large part consist of data that is only supported by one of the databases.

A third option is to only include reactions on which the majority of the databases agree. This gives a higher level of confidence and in our case also a considerably larger set of 1004 reactions instead of the 199 reactions in the consensus. However, caution is warranted as for instance the databases are not strictly independent as illustrated by our pairwise comparison of KEGG and EHMN, for example. Erroneous data may, therefore, be propagated in multiple databases. Our case study of the TCA cycle also illustrates the problems of the majority vote strategy (Supplementary

39 Chapter 2

Figure S3). If we retain all entities the majority agrees on, 40% of the reactions are included. However, the genes MDH1 and ACO1 encoding for cytosolic proteins are also part of the majority as is the conversion of citrate to oxaloacetate (EC 2.3.3.8), which is also cytosolic. Moreover, there is no majority for any of the EC numbers proposed by one of the databases for the conversion of 2-oxoglutarate to succinyl- CoA. Also conceptual differences can be observed as, for example, we are left with two routes for both the conversion of citrate to isocitrate. Furthermore, reactions that are not part of the majority, but only found in one or two databases are not necessarily incorrect, but could be valuable complementary information. For example, KEGG gives a more detailed description of the conversion of 2-oxoglutarate to succinyl-CoA.

If the conceptual differences and technical issues we identified would be resolved the overlap will increase. It will, however, remain very difficult to (automatically) discern useful complementary information from conflicting information. In this respect, a more widespread use of evidence codes indicating the type of evidence supporting the data would enable to make a distinction between high and low confidence data. However, extensive annotation of evidence is currently only provided by BiGG and HumanCyc.

Significant manual intervention will be needed to reach the ultimate goal of a single human metabolic network. A promising model is a community-based approach, such as WikiPathways (Pico et al, 2008) or an annotation jamboree as advocated by Mo and Palsson (2009). A wiki-based approach allows the community to curate existing pathways and add new ones. Annotation jamborees are organized around domain experts and facilitate the reconciliation and refinement of metabolic pathway databases. They have already been carried out successfully for various organisms (Herrgård et al, 2008; Thiele and Palsson, 2010a; Thiele et al, 2011). The results of our comparison could be used as a stepping stone for such an effort as it is crucial to understand the underlying causes of the differences to be able to resolve them. For integration purposes, we also provide an automatically derived overview of all reactions in which matching reactions are aligned, along with their associated genes, EC number and pathways (Supplementary File S6). The overviews of the comparison on gene, EC number and reaction level can be also found online (http://www.molgenis.org/humanpathwaydb). Here, results of the comparison can be queried, sorted, and exported in a number of ways. The web application was generated using the MOLGENIS toolkit (Swertz et al, 2010) and next to the graphical user interface also provides several scriptable interfaces, e.g., an R interface. Using,

40 Comparison of human metabolic pathway databases for example, the majority reactions as a starting point for curation these overviews could aid experts on the human metabolic network to consolidate the differences between the networks and arrive at a unified model of human metabolism.

Conclusions An accurate and complete reconstruction of the human metabolic network is of utmost importance for its successful application in the life sciences. Our results will help curators to even further improve the metabolic network as described in the individual databases. Furthermore, as our analysis shows, each of the five pathway databases discussed in this paper provides us with a valuable piece of the puzzle. Combining the expert knowledge put into these five reconstructions and the evidence provided will improve our understanding of the human metabolic network. However, we explicitly identified many issues that prohibit the (automatic) integration of the metabolic networks. Not only the unambiguous identification of metabolites is required but the conceptual differences need to be addressed as well. Considerable manual intervention and a broad community effort are needed to reach the ultimate goal of a consolidated and biologically accurate model of human metabolism. Community efforts, such as BioPAX (Demir et al, 2010) and SBGN (Le Novère et al, 2009), which standardize the representation of the pathway databases, could also aid the integration of the databases. Our detailed comparison of five metabolic networks and the identification of the conceptual differences between the databases provide a stepping stone for their integration. The construction of such an integrated network will, however, require considerable time and effort. It would therefore be advisable that users keep in mind, for now, the large differences found and carefully weigh their decision when choosing a particular database or if possible apply their analyses to multiple networks to ensure the robustness of the results.

Methods

Data retrieval For each of the five pathway databases, we retrieved all metabolic reactions with their corresponding gene(s), EC numbers, and pathway(s). All files mentioned below were downloaded in May, 2011.

For the metabolites we retrieved the following, most frequently provided, types of identifiers, if available in the specific pathway database (Supplementary Table S2): KEGG Compound (http://www.genome.jp/kegg/compound/), KEGG Glycan (http://www.genome.jp/kegg/glycan/), ChEBI (http://www.ebi.ac.uk/chebi/),

41 Chapter 2

PubChem (http://pubchem.ncbi.nlm.nih.gov/), and CAS Registry Numbers (proprietary, assigned by the CAS registry, http://www.cas.org/). There are two types of PubChem IDs, Substance and Compound. Substance IDs are specific for the depositor of the metabolite. Compound IDs unite the different Substance IDs for the same metabolite. To convert the Substance IDs to Compound IDs we used the CID- SID file (ftp://ftp.ncbi.nih.gov/pubchem/Compound/Extras/CID-SID.gz).

For genes we retrieved the Entrez Gene ID, which is the only type of gene identifier the databases have in common (Supplementary Table S2).

Syntactically incorrect and out-of-date identifiers We manually corrected seven syntactically incorrect KEGG Compound IDs and 50 KEGG Glycan IDs in BiGG. We did the same for seven CAS IDs in BiGG and one in HumanCyc. For the KEGG Compound, KEGG Glycan, ChEBI and PubChem Compound IDs we checked if the IDs were up-to-date (Supplementary Table S1). For the KEGG IDs we used the 'compound', 'glycan' and 'merged_compound.lst' file. For ChEBI we used its SQL database (ftp://ftp.ebi.ac.uk/pub/databases/chebi/generic_dumps/) and for PubChem the Batch Entrez from the NCBI website (http://www.ncbi.nlm.nih.gov/sites/batchentrez). We also checked and, when necessary, updated Entrez Gene IDs (Supplementary Table S1) using the 'gene_info' and 'gene_history' files from the FTP site (ftp://ftp.ncbi.nih.gov/gene/DATA/) of Entrez Gene. Finally, also the EC numbers were updated using the ‘enzyme.dat’ file downloaded from Expasy (ftp://ftp.expasy.org/databases/enzyme/). If an out-of-date metabolite ID, Entrez Gene ID or EC number had been transferred, we replaced it with the new one, and otherwise the ID or EC number was not taken into account in the comparison.

BiGG We downloaded the flat files containing reactions and metabolites from http://bigg.ucsd.edu/ (Schellenberger et al, 2010). We removed the 406 exchange reactions, indicated by the prefix ‘EX-’, added to BiGG for simulation purposes. Gene information was extracted from the SBML file. We ignored the suffix that was added to the Entrez Gene IDs to discern transcript variants. We removed 38 reaction duplicates that only differed in their tissue annotation. We raised the total percentage of metabolites with an identifier from 53% to 66% by parsing the HTML files of the metabolite pages available from the BiGG website.

EHMN We downloaded from http://www.ehmn.bioinformatics.ed.ac.uk/ the EHMN Excel file containing sheets in which the reactions are linked to: (i) pathway(s) (ii) genes,

42 Comparison of human metabolic pathway databases represented by an Entrez Gene ID, (iii) EC number(s). A separate file was provided to us by the curators of this database containing information about the metabolites including the five types of identifiers mentioned above.

HumanCyc We used Pathway Tools (Karp et al, 2010) to export the content of HumanCyc into flat files. These were combined using the internal Pathway Tools identifiers. We excluded two signaling pathways, i.e., the ‘BMP Signalling Pathway’ and the ‘MAP kinase cascade’. HumanCyc uses classes as substrates in some reactions (e.g., an amino acid, an alcohol) as a way of catering for enzymes with broad substrate specificity or enzymes for which the exact substrate specificity is unknown. For the metabolite comparison we retrieved the instances provided for each metabolite class. There are 563 metabolite classes that do not have instances, of which 192 have a metabolite identifier, e.g., a KEGG Compound ID. To retrieve the identifiers for these metabolite classes we used the Lisp API as they were not available in the exported flat files. Finally, the Entrez Gene ID is missing for 605 genes. If provided, the Ensembl Gene ID was mapped to an Entrez Gene ID, if available, via Ensembl BioMart (181 genes). If both gene identifiers were absent the UniProt ID was mapped to an Entrez Gene ID via the UniProt ID Mapping service (101 genes). After mapping an Entrez Gene ID was still missing for 323 genes and these were therefore not included in the comparison. For 82% of this set all three IDs mentioned are missing.

KEGG We selected all human pathways from the metabolism category. For each pathway, we downloaded from the KEGG FTP site (ftp://ftp.genome.jp/pub/kegg/) the human-specific KGML file, from which we retrieved the genes, and the KGML file containing the reference pathway linked to the EC numbers. Entries in both files are numbered, which we used to link genes to their associated EC numbers. In both files, the catalyzed reaction can be found. A single entry can contain more than one reaction, gene, and/or EC number. In that case, we assigned all genes and EC numbers contained in the entry to each reaction. Note that we cannot retrieve spontaneous reactions and reactions for which the human gene encoding the catalyst is unknown. Since KGML files only contain the main metabolites of a reaction, we retrieved the complete reaction from the flat ‘reaction’ file available on the FTP site. We used the ‘H.sapiens.ent’ file to get the Ensembl Gene IDs, and the ‘compound’ and ‘glycan’ files to extract ChEBI, PubChem Substance, and CAS IDs for metabolites.

43 Chapter 2

Reactome We used the dump file of the MySQL database to retrieve data from Reactome. From the top-level pathways on the front page of the Reactome website, we selected the ten pathways focused on (normal) metabolic processes, excluding, e.g., signaling and disease-related pathways (see Supplementary Text S1 for a complete list). We retrieved all reactions assigned to the selected metabolic pathways. EC numbers were obtained from the table that links catalyst activity to a GO term. Reactome contains reactions operating on sets of metabolites. We retrieved the instances of these sets from the MySQL database dump. Following the description from the Reactome Curator Guide (http://wiki.reactome.org/index.php/Reactome_Curator_Guide) we instantiated the reactions by taking the first member of the set at the left hand side and the first member of the set at the right hand side, and so on. In five cases this was not possible and we, therefore, did not instantiate the sets in these five reactions. Two examples are shown in Supplementary Text S2. Reactome’s black box events represent reactions for which the molecular details are not specified or unknown. We excluded a black box event if the input or output of the reaction was unknown.

TCA cycle Two EC numbers only mentioned in the comment field of the SBML file of BiGG were also taken into account. We left out the transport reactions that EHMN included in this pathway as KEGG does not contain any transport reactions in its metabolic network.

Pathway database comparison We compared five metabolic pathway databases at different levels: genes, EC numbers, metabolites, reactions, and relations between these components. Below, we describe in detail how we compared each of these components.

Genes For the primary comparison at gene level we used Entrez Gene IDs, since it is the only gene identifier common to all five databases. BiGG, HumanCyc, and Reactome provide syntactic mechanisms for defining protein complexes, while EHMN and KEGG do not. Therefore, we did not make a distinction in the comparison between genes encoding a component of a catalyst or genes that encode a single protein catalyst.

EC numbers A fully specified EC number consists of four numbers separated by a period (http://www.chem.qmul.ac.uk/iubmb/enzyme/). The first three numbers indicate

44 Comparison of human metabolic pathway databases increasingly narrower classes and the fourth number is the serial number of the enzyme in its subclass. The databases combined contain 83 partial EC numbers, such as 1.1.1.-, which were excluded from the comparison, since they are semantically ambiguous (Green and Karp, 2005).

Metabolites Establishing identity between metabolites is a challenging task. For the comparison we, in general, used the KEGG Compound ID, which is in each database the most frequently provided metabolite identifier. However, KEGG Compound IDs are not available for each metabolite (Supplementary Table S3). If the KEGG Compound ID was not provided, metabolites were matched on any of the other metabolite identifiers (KEGG Glycan, ChEBI, PubChem Compound or CAS) or metabolite name, in the latter case we also required an exact match of the chemical formula. Matching was case-insensitive and spaces and punctuation in the metabolite names were ignored. Furthermore, we computed the transitive closure of the metabolite matches. This means that if for a particular metabolite there was a match between database A and B, e.g., on CAS ID, and between database B and C on, e.g., ChEBI ID then the metabolite was considered to match between database A and C as well. Instances of metabolite classes in HumanCyc and members of sets in Reactome were included in the comparison at metabolite level. To make the comparison as accurate as possible we did not match more generic metabolites, like alcohol or glucose, with more specific metabolites, like ethanol or α-D-glucose.

Reactions We considered reactions to be the same if all substrates and products matched (see above). The direction of a reaction was not taken into account in the comparison. The same reaction written in two directions was counted as one reaction. Compartment(s) were not considered as well. We again took the transitive closure for the reaction matches (see above).

TCA cycle In our detailed comparison of the TCA cycle, the following three pairs of metabolites were considered to match despite not having the same KEGG Compound ID: s- succinyldihydrolipoamide-E and s-succinyldihydrolipoamide; lipoamide-E and lipoamide; dihydrolipoamide-E and dihydrolipoamide. The only difference between these pairs is that one is the enzyme bound form of the metabolite, e.g., lipoamide-E, and the other is indicated as being unbound, e.g., lipoamide. The reactions were compared while not taking into account H+. In contrast to the comparison of the entire networks we removed neither the obsolete Entrez Gene IDs nor the gene for which the Entrez Gene was not available at all.

45 Chapter 2

Gene ontology analysis Differences in GO biological process annotation between two lists of genes were assessed with the FatiGO functional enrichment module of the Babelomics suite (version 4.2, http://babelomics.bioinfo.cipf.es/) FatiGO uses the Fisher's exact test for 2×2 contingency tables to check for significant over-representation of GO biological process terms (levels 3-9) in one of the sets with respect to the other one. We used the default settings except that we set the filter for the minimum and maximum number of annotated IDs per term to 1 and 10000, respectively. GO terms were considered to be significantly over-represented if the p-values, adjusted for multiple testing by using Benjamini and Hochberg’s method, were <0.01.

Grouping pathways into categories For the comparison of the core metabolic processes, we manually assigned the pathways of each database to one of the following nine categories using the division of KEGG as a guideline: amino acid metabolism, carbohydrate metabolism, energy metabolism, glycan biosynthesis and metabolism, lipid metabolism, metabolism of cofactors and vitamins, metabolism of secondary metabolites, nucleotide metabolism, and xenobiotics biodegradation and metabolism (Supplementary File S4). Pathways that did not fit in any of these nine KEGG categories were assigned to the category ‘Miscellaneous’. Transport reactions of BiGG, EHMN, HumanCyc and Reactome were assigned to a separate category. A reaction was considered a transport reaction if not all metabolites were localized in the same compartment. Most transport reactions in BiGG and Reactome were originally already assigned to separate transport pathways by the databases themselves. Reactions that were not assigned to any pathway could not be assigned to any category. Note that reactions, EC numbers and genes may be found in multiple pathways and consequently may be part of multiple categories.

Acknowledgements We are indebted to Dr Hongwu Ma at the University of Edinburgh (EHMN), and user support at BioCyc and KEGG for providing additional information. We would like to thank Mark Stobbe for optimizing the code, Joris Scharp for helping with the Lisp API of HumanCyc, and Dr. Morris Swertz and Trebor Rengaw for their help with MOLGENIS and hosting the web application. We also thank the reviewers for their helpful comments and suggestions for improving the presentation and comprehensibility of the paper. This research was carried out within the BioRange programme (project SP1.2.4) of The Netherlands Bioinformatics Centre (NBIC; http://www.nbic.nl), supported by a BSIK grant through The Netherlands

46 Comparison of human metabolic pathway databases

Genomics Initiative (NGI) and within the research programme of the Netherlands Consortium for Systems Biology (NCSB), which is part of the Netherlands Genomics Initiative / Netherlands Organization for Scientific Research. Sander M. Houten was supported by the Netherlands Organization for Scientific Research (VIDI-grant No. 016.086.336).

Supplementary material

Supplementary Figure S1 – Pairwise comparison of the five databases on gene, EC, metabolite, and reaction level Consensus between pairs of databases is calculated as in the main text:

(|CCDB12∩∪× DB | / | CC DB 12 DB |) 100% , where C is the set of entities under consideration. Databases are compared on Entrez Gene IDs, EC numbers, - + metabolites, and reactions, which were not required to match on e , H and/or H2O. See next page.

Supplementary Figure S2 – TCA cycle as represented in each of the five metabolic pathway databases Adapted version of Figure 2 in the main text for each of the metabolic pathway databases separately. Reactions occurring in the TCA cycle for the selected database are highlighted. Metabolites are represented by rectangles, genes by rounded rectangles, and EC numbers by parallelograms. Color indicates how many of the five databases include a specific entity. Color of an arrow indicates the number of databases that agree upon an entire reaction, i.e., all its metabolites (except H+ which was matched separately). ‘x’ denotes a missing EC number.

Available on: http://www.biomedcentral.com/content/supplementary/1752-0509-5-165-s5.pdf

47

48 Chapter 2

BiGG EHMN HumanCyc KEGG Reactome genes EC numbers BiGG EHMN HumanCyc KEGG Reactome BiGG x 46.4 36.0 60.7 42.5 BiGG x 52.4 44.4 62.8 38.3 EHMN x 60.4 43.2 31.8 EHMN x 61.3 62.7 33.9 HumanCyc x 40.1 24.6 HumanCyc x 55.8 26.7 KEGG x 32.2 KEGG x 39.1 Reactome x Reactome x

B i B i GG GG

EHMN EHMN

Most overlapping

Reactome Reactome

B i metabolites reactions B i GG GG Least overlapping

EHMN EHMN

Reactome Reactome

BiGG EHMN HumanCyc KEGG Reactome BiGG EHMN HumanCyc KEGG Reactome BiGG x 27.0 26.8 36.0 23.8 BiGG x 22.4 14.4 17.9 14.4 EHMN x 24.7 40.1 16.8 EHMN x 14.7 28.5 11.1 HumanCyc x 28.2 21.7 HumanCyc x 20.6 13.5 KEGG x 24.9 KEGG x 13.0 Reactome x Reactome x

Comparison of human metabolic pathway databases

Supplementary Figure S3 – TCA cycle: majority vote Adapted version of Figure 2 in the main text when retaining only the entities that at least three out of five databases agree on. Reactions occurring in the majority are highlighted. Metabolites are represented by rectangles, genes by rounded rectangles, and EC numbers by parallelograms. Color indicates how many of the five databases include a specific entity. Color of an arrow indicates the number of databases that agree upon an entire reaction, i.e., all its metabolites (except H+ which was matched separately). See next page.

49

50 Chapter 2

Comparison of human metabolic pathway databases

Supplementary Table S1 – Transferred and obsolete identifiers and EC numbers per database

Genes Number of Entrez Gene IDs Total number of Entrez Gene IDs Database transferreda obsolete Database before update after update BiGG 10 (1) 5 BiGG 1496 1490 EHMN 4 (1) 24 EHMN 2517 2492 HumanCyc 38 (19) 5 HumanCyc 3233 3209 KEGG 1 (0) 0 KEGG 1535 1535 Reactome 10 (9) 21 Reactome 1210 1180 a The number of the genes that were transferred to an ID that was already present in the set of Entrez Gene IDs of the particular database is indicated between brackets.

EC numbers Number of EC numbers Total number of EC numbers Database incomplete transferred obsolete Database before update after update BiGG 2 8 1 BiGG 644 645 EHMN 43 4 1 EHMN 936 940 HumanCyc 34 2 0 HumanCyc 1212 1215 KEGG 34 0 0 KEGG 726 726 Reactome 19 3 0 Reactome 354 356

Incomplete EC numbers were not taken into account in the comparison because of their ambiguity. Some EC numbers were transferred to multiple new EC numbers. In Reactome there are two cases in which the new EC number was already included in Reactome.

(continued on next page)

51

52 Chapter 2 Metabolites Number of KEGG Compound KEGG Glycan CAS ChEBI PubChem Compound incorrectly incorrectly incorrectly PubChem Substance IDs that do not Database formatted transferred obsolete formatted obsolete formatted transferred obsolete map to a PubChem Compound ID obsolete BiGG 8b 14 21 50 2 7 x x x 1 EHMN 0 28 4 0 0 0 3 0 35 0c HumanCyc 0 9 0 x x 2b 1 39 x 106 KEGG 0 0 0 0 0 0 0 0 259 0c Reactome 0 8 1 x x x 0 0 12 0c

An 'x' indicates that the particular identifier is not available for this database. b One could not be corrected and was therefore removed. c As the CID- SID.gz file from PubChem was used to convert the PubChem Substance IDs to PubChem Compound IDs these will naturally be up-to-date.

Supplementary Table S2 – Identifier types for genes and metabolites present in each of the databases

Gene identifiers BiGG Entrez Gene, HGNC (only on website and in PDF) EHMN Entrez Gene, Ensembl Gene, HGNC (only on website) HumanCyc Entrez Gene, Ensembl gene, UCSC, UniGene, Entrez, Genecards, RefSeq_NM KEGG KEGG Gene ID (=Entrez Gene), NCBI-GeneID (=Entrez Gene), HGNC, Ensembl, RefSeq_XM Reactome Entrez Gene, Ensembl Gene, UCSC, KEGG Gene ID, BioGPS, CTD

Metabolite identifiers BiGG KEGG Compound, KEGG Glycan, PubChem Compound (in comments), PubChem Substance (in comments), CAS EHMN KEGG Compound, KEGG Glycan, ChEBI, PubChem Substance, CAS, InChI, SMILES, EMP ID, HMDB HumanCyc KEGG Compound, ChEBI, PubChem Compound, CAS, InChI, SMILES, NCI, UM-BBD-CPD, KNApSAck, NIKKAJI, Wikipedia KEGG KEGG Compound, KEGG Glycan, ChEBI, PubChem Substance, CAS, InChI, PDB-CCD, 3DMET, KNApSAcK, LIPIDMAPS, LIPIDBANK Reactome KEGG Compound, ChEBI, PubChem Substance

Note that some identifiers are only present for a few genes/metabolites.

Supplementary Table S3 – Metabolite counts per database

total number % of metabolites without % of metabolites # of metabolites % metabolites with Database of metabolites a chemical formula without identifier with identifier KEGG Compound KEGG Glycan ChEBI PubChem Compound CAS BiGG 1485 0 34 984 87 15 0 3 57 EHMN 2676 10 32 1830 91 6 49 33 40 HumanCyc 1681 18 25 1258 88 0 49 77 40 KEGG 1553 10 0 1553 92 14 67 75 47 Reactome 984 44 31 682 85 0 98 54 0

For each of the five databases the percentage of metabolites without a chemical formula and the percentage of metabolites without an identifier is indicated. Furthermore, for each pathway database the percentage of metabolites linked to a particular metabolite database (KEGG Compound, KEGG Glycan, ChEBI, PubChem Compound, and CAS) is indicated. We also included the instances of metabolite classes for HumanCyc and members of sets for Reactome, see Materials and Methods. Comparison ofhumanmetabolicpathwaydatabases

Supplementary Table S4 – Metabolite counts per database for the comparison of core metabolic processes

total number % of metabolites without % of metabolites # of metabolites % metabolites with Database of metabolites a chemical formula without identifier with identifier KEGG Compound KEGG Glycan ChEBI PubChem Compound CAS BiGG 1041 0 21 824 94 5 0 2 64 EHMN 1924 2 38 1195 96 0 62 40 46 HumanCyc 847 6 13 738 88 0 56 87 51 KEGG 1190 2 0 1190 100 3 79 81 50 Reactome 532 22 10 481 92 0 99 58 0

For the metabolites of the core metabolic processes in each of the five pathway databases the percentage of metabolites without a chemical formula and the percentage of metabolites without an identifier is indicated. Furthermore, for each pathway database the percentage of metabolites linked to a particular metabolite database (KEGG Compound, KEGG Glycan, ChEBI, PubChem Compound, and CAS) is indicated. We also included the instances of metabolite classes for HumanCyc and members of sets for Reactome, see Materials and Methods. 53

Chapter 2

Supplementary Text S1 – Top-level pathways from Reactome (not) considered in the comparison

Included: Biological oxidations Metabolism of vitamins and cofactors Metabolism of amino acids and derivatives Pyruvate metabolism and Citric Acid (TCA) cycle Metabolism of carbohydrates Respiratory electron transport, ATP synthesis by Metabolism of lipids and lipoproteins chemiosmotic coupling, and heat production by uncoupling Metabolism of nucleotides Transmembrane transport of small molecules Metabolism of porphyrins

Excluded: Apoptosis Integration of energy metabolism Signaling by BMP Axon guidance Integrin cell surface interactions Signaling by EGFR

Botulinum neurotoxicity Interactions of the immunoglobulin Signaling by FGFR Cell Cycle Checkpoints superfamily (IgSF) member proteins Signaling by GPCR Cell Cycle, Mitotic Meiotic Recombination Signaling by PDGF Cell junction organization Membrane Trafficking Signaling in Immune system Maintenance Metabolism of nitric oxide Signaling in Insulin receptor Circadian Clock Metabolism of proteins Signaling by NGF Diabetes pathways Metabolism of RNA Signaling by Notch DNA Repair Muscle contraction Signaling by Rho GTPases DNA Replication mRNA Processing Signaling by TGF beta Gene Expression Myogenesis Signaling by VEGF HIV Infection Opioid Signalling Signaling by Wnt Hemostasis Regulation of beta-cell development Synaptic Transmission Influenza infection Regulatory RNA pathways Transcription

Supplementary Text S2 – Instantiating reactions containing metabolite sets Reactome contains reactions defined in terms of sets of metabolites. For our comparison we wanted to use the specific reactions that can be derived from these. According to the curator guide of Reactome (http://wiki.reactome.org/index.php/Reactome_Curator_Guide) if a set is used as input for a reaction and another set as output, the annotation is taken to mean that the first member of the input set is converted to the first member of the output set and so on. This indeed works in most cases, except for five cases, including the following two examples.

54 Comparison of human metabolic pathway databases

Example 1

(choloyl-CoA, chenodeoxycholoyl-CoA) + (glycine, taurine) Æ (glycocholate, glycochenodeoxycholate, taurocholate, taurochenodeoxycholate) + CoA

If we would take the first member of each set and do the same for the second member, this would give us: choloyl-CoA + glycine Æ glycocholate + CoA (1) chenodeoxycholoyl-CoA + taurine Æ glycochenodeoxycholate + CoA (2)

This leaves us with the two last members of the set in the output; assuming we should recycle the members of the sets in the input of the left hand side of the reaction: choloyl-CoA + glycine Æ taurocholate + CoA (3) chenodeoxycholoyl-CoA + taurine Æ taurochenodeoxycholate + CoA (4)

Reaction 1 and 4 are indeed correct, but reaction 2 and 3 are not.

Another option would be to make all possible combinations between the sets, which would give 16 reactions of which only 4 are correct: choloyl-CoA + glycine Æ glycocholate + CoA choloyl-CoA + taurine Æ taurocholate + CoA chenodeoxycholoyl-CoA + glycine Æ glycochenodeoxycholate + CoA chenodeoxycholoyl-CoA + taurine Æ taurochenodeoxycholate + CoA

Example 2 (TMP, uridine 5' monophosphate, 2'-deoxyuridine 5' monophosphate, uridine 2' monophosphate, uridine 3' monophosphate) + H2O Æ (thymidine, uridine, deoxyuridine) + orthophosphate

In this example there are five metabolites in the set at the left hand side of the reaction and only three metabolites at the right hand side, which again makes it impossible to match the first member of one set with the first of the other, and so on.

Supplementary File S1 – Results of the FatiGO analyses GO biological processes enriched according to FatiGO for the following comparisons: WS1) genes in the consensus on gene level versus the union of the remaining genes, WS2) all unique genes versus the union of the remaining genes, WS3-WS7): unique genes per database (BiGG, EHMN, HumanCyc, KEGG, Reactome) versus the remaining genes contained in the union, WS8) genes contained in the majority of the databases versus the union of the remaining genes.

Available on: http://www.biomedcentral.com/content/supplementary/1752-0509-5-165-s2.xls

55 Chapter 2

Supplementary File S2 – Consensus reactions Overview of the reactions part of the consensus of all five pathway databases (when - + not taking into account e , H and H2O). For each consensus reaction the corresponding EC numbers, genes (Entrez Gene IDs), and pathways are also given for each database.

Available on: http://www.biomedcentral.com/content/supplementary/1752-0509-5-165-s3.xls

Supplementary File S3 – TCA cycle as represented in each of the five metabolic pathway databases Breakdown of the TCA cycle per database. WS1) Overview of all reactions, plus corresponding EC numbers and genes. WS2) Reactions of each database; matching reactions are aligned. For reactions that are not part of the TCA cycle consensus an explanation for the differences observed is given in column B (see also section ‘Analysis of differences between databases’ in the main text). WS3) Metabolites of each database; matching metabolites are aligned. WS4) EC numbers of each database; matching EC numbers are aligned. WS5) Genes of each database; matching genes are aligned. In WS3-WS5 metabolites, EC numbers, and genes are matched across the entire TCA cycle.

Available on: http://www.biomedcentral.com/content/supplementary/1752-0509-5-165-s6.xls

Supplementary File S4 – Grouping of pathways into categories. Overview of the manual grouping of pathways of each database into one of eleven categories, see Materials and Methods.

Available on: http://www.biomedcentral.com/content/supplementary/1752-0509-5-165-s7.xls

Supplementary File S5 – Unmatched metabolites without a metabolite identifier Names of metabolites without a match in any of the four other databases and without any of the five types of metabolite identifiers.

Available on: http://www.biomedcentral.com/content/supplementary/1752-0509-5-165-s10.xls

Supplementary File S6 – Overview of all reactions and their matches Overview of all reactions and their matches (when not taking into account e-, H+ and

H2O). Rows are colored according to the number of databases that agree on a reaction. For each reaction the corresponding EC numbers, genes (Entrez Gene IDs), and pathways are also given for each database.

Available on: http://www.biomedcentral.com/content/supplementary/1752-0509-5-165-s13.xls

56

Knowledge representation in metabolic pathway databases Miranda D. Stobbe1,3, Gerbert A. Jansen1,3, Perry D. Moerland1,3, Antoine H.C. van Kampen1,2,3,4

1 Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, 1100 DE, Amsterdam, the Netherlands 2 Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Science Park 904, 1098 XH, Amsterdam, the Netherlands 3 Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525 GA, Nijmegen, the Netherlands 4 Netherlands Consortium for Systems Biology, University of Amsterdam, PO Box 94215, 1090 GE, Amsterdam, the Netherlands

Accepted for publication in: Briefings in Bioinformatics

Abstract

The accurate representation of all aspects of a metabolic network in a structured format, such that it can be used for a wide variety of computational analyses, is a challenge faced by a growing number of researchers. Analysis of five major metabolic pathway databases reveals that each database has made widely different choices to address this challenge, including how to deal with knowledge that is uncertain or missing. In concise overviews we show how concepts such as compartments, enzymatic complexes, and the direction of reactions are represented in each database. Importantly, also concepts which a database does not represent are described. Which aspects of the metabolic network need to be available in a structured format and to what detail differs per application. For example, for in silico phenotype prediction a detailed representation of gene-protein-reaction relations and the compartmentalization of the network is essential. Our analysis also shows that current databases are still limited in capturing all details of the biology of the metabolic network, further illustrated with a detailed analysis of three metabolic processes. Finally, we conclude that the conceptual differences between the databases, which make knowledge exchange and integration a challenge, have not been resolved, so far, by the exchange formats in which knowledge representation is standardized.

Knowledge representation

Introduction Our understanding of metabolism is ever expanding, as evidenced by the increasing amount of bibliomic data. Pathway databases have been built to collect and capture this knowledge. Besides serving as knowledge repositories, the databases aim to represent the metabolic network in a digital format in such a way that it can be used for computational analyses. This has enabled numerous analyses ranging from the prediction of phenotypes (Jerby et al, 2010), studying evolution (Tanaka et al, 2006), to the analysis and interpretation of high-throughput data (Antonov et al, 2008). The number of pathway databases describing the metabolic network for one or more organisms continues to grow (Karp and Caspi, 2011; Oberhardt et al, 2009).

From the perspective of a researcher used to the compact representation of biological knowledge on metabolism in a pathway, it may seem trivial to represent the metabolic network in an electronic form. However, the biology of the metabolic network is complex and the terminology used by biologists changes over time and varies among biologists (Karp and Mavrovouniotis, 1994). Furthermore, a pathway database needs to accommodate a wide range of users with different requirements. Numerous choices need to be made by database developers and curators on how to represent and relate each component of the metabolic network (Figure 1). Important considerations hereby are what needs to be described in a structured and standardized form to enable computational analyses and what can be described as background information in unstructured text fields.

We selected five, frequently used, pathway databases and compared their approach to represent the human metabolic network in a digital format. Furthermore, we discuss how the databases deal with knowledge that is uncertain or missing. To illustrate the challenges faced in making biological knowledge amenable to computational analyses, we give a detailed description of how each database represents three complex metabolic processes: fatty acid beta oxidation, oxidative phosphorylation, and the pyruvate dehydrogenase reaction. Finally, we discuss the challenges posed by the differences in knowledge representation for analyses across pathway databases and exchange of knowledge between databases.

With this paper we intend to increase the awareness on the complexity of representing the current knowledge on metabolism. A detailed understanding of how knowledge is represented is crucial for users of pathway databases, as differences in representation can affect the outcome of computational analyses. As argued by Green and Karp (Green and Karp, 2006), the pathway definition alone

59 Chapter 3

Figure 1 - Gluconeogenesis. Selection of the information that needs to be stored to accurately represent the gluconeogenesis pathway (Berg et al, 2012, pp. 469-513). may already influence analysis results. Moreover, the choices made in how to represent the metabolic network affect the ability of the database to capture every detail. By pointing out the current limitations our research will also aid (future) database developers, knowledge curators, and domain experts in their quest to further improve knowledge representation.

60 Knowledge representation

Database # of organisms1 Name human network Version File formats and computational access BiGG 8 H. sapiens Recon 1 1 Tab-delimited files, SBML Pathway Tools (export flat files), API, BioCyc 1700 HumanCyc 15.5 BioPAX, SBML EHMN 1 EHMN 2 Excel file, SBML KEGG 1646 --- 61 KGML, API, dbget, flat files Reactome 46 --- 39 MySQL dump, API, BioPAX, SBML

Table 1 – Metabolic pathway databases compared. 1Note that numbers should not be directly compared, since the level of curation may differ per organism, both within and between databases. Results To illustrate the differences in representation of the metabolic network, we selected the following five databases: H. sapiens Recon 1 (Duarte et al, 2007) from BiGG (Schellenberger et al, 2010) (referred to as Recon 1 in the rest of the paper), HumanCyc (Romero et al, 2004) from BioCyc (Karp et al, 2005), EHMN (Hao et al, 2010), KEGG (Kanehisa et al, 2012), and Reactome (Croft et al, 2011) (Table 1). Our analysis of how knowledge is represented in these databases was based on the descriptions given by the pathway database curators themselves in articles and online manuals (if available); when necessary we contacted the database curators for additional details. Moreover, for a more detailed insight we also examined the data files provided by each database, which contain the actual representation of the metabolic network (Supplementary Table S1). Note that we did not consider concepts that were only represented on the website of the database or knowledge that was only provided indirectly by references to other databases, e.g., metabolite databases. For each database we analyzed how and to what detail knowledge is represented on the level of (i) the entire network, (ii) its reactions, and (iii) the enzymes and their encoding genes. On all three levels the databases have made different choices (Table 2). We focused in our review on how knowledge is represented rather than how it is collected, although both issues are intimately linked to how accurately the metabolic network is captured.

Representation of the network

Types of networks An important difference on network level is whether a database only describes metabolic processes (EHMN, Recon 1) or also other types of biological processes such as signaling and genetic information processing (HumanCyc, KEGG, Reactome; Table 2 and 3). Since in Reactome the different types of processes are intertwined it is non-trivial to only retrieve the metabolic network, needed for instance to carry out flux balance analyses (Latendresse et al, 2012; Orth et al, 2010). Note that all five

61 Chapter 3

EHMN Recon 1 HumanCyc KEGG Reactome metabolism

signaling type genetic information

processing network network

compartmentalization division into pathways reaction type

linking of reactions physiological direction reaction type metabolite protonation state isozymes

isoforms me y heteromers complexes enz homomers prosthetic groups/cofactors

Table 2 – Representation of concepts in metabolic pathway databases. Check mark: concept is represented. Cross: concept is not represented. Bar: database is able to represent the concept, but this is only done to a limited extent. databases describe what is referred to as the global human metabolic network in which all possible reactions are combined, despite that they may not take place in every tissue or cell type. Defining tissue-specific models is left to algorithms like the one designed by Jerby et al. (2010) or more specialized (manually) curated networks like HepatoNet1 (Gille et al, 2010).

Compartmentalization Different cellular compartments have distinct metabolic functions. KEGG does not provide any information on compartments. In HumanCyc compartmentalization is work in progress, but their cell component ontology (Zhang et al, 2005) does allow for a detailed representation. EHMN, Recon 1, and Reactome provide a fully compartmentalized network. Reactome has the most fine-grained compartmentalization of the latter three databases and thereby conveys the most detailed knowledge (Table 3). Moreover, Reactome not only indicates the compartment for each reaction, but also for each enzyme and even for some pathways. Recon 1 and EHMN account for the same set of eight compartments, but handle subcellular locations not included in this set differently. In Recon 1 the intermembrane space of the mitochondrion, for example, is merged with the cytosol (Duarte et al, 2007). EHMN uses the hierarchy of the Gene Ontology (GO) to determine which of their eight compartments is the ancestor of the subcellular

62

EHMN Recon 1 HumanCyc KEGG Reactome

yes, different types of yes, all types of reactions multiple types of no, only metabolic yes, all types of reactions a no, only metabolic reactions reactions represented represented in the same type processes ? reactions represented in the same way differently way and intertwined

# compartmentsb 8 8 27c not applicable 47

ontology GO cellular component no ontology used Cell Component Ontology not applicable GO cellular component

specified on the metabolites and/or metabolites or metabolites, reaction and metabolites, reaction, not applicable level of the reactiond reactione enzyme enzyme and pathway not applicable ‘uncertain’, but for practical 'NIL' (few cases) or nothing is if compartment is (reaction not included in purposes also provided as cytosol specified in which case ‘cytosol’ not applicable compartmentalization unknown database if compartment ‘cytosol’ is the default is unknown) re-division of the pathways guidelines used: of KEGG and EMP: - a single biological process a series of reactions, - less overlap between centered on the synthesis definition of KEGG, - evolutionary conserved connected by their definition pathways and/or degradation of one but human-specific - regulated as a unit participants, leading to a - small functionally related or more related substrates

- boundaries at stable and high- biological outcome pathways are grouped connectivity metabolites - human-specific hierarchy, first classified 11 categories (e.g. amino pathway hierarchy, based on the according to type of process acid metabolism),

categorization categories of KEGG categories of KEGG type of metabolites representation Knowledge (e.g. degradation) and next on based on the type of involved the type of metabolites metabolites involved reactions without yes, labeled ‘other’ yes, no link to instance of the yes, labeled ‘isolated’ yes no pathway? or ‘miscellaneous’ pathway frame

Table 3 – Representation differences on network level. a Metabolic, signaling and gene information processing. b Includes extracellular space. c Includes generic compartments like ‘in’ and ‘membrane’ and the non-human compartment ‘inner membrane (sensu Gram-negative Bacteria)’. d If it is a transport reaction, the compartment is only indicated at the metabolite level. e If it is a transport reaction, the compartment is indicated at the metabolite level, otherwise only for the entire reaction. GO: Gene Ontology, EMP: Enzymes and Metabolic Pathways database. 63

Chapter 3

EHMN Recon 1 HumanCyc KEGG Reactome number of pathways 69 96 257 84 171 average number of reactions per pathway 52 30 4 23 8 % of reactions occurring in >1 pathwaya 1% 9% 13% 10% 12%

Table 4 – Pathway statistics. Statistics based upon the same data we used previously (Stobbe et al, 2011) for a comparison of the content of the five databases. Only the metabolic pathways of HumanCyc, KEGG and Reactome are considered. For Reactome and HumanCyc the lowest level in the hierarchy was used when counting the number of pathways. If reactions only differ in direction and/or compartments they are counted as one. a With respect to the total number of reactions assigned to at least one pathway. location in question. In the example above the ancestor is the mitochondrion. Such a more coarse-grained compartmentalization will result in a less accurate representation of, for example, oxidative phosphorylation (Supplementary Text S1).

Division into pathways All five databases divided their network into pathways to provide insight into the functional organization of the metabolic network. Although this division into pathways is not arbitrary and is based on biological criteria in each of the databases, there is no generally accepted definition of a pathway. Consequently, each database defines the boundaries of its pathways differently (Table 3). This results in a large difference in the number of pathways, the average number of reactions per pathway, and the overlap between pathways (Table 4). As described by Green and Karp (Green and Karp, 2006) for BioCyc and KEGG, the pathway definition might influence the outcome of pathway-based analyses.

Reactome defines a pathway as a series of reactions, connected by their participants, leading to a biological outcome (Reactome Glossary, 2012). In HumanCyc more strict guidelines are used, i.e., a pathway is a single biological process that should be evolutionary conserved and regulated as a unit (Green and Karp, 2006). This partly explains the low average size of a pathway in HumanCyc. Moreover, variants of the same metabolic process are considered as separate pathways, thereby increasing the overlap between pathways. In contrast, in EHMN the emphasis is on the functional relationships between reactions and overlapping metabolic processes are merged into a single pathway (Ma et al, 2007). The pathways of KEGG are a mosaic of reactions that take place in any of the organisms included in KEGG and are substrate-centric (Green and Karp, 2006). The organism-specific version of a KEGG pathway consists of those reactions to which a gene of the organism of interest has been linked. This approach can result in artifacts. For example, ‘lysine biosynthesis’ is part of the human metabolic network in KEGG, although our metabolism lacks the ability to synthesize lysine. Recon 1 uses the same pathway definitions and

64 Knowledge representation categories as KEGG, however only human-specific pathways are included and, e.g., lysine biosynthesis is not included in Recon 1. Ultimately, pathways cannot be studied in isolation as the entire network is connected.

Representation of metabolic reactions A metabolic reaction can be defined as the synthesis or degradation of chemical compounds, which may or may not be a reversible process. The type of reaction, e.g., an 'oxidation-reduction' reaction, is indicated by an Enzyme Commission (EC) number in all databases, although in Reactome a link to GO is preferred. KEGG and HumanCyc also have their own reaction ontology (Table 5). The ontology of HumanCyc enables selecting, for instance, only small molecule reactions. The exact representation of a metabolic reaction differs per database. For example, each database uses a different terminology in its data model to refer to the metabolites before the arrow, e.g., ‘substrates’ or ‘input’, and after the arrow, e.g. ‘products’ or ‘output’ (Figure 2). In addition, the level of detail in which a conversion is described varies within and between databases (Case Study and Supplementary Text S2). Differences in detail between databases can reflect a disagreement on the number of steps required for the conversion. Intermediate steps may, however, also have been left out because of a lack of evidence, or to simplify the description of a process. For these reasons, an apparent disagreement on the underlying biology between multiple descriptions of the metabolic network could also be caused by different decisions on how to represent the same knowledge. If intermediate steps have been left out for simplification only, a mechanism to retrieve these steps should be provided to allow users to determine themselves the level of detail that is required for the application at hand. Only HumanCyc enables its curators to indicate 'subreactions', but this option has not been used yet (release 15.5). The reasons for leaving out intermediate steps are not indicated in a structured way in any of the databases.

Linking of reactions For various types of network analysis, first the network needs to be constructed from the individual reactions in a pathway database (Lacroix et al, 2008). In HumanCyc, KEGG, and Reactome reactions are explicitly linked to the preceding and/or following steps both within pathways as well as across pathways (Table 5). For each reaction KEGG also stores its main compounds, which connect consecutive reactions. In HumanCyc the main compounds are deduced automatically per pathway for most reactions (Karp and Paley, 1994); in Reactome main compounds are only captured in the graphical representation of the pathway. Note that in Reactome a metabolic reaction can be preceded by the activation of an enzyme catalyzing this reaction. This

65

66 Chapter 3

EHMN Recon 1 HumanCyc KEGG Reactome 2 parallel reaction ontologies, classified in KEGG BRITE a, a indirectly indirectly indirectly (link to GO type i.e., classification by conversion collection of functional (link to NC-IUBMB) (link to NC-IUBMB) Biological Process) typea or by substrate hierarchies yes, to previous reaction(s) yes, successive steps within a yes, linked to next/previous within a pathway pathway are linked to preceding no no reaction? Links to other pathways via a Links to other pathways via a reaction(s) and/or metabolite. metabolite. pathway(s) reaction irreversible as stored in direction it has in the physiological direction as stored in NC-IUBMB as stored in NC-IUBMB direction reactions NC-IUBMB pathway of writing reversible as stored in direction it has in the both directions, stored as stored in NC-IUBMB as stored in NC-IUBMB reactions NC-IUBMB pathway separately

classified in KEGG BRITE, a indirectly (links to indirectly (links to classified in their own indirectly (links to type collection of functional metabolite databases) metabolite databases) compound ontology metabolite databases) hierarchies

always the neutral most common state at most common state at pH level always the neutral

metabolites protonation state always the neutral form form pH level 7.2 7.3 form

Table 5 – Representation differences on reaction level. a Includes the NC-IUBMB classification. NC-IUBMB: Nomenclature Committee of the International Union of Biochemistry and Molecular Biology.

Knowledge representation

Figure 2 – Last, irreversible, step of glycolysis. Words in italic indicate reserved terms in a database. The term used to indicate the side of the metabolite in the reaction is shown using braces. The term used to describe the reversibility of the reaction is shown in a solid-lined box. Each database indicates the direction of the reaction differently as shown in the dotted-lined boxes. For example, KEGG indicates for the main metabolites of a reaction whether it is a substrate or product, thereby implying the direction of the reaction. HumanCyc, on the other hand, explicitly indicates the direction with, in this example, 'RIGHT-TO-LEFT'. (A) EHMN, KEGG, and HumanCyc store the reaction in the direction defined by NC-IUBMB. (B) Recon 1 and Reactome store the reaction in the physiological direction. gives a more complete view of a biological process, compared to only describing its metabolic component. However, as stated above, retrieving solely the metabolic processes from Reactome is difficult. How to link reactions to each other is not explicitly indicated in EHMN and Recon 1. In this case reactions are generally linked based on the substrates or products they have in common. This strategy makes network construction more difficult due to ‘currency’ metabolites (ATP, H+, etc) connecting unrelated reactions (Huss and Holme, 2007). The number of possible connections can be restricted by only linking reactions via metabolites that are assigned to the same compartment.

Physiological direction Knowing the physiological direction of reactions is crucial when, for example, building an in silico model to predict phenotypes. All databases indicate the direction in slightly different ways (Figure 2, Table 5). In general, reactions in EHMN, HumanCyc, and KEGG are stored in the direction defined by NC-IUBMB, which is not necessarily the physiological direction in which the reaction takes place in human. For example, the last step of glycolysis, the formation of pyruvate, is given in the direction opposite to the one in which it takes place (Figure 2A). EHMN only indicates that the reaction is irreversible and thus does not provide the correct physiological direction. In KEGG (ir)reversibility of a reaction is indicated independently of the specific organism, hereby ignoring that whether a reaction is reversible or not varies among species. Note that in Reactome reversibility is only

67 Chapter 3 indicated by providing a link to the reaction in the opposite direction if the reaction is reversible. Finally, only in HumanCyc the direction can be defined in the context of (i) a specific enzyme that catalyzes the reaction, (ii) the pathway or (iii) only the reaction itself. Only for 4% of the reactions, mainly transport reactions, the direction is specified in the context of the enzyme. Furthermore, for around 40% of the reactions the direction is not specified at all. These reactions are also not included in any pathway.

Metabolites The identity of a metabolite is determined by its name and by identifiers from specialized metabolite databases such as ChEBI (de Matos et al, 2010). An identifier is meant to unambiguously designate a metabolite across multiple resources. Identifiers enable, for example, mapping of experimental data onto the network. All five databases try to link a metabolite to at least one specialized metabolite database. The number of metabolite databases referred to differs (Supplementary Table S2). To minimize ambiguity, it is advisable that pathway databases provide a link to a single, common metabolite database for every metabolite. However, in practice this is not possible yet, as metabolite databases are far from complete.

Metabolite databases use different criteria for assigning IDs to metabolites. Consequently, the type of ID chosen affects the characterization of a metabolite in a pathway database and the level of distinction that can be made. For example, in ChEBI, both a base and its conjugate acid are assigned separate IDs, whereas in KEGG Compound no distinction between these two is made, and they are combined in a single entry with one ID. KEGG, EHMN, and Reactome prefer to state the neutral form of the metabolite. This choice may result in a different ChEBI ID compared to the one assigned to a metabolite by Recon 1 and HumanCyc, which specify the most common protonation state of a metabolite at a pre-defined and fixed pH level. Indicating the correct protonation state is important to be able to build a charge-balanced network. As the pH level varies between compartments, metabolites may have multiple protonation states. Ideally, databases should also be able to indicate these multiple states, which is currently not the case in any of the databases.

The pathway databases have made different choices with respect to how much information about metabolites is contained in the pathway database itself. EHMN, Recon 1, and Reactome contain the least information on metabolites and refer to specialized metabolite databases for additional information. Aside from referring to metabolite databases, KEGG and HumanCyc themselves also provide detailed information about metabolites (Table 6). KEGG even has its own metabolite database

68 Knowledge representation

EHMN Recon 1 Humancyc KEGG Reactome formula

charge

mass

Gibbs free energy of formation

InChI

structure SMILES

othera

Table 6 – Characteristics metabolites. Check mark: information is present. Cross: information is not available. a Other structure formats: mol file (KEGG), structure atoms and bonds (HumanCyc), atomicConnectivity and chemical fingerprint (Reactome)

(KEGG Compound), which is often referred to by many other pathway databases. HumanCyc and KEGG also provide their own hierarchical classification of the metabolites (Table 5 and Supplementary Figure S1), although not as extensively as ChEBI. These ontologies are a powerful way to provide some level of abstraction, and at the same time explicitly define what is meant by the abstract term. HumanCyc uses compound classes, e.g. 'an alcohol’, in reactions as a level of abstraction. Such generic metabolites are, in general, linked to specific metabolites, and used to represent the broad substrate specificity of an enzyme (see Case Study). If for a generic metabolite no specific metabolites are provided at all, the substrate specificity of the enzyme is likely to be undetermined (BioCyc, 2012). Especially when constructing a computational model, it is important to be able to instantiate such generic metabolites and derive specific reactions. For HumanCyc a mechanism for instantiation of generic reactions has recently been added to their Pathway Tools software for the purpose of building flux balance analysis models (Latendresse et al, 2012). Instantiation is done using the compound ontology by selecting those combinations of specific instances of the generic metabolites that lead to a mass balanced reaction. However, not for every generic reaction appropriate instances exist which fulfill this requirement. Moreover, part of the generic reactions cannot be instantiated because multiple products are possible for a given substrate (and vice versa). Recon 1 was specifically built to serve as an in silico model capable of predicting phenotypes. For this purpose very generic reactions do not provide enough information and were therefore not included in Recon 1. The (broad) substrate specificity of an enzyme is, consequently, not explicitly captured in this database. In Reactome some level of abstraction is given by grouping metabolites that undergo the same conversion into a set (Supplementary Figure S2).

69 Chapter 3

Representation of enzymes and their encoding genes A metabolic reaction is nearly always catalyzed by an enzyme, which in turn is encoded by one or more genes (Figure 1). An enzyme may be a single protein or a complex consisting of multiple copies of the same protein (homomer) or of multiple different proteins (heteromer). The concepts of an enzyme and a gene are represented differently in each database or, in some cases, even not at all (Figure 3). The same holds for the gene-protein-reaction relationship.

Enzymes The concept of an enzyme, as defined above, is not explicitly represented in KEGG. Information on a protein, e.g., its sequence, is indicated at the gene level. Furthermore, aside from a few exceptions, complexes are not indicated in KEGG. EHMN merely represents an enzyme by the Uniprot ID(s) assigned to the protein(s) constituting the enzyme; also in EHMN complexes are not represented. In Recon 1 heteromeric complexes are represented using a Boolean expression. However, this is not done on protein level, as one would expect, but at gene level. Homomeric complexes are not represented at all. HumanCyc and Reactome do have a separate enzyme level and both types of complexes, i.e., heteromers and homomers, are represented.

Genes A gene is defined by its Entrez Gene ID in EHMN, KEGG and Recon 1, while HumanCyc has its own definition which is closer to the definition of Ensembl. Reactome focuses more on proteins. Genes are not represented as a single entity in their MySQL database, but as a collection of identifiers from different genome databases (Figure 3). The various identifiers provided by Reactome are only united through their link to the same protein entry. As for metabolite databases, also genome databases use different criteria for assigning an identifier. The answer to the seemingly simple question of how many genes are involved in the human metabolic network according to each pathway database depends on which type of identifier one counts or whether one follows the convention of the pathway database itself.

Gene-Protein-Reaction relationship There is not necessarily a one-to-one relation between a reaction and the catalyst, e.g., multiple enzymes may catalyze the same reaction (isozymes) (Karp and Riley, 1993). EHMN and KEGG do not specify whether the products of multiple genes linked to the same reaction are isozymes, which can separately catalyze the reaction, or that the products together form a complex. This could result in incorrect conclusions with respect to the feasibility of a reaction when studying the effect of a protein deficiency

70

suffix to Entrez boolean expression boolean expression Gene ID (‘OR’) (‘AND’)

alternatively isozyme heteromer spliced variants

gene EC number proteins genes EC number protein Uniprot ID Entrez gene ID

Entrez Gene enzyme activity enzyme gene enzyme activity enzyme UniProt gene Entrez Gene

equation equation reaction reaction H. sapiens Recon 1 EHMN

one link per genome database molecule that is part of a complex, not explicitly labeled SimpleEntity Entrez Gene ...... Ensembl Gene genes as prosthetic group or cofactor Ensembl Gene gene component referenceGene Entrez Gene of gene ...... gene

complexes proteins complex physicalEntity heteromer or homomer heteromer or homomer complex enzyme UniProt complex enzyme UniProt referenceIsoform enzymatic- isozyme reactions isoform catalystActivity prosthetic group separate instances of the enzyme activity enzymatic reaction class . . . or cofactor entitySet enzyme activity

isozyme cofactor, prosthetic group, isozyme reactions cofactor-or-prosthetic group reactionLikeEvent reaction reaction HumanCyc Reactome

KEGG GENOME Ensembl Gene LEGEND UniProt Entrez Gene gene Knowledge representation Knowledge gene level prosthetic group or cofactor

enzyme level cofactor KEGG KEGG ENZYME ORTHOLOGY KEGG MODULE gene or protein EC number, structural complex: enzyme activity level identifier organism inspecific enzyme activity orthology complex set of KO numbers database-specific reaction level KEGG REACTION term for concept

reaction orthology level note KEGG

Figure 3 – Representation differences on the level of enzymes and encoding genes. Boxes with a dotted line indicate where in the data model the specific

71 identifier is provided. KEGG: The KEGG MODULE database describes four types of modules, among which structural complexes. Only the complexes of the

and oligosaccharyltransferase are represented in this way.

Chapter 3

(Schellenberger et al, 2010). Furthermore, the EC number and KEGG Orthology (KO) number are used instead of a specific enzyme to connect a gene to the corresponding reaction in KEGG. Moreover, only the gene level is organism-specific and the relation between ‘enzyme activity’ (EC number) and reaction is not. To retrieve species- specific reactions, the gene coding for the enzyme that catalyzes a reaction needs to be known. In Recon 1 the relation between a protein and the encoding gene is only available via its website. Isozymes are represented using a Boolean expression and are defined on gene level in Recon 1. Reactome groups isozymes on enzyme level into a set, each member of which can catalyze the reaction it is linked to. In HumanCyc isozymes are implied when multiple proteins are separately linked to the same reaction (Figure 3). None of the five databases indicates tissue-specificity of isozymes and isoforms, aside from statements in unstructured comment fields, which cannot easily be used in computational analyses.

Representation of uncertain and missing knowledge Our current knowledge on human metabolism is incomplete and based on different types of evidence such as biochemical, genetic, or sequence data and studies on other organisms. It is, therefore, important that databases explicitly indicate the source of a piece of knowledge and the degree of confidence associated with it. The databases, except EHMN, commonly cite scientific articles as evidence source. HumanCyc and Recon 1 also indicate the type of evidence available (Table 7). HumanCyc uses an evidence ontology (Karp et al, 2004) with 160 terms such as 'Inferred from experiment’, which can be combined with a probability that the evidence is correct. However, evidence codes are available for less than 18% of the reactions and their catalyzing enzymes. Evidence codes are available for each pathway as a whole. In Recon 1 the type of evidence is indicated for each reaction. Five types of evidence are discerned which are assigned a confidence score (Table 7) (Thiele and Palsson, 2010b).

It is also important to explicitly indicate a complete lack of information, which is, however, rarely done. In Recon 1, for instance, 'cytosol' is used as the default value if the compartment is unknown, instead of explicitly indicating that there is a lack of knowledge as done in EHMN (Table 7). Similarly, of the five databases only HumanCyc explicitly distinguishes a spontaneous reaction from a reaction for which the enzyme is unknown. In KEGG, spontaneous reactions are only indicated in unstructured comment fields. Moreover, for an organism-specific network spontaneous reactions cannot be retrieved as this requires the presence of a gene. The same holds for reactions for which the corresponding gene is unknown for the organism of interest.

72

Database Type of evidence Degree of confidence ƒ source of transport reactions is indicated: ƒ if the compartment of a protein is unknown it is set to uncertain

EHMN - dead-end analysis (Hao et al, 2010), - H. sapiens Recon 1, - another database ƒ five types of evidence are discerned: ƒ five confidence score, ranging from 0 (low) to 4 (high), reflecting the information

- biochemical and evidence currently available:

- genetic 4 – biochemical data - 3 – genetic data Recon 1 - physiological 2 – sequence homology or physiological data - modelinga 1 – modeling dataa

ƒ literature references 0 – unevaluated

If multiple types of evidence are available scores are added.

ƒ remarks in comment fields ƒ evidence ontology containing 160 terms, main evidence types: ƒ probability that the evidence is correct

- inferred from computation ƒ orphan reaction: - inferred from experiment reaction for which no enzyme that catalyzes the reaction has been sequenced - inferred by curator ƒ unbalanced reactions are labeled HumanCyc - author statement ƒ reaction may be labeled as hypotheticalb ƒ basis for assignment of a protein to a reaction, e.g., ƒ EC number or decision of curator remarks in comment fields

ƒ literature references KEGG ƒ literature references ƒ remarks in comment fields

ƒ data supported by evidence from other organisms is indicated ƒ CandidateSet: enzymes hypothesized to catalyze the reaction representation Knowledge

ƒ literature references ƒ BlackBoxEvents used for:

- reactions that have imbalances for various reasons - complex processes of which not all details are known Reactome - summarizing a complex process in a single step for which each step is known

ƒ OtherEntity: entities that curators are unable or unwilling to describe in chemical detail

ƒ GenomeEncodedEntity: polypeptide or polynucleotide whose sequence is unknown

ƒ remarks in comment fields

Table 7 – Evidence description. a Reaction is included, because it improved the performance of the in silico model. b The presence of the substrates, products or 73 catalyst of the reaction have not yet been demonstrated.

Chapter 3

Finally, it is not always indicated, or only in a comment field, that intermediate steps have been left out. In Reactome a ‘BlackBoxEvent’ can be used for this, which, however, does not necessarily mean that the intermediary steps are unknown.

Case study We selected three complex metabolic processes to further illustrate the different challenges faced by the pathway databases in accurately representing the metabolic network in silico. Here we discuss fatty acid beta oxidation (Berg et al, 2012, pp. 663- 696) in more detail. Two more examples, oxidative phosphorylation (Berg et al, 2012, pp. 543-584) and the pyruvate dehydrogenase reaction (Berg et al, 2012, pp. 515-542), are discussed in the Supplementary Text S1 and S2, respectively. These three case studies show the implications of different design decisions on the ability of the pathway databases to represent a biological process in full detail.

Fatty acid beta oxidation We focused on the beta oxidation of saturated fatty acids with a straight chain of even length. One particular challenge in representing this pathway is the repetitive nature of this process. The chain of an activated fatty acid is shortened by two carbons via four subsequent reactions, yielding one unit of acetyl-CoA. Several chain-length specific isozymes are available for each cycle. For the complete oxidation of a fatty acid, this cycle is repeated until only acetyl-CoA is left. The number of cycles needed depends on the chain length of the fatty acid. There is a wide range of fatty acids of which the majority, i.e., short-chain, medium-chain and long-chain fatty acids, are degraded in the mitochondrion. In mammals and many fungi, very-long-chain fatty acids are first shortened in the after which they may be transported to the mitochondrion for further oxidation (Cornell et al, 2007). The exact enzymes involved differ in the two compartments, but the reactions are the same except for the co-substrates of the first step of a cycle.

Representation Each database has its own strengths and limitations in representing fatty acid beta oxidation (Table 8). KEGG and HumanCyc make no distinction between the peroxisomal and mitochondrial pathway, which emphasizes the similarities, but disregards the differences. For KEGG it is difficult to separate the two pathways, since KEGG does not provide information on compartments. Moreover, as mentioned above, pathways in KEGG are not species-specific and the distinction between the peroxisomal and the mitochondrial pathway does not hold for the majority of organisms described in KEGG. In HumanCyc the repetitive nature of this metabolic process is captured by describing a single cycle, using generic metabolites

74 Knowledge representation

EHMN Recon 1 HumanCyc KEGG Reactome the four steps of a cycle

cycle described multiple times

degradation described for the repetitive process a complete range of fatty acids every step of the complete degradation of fatty acids b c

included in the database differences mitochondrial versus peroxisomal pathway similarities chain length c specificity of enzymes

Table 8 – Representational challenges fatty acid beta oxidation. Check mark: concept is represented. Cross: concept is not represented. Bar: not for all cases the concept is represented. a Not all possible fatty acids are specified as instances for each of the generic metabolites. b By using generic metabolites to represent one cycle all possibilities are captured, but not all possible specific instances can be deduced (Figure 4). c For the mitochondrial pathway each step of all cycles is described. For the peroxisomal pathway eight of the nine cycles are summarized in a single reaction (‘BlackBoxEvent’). to describe the oxidation for the complete range of saturated fatty acids. Furthermore, a ‘polymerization link’ explicitly connects the fatty acid before and after the removal of the 2-carbon acetyl-CoA (Figure 4). Which combinations of instances of the generic metabolites together form the specific reactions is, however, not explicitly indicated. Moreover, in this particular example, several of the required metabolite instances for each of the four steps are lacking. The chain length specificity of the enzymes is also not captured in HumanCyc. Reactome chooses to represent the repeating cycles of the mitochondrial pathway as separate subpathways and to describe each step of every cycle. In this way, Reactome is able to represent the chain length specificity of each enzyme. Note though that all four steps in the peroxisomal pathway are described for one cycle only, the next eight cycles are lumped into a single step. No mechanism is provided that allows users to retrieve the intermediate steps of these eight cycles. Similarly, in Recon 1 the repetitive nature of this process is not captured as the conversion of, for example, palmitoyl-CoA into octanoyl-CoA is described in a single step instead of 16 steps. Also, the four steps of a single cycle are described for none of the fatty acids. This decision is indicated in the comment field, but the intermediate steps cannot be retrieved. Describing each step is, however, important to be able to simulate enzyme deficiencies that lead to the abnormal build up of intermediate products of a cycle of beta oxidation (Das et al, 2006; Molven et al, 2004; Wanders et al, 1992; Wanders et al, 1990). In EHMN and KEGG every step is described for fatty acids with a chain of length 16 or shorter, but not for those with a longer chain length. A disadvantage of describing every step is that, in contrast to the generic approach of HumanCyc, it

75 Chapter 3

Figure 4 – Fatty acid beta oxidation in HumanCyc. Part of the ‘fatty acid β-oxidation I’ pathway in HumanCyc, only the reactions of the cycle itself are shown. requires that each of the highly similar steps needs to be specified separately for the whole range of possible fatty acids. Moreover, not for all applications the intermediate steps are of interest. A mechanism that allows database users to switch between a high-level and a more detailed representation would be preferred. Challenges Within a database It remains a challenge to capture every detail of the knowledge on the human metabolic network, both for relatively straightforward processes like gluconeogenesis (Figure 1) and the more complex processes illustrated by the three case studies. On the other hand, the question which level of detail is required to be able to perform a wide range of possible computational analyses does not always have a clear-cut answer. For example, to represent a (de)polymerization processes it is not always an option nor always strictly necessary to specify each step. The degradation of glycogen, for instance, consists of thousands of steps (Berg et al, 2012, pp. 637-661), but not each intermediate product might be of interest. Furthermore, it is also important for users to have some degree of abstraction, such as the ontologies provided by some databases, to see how everything fits in the bigger picture.

Unstructured text fields in the databases, which cannot be easily used in computational analyses, frequently contain more detailed information, such as the tissue-specificity of an enzyme. As argued by Khatri et al. (2012), tissue- and cell- specific information is essential to improve the accuracy and relevance of pathway

76 Knowledge representation analyses. In the Biological Connection Markup Language format (Beltrame et al, 2011) that was recently proposed, this information can be stored, but this format has not yet been adopted by the major databases. Ultimately, there are even more factors to consider to accurately capture the complete (human) physiology in a digital format. This includes the inherent dynamics of metabolic processes and the multiple levels at which these processes are being controlled. The complexity of metabolism is further increased by its multi-scale nature ranging from cellular compartments, and cells to organs. These issues are not (yet) addressed by the pathway databases selected in this review. However, several large-scale projects have taken up this challenge. For example, the goal of the Virtual Liver project (http://www.virtual- liver.de) (Holzhütter et al, 2012) is to construct a multi-scale representation of liver physiology.

The ability to indicate that a piece of knowledge is missing is another desirable characteristic of a pathway database. This is, however, not yet done in each database. A further extension of the pathway databases is to not only provide affirmative evidence as done by HumanCyc and Recon 1, but to also indicate ‘negative evidence’ such as a statement that a reaction cannot take place in human. This would enable users to distinguish such cases from knowledge gaps. This information is highly valuable, especially given that the (human) metabolic network is not yet complete. Note that although we focussed on the human network, most observations also hold for the other organisms the five databases describe (when applicable).

Across databases Efforts to reconcile different descriptions of the metabolic network for a specific species are hampered by the representation differences discussed in this paper (Chindelevitch et al, 2012; Herrgård et al, 2008; Radrich et al, 2010; Thiele et al, 2011). Similar problems arise in analyses that require the metabolic network of multiple organisms from different databases, e.g., to study evolution (Tanaka et al, 2006). For these purposes it is essential to be aware what differences could be caused by a difference in representation rather than a true difference in opinion on the underlying biology. One example is the number of steps a process is described in (see Case Study and Supplementary Text S2). Furthermore, it is important to realize that even the smallest difference in terminology and the definition of a concept needs to be accounted for.

To simplify the exchange of knowledge between databases several standards have been proposed such as SBML and BioPAX, each with their own advantages (Strömbäck and Lambrix, 2005). However, these do not resolve all representational

77 Chapter 3 differences that we discussed. HumanCyc and Reactome provide their network in the BioPAX (level 3) format (Demir et al, 2010). They, however, followed their own representation when converting their data into this format. For example, in Reactome a reversible reaction is stored separately in both directions in their own data model and also in their BioPAX file. In HumanCyc’s data model a reversible reaction is stored only once and its direction is indicated as ‘reversible’. The same is done in their BioPAX file. Semantic standards like BioPAX can also not enforce the level of detail in which a process needs to be described by a curator or how a pathway is defined. Curators will, therefore, need to adhere to strict guidelines to make these exchange formats more easily comparable. Alternatively, rules could be formulated to translate one representation into the other. Based on Figure 3, for example, one could develop more precise rules to translate the different ways of representing the relations between gene products. The results of our comparison and the accompanying overviews provide useful insights in the road ahead to further simplify integration of the knowledge contained in pathway databases. Integration will also enable the construction of a more accurate in silico representation of metabolic networks. Endeavors in this direction have already been undertaken for multiple organisms (Herrgård et al, 2008; Thiele et al, 2011), including human (Thiele et al, submitted).

Discussion The five pathway databases each have made different decisions on how and in what detail to represent the metabolic network. Of these five databases EHMN provides the least detailed information and HumanCyc the most (Table 2). At the same time not every aspect of the data model of HumanCyc is used yet. Filling in every detail will likely require a lot of time and effort from the curators. Moreover, the lack of knowledge on human metabolism may currently preclude the usability of every feature of HumanCyc’s data model. It also depends on the application at hand which aspects of the metabolic network are important and to what detail they need to be represented in a structured format. The overviews given in this paper provide insight into the differences between the databases and can help to make a well- informed decision on which database to use. The different choices the database developers have made each have their own advantages and disadvantages. For example, to perform in silico simulations and predicting metabolic phenotypes, Recon 1 may be preferred. This network is fully compartmentalized and fully mass and charge balanced. Also the relation between gene products is provided, which is important for simulating the effect of, for example, gene defects. On the other hand, for pathway enrichment analyses (Antonov et al, 2008) each of the five databases may

78 Knowledge representation suffice and the determining factor is the pathway definition used. Other factors than the representation of the network will also play a role in selecting the database that best fits the application, such as the coverage of the metabolic network (Stobbe et al, 2011), e.g., EHMN contains a more extensive description of lipid metabolism than other databases. Finally, retrieving and capturing every detail of the (human) metabolic network in a digital format is a huge challenge and will require a joint effort of a broad scientific community.

Acknowledgements We would like to thank the reviewers for their helpful comments and suggestions for improving the presentation and comprehensibility of the paper. This research was carried out within the BioRange programme (project SP1.2.4) of The Netherlands Bioinformatics Centre (NBIC; http://www.nbic.nl), supported by a BSIK grant through The Netherlands Genomics Initiative (NGI) and within the research programme of the Netherlands Consortium for Systems Biology (NCSB), which is part of the Netherlands Genomics Initiative / Netherlands Organization for Scientific Research.

Supplementary material

Supplementary Figure S1 – Metabolite hierarchy HumanCyc and KEGG

A B A) HumanCyc: The different levels of the metabolite hierarchy ending at α-D-glucose. B) KEGG: hierarchy has four levels and stops at D-glucose in contrast to HumanCyc (A).

79

80 Chapter 3

Supplementary Figure S2 – Sets used in Reactome

Sets are used to group metabolites undergoing the same conversion. The relation between the two sets is not explicitly indicated. Consequently, deducing the correct specific reactions is dependent on the members of the sets being in the same order.

Supplementary Table S1 – Information sources used

Database Data files analyzed Online manuals Articles of interest EHMN Excel file --- (Hao et al, 2010; Ma et al, 2007) Recon 1 Tab-delimited files, SBML --- (Duarte et al, 2007; Schellenberger et al, 2010; Thiele and Palsson, 2010b) HumanCyc Pathway Tools (export flat files), - User’s guide available via the Pathway Tools program (Green and Karp, 2006; Karp et al, 2005; BioPAX file - Curators’s guide: Karp et al, 2004; Latendresse et al, 2012; http://bioinformatics.ai.sri.com/ptools/curatorsguide.pdf Romero et al, 2004; Zhang et al, 2005) KEGG KGML, flat files http://www.kegg.jp/kegg/xml/docs/ (Kanehisa et al, 2010; Kanehisa et al, 2012) Reactome MySQL dump, BioPAX file http://wiki.reactome.org/index.php/Main_Page (Croft et al, 2011; Vastrik et al, 2007)

Supplementary Table S2 – Identifier types

Metabolite databases EHMN KEGG Compound, KEGG Glycan Recon 1 KEGG Compound, CAS HumanCyc KEGG Compound, ChEBI, PubChem Compound, CAS, InChI, SMILES, NCI, UM-BBD-CPD, KNApSAck, Wikipedia KEGG KEGG Compound, KEGG Glycan, ChEBI, PubChem Substance, CAS, InChI, 3DMET, KNApSAcK, NIKKAJI, LIPIDMAPS, LIPIDBANK Reactome KEGG Compound, ChEBI, PubChem Substance

Protein databases EHMN UniProt Recon 1 --- HumanCyc NCBI Protein, UniProt KEGG HPRD, NCBI-GI, UniProt Reactome RefSeq, UniProt

Gene databases EHMN Entrez Gene Recon 1 Entrez Gene HumanCyc Entrez Gene, Ensembl Gene, UCSC, UniGene, Entrez, Genecards, RefSeq_NM Knowledge representation Knowledge KEGG KEGG Gene ID, NCBI-GeneID (=Entrez Gene), Ensembl Gene, HGNC Reactome Entrez Gene, Ensembl Gene, UCSC, KEGG Gene ID, BioGPS

Note that some identifiers are only present for a few proteins/genes/metabolites.

81

Chapter 3

Supplementary Text S1 – Case Study: Oxidative phosphorylation

Biological background Oxidative phosphorylation is a crucial biological process as it is the main source of

ATP in human (Berg et al, 2012, pp. 543-584). The electrons from NADH and FADH2 are transferred to oxygen via a number of electron-transfer reactions using multiple transmembrane complexes in the mitochondrion. Via prosthetic groups electrons flow through the complexes. This causes protons to be pumped from the matrix side of the inner mitochondrial membrane to the cytosolic side of the membrane, creating a pH gradient and membrane potential that drive the synthesis of ATP.

Representation In HumanCyc oxidative phosphorylation is not represented as a single pathway, but only as isolated reactions. In EHMN it is also not represented as a separate pathway, moreover not all reactions are described. KEGG does represent oxidative phosphorylation as a pathway, but the reactions are only shown in the pathway diagram itself and cannot be retrieved in an automated way. All five databases only describe the details of oxidative phosphorylation, such as the role of the prosthetic groups and the change in pH gradient, in unstructured text or in a figure. The redox half-reactions are captured by the data model of HumanCyc, providing some more detail. For KEGG this process is one of the few cases in which the enzymes are indicated as complexes, while in HumanCyc, although possible in their data model, the complexes are not represented as such. Recon 1 lacks the ability to indicate the intermembrane space as the compartment for part of the metabolites. Instead, in Recon 1, the protons are indicated to be pumped into the cytosol. HumanCyc erroneously indicates the ‘periplasmic space’ and ‘inner membrane (sensu Gram- negative Bacteria)’ as compartments for some of the metabolites. In Reactome the compartments of the metabolites are correctly indicated. The enzymatic complexes are assigned to the mitochondrial inner membrane. Although true for components of the complexes, all of them are transmembrane enzymes with also parts in the and mitochondrial intermembrane space.

Table S1.1 - Representation challenges oxidative phosphorylation

EHMN Recon 1 HumanCyc KEGG Reactome complexes (heteromer)

redox half-reactions

pH gradient role prosthetic group metabolites in intermembrane mitochondrial space

transmembrane enzymes

Check mark: concept is represented. Cross: concept is not represented. 82 Knowledge representation

Supplementary Text S2 – Case Study: Pyruvate dehydrogenase reaction

Biological background Under aerobic conditions pyruvate is converted to acetyl-CoA in three reaction steps, catalyzed by the pyruvate dehydrogenase complex (Figure S2.1A) (Berg et al, 2012, pp. 515-542). The first two steps are catalyzed by the pyruvate dehydrogenase component (E1), and the third step by dihydrolipoyl transacetylase (E2). Two more steps are needed to reactivate lipoamide, a prosthetic group, to enable another cycle. Both these steps are catalyzed by the third component dihydrolipoyl dehydrogenase

(E3). Only acetyl-CoA, CO2 and NADH are released from the complex, all other (intermediate) products remain bound to the complex.

Representation In each database, a different number of steps is used to describe the degradation of pyruvate (Figure S2.1). EHMN, HumanCyc and KEGG describe it in multiple steps and each component of the pyruvate dehydrogenase complex is linked separately to the reaction in which it is active. In EHMN the same process is even described three times in the same pathway with a different number of steps (Table S2.1, Figure S2.1C, Figure S2.1E, and Figure S2.1G). In HumanCyc there are also two alternative routes described, but these are not assigned to any pathway. Recon 1 and Reactome describe the degradation of pyruvate in one step and represent the catalyst as a complex. In the comment field of Recon 1 it is indicated that multiple reaction steps are lumped into one. In earlier releases (before version 32) of Reactome this process was presented in more detail, describing the process in five steps as in the student textbook ‘Biochemistry’ (Berg et al, 2012) and indicating the changes in the prosthetic groups of the complex. The active component of the complex, however, was also not indicated at that time. It is important to note that in this case the difference in number of steps in which the databases describe this process is not a disagreement on the underlying biology, but a difference in representation. Although one could argue that representing all intermediate products of the pyruvate dehydrogenase reaction as unbound is incorrect as most remain attached to the enzyme complex.

83

84 Chapter 3

EHMN Recon 1 HumanCyc KEGG Reactome

- acetyl-CoA biosynthesis - glycolysis / gluconeogenesis glycolysis and glycolysis / pathway assigned to (from pyruvate) - citrate cycle (TCA cycle) pyruvate metabolism gluconeogenesis gluconeogenesis - isolated reactions - pyruvate metabolism network

number of steps process is within pathway: 4 4, 3 and 1 1 4 1 described in isolated reactions: 3 and 1

reaction intermediate products unbound metabolites not applicable proteins unbound metabolitesa not applicable represented as

complex or separate separate components complex (heteromer) separate components separate components complex (heteromer) components?

indicated which protein partly, two steps are yes no yes no enzyme catalyzes which step? missing an enzyme

prosthetic groups indicated? no no no yes yes

Table S2.1 - Differences in representation of the pyruvate dehydrogenase reaction a The names of the metabolites and structure of three metabolites do suggest that they are attached to an enzyme.

Knowledge representation

Figure S2.1 - Pyruvate dehydrogenase reaction as represented in ‘Biochemistry’ (Berg et al, 2012, pp. 515-542) (A) and in each of the databases (B-G).

85

Improving the description of metabolic networks: the TCA cycle as example Miranda D. Stobbe1,4,*, Sander M. Houten3,*, Antoine H.C. van Kampen1,2,4,5, Ronald J.A. Wanders3, Perry D. Moerland1,4

1 Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, P.O. Box 22700, 1100 DE, Amsterdam, the Netherlands 2 Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Science Park 904, 1098 XH, Amsterdam, the Netherlands 3 Laboratory Genetic Metabolic Diseases, Departments of Clinical Chemistry and Pediatrics, Academic Medical Center, University of Amsterdam, P.O. Box 22700, 1100 DE, Amsterdam, the Netherlands 4 Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525 GA, Nijmegen, the Netherlands 5 Netherlands Consortium for Systems Biology, University of Amsterdam, P.O. Box 94215, 1090 GE, Amsterdam, the Netherlands

*These authors contributed equally.

Accepted for publication in: FASEB Journal (May 21, 2012)

Abstract

To collect the ever increasing, yet scattered knowledge on metabolism, multiple pathway databases, like the Kyoto Encyclopedia of Genes and Genomes, have been created. A complete and accurate description of the metabolic network for human and other organisms is essential to foster new biological discoveries. Previous research has shown, however, that the level of agreement between pathway databases is surprisingly low. We investigated if the lack of consensus between databases can be explained by an inaccurate representation of the knowledge described in scientific literature. As an example, we focus on the well-known tricarboxylic acid (TCA) cycle and evaluated the description of this pathway as found in a comprehensive selection of ten human metabolic pathway databases. Remarkably, none of the descriptions given by these databases is entirely correct. Moreover, there is consensus on only three reactions. Mistakes in pathway databases might lead to the propagation of incorrect knowledge, misinterpretation of high- throughput molecular data, and poorly designed follow-up experiments. We provide an improved description of the TCA cycle via the community-curated database WikiPathways. We review various initiatives that aim to improve the description of the human metabolic network and discuss the importance of the active involvement of biological experts in these.

Improving the description of metabolic networks

Introduction Metabolism has been studied for decades already and the interest in this topic is going through a marked revival (DeBerardinis and Thompson, 2012; Hanahan and Weinberg, 2011). To collect our ever increasing, but scattered knowledge on metabolism, pathway databases, like the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al, 2012), have been created. The number of pathway databases describing the metabolic network of one or more organisms is growing rapidly (Karp and Caspi, 2011; Oberhardt et al, 2009). For many organisms there are even multiple databases available describing their metabolic network. These databases provide a holistic view of the metabolic network (Oberhardt et al, 2009) and are routinely used to provide context for the analysis and interpretation of high-throughput molecular data (Rey et al, 2011). In silico models of the metabolic network can also be used to generate experimentally verifiable hypotheses, such as potential drug targets, or to simulate the effect of network perturbations, such as loss of function. Recent research has shown, however, that the level of agreement between the metabolic network descriptions of the same organism given by the various pathway databases is surprisingly low (Herrgård et al, 2008; Radrich et al, 2010; Stobbe et al, 2011). For example, five pathway databases that each describe the human metabolic network were shown to agree on only 199 (about 3%) of the close to 7,000 reactions they have combined (Stobbe et al, 2011). Databases differ in the way they retrieved information to build the metabolic network and in the way the network is curated. For example, Homo sapiens Recon 1 (Duarte et al, 2007) was built by first automatically generating a preliminary network based on, genome annotation and the reactions from KEGG (Kanehisa et al, 2012). Next, the network was manually refined using literature and computer simulations. In contrast, Reactome (Croft et al, 2011) takes an incremental approach, regularly adding new parts to its network, which are curated by selected experts and peer-reviewed.

Here, we discuss to what extent this lack of consensus between the databases is caused by an inaccurate representation of the knowledge described in scientific literature. To answer this question, we choose, as an example, the well-known tricarboxylic acid (TCA) cycle, a well-studied pathway that has been a subject of extensive research ever since its discovery by Hans Krebs in 1937 (Krebs and Johnson, 1937). Furthermore, this pathway can be found in virtually every student text book about biochemistry. One would expect the description of the TCA cycle in pathway databases to be highly accurate and, hence, a high level of agreement between these databases.

89 Chapter 4

We evaluated the description of this pathway as found in a comprehensive selection of ten public human metabolic pathway databases (Table 1). Although pathway databases have certainly proven their value in a wide range of applications, we show that none of the selected ten descriptions is entirely correct based on a thorough review of the literature. Using the knowledge contained in the ten databases and additional scientific literature, we provide an improved description of the TCA cycle. The observations made for the TCA cycle are not unique to this pathway, but are also valid for the entire metabolic network. To further improve upon the description of the entire (human) metabolic network, various initiatives have been implemented to which the community of biological experts can contribute. We conclude by outlining some of these initiatives and discuss the challenges ahead. Results We retrieved the descriptions of the TCA cycle from a comprehensive set of ten databases (Table 1) and compared the descriptions to each other to identify differences. Figure 1 displays the union of all reactions from the ten databases and shows that there is consensus on only three reactions and two of the corresponding catalysts. The number of steps in which a conversion is described, such as the formation of D-threo-isocitrate from citrate, is one explanation of a difference between the databases. Next, two experts in the field of metabolism (SH and RW) compared the knowledge described in the literature with the descriptions from the ten pathway databases. Relevant literature was extracted from Medline based on MeSH terms and keywords related to the TCA cycle. We observed that many inconsistencies in the databases are explained by an inaccurate representation of the knowledge described in scientific literature (Table 2 and 3). In some cases, conclusive evidence in literature was lacking, referred to henceforth as ‘unconfirmed’.

Database Version URL for TCA cycle pathway BioCarta March 2001 http://www.biocarta.com/pathfiles/krebPathway.asp EHMN 2 http://www.ehmn.bioinformatics.ed.ac.uk/?se=rea&sefor=rea&seterm=&pa=17 H. sapiens Recon 1 1 http://bigg.ucsd.edu/ HumanCyc 15.1 http://humancyc.org/human/new-image?type=PATHWAY&object=PWY-5690 INOH 4.0 http://www.inoh.org/download.html#MetabolicPathway KEGG 59 http://www.genome.jp/kegg-bin/show_pathway?org_name=hsa&mapno=00020 Panther 2.1 http://www.pantherdb.org/pathway/pathwayDiagram.jsp?catAccession=P00051 http://www.reactome.org/entitylevelview/PathwayBrowser.html#DB=test_react Reactome 37 ome_37&FOCUS_SPECIES_ID=48887&FOCUS_PATHWAY_ID=71406&ID=71403& SMPDB 1.0 http://pathman.smpdb.ca/pathways/SMP00057/pathway http://www.grenoble.prabi.fr/obiwarehouse/unipathway/upa?upid=UPA00223 UniPathway 2010_05 &oscode=HUMAN

Table 1 – Pathway databases. Ten pathway databases from which we retrieved their description of the TCA cycle, if possible a direct link to the TCA cycle is provided.

90 Improving the description of metabolic networks

Below we discuss the outcome of our literature study and the inconsistencies observed in the databases in more detail (Figure 2, capital letters below refer to the different panels). The resulting improved description of the TCA cycle is illustrated in the blue boxes. This description is based on literature, the knowledge captured by the ten databases combined and our own expertise. In contrast to some databases, all reactions in our description are mass and charge balanced. In addition, we determined the protonation state of the metabolites at the pH level of the mitochondrion, which is estimated to be between 7.8 and 8.0 (Hoek et al, 1980; Llopis et al, 1998; Porcelli et al, 2005).

Citrate synthase (A) The TCA cycle starts with the condensation of oxaloacetate and acetyl-CoA by the enzyme . Overall, the different databases have a high degree of agreement with respect to this reaction. Some disagreement exists over the reversibility of this reaction. Evidence in the literature shows that in rat liver mitochondria, radiolabeled citrate did not label products of mitochondrial acetyl- CoA metabolism. This shows that the citrate synthase reaction is irreversible in vivo (Greksak et al, 1982) .

Aconitase (B) In this reversible reaction citrate is converted into D-threo-isocitrate. The reaction proceeds via the intermediate cis-aconitate. Since cis-aconitate is not the final product of this reaction, the necessity of adding it to the model could be questioned. Indeed some databases do not include this intermediate step. We argue that an intermediate should be included in a model when there is evidence available that the metabolite is a true intermediate that can be accurately measured enabling the use of the concentration of this metabolite in mathematical models (Hoppe et al, 2007). A second reason to include the intermediate is to be able to model the accumulation of the metabolite under pathologic conditions such as an inborn error of metabolism. Cis-aconitate meets the first criterion, as it is readily measured as a product of the aconitase reaction (Krebs and Holzach, 1952), but also in body fluids using organic acid analysis (Lawson et al, 1976). Therefore, we decided to include cis-aconitate in our description.

Isocitrate dehydrogenase (C and D) The biochemistry associated with the reactions catalyzed by the two mitochondrial isocitrate dehydrogenase enzymes IDH2 and IDH3 is complex, which may explain some of the discrepancies between databases. The main difference between IDH2 and IDH3 enzyme is at the level of the electron acceptor, with the latter using NAD

91 Chapter 4

enzyme and encoding gene(s) complex not missinga incorrect unconfirmeda indicateda Databases n Fig. 2 n Fig. 2 n Fig. 2 n Fig. 2 BioCarta 13 c (3x), d*, e (2x), f (2x), g, h (3x), j 3 c, f, j 0 - 5 c, e, f, g, h EHMN 1 g 3 b, d, j 1 e 5 c, e, f, g, h H. sapiens 0 - 5 b (2x), d, e, j 1 j 0 - Recon 1 HumanCyc 3 d*, f * (2x) 3 a, b, j 2 e (2x) 1 f* INOH 2 d*, g 2 b, j 0 - 4 c, e, f, g KEGG 0 - 5 b, d, f, g, j 1 e 5 c, e, f, g, h Panther 12 c (3x), d*, e (2x), f, g* (2x), h (2x), j 3 c (2x), j 3 e (2x), j 5 c, e, f, g*, h Reactome 0 - 2 f, g 0 - 0 - SMPDB 3 d*, g* (2x) 0 - 0 - 1 g* Unipathway 13 c (3x), d, e (3x), f* (2x), g, h (2x), j 0 - 0 - 5 c, e, f*, g, h

Table 2 – Overview of inconsistencies per pathway database: enzymes and encoding genes. For each database the number of times (n) a specific inconsistency was found is indicated and, whenever possible, linked to a specific panel of Figure 2 via letters a-j. a Excluding reactions not considered to be part of the TCA cycle in our description. * Enzyme (complex) is missing because the indicated reaction is not described in the database.

Reaction inclusion of enzyme- missing direction incorrecta not part of the bound intermediates TCA cycle Databases n Fig. 2 n Fig. 2 n Fig. 2 BioCarta 0 1 d 0 - 1 h EHMN 16 0 - 4 a, d, e, h 1 e H. sapiens Recon 1 2 0 - 0 - 1 h HumanCyc 0 2 d, f 2 a, b 0 - INOH 3 1 d 2 c, h 1 e KEGG 8 0 - 3 a, c, h 1 e Panther 1 2 d, g 4 b, f, i, j 1 c Reactome 1 0 - 3 d, f, g 1 h SMPDB 2 2 d, g 0 - 1 h Unipathway 7 1 f 5 b, d, g, i, j 0 -

Table 3 – Overview of mistakes per pathway database: reactions. For each database the number of times (n) a specific inconsistency was found is indicated and, whenever possible, linked to a specific panel of Figure 2. a Excluding reactions not considered to be part of the TCA cycle in our description.

92 Improving the description of metabolic networks

Figure 1 – Union of the descriptions of the TCA cycle given by ten pathway databases. Overview showing the reactions and genes annotated to play a role in the TCA cycle by the ten databases. The transport reactions found in the EHMN database (Hao et al, 2010) were excluded. The main metabolites of the TCA cycle are indicated by white rectangles. All genes linked to a reaction are combined in a box with grey borders. Colors indicate the level of agreement on a reaction and on the gene(s). The direction of an arrow is determined by what is indicated by the majority of the databases. In case of a tie irreversible was chosen. The letters (A-J) refer to the reactions described in Figure 2. 93 Chapter 4

(Figure 2C) and the former NADP (Figure 2D). Most databases agree on this except two that incorrectly assign IDH2 to the NAD-dependent variant. More confusion exists over the direction of the reactions catalyzed by the IDH enzymes, i.e., forward (oxidative decarboxylation) and/or backward (reductive carboxylation). IDH3 is allosterically regulated by positive (Ca2+, ADP, citrate) and negative modifiers (NADH, NADPH, ATP) (Gabriel et al, 1986), which is consistent with its activity in the oxidative direction of the TCA cycle. Most databases agree on this direction. Much more controversy exists on the NADP-dependent reaction catalyzed by the IDH2 enzyme. Only half of the databases include the NADP-dependent reaction. Furthermore, only one database, H. sapiens Recon 1, mentions that this reaction is reversible, while the others claim that the NADP-dependent reaction operates in the oxidative direction of the TCA cycle. Biochemical evidence, however, indicates that the IDH2 enzyme operates in the (reverse) reductive direction, synthesizing D-threo- isocitrate. This is facilitated by the virtually fully reduced mitochondrial NADPH/NADP-redox state caused by the action of the nicotinamide nucleotide transhydrogenase, which is driven by the proton gradient across the mitochondrial membrane (Hoek and Rydström, 1988). Only under abnormal conditions, such as limiting substrate supply or hypoxia, which are characterized by a low proton electrochemical gradient, this reaction could theoretically proceed in the (forward) oxidative direction, but conclusive biochemical evidence is lacking. In contrast, it has been shown that in normoxic, but also hypoxic cell lines, the IDH2 enzyme is crucial for the reductive reaction to convert glutamine via 2-oxoglutarate into D-threo- isocitrate that is subsequently converted into citrate, which is exported to the cytosol where it is used for (Mullen et al, 2012; Wise et al, 2011). With IDH2 and IDH3 operating in opposite directions, they form a substrate cycle that has been speculated to contribute to the fine regulation of the TCA cycle (Sazanov and Jackson, 1994). Indeed, multiple in vivo studies have established that this substrate cycle takes place in liver (Des Rosiers et al, 1994) and heart (Comte et al, 2002), but although biochemically plausible, it is not possible to conclusively attribute the reverse reaction to the IDH2 enzyme in the type of experiment that was done. In spite of this biochemical evidence, it was recently suggested that IDH2 serves as the main enzyme in the oxidative direction (Hartong et al, 2008). This was based on an observation in two patients with mutations in the β-subunit of the IDH3 complex presenting with retinitis pigmentosa and no other disease phenotypes that point to general TCA cycle dysfunction. Although interesting, this genetic finding does not prove that IDH2 enzyme functions in the oxidative direction. Further biochemical studies to address the role of IDH2 enzyme were not reported, nor studies addressing a more likely compensating role for the cytosolic IDH1 enzyme. In the

94 Improving the description of metabolic networks latter scenario, D-threo-isocitrate would be transported to the cytosol, converted to 2- oxoglutarate using NADP, which is readily available in the cytosol. Next, 2- oxoglutarate is transported back to the mitochondrion. For our description, we have decided to include only the reductive direction for IDH2 reaction because, as explained above, the oxidative direction does not take place under normal conditions. The role of IDH2 in the TCA cycle under pathophysiological conditions is unclear and should be further investigated. We therefore indicated the possibility that the IDH2 reaction is reversible as ‘unconfirmed’ in Figure 2D.

Databases also differ on including oxalosuccinate as an intermediate in the IDH reactions. In vitro studies have shown that the IDH2 enzyme can use oxalosuccinate as a substrate for reduction to D-threo-isocitrate as well as decarboxylation to 2- oxoglutarate. Although the catalytic mechanism that was inferred from the crystal structure of IDH2 enzyme shows that oxalosuccinate is an intermediate in the dehydrogenation of D-threo-isocitrate (Ceccarelli et al, 2002), there is strong evidence that oxalosuccinate is not a free intermediate (Siebert et al, 1957). Moreover, enzyme- bound oxalosuccinate was also not detected (Ramakrishna and Krishnaswamy, 1966). Most likely, the decarboxylation reaction proceeds very rapidly. Although oxalosuccinate is probably also a catalytic intermediate for the IDH3 reaction, the IDH3 enzyme complex does not accept oxalosuccinate as a substrate (Plaut and Sung, 1954). Since oxalosuccinate does not fulfill the two criteria we set (see section on aconitase), we have decided not to include oxalosuccinate as a TCA cycle intermediate in the reactions performed by IDH2 and IDH3.

2-Oxoglutarate dehydrogenase (E) The 2-oxoglutarate dehydrogenase complex performs a series of complicated reactions including oxidative decarboxylation, formation of CoA ester and reoxidation of a lipoamide cofactor for which three different subunits are necessary, commonly referred to as the E1, E2 and E3 subunits. The overall reaction is irreversible, which is driven by the decrease in free energy and the removal of the

CO2 generated in the first, E1-catalyzed step (Sheu and Blass, 1999). Some databases describe this as a single reaction, while others use multiple steps. Representing it as a single reaction has the disadvantage that it will be more difficult to indicate, which reaction step is catalyzed by the different subunits. Indeed, a specific inherited defect has been described in only the E3 subunit (Liu et al, 1993). Theoretically, this would not affect the E1 and E2 enzymes that initiate the 2-oxoglutarate dehydrogenase reaction. However, in practice the entire complex operates as one functional unit with all intermediary products being enzyme–bound. Moreover, the complete cycle of steps has to be completed before a new reaction can start, which probably explains

95 Chapter 4

Figure 2 – Improved description of the TCA cycle. Reaction-wise overview of our description of the TCA cycle based on literature and the ten databases evaluated. Blue colored boxes contain the correct gene(s), enzyme(s), and reaction. The numbers in square boxes indicate how many of the ten databases are in agreement with this description and their color indicates the level of agreement. (continued on next page) 96 Improving the description of metabolic networks

x 6 E 2 DHTKD1 9 OGDH and 7 DLST and 7 DLD PDHX 1 OGDHL 4

pyruvate dehydrogenase protein X probable 2-oxoglutarate component, mitochondrial dehydrogenase E1 component 2-oxoglutarate dehydrogenase complex DHKTD1, mitochondrial 2-oxoglutarate dehydrogenase-like, mitochondrial 8 2-oxoglutarate + CoA + NAD NAD 1 7 1 + succinyl-CoA + CO2 + NADH + H 4

2-oxoglutarate + ThPP 3-carboxy-1-hydroxypropyl-ThPP + CO2 3 3-carboxy-1-hydroxypropyl-ThPP + lipoamide S8-succinyldihydrolipoamide + ThPP S8-succinyldihydrolipoamide + CoA succinyl-CoA + dihydrolipoamide 2 dihydrolipoamide + NAD lipoamide + NADH + H+ H+ 1 1 contains both versions

F x 5

2 SUCLA2 7 SUCLG1 and 6 SUCLG2 LOC283398 1

succinyl-CoA [ADP-forming] succinyl-CoA ligase no protein subunit beta, mitochondrial [GDP-forming] complex

8 succinyl-CoA + GDP + hydrogenphosphate 6 2 succinate + GTP + CoA 2 do not contain this reaction

G x 5

1 SUCLG2 4 SUCLG1 and 8 SUCLA2 SUCLA2P1 1 succinyl-CoA ligase [GDP-forming] succinyl-CoA ligase no protein subunit beta, mitochondrial [ADP-forming] complex

8 succinyl-CoA + ADP + hydrogenphosphate 6 2 succinate + ATP + CoA 2 do not contain this reaction

H x 5 4 1 FAD 9 SDHA and 9 SDHB and 8 SDHC and 7 SDHD succinate dehydrogenase complex succinate + FAD 3 1 5 succinate + ubiquinone-10 0 indicate the specific form fumarate + FADH2 2 3 1 succinate fumarate fumarate + ubiquinol-10 0 indicate the specific form

I J 10 FH 2 MDH1B 7 MDH2 MDH1 7

fumarate hydratase, putative malate malate dehydrogenase, malate dehydrogenase, mitochondrial dehydrogenase 1B mitochondrial cytoplasmic

10 fumarate + H2O 10 (S)-malate + NAD NAD 1 8 2 8 2 (S)-malate oxaloacetate + NADH + H+ H+ 3

Inconsistencies with the literature as found in any of the ten databases are shown in dark red outside the blue boxes. Genes and their products for which a role in the TCA cycle could neither be confirmed nor refuted by evidence from the literature are indicated in orange font. The same holds for the direction of reactions. Reactions in purple font differ in the number of steps the conversion is described in compared to our description.

97 Chapter 4 why E3-deficient patients accumulate only the substrate 2-oxoglutarate. In line with the criteria mentioned above, we therefore describe the reaction in a single step.

The majority of the databases attribute the OGDH, DLST and DLD proteins to this reaction, although four databases do not indicate that the proteins act as a complex. Some databases (also) assign the DHTKD1, OGDHL and PDHX proteins. The first two are similar to the E1 subunit (OGDH) of the 2-oxoglutarate dehydrogenase complex. Both proteins have recently been further characterized using sequence comparison. OGDHL most likely represents a previously unknown isoform of the OGDH protein. Thus, OGDHL might play a role in the TCA cycle, but its expression levels are much lower than that of OGDH. More biochemical evidence is needed to elucidate the role of OGDHL. DHTKD1 may accommodate more polar and/or bulkier structural analogs of the 2-oxoglutarate metabolite (Bunik and Degtyarev, 2008). The DHTKD1 gene most likely encodes a dehydrogenase with a new function. For the PDHX protein there is no biochemical evidence supporting that it is necessary in the formation of the 2-oxoglutarate dehydrogenase complex (McCartney et al, 1998). Moreover, patients with PDHX deficiency have a selective pyruvate dehydrogenase deficiency (Aral et al, 1997).

Succinyl-CoA synthetase (F and G) In this reversible reaction, an energy-rich CoA ester bond is cleaved, which was classically thought to be coupled to the formation of GTP (Figure 2F). In 1998, it was, however, established that there is a second succinyl-CoA synthetase which produces ATP rather than GTP (Figure 2G) (Johnson et al, 1998). Some databases also give a third purine nucleotide as a product, ITP (Figure 1). Although in vitro IDP is a substrate for this enzyme, it is very unlikely to play a role in vivo. The concentrations of IDP and ITP are very low as compared to the other nucleotides and considered a byproduct of purine nucleotide metabolism (Bierau et al, 2007).

The two succinyl-CoA synthetase enzymes are dimers with one common alpha subunit (SUCLG1) and a beta subunit that confers nucleotide specificity: SUCLG2 for the GTP-specific isozyme and SUCLA2 for the ATP-specific isozyme. Although widely expressed, the relative amounts of these two subunits vary from tissue to tissue. SUCLA2 is highly expressed in testis, brain, heart, and kidney (Lambeth et al, 2004). SUCLG2 is expressed in liver, kidney, and heart, but barely detected in brain and testis (Lambeth et al, 2004). These two complexes are often incorrectly represented in the different databases. Three types of mistakes are made: (i) not all components are indicated, (ii) all three proteins mentioned are assigned to both reactions, (iii) it is not described that the proteins form a complex.

98 Improving the description of metabolic networks

The correct representation of these complexes is important to be able to understand the effect of deficiencies in the SUCLA2 and/or SUCLG1 proteins, as the biochemical consequences differ. In a SUCLG1 deficiency, both the GTP- and ATP-specific isozymes are affected, whereas in a SUCLA2 deficiency only the ATP-specific isozyme is deficient (Ostergaard et al, 2007). Consequently, in the former both reactions would be affected, while in the latter only the ADP/ATP-dependent reaction is influenced.

Succinate dehydrogenase (H) The succinate dehydrogenase enzyme oxidizes succinate into fumarate and is also known as complex 2 of the respiratory chain. The enzyme is membrane associated and contains four different subunits, which are not all included by each database and also not always indicated as forming a complex. Most biochemistry textbooks teach that electrons are transferred to FAD giving FADH2, which may explain the choice made by four of the databases. Succinate dehydrogenase, however, is a covalent flavoprotein, therefore the FAD is an enzyme-bound prosthetic group that can not dissociate from the enzyme (Mewies et al, 1998). In fact, the electrons are contained in enzyme-bound FADH2 and further transferred into the electron transfer chain via ubiquinone-10 forming ubiquinol-10. Since it is ubiquinol-10 that dissociates from succinate dehydrogenase, we decided to describe these cosubstrates in the reaction.

Fumarate hydratase (I) Fumarate hydratase catalyzes the reversible hydration of fumarate into (S)-malate. The highest level of agreement between databases is on this reaction. The only disagreement concerns the reversibility of this reaction. Although undoubtedly reversible, two databases give this reaction as unidirectional. In these two databases, however, all reactions are unidirectional, while in the other eight databases the information on the reversibility of a reaction is provided.

Malate dehydrogenase (J) Malate dehydrogenase completes the TCA cycle by converting (S)-malate into oxaloacetate. In the liver, malate dehydrogenase is shared between the TCA cycle and gluconeogenesis illustrating that this reaction is reversible (Des Rosiers et al, 1995; Fernandez and Des Rosiers, 1995). There are two malate dehydrogenase enzymes, the cytosolic MDH1 and the mitochondrial MDH2. Some of the databases associate the MDH1 enzyme with the TCA cycle, whereas only the MDH2 enzyme can perform this role because of its mitochondrial localization (see below). Two databases also associate the MDH1B enzyme with this reaction. There is, however, no supporting evidence for this. Furthermore, the protein is most likely not localized to mitochondria.

99 Chapter 4

Annotation of gene function Our analysis brought to light several genes, found in one or more pathway databases, which are suggested to be involved in the TCA cycle, but for which no evidence in literature exists. The availability of the complete allowed for the identification of new genes, without any functional characterization. Based on homology the products of three genes, i.e., MDH1B, DHTKD1 and OGDHL, were annotated to play a role in the TCA cycle in several databases, but without formal biochemical proof. We therefore indicated these enzymes as ‘unconfirmed’ (Figure 2 and Table 2). Furthermore, in one of the databases, HumanCyc (Romero et al, 2004), the protein encoded by SLC35G3 is associated with the citrate synthase reaction, but there is no evidence at all for such a role. In another database, Reactome (Croft et al, 2011), pseudogenes (SUCLA2P1 en LOC283398) are linked to the succinyl-CoA synthetase reactions. For the products of ACO1, IREB2, IDH1 and MDH1, there is currently no evidence that they can be localized in the mitochondrion where the TCA cycle takes place. They all catalyze a reaction also found in the TCA cycle, but the proteins are localized in the cytosol (and also peroxisome for the IDH1 protein). Although some of these databases mention that these reactions are compartmentalized to the cytosol, their annotated role in the TCA cycle is incorrect.

Links of the TCA cycle with other pathways The TCA cycle is a hub in cellular metabolism connecting many different pathways. There are many associated transporters, and anaplerotic and cataplerotic reactions, supplying or removing the main metabolites of the TCA cycle. In defining the boundaries of this pathway for our description, we focused on the biochemical cycle itself, with no real starting substrate or end product (Berg et al, 2012, pp. 515-542). Some databases include reactions that transport TCA cycle intermediates and selected reactions associated with TCA cycle intermediates such as the pyruvate carboxylase, phosphoenolpyruvate carboxykinase and pyruvate dehydrogenase reaction. Pyruvate carboxylase is an example of an important anaplerotic reaction, which supplies the TCA cycle with oxaloacetate, and is therefore by definition not part of the cycle itself. The same holds for the cataplerotic phosphoenolpyruvate carboxykinase reaction. Finally, the pyruvate dehydrogenase reaction is often included because it generates the acetyl-CoA that is used for the synthesis of citrate in the first step of the TCA cycle, but it is not part of the cycle itself. Moreover, acetyl- CoA can also be produced from fatty acids and amino acids. Therefore we also left out the pyruvate dehydrogenase reaction including all its associated regulating kinases and phosphatases.

100 Improving the description of metabolic networks

Dissemination One database, WikiPathways (Pico et al, 2008), was excluded from the analyses described above as it only provided the main metabolites of each reaction. Instead we made use of an important characteristic of this database, namely, that WikiPathways enables researchers to adapt the description of a pathway. Based on the results of our comparison and literature study, we refined the description of the TCA cycle in WikiPathways (http://www.wikipathways.org/index.php?title=Pathway:WP78&oldid=47741) and also added literature references. Our corrections can be viewed in detail by looking at the differences with an earlier description in WikiPathways using the Pathway Difference Viewer (Pico et al, 2008) (http://wikipathways.org/index.php/Help:Viewing_Pathways). Our improved description of the TCA cycle can be downloaded in various formats to allow for a broad dissemination of our results. The original description of WikiPathways is also shown in the Wikipedia entry on this pathway and therefore we requested an update.

Given the extensive amount of literature on the TCA cycle spanning decades of research, fully capturing current knowledge on this biochemically complex process remains a huge challenge. Consequently, we cannot exclude that our description still contains some inconsistencies. Moreover, as shown above parts of the TCA cycle are subject of ongoing research, which might lead to new insights. In line with the philosophy of WikiPathways, we therefore encourage others to refine our description.

Discussion Metabolic pathway databases have proven highly valuable in a broad range of applications ranging from the analysis and visualization of high-throughput data (Antonov et al, 2008) to in silico predictions of phenotypes (Jerby et al, 2010). At the same time it is important to be aware, however, of possible limitations inherent to pathway databases. Based on the evaluation of ten descriptions of the TCA cycle we conclude that none of the selected pathway databases accurately represent the knowledge available in the literature on this pathway. In the UniPathway database (Morgat et al, 2012), for example, 13 enzymes are missing, while in KEGG five enzymes are incorrectly linked to one of the reactions of the TCA cycle (Table 2). Furthermore, we also observe a difference in how the boundaries of the TCA cycle are defined. For example, following our definition, 16 reactions of EHMN and 8 of KEGG (Table 3) would not be part of this pathway.

101 Chapter 4

Our detailed analysis confirms our hypothesis that part of the lack of consensus can be explained by these inconsistencies and a partial coverage of the literature. Also in the description in ‘Biochemistry’ (Berg et al, 2012), one of the most popular student textbooks, we identified similar inconsistencies. Two reactions are incorrectly indicated to be reversible, i.e., the NAD-dependent IDH reaction and the 2- oxoglutarate dehydrogenase reaction. The GDP-dependent succinyl-CoA synthetase reaction is described as acting only in the opposite direction of the TCA cycle and not as reversible. Furthermore, the NADP-dependent IDH reaction is left out. Given that the TCA cycle is one of the most well-known pathways our results are surprising. However, our review also showed that the biochemistry of this pathway may not be as clear cut as one would expect, which explains some of the inconsistencies discovered. This is underscored by recent literature showing that this particular pathway is still actively studied. One example is the controversy surrounding the direction of the IDH reaction catalyzed by the IDH2 enzyme, despite the traditional biochemical evidence that is available. Lack of consensus may also be partly explained by a different judgment between curators on the strength of the evidence from the literature. For example, part of our evidence has been obtained in model organisms such as rat and mouse. Since core metabolic pathways such as the TCA cycle are broadly conserved across organisms, we considered evidence from mammals conclusive. Human-specific evidence is often lacking, but once it becomes available it can be added to the description of the TCA cycle on WikiPathways.

Our results are likely to extend to other pathways as well. For five of the eleven databases, we have previously shown that the lack of consensus translates to the entire human metabolic network (Stobbe et al, 2011). Moreover, in that analysis on network level various mistakes were found in other pathways as well. Overviews of all differences between these five databases can be retrieved via the web application called Consensus and Conflict Cards (C2Cards) (http://www.c2cards.nl). A C2Card provides a concise overview of what the databases do and do not agree on with respect to a single reaction, gene or EC number. This enables experts to more easily identify mistakes and to reconcile the descriptions for other pathways than the TCA cycle.

We expect that the issues encountered in our curation effort extend to other organisms. In fact, various analyses have already shown that also for other organisms there is a lack of consensus between the multiple descriptions of the metabolic network that are available (Chindelevitch et al, 2012; Herrgård et al, 2008; Radrich et al, 2010; Thiele et al, 2011). A complete and accurate description of the metabolic network for human and other organisms is essential to foster new

102 Improving the description of metabolic networks biological discoveries. For example, the holistic view that a model of the metabolic network offers, allows for the identification of gaps in our knowledge on human metabolism for which further experiments are required (Rolfsson et al, 2011). Furthermore, pathway databases are used more and more often as the primary knowledge resources on metabolism by biologists. Reliance on databases will continue to increase since the complexity of the high-throughput datasets that are handled increases as well. Mistakes in pathway databases are also propagated to other types of resources such as the Gene Ontology (GO) (Ashburner et al, 2000), STRING (Szklarczyk et al, 2011) and Wikipedia. According to GO, for example, the cytosolic proteins that are encoded by ACO1, IDH1 and MDH1 play a role in the TCA cycle. Pathways contained in WikiPathways are featured by Wikipedia as interactive pathway maps and, therefore, contain the same inconsistencies. For the TCA cycle we requested an update of the current Wikipedia entry to our description in Wikipathways. Incorrect information may lead to misinterpretation of high- throughput molecular data and the design of poorly designed follow-up experiments. Furthermore, a consequence of the many differences between the descriptions is that one could arrive at different conclusions for an analysis depending on which database one uses (Lee et al, 2008; Zelezniak et al, 2010).

Various initiatives are in place that aim to improve upon the current state of affairs and for which the support of a broad community is essential. One example of an initiative to improve upon an already existing description of the human metabolic network is the Reactome Portal (http://wikipathways.org/index.php/Portal:Reactome). Pathways from the Reactome database have been incorporated into WikiPathways and thus can be edited by everyone, hereby following the same principle as Wikipedia. Periodically, curators of Reactome evaluate the changes made and decide whether or not to include it in their centralized database, which cannot be edited by the public. An important difference with Wikipedia and a bottleneck for initiatives like the Reactome Portal is that the pool of experts knowledgeable enough to contribute is much smaller. Therefore, a larger percentage of the community needs to contribute to reach the necessary momentum to improve upon the current descriptions of the (human) metabolic network.

A second example of an endeavor to refine and reconcile current descriptions of the (human) metabolic network is the organization of reconstruction annotation jamborees (Herrgård et al, 2008; Thiele and Palsson, 2010a; Thiele et al, 2011). In a jamboree, experts from multiple disciplines, including biochemistry, molecular description of the metabolic network. Our improved description of the TCA cycle

103 Chapter 4 provides a nice example of the reconciliation of ten individual descriptions. The knowledge on the metabolic network will continue to expand and therefore the consensus network needs to be kept up-to-date, for example, by organizing subsequent jamborees. This requires the continuing commitment of experts.

Other initiatives focus on the accurate extraction of knowledge from the scientific literature. One of the explanations for the lack of consensus between the descriptions of the metabolic network is that pathway database curators have based their conclusions on a different set of articles and/or interpreted the literature differently (Mo and Palsson, 2009). Importantly, curators may not have interpreted the article as intended by the authors. The original authors of a novel scientific fact are generally not the ones that put their conclusions into a pathway database. For the TCA cycle, our literature study led to the reappraisal of the knowledge that the NAD/NADH and NADP/NADPH redox couples (Houtkooper et al, 2010) are widely different. It is, however, quite a challenge to oversee the huge volume of articles already available, which is further complicated by the changing nomenclature of enzymes and metabolites. Moreover, as the biomedical literature grows exponentially, it is impossible for curators of pathway databases to keep track of everything published on metabolism. To cope with this, various innovative ideas have been proposed. For example, authors could add semantic annotation to their article (Jensen and Bork, 2010), making the knowledge described more machine-readable. This would allow for an easier way to automatically retrieve and put new knowledge into a pathway database. Another approach is to provide the newly discovered facts also as nanopublications (Groth et al, 2010). A nanopublication is a traceable author statement, which consists of three parts: a statement, e.g., protein X (subject) catalyzes (predicate) reaction Y (object), conditions under which the statement holds, e.g., a specific compartment, and provenance of the statement, e.g., author and literature. By staying close to the source of the newly discovered fact on metabolism, misinterpretation of the article can be prevented. Moreover, an additional advantage is that it will significantly reduce the workload for curators of metabolic pathway databases when retrieving new knowledge. It remains a challenge, however, to convince experts to spend time and effort into improving the description of a metabolic network as they may feel that they do not directly benefit from this. Therefore, the efforts required should be kept as minimal as possible and the contributions of the expert should be clearly acknowledged. In the C2Cards application, for example, curation is done at the level of a single reaction or the metabolic functions of a single gene product. Nanopublications provide an explicit recognition of the knowledge contributed by the expert(s). Journals could also play a

104 Improving the description of metabolic networks role by, for example, requiring authors to contribute their results to one of the discussed initiatives.

The active involvement of a broad community, across multiple disciplines within the field of biology, is essential to further improve the current description of the metabolic network of human and other organisms (Kitano et al, 2011). Via this article, we would, therefore, like to urge biologists to donate their knowledge and actively contribute to reach the ultimate goal of a complete and biologically accurate description of the (human) metabolic network.

Acknowledgements This research was carried out within the BioRange programme (project SP1.2.4) of The Netherlands Bioinformatics Centre (NBIC; http://www.nbic.nl), supported by a BSIK grant through The Netherlands Genomics Initiative (NGI) and within the research programme of the Netherlands Consortium for Systems Biology (NCSB), which is part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research. Sander M. Houten was supported by the Netherlands Organization for Scientific Research (VIDI-grant No. 016.086.336).

105

Consensus and Conflict Cards for metabolic pathway databases Miranda D Stobbe1,6, Morris A Swertz5,6, Ines Thiele3,4, Trebor Rengaw5,6, Antoine HC van Kampen1,2,6,7, Perry D Moerland1,6

1 Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, P.O. Box 22700, 1100 DE, Amsterdam, the Netherlands 2 Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Science Park 904, 1098 XH, Amsterdam, the Netherlands 3 Center for Systems Biology, University of Iceland, Sturlugata 8, 101 Reykjavik, Iceland 4 Faculty of Industrial Engineering, Mechanical Engineering & Computer Science, University of Iceland, Sturlugata 8, 101 Reykjavik, Iceland 5 Genomics Coordination Center, University Medical Center Groningen & University of Groningen, P.O. Box 30001, 9700 RB Groningen, the Netherlands 6 Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525 GA, Nijmegen, the Netherlands 7 Netherlands Consortium for Systems Biology, University of Amsterdam, P.O. Box 94215, 1090 GE, Amsterdam, the Netherlands

Submitted to BMC Systems Biology

Abstract

Background The metabolic network of H. sapiens and many other organisms is described in multiple pathway databases. The level of agreement between these descriptions, however, has proven to be low. We can use these different descriptions to our advantage by identifying conflicting information and combining their knowledge into a single, more accurate, and more complete description. This task is, however, far from trivial.

Results

We introduce the concept of Consensus and Conflict Cards (C2Cards) to provide concise overviews of what the databases do or do not agree on. Each card is centered at a single gene, EC number or reaction. These three complementary perspectives make it possible to distinguish disagreements on the underlying biology of a metabolic process from differences that can be explained by different decisions on how and in what detail to represent knowledge. As a proof-of-concept, we Human implemented C2Cards , as a web application (www.c2cards.nl), covering five human pathway databases.

Conclusions

C2Cards can contribute to ongoing reconciliation efforts by simplifying the identification of consensus and conflicts between pathway databases and lowering the threshold for experts to contribute. Several case studies illustrate the potential of the C2Cards in identifying disagreements on the underlying biology of a metabolic process. The overviews may also point out controversial biological knowledge that should be subject of further research. Finally, the examples provided emphasize the importance of manual curation and the need for a broad community involvement.

Consensus and Conflict Cards

Introduction Metabolic pathway databases have proven very valuable for a wide range of applications, varying from the analysis of high-throughput data to in silico phenotype prediction. The past decade the number of pathway databases has grown markedly, providing extensive descriptions of the metabolic network for an increasing number of organisms (Karp and Caspi, 2011; Oberhardt et al, 2009). The metabolic networks of several key organisms, for example, S. cerevisiae and H. sapiens, are even described in multiple databases. A comparison of two yeast networks showed, however, that the two agreed on only 36% of their reactions (Herrgård et al, 2008). Similarly, five pathway databases describing the human metabolic network agreed on only 3% of the 6968 reactions they jointly contain (Stobbe et al, 2011). Given that these databases aim to represent the metabolic capabilities of the same organism, the level of agreement is much lower than one might expect and hope for. There are several explanations for the observed lack of consensus. These include the different ways in which the networks have been built, their manner of curation, and a different interpretation of literature (Mo and Palsson, 2009). The comparison of Stobbe et al (2011) also revealed large differences in the breadth and depth of the coverage the five human metabolic networks have.

The advantage of having several descriptions of the metabolic network for the same organism is that they offer different views on the same biological system and thus can reveal controversial biological knowledge. In addition, the databases each have a particular focus and its curators have specific fields of expertise. Therefore, each database may provide complementary pieces of the puzzle of the complete metabolic network. These observations have motivated, still ongoing, efforts to consolidate the different networks for the same organism and to build consensus metabolic networks using a largely manual approach (Herrgård et al, 2008; Thiele and Palsson, 2010a; Thiele et al, 2011).

Combining all the knowledge on the metabolic network contained in the various pathway databases and identifying conflicting information is, however, far from trivial. Retrieving all required information from multiple databases is in itself already a cumbersome task. One reason that makes it challenging to identify instances where pathway databases do not agree on the underlying biology of a metabolic process are the different decisions made by each of the databases on how to represent knowledge (Stobbe et al, 2011; Wittig and De Beuckelaer, 2001). For example, a particular difference may be simply explained by the different levels of granularity with which metabolic processes are described by each database, instead of a fundamentally

109 Chapter 5

different biological insight. Secondly, it remains a challenge to determine whether databases refer to the same gene or the same metabolite. Thirdly, the definition of a pathway also differs per database, which makes it nearly impossible to compare the networks on a smaller scale, i.e., per pathway. Fourthly, the larger the number of pathway databases considered, the more difficult it is to identify the consensus and the conflicts. Recently, algorithms have been proposed to semi-automatically merge two descriptions of the metabolic network of the same organism (Chindelevitch et al, 2012; Radrich et al, 2010). These approaches mainly address the challenge of matching metabolites, partly via interactions with the user. The core of their resulting merged description consists of reactions that can be found in both networks. Integrating more than two descriptions will, however, significantly reduce the size of the core and limit its utility (Stobbe et al, 2011). The merged description also contains reactions that could not be (exactly) matched and are therefore unique to one of the descriptions. Such an approach will, however, neither resolve the conflicting information between databases nor filter out erroneous information. Furthermore, the semi-automatic approaches do not explicitly address all issues mentioned above. For example, conflicts due to differences in granularity are not taken into account. While semi-automatic approaches generate a useful scaffold for a consensus network, the resulting description still requires extensive manual curation.

Altogether, the issues described above make the construction of a single, more accurate, and more complete network based on the pathway databases available a laborious and largely manual process (Thiele and Palsson, 2010a). Moreover, it is an ongoing process, as new knowledge continues to become available both in the scientific literature and in pathway databases.

To more easily visualize the opinion of multiple pathway databases, we introduce the concept of Consensus and Conflict Cards (C2Cards). C2Cards combine the knowledge from multiple pathway databases for a specific target organism. A

C2Card can be centered at a single gene, Enzyme Commission (EC) number or reaction of interest and gives a concise overview of what the databases do or do not agree on with respect to the entity the C2Card is centered at. These three perspectives offer complementary views on the knowledge contained in the pathway databases. Importantly, the perspectives provide ways to identify differences that may be explained by a different decision on how and in how much detail to represent knowledge. C2Cards can be used to assist reconciliation efforts and make users of pathway databases more aware of the exact differences that currently exist between databases.

110 Consensus and Conflict Cards

Human As a proof-of-concept, we implemented C2Cards (www.c2cards.nl), which combines the knowledge of the following five frequently used human pathway databases: the Biochemically, Genetically and Genomically structured (BiGG) knowledgebase (Schellenberger et al, 2010) (H. sapiens Recon 1 (Duarte et al, 2007)), the Edinburgh Human Metabolic Network (EHMN) (Hao et al, 2010), HumanCyc (Romero et al, 2004), and the metabolic subsets of the Kyoto Encyclopedia of Genes and Genomes database (KEGG) (Kanehisa et al, 2012) and Reactome (Croft et al,

2011). Below, we first give an overview of the various features of the C2Cards, the combined strength of the three perspectives, and how C2Cards can aid in the curation of gene and metabolite identifiers. Next, we describe several case studies illustrating the potential of the C2Cards in identifying conflicts between pathway databases. Finally, we discuss the next steps to be taken in curating metabolic networks.

Results

Each C2Card provides an overview of the knowledge of multiple pathway databases from the perspective of a specific gene, EC number or reaction of interest. A C2Card answers the basic question of which databases contain the entity of interest. Importantly, each card provides a concise overview of what the databases do and do not agree on with respect to the entity of interest. The core component of a C2Card is a table in which each row contains the following basic elements: a reaction and the EC number(s), gene(s) and pathway linked to it in one of the pathway databases (Figure 1). Any of these elements may be missing, except for the entity on which the

C2Card is centered. By focusing on these basic elements, the overviews remain compact. For additional information provided by the pathway databases, e.g., pathway visualization and literature references, a direct link is provided to the original entry of the reaction in the pathway database. The second core component of a C2Card is that each card explicitly indicates the similarity of the reactions displayed on it. Similarity is indicated either between all pairs of reactions (gene and EC number perspective; Figure 1) or with respect to the reaction of interest (reaction perspective; Figure 1). Here, reaction similarity is defined as the percentage of metabolites found in both reactions (see Materials and Methods). The strengths of each of the three perspectives are discussed in more detail below.

Three complementary perspectives

C2Cards offer three complementary perspectives (gene, EC number, reaction) on the knowledge contained in the pathway databases. Each perspective can answer various types of questions, accommodating the different interests one may have.

111 Chapter 5

Figure 1 – Examples of two C2Cards. C2Card centered at the CTPS gene (top) and the C2Card retrieved by clicking on the reaction of Reactome in the C2Card centered at the CTPS gene (bottom). Each C2Card consists of a table in which each row contains the following basic elements: a reaction and the EC number(s), gene(s) and pathway linked to it in one of the pathway databases. One can switch perspective by clicking on any of the elements in the table. For additional information provided by the pathway databases, e.g., pathway visualizations and literature references, a direct link is provided to the original entry of the reaction in the pathway database. The second core ingredient of a C2Card is that each card explicitly shows the similarity of the reactions displayed on it. The percentage of overlap between reactions is indicated and relevant cells are colored according to the degree of overlap. Information on the IDs assigned to the metabolites and genes by a pathway database is shown by clicking on the i icon. For EC numbers the reaction and name linked to it by NC-IUBMB are shown.

112 Consensus and Conflict Cards

Importantly, the three perspectives can be used to identify and complement information missing in one (or more) of the pathway databases using the knowledge from the other pathway databases.

Gene perspective The 'gene perspective' shows for each of the pathway databases, which metabolic functions the product of a gene has, as indicated by the reaction(s) and EC number(s) linked to it. This perspective may also answer the question whether other genes, either encoding isozymes or components of the same complex, are linked to the same reaction.

EC number perspective The 'EC number perspective' shows on which elements linked to the EC number the pathway databases (dis)agree for a specific type of conversion. It may also reveal possible alternative substrates, which is one of the sources of conflict between metabolic pathway databases (Stobbe et al, 2011). The C2Card centered at the EC number 1.1.1.35 (3-hydroxyacyl-CoA dehydrogenase) provides an example of this scenario (Supplementary File S1). The EC number perspective can also be used to answer the question which genes encode for an enzyme with the specified enzymatic function, according to each database.

Reaction perspective The 'reaction perspective' provides a compact overview of which gene(s) and EC number(s) are linked to a reaction of interest in each pathway database. This perspective can assist in resolving a commonly occurring gap in reconstructions of the metabolic network, namely cases in which the gene product catalyzing a known metabolic reaction is missing (Orth and Palsson, 2010). The reaction perspective (and also the EC number perspective) can be used to find possible candidates for a missing gene in a particular database or reveal that the gene is missing in all pathway databases.

By clicking on any of the entities shown in a C2Card one can easily switch perspective. Furthermore, each C2Card is opened in a new window to enable a simultaneous view of the C2Cards of a linked triple of a reaction, EC number, and gene from different viewpoints. Using all three perspectives is essential to get a complete picture of what the databases do or do not agree on. The EC number perspective can, for example, neither fully replace the gene perspective nor the reaction perspective, as illustrated by the example in Table 1. An EC number does not uniquely identify a reaction or an enzyme. As the example shows, the pathway databases linked different EC numbers to the same reaction. Furthermore, in this case

113 Chapter 5

ATP + UMP <==> ADP + UDP Database EC number Gene(s) EHMN 2.7.4.14 2.7.4.4 CMPK1 H. sapiens Recon 1 2.7.4.14 CMPK1 2.7.4.22 -- HumanCyc 2.7.4.14 CMPK1 KEGG 2.7.4.14 CMPK1, CMPK2 Reactome 2.7.4.4 CMPK1

Table 1 – Excerpt of the C2Card centered at the reaction ‘ATP + UMP <==> ADP + UDP’. Different EC numbers linked to the same reaction and gene, which illustrates the difference in enzyme activity assigned to the product of the CMPK1 gene. Matching EC numbers have the same color.

EC number Enzyme name Reaction as defined by NC-IUBMB 2.7.4.4 nucleoside-phosphate kinase ATP + nucleoside phosphate = ADP + nucleoside diphosphate 2.7.4.14 UMP/CMP kinase (1) ATP + (d)CMP = ADP + (d)CDP (2) ATP + UMP = ADP + UDP 2.7.4.22 UMP kinase ATP + UMP = ADP + UDP

Table 2 – Definition of EC numbers in NC-IUBMB. The enzyme name and reaction(s) linked to the EC numbers of Table 1 by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). The information of NC-IUBMB is available in a C2Card for each EC number that is part of the overview (see Figure 1). the databases either do not agree on the substrate specificity of the gene product, or curators assigned the EC number based on the reaction instead of the functionality of the gene product (Table 2). Finally, in the C2Cards application one can also cast a wider net when querying for an EC number by allowing a mismatch on the fourth number of an EC number. In contrast to the first three numbers, the last number does not indicate a specific subclass of enzymes and only serves to distinguish enzymes with different substrate specificities.

Dealing with conceptual differences Combining different perspectives also offers a way to side-step differences that do not reflect a true disagreement on the underlying biology. For example, the detail with which a metabolite or a conversion is described varies between and within databases. One database may describe the specific form of a metabolite, e.g., α-D- glucose or β-D-glucose, while in another database the more general form is used, D- glucose in this case. A possible motivation for database curators to choose the general version is that in an experiment the distinction between two isomers may be difficult to make. This type of difference is unlikely to affect the gene or EC number that is assigned to the reaction and can, therefore, be revealed using the gene or EC number perspective.

Another example of a difference in level of detail is a biochemical conversion that is described in a single reaction using generic metabolites, like 'a long chain alcohol',

114 Consensus and Conflict Cards

Database Reaction EC number Genes

citrate <==> cis-aconitate + H2O

EHMN cis-aconitate + H2O <==> isocitrate 4.2.1.3 ACO1, ACO2 citrate <==> isocitrate H. sapiens Recon 1 citrate <==> isocitrate 4.2.1.3 ACO1, ACO2, IREB2

citrate → cis-aconitate + H2O HumanCyc 4.2.1.3 ACO1, ACO2 cis-aconitate + H2O → isocitrate

citrate <==> cis-aconitate + H2O ACO1, ACO2 KEGG 4.2.1.3 cis-aconitate + H2O <==> isocitrate Reactome citrate <==> isocitrate 4.2.1.3 ACO2

Table 3 – Excerpt of the C2Card centered at the EC number ‘4.2.1.3’ (aconitate hydratase). Conversion of citrate into isocitrate (part of the TCA cycle) in one (green) or two steps (blue). The EC number and gene on which all five databases agree are underlined. versus multiple reactions with more specific examples of metabolites, i.e., 'hexadecanol' and 'octadecanol' instead of 'a long chain alcohol'. The gene or EC number perspective can be used to uncover such a difference. The number of steps used to describe a biochemical process may also differ and will prevent a perfect match on reaction level as well. The latter type of difference is not necessarily explained by different decisions made on how to represent a biochemical process, but could also be due to a disagreement on the underlying biology. This commonly occurring difference in level of granularity can be revealed via the gene or EC number perspective as well (Table 3).

Gene and metabolite identity Next to exploring the genes, EC numbers, and reactions contained in the pathway databases, as described above, C2Cards can also be of direct use in curating the identifiers (IDs) assigned to the genes and metabolites by the pathway databases. Identifiers are essential for the unambiguous identification of genes and metabolites across multiple resources and enable linking experimental data to the metabolic network. For each gene and metabolite a C2Card provides the identifiers assigned to them by the pathway databases (see Figure 1, and Materials and Methods). Obsolete or transferred identifiers are explicitly indicated. For genes the HUGO Committee (HGNC) symbol is provided and for metabolites their name and synonyms. If available in a pathway database, two structural IDs (InChI and SMILES) and the chemical formula are also shown for a metabolite. The information on the identifiers helps to reveal cases where the assignment of identifiers to a metabolite or gene can be improved. Firstly, it can uncover metabolites that completely lack an ID in one or more pathway databases. Secondly, ID information can also help to identify cases where pathway databases assigned IDs from different gene and metabolite databases to the same entity. This can be used to propose additional identifiers for that particular gene or metabolite, which may also

115 Chapter 5

Reaction of interest: l-arginine + H2O → ornithine + urea Database Reaction Overlap (%) EC number Gene Pathway l-arginine[c] + H O[c] 2 Urea cycle / amino → 100 3.5.3.1 ARG1 group metabolism ornithine[c] + urea[c] H. sapiens Recon 1 l-arginine[m] + H O[m] 2 Urea cycle / amino → 100 3.5.3.1 ARG2 group metabolism ornithine[m] + urea[m]

l-arginine[c] + H2O[c] → 66 3.5.3.1 ARG1 Urea Cycle l-ornithine[c] + urea[c] Reactome l-arginine[m] + H2O[m] → 66 3.5.3.1 ARG2 Urea Cycle l-ornithine[m] + urea[m]

Table 4 – Excerpt of the C2Card centered at the reaction ‘l-arginine + H2O → ornithine + urea’. One metabolite was allowed not to match in this reaction search. The only difference between the reactions is the use of ornithine versus l-ornithine (both in bold). Note that H2O is not taken into account for computing the percentage of overlap. '[c]' stands for cytosol and '[m]' for mitochondrion. facilitate matching between databases. Thirdly, it can reveal genes and metabolites to which a pathway database assigned multiple identifiers from the same genome or metabolite database, respectively. In summary, C2Cards can assist the considerable amount of manual curation required to correctly link each component of the metabolic network to external databases.

The ability to correctly match metabolites when comparing reactions is influenced by the different decisions the curators of the pathway databases have taken. For example, in Recon 1 and HumanCyc the protonation state of a metabolite is determined at a pH level of 7.2 and 7.3, respectively. The other three databases always use the neutral form of a metabolite. As illustrated in the C2Card centered at the CTPS gene (Figure 1), this leads to a reaction mismatch between EHMN and

KEGG that have chosen for ammonia (NH3) and Recon 1 that has chosen ammonium. The gene and EC number perspectives offer a possible way to uncover such differences. The C2Cards application provides an additional means to uncover reactions that are similar, but not an exact match, by allowing the user to specify that one or more mismatches are allowed when querying for a reaction. An example of the results of a query in which one mismatch was allowed is given in Table 4. Note that the genes and EC number do match, which suggests that the two reactions can be considered equivalent. Moreover, in this example the reactions only seem to differ in the level of detail with which the metabolite ornithine was described. Allowing mismatches also makes it possible to retrieve reactions for which the identity of one or more metabolites could not be established, because of missing identifiers or for which matching on name was hindered by the use of different synonyms.

116 Consensus and Conflict Cards

C2Cards interfaces

C2Cards can be accessed using common JavaScript-enabled browsers on all major platforms including Windows, Linux, and Apple. A C2Card centered at a gene or EC number of interest can be retrieved in a single step. For the reaction perspective two routes are offered, either of which requires three steps. A reaction can be found by entering one or more metabolites or by selecting the pathway it is part of in one of the pathway databases. More detail on how to retrieve a C2Card is described on the

C2Cards website (www.c2cards.nl). Once retrieved, a C2Card can also be downloaded for off-line use. In addition, for each database the C2Cards for all its genes, EC numbers, and reactions, respectively, can be downloaded in tab-delimited format in a single ZIP file.

Next to the web interface, programming interfaces to R, SOAP (Simple Object Access Protocol), and REST (Representational State Transfer) are provided to enable programmatic querying of the collection of C2Cards. One possible application would be to perform computational analyses on each of the pathway databases. A typical example is an enrichment test to prioritize pathways most likely to be affected in a given high-throughput experiment. The differences between pathway databases can be quite large both with respect to content and conceptual differences (Stobbe et al, 2011). For example, the number of pathways, in the five selected human pathway databases ranges from 69 in EHMN to 257 in HumanCyc (see Materials and Methods). Consequently, it is to be expected that the choice of a particular pathway database affects the outcome of pathway enrichment analyses (Elbers et al, 2009). It would, therefore, be advisable to apply analyses to multiple pathway databases to verify the robustness of the results. Specifically, to accommodate pathway enrichment analyses, we provide two additional tables, accessible via the programmatic interfaces only. In these tables the metabolites and genes of each pathway database are linked to the corresponding pathways. The results of our reaction comparison could be used to zoom into the outcomes of an enrichment analysis to see if the differences found can perhaps be attributed to the different pathway definitions used by the databases.

Another additional feature offered is the possibility to look up the fate of a metabolite, contained in any of the five databases, by retrieving the list of reactions in which the metabolite of interest participates. Furthermore, databases in which the metabolite is a ‘dead-end’, i.e., it is either only produced or consumed, are explicitly indicated. The list of reactions provided allows the user to find candidate reactions to resolve these dead-ends in the network of a particular database using information

117 Chapter 5

from other databases. All reactions in this list are linked to their corresponding

C2Card.

C2Cards case studies For each of the three perspectives we provide a concrete example derived from Human C2Cards of consensus and conflicts between the five human pathway databases below. The examples have all been chosen from primary metabolic processes, highlighting that conflicts still occur even in well-studied parts of the metabolic network. The case studies also illustrate why manual curation remains crucial to resolve contradicting information and to determine in which cases further biochemical experiments are even required to verify what is correct and what is not.

Case study I: Gene perspective

The C2Card focused on the CTPS gene (Figure 1) shows that the gene is found in all five databases and is linked to the same EC number by each database. However, Reactome and Recon 1 link the gene to two different reactions, i.e., the glutamine dependent reaction ‘l-glutamine + ATP + UTP + H2O → l-glutamate + ADP + CTP + orthophosphate’ and the ammonium dependent reaction ‘ammonium + ATP + UTP → + ADP + CTP + phosphate + H ’, respectively. The C2Card focused on the reaction of Reactome (Figure 1) shows that Recon 1 does contain this reaction, but links it only to the CTPS2 gene and not to CTPS. The same observation can be made when starting from the EC number perspective, as both genes are linked to the same EC number (not shown).

The products of both the CTPS and CTPS2 gene contain a glutamine amidotransferase domain and have high sequence similarity. This, and the fact that all databases assigned the same EC numbers to both genes suggests that they have similar catalytic activity. Both gene products can indeed catalyze the glutamine dependent reaction, as demonstrated by overexpression of both human genes in yeast (Han et al, 2005). For L. lactis it is known that both ammonium derived from the hydrolysis of glutamine by the CTP synthase enzymes themselves and ammonium from other external sources of amine donors can be utilized for CTP synthesis (Willemoës, 2004). The human counterparts of these enzymes may follow the same reaction mechanism as found for L. lactis. This is supported by the fact that under room temperature glutamine is unstable and will dissociate into an ammonium ion and oxo-proline. We, therefore, conclude that CTPS and CTSP2 should probably be linked to both reactions. This means that Recon 1 could be improved by adding CTPS to each reaction. In Reactome and HumanCyc the ammonium dependent reaction then needs to be added.

118 Consensus and Conflict Cards

EC number of interest: 6.2.1.4 Database Reaction Gene(s) Pathway GTP + succinate + CoA |==| GDP + succinyl-CoA + Pi EHMN SUCLG1, SUCLG2 TCA cycle ITP + succinate + CoA <==> IDP + succinyl-CoA + Pi GTP + succinate + CoA <==> (SUCLG1 and SUCLG2) TCA cycle GDP + succinyl-CoA + Pi GTP + itaconate + CoA H. sapiens Recon 1 <==> GDP + itaconyl-CoA + Pi C5-branched dibasic acid (SUCLG1 and SUCLG2) GTP + mesaconate + CoA metabolism <==> GDP + mesaconyl-CoA + Pi GTP + succinate + CoA |==| (SUCLG1 and SUCLG2) --- GDP + succinyl-CoA + Pi HumanCyc GTP + itaconate + CoA → SUCLG1 or SUCLG2 itaconate degradation GDP + aconyl-CoA + Pi GTP + succinate + CoA SUCLA2, SUCLG1, TCA cycle and propanoate <==> SUCLG2 metabolism GDP + succinyl-CoA + Pi KEGG ITP + succinate + CoA SUCLA2, SUCLG1, <==> TCA cycle SUCLG2 IDP + succinyl-CoA + Pi GTP + succinate + CoA Reactome (SUCLG1 and SUCLG2) TCA cycle GDP + succinyl-CoA + Pi

Table 5 – Excerpt of the C2Card centered at the EC number 6.2.1.4 (succinate-CoA ligase (GDP- forming)). The reaction in grey is found in all databases, the reaction in red only in EHMN and KEGG. ‘|==|’ indicates no direction provided by the database. Genes are represented by HGNC symbols, retrieved via Entrez Gene IDs. Genes, the products of which form a complex, are placed between parentheses and connected by the Boolean operator ‘and' (see Materials and Methods). If the gene products are isozymes ‘or’ is used. Case study II: EC number perspective The EC number 6.2.1.4 (succinate-CoA ligase (GDP-forming)) is found in all five databases. They all agree on one reaction and two genes linked to it (Table 5, reaction indicated in grey). The reaction is considered to be part of the tricarboxylic acid (TCA) cycle, a mitochondrial pathway, by all databases except HumanCyc. Both EHMN and KEGG also include a very similar reaction (Table 5, reaction indicated in red), which only differs with respect to its co-substrates, i.e., IDP/ITP instead of GDP/GTP. Although IDP is a substrate for this enzyme in vitro, it is extremely unlikely to play a role in vivo. The concentrations of IDP and ITP are very low as compared to other nucleotides, and they are considered byproducts of purine nucleotide metabolism (Bierau et al, 2007). The reaction should therefore not be included in the description of the human metabolic network.

119 Chapter 5

Reaction of interest: deoxyuridine + phosphate <==> 2-deoxy-d-ribose 1-phosphate + uracil Database EC number Gene(s) Pathway EHMN 2.4.2.1, 2.4.2.4 PNP*, TYMP*, UPP1 Pyrimidine metabolism H. sapiens Recon 1 --- PNP* or UPP2 Nucleotides 2.4.2.23 ------HumanCyc salvage pathways of pyrimidine ------deoxyribonucleotides 2.4.2.1 PNP* KEGG Pyrimidine metabolism 2.4.2.4 TYMP* 2.4.2.3 UPP1 or UPP2 Pyrimidine catabolism and Pyrimidine Reactome 2.4.2.- TYMP* salvage reactions

Table 6 – Excerpt of the C2Card centered at the reaction ‘deoxyuridine + phosphate <==> 2-deoxy-d- ribose 1-phosphate + uracil’. Genes are represented by the HGNC symbol to which their Entrez Gene IDs are linked. The genes on which the majority of the five pathway databases agree, i.e., PNP and TYMP, are indicated with a ‘*’. Case study III: Reaction perspective All five databases contain the reaction ‘deoxyuridine + phosphate <==> 2-deoxy-d-ribose 1-phosphate + uracil’ and assigned it to similarly named pathways (Table 6). However, there is no consensus regarding the genes linked to this reaction. For UPP2 there is clear experimental evidence that its gene product can catalyze the reaction (Johansson, 2003). To the best of our knowledge the activity of the enzyme encoded by UPP1 was only evaluated for two substrates, uridine and thymidine (Watanabe and Uchida, 1995). For TYMP evidence exists that its product can indeed catalyze this reaction in placenta (Kubilus et al, 1978; Yoshimura et al, 1990) and in platelets (Desgranges et al, 1981), but in liver, for example, such activity was not observed (Yoshimura et al, 1990). For PNP there is not enough evidence clearly confirming or refuting that its product can catalyze this specific reaction. In conclusion, additional experiments are required to determine whether the products of UPP1 and PNP can catalyze this reaction. This also illustrates that even though the majority of the databases links PNP to the reaction, this is not necessarily corroborated by conclusive evidence. For the TYMP gene there is only evidence for two highly specific tissues, which leaves it open for discussion whether its product should be included as a catalyst of this particular reaction. We can conclude that EHMN, HumanCyc and KEGG should at least link the UPP2 gene to this reaction. This would resolve the ‘missing gene’ issue in HumanCyc. Note also that the majority of the databases does not link UPP2 to this reaction, although clear evidence for its role is available. Discussion We proposed the concept of Consensus and Conflict Cards to provide concise overviews of the knowledge contained in metabolic pathway databases for an organism of interest. In a single step one can find, for example, a gene of interest and see if the databases agree on the role of its product in the metabolic network. The

120 Consensus and Conflict Cards

C2Cards will increase the awareness of the differences that exist between the various pathway databases. Other initiatives also provide a web-based interface to browse and search multiple pathway databases (Cerami et al, 2010; Kamburov et al, 2011). However, they are focused on the union of various (pathway) databases instead of explicitly pointing out the differences between pathway databases. Furthermore, they do not provide a clear and compact overview of the content of each of the five selected databases as a C2Card does. Also, the C2Cards application enables users to find reactions that are similar to the reaction of interest, but that are not exactly the same. The three perspectives offered by the C2Cards application provide complementary views on the knowledge contained in the pathway databases. This makes it possible to distinguish differences that reflect a disagreement on the underlying biology (case studies I-III) from differences that may be explained by, for example, different decisions taken on how to represent knowledge (Table 4).

Ultimately, to reconcile differences and to integrate the networks manual curation is required. While a C2Card can highlight differences between databases, it cannot distinguish between errors in one (or more) of the databases and cases where databases do not agree due to lack of consensus in the scientific literature. Moreover, for any given organism metabolic pathway databases are still being refined, expanded, and corrected. This makes it challenging to distinguish complementary information from cases in which the database curators purposely excluded, for example, a reaction or gene. Even the parts the pathway databases agree on may need to be reviewed as the databases share information sources and may copy data from each other, thereby possibly propagating incorrect information. Manual curation is also needed to unambiguously assign identifiers to genes and metabolites.

In summary, C2Cards offer an elegant solution to bring cases that deserve further inspection to the attention of pathway database curators. The overviews may also point out controversial biological knowledge that should be subject of further research. Conclusions A biologically accurate and complete description of the metabolic network for human and other organisms is of utmost importance to, e.g., increase our knowledge about pathways perturbed by a disease, find new drug targets, and interpret the deluge of high-throughput data. A crucial step towards a more complete description is to combine the knowledge captured by each of the available pathway databases for a specific organism. Much time and effort has already been put into pathway databases and we should profit from this to the fullest extent. However, it requires

121 Chapter 5 the commitment and the support of a broad community to construct an initial consensus network and to extend it with new knowledge from domain experts, the scientific literature, and as captured by the various pathway databases. C2Cards can contribute to such an endeavor in several ways. As illustrated by the three case studies the C2Cards are a perfect starting point for manual curation of the human metabolic network in future reconstruction jamborees (Thiele and Palsson, 2010a). Human The set of five pathway databases currently contained in C2Cards can also be further expanded with additional pathway databases. Importantly, C2Cards can be set up for other organisms as well (see www.c2cards.nl for a description).

As a guide for integrating pathway databases, we provide overviews of which genes, EC numbers, and reactions can be found in which database. The entries in these overviews are linked to the corresponding C2Card. One could start by curating the reactions contained in all or the majority of the databases. In fact, for more than half of the reactions found in all five human metabolic pathway databases, there is no agreement on the EC numbers and genes linked to a reaction (Stobbe et al, 2011) and additional curation is needed. C2Cards can also be of use if a consensus network for a given organism has already been established. We envision that the C2Cards application could serve as a central platform in which the consensus network can be further refined and extended with knowledge available in pathway databases not used for its construction. We are planning to include the, recently completed, consensus human metabolic network Recon 2 (Thiele et al, submitted) in Human C2Cards . Recon 2 combines the content of three reconstructions, H. sapiens Recon 1, EHMN, and the liver-specific network HepatoNet1 (Gille et al, 2010). By including Recon 2 as a point of reference, we can compare this state-of-the-art consensus network with other pathway databases. The overview of all reactions in Human C2Cards , for example, could be a source of candidates for expanding Recon 2. Bringing the differences between the consensus network and other descriptions to the attention of experts would enable further refinement of Recon 2. As a first step towards such a platform, users can already add comments to a C2Card, preferably substantiated by references to the literature. They can subscribe to C2Cards of their interest and receive an e-mail when new comments are added. Based on these contributions a team of curators could then decide to incorporate the necessary changes in the consensus network, if enough evidence supports this claim. Notably, as illustrated by case study III, it may lead to the conclusion that further biochemical characterization experiments are required. Since pathway databases are continuously being refined and new information is being added, we could also include the possibility to automatically alert the curators by mailing them updated or additional

C2Cards.

122 Consensus and Conflict Cards

It is important to actively involve domain experts in this continuous curation process, even though they may only indirectly benefit from contributing to such an effort. To make the barrier to contribute as low as possible, the web interface of the

C2Cards was designed to be easy to use and suitable for users with different backgrounds. The application can be accessed via smartphones and tablets as well, allowing C2Cards to be viewed and discussed nearly anywhere. Furthermore, a

C2Card can be downloaded for off-line use. The curation of a C2Card is done at the level of a single reaction or the metabolic functions of a single gene product. This may lower the threshold for experts to contribute as well and also allows (very) detailed knowledge of just a single step in the metabolic network to be added. One way to stimulate expert contributions would be to make the contribution traceable and citable in the form of 'nanopublications' (Groth et al, 2010). A nanopublication consists of three parts: a statement, e.g., protein X (subject) catalyzes (predicate) reaction Y (object), conditions under which the statement holds, e.g., a specific compartment, and provenance of the statement, e.g., author and literature. Besides that this provides an incentive for experts to share their knowledge, it is also a way to ensure that contributions of curators are substantiated by references to the literature.

We also plan to include in C2CardsHuman the human metabolic pathways of WikiPathways (Pico et al, 2008), an open platform in which anyone can contribute a pathway. By incorporating the knowledge from this database we indirectly have a second way in which experts can contribute their knowledge. Ultimately, to reconstruct a biochemical network that closely resembles the metabolism of a target organism, extensive literature research and additional biochemical experiments will be needed to resolve all conflicts revealed and to fill in the gaps that remain. The continuous support, time and effort of a large and diverse community are therefore essential. C2Cards can contribute to this endeavor by simplifying the identification of consensus and conflicts between pathway databases and lowering the threshold for experts to contribute.

Materials and Methods

Materials C2CardsHuman was built upon the same dataset we used previously (Stobbe et al, 2011) for a comparison of five pathway databases, i.e., EHMN, H. sapiens Recon 1, HumanCyc, and the human metabolic subsets of KEGG and Reactome (Table 7). For each reaction we retrieved: the EC number(s) and gene(s) linked to it, and the pathway(s) the reaction is part of (Table 8). To compare the reactions, we retrieved for each metabolite, besides its primary name and available synonyms, the chemical

123 Chapter 5

Export formats used Version Downloaded from Database EHMN Excel 2 http://www.ehmn.bioinformatics.ed.ac.uk/ H. sapiens Recon 1 Flat file, SBML 1 http://bigg.ucsd.edu/ HumanCyc Flat file 15.0 http://biocyc.org/download.shtml KEGG Flat file, KGML 58 ftp://ftp.genome.jp/pub/kegg/ Reactome MySQL database 36 http://reactome.org/download/index.html Table 7 – Overview of metabolic pathway databases used. All data from the pathway databases was downloaded in the first week of May 2011.

Number of Database Genes EC numbers Reactions Pathways EHMN 2517 981 3893 69 H. sapiens Recon 1 1496 647 2617 96 HumanCyc 3586 1249 1785 257 KEGG 1535 760 1635 84 Reactome 1159 375 1175 171 Table 8 – Pathway database content statistics. Genes: counts are based on the internal database identifiers and include genes encoding for a component of a protein complex as separate entities. EC numbers: including incomplete EC numbers. Reactions: if reactions only differ in direction and/or compartments they are counted as one. Pathways: counts for HumanCyc and Reactome are based on the lowest level of their pathway hierarchy. formula and the following five types of metabolite identifiers, if available in the specific pathway database: KEGG Compound, KEGG Glycan, PubChem, ChEBI and CAS. There are two types of PubChem IDs, Substance and Compound. Substance IDs are specific for the depositor of the metabolite. Compound IDs unite the different Substance IDs for the same metabolite. We used the CID-SID file (ftp://ftp.ncbi.nih.gov/pubchem/Compound/Extras/CID-SID.gz) to convert PubChem Substance IDs to PubChem Compound IDs.

Although not used for comparing metabolites, we also retrieved the InChI and SMILES of metabolites, when provided by the pathway database, as additional information. For the genes we retrieved the Entrez Gene and Ensembl Gene ID, if available. For display and comparison purposes we mapped the Entrez Gene and Ensembl Gene IDs to their corresponding HGNC symbol as provided by the Entrez Gene and Ensembl database, respectively. Both the Entrez Gene ID and the Ensembl Gene ID were not available for 396 genes in HumanCyc. For 106 of these genes the UniProt ID was used to retrieve the Entrez Gene ID and/or Ensembl Gene ID. All out-of-date identifiers and EC numbers were transferred to the current ID/EC number (Supplementary Table S1). If that was not possible the ID or EC number was flagged as being obsolete. All data is made available under the original license terms of the primary databases.

124 Consensus and Conflict Cards

Methods Data retrieval and storage Human We used dedicated in-house scripts to retrieve the data needed for C2Cards from the five pathway databases and stored these data in a local MySQL database. The database was designed for easy comparison of the genes, EC numbers, and reactions. The results of all comparisons were stored in the database as well. A Human second database, optimized for the queries needed for generating the C2Cards (Supplementary Figure S1), was derived from this database, including precomputed results of all the comparisons to avoid heavy computations in the web application.

Matching Human In C2cards genes, EC numbers, metabolites and reactions were matched as follows:

Genes Two genes were considered to match if they agreed based on the Entrez Gene ID and/or Ensembl Gene ID. In addition, both types of gene identifiers were mapped to the corresponding HGNC symbols. This provides a basis for matching genes that are not linked to the same genome database, i.e., Entrez Gene or Ensembl, via their HGNC symbol.

EC numbers Matching of EC numbers is straightforward except for 71 incomplete EC numbers the five databases have in total. Up to three numbers of the four that make up a complete EC number may be missing. This is indicated by ‘-’, for example, EC 1.-.-.-. Incomplete EC numbers have an ambiguous meaning (Green and Karp, 2005). They may indicate that further specification of the enzyme activity is not possible, but also that a complete EC number for the specific enzyme activity is not yet included by NC-IUBMB. To reduce the number of spurious matches, incomplete EC numbers were matched literally, i.e., the ‘-’ was not treated as a wildcard.

Metabolites Metabolites were matched based on the KEGG Compound ID, when available. If the KEGG Compound ID was not provided, the metabolites had to match on any of four other identifiers (KEGG Glycan, ChEBI, PubChem Compound or CAS ID) or on name. In the latter case we also required the chemical formula to match. A difference in the number of H atoms when comparing chemical formulae was ignored.

Reactions For reactions we determined the percentage of metabolites they agreed upon, respecting the two sides of a reaction, but ignoring the direction of a reaction. - + We did not require e , H , H2O to match as reactions are not always balanced for

125 Chapter 5

these metabolites. Furthermore, we did not take into account the compartmentalization of reactions. The similarity of two reactions was measured by the percentage of overlap:

| matching metabolites | ×100% max(| metabolitesRR12 |,| metabolites |)

where R1 and R2 denote the two reactions being compared. The metabolites e-, H+,

H2O were not taken into account in computing the overlap.

It depends on the organism and the specific pathway databases included in the

C2Cards database which IDs can best be used for comparing genes and metabolites.

Only a few changes to the code and the original C2Cards database scheme are required to use other IDs for matching. A more detailed description of the changes to make is available on our website (www.c2cards.nl).

Construction web application Human C2Cards was built using the Molecular Genetics Information Systems (MOLGENIS) toolkit (Swertz et al, 2010). This software enables bioinformaticians to model a complete web application having rich data structure and user interfaces using a simple and short XML file. From this model, the toolkit automatically generates software in the Java language that provides a basic web user interface (using Freemarker templates, http://www.freemarker.org), and programming interfaces in Java, R, SOAP and REST to the underlying MySQL database. Building on these generated software we used MOLGENIS ‘plug-in’ framework to program in Human Java and JavaScript extra features that are specific for C2Cards , such as the various search options. The result is installed on a standard Tomcat web server, but can also run ‘standalone’ using the MOLGENIS embedded web server. A local Human installation of C2Cards is also available upon request. All code and the database scheme is open source and can be used as a basis for building a C2Cards application for other organisms. A manual on how to do this is available on our website

(www.c2cards.nl). The code for the C2Cards application is available at http://www.molgenis.org/svn/c2cards/trunk/. A copy of the core MOLGENIS project is also required, which is available at http://www.molgenis.org/svn/molgenis/branches/molgenis_c2cards.

Representation

Each row in a C2Card contains a reaction, the EC number(s), gene(s), and the pathway linked to the reaction, and the name of the source database. If a reaction was assigned to multiple pathways, a separate row is used for each pathway. The

126 Consensus and Conflict Cards metabolites of a reaction are represented by their primary name as indicated by the pathway database. Although not taken into account when matching reactions, the direction of a reaction and the compartment(s) as indicated by the source database are shown in a C2Card. If the direction was not provided this is indicated with ‘|==|’. Multiple EC numbers are connected by a comma. Following the convention used in Recon 1, genes of which the products are isozymes are connected by the Boolean operator ‘or’. If the gene products form a complex ‘and’ is used. EHMN and KEGG, however, do not have a syntactic mechanism for describing isozymes nor complexes. Therefore, if multiple genes were linked to a reaction by EHMN and KEGG, they are connected by a comma. Genes are represented by the HGNC symbol retrieved from Entrez Gene. The Entrez Gene ID was, however, not always available for every gene, and the HGNC symbol could not always be retrieved when the Entrez Gene ID was available. In these cases we used, when available, the Ensembl Gene ID to retrieve the HGNC symbol. For 358 genes the HGNC symbol was not available via either gene identifier type. In this case the gene is represented by its Entrez Gene or Ensembl Gene ID, depending on which of these two was available. For 274 genes in HumanCyc these two gene identifiers were also not available and for these cases the internal gene identifier of HumanCyc is used for representation. If multiple HGNC symbols were linked to a gene they are separated by two underscores. Note also that HumanCyc and Reactome may link multiple Entrez Gene IDs to a single gene, which in most cases will also result in multiple HGNC symbols. Similarly, KEGG and Reactome contain genes linked to multiple Ensembl Gene IDs. Acknowledgements We would like to thank Erik Roos for his contributions to the web application in the initial phase of this project and Joeri van der Velde for his contributions in the final phase. We also thank the reviewers for their helpful comments and suggestions for improving the presentation and comprehensibility of the paper and suggesting additional features of the web application. This research was carried out within the BioRange (project SP1.2.4) and BioAssist (project 4.1 'Molgenis') programmes of The Netherlands Bioinformatics Centre (NBIC; http://www.nbic.nl), supported by BSIK; Netherlands Proteomics Center grants through The Netherlands Genomics Initiative (NGI); the research programme of the Netherlands Consortium for Systems Biology (NCSB), which is part of the Netherlands Genomics Initiative / Netherlands Organization for Scientific Research. IT was supported by a European Research Council grant (N° 232816) and by a Marie Curie International Reintegration grant (N° 249261) within the 7th European Community Framework Program.

127 Chapter 5

Supplementary material

Human Supplementary Figure S1 – Database scheme C2Cards

Human Overview of the tables in the database of C2Cards . Available at www.c2cards.nl. Only the overview tables and the table with the statistics of the comparison of the Human five human pathway databases are specific for C2Cards . The SQL script needed to generate the database is available at http://www.molgenis.org/svn/c2cards/trunk/data/c2cardsdb_empty.sql.

Supplementary Table S1 – Transferred and obsolete identifiers and EC numbers per database Number of transferred and obsolete EC numbers, gene and metabolite identifiers for each of the five pathway databases.

Genes Number of Number of Entrez Gene IDs Ensembl Gene IDs Database transferred obsolete Database obsolete EHMN 4 24 EHMN 37 H. sapiens Recon 1 10 5 H. sapiens Recon 1 x HumanCyc 38 5 HumanCyc 31 KEGG 1 0 KEGG 12 Reactome 10 22 Reactome 35

EC numbers Number of EC numbers Database incomplete transferred obsolete EHMN 41 4 1 H. sapiens Recon 1 2 8 1 HumanCyc 34 2 0 KEGG 34 0 0 Reactome 19 3 0

(continued on next page)

128

Metabolites Number of KEGG Compound KEGG Glycan CAS ChEBI PubChem Compound incorrectly incorrectly incorrectly PubChem Substance IDs that do not Database formatted transferred obsolete formatted obsolete formatted transferred obsolete map to a PubChem Compound ID obsolete EHMN 0 28 4 0 0 0 3 0 35 0b H. sapiens Recon 1 8a 14 21 50 2 7 x x x 1 HumanCyc 0 9 0 x x 2a 1 39 x 106 KEGG 0 0 0 0 0 0 0 0 259 0b Reactome 0 8 1 x x x 0 0 12 0b

a One could not be corrected and was therefore removed b As the CID-SID.gz file from PubChem was used to convert the PubChem Substance IDs to PubChem Compound IDs these are naturally up-to-date.

An 'x' indicates that the particular identifier is not available for this database.

Supplementary File S1 – Example of a C2Card

A C2Card centered at an EC number may reveal possible alternative substrates, which is one of the sources of conflict between

metabolic pathway databases (Stobbe et al, 2011). The C2Card centered at the EC number 1.1.1.35 (3-hydroxyacyl-CoA

dehydrogenase) provides an example of this scenario. The C2Card was exported to an Excel file via the web application. This

file contains, besides the core table of the C2Card, the overview of the reaction comparison, information on the metabolites, Consensus andConflictCards gene(s), and EC number(s) in the C2Card. The number of unique reactions, not taking into account compartmentalization, linked to the EC number 1.1.1.35 varies from 2 in HumanCyc and Recon 1 to 62 in EHMN, as shown in the first worksheet. Available at www.c2cards.nl. 129

General discussion

“Not all those who wander are lost.”

J.R.R. Tolkien (The Fellowship of the Ring)

Chapter 6

The interest in metabolism as a research topic is going through a marked revival (DeBerardinis and Thompson, 2012) as it has become clear that several of the most prevalent diseases, such as cancer, cardiovascular disease, diabetes, and obesity have a strong metabolic component. Cancer research groups, for example, are exploring the possibilities to develop drugs targeting the metabolic pathways involved (Hanahan and Weinberg, 2011). How to capture the increasing amount of knowledge on metabolism has become a firmly established topic of research in the relatively short history of bioinformatics and has led to the development of a multitude of metabolic pathway databases (see also: www.pathguide.org). The use of these databases in various types of analysis has become a common mainstay and examples of successful applications are plentiful. It remains, however, a challenge to gather all knowledge on metabolism and keep up with new discoveries. Moreover, due to the complexity of these networks, it is still a challenge to capture every detail of the metabolic network in a digital format that is suitable for a wide range of computational analyses. An accurate and complete description of metabolism is crucial to reach the ultimate goal of constructing in silico models that are capable of generating experimentally verifiable hypotheses, such as potential drug targets, or to simulate the effect of network perturbations such as loss of function. The systematic analyses described in this thesis provide an overview of the current status of human metabolic pathway databases regarding (i) how well they agree on the description of the metabolic network, (ii) how the knowledge is represented in silico. We further uncovered some of the obstacles that need to be overcome to further refine the description of the human metabolic network and keep it in sync with new discoveries.

Differences in content Given the extensive research efforts in the field of metabolism, in past and present, one would expect that pathway databases agree at least on the core metabolic processes, like carbohydrate, nucleotide, and amino acid metabolism. We indeed confirmed that for these core processes, a higher level of agreement exists between the five human pathway databases than for other parts of the network. However, the consensus for this core is still only 4% of around 3,700 reactions that these databases jointly contain. We identified several other explanations for the differences in content (Chapters 2 and 4). First of all, differences are caused by disagreements on the biology underlying the metabolic network, possibly because there is controversy in literature as well. Our analysis of the TCA cycle also showed that the descriptions are not always in agreement with literature (Chapter 4). Secondly, another important explanation for the differences in content is that the databases differ in the breadth

132 General discussion and depth of their coverage of the metabolic network (Chapter 2). For example, lipid metabolism is described in greater detail in EHMN than in the other four databases. In general, each database has a particular focus and its curators have specific fields of expertise. Furthermore, each database is work in progress, which is to be expected given that it is a time-consuming challenge for curators to cover the huge volume of articles. This is a daunting task even for a single pathway (Chapter 4). Thirdly, the comparison is hampered by the difficulty of relating metabolites between the different databases and, consequently, of establishing whether the databases describe the same reaction (Chindelevitch et al, 2012; Herrgård et al, 2008; Radrich et al, 2010). However, we have shown that this issue certainly does not explain all observed differences. In the comparison of the core metabolic processes the problem of matching metabolites is less pronounced, but the consensus is still small (Chapter 2). Finally, we revealed several conceptual differences, which partly cloud the true disagreements on the underlying biology of the metabolic network. Examples include the number of steps in which a process is described and the use of generic substrates, like ‘an amino acid’, in reactions versus describing every specific instance (Chapters 2 and 3).

In our comparison we focused on reactions, EC numbers and genes as the main components of the metabolic network. There are, however, even more aspects to consider: whether there is agreement on the direction of the reactions, on the compartments in which the reactions take place and whether a catalyst is a complex or not, and so forth. Even when we arrive at a consensus network for all these aspects, further details still have to be worked out. Ott and Vriend (2006) have shown that in KEGG mistakes are made in describing the structure of metabolites. This type of information is important for drug development. Furthermore, the metabolic network differs per tissue and even per cell type, while the five databases we analyzed all aim to describe what is referred to as the global human metabolic network. As argued by Khatri et al (2012), tissue information is also essential to improve the accuracy and relevance of pathway analyses. There are already some examples of tissue-specific networks deduced from a global reconstruction, such a HepatoNet1 (Gille et al, 2010) for liver metabolism. Agren et al (2012) recently published 69 cell type specific models of the human metabolic network, as a first step towards a Human Metabolic Atlas. This resource could in the future be used in the field of personalized medicine to enable a systems-level approach for analyzing patient data, such as gene expression profiles and metabolomics measurements.

133 Chapter 6

Differences in knowledge representation In our first comparison of the databases, we already observed several differences in how databases represent knowledge on metabolism (Chapter 2). Further analysis showed that widely different choices were made by the five databases in how and to what detail to represent the network in a structured way (Chapter 3). The choices made are often determined by the intended application domain of a database. For instance, H. sapiens Recon 1 is geared towards serving as a basis for mathematical models. For this purpose, it is important to accurately describe gene-protein-reaction relations and compartmentalization, and to ensure that the network is charge and mass balanced. KEGG chooses not to represent these aspects of the metabolic network and puts more emphasis on functional hierarchies of the components of the metabolic network and providing context for the analysis and interpretation of high- throughput data. The different requirements researchers have, may be part of the reason why so many pathway databases have been developed.

Our analysis also revealed that not every detail of the metabolic network can yet be captured in a structured way. One example is the representation of fatty acid beta oxidation. Different solutions were chosen by the databases, but none captures the complete process for all fatty acids. Furthermore, unstructured text fields, which cannot be easily interpreted by computer programs, often contain additional information such as the tissue-specificity of enzymes, which could be used to (automatically) derive the metabolic network for a particular tissue. Data provenance, indicating the type of evidence supporting a piece of information, can also be improved in most databases. Information on evidence is important as it can be used to guide further experiments and allows users to retrieve only that part of the network for which there is a high degree of confidence. Only HumanCyc and H. sapiens Recon 1 provide the type of supporting evidence and the level of confidence for a piece of knowledge. In addition, also complete lack of knowledge needs to be indicated explicitly. For example, for the ‘missing gene’ problem (Chapter 1) it is important to know whether the catalyst of a reaction is really unknown or that the reaction takes place spontaneously. Only in HumanCyc this difference is explicitly annotated. Finally, the more elaborate a data model is, the more of a challenge it will be to acquire all necessary details. This is problematic in practice as describing a metabolic process in full detail is a very time-consuming process and requires extensive knowledge that may not even be available yet. At the same time, there is not always a clear cut answer to the question which level of detail is required to be able to perform a wide range of possible computational analyses.

134 General discussion

Integration of databases The results of the comparisons outlined above illustrate that differences between the databases are large both with respect to content and their representation of the network. It is therefore advisable that users carefully weigh their decision when selecting one of the databases, as this choice may affect data analysis results (Lee et al, 2008; Zelezniak et al, 2010). If possible, users should compare and contrast the outcome of their analysis using different databases to ensure robustness of the results. In retrospect, for our initial quest to find candidates for missing genes, the best option might have been to apply the algorithm to the network of each database and combine the results, ranking the candidates predicted by multiple networks higher. This would, however, have been a far from optimal solution. Instead, having a single, complete, and accurate description of the human metabolic network is to be preferred.

Integration of the multiple descriptions of the (human) metabolic network and, importantly, the reconciliation of the differences between them will lead to a more complete and more accurate description. Moreover, by integrating the individual databases we can profit to the fullest extent from all the knowledge, time, effort, and money that has already been put into these pathway databases. The same is true for other types of networks, including signaling and gene regulatory networks. Also for these networks, comparisons have shown that large differences exist between the various databases available (Bauer-Mehren et al, 2009; Kirouac et al, 2012). Similar issues will play a role when integrating these types of network, including problems in matching of network components and differences in representation. Moreover, the various cellular processes are not isolated, but are inherently intertwined. Therefore, also integration of databases describing different cellular processes is necessary. Integration on this level is further complicated by the heterogeneity of the data.

We now discuss three approaches to integrate multiple metabolic network descriptions, i.e., automatic, semi-automatic, and manual integration, and the strengths and weaknesses of each of these approaches.

Automatic integration A fully automatic integration of the multiple metabolic pathway databases would clearly be the fastest approach. However, it is not possible to integrate them in a fully automated way (Chapter 2), even though this is commonly assumed to be the case. From a technical perspective, automatic integration is already quite challenging and with respect to reaching consensus on the underlying biology it is virtually impossible. From a technical perspective there are three main challenges. Firstly,

135 Chapter 6 retrieving the content of databases is cumbersome to automate. Each database requires a different approach that may even have to be adapted with each new release of a database, since underlying data models are subject to change. Some databases offer an Application Programming Interface (API), which should shield the user from such changes. However, also the API may be subject to change between subsequent releases. In addition, APIs do not always provide access to the entire content of a pathway database. Secondly, databases use different representations and definitions. The different pathway definitions used by each database, for example, make it impossible to integrate networks in a modular way, i.e., per pathway. It is important to realize that even the smallest difference in terminology and the definition of a concept needs to be accounted for when integrating databases. Humans may easily tell when different terms or concepts are equivalent, but computers need to be programmed to do so. The results of our comparison described in Chapter 3 provide guidance on how to translate the different representations in a single format. Thirdly, there is a lack of a common ground to compare metabolites. As yet, the problem of matching metabolites has not been resolved by naming standards (e.g., IUPAC) and small molecule databases (e.g., ChEBI) that aim to provide an unambiguous way to specify metabolites (Chapter 2). It has been suggested to use identifiers that depend on the structure of a metabolite, such as SMILES (Weininger, 1988) and InChI (McNaught, 2006), instead of identifiers from a particular metabolite database. However, in this case a difference in the level of detail with which a structure is described may prevent a straightforward match. The same holds for differences in protonation state. The question then is to what detail the structure of the metabolites needs to be the same to consider them a match.

Standard representation formats for molecular pathways, e.g., BioPAX (Demir et al, 2010) and SBML (Hucka et al, 2003), have been proposed to facilitate the information exchange between different resources (Strömbäck and Lambrix, 2005). We investigated the use of BioPAX for our comparison, but the BioPAX files turned out to be insufficient as information about genes encoding for the metabolic enzymes was not represented. BioPAX also did not resolve the issues related to matching of metabolites. Furthermore, although BioPAX reduces differences in terminology, most of the conceptual differences between the databases that prevent their integration remain (Chapter 3). For example, differences between databases in the representation of gene-protein-reaction relationships are still present in the BioPAX files. Consequently, even if all databases offer their data in the same exchange format - which is currently not the case - we would still need tailored scripts to get the data needed for our comparison from the different databases. In summary, curators will

136 General discussion need to adhere to strict guidelines to make the BioPAX files more easily comparable.

The representation differences outlined above complicate an automatic approach for determining whether databases do not agree on the underlying biology or only made a different choice on how to represent the biology. For example, a difference in the number of steps in which a process is described could either point to an alternative route or to a difference in representation. Furthermore, the question remains how to combine the networks in an automated fashion. Simply taking their union will neither resolve conflicting information nor filter out erroneous information. Only including those parts on which the majority of the pathway databases agree is a better approach. However, this does not take into account that databases are not independent as they often share the same knowledge resources and may also have copied data from each other. Consequently, even if the majority agrees on a piece of knowledge it may still be an error that has been propagated through the databases. Moreover, it cannot be excluded that even the majority can be wrong and a single database with an opposing statement may be right. Manual curation is, therefore, crucial if one wants to combine all knowledge captured by these databases and to decide what is correct and what is not.

Semi-automatic integration Algorithms have recently been developed to integrate two networks in a semi- automatic manner to overcome some of the difficulties of a fully automated approach. One example is MetaMerge, which starts by matching metabolites and reactions (Chindelevitch et al, 2012). Next, users are asked to confirm the matched reactions and their metabolites. This manual step can be skipped, but this will give less reliable results. The reactions that match exactly form a core set, which is expanded by adding reactions of which almost all metabolites match. These new reaction matches are again shown to the user for approval. These two steps are repeated until no reactions are a good enough match. Next, the core network is complemented with reactions that could not be matched. In the resulting merged description conflicting reactions may have been included, which also need to be resolved manually. Another example of a semi-automatic integration procedure is the algorithm proposed by Radrich et al (2010), which they used to integrate two descriptions of the metabolic network of A. thaliana. The algorithm results in three merged descriptions with a decreasing degree of confidence. The core description contains only reactions that are found in both networks with a high degree of confidence. In the intermediate description reactions are included that are likely to match. The third level simply contains all remaining reactions from the original two

137 Chapter 6 descriptions. Also in this algorithm, part of the metabolite matches has to be checked manually. Furthermore, conflicting information on the second and third level still needs to be resolved manually. Due to the inability to match the metabolites automatically, Radrich et al could not integrate two other descriptions that are available for A. thaliana. Note that in both approaches when integrating more than two network descriptions, the size of the initial core description will be significantly smaller (Chapter 2), which limits its utility.

Manual integration Efforts are ongoing in the form of reconstruction annotation jamborees to manually integrate the different descriptions of the metabolic network. In a jamboree, experts from multiple disciplines, including biochemistry, molecular biology and systems biology, come and work together on refining the description of the metabolic network of a particular organism. Jamborees have been held for multiple organisms already, including human (Thiele et al, submitted), and have also resulted in consensus networks (Herrgård et al, 2008; Thiele et al, 2011). Our C2Cards application can assist in such an endeavor by bringing the differences between descriptions of the metabolic network of the same organism to the attention of experts via concise overviews (Chapter 5). As we have shown, this may even lead to the conclusion that further biochemical characterization experiments are required. C2Cards provides a good starting point to construct a consensus network. A manual approach will require the commitment of a large group of experts from various research fields.

In summary, although the (semi-) automatic approaches proposed are likely to speed up the integration process, they address only part of the challenges we discussed above. It will require more than only technical solutions to integrate the multitude of available descriptions of the (human) metabolic network. Manual curation will remain necessary to resolve the conflicts and filter out erroneous data. This is, however, a huge undertaking. To reduce the manual effort required adhering to standards or guidelines, like MIRIAM (Minimal Information Required In the Annotation of Models), should be a prerequisite for (pathway) databases. The standards need to be followed to the letter and coexistence of multiple interpretations as we observed for BioPAX (Chapter 3) should be prevented. Furthermore, it is crucial that the metabolites, proteins, and genes are unambiguously identified. The effort that all this requires will be worthwhile in the end, as it will prevent the knowledge to be lost to the community and allows other researchers to build upon the already discovered facts (Le Novère et al, 2005).

138 General discussion

The road ahead: keeping up with new discoveries The extent of our knowledge on cellular processes for different organisms varies, but in general it is still far from complete and new pieces of the puzzle continue to be discovered. Even for a classical metabolic pathway like the TCA cycle, discussion about this small but crucial part of metabolism continues (Chapter 4). Also in the latest version of 'Biochemistry' (Berg et al, 2012), one of the most popular student text books on the subject, a change was made to the description of this pathway. The importance of an accurate description of the cellular networks has been recognized by many research groups. Pathway databases are still being improved, traditionally by only few researchers (Finn et al, 2012). For some organisms, multiple research groups even work independently on improving their own description of the metabolic network. Maintaining and refining these databases requires not only a continuing commitment of the research group, but also long-term funding. Similarly, initiatives like reconstruction annotation jamborees cannot just be a one-time effort and need to be repeated regularly. Unfortunately, it is still a major challenge to acquire funding for setting up and maintaining public databases. For precisely this reason KEGG, one of the most widely used pathway databases, had to switch to paid subscriptions to their FTP site (http://www.genome.jp/kegg/docs/plea.html). Interestingly, performing the experiments to acquire data is much more expensive yet easier to fund. Failing to make the newly discovered facts broadly available through a database may result in loss of the knowledge gained. To quote Amos Bairoch: “It’s quite depressive to think that we are spending millions in grants for people to perform experiments, produce new knowledge, hide this knowledge in often badly written text and then spend some more millions trying to second guess what the authors really did and found.” (Bairoch, 2009). The key to keeping up with new discoveries will be to stimulate active contributions from the scientific community or what is referred to as ‘social engineering’ (Kitano et al, 2011).

Social engineering To complement the curation done by small groups of curators or during jamborees, it has been suggested to mobilize a much larger part of the life sciences community to get involved. One of the strengths of a social engineering approach is that by directly involving the researchers that actually discovered a new scientific fact, misinterpretation of the article could be avoided. Furthermore, once a critical mass of people with the relevant expertise is reached, the curation process will be accelerated and conflicting information more easily resolved.

139 Chapter 6

Various initiatives already exist in which the scientific community as a whole can assist in curating existing pathway databases. WikiPathways is a typical example (Pico et al, 2008), which follows the same strategy as Wikipedia. Everyone can contribute to this database by either improving the pathways provided by the database or adding new pathways using the embedded pathway editor PathVisio (van Iersel et al, 2008). The content of Reactome is also available in this format via a dedicated WikiPathways portal. In this way, improvements to pathways of the centralized database of Reactome can be proposed. We contributed to WikiPathways ourselves by improving the existing description of the TCA cycle (Chapter 4). Through its availability in WikiPathways, our description based on the knowledge of ten pathway databases, literature and expert knowledge can easily be adapted if new facts are discovered.

Kitano et al. (2011) proposed an open-flow model of knowledge aggregation, whereby both users and multiple pathway databases contribute their knowledge to a centralized forum. Refinements are fed back into the participating pathway databases. Our C2Cards application would fit in such a model. The C2Cards application could serve as the central forum in which the various available pathway databases are combined including a current state-of-the-art consensus network, such as Recon 2 for human (Thiele et al, submitted). The concise overviews provided by a

C2Card bring the differences between the consensus network and other descriptions to the attention of the scientific community (Chapter 5). Experts could then resolve these differences and add to the corresponding C2Card a nanopublication as supporting evidence. A nanopublication is a traceable author statement, consisting out of three parts: a statement, e.g., protein X (subject) catalyzes (predicate) reaction Y (object), conditions under which the statement holds, e.g., a specific compartment, and provenance of the statement, e.g., author and literature (Groth et al, 2010). Based on the contributions of experts a team of curators will then decide to incorporate the necessary changes in the consensus network, if enough evidence supports this claim.

Currently, it is already possible to add comments to a C2Card, enabling a first discussion on the inconsistencies observed. Furthermore, we are planning to add the option to automatically alert the curators if there are updated or additional C2Cards.

Lowering the barrier to contribute For social engineering to be successful, a large community needs to be willing to spend the additional time and effort required. They may, however, feel that they do not directly benefit from this, although at the very least it provides an additional way to advertise one’s research (Finn et al, 2012). To entice experts to participate,

140 General discussion contributions should be clearly acknowledged and the threshold to contribute should be kept low. In the C2Cards application, for example, curation is done at the level of only a single reaction or the metabolic functions of a single gene product (Chapter 5). Furthermore, by requiring experts to contribute in the form of nanopublications, as mentioned above, their contribution will become traceable and citable (Mons et al, 2011). Journals could play a role by requiring authors to contribute their results to initiatives such as WikiPathways. Several publishers already recommend authors to deposit, for example, their computational model of a biological process into the BioModels Database, a repository for sharing peer-reviewed and published mathematical models (Li et al, 2010). In a way, this is similar to the current requirements to deposit microarray and sequencing data in public databases.

Another approach would be to let authors semantically annotate their own articles. This would enable the automated retrieval of knowledge that can be used to further refine the description of cellular processes (Jensen and Bork, 2010) and significantly reduce the workload for curators of pathway databases. To assist authors and to reduce the manual effort in annotating an article, well-designed tools are required. The question is though to what extent the authors themselves are the best candidates to do this, as it can be quite challenging to correctly annotate an article in a systematic way (Jensen and Bork, 2010). Again, a community approach in which readers correct and enhance the annotation may be beneficial.

The issue remains that for funding agencies and tenure track committees mainly the number of publications and their citation index truly counts. Appropriate recognition for other types of contributions to science will require a cultural shift in the scientific arena to a situation in which contributions are not only measured in terms of publications. The Scholar Factor, proposed by Bourne and Fink, is an example of a metric that takes into account the number of entries you have made in a public database (Bourne and Fink, 2008). In this way all scientific contributions of a researcher are acknowledged instead of only being based on the number of articles.

Concluding remarks We would like to stress the importance of broad community efforts for unraveling the complete physiology of an organism. Improving currently available databases is a continuous effort, both with respect to their content and their data model. This, however, should not be used as a reason to justify the development of yet another database, but should instead encourage collaborations to further improve existing resources. The solution is not to continue to build more and more databases or to

141 Chapter 6 design new exchange formats, but to convince the community to contribute to existing initiatives, to stick to the exchange formats, and to lower the threshold for doing so. Ultimately, by joining forces we will be more capable of eliciting knowledge from the huge amounts of data being generated and constructing an accurate model of the cellular processes taking place in human and many other organisms.

142

Bibliography

1. Agren R, Bordel S, Mardinoglu A et al (2012) Reconstruction of Genome-Scale Active Metabolic Networks for 69 Human Cell Types and 16 Cancer Types Using INIT. PLoS Comput Biol 8: e1002518. 2. Antonov AV, Dietmann S, Mewes HW (2008) KEGG spider: interpretation of genomics data in the context of the global gene metabolic network. Genome Biol. 9: R179. 3. Aral B, Benelli C, Ait-Ghezala G et al (1997) Mutations in PDX1, the human lipoyl-containing component X of the pyruvate dehydrogenase-complex gene on chromosome 11p1, in congenital lactic acidosis. Am J Hum Genet 61: 1318-1326. 4. Ashburner M, Ball CA, Blake JA et al (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25: 25-29. 5. Bader GD, Cary MP, Sander C (2006) Pathguide: a pathway resource list. Nucleic Acids Res 34: D504-D506. 6. Bairoch A (2000) The ENZYME database in 2000. Nucleic Acids Res 28: 304-305. 7. Bairoch A (2009) The future of annotation/biocuration. Nature Precedings. 8. Bauer-Mehren A, Furlong LI, Sanz F (2009) Pathway databases and tools for their exploitation: benefits, current limitations and challenges. Mol Syst Biol 5: 290. 9. Beltrame L, Calura E, Popovici RR et al (2011) The Biological Connection Markup Language: a SBGN-compliant format for visualization, filtering and analysis of biological pathways. Bioinformatics 27: 2127-2133. 10. Berg JM, Tymocsko JL, Stryer L. (2012) Biochemistry. W.H. Freeman and Company, New York. 7th edition 11. Berg JM, Tymoczko JL, Stryer L. (2002) Biochemistry. W.H. Freeman and Company, New York. 5th Edition 12. Bierau J, Lindhout M, Bakker JA (2007) Pharmacogenetic significance of inosine triphosphatase. Pharmacogenomics 8: 1221-1228. 13. BioCyc (2012): Curator's Guide for Pathway/Genome Databases. 14. Bourne PE, Fink JL (2008) I Am Not a Scientist, I Am a Number. PLoS Comput Biol 4: e1000247. 15. Buchner E (1897) Alkoholische gärung ohne hefezellen (vorläufige mitteilung). Ber. Dtsch. Chem. Ges. 30: 117-124. 16. Bunik VI, Degtyarev D (2008) Structure-function relationships in the 2-oxo acid dehydrogenase family: substrate-specific signatures and functional predictions for the 2- oxoglutarate dehydrogenase-like proteins. Proteins 71: 874-890. 17. Caspi R, Altman T, Dreher K et al (2012) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucl. Acids Res. 40: D742- D753. 18. Ceccarelli C, Grodsky NB, Ariyaratne N et al (2002) Crystal structure of porcine mitochondrial NADP+-dependent isocitrate dehydrogenase complexed with Mn2+ and isocitrate. J. Biol. Chem. 277: 43454-43462. 19. Cerami EG, Gross BE, Demir E et al (2010) Pathway Commons, a web resource for biological pathway data. Nucl. Acids Res. 20. Chang A, Scheer M, Grote A et al (2009) BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009. Nucleic Acids Res 37: D588-D592. 21. Chindelevitch L, Stanley S, Hung D et al (2012) MetaMerge: scaling up genome-scale metabolic reconstructions, with application to Mycobacterium tuberculosis. Genome Biology 13: R6.

144 Bibliography

22. Chowbina SR, Wu X, Zhang F et al (2009) HPD: an online integrated human pathway database enabling systems biology studies. BMC Bioinformatics 10 Suppl 11: S5. 23. Comte B, Vincent G, Bouchard B et al (2002) Reverse flux through cardiac NADP+-isocitrate dehydrogenase under normoxia and ischemia. Am J Physiol Heart Circ Physiol 283: H1505- H1514. 24. Cornell MJ, Alam I, Soanes DM et al (2007) Comparative genome analysis across a kingdom of eukaryotic organisms: Specialization and diversification in the Fungi. Genome Res 17: 1809- 1822. 25. Croft D, O'Kelly G, Wu G et al (2011) Reactome: a database of reactions, pathways and biological processes. Nucl. Acids Res. 39: D691-D697. 26. Das AM, Illsinger S, Lücke T et al (2006) Isolated mitochondrial long-chain ketoacyl-CoA deficiency resulting from mutations in the HADHB gene. Clinical Chemistry 52: 530- 534. 27. de Matos P, Alcántara R, Dekker A et al (2010) Chemical Entities of Biological Interest: an update. Nucl. Acids Res. 38: D249-D254. 28. DeBerardinis R, Thompson C (2012) Cellular metabolism and disease: What do metabolic outliers teach us? Cell 148: 1132-1144. 29. Demir E, Cary MP, Paley S et al (2010) The BioPAX community standard for pathway data sharing. Nat Biotech 28: 935-942. 30. Des Rosiers C, Donato LD, Comte B et al (1995) Isotopomer analysis of citric acid cycle and gluconeogenesis in rat liver. J. Biol. Chem. 270: 10027-10036. 31. Des Rosiers C, Fernandez CA, David F et al (1994) Reversibility of the mitochondrial isocitrate dehydrogenase reaction in the perfused rat liver. Evidence from isotopomer analysis of citric acid cycle intermediates. J. Biol. Chem. 269: 27179-27182. 32. Desgranges C, Razaka G, Rabaud M et al (1981) Catabolism of thymidine in human blood platelets: purification and properties of thymidine phosphorylase. Biochimica et Biophysica Acta (BBA) - Nucleic Acids and Protein Synthesis 654: 211-218. 33. Duarte NC, Becker SA, Jamshidi N et al (2007) Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc. Natl. Acad. Sci. U. S. A. 104: 1777-1782. 34. Elbers CC, van Eijk KR, Franke L et al (2009) Using genome-wide pathway analysis to unravel the etiology of complex diseases. Genet Epidemiol 33: 419-431. 35. Fernandez CA, Des Rosiers C (1995) Modeling of liver citric acid cycle and gluconeogenesis based on 13C mass isotopomer distribution analysis of intermediates. J. Biol. Chem. 270: 10037- 10042. 36. Finn RD, Gardner PP, Bateman A (2012) Making your database available through Wikipedia: the pros and cons. Nucl. Acids Res. 40: D9-D12. 37. Gabriel JL, Zervos PR, Plaut GWE (1986) Activity of purified NAD-specific isocitrate dehydrogenase at modulator and substrate concentrations approximating conditions in mitochondria. Metabolism 35: 661-667. 38. Gille C, Bölling C, Hoppe A et al (2010) HepatoNet1: a comprehensive metabolic reconstruction of the human hepatocyte for the analysis of liver physiology. Mol Syst Biol 6. 39. Goffard N, Frickey T, Weiller G (2009) PathExpress update: the enzyme neighbourhood method of associating gene-expression data with metabolic pathways. Nucleic Acids Res 37: W335-W339. 40. Goto S, Okuno Y, Hattori M et al (2002) LIGAND: database of chemical compounds and reactions in biological pathways. Nucl. Acids Res. 30: 402-404.

145 Bibliography

41. Green ML, Karp PD (2005) Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res 33: 4035-4039. 42. Green ML, Karp PD (2006) The outcomes of pathway database computations depend on pathway ontology. Nucleic Acids Res 34: 3687-3697. 43. Greksak M, Lopes-Cardozo M, van den Bergh SG (1982) Citrate synthesis in intact rat-liver mitochondria is irreversible. Eur J Biochem 122: 423-427. 44. Groth P, Gibson A, Velterop J (2010) The anatomy of a nanopublication. Information Services and Use 30: 51-56. 45. Han G-S, Sreenivas A, Choi M-G et al (2005) Expression of human CTP synthetase in Saccharomyces cerevisiae reveals phosphorylation by protein kinase A. J. Biol. Chem. 280: 38328-38336. 46. Hanahan D, Weinberg RA (2011) Hallmarks of cancer: the next generation. Cell 144: 646-674. 47. Hao T, Ma HW, Zhao XM et al (2010) Compartmentalization of the Edinburgh Human Metabolic Network. BMC Bioinformatics 11: 393. 48. Hartong DT, Dange M, McGee TL et al (2008) Insights from retinitis pigmentosa into the roles of isocitrate in the Krebs cycle. Nat Genet 40: 1230-1234. 49. Herrgård MJ, Swainston N, Dobson P et al (2008) A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnol 26: 1155-1160. 50. Hettne KM, Stierum RH, Schuemie MJ et al (2009) A dictionary to identify small molecules and drugs in free text. Bioinformatics 25: 2983-2991. 51. Hoek JB, Nicholls DG, Williamson JR (1980) Determination of the mitochondrial protonmotive force in isolated hepatocytes. J. Biol. Chem. 255: 1458-1464. 52. Hoek JB, Rydström J (1988) Physiological roles of nicotinamide nucleotide transhydrogenase. Biochemical Journal 254: 1-10. 53. Holzhütter H-G, Drasdo D, Preusser T et al (2012) The virtual liver: a multidisciplinary, multilevel challenge for systems biology. WIREs Syst Biol Med 4: 221-235. 54. Hoppe A, Hoffmann S, Holzhütter H-G (2007) Including metabolite concentrations into flux balance analysis: thermodynamic realizability as a constraint on flux distributions in metabolic networks. BMC Syst Biol 1: 23. 55. Houtkooper RH, Cantó C, Wanders RJ et al (2010) The secret life of NAD+: an old metabolite controlling new metabolic signaling pathways. Endocrine Reviews 31: 194-223. 56. Hucka M, Finney A, Sauro HM et al (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19: 524-531. 57. Huss M, Holme P (2007) Currency and commodity metabolites: their identification and relation to the modularity of metabolic networks. IET Systems Biology 1: 280-285. 58. Jensen LJ, Bork P (2010) Ontologies in quantitative biology: a basis for comparison, integration, and discovery. PLoS Biol 8: e1000374. 59. Jerby L, Shlomi T, Ruppin E (2010) Computational reconstruction of tissue-specific metabolic models: application to human liver metabolism. Mol Syst Biol 6: 401. 60. Johansson M (2003) Identification of a novel human uridine phosphorylase. Biochemical and Biophysical Research Communications 307: 41-46. 61. Johnson JD, Muhonen WW, Lambeth DO (1998) Characterization of the ATP- and GTP- specific succinyl-CoA synthetases in pigeon. J. Biol. Chem. 273: 27573-27579.

146 Bibliography

62. Kamburov A, Wierling C, Lehrach H et al (2009) ConsensusPathDB--a database for integrating human functional interaction networks. Nucleic Acids Res 37: D623-D628. 63. Kamburov A, Pentchev K, Galicka H et al (2011) ConsensusPathDB: toward a more complete picture of cell biology. Nucl. Acids Res. 39: D712-D717. 64. Kanehisa M, Goto S, Furumichi M et al (2010) KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 38: D355-D360. 65. Kanehisa M, Goto S, Sato Y et al (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucl. Acids Res. 40: D109-D114. 66. Karp P.D., Paley,S., Krieger,C.J. et al (2004) An evidence ontology for use in pathway/genome databases. Pac Symp Biocomput., pp. 190-201. 67. Karp PD, Paley SM (1994) Representations of metabolic knowledge: pathways. Proc Int Conf Intell Syst Mol Biol.: 203-211. 68. Karp PD, Caspi R (2011) A survey of metabolic databases emphasizing the MetaCyc family. Archives of Toxicology 85: 1015-1033. 69. Karp PD, Mavrovouniotis ML (1994) Representing, analyzing, and synthesizing biochemical pathways. IEEE Expert: Intelligent Systems and Their Applications 9: 11-21. 70. Karp PD, Ouzounis CA, Moore-Kochlacs C et al (2005) Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucl. Acids Res. 33: 6083-6089. 72. Karp PD, Paley SM, Krummenacker M et al (2010) Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology. Brief Bioinform 11: 40-79. 73. Karp PD, Riley M (1993) Representations of metabolic knowledge. Proc Int Conf Intell Syst Mol Biol: 207-215. 74. Khatri P, Sirota M, Butte AJ (2012) Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 8: e1002375. 75. Kirouac D, Saez-Rodriguez J, Swantek J et al (2012) Creating and analyzing pathway and protein interaction compendia for modelling signal transduction networks. BMC Systems Biology 6: 29. 76. Kitano H, Ghosh S, Matsuoka Y (2011) Social engineering for virtual 'big science' in systems biology. Nat Chem Biol 7: 323-326. 77. Kotera M, Okuno Y, Hattori M et al (2004) Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J Am Chem Soc 126: 16487-16498. 78. Krebs HA, Holzach O (1952) The conversion of citrate into cis-aconitate and isocitrate in the presence of aconitase. Biochem J 52: 527-528. 79. Krebs HA, Johnson WA (1937) The role of citric acid in intermediate metabolism in animal tissues. Enzymologia 4: 148-156. 80. Krebs HA, Salvin E, Johnson WA (1938) The formation of citric and alpha-ketoglutaric acids in the mammalian body. Biochem J 32: 113-117. 81. Kubilus J, Lee LD, Baden HP (1978) Purification of thymidine phosphorylase from human amniochorion. Biochimica et Biophysica Acta (BBA) - Enzymology 527: 221-228. 82. Kuffner R, Zimmer R, Lengauer T (2000) Pathway analysis in metabolic databases via differential metabolic display (DMD). Bioinformatics 16: 825-836. 83. Lacroix V, Cottret L, Thebault P et al (2008) An introduction to metabolic networks and their structural analysis. IEEE/ACM. Trans. Comput. Biol. Bioinform. 5: 594-617. 84. Lambeth DO, Tews KN, Adkins S et al (2004) Expression of two succinyl-CoA synthetases with different nucleotide specificities in mammalian tissues. J. Biol. Chem. 279: 36621-36624.

147 Bibliography

85. Latendresse M, Krummenacker M, Trupp M et al (2012) Construction and completion of flux balance models from pathway databases. Bioinformatics 28: 388-396. 86. Lawson AM, Chalmers RA, Watts RWE (1976) Urinary organic acids in man. I. Normal patterns. Clinical Chemistry 22: 1283-1287. 87. Le Novère N, Finney A, Hucka M et al (2005) Minimum information requested in the annotation of biochemical models (MIRIAM). Nat Biotech 23: 1509-1515. 88. Le Novère N, Hucka M, Mi H et al (2009) The Systems Biology Graphical Notation. Nat Biotech 27: 735-741. 89. Lee DS, Park J, Kay KA et al (2008) The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci USA 105: 9880-9885. 90. Lee TJ, Pouliot Y, Wagner V et al (2006) BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 7: 170. 91. Li C, Donizelli M, Rodriguez N et al (2010) BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology 4: 92. 92. Liu TC, Kim H, Arizmendi C et al (1993) Identification of two missense mutations in a dihydrolipoamide dehydrogenase-deficient patient. Proc. Natl. Acad. Sci. U. S. A. 90: 5186- 5190. 93. Llopis J, McCaffery JM, Miyawaki A et al (1998) Measurement of cytosolic, mitochondrial, and Golgi pH in single living cells with green fluorescent proteins. Proc. Natl. Acad. Sci. U. S. A. 95: 6803-6808. 94. Lusis AJ, Attie AD, Reue K (2008) Metabolic syndrome: from epidemiology to systems biology. Nat Rev Genet 9: 819-830. 95. Ma H, Sorokin A, Mazein A et al (2007) The Edinburgh Human Metabolic Network reconstruction and its functional analysis. Mol Syst Biol 3: 135. 96. McCartney RG, Rice JE, Sanderson SJ et al (1998) Subunit interactions in the mammalian a- ketoglutarate dehydrogenase complex. Evidence for direct association of the alpha- ketoglutarate dehydrogenase and dihydrolipoamide dehydrogenase components. J. Biol. Chem. 273: 24158-24164. 97. McNaught A (2006) The IUPAC International Chemical Identifier: InChI — A New Standard for Molecular Informatics. Chemistry International 28. 98. Medina I, Carbonell J, Pulido L et al (2010) Babelomics: an integrative platform for the analysis of transcriptomics, proteomics and genomic data with advanced functional profiling. Nucl. Acids Res. 38: W210-W213. 99. Mewies M, McIntire WS, Scrutton NS (1998) Covalent attachment of flavin adenine dinucleotide (FAD) and flavin mononucleotide (FMN) to enzymes: The current state of affairs. Protein Science 7: 7-21. 100. Mo ML, Palsson BØ (2009) Understanding human metabolic physiology: a genome-to-systems approach. Trends Biotechnol 27: 37-44. 101. Molven A, Matre GE, Duran M et al (2004) Familial hyperinsulinemic hypoglycemia caused by a defect in the SCHAD enzyme of mitochondrial fatty acid oxidation. Diabetes 53: 221-227. 102. Mons B, van Haagen H, Chichester C et al (2011) The value of data. Nature Genetics. 103. Morgat A, Coissac E, Coudert E et al (2012) UniPathway: a resource for the exploration and annotation of metabolic pathways. Nucl. Acids Res. 40: D761-D769. 104. Mullen AR, Wheaton WW, Jin ES et al (2012) Reductive carboxylation supports growth in tumour cells with defective mitochondria. Nature 481: 385-388.

148 Bibliography

105. Oberhardt MA, Palsson BØ, Papin JA (2009) Applications of genome-scale metabolic reconstructions. Mol Syst Biol 5: 320. 106. Orth JD, Palsson BØ (2010) Systematizing the generation of missing metabolic knowledge. Biotechnol. Bioeng. 107: 403-412. 107. Orth JD, Thiele I, Palsson BØ (2010) What is flux balance analysis? Nat Biotech 28: 245-248. 108. Ostergaard E, Christensen E, Kristensen E et al (2007) Deficiency of the a subunit of succinate- ligase causes fatal infantile lactic acidosis with mitochondrial DNA depletion. Am J Hum Genet 81: 383-387. 109. Ott M, Vriend G (2006) Correcting ligands, metabolites, and pathways. BMC Bioinformatics 7: 517. 110. Pico AR, Kelder T, van Iersel MP et al (2008) WikiPathways: pathway editing for the people. PLoS Biol 6: e184. 111. Plaut GWE, Sung SC (1954) Diphosphopyridine nucleotide isocitric dehydrogenase from animal tissues. J. Biol. Chem. 207: 305-314. 112. Porcelli AM, Ghelli A, Zanna C et al (2005) pH difference across the outer mitochondrial membrane measured with a green fluorescent protein mutant. Biochem Biophys Res Commun 326: 799-804. 113. Radrich K, Tsuruoka Y, Dobson P et al (2010) Integration of metabolic databases for the reconstruction of genome-scale metabolic networks. BMC Syst Biol 4: 114. 114. Ramakrishna M, Krishnaswamy PR (1966) Formation of enzyme bound carbon dioxide in the reductive carboxylation of alpha-ketoglutarate by isocitrate dehydrogenase. Biochem Biophys Res Commun 25: 378-382. 115. Reactome Glossary (2012): http://wiki.reactome.org/index.php/Glossary#P 116. Rey G, Cesbron F, Rougemont J et al (2011) Genome-wide and phase-specific DNA-binding rhythms of BMAL1 control circadian output functions in mouse liver. PLoS Biol 9: (2):e1000595. 117. Rolfsson O, Palsson BØ, Thiele I (2011) The human metabolic reconstruction Recon 1 directs hypotheses of novel human metabolic functions. BMC Syst Biol 5: 155. 118. Romero P, Wagg J, Green ML et al (2004) Computational prediction of human metabolic pathways from the complete human genome. Genome Biol 6: R2. 119. Saier MHJr, Yen MR, Noto K et al (2009) The transporter classification database: recent advances. Nucleic Acids Res 37: D274-D278. 120. Sanctorius S (1614) Ars de statica medicina. 121. Sazanov LA, Jackson JB (1994) Proton-translocating transhydrogenase and NAD- and NADP- linked isocitrate dehydrogenases operate in a substrate cycle which contributes to fine regulation of the tricarboxylic acid cycle activity in mitochondria. FEBS letters 344: 109-116. 122. Schellenberger J, Park JO, Conrad TM et al (2010) BiGG: a Biochemical Genetic and Genomic knowledgebase of large scale metabolic reconstructions. BMC Bioinformatics 11: 213. 123. Sheu K-FR, Blass JP (1999) The a-ketoglutarate dehydrogenase complex. Ann N Y Acad Sci 893: 61-78. 124. Siebert G, Carsiotis M, Plaut GWE (1957) The enzymatic properties of isocitric dehydrogenase. J. Biol. Chem. 226: 977-991. 125. Soh D, Dong D, Guo Y et al (2010) Consistency, comprehensiveness, and compatibility of pathway databases. BMC Bioinformatics 11: 449. 126. Stein LD (2003) Integrating biological databases. Nat Rev Genet 4: 337-345.

149 Bibliography

127. Stobbe MD, Houten SM, Jansen GA et al (2011) Critical assessment of human metabolic pathway databases: a stepping stone for future integration. BMC Syst Biol 5: 165. 128. Strömbäck L, Lambrix P (2005) Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX. Bioinformatics 21: 4401-4407. 129. Swertz MA, Dijkstra M, Adamusiak T et al (2010) The MOLGENIS toolkit: rapid prototyping of biosoftware at the push of a button. BMC Bioinformatics 11: S12. 130. Szklarczyk D, Franceschini A, Kuhn M et al (2011) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucl. Acids Res. 39: D561- D568. 131. Tanaka T, Ikeo K, Gojobori T (2006) Evolution of metabolic networks by gain and loss of enzymatic reaction in eukaryotes. Gene 365: 88-94. 132. Thiele I, Palsson BØ (2010a) Reconstruction annotation jamborees: a community approach to systems biology. Mol Syst Biol 6: 361. 133. Thiele I, Hyduke DR, Steeb B et al (2011) A community effort towards a knowledge-base and mathematical model of the human pathogen Salmonella Typhimurium LT2. BMC Syst Biol 5: 8. 134. Thiele I, Palsson BØ (2010b) A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat. Protocols 5: 93-121. 135. van Iersel M, Kelder T, Pico A et al (2008) Presenting and exploring biological pathways with PathVisio. BMC Bioinformatics 9: 399. 136. Vastrik I, D'Eustachio P, Schmidt E et al (2007) Reactome: a knowledge base of biologic pathways and processes. Genome Biol 8: R39. 137. Wanders RJA, Ijlst L, Poggi F et al (1992) Human trifunctional protein deficiency: A new disorder of mitochondrial fatty acid b-oxidation. Biochemical and Biophysical Research Communications 188: 1139-1145. 138. Wanders RJA, Ijlst L, van Gennip AH et al (1990) Long-chain 3-hydroxyacyl-CoA dehydrogenase deficiency: identification of a new inborn error of mitochondrial fatty acid b- oxidation. Journal of Inherited Metabolic Disease 13: 311-314. 139. Watanabe SI, Uchida T (1995) Cloning and expression of human uridine phosphorylase. Biochemical and Biophysical Research Communications 216: 265-272. 140. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28: 31-36. 141. Willemoës M (2004) Competition between ammonia derived from internal glutamine hydrolysis and hydroxylamine present in the solution for incorporation into UTP as catalysed by Lactococcus lactis CTP synthase. Archives of Biochemistry and Biophysics 424: 105-111. 142. Wise DR, Ward PS, Shay JES et al (2011) Hypoxia promotes isocitrate dehydrogenase- dependent carboxylation of a-ketoglutarate to citrate to support cell growth and viability. Proc. Natl. Acad. Sci. U. S. A. 108: 19611-19616. 143. Wittig U, De Beuckelaer A (2001) Analysis and comparison of metabolic pathway databases. Brief Bioinform 2: 126-142. 144. Yoshimura A, Kuwazuru Y, Furukawa T et al (1990) Purification and tissue distribution of human thymidine phosphorylase; high expression in lymphocytes, reticulocytes and tumors. Biochimica et Biophysica Acta (BBA) - General Subjects 1034: 107-113. 145. Zelezniak A, Pers TH, Soares S et al (2010) Metabolic network topology reveals transcriptional regulatory signatures of type 2 diabetes. PLoS Comput Biol 6: e1000729. 146. Zhang P, Foerster H, Tissier CP et al (2005) MetaCyc and AraCyc. metabolic pathway databases for plant research. Plant Physiology 138: 27-37.

150

Summary

Metabolic processes are constantly taking place in our body no matter whether we eat, sleep or exercise. Food is broken down to generate energy, for example, to enable our muscles to contract and to synthesize building blocks needed to maintain our body’s cells. In the 19th century researchers first realized that metabolism can be viewed as a network of connected biochemical reactions. Enzymes catalyze these reactions, that is, they enable them and accelerate the rate at which they take place. Several of the most prevalent diseases in modern society, including diabetes, cardiovascular disease and obesity involve disruptions of metabolic processes. In patients with diabetes mellitus, for example, this results in an excessive amount of glucose in the blood. Metabolism also plays an important role in cancer, as a tumor requires energy to grow.

The study of metabolism has a long history, with the first scientific article on a metabolic process dating back to the 17th century1. Currently, nearly six million articles on metabolism have been published according to Medline, the largest repository of biomedical articles. New insights continue to be published, which all provide pieces of the puzzle about the mechanisms of metabolic processes in human and many other organisms. We can, however, not fully understand metabolism if we only study the individual parts of the network of reactions. This would be similar to trying to understand the principles of powered flight and working details of a modern aircraft by only considering the components of the airplane laid out on a hanger floor2. In the case of metabolism, we therefore need to know how the reactions and the enzymes catalyzing them fit together and act as a whole. To this end databases have been developed to collect and organize the current knowledge on metabolism scattered across a multitude of scientific articles. These databases, referred to as 'metabolic pathway databases', are like a digital encyclopedia and can often be freely accessed via a standard web browser. An important long-term goal in the field of bioinformatics is to use the knowledge gathered in such databases to build an accurate computer model of (human) metabolism. Such a model could be used to run simulations to determine the effect of disruptions of metabolic processes in a certain disease and predict possible drug targets for treating the disease.

Gathering all knowledge on metabolism to build an accurate model and keeping track of new discoveries is a huge challenge. The literature on metabolism is extensive while at the same time not for every piece of the (human) metabolic network conclusive evidence is available. A second challenge is that the metabolic

1 Sanctorius S (1614) Ars de statica medicina. 2 Vastrik I et al (2007) Reactome: a knowledge base of biologic pathways and processes. Genome Biology 8: R39.

152 Summary network needs to be represented in an electronic format that a computer is able to understand and work with. Notwithstanding these challenges, various initiatives have already resulted in more than ten human metabolic pathway databases. One would expect these databases to contain largely the same information. In this thesis we show that the contrary is true. The differences between the databases are extensive, which may influence computational analyses based on these databases. We also propose ways to resolve these differences and arrive at a more complete and more accurate description of human metabolism.

We compared five often used human metabolic pathway databases and showed that of the nearly 7,000 reactions the databases jointly contain only 199 reactions are common to all five databases (Chapter 2). We also zoomed in on a single well-known metabolic process, the tricarboxylic acid (TCA) cycle, which serves as an example in nearly every biology and chemistry curriculum. This process plays a key role in generating energy. The entire process was described for the first time in 1937 by Hans Krebs3, for which he was awarded the Nobel Prize. Our comparison showed that even for this well-studied process there is considerable disagreement between the five databases. We identified several explanations for the lack of consensus between the five descriptions of the metabolic network. One important explanation is that the databases complement each other, i.e., to some extent they provide different pieces of the puzzle of the complete description of human metabolism. To profit from all the time, effort and also the money that have already been put into the development of these different databases, we should strive towards integrating the knowledge gathered. In this way we can further improve the digital description of human metabolism.

Integrating the knowledge contained in the various databases is, however, easier said than done. The databases have made widely different choices on how to represent the metabolic network in a digital format (Chapter 3). Several formats have been proposed to standardize how knowledge should be stored and enable integration and exchange of data between different databases. However, we show in our analysis that even such a standardized format is not enough to resolve all differences in representation of the metabolic network. Another issue that makes the comparison and integration of databases difficult is that different names are used to refer to the same biochemical compounds and enzymes. Multiple naming standards have been proposed in the literature, but these are not always used or different conventions are

3 Krebs HA, Johnson WA (1937) The role of citric acid in intermediate metabolism in animal tissues. Enzymologia 4: 148-156.

153 Summary used by the databases. Another important challenge is that if we would simply combine the different descriptions of human metabolism provided by the databases, we do not resolve conflicting information nor filter out mistakes. For these reasons a completely automated integration is impossible.

For the TCA cycle we manually integrated the descriptions as given by ten different human metabolic pathway databases (Chapter 4). We identified and resolved the differences between the descriptions using literature and the knowledge of two experts in the field of metabolism. In this way, we were able to propose an improved description of the TCA cycle. This endeavor illustrates the importance of going back to the biology as described in the literature and the crucial role of experts in resolving the differences between the databases. It is, however, quite a time-consuming endeavor, even for a metabolic process with a relatively small number of steps. To do so for all metabolic processes will require a combined effort of a large group of experts. Moreover, it needs to be an ongoing process as new facts on metabolism continue to be discovered.

We built a web application called 'Consensus and Conflict Cards' (C2Cards) that can be used in the quest to further improve the description of the human metabolic network (Chapter 5). The application provides concise and easy-to-retrieve overviews of the differences between five metabolic pathway databases. Experts can use the

C2Cards to try to resolve the disagreements on, for example, which enzyme catalyzes a specific reaction. By showing the different views the databases have on a piece of the metabolic network both controversial as well as complementary biological knowledge may be revealed. A number of case studies illustrate that in some cases additional biochemical experiments are required to resolve the differences observed.

By making the C2Card application available as a webpage the threshold for experts to use the tool is low.

In conclusion, the results described in this thesis give a clear signal to the scientific community that it is of great importance to integrate the knowledge captured by multiple databases. Our analyses and the C2Cards tool provide a starting point for such an endeavor. The contribution and continuing commitment of a broad community of experts will be necessary to bring us closer to reaching the ultimate goal of a highly accurate model of (human) metabolism.

154

Samenvatting

Stofwisseling, ook wel metabolisme genoemd, vindt voortdurend plaats in ons lichaam of we nu eten, slapen of sporten. Voedsel wordt omgezet in energie om bijvoorbeeld onze spieren te kunnen samentrekken en om bouwstoffen aan te maken die nodig zijn om onze lichaamscellen te onderhouden. In de 19de eeuw realiseerden onderzoekers zich voor het eerst dat stofwisseling gezien kan worden als een netwerk van aaneengeschakelde biochemische reacties. Enzymen katalyseren deze reacties, dat wil zeggen ze zorgen ervoor dat reacties kunnen plaatsvinden en verhogen de snelheid van reacties. Verscheidene veel voorkomende welvaartsziekten, zoals diabetes, hart- en vaatziekten en obesitas, hebben te maken met een verstoring van stofwisselingprocessen. In patiënten met diabetes mellitus, bijvoorbeeld, resulteert dit in een overmatige hoeveelheid suiker in het bloed. Stofwisseling speelt ook een belangrijke rol in kanker, aangezien een tumor energie nodig heeft om te kunnen groeien.

Het onderzoek naar stofwisseling heeft een lange historie. Het eerste wetenschappelijke artikel over een stofwisselingproces dateert uit de 17de eeuw1. Momenteel, zijn er bijna zes miljoen artikelen over stofwisseling gepubliceerd volgens Medline, de grootste databank van biomedische artikelen. Nog steeds worden er nieuwe artikelen gepubliceerd over dit onderwerp, die allemaal hun steentje bijdragen aan het compleet maken van onze kennis over stofwisseling in mens en vele andere (dier)soorten. We kunnen de stofwisseling echter niet volledig doorgronden als we enkel en alleen de individuele onderdelen van het stofwisselingsnetwerk bekijken. Op dezelfde manier kunnen we ook niet begrijpen hoe een vliegtuig kan vliegen als we enkel zijn aparte onderdelen in beschouwing nemen2. In het geval van stofwisseling moeten we weten hoe reacties en de enzymen die de reacties katalyseren verbonden zijn en als één geheel samenwerken. Omdat onze huidige kennis over stofwisseling nu verspreid is over een groot aantal artikelen zijn specifieke databanken opgezet om deze kennis te verzamelen en te integreren. Deze databanken zijn te vergelijken met een digitale encyclopedie en zijn meestal vrij toegankelijk via een standaard webbrowser. Een belangrijke langetermijndoelstelling in de bioinformatica is om de kennis die verzameld is in deze databanken te gebruiken om een computermodel te bouwen van stofwisseling (in mens). Een dergelijk model kan bijvoorbeeld gebruikt worden om simulaties uit te voeren om te bepalen wat het effect is van verstoringen van de stofwisseling in een

1 Sanctorius S (1614) Ars de statica medicina. 2 Vastrik I et al (2007) Reactome: a knowledge base of biologic pathways and processes. Genome Biology 8: R39.

156 Samenvatting ziekte. Dit kan uiteindelijk ook bijdragen aan de selectie of ontwikkeling van nieuwe behandelmethoden, zoals medicijnen.

Het verzamelen van alle kennis over stofwisseling met als doel het bouwen van een nauwkeurig computermodel is een enorme uitdaging. De literatuur over stofwisseling is omvangrijk en tegelijkertijd is er niet voor elk onderdeel van het humane metabole netwerk doorslaggevend bewijs voorhanden. Een tweede uitdaging is om het metabole netwerk in een elektronisch formaat op te slaan zodanig dat een computer het kan begrijpen en er mee kan werken. Desalniettemin hebben verschillende initiatieven inmiddels geresulteerd in meer dan tien databanken over stofwisseling in de mens. Men zou verwachten dat deze databanken grotendeels dezelfde informatie bevatten maar in dit proefschrift laten we zien dat dit niet het geval is. De verschillen tussen de databanken zijn enorm, wat ook invloed kan hebben op de resultaten van computeranalyses die gebruik maken van deze databanken. Verder stellen wij manieren voor waarop deze verschillen opgelost kunnen worden om zo te komen tot een completere en betere beschrijving van de stofwisseling in de mens.

We hebben vijf vaak gebruikte databanken vergeleken die allen de menselijke stofwisseling beschrijven. We laten zien dat van de bijna 7000 reacties die de databanken gezamenlijk bevatten slechts 199 reacties in alle vijf de databanken gevonden kunnen worden (Hoofdstuk 2). We hebben in meer detail gekeken naar de beschrijving van de citroenzuurcyclus, een algemeen bekend stofwisselingsproces dat als voorbeeld dient in nagenoeg elk biologie- en scheikundecurriculum. De citroenzuurcyclus speelt een belangrijke rol in het genereren van energie en werd voor het eerst in zijn geheel beschreven in 1937 door Hans Krebs3, waarvoor hij de Nobelprijs kreeg toegekend. Onze vergelijking laat zien dat zelfs voor dit veelbestudeerde proces er grote verschillen zijn tussen de vijf databanken. We hebben een aantal verklaringen geïdentificeerd voor dit gebrek aan consensus tussen de vijf beschrijvingen van het metabole netwerk. Een belangrijke verklaring is dat de databanken elkaar aanvullen; ze beschrijven soms verschillende stukjes van het menselijk stofwisselingsnetwerk. Om te profiteren van alle tijd, inspanning en ook geld die al zijn gestoken in de ontwikkeling van deze verschillende databanken, moeten we streven naar het combineren van alle kennis die is vergaard. Op deze manier kunnen we de digitale beschrijving van de menselijke stofwisseling verder verbeteren.

3 Krebs HA, Johnson WA (1937) The role of citric acid in intermediate metabolism in animal tissues. Enzymologia 4: 148-156.

157 Samenvatting

Het integreren van de kennis in de verschillende databanken is echter makkelijker gezegd dan gedaan. De databanken hebben zeer verschillende keuzes gemaakt in de wijze waarop het metabole netwerk in een digitaal formaat wordt gerepresenteerd (Hoofdstuk 3).Verschillende formaten zijn voorgesteld om de manier waarop kennis wordt opgeslagen te standaardiseren en op deze manier integratie en uitwisseling van data tussen databanken mogelijk te maken. Onze analyse heeft echter laten zien dat zelfs een gestandaardiseerd formaat niet genoeg is om alle verschillen in representatie van het stofwisselingsnetwerk op te lossen. Een ander probleem dat de vergelijking en integratie van databanken lastig maakt is dat verschillende namen gebruikt worden voor dezelfde biochemische stoffen en enzymen. In de literatuur zijn meerdere naamstandaarden voorgesteld, maar deze worden niet altijd gebruikt of er worden verschillende standaarden gebruikt door de databanken. Een andere belangrijke uitdaging is dat het louter combineren van de verschillende beschrijvingen van de menselijke stofwisseling niet genoeg is, aangezien we hiermee tegenstrijdigheden niet oplossen en eventuele fouten niet gecorrigeerd worden. Om deze redenen is een volledig automatische integratie onmogelijk.

Voor de citroenzuurcyclus hebben we handmatig tien beschrijvingen uit verschillende databanken geïntegreerd (Hoofdstuk 4). We hebben de verschillen tussen de beschrijvingen geïdentificeerd en opgelost met behulp van literatuuronderzoek en de kennis van twee experts op het gebied van stofwisseling. Op deze manier waren we in staat om de beschrijving van de citroenzuurcyclus te verbeteren. Deze inspanning illustreert dat het belangrijk is om terug te gaan naar de biologie, zoals beschreven in de literatuur, en dat deskundigen een cruciale rol spelen bij het oplossen van verschillen tussen databanken. Het is echter een nogal tijdrovende bezigheid, zelfs voor een stofwisselingsproces met een relatief klein aantal stappen. Om hetzelfde te doen voor alle stofwisselingsprocessen vereist de gezamenlijke inzet van een grote groep deskundigen. Bovendien moet het een continu proces zijn aangezien nieuwe feiten over de stofwisseling nog steeds worden ontdekt.

Wij hebben een webapplicatie gebouwd genaamd 'Consensus en Conflict Cards'

(C2Cards) die gebruikt kan worden om de beschrijving van het menselijk metabole netwerk verder te verbeteren (Hoofdstuk 5). De applicatie genereert beknopte en eenvoudig op te halen overzichten van de verschillen tussen vijf databanken die de menselijke stofwisseling beschrijven. Experts kunnen de C2Cards gebruiken om te proberen de meningsverschillen op te lossen, bijvoorbeeld over welk enzym een specifieke reactie katalyseert. Door het zichtbaar maken van de verschillende

158 Samenvatting standpunten die de databanken hebben over een stukje van het metabole netwerk kan zowel controversiële als complementaire biologische kennis worden onthuld. Een aantal van onze voorbeelden laat zien dat in sommige gevallen verdere biochemische experimenten nodig zijn om de waargenomen verschillen op te lossen.

Door de C2Card applicatie als een webpagina aan te bieden, is de drempel voor deskundigen om deze applicatie te gebruiken laag.

Een belangrijke conclusie van dit proefschrift is dat het van groot belang is om de kennis die vergaard is door de verschillende databanken te combineren. Onze analyses en de C2Cards applicatie vormen hiervoor een geschikt uitgangspunt. De bijdrage en aanhoudende inzet van een brede gemeenschap van experts is nodig om dichter bij het uiteindelijke doel te komen van een nauwkeurig model van de (menselijke) stofwisseling.

159

Curriculum Vitae

Curriculum Vitae Miranda Daniëlle Stobbe was born on March 24th 1983 in Laren and grew up in Huizen. After finishing the gymnasium at the Willem de Zwijger College in Bussum, she studied Medical Technical Computer Science at the Utrecht University. During her master she specialized in Biomedical Image Sciences. She did her Master project at the Image Science Institute in the University Medical Center Utrecht with as topic the automatic detection of red lesions in retinal images. Directly after finishing her Master project she started as a PhD student in the bioinformatics group at the Academic Medical Center in Amsterdam. Under the supervision of Dr. ir. P.D. Moerland and Prof. dr. A.H.C. van Kampen she analyzed and compared metabolic pathway databases as described in this thesis. During her PhD period she was secretary of the RSG Netherlands, which is the network for young researchers in the field of bioinformatics in the Netherlands. As of November 2012 she will hold a Postdoctoral position at the Institute for Research in Biomedicine in Barcelona, where she will switch focus to cancer related research. She will work on the project titled: "Integration of genome and proteome wide approaches in the regulation of gene expression by stress-activated kinases".

162 List of publications

List of publications 1. Miranda D. Stobbe, Sander M. Houten, Gerbert A. Jansen, Antoine H.C. van Kampen, Perry D. Moerland (2011). Critical assessment of human metabolic pathway databases: a stepping stone for future integration. BMC Systems Biology 5: 165

2. Miranda D. Stobbe, Gerbert A. Jansen, Perry D. Moerland, Antoine H.C. van Kampen Knowledge representation in metabolic pathway databases (accepted for publication in Briefings in Bioinformatics, August 17, 2012)

3. Miranda D. Stobbe*, Sander M. Houten*, Antoine H.C. van Kampen, Ronald A. Wanders, Perry D. Moerland Improving the description of metabolic networks: the TCA cycle as example (accepted for publication in the FASEB Journal, May 21, 2012) *These authors contributed equally.

4. Miranda D. Stobbe, Morris A. Swertz, Ines Thiele, T. Rengaw, Antoine H.C. van Kampen, Perry D. Moerland Consensus and Conflict Cards for metabolic pathway databases (submitted to BMC Systems Biology)

5. Ines Thiele*, Neil Swainston*, Ronan M.T. Fleming, Andreas Hoppe, Swagatika Sahoo, Maike K. Aurich, Hulda Haraldsdottir, Monica L. Mo, Ottar Rolfsson, Miranda D. Stobbe, Stefan G. Thorleifsson, Rasmus Agren, Christian Bölling, Sergio Bordel, Arvind K. Chavali, Paul Dobson, Warwick B. Dunn, Lukas Endler, Igor Goryanin, Steinn Gudmundsson, David Hala, Michael Hucka, Duncan Hull, Daniel Jameson, Neema Jamshidi, Janette Jones, Jon J. Jonsson, Nick Juty, Sarah Keating, Intawat Nookaew, Nicolas Le Novère, Naglis Malys, Alexander Mazein, Jason A. Papin, Yogendra Patel, Nathan D. Price, Evgeni Selkov, Sr., Martin I. Sigurdsson, Evangelos Simeonidis, Nikolaus Sonnenschein, Kieran Smallbone, Anatoly Sorokin, Hans Van Beek, Dieter Weichart, Jens B. Nielsen, Hans V. Westerhoff, Douglas B. Kell, Pedro Mendes, Bernhard Ø. Palsson A community-driven global reconstruction of human metabolism (submitted to Nature Biotechnology). *These authors contributed equally.

163

Acknowledgements

“If I have seen further it is by standing on the shoulders of giants.”

Sir Isaac Newton (Letter to Robert Hooke)

Dankwoord - Acknowledgements

Het is een lange en uitdagende reis geweest, maar ook zeker een reis die alle moeite waard was. Ik heb veel geleerd van alle mensen die ik onderweg heb ontmoet. Het proefschrift dat u nu in uw handen heeft, zou nooit zijn voltooid zonder de hulp van vele mensen. Hierbij wil ik jullie allen hartelijk bedanken! Een aantal mensen wil ik graag in het bijzonder bedanken (na de vertaling in het Engels):

It has been a long and challenging journey, but certainly also a very rewarding one. I have learnt a lot from all the people I have met along the way. The thesis that you now have in your hands could not have been completed without the help of many people. Hereby I want to thank all of you! A number of people I would like to thank specifically:

Als allereerst wil ik mijn copromotor, Perry, bedanken. Het was echt allemaal niet mogelijk geweest zonder al jouw hulp. Ik bewonder je geduld en je vermogen om ook aan een onsamenhangend verhaal genoeg te hebben om te begrijpen wat ik bedoelde. Bedankt voor alle energie die je gestoken hebt in elke samenvatting, presentatie, poster en artikel. Aangezien er achter elk groot man een sterke vrouw staat, wil ik ook graag Rachel bedanken.

Antoine bedankt voor je frisse blik op de verschillende artikelen en het regelen van een broodnodige contractverlenging van een jaar. Ook wil ik je bedanken voor het stimuleren om in het bestuur van RSG Netherlands plaats te nemen. Maar wellicht moet ik het verder kort houden? Bij deze: bedankt!

Prof. dr. S. Brul, Dr. ir. C.T.A. Evelo, Prof. dr. W.J. Stiekema, Prof. dr. B. Teusink, Prof. dr. R.J.A. Wanders en Prof. dr. L.F.A. Wessels: bedankt voor uw bereidheid om zitting te nemen in mijn promotiecommissie. Dr. I. Thiele, thank you for taking place in my Doctorate committee. Prof. dr. B. Mons bedankt voor uw bereidheid om als gastopponent plaats te nemen in mijn promotiecommissie.

Gerbert en Marcel, mijn vaste kamergenoten de afgelopen zes jaar, bedankt voor al jullie morele steun en gezelligheid. Het kunnen delen van lief en leed heeft mij enorm geholpen om het tot het einde vol te houden. Gerbert ook bedankt voor je biologisch perspectief in de samenwerking voor twee artikelen. Marcel, succes met jouw promotietraject, ik ben benieuwd naar je boekje! Ook onze 'wissel' kamergenoten (Jasper, Wiebe, Aviral, Christin, Herman, Mia) wil ik graag bedanken. We hebben gelukkig de afgelopen jaren ook heel wat afgelachen.

Barbera en Angela, collega's van het eerste uur, bedankt voor alle gezellige koffiepauzes, lunches, borrels, uitjes en congressen! Barbera, bedankt dat je bereid bent foto's te maken tijdens mijn verdediging.

166 Dankwoord - Acknowledgements

Umesh and Shayan, fellow travelers so to say, thank you and good luck with your PhD project. I am looking forward to receiving your booklets. Umesh, do not forget the promise you made me, only a few more months to go! Shayan, I am sure you will soon find the perfect research question. Aldo, Eelke, Jack, e-bioscience group and former BioLab colleagues also many thanks, I had a great time! Joris ik heb ook zeker geleerd van de ervaring om jou te mogen begeleiden in de laatste fase van je opleiding. Succes met je verdere carrière!

Iris, we hadden het bijna niet meer voor mogelijk gehouden, maar het is ons nu beide gelukt! Bovendien hebben we allebei een nieuwe positie weten te krijgen. Bedankt voor alle morele steun de afgelopen jaren! Ook twee andere leden van de beruchte kamer 207, Michel en Erik, wil ik graag bedanken, het was gezellig! To all current and former PhD students of the KEBB, thank you and I wish you all the best finishing your PhD project and/or the next step in your career!

Graag wil ik ook alle andere mensen bedanken die de afgelopen zes jaar mijn collega's waren op het KEBB. Ik heb het enorm naar m'n zin gehad! Ik zal met plezier terugdenken aan alle borrels, KEBB-uitjes en kerstfeesten.

Sander en Ronald, ik heb genoten van onze samenwerking, bedankt daarvoor. Jullie enthousiasme over jullie vakgebied is aanstekelijk. Het was zeer leerzaam en het heeft ook nog eens geresulteerd in een prachtig artikel in de FASEB Journal!

Robert and Morris thank you for all your help in building the C2Card application and making the design I had in mind possible. Erik bedankt voor je hulp bij de initiële aanzet en Joeri voor de last-minute hulp met de laatste loodjes van dit project! I also want to thank the other members of the “MOLGENIS group” for the warm welcome during my two visits.

Ines, thank you for giving me the opportunity to present my work at the University of Iceland and the collaboration following that first meeting. I also enjoyed meeting your group during the Systems Biology short course and the COBRA conference.

A PhD project is not only about doing research and therefore I am grateful to have had the opportunity to be part of the board of RSG Netherlands. Jayne, Jeroen and Hanka, I think we have been very successful in setting up the initial network for PhD students in bioinformatics in the Netherlands. Thank you for all your time and effort. I have enjoyed working with you. Inken and Jurgen thank you for helping continuing the work in the second RSG board. Vikram, it has only be brief, but thank you for your input. I wish the new board (Umesh, Margherita, Lex, Punto, Sepideh)

167 Dankwoord - Acknowledgements the best of luck in keeping up the good work and expanding the activities of RSG Netherlands. Perhaps we should plan a joined event with RSG Spain now?

Alles wat we met RSG Netherlands georganiseerd hebben was niet mogelijk geweest zonder alle hulp en steun van de mensen van NBIC. In het bijzonder wil ik graag Karin, Celia en Femke bedanken voor al hun hulp! Ik heb ook elk jaar met plezier deelgenomen het NBIC congres.

Een promotietraject is niet geheel zonder frustraties. Ik kan me geen betere manier bedenken om daarmee om te gaan dan me te kunnen uitleven tijdens een training Pencak Silat. Voor diegene die het niet weten, Pencak Silat is een Indonesische gevechtskunst, welke ik al meer dan 12 jaar beoefen. Beste (oud-)leden van Panglipur Huizen en Amsterdam, wellicht niet bewust maar ook jullie hebben zeer zeker bijgedragen, bedankt daarvoor! Julian, bedankt voor alle trainingen al die jaren, ook die in het Amsterdamse bos. Ik zal het ijs eten bij Pisa gaan missen! Silvy en Jenna, niet alleen tijdens de trainingen en demo's was het heel gezellig, maar ook bij elke verjaardag! Ik voel me altijd meteen helemaal thuis bij jullie en natuurlijk is het eten overheerlijk.

Tangolessen waren het afgelopen jaar ook een zeer welkome afleiding. Cosima en Mariano, als jullie met elkaar dansen is het alsof voor jullie de rest van de wereld even niet bestaat, prachtig! Bedankt voor alle inspirerende lessen!

Gerwert ik wil je graag bedanken dat je mijn paranimf wil zijn en ook voor alle heerlijke (zelfgekookte) etentjes, die inmiddels bijna traditie zijn geworden. Ik hoop dat we deze ‘traditie’ ook voort kunnen zetten wanneer ik in Barcelona zal zitten.

Last, but certainly not least, wil ik graag mijn ouders en m’n broer Mark bedanken. Lieve pap, mam en Mark bedankt voor al jullie steun de afgelopen jaren. Het was zeker niet altijd even makkelijk, maar mede dankzij jullie is het dan toch eindelijk gelukt. Mark ook bedankt voor de technische ondersteuning, altijd handig zo’n broer. Nu op zoek naar een festival in Barcelona, toch? Ook jij bedankt dat je 18 oktober naast me wil staan als paranimf. Muchísimas gracias!

168