An Analysis of Differences in Biological Pathway Resources

Total Page:16

File Type:pdf, Size:1020Kb

An Analysis of Differences in Biological Pathway Resources An Analysis of Differences In Biological Pathway Resources Lucy L. Wang, John H. Gennari, and Neil F. Abernethy Department of Biomedical Informatics and Medical Education, University of Washington Seattle, Washington 98195 Email: [email protected], [email protected], [email protected] Abstract—Integrating content from multiple biological bilistically combine protein interactions from various pathway resources is necessary to fully exploit pathway pathway databases into merged networks [11]. These knowledge for the benefit of biology and medicine. Dif- tools improve querying of multiple resource, and pave ferences in content, representation, coverage, and more occur between databases, and are challenges to resource the way towards more comprehensive network models merging. We introduce a typology of representational of human biological processes. differences between pathway resources and give exam- Some work has also been done in inter-resource ples using several databases: BioCyc, KEGG, PANTHER pathways, and Reactome. We also detect and quantify comparison, quantifying the overlap between different annotation mismatches between HumanCyc and Reactome. databases [12–15]. These comparison studies emphasize The typology of mismatches can be used to guide entity and differences in entity membership in pathways and differ- relationship alignment between these databases, helping us ing counts of unique entities and pathways, but do not identify and understand deficiencies in our knowledge, and focus on cross-resource entity alignment. Existing tools allowing the research community to derive greater benefit from the existing pathway data. for entity normalization of proteins [16] and metabo- lites [17] may provide a starting point for alignment. Keywords—pathway database, knowledge representa- Other studies emphasize aligning metabolic pathways of tion, resource comparison different species in order to find analogous but missing relationships [18, 19], merging resources for combined I.I NTRODUCTION network analysis [11, 20], or defining conserved path- Describing and studying biological pathways is neces- way elements across existing pathway resources [21]. sary for understanding biological and disease processes. However, although they represent progress, the tools and Biological functions and processes follow from com- studies mentioned above accomplish goals that do not plex networks of interactions among gene products and include aligning representations across resources. molecules. Through the study of pathways of known Given the number and uniqueness of pathway re- biochemical reactions, we can gain deeper insights into sources, inter-resource merging is a challenge. In order to these interactions. Many of these relationships and reac- successfully align and integrate the content of multiple tions have been catalogued in pathway resources such as knowledge bases, we must contend with variability in Reactome, BioCyc, KEGG, and others [1–5]. content correctness, standards usage, knowledge repre- As of April 2016, PathGuide, a pathway resource ag- sentation choices, and coverage. Pathway data sharing gregator, lists 547 pathway resources [6], each providing standards such as BioPAX, SBML, and PSI MI [8, 22, specialized knowledge in niche areas of biology. Efforts 23] assist in the interchange of pathway resources, but have been made to integrate some of these databases. even resources available in the same standard still retain PathwayCommons catalogs human pathway resources differences in content and representation. Nonetheless, under a unified biological pathway exchange umbrella our goal is to align knowledge, so that users can benefit (BioPAX), allowing easier querying of pathways across from a semantic union across multiple resources. 22 different resources [7, 8]. Tools such as Consensus Pathway DB [9] and hiPATHDB [10] offer querying To align resources, we must comprehensively under- and visualization of pathways from multiple databases. stand the types of differences one may encounter. Stobbe Statistical frameworks like R Spider seek to proba- et al. have made an excellent start in this direction, providing numerous examples and descriptions of the This work was supported by the National Library of Medicine (NLM) sorts of differences among metabolic pathway resources Training Grant T15LM007442. [13, 24]. Here, we extend this work, aiming at a typology Fig. 1: The conversion of phosphoenolpyruvate and ADP into pyruvate and ATP assisted by the enzyme pyruvate kinase, as represented by HumanCyc, KEGG, PANTHER, and Reactome. The display name for each entity is given, along with ChEBI or UniProt identifiers where available. Entities related to the reaction by the BioPAX left property are red, and entities related by the BioPAX right property are green. of mismatches among pathway resources. In particular, we see a variety of important mismatches, of which a we describe and give examples of mismatches in (a) subset are described below.∗ annotation, (b) existence, (c) reaction semantics, and (d) granularity. By classifying mismatches, we enable the A. Annotation better understanding and discussion of resource differ- We first consider annotation mismatches on the par- ences, and allow for improved consensus formation in ticipating physical entities. Inconsistencies arise when multiple pathway resource applications. two pathway resources refer to the same entity with We also present some results in quantifying annotation different identifiers or different names. Pyruvate is rep- mismatches between two popular human pathway re- resented by all four resources (Fig. 1), but is anno- sources: HumanCyc and Reactome. Results demonstrate tated with two identifiers, ChEBI:15361 (HumanCyc and the pervasiveness of representational differences and PANTHER) and ChEBI:32816 (KEGG and Reactome). suggest further work towards consensus pathway repre- The ChEBI:15361 entity “pyruvate” and ChEBI:32816 sentations. Understanding the types of mismatches that entity “pyruvic acid” are conjugate acids and bases of exist between resources is a first step towards expanding one another in ChEBI. The display name for pyruvate and deriving the full benefit of our pathway knowledge. also differs between resources, and is given as “pyru- vate” (HumanCyc), “Pyruvate” (KEGG, PANTHER), or II.M ISMATCHES IN PATHWAY RESOURCES:A “PYR” (Reactome). Differences in identifiers and names TYPOLOGY are also seen for all other participants in this reaction. To provide examples of mismatches, we retrieved re- action representations from HumanCyc, KEGG, PAN- ∗Pathways were retrieved from Reactome v55 (http://reactome. org) and HumanCyc v19.5 (http://humancyc.org) BioPAX3 ex- THER, and Reactome. Fig. 1 shows several different ports and through PathwayCommons v7 [7]. Glycolysis path- representations of a step of glycolysis in Homo sapiens: ways for KEGG and PANTHER are located at http://purl. the conversion of phosphoenolpyruvate and ADP to org/pc2/7/#Pathway 307add3cea6530288cc1016267ec055b and http: //identifiers.org/panther.pathway/P00024 respectively and are sup- pyruvate and ATP modulated by the enzyme pyruvate plemented by the pathway diagrams at http://kegg.jp and http:// kinase. In this single, well-studied biochemical reaction, pantherdb.org. 2 In order to resolve these mismatches, we must either enforce consistent labeling of entities across resources, or somehow infer alignment of similar but differently annotated entities across resources. The former strategy is usually impractical; in this case, we can infer similarity by treating ChEBI identifiers that refer to conjugate acid/base pairs as synonyms. A second type of annotation mismatch occurs when entities lack cross-referenced identifiers, e.g., no identi- fiers are given for ADP or ATP in PANTHER pathways. Other features such as string name, entity relationships, Fig. 2: The oxidative decarboxylation of isocitrate can be represented as a two-step process with an oxalosuccinate intermediary (left) and and local network topology can be used to align entities as a one-step process (right). between resources when identifiers are insufficient. right are used to indicate reaction direction, as well as B. Existence the identities of reactants and products [25]. In KEGG, Existence refers to missing or extraneous physical en- PANTHER, and Reactome, phosphoenolpyruvate is la- tities, reactions, relationships, or information, e.g., en- beled left and pyruvate right, with a reaction direction of tities that participate in a reaction or reactions that are left-to-right. However, in HumanCyc, phosphoenolpyru- members of a pathway in one resource but not another, vate is labeled right and pyruvate left and the reaction or a connection between two reactions that occurs in direction is right-to-left, a choice dictated by the Enzyme one resource but not another. In Reactome, for example, Commission system [2]. Even though HumanCyc is in the conversion of fructose 6-phosphate to fructose 2,6- the minority, its choice follows recommendations from biphosphate is a reaction in the glycolysis pathway. This the BioPAX3 specifications [25]. reaction is not included in the glycolysis pathway of Resolving this type of semantic mismatch between the other three resources. Although the reaction involves resources requires knowledge about the ordering of entities that participate in glycolysis, there is uncertainty reactions, which can be derived from pathway design, in whether it is important to the overall process. or when reactions are taken out of context, may depend
Recommended publications
  • Reactome Pathway Analysis
    Fabregat et al. BMC Bioinformatics (2017) 18:142 DOI 10.1186/s12859-017-1559-2 SOFTWARE Open Access Reactome pathway analysis: a high- performance in-memory approach Antonio Fabregat1,2, Konstantinos Sidiropoulos1, Guilherme Viteri1, Oscar Forner1, Pablo Marin-Garcia3,4, Vicente Arnau5,6, Peter D’Eustachio7, Lincoln Stein8,9 and Henning Hermjakob1,10* Abstract Background: Reactome aims to provide bioinformatics tools for visualisation, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modelling, systems biology and education. Pathway analysis methods have a broad range of applications in physiological and biomedical research; one of the main problems, from the analysis methods performance point of view, is the constantly increasing size of the data samples. Results: Here, we present a new high-performance in-memory implementation of the well-established over- representation analysis method. To achieve the target, the over-representation analysis method is divided in four different steps and, for each of them, specific data structures are used to improve performance and minimise the memory footprint. The first step, finding out whether an identifier in the user’s sample corresponds to an entity in Reactome, is addressed using a radix tree as a lookup table. The second step, modelling the proteins, chemicals, their orthologous in other species and their composition in complexes and sets, is addressed with a graph. The third and fourth steps, that aggregate the results and calculate the statistics, are solved with a double-linked tree. Conclusion: Through the use of highly optimised, in-memory data structures and algorithms, Reactome has achieved a stable, high performance pathway analysis service, enabling the analysis of genome-wide datasets within seconds, allowing interactive exploration and analysis of high throughput data.
    [Show full text]
  • Biological Pathways Exchange Language Level 3, Release Version 1 Documentation
    BioPAX – Biological Pathways Exchange Language Level 3, Release Version 1 Documentation BioPAX Release, July 2010. The BioPAX data exchange format is the joint work of the BioPAX workgroup and Level 3 builds on the work of Level 2 and Level 1. BioPAX Level 3 input from: Mirit Aladjem, Ozgun Babur, Gary D. Bader, Michael Blinov, Burk Braun, Michelle Carrillo, Michael P. Cary, Kei-Hoi Cheung, Julio Collado-Vides, Dan Corwin, Emek Demir, Peter D'Eustachio, Ken Fukuda, Marc Gillespie, Li Gong, Gopal Gopinathrao, Nan Guo, Peter Hornbeck, Michael Hucka, Olivier Hubaut, Geeta Joshi- Tope, Peter Karp, Shiva Krupa, Christian Lemer, Joanne Luciano, Irma Martinez-Flores, Zheng Li, David Merberg, Huaiyu Mi, Ion Moraru, Nicolas Le Novere, Elgar Pichler, Suzanne Paley, Monica Penaloza- Spinola, Victoria Petri, Elgar Pichler, Alex Pico, Harsha Rajasimha, Ranjani Ramakrishnan, Dean Ravenscroft, Jonathan Rees, Liya Ren, Oliver Ruebenacker, Alan Ruttenberg, Matthias Samwald, Chris Sander, Frank Schacherer, Carl Schaefer, James Schaff, Nigam Shah, Andrea Splendiani, Paul Thomas, Imre Vastrik, Ryan Whaley, Edgar Wingender, Guanming Wu, Jeremy Zucker BioPAX Level 2 input from: Mirit Aladjem, Gary D. Bader, Ewan Birney, Michael P. Cary, Dan Corwin, Kam Dahlquist, Emek Demir, Peter D'Eustachio, Ken Fukuda, Frank Gibbons, Marc Gillespie, Michael Hucka, Geeta Joshi-Tope, David Kane, Peter Karp, Christian Lemer, Joanne Luciano, Elgar Pichler, Eric Neumann, Suzanne Paley, Harsha Rajasimha, Jonathan Rees, Alan Ruttenberg, Andrey Rzhetsky, Chris Sander, Frank Schacherer,
    [Show full text]
  • Integration of Metabolic Pathway Linked Data: Lessons Learned
    Integration of Metabolic Pathway Linked Data: Lessons Learned Maciej Rybinski, María Jesús García-Godoy, Ismael Navas-Delgado and José F. Aldana-Montes Universidad de Málaga, Spain Abstract. In the last few years, the Life Science domain has experienced a rapid growth in the amount of available biological databases. The heterogeneity of these databases makes the data integration a challenging issue. Some inte- gration challenges refer to locating resources, relations, data formats, synonyms or ambiguity. Linked Data approach partly solves the heterogeneity problems (mainly the syntactic ones). Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. However, Linked Data approach is not a solution in itself. This paper illustrates a data integration and interlinking process for metabolic pathway databases such as Uniprot, Kegg, Brenda and Reactome. This process was an important part of the development of a tool for biological end-users that integrates metabolic pathways information by using Linked Data technology. The integration process and the tool de- velopment shows that heterogeneity problems have not been solved in this domain, and so this paper discusses the problems that arise from the use of data and links published as Linked Data and the possible approaches to solve these issues. Keywords: Linked Data, Knowledge integration, Bioinformatic databases (phosphoglucomutase, Homo sapiens) in order to 1. Introduction obtain information about this enzyme in glycoly- sis/gluconeogenesis pathways. The result is a list of Over the last two decades, the biological database pathways in which the target enzyme works. The user community has witnessed a rapid growth in the num- has to browse the map to reach the link to the enzyme.
    [Show full text]
  • An R Package to Generate Integrated Reproducible Pathway Models
    biology Technical Note Multipath: An R Package to Generate Integrated Reproducible Pathway Models Zaynab Hammoud * and Frank Kramer IT-Infrastructure for Translational Medical Research, University of Augsburg, 86159 Augsburg, Germany; [email protected] * Correspondence: [email protected] or [email protected] Received: 24 October 2020; Accepted: 19 December 2020; Published: 21 December 2020 Simple Summary: In biological terms, the term "pathway" is used to describe a collection of processes within a cell that lead to one or more actions. The graphical representation of these processes enables the reader to understand complex relationships and interactions much more easily compared to free-text descriptions. While there is usually agreement on the existence and function of these high-level processes, the specific molecules and their interactions are often disputed and a matter of current research. A standardized computational representation of biological networks has become desirable, especially with the recent surge in new knowledge generation in biology and medicine. Our work is influenced by challenges emerging from previous work on biological pathways, knowledge encoding, and visualization as well as pathway databases. Our main motivation is the difficulty of reproducing pathway knowledge used within publications, even in top-tier journals. We propose a new way of integrating and modeling pathways and other influencing knowledge, such as drugs, and documenting their modifications using multilayered networks. We provide a tool that transforms encoded pathway data to multilayered graphs, with the possibility to modify them, and integrate other knowledge from external databases. Abstract: Biological pathway data integration has become a topic of interest in the past years.
    [Show full text]
  • Reactome: a Database of Reactions, Pathways and Biological Processes
    City University of New York (CUNY) CUNY Academic Works Publications and Research Hunter College 2011 Reactome: a database of reactions, pathways and biological processes David Croft Wellcome Trust Gavin O’Kelly Wellcome Trust Genome Campus Guanming Wu Ontario Institute for Cancer Research Robin Haw Ontario Institute for Cancer Research Marc Gillespie St. John's University See next page for additional authors How does access to this work benefit ou?y Let us know! More information about this work at: https://academicworks.cuny.edu/hc_pubs/177 Discover additional works at: https://academicworks.cuny.edu This work is made publicly available by the City University of New York (CUNY). Contact: [email protected] Authors David Croft, Gavin O’Kelly, Guanming Wu, Robin Haw, Marc Gillespie, Lisa Matthews, Michael Caudy, Phani Garapati, Gopal Gopinath, Bijay Jassal, Steven Jupe, Irina Kalatskaya, Shahana S. Mahajan, Bruce May, Nelson Ndegwa, Esther Schmidt, Veronica Shamovsky, Christina Yung, Ewan Birney, Henning Hermjakob, Peter D’Eustachio, and Lincoln Stein This article is available at CUNY Academic Works: https://academicworks.cuny.edu/hc_pubs/177 Published online 9 November 2010 Nucleic Acids Research, 2011, Vol. 39, Database issue D691–D697 doi:10.1093/nar/gkq1018 Reactome: a database of reactions, pathways and biological processes David Croft1, Gavin O’Kelly1, Guanming Wu2, Robin Haw2, Marc Gillespie3, Lisa Matthews4, Michael Caudy2, Phani Garapati1, Gopal Gopinath5, Bijay Jassal1, Steven Jupe1, Irina Kalatskaya2, Shahana Mahajan4,6, Bruce May2, Nelson Ndegwa1, Esther Schmidt1, Veronica Shamovsky4, Christina Yung2, Ewan Birney1, Henning Hermjakob1, Peter D’Eustachio4 and Lincoln Stein2,7,8,* 1 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK, 2Ontario Institute for Cancer Research, Informatics and Bio-Computing, Toronto, ON, M5G0A3, Canada, 3St.
    [Show full text]
  • SCIENCE CHINA Pathppi
    SCIENCE CHINA Life Sciences • RESEARCH PAPER • June 2015 Vol.58 No.6: 579–589 doi: 10.1007/s11427-014-4766-3 PathPPI: an integrated dataset of human pathways and protein-protein interactions TANG HaiLin1†, ZHONG Fan2†, LIU Wei1, HE FuChu2,3* & XIE HongWei1* 1College of Mechanical & Electronic Engineering and Automatization, National University of Defense Technology, Changsha 410073, China; 2Institutes of Biomedical Sciences, Fudan University, Shanghai 200032, China; 3State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, China Received April 5, 2014; accepted July 20, 2014; published online January 14, 2015 Integration of pathway and protein-protein interaction (PPI) data can provide more information that could lead to new biologi- cal insights. PPIs are usually represented by a simple binary model, whereas pathways are represented by more complicated models. We developed a series of rules for transforming protein interactions from pathway to binary model, and the protein in- teractions from seven pathway databases, including PID, BioCarta, Reactome, NetPath, INOH, SPIKE and KEGG, were transformed based on these rules. These pathway-derived binary protein interactions were integrated with PPIs from other five PPI databases including HPRD, IntAct, BioGRID, MINT and DIP, to develop integrated dataset (named PathPPI). More de- tailed interaction type and modification information on protein interactions can be preserved in PathPPI than other existing da- tasets. Comparison analysis results indicate that most of the interaction overlaps values (OAB) among these pathway databases were less than 5%, and these databases must be used conjunctively. The PathPPI data was provided at http://proteomeview. hupo.org.cn/PathPPI/PathPPI.html.
    [Show full text]
  • An Ecosystem for Exploring, Analyzing, and Curating Mappings Across Pathway Databases
    www.nature.com/npjsba Corrected: Publisher Correction TECHNOLOGY FEATURE OPEN ComPath: an ecosystem for exploring, analyzing, and curating mappings across pathway databases Daniel Domingo-Fernández 1,2, Charles Tapley Hoyt 1,2, Carlos Bobis-Álvarez3, Josep Marín-Llaó1,4 and Martin Hofmann-Apitius 1,2 Although pathways are widely used for the analysis and representation of biological systems, their lack of clear boundaries, their dispersion across numerous databases, and the lack of interoperability impedes the evaluation of the coverage, agreements, and discrepancies between them. Here, we present ComPath, an ecosystem that supports curation of pathway mappings between databases and fosters the exploration of pathway knowledge through several novel visualizations. We have curated mappings between three of the major pathway databases and present a case study focusing on Parkinson’s disease that illustrates how ComPath can generate new biological insights by identifying pathway modules, clusters, and cross-talks with these mappings. The ComPath source code and resources are available at https://github.com/ComPath and the web application can be accessed at https://compath.scai.fraunhofer.de/. npj Systems Biology and Applications (2018) 4:43 ; https://doi.org/10.1038/s41540-018-0078-8 INTRODUCTION gained traction.3,6 Further, the variability in curation team The notion of pathways enables the representation, formalization, composition, database scope (e.g., signaling pathways, gene and interpretation of biological events or series of interactions. regulatory networks, and metabolic processes), and curation Cataloging biological knowledge into pathways reduces complex- guidelines led to the adoption of different (and in many ways ity from all possible interacting molecular entities to a set of well- incompatible) schemata and formalisms such as Biological Path- 7 studied and validated functional relationships between molecular way Exchange (BioPAX; ) and Systems Biology Markup Language 8 entities culminating in biological processes.
    [Show full text]