An Analysis of Differences in Biological Pathway Resources
Total Page:16
File Type:pdf, Size:1020Kb
An Analysis of Differences In Biological Pathway Resources Lucy L. Wang, John H. Gennari, and Neil F. Abernethy Department of Biomedical Informatics and Medical Education, University of Washington Seattle, Washington 98195 Email: [email protected], [email protected], [email protected] Abstract—Integrating content from multiple biological bilistically combine protein interactions from various pathway resources is necessary to fully exploit pathway pathway databases into merged networks [11]. These knowledge for the benefit of biology and medicine. Dif- tools improve querying of multiple resource, and pave ferences in content, representation, coverage, and more occur between databases, and are challenges to resource the way towards more comprehensive network models merging. We introduce a typology of representational of human biological processes. differences between pathway resources and give exam- Some work has also been done in inter-resource ples using several databases: BioCyc, KEGG, PANTHER pathways, and Reactome. We also detect and quantify comparison, quantifying the overlap between different annotation mismatches between HumanCyc and Reactome. databases [12–15]. These comparison studies emphasize The typology of mismatches can be used to guide entity and differences in entity membership in pathways and differ- relationship alignment between these databases, helping us ing counts of unique entities and pathways, but do not identify and understand deficiencies in our knowledge, and focus on cross-resource entity alignment. Existing tools allowing the research community to derive greater benefit from the existing pathway data. for entity normalization of proteins [16] and metabo- lites [17] may provide a starting point for alignment. Keywords—pathway database, knowledge representa- Other studies emphasize aligning metabolic pathways of tion, resource comparison different species in order to find analogous but missing relationships [18, 19], merging resources for combined I.I NTRODUCTION network analysis [11, 20], or defining conserved path- Describing and studying biological pathways is neces- way elements across existing pathway resources [21]. sary for understanding biological and disease processes. However, although they represent progress, the tools and Biological functions and processes follow from com- studies mentioned above accomplish goals that do not plex networks of interactions among gene products and include aligning representations across resources. molecules. Through the study of pathways of known Given the number and uniqueness of pathway re- biochemical reactions, we can gain deeper insights into sources, inter-resource merging is a challenge. In order to these interactions. Many of these relationships and reac- successfully align and integrate the content of multiple tions have been catalogued in pathway resources such as knowledge bases, we must contend with variability in Reactome, BioCyc, KEGG, and others [1–5]. content correctness, standards usage, knowledge repre- As of April 2016, PathGuide, a pathway resource ag- sentation choices, and coverage. Pathway data sharing gregator, lists 547 pathway resources [6], each providing standards such as BioPAX, SBML, and PSI MI [8, 22, specialized knowledge in niche areas of biology. Efforts 23] assist in the interchange of pathway resources, but have been made to integrate some of these databases. even resources available in the same standard still retain PathwayCommons catalogs human pathway resources differences in content and representation. Nonetheless, under a unified biological pathway exchange umbrella our goal is to align knowledge, so that users can benefit (BioPAX), allowing easier querying of pathways across from a semantic union across multiple resources. 22 different resources [7, 8]. Tools such as Consensus Pathway DB [9] and hiPATHDB [10] offer querying To align resources, we must comprehensively under- and visualization of pathways from multiple databases. stand the types of differences one may encounter. Stobbe Statistical frameworks like R Spider seek to proba- et al. have made an excellent start in this direction, providing numerous examples and descriptions of the This work was supported by the National Library of Medicine (NLM) sorts of differences among metabolic pathway resources Training Grant T15LM007442. [13, 24]. Here, we extend this work, aiming at a typology Fig. 1: The conversion of phosphoenolpyruvate and ADP into pyruvate and ATP assisted by the enzyme pyruvate kinase, as represented by HumanCyc, KEGG, PANTHER, and Reactome. The display name for each entity is given, along with ChEBI or UniProt identifiers where available. Entities related to the reaction by the BioPAX left property are red, and entities related by the BioPAX right property are green. of mismatches among pathway resources. In particular, we see a variety of important mismatches, of which a we describe and give examples of mismatches in (a) subset are described below.∗ annotation, (b) existence, (c) reaction semantics, and (d) granularity. By classifying mismatches, we enable the A. Annotation better understanding and discussion of resource differ- We first consider annotation mismatches on the par- ences, and allow for improved consensus formation in ticipating physical entities. Inconsistencies arise when multiple pathway resource applications. two pathway resources refer to the same entity with We also present some results in quantifying annotation different identifiers or different names. Pyruvate is rep- mismatches between two popular human pathway re- resented by all four resources (Fig. 1), but is anno- sources: HumanCyc and Reactome. Results demonstrate tated with two identifiers, ChEBI:15361 (HumanCyc and the pervasiveness of representational differences and PANTHER) and ChEBI:32816 (KEGG and Reactome). suggest further work towards consensus pathway repre- The ChEBI:15361 entity “pyruvate” and ChEBI:32816 sentations. Understanding the types of mismatches that entity “pyruvic acid” are conjugate acids and bases of exist between resources is a first step towards expanding one another in ChEBI. The display name for pyruvate and deriving the full benefit of our pathway knowledge. also differs between resources, and is given as “pyru- vate” (HumanCyc), “Pyruvate” (KEGG, PANTHER), or II.M ISMATCHES IN PATHWAY RESOURCES:A “PYR” (Reactome). Differences in identifiers and names TYPOLOGY are also seen for all other participants in this reaction. To provide examples of mismatches, we retrieved re- action representations from HumanCyc, KEGG, PAN- ∗Pathways were retrieved from Reactome v55 (http://reactome. org) and HumanCyc v19.5 (http://humancyc.org) BioPAX3 ex- THER, and Reactome. Fig. 1 shows several different ports and through PathwayCommons v7 [7]. Glycolysis path- representations of a step of glycolysis in Homo sapiens: ways for KEGG and PANTHER are located at http://purl. the conversion of phosphoenolpyruvate and ADP to org/pc2/7/#Pathway 307add3cea6530288cc1016267ec055b and http: //identifiers.org/panther.pathway/P00024 respectively and are sup- pyruvate and ATP modulated by the enzyme pyruvate plemented by the pathway diagrams at http://kegg.jp and http:// kinase. In this single, well-studied biochemical reaction, pantherdb.org. 2 In order to resolve these mismatches, we must either enforce consistent labeling of entities across resources, or somehow infer alignment of similar but differently annotated entities across resources. The former strategy is usually impractical; in this case, we can infer similarity by treating ChEBI identifiers that refer to conjugate acid/base pairs as synonyms. A second type of annotation mismatch occurs when entities lack cross-referenced identifiers, e.g., no identi- fiers are given for ADP or ATP in PANTHER pathways. Other features such as string name, entity relationships, Fig. 2: The oxidative decarboxylation of isocitrate can be represented as a two-step process with an oxalosuccinate intermediary (left) and and local network topology can be used to align entities as a one-step process (right). between resources when identifiers are insufficient. right are used to indicate reaction direction, as well as B. Existence the identities of reactants and products [25]. In KEGG, Existence refers to missing or extraneous physical en- PANTHER, and Reactome, phosphoenolpyruvate is la- tities, reactions, relationships, or information, e.g., en- beled left and pyruvate right, with a reaction direction of tities that participate in a reaction or reactions that are left-to-right. However, in HumanCyc, phosphoenolpyru- members of a pathway in one resource but not another, vate is labeled right and pyruvate left and the reaction or a connection between two reactions that occurs in direction is right-to-left, a choice dictated by the Enzyme one resource but not another. In Reactome, for example, Commission system [2]. Even though HumanCyc is in the conversion of fructose 6-phosphate to fructose 2,6- the minority, its choice follows recommendations from biphosphate is a reaction in the glycolysis pathway. This the BioPAX3 specifications [25]. reaction is not included in the glycolysis pathway of Resolving this type of semantic mismatch between the other three resources. Although the reaction involves resources requires knowledge about the ordering of entities that participate in glycolysis, there is uncertainty reactions, which can be derived from pathway design, in whether it is important to the overall process. or when reactions are taken out of context, may depend