Supplementary Information
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Information PathwayMatcher: multi-omics pathway mapping and proteoform network generation Luis Francisco Hernández Sánchez1,2,3, Bram Burger4,5, Carlos Horro4,5, Antonio Fabregat3, Stefan Johansson1,2, Pål Rasmus Njølstad1,6, Harald Barsnes4,5, Henning Hermjakob3,7, and Marc Vaudel1,2,* 1 K.G. Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway 2 Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway 3 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom 4 Proteomics Unit, Department of Biomedicine, University of Bergen, Bergen, Norway 5 Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway 6 Department of Pediatrics, Haukeland University Hospital, Bergen, Norway 7 Beijing Proteome Research Center, National Center for Protein Sciences Beijing, Beijing, China * To whom correspondence should be addressed Abstract Mapping biomedical data to functional knowledge is an essential task in biomedicine and can be achieved by querying gene or protein identifiers in pathway knowledgebases. Here, we demonstrate that including fine-granularity information such as post-translational modifications greatly increases the specificity of the analysis. We present PathwayMatcher (github.com/PathwayAnalysisPlatform/PathwayMatcher), a bioinformatic application for mapping multi-omics data to pathways and show how this enables the building of biological networks at the proteoform level. Hernández Sánchez et al. PathwayMatcher Table of Contents 1. Introduction ......................................................................................................................................................... 3 2. Availability ............................................................................................................................................................ 5 3. Post-translational modifications in the Reactome data model ....................................................... 6 4. Mapping omics data to pathways ................................................................................................................ 7 5. Input ..................................................................................................................................................................... 11 a) Genetic variants .......................................................................................................................................... 11 b) Genes ............................................................................................................................................................... 12 c) Peptides .......................................................................................................................................................... 13 d) Proteins .......................................................................................................................................................... 14 e) Proteoforms .................................................................................................................................................. 15 a) Superset (with and without PTM types) .......................................................................................... 17 b) Subset (with and without PTM types) ............................................................................................... 17 c) One (with and without PTM types) .................................................................................................... 18 d) Strict................................................................................................................................................................. 19 6. Output .................................................................................................................................................................. 21 a) Search .............................................................................................................................................................. 21 b) Analysis .......................................................................................................................................................... 22 c) Graph ............................................................................................................................................................... 25 7. Performance ...................................................................................................................................................... 28 8. Metrics and Figures ........................................................................................................................................ 30 9. References .......................................................................................................................................................... 31 2 Hernández Sánchez et al. PathwayMatcher 1. Introduction Biological pathways are a common way to represent biological processes. A pathway is a sequence of biochemical reactions in a cell that achieves a specific biological goal. Pathways are consolidated in public knowledgebases where they can be accessed, queried, and navigated. One of the main use cases is to map biomedical data to provide functional interpretation, and potentially uncover underlying causes for certain diseases, through so-called pathway analysis. Pathway analysis consist of two steps: (i) mapping of omics data to the knowledgebase, and (ii) statistical analysis evaluating how confidently the pathways relate to a clinical sample. The search for relevant pathways can be done using lists of genes or proteins. Proteins provide a finer level of detail given that multiple protein products can originate from the same gene. After the search has been performed, statistical methods are applied to filter and rank the resulting pathways (García-Campos et al., 2015). Proteins are the main participants of pathways, acting as reactants, catalysts, regulators or products. They take multiple forms and can also be chemically modified, all referred to as proteoforms, giving them the ability to perform highly specific tasks. Knowledgebases such as PhosphoSitePlus (Hornbeck et al., 2014) or Reactome (Fabregat et al., 2018a) gather information on proteoforms. Reactome notably annotates reactions involving proteoforms, which include the proteins’ processed peptide sequences, isoforms and sets of known post- translational modifications (PTMs). This type of annotation reflects the dynamic nature of the proteins and allows identifying the reactions and pathways where proteins need specific sets of PTMs to achieve the reactions. However, so far, no bioinformatic tool allowed the mapping and analysis of the detailed information contained in proteoform pathway networks. Here we present a more fine-grained approach to pathway search, not only using gene names or protein identifiers, but also proteoforms. As demonstrated in the main text, this tailored matching allows for a more specific analysis, and can reduce the prevalence of artefacts in the matching of the results. 3 Hernández Sánchez et al. PathwayMatcher We developed PathwayMatcher, an open-source standalone Java command line tool that maps multiple types of omics data to the pathways in the Reactome graph database, including: (i) lists of genetic variants, (ii) gene or protein identifiers, (iii) lists of peptides including post-translational modifications, and (iv) lists of proteoform identifiers. PathwayMatcher converts the input to either proteins or proteoforms and searches for them as participants in pathway reactions. The output comprises three types of files: (i) a list of the matched pathways, (ii) the result of an over- representation analysis, and (iii) the connection graphs. PathwayMatcher uses Reactome, a free, open source, curated knowledgebase containing human reactions categorized in hierarchical pathways which also includes proteoform-level annotation. Protein post-translational modifications are notably supported through the protein sequence coordinate and the modification type following the PSI-MOD ontology (Montecchi- Palazzi et al., 2008). Proteins have a UniProt (The UniProt, 2017) identifier associated with an additional indication of the isoform participating in the reaction. The detailed annotation of Reactome is therefore instrumental in our new fine-grained pathway search. PathwayMatcher contains all mappings internally and therefore does not rely on web services e.g. from Ensembl, Uniprot, or Reactome. This allows PathwayMatcher to run on high- performance setups without compromising efficiency through dependency on third-party services. Furthermore, it allows PathwayMatcher to run in secure environments without access to the internet. PathwayMatcher is readily available for integration in bioinformatic workflows thanks to implementations in Bioconda and Galaxy, as detailed below. 4 Hernández Sánchez et al. PathwayMatcher 2. Availability PathwayMatcher is freely available at github.com/PathwayAnalysisPlatform/PathwayMatcher under the permissive Apache 2.0 license. It is also possible to use PathwayMatcher as a Docker image: hub.docker.com/r/lfhs/pathwaymatcher. The Docker image allows the creation of isolated, self-contained containers comprising PathwayMatcher, its dependencies and internal data without