Supplementary Information

PathwayMatcher: multi-omics pathway mapping and proteoform network generation

Luis Francisco Hernández Sánchez1,2,3, Bram Burger4,5, Carlos Horro4,5, Antonio Fabregat3, Stefan Johansson1,2, Pål Rasmus Njølstad1,6, Harald Barsnes4,5, Henning Hermjakob3,7, and Marc Vaudel1,2,*

1 K.G. Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway 2 Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway 3 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom 4 Proteomics Unit, Department of Biomedicine, University of Bergen, Bergen, Norway 5 Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway 6 Department of Pediatrics, Haukeland University Hospital, Bergen, Norway 7 Beijing Proteome Research Center, National Center for Sciences Beijing, Beijing, China

* To whom correspondence should be addressed

Abstract

Mapping biomedical data to functional knowledge is an essential task in biomedicine and can be achieved by querying or protein identifiers in pathway knowledgebases. Here, we demonstrate that including fine-granularity information such as post-translational modifications greatly increases the specificity of the analysis. We present PathwayMatcher (github.com/PathwayAnalysisPlatform/PathwayMatcher), a bioinformatic application for mapping multi-omics data to pathways and show how this enables the building of biological networks at the proteoform level.

Hernández Sánchez et al. PathwayMatcher

Table of Contents 1. Introduction ...... 3

2. Availability ...... 5

3. Post-translational modifications in the Reactome data model ...... 6

4. Mapping omics data to pathways ...... 7

5. Input ...... 11

a) Genetic variants ...... 11

b) ...... 12

c) Peptides ...... 13

d) ...... 14

e) Proteoforms ...... 15

a) Superset (with and without PTM types) ...... 17

b) Subset (with and without PTM types) ...... 17

c) One (with and without PTM types) ...... 18

d) Strict...... 19

6. Output ...... 21

a) Search ...... 21

b) Analysis ...... 22

c) Graph ...... 25

7. Performance ...... 28

8. Metrics and Figures ...... 30

9. References ...... 31

2

Hernández Sánchez et al. PathwayMatcher

1. Introduction

Biological pathways are a common way to represent biological processes. A pathway is a sequence of biochemical reactions in a cell that achieves a specific biological goal. Pathways are consolidated in public knowledgebases where they can be accessed, queried, and navigated. One of the main use cases is to map biomedical data to provide functional interpretation, and potentially uncover underlying causes for certain diseases, through so-called pathway analysis.

Pathway analysis consist of two steps: (i) mapping of omics data to the knowledgebase, and (ii) statistical analysis evaluating how confidently the pathways relate to a clinical sample. The search for relevant pathways can be done using lists of genes or proteins. Proteins provide a finer level of detail given that multiple protein products can originate from the same gene. After the search has been performed, statistical methods are applied to filter and rank the resulting pathways (García-Campos et al., 2015).

Proteins are the main participants of pathways, acting as reactants, catalysts, regulators or products. They take multiple forms and can also be chemically modified, all referred to as proteoforms, giving them the ability to perform highly specific tasks. Knowledgebases such as

PhosphoSitePlus (Hornbeck et al., 2014) or Reactome (Fabregat et al., 2018a) gather information on proteoforms. Reactome notably annotates reactions involving proteoforms, which include the proteins’ processed peptide sequences, isoforms and sets of known post- translational modifications (PTMs). This type of annotation reflects the dynamic nature of the proteins and allows identifying the reactions and pathways where proteins need specific sets of

PTMs to achieve the reactions.

However, so far, no bioinformatic tool allowed the mapping and analysis of the detailed information contained in proteoform pathway networks. Here we present a more fine-grained approach to pathway search, not only using gene names or protein identifiers, but also proteoforms. As demonstrated in the main text, this tailored matching allows for a more specific analysis, and can reduce the prevalence of artefacts in the matching of the results.

3

Hernández Sánchez et al. PathwayMatcher

We developed PathwayMatcher, an open-source standalone Java command line tool that maps multiple types of omics data to the pathways in the Reactome graph database, including: (i) lists of genetic variants, (ii) gene or protein identifiers, (iii) lists of peptides including post-translational modifications, and (iv) lists of proteoform identifiers. PathwayMatcher converts the input to either proteins or proteoforms and searches for them as participants in pathway reactions. The output comprises three types of files: (i) a list of the matched pathways, (ii) the result of an over- representation analysis, and (iii) the connection graphs.

PathwayMatcher uses Reactome, a free, open source, curated knowledgebase containing human reactions categorized in hierarchical pathways which also includes proteoform-level annotation. Protein post-translational modifications are notably supported through the protein sequence coordinate and the modification type following the PSI-MOD ontology (Montecchi-

Palazzi et al., 2008). Proteins have a UniProt (The UniProt, 2017) identifier associated with an additional indication of the isoform participating in the reaction. The detailed annotation of

Reactome is therefore instrumental in our new fine-grained pathway search.

PathwayMatcher contains all mappings internally and therefore does not rely on web services e.g. from Ensembl, Uniprot, or Reactome. This allows PathwayMatcher to run on high- performance setups without compromising efficiency through dependency on third-party services. Furthermore, it allows PathwayMatcher to run in secure environments without access to the internet. PathwayMatcher is readily available for integration in bioinformatic workflows thanks to implementations in Bioconda and Galaxy, as detailed below.

4

Hernández Sánchez et al. PathwayMatcher

2. Availability

PathwayMatcher is freely available at github.com/PathwayAnalysisPlatform/PathwayMatcher under the permissive Apache 2.0 license. It is also possible to use PathwayMatcher as a Docker image: hub.docker.com/r/lfhs/pathwaymatcher. The Docker image allows the creation of isolated, self-contained containers comprising PathwayMatcher, its dependencies and internal data without installing or making any changes to the main user environment. Such a container easily allows the integration into modularized pipelines and greatly improves the reproducibility of bioinformatic workflows.

PathwayMatcher can be obtained from the Bioconda channel of the Conda (Grüning et al., 2018) package manager at bioconda.github.io/recipes/pathwaymatcher/README.html. This allows an easy dependency management and simple integration in bioinformatic pipelines. Finally,

PathwayMatcher is available as a Galaxy (Afgan et al., 2018) tool in the Galaxy

ToolShed (Blankenberg et al., 2014) at toolshed.g2.bx.psu.edu/view/galaxyp/reactome_pathwaymatcher where it can be readily integrated into analysis workflows. PathwayMatcher has also been installed into the public

European Galaxy instance, usegalaxy.eu, making it possible to use the application without requiring any local configuration and just providing valid input files and options. The complete

URL for the online tool is: https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fgalaxyp%2Freactome_path waymatcher%2Freactome_pathwaymatcher

Upon installation, PathwayMatcher can be used from the command line to query Reactome using various types of omics data. Either the “.jar” file is run directly using Java or the Docker image is instantiated to a container. Detailed information on implementation, installation, usage and format specifications is available in the online documentation: github.com/PathwayAnalysisPlatform/PathwayMatcher/wiki.

5

Hernández Sánchez et al. PathwayMatcher

3. Post-translational modifications in the Reactome data model

The Reactome object model specifies physical entities, e.g. complexes, proteins and small molecules, and proteins are annotated using unique identifiers. These entities participate in reactions in specific cellular compartments. They can also be connected to multiple instances of

Translational Modification objects containing a specific coordinate on the protein sequence and an identifier following the PSI-MOD ontology (Montecchi-Palazzi et al., 2008). The portion of physical entities referring to proteins are associated to other class of objects as reference entities, which contain protein annotations in external databases such as UniProt (Natale et al.,

2017). Therefore, a proteoform is defined by the physical entity associated to a set of modifications for specific processes at specific subcellular locations. 127 different protein modifications are annotated in Reactome for humans, Supplementary Figure 1 displays the occurrence of the most frequent ones.

Supplementary Figure 1: Prevalence of the different PTM annotations in Reactome. PTM labels are extracted from the Reactome database and the number of proteins annotated with the PTM is displayed for each label. If a protein is carrying multiple instances of the PTM, the PTM is counted only once.

6

Hernández Sánchez et al. PathwayMatcher

4. Mapping omics data to pathways

PathwayMatcher takes as input a file containing biological entities, queries the content of the

Reactome graph database organized in tables (version 64 at time of writing), and produces an output file with reactions and pathways where the input could be matched (see Supplementary

Figure 2: PathwayMatcher general overvie). Input can be gene names, single nucleotide polymorphism (SNP) identifiers, peptides, protein accessions, or proteoforms. Peptides and proteoforms can be provided with post-translational modifications and their coordinates. The input is mapped to proteins or proteoforms to find the reactions where the input entities are participants (Supplementary Figure 3). The input is mapped to proteins when data types without PTMs or specific translation products are specified; otherwise a mapping to proteoforms is used. When one type of data yields multiple results due to ambiguity, e.g. a SNP or peptide mapping multiple proteins, all the possibilities are included in the search entities.

Static Resources

Map from Proteins & Proteoforms to Reactome Pathways Map from SNP to Proteins UniProt Proteins FASTA

Input Output

Omics Analysis Pathway Search and Analysis Data Results

Supplementary Figure 2: PathwayMatcher general overview. The program takes the user input in the form of omics data files and the reference pathways from the database as input. It then executes the search and analysis algorithm to create a resulting list of output files.

7

Hernández Sánchez et al. PathwayMatcher

When a list of SNPs is provided, mapping from the Ensembl Variant Effect Predictor (VEP)

(McLaren et al., 2016) is used to find the possibly affected proteins. When peptides are provided, their sequence is mapped to UniProt protein identifiers (The UniProt, 2017) using

PeptideMapper (Kopczynski et al., 2017) and possible proteoforms are constructed. When proteins or proteoforms are available, PathwayMatcher maps them to reactions and pathways using data structures embedded in the PathwayMatcher jar file. These data structures are extracted and serialized from the Reactome Neo4j graph database (neo4j.com). The code for extraction of the relationships from proteins to pathways is available at github.com/PathwayAnalysisPlatform/Extractor.

Supplementary Figure 3: If different from proteins or proteoforms, the input is mapped to proteins or proteoforms depending on the specificity of the input. From there, reactions and pathways can be obtained from the Reactome graph database.

8

Hernández Sánchez et al. PathwayMatcher

Proteins in Reactome are defined according to UniProt following a gene-centric paradigm. From all reactions in Reactome, 9,734 involve two proteins, participating in 2,208 human pathways (Burger et al., 2018) (version 64 at time of writing). Using additional information from Reactome on the post-translational state required for a protein to participate in a reaction,

PathwayMatcher allows matching proteoforms to reactions and pathways. Supplementary

Tables 1 and 2 respectively list the proteins and proteoforms that are participating in the highest number of pathways.

Reactions Pathways Gene Protein name Mapped Mapped Ubiquitin-40S ribosomal P62979 RPS27A 306 292 protein S27a Ubiquitin-60S ribosomal P62987 UBA52 299 288 protein L40 P0CG47 UBB Polyubiquitin-B 279 270 P0CG48 UBC Polyubiquitin-C 279 270 Growth factor receptor-bound P62993 GRB2 259 144 protein 2 Mitogen-activated protein P28482 MAPK1 70 119 kinase 1 Serine/threonine-protein phosphatase 2A 65 kDa P30153 PPP2R1A 83 116 regulatory subunit A alpha isoform P01112 HRAS GTPase HRas 89 112 P01116 KRAS GTPase KRas 87 108 Q07889 SOS1 Son of sevenless homolog 1 118 107 Supplementary Table 1: Human proteins participating in the highest number of pathways in Reactome. Note that the Reactions Mapped column shows the number of reactions that are part of the mapped pathways. A protein may participate in a reaction that was not assigned to a pathway, and a reaction can be included in multiple pathways.

9

Hernández Sánchez et al. PathwayMatcher

Reactions Pathways Gene Protein name PTMs Mapped Mapped [00047:185, 000 P28482 MAPK1 Mitogen-activated protein kinase 1 48 111 48:187] [00047:202, 000 P27361 MAPK3 Mitogen-activated protein kinase 3 43 95 48:204] RAC-alpha serine/threonine-protein [00046:473, 000 P31749 AKT1 51 78 kinase ecNumber2.7.11.1/ecNumber 47:308] RAC-beta serine/threonine-protein [00046:474, 000 P31751 AKT2 33 69 kinase ecNumber2.7.11.1/ecNumber 47:309] Q9GZV9 FGF23 Fibroblast growth factor 23 [00164:178] 152 68 Cyclic AMP-responsive element- P16220 CREB1 [00046:133] 13 68 binding protein 1 [00047:180, 000 Q16539 MAPK14 Mitogen-activated protein kinase 14 16 63 48:182] [00047:180, 000 Q15759 MAPK11 Mitogen-activated protein kinase 11 15 57 48:182] 1-phosphatidylinositol 4,5- [00048:472, 000 bisphosphate phosphodiesterase P19174 PLCG1 48:771, 00048:7 38 55 gamma-1 83, 00048:1253] ecNumber3.1.4.11/ecNumber [00048:196, 000 Fibroblast growth factor receptor 48:306, 00048:3 Q8WU20 FRS2 substrate 2 49, 00048:392, 0 94 53 0048:436, 00048 :471] Supplementary Table 2: Human proteoforms participating in the most number of pathways in Reactome.

Note that PathwayMatcher maps experimental data to pathways in a systematic and unbiased fashion. This means that it collects all pathways containing at least one of the participant proteins or proteoforms of the input data and does not perform any filtering or biological inference. Therefore, it attempts at minimizing the prevalence of false negatives by considering all the possible pathways annotated in the reference database. It can however not control for missing annotation, i.e. what is not annotated in the knowledgebase is not considered to be happening.

10

Hernández Sánchez et al. PathwayMatcher

5. Input

Detailed and updated documentation of the input can be found in the online documentation at github.com/PathwayAnalysisPlatform/PathwayMatcher/wiki/Input. The supported types of input are: (a) Genetic variants, (b) Genes, (c) Peptides, (d) Proteins, (e) Proteoforms.

a) Genetic variants

i. SNP rsId list:

The file contains one rsid identifier as defined in dbSNP on each row. The list must be ordered by and (bp), and it must not have duplicates. All rsids must be included in the Genome Reference Consortium Human Build 37 patch release 13 (GRCh37.p13). Example:

rs187174427 rs182321900 rs566371895 rs375798137

ii. Chromosome and base pair

Genetic variants can also be represented using the chromosome and the base pair numbers. The input should be sorted by chromosome number and then by base pair. Example:

1 210827406 2 14370 2 17330 10 1110696 18 1230237 20 1234567

iii. Variant Call Format Specification (VCF)

The input follows the Variant Call Format Specification v4.3(Danecek et al., 2011). It also allows the possibility to specify only the first four columns in the data section of the file: CHROM, POS,

ID, REF.

Whenever a value is missing, it is represented by a dot. The value for the columns CHROM, POS and REF are mandatory, only the column for ID can have missing values. The data records do

11

Hernández Sánchez et al. PathwayMatcher not need to be ordered by chromosome and base. The search will only take into account the

Single Nucleotide Polymorphisms present in the human assembly GRCh37.p13. Example:

##fileformat=VCFv4.3 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig= ##phasing=partial ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##FILTER= ##FILTER= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 1 210827406 NA T 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

b) Genes

File with one gene name for each line. Genes follow the HUGO gene nomenclature(Wain et al.,

2002). Example:

CFTR TGFB1 FCGR2A DCTN4 SCNN1B SCNN1G SCNN1A TNFRSF1A CLCA4 STX1A CXCL8

12

Hernández Sánchez et al. PathwayMatcher

c) Peptides

i. Simple list

File with one peptide sequence for each line. Example:

VGENHLVKVA MSDVAIVKEG GSPGKARPGT HHLSPHPPGT HHLSPHPPGT QNKTLIEELKALKDLYCHKSD MSSARFDSSDRSAWYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVL LTEYVATRWYRAPEIMLNSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINMKARN YLQSLPSKTKVAWAKLFPKSD LPKPSRHNTEFRDSTYDLPRSLASHGHTKG EALAHAYFSQ SQELRPEAKN MKLNISFPATGCQKLIEVDD MGSNKSKPKDASQRRRSLEPAENVHGAGGG WDQVAEVLSWQFSSTTKRGLSIEQLTTLAEKLLGPGVNYSGCQITWAKFC CVMEYHQATGTLSAHFRNMSLKRIKRADRRGAESVTEEKF

ii. Peptide list with post-translational modifications

Each line of the file corresponds to a single peptide with post-translational modifications. It has two fields: peptide sequence and a set of PTMs. Each PTM with a MOD type and the site number.

The site of the modification is relative to the peptide sequence coordinates in base 1. Example:

KDGATMKTFC KDGATMKTFC,MOD:00048:7 QFSYSASGTA,MOD:00048:2 LTEYVATRWY,MOD:031878:3, QCEGEEDTEYMTPSSRPLRPLDTSQSSRACDCDQQIDSCTYEAMYNIQSQAPSITESSTFGEGNLAAAHANTGPEESEN EDDGYDVPKPPVPAVLARRTL,MOD:00046:9;MOD:00047:40;MOD:00047:83

Here again the pathway mapping is as exhaustive as possible and considers all proteins containing each peptide. If various peptides are related to the same protein, then all the PTM site combinations for the different peptides are considered. If a peptide maps to different proteins, all possible proteins are considered for the search and protein inference must be conducted a posteriori (Nesvizhskii and Aebersold, 2005).

13

Hernández Sánchez et al. PathwayMatcher

d) Proteins

i. UniProt accession list

File with one UniProt Accession (The UniProt, 2017) on each line. Example:

P00519 P31749 P11274 P22681 P22681 P16220 P46109 P27361 Q9UQC2

ii. Ensembl identifier list

File with one Ensembl (Ruffier et al., 2017) identifier for each line. Example:

ENSG00000101076 ENSG00000106633 ENSP00000223366 ENSP00000312987 ENSP00000315180 ENSP00000379142 ENSP00000384247 ENSP00000396216

14

Hernández Sánchez et al. PathwayMatcher

e) Proteoforms The input file consists of one protein or proteoform per line. Each proteoform consists of two fields, the UniProt protein accession and the set of PTMs. They are separated by a semicolon ‘;’.

The protein accession can include the isoform number specified with a dash ‘-‘. The PTM set contains each PTM separated by a comma ‘,’. Each PTM is specified using a modification identifier and a site, separated by a colon ':'.

Mandatory Optional

P12318-1;00046:133,00048:304

Type Site Type Site Isoform

Accession PTM 1 PTM 2

Note that the order of PTMs is not relevant for the search. The PTM identifier is a five digit id from the PSI-MOD Protein Modification (Montecchi-Palazzi et al., 2008). The site is an integer specifying the 1-based index of the modified amino acid on the sequence as defined by UniProt.

It is common to write the identifiers for the PTM types with the prefix ‘MOD:’ before the five digits of the ontology term. PathwayMatcher also allows the user to write the identifier without the prefix. PathwayMatcher also allows querying all proteoforms modified at a given site using the ‘00000’ wild card for modification type. Note that the modification site is mandatory, but a tolerance window can be set, see details below.

Examples:

• A single protein UniProt accession with no PTMs: P00519

• A protein with one PTM at a specific amino acid: P16220;00046:133

• A protein carrying two PTMs: P62753;00046:235,00048:236

• A specific isoform with a modification: O60879-2;00046:196

15

Hernández Sánchez et al. PathwayMatcher

• Protein with any modification at a given site: O75916-3;00000:478

Example:

P00519;00046:245,00048:412 P31749;00047:473,00047:308 P11274;00187:177 P22681;0098:774 P22681 P16220;00046:133 P46109;01192:207 P27361;00047:202,00048:204 Q9UQC2;00000:452 Q15759;00048:182,00047:180 O15530;00048:241 P62753;00048:235,00049:236,00126:240 P12931;00048:419 P40763;00046:705,00046:727 P42229;00048:694

We can expand our support or provide converters to other proteoform formats upon request, e.g. ProForma (LeDuc et al., 2018), Protein Ontology proteoform representation (Natale et al.,

2017), or the PSI Extended FASTA Format (PEFF)(Deutsch, 2012).

Proteoform matching

Searching pathways using gene names or protein accessions solely requires mapping a string of characters between the input and the knowledgebase. In order to map the proteoforms to reactions and pathways, it is necessary to decide if the proteoforms in the input are equivalent to the proteoforms annotated in the Reactome database, taking into account the protein accession, isoform information, and the set of PTMs. Two proteoforms can have all, some or none of these elements in common.

We defined a set of criteria to compare two proteoforms, one from the input and one from the reference database and decide whether they are equivalent to each other. The matching types defined for PathwayMatcher are: a) Superset, b) Subset, c) One, d) Strict.

16

Hernández Sánchez et al. PathwayMatcher

a) Superset (with and without PTM types)

The set of input PTMs is a superset of the reference PTMs set. This includes the command line arguments: -m superset or -m superset_no_types.

• The UniProt accession is the same

• The isoform is the same; either:

o Both have an isoform specified. Ex: P31749-3

o Or both refer to the default one. Ex: P31749

• The PTMs:

o The input contains ALL the reference PTMs or more (input is superset or equal).

Each reference PTM must have a matching input PTM. Some input PTMs may not

have a matching reference PTM.

• A PTM matches if these two requirements are true:

o The types match:

▪ If chosen superset then types should be equal.

▪ If chosen superset_no_types the type is not considered.

o The coordinates match if any of the following is true:

▪ Both are known identical (positive integer) coordinates.

▪ Both are known different (positive integer) coordinates, but the absolute

difference between the two coordinates is less than or equal to a user-

defined margin (‘range’ option in command line).

▪ One of the coordinates is unknown (null, empty, ?, -1).

b) Subset (with and without PTM types)

The set of input PTMs is a subset of the reference PTMs set. This includes the command line arguments: -m subset or -m subset_no_types.

• The UniProt accession is the same.

• The isoform is the same; either:

17

Hernández Sánchez et al. PathwayMatcher

o Both have an isoform specified. Ex: P31749-3

o Or both refer to the default one. Ex: P31749

• The PTMs:

o Each input PTM must have a matching reference PTM. Some reference PTMs

may not have a matching input PTM.

• A PTM matches if these two requirements are true:

o Types match, i.e.:

▪ If chosen subset (then types must be equal), or

▪ If chosen subset_no_types (type is not considered)

o The coordinates match if any of the following is true:

▪ Both are known identical (positive integer) coordinates.

▪ Both are known different (positive integer) coordinates but the absolute

difference between the two coordinates is less than or equal to a user-

defined margin (‘range’ option in command line).

▪ One of the coordinates is unknown (null, empty, ?, -1).

c) One (with and without PTM types)

At least one input PTM matches. This includes the command line arguments: -m one or -m one_no_types.

• The UniProt accession is the same.

• The isoform is the same; either:

o Both have an isoform specified. Ex: P31749-3

o Or both refer to the default one. Ex: P31749

• The PTMs:

o At least one input PTM must have a matching reference PTM.

• A PTM matches if these two requirements are true:

o The types match:

18

Hernández Sánchez et al. PathwayMatcher

▪ If chosen one (then types should be equal), or

▪ If chosen one_no_types (type is not considered)

o The coordinates match if any of the following it true:

▪ Both are known identical (positive integer) coordinates.

▪ Both are known different (positive integer) coordinates, but the absolute

difference between the two coordinates is less than or equal to a user-

defined margin (‘range’ option in command line).

▪ One of the coordinates is unknown (null, empty, ?, -1).

d) Strict

Proteoforms must match exactly in all the attributes.

• The UniProt accession is the same.

• The isoforms are the same; either:

o Both have an isoform specified. Ex: P31749-3

o Or both refer to the default one. Ex: P31749

• The PTMs have the same elements:

o The reference PTM set and the input PTM set have the same size.

o Each reference PTM has a matching input PTM.

• A PTM matches if:

o Types are the same.

o Coordinates are the same:

▪ In case they are numbers, they should be equal

▪ In case they are null, then both should be null.

Extra considerations:

• Negative, zero or floating-point values are invalid as sequence coordinates in the input.

• We accept only PSI-MOD ontology modification types.

• The margin to compare the coordinates should be set as an unsigned integer.

19

Hernández Sánchez et al. PathwayMatcher

Supplementary Table 3 shows examples of PTM coordinates matching. The letter k represents any positive integer. It compares a PTM coordinate in an input PTM with a PTM coordinate in a reference PTM.

Input Reference Margin Matched Comment

17 17 0 Yes Equal

16 17 0 No Out of margin

18 17 0 No Out of margin

7 13 5 No Out of margin

8 13 5 Yes In margin

9 13 5 Yes In margin

17 13 5 Yes In margin

18 13 5 Yes In margin

19 13 5 No Out of margin

0 2 5 No Input in margin but not valid

-1 2 5 No Input in margin but negative

?, empty, null Positive integer k Yes Input is less specific

Positive integer ?, empty, null, -1 k Yes Input is more specific

?, empty, null ?, empty, null, -1 k Yes Equally unspecific

Negative int, zero Any k No Negative or zero input are invalid

Supplementary Table 3: Post-translational modification coordinates comparison criteria.

20

Hernández Sánchez et al. PathwayMatcher

6. Output

PathwayMatcher produces three output files: (a) the result of the pathway search, (b) the results of the over-representation analysis, and (c) biological networks in relationship with the input.

a) Search

It consists of a tab separated file with the list of reactions and pathways mapped from the input.

The columns are:

• UNIPROT: The UniProt accession number of the protein associated with the input. Note

that the proteins reported in this column are not necessarily explicitly given in the input

and may also be the result of mapping peptides and/or genetic variants.

• REACTION_STID: Reaction stable identifier

• REACTION_DISPLAY_NAME: Reaction name

• PATHWAY_STID: Pathway stable identifier

• PATHWAY_DISPLAY_NAME: Pathway name

• TOP_LEVEL_PATHWAY_STID: Top level pathway stable identifier

• TOP_LEVEL_PATHWAY_DISPLAY_NAME: Top level pathway name

21

Hernández Sánchez et al. PathwayMatcher

For the genes, genetic variants, Ensembl and proteoforms an extra column is added with the respective name:

• GENE

• ENSEMBL

• RSID

• PROTEOFORM: The set of post-translational modifications with their PSI-MOD type and

coordinate.

b) Analysis

Output files

A csv file with the results of the statistical analysis. Each row corresponds to a pathway containing at least one participant entity of the input. It contains the following columns:

• Pathway StId: The Reactome stable unique identifier.

• Pathway Name: The name of the pathway in Reactome.

• # Entities Found: The number of entities (proteins or proteoforms) found as

participants in the pathway.

• # Entities Total: The total number of entities participating in the pathway.

• Entities Ratio: The number of entities found divided by the total number of entities in

the pathway.

• Entities P-Value: The probability of finding the number of entities given that the

selection of entities in the input would be completely random and each protein was

selected independently.

• Significant: If the p-value is less than or equal to 0.05.

• Entities FDR: The false discovery rate.

22

Hernández Sánchez et al. PathwayMatcher

• # Reactions Found: The number of reactions in the pathway with a participating entity

of the input.

• # Reactions Total: The total number of reactions in the pathway.

• Reactions Ratio: The number of reactions found divided by the total number of reactions

in the pathway.

• Entities Found: The UniProt accession numbers of the entities found in the pathway.

• Reactions Found: The Reactome stable identifiers of the reactions with participating

entities found in the pathway.

Over-representation analysis

This analysis is performed after the result set of reactions and pathways is collected and follows the first generation of pathway analysis methods: over-representation analysis (García-Campos et al., 2015). A p-value for each pathway in the reference database is calculated using a binomial distribution followed by Benjamini-Hochberg correction (Benjamini and Hochberg, 1995) in a similar way as performed by the Reactome online analysis tool (Fabregat et al., 2018a). This p- value represents how likely it would be to find the same or more proteins or proteoforms in the pathway if the sample was completely random.

The matching of each entity to a given pathway is modelled as a Bernoulli trial with two possible outcomes: success or failure, depending on whether the protein or proteoform is a participant of a reaction in the pathway. Trials are considered independent from each other, meaning that the outcome of previous trials does not affect the next. Finally, the probability of success is calculated by the proportion of choosing a protein in a pathway over the total number of possible proteins, therefore the probability is constant over all trials.

First, we search all the input entities (proteins or proteoforms) across all the pathways and count how many of them were found in each pathway. The number of entities found in a pathway is taken as the number of successful trials. Then, with the binomial probability

23

Hernández Sánchez et al. PathwayMatcher distribution, we calculate how likely it would be to get a result equal to or more extreme than the current result (the same number or more proteins or proteoforms in the pathway), given that the input (proteins or proteoforms) were randomly selected (García-Campos et al., 2015).

This is done using the cumulative distribution function for the binomial distribution, which calculates the probability of getting at most k successes out of n trials, with a probability p ∈ [0,1], where X is a random variable following the binomial distribution, as detailed in

Equation 1.

⌊k⌋ 푛 푖 푛−푖 퐹(푘, 푛, 푝) = Pr(푋 ≤ 푘) = ∑푖=0( 푖 )푝 (1 − 푝) (1)

For each pathway, p is set to the ratio between the number of total proteins or proteoforms in the pathway and the total possible entities in the database, n is the number of proteins or proteoforms in the input sample, k is the number of proteins successfully mapped in the pathway, X is the number of entities found in the current pathway after the search.

Finally, given that the p-value requires the calculation of the probability of an equal or more extreme result, we use the complement of Equation 1 to calculate the probability of getting at least k successful trials out of n as stated in Equation 2.

Pr(X ≥ k) = 1 − Pr(X ≤ k − 1) (2)

The calculations for proteins or proteoforms are similar, but are performed separately depending on the input. If the input consists of protein accessions, the number of participants is calculated by only considering proteins. On the other hand, for the proteoform input, the number of entities in the pathways and the database are the participant proteoforms.

Please note that the over-representation analysis is included as a simple analysis to identify the most covered pathways. We recommend however that users rather interpret the results of the mapping and the networks using the systems biology method that best suits the experiment and biomedical context. PathwayMatcher is developed to be a hypothesis generation tool, helping to navigating large datasets and guide experiments. It is not a validation or mechanism inference tool.

24

Hernández Sánchez et al. PathwayMatcher

c) Graph

i. Graph definition

The connection graph is defined by a set of vertices and edges, where vertices represent genes, proteins or proteoforms. The edges represent connections/relations between proteins according to the data model in the Reactome database.

Proteins are referenced only by their UniProt (The UniProt, 2017) accession. Genes follow the

HUGO gene nomenclature (Wain et al., 2002). The proteoforms follow the format used for the input.

There is a connection between two proteins if:

• Both are components of the same complex: (Protein1)--(Complex)--(Protein2).

• Both participate in the same reaction: (Protein1)--(Reaction)--(Protein2).

• Both are members of the same entity set: (Protein1)--(Set)--(Protein2).

Note that these connections are undirected, i.e. the two proteins are simply related to each other.

Proteins can participate with multiple roles in a chemical reaction:

• input (reactant)

• output (product)

• catalyst

• regulator

25

Hernández Sánchez et al. PathwayMatcher

Proteins participate independently or as components of a complex or entity set:

• (Reaction)--(Protein)

• (Reaction)--(Complex)--(Protein)

• (Reaction)--(Complex)--(Complex)--(Protein)

• (Reaction)--(Set)--(Protein)

• (Reaction)--(Set)--(Set)--(Protein)

• (Reaction)--(Complex)--(Set)--(Protein)

• (Reaction)--(Complex)--(Set)--(Set)--(Complex)--(Protein)

• ...

For genes and proteoforms, the connections are set up in a similar way, replacing the protein by the respective gene or proteoform.

Finally, there are two types of edges: internal and external.

• Internal edges: connections between proteins of the input list.

• External edges: connections between a protein in the input list and a protein not in the

input list.

ii. Graph representation

The graph is defined in three files: vertices.tsv, internalEdges.tsv and externalEdges.tsv. The format chosen to represent these graphs is compatible with iGraph (Ferres et al., 2006), in which files can be readily imported for further analysis. By default, the files are saved in the directory where PathwayMatcher is located. To save them in a different directory use the command line argument -o.

26

Hernández Sánchez et al. PathwayMatcher

Vertices file

A tab separated file (.tsv) with two columns, one vertex (protein) per row:

• id: UniProt accession of the protein

• name: Colloquial name of the protein

Example:

id name P35070 Probetacellulin P21359 Neurofibromin Q8IV61 Ras guanyl-releasing protein 3

Edges files

Tab separated files (.tsv) with six columns, one edge (connection) per row:

• id1: UniProt accession of the first protein in the connection

• id2: UniProt accession of the second protein in the connection

• type: Where the two proteins meet (Complex or Reaction)

• container_id: Id of the complex or reaction

• role1: Role of the first protein in the connection

• role2: Role of the second protein in the connection

Example:

id1 id2 type container_id role1 role2 P27361 P28482 Reaction R-HSA-5675373 input output P27361 P28562 Reaction R-HSA-5675373 input catalyst P27361 P28562 Reaction R-HSA-5675373 output catalyst O43524 P84022 Complex R-HSA-1535906 component component

27

Hernández Sánchez et al. PathwayMatcher

7. Performance

We measured the performance of PathwayMatcher using data sets of different sizes. Benchmark data sets were randomly sampled from publicly available resources:

• Proteins: human complement of the UniProtKB/Swiss-Prot database (release 2017_10).

• Peptides: ProteomeTools (Zolg et al., 2017) as available in PRIDE (Vizcaíno et al., 2016),

dataset PXD004732, release date 23/01/2017.

• Genetic variants: variants from the human assembly GRCh37.p13.

• Proteoforms: annotated proteoforms in Reactome Graph database version 62.

Performance testing was done using a standard desktop computer (Intel® Core™ i7-6600U CPU

@ 2.60GHz with 2 cores using 64-bit Windows 10 with Java SE 1.8.0_144 on SSD). Details and code are available at github.com/PathwayAnalysisPlatform/PathwayMatcher/wiki/Test- datasets.

PathwayMatcher takes advantage of in-memory mapping of genetic variant effect, efficient peptide sequence matching, and direct connection to the Reactome graph database to conduct complex queries in a reasonable time. Supplementary Error! Reference source not found.A-D show the performance of PathwayMatcher for four types of input: genetic variants, proteins, peptides, and proteoforms, respectively.

28

Hernández Sánchez et al. PathwayMatcher

Supplementary Figure 4: Performance of PathwayMatcher using (A) genetic variants, (B) proteins, (C) peptides and (D) proteoforms. Performance in minutes is plotted against input size. Mean is displayed as solid line and 95% range as ribbon.

For the proteins and proteoforms, the processing time increases linearly with query size with small slope, making it possible to search all available proteins within a few minutes. As expected, protein identifiers provide the fastest response time, while proteoforms is the second fastest. Mapping peptides takes approximately 30 seconds more, which corresponds to the indexing time of the protein sequences database by PeptideMapper, after which the time increases linearly in a similar fashion as for proteins. For the genetic variants, an extra mapping step is required to map SNPs to possibly affected proteins, adding additional computing time.

The overall mapping time for a million SNPs is less than a minute, which is fast compared to the other steps of a variant analysis workflow. Note that the processing time is very reproducible across runs, where minor variation is only noticeable using genetic variants, resulting in very thin ribbons in Supplementary Figure 4B-D.

29

Hernández Sánchez et al. PathwayMatcher

8. Metrics and Figures

The metrics presented in this manuscript were obtained by querying the Reactome graph database directly (Fabregat et al., 2018b). The queries used can be found in the online documentation at: github.com/PathwayAnalysisPlatform/PathwayMatcher/blob/master/docs/queriesForStatistic s.md

The figures in this manuscript were built in R version 3.4.1 (2017-06-30) - "Single Candle" (r- project.org) using the following packages: ggplot2, ggrepel, igraph, scico, grid, and gtable. The R scripts used to build the figures are available in the tool repository at: github.com/PathwayAnalysisPlatform/PathwayMatcher/tree/master/docs/figures/scripts

30

Hernández Sánchez et al. PathwayMatcher

9. References

Afgan, E., Baker, D., Batut, B., van den Beek, M., Bouvier, D., Čech, M., Chilton, J., Clements, D., Coraor, N., Grüning, B.A., et al. (2018). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Research 46, W537-W544. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society Series B (Methodological), 289- 300. Blankenberg, D., Von Kuster, G., Bouvier, E., Baker, D., Afgan, E., Stoler, N., Taylor, J., and Nekrutenko, A. (2014). Dissemination of scientific software with Galaxy ToolShed. Genome Biology 15, 403. Burger, B., Hernandez Sanchez, L.F., Lereim, R.R., Barsnes, H., and Vaudel, M. (2018). Analysing the structure of pathways and its influence on the interpretation of biomedical datasets. bioRxiv. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., and Sherry, S.T. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156-2158. Deutsch, E.W. (2012). File formats commonly used in mass spectrometry proteomics. Molecular & Cellular Proteomics 11, 1612-1621. Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Garapati, P., Haw, R., Jassal, B., Korninger, F., May, B., et al. (2018a). The Reactome Pathway Knowledgebase. Nucleic Acids Res 46, D649-d655. Fabregat, A., Korninger, F., Viteri, G., Sidiropoulos, K., Marin-Garcia, P., Ping, P., Wu, G., Stein, L., D'Eustachio, P., and Hermjakob, H. (2018b). Reactome graph database: Efficient access to complex pathway data. PLoS Comput Biol 14, e1005968. Ferres, L., Parush, A., Li, Z., Oppacher, Y., and Lindgaard, G. (2006). Representing and Querying Line Graphs in Natural Language: The iGraph System. Paper presented at: Smart Graphics (Berlin, Heidelberg: Springer Berlin Heidelberg). García-Campos, M.A., Espinal-Enríquez, J., and Hernández-Lemus, E. (2015). Pathway Analysis: State of the Art. Frontiers in Physiology 6. Grüning, B., Dale, R., Sjödin, A., Chapman, B.A., Rowe, J., Tomkins-Tinch, C.H., Valieris, R., and Köster, J. (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods 15, 475-476. Hornbeck, P.V., Zhang, B., Murray, B., Kornhauser, J.M., Latham, V., and Skrzypek, E. (2014). PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic acids research 43, D512-D520. Kopczynski, D., Barsnes, H., Njolstad, P.R., Sickmann, A., Vaudel, M., and Ahrends, R. (2017). PeptideMapper: Efficient and Versatile Amino Acid Sequence and Tag Mapping. Bioinformatics. LeDuc, R.D., Schwämmle, V., Shortreed, M.R., Cesnik, A.J., Solntsev, S.K., Shaw, J.B., Martin, M.J., Vizcaino, J.A., Alpi, E., Danis, P., et al. (2018). ProForma: A Standard Proteoform Notation. Journal of Proteome Research 17, 1321-1325. McLaren, W., Gil, L., Hunt, S.E., Riat, H.S., Ritchie, G.R., Thormann, A., Flicek, P., and Cunningham, F. (2016). The Ensembl Variant Effect Predictor. Genome Biol 17, 122. Montecchi-Palazzi, L., Beavis, R., Binz, P.A., Chalkley, R.J., Cottrell, J., Creasy, D., Shofstahl, J., Seymour, S.L., and Garavelli, J.S. (2008). The PSI-MOD community standard for representation of protein modification data. Nat Biotechnol 26, 864-866. Natale, D.A., Arighi, C.N., Blake, J.A., Bona, J., Chen, C., Chen, S.C., Christie, K.R., Cowart, J., D'Eustachio, P., Diehl, A.D., et al. (2017). Protein Ontology (PRO): enhancing and scaling up the representation of protein entities. Nucleic Acids Res 45, D339-d346. Nesvizhskii, A.I., and Aebersold, R. (2005). Interpretation of shotgun proteomic data: the protein inference problem. Molecular & cellular proteomics : MCP 4, 1419-1440. Ruffier, M., Kähäri, A., Komorowska, M., Keenan, S., Laird, M., Longden, I., Proctor, G., Searle, S., Staines, D., Taylor, K., et al. (2017). Ensembl core software resources: storage and programmatic access for DNA sequence and genome annotation. Database (Oxford) 2017.

31

Hernández Sánchez et al. PathwayMatcher

The UniProt, C. (2017). UniProt: the universal protein knowledgebase. Nucleic Acids Res 45, D158- D169. Vizcaíno, J.A., Csordas, A., del-Toro, N., Dianes, J.A., Griss, J., Lavidas, I., Mayer, G., Perez-Riverol, Y., Reisinger, F., Ternent, T., et al. (2016). 2016 update of the PRIDE database and its related tools. Nucleic Acids Research 44, D447-D456. Wain, H.M., Bruford, E.A., Lovering, R.C., Lush, M.J., Wright, M.W., and Povey, S. (2002). Guidelines for human gene nomenclature. Genomics 79, 464-470. Zolg, D.P., Wilhelm, M., Schnatbaum, K., Zerweck, J., Knaute, T., Delanghe, B., Bailey, D.J., Gessulat, S., Ehrlich, H.-C., Weininger, M., et al. (2017). Building ProteomeTools based on a complete synthetic human proteome. Nature Methods 14, 259.

32