Biological Pathway Analysis: Trends and Applications

Pathway resources, data access, and pathway standards

BIME 591 2017.1.11

Lucy Lu Wang Uses for pathway resources

• Functional analysis • General information retrieval • Biosimulation • Visualizing relationships • Disease modeling What information should be represented?

• Entities and their relationships • Biological entities (chemical species, , proteins/enzymes, cofactors) • Cellular entities (locality, type of cell, compartments) • Organism • Function • Rates? • Reactions • Reference to a published resource • Synonyms/aliases • Related pathways • Analogous pathways from different species (ortholog?) • Curation status • Meta-information for how to draw the pathway diagram Strengths and weaknesses of pathway representation

Strengths Weaknesses • Useful abstraction for • Pathways are not human interpretation independent • Representations may • Clarifies relationships disagree between between genes and resources molecules • Human curation is slow and • Connects genes to expensive function • Lack of consistency in naming, classification, search, and download • Pathway boundaries are arbitrary Today’s topics

Pathway resources and available data Data access channels Relevant linked databases Pathway Resources

Different ways to view pathways: Pathway ontologies Pathway diagrams Pathway representations Protein-protein interaction networks sets Pathway Ontologies

These describe the organizational hierarchy of pathways, i.e. pathway classes. Actual pathway representations are instances of those classes.

Pathway ontologies: • BioPortal’s Pathway Ontology • INOH (Integrating Network Objects w/ Hierarchies) — defunct, data available through PathwayCommons

Compare: http://purl.bioontology.org/ontology/PW (ontology only) http://reactome.org/PathwayBrowser/ (ontology w/ pathway instances) Pathway Diagrams Visual display of pathway information (entities and relationships) Some are backed by pathway representations e.g. Reactome “Glycolysis” pathway Pathway Representations

BioPAX pathway converted from "Metabolism of carbohydrates" in the Reactome database. Protein-Protein Interaction Network

An example network of proteins involved in glycolysis

From Krishna et al. (2014) Systems genomics evaluation of the SH-SY5Y neuroblastoma cell line as a model for Parkinson’s disease Gene Set

An example glycolysis gene set

ALDOA HK3 ALDOB PFKL ALDOC PGAM2 BPGM PGK1 ENO1 PGK2 ENO2 PGM1 ENO3 PGM2 GALM PGM3 GCK PKLR GPI TPI1 HK2 Databases of pathway representations

Pathguide: http://pathguide.org/ Pathway Data Standards

BioPAX (Biological pathway exchange) SBML (Systems biology markup language)

Other: PSI-MI (Proteomics Standards Institute’s molecular interactions) KGML (KEGG markup language) GPML (Graphical pathway markup language) SBGN (Systems biology graphical notation) BioPAX

Triple store with many pathway specific keywords

BioPAX2 and BioPAX3 specifications NOT interoperable!

http://www.biopax.org/webprotege/

http://www.biopax.org/mediawiki/index.php/Specification SBML

XML encoding designed for describing biological models (entities, interactions, rates etc) e.g. given a set of reactions and initial conditions, how does the system proceed?

References: libSBML SBML parser python libAntimony

http://sbml.org/Documents/Specifications Pathway resource reference database stack

Most entities in pathway databases are cross-referenced to identifiers in the following linked databases

Pathway/Process: Protein/Enzyme: EC Number Cellular location Entrez-protein Cellular location: MOPED Pathway Gene Ontology UniProt

Reaction: Small molecule: EC Number (Enzyme CAS Reaction Commission) ChEBI ChemSpider Gene: HMDB Protein Small DNA Ensembl KEGG or Enzyme molecule or RNA Entrez-gene PubChem GeneCards HUGO Gene OMIM And now, some popular pathway databases… Public databases

Reactome

Central curation Supports BioPAX and SBML Public databases

Reactome BioCyc (HumanCyc)

Focuses on metabolic pathways Supports BioPAX and SBML Public databases

Reactome BioCyc (HumanCyc) WikiPathways

Community-based curation Uses GPML Public databases — other

• NCI Pathway Interaction Database — data available at NDex or through PathwayCommons • PANTHER pathways — primarily signaling pathways • NetPath — focuses on signaling transduction pathways • SMPDB (Small Molecule Pathway Database) — focuses on human small molecule pathways • SignaLink — signaling pathways, cross-talks

Pathway Diagrams: • BioCarta — available as gene sets from MSigDB; formed the basis of NCI PID • KEGG pathway diagrams — available at PathwayCommons, converted by BioModels • PharmGKB — available in BioPAX and GPML KEGG A history of pathway resources 1995

available subscription only

BioCarta diagrams 2000

HumanCyc Reactome 2005 PANTHER NCI PID

WikiPathways

NetPath SMPDB 2010

2015 Subscription databases

KEGG (Kyoto Encyclopedia of Genes and Genomes) Last public version: 2011 IPA (Ingenuity Pathway Analysis) MetaCore TRANSFAC Professional Last public version: 2005 Size of resources?

Canonical versus species-specific pathways Pathway uniqueness Pathway overlap

Hard to judge… Data access

APIs/web services SPARQL endpoints Raw data APIs

Quick questions might be best answered through API calls:

Documentation: BioCyc Reactome Pathway Commons KEGG WikiPathways

Libraries available in various programming languages, e.g. python bioservices SPARQL endpoints

SPARQL is a query language for RDF

Some resources have online SPARQL endpoints: Reactome WikiPathways Pathway Commons

Or you can host your own: Stardog Virtuoso Raw data

Download directly from sites with open data: Reactome HumanCyc Pathway Commons

Interact with data through any programming language RDF libraries: rdflib (python), rrdf (R), Jena (Java) Paxtools (BioPAX — Java and R) libSBML (SBML — C/C++, Matlab, Java, Python) Redland (RDF library): C Let’s see some examples Assignment:

• Find a partner NOTE: At least one person per pair should have taken KR • Pick a disease with a complex genetic component • Before next class, spend some time on Google Scholar or PubMed and find some of the genes that show association with your disease • We will use this next time Next class: 2017.1.18 Experimenting with APIs, SPARQL, and RDF libraries

Think about: • When might you want to access data through APIs versus SPARQL versus directly?

Read: http://www.pathwaycommons.org/pc2/ http://www.dataversity.net/introduction-to-sparql/