Inputs Steps Reads R1 Reads R2 [optional]

Control FastQ FastQ 1. Quality BIMM 143 Reads R1 Reads R2 Control Pathway Analysis and the [optional] FastQC FastQ FastQ Interpretation of Lists Treatment 2. Reference Lecture 15 Alignment Genome (Mapping)

UCSC Fasta TopHat2 Barry Grant Count Last day’s step Table 4. 3. data.frame Annotation Read Differential http://thegrantlab.org/bimm143 Counting expression UCSC GTF Count analysis! CuffLinks Table DESeq2 data.frame

Volcano Plot Fold change vs P-value

Significant (P < 0.01 & log2 > 2) My high-throughput experiment generated a Pathway analysis! long list of /… (a.k.a. geneset enrichment)

What do I do now? Use bioinformatics methods to help extract biological meaning from such lists…

Inputs Steps Basic idea Reads R1 Reads R2 [optional]

Control FastQ FastQ 1. Differentially Expressed Genes (DEGs) Gene-sets (Pathways, Quality annotations, etc...) Reads R1 Reads R2 Control [optional] FastQC 5. Gene set Annotate... FastQ FastQ enrichment Treatment analysis 2. nalysis! Reference KEGG, GO, … Alignment Genome (Mapping)

UCSC Fasta TopHat2 Count Table 4. 3. data.frame Annotation Read Differential Counting expression UCSC GTF Count analysis! CuffLinks Table DESeq2 data.frame Basic idea Pathway analysis (a.k.a. geneset enrichment) Principle

Differentially Expressed Genes (DEGs) Gene-sets (Pathways, annotations, etc...)

Annotate... DEGs Pathway Differentially Pathway Expressed (geneset) Genes (DEGs)

Enriched Not enriched

Differentially Pathways • DEGs come from your experiment ➢ Critical, needs to be as clean as possible Expressed Overlap... • Pathway genes (“geneset”) come from annotations ➢ Important, but typically not a competitive advantage Genes • Variations of the math: overlap, ranking, networks... ➢ Not critical, different algorithms show similar performances (DEGs) Pathway analysis (geneset enrichment)

Side-note: Pathway analysis (a.k.a. geneset enrichment) Starting point for pathway analysis: Limitations Your gene list

• Geneset annotation bias: can only discover what is already known • You have a list of genes/proteins of interest

• Non-model organisms: no high-quality genesets available • You have quantitative data for each gene/ • Fold change 228018_at F26A1.8NP_000192 C20orf58 • Post-transcriptional regulation is neglected 226925_atENSG00000090339 F26A1.8 NP_057219 ACPL2 207113_s_atENSG00000010030 3383F53F8.4 NP_055029 TNF • p-value 225078_atENSG00000141401 51513C09H5.2NP_000585 EMP2 • Tissue-specific variations of pathways are not annotated 221577_x_atENSG00000206439 3613C09H5.2NP_006125 GDF15 • Spectral counts 1553997_a_atENSG00000204490 7124F22E10.3NP_689495 ASPHD1 • e.g. NF-κB regulates metabolism, not inflammation, in 218433_atENSG00000206328 757F09F7.8NP_001032249 PANK3 204975_atENSG00000142188 92370C34F11.8NP_078870 EMP2 adipocytes • Presence/absence 202638_s_atENSG00000155893 79646F07F6.5NP_064515 ICAM1 230954_atENSG00000120137 56892F35B3.4NP_733839 C20orf112 228017_s_atENSG00000176907 124540W08E12.3NP_620412 C20orf58 • Size bias: stats are influenced by the size of the pathway 1554018_atENSG00000153944 253982B0222.10NP_859069 GPNMB 203126_atENSG00000174939 140688C03A7.14NP_542183 IMPA2 • Many pathways/receptors converge to few regulators 225182_atENSG00000197183 NP_00250110457Y40B10A.6 TMEM50B 225079_atENSG00000136235 NP_0010053409518M176.6 EMP2 NP_004855 243010_atENSG00000130513 2013H04M03.2 MSI2 e.g. Tens of innate immune receptors activate four TFs: NP_001415 230668_atENSG00000213853 4050C03A7.4 C20orf58 NF-kB, AP-1, IRF3/7, NFAT 218541_s_atENSG00000204487 NP_033666C03A7.7 C8orf4 224225_s_atENSG00000206437 NP_002332F44F4.4 ETV7 207339_s_atENSG00000206327 F32A5.2 LTB 202637_s_at W03F8.6 ICAM1 Translating between identifiers Translating between identifiers

• Many different identifiers exist for genes and proteins, e.g. UniProt, , etc. • Many different identifiers exist for genes and proteins, e.g. UniProt, Entrez, etc.

• Often you will have to translate one set of ids into another • Often you will have to translate one set of ids into another

• A program might only accept certain types of ids • A program might only accept certain types of ids

• You might have a list of genes with one type of id and info for genes with another type of id • You might have a list of genes with one type of id and info for genes with another type of id

• Various web sites translate ids -> best for small lists

• UniProt < www.uniprot.org>; IDConverter < idconverter.bioinfo.cnio.es >

Translating between identifiers: UniProt < www.uniprot.org > Translating between identifiers

• Many different identifiers exist for genes and proteins, e.g. UniProt, Entrez, etc.

• Often you will have to translate one set of ids into another

• A program might only accept certain types of ids

• You might have a list of genes with one type of id and info for genes with another type of id

• Various web sites translate ids -> best for small lists

• UniProt < www.uniprot.org>; IDConverter < idconverter.bioinfo.cnio.es >

• VLOOKUP in Excel - good if you are an excel whizz - I am not!

• Download flat file from Entrez, Uniprot, etc; Open in Excel; Find columns that correspond to the 2 IDs you want to convert between; Sort by ID; Use vlookup to translate your list Translating between identifiers: Excel VLOOKUP Translating between identifiers

VLOOKUP(lookup_value, table_array, col_index_num) • Many different identifiers exist for genes and proteins, e.g. UniProt, Entrez, etc.

• Often you will have to translate one set of ids into another

• A program might only accept certain types of ids

• You might have a list of genes with one type of id and info for genes with another type of id

• Various web sites translate ids -> best for small lists

• UniProt < www.uniprot.org >; IDConverter < idconverter.bioinfo.cnio.es >

• VLOOKUP in Excel -> good if you are an excel whizz - I am not!

• Download flat file from Entrez, Uniprot, etc; Open in Excel; Find columns that correspond to the two ids you want to convert between; Use vlookup to translate your list

• Use the merge() or mapIDs() functions in R - fast, versatile & reproducible!

• Also clusterProfiler::bitr() function and many others… [Link to clusterProfiler vignette]

Reminder Reminder

Using the merge() function Using the merge() function anno <- read.csv("data/annotables_grch38.csv") This is an annotation file anno <- read.csv("data/annotables_grch38.csv")

merge(mygenes, anno, by.x="row.names", by.y= "ensgene") merge(mygenes, anno, by.x="row.names", by.y= "ensgene")

Using mapIDs() Thisfunction is our from differential bioconductor expressed genes Using mapIDs() function from bioconductor library("AnnotationDbi") library("AnnotationDbi") library("org.Hs.eg.db") library("org.Hs.eg.db") Load the required Bioconductor packages

mygenes$symbol <- mapIds( org.Hs.eg.db, mygenes$symbol <- mapIds( org.Hs.eg.db, Annotation we want to add keys=row.names(mygenes), column="SYMBOL", column="SYMBOL", keys=row.names(mygenes), keytype="ENSEMBL" ) keytype="ENSEMBL" ) Our vector of gene names & their format Alternative...

What functional set databases do you want?

Pathway • Most commonly used: DEGs GO KEGG • (GO) IPA etc.... • KEGG Pathways (mostly metabolic)

• GeneGO MetaBase

• Ingenuity Pathway Analysis (IPA)

• Many others... • Enzyme Classification, PFAM, Reactome,

• Disease Ontology, MSigDB, Chemical Entities of Biological Interest, Network of Cancer Genes etc…

See package vignette: • See: Open Biomedical Ontologies ( www.obofoundry.org ) https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html

GO < www.geneontology.org > GO Annotations

• What function does HSF1 perform? • GO is not a stand-alone database of genes/proteins or sequences

• response to heat; sequence-specific DNA binding; transcription; etc • Rather gene products get annotated with GO terms by UniProt and other organism specific databases, such as Flybase, Wormbase, MGI, ZFIN, etc.

• Annotations are available through AmiGO < amigo.geneontology.org > • Ontology => a structured and controlled vocabulary that allows us to annotate gene products consistently, interpret the relationships among annotations, and can easily be handled by a computer

• GO database consists of 3 ontologies that describe gene products in terms of their associated biological processes, cellular components and molecular functions GO is structured as a “directed graph” GO evidence codes

Relationships (edges)

Parent terms are more general & child terms more specific

Use and misuse of the gene ontology annotations Seung Yon Rhee, Valerie Wood, Kara Dolinski & Sorin Draghici GO Terms (nodes) Nature Reviews Genetics 9, 509-515 (2008)

Can now do gene list analysis with GeneGO online!

• See AmiGO for details: http://amigo.geneontology.org/amigo/base_statistics Another popular online tool: DAVID at NIAID < david.abcc.ncifcrf.gov > DAVID

• Functional Annotation Chart

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources Da Wei Huang, Brad T Sherman & Richard A Lempicki Nature Protocols 4, 44 - 57 (2009)

Overlapping functional sets DAVID

• Many functional sets overlap • DAVID now offers functional annotation clustering:

• In particular those from databases that are hierarchical in nature (e.g. GO)

• Hierarchy enables:

• Annotation flexibility (e.g. allow different degrees of annotation completeness based on what is known)

• Computational methods to “understand” function relationships (e.g. ATPase function is a subset of enzyme function)

• Unfortunately, this also makes functional profiling trickier

• Clustering of functional sets can be helpful in these cases DAVID Functional Annotation Clustering Want more?

• Based on shared genes between functional sets • GeneGO < portal.genego.com >

• MD/PhD curated annotations, great for certain domains (eg, Cystic Fibrosis)

• Nice network analysis tools

• Email us for access

• Oncomine < www.oncomine.org >

• Extensive cancer related expression datasets

• Nice concept analysis tools

• Research edition is free for academics, Premium edition $$$

• Lots and lots other R/Bioconductor packages in this area!!!

Do it Yourself! Inputs Steps Reads R1 Reads R2 [optional]

Control FastQ FastQ 1. Quality Reads R1 Reads R2 Control [optional] FastQC 5. Gene set FastQ FastQ enrichment Treatment Hands-on time! analysis 2. nalysis! Reference KEGG, GO, … Alignment Genome http://thegrantlab.org/bimm143 (Mapping)

UCSC Fasta TopHat2 Count Table 4. 3. data.frame Annotation Read Differential Counting expression UCSC GTF Count analysis! CuffLinks Table DESeq2 data.frame 3/7/17

10

Pathway and Network Analysis Inputs Steps counts + metadata Reads R1 Reads R2 [optional] 3/7/17 • Any analysis involving pathway or network informa2on

1 2 Control countData colData FastQ FastQ 1. • Most commonly applied to interpret lists of genes Quality gene ctrl_1 ctrl_2 exp_1 exp_2 id treatment sex ... Reads R1 Reads R2 Control [optional] 5. • Most popular type is pathway enrichment analysis, but geneA 10 11 56 45 ctrl_1 control male ... FastQC Gene set FastQ FastQ enrichment many others are useful geneB 0 0 128 54 ctrl_2 control female ... Treatment analysis geneC 42 41 59 41 exp_1 treated male ... 2. nalysis! • Helps gain mechanis2c insight into ‘omics data Reference KEGG, GO, … 12 Alignment geneD 103 122 1 23 exp_2 treated female ... Types of Pathway/Network Analysis Genome (Mapping)

UCSC Fasta geneE 10 23 14 56 TopHat2 colData describes metadata about Count geneF 0 1 2 0 the columns of countData Table ...... 4. 3. data.frame Annotation 3/7/17 Read Differential countData is the count matrix Counting expression UCSC GTF Count (Number of reads coming from each analysis! gene for each sample) CuffLinks Table DESeq2 Module 12: Introduc0on to Pathway and Network Analysis bioinformatics.ca N.B. First column of colData must match column names (i.e. sample names ) of countData (-1st) data.frame

12 11 Next Class Types of Pathway/Network Analysis Next Class Pathways vs Networks

- Detailed, high-confidence consensus - Simplified cellular logic, noisy - Biochemical reac2ons - Abstrac2ons: directed, undirected - Small-scale, fewer genes - Large-scale, genome-wide - Concentrated from decades of literature - Constructed from omics data integra2on Module 12: Introduc0on to Pathway and Network Analysis bioinformatics.ca

Module 12: Introduc0on to Pathway and Network Analysis bioinformatics.ca

13 Types of Pathway/Network Analysis 4

Module 12: Introduc0on to Pathway and Network Analysis bioinformatics.ca

Are new pathways How are pathway 13 Types of Pathway/Network Analysis altered in this activities altered in cancer? Are there a particular What biological clinically-relevant patient? Are there processes are tumour subtypes? targetable altered in this pathways in this cancer? patient? Are new pathways How are pathway altered in this activities altered in cancer? Are there a particular What biological clinically-relevant patient? Are there Module 12: Introduc0on to Pathway and Network Analysis processes are tumour subtypes? biotargetableinformatics .ca altered in this pathways in this cancer? patient?

5

Module 12: Introduc0on to Pathway and Network Analysis bioinformatics.ca

5 3/7/17

Types of Pathway/Network Analysis 12

3/7/17

12

Next Class Side-note: Types of Pathway/Network Analysis Pathway analysis (a.k.a. geneset enrichment) Limitations

• Geneset annotation bias: can only discover what is already known

• Non-model organisms: no high-quality genesets available

• Post-transcriptional regulation is neglected

• Tissue-specific variations of pathways are not annotated • e.g. NF-κB regulates metabolism, not inflammation, in adipocytes

• Size bias: stats are influenced by the size of the pathway

• Many pathways/receptors converge to few regulators e.g. Tens of innate immune receptors activate four TFs: What biological process is Are NEW pathways altered NF-kB, AP-1, IRF3/7, NFAT Module 12: Introduc0on to Pathway and Network Analysis altered in this cancer? in this cancer? Are there clinically bioinformatics.ca relevant tumor subtypes?

Do it Yourself! Pathway & Network Analysis Overview Types of Pathway/Network Analysis 13 Find biological processes underlying a phenotype Databases R Knowledge Check For Module 12: Introduc0on to Pathway and Network Analysis bioinformatics.ca BIMM-143 Literature Quiz Are new pathways This will be marked but not graded Expert knowledge How are pathway(i.e. will not factor into your course grade) Network 13 Network altered in this Types of Pathway/Network Analysis Information activities altered in Analysis cancer? Are there a particular Time Limit: 40 mins clinically-relevant What biological Experimental Data patient? Are there processes are tumour subtypes? targetable altered in this pathways in this cancer? patient? Are new pathways How are pathway altered in this activities altered in cancer? Are there a particular What biological clinically-relevant patient? Are there Module 12: Introduc0on to Pathway and Network Analysis processes are tumour subtypes? biotargetableinformatics .ca altered in this pathways in this cancer? patient?

5

Module 12: Introduc0on to Pathway and Network Analysis bioinformatics.ca

5