PANTHER Classificaon System version 7

Huaiyu Mi Department of Prevenve Medicine Keck School of Medicine University of Southern California USA

August 27, 2011, ICSB Tutorial, Heidelberg, Germany 0 Outline

• PANTHER Background – How PANTHER is built? • PANTHER Website at a Glance – Brief overview of all PANTHER pages • PANTHER Basic Funconalies • PANTHER Tools – Tutorial on tool usage

1 PANTHER BACKGROUND

2 PANTHER Database

3 4 What’s new in PANTHER 7.0?

• Whole genome sequence coverage from 48 organisms. • New tree building algorithm (GIGA) for improved phylogenec relaonships of genes and families. • Improved Hidden-Markov Models • Improved ortholog idenficaon. • Implement GO slim and PANTHER protein class for classifying genes and families. • Expanded sets of genomes and sequence idenfier for PANTHER tools. • PANTHER Pathway diagram in SBGN.

5 PANTHER PROTEIN LIBRARY

6 What is PANTHER?

PANTHER library (PANTHER/LIB)

• a family tree Sequences • a mulple • an HMM PANTHER subfamily HMM models

PANTHER GO slim and Protein Class Stasc models Phylogenec trees Mulsequence (HMM) alignments • Molecular funcon • Biological process • Cellular component • Protein class

7 Building PANTHER Protein Family Library Select sequences

Build clusters

Curaon PANTHER Build MSA Protein Libray

Build trees

PANTHER GO slim Build and Protein Class HMMs ontology

8 Complete Gene Sets

• 12 GO Reference Genomes • 36 other genomes to help reconstruct evoluonary history – 14 bacterial genomes – 2 archaeal genomes – 2 fungal genomes – 2 plant genomes – 1 amoebozoan genome – 3 prost genomes – 2 protostome genomes – 10 deuterostome genomes

9 “Standard” set of protein coding genes and corresponding protein sequences Get list of genes in each genome • 48 genomes • Sources of genes – MOD Get list of all protein products – ENSEMBL from given source – NCBI (Entrez) • Sources of protein sequences Get mapping of – UniProt each protein – product to UniProt NCBI (Refseq) – ENSEMBL • One protein is selected for Select one each gene. “representative” protein for each gene

10 Building Clusters and MSA

Score against PANTHER 6.1 • Family and subfamily IDs HMM library (PTHRxxxxx:SFx) are tracked as much as possible. • New IDs are assigned if no Interpro for necessary. Hit an HMM? addional clusters • In PANTHER 7.2 (release in the end of 2011), all clusters with yes at least one sequence from the 12 MOD will be included in the Family cluster library. • MSA are built with ma, a freely available mulple sequence alignment soware package (Katoh, Nucleic Acid MSA by ma Res., 30:3059-3066)

11 GIGA

• An algorithm that makes phylogenec inferences under the constraint of the species tree. • Use sequence–based distance from mulple sequence alignment at each step. – Speciaon – Duplicaon – Ortholog group (subfamily)

Thomas, 2010 BMC Bioinformacs, 11:312

12 Phylogenec inferences based on species tree

speciaon

speciaon

“Fixed differences” between species

13 Speciaon event human human chimpanzee chimpanzee mouse mouse rat rat cow cow horse horse chicken chicken frog frog mosquito mosquito fruit fly fruit fly worm worm yeast yeast 14 human Duplicaon event chimpanzee human

chimpanzee human human

chimpanzee chimpanzee mouse mouse

rat rat

cow cow horse horse chicken chicken

frog frog mosquito mosquito

fruit fly fruit fly worm worm

yeast yeast

15 PANTHER Phylogenec Tree

Tree from PTHR11537

• Green node: speciaon • Yellow node: duplicaon • Blue diamond: subfamily

16 PANTHER Protein Library Building

600,000 sequences from 48 62,972 subfamilies organisms Curaon annotated with GO terms and PANTHER pathways.

400,000 sequences In 6594 family clusters

17 Tree Representaon of Subfamilies

18 MSA

19 PANTHER Ontology in Tree

20 PANTHER in InterPro

21 PANTHER in FlyBase

22 PANTHER and Gene Ontology Reference Genome Project

23 PANTHER PATHWAY

24 Goals

• Go beyond individual protein. • To understand how mulple proteins work together in a complex system. • To build an integrated infrastructure with expert-curated pathways. • To help to establish a standard that will enable the content to be used across a large number of soware applicaons. • The system should allow users to: – Predict gene and protein funcons – Analyze research data – Navigate or browse literatures – Design new experiments

25 Biological process ontology vs. Pathway

26 Phylogenec relaonships help pathway building

M

p A A

27 Phylogenec relaonships help pathway building

M

p A A

>40,000 orthologous trees A

p X X

28 Phylogenec relaonships help pathway building

M

p A A

>40,000 orthologous trees A

p X X

29 Two approaches to build pathways databases • Boom-up – Start from individual protein/reacon – Build species specific pathways (or paral pathways) – Infer to other organisms based on orthologue mapping – Generate a more comprehensive pathway map – Example databases: MetaCyc and Reactome

• Top-down – Start with pathways at the conceptual level, usually based on review papers or textbooks – Build a comprehensive pathway map – Assign protein sequences to the pathway

30 PANTHER Pathway Data Structure

PANTHER pathway • A pathway diagram Pathway • Curate the pathway • Display the pathway Reacon Pathway Molecule Cell type/ • Unambiguous graphical Classes Cellular locaon

representation of pathway data Sequences

• Structured data for pathway PANTHER subfamily HMM models • Link pathway classes to the sequence database Stasc models Phylogenec tree Mulsequence (HMM) alignment

PANTHER library

31 PANTHER Pathway Data Structure • Catalysis • Transition • Nucleus • Transcription and translation • Mitochondria activation/inhibition • Cytoplasm PANTHER pathway • Activation / Inhibition • Nerve terminal • Phosphorylation / dephosphorylation • Lymphocyte • Complex formation Pathway • Astrocytes • Transportation • Upstream / downstream

Reacon Pathway Molecule Cell type/ Classes Cellular locaon

Sequences

• Proteins: receptor, kinase • Genes:PANTHER subfamily HMM models receptor gene, kinase gene • Simple molecules: Glucose, pyruvate, • Ions: Calcium ion Stasc models Phylogenec tree Mulsequence (HMM) • Phenotypes: stress, glucose deprivationalignment • This entity is also used to link out to other pathways. PANTHER library

32 CellDesigner

33 Pathway Curaon Process

Idenfy pathways To curate Idenfy curators

CellDesigner

Pathway Diagrams

SBML parser

Pathway Index

PANTHER library Pathway DB Pathway curaon Web infrastructure

PANTHER database Pathway diagram With library sequences applet Associated to pathways

Web delivery

34 35 Acvity flow view

36 Standard view

37 SBGN-PD view

38 History of PANTHER

• 1998: Project was launched at Molecular Applicaon Group. • 1999: Acquired by Celera Genomics. • 2000: PANTHER 1 released in Celera Discovery Systems (CDS). • 2001: PANTHER 2 released, which is used in the annotaon of the first published human genome Celera. • 2002: PANTHER 3 released. PANTHER annotaons are integrated in FlyBase. Moved to ABI • 2003: PANTHER 4 released with the public release of PANTHER Classificaon System. • 2005: PANTHER 5 released with PANTHER Pathway and analysis tool. Establish collaboraon with Interpro. • 2006: PANTHER 6 released. Move to SRI. • 2010: PANTHER 7 released. • 2011: Move to USC.

39 User Stascs

• 12,000 visits per month • From over 90 countries and territories with USA, India, UK, Germany, China, Japan, Canada, France, Australia and Netherland on the top 10. • 130,000 page views per month • Cited in 2280 scienfic papers (up to August 2011)

40 PANTHER Stascs

• 48 organisms • 400,000 genes • 62,972 subfamilies • 6,594 families • 165 pathways

41 PANTHER WEBSITE AT A GLANCE

42 43 Main menu tabs to access to each subject main page

44 PANTHER keyword search and HMM score.

45 Quick links to popular PANTHER funconalies.

46 PANTHER news and publicaons.

47 PANTHER Website Pges

• List page – Gene list page – Family/subfamily list page – Ontology or pathway list page • Informaon detail page – Gene detail page – Family/subfamily detail page – Pathway descripon page – Pathway molecule class detail page – Ontology term detail page • Graph and diagram page – Pie chart – Pathway diagram – Tree viewer

48 PANTHER Gene List Page

49 PANTHER Gene List Page

• Click to view the pie chart

50 PANTHER Gene List Page

Choose an organism to display your gene list.

51 PANTHER Gene List Page

• Sort the list by clicking the column name. • Collapse the column(s) by clicking on the “x” icon.

52 PANTHER Gene List Page

• Convert the gene list to another list type

53 PANTHER Gene List Page

• Export the list to o Workspace – Need to register an account o File on your computer o Text on the website

54 Gene Detail Page

• Informaon is divided into 3 secons – General informaon about the gene • Including IDs, names, gene symbol, alternave IDs, etc. – PANTHER classificaon of the gene • PANTHER family and subfamily informaon. • Links to view the tree and MSA • PANTHER GO slim and protein class – Orthlogs of the gene

55 Gene Detail Page

• Columns – ID – Unique gene idenfiers in PANTHER – Organism - The modern-day organisms in which the ortholog is found. For paralogs, the organism column gives the two speciaon events between which the duplicaon occurred that generated the paralogous genes. ”ND” means ”not determined”. Thus different paralogs can be disnguished by how long ago the relevant duplicaons occurred. – Type • LDO - least diverged ortholog • O - other, more diverged orthologs (in case of gene duplicaon) • P - paralogs • Orthologs are genes that can be traced to the same gene in the genome of their most recent common ancestor species. • Paralogs are genes that are traced to related, but disnct, genes in the genome of their most recent common ancestor species.

56 Pie Chart

How to read the numbers: • 1st number - number of genes that are classified to this category. In our example in figure 2.20, it is 2067 • 2nd number - the percent of genes classified to this category over the total number of genes. • 3rd number - the percent of genes classified to this category over total number of class hits • Class hit means independent ontology terms. If a gene is classified to 2 ontology terms that are not parent or child to each other, it counts as 2 class hits.

57 PANTHER BASICS

58 PANTHER Basic Funcons

• Key word search – Simple – Advanced • Online HMM score (single sequence) • Prowler • Batch ID search • Downloads

59 Basic Keyword Search

• Search term: – Idenfier – Word – Phrase (mulple words) • Exact work match. Use wildcard character “*” for paral word search. • Specify a subject to return the search results • Search looks for all fields

60 Advanced Keyword Search

• You can refine the field you want to search. • You can select the genome from the 12 MOD to search. We expect to expand this to all 48 organisms in the next release.

61 Online HMM Score

• Only the top hit HMM is reported here. • The green dots next to the score indicates how closely related the protein is to the model. There are three categories: • closely related (3 greet dots, p>E-23)- molecular funcon likely to be the correct but biological process/pathway less certain • distantly related (1 green dot, E-3>p>E-11)- protein is evoluonarily related but funcon may have diverged. 62 Batch ID Search

• Enter ID – Type or paste in the box – Browse and upload a file • File format – Simple text (.txt) format. Microso excel file is not supported. – Previously exported search results • Supported IDs • Select from the 12 MOD to search.

63 Batch ID Search

Why my IDs are not mapped? • IDs are not supported • Wrong file format • Wrong organism(s) is selected for the search • The ID mapping is missing from the UniProt idmapping mechanism. • IDs are not covered in PANTHER protein library • The IDs are updated by the source database (MOD, UniProt, Refseq, ESEMBL)

64 Prowler

• Mulple ontology and pathway terms can be selected • Species can be defined • The results have to meet all the search criteria

65 PANTHER TOOLS

66 PANTHER Tools

• Gene expression analysis tools – Compare gene list(s) – (binomial distribuon test) – Analyze a list of genes with expression values (Mann-Whitney U Test) • Coding SNP analysis

67 PANTHER Tools

68 Compare Gene List

Your gene list 1 of interest

Your gene list 2 of interest

Reference gene list

69 Binomial Distribuon Test

Genes in Categroy 1 Reference Genes in Category 2 gene list

Gene Expression Experiment

Your gene list of interest Any categories over- or under- represented compared to a reference?

70 Members of clusters Randomly distributed?

All genes on the reference gene list Genes upregulated in tumor sample

Intracellular signaling cascade

71 Compare Gene List

# in the reference list + overrepresent.

- under-represent. # observed

P value Probability random #expected

The data on this slide is not real. It is for training purpose. 72of 76 Compare Gene List

# in the reference list + overrepresent.

- under-represent. # observed

P value Probability random #expected

The data on this slide is not real. It is for training purpose. 73of 76 Mann-Whitney U Test

• A non-parametric stasc significance test (distribuon-free) • Compares 2 samples – Sample list: values from genes in a parcular pathway or category – Reference list: values from all genes in the experiment • The output: a list of P-values between a funconal category distribuon and the reference distribuon.

Clark A. et. al., Science 302: 1960 74 Mann-Whitney U Test

The data on this slide is not real. It is for training purpose. 75 Mann-Whitney U Test

# genes in the category

+ over-express

- under-express

P – probability of random

The data on this slide is not real. It is for training purpose. 76 Results shown in PANTHER pathway in a “heat map”

The data on this slide is not real. It is for training purpose. 77 Evoluonary cSNP analysis: Predicng funconal SNPs using observed sequences across different organisms

• Which cSNPs are most likely to affect protein funcon? – When funcon is conserved across homologous proteins • Protein sequences change due to neutral dri • Conserved amino acids are under negave selecon – When funcon is evolving across homologous proteins • Protein sequences change due to selecve pressure • Variable amino acids may be under posive selecon

78 When funcon is conserved across homologous proteins

• Evoluon: mutaon experiments & in vivo funconal assay – Molecular funcon and interacons • How to idenfy cSNPs likely to have funconal impact: – HMM-based analysis balances prior knowledge of average with weighted, site-specific observaons – Posion specific evoluonary conservaon (PSEC) – subPSEC = ln(Pmin/Pmax) Thomas et al., Genome Res. 13:2129 (2003) Thomas & Kejariwal, PNAS 101:15398 (2004) • Opmal when sampling is over more distant homologs (both paralogs and orthologs)

79 Funconal effects of cSNPs: genotype to phenotype via protein (dys)funcon

Mouse wild type

Mouse albino

subPSEC = ln(PS/PC) = -4.9 Mammalian tyrosinases excerpted from an alignment spanning vertebrates

80 cSNP Scoring Tool

Sequence

substuons

81 cSNP Scoring Tool

• subPSEC (substuon posion-specific evoluonary conservaon) • 0 (neutral) to -10 (most likely to be deleterious) • -3 is significant • Pdeleterious - probability of a given SNP to give a deleterious effect. A subPSEC score of -3 corresponds to a Pdeleterious of 0.5. • NIC - number of independent count. Thomas et al., Genome Res. 13:2129 (2003) Thomas & Kejariwal, PNAS 101:15398 (2004) 82 cSNP Scoring Tool

“posion does not align to the HMM” message • The substuon occurs at a posion that does not appear in the mulple sequence alignment; • The substuon occurs at a posion that is inserted relave to the consensus HMM for the given HMM. • In most cases, these posions are not conserved and are not modeled by the HMMs. • These substuons at inserted posions are not generally likely to be deleterious.

83 Future Development

• PANTHER 7.2 – Improved sequence coverage (15% more) – Improved batch ID search by updang the ID mapping. – Improved tool and website performance. – BioPAX support of all Pathway. – expected release date: Jan. 2012. • PANTHER 8.0 – New and updated sequence set. – Improved stascal analysis algorithm for gene expression analysis. – Support online batch HMM scoring. – Customized phylogenec trees. – Expand tools to analyze other genome wide studies, such as GWAS, CNV, RNAseq, NGS, etc. – Improved website with enhanced performances. – Expected release date: • 2013 with paral funconalies • 2015 with full funconalies.

84 Contact

[email protected]

85 Acknowledgements

• PANTHER Group • Systems Biology Instute in – Paul Thomas* Japan – Anushya Muruganian – Hiroake Kitano – Akira Funahashi – Stan Dong – Yukiko Matsuoka – All former PANTHER members – Samet Ghosh

• GO Consorum – Suzanna Lewis – Pascale Gaudet – All GO Consorum members

86