PANTHER Classifica on System version 7
Huaiyu Mi Department of Preven ve Medicine Keck School of Medicine University of Southern California USA
August 27, 2011, ICSB Tutorial, Heidelberg, Germany 0 Outline
• PANTHER Background – How PANTHER is built? • PANTHER Website at a Glance – Brief overview of all PANTHER pages • PANTHER Basic Func onali es • PANTHER Tools – Tutorial on tool usage
1 PANTHER BACKGROUND
2 PANTHER Database
3 4 What’s new in PANTHER 7.0?
• Whole genome sequence coverage from 48 organisms. • New tree building algorithm (GIGA) for improved phylogene c rela onships of genes and families. • Improved Hidden-Markov Models • Improved ortholog iden fica on. • Implement GO slim and PANTHER protein class for classifying genes and families. • Expanded sets of genomes and sequence iden fier for PANTHER tools. • PANTHER Pathway diagram in SBGN.
5 PANTHER PROTEIN LIBRARY
6 What is PANTHER?
PANTHER library (PANTHER/LIB)
• a family tree Sequences • a mul ple sequence alignment • an HMM PANTHER subfamily HMM models
PANTHER GO slim and Protein Class Sta s c models Phylogene c trees Mul sequence (HMM) alignments • Molecular func on • Biological process • Cellular component • Protein class
7 Building PANTHER Protein Family Library Select sequences
Build clusters
Cura on PANTHER Build MSA Protein Libray
Build trees
PANTHER GO slim Build and Protein Class HMMs ontology
8 Complete Gene Sets
• 12 GO Reference Genomes • 36 other genomes to help reconstruct evolu onary history – 14 bacterial genomes – 2 archaeal genomes – 2 fungal genomes – 2 plant genomes – 1 amoebozoan genome – 3 pro st genomes – 2 protostome genomes – 10 deuterostome genomes
9 “Standard” set of protein coding genes and corresponding protein sequences Get list of genes in each genome • 48 genomes • Sources of genes – MOD Get list of all protein products – ENSEMBL from given source – NCBI (Entrez) • Sources of protein sequences Get mapping of – UniProt each protein – product to UniProt NCBI (Refseq) – ENSEMBL • One protein is selected for Select one each gene. “representative” protein for each gene
10 Building Clusters and MSA
Score against PANTHER 6.1 • Family and subfamily IDs HMM library (PTHRxxxxx:SFx) are tracked as much as possible. • New IDs are assigned if no Interpro for necessary. Hit an HMM? addi onal clusters • In PANTHER 7.2 (release in the end of 2011), all clusters with yes at least one sequence from the 12 MOD will be included in the Family cluster library. • MSA are built with ma , a freely available mul ple sequence alignment so ware package (Katoh, Nucleic Acid MSA by ma Res., 30:3059-3066)
11 GIGA
• An algorithm that makes phylogene c inferences under the constraint of the species tree. • Use sequence–based distance from mul ple sequence alignment at each step. – Specia on – Duplica on – Ortholog group (subfamily)
Thomas, 2010 BMC Bioinforma cs, 11:312
12 Phylogene c inferences based on species tree
specia on
specia on
“Fixed differences” between species
13 Specia on event human human chimpanzee chimpanzee mouse mouse rat rat cow cow horse horse chicken chicken frog frog mosquito mosquito fruit fly fruit fly worm worm yeast yeast 14 human Duplica on event chimpanzee human
chimpanzee human human
chimpanzee chimpanzee mouse mouse
rat rat
cow cow horse horse chicken chicken
frog frog mosquito mosquito
fruit fly fruit fly worm worm
yeast yeast
15 PANTHER Phylogene c Tree
Tree from PTHR11537
• Green node: specia on • Yellow node: duplica on • Blue diamond: subfamily
16 PANTHER Protein Library Building
600,000 sequences from 48 62,972 subfamilies organisms Cura on annotated with GO terms and PANTHER pathways.
400,000 sequences In 6594 family clusters
17 Tree Representa on of Subfamilies
18 MSA
19 PANTHER Ontology in Tree
20 PANTHER in InterPro
21 PANTHER in FlyBase
22 PANTHER and Gene Ontology Reference Genome Project
23 PANTHER PATHWAY
24 Goals
• Go beyond individual protein. • To understand how mul ple proteins work together in a complex system. • To build an integrated infrastructure with expert-curated pathways. • To help to establish a standard that will enable the content to be used across a large number of so ware applica ons. • The system should allow users to: – Predict gene and protein func ons – Analyze research data – Navigate or browse literatures – Design new experiments
25 Biological process ontology vs. Pathway
26 Phylogene c rela onships help pathway building
M
p A A
27 Phylogene c rela onships help pathway building
M
p A A
>40,000 orthologous trees A
p X X
28 Phylogene c rela onships help pathway building
M
p A A
>40,000 orthologous trees A
p X X
29 Two approaches to build pathways databases • Bo om-up – Start from individual protein/reac on – Build species specific pathways (or par al pathways) – Infer to other organisms based on orthologue mapping – Generate a more comprehensive pathway map – Example databases: MetaCyc and Reactome
• Top-down – Start with pathways at the conceptual level, usually based on review papers or textbooks – Build a comprehensive pathway map – Assign protein sequences to the pathway
30 PANTHER Pathway Data Structure
PANTHER pathway • A pathway diagram Pathway • Curate the pathway • Display the pathway Reac on Pathway Molecule Cell type/ • Unambiguous graphical Classes Cellular loca on
representation of pathway data Sequences
• Structured data for pathway PANTHER subfamily HMM models • Link pathway classes to the sequence database Sta s c models Phylogene c tree Mul sequence (HMM) alignment
PANTHER library
31 PANTHER Pathway Data Structure • Catalysis • Transition • Nucleus • Transcription and translation • Mitochondria activation/inhibition • Cytoplasm PANTHER pathway • Activation / Inhibition • Nerve terminal • Phosphorylation / dephosphorylation • Lymphocyte • Complex formation Pathway • Astrocytes • Transportation • Upstream / downstream
Reac on Pathway Molecule Cell type/ Classes Cellular loca on
Sequences
• Proteins: receptor, kinase • Genes:PANTHER subfamily HMM models receptor gene, kinase gene • Simple molecules: Glucose, pyruvate, • Ions: Calcium ion Sta s c models Phylogene c tree Mul sequence (HMM) • Phenotypes: stress, glucose deprivationalignment • This entity is also used to link out to other pathways. PANTHER library
32 CellDesigner
33 Pathway Cura on Process
Iden fy pathways To curate Iden fy curators
CellDesigner
Pathway Diagrams
SBML parser
Pathway Index
PANTHER library Pathway DB Pathway cura on Web infrastructure
PANTHER database Pathway diagram With library sequences applet Associated to pathways
Web delivery
34 35 Ac vity flow view
36 Standard view
37 SBGN-PD view
38 History of PANTHER
• 1998: Project was launched at Molecular Applica on Group. • 1999: Acquired by Celera Genomics. • 2000: PANTHER 1 released in Celera Discovery Systems (CDS). • 2001: PANTHER 2 released, which is used in the annota on of the first published human genome Celera. • 2002: PANTHER 3 released. PANTHER annota ons are integrated in FlyBase. Moved to ABI • 2003: PANTHER 4 released with the public release of PANTHER Classifica on System. • 2005: PANTHER 5 released with PANTHER Pathway and analysis tool. Establish collabora on with Interpro. • 2006: PANTHER 6 released. Move to SRI. • 2010: PANTHER 7 released. • 2011: Move to USC.
39 User Sta s cs
• 12,000 visits per month • From over 90 countries and territories with USA, India, UK, Germany, China, Japan, Canada, France, Australia and Netherland on the top 10. • 130,000 page views per month • Cited in 2280 scien fic papers (up to August 2011)
40 PANTHER Sta s cs
• 48 organisms • 400,000 genes • 62,972 subfamilies • 6,594 families • 165 pathways
41 PANTHER WEBSITE AT A GLANCE
42 43 Main menu tabs to access to each subject main page
44 PANTHER keyword search and HMM score.
45 Quick links to popular PANTHER func onali es.
46 PANTHER news and publica ons.
47 PANTHER Website Pges
• List page – Gene list page – Family/subfamily list page – Ontology or pathway list page • Informa on detail page – Gene detail page – Family/subfamily detail page – Pathway descrip on page – Pathway molecule class detail page – Ontology term detail page • Graph and diagram page – Pie chart – Pathway diagram – Tree viewer
48 PANTHER Gene List Page
49 PANTHER Gene List Page
• Click to view the pie chart
50 PANTHER Gene List Page
Choose an organism to display your gene list.
51 PANTHER Gene List Page
• Sort the list by clicking the column name. • Collapse the column(s) by clicking on the “x” icon.
52 PANTHER Gene List Page
• Convert the gene list to another list type
53 PANTHER Gene List Page
• Export the list to o Workspace – Need to register an account o File on your computer o Text on the website
54 Gene Detail Page
• Informa on is divided into 3 sec ons – General informa on about the gene • Including IDs, names, gene symbol, alterna ve IDs, etc. – PANTHER classifica on of the gene • PANTHER family and subfamily informa on. • Links to view the tree and MSA • PANTHER GO slim and protein class – Orthlogs of the gene
55 Gene Detail Page
• Columns – ID – Unique gene iden fiers in PANTHER – Organism - The modern-day organisms in which the ortholog is found. For paralogs, the organism column gives the two specia on events between which the duplica on occurred that generated the paralogous genes. ”ND” means ”not determined”. Thus different paralogs can be dis nguished by how long ago the relevant duplica ons occurred. – Type • LDO - least diverged ortholog • O - other, more diverged orthologs (in case of gene duplica on) • P - paralogs • Orthologs are genes that can be traced to the same gene in the genome of their most recent common ancestor species. • Paralogs are genes that are traced to related, but dis nct, genes in the genome of their most recent common ancestor species.
56 Pie Chart
How to read the numbers: • 1st number - number of genes that are classified to this category. In our example in figure 2.20, it is 2067 • 2nd number - the percent of genes classified to this category over the total number of genes. • 3rd number - the percent of genes classified to this category over total number of class hits • Class hit means independent ontology terms. If a gene is classified to 2 ontology terms that are not parent or child to each other, it counts as 2 class hits.
57 PANTHER BASICS
58 PANTHER Basic Func ons
• Key word search – Simple – Advanced • Online HMM score (single sequence) • Prowler • Batch ID search • Downloads
59 Basic Keyword Search
• Search term: – Iden fier – Word – Phrase (mul ple words) • Exact work match. Use wildcard character “*” for par al word search. • Specify a subject to return the search results • Search looks for all fields
60 Advanced Keyword Search
• You can refine the field you want to search. • You can select the genome from the 12 MOD to search. We expect to expand this to all 48 organisms in the next release.
61 Online HMM Score
• Only the top hit HMM is reported here. • The green dots next to the score indicates how closely related the protein is to the model. There are three categories: • closely related (3 greet dots,