The HMMerThread Database A Resource for Weakly Conserved Domains
Charles Bradshaw
Max-Planck Institute for Cell Biology and Genetics Dresden
Habermann / Zerial Group Contents
• HMMerThread Database: – A Bit of Biology: The Central Dogma – Conserved Domains – HMMerThread Overview – Results – Database
• Summary The Central Dogma
DNA RNA Protein
ATGCCGGG AUG CCG GGG M-P-G
MPGMPGIDKLPIEETLEDSPQTRSLLGVFEEDATAISNYMNQLYQAMHRIYDAQNELSAATHLTSKLLKEYEKQRFPLGGDDEVMSSTLQQFSKVIDELSSCHAVLSTQLADAMMFPITQFKERIDKLPIEETLEDSPQTRSLLGVFEEDATAISNYMNQLYQAMHRIYDAQNELSAATHLTSKLLKEYEKQRFPLGGDDEVMSSTLQQFSKVIDELSSCHAVLSTQLADAMMFPITQFKER DLKEILTLKEVFQIASNDHDAAINRYSRLSKKRENDKVKYEVTEDVYTSRKKQHQTMMHYFCALNTLQYKKKIALLEPLLGYMQAQISFFKMGSENLNEQLEEFLANIGTSVQNVRREMDS DIETMQQTIEDLEVASDPLYVPDPDPTKFPVNRNLTRKAGYLNARNKTGLVSSTWDRQFYFTQGGNLMSQARGDVAGGLAMDIDNCSVMAVDCEDRRYCDIETMQQTIEDLEVASDPLYVPDPDPTKFPVNRNLTRKAGYLNARNKTGLVSSTWDRQFYFTQGGNLMSQARGDVAGGLAMDIDNCSVMAVDCEDRRYCFQITSFDGKKSSILQAESKKDHFQITSFDGKKSSILQAESKKDH EEWICTINNISKQIYLSENPEETAARVNQSALEAVTPSPSFQQRHESLRPAAGQSRPPTARTSSSGSLGSESTNLAALSLDSLVAPDTPIQFDIISPVCEDQPGQAKAFGQGGRRTNPFGE SGGSTKSETEDSILHQLFIVRFLGSMSGGSTKSETEDSILHQLFIVRFLGSMEVKSDDHPDVVYETMRQILAARAIHNIFRMTESHLLVTCDCLKLIDPQTQVTRLTFPLPCVVLYATHQENKRLFGFVLRTSSGRSESNLSSVCYIEVKSDDHPDVVYETMRQILAARAIHNIFRMTESHLLVTCDCLKLIDPQTQVTRLTFPLPCVVLYATHQENKRLFGFVLRTSSGRSESNLSSVCYI FESNNEGEKICDSVGLAKQIALHAELDRRASEKQKEIERVKEKQQKELNKQKQIEKDLEEQSRLIAASSRPNQASSEGQFVVLSSSQSEEFESNNEGEKICDSVGLAKQIALHAELDRRASEKQKEIERVKEKQQKELNKQKQIEKDLEEQSRLIAASSRPNQASSEGQFVVLSSSQSEESDLGEGGKKRESEASDLGEGGKKRESEA
BAR PH PTB Conserved Domains are the Main Source of Functional Information
FYVE PX PH BAR Conserved Domains are the Main Source of Functional Information
FYVE Kinase
Uptake of materials Protein Modification
Sequence Profile any sequence Databases (DNA, RNA, Protein) (InterPro, Pfam)
Domain Searches (HMMer, InterProScan) Coverage of the worm genome
InterPro
0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Genes with Domains
Domain Type % of Genome InterPro 69% Limitations of Functional Annotation based on Conserved Domains
Weakly conserved members of domain families are not detected by sequence based methods
Can we expand Functional Predictions by the detection of Weakly Conserved Domains? Sequence evolves faster than Structure Moving into the Twilight Zone sequence similarity search techniques
Default Settings give many False Negatives
and . . . % similarity %
Relaxed Settings give many False Positives
fold recognition sequence length Exploiting High Conservation of Structures for Functional Predictions
Fold Recognition: Sequence to Structure Alignments Exploiting High Conservation of Structures for Functional Predictions
Fold Recognition: Sequence to Structure Alignments HMMerThread Overview
Sequence
Domain Database
Search for weakly conserved domains with relaxed settings
Structure Database
Does the sequence conform to the 3D structure of the domain?
Weakly Conserved Domain
Novel Functional Predictions Does it work?
• Essential Tests:
– Weakly Conserved Domains, • Can we really pick up weakly conserved domains? – Coverage, • How much of domain space can we test? – Confidence, • How can we give confidence to a hit? Detection of a weakly conserved BAR domain in APPL proteins APPL1/2
BAR PH PTB PDZ
Miaczynska, et al., 2004
1) HMMer Search
HMMer detects multiple domains in the N-Terminal of APPL1, all with e-values above the standard threshold (0.0001)
Can we identify the correct domain? Detection of a weakly conserved BAR domain in APPL proteins APPL1/2
BAR PH PTB PDZ
Miaczynska, et al., 2004
2) HMMerThread Coverage of the domain space
Only domains with an existing 3D structure can be considered for analysis with HMMerThread
12,430
3,192 (34%)
9,316 Increase in confidence by cross- species validation
Use close orthologues to validate weak hits
Mouse (Mus Musculus) Chicken (Gallus Gallus) Dog (Canis Familiaris)
Consider a domain to be Validated only if it is found in at least 1 other closely related organism Validation • M. musculus – Rat (Rattusnorvegious) – Human (Homo sapiens) – Chicken (Gallus gallus)
• D. melanogaster – Beetle (Triboliumcastaneum) – Honey Bee (Apismellifera) – Mosquito (Anopheles gambiae)
• D. rerio – Fugu (Takifugurubripes) – Pufferfish (Tetraodonnigroviridis)
• C. elegans – C. briggsae
• S. cerevisiae – Kluyveromyceslactis – Candida glabrata – Eremotheciumgossypii Genome-Wide HMMerThread
Proteome
HMMer Domain Search 8 Species against Pfam Human Select domains with lowest Mouse e-values for each region Zebrafish Pre-Processing Fruit fly (2o Structure, TM) Worm Fission yeast Fold OpenProspect Budding yeast Recognition against SCOP Slime mold
Validation
HMMerThread Database HMMerThread Database Statistics
Total Total Domain Domain Hits with Validated in species Genome Proteins Hits e-value > 0.1 3 2 1 Human 33,466 23,151 12,038 1,1801,180 2,670 5,2845,284 Mouse 34,981 22,037 11,460 736 2,084 4,055 Zebrafish 29,720 23,251 11,422 - 971 2,050 Worm 23,518 13,709 7,664 - - 2,058 Fly 19,388 13,368 7,430 302 543 1,282 S. cerevisiae 5,868 3,117 1,919 48 73 435 S. pombe 5,004 2,761 1,506 - - - D. discoideum 13,501 8,176 4,891 - - - TOTAL: 165,446 109,570 58,330 2,2662,266 6,341 15,16415,164
To complete all species + validation: ~ 500,000 runs (~ 3,000,000 CPU hours) HMMerThread finds weakly conserved domains
200
180
160
140
120
100 HMMerthread InterPro 80
60
40
20
0 ubiquitin PH PX FYVE BAR Syntaxin HMMerThread improves sequence based coverage
Domain Coverage - Worm
InterPro
InterPro 437 Novel Functions for HMMerThread HMMerThread Unknowns
0 2000 4000 6000 8000 10000 12000 14000 16000 Genes with Domains
Domain Type In Library (Genes) % of Genome InterPro 13,800 69% HMMerThread 3,194 16% Data Integration http://cluster-12.mpi-cbg.de/htdb/html/ Example from the database
http://cluster-12.mpi-cbg.de/htdb/html/ Summary
• HMMerThread can detect remotely conserved domains
• Cross-species validation of remotely conserved domains adds confidence to domain predictions
• We have identified 5,284 novel, validated domains in the Human Genome and 2,058 in the Worm Genome
• Weak domains predicted by HMMerThread increase the coverage of functional predictions for the worm genome by 16% Essential Statistics!
Most CPUs used on deimos concurrently 2,492
CPU hours used so far 3,593,468
Personal Crash Success Rate (Entire System) 2
Personal e-mails from the Admins 308 Acknowledgements
Bioinformatics Group: Bianca Habermann Ian Henry Michael Volkmer Marino Zerial Sabine Bernauer Vineeth Surendranath
High Performance Computing MPI-CBG Cluster Robert Henschel Matt Boes Guido Juckeland Jeff Omega Matthias Müller