The HMMerThread Database A Resource for Weakly Conserved Domains

Charles Bradshaw

Max-Planck Institute for Cell Biology and Genetics Dresden

Habermann / Zerial Group Contents

• HMMerThread Database: – A Bit of Biology: The Central Dogma – Conserved Domains – HMMerThread Overview – Results – Database

• Summary The Central Dogma

DNA RNA

ATGCCGGG AUG CCG GGG M-P-G

MPGMPGIDKLPIEETLEDSPQTRSLLGVFEEDATAISNYMNQLYQAMHRIYDAQNELSAATHLTSKLLKEYEKQRFPLGGDDEVMSSTLQQFSKVIDELSSCHAVLSTQLADAMMFPITQFKERIDKLPIEETLEDSPQTRSLLGVFEEDATAISNYMNQLYQAMHRIYDAQNELSAATHLTSKLLKEYEKQRFPLGGDDEVMSSTLQQFSKVIDELSSCHAVLSTQLADAMMFPITQFKER DLKEILTLKEVFQIASNDHDAAINRYSRLSKKRENDKVKYEVTEDVYTSRKKQHQTMMHYFCALNTLQYKKKIALLEPLLGYMQAQISFFKMGSENLNEQLEEFLANIGTSVQNVRREMDS DIETMQQTIEDLEVASDPLYVPDPDPTKFPVNRNLTRKAGYLNARNKTGLVSSTWDRQFYFTQGGNLMSQARGDVAGGLAMDIDNCSVMAVDCEDRRYCDIETMQQTIEDLEVASDPLYVPDPDPTKFPVNRNLTRKAGYLNARNKTGLVSSTWDRQFYFTQGGNLMSQARGDVAGGLAMDIDNCSVMAVDCEDRRYCFQITSFDGKKSSILQAESKKDHFQITSFDGKKSSILQAESKKDH EEWICTINNISKQIYLSENPEETAARVNQSALEAVTPSPSFQQRHESLRPAAGQSRPPTARTSSSGSLGSESTNLAALSLDSLVAPDTPIQFDIISPVCEDQPGQAKAFGQGGRRTNPFGE SGGSTKSETEDSILHQLFIVRFLGSMSGGSTKSETEDSILHQLFIVRFLGSMEVKSDDHPDVVYETMRQILAARAIHNIFRMTESHLLVTCDCLKLIDPQTQVTRLTFPLPCVVLYATHQENKRLFGFVLRTSSGRSESNLSSVCYIEVKSDDHPDVVYETMRQILAARAIHNIFRMTESHLLVTCDCLKLIDPQTQVTRLTFPLPCVVLYATHQENKRLFGFVLRTSSGRSESNLSSVCYI FESNNEGEKICDSVGLAKQIALHAELDRRASEKQKEIERVKEKQQKELNKQKQIEKDLEEQSRLIAASSRPNQASSEGQFVVLSSSQSEEFESNNEGEKICDSVGLAKQIALHAELDRRASEKQKEIERVKEKQQKELNKQKQIEKDLEEQSRLIAASSRPNQASSEGQFVVLSSSQSEESDLGEGGKKRESEASDLGEGGKKRESEA

BAR PH PTB Conserved Domains are the Main Source of Functional Information

FYVE PX PH BAR Conserved Domains are the Main Source of Functional Information

FYVE Kinase

Uptake of materials Protein Modification

Sequence Profile any sequence Databases (DNA, RNA, Protein) (InterPro, )

Domain Searches (HMMer, InterProScan) Coverage of the worm genome

InterPro

0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Genes with Domains

Domain Type % of Genome InterPro 69% Limitations of Functional Annotation based on Conserved Domains

Weakly conserved members of domain families are not detected by sequence based methods

Can we expand Functional Predictions by the detection of Weakly Conserved Domains? Sequence evolves faster than Structure Moving into the Twilight Zone sequence similarity search techniques

Default Settings give many False Negatives

and . . . % similarity %

Relaxed Settings give many False Positives

fold recognition sequence length Exploiting High Conservation of Structures for Functional Predictions

Fold Recognition: Sequence to Structure Alignments Exploiting High Conservation of Structures for Functional Predictions

Fold Recognition: Sequence to Structure Alignments HMMerThread Overview

Sequence

Domain Database

Search for weakly conserved domains with relaxed settings

Structure Database

Does the sequence conform to the 3D structure of the domain?

Weakly Conserved Domain

Novel Functional Predictions Does it work?

• Essential Tests:

– Weakly Conserved Domains, • Can we really pick up weakly conserved domains? – Coverage, • How much of domain space can we test? – Confidence, • How can we give confidence to a hit? Detection of a weakly conserved BAR domain in APPL APPL1/2

BAR PH PTB PDZ

Miaczynska, et al., 2004

1) HMMer Search

HMMer detects multiple domains in the N-Terminal of APPL1, all with e-values above the standard threshold (0.0001)

Can we identify the correct domain? Detection of a weakly conserved BAR domain in APPL proteins APPL1/2

BAR PH PTB PDZ

Miaczynska, et al., 2004

2) HMMerThread Coverage of the domain space

Only domains with an existing 3D structure can be considered for analysis with HMMerThread

12,430

3,192 (34%)

9,316 Increase in confidence by cross- species validation

Use close orthologues to validate weak hits

Mouse (Mus Musculus) Chicken (Gallus Gallus) Dog (Canis Familiaris)

Consider a domain to be Validated only if it is found in at least 1 other closely related organism Validation • M. musculus – Rat (Rattusnorvegious) – Human (Homo sapiens) – Chicken (Gallus gallus)

• D. melanogaster – Beetle (Triboliumcastaneum) – Honey Bee (Apismellifera) – Mosquito (Anopheles gambiae)

• D. rerio – Fugu (Takifugurubripes) – Pufferfish (Tetraodonnigroviridis)

• C. elegans – C. briggsae

• S. cerevisiae – Kluyveromyceslactis – Candida glabrata – Eremotheciumgossypii Genome-Wide HMMerThread

Proteome

HMMer Domain Search 8 Species against Pfam Human Select domains with lowest Mouse e-values for each region Zebrafish Pre-Processing Fruit fly (2o Structure, TM) Worm Fission yeast Fold OpenProspect Budding yeast Recognition against SCOP Slime mold

Validation

HMMerThread Database HMMerThread Database Statistics

Total Total Domain Domain Hits with Validated in species Genome Proteins Hits e-value > 0.1 3 2 1 Human 33,466 23,151 12,038 1,1801,180 2,670 5,2845,284 Mouse 34,981 22,037 11,460 736 2,084 4,055 Zebrafish 29,720 23,251 11,422 - 971 2,050 Worm 23,518 13,709 7,664 - - 2,058 Fly 19,388 13,368 7,430 302 543 1,282 S. cerevisiae 5,868 3,117 1,919 48 73 435 S. pombe 5,004 2,761 1,506 - - - D. discoideum 13,501 8,176 4,891 - - - TOTAL: 165,446 109,570 58,330 2,2662,266 6,341 15,16415,164

To complete all species + validation: ~ 500,000 runs (~ 3,000,000 CPU hours) HMMerThread finds weakly conserved domains

200

180

160

140

120

100 HMMerthread InterPro 80

60

40

20

0 ubiquitin PH PX FYVE BAR Syntaxin HMMerThread improves sequence based coverage

Domain Coverage - Worm

InterPro

InterPro 437 Novel Functions for HMMerThread HMMerThread Unknowns

0 2000 4000 6000 8000 10000 12000 14000 16000 Genes with Domains

Domain Type In Library (Genes) % of Genome InterPro 13,800 69% HMMerThread 3,194 16% Data Integration http://cluster-12.mpi-cbg.de/htdb/html/ Example from the database

http://cluster-12.mpi-cbg.de/htdb/html/ Summary

• HMMerThread can detect remotely conserved domains

• Cross-species validation of remotely conserved domains adds confidence to domain predictions

• We have identified 5,284 novel, validated domains in the Human Genome and 2,058 in the Worm Genome

• Weak domains predicted by HMMerThread increase the coverage of functional predictions for the worm genome by 16% Essential Statistics!

Most CPUs used on deimos concurrently 2,492

CPU hours used so far 3,593,468

Personal Crash Success Rate (Entire System) 2

Personal e-mails from the Admins 308 Acknowledgements

Bioinformatics Group: Bianca Habermann Ian Henry Michael Volkmer Marino Zerial Sabine Bernauer Vineeth Surendranath

High Performance Computing MPI-CBG Cluster Robert Henschel Matt Boes Guido Juckeland Jeff Omega Matthias Müller