<<

Fast, sensitive detection using HMMER

Rob Finn Sequence Families Team Lead @robdfinn, [email protected] 14th Nov 2018 Making sense of sequence data

Sequence Data Information

Model Experimental Organisms Literature

Reference Sporadic Proteomes Literature

Complete Proteomes Similarity

Other Sequences & Uncharacterized MGnify Length distribution

5 Growth of MGnify Compared with UniProt

1,200,000,000

1,000,000,000

800,000,000 4

600,000,000

400,000,000 Number of Number Sequences 200,000,000 3 Protein 0 Partail 2002 2004 2006 2008 2010 2012 2014 2016 2018 −term truncated -200,000,000 Year N−term truncated

UniProt MGNify Full length

Frequency (millions) 2

1

0

0 500 1000 1500 2000 length of sequence • >1 billion sequences, mean length of 205 • <1% match UniProtKB, but 58% match Pfam 406 STRUCTURE COMPARISON AND ALIGNMENT Sequence And Structure Alignments

406 STRUCTURE COMPARISON AND ALIGNMENT

Figure 16.2. Structure alignment for c- (1CPC:A) (black) and colicin A (1COL:A) (gray) as computed by SALIGN. The alignment extended over 86 residues with a 0.97 A RMSD. The sequence identity of the superposed residues with respect to the shorter of the two structures was 11.9%.

undergone convergent to form a stable 3-on-3 a-helical sandwich fold. Interest- ingly, it was subsequently discovered that phycocianins can aggregate forming clusters that Figure 16.2.thenStructure adhere to alignment the membrane for c-phycocyanin forming the (1CPC:A) so-called (black) phycobilisomes. and colicin A Such (1COL:A) a functional (gray) as computed byrelationship SALIGN. The may alignment indeed point extended to convergent over 86 evolution residues with from a a 0.97 distant A RMSD. common The ancestor. sequence identity of theFig. Adapted superposedThe second from Chap.16, example, residues Structural with whichBioinformatics, respect is extracted 2nd toEd., theMarti-Renom from shorter the et al work of the of two one structures of our groups was (Tsigelny 11.9%. et al., 2000), illustrated how the combination and integration of different sources of undergoneinformation, convergent including evolution structural to form alignments, a stable 3-on-3 coulda help-helical to functionally sandwich fold. characterize Interest- a ingly, it wasprotein. subsequently In our work, discovered two new that EF-hand phycocianins motifs were can aggregate identified in forming acetylcholinesterase clusters that (AChE) and related by combining the results from a hidden Markovmodel sequence then adheresearch, to the Prosite membrane pattern extraction, forming and the protein so-called structure phycobilisomes. alignments by CE. Such It was a functional also found 2 relationshipthat may the a indeed–b hydrolase point fold to convergent family, including evolution acetylcholinesterases, from a distant contains common putative ancestor. Ca þ The secondbinding example, sites, indicative which of is an extracted EF-hand from motif, the and work which of in one some of family our groups members (Tsigelny may be et al., 2000),critical illustrated for heterologous how the cell combination associations. This and putative integration finding of represented different sources the second of information,characterization including structural of an EF-hand alignments, motif within could an help extracellular to functionally protein, which characterize previously a protein. Inhad our only work, been two found new in osteonectins. EF-hand motifs Thus, were structure identified alignment in had acetylcholinesterase contributed to our (AChE) andunderstanding related proteins of an by important combining family the of results proteins. from a hidden Markovmodel sequence Finally, the third example, also from a previous work of one of our groups (McMahon search, Prosite pattern extraction, and alignments by CE. It was also found et al., 2005), combined information from structural alignments deposited in the DBAli2 that the a–databaseb hydrolase and foldexperiments family, to including analyze the acetylcholinesterases, sequence and fold diversity contains of putative a C-type Ca lectinþ binding sites,domain. indicative We demonstrated of an EF-hand that the motif, C-type and lectin which fold adopted in some by a family major tropism members determinant may be critical forsequence, heterologous a retroelement-encoded cell associations. receptor This putative binding finding protein, represented provides a highly the second static characterizationstructural of scaffold an EF-hand in support motif of a within diverse anarray extracellular of sequences. protein, Immunoglobulins which previously are known had only beento fulfill found the same in osteonectins. role of a scaffold Thus, supporting structure a large alignment variety of sequences had contributed necessary to for our an understandingantigenic of an response. important C-type family lectins of were proteins. shown to represent a different evolutionary solution taken by retroelements to balance diversity against stability. Finally, the third example, also from a previous work of one of our groups (McMahon et al., 2005), combined information from structural alignments deposited in the DBAli database andMULTIPLE experiments STRUCTURE to analyze ALIGNMENT the sequence and fold diversity of a C-type lectin domain. We demonstrated that the C-type lectin fold adopted by a major tropism determinant sequence,Our a retroelement-encoded discussions thus far have involved receptor only binding pair-wise protein, structure provides comparison a and highly alignment, static structural scaffoldor at best, in alignment support of of multiple a diverse structures array of to sequences. a single representative Immunoglobulins in a pair-wise are fashion known (i.e., progressive pair-wise structure alignment). Most of the available methods for multiple to fulfill thestructure same role alignment of a scaffold start by computingsupporting all a pair-wise large variety alignments of sequences between necessary a set of structures for an antigenic response.but then use C-type them to lectins generate were the shown optimal to consensus represent alignment a different between evolutionary all the structures. solution taken by retroelements to balance diversity against stability.

MULTIPLE STRUCTURE ALIGNMENT

Our discussions thus far have involved only pair-wise structure comparison and alignment, or at best, alignment of multiple structures to a single representative in a pair-wise fashion (i.e., progressive pair-wise structure alignment). Most of the available methods for multiple structure alignment start by computing all pair-wise alignments between a set of structures but then use them to generate the optimal consensus alignment between all the structures. Profile hidden Markov models

• Statistical inference, accounting for uncertainty

• Use more information Profile hidden Markov models

• Statistical inference, accounting for uncertainty

• Use more information P(t | model of homology to q) P(t | model of homology to q) P(t | model of nonhomology) P(t | H) P(t | R) P(t | H) S = log P(t | R) joint probability of t, and the alignment P(t,πo | H) S = log P(t | R) Optimal alignment scores are only an approximation. and the approximation breaks down on remote homologs.

P(t,πo | H) S = log P(t | R)

...GHRL...... | |...... GI-M... Optimal alignment scores are only an approximation. and the approximation breaks down on remote homologs.

P(t,πo | H) S = log P(t | R)

...GHRL...... | |...... GI-M... According to inference theory, the correct score is a log-odds ratio summed over all alignments

P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V

P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) According to inference theory, the correct score is a log-odds ratio summed over all alignments

Depends on: - a probability model of alignment, not just scores - fast enough to use in practice

P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V

P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) According to inference theory, the correct score is a log-odds ratio summed over all alignments

Depends on: - a probability model of alignment, not just scores

BLAST (almost) P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V

P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) HMMER Profile hidden Markov models

• Statistical inference, accounting for uncertainty

• Use more information Profile Hidden Markov Models - Encapsulate diversity

Input multiple alignment: seq1 ACGACG-LD-LD Consensus columns assigned, seq2 SCGSCG--E--E Defining inserts and deletes: Seq3 NCGNCGgFDgFD Seq4 TCGTCG-WQ-WQ 123-45

N W T F D A L E S C G Y Q B M1 M2 M3 M4 M5 E

Plan7 core D1 D2 D3 D4 D5 model

I0 I1 I2 I3 I4 I5 Profile Hidden Markov Models

Input multiple alignment: seq1 ACG-LD Consensus columns assigned, seq2 SCG--E Defining inserts and deletes: Seq3 NCGgFD Seq4 TCG-WQ 123-45 Profile Hidden Markov Models

Input multiple alignment: Consensus columns assigned, Defining inserts and deletes:

seq1 ACG-LD seq2 SCG--E Seq3 NCGgFD Seq4 TCG-WQ 123-45 Profile Hidden Markov Models

Input multiple alignment: Consensus columns assigned, Defining inserts and deletes:

seq1 ACGACG-LD-LD seq2 SCGSCG--E--E Seq3 NCGNCGgFDgFD Seq4 TCGTCG-WQ-WQ 123-45 Profile Hidden Markov Models

Input multiple alignment: seq1 ACGACG-LD-LD Consensus columns assigned, seq2 SCGSCG--E--E Defining inserts and deletes: Seq3 NCGNCGgFDgFD Seq4 TCGTCG-WQ-WQ 123-45

N W T F D A L E S C G Y Q B M1 M2 M3 M4 M5 E

Plan7 core D1 D2 D3 D4 D5 model

I0 I1 I2 I3 I4 I5 Profile Hidden Markov Models

Input multiple alignment: seq1 ACGACG-LD-LD Consensus columns assigned, seq2 SCGSCG--E--E Defining inserts and deletes: Seq3 NCGNCGgFDgFD Seq4 TCGTCG-WQ-WQ 123-45

N W T F D A L E S C G Y Q B M1 M2 M3 M4 M5 E

Plan7 core D1 D2 D3 D4 D5 model

I0 I1 I2 I3 I4 I5 anecdotal search example: superfamily E-value (statistical significance) PSI-BLAST HMMER alpha hemoglobins HBA_HUMAN 4e-46 9e-62 HBA_MOUSE 3e-42 4e-55 ~300 Mya beta hemoglobins HBB_HUMAN 2e-57 4e-64 ~550 Mya HBB1_MOUSE 9e-50 2e-57 2e-58 myoglobins MYG_HUMAN 1e-45 6e-54 ~600-700 Mya? MYG_MOUSE 2e-41

1e-7 ~1000 Mya? neuroglobins NGB_HUMAN - 2e-7 NGB_MOUSE - ~2500 Mya? plant leghaemoglobins LGB1_PEA 1.1 5e-5 LGB2_PEA 0.45 5e-6

Aplysia myoglobin (PDB 1mba) bacterial nitric oxide HMP_VIBCH - 0.004 HMP_ECOLI - -

query: alignment of three vertebrate hemoglobins and one myoglobin

target db: Uniprot 7.0 (207K seqs) (contains about 1060 known )

at E <= 0.01: PSI-BLAST sees: 915 globins (9 sec) HMMER3 sees: 1002 globins (8sec) Projecting profile HMMs back onto structures

generated using http://www.skylign.org/3DPatch/

Jakubec D et al, , 2018 Different HMMER search methods

• phmmer—single protein sequence against protein .

• hmmscan—single protein sequence against profile HMM library (, CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs).

• hmmsearch—either multiple or profile HMM against protein sequence database.

• jackhmmer—iterative searches. Initiated with a single sequence, a profile HMM or a multiple sequence alignment against a target sequence database. Find out more? The HMMER Web Server for Protein UNIT 3.15 Sequence Similarity Search Ananth Prakash,1 Matt Jeffryes,1 ,1 and Robert D. Finn1 1European Laboratory, The European Bioinformatics Institute (EMBL-EBI), Wellcome Campus, Hinxton, Cambridge, United Kingdom

Protein sequence similarity search is one of the most commonly used bioin- formatics methods for identifying evolutionarily related proteins. In general, sequences that are evolutionarily related share some degree of similarity, and sequence-search algorithms use this principle to identify homologs. The re- quirement for a fast and sensitive sequence search method led to the de- velopment of the HMMER software, which in the latest version (v3.1) uses acombinationofsophisticatedaccelerationheuristicsandmathematicaland computational optimizations to enable the use of profile hidden Markov models (HMMs) for . The HMMER Web server provides a common platform by linking the HMMER algorithms to , thereby enabling the search for homologs, as well as providing sequence and functional annotation The HMMERby linking external Web databases. This unit Server describes three basic protocols for and Protein UNIT 3.15 two alternate protocols that explain how to use the HMMER Web server using C various input formats and user defined parameters. ⃝ 2017 by John Wiley & SequenceSons, Similarity Inc. Search Keywords: bioinformatics homology profile pro- 1 1 1 1 Ananth Prakash,teinMatt sequence Jeffryes,analysis r Alexr Bateman, andr Robert D. Finn

1European Molecular Biology Laboratory,How to cite this The article: European Bioinformatics Institute Prakash, A., Jeffryes, M., Bateman, A., & Finn, R. D. (2017). The (EMBL-EBI), WellcomeHMMER web Genome server for Campus, protein sequence Hinxton, similarity Cambridge, search. Current United Kingdom Protocols in Bioinformatics, 60, 3.15.1–3.15.23. doi: Protein sequence similarity10.1002/cpbi.40 search is one of the most commonly used bioin- formatics methods for identifying evolutionarily related proteins. In general, sequencesINTRODUCTION that are evolutionarily related share some degree of similarity, and sequence-searchThe HMMER Web algorithms server (http://www.ebi.ac.uk/Tools/hmmer/ use this principle) is an to open-access identify protein homologs. The re- sequence similarity search tool that hosts a suite of HMMER algorithms to identify quirementevolutionarily for related a fast proteins and and/or sensitive domains bysequence employing profile search hidden method Markov led to the de- velopmentmodels (HMMs; of theAPPENDIX HMMER 3A,Schuster-B software,ockler¨ & Bateman, which 2007) in forthe fast latest and efficient version (v3.1) uses detection of close and remote homologs. The HMMER Web server provides four search acombinationofsophisticatedaccelerationheuristicsandmathematicalandinterfaces to the corresponding algorithms in the HMMER suite (http://hmmer.org): phmmer, hmmscan, hmmsearch, and jackhmmer. The functionality of these algorithms computationalare outlined in optimizations Table 3.15.1. The HMMER to enable Web serverthe use can workof profile with various hidden input Markov models (HMMs)formats for and sequence user-defined parameters analysis. to provide The HMMER results that are Web presented server to help providesinfer a common protein sequence conservation, function, and evolution. This article provides detailed platformprotocols by for linking using the the Web HMMER versions of PHMMER, algorithms HMMSCAN, to databases, and JACKHMMER thereby enabling the searchalgorithms, for homologs, and ways to navigate as well andas interpret providing the output. sequence and functional annotation by linkingBasic Protocol external 1 and Alternate databases. Protocol This 1 describe unit in detail describes how to use three the basic basic and protocols and advanced search features, respectively, in PHMMER, and interpret the results using two alternatea protein sequence protocols as the starting that explain point. The how logical to organization use the and HMMER interpretation Web server using Finding variousof the input output formats described in and Basic user Protocol defined 1 is common parameters. to all other protocolsC 2017 and is by JohnSimilarities Wiley and & therefore described in detail; user is referred back to this section in subsequent⃝ protocols. Inferring Sons, Inc. Homologies Current Protocols in Bioinformatics 3.15.1–3.15.23, December 2017 3.15.1 Published online December 2017 in Wiley Online Library (wileyonlinelibrary.com). Keywords: bioinformaticsdoi: 10.1002/cpbi.40 homology profile hidden Markov model pro- Copyright C 2017 John Wiley & Sons, Inc. Supplement 60 tein sequence analysis⃝ r r r

How to cite this article: Prakash, A., Jeffryes, M., Bateman, A., & Finn, R. D. (2017). The HMMER web server for protein sequence similarity search. Current Protocols in Bioinformatics, 60, 3.15.1–3.15.23. doi: 10.1002/cpbi.40

INTRODUCTION The HMMER Web server (http://www.ebi.ac.uk/Tools/hmmer/) is an open-access protein sequence similarity search tool that hosts a suite of HMMER algorithms to identify evolutionarily related proteins and/or domains by employing profile hidden Markov models (HMMs; APPENDIX 3A,Schuster-Bockler¨ & Bateman, 2007) for fast and efficient detection of close and remote homologs. The HMMER Web server provides four search interfaces to the corresponding algorithms in the HMMER suite (http://hmmer.org): phmmer, hmmscan, hmmsearch, and jackhmmer. The functionality of these algorithms are outlined in Table 3.15.1. The HMMER Web server can work with various input formats and user-defined parameters to provide results that are presented to help infer protein sequence conservation, function, and evolution. This article provides detailed protocols for using the Web versions of PHMMER, HMMSCAN, and JACKHMMER algorithms, and ways to navigate and interpret the output.

Basic Protocol 1 and Alternate Protocol 1 describe in detail how to use the basic and advanced search features, respectively, in PHMMER, and interpret the results using a protein sequence as the starting point. The logical organization and interpretation Finding of the output described in Basic Protocol 1 is common to all other protocols and is Similarities and therefore described in detail; user is referred back to this section in subsequent protocols. Inferring Homologies

Current Protocols in Bioinformatics 3.15.1–3.15.23, December 2017 3.15.1 Published online December 2017 in Wiley Online Library (wileyonlinelibrary.com). doi: 10.1002/cpbi.40 C Copyright ⃝ 2017 John Wiley & Sons, Inc. Supplement 60 Acknowledgements EMBL-EBI The Sequence Families team: Collaborators: Matthias Blum Harvard University Hsin-Yu Chang Sara El-Gebali Matthew Fraser University of Montana Jaina Mistry Travis Wheeler Alex Mitchell Gift Nuka Typhaine Paysan-Lafosse Sebastien Pesseat Simon Potter Matloob Qureshi Lorna Richardson Gustavo Salazar-Orejuela Amaia Sangrador

InterPro Demo http://www.ebi.ac.uk/Tools/hmmer hmmscan-single protein sequence against profile HMM library hmmscan - Search results

ABC transporter domain

CFTR_RAT (P34158): ABC transporter, a chloride ion channel controlled by phosphorylation.

ABC transporter trans-membrane region jackhmmer-iterative searches. Initiated with a single sequence, a profile HMM or a multiple sequence alignment against a target sequence database jackhmmer-iterative searches jackhmmer-iterative searches phmmer—single protein sequence against protein sequence database phmmer- search TRPA1_HUMAN (O75762)