Fast, Sensitive Homology Detection Using HMMER
Total Page:16
File Type:pdf, Size:1020Kb
Fast, sensitive homology detection using HMMER Rob Finn Sequence Families Team Lead @robdfinn, [email protected] 14th Nov 2018 Making sense of sequence data Sequence Data Information Model Experimental Organisms Literature Reference Sporadic Proteomes Literature Complete Proteomes Similarity Other Sequences & Uncharacterized Metagenomics MGnify Protein Database Length distribution 5 Growth of MGnify Compared with UniProt 1,200,000,000 1,000,000,000 800,000,000 4 600,000,000 400,000,000 Number of Sequences 200,000,000 3 Protein 0 Partail 2002 2004 2006 2008 2010 2012 2014 2016 2018 C−term truncated -200,000,000 Year N−term truncated UniProt MGNify Full length Frequency (millions) 2 1 0 0 500 1000 1500 2000 length of sequence • >1 billion sequences, mean length of 205 • <1% match UniProtKB, but 58% match Pfam 406 STRUCTURE COMPARISON AND ALIGNMENT Sequence And Structure Alignments 406 STRUCTURE COMPARISON AND ALIGNMENT Figure 16.2. Structure alignment for c-phycocyanin (1CPC:A) (black) and colicin A (1COL:A) (gray) as computed by SALIGN. The alignment extended over 86 residues with a 0.97 A RMSD. The sequence identity of the superposed residues with respect to the shorter of the two structures was 11.9%. undergone convergent evolution to form a stable 3-on-3 a-helical sandwich fold. Interest- ingly, it was subsequently discovered that phycocianins can aggregate forming clusters that Figure 16.2.thenStructure adhere to alignment the membrane for c-phycocyanin forming the (1CPC:A) so-called (black) phycobilisomes. and colicin A Such (1COL:A) a functional (gray) as computed byrelationship SALIGN. The may alignment indeed point extended to convergent over 86 evolution residues with from a a 0.97 distant A RMSD. common The ancestor. sequence identity of theFig. Adapted superposedThe second from Chap.16, example, residues Structural with whichBioinformatics, respect is extracted 2nd toEd., theMarti-Renom from shorter the et al work of the of two one structures of our groups was (Tsigelny 11.9%. et al., 2000), illustrated how the combination and integration of different sources of undergoneinformation, convergent including evolution structural to form alignments, a stable 3-on-3 coulda help-helical to functionally sandwich fold. characterize Interest- a ingly, it wasprotein. subsequently In our work, discovered two new that EF-hand phycocianins motifs were can aggregate identified in forming acetylcholinesterase clusters that (AChE) and related proteins by combining the results from a hidden Markovmodel sequence then adheresearch, to the Prosite membrane pattern extraction, forming and the protein so-called structure phycobilisomes. alignments by CE. Such It was a functional also found 2 relationshipthat may the a indeed–b hydrolase point fold to convergent family, including evolution acetylcholinesterases, from a distant contains common putative ancestor. Ca þ The secondbinding example, sites, indicative which of is an extracted EF-hand from motif, the and work which of in one some of family our groups members (Tsigelny may be et al., 2000),critical illustrated for heterologous how the cell combination associations. This and putative integration finding of represented different sources the second of information,characterization including structural of an EF-hand alignments, motif within could an help extracellular to functionally protein, which characterize previously a protein. Inhad our only work, been two found new in osteonectins. EF-hand motifs Thus, were structure identified alignment in had acetylcholinesterase contributed to our (AChE) andunderstanding related proteins of an by important combining family the of results proteins. from a hidden Markovmodel sequence Finally, the third example, also from a previous work of one of our groups (McMahon search, Prosite pattern extraction, and protein structure alignments by CE. It was also found et al., 2005), combined information from structural alignments deposited in the DBAli2 that the a–databaseb hydrolase and foldexperiments family, to including analyze the acetylcholinesterases, sequence and fold diversity contains of putative a C-type Ca lectinþ binding sites,domain. indicative We demonstrated of an EF-hand that the motif, C-type and lectin which fold adopted in some by a family major tropism members determinant may be critical forsequence, heterologous a retroelement-encoded cell associations. receptor This putative binding finding protein, represented provides a highly the second static characterizationstructural of scaffold an EF-hand in support motif of a within diverse anarray extracellular of sequences. protein, Immunoglobulins which previously are known had only beento fulfill found the same in osteonectins. role of a scaffold Thus, supporting structure a large alignment variety of sequences had contributed necessary to for our an understandingantigenic of an response. important C-type family lectins of were proteins. shown to represent a different evolutionary solution taken by retroelements to balance diversity against stability. Finally, the third example, also from a previous work of one of our groups (McMahon et al., 2005), combined information from structural alignments deposited in the DBAli database andMULTIPLE experiments STRUCTURE to analyze ALIGNMENT the sequence and fold diversity of a C-type lectin domain. We demonstrated that the C-type lectin fold adopted by a major tropism determinant sequence,Our a retroelement-encoded discussions thus far have involved receptor only binding pair-wise protein, structure provides comparison a and highly alignment, static structural scaffoldor at best, in alignment support of of multiple a diverse structures array of to sequences. a single representative Immunoglobulins in a pair-wise are fashion known (i.e., progressive pair-wise structure alignment). Most of the available methods for multiple to fulfill thestructure same role alignment of a scaffold start by computingsupporting all a pair-wise large variety alignments of sequences between necessary a set of structures for an antigenic response.but then use C-type them to lectins generate were the shown optimal to consensus represent alignment a different between evolutionary all the structures. solution taken by retroelements to balance diversity against stability. MULTIPLE STRUCTURE ALIGNMENT Our discussions thus far have involved only pair-wise structure comparison and alignment, or at best, alignment of multiple structures to a single representative in a pair-wise fashion (i.e., progressive pair-wise structure alignment). Most of the available methods for multiple structure alignment start by computing all pair-wise alignments between a set of structures but then use them to generate the optimal consensus alignment between all the structures. Profile hidden Markov models • Statistical inference, accounting for uncertainty • Use more information Profile hidden Markov models • Statistical inference, accounting for uncertainty • Use more information P(t | model of homology to q) P(t | model of homology to q) P(t | model of nonhomology) P(t | H) P(t | R) P(t | H) S = log P(t | R) joint probability of t, and the alignment P(t,πo | H) S = log P(t | R) Optimal alignment scores are only an approximation. and the approximation breaks down on remote homologs. P(t,πo | H) S = log P(t | R) ...GHRL... ...| |... ...GI-M... Optimal alignment scores are only an approximation. and the approximation breaks down on remote homologs. P(t,πo | H) S = log P(t | R) ...GHRL... ...| |... ...GI-M... According to inference theory, the correct score is a log-odds ratio summed over all alignments P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) According to inference theory, the correct score is a log-odds ratio summed over all alignments Depends on: - a probability model of alignment, not just scores - algorithms fast enough to use in practice P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) According to inference theory, the correct score is a log-odds ratio summed over all alignments Depends on: - a probability model of alignment, not just scores BLAST (almost) P(t,πo | H) max P(t,π | H) V = log = log π optimal alignment score P(t | R) P(t | R) HMMs: "Viterbi" score, V P(t | H) Σπ P(t,π | H) F = log = log HMMs: "Forward" score, F P(t | R) P(t | R) HMMER Profile hidden Markov models • Statistical inference, accounting for uncertainty • Use more information Profile Hidden Markov Models - Encapsulate diversity Input multiple alignment: seq1 ACGACG-LD-LD Consensus columns assigned, seq2 SCGSCG--E--E Defining inserts and deletes: Seq3 NCGNCGgFDgFD Seq4 TCGTCG-WQ-WQ 123-45 N W T F D A L E S C G Y Q B M1 M2 M3 M4 M5 E Plan7 core D1 D2 D3 D4 D5 model I0 I1 I2 I3 I4 I5 Profile Hidden Markov Models Input multiple alignment: seq1 ACG-LD Consensus columns assigned, seq2 SCG--E Defining inserts and deletes: Seq3 NCGgFD Seq4 TCG-WQ 123-45 Profile Hidden Markov Models Input multiple alignment: Consensus columns assigned, Defining inserts and deletes: seq1 ACG-LD seq2 SCG--E Seq3 NCGgFD Seq4 TCG-WQ 123-45 Profile Hidden Markov Models Input multiple alignment: Consensus columns assigned, Defining inserts and deletes: seq1 ACG-LDACG-LD seq2 SCG--ESCG--E Seq3 NCGgFDNCGgFD Seq4 TCG-WQTCG-WQ 123-45 Profile Hidden Markov Models Input multiple