Practical Use of Hmms and Protein Sequence Databases

Temple University Structural Bioinformatics II Practical use of HMMs and Protein Sequence Databases Allan Haldane Ron evy !roup "utline %art 1: HMMs for proteins ● Practical HMM usage: Hmmer% E/p(values and limitations ● HMMs for family alignment and for secondary structure prediction %art 2: Protein Sequence *ariation ● Tour of the protein sequence space: Pfam% Interpro% S*"P ● Protein &volution$ Observations References: ● Biological sequence analysis ● HMMER User guide ● Fundamentals of Molecular Evolution (Graur/Li ● !he #eutral Theory of Molecular Evolution (Kimura) Revie+: HMMs (Hidden Mar,ov Models) ● )ystem moves t"roug" a series of -"idden. states +it" s'eci/ed transition probabilities ● )ystem -emits. o0serva0les, +it" di1erent emmision probabilities for eac" state2 occasionally dishonest casino 3eather Behavior (wikipedia) Hidden states Observables 4m'ortant 5lgorit"ms: ● Forward algorithm: Compute 'robability of a sequence of observables7 summing over all 'ossible pat"s through hidden states ● Viterbi Algorithm: Compute the most likely 'ath though hidden states7 given a sequence of observables ● Baum–Welch algorithm: Given a set of observed sequences and a topology, estimate emission and transition 'robabilities (build an HMM). !"ere are many 'ossible choises of -to'ology. for HMMs7 de'ending on t"e a''lication2 3e +ill focus on t+o ty'es: 1. Secondary9)tructure HMMs ()5M, O))9HMM, TMHMM H 6 ) (2 Profle HMMs (HMMER (for multiple sequence alignments) )imple !ransmembrane Helix HMM 3ant to model +hich parts of a 'rotein sequence are Simple T+o-)tate Model transmembrane helices2 > = Everyt"ing else M = !M "eli: Emission !ransition Probabilities Probabilities Hidden Mar,ov Models for %rediction of %rotein Features2 6"risto'"er Bystro1 and 5nders $rog"2 Met"ods in Molecular Biology7 vol2 ;&< )imple !ransmembrane Helix HMM 3ant to model +hich parts of a 'rotein sequence are Simple T+o-)tate Model transmembrane helices2 > = Everyt"ing else M = !M "eli: 6an improve wit" more complex topologies Hidden Mar,ov Models for %rediction of %rotein Features2 6"risto'"er Bystro1 and 5nders $rog"2 Met"ods in Molecular Biology7 vol2 ;&< !MHMM -Real. "ig" 'erformance HMMs can 0e muc" more com'licated2 -Correctly predicts AB9AC D of transmembrane helices2. Used to estimate that 209<@ D of all genes are !M proteins %osterior probabilities for a single sequence: %redicting !ransmem0rane %rotein !o'ology wit" a Hidden Mar,ov Model: 5''lication to Com'lete Genomes2 $rog" et al2 JMB (@@& HMM) for )econdary )tructure Prediction H ● Simplest idea: < state topology: Helix7 Coil7 Strand ● Real 3orld: 36 "idden state HMM (O))-HMM) ● !opology not chosen by hand: An algorithm tries topologies to fnd the one wit" 0est match with 6 ) E)SP predictions Similar %erformance to P)4%RED Insights into secondary structure 'atternsG 5nalysis of an optimal hidden Mar,ov model for secondary (% correct predictions structure prediction. Martin et al, BM6 Structural Biology (@@F %rofle HMMs (HMMER HMMs used to model and align Multiple Sequence Alignments ● !o'ology of a 'rofle HMM ● Building t"e HMM ● Using t"e HMM to align sequences ● Using t"e HMM to search for sequences in a data0ase ● !ool: "mmer ● %ractical challenges to 0uilding a 'rofle HMM given an M)5 %rofle HMMs (HMMER Review of !o'ology: Delete states (ga' 4nsert states (emit aa7 wit" unbiased probabilities) Match states (emit aa7 with speci/c 'robabilities) %rofle HMMs (HMMER Example of a 'rofle HMM wit" all values illustrated: 3eight of line = transition probability Unaligned (insert 3hat you can do with pro/le HMMs: Basic (Given a Model): ● Align Sequences to kno+n model: ● Find the most 'robable alignment of a unaligned sequence (viter0i ● Score Sequences: ● Compute 'robability that a sequence +as generated 0y the HMM ● Search For Matching Sequences ● Score all sequences in a database7 pick only the high scoring ones ● 6ompute Site Posterior Probability: ● refects the degree of confdence in each individual aligned residue Hig"er9level (#o initial model : ● Build a model given an aligned MS5 ● Baum-3elch algorithm (iterates for+ard/0ac,+ard algorit"ms) ● Build a model given unaligned sequences ● 4teratively run Baum-3elch, and re9align sequences +ith it ● 4terative model 0uilding/search ● 4teratively run Baum-3elch, and collect new sequences from a database using it2 Scoring )equences Given an observed sequence7 what is the probability that it was generated by t"e model? 3e have t+o ty'es of 'robabilities: Observed sequence ● 3e can calculate the probability of a sequence through most likely path through hidden states (*iterbi algorit"m Model *iterbi 'at" ● Or, +e can calculate the 'robability of a sequence averaging over all 'aths (for+ard algorit"m In practice we use the second: “!o maximiJe 'o+er to discriminate true "omologs from nonhomologs in a database search, statistical inference theory says you ought to 0e scoring sequences by integrating over alignment uncertainty7 not just scoring the single best alignment. 3e can compare this to the probability that it would 0e generated by a “random” model ● #ull model (random aa mean residue frequencies in Swiss-Prot I@2C )coring Sequences: Log Odds Ratio !he 'robability of a sequence Probability of sequence in Null model averaging over all 'aths (for+ard (random aa) algorithm) !"e “Log odds ratio. ● )tandard +ay of testing model discriminatory po+er ● *alues >L 1 imply the sequence was likely generated 0y the HMM ● 3e take the log to avoid numerical difculties (numbers can be very small ● 3"en the log is 0ase 2, the value has units called “bits.2 ● Log-odds ratio is also called the -sequence bit score” ● 6an be eMciently computed for HMMs using a modifed For+ard algorithm Scoring Sequences: E9*alues !"e “Log odds ratio. (“sequence 0it score” !his gives us a score for each individual sequence. But if you are testing many, many sequences to fnd matches, some will still "ave "igh scores 0y chance2 E9values: Expected number of sequences having this score or more7 +hen looking in a dataset of siJe #2 !heory: 6an wor, out that the “null” distribution of 0it9scores is e:'onential !his is a +ay to estimate if a high score 5''eared by chance due to having a large dataset2 7 necessary to cali0rate mu/tau )coring Sequences: Summary Given an HMM, and a sequence data0ase of siJe N to search t"roug"7 we can calculate: ● A log-odds ratio (0it score) for each sequence2 L& means its more li,ely to be generated by t"e HMM t"an the null model 2 ● An E9*alue: !he number of times +e e:'ect a sequence +ith this score 0y chance2 E9value < 1 means that on average +e donOt expect this strong a score by 6hance given our dataset siJe, so it is li,ely a homolog2 6ommonly E N @2@1 is used as a signifcance cutof By setting a t"rehold on the E9*alue7 we can flter the database to pick out the "omologous sequences2 ● Bonus com'utation: %er9site 'osterior 'ro0a0ility annotation2 Probability that emmited observable i came from state ,7 averaging over all 'ossible paths2 Computable from for+ard/bac,+ard equations2 (see Biological sequence analysis equation <2&6) 3hat you can do with pro/le HMMs: Basic (Given a Model): ● Align Sequences to kno+n model: ● Find the most 'robable alignment of a unaligned sequence (viter0i ● Score Sequences: ● Compute 'robability that a sequence +as generated 0y the HMM ● Search For Matching Sequences ● Score all sequences in a database7 pick only the high scoring ones Hig"er9level (#o initial model : ● Build a model given an aligned MS5 ● Baum-3elch algorithm (iterates for+ard/0ac,+ard algorit"ms) ● Build a model given unaligned sequences ● 4teratively run Baum-3elch, and re9align sequences +ith it ● 4terative model 0uilding/search ● 4teratively run Baum-3elch, and collect new sequences from a database using it2 Iterative HMM Algorithms Building an HMM given an aligned M)5 Pou already ,now this: !he Baum-3elch Algorithm Idea: Given a trial HMM, use for+ard/bac,+ard algorithms to average over all possible alignments of all sequences in t"e MS5. 5dd up t"e observed emission and transitions seen in t"e data wit" these probabilities7 and update the model +ith these observed values. Repeat2 Building an HMM given an unalignmed M)5 Idea: Similar to Baum-3elch algorit"m, e:cept t"at +e re9align the MSA using t"e current trial MS5 each round of iteration2 This +ill change which 'ositions in each sequence count as match vs insert states. Have to be careful not to get stuck in local optima. Building an HMM given a seed M)A and a data0ase Idea: Given a seed alignment, construct an initial MS52 Use t"is HMM to search t"e database for more "omologous sequences, giving a 0igger7 more diverse MS52 Fit a new HMM to this MS5. Repeat2 HMMER ● Free Soft+are for using/building 'ro/le HMMs ● HMMER3 is highly optimiJed. FastQ HMMER: Exam'le Run Profle HMMs: Practical issues & Local vs Global Alignments ( 6"oosing Matc" states < %"ylogenetic Weighting ; %seudocounts I Filtering Sequences 1. Local vs Global Alignments HMM for glo0al alignment !ry to match all residues in t"e query sequence to the model WIGDEVAVKAARHDDEDIS WHGD-VAVKILKVVDPTPE HMM for local alignment Allo+s unmatched states at beginning and end, so can fnd a match within a longer sequence7 or partial sequence2 EVAVKAARHDDEDISvykig WHGD-VAVKILKVVDPTPE Flan,ing model states Switching points (2 Choosing Match )tates 3hen building a new HMM given a seed MS5, you have to make a decision: 3hich columns +ill count as match states7 and which as insert states? (model construction) ● manual construction: !he user marks alignment columns 0y hand2 ● heuristic construction: A rule is used to decide whet"er a column should 0e mar,ed.

Load more