Bioinformatics Applications of Hmms

BIOINFORMATICS APPLICATIONS OF HMMS CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES OUTLINE Definitions and terms Training approaches Sequence feature selection Secondary structure prediction Probabilistic alignment using HMMs: PFAM, HMMER Gene finding [next major topic] Prokaryotic genes and generalized HMMs Eukaryotic genes CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES DEFINITIONS AND REVIEW A hidden Markov model (HMM) is a generative stochastic model which assigns the probabilities to finite length strings over alphabet A. A four-tuple (A,Q,Pe,Pt) defines a hidden Markov model H: A - the finite alphabet over which the observed strings are defined. Q - the finite collection of hidden states of the model. Pe (ai|qk) - the probability of emitting character i if the hidden state is k Pt (qk|qm) - the probability of transition from hidden state k to hidden state m in one step 0 0 0 0 1 1 1 1 H H H H T T T T CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES ALGORITHMS Forward or backward (sum peeling): compute the probability of an observed string a1a2...an given emission and transmission probabilities. Runs in time O(|Q|2 n), or O(|Q| n) for sparse models. Decoding (Viterbi): compute the sequence of hidden states q1q2...qn that is most likely to have given rise to an observed sequence a1a2...an Runs in time O(|Q|2 n), or O(|Q| n) for sparse models. Training: estimate transition and/or emission probabilities given a set of labeled observed sequences (corresponding hidden states are known): frequency counts, possibly corrected only observed sequences: Baum-Welch, or another non-linear optimization procedure CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES TRAINING HMMS FROM LABELED SEQUENCES CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 011111112222222111111222211111112222111110 TRANSITIONS to state 0 1 2 Ai, j ai, j = from 0 0 (0%) 1 (100%) 0 (0%) |Q|"1 state A 1 1 (4%) 21 (84%) 3 (12%) !h=0 i,h 2 0 (0%) 3 (20%) 12 (80%) symbol E A C G T e = i,k in 6 7 5 7 i,k |#|"1 1 E state (24%) (28%) (20%) (28%) !h=0 i,h 3 3 2 7 2 (20%) (20%) (13%) (47%) EMISSIONS EXAMPLE FROM: HTTP://WWW.GENEPREDICTION.ORG/BOOK/HMM-PART1.PPT CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROTEIN STRUCTURE PREDICTION A simple model states that each residue in a folded protein can be assigned to one of three structural features: Protein 1DZOA An α-helix (offset 4 hydrogen bonds) A β-strand/sheet Other (a loop, L) Cheng and Baldi BMC Bioinformatics 2007 8:113 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES EMISSION AND TRANSITION FREQUENCIES Frequency distributions of amino-acid residues is different between classes. E.g. can be used to estimate emission probabilities. To estimate transition probabilities, we simply tabulate how frequently the transitions happen in a large reference dataset with known structure. STATIONARY FREQUENCIES OF THE HIDDEN MARKOV CHAIN GOLDMAN, THORNE AND JONES JMB 1996 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES TRAINING CAVEATS Rare transition probabilities events are difficult to estimate from counts data. Some state k may not appear in any of the training sequences. This means #k➔l = 0 for every state l and Pt(k,l) cannot be computed from counts. One can ‘pad’ (reflecting our prior beliefs) to observed counts: A = # of k l transitions + r k,l → k,l Eb,k = # of emissions of k from b + rk(b) CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES STRUCTURE INFERENCE Given a trained HMM H and a sequence S we can: Run Viterbi decoding to assign a most-likely hidden path of α, β and L to a given sequence and infer the most likely path. Use a forward-backward algorithm to compute the posterior probabilities that that a given position i in the amino acid sequence is in an α-helix, β- sheet or a loop: Pr q = α S, H Pr q = β S, H p = { i | } p = { i | } i,α Pr S H i,β Pr S H { | } { | } Pr q = L S, H p = { i | } i,L Pr S H { | } CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES Query weight=0.0963 Q3=68.5% 0.0 0.40 0.8 20 40 60 80 100 120 sequence 1 weight=0.0963 0.0 0.40 0.8 20 40 60 80 100 120 sequence 2 weight=0.146 0.0 0.40 0.8 20 40 60 80 100 120 sequence 3 weight=0.129 0.0 0.40 0.8 20 40 60 80 100 120 sequence 4 weight=0.140 0.0 0.40 0.8 20 40 60 80 100 120 sequence 5 weight=0.109 0.0 0.40 0.8 20 40 60 80 100 120 sequence 6 weight=0.133 0.0 0.40 0.8 20 40 60 80 100 120 sequence 7 weight=0.150 HTTP WWW BIOMEDCENTRAL COM 0.0 0.4 0.8 :// . /1472-6807/6/25 0 20 40 60 80 100 120 d1jyoa protein consensus Q3=79.2% 0.0 0.40 0.8 20 40 60 80 100 120 true secondary structure h1 s1 s2 Q 3 - a standard measure of structural prediction accuracy, ALHEASGPSVILFGSDVTVPPASNAEQAK defined as the proportion of hhhhhoooossssooosssooooohhhhh residues assigned to correct class (true) ohhhoooossssooooosssooohhhhhh (22/29 = 76% - useful) Random assignment : Q = 33% 3 hhhhhoooohhhhooohhhooooohhhhh (22/29 = 76% - terrible) State-of-the-art prediction: Q3 ~ 80% HTTP://NOOK.CS.UCDAVIS.EDU/~KOEHL/CLASSES/CSB/CSB_LECTURE11.PPT CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES HMMS ACTUALLY USEFUL FOR STRUCTURE PREDICTION... HELIX COIL STRAND 3.1 4.6 p in [0.1, 0.25[ 3.7 9.6 H10 H3 c1 3.4 p in [0.25, 0.5[ 6.7 1.1 1.9 b3 c12 p in [0.5, 0.75[ 7.0 c9 H14 H2 p=>0.75 c6 8.3 b7 1.1 c10 hydrophilic 1.9 5.4 2.6 H9 H1 H7 b1 preference 3.1 b5 5.0 2.8 4.8 2.5 c8 hydrophobic c5 4.7 2.2 7.8 preference H12 H8 2.9 2.1 H6 c4 7.3 5.2 b6 b8 secondary H4 c2 structure 3.8 2.5 6.9 entry state H15 H11 c3 b2 3.5 secondary c11 4.4 5.5 structure exit state 4.5 3.3 7.2 b4 b9 H13 H5 c7 4.4 LOG-ODDS SCORE Helix Coil Strand Score > > > = = = < < < log2(piq/Pi) ; ; ; % : : : 9 9 9 8 8 8 " 7 7 7 6 6 6 5 5 5 # 4 4 4 Frequency of Frequency of 3 3 3 2 2 2 residue i in residue i in all 1 1 1 !" training sequences 0 0 0 state q / / / . !% - - - , , , + + + ! "# "$ % & " ' "% $ "( "" ( "! ) * $ ( * "% " "" ) ! % ' "# & !)("'*&$% HTTP://WWW.BIOMEDCENTRAL.COM/1472-6807/6/25 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROFILE HMM ALIGNMENT/ MATCHING A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus fail significance test. However, they may have weak similarities with many members of the family. The goal is to align a sequence to all members of the family at once. Family of related proteins can be represented by their multiple alignment and the corresponding profile. CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROFILE REPRESENTATION OF SEQUENCE FAMILIES Aligned DNA sequences can be represented by a 4N profile matrix reflecting the frequencies of nucleotides in every aligned position. Protein family can be represented by a 20N profile representing frequencies of amino acids. These can be used to estimate emission probabilities of an HMM 1 A C A C G T G T 0.000455373 0.000819672 9.10747e-05 0.998634 0.0512143 0.119885 0.000273224 0.828628 0.000335008 0.000167504 0 0.999497 8.37521e-05 8.37521e-05 0.999749 8.37521e-05 0.000167504 0.0274707 0.000167504 0.972194 0.5 0.957377 0.0021062 0.0332003 0.00731626 0.0100599 0.981792 0.00108081 0.00706684 0 1 2 3 4 5 6 7 HIV protease CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES MULTIPLE ALIGNMENTS AND PROTEIN FAMILY CLASSIFICATION Multiple alignment of a protein family shows variations in conservation along the length of a protein Example: after aligning many globin proteins, the biologists recognized that the helical regions in globins are more conserved than others. One way to visualize: entropy plots Influenza A hema"lutinin 1.5 1 Antigenic sites 0.5 0 50 100 150 200 250 300 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES WHAT ARE PROFILE HMMS A Profile HMM is a probabilistic representation of a multiple alignment.

Bioinformatics Applications of Hmms

RDA COVID-19 Recommendations and Guidelines on Data Sharing

Downloaded Were Considered to Be True Positive While Those from the from UCSC Databases on 14Th September 2011 [70,71]

Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids

Genomic and Transcriptomic Surveys for the Study of Ncrnas with a Focus on Tropical Parasites

On the Necessity of Dissecting Sequence Similarity Scores Into

Download PDF of This Story

Clawhmmer: a Streaming Hmmer-Search Implementation

HMMER User's Guide

Computational Identification of Functional RNA Homologs in Metagenomic Data

INFERNAL User's Guide

Reading Genomes Bit by Bit

The Janus-Faced E-Values of Hmmer2: Extreme Value Distribution Or Logistic Function?