Introduction to Hidden Markov Models and Profiles in

Utah State University – Spring 2010 STAT 5570: Statistical Notes 6.3

1 References

 Chapters 3-6 of Biological Sequence Analysis (Durbin et al., 2001)

 Chapter 9 of A First Course in Probability (Ross, 1997)

 Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14:755-763

2 The “occasionally dishonest casino”

 A casino usually uses a fair die, but sometimes (5% of the time) switches to a loaded die. Once using the loaded die, they usually keep using it (90% of the time).  How do you know which die you’re playing? - not sure, but have to look at many plays to see pattern - the “state” here is “hidden”  One possible representation of this “model”:

0.95 1: 1/6 1: 1/10 0.90 2: 1/6 0.05 2: 1/10 3: 1/6 3: 1/10 4: 1/6 4: 1/10 5: 1/6 0.10 5: 1/10 6: 1/6 6: 1/2 Possible partial sequence of rolls: Roll ...62625335636616366646623... Die ...FFFFFFFFLLLLLLLLLFFFFFF...

L = “loaded” state 3 Follow-up to a sequence alignment

 Consider pairwise (or multiple) alignment  What does alignment mean? possibly represents: common ancestry  Possible questions

 Does alignment describe some: “family”?

 How can we describe its internal structure?  Can sometimes characterize these “family” structures as a Markov Chain

4 “Family” Example: CpG islands

 In DNA: C & G are always matched up in helical structure  CpG (C followed by G) in sequence is rare, but is more frequent in promoters of start regions of a gene  General idea: CpG  at start of gene  Possible question about our alignment: Does CpG frequency suggest we are in a promoter region?

5 Markov Chain Examples  Occasionally dishonest casino: Probability you are using a fair die on one play depends on whether you were using a fair die on the previous play.  Rain indicator: Chance of rain tomorrow depends on whether it rains today  Gambler’s Ruin: A gambler starts out with a fortune and plays a game repeatedly (with same prob. of winning or losing $1 each time) until her fortune reaches either 0 or M. Her sequence of fortunes is a Markov chain.  CpG Island: Probability you are in a CpG island at one point in alignment depends on whether you were in a CpG island at the previous point.

6 Markov Chains – a little more formally

 Sequence of random variables:

X0, X 1, …  Set of possible values: {0,1,…,M}

 Think of Xt as state of process at time t  This sequence is a Markov Chain if:

P{X t+1 =j | Xt=i,X t-1=i t-1, …, X 1=i 1, X 0=i 0} = P{X t+1 =j | Xt=i} = Pij  So state at time t+1 depends only on: state at time t

7 (HMM) - vocabulary

 State: in which “family” the process is (CpG vs. not, fair vs. loaded die, etc.) π th π - the “path”: ; the i state in path: i  Symbol: observed “outcome” x (sequence, die, etc.) from unknown state  Transition probabilities: π π akh = P{ i=h | i-1=k}  Emission probabilities: π ek(b) = P{x i=b | i=k }  Joint probability of observed sequence x & state sequence π : P(x,π ) = a π ∏ eπ (x )aπ π 0 1 i i i i+1 i

8 Estimating the HMM path

Several approaches to find the “most probable state path” π*=argmax P(x,π)

 Viterbi focus on identifying the  Forward algorithm most probable state path

 Backward algorithm – focuses more on: posterior state probabilities (position-specific prob. that observation came from state k given observed sequence)

9 Estimation when paths unknown: Baum-Welch

 An iterative procedure estimates the transition and emission probabilities

 A special case of the EM algorithm (a general approach to deal with maximum likelihood with missing/incomplete/latent data) - think of missing covariate: state

 Can also consider an approximation based on iterations of the Viterbi algorithm

10 Pairwise Alignments as HMMs Recall notation: x & y are sequences to be aligned, with gap opening penalty of d and gap extension penalty of e Let (+w,+v) here represent change in sequence position, with M=match, and X,Y=insertion (gap) in x or y

X s(x ,y ) X i j (+1,+0) ε q -e 1- xi ε s(x i,y j) 1-2δ M -d M δ (+1,+1) Pxiyj -d δ ε s(x i,y j) Y 1- Y (+0,+1) qyj -e ε States: insertion (X or Y), match (M) δ=probability of moving to a specific insertion state ε=prob. of staying in an insertion state

11 What can be done with pairwise HMM

 Build HMM for a random (non-matched) model  Evaluate likelihood of matched model by considering log-odds of matched vs. random models  Search for other alignments: sub-optimal  Consider posterior probability of alignment: { } P xi y j | x, y “is aligned with”

12 Using HMMs to describe a “family”

 Suppose we have an alignment of multiple sequences – we can model their “relationship” as a family of sequences – call this the family’s: “profile”  PSSM – position-specific score matrix - estimate this to: describe this particular profile (e.g., should ‘A’ count for more at a particular position in the alignment?)  Allow for insertions and deletions, where “cost” could also be position-specific  Use this profile to describe the alignment and look for other similar sequences

13 Transition structure of a profile HMM

Dj

Ij

Begin Mj End

specific position of profile

: match state : insertion state : deletion state

14 How do we get this “family”?

 Multiple Sequence Alignment - many possible strategies to find and score possible alignments

 One common way: ClustalW

 a “progressive alignment” approach

 construct pairwise distances based on evolutionary distance

 essentially follow an agglomerative clustering approach, progressively aligning nodes in order of decreasing similarity

 additional heuristics make final alignment more accurate

15 Possible Strategies

image from HMMER (on bioweb.pasteur.fr server) 16 One possible analysis approach

 Obtain multiple alignment using ClustalW  http://www.ebi.ac.uk/Tools/clustalw2  creates alignment files in various formats - some specialized for tree-viewing, for example - can get FASTA format of alignment to pass to HMMER

 Obtain HMM model using HMMER  http://bioweb2.pasteur.fr/alignment/intro-en.html  creates a “consensus” sequence to summarize the profile (hmmbuild)  can use this profile to search database for similar sequences (hmmsearch)

17 Example

 (Source: JalView example at ClustalW; same as HW5 data)

 Five from different species (FASTA format)  Mouse (2)  Human  Chicken  Rat

>FOSB_MOUSE fosB MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRR RELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL >FOSB_HUMAN Protein fosB MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRR RELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSYTSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL >FOS_CHICK Proto-oncogene proteinc-fos MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANFVPTVTAISTSPDLQWLVQPTLISSVAPSQNRGHPYGVPAPAPPAAYSRPAVLKAPGGRGQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEEEKS ALQAEIANLLKEKEKLEFILAAHRPACKMPEELRFSEELAAATALDLGAPSPAAAEEAFALPLMTEAPPAVPPKEPSGSGLELKAEPFDELLFSAGPREASRSVPDMDLPGASSFYASDWEPLGAGSGGELEPLCTPVVTCTPCPSTYTSTFVFTYPEADAFPSCAAAHRKGSSSN EPSSDSLSSPTLLAL >FOS_RAT Proto-oncogene protein c-fos MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVVKTMSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEK SALQTEIANLLKEKEKLEFILAAHRPACKIPNDLGFPEEMSVTSLDLTGGLPEATTPESEEAFTLPLLNDPEPKPSLEPVKNISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSGSFYAADWEPLHSSSLGMGPMVTELEPLCTPVVTCTPSCTTYTSSFVFTYPEADSFP SCAAAHRKGSSSNEPSSDSLSSPTLLAL >FOS_MOUSE Proto-oncogene protein c-fos MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMVKTVSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEK SALQTEIANLLKEKEKLEFILAAHRPACKIPDDLGFPEEMSVASLDLTGGLPEASTPESEEAFTLPLLNDPEPKPSLEPVKSISNVELKAEPFDDFLFPASSRPSGSETSRSVPDVDLSGSFYAADWEPLHSNSLGMGPMVTELEPLCTPVVTCTPGCTTYTSSFVFTYPEADSFP SCAAAHRKGSSSNEPSSDSLSSPTLLAL

18 ClustalW – quick example

set alignment options here

paste multiple sequences here (in FASTA format, e.g.)

click Run to start alignment

19 20 ClustalW – format output in Jalview (Java applet)

>FOS_RAT/1-380 MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPD Here, color is LQWLVQPTLVSSVAPSQ------TRAPHPYGLPTPS-TGAYARAGVVKTMSGGRAQSIG------by BLOSUM62 ------RRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLK EKEKLEFILAAHRPACKIPNDLGFPEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKN score ISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSG--SFYAADWEPLHSSSLGMGPMVTELEPLCTPVV TCTPSCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL >FOS_MOUSE/1-380 MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPD LQWLVQPTLVSSVAPSQ------TRAPHPYGLPTQS-AGAYARAGMVKTVSGGRAQSIG------...

from “File”  “Output to Textbox”  “FASTA” format (others available) 21 paste FASTA format here (from ClustalW, for example)

http://bioweb2.pasteur.fr/alignment/intro-en.html 22 hmmbuild results

23 hmmbuild results (reformatted by hand)

HMM A C D ... Q R S T ...... 15 2.35 4.27 3.26 3.50 3.44 0.99 2.83 15 16 3.08 4.91 3.57 3.09 0.88 3.16 3.34 16 17 2.66 0.81 4.20 4.12 3.89 2.93 3.13 17 18 2.35 4.27 3.26 3.50 3.44 0.99 2.83 18 19 2.35 4.27 3.26 3.50 3.44 0.99 2.83 19 ...

24 HMMER – search for “family” members

25 HMMER – search for “family” members

26 27 hmmsearch results

28 12345678901234567890123456789012 alignfile_data 1 mmfqafagdyeasssrcssaspaadslsyyls mmf++f++dyeasssrcssaspa+dslsyy+s gp|BC029814|BC029814_1 1 MMFSGFNADYEASSSRCSSASPAGDSLSYYHS

(more to profile than just SRCSS) 29 Summary

 Hidden Markov Models  use to describe sequence alignments  main idea: how does each portion of alignment represent the “family profile”

 Idea of profile: general “family” characteristics

 Online resources  ClustalW – perform multiple alignments  HMMER – build (& use) HMM model from multiple alignment

30