Introduction to Hidden Markov Models and Profiles in Sequence Alignment
Utah State University – Spring 2010 STAT 5570: Statistical Bioinformatics Notes 6.3
1 References
Chapters 3-6 of Biological Sequence Analysis (Durbin et al., 2001)
Chapter 9 of A First Course in Probability (Ross, 1997)
Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14:755-763
2 The “occasionally dishonest casino”
A casino usually uses a fair die, but sometimes (5% of the time) switches to a loaded die. Once using the loaded die, they usually keep using it (90% of the time). How do you know which die you’re playing? - not sure, but have to look at many plays to see pattern - the “state” here is “hidden” One possible representation of this “model”:
0.95 1: 1/6 1: 1/10 0.90 2: 1/6 0.05 2: 1/10 3: 1/6 3: 1/10 4: 1/6 4: 1/10 5: 1/6 0.10 5: 1/10 6: 1/6 6: 1/2 Possible partial sequence of rolls: Roll ...62625335636616366646623... Die ...FFFFFFFFLLLLLLLLLFFFFFF...
L = “loaded” state 3 Follow-up to a sequence alignment
Consider pairwise (or multiple) alignment What does alignment mean? possibly represents: common ancestry Possible questions
Does alignment describe some: “family”?
How can we describe its internal structure? Can sometimes characterize these “family” structures as a Markov Chain
4 “Family” Example: CpG islands
In DNA: C & G are always matched up in helical structure CpG (C followed by G) in sequence is rare, but is more frequent in promoters of start regions of a gene General idea: CpG at start of gene Possible question about our alignment: Does CpG frequency suggest we are in a promoter region?
5 Markov Chain Examples Occasionally dishonest casino: Probability you are using a fair die on one play depends on whether you were using a fair die on the previous play. Rain indicator: Chance of rain tomorrow depends on whether it rains today Gambler’s Ruin: A gambler starts out with a fortune and plays a game repeatedly (with same prob. of winning or losing $1 each time) until her fortune reaches either 0 or M. Her sequence of fortunes is a Markov chain. CpG Island: Probability you are in a CpG island at one point in alignment depends on whether you were in a CpG island at the previous point.
6 Markov Chains – a little more formally
Sequence of random variables:
X0, X 1, … Set of possible values: {0,1,…,M}
Think of Xt as state of process at time t This sequence is a Markov Chain if:
P{X t+1 =j | Xt=i,X t-1=i t-1, …, X 1=i 1, X 0=i 0} = P{X t+1 =j | Xt=i} = Pij So state at time t+1 depends only on: state at time t
7 Hidden Markov Model (HMM) - vocabulary
State: in which “family” the process is (CpG vs. not, fair vs. loaded die, etc.) π th π - the “path”: ; the i state in path: i Symbol: observed “outcome” x (sequence, die, etc.) from unknown state Transition probabilities: π π akh = P{ i=h | i-1=k} Emission probabilities: π ek(b) = P{x i=b | i=k } Joint probability of observed sequence x & state sequence π : P(x,π ) = a π ∏ eπ (x )aπ π 0 1 i i i i+1 i
8 Estimating the HMM path
Several approaches to find the “most probable state path” π*=argmax P(x,π)
Viterbi focus on identifying the Forward algorithm most probable state path
Backward algorithm – focuses more on: posterior state probabilities (position-specific prob. that observation came from state k given observed sequence)
9 Estimation when paths unknown: Baum-Welch
An iterative procedure estimates the transition and emission probabilities
A special case of the EM algorithm (a general approach to deal with maximum likelihood with missing/incomplete/latent data) - think of missing covariate: state
Can also consider an approximation based on iterations of the Viterbi algorithm
10 Pairwise Alignments as HMMs Recall notation: x & y are sequences to be aligned, with gap opening penalty of d and gap extension penalty of e Let (+w,+v) here represent change in sequence position, with M=match, and X,Y=insertion (gap) in x or y
X s(x ,y ) X i j (+1,+0) ε q -e 1- xi ε s(x i,y j) 1-2δ M -d M δ (+1,+1) Pxiyj -d δ ε s(x i,y j) Y 1- Y (+0,+1) qyj -e ε States: insertion (X or Y), match (M) δ=probability of moving to a specific insertion state ε=prob. of staying in an insertion state
11 What can be done with pairwise HMM
Build HMM for a random (non-matched) model Evaluate likelihood of matched model by considering log-odds of matched vs. random models Search for other alignments: sub-optimal Consider posterior probability of alignment: { } P xi y j | x, y “is aligned with”
12 Using HMMs to describe a “family”
Suppose we have an alignment of multiple sequences – we can model their “relationship” as a family of sequences – call this the family’s: “profile” PSSM – position-specific score matrix - estimate this to: describe this particular profile (e.g., should ‘A’ count for more at a particular position in the alignment?) Allow for insertions and deletions, where “cost” could also be position-specific Use this profile to describe the alignment and look for other similar sequences
13 Transition structure of a profile HMM
Dj
Ij
Begin Mj End
specific position of profile
: match state : insertion state : deletion state
14 How do we get this “family”?
Multiple Sequence Alignment - many possible strategies to find and score possible alignments
One common way: ClustalW
a “progressive alignment” approach
construct pairwise distances based on evolutionary distance
essentially follow an agglomerative clustering approach, progressively aligning nodes in order of decreasing similarity
additional heuristics make final alignment more accurate
15 Possible Strategies
image from HMMER (on bioweb.pasteur.fr server) 16 One possible analysis approach
Obtain multiple alignment using ClustalW http://www.ebi.ac.uk/Tools/clustalw2 creates alignment files in various formats - some specialized for tree-viewing, for example - can get FASTA format of alignment to pass to HMMER
Obtain HMM model using HMMER http://bioweb2.pasteur.fr/alignment/intro-en.html creates a “consensus” sequence to summarize the profile (hmmbuild) can use this profile to search database for similar sequences (hmmsearch)
17 Example
(Source: JalView example at ClustalW; same as HW5 data)
Five proteins from different species (FASTA format) Mouse (2) Human Chicken Rat
>FOSB_MOUSE Protein fosB MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRR RELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL >FOSB_HUMAN Protein fosB MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRR RELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSYTSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL >FOS_CHICK Proto-oncogene proteinc-fos MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANFVPTVTAISTSPDLQWLVQPTLISSVAPSQNRGHPYGVPAPAPPAAYSRPAVLKAPGGRGQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEEEKS ALQAEIANLLKEKEKLEFILAAHRPACKMPEELRFSEELAAATALDLGAPSPAAAEEAFALPLMTEAPPAVPPKEPSGSGLELKAEPFDELLFSAGPREASRSVPDMDLPGASSFYASDWEPLGAGSGGELEPLCTPVVTCTPCPSTYTSTFVFTYPEADAFPSCAAAHRKGSSSN EPSSDSLSSPTLLAL >FOS_RAT Proto-oncogene protein c-fos MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVVKTMSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEK SALQTEIANLLKEKEKLEFILAAHRPACKIPNDLGFPEEMSVTSLDLTGGLPEATTPESEEAFTLPLLNDPEPKPSLEPVKNISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSGSFYAADWEPLHSSSLGMGPMVTELEPLCTPVVTCTPSCTTYTSSFVFTYPEADSFP SCAAAHRKGSSSNEPSSDSLSSPTLLAL >FOS_MOUSE Proto-oncogene protein c-fos MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMVKTVSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEK SALQTEIANLLKEKEKLEFILAAHRPACKIPDDLGFPEEMSVASLDLTGGLPEASTPESEEAFTLPLLNDPEPKPSLEPVKSISNVELKAEPFDDFLFPASSRPSGSETSRSVPDVDLSGSFYAADWEPLHSNSLGMGPMVTELEPLCTPVVTCTPGCTTYTSSFVFTYPEADSFP SCAAAHRKGSSSNEPSSDSLSSPTLLAL
18 ClustalW – quick example
set alignment options here
paste multiple sequences here (in FASTA format, e.g.)
click Run to start alignment
19 20 ClustalW – format output in Jalview (Java applet)
>FOS_RAT/1-380 MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPD Here, color is LQWLVQPTLVSSVAPSQ------TRAPHPYGLPTPS-TGAYARAGVVKTMSGGRAQSIG------by BLOSUM62 ------RRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLK EKEKLEFILAAHRPACKIPNDLGFPEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKN score ISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSG--SFYAADWEPLHSSSLGMGPMVTELEPLCTPVV TCTPSCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL >FOS_MOUSE/1-380 MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPD LQWLVQPTLVSSVAPSQ------TRAPHPYGLPTQS-AGAYARAGMVKTVSGGRAQSIG------...
from “File” “Output to Textbox” “FASTA” format (others available) 21 paste FASTA format here (from ClustalW, for example)
http://bioweb2.pasteur.fr/alignment/intro-en.html 22 hmmbuild results
…
…
23 hmmbuild results (reformatted by hand)
HMM A C D ... Q R S T ...... 15 2.35 4.27 3.26 3.50 3.44 0.99 2.83 15 16 3.08 4.91 3.57 3.09 0.88 3.16 3.34 16 17 2.66 0.81 4.20 4.12 3.89 2.93 3.13 17 18 2.35 4.27 3.26 3.50 3.44 0.99 2.83 18 19 2.35 4.27 3.26 3.50 3.44 0.99 2.83 19 ...
24 HMMER – search for “family” members
25 HMMER – search for “family” members
26 27 hmmsearch results
…
28 12345678901234567890123456789012 alignfile_data 1 mmfqafagdyeasssrcssaspaadslsyyls mmf++f++dyeasssrcssaspa+dslsyy+s gp|BC029814|BC029814_1 1 MMFSGFNADYEASSSRCSSASPAGDSLSYYHS
(more to profile than just SRCSS) 29 Summary
Hidden Markov Models use to describe sequence alignments main idea: how does each portion of alignment represent the “family profile”
Idea of profile: general “family” characteristics
Online resources ClustalW – perform multiple alignments HMMER – build (& use) HMM model from multiple alignment
30