Hidden Markov Models
Total Page:16
File Type:pdf, Size:1020Kb
Hidden Markov Models A profile HMM What is a HMM? • It’s a model. It condenses information. • Models 1D discrete data. • Directed graph. • Nodes emit discrete data • Edges transition between nodes. • What’s hidden about it? • Node identity. • Who is Markov? 2 Markov processes time sequence Markov process is any process where the next item in the list depends on the current item. The dimension can be time, sequence position, etc Modeling protein secondary structure using Markov chains “states” connected by “transitions” H E H=helix E=extended (strand) L L=loop Setting the parameters of a Markov model from data. Secondary structure data LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLLEEEEELLLLLLLLLLLEEEEEEEEELLLLLEEEEEEEEELL LLLLLLEEEEEELLLLLEEEEEELLLLLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLLLLLLEEEELLLLEEEELLLLEEE EEEEELLLLLLEEEEEEEEELLLLLLEELLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLEEEEEELLLLLLLLLLEEEEEEL LLLLEEELLLLLLLLLLLLLEEEEEEEEELLLEEEEEELLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHLLLLLLLEEL HHHHHHHHHHLLLLLLHHHHHHHHHHHLLLLLLLELHHHHHHHHHHHHLLLLLHHHHHHHHHHHHHLLLLLEEEL HHHHHHHHHHLLLLLLHHHHHHHHHHEELLLLLLHHHHHHHHHHHLLLLLLLHHHHHHHHHHHHHHHHHHHHH HHHHHHLLLLLLHHHHHHHHHHHHHHHHHHLLLHHHHHHHHHHHHHHLLLLEEEELLLLLLLLLLLLLLLLEEEEL LLLHHHHHHHHHHHHHHHLLLLLLLLEELLLLLHHHHHHHHHHHHHHLLLLLLEEEEELLLLLLLLLLHHHHHHHH HHHHHHHHHHHHHHHLLLLLHHHHHHHHHLLLLHHHHHHHLLHHHHHHHHHHHHHHHHHHHH Count the pairs to get the transition probability. E P(L|E) L P(L|E) = P(EL)/P(E) = counts(EL)/counts(E) counts(E) = counts(EE) + counts(EL) + counts(EH) Therefore: P(E|E) + P(L|E) + P(H|E) = 1. A transition matrix P(H|H) P(E|H) P(E|E) P(qt|qt-1) H E H E L P(H|E) P(L|E) H .93 .01 .06 P(L|H) P(H|L) P(E|L) E .01 .80 .19 L L .04 .06 .90 P(L|L) **This is a “first-order” MM. Transition probabilities depend on only the current state. P(S|λ), the probability of a sequence, given the model. P(“HHEELL”| λ) =P(H)P(H|H)P(E|H)P(E|E)P(L|E)P(L|L) λ H E L =(.33)(.93)(.01)(.80)(.19)(.90) H .93 .01 .06 =4.2E-4 E .01 .80 .19 P(“HHHHHH” | ) =0.69 common protein secondary structure L .04 .06 .90 λ P(“HEHEHE” | λ) =1E-6 not protein secondary structure Probability discriminates between realistic and unrealistic sequences What is the maximum likelihood model given a dataset of sequences? Dataset. HHEELL H E L H E L HHEELL H 1 1 0 H 0.5 0.5 0 HHEELL HHEELL E 0 1 1 E 0 0.5 0.5 HHEELL L 0 0 1 L .0 0 1.0 HHEELL Maximum likelihood model Count the state pairs. Normalize by row. Is this model too simple? H E L Synthetic helix length data from this model Frequency 1 2 3 4 5 6 7 8 9 10 Real helix length data *L.Pal et al, J. Mol. Biol. (2003) 326, 273–291 “A model should be as simple as possible but not simpler” --Einstein A pseudo-higher-order HMM A Markov chain for proteins where helices are always exactly 4 residues long H H H H L E A Markov chain for proteins where helices are always at least 4 residues long H H H H L E Can you draw a Markov chain where helices are always a multiple of 4 long? H H H H L E H1 H2 H3 H4 E L Calculate probability of H1 1 H2 1 EHHHHHLLE. H3 1 H4 0.5 0.1 0.4 E 0.2 0.7 0.1 L 0.2 0.2 0.6 Example application: A Markov chain for CpG islands A T a “saturated” 4-state MM C G P(ATCGCGTA...) = πAaATaTCaCGaGCaCGaGTaTA … 1 CpG Islands + - DNA ... - + ... methylated Not methylated DNA is methylated on C to protect against endonucleases. Using mass spectroscopy we can find regions of DNA that are methylated and regions that are not. Regions that are protected from methylation may be functionally important, i.e. transcription factor binding sites. NNNCGNNN NNNTGNNN During the course of evolution. Methylated CpG’s get mutated to TpG’s Using Markov chains for descrimination: CpG Islands in human chromosome sequences ... + - + - ... CpG rich CpG poor CpG rich= "+" CpG poor= "-" + A C G T - A C G T A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210 C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302 G 0.161 0.339 0.385 0.125 G 0.248 0.246 0.298 0.208 T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292 P(CGCG|+) = πC(0.274)(0.339)(0.274) = πC 0.0255 P(CGCG|-) = πC(0.078)(0.246)(0.078) = πC 0.0015 2 From Durbin,Eddy, Krogh and Mitcheson “Biological Sequence Analysis” (1998) p.50 Rabiner’s notation P(y, x) F(y, x) a = P(x | y) = = yx P(y) F(y) ...the conditional probability of x given y, or the transition probability of state y to state x. πx = P(x) = F(x)/N ...the unconditional probability of being in state x . (used to start a state pathway) bx (y) = P(y|x) ...the conditional probability of emitting character y given state x. Comparing two MMs The log likelihood ratio (LLR) L + L + L ax x a x x log i−1 i = log i−1 i = β ∏ − ∑ − ∑ x i−1x i i =1 ax i−1x i i =1 a xi−1 xi i =1 β A C G T Log-likelihood ratios A -0.740 0.419 0.580 -0.803 for transitions: C -0.913 0.302 1.812 -0.685 G -0.624 0.461 0.331 -0.730 T -1.169 0.573 0.393 -0.679 Sum the LLRs. If the result is positive, its a CpG island, otherwise not. 3 LLR(CGCG)=1.812 + 0.461 + 1.812 = 4.085 yes Combining two Markov chains to make a hidden Markov model A hidden Markov model can have multiple paths for a sequence C A T "+" model G Transitions between +/- models "–"model A T G In Hidden Markov models (HMM), there is no one-to-one correspondence between the state and the emitted symbol. 4 Probability of a sequence using a HMM Different state sequences can produce the same emitted sequence Nucleotide sequence (S): C G C G State sequences (Q): P(sequence,path) C+ G+ C+ G+ πC+ aC+G+aG+C+aC+G+ C- G- C- G- πC- aC-G- aG-C- aC-G- C+ G+ C- G- πC+ aC+G+aG+C-aC-G- C+ G- C- G+ πC+ aC+G- aG-C- aC-G+ etc.... etc.... P(CGCG|λ) = Σ P(Q) All paths Q Each state sequence has a probability. The sum of all state sequences that emit CGCG is the P(CGCG). 5 Three HMM Algorithms 1. The Viterbi algorithm: get the optimal state pathway. Maximum joint prob. 2. The Forward/Backward algorithm: get the probability of each state at each position. Sum over all joint probs. 3. Expectation/Maximization: refine the parameters of the model using the data Back to secondary structure prediction.... Parallel HMM: emits sec struct and amino acid bH(i) H E The marble bag represents Each state emits one a probability distribution of amino acids, b. ( a L amino acid from the profile ) marblebag, for each visit. probability distribution == a set of stacked odds? probabilities (0 ≤ p ≤ 1) that sum to 1. 0. 1. states emit aa and ss. Amino acid Sequence H E L λ State sequence (secondary Given an amino acid sequence, what is the structure) most probable state sequence? HMM data structure for parallel HMM in fortran... type HMMNODE integer :: id type (HMMEDGE), pointer :: a(:) real :: b(20), emit(:) logical :: emitting end type HMMNODE type HMMEDGE integer :: id real :: p type (HMMNODE), pointer :: q end type HMMEDGE type (HMMSTATE), pointer :: hmm_root A linked list... hmm_root should be the “begin” state, with hmm_root%emitting set to .false. If (emitting==.true.) then the state emits amino acid profile b, and optionally something else called emit(:). 23 In class exercise: what’s the LLR? What is the LLR that this seq is a CpG Island? ATGTCTTAGCGCGATCAGCGAAAGCCACG β A C G T A -0.740 0.419 0.580 -0.803 C -0.913 0.302 1.812 -0.685 G -0.624 0.461 0.331 -0.730 T -1.169 0.573 0.393 -0.679 L LLR = β = _______________ ∑ xi−1xi i=1 3 HMM: assigning the states given the sequence is not as easy. Typically, when using a HMM, the task is to determine the optimal state pathway given the sequence. The state pathway provides some predictive feature, such as secondary structure, or splice site/not splice site, or CpG island/not CpG island, etc. In Principle, we can do this task by trying all state pathways Q, and choosing the optimal. In Practice, this is usually impossible, because the number of pathways increases as the number of states to the power of the length, i.e. O(nm). How do we do it, then? Maximize: Joint probability of a sequence and pathway Q = {q1,q2,q3,…qT} = sequence of Markov states, or pathway S = {s1,s2,s3,…sT} = sequence of amino acids or nucleotides T = length of S and Q. Joint probability of a pathway and sequence, given a HMM λ. S= A G P L V D H H H H H H Q= E E E E E E L L L L L L P= πHbH(A) × aHHbH(G) × aHEbE(P) × aEEbE(L) × aEEbE(V) × aELbL(D) Joint probability : general expression General expression for pathway Q through HMM λ : P(S,Q | λ) = π b s a q1 ∏ qt ( t ) qtqt+1 t =1,T ** **when t=T, there is no qt+1. Use a = 1 A G P L V D H H H H H H E E E E E E L L L L L L The Viterbi algorithm:the maximum probability path For all states k Recursive.