Hidden Markov Models

Hidden Markov Models Jacques van Helden Aix-Marseille Université (AMU) Lab. Theory and Approaches of Genomic Complexity (TAGC) https://tagc.univ-amu.fr/ Institut Français de Bioinformatique (IFB) http://www.france-bioinformatique.fr [email protected] https://orcid.org/0000-0002-8799-8584 A seminal book n In 1998, Richard Durbin, Sean Eddy, A. Krogh and G. Mitchison published a seminal book entitled « Biological sequence analysis » q A tutorial introduction to hidden Markov models and other probabilistic modelling approaches in computational sequence analysis. q The authors restate the classical sequence analysis problems in terms of Hidden Markov Models (HMM). q Even their table of contents is presented as an HMM (their Figure 1.1 below) n Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1998. ISBN 0-521-62041-4 (hardback) Applications of Hidden Markov Models in biology n Hidden Markov models can be applied to solve a diversity of problems in bioinformatics n Sequence segmentation q Detection of CpG islands q Intron/exon prediction n Motif detection q Protein domains (long motifs in peptidic sequences) q Transcription factor binding sites (short motifs on DNA sequences) n Secondary structure prediction n … Markov models (nothing to hide so far) Markov process 2-states Markov process Transition matrix n A Markov process is defined by q A finite number of states (A, B, C, …) X Y n Example: 2-state Markov process X Y q States: {X, Y} X 0.9 0.1 n Transitions: Y 0.2 0.8 q {X à X, X à Y, Y à X, X à Y} n The probability of transition from each state to each other one is described in a transition matrix. Examples of biological applications q Rows: current state si 1. Segmentation of the genome into transcribed and intergenic regions q Columns: next state si+1 Genome fragment q Values P(si+1 | si ) transcript q Transition probabilities sum to 1 on each row n Examples of Markov models to annotate genomic sequences 1. State X = intron, State Y = exon 2. Segmentation of transcribed regions into introns and exons 2. State X = transcribed region, state Y = intergenic region intron 3. State X = CpG island; State Y = other genomic region exon 3. Segmentation of the genome into CpG islands and non-CpG islands CpG island non-CpG island Markov process n In order to annotate the genome, we could conceive a multi- k-states Markov process state markov model that would represent the different 1. State W = intron, 2. State X = exon W X 3. State Y = CpG island; 4. State Z = other genomic region B E Y Z Segmentation of the genome into different types of regions intron exon Transition matrix (arbitrary values) CpG island W X Y Z other type of genomic region W 0.990 0.010 0.000 0.000 X 0.010 0.988 0.001 0.001 Y 0.00000 0.00002 0.99898 0.00100 Z 0 0.000002 0.000001 0.999997 Markov model of a sequence 4-states Markov process for DNA sequence n We can model a macromolecular sequence as a Markov process q DNA : n = 4 states (A, C, G, T) q Proteins: n = 20 states (amino acids) q Optionally, additional states can be used to represent the beginning (B) of and the end (E) of the sequence. This A C enables to generate sequences of different lengths. n Transition probabilities indicate the probability to generate a B E given residue (suffix) given the current residue (prefix) G T n Exercise q DNA sequences are generated using a Markov model with ending probability of 0.99 (irrespective of the current residue). What is the distribution of sequence lengths? Probability of a sequence segment n What is the probability for a given sequence segment ? n Different models can be chosen q Bernoulli model • Assumes independence between successive nucleotides. • The probability of each residue is fixed a priori (prior residue probability) n Example: P(A) = 0.35; P(T) = 0.32; P(C) = 0.17; P(G) = 0.16 • Particular case: equiprobable residues n P(A) = P(T) = P(C) = P(G) = 0.25 n Simple, but NOT realistic ! q Markov model • The probability of each residue depends on the m preceding residues. • The parameter m is called the order of the Markov model • Remark: a Bernoulli model can be considered as a Markov model of order 0 8 Independent and equiprobable nucleotides n The simplest model : Bernoulli with identically and independently (i.i.d.) distributed nucleotides. p = P(A) = P(C) = P(G) = P(T)= 0.25 n The probability of a sequence P(S) = pL q Is the product of its residue probabilities (independence) q Equiprobability: since all residues have the same probability, it is simply computed as the residue proba (p) to the power of the sequence length (L) • S is a sequence segment (e.g. an oligonucleotide) • L length of the sequence segment € • p nucleotide probability • P(S) is the probability to observe this sequence segment at given position of a larger sequence n Example 6 -4 q P(CACGTG) = 0.25 = 2.44e 9 Bernoulli model : independently distributed nucleotides n A more refined model consists in using residue- specific probabilities. The probability of each residue L is assumed to be constant on the whole sequence (Bernoulli schema). P(S) = ∏P(ri ) n The probability of a sequence is the product of its i=1 residue probabilities. q i = 1..k is the index of nucleotide positions q ri is the residue found at position I q P(ri) is the probability of this residue € n Example: non-coding sequences in the yeast genome q P(A) = P(T) = 0.325 q P(C) = P(G) = 0.175 q P(CACGTG) = P(C) P(A) P(C) P(G) P(T) P(G) = 0.3254 * 0.1752 = 9.91E-5 10 Bernoulli models n A Bernoulli model assumes that q each residue has a specific prior probability q this probability is constant over the sequence (no context dependencies) n The heat-maps below depict the nucleotide frequencies in non-coding upstream sequences of various organisms. n The frequencies of AT versus CG show strong inter-organism differences. Saccharomyces cerevisiae Escherichia coli K12 Mycobacterium leprae (Fungus) (Proteobacteria) (Actinobacteria) Mycoplasma genitalium Bacillus subtilis (Firmicute, intracellular) (Firmicute, extracellular) Plasmodium falciparum Anopheles gambiae Homo sapiens (Aplicomplexa, intracellular) (Insect) (Mammalian) 11 Markov chains and transition matrices n In a Markov model, the probability to find a letter at P(ri | Si−m,i−1) position i depends on the residues found at the m preceding residues. Transition matrix, order 1 n The tables represent the transition matrices for a c g t A P(A|A) P(C|A) P(G|A) P(T|A) Markov chain models of order m=1 (top) and m=2 C P(A|C) P(C|C) P(G|C) P(T|C) (bottom). € G P(A|G) P(C|G) P(G|G) P(T|G) T P(A|T) P(C|T) P(G|T) P(T|T) n Each row specifies one prefix, each column one suffix. Transition matrix, order 2 Prefix A C G T n The values indicate the probability to observe a AA P(A|AA) P(C|AA) P(G|AA) P(T|AA) given residue (suffix ) at position ( ) of the AC P(A|AC) P(C|AC) P(G|AC) P(T|AC) ri i AG P(A|AG) P(C|AG) P(G|AG) P(T|AG) sequence, as a function of the m preceding residues AT P(A|AT) P(C|AT) P(G|AT) P(T|AT) (the prefix S ) CA P(A|CA) P(C|CA) P(G|CA) P(T|CA) i-m,i-1 CC P(A|CC) P(C|CC) P(G|CC) P(T|CC) n Particular case CG P(A|CG) P(C|CG) P(G|CG) P(T|CG) CT P(A|CT) P(C|CT) P(G|CT) P(T|CT) q A Bernoulli model is a Markov model of order 0. GA P(A|GA) P(C|GA) P(G|GA) P(T|GA) GC P(A|GC) P(C|GC) P(G|GC) P(T|GC) GG P(A|GG) P(C|GG) P(G|GG) P(T|GG) GT P(A|GT) P(C|GT) P(G|GT) P(T|GT) TA P(A|TA) P(C|TA) P(G|TA) P(T|TA) TC P(A|TC) P(C|TC) P(G|TC) P(T|TC) TG P(A|TG) P(C|TG) P(G|TG) P(T|TG) TT P(A|TT) P(C|TT) P(G|TT) P(T|TT) 12 Markov model estimation (“training”) Dinucleotide frequencies n Transition frequencies for a Markov model of order m can Sequences Occurrences Frequency be estimated from the frequencies observed for oligomers S N(S) F(S) (k-mers) of length k=m+1 in a reference sequence set. AA 526,149 0.112 AC 251,377 0.054 n Example AG 275,056 0.059 q The upper table shows dinucleotide frequencies (k=2) AT 414,453 0.088 computed from the whole set of upstream sequences of CA 294,423 0.063 the yeast Saccharomyces cerevisiae. CC 178,324 0.038 CG 146,052 0.031 q This table can be used to estimate a Markov model of CT 275,859 0.059 order m = k–1 = 1. GA 277,343 0.059 GC 184,367 0.039 GG 173,404 0.037 GT 239,569 0.051 TA 369,980 0.079 TC 280,475 0.060 TG 279,932 0.060 TT 521,236 0.111 13 Markov model estimation (“training”) Dinucleotide frequencies n Transition frequencies for a Markov model of order m can Sequences Occurrences Frequency be estimated from the frequencies observed for oligomers S N(S) F(S) (k-mers) of length k=m+1 in a reference sequence set.

Load more