Hidden Markov Models

Jacques van Helden

Aix-Marseille Université (AMU) Lab. Theory and Approaches of Genomic Complexity (TAGC) https://tagc.univ-amu.fr/

Institut Français de Bioinformatique (IFB) http://www.france-bioinformatique.fr

[email protected] https://orcid.org/0000-0002-8799-8584 A seminal book

n In 1998, Richard Durbin, , A. Krogh and G. Mitchison published a seminal book entitled « Biological »

q A tutorial introduction to hidden Markov models and other probabilistic modelling approaches in computational sequence analysis.

q The authors restate the classical sequence analysis problems in terms of Hidden Markov Models (HMM).

q Even their table of contents is presented as an HMM (their Figure 1.1 below)

n Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Richard Durbin, Sean Eddy, , and Graeme Mitchison. Cambridge University Press, 1998. ISBN 0-521-62041-4 (hardback) Applications of Hidden Markov Models in biology

n Hidden Markov models can be applied to solve a diversity of problems in

n Sequence segmentation

q Detection of CpG islands

q Intron/exon prediction

n Motif detection

q Protein domains (long motifs in peptidic sequences)

q Transcription factor binding sites (short motifs on DNA sequences)

n Secondary structure prediction

n … Markov models (nothing to hide so far) Markov process

2-states Markov process Transition matrix n A Markov process is defined by

q A finite number of states (A, B, C, …) X Y n Example: 2-state Markov process X Y q States: {X, Y} X 0.9 0.1 n Transitions: Y 0.2 0.8 q {X à X, X à Y, Y à X, X à Y}

n The probability of transition from each state to each other one is described in a transition matrix. Examples of biological applications

q Rows: current state si 1. Segmentation of the genome into transcribed and intergenic regions q Columns: next state si+1 Genome fragment q Values P(si+1 | si ) transcript q Transition probabilities sum to 1 on each row

n Examples of Markov models to annotate genomic sequences 1. State X = intron, State Y = exon 2. Segmentation of transcribed regions into introns and exons 2. State X = transcribed region, state Y = intergenic region intron 3. State X = CpG island; State Y = other genomic region exon

3. Segmentation of the genome into CpG islands and non-CpG islands CpG island non-CpG island Markov process

n In order to annotate the genome, we could conceive a multi- k-states Markov process state markov model that would represent the different 1. State W = intron, 2. State X = exon W X 3. State Y = CpG island; 4. State Z = other genomic region B E

Y Z Segmentation of the genome into different types of regions

intron exon Transition matrix (arbitrary values) CpG island W X Y Z other type of genomic region W 0.990 0.010 0.000 0.000 X 0.010 0.988 0.001 0.001 Y 0.00000 0.00002 0.99898 0.00100 Z 0 0.000002 0.000001 0.999997 Markov model of a sequence

4-states Markov process for DNA sequence n We can model a macromolecular sequence as a Markov process

q DNA : n = 4 states (A, C, G, T)

q Proteins: n = 20 states (amino acids)

q Optionally, additional states can be used to represent the beginning (B) of and the end (E) of the sequence. This A C enables to generate sequences of different lengths.

n Transition probabilities indicate the probability to generate a B E given residue (suffix) given the current residue (prefix) G T n Exercise

q DNA sequences are generated using a Markov model with ending probability of 0.99 (irrespective of the current residue). What is the distribution of sequence lengths? Probability of a sequence segment

n What is the probability for a given sequence segment ? n Different models can be chosen

q Bernoulli model • Assumes independence between successive nucleotides. • The probability of each residue is fixed a priori (prior residue probability)

n Example: P(A) = 0.35; P(T) = 0.32; P(C) = 0.17; P(G) = 0.16 • Particular case: equiprobable residues

n P(A) = P(T) = P(C) = P(G) = 0.25

n Simple, but NOT realistic !

q Markov model • The probability of each residue depends on the m preceding residues. • The parameter m is called the order of the Markov model • Remark: a Bernoulli model can be considered as a Markov model of order 0

8 Independent and equiprobable nucleotides

n The simplest model : Bernoulli with identically and independently (i.i.d.) distributed nucleotides. p = P(A) = P(C) = P(G) = P(T)= 0.25

n The probability of a sequence P(S) = pL q Is the product of its residue probabilities (independence)

q Equiprobability: since all residues have the same probability, it is simply computed as the residue proba (p) to the power of the sequence length (L) • S is a sequence segment (e.g. an oligonucleotide) • L length of the sequence segment € • p nucleotide probability • P(S) is the probability to observe this sequence segment at given position of a larger sequence

n Example 6 -4 q P(CACGTG) = 0.25 = 2.44e

9 Bernoulli model : independently distributed nucleotides

n A more refined model consists in using residue- specific probabilities. The probability of each residue L is assumed to be constant on the whole sequence (Bernoulli schema). P(S) = ∏P(ri ) n The probability of a sequence is the product of its i=1 residue probabilities.

q i = 1..k is the index of nucleotide positions

q ri is the residue found at position I

q P(ri) is the probability of this residue € n Example: non-coding sequences in the yeast genome

q P(A) = P(T) = 0.325

q P(C) = P(G) = 0.175

q P(CACGTG) = P(C) P(A) P(C) P(G) P(T) P(G) = 0.3254 * 0.1752 = 9.91E-5

10 Bernoulli models

n A Bernoulli model assumes that

q each residue has a specific prior probability

q this probability is constant over the sequence (no context dependencies)

n The heat-maps below depict the nucleotide frequencies in non-coding upstream sequences of various organisms.

n The frequencies of AT versus CG show strong inter-organism differences.

Saccharomyces cerevisiae Escherichia coli K12 Mycobacterium leprae (Fungus) (Proteobacteria) (Actinobacteria)

Mycoplasma genitalium Bacillus subtilis (Firmicute, intracellular) (Firmicute, extracellular)

Plasmodium falciparum Anopheles gambiae Homo sapiens (Aplicomplexa, intracellular) (Insect) (Mammalian)

11 Markov chains and transition matrices

n In a Markov model, the probability to find a letter at P(ri | Si−m,i−1) position i depends on the residues found at the m preceding residues. Transition matrix, order 1 n The tables represent the transition matrices for a c g t A P(A|A) P(C|A) P(G|A) P(T|A) Markov chain models of order m=1 (top) and m=2 C P(A|C) P(C|C) P(G|C) P(T|C) (bottom). € G P(A|G) P(C|G) P(G|G) P(T|G) T P(A|T) P(C|T) P(G|T) P(T|T) n Each row specifies one prefix, each column one suffix. Transition matrix, order 2 Prefix A C G T n The values indicate the probability to observe a AA P(A|AA) P(C|AA) P(G|AA) P(T|AA) given residue (suffix ) at position ( ) of the AC P(A|AC) P(C|AC) P(G|AC) P(T|AC) ri i AG P(A|AG) P(C|AG) P(G|AG) P(T|AG) sequence, as a function of the m preceding residues AT P(A|AT) P(C|AT) P(G|AT) P(T|AT) (the prefix S ) CA P(A|CA) P(C|CA) P(G|CA) P(T|CA) i-m,i-1 CC P(A|CC) P(C|CC) P(G|CC) P(T|CC) n Particular case CG P(A|CG) P(C|CG) P(G|CG) P(T|CG) CT P(A|CT) P(C|CT) P(G|CT) P(T|CT) q A Bernoulli model is a Markov model of order 0. GA P(A|GA) P(C|GA) P(G|GA) P(T|GA) GC P(A|GC) P(C|GC) P(G|GC) P(T|GC) GG P(A|GG) P(C|GG) P(G|GG) P(T|GG) GT P(A|GT) P(C|GT) P(G|GT) P(T|GT) TA P(A|TA) P(C|TA) P(G|TA) P(T|TA) TC P(A|TC) P(C|TC) P(G|TC) P(T|TC) TG P(A|TG) P(C|TG) P(G|TG) P(T|TG) TT P(A|TT) P(C|TT) P(G|TT) P(T|TT)

12 Markov model estimation (“training”) Dinucleotide frequencies n Transition frequencies for a Markov model of order m can Sequences Occurrences Frequency be estimated from the frequencies observed for oligomers S N(S) F(S) (k-mers) of length k=m+1 in a reference sequence set. AA 526,149 0.112 AC 251,377 0.054 n Example AG 275,056 0.059 q The upper table shows dinucleotide frequencies (k=2) AT 414,453 0.088 computed from the whole set of upstream sequences of CA 294,423 0.063 the yeast Saccharomyces cerevisiae. CC 178,324 0.038 CG 146,052 0.031 q This table can be used to estimate a Markov model of CT 275,859 0.059 order m = k–1 = 1. GA 277,343 0.059 GC 184,367 0.039 GG 173,404 0.037 GT 239,569 0.051 TA 369,980 0.079 TC 280,475 0.060 TG 279,932 0.060 TT 521,236 0.111

13 Markov model estimation (“training”) Dinucleotide frequencies n Transition frequencies for a Markov model of order m can Sequences Occurrences Frequency be estimated from the frequencies observed for oligomers S N(S) F(S) (k-mers) of length k=m+1 in a reference sequence set. AA 526,149 0.112 AC 251,377 0.054 n Example AG 275,056 0.059 q The upper table shows dinucleotide frequencies (k=2) AT 414,453 0.088 computed from the whole set of upstream sequences of CA 294,423 0.063 the yeast Saccharomyces cerevisiae. CC 178,324 0.038 CG 146,052 0.031 q This table can be used to estimate a Markov model of CT 275,859 0.059 order m = k–1 = 1. GA 277,343 0.059 GC 184,367 0.039 GG 173,404 0.037 GT 239,569 0.051 TA 369,980 0.079 TC 280,475 0.060 TG 279,932 0.060 TT 521,236 0.111

Exercise: estimate P(G|T) from the dinucleotide frequency table - Give the formula with symbols - Replace the symbols by their values - Compute the result

14 Markov model estimation (“training”) Dinucleotide frequencies n Transition frequencies for a Markov model of order m can Sequences Occurrences Frequency be estimated from the frequencies observed for oligomers S N(S) F(S) (k-mers) of length k=m+1 in a reference sequence set. AA 526,149 0.112 AC 251,377 0.054 n Example AG 275,056 0.059 q The upper table shows dinucleotide frequencies (k=2) AT 414,453 0.088 computed from the whole set of upstream sequences of CA 294,423 0.063 the yeast Saccharomyces cerevisiae. CC 178,324 0.038 CG 146,052 0.031 q This table can be used to estimate a Markov model of CT 275,859 0.059 order m = k–1 = 1. GA 277,343 0.059 GC 184,367 0.039 GG 173,404 0.037 GT 239,569 0.051 TA 369,980 0.079 TC 280,475 0.060 Fbg (ri | S1..m ) Fbg (S1..m ri ) P(ri | S1..m ) = = TG 279,932 0.060 ∑Fbg (rj | S1..m ) ∑Fbg (S1..m rj ) TT 521,236 0.111 j∈A j∈A F(G | T) F(TG) P(G | T) = = ∑F( j | T) F(T *) € j∈A 0.060 = 0.079 + 0.060 + 0.060 + 0.111 0.060 = = 0.194 0.310

15

€ Examples of transition matrices

n The two tables below show the transition matrices for a Markov model of order 1 (top) and 2 (bottom), respectively.

n Trained with the whole set of non-coding upstream sequences of the yeast Saccharomyces cerevisiae.

n Notice the high probability of transitions from AA to A and TT to T.

Pre/Suffix A C G T P(Prefix) a 0.371 0.165 0.178 0.285 0.321 c 0.327 0.190 0.167 0.316 0.183 P(ri | Si−m,i−1) g 0.312 0.214 0.189 0.285 0.177 t 0.273 0.179 0.173 0.375 0.320 Sym 1.283 0.748 0.708 1.261 P(suffix) 0.321 0.183 0.176 0.320

Prefix/Suffix A C G T P(Prefix) aa 0.416 0.151 0.187 0.246 0.119 € ac 0.352 0.181 0.171 0.297 0.053 ag 0.339 0.202 0.193 0.267 0.057 at 0.346 0.166 0.162 0.326 0.092 ca 0.344 0.185 0.180 0.291 0.060 cc 0.305 0.200 0.171 0.324 0.035 cg 0.282 0.232 0.193 0.294 0.031 ct 0.241 0.189 0.184 0.385 0.058 ga 0.411 0.144 0.187 0.257 0.055 gc 0.334 0.192 0.182 0.293 0.038 gg 0.315 0.220 0.194 0.271 0.033 gt 0.307 0.156 0.200 0.338 0.050 ta 0.304 0.184 0.160 0.352 0.087 tc 0.313 0.192 0.152 0.343 0.057 tg 0.300 0.214 0.180 0.307 0.055 tt 0.218 0.194 0.164 0.423 0.120 Sum 5.127 3.000 2.860 5.013 P(suffix) 0.321 0.183 0.176 0.319 16 Markov chains and Bernoulli models

n By extension of the concept of Markov chain, Bernoulli models can be qualified as Markov models of order 0 (the order 0 means that there is no dependency between a residue and the preceding ones).

n The prior probabilities of a Makov model of order m=0 can be estimated from the residue of single nucleotides (k=m+1=1) in a background sequence set.

n The table below shows the residue frequencies in the genomes of the yeast Saccharomyces cerevisiae and the bacteria Escherichia coli K12, respectively.

n Notice the strong differences between these genomes.

Markov order 0 = Bernouli A C G T Genome 0.310 0.191 0.191 0.309 Saccharomyces cerevisiae 0.246 0.254 0.254 0.246 Escherichia coli K12

17 Scoring a sequence segment with a Markov model

n Exercise: compute the probability P(S|B) of a sequence segment S with a background Markov model B of order 2, estimated from 3nt frequencies on the yeast non-coding upstream sequences.

CCTACTATATGCCCAGAATT Sequence probability given the background model L P(S | B) = P(S | B) P r | S ,B Background model B 1,m ∏ ( i i−m,i−1 ) i= m +1 Transition matrix, order 2 Prefix/Suffix A C G T P(Prefix)N(Prefix) AA 0.388 0.161 0.200 0.251 0.112 525,000 AC 0.339 0.198 0.173 0.290 0.054 251,072 AG 0.345 0.204 0.196 €0.255 0.059 274,601 AT 0.311 0.184 0.182 0.323 0.088 413,946 CA 0.347 0.178 0.189 0.286 0.063 293,750 CC 0.341 0.190 0.161 0.309 0.038 178,110 CG 0.293 0.221 0.196 0.290 0.031 145,876 CT 0.229 0.195 0.205 0.371 0.059 275,634 GA 0.394 0.155 0.187 0.264 0.059 277,053 GC 0.330 0.205 0.169 0.297 0.039 184,192 GG 0.318 0.217 0.187 0.277 0.037 173,266 GT 0.285 0.175 0.204 0.336 0.051 239,384 TA 0.300 0.193 0.168 0.339 0.079 369,426 TC 0.313 0.203 0.152 0.332 0.060 280,131 TG 0.302 0.209 0.208 0.282 0.060 279,783 TT 0.210 0.208 0.189 0.392 0.111 520,906 P(Suffix) 0.313 0.191 0.187 0.310 N(suffix) 1,466,075 893,444 873,260 1,449,351

18 Scoring a sequence segment with a Markov model

n The example below illustrates the computation of the probability P(S|B) of a sequence segment S with a background Markov model B of order 2, estimated from 3nt frequencies on the yeast non-coding upstream sequences.

CCTACTATATGCCCAGAATT

Background model B Sequence probability given the backgound model Transition matrix, order 2 L Prefix/Suffix A C G T P(Prefix)N(Prefix) AA 0.388 0.161 0.200 0.251 0.112 525,000 P(S | B) = P(S | B) P r | S ,B AC 0.339 0.198 0.173 0.290 0.054 251,072 1,m ∏ ( i i−m,i−1 ) AG 0.345 0.204 0.196 0.255 0.059 274,601 i= m +1 AT 0.311 0.184 0.182 0.323 0.088 413,946 CA 0.347 0.178 0.189 0.286 0.063 293,750 CC 0.341 0.190 0.161 0.309 0.038 178,110 pos P(R|W) wR S P(S) CG 0.293 0.221 0.196 0.290 0.031 145,876 1 P(CC) 0.038 cc CC 3.80E-02 CT 0.229 0.195 0.205 0.371 0.059 275,634 2 P(T|CC) 0.309 ccT CCT 1.17E-02 GA 0.394 0.155 0.187 0.264 0.059 277,053 3 P(A|CT) 0.229 ctA CCTA 2.69E-03 GC 0.330 0.205 0.169 0.297 0.039 184,192 GG 0.318 0.217 0.187 0.277 0.037 173,266€ 4 P(C|TA) 0.193 taC CCTAC 5.19E-04 GT 0.285 0.175 0.204 0.336 0.051 239,384 5 P(T|AC) 0.290 acT CCTACT 1.50E-04 TA 0.300 0.193 0.168 0.339 0.079 369,426 6 P(A|CT) 0.229 ctA CCTACTA 3.45E-05 TC 0.313 0.203 0.152 0.332 0.060 280,131 TG 0.302 0.209 0.208 0.282 0.060 279,783 7 P(T|TA) 0.339 taT CCTACTAT 1.17E-05 TT 0.210 0.208 0.189 0.392 0.111 520,906 8 P(A|AT) 0.311 atA CCTACTATA 3.63E-06 P(Suffix) 0.313 0.191 0.187 0.310 9 P(T|TA) 0.339 taT CCTACTATAT 1.23E-06 N(suffix) 1,466,075 893,444 873,260 1,449,351 10 P(G|AT) 0.182 atG CCTACTATATG 2.25E-07 11 P(C|TG) 0.209 tgC CCTACTATATGC 4.69E-08 12 P(C|GC) 0.205 gcC CCTACTATATGCC 9.61E-09 13 P(C|CC) 0.190 ccC CCTACTATATGCCC 1.82E-09 14 P(A|CC) 0.341 ccA CCTACTATATGCCCA 6.21E-10 15 P(G|CA) 0.189 caG CCTACTATATGCCCAG 1.17E-10 16 P(A|AG) 0.345 agA CCTACTATATGCCCAGA 4.04E-11 17 P(A|GA) 0.394 gaA CCTACTATATGCCCAGAA 1.59E-11 18 P(T|AA) 0.251 aaT CCTACTATATGCCCAGAAT 4.00E-12 19 P(T|AT) 0.323 atT CCTACTATATGCCCAGAATT 1.29E-12

19 Sequence discrimination Discriminating sequences based on alternativeCpG islands Markov modelsGenomic background

CpG islands Genomic background n Problem: for a given sequence of symbols (e.g. a nucleotidic sequence), identify the most likely Markov model. 0.154 0.288 0.438 0.119 A 0.263 0.188 0.261 0.287 A n Approach: compute the log-likelihood ratio (LLR) of the sequence probabilities computed with the two respective transition matrices 0.179 0.295 0.318 0.208 C 0.375 0.213 0.052 0.361 C (CpG island versus genomic background) 0.182 0.384 0.296 0.138 G 0.307 0.219 0.213 0.26 G �CpG(�) = �CpG(�) ⋅ �CpG (�|�) 0.09 0.378 0.376 0.156 T 0.243 0.223 0.272 0.263 T �Bg(�) = �Bg(�) ⋅ �Bg (�|�) 0.162 0.337 0.339 0.162 B 0.29 0.21 0.21 0.29 B �CpG(�) �(�) = ��� CpG / Bg log−odds �Bg(�) A C G T A C G T

Log-odds transition matrix n A more efficient approach

q Compute (only once) a log-odds matrix from the two transition −0.769 0.611 0.745 −1.264 matrices A �CpG(�|�) �(�|�) = ��� −1.062 0.471 2.605 −0.791 C �Bg(�|�)

q Compute the LLR of the sequence by summing the transition −0.76 0.809 0.478 −0.918 G LLRs −1.431 0.762 0.47 −0.753 T �(�) = �B(�) ⋅ � (�|�) −0.843 0.684 0.691 −0.842 B

A C G T Exercise

1. Open a connection to the UCSC genome browser, and select the table browser tool. 2. Choose a mammalian genome (e.g. Human version hg38), select CpG track in the Regulation group, and download the sequences of all the annotated CpG islands. 3. Open a connection to RSAT Metazoa 4. Compute the transition matrix of a 1st order Markov the tool create background model. 5. Use the tool random genome fragments to extract sequences of random genomic fragments of the same sizes as the CpG island. 6. Use these random genome fragments to compute the genomic background (transition matrix of a 1st order Markov model) 7. With the tool sequence proba, compute the sequence probabilities for each sequence file (CpG islands, random genome fragments) with each model (CpG island, genome background). 8. Open the 4 results files with R or in a spreadsheet, and compute the log-likelihood ratio log(P(S|CpG) / P(S|Bg)) for

q the CpG islands

q the genomic background 9. Compare the distributions of these LLR (you can depict them with histograms, boxplots, violin plots, …). Hidden Markov Models (HMM) Probabilities of transition between states

n Let us consider the genome as a Markov chain Segmentation of the genome into CpG islands and non-CpG islands composed of a succession of regions in states CpG CpG island (+) or non-CpG islands (-) non-CpG island n The total size of the Human genome is 3GB, and its annotation contain 31,144 CpG islands totaling 24.2MB.

n Exercise: based on these numbers, estimate the 2-states Markov process transition probabilities between states.

CpG other (+) (-)

Transition matrix CpG Other CpG Other Solution: transition probabilities between states

n The total size of non-CpG islands is the difference between genome Segmentation of the genome into CpG islands and non-CpG islands size (3e+09) and the total size of the CpG islands (24,200,434). � = � − � = 3� + 09 − 24,200,434 = 2,975,799,566 CpG island n In total, there are 31,144 CpG islands in the genome, and each of non-CpG island them is preceded by a non-CpG island. There are thus 31,144 transitions from non-CpG to CpG. The probability of transition from non-CpG to CpG is thus the number of non-CpG positions preceding a CpG divided by the total number of non-CpG positions. 31,144 �(+|−) = � /� = = 1.0466� − 05 2,975,799,566 2-states Markov process n Since there are only two possible states, the transition probability from non-CpG to non-CpG is the complement of the transition CpG other probability from non-CpG to CpG. (+) (-) �(−|−) = 1 − �(+|−) = 0.99999

n Each CpG island has an exactly one ending nucleotide, which precedes a non-CpG island nucleotide. The number of nucleotides marking a transition from CpG to non-CpG is thus 31,144. The probability of transition from CpG to non-CpG is this number divided by the total size of all CpGs. Transition matrix 31,144 �(−|+) = � /� = = 0.00129 CpG (+) Other (-) 24,200,434 CpG (+) 0.99871 0.00129 n The probability of transition from CpG to CpG is the complement. �(+|+) = 1 − �(−|+) = 0.99871 Other (-) 0.00001 0.99999 Hidden Markov Models

n Hidden Markov Models (HMM) are an extension of Markov Segmentation of the genome into CpG islands and non-CpG islands chains, where we assume a process with a given number of states, and a specific probability of emitting symbols CpG island associated to each state. non-CpG island 2-states Markov process n The state-specific emission probabilities can themselves be modeled as Markov chains, or as Bernoulli models.

n Example: CpG islands in genomic sequences CpG Other (+) (-) q 2 states: CpG islands and genomic background

q Each state has a specific nucleotide transition matrix

q Note: the emission probabilities were computed from a CpG islands Genomic background random selection of genomic region, which might include Emission probabilities some fragments of CpG islands. However the proportion CpG state (+) Bg state is likely to be very low, so we can use it as estimator for CpG islands Genomic background (-) the emission probabilities of non-CpG islands (other). 0.154 0.288 0.438 0.119 A 0.263 0.188 0.261 0.287 A

0.179 0.295 0.318 0.208 C 0.375 0.213 0.052 0.361 C Transition matrix CpG (+) Other (-) 0.182 0.384 0.296 0.138 G 0.307 0.219 0.213 0.26 G CpG (+) 0.99871 0.00129 0.09 0.378 0.376 0.156 T 0.243 0.223 0.272 0.263 T Other (-) 0.00001 0.99999

0.162 0.337 0.339 0.162 B 0.29 0.21 0.21 0.29 B

A C G T A C G T Sequence segmentation Sequence segmentation

n Problem: given a long unannotated sequence, identify the n Example: sequence ATTATGGGCGCGAA segments corresponding to CpG islands. n Each nucleotide (symbol) can be generated by either the n Notes CpG (upper row) or the non-CpG (lower row) state.

q This problem differs from the discrimination problem n Between each pair of nucleotides, we can either stay on seen before, where we assign a class to each sequence the current state (horizontal arrows) or switch to the other as a whole. state (oblique arrows)

q No unequivocal correspondence from symbols to states: n The problem amounts to find, among all possible paths the same symbol can be emitted by different state. from B (sequence beginning) to E (end), the path having the maximal likelihood. q The state underlying each emission (nucleotide) is thus “hidden”. n Exercise: how many possible paths are there between B and E? q We need to discover it by finding the chain of states most likely to have produced the sequence.

Sequence position 1 2 3 4 5 6 7 8 9 10 11 12 13 14

CpG island A T T A T G G G C G C G A A

B E A T T A T G G G C G C G A A non-CpG

Hidden ? ? ? ? ? ? ? ? ? ? ? ? ? ? state Uncovering the hidden chain of states

CpG (+) Other (-) n In the drawing below, we highlighted one of the possible paths from the beginning to the end of the sequence CpG (+) 0.99871 0.00129 ATTATGGGCGCGAA. OtherCpG ( islands-) 0.00001 Genomic0.99999 background

n Exercise CpG (+) Non-CpG (-) q Annotate the sequence of hidden states with + and - 0.154 0.288 0.438 0.119 A q Compute the probability of this path according to the previously 0.263 0.188 0.261 0.287 A defined parameters 0.179 0.295 0.318 0.208 C 0.375 0.213 0.052 0.361 C q How many possible paths are there between B and E?

0.182 0.384 0.296 0.138 q Which path would you intuitively propose as the best? G 0.307 0.219 0.213 0.26 G

q Which path would you intuitively propose as the worse? 0.09 0.378 0.376 0.156 T 0.243 0.223 0.272 0.263 T q Compute the probability of these paths 0.162 0.337 0.339 0.162 B 0.29 0.21 0.21 0.29 B

A C G T A C G T Sequence position 1 2 3 4 5 6 7 8 9 10 11 12 13 14

CpG island A T T A T G G G C G C G A A

B E A T T A T G G G C G C G A A non-CpG

Hidden ? ? ? ? ? ? ? ? ? ? ? ? ? ? state Viterbi algorithm

n Viterbi algorithm enables to find the optimal path in a

n Same principle as dynamical programing:

q Compute the probability to reach each node (emitted symbol in a given state), from each one of the incoming arrows.

q Assign to this node the highest of probability value, and keep track of the corresponding incoming arrow.

Sequence position 1 2 3 4 5 6 7 8 9 10 11 12 13 14

CpG island A T T A T G G G C G C G A A

B E A T T A T G G G C G C G A A non-CpG

Hidden ? ? ? ? ? ? ? ? ? ? ? ? ? ? state Bioinformatics

Sequence motifs Profile matrices (=position-specific scoring matrices, PSSM)

n Starting from a multiple alignment, one can build a matrix which reflects the most representative residues at each position

q Each column represents a position

q Each row represents a residue (20 rows for proteins, 4 rows for DNA)

q The cells indicate the frequency of each residue at each position of the multiple alignment. Multiple alignment W S K T N V T S T L H I C W G A Q A G L W S K T N V T S T L H I C W G A Q A G L W T Q S H V H R T L N I C W A A Q A A V F L K Q N V T S S M Y I C W G A M A A L W S V T N V T S T I H I C W G A Q A G L W S K D H V T S T L F V C W A V Q A A L W S K D H V T S T L F V C W A V Q A A L W S K S H V Y S S L H I C W G A Q A A L W T T T N V H S T L N V C W G G M A A V W A K D H V T S T L F V C W A V Q A A L W A K D H V T S T L F V C W A V Q A A L W S K T H V Y S T L H I C W G A Q A G L W S R H N V Y S T M F I C W A A Q A G L W A K A H V T S T L Y I C W A A Q A G L W A K E H V T S T L F V C W A V Q A A L W T Q T N V H S T L N V C W G A M A A I W S K T H V Y S T L H I C W G A Q A G L

Position-Specific Scoring Matrix (counts) Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Residue A 0 4 0 1 0 0 0 0 0 0 0 0 0 0 8 11 0 17 10 0 C 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 D 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 F 1 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 1 0 0 7 0 H 0 0 0 1 10 0 3 0 0 0 6 0 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0 0 0 0 1 0 10 0 0 0 0 0 0 0 1 K 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L 0 1 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 14 M 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 N 0 0 0 0 7 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 R 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 S 0 9 0 2 0 0 0 16 2 0 0 0 0 0 0 0 0 0 0 0 T 0 3 1 7 0 0 10 0 15 0 0 0 0 0 0 0 0 0 0 0 V 0 0 1 0 0 17 0 0 0 0 0 7 0 0 0 5 0 0 0 2 W 16 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 Y 0 0 0 0 0 0 4 0 0 0 2 0 0 0 0 0 0 0 0 0 sum 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 Multiple alignment W S K T N V T S T L H I C W G A Q A G L W S K T N V T S T L H I C W G A Q A G L W T Q S H V H R T L N I C W A A Q A A V F L K Q N V T S S M Y I C W G A M A A L W S V T N V T S T I H I C W G A Q A G L W S K D H V T S T L F V C W A V Q A A L W S K D H V T S T L F V C W A V Q A A L W S K S H V Y S S L H I C W G A Q A A L W T T T N V H S T L N V C W G G M A A V W A K D H V T S T L F V C W A V Q A A L W A K D H V T S T L F V C W A V Q A A L W S K T H V Y S T L H I C W G A Q A G L W S R H N V Y S T M F I C W A A Q A G L W A K A H V T S T L Y I C W A A Q A G L W A K E H V T S T L F V C W A V Q A A L W T Q T N V H S T L N V C W G A M A A I W S K T H V Y S T L H I C W G A Q A G L

Position-Specific Scoring Matrix (counts) Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Residue A 0 4 0 1 0 0 0 0 0 0 0 0 0 0 8 11 0 17 10 0 C 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 D 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 F 1 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 1 0 0 7 0 H 0 0 0 1 10 0 3 0 0 0 6 0 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0 0 0 0 1 0 10 0 0 0 0 0 0 0 1 K 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L 0 1 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 14 M 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 N 0 0 0 0 7 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 R 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 S 0 9 0 2 0 0 0 16 2 0 0 0 0 0 0 0 0 0 0 0 T 0 3 1 7 0 0 10 0 15 0 0 0 0 0 0 0 0 0 0 0 V 0 0 1 0 0 17 0 0 0 0 0 7 0 0 0 5 0 0 0 2 W 16 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 Y 0 0 0 0 0 0 4 0 0 0 2 0 0 0 0 0 0 0 0 0 sum 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 Weight matrix

Count matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sum Freq Residue A 0 4 0 1 0 0 0 0 0 0 0 0 0 0 8 11 0 17 10 0 51 0.150 C 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 17 0.050 D 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0.012 E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.003 F 1 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 7 0.021 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 1 0 0 7 0 17 0.050 H 0 0 0 1 10 0 3 0 0 0 6 0 0 0 0 0 0 0 0 0 20 0.059 I 0 0 0 0 0 0 0 0 0 1 0 10 0 0 0 0 0 0 0 1 12 0.035 K 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0.035 L 0 1 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 14 29 0.085 M 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 5 0.015 N 0 0 0 0 7 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 10 0.029 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000 Q 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 17 0.050 R 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0.006 S 0 9 0 2 0 0 0 16 2 0 0 0 0 0 0 0 0 0 0 0 29 0.085 T 0 3 1 7 0 0 10 0 15 0 0 0 0 0 0 0 0 0 0 0 36 0.106 V 0 0 1 0 0 17 0 0 0 0 0 7 0 0 0 5 0 0 0 2 32 0.094 W 16 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 33 0.097 Y 0 0 0 0 0 0 4 0 0 0 2 0 0 0 0 0 0 0 0 0 6 0.018 sum 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 340 1.000

Weight matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Residue A -1.72 0.19 -1.72 -0.39 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 0.49 0.63 -1.72 0.82 0.59 -1.72 ' ni, j + pik C -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.28 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 fi, j = A D -0.70 -0.70 -0.70 1.21 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 E -0.30 -0.30 -0.30 1.02 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 nr, j + k F 0.42 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 1.18 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 ∑ G -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.00 0.07 -1.26 -1.26 0.89 -1.26 r=1 H -1.32 -1.32 -1.32 0.00 0.98 -1.32 0.46 -1.32 -1.32 -1.32 0.76 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 I -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 -1.11 1.19 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 K -1.11 -1.11 1.27 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 " ' % L -1.48 -0.15 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 fi, j M -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 0.83 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 1.01 -0.78 -0.78 -0.78 Wi, j = ln$ ' N -1.04 -1.04 -1.04 -1.04 1.11 -1.04 -1.04 -1.04 -1.04 -1.04 0.74 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 p P 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 €0.00 0.00 # i & Q -1.26 -1.26 0.36 0.07 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.19 -1.26 -1.26 -1.26 R -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 S -1.48 0.78 -1.48 0.14 -1.48 -1.48 -1.48 1.03 0.14 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 T -1.57 0.22 -0.25 0.58 -1.57 -1.57 0.73 -1.57 0.91 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 V -1.52 -1.52 -0.20 -1.52 -1.52 1.01 -1.52 -1.52 -1.52 -1.52 -1.52 0.63 -1.52 -1.52 -1.52 0.49 -1.52 -1.52 -1.52 0.09 W 0.98 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 1.00 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 Y -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 1.06 -0.85 -0.85 -0.85 0.77 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 sum -17.8 -14.4 -13.7 -10.7 -17.2 -19.1 -15.7 -17.8 -17.6 -16.3 -14.1 -17.2 -19.1 -19.1 -17.2 -16 -17.4 -19.1€-17.2 -16.3 Scoring a sequence with a profile matrix

Weight matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Residue A -1.72 0.19 -1.72 -0.39 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 0.49 0.63 -1.72 0.82 0.59 -1.72 C -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.28 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 D -0.70 -0.70 -0.70 1.21 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 E -0.30 -0.30 -0.30 1.02 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 F 0.42 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 1.18 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 G -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.00 0.07 -1.26 -1.26 0.89 -1.26 H -1.32 -1.32 -1.32 0.00 0.98 -1.32 0.46 -1.32 -1.32 -1.32 0.76 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 I -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 -1.11 1.19 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 K -1.11 -1.11 1.27 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 L -1.48 -0.15 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 M -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 0.83 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 1.01 -0.78 -0.78 -0.78 N -1.04 -1.04 -1.04 -1.04 1.11 -1.04 -1.04 -1.04 -1.04 -1.04 0.74 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 P 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Q -1.26 -1.26 0.36 0.07 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.19 -1.26 -1.26 -1.26 R -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 S -1.48 0.78 -1.48 0.14 -1.48 -1.48 -1.48 1.03 0.14 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 T -1.57 0.22 -0.25 0.58 -1.57 -1.57 0.73 -1.57 0.91 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 V -1.52 -1.52 -0.20 -1.52 -1.52 1.01 -1.52 -1.52 -1.52 -1.52 -1.52 0.63 -1.52 -1.52 -1.52 0.49 -1.52 -1.52 -1.52 0.09 W 0.98 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 1.00 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 Y -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 1.06 -0.85 -0.85 -0.85 0.77 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 sum -17.8 -14.4 -13.7 -10.7 -17.2 -19.1 -15.7 -17.8 -17.6 -16.3 -14.1 -17.2 -19.1 -19.1 -17.2 -16 -17.4 -19.1 -17.2 -16.3

Sequence L W A K D H V T S T M F V C W A V M A A SUM Score -1.48 -1.53 -1.72 -1.11 -0.7 -1.32 -1.52 -1.57 0.136 -1.57 -0.78 -0.9 -1.52 -1.26 -1.53 0.628 -1.52 -0.78 0.587 -1.72 -21.1626 Scoring a sequence with a profile matrix

Weight matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Residue A -1.72 0.19 -1.72 -0.39 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 0.49 0.63 -1.72 0.82 0.59 -1.72 C -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.28 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 D -0.70 -0.70 -0.70 1.21 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 E -0.30 -0.30 -0.30 1.02 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 F 0.42 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 1.18 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 G -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.00 0.07 -1.26 -1.26 0.89 -1.26 H -1.32 -1.32 -1.32 0.00 0.98 -1.32 0.46 -1.32 -1.32 -1.32 0.76 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 I -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 -1.11 1.19 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 K -1.11 -1.11 1.27 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 L -1.48 -0.15 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 M -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 0.83 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 1.01 -0.78 -0.78 -0.78 N -1.04 -1.04 -1.04 -1.04 1.11 -1.04 -1.04 -1.04 -1.04 -1.04 0.74 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 P 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Q -1.26 -1.26 0.36 0.07 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.19 -1.26 -1.26 -1.26 R -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 S -1.48 0.78 -1.48 0.14 -1.48 -1.48 -1.48 1.03 0.14 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 T -1.57 0.22 -0.25 0.58 -1.57 -1.57 0.73 -1.57 0.91 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 V -1.52 -1.52 -0.20 -1.52 -1.52 1.01 -1.52 -1.52 -1.52 -1.52 -1.52 0.63 -1.52 -1.52 -1.52 0.49 -1.52 -1.52 -1.52 0.09 W 0.98 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 1.00 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 Y -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 1.06 -0.85 -0.85 -0.85 0.77 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 sum -17.8 -14.4 -13.7 -10.7 -17.2 -19.1 -15.7 -17.8 -17.6 -16.3 -14.1 -17.2 -19.1 -19.1 -17.2 -16 -17.4 -19.1 -17.2 -16.3

Sequence W A K D H V T S T M F V C W A V M A A L SUM Score 0.975 0.192 1.268 1.21 0.981 1.014 0.735 1.029 0.91 0.835 1.18 0.631 1.277 1.001 0.491 0.486 1.007 0.817 0.587 0.972 17.59818 Scoring a sequence with a profile matrix

Weight matrix Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Residue A -1.72 0.19 -1.72 -0.39 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 -1.72 0.49 0.63 -1.72 0.82 0.59 -1.72 C -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.28 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 D -0.70 -0.70 -0.70 1.21 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 -0.70 E -0.30 -0.30 -0.30 1.02 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 -0.30 F 0.42 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 1.18 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 -0.90 G -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.00 0.07 -1.26 -1.26 0.89 -1.26 H -1.32 -1.32 -1.32 0.00 0.98 -1.32 0.46 -1.32 -1.32 -1.32 0.76 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 I -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 -1.11 1.19 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 0.21 K -1.11 -1.11 1.27 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 -1.11 L -1.48 -0.15 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 0.97 M -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 0.83 -0.78 -0.78 -0.78 -0.78 -0.78 -0.78 1.01 -0.78 -0.78 -0.78 N -1.04 -1.04 -1.04 -1.04 1.11 -1.04 -1.04 -1.04 -1.04 -1.04 0.74 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 -1.04 P 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Q -1.26 -1.26 0.36 0.07 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 -1.26 1.19 -1.26 -1.26 -1.26 R -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 0.85 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 -0.48 S -1.48 0.78 -1.48 0.14 -1.48 -1.48 -1.48 1.03 0.14 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 -1.48 T -1.57 0.22 -0.25 0.58 -1.57 -1.57 0.73 -1.57 0.91 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 -1.57 V -1.52 -1.52 -0.20 -1.52 -1.52 1.01 -1.52 -1.52 -1.52 -1.52 -1.52 0.63 -1.52 -1.52 -1.52 0.49 -1.52 -1.52 -1.52 0.09 W 0.98 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 1.00 -1.53 -1.53 -1.53 -1.53 -1.53 -1.53 Y -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 1.06 -0.85 -0.85 -0.85 0.77 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 -0.85 sum -17.8 -14.4 -13.7 -10.7 -17.2 -19.1 -15.7 -17.8 -17.6 -16.3 -14.1 -17.2 -19.1 -19.1 -17.2 -16 -17.4 -19.1 -17.2 -16.3

Sequence A K D H V T S T M F V C W A V M A A L V SUM Score -1.72 -1.11 -0.7 1E-16 -1.52 -1.57 -1.48 -1.57 -0.78 -0.9 -1.52 -1.26 -1.53 -1.72 -1.52 -0.78 -1.72 0.817 -1.48 0.094 -21.9422 PSI-BLAST

n PSI-BLAST stands for Position-Specific Iterated BLAST (Altschul et al, 1997)

q BLAST runs a first time in normal mode.

q Resulting sequences are aligned together (Multiple sequence alignment) and a PSSM is calculated.

q This PSSM is used to scan the database for new matches.

q Steps 2-3 can be iterated several times. n The PSSM increases the sensitivity of the search. References

n Substitution matrices q PAM series • Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. (1978). A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5, 345--352. q BLOSUM substitution matrices • Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89, 10915-9. q Gonnet matrices, built by an iterative procedure • Gonnet, G. H., Cohen, M. A. & Benner, S. A. (1992). Exhaustive matching of the entire protein sequence database. Science 256, 1443-5. 1. n Sequence alignment algorithms q Needleman-Wunsch (pairwise, global) • Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-53. q Smith-Waterman (pairwise, local) • Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195-7. q FastA (database searches, pairwise, local) • W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, 85:2444–2448, 1988. q BLAST (database searches, pairwise, local) • S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment search tool. J. Mol. Biol., 215:403–410, 1990. • S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res., 25:3389–3402, 1997. q Clustal (multiple, global) • Higgins, D. G. & Sharp, P. M. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237-44. • Higgins, D. G., Thompson, J. D. & Gibson, T. J. (1996). Using CLUSTAL for multiple sequence alignments. Methods Enzymol 266, 383-402. q Dialign (multiple, local) • Morgenstern, B., Frech, K., Dress, A. & Werner, T. (1998). DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, 290-4. q MUSCLE (multiple local) Modelling protein families with HMM Limitations of position weight matrices

n The main limitation of position-weight matrices is that they are not practical to handle gaps. n Hidden Markov Models (HMM) can be used to handle gaps. The database

n http://pfam.xfam.org/

q “The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).” Exam questions Exam Dec 19, 2019 – Hidden Markov Models CpG islands Genomic background

CpG (+) Non-CpG (-) n The matrices on the right define 0.154 0.288 0.438 0.119 A q the emission probabilities of nucleotides according to two alternative models 0.263 0.188 0.261 0.287 A (CpG and non-CpG) 0.179 0.295 0.318 0.208 C 0.375 0.213 0.052 0.361 C q the probabilities of transitions between these models

n In the drawing below, we highlighted one of the possible paths from the 0.182 0.384 0.296 0.138 G 0.307 0.219 0.213 0.26 G beginning to the end of the sequence ATCGCAA.

n Show the way to compute the following statistics. In each case, write the 0.09 0.378 0.376 0.156 T 0.243 0.223 0.272 0.263 T formula with symbols, then replace the symbols by the values taken from the matrices. It is not necessary to compute the final value. 0.162 0.337 0.339 0.162 B 0.29 0.21 0.21 0.29 B

a. Probability of the sequence according to the CpG model A C G T A C G T b. Probability of the sequence according to the non-CpG model c. Log-likelihood CpG (+) Other (-) d. The states along the path labelled with solid black arrows. CpG (+) 0.99871 0.00129 e. Probability of this path Other (-) 0.00001 0.99999 Sequence position 1 2 3 4 5 6 7

CpG island A T C G C A A

B E A T C G C A A non-CpG

State ...... Exam Dec 19, 2019 – Hidden Markov Models CpG islands Genomic background

CpG (+) Non-CpG (-) n The matrices on the right define 0.154 0.288 0.438 0.119 A q the emission probabilities of nucleotides according to two alternative models 0.263 0.188 0.261 0.287 A (CpG and non-CpG) 0.179 0.295 0.318 0.208 C 0.375 0.213 0.052 0.361 C q the probabilities of transitions between these models

n In the drawing below, we highlighted one of the possible paths from the 0.182 0.384 0.296 0.138 G 0.307 0.219 0.213 0.26 G beginning to the end of the sequence ATCGCAA.

n Show the way to compute the following statistics. In each case, write the 0.09 0.378 0.376 0.156 T 0.243 0.223 0.272 0.263 T formula with symbols, then replace the symbols by the values taken from the matrices. It is not necessary to compute the final value. 0.162 0.337 0.339 0.162 B 0.29 0.21 0.21 0.29 B

a. Probability of the sequence according to the CpG model A C G T A C G T b. Probability of the sequence according to the non-CpG model c. Log-likelihood CpG (+) Other (-) d. The states along the path labelled with solid black arrows. CpG (+) 0.99871 0.00129 e. Probability of this path Other (-) 0.00001 0.99999 Sequence position 1 2 3 4 5 6 7

CpG island A T C G C A A

B E A T C G C A A non-CpG

State ......