<<

processes and Hidden Markov Models

Dr Mauro Delorenzi and Dr Frédéric Schütz

Swiss Institute of

EMBnet course – Basel 23.3.2006

Introduction

ƒ A mainstream topic in bioinformatics is the problem of annotation: given a sequence of DNA/RNA or protein, we want to identify “interesting” elements ƒ Examples: – DNA/RNA: genes, promoters, splicing signals, segmentation of heterogeneous DNA, binding sites, etc – Proteins: coiled-coil domains, transmembrane domains, signal peptides, phosphorylation sites, etc – Generally: homologs, etc.

ƒ “The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster” – http://www.fruitfly.org/GASP1/tutorial/presentation/

EMBNET course Basel 23.3.2006 Sequence annotation ƒ The sequence of many of these interesting elements can be characterized statistically, so we are interested in modeling them. ƒ By modeling, we find statistical models than can: – Accurately describe the observed elements of provided ; – Accurately predict the presence of particular elements in new, unannotated, sequences; – If possible, be readily interpretable and provide some insight into the actual biological process involved (i.e. not a black box).

EMBNET course Basel 23.3.2006

Example: heterogeneity of DNA sequences ƒ The nucleotide composition of segments of genomic DNA changes between different regions in a single organism – Example: coding regions in the human genome tend to be GC-rich. ƒ Modeling the differences between different homogeneous regions is interesting because – These differences often have a biological meaning – Many bioinformatics tools depend on the “background distribution” of nucleotides, often assumed to be constant.

EMBNET course Basel 23.3.2006 Modeling tools (quick review)

ƒ Among the different tools used for modeling sequences, we have (sorted by increasing complexity): – Consensus sequences – Regular expressions – Position Specific Scoring Matrices (PSSM), or Weight Matrices – Markov Models, Hidden Markov Models and other stochastic processes ƒ These tools (in particular the stochastic processes) are also used for bioinformatics problems other than pure sequence analysis.

EMBNET course Basel 23.3.2006

Consensus sequence

ƒ Exact sequence that correspond to a certain region ƒ Example: initiation in E. coli – Transcription initiated at the promoter; the sequence of the promoter is recognised by the sigma factor ot RNA polymerase – For the sigma factor σ70 , the consensus sequence of the promoter is given by -35 -10 TTGACA … TATAAT ƒ Very rigid, and do not allow for any variation ƒ This works also well for enzyme restriction sites, or, in general, for sites for which strict conservation is important (in the case of restriction sites: cutting of the DNA at a certain site is a question of “life and death” for the DNA)

EMBNET course Basel 23.3.2006 Example: binding site for TF p53

ƒ The Transcription Factor Binding Site (TFBS) for p53 as been described as having the consensus sequence

GGA CATG CCC * GGG CATG TCT

where * represents a spacer of various length. ƒ In this case, the sequence is not entirely conserved; this is believed to allow the cell some flexibility in the level of response for different signals (which was not possible or desirable for restriction sites).

EMBNET course Basel 23.3.2006

Example: binding site for TF p53

ƒ This flexibility translates into the need for more complicated models to describe the site. ƒ Since the binding site is not entirely conserved, the consensus sequence represents only the nucleotides most frequently observed. ƒ The protein could potentially bind to many other similar, but different, sites along the genome. ƒ In theory, if the sites are not independent, the protein may not even bind to the actual consensus sequence !

EMBNET course Basel 23.3.2006 Patterns/Regular Expression ƒ Patterns attempts to explain observed motifs by trying to identify the most important combinations of positions and nucleotides/residues of a given site (to be compared with the consensus sequence, where the most important nucleotide/residue at each position was identified) ƒ They are often described using the Regular Expression syntax. ƒ Prosite database (developed at the SIB): http://www.expasy.org/prosite/

EMBNET course Basel 23.3.2006

Example: Cys-Cys-His-His zinc finger DNA binding domain

ƒ Its characteristic motif has regular expression

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

ƒ Where ‘x’ any amino acid, ‘(2,4)’ means between 2 and 4 occurences, and ‘[…]’ indicate a list of possible amino acids.

ƒ Example: 1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX

EMBNET course Basel 23.3.2006 Example: TFBS for p53

ƒ The TFBS has been described with the pattern →← … →← where →← is the palindromic sequence (5’) Pu-Pu-Pu-C-[AT][TA]-G-Py-Py-Py (3’) and “…” is a spacer with 0 to 14 nucleotides

ƒ Note that this pattern (with the palindromic condition) can not be expressed using a regular expression (at least not in a simple or general way).

J. Hoh et al. “The p53MH and its application in detecting EMBNET course Basel 23.3.2006 p53-responsive genes”. PNAS, 99(13), June 2002, 8467-8472.

Example: TFBS for p53

ƒ The pattern approaches clearly allows more flexibility than the consensus sequence; however it is still too rigid, especially for sites that are not well conserved. ƒ When applying the pattern, each possible amino/acid or nucleotide at a given position has the same weight, i.e. is it not possible to specifiy if one is more likely to appear than another.

EMBNET course Basel 23.3.2006 Position-Specific Scoring Matrices

ƒ “Stochastic consensus sequence” ƒ Indicates the relative importance of a given nucleotide or amino acid at a certain position. ƒ Usually built from an alignment of sequences corresponding to the domain we are interested in, and either a collection of sequences known not to contain the domain, or (most often), background for the different nucleotides or amino acids.

EMBNET course Basel 23.3.2006

Building a PSSM Pos. 1 2 3 4 5 6

A 9 214 63 142 118 8

Counts from 242 C 22 7 26 31 52 13 known sites G 18 2 29 38 29 5 T 193 19 124 31 43 216

A 0.04 0.88 0.26 0.59 0.49 0.03

Relative C 0.09 0.03 0.11 0.13 0.22 0.05 frequencies: fbl G 0.07 0.01 0.12 0.16 0.12 0.02 T 0.80 0.08 0.51 0.13 0.18 0.89

PSSM: A -2.76 1.82 0.06 1.23 0.96 -2.92 log f /p bl b C -1.46 -3.11 -1.22 -1.00 -0.22 -2.21 (pb=background probabilities) G -1.76 -5.00 -1.06 -0.67 -1.06 -3.58 EMBNET course T 1.67 -1.66 1.04 -1.00 -0.49 1.84 Basel 23.3.2006 Scoring a sequence using a PSSM C T A T A A T C

A -38 19 1 12 10 -48 sum Move the matrix C -15 -38 -8 -10 -3 -32 along the sequence -13 -48 -6 -7 -10 -4 8 and each G -93 “window” T 17 -32 8 -9 -6 19

Peaks should occur at A -38 19 1 12 10 -48 the “true” sites C -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -4 8 Of course in general G +85 any threshold will T 17 -32 8 -9 -6 19 have some false positive and false A -38 19 1 12 10 -48 negative rate C -15 -38 -8 -10 -3 -32 G -13 -48 -6 -7 -10 -4 8 -95 EMBNET course T 17 -32 8 -9 -6 19 Basel 23.3.2006

Sequence logo: graphical representation Cys-Cys-His-His zinc finger DNA binding domain

The total height of each stack represent the degree of conservation of each position; the heights of the letters on a stack are proportional to their frequencies.

EMBNET course Basel 23.3.2006 PSSM for p53 binding site

Counts from 37 known sites

Pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A 14 11 26 0 28 2.5 0 0.5 0 3 6 2 11.5 0 27 4 0 0.5 1 2 C 3 1 1 36 1 0.5 0 24.5 33 23 2 0 0.5 36 2 0 0 9.5 24 15 G 16 24 10 0 0 0 37 0 0 0 23.5 34 25 0 2 1 37 0 0 3 T 4 1 0 1 7 34 0 12 4 10 5.5 1 0 1 5 32 0 27 12 16

EMBNET course J. Hoh et al., “The p53HM algorithm and its application in Basel 23.3.2006 detecting p53-responsive genes”, 2002.

What is missing ?

ƒ PSSM help to deal with the stochastic distributions of symbols at a given position. ƒ However, they lack the ability to deal with the length distribution of the motif they describe. ƒ Stochastic processes provide a more general framework to deal with these questions.

EMBNET course Basel 23.3.2006

ƒ A stochastic process X is a collection of random

variables {Xt, t ∈T} ƒ The variable t is often interpreted as time, and X(t) is the state of the process at time t. ƒ A realization of X is called a sample path. ƒ While this definition is quite general, there are a number of special cases that are of high interest in bioinformatics, in particular Markov processes.

EMBNET course Basel 23.3.2006

Markov Chain ƒ A Markov process is a stochastic process X with the following properties:

– The random variables Xt can take values in a finite list of states S={s1,s2,…,sn} (often taken as {1,2,…,n}) –P(Xt+1=jt+1 | Xt=jt, Xt-1=jt-1, …) = P(Xt+1=jt+1 | Xt=jt) •This means that the probabilities of getting to the next state depend only on the current state, and not on the previous states; in other word, the Markov process is memoryless. – We assume that the process is homogeneous, meaning that the probabilities are the same at all points: P(Xt+1=jt+1 | Xt=jt) = pij Next state 1 2 n –The p are usually represented as Current state ij p p …p 1 a matrix P and an initial 11 12 1n p21 p22 …p2n 2 vector π. P = p31 p32 …p3n 3 ... … . EMBNET course n Basel 23.3.2006 pn1 pn2 …pnn : matrix form

ƒ Given a vector πt containing the probabilities of each state at time t, the probabilities of the states at

time t+1 are given by πt+1 = πt P ƒ Looking further into the of the chain, the

probabilities P(Xt+n=jt+n | Xt=jt) are given by

n πt+n = πt P

(matrix multiplication calculates Next state 1 2 n the long-rang probabilities) Current state p11 p12 …p1n 1 p21 p22 …p2n 2 P = p31 p32 …p3n 3 ... … . EMBNET course n Basel 23.3.2006 pn1 pn2 …pnn

Stationary processes

ƒ Under certain conditions, there exists a stationary distribution for the states of the , that is, a distribution for which the probabilities of the different states do not change anymore. ƒ A state π is stationary if π = π P ƒ If it exists, the stationary state can be found by solving this (along with the constraints that the sum of elements in π must be 1) n ƒ An alternative way is to calculate lim n→∞P

EMBNET course Basel 23.3.2006 An example: mutations in DNA

ƒ A Markov chain can be used to model mutations in DNA, e.g. for use in evolutionary problems. ƒ Different models are possible, depending on the constraints that we place on the possible mutations ƒ See for example http://www.molecularevolution.org/resources/models/dnamodels.php ƒ Example (fake): AC G T A 0.80 0.05 0.05 0.10 C 0.05 0.70 0.20 0.05 G 0.05 0.20 0.70 0.05 T 0.10 0.05 0.05 0.80

S. Tavaré, “Some Probabilistic and Statistical Problems in EMBNET course the Analysis of DNA Sequences”. Basel 23.3.2006

Models for mutations in DNA

ƒ The stationary state for this process is given by the vector (0.25, 0.25, 0.25, 0.25), meaning that after a certain time, the distribution of nucleotides will be uniform, regardless of the initial distribution. ƒ Of course, this does not take into account other factors such as natural selection, etc. ƒ These substitution models are especially important for database search of proteins, where they allow the to give more weight to sequences that are evolutionary close, even if they are different from the input (cf PAM matrices, etc).

EMBNET course Basel 23.3.2006 Example: CpG islands

ƒ In the human genome, the frequency of the dinucleotide CG (written CpG to differentiate it from a C-G pair across the two strands of DNA) is lower than what would expected from the frequencies of the C and G nucleotides, because the Cytosine is usually methylated, and often mutated into a Thymine (T) ƒ However, in some regions of the DNA (in general around the promoter region of a gene), called CpG Islands, the methylation process is suppressed and a much higher frequency of CpG dinucleotides is observed.

EMBNET course Basel 23.3.2006

Difference between the 2 examples

ƒ In the example concerning mutations of DNA, we have a single nucleotide that changes over time ƒ If we look at several nucleotides, they can mutate independently and a different Markov chain is associated with each of them. ACTAGCTAGCTTGATCTGATCGACTGTGG ACAAGGTAGCTAGACCTGATCGACAGTGG TCAAGCTAGATAGACCTGCTCGTCGGTGG ...... ƒ In the example concerning CpG islands, we are looking at consecutive nucleotides and how they are linked using a single Markov chain.

A→C→T→A→G→C→T→A→G→C→T→T→G→A→T→C→T ...

EMBNET course Basel 23.3.2006 A Markov model for CpG islands

ƒ Roughly speaking, being in a CpG island means that C is more frequent than in the rest of the genome, and is more often followed by G. ƒ This can be modeled using a Markov model. ƒ Given a list of sequences from putative CpG islands, a Markov model (“+”) can be derived ƒ Given a list of sequences not part of a CpG island, a second Markov model (“-”) can be derived

EMBNET course Basel 23.3.2006

Deriving the Markov models

ƒ Data from 48 putative CpG islands, total of about 60,000 nucleotides. ƒ To estimate the probabilities, calculate the observed frequencies of transition from each nucleotide to any other (maximum likelihood estimator). ƒ Example: in CpG islands, 1,000 “A”s observed, 270 of which are followed by a “C” gives an estimated probability P(C|A)=0.27. Model “+” Model “-” AC G T AC G T A 0.18 0.27 0.43 0.12 A 0.30 0.21 0.28 0.21 C 0.17 0.37 0.27 0.19 C 0.32 0.30 0.08 0.30 G 0.16 0.34 0.38 0.12 G 0.25 0.24 0.30 0.21 T 0.08 0.36 0.38 0.18 T 0.18 0.24 0.29 0.29

EMBNET course Basel 23.3.2006 Eddy et al., “Biological sequence analysis” How to use the model ?

ƒ For a given sequence X0X1X2…Xn, we are interested in calculating the probability that one of our Markov models generated this particular sequence ƒ This is easily done using the Markov property:

–P(Xo, X1, X2, …, Xn) = P(X0) P(X1|X0) P(X2|X1)… P(Xn|Xn-1) ƒ The different probabilities are provided either by the transition matrix or by an initial probability vector ƒ The initial probability vector can correspond for example to the stationary distribution, here: Model “+” Model “-” AC G T AC G T 0.16 0.34 0.35 0.15 0.26 0.24 0.24 0.25

EMBNET course Basel 23.3.2006

How to use the model ?

ƒ Example: A → C → G → T → A → C → G → T P(CpG) = 0.16 x 0.27 x 0.27 x 0.12 x 0.08 x 0.27 x 0.27 x 0.12 = 3.6 x 10-6 ƒ Simply calculating a probability of going through a certain path in the Markov chain may not be the best thing to do, because the probability will keep becoming smaller and smaller (because the probability of seeing any particular sequence decreases quickly if the sequence length increases). ƒ More interesting is the likelihood ratio (LR), the comparison of the probability according to the “CpG island”, compared to the probability according to a “background model”.

EMBNET course Basel 23.3.2006 Likelihood Ratio ƒ The background distribution can be given by the “non-CpG island” Markov chain, or another distribution of probabilities. ƒ The (log) LR will indicate if we are more likely to be in a CpG island (LR>1) or outside of one (LR<1). ƒ In our example: – P(+) = 0.16 x 0.27 x 0.27 x 0.12 x 0.08 x 0.27 x 0.27 x 0.12 = 3.6 x 10-6 – P(-) = 0.26 x 0.21 x 0.08 x 0.21 x 0.18 x 0.21 x 0.28 x 0.21 = 2.0 x 10-6 –LR = (3.6 x 10-6) / (2.0 x 10-6) = 1.8 ƒ This is an example of why we indicated that the problem of DNA segmentation was important.

EMBNET course Basel 23.3.2006

LR in practice

ƒ Problems with the analyze of a long sequence of DNA with this method: – Need to specify a sliding window on which the test will be applied – If the window is too small, the results may not be significant (few data points used for the LR) – If the window is too large, some CpG islands may be missed. ƒ Hidden Markov Models provide a framework where we can use a single model to specify both the observations (nucleotides) and a state (“are we in a CpG island or not ?”) which we call hidden because it is not observed, and does not appear directly when observing the sequence.

EMBNET course Basel 23.3.2006 Hidden Markov Models

ƒ Described by Leonard E. Baum and colleagues in a series of 5 papers between 1966 and 1972 ƒ Originally used for ƒ Applied in genetics in the late 1980s ƒ Now ubiquitous in bioinformatics:

ƒ Multiple ƒ Prediction of protein domains ƒ Functional classification of ƒ Interpretation of peptide tandem proteins MS data ƒ Prediction of ƒ Analysis of ChIP CHIP ƒ Gene finding microarray data ƒ …

EMBNET course Basel 23.3.2006

HMM: Formal definition

ƒ Two processes {(St, Ot), t=1…}, where

– St is the hidden state at time t

– Ot is the observation at time t

–Pr(St | St-1 , Ot-1 , St-2 , Ot-2 , …) = Pr (St | St-1 ) (Markov Chain)

–Pr(Ot | St, St-1, Ot-1, St-2, Ot-2, …) = Pr (Ot | St )

ƒ Many variants are used (e.g. the distribution of O can depend on previous S or previous O)

EMBNET course Basel 23.3.2006 Example: heterogeneity of DNA

ƒ 2 states {AT,CG}, representing AT-rich or CG-rich regions ƒ What we see is a series of symbols A,C,G,T. 0.95 A0.40 ƒ What is hidden is the state: are we in a C0.10 AT-rich or CG-rich region ? G0.10 AT T0.40 ƒ Example (with made-up numbers) of a possible architecture 0.1 0.05 ƒ Models are usually specified with a BEGIN and an END state (not shown CG A0.10 here) C0.40 G0.40 0.9 T0.10

EMBNET course Basel 23.3.2006

CpG islands, part 2

ƒ CpG islands can be modelled using a ƒ Parameters: – 2 sets of states A, C, G and T; one set for “CpG island”, and another one for “non-CpG island” – Transition probabilities between the differents states (inside the 2 sides, and between them) – Observation probabilities

EMBNET course Basel 23.3.2006 CpG islands, part 2

AC G T

A CGT

ƒ The green transitions are similar to the probabilities defined by our “+” and “-” Markov models for CpG islands. ƒ The red transitions incorporate the probabilities of entering or leaving a CpG island (absent from the previous model).

EMBNET course Basel 23.3.2006

Fitting of a HMM

ƒ If the sequence of states is known, the probabilities can be trained by maximum likelihood, exactly like our Markov model for heterogeneity CpG islands – However, we can not use 2 separate sets of data for CpG islands/non CpG islands: we require a set of annotated sequences with transitions from one state to the other, in order to estimate these transition probabilities. ƒ In most cases, however, we know only the sequence of symbols, and not the subjacent states ƒ In this case, there is no closed-form equation to estimate the parameters, and an (usually iterative) optimisation procedure must be used, starting from a random estimation of the parameters (starting point).

EMBNET course Basel 23.3.2006 Fitting of a HMM

ƒ Given the general architecture of an HMM (number of states, possible transitions), and a list of training sequences, the standard algorithm for estimating the transition and emission probabilities is the Baum-Welch training algorithm (based on the Expectation-Maximization (EM) algorithm) ƒ An alternative training algorithm is the Viterbi training algorithm, which calculates the most probable paths for the training sequences, and re- estimate the probabilities from these results.

EMBNET course Basel 23.3.2006

Expectation-Maximization (EM) algorithm ƒ The EM algorithm allows us to perform Maximum Likelihood estimation when some data is missing (in this case, the unknown sequence of states). ƒ It requires a “starting point”, i.e. an initial guess of the parameters of the model ƒ It consists in the iteration of 2 steps: – E-step: using the current model, estimate the missing observations – M-step: use the observations (real+estimated) to find a new model by maximum likelihood ƒ It can be shown that the likelihood increases with each iteration, so that the new model always get better, until we reach a maximum (hopefully global) ƒ Used to solve many different problems in bioinformatics

EMBNET course Basel 23.3.2006 Questions ƒ Given an HMM, and a (new) sequence, we may want to know: – What is the probability that the sequence was generated by this model (scoring problem) ? Solved by the Forward algorithm – What is the most probable path of states followed by the HMM when generating this sequence (“annotation” problem, or decoding) ? Solved by the – What is the most probable state for a given observation (symbol) (posterior probability) ? Solved by the backward algorithm. ƒ These elegant algorithms are based on a similar concept, ; we will detail only the forward algorithm.

EMBNET course Basel 23.3.2006

Forward algorithm

ƒ If the number of states increases, the number of possible paths increases exponentially ƒ For most models, it is not possible to enumerate all the paths ƒ The “trick” is to recognise the following fact (based on the Markov property): – The probability that the kth character of a given sequence was generated by state i depends only on which state generated the (k-1)th character, and on the transition probabilities between the state a – The actual path that was followed before is not relevant. b ƒ This allows us to calculate the total i probability recursively. … z

EMBNET course Basel 23.3.2006 Forward algorithm: example

ƒ What is the probability that our model for DNA heterogeneity produced the sequence ATG ? ƒ Naïve approach: 3 symbols, 2 possible states for each symbols, 8 paths to consider

α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

s1=A s2=T s3=G 0000.95 AT 0.95 AT AT 1.0 0.8 0. 0. 05 05 SSEE 0. .1 .1 2 0 0 1.0 CG1110.9 CG 0.9 CG EMBNET course Basel 23.3.2006

Forward algorithm

α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

α2(AT) = PAT(T) • (0.95 • α1(AT) + 0.1 • α1(CG) ) = 0.1224

α2(CG) = PCG(T) • (0.05 • α1(CG) + 0.9 • α1(CG) ) = 0.0034

s1=A s2=T s3=G 0000.95 AT 0.95 AT AT 1.0 0.8 0. 0. 05 05 SSEE 0. .1 .1 2 0 0 1.0 CG1110.9 CG 0.9 CG EMBNET course Basel 23.3.2006 Forward algorithm

α1(AT) = PAT(A) • 0.8 = 0.4 • 0.8 = 0.32

α1(CG) = PCG(A) • 0.2 = 0.1 • 0.2 = 0.02

α2(AT) = PAT(T) • (0.95 • α1(AT) + 0.1 • α1(CG) ) = 0.1224

α2(CG) = PCG(T) • (0.05 • α1(CG) + 0.9 • α1(CG) ) = 0.0034

α3(AT) = PAT(G) • (0.95 • α2(AT) + 0.1 • α2(CG) ) = 0.0116

α3(CG) = PCG(G) • (0.05 • α2(CG) + 0.9 • α2(CG) )= 0.0037

α = α3(1) + α3(0) = 0.0153

s1=A s2=T s3=G 0000.95 AT 0.95 AT AT 1.0 0.8 0. 0. 05 05 SSEE 0. .1 .1 2 0 0 1.0 CG1110.9 CG 0.9 CG EMBNET course Basel 23.3.2006

Viterbi algorithm and posterior decoding ƒ For a given sequence, the Viterbi algorithm finds the most likely path across the model ATG 0000.95 AT 0.95 AT AT 1.0 0.8 0. 0. 05 05 SSEE 0. .1 .1 2 0 0 1.0 CG1110.9 CG 0.9 CG ƒ In some cases, this may not be the most relevant (e.g. if two paths have very close probabilities). Posterior decoding using the backward algorithm can find the most probable state for a given observation, summed over all the possible paths (the Viterbi algorithm considered the maximum of all the paths, instead of the sum). EMBNET course Basel 23.3.2006 A longer example

ƒ Several hundred bases were generated using the model for heterogeneity of DNA ƒ The resulting sequence was “decoded” using the Viterbi algorithm to find the sequence of states that is most likely.

EMBNET course Basel 23.3.2006

Obs. seq: TTATTTAACTTAATAAATATGTCAATCAATTTTCTGCTTCAGTTCAGTAGGGGAACATCATACTTGGAAAGGAAATATAA Real state: 11111111111111111111111111111111111111111111111112222111111111111111111111111111 Pred.state: 11111111111111111111111111111111111111111111111112222111111111111111111111111111

CTTTATTGCATATTAAGGCGCCACGGGGCCGGCGCGCGGGCATAAAATAGCTATTTTTCTTCGTATGTGAATGAATCCAT 11111111111111111222222222222222222222222211111111221111111111111111111111111111 11111111111111112222222222222222222222222111111111111111111111111111111111111111

TATCCAAATCATTTGATCCATTTAAATTTTATTATGTTTTTCAGCTGTAACAGTAAAGTTTTTACACGCAATTGTGAAGA 11111111111111121111111111111111111111111111111111111111112111122222222111111111 11111111111111111111111111111111111111111111111111111111111111111111111111111111

ATCATCTCAAAAGAAATAAAAAATTTAAGGTGACCCGGGACGGCGCCTGAATTTAGAATCGCCGGGACCCCAGCGACTTG 11111111111111111111111111111222222222222222222211111111111222222222222222222211 11111111111111111111111111112222222222222222222111111111111222222222222222222222

CCACTTTGGGTTCGATATGAATATTATCAACGTGGGGGCCGACCGCGCTTAATAAAATATTATAGTGCAAAAATATAAAG 22221111122111111111111111111112222222222222212222211111111111111111111111111111 22111111111111111111111111111122222222222222222211111111111111111111111111111111

ATCTAATATTGCAGTTCAAATTTGTAATATATATTTTGCCGAAATATGTGCGCGTGGCCCTTTTGACTTAACATTATTCG 11111111111121111111111121111111111111222211111112222222222221111111111111111111 11111111111111111111111111111111111112222111111112222222222211111111111111111122

GCACAGCCCCTGCGCCGAAGAATTAACAATAGATATATTTCATATTTAATACGTAAAACAAAGATTACATTTGCATAGAT 22222222222222222222211111111111111111111111111111111111111111111111111111111111 22222222222222222111111111111111111111111111111111111111111111111111111111111111

TAGGCCTAACCTCGATGATGGTATATAACTATTAATTTGATAATAATAATGGGTTCAATCTATTTTTACGCCCCCACATA 11111111112211111111111111111111111111111111111111122222211111111111122222222211 11222211111111111111111111111111111111111111111111111111111111111111222222211111

EMBNET course Basel 23.3.2006 Some comments

ƒ The model is unable to find an “island” of one state “2” in a series of states “1” or the opposite, because two state transitions in a row are much less likely than any observation ƒ Since we work with random processes, this is expected because a misleading sequence of observations can always appear at random ƒ Longer series in one state are more likely to be found, because the model has more evidence in hand to find them, and longer series are more likely to be real, and less likely to happen just by chance. ƒ Boundaries of regions are often incorrect, because a nucleotide at the boundary will “attached” to the region it is most likely from.

EMBNET course Basel 23.3.2006

Couplage of state and observation

ƒ One problem with HMMs is the fact that the states and observations are coupled. p ƒ If a symbol is emitted for each state and the probability of staying on a given state is (1-p), the number of symbols 1-p emitted will be distributed according to a geometric distribution (mean 1/p). ƒ Variants have been developed where state and observation are decoupled; they are called generalized HMMs (GHMMs), HMM with duration or hidden semi-Markov model. ƒ With a GHMM, states do not loop on themselves to repeat symbols; instead, each state can emit a number of symbols according to an arbitrary distribution.

EMBNET course Basel 23.3.2006 Gene finding

ƒ Given all the genomes that have been sequenced in the past few years, gene finding is an important (and challenging) task. ƒ By “Gene finding” we can mean – Find if a particular piece of DNA sequence is in a gene or not (segmentation) – Find and annotate all the different parts of a gene – … and anything in between

EMBNET course Basel 23.3.2006

EMBNET course Basel 23.3.2006 A very simple model for prokaryotes

ƒ Calculate the frequency of appearance of bases (or codons) in genes, and in intergenic regions ƒ Similar to the models we used for the heterogeneity

of DNA (segmentation) 1-q ƒ This could be modelled using an HMM with this architecture and Coding region relevant observation probabilities. p q

Non-coding region

1-p EMBNET course Basel 23.3.2006

More sophisticated model

ƒ A more sophisticated model uses frame-dependent composition

q

1 1 Position 1 Position 2 Position 3

p 1-q Non-coding region

1-p EMBNET course Basel 23.3.2006 More sophisticated model (2)

ƒ We can modify our model so that it requires an initiation of translation site (ATG), as well as a site for termination of translation (TAA, TAG or TGA). ƒ Note that “Coding region” and “Non-coding regions can represent HMMs as well (not detailed here)

Coding T region G T A

T T A A

A G G

A

Non-coding EMBNET course region Basel 23.3.2006

More sophisticated models

ƒ We can build more complicated models on top of the simpler ones – Promoters, second strand, etc ƒ For every model, we hope that it will be better at finding real genes than the previous ones ƒ However, if the model gets more complicated, we have more parameters to estimate.

EMBNET course Basel 23.3.2006 Human genes finding

ƒ Finding genes in eukaryotes genomes is complicated by the presence of exons and introns. ƒ Many gene finders have been developed specifically for finding genes in the human genome ƒ GENSCAN is currently one of the most popular and successful human gene finder ƒ It is based on a GHMM.

EMBNET course Basel 23.3.2006 Burge et al., J. Mol. Biol., 1997

62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC GENSCAN 62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC 62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA 62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC E0 E1 E2 62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG 62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC 62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC 62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC I0 I 1 I 2 62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC 62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA 62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC 62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA Ei Et 62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT 62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG 62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC 62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA 62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC 5'UTR Es 3'UTR 62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG 62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT 62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC 63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT promoter poly-A 63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT 63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC 63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC 63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT 63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT Forward (+) strand Forward (+) strand intergenic 63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT region Reverse (-) strand Reverse (-) strand 63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG 63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT 63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG 63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG 63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA 63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT 63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA 63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT EMBNET course Basel 23.3.2006 Assessing performance

ƒ The performances of a predictive model can be assessed using a second dataset (testing set) different from the one used to train the model (learning or training set). We assume that we know the “truth” in each dataset. ƒ Important quantities: – TP (True Positives): number of positive predictions which are correct – FP (False Positives): number of positive predictions which are incorrect (i.e. spurious prediction of a site) – TN (True Negatives): number of negative predictions which are correct – FN (False Negatives): number of negative predictions which are incorrect (i.e. true site, missed).

EMBNET course Basel 23.3.2006

Assessing performance

ƒ Most of the time, the number of False Positives and False Negatives (i.e., the errors that we would like to minimises) is not fixed, but depends on a continuous score (e.g. probability returned by the forward algorithm) on which we set a threshold. ƒ Changing this threshold will change the number of errors; a higher threshold (i.e. a more stringent criteria for classifying an observation as “positive”) will usually result in a lower number of False Positives, but a higher number of False Negatives

EMBNET course Basel 23.3.2006 Specificity and sensitivity

ƒ Sensitivity is TP/(TP+FN), the proportion of true sites that are correctly found and annotated ƒ Specificity is generally defined as TN/(TN+FP), the proportion of false sites that are correctly annotated ƒ In some cases, specificity is defined as TP/(TP+FP), the proportion of all positive predictions that are correct. This is especially useful if the proportion of false sites correctly annotated is high, in which case TN/(TN+FP) will not be sensitive to change in FP. ƒ The false positive rate is 1-Specificity.

S.E. Cawley, A.I. Wirth and TP. Speed, Mol. And Biochem. EMBNET course Basel 23.3.2006 Parasitology, 118(2), 167-174 (2001).

Graphical representation

Observations classified Observations classified as “negative” as “positive”

Distribution of scores for negative results Distribution of scores for positive results

TN TP

Score FN FP Chosen threshold The two distribution overlap, so that it is impossible to use this score to perfectly discriminate between positive EMBNET course Basel 23.3.2006 and negative results. Specificity and sensitivity

ƒ Sensitivity is TP/(TP+FN), the proportion of true sites that are correctly found and annotated ƒ Specificity is generally defined as TN/(TN+FP), the proportion of false sites that are correctly annotated ƒ In some cases, specificity is defined as TP/(TP+FP), the proportion of all positive predictions that are correct. This is especially useful if the proportion of false sites correctly annotated is high, in which case TN/(TN+FP) will not be sensitive to change in FP. ƒ The false positive rate is 1-Specificity.

S.E. Cawley, A.I. Wirth and TP. Speed, Mol. And Biochem. EMBNET course Basel 23.3.2006 Parasitology, 118(2), 167-174 (2001).

Comparison of different models

ƒ Different models can be applied to a given statistical problem, as we have seen in the case of gene finding.

ƒ ROC (Receiver Operating Characteristic) curves help summarising the FP/FN rates of one or several models.

EMBNET course Basel 23.3.2006 ROC curve

100% (Sensitivity) True positive rate

0% 0% 100% False positive rate (1-Specificity)

EMBNET course Basel 23.3.2006

ROC curve

100% ƒ The Area Under the Curve (AUC) provides a way to compare the global performance of different models. ƒ A perfect model would

(Sensitivity) have an AUC of 1

True positive rate ƒ A random model would have an AUC of 0.5 (worst) 0% 0% 100% ƒ Models in between can False positive rate (1-Specificity) have AUCs around 0.9 (excellent), 0.8 (good), 0.7 (fair), 0.6 (poor). EMBNET course Basel 23.3.2006 ROC curves and other methods

ƒ Problems with ROC curves: – When comparing two curves, the comparison is not straightforward if the two curves cross once or several times. – Most of the time, we are only interested in some regions of the graph, typically the low FP region (we usually do not care about an algorithm that produces 85% of false positives, even it produces no false negatives). – The comparison between models does not take into account the complexity of the model; a model A may be slightly better than model B, but if model A is much more complicated (higher number of parameters), the gain may not be worth the work. Comparison methods that penalise more complicated models also exist (for example, AIC or BIC).

EMBNET course Basel 23.3.2006

What is still missing with HMMs ?

ƒ By default, HMM use a Markov model of order 1, i.e. the transition and emission probabilities depend only on the current state ƒ This means that they can not take into account dependencies between the sites ƒ Models of higher orders, which solve this problem, can be transformed into models of order 1 by increasing the number of states ƒ Doing so is straightforward, but the complexity of the resulting models increases exponentially and they are quickly computationally infeasible. ƒ In general, this problem is unsolved.

EMBNET course Basel 23.3.2006 What is still missing with HMMs ?

ƒ Example: probabilities of observation of nucleotides in DNA in a Markov model ƒ Increase the number of states by having a state for each digram instead of each symbol

AC G T AC G T AA 0.18 0.27 0.43 0.12 A 0.18 0.27 0.43 0.12 AC 0.17 0.37 0.27 0.19 C 0.17 0.37 0.27 0.19 AG 0.16 0.34 0.38 0.12 G 0.16 0.34 0.38 0.12 AT 0.17 0.37 0.27 0.19 T 0.08 0.36 0.38 0.18 CA 0.16 0.34 0.38 0.12 CC 0.08 0.36 0.38 0.18 … ……………………..

EMBNET course Basel 23.3.2006

Other stochastic models

ƒ Many other stochastic models are available,such as – Variable Length Markov Models: many variants; uses models of order higher than 1, but only when needed; otherwise use simpler models; compromise to reduce the increase in complexity. Uses trees. – Permuted Variable Length Markov Models: similar to the VLMM, but the sites may be reordered so that the “previous” site in the probabilities may not be the previous site in the sequence; this allows to model simple but long- range effects

EMBNET course Basel 23.3.2006 Some caveats…

ƒ Some care has to be taken because they can break down at any time, without warning ! ƒ Ideally, any prediction made with a should be confirmed experimentally. ƒ However, these models have been hugely successful, as shown by our examples and references

EMBNET course Basel 23.3.2006

Introduction: Coiled-coil domains

ƒ Coiled-coil domains (CCD) are a protein motif found in many types of proteins ƒ It consists of two or more identical strands of amino acid sequences forming alpha-helices that are wrapped around each other, forming one of the simplest tertiary structures ƒ Most coiled-coil domains contain a pattern called heptad repeat: seven residues of the form abcdefg, where the a and d positions are generally hydrophobic.

EMBNET course Basel 23.3.2006 Coiled-coil domains

ƒ This apparent structure is a good candidate for statistical modeling. ƒ Methods used for modeling: – PSSM –HMM ƒ For the practical work, we will use an HMM to model CCDs.

EMBNET course Basel 23.3.2006

References (web)

ƒ Wikipedia (http://en.wikipedia.org/) contains much information on HMMs and other topics ƒ Introduction to Coiled Coils http://www.lifesci.sussex.ac.uk/research/woolfson/html/coils.html ƒ Terry Speed’s courses at University of Berkeley http://www.stat.berkeley.edu/users/terry/Classes/index.html

EMBNET course Basel 23.3.2006 References (General articles)

ƒ Sean R. Eddy, “What is a hidden Markov Model ?”, Nat. Biotech. 22, p. 1315-1316 (2004). ƒ Karen Heyman, “Gene Finding with Hidden Markov Models”, The Scientist, 19(6), p. 26 (2005). ƒ L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proc. IEEE. 77(2), p. 257-285 (1989).

EMBNET course Basel 23.3.2006

References (Specific articles)

ƒ p53 transcription factor: – J. Hoh et al., “The p53HM algorithm and its application in detecting p53-responsive genes”, PNAS 99 (13), 25 June 2002, 8647-8472. – C.-L. Wei et al., “A Global of p53 Transcription- Factor Binding Sites in the Human Genome”, Cell 124, 13 January 2006, 207-219. ƒ Other stochastic models: – X. Zhao, H. Huang and T.P. Speed, “Finding short DNA motifs using permuted Markov models”, Proceedings of the Eighth Annual International Conference on Computational Molecular (RECOMB 2004)

EMBNET course Basel 23.3.2006 References (Specific articles)

ƒ S. Tavaré, “Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences”. Lectures on and Life Sciences, American Mathematical Society, vol. 17, 1986. ƒ L. Peshkin and M.S. Gelfand, “Segmentation of yeast DNA using hidden Markov models”, Bioinformatics 15(12), p. 980-986 (1999). ƒ C. Burge and S. Karlin, “Prediction of complete gene structures in human genomic DNA”, J. Mol. Biol., 268, p. 78-94 (1997). ƒ M. Delorenzi and T. Speed, “An HMM model for coiled-coil domains and a comparison with PSSM-based predictions”, Bioinformatics 18(4), p. 617-625 (2002).

EMBNET course Basel 23.3.2006

References (Books) ƒ R. Durbin, S. Eddy, A. Krogh, G. Mitchison, “Biological sequence analysis”, Cambridge University Press, 1998. ƒ T. Koski, “Hidden Markov Models for Bioinformatics”, Kluwer Academic Publishers, 2001. ƒ W.J. Ewens and G.R. Grant, “Statistical Methods in Bioinformatics”(2nd edition), Springer, 2005. ƒ S.M. Ross, “Stochastic Processes”(2nd edition), John Wiley and Sons, 1996.

EMBNET course Basel 23.3.2006 Acknowledgements

ƒ Many slides and ideas were borrowed from talks and courses by Terry Speed (University of Berkeley, California, and WEHI, Melbourne, Australia)

EMBNET course Basel 23.3.2006

More questions ?

[email protected]

[email protected]

EMBNET course Basel 23.3.2006