Stochastic Processes and Hidden Markov Models Introduction

Stochastic processes and Hidden Markov Models Dr Mauro Delorenzi and Dr Frédéric Schütz Swiss Institute of Bioinformatics EMBnet course – Basel 23.3.2006 Introduction A mainstream topic in bioinformatics is the problem of sequence annotation: given a sequence of DNA/RNA or protein, we want to identify “interesting” elements Examples: – DNA/RNA: genes, promoters, splicing signals, segmentation of heterogeneous DNA, binding sites, etc – Proteins: coiled-coil domains, transmembrane domains, signal peptides, phosphorylation sites, etc – Generally: homologs, etc. “The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster” – http://www.fruitfly.org/GASP1/tutorial/presentation/ EMBNET course Basel 23.3.2006 Sequence annotation The sequence of many of these interesting elements can be characterized statistically, so we are interested in modeling them. By modeling, we mean find statistical models than can: – Accurately describe the observed elements of provided sequences; – Accurately predict the presence of particular elements in new, unannotated, sequences; – If possible, be readily interpretable and provide some insight into the actual biological process involved (i.e. not a black box). EMBNET course Basel 23.3.2006 Example: heterogeneity of DNA sequences The nucleotide composition of segments of genomic DNA changes between different regions in a single organism – Example: coding regions in the human genome tend to be GC-rich. Modeling the differences between different homogeneous regions is interesting because – These differences often have a biological meaning – Many bioinformatics tools depend on the “background distribution” of nucleotides, often assumed to be constant. EMBNET course Basel 23.3.2006 Modeling tools (quick review) Among the different tools used for modeling sequences, we have (sorted by increasing complexity): – Consensus sequences – Regular expressions – Position Specific Scoring Matrices (PSSM), or Weight Matrices – Markov Models, Hidden Markov Models and other stochastic processes These tools (in particular the stochastic processes) are also used for bioinformatics problems other than pure sequence analysis. EMBNET course Basel 23.3.2006 Consensus sequence Exact sequence that correspond to a certain region Example: Transcription initiation in E. coli – Transcription initiated at the promoter; the sequence of the promoter is recognised by the sigma factor ot RNA polymerase – For the sigma factor σ70 , the consensus sequence of the promoter is given by -35 -10 TTGACA … TATAAT Very rigid, and do not allow for any variation This works also well for enzyme restriction sites, or, in general, for sites for which strict conservation is important (in the case of restriction sites: cutting of the DNA at a certain site is a question of “life and death” for the DNA) EMBNET course Basel 23.3.2006 Example: binding site for TF p53 The Transcription Factor Binding Site (TFBS) for p53 as been described as having the consensus sequence GGA CATG CCC * GGG CATG TCT where * represents a spacer of various length. In this case, the sequence is not entirely conserved; this is believed to allow the cell some flexibility in the level of response for different signals (which was not possible or desirable for restriction sites). EMBNET course Basel 23.3.2006 Example: binding site for TF p53 This flexibility translates into the need for more complicated models to describe the site. Since the binding site is not entirely conserved, the consensus sequence represents only the nucleotides most frequently observed. The protein could potentially bind to many other similar, but different, sites along the genome. In theory, if the sites are not independent, the protein may not even bind to the actual consensus sequence ! EMBNET course Basel 23.3.2006 Patterns/Regular Expression Patterns attempts to explain observed motifs by trying to identify the most important combinations of positions and nucleotides/residues of a given site (to be compared with the consensus sequence, where the most important nucleotide/residue at each position was identified) They are often described using the Regular Expression syntax. Prosite database (developed at the SIB): http://www.expasy.org/prosite/ EMBNET course Basel 23.3.2006 Example: Cys-Cys-His-His zinc finger DNA binding domain Its characteristic motif has regular expression C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H Where ‘x’ means any amino acid, ‘(2,4)’ means between 2 and 4 occurences, and ‘[…]’ indicate a list of possible amino acids. Example: 1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX EMBNET course Basel 23.3.2006 Example: TFBS for p53 The TFBS has been described with the pattern →← … →← where →← is the palindromic sequence (5’) Pu-Pu-Pu-C-[AT][TA]-G-Py-Py-Py (3’) and “…” is a spacer with 0 to 14 nucleotides Note that this pattern (with the palindromic condition) can not be expressed using a regular expression (at least not in a simple or general way). J. Hoh et al. “The p53MH algorithm and its application in detecting EMBNET course Basel 23.3.2006 p53-responsive genes”. PNAS, 99(13), June 2002, 8467-8472. Example: TFBS for p53 The pattern approaches clearly allows more flexibility than the consensus sequence; however it is still too rigid, especially for sites that are not well conserved. When applying the pattern, each possible amino/acid or nucleotide at a given position has the same weight, i.e. is it not possible to specifiy if one is more likely to appear than another. EMBNET course Basel 23.3.2006 Position-Specific Scoring Matrices “Stochastic consensus sequence” Indicates the relative importance of a given nucleotide or amino acid at a certain position. Usually built from an alignment of sequences corresponding to the domain we are interested in, and either a collection of sequences known not to contain the domain, or (most often), background probabilities for the different nucleotides or amino acids. EMBNET course Basel 23.3.2006 Building a PSSM Pos. 1 2 3 4 5 6 A 9 214 63 142 118 8 Counts from 242 C 22 7 26 31 52 13 known sites G 18 2 29 38 29 5 T 193 19 124 31 43 216 A 0.04 0.88 0.26 0.59 0.49 0.03 Relative C 0.09 0.03 0.11 0.13 0.22 0.05 frequencies: fbl G 0.07 0.01 0.12 0.16 0.12 0.02 T 0.80 0.08 0.51 0.13 0.18 0.89 PSSM: A -2.76 1.82 0.06 1.23 0.96 -2.92 log f /p bl b C -1.46 -3.11 -1.22 -1.00 -0.22 -2.21 (pb=background probabilities) G -1.76 -5.00 -1.06 -0.67 -1.06 -3.58 EMBNET course T 1.67 -1.66 1.04 -1.00 -0.49 1.84 Basel 23.3.2006 Scoring a sequence using a PSSM C T A T A A T C A -38 19 1 12 10 -48 sum Move the matrix C -15 -38 -8 -10 -3 -32 along the sequence -13 -48 -6 -7 -10 -4 8 and score each G -93 “window” T 17 -32 8 -9 -6 19 Peaks should occur at A -38 19 1 12 10 -48 the “true” sites C -15 -38 -8 -10 -3 -32 -13 -48 -6 -7 -10 -4 8 Of course in general G +85 any threshold will T 17 -32 8 -9 -6 19 have some false positive and false A -38 19 1 12 10 -48 negative rate C -15 -38 -8 -10 -3 -32 G -13 -48 -6 -7 -10 -4 8 -95 EMBNET course T 17 -32 8 -9 -6 19 Basel 23.3.2006 Sequence logo: graphical representation Cys-Cys-His-His zinc finger DNA binding domain The total height of each stack represent the degree of conservation of each position; the heights of the letters on a stack are proportional to their frequencies. EMBNET course Basel 23.3.2006 PSSM for p53 binding site Counts from 37 known sites Pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A 14 11 26 0 28 2.5 0 0.5 0 3 6 2 11.5 0 27 4 0 0.5 1 2 C 3 1 1 36 1 0.5 0 24.5 33 23 2 0 0.5 36 2 0 0 9.5 24 15 G 16 24 10 0 0 0 37 0 0 0 23.5 34 25 0 2 1 37 0 0 3 T 4 1 0 1 7 34 0 12 4 10 5.5 1 0 1 5 32 0 27 12 16 EMBNET course J. Hoh et al., “The p53HM algorithm and its application in Basel 23.3.2006 detecting p53-responsive genes”, 2002. What is missing ? PSSM help to deal with the stochastic distributions of symbols at a given position. However, they lack the ability to deal with the length distribution of the motif they describe. Stochastic processes provide a more general framework to deal with these questions. EMBNET course Basel 23.3.2006 Stochastic Process A stochastic process X is a collection of random variables {Xt, t ∈T} The variable t is often interpreted as time, and X(t) is the state of the process at time t. A realization of X is called a sample path. While this definition is quite general, there are a number of special cases that are of high interest in bioinformatics, in particular Markov processes. EMBNET course Basel 23.3.2006 Markov Chain A Markov process is a stochastic process X with the following properties: – The random variables Xt can take values in a finite list of states S={s1,s2,…,sn} (often taken as {1,2,…,n}) –P(Xt+1=jt+1 | Xt=jt, Xt-1=jt-1, …) = P(Xt+1=jt+1 | Xt=jt) •This Markov Property means that the probabilities of getting to the next state depend only on the current state, and not on the previous states; in other word, the Markov process is memoryless.

Stochastic Processes and Hidden Markov Models Introduction

Multivariate Poisson Hidden Markov Models for Analysis of Spatial Counts

Regime Heteroskedasticity in Bitcoin: a Comparison of Markov Switching Models

Poisson Processes Stochastic Processes

12 : Conditional Random Fields 1 Hidden Markov Model

Modeling Dependence in Data: Options Pricing and Random Walks

A Stochastic Processes and Martingales

Ergodicity, Decisions, and Partial Information

A Study of Hidden Markov Model

Introduction to Stochastic Processes - Lecture Notes (With 33 Illustrations)

Particle Gibbs for Infinite Hidden Markov Models

Hidden Markov Models (Particle Filtering)

Hierarchical Dirichlet Process Hidden Markov Model for Unsupervised Bioacoustic Analysis