<<

Introduction to sequence motif discovery and search

EMBNET course of transcriptional regulation Jan 28 2008 Christoph Schmid

The complete genome era Components of transcriptional regulation

Distal transcription-factor binding sites (enhancer)

cis-regulatory modules

Wasserman 5, 276-287 (2004)

DNA- interaction

Allen et al. © 1998 EMBL Jordan et al. © 1991 Prentice-Hall Sequence motifs sequence variants of site: ..A T C G C A.. ..T T G G A C.. ..T T G G T G.. ..A T C G G T.. matrix: A 2 0 0 0 1 1 (simplest version) C 0 0 2 0 1 1 G 0 0 2 4 1 1 T 2 4 0 0 1 1 + cutoff ! :

Title Representation of the binding specificity by a scoring matrix (also referred to as weight matrix)

1 2 3 4 5 6 7 8 9

A -10 -10 -14 -12 -10 5 -2 -10 -6 C 5 -10 -13 -13 -7 -15 -13 3 -4 G -3 -14 -13 -11 5 -12 -13 2 -7 T -5 5 5 5 -10 -9 5 -11 5

Strong C T T T G A T C T Binding site 5 + 5 + 5 + 5 + 5 + 5 + 5 + 3 + 5 = 43

Random A C G T A C G T A Sequence -10 -10 -13 + 5 -10 -15 -13 -11 - 6 = -83

Biophysical interpretation of protein binding sites

Columns of a weight matrix characterize the specificity of base-pair acceptor sites on the protein surface. Weight matrix elements represent negated energy contributions to the total binding energy → weight matrix score inversely proportional to binding energy Motif search:

Statistical over-representation of genomic sequence motifs

Conserved motifs represent protein binding sites

Putative conservation in sequence (motif) position relative to: TSS other binding sites (protein complexes)

‘Noise’ due to redundancy in function of individual sites

-> different algorithms in use

Principles of motif finding

difficulties: Input: set of nucleotide sequences -> high degeneracy -> low frequency of motif sites! Define matrices with statistical overrepresentation

Motif 1:

Motif discovery Motif 2:

Motif 3: Motif search Formal tools to describe regulatory elements Consensus sequences: • example: TATAAA (for eukaryotic TATA-box) • a limited number of mismatches may be allowed • May contain IUPAC codes for ambiguous positions, e.g. R = A or G.

Weight matrices: • synonym: position-specific scoring matrix (PSSM), • a table with numbers for each residue at each position of a fixed- length (gap-free) motif. • two numerical representations: probabilities, additive scores

More advanced descriptors: • HMMs can model spacers of variable length between conserved blocks • Dinucleotide matrices: dependencies between nearest neighbors

Where are the motifs? Word search algorithms for consensus sequences

Purpose: optimal for a given sequence set.

Algorithm: For each word wi of size k and mismatch threshold d:

• Count total frequency of word wi in data set. • Compute expected frequency of word wi in data set, based on base composition of data set or some other null model. • From the observed and expected frequencies, compute P-value

for word wi according to some statistical distribution (e.g. Poisson distribution) • Return word with highest P-value, or N best words

This algorithm is also referred to as word-enumeration. It is guaranteed to find the optimal word. Computationally feasible only for short words (length ≤ 12). Heuristic algorithms exist for longer words.

Motif Discovery by EM: Inputs and outputs Expectation-Maximization (EM) Algorithms

EM algorithms essentials: • An iterative procedure to maximize the likelihood of a probabilistic model with regard to given data • It can deal with missing (unobservable) data (motif positions) • Not guaranteed to find the global maximum

Input: • A model type with initial parameters to be optimized • A mathematical formula that allows to compute the likelihood of the model given complete (observed and unobserved) data • The observed data: set of sequences containing motifs

The maximum likelihood estimation of • The model parameters • The unobserved data

Gibbs sampling Motivation: EM is not guaranteed to find globally optimal motif EM is deterministic: same initial model, same result

Stochastic algorithms also not guaranteed to find global optimum, but: same initial model → different result

By running stochastic algorithms several times: → better chance to find optimal solution

Gibbs sampling: modification of the E-step: considers only one sequence position for computing new base frequency matrix. Chooses motif position by sampling from the probability distribution given by the sequence position weights w(S,j). Flavors of motif discovery algorithms

• Consensus sequence (allowing mismatch) Weeder (Pavesi et al.) cpu time* : ~2 hrs

• Expectation maximation MEME (Bailey et al.) cpu time* :~24 hrs

• Gibbs sampler (stochastic) MotifSampler (Thijs et al.) cpu time* : ~2 hrs

*) for 10 motifs in ~2000 x 100bp on 3 GHz Pentium4

Do motif discovery algorithms work in practice ?

Two recent studies suggest that the performance of motif discovery algorithms is very bad.

1. Hu J, Li B, Kihara D (2005) Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 33: 4899-4913.

2. Tompa M, Li N, Bailey TL, Church GM, De Moor B, et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23: 137-144.

-> use transcription factor binding sites identified by experiments as ‘true’ solution

=> only about 20% to 30% of motifs found by current motif discovery algorithms correspond to experimentally defined motifs! Performance indices used by Tompa et. al. 2005 (2) Performance indices used by Tompa et. al. 2005

Bad Performance of Motif Discovery algorithms on Eukaryotic Benchmark Data Sets (Results from Tompa et. al. 2005) Possible reasons for insufficiency

– The heuristic motif discovery algorithms fail to find the optimal motif. – The sequence sets are too small for the estimation of statistically robust models – The experimental data defining the binding sites used in these tests are flawed – The current models of binding sites based only on sequences

¾currently under intensive investigation...

Promoter sequence motifs

• Fundamental assumption: Ö Regulatory program of transcription (at least in part) encoded in nucleotide sequence in form of short motifs

• Additional assumptions: Ö multiple similar binding sites in genome for each transcription factor >6200 transcription factors ( in GO categories “ binding” and “transcription regulator activity”) for ~50’000 human TSS

Ö binding sites in core in immediate vicinity to transcription start site

• Definition of ‚promoter motif‘: Ö set of related sequences acting as binding site for specific transcription factor Drosophila motif overview

Initiator 1813/1932

DRE 1045 DNA replication-related element

DPE 482 Downstream promoter element

TATA-box 475