Introduction to Sequence Motif Discovery and Search the Complete
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to sequence motif discovery and search EMBNET course Bioinformatics of transcriptional regulation Jan 28 2008 Christoph Schmid The complete genome era Components of transcriptional regulation Distal transcription-factor binding sites (enhancer) cis-regulatory modules Wasserman 5, 276-287 (2004) DNA-protein interaction Allen et al. © 1998 EMBL Jordan et al. © 1991 Prentice-Hall Sequence motifs sequence variants of site: ..A T C G C A.. ..T T G G A C.. ..T T G G T G.. ..A T C G G T.. matrix: A 2 0 0 0 1 1 (simplest version) C 0 0 2 0 1 1 G 0 0 2 4 1 1 T 2 4 0 0 1 1 + cutoff ! sequence logo: Title Representation of the binding specificity by a scoring matrix (also referred to as weight matrix) 1 2 3 4 5 6 7 8 9 A -10 -10 -14 -12 -10 5 -2 -10 -6 C 5 -10 -13 -13 -7 -15 -13 3 -4 G -3 -14 -13 -11 5 -12 -13 2 -7 T -5 5 5 5 -10 -9 5 -11 5 Strong C T T T G A T C T Binding site 5 + 5 + 5 + 5 + 5 + 5 + 5 + 3 + 5 = 43 Random A C G T A C G T A Sequence -10 -10 -13 + 5 -10 -15 -13 -11 - 6 = -83 Biophysical interpretation of protein binding sites Columns of a weight matrix characterize the specificity of base-pair acceptor sites on the protein surface. Weight matrix elements represent negated energy contributions to the total binding energy → weight matrix score inversely proportional to binding energy Motif search: Statistical over-representation of genomic sequence motifs Conserved motifs represent protein binding sites Putative conservation in nucleotide sequence (motif) position relative to: TSS other binding sites (protein complexes) ‘Noise’ due to redundancy in function of individual sites -> different algorithms in use Principles of motif finding difficulties: Input: set of nucleotide sequences -> high degeneracy -> low frequency of motif sites! Define matrices with statistical overrepresentation Motif 1: Motif discovery Motif 2: Motif 3: Motif search Formal tools to describe regulatory elements Consensus sequences: • example: TATAAA (for eukaryotic TATA-box) • a limited number of mismatches may be allowed • May contain IUPAC codes for ambiguous positions, e.g. R = A or G. Weight matrices: • synonym: position-specific scoring matrix (PSSM), • a table with numbers for each residue at each position of a fixed- length (gap-free) motif. • two numerical representations: probabilities, additive scores More advanced descriptors: • HMMs can model spacers of variable length between conserved blocks • Dinucleotide matrices: dependencies between nearest neighbors Where are the motifs? Word search algorithms for consensus sequences Purpose: optimal consensus sequence for a given sequence set. Algorithm: For each word wi of size k and mismatch threshold d: • Count total frequency of word wi in data set. • Compute expected frequency of word wi in data set, based on base composition of data set or some other null model. • From the observed and expected frequencies, compute P-value for word wi according to some statistical distribution (e.g. Poisson distribution) • Return word with highest P-value, or N best words This algorithm is also referred to as word-enumeration. It is guaranteed to find the optimal word. Computationally feasible only for short words (length ≤ 12). Heuristic algorithms exist for longer words. Motif Discovery by EM: Inputs and outputs Expectation-Maximization (EM) Algorithms EM algorithms essentials: • An iterative procedure to maximize the likelihood of a probabilistic model with regard to given data • It can deal with missing (unobservable) data (motif positions) • Not guaranteed to find the global maximum Input: • A model type with initial parameters to be optimized • A mathematical formula that allows to compute the likelihood of the model given complete (observed and unobserved) data • The observed data: set of sequences containing motifs The maximum likelihood estimation of • The model parameters • The unobserved data Gibbs sampling Motivation: EM is not guaranteed to find globally optimal motif EM is deterministic: same initial model, same result Stochastic algorithms also not guaranteed to find global optimum, but: same initial model → different result By running stochastic algorithms several times: → better chance to find optimal solution Gibbs sampling: modification of the E-step: considers only one sequence position for computing new base frequency matrix. Chooses motif position by sampling from the probability distribution given by the sequence position weights w(S,j). Flavors of motif discovery algorithms • Consensus sequence (allowing mismatch) Weeder (Pavesi et al.) cpu time* : ~2 hrs • Expectation maximation MEME (Bailey et al.) cpu time* :~24 hrs • Gibbs sampler (stochastic) MotifSampler (Thijs et al.) cpu time* : ~2 hrs *) for 10 motifs in ~2000 x 100bp on 3 GHz Pentium4 Do motif discovery algorithms work in practice ? Two recent studies suggest that the performance of motif discovery algorithms is very bad. 1. Hu J, Li B, Kihara D (2005) Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 33: 4899-4913. 2. Tompa M, Li N, Bailey TL, Church GM, De Moor B, et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23: 137-144. -> use transcription factor binding sites identified by experiments as ‘true’ solution => only about 20% to 30% of motifs found by current motif discovery algorithms correspond to experimentally defined motifs! Performance indices used by Tompa et. al. 2005 (2) Performance indices used by Tompa et. al. 2005 Bad Performance of Motif Discovery algorithms on Eukaryotic Benchmark Data Sets (Results from Tompa et. al. 2005) Possible reasons for insufficiency – The heuristic motif discovery algorithms fail to find the optimal motif. – The sequence sets are too small for the estimation of statistically robust models – The experimental data defining the binding sites used in these tests are flawed – The current models of binding sites based only on sequences ¾currently under intensive investigation... Promoter sequence motifs • Fundamental assumption: Ö Regulatory program of transcription (at least in part) encoded in nucleotide sequence in form of short motifs • Additional assumptions: Ö multiple similar binding sites in genome for each transcription factor >6200 transcription factors (proteins in GO categories “nucleic acid binding” and “transcription regulator activity”) for ~50’000 human TSS Ö binding sites in core promoter in immediate vicinity to transcription start site • Definition of ‚promoter motif‘: Ö set of related sequences acting as binding site for specific transcription factor Drosophila motif overview Initiator 1813/1932 DRE 1045 DNA replication-related element DPE 482 Downstream promoter element TATA-box 475.