Regulatory Motifs of DNA
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 3: PWMs and other models for regulatory elements in DNA -Sequence motifs -Position weight matrices (PWMs) -Learning PWMs combinatorially, or probabilistically -Learning from an alignment -Ab initio learning: Baum-Welch & Gibbs sampling -Visualization of PWMs as sequence logos -Search methods for PWM occurrences -Cis-regulatory modules Regulatory motifs of DNA Measuring gene activity (’expression’) with a microarray 1 Jointly regulated gene sets Cluster of co-expressed genes, apply pattern discovery in regulatory regions 600 basepairs Expression profiles Retrieve Upstream regions Genes with similar expression pattern Pattern over-represented in cluster Find a hidden motif 2 Find a hidden motif Find a hidden motif (cont.) Multiple local alignment! Smith-Waterman?? 3 Motif? • Definition: Motif is a pattern that occurs (unexpectedly) often in a string or in a set of strings • The problem of finding repetitions is similar cgccgagtgacagagacgctaatcagg ctgtgttctcaggatgcgtaccgagtg ggagacagcagcacgaccagcggtggc agagacccttgcagacatcaagctctt tgggaacaagtggagcaccgatgatgt acagccgatcaatgacatttccctaat gcaggattacattgcagtgcccaagga gaagtatgccaagtaatacctccctca cagtg... 4 Longest repeat? Example: A 50 million bases long fragment of human genome. We found (using a suffix-tree) the following repetitive sequence of length 2559, with one occurrence at 28395980, and other at 28401554r ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctcc gaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagt gagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttt tatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatag tgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttcta gttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgt tgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcatttttt catgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttgga gttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgc tgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcc tgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggt gtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagtt ttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgaggg ctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggtt ccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattg gtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatt tgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttga agcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagac tttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcct cttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgt cttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatccca tcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttc tgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgct gctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttgg tatcaggatgatgctggcctcataaaatgagttagg Representations of motifs • pattern – substring – substring with gaps – string in generalized alphabet (e.g., IUPAC) – finite automaton – Hidden Markov Model – binding affinity matrix – cluster of binding affinity matrices – … (= the hidden structure to be learned from data) 5 Types of occurrences • occurrence – exact – approximate – with high probability –… Subsequence motifs with approximate occurrences 6 Motif finding problem • Given: a set of sequences S = x(1),x(2),...,x(n) from alphabet Ȉ, and a motif length m • Find the best motif of length m that occurs in the training data S • Best? Occurs? • Combinatorial approach: motif = sequence of length m – Approximate occurrences defined using Hamming distance (= number of mismatches) • Probabilistic approach: motif = PWM (position weight matrix) of length m – Approximate occurrences defined using the probability distribution Combinatorial approach • d(W,x(i)) = minimum Hamming distance between W and any subsequence of x(i) of length m (i) • d(W,S) = i d(W,x ) 7 Combinatorial approach: a pattern- driven algorithm 1. For all sequences W İȈm of length m do Find d(W,S) 2. Report W* := arg minW (d(W,S)) For DNA, the trivial implementation has running time O(mN4m). Why? Explain! (N = total length of the sequences in S) Combinatorial approach: a sample- driven algorithm 1. For all subsequences W of length m from S do Find d(W,S) 2. Report W* := arg minW(d(W,S)) For DNA, the trivial implementation has running time O(mN2). Why? Explain! (N = total length of sequences in S) 8 Probabilistic approach • Learn Position Weight Matrices (PWMs) from data • Probabilistic model (in fact, a simple HMM) Positionally weighted pattern • Weighted pattern w = (wij) of length m in alphabet Ȉ: _Ȉ| x m matrix of real-valued scores • Also called as: position weight matrix (PWM), position- specific scoring matrix (PSSM), profile(-HMM), motif, ... (Stormo et al 1980, Gribskov et al 1987, Henikoff et al 1990,...) • Public collections of PWMs: TRANSFAC, JASPAR 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 9 Construction of PWM from an alignment • Given: – aligned set of binding sites (= sequences of length m) – background distribution qa for a ɽ Ȉ • PWM construction algorithm: 1. Construct count matrix (Nij): Nij := the number of symbols i ɽ Ȉ on column j of the aligment 2. Add pseudocounts: • Nij := Nij + 1 (= Laplace rule), or • Nij := Nij + ȕqi where ȕ is a scaling parameter that determines the total number of pseudocounts in an alignment column 3. Construct probability matrix (fij)by normalizing the (pseudo)counts: fij := Nij / i Nij 4. Construct log-odds of signal vs background: wij := log(fij / qi) (= weighted pattern w) PWM construction: example 1 2 3 4 5 6 7 CTCACACGTGGG (Nij): A 2 0 5 0 0 0 0 TCACACGTGGGA C 0 5 0 5 0 0 0 ATTAGCACGTTTT G 3 0 0 0 5 1 3 TTAGCACGTTTCGC T 0 0 0 0 0 4 2 CGCTGCACGGGGCC 10 PWM construction: example (cont.) Step 1: (Nij) Normalization without pseudocounts A 2 0 5 0 0 0 0 A 0.4 0 1.0 0 0 0 0 C 0 5 0 5 0 0 0 C 0 1.0 0 1.0 0 0 0 G 3 0 0 0 5 1 3 G 0.6 0 0 0 1.0 0.2 0.6 T 0 0 0 0 0 4 2 T 0 0 0 0 0 0.8 0.4 Nij := Nij + ȕqi where qA = qT = 0.32, qG = qC = 0.18 ȕ = 1 Step 2: (Nij) with pseudocounts A 2.32 0.32 5.32 0.32 0.32 0.32 0.32 C 0.18 5.18 0.18 5.18 0.18 0.18 0.18 G 3.18 0.18 0.18 0.18 5.18 1.18 3.18 T 0.32 0.32 0.32 0.32 0.32 4.32 2.32 Step 4: log-odds wij := log2 (fij / qi ) Step 3: probability matrix (f ) := (N ) / 6 ij ij A ... -2.7 1.5 -2.7 -2.7 ... 0.39 0.05 0.89 0.05 0.05 0.05 0.05 A C 2.3 -2.6 2.3 -2.6 C 0.03 0.87 0.03 0.87 0.03 0.03 0.03 G -2.6 -2.6 -2.6 2.3 G 0.53 0.03 0.03 0.03 0.87 0.20 0.53 T ... -2.7 -2.7 -2.7 -2.7 ... T 0.05 0.05 0.05 0.05 0.05 0.72 0.39 Visualisation of PWMs: sequence logo • PWM probability matrix (faj): faj = probability of symbol a in motif position j f I • Entropy of column j: A Aj j Hj = -a ɽȈ faj log2 faj Ij T C • Information present in column j: G Ij = log2 _Ȉ| - Hj [DNA: Ij 2] • Logo: Column j has total height Ij, symbol a has height faj·Ij 11 Sequence logo: an example A 9 11 49 51 0 1 1 4 C 19 3 0 0 0 45 25 16 G 5 1 2 0 17 0 4 21 T 18 36 0 0 34 5 21 10 Ab initio construction of PWMs • No alignment given, just the training sequences S = x(1), ..., x(N) • Find from S all m-long words • Assume each word comes either from motif or from background (don’t care about overlaps!) • Find the most likely – motif model (PWM), – background model, and – classification of the m-long words of S into motif and background instances 12 HMM for mixture of multinomials PWM ... Ȝ M1 M2 M3 Mm 1-Ȝ Q Background m-1 times PWM by the EM algorithm • Train the mixture model using EM algorithm by iterating – E step – M step • Training data: all m-words from S • For more details, see: – T.L. Bailey & C. Elkan: The value of prior knowledge in discovering motifs with MEME – Other papers on MEME 13 Ab initio motif finding: Gibbs sampling • Another popular probabilistic algorithm for motif (PWM) discovery • Local search algorithm Gibbs sampling: basic idea Current motif = PWM formed by circled substrings 14 Gibbs sampling: basic idea Delete one substring Gibbs sampling: basic idea Try a replacement: Compute its score, and accept the replacement depending on the score. 15 Gibbs sampling: basic idea New current motif Repeat many times! PWM by Gibbs sampling • Goal: Find most probable pattern by sampling from motif probabilities to maximize the ratio of model to background • Given: (1) (N) – Training data S = X , …, X , and background distribution (qa) – motif length m • Notation & algorithm idea: – Model