Regulatory Motifs of DNA

Lecture 3: PWMs and other models for regulatory elements in DNA -Sequence motifs -Position weight matrices (PWMs) -Learning PWMs combinatorially, or probabilistically -Learning from an alignment -Ab initio learning: Baum-Welch & Gibbs sampling -Visualization of PWMs as sequence logos -Search methods for PWM occurrences -Cis-regulatory modules Regulatory motifs of DNA Measuring gene activity (’expression’) with a microarray 1 Jointly regulated gene sets Cluster of co-expressed genes, apply pattern discovery in regulatory regions 600 basepairs Expression profiles Retrieve Upstream regions Genes with similar expression pattern Pattern over-represented in cluster Find a hidden motif 2 Find a hidden motif Find a hidden motif (cont.) Multiple local alignment! Smith-Waterman?? 3 Motif? • Definition: Motif is a pattern that occurs (unexpectedly) often in a string or in a set of strings • The problem of finding repetitions is similar cgccgagtgacagagacgctaatcagg ctgtgttctcaggatgcgtaccgagtg ggagacagcagcacgaccagcggtggc agagacccttgcagacatcaagctctt tgggaacaagtggagcaccgatgatgt acagccgatcaatgacatttccctaat gcaggattacattgcagtgcccaagga gaagtatgccaagtaatacctccctca cagtg... 4 Longest repeat? Example: A 50 million bases long fragment of human genome. We found (using a suffix-tree) the following repetitive sequence of length 2559, with one occurrence at 28395980, and other at 28401554r ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctcc gaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagt gagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttt tatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatag tgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttcta gttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgt tgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcatttttt catgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttgga gttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgc tgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcc tgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggt gtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagtt ttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgaggg ctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggtt ccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattg gtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatt tgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttga agcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagac tttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcct cttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgt cttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatccca tcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttc tgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgct gctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttgg tatcaggatgatgctggcctcataaaatgagttagg Representations of motifs • pattern – substring – substring with gaps – string in generalized alphabet (e.g., IUPAC) – finite automaton – Hidden Markov Model – binding affinity matrix – cluster of binding affinity matrices – … (= the hidden structure to be learned from data) 5 Types of occurrences • occurrence – exact – approximate – with high probability –… Subsequence motifs with approximate occurrences 6 Motif finding problem • Given: a set of sequences S = x(1),x(2),...,x(n) from alphabet Ȉ, and a motif length m • Find the best motif of length m that occurs in the training data S • Best? Occurs? • Combinatorial approach: motif = sequence of length m – Approximate occurrences defined using Hamming distance (= number of mismatches) • Probabilistic approach: motif = PWM (position weight matrix) of length m – Approximate occurrences defined using the probability distribution Combinatorial approach • d(W,x(i)) = minimum Hamming distance between W and any subsequence of x(i) of length m (i) • d(W,S) = i d(W,x ) 7 Combinatorial approach: a pattern- driven algorithm 1. For all sequences W İȈm of length m do Find d(W,S) 2. Report W* := arg minW (d(W,S)) For DNA, the trivial implementation has running time O(mN4m). Why? Explain! (N = total length of the sequences in S) Combinatorial approach: a sample- driven algorithm 1. For all subsequences W of length m from S do Find d(W,S) 2. Report W* := arg minW(d(W,S)) For DNA, the trivial implementation has running time O(mN2). Why? Explain! (N = total length of sequences in S) 8 Probabilistic approach • Learn Position Weight Matrices (PWMs) from data • Probabilistic model (in fact, a simple HMM) Positionally weighted pattern • Weighted pattern w = (wij) of length m in alphabet Ȉ: _Ȉ| x m matrix of real-valued scores • Also called as: position weight matrix (PWM), position- specific scoring matrix (PSSM), profile(-HMM), motif, ... (Stormo et al 1980, Gribskov et al 1987, Henikoff et al 1990,...) • Public collections of PWMs: TRANSFAC, JASPAR 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 9 Construction of PWM from an alignment • Given: – aligned set of binding sites (= sequences of length m) – background distribution qa for a ɽ Ȉ • PWM construction algorithm: 1. Construct count matrix (Nij): Nij := the number of symbols i ɽ Ȉ on column j of the aligment 2. Add pseudocounts: • Nij := Nij + 1 (= Laplace rule), or • Nij := Nij + ȕqi where ȕ is a scaling parameter that determines the total number of pseudocounts in an alignment column 3. Construct probability matrix (fij)by normalizing the (pseudo)counts: fij := Nij / i Nij 4. Construct log-odds of signal vs background: wij := log(fij / qi) (= weighted pattern w) PWM construction: example 1 2 3 4 5 6 7 CTCACACGTGGG (Nij): A 2 0 5 0 0 0 0 TCACACGTGGGA C 0 5 0 5 0 0 0 ATTAGCACGTTTT G 3 0 0 0 5 1 3 TTAGCACGTTTCGC T 0 0 0 0 0 4 2 CGCTGCACGGGGCC 10 PWM construction: example (cont.) Step 1: (Nij) Normalization without pseudocounts A 2 0 5 0 0 0 0 A 0.4 0 1.0 0 0 0 0 C 0 5 0 5 0 0 0 C 0 1.0 0 1.0 0 0 0 G 3 0 0 0 5 1 3 G 0.6 0 0 0 1.0 0.2 0.6 T 0 0 0 0 0 4 2 T 0 0 0 0 0 0.8 0.4 Nij := Nij + ȕqi where qA = qT = 0.32, qG = qC = 0.18 ȕ = 1 Step 2: (Nij) with pseudocounts A 2.32 0.32 5.32 0.32 0.32 0.32 0.32 C 0.18 5.18 0.18 5.18 0.18 0.18 0.18 G 3.18 0.18 0.18 0.18 5.18 1.18 3.18 T 0.32 0.32 0.32 0.32 0.32 4.32 2.32 Step 4: log-odds wij := log2 (fij / qi ) Step 3: probability matrix (f ) := (N ) / 6 ij ij A ... -2.7 1.5 -2.7 -2.7 ... 0.39 0.05 0.89 0.05 0.05 0.05 0.05 A C 2.3 -2.6 2.3 -2.6 C 0.03 0.87 0.03 0.87 0.03 0.03 0.03 G -2.6 -2.6 -2.6 2.3 G 0.53 0.03 0.03 0.03 0.87 0.20 0.53 T ... -2.7 -2.7 -2.7 -2.7 ... T 0.05 0.05 0.05 0.05 0.05 0.72 0.39 Visualisation of PWMs: sequence logo • PWM probability matrix (faj): faj = probability of symbol a in motif position j f I • Entropy of column j: A Aj j Hj = -a ɽȈ faj log2 faj Ij T C • Information present in column j: G Ij = log2 _Ȉ| - Hj [DNA: Ij 2] • Logo: Column j has total height Ij, symbol a has height faj·Ij 11 Sequence logo: an example A 9 11 49 51 0 1 1 4 C 19 3 0 0 0 45 25 16 G 5 1 2 0 17 0 4 21 T 18 36 0 0 34 5 21 10 Ab initio construction of PWMs • No alignment given, just the training sequences S = x(1), ..., x(N) • Find from S all m-long words • Assume each word comes either from motif or from background (don’t care about overlaps!) • Find the most likely – motif model (PWM), – background model, and – classification of the m-long words of S into motif and background instances 12 HMM for mixture of multinomials PWM ... Ȝ M1 M2 M3 Mm 1-Ȝ Q Background m-1 times PWM by the EM algorithm • Train the mixture model using EM algorithm by iterating – E step – M step • Training data: all m-words from S • For more details, see: – T.L. Bailey & C. Elkan: The value of prior knowledge in discovering motifs with MEME – Other papers on MEME 13 Ab initio motif finding: Gibbs sampling • Another popular probabilistic algorithm for motif (PWM) discovery • Local search algorithm Gibbs sampling: basic idea Current motif = PWM formed by circled substrings 14 Gibbs sampling: basic idea Delete one substring Gibbs sampling: basic idea Try a replacement: Compute its score, and accept the replacement depending on the score. 15 Gibbs sampling: basic idea New current motif Repeat many times! PWM by Gibbs sampling • Goal: Find most probable pattern by sampling from motif probabilities to maximize the ratio of model to background • Given: (1) (N) – Training data S = X , …, X , and background distribution (qa) – motif length m • Notation & algorithm idea: – Model

Regulatory Motifs of DNA

Motif Discovery in Sequential Data Kyle L. Jensen

A General Pairwise Interaction Model Provides an Accurate Description of in Vivo Transcription Factor Binding Sites

A New Sequence Logo Plot to Highlight Enrichment and Depletion

Predicting Expression Levels of De NovoProtein Designs in Yeast

Low-N Protein Engineering with Data-Efficient Deep Learning

Lecture 12. Gene Finding & Rnaseq

Interpretable Convolution Methods for Learning Genomic Sequence Motifs

Specificity and Nonspecificity in RNA–Protein Interactions

Ancient Mechanisms for the Evolution of the Bicoid Homeodomain's

Probability and Statistics for Bioinformatics and Genetics Course

Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model

Regulatory Motif Discovery Using Pwms and the Architecture of Eukaryotic Core Promoters