Chromosomes inside the cell Introduction to 6.046J/18.401J • Eukaryote cell

LECTURE 18 • Prokaryote cell • Bio intro: Regulatory Motifs • Combinatorial motif discovery - Median string finding • Probabilistic motif discovery - Expectation maximization • Comparative

Prof. Manolis Kellis April 15, 2008

DNA packaging DNA: The double helix • Why packaging • The most noble molecule of our time – DNA is very long – Cell is very small • Compression – Chromosome is 50,000 times shorter than extended DNA • Using the DNA – Before a piece of DNA is used for anything, this compact structure must open locally

ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA ATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC ATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC AATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC GCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT GCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT TTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG TTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT GATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA CGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG ACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCGenes A GGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATARegulatory motifs T CAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAEncode A AGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAControl C AGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAproteins G TCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAgene expression G GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG GAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG TTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC TTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT TCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT CCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA AGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC ACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT The Central Dogma of Biology Challenges in Computational Biology

DNA makes RNA makes Protein 4 Genome Assembly 5 Regulatory motif discovery 1 Gene Finding DNA Inheritance 2 Sequence alignment

6 Comparative Genomics TCATGCTAT TCGTGATAA 3 Database lookup TGAGGATAT 7 Evolutionary Theory TTATCATAT TTATGATTT

Messages 8 Gene expression analysis (and much more) RNA transcript 9Cluster discovery 10 Gibbs sampling 11 Protein network analysis

12 Metabolic modeling Function 13 Emerging network properties

Challenges in Computational Biology The regulatory code

Enhancer regions Promoter motifs Splicing signals Motifs at RNA level 4 Genome Assembly Errα

5’-UTR 3’-UTR 5 Regulatory motif discovery 1 Gene Finding DNA human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG 2 dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG Sequence alignment mouse GTCTTAGGAGGCT-CGATCGCC------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC------TCATGCATAATT----- 6 Comparative Genomics ***** * * * * * * TCATGCTAT Errα TCGTGATAA 3 Database lookup TGAGGATAT human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCCGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGTGACCTTGGGGGTTGCCCCAGCCAGGC 7 Evolutionary Theory TTATCATAT TTATGATTT dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCCAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGTGACCTTGGGCGGCCGCAGCGGGGC mouse ------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGTGACCTTGGGCTGCCCCAGGCGGGC rat ------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG------CACAAGTTTCTC---TGC-CCTGACCTTGGGTGACCTTGGGTTGCCCCAGGCGAG- 8 Gene expression analysis * * * ******************** *** *** *

human TGCGGGCCCGAGACCCCCG------GGCCTCCCTGCCCCCCGCGCCG RNA transcript dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG Gabpα 9 Cluster discovery 10 Gibbs sampling mouse TGCAGGCTCACCACCCCGTCTTTTCT------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT------TTTTTTTTTGCCGTTCAAG-AG 11 Protein network analysis ** * * ** ** * * Gene regulation 12 Metabolic modeling Cells respond to environment and change during development. 13 Emerging network properties These events are minutely controlled by short patterns within the DNA. Regulatory motifs: Sequence patterns that control gene usage, recognized by specific regulators. General, short (~6-12 letters), possibly degenerate, act at varying distances.

Regulatory motif discovery Three-dimensional contacts of regulators and DNA

GAL1 Gal4 Gal4 Mig1 • Protein-DNA interactions –“Feeling” chemical ATGACTAAATCTCATTCAGAAGAA CGG CCG CGG CCG CCCCW properties of the bases – DO NOT open DNA (not by • Regulatory motifs (summary) base complementarity) – Genes are turned on / off in response to changing environments • Sequence specificity – No direct addressing: subroutines (genes) contain sequence tags (motifs) – Topology of 3D contact – Specialized proteins (transcription factors) recognize these tags dictates sequence specificity of binding • What makes motif discovery hard? – Some positions are fully – Motifs are short (6-8 bp), sometimes degenerate constrained; other – Can contain any set of nucleotides (no ATG or other rules) positions are degenerate – Act at variable distances upstream (or downstream) of target gene – “Ambiguous / degenerate” positions are loosely • How can we discover them? contacted by the transcription factor Motifs capture regulator sequence specificity

•Summarize information

• Building blocks of gene regulation Three settings for motif discovery • Underlying code linking regulatory networks together

• How do we go about discovering them?

• Motif vs. motif instance!

Three settings for motif discovery Computational motif discovery (traditional)

¾ Combinatorial solutions Regulatory motif discovery ¾ Exhaustive search DNA ¾ Greedy motif clustering

¾ Wordlets and motif refinement Lots of experimentation Æ Discover groups of co-regulated genes. Computation Æ Find common sequence patterns within them. ¾ Probabilistic solutions ¾ Expectation maximization ¾ Gibbs sampling

¾ Comparative genomics ¾ Genome-wide conservation ¾ Evolutionary signatures

Common subsequence Group of co-regulated genes

Problem Definition Three settings for motif discovery

Given a collection of promoter sequences s ,…, s of 1 N ¾ Combinatorial solutions genes with common expression ¾ Exhaustive search ¾ Greedy motif clustering Combinatorial Probabilistic ¾ Wordlets and motif refinement Motif: M ; 1 ≤ i ≤ W Motif M: m1…mW ij 1 ≤ j ≤ 4 ¾ Probabilistic solutions Some of the mi’s blank Mij = Prob[ letter j, pos i ] ¾ Expectation maximization • Find M that occurs in all s ¾ Gibbs sampling i Find best M, and positions p ,…, with k differences 1 ≤ p in sequences N ¾ Comparative genomics •Or, Find M with smallest ¾ Genome-wide conservation total hamming dist ¾ Evolutionary signatures Discrete Formulations Exhaustive Searches

Given sequences S = {x1, …, xn} 1. Pattern-driven :

For W = AA…A to TT…T (4K possibilities) • A motif W is a consensus string w1…wK Find d( W, S ) Report W* = argmin( d(W, S) ) •Findmotif W* with “best” match to x1, …, xn

K Definition of “best”: Running time: O( K N 4 ) i (where N = Σi |x |) d(W, xi) = min hamming dist. between W and any word in xi

i d(W, S) = Σi d(W, x ) Advantage: Finds provably “best” motif W Disadvantage: Time

Exhaustive Searches Overview

2. Sample-driven algorithm: ¾ Introduction i For W = every K-long word occurring in some x ¾ Bio review: Where do ambiguities come from? Find d( W, S ) ¾ Computational formulation of the problem Report W* = argmin( d( W, S ) ) * or, Report a local improvement of W ¾ Combinatorial solutions Running time: O( K N2 ) ¾ Exhaustive search ¾ Greedy motif clustering Advantage: Time ¾ Wordlets and motif refinement Disadvantage: If the true motif is weak and does not occur in data ¾ Probabilistic solutions then a random motif may score better than any instance of true motif ¾ Expectation maximization ¾ Gibbs sampling

Greedy motif clustering (CONSENSUS) Greedy motif clustering (CONSENSUS)

Algorithm: Algorithm:

Cycle 1: Cycle t: For each word W in S (of fixed length!) For each word W in S For each alignment A from cycle t-1 For each word W’ in S j Create alignment (gap free) of W, A Create alignment (gap free) of W, W’ j

Keep the Cl best alignments A1, …, ACt Keep the C1 best alignments, A1, …, AC1 ACGGTTG , CGAACTT , GGGCTCT … ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT … ACGCCTG , AGAACTA , GGGGTGT … ……… ACGGCTC , AGATCTT , GGCGTCT … Greedy motif clustering (CONSENSUS) Overview

•C1, …, Cn are user-defined heuristic constants ¾ Introduction ¾ Bio review: Where do ambiguities come from? – N is sum of sequence lengths ¾ Computational formulation of the problem – n is the number of sequences

Running time: ¾ Combinatorial solutions ¾ Exhaustive search O(N2) + O(N C ) + O(N C ) + … + O(N C ) 1 2 n ¾ Greedy motif clustering

2 ¾ Wordlets and motif refinement = O( N + NCtotal)

Where Ctotal = Σi Ci, typically O(nC), where C is a big constant ¾ Probabilistic solutions ¾ Expectation maximization ¾ Gibbs sampling

Motif Refinement and wordlets (MULTIPROFILER) Motif Refinement and wordlets (MULTIPROFILER)

• Extended sample-driven approach Assume W differs from true motif W* in at most L positions

Given a K-long word W, define: Define:

N (W) = words W’ in S s.t. d(W,W’) ≤ α α A wordlet G of W is a L-long pattern with blanks, differing from W – L is smaller than the word length K Idea: Assume W is occurrence of true motif W* Example: Will use Nα(W) to correct “errors” in W

K = 7; L = 3

W = ACGTTGA G = --A--CG

Motif Refinement and wordlets (MULTIPROFILER) Three settings for motif discovery

Algorithm: ¾ Combinatorial solutions For each W in S: ¾ Exhaustive search For L = 1 to L max ¾ Greedy motif clustering 1. Find the α-neighbors of W in S → Nα(W) 2. Find all “strong” L-long wordlets G in Na(W) ¾ Wordlets and motif refinement 3. For each wordlet G, 1. Modify W by the wordlet G → W’ 2. Compute d(W’, S) ¾ Probabilistic solutions ¾ Expectation maximization Report W* = argmin d(W’, S) ¾ Gibbs sampling

Step 1 above: Smaller motif-finding problem; ¾ Comparative genomics Use exhaustive search ¾ Genome-wide conservation ¾ Evolutionary signatures Overview Where do ambiguous bases come from ?

¾ Introduction • Protein-DNA interactions ¾ Bio review: Where do ambiguities come from? – Proteins read DNA by “feeling” ¾ Computational formulation of the problem the chemical properties of the bases – Without opening DNA (not by ¾ Combinatorial solutions base complementarity) • Sequence specificity ¾ Exhaustive search – Topology of 3D contact dictates ¾ Greedy motif clustering sequence specificity of binding – Some positions are fully ¾ Wordlets and motif refinement constrained; other positions are degenerate – “Ambiguous / degenerate” ¾ Probabilistic solutions positions are loosely contacted by the transcription factor ¾ Expectation maximization ¾ Gibbs sampling

Representing motif ambiguities Starting positions Ù Motif matrix • given aligned sequences Î easy to compute profile matrix

shared motif sequence positions

12345678 A 0.1 0.30.1 0.2 0.2 0.4 0.3 0.1

C 0.5 0.2 0.1 0.1 0.6 0.1 0.2 0.7

entropy - n 1: (communication theory) a numerical measure of the uncertainty of an G 0.2 0.2 0.6 0.5 0.1 0.2 0.2 0.1 outcome; "the signal contained thousands of bits of information" [information, selective information] 2: (thermodynamics) a thermodynamic quantity representing the amount of T 0.2 0.3 0.2 0.2 0.1 0.3 0.3 0.1 energy in a system that is no longer available for doing mechanical work; "entropy increases as matter and energy in the universe degrade to an ultimate state of inert uniformity" [randomness] given profile matrix Í • easy to find starting position probabilities • Entropy at pos’n I, H(i) = – Σ{letter x} freq(x, i) log2 freq(x, i) • Height of x at pos’n i, L(x, i) = freq(x, i) (2 – H(i)) – Examples: Key idea: Iterative procedure for estimating both, given uncertainty • freq(A, i) = 1; H(i) = 0; L(A, i) = 2 (learning problem with hidden variables: the starting positions) • A: ½; C: ¼; G: ¼; H(i) = 1.5; L(A, i) = ¼; L(not T, i) = ¼

Representing Motif (pck) and Background (pc0) Basic Iterative Approach • Assume motif has fixed width, W

• Motif represented by matrix of probabilities: pck Given: length parameter W, training set of sequences the probability of character c in column k set initial values for motif 1 2 3 do A 0.1 0.5 0.2 p = C 0.4 0.2 0.1 Î re-estimate starting-positions from motif G 0.3 0.1 0.6 Î re-estimate motif from starting-positions T 0.2 0.2 0.1 (~CAG) until convergence (change < ε) • Background represented by pc0, frequency of each base return: motif, starting-positions 0 A 0.26 C 0.24 (near uniform) G 0.23 p0 = (see also: di-nucleotide etc) T 0.27 Starting positions (Zij) Ù Motif matrix (pck) Representing the starting position probabilities (Zij)

k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 X1 X • the element Z of the matrix represents the 2 c=A 0.1 0.30.1 0.2 0.2 0.4 0.3 0.1 ij Z X M-step 3 0.5 0.2 0.1 0.1 0.6 0.1 0.2 0.7 probability that the motif starts in position j in sequence I … c=C Xi c=G 0.2 0.2 0.6 0.5 0.1 0.2 0.2 0.1 … E-step 1 2 3 4 c=T 0.2 0.3 0.2 0.2 0.1 0.3 0.3 0.1 seq1 0.1 0.1 0.2 0.6 Xn seq2 0.4 0.2 0.1 0.3 Starting positions: Z Motif: p Z = seq3 0.3 0.1 0.5 0.1 ij ck seq4 0.1 0.5 0.1 0.3 •Zij: Probability that on sequence i, motif start at position j th Some examples: •pck: Probability that k character of motif is letter c no clear winner Z1 • Computing Zij matrix from pck is straightforward two – At each position, evaluate start probability by multiplying across the matrix candidates Z2 • Three variations for re-computing motif p from Z matrix one big ck ij Z3 winner – Expectation maximization Î All starts weighted by Zij prob distribution – Gibbs sampling Î Single start for each seq Xi by sampling Zij – Greedy approach Î Best start for each seq X by maximum Z Z4 uniform i ij

Three examples of Greedy, Gibbs Sampling, EM

Greedy always picks maximum

Gibbs sampling picks one at random (or) two candidates Z1 (and) E-step: Calculating Zij from motif EM uses both in estimating motif k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 X1 c=A 0.1 0.30.1 0.2 0.2 0.4 0.3 0.1 X2 X3 c=C 0.5 0.2 0.1 0.1 0.6 0.1 0.2 0.7 … X c=G 0.2 0.2 0.6 0.5 0.1 0.2 0.2 0.1 All methods agree i one big … E-step Z winner c=T 0.2 0.3 0.2 0.2 0.1 0.3 0.3 0.1 2 Xn

Starting positions: Zij Motif: pck

Greedy ignores most of the probability Gibbs sampling rapidly converges to some choice

Z3 uniform EM averages over the entire sequence (no preference)

Calculating the Z vector ( using P(Xi) ) Calculating P(Xi) when motif position is known • To estimate the starting positions in Z at step t

• Probability of training sequence Xi, given hypothesized start position j Pr(Xi | Zij=1,p) P(Zij=1) j−1 j+W −1 L Pr(Zij=1 | Xi, p) = ------Pr(X ) Pr(X | Z =1, p) = p p p i (Bayes’ rule) i ij ∏ ck ,0 ∏ ck ,k − j+1 ∏ ck ,0 k =1 k = j k = j+W (t) (t) • At iteration t, calculate Zij based on p

before motif motif after motif – We just saw how to calculate Pr(Xi | Zij=1,p)

– To obtain total probability Pr(Xi), sum over all starting positions

•Example: 0 1 2 3 (t) A 0.25 0.1 0.5 0.2 (t ) Pr(X i | Zij =1, p ) Pr(Zij =1) C 0.25 0.4 0.2 0.1 Zij = L−W +1 X i = G C T G T A G p = G 0.25 0.3 0.1 0.6 (t) T 0.25 0.2 0.2 0.1 ∑Pr(X i | Zik =1, p ) Pr(Zik =1) k=1 Pr( XX || Z == 1,1, p) = i i3i3 - Assume uniform priors (motif equally likely to start at any position)

pG,0 × pC,0 × pT,1 × pG,2 × pT,3 × pA,0 × pG,0 = 0.25 ×0.25× 0.2× 0.1×0.1×0.25 × 0.25 Calculating the Z vector: Example

X i = G C T G T A G 0 1 2 3 A 0.25 0.1 0.5 0.2 M-step: Calculating motif from Z p = C 0.25 0.4 0.2 0.1 ij k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 G 0.25 0.3 0.1 0.6 X1 c=A 0.1 0.30.1 0.2 0.2 0.4 0.3 0.1 X2 T 0.25 0.2 0.2 0.1 X3 c=C 0.5 0.2 0.1 0.1 0.6 0.1 0.2 0.7 … X c=G 0.2 0.2 0.6 0.5 0.1 0.2 0.2 0.1 i M-step Zi1 = 0.3×0.2×0.1×0.25×0.25×0.25×0.25 … c=T 0.2 0.3 0.2 0.2 0.1 0.3 0.3 0.1 Xn Starting positions: Z Motif: p Z = 0.25×0.4×0.2×0.6×0.25×0.25×0.25 ij ck

i2 ...

L−W +1 • then normalize so that ∑ Zij =1 j=1

The M-step: Estimating the motif p M-step example: Estimating pck from Zij

• recall p represents the probability of character c in position k ; c,k X = • EM: sum over full probability values for position 0 represent the background 1 A C A G C A –nA,1= 0.1+0.1+0.4+0.1 = 0.7 Z1 = 0.1 0.7 0.1 0.1 –nC,1= 0.7+0.4+0.6 = 1.7

(t+1) nc,k + dc,k –nG,1= 0.1+0.1+0.1+0.1= 0.4 X = p = 2 –n = 0.2 = 0.2 c,k pseudo-counts A G G C A G T,1 (n d ) Z = 0.4 0.1 0.1 0.4 ∑ b,k + b,k 2 – Total: T=0.7+1.7+0.4+0.2 = 3.0 b • Normalize and add pseudo-counts X = 3 T C A G T C –P = (0.7+1)/(T+4) = 1.7/7=0.24 ⎧ Z k > 0 motif Z = 0.2 0.6 0.1 0.1 A,1 ∑∑ij 3 –PC,1 = (1.7+1)/(T+4) = 2.7/7=0.39 ⎪ ic{ j|X i , j+k−1 = } –P = (0.4+1)/(T+4) = 1.4/7=0.2 ⎪ Z + Z + Z + Z +1 G,1 1,1 1,3 2,1 3,3 –P = (0.2+1)/(T+4) = 1.2/7=0.17 nc,k = ⎨ W pA,1 = T,1 Z1,1 + Z1,2 ... + Z3,3 + Z3,4 + 4 background 1 2 3 ⎪nc − nc, j k = 0 total # of c’s ∑ A 0.24 0.39 0.21 ⎪ j=1 Gibbs sampling: Pick one C 0.39 0.21 0.18 in data set ⎩ •Pck = Greedy: Pick max G 0.2 0.24 0.44 T 0.17 0.16 0.16

Three settings for motif discovery The EM Algorithm ¾ Combinatorial solutions ¾ Exhaustive search • EM converges to a local maximum in the likelihood of the ¾ Greedy motif clustering data given the model: ¾ Wordlets and motif refinement

Pr( X | p) ¾ Probabilistic solutions ∏ i ¾ Expectation maximization i ¾ Gibbs sampling • usually converges in a small number of iterations ¾ Comparative genomics • sensitive to initial starting point (i.e. values in p) ¾ Genome-wide conservation ¾ Evolutionary signatures Resolving power in mammals, flies, fungi Comparative Genomics 32 mammals 12 flies 17 fungi

Using evolution to study genomes Post-duplication 9 Yeasts Pre-dup P N P Evolution Genomics P Diploid 8 Candida P N P

P Haploid

Many species lead to high resolving power in close distances Using genomics to study evolution

Comparative genomics and evolutionary signatures Evolutionary signal for regulatory motif discovery

5’-UTR 3’-UTR

Known D.mel CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC D.sim CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC engrailed D.sec CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC site D.yak CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC (footprint) D.ere CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTC-CAAGTC D.ana CACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAG ** * * *********** * **** * ** D.mel • Comparative genomics can reveal functional elements D. ere – For example: exons are deeply conserved to mouse, chicken, fish D. ana – Many other elements are also strongly conserved: exons / regulatory? D. pse. • Individual motif instances are preferentially conserved • Can we also pinpoint specific functions of each region? Yes! – Patterns of change distinguish different types of functional elements • Measure conservation across entire genome – Specific function Ù Selective pressures Ù Patterns of mutation/inse/del – Over thousands of motif instances Æ Increased discovery power – Couple to rapid enumeration and rapid string search Kellis el al, Nature 2003 • Develop evolutionary signatures characteristic of each function Î De novo discovery of regulatory motifs Xie et al. Nature 2005 Stark et al, Nature 2007

Framing the problem computationally Computational approaches for motif discovery

• How do we find all instances of a motif in a genome? • Method #1: Enumerate all motifs – Naïve algorithm: Search every position – Combinatorial search

• How do we count all instances of every 6-mer in a genome • Method #2: Randomly sample the genome – Naïve algorithm: Scan the genome for each motif – Statistical approach – Improvement: Scan genome once, filling a table

• How do we count all instances of every 50-mer in a genome • Method #3: Enumerate motif seeds + refinement – Table is no longer feasible, most entries empty – Hill-climbing – Use a hash table • Method #4: Content-based addressing • How do we search a new motif in a known genome – Hashing – Pre-processing of the database

• How do we deal with motif degeneracy and ambiguities – Hash in multiple places, increase alphabet size, partial hashing Evaluating genome-wide motif conservation Power of evolutionary signatures for motif discovery Consensus MCS Matches to known Expression enrichment Promoters Enhancers 1 CTAATTAAA 65.6 engrailed (en) 25.4 2 2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2 ATTAGCCAGTAGCGCAGTGCATGCATGCACGACTGCAAGTGCATGCATGCTAGCTACGTAGCTAGCCGCGCATGCTGTGACTGCTAG 3 WATTRATTK 54.9 araucan (ara) 11.7 2.6 4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5 5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3 6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3 7 TGATTAAT 45.7 apterous (ap) 7.1 1.7 • Genome-wide conservation reveals real motifs 8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2 9 AAACNNGTT 41.2 20.1 4.3 – Count conserved instances Æ Not informative 10 RATTKAATT 40 3.9 0.7 total ratio

cons 11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9 – Count total instances Æ Not informative 12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7 AGTGAA 20 4000 – Evaluate conserved/total instances Æ Real motifs! 13 AATTRMATTA 38.2 19.5 1.2 14 TATGCWAAT 37.8 5.8 2 AGTGAC 20 4000 15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4 AGTGAG 20 4000 16 CATNAATCA 36.9 1.8 1.7 AGTGAT 20 4000 • Motif enumeration 17 TTACATAA 36.9 5.4 18 RTAAATCAA 36.3 3.2 2.8 AGTGCA 50 200 – Perfectly conserved motif instances 19 AATKNMATTT 36 3.6 0 AGTGCC 20 4000 – Each motif is a fully-specified 6-mer 20 ATGTCAAHT 35.6 2.4 4.6 AGTGCG 20 4000 21 ATAAAYAAA 35.5 57.2 -0.5 AGTGCT 20 4000 22 YYAATCAAA 33.9 5.3 0.6 • Algorithmic speed-up 23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6 24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7 – Do not search entire genome for every motif 25 TGTMAATA 33.2 8.9 1.6 26 TAAYGAG 33.1 4.7 2.7 – Scan genome once, fill in table of motif instances 27 AAAKTGA 32.9 7.6 0.3 – Content-based indexing 28 AAANNAAA 32.9 449.7 0.8 29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8 30 TTATTTAYR 32.9 Deformed (Dfd) 30.7 • What’s missing: Motif collapsing Ability to discover full dictionary of regulatory motifs de novo Stark et al, Nature, 2007

5. Evolutionary signatures of motif instances Three settings for motif discovery

• Allow for motif movements ¾ Combinatorial solutions – Sequencing/alignment errors ¾ Exhaustive search – Loss, movement, divergence ¾ Greedy motif clustering • Measure branch-length score ¾ Wordlets and motif refinement – Sum evidence along branches – Close species little contribution ¾ Probabilistic solutions ¾ Expectation maximization ¾ Gibbs sampling

¾ Comparative genomics ¾ Genome-wide conservation ¾ Evolutionary signatures

BLS: 25% Mef2:YTAWWWWTAR BLS: 83%

Challenges in Computational Biology

4 Genome Assembly

5 Regulatory motif discovery 1 Gene Finding DNA

2 Sequence alignment

6 Comparative Genomics TCATGCTAT TCGTGATAA 3 Database lookup TGAGGATAT 7 Evolutionary Theory TTATCATAT TTATGATTT

8 Gene expression analysis

RNA transcript 9Cluster discovery 10 Gibbs sampling 11 Protein network analysis

12 Metabolic modeling

13 Emerging network properties