This lecture

¾Restriction enzymes and the partial digest problem Exhaustive search ¾Finding regulatory motifs in DNA Sequences

Torgeir R. Hvidsten ¾Exhaustive search methods

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 1 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 2

Restriction enzymes

¾ HindII - first – Restriction enzymes and the was discovered accidentally in 1970 while studying how the ppgpartial digest problem bacterium Haemophilus influenzae takes up DNA from the virus

¾ Restriction enzymes are used as a defense mechanism by bacteria to break down the DNA of attacking viruses

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 3 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 4

Uses of restriction enzymes Restriction maps

¾Recombinant DNA technology (i.e. combining DNA ¾ A map showing the positions of restriction sites in a DNA sequences that would not normally occur together) sequence ¾Cloning ¾ If the DNA sequence is known then constructing a ¾cDNA/genomic library construction restriiiction map is tri iilvial ¾DNA mapping ¾ In early days of , DNA sequences were often unknown ¾ Biologists had to solve the problem of constructing restriction maps without knowing DNA sequences

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 5 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 6

1 Measuring length of restriction fragments Partial restriction digest ¾ The sample of DNA is exposed to the restriction enzyme for only a limited amount of time to prevent it from being cut at all restriction sites ¾ Restriction enzymes break DNA into restriction ¾ This experiment generates the set of all possible restriction fragments fragments between every two (not necessarily consecutive) cuts ¾ is a process for separating DNA by size and measuring sizes of restriction fragments ¾ This set of fragment sizes is used to ¾ Can separate DNA fragments that differ in length in determine the only 1 nucleotide for fragments up to 500 nucleotides positions of the long restriction sites in the DNA sequence

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 7 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 8

Multiset of restriction fragments Partial Digest Problem (PDP) ¾ We assume that multiplicity of a fragment can be detected, i.e., X the set of n integers representing the location of all the number of restriction fragments of the same length can be cuts in the restriction map, including the start and determined end i.e. X = {x1 = 0, x2, …, xn}. ¾ Restriction sites: X = {0, 5, 14, 19, 22} n the total number of cuts ¾ Multiset: ΔX = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} ΔX thlifihe multiset of integers represent ilhfing lengths of each of the n(n-1)/2 fragments produced from a

partial digest i.e. ΔX = {xj –xi | 1 ≤ i < j ≤ n}

Problem: Given the multiset L, find a set X such that ΔX = L

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 9 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 10

Partial Digest: Multiple Solutions (I) Partial Digest: Multiple Solutions (II)

It is not always possible to uniquely reconstruct a set X based only on ΔX. 01257912 015781012 For example, the set 0 1 2 5 7 912 0 1 5 7 81012 X = {0, 2, 5} 1 1 4 6811 1 4 6 7 911 and 2 3 5 7 10 5 2 3 5 7 (X + 10) = {10, 12, 15} both produce ΔX={2, 3, 5} as their partial digest set (they are 5 2 4 7 7 1 3 5 homometric). 7 2 5 8 2 4 The sets {0,1,2,5,7,9,12} and {0,1,5,7,8,10,12} present a less trivial 9 3 10 2 example of non-uniqueness. They both digest into: {1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, 8, 9, 10, 11, 12}. 12 12

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 11 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 12

2 BruteForcePDP Efficiency of BruteForcePDP

BruteForcePDP(L, n) ¾The running time of BruteForcePDP is O(|L|n-2) 1 M ← Maximum element in L or, since |L| = n(n-1)/2, O(n2n-4)

2 for every set of n – 2 integers 0 < x2 < … xn-1 < M ¾ Note that without restricting the elements of X to n-2 such that xi ε L for 1 < i < n elements in L, the running time would be O(M ) 3 X ← {0,x2,…,xn-1,M } ¾If L = {2, 998, 1000} (n = 3, M = 1000), the 4Form ΔX from X running time would be very different 5 if ΔX = L ¾Nonetheless, both algorithms have a exponential 6 return X running time 7 output “No solution”

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 13 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 14

Branch and Bound Algorithm for PDP Definitions

1. Initiation: Add the start (0) and end (width) point of the Before describing PartialDigest, first define sequence to X (and remove the end point from L) Δ(y, X) 2. Find the largest element y in L as the multiset of all distances between point y and all 3. See if y fits o n t he rig ht o r le ft s ide o f t he rest rict io n map other points in the set X by checking whether the other lengths (fragments) it Δ(y, X) = {|y – x1|, |y – x2|, …, |y – xn|} creates are in L 4. If it fits, remove these lengths from L and add y (or width for X = {x1, x2, …, xn} –y) to X (if not, backtrack) 5. Go back to step 2 until L is empty

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 15 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 16

PartialDigest(L) 1 width ← Maximum element in L 2 Remove width from L PartialDigest An Example 3 X ← {0, width} 4 Place(L, X) L = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} Place(L,X) 1 if L is empty Inititation: 2 output X width = 10, so delete 10 from L and add 0 and 10 to X 3 return 4 y ← MiMaximum e lement in L 5 if Δ(y, X ) ⊆ L L = {2, 2, 3, 3, 4, 5, 6, 7, 8} 6 Remove lengths Δ(y, X) from L and add y to X X = {0, 10} 7 Place(L, X ) 8 Remove y from X and add lengths Δ(y, X) to L 9 if Δ(width – y, X ) ⊆ L 10 Remove lengths Δ(width – y, X) from L and add width – y to X and 11 Place(L, X ) 12 Remove width – y from X and add lengths Δ(width – y, X ) to L 13 return T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 17 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 18

3 An Example An Example

L = {2, 2, 3, 3, 4, 5, 6, 7, 8} L = {2, 3, 3, 4, 5, 6, 7} X = {0, 10} X = {0, 8, 10}

y = 8 y = 7 Δ(y, X ) = {8, 2 }, which is a subset of L Δ(y, X ) = {7, 1, 3 }, which is not a subset of L Δ(width – y, X) = {3, 5, 7}, which is a subset of L L = {2, 3, 3, 4, 5, 6, 7} X = {0, 8, 10} L = {2, 3, 4, 6} X = {0, 3, 8, 10}

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 19 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 20

An Example An Example

L = {2, 3, 4, 6} L = {} X = {0, 3, 8, 10} X = {0, 3, 6, 8, 10}

y = 6 L is empty! Δ(y, X ) = {6, 3, 2, 4 }, which is a subset of L Output X

L = {} And continue searching for more solutions … X = {0, 3, 6, 8, 10}

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 21 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 22

An Example An Example

L = {2, 3, 4, 6} L = {2, 3, 3, 4, 5, 6, 7} X = {0, 3, 8, 10} X = {0, 8, 10}

y = 6 y = 7 Δ(y, X ) = {6, 3, 2, 4 }, which is a subset of L Δ(y, X ) = {7, 1, 3 }, which is not a subset of L Δ(width – y, X) = {4, 1, 4, 6}, which is not a subset of L Δ(width – y, X) = {3, 5, 7}, which is a subset of L

Backtrack! Backtrack!

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 23 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 24

4 An Example An Example

L = {2, 2, 3, 3, 4, 5, 6, 7, 8} L = {2, 3, 3, 4, 5, 6, 7} X = {0, 10} X = {0, 2, 10}

y = 8 y = 7 Δ(y, X ) = {8, 2 }, which is a subset of L Δ(y, X ) = {7, 5, 3 }, which is a subset of L Δ(width – y, X) = {2, 8}, which is a subset of L L = {2, 3, 4, 6} L = {2, 3, 3, 4, 5, 6, 7} X = {0, 2, 7, 10} X = {0, 2, 10}

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 25 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 26

An Example An Example

L = {2, 3, 4, 6} L = {} X = {0, 2, 7, 10} X = {0, 2, 4, 7, 10}

y = 6 L is empty! Δ(y, X ) = {6, 4, 1, 4 }, which is not a subset of L Output X Δ(width – y, X) = {4, 2, 3, 6}, which is a subset of L And continue searching for more solutions … L = {} X = {0, 2, 4, 7, 10}

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 27 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 28

An Example Recursion tree

1 L = {2, 3, 3, 4, 5, 6, 7} y = 8 10-y = 2 X = {0, 2, 10} 2 4 y = 7 y = 7 10-y = 3 y = 7 10-y = 3 Δ(y, X ) = {7, 5, 3 }, which is a subset of L Δ(width – y, X) = {3, 1, 7} which is not a subset of L 3 5 y = 6 10-y = 4 y = 6 10-y = 6 Backtrack and finish! 6

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 29 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 30

5 Analyzing PartialDigest (I) Analyzing PartialDigest (II)

¾ Let T(n) be the time PartialDigest takes to place n T(n) = 2T(n-1) + O(n) T(1) = 1 cuts O(1) = 1 − No branching case: T(n) = T(n-1) + O(n) T(()n) + O ()(n) = 2T(n( -1) + O(()n) + O( ()n) = 2( T(n( -1) + O(())n)) − BhiBranching case: T(n ) = 2T( n-1) + O( n) ¾ The ”No branching case” is quadratic O(n2) (like Let U(n) = T(n) + O(n) SelectionSort) ¾ n U(n) = 2U(n-1) The ”Branching case” is exponential O(2 ) U(1) = 2

This gives the sequence: 2, 4, 8, 16, 32, 64, …→ U(n) = 2n T(n) = 2n –O(n)

Random sample

Finding regulatory motifs in atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca DNA sequences tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 33 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 34

Implanting motif AAAAAAAGGGGGGG Where is the implanted motif?

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 35 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 36

6 Implanting motif AAAAAAGGGGGGG Where is the motif? with four random mutations

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 37 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 38

Why finding motif is difficult Gene regulation

¾A microarray experiment showed that when atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa gene X is knocked out, 20 other genes are not tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga expressed gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat − How can one gene have such drastic effects?

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 39 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 40

Regulatory proteins Regulatory regions

¾ Gene X encodes a regulatory protein, a.k.a. a ¾ Every gene contains a regulatory region (RR) typically transcription factor (TF) stretching 100-1000 bp upstream of the transcriptional start site ¾ The 20 unexpressed genes rely on gene X’s TF to ¾ Located somewhere within the RR are the transcription induce transcription factor binding sites (TFBS), also known as motifs, specific for a given transcription factor ¾ TFs influence gene expression by binding to a specific ¾ A single TF may regulate multiple genes location in the respective gene’s regulatory region - TFBS - and recruiting the DNA polymerase

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 41 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 42

7 Motif logo The motif finding problem TGGGGGA ¾ Motifs can mutate on non Given a random sample of DNA sequences: important bases TGAGAGA ¾ The five motifs in five TGGGGGA cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat different genes have TGAGAGA agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc mutatiiions in posi ii3tion 3 TGAGGGA aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt and 5 agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

¾ Representations called motif ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc logos illustrate the conserved and variable Find the pattern that is implanted in each of the regions of a motif individual sequences, namely, the motif

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 43 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 44

Parameters Definitions

l = 8 DNA t number of sample DNA sequences cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat n length of each DNA sequence agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc DNA samppq(le of DNA sequences (t x n array) t=5 aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt l length of the motif (l -mer) agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca si starting position of an l -mer in sequence i ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc s=(s1, s2,… st) array of motif starting positions n = 69

s s1 = 26 , s2 = 21 , s3= 3 , s4 = 56 , s5 = 60

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 45 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 46

Motifs: Profiles and consensus Consensus

a G g t a c T t ¾ C c A t a c g t Line up the patterns by ¾ Think of consensus as an “ancestor” motif, from Alignment a c g t T A g t their start indexes a c g t C c A t which mutated motifs emerged C c g t a c g G ¾ The distance between a real motif and the consensus s = (s1, s2, …, st) ______seqqguence is generally less than that for two real motifs A 3 0 1 0 3 1 1 0 ¾ Construct matrix profile Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 with frequencies of each T 0 0 0 5 1 0 1 4 nucleotide in columns ¾ We need to introduce a scoring function to compare ______different motifs and choose the “best” one.

Consensus A C G T A C G T ¾ Consensus nucleotide in each position has the highest score in column

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 47 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 48

8 Scoring motifs: consensus score The motif finding problem

l ¾ Goal: Given a set of DNA sequences, find a set of l - a G g t a c T t ¾Given s = (s1, … st) and DNA: C c A t a c g t mers, one from each sequence, that maximizes the a c g t T A g t t consensus score a c g t C c A t C c g t a c g G ¾Score(s,DNA) = ______¾ Input: A t x n matrix DNA, and l, the length of the l A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 pattern to find count(k,i) G 0 1 4 0 0 0 3 1 ∑ max T 0 0 0 5 1 0 1 4 i=1 k∈{ A,T ,C,G} ______

Consensus a c g t a c g t ¾ Output: An array of t starting positions s = (s1, s2, … st) maximizing Score(s,DNA) Score 3+4+4+5+3+4+3+4=30

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 49 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 50

BruteForceMotifSearch Running Time of BruteForceMotifSearch

BruteForceMotifSearch(DNA, t, n, l) ¾ Varying (n – l + 1) positions in each of t sequences, we’re – t 1 bestScore ← 0 looking at (n l + 1) sets of starting positions

2 for each s=(s1,s2 , . . ., st) from (1,1 . . .,1) to (n-l+1, . . ., n-l+1) 3 if (((Score(s,DNA) > bestScore) ¾ For each set of starting positions, the scoring function 4 bestScore ← Score(s, DNA) makes l operations, so complexity is l(n – l + 1)t = O(lnt) 5 bestMotif ← (s1,s2 , . . . , st) 6 return bestMotif ¾ That means that for t = 8, n = 1000, and l = 10 we must perform approximately 1020 computations – it will take billions of years!

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 51 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 52

The median string problem Hamming Distance

¾Hamming distance:

¾Given a set of t DNA sequences, find a pattern − dH(v,w) is the number of nucleotide pairs that that appears in all t sequences with the minimum do not match when v and w are aligned. For number of mutations example:

¾This pattern will be the motif dH(AAAAAA,ACAAAC) = 2

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 53 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 54

9 Total Distance: Example The median string problem ¾ ¾ Given v = “acgtacgt” and s Goal: Given a set of DNA sequences, find a

d (v, x) = 1 median string H acgtacgt cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgt ¾Input: A t x n matrix DNA, and l, the length of dH(v, x) = 0 aggggggtactggtgtacatttgatacgggtacgtacaccggcaacct gaaacaaac gctca gaacca gaa gt gc acgtacgt the pattern to find aaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgt d (v, x) = 2 dH(v, x) = 0 ¾ H agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca Output: A string v of l nucleotides that d (v, x) = 1 acgtacgt ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaH acgtaGgtc minimizes TotalDistance(v,DNA) over all strings

v is the sequence in red, x is the sequence in blue of that length

¾ TotalDistance(v,DNA) = 1+0+2+0+1 = 4

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 55 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 56

Median string search algorithm Motif finding problem = median string problem l

a G g t a c T t ¾ At any column i BruteForceMedianStringSearch (DNA, t, n, l) C c A t a c g t Score + TotalDistance = t Alignment a c g t T A g t t i i 1 bestWord ← AAA…A a c g t C c A t C c g t a c g G ¾ ______Because there are l columns 2 bestDistance ←∞ Score + TotalDistance = l × t A 3 0 1 0 3 1 1 0 3 for eachlh l-mer v from AAA… A to TTT… T Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 ¾ Rearranging: 4 if TotalDistance(v,DNA) < bestDistance T 0 0 0 5 1 0 1 4 ______Score = l × t - TotalDistance

5 bestDistance ← TotalDistance(v,DNA) Consensus a c g t a c g t ¾ l × t is constant, thus the 6 bestWord ← v Score 3+4+4+5+3+4+3+4 minimization of TotalDistance is TotalDistance 2+1+1+0+2+1+2+1 equivalent to the maximization of 7 return bestWord Score Sum 5 5 5 5 5 5 5 5

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 57 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 58

Motif finding problem vs. Structuring the search median string problem

Why bother reformulating the motif finding problem For the median string problem we need to consider all into the median string problem? 4l possible l-mers: l l − The motif finding problem needs to examine aa… aa 11… 11 all the combinations for s. That is (n - l + 1)t aa… ac 11… 12 aa… ag 11… 13 combinations aa… at 11… 14 . − The median string problem needs only to . examine all 4l combinations for v. tt… tt 44… 44 How to organize this search?

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 59 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 60

10 Search Tree Analyzing Search Trees

¾Characteristics of the search trees: root − The sequences are contained in its leaves -- − The parent of a node is the prefix of its children

1- 2- 3- 4- ¾How can we move through the tree?

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 61 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 62

Visit the Next Leaf NextLeaf: Example

Let a be an array of digits: a = {a1, a2, …aL} where ai ε [1,k], 1 ≤ i ≤ L

Given a current leaf a, we need to compute the “next” leaf: root

NextLeaf(a,L,k ) -- 1 for i ← L to 1

2 if ai < k 1- 2- 3- 4-

3 ai ← ai + 1 4 return a 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 5 ai ← 1 6 return a T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 63 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 64

Visit all leaves Visit all leaves: Example

Printing all permutations in ascending order:

root

AllLeaves(L,k) -- 1 a ← (111,1,...,1 ) 2 while forever 1- 2- 3- 4- 3output a 4 a ← NextLeaf(a,L,k) 5 if a = (1,1,...,1) 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 6 return

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 65 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 66

11 Visit the next vertex Visit the next vertex: Example (five moves)

We can search though all vertices of the tree with a depth first search

i is the prefix length root

-- NextVertex(a,i,L,k) 1 if i < L

2 ai+1 ← 1 3 return (a,i+1) 1- 2- 3- 4- 4 else 5 for j ← L to 1

6 if aj < k 7 aj ← aj +1 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 8 return(a,j ) 9 return(a,0)

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 67 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 68

Bypass Move Bypass Move: Example Given a prefix (internal vertex), find the next vertex after skipping all its children root Bypass(a,i,L,k) -- 1 for j ← i to 1 2 if a < k j 1- 2- 3- 4- 3 aj ← aj +1 4 return(a,j) 5 return(a,0) 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 69 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 70

Brute force search again Can We Do Better?

¾ Let TotalDistance(prefix,DNA) be the distance for a BruteForceMedianStringSearchAgain(DNA, t, n, l) nucleotide string corresponding to (s , s , …,s ) 1 s ← (1,1,…, 1) 1 2 i 2 bestDistance ←∞ ¾ Note that if the total distance for a prefix is greater than 3 while forever that for the best word so far: 4 s ← NextLeaf (s, l, 4)

5 word ← Nucleotide string corresponding to (s1,s2 , . . . , sl) 6 if (TotalDistance(word,DNA) < bestDistance) TotalDistance (prefix, DNA) > BestDistance 7 bestDistance ← TotalDistance(word, DNA) 8 bestWord ← word 9 if s = (1,1,…, 1) there is no use exploring the remaining part of the word 10 return bestWord ¾ Use ByPass()!

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 71 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 72

12 Bounded Median String Search Running time BranchAndBoundMedianStringSearch(DNA,t,n,l ) 1 s ← (1,…,1) 2 bestDistance ←∞ 3 i ← 1 ¾As usual with branch-and-bound algorithms, 4 while i > 0 5 if i < l there is no improved running time in the worst

6 prefix ← Nucleotide string corresponding to (s1, s2, …, si) 7 optimisticDistance ← TotalDistance(prefix,DNA) case 8 if optimisticDistance > bestDistance 9(s, i ) ← Bypass(s,i, l, 4) ¾However, it often results in a practical speedup 10 else 11 (s, i ) ← NextVertex(s, i, l, 4) 12 else 13 word ← Nucleotide string corresponding to (s1, s2, …, sl) 14 if TotalDistance(s,DNA) < bestDistance 15 bestDistance ← TotalDistance(word, DNA) 16 bestWord ← word 17 (s,i) ← NextVertex(s,i,l, 4) 18 return bestWord

T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 73 T.R. Hvidsten: 1MB304: Discrete structures for bioinformatics II 74

13