Bioinformatics Algorithms

Bioinformatics Algorithms

BioInformatics algorithms • Simon Frost, [email protected] • No biology in the exam questions • You need to know only the biology in the slides to understand the reason for the algorithms • Partly based on book: Compeau and Pevzner Bioinformatics algorithms (chapters 3,5 in Vol I,7‐10 in Vol II) – Also Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison • Color slides from the course website 1 Sequence Alignment Outline • From Sequence Comparison to Biological Insights • The Alignment Game and the Longest Common Subsequence • The Manhattan Tourist Problem • Dynamic Programming and Backtracking Pointers • From Manhattan to the Alignment Graph • From Global to Local Alignment • Penalizing Insertions and Deletions in Sequence Alignment • Space‐Efficient Sequence Alignment • Multiple Sequence Alignment 2 DNA: 4-letter alphabet, A (adenosine), T (thymine), C (cytosine) and G (guanine). In the double helix A pairs with T, C with G Gene: hereditary information located on the chromosomes and consisting of DNA. RNA: same as DNA but T -> U (uracil) 3 letters (triplet – a codon) code for one amino acid in a protein. Proteins: units are the 20 amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y. Genome: an organism’s genetic material CCTGAGCCAACTATTGATGAA DNA GCACTCGGTTGATAACTACTT transcription CCUGAGCCAACUAUUGAUGAA mRNA translation Protein PEPTIDE 3 Why Align Sequences? • Biological sequences can be represented as a vector of characters: – A, C, T and G for DNA • We align sequences such that sites with the same ancestry are aligned • This is a requirement e.g. for downstream phylogenetic analyses 4 The Alignment Game A T G T T A T A A T C G T C C Alignment Game (maximizing the number of points): • Remove the 1st symbol from each sequence • 1 point if the symbols match, 0 points if they don’t match • Remove the 1st symbol from one of the sequences • 0 points 5 The Alignment Game AT- GTT A T A ATC GT- C - C +1+1 +1+1 =4 6 What Is the Sequence Alignment? matches insertions deletions mismatches AT- GTT A T A ATC GT- C - C +1+1 +1+1 =4 Alignment of two sequences is a two‐row matrix: 1st row: symbols of the 1st sequence (in order) interspersed by “‐” 2nd row: symbols of the 2nd sequence (in order) interspersed by “‐” 7 Longest Common Subsequence AT- GTT A T A ATC GT- C - C Matches in alignment of two sequences (ATGT) form their Common Subsequence Longest Common Subsequence Problem: Find a longest common subsequence of two strings. • Input: Two strings. • Output: A longest common subsequence of these strings. 8 From Manhattan to a Grid Graph Walk from the source to the sink (only in the South ↓ and East → directions) and visit the maximum number of attractions 9 Manhattan Tourist Problem Manhattan Tourist Problem: Find a longest path in a rectangular city grid. •Input: A weighted rectangular grid. •Output: A longest path from the source to the sink in the grid. 10 3 2 4 0 0 3 130 2 4 3242 3 6521 Greedy 0733 algorithm? 4 4521 3 30 2 5 6853 1 3 22 11 3 2 4 0 0 3 5 9 130 2 4 3242 13 3 6521 Greedy 0733 algorithm? 15 19 4 4521 3 30 2 20 5 6853 1 3 22 23 12 3 2 4 0 5 134 2 From a 4 regular to an 324 irregular grid 2 3 65 1 072 3 4 4 4 6 2 4 1 3 30 2 5 6853 1 3 22 13 Search for Longest Paths in a Directed Graph Longest Path in a Directed Graph Problem: Find a longest path between two nodes in an edge‐weighted directed graph. • Input: An edge‐weighted directed graph with source and sink nodes. • Output: A longest path from source to sink in the directed graph. 14 Do You See a Connection between the Manhattan Tourist and the Alignment Game? A T - G T T A T A A T C G T - C - C → ↓ ↓ 15 ATCGTCC A ? T alignment → path A T - G T T A T A G A T C G T - C - C ↘↘→ ↘↘↓ ↘ ↓ ↘ T T A T A 16 ATCGTCC A ? T alignment → path A T - G T T A T A G A T C G T - C - C ↘↘→ ↘↘↓ ↘ ↓ ↘ T T A T A 17 ATCGTCC A ? T path → alignment A T G T T - AT A G --A T C G TCC ↘ ↘ ↘ ↘↘↘ ↓ ↓ → T highest‐scoring alignment T = longest path in a A properly built T Manhattan A 18 ATCGTCC How to built a A Manhattan for the Alignment Game T and the Longest Common G Subsequence Problem? T T Diagonal red edges correspond to A matching symbols and have scores 1 T A 19 3 2 4 0 There are 130 2 4 only 2 ways 3242 to arrive to the sink: by moving 3 6521 South ↓ 0733 or by moving East → 4 4521 3 30 2 5 6853 South 1 3 22or East20? South or East? SouthOrEast(n,m) if n=0 and m=0 return 0 if n>0 and m>0 x SouthOrEast(n‐1,m)+weight of edge “↓”into (n,m) y SouthOrEast(n,m‐1)+ weight of edge “→”into (n,m) return max{x,y} return ‐infinity 21 3 2 4 0 0 130 2 4 3242 1 4 6521 0733 5 4 4521 3 30 2 5 6853 1 3 22 22 3 2 4 0 0 3 5 99 130 2 4 3242 1 4 6521 0733 5 4 4521 3 30 2 9 5 6853 1 3 22 14 23 3 2 4 0 0 3 5 99 South or 130 2 4 East? 3242 1 4 1+3 > 3+0 4 6521 0733 5 4 4521 3 30 2 9 5 6853 1 3 22 14 24 3 2 4 0 0 3 5 99 We arrived to (1,1) 130 2 4 by the bold 3242 edge: 1 4 3 4 6521 4 0733 5 4 4521 3 30 2 9 5 6853 1 3 22 14 25 3 2 4 0 0 3 5 9 9 142 Backtracking 3 2 pointers: 1 4 7 13 15 the best way to get to 4 6 each node 733 5 10 17 20 24 4 4521 9 14 22 22 25 5 68 22 14 20 30 32 34 26 Dynamic programming • Break down complex problem into simpler subproblems • Solving subproblems once • Store solutions – ’Memoization’ 27 Dynamic Programming Recurrence si, j: the length of a longest path from (0,0) to (i,j) si‐1, j + weight of edge “↓”into (i,j) si, j = max { si, j‐1 + weight of edge “→”into (i,j) 28 3 2 4 0 0 3 5 How does 5 the 135 4 7 2 3 recurrence 3244 1 4 change for 2 this graph? 3 65 4 1 072 3 4 4 4 6 2 4 1 3 30 2 5 6853 1 3 22 29 sa = maxall predecessors b of node a{sb+ weight of edge from b to a} 3 2 4 0 0 3 5 4 choices: 5 5 + 2 135 4 7 2 3 3 + 7 3244 1 4 10? 5 + 4 2 4 + 2 3 65 4 1 072 3 4 4 4 6 2 4 1 3 30 2 5 6853 1 3 22 30 sa = maxall predecessors b of node a{sb+ weight of edge from b to a} 3 2 4 0 0 3 5 9 9 4 choices: 5 5 + 2 135 4 7 2 3 3 + 7 3244 1 4 10? 14 18 5 + 4 2 4 + 2 3 65 1 12 4 07 3 4 10 17 2 14 19 4 4 4 6 2 4 1 3 30 2 8 14 17 17 20 5 6853 1 3 22 13 20 25 27 29 31 Dynamic Programming Recurrence for the Alignment Graph si, j: the length of a longest path from (0,0) to (i,j) si‐1, j + weight of edge “↓” into (i,j) s + weight of edge “→” into (i,j) si, j= max { i, j‐1 si‐1, j‐1+ weight of edge “ ” into (i,j) red edges ↘ –weight 1 other edges –weight 0 32 Dynamic Programming Recurrence for the Longest Common Subsequence Problem si, j: the length of a longest path from (0,0) to (i,j) si‐1, j + 0 s + 0 si, j= max { i, j‐1 si‐1, j‐1+ 1, if vi=wj red edges ↘ –weight 1 other edges –weight 0 33 ATCGTCC A backtracking pointers for the Longest T Common Subsequence red edges ↘ –weight 1 G other edges –weight 0 T T A T A 34 ATCGTCC A backtracking pointers for the Longest T Common Subsequence G T T A T A 35 Computing Backtracking Pointers si,j‐1+0 si,j ← max{ si‐1,j+0 si‐1,j‐1+1, if vi=wj “→”, if si,j=si,j‐1 backtracki,j ← {“↓", if si,j=si‐1,j “↘”, if si,j=si‐1,j‐1+1 36 3 2 4 0 0 3 5 9 9 Why did we store the 142 backtracking 3 2 pointers? 1 4 7 13 15 4 6 733 5 10 17 20 24 4 4521 9 14 22 22 25 5 68 22 14 20 30 32 34 37 3 2 4 0 0 3 5 9 9 What is the optimal 142 3 2 alignment 1 4 7 13 15 path? 4 6 733 5 10 17 20 24 4 4521 9 14 22 22 25 5 68 22 14 20 30 32 34 38 ATCGTCC A backtracking pointers for the Longest T Common Subsequence G T T A T A 39 Using Backtracking Pointers to Compute LCS OutputLCS (backtrack, v, i, j) if i = 0 or j = 0 return if backtracki,j = “→” OutputLCS (backtrack, v, i, j‐1) else if backtracki,j = “↓” OutputLCS (backtrack, v, i‐1, j) else OutputLCS (backtrack, v, i‐1, j‐1) output vi 40 Computing Scores of ALL Predecessors 4 0 4 4 ? 1 6 1 1 2 1 2 6 2 sa = maxALL predecessors b of node a{sb+ weight of edge from b to a} 41 4 0 4 4 ? 1 6 1 1 2 1 ? 2 6 2 42 4 0 4 4 ? 1 6 1 1 2 1 ? ? 2 6 2 43 A Vicious Cycle 4 0 4 4 ? ? 1 6 1 1 2 1 ? ? 2 6 2 44 In What Order Should We Explore Nodes of the Graph? sa = maxALL predecessors b of node a{sb+ weight of edge from b to a} •By the time a node is analyzed, the scores of all its predecessors should already be computed.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    626 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us