Bioinformatics Algorithms

BioInformatics algorithms • Simon Frost, [email protected] • No biology in the exam questions • You need to know only the biology in the slides to understand the reason for the algorithms • Partly based on book: Compeau and Pevzner Bioinformatics algorithms (chapters 3,5 in Vol I,7‐10 in Vol II) – Also Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison • Color slides from the course website 1 Sequence Alignment Outline • From Sequence Comparison to Biological Insights • The Alignment Game and the Longest Common Subsequence • The Manhattan Tourist Problem • Dynamic Programming and Backtracking Pointers • From Manhattan to the Alignment Graph • From Global to Local Alignment • Penalizing Insertions and Deletions in Sequence Alignment • Space‐Efficient Sequence Alignment • Multiple Sequence Alignment 2 DNA: 4-letter alphabet, A (adenosine), T (thymine), C (cytosine) and G (guanine). In the double helix A pairs with T, C with G Gene: hereditary information located on the chromosomes and consisting of DNA. RNA: same as DNA but T -> U (uracil) 3 letters (triplet – a codon) code for one amino acid in a protein. Proteins: units are the 20 amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y. Genome: an organism’s genetic material CCTGAGCCAACTATTGATGAA DNA GCACTCGGTTGATAACTACTT transcription CCUGAGCCAACUAUUGAUGAA mRNA translation Protein PEPTIDE 3 Why Align Sequences? • Biological sequences can be represented as a vector of characters: – A, C, T and G for DNA • We align sequences such that sites with the same ancestry are aligned • This is a requirement e.g. for downstream phylogenetic analyses 4 The Alignment Game A T G T T A T A A T C G T C C Alignment Game (maximizing the number of points): • Remove the 1st symbol from each sequence • 1 point if the symbols match, 0 points if they don’t match • Remove the 1st symbol from one of the sequences • 0 points 5 The Alignment Game AT- GTT A T A ATC GT- C - C +1+1 +1+1 =4 6 What Is the Sequence Alignment? matches insertions deletions mismatches AT- GTT A T A ATC GT- C - C +1+1 +1+1 =4 Alignment of two sequences is a two‐row matrix: 1st row: symbols of the 1st sequence (in order) interspersed by “‐” 2nd row: symbols of the 2nd sequence (in order) interspersed by “‐” 7 Longest Common Subsequence AT- GTT A T A ATC GT- C - C Matches in alignment of two sequences (ATGT) form their Common Subsequence Longest Common Subsequence Problem: Find a longest common subsequence of two strings. • Input: Two strings. • Output: A longest common subsequence of these strings. 8 From Manhattan to a Grid Graph Walk from the source to the sink (only in the South ↓ and East → directions) and visit the maximum number of attractions 9 Manhattan Tourist Problem Manhattan Tourist Problem: Find a longest path in a rectangular city grid. •Input: A weighted rectangular grid. •Output: A longest path from the source to the sink in the grid. 10 3 2 4 0 0 3 130 2 4 3242 3 6521 Greedy 0733 algorithm? 4 4521 3 30 2 5 6853 1 3 22 11 3 2 4 0 0 3 5 9 130 2 4 3242 13 3 6521 Greedy 0733 algorithm? 15 19 4 4521 3 30 2 20 5 6853 1 3 22 23 12 3 2 4 0 5 134 2 From a 4 regular to an 324 irregular grid 2 3 65 1 072 3 4 4 4 6 2 4 1 3 30 2 5 6853 1 3 22 13 Search for Longest Paths in a Directed Graph Longest Path in a Directed Graph Problem: Find a longest path between two nodes in an edge‐weighted directed graph. • Input: An edge‐weighted directed graph with source and sink nodes. • Output: A longest path from source to sink in the directed graph. 14 Do You See a Connection between the Manhattan Tourist and the Alignment Game? A T - G T T A T A A T C G T - C - C → ↓ ↓ 15 ATCGTCC A ? T alignment → path A T - G T T A T A G A T C G T - C - C ↘↘→ ↘↘↓ ↘ ↓ ↘ T T A T A 16 ATCGTCC A ? T alignment → path A T - G T T A T A G A T C G T - C - C ↘↘→ ↘↘↓ ↘ ↓ ↘ T T A T A 17 ATCGTCC A ? T path → alignment A T G T T - AT A G --A T C G TCC ↘ ↘ ↘ ↘↘↘ ↓ ↓ → T highest‐scoring alignment T = longest path in a A properly built T Manhattan A 18 ATCGTCC How to built a A Manhattan for the Alignment Game T and the Longest Common G Subsequence Problem? T T Diagonal red edges correspond to A matching symbols and have scores 1 T A 19 3 2 4 0 There are 130 2 4 only 2 ways 3242 to arrive to the sink: by moving 3 6521 South ↓ 0733 or by moving East → 4 4521 3 30 2 5 6853 South 1 3 22or East20? South or East? SouthOrEast(n,m) if n=0 and m=0 return 0 if n>0 and m>0 x SouthOrEast(n‐1,m)+weight of edge “↓”into (n,m) y SouthOrEast(n,m‐1)+ weight of edge “→”into (n,m) return max{x,y} return ‐infinity 21 3 2 4 0 0 130 2 4 3242 1 4 6521 0733 5 4 4521 3 30 2 5 6853 1 3 22 22 3 2 4 0 0 3 5 99 130 2 4 3242 1 4 6521 0733 5 4 4521 3 30 2 9 5 6853 1 3 22 14 23 3 2 4 0 0 3 5 99 South or 130 2 4 East? 3242 1 4 1+3 > 3+0 4 6521 0733 5 4 4521 3 30 2 9 5 6853 1 3 22 14 24 3 2 4 0 0 3 5 99 We arrived to (1,1) 130 2 4 by the bold 3242 edge: 1 4 3 4 6521 4 0733 5 4 4521 3 30 2 9 5 6853 1 3 22 14 25 3 2 4 0 0 3 5 9 9 142 Backtracking 3 2 pointers: 1 4 7 13 15 the best way to get to 4 6 each node 733 5 10 17 20 24 4 4521 9 14 22 22 25 5 68 22 14 20 30 32 34 26 Dynamic programming • Break down complex problem into simpler subproblems • Solving subproblems once • Store solutions – ’Memoization’ 27 Dynamic Programming Recurrence si, j: the length of a longest path from (0,0) to (i,j) si‐1, j + weight of edge “↓”into (i,j) si, j = max { si, j‐1 + weight of edge “→”into (i,j) 28 3 2 4 0 0 3 5 How does 5 the 135 4 7 2 3 recurrence 3244 1 4 change for 2 this graph? 3 65 4 1 072 3 4 4 4 6 2 4 1 3 30 2 5 6853 1 3 22 29 sa = maxall predecessors b of node a{sb+ weight of edge from b to a} 3 2 4 0 0 3 5 4 choices: 5 5 + 2 135 4 7 2 3 3 + 7 3244 1 4 10? 5 + 4 2 4 + 2 3 65 4 1 072 3 4 4 4 6 2 4 1 3 30 2 5 6853 1 3 22 30 sa = maxall predecessors b of node a{sb+ weight of edge from b to a} 3 2 4 0 0 3 5 9 9 4 choices: 5 5 + 2 135 4 7 2 3 3 + 7 3244 1 4 10? 14 18 5 + 4 2 4 + 2 3 65 1 12 4 07 3 4 10 17 2 14 19 4 4 4 6 2 4 1 3 30 2 8 14 17 17 20 5 6853 1 3 22 13 20 25 27 29 31 Dynamic Programming Recurrence for the Alignment Graph si, j: the length of a longest path from (0,0) to (i,j) si‐1, j + weight of edge “↓” into (i,j) s + weight of edge “→” into (i,j) si, j= max { i, j‐1 si‐1, j‐1+ weight of edge “ ” into (i,j) red edges ↘ –weight 1 other edges –weight 0 32 Dynamic Programming Recurrence for the Longest Common Subsequence Problem si, j: the length of a longest path from (0,0) to (i,j) si‐1, j + 0 s + 0 si, j= max { i, j‐1 si‐1, j‐1+ 1, if vi=wj red edges ↘ –weight 1 other edges –weight 0 33 ATCGTCC A backtracking pointers for the Longest T Common Subsequence red edges ↘ –weight 1 G other edges –weight 0 T T A T A 34 ATCGTCC A backtracking pointers for the Longest T Common Subsequence G T T A T A 35 Computing Backtracking Pointers si,j‐1+0 si,j ← max{ si‐1,j+0 si‐1,j‐1+1, if vi=wj “→”, if si,j=si,j‐1 backtracki,j ← {“↓", if si,j=si‐1,j “↘”, if si,j=si‐1,j‐1+1 36 3 2 4 0 0 3 5 9 9 Why did we store the 142 backtracking 3 2 pointers? 1 4 7 13 15 4 6 733 5 10 17 20 24 4 4521 9 14 22 22 25 5 68 22 14 20 30 32 34 37 3 2 4 0 0 3 5 9 9 What is the optimal 142 3 2 alignment 1 4 7 13 15 path? 4 6 733 5 10 17 20 24 4 4521 9 14 22 22 25 5 68 22 14 20 30 32 34 38 ATCGTCC A backtracking pointers for the Longest T Common Subsequence G T T A T A 39 Using Backtracking Pointers to Compute LCS OutputLCS (backtrack, v, i, j) if i = 0 or j = 0 return if backtracki,j = “→” OutputLCS (backtrack, v, i, j‐1) else if backtracki,j = “↓” OutputLCS (backtrack, v, i‐1, j) else OutputLCS (backtrack, v, i‐1, j‐1) output vi 40 Computing Scores of ALL Predecessors 4 0 4 4 ? 1 6 1 1 2 1 2 6 2 sa = maxALL predecessors b of node a{sb+ weight of edge from b to a} 41 4 0 4 4 ? 1 6 1 1 2 1 ? 2 6 2 42 4 0 4 4 ? 1 6 1 1 2 1 ? ? 2 6 2 43 A Vicious Cycle 4 0 4 4 ? ? 1 6 1 1 2 1 ? ? 2 6 2 44 In What Order Should We Explore Nodes of the Graph? sa = maxALL predecessors b of node a{sb+ weight of edge from b to a} •By the time a node is analyzed, the scores of all its predecessors should already be computed.

Load more