Dynamic Programming
Total Page:16
File Type:pdf, Size:1020Kb
Dynamic Programming Comp 122, Fall 2004 Review: the previous lecture • Principles of dynamic programming: optimization problems, optimal substructure property, overlapping subproblems, trade space for time, implementation via bottom-up/memoization. • Example of Match Game. • Example of Fibinacci: from Divide and Conquer (exponential time complexity) to Dynamic Programming (O(n) time and space) , then to more efficient dynamic programming (constant space, O(logn) time) • Example of LCS in O(m n) time and space. Today we will show how to reduce the space complexity of LCS to O(n) and in future lectures we will show how to reduce the time complexity of LCS. Comp 122, Spring 2004 Longest Common Subsequence • Problem: Given 2 sequences, X = x1,...,xm and Y = y1,...,yn, find a common subsequence whose length is maximum. X: springtime Y: printing LCS(X,Y): printi Comp 122, Spring 2004 0 if empty or empty, c[, ] c[ prefix, prefix ]1 if end() end( ), max(c[ prefix, ],c[, prefix ]) if end() end( ). p r i n t i n g •Keep track of c[,] in a table of nm 0 0 0 0 0 0 0 0 0 entries: s 0 0 0 0 0 0 0 0 0 •top/down: increasing row order p 0 1 1 1 1 1 1 1 1 r 0 1 2 2 •within each row left-to-right: increasing column order i n g t i m Comp 122,e Spring 2004 0 if empty or empty, c[, ] c[ prefix, prefix ]1 if end() end( ), max(c[ prefix, ],c[, prefix ]) if end() end( ). Time Complexity: O(nm). p r i n t i n g Space Complexity: O(nm). 0 0 0 0 0 0 0 0 0 Can the space complexity be s 0 0 0 0 0 0 0 0 0 improved if we just compute the p 0 1 1 1 1 1 1 1 1 length of the LCS and do not need to recover an actual LCS? r 0 1 2 2 In this case we only need to compute i the alignment score. n Can the space complexity be g improved and still allow the t recovery of an actual LCS? i Yes, but for this we will have to m address LCS as a longest path problem. Comp 122,e Spring 2004 Other sequence questions • Edit distance: Given 2 sequences, X = x1,...,xm and Y = y1,...,yn, what is the minimum number of deletions, insertions, and changes that you must do to change one to another? • Protein sequence alignment: Given a score matrix on amino acid pairs, s(a,b) for a,b{}A, m and 2 amino acid sequences, X = x1,...,xmA n and Y = y1,...,ynA , find the alignment with lowest score… Comp 122, Spring 2004 Outline • DNA Sequence Comparison: Biological background • Sequence Alignment • First Biological Success Stories • More Grid Graphs: Manhattan Tourist Problem • Longest Paths in Grid Graphs • Back to Longest Common Subsequence Problem • Review: divide and conquer paradigm • Reducing the space requirement of LCS Dynamic Programming: String Editing The Central Dogma of Molecular Biology DNA RNA PROTEIN Gene Function > DNA sequence > Protein sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGG MKIVYWSGTGNTEKMAELIAKGIIES CAACACTGAGAAAATGGCAGAGCTCATCGCTAAA GKDVNTINVSDVNIDELLNEDILILGC GGTATCATCGAATCTGGTAAAGACGTCAACACCA SAMGDEVLEESEFEPFIEEISTKISG TCAACGTGTCTGACGTTAACATCGATGAACTGCT KKVALFGSYGWGDGKWMRDFEER GAACGAAGATATCCTGATCCTGGGTTGCTCTGCC MNGYGCVVVETPLIVQNEPDEAEQD ATGGGCGATGAAGTTCTCGAGGAAAGCGAATTTG CIEFGKKIANI The Central Dogma of Molecular Biology Genome: The digital backbone of molecular biology Transcripts from Gene(DNA subsequence) to Proten sequence: Perform functions encoded in the genome The Sequence Alignment Problem The Scoring Matrix - a c g t A = c t a c g a g a c - -1 -1 -1 -1 a -1 1 -1 -1 -1 c -1 -1 1 -1 -1 B = a a c g a c g a t g -1 -1 -1 1 -1 t -1 -1 -1 -1 1 Compare two strings A and B and measure their similarity by finding the optimal alignment between them. The alignment is classically based on the transformation of one sequence into the other, via operations of substitutions, insertions, and deletions (indels). 13 Two Sequence Alignment Problems Global Alignment. A = c t a c g a g a c B = a a c g a c g a t Local Alignment. A = c t a c g a g a c B = a a c g a c g a t 14 Two Sequence Alignment Problems Global Alignment. The Scoring Matrix A = c t a c g a g a c - a c g t - -1 -1 -1 -1 a -1 1 -1 -1 -1 B = a a c g a c g a t c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 t -1 -1 -1 -1 1 Local Alignment. A = c t a c g a g a c B = a a c g a c g a t 15 Two Sequence Alignment Problems Global Alignment. The Scoring Matrix A = c t a c g a g a c - a c g t - -1 -1 -1 -1 a -1 1 -1 -1 -1 B = a a c g a c g a t c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 Value: 2 t -1 -1 -1 -1 1 Local Alignment. A = c t a c g a g a c B = a a c g a c g a t 16 Two Sequence Alignment Problems Global Alignment. The Scoring Matrix A = c t a c g a g a c - a c g t - -1 -1 -1 -1 a -1 1 -1 -1 -1 B = a a c g a c g a t c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 Value: 2 t --1 -1 -1 -1 1 Local Alignment. A = c t a c g a g a c B = a a c g a c g a t Value: 5 17 The O(n 2 ) time, Classical Dynamic Programming Algorithm The Alignment Graph |B|= n The Scoring Matrix a a c g a c g a t - a c g t 0 1 2 3 4 5 6 7 8 9 - -1 -1 -1 -1 c a -1 1 -1 -1 -1 1 c -1 -1 1 -1 -1 t g -1 -1 -1 1 -1 2 a t -1 -1 -1 -1 1 3 |A|= n c 4 g 5 a 6 g 7 a c8 9 0 1 2 3 4 5 6 7 8 9 18 Computing the Optimal Global Alignment Value |B|= n a c a a 0 1 a 2 3 g 4 5 c 6 g 7 8 t 9 Score of = 1 c Score of = -1 1 t 2 a 3 |A|= n c 4 g 5 a 6 g 7 a c8 9 0 1 2 3 4 5 6 7 8 9 Classical Dynamic Programming: O(n 2 ) 19 Computing an Optimal Local Alignment Value |B|= n a a c a a t 0 2 g4 5 c 6 g 8 1 3 7 9 Score of = 1 c Score of = -1 1 t 2 a 3 |A|= n c 4 g 5 a 6 g 7 a 8 c 9 0 1 2 3 4 5 6 7 8 9 Classical Dynamic Programming: O(n 2 ) 20 DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function • In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene Cystic Fibrosis • Cystic fibrosis (CF) is a chronic and frequently fatal genetic disease of the body's mucus glands (abnormally high level of mucus in glands). CF primarily affects the respiratory systems in children. • Mucus is a slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva Cystic Fibrosis: Inheritance • In early 1980s biologists hypothesized that CF is an autosomal recessive disorder caused by mutations in a gene that remained unknown till 1989 • Heterozygous carriers are asymptomatic • Must be homozygously recessive in this gene in order to be diagnosed with CF Cystic Fibrosis: Finding the Gene Cystic Fibrosis and the CFTR Protein •CFTR (Cystic Fibrosis Transmembrane conductance Regulator) protein is acting in the cell membrane of epithelial cells that secrete mucus •These cells line the airways of the nose, lungs, the stomach wall, etc. Mechanism of Cystic Fibrosis • The CFTR protein (1480 amino acids) regulates a chloride ion channel • Adjusts the “wateriness” of fluids secreted by the cell • Those with cystic fibrosis are missing one single amino acid in their CFTR • Mucus ends up being too thick, affecting many organs Cystic Fibrosis and the CFTR Protein •CFTR (Cystic Fibrosis Transmembrane conductance Regulator) protein is acting in the cell membrane of epithelial cells that secrete mucus •These cells line the airways of the nose, lungs, the stomach wall, etc. Finding Similarities between the Cystic Fibrosis Gene and ATP binding proteins • ATP binding proteins are present on cell membrane and act as transport channel • In 1989 biologists found similarity between the cystic fibrosis gene and ATP binding proteins • A plausible function for cystic fibrosis gene, given the fact that CF involves sweet secretion with abnormally high sodium level Cystic Fibrosis: Mutation Analysis If a high % of cystic fibrosis (CF) patients have a certain mutation in the gene and the normal patients don’t, then that could be an indicator of a mutation that is related to CF A certain mutation was found in 70% of CF patients, convincing evidence that it is a predominant genetic diagnostics marker for CF Outline • DNA Sequence Comparison: Biological background • The Sequence Alignment problem • First Biological Success Stories • Grid Graphs: Manhattan Tourist Problem • Longest Paths in Grid Graphs • Back to Longest Common Subsequence Problem • Review: divide and conquer paradigm • Reducing the space requirement of LCS Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to Source sink) to travel (only * * eastward and * * * southward) with the * most number of * * attractions (*) in the * * * * Manhattan grid Sink Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to Source sink) to travel (only * * eastward and * * * southward) with the * most number of * * attractions (*) in the * * * * Manhattan grid Sink Manhattan Tourist Problem: Formulation Goal: Find the longest (highest scoring) path in a weighted grid.