Edit Distance – Levenshtein Sequence Alignment – Needleman

Edit Distance – Levenshtein Sequence Alignment – Needleman

Edit Distance – Levenshtein Sequence Alignment – Needleman & Wunsch Not in Book CS380 Algorithm Design and Analysis 1 http://en.wikipedia.org/wiki/Levenshtein_distance EDIT DISTANCE CS380 Algorithm Design and Analysis 2 Edit Distance • Mutation in DNA is evolutionary. • DNA replication errors cause o Substitutions o Insertions o Deletions • of nucleotides, leading to “edited” DNA texts CS380 Algorithm Design and Analysis 3 Edit Distance: Definition • Introduced by Vladimir Levenshtein in 1966 • The Edit Distance between two strings is the minimum number of editing operations needed to transform one string into another • Operations are: o Insertion of a symbol o Deletion of a symbol o Substitution of one symbol for another CS380 Algorithm Design and Analysis 4 Example • How would you transform: o X: TGCATAT • To the string: o Y: ATCCGAT CS380 Algorithm Design and Analysis 5 Edit Distance • How many insertions, deletions, substitutions will transform one string into another? • Backtracking will give us the steps used to convert one string to another CS380 Algorithm Design and Analysis 6 Recursive Solution • Let dij = the minimum edit distance of x1x2x3..xi and y1y2y3..yi Match Deletion from X Insertion to X Substitution CS380 Algorithm Design and Analysis 7 Backtracking • No need to keep track of the arrows • Just know that: o Match/Substitution: Diagonal o Insertion: Horizontal (Left) o Deletion: Vertical (Up) CS380 Algorithm Design and Analysis 8 Example • X = ATCGTT • Y = AGTTAC CS380 Algorithm Design and Analysis 9 Kleinberg, Tardos, Algorithm Design, Pearson Addison Wesley, 2006, p 278 http://www.aw-bc.com/info/kleinberg/ SEQUENCE ALIGNMENT CS380 Algorithm Design and Analysis 10 Sequence Alignment • Edit Distance: o Gave the minimum number of changes to convert one string into another • Sequence Alignment o Maximizes the similarity by giving weights to types of differences CS380 Algorithm Design and Analysis 11 Sequence Alignment • Needleman-Wunsch • Similarity based on gaps and mismatches • Generalized form of Levenshtein o additional parameters: § gap penalty, δ § mismatch cost ( αx,y ; αx,x = 0 ) CS380 Algorithm Design and Analysis 12 Recurrence • Two strings x1...xm and y1...yn • In an optimal alignment, M, at least one of the following is true: o (xm, yn) is in M o xm is not matched o yn is not matched CS380 Algorithm Design and Analysis 13 Recurrence • So, for i and j > 0 CS380 Algorithm Design and Analysis 14 Example • Assume that: o δ = 2 o α (v, v) = 1 o α (c, c) = 1 o α (v, c) = 3 • What is the cost of aligning the strings: o mean o name CS380 Algorithm Design and Analysis 15 SPACE-EFFICIENT SEQUENCE ALIGNMENT CS380 Algorithm Design and Analysis 16 Sequence Alignment Space Usage • O(n2) is pretty low space usage • However, for a 10GB genome, you’d need a huge amount of memory • Can we use less? o Hirschberg’s algorithm o 1975 CS380 Algorithm Design and Analysis 17 Linear Space for Alignment Scores • If you are only interested in the cost of the alignment, you need to only use O(n) space • How? o When filling the entries, we only ever look at the current and previous cols o Only keep those two in memory CS380 Algorithm Design and Analysis 18 Space-Efficient-Alignment (X, Y) CS380 Algorithm Design and Analysis 19 Actual Alignment • How do we recover the actual alignment? • Do we need the entire matrix? CS380 Algorithm Design and Analysis 20 Divide-and-Conquer-Alignment (X,Y) CS380 Algorithm Design and Analysis 21 .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    21 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us