The Basics of Edit Distance

The Basics of Edit Distance COSI 114 – Computational Linguistics James Pustejovsky January 20, 2015 Brandeis University Spelling Correction We can detect spelling errors (spell check) by building an FST-based lexicon and noting any strings that are rejected. But how do I fix “graffe”? That is, how do I come up with suggested corrections? ◦ Search through all words in my lexicon Graft, craft, grail, giraffe, crafted, etc. ◦ Pick the one that’s closest to graffe ◦ But what does “closest” mean? We need a distance metric. The simplest one: minimum edit distance As in the Unix diff command 2 Dynamic programming • Algorithm design technique often used for optimization problems • Generally usable for recursive approaches if the same partial solutions are required more than once • Approach: store partial results in a table • Advantage: improvement of complexity, often polynomial instead of exponential 3 Two different approaches Bottom-up: + controlled efficient table management, saves time + special optimized order of computation, saves space - requires extensive recoding of the original program - possible computation of unnecessary values Top-down: (Note-pad method) + original program changed only marginally or not at all + computes only those values that are actually required - separate table management takes additional time - table size often not optimal 4 Edit Distance The minimum edit distance between two strings is the minimum number of editing operations ◦ Insertion ◦ Deletion ◦ Substitution that one would need to transform one string into the other 1/20/15 5 Note The following discussion has 2 goals 1. Learn the minimum edit distance computation and algorithm 2. Introduce dynamic programming 1/20/15 6 Min Edit Example 1/20/15 7 Minimum Edit Distance If each operation has cost of 1 distance between these is 5 If substitutions cost 2 (Levenshtein) distance between these is 8 1/20/15 8 Min Edit As Search That’s all well and good but how did we find that particular (minimum) set of operations for those two strings? We can view edit distance as a search for a path (a sequence of edits) that gets us from the start string to the final string ◦ Initial state is the word we’re transforming ◦ Operators are insert, delete, substitute ◦ Goal state is the word we’re trying to get to ◦ Path cost is what we’re trying to minimize: the number of edits 1/20/15 9 Min Edit as Search 1/20/15 10 Min Edit As Search But that generates a huge search space Navigating that space in a naïve backtracking fashion would be incredibly wasteful Why? Lots of distinct paths wind up at the same state. But there is no need to keep track of the them all. We only care about the shortest path to each of those revisited states. 1/20/15 11 Defining Min Edit Distance For two strings S1 of len n, S2 of len m ◦ distance(i,j) or D(i,j) Is the min edit distance of S1[1..i] and S2[1..j] That is, the minimum number of edit operations need to transform the first i characters of S1 into the first j characters of S2 The edit distance of S1, S2 is D(n,m) We compute D(n,m) by computing D(i,j) for all i (0 < i < n) and j (0 < j < m) 1/20/15 1/20/15 12 Edit Distance R I G H T Edit Distance R I T E D D D D D I I I I 1 1 1 1 1 1 1 1 1 9 R I G H T R I T E S S D 5 0 0 2 2 1 R I G H T 3 R I T E D D I 0 0 1 1 0 1 Minimum Edit Distance Algorithm Create Matrix Initialize 1 – length in LH column and bottom row For each cell ◦ Take the minimum of: Deletion: +1 from left cell Insertion: +1 from cell below Substitution: Diagonal +0 if same +2 if different ◦ Keep track of where you came from Example Minimum of: T 5 ◦ 1+1 (left right) H 4 G 3 ◦ 1+1 (bottom up) I 2 ◦ 0+0 (diagonal) R 1 Minimum of: # 0 1 2 3 4 ◦ 0+1 (left right) # R I T E ◦ 2+1 (bottom up) ◦ 1+2 (diagonal) Answer to Right-Rite T 5 6, 6, 4 5, 5, 5 6, 2, 4 3, 5, 5 H 4 5, 5, 3 4, 4, 2 3, 3, 3 4, 4, 4 G 3 4, 4, 2 3, 3, 1 2, 2, 2 3, 3, 3 I 2 3, 3, 1 2, 0, 2 1, 3, 3 2, 4, 4 R 1 2, 0, 2 1, 3, 3 2, 4, 4 3, 5, 5 # 0 1 2 3 4 # R I T E In each box X, Y, Z values are Minimum is highlighted X: From left: Insert-add one from left box in red with arrow to source Y: Diagonal, Compare-0 if same, 2 if different NOTE: All boxes will have arrows. Z: From below: Delete-add one from lower box I didn’t show them all. Only one path back to root. Defining Min Edit Distance Base conditions: ◦ D(i,0) = i ◦ D(0,j) = j ◦ Recurrence Relation: D(i-1,j) + 1 ◦ D(i,j) = min D(i,j-1) + 1 D(i-1,j-1) + 2; if S1(i) ≠ S2(j) 0; if S1(i) = S2(j) 1/20/15 17 Dynamic Programming A tabular computation of D(n,m) Bottom-up ◦ We compute D(i,j) for small i,j ◦ And compute larger D(i,j) based on previously computed smaller values 1/20/15 18 The Edit Distance Table N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N 1/20/15 19 N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N 1/20/15 20 N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N 1/20/15 21 Min Edit Distance Note that the result isn’t all that informative ◦ For a pair of strings we get back a single number The min number of edits to get from here to there That’s like a map routing program that tells you the distance from here to Denver but doesn’t tell you how to get there. 1/20/15 Paths Keep a back pointer ◦ Every time we fill a cell add a pointer back to the cell that was used to create it (the min cell that lead to it) ◦ To get the sequence of operations follow the backpointer from the final cell 1/20/15 23 Backtrace N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C 1/20/15U T I O N Adding Backtrace to MinEdit Base conditions: ◦ D(i,0) = i ◦ D(0,j) = j Recurrence Relation: D(i-1,j) + 1 Case 1 ◦ D(i,j) = min D(i,j-1) + 1 Case 2 D(i-1,j-1) + 1; if S1(i) ≠ S2(j) Case 3 0; if S1(i) = S2(j) LEFT Case 1 ptr(i,j) DOWN Case 2 DIAG Case 3 1/20/15 1/20/15 25 Complexity Time: O(nm) Space: O(nm) Backtrace O(n+m) 1/20/15 26 Alignments An alignment is a 1 to 1 pairing of each element in a sequence with a corresponding element in the other sequence or with a gap... -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC 1/20/15 27 Weighted Edit Distance Why would we want to add weights to the minimum edit distance computation? 1/20/15 28 Confusion matrix 1/20/15 1/20/15 29 Problem: similarity of strings Edit distance For two given A and B, compute, as efficiently as possible, the edit distance D(A,B) and a minimal sequence of edit operations which transforms A into B. i n f - - - o r m a t i k - i n t e r p o l - a t i o n 30 Problem: similarity of strings Approximate string matching For a given text T, a pattern P, and a distance d, find all substrings P´ in T with D(P,P´) ≤ d Sequence alignment Find optimal alignments of DNA sequences G A G C A - C T T G G A T T C T C G G - - - C A C G T G G - - - - - - - - - 31 Edit distance Given: two strings A = a1a2 ...

Load more