<<

-TAC-AT TGA-TAT

T-ACAT TGATAT

???

Similarity vs.

Algorithms for Computational Biology Two ways of measuring the same thing: Zsuzsanna Lipt´ak 1. How similar are two strings? 2. How di↵erent are two strings? Masters in Molecular and Medical Biotechnology a.a. 2015/16, fall term 1. Similarity:thehigher the value, the closer the two strings. String Distance Measures 2. Distance:thelower the value, the closer the two strings.

2/21

Similarity vs. distance Alignment score and

Edit operations Example substitution: a becomes b,wherea = b • 6 s=TATTACTATC deletion:deletecharactera t=CATTAGTATC • insertion:insertcharactera • Often one views alignments in this way: thinking about the changes that number of equal positions: i : si = ti = 8 (out of 10) happened turning one string into the other, e.g. • |{ }| 80% similarity (s = t if 100%, i.e. if high) number of di↵erent positions: i : s = t = 2 (out of 10) ACCT ACCT-- -ACCT • |{ i 6 i }| Hamming distance 2(s = t if 0, i.e. if low) CACT --CACT CA-CT

(Note that both are defined only if s = t .) 1 insertion, | | | | 2 substitutions 2 deletions, 1 substition, 1 deletion 2 insertions

3/21 4/21

The edit distance Alignments vs. edit operations

Edit distance,alsocalledLevenshtein distance, or unit-cost edit distance (Levenshtein, 1965) Not every series of operations corresponds to an alignment: Definition The edit distance d(s, t)istheminimum number of edit operations TACAT subst GACAT del GAAT ins TGAAT ins TGATAT needed to transform s into t. • ! ! ! ! Example s=TACAT,t=TGATAT TACAT ins TGACAT subst TGATAT • ! ! TACAT subst GACAT del GAAT ins TGAAT ins TGATAT 4 edit op’s • ! ! ! ! TACAT ins TGACAT subst TGATAT 2 edit op’s • ! ! ins subst subst ins subst subst TACAT TGACAT TGAGAT TGATAT TACAT TGACAT TGAGAT TGATAT 3 edit op’s • ! ! ! • ! ! !

5/21 6/21 Examples

2 2 2 Euclidean distance on R : d(x, y)= (x1 y1) +(x2 y2) • p where x =(x1, x2), y =(y1, y2) 2 Manhattan distance on R : d(x, y)= x1 y1 + x2 y2 • | | | | Hamming distance on ⌃n: d (s, t)= i : s = t . • H { i 6 i }

Alignments vs. edit operations Alignments vs. edit operations

But every alignment corresponds to a series of operations: Not every series of operations corresponds to an alignment: match do nothing • -TAC-AT 7! TACAT subst GACAT del GAAT ins TGAAT ins TGATAT mismatch substitution • ! ! ! ! TGA-TAT • 7! gap below deletion • 7! gap on top insertion • 7! T-ACAT TACAT ins TGACAT subst TGATAT • ! ! TGATAT Example T-ACAT- TGAT-AT TACAT ins TGACAT subst TGAGAT subst TGATAT ??? • ! ! ! TACAT ins TGACAT subst TGATAT del TGATT subst TGATA ins TGATAT ! ! ! ! !

6/21 7/21

Alignments vs. edit operations Minimum length series of edit operations

Take the following scoring function: match = 0, mismatch = -1, gap = -1. If alignment corresponds to the series of operations ,then: A S We are looking for a series of operations of minimum length (=shortest): score( )= A |S| where = no. of operations in . dist(s, t) = min : is a series of operations transforming s into t |S| S {|S| S } Example subst del ins ins N.B. TACAT GACAT GAAT TGAAT TGATAT There may be more than one series of op’s of minimum length, but the • ! ! ! ! -TAC-AT length is unique. TGA-TAT

TACAT ins TGACAT subst TGATAT • ! ! T-ACAT TGATAT

8/21 9/21

Exercises on edit distance What is a distance?

A distance function () on a set X is a function d : X X R s.t. ⇥ ! for all x, y, z X : 2 1. d(x, y) 0, and (d(x, y)=0 x = y) (non-negative, Exercises , identity of indiscernibles) If t is a substring of s,thenwhatisdist(s, t)? • 2. d(x, y)=d(y, x)(symmetric) What is dist(s, ✏)? • 3. d(x, y) d(x, z)+d(z, y) (triangleinequality) If we can transform s into t by using only deletions, then what can we  • say about s and t? If we can transform s into t by using only substitutions, then what • can we say about s and t?

10 / 21 11 / 21 What is a distance? The edit distance is a metric

A distance function (metric) on a set X is a function d : X X R s.t. Claim: The edit distance is a metric (distance function). ⇥ ! for all x, y, z X : 2 Proof: Let s, t, u ⌃⇤ (strings over ⌃): 1. d(x, y) 0, and (d(x, y)=0 x = y) (non-negative, 2 , 1. dist(s, t) 0: to transform s to t, we need 0 or more edit op’s. Also, identity of indiscernibles) we can transform s into t with 0 edit op’s if and only if s = t. 2. d(x, y)=d(y, x)(symmetric) 2. Since every edit operation can be inverted, we get 3. d(x, y) d(x, z)+d(z, y) (triangleinequality)  dist(s, t)=dist(t, s). Examples 3. (by contradiction) Assume that dist(s, u)+dist(u, t) < dist(s, t), and transforms s into u in dist(s, u)steps,and 0 transforms u into t in 2 2 2 S S Euclidean distance on R : d(x, y)= (x1 y1) +(x2 y2) dist(u, t) steps. Then the series of op’s (first ,then ) • S0 S S S0 p where x =(x1, x2), y =(y1, y2) transforms s into t,butisshorterthandist(s, t), a contradiction to 2 the definition of dist. Manhattan distance on R : d(x, y)= x1 y1 + x2 y2 • | | | | n Hamming distance on ⌃ : dH (s, t)= i : si = ti . • { 6 } (Exercise:ShowthattheHammingdistanceisametric.)

11 / 21 12 / 21

Computing the edit distance Computing the edit distance

We will need a DP-table (matrix) E of size (n + 1) (m + 1) ⇥ (where n = s and m = t ). Note first that we can assume that edit operations happen left-to-right. As | | | | for computing an optimal alignment, we look at what happens to the last characters. Transforming s into t can be done in one of 3 ways: Definition: E(i, j)=dist(s1 ...si , t1 ...tj )

1. transform s1 ...sn 1 into t and then delete last character of s Computation of E(i, j): 2. if sn = tm:transforms1 ...sn 1 into t1 ...tm 1 if s = t : Fill in first row and column: E(0, j)=j and E(i, 0) = i n 6 m • transform s1 ...sn 1 into 11 ...tm 1 and substitute sn with tm for i, j > 0: now E(i, j)istheminimum of 3 entries plus 1 (top and • 3. transform s into t1 ...tm 1 and insert tm left) or plus 0/plus 1, depending on whether current chars are the same or di↵erent So again we can use Dynamic Programming! return entry on bottom right E(n, m) • backtrace for shortest series of edit operations •

13 / 21 14 / 21

Algorithm for computing the edit distance Analysis

Algorithm DP algorithm for edit distance Input: strings s, t,with s = n, t = m | | | | Output: value dist(s, t) 1. for j =0tom do E(0, j) j; Space: O(nm)fortheDP-table 2. for i =1ton do E(i, 0) i; • Time: 3. for i =1ton do • computing dist(s, t): 3nm + n + m +1 O(nm) 4. for j =1tom do • (resp. O(n2)ifn = m) 2 E(i 1, j)+1 finding an optimal series of edit op’s: O(n + m) • 8 E(i 1, j 1) if si = tj (resp. O(n)ifn = m) E(i, j) min > >(E(i 1, j 1) + 1 if si = tj <> 6 E(i, j 1) + 1 > 5. return E(n, m); > :>

15 / 21 16 / 21 General cost functions General cost edit distance: di↵erent edit operations can have di↵erent cost (but some conditions must hold, e.g. cost(insert) = cost(delete), why?). Also computable with same algorithm in same time and space.

Again alignment vs. edit distance Again alignment vs. edit distance sim(s, t)vs.dist(s, t) sim(s, t)vs.dist(s, t) Recall the scoring function from before: Recall the scoring function from before: match = 0, mismatch = -1, gap = -1. Then we have: match = 0, mismatch = -1, gap = -1. Then we have:

sim(s, t)= dist(s, t) sim(s, t)= dist(s, t) (This seems obvious but it actually needs to be proved. Formal proof see Setubal & (This seems obvious but it actually needs to be proved. Formal proof see Setubal & Meidanis book, Sec. 3.6.1.) Meidanis book, Sec. 3.6.1.) General cost functions General cost edit distance: di↵erent edit operations can have di↵erent cost (but some conditions must hold, e.g. cost(insert) = cost(delete), why?). Also computable with same algorithm in same time and space.

17 / 21 17 / 21

LCS distance LCS distance Given two strings s and t, dLCS (s, t)= s + t 2LCS(s, t) | | | | LCS(s, t)=max u : u is a subsequence of s and t {| | } N.B. is the length of a longest common subsequence of s and t. There may be more than one longest common subsequence, but the length Example LCS(s, t) is unique! E.g. s0 =TAACAT,t0 =ATCTA,then LCS(s , t ) = 3, and ACA, TCA, TCT, ACT are all longest common Let s =TACATandt =TGATAT,thenwehaveLCS(s, t) = 4. 0 0 subsequences. s=TACAT,t=TGATAT LCS distance LCS-distance In the examples above, we have d (s, t)=5+6 2 4 = 3, and LCS · dLCS (s0, t0)=6+5 2 3 = 5. dLCS (s, t)= s + t 2LCS(s, t) · | | | | Exercise (*)

Example (1) Prove that dLCS is a metric. (2) Find a DP-algorithm that computes We have dLCS (s, t)=5+6 2 4 = 3. LCS(s, t). · (*) means: for particularly motivated students 18 / 21 19 / 21

Summary: Similarity and distance Summary: Similarity and distance

Similarity measures for strings sim(s, t) - score of an optimal alignment of s, t • percent similarity (only for equal length strings!) two ways of expressing the same thing (similarity vs. distance) • • similarity:thehigher the value, the more similar the strings • Distance measures for strings distance:thelower the value, the more similar the strings • optimal alignment = minimum length edit transformation edit distance () - minimum no. of edit operations • ⇠ • both computable in quadratic time and quadratic space to transform s into t • Hamming distance (only for equal length strings!) • LCS distance • (q-gram distance) •

20 / 21 21 / 21