<<

Introduction Basic definitions Alignment algorithms on strings Conclusion

Pairwise

Solon P. Pissis Tom´aˇs Flouri

Heidelberg Institute for Theoretical Studies

November 17, 2012 Introduction Basic definitions Alignment algorithms on strings Conclusion

1 Introduction Introduction

2 Basic definitions Alphabet and strings metrics between strings Alignment

3 Alignment algorithms on strings Global alignment Local alignment Substitution matrices Hamming distance

4 Conclusion Overview Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction 2 Basic definitions 3 Alignment algorithms on strings 4 Conclusion Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

Sequence alignment is the process of comparing two or more strings of letters (e.g. nucleotides or amino acids) to infer their similarity. Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

Sequence alignment is the process of comparing two or more strings of letters (e.g. nucleotides or amino acids) to infer their similarity. Pairwise sequence alignment is the process of comparing only two strings. Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

Sequence alignment is the process of comparing two or more strings of letters (e.g. nucleotides or amino acids) to infer their similarity. Pairwise sequence alignment is the process of comparing only two strings. Useful in dozens of biological applications; e.g. genome assembly: taking a huge number of DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated. Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

Sequence alignment is the process of comparing two or more strings of letters (e.g. nucleotides or amino acids) to infer their similarity. Pairwise sequence alignment is the process of comparing only two strings. Useful in dozens of biological applications; e.g. genome assembly: taking a huge number of DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated.

12345 6789 x = GCGAC GTCC |||| | . | y = G C G A − − T A C Figure: Alignment between x = GCGACGTCC and y = GCGATAC: one mismatch at position 8 and a gap of length two inserted in y after position 4 Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

We focus on online sequence alignment — the sequences cannot be preprocessed to build an index on them. Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

We focus on online sequence alignment — the sequences cannot be preprocessed to build an index on them. There exist four main approaches to online sequence alignment: algorithms based on dynamic programming (DP); algorithms based on automata; algorithms based on word-level parallelism; and algorithms based on filtering. We focus on algorithms based on dynamic programming. Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

We focus on online sequence alignment — the sequences cannot be preprocessed to build an index on them. There exist four main approaches to online sequence alignment: algorithms based on dynamic programming (DP); algorithms based on automata; algorithms based on word-level parallelism; and algorithms based on filtering. We focus on algorithms based on dynamic programming. There mainly exist two different for comparing two strings: the edit distance (Damerau-) and the Hamming distance. Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

We focus on online sequence alignment — the sequences cannot be preprocessed to build an index on them. There exist four main approaches to online sequence alignment: algorithms based on dynamic programming (DP); algorithms based on automata; algorithms based on word-level parallelism; and algorithms based on filtering. We focus on algorithms based on dynamic programming. There mainly exist two different distances for comparing two strings: the edit distance (Damerau-Levenshtein distance) and the Hamming distance. Biological applications require the modification of algorithms measuring the distance between two strings in order to perform mainly two types of sequence alignment — local and global — between nucleotide (or protein) sequences. Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction Introduction

1 2 3 4 5 6 7 8 9 1011 CGTCCGAAGTG | . | | | | − − TACGAA − − − Table: Global alignment between x = CGTCCGAAGTG and y = TACGAA

345678 TCCGAA | . | | | | TACGAA Table: Local alignment between x = CGTCCGAAGTG and y = TACGAA Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction 2 Basic definitions 3 Alignment algorithms on strings 4 Conclusion Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ∗. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ∗. Definition (Length of string) The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by |x|. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

We denote by x[i], for all 0 ≤ i < |x|, the letter at index i of x. We also call index i, for all 0 ≤ i < |x|, a position in x when x = ε. It follows that the ith letter of x is the letter at position i − 1 in x, and that x = x[0 .. |x|− 1] Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

We denote by x[i], for all 0 ≤ i < |x|, the letter at index i of x. We also call index i, for all 0 ≤ i < |x|, a position in x when x = ε. It follows that the ith letter of x is the letter at position i − 1 in x, and that x = x[0 .. |x|− 1]

Definition (Identity between strings) The identity between any two strings x and y is defined as

x = y

if and only if

|x| = |y| and x[i]= y[i], for all 0 ≤ i < |x| Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Concatenation of strings) The concatenation of two strings x and y is the string of the letters of x followed by the letters of y. It is denoted by xy. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Concatenation of strings) The concatenation of two strings x and y is the string of the letters of x followed by the letters of y. It is denoted by xy.

Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Concatenation of strings) The concatenation of two strings x and y is the string of the letters of x followed by the letters of y. It is denoted by xy.

Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv.

Notice that u and v are possibly empty strings! Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings Alphabet and strings

Definition (Concatenation of strings) The concatenation of two strings x and y is the string of the letters of x followed by the letters of y. It is denoted by xy.

Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv.

Notice that u and v are possibly empty strings! Definition (Occurrence of string) Let x be a non-empty string and y be a string. We say that there exists an occurrence of x in y, or, more simply, that x occurs in y, when x is a factor of y. Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings Distance

Definition (Distance between two strings) We say that a function δ : Σ∗ × Σ∗ → R is a distance on Σ∗ if the four following properties are satisfied, for every u, v ∈ Σ∗: Positivity: δ(u, v) ≥ 0 Separation: δ(u, v)= 0 if and only if u = v Symmetry: δ(u, v)= δ(v, u) Triangle inequality: δ(u, v) ≤ δ(u, w)+ δ(w, v), for every w ∈ Σ∗ Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings Distance

Definition (Distance between two strings) We say that a function δ : Σ∗ × Σ∗ → R is a distance on Σ∗ if the four following properties are satisfied, for every u, v ∈ Σ∗: Positivity: δ(u, v) ≥ 0 Separation: δ(u, v)= 0 if and only if u = v Symmetry: δ(u, v)= δ(v, u) Triangle inequality: δ(u, v) ≤ δ(u, w)+ δ(w, v), for every w ∈ Σ∗ The distances are defined from operations that transform x into y. Three types of elementary operations are considered. substitution (sub) for a letter of x at a given position by a letter of y deletion (del) of a letter of x at a given position insertion (ins) of a letter of y in x at a given position Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings Edit distance

We implicitly assume that the costs of edit operations are independent of the positions at which the operations are realized, and that sub(a, b) := sub(b, a) := del(a) := ins(b) := 1, for a, b ∈ Σ, a = b. Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings Edit distance

We implicitly assume that the costs of edit operations are independent of the positions at which the operations are realized, and that sub(a, b) := sub(b, a) := del(a) := ins(b) := 1, for a, b ∈ Σ, a = b. Definition (Edit distance) From the elementary costs, we set

δE = min{cost of σ : σ ∈ Sx,y }

where Sx,y is the set of sequences of elementary edit operations that transform x into y, and the cost of an element σ ∈ Sx,y is the sum of the costs of the edit operations of the sequence σ. The ∗ function δE is then a distance on Σ , and it is called the edit distance (Damerau-Levenshtein distance). Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings Hamming distance

Definition (Hamming distance)

The Hamming distance, denoted by δH , is defined for two strings of the same length as the number of positions in which the two strings possess different letters. The Hamming distance is a particular case of edit distance for which only the operation of substitution is considered. This amounts to set del(a)= ins(a)=+∞, for each a ∈ Σ. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alignment Alignment

Definition (Alignment between two strings) An alignment between x and y is a string z on the alphabet of pairs of letters, more accurately on

(Σ ∪ {ε}) × (Σ ∪ {ε}) \ ({ε, ε})

whose projection on the first component is x, and the projection on the second component is y. Thus, if z is an alignment of length p between x and y, we have

′ ′ ′ ′ ′ ′ z =(x0, y0)(x1, y1) ... (xp−1, yp−1) ′ ′ ′ x = x0x1 ... xp−1 ′ ′ ′ y = y0y1 ... yp−1 ′ ′ where xi ∈ Σ ∪ {ε} and yi ∈ Σ ∪ {ε}, for all 0 ≤ i < p. Introduction Basic definitions Alignment algorithms on strings Conclusion

Alignment Example

Example Let the string x = ACGA and the string y = ATGCTA. An alignment between x and y is ACG--A ATGCTA Operation Aligned pair Cost substitute A for A (A,A) 0 substitute T for C (C,T) 1 substitute G for G (G,G) 0 insert C (-,C) 1 insert T (-,T) 1 substitute A for A (A,A) 0 This alignment is optimal since its cost is 3. Notice that the edit distance between the two strings is also 3. Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction 2 Basic definitions 3 Alignment algorithms on strings 4 Conclusion Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance

We focus on algorithms based on Dynamic Programming (DP). Let x and y be two strings of lengths m and n, respectively. The cells of the DP matrix T [0 .. m][0 .. n] can be computed by the following formula Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance

We focus on algorithms based on Dynamic Programming (DP). Let x and y be two strings of lengths m and n, respectively. The cells of the DP matrix T [0 .. m][0 .. n] can be computed by the following formula

0 : i = j = 0 T [i − 1][j]+ ins(y[j]) : 0 < i ≤ m, j = 0  T [i][j − 1]+ del(x[i]) : 0 < j ≤ n, i = 0 T [i][j]=   T [i − 1][j − 1]+ sub(x[i], y[j])   min T [i − 1][j]+ del(x[i]) : 0 < i ≤ m, 0 < j ≤ n    T [i][j − 1]+ ins(y[j])     Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and ∗ del(a) := 1, where a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the amino acids alphabet. 0 : i = j = 0 T [i − 1][j] + ins(y[j]) : 0 < i ≤ m, j = 0  T [i][j − 1] + del(x[i]) : 0 < j ≤ n, i = 0 T [i][j] =   T [i − 1][j − 1] + sub(x[i], y[j])  min T [i − 1][j] + del(x[i]) : 0 < i ≤ m, 0 < j ≤ n T [i][j − 1] + ins(y[j])    Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and ∗ del(a) := 1, where a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the amino acids alphabet. 0 : i = j = 0 T [i − 1][j] + ins(y[j]) : 0 < i ≤ m, j = 0  T [i][j − 1] + del(x[i]) : 0 < j ≤ n, i = 0 T [i][j] =   T [i − 1][j − 1] + sub(x[i], y[j])  min T [i − 1][j] + del(x[i]) : 0 < i ≤ m, 0 < j ≤ n T [i][j − 1] + ins(y[j])    T - ERDAWCQPGK W Y - 0 1 2 3 4 5 6 7 8 9 10 11 12 E 1 A 2 W 3 A 4 C 5 Q 6 G 7 K 8 L 9 Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and ∗ del(a) := 1, where a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the amino acids alphabet. 0 : i = j = 0 T [i − 1][j] + ins(y[j]) : 0 < i ≤ m, j = 0  T [i][j − 1] + del(x[i]) : 0 < j ≤ n, i = 0 T [i][j] =   T [i − 1][j − 1] + sub(x[i], y[j])  min T [i − 1][j] + del(x[i]) : 0 < i ≤ m, 0 < j ≤ n T [i][j − 1] + ins(y[j])    T - ERDAWCQPGK W Y  - 0 1 2 3 4 5 6 7 8 9 10 11 12 E 1 0 A 2 W 3 A 4 C 5 Q 6 G 7 K 8 L 9 Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and ∗ del(a) := 1, where a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the amino acids alphabet. 0 : i = j = 0 T [i − 1][j] + ins(y[j]) : 0 < i ≤ m, j = 0  T [i][j − 1] + del(x[i]) : 0 < j ≤ n, i = 0 T [i][j] =   T [i − 1][j − 1] + sub(x[i], y[j])  min T [i − 1][j] + del(x[i]) : 0 < i ≤ m, 0 < j ≤ n T [i][j − 1] + ins(y[j])    T - ERDAWCQPGK W Y  - 0 1 2 3 4 5 6 7 8 9 10 11 12 E 1 0 1 2 3 4 5 6 7 8 9 10 11 A 2 1 2 3 2 3 4 5 6 7 8 9 10 W 32343234567 8 9 A 4 3 4 5 4 3 4 5 6 7 8 9 10 C 54565434567 8 9 Q 65676543456 7 8 G 76787654545 6 7 K 87898765654 5 6 L 9 8 9 10 9 8 7 6 7 6 5 6 7 Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 1

T - ERDAWCQPGK W Y - 0 1 2 3 4 5 6 7 8 9 101112 E 1 0 1 2 3 4 5 6 7 8 9 1011 A 212 3 2 34567 8 9 10 W 323 4 3 2 34567 8 9 A 434 5 4 3 4567 8 9 10 C 545 6 54 3 4567 8 9 Q 656 7 654 3 4 56 7 8 G 767 8 76545 4 5 6 7 K 878 9 876565 4 5 6 L 98910987676 5 6 7 Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 1

T - ERDAWCQPGK W Y - 0 1 2 3 4 5 6 7 8 9 101112 E 1 0 1 2 3 4 5 6 7 8 9 1011 A 212 3 2 34567 8 9 10 W 323 4 3 2 34567 8 9 A 434 5 4 3 4567 8 9 10 C 545 6 54 3 4567 8 9 Q 656 7 654 3 4 56 7 8 G 767 8 76545 4 5 6 7 K 878 9 876565 4 5 6 L 98910987676 5 6 7

E--AWACQ-GK--L E--AWACQ-GK-L- E--AWACQ-GKL-- ERDAW-CQPGKWY-ERDAW-CQPGKW-YERDAW-CQPGK-WY Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where ∗ a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the DNA alphabet. Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where ∗ a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the DNA alphabet.

0 : i = j = 0 T [i − 1][j] + ins(y[j]) : 0 < i ≤ m, j = 0  T [i][j − 1] + del(x[i]) : 0 < j ≤ n, i = 0 T [i][j] =   T [i − 1][j − 1] + sub(x[i], y[j])  min T [i − 1][j] + del(x[i]) : 0 < i ≤ m, 0 < j ≤ n T [i][j − 1] + ins(y[j])    Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where ∗ a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the DNA alphabet.

0 : i = j = 0 T [i − 1][j] + ins(y[j]) : 0 < i ≤ m, j = 0  T [i][j − 1] + del(x[i]) : 0 < j ≤ n, i = 0 T [i][j] =   T [i − 1][j − 1] + sub(x[i], y[j])  min T [i − 1][j] + del(x[i]) : 0 < i ≤ m, 0 < j ≤ n T [i][j − 1] + ins(y[j])    T - ATGCTA - 0 1 2 3 4 5 6 A 1 0 1 2 3 4 5 C 2 1 1 2 2 3 4 G 3 2 2 1 2 3 4 A 4 3 3 2 2 3 3 Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 2

T - ATGCTA - 0 123456 A 1 0 1 2 3 4 5 C 2 1 1 2 2 3 4 G 3 2 2 1 2 3 4 A 433223 3 Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Example 2

T - ATGCTA - 0 123456 A 1 0 1 2 3 4 5 C 2 1 1 2 2 3 4 G 3 2 2 1 2 3 4 A 433223 3

A--CGA ACG--A ATGCTAATGCTA Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Complexities

The first algorithm for solving this problem has been rediscovered many times in the past in different fields (Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972; Sellers, 1974; etc). Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Complexities

The first algorithm for solving this problem has been rediscovered many times in the past in different fields (Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972; Sellers, 1974; etc). The computation of the value of each cell of the table T depends only on the three neighbour cells - O(1). Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Complexities

The first algorithm for solving this problem has been rediscovered many times in the past in different fields (Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972; Sellers, 1974; etc). The computation of the value of each cell of the table T depends only on the three neighbour cells - O(1). For the DP matrix T [0 .. m][0 .. n], there are m × n values computed in this way. Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Complexities

The first algorithm for solving this problem has been rediscovered many times in the past in different fields (Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972; Sellers, 1974; etc). The computation of the value of each cell of the table T depends only on the three neighbour cells - O(1). For the DP matrix T [0 .. m][0 .. n], there are m × n values computed in this way. The initialization phase requires time O(m + n). Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Complexities

The first algorithm for solving this problem has been rediscovered many times in the past in different fields (Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972; Sellers, 1974; etc). The computation of the value of each cell of the table T depends only on the three neighbour cells - O(1). For the DP matrix T [0 .. m][0 .. n], there are m × n values computed in this way. The initialization phase requires time O(m + n). Hence, table T can be computed in O(m × n) time and space. Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance Edit distance - Complexities

The first algorithm for solving this problem has been rediscovered many times in the past in different fields (Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972; Sellers, 1974; etc). The computation of the value of each cell of the table T depends only on the three neighbour cells - O(1). For the DP matrix T [0 .. m][0 .. n], there are m × n values computed in this way. The initialization phase requires time O(m + n). Hence, table T can be computed in O(m × n) time and space. For the space, it is sufficient to note that only a space of two columns (or two rows) is required. In case we are only interested in the edit distance between the strings (but not the alignment!), this can be computed in O(m × n) time and O(min(n, m)) space. Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distance problem (Damerau, 1964; Levenshtein, 1966) in terms of maximizing similarity (Needleman and Wunsch, 1970). Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distance problem (Damerau, 1964; Levenshtein, 1966) in terms of maximizing similarity (Needleman and Wunsch, 1970). Sellers, however, showed in 1974 that the two problems are equivalent (Sellers, 1974). Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distance problem (Damerau, 1964; Levenshtein, 1966) in terms of maximizing similarity (Needleman and Wunsch, 1970). Sellers, however, showed in 1974 that the two problems are equivalent (Sellers, 1974). The notion of distance between two strings is not suitable for biological applications. Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distance problem (Damerau, 1964; Levenshtein, 1966) in terms of maximizing similarity (Needleman and Wunsch, 1970). Sellers, however, showed in 1974 that the two problems are equivalent (Sellers, 1974). The notion of distance between two strings is not suitable for biological applications. We rather utilize a notion of similarity between strings for which the disimilarities are penalized and the similarities are favored; i.e. sub(a, a) > 0, sub(a, b) < 0, ins(a) < 0, del(a) < 0 for a, b ∈ Σ, a = b. Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distance problem (Damerau, 1964; Levenshtein, 1966) in terms of maximizing similarity (Needleman and Wunsch, 1970). Sellers, however, showed in 1974 that the two problems are equivalent (Sellers, 1974). The notion of distance between two strings is not suitable for biological applications. We rather utilize a notion of similarity between strings for which the disimilarities are penalized and the similarities are favored; i.e. sub(a, a) > 0, sub(a, b) < 0, ins(a) < 0, del(a) < 0 for a, b ∈ Σ, a = b. This is known as the Needleman-Wunsch algorithm for global sequence alignment. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y, in molecular biology it is often more relevant to determine a best alignment between a substring of x and a substring of y. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y, in molecular biology it is often more relevant to determine a best alignment between a substring of x and a substring of y. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y, in molecular biology it is often more relevant to determine a best alignment between a substring of x and a substring of y. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. Similarly, the notion of distance between two strings is not suitable for biological applications. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y, in molecular biology it is often more relevant to determine a best alignment between a substring of x and a substring of y. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. Similarly, the notion of distance between two strings is not suitable for biological applications. Similarly, we rather utilize a notion of similarity between strings for which the disimilarity is penalized and the similarity is favored. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y, in molecular biology it is often more relevant to determine a best alignment between a substring of x and a substring of y. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. Similarly, the notion of distance between two strings is not suitable for biological applications. Similarly, we rather utilize a notion of similarity between strings for which the disimilarity is penalized and the similarity is favored. The search for a similar substring consists then in maximizing the similarity between the strings. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y, in molecular biology it is often more relevant to determine a best alignment between a substring of x and a substring of y. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. Similarly, the notion of distance between two strings is not suitable for biological applications. Similarly, we rather utilize a notion of similarity between strings for which the disimilarity is penalized and the similarity is favored. The search for a similar substring consists then in maximizing the similarity between the strings. This is known as the Smith-Waterman algorithm for local sequence alignment. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Let x and y be two strings of lengths m and n, respectively. The computation of the cells of the DP matrix S[0 .. m][0 .. n] are described by the following formula Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Let x and y be two strings of lengths m and n, respectively. The computation of the cells of the DP matrix S[0 .. m][0 .. n] are described by the following formula

0 : i = j = 0 0 : 0 < i ≤ m, j = 0  0 : 0 < j ≤ n, i = 0  S[i][j]=  0   S[i − 1][j − 1]+ sub(x[i], y[j])  max  : 0 < i ≤ m, 0 < j ≤ n  S[i − 1][j]+ del(x[i])    S[i][j − 1]+ ins(y[j])      Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Recall the formula for the global alignment! Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm & Local alignment

Recall the formula for the global alignment!

0 : i = j = 0 T [i − 1][j]+ ins(y[j]) : 0 < i ≤ m, j = 0  T [i][j − 1]+ del(x[i]) : 0 < j ≤ n, i = 0 T [i][j]=   T [i − 1][j − 1]+ sub(x[i], y[j])   min T [i − 1][j]+ del(x[i]) : 0 < i ≤ m, 0 < j ≤ n    T [i][j − 1]+ ins(y[j])     Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3, ∗ ins(a) := del(a) := −1, where a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the amino acids alphabet. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3, ∗ ins(a) := del(a) := −1, where a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the amino acids alphabet. 0 : i = j = 0 0 : 0 < i ≤ m, j = 0  0 : 0 < j ≤ n, i = 0  S[i][j] =  0  S[i − 1][j − 1] + sub(x[i], y[j])  max : 0 < i ≤ m, 0 < j ≤ n  S[i − 1][j] + del(x[i])   S[i][j − 1] + ins(y[j])     Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3, ∗ ins(a) := del(a) := −1, where a, b ∈ Σ, a = b, x, y ∈ Σ , and Σ is the amino acids alphabet. 0 : i = j = 0 0 : 0 < i ≤ m, j = 0  0 : 0 < j ≤ n, i = 0  S[i][j] =  0  S[i − 1][j − 1] + sub(x[i], y[j])  max : 0 < i ≤ m, 0 < j ≤ n  S[i − 1][j] + del(x[i])    S[i][j − 1] + ins(y[j])     S - ERDAWCQPGKWY - 0000000000000 E 0100000000000 A 0000100000000 W 0000021000010 A 0000110000000 C 0000002100000 Q 0000001321000 G 0000000213210 K 0000000102432 L 0000000001321 Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S. 2 Traceback the path from the cell of this value by following the largest value of the neighbor cells. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S. 2 Traceback the path from the cell of this value by following the largest value of the neighbor cells. 3 Stop the scan on a zero value. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S. 2 Traceback the path from the cell of this value by following the largest value of the neighbor cells. 3 Stop the scan on a zero value.

S - ERDAWCQPGKWY - 0000000000000 E 0 1 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 0 1 0 0 0 0 0 0 0 0 W 0 0 0 0 0 2 1 0 0 0 0 1 0 A 0 0 0 0 1 1 0 0 0 0 0 0 0 C 0 0 0 0 0 0 2 1 0 0 0 0 0 Q 0 0 0 0 0 0 1 3 2 1 0 0 0 G 0 0 0 0 0 0 0 2 1 3 2 1 0 K 0 0 0 0 0 0 0 1 0 2 4 3 2 L 0000000001321 Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S. 2 Traceback the path from the cell of this value by following the largest value of the neighbor cells. 3 Stop the scan on a zero value.

S - ERDAWCQPGKWY - 0000000000000 E 0 1 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 0 1 0 0 0 0 0 0 0 0 W 0 0 0 0 0 2 1 0 0 0 0 1 0 A 0 0 0 0 1 1 0 0 0 0 0 0 0 C 0 0 0 0 0 0 2 1 0 0 0 0 0 Q 0 0 0 0 0 0 1 3 2 1 0 0 0 G 0 0 0 0 0 0 0 2 1 3 2 1 0 K 0 0 0 0 0 0 0 1 0 2 4 3 2 L 0000000001321 AWACQ-GK AW-CQPGK Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm to compute an optimal alignment requires O(m2 × n) time and space O(m × n) (Smith and Waterman, 1981). Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm to compute an optimal alignment requires O(m2 × n) time and space O(m × n) (Smith and Waterman, 1981). An improved version of this algorithm requires time O(m × n) (Gotoh, 1982). Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm to compute an optimal alignment requires O(m2 × n) time and space O(m × n) (Smith and Waterman, 1981). An improved version of this algorithm requires time O(m × n) (Gotoh, 1982). An improved version of Gotoh’s algorithm requires space O(max(m, n)) (Myers and Miller, 1988). Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm to compute an optimal alignment requires O(m2 × n) time and space O(m × n) (Smith and Waterman, 1981). An improved version of this algorithm requires time O(m × n) (Gotoh, 1982). An improved version of Gotoh’s algorithm requires space O(max(m, n)) (Myers and Miller, 1988). This was inspired by Hirschberg’s paper from 1975 for computing the longest common subsequences in linear space (Hirschberg, 1975)—see the Unix command diff for details. Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm to compute an optimal alignment requires O(m2 × n) time and space O(m × n) (Smith and Waterman, 1981). An improved version of this algorithm requires time O(m × n) (Gotoh, 1982). An improved version of Gotoh’s algorithm requires space O(max(m, n)) (Myers and Miller, 1988). This was inspired by Hirschberg’s paper from 1975 for computing the longest common subsequences in linear space (Hirschberg, 1975)—see the Unix command diff for details. The Hirschberg algorithm is based on the divide and conquer principle, which divides the DP matrix into smaller parts, solving each of these parts separately. The same idea can be directly applied to global alignment DP-based algorithms!!! Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

A substitution matrix describes the rate at which one character in a sequence changes to other character states over time. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

A substitution matrix describes the rate at which one character in a sequence changes to other character states over time. Usually seen in the context of amino acid or DNA sequence alignments. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

A substitution matrix describes the rate at which one character in a sequence changes to other character states over time. Usually seen in the context of amino acid or DNA sequence alignments. The similarity between sequences depends on their divergence time and the substitution rates as represented in the matrix. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

A substitution matrix describes the rate at which one character in a sequence changes to other character states over time. Usually seen in the context of amino acid or DNA sequence alignments. The similarity between sequences depends on their divergence time and the substitution rates as represented in the matrix. For example, in the process of evolution, from one generation to the next the amino acid sequences of an organism’s proteins are gradually altered through the action of DNA mutations. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of protein sequences. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of protein sequences. Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of protein sequences. Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. BLOSUM matrices were first introduced by Henikoff and Henikoff (Henikoff and Henikoff, 1992). Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of protein sequences. Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. BLOSUM matrices were first introduced by Henikoff and Henikoff (Henikoff and Henikoff, 1992). All BLOSUM matrices are based on observed (empirical) local alignments. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of protein sequences. Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. BLOSUM matrices were first introduced by Henikoff and Henikoff (Henikoff and Henikoff, 1992). All BLOSUM matrices are based on observed (empirical) local alignments. BLOSUM matrices with high numbers, e.g. BLOSUM80, are designed for comparing closely related sequences, while those with low numbers, e.g. BLOSUM45, are designed for comparing distant related sequences. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices – BLOSUM

BLOCKS database is a database containing multiple alignments of conserved regions in protein families. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices – BLOSUM

BLOCKS database is a database containing multiple alignments of conserved regions in protein families. Henikoff and Henikoff scanned the BLOCKS database for very conserved regions of protein families that do not have gaps in the sequence alignment. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices – BLOSUM

BLOCKS database is a database containing multiple alignments of conserved regions in protein families. Henikoff and Henikoff scanned the BLOCKS database for very conserved regions of protein families that do not have gaps in the sequence alignment. Then they counted the relative frequencies of amino acids and their substitution probabilities. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices – BLOSUM

BLOCKS database is a database containing multiple alignments of conserved regions in protein families. Henikoff and Henikoff scanned the BLOCKS database for very conserved regions of protein families that do not have gaps in the sequence alignment. Then they counted the relative frequencies of amino acids and their substitution probabilities. Finally, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. Every possible match or substitution is assigned a score based on its observed frequences in the alignment of related proteins. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. Every possible match or substitution is assigned a score based on its observed frequences in the alignment of related proteins. A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions. Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices Substitution matrices – BLOSUM62

- CSTPAGNDEQHRKMILVFYW C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2 S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3 T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3 P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4 A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3 G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2 N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4 D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4 E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3 Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2 H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2 R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3 K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3 M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1 I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3 L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2 V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3 F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1 Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2 W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11 Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k be the maximum number of allowed errors (maximum Hamming distance). Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k be the maximum number of allowed errors (maximum Hamming distance). The computation of the cells of the DP matrix D[0 .. m][0 .. n] are described by the following formula Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k be the maximum number of allowed errors (maximum Hamming distance). The computation of the cells of the DP matrix D[0 .. m][0 .. n] are described by the following formula

k + 1 : 0 < i ≤ m, j = 0 0 : 0 ≤ j ≤ n, i = 0 D[i][j]=   D[i − 1][j − 1] : x[i]= y[j], 0 < i ≤ m, 0 < j ≤ n   D[i − 1][j − 1]+ 1 : x[i] = y[j], 0 < i ≤ m, 0 < j ≤ n   Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3. Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

k + 1 : 0 < i ≤ m, j = 0 0 : 0 ≤ j ≤ n, i = 0 D[i][j]=   D[i − 1][j − 1] : x[i]= y[j], 0 < i ≤ m, 0 < j ≤ n   D[i − 1][j − 1]+ 1 : x[i] = y[j], 0 < i ≤ m, 0 < j ≤ n   Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

k + 1 : 0 < i ≤ m, j = 0 0 : 0 ≤ j ≤ n, i = 0 D[i][j]=   D[i − 1][j − 1] : x[i]= y[j], 0 < i ≤ m, 0 < j ≤ n   D[i − 1][j − 1]+ 1 : x[i] = y[j], 0 < i ≤ m, 0 < j ≤ n   D - ADCABCAABADBBCA - 0000000000000000 A 4011011001011110 D 4502212211202222 B 4561322331230233 B 4567233343233034 C 4566833445434404 A 4467694345554550 Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance Hamming distance - Example

D - ADCABCAABADBBCA - 0000000000000000 A 4 0 1 1011001 0 11110 D 4 5 0 2 2122112 0 2 2 2 2 B 4 5 6 1 3 2233123 0 2 3 3 B 45672 3 3343233 0 3 4 C 456683 3 4454344 0 4 A 4467694 3 4555455 0 Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance Hamming distance - Example

D - ADCABCAABADBBCA - 0000000000000000 A 4 0 1 1011001 0 11110 D 4 5 0 2 2122112 0 2 2 2 2 B 4 5 6 1 3 2233123 0 2 3 3 B 45672 3 3343233 0 3 4 C 456683 3 4454344 0 4 A 4467694 3 4555455 0

DCABCA ADBBCA . . . | | | | ||||| ADBBCA ADBBCA Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction 2 Basic definitions 3 Alignment algorithms on strings 4 Conclusion Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview Overview

Pairwise sequence alignment is the process of comparing two strings of letters to infer their similarity. Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview Overview

Pairwise sequence alignment is the process of comparing two strings of letters to infer their similarity. There exist two main distances for comparing two strings — the edit distance and the Hamming distance. Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview Overview

Pairwise sequence alignment is the process of comparing two strings of letters to infer their similarity. There exist two main distances for comparing two strings — the edit distance and the Hamming distance. A different formulation of the edit distance is to maximize the similarity of the two strings — global alignment (Needleman-Wunsch algorithm) — instead of minimizing the distance between the two strings. Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview Overview

Pairwise sequence alignment is the process of comparing two strings of letters to infer their similarity. There exist two main distances for comparing two strings — the edit distance and the Hamming distance. A different formulation of the edit distance is to maximize the similarity of the two strings — global alignment (Needleman-Wunsch algorithm) — instead of minimizing the distance between the two strings. Instead of considering a global alignment between two strings, in biological applications it is often more relevant to determine a best alignment between substrings of the two strings — local alignment (Smith-Waterman algorithm). Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview Overview

Pairwise sequence alignment is the process of comparing two strings of letters to infer their similarity. There exist two main distances for comparing two strings — the edit distance and the Hamming distance. A different formulation of the edit distance is to maximize the similarity of the two strings — global alignment (Needleman-Wunsch algorithm) — instead of minimizing the distance between the two strings. Instead of considering a global alignment between two strings, in biological applications it is often more relevant to determine a best alignment between substrings of the two strings — local alignment (Smith-Waterman algorithm). The Hamming distance is a particular case of edit distance for which only the operation of substitution is considered. Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview N. Alachiotis, S. Berger, and A. Stamatakis. Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel. BMC Bioinformatics, 13:196, 2012. F. J. Damerau. A technique for computer detection and correction of spelling errors. Commun. ACM, 7(3):171–176, 1964. M. Farrar. Striped smith–waterman speeds database searches six times over other simd implementations. Bioinformatics, 23(2):156–161, 2007. O. Gotoh. An improved algorithm for matching biological sequences. Journal of molecular biology, 162(3):705–708, 1982. Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89(22):10915–10919, 1992. D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18(6):341–343, 1975. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8, Soviet Physics Doklady, 1966. W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig.¨ Streaming Algorithms for Biological Sequence Alignment on GPUs. IEEE Trans. Parallel Distrib. Syst., 18(9):1270–1281, 2007. Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview S. Manavski and G. Valle. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics, 9(S-2), 2008. E. W. Myers and W. Miller. Optimal alignments in linear space. Computer Applications in the Biosciences, 4(1):11–17, 1988. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970. T. Rognes. Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics, 12:221, 2011. Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview T. Rognes and E. Seeberg. Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics, 16(8):699–706, 2000. D. Sankoff. Matching Sequences under Deletion/Insertion Constraints. Proceedings of the National Academy of Sciences of the United States of America, 69(1):4–6, 1972. P. H. Sellers. On the theory and computation of evolutionary distances. SIAM Journal on Applied Mathematics, 26(4):787–793, 1974. T. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4:52–57, 1968. M. S. Waterman and T. F. Smith. Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981.