Pairwise Sequence Alignment

Introduction Basic definitions Alignment algorithms on strings Conclusion Pairwise sequence alignment Solon P. Pissis Tomáˇs Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 Introduction Basic definitions Alignment algorithms on strings Conclusion 1 Introduction Introduction 2 Basic definitions Alphabet and strings Distance metrics between strings Alignment 3 Alignment algorithms on strings Edit distance Global alignment Local alignment Substitution matrices Hamming distance 4 Conclusion Overview Introduction Basic definitions Alignment algorithms on strings Conclusion Contents 1 Introduction 2 Basic definitions 3 Alignment algorithms on strings 4 Conclusion Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing two or more strings of letters (e.g. nucleotides or amino acids) to infer their similarity. Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing two or more strings of letters (e.g. nucleotides or amino acids) to infer their similarity. Pairwise sequence alignment is the process of comparing only two strings. Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing two or more strings of letters (e.g. nucleotides or amino acids) to infer their similarity. Pairwise sequence alignment is the process of comparing only two strings. Useful in dozens of biological applications; e.g. genome assembly: taking a huge number of DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated. Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing two or more strings of letters (e.g. nucleotides or amino acids) to infer their similarity. Pairwise sequence alignment is the process of comparing only two strings. Useful in dozens of biological applications; e.g. genome assembly: taking a huge number of DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated. 12345 6789 x = GCGAC GTCC |||| | . | y = G C G A − − T A C Figure: Alignment between x = GCGACGTCC and y = GCGATAC: one mismatch at position 8 and a gap of length two inserted in y after position 4 Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction We focus on online sequence alignment — the sequences cannot be preprocessed to build an index on them. Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction We focus on online sequence alignment — the sequences cannot be preprocessed to build an index on them. There exist four main approaches to online sequence alignment: algorithms based on dynamic programming (DP); algorithms based on automata; algorithms based on word-level parallelism; and algorithms based on filtering. We focus on algorithms based on dynamic programming. Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction We focus on online sequence alignment — the sequences cannot be preprocessed to build an index on them. There exist four main approaches to online sequence alignment: algorithms based on dynamic programming (DP); algorithms based on automata; algorithms based on word-level parallelism; and algorithms based on filtering. We focus on algorithms based on dynamic programming. There mainly exist two different distances for comparing two strings: the edit distance (Damerau-Levenshtein distance) and the Hamming distance. Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction We focus on online sequence alignment — the sequences cannot be preprocessed to build an index on them. There exist four main approaches to online sequence alignment: algorithms based on dynamic programming (DP); algorithms based on automata; algorithms based on word-level parallelism; and algorithms based on filtering. We focus on algorithms based on dynamic programming. There mainly exist two different distances for comparing two strings: the edit distance (Damerau-Levenshtein distance) and the Hamming distance. Biological applications require the modification of algorithms measuring the distance between two strings in order to perform mainly two types of sequence alignment — local and global — between nucleotide (or protein) sequences. Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction 1 2 3 4 5 6 7 8 9 1011 CGTCCGAAGTG | . | | | | − − TACGAA − − − Table: Global alignment between x = CGTCCGAAGTG and y = TACGAA 345678 TCCGAA | . | | | | TACGAA Table: Local alignment between x = CGTCCGAAGTG and y = TACGAA Introduction Basic definitions Alignment algorithms on strings Conclusion Contents 1 Introduction 2 Basic definitions 3 Alignment algorithms on strings 4 Conclusion Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ∗. Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ∗. Definition (Length of string) The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by |x|. Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings We denote by x[i], for all 0 ≤ i < |x|, the letter at index i of x. We also call index i, for all 0 ≤ i < |x|, a position in x when x = ε. It follows that the ith letter of x is the letter at position i − 1 in x, and that x = x[0 .. |x|− 1] Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings We denote by x[i], for all 0 ≤ i < |x|, the letter at index i of x. We also call index i, for all 0 ≤ i < |x|, a position in x when x = ε. It follows that the ith letter of x is the letter at position i − 1 in x, and that x = x[0 .. |x|− 1] Definition (Identity between strings) The identity between any two strings x and y is defined as x = y if and only if |x| = |y| and x[i]= y[i], for all 0 ≤ i < |x| Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Concatenation of strings) The concatenation of two strings x and y is the string of the letters of x followed by the letters of y. It is denoted by xy. Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Concatenation of strings) The concatenation of two strings x and y is the string of the letters of x followed by the letters of y. It is denoted by xy. Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv. Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Concatenation of strings) The concatenation of two strings x and y is the string of the letters of x followed by the letters of y. It is denoted by xy. Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv. Notice that u and v are possibly empty strings! Introduction Basic definitions Alignment algorithms on strings Conclusion Alphabet and strings Alphabet and strings Definition (Concatenation of strings) The concatenation of two strings x and y is the string of the letters of x followed by the letters of y. It is denoted by xy. Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv. Notice that u and v are possibly empty strings! Definition (Occurrence of string) Let x be a

Load more