Lecture Notes on Sequence Analysis

Sequence Analysis Lecture notes Faculty of Technology, Bielefeld University Winter 2015/16 Contents 1 Overview 1 1.1 Application Areas of Sequence Analysis.....................1 1.2 A Small Selection of Problems on Sequences...................2 2 Basic Definitions3 2.1 Sets and Basic Combinatorics...........................3 2.2 Review of Elementary Probability Theory....................4 2.3 Asymptotics.....................................6 2.4 Alphabets and Sequences.............................7 3 Metrics on Sequences 11 3.1 Problem Motivation................................ 11 3.2 Definition of a Metric............................... 12 3.3 Transformation Distances............................. 12 3.4 Metrics on Sequences of the Same Length.................... 14 3.5 Edit Distances for Sequences........................... 16 3.6 An Efficient Algorithm to Compute Edit Distances............... 18 3.7 The q-gram Distance................................ 21 3.8 The Maximal Matches Distance.......................... 26 3.9 Filtering....................................... 28 4 Pairwise Sequence Alignment 29 4.1 Definition of Alignment.............................. 29 4.2 The Alignment Score................................ 31 4.3 The Alignment Graph............................... 32 4.4 A Universal Alignment Algorithm........................ 33 4.5 Alignment Types: Global, Free End Gaps, Local................ 34 4.6 Score and Cost Variations for Alignments.................... 38 i Contents 5 Advanced Topics in Pairwise Alignment 43 5.1 Suboptimal Alignments.............................. 43 5.2 Approximate String Matching........................... 44 5.3 The Forward-Backward Technique........................ 48 5.4 Pairwise Alignment in Linear Space....................... 50 6 Pairwise Alignment in Practice 55 6.1 Alignment Visualization with Dot Plots..................... 55 6.2 Fundamentals of Rapid Database Search Methods............... 56 6.3 BLAST: An On-line Database Search Method.................. 58 7 Suffix Trees 61 7.1 Motivation..................................... 61 7.2 An Informal Introduction to Suffix Trees..................... 62 7.3 A Formal Introduction to Suffix Trees...................... 64 7.4 Suffix Tree Construction: The WOTD Algorithm................ 65 7.5 Suffix Tree Construction: Ukkonen's Linear-Time Online Algorithm..... 67 7.6 Applications of Suffix Trees............................ 67 7.6.1 Exact String Matching........................... 67 7.6.2 The Shortest Unique Substring...................... 69 7.6.3 Maximal Repeats.............................. 70 7.6.4 Maximal Unique Matches......................... 73 8 Suffix Arrays 77 8.1 Motivation..................................... 77 8.2 Basic Definitions.................................. 78 8.3 Suffix Array Construction Algorithms...................... 79 8.3.1 Linear-Time Construction using a Suffix Tree.............. 79 8.3.2 Direct Construction............................ 80 8.3.3 Construction of the rank and lcp Arrays................ 81 8.4 Applications of Suffix Arrays........................... 83 9 Burrows-Wheeler Transformation 85 9.1 Introduction..................................... 85 9.2 Transformation and Retransformation...................... 85 9.3 Exact String Matching............................... 88 9.4 Other Applications................................. 88 9.4.1 Compression with Run-Length Encoding................. 89 10 Multiple Sequence Alignment 91 10.1 Basic Definitions.................................. 91 10.2 Why multiple sequence comparison?....................... 93 10.3 Multiple Alignment Scoring Functions...................... 94 10.4 Multiple Alignment Problems........................... 96 10.5 Digression: NP-completeness........................... 99 10.6 A Guide to Multiple Sequence Alignment Algorithms.............. 102 11 Algorithms for Sum-of-Pairs Multiple Alignment 103 11.1 An Exact Algorithm................................ 103 ii Contents 11.1.1 The Basic Algorithm............................ 103 11.1.2 Variations of the Basic Algorithm.................... 105 11.2 Carrillo and Lipman's Search Space Reduction................. 106 11.3 The Center-Star Approximation.......................... 109 11.4 Divide-and-Conquer Alignment.......................... 111 12 Tree Alignment Algorithms 117 12.1 Sankoff's Algorithm................................ 117 12.2 Generalized Tree Alignment............................ 119 12.2.1 Greedy Three-Way Tree Alignment Construction............ 120 12.2.2 The Deferred Path Heuristic....................... 122 13 Whole Genome Alignment 123 13.1 Filter Algorithms.................................. 124 13.2 Pairwise Genome Alignment (MUMmer 1 & 2)................. 125 13.3 Multiple Genome Alignment (MGA and MUMmer 3).............. 126 13.4 Multiple Genome Alignment with Rearrangements (MAUVE)......... 126 A Distances versus Similarity Measures on Sequences 129 A.1 Biologically Inspired Distances.......................... 129 A.2 From Distance to Similarity............................ 131 A.3 Log-Odds Score Matrices............................. 135 B Pairwise Sequence Alignment (Extended Material) 139 B.1 The Number of Global Alignments........................ 139 C Pairwise Alignment in Practice (Extended Material) 143 C.1 Fast Implementations of the Smith-Waterman Algorithm............ 143 C.2 FASTA: An On-line Database Search Method.................. 143 C.3 Index-based Database Search Methods...................... 146 C.4 Software....................................... 148 D Alignment Statistics 151 D.1 Preliminaries.................................... 151 D.2 Statistics of q-gram Matches and FASTA Scores................. 152 D.3 Statistics of Local Alignments........................... 154 E Basic RNA Secondary Structure Prediction 157 E.1 Introduction..................................... 157 E.2 The Optimization Problem............................ 159 E.3 Context-Free Grammars.............................. 160 E.4 The Nussinov Algorithm.............................. 161 E.5 Further Remarks.................................. 163 F Suffix Tree (Extended Material) 165 F.1 Memory Representations of Suffix Trees..................... 165 G Advanced Topics in Pairwise Alignment (Extended Material) 169 G.1 Length-Normalized Alignment........................... 169 iii Contents G.1.1 A Problem with Alignment Scores.................... 169 G.1.2 A Length-Normalized Scoring Function for Local Alignment...... 171 G.1.3 Finding an Optimal Normalized Alignment............... 171 G.2 Parametric Alignment............................... 176 G.2.1 Introduction to Parametric Alignment.................. 176 G.2.2 Theory of Parametric Alignment..................... 177 G.2.3 Solving the Ray Search Problem..................... 178 G.2.4 Finding the Polygon Containing p .................... 181 G.2.5 Filling the Parameter Space........................ 182 G.2.6 Analysis and Improvements........................ 182 G.2.7 Parametric Alignment in Practice.................... 183 H Multiple Alignment in Practice: Mostly Progressive 185 H.1 Progressive Alignment............................... 185 H.1.1 Aligning Two Alignments......................... 187 H.2 Segment-Based Alignment............................. 188 H.2.1 Segment Identification........................... 189 H.2.2 Segment Selection and Assembly..................... 190 H.3 Software for Progressive and Segment-Based Alignment............ 190 H.3.1 The Program Clustal W.......................... 190 H.3.2 T-COFFEE................................. 191 H.3.3 DIALIGN.................................. 192 H.3.4 MUSCLE.................................. 192 H.3.5 QAlign................................... 194 Bibliography 195 iv Preface At Bielefeld University, elements of sequence analysis are taught in several courses, starting with elementary pattern matching methods in \Algorithms and Data Structures" in the first semester. The present three-hour course \Sequence Analysis" is taught in the third semester and is extended by a practical course in the fourth semester. Prerequisites. It is assumed that the student has had some exposure to algorithms and mathematics, as well as to elementary facts of molecular biology. The following topics are required to understand some of the material here: • exact string matching algorithms (e.g. the naive method, Boyer-Moore, Boyer-Moore- Horspool, Knuth-Morris-Pratt), • comparison-based sorting (e.g. insertion sort, mergesort, heapsort, quicksort), • asymptotic complexity analysis (O notation). These topic are assumed to have been covered before this course and the first two do not appear in these notes. In Section 2.3 the O notation is briefly reviewed, more advanced complexity topics are discussed in Section 10.5. Advanced topics. Because the material is taught in one three-hour course, some advanced sequence analysis techniques are not covered. These include: • fast algorithms for approximate string matching (e.g. bit-shifting), • advanced filtering methods for approximate string matching (e.g. gapped q-grams), • efficient methods for concave gap cost functions in pairwise alignment, v Contents • in-depth discussion of methods used in practice. How to use existing software to solve a particular problem generally requires hands-on experience which can be conveyed in the \Sequence Analysis practical course". Some of the theory of such methods is covered in the Appendix of these notes. Suggested Reading Details about the recommended textbooks

Load more