Bioinformatics I: Sequence Analysis and Phylogenetics

Lecture Notes Institute of Bioinformatics, Johannes Kepler University Linz Bioinformatics I Sequence Analysis and Phylogenetics Winter Semester 2016/2017 by Sepp Hochreiter Institute of Bioinformatics Tel. +43 732 2468 4520 Johannes Kepler University Linz Fax +43 732 2468 4539 A-4040 Linz, Austria http://www.bioinf.jku.at c 2008 Sepp Hochreiter This material, no matter whether in printed or electronic form, may be used for personal and educational use only. Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the author. Legend (!): explained later in the text, forward reference italic: important term (in most cases explained) iii iv Literature D. W. Mount, Bioinformatics: Sequences and Genome analysis, CSHL Press, 2001. D. Gusfield, Algorithms on strings, trees and sequences: computer science and cmomputa- tional biology, Cambridge Univ. Press, 1999. R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological sequence analysis, Cambridge Univ. Press, 1998. M. Waterman, Introduction to Computational Biology, Chapmann & Hall, 1995. Setubal and Meidanis, Introduction to Computational Molecular Biology, PWS Publishing, 1997. Pevzner, Computational Molecular Biology, MIT Press, 2000. J. Felsenstein: Inferring phylogenies, Sinauer, 2004. W. Ewens, G. Grant, Statistical Methods in Bioinformatics, Springer, 2001. M. Nei, S. Kumar, Molecular Evolution and Phylogenetics, Oxford 2000. Blast: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html v vi Contents 1 Biological Basics 1 1.1 The Cell . .1 1.2 Central Dogma of Molecular Biology . .4 1.3 DNA.........................................5 1.4 RNA......................................... 12 1.5 Transcription . 14 1.5.1 Initiation . 15 1.5.2 Elongation . 17 1.5.3 Termination . 17 1.6 Introns, Exons, and Splicing . 17 1.7 Amino Acids . 23 1.8 Genetic Code . 27 1.9 Translation . 29 1.9.1 Initiation . 29 1.9.2 Elongation . 31 1.9.3 Termination . 31 1.10 Folding . 31 2 Bioinformatics Resources 37 2.1 Data Bases . 37 2.2 Software . 40 2.3 Articles . 42 3 Pairwise Alignment 45 3.1 Motivation . 45 3.2 Sequence Similarities and Scoring . 47 3.2.1 Identity Matrix . 47 3.2.2 PAM Matrices . 50 3.2.3 BLOSUM Matrices . 55 3.2.4 Gap Penalties . 59 3.3 Alignment Algorithms . 60 3.3.1 Global Alignment — Needleman-Wunsch . 61 3.3.1.1 Linear Gap Penalty . 61 3.3.1.2 Affine Gap Penalty . 66 3.3.1.3 KBand Global Alignment . 67 3.3.2 Local Alignment — Smith-Waterman . 71 vii 3.3.3 Fast Approximations: FASTA, BLAST and BLAT . 72 3.3.3.1 FASTA . 76 3.3.3.2 BLAST . 76 3.3.3.3 BLAT . 79 3.4 Alignment Significance . 80 3.4.1 Significance of HSPs . 80 3.4.2 Significance of Perfect Matches . 83 4 Multiple Alignment 85 4.1 Motivation . 85 4.2 Multiple Sequence Similarities and Scoring . 87 4.2.1 Consensus and Entropy Score . 87 4.2.2 Tree and Star Score . 87 4.2.3 Weighted Sum of Pairs Score . 88 4.3 Multiple Alignment Algorithms . 90 4.3.1 Exact Methods . 92 4.3.2 Progressive Algorithms . 94 4.3.2.1 ClustalW . 95 4.3.2.2 TCoffee . 96 4.3.3 Other Multiple Alignment Algorithms . 96 4.3.3.1 Center Star Alignment . 96 4.3.3.2 Motif- and Profile-based Methods . 98 4.3.3.3 Probabilistic and Model-based Methods . 98 4.3.3.4 Divide-and-conquer Algorithms . 98 4.4 Profiles and Position Specific Scoring Matrices . 101 5 Phylogenetics 103 5.1 Motivation . 103 5.1.1 Tree of Life . 103 5.1.2 Molecular Phylogenies . 105 5.1.3 Methods . 106 5.2 Maximum Parsimony Methods . 106 5.2.1 Tree Length . 106 5.2.2 Tree Search . 110 5.2.2.1 Branch and Bound . 110 5.2.2.2 Heuristics for Tree Search . 111 5.2.2.2.1 Stepwise Addition Algorithm . 111 5.2.2.2.2 Branch Swapping . 112 5.2.2.2.3 Branch and Bound Like . 112 5.2.3 Weighted Parsimony and Bootstrapping . 112 5.2.4 Inconsistency of Maximum Parsimony . 112 5.3 Distance-based Methods . 114 5.3.1 UPGMA . 115 5.3.2 Least Squares . 115 5.3.3 Minimum Evolution . 116 5.3.4 Neighbor Joining . 116 5.3.5 Distance Measures . 121 viii 5.3.5.1 Jukes Cantor . 121 5.3.5.2 Kimura . 123 5.3.5.3 Felsenstein / Tajima-Nei . 124 5.3.5.4 Tamura . 124 5.3.5.5 Hasegawa (HKY) . 125 5.3.5.6 Tamura-Nei . 125 5.4 Maximum Likelihood Methods . 125 5.5 Examples . 127 A Amino Acid Characteristics 135 B A∗-Algorithm 137 C Examples 139 C.1 Pairwise Alignment . 139 C.1.1 PAM Matrices . 139 C.1.2 BLOSUM Matrices . 142 C.1.3 Global Alignment — Needleman-Wunsch . 144 C.1.3.1 Linear Gap Penalty . 144 C.1.3.2 Affine Gap Penalty . 147 C.1.4 Local Alignment — Smith-Waterman . 151 C.1.4.1 Linear Gap Penalty . 151 C.1.4.2 Affine Gap Penalty . 154 C.2 Phylogenetics . 158 C.2.1 UPGMA . 158 C.2.2 Neighbor Joining . 161 ix x List of Figures 1.1 Prokaryotic cells of bacterium and cynaophyte (photosynthetic bacteria). .3 1.2 Eukaryotic cell of a plant. .4 1.3 Cartoon of the “human genome project”. .5 1.4 Central dogma is depicted. .6 1.5 The deoxyribonucleic acid (DNA) is depicted. .7 1.6 The 5 nucleotides. .8 1.7 The hydrogen bonds between base pairs. .8 1.8 The base pairs in the double helix. .9 1.9 The DNA is depicted in detail. .9 1.10 The storage of the DNA in the nucleus. 10 1.11 The storage of the DNA in the nucleus as cartoon. 11 1.12 The DNA is right-handed. 11 1.13 The difference between RNA and DNA is depicted. ..

Bioinformatics I: Sequence Analysis and Phylogenetics

Selecting Molecular Markers for a Specific Phylogenetic Problem

An Automated Pipeline for Retrieving Orthologous DNA Sequences from Genbank in R

Sequencing Alignment I Outline: Sequence Alignment

Comparative Analysis of Multiple Sequence Alignment Tools

Chapter 6: Multiple Sequence Alignment Learning Objectives

How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)

"Phylogenetic Analysis of Protein Sequence Data Using The

Aligning Reads: Tools and Theory Genome Transcriptome Assembly Mapping Mapping

Alignment of Next-Generation Sequencing Data

New Syllabus

Molecular Phylogenetics Suggests a New Classification and Uncovers Convergent Evolution of Lithistid Demosponges

Contributions to Biostatistics: Categorical Data Analysis, Data Modeling and Statistical Inference Mathieu Emily