Downloaded from Downloads/Clades
Total Page:16
File Type:pdf, Size:1020Kb
UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Statistical phylogenetic methods with applications to virus evolution Permalink https://escholarship.org/uc/item/362312kp Author Westesson, Oscar Publication Date 2012 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California Statistical phylogenetic methods with applications to virus evolution by Oscar Westesson A dissertation submitted in partial satisfaction of the requirements for the degree of Joint Doctor of Philosophy with University of California, San Francisco in Bioengineering in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ian Holmes, Chair Professor Rasmus Nielsen Professor Joe DeRisi Fall 2012 Statistical phylogenetic methods with applications to virus evolution Copyright 2012 by Oscar Westesson 1 Abstract Statistical phylogenetic methods with applications to virus evolution by Oscar Westesson Joint Doctor of Philosophy with University of California, San Francisco in Bioengineering University of California, Berkeley Ian Holmes, Chair This thesis explores methods for computational comparative modeling of genetic sequences. The framework within which this modeling is undertaken is that of sequence alignments and associated phylogenetic trees. The first part explores methods for building ancestral sequence alignments making explicit use of phylogenetic likelihood functions. New capabilities of an existing MCMC alignment sampler are discussed in detail, and the sampler is used to analyze a set of HIV/SIV gp120 proteins. An approximate maximum-likelihood alignment method is presented, first in a tutorial-style format and later in precise mathematical terms. An implementation of this method is evaluated alongside leading alignment programs. The second part describes methods utilizing multiple sequence alignments. First, mutation rate is used to predict positional mutational sensitivities for a protein. Second, the flexible, automated model-specification capabilities of the XRate software are presented. The final chapter presents recHMM, a method to detect recombination among sequence by use of a phylogenetic hidden Markov model with a tree in each hidden state. i F¨ormin farmor och farfar, som alltid undrade varf¨orjag aldrig blev doktor. ii Acknowledgments This dissertation contains much of the work I did while in the lab of Ian Holmes at UC Berkeley. I am deeply indebted to Ian for taking me on as a student and providing a quiet example of what it means to be a modern computational scientist. The folks in and around the lab during my time there (anb before) were absolutely essential, each in their own ways, to my growth as a scientist and person. Mitch Skinner, Robert Bradley, Lars Barquist, Andrew Uzilov, and Alex James Hughes: thanks for the help, lunches, and pranks. Members of the Andino and DeRisi labs helped deepen my interest in virology and keep my ideas grounded in biology: Raul Andino, Joe DeRisi, Charles Runckel, Michael Schulte, and Cecily Burrill. Without the following teachers along the way I would perhaps never have come to love pure and applied maths as much as I do: Rich \R" Kelly, Naomi Jochnowitz, and Bernd Sturmfels. The very tricky transfer from University of Rochester to UC Berkeley would likely not have happened were it not for Maritza Aguilar, an extremely helpful and thorough admissions counselor. iii Contents Contents iii List of Figures v List of Tables viii 1 Introduction 1 I Methods for multiple sequence alignment 12 2 Approximate alignment with transducers: theory 13 2.1 Background . 14 2.2 Informal tutorial on transducer composition . 17 2.3 Formal definitions . 66 2.4 Conclusions . 97 2.5 Methods . 98 3 Approximate alignment with transducers: simulation-based evaluation 100 3.1 Background . 101 3.2 Results . 103 3.3 Discussion . 107 3.4 Methods . 109 4 HandAlign: MCMC for exact Bayesian inference of alignment, tree, and parameters 117 4.1 Background . 118 4.2 Sampling alignments, trees, and parameters . 118 4.3 Capabilities . 119 iv II Methods utilizing a multiple sequence alignment 123 5 Predicting the functional effects of polymorphisms using mutation rate 124 5.1 Background . 125 5.2 Results . 126 5.3 Discussion . 127 5.4 Materials and methods . 130 6 Modeling genomic features using phylo-grammars 133 6.1 Background . 134 6.2 Methods . 136 6.3 Results and discussion . 138 7 RecHMM: Detecting phylogenetic recombination 153 7.1 Background . 154 7.2 Results . 157 7.3 Discussion . 168 7.4 Methods . 171 Bibliography 177 III Appendices 192 A Experimental connections 193 A.1 Recombination map via high-throughput sequencing in poliovirus . 195 A.2 Probing RNA secondary structure via SHAPE in poliovirus . 198 A.3 Reconstructing an ancestral Adeno-associated virus sequence for synthesis . 201 B ProtPal simulation study and OPTIC data analysis 208 C Additional phenotype prediction figures 220 D Table of DART-Scheme functions 226 E Glossary of XRate terminology 229 F RecHMM simulation and methodological details 234 F.1 Additional simulation results . 234 F.2 Methodological details . 239 v List of Figures 1.1 Outline of document . 3 1.2 Pairwise alignment . 4 1.3 Multiple alignment . 4 1.4 Ancestral alignment . 5 2.1 Tutorial: cs-mf-liv-tree . 18 2.2 Tutorial: cs-mf-liv-machines . 19 2.3 Tutorial: mf-generator . 20 2.4 Tutorial: liv-small . 20 2.5 Tutorial: fanned-emission . 22 2.6 Tutorial: condensed-emission . 23 2.7 Tutorial: legend . 24 2.8 Tutorial: transitions . 25 2.9 Tutorial: moore-mf-generator . 26 2.10 Tutorial: liv-labeled . 27 2.11 Tutorial: mf-labeled . 27 2.12 Tutorial: cs-labeled . 27 2.13 Tutorial: null-model . 28 2.14 Tutorial: substituter . 28 2.15 Tutorial: identity . 29 2.16 Tutorial: tkf91 . 30 2.17 Tutorial: tkf91-labeled . 31 2.18 Tutorial: tkf91-root . 32 2.19 Tutorial: protpal . 32 2.20 Tutorial: protpal-mix2 . 33 2.21 Tutorial: triplet . 34 2.22 Tutorial: substituter2 . 35 2.23 Tutorial: substituter-substituter2 . 35 2.24 Tutorial: tkf91-tkf91 . 36 2.25 Tutorial: mf-substituter . 39 2.26 Tutorial: mf-tkf91 . 39 2.27 Tutorial: substituter-mf . 41 vi 2.28 Tutorial: substituter-cs . 41 2.29 Tutorial: tkf91-mf . 41 2.30 Tutorial: mf-substituter-cs . 43 2.31 Tutorial: mf-tkf91-liv . 44 2.32 Tutorial: fork-subcs-submf . 47 2.33 Tutorial: root-fork-subcs-submf . 48 2.34 Tutorial: fork-tkf91liv-tkf91mf . 50 2.35 Tutorial: root-fork-tkf91liv-tkf91mf . 52 2.36 Tutorial: viterbi-root-fork-tkf91liv-tkf91mf . 53 2.37 Tutorial: viterbi-fork-tkf91liv-tkf91mf . 54 2.38 Tutorial: viterbi-profile . 54 2.39 Tutorial: forward2-root-fork-tkf91liv-tkf91mf . 55 2.40 Tutorial: forward2-fork-tkf91liv-tkf91mf . 56 2.41 Tutorial: forward2-profile . 56 2.42 Tutorial: fork3-tkf91liv-tkf91mf-tkf91cs . 58 2.43 Tutorial: fork-tkf91viterbi-tkf91cs . 59 2.44 Tutorial: fork-tkf91forward2-tkf91cs . 60 2.45 Simulation study results . 62 2.46 Tutorial: fork-tkf91-tkf91 . 63 2.47 Tutorial: root-fork-tkf91-tkf91 . 64 2.48 Empirical runtime analysis . 94 3.1 Indel rate estimation error distributions . 112 3.2 Indel rate estimation errors separated by simulated indel rate . 113 3.3 Effects of gap attraction . 114 3.4 Indel rate distribution in OPTIC database . 115 3.5 Sampled paths through state graph . 116 4.1 An alignment summarizing a handalign MCMC trace . ..