Bayesian Ancestral Reconstruction for Bat Echolocation

Bayesian Ancestral Reconstruction for Bat Echolocation by Joseph Patrick Meagher Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Statistics June 2020 Contents List of Tables iv List of Figures v Acknowledgments vii Declarations viii Abstract ix Abbreviations x Chapter 1 Introduction 1 Chapter 2 Literature Review 7 2.1 Some Background on Bats . .7 2.2 The Phylogenetic Comparative Method . 12 2.3 Statistical Models for Data . 16 2.3.1 Gaussian Processes . 16 2.3.2 The Phylogenetic Gaussian Process Framework . 20 2.3.3 Latent Variable Models and Factor Analysis . 24 2.3.4 Functional Data Analysis . 26 2.3.5 Acoustic Signal Processing . 30 2.4 Bayesian Inference . 33 2.4.1 MCMC Methods for Parameter Inference . 35 2.4.2 MCMC Methods for Gaussian Processes . 37 2.4.3 Estimating Model Evidence . 39 Chapter 3 A Phylogenetic Latent Variable Model for Function-valued Traits 41 i 3.1 Introduction . 41 3.2 Methods . 45 3.2.1 The Phylogeny: A Graphical Model for Shared Ancestry . 45 3.2.2 A Phylogenetic Latent Variable Model for Function-valued Traits . 47 3.2.3 Efficient Computation of the Model Likelihood . 48 3.2.4 Prior Specification . 49 3.2.5 Posterior Inference and Model Selection . 53 3.2.6 Ancestral Reconstruction . 56 3.3 Results for a Synthetic Example . 58 3.4 Discussion . 62 Chapter 4 A Generalised Phylogenetic Latent Variable Model 68 4.1 Introduction . 68 4.2 Methods . 70 4.2.1 Data Augmentation . 70 4.2.2 A Generalised Phylogenetic Latent Variable Model . 71 4.2.3 Approximate Posterior Inference . 74 4.2.4 Ancestral Reconstruction . 79 4.3 Results for a Synthetic Example . 80 4.3.1 Model Fitting . 81 4.3.2 Ancestral Reconstruction . 84 4.3.3 Parameter Inference . 86 4.4 Discussion . 87 Chapter 5 Ancestral Reconstruction of the Bat Echolocation Call 93 5.1 Introduction . 93 5.2 A Harmonic Model for Bat Echolocation . 96 5.2.1 Prior Specification . 98 5.2.2 Maximum-a-Posteriori Inference . 102 5.2.3 Fitting the Harmonic Model . 105 5.2.4 A Brief Discussion of the Harmonic Model . 106 5.3 Echolocation Call Reconstruction . 109 5.3.1 Echolocation Call Data . 109 5.3.2 Bat Phylogeny . 110 5.3.3 Echolocation Call Features . 111 5.3.4 Ancestral Reconstruction . 116 5.4 Discussion . 123 ii Chapter 6 Final Remarks 126 Appendix A Tree Traversal Algorithms 132 A.1 Pruned Likelihood Calculation . 132 A.2 Pruned Conditional Distribution . 136 Appendix B Derivations for Variational Inference 139 B.1 Co-ordinate Ascent Variational Inference Updates . 139 B.2 The Evidence Lower Bound . 146 B.3 Predictive Distribution . 151 Appendix C Alternative generalised PLVMs 153 iii List of Tables 3.1 Fixed Parameter and Hyper-Parameter Values for Synthetic Data . 58 3.2 Bayes Factors for Model Comparison . 60 4.1 Fixed Hyper-Parameter Values for Synthetic Phylogenetic Gaussian Processes . 81 5.1 Mexican Bat Echolocation Call Dataset . 110 5.2 MAP estimates and intervals of 90% posterior density for phylogenetic hyper-parameters of the V-PLVM. 121 iv List of Figures 1.1 The Ancestral Bat Call . .6 2.1 Grouping Organisms over Phylogenies: The definition of monophyletic, paraphyletic, and polyphyletic groups . .9 2.2 The Diversity of Bat Echolocation Calls: Selected call spectrograms 10 2.3 Gaussian Process Regression: Illustration of prior and posterior samples from a Gaussian process . 18 2.4 Comparison of Matérnkernels: Illustration of Matérncovariance functions and samples from the process for different values of the smoothing parameter ν ......................... 20 2.5 Positions on a Phylogeny . 21 2.6 Registration of Functional Data: A toy example . 29 2.7 Interpolation between Signal Characterisations: A comparison of in- stantaneous frequency and spectrogram representations . 34 3.1 A Taxon-level Phylogeny . 45 3.2 A Phylogeny for Repeated Measurements . 46 3.3 The Phylogenetic Latent Variable Model: A graphical representation 53 3.4 Loadings for a Synthetic Example . 59 3.5 Phylogeny and Trait Observations for a Synthetic Example . 59 3.6 Ancestral Reconstruction of a synthetic Function-Valued Trait . 61 3.7 Sampled Posterior Loading . 63 3.8 Sampled Posterior Phylogenetic Hyper-parameters . 64 3.9 Sampled Posterior Parameters . 65 4.1 The Generalised Phylogenetic Latent Variable Model: A graphical representation . 75 4.2 A Synthetic Collection of Traits on a Phylogeny . 82 4.3 The log Evidence Lower Bound for Generalised PLVMs . 83 v 4.4 Root Ancestral Distribution: The V-PLVM . 85 4.5 Inferred Loading: The V-PLVM . 88 4.6 Inferred Phylogenetic Hyper-parameters: The V-PLVM . 89 4.7 Root Ancestral Distribution: The R-PLVM . 92 5.1 A Harmonic Model for Bat Echolocation: A graphical representation 101 5.2 Fitted Harmonic Models: A selection of bat echolocation calls . 107 5.3 A Phylogeny for Sampled Mexican Bats . 112 5.4 Corrected Fundamental Frequency Curves: A selection of bat echolocation calls . 114 5.5 Time Registration of Fundamental Frequency Curves: Pteronotus parnellii ................................. 115 5.6 The log Evidence Lower Bound: Models for the evolution of bat echolocation calls . 119 5.7 Echolocation in Bats Most Recent Common Ancestor . 120 5.8 The Evolution of Bat Echolocation . 122 5.9 Inferred Loadings: A model for the evolution of bat echolocation . 124 A.1 A Toy Phylogeny . 132 C.1 Root Ancestral Distribution: The R-PLVM . 154 C.2 Root Ancestral Distribution: The P-PLVM . 155 C.3 Root Ancestral Distribution: The I-PLVM . 156 vi Acknowledgments Thank you to my supervisors on this project. To Mark Girolami, for offering me the opportunity to undertake this programme of doctoral research; to Kate Jones, for providing such a motivating problem, and to Theo Damoulas, whose mentorship made this work possible. Thank you to both the EPSRC for their generous funding of this project and the Department of Statistics at the University of Warwick for hosting me throughout. Thank you to my friends, old and new, who have been with me on this journey. In particular, I want to thank Arthur, five years is a long time to spend under the same roof, but you made it easy. Thank you to my family; to my grandparents, for whom my admiration only increases with the passage of time; to Ailshe and Liam, may you learn from the mistakes of your brother, whatever you may judge them to be; to my mother, whose constant self-development is as much motivation as it is inspiration; and to my father, for demonstrating the value of consistent enthusiasm. I can never truly thank you for all I've been given in this life, but only do my best to make the most of it. And to Suzy, on to the next chapter. vii Declarations This thesis is submitted to the University of Warwick in support of my application for the degree of Doctor of Philosophy. I declare that it has been composed by myself and that the work contained herein is my own except where explicitly stated otherwise. This work has been completed wholly while in candidature for a research degree at the University of Warwick and has not been submitted for any other degree or professional qualification. Aspects of this research have been published in: • JP Meagher, T Damoulas, KE Jones, and M Girolami. Discussion of \The statistical analysis of acoustic phonetic data: exploring differences between spoken Romance languages", by Davide Pigoli, Pantelis Z Hadjipantelis, John S Coleman, and John AD Aston. Journal of the Royal Statistical Society: Series C (Applied Statistics), 67(5):1103-1145, 2018. • JP Meagher, T Damoulas, KE Jones, and M Girolami. Phylogenetic Gaussian processes for bat echolocation. Statistical Data Science, pages 111-122, 2018. doi:10.1142/q0159. viii Abstract Ancestral reconstruction can be understood as an interpolation between mea- sured characteristics of existing populations to those of their common ancestors. Doing so provides an insight into the characteristics of organisms that lived mil- lions of years ago. Such reconstructions are inherently uncertain, making this an ideal application area for Bayesian statistics. As such, Gaussian processes serve as a basis for many probabilistic models for trait evolution, which assume that mea- sured characteristics, or some transformation of those characteristics, are jointly Gaussian distributed. While these models do provide a theoretical basis for uncertainty quantification in ancestral reconstruction, practical approaches to their implementation have proven challenging. In this thesis, novel Bayesian methods for ancestral reconstruction are developed and applied to bat echolocation calls. This work proposes the first fully Bayesian approach to inference within the Phylogenetic Gaussian Process Regression framework for Function-Valued Traits, producing an ancestral reconstruction for which any uncertainty in this model may be quantified. The framework is then generalised to collections of discrete and continuous traits, and an efficient approximate Bayesian inference scheme proposed, representing the first application of Variational inference techniques to the problem of ancestral reconstruction. This efficient approach is then applied to the reconstruction of bat echolocation calls, providing new insights into the developmental pathways of this remarkable characteristic. It is the complexity of bat echolocation that motivates the proposed approach to evolutionary inference, however, the resulting statistical methods are broadly applicable within the field of Evolutionary Biology. ix Abbreviations AM Adaptive Metropolis ARD Automatic Relevance Determination ASIS Ancillarity-Sufficiency Interweaving Strategy

Bayesian Ancestral Reconstruction for Bat Echolocation

Uncorrelated Genetic Drift of Gene Frequencies and Linkage

Comparative Methods for Reconstructing Ancient Genome Organization Yoann Anselmetti, Nina Luhmann, Sèverine Bérard, Eric Tannier, Cedric Chauve

Reconstruction of Ancestral RNA Sequences Under Multiple Structural Constraints Olivier Tremblay-Savard1,2, Vladimir Reinharz1 and Jérôme Waldispühl1*

Were Ancestral Proteins Less Specific?

Ancestral Reconstruction

Workshop on Molecular Evolution (Extended Special Topics Session July 27-August 8,2003 August 8-August 15,2003) Course Director

Reconstructing Ancestral Gene Content by Coevolution

Molecular Phylogenetics Next Two Lectures

Climatic Niches in Phylogenetic Comparative Studies: a Review Of

American Society of Naturalists Honorary Lifetime Membership Awards

Predicting the Ancestral Character Changes in a Tree Is Typically Easier Than Predicting the Root State

Reconstructing Large Regions of an Ancestral Mammalian Genome in Silico