Estimation of Gene Insertion/Deletion Rates with Missing Data

| INVESTIGATION Estimation of Gene Insertion/Deletion Rates with Missing Data Utkarsh J. Dang,*,1 Alison M. Devault,† Tatum D. Mortimer,‡ Caitlin S. Pepperell,‡ Hendrik N. Poinar,§ and G. Brian Golding**,2 *Departments of Biology and Mathematics and Statistics, McMaster University, Hamilton, Ontario L8S-4L8, Canada, †MYcroarray, Ann Arbor, Michigan 48105, ‡Departments of Medicine and Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin, Madison, Wisconsin 53705, and §Department of Anthropology and **Department of Biology, McMaster University, Hamilton, Ontario L8S-4K1, Canada ABSTRACT Lateral gene transfer is an important mechanism for evolution among bacteria. Here, genome-wide gene insertion and deletion rates are modeled in a maximum-likelihood framework with the additional flexibility of modeling potential missing data. The performance of the models is illustrated using simulations and a data set on gene family phyletic patterns from Gardnerella vaginalis that includes an ancient taxon. A novel application involving pseudogenization/genome reduction magnitudes is also illustrated, using gene family data from Mycobacterium spp. Finally, an R package called indelmiss is available from the Comprehensive R Archive Network at https://cran.r-project.org/package=indelmiss, with support documentation and examples. KEYWORDS gene insertion/deletion; indel rates; maximum likelihood; unobserved data ATERAL gene transfer is an important, yet traditionally Marri et al. 2006; Cohen and Pupko 2010). Traditionally, Lunderestimated, mechanism for microbial evolution such likelihood-based analyses have required that the closely (McDaniel et al. 2010; Treangen and Rocha 2011). Whole related sequences being investigated have complete genome gene insertions/deletions, referred to as indels here in the sequences available. This ensures that no genome rearrange- context of lateral gene transfer, can be deduced from exam- ment masks a homolog (Hao and Golding 2006). Here, ining gene presence/absence patterns on a phylogenetic tree likelihood-based models are investigated that can also account of closely related taxa. Systematic investigation of the rates of for potentially unobserved or missing data. such indels can be done via several methods. Parsimony The term “missing” here is used in a loose and informal methods can be used (Hao and Golding 2004); however, sense. It is meant to measure the degree to which unobserved these are known to underestimate the number of events in data may nevertheless contribute to a taxon’s data set, given phylogeny reconstruction (Felsenstein 2004). Sequence the inferred rates from related taxa. Two different kinds of characteristics such as codon usage bias and G+C content missing data are used here for illustration purposes. have also been investigated in the past, but these are not In the first example, consider a taxon or a few taxa that have always reliable (Koski and Golding 2001). Alternatively, phy- only subsets of their genome sampled. This could be due to logenies can also be constructed for individual genes, and a genome degradation, errors in sequencing, errors in assembly, comparison of trees among individual genes can yield in- or incomplete next generation sequencing (NGS) studies. In sights on the acquisition of foreign genes. bacterial evolution, genes are continually being inserted or Maximum-likelihood techniques have previously been deleted and the goal here is to estimate how much of the data used to estimate gene indel rates (Hao and Golding 2006; has been missed within this background of continuous gene insertion/deletion. Copyright © 2016 by the Genetics Society of America As a second example, consider an intracellular pathogenic doi: 10.1534/genetics.116.191973 bacteria. It is well known that such species will adapt to their Manuscript received May 24, 2016; accepted for publication August 17, 2016; published Early Online August 25, 2016. host by deleting unnecessary genes. Here, the missing data 1Present address: Department of Mathematical Sciences, Binghamton University, allude to the magnitude of genome reduction beyond the State University of New York, Binghamton, NY 13902. fl 2Corresponding author: Department of Biology, McMaster University, 1280 Main St. normal levels of genome ux. The goal, in this case, is to West, Hamilton, ON L8S-4K1, Canada. E-mail: [email protected] estimate this reduction while simultaneously estimating Genetics, Vol. 204, 513–529 October 2016 513 Table 1 The proportion ðdi Þ of data that is unexpectedly missing in the data for species i compared to closely related taxa on the phylogenetic tree Observed True “0”“1” “0” 10 “1” di 1 2 di from models that relax the assumption of homogeneity of gene insertion and deletion rates across all branches on the phylogeny. Such models can yield unique estimates of insertion and deletion rates for specific clades (or branch and node groupings) chosen based on evolutionary time or prior information. An R package called indelmiss (insertion deletion analysis while accounting for missing data) is provided that allows for efficient fitting of all models discussed (Appendix D). The rest of this article is structured as follows. Materials and Methods includes details on the likelihood calculations and the formulation of a model that incorporates missing data. Results illustrates model performance using simulations and Figure 1 An illustration of the scenarios being modeled. The shaded bars data based on gene phyletic patterns from Gardnerella vaginalis on the right indicate gene content (presence/absence) within the ge- nomes of five species related according to the phylogeny given on the and Mycobacterium spp. Finally, some conclusions and ideas left. The third species is missing some gene blocks that are present in the for future work are discussed in the Discussion section. other species. Materials and Methods phylogenetic insertion/deletion rates. Not accounting for these missing data will bias estimates of indel rates. To model gene evolution, a two-state (presence or absence) As an illustration, see Figure 1. Here, sequences for coding continuous-time Markov chain is used. Genes are assumed to genes (shaded rectangles) are available for five closely related be inserted or deleted independently of other genes and at taxa. Unexpectedly, the data available for the third taxon seem constant rates. To eliminate the problem of paralogs, only the to differ from those for the other taxa, and this taxon appears to presence or absence of gene families is considered in the be missing some genes that are present in the others. If these fashion of Hao and Golding (2004, 2006). Any paralogs are are closely related taxa, they should have approximately similar clustered as a single gene family and only one member of a amounts of coding information. Hence, the data recorded for family is retained. The criteria for being considered as be- the third taxon seem to be unusual in comparison to related longing to a gene family are given in Results. For the Markov taxa. Indeed, it seems that more deletion has occurred in this chain, an operational taxonomic unit (OTU) having a gene taxon relative to the others. In this case, assuming that the third family present or absent is represented by a 1 or a 0. taxon has missing data as discussed above (via either of the two Let the instantaneous rates of insertion and deletion be n m; Q scenarios), modeling of the insertion and deletion rates for and respectively. Then, the rate matrix can be written genes for all five of the taxa directly would lead to an over- 2mm as ; where the rows (and columns) represent estimate of the deletion rate for the entire clade. However, n 2n accounting for this unexpected event would provide better es- presence and absence in the current (future) state, respec- timates for the insertion/deletion rates and at the same time tively. The transition rate matrix representing the probabili- give an estimate of the proportion of missing data. Such a ties of a transition from one state to the next can be easily methodology would permit the separation of the confounded derived (cf. Hao and Golding 2006), effects of missing data from normal gene gain and loss over n m 2 m n m 2 m 2 m n time. The method cannot determine the reason that the data 1 3 þ exp ð ð þ ÞtÞ exp ð ð þ ÞtÞ ; n 2 n 2 m n m n 2 m n are missing but can estimate the magnitude. Note that while m þ n exp ð ð þ ÞtÞ þ exp ð ð þ ÞtÞ methods exist for handling missing data, ambiguous states, and sequencing error for nucleotides (Felsenstein 2004; Kuhner where, as for Q; the rows (and columns) represent and McGill 2014; Yang 2014), this article is the first to propose presence and absence in the current (future) state, respec- dealing with missing data using such models on gene family tively. For example, the probability of gene pres- membership data and to illustrate their performance. ence in a descendant OTU ðPdÞ giventhatitwasalso As evolutionary rates can vary among different clades or present in the ancestral OTU ðPaÞ is given here ; m n 21 3 n m 2 m n : lineages on a tree, the analyses presented also include results by pðPdjPa tÞ¼pPaPd ¼ð þ Þ ð þ expð ð þ ÞtÞÞ 514 U. J. Dang et al. Table 2 Mean estimates for indel rates and proportion of missing data along with the ranges across 100 runs for simulation set 1 Recovered Expected Model 1 Model 2 Model 3 Model 4 m ¼ 1 1.00 (0.94, 1.08) 1.00 (0.93, 1.06) 1.00 (0.93, 1.08) 1.00 (0.93, 1.06) n ¼ 1 1.00 (0.94, 1.08) 1.00 (0.93, 1.06) 1.01 (0.90, 1.39) 1.00 (0.90, 1.39) d ¼ 0 — 0.00 (0.00, 0.02) — 0.01 (0.00, 0.03) Best (AIC) 76 8 14 2 Best (BIC) 100 0 0 0 No missing data were simulated but possible missing data were estimated for tip 1 (of six tips).

Estimation of Gene Insertion/Deletion Rates with Missing Data

Diagnosis and Differentiation of the Order Primates

A Revised Taxonomy of the Iguanodont Dinosaur Genera and Species

Université De Montréal Edit Distance Metrics for Measuring Dissimilarity

Are Pinnipeds Functionally Different from Fissiped Carnivores? the Importance of Phylogenetic Comparative Analyses

Evolutionary Crossroads in Developmental Biology: Cyclostomes (Lamprey and Hagfish) Sebastian M

A Briefly Argued Case That Asgard Archaea Are Part of the Eukaryote Tree

Biology 3404F Evolution of Plants Fall 2018 Description: This Course Provides an Introduction to the Major Groups of Photosynthe

Holophyly and Associated Concepts If the Unknown Is Unclassifiable

In Quest for a Phylogeny of Mesozoic Mammals

Short Branch Attraction, the Fundamental Bipartition in Cellular Life, and Eukaryogenesis Amanda A

Disintegration of the Scrophulariaceae1

Javier Luque