Indel Information Eliminates Trivial Sequence Alignment in Maximum Likelihood Phylogenetic Analysis

Cladistics Cladistics 28 (2012) 514–528 10.1111/j.1096-0031.2012.00402.x Indel information eliminates trivial sequence alignment in maximum likelihood phylogenetic analysis John S.S. Dentona,b,* and Ward C. Wheelerb,c aDivision of Vertebrate Zoology, American Museum of Natural History, New York, NY 10024, USA; bRichard Gilder Graduate School, American Museum of Natural History, New York, NY 10024, USA; cDivision of Invertebrate Zoology, American Museum of Natural History, New York, NY 10024, USA Accepted 12 March 2012 Abstract Although there has been a recent proliferation in maximum-likelihood (ML)-based tree estimation methods based on a fixed sequence alignment (MSA), little research has been done on incorporating indel information in this traditional framework. We show, using a simple model on a single character example, that a trivial alignment of a different form than that previously identified for parsimony is optimal in ML under standard assumptions treating indels as ‘‘missing’’ data, but that it is not optimal when indels are incorporated into the character alphabet. We show that the optimality of the trivial alignment is not an artefact of simplified theory assumptions by demonstrating that trivial alignment likelihoods of five different multiple sequence alignment datasets exhibit this phenomenon. These results demonstrate the need for use of indel information in likelihood analysis on fixed MSAs, and suggest that caution must be exercised when drawing conclusions from software implementations claiming improvements in likelihood scores under an indels-as-missing assumption. Ó The Willi Hennig Society 2012. Maximum likelihood (ML) has become a popular the increasing sophistication of search heuristics in ML optimality criterion for inferring evolutionary trees software (e.g. Jobb et al., 2004; Guindon et al., 2005, following its introduction by Fisher (1912), its 2010; Stamatakis, 2006; Zwickl, 2006). initial application to nucleotide data (Neyman, 1971; However, despite this increase in sophistication, the Felsenstein, 1981), and its popularization through early fundamental procedure remains unchanged. A gap cost, software releases (Felsenstein, 1989; Swofford, 2002). whether non-affine (Waterman et al., 1976), affine ML is a parametric method that requires estimation of (Gotoh, 1982; Edgar, 2004; Katoh et al., 2005; Larkin the entries of a stochastic character transition matrix as et al., 2007), logarithmic (Mott, 1999), or log-affine well as branch lengths on a phylogenetic tree. Due to the (Cartwright, 2007) is employed during multiple sequence large number of values requiring independent optimiza- alignment via progressive algorithms, but most often not tion in an ML framework, computation times for during topology search. The pattern of the inserted likelihood tree searches are much longer than those for indels (‘‘missing’’ data, in a phylogenetic context) may parsimony, and ML analyses have been made more time- influence the topology found as optimal by either consuming by increased dataset sizes, greater numbers of increasing the number of most parsimonious trees partitions, and by increased model complexity. Yet (Wheeler, 1994; Wiens, 1998, 2003a,b; Kearney, 2002) despite these additional computational burdens, likeli- or by flattening the likelihood surface. When indels are hood analyses have become increasingly tractable in the treated as missing data in the character alphabet on a past decade with improvements in both computing speed fixed alignment, adding more indels may improve the and public access to multiprocessor clusters, as well as in optimality score to the point where a ‘‘trivial’’ alignment, in which sequences are aligned completely out of *Corresponding author: phase, is mathematically optimal, i.e. having a parsi- E-mail address: [email protected] mony cost of zero (Wheeler and Gladstein, 1994; Ó The Willi Hennig Society 2012 J.S.S. Denton and W.C. Wheeler / Cladistics 28 (2012) 514–528 515 Wheeler et al., 1995; Giribet and Wheeler, 1999). Trivial LðH; T Þ/PrðDjH; T Þð1Þ alignments are characterized by two undesirable prop- erties: (a) they have zero cost (in a parsimony frame- There are two ends of the spectrum of general work) despite contributing no biologically justified approaches to this problem. In the broad formulation, homology statements, and (b) they are invariant with a given alignment of the data (ai) in the space of respect to tree topology. Although the assumption of alignments (A) can be treated as a random variable and treating indels as ‘‘missing’’ data in the character marginalized, in which case X alphabet remains standard practice in phylogenetic LðH; T Þ¼ PrðD; a jH; T Þð2Þ analyses, little work has been done to investigate the i a 2 A conceptual implications of this assumption under the i likelihood criterion. and therefore Several methods for incorporating indel information X into phylogenetic analyses exist, among them likelihood- LðH; T Þ¼ PrðDjai; H; T ÞPrðaijH; T Þð3Þ based sequence alignment (Bishop and Thompson, 1986; ai 2 A Thorne et al., 1991, 1992); simultaneous alignment and by the chain rule for probability distributions. The first topology inference by Bayesian criteria (Redelings and term in the right hand side of the above equation treats Suchard, 2005, 2007; Suchard and Redelings, 2006), by only substitutions among alphabet characters, whereas likelihood criteria (Wheeler, 2006), and by parsimony the second term accounts for differences in alignment (Wheeler, 1996; Varoń et al., 2010) criteria; and by probability due to differing numbers of insertion and incorporation of indels into the character alphabet for deletion events. Because insertion and deletion events are topology search on a fixed alignment (McGuire et al., accounted for by the second term in this formulation, 2001a; Young and Healy, 2003; Rivas and Eddy, 2008). they do not have meaning as an entry in the character Utilization of indel information on a static MSA alphabet of the continuous-time stochastic process. eliminates trivial alignments because it defines indels By contrast, the more widely-applied and traditional on a direct transformation path, thereby maintaining formulation of the problem is that of utilizing eqn (1), metricity of the substitution cost matrix (Wheeler, 1993). with several additional assumptions. First, in this Additionally, incorporating indel information has been approach, a heuristic MSA is taken as a fixed parameter observed to contribute to phylogenetic resolution and value for the alignment A, from which the tree topology node support (Whiting et al., 1997; Egan and Crandall, is inferred by search algorithms. Hence, under this ‘‘two- 2008; Simmons et al., 2008; Dwivedi and Gadagkar, step’’ formulation, the second term in the right-hand side 2009; Dessimoz and Gil, 2010; Pas¢ko et al., 2011). of eqn (3) is omitted entirely. Second, under this tradi- However, the use of indel information is limited in tional approach, the conditional state transition proba- application outside parsimony by computational con- bilities p(j|i, t) for an alphabet size K in a K · K transition straints, and so the limited number of existing model- matrix P are derived by matrix exponentiation from an based methods for use on nucleotide data apply the instantaneous rate matrix Q having the property that assumption of atomistic indel events to make analyses X tractable. Although the utility of atomistic indels has Qði; jÞ¼0 ð4Þ been questioned in several studies under parsimony j (Simmons and Ochoterena, 2000, for example), to date no model-based software implementations exist that Because this common formulation requires all changes provide non-atomistic alternatives with heuristics that to be accounted for by substitutions among characters, operate in polynomial time. It is therefore imperative insertion and deletion events have meaning as character that such implementations be developed and made values. To account for insertions and deletions in this accessible so that the behaviours of the assumptions framework, the K · K transition matrix is augmented to can be studied. a(K +1)· (K + 1) matrix to incorporate insertion and deletion events. Third, under the traditional two- step approach, the continuous-time stochastic process is Context of the ‘‘indel problem’’ assumed to be globally stationary, hence the equilibrium character frequencies pK, obtained in the limit t fi¥ Parameterizing insertions and deletions is a more for the entries of P are constant across the tree topology. complex problem than determining the best-fit model Fourth, the stochastic process is assumed to be globally of nucleotide substitution. In likelihood analysis of reversible, hence nucleotide data, we wish to estimate the likelihood piQði; jÞ¼pjQðj; iÞð5Þ of sequence data (D) given a tree topology (T ) and a stochastic model of evolution (Q) of any alphabet and so insertions and deletions are estimated as a single size: (‘‘indel’’) parameter. Lastly, the stochastic process is 516 J.S.S. Denton and W.C. Wheeler / Cladistics 28 (2012) 514–528 assumed to be globally homogeneous, in that the same death process complementary to the substitution process entries of P apply across all branches of the topology. requires estimation of insertion and deletion parameters The main differences between the broad approach and at internal nodes. Hence, the process inherits the the two-step approach discussed here go beyond onto- problems of ancestral state reconstruction twice, once logical formulation of character states. The broad for the

Indel Information Eliminates Trivial Sequence Alignment in Maximum Likelihood Phylogenetic Analysis

Frameshift Indels Generate Highly Immunogenic Tumor Neoantigens Tumor-Specifi C Neoantigens Are the Targets of T Cells in the Neoantigens

Genetic Analysis of the Transition from Wild to Domesticated Cotton (G

Small Variants Frequently Asked Questions (FAQ) Updated September 2011

Mutational Landscape of Spontaneous Base Substitutions and Small Indels in Experimental Caenorhabditis Elegans Populations of Differing Size

Genetic Testing: Background and Policy Issues

Indelible Markers the Recruitment of Modified Histones by the RITS Complex

Pervasive Indels and Their Evolutionary Dynamics After The

The Origin, Evolution, and Functional Impact of Short Insertion–Deletion Variants Identified in 179 Human Genomes

The Genomics Era: the Future of Genetics in Medicine - Glossary

Integrated Analysis of Gene Expression, SNP, Indel, and CNV

Comprehensive Analysis of Indels in Whole-Genome Microsatellite Regions and Microsatellite Instability Across 21 Cancer Types

Characterization of Point Mutations in the Same Arginine Codon in Three Unrelated Patients with Ornithine Transcarbamylase Deficiency