HIGHLIGHTED ARTICLE | INVESTIGATION

A Model of Evolution by Finite-State, Continuous-Time Machines

Ian Holmes1 Department of Bioengineering, University of California, Berkeley, California 94720 ORCID ID: 0000-0001-7639-5369 (I.H.)

ABSTRACT We introduce a systematic method of approximating finite-time transition probabilities for continuous-time - models on sequences. The method uses automata theory to describe the action of an infinitesimal evolutionary generator on a probability distribution over alignments, where both the generator and the alignment distribution can be represented by pair hidden Markov models (HMMs). In general, combining HMMs in this way induces a multiplication of their state spaces; to control this, we introduce a coarse-graining operation to keep the state space at a constant size. This leads naturally to ordinary differential equations for the evolution of the transition probabilities of the approximating pair HMM. The TKF91 model emerges as an exact solution to these equations for the special case of single-residue . For the more general case of multiple-residue indels, the equations can be solved by numerical integration. Using simulated data, we show that the resulting distribution over alignments, when compared to previous approximations, is a better fit over a broader range of parameters. We also propose a related approach to develop differential equations for sufficient statistics to estimate the underlying instantaneous indel rates by expectation maximization. Our code and data are available at https://github.com/ihh/trajectory-likelihood.

KEYWORDS automata; hidden Markov models; indels; Markov processes; ; phylogenetics

N molecular evolution, the equations of motion describe Materials and Methods), report on a simulation-based evalu- Icontinuous-time Markov processes on discrete or ation (in the Results), and discuss the implications of our amino acid sequences. For substitution processes, these equa- results (in the Discussion). tions are reasonably well understood, but insertions and Alignments as Summaries of Indel Histories deletions (indels) have proved less tractable. This paper presents a new approach to analysis of indel Our motivating goal is to calculate probabilities of sequence processes. In this introduction, we first discuss core bioinfor- alignments, assuming an underlying instantaneous rate model matics concepts such as alignments, define a continuous-time of indel events. We will mostly consider alignments of two Markov process for indels, and review previously published sequences that we will refer to as the “ancestor” and the “de- approximations to the finite-time alignment distributions of scendant,” where the likelihood function takes the form this process, using hidden Markov models (HMMs). In the PðÞdescendant; alignmentjancestor; Q; t ,whereQ represents remaining sections we describe our new method (in the model parameters (e.g., rates) and t is a time param- eter. Common uses of this likelihood function include perform- Copyright © 2020 Holmes ing sequence alignment (for downstream inference based on doi: https://doi.org/10.1534/genetics.120.303630 homology, such as protein structure prediction), finding maxi- Manuscript received July 1, 2020; accepted for publication September 22, 2020; published Early Online October 5, 2020. mum-likelihood estimates of the time parameter t (for example, Available freely online through the author-supported open access option. as part of phylogenetic inference of ancestral relationships), and This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which comparing different models or parameterizations Q (for exam- permits unrestricted use, distribution, and reproduction in any medium, provided the ple, to measure the rate of evolution in sequences, or to annotate original work is properly cited. Supplemental material available at figshare: https://doi.org/10.25386/genetics. conserved regions). 13040585. We seek to derive this pairwise alignment likelihood di- Available freely online. 1Address for correspondence: Stanley Hall, Department of Bioengineering, University of rectly from an instantaneous model of sequence mutation; California, Berkeley, California 94720. E-mail: [email protected] that is, a continuous-time Markov chain whose state space is

Genetics, Vol. 216, 1187–1204 December 2020 1187 the set of all possible DNA or protein sequences. For prac- uncertain alignments. For example, the alignments in 1F are tical purposes, we often need to summarize paths through plausible alternatives to 1A; in 1G, the ordering of insertions this process, and it is worth distinguishing between differ- and deletions may be considered irrelevant for many purpo- ent ways of doing so. We will use three progressively de- ses; and the placement of the second gap in 1H admits some tailed descriptions of the evolutionary path which we refer uncertainty. to as alignments, trajectories, and histories (Figure 1), as If the alignment likelihood can be represented as a path described below. probability through a pair HMM, F, then we can perform this sum over alignments using the forward algorithm (Durbin Alignments et al. 1998), writing the result as A pairwise alignment consists of the observed initial and F ; Q; final state of the process (the ancestral and descendant XY ¼ PðÞdescendant ¼ Yjancestor ¼ X t ¼ PðÞdescendant ¼ Y; path ¼ fjancestor ¼ X; Q; t : sequences), with gap characters to show which residues are f descended from which. An example alignment is shown in Figure 1A. Most of our discussion will be at this level of This paper focuses on the probability distribution of align- summarization. ment gaps. In general, when we refer to a gap, we will mean Trajectories a run of adjacent indels in any order, as in 1G. Because of the possibility of overlapping indel events, as in 1C, these gaps can A trajectory includes all the intermediate sequences from arise in a number of different ways. ancestor to descendant. Transitions between the interme- diate sequences correspond to instantaneous changes. A The general geometric indel model trajectory uniquely implies an alignment, but there are fi many trajectories consistent with each alignment: the most Ourstartingpointforde ning an evolutionary process is the plausible trajectory for alignment 1A is shown in 1B, but the pointsubstitutionmodel,appliedtoasequence.Insuchamodel, R longer trajectories in 1C and 1D are also consistent. We will each residue evolves according to a substitution rate matrix , ’ refer to this level of summarization when discussing some such as Kimura s two-parameter model for DNA (Kimura 1980) ’ previous methods for indel analysis. or Dayhoff s PAM model for proteins (Dayhoff et al. 1978). We generalize this by allowing instantaneous insertion and Histories deletion events as well as point substitution events. We do not A history consists of a trajectory fully annotated with the time want to be forced always to count the insertion of multiple of each indel event. This is the most detailed description, being adjacent residues as separate events (as in 1D), since this leads a complete specification of the path of the stochastic process. to inferential artifacts such as trajectories with too many events, For each nontrivial trajectory,there is a continuum of possible alignments with scattered gaps, rates that are too fast, or times histories. For example, history 1E is consistent with trajectory that are too long. Consequently, our model should allow events 1B, with an event time u in the range 0 # u # t. We will not that insert or delete multiple residues instantaneously (as in 1B), refer much to this level as it contains more information than with the indel length being a random variable. we usually care about. For simplicity, we want to keep the number of parameters Reflecting this hierarchy of summarization, we can write minimal, so we specify only the mean lengths of insertion and deletion events. The maximum entropy distribution for this pa- PðÞdescendantj ancestor; Q; t rameterization is the geometric distribution. Thus, the probability n21 X that a given event involves n residues is x ðÞ1 2 x for an in- 2 ¼ PðÞdescendant; alignmentj ancestor; Q; t sertion, and yn 1ðÞ1 2 y foradeletion,withmeanlengths alignment 1=ðÞ1 2 x and 1=ðÞ1 2 y . If the rate of insertions is l and the X X rate of deletions is m, then, at any given site, an event that inserts n21 ¼ Pðdescendant; alignment; n residues has rate lx ðÞ1 2 x 3 PIðÞ1 ...In ,whereI1 ...In alignment trajectory represents the actual n residues that were inserted, while an event that deletes n residues has rate myn21ðÞ1 2 y .So,forex- trajectoryjÞ ancestor; Q; t ample, the instantaneous event EALGVK/EALGKLGVK in the X X Z Z Z t u1 u2 history shown in Figure 1E, which occurs during the time interval ¼ du du du ... 1 2 3 [u, u + du) and inserts the three residues GKL,hasinstanta- alignment trajectory 0 0 0 neous rate lx2ðÞ1 2 x 3 PðÞGKL and infinitesimal probability PðÞdescendant; alignment; trajectory; historyj ancestor; Q; t ; lx2ðÞ1 2 x PðÞGKL du. The inserted residues are indepen- dently drawn from the stationaryQ distribution of the substitution n ... model, so PI ...I r where rR =0.Thus, where (u1, u2,u3 ) represents all the event times in a history. ðÞ¼1 n k¼1 Ik Note that the top-level summation is over alignments. PðÞGKL ¼ r Gr KrL. By contrast, deletion rates are completely Many scenarios demand that we marginalize ambiguous or independent of the residues being deleted.

1188 I. Holmes Figure 1 Three views of evolutionary processes—alignments, trajectories, and histories—represent different levels of summarization. Alignments include no information about intermediate events except the positions of homologous residues; trajectories include intermediate sequences and the transition events between them, but not the times at which those events occurred; histories include transition events and times. Panels are illustrated using examples from the HOMSTRAD database (Mizuguchi et al. 1998); PDB identifiers are shown. (A) Part of an alignment of two proteins from PDB. (B) A single-event trajectory consistent with alignment A. (C) Two different two-event trajectories consistent with A. (D) A three-event trajectory consistent with A. (E) A history consistent with trajectory (B), in which the single event occurs between times u and u 1 du. (F) Two alternate alignments of the sequences in alignment A. (G) Several equivalent alignments containing adjacent insertions and deletions that have been rearranged in different ways. (H) An alignment that can only be explained by a model incorporating both substitution and indel events. (I) The gap profile of alignment H, and its associated M/I/D column types. (J) A multiple alignment whose misaligned gap boundaries do not seem to support the TKF92 model’s assumption that multiresidue gaps arise from indivisible sequence fragments.

To summarize, the parameters of our indel model are Derivation of alignment likelihoods from Q ¼ðl; m; x; y; RÞ consisting of indel rates (l, m) and indel indel processes length parameters (x, y), together with a substitution rate In the previous section, we described the GGI model with R matrix . We call this model the general geometric indel instantaneous rates (l, m) and extension probabilities (x,y). (GGI) model, following De Maio (2020). The GGI model is We now review previous approaches to calculating alignment the simplest continuous-time Markov chain over sequences gap likelihoods under this model and related models. that is local, allows multiresidue indels, and does not enforce These methods include the pair HMMs we evaluate in reversibility. We may contrast the locality with the Poisson this paper: TKF91 (Thorne et al. 1991), TKF92 (Thorne indel process, where the indel rate per site varies inversely et al. 1992), MLH04 (Miklós et al. 2004), LG05 (Löytynoja with the total sequence length (Bouchard-Côté and Jordan and Goldman 2005), RS07 (Redelings and Suchard 2007), 2013). As for multiresidue indels, other models such as LAHP19 (Levy Karin et al. 2019), and DM20 (De Maio 2020). TKF92 do allow this, but they do so by introducing unobserv- All of these methods exploit the property of the GGI model able auxiliary information into the state space; specifically, that the indel and substitution processes are independent of one TKF92 introduces fragment boundaries. We can further con- another. A pairwise alignment (1H) has a gap profile (1I) that is strain the parameters in various ways if desired; for example, like a residue-masked silhouette of the alignment, comprising by insisting that the model be reversible ½lyðÞ¼1 2 x mxðÞ1 2 y , three types of column: matches (M), in which ancestral and as with the long indel model of Miklós et al. (2004); or by requiring descendant residues are aligned, and insertions (I) and deletions perfect symmetry between insertions and deletions (l = m and x = (D), in which either ancestor ordescendantcontainsagap.We y), as in the simulations of De Maio (2020); or by restricting indels can factorize the alignment likelihood into a term for the gap to single residues (x = y = 0), so all trajectories look like Figure 1D, profile (written as a sequence of M’s, I’s, and D’s) and a condi- as with the TKF91 model of Thorne et al. (1991). tionally independent set of terms for the actual residue content:

W 2 DPSN K ERH P alignment ¼ j ancestor ¼ WDPSNKERH IGDPS22YPH ¼ PðÞgap profile ¼ MIMMMDDMMMj length of ancestor ¼ 9 3 3 3 3 3 3 3 3 : MWI rG MDD MPP MSS MEY MRP MHH

Finite-State, Continuous-Time Machines 1189 Figure 2 Three state machines for modeling

indels in alignments. Match states (sM) are or- ange, insert states (sI) are green, delete states (sD) are red, and null states (sN) are uncolored. Transitions are colored by destination state. States for the machines in A–C are further de- scribed in Tables 1, 3, and 4. Our approach is to approximate C with a machine of the same form as A. (A) Machine FðÞt ,defined under Three-state HMM and in Table 3, models align- ments at divergence time t. (B) Machine GðÞDt , defined under Infinitesimal-time machine and in Table 1, models the infinitesimal evolution over time Dt. (C) Machine FðÞt GðÞDt ,defined under Rate of change of expected transition counts and in Table 4, models alignments at diver- gence time t + Dt. It is the machine product of FðÞt and GðÞDt : each state has the form XY where X is an F state and Y is a G state. Uppercase is used to indicate that a component machine makes a transition when the com- pound state is entered. So, for example, when FG makes the transition mI / MM, the transi- tion weight is the product of a (for F’s self-loop- ing M / M transition) and 1 2 x (for G’s I / M transition). However, if then makes the transition MM / Dm, the transition weight is just c (for F’sM/ D transition), since G stays in the M state without making a transition. This structure arises from simple rules for transition synchronization in multiplied machines (Westesson et al. 2011, 2012).

Here, MXY is the probability that a descendant residue is Y, the pair HMM, FðÞt , and the GGI model’sratematrixover conditional on the ancestor being X; while rY is the probability sequences, ℝ. that an inserted residue is Y. Since deletion events are residue- We now review previous work in this area. blind in the GGI model, and we have already conditioned on the ancestral sequence, we do not to include terms for the proba- TKF91: The first approach, TKF91, addresses a restricted bility that the two deleted ancestral residues are N and K; the version of the GGI model allowing only single-residue indels. gap profile tells us what positions the deleted residues were at, This reduces to a linear birth-death process, which can be and that is enough. solved exactly (Thorne et al. 1991). The probability distribu- This decomposition of indel and substitution probabilities tion over alignments can be represented as a pair HMM is naturally expressed in terms of a pair HMM with M, I, and D (Holmes and Bruno 2001). Being exactly solvable, TKF91 states. We can think of the gap profile term as the probability has become the canonical example of an indel model. However, of the state path through the HMM, while the substitution as noted previously, it leads to systematic biases during inference, terms correspond to the emission probabilities from those imputing trajectories with too many events, as in Figure 1D. states. The emission part is well understood (Thorne et al. 1991): M and r canbelinkedtoanunderlyingpointsub- TKF92: Attempting to address the deficiencies of TKF91, the stitution rate matrix R [by the matrix exponential M =exp(- TKF92 model (Thorne et al. 1992) posits a similar birth-death Rt) and the stationary distribution rR = 0]. Our focus is on process, but on indivisible multiresidue fragments instead of sin- the likelihood of the gap profile: we seek a similar relation- gle residues. Each fragment contains a random, geometrically ship FðÞ’t expðÞℝt between the transition probabilities of distributed number of residues. TKF92 has a closed-form pair

1190 I. Holmes Table 1 Interpretation of states in machine GðÞDt (Figure 2B, defined in Infinitesimal-time machine); here, v in; v out 2 V represent input and output tokens from the residue alphabet

State Name Class On entry Input Output P (vout) 1Ms Reads v from input, writes v to output v v expðÞRt M in out in out vinvout s v — v r 2I I Writes out to output out vout 3D sD Reads vin from input vin ——

HMM solution, rather like TKF91, but with the introduction of shows that the exact solution is, in fact, an infinite-state pair a new parameter corresponding to the mean fragment length. HMM, so a smaller pair HMM may be a reasonable approximation. However, TKF92 is also somewhat unrealistic in practice. The idea that multiresidue gaps arise from unbreakable fragments is artifi- LAHP19: A purely simulation-based approach to estimating cial, as can be illustrated with reference to the multiple sequence the gap probabilities of the GGI model has recently been alignment of Figure 1J. The gap boundaries in (1J) do not align, described (Levy Karin et al. 2019). In the limit of an infinite and this is not uncommon: empirically, there is no evidence that number of random trials, this approach is exact. We use such TKF92’s indivisible fragments are real. In practice, when using simulations as a gold standard to evaluate other approxima- TKF92, it is common to simply marginalize over the fragment tions. However, the number of trials required to sample rare boundaries, effectively treating TKF92 as an ad hoc approximation outcomes (i.e., long gaps, particularly those involving multi- to the GGI model. Therefore, by defining a suitable mapping be- ple-event trajectories) is large, and the simulations become tween TKF92’s fragment parameters and the GGI model’sindels, computationally expensive with longer sequences. The per- we can evaluate it on this basis, as an approximation to GGI. formance and sampling limitations of this approach are fur- ther discussed in the Results. LG05 and RS07: Similarly, the LG05 pair HMM used in the PRANK program (Löytynoja and Goldman 2005) and the RS07 DM20: The DM20 method is a recent breakthrough in ap- pair HMM used in BAliPhy (Redelings and Suchard 2007) in- proximating the GGI model (De Maio 2020). Starting from troduce fragment length parameters that can be related (with the assumption that the alignment likelihood can be approx- some hand-waving) to the indel length parameters of GGI. In imated by a product of geometric distributions over insertion this paper, we evaluate TKF91, TKF92, LG05, and RS07 as and deletion lengths, De Maio derived ordinary differential approximations to GGI, but we do not evaluate some other equations (ODEs) for the evolution of the mean lengths of indel models that are a little harder to reconcile with GGI these distributions, yielding transition probabilities for the pair because of extra parameters (Rivas and Eddy 2015) or incom- HMM. DM20 is a more accurate approximation to the multi- patible assumptions (Bouchard-Côté and Jordan 2013). residue indel process than all previous attempts, although it has limitations: it does not allow deletions to directly follow MLH04: The MLH04 approach, developed by Miklós et al. insertions in the alignment (thus limiting its ability to model (2004), computes lower-bound likelihoods for alignment gaps covariation between insertion and deletion lengths), it is in- by considering short trajectories like those in Figure 1, B–D, exact for the special case of the TKF91 model, and it requires integrating out the event times from the corresponding histories laborious manual derivation of the underlying ODEs. to find a likelihood for each such trajectory. To calculate the likelihood of an alignment gap, MLH does a brute-force exhaus- H20: The H20 method, developed in this paper, builds on tive enumeration of all consistent trajectories, up to a given DM20 to develop a systematic differential calculus for number of indel events and a maximum gap length. Under finding HMM-based approximate solutions of continuous- theassumptionofaninfinite sequence, the resulting distribu- time Markov processes on strings that are “local” in the tion is technically still a pair HMM, albeit one with an infinite sense that the infinitesimal generator is a pair HMM. Our number of states (corresponding to every possible size of gap). approach addresses the limitations of DM20, identified in As our simulations in the Results section demonstrate, this ap- the previous paragraph. It does allow deletions to follow proach is extremely slow, and effectively impossible for trajec- insertions, so as to better account for covariation between tories with more than three overlapping indel events; however, insertion and deletion gap sizes. The TKF91 model emerges for very short evolutionary times, MLH04 remains the most as a special case: the closed-form solutions to TKF91 are also accurate approximation to GGI, short of direct simulation. exact solutions to our model. Finally, although our equations In the special cases of TKF91 and TKF92, the alignment gap can be derived without computational assistance, the anal- lengthsaregeometricallydistributed.Thisisnotnecessarilytrue ysis is greatly simplified by the use of symbolic algebra pack- in general for the GGI model (Rivas and Eddy 2015): alignment ages, both for the manipulation of equations, for which we gap lengths are not geometrically distributed even though the used Mathematica (Wolfram Research, Inc.) (version 2020), underlying indel event lengths are. Thus, a simple three-state and for the manipulation of state machines, for which we pair HMM—whose gap lengths are geometrically distributed— used our recently published software Machine Boss (Silvestre- cannot be an exact solution to GGI. Nevertheless, MLH04 Ryan et al. 2020).

Finite-State, Continuous-Time Machines 1191 Figure 3 Versions of the first two pair HMMs of Figure 2 that include start and end states, and so can be used for finite sequences. As in Figure 2, match states (sM) are orange, insert states (sI) are green, delete states (sD) are red, and null states (sN) are uncolored. (A) A version of GðDt) with start and end states, where deletion rates at the end of the sequence are elevated by a factor 1=ð1 yÞ so that the total rightward deletion rate at any residue is m, independent of its distance from the end. (B) A version of FðtÞ with start and end states.

The central idea of our approach is that the application of where V is the residue alphabet (e.g., or amino the infinitesimal generator to the approximating HMM gen- acids), V* is the set of all sequences over that alphabet (in- erates a more complicated HMM that, by a suitable coarse- cluding the empty sequence e), VN is the set of all sequences fi graining operation, can be mapped back to the simpler struc- of nite length N, B is the deleted sequence, C is the inserted sequence, and A; D V* are flanking sequences (we will ture of the approximating HMM. By matching the expected 2 mostly be considering the infinite-sequence approximation, transition usages of these HMMs, we derive ODEs for the where X; Y; A; D 6¼ e). transition probabilities of the approximator. Our approach is Suppose that cðÞ2t V* is a sequence evolving under the justified by improved results in simulations, yielding greater GGI model. Consider GðÞDt , the pair HMM defined in Figure accuracy and generality than all previous approaches to this 2B and Table 1. Assuming c (t)isinfinitely long, the forward problem, including DM20 (which can be seen as a restricted algorithm for GðÞDt computes the conditional distribution version of our method). Our approach is further justified by over an instant of evolutionary time: the emergence of the TKF91 model as an exact special case, without the need to introduce any additional latent variables G D D ; Q; D ðÞt XY ¼ PðÞþcðt þ tÞ¼YjcðÞ¼t X t oðÞt such as fragment boundaries. ℝD D ¼ expðÞt XY þoðÞt While we focus here on the multiresidue indel process, the ¼ I þ ℝXY Dt þ oðÞDt ; generality of the infinitesimal automata suggests that other local evolutionary models, such as those allowing neighbor- where I is the identity matrix over sequences (IXY ¼ 1ifX = dependent substitution and indel rates, might also be pro- Y,0ifX 6¼ Y). ductively analyzed using this approach. Our approach is to find a pair HMM, FðÞt (Fig- The sequence rate matrix and the infinitesimal-time ma- ure 2A), that approximates the matrix exponential t=Dt chine: We now give a concise preview of the approach de- FðÞ’t expðÞ¼ℝt limDt/0ðÞGðÞDt , by mapping the ma- scribed in detail in the Materials and Methods. chine product FðÞt GðÞDt (Figure 2C) back onto FðÞt þ Dt . ℝ The rate matrix of the GGI model is, for two sequences We match expected transition counts between classes of F / / ðM IDN XÞ: states in FðÞt GðÞDt to their representative transitions in

8 P QN > 2 > lxN 1ðÞ1 2 x r where X ¼ AD and Y ¼ ACD and C 2 VN > Ck > A;C;D k¼1 > P < myN21ðÞ1 2 y where X ¼ ABD and Y ¼ AD and B 2 VN ℝ ; ; XY¼> A BPD > R where X ¼ ABD and Y ¼ ACD and B; C 2 V; B 6¼ C > BC > A;B;PC;D > * :> 2 ℝXZ where X ¼ Y and Z 2 V Z6¼X

1192 I. Holmes Table 2 Glossary of mathematical notation, terminology, and abbreviations used in this paper

Term Meaning Defined in History Realization of a continuous-time process, including event times Introduction, Figure 1 Trajectory Summary of a history that includes only events but not times Introduction, Figure 1 Alignment Summary of a trajectory that shows homologous residues Introduction, Figure 1 Pair HMM A hidden Markov model for pairwise alignments Durbin et al. (1998) GGI The general geometric indel model Introduction TKF91 The links model, a special case of GGI for single-residue indels Thorne et al. (1991) TKF92 Sequel to the TKF91 model allowing multiresidue indels Thorne et al. (1992) MLH04 Approximation to GGI that enumerates short trajectories Miklós et al. (2004) LG05 Pair HMM used by PRANK alignment software Löytynoja and Goldman (2005) RS07 Pair HMM used by BAliPhy alignment software Redelings and Suchard (2007) LAHP19 Approximation to GGI that estimates gap probabilities by simulation Levy Karin et al. (2019) DM20 The cumulative indel model, a direct precursor to this work De Maio (2020) l, m Rate of beginning an insertion or deletion in the GGI model Introduction x, y Probability of extending an insertion or deletion in the GGI model Introduction R Substitution rate matrix in the GGI model Introduction Q The set of GGI model parameters (l, m, x, y, R) Introduction t Evolutionary time separating ancestor and descendant sequences Introduction Dt A very small amount of evolutionary time Introduction M Substitution probability matrix defined by M = exp (Rt) Introduction r Probability distribution over inserted residues defined by rR =0 Introduction V The residue alphabet, e.g., nucleotides or amino acids Introduction VN,V* Sets of sequences over V (N denotes fixed length, * denotes any length) Introduction e The empty sequence Introduction ℝ The rate matrix over V* for the GGI model Introduction M A generic pair HMM, a probabilistic input-output machine Materials and Methods K Number of states in the machine Materials and Methods sM, sI, sD, sN Sets of Match-, Insert-, Delete-, and Null-type states Materials and Methods ​ ​ sIDN, sMIDN, etc. Unions of state classes, e.g., sIDN ¼ sI [ sD [ sN Materials and Methods f A state path through a pair HMM Materials and Methods Q Transition probability matrix for a pair HMM Materials and Methods Q½M Transition matrix for a particular machine M Materials and Methods FðÞX/Y/Z The set of all paths starting in sX, possibly passing through states in sY, Materials and Methods and ending in sZ FðÞM/IDN/M The set of all paths that begin and end in the Match state, but do not Materials and Methods otherwise use it for the intervening steps ðÞX/Y J A matrix that contains 1’s for entries corresponding to sX / sY Materials and Methods transitions, and 0’s for other entries ðÞX/Y Q A matrix that contains probabilities for sX / sY transitions, and 0’s for Materials and Methods other entries s fi ∘ [ The pointwise matrix product, de ned by ðÞA B ij Aij Bij Materials and Methods SXðÞf The number of sX states in path f Materials and Methods TXYðÞf The number of sX / sY transitions in path f Materials and Methods ... F / / M EfjM½ An expectation over ðÞM IDN M for machine, . Materials and Methods U, V, W Geometric series sums involving the transition matrix Q Equation 1 FðÞt A pair HMM approximation to finite-time alignments under the GGI Figure 2A and 1 model. Sometimes abbreviated to F a, b, c, f, g, h, p, q, r Transition probabilities of FðÞt . All are functions of t Equation 3 and Equation 4 SX Expectation of SX over FðÞM/IDN/M Equation 2, Equation 4, and Equation 7 T XY Expectation of TXY over FðÞM/IDN/M Equation 2, Equation 5, and Equation 6 GðÞDt A pair HMM for alignments at infinitesimal time intervals under the GGI Figure 2B and Table 1 model. Sometimes abbreviated to G FðÞt GðÞDt Automata product of FðÞt and GðÞDt . Abbreviated to FG Figure 2C and Table 4 0 A K 3 K matrix of zeroes 1 A K 3 K matrix of ones

FðÞt þ Dt , and take the limit Dt / 0 to derive differential total rightward deletion rate starting at any residue is m, / equations for the transition weights of A0 ¼ QðMIDN YÞ. regardless of its distance from the end. We can then define A note on finite sequences: to model these we can set GðÞDt as in Figure 3A. Imposing reversibility on finite- N21 N ℝAB;A ¼ my for B 2 V , dropping the 1 2 y term for dele- sequence models takes slightly more care (Miklós et al. tions that remove the end of the sequence. This ensures the 2004).

Finite-State, Continuous-Time Machines 1193 Table 3 Interpretation of states in machine FðÞt (Figure 2A, defined in Three-state HMM); here, v in; v out 2 V represent input and output tokens from the residue alphabet

State Name Class On entry Input Output P (vout) 1Ms Reads v from input, writes v to output v v expðÞR Dt M in out in out vinvout s v — v r 2I I Writes out to output out vout 3D sD Reads vin from input vin ——

Materials and Methods differential equations for the transition probabilities of FðÞt . Finally, we show that this reduces correctly to the TKF91 As noted in the Introduction, our approach makes use of pair model when x = y =0. HMMs. We assume some familiarity with these models, Table 2 gives a glossary of mathematical terms used which are standard in bioinformatics. A tutorial introduction throughout this section. In the Supplemental Material, we can be found in Durbin et al. (1998). give some additional lemmas and a conjecture relating this The pair HMMs we will use are normalized for the condi- work to the expectation-maximization algorithm for param- tional probability PðÞdescendantj ancestor [rather than for the eter estimation. joint distribution PðÞancestor; descendant as is often seen in the bioinformatics literature]. Such conditionally normalized Expected transition usage pair HMMs are sometimes called input-output automata, or Suppose that we have a pair HMM, M, with K states that can transducers. Rather than simultaneously generating two out- be partitioned into match sM, insert sI, delete sD, and null sN put sequences, these state machines read a specified input states. As a shorthand we will write sID for sI [ sD, and so on. sequence and generate a probabilistic output. Following the Thus sMIDN is the complete set of K states. convention of Durbin et al. (1998), these pair HMMs have We will be considering models with only one match state, Match (input-output), Insert (output-only), and Delete (in- which, by convention, will always be the first state, so put-only) states corresponding to the M, I, and D columns in sM ¼ fg1 a pairwise alignment (see e.g., Figure 1I), as well as Null (N) ... Let f ¼ ðÞf1 fn denote a state path. The transition prob- states that do not input or output anything. Q ability matrix is with elements Qij ¼ P fkþ1 ¼ jjfk ¼ i . A key result enabling our approach is that two automata For X; Y; Z4fgM; I; D; N , let FðÞX/Y/Z denote the A; B A ðÞcan be multiplied together: the output of is fed into set of state paths with the following properties: the input of B. The serial operation of both machines can be represented by a single compound machine AB, constructed The path begins in an sX state; algorithmically from the two component machines. The al- The path ends in an sZ state; gorithm takes a Cartesian product of the two machines’ state For paths with more than just a begin and end state, the spaces, then synchronizes transitions in this joint space so intermediate states are all sY states. A that each output-writing transition of coincides with an Let JðÞX/Y be a matrix that selects transitions from s to s B X Y, input-reading transition of , with some ordering of updates so that indels are not double-counted. The algorithms for X/Y 1 i 2 s ; j 2 s JðÞ¼ X Y ; doing these multiplications are published (Westesson et al. ij 0 otherwise 2011, 2012), and software implementations are available ðÞX/Y ðÞX/Y (Silvestre-Ryan et al. 2020). and let Q [ J ∘Q, where ∘ is the pointwise product, For our purposes, the only machine-multiplication that defined as follows: if A, B are two matrices of the same size, ∘ [ we need is the one shown in Figure 2. A three-state ma- then ðÞA B ij AijBij. chine representing a distribution over finite-time pairwise alignments [FðÞt , Figure 2A] is multiplied by another three- A concrete example is the machine FðÞt in Figure 2A, which state machine representing the action of the GGI model over has sM = {1}, sI = {2}, sD = {3}, and sN ¼ ∅. Thus, for an infinitesimal time interval [GðÞDt , Figure 2B], yielding example, sMIDN = {1, 2, 3} and sIDN = {2,3}. Some exam- a nine-state machine [FðÞt GðÞDt , Figure 2C]. ples of state paths in FðÞM/IDN/M areðÞ 1; 1 , ; ; ; ; ; ; ; ; ; In the following sections, we show that FðÞt GðÞDt can be ðÞ1 2 2 2 1 , andðÞ 1 3 3 2 3 10. 1 systematically mapped back to FðÞt þ Dt by a coarse-graining abc fi Q @ A operation that involves nding the expected number of tran- The transition0 matrix is1 ¼ fgh. The matrix sitions of each type (I/I, I/D, etc.) in walks through the 011 pqr JðÞMID/ID @ A state space that begin and end in the M state. In the following ¼ 011 selects0 transitions1 into the I sections, we describe how to calculate these expectations, 011 0 bc / with expository examples relating to FðÞt , which we then and D states, so QðÞMID ID ¼ @ 0 ghA. define in detail. We then apply this to map FðÞt GðÞDt back 0 qr to FðÞt þ Dt , and, by taking the limit Dt / 0, derive

1194 I. Holmes Consider a random walk f 2 FðÞM/IDN/M that Let SX ðÞf be the number of X states in f, excluding the beginsandendsinthematchstate,passingonlythrough final state. Thus, ; : nonmatch states in between. Let TXYðÞ¼f jfðÞi j . ; ... F / / SMðÞ¼f 1 j i fi fj 2 ðX N Y Þgj count the number of transi- tions from X states to Y states if null states are removed from the P walk. In other words this is the number of subpaths of f that go SX ðÞ¼f TXYðÞf Y2MP;I;D;N from a state in s , via zero or more null states, to a state in s . X Y ¼ TYXðÞf : Y2M;I;D;N Continuing with the example of machine FðÞt in Figure 2A, the state path (1, 2, 2, 2, 1) has state types (M, I, I, I, M) Three-state HMM and thus transitions (MI, II, II, IM), so the transition counts Consider the machine FðÞt shown in Figure 2A with are T = T = 1 and T = 2. For an example of a ma- MI IM II s ¼ fg1 , s ¼ fg2 , s ¼ fg3 , and chine with more complex structure including null states, M I D F G D 0 1 consider ðÞt ðÞt of Figure 2C. A state path atðÞ btðÞ ctðÞ ; ; ; ; ; ; ðÞ1 2 3 9 9 5 1 through this machine has state types (M, I, I, Q½¼FðÞt @ ftðÞ gtðÞ htðÞA; N, N, D, M). When we remove the null states, this becomes (M, I, I, D, ptðÞ qtðÞ rtðÞ M) and so the transitions are (MI, II, ID, DM). Thus the transition counts are TMI ¼ TII ¼ TID ¼ TDM ¼ 1. with a + b + c =1,f + g + h, and p + q + r = 1. Table 3 gives the emission probabilities.

To find the expected value of TXY in walks that begin and Here, t will play the role of a time parameter. end in the match state, we break down paths into three Let segments: M to X (via insert, delete, and null states), X to Y (via null states), and Y to M (via insert, delete, and null TXYðÞt ¼ EfjFðÞt ½TXYðÞf states). The first and last segments are only required if M 6¼ X and Y 6¼ M. The corresponding sets of state paths are SXðÞ¼t EfjFðÞt ½SXðÞf FðÞM/IDN/X , FðÞX/N/Y , and FðÞY/IDN/M . X To sum over all paths in a given set FðÞX/Y/Z ,wecan ¼ T ðÞt PN 2 XY B Ak 12A 1 ; ; ; use the geometric series formula ¼ k¼0 ¼ ðÞ, Y2M I D N 1 3 A QðÞY/Y where is the K K identity matrix. Setting ¼ , X / the effective X Z transition probabilities are the ¼ TYXðÞt ðÞX/Z ðÞX/Y ðÞY/Z nonzero entries of C ¼ Q þ Q BQ .We Y2M;I;D;N can further simplify the formulae, for example by C JðÞX/Z ∘ Q A9BQ JðÞX/Z ∘ B9Q noting that ¼ ðÞ¼þ ðÞwhere SMðÞ¼t 1; (2) / 2 A9 ¼ QðÞMIDN Y and B9 ¼ ðÞ12A9 1. Using the methods of the previous paragraphs, the expec- where the expectations are as defined in (1) [throughout this paper, F / / tation of TXY is such expectations are over f 2 ðÞM IDN M ]. Evidently, ; ; ; EfjM½TXYðÞf P atðÞ¼TMMðÞt btðÞ¼TMIðÞt ctðÞ¼TMDðÞt ¼ PðÞf T ðÞf XY = ; = ; = ; f2FðÞM/IDN/M !ftðÞ¼TIMðÞt SIðÞt gtðÞ¼TIIðÞt SIðÞt htðÞ¼TIDðÞt SIðÞt PN PN / i / / j = ; = ; = : ¼ QðÞMIDN IDN JðÞX Y ∘ QðÞMIDN N Q ptðÞ¼TDMðÞt SDðÞt qtðÞ¼TDIðÞt SDðÞt rtðÞ¼TDDðÞt SDðÞt i¼0 j¼0 (3) N X k QðÞIDN/MIDN By (1), 11 k¼0 ðbð1 2 rÞþcqÞðfð1 2 rÞþhpÞ / SIðtÞ¼ ¼ UJðÞX Y ∘ðÞVQ W ; (1) ðð12gÞð12rÞ2hgÞ2 11 where ðcð1 2 gÞþbhÞðpð1 2 gÞþfqÞ SDðtÞ¼ : (4) 12g 12r 2hq 2 2 ðð Þð Þ Þ / 1 U ¼ 12QðÞMIDN IDN The essence of our approach is to use transducer compo- 21 ðÞMIDN/N V ¼ 12Q sition to study infinitesimal increments in TXYðÞt , and thereby 21 obtain differential equations that can be solved to find these W 2QðÞIDN/MIDN : ¼ 1 parameters.

Finite-State, Continuous-Time Machines 1195 Table 4 Interpretation of states in the composite machine FðÞt GðÞDt (Figure 2C, defined in Rate of change of expected transition counts)

State Name Class On entry Input Output P (vout) 1MMs F reads input v , writes v to G, enters M state v v expðÞRðÞt þ Dt M in thru in out vinvout G reads vthru from F, writes output vout, enters M state s F — v r 2mI I stays in M state (no transition) out vout G writes output vout, enters I state s F v G — v r 3IM I writes thru to , enters I state out vout G reads vthru from F, writes output vout, enters M state s F — v r 4iI I stays in I state (no transition) out vout G writes output vout, enters I state 5MDsD F reads input vin, writes vthru to G, enters M state vin —— G reads vthru from F, enters D state 6DmsD F reads input vin, enters D state vin —— G stays in M state (no transition)

7DdsD F reads input vin, enters D state vin —— G stays in D state (no transition)

8DisD F reads input vin, enters D state vin —— G stays in I state (no transition)

9IDsD F writes vthru to G, enters I state —— — G reads vthru from F, enters D state

Here, v in; v out; v thru 2 V represent input, output, and pass-through tokens from the residue alphabet. Each state has the form XY where X is an F state and Y is a Gstate. Each transition of FG can involve an F-transition, a G-transition, or both. Uppercase (XY) is used to indicate that a component machine makes a transition when the compound state is entered; lowercase (xy) indicates the component machine makes no transition. Thus, transitions intofg MM; IM; MD; Dm; Dd; Di; ID involve an F-transition; transitions intofg MM; mI; IM; iI; MD; ID involve a G-transition. This structure arises from simple rules for transition synchronization in multiplied machines (Westesson et al. 2011, 2012). By these rules, G can only make an input-reading transition when F makes an output-writing transition, and vice versa. So, for example, when FG makes the transition mI / MM, what happens internally is that F makes a self-looping M / M transition while G makes an I / M transition, and an (unobserved) token vthru is passed through from F to G. However, if FG then makes the transition MM / Dm, internally F makes a M / D transition without outputting anything, so G just stays in the M state without making a transition.

Infinitesimal-time machine Rate of change of expected transition counts The infinitesimal transducer GðÞDt of Figure 2B has states Composing FðÞt (Figure 2A) with GðÞDt (Figure 2B) yields sM ¼ fg1 , sI ¼ fg2 , sD ¼ fg3 , and transition matrix FðÞt GðÞDt , the machine of Figure 2C. ; ; 0 1 This machine has states sM ¼ fg1 , sI ¼ fg2 3 4 , ; ; ; 1 2 ðÞl þ m Dt lDt mDt sD ¼ fg5 6 7 8 , sN ¼ fg9 , and transition matrix Q½¼GðÞDt @ 1 2 xx0 A: 1 2 y 0 y Table 4 describes the interpretation of each state, and its emission probability distribution. See Table 1 for emission probabilities. This describes the GGI The transducer composition FðÞt 3 GðÞDt was performed using model as defined in the introduction. The model parameters the automata algebra program Machine Boss (Silvestre-Ryan et al. are the insertion and deletion rates (l, m) and extension 2020). A general procedure for doing this for any two machines probabilities (x, y), and the time interval Dt 1=ðÞl þ m . A; B involves taking the Cartesian product of the two machines’

0 1 aðÞ1 2 ðÞl þ m Dt lDtbðÞ1 2 ðÞl þ m Dt 0 amDtc00bmDt B C B aðÞ1 2 x xbðÞ1 2 x 0000c 0 C B C B fðÞ1 2 ðÞl þ m Dt 0 gðÞ1 2 ðÞl þ m Dt lDtfmDth00gmDt C B C B fðÞ1 2 x 0 gðÞ1 2 x x 000h 0 C B C Q½FðÞt GðÞ¼Dt B aðÞ1 2 y 0 bðÞ1 2 y 0 ay 0 c 0 by C: B C B pðÞ1 2 ðÞl þ m Dt 0 qðÞ1 2 ðÞl þ m Dt 0 pmDtr00qmDt C B C B pðÞ1 2 y 0 qðÞ1 2 y 0 py 0 r 0 qy C @ A pðÞ1 2 x 0 qðÞ1 2 x 0000r 0 fðÞ1 2 y 0 gðÞ1 2 y 0 fy 0 h 0 gy

1196 I. Holmes Figure 4 Relative entropies of simulated gap length distributions P (SI, SD) to the predictions of various approximate methods. The approximation methods are TKF91 (Thorne et al. 1991), TKF92 (Thorne et al. 1992), MLH04 (Miklós et al. 2004), LG05 (Löytynoja and Goldman 2005), RS07 (Redelings and Suchard 2007), and DM20 (De Maio 2020), reviewed in the Introduction; and H20 (the present method), defined in the Materials and Methods. The simulation procedure is defined in the Results. Starting from a parameter setting (l = m =1,x = y = 0.5) representative of indel lengths in protein structural alignments, the panels show the following parameter sweeps: (A) varying the time parameter t over a range of scales; (B and C) varying the indel extension probabilities x, y separately, thus exploring irreversible models with insertion-deletion asymmetry; (D) varying x and y jointly while holding x = y, exploring reversible models with differing indel lengths; and (E and F) varying the indel rate parameters l,m separately. These are the same experiments (and ordering thereof) shown in Figure 6.

2 state spaces and then synchronizing their transitions so that the d EfjFðÞtþDt ½TXYðÞf EfjFðÞt ½TXYðÞf A B M TXYðÞ¼t lim output of drives the input of . This ensures that, if XY dt Dt/0 Dt represents the result of the forward algorithm for machine 2 M EfjFðÞt GðÞDt ½TXYðÞf EfjFðÞt ½TXYðÞf (withP null states eliminated) and sequences X,Y, then ¼ lim : AB A B Dt/0 Dt ðÞXZ¼ XY YZ. More details on these operations can be Y found elsewhere (Silvestre-Ryan et al. 2020; Westesson et al. Expanding (1) to first order in in Dt and then taking the limit 2011, 2012) and their information-theoretic and linguistic roots Dt / 0, we arrive, using Mathematica (Wolfram Research, in Mohri et al. (2002). Inc.) for the symbolic algebra, at the following equations for We now make the approximation FðÞt GðÞDt FðÞt þ Dt , the expected transition counts: which is to say that the nine-state machine of Figure 2C can d bf 1 2 y be approximated by the three-state machine of Figure 2A by ðÞ2 TMMðÞ¼t m 2 ðÞl þ m a infinitesimally increasing the time parameter of the sim- dt 1 gy pler machine. This will not, in general, be exact (with the d bðÞ1 2 g T ðÞ¼t 2 m þ lðÞ1 2 b exception of the TKF91 model, discussed in the next sec- dt MI 1 2 gy tion). However, by mapping states (and hence transitions) d fðÞ1 2 g ðÞbðÞþ1 2 r cq of FG back to F,andsettingF’s transition probabilities pro- T ðÞ¼t la 2 m dt IM ðÞ1 2 gy ðÞfðÞþ1 2 r hp portional to the expected number of times the correspond- ing transitions are used in FG,wefind a maximum- d ðÞ1 2 g ðÞbðÞþ1 2 r 2 hq cgq T ðÞ¼t m ; (5) likelihood fit. dt DI ðÞ1 2 gy ðÞfðÞþ1 2 r hp The expected transition counts evolve via the coupled differential equations with boundary condition

Finite-State, Continuous-Time Machines 1197 Figure 5 Moments of the gap length distribution P (SI,SD) as revealed by simulation, compared to the predictions of various approximate methods. The approximation methods are TKF91 (Thorne et al. 1991), TKF92 (Thorne et al. 1992), MLH04 (Miklós et al. 2004), LG05 (Löytynoja and Goldman 2005), RS07 (Redelings and Suchard 2007), and DM20 (De Maio 2020), reviewed in the Introduction; and H20 (the present method), defined in the Materials and Methods. The simulation procedure is defined in the Results. The pa- rameter range explored is l = m =1,x = y = 0.5, and 227 # t # 21, corresponding to panel A of Figures 4 and 6. Panel A shows the expected insertion length as a function of time, and panel B shows the probability that there is no gap between adjacent ancestral residues.

1198 I. Holmes Figure 6 Covariance between the numbers of inserted (SI) and deleted (SD) residues in an alignment gap, as revealed by simulation and predicted by various approximate methods. The approximation methods are TKF91 (Thorne et al. 1991), TKF92 (Thorne et al. 1992), MLH04 (Miklós et al. 2004), LG05 (Löytynoja and Goldman 2005), RS07 (Redelings and Suchard 2007), and DM20 (De Maio 2020), reviewed in the Introduction; and H20 (the present method), definedintheMaterials and Methods. The simulation procedure is definedintheResults. Starting from a parameter setting (l = m =1,x = y = 0.5) representative of indel lengths in protein structural alignments, the panels show the following parameter sweeps: (A) varying the time parameter t over a range of scales; (B and C) varying the indel extension probabilities x, y separately, thus exploring irreversible models with insertion-deletion asymmetry; (D) varying x and y jointly while holding x = y,exploring reversible models with differing indel lengths; and (E and F) varying the indel rate parameters l, m separately. These are the same experiments (and ordering thereof) shown in Figure 4. 1 X ¼ Y ¼ M d l T ðÞ¼0 S ðÞ¼t 1 þ S ðÞÞt XY 0 otherwise dt I 1 2 x I d m SDðÞ¼t 1 þ SDðÞÞt ; The parameters (a, b, c, f, g, h, p, q, r) are defined by (3) for dt 1 2 y t . 0, condition at t = 0, where aðÞ¼0 1, fðÞ¼0 1 2 x, gðÞ¼0 x, pðÞ¼0 1 2 y, rðÞ¼0 y, and bðÞ¼0 cðÞ¼0 which, generalizing De Maio (2020), have the closed-form hðÞ¼0 qðÞ¼0 0. solution The remaining counts are obtained from (2): lt 2 SIðÞ¼t exp 2 1 TMDðÞ¼t 1 2 TMMðÞt 2 TMIðÞt 1 x 2 2 TIIðÞ¼t SIðÞt TMIðÞt TDIðÞt mt S ðÞt ¼ exp 2 1: (7) D 1 2 y TIDðÞt ¼ TMIðÞt þ TDIðÞt 2 TIMðÞt

TDMðÞ¼t 1 2 TMMðÞt 2 TIMðÞt

TDDðÞ¼t SDðÞþt TMMðÞþt TIMðÞt 2 TDIðÞt 2 1: (6) TKF91 model When x = y = 0, our model reduces to the TKF91 model The expected state occupancies are governed by the fol- (Thorne et al. 1991). Holmes and Bruno (2001) showed lowing equations: that the solution to the TKF91 model can be expressed

Finite-State, Continuous-Time Machines 1199 Figure 7 The running time of the MLH04 method is prohibitive and increases steeply for long gaps, since it requires the explicit enu- meration of all intermediate indel states in the trajectory (Miklós et al. 2004). In this plot, each data point is an average of 100– 1000 repetitions on a late-2014 iMac (4GHz quad-core Intel i7 CPU, 32GB 1600MHz DDR3 RAM). The running times of DM20 and H20 are negligible in comparison (, 1ms).

as a pair HMM of the form shown in Figure 2A, with A version of G with start and end states is shown in Figure parameters 3A. This can be carried throughout the analysis by also specify- ing start and end states for F and deriving differential equations 2 ; ; 2 2 ; atðÞ¼ðÞ1 b a btðÞ¼b ctðÞ¼ðÞ1 b ðÞ1 a for the transitions involving these states. Since this complicates 2 ; ; 2 2 ; ftðÞ¼ ðÞ1 b a gtðÞ¼ b htðÞ¼ ðÞ1 b ðÞ1 a the formulae considerably, we have omitted it. Instead, we pro- ptðÞ¼ðÞ1 2 g a; qtðÞ¼g; rtðÞ¼ðÞ1 2 g ðÞ1 2 a ; pose a heuristic modification of F that includes ad hoc transi- where tions from/to start and end states, shown in Figure 3B. Parameterizing the model aðÞt; l; m ¼ expðÞ2mt 2 2 2 In principle, the likelihood function is sufficient to parame- bðÞ¼t; l; m lðÞexpðÞlt expðÞmt mexpðÞ2lt 2 lexpðÞ2mt terize the model: we can compute the gradient numerically to ; ; 2 mb : gðÞ¼t l m 1 lðÞ1 2 a locate the maximum-likelihood parameters, or use Markov Chain Monte Carlo sampling to find the posterior. It can readily be verified that this is an exact solution to In practice, it might be more efficient to use an expectation- Equation (3) through Equation (7), when x = y = 0. Thus, maximization algorithm tailored to this model. In the Sup- our model reduces exactly to the TKF91 model when indels plemental Material, we conjecture that a simple expectation- involve only single residues. In this case, the equivalence maximization algorithm does exist for this model, and we FðÞ¼t þ Dt FðÞt GðÞDt is exact in the limit Dt / 0. outline one way it might be arrived at. Using the model for alignment Data availability statement To apply this model to sequence alignment, we need to specify Our code implementing this model is available under an open- a start and end state for G, rather than implicitly assuming source license at https://github.com/ihh/trajectory-likelihood/ infinite-length sequences as we have done up to this point. tree/benchmark.

1200 I. Holmes File S1 (gzipped tarball) contains JSON files describing the deletions become longer (which happens as t or y get large). state machines in Figure 2, a Makefile and short script for Because of the OðÞL time complexity of finding a position in manipulating these state machines, a Mathematica notebook a sequence and then inserting or deleting elements, the sim- deriving the equations in this paper, and a text file listing the ulation time is O NL2t yielding OðÞNLt indel events. Thus, contents of the tarball. Supplemental material available at when increasing L, we also reduced N. The time required for figshare: https://doi.org/10.25386/genetics.13040585. simulation handily dominated the total CPU time taken by the experiment, in almost all cases; the longest-running data points of Figure 4A each required over 10 min on our late- Results 2014 iMac (4GHz quad-core Intel i7 CPU, 32GB 1600MHz We implemented the H20 likelihood calculations described in DDR3 RAM). Even with these run times, the observed counts the Materials and Methods and those of the models (TKF91, were zero for a majority of the (SI,SD) tuples in many cases. TKF92, MLH04, LG05, RS07, and DM20) reviewed in the This illustrates a fundamental problem with the purely sim- Introduction. We also implemented a simulator for the un- ulation-based LAHP19 approach; running it enough times to derlying indel process, similarly to LAHP19 but using a suc- sample all cases is impractical. This may be one reason why cinct (run-length encoded) representation of the alignment. the authors of LAHP19 limited the maximum gap length G to We performed simulations at various parameter settings and 50 residues (Levy Karin et al. 2019). A potential solution to calculated summary statistics for all methods, including the this problem would be to use a limited sample to fit a para- relative entropy from the simulated to approximate joint dis- metric model, although we have not explored that approach. tribution P (SI, SD) and various marginals and moments of Of course, other indel simulation programs may run faster this distribution. Our implementations and simulation results than ours. are available at https://github.com/ihh/trajectory-likelihood/ For the MLH04 method, we limited G to 30 in all cases, tree/benchmark. since it takes impractically long to calculate likelihoods of In Figures 4–6, we show summary statistics for several longer gaps. This is illustrated by Figure 7, which plots the sweeps of the GGI model parameters (l, m, x, y, t) in the runtime of MLH04 as a function of gap length. To calculate ranges 227 # t # 21,0# x # 0.7, 0 # y # 0.65, 0 # x # likelihoods of gap lengths up to 30 residues at a particular 0.8 with y =x,0# x # 0.7, 0 , l # 1, and 0 , m # 1. All parameter setting, MLH04 takes roughly 10 sec on current sweeps are based around the point l = m =1,x = y = t = 0.5, desktop hardware, which is vastly slower than the microsec- which was chosen to be indel-symmetric (l = m and x = y) ond-scale runtime of all other methods (with the exception of and representative of amino acid indel lengths: the setting direct simulation, which requires many repetitions to achieve y ’ 0:5 is consistent with a previous maximum-likelihood statistical accuracy). Because we only considered shorter estimate based on indel lengths in the HOMSTRAD database gaps for MLH04, when calculating the relative entropy from (Miklós et al. 2004). These parameter sweeps have the effect the simulated gap length distribution, we used the truncated of exploring each parameter individually, as well as (in the distribution PSðÞI; SDjSI # G; SD # G to avoid infinities that case of the x-sweep where y = x) varying the length of inser- would otherwise occur due to MLH04 assigning zero proba- tions and deletions simultaneously. bility to longer gaps. To map the GGI model parameters onto those of the models Figure 4 shows the relative entropy (Kullback–Leibler di- being evaluated, we used the mappings defined by De Maio vergence) between the simulated and various approximated (2020), with some adjustments to allow for cases where distributions for the six parameter sweeps: t, x, y (separately insertions and deletions were asymmetric, which were not and together), l, and m. Starting with the time sweep in reported by De Maio. These mappings are defined in the Figure 4A, we see a pattern that was broadly repeated in time Supplemental Material. Our criteria for evaluating a model sweeps across other parameterizations (data not shown): at was based on the obviousness of these mappings; so, for very short times, the MLH04 trajectory-enumerating approx- example, we did not include models significantly more pa- imation is the best fit to the simulated distribution (with the rameter-rich than GGI (Rivas and Eddy 2015), where to de- proviso that it can only handle short gaps, and takes signifi- fine a mapping would have required so many choices as to cant time to compute). At these low times, the DM20 and have created a new model. H20 approximations are almost indistinguishable, and are In each experiment we performed N simulations on a se- the second-best fit; TKF92 is the next best after that, followed quence of length L and estimated gap sizes up to G residues, by the PRANK and BAliPhy HMMs. However, when t gets discounting gaps at the end of the sequence (which have large enough (in Figure 4A it occurs around t * 0:4), the di- different statistics). In most cases these settings were L = vergence of MLH04 shoots up, to the point where it quickly 103 and G =102, with N =107 for t , 225, N =106 for becomes the worst or second-worst fit. This is presumably 225 # t , 224, and N =105 for 224 # t , 1. For t $ 1, and because, at higher t, there is a significant probability of hav- also for y . 0.6, we set L =105, G =103, and N =102. The ing more than three events in the trajectory (MLH04 is lim- higher N at low t was to ensure sufficient sampling of infre- ited to three events due to the combinatorial complexity of quent events over short evolutionary timespans. The higher enumerating longer trajectories). This leaves DM20 and H20 values of (G, L) at high (t, y) were to mitigate end effects as as the best methods. Shortly after this point, DM20 and H20

Finite-State, Continuous-Time Machines 1201 start to separate, so that H20 has a slight edge over DM20. but diverge as those rates increase, with H20 performing Meanwhile, LG05 (the PRANK HMM) is not defined for all better than DM20. parameterizations, since it contains probabilities propor- To summarize Figure 4: at all parameterizations, H20 is 2 2 2 lt tional to 1 2d where d ¼ 1 exp 1 2 x . Thus, at high more accurate than DM20, and is the most accurate of all the enough t or x, we have d . 0.5 and so the probabilities be- three-state pair HMMs. H20 is outperformed only by MLH04, come negative. The RS07 HMM (used by BAliPhy) remains and then only at very low values of t and for very short gaps defined but becomes inaccurate at high t. It should be noted (taking a very long time to compute). In the parameter range that the time values mt . 1 2 y correspond to a scenario where most alignment occurs (mt 1), DM20 and H20 are where the sequence is saturated with deletions, which may roughly equivalent; they diverge as gaps become very fre- beoflessrelevancetomanyapplications in alignment and phy- quent. Of the closed-form approximations, TKF92 is most logenetics. However, it is at the borderline of this region where accurate, but is significantly less reliable than H20. the differences between the methods are most pronounced. To understand these differences better, we examined Seeking to explore these differences further, we varied x moments of the simulated and approximate distributions. and y in Figure 4, B and C. The general trend is consistent: These are plotted in Figures 5 and 6. H20 and DM20 are the most accurate methods, LG05 is not Figure 5A plots the expected insertion length, focusing on defined for most of this parameter regime, and the other the t-parameter sweep (similar trends were apparent in the methods cluster in a pack, affected to varying degrees by other sweeps). All the methods that find explicit closed-form the parameter sweep. TKF92 (which models multiresidue formulae for the expected insertion length as a continuous- indels) is consistently a better fit than TKF91 (which does timeprocess(whichistosayTKF91,TKF92,DM20,andH20) not), as might be expected. Both TKF91 and TKF92 explic- show an exact fit to the simulated distribution, while RS07 itly assume reversibility between insertions and deletions, deviates more significantly,and LG05 as previously noted is only and appear to be quite strongly affected by deviations from defined for part of the time range. MLH04 underestimates the symmetry. expected insertion length significantly after t * 223,againpre- When x and y are varied together, as in Figure 4D, we see sumably because MLH04 is a poor approximation when there quite different behavior across the different approximation are multiple expected indel events per observed alignment gap. methods. In the special case x = y = 0, when indel events can Figure 5B plots the probability that adjacent ancestral res- include only a single residue, the GGI model is essentially idues have no gap between them, P(SI = SD = 0). As with identical to TKF91. Unsurprisingly TKF91, TKF92, and Figure 5A, this is plotted as a function of time; and once H20—which all admit exact solutions in this special case— again, DM20 and H20 closely match the simulation, while have effectively zero relative entropy.By contrast, the MLH04 RS07 is an overestimate. However, for this statistic (unlike model performs very poorly when x = y = 0, since this pa- the expected insertion length), MLH04 and (where defined) rameterization requires at least K separate events to explain LG05 are as accurate as DM20 and H20, while TKF91 under- a gap of length K, and so MLH04 assigns zero probability to estimates the probability and TKF92 overestimates it. any gap of length $3, yielding an infinite Kullback–Leibler To summarize the results plotted in Figure 5, the expected divergence from the simulated distribution. As soon as x and insertion length and empty-gap probability are uninforma- y become nonzero, the MLH04 divergence becomes finite tive as to the differences between DM20 and H20: both meth- again, as (from the other direction) do those of TKF91, ods seem to get these statistics right. However, noting that TKF92, and H20. All relative entropies continue to rise as the main difference between the DM20 and H20 is that DM20 the mean indel length increases, MLH04 rising most slowly. does not allow transitions back and forth between the I and D Eventually, H20’s error approaches DM20’s error from below. states, instead requiring deletions to precede insertions, we At x * 0:78, the RS07 probabilities—which drop off rapidly might expect the covariance between insertion and deletion with increasing gap length—are rounded to zero by floating- lengths to be a better diagnostic. point precision errors, including for some gap lengths that are Figure 6 plots the insertion-deletion covariance for all pa- reached by the simulation, leading to infinities in the relative rameter sweeps, using the same ordering of subfigures (one entropy for RS07. per parameter sweep) as the relative entropy plots in Figure Varying l and m (Figures 4, E and F) reveals that H20, 4. What we see in these plots is that the relative accuracy of DM20, TKF92, and MLH04 rise monotonically in inaccuracy the covariances for DM20 and H20 closely tracks the relative as these rate parameters increase from zero. TKF91 is a worse entropies of their gap length distributions. In the time sweep, approximation than all these methods when l =0orm =0, DM20 and H20 perform identically at low times, but gradu- but stays mostly flat as they are increased, to the point ally diverge; in the x and y sweeps, they are separated where it eventually beats MLH04 as an approximation. throughout the parameter range; and in the l and m sweeps, RS07’s inaccuracy decreases monotonically with l,buthas they are close when the rate parameter is zero, but then di- aminimumasafunctionofm. LG05, as with the other verge rapidly. The most significant single factor determining sweeps, performs weakly and is only defined for part of the covariance between insertion length SI and deletion length the parameter regime. One notable point is that H20 and SD is the joint probability P(SI = SD = 0) that both are zero; DM20 have almost identical accuracy when l =0orm =0, however, Figure 5B shows that both DM20 and H20 basically

1202 I. Holmes get this probability correct. The most obvious remaining factor significant; further, our approach puts the process of deriving that might explain the different covariances is the absence of an the approximation on a more systematic footing. In the Sup- I / D transition in the DM20 pair HMM, which De Maio ex- plemental Material, we also conjecture similar ODEs for the plicitly suggested might be a potential limitation on the accuracy posterior expectations of the sufficient statistics that would be of DM20 (De Maio 2020). Thus, we propose this as an expla- required to fit this model by expectation maximization, al- nation for the improved performance of H20, while noting that though we have not yet tested this approach. this improvement is relatively small. Point substitution models are the foundation of likelihood As for the covariance of the other methods, the results in phylogenetics. There is, additionally, a substantial literature Figure 6 are broadly consistent with Figure 4, although they combining such models with HMMs and stochastic context- do not provide as a coherent single explanation of the differ- free grammars, for the purposes of genome annotation and ences between methods, as is the case when comparing other sequence analysis. Indels are a potential annotation DM20 and H20. Two points are worth remarking on. First, signal: phase-preserving indels, for example, are a common in Figure 6A, the covariance of MLH04 dips sharply around signature of protein-coding selection. As well as being useful * 23 t 2 , and in fact eventually becomes negative (although it tools to represent and (approximately) solve such models, is not shown on this plot, due to the logarithmic y-axis). The automata theory can be used to build sampling kernels ’ negative covariance can be explained by MLH04 s restriction (Redelings and Suchard 2007) and reconstruct ancestral fi to a small nite number of indel events in any given trajec- sequences (Westesson et al. 2012). tory, which implies that every insertion event is one event Our emphasis on the GGI model, a continuous-time Markov that cannot be a deletion, and vice versa. The other point process defined on sequences of residues, somewhat disadvan- worth noting is that, as is the case with the relative entropies, tages models like TKF92, which technically defines a process on Figure 6D clearly shows that TKF91, TKF92, and H20 are an sequences of multiresidue fragments. We have argued that there is fi exact t to the simulated data when x = y = 0, which reduces no evidence such indivisible fragments really exist, so instead we the GGI model to the TKF91 model. evaluated TKF92as an approximation to theGGI model.However, We can summarize all the simulation results as follows. In the routine usage of amino acid fragment models to predict virtually all cases, H20 is the most accurate of pair HMM protein tertiary structure suggests a valid counterargument: such methods, outperformed only by MLH04 when the rate of models may usefully capture some aspects of selection. Further, indels is slow enough that thereisnegligibleprobability TKF92 can be generalized in other ways, allowing for richer that an observed alignment gap can be explained by mul- models of fragment mutation; for example, to model the evolution tiple overlapping indel events. DM20 is very close behind of RNA structure. In this context, it is promising that our method H20; the differences between DM20 and H20 appear to be recovers TKF91 (and therefore TKF92) as special cases. explainable in terms of H20’s better modeling of covaria- Input-output automata are well suited to modeling indels in tion between insertion and deletion lengths, which is prob- statistical phylogenetics. It seems possible that the method of this ably attributable to an additional I / D transition in the paper might be applied to other instantaneous rate models of H20 pair HMM. local evolution where the infinitesimal generator can be repre- sented as an HMM. It is tempting to speculate that a similar Discussion approachmayalsobeproductively applied to stochastic context- We have shown that an evolutionary model which can be free grammars, so as to analyze RNA; or to model literary texts, represented infinitesimally as an HMM can be formally con- phonemes, vocabularies, music, source code, bird song, or other nected to a pair HMM that approximates its finite-time solu- alignable sequences that evolve by local edits over time. tion. This may be viewed as an automata-theoretic framing of Acknowledgment the Chapman–Kolmogorov equation. Ours is a coarse-graining technique. It is generally the case During review, this text benefitted greatly from suggestions that composing two state machines will yield a more complex of Jotun Hein, Nicola De Maio, and Elena Rivas. This work machine, since the composite state space is the Cartesian was funded by NIH grant HG004483. product of the components. We approximate this more com- plex machine with a renormalizing operation that eliminates Literature Cited null states and maps the remaining states of each type back to Bouchard-Côté, A., and M. I. Jordan, 2013 Evolutionary inference a single representative state in the approximator. via the Poisson indel process.Proc. Natl. Acad. Sci. USA110: We used this approach to derive ODEs for the transition 1160–1166. https://doi.org/10.1073/pnas.1220450110 probabilities of a minimal (three-state) pair HMM that ap- Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt, 1978 A model proximates a continuous-time indel process with geometri- of evolutionary change in proteins. In Atlas of Protein Sequence – cally distributed indel lengths. We have implemented numerical and Structure, edited by M. O. Dayhoff, 5 (Suppl. 3), pp. 345 352, National Biomedical Research Foundation, Washington, DC. solutions to these equations, and demonstrated that they De Maio, N., 2020 The cumulative indel model: fast and accurate outperform the previous best methods. The improvement in statistical evolutionary alignment. Syst. Biol. syaa050. https:// accuracy over DM20, the closest method, is small but doi.org/10.1093/sysbio/syaa050

Finite-State, Continuous-Time Machines 1203 Durbin, R., S. Eddy, A. Krogh, and G. Mitchison, 1998 Biological Redelings, B. D., and M. A. Suchard, 2007 Incorporating indel Sequence Analysis: Probabilistic Models of Proteins and Nucleic information into phylogeny estimation for rapidly emerging Acids, Cambridge University Press, Cambridge, UK. https:// pathogens. BMC Evol. Biol. 7: 40. https://doi.org/10.1186/1471- doi.org/10.1017/CBO9780511790492 2148-7-40 Holmes, I., and W. J.Bruno, 2001 Evolutionary HMMs: a Bayesian Rivas, E., and S. R. Eddy, 2015 Parameterizing sequence align- approach to multiple alignment. Bioinformatics 17: 803–820. ment with an explicit evolutionary model. BMC Bioinfor- https://doi.org/10.1093/bioinformatics/17.9.803 matics16: 406. https://doi.org/10.1186/s12859-015-0832-5 Kimura, M., 1980 A simple method for estimating evolutionary Silvestre-Ryan, J., Y. Wang, M. Sharma, S. Lin, Y. Shenet al., rates of base substitutions through comparative studies of nu- 2020 Machine Boss: rapid prototyping of bioinformatic automata. cleotide sequences. J. Mol. Evol. 16: 111–120. https://doi.org/ Bioinformatics btaa633. https://doi.org/10.1093/bioinformatics/ 10.1007/BF01731581 btaa633 Levy Karin, E., H. Ashkenazy, J. Hein, and T. Pupko, 2019 A sim- Thorne, J. L., H. Kishino, and J. Felsenstein, 1991 An evolutionary ulation-based approach to statistical alignment. Syst. Biol. 68: model for maximum likelihood alignment of DNA sequences. 252–266. https://doi.org/10.1093/sysbio/syy059 J. Mol. Evol. 33: 114–124. https://doi.org/10.1007/BF02193625 Löytynoja, A., and N. Goldman, 2005 An algorithm for progressive mul- Thorne,J.L.,H.Kishino,andJ.Felsenstein, 1992 Inching toward tiple alignment of sequences with insertions. Proc. Natl. Acad. Sci. USA reality: an improved likelihood model of sequence evolution. 102: 10557–10562. https://doi.org/10.1073/pnas.0409137102 J. Mol. Evol.34: 3–16. https://doi.org/10.1007/BF00163848 Miklós, I., G. Lunter, and I. Holmes, 2004 A “long indel” model for Westesson, O., G. Lunter, B. Paten, and I. Holmes, 2011 Phylogenetic evolutionary sequence alignment.Mol. Biol. Evol. 21: 529–540. automata, pruning, and multiple alignment. arXiv doi: 10.1103/ https://doi.org/10.1093/molbev/msh043 4347v3 (Preprint posted October 23, 2014). Mizuguchi, K., C. M. Deane, T. L. Blundell, and J. P. Overington, Westesson, O., G.Lunter, B. Paten, and I.Holmes, 2012 Accurate 1998 HOMSTRAD: a database of protein structure alignments reconstruction of insertion-deletion histories by statistical phy- for homologous families. Protein Sci. 7: 2469–2471. https:// logenetics. PLoS One7: e34572. https://doi.org/10.1371/jour- doi.org/10.1002/pro.5560071126 nal.pone.0034572 Mohri, M., F. Pereira, and M. Riley, 2002 Weighted finite-state Wolfram Research, Inc. Mathematica version 12.1.0.0, March 2020. transducers in speech recognition. Comput. Speech Lang. 16: 69–88. https://doi.org/10.1006/csla.2001.0184 Communicating editor: M. Beaumont

1204 I. Holmes