Are Profile Hidden Markov Models Identifiable? Srilakshmi Pattabiraman Tandy Warnow Department of Electrical and Computer Engineering Department of Computer Science University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign Urbana, Illinois Urbana, Illinois [email protected] [email protected] ABSTRACT 1 INTRODUCTION Profile Hidden Markov Models (HMMs) are graphical models that Profile Hidden Markov Models (HMMs) are arguably themost can be used to produce finite length sequences from a distribution. common statistical models in bioinformatics. Originally introduced In fact, although they were only introduced for bioinformatics 25 by Haussler and colleagues in [10, 12], and then expanded later years ago (by Haussler et al., Hawaii International Conference on in many subsequent texts [4–6, 9, 11, 21, 25], profile HMMs are Systems Science 1993), they are arguably the most commonly used now used in many analytical steps in biological sequence analysis statistical model in bioinformatics, with multiple applications, in- [15, 17–19, 22]. cluding protein structure and function prediction, classifications Profile Hidden Markov models are graphical models with match of novel proteins into existing protein families and superfamilies, states, insertion states, and deletion states; and the match and in- metagenomics, and multiple sequence alignment. The standard use sertion states emit letters from an underlying alphabet Σ (i.e., Σ of profile HMMs in bioinformatics has two steps: first a profile may be the 20 amino acids, the four nucleotides, or some other HMM is built for a collection of molecular sequences (which may set of symbols). In the standard form presented in [4] (widely in not be in a multiple sequence alignment), and then the profile HMM use in bioinformatics applications), each profile Hidden Markov is used in some subsequent analysis of new molecular sequences. model has a single start state and a single end state, and every path The construction of the profile thus is itself a statistical estimation through the model produces a string from Σ∗. The topology of this problem, since any given set of sequences might potentially fit standard model as seen in Figure 1 shows directed edges between more than one model well. Hence a basic question about profile certain pairs of states, and each such directed edge has a non-zero HMMs is whether they are statistically identifiable, which means transition probability. that no two profile HMMs can produce the same distribution on In this paper, we address the question of statistical identifiability finite length sequences. Indeed, statistical identifiability is afun- of profile Hidden Markov models, which in essence asks whether damental aspect of any statistical model, and yet it is not known the model is reconstructible given the probability distribution it whether profile HMMs are statistically identifiable. In this paper, we defines23 [ ]. Thus, if there are two sets of parameters of the model report on preliminary results towards characterizing the statistical that generate the same joint distribution, then the model is not identifiability of profile HMMs in one of the standard forms used identifiable. Note that if a model is not identifiable, then itisim- in bioinformatics. possible for any algorithm designed to estimate the model from a finite dataset to be statistically consistent: that is, it is not possible CCS CONCEPTS for the method to converge in probability to the true model with • Applied computing → Molecular sequence analysis; increasing amounts of data. Statistical identifiability is a basic property of statistical mod- KEYWORDS els, and is the subject of rigorous study [1–3, 7, 13, 16, 20]. Indeed, the importance of identifiability is evident in the following quotes: Profile Hidden Markov Models, statistical identifiability, molecular “Unidentifiable models are pathological, usually due to conceptual sequence analysis error in model formulation” [24] and “Many statisticians frown ACM Reference Format: on the use of under-identified models: if a parameter is not iden- Srilakshmi Pattabiraman and Tandy Warnow. 2018. Are Profile Hidden tifiable, two or more values are indistinguishable, no matter how Markov Models Identifiable?. In ACM-BCB ’18: 9th ACM International Con- much data you have” [8]. However, to the best of our knowledge, ference on Bioinformatics, Computational Biology, and Health Informatics, nothing has yet been established about the statistical identifiability August 29–September 1, 2018, Washington, DC, USA. ACM, New York, NY, of profile Hidden Markov Models (HMMs), although the question USA, 9 pages. https://doi.org/10.1145/3233547.3233563 of identifiability of parameters in HMMs more generally has also been specifically addressed [14]. Permission to make digital or hard copies of all or part of this work for personal or In this paper, we partially characterize the conditions under classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation which profile HMMs are statistically identifiable. Our study in- on the first page. Copyrights for components of this work owned by others than ACM cludes a characterization of identifiable profile HMMs when no must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, deletion states are permitted but also shows two profile HMMs to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. in the standard format that define the same probability distribu- ACM-BCB’18, August 29–September 1, 2018, Washington, DC, USA tion. Hence, we show that profile HMMs are not identifiable. We © 2018 Association for Computing Machinery. conclude our study with a discussion of the implications of this ACM ISBN 978-1-4503-5794-4/18/08. https://doi.org/10.1145/3233547.3233563 research and future directions. 2 RESULTS this is the only path with j + 1 edges that begins at the start state 2.1 Preliminary material and notation and ends at Ij . The question we address in this paper is whether profile HMMs Theorem 2.1. Consider a standard profile HMM topology with (in this standard format, as described in Figure 1) are statistically n match states and no deletion states. Then, the model is identifi- identifiable. We present profile HMMs for modeling collections of able if and only if no match state has the same emission probability DNA sequences (i.e., strings over {A,C,T ,G}), which is one of their distribution as the insertion states. uses; however, the results we present here are independent of the choice of alphabet. As shown in Figure 1, the standard topology Proof. ⇐: We begin by proving that if no match state has the profile HMM with n match states has a single begin state “begin” same distribution as the insertions states, then the model is iden- and a single end state “end”; every path through the profile HMM tifiable. Note that when there are no deletion states, the lengthof thus begins and ends at these states. The standard profile HMM also the shortest sequence with non-zero probability of being generated has n match states, n+1 insertion states, and n deletion states. Every is the number of match states. Hence, given the distribution of sequences defined by a profile HMM that has no deletion states, match state Mj (with j ∈ [n]) emits a letter A,T,G, or C according to we immediately know the number of match states, and hence also some fixed but unknown probability distribution Pj . All insertion ′ the topology. We will show that we can use the topology of the states I ′ (with j ∈ {0} ∪ [n]) emit a letter with the same known j profile HMM to compute all the numerical parameters, and hence distribution Pins . Note therefore that the emission probabilities define the entire model, once we are given the distribution of strings can be different for different match states, but all insertion states defined by the model. have the same emission probabilities. Finally, the deletion states So let the length of the shortest sequence (with non-zero proba- are “silent” (i.e., they do not emit any letters), and are denoted bility) be n. We provide the proof of identifiability for the case where by Dj . The probability of transition from one state to another is all nucleotides have equal probability of being generated at the in- represented by the positive values on the edges (also referred to as sertion states (i.e., P [A] = P [C] = P [T ] = P [C] = 1 ). edge weights); hence, the sum of the weights on the edges leaving ins ins ins ins 4 For the more general case where the emission probabilities at the any given node is 1.0. Note that under this standard profile HMM, insertion states are different, the proof is a simple modification of once you know the number of match states you also know the the one provided below. entire topology. We now show how to compute the emission probabilities zi , We introduce some notation to simplify the rest of the exposition. A zi , zi , and zi . Note that the probability that a string of length n We let xi denote the transition probability from Mi−1 to Mi (with T G C th M0 denoting the start state and Mn+1 denoting the end state) and generated by this model has an A in the i position is given by i yi denote the transition probability from Ii−1 to Mi . We let zY n i Ö denote the emission probability of letter Y from match state Mi , i.e., p?[i−1]A?[n−i] = zA xj (1) [ | ] i [ ] P Y match state = i = zY . We use Pins j to denote the emission j=1 j ∈ {A,C,T,G} probability of letter at the insertion states, and and hence constrain all insertion states to have the same emission probability i p?[i−1]A?[n−i] distribution. zA = Í . (2) Let ∗ denote an arbitrary length string. Thus, A∗ denotes all X ∈{A,T,G,C } p?[i−1]X ?[n−i] sequences that begin with A. Let ? denote an arbitrary letter, and The next equation follows since every path that emits an A as let ?[k] denote k contiguous arbitrary letters. Thus, ?A∗ denotes all the first letter either goes through the first match state orthrough sequences whose second letter is A. Let pS denote P[ sequence S], the first insertion state: the probability of the model emitting sequence S, and let p denote S 1 1 p ∗ = x z + (1 − x ) . (3) the probability of emitting all sequences in the set S. We drop the A 1 A 1 4 stylized notation when the set S is clear from context. We will refer to this equation as the 0th system; note that it is a 1 linear system in one variable, x1. Furthermore, if not all zX (for 1 2.2 No deletion nodes X ∈ {A,C,T,G}) are equal to 4 then there is a unique solution 1 Here, we consider the standard profile HMM topology with the for x1; however, if all are equal to 4 then every value for x1 is a probability of transitioning to any deletion node being 0. In other solution. Also, the same equations hold where A is replaced by the words, we consider a profile HMM topology without deletion nodes, other nucleotides. as shown in Figure 2. Recall that yi is the transition probability from Ii−1 to Mi . Con- sider the probability of a string that has A as its second letter. Consider the path that begins at the start state and ends at Mi Equations (4) and (5) below (both straightforward to establish) will and that only goes through match states; the probability of picking st (match:i) Îi be referred to jointly as the “1 system”: that path is denoted by p , and is easily seen to be xk . k=1     Note that this is the only path with i edges that begins at the start 2 1 1 1 p?A∗ = x1 x2z + (1 − x2) + (1 − x1) y1z + (1 − y1) (4) state and ends at Mi . Similarly, the probability of picking the path A 4 A 4 from the start state to Ij that passes only through match states is  1   1  ( ) j−1 2 1 p insrt:j Î x ·( − x ) p?T ∗ = x1 x2zT + (1 − x2) + (1 − x1) y1zT + (1 − y1) (5) denoted by and is equal to k=1 k 1 j . As before, 4 4 Figure 1: The topology of the standard profile Hidden Markov Model (according to [4]) with n match states. Note that only certain pairs of nodes are connected by edges; every such edge has strictly positive transition probability, and the sum of the transition probabilities on the edges leaving any single node is 1. The match states (denoted by M) and insertion states (denoted by I) emit letters from an underlying alphabet Σ, and hence have associated emission probabilities for each letter in Σ. The deletion states (denoted by D) are silent and do not emit anything. Each such profile HMM is a generative model, since every path from the start state to the end state produces a string from Σ∗. Hence each profile HMM defines a probability distribution on Σ∗.

The 1st system of equations given by Equations (4) and (5) is linear When M(1) is singular, we append the system with another equa- in (x2,y1) as long as Equation (3) is solved. The system can be tion linear in (x2,y1). To that end, the probability of generating written as p(1) = M(1)w(1) where sequences of the form AA∗ is given by:  ( 2 − / )( − )( 1 − / ) 1 1 (1) x1 zA 1 4 1 x1 zA 1 4 p = x z1 x z2 + x z1 (1 − x ) + (1 − x ) y z1 M = 2 1 , (6) AA∗ 1 A 2 A 1 A 2 1 1 A x1(z − 1/4)(1 − x1)(z − 1/4) 4 4 T T 1 1 +(1 − x ) (1 − y ) , (9) x  1 1 w(1) = 2 , and (7) 4 4 y1 Rearranging,   p ∗ − (1/4)     p(1) = ?A . (8) 1 2 1 1 1 1 1 1 − ( / ) pAA∗ = x1z z − x2 + (1 − x1) z − y1 + x1z p?T ∗ 1 4 A A 4 4 A 4 A 4 Without loss of generality, let’s assume z1 , 0, and z1 , z1 , 1 ; 1 A A T 4 +(1 − x )( )2 (10) (1) (1) 1 4 this implies that M12 , M22 are always non-zero. When any of the (1) (1)′ (1)′ (1) other entries of M are zero, y1 is trivially obtained. Furthermore, Consider the system p = M w formed by appending Equa- 2 1 st using the equation for p?X ∗ for the letter X such that zX , 4 , tion (10) to the 1 system of equations: x2 can be computed. Thus, the only case left to be considered is ′  ( 2 − / )( − )( 1 − / )  1 1 2 2 (1) x1 zA 1 4 1 x1 zA 1 4 when z , z , z , z , 0. However, x2,y1 can be computed using M = 1 2 1 , (11) A T A T x1z (z − 1/4)(1 − x1)(1/4)(z − 1/4) Equations (4) and (5) when M(1) is invertible, which holds when A A A 1 1   z − 1/4 z − 1/4 (1) x2 T , A . w = , and (12) 2 − / 2 − / y1 zT 1 4 zA 1 4 Figure 2: The standard profile HMM topology with n match states and no deletion nodes.

  ′ p ∗ − (1/4) th p(1) = ?A . (13) Thus, the (m − 1) system is given by Equations (15) and (16): p − x z1 (1/4) − (1 − x )(1/4)2 AA∗ 1 A 1   (match:m−1) m 1 p [m−1] = дm(A) + p xmz + (1 − xm) ? A∗ A 4   (insrt:m−2) m−1 1 + p ym− z + (1 − ym− ) , 1 A 1 4 (15) ′ M(1) is invertible if z1 , z2 , 1 , which is the assumption that   A A 4 (match:m−1) m 1 we began with. Thus, the appended system can be used to compute p?[m−1]T ∗ = дm(T ) + p xmzT + (1 − xm) st 4 x2,y1 whenever the 1 system is rank deficient.   We let д (B) denote the probability of generating a string s (insrt:m−2) m−1 1 m + p ym−1zT + (1 − ym−1) whose mth letter is B (where B ∈ {A,C,T ,G}), but subject to the 4 (16) constraints that (a) s[m] is not generated by Mm or Im−1, and (b) [ − ] ∈ { } s m 1 is not generated by Im−2. Then, for all B A,C,T,G , Thus, the (m − 1)th system of equations is linear in variables p [m−1] (the probability that a randomly generated string s has (m−1) ? B∗ xm,ym−1. Furthermore, the matrix M associated with Equa- s[m] = B) satisfies: tions (15) and (16), when written as p(m−1) = M(m−1)w(m−1), where (m−1) T w = [xm ym−1] , is given by:  m m−1  x − (z − 1/4)(1 − x − )(z − 1/4) M(m−1) = c m 1 A m 1 A , (17) 0 ( m − / )( − )( m−1 − / ) xm−1 zT 1 4 1 xm−1 zT 1 4  1  Îm−2  (match:m−1) m where c0 = xi . p?[m−1]B∗ = дm(B) + p xmzB + (1 − xm) i=1 4 (m−1)   When M is not invertible, consider the strings that have A (insrt:m−2) m−1 1 in the (m−1)th andmth positions. Thus, when the equation obtained + p ym−1zB + (1 − ym−1) . 4 by expressing the probability of generating the string ?[m−2]AA∗ in (14) terms of the transition probabilities and the emission probabilities ′ is appended to the system, the new matrix M(m−1) associated with ′ ′ the system p(m−1) = M(m−1) w(m−1) is given by can be computed uniquely for the the standard profile HMM with  m m−1  one match state. ′ x − (z − 1/4)(1 − x − )(z − 1/4) M(m−1) = m 1 A m 1 A . m−1( m − / )( − )( / )( m−1 − / ) xm−1zA zA 1 4 1 xm−1 1 4 zA 1 4 Theorem 2.2. The standard profile HMM topology with one match (18) state is non-identifiable. Following the argument presented for the 1st system, we con- Proof. Consider the two models as shown in Figure 5. The clude that w(m−1) can be computed. emission distribution at the match state is the same across the To find yn, consider all sequences of length n + 1 . models and is equal to {zA = a, zT = t, zG = д, zC = c}. The emis- n n sion distribution at both the insertion states is equal to Pins [A] = Õ ©Ö ª 1 ′ p[?]n+1 = (1 − xi )yi xj /xi . (19) Pins [C] = Pins [T ] = Pins [C] = . Let α ,i ∈ {1,..., 12}, denote ­ ® 4 i ′′ i=1 j=1 ∈ { } « ¬ the transmission probabilities for Model 1 and αi ,i 1,..., 12 , Thus, yn is obtained from this equation. denote the transmission probabilities for Model 2, both as shown in We point out that the letters A,T are representatives. In general, Figure 5. Let X1X2 ... Xk be an arbitrary DNA sequence of length we pick the letters that give unique solutions to the systems of k; we will show that the two models emit this sequence with the ′ linear equations that are obtained in the proof. Hence, we have same probability. Let p denote the probability with which X1X2 ...Xk ′′ proved that if each of the match states is different (in distribution) Model 1 emits the sequence X1X2 ... X ; p denotes the k X1X2 ...Xk from the insertion states, then the model is identifiable. the probability with which Model 2 emits the same sequence. When ⇒ : We now prove the other direction. We show that if the emis- k = 1 we obtain: sion probabilities for a match state are identical (in distribution) to ′′ ′ 1 ′′ ′′ ′′ ′ ′ ′ ′′ ′′ ′′ ′ ′ ′ the insertion states, then the profile HMM is not identifiable. Specif- p − p = (α α α − α α α + α α α − α α α ) = 0. X1 X1 4 3 10 11 3 10 11 9 12 6 9 12 6 ically, we show (Figure 3) two different profile Hidden Markov (23) models (each with a single match state) where the emission prob- When k = 2 we obtain: ability distribution for the match state is identical to that of the ′′ ′ p − p insertion states, and for which the two models define the same X1X2 X1X2 2 distribution on strings. In both models shown in Figure 3,  1  ′′ ′′ ′′ ′ ′ ′ ′′ ′′ ′′ ′ ′ ′ = (0.3(α α α − α α α ) + 0.6(α α α − α α α ) 1 4 3 10 11 3 10 11 3 10 12 3 10 12 p = x1 x2, (20) A 4 ′′ ′′ ′′ ′ ′ ′ + 0.4(α9 α12α6 − α9α12α6)) 1 1 1 1 = 0. (24) pAA = (1 − x1) y1 x2 + x1 (1 − x2) y2, (21) 4 4 4 4 Finally, for k ≥ 3 we obtain: and ′′ ′ pX X ...X − pX X ...X 1 1 n−2 1 1 2 k 1 2 k p [n] = x ( − x ) ( − y ) y ′′ ′′ ′′ ′ ′ ′ ′′ ′′ ′′ ′ ′ ′ A 1 1 2 1 2 n−2 2 k−1 k−1 4 4 4 = (α3 α10α11 − α3α10α11)0.3 + (α9 α12α6 − α9α12α6)0.4 1 n−2 1 1 k + (1 − x1) (1 − y1) y1 x2  1  ′′ ′′ ′′ ′ ′ ′ Õ 4 4n−2 4 + (α α α − α α α ) 0.3n1 0.4n2 4 3 10 12 3 10 12 Õ 1 n 1 1 1 n 1 n +n =k−2 + (1 − x1) (1 − y1) 1 y1 (1 − x2) (1 − y2) 2 y2, 1 2 4 4n1 4 4 4n2 9 9 n1+n2=n−3 = (0.4k−1 − 0.3k−1) − (0.4k−1 − 0.3k−1) = 0, (25) (22) 400 400 □ where n ≥ 3 and n1,n2 ≥ 0 in Equation (22). Thus, the profile HMM with one match state whose emission probability distribution is Theorem 2.3. Consider the standard profile HMM topology with identical to that of the insertion states is not identifiable. one match state. If the model topology is given, then some (but perhaps This proof can be extended to show that a profile HMM with no not all) of the transition probabilities can be identified if the emission deletion nodes and arbitrary number of match states that has at probabilities at the match state are not equal to that of the insertion least one match state whose distribution is identical to that of the states. insertion states is not identifiable. Consider a profile HMM with n Proof. Consider the standard profile HMM topology with one match states as depicted by Model 1 in Figure 4. Without loss of match state as shown in Figure 6. We will show that under the generality, we assume that match state M has the same distribution 3 assumption of the theorem, α , α , α , α , and α can be computed as that of the insertion states. Note that the highlighted region is 2 4 6 7 8 uniquely. Further, if α , α , α α , α α , then α and the emis- exactly the toy example that we described above, so that Model 1 2 6 3 5 7 1 1 sion probabilities at the match state can be determined. We provide and Model 2 have identical sequence distributions. □ the proof for the case wherein Pins [A] = Pins [C] = Pins [T ] = P [C] = 1 2.3 The standard profile HMM with one match ins 4 . For the more general case where the emission proba- bilities of the letters at the insertion states are different, the proof is state a modification of the one provided. Let αi ,i ∈ {1,..., 12}, denote We begin with a proof of non-identifiability of the standard profile the transmission probabilities as shown in Figure 6. Consider a HMM with one match state. We then identify the parameters that sequence of length one. The letter is emitted either by insertion Figure 3: Two profile HMMs that have the same sequence distribution.

Figure 4: Two profile HMMs with n match states (but no deletion states) that have the same sequence distribution. Figure 5: Two standard profile HMMs with one match state that define the same distribution on sequences, establishing that profile HMMs in the standard format are not identifiable (see Theorem 2.2).

states I0 or I1, or by the match state M1. Thus the probability of Therefore, emitting the letter B is given by 1 1 1 pAA − pTA = α1(zA − zT )α4 α6 , 0, (32) 1 1 1 4 pB = α3 α10α11 + α1z α2 + α9α12 α6. (26) 4 B 4 1 1 1 pAA − pAT = α3 α5(zA − zT )α2 , 0. (33) 1 1 4 Assume without loss of generality that zA , zT . Therefore, We now consider sequences that begin with two letters B1B2. 1 1 pA − pT = α1(zA − zT )α2 , 0. (27) 1 1 1 1 1 1 1 1 pB B ∗ = α1z α4 + α3 α7 + α3 α5z + α3 α10α12 We now consider sequences that begin with a particular letter B. 1 2 B1 4 4 4 4 B2 4 4 Again, the first letter is generated either by insertion states I0 or I1, 1 1 +α9α12 α8 . (34) or by the match state M1. Thus, 4 4

1 1 1 Therefore, pB∗ = α1zB + α3 + α9α12 . (28) 4 4 1 1 1 pAA∗ − pTA∗ = α1(z − z ) α4 , 0. (35) Therefore, A T 4 1 1 Dividing (32) by (35), we find pA∗ − pT ∗ = α1(zA − zT ) , 0. (29) pAA − pTA Dividing (27) by (29), we find α6 = . (36) pAA∗ − pTA∗ pA − pT α2 = . (30) Thus, α8 = 1 − α6 is also computed. We now consider all sequences pA∗ − pT ∗ of form B1B2B3. Thus, α = 1 − α is also computed. Consider all sequences of the 4 2 1 1 form B B . p − p = α α α (z1 − z1 )α (37) 1 2 AAA AAT 3 4 7 4 5 A T 2 1 1 1 1 1 1 pB B = α1z α4 α6 + α3 α7 α10α11 + α3 α5z α2 (31) Dividing (37) by (33), we get 1 2 B1 4 4 4 4 B2 1 1 1 1 pAAA − pAAT +α3 α10α12 α6 + α9α12 α8 α6. α7 = 4 (38) 4 4 4 4 pAA − pAT Figure 6: The standard profile HMM with one match state.

Dividing (33) by (27), we find that that α1α7 , α3α5 and α2 , α3. Since α1 is unique, we can compute pAA − pAT the emission distribution. α2α3 = α14 (39) Equations (47), (48), and (49) are obtained from (27). pA − pT 1 1 Let pϵ denote the probability of not emitting any letter. To find pA − pT = α1(zA − zT )α2, (47) α1, α9, α11, consider the following equations: 1 1 pA − pG = α1(zA − zG )α2, (48) p? = α1α2 + α3α10α11 + α9α12α6 (40) 1 1 pA − pC = α1(zA − zC )α2, (49) p 2 = α α α + α α α α + α α α + α α α α + α α α α [?] 1 4 6 3 7 10 11 3 5 2 3 10 12 6 9 12 8 6 1 = z1 + z1 + z1 + z1 . (50) (41) A T G C

α10 = 1 − α5 − α7 (42) Equations (47), (48), (49), and (50) together are a linear system of 4 equations with 4 unknowns, and can be expressed as Mz1 = p α12 = 1 − α11 (43) where pϵ α = (44) 1 −1 0 0  11 α   9 1 0 −1 0  M =   , (51) Substituting equations (39), (42), (43) and (44) in (40) and (41), we 1 0 0 −1   obtain the following: 1 1 1 1    pϵ 1 p? = α1α2 + (α3(1 − α7) − γα1) + (α9 − pϵ )α6 (45) z  α9  A z1  pϵ z1 =  T  , and (52) p 2 = α α α + α (α (1 − α ) − γα ) + α α (α − pϵ ) 1 [?] 4 6 1 7 3 7 1 6 8 9 zG  α9  1    z  pϵ  C  + α6 1 − (α3(1 − α7) − γα1) + γα2α1 (46) α9 (pA − pT )/(α1α2)   pAA−pAT (pA − pG )/(α1α2) where γ = 4 p −p . Thus, equations (45), (46) together with the p =   (53) A T (pA − pC )/(α1α2) equation α1 +α3 +α9 = 1 form a system of three equations in three    (1)/(α1α2).  variables (α1, α3, and α9). Wolfram|Alpha returns a unique solution   1 for α1 and two pairs of solutions for (α3, α9) under the condition Since M is invertible, z can be obtained. □ 2.4 Estimating parameters from finite data 4 ACKNOWLEDGMENTS Identifiability results establish what can be known from the true This research was supported by National Science Foundation grants distribution, but do not directly imply that a statistically consistent ABI-1458652 and III:AF:1513629 to TW. This research began as a method is possible. Here we describe how to estimate what can final project by the first author for the course Computer Science be estimated from data for the standard model, modified so that 581: Algorithmic Computational Genomics, taught by the second there are no deletion nodes, and discuss the amount of data that are author at the University of Illinois at Urbana-Champaign in Spring needed to estimate the true topology and the numeric parameters 2018. (within some error threshold) with high probability. One could leverage the ideas used in our proof techniques to REFERENCES reconstruct the model using empirical joint distributions obtained [1] Frederick A Matsen, Elchanan Mossel, and Mike Steel. 2008. Mixed-Up Trees: The Structure of Phylogenetic Mixtures. Bulletin of mathematical biology 70 from the data. However, since the number of paths doubles from one (2008), 1115–39. Issue 4. system of equations to the next, such an approach is not efficient. [2] Elizabeth S Allman, Catherine Matias, and John A Rhodes. 2009. Identifiability of Yet, some parameters of the model can still be estimated efficiently parameters in latent structure models with many observed variables. The Annals of Statistics (2009), 3099–3132. using our techniques. For example, the number of match states, and [3] Joseph T Chang. 1996. Full reconstruction of Markov models on evolutionary therefore the topology can be estimated from the shortest string trees: identifiability and consistency. Mathematical biosciences 137, 1 (1996), produced. 51–73. [4] Richard Durbin, Sean R Eddy, , and Graeme Mitchison. 1998. Biolog- Suppose we had N independent sequences that were generated ical sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge by a specific profile HMM with n match states and no deletion university press. n [5] Ingo Ebersberger, Sascha Strauss, and Arndt von Haeseler. 2009. HaMStR: profile states. The probability of not observing any sequence of length is hidden Markov model based search for orthologs in ESTs. BMC evolutionary given by biology 9, 1 (2009), 157. [6] Sean R. Eddy. 1998. Profile hidden Markov models. Bioinformatics (Oxford, n !N England) 14, 9 (1998), 755–763. Ö P[all sequences have length >n] = 1 − x [7] Steven N Evans and Philip B Stark. 2002. Inverse problems as statistics. Inverse i problems 18, 4 (2002), R55. i=1 [8] David Freedman. 2005. Statistical Models: Theory and Practice. Cambridge Uni- ≤ − n N versity Press. 1 xmin [9] Torben Friedrich, Birgit Pils, Thomas Dandekar, Jörg Schultz, and Tobias Müller. ≤ exp{−xn N }, (54) 2006. Modelling interaction sites in protein domains with interaction profile min hidden Markov models. Bioinformatics 22, 23 (2006), 2851–2857. [10] , Anders Krogh, I. Saira Mian, and Kimmen Sjölander. 1993. Protein where xmin = min1≤i ≤n xi . The probability of error decays expo- Modeling using Hidden Markov Models: Analysis of Globins. In Proceedings of nentially with the number of sequences. Thus, if the transition the Twenty-sixth Hawaii International Conference on System Sciences. probabilities from one match state to the next were all bounded [11] Timo Koski. 2001. Hidden Markov models for bioinformatics. Vol. 2. Springer ′   Science & Business Media. 1 1 [12] Anders Krogh, Michael Brown, I Saira Mian, Kimmen Sjölander, and David from below, then a finite number N = x n log δ of indepen- min Haussler. 1994. Hidden Markov models in computational biology: Applications dently generated sequences are sufficient for reconstructing the to protein modeling. Journal of molecular biology 235, 5 (1994), 1501–1531. topology with confidence at least 1 − δ. [13] Colby Long and Laura Kubatko. 2017. Identifiability and reconstructibility of species phylogenies under a modified coalescent. arXiv preprint arXiv:1701.06871 Other parameters such as emission probabilities of the match (2017). state, and a constant number of transition probabilities xi ’s and yi ’s [14] Rachel J MacKAY. 2002. Estimating the order of a hidden Markov model. Canadian can also be computed efficiently from the empirical distributions of Journal of Statistics 30, 4 (2002), 573–589. [15] Siavash Mirarab, Nam phuong Nguyen, and Tandy Warnow. 2012. SEPP: SATé- sequences,- and their errors can be bounded. enabled phylogenetic placement. In Pacific Symposium on Biocomputing. 247–58. [16] Elchanan Mossel and Sebastien Roch. 2012. Phylogenetic mixtures: concentration of measure in the large-tree limit. The Annals of Applied Probability 22, 6 (2012), 3 CONCLUSION 2429–2459. In this text, we made the first strides towards completely charac- [17] Nam-phuong Nguyen, Siavash Mirarab, Keerthana Kumar, and Tandy Warnow. 2015. Ultra-large alignments using phylogeny aware profiles. Genome Biology terizing the identifiability of profile hidden Markov models. We 16, 124 (2015). https://doi.org/10.1186/s13059-015-0688-z A preliminary version analyzed identifiability for the case where there are no deletion appeared in the Proceedings RECOMB 2015. states, but otherwise all the properties of the standard model hold. [18] Nam-phuong Nguyen, Siavash Mirarab, Bo Liu, Mihai Pop, and Tandy Warnow. 2014. TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics For this case, Theorem 2.1 shows that the model is identifiable if 30, 24 (2014), 3548–3555. https://doi.org/10.1093/bioinformatics/btu721 and only if no match state has the same emission probability dis- [19] Nam-phuong Nguyen, Michael Nute, Siavash Mirarab, and Tandy Warnow. 2016. HIPPI: highly accurate protein family classification with ensembles of hidden tribution as the insertion states. Further, we analyzed the question Markov models. BMC Bioinformatics 17 (Suppl 10) (2016), 765. Special issue for of identifiability for the special case of only one match stateun- RECOMB-CG 2016. der the standard topology, and proved that it is not identifiable. In [20] Judea Pearl and Michael Tarsi. 1986. Structuring causal trees. Journal of Com- plexity 2, 1 (1986), 60–77. particular, we presented two models with different transition prob- [21] Benjamin Schuster-Böckler and . 2007. An introduction to hidden abilities and showed that the probability of emitting any particular Markov models. Current protocols in bioinformatics 18, 1 (2007), A.3A.1–A.3A.9. sequence is the same for the two models. This in turn implies that [22] Kimmen Sjölander. 2004. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 20, 2 (2004), 170–179. the standard profile HMM is non-identifiable. For the model with [23] A. W. van der Vaart. 1998. Asymptotic Statistics. Cambridge University Press. the standard topology and one match state, we also identified the https://doi.org/10.1017/CBO9780511802256 [24] Ziheng Yang. 2006. Computational Molecular Evolution. Oxford University Press. parameters that can be computed uniquely. Characterizing partial [25] Zemin Zhang and William I Wood. 2003. A profile hidden Markov model for identifiability for the standard topology with an unknown number signal peptides generated by HMMER. Bioinformatics 19, 2 (2003), 307–308. of match states is still open.