Are Profile Hidden Markov Models Identifiable?
Total Page:16
File Type:pdf, Size:1020Kb
Are Profile Hidden Markov Models Identifiable? Srilakshmi Pattabiraman Tandy Warnow Department of Electrical and Computer Engineering Department of Computer Science University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign Urbana, Illinois Urbana, Illinois [email protected] [email protected] ABSTRACT 1 INTRODUCTION Profile Hidden Markov Models (HMMs) are graphical models that Profile Hidden Markov Models (HMMs) are arguably themost can be used to produce finite length sequences from a distribution. common statistical models in bioinformatics. Originally introduced In fact, although they were only introduced for bioinformatics 25 by Haussler and colleagues in [10, 12], and then expanded later years ago (by Haussler et al., Hawaii International Conference on in many subsequent texts [4–6, 9, 11, 21, 25], profile HMMs are Systems Science 1993), they are arguably the most commonly used now used in many analytical steps in biological sequence analysis statistical model in bioinformatics, with multiple applications, in- [15, 17–19, 22]. cluding protein structure and function prediction, classifications Profile Hidden Markov models are graphical models with match of novel proteins into existing protein families and superfamilies, states, insertion states, and deletion states; and the match and in- metagenomics, and multiple sequence alignment. The standard use sertion states emit letters from an underlying alphabet Σ (i.e., Σ of profile HMMs in bioinformatics has two steps: first a profile may be the 20 amino acids, the four nucleotides, or some other HMM is built for a collection of molecular sequences (which may set of symbols). In the standard form presented in [4] (widely in not be in a multiple sequence alignment), and then the profile HMM use in bioinformatics applications), each profile Hidden Markov is used in some subsequent analysis of new molecular sequences. model has a single start state and a single end state, and every path The construction of the profile thus is itself a statistical estimation through the model produces a string from Σ∗. The topology of this problem, since any given set of sequences might potentially fit standard model as seen in Figure 1 shows directed edges between more than one model well. Hence a basic question about profile certain pairs of states, and each such directed edge has a non-zero HMMs is whether they are statistically identifiable, which means transition probability. that no two profile HMMs can produce the same distribution on In this paper, we address the question of statistical identifiability finite length sequences. Indeed, statistical identifiability is afun- of profile Hidden Markov models, which in essence asks whether damental aspect of any statistical model, and yet it is not known the model is reconstructible given the probability distribution it whether profile HMMs are statistically identifiable. In this paper, we defines [23]. Thus, if there are two sets of parameters of the model report on preliminary results towards characterizing the statistical that generate the same joint distribution, then the model is not identifiability of profile HMMs in one of the standard forms used identifiable. Note that if a model is not identifiable, then itisim- in bioinformatics. possible for any algorithm designed to estimate the model from a finite dataset to be statistically consistent: that is, it is not possible CCS CONCEPTS for the method to converge in probability to the true model with • Applied computing → Molecular sequence analysis; increasing amounts of data. Statistical identifiability is a basic property of statistical mod- KEYWORDS els, and is the subject of rigorous study [1–3, 7, 13, 16, 20]. Indeed, the importance of identifiability is evident in the following quotes: Profile Hidden Markov Models, statistical identifiability, molecular “Unidentifiable models are pathological, usually due to conceptual sequence analysis error in model formulation” [24] and “Many statisticians frown ACM Reference Format: on the use of under-identified models: if a parameter is not iden- Srilakshmi Pattabiraman and Tandy Warnow. 2018. Are Profile Hidden tifiable, two or more values are indistinguishable, no matter how Markov Models Identifiable?. In ACM-BCB ’18: 9th ACM International Con- much data you have” [8]. However, to the best of our knowledge, ference on Bioinformatics, Computational Biology, and Health Informatics, nothing has yet been established about the statistical identifiability August 29–September 1, 2018, Washington, DC, USA. ACM, New York, NY, of profile Hidden Markov Models (HMMs), although the question USA, 9 pages. https://doi.org/10.1145/3233547.3233563 of identifiability of parameters in HMMs more generally has also been specifically addressed [14]. Permission to make digital or hard copies of all or part of this work for personal or In this paper, we partially characterize the conditions under classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation which profile HMMs are statistically identifiable. Our study in- on the first page. Copyrights for components of this work owned by others than ACM cludes a characterization of identifiable profile HMMs when no must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, deletion states are permitted but also shows two profile HMMs to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. in the standard format that define the same probability distribu- ACM-BCB’18, August 29–September 1, 2018, Washington, DC, USA tion. Hence, we show that profile HMMs are not identifiable. We © 2018 Association for Computing Machinery. conclude our study with a discussion of the implications of this ACM ISBN 978-1-4503-5794-4/18/08. https://doi.org/10.1145/3233547.3233563 research and future directions. 2 RESULTS this is the only path with j + 1 edges that begins at the start state 2.1 Preliminary material and notation and ends at Ij . The question we address in this paper is whether profile HMMs Theorem 2.1. Consider a standard profile HMM topology with (in this standard format, as described in Figure 1) are statistically n match states and no deletion states. Then, the model is identifi- identifiable. We present profile HMMs for modeling collections of able if and only if no match state has the same emission probability DNA sequences (i.e., strings over fA;C;T ;Gg), which is one of their distribution as the insertion states. uses; however, the results we present here are independent of the choice of alphabet. As shown in Figure 1, the standard topology Proof. (: We begin by proving that if no match state has the profile HMM with n match states has a single begin state “begin” same distribution as the insertions states, then the model is iden- and a single end state “end”; every path through the profile HMM tifiable. Note that when there are no deletion states, the lengthof thus begins and ends at these states. The standard profile HMM also the shortest sequence with non-zero probability of being generated has n match states, n+1 insertion states, and n deletion states. Every is the number of match states. Hence, given the distribution of sequences defined by a profile HMM that has no deletion states, match state Mj (with j 2 »n¼) emits a letter A;T;G; or C according to we immediately know the number of match states, and hence also some fixed but unknown probability distribution Pj . All insertion 0 the topology. We will show that we can use the topology of the states I 0 (with j 2 f0g [ »n¼) emit a letter with the same known j profile HMM to compute all the numerical parameters, and hence distribution Pins . Note therefore that the emission probabilities define the entire model, once we are given the distribution of strings can be different for different match states, but all insertion states defined by the model. have the same emission probabilities. Finally, the deletion states So let the length of the shortest sequence (with non-zero proba- are “silent” (i.e., they do not emit any letters), and are denoted bility) be n. We provide the proof of identifiability for the case where by Dj . The probability of transition from one state to another is all nucleotides have equal probability of being generated at the in- represented by the positive values on the edges (also referred to as sertion states (i.e., P »A¼ = P »C¼ = P »T ¼ = P »C¼ = 1 ). edge weights); hence, the sum of the weights on the edges leaving ins ins ins ins 4 For the more general case where the emission probabilities at the any given node is 1:0. Note that under this standard profile HMM, insertion states are different, the proof is a simple modification of once you know the number of match states you also know the the one provided below. entire topology. We now show how to compute the emission probabilities zi , We introduce some notation to simplify the rest of the exposition. A zi , zi , and zi . Note that the probability that a string of length n We let xi denote the transition probability from Mi−1 to Mi (with T G C th M0 denoting the start state and Mn+1 denoting the end state) and generated by this model has an A in the i position is given by i yi denote the transition probability from Ii−1 to Mi . We let zY n i Ö denote the emission probability of letter Y from match state Mi , i.e., p?»i−1¼A?»n−i¼ = zA xj (1) » j ¼ i » ¼ P Y match state = i = zY . We use Pins j to denote the emission j=1 j 2 fA;C;T;Gg probability of letter at the insertion states, and and hence constrain all insertion states to have the same emission probability i p?»i−1¼A?»n−i¼ distribution.