1026 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 7, JULY 2005 Probabilistic Finite-State Machines—Part II

Enrique Vidal, Member, IEEE Computer Society, Frank Thollard, Colin de la Higuera, Francisco Casacuberta, Member, IEEE Computer Society, and Rafael C. Carrasco

Abstract—Probabilistic finite-state machines are used today in a variety of areas in pattern recognition or in fields to which pattern recognition is linked. In Part I of this paper, we surveyed these objects and studied their properties. In this Part II, we study the relations between probabilistic finite-state automata and other well-known devices that generate strings like hidden Markov models and n-grams and provide theorems, algorithms, and properties that represent a current state of the art of these objects.

Index Terms—Automata, classes defined by grammars or automata, machine learning, language acquisition, language models, language parsing and understanding, machine translation, speech recognition and synthesis, structural pattern recognition, syntactic pattern recognition. æ

1INTRODUCTION

N the first part [1] of this survey, we introduced would deserve a detailed survey in itself. We provide, in Iprobabilistic finite-state automata (PFA), their determi- Section 3.2, an overview of the main methods, but we do not nistic counterparts (DPFA), and the properties of the describe them thoroughly. We hope the bibliographical distributions these objects can generate. Topology was also entries we provide, including the recent review which discussed, as were consistency and equivalence issues. appears in [8], will be of use for the investigator who In this second part, we will describe additional features requires further reading in this subject. Smoothing [9], [10], that are of use to those wishing to work with PFA or DPFA. [11] (in Section 3.3) is also becoming a standard issue. As mentioned before, there are many other finite-state A number of results do not fall into any of these main machines that describe distributions. Section 2 is entirely questions. Section 4 will be a pot-pourri, presenting alter- devoted to compareing them with PFA and DPFA. The native results, open problems, and new trends. Among comparison will be algorithmic: Techniques (when existing) these, more complex models such as stochastic transducers allowing to transform one model into another equivalent, in (in Section 4.1), probabilistic context-free grammars [12] (in the sense that the same distribution is represented, will be Section 4.2), or probabilistic tree automata [13], [14], [15] (in provided. We will study n-grams along with stochastic local Section 4.3) are taking importance when coping with languages in Section 2.1 and HMMs in Section 2.3. In increasingly structured data. addition, in Section 2.2, we will present a probabilistic The proofs of some of the propositions and theorems are extension of the classical morphism theorem that relates local left to the corresponding appendices. languages with regular languages in general. As all surveys, this one is incomplete. In our particular Once most of the issues concerning the task of dealing case, the completeness is particularly difficult to achieve with existing PFA have been examined, we turn to the due to the enormous and increasing amount of very problem of building these models, presumably from different fields where these objects have been used. We samples. First, we address the case where the underlying would like to apologize in advance to all those whose work automaton structure is known; then, we deal (Section 3.1) on this subject that we have not recalled. with the one of estimating the parameters of the model [2], [3], [4], [5], [6], [7]. The case where the model structure is not THER INITE TATE ODELS known enters the field of machine learning and a variety of 2O F -S M learning algorithms has been used. Their description, Apart from the various types of PFA, a variety of alternative proofs, and a comparison of their qualities and drawbacks models have been proposed in the literature to generate or model probability distributions on the strings over an alphabet. . E. Vidal and F. Casacuberta are with the Departamento de Sistemas Many different definitions of probabilistic automata Informa´ticos y Computacio´n and Instituto Tecnolo´gico de Informa´tica, Universidad Polite´cnica de Valencia, Camino dr Vera s/n, E-46071 exist. Some assume probabilities on states, others on Valencia, Spain. E-mail: {evidal, fcn}@iti.upv.es. transitions. But the deep question is “which is the . C. de la Higuera and F. Thollard are with EURISE—Faculte´ des Sciences distinctive feature of the probability distribution defined?” et Techniques, FR-42023 Saint-Etienne Cedex 2, France. All the models describe discrete probability distributions. E-mail: {Franck.Thollard, Colin.Delahiguera}@univ-st-etienne.fr. . R.C. Carrasco is with the Departamento de Lenguajes y Sistemas Many of them aim at predicting the next symbol in a string, Informa´ticos, Universidad de Alicante, E-03071 Alicante, Spain. thereby describing probability distributions over each n, E-mail: [email protected]. 8n>0. We will concentrate here on models where parsing Manuscript received 12 Jan. 2004; revised 3 Aug. 2004; accepted 20 Sept. will be done from left to right, one symbol at a time. As a 2004; published online 12 May 2005. consequence, the term predicting history will correspond to Recommended for acceptance by M. Basu. For information on obtaining reprints of this article, please send e-mail to: the amount of information one needs from the prefix to [email protected]. compute the next-symbol probability. Multiplying these

0162-8828/05/$20.00 ß 2005 IEEE Published by the IEEE Computer Society VIDAL ET AL.: PROBABILISTIC FINITE-STATE MACHINES—PART II 1027 next-symbol probabilities is called the chain rule which will As (1), this approximation also defines a probability be discussed in Section 2.1. distribution over m. Nevertheless, for practical reasons, it is Among the models proposed so far, some are based on often interesting to extend it to define a probability acyclic automata [1], [16], [17], [18], [19]. Therefore, the distribution over ?. To this end, the set of events, , which corresponding probability distributions are defined on finite are predicted by the n 1 previous symbols, is extended by sets of strings, rather than on ?. In [18], automata that define considering an additional end-of-string event (denoted by probability distributions over n, for some fixed n>0 are “#”), with probability Prð#jx ; ...;x Þ. As a result, the introduced. This kind of model can be used to represent, for mnþ2 m x 2 ? instance, logic circuits, where the value of n can be defined in probability of any string is approximated as: advance. A main restriction of this model is that it cannot be PrðxÞ used to compare probabilities of strings of different lengths. ! Yjxj Ron et al. [19] define other probabilistic acyclic deterministic ð2Þ Prðxl j xlnþ1; ...;xl1Þ Prð# j xjxjnþ2; ...;xjxjÞ: automata and apply them to optical character recognition. l¼1 Another kind of model describes a probability distribu- tion over ?; that is, over an infinite number of strings. By making use of our convention that a string such as Stolcke and Omohundro [20] use other types of automata xi ...xj denotes if i>j, this approximation accounts for that are equivalent to our definition of DPFA.Many the empty string. In fact, if x ¼ , the right-hand side of (2) probabilistic automata, such as those discussed here, the is 1 Prð#jÞ, which may take values greater than 0. The HMM and the (also known as the n-gram resulting approximation will be referred to as “extended model), also belong to this class. n-gram model.” The parameters of this model are estimates We give here an overview of some of the most relevant of of PrðajzÞ;a2 [f#g;z2 n, there is 0 2.1.1 N-Gram Models an extended n -gram which describes a distribution Dn0 such N-grams are traditionally presented as an approximation to that Dn ¼Dn0 . In other words, there is a natural hierarchy of a distribution of strings of fixed length. For a string x of classes of n-grams, where the classes with more expressive length m, the chain rule is used to (exactly) decompose the power are those with larger n. The simplest interesting class probability of x as [21]: in this hierarchy is the class for n ¼ 2,orbigrams. This class is interesting for its generative power in the sense discussed Ym later (Section 2.2). PrðxÞ¼Prðx1Þ Prðxl j x1; ...;xl1Þ: ð1Þ l¼2 On the other hand, perhaps the most interesting feature of n-grams is that they are easily learnable from training The n-gram approximation makes the assumption that data. All the parameters of an n-gram model can be the probability of a symbol depends only on the n 1 1 maximum-likelihood estimated by just counting the relative previous symbols; that is: frequency of the relevant events in the training strings [21]. Ym If S is a training sample, Pnða j zÞ is estimated as fðzaÞ =fðzÞ,

x2S PrDn ðxÞ be the likelihood with which an extended PF ðaÞ¼P2ð# j aÞ; 8a2 [fg; 0 0 0 n-gram generates S. Then, for m ¼ maxx2Sjxj; DS ¼Dm PT ða ;aÞ¼P2ða j a Þ; 8a; a 2 : 00 0 and for all m 0 ¼fð[ Þj 2 I ð Þ g straightforward probabilistic extension adequately assigns 00 00 00 fða ;a;aÞja; a 2 ;PT ða ;aÞ > 0Þg; probabilities to these substrings, thereby establishing a ð5Þ 00 direct relation with n-grams. For the sake of brevity, we will 8a; a 2 : 00 00 only present the details for 2-testable distributions, also Pða ;a;aÞ¼PT ða ;aÞ;Pð; a; aÞ¼PI ðaÞ; called stochastic local languages. FðaÞ¼PF ðaÞ;FðÞ¼PF ðÞ: Definition 1. A stochastic local language (or 2-testable An example of this construction is shown in Fig. 3 stochastic language) is defined by a four-tuple Z ¼ (middle) corresponding to Example 2 below. Definition 1, h;PI ;PF ;PT i, where is the alphabet, and PI ;PF :! Proposition 1, and (5) can be easily extended to show the ½0; 1 and PT : !½0; 1 are, respectively, initial, final, equivalence of extended k-grams and k-TSA for any finite k. and symbol transition probability functions. PI ðaÞ is the As in the case of n-grams, k-TSA can be easily learned probability that a 2 is a starting symbol of the strings in the from training data [22]. Given the equivalence with language and, 8a 2 , P ða0;aÞ is the probability that a T extended n-grams, k-TSA exhibit the same properties for follows a0, while P ða0Þ is the probability that no other symbol F varying values of k. In particular, in this case, the m-TSA follows a0 (i.e., a0 is the last symbol) in the strings of the obtained from a training sample S for m ¼ maxx2Sjxj is an language. acyclic DPFA which is identical to the probabilistic prefix tree automaton representation of S. As in the case of n-grams, this model can be easily extended to allow the generation of empty strings. To this 2.1.3 N-Grams and k-TSA Are Less Powerful than DPFA end, PF can be redefined as PF :[fg!½0; 1, interpret- We now show that extended n-grams or stochastic k-testable ing PF ðÞ as the probability of the empty string, according automata do not have as many modeling capabilities as to the following normalization conditions: DPFA have. X PI ðaÞþPF ðÞ¼1; Proposition 2. There are regular deterministic distributions that X a2 cannot be modeled by a k-TSA or extended k-gram, for any 0 0 0 k PT ða ;aÞþPF ða Þ¼1 8a 2 : finite . a2 This is a direct consequence of the fact that every regular Disregarding possible “degenerate” cases (similar to those language is the support of at least one stochastic regular of extended n-grams discussed above), the model Z is con- ? language, and there are regular languages which are not sistent; i.e., it defines a probability distribution DZ on as: 8 k-testable. The following example illustrates this lack of <> PF ðÞ if x ¼ ; modeling power of extended n-grams or k-TSA. Yjxj PrZðxÞ¼ þ ð4Þ Example 1. Let ¼fa; b; c; dg and let D be a probability :> PI ðx1Þ PT ðxi 1;xiÞPF ðx Þ if x 2 : jxj distribution over ? defined as: i¼2 1=2iþ1 if x ¼ abic _ x ¼ dbie; i > 0; PrDðxÞ¼ 3. In the traditional literature, a k-testable automaton (k-TA) is (more 0 otherwise: properly) referred to as a k-testable automaton in the strict sense (k-TSA) [23], [24]. In these references, the name k-testable automaton is reserved for more This distribution can be exactly generated by the DPFA of powerful models which are defined as Boolean compositions of k-TSA.A stochastic extension of k-TSA would lead to models which, in some cases, Fig. 1, but it cannot be properly approached by any k-TSA can be seen as mixtures of stochastic k-TSA. for any given k. The best k-TSA approximation of D, Dk, is: VIDAL ET AL.: PROBABILISTIC FINITE-STATE MACHINES—PART II 1029

Fig. 1. A DPFA which generates a regular deterministic distribution that cannot be modeled by any k-TSA or n-gram.

iþ1 0 PrDk ðxÞ¼1=2 ; PrDk ðx Þ¼0 8i k 2; 0 iþ2 PrDk ðxÞ¼PrDk ðx Þ¼1=2 8i>k 2; for any string x of the form abic or dbie, and x0 of the form i i db c or ab e. Fig. 2. Error-ratio of the probabilities provided by different k-testable In other words, using probabilistic k-testable auto- automata that best approach the stochastic language of Example 2, with mata or extended k-grams, only the probabilities of the respect to the true probability of strings in this language. strings up to length k can be approached while, in this 4 i i example, the error ratio for longer strings will be at least with PrðiÞ¼p1 ð1 p2Þp2 þð1 p1Þð1 p3Þp3 and 1=2 (or larger if k-TSA probabilities are estimated from a p1 ¼ 0:5, p2 ¼ 0:7, and p3 ¼ 0:9. finite set of training data). As a result, for all finite values This distribution (which is similar to that used in Part I of k, the logarithmic distance d ðD; D Þ is infinite. log k [1] to prove that the mean of two deterministic dis- This can be seen as a probabilistic manifestation of the tributions may not be deterministic) is exactly generated well-known over/undergeneralization behavior of con- by the PFA shown in Fig. 3 (left). From a purely ventional k-testable automata [25]. structural point of view, the strings from the language 2.2 Stochastic Morphism Theorem underlying this distribution constitute a very simple In classical theory, the morphism theorem local language that can be exactly generated by a trivial [26] is a useful tool to overcome the intrinsic limitations of 2-testable automaton. However, from a probabilistic k-testable models and to effectively achieve the full point of view, D is not regular deterministic, nor by modelling capabilities of regular languages in general. any means local. In fact, it cannot be approached with Thanks to this theorem, the simple class of 2-testable arbitrary precision by any k-TSA, for any finite value of k. languages becomes a “base set” from which all the regular The best approximations for k ¼ 2; 3; 4; 5 produce error- languages can be generated. ratios greater than 2 for strings longer than 35, 40, 45, and However, no similar tool existed so far for the corre- sponding stochastic distributions. This section extends the 52, respectively, as is shown in Fig. 2. In fact, the standard construction used in the proof of the morphism logarithmic distance between the true and k-TSA-approxi- theorem so that a similar proposition can be proved for mated distributions is infinite for any finite k. Never- stochastic regular languages. theless, the construction given by the stochastic Theorem 3 (Stochastic morphism theorem). Let be a finite morphism theorem yields a stochastic finite-state auto- alphabet and D a stochastic on ?. There maton that exactly generates D. exists then a finite alphabet 0, a letter-to-letter morphism Using the construction of the proof of the stochastic 0? ? 0 0 h : ! , and a stochastic local language over , D2, morphism theorem, a 2-TSA, Z ¼h ;PI ;PF ;PT i, is built such that D¼hðD2Þ, i.e., from the DPFA A0 ¼hQ; ;;q0;F;Pi shown in Fig. 3 X (left) as follows: ? 1 8x 2 PrDðxÞ¼PrD2 ðh ðxÞÞ ¼ PrD2 ðyÞ; ð6Þ 0 y2h1ðxÞ ¼fa2;a3;b2;b3g; ? P a P a P 1;a;2 P 1;a;3 0:5; where h1ðxÞ¼fy 2 0 j x ¼ hðyÞg. I ð 2Þ¼ I ð 3Þ¼ ð Þ¼ ð Þ¼ PF ða2Þ¼PF ðb2Þ¼Fð2Þ¼0:3; ð7Þ The proof of this proposition is in the Appendix. PF ða3Þ¼PF ðb3Þ¼Fð3Þ¼0:1; The following example illustrates the construction used PT ða2;b2Þ¼PT ðb2;b2Þ¼Pð2;b;2Þ¼0:7; in this proof and how to obtain exact 2-TSA-based models for given, possibly nondeterministic stochastic regular PT ða3;b3Þ¼PT ðb3;b3Þ¼Pð3;b;3Þ¼0:9; languages. all the other values of PI , PF , and PT are zero. Example 2. Consider the following distribution D over The corresponding 2-TSA is shown in Fig. 3 (middle). ¼fa; bg: Applying the morphism h (i.e., dropping subindexes) to this automaton yields the PFA A shown in Fig. 3 (right). PrðiÞ if x ¼ abi;i 0; PrðxÞ¼ For any string x of the form abi, we have: 0 otherwise; i i PrAðxÞ¼0:5 0:3 0:7 þ 0:5 0:1 0:9 8i 0: 4. The error-ratio for a string x is the quotient between the true and the approximated probabilities for x. which is exactly the original distribution, D. 1030 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 7, JULY 2005

Fig. 3. Left: Finite-state stochastic automaton which generates the stochastic language of Example 2. Middle and right: Automata obtained through the construction used in the proof of the stochastic morphism theorem.

2.3 Hidden Markov Models Then, if MðxÞ is the set of all valid paths for x, the Nowadays, hidden Markov models (HMMs) are basic probability that M generates x is: components of the most successful natural language X processing tasks, including speech [21], [27], [28] and PrMðxÞ¼ PrMðx j ÞPrMðÞ : handwritten text recognition [29], [30], speech translation 2MðxÞ [31], [32], and shallow parsing [33], to name but a few. It should be noticed that the above model cannot emit the HMMs have also proved useful in many other pattern recognition and computer vision tasks, including shape empty string. Moreover, as in the case of PFA, some HMMs recognition, face and gesture recognition, tracking, image can be deficient. DiscardingP these degenerate cases, it can database retrieval and medical image analysis [34], [35] and easily be seen that x2þ PrMðxÞ¼1. Correspondingly, an þ other less conventional applications such as financial HMM M defines a probability distribution DM on . returns modeling [36]. In some definitions, states are allowed to remain silent There exist many variants of Markov models, including or the final state qf is not included in the definition of an differences as to whether the symbols are emitted at the states HMM. As in the case of n-grams, this latter type of model or at the transitions. See, for example, [21], [27], [28], [37]. defines a probability distribution on n for each n, rather Definition 2. An HMM is a 6-tuple M¼hQ; ; I; F; T; Ei, than on þ [37]. where Some relations between HMMs and PFA are established by the following propositions: . Q is a finite set of states, . is a finite alphabet of symbols, Proposition 4. Given a PFA A with m transitions and þ . T:ðQ fqf gÞ Q ! IR is a state to state transi- PrAðÞ¼0, there exists an HMM M with at most m states, tion probability function, such that DM ¼DA. . I:Qfq g!IR þ is an initial state probability f Proposition 5. Given an HMM M with n states, there exists a function, þ PFA A with at most n states such that DA ¼DM. . E:ðQ fqf gÞ ! IR is a state-based symbol emission probability function, In order to have a self-contained article, the proofs of . qf 2 Q is a special (final) state, Propositions 4 and 5 are given in the Appendix (Sections A.1 subject to the following normalization conditions: X and A.3). They nonetheless also appear in [8] using a IðqÞ¼1; slightly different method regarding Proposition 4. qX2Qfqf g 0 Tðq; q Þ¼1; 8q 2 Q fqf g; 3LEARNING PROBABILISTIC AUTOMATA q02Q X Over the years, researchers have attempted to learn, infer, Eðq; aÞ¼1; 8q 2 Q fq g: f identify, or approximate PFA from a given set of data. This a2 task, often called language modeling [38], is seen as We will say that the model M generates (or emits)a essential when considering pattern recognition [27], ma- sequence x ¼ x1 ...xk with probability PrMðxÞ. This is chine learning [39], computational linguistics [40], or defined in two steps. First, let be a valid path of length biology [14]. The general goal is to construct a PFA (or k, i.e., a sequence ðs1;s2; ...;skÞ of states, with sk ¼ qf . The some alternative device) given data assumed to have been probability of is: generated from this device, and perhaps the partial knowl- Y edge of the underlying structure of the PFA. A recent review Pr I s T s ;s Mð Þ¼ ð 1Þ ð j1 jÞ on probabilistic automata learning appears in [8]. Here, 2jk only a quick, in most cases complementary, review, along and the probability of generating x through is: with a set of relevant references, will be presented. We will Y distinguish here between the estimation of the probabilities Pr x E s ;x : Mð j Þ¼ ð j jÞ given an automaton structure and the identification of the 1j

3.1 Estimating PFA Probabilities algorithms for other criteria different from ML can be found The simplest setting of this problem arises when the in [7], [43], [44], [45]. underlying structure corresponds to an n-gram or a k-TSA. Baum-Welch and Viterbi reestimation techniques ade- In this case, the estimation of the parameters is as simple as quately cope with the multiple-derivations problem of the identification of the structure [21], [22]. ambiguous PFA. Nevertheless, they can also be applied to We assume more generally that the structural components, the simpler case of nonambiguous PFA and, in particular, the , Q, and ,ofaPFA A, are given. Let S be a finite sample of deterministic PFA discussed above. In these cases, the training strings drawn from a regular distribution D. The following properties hold: problem is to estimate the probability parameters I;P;F of Proposition 6. For nonambiguous PFA (and for DPFA in A in such a way that DA approaches D. particular), Maximum likelihood (ML) is one of the most widely adopted criteria for this estimation: 1. the Baum-Welch and the Viterbi reestimation algo- Y rithms produce the same solution, ^ ^ ^ ðII;PP;FÞ¼argmax PrAðxÞ: ð8Þ 2. the Viterbi reestimation algorithm stops after only one I;P;F x2S iteration, and Maximizing the likelihood is equivalent to minimizing the 3. the solution is unique (global maximum of (8)). ^ empirical cross entropy XXðS; DAÞ (see Section 6 of [1]). It can 3.2 Learning the Structure be seen that, if D is generated by some PFA A0 with the same We will first informally present the most classic learning structural components of A, optimizing this criterion guaran- paradigms and discuss their advantages and drawbacks. tees that D approaches D as the size of S goes to infinity [41]. A We will then present the different results of learning. The optimization problem (8) is quite simple if the given automaton is deterministic [42]. Let hQ; ;;q0;F;Pi be the 3.2.1 Learning Paradigms given DPFA whose parameters F and P are to be estimated. In the first learning paradigm, proposed by Gold [46], [47], For all q 2 Q, a ML estimation of the probability of the there is an infinite source of examples that are generated transition Pðq; a; q0Þ is obtained by just counting the number following the distribution induced by a hidden target. The of times this transition is used in the deterministic deriva- learning algorithm is expected to return some hypothesis tions of the strings in S and normalizing this count by the after each new example, and we will say that the class is q frequency of use of the state . Similarly, the final state identifiable in the limit with probability one if whatever FðqÞ probability is obtained as the relative frequency of target the algorithm identifies is the target (i.e., there is a q S state being final through the parsing of . Probabilistic point from which the hypothesis is equivalent to the target) parameters of nonambiguous PFA or -PFA can also be easily with probability one. ML-estimated in the same way. The main drawbacks of this paradigm are: However, for general (nondeterministic, ambiguous) PFA or -PFA, multiple derivations are possible for each . it does not entail complexity constraints, string in S and things become more complicated. If the . we usually do not know if the amount of data values of I, P, and F of A are constrained to be in OQþ, the needed by the algorithm is reached, and decisional version of this problem is clearly in NP and the . an algorithm can be proven to identify in the limit conjecture is that this problem is at least NP-Hard. In and might return arbitrary bad answers if the practice, only locally optimal solutions to the optimization required amount of data is not provided. (8) are possible. Despite these drawbacks, the identification in the limit As discussed in [5], the most widely used algorithmic paradigm can be seen as a necessary condition for learning solution to (8) is the well-known expectation-maximization a given class of model. If this condition is not met, that (EM) Baum-Welch algorithm [2], [3], [6]. It iteratively updates means that some target is not learnable. the probabilistic parameters (I, F, and P) in such a way that A second learning paradigm was proposed by Valiant the likelihood of the sample is guaranteed not to decrease and extended later [48], [49], [50], [51], [52]. This paradigm, after each iteration. The parameter updating is based on the called probably approximately correct (PAC)learning, forward and backward dynamic programming recurrences requires that the learner returns a good approximation of to compute the probability of a string discussed in Section 3 the target with high probability. The words good and high are of [1]. Therefore, the method is often referred to as backward- formalized in a probabilistic framework and are a function forward reestimation. The time and space complexities of of the amount of data provided. each Baum-Welch iteration are OðM NÞ and OðK L þ MÞ, These frameworks have been adapted to the cases where respectively, where M ¼jj (number of transitions), K ¼ the target concept is a probabilistic model [19], [53], [54], jQj (number of states), N ¼jjSjj (number of symbols in the [55], [56], [57], [58]. sample), and L ¼ maxx2S jxj (length of the longest training Finally, another framework comes from traditional string) [5]. methods for HMM estimation. In this framework, the Using the optimal path (Viterbi) approximation rather structure of the model is somehow parameterized and than the true (forward) probability (see [1], Section 3.2, and learning is seen as a problem of parameter estimation. In the Section 3.1, respectively) in the function to be optimized (8), most general statement of this problem for PFA, only the a simpler algorithm is obtained, called the Viterbi reestima- alphabet (of size n) and the number of states (m) are given tion algorithm. This is discussed in [5], while reestimation and the problem is to estimate the probabilities of all the 1032 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 7, JULY 2005 n m2 possible transitions. As discussed in Section 3.1, the reduced the class of the language to be learned and provided Baum-Welch (or the Viterbi) algorithm can be used for a another state-merging algorithm and Thollard et al. [61] locally optimal estimation of these parameters. However, proposed the MDI algorithm under the same framework. given the very large amount of parameters, this general MDI has been shown to outperform ALERGIA on a natural method has seldom proved useful in practice. Related language modeling task [61] and on shallow parsing [62]. A approaches where the amount of parameters to estimate is recent variant of ALERGIA was proposed in [63] and explicitly constrained are discussed in [8]. evaluated on a natural language modeling task. A modifica- tion of this algorithm was also used in [64] to discover the 3.2.2 What Can Be Learned? underlying model in structured text collections. This section addresses previous works related to the learning of probabilistic finite-state automata. The first 3.2.4 Other Learning Approaches results came from Horning [53], who showed that any While not a learning algorithm in itself, a (heuristic) general recursively enumerable class of languages can be identified learning scheme which is worth mentioning can be derived in the limit with probability one. The problem of the proof from the stochastic morphism theorem shown in Section 2.2. In —among others of the same spirit [54], [55]—is that it does fact, the use of the conventional morphism theorem [26] not provide us with a reasonable algorithm to perform the was already proposed in [65] to develop a general learning task. methodology for learning general regular languages, called A more constructive proof, relying on a reasonable “morphic generator grammatical inference”(MGGI). The basic algorithm, was proposed in [57]: Identification in the limit idea of MGGI was to rename the symbols of the given of DPFA is shown. This proof is improved in [59] with alphabet in such a manner that the syntactic restrictions results concerning the identification of rational random which are desirable in the target language can be described variables. by simple local languages.MGGI constitutes an interesting Work has also been done in the Probably Approximately engineering tool which has proved very useful in practical Correct (PAC) learning paradigm. The results are rather applications [25], [65]. different depending on the object we want to infer and/or We briefly discuss here how the stochastic morphism what we know about it. Actually, Abe and Warmuth [17] theorem can be used to obtain a stochastic extension of this showed that nondeterministic acyclic automata that defined methodology, which will be called stochastic MGGI (SMGGI). a probability distribution over n, with n and known, Let S be a finite sample of training sentences over and could be approximated in polynomial time. Moreover, they let 0 be the alphabet required to implement an adequate showed that learnability is not polynomial in the size of the renaming function g: S ! 0?. Let h:0? ! ? be a letter-to- vocabulary. Kearns et al. [18] showed that an algorithm that letter morphism; typically, one such that hðgðSÞÞ ¼ S. Then, aims at learning a probabilistic function cannot reach its a 2-TSA model can be obtained and the corresponding goal5 if the probability distribution can be generated by a transition and final-state probabilities max-likelihood esti- n DPFA over f0; 1g . Thus, knowing the class of the object we mated from gðSÞ using conventional bigram learning or the want to infer helps the inference a lot since the object dealt 2-TSI algorithm [22]. with in [17] are more complex than the ones addressed in Let D2ðgðSÞÞ be the stochastic local language generated [18]. Following this idea, Ron et al. [19] proposed a practical by this model. The final outcome of SMGGI is then defined algorithm that converges in a PAC-like framework that as the regular distribution D¼hðD2ðgðSÞÞ; that is: infers a restricted class of acyclic automata. More recently, X ? Clark and Thollard [58] showed that the result holds with 8x 2 ; PrDðxÞ¼ PrD2ðgðSÞÞðyÞ; ð9Þ cyclic automata as soon as a bound on the expected length y2h1ðxÞ of the generated strings is known. where h1ðxÞ¼fy 2 0? : y ¼ hðxÞg. h 3.2.3 Some Algorithms From a practical point of view, the morphism is just applied to the terminal symbols of the 2-TSA generating If we restrict ourselves to the class of n-gram or k-TSA 0 D2ðgðSÞÞ. While this automaton (defined over )has distributions, as previously mentioned, learning both the deterministic structure and is therefore unambiguous, after n-grams structure and the probabilities of or k-TSA is simple applying h, the resulting automaton is often ambiguous, and already very well-known [21], [22]. For more general thus precluding a simple max-likelihood estimation of the PFAs, another strategy can be followed: First, the probabil- corresponding transition and final state probabilities. istic prefix tree automaton (PPTA), which models the given Nevertheless, (9) allows us to directly use the the 2-TSA training data with maximum-likelihood, is constructed. probabilities with the guarantee that they constitute a This PPTA is then generalized using state-merging opera- proper estimation for the possibly ambiguous resulting tions. This is usually called the state-merging strategy. automaton. Following this strategy, Carrasco and Oncina [60] proposed the ALERGIA algorithm for DPFA learning. Stolcke 3.3 Smoothing Issues and Omohundro [20] proposed another learning algorithm The goal of smoothing is estimating the probability of that infer DPFA based on Bayesian learning. Ron et al. [19] events that have never been seen in the training data available. From the theoretical point of view, smoothing 5. Actually, the authors showed that this problem was as hard as learning parity functions in a noisy setting for the nonprobabilistic PAC must be taken into account since estimates must behave ? framework. This problem is generally believed to be untractable. well on the whole set . From the practical point of view, VIDAL ET AL.: PROBABILISTIC FINITE-STATE MACHINES—PART II 1033 we saw that the probability of a sequence is computed using 4PROBABILISTIC EXTENSIONS products of probabilities associated with the symbols. A number of natural extensions of the PFA and DPFA have Smoothing is necessary to distinguish a very probable been proposed. We mention in the sequel some of the most sequence with a unique unknown symbol (e.g., in natural important ones. These include probabilistic finite-state trans- language modeling this can be a sentence with an unknown ducers and stochastic finite-state tree automata. These models proper noun) from a sequence composed of impossible are related with the more general stochastic context-free concatenations of symbols. grammars, for which a short account is also given. Even though some work has been done in order to theoretically justify some smoothing techniques—e.g., the 4.1 Probabilistic Finite-State Transducers Good-Turing estimator [39], [66]—smoothing has mainly Stochastic finite-state transducers (SFSTs) are similar to PFA been considered from the practical point of view. The main but, in this case, two different alphabets are involved: line of research is considering the n-gram model as the base source () and target () alphabets. Each transition in a model and a back-off strategy as the smoothing technique SFST has attached a source symbol and a (possibly empty) [10], [38], [67], [68], [69]. In the back-off strategy, another string of target symbols. model is used (usually a more general one) in order to Different types of SFSTs have been applied with success estimate the probability of a sequence; for example, if there in some areas of machine translation and pattern recogni- is no trigram to estimate a conditional probability, a bigram tion [77], [78], [79], [80], [81], [82], [83]. On the other hand, in can be used to do it. In order to guarantee an overall [40], [84], [85], weighted finite-state transducers are intro- consistent model, several variants have been considered. duced. Another (context-free) generalization, head transducer After the backing-off, the trigram can again be used to models, was proposed in [86], [87]. T T¼ estimate the probabilities. An SFST is defined as an extension of PFA: hQ; ; ;;I;F;Pi, where Q is a finite set of states; and Smoothing PFA is a harder problem. Even if we can think about backing-off to simpler and more general models, it is are the source and target alphabets,respectively; Q Q is a set of transitions; I : Q ! IR þ and F : not easy to use the PFA to continue the parsing after the Q ! IR þ are the initial and final-state probabilities, respec- backing-off. A first strategy consists in backing-off to a tively; and P : ! IR þ are the transition probabilities, subject unigram and finishing the parsing in the unigram [70] itself. to the following normalization constraints: A more clever strategy is proposed by Llorens et al. [71], X which use a (recursively smoothed) n-gram as a back-off IðqÞ¼1; model. The history of each PFA state is computed in order to q2Q X associate it with the adequate n-gram state(s). Parsing can 8q 2 Q; FðqÞþ Pðq; a; y; q0Þ¼1: then go back and forth through the full hierarchy of PFA and a2;q02Q;y2 m-gram states, 0

0 PrT ðt; xÞ > 0 and PrT ðt ;xÞ > 0), the translation problem is There are also extensions of stochastic context-free polynomial. Moreover, if T is simply nonambiguous (8x 2 ? grammars for translation: stochastic syntax-directed transla- and 8t 2 ? there are not two different sequences of states tion schemata [104] and head transducer models were proposed that deal with ðx; tÞ with probability greater then zero), the in [86], [87]. translation problem is also polynomial. In these two cases, the computation can be carried out using an adequate 4.3 Stochastic Finite-State Tree Automata version of the Viterbi algorithm. Finally, if T is subsequential, Stochastic models that assign a probability to a tree can be or just deterministic with respect to the input symbol, the useful, for instance, in natural language modeling to select stochastic translation problem is also polynomial, though, in the best parse tree for a sentence and resolve structural this case, the computational cost is OðjxjÞ, independent of ambiguity. For this purpose, finite-state automata that the size of T . operate on trees can be defined [15]. In contrast to the case The components of an SFST (states, transitions, and the of strings, where the automaton computes a state for every probabilities associated to the transitions) can be learned prefix, a frontier-to-root tree automaton processes the tree from training pairs in a single process or in a two-step bottom-up and state is computed for every subtree. The process. In the latter case, first the structural component is result depends on both the node label and the states learned and next the probabilistic components are estimated obtained after the node subtrees. Therefore, a collection of from training samples. The GIATI (Grammatical Inference and transition functions, one for each possible number of Alignments for Translator Inference)7 is a technique of the first subtrees, is needed. This probabilistic extension defines a T type [81], [89], while OSTIA (Onward Subsequential Transducer probability distribution over the set of labeled trees. Inference Algorithm) and OMEGA (OSTIA Modified for Employ- A probabilistic finite-state tree automaton (PTA) is defined ing Guarantees and Alignments) are techniques for learning as hQ; ; ;P;i, where the structural component of a SFST [79], [80]. Only a few . Q is a finite set of states, other techniques exist to infer finite-state transducers [77], . is the alphabet, [90], [91], [92]. To estimate the probabilistic component in . ¼f ; ; ...; g is a collection of transition sets the two-step approaches, maximum likelihood or other criteria 0 1 M Q Qm, can be used [7], [45], [93]. One of the main problems m . P is a collection of functions P ¼fp ;p ;p ; ...;p g associated with the learning process is the modeling of 0 1 2 M of the type p : !½0; 1, and events not seen in the training set. As previously discussed m m . are the root probabilities : Q !½0; 1. for PFA, this problem can be tackled by using smoothing techniques; either in the estimation of the probabilistic The required normalizations are components of the SFSTs [94] or within of the process of X learning both components [81]. ðqÞ¼1; ð11Þ q2Q 4.2 Stochastic Context-Free Grammars q 2 Q Stochastic context-free grammars are the natural extension of and, for all , probabilistic finite-state transducers. These models are X XM X defined as a tuple hQ; ;S;R;Pi, where Q is a set of pmðq; a; i1; ...;imÞ¼1: ð12Þ nonterminal symbols, is an finite alphabet, S 2 Q is the a2 m¼0 i1;...;im2Q: ðq;a;i1;...;imÞ2m initial symbol, R is a set of rules A ! ! with ! 2ðQ [ Þ?, þ and P : R ! IR Pis the set of probabilities attached to the The probability of a tree t in the stochastic language ? A rules such that !2ðQ[Þ PðA ! !Þ¼1 for all A 2 Q. generated by is defined as In general, parsing strings with these models is in Oðn3Þ X (although quadratic algorithms can be designed for special pðt j AÞ¼ ðqÞðq; tÞ; ð13Þ types of stochastic context-free grammars) [4], [5]. Approx- q2Q imations to stochastic context-free grammars using prob- where ðq; tÞ is recursively defined as: abilistic finite-state automata have been proposed in [95], 8 [96]. Algorithms for the estimation of the probabilities > p0ðq; aÞ if t ¼ a 2 ; attached to the rules are basically the inside-outside algorithm > > X [4], [97] and a Viterbi-like algorithm [98]. The relation > > pmðq; a; ðt1Þ; ...;ðtmÞÞ between the probability of the optimal path of states and <> i1;...;im2Q: the probability of generating a string has been studied in ðq; tÞ¼ ðq;a;i1;...;imÞ2m ð14Þ > [99]. The structure of stochastic context-free grammars (the > ði1;t1Þðim;tmÞ; > nonterminal symbols and the rules) can currently be > > learned from examples [100], [101], [102] in very limited :> if t ¼ aðt1 tmÞ2T ; settings only (when grammars are even linear). An alternative 0 otherwise: line is to learn the context-free grammar from the examples As in the case of PFA, it is possible to define deterministic and by ignoring the distribution: Typically, Sakakibara’s PTA as those where the set fq 2 Q : ðq; a; i ; ...;i Þ2 g reversible grammars [103] have been used for this purpose; 1 m m has size at most 1 for all a 2 , for all m 0, and for all then, the inside-outside algorithm is used to estimate the i ; ...;i 2 Q. In such a case, a minimal automaton can be probabilities. 1 m defined and it can be identified from samples [15].

7. In earlier papers, this technique was called MGGI (Morphic Generator In contrast, the consistency of probabilistic tree automata Transducer Inference). is not guaranteed by (11) and (12) even in the absence of VIDAL ET AL.: PROBABILISTIC FINITE-STATE MACHINES—PART II 1035 useless states. Consistency requires that the spectral radius of 6. Smoothing is a crucial issue for language modeling the production matrix defined below is strictly smaller (see Section 3.3). Good smoothing techniques for than 1 [42]: PFA and DPFA would surely improve the modeling capacities of these models and it can be conjectured X XM X that they might perform better than standard ij ¼ pmði; a; i1;i2; ...;imÞ techniques. a2 m¼1 i1;i2;...;im2Q: ð15Þ 7. ði;a;i1;...;imÞ2m Testing the closeness of two distributions from samples is also an issue that matters: Whether to ð1ðj; i Þþþ1ðj; i ÞÞ; 1 m be able to use larger data sets for learning or to be where 1ði; jÞ is Kronecker’s delta defined before. able to decide merging in learning algorithms, one wishes to be able to have a simple test to decide if two samples come from the same (or sufficiently ONCLUSION 5C similar) distribution or not. We have, in this paper, proposed a survey of the properties 8. Following [88], we recall that the problem of deciding concerning deterministic and nondeterministic probabilistic whether the probability of the most probable string is finite-state automata. A certain number of results have been more than a given fraction is NP-hard. It is not proved and others can be fairly straightforwardly derived known if the problem belongs to NP. from them. On the other hand, we have left many questions Obviously, there are many topics related with PFA that require further research efforts and only few are mentioned not answered in this work. They correspond to problems here. To mention but one of these topics, probabilistic (finite that, to our knowledge, are open or, even in a more or context-free) transducers are increasingly becoming extensive way, to research lines that should be followed. important devices, where only a few techniques are known Here are some of these: to infer finite-state transducers from training pairs or to smooth probabilistic finite-state transducers when the 1. We studied in the section concerning topology of training pairs are scarce. part I [1] the questions of computing the distances Solving some of the above problems, and in a more between two distributions represented by PFA.In general way, better understanding how PFA and DPFA the case where the PFA are DPFA the computation of work would necessarily increase their importance and the L2 distance and of the Kullback-Leibler diver- relevance in a number of fields and, specifically, those that gence can take polynomial time, but what about the are related to pattern recognition. L1, L1, and logarithmic distances? 2. In the same trend, it is reasonably clear that, if at PPENDIX least one of the distributions is represented by a PFA, A the problem of computing or even approximating A.1 Proof of Theorem 3 the L (or L )isNP-hard. What happens for the 1 1 Theorem 3 (Stochastic morphism theorem). Let be a finite other distances? The approximation problem can be alphabet and D be a stochastic regular language on ?. There defined as follows: Given an integer m decide if exists then a finite alphabet 0, a letter-to-letter morphism dðD; D0Þ < 1 . m h :0? ! ?, and a stochastic local language over 0, D , 3. In [105], the question of computing the weight of a 2 such that D¼hðD2Þ, i.e., language inside another (or following a regular X distribution)P is raised. Technically, it requires com- ? 1 8x 2 ; PrDðxÞ¼PrD ðh ðxÞÞ ¼ PrD ðyÞ; ð16Þ puting PrBðwÞ, where A is a DFA and B is a 2 2 w2LA y2h1ðxÞ DPFA. Techniques for special cases are proposed in ? [105], but the general question is not solved. The where h1ðxÞ¼fy 2 0 j x ¼ hðyÞg. problem is clearly polynomially solvable; the pro- Proof. By Proposition 11 of [1], D can be generated by a PFA blem is that of finding a fast algorithm. with a single initial state. Let A¼ be 4. The equivalence of HMM has been studied in [106], 0 0 such a PFA. Let ¼faqjðq ;a;qÞ2g and define a letter- where it is claimed that it can be tested in 0 to-letter morphism h : ! by hðaqÞ¼a. Next, define polynomial time. When considering the results from 0 a stochastic local language, D2,over by Z ¼ our Section 2.3, it should be possible to adapt the 0 ð ;PI ;PF ;PT Þ, where proof in order to obtain an equivalent result for PFA. 0 0 5. We have provided a number of results on distances PI ðaqÞ¼Pðq0;a;qÞ;PF ðaqÞ¼FðqÞ;PT ða q0 ;aqÞ¼Pðq ;a;qÞ: in the section concerning distances of part I [1]. Yet, a ð17Þ comparison of these distances and how they relate to learning processes would be of clear interest. From Now, let x ¼ x1 ...xn be a nonempty string over , the theoretical point of view, in a probabilistic PAC with PrDðxÞ > 0. Then, at least one valid path exists for x learning framework, the error function used is in A. Let be one of these paths, with s0 ¼ q0: usually the Kullback-Leibler divergence [17], [18], ¼ðs ;x ;s Þ ...ðs ;x ;s Þ: [19],[56],[58].Aswementioned,manyother 0 1 1 n1 n n 0 measures exist and it should be interesting to study Associated with , define a string y over as: learnability results while changing the similarity y ¼ y1 ...yn ¼ x1s ...xns : measure. 1 n 1036 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 7, JULY 2005

? Let Y be the set of strings in 0 associated with all the Proof. Let M¼hQ; ; I; F; T; Ei be an HMM. We create an valid paths for x in A. Note that, for each y 2 Y , there is a equivalent PFA A0 ¼hQ; ;I;;F;Pi as follows: unique path for x and vice versa. Note also that x ¼ hðyÞ. Q Q; Therefore, Y ¼ h1ðxÞ. ¼ If x ¼ , it has a unique degenerate path consisting IðqÞ¼IðqÞ; for all q 2 Q nfqf g; and Iðqf Þ¼0; 0 0 only in q0; that is, Y ¼fg and PrD2 ðÞ¼Fðq0Þ¼PrDðÞ. ¼fðq; a; q Þ :Tðq; q Þ 6¼ 0 and Eðq; aÞ 6¼ 0g; Otherwise, from (4) and (17), the probability of every FðqÞ¼0 for all q 2 Q nfqf g; and Fðqf Þ¼1; y 2 Y is: Pðq; a; q0Þ¼Eðq; aÞTðq; q0Þ: Yn ? For each x ¼ x1x x 2 with PrMðxÞ 6¼ 0, there is at least PrD2 ðyÞ¼PI ðx1s1 Þ PT ðxi1si1 ;xisi ÞPF ðxnsn Þ j j i¼2 a sequence of states ðs1; ...;sjxj; qf Þ that generates with x Yn probability: ¼ Pðs0;x1;s1Þ Pðsi1;xi;siÞFðsnÞ; i¼2 Iðs1ÞEðs1;x1ÞTðs1;s2ÞTðsjxj1;sjxjÞ

which, according to (1) in Section 2.6 of [1] (and noting Eðsjxj;xjxjÞTðsjxj; qf Þ: that, in our PFA, Iðq0Þ¼1), is the probability of the path And, in A0, for x in A y is associated with. Finally, following (2) in Section 2.6 of [1] (that gives Iðs1ÞPðs1;x1;s2ÞPðsjxj;xjxj; qf Þ: the probability of generating a string), For each path in M, there is one and only one path X 0 in A . Moreover, by construction, Iðs1Þ¼Iðs1Þ and PrD ðyÞ¼PrAðxÞ8x :PrDðxÞ > 0: 0 0 2 Pðq; a; q Þ¼Eðq; aÞTðq; q Þ; therefore, DA0 ¼DM. Final- y2Y ly, by Proposition 11 of [1], we can build a PFA A, with at most jQj¼n states, such that D 0 ¼D . tu OnP the other hand, if PrDðxÞ¼0, then Y ¼;, leading A A 1 to y2Y PrD2 ðyÞ¼0. Therefore, since Y ¼ h ðxÞ,we have hðD2Þ¼D. tu ACKNOWLEDGMENTS This proof is a probabilistic generalization of the proof The authors wish to thank the anonymous reviewers for their for the classical morphism theorem [26]. Given the none- careful reading and in-depth criticisms and suggestions. quivalence of PFA and DPFA, the present construction This work has been partially supported by the Spanish required the use of nondeterministic and possibly ambig- project TIC2003-08681-C02 and the IST Programme of the uous finite-state automata. European Community, under the PASCAL Network of A.2 Proof of Proposition 4 Excellence, IST-2002-506778. This publication only reflects Proposition 4. Given a PFA A with m transitions and the authors’ views.

PrAðÞ¼0, there exists a HMM M with at most m states, such that DM ¼DA. REFERENCES Proof. Let A¼hQ; ;;I;F;Pi be a PFA. We create an [1] E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, and R.C. Carrasco, “Probabilistic Finite-State Automata—Part I,” IEEE Trans. equivalent HMM M¼hQ; ; I; F; T; Ei as follows: Pattern Analysis and Machine Intelligence, special issue on syntactic and structural pattern recognition, vol 27, no. 7, pp. 1013-1025, July . Q ¼ Q Q, P 2005. 0 0 0 [2] L.E. Baum, “An Inequality and Associated Maximization Techni- . Iðq; q Þ¼IðqÞ a2P Pðq; a; q Þ for all ðq; q Þ2Q, . T q; q0 ; q0;q00 P q0;a;q00 T q; q0 ; q que in Statistical Estimation for Probabilistic Functions of Markov ðð Þ ð ÞÞ¼ a2 ð Þand ðð Þ f Þ Processes,” Inequalities, vol. 3, pp. 1-8, 1972. ¼ Fðq0Þ, and 0 [3] C.F.J. Wu, “On the Convergence Properties of the EM Algorithm,” . Eððq; q0Þ;aÞ¼PPðq;a;q Þ if Pðq; a; q0Þ 6¼ 0. Annals of Statistics, vol. 11, no. 1, pp. 95-103, 1983. Pðq;b;q0Þ b2 [4] F. Casacuberta, “Statistical Estimation of Stochastic Context-Free ? For each x ¼ x1xjxj 2 with PrAðxÞ 6¼ 0, there is at Grammars,” Pattern Recognition Letters, vol. 16, pp. 565-573, 1995. [5] F. Casacuberta, “Growth Transformations for Probabilistic Func- least a sequence of states ðs0; ...;sjxjÞ that generates x tions of Stochastic Grammars,” Int’l J. Pattern Recognition and with probability: Artificial Intelligence, vol. 10, no. 3, pp. 183-201, 1996. [6] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. Iðs0ÞPðs0;x1;s1ÞPðsjxj1;xjxj;sjxj1;sjxjÞFðsjxjÞ: Wiley, 1997. [7] D. Pico´ and F. Casacuberta, “Some Statistical-Estimation Methods And, in M, for Stochastic Finite-State Transducers,” Machine Learning J., vol. 44, no. 1, pp. 121-141, 2001. Iðs0;s1ÞEððs0;s1Þ;x1ÞTððs0;s1Þ; ðs1;s2ÞÞ ... [8] P. Dupont, F. Denis, and Y. Esposito, “Links between Probabilistic E s ;s ;x T s ;s ; q : Automata and Hidden Markov Models: Probability Distributions, ðð jxj1 jxjÞ jxjÞ ðð jxj1 jxjÞ f Þ Learning Models and Induction Algorithms,” Pattern Recognition, 2004. For each path in A, there is one and only one path in [9] I.H. Witten and T.C. Bell, “The Zero Frequency Problem: HMM, so the theorem holds. tu Estimating the Probabilities of Novel Events in Adaptive Test Compression,” IEEE Trans. Information Theory, vol. 37, no. 4, A.3 Proof of Proposition 5 pp. 1085-1094, 1991. n [10] H. Ney, S. Martin, and F. Wessel, Corpus-Based Statiscal Methods in Proposition 5. Given an HMM M with states, there exists a Speech and Language Processing, S. Young and G. Bloothooft, eds., PFA A with at most n states such that DA ¼DM. pp. 174-207, Kluwer Academic Publishers, 1997. VIDAL ET AL.: PROBABILISTIC FINITE-STATE MACHINES—PART II 1037

[11] P. Dupont and J.-C. Amengual, “Smoothing Probabilistic Auto- [36] Y. Bengio, V.-P. Lauzon, and R. Ducharme, “Experiments on the mata: An Error-Correcting Approach,” Proc. Fifth Int’l Colloquium Application of IOHMMs to Model Financial Returns Series,” IEEE Grammatical Inference: Algorithms and Applications, pp. 51-56, 2000. Trans. Neural Networks, vol. 12, no. 1, pp. 113-123, 2001. [12] Y. Sakakibara, M. Brown, R. Hughley, I. Mian, K. Sjolander, [37] F. Casacuberta, “Some Relations among Stochastic Finite State R. Underwood, and D. Haussler, “Stochastic Context-Free Networks Used in Automatic Speech Recognition,” IEEE Trans. Grammars for tRNA Modeling,” Nuclear Acids Research, vol. 22, Pattern Analysis and Machine Intelligence, vol. 12, no. 7, pp. 691-695, pp. 5112-5120, 1994. July 1990. [13] T. Kammeyer and R.K. Belew, “Stochastic Context-Free Grammar [38] J. Goodman, “A Bit of Progress in Language Modeling,” technical Induction with a Genetic Algorithm Using Local Search,” report, Microsoft Research, 2001. Foundations of Genetic Algorithms IV, R.K. Belew and M. Vose, [39] D. McAllester and R.E. Schapire, “On the Convergence Rate of eds., 1996. Good-Turing Estimators,” Proc. 13th Ann. Conf. Computer Learning [14] N. Abe and H. Mamitsuka, “Predicting Protein Secondary Theory, pp. 1-6, 2000. Structure Using Stochastic Tree Grammars,” Machine Learning J., [40] M. Mohri, F. Pereira, and M. Riley, “The Design Principles of a vol. 29, pp. 275-301, 1997. Weighted Finite-State Transducer Library,” Theoretical Computer [15] R.C. Carrasco, J. Oncina, and J. Calera-Rubio, “Stochastic Science, vol. 231, pp. 17-32, 2000. Inference of Regular Tree Languages,” Machine Learning J., [41] R. Chaudhuri and S. Rao, “Approximating Grammar Probabil- vol. 44, no. 1, pp. 185-197, 2001. ities: Solution to a Conjecture,” J. Assoc. Computing Machinery, [16] M. Kearns and L. Valiant, “Cryptographic Limitations on vol. 33, no. 4, pp. 702-705, 1986. Learning Boolean Formulae and Finite Automata,” Proc. 21st [42] C.S. Wetherell, “Probabilistic Languages: A Review and Some ACM Symp. Theory of Computing, pp. 433-444, 1989. Open Questions,” Computing Surveys, vol. 12, no. 4, 1980. [17] N. Abe and M. Warmuth, “On the Computational Complexity of [43] Approximating Distributions by Probabilistic Automata,” Machine F. Casacuberta, “Probabilistic Estimation of Stochastic Regular Learning J., vol. 9, pp. 205-260, 1992. Syntax-Directed Translation Schemes,” Proc. Spanish Symp. [18] M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R.E. Schapire, and Pattern Recognition and Image Analysis, R. Moreno, ed., pp. 201- L. Sellie, “On the Learnability of Discrete Distributions,” Proc. 25th 297, 1995. Ann. ACM Symp. Theory of Computing, pp. 273-282, 1994. [44] F. Casacuberta, “Maximum Mutual Information and Conditional [19] D. Ron, Y. Singer, and N. Tishby, “On the Learnability and Usage Maximum Likelihood Estimation of Stochastic Regular Syntax- of Acyclic Probabilistic Finite Automata,” Proc. Conf. Learning Directed Translation Schemes,” Proc. Third Int’l Colloquium Theory, pp. 31-40, 1995. Grammatical Inference: Learning Syntax from Sentences, pp. 282-291, [20] A. Stolcke and S. Omohundro, “Inducing Probabilistic Grammars 1996. by Bayesian Model Merging,” Proc. Second Int’l Colloquium [45] D. Pico´ and F. Casacuberta, “A Statistical-Estimation Method for Grammatical Inference and Applications, pp. 106-118, 1994. Stochastic Finite-State Transducers Based on Entropy Measures,” [21] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, Proc. Joint Int’l Assoc. Pattern Recognition Workshops Syntactical and Mass.: MIT Press, 1998. Structural Pattern Recognition and Statistical Pattern Recognition, [22] P. Garcı´a and E. Vidal, “Inference of k-Testable Languages in the pp. 417-426, 2000. Strict Sense and Application to Syntactic Pattern Recognition,” [46] E.M. Gold, “Language Identification in the Limit,” Information and IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 9, Control, vol. 10, no. 5, pp. 447-474, 1967. pp. 920-925, Sept. 1990. [47] E.M. Gold, “Complexity of Automaton Identification from Given [23] Y. Zalcstein, “Locally Testable Languages,” J. Computer and System Data,” Information and Control, vol. 37, pp. 302-320, 1978. Sciences, vol. 6, pp. 151-167, 1972. [48] L.G. Valiant, “A Theory of the Learnable,” Comm. Assoc. [24] R. McNaughton, “Algebraic Decision Procedures for Local Computing Machinery, vol. 27, no. 11, pp. 1134-1142, 1984. Testability,” Math. System Theory, vol. 8, no. 1, pp. 60-67, 1974. [49] L. Pitt and M. Warmuth, “The Minimum Consistent DFA Problem [25] E. Vidal and D. Llorens, “Using Knowledge to Improve N-Gram Cannot be Approximated within Any Polynomial,” J. Assoc. Language Modelling through the MGGI Methodology,” Proc. Computing Machinery, vol. 40, no. 1, pp. 95-142, 1993. Third Int’l Colloquium Grammatical Inference: Learning Syntax from [50] F. Denis, C. d’Halluin, and R. Gilleron, “PAC Learning with Sentences, pp. 179-190, 1996. Simple Examples,” Proc. 13th Symp. Theoretical Aspects of Computer [26] S. Eilenberg, Automata, Languages and Machines. Vol. A. New York: Science, pp. 231-242, 1996. Academic, 1974. [51] F. Denis and R. Gilleron, “PAC Learning under Helpful Distribu- [27] L. Rabiner, “A Tutorial on Hidden Markov Models and Selected tions,” Algorithmic Learning Theory, 1997. Applications in Speech Recoginition,” Proc. IEEE, vol. 77, pp. 257- [52] R. Parekh and V. Honavar, “Learning DFA from Simple 286, 1989. Examples,” Proc. Workshop Automata Induction, Grammatical [28] J. Picone, “Continuous Speech Recognition Using Hidden Markov Inference, and Language Acquisition, 1997. Models,” IEEE ASSP Magazine, vol. 7, no. 3, pp. 26-41, 1990. [53] J.J. Horning, “A Procedure for Grammatical Inference,” Informa- [29] I. Bazzi, R. Schwartz, and J. Makhoul, “An Omnifont Open- tion Processing, vol. 71, pp. 519-523, 1972. Vocabulary OCR System for English and Arabic,” IEEE Trans. [54] D. Angluin, “Identifying Languages from Stochastic Examples,” Pattern Analysis and Machine Intelligence, vol. 21, no. 6, pp. 495- Technical Report YALEU/DCS/RR-614, Yale Univ., Mar. 1988. 504, June 1999. [55] [30] A. Toselli, A. Juan, D. Keysers, J. Gonza´lez, I. Salvador, H. Ney, S. Kapur and G. Bilardi, “Language Learning from Stochastic E. Vidal, and F. Casacuberta, “Integrated Handwriting Recogni- Input,” Proc. Fifth Conf. Computational Learning Theory, pp. 303-310, tion and Interpretation Using Finite State Models,” Int’l J. Pattern July 1992. Recognition and Artificial Intelligence, 2004. [56] N. Abe and M. Warmuth, “On the Computational Complexity of [31] F. Casacuberta, “Finite-State Transducers for Speech-Input Trans- Approximating Distributions by Probabilistic Automata,” Proc. lation,” Proc. Workshop Automatic Speech Recognition and Under- Third Workshop Computational Learning Theory, pp. 52-66, 1998. standing, Dec. 2001. [57] R. Carrasco and J. Oncina, “Learning Deterministic Regular [32] F. Casacuberta, E. Vidal, and J.M. Vilar, “Architectures for Speech- Grammars from Stochastic Samples in Polynomial Time,” to-Speech Translation Using Finite-State Models,” Proc. Workshop Theoretical Informatics and Applications, vol. 33, no. 1, pp. 1-20, on Speech-to-Speech Translation: Algorithms and Systems, pp. 39-44, 1999. July 2002. [58] A. Clark and F. Thollard, “Pac-Learnability of Probabilistic [33] A. Molina and F. Pla, “Shallow Parsing Using Specialized HMMs,” Deterministic Finite State Automata,” J. Machine Learning Research, J. Machine Learning Research, vol. 2, pp. 559-594, Mar. 2002. vol. 5, pp. 473-497, May 2004. [34] H. Bunke and T. Caelli, Hidden Markov Models Applications in [59] C. de la Higuera and F. Thollard, “Identification in the Limit with Computer Vision, Series in Machine Perception and Artificial Probability One of Stochastic Deterministic Finite Automata,” Intelligence, vol. 45. World Scientific, 2001. Proc. Fifth Int’l Colloquium Grammatical Inference: Algorithms and [35] R. Llobet, A.H. Toselli, J.C. Perez-Cortes, and A. Juan, “Computer- Applications, pp. 15-24. 2000. Aided Prostate Cancer Detection in Ultrasonographic Images,” [60] R. Carrasco and J. Oncina, “Learning Stochastic Regular Gram- Proc. First Iberian Conf. Pattern Recognition and Image Analysis, mars by Means of a State Merging Method,” Proc. Second Int’l pp. 411-419, 2003. Colloquium Grammatical Inference, pp. 139-150, 1994. 1038 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 7, JULY 2005

[61] F. Thollard, P. Dupont, and C. de la Higuera, “Probabilistic DFA [84] M. Mohri, “Finite-State Transducers in Language and Speech Inference Using Kullback-Leibler Divergence and Minimality,” Processing,” Computational Linguistics, vol. 23, no. 3, pp. 269-311, Proc. 17th Int’l Conf. Machine Learning, pp. 975-982, 2000. 1997. [62] F. Thollard and A. Clark, “Shallow Parsing Using Probabilistic [85] M. Mohri, F. Pereira, and M. Riley, “Weighted Finite-State Grammatical Inference,” Proc. Sixth Int’l Colloquium Grammatical Transducers in Speech Recognition,” Computer Speech and Lan- Inference, pp. 269-282, Sept. 2002. guage, vol. 16, no. 1, pp. 69-88, 2002. [63] C. Kermorvant and P. Dupont, “Stochastic Grammatical Inference [86] H. Alshawi, S. Bangalore, and S. Douglas, “Head Transducer with Multinomial Tests,” Proc. Sixth Int’l Colloquium Grammatical Model for Speech Translation and Their Automatic Acquisition Inference: Algorithms and Applications, pp. 149-160, 2002. from Bilingual Data,” Machine Translation, 2000. [64] M. Young-Lai and F.W. Tompa, “Stochastic Grammatical In- [87] H. Alshawi, S. Bangalore, and S. Douglas, “Learning Dependency ference of Text Database Structure,” Machine Learning J., vol. 40, Translation Models as Collections of Finite State Head Transdu- no. 2, pp. 111-137, 2000. cers,” Computational Linguistics, vol. 26, 2000. [65] P. Garcı´a, E. Vidal, and F. Casacuberta, “Local Languages, the [88] F. Casacuberta and C. de la Higuera, “Computational Complexity Succesor Method, and a Step Towards a General Methodology for of Problems on Probabilistic Grammars and Transducers,” Proc. the Inference of Regular Grammars,” IEEE Trans. Pattern Analysis Fifth Int’l Colloquium Grammatical Inference: Algorithms and Applica- and Machine Intelligence, vol. 9, no. 6, pp. 841-845, June 1987. tions, pp. 15-24, 2000. [66] A. Orlitsky, N.P. Santhanam, and J. Zhang, “Always Good Turing: [89] F. Casacuberta, E. Vidal, and D. Pico´, “Inference of Finite-State Asymptotically Optimal Probability Estimation,” Proc. 44th Ann. Transducers from Regular Languages,” Pattern Recognition, 2004, IEEE Symp. Foundations of Computer Science, p. 179, Oct. 2003. to appear. [67] S. Katz, “Estimation of Probabilities from Sparse Data for the [90] E. Ma¨kinen, “Inferring Finite Transducers,” Technical Report Language Model Component of a Speech Recognizer,” IEEE A-1999-3, Univ. of Tampere, 1999. Trans. Acoustic, Speech and Signal Processing, vol. 35, no. 3, pp. 400- [91] E. Vidal, P. Garcı´a, and E. Segarra, “Inductive Learning of Finite- 401, 1987. State Transducers for the Interpretation of Unidimensional [68] R. Kneser and H. Ney, “Improved Backing-Off for m-Gram Objects,” Structural Pattern Analysis, R. Mohr, T. Pavlidis, and Language Modeling,” IEEE Int’l Conf. Acoustics, Speech and Signal A. Sanfeliu, eds., pp. 17-35, 1989. Processing, vol. 1, pp. 181-184, 1995. [92] K. Knight and Y. Al-Onaizan, “Translation with Finite-State [69] S.F. Chen and J. Goodman, “An Empirical Study of Smoothing Devices,” Proc. Proc. Third Conf. Assoc. for Machine Translation in Techniques for Language Modeling,” Proc. 34th Ann. Meeting of the the Americas: Machine Translation and the Information Soup, vol. 1529, Assoc. for Computational Linguistics, pp. 310-318, 1996. pp. 421-437, 1998. [93] [70] F. Thollard, “Improving Probabilistic Grammatical Inference Core J. Eisner, “Parameter Estimation for Probabilistic Finite-State Algorithms with Post-Processing Techniques,” Proc. 18th Int’l Transducers,” Proc.40thAnn.MeetingAssoc.Computational Conf. Machine Learning, pp. 561-568, 2001. Linguistics, July 2002. [94] D. Llorens, “Suavizado de Auto´matas y Traductores Finitos [71] D. Llorens, J.M. Vilar, and F. Casacuberta, “Finite State Language Estoca´sticos,” PhD dissertation, Univ. Polite`cnica de Vale`ncia, Models Smoothed Using n-Grams,” Int’l J. Pattern Recognition and 2000. Artificial Intelligence, vol. 16, no. 3, pp. 275-289, 2002. [95] M.-J. Nederhoff, “Practical Experiments with Regular Approx- [72] J. Amengual, A. Sanchis, E. Vidal, and J. Benedı´, “Language imation of Context-Free Languages,” Computational Linguistics, Simplification through Error-Correcting and Grammatical Infer- vol. 26, no. 1, 2000. ence Techniques,” Machine Learning J., vol. 44, no. 1, pp. 143-159, [96] 2001. M. Mohri and M.-J. Nederhof, “Regular Approximations of Context-Free Grammars through Transformations,” Robustness in [73] P. Dupont and L. Chase, “Using Symbol Clustering to Improve Language and Speech Technology, J.-C. Junqua and G. van Noord, Probabilistic Automaton Inference,” Proc. Fourth Int’l Colloquium eds., pp. 252-261. Kluwer Academic Publisher, Springer Verlag, Grammatical Inference, pp. 232-243, 1998. 2000. [74] R. Kneser and H. Ney, “Improved Clustering Techniques for [97] K. Lari and S. Young, “The Estimation of Stochastic Context-Free Class-Based Language Modelling,” Proc. European Conf. Speech Grammars Using the Inside-Outside Algorithm,” Computer Speech Comm. and Technology, pp. 973-976, 1993. and Language, no. 4, pp. 35-56, 1990. [75] C. Kermorvant and C. de la Higuera, “Learning Languages with [98] J. Sa´nchez and J. Benedı´, “Consistency of Stocastic Context—Free Help,” Proc. Int’l Colloquium Grammatical Inference, vol. 2484, 2002. Grammars from Probabilistic Estimation Based on Growth [76] L. Breiman, “Bagging Predictors,” Machine Learning J., vol. 24, Transformation,” IEEE Trans. Pattern Analysis and Machine no. 2, pp. 123-140, 1996. Intelligence, vol. 19, no. 9, pp. 1052-1055, Sept. 1997. [77] S. Bangalore and G. Riccardi, “Stochastic Finite-State Models for [99] J. Sa´nchez, J. Benedı´, and F. Casacuberta, “Comparison between Spoken Language Machine Translation,” Proc. Workshop Embedded the Inside-Outside Algorithm and the Viterbi Algorithm for Machine Translation Systems, North Am. Chapter Assoc. for Computa- Stochastic Context-Free Grammars,” Proc. Sixth Int’l Workshop tional Linguistics, pp. 52-59, May 2000. Advances in Syntactical and Structural Pattern Recognition, pp. 50-59, [78] S. Bangalore and G. Ricardi, “A Finite-State Approach to Machine 1996. Translation,” Proc. North Am. Chapter Assoc. for Computational [100] Y. Takada, “Grammatical Inference for Even Linear Languages Linguistics, May 2001. Based on Control Sets,” Information Processing Letters, vol. 28, no. 4, [79] J. Oncina, P. Garcı´a, and E. Vidal, “Learning Subsequential pp. 193-199, 1988. Transducers for Pattern Recognition Interpretation Tasks,” IEEE [101] T. Koshiba, E. Ma¨kinen, and Y. Takada, “Learning Deterministic Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 5, Even Linear Languages from Positive Examples,” Theoretical pp. 448-458, May 1993. Computer Science, vol. 185, no. 1, pp. 63-79, 1997. [80] J.M. Vilar, “Improve the Learning of Subsequential Transducers [102] T. Koshiba, E. Ma¨kinen, and Y. Takada, “Inferring Pure Context- by Using Alignments and Dictionaries,” Proc. Fifth Int’l Colloquium Free Languages from Positive Data,” Acta Cybernetica, vol. 14, no. 3, Grammatical Inference: Algorithms and Applications, pp. 298-312, pp. 469-477, 2000. 2000. [103] Y. Sakakibara, “Learning Context-Free Grammars from Structural [81] F. Casacuberta, “Inference of Finite-State Transducers by Using Data in Polynomial Time,” Theoretical Computer Science, vol. 76, Regular Grammars and Morphisms,” Proc. Fifth Int’l Colloquium pp. 223-242, 1990. Grammatical Inference: Algorithms and Applications, pp. 1-14, 2000. [104] F. Maryanski and M.G. Thomason, “Properties of Stochastic [82] F. Casacuberta, H. Ney, F.J. Och, E. Vidal, J.M. Vilar, S. Barrachina, Syntax-Directed Translation Schemata,” Int’l J. Computer and I. Garcı´a-Varea, D. Llorens, C. Martı´nez, S. Molau, F. Nevado, M. Information Science, vol. 8, no. 2, pp. 89-110, 1979. Pastor, D. Pico´, A. Sanchis, and C. Tillmann, “Some Approaches to [105] A. Fred, “Computation of Substring Probabilities in Stochastic Statistical and Finite-State Speech-to-Speech Translation,” Compu- Grammars,” Proc. Fifth Int’l Colloquium Grammatical Inference: ter Speech and Language, 2003. Algorithms and Applications, pp. 103-114, 2000. [83] F. Casacuberta and E. Vidal, “Machine Translation with Inferred [106] V. Balasubramanian, “Equivalence and Reduction of Hidden Stochastic Finite-State Transducers,” Computational Linguistics, Markov Models,” Technical Report AITR-1370, Mass. Inst. of vol. 30, no. 2, pp. 205-225, 2004. Technology, 1993. VIDAL ET AL.: PROBABILISTIC FINITE-STATE MACHINES—PART II 1039

Enrique Vidal received the Doctor en Ciencias Francisco Casacuberta received the master Fisicas degree in 1985 from the Universidad de and PhD degrees in physics from the University Valencia, Spain. From 1978 to 1986, he was of Valencia, Spain, in 1976 and 1981, respec- with this University serving in computer system tively. From 1976 to 1979, he worked with the programming and teaching positions. In the Department of Electricity and Electronics at the same period, he coordinated a research group University of Valencia as an FPI fellow. From in the fields of pattern recognition and automatic 1980 to 1986, he was with the Computing Center speech recognition. In 1986, he joined the of the University of Valencia. Since 1980, he has Departamento de Sistemas Informa´ ticos y been with the Department of Information Sys- Computacio´n of the Universidad Polite´cnica de tems and Computation of the Polytechnic Uni- Valencia (UPV), where he served as a full professor of the Facultad de versity of Valencia, first as an associate professor and, since 1990, as a Informa´tica. In 1995, he joined the Instituto Tecnolo´gico de Informa´tica, full professor. Since 1981, he has been an active member of a research where he has been coordinating several projects on Pattern Recognition group in the fields of automatic speech recognition and machine and Machine Translation. He is coleader of the Pattern Recognition and translation (Pattern Recognition and Human Language Technology Human Language Technology group of the UPV. His current fields of group). Dr. Casacuberta is a member of the Spanish Society for Pattern interest include statistical and syntactic pattern recognition and their Recognition and Image Analysis (AERFAI), which is an affiliate society applications to language, speech, and image processing. In these fields, of IAPR, the IEEE Computer Society, and the Spanish Association for he has published more than 100 papers in journals, conference Artificial Intelligence (AEPIA). His current research interests include the proceedings, and books. Dr. Vidal is a member of the Spanish Society areas of syntactic pattern recognition, statistical pattern recognition, for Pattern Recognition and Image Analysis (AERFAI), the International machine translation, speech recognition, and machine learning. Association for Pattern Recognition (IAPR), and the IEEE Computer Society. Rafael C. Carrasco received a degree in physics from the University of Valencia in 1987. He Franck Thollard received the masters of received the PhD degree in theoretical physics computer science in 1995 from the University from the University of Valencia in 1991 and of Montpellier, France. He received the PhD another PhD degree in computer science from degree in computer science from the University the University of Alicante in 1997. In 1992, he of Saint-Etienne in 2000. He worked at the joined the Departamento de Lenguajes y Siste- University of Tuebingen (Germany) and at the mas Informa´ticos at the University of Alicante as University of Geneva (Switzeland) under the a professor teaching formal languages and Learning Computational Grammar European automata theory, algorithmics, and markup lan- Project. Since 2002, he has been working as a guages. Since 2002, he has lead the technology section of the Miguel de lecturer with the EURISE research team. His Cervantes digital library (http://www.cerantesvirtual.com). His research current research interests include machine learning and its application to interests include grammar induction from stochastic samples, probabil- natural language processing (e.g., language modeling, parsing, etc.). istic automata, recurrent neural networks and rule encoding, markup languages and digital libraries, finite-state methods in automatic Colin de la Higuera received the master and translation, and computer simulation of photonuclear reactions. PhD degrees in computer science from the University of Bordeaux, France, in 1985 and 1989, respectively. He worked as Maitre de . For more information on this or any other computing topic, Confe´rences (senior lecturer) from 1989 to 1997 please visit our Digital Library at www.computer.org/publications/dlib. at Montpellier University and, since 1997, has been a professor at Saint-Etienne University, where he is director of the EURISE research team. His main research theme is grammatical inference and he has been serving as chairman of the ICGI (International Community in Grammatical Inference) since 2002.