<<

An Introduction to Hidden Markov Models

L. R. .Rabiner B. H. Juang

The basic theory of Markov chains hasbeen known to mathematicians and engineersfor close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in . Oneof the major reasons why speech models, basedon Markovchains, have not been devel- oped until recently was the lack of a method for optimizing the parameters of the Markov modelto match observed signal patterns. Such a method was proposedin the late1960’s and was immediately applied to speech processing in several re- search institutions. Continiued refinementsin the theory and implementation of Markov modelling techniques have greatly enhanced the method, leadingto awide,range of applications of these models. It is the purpose of this tutorial paper to give an introduction to,the theory.of Markov models, and to illustrate how theyhave been applied to problems in .

4 IEEE ASSP MAGAZINE JANUARY 1986 0740-7467/86/0100-0004$01.00@1986IEEE appropriate excitation.The easiest waythen to address the den Markov models successfully treat these problemsun- time-varying nature of theaprocess is to view it as a direct der a probabilistic or statistical framework. concatenation of these smaller ”short time” segments, It is thus the purpose of this paper to explain- what a each such segment being individually represented by.a hiddenJvlarkov model is, why it is appropriate for certain linear system model. In other words, the overall model is types of problems, and how it can be used in practice. In a synchronous sequence of symbols where each of the the next section, we illustrate hidden Markov models via symbols is a linear system model representing a short seg- some simple coin toss examples and outline the three ,merit of the process. In a sense this type of approach fundamental problems associatedwith the modelingtech- models the observed signal using representative tokensof nique. We then discuss how these problems can be solved the signal itself (or some suitably averaged set of such in Section Ill. We will not direct our general discussionto ,signals if we have multiple observations). any one particular problem,but at theend of this paperwe Time-varying processes illustrate how HMM’s are used viaa couple ofexamples in speech recognition. Modeling time-varying processes with the above ap- proach assumes that every such short-time segment of observation is a unit with a prechosen duration. In gen- DEFINITION OFA HIDDEN eral,hqwever, there doesn’texist a precise procedure An HMM is a doubly with an under- to decide what the unit duration shouldbe so that both lying stochastic process that is not observable (it is hid- the time-invariant assumption holds, and the short-time den), but can only be observed through another set of linear system models (as well as concatenation of the mod- stochastic processes that produce the sequence of ob- els) are meaningful. In most physical systems, theduration served symbols. We illustrate HMM’s with the following of a short-time segment is determined empirically. In coin toss’example. I many processes, of.course, one would neither expect the properties of the process to change synchronously with Coin toss example every unit analysis duration, nor observe drastic changes To understand the concept of the HMM, consider the from each unit to the next except at certain instances. following simplified example. You are in a room with a Making nofurther assumptions aboutthe relationship be- barrier (e.g., a,curtain) through which youcannot see tween adjacent short-time models, and treating temporal what is happening. On the other side of the barrier is variations, small or large, as “typical” phenomena in the a,notherperson who is performing a coin (or multiple observed signal, are key featuresin the above direct con- coin) tossing experiment. The other person will not tell catenation technique. This template approach to signal you anything about what heis doing exactly; he will only modeling has proven to be quite useful and has been the tell you the result of each coin flip. Thus a sequence of basis of a wide variety of speech recognition systems. hidden coin tossing experiments is performed, and you There are good reasonsto suspect, atthis point, that the only observe the results of the coin tosses, i.e. above approach, while useful, may not be the most effi- -cient (interms of computation, storage, parameters etc.) technique as far as representation is concerned. Many real 0, o203...... :. . * . OT world processes seem to manifest a rather sequentially changing behavior; the properties ofthe process are usu- where x stands for heads and T stands for tails. ally held pretty steadily, except for minor fluctuations, Given the above experiment, theproblem is how dowe for a certain period of time (or a number of the above- build an HMM toexplain the observed sequenceof heads mentioned durationunits), and then, at certain instances, and tails. One possible model is shown in Fig. la. We call change (gradually or rapidly) to another set of properties. this the “l-fair coin” model. There are two states in the The opportunity for more efficient modeling can be ex- model, but each state is uniquely associated with either .plaited if wecan first identify these periods of rather heads (state 1) or tails (state 2). Hence this model is not steadily behavior, andthen are willing to assume that the hidden because the observation sequence uniquely de- temporal variations within each of these steady periods fines the state. The model represents a “fair coin” because are, in a sense, statistical. A more efficient representation the probability.of generating a head (or a tail) following a may then be obtained by using a common short .time head (or a tail) is 0.5; hence there is no bias onthe current model for each of the steady, or well-behaved partsof the observation: This is a degenerate example and showshow signal, along with some characterization of how one independent trials,like tossing of a fair coin, can beinter- such period evolves to the next. This is how hidden preted as a set of sequential events. Of course, if the Markov models (HMM) come about. Clearly, three prob- person behind th.e barrier is, in fact, tossing a single fair lems have to be addressed: 1) howz’these steadily or dis- coin, this model should explain the outcomes very well. tinctively behaving periods can be identified, 2) how the A, second possible HMM for explaining the observed “sequentially” evolving nature of these periods canbe sequence of coin toss outcomes is given iri Fig. Ib. We call characterized, and 3) what typical or common short time this model the “2-faircoin” model. There are again2 states model should be chosen for each of these periods. Hid- in the model, but neitherState is uniquely associated with

JANUARY 1986 IEEE ASSP MAGAZINE 5 vathn probability distributions which, of course, repre- Usingthe model, an observationsequence, 0 = sent random variables or stochastic processes. 0, Op,. . . ,OT, is generated as follows:

JANUARY 1986 IEEE ASSPMAGAZINE . 7

JANUARY 1986 IEEE ASSP MAGAZINE 9 10 IEEE ASSPMAGAZINE JANUARY 1986 sequence giventhe model. This is the most difficult of the three problems wehave discussed. Thereis no known way to solve for a maximum likelihood model analytically. Therefore an iterative procedure, suchas the Baum-Welch method, or gradient techniques for optimization must be used. Here wewill only discuss the iterative procedure. It appears that with this procedure, the physical meaningof various parameter estimates can be easily visualized. To describe how we (re)estimate HMM parameters, we first define t,(i,j).as

i.e. the probability ofa path being in state qi at time t and making a transition to state qi at time t + 1, given the observation sequence and the model.' FromFig. 5 it should be clear that we can write tt(i,j) as

I I

In the. above, at(i)accounts for the first t observations, ending in state qi at time t, the term aiibj(Ot+Jaccounts for the transition to state 9j at time t + 1 with the.occur- rence of symbol Ot+,, and the term pt+l(j) accounts for 12 ,IEEE ASSP.MAGAZINE JANUAPY 1986 JANUARY 1986 IEEE ASSPMAGAZINE 13 rational information is often represented in a normalized of Pr(0, / 1 A) is usually very large and max, Pr(0, / I A) is form for word models, (sincethe word boundary is essen- usually the only significant term in' the summation for tially known), in the form: Pr(0 /A). Therefore, in such cases,, either the forward- backwakd procedure or the works pi(//T) = probabidity of being in state j for exactly (/IT)of equally well in the word recognition task. the word, whereT is the numberof, frames in the

REFERENCES [I]Baker, J.K., "The Dragon System-AnOverview,'' IEEE Trans. on Acoustics Speech , Vol. ASSP-23, No. 1, pp. 24-9, February 1975. 121 Jelinek, F., "Continuous Speech Recognition by Sta- tistical Methods," Proc. /€€,E, Vol. 64, pp. 532-556, April 1976. 16 IEEE ASSP MAGAZINE JANUARY 1986