Evolving the Structure of Hidden Markov Models Kyoung-Jae Won, Adam Prügel-Bennett, and Anders Krogh

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 1, FEBRUARY 2006 39 Evolving the Structure of Hidden Markov Models Kyoung-Jae Won, Adam Prügel-Bennett, and Anders Krogh Abstract—A genetic algorithm (GA) is proposed for finding the being imposed faithfully captures some biological constraints, structure of hidden Markov Models (HMMs) used for biological it should not significantly increase the approximation error. As sequence analysis. The GA is designed to preserve biologically the generalization error is the sum of the estimation and approx- meaningful building blocks. The search through the space of HMM structures is combined with optimization of the emis- imation error, imposing a meaningful structure on the HMM sion and transition probabilities using the classic Baum–Welch should give good generalization performance. By automatically algorithm. The system is tested on the problem of finding the optimizing the HMM architecture, we are potentially throwing promoter and coding region of C. jejuni. The resulting HMM has away the advantage over other machine learning techniques, a superior discrimination ability to a handcrafted model that has namely their amenability to incorporate biological information been published in the literature. in their structure. That is, we risk finding a model that “overfits” Index Terms—Biological sequence analysis, genetic algorithm the data. (GA), hidden Markov model (HMM), hybrid algorithm, machine The aim of the research presented in this paper is to utilize learning. the flexibility provided by genetic algorithms (GAs) to gain the advantage of automatic structure discovery, while retaining I. INTRODUCTION some of the benefits of a hand-designed architecture. That is, by IDDEN MARKOV models (HMMs) are probabilistic fi- choosing the representation and genetic operators, we attempt H nite-state machines used to find structures in sequential to bias the search toward biologically plausible HMM architec- data. An HMM is defined by the set of states, the transition prob- tures. In addition, we can incorporate the Baum–Welch algo- abilities between states, and a table of emission probabilities as- rithm which is traditionally used to optimize the emission and sociated with each state for all possible symbols that occur in transition probabilities as part of the GA. GAs appear to be very the sequence. They were developed for use in speech applica- well suited to this application. The optimization of HMM ar- tions, where they remain the dominant machine learning tech- chitectures is a discrete optimization problem which is easy to nique [1]. Over the past decade, they have also become an im- implement in a GA. We can simultaneously optimize the contin- portant tool in bioinformatics [2]. Their great attraction is that uous probabilities by hybridizing the GA with the Baum–Welch. they allow domain information to be built into their structure Furthermore, GAs allow us to tailor the search operators to bias while allowing fine details to be learned from the data through the search toward biologically plausible structures. This would adjusting the transition and emission probabilities. The design be far harder to accomplish in a technique such as simulated of HMM architectures has predominately been the province of annealing, where the freedom to choose the move set is often the domain expert. constrained by the wish to maintain detailed balance or some Automatic discovery of the HMM structure has some signif- similar condition. icant attractions. By eliminating the expert, it allows the data HMMs have received little attention from the evolutionary to “speak for itself.” This opens the possibility of finding com- computing community. We are aware of only two other groups pletely novel structures, unfettered by theoretical prejudice. In that have used a GA to optimize HMMs, and neither has pub- addition, automation of the design process allows many more lished in evolutionary computing journals [3], [4] (we discuss structures to be tested than is possible if every structure has to the relation of this work to our own in the next section). This be designed by hand. However, automation also comes at a cost. is, perhaps, surprising considering the large amount of work on Imposing a structure on the HMM limits the possible outcome using GAs to optimize the structure of neural networks (see, (probability distribution for sequences) that can be learned. In for example, the review [5]) and more recent work on graph- the language of machine learning, imposing a structure reduces ical models [6], [7]. We believe that GAs may be an important the learning capacity of the HMM. This has the benefit of re- tool for evolving HMM architectures. Furthermore, there may ducing the estimation errors for the learned parameters given be lessons about guiding search GAs to be learned from this the limited amount of training data. Provided that the structure application. The rest of this paper is organized as follows. In the next section, we give a brief review of HMMs. We include this section to make the paper self-contained for readers who are not familiar Manuscript received July 19, 2004; revised November 23, 2004. K.-J. Won and A. Prügel-Bennett are with the School of Electronic and with HMMs. It also serves to define the notation we use later Computer Science, University of Southampton, Southampton S017 1BJ, U.K. on in this paper. More detailed pedagogical reviews are given in (e-mail: [email protected]). [1] and [2]. In Section III, we describe our GA for optimizing A. Krogh is with the Bioinformatics Center, University of Copenhagen, Copenhagen 2100, Denmark. HMMs. Section IV describes the experiments to test our GA. Digital Object Identifier 10.1109/TEVC.2005.851271 We conclude in Section V. 1089-778X/$20.00 © 2006 IEEE 40 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 1, FEBRUARY 2006 II. BACKGROUND To formalize the HMM, we denote the set of states by , the transition probabilities from state to by , and the proba- A. HMMs bility of emitting a symbol , given that we are in a state by . Let be a sequence of states, then An HMM is a learning machine that assigns a proba- the likelihood of a sequence is given by bility to an observed sequence of symbols. A sequence consist of symbols belonging to some (2) alphabet . In biological sequence analysis, the alphabet might be, for example, the set of four possible nucleotides in DNA, and “A,”“C,”“G,” and “T,” or the set of 20 amino acids that are the building blocks of proteins. We denote the set of parameters that define an HMM by . Given a sequence , an HMM (3) returns a “probability” , where Here, denotes the initial state. Naively, the computation of the likelihood seems to grow exponentially with the length of the sequence. However, all the computations we need can be so it is a probability distribution over sequences of length (we computed efficiently using dynamic programming techniques. use to denote the set of all sequences of length ). To un- We can compute the likelihood using the “forward” algorithm. derstand the meaning of this probability, we can imagine some The forward variable is defined as process of interest (e.g., molecular evolution) that generates a (4) set of sequences of length with probability . Our aim is to find an HMM such that is as close as possible Starting from , we can find for all states to . Of course, we usually do not know . Rather, we for successive times using the recursion have some training examples consisting of a set of sequences. We can then use the maximum likelihood principle to estimate (5) the HMM, which corresponds to maximizing with re- spect to (the probability is known as the likelihood, when considered a function of ). This follows from the Markovian nature of the model. When we Often, we are interested in how well a sequence fits the model. have found we can compute the likelihood by marginal- To do this, we consider the log-odds of a sequence izing out the final state (6) odd for sequence (1) Having an absorbing state (with no outgoing transitions) in where denotes the cardinality of the set of symbols . the model would result in a probability distribution over se- The log-odds are positive if the sequence is more likely than a quences of all possible lengths, but we use the first formulation random sequence. To use an HMM for classification, we can in the remainder of the theoretical part of the paper. There exists set a threshold for the log-odds of a new sequence to belong to an analogous backward algorithm that can also be used to com- the same class as the training data. pute the likelihood. We define the backward variable to be The HMM is a probabilistic finite-state machine which can the probability of matching the sequence ,given be represented as a directed graph in which the nodes corre- that we are in state at time spond to states and the edges correspond to possible transitions between states. A transition probability is associated with each (7) edge with the constraint that the sum of the transition probabilities for edges exiting a node must sum to one. A node may have Again, this can be obtained recursively using a transition to itself. In addition, there is an emission probability table associated with each state which encodes the probability (8) of each symbol being “emitted,” given that the machine is in that state. We define one state as the start state, which does with initial condition for all . The likelihood is not emit a symbol but has transitions to the other states. To com- given by pute probabilities from our HMM we consider an “event” to be a path through the graph where we emit a symbol every time we enter a state.

Load more