Protein Modeling with Hybrid Hidden Markov Model/Neural Network Architectures Pierre Baldi Yves Chauvin Division of Biology and J PL Net-ID, Inc

Protein Modeling with Hybrid Hidden Markov Model/Neural Network Architectures Pierre Baldi Yves Chauvin Division of Biology and J PL Net-ID, Inc. California Institute of Technology 601 Minnesota San Francisco, CA 94107 Pasadena, CA 91125 [email protected] [email protected] (415)647-9402 (818) 354-9038 (415) 647-2758 FAX (818) 393-5013 FAX

Abstract ber of unstructured parameters; (2) a built-in inability Hidden Markov Models (HMMs)are useful in to handle certain dependencies. In the case of protein numberof tasks in computational molecular bi- families for instance, a typical HMMhas several thou- ology, and in particular to modeland align pro- sand free parameters. In the early stages of genome tein families. We argue that ttMMsare some- projects, the number of sequences available for train- what optimal within a certain modeling hierar- ing in a given fanfily is very variable and can range chy. Single first order HMMs,however, have from 0, or just a few, for the least knownfamilies, to two potential limitations: a large number of a thousand, or so. Proteins also fold into complex 3D unstructured parameters, and a built-in inabil- shapes, essential to their function. Subtle long range ity to deal with long-range dependencies. Hy- dependencies in their polypeptide chains exist that are brid HMM/NeuralNetwork (NN) architectures not immediately apparent in the primary sequences attempt to overcomethese limitations. In hybrid HMM/NN,the HMMparameters are computed alone. These cannot be captured by a simple first order by a NN. This provides a reparametrization that Markov process. allows for flexible control of model complexity, Here, we develop a new class of models and learn- and incorporation of constraints. The approach ing algorithms to circumvent these problems, in a con- is tested on the immunoglobulinfamily. A hy- text of protein modeling. We call these models hybrid model is trained, and a multiple alignment brid HMM/Neural Network (HMM/NN)architectures derived, with less than a fourth of the numberof i. The same ideas can of course be applied to other do- parameters used with previous single HMMs.To mains and are introduced, in a more general setting in capture dependencies, however, one must resort ((Baldi & Chauvin 1995)). There are two basic ideas to a larger hybrid modelclass, wherethe data is modeled by multiple HMMs.The parameters of behind HMM/NNarchitectures. The first is to cal- the HMMs,and their modulation as a function culate the parameters of a HMM, via one or several of input or context, is again calculated by a NN. NNs, in order to control the structure and complexity of the model. The second is to model the data with several HMMs,rather than a single model, and to use Introduction the previous NNsto shift amongmodels, as as function Many problems in computational molecular biology of input or context, in order to capture dependencies. can be casted in terms of statistical pattern recognition The main focus of this paper is on demonstrating the and formal languages ((Searls 1992)). While sequence first idea. The second idea is briefly discussed at the data is increasingly abundant, the underlying biologi- end, and related simulations are currently in progress. cal phenomenaare still often poorly understood. This In the next section, we review HMMsand how they creates a favorable situation for machine learning ap- have been applied to protein families. In section 3, we proaches, where grammars are learnt from the data. discuss someof the limitations and optimality of such In particular, Hidden Markov Models (HMMs),which models. In section 4, we introduce a simple class of are equivalent to stochastic regular grammars, and the HMMM/NNhybrid architectures with the correspond- associated learning algorithms, have been extensively ing learning algorithms. In section 5, we present simu- used to model protein families and DNAcoding re- lation results using the inmmnoglobulin family. More gions ((Baldi eta/. 1994), (Krogh et al. 1994), (Baldi general HMM/NNarchitectures are discussed at the & Chauvin 1994a), (Baldi et a/. 1995), (Krogh, Mian, end together with other extensions and related work. & Haussler 1994)). Likewise, Stochastic Context Free Granmaars (SCFGs) have been used to model RNA I HMM/NNaxchitectures were first described at a NIPS ((Sakakibara eta/. 1994)). workshop(Vail, CO)and at the International Symposium Although very powerful, HMMsin molecular biology on Fifth Generation Computer Systems (Tokyo, Japan), have at least two severe limitations: (1) a large num- both in December1994.

Baldi 39 K K - ~--:~k=l Qk = -~k=l lnP(Ok), and the negative log-likelihood based on the optimal patiis: Q = K K - ~-~k=i Qk = - ~k---1 In P(r(Ok)) where 7r(O) is the most likely HMMproduction path for sequence O. 7r(O) can be computedefficiently by dynamic program- ruing (Viterbi algorithm). Dependingon the situation, the Viterbi path approach can be considered as an approximation to the full maximumlikelihood, or as an algorithnl in its ownright. This is the case in protein modeling where, as described below, the optimal paths play a particularly important role. Whenpriors on the parameters are included, one can also add regulariser terms to the objective functions for MAP(Maximum Posteriori) estimation. Different algorithms are available for HMMtraining, including the classical Baum- Vv’elch or EM(Expectation-Maximization) algorithm, and different forms of gradient descent and other GEM Figure 1: Example of HMMarchitecture used in pro- (Generalized EM)((Dempster, Laird, & Rubin 1977), tein modeling. S is the start state, E the end state. (Rabiner 1989), (Baldi & Chanvin 1994b)) algorithnrs. di, mi and ii denote delete, main and insert states re- Regardless of the training method, once a HMMhas spectively. been successfully trained on a family of primary sequences, it constitutes a model of the entire family and can bc used in a numberof different tasks. First, for HMMsof Protein Families any given sequence, we can compute its likelihood ac- A first order discrete HMMcan be viewed as a stochas- cording to the model, and also its most likely path. A tic production system defined by a set of states S, multiple alignment results immediately from aligning an alphabet .,4 of m symbols, a probability transi- all the optimal paths of the sequences in the family. t.ion matrix T = (tij), and a probability emission ma- The model can also be used for discrimination tests trix E = (eix). The system randomly evolves from and data base searches ((Krogh et al. 1994), (Baldi state to state, while randomly emitting symbols from Chauvin 1994a)), by comparing the likelihood of any the alphabet. When the system is in a given state sequence to the likelihoods of the sequences in the fam- i, it has a probability tij of moving to state j, and ily. So far, HMMshave been successfully applied to a probability eix of emitting symbol X. As in the several protein families including globins, immunoglob- application of HMMsto speech recognition, a fam- ulins, kinases, G-protein-coupled receptors (GPCRs), ily of proteins can be seen as a set of different ut- EF hand, aspartic acid proteases and HIV membrane terances of the same word, generated by a common proteins. In all these cases, the HMMsmodels have underlying HMMwith a left-right architecture (Fig. been able to perform well on all previous tasks yield- ing, for instance, multiple alignments that are compa- 1). The alphabet has m = 20 symbols, one for each rable to those derived by humanexperts and published anaino acid (m = 4 for DNAor RNAmodels, one symbol per nucleotide). In addition to the start and in the literature. end state, there are three classes of states: the main states, the delete states and the insert states with Limitations and Optimality of HMMs S = {start, ml,..., raN, il, ..., {IV+l, dl,..., dN+l,end}. In spite of their success, HMMsfor biological sequences N is the length of the model, typically equal to the have two weaknesses. First, they have a large num- average length of the sequences in the family. The ber of unstructured parameters. In the case of pro- delete states are mute. The linear sequence of state tein models, the architecture of Fig. 1 has a total of transitions start ---* rnl ---, m2..... rnN --, end is the approximately 49N parameters (40N emission param- backbone of the model. The self-loop on the insert eters and 9N transition parameters). For a typical states allows for multiple insertions. protein family, N is of the order of a few hundreds, re- Given a sample of K training sequences O1, ..., Ok sulting immediately in models with over 10,000 param- from a protein family, the parameters of a HMMcan eters. This can be a problem, especially in situations be iteratively modified to optimize the data fit, ac- where only a few sequences are available for training. cording to some measure, usually based on the like- It should be noted, however, that a single sequence lihood of the data according to the model. Since should not be counted as a single training example. the sequences can be considered as independent, the Each letter, and each succession of letters, in the se- overall likelihood is equal to the product of the in- quence should be considered as a "training example" dividual likelihoods. Two target functions commonly for the HMMparameters. Thus a typical sequence pro- used for training are the negative log-likelihood: Q = vides of the order of 2N constraints, and 25 sequences

40 ISMB-95 or so provide a number of examples in the same range current experimental evidence shows that HMMper- as the number of HMMparameters. formance in sequence data base mining is excellent. In our experience, we have derived good multiple A slighly different point of view is to consider, for alignments with sometimesas little as 35 sequences in a a given protein family, a hierarchy of models. All the family. Wealso conjecture, although this has not been models have the same structure, and are entirely de- tested, that a HMMas in Fig. 1, trained with only two fined by a fixed emission distribution vector at each sequences and the proper regularisation, should be able position. These model can also be seen as factorial to yield optimal pairwise alignments. More generally, distributions over the space of sequences. The first it has been noticed several times in the connection- statistical model one can derive, is when the emis- ist literature that certain models are "well adapted" sion vector is constant at each position, and uniform. to certain tasks, precisely in the sense that little over- This yields a generative model for uniform sequences fitting is observed, even with an unfavorable ratio of of amino acids, and is therefore a very poor model for number of parameters to number of training examples. any specific family. A slightly better model is whenthe We believe that HMMsare well adapted for protein emission vector is constant at all positions, but equal modeling, and in some sense, to be discussed below, to the average amino acid composition of the family. they are optimal. But the problem remains that some This is the standard model that is used for compari- improvementshould be possible in situations where lit- son in manystatistical discrimination tests. Anygood tle training data is available. Furthermore, the HMM model must fair well agains this one. Finally, the best parameters have no structure, and no explicit relations model within this class, is the one where the emis- among them. The hybrid HMM/NNdescribed in the sion vectors at each position are fixed, but different next section provide a solution to these problems. from one position to the other, and equal to the optimal column composition of a good multiple alignment. A second limitation of first order HMMs,is their This latter model is essentially equivalent to a good inability to deal with long range dependencies. Be- multiple alignment, since one can be derived from the cause proteins have complex 3 dimensional shapes and other. Likewise, it is also essentially equivalent to a long range interactions between their residues, it may well trained tIMM, since a good alignment can be de- seem surprising that good models can be derived us- rived from the HMM,and vice versa, the parameters ing simple first order Markov processes. One partial of the HMMcan be estimated from a good alignment. explanation for this is that HMMscan capture those Therefore, in this sense, HMMsare optimal within this effects of long range interactions that manifest them- limited hierarchy of models that assign a fixed distribu- selves in a moreor less constant fashion, across a family tion vector at each position. It is impossible to capture of sequences. For instance, suppose that, as a result dependencies of the form X-Y and X~-Y~ within this of a particular folding structure, two distant regions of hierarchy, since this would require, in its more general a protein have a predominantly hydrophobic composi- form, variable enfission vectors at the corresponding tion. Then this pattern is present in all the members positions, together with a mechanismto link them in of the family, and will be learnable by a HMM.On the the proper way. other hand, a variable long range interaction such as: We now turn to the HMM/NNhybrid architectures "a residue X at position i implies a residue f(X) at po- and how they can address, in their most simple form, sition j’" cannot be captured by a first order HMM,as soon as f is sufficiently complex. [Note that a HMM the problems of parameter complexity and structure. We will come back to the problems of long range de- is still capable of capturing certain variable long range pendencies at the end. interactions. For instance, assume that the sequences in the family have either a fixed residue X at position i, and a corresponding Y at position j, or a fixed XI a HMM/NN Architectures position i with a corresponding Y’ at position j. Then In ((Baldi & Chauvin 1994b), it was noticed that a use- these 2 sub-classes of sequences in the family could be ful reparametrization of the HMMparameters consists associated with 2 types of paths in the HMMwhere, of ewij eViX for instance, X - Y are emitted from main states and iij -- ~-~ ew’~ and eix -~- ZY eV’V (4.1) X’ - Y’ are emitted from insert states.]. Although these dependencies are important, their effects do not with wij and vix as the new set of variables. This seem to have hampered the HMMapproach. Indeed, reparametrization has two advantages: (1) modifica- consider for instance the standard case of a data base tion of the w’s and v’s automatically preserves the search with a HMMtrained on a protein family. To normalisation constraints on the original emission and fool the model, sequences would have to exist having transition probability distributions; (2) transition and all the same first order properties associated with all emission probabilities can never reach the absorbing the emission vectors (i.e. the right statistical compo- value 0. In the reparametrisation of (4.1), we can con- sition at each position) similar to the sequences in the sider that each one of the HMMparameters is cal- family, hut with a X’ - Y or X - Y’ association, rather culated by a small NN, with one on/off input, no hid- than a X - Y or X’ - Y’. This is highly unlikely, and den layers, and 20 softmax (or normalized exponential)

Baldi 41 (resp. 3) output units (Fig. 2a) for the emissions (resp. transitions). The connections between the input and the outputs are the vix. This can be generalized immediately by having arbitrarily complex NNs, for the computation of the HMMparameters. The NNs associated with different states can also be linked with one or several commonhidden layers. In general, we can consider that there is one global NNconnecting Oulpul enlllmlon dlllrlbutlons all the HMMstates to their parameters. The architecture of the network should be dictated by the problem I I II at hand. In the case of a discrete alphabet however, such as for proteins, the emission of each state is a multinomial distribution, and therefore the output of the corresponding network should consist of M soft- maxunits. For simplicity, in the rest of this article I we discuss emission parameters only, but the approach / extends immediately to transition paranaeters as well. I I As a concrete example, consider the hybrid HMM/NN Input: HMMslides architecture of Fig. 2b consisting of: 1. Input layer: one unit for each state i. At each time, all units are off, except one which is on. If unit Fig. 2a i is set to 1, the network computes eix, the emission distribution of state i. 2. Hidden layer: H hidden units indexed by h, each with transfer function fh (logistic by default) with bias bh (H < M). Output emlamlon distribution 3. Output layer: M softmax units or weighted exponentials, indexed by X, with bias bx. I 4. Connections: a = (c~hi) connects input position to hidden unit h. ’3 = (fix’h) connects hidden unit to output unit X. I Hiddenlayer For input i, tile activity in the hidden layer is given by: A(~ + bh) (4.2) The corresponding activity in the output layer is Inpul: HMMstales

e-[E~, 13xhfh(elhiq-bh )+bx] elx = (:4..3) Fig. 2b

A number of points are worth noticing: Figure 2: (a) Schematic representation of simple ¯ The HMMstates can be partitioned into different HMM/NNhybrid architecture used in 4.1. Each HMM groups, with different networks for different groups. state has its own independent NN. Here, the NNsare In the limit of one network per state, with no hidden extremely simple, with no hidden layer, and an output layers (or with H = M hidden units), one obtains layer of softmax units that compute the HMMemis- the architecture used in (Baldi et a/. 1994), where sion and transition parameters. Only output emissions the HMMparameters are reparametrised as nor- are represented for simplicity. (b) Schematic repre- malised exponentials (Fig. 2a). One can use differ- sentation of a HMM/NNarchitecture where the NNs ent NNsfor insert states and for main states, or for associated with different states (or different group of different group of states along the protein sequence states) are connected via one or several hidden layers. corresponding for instance to different regions (hydrophobic, hydrophilic, alpha-helical, etc...) if these are known. ¯ HMMparameter reduction can easily be achieved using small hidden layers with H hidden units, and H small compared to N or M. In the example of Fig. 2b, with H hidden units and considering only

42 ISMB-95 main states, the number of parameters is H(N + M) ing algorithms for HMM/NNarchitectures, both essen- in the HMM/NNarchitecture, versus NM in the tially gradient descent on the target functions (2.1) and corresponding simple HMM.For protein models, this (2.2), along the lines discussed in ((Baldi & Chanvin yields roughly HN parameters for the HMM/NN 1994b)). They can easily be modified to accomodate architecture, versus 20N for the simple HMM. different target functions, such as MAPoptimisation ¯ The number of parameters can be adaptively ad- with inclusion of priors. In these learning algorithms, justed to variable training set sizes, merely by chang- the ttMM dynanfic programming and the NNnetwork ing the numberof hidden units. This is useful in en- back-propagation are intimately interleaved, and learn- vironments with large variations in data base sizes, ing can be on-line or off-line. Here we give the on-line as in current molecular biology applications. The equations (batch equations can be derived similarly) total numberof protein families is believed to be on for one of the algorithms (detailed derivations can be the order of a thousand. One can envision building a found in (Baldi & Chauvin 1995)). For each sequence library of HMMs,one model per family, and update O, and for each state i on the Viterbi path ~r = It(O), the library as the data bases grow. the Viterbi on-line learning equations are given by ¯ Because the number of parameters can be signif- A ~rh= rl(Tiv - e~r).5,(,~,hi+ icantly reduced, training of hybrid architectures, Abv= ~/(Tix - e~r) along the lines described below, is also faster in gen- IAahj = (~ij~fth(ahi + bh)[~,y flyh(Tir eiy) eral. ¯ The entire bag of well-known connectionist tricks (4.4) can be brought to bear on these architectures in- for (i,X) E ~r(O), with Tix = 1, and Tit = 0 cluding: higher order networks, radial basis func- Y#X. tions and other transfer functions, multiple hidden layers, sparse connectivity, weight sharing, weight Simulation Results decay, gaussian and other priors, hyperparameters Here we demonstrate a sinlple application of the prin- and regulaxization, to name only the most commonly ciples behind HMM/NNhybrid architectures on the used. Manysensible initialization and structures can immunoglobulin protein family. Immunoglobulins, or be implemented in a flexible way. For instance, by antibodies, are proteins produced by B cells that bind allocating different numberof hidden units to differ- with specificity to foreign antigens in order to neutral- ent subsets of emissions or transitions, it is easy to ize them, or target their destruction by other effector favor certain classes of paths in the models, when cells. The various classes of immunoglobulins are de- needed. For instance, in the HMMof Fig. 1, one fined by pairs of light and heavy chains that are held must in general introduce a bias favoring main states together principally by disulphide bonds. Each light over insert states, prior to any learning. It is easy and heavy chain molecule contains one variable (V) also to tie different regions of a protein that may region, and one (light) or several (heavy) constant have similar properties by weight sharing, and other regions. The V regions differ amongimmunoglobulins types of long range correlations, if these are known and provide the specificity of the antigen recognition. in advance. About one third of the amino acids of the V regions ¯ By setting the output bias to the proper values, the form the hypervariable sites, responsible for the great model can be initialized to the average composition diversity characteristic of the vertebrate immunere- of the training sequences, or any other useful distri- sponse. Our data base is the same as the one used in bution. ((Baldi et al. 1994)) and consists of humanand mouse ¯ Classical prior information in the form of substitu- heavy chain irmnunoglobulins V region sequences from tion matrices is also easily incorporated. Substitu- the Protein Identification Resources (PIR) data base. tion matrices (for instance (Altschul 1991)) can It corresponds to 224 sequences, with minimumlength computed from data bases, and essentially produce 90, average length 117, and maximumlength 254. a background probability matrix P = (Pxr), where For the immunoglobulinsV regions, our original re- PxY is the probability that X be changed into Y sults ((Baldi et a/. 1994)) were obtained by training over a certain evolutionary time. P can be imple- a simple HMM,similar to the one in Fig. 1, that mented as a linear transformation in the emission contained a total of 52N + 23 = 6107 adjustable pa- NN. rameters. Here we train a hybrid HMM/NNarchi- tecture with the following characteristics. The basic ¯ Finally, by looking at the structure of the weights model is a HMMwith the architecture of Fig. 1. All and the activity of the hidden units, it may be pos- the main states emissions are calculated by a com- sible to detect certain patterns in the data. mon NN, with 2 hidden units. Likewise, all the in- With hybrid HMM/NNarchitectures, in general the sert state emissions are calculated by a commonNN, M step of the EMalgorithm cannot be carried ana- with one hidden unit only. Each state transition dis- lyticially. We have derived two simple GEMtrain- tribution is calculated by a different softmax network

Baldi 43 1 2 * 3 4 V F37262 AE-LMI41~-GA-S --VKI SCK-A-TG--Y-KfS --S -Y-WI --eWVKQ-R-PGHGL- T B27563 L-QQp-G-AE-LVKP-GA-S --VKLSCK-A-S G-Y-T fT--N-Y-WI--hWVKQ-R-PGRGL- T C30560 QVHL-QQ-S G-AE-LVKP-GA- S --VKI SCK-A-S G-Y-T fT-- S-Y-WMNW--VKQ-R-PGQGL- V GIHUDW QVTL-RE-SG-PA-LVRPt -Q-T--LTLTC--T fSGf-- S 1 SgeTm-c-VAW-- I RQ--pPGEAL- T S09711 mkhlw f fl i ivraprwcl sQVQL-QE-SG-PG-LVKP s-E-T--LSVTCT-V-SGg--SvS--S s-g-LYWsWI RQ--pPGKG-p T B36006 ¯KI S CKg--SG-Y-S f T --S -Y-W I--gWVRQ--mPGKGL- T F36005 QVQL-VE-SG-GG-WQP-GR-S --LRLS CA-A- SG f --T f S --S -Ya-M--hWVRQ-A-PGKGL- V A36194 mgws fi fl fl i svt agvhsEVQL-QQ-SG-AE-LVRA-G-s S --VKMSCK-A-S G-Y-T f T--N-Yg- INW--VKQ-R-PGQGL- V A31485 EVKLd-E-TG-GG-LVQP-G---rpMKLS C-vA-S G f --T f S --D-Y-WMNW--VRQ-- sPEKGL- V D33548 QVQL-VQ-S G-AE-VKKP-GA-S--VKVS C-eA-S G-Y-T fTg---H-YM--hWVRQ-A-PGQGL- T AVMS J5 EVKL-LE-S G-GG-LVQP -GG- S --LKLS CA-A-S G f-d- f S --K-Y-WMSW--VRQ-A-PGKGL- T D30560 QVQL-KQ-S G-P- s LVQP s-Q-S --LS I TCT-V-SD f--S 1 T--Nf-g-V--hWVRQ--SPGKGL- V SI1239 me i gl swi f i i ai i kgvqcEVQL-VE-S G-GG-LVQP-GR-S --LRLSCA-A-SG f--T fN--D-Ya-M--hWVRQ-A-PGKGL- T GIMSAA ,EVQL-QQ-S G-AE-LVKA-G- S S --VlqgSCK-A-TG--Y-Tf S --S -Ye -LYW--VRQ-A-PGQGL- V I27888 .EVQL-VE-SG-GG-LVKP -GG-S --LRLSCA-A-SG f--T f S--S -Ya-MSW--VRQ--sPEKIRL- T PL0118 QL-QE-SGs--gLVKP s-Q-T--LSLTCA-V-SGg--SiS--Sg-gY-SWsWIRQ--pPGKGL- T PL0122 ¯ EVQL-VE-S G-GG-LVQP-GG-S--LKLS CA-A-S G f --T f Sg-Sa---M--hWVRQa-s--GKGL- V A33989 -DVQLd-Q-SEs--vVI KP-GG-S--LKLSCT-A-SGf--T fS--S-Y-9@4SW--VRQ-A-PGKGL- V A30502 -EVQL-QQ-S G-PE-LVKP-GA-S - -VEMSCK-A-S G-D-T fT--S s-v-M--hWVKQ-K-PGQGL- T PH0097 .DVKL-VE-SG-GG-LVKP-GG-S--LKLSCA-A-SG f--T fS--S-Yi-MSW--VRQ-T-PEKRL-

5 6 7 8 F37262 EWI-G--enlp .....GsD ...... S-T--KYN-EKf--K-GKaT-- ftA-DT-S s--NT-A--Y-M-Q-LS-SLT-S-E-D-S- B27563 EWI -G-Ri ....DpNS G- ....g ....T--KYIq-EKf--Kn-KaTIT---i--nKps-NT-A--Y-M-Q-LS-SLT-S-D-D-S- C30560 EWI-G---ei---DpSN ..... S---Y-T--NNN-QKf--Kn-KaTIT---vDK-S s--NT-A--Y-M--Q-LS-SLT-S-E-D-S- GIHUDW EWL-A--wdi 1 .... N-dD--K---Y ..... Y-gAS l-e-t-RiAvS--K-DT-S--KNQ-V---vLs--MN-TV-g-pG-D-T- $09711 EWI-G--yi---Y-YSG S-T--NYNp-S-L-IRs-RvTiS---vDT-S--KNQ-- f--sL-K-L-gSVT-A-A-D-T- B36006 EWM-G--ii ..... YPgD--S---D-T--RYSp-S f--Q-GQvTiS--A-DK-Si--ST-A--Y-L-Qw-S-SLK-A--sD-T- F36005 EWV-A--vi .... SYDG ....S---N-K--YYA-D S -V-K-GBZTi S --R-DN-S--KNT-L--Y-L-Q-MN-S LR-A-E-D-T - A36194 EWI-G ...... YqSTG .....s f-Y-S --TYN-EK-V-K-GKt T IT---vDK-S S--ST-A--Y-M-Q-LRg-LT-S-E-D-S- A31485 EW~V-A-Qi ....R-NKP-Y--N---YeT --YYS -DS-V-K-GRfTi S--R-DD- S --KS - sV--Y-L-Q-MN-NLR--vE-Dm-g D33548 EWM-G--wi---NpNSG ..... g ....T--NYA-EK f --Q-GRvTiT--R-DT-S i--NT-A--Y--M-E-LS -RLR- S -D-D-T- AVM~J5 EWI -G--e ihpd--- S G T i-NYTp--S-L-Kd-K f I iS --R-DN-A--KN-sL--Y-L-Q-MS -KVIR- S -E-D-T- D30560 EWL-G--viwp .... RG- ....g--N-T--DYN-AAf-m- S-R1 S iT--K-DN-S --KS Q-V f f ....K-MN-SLQ-A-D-D-T- SI1239 EWVsG--i ...... S-wD--S--- S - S i g-YA-DS -V-K-GRfTi S - -R-DN-A--KN- s L--Y-L-Q-MN-SLR-A-E-D-M- GIMSAA El)L-G- ...... YiSS .....S ---S-AypNYA-QK f--Q-GRvTiT--A-D-e S --TNT-A--Y-M-E-LS -SLR-S-E-D-T- I27888 EWV-A ...... Di S S G-----g s f--T--YY-pDT -V-T-GRfTi S --R-DD-A--QNT-L--Y-L-E-MN-S LR-S -E-D-T- PL0118 EWI-G--yi---Yh-SC~ S-T--YYIqp-S -L-KS -RvTi S---vDR-S --KNQ-- f-- s L-K-LS-SVT-A-A-D-T- PL0122 EWV-C--RI ....R- SKA--n sY---A-T--AYA-AS -V-K-GRf Ti S --R-DD-S --KNT-A--Y-L-Q-I~N-S LK-T-E-D-T- A33989 QWV-- sRi s s .....K-aD---gg-S -T --YYA-D S -V-K-GRfT i S --R-DN-Nn--NK-L--Y-L-Q-MN-NLQ-T-E-D-T- A30502 EW I-G--yi---NpYN--D---g .... T--KYN-EKf--K-GKaT IT---sDK-S s--ST-A--Y-M-E-LS-SLT-S-E-D-S- PH0097 EWV-A ...... Ti S SG--g-R---Y-T--YYS -DS-V-K-GRf Ti S --R-DN-A--KNT-L--Y-L-Q-MS- S LR-S -E-D-T-

9 * 0 1 F37262 AVYYCA-R-n ....Y---Y--gS snl fa---Y ...... B27563 AVYYCA-R-gy---D---Y sY Yam. D--Y .....WGQGT--SVTVSS--- C30560 AVYYCA-RW- .gtgss .....Wg, WfaY .... WGQGT--LVTVSA GIHUDW ATYYCA-~ .scgsq .....Yf, D--Y .... WGQGI--LVTVSS .... S09711 AVYYCA-~ vlvsrtsisqYs ...... Y--Ym-D-VWGKGT--TVTVSS .... B36006 AMYYCA-R-r ....R---Y--mg Yg D--Qa fD- I WGQGT--MVTVSS .... F36005 AVYYCA-RD .....R---X--as Da . f-D- I WGQGT--MVTVSS --- A36194 AVYFCA-R- sn---Y---Y--ggs ...... Y8 f ¯ D--Y..... WGQGT--TLTVS S --- A31485 -IYYCT--- -gsy ...... Ygm .D--Y .....WGQGT--SVTVS S --- D33548 AVYYCA-R-a sycgY---DcY ...... Yff ¯ D--Y..... WGQGT--LVTVS S--- AVMSJ5 ALYYCA-R-Ih---Y---YgYna- ,Y-- ...... WGQGT--LVTVSAE-- D30560 AI YYCT-K-eg---Y fgnY-D ...... Yam’ ¯ D--Y .... WGQGT--SVTVS S --- SI1239 ALYYCV-K--gr---D---Y-Ydsgg ...... Yftva . f -D- I WGQGT--M~rrVSS --- GIMSAA AVYFCAvR-vi s--R---Y- .f-D-GWGQGTIv I27888 AIYYCT-RD. -ee- Dptt ivapfamD--Y ..... WGQGT--SVTVS .... PL0118 AVYYCA-R PL0122 AVYYCT-R- A33989 AVYYCT-RE ar ...... Wgg ...... W--Yf-Eh-WGQGT--MVTVTS--- A30502 AVYYCA-R- ggfa .....Y. WGQGT--LVTV ..... PH0097 AMYYST-A --sg ...... Dsf -D--Y ....WGQGT--TLTVSSAk-

44 ISMB-95 (normalized exponential reparametrization) as in our in the family have a "header" (transport signal pep- previous work. So the total number of parameters of tide) whereas the others do not. We did not remove this HMM/NNarchitecture, neglecting edge efffects, is the headers prior to training. The model is capable of 1507 (roughly 117 × 3 x 3 -- 1053 for the transitions, detecting and accomodating these headers by treating (117x 3 + 3 + 3 × 20 + 40) = 454 for the emissions, them as initial inserts, as can be seen from the align- including biases). This architecture is not at all opti- ment of three of the sequences. This multiple align- mised: for instance, we suspect we could have signif- ment, however, contains a number of problems related icantly reduced the number of transition parazneters. to the overuse of gaps and insert sates, especially in Our goal at this time is not to find the best possible the hypervariable regions, for instance at positions 30- HMM/NNarchitecture, but to demonstrate the gen- 35 and 50-55. These problems should be eliminated eral principles, and test the learning algorithm. We with a more careful selection of hybrid architecture. have also trained a number of similar hybrid architec- In ((Baldi & Chauvin 1995)), we display the activ- tures with a larger numberof hidden units, up to four, ity of the two hidden units associated with each main both for the main and insert states. Here we report state. For most states, at least one of the activities is only the results derived with the smallest architecture. saturated. The activities associated with the cysteine The hybrid architecture is then trained on line us- residues responsible for the disulphide bridges (main ing (4.4), and the same training set as in our original states 22 and 94) are all saturated, and in the same experiments. There the emission and transition pa- corner (-1,+1). rameters were initialized uniformly. Here we initialize all the weights fron the input to the hidden layer with Discussion independent gaussians, with mean 0 and standard de- viation 1. All the weights from the hidden to the out- The concept of hybrid HMM/NNarchitecture has been put layer are initialized to 1. This yields a uniform demonstrated, by providing a simple model of the im- emission probability distribution on all the emitting munoglobulin family. Furthermore, integrated learning states ~. Notice also that if all the weights are initial- algorithms have been described where the HMMdy- ized to 1, including those from input to hidden layer, nanfic programnfing and the NNbackpropagation are then the hidden units cannot differentiate from each intimately interwoven. The specific architecture used other. The transition probability parameters out of in the sinmlations is by no means optimised for the insert or delete states are initialized uniformly to 1/3. task of protein modeling. Our intention here is only We introduce, however, a small bias along the back- to demonstrate the principles, and test the soundness bone, in the form of a Dirichlet prior (see (Krogh et al. of the learning algorithms. The architecture we have 1994), (Baldi et al. 1995)) that favors main to described, and its manypossible variations, solve the transitions. This prior is equivalent to introducing a problem of having a large number of unstructured pa- regularisation term in the objective function, equal to rameter. The NN component of the hybrid architec- the logarithm of the backbone path. The regularisa- ture calculates the HMMparameters. This component tion constant is set to 0.01, and the learning rate to can be taylored to accomodate all kinds of constraints 0.1. In Fig. 3, we display the multiple alignment of and priors, in a very flexible way. 20 inlmunoglobulin sequences, selected randonfly from A HMMdefines a probability distribution over the both the training (T) and validation (V) sets, after space of all possible sequences. Only a very small frac- 10 epochs. The multiple alignment is very stable be- tion of distributions can be realized by reasonable size tween between 5 and 10 epochs. Lower case letters HMMs3. HMMs,or the equivalent multiple alignment, correspond to enfissions from insert states. This align- essentially generate the manifold of factorial distribu- ment is far from perfect, but roughly comparable to tions. In this sense, a HMMalready provides a com- the multiple alignment previously derived with a sim- pact representation of a distribution over the space of ple HMM,having more than four times as many pa- all possible sequences. A given family of proteins de- rameters. The algorithm has been able to detect all fines also a distribution D over the space of all possi- the main regions of highly conserved residues. Most ble amino acid chains. Thus our problem can also be importantly, the cysteine residues (C) towards the be- viewed as an attempt to approximate D with a fac- ginning and the end of the region which are responsible torial distribution F. A properly trained HMMde- for the disulphide bonds that hold the chains are per- fines a close to optimal factorial approximation F. We fectly aligned. The only exception is the last sequence have seen that for many practical purposes, and in (PI-I0097) whichhas a serine (S) residue in its terminal particular for data base mining, we can expect facto- portion. This is a rare but recognized exception to the rial approximations to perform very well. HMM/NN conservation of this position. Some of the sequences 3Anydistribution can be represented by a single expo- 2With Viterbi learning, this is probably better than a nential size HMM,with a start state connectedto different non-uniforminitialization, such as the average composition. sequencesof deterministic states, one for each possible al- A non-uniforminitialization mayintroduce distortions in phabet sequence, with a transition probability equal to the the Viterbi paths probability of the sequenceitself. architectures provide a powerful mean for further re- ferent ways. Different HMMexperts can be assigned finements and compression of the HMMparametri- to different portions of the data. For instance, in a sation and the flexible incorporation of constraints. protein family modeling task, a different expert can Since part of the problem is compressing and extract- be assigned to each sub-class within the family. An- ing information, one could also view or complement other possibility is when the emission vector of any the present approach ill terrrLs of the MDL(Minimum state results from a convex linear combination of ba- Description Length) principle. In particular, costs for sic emission expert vectors, gated by the NN. Again the NNhidden units could also be introduced in the in the protein modeling task, there could be an expert objective function. But no matter how complex the for hydrophobic regions, one for hydrophilic regions NNreparametrization and the objective function, the and so on. The proper mixture or gating among ex- basic probabilistic model for the data remains so far a perts would again be calculated by the NNcomponent single HMM. of the architecture. In manycases, the learning algo- There may exist situations, however, in molecu- rithms described here can be applied directly to these lar biology or other domains, where one must deal more general HMM/NNhybrid architectures. After with subtle dependencies in the data, such as the all, one can still calculate the likelihood of a given se- "X - Y/X’ - Y~" situation. Such correlations can- quence and then differentiate with respect to the NN not be captured by a single HMM4. Therefore to cap- parameters. In some cases, one may have to add some ture variable dependencies one must resort to a larger form of competitive learning among experts. A more class of models. An obvious candidate is higher or- detailed analysis of the most general HMM/NNhybrid der markov models, but unfortunately these bccome architectures is given in ((Baldi &: Chauvin 1995)). rapidly untractable. If one is to stay close to the first The ideas presented here are of course not limited order HMMformalism, then to handle the present sim- to HMMs,or to protein or DNAmodeling. They can ple example one needs four emission vectors, instead of be viewed in a more general framework of hierarchi- two as in a single HMM.A vector with a high prob- cal modeling, where first a parametrised probabilistic ability of emitting X (resp. Y) and a vector with model is constructed for the data, and then the param- high probability of emitting X’ (resp. Y’), at position eters of the model are calculated, and possibly mod- i (rasp. position j). In addition one needs a mecha- ulat.ed, as a function of input or context, by one or nism to link these emission vectors in the proper way several other NNs(or any other flexible reparametri- as a function of context, or input. Thus the data must sation). It is well known, for instance, that HMMs be modeled by multiple HMMs,or by a single HMM are equivalent to stochastic regular grammars. The that can be modulated as a function of context. Again next level in the language hierachy is stochasti context the parameters of the multiple HMMs,or of the mod- free grammars (SCFG). Once can then immediately in- ulated HMM,can be calculated by a NN, giving rise troduce hybrid SCFG/NN.It would be interesting to to a more general class of hybrid HMMarchitectures. extend the results in ((Sakakibara et al. 1994)), In these more general hybrid architectures, one in- ing hybrid SCFG/NN.Finding optimal architectures put stream into the NNcomponent originates from the for molecular biology applications and other domains, HMMstates, as in the previous sections. There is how- and developing a better understanding of how proba- ever also a second stremn representing the input or con- bilistie models should be NN-modulated,as a fimetion text. The choice of input or context can assume many of input or context, are someof the current challenges different forms and is problem dependent. In some for hybrid approaches. cases, it. can be equal to the entire current observation sequences O. The NNcomponent must then gate the HMMparameters and decide whether they should be in the X - Y class or the X~ - Y~ class. Local connec- Acknowledgement tivity in the NNcan also ensure that only local context be taken hlto consideration. Other inputs are however The work of PB is supported by grants from the ONR, possible, over different, alphabets. An obvious candi- the AFOSR,and a Law Allen award at JPL. The work date in protein modeling tasks would be the secondary of YC is supported by grant number R43 LM05780, structure of the protein (alpha helices, beta sheets and from the National Library of Medicine. The contents coils). Continuous input (and/or output) alphabets of this publication are solely the responsibility of the are also possible, as in ((MacKay1994)) where a small authors and do not necessarily represent the official vector of real numbers is used to reparametrize the views of the National Library of Medicine. manifold of distributions over all possible sequences. Mixture of experts ideas, as in ((Jacobs et al. 1991)), can also be used to design such architectures in dif- 4Theyare also related to the problemsof classification and self-organisation since in this examplethere are clearly two distinct sub-families of sequences.

46 ISMB--95 References Altschul, S. 1991. Aminoacid substitution matrices from an information theoretic perspective. JournMof Molecular Biology 219:1-11. Baldi, P., and Chauvin, Y. 1994a. Hidden markov models of the G-protein-coupled receptor family. Journal of Computational Biology 1(4) :311-335. Baldi, P., and Chauvin, Y. 1994h. Smooth on-line learning algorithms for hidden markov models. Neural Computation 6(2):305-316. Baldi, P., and Chauvin, Y. 1995. Hierarchical hybrid modeling, HMM/NNarchitectures, and protein applications. Submitted. Baldi, P.; Chauvin, Y.; Hunkapillar, T.; and McClure, M. 1994. Hidden markov models of biological primary sequence information. PNAS USA 91(3):1059-1063. Baldi, P.; Brunak, S.; Chauvin, Y.; Engelbrecht, J.; and Krogh, A. 1995. Hidden markov models for hu- man genes. Caltech Technical Report. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximumlikelihood from incomplete data via the em algorithm. Journal Royal Statistical Society B39:1-22. Jacobs, R.; Jordan, M.; Nowlan, S.; and Hinton, G. 1991. Adaptive mixtures of local experts. Neural Computation 3:79-87. Krogh, A.; Brown, M.; Mian, I. S.; Sjolander, K.; and IIaussler, D. 1994. Hidden Markov models in computational biology: applications to protein modeling. Journal of Molecular Biology 235:1501-1531. Krogh, A.; Miaal, I. S.; and Itaussler, D. 1994. A hidden Markovmodel that finds genes in e. coli DNA. Nucleic Acid Research 22:4768-4778. MacKay, D. 1994. Bayesian neural networks and den- sity networks. Proceedings of Workshop on Neutron Scattering Data Analysis and Proceedings of 1994 MaxEnt Conference, Cambridge (UK). Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2):257-286. Sakakibara, Y.; Brown, M.; ttughey, R.; Mian, I. S.; Sjolander, K.; Underwood, R. C.; and IIaussler, D. 1994. The application of stochastic context-free grammars to folding, aligning and modeling homologous RNAsequences. In UCSC Technical Report UCSC- CRL-94-14. Searls, D. B. 1992. The linguistics of DNA.American Scien tist 80:579-591.

Baldi 47