April 6, 2005 8:23 WSPC/185-JBCB 00107 Journal of Bioinformatics and Computational Biology Vol. 3, No. 2 (2005) 491–526 c Imperial College Press HIDDEN MARKOV MODELS, GRAMMARS, AND BIOLOGY: ATUTORIAL SHIBAJI MUKHERJEE Association for Studies in Computational Biology Kolkata 700 018, India
[email protected] SUSHMITA MITRA∗ Machine Intelligence Unit, Indian Statistical Institute Kolkata 700 108, India
[email protected] Received 23 April 2004 1st Revision 2 September 2004 2nd Revision 20 December 2004 3rd Revision 5 January 2004 Accepted 6 January 2005 Biological sequences and structures have been modelled using various machine learn- ing techniques and abstract mathematical concepts. This article surveys methods using Hidden Markov Model and functional grammars for this purpose. We provide a for- mal introduction to Hidden Markov Model and grammars, stressing on a comprehensive mathematical description of the methods and their natural continuity. The basic algo- rithms and their application to analyzing biological sequences and modelling structures of bio-molecules like proteins and nucleic acids are discussed. A comparison of the dif- ferent approaches is discussed, and possible areas of work and problems are highlighted. Related databases and softwares, available on the internet, are also mentioned. Keywords: Computational biology; machine learning; Hidden Markov Model; stochastic grammars; biological structures. 1. Introduction Hidden Markov Model (HMM) is a very important methodology for modelling pro- tein structures and sequence analysis.28 It mostly involves local interaction mod- elling. Functional grammars provide another important technique typically used for modelling non-local interactions, as in nucleic acids.71 Higher order grammars, like Graph grammars, have also been applied to biological problems mostly to model cellular and filamentous structures.25 Other mathematical structures like ∗Corresponding author.