A Jumping Profile HMM for Remote Protein Homology Detection

Anne-Kathrin Schultz and Mario Stanke Institut fur¨ Mikrobiologie und Genetik, Abteilung Bioinformatik, Universit¨at G¨ottingen, Germany contact: {aschult2, mstanke}@gwdg.de

Abstract Our Generalization: Jumping Profile HMM

We address the problem of finding new members of a given protein family in a database of protein sequences. We are given a MSA of k rows and a candidate sequence. At each position the candidate sequence is either Given a multiple sequence alignment (MSA) of the sequences in the protein family, we would like to score each aligned to the whole column of the MSA or to a certain reference sequence: We say that we are in the column candidate sequence in the database with respect to how likely it is that it belongs to the family. Successful mode or in a row mode of the HMM. methods for this task are profile Hidden Markov Models (HMM), like HMMER [Eddy, 1998] and SAM [Hughey • Column mode: (red part of Figure 1) and Krogh, 1996], and a so-called jumping alignment (JALI) [Spang et al., 2002]. As in a profile HMM each consensus column of the MSA is modeled by three states: match (M), insert (I) and delete (D).Match states model the distribution of residues in this column, they emit the amino acids We developed a which can be regarded as a generalization of these two methods: At each with a probability which depends on all residues in this column. position the candidate sequence is either aligned to the whole column of the MSA or to a certain reference sequence. Thus our model catches both the horizontal and vertical information in the MSA. It is called a • Row modes: (first k rows of Figure 1) jumping profile HMM. Each sequence of the alignment corresponds to a row mode, i.e. for each amino acid in the alignment we have three states: match (M), insert (I) and delete (D). The emission probabilities of the amino acids in the match states only depend on the residue found at this specific position in the alignment. Insert and delete states allow for insertion or deletion of one or more residues at a certain position in the alignment. The states are connected by transitions to which transition probabilities are assigned. To estimate the emission and transition probabilities we use Dirichlet mixtures [Sj¨olander et al., 1996], [Wistrand Successful Known Methods: Profile HMMs and JALI and Sonnhammer, 2004] for all states. Changes between the column mode and row modes and changes between different row modes are possible and penalized. They allow to align the candidate sequence partially to the columns of the MSA and partially to a reference sequence, which may change within the alignment. A profile HMM uses mainly a summary of the columns of the MSA to align a candidate sequence to the These transitions are called jumps and are shown just for three example states: M1 2 (cyan), I1 1 (green) MSA and to define a probability that measures how well the candidate protein fits into the protein family. It , ,n1− and D1 1 (blue) (Figure 1). contains information about conserved key residues shared by most or all sequences of the family. The penalty ,n1− for a deletion or an insertion in the candidate sequence depends on its position in the alignment to the MSA. Aligning a sequence to the MSA with respect to the jumping profile HMM corresponds to finding the most probable path through the model that emits the sequence. JALI is a jumping alignment of the candidate sequence to the MSA of the protein family. At each position the candidate sequence is aligned to one reference sequence of the MSA using a scoring matrix. This reference A jumping profile HMM is a generalization of profile HMMs and jumping alignments in the following sense. If sequence can change within the alignment. Such a change of the reference sequence is called a jump and we only take the column mode of our model we get a profile HMM. Leaving out the column mode our model is penalized by subtracting a constant jump cost from the score. The jumping alignment method can catch is a probabilistic analogon of JALI. Jumps in JALI correspond to transitions between the states of the different dependencies between residues at different positions of the sequence and appears to be more robust when the row modes. MSA of the family is partially wrong.

...... I I I I 1n 11 12 1n 1−1 1

...... M11 M12 M13 ...... M 1n1

...... D D D1n −1 D 11 12 1 1n1

......

B ...... E I Ikn −1 I k1 k knk

Jumping Profile HMM - Features M M M ...... M k1 k2 k3 knk

• The model catches both the horizontal and vertical information ...... D Dk1 Dk2 Dk3 knk in the MSA. It considers information about conserved columns of the alignment as well as information about conserved sequence ...... patterns common to only some of the sequences. I1 Im

• It can both find sequences which are similar to the profile of a ...... protein family and sequences which are only partially similar to M1 M2 M3 Mm one sequence of the family.

D ...... • It automatically includes a pairwise comparison to each sequence 1 D2 D3 Dm of the family. Thus it is less impaired by a bad or wrong MSA as a profile HMM. Figure 1: Architecture of a jumping profile Hidden Markov Model

References

[Eddy, 1998] Sean R. Eddy. Profile hidden Markov models. , 14(9):755–763, 1998. [Hughey and Krogh, 1996] Richard Hughey and Anders Krogh. Hidden Markov models for : Extension and analysis of the basic method. Comput. Appl. Biosci., 12(2):95–107, 1996. [Sj¨olander et al., 1996] Kimmen Sj¨olander, Kevin Karplus, Michael Brown, Richard Hughey, Anders Krogh, I. Saira Mian, and . Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology. Comput. Applic. Biosci., 12:327–345, 1996. [Spang et al., 2002] Rainer Spang, Marc Rehmsmeier, and Jens Stoye. A Novel Approach to Remote Homology Detection: Jumping Alignments. J. Comp. Biol., 9(5):747–760, 2002. [Wistrand and Sonnhammer, 2004] Markus Wistrand and Erik L.L. Sonnhammer. Transition Priors for Protein Hidden Markov Models: An Empirical Study towards Maximum Discrimination. J. Comp. Biol., 11(1):181–193, 2004.