A Jumping Profile HMM for Remote Protein Homology Detection

A Jumping Profile HMM for Remote Protein Homology Detection Anne-Kathrin Schultz and Mario Stanke Institut fur¨ Mikrobiologie und Genetik, Abteilung Bioinformatik, Universität Göttingen, Germany contact: faschult2, [email protected] Abstract Our Generalization: Jumping Profile HMM We address the problem of finding new members of a given protein family in a database of protein sequences. We are given a MSA of k rows and a candidate sequence. At each position the candidate sequence is either Given a multiple sequence alignment (MSA) of the sequences in the protein family, we would like to score each aligned to the whole column of the MSA or to a certain reference sequence: We say that we are in the column candidate sequence in the database with respect to how likely it is that it belongs to the family. Successful mode or in a row mode of the HMM. methods for this task are profile Hidden Markov Models (HMM), like HMMER [Eddy, 1998] and SAM [Hughey • Column mode: (red part of Figure 1) and Krogh, 1996], and a so-called jumping alignment (JALI) [Spang et al., 2002]. As in a profile HMM each consensus column of the MSA is modeled by three states: match (M), insert (I) and delete (D).Match states model the distribution of residues in this column, they emit the amino acids We developed a Hidden Markov Model which can be regarded as a generalization of these two methods: At each with a probability which depends on all residues in this column. position the candidate sequence is either aligned to the whole column of the MSA or to a certain reference sequence. Thus our model catches both the horizontal and vertical information in the MSA. It is called a • Row modes: (first k rows of Figure 1) jumping profile HMM. Each sequence of the alignment corresponds to a row mode, i.e. for each amino acid in the alignment we have three states: match (M), insert (I) and delete (D). The emission probabilities of the amino acids in the match states only depend on the residue found at this specific position in the alignment. Insert and delete states allow for insertion or deletion of one or more residues at a certain position in the alignment. The states are connected by transitions to which transition probabilities are assigned. To estimate the emission and transition probabilities we use Dirichlet mixtures [Sjölander et al., 1996], [Wistrand Successful Known Methods: Profile HMMs and JALI and Sonnhammer, 2004] for all states. Changes between the column mode and row modes and changes between different row modes are possible and penalized. They allow to align the candidate sequence partially to the columns of the MSA and partially to a reference sequence, which may change within the alignment. A profile HMM uses mainly a summary of the columns of the MSA to align a candidate sequence to the These transitions are called jumps and are shown just for three example states: M1 2 (cyan), I1 1 (green) MSA and to define a probability that measures how well the candidate protein fits into the protein family. It ; ;n1− and D1 1 (blue) (Figure 1). contains information about conserved key residues shared by most or all sequences of the family. The penalty ;n1− for a deletion or an insertion in the candidate sequence depends on its position in the alignment to the MSA. Aligning a sequence to the MSA with respect to the jumping profile HMM corresponds to finding the most probable path through the model that emits the sequence. JALI is a jumping alignment of the candidate sequence to the MSA of the protein family. At each position the candidate sequence is aligned to one reference sequence of the MSA using a scoring matrix. This reference A jumping profile HMM is a generalization of profile HMMs and jumping alignments in the following sense. If sequence can change within the alignment. Such a change of the reference sequence is called a jump and we only take the column mode of our model we get a profile HMM. Leaving out the column mode our model is penalized by subtracting a constant jump cost from the score. The jumping alignment method can catch is a probabilistic analogon of JALI. Jumps in JALI correspond to transitions between the states of the different dependencies between residues at different positions of the sequence and appears to be more robust when the row modes. MSA of the family is partially wrong. ... ... I I I I 1n 11 12 1n 1−1 1 ... ... M11 M12 M13 ... ... M 1n1 ... ... D D D1n −1 D 11 12 1 1n1 . B ... ... ... ... E I Ikn −1 I k1 k knk Jumping Profile HMM - Features M M M ... ... ... ... M k1 k2 k3 knk • The model catches both the horizontal and vertical information ... ... D Dk1 Dk2 Dk3 knk in the MSA. It considers information about conserved columns of the alignment as well as information about conserved sequence ... ... ... ... patterns common to only some of the sequences. I1 Im • It can both find sequences which are similar to the profile of a ... ... ... ... protein family and sequences which are only partially similar to M1 M2 M3 Mm one sequence of the family. D ... ... • It automatically includes a pairwise comparison to each sequence 1 D2 D3 Dm of the family. Thus it is less impaired by a bad or wrong MSA as a profile HMM. Figure 1: Architecture of a jumping profile Hidden Markov Model References [Eddy, 1998] Sean R. Eddy. Profile hidden Markov models. Bioinformatics, 14(9):755{763, 1998. [Hughey and Krogh, 1996] Richard Hughey and Anders Krogh. Hidden Markov models for sequence analysis: Extension and analysis of the basic method. Comput. Appl. Biosci., 12(2):95{107, 1996. [Sjölander et al., 1996] Kimmen Sjölander, Kevin Karplus, Michael Brown, Richard Hughey, Anders Krogh, I. Saira Mian, and David Haussler. Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology. Comput. Applic. Biosci., 12:327{345, 1996. [Spang et al., 2002] Rainer Spang, Marc Rehmsmeier, and Jens Stoye. A Novel Approach to Remote Homology Detection: Jumping Alignments. J. Comp. Biol., 9(5):747{760, 2002. [Wistrand and Sonnhammer, 2004] Markus Wistrand and Erik L.L. Sonnhammer. Transition Priors for Protein Hidden Markov Models: An Empirical Study towards Maximum Discrimination. J. Comp. Biol., 11(1):181{193, 2004..

A Jumping Profile HMM for Remote Protein Homology Detection

Curriculum Vitae – Prof. Anders Krogh Personal Information

Predicting Transmembrane Topology and Signal Peptides with Hidden Markov Models

Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids

Tporthmm : Predicting the Substrate Class Of

Dear Delegates,History of Productive Scientiﬁc Discussions of New Challenging Ideas and Participants Contributing from a Wide Range of Interdisciplinary ﬁelds

I S C B N E W S L E T T

Reading Genomes Bit by Bit

Predicting Cellular Localization and Membrane Topology

Mrub 1325, Mrub 1326, Mrub 1327, And

Scientific Program

University of Bologna International Master Course in Bioinformatics Interdepartmental Center “Luigi Galvani” Department of Pharmacy and Biotechnology

Curriculum Vitae of Piero Fariselli