Bioinformatics Explained: HMMER September 12, 2007

BBioinformaticsioinformatics EExplainedxplained Bioinformatics explained: HMMER September 12, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com [email protected] Bioinformatics explained: HMMER Bioinformatics explained: HMMER Similarity searches Database searching is widely used in bioinformatics and there are a number of different ways to do e.g. protein database searches. Alignment algorithms like BLAST [Altschul et al., 1990] and Smith-Waterman [Smith and Waterman, 1981] compare two sequences and determine their similarities by association of one single score for each given substitution of one amino acid with another using standard substitution matrices and gap penalty scores. These kinds of sequence- based pairwise comparisons calculate similarity between two sequences to identify significant matches. When two sequences are considered similar at a significant level, it indicates shared d biological properties as common evolutionary origin, similar molecular structure, and similar functionality. As specific positions and specific amino acids may not necessarily have the same conservation patterns in different contexts, comparing protein sequences using standard substitution matrices is a very simplistic way of searching for similarity, and it may be better to search for family or aine domain similarity rather than to search for sequence similarity. It may be more beneficial to search l for similarity using substitution scores reflecting frequencies of individual amino acid positions of many sequences in a domain, rather than using standard substitution scores reflecting only one p amino acid being replaced with another, one by one along the sequences searched. x Profile hidden Markov models (profile HMMs) E "A hidden Markov model describes a probability distribution over a potentially infinite number of sequences" [Eddy, 1998]. The HMM can be said to be a model generating sequences. cs The profile HMMs improve the search for distantly related sequences by turning a multiple- sequence alignment into a probability based position-specific scoring systems [Eddy, 1998]. A profile HMM contains states for match, insert and delete which are used for modeling a ati sequence family. Each state in the model has probability distributions and each transition has a probability. So, if you have an amino acid commonly represented at a particular position in the multiple sequence alignment it gets a higher score. It is also a possibility to assign scores to m insertions and deletions in specific positions. A sequence is compared to the model by assigning r the sequence residues to the states in the HMM. The resulting score is a probability for the o sequence to be related to the given model and the probability is used for finding an e-value for f the match. HMMs were introduced to the field of computational biology in the late 1980s, and HMMs for use as profile models were introduced by Krogh et al. [Krogh et al., 1994] in the mid 1990s [Eddy, 1998]. Examples of the use of HMMs within the field of biology are for gene finding, genetic linkage mapping and protein secondary structure prediction. ioin The idea of using profile HMMs for database searching is to compare a sequence to a statistical B model describing a family or pattern of sequences contrary to a simple comparison of single amino acids of two sequences. By comparing a sequence to a statistical model you can get some extra information. For instance • some sites may be conserved for specific residues while other sites represent considerable variations P. 2 Bioinformatics explained: HMMER • some sites may be deleted without affecting functionality while other sites may not be deleted without affecting functionality • insertions may be acceptable at some sites while insertions may not be acceptable at other sites Building upon this information, it may be easier to see if a sequence and a specific family are related. Distant relationships between sequences are also more likely to be identified when using statistical models instead of standard substitution matrices. d Pfam database Profile HMM libraries are needed to search a query sequence for known domains and for the relatedness from the sequence to a protein family sharing e.g. functionality. One of the most comprehensive profile HMM libraries is the publicly available Pfam database (protein family database). aine l The Pfam database consists of a multiple alignment for each protein family which has been used as the basis for building a profile HMM. Researchers at the Sanger Center have released this p collection [Bateman et al., 2002], and the database currently represents 9318 protein families, x covering 74% of proteins (July 2007) [PFAM,]. E cs ati m r o f Figure 1: A part of an alignment for the Globin family from the Pfam website ioin B Pfam is a classification of protein families according to families, domains, repeats and motifs. A family is the default class of proteins related to each other. The families in Pfam are all represented by a seed, which contains a representative number of family members, and a full alignment containing all family members. Full family alignments contain up to 2500 sequences. Domains represent elements of structure or sequence which may be identified and relevant in different protein contexts. Repeats and motifs describe short parts of sequence [Bateman et al., 2002]. P. 3 Bioinformatics explained: HMMER The Pfam database comes in two variants, Pfam-A and Pfam-B. Pfam-A is a well-annotated database, which is curated by hand and thus contains high quality data. Pfam-B is an automatically generated database and of lower quality. The Pfam-B is intended to incorporate domains not already represented in Pfam-A [PFAM,]. Both databases come in two variants: A fragment database (fs) which allows partial matches to a domain to be found, e.g. identifying a match to half a globin domain, and a full domain database (ls) which only allows matches to full domains. The full domain database is more specific than the fragment database and is only based on global models of HMMs [Bateman et al., 2002]. The Pfam database can be accessed from http://pfam.sanger.ac.uk (UK) or http: d //pfam.wustl.edu/ (US). HMMER package There are several software implementations using profile HMMs in computational biology, one of the most popular being HMMER [Eddy, 2003]. aine l HMMER is a software implementation of profile HMMs for biological sequence analysis. A sequence is compared to a profile HMM by assigning the sequence residues to the states in the p HMM, and the resulting score is a probability for the sequence to be related to the given model. x E-values for the match are found using the probability of the sequence compared to a model. The implementation of profile HMMs in the HMMER package contains programs for construction E and use of position specific scoring matrices. HMMER was written by Sean Eddy and colleagues and was first released in 1995 [Eddy, 2003]. The HMMER package is accessible from http://hmmer.janelia.org. cs Programs in HMMER Currently, the HMMER package contains nine programs. Two of these are programs for database ati searching: m • hmmpfam Search an HMM database for matches to a query sequence. r • hmmsearch Search a sequence database for matches to a single profile HMM. o f The other programs in the package are: • hmmalign Align sequences to an existing model. • ioin hmmbuild Build a model from a multiple sequence alignment. • hmmcalibrate Takes an HMM and empirically determines parameters that are used to make B searches more sensitive, by calculating more accurate expectation value scores (E-values). • hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and "best effort" emulation of GCG profiles. • hmmemit Emit sequences probabilistically from a profile HMM. • hmmfetch Get a single model from an HMM database. P. 4 Bioinformatics explained: HMMER • hmmindex Index an HMM database. [Eddy, 2003] When using the Pfam database, a researcher would normally only have to use the two search programs since the database has already been built. Researchers seeking to construct their own profile HMMs should use the hmmalign, hmmbuild and hmmcalibrate programs. Examples of HMMER usage d This section gives some examples of how to use the two database search programs, hmmpfam and hmmsearch. The protein leghemoglobin is a plant globin binding oxygen and a member of the family of globins. The first hmmpfam example will show how the leghemoglobin 1 from a bean (Swiss-Prot accession number P02232 lgb1_vicfa) is recognized to be related to the family. The hmmsearch shows if any sequence in a given database matches an HMM, a protein family. In the second example, aine l hmmsearch is used to identify members of the globin protein family among 1000 sequences from Swiss-Prot. p x hmmpfam The command line version of hmmpfam has two required parameters, the first is the profile HMM database file and the second is a file with one or more sequences. E hmmpfam accepts a number of parameters, mainly for adjusting the cut-offs for the quality of matches to present. cs Here is the example run (not all the output is shown, see the appendix for the full output): localhost:~...hmmer% hmmpfam Pfam_fs.bin lgb1_vicfa.fasta hmmpfam - search one or more sequences against HMM database ati HMMER 2.3.2 (Oct 2003) Copyright (C) 1992-2003 HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) ------------------------------------ m HMM file: Pfam_fs.bin r Sequence file: lgb1_vicfa.fasta -------------------------------- o f Query sequence: P02232|LGB1_VICFA Accession: [none] Description: Leghemoglobin-1 - Vicia faba (Broad bean) Scores for sequence family classification (score includes all domains): Model Description Score E-value N ioin -------- ----------- ----- ------- --- Globin Globin 75.5 2.6e-21 1 Herpes_UL42 DNA polymerase processivity factor (UL 1.3 7.8 1 B PPTA Protein prenyltransferase alpha subuni 2.8 8.2 1 ..

Bioinformatics Explained: HMMER September 12, 2007

RDA COVID-19 Recommendations and Guidelines on Data Sharing

HMMER User's Guide

Apply Parallel Bioinformatics Applications on Linux PC Clusters

HMMER User's Guide

Downloaded Were Considered to Be True Positive While Those from the from UCSC Databases on 14Th September 2011 [70,71]

Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids

Genomic and Transcriptomic Surveys for the Study of Ncrnas with a Focus on Tropical Parasites

Software List for Biology, Bioinformatics and Biostatistics CCT

On the Necessity of Dissecting Sequence Similarity Scores Into

PTIR: Predicted Tomato Interactome Resource

Download PDF of This Story

Scaling HMMER Performance on Multicore Architectures