STRUCTFAST: Protein Sequence Remote Homology Detection and Alignment Using Novel Dynamic Programming and Profile-Profile Scoring
Total Page:16
File Type:pdf, Size:1020Kb
PROTEINS: Structure, Function, and Bioinformatics 64:960–967 (2006) STRUCTFAST: Protein Sequence Remote Homology Detection and Alignment Using Novel Dynamic Programming and Profile–Profile Scoring Derek A. Debe,1 Joseph F. Danzer,1 William A. Goddard,2 and Aleksandar Poleksic3* 1Eidogen-Sertanty Inc., San Diego, California 92121 2Materials and Process Simulation Center, California Institute of Technology, Pasadena, California 91106 3Department of Computer Science, University of Northern Iowa, Cedar Falls, Iowa 50614 ABSTRACT STRUCTFAST is a novel profile— probabilistic model (HMM) in place of a position specific profile alignment algorithm capable of detecting score matrix. weak similarities between protein sequences. The The new generation of profile–profile alignment meth- increased sensitivity and accuracy of the STRUCT- ods utilizes multiple sequence alignment profiles in place FAST method are achieved through several unique of the query and template sequence.5,6 Profile–profile features. First, the algorithm utilizes a novel dy- algorithms enjoy additional enhancements in sensitivity, namic programming engine capable of incorporat- often recognizing sequences that share less than 15% ing important information from a structural family identity, as evidenced by large-scale benchmarking experi- directly into the alignment process. Second, the ments such as LiveBench,7 CAFASP,8 and CASP.9–11 algorithm employs a rigorous analytical formula for Despite the success of profile–profile approaches, many profile–profile scoring to overcome the limitations aspects of their development remain ad hoc. For example, of ad hoc scoring functions that require adjustable most of the existing profile–profile algorithms fail to parameter training. Third, the algorithm employs provide a rigorous, probabilistic treatment for the column– Convergent Island Statistics (CIS) to compute the column matching (as opposed to sequence–profile methods statistical significance of alignment scores indepen- such as PSI-BLAST and HMMER, where the residue- dently for each pair of sequences. STRUCTFAST column scores are treated as log-odd scores). The lack of a routinely produces alignments that meet or exceed rigorous framework is universally true for profile–profile the quality obtained by an expert human homology methods that incorporate other terms in the scoring func- modeler, as evidenced by its performance in the tion, such as position specific gap penalties and secondary latest CAFASP4 and CASP6 blind prediction bench- structure information. To work well across the protein mark experiments. Proteins 2006;64:960–967. universe, these methods implement various ad hoc adjust- © 2006 Wiley-Liss, Inc. ments when scoring pairs of profiles, such as the normaliza- tion of the score matrix and score shifts.5,6,12 This natu- Key words: protein structure; homology modeling; rally introduces a performance bias toward the test sets comparative modeling; alignment algo- used during parameter adjustment. rithms; alignment statistics Estimating alignment score significance in profile– profile methods is also a challenging task. A rigorous INTRODUCTION measure of protein sequence similarity should reflect the There are many different techniques for comparing and difference between the quality of the best alignment of the aligning protein sequences. Over the years, dynamic pro- two sequences and the quality of the best alignment gramming algorithms have dominated the field of protein between random sequences of the same lengths and compo- sequence comparison. The widely applied BLAST1 and sitions. For ungapped local sequence–sequence align- 2 FASTA algorithms utilize dynamic programming in con- ments in the asymptotic limit of long sequences, it is well junction with a sequence–sequence scoring function that established that the alignment scores follow an extreme evaluates the similarities between the amino acids of the value distribution described by two parameters and query and template sequence. The well-known PSI- K.13,14 For profile-based algorithms, with or without gaps, 3 BLAST algorithm is an example of a sequence–profile it has been conjectured that the score distribution is still of method that replaces the query sequence with a profile of the Gumbel form. PSI-BLAST rescales the position specific sequences from the query protein family. PSI-BLAST iteratively collects sequences from a sequence database to build a position-specific scoring matrix (PSSM). The PSSM *Correspondence to: Aleksandar Poleksic, Department of Computer Science, University of Northern Iowa, Cedar Falls, IA 50614. E-mail: is then used to search the sequence database for new [email protected] homologs, which are used to construct a new position Received 8 February 2006; Revised 15 March 2006; Accepted 15 specific score matrix. This process is repeated until no new March 2006 sequences are found. The hidden Markov model-based Published online 19 June 2006 in Wiley InterScience approaches, such as SAMT024 or HMMER use an explicit (www.interscience.wiley.com). DOI: 10.1002/prot.21049 © 2006 WILEY-LISS, INC. STRUCTFAST 961 scoring matrix (PSSM) so that the scale of the raw scores input to PSI-BLAST is a multiple alignment consisting of corresponds to that of the standard substitution matrix. the PDB structures in the template’s Dali structural The assumption is that the random PSI-BLAST alignment alignment.17 This is a standard technique often used to scores, obtained by using a rescaled PSSM, will follow the increase the sensitivity of the database search. Extreme Value Distribution of gapped sequence–sequence alignment scores. Thus, PSI-BLAST is able to quickly and Profile–Profile Scoring Function precisely estimate the score statistics, without further In PSI-BLAST, the score for aligning residue R to profile random simulations. Due to the increased algorithm com- column C is equivalent to plexity, score normalization in profile–profile methods is a much more challenging problem. This is particularly true Pc͑R͒ for algorithms that incorporate various constraints into s͑C,R͒ ϭ log , (1) PB͑R͒ the alignment process, such as secondary structure or complex gap treatments. For this reason, some profile– where PC(R) is the probability of observing letter R in profile alignment methods use Z-score statistics to mea- column C (e.g., its target frequency) and PB(R)isthe sure the alignment score significance.5,12 COMPASS15 overall probability of the residue R (background fre- provides a nice, PSI-BLAST-like statistical approach to quency). A physical analogy of this scoring scheme consists computing E-values. However, the applications of this of throwing weighted 20-sided dice. PC(R) is the probabil- technique are limited to the specific scoring system used in ity of rolling R on the die C, where the area of each of the the COMPASS algorithm. sides on die C is proportional to the frequencies of each In this article we introduce STRUCTFAST (STructure amino acid in the template profile. PB(R) is the probability Realization Utilizing Cogent Tips From Aligned Struc- of rolling R on the “background die” B, where the area of tural Templates), a novel, fully automated, profile–profile each of sides on this die are proportional to the background database search algorithm. The query sequence profiles in amino acid frequencies. STRUCTFAST are generated with a modified version of The same approach and analogy can be applied to the PSI-BLAST algorithm. A database of profiles for profile–profile scoring.15 For profile–profile scoring, in- template representatives from the Protein Data Bank stead of computing the probability of a single event (PDB) are generated in a similar manner, but are aug- (corresponding to a single residue R), we need to compute mented with information from structure–structure align- the probability of a series of events (corresponding to the ments derived from the template protein’s structural second profile column C2). Thus, in the context of profile– family. A query profile is then aligned and scored against profile scoring, Equation (1) has the following form: the library of structural profile templates and the align- ͑ ͒ ments are ranked by the significance of their scores, as PC1 C2 s͑C1,C2͒ ϭ log (2) determined by the previously published Convergent Is- PB͑C2͒ land Statistics method.16 A standard manipulation can be employed to make MATERIALS AND METHODS scoring function 2 symmetric with respect to C1 and C2: Profile Construction score͑C1,C2͒ ϭ s͑C1,C2͒ ϩ s͑C2,C1͒ (3) STRUCTFAST uses an internally modified version of PSI-BLAST to compare a protein sequence of interest The COMPASS algorithm implements a variant of Equa- against the NCBI’s nonredundant nr database. After 10 tion (3) along with fixed gap penalties. In COMPASS, the PSI-BLAST iterations, the algorithm parses the check- contribution of each score s(C1,C2) and s(C2,C1) is weighted point file to obtain the probabilities (target frequencies) pi, and the weights are optimized by running the algorithm on where i ϭ 1,2, . .20 for the 20 different amino acids at large validation sets. each sequence position. In addition, our version of PSI- The STRUCTFAST scoring scheme employs a direct, BLAST reports, for each sequence position, weighted nonweighted calculation of Equation (3). STRUCTFAST ϭ frequencies rk, where k 1,2 . 20 for the 20 different scores depend not only on the similarity between the amino acids (these are called “observed residue frequen- profile columns, but also on the “amount of confidence” cies fk” in PSI-BLAST) as well as the mean number of (called “thickness” in the