Protein Secondary Structure Prediction Based on Position-Specific Scoring Matrices David T

ArticleNo.jmbi.1999.3091availableonlineathttp://www.idealibrary.comon J. Mol. Biol. (1999) 292, 195±202 COMMUNICATION Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices David T. Jones Department of Biological A two-stage neural network has been used to predict protein secondary Sciences, University of structure based on the position speci®c scoring matrices generated by Warwick, Coventry CV4 7AL PSI-BLAST. Despite the simplicity and convenience of the approach used, United Kingdom the results are found to be superior to those produced by other methods, including the popular PHD method according to our own benchmarking results and the results from the recent Critical Assessment of Techniques for Protein Structure Prediction experiment (CASP3), where the method was evaluated by stringent blind testing. Using a new testing set based on a set of 187 unique folds, and three-way cross-validation based on structural similarity criteria rather than sequence similarity criteria used previously (no similar folds were present in both the testing and training sets) the method presented here (PSIPRED) achieved an average Q3 score of between 76.5 % to 78.3 % depending on the precise de®nition of observed secondary structure used, which is the highest published score for any method to date. Given the success of the method in CASP3, it is reasonable to be con®dent that the evaluation presented here gives a fair indication of the performance of the method in general. # 1999 Academic Press Keywords: protein structure prediction; secondary structure; protein folding; sequence analysis; neural network As a result of the in¯ux of sequence data from the prediction methods are aimed at the prediction of numerous genome sequencing projects, interest has secondarystructuralelementsinproteins(e.g.Lim, never been greater in methods for predicting pro- 1974;Chou&Fasman,1974;Garnieretal.,1978; tein structure from amino acid sequence. At pre- Zvelebiletal.,1987;Rost&Sander,1993;Geourjon sent, the prediction of an unknown protein &Deleage,1995;Salamov&Solovyev,1995; structurebycomparativemodelling(e.g.Sali,1995) Frishman&Argos,1996;King&Sternberg,1996). is by far the most reliable technique, but only when Secondary structure prediction methods are not a template protein structure can be found with a often used alone, but are instead often used to pro- very high degree of sequence similarity to the target vide constraints for tertiary structure prediction protein. In the absence of a suitable homologous methods or as part of fold recognition methods (e.g. template structure with which to build a model for Russelletal.,1996;Rost,1997). a given sequence, fold recognition methods now Early methods for secondary structure prediction provide another option for constructing useful ter- were based on either simple stereochemical prin- tiarystructuralmodels(e.g.Bowieetal.,1991;Jones ciples(Lim,1974)orstatistics(Chou&Fasman, etal.,1992;Lemeretal.,1995).Beyondmethods 1974;Garnieretal.,1978).TheGORmethod based on recognizing similarities between proteins (Garnieretal,1978)hasbeenparticularlypopular are ab initio tertiary methods, which attempt to pre- due the simplicity of implementing the method in dict the structure of a protein without reference to a software. Increasingly, however, rather than a template structure. Despite some recent progress in single sequence a whole family of related abinitiotertiearyproteinstructureprediction(Jones, sequences is available for analysis. By constructing 1997),byfarthemostcommonlyusedabinitio a multiple sequence alignment, additional information may be obtained from the observed E-mail address of the corresponding author: patterns in sequence variability, and the location of [email protected] insertions and deletions. Probably the earliest 0022-2836/99/370195±8 $30.00/0 # 1999 Academic Press 196 Protein Secondary Structure Prediction attempts at using such multiple sequence infor- of secondary structure. Standard approaches to mation for secondary structure prediction was the generating sequence pro®les are cumbersome and successful prediction of the secondary structure time-consuming. For example, the PHD server (and from this the fold) for the alpha-subunit of (Rost&Sander,1993)makesuseofalargemulti- tryptophansynthasebyNiermannetal.(1987),and processor computer system to generate multiple ageneralmethodpublishedbyZvelebiletal. sequence alignments in a timely fashion, and it is (1987).Itisfairtosay,however,thattheuseof therefore dif®cult to move the whole PHD predic- multiple sequence data was perhaps popularised tion server to another site. Furthermore, the predic- bythelaterworkbyBenner&Gerloff(1991)on tion accuracy of methods based on multiple the successful secondary structure prediction for sequence alignments has been found to correlate the cAMP-dependent kinases. The main source of with the degree of divergence present in the information in this approach to secondary struc- aligned set of sequences. Alignments which incor- ture prediction is obtained by observing that the porate sequences with signi®cant yet low sequence most conserved regions of a protein sequence are similarity to the target protein produce more accu- those regions which are either functionally import- rate predictions that those which incorporate ant, and/or buried in the protein core. Conversely, sequences which are very closely related to the tar- the more variable regions can be fairly con®dently get. Recently a new method for very sensitive assumed to be on the surface of the protein, where sequence comparison based on the new gapped- few constraints are imposed on the type of amino versionofBLASThasbeenpublished(Altschul acid residues observed, apart from a bias towards etal.,1997).Withsuitablechoicesofparameters hydrophilic amino acid residues. By clustering the and ®ltering of the search data banks, PSI-BLAST sequences in an aligned family, and assessing the greatly outperforms a standard Smith-Waterman degree of sequence variability observed between (Smith&Waterman,1981)searchinitsabilityto very similar pairs, Benner & Gerloff demonstrated detect distant homologues of a query sequence. In that the degree of solvent accessibility of an amino addition to this, PSI-BLAST generates sequence acid residue can be predicted with reasonable accu- pro®les as part of the search process, and here we racy. Secondary structure can then be predicted by explore the idea of using these intermediate PSI- comparing the accessibility patterns generally BLAST pro®les as a direct input to a secondary associated with speci®c secondary structures when structure prediction method rather than extracting packed against a hydrophobic protein core. the sequences, and producing an explicit multiple Despite the apparent power of the manual sequence alignment as a separate step. By using approach described above, it is clearly bene®cial to attempt to incorporate these ideas into an auto- the PSI-BLAST pro®les directly, the very time-con- matic method so that a large number of accurate suming multiple-sequence alignment stage is elimi- predictions can be generated routinely. The PHD nated, and this leads to a radical reduction in the methodbyRost&Sander(1993)usesasetoffeed- overall time taken to go from target sequence to forward neural networks trained by back-propa- predicted secondary structure. On a Silicon gation(Rumelhartetal.,1986)toreplacethe Graphics Origin 200 server, the entire prediction ``human expert'' components of the Benner & Gerl- process takes only two minutes. off approach, and has since become the de facto Although PSI-BLAST is a very powerful standard secondary structure prediction method. sequence searching method, it is prone to failure The method described here also makes use of for a number of reasons. The iterative nature of the neural networks, but is greatly simpli®ed. Despite PSI-BLAST algorithm makes it very sensitive to the simpli®cation, the method achieves a very high biases in the sequence data banks. In particular, degree of prediction accuracy, being the most accu- PSI-BLAST is very prone to erroneously incorpor- rate method evaluated in the recent third CASP ating repetitive sequences into the intermediate experiment(Moultetal.,1997)andcanbeeasily pro®les. As soon as one or two of these pathologi- implemented and run on any common computer cal sequences are incorporated, then the whole system. process goes astray with completely random sequences being matched with apparent high con®- dence. In order to maximise the effectiveness of Method PSI-BLAST in producing very sensitive pro®les, a Thepredictionmethod(illustratedinFigure1)is custom sequence data bank was constructed for split into three stages: generation of a sequence the present application. Firstly, a large non-redun- pro®le, prediction of initial secondary structure, dant protein sequence data bank was compiled by and ®nally the ®ltering of the predicted structure. extracting non-identical sequences from a number of publicly available data banks. This databank, which currently contains around 340,000 Generation of sequence profiles sequences, is then ®ltered with the SEG program The main design goal of this prediction method (Wootton&Federhen,1993)toremoveregions was to make the entire system easily ported to any with very low information content. A custom pro- workstation. This aim encompasses both the gener- gram is used to further ®lter the data bank in ation of sequence pro®les and the actual prediction ordertoremovetransmembranesegments(Jones

Protein Secondary Structure Prediction Based on Position-Specific Scoring Matrices David T

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support