Improved Prediction of Signal Peptides: Signalp 3.0
Total Page:16
File Type:pdf, Size:1020Kb
doi:10.1016/j.jmb.2004.05.028 J. Mol. Biol. (2004) 340, 783–795 Improved Prediction of Signal Peptides: SignalP 3.0 Jannick Dyrløv Bendtsen1, Henrik Nielsen1, Gunnar von Heijne2 and Søren Brunak1* 1Center for Biological Sequence We describe improvements of the currently most popular method for pre- Analysis, BioCentrum-DTU diction of classically secreted proteins, SignalP. SignalP consists of two Building 208, Technical different predictors based on neural network and hidden Markov model University of Denmark algorithms, where both components have been updated. Motivated by DK-2800 Lyngby, Denmark the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as 2Department of Biochemistry input to the neural network. This addition, combined with a thorough and Biophysics, Stockholm error-correction of a new data set, have improved the performance of the Bioinformatics Center predictor significantly over SignalP version 2. In version 3, correctness of Stockholm University, SE-106 the cleavage site predictions has increased notably for all three organism 91 Stockholm, Sweden groups, eukaryotes, Gram-negative and Gram-positive bacteria. The accuracy of cleavage site prediction has increased in the range 6–17% over the previous version, whereas the signal peptide discrimination improvement is mainly due to the elimination of false-positive predic- tions, as well as the introduction of a new discrimination score for the neural network. The new method has been benchmarked against other available methods. Predictions can be made at the publicly available web server http://www.cbs.dtu.dk/services/SignalP/ q 2004 Elsevier Ltd. All rights reserved. Keywords: signal peptide; signal peptidase I; neural network; hidden *Corresponding author Markov model; SignalP Introduction This is particularly important in this area, where the predictive performance is approaching the Numerous attempts to predict the correct sub- performance calculated from interpretation of cellular location of proteins using machine learning experimental data, which is not always perfect. techniques have been developed.1–9 Computational Incorrect annotation of signal peptide cleavage methods for prediction of N-terminal signal pep- sites in the databases stems from trivial database tides were published around 20 years ago, initially errors, and from peptide sequencing, where it using a weight matrix approach.1,2 Development may be hard to control the level of post-processing of prediction methods shifted to machine learning of the protein by other peptidases after the signal algorithms in the mid 1990s,10,11 with a signifi- peptidase I has made its initial cleavage. Such cant increase in performance.12 SignalP, one of the post-processing typically leads to cleavage site currently most used methods, predicts the pre- assignments shifted downstream relative to the sence of signal peptidase I cleavage sites. For signal true signal peptidase I cleavage site. peptidase II cleavage sites found in lipoproteins, In the process of training the new version of the LipoP predictor has been constructed.13 SignalP SignalP we have generated a new, thoroughly produces both classification and cleavage site curated dataset based on the extraction and redun- assignment, while most of the other methods dancy reduction method published earlier.14 Other classify proteins as secretory or non-secretory. methods were used for cleaning the new dataset, A consistent assessment of the predictive per- and we found a surprisingly high error rate in formance requires a reliable benchmark dataset. Swiss-Prot, where, for example, of the order of 7% of the Gram-positive entries had either wrong Abbreviations used: HMM, hidden Markov model; cleavage site position and/or wrong annotation of NN, neural network. the experimental evidence. Also, we found many E-mail address of the corresponding author: errors in a previously used benchmark set [email protected] (stemming from automatic extraction from Swiss- 0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. 784 Signal Peptide Prediction by SignalP Prot),12 and it appears that some programs are in actually include probable propeptides. In such fact better than the performance reported (predic- cases, convertase cleavage sites are mixed together tions are correct, while feature annotation is with signal peptidase I cleavage sites. incorrect). For comparison, we made use of this independent benchmark dataset that was used Removal of spurious cleavage site residues initially for evaluation of five different signal pep- tide predictors.12 Experimental assessment of the effect of certain In the new version of SignalP we have intro- amino acids in the cleavage site region has shown duced novel amino acid composition units as well that rare residues do not allow for efficient as sequence position units in the neural network cleavage.17,18 Examination of amino acids around input layer in order to obtain better performance. the signal peptidase I cleavage site in the data set Moreover, we have changed the window sizes revealed a number of sequences containing amino slightly compared to the previous version. We acids which appear at the cleavage site very rarely. have used fivefold cross-validation tests for direct In the eukaryotic dataset we found and removed comparison to the previous version of SignalP.10 In seven sequences containing lysine (K) and 13 the previous version of SignalP a combination sequences containing arginine (R) at the 21 pos- score, Y, was created from the cleavage site score, ition. All sequences with either lysine or arginine C, and the signal peptide score, S, and used to at position 21 were investigated manually. All of obtain a better prediction of the position of the them, except one, had a predicted cleavage site cleavage site. In the new version, we also use the upstream of the annotated one. Most of these C-score to obtain a better discrimination between sequences probably undergo N-terminal matu- secreted and non-secreted sequences, and have ration by different proteases, either in the trans constructed a new D-score for this classification Golgi network (TGN) or after release from the cell task. The architecture of the hidden Markov as mentioned below in the section on propeptide model (HMM) SignalP has not changed, but the analysis. In one clear case we found an obvious models have been retrained on the new data set, error in the Swiss-Prot entry NPAB_LOCMI. and have increased their performance significantly. According to the annotation, the cleavage site is located between residues 24 and 25 (arginine in position 21), but in the original paper the authors Results and Discussion identified the cleavage to occur between amino acid residues 22 and 23. In this case, the two Generation of data sets residues, ER, are removed by a dipeptidase.19 Furthermore, we removed sequences where As the predictive performance of the earlier other amino acids appeared at position 21 in very SignalP method was quite high, assessment of few of the sequences. For the eukaryotic dataset, potential improvements is critically dependent on the only residues allowed at position 21 were the quality of the data annotation. We generated a alanine (A), cysteine (C), glycine (G), leucine (L), new positive signal peptide data set from Swiss- proline (P), glutamine (Q), serine (S) and threonine Prot15 release 40.0, retaining the negative dataset (T). By allowing only the latter amino acids extracted from the previous work. The method for we might have removed a few true, unusual redundancy reduction was the same as in the pre- sequences. For instance, tyrosine (Y) and histidine vious work14, and was based on the reduction (H) at position 21 were found only in one case principle developed by Hobohm et al.16 Our final each in the entire eukaryotic dataset. We removed positive signal peptide datasets contain 1192, 334 eight sequences with aspartic acid (D) and eight and 153 sequences for eukaryotes, Gram-negative with phenylalanine (F), seven each with glutamic and Gram-positive bacteria, respectively. acid (E) and asparagine (N), respectively. Five In the previous work, we found many errors by with methionine (M), three containing isoleucine detailed inspection of hard-to-learn examples (I) and two sequences containing tryptophan (W) during training and wrongly predicted examples. at position 21 were removed. Some of these are Nevertheless, we were quite sure that even after in fact provable errors, in one of the aspartic acid careful examination in this manner, the dataset examples, CLUS_BOVIN,20 the N-terminal peptide would probably still contain errors obtained from sequencing in the published paper reports the incorrect database annotation and wrongly inter- cleavage as MKTLLLLMGLLLSWESGWA—-ISDK preted laboratory results. ELQEMST…, while Swiss-Prot annotates the Therefore, we developed a new feature-based sequence as being cleaved between D and K, approach where abnormal examples can be thereby changing a common position 21 amino detected by inspecting rare amino acid occurrences acid, alanine, into a rare one. Interestingly, SignalP and outlier physical–chemical properties of signal predicts the cleavage site as reported in the peptides. In the following, we show that the iso- published paper. electric point of signal peptides can help in finding For Gram-positive and Gram-negative bacteria, possible annotation errors and other errors, where only four residues were allowed at position 21. these errors may be due to the fact that some These residues were alanine (A), glycine (G), serine (long) signal peptides annotated in Swiss-Prot (S) and threonine (T).17,18 For the Gram-positive Signal Peptide Prediction by SignalP 785 dataset, this approach removed four sequences IAA2_STRGS is not verified experimentally. It is containing arginine (R), three containing valine predicted to have a cleavage site at position 26 (V), two containing lysine (K) and one sequence (SignalP) or 24 (PSORT). Calculation of pI using each of glutamic acid (E), leucine (L), asparagine the SignalP predicted signal peptide length gave a (N), glutamine (Q), threonine (T) and tyrosine (Y).