doi:10.1016/j.jmb.2004.05.028 J. Mol. Biol. (2004) 340, 783–795

Improved Prediction of Signal Peptides: SignalP 3.0

Jannick Dyrløv Bendtsen1, Henrik Nielsen1, Gunnar von Heijne2 and Søren Brunak1*

1Center for Biological Sequence We describe improvements of the currently most popular method for pre- Analysis, BioCentrum-DTU diction of classically secreted proteins, SignalP. SignalP consists of two Building 208, Technical different predictors based on neural network and hidden Markov model University of Denmark algorithms, where both components have been updated. Motivated by DK-2800 Lyngby, Denmark the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as 2Department of Biochemistry input to the neural network. This addition, combined with a thorough and Biophysics, Stockholm error-correction of a new data set, have improved the performance of the Bioinformatics Center predictor significantly over SignalP version 2. In version 3, correctness of Stockholm University, SE-106 the cleavage site predictions has increased notably for all three organism 91 Stockholm, Sweden groups, eukaryotes, Gram-negative and Gram-positive bacteria. The accuracy of cleavage site prediction has increased in the range 6–17% over the previous version, whereas the signal peptide discrimination improvement is mainly due to the elimination of false-positive predic- tions, as well as the introduction of a new discrimination score for the neural network. The new method has been benchmarked against other available methods. Predictions can be made at the publicly available web server http://www.cbs.dtu.dk/services/SignalP/ q 2004 Elsevier Ltd. All rights reserved. Keywords: signal peptide; I; neural network; hidden *Corresponding author Markov model; SignalP

Introduction This is particularly important in this area, where the predictive performance is approaching the Numerous attempts to predict the correct sub- performance calculated from interpretation of cellular location of proteins using machine learning experimental data, which is not always perfect. techniques have been developed.1–9 Computational Incorrect annotation of signal peptide cleavage methods for prediction of N-terminal signal pep- sites in the databases stems from trivial database tides were published around 20 years ago, initially errors, and from peptide sequencing, where it using a weight matrix approach.1,2 Development may be hard to control the level of post-processing of prediction methods shifted to machine learning of the protein by other peptidases after the signal algorithms in the mid 1990s,10,11 with a signifi- peptidase I has made its initial cleavage. Such cant increase in performance.12 SignalP, one of the post-processing typically leads to cleavage site currently most used methods, predicts the pre- assignments shifted downstream relative to the sence of signal peptidase I cleavage sites. For signal true signal peptidase I cleavage site. peptidase II cleavage sites found in lipoproteins, In the process of training the new version of the LipoP predictor has been constructed.13 SignalP SignalP we have generated a new, thoroughly produces both classification and cleavage site curated dataset based on the extraction and redun- assignment, while most of the other methods dancy reduction method published earlier.14 Other classify proteins as secretory or non-secretory. methods were used for cleaning the new dataset, A consistent assessment of the predictive per- and we found a surprisingly high error rate in formance requires a reliable benchmark dataset. Swiss-Prot, where, for example, of the order of 7% of the Gram-positive entries had either wrong Abbreviations used: HMM, hidden Markov model; cleavage site position and/or wrong annotation of NN, neural network. the experimental evidence. Also, we found many E-mail address of the corresponding author: errors in a previously used benchmark set [email protected] (stemming from automatic extraction from Swiss-

0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. 784 Signal Peptide Prediction by SignalP

Prot),12 and it appears that some programs are in actually include probable propeptides. In such fact better than the performance reported (predic- cases, convertase cleavage sites are mixed together tions are correct, while feature annotation is with signal peptidase I cleavage sites. incorrect). For comparison, we made use of this independent benchmark dataset that was used Removal of spurious cleavage site residues initially for evaluation of five different signal pep- tide predictors.12 Experimental assessment of the effect of certain In the new version of SignalP we have intro- amino acids in the cleavage site region has shown duced novel amino acid composition units as well that rare residues do not allow for efficient as sequence position units in the neural network cleavage.17,18 Examination of amino acids around input layer in order to obtain better performance. the signal peptidase I cleavage site in the data set Moreover, we have changed the window sizes revealed a number of sequences containing amino slightly compared to the previous version. We acids which appear at the cleavage site very rarely. have used fivefold cross-validation tests for direct In the eukaryotic dataset we found and removed comparison to the previous version of SignalP.10 In seven sequences containing lysine (K) and 13 the previous version of SignalP a combination sequences containing arginine (R) at the 21 pos- score, Y, was created from the cleavage site score, ition. All sequences with either lysine or arginine C, and the signal peptide score, S, and used to at position 21 were investigated manually. All of obtain a better prediction of the position of the them, except one, had a predicted cleavage site cleavage site. In the new version, we also use the upstream of the annotated one. Most of these C-score to obtain a better discrimination between sequences probably undergo N-terminal matu- secreted and non-secreted sequences, and have ration by different proteases, either in the trans constructed a new D-score for this classification Golgi network (TGN) or after release from the cell task. The architecture of the hidden Markov as mentioned below in the section on propeptide model (HMM) SignalP has not changed, but the analysis. In one clear case we found an obvious models have been retrained on the new data set, error in the Swiss-Prot entry NPAB_LOCMI. and have increased their performance significantly. According to the annotation, the cleavage site is located between residues 24 and 25 (arginine in position 21), but in the original paper the authors Results and Discussion identified the cleavage to occur between amino acid residues 22 and 23. In this case, the two Generation of data sets residues, ER, are removed by a dipeptidase.19 Furthermore, we removed sequences where As the predictive performance of the earlier other amino acids appeared at position 21 in very SignalP method was quite high, assessment of few of the sequences. For the eukaryotic dataset, potential improvements is critically dependent on the only residues allowed at position 21 were the quality of the data annotation. We generated a alanine (A), cysteine (C), glycine (G), leucine (L), new positive signal peptide data set from Swiss- proline (P), glutamine (Q), serine (S) and threonine Prot15 release 40.0, retaining the negative dataset (T). By allowing only the latter amino acids extracted from the previous work. The method for we might have removed a few true, unusual redundancy reduction was the same as in the pre- sequences. For instance, tyrosine (Y) and histidine vious work14, and was based on the reduction (H) at position 21 were found only in one case principle developed by Hobohm et al.16 Our final each in the entire eukaryotic dataset. We removed positive signal peptide datasets contain 1192, 334 eight sequences with aspartic acid (D) and eight and 153 sequences for eukaryotes, Gram-negative with phenylalanine (F), seven each with glutamic and Gram-positive bacteria, respectively. acid (E) and asparagine (N), respectively. Five In the previous work, we found many errors by with methionine (M), three containing isoleucine detailed inspection of hard-to-learn examples (I) and two sequences containing tryptophan (W) during training and wrongly predicted examples. at position 21 were removed. Some of these are Nevertheless, we were quite sure that even after in fact provable errors, in one of the aspartic acid careful examination in this manner, the dataset examples, CLUS_BOVIN,20 the N-terminal peptide would probably still contain errors obtained from sequencing in the published paper reports the incorrect database annotation and wrongly inter- cleavage as MKTLLLLMGLLLSWESGWA—-ISDK preted laboratory results. ELQEMST…, while Swiss-Prot annotates the Therefore, we developed a new feature-based sequence as being cleaved between D and K, approach where abnormal examples can be thereby changing a common position 21 amino detected by inspecting rare amino acid occurrences acid, alanine, into a rare one. Interestingly, SignalP and outlier physical–chemical properties of signal predicts the cleavage site as reported in the peptides. In the following, we show that the iso- published paper. electric point of signal peptides can help in finding For Gram-positive and Gram-negative bacteria, possible annotation errors and other errors, where only four residues were allowed at position 21. these errors may be due to the fact that some These residues were alanine (A), glycine (G), serine (long) signal peptides annotated in Swiss-Prot (S) and threonine (T).17,18 For the Gram-positive Signal Peptide Prediction by SignalP 785 dataset, this approach removed four sequences IAA2_STRGS is not verified experimentally. It is containing arginine (R), three containing valine predicted to have a cleavage site at position 26 (V), two containing lysine (K) and one sequence (SignalP) or 24 (PSORT). Calculation of pI using each of glutamic acid (E), leucine (L), asparagine the SignalP predicted signal peptide length gave a (N), glutamine (Q), threonine (T) and tyrosine (Y). new result of 8.66, closer to the average for Gram- In the Gram-negative dataset, we removed two positive bacteria. The published paper proposes sequences containing valine (V) at position 21 two other cleavage site positions, but these have and one sequence for each of the following amino not been verified experimentally.25 acids, glutamic acid (E), lysine (K), leucine (L), The last entry COTT_BACSU is a spore coat asparagine (N), glutamine (Q). protein from B. subtilis26,27and no BLAST homologs in Swiss-Prot were found to contain an experimen- Isoelectric point calculations tally verified signal peptide. CotT is processed pro- teolytically from a 10 kDa precursor protein and is Previous studies have shown differences in localized to spore coat where it controls the assem- amino acid composition between signal peptide bly. By N-terminal sequencing, the N terminus of and mature protein.21,22 Thus, we examined to the mature and processed protein was identified, what extent the isoelectric point (pI) could be used although nowhere in the two published papers is as a unique feature of signal peptides. an SPase I cleavage site indicated, and no signal We calculated the pI for all signal peptides and peptide is mentioned.26,27 With the current know- the corresponding mature proteins in the dataset ledge about spore coats, spore coat assembly does and presented this in three scatter plots (Figure 1). not involve translocation of coat protein across In the scatter plot for Gram-positive bacteria, two any membrane.28 – 30 Hence, it is very unlikely for very distinct clusters appear. Only three signal CotT to carry an N-terminal signal peptide as peptide outliers were found and we found by annotated in Swiss-Prot. manual inspection of the corresponding Swiss- The average isoelectric point of signal peptides Prot entries that these proteins most likely and mature proteins in the entire Gram-positive were either not carrying signal peptides, or were dataset was 10.59 and 6.24, respectively. This is annotated wrongly. consistent with the fact that Gram-positive bacteria These outliers having pI values below 8 had are known to have the longest signal peptides that the following Swiss-Prot IDs: CWLA_BACSP, carry more basic residues (K/R) in the n-region, IAA2_STRGS and COTT_BACSU. The three entries than Gram-negatives and eukaryotes.11 have annotated signal peptides, but it is doubtful When inspecting the scatter plot for Gram-nega- whether the annotation is correct. According to tive bacteria, we find the same overall clustering the prediction from SignalP and PSORT, CWLA_ as observed for the Gram-positive bacteria, BACSP does not carry a signal peptide. CWLA_ although not as distinct. Here, the major group of BACSP was described in the published paper as a signal peptides have pI values between 8 and 13, “putative” signal peptide23 and later it was indi- although the variation is larger than in the Gram- cated that cwlA is part of an ancestral prophage, positive scatter plot. A few sequence entries with still in the Bacillus subtilis genome.24 All phage and acidic signal peptides were investigated in detail. virus sequences were initially removed from the Sequence entry SFMA_ECOLI having a pI of 4.78 SignalP training set, which could result in the was found to be an obvious erroneous annotation negative prediction for this prophage sequence. in Swiss-Prot. This entry had an annotated clea- The cleavage site in the alpha-amylase inhibitor vage site at position 22, but a predicted cleavage

Figure 1. Isoelectric point calculations. Calculations of isoelectric point of signal peptide and mature protein, indicated by s and m, respectively. Clusters of outlier examples for bacteria are indicated on the two plots. 786 Signal Peptide Prediction by SignalP

Figure 2. Alternative start codon assignment. The graphical output from SignalP strongly indicates erroneous anno- tation of the signal peptide from Swiss-Prot entry SFMA_ECOLI. Further investigation showed a wrong annotation of the start codon (see the text for details). C, S, and Y-score indicate cleavage site, “signal peptide-ness” and combined cleavage site predictions, respectively. site at position 34. As seen from Figure 2, we found XYN2_TRIRE and XYNA_THELA, were reas- an internal methionine residue at position 12. Since signed according to the prediction of SignalP the signal peptide-ness is very low until position version 2.0. This is an exceptional case where we 12, we assumed that this was an incorrectly anno- tend to rate the computational analysis higher tated start codon. If the initial 11 amino acid resi- than experimental evidence, which must be con- dues until the internal methionine residue were sidered weak, as the propeptide processing takes removed, SignalP correctly predicted the cleavage place before the proteins have been subjected to to be at position 22 and the pI of the signal peptide experimental, N-terminal peptide sequencing. increased from 4.78 to 9.99. Indeed, in release 41.0 After the signal peptide had been reassigned in of Swiss-Prot, this entry was corrected and the sig- these cases, we got marginally higher correlation nal peptide marked “POTENTIAL”. coefficients when retraining the neural network on For eukaryotes, on the other hand, we were not the reassigned data set (data not shown). able to distinguish the pI of the signal peptide and the mature protein. Eukaryotes have the shortest Optimization of window sizes signal peptides and the amount of basic residues is much smaller than for bacteria. As in the earlier SignalP approach, the signal peptide discrimination and the signal peptidase I Propeptide or signal peptide? cleavage site prediction were handled using two different types of neural networks.10,33 For the eukaryotic data, we examined whether We used a brute-force approach to optimize the annotated signal peptides could possibly include window sizes for the neural networks by calcu- propeptides. In secreted proteins, propeptides are lating single-position correlation coefficients for all often found immediately downstream of the signal possible combinations of symmetric and asym- peptidase I cleavage site and their cleavage site is metric windows. Using this approach, we trained defined by a conserved set of basic amino acids. approximately 6500 neural networks for window Propeptides can be hard to detect by N-terminal optimization for a single organism group. This Edmann degradation, as the propeptides are was done also for different combinations where cleaved off in the TGN before the release of the amino acid composition and position information mature protein to the surroundings.31 were included in the input to network or not, We used a new propeptide predictor, ProP, to leading to approximately 27,000 neural networks predict propeptide cleavage sites32 in the eukary- being tested in all. otic dataset. In ten sequences we found a predicted For eukaryotes, these data are shown in Figure 3. cleavage site for a propeptide at the same position It is clear that optimal signal peptide discrimi- where a signal peptidase I cleavage site was anno- nation prediction requires symmetric (or nearly tated in Swiss-Prot. In all ten cases, SignalP pre- symmetric) windows, whereas cleavage site train- dicted a shorter signal peptide than annotated, ing needs asymmetric windows with more pos- thus making room for a short propeptide between itions upstream of the cleavage site included in the predicted signal peptide and the mature the input to the network. The optimal window protein. The ten sequences, AMYH_SACFI, CRYP_ size for cleavage site prediction for the eukaryote CRYPA, FINC_RAT, GUX2_TRIRE, LIGC_TRAVE, network included 20 positions upstream and four MDLA_PENCA, RNMG_ASPRE, RNT1_ASPOR, positions downstream of the cleavage site. The Signal Peptide Prediction by SignalP 787

Figure 3. Window optimization. These plots show single-position level correlation coefficients for all combinations of window sizes for the signal peptide cleavage and discrimination networks used for eukaryotic signal peptide pre- diction. The optimal window size for cleavage site for the eukaryotic network included 20 positions to the left and four positions to the right of the cleavage site. For reasons of computational efficiency, we have selected a discrimi- nation network with a symmetric window of 27 amino acid residues, although networks with larger windows have slightly higher single-position level correlation coefficients. window sizes for the Gram-positive networks were was obtained for the cleavage site prediction as retained as found previously,10 whereas the Gram- seen in Table 1. A performance increase of 6–17% negative cleavage site network included one more for all three organism classes was obtained. We position downstream of the cleavage site, resulting were able to optimize the signal peptide discrimi- in a window of 11 positions upstream and three nation performance by introducing a new score, positions downstream of the cleavage site. The termed the D-score, replacing the earlier used eukaryote discrimination network performs best mean S-score quantifying the “signal peptide- when using a symmetric window of 27 positions. ness” of a given sequence segment. In the earlier For both Gram-positive and Gram-negative bac- versions of SignalP, the scores from the two types teria the discrimination network is based on a of networks were combined for cleavage site symmetric window of 19 positions. This brute- assignment, and not for the task of discrimination. force approach changed the optimal window sizes In the new version 3, the D-score is calculated as of the cleavage site network slightly from those the average of the mean S-score and the maximal used in SignalP 2.0.10,33 Y-score, and the two types of networks are then used for both purposes (see Materials and Methods Network performance for details).

We have evaluated the performance of SignalP Improvement by position information and version 3.0 using the same performance measures composition features as used for the previous two versions of SignalP (see Table 1). The performance values were calcu- In order to improve the performance of the lated using fivefold cross-validation, i.e. testing on neural network version of SignalP, we introduced sequences not present in the training set (all data two new features into the network input: split into five subsets of approximately the same information about the position of the sliding size). The most significant performance increase window as well as information on the amino acid

Table 1. Performances of three different SignalP versions Cleavage site (Y-score) Discrimination (SP/non-SP) Version Euk Gram 2 Gram þ Euk Gram 2 Gram þ

SignalP 1 NN 70.2 79.3 67.9 0.97 0.88 0.96 SignalP 2 NN 72.4 83.4 67.4 0.97 0.90 0.96 SignalP 2 HMM 69.5 81.4 64.5 0.94 0.93 0.96 SignalP 3 NN 79.0 92.5 85.0 0.98 0.95 0.98 SignalP 3 HMM 75.7 90.2 81.6 0.94 0.94 0.98 The most significant improvement was for the cleavage site predictions. Cleavage site performances are presented as % and dis- crimination values (based on D-score) as correlation coefficients. NN and HMM indicate neural network and hidden Markov model, respectively. Results are based on fivefold cross-validation for all SignalP versions. 788 Signal Peptide Prediction by SignalP

Figure 4. Improvement of the neural network by introducing length and composition features. Position of the sliding window in the neural network input increased cleavage site prediction performance slightly (left panel). Amino acid composition information together with information of the position of the sliding window improved the discrimination network significantly, as seen in the right-hand panel. The performance improvement was evaluated as single-position level correlations during training on the individual networks for cleavage and discrimination, respectively.

composition of the entire sequence. This infor- rectly by the neural network version, both in mation was encoded by additional input units in terms of cleavage site and discrimination. How- the neural network. The new position information ever, great care should be taken when interpreting units were found to be important for both the clea- the scores for long potential signal peptides. vage site and discrimination networks, whereas From Figure 4, the importance of the new the amino acid composition information improved approach where position and amino acid compo- only the discrimination network. The idea of sition information is included can be assessed. including compositional information is based on Including information about the position of the the observation that the compositions of secreted sliding window during training increased the and non-secreted proteins differ.21,22 neural network cleavage site prediction per- The average length of signal peptides range from formance slightly (left-hand panel of the Figure). 22 (eukaryotes) and 24 (Gram-negatives) to 32 Composition information did not increase the amino acid residues for Gram-positives, and the performance of the cleavage site prediction; there- new network encoding the position of the sliding fore, it is excluded from the left-hand panel in window uses these averages to penalize prediction Figure 4. But composition information did increase of extremely long or short signal peptides. There- the performance of the discrimination network fore, twin arginine signal peptides often receive a slightly (right-hand panel of the Figure), whereas D-score below the threshold, as they tend to be information about the position of the sliding quite long (average 37 amino acid residues).34,35 window together with composition increased the This means that a few cases of ordinary signal discrimination significantly (right-hand panel). peptides with extreme length are not predicted cor- Another improvement of the discrimination stems rectly by the neural networks. The HMM penalizes from the new D-score (see Table 2). The final pre- long signal peptides in its structure, and similarly diction method uses both position and composition the SignalP3 HMM is not able to predict these information. cases correctly. One example36 is the (NUC_ STAAU) with a 63 residue signal peptide that is Effect of the new discrimination score not predicted correctly by any of the SignalP3 models. SignalP3 does not always fail to predict In SignalP version 3.0 we have introduced a new long signal peptides correctly, e.g. the 56 residue discrimination score for the neural network, signal peptide of CYGD_BOVIN37 is handled cor- termed the D-score. On the basis of the mean S-score and maximal Y-score it was found to give increased discriminative performance over the mean S-score, used in SignalP version 2.0. In Table Table 2. D-score outperforms the mean S-score for dis- 2, the D-score shows superior performance over crimination of signal peptide versus non-signal peptide the mean S-score for the novel part of the bench- 12 Dataset Sensitivity Specificity Accuracy Cc mark set defined by Menne et al. (see below). The above-mentioned 56 residue signal peptide Eukaryotes 0.99 (0.98) 0.85 (0.84) 0.93 (0.93) 0.8 (0.86)7 in CYGD_BOVIN is an example of where the Gram 2 0.94 (0.93) 0.88 (0.81) 0.95 (0.93) 0.8 (0.82)8 D Gram þ 0.98 (0.98) 0.98 (0.98) 0.98 (0.98) 0.9 (0.95)6 -score leads to a correct classification, while the mean S-score is below the threshold. In this case, Using the novel part of the Menne test set,12 we tested the the strong cleavage site score adds to a weaker D-score for discrimination compared to the mean S-score. The signal peptide-ness in the C-terminal part of the mean S-score performances are shown in parentheses. leader sequence. Signal Peptide Prediction by SignalP 789

Performance comparison to other discrimination performance of SignalP3 on this prediction methods set, when compared to the cross-validation per- formance reported in Table 1, is a result of errors As described in a recent review of signal peptide in the Menne set (originating from Swiss-Prot) prediction methods, it is hard to find an ideal together with its redundancy (see below) but, benchmark set, as methods have been frozen at more importantly, the presence of transmembrane different times.12 The data used to train a method helices within the first 60 residues in more than is, in general, “easier” than genuine test sequences 10% of the novel negative test sequences from this that are novel to a particular method. Since we set (when analyzed by TMHMM).40 have used a more recent version of Swiss-Prot The new version of PSORT (PSORT-B) has been than did Menne et al. in their assessment,12 we trained on five subcellular localization classes in have merely retained Menne set sequences that Gram-negative bacteria and was reported to obtain are not present in the SignalP version 3.0 training a 97% specificity and 75% sensitivity.8 PSORT-B set. In this manner, we do not give an advantage was optimized for specificity over sensitivity. to SignalP, as some of these sequences possibly Another recent method, SubLoc,5 predicts three have been included in the training set for other subcellular compartments for prokaryotes and methods. four compartments for eukaryotes. For SubLoc, We did not test the performance of the weight the total prediction accuracy was reported to be matrix-based methods SigCleave or SPScan, as the 91.4% for the three subcellular locations in pro- earlier report shows that these are outperformed karyotes and 79.4% for the four locations in by machine learning methods.12 SigCleave is eukaryotes. Neither of the latter two methods, based on von Heijne’s weight matrix2 from 1986. PSORT-B and SubLoc, reports a predicted cleavage SPScan is also based on the weight matrix site, but they are designed to perform only dis- from von Heijne, but in addition to this it uses crimination. Also, neither of these methods was McGeoch’s criteria for a minimal, acceptable signal compared to the SignalP version 2.0. peptide.1 Of the 289 negative test set sequences for Gram- We have tested other methods, one problem negative bacteria in the novel part of the Menne being that they do not necessarily predict the set, 191 received an “unknown” classification by same organism classes, e.g. the PSORT-B PSORT-B, and similarly 22 sequences in the posi- method8predicts only on Gram-negative data, and tive test set were classified as unknown. The not on the two other SignalP organism classes. unknown sequences were discarded in the calcu- The comparative results are given in Table 3. For lation of the PSORT-B performance shown in the PSORT-II method,38,39which predicts on eukary- Table 3. PSORT-B classifies 55% of the submitted otic sequences, the subcellular localization classes sequences as unknown. From Table 3 it can be endoplasmic reticulum (ER), extracellular and seen that SignalP3-NN is significantly better than Golgi were merged into one category of secretory both SubLoc and PSORT-B on the novel part of proteins, whereas the rest, cytoplasmic, mito- the Menne set, even when excluding these chondrial, nuclear, peroxisomal and vacuolar, unknown sequences. SignalP3-HMM has a similar were merged into a single “non-secretory” cat- performance on this set. All PSORT-B categories egory. The performance reported in the published except cytoplasmic were regarded as secretory. In paper is 57% correct for all categories. Table 3 comparison to SignalP3 we have merged the two shows that SignalP3 outperforms PSORT-II on this categories periplasmic and extracellular predicted particular set with a significant margin. PSORT-II from SubLoc to one category for secretory proteins. does not assign cleavage sites, and we have there- The original version of PSORT was used for fore compared only the discrimination per- predicting signal peptides in Gram-positive formance. We believe that the minor decrease in bacteria.3 We merged the output categories of

Table 3. Performance measures for signal peptide discrimination Data set/Method Sensitivity Specificity Accuracy cc

Eukaryotes SignalP3-NN 0.99 0.85 0.93 0.87 Eukaryotes PSORT-II 0.65 0.75 0.80 0.56 Eukaryotes SubLoc 0.58 0.70 0.77 0.47 Gram 2 SignalP3-NN 0.92 0.88 0.95 0.87 Gram 2 PSORT-B 0.99 0.64 0.75 0.58 Gram 2 Subloc 0.90 0.79 0.91 0.78 Gram þ SignalP3-NN 0.95 0.93 0.97 0.92 Gram þ PSORT 0.86 0.80 0.91 0.77 Gram þ SubLoc 0.82 0.92 0.86 0.76 Using the novel part of the Menne et al. test set,12 we obtained the results shown. Note that the values for PSORT-B are calculated on the part of the data set where PSORT-B produces a classification. Around 55% of the sequences were classified as Unknown, and the actual performance is therefore much lower than indicated here. For a given organism class, the relevant version of PSORT has been used to make the predictions and calculate the performance. 790 Signal Peptide Prediction by SignalP

“cleaved signal peptide” and “uncleaved signal prediction of subcellular localization in eukaryotes peptide” into one category, “secretory”. Sequences and prokaryotes was published a few years back.4 with a negative N-terminal signal peptide predic- Unfortunately, the online prediction method is tion were regarded as cytoplasmic. Again, the capable of handling only a couple of sequences performance of SignalP3 is higher than PSORT. per submission, which made it hard to compare to As the amount of data used to train this version SignalP. of PSORT was quite small, the performance is Another prediction method, SPEPlip,7 was pub- surprisingly good. lished recently, but we were not able to perform a Sigfind,41 another (eukaryotic only) method comparison as the server did not function for a based on neural networks, has a limitation of four duration of four weeks in which we checked it. sequences per host per day. We have submitted However, we are quite skeptical in relation to the 50 randomly chosen negative and 50 randomly generalization ability of this method, as it was chosen positive test sequences from the novel trained and tested on the full version of the Menne set. Sigfind reported two false positive highly redundant Menne set. The set has not been within this negative test set and one false negative redundancy reduced before training and testing of in the positive test set. When running the same SPEPlip.7 Consequently, the test part of the dataset sequences on SignalP3, we obtained no false will contain many sequences highly similar to positive, but the same false negative as Sigfind. training set sequences. For such sequences with Manual inspection of the one false negative high similarity, the cleavage site position can easily prediction of APL_HUMAN by both Sigfind and be found by alignment, and the inclusion of such SignalP revealed that this particular Swiss-Prot sequences leads to a significant overestimation of has been updated (with new identifier APL1 the predictive performance. HUMAN) after the development of the Menne set, A recently published method for cleavage site which was based on release 38.0 of Swiss-Prot. prediction (not available for test) based on support The sequence has now been extended by 15 amino vector machines6 reports a performance increase of acid residues at the N terminus, which results in a 47% in terms of true positive predictions at a false 27 residue signal peptide and not a 12 residue positive rate of 3% when compared to the original signal peptide as reported earlier. Taken this weight matrix method described by von Heijne.2 change into consideration, both Sigfind and Unfortunately, the support vector machine method SignalP3 correctly classify this protein as being was not compared to SignalP. The SVM method secreted. finds 68% true positive cleavage sites at a false For this limited part of the novel test set by positive rate of 3%. This method does not dis- Menne, Sigfind obtained a correlation coefficient tinguish between eukaryotic and bacterial of 0.96, compared to the perfect correlation coef- sequences. ficient of 1.0 for SignalP3. Very recently, a new method, Phobius, designed Wondering about the true performance of to improve transmembrane helix topology predic- Sigfind, we chose to submit eukaryotic secretory, tions by integrating topology and signal peptide cytoplasmic and nuclear protein sequences initially predictions, became available.42 Often the first created in Swiss-Prot release 42.0. Neither SignalP transmembrane helix can be mistaken for a signal nor the Sigfind method have been trained or tested peptide and vice versa. This method was trained on these new sequences. We were able to extract on data collected from Swiss-Prot release 41. 54 signal peptide-containing sequences, 86 cyto- While the performance values are not easy to plasmic, and 119 nuclear sequences. When submit- compare as, e.g. the negative data set has been ting those, Sigfind correctly classifies all new extracted from PDB using different similarity signal peptide-containing sequences as secretory, criteria than those used to develop SignalP,14 we but classifies four of the 86 cytoplasmic and five of made an evaluation of Phobius, using the same the nuclear sequences as secretory (false positives). novel sequences from release 42 of Swiss-Prot, as SignalP3 classifies all new eukaryotic secretory and used in the test of Sigfind described above (note cytoplasmic proteins correctly, but makes two false that the comparison between Phobius and SignalP positive predictions for the nuclear sequences. in the published paper was made using the old For discrimination of secretory and non-secretory SignalP 2.0 version). Out of 205 negative test proteins newly entered into Swiss-Prot, the Sigfind examples, Phobius generated four false positive method obtains a correlation coefficient of 0.91, predictions, whereas SignalP generated two false whereas SignalP again obtains a better correlation positive predictions. Both methods were able to of 0.98. It appears that the Sigfind method quite correctly classify all signal peptide-containing strongly overpredicts signal peptide-containing sequences. For discrimination, this results in a cor- sequences, and this means that on a normal dataset relation coefficient of 0.96 for Phobius and 0.98 for (either the one used to train SignalP or a full pro- SignalP. As reported in the Phobius paper, the clea- teome), where the non-secretory proteins greatly vage site prediction accuracy is below the accuracy outnumber the secretory proteins, the actual of the SignalP method. For this set from Swiss-Prot performance in terms of specificity will be much release 42, Phobius could predict the position of lower than on this more balanced set. the cleavage site correctly in 75% of the sequences, A neural network-based method (NNPSL) for while SignalP version 3.0 is able to predict the Signal Peptide Prediction by SignalP 791 cleavage site position correctly in 87% of the replacing the mean S-score used for discrimination sequences. When tested on a very small set of 11 in earlier SignalP versions. novel, experimentally verified Gram-positive and On an independent test set (limited to sequences Gram-negative sequences from Swiss-Prot release not used for SignalP3 construction) we achieved 42, we found that Phobius predicted 64% of those better sensitivity, specificity, accuracy and corre- correctly, while SignalP3 was correct in 82% of the lation coefficient for the cleavage site predictions cases. Thus, for these novel sequences found in for all three organism groups. Moreover, we the Swiss-Prot database, SignalP performs better, obtained better signal peptide discrimination in both in terms of discrimination and cleavage most cases. site prediction. Nevertheless, Phobius is indeed The 3.0 version of SignalP that we present here superior over SignalP when it comes to prediction shows in all cases a performance increase over of transmembrane helices close to the N terminus, version 2.0 (see Table 1). This was observed both which are easier to confuse with signal peptides. for the cleavage site prediction and the signal peptide discrimination. The improved perform- Misuse of SignalP ance of discrimination predictions for the new version of SignalP is partly due to the introduction We have noticed that users of SignalP in some of composition units in the neural network input cases interpret a positive prediction as meaning layer. that the protein is extracellular. As many proteins The improved performance obtained when with signal peptides are retained, e.g. in ER/ simultaneously using position and composition Golgi, this is not always the case. Eukaryotic pro- information proves that these features are indeed teins may be retained in the ER if the protein correlated. We were not able to figure out in detail holds an “ER retention signal”, which is found at how the cleavage site position and the protein the C terminus of the mature protein. SignalP composition actually are correlated, but when does not take such signals into account. It is hard inspecting the neural network composition input to assess how often wrong interpretations are unit weights, it became clear that the networks made due to lack of experimental data, but for the readily learn how secreted proteins differ in com- dataset used to train SignalP3 we found 11 cases position from non-secretory proteins. Figure 5 with retention signals (based on Swiss-Prot annota- tion). The true level of retention is presumably higher. Another more rare type of wrong use happens when negative predictions are interpreted incor- rectly. A negative classification by SignalP does not necessarily imply that the protein is indeed a non-secreted protein, as some proteins enter the extracellular space by non-classical and leaderless pathways. We have dealt with the issue of non- classical secretion elsewhere,43 and we have devel- oped the SecretomeP server for this purpose (see below).

Conclusion

We present new versions of SignalP, based on an expanded, highly curated dataset. The architecture of the HMM-based version was unchanged, while the neural network scheme was improved by including information about the amino acid com- position of the precursor protein as well as the position of the sliding window. Furthermore, we optimized the window sizes by testing all possible combinations of asymmetric and symmetric input windows up to a total input of 51 amino acid resi- dues. These were changed slightly compared to the earlier SignalP version. Figure 5. Amino acid composition information. Over- For all organism groups, we obtained an and under-representation of amino acids in the Gram- improvement of the cleavage site predictions negative training set and corresponding weights from 10 based on the maximal Y-score as defined earlier. the composition units in the network to the hidden Discrimination between signal peptide-containing units. The unit for neural network weights is arbitrary. sequences and non-secretory sequences was Negative numbers indicate over-representation of the improved by introducing a new score, the D-score, particular amino acid in secretory proteins. 792 Signal Peptide Prediction by SignalP shows the over-representation and the under- In order to remove any bias in the dataset, the set was representation of the 20 amino acid residues in redundancy reduced by the scheme developed pre- secretory proteins from Gram-negative bacteria, viously, excluding pairs of sequences that were function- 14 and the sizes of the corresponding weights ally homologous. Subsequently, no two sequences in connecting the input composition units and the the dataset have more than 17 (eukaryotes) or 21 (pro- hidden units. With the exception of methionine, karyotes) identical amino acid residues in a local align- ment. Approximately, half of the remaining sequences the sign of these weights reflects directly the over- were thus removed. This scheme for redundancy representation or under-representation of the reduction ensures that the method does not transfer amino acids, explaining why this additional input functional information to one sequence (from a set is able to improve the predictive performance of of experimentally characterized examples) by mere SignalP3. The amino acid bias in the data is in sequence similarity in the usual sense as detected by agreement with earlier observations.21 The picture alignment. The SignalP dataset preparation paper14 was similar for eukaryotes and Gram-positive determines how dissimilar two sequences should be in bacteria (data not shown). order to prompt the application of a “discrimination” The prediction quality of the new method was method, rather than alignment, for the specific case of significantly better than other methods performing signal peptide prediction. Thus, the performance values classification. SignalP3 was compared to other reported here correspond to a situation where the methods, with the exception of older weight matrix inference cannot be made reliably by alignment. approaches. It should be noted that the indepen- A few of the remaining sequences were removed from 12 the dataset due to their extreme length. We removed dent test set we have used (from Menne et al. ), sequences with signal peptides with less than 15 amino has not been cleaned in the same thorough manner acid residues for all three organism groups, sequences as the training set. We are aware of several annota- with more than 45 residues for eukaryotes and Gram- tion errors in the test that have not been removed, negatives and 50 residues for Gram-positive. We did not and therefore the performance of all methods is, specifically remove Tat signal peptides, but some of in practice, better than estimated from this set. them were removed due to their length. As the discrimination task is quite close to being For discrimination of secretory proteins versus cyto- correct for most sequences, the most prominent plasmic proteins we used the old dataset from SignalP problem in signal peptide prediction is the pre- version 2.0 of cytoplasmic, nuclear and signal anchor diction of the correct cleavage site. We have sequences, as we saw no reason to doubt these. We thoroughly investigated the data set for cleavage found and removed one error, though. site annotation errors and many of these were Additionally, we cleaned the redundancy reduced data set by additional techniques as described below. indeed found. One source of error of signal peptide Our final eukaryotic data set contains 1192 secretory, annotation is the neglect of maturation proteases 990 nuclear, 459 cytoplasmic and 67 signal anchor that act on the protein after the SPase I cleavage. sequences. The dataset for Gram-negative bacteria By cleaning for obvious cleavage site annotation contains 334 secretory and 358 cytoplasmic sequences. errors, we improved the cleavage site performance Finally, the dataset for Gram-positive bacteria contains without retraining. This means that the old 153 secretory and 151 cytoplasmic sequences. versions indeed had a better performance than assessed by the old incorrect data. The method described here† and the SecretomeP Further cleaning of the extracted data set server mentioned above can be found on the web.‡ Propeptides Recently, a eukaryotic propeptide convertase predic- Materials and Methods tor, ProP, was developed.32 In conjunction with SignalP version 2.0, we reassigned sequences from the eukaryotic Data set extraction dataset, which seemed to include a propeptide annotated as a signal peptide (see below). All sequence data were extracted from Swiss-Prot15 release 40.0. A total of 12,975 entries with the keyword SIGNAL were found. The dataset was split into three species-specific groups: eukaryotes, Gram-negative pro- Spurious cleavage site residues karyotes and Gram-positive prokaryotes. We excluded all archaeal sequences. Non-experimentally verified sig- In addition to cleaning the eukaryotic data set for nal peptides that had POTENTIAL or HYPOTHETICAL potential propeptide cleavage sites, we removed any stated in the keyword line were removed. Furthermore, sequence that contained the basic residues lysine (K) or any phage, viral or eukaryote organelle-encoded proteins arginine (R) at position 21. Furthermore, we removed were excluded. Lipoproteins were removed. Similarly, all sequences that had residues in position 21 occurring we excluded entries with more than one cleavage site only in a small minority of sequences. For the eukaryotic and entries with non verified N terminus if indicated in dataset, the only residues allowed at position 21 were, the keyword line. This reduced the number of signal alanine (A), cysteine (C), glycine (G), leucine (L), peptide-carrying sequences in the dataset to 3902. proline (P), glutamine (Q), serine (S) and threonine (T). For Gram-positive and Gram-negative bacteria, only alanine (A), glycine (G), serine (S) and threonine (T) † http://www.cbs.dtu.dk/services/SignalP/ were allowed at position 21, according to the cleavage ‡ http://www.cbs.dtu.dk/services/SecretomeP/ site. Signal Peptide Prediction by SignalP 793

Database annotation errors Position of the sliding window was used as an input neuron in the neural network for cleavage site prediction After initial training of our method, the dataset was and for the discrimination of signal peptides. Compo- cleaned manually for errors that could be related to sition neurons were introduced only as additional annotation errors. These erroneous annotations were neurons into the neural network for discrimination of identified as predictions made by the neural network signal peptides versus non-signal peptides, as they did that did not correspond with the annotation from Swiss- not improve the cleavage site prediction. Prot. Only errors that could be identified manually and Also, we optimized the window sizes by testing all confirmed from the corresponding papers were removed combinations of symmetric and asymmetric windows or reannotated. varying from three to 51 positions. The sequence entries for Gram-negative bacteria that were either reannotated or removed from the training Measuring prediction performance set were: PGL2_ERWCA, YBCL_ECOLI, OMLA_PSEAE, CBPG_PSES6, BLP2_PSEAE, GUNC_PSEFL, HLYB_ Performance of the prediction method was carried out PROMI, FPTA_PSEAE, and CY1_PARDE. on different datasets. Firstly, we used the same measure The sequence entries for Gram-positive bacteria that of performance as used in earlier versions of SignalP, i.e. were either reannotated or removed from the training cross-validation on the training set, meaning that the set were: XYNC_STRLI, CHOD_STRSQ, HYSA_PROAC, dataset was split into five equally sized parts and trained BLAF_MYCFO, AGAR_STRCO, CHOD_BREST, CHOD_ on four and tested on one for a total of five times until STRSQ, IMD_ARTGO, ALDC_BACBR, BLAC_STRAU, all sequences had been used for training and testing, GUNG_CLOTM, GUNH_CLOTM, TACY_STRPY, TEE6_ respectively. STRPY, THI1_PANTH, XYN1_BACST, AMY_BACSU Performance of the prediction method on the novel and CHI1_BACCI. This approach has not been carried part of the Menne et al. test set12 was measured as the out on the eukaryote training set. All errors found in the sensitivity, giving the fraction of positive examples truly training sets have been reported to Swiss-Prot. predicted as positive:

Independent test set tp Sensitivity ¼ ð1Þ tp þ fn During previous work for an evaluation of signal pep- tide predictors, an independent test was created.12 The The specificity gives the fraction of all positive predic- positive test set was divided into three groupings corre- tions that are true positives: sponding to eukaryotes, Gram-negative and Gram-posi- tive bacteria. Sequences already found in the SignalP3 tp Specificity ¼ ð2Þ training set were removed to prevent an artificially high tp þ fp performance measure for the SignalP method when test- ing. This resulted in 557 eukaryotic, 100 Gram-negative and 42 Gram-positive sequences that were used as a Accuracy gives the fraction of all true predictions, both positive test set. In the eukaryotic negative test set, true positive and true negative: sequences carrying mitochondrial transit peptides were tp þ tn removed. The negative test set consisted of 1056 Accuracy ¼ ð3Þ eukaryotic, 289 Gram-negative and 129 Gram-positive tp þ tn þ fp þ fn sequences. The three negative test sets were investigated using The Matthews correlation coefficient (cc) is defined TMHMM40 for the presence of transmembrane helices as:44 within the first 60 amino acid residues of each sequence. For the 129 Gram-positive cytoplasmic sequences we tp £ tn 2 fp £ fn cc ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð4Þ found 12 sequences with one transmembrane helix ðtp þ fnÞðtp þ fpÞðtn þ fpÞðtn þ fnÞ within the first 60 residues. Out of the 289 Gram- negative cytoplasmic sequences, 32 carry a (predicted) transmembrane helix within the first 60 residues, and was used in calculation of optimal network perform- where 123 out of 1056 eukaryotic cytoplasmic sequences ance during the training. carry a helix within the first 60 residues of each Here, tp ¼ true positive and tn ¼ true negative and sequence. fp ¼ false positive and fn ¼ false negative.

Neural network architecture New discrimination score Previous versions of SignalP showed the best discrimi- We used two different neural networks for coping versus with the signal peptide prediction problem. One net- nation of signal peptides non-signal peptides, by using the mean S-score calculated as the average of the work for recognition of the cleavage site, and one net- S-score in the predicted signal peptide region. We have work for determining whether a given amino acid D belongs to the signal peptide. A more thorough implemented a new score, called the -score, for better D description of the neural network has been presented discrimination. The -score is simply an average of the mean S-score and the maximal Y-score, where the elsewhere.33 Y-score is defined as: Furthermore, we improved the neural network by introducing new input features. These features were pffiffiffiffiffiffiffiffiffiffiffiffiffiffi position of the sliding window as a parameter together Yi ¼ CiDdSi ð5Þ with the amino acid composition of the entire sequence. where DdSi is the difference between the average S-score 794 Signal Peptide Prediction by SignalP of d positions before and d positions after position i: of lipoprotein signal peptides in Gram-negative 0 1 bacteria. Protein Sci. 12, 1652–1662. Xd Xd21 14. Nielsen, H., Engelbrecht, J., von Heijne, G. & Brunak, 1 @ A S. (1996). Defining a similarity threshold for a func- DdSi ¼ Si2j 2 Siþj ð6Þ d j¼1 j¼0 tional protein sequence pattern: the signal peptide cleavage site. Proteins: Struct. Funct. Genet. 26, 165–177. 15. Bairoch, A. & Apweiler, R. (2000). The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000. Nucl. Acids Res. 28, 45–48. 16. Hobohm, U., Scharf, M., Schneider, R. & Sander, C. (1992). Selection of representative protein data sets. Acknowledgements Protein Sci. 1, 409–417. 17. Karamyshev, A. L., Karamysheva, Z. N., Kajava, We thank Anders Krogh for use of his HMM A. V., Ksenzenko, V. N. & Nesmeyanova, M. A. software. This work was supported by grants (1998). Processing of Escherichia coli alkaline phos- from the Danish National Research Foundation, phatase: role of the primary structure of the signal the Danish Natural Science Research Council, the peptide cleavage region. J. Mol. Biol. 277, 859–870. Danish Center for Scientific Computing, and by a 18. Paetzel, M., Karla, A., Strynadka, N. C. & Dalbey, grant from Novozymes A/S (to J.D.B.). R. E. (2002). Signal peptidases. Chem. Rev. 102, 4549–4580. 19. Lagueux, M., Kromer, E. & Girardie, J. (1992). Cloning of a Locusta cDNA encoding neuroparsin A. References Insect Biochem. Mol. Biol. 22, 511–516. 1. McGeoch, D. J. (1985). On the predictive recognition 20. Palmer, D. J. & Christie, D. L. (1990). The primary of signal peptide sequences. Virus Res. 3, 271–286. structure of glycoprotein III from bovine adrenal 2. von Heijne, G. (1986). A new method for predicting medullary chromaffin granules. Sequence similarity signal sequence cleavage sites. Nucl. Acids Res. 14, with human serum protein-40,40 and rat Sertoli cell 4683–4690. glycoprotein. J. Biol. Chem. 265, 6617–6623. 3. Nakai, K. & Kanehisa, M. (1991). Expert system for 21. Cedano, J., Aloy, P., Perez-Pons, J. A. & Querol, E. predicting protein localization sites in Gram- (1997). Relation between amino acid composition negative bacteria. Proteins: Struct. Funct. Genet. 11, and cellular location of proteins. J. Mol. Biol. 266, 95–110. 594–600. 4. Reinhardt, A. & Hubbard, T. (1998). Using neural 22. Chou, K. C. (2001). Prediction of protein cellular networks for prediction of the subcellular location attributes using pseudo-amino acid composition. of proteins. Nucl. Acids Res. 26, 2230–2236. Proteins: Struct. Funct. Genet. 43, 246–255. 5. Hua, S. & Sun, Z. (2001). Support vector machine 23. Potvin, C., Leclerc, D., Tremblay, G., Asselin, A. & approach for protein subcellular localization predic- Bellemare, G. (1988). Cloning, sequencing and tion. Bioinformatics, 17, 721–728. expression of a Bacillus bacteriolytic in 6. Vert, J. P. (2002). Support vector machine prediction Escherichia coli. Mol. Gen. Genet. 214, 241–248. of signal peptide cleavage site using a new class of 24. Takemaru, K., Mizuno, M., Sato, T., Takeuchi, M. & kernels for strings. In Proceedings of the Pacific Kobayashi, Y. (1995). Complete nucleotide sequence Symposium on Biocomputing, World Scientific, of a skin element excised by DNA rearrangement Singapore pp. 649–660. during sporulation in Bacillus subtilis. Microbiology, 7. Fariselli, P., Finocchiaro, G. & Casadio, R. (2003). 141, 323–327. SPEPlip: the detection of signal peptide and lipo- 25. Nagaso, H., Saito, S., Saito, H. & Takahashi, H. protein cleavage sites. Bioinformatics, 19, 2498–2499. (1988). Nucleotide sequence and expression of a 8. Gardy, J. L., Spencer, C., Wang, K., Ester, M., Streptomyces griseosporeus proteinaceous alpha- Tusnady, G. E., Simon, I. et al. (2003). PSORT-B: amylase inhibitor (HaimII) gene. J. Bacteriol. 170, improving protein subcellular localization prediction 4451–4457. for Gram-negative bacteria. Nucl. Acids Res. 31, 26. Aronson, A. I., Song, H. Y. & Bourne, N. (1989). Gene 3613–3617. structure and precursor processing of a novel Bacillus 9. Zhang, Z. & Wood, W. I. (2003). A profile hidden subtilis spore coat protein. Mol. Microbiol. 3, 437–444. Markov model for signal peptides generated by 27. Bourne, N., FitzJames, P. C. & Aronson, A. I. (1991). HMMER. Bioinformatics, 19, 307–308. Structural and germination defects of Bacillus subtilis 10. Nielsen, H., Brunak, S., Engelbrecht, J. & von Heijne, spores with altered contents of a spore coat protein. G. (1997). Identification of prokaryotic and eukary- J. Bacteriol. 173, 6618–6625. otic signal peptides and prediction of their cleavage 28. Driks, A. (1999). Bacillus subtilis spore coat. Microbiol. sites. Protein Eng. 10,1–6. Mol. Biol. Rev. 63, 1–20. 11. Nielsen, H. & Krogh, A. (1998). Prediction of signal 29. Driks, A. (2002). Maximum shields: the assembly and peptides and signal anchors by a hidden Markov function of the bacterial spore coat. Trends. Microbiol. model. Proc. Int. Cong. Intell. Syst. Mol. Biol. 6, 10, 251–254. 122–130. 30. Chada, V. G., Sanstad, E. A., Wang, R. & Driks, A. 12. Menne, K. M., Hermjakob, H. & Apweiler, R. (2000). (2003). Morphogenesis of bacillus spore surfaces. A comparison of signal sequence prediction methods J. Bacteriol. 185, 6255–6261. using a test set of signal peptides. Bioinformatics, 16, 31. Thomas, G. (2002). at the cutting edge: from 741–742. protein traffic to embryogenesis and disease. Nature 13. Juncker, A. S., Willenbrock, H., Von Heijne, G., Rev. Mol. Cell Biol. 3, 753–766. Brunak, S., Nielsen, H. & Krogh, A. (2003). Prediction 32. Duckert, P., Brunak, S. & Blom, N. (2004). Prediction Signal Peptide Prediction by SignalP 795

of cleavage sites. Protein Eng. 39. Nakai, K. & Horton, P. (1999). PSORT: a program for Des. Sel. 17, 107–112. detecting sorting signals in proteins and predicting 33. Nielsen, H., Engelbrecht, J., Brunak, S. & von Heijne, their subcellular localization. Trends Biochem. Sci. 24, G. (1997). A neural network method for identifi- 34–36. cation of prokaryotic and eukaryotic signal peptides 40. Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, and prediction of their cleavage sites. Int. J. Neural E. L. (2001). Predicting transmembrane protein top- Syst. 8, 581–599. ology with a hidden Markov model: application to 34. Berks, B. C., Sargent, F. & Palmer, T. (2000). The Tat complete genomes. J. Mol. Biol. 305, 567–580. protein export pathway. Mol. Microbiol. 35, 260–274. 41. Reczko, M., Fiziev, P., Staub, E. & Hatzigeorgiou, A. 35. Palmer, T. & Berks, B. C. (2003). Moving folded pro- (2002). Finding signal peptides in human protein teins across the bacterial cell membrane. Micro- sequences using recurrent neural networks. In biology, 149, 547–556. Algorithms in Bioinformatics WABI 2002, Springer, 36. Miller, J. R., Kovacevic, S. & Veal, L. E. (1987). Heidelberg pp. 60–67. Secretion and processing of staphylococcal nuclease 42. Ka¨ll, L., Krogh, A. & Sonnhammer, E. L. (2004). A by Bacillus subtilis. J. Bacteriol. 169, 3508–3514. combined transmembrane topology and signal pep- 37. Kristensen, T., Ogata, R. T., Chung, L. P., Reid, K. B. tide prediction method. J. Mol. Biol. 338, 1027–1036. & Tack, B. F. (1987). cDNA structure of murine 43. Bendtsen, J. D., Jensen, L. J., Blom, N., von Heijne, G. C4b-binding protein, a regulatory component of & Brunak, S. (2004). Feature based prediction of non- the serum . Biochemistry, 26, classical protein secretion. Protein Eng. Des. Sel. In the 4668–4674. press. 38. Horton, P. & Nakai, K. (1997). Better prediction of 44. Mathews, B. W. (1975). Comparison of the predicted protein cellular localization sites with the k nearest and observed secondary structure of T4 phage lyso- neighbors classifier. Proc. Int. Conf. Intell. Syst. Mol. zyme. Biochim. Biophys. Acta, 405, 442–451. Biol. 5, 147–152.

Edited by F. E. Cohen

(Received 4 March 2004; received in revised form 17 May 2004; accepted 17 May 2004)