Accurate and Automated Classification of Protein Secondary Structure with Psicsi

Accurate and automated classification of protein secondary structure with PsiCSI LING-HONG HUNG AND RAM SAMUDRALA Computational Genomics, Department of Microbiology, University of Washington, Seattle, Washington 98109, USA (RECEIVED July 2, 2002; FINAL REVISION October 18, 2002; ACCEPTED October 31, 2002) Abstract PsiCSI is a highly accurate and automated method of assigning secondary structure from NMR data, which is a useful intermediate step in the determination of tertiary structures. The method combines information from chemical shifts and protein sequence using three layers of neural networks. Training and testing was performed on a suite of 92 proteins (9437 residues) with known secondary and tertiary structure. Using a stringent cross-validation procedure in which the target and homologous proteins were removed from the databases used for training the neural networks, an average 89% Q3 accuracy (per residue) was observed. This is an increase of 6.2% and 5.5% (representing 36% and 33% fewer errors) over methods that use chemical shifts (CSI) or sequence information (Psipred) alone. In addition, PsiCSI improves upon the and is able to use sequence (87.4% ס translation of chemical shift information to secondary structure (Q3 without 13C shifts and 86.9% ס information as an effective substitute for sparse NMR data (Q3 with only H␣ shifts available). Finally, errors made by PsiCSI almost exclusively involve the 86.8% ס Q3 interchange of helix or strand with coil and not helix with strand (<2.5 occurrences per 10000 residues). The automation, increased accuracy, absence of gross errors, and robustness with regards to sparse data make PsiCSI ideal for high-throughput applications, and should improve the effectiveness of hybrid NMR/de novo structure determination methods. A Web server is available for users to submit data and have the assignment returned. Keywords: NMR; chemical shifts; secondary structure; neural networks Supplemental material: See www.proteinscience.org. The flood of data from the genomic sequencing projects has Moseley et al. 2001), still requires considerable human in- inspired structural genomic projects aimed at determining tervention. One promising approach to this problem is to all of the possible protein folds (Burley 2000; Brenner couple theoretical simulations with NMR methods to reduce 2001). Although the major methodology being used in these the amount of data, effort, and time required to determine projects is X-ray crystallography, NMR is also being devel- the fold of a protein (Delaglio et al. 2000; Rohl and Baker oped as an alternative for high-throughput applications 2002). Automated and accurate secondary structure assign- (Montelione 2001). One of the major bottlenecks in NMR ments are necessary for these methods to be effective. structure determinations is in the interpretation and analysis of the spectral data, which, with the possible exception of chemical-shift assignment (Bailey-Kellogg et al. 2000; Secondary structure from chemical shifts (CSI) The first step of any NMR structure determination is the Reprint requests to Ram Samudrala, Computational Genomics, Depart- assignment of chemical shifts (CSI). Because this is also the ment of Microbiology, University of Washington, Box 357242, Rosen only step that has been partially automated, a considerable Building, 960 Republican St., Seattle, WA 98109, USA; e-mail: amount of effort has been expended in translating chemical [email protected]; fax: (206) 732-6055. Article and publication are at http://www.proteinscience.org/cgi/doi/ shifts into structural information (Wishart et al. 1992; Wis- 10.1110/ps.0222303. hart and Sykes 1994; Cornilescu et al. 1999; Bonvin et al. 288 Protein Science (2003), 12:288–295. Published by Cold Spring Harbor Laboratory Press. Copyright © 2003 The Protein Society Assignment of secondary structure with PsiCSI 2001). There is a fairly simple, if noisy, relationship be- works to derive a second set of refined potentials. Like CSI, tween secondary structure and the chemical shifts of certain multiple shifts are used to further increase accuracy. Addi- nuclei (Spera and Bax 1991; Wishart et al. 1991). For ex- tional information from 15N shifts and from Psipred predic- ample, H␣ chemical shifts are higher than average (down- tions is also used. Rather than utilizing a simple jury system, field) in extended structures and lower than average (up- PsiCSI trains a second layer of neural networks. Every pos- 15 field) in helices. The same is true for N and C␤ shifts, sible combination of the available data for the residue (i.e., whereas the opposite relationship holds for C, and C␣ shifts. refined potentials from the first layer of networks and To exploit this information, CSI (Chemical Shift Index) Psipred potentials) is fed into separate neural nets. Reliabili- (Wishart et al. 1992; Wishart and Sykes 1994) assigned ties for each combination are estimated and the best per- three indices, −1, 0, and 1, depending on whether the chemi- forming combination (for that residue type) is used to pro- cal shift was near the average value or at one of the ex- vide potentials for the next layer of neural networks. Fi- tremes. Consecutive occurrences of like indices were used nally, as with Psipred, the last neural net takes into account to identify the presence of secondary structure. To further local interactions. This is similar to the first layer of neural increase accuracy, a jury system averaged assignments from nets used to average out chemical shift noise. However, multiple chemical shifts—C, C␣,C␤, and H␣—to arrive at a because the accuracy of the inputs at this stage is much consensus assignment. higher, it is possible to utilize a much larger window (17 vs. 3 residues) to take into account more subtle interactions Secondary structure from sequence (Psipred) between distant residues. The most reliable outputs from the second layer along with estimated reliablities are fed into this Early secondary structure prediction methods relied upon final neural net to ultimately obtain the PsiCSI prediction. database (Chou and Fasman 1974; Garnier et al. 1978) or theoretically derived propensities (Lim 1974) for residue types to be in the three secondary structure states with Q3 Results and Discussion (i.e., the percentage total number of residues correctly assigned to the three secondary structure states ) accuracies in PsiCSI significantly improves upon existing methods the 60% range. The current generation of methods exploits PsiCSI achieves a Q3 accuracy of 89% (per residue), which the information from multiple alignments to further enhance is a significant improvement over the 82.8% (z > 12) accu- the accuracy (Krogh et al. 1994; Rost 1996; Jones 1999), racy observed for CSI and the 83.5% (z > 11) accuracy which now approaches 80%. The use of neural nets, PHD observed for Psipred. The CSI accuracy observed for our (Rost 1996) and Psipred (Jones 1999), to interpret the large dataset differs from the originally published accuracy of amount of data has been also instrumental in increasing the 92%. However, this figure was obtained using a small accuracy. One of the most accurate methods, Psipred, uses sample of proteins on the basis of a combination of subjec- neural nets to convert PsiBlast (Altschul et al. 1997) profile tive identification of secondary structures from NMR data data to secondary structure propensities. A second set of (not structures) and on crystal structures. In addition, the neural nets then takes into account local interactions to dataset was not jackknifed, 8 of the 16 proteins used to smooth the resulting secondary structure predictions and evaluate CSI were also part of the set of 12 proteins used to further increase accuracy. determine the indices. The observed accuracy of Psipred also differs from the stated accuracy (80%) by more than Secondary structure from chemical shifts and can be accounted for by random chance (z > 8). The test set sequence (PsiCSI) may include proteins and/or homologs to proteins used to PsiCSI combines both the chemical shift-based and se- train Psipred’s neural networks, which could account for the quence-based methods to further increase the accuracy of higher accuracy. However, the accuracy of the chemical secondary structure assignments. It is also designed to best shift alone version of PsiCSI (87.4%) indicates that the high utilize whatever data is available. PsiCSI begins by refining accuracy of PsiCSI is not contingent upon the unusual ac- the CSI methodology. Rather than three indices, three sepa- curacy of Psipred on the test set. rate potentials ranging from 0 to 1 are assigned to reflect the The distribution of Q3 accuracies of PsiCSI, CSI, and relative likelihood of a given chemical-shift value being PsiPred, is shown in Figure 1. The distribution of PsiCSI associated with a given secondary structure state. Like CSI, accuracies is very tight, reflecting the consistency of the PsiCSI reduces noise by polling nearby shifts. PsiCSI ex- method. Some of the less accurate results come from large amines a small window of shifts (three residues) centered regions of coil being assigned as helix or extended (see around the residue in question. Potentials derived from Electronic Supplemental Material). It is possible that PsiCSI these shifts, along with the estimated residue-dependent re- is detecting some residual structure in these regions. PsiCSI liabilities (i.e., probability of the assignment being correct) does better than CSI or Psipred in the majority of cases as of these potentials, are fed into a first layer of neural net- is expected from the average per residue increase in accu- www.proteinscience.org 289 Hung and Samudrala Table 1. Accuracy and reliability of PsiCSI, Psipred, and CSI Accuracy (%)a Reliability (%)b Overall (Q3)% HECHEC PsiCSI 89.0 91.8 84.4 89.1 93.3 84.0 88.3 PsiCSI (shifts only) 87.4 90.5 80.2 88.3 92.1 82.9 86.2 PsiCSI (no 13C) 86.9 87.0 82.6 88.7 90.6 86.0 85.1 PsiCSI (no 13C/15N) 86.8 87.0 81.6 88.7 90.6 85.9 84.4 Psipred v2.3 83.5 88.7 79.3 81.8 83.1 77.2 86.6 CSI (consensus) 82.8 86.9 80.7 81.0 91.4 71.3 82.7 a Percentage of correct assignments of state/total number of residues that are actually in that state.

Accurate and Automated Classification of Protein Secondary Structure with Psicsi

A Web Server for Rapid NMR-Based Protein Structure Determination Berjanskii, M.; Tang, P.; Liang, J.; Cruz, J

Chemical Shift Secondary Structure Population Inference

CSI 3.0: a Web Server for Identifying Secondary and Super-Secondary Structure in Proteins Using NMR Chemical Shifts Noor E

BMC Structural Biology Biomed Central

An Efficient Backbone Assignment Tool for NMR Studies of Proteins

Bioinformatics Methods for NMR Chemical Shift Data

Anna Katharina Dehof Novel Approaches for Bond Order Assignment and NMR Shift Prediction

Equilibrium Simulations of Proteins Using Molecular Fragment Replacement and NMR Chemical Shifts

Model-Based Assignment and Inference of Protein Backbone Nuclear Magnetic Resonances

NMR Study of the Secondary Structure and Biopharmaceutical Formulation of an Active Branched Antimicrobial Peptide

Structure Calculation of Proteins from Solution and Solid-State NMR Data : Application to Monomers and Symmetric Aggregates

University of Florence International Doctorate in Structural Biology Cycle XXIII (2008–2010)