Refining Neural Network Predictions for Helical Transmembraneproteins by Dynamic Programming
Total Page:16
File Type:pdf, Size:1020Kb
From: ISMB-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. Refining neural network predictions for helical transmembraneproteins by dynamic programming BurkhardRost 0, Rita Casadio, and Piero Fariselli EMBL,69012 Heidelberg, Germany Lab. of Biophysics, Dep. of Biology EBI, Hinxton, Cambridge CB101RQ, U.K. Univ. of Bologna, 40126 Bologna, Italy Rost @embl-heidelberg.de Casadio @kaiser.alma.unibo.it Abstract time. Currently, the gap between the number of known For transmembrane proteins experimental determina- protein sequences and protein three-dimensional (3D) tion of three-dimensional structure is problematic. structures is still increasing (Rost 1995). However, However, membraneproteins have important impact experimental structure determination is becomingmore of for molecular biology in general, and for drug design a routine (Lattman 1994). Thus, within the next decades in particular. Thus, prediction method are needed. we may know the three-dimensional (3D) structure for Here we introduce a method that started from the most proteins. Even in such an optimistic scenario, output of the profile-based neural network system experimental information is likely to remain scarce for PHDhtm(Rost, et al. 1995). Instead of choosing the integral membraneproteins. These challenge experimental neural network output unit with maximalvalue as pre- structure determination: X-ray crystallography is diction, we implemented a dynamic programming-like problematic as the hydrophobic molecules hardly refinement procedure that aimed at producing the best crystallise; and for nuclear magnetic resonance (NMR) model for all transmembranehelices compatible with spectroscopy transmembraneproteins are usually too long. the neural network output. The refined prediction was Today, 3D structure is known for only six membrane used successfully to predict transmembranetopology proteins (Rost, et al. 1995). Cantheory predict structural based on an empirical rule for the charge difference aspects for integral membraneproteins? between extra- and intra-cytoplasmic regions (posi- Theoretical predictions for helical transmembrane tive-inside rule). Preliminary results suggest that the proteins. The prediction of structural aspects is simpler for refinement was clearly superior to the initial neural membraneproteins than for globular proteins as the lipid network system; and that the method predicted all bilayer imposes strong constraints on the degrees of transmembrane helices correctly for more proteins freedom of 3D structure (Taylor, et al. 1994). Most than a previously applied empirical filter. The resul- prediction tools focus on helical transmembraneproteins ting accuracy in predicting topology was better than for which some experimental information about 3D 80%. Although a more thorough evaluation of the structure is available (Manoil & Beckwith 1986, method on a larger data set will be required, the Hennessey & Broome-Smith 1993). Methods have beer. results comparedfavourably with alternative methods. designed to predict the locations of transmembranehelices The results reflected the strength of the refinement procedure which was the successful incorporation of global information: whereas the residue preferences output by the neural network were derived from topology: out extra- stretches of 17 adjacent residues, the refinement procedure involved constraints on the level of the entire protein. Introduction ~r’ ~ intra- N cytoplasmic Given the rapid advance of large-scale gene-sequencing topology: in projects (Fleischmann, et al. 1995), most protein sequences of key organismswill probably be knownin a few years’ Fig. 1. Definition of topology. In one of the two -~ Correspondingauthor. knownclasses of membraneproteins, typically apolar Abbreviations: 3D, three-dimensional; HTM,transmembrane helices are embeddedin the lipid bilayer oriented helix (also abbreviated by H in figures, L used for non- perpendicular to the membranesurface. The helices can transmembraneregions); PDB,Protein Data Bankof experimen- be regarded as moreor less rigid cylinders. "Topology’ tally determined3D structures of proteins; PHDhtrn,Profile is defined as out whenthe N-term(first residue) starts based neural network prediction of helical transmembrane on the extra-cytoplasmicside and as/n if it starts on the regions; SWISS-PROT,data base of knownprotein sequences. intra-cytoplasmic side. 192 ISMB--96 (Kyte & Doolittle 1982, Engelman,et al. 1986, von Heijne accuracy), or would it suffice to predict more trans- 1986, von Heijne 1992, Edelman1993, Jones, et al. 1994, membranehelices correctly (per-segment accuracy; (Rost, Persson & Argos 1994, Donnelly & Findlay 1995), and the et al. 1994, Rost, et al. 1995)? orientation of transmembrane helices with respect to Here we introduce a conceptually simple method that themembrane(dubbed topology, Fig. 1; (von Heijne 1986, started from the output of the profile-based neural network Hartmann, et al. 1989, von Heijne 1989, Sipos &von system PHDhtm(Fig. 2). The preferences for residues Heijne 1993, Jones, et al. 1994, Casadio & Fariselli 1996). be in transmembrane helices output from the neural If the transmembrane helix locations and topology are network system were post-processed by a dynamic known with sufficient accuracy, 3D structure can be programming-like algorithm. A similar algorithm has been successfully predicted for the membrane spanning used before to improve statistics-based predictions for segments by an exhaustive search of the entire structure transmembranehelices (Jones, et al. 1994). The algorithm space (Taylor, et al. 1994). constitutes an examplefor a problem-specific improvement Further improvementof prediction accuracy necessary. ?. of neural network output. The resulting predictions were There are two incentives to aim at improving the accuracy used to predict topology based on charge differences of predicting transmembrane helix locations. (1) between extra- and intra-cytoplasmic residues (Fig. 1). general, current prediction methods are probably not The method is described in detail; and some preliminary accurate enoughto be used for the 3D prediction proposed results are given. by Taylor and colleagues (Taylor, et al. 1994). (2) simple meansto predict topology (Fig. 1) is the positive- inside rule: positively charged residues are abundant in Materials and methods periplasmic loops of membraneproteins (yon Heijne 1986, von Heijne & Gavel 1988, Hartmann, et al. 1989, Boyd & Beckwith 1990, von Heijne 1992). To successfully apply Database and evaluation of method this rule, a correct prediction of the non-transmembrane Selection of proteins. Webased our analyses on a set of 68 regions is required. Woulda better prediction of topology proteins for which experimental information about the require to predict moreaccurately each residue (per-residue locations of transmembrane helices is annotated in the SWISS-PROTdatabase (von Heijne 1992, Bairoch Boeckmann 1994). The locations of transmembrane helices used are listed in our previous work (Rost, et al. 1995). The observed topology was taken from SWISS- Input: helical membraneprotein sequenc~ PROTor (Jones, et al. 1994)(Jones, et al. 1994; names of unknownhelix location /} listed in Table 2). Note that the set was the same as used previously (Rost, et al. 1995), except for melittin, that was searchI similar | ~ predict ] excludedfor the purpose of topology prediction. sequences Cross-validation test. For the prediction of I transmembranepropensities by the neural network system BLAST MaxHom PHDhtm~- PHDhtm (PHDhtm;(Rost, et al. 1995), the set of 68 transmembrane filter proteins (Table 2) was divided into 55 proteins used for ntermedtate output: preferences for each resi~ training the networks and 13 used for deriving the propensities (test set). This was repeated five times (five- due to be in a transmembranehelix,,) fold cross-validation) such that each protein was in one test set. The sets were chosen such that no protein in the multiple alignments used for training the networks had Fig. 2. Neural network predictions for transmem- more than 25%pairwise sequence identity to any protein in brahe helix preferences (PHDhtm).Input was a the multiple alignments of the proteins for which the protein sequence; output were the locations of propensities were predicted. The cross-validation transmembrane helices. (1) Similar sequences were procedure yields estimates for prediction accuracy that are searched for in SWISS-PROT(Bairoch & Boeckmann likely to hold for proteins of yet unknowntopology (Rost 1994) by the fast alignment algorithm BLAST(Altschul, & Sander 1993, Rost & Sander 1995, Rost & Sander et al. 1990)). (2) The resulting sequenceswere 1996). aligned by the multiple alignment algorithm MAXHOM Measuring prediction accuracy. We compiled results (Sander & Schneider 1991). (3) Profiles derived for per-residue scores capturing the accuracy of predicting the alignment were input into the neural network system HTMplacement (Table 1); and per-segment scores PHDhtm(Rost, et al. 1995) . (4) The final output capturing the accuracy of predicting entire HTM’s PHDhtmwere the locations of transmembranehelices correctly. Due to the length of HTM’sand the prediction assigned according to the output unit with maximal accuracy, the definition of segment-based scores was value (winner-takes-all). straightforward (in contrast to the case of secondary structure predictions for globular proteins (Rost, et al. Rost 193 1994)). We regarded a transmembrane helix