Predicting the Beta-Helix Fold from Protein Sequence Data
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OFCOMPUTATIONAL BIOLOGY Volume9, Number 2,2002 ©MaryAnn Liebert,Inc. Pp. 261–276 Predicting the Beta-Helix Fold from Protein Sequence Data LENORE COWEN, 1 PHIL BRADLEY, 2 MATTHEW MENKE, 2 JONATHAN KING, 3 andBONNIE BERGER 3 ABSTRACT Amethodis presented thatuses b-strand interactions topredict the parallel right-handed b-helix super-secondarystructural motifin protein sequences. Aprogramcalled BetaWrap implements this methodand is shownto score known b-helices abovenon- b-helices in the ProteinData Bank in cross-validation.It is demonstratedthat BetaWrap learns eachof the seven knownSCOP b-helix families, whentrained primarily on b-structures thatare not b-helices, togetherwith structural features ofknown b-helices fromoutside the family. BetaWrapalso predicts manybacterial proteins ofunknown structure tobe b-helices; in particular,these proteins serve as virulence factors,adhesins, andtoxins in bacterial patho- genesis andinclude cell surface proteins fromChlamydia and the intestinal bacterium He- licobacter pylori. The computationalmethod used here maygeneralize toother b-structures forwhich strand topologyand pro les ofresidue accessibility arewell conserved. Key words: parallel right-handedbeta helix,protein folding, protein motif recognition,statistical prediction,bacterial pathogenesis. 1.INTRODUCTION hispaper presents the rst computationalmethod that successfully predicts membership in a ¯- Tstructuralsuper-secondary fold family based on a protein’s aminoacid sequence. W eintroducethe BetaWrapprogram, which recognizes sequences whose folds belong to the parallel right-handed ¯-helix superfamily.There currently exist 12 known examples of theright-handed ¯-helixwhose crystal structures havebeen solved in the Protein Data Bank (PDB) (Berman et al.,2000).The SCOP (Murzin et al., 1995)classi cationsystem (version 1.53) places these known ¯-helicesinto seven different families and thoughthere are core structural similarities that categorize the ¯-helix,there is little sequence homology acrossstructures in different families (except the pectate and pectin lyases). Thus even multiple-alignment sequence-basedmethods such as PSI-BLAST (Altschul et al.,1997)cannot solve the ¯-helixrecognition problemacross many of thedifferent SCOP families.Threading methods (Sippl and W eitckus,1992; Jones et al.,1992;Bryant, 1996; Kelley et al.,2000;Sternberg et al.,1999)and hidden Markov methods such 1Departmentof EECS, TuftsUniversity, Medford, MA 02155. 2Departmentof Mathematics and Laboratory for Computer Science, Massachusetts Institute of T echnology,Cam- bridge,MA 02139. 3Departmentof Biology, Massachusetts Institute of T echnology,Cambridge, MA 02139. 261 262 COWEN ET AL. asHMMER (Eddy, 1998) also do not solve the ¯-helixrecognition problem across the different SCOP families;see Section 5. Ithas been known for sometime that amino acid residues that are close in space in a foldedthree- dimensionalprotein structure can exhibit marked statistical preferences. For example,consider the coiled- coil ®-helicalprotein motifs. The backbone coils in aprocessiveturn, giving a structuralrepeat every seven residues.A residueis closest in space to the residues that appear 1, 3, and 4awayin the sequence, C C C andstrong pairwise statistical correlations between residues appearing in these positions in the coiled coilwere identi ed by Berger et al. (1995).Because residues a xeddistance in sequence exhibit these correlationsin coiled coils Berger et al. (Berger et al., 1995; Wolf et al.,1997;Singh et al.,1999;Berger andSingh, 1997; Singh et al.,1998),could predict the coiled-coil fold from theamino acid sequence alone, bysliding a 30-residuewindow along the sequence and tabulating pairwise frequencies in these positions, ascompared to the pairwise frequencies in positive and negative examples of knowncoiled coils. Regions of ®-helixsecondary structure (with a slightlyless regular repeat) have also been successfully predicted bythese methods, again based on the fact that residues close in space are also close in sequence (Garnier et al., 1996). In ¯-structuralmotifs, as was notedby Lifson,Sanders and others (Lifson and Sanders, 1980; Hubbard andPark, 1995; Zhu and Braun, 1999), amino acid residues that are close in space also exhibit marked statisticalpreferences. These preferences have proven dif cultto exploit, however, because residues in stacking ¯-strandsthat are closein 3D and may be instrumental in the fold can be very far away(and a variablenumber of aminoacid residues apart) in the 1D sequence. Thus, in theabsence of a relatedsolved 3Dstructure or strong sequence homology with a solved3D structure, it seems quite dif cult to nd the importantcorrelations that could drive the fold. Even the best (local) secondary structure predictors such asPHD (Rostand Sander, 1993) and PSI-Pred (Jones,1999a,b) are better at correctly placing ®-helices than ¯-strands(Rost and Sander, 1993). Wechoseto target the ¯-helixas a good rst ¯-structuralmotif to attempt to predict computationally becauseit has a topologythat makes prediction of the interacting residues in the ¯-sheetsmore tractable, namelya long,processive fold. The fold is characterized by a repeatingpattern of parallel ¯-strands in atriangularprism shape (Y oder et al.,1993)(Fig. 1). The cross-section, or rung, of a ¯-helixconsists of three ¯-strandsconnected by variable-length turn regions (Fig. 2); the backbone folds up in a helical fashionwith ¯-strandsfrom adjacentrungs stacking on top of each other in a parallelorientation. While the known ¯-helicesvary in the number of complete rungs and in the lengths of the turn regions, the ¯-strandportions of the rungs have patterns of pleating and hydrogen bonding that are wellconserved acrossthe superfamily (Jenkins et al.,1998).Examination of the solved ¯-helixstructures (Y oderand Jurnak,1995; Kreisberg et al.,2000)together with the analysis of mutants defective in the folding of ¯-helices(Kreisberg et al.,2000)suggested that the interaction of thestrand side-chains in theburied core FIG. 1. Sideview of X-raycrystal structure of Pectatelyase C from Erwiniachrysanthemi (Rostand Sander, 1994), residues102-258; ¯-sheetB1 is shown in light gray, B2 in medium gray, and B3 in black. PREDICTING THEBET A-HELIX FOLD 263 FIG. 2. Topviewof a singlerung of a ¯-helix(residues 242– 263) of Fig.1, parsedby the algorithm into ¯-strands B1,B2, B3 and the intervening turns T1, T2, and T3. Residues parsed as ¯-strandare numbered and as turns are lettered.The alternating pattern of the strands before and after T2 is conserved across the superfamily. were criticaldeterminants of the fold. W eincorporatedthis interior packing emphasis in the development oftheprogram. Themethod we employhas two phases, a wrappingphase and a scoringphase. The wrapping phase canbest be understoodwith an analogywith old historical ciphers. As documentedas far backas Ancient Greece,secret coded messages were oncesent as a stripof letters,that, when wrapped around a tubeof the appropriatediameter, would align vertically to spell out words of the message. Wrapped incorrectly, they wouldspell out nonsense (see Fig.3). In the wrapping phase, we domuch the same thing with an amino acidsequence; however, the complexity is increased by the fact that there are loops of various lengths, sothe sequence does not wrap tight.The algorithm tries to wrap theunknown sequence in all plausible waysand see if anyof these wraps make“ sense.”The“ sense”is determined by the scoring phase whose principalcomponent is based, just as in the coiled-coil algorithm, on examining pairwise frequencies of speci cpairsof amino acids. Instead of these pairs being a xeddistance apart in sequence, they now arethe pairs that stack in the 3D wrapping (and may, in fact, be far awayand a variabledistance apart FIG. 3. Anancient cipher. The strip of letters when wrapped around the dowel can be made to spell out vertically thewords “ showcase the word that you see.” Our method wraps up a sequence(with a variablewidth dowel) in all possibleways to see if it can attain plausible vertical alignments. 264 COWEN ET AL. insequence.) Thus, our method can be seen as a spatialgeneralization of the PairCoil method of Berger et al. (1995). Our BetaWrapprogram scores the known ¯-helicesahead of all the non- ¯-helixproteins in a stringent cross-validationperformed against a nonredundantversion of thePDB. The ¯-helixsuperfamily is divided inthe SCOP (Murzin et al.,1995)database into seven families of closely related proteins. 1 Therefore,a sevenfoldcross-validation was performed,where all the proteins in the same family were leftout of the trainingset in each experiment. BetaWrap was ableto identify known ¯-helixproteins from onefamily, whenonly trained on ¯-helixproteins from adifferentSCOP family.In addition, the program makes reasonablygood predictions of thealignment between sequence and structure in the known structures. Weremarkthat even though BetaWrap has access to the other six ¯-helixfamilies in each cross- validation,it makes very parsimonious use of their sequence information. The ¯-helicesin the training setyield the structural template, and also set cutoffs for thegap lengths between ¯-strands,and a lter for ®-helicalcontent. The stacking preferences that are at thecore of themethod are initially built from a databaseof amphipathic ¯-structuresthat are not ¯-helices(see Section3). Incomparison, we ranthe iterative sequence-based method PSI-BLAST (Altschul et al., 1990), the publiclyavailable threading program Threader