Improving the Performance of Rosetta Using Multiple Sequence Alignment
Total Page:16
File Type:pdf, Size:1020Kb
PROTEINS:Structure,Function,andGenetics43:1–11(2001) ImprovingthePerformanceofRosettaUsingMultiple SequenceAlignmentInformationandGlobalMeasures ofHydrophobicCoreFormation RichardBonneau,1 CharlieE.M.Strauss,2 andDavidBaker1* 1DepartmentofBiochemistry,Box357350,UniversityofWashington,Seattle,Washington 2BiosciencesDivision,LosAlamosNationalLaboratory,LosAlamos,NewMexico ABSTRACT Thisstudyexplorestheuseofmul- alwayssharethesamefold.8–11 Eachsequencealignment tiplesequencealignment(MSA)informationand canthereforebethoughtofasrepresentingasinglefold, globalmeasuresofhydrophobiccoreformationfor witheachpositionintheMSArepresentingthepreference improvingtheRosettaabinitioproteinstructure fordifferentaminoacidsandgapsatthecorresponding predictionmethod.Themosteffectiveuseofthe positioninthefold.BadretdinovandFinkelsteinand MSAinformationisachievedbycarryingoutinde- colleaguesusedthetheoreticalframeworkoftherandom pendentfoldingsimulationsforasubsetofthe energymodeltoarguethatthedecoydiscrimination homologoussequencesintheMSAandthenidentify- problemisnotcurrentlypossibleunlesstheinformationin ingthefreeenergyminimacommontoallfolded theMSAisusedtosmooththeenergylandscape.12,13 sequencesviasimultaneousclusteringoftheinde- Usingathree-dimensional(3D)cubiclatticemodelfora pendentfoldingruns.Globalmeasuresofhydropho- polypeptide,theseinvestigatorsdemonstratedthataverag- biccoreformation,usingellipsoidalratherthan ingascoringfunctioncontainingrandomerrorsover sphericalrepresentationsofthehydrophobiccore, severalhomologoussequencesallowedthecorrectfoldto arefoundtobeusefulinremovingnon-nativeconfor- beseparatedfromtherestoftheaveragedenergydistribu- mationsbeforeclusteranalysis.Throughthiscombi- tion,inspiteoftherandomerrorsintroducedintothe nationofMSAinformationandglobalmeasuresof 14,15 proteincoreformation,wesignificantlyincrease energyfunction.Keasaretal. showedthatcoupling theperformanceofRosettaonachallengingtestset. thefoldingofseveralhomologoussequencesonatetrahe- Proteins2001;43:1–11.©2001Wiley-Liss,Inc. drallatticesignificantlyimprovedtheperformanceoftheir MonteCarloroutinefora36-residuepeptidehormone.In Keywords:abinitiostructureprediction;homol- thisstudy,wetestfourmethodsforusingMSAinforma- ogy;Rosetta tiontoassistinabinitiostructureprediction.Themethod wefindmosteffectiveallowseachhomologoussequenceto INTRODUCTION foldindependentlyofotheralignedsequences,onlyusing Recentblindtests(CASPIII)havedemonstratedsignifi- theinformationintheMSAaftertheindependenthomolo- cantprogressinthefieldofabinitioproteinfoldpredic- gousfoldingrunsarecompleted. tion.1,2 Althoughencouraging,thisprogressstillleaves Oneofthemajorassumptionsbehindmanyabinitio thefieldwithoutmethodsthatcanreliablypredictprotein foldingpotentialscurrentlyusedisthatthefreeenergyof tertiarystructureintheabsenceofhomologytoasequence aconformationcanbedescribedasasumofseveral whosestructureisknown.Oneofthemajorhurdlesthat pair-additivetermsmeanttodescribespecificinteractions mustbeovercomeinthedevelopmentofconsistently presentinthemolecule.Therearemanycases,however, reliableabinitioprotocolsisthedifficultyofdiscriminat- forwhichthisfundamentalassumptionislikelytobein ingnear-nativemodelsfromincorrectmodels.3–6 The error,especiallycasesinvolvingentropies.16,17 Solvation methodourgrouphasdeveloped,Rosetta,isableto freeenergiesarelargelydominatedbyentropictermsand generatelowroot-mean-squaredeviation(RMSD)struc- arethereforenotwelldescribedbypairadditiveterms. ϭ turesformostsmallproteins(good 3–7.5ÅRMSD, Thenearlyubiquitouspresenceofwell-formedsingle ϭϽ small 100residues),butitisnotalwayspossibleto hydrophobiccoresinsmallproteinssuggestsusingmea- recognizethesestructuresamidstthelargerpopulationof suresthatexplicitlymonitorhydrophobiccoreforma- 7 incorrectdecoys. Thisworkdealswiththisproblemof tion.4,18,19 Wehavedevelopedthreeglobalfeaturesthat recognitionviatwomainapproaches:theuseofmultiple togetherscreenfornon-nativecorepackingandtopology. sequencealignment(MSA)informationandtheuseof Unlikepreviousglobalfeaturesdesignedtorecognize globalmeasuresofhydrophobiccoreformation. well-formedhydrophobiccores,ourfeaturesuseanellipsoi- MSAscontainagreatdealofinformationnotavailable topredictionmethodsthatuseonlyasinglesequence. Centraltothemainmethoddescribedinthisarticleisthe *CorrespondenceshouldbeaddressedtoDB;e-mail:dabaker@ empiricalobservationthattwosequencessharing25% u.washington.edu sequenceidentity,formorethan60aminoacids,almost ©2001WILEY-LISS,INC. 2 R. BONNEAU ET AL. dal rather than a spherical representation, allowing them with paired  strands and buried hydrophobic residues. A to recognize a broader range of native core arrangements. total of 1,000 independent simulations are carried out In the present study, we demonstrate that recognition of (starting from different random number seeds) for each native-like structures produced by Rosetta can be en- query sequence, and the resulting structures are clustered hanced using MSA information and our measures of as described in Materials and Methods and as previously hydrophobic core assembly. Using a combination of these reported.21 The most reliable selection method, before this two approaches has allowed us to produce good models for study, was simply to choose the centers of the largest 15 of the 18 query sequences used to test the methods clusters as the highest confidence models.22 These cluster described in this work. Most of this increase in perfor- centers are then rank-ordered according to the size of the mance is the result of our novel use of MSA information. clusters they represent, with the cluster centers represent- The global features are used to filter the decoy populations ing the largest clusters representing the highest confi- before the analysis of the homologous decoy populations dence models. This protocol produces good models with via our clustering procedure, ridding the analysis of all rank 25th or better for 5 of the 18 queries folded in this recognizably misfolded structures, and thus offer a smaller study. In the larger set, the performance is somewhat but significant improvement in Rosetta’s performance. better (as the 18-protein set was chosen to be a challenging set) and ϳ40% of small proteins (Ͻ100 residues) produce RESULTS AND DISCUSSION good models with the protocol described above. Before Test Set Selection clustering, most structures produced by Rosetta are incor- Initial tests of Rosetta were performed on a test set of 70 rect (i.e., good structures account for less than 10% of the proteins less than 120 amino acids in length as described conformations produced); for this reason, we refer to elsewhere (Simons et al., J Mol Biol, in press). For testing conformations generated by Rosetta as decoys. The prob- more computationally intensive methods, 18 query se- lem of discriminating between good and bad decoys in quences were chosen from the larger set of 70. The smaller Rosetta populations is the primary problem addressed in test set includes five positive controls, for which good this work. models were generated using Rosetta and selected with our clustering routine. Ten of the 18 structures folded were The Use of Global Features for Filtering Large structures for which Rosetta could produce good models Sets of Decoys Ͻ ( 7 Å RMSD) that were not identified by clustering One of the problems we encountered during CASPIII, (presumably because they occurred too rarely in the over- and during the tests of Rosetta immediately after CAS- all population). The remaining three were cases for which PIII, was that the sequence dependent terms in Rosetta’s Rosetta did not produce any good models. Any improve- energy function (the environment score and the pair score) ment of the method must rescue some of the 13 cases for allowed topologies that formed multiple small hydrophobic which no good models were selected while preserving the cores, as opposed to a single unified hydrophobic core. For performance of the positive controls. The set contains 10 proteins of the size attempted, the probability of having ␣   ␣ / ,5 , and 3 class proteins with a mean length of 79 multiple domains is low,23 and most hydrophobic cores can residues. Because the main improvement to the method be roughly described by a single ellipsoid. In an effort to involved homologous sequence information, the smaller address this problem and other commonly encountered test set includes only sequences for which at least two errors in decoys generated by Rosetta, we have developed sequences, not highly correlated to the query or each other, three measures of hydrophobic core formation loosely were aligned. based on a micelle model of proteins, which requires that hydrophobic residues partition to the interior of a protein The Rosetta Method and that the core is uniformly surrounded by backbone The basic method used to generate and select models has and hydrophilic residues. These features are evaluated been previously described but will be reviewed briefly using a method that partitions conformations into inner, because of its importance as a starting point for the middle, and outer ellipsoidal spaces for the purpose of following discussion.7,20 One of the fundamental assump- recognizing single well-formed hydrophobic cores, as illus- tions underlying Rosetta is that the distribution of confor- trated in Figure 1. The first feature, the core score, mations sampled for a given nine residue segment of the measures the extent to which hydrophobic side-chains chain is reasonably well approximated by the distribution partition to the central region