100000 Protein Structures for the Biologist
Total Page:16
File Type:pdf, Size:1020Kb
© 1998 Nature America Inc. • http://structbio.nature.com meeting review 100,000 protein structures for the biologist Andrej Sali˘ Structural genomics promises to deliver experimentally determined three-dimensional structures for many thousands of protein domains. These domains will be carefully selected, so that the methods of fold assignment and comparative protein structure modeling will result in useful models for most other protein sequences. The impact on biology will be dramatic. Recently, some 200 structural biologists archiving and annotation of the new struc- on the order of 10,000. As an alternative, it from both academia and industry gath- tures. Several pilot projects in structural has also been suggested that 100,000 protein ered in Avalon, New Jersey to explore the genomics that implement all of these steps structures need to be determined by experi- science and organization of the new field are already underway (Table 1). It is feasible ment (David Eisenberg, UCLA); this would of structural genomics1. The aim of the to determine at least 10,000 protein struc- allow calculation of models with higher structural genomics project is to deliver tures within the next five years. No new dis- accuracy than is possible with only 10,000 structural information about most pro- coveries or methods are needed for this known structures. After a list of the 30-seq teins. While it is not feasible to determine task. The average cost per structure is families is obtained, the representatives will the structure of every protein by experi- expected to decrease significantly from be picked and prioritized. A variety of dif- ment, useful models can be obtained by ~$200,000 per structure to perhaps less ferent criteria likely will be used for this fold assignment and comparative model- than $20,000 per structure. These savings task; for example, size of the family, biologi- ing for those protein sequences that are will be achieved by the economies of scale, cal knowledge about the family, distribu- related to at least one known protein new or improved methods, and their paral- tion of the members among various structure. Structure determination of lelization in most of the steps in the process. organisms, the pharmaceutical relevance, 10,000 properly chosen proteins should the likelihood of a successful structure http://structbio.nature.com • result in useful three-dimensional models Target selection determination, and so forth . for hundreds of thousands of other pro- The targets for structure determination A preliminary analysis suggests that tein sequences. In other words, structural generally will be individual domains rather among recent protein structures deter- genomics will put each protein within a than multi-domain proteins (Chris Sander, mined in the hope of discovering a new comparative modeling distance of a Millenium Information). Such targets will fold, more than two-thirds actually have known protein structure (Eaton Lattman, be easier to determine by X-ray crystallog- exhibited an already known fold (Steven Johns Hopkins University). raphy or NMR spectroscopy than the more Brenner, Stanford). While this unveils Knowing the structures of biological flexible multi-domain proteins, although it weaknesses in the fold assignment meth- macromolecules is almost always useful would be beneficial to determine whole ods, it is not a waste of resources. In prac- for predicting, interpreting, modifying proteins whenever possible. Target selection tical terms, the structure of a protein is 1998 Nature America Inc. © and designing their functions. The high- will begin by pairwise comparison of all worth determining if we do not know it, throughput, coordinated structure deter- protein sequences, followed by clustering for whatever reason. In fact, it may even be mination efforts should be more efficient into groups of domains. The targets for the an advantage when the newly determined for obtaining 10,000 structures than the structural genomics project will be the rep- protein is related to other known struc- current distributed and largely uncoordi- resentatives of the groups without struc- tures because such structure is likely to be nated efforts in individual structural biol- turally defined members. The groups of more informative about its function than ogy laboratories. Structural genomics domains should contain members that in the case of structural ‘orphans’. will build on the human genome project share at least ~30% sequence identity (30- Given a fixed number of experimentally and will be integrated with the functional seq families), rather than those that corre- determined protein structures, the quality genomics project. Here, I review the spond to superfamiles or fold families. and the number of models for the rest of recent meeting in Avalon. Due to space This relatively high granularity is needed the proteins will be influenced strongly by restrictions, the review omits many of the because of the limitations in the current the methods for fold assignment and presentations. However, a complete list of methods for comparative modeling and for comparative protein structure modeling. speakers may be found on the meeting inferring function from structure. Errors in The impact of expected improvements in web site1. the comparative models become relatively these methods may be equivalent to sever- The structural genomics project will small as the sequence identity increases al years of the experimental protein struc- deliver improved methods and a process above 30% (Andrej Sali,˘ Rockefeller ture output (Andrej Sali).˘ A number of for high-throughput protein structure University) and function is generally con- different approaches to fold assignment determination. The process involves: served within families (John Moult, Center have been described, applied, and evaluat- (i) selection of the target proteins or for Advanced Research in Biotechnology, ed (Ron Levy, Rutgers University; George domains, (ii) cloning, expression, and Rockville, Maryland); Christine Orengo, Rose, John Hopkins University; Barry purification of the targets, (iii) crystalliza- University College, London). Given the Honig, Columbia University; Manfred tion and structure determination by X-ray clustering into the 30-seq families, the total Sippl, CAME, Salzburg; David Eisenberg; crystallography or by nuclear magnetic res- number of targets for experimental struc- Eugene Koonin, NCBI; Steven Brenner). onance (NMR) spectroscopy, and (iv) ture determination has been estimated to be The methods that rely on multiple nature structural biology • volume 5 number 12 • december 1998 1029 © 1998 Nature America Inc. • http://structbio.nature.com meeting review sequence information, such as PSI- selenomethionyl proteins2. It is now possi- partial deuteration of protein samples, BLAST, appear surprisingly sensitive in ble to sustain a data collection throughput new magnets (up to 1 GHz), supercon- detecting remote relationships (Liisa of four structures per day on a single undu- ducting probes, the TROSY detection Holm, EMBL-EBI, Cambridge; Eugene lator beam line. Assuming 180 operational method, triple resonance methods for res- Koonin; Steven Brenner). days per year and that only one out of three onance assignment, spatial information data sets results in a useful final structure, from residual dipolar coupling, and soft- Cloning, expression and purification this gives a conservative estimate of ware for resonance assignments and struc- Obtaining protein samples for crystal- approximately 360 structures per year per ture determination. For example, lization trials and NMR measurements single undulator beam line. This means Montelione and colleagues have devel- likely will be the bottleneck in the struc- that only several dedicated undulated beam oped the program AUTOASSIGN to per- tural genomics process. Although the lines are needed over the course of five form automated backbone assignments in current methods are sufficiently power- years to achieve the goal of determining a few minutes of CPU time for proteins ful to begin the project, advances are 10,000 new protein structures. smaller than 200 residues. This tool, inte- expected in parallelization and automa- Tom Terwilliger (LANL) has developed grated with the program AUTO_STRUC- tion of the existing methods for cloning, the SOLVE computer program for TURE for automatic analysis of NOESY expression, and purification of proteins. obtaining electron density maps of pro- data and structure generation, decreased In addition, entirely new technologies teins directly from multiwavelength or the time needed for structure refinement may arise, such as an in vitro system that multiple isomorphous replacement X- of a Z-domain from several months to already has allowed researchers to pro- ray diffraction data. An optimization of a several hours. If the NMR data collection duce proteins up to concentrations of scoring function replaces a complex rule- for a 120 residue protein takes 4–5 weeks 6mgml–1 (Shigeyuki Yokoyama, Tokyo based process followed manually by crys- on a 600 MHz spectrometer and if auto- University). tallographers. This system has already mated data processing takes 1 day, 120 allowed crystallographers to obtain structures can be determined in 1 year by Crystallization and X-ray images of proteins within hours of the 12 machines (Montelione), approximately crystallography end of X-ray data collection. The proce- at a cost of using one synchrotron beam Stephen Burley (Rockefeller University) dure