© 1998 America Inc. ¥ http://structbio.nature.com

meeting review

100,000 structures for the biologist

Andrej Sali˘

Structural genomics promises to deliver experimentally determined three-dimensional structures for many thousands of protein domains. These domains will be carefully selected, so that the methods of fold assignment and comparative modeling will result in useful models for most other protein sequences. The impact on biology will be dramatic.

Recently, some 200 structural biologists archiving and annotation of the new struc- on the order of 10,000. As an alternative, it from both academia and industry gath- tures. Several pilot projects in structural has also been suggested that 100,000 protein ered in Avalon, New Jersey to explore the genomics that implement all of these steps structures need to be determined by experi- science and organization of the new field are already underway (Table 1). It is feasible ment (, UCLA); this would of structural genomics1. The aim of the to determine at least 10,000 protein struc- allow calculation of models with higher structural genomics project is to deliver tures within the next five years. No new dis- accuracy than is possible with only 10,000 structural information about most pro- coveries or methods are needed for this known structures. After a list of the 30-seq teins. While it is not feasible to determine task. The average cost per structure is families is obtained, the representatives will the structure of every protein by experi- expected to decrease significantly from be picked and prioritized. A variety of dif- ment, useful models can be obtained by ~$200,000 per structure to perhaps less ferent criteria likely will be used for this fold assignment and comparative model- than $20,000 per structure. These savings task; for example, size of the family, biologi- ing for those protein sequences that are will be achieved by the economies of scale, cal knowledge about the family, distribu- related to at least one known protein new or improved methods, and their paral- tion of the members among various structure. Structure determination of lelization in most of the steps in the process. organisms, the pharmaceutical relevance, 10,000 properly chosen should the likelihood of a successful structure

http://structbio.nature.com ¥ result in useful three-dimensional models Target selection determination, and so forth . for hundreds of thousands of other pro- The targets for structure determination A preliminary analysis suggests that tein sequences. In other words, structural generally will be individual domains rather among recent protein structures deter- genomics will put each protein within a than multi-domain proteins (, mined in the hope of discovering a new comparative modeling distance of a Millenium Information). Such targets will fold, more than two-thirds actually have known protein structure (Eaton Lattman, be easier to determine by X-ray crystallog- exhibited an already known fold (Steven Johns Hopkins University). raphy or NMR spectroscopy than the more Brenner, Stanford). While this unveils Knowing the structures of biological flexible multi-domain proteins, although it weaknesses in the fold assignment meth- macromolecules is almost always useful would be beneficial to determine whole ods, it is not a waste of resources. In prac- for predicting, interpreting, modifying proteins whenever possible. Target selection tical terms, the structure of a protein is 1998 Nature America Inc.

© and designing their functions. The high- will begin by pairwise comparison of all worth determining if we do not know it, throughput, coordinated structure deter- protein sequences, followed by clustering for whatever reason. In fact, it may even be mination efforts should be more efficient into groups of domains. The targets for the an advantage when the newly determined for obtaining 10,000 structures than the structural genomics project will be the rep- protein is related to other known struc- current distributed and largely uncoordi- resentatives of the groups without struc- tures because such structure is likely to be nated efforts in individual structural biol- turally defined members. The groups of more informative about its function than ogy laboratories. Structural genomics domains should contain members that in the case of structural ‘orphans’. will build on the human genome project share at least ~30% sequence identity (30- Given a fixed number of experimentally and will be integrated with the functional seq families), rather than those that corre- determined protein structures, the quality genomics project. Here, I review the spond to superfamiles or fold families. and the number of models for the rest of recent meeting in Avalon. Due to space This relatively high granularity is needed the proteins will be influenced strongly by restrictions, the review omits many of the because of the limitations in the current the methods for fold assignment and presentations. However, a complete list of methods for comparative modeling and for comparative protein structure modeling. speakers may be found on the meeting inferring function from structure. Errors in The impact of expected improvements in web site1. the comparative models become relatively these methods may be equivalent to sever- The structural genomics project will small as the sequence identity increases al years of the experimental protein struc- deliver improved methods and a process above 30% (Andrej Sali,˘ Rockefeller ture output (Andrej Sali).˘ A number of for high-throughput protein structure University) and function is generally con- different approaches to fold assignment determination. The process involves: served within families (John Moult, Center have been described, applied, and evaluat- (i) selection of the target proteins or for Advanced Research in Biotechnology, ed (Ron Levy, Rutgers University; George domains, (ii) cloning, expression, and Rockville, Maryland); Christine Orengo, Rose, John Hopkins University; Barry purification of the targets, (iii) crystalliza- University College, London). Given the Honig, Columbia University; Manfred tion and structure determination by X-ray clustering into the 30-seq families, the total Sippl, CAME, Salzburg; David Eisenberg; crystallography or by nuclear magnetic res- number of targets for experimental struc- Eugene Koonin, NCBI; Steven Brenner). onance (NMR) spectroscopy, and (iv) ture determination has been estimated to be The methods that rely on multiple

nature structural biology • volume 5 number 12 • december 1998 1029 © 1998 Nature America Inc. ¥ http://structbio.nature.com

meeting review

sequence information, such as PSI- selenomethionyl proteins2. It is now possi- partial deuteration of protein samples, BLAST, appear surprisingly sensitive in ble to sustain a data collection throughput new magnets (up to 1 GHz), supercon- detecting remote relationships (Liisa of four structures per day on a single undu- ducting probes, the TROSY detection Holm, EMBL-EBI, Cambridge; Eugene lator beam line. Assuming 180 operational method, triple resonance methods for res- Koonin; Steven Brenner). days per year and that only one out of three onance assignment, spatial information data sets results in a useful final structure, from residual dipolar coupling, and soft- Cloning, expression and purification this gives a conservative estimate of ware for resonance assignments and struc- Obtaining protein samples for crystal- approximately 360 structures per year per ture determination. For example, lization trials and NMR measurements single undulator beam line. This means Montelione and colleagues have devel- likely will be the bottleneck in the struc- that only several dedicated undulated beam oped the program AUTOASSIGN to per- tural genomics process. Although the lines are needed over the course of five form automated backbone assignments in current methods are sufficiently power- years to achieve the goal of determining a few minutes of CPU time for proteins ful to begin the project, advances are 10,000 new protein structures. smaller than 200 residues. This tool, inte- expected in parallelization and automa- Tom Terwilliger (LANL) has developed grated with the program AUTO_STRUC- tion of the existing methods for cloning, the SOLVE computer program for TURE for automatic analysis of NOESY expression, and purification of proteins. obtaining electron density maps of pro- data and structure generation, decreased In addition, entirely new technologies teins directly from multiwavelength or the time needed for structure refinement may arise, such as an in vitro system that multiple isomorphous replacement X- of a Z-domain from several months to already has allowed researchers to pro- ray diffraction data. An optimization of a several hours. If the NMR data collection duce proteins up to concentrations of scoring function replaces a complex rule- for a 120 residue protein takes 4–5 weeks 6mgml–1 (Shigeyuki Yokoyama, Tokyo based process followed manually by crys- on a 600 MHz spectrometer and if auto- University). tallographers. This system has already mated data processing takes 1 day, 120 allowed crystallographers to obtain structures can be determined in 1 year by Crystallization and X-ray images of proteins within hours of the 12 machines (Montelione), approximately crystallography end of X-ray data collection. The proce- at a cost of using one synchrotron beam Stephen Burley (Rockefeller University) dure has been successfully applied to line. By using higher field 800 MHz to 1 described how to test protein samples for MAD data with up to 26 selenium sites GHz magnets and other sensitivity- their propensity to form useful crystals. per derivative. enhancement methods, this throughput

http://structbio.nature.com ¥ Crystals have a higher chance of being Sung-Hou Kim (University of could be significantly increased. formed when the protein species in solu- California, Berkeley) described results Anticipating the usefulness of NMR for tion is conformationally homogeneous. from the first structural genomics pilot structural genomics, Shigeyuki Yokoyama This can be evaluated by dynamic light project, focusing on the Methanococcus of Tokyo University announced the scattering, as well as by limited proteoly- janaschii genome (Table 1). He presented Japanese Genomic Sciences Center that sis combined with mass spectroscopy. the structures of two proteins, an ATP will include a large NMR facility. The cen- For example, candidates that are dependent molecular switch with a new ter will also support computational analy- monodisperse under standard aqueous ATP binding motif, and a nucleotide sis of the human genome, sequencing and conditions have a probability of at least pyrophosphatase, with a new fold. For mapping of the mouse genome, and col- 75% to yield crystals suitable for X-ray each protein, the type of a ligand bound laborations with high-throughput protein 1998 Nature America Inc.

© studies, in comparison to 10% for poly- to it, and thus its biochemical function, crystallography efforts elsewhere in Japan. disperse samples. were predicted from the protein structure The center will have 20 NMR instruments Wayne Hendrickson (Columbia alone. The predictions were subsequently by 2001 and will be open to international University) described several pivotal confirmed by direct experiment. These collaboration. A 1 GHz magnet is being advances that have matured only recently results support the idea that protein constructed in collaboration with the and form the basis of crystallographic function can be anticipated from struc- Tsukuba laboratory, the engineers of the analysis at a markedly accelerated level. tural information alone and thus that magnets used for the bullet train. These include: undulator sources at syn- high-throughput structure determina- NMR spectroscopy has become increas- chrotrons, CCD detectors, cryo-crystallog- tion will make a valuable contribution to ingly more useful and efficient. However, raphy, MAD phasing analysis, and biology. given the synchrotrons, X-ray crystallog- raphy is likely to retain an edge in Nuclear magnetic Table 1 List of pilot projects in structural genomics1 throughput, accuracy, maximal protein resonance spectroscopy size, and possibly cost per protein struc- Pilot project organizers Organism ˘ Kurt Wüthrich (ETH, ture determination. Nevertheless, NMR S. Burley, A. Sali, J. Sussman S. cerevisiae Zürich), Gaetano Monte- spectroscopy will be indispensable for A. Edwards M. thermoautotrophicum lione and Gerhardt Wagner maximizing the benefits of structural D. Eisenberg, T. Terwilliger P. aerophilum (Harvard Medical School) genomics because of its unique ability to S.-H. Kim M. janaschii described recent and obtain information about the internal G. Montelione, S. Anderson Metazoa prospective advances in dynamics of a protein on multiple time J. Moult H. influenzae NMR spectroscopy that will scales as well as about its ligand binding S. Yokoyama T. Thermophilus, HB8 allow structure determina- properties. For example, Kurt Wüthrich 1These projects were described at the meeting. Each project gen- tion of larger proteins as described the importance of flexible tails erally involves several independent laboratories. Only speakers at well as increase the accura- for the function of several proteins. the meeting are listed in the first column. Their addresses can be found at http://lion.cabm.rutgers.edu/bioinformatics_meeting/. cy and speed of the analysis. Another example is the routine use of These advances include chemical shift perturbation to enumerate

1030 nature structural biology • volume 5 number 12 • december 1998 © 1998 Nature America Inc. ¥ http://structbio.nature.com

meeting review

binding sites on a protein for a thousand genomics pilot projects (Sung-Hou Kim; proteins in thermophilic organisms in ligands per day from a library of 15,000 Tom Terwilliger). terms of the amino acid composition and compounds, with up to 1 mM sensitivity structural features of the proteins. The in the dissociation constant (Steven Fesik, Using the structures study was based on a comparison of pro- Abbott Laboratories). Also potentially fea- Development of the databases and soft- teins that have homologs in both ther- sible is NMR structure determination of ware will be necessary to integrate raw mophilic and mesophilic organisms and membrane protein structures, which are structural information and structure- which have homologs of known structure. exceptionally difficult to crystallize. derived results with other information, It appears that thermostability of proteins such as genetic and biochemical analysis, originates in part from salt bridges and Databases gene expression patterns, mutational from deletions of loop regions. Several different databases will be useful analysis, binding data from protein chips, Jeff Skolnick (Scripps Research in facilitating the structural genomics pro- interaction data from two-hybrid systems, Institute) described a method for predict- ject (Table 2). These databases will be used metabolic pathway information, and so ing protein function based on structure for target selection, coordination of the forth (Chris Sander). These tools will then prediction and three-dimensional motif experimental efforts, archiving of the new allow researchers to address efficiently a identification. This method could push structures, dissemination of the results, host of new questions. Some such studies, the detection of protein function much and for performing new research with the illustrating but the tip of the iceberg, have further into the twilight zone of sequence old and new data. Helen Berman of already been performed and were identity. First, the tertiary structures of the Rutgers University described a new described at the meeting. For example, sequences of interest are predicted either macromolecular structure database that questions about the distribution of differ- ab initio or by threading. Then, the result- will replace the Brookhaven Protein Data ent structures among the genomes, and ing models are screened using a library of Bank3. Liisa Holm, Christine Orengo, and the implications for models of evolution, three-dimensional descriptors of protein Steven Brenner described CATH, FSSP, have been asked by Mark Gerstein (Yale active sites, termed ‘fuzzy functional and SCOP databases respectively. These University) and Eugene Koonin. In addi- forms’. If the geometry and residue types databases classify protein structures in a tion, David Eisenberg focused on under- in the predicted structures match a three- hierarchical manner, generally consisting standing the extreme thermostability of dimensional motif, then the protein is of families (proteins that share at least ~30% sequence identity), superfamilies Table 2 Structural genomics web sites

http://structbio.nature.com ¥ (proteins that are probably related by Pilot Projects divergent evolution but can have very lit- DOE http://proi3.lanl.gov/structural_genomics/ tle sequence similarity), and fold families New York http://genome5.bio.bnl.gov/Proteome/ (proteins that share similar folds). CARB/TIGR http://s2f.carb.nist.gov Orengo pointed out a great need for an Databases exhaustive, systematic and computer NCBI http://www.ncbi.nlm.nih.gov/ friendly database of protein functional PDB http://www.pdb.bnl.gov/ annotations. Such a database is necessary MSD http://www.rcsb.org/ for the development and use of the meth- DALI http://www2.ebi.ac.uk/dali/ ods that correlate structure to function. 1998 Nature America Inc. ˘ CATH http://www.biochem.ucl.ac.uk/bsm/cath/ © Andrej Sali described ModBase, a large SCOP http://scop.mrc-lmb.cam.ac.uk/scop/ database of comparative protein structure PRESAGE http://presage.stanford.edu models. This database is calculated with ModBase http://guitar.rockefeller.edu/modbase/ program Modeller and currently contains GeneCensus http://bioinfo.mbb.yale.edu/genome protein models for five genomes. It will Fold assignment continue to grow with the availability of PhD http://www.embl-heidelberg.de/predictprotein/predictprotein.html new protein structures and sequences, as THREADER http://globin.bio.warwick.ac.uk/”jones/threader.html well as the improvements in the modeling 123D http://www-lmmb.ncifcrf.gov/”nicka/123D.html software. Steven Brenner introduced UCLA-DOE http://www.doe-mbi.ucla.edu/people/frsvr/frsvr.html PRESAGE, a database that stands to PROFIT http://lore.came.sbg.ac.at/ become the central nervous system for Comparative modeling coordinating the efforts in structural COMPOSER http://www-cryst.bioc.cam.ac.uk/ genomics. PRESAGE includes entries for CONGEN http://www.cabm.rutgers.edu/”bruc proteins; each entry contains information DRAGON http://www.nimr.mrc.ac.uk/”mathbio/a-aszodi/dragon.html about the protein’s homologs, current sta- Modeller http://guitar.rockefeller.edu/modeller/modeller.html tus of structure determination, links to PrISM http://honiglab.cpmc.columbia.edu/ three-dimensional models, and so forth. SWISS-MODEL http://www.expasy.ch/swissmod/SWISS-MODEL.html The entries in PRESAGE will be con- WHAT IF http://www.sander.embl-heidelberg.de/vriend/ tributed by the community at large. Miscellaneous Biologists will use it as a portal for access- Target selection http://structuralgenomics.org/ ing the most up-to-date information Funding NIH http://www.nih.gov/nigms/ about proteins of interest. For example, Funding NSF http://www.nsf.gov/home/grants.htm PRESAGE would have eliminated the Pedro’s list http://www.public.iastate.edu/”pedro/research_tools.html duplicated structure determination of Roberto’s list http://guitar.rockefeller.edu/”roberto/tools/tools.html elongation factor 5A by two structural

nature structural biology • volume 5 number 12 • december 1998 1031 © 1998 Nature America Inc. ¥ http://structbio.nature.com

meeting review

predicted to have the specified molecular and pharmaceutical researchers to obtain well as on our understanding of the function. The idea and its power relative protein samples and research informa- machinery of life and its evolution, will to local sequence pattern searches were tion, such as purification protocols and surely result in profusion of new drug tar- illustrated by searching for proteins with crystallization conditions. Such an infra- gets and better drugs. Structural genomics a disulfide oxidoreductase active site in structure will enable the individual labo- will refocus biology, just as the advent of three bacterial genomes. ratories to focus on more challenging of high throughput DNA sequencing research problems, including functional machines freed researchers to focus on Impact of structural genomics characterization of their favorite proteins, more elaborate experiments rather than Suggestions about how best to proceed structure determination of disease targets, on accumulating sequence data. with the structural genomics project were large proteins and protein complexes, Andrej Sali˘ is with the Laboratories of discussed. Ideas ranging from continuing study of post-translational modifications, Molecular Biophysics, The Rockefeller ‘research as usual’ to establishing large and systemic questions involving net- University, New York, New York 10021, centers for structural genomics were works of proteins. Structural biologists in USA. email: [email protected] explored. The pilot projects will help the pharmaceutical industry will increase shape the full scale effort. It is likely that their impact on the drug discovery process 1. Structure-based functional genomics, October 4–7, Avalon, New Jersey (1998). http://lion.cabm. relatively large centers will provide the because they will be in a position to deter- rutgers.edu/bioinformatics_meeting/. The meeting infrastructure for X-ray crystallography mine the structures of many more pro- was organized by Gaetano Montelione, Stephen Anderson, Edward Arnold, and Ann Stock of the and NMR spectroscopy, allow the econo- tein–ligand complexes, at a lower cost and Center for Advanced Biotechnology and Medicine at my of scale, foster collaboration, and also faster, using the structures, samples and Rutgers University (CABM), with the backing of the New Jersey Commission on Science and Technology, ensure that the technical and repetitive protocols from the high-throughput CABM, the Merck Genome Research Institute, and jobs are not performed by students and structure determination. The impact of the Burroughs Wellcome Fund. 2. Synchrotron supplement. Nature Struct. Biol. 5, postdocs in the academic laboratories. structural genomics on the structural and 614–656 (1998). Centers will make it possible for academic functional characterization of proteins, as 3. Editorial, Nature Struct. Biol. 5, 925–926 (1998).

picture story

http://structbio.nature.com ¥

Fight the flu

Every year, we are reminded of the power of the influenza virus, especially when severe and deadly outbreaks occur — such as last season’s Hong Kong ‘bird flu’ in 1998 Nature America Inc.

© which 6 of the 18 confirmed cases in humans were fatal. What makes one influenza virus more virulent than anoth- er? In the case of the Hong Kong virus, one factor was probably a variation in its Adapted from Chen et al. hemagglutinin (HA) precursor protein, membranes into close proximity. The five- comparison with the previously deter- which had a five residue insertion. residue insertion in the Hong Kong HA0 mined cleaved HA structure suggests that Hemagglutinin, a glycoprotein that is held hemagglutinin variant occurred at the proteolysis may be playing a direct role in in the viral membrane by a C-terminal HA0 cleavage site. Why would such a setting the low pH trigger for the confor- transmembrane sequence, is required for mutant lead to higher virulence? mational change — in addition to playing viral membrane fusion during infection. Researchers have found that HA0 variants more obvious roles in exposing the fusion Two events must occur for hemagglutinin with insertions in this region are more sus- peptide and giving conformational flexi- to become active: (i) the precursor protein, ceptible to proteases, leading to a greater bility to the protein. In the HA0 precursor, HA0, must be cleaved to create two disul- proportion of cleaved HA molecules and the cleavage site (left image, between the fide linked subunits HA1 and HA2 and hence a higher infectivity. dots) is exposed on a surface loop. After (ii) cleaved HA must undergo a low pH- Hemagglutinin is a logical drug target proteolysis, the new N-terminus of HA2 induced conformational change in endo- because of its key role in the success of an fills a hole in the protein (arrow), burying somes. The new N-terminus of HA2 is the infection. Researchers could aim to devel- several ionizable Asp and His residues that fusion peptide that becomes embedded in op drugs that would block the cleavage were previously exposed in the cavity the target membrane, leading to infection. step or prevent the conformational (right image). In theory, this could set a The pH-induced conformational change change. Useful information for such sensitive trigger: protonation of the His projects the fusion peptide to the end of a endeavors is provided by a recent structure residue at low pH could severely destabi- long coiled coil, placing it close to the HA of the HA0 precursor (Chen, J. et al. Cell lize the structure and lead to the dramatic C-terminal transmembrane domain and 95, 409-417; 1998), which shows the con- conformational change that is necessary thus bringing the viral and the target formation of the intact cleavage site. A for infection. TS

1032 nature structural biology • volume 5 number 12 • december 1998