<<

Life with 6000 Genes Author(s): A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin and S. G. Oliver Source: Science, Vol. 274, No. 5287, Genome Issue (Oct. 25, 1996), pp. 546+563-567 Published by: American Association for the Advancement of Science Stable URL: https://www.jstor.org/stable/2899628 Accessed: 28-08-2019 13:31 UTC

REFERENCES Linked references are available on JSTOR for this article: https://www.jstor.org/stable/2899628?seq=1&cid=pdf-reference#references_tab_contents You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at https://about.jstor.org/terms

American Association for the Advancement of Science is collaborating with JSTOR to digitize, preserve and extend access to Science

This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms by FISH across the whole genome. A subset of the 37. G. D. Billingsleyet al., Am. J. Hum. Genet. 52, 343 (1994); L. M. Mulliganet al., Nature 363, 458 (1993); results were selected in which there was no evi- (1993); S. S. Schneider et al., Proc. Natl. Acad. Sci. P. Edery et al., ibid. 367, 378 (1994). dence of the YAC having been chimeric, and the U.S.A. 92, 3147 (1995). 42. The potential effect of a comprehensive gene map is position was resolved to a cytogenetic interval (as 38. R. Sherrington et al., Nature 375, 754 (1995); D. well illustrated by the fact that 82% of genes that opposed to fractional length only). A G6n6thon Levitanand 1.Greenwald, ibid. 377, 351 (1995); S. A. have been positionally cloned to date are marker was given for each of these YACs, which Hahn et al., Science 271, 350 (1996). represented by one or more ESTs in GenBank (http:/ allowed the position in centimorgans to be 39. S. Tugendreich et al., Hum. Mol. Genet. 3, 1509 /www.ncbi.nlm.nih.gov/dbEST/dbESTL genes/). determined. Because this study contained no data (1994); D. E. Bassett Jr., M. S. Boguski, P. Hieter, 43. K. H. Fasman, A. J. Cuticchia, D. T. Kingsbury,Nu- for chromosomes 19, 21, and 22, an alternative cleic Acids Res. 22, 3462 (1994). strategy was used: Genes that are nonrecombinant Nature 379, 589 (1996); P. Hieter, D. E. Bassett Jr., with respect to G6n6thon markers [table 5 in C. Dib D. Valle, Nature Genet. 13, 253 (1996). 44. M. Bonaldo et al., Genome Res. 6, 791 (1996). et al., Nature 380 (suppl.), iii (1996)] and have 40. The scoring systems were amino acid substitution 45. Consensus sequences from at least three overlap- well-known cytogenetic locations [Online Mende- matrices based on the PAM (point accepted muta- ping ESTs were generated from 294 UniGene clus- lian Inheritance in Man (OMIM) at http://www. tion) model of evolutionary distance. Pam matrices ters. Of the 230 (78%) redesigned primer pairs, 188 ncbi.nlm.nih.gov,] were used to establish the may be generated for any number of PAMs by ex- (82%) of these yielded successful PCR assays. cross-reference. trapolation of observed mutation frequencies [M. 0. 46. We thank M. 0. Anderson, A. J. Collymore, D. F. 28. N. E. Morton, Proc. Natl. Acad. Sci. U.S.A. 88, 7474 Dayhoff et al., in Atlas of Protein Sequence and Courtney, R. Devine, D. Gray, L. T. Horton Jr., V. (1991). Structure, M. 0. Dayhoff, Ed. (National Biomedical Kouyoumjian, J. Tam, W. Ye, and 1. S. Zemsteva 29. Detailed instructions and examples are provided on Research Foundation, Washington, DC, 1978), vol. from the Whitehead Institute for technical assist- the Web site. 5, suppl. 3, pp. 345-352; S. F. Altschul,J. Mol. Evol. ance. We thank W. Miller,E. Myers, D. J. Lipman, 36, 290 (1993)]. PAM matrices were customized for 30. G. D. Schuler, J. A. Epstein, H. Ohkawa, J. A. Kans, and A. Schaffer for essential contributions toward scoring matches between sequences in each of the Methods Enzymol. 266,141 (1996). the development of UniGene. Supported by NIH 15 species pairs; for the pool of remaining proteins 31. Five additional assays yielded ambiguous results as awards HG00098 to E.S.L., HG00206 to R.M.M., ("Other organisms" in Table 4), the PAM120 matrix a result of the presence of an interferingmouse band HG00835 to J.M.S., and HGO0151 to T.C.M., and was used because it has been shown to be good for or poor amplification. by the Whitehead Institutefor Biomedical Research general-purpose searching [S. F. Altschul, J. Mol. 32. This could be the result of either a trivialprimer tube and the Wellcome Trust. T.J.H. is a recipient of a Biol. 219, 555 (1991)]. The BLASTX program labeling erroror a sequence clustering errorin which [W. ClinicianScientist Award from the Medical Research Gish and D. J. States, Nature Genet. 3, 266 (1993)] two separate genes were erroneously assigned to Council of Canada. D.C.P. is an assistant investiga- takes a inucleotide sequence query (EST or gene), the same UniGene entry. tor of the Howard Hughes Medical Institute. The translates it into all six conceptual ORFs, and then 33. Typing errors in a framework marker would allow a Stanford Human Genome Center and the Whitehead close EST to map with significant lod scores (loga- compares these with protein sequences in the data- Institute-MITGenome Center are thankful for the rithmof the odds ratio for linkage) to a correct chro- base; TBLASTN performs a similar function but in- oligonucleotides purchased with funds donated by mosome location but would tend to localize the stead searches a protein query sequence against marker in a distant bin, in order to minimize "double- six-frame translations of each entry in a nucleotide Sandoz Pharmaceuticals. G6n6thon is supported by breaks" caused by the erroneous typings of the . Searches were performed with the Association Francaise contre les Myopathies and framework marker. E = 1e-6 and E2 = 1e-5 as the primaryand second- the Groupement d'Etudes sur le Genome. The 34. J. M. Craig and W. A. Bickmore, Bioessays 15, 349 ary expectation parameters. Sanger Centre, G6n6thon, and Oxford are grateful (1993). 41. M. A. Pericak-Vance et al., Am. J. Hum. Genet. 48, for support from the European Union EVRHESTpro- 35. V. Romano et al., Cytogenet. Cell Genet. 48, 148 1034 (1991); W. J. Strittmatter et al., Proc. Natl. gramme. The Human Genome Organizationand the (1988). Acad. Sc. U.S.A. 90, 1977 (1993); R. Fishel et al., Wellcome Trust sponsored a series of meetings from 36. S. Bahram, M. Bresnahan, D. E. Geraghty, T. Spies, Cell 75, 1027 (1993); N. Papadopoulos et al., Sci- October 1994 to November 1995 without which this Froc. Natl. Acad. Sci. U.S.A. 91, 6259 (1994). ence 263, 1625 (1994); R. Shiang et al., Cell 78, 335 collaboration would not have been possible.

physicalenvironment. S. cerevisiaehas a life Life with 6000 Genes cycle that is ideally suited to classical ge- netic analysis,and this has permittedcon- A. Goffeau,*B. G. Barrell,H. Bussey, R. W. Davis, B. Dujon, struction of a detailed genetic map that H. Feldmann, F. Galibert,J. D. Hoheisel, C. Jacq, M. Johnston, defines the haploidset of 16 chromosomes. Moreover,very efficient techniques have E. J. Louis, H. W. Mewes, Y. Murakami,P. Philippsen, been developedthat permitany of the 6000 H. Tettelin, S. G. Oliver genes to be replacedwith a mutantallele, or completelydeleted from the genome, with The genome of the yeast Saccharomyces cerevisiae has been completely sequenced absolute accuracy(17-19). The conmbina- through a worldwide collaboration. The sequence of 12,068 kilobases defines 5885 tion of a largenumber of chromosomesand potential protein-encoding genes, approximately 140 genes specifying ribosomal RNA, a small genome size meant that it was pos- 40 genes for small nuclear RNA molecules, and 275 transfer RNA genes. In addition, the sible to divide responsibilities complete sequence provides information about the higher order organization of yeast's conveniently among the differentinterna- 16 chromosomes and allows some insight into their evolutionary history. The genome tional groupsinvolved in the project. shows a considerable amount of apparent genetic redundancy, and one of the major problems to be tackled during the next stage of the yeast genome project is to elucidate Old Questions and New Answers the biological functions of all of these genes. The genome.At the beginning of the se- quencing project, perhaps1000 genes en- codingeither RNA or proteinproducts had The genome of the yeast Saccharomyces nucleotideand proteinsequence data from been defined by genetic analysis(20). The cerevisiaehas been completely sequenced each of the 16 yeast chromosomes(1-16) complete genome sequence defines some through an internationaleffort involving have been established(Table 1). 5885 open readingframes (ORFs) that are some 600 scientistsin Europe,North Amer- The position of S. cerevisiaeas a model likely to specify protein products in the ica, andJapan. It is the largestgenome to be eukaryoteowes much to its intrinsicadvan- yeastcell. This meansthat a protein-encod- completelysequenced so far (a recordthat tages as an experimentalsystem. It is a ing gene is foundfor every2 kb of the yeast we hope will soon be bettered) and is the unicellular organism that (unlike many genome, with almost 70% of the total se- first complete genome sequence of a eu- morecomplex eukaryotes) can be grownon quence consistingof ORFs (21). The yeast karyote.A numberof public data libraries definedmedia, which gives the experiment- genome is much more compactthan those compiling the mapping information and er complete control over its chemical and of its morecomplex relatives in the eukary-

546 SCIENCE * VOL.274 * 25 OCTOBER1996

This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms nm ARTICLES otic world.By contrast,the genoine of the Schizosaccharomycespombe indicate that the stratedthat the relative incidence of dou- nematode worm contains a potential pro- density of protein-encodinggenes is ap- ble-strand breaks, which are thought to tein-encodinggene every 6 kb (22), and in proximatelyone per 2.3 kb (23). The dif- initiate genetic recombination in yeast the humangenome, some 30 kb or more of ference between these two yeast genomes (29), correlatesdirectly with the GC-rich sequence must be examined in order to can be ascribedto the paucityof intronsin regions of this chromosome. uncoversuch a gene. Analysisof the yeast S. cerevisiae.In the fission yeast, approxi- The four smallest chromosomes(I, III, genomereveals the existenceof 6275 ORFs mately 40% of genes contain introns (21), VI, and IX) exhibit averagerecombination that theoretically could encode proteins whereasonly 4% of protein-encodinggenes frequenciessome 1.3 to to 1.8 times greater longer than 99 amino acids. However,390 in S. cerevisiaeare similarly interrupted (Ta- than the averagefor the genomeas a whole. ORFs are unlikely to be translated into ble 2). The Saccharomycesgenes that do Kaback(30) has suggestedthat high levels proteins.Thus, only 5885 protein-encoding contain introns [notably, those encoding of recombinationhave been selectedfor on genes are believed to exist. In addition,the ribosomalproteins (24)] usuallyhave only these very small chromosomesto ensureat yeast genome containssome 140 ribosomal one small intron close to the start of the least one crossoverper meiosis,and so per- RNA genes in a large tandem array on codingsequence (often interruptingthe ini- mit them to segregatecorrectly. It is known chromosomeXII and 40 genes encoding tiator codon) (25). It has even been sug- that artificial chromosomes (or chromo- small nuclear RNAs (snRNAs) scattered gested (26) that manyyeast genes represent somefragments) of approximuately150 kb in throughoutthe 16 chromosomes;275 trans- cDNA copies that have been generatedby size are mitoticallyunstable (31, 32), which fer RNA (tRNA) genes (belonging to 43 the action of reversetranscriptases specified raisesthe relatedquestions of whetherthere families) are also widely distributed.Table by retrotransposons(Ty elements). is a minimalsize for yeastchromosomes and 2, which providesdetails of the distribution The chromosomes.A complete genome how the smallest chromosomes have of genes and other sequence elements sequence providesmore informationthan achieved their currentsize (see Table 2). amongyeast's 16 chromosomes,shows that the sum of all the genes (or ORFs) that it The organizationof chromosomeI is very the genome has been completely se- contains. In particular,it permits an in- unusual:The 31 kb at each of its ends are quenced, with the exception of a set of vestigation of the higher order organiza- very gene-poor,and Busseyet al. (5) have identicalgenes repeatedin tandem. tion of the S. cerevisiaegenome. An ex- suggestedthat these terminaldomains may The compact natureof the S. cerevisiae ample is the long-rangevariation in base act as "fillers"to increase the size, and genomeis remarkableeven when compared composition. Many yeast chromosomes hence the stability, of this smallest yeast with the genomesof other yeastsand fungi. consist of alternating large domains of chromosome. Currentdata from the systematicsequence GC-rich and GC-poor DNA (21, 27), Genetic redundancyis the rule at the analysisof the genome of the fission yeast generallycorrelating with the variation in ends of yeast chromosomes.For instance, gene density along these chromosomes.In the two terminaldomains of chromosome A. Goffeau and-H. Tettelin, Universit6Catholique de Lou- the case of chromosome III, it has been III show considerablenucleotide sequence vain, Unit6 de Biochimie Physiologique, Place Croix du Sud, 2/20,1348 Louvain-la-Neuve, Belgium. demonstratedthat the periodicity in base homologyboth to one another and to the B. G. Barrell,Sanger Centre, HinxtonHall, Hinxton, Cam- composition is paralleledby a variation in terminaldomains of other chromosomes(V bridge CB10 1SA, UK. recombinationfrequency along the chro- and XI). The right terminalregion of chro- H. Bussey, Departmentof , McGillUniversity, 1205 mosome arms, with the GC-rich peaks mosome I is duplicatedat the left end of Docteur Penfield Avenue, Montreal,H3A 1 Bi, Canada. R. W. Davis, Department of Biochemistry, Stanford Uni- coinciding with regions of high recombi- chromosomeI (5) and at the right end of versity, Beckman Center, Room B400, Stanford, CA nation in the middle of each arm of the chromosomeVIII (3). The sugarfermenta- 94305-5307, USA. chromosome and AT-rich troughs coin- tion genes MAL,SUC, and MELall have a B. Dujon, Unit6 de G6n6tique Mol6culaire des Levures (URAl 149 CNRS and UPR927 Universit6Pierre et Marie ciding with the recombination-poorcen- numberof telomere-associatedcopies, not Curie), InstitutPasteur, 25 Rue du Docteur Roux, 75724 tromeric and telomeric sequences. Sim- all of which are expressed (33-35). An Paris-Cedex 15, France. chen and co-workers (28) have demon- interestingfeature of the distributionof the H. Feldmann, Universitat MOnchen, InstitutfOr Physiolo- gische Chemie, Schillerstrasse, 44, 80336 MOnchen, Germany. F. Galibert, Laboratoire de Biochimie Moleculaire, UPR Table 1. Finding yeast genome information on the Internet. A fuller description of these and other 41-Facult6 de M6decine, CNRS, 2 Avenue du Profes- resources may be found in (86). seur L6on Bernard, 35043 Rennes Cedex, France. J. D. Hoheisel, Molecular Genetic Genome Analysis, Internetaddresses Deutsches Krebsforschungszentrum, Im Neuenheimer Feld, 506, 69120 Heidelberg, Germany. FTP sites for the complete S. cerevisiae genome sequence C. Ecole Normale Su- Jacq, G6n6tique Moleculaire, ftp.mips.embnet.org (directory/yeast) perieure, CNRS URA 1302, 46 Rue d'Ulm, 75230 Paris Cedex 05, France. ftp.ebi.ac.uk (directory/pub/databases/yeast) M. Johnston, Department of , Box 8232, Wash- genome-ftp.stanford.edu (directory/yeast/genome-seq) ington UniversityMedical School, 4566 Scott Avenue, St. Other S. cerevisiae data libraries Louis, MO 631 10, USA. MIPS, Martinsried,Germany (http://www.mips.biochem.mpg.de/yeast/) E. J. Louis, Yeast Genetics, Institute of Molecular Medi- Sanger Center, Hinxton, UK (http://www.sanger.ac.uk/yeast/home.html) cine, John Radcliffe Hospital, Oxford OX3 9DU, UK. Saccharomyces Genome Database (SGD), Stanford University,USA (http://genome- H. W. Mewes, Max-Planck-InstitutfOr Biochemie, Martins- www.stanford.edu/Saccharomyces/) red Institutefor Protein Sciences (MIPS),Am Klopferspitz 18A, 82152 Martinsried,Germany. SWISS-PROT, Universityof Geneva, Switzerland (http://expasy.hcuge.ch/sprot/sp-docu.html) Y. Murakami,Tsukuba Life Science Center, Division of Yeast Protein Database (YPD), Proteome Inc., Beverly, MA, USA Human Genome Research, RIKEN, Koyaday Tsukuba (http://www.proteome.com/YPDhome.html) Science City 3-1-1, 305 Ibaraki,Japan. GeneQuiz, European Laboratory,Heidelberg, Germany P. Philippsen, Institutefor Applied Microbiology,Universi- (http://www.embl-heidelberg.de/-genequiz/) tat Basel, Klingelbergstrasse70, 4056 Basel, Switzerland. XREFdb, National Center for Biological Information,Baltimore, MD, USA S. G. Oliver, Department of Biochemistry and Applied (http://www.ncbi.nlm.nih.gov/XREFdb/) Molecular Biology, University of Manchester Institute of (Editor's note: For readers who would like a user-friendlyguide to the yeast databases, please see Science and Technology (UMIST), Post Office Box 88, Manchester M60 1QD, UK. the special feature at the Science Web site http://www.sciencemag.org/science/feature/data/ genomebase.htm) *Towhom correspondence should be addressed.

SCIENCE * VOL.274 * 25 OCTOBER1996 563

This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms MEL genes is that they are only found activities of retrotransposonsor retroviruses. ogy criteria.However, such assignmentsof- associatedwith one telomereof a chromo- Whatever the merits of such specula- ten provide only a general descriptionof some and not both (36). This raises the tion, it is evident that many of the poly- the biochemicalfunction of the predicted intriguing possibility that yeast chromo- morphisms observed between homologous protein products(such as "proteinkinase" somes might have an intrinsicpolarity be- chromosomes in different strains of S. cer- or "transcriptionfactor") but provide no yond our arbitrarylabeling of left and right evisiae are due to transposition events or indicationas to their biologicalrole. Thus, arms. recombination between transposons or although computational approachespro- The left telomereof chromosomeIII has their LTRs or both (46). Fortunately for vide valuable guides to experimentation, some special characteristics.Like all yeast genome analysis, and perhaps for yeast it- they do not obviate the need to carryout telomeres,it contains a repeatedsequence self, these spontaneous transposition events real experimentsto determineprotein func- elementcalled X (37, 38). It also containsa do not appear to occur randomly along the tion [see (55)]. pseudo-Xelement at an internalsite about length of individual chromosomes. The An attemptto classifyyeast proteins ac- 4 kb fromthe trueX, and Voytasand Boeke yeast genome that was sequenced contains cordingto their function as conservatively (39) have suggested that the two X se- 52 complete Ty elements as well as 264 solo predicted by such computer analyses has quencesrepresent the long terminalrepeats LTRs or other remnants that are the foot- been carriedout by MIPS (56). The yeast (LTRs) of a new class of yeast transposon prints of previous transposition events. The cell devotes 11%of its proteomneto metab- calledTy5. Transposonsare often foundon majority of the Ty2 elements (11 out of 13) olism;3% to energyproduction and storage; the healed ends of brokenchromosomes in are found in sites that show evidence of 3% to DNA replication,repair, and recom- Drosophila(40, 43). The yeast genome se- previous transposition activity ("old" sites); bination; 7% to transcription;and 6% to quence reveals that all 19 telomere-associ- only about half (16 out of 33) of the Tyl translation.A total of 430 proteinsare in- ated highly conservedrepeats called Y' el- elements are found in "new" sites (the ma- volved in intracellulartrafficking or protein ements contain an ORF whose predicted jority flanking tRNA genes). Thus, yeast targeting,and 250 proteinshave structural proteinproduct is reminiscentof the RNA transposons appear to insert preferentially roles.Nearly 200 transcriptionfactors have helicases(42, 43). No Y' helicaseORFs are into specific chromosomal regions that may been identified(57), as well as 250 primary found on chromosomes I, III, and XI be termed transposition hot spots (46-53). and secondarytransporters (58). However, (though there are small partsof Y's on III The proteome. The term "proteome" has these statisticsrefer only to yeast proteins and XI near the actualtelomeres) and a few been coined to describe the complete set of for which significanthomologs were found. Y's fromtandem arrays at chromosomesXII proteins that a living cell is capable of Another approachthat is expectedto be R and IV R have not been sequenced.The synthesizing (54). The completion of the greatlyfacilitated by the availabilityof all functionsof these ORFsare unknown;how- yeast genome sequence means that, for the yeast proteinsequences is two-dimensional ever, they may have formedparts of trans- first time, the complete proteome of a eu- gel electrophoreticanalysis, which permits posableelements in the past (41, 44). Be- karyotic cell is accessible. Computer analy- the resolution of more than 2000 soluble causethe synthesisof new telomericrepeats sis of the yeast proteome allows classifica- protein species (59). Unfortunately,many by telomerase[see (45)] is essentiallya re- tion of about 50% of the proteins on the of these membraneproteins are not re- verse transcriptionprocess, it may be that basis of their amino acid sequence similarity solved, and the reproducibilityof the two- currentmechanisms of telomerebiogenesis with other proteins of known function, with dimensionalelectrophoretograms from one in manyeukaryotes had their originsin the the use of simple and conservative homol- laboratoryto another is still poor. Identifi-

Table 2. Distributionof genes and other sequence elements. Questionable proteins are defined in (2). Hypothetical proteins are the difference between all proteins predicted as ORFs and the questionable proteins. Introns include both experimentally verified examples and those predicted by the EXPLORA program. UTR, untranslated but transcribed regions.

Chromosome number Elements I 11 III IV V VI VIl Vil IX X Xl XII XIII XIV XV XVI Total

Sequenced length (kb) 230 813 315 1,532 577 270 1,091 563 440 745 667 1,078 924 784 1,091 948 12,068 Nonsequenced identical repeats Name of unit ENA2 and Y' tel CUPi rDNAand Y' Length of unit (kb) 4 and 7 <1 2 9 and 7 Number of units 2 and 2 1 13 ?140 and 2 Length of repeats (kb) 8 and 14 <1 26 1,260 and 14 1,321 Total length (kb) 230 813 315 1,554 577 271 1,091 589 440 745 667 2,352 924 784 1,091 948 13,389 ORFs (n) 110 422 172 812 291 135 572 288 231 387 334 547 487 421 569 497 6,275 Questionable proteins (n) 3 30 12 65 13 5 57 12 1 1 29 20 41 30 23 3 36 390 Hypothetical proteins (n) 107 392 160 747 278 130 515 276 220 358 314 506 457 398 566 461 5,885 Intronsin ORFs (n) 4 18 4 30 13 5 15 15 8 13 11 17 19 15 15 18 220 Intronsin UTR (n) 0 2 0 1 2 0 5 0 0 0 0 3 0 2 0 0 15 IntactTy1 (n) 1 2 0 6 1 0 4 1 0 2 0 4 4 2 2 4 33 IntactTy2 (n) 0 1 1 3 1 1 1 0 0 0 0 2 0 1 2 0 13 IntactTy3 (n) 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 2 IntactTy4 (n) 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 3 IntactTy5 (n) 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 tRNAgenes(n) 2 13 10 27 20 10 36 11 10 24 16 22 21 16 20 17 275 snRNAgenes (n) 1 1 2 1 2 0 3 1 1 4 1 3 8 3 7 2 40

564 SCIENCE * VOL. 274 * 25 OCTOBER 1996

This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms m ARTICLES cation of the proteinscorresponding to giv- tions is most readilyseen in the pericentric good example of evolution through gene en spots by NH2-terminalprotein sequenc- regions and in the central portions of a duplication,but the situation is even more ing has been slow and only about 200 as- numberof chromosomearms. Although re- complicatedas a thirdcitrate synthase gene signments have been made so far. The dundancydoes occur close to the ends of (CIT3) has been discovered on chromo- availabilityof the predictedamino acid se- chromosomes(indeed, the subtelomericre- someXVI (68, 69). ChromosomesIV and II quencefor all yeastproteins should permit a gions are a major repositoryof redundant sharethe longestCHR, comprisinga pairof comprehensive analysis of the proteome sequences), exchanges between such re- pericentricregions of 120 kb and 170 kb, with the use of rapid and accurate mass gions are probablytoo frequent and too respectively,that share 18 pairsof homolo- spectrometrictechniques, so that it should recent to help us discernthe overallhistory gousgenes (13 ORFsand five tRNA genes). soon become routine to identify all yeast of the yeast genome. The genome has continuedto evolve since proteinsproduced under a given set of phys- In S. cerevisiae,simple direct repeat clus- this ancient duplicationoccurred: The in- iologicalconditions, or those that are qual- ters (64) take severalforms, the most typi- sertionor deletionof geneshas occurred,Ty itativelyor quantitativelymodified as a re- cal being dispersedfamilies with relatedbut elements and introns have been lost and sult of the deletion of a specific gene. A nonidentical genes scattered singly over gained between the two sets of sequences, new kind of mapwill then emerge-that of manychromosomes. The largestsuch family and pseudogeneshave been generated.In the direct and indirect interactionsamong comprisesthe 23 PAU genes, which specify all, at least 10 CHRs (sharedwith chromo- all of the membersof the yeastproteome. A the so-called seripauperines(65), a set of somes II, V, VIII, XII, and XIII) can be complete understandingof life at the mo- almostidentical serine-poor proteins of un- recognizedon chromosome IV. None of lecular level cannot be achieved without known function whose ORFs show a very them is found in the central region of the such knowledge. high codon bias and an NH2-terminalsig- chromosome,which, on the other hand, Another consequence of defining the nal sequence(66). The PAU genes, like the contains most (7 out of 9) of chromosome yeast proteome is the uncoveringof gene sugarfermentation genes discussedabove, IV'scomplement of Ty elements (Table 2); productswhose existence was hitherto in reside in the subtelomericregions. Other these may be the cause of the genetic plas- doubt. For instance, early views that yeast dispersedgene families show no obvious ticity of this region. chromosomeswere atypical led to the asser- chromosomalpositioning (such as the 15 The example of the citrate synthase tion that yeast does not have an HI his- membersof the PMT and KRE2families, genessuggests that muchof the redundancy tone. In reality,yeast chromosomes contain which encode enzymes involved in the in the yeast genome may be more apparent the full repertoireof eukaryotichistones, mannosylationof cell wall proteins).Clus- than real.In this case, it wasour knowledge including HI, whose gene was found on tered gene familiesare less common, but a of the rulesof proteintargeting in yeastthat chromosomeXVI (60). Another exampleis largefamily of this type occurson chromo- allowedus to discernthat these genes play the discoveryof a yeastgamma tubulin gene some I, where six relatedbut nonidentical differentphysiological roles. It is likely that on chromosomeXII (14); this gene had ORFs (YAR023 throughYAR033) specify a large number of apparentlyredundant previouslyeluded yeast geneticists despite membraneproteins of unknown function yeast genes are requiredto deal with phys- intensive effortsby severalresearch groups. (5). There are 16 membersof this familyon iological challenges that are not encoun- It has been an article of faith for some six chromosomes. Somne are clustered tered in the laboratoryenvironment but time that a full understandingof the yeast (YHLO42through YHLO46 on chromosome that yeast commonlyencounters in its nat- proteome is a prerequisitefor understand- VIII), others are scatteredsingly (YCRO07 uralhabitat of the rottingfig (70) or grape. ing the more complex human proteome; on chromosomeIII), and still others are Our abilityto imaginethese conditionsand this has becomereality with the availability located in subtelomericregions (YBR302 recreatethem in the laboratoryis severely of the complete yeast genome sequence. on chromosomeII, YKL219 on chromo- compromisedby our lack of knowledgeof Nearly half of the proteins known to be some XI, and YHLO48on chromosome the ecology or naturalhistory of S. cerevi- defective in human heritablediseases (61) VIII).Functional analysis of such a complex siae. Indeed,only very recentlyhas it been show some amino acid sequence similarity family poses a challenge but one that re- clearlyestablished that S. cerevisiaeis found to yeast proteins (62). Although it is evi- mains within the capabilityof yeast gene on the surfaceof the grapesused to make dent that the human genome will specify disruptiontechnology. wine (71). many proteins that are not found in the An analysisof the numerouscluster ho- yeast proteome,it is reasonableto suggest mology regions (CHRs) revealed by the What's Next? that the majorityof the yeast proteinshave yeast genome sequencehas led to a better human homologs. If so, these human pro- understandingof genome evolution. CHRs For yeast research.New graduatestudents teins could be classifiedon the basisof their are large regions in which homologous are alreadywondering how we all managed structural or functional equivalence to genes are arrangedin the same order,with in the "darkages" before the sequencewas membersof the yeast proteome. the same relative transcriptionalorienta- completed. We must now tackle a much Genomeevolution. The existence of sets tions, on two or more chromosomes.Early larger challenge, that of elucidating the of two or more genes encoding proteins reportsof CHRs involved a 7.5-kb region function of all of the novel genes revealed with identicalor verysimilar sequences (re- on chromosomesV and X (67) and a 15-kb by that sequence. As with the sequencing dundancy)provides the rawmaterial for the regionfrom chromosomes XIV and III (68). project itself, functional analysis will re- evolution of novel functions (63). Under- The latter contains four ORFs that have quire a worldwideeffort. In Europe,a new standing the true nature of redundancyis similarlyordered homologs in the centro- research network called EUROFAN [for one of the majorchallenges in the quest to meric regions of both chromosomes.One European Functional Analysis Network elucidate the biologicalrole of every gene homologouspair consists of two genes, each (72)] has been establishedto undertakethe in the S. cerevisiaegenome (55). An anal- of which encodes citrate synthase.Howev- systematicanalysis of the functionof novel ysis of the complete genome sequencesug- er, one (CIT2 on chromosomeIII) encodes yeast genes. Parallelactivities are underway gests that it may have undergoneduplica- the peroxisomalenzyme, whereas the other in Germany, Canada, and Japan. In the tion events at some point in its evolution- (CIT1 on chromosomeXIV) specifies the United States, the National Institutes of ary history. The evidence for such duplica- mitochondrial enzyme. This is probably a Health has recently sent out a request for

SCIENCE * VOL. 274 * 25 OCTOBER 1996 565

This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms applications for "Large-ScaleFunctional For genome sizes between 2 and 6 Mb, parts(84). This gives considerablehope for Analysisof the YeastGenome." Clearly, the sequencing becomes much more complex the rapidanalysis of the genomesof a large yeast researchcommunity is mobilizingfor and expensive.The productionof a library numberof medicallyand economicallyim- the next phase of the campaignto under- comprisinga contiguousset of DNA clones portant fungi through the use of the S. standhow a simpleeukaryotic cell works. (ratherthan the generationand assemblyof cerevisiaegenome sequence as a paradigm. In all of this, a common approach is the sequence itself) becomes the limiting However,this optimismis temperedby the emerging for the deletion of individual factor. In the absence of such a library, lack of apparentsynteny between the S. genes by a polymerase chain reaction long-range PCR amplification or direct cerevisiaeand S. pombegenomes. This is (PCR)-mediated gene replacement tech- PCR sequencing, or both, must be em- perhapsnot surprising,as the two species nique (18, 19). This approach,which relies ployed. The sequencingof genomes larger probablydiverged from a common ancestor on the greatefficiency and accuracyof mi- than 6 Mb typicallyrequires the time-con- some 1000 million yearsago (85). totic recombinationin S. cerevisiae,results suming construction of clone librariesin The systematic sequencing of larger in the precise deletion of the entire gene cosmidsor other high-capacityvectors and model genomes, most notably those of the and is economical enough to enable pro- the usuallytedious filling-in of the unavoid- fruit fly Drosophilamelanogaster and the di- ducton of the complete set of 6000 single- able, sometimesnumerous, and occasionally cotyledonousplant Arabidopsisthaliana, has deletion mutants. This would be a major intractable,gaps in clone coverage. Plans now begun. It is doubtfulwhether either of resourcefor the scientific community,not for the determinationof medium-sizedge- them will be completed in this century, only for the functionalanalysis of the yeast nome sequences (10 to 100 Mb) almost given that <3% of these 100- to 130-Mb genome itself but also in permitting"func- alwaysunderestimate the costs of these es- genomeshas been systematicallysequenced tional mapping"of the genomes of higher sential steps. The existence of two comple- so far. While we await the completionof organismsonto that of yeast.Because of the mentary,well-organized, and almostgapless the human genome sequence sometime redundancyproblem, and to enable the librariesof yeast DNA in cosmid vectors around the year 2005, there is danger of study of gene interactions,it will also be (78, 79) wasa m-ajorfactor in the unexpect- dispersing sequencing power among too necessary to construct multiply deleted ed speedat which the full genomesequence many model genomes. Instead, it may be strains;easy methods to achieve this are was obtained. After the pilot exercise of desirableto direct sequencingcapacity to- already in hand (73, 74). All these ap- chromosomeIII (1), it took only 4 yearsto wardeukaryotic parasites (such as Plasmodi- proachesshould make S. cerevisiaethe eu- complete the remaining 11.8 Mb of the um falciparum,Trypanosoma cruzei, Schisto- karyoteof choice for the studyof functions yeast genome. During 1995 alone, more sona mansoni, and Leishmaniadonovani) commonto all eukaryoticcells, by reversing than 6 Mb of final contiguousyeast genom- that plaguemillions of people in developing the traditionalpath of genetic researchto ic sequence were obtained. We are confi- countries.These genomesare only of inter- one in which the studyof the gene (or DNA dent that it will soon become routine to mediate size (30 to 300 Mb) and thus are sequence)leads to an understandingof bio- complete a 10-Mbcontig in a yearfor <$5 achievable objects for sequence analysis, logical function, rather than a change in million.Nearly complete cosmid libraries of providedthat funding is increasedfrom its function leading to the identificationof a the 15-Mb genome of the fission yeast S. presentmodest levels. On a worldscale, the gene. pombeare available (80, 81), allowing its cost-benefit equation for such projects is Forother genomes. Before the release,on sequencingto proceed rapidly.If financial overwhelminglypositive. 24 April 1996, of the complete yeast ge- supportis sustained,S. pombeshould be one nome sequence,two complete bacterialge- of the next two eukaryotesto have their Pride and Productivity nomes had been made public: the 1.8-Mb genome sequences completed [along with sequenceof Haemophilusinfluenzae (75) and the nematodeC. elegans(22), probablyin Two contrastingstrategies for gatheringge- the 0.6-Mb sequence of Mycoplasmageni- 1998]. nome data have emerged,both of which talium(76). Another prokaryoticgenome, Now that the complete sequence of a have been appliedto sequencingthe yeast that of Methanococcusjannaschii (1.7 Mb), laboratorystrain of'S. cerevisiaehas been genome: the "factory"and the "network" was subsequentlyreleased (77). The se- obtained, the complete genome sequences approaches.In the former,sequencing was quencesof severalother bacterialgenomes of other yeastsof industrialor medical im- automatedas far as possibleand was carried [Helicobacterpylori, Methanobacterium ther- portanceare within our reach.Such knowl- out in large sequencingcenters by highly moautotrophicum(1.7 Mb), Mycoplasma edge shouldconsiderably accelerate the de- specializedscientists and technicians who pneumoniae(0.8 Mb), and Synechocystissp. velopment of more productivestrains and may never have seen a yeast outside of a (3.6 Mb)] have apparentlybeen completed the search for badly needed antifungal bottle of doublyfermented beer. Their daily but were not publiclyavailable at the time drugs.Unfortunately, the sequence of the notebook, put on the World Wide Web, this paperwent to press.The sequenceof at important human pathogen Candidaalbi- was fully accessible to the scientific com- least two dozen other prokaryoticgenomes cans is proceedingslowly, with limited in- munityand was progressivelycorrected and (mostly extremophileswith genome sizes dustrialsupport (82, 83). Completegenome completed when new informationbecame below 2 Mb) is underway.The shotgun sequencing may be unnecessarywhen a available.In the networkapproach, by con- sequencingof small bacterialgenomes can yeast or fungal genome displaysconsider- trast, yeast genome sequencing was per- be completedin less than 6 monthsat a cost able synteny (conservationof gene order) formed in small laboratoriesby scientists of <$0.50 per base pair (75, 77). It is not with that of S. cerevisiae.For instance, re- and studentsdeeply committed to the study easy to determine whether such estimates cent studies on Ashbyagossypii (a filamen- of particularaspects of yeast molecularbi- representfull or marginalcosts; neverthe- tous fungus that is a pathogen of cotton ology.These scientistshad a specialinterest less, we can expect many more small ge- plants)have revealedthat most of its ORFs in the interpretationof the data and made nomes to be completed soon. It would be show homologyto those of S. cerevisiaeand public only verifiedand (they hoped) final unfortunate if some of these sequences, that at least a quarterof the clones in an A. data, using the same standardsas for their manyof which will be determinedthrough gossypii genomic bank contain pairs or normalpublications. the use of private funds, are not made public groups of genes in the same order or relative In practice, all intermediate forms of in S timpIu fs.sicfn. orientation as thzeir 5. cerevisiae counter- approach between these two cultural ex-

566 SCIENCE * VOL. 274 * 25 OCTOBER 1996

This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms ARTICLES tremes have been employed in the yeast ries. Increasingly, large-scale sequencing 39. D. F. Voytas and J. D. Boeke, Nature 358, 717 (1992). 40. 0. Danilevskaya, A. Lofsky, M.-L Pardue, Mol. Cell. genome sequencing project. The network will become the province of the sequencing Biol. 35S, 83 (1992). system (to the surpriseof some) worked centers, with the small laboratories being 41. H. Biessmann etal., ibid. 12, 3910 (1992). very well: 55% of the total genome se- enlisted to sort out problem regions where 42. F. Foury, personal communication. quencewas determined by a EuropeanNet- their specialist knowledge of the organism 43. E. J. Louis, Yeast 11, 1553 (1995). 44. E. J. Louis and J. E. Haber, Genetics 131, 559 work in which a total of 92 laboratories involved may be of assistance. Enthusiasm, (1992). were involved over the course of the determination, and cooperation (forces that 45. C. W Grieder,in TheEukaiyotic Genome: Organisa- project. The other 45% was obtained by are indispensable under pioneering circum- tion and Regulation, P. M. A. Broda, S. G. Oliver, P. F. G. Sims, Eds. (Cambridge Univ. Press, Cam- five medium- to large-sized sequencing cen- stances) drove this enterprise; we expect bridge, 1993), pp. 31-42. ters. Over the 6 yearsof its life, the Euro- these forces will continue to propel us 46. B. L. Wicksteed et al., Yeast 10, 39 (1992). pean Network's performance improved through the next phase of the project. 47. A. Eigel and H. Feldmann, EMBO J. 1, 1245 (1982). steadily.Its 300-kb fragmentof the "inter- 48. J. Gafner, E. M. De Robertis, P. Philippsen, ibid. 2, 583 (1983). national"chromosome XVI was sequenced REFERENCESAND NOTES 49. J. R. Warmingtonet al., Nucleic Acids Res. 14, 3475 and made publicly available as rapidlyas (1986). 1. S. G. Oliveret al., NatLire357, 38 (1992). 50. J. R. Warmington, R. P. Green, C. S. Newlon, S. G. were the fragmentsfrom the three other 2. B. Dujon et al., ibid. 369, 371 (1994). Oliver,ibid. 15, 8963 (1987). 3. M. Johnston et al., Science 265, 2077 (1994). partners.The sequencequality produced by 51. J. Hauber, R. Stucka, R. Krieg, H. Feldmann, ibid. 4. H. Feldmann et al., EMBO J. 13, 5795 (1994). the two approacheswas similar:A large 16, 10623 (1988). 5. H. Bussey et al., Proc. Natl. Acad. Sci. U.S.A. 92, 52. H. LochmOller,R. Stucka, H. Feldmann, Curr.Genet. fragmentof chromosomeXII (170 kb) was 3809 (1995). 16, 247 (1989). deliberatelysequenced by both a largecen- 6. Y. Murakamiet al., Nature Genet. 10, 261 (1995). 53. H. Ji et al., Cell 73, 1007 (1993). 7. F. Galibertet al., EMBO J. 15, 2031 (1996). ter and the EuropeanNetwork; both groups 54. M. R. Wilkinset al., Bio/Technology 14, 61 (1996). 8. B. Barrellet a/., in preparation. had an extremelylow errorfrequency (one 55. S. G. Oliver,Nature 379, 597 (1996). 9. B. Barrellet a., in preparation. 56. H. W. Mewes, personal communication. to two sequencingerrors per 100 kb). It is 10. H. Bussey et a/, in preparation. 57. V. V. Svetlov and T. G. Cooper, Yeast 11, 1439 11. F. Dietrichet a/, in preparation. estimatedthat an averageof threeerrors per (1995). 12. B. Dujon et a/, in preparation. 10 kb remain in the publishedversion of 58. B. Nelissen, P. Mordant, J. L. Jonniaux, R. De 13. C. Jacq et al., in preparation. Wachter, A. Goffeau, FEBS Lett. 377, 32 (1995). the yeast genome sequence. 14. M. Johnston et a., in preparation. 59. S. Sagliocco et a/., Yeast, in press. There are at least three reasonsfor the 15. P. Philippsen et a., in preparation. 60. S. C. Ushinsky et ibid., in press. 16. H. Tettelin et a., in preparation. al., successof the network.The first is the use 61. http://www3.ncbinim.nih.gov/OMIM/ 17. R. J. Rothstein, Methods Enzymol. 101, 202 (1983). of moderninformatics technology and the 62. D. E. Bassett, M. S. Boguski, P. Hieter, Nature 379, 18. A. Baudin, 0. Ozier-Kalogeropoulos, A. Denouel, F. 589 Internet to coordinatethe acquisitionand Lacroute, C. Cullin, Nucleic Acids Res. 21, 3329 (1996); http://www.ncbi.nlm.gov/XREFdb/ 63. S. Ohno,Evolution Through Gene Duplication(Allen of as well to (1993). analysis data, as supportthe and Unwin, London, 1970). general managementof the network.The 19. A. Wach, A. Brachat, R. Pohlmann, P. Philippsen, Yeast 10,1793 (1994). 64. M. V. Olson,in TheMolecular Biology of the Yeast second is that severalsmall laboratoriesbe- 20. R. K. Mortimer,C. R. Contopoulou, J. S. King, ibid. Saccharomyces:Genome Dynamics, Protein Syn- thesis, and Energetics, J. R. Broach, J. R. Pringle, E. came extremelyefficient; the most produc- 8, 817 (1992). W. Jones, Eds. (Cold Spring Harbor Laboratory 21. B. Dujon, Trends Genet. 12, 263 (1996). tive reached,200 kb of finished sequence Press, Cold Spring Harbor, NY, 1991), pp.1-39. 22. J. Hodgkin, R. H. A. Plasterk, R. H. Waterston, Sci- 65. M. Viswanathan et a/., Gene 148, 149 (1994). per yearusing only two or three people and ence 270, 410 (1994). B. A. in almostno automation.The third(and most 23. S. Bowman and B. Barrell, http://www.sanger. 66. Purnelle and Goffeau, Yeast, press. ac.uk/yeast/pombe.html 67. L. Melnick and F. Sherman, J. Mol. Biol. 233, 372 important)reason is that an enthusiastic (1993). and competitive team spirit was built up 24. R. J. Planta, P. M. Goncalves, W. H. Mager, Bio- chem. Cell Biol. 73, 825 (1996). 68. D. Lalo, S. Stettler, S. Mariotte, P. P. Slonimski, P. among the small sequencersas, month by 25. B. C. Rymond and M. Rosbash, in The Molecular Thuriaux,C. R. Acad. Sci. Paris 316, 367 (1993). month, they watched the data accumulate Biology of the Yeast Saccharomyces: Gene Expres- 69. A.-M. B6cam, Y. Jia, P. P. Slonimski, C. J. Herbert, Yeast 11, S300 (1995). exponentiallytoward completion of the ge- sion, E. W. Jones, J. R. Pringle, J. R. Broach, Eds. (Cold Spring Harbor Laboratory Press, Cold Spring 70. R. K. Mortimerand J. R. Johnston, Genetics 113, 35 nome. The generalfeeling among the net- Harbor, NY, 1992), pp.143-192. (1989). work'sparticipants was that their member- 26. G. R. Fink, Cell 49, 5 (1987). 71. R. K. Mortimer,T. Torok, P. Romano, G. Suzzi, M. ship conferred considerable benefits on 27. P. M. Sharp and A. T. Lloyd, Nucleic Acids Res. 21, Polsinelli, Yeast 11, S574 (1995). 179 (1993). 72. S. G. Oliver, Trends Genet. 12, 241 (1996). themselvesand on the scientific communi- 28. D. Zenvirthet al., EMBO J. 11, 3441 (1992). 73. F. Langle-Rouaultand E. Jacobs, NucleicAcids Res. ty as whole. 29. B. De Massy, V. Rocco, A. Nicolas, ibid. 14, 4589 23, 3079 (1995). Whetherthey workedin largecenters or (1995). 74. U. Guldener, S. Heck, T. Fiedler,J. Beinhauer, J. H. ibid. 24, 2519 (1996). small laboratories,most of the 600 or so .30. D. B. Kaback, H. Y. Steensma, P. de Jonge, Proc. Hegemann, Natl. Acad. Sci. U.S.A. 86, 3694 (1989). 75. R. D. Fleischmann et a/., Science 269, 496 (1995). scientists involved in sequencingthe yeast 31. P. Hieter, C. Mann, M. Snyder, R. W. Davis, Cell 40, 76. C. M. Frazeret al., ibid. 270, 397 (1995). genome share the feeling that the world- 381 (1985). 77. C. Bult et a!., ibid. 273, 1058 (1996). wide ties created by this venture are of 32. A. W. Murray,N. P. Schultes, J. W Szostak, ibid. 45, 78. L. Riles et a/., Genetics 134, 81 (1993). 529 (1986). 79. A. Thierry,L. Gaillon, F. Galibert, B. Dujon Yeast 11, inestimable value to the future of yeast 33. H. Y. Steensma, J. C. Crowley, D. B. Kaback, Mol. 121 (1995). research.In Europeespecially, a corporate Cell. Biol. 7, 2894 (1987). 80. J. D. Hoheisel et al., Cell 73, 109 (1993). spirithas been engenderedthat will permit 34. M. J Charron, E. Read, S. R. Haut, C. A. Michels, 81. T. Mizukamiet al., ibid., p. 121. Genetics 122, 307 (1989). 82. W. S. Chu, B. B. Magee, P. T. Magee. J. Bacteriol. the sharingof data and ideas that will be 35. M. Carlson, J. L. Celenza, F. J. Eng, Mol. Cell. Biol. 5, 175, 6637 (1993). requiredto meet the challengeof decipher- 410 (1985). 83. http://alces.med.umn.edu/Candida.html ing the functions and interactionsof the 36. G. l. Naumov, E. S. Naumova, E. J. Louis, Yeast 11, 84. R. Altman-J6hland P. Philippsen, Molec. Gen. Gen- novel genes. Nevertheless, it is doubtful 481 (1995). et. 250, 69 (1996). 37. C. Chan and B.-K. Tye, Proc. Natl. Acad. Sci. U.S.A. 85. P. Russell and P. Nurse, Cell 45, 781 (1986). that in the futuregenome sequencingwill 77, 6329 (1980). 86. S. Walsh and B. Barrell, Trends Genet. 12, 276 continue to involve many small laborato- 38. A. J. Linkand M. V. Olson, Genetics 1 27,681 (1991). (1996).

SCIENCE * VOL. 274 * 25 OCTOBER 1996 567

This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms