articles Analysis of the sequence of the ¯owering thaliana

The Arabidopsis Genome Initiative*

* Authorship of this paper should be cited as `The Arabidopsis Genome Iniative'. A full list of contributors appears at the end of this paper ......

The ¯owering plant is an important model system for identifying and determining their functions. Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115.4 megabases of the 125-megabase genome and extend into centromeric regions. The of Arabidopsis involved a whole-genome duplication, followed by subsequent loss and extensive local gene duplications, giving rise to a dynamic genome enriched by lateral gene transfer from a cyanobacterial-like ancestor of the plastid. The genome contains 25,498 genes encoding from 11,000 families, similar to the functional diversity of and Caenorhabditis elegansÐ the other sequenced multicellular . Arabidopsis has many families of new proteins but also lacks several common families, indicating that the sets of common proteins have undergone differential expansion and contraction in the three multicellular eukaryotes. This is the ®rst complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-speci®c gene functions and establishing rapid systematic ways to identify genes for crop improvement.

The plant and animal kingdoms evolved independently from biologists, but will also affect agricultural science, evolutionary unicellular eukaryotes and represent highly contrasting life forms. biology, , combinatorial chemistry, functional and The genome sequences of C. elegans1 and Drosophila2 reveal that comparative , and molecular medicine. metazoans share a great deal of genetic information required for developmental and physiological processes, but these genome Overview of strategy sequences represent a limited survey of multicellular organisms. We used large-insert bacterial arti®cial (BAC), phage Flowering have unique organizational and physiological (P1) and transformation-competent arti®cial chromosome (TAC) properties in addition to ancestral features conserved between libraries9±12 as the primary substrates for sequencing. Early stages of plants and animals. The genome sequence of a plant provides a genome sequencing used 79 cosmid clones. Physical maps of the means for understanding the genetic basis of differences between genome of accession Columbia were assembled by restriction plants and other eukaryotes, and provides the foundation for fragment `®ngerprint' analysis of BAC clones13, by hybridization14 detailed functional characterization of plant genes. or polymerase chain reaction (PCR)15 of sequence-tagged sites and Arabidopsis thaliana has many advantages for genome analysis, by hybridization and Southern blotting16. The resulting maps were including a short generation time, small size, large number of integrated (http://nucleus/cshl.org/arabmaps/) with the genetic offspring, and a relatively small nuclear genome. These advantages map and provided a foundation for assembling sets of contigs promoted the growth of a scienti®c community that has investi- into sequence-ready tiling paths. End sequence (http://www. gated the biological processes of Arabidopsis and has characterized tigr.org/tdb/at/abe/bac_end_search.html) of 47,788 BAC clones many genes3. To support these activities, an international collabora- was used to extend contigs from BACS anchored by marker content tion (the Arabidopsis Genome Initiative, AGI) began sequencing and to integrate contigs. the genome in 1996. The sequences of 2 and 4 have Ten contigs representing the chromosome arms and centromeric been reported4,5, and the accompanying Letters describe the heterochromatin were assembled from 1,569 BAC, TAC, cosmid and sequences of chromosomes 1 (ref. 6), 3 (ref. 7) and 5 (ref. 8). P1 clones (average insert size 100 kilobases (kb)). Twenty-two PCR Here we report analysis of the completed Arabidopsis genome products were ampli®ed directly from genomic DNA and sequence, including annotation of predicted genes and assignment sequenced to link regions not covered by cloned DNA or to optimize of functional categories. We also describe chromosome dynamics the minimal tiling path. sequence was obtained from and architecture, the distribution of transposable elements and speci®c yeast arti®cial chromosome (YAC) and phage clones, and other repeats, the extent of lateral gene transfer from organelles, from inverse polymerase chain reaction (IPCR) products derived and the comparison of the genome sequence and structure to that of from genomic DNA. Clone ®ngerprints, together with BAC end other Arabidopsis accessions (distinctive lines maintained by single- sequences, were generally adequate for selection of clones for descent) and plant species. This report is the summation of sequencing over most of the genome. In the centromeric regions, work by experts interested in many biological processes selected to these physical mapping methods were supplemented with genetic illuminate plant-speci®c functions including defence, photomor- mapping to identify contig positions and orientation17. phogenesis, gene regulation, development, , transport Selected clones were sequenced on both strands and assembled and DNA repair. using standard techniques. Comparison of independently derived The identi®cation of many new members of families, sequence of overlapping regions and independent reassembly cellular components for plant-speci®c functions, genes of bac- sequenced clones revealed accuracy rates between 99.99 and terial origin whose functions are now integrated with typical 99.999%. Over half of the sequence differences were between eukaryotic components, independent evolution of several families genomic and BAC clone sequence. All available sequenced genetic of factors, and suggestions of as yet uncharacterized markers were integrated into sequence assemblies to verify sequence metabolic pathways are a few more highlights of this work. The contigs4±8. The total length of sequenced regions, which extend from implications of these discoveries are not only relevant for plant either the or ribosomal DNA repeats to the 180-base-pair

796 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles

(bp) centromeric repeats, is 115,409,949 bp (Table 1). Estimates of sequences for all were found in the genome, except for U5 the unsequenced centromeric and rDNA repeat regions measure where the most similar counterpart was 92% identical. Between 10 roughly 10 megabases (Mb), yielding a genome size of about and 16 copies of each small nuclear RNA (snRNA) were found 125 Mb, in the range of the 50±150 Mb haploid content estimated across all chromosomes, dispersed as singletons or in small groups. by different methods18. In general, features such as , The small nucleolar RNAs (snoRNAs) consist of two subfamilies, expression levels and repeat distribution are very consistent across the C/D box snoRNAs, which includes 36 Arabidopsis genes, and the the ®ve chromosomes (Fig. 1), and these are described in detail in H/ACA box snoRNAs, for which no members have been identi®ed reports on individual chromosomes4±8 and in the analysis of in Arabidopsis. U3 is the most numerous of the C/D box snoRNAs, centromere, telomere and rDNA sequences. with eight copies found in the genome. We identi®ed forty-®ve We used tRNAscan-SE 1.21 (ref. 19) and manual inspection to additional C/D box snoRNAs using software (www..wustl.edu/ identify 589 cytoplasmic transfer RNAs, 27 organelle-derived snoRNAdb/) that detects snoRNAs that guide ribose methylation of tRNAs and 13 pseudogenesÐmore than in any other genome ribosomal RNA. sequenced to date. All 46 tRNA families needed to decode all A combination of algorithms, all optimized with parameters possible 61 codons were found, de®ning the completeness of the based on known Arabidopsis gene structures, was used to de®ne functional set. Several highly ampli®ed families of tRNAs were gene structure. We used similarities to known protein and expressed found on the same strand6; excluding these, each is sequence tag (EST) sequence to re®ne gene models. Eighty per cent decoded by 10±41 tRNAs. of the gene structures predicted by the three centres involved were The spliceosomal RNAs (U1, U2, U4, U5, U6) have all been completely consistent, 93% of ESTs matched gene models, and less experimentally identi®ed in Arabidopsis. The previously identi®ed than 1% of ESTs matched predicted non-coding regions, indicating

100 kb

Chr. 1 29.1 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Chr. 2 19.6 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Chr. 3 23.2 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Chr. 4 17.5 Mb

Genes

ESTs

TEs

MT/CP

RNAs

Chr. 5 26.0 Mb

Genes

ESTs

TEs

MT/CP

RNAs Pseudo-colour spectra: High density Low density

Figure 1 Representation of the Arabidopsis chromosomes. Each chromosome is ranged from 38 per 100 kb to 1 gene per 100 kb; matches represented as a coloured bar. Sequenced portions are red, telomeric and centromeric (`ESTs') ranged from more than 200 per 100 kb to 1 per 100 kb. regions are light blue, heterochromatic knobs are shown black and the rDNA repeat densities (`TEs') ranged from 33 per 100 kb to 1 per 100 kb. Mitochondrial and regions are magenta. The unsequenced telomeres 2N and 4N are depicted with dashed insertions (`MT/CP') were assigned black and green tick marks, respectively. lines. Telomeres are not drawn to scale. Images of DAPI-stained chromosomes were Transfer RNAs and small nucleolar RNAs (`RNAs') were assigned black and red ticks kindly supplied by P. Fransz. The frequency of features was given pseudo-colour marks, respectively. assignments, from red (high density) to deep blue (low density). Gene density (`Genes')

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 797 articles that most potential genes were identi®ed. The sensitivity and Characterization of the coding regions selectivity of the software used in this report has To assess the similarities and differences of the Arabidopsis gene been comprehensively and independently assessed20. complement compared with other sequenced eukaryotic , The 25,498 genes predicted (Table 1) is the largest gene set we assigned functional categories to the complete set of Arabidopsis published to date: C. elegans1 has 19,099 genes and Drosophila2 genes. For chromosome 4 genes and the yeast genome, predicted 13,601 genes. Arabidopsis and C. elegans have similar gene density, functions were previously manually assigned5,21. All other predicted whereas Drosophila has a lower gene density; Arabidopsis also has a proteins were automatically assigned to these functional signi®cantly greater extent of tandem gene duplications and categories22, assuming that conserved sequences re¯ect common segmental duplications, which may account for its larger gene set. functional relationships. The rDNA repeat regions on chromosomes 2 and 4 were not The functions of 69% of the genes were classi®ed according to sequenced because of their known repetitive structure and content. sequence similarity to proteins of known function in all organisms; The centromeric regions are not completely sequenced owing to only 9% of the genes have been characterized experimentally large blocks of monotonic repeats such as 5S rDNA and 180-bp (Fig. 2a). Generally similar proportions of gene products were repeats. The sequence continues to be extended further into predicted to be targeted to the secretory pathway and mitochondria centromeric and other regions of complex sequence. in Arabidopsis and yeast, and up to 14% of the gene products are

Table 1 Summary statistics of the Arabidopsis genome

Feature Value ...... (a) The DNA molecules Chr. 1 Chr. 2 Chr. 3 Chr. 4 Chr. 5 S

Length (bp) 29,105,111 19,646,945 23,172,617 17,549,867 25,953,409 115,409,949 Top arm (bp) 14,449,213 3,607,091 13,590,268 3,052,108 11,132,192 Bottom arm (bp) 14,655,898 16,039,854 9,582,349 14,497,759 14,803,217 Base composition (%GC) Overall 33.4 35.5 35.4 35.5 34.5 Coding 44.0 44.0 44.3 44.1 44.1 Non-coding 32.4 32.9 33.0 32.8 32.5 Number of genes 6,543 4,036 5,220 3,825 5,874 25,498 Gene density 4.0 4.9 4.5 4.6 4.4 (kb per gene) Average gene 2,078 1,949 1,925 2,138 1,974 length (bp) Average peptide 446 421 424 448 429 length (bp) Number 35,482 19,631 26,570 20,073 31,226 13,2982 Total length (bp) 8,772,559 5,100,288 6,654,507 5,150,883 7,571,013 33,249,250 Average per gene 5.4 4.9 5.1 5.2 5.3 Average size (bp) 247 259 250 256 242 Number 28,939 15,595 21,350 16,248 25,352 107,484 Total length (bp) 4,828,766 2,768,430 3,397,531 3,030,649 4,030,045 18,055,421 Average size (bp) 168 177 159 186 159 Number of genes 60.8 56.9 59.8 61.4 61.4 with ESTs (%) Number of ESTs 30,522 14,989 20,732 16,605 22,885 105,733 ...... (b) The proteome Classi®cation/function Total proteins 6,543 4,036 5,220 3,825 5,874 25,498 With INTERPRO 4,194 1,205 2,989 1,545 3,136 13,069 domains 64.1% 29.9% 57.8% 40.4% 53.4% 51.3% Genes containing at 2,334 1,322 1,615 1,402 1,940 8,613 least one TM domain 35.7% 32.8% 30.9% 36.7% 33.0% 33.8% Genes containing at 2,513 1,424 1,664 1,304 2,121 9,026 least one SCOP domain 38.4% 35.3% 31.9% 34.1% 36.1% 35.4% With putative signal peptides Secretory pathway 1,242 19.0% 675 16.7% 877 17.0% 659 17.2% 1,014 17.3% 4,467 17.6% .0.95 speci®city 1,146 17.5% 632 15.7% 813 15.7% 632 16.5% 964 16.4% 4,167 16.4% Chloroplast 866 13.2% 535 13.2% 754 14.6% 532 13.9% 887 15.1% 3,574 14.0% .0.95 speci®city 602 9.2% 290 7.2% 420 8.1% 298 7.8% 475 8.1% 2,085 8.2% mitochondria 901 13.8% 425 10.5% 554 10.7% 390 10.2% 627 10.7% 2,897 11.4% .0.95 speci®city 113 1.7% 49 1.2% 63 1.2% 59 1.5% 65 1.1% 349 1.4% Functional classi®cation Cellular metabolism 1,188 22.7% 620 23.3% 745 22.8% 588 22.9% 868 21.1% 4,009 22.5% Transcription 880 16.8% 474 17.8% 566 17.3% 335 13.1% 763 18.6% 3,018 16.9% Plant defence 640 12.2% 276 10.4% 354 10.8% 295 11.5% 490 11.9% 2,055 11.5% Signalling 573 11.0% 296 11.1% 356 10.9% 210 8.2% 420 10.2% 1,855 10.4% Growth 542 10.4% 263 9.9% 357 10.9% 448 17.5% 469 11.4% 2,079 11.7% Protein fate 520 9.9% 273 10.2% 314 9.6% 264 10.3% 395 9.6% 1,766 9.9% Intracellular transport 435 8.3% 214 8.9% 269 8.2% 220 8.6% 334 8.1% 1,472 8.3% Transport 236 4.5% 139 5.2% 155 4.7% 113 4.4% 206 5.0% 849 4.8% Protein synthesis 216 4.1% 111 4.2% 148 4.5% 90 3.5% 165 4.0% 730 4.1% Total 5,230 2,666 3,264 2,563 4,110 17,833 ...... The features of Arabidopsis chromosomes 1±5 and the complete nuclear genome are listed. Specialized searches used the following programs and databases: INTERPRO23; transmembrane (TM) domains by ALOM2 (unpublished); SCOP domain database121; functional classi®cation by the PEDANT analysis system22. Signal peptide prediction (secretory pathway, targeted to chloroplast or mitochondria) was performed using TargetP122 and http://www.cbs.dtu.dk/services/TargetP/. * Default value.

798 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles likely to be targeted to the chloroplast (Table 1). The signi®cant conserved gene functions. The relatively high proportion of proportion of genes with predicted functions involved in metabo- matches between Arabidopsis and bacterial proteins in the categories lism, gene regulation and defence is consistent with previous `metabolism' and `energy' re¯ects both the acquisition of bacterial analyses5. Roughly 30% of the 25,498 predicted gene products, genes from the ancestor of the plastid and high conservation of (Fig. 2a), comprising both plant-speci®c proteins and proteins with sequences across all species. Finally, a comparison between uni- similarity to genes of unknown function from other organisms, cellular and multicellular eukaryotes indicates that Arabidopsis could not be assigned to functional categories. genes involved in cellular communication and To compare the functional catagories in more detail, we com- have more counterparts in multicellular eukaryotes than in yeast, pared data from the complete genomes of Escherichia coli23, re¯ecting the need for sets of genes for communication in multi- Synechocystis sp.24, Saccharomyces cerevisiae21, C. elegans1 and cellular organisms. Drosophila2, and a non-redundant protein set of Homo sapiens, Pronounced redundancy in the Arabidopsis genome is evident in with the Arabidopsis genome data (Fig. 2b), using a stringent segmental duplications and tandem arrays, and many other genes BLASTP threshold value of E , 10-30. The proportion of with high levels of sequence conservation are also scattered over the Arabidopsis proteins having related counterparts in eukaryotic genome. Sequence similarity exceeding a BLASTP value E , 10-20 genomes varies by a factor of 2 to 3 depending on the functional and extending over at least 80% of the protein length were used as category. Only 8±23% of Arabidopsis proteins involved in transcrip- parameters to identify protein families (Table 2). A total of 11,601 tion have related genes in other eukaryotic genomes, re¯ecting the protein types were identi®ed. Thirty-®ve per cent of the predicted independent evolution of many plant transcription factors. In proteins are unique in the genome, and the proportion of proteins contrast, 48±60% of genes involved in protein synthesis have belonging to families of more than ®ve members is substantially counterparts in the other eukaryotic genomes, re¯ecting highly higher in Arabidopsis (37.4%) than in Drosophila (12.1%) or

Cell growth, cell division a and DNA synthesis

Transcription Cell rescue, defence, Metabolism cell death, ageing

Cellular communciation/ signal transduction

Protein destination

Intracellular transport Unclassified Cellular biogenesis

Transport facilitation

Energy

Protein synthesis

Ionic homeostasis

b E. coli 0.7 Synechocystis S. cerevisiae 0.6 C. elegans 0.5 Drosophila 0.4

0.3

0.2

0.1

0

Energy

Metabolism Transcription Unclassified Protein synthesis Protein destination Cellular biogenesis cell death,Ionic ageing homeostasis Transport facilitation signal transduction and DNA synthesis Intracellular transport Cell rescue, defence, Cell growth, cell division Cellular communication/

Classification not yet clear-cut Figure 2 Functional analysis of Arabidopsis genes. a, Proportion of predicted Arabidopsis Saccharomyces cerevisae, Drosophila, C. elegans and a Homo sapiens non-redundant genes in different functional categories. b, Comparison of functional categories between protein database. The percentage of Arabidopsis proteins in a particular subset that had a organisms. Subsets of the Arabidopsis proteome containing all proteins that fall into a BLASTP match with E # 10-30 to the respective reference genome is shown. This re¯ects common functional class were assembled. Each subset was searched against the the measure of sequence conservation of proteins within this particular functional complete set of translations from , Synechocystis sp. PCC6803, category between Arabidopsis and the respective reference genome. y axis, 0.1 = 10%.

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 799 articles

Table 2 Proportion of genes in different organisms present as either singletons or in paralogous families

No of singletons and Unique Gene families containing distinct gene families 2 members 3 members 4 members 5 members .5 members ...... H. in¯uenzae 1,587 88.8% 6.8% 2.3% 0.7% 0.0% 1.4% S. cerevisiae 5,105 71.4% 13.8% 3.5% 2.2% 0.7% 8.4% D. melanogaster 10,736 72.5% 8.5% 3.4% 1.9% 1.6% 12.1% C. elegans 14,177 55.2% 12.0% 4.5% 2.7% 1.6% 24.0% Arabidopsis 11,601 35.0% 12.5% 7.0% 4.4% 3.6% 37.4% ...... The number of genes in the genomes of Haemophilus in¯uenzae, S. cerevisiae, Drosophila, C. elegans and Arabidopsis that are present either as singletons or in gene families with two or more members are listed. To be grouped in a gene family, two genes had to show similarity exceeding a BLASTP value E , 10-20 and a FASTA alignment over at least 80% of the protein length. In column 1, the number of genes that are unique plus the number of gene families are listed. Columns 2 to 6 give the percentage of genes present as singletons or in gene families of n members.

C. elegans (24.0%). The absolute number of Arabidopsis gene to the ubiquitin degradation pathway. This mode of regulation families and singletons (types) is in the same range as the other appears to be more prevalent in plants and may account for a higher multicellular eukaryotes, indicating that a proteome of 11,000± representation of the F box than in Drosophila and for the over- 15,000 types is suf®cient for a wide diversity of multicellular life. representation of the ubiquitin domain in the Arabidopsis genome. The proportion of gene families with more than two members is RING ®nger domain proteins in general have a role in ubiquitin considerably more pronounced in Arabidopsis than in other eukar- protein ligases, indicating that proteasome-mediated degradation is yotes (Fig. 3). As segmental duplication is responsible for 6,303 gene a more widespread mode of regulation in plants than in other duplications (see below), the extent of tandem gene duplications kingdoms. accounts for a signi®cant proportion of the increased family size. Most functions identi®ed by protein domains are conserved in These features of the Arabidopsis, and presumably other plant similar proportions in the Arabidopsis, S. cerevisiae, Drosophila and genomes, may indicate more relaxed constraints on genome size C. elegans genomes, pointing to many ubiquitous eukaryotic path- in plants, or a more prominent role of unequal crossing over to ways. These are illustrated by comparing the list of human disease generate new gene copies. genes29 to the complete Arabidopsis gene set using BLASTP. Out Conserved protein domains revealed more informative differ- of 289 human disease genes, 139 (48%) had hits in Arabidopsis ences through INTERPRO25 analysis of the predicted gene products using a BLASTP threshold E , 10-10. Sixty-nine (24%) exceeded an from Arabidopsis, S. cerevisiae, C. elegans and Drosophila. Statisti- E , 10-40 threshold, and 26 (9.3%) had scores better than E , 10-100 cally over-represented domains, and those that are absent from the (Table 3). There are at least 17 human disease genes more similar to Arabidopsis genome, indicate domains that may have been gained or Arabidopsis genes than yeast, Drosophila or C. elegans genes lost during the evolution of plants (Supplementary Information (Table 3). Table 1). Proteins containing the Pro-Pro-Arg repeat, which is This analysis shows that, although numerous families of proteins involved in RNA stabilization and RNA processing, are over- are shared between all eukaryotes, plants contain roughly 150 represented as compared to yeast, ¯y and worm; 400 proteins unique protein families. These include transcription factors, struc- containing this signature were detected in Arabidopsis compared tural proteins, and proteins of unknown function. Mem- with only 10 in total in yeast, Drosophila and C. elegans. Protein bers of the families of genes common to all eukaryotes have kinases and associated domains, 169 proteins containing a disease undergone substantial increases or decreases in their size in resistance protein signature, and the Toll/IL-1R (TIR) domain, a Arabidopsis. Finally, the transfer of a relatively small number of component of recognition molecules26, are also relatively cyanobacteria-related genes from a putative endosymbiotic ances- abundant. This suggests that pathways transducing signals in tor of the plastid has added to the diversity of protein structures response to and diverse environmental cues are more found in plants. abundant in plants than in other organisms. The RING zinc ®nger domain is relatively over-represented in Genome organization and duplication Arabidopsis compared with yeast, Drosophila and C. elegans, whereas The Arabidopsis genome sequence provides a complete view of the F-box domain is over-represented as compared with yeast and chromosomal organization and clues to its evolutionary history. Drosophila only. These domains are involved in targeting proteins to Gene families organized in tandem arrays of two or more units have the proteasome27 and ubiquitinylation28 pathways of protein degra- been described in C. elegans1 and Drosophila2. Analysis of the dation, respectively. In plants many processes such as hormone and Arabidopsis genome revealed 1,528 tandem arrays containing defence responses, light signalling, and circadian rhythms and 4,140 individual genes, with arrays ranging up to 23 adjacent pattern formation use F-box function to direct negative regulators members (Fig. 3). Thus 17% of all genes of Arabidopsis are arranged in tandem arrays. Large segmental duplications were identi®ed either by directly 1,200 aligning chromosomal sequences or by aligning proteins and 1052 1,000 searching for tracts of conserved gene order. All ®ve chromosomes 800 were aligned to each other in both orientations using MUMmer30, 600 and the results were ®ltered to identify all segments at least 1,000 bp 400 in length with at least 50% identity (Supplementary Information

Number of arrays 249 Fig. 1). These revealed 24 large duplicated segments of 100 kb or 200 108 57 36 20 18 17 15 6 2 2 larger, comprising 65.6 Mb or 58% of the genome. The only 0 2 3 4 5 6 7 8 9 10 11–15 16–20 21–23 duplicated segment in the centromeric regions was a 375-kb Number of tandemly repeated genes per gene array segment on chromosome 4. Many duplications appear to have Figure 3 Distribution of tandemly repeated gene arrays in the Arabidopsis genome. undergone further shuf¯ing, such as local inversions after the Tandemly repeated gene arrays were identi®ed using the BLASTP program with a duplication event. threshold of E , 10-20. One unrelated gene among cluster members was tolerated. The We used TBLASTX5 to identify collinear clusters of genes residing histogram gives the number of clusters in the genome containing 2 to n similar gene units in large duplicated chromosomal segments. The duplicated regions in tandem. encompass 67.9 Mb, 60% of the genome, slightly more than was

800 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles found in the DNA-based alignment (Fig. 4), and these data extend to create new pathogen response speci®cities36. We carried out a earlier ®ndings4,5,31. The extent of sequence conservation of the comparative analysis between 82 Mb of the genome sequence of duplicated genes varies greatly, with 6,303 (37%) of the 17,193 genes Arabidopsis accession Columbia (Col-0) and 92.1 Mb of non- in the segments classi®ed as highly conserved (E , 10-30) and a redundant low-pass (twofold redundant) sequence data of the further 1,705 (10%) showing less signi®cant similarity up to genomic DNA of accession Landsberg erecta (Ler). We identi®ed E , 10-5. The proportion of homologous genes in each duplicated two classes of differences between the sequences: single segment also varies widely, between 20% and 47% for the highly polymorphisms (SNPs), and insertion±deletions (InDels). As we conserved class of genes. In many cases, the number of copies of a used high stringency criteria, our results represent a minimum gene and its counterpart differ (for example, one copy on one estimate of numbers of polymorphisms between the two genomes. chromosome and multiple copies on the other; see Supplementary In total, we detected 25,274 SNPs, representing an average density Information Fig. 2); this could be due to either tandem duplication of 1 SNP per 3.3 kb. Transitions (A/T±G/C) represented 52.1% of or gene loss after the segmental duplication. the SNPs, and transversions accounted for the remainder: 17.3% for What does the duplication in the Arabidopsis genome tell us A/T±T/A, 22.7% for A/T±C/G and 7.9% for C/G±G/C. In total, we about the ancestry of the species? occurs widely in plants detected 14,570 InDels at an average spacing of 6.1 kb. They ranged and is proposed to be a key factor in plant evolution32. As the from 2 bp to over 38 kilobase-pairs, although 95% were smaller than majority of the Arabidopsis genome is represented in duplicated 50 bp. Only 10% of the InDels were co-located with simple sequence (but not triplicated) segments, it appears most likely that repeats identi®ed with the program Sputnik. An analysis of 416 Arabidopsis, like , had a tetraploid ancestor33. A comparative relative insertions greater than 250 bp in Col-0 showed that 30% of Arabidopsis and estimated that a matched transposon-related proteins, indicating that a substantial duplication occurred ,112 Myr ago to form a tetraploid34. The proportion of the large InDels are the result of transposon insertion degrees of conservation of the duplicated segments might be due to or excision. Many InDels contained entire active genes not related to divergence from an ancestral autotetraploid form, or might re¯ect transposons. Half of such genes absent from corresponding posi- differences present in an allotetraploid ancestor. It is also possible, tions in the Col-0 sequence were found elsewhere on the genome of however, that several independent segmental duplication events Ler. This indicates that genes have been transferred to new genomic took place instead of tetraploid formation and stabilization. locations. The diploid of Arabidopsis and the extensive divergence Gene structures are often affected by small InDels and SNPs. The of the duplicated segments have masked its evolutionary history. positions of SNPs and InDels were mapped relative to 87,427 exons The determination of Arabidopsis gene functions must therefore be and 70,379 introns annotated in the Col-0 sequence. SNPs were pursued with the potential for functional redundancy taken into found in exons, introns and intergenic regions at frequencies of 1 account. The long period of time over which genome stabilization SNP per 3.1, 2.2 and 3.5 kb, respectively. The frequencies for InDels has occurred has, however, provided ample opportunity for the were 1 per 9.3, 3.1 and 4.3 kb, respectively. Polymorphisms were divergence of the functions of genes that arose from duplications. detected in 7% of exons, and alter the spliced sequences of 25% of the predicted genes. For InDels in exons, insertion lengths divisible Comparative analysis of Arabidopsis accessions by three are prevalent for small insertions (, 50 bp), indicating that Comparing the multiple accessions of Arabidopsis allows us to many proteins can withstand small insertions or deletions of amino identify commonly occurring changes in genome microstructure. acids without loss of function. It also enables the development of new molecular markers for Our analyses show that sequence polymorphisms between acces- genetic mapping. High rates of polymorphism between sions of Arabidopsis are common, and that they occur in both Arabidopsis accessions, including both DNA sequence and copy coding and non-coding regions. We found evidence for the reloca- number of tandem arrays, are prevalent at loci involved in disease tion of genes in the genome, and for changes in the complement of resistance35. This has been observed for other plant species, and such transposable elements. The data presented here are available at loci are thought to serve as templates for illegitimate recombination http://www.arabidopsis.org/cereon/.

5 Mb 10 Mb 15 Mb 20 Mb 25 Mb 30 Mb

5 Mb 10 Mb 15 Mb 20 Mb 25 Mb 30 Mb Figure 4 Segmentally duplicated regions in the Arabidopsis genome. Individual segments. Similarity between the rDNA repeats are excluded. Duplicated segments in chromosomes are depicted as horizontal grey bars (with chromosome 1 at the top), reversed orientation are connected with twisted coloured bands. The scale is centromeres are marked black. Coloured bands connect corresponding duplicated in megabases.

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 801 articles

Comparison of Arabidopsis and other plant genera have occurred in the Arabidopsis lineage. The extensive duplication Comparative genetic mapping can reveal extensive conservation of described here supports the proposal that the more recent of these genome organization between closely related species37,38. The com- duplications, estimated to have occurred ,112 Myr ago, was the parative analysis of plant genome microstructure reveals much result of a polyploidization event. The lineages of Arabidopsis and about the evolution of plant genomes and provides unprecedented rice diverged ,200 Myr ago42. Three regions of the genome of opportunities for crop improvement by establishing the detailed Arabidopsis were related to each other and to one region in the rice structures of, and relationships between, the genomes of crops and genome, providing further evidence for multiple duplication Arabidopsis. events43,44. The lineages leading toArabidopsis and (shepherd's The frequent occurrence of tandem gene duplications and the purse) diverged between 6.2 and 9.8 Myr ago, and the gene content apparent deletion of single genes, or small groups of adjacent genes, and genome organization of C. rubella is very similar to that of from duplicated regions suggests that unequal crossing over may be Arabidopsis39, including the large-scale duplications. Alignment of a key mechanism affecting the evolution of plant genome micro- Arabidopsis complementary DNA and EST sequences with genomic structure. However, the segmental inversions and gene transloca- DNA sequences of Arabidopsis and C. rubella showed conservation tions in the genomes of both rice and B. oleracea that are not found of length and positions. Coding sequences predicted in Arabidopsis indicate that additional mechanisms may be from these alignments differed from the annotated Arabidopsis gene involved40. sequences in two out of ®ve cases. The ancestral lineages of Arabidopsis and the Brassica (cabbage Integration of the three genomes in the and mustard) genera diverged 12.2±19.2 Myr ago40. Brassica genes The three genomes in the plant cellÐthose of the nucleus, the show a high level of nucleotide conservation with their Arabidopsis plastids () and the mitochondriaÐdiffer markedly in orthologues, typically more than 85% in coding regions40. The gene number, organization and stability. Plastid genes are densely structure of Brassica genomes resembles that of Arabidopsis, but packed in an order highly conserved in all plants45, whereas with extensive triplication and rearrangement41, and extensive mitochondrial genes46 are widely dispersed and subjected to exten- divergence of microstructure (Supplementary Information Fig. 3). sive recombination. The divergence between the genomes of Arabidopsis and Brassica Organellar genomes are remnants of independent organismsÐ oleracea is in striking contrast to that observed between Arabidopsis plastids are derived from the cyanobacterial lineage and mitochon- and C. rubella, although the time since divergence is only twofold dria from the a-Proteobacteria. The remaining genes in plastids greater. This accelerated rate of change in triplicated segments of the include those that encode subunits of the photosystem and the genome of B. oleracea indicates that polyploidy fosters rapid electron transport chain, whereas the genes in mitochondria encode chromosomal evolution. essential subunits of the respiratory chain. Both organelles contain The Arabidopsis and tomato lineages diverged roughly 150 Myr sets of speci®c membrane proteins that, together with housekeeping ago, and comparative sequence analysis of segments of their proteins, account for 61% of the genes in the chloroplast and 88 % genomes has revealed complex relationships34. Four regions of the in the mitochondrion (Table 4). The balances are involved in Arabidopsis genome are related to each other and to one region in transcription and translation. the tomato genome, suggesting that two rounds of duplication may The number of proteins encoded in the nucleus likely to be found

Table 3 Arabidopsis genes with similarities to human disease genes

Human disease gene E value Gene code Arabidopsis hit ...... Darier±White, SERCA 5.9 ´ 10-272 T27I1_16 Putative calcium ATPase Xeroderma Pigmentosum, D-XPD 7.2 ´ 10-228 F15K9_19 Putative DNA repair protein Xeroderma pigment, B-ERCC3 9.6 ´ 10-214 AT5g41360 DNA excision repair cross-complementing protein Hyperinsulinism, ABCC8 7.1 ´ 10-188 F20D22_11 Multidrug resistance protein Renal tubul. acidosis, ATP6B1 1.0 ´ 10-182 AT4g38510 Probable H+-transporting ATPase HDL de®ciency 1, ABCA1 2.4 ´ 10-181 At2g41700 Putative ABC transporter Wilson, ATP7B 7.6 ´ 10-181 AT5g44790 ATP-dependent copper transporter Immunode®ciency, DNA Ligase 1 8.2 ´ 10-172 T6D22_10 DNA ligase Stargardt's, ABCA4 2.8 ´ 10-168 At2g41700 Putative ABC transporter Ataxia telangiectasia, ATM 3.1 ´ 10-168 AT3g48190 Ataxia telangiectasia mutated protein AtATM Niemann±Pick, NPC1 1.2 ´ 10-166 F7F22_1 Niemann±Pick C disease protein-like protein Menkes, ATP7A 1.1 ´ 10-153 F2K11_17 ATP-dependent copper transporter, putative HNPCC*, MLH1 1.5 ´ 10-150 AT4g09140 MLH1 protein Deafness, hereditary, MYO15 2.7 ´ 10-150 At2g31900 Putative unconventional myosin Fam, cardiac myopathy, MYH7 6.5 ´ 10-147 T1G11_14 Putative myosin heavy chain Xeroderma Pigmentosum, F-XPF 1.4 ´ 10-146 AT5g41150 Repair endonuclease (gb|AAF01274.1) G6PD de®ciency, G6PD 7.6 ´ 10-137 AT5g40760 Glucose-6-phosphate dehydrogenase Cystic ®brosis, ABCC7 2.3 ´ 10-135 AT3g62700 ABC transporter-like protein Glycerol kinase de®c, GK 7.9 ´ 10-135 T21F11_21 Putative glycerol kinase HNPCC, MSH3 6.6 ´ 10-134 AT4g25540 Putative DNA mismatch repair protein HNPCC, PMS2 5.1 ´ 10-128 AT4g02460 No title Zellweger, PEX1 4.1 ´ 10-125 AT5g08470 Putative protein HNPCC, MSH6 9.6 ´ 10-122 AT4g02070 G/T DNA mismatch repair Bloom, BLM 4.4 ´ 10-109 T19D16_15 DNA isolog Finnish amyloidosis, GSN 2.2 ´ 10-107 AT5g57320 Villin Chediak±Higashi, CHS1 5.8 ´ 10-99 F10O3_11 Putative transport protein Xeroderma Pigmentosum, G-XPG 7.1 ´ 10-89 AT3g28030 Hypothetical protein Bare lymphocyte, ABCB3 1.3 ´ 10-84 AT5g39040 ABC transporter-like protein Citrullinemia, type I, ASS 3.2 ´ 10-83 AT4g24830 Argininosuccinate synthase-like protein Cof®n±Lowry, RPS6KA3 5.2 ´ 10-81 AT3g08720 Putative ribosomal-protein S6 kinase (ATPK19) Keratoderma, KRT9 8.5 ´ 10-81 AT3g17050 Unknown protein Myotonic dystrophy, DM1 1.4 ´ 10-76 At2g20470 Putative protein kinase Bartter's, SLC12A1 1.6 ´ 10-75 F26G16_9 Cation-chloride co-transporter, putative Dents, CLCN5 3.3 ´ 10-74 AT5g26240 CLC-d chloride channel protein Diaphanous 1, DAPH1 1.9 ´ 10-73 68069_m00158 Hypothetical protein AKT2 6.9 ´ 10-72 AT3g08730 Putative ribosomal-protein S6 kinase (ATPK6) ......

802 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles in organelles was predicted using default settings on TargetP subset of transposons replicate through an RNA intermediate (class (Table 1). Many nuclear gene products that are targeted to either I), whereas others move directly through a DNA form (class II). (or both) organelles were originally encoded in the organelle Transposons are further classi®ed by similarity either between their genomes and were transferred to the nuclear genome during mobility genes or between their terminal and/or internal motifs, as evolutionary history. A large number also appear to be of eukaryotic well as by the size and sequence of their target site. Internally deleted origin, with functions such as protein import components, which elements can often be mobilized in trans by fully functional were probably not required by the free-living ancestors of the elements. endosymbionts. Transposons in Arabidopsis account for at least 10% of the To identify nuclear genes of possible organellar ancestry, we genome, or about one-®fth of the intergenic DNA. The compared all predicted Arabidopsis proteins to all proteins from Arabidopsis genome has a wealth of class I (2,109) and II completed genomes including those from plastids and mitochon- (2,203) elements, including several new groups (1,209 elements; dria (Supplementary Information Table 2). This search identi®ed Supplementary Information Table 4). Mobile histories for many proteins encoded by the Arabidopsis nuclear genome that are most elements were obtained by identifying regions of the genome with similar to proteins encoded by other species' organelle genomes (14 signi®cant similarity to `empty' target sites (RESites) thus providing mitochondrial and 44 plastid). These represent organelle-to- high-resolution information concerning the termini and target site nuclear gene transfers that have occurred sometime after the duplications48,49. These regions were readily detected because of the divergence of the organelle-containing lineages47. There is a great propensity of transposons to integrate into repeats and because of excess of nuclear encoded proteins most similar to proteins from the duplications in the genome sequence. In several cases, genes appear cyanobacteria Synechocystis (Supplementary Information Fig. 4; to have been included as `passengers' in transposable units48.In 806 Arabidopsis predicted proteins matching 404 different Synecho- some cases, shared sequence similarity, coding capacity and RESites cystis proteins, providing further evidence of a genome duplica- attest to recent activity of transposable elements in the Arabidopsis tion). These 806 Arabidopsis predicted proteins, and many others of genome. Only about 4% of the complete elements identi®ed greatly diverse function, are possibly of plastid descent. Through correspond to an EST, however, suggesting that most are not searches against proteins from other cyanobacteria (with incom- transcribed. pletely sequenced genomes), we identi®ed 69 additional genes of Transposable elements found in many other plant genomes are possibly plastid descent. Only 25% of these putatively plastid- well represented in Arabidopsis, including copia- and gypsy-like long derived proteins displayed a target peptide predicted by TargetP, terminal repeat (LTR) retrotransposons, long interspersal nuclear indicating potential cytoplasmic functions for most of these genes. elements (LINEs); short interspersed nuclear elements (SINEs), The difference between predicted plastid-targeted and predicted hobo/Activator/Tam3 (hAT)-like elements, CACTA-like elements plastid-derived genes indicates that there is a probable overestima- and miniature inverted-repeat transposable elements (MITES). tion by ab initio targeting prediction methods and a lack of Although usually small in size, some larger Tourist-like MITEs resolution with respect to destination organelles, the possible contain open reading frames (ORFs) with similarity to the trans- extensive divergence of some endosymbiont-derived genes in the posases of bacterial insertion sequences48. Basho and many Mutator- nuclear genome, the co-opting of nuclear genes for targeting to like elements (MULEs), ®rst discovered in the Arabidopsis sequence, organelles, and cytoplasmic functions for cyanobacteria-derived represent structurally unique transposons48±50. Basho elements have proteins. Clearly more re®ned tools and extensive experimentation a target site preference for mononucleotide `A' and wide distribution is required to catalogue plastid proteins. among plants48,51. MULEs exhibit a high level of sequence diversity The transfer of genes between genomes still continues (Supple- and members of most groups lack long terminal inverted repeats mentary Information Table 3). Plastid DNA insertions in the (TIRs). Phylogenetic analysis of the Arabidopsis MURA-like trans- nucleus (17 insertions totalling 11 kb) contain full-length genes posases suggests that TIR-containing MULEs are more closely encoding proteins or tRNAs, fragments of genes and an intron as related to one another than to MULEs lacking TIRs49,52. well as intergenic regions. Subsequent reshuf¯ing in the nucleus is For many plants with large genomes, class I retrotransposons illustrated by the atpH gene, which was originally transferred contribute most of the nucleotide content53. In the small Arabidopsis completely, but is now in two pieces separated by 2 kb. The 13 genome, class I elements are less abundant and primarily occupy the small mitochondrial DNA insertions total 7 kb in addition to the centromere. In contrast, Basho elements and class II transposons large insertion close to the centromere of chromosome 2 (ref. 3). such as MITEs and MULEs predominate on the periphery of The high level of recombination in the mitochondrial genome may pericentromeric domains (Fig. 5). In class II transposons, MULEs account for these events. and CACTA elements are clustered near centromeres and hetero- chromatic knobs, whereas MITEs and hAT elements have a less Transposable elements pronounced bias. The distribution pattern of transposable elements Transposons, which were originally identi®ed in maize by Barbara observed in Arabidopsis may re¯ect different types of pericentro- McClintock, have been found in all eukaryotes and prokaryotes. A meric heterochromatin regions and may be similar to those found in animals. Numerous centromeric satellite repeats are located between Table 4 General features of genes encoded by the three genomes in each chromosome arm and have not yet been sequenced, but Arabidopsis are represented in part by unanchored BAC contigs (R. Martienssen Nucleus/cytoplasm Plastid Mitochondria and M. Marra, unpublished data). End sequence suggests that these ...... domains contain many more class I than class II elements, con- Genome size 125 Mb 154 kb 367 kb Genome equivalent/cell 2 560 26 sistent with the distribution reported here (K. Lemcke and R. Duplication 60% 17% 10% Martienssen, unpublished data). We do not know the signi®cance Number of protein genes 25,498 79 58 of the apparent paucity of elements in telomeric regions and in the Gene order Variable, but syntenic Conserved Variable Density 4.5 1.2 6.25 region ¯anking the rDNA repeats on chromosome 4 (but not on (kb per protein gene) chromosome 2). Average coding length 1,900 nt 900 nt 860 nt Genes with introns 79% 18.4% 12% Overall, transposon-rich regions are relatively gene-poor and Genes/pseudogenes 1/0.03 1/0 1/0.2±0.5 have lower rates of recombination and EST matches, indicating a Transposons 14% 0% 4% correlation between low , high transposon density (% of total genome size) ...... and low recombination51. The role of transposons in genome

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 803 articles organization and chromosome structure can now be addressed in a near centromeres. These arrays might affect the expression of nearby known to undergo DNA methylation and other genes and may have resulted from ancient rearrangements, such as forms of chromatin modi®cation thought to regulate inversions of the chromosome arms. transposition52. Centromere DNA mediates chromosome attachment to the meiotic and mitotic spindles and often forms dense heterochroma- rDNA, telomeres and centromeres tin. Genetic mapping of the regions that confer centromere function Nucleolar organizers (NORs) contain arrays of unit repeats encod- provided the markers necessary to precisely place BAC clones at ing the 18S, 5.8S and 25S ribosomal RNA genes and are transcribed individual centromeres17; 69 clones were targeted for sequencing, by RNA polymerase I. Together with 5S RNA, which is transcribed resulting in over 5 Mb of DNA sequence from the centromeric by RNA polymerase III, these rRNAs form the structural and regions. The unsequenced regions of centromeres are composed catalytic cores of cytoplasmic ribosomes. In Arabidopsis, the primarily of long, homogeneous arrays that were characterized NORs juxtapose the telomeres of chromosomes 2 and 4, and previously with physical57 and genetic mapping17 and contain over comprise uninterrupted 18S, 5.8S and 25S units all orientated on 3 Mb of repetitive arrays, including the 180-bp repeats and 5S the chromosomes in the same direction54. In contrast, the 5S rRNA rDNA51 (Fig. 6). genes are localized to heterogeneous arrays in the centromeric Arabidopsis centromeres, like those of many higher eukaryotes, regions of chromosomes 3, 4 and 5 (ref. 55; and Fig. 6). Both contain numerous repetitive elements including retroelements, NORs are roughly 3.5±4.0 megabase-pairs and comprise ,350±400 transposons, microsatellites and middle repetitive DNA17. These highly methylated rRNA gene units, each ,10 kb (ref. 54). The repeats are rare in the euchromatic arms and often most abundant sequence between the euchromatic arms and NORs has been in pericentromeric DNA. The repeats, af®nity for DNA-binding determined. Elsewhere in the genome, only one other 18S, 5.8S, dyes, dense methylation patterns and inhibition of homologous 25S rRNA gene unit was identi®ed in centromere 3. Although minor recombination indicate that the centromeric regions are highly variations in sequence length and composition occur in the NOR heterochromatic, and such regions are generally viewed as very repeats, these variants are highly clustered, supporting a model of poor environments for gene expression. Unexpectedly, we found at sequence maintenance through concerted evolution55. least 47 expressed genes encoded in the genetically de®ned centro- Arabidopsis telomeres are composed of CCCTAAA repeats and meres of Arabidopsis (http://preuss.bsd.uchicago.edu/arabidopsis. average ,2±3 kb (ref. 56). For TEL4N (telomere 4 North), con- genome.html). In several cases, these genes reside on islands of sensus repeats are adjacent to the NOR; the remaining telomeres are unique sequence ¯anked by repetitive arrays, such as 180-bp or 5S typically separated from coding sequences by repetitive subtelo- rDNA repeats. Among the genes encoded in the centromeres are meric regions measuring less than 4 kb. Imperfect telomere-like members of 11 of the 16 functional categories that comprise the arrays of up to 24 kb are found elsewhere in the genome, particularly proteome. The centromeres are not subject to recombination; consequently, genes residing in these regions probably exhibit unique patterns of molecular evolution. The function of higher eukaryotic centromeres may be speci®ed a 12 Chr. 1 b 12 Chr. 2 by proteins that bind to centromere DNA, by epigenetic modi®cations, or by secondary or higher order structures. A 8 Class I 8 pairwise comparison of the non-repetitive portions of all ®ve Class II centromeres showed they share limited (1±7%) sequence similarity. Basho 4 4 Forty-one families of small, conserved centromere sequences Frequency (AtCCS, see http://preuss.bsd.uchicago.edu/arabidopsis.genome. 0 0 html) are enriched in the centromeric and pericentromeric regions 01020 30 01020 and differ from sequences found in the centromeres of other Position (Mb) Position (Mb) eukaryotes. Molecular and genetic assays will be required to determine whether these conserved motifs nucleate Arabidopsis c d centromere activity. Apart from the AtCCS sequences, most cen- 20 Chr. 4 16 Chr. 3 tromere DNA is not shared between chromosomes, complicating 12 16 efforts to derive clear evolutionary relationships. In contrast, genetic 12 and cytological assays indicate that homologous centromeres are 8 8 highly conserved among Arabidopsis accessions, albeit subject to 5,58,59

Frequency rearrangements such as inversions to form knobs and 4 4 insertions4. Further investigation of centromere DNA promises to 0 0 yield information on the evolutionary forces that act in regions of 01020 01020 limited recombination, as well as an improved understanding of the Position (Mb) Position (Mb) role of DNA sequence patterns in chromosome segregation. e 20 Chr. 5 Membrane transport 16 Transporters in the plasma and intracellular membranes of Arabidopsis are responsible for the acquisition, redistribution and 12 compartmentalization of organic nutrients and inorganic ions, as 8 well as for the ef¯ux of toxic compounds and metabolic end Frequency 4 products, energy and signal transduction, and turgor generation. 0 Previous genomic analyses of membrane transport systems in 01020 30 S. cerevisiae and C. elegans led to the identi®cation of over 100 Position (Mb) distinct families of membrane transporters60,61. We compared Figure 5 Distribution of class I, II and Basho transposons in Arabidopsis chromosomes. membrane transport processes between Arabidopsis, animals, The frequency of class I retroelements (green), class II DNA transposons (blue) and Basho fungi and prokaryotes, and identi®ed over 600 predicted membrane elements (purple) are shown at 100-kb intervals along the ®ve chromosomes (a±e)of transport systems in Arabidopsis (http://www-biology.ucsd.edu/ Arabidopsis. ,ipaulsen/transport/), a similar number to that of C. elegans

804 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles

(,700 transporters) and over twofold greater than either transport, but Arabidopsis appears to rely primarily on the AAAP S. cerevisiae or E. coli (,300 transporters). family of amino-acid and transporters. More than 10% of the We compared the transporter complement of Arabidopsis, transporters in Arabidopsis are homologous to drug ef¯ux pumps; C. elegans and S. cerevisiae in terms of energy coupling mechanisms these probably represent transporters involved in the sequestration (Fig. 7a). Unlike animals, which use a sodium ion P-type ATPase into vacuoles of xenobiotics, secondary metabolites, and breakdown pump to generate an electrochemical gradient across the plasma products of chlorophyll. membrane, plants and fungi use a proton P-type ATPase pump to Surprisingly, Arabidopsis has close homologues of the human form a large membrane potential (-250 mV)62. Consequently, plant ABC TAP transporters of antigenic peptides for presentation to the secondary transporters are typically coupled to protons rather than major histocompatability complex (MHC). In Arabidopsis, these to sodium63. Compared with C. elegans, Arabidopsis has a surpris- transporters may be involved in peptide ef¯ux, or more specula- ingly high percentage of primary ATP-dependent transporters (12% tively, in some form of cell-recognition response. Arabidopsis also and 21% of transporters, respectively), re¯ecting increased numbers has 10-fold more members of the multi-drug and toxin extrusion of P-type ATPases involved in metal ion transport and ABC ATPases (MATE) family than any other sequenced organism; in , proposed to be involved in sequestering unusual metabolites and these transporters function as drug ef¯ux pumps. Curiously, drugs in the vacuole or in other intracellular compartments. These Arabidopsis has several homologues of the Drosophila RND trans- processes may be necessary for pathogen defence and nutrient porter family Patched protein, which functions in segment polarity, storage. and more than ten homologues of the Drosophila ABC family eye About 15% of the transporters in Arabidopsis are channel pro- pigment transporters. In plants, these are presumably involved in teins, ®ve times more than in any single-celled organism but half the intracellular sequestration of secondary metabolites. number in C. elegans (Fig. 7b). Almost half of the Arabidopsis channel proteins are aquaporins, and Arabidopsis has 10-fold more DNA repair and recombination Mfamily major intrinsic protein (MIP) family water channels than DNA repair and recombination pathways have many functions in any other sequenced organism. This abundance emphasizes the different species such as maintaining genomic integrity, regulating importance of hydraulics in a wide range of plant processes, rates, chromosome segregation and recombination, including sugar and nutrient transport into and out of the vascu- genetic exchange within and between populations, and immune lature, opening of stomatal apertures, cell elongation and epinastic system development. Comparing the Arabidopsis genome with movements of and stems. Although Arabidopsis has a diverse other species65 indicates that Arabidopsis has a similar set of DNA range of metal cation transporters, C. elegans has more, many of repair and recombination (RAR) genes to most other eukaryotes. which function in cell±cell signalling and nerve signal transduction. The pathways represented include photoreactivation, DNA ligation, Arabidopsis also possesses transporters for inorganic anions such as non-homologous end joining, base excision repair, mismatch phosphate, sulphate, nitrate and chloride, as well as for metal cation excision repair, nucleotide excision repair and many aspects of channels that serve in signal transduction or cell homeostasis. DNA recombination (Supplementary Information Table 5). The Compared with other sequenced organisms, Arabidopsis has 10- Arabidopsis RAR genes include homologues of many DNA repair fold more predicted peptide transporters, primarily of the proton- genes that are defective in different human diseases (for example, dependent oligopeptide transport (POT) family, emphasizing the hereditary breast cancer and non-polyposis colon cancer, xero- importance of peptide transport or indicating that there is broader derma pigmentosum and Cockayne's syndrome). substrate speci®city than previously realized. There are nearly 1,000 One feature that sets Arabidopsis apart from other eukaryotes is Arabidopsis genes encoding Ser/Thr protein kinases, suggesting that the presence of additional homologues of many RAR genes. This is peptides may have an important role in plant signalling64. seen for almost every major class of DNA repair, including recom- Virtually no transporters for carboxylates, such as lactate and bination (four RecA), DNA ligation (four DNA ligase I), photo- pyruvate, were identi®ed in the Arabidopsis genome. About 12% of reactivation (one class II and ®ve class I photolyase the transporters were predicted to be sugar transporters, mostly homologues) and nucleotide excision repair (six RPA1, two RPA2, consisting of paralogues of the MFS family of hexose transporters. two Rad25, three TFB1 and four Rad23). This is most striking for Notably, S. cerevisiae, C. elegans and most prokaryotes use genes with probable roles in base excision repair. Arabidopsis APC family transporters as their principle means of amino-acid encodes 16 homologues of DNA base glycosylases (enzymes that

CEN1 F28L22F2C1 T28N5 F9D18 T4I21 T18N24 F12G6 F25O15 F9M8 F5A13 Key

180 bp CEN2 T25N22 F27C21 T5M2 T18C6 T12J2 T14C8 T15D9 160 bp T13E11 F9A16 T17H1 T5E7 F7B19 Mitochondrial

CEN3 5S rDNA T8N9 T7B9 T15D2 F23H6 T18B3 T14A11 F21A14 T27B3 T14K23 100 kb F6H5 F1D9 T13O13 T28G19 T26P13 T4P3 F4M19 F26B15

CEN4 T19B17 F4H6 T4B21 T32N4 C6L9 F6H8 F14G16 T27D20 T26N6 T19J18 T1J1 C17L7 T1J24 F21I2 F28D6

CEN5 F3F24 T3P1F17M7 F13C19 F18O9 F3D18 T6F8 F19N2 T32B3 F23C8 F7I20 F19I11 F14C23 T15F17 F15I15 T29A4 F18A12 T25B21 Figure 6 Predicted centromere composition. Genetically de®ned centromere boundaries the centromeres were derived from consideration of repeat copy number, physical are indicated by ®lled circles; fully and partially assembled BAC sequences are mapping and cytogenetic assays. represented by solid and dashed black lines, respectively. Estimates of repeat sizes within

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 805 articles recognize abnormal DNA bases and cleave them from the sugar- remodelling ATPases, which regulate the expression of nearly all phosphate backbone)Ðmore than any other species known. This genes, there are signi®cant structural differences between yeast and includes several homologues of each of three families of alkylation metazoan SNF2-type genes and their orthologues in Arabidopsis. damage base glycosylases: two of the S. cerevisiae MPG; six of the E. DDM1, a member of the SNF2 superfamily, and MOM1, a gene with coli TagI; and two of the E. coli AlkA. Arabidopsis also encodes three similarity to the SNF2 family, are involved in transcriptional gene homologues of the apurinic-apyrimidinic (AP) endonuclease Xth. silencing in Arabidopsis. MOM1 has no clear orthologue in fungal or AP endonucleases continue the base excision repair started by metazoan genomes. glycosylases by cleaving the DNA backbone at abasic sites. Consistent with its methylated DNA, Arabidopsis possesses Evolutionary analysis indicates that some of the extra copies of eight DNA methyltransferases (DMTs). Two of the three types RAR genes in Arabidopsis originated through relatively recent gene are orthologous to mammalian DMT69 whereas one, chromo- duplicationsÐbecause many of the sets of genes are more closely methyltransferase70, is unique to plants. No DMTs are found in related to each other than to their homologues in any other species. yeast or C. elegans, although two DMT-like genes are found in As duplication is frequently accompanied by functional divergence, Drosophila71. Arabidopsis also encodes eight proteins with methyl- the duplicate (paralogous) genes may have different repair speci®- DNA-binding domains (MBDs). Despite lacking methylated DNA, cities or may have evolved functions that are outside RAR functions Drosophila encodes four MBD proteins and C. elegans has two. (as is the case for two of the ®ve class I photolyase homologues, These differences in chromatin components are likely to which function as blue-light receptors). In most cases, it is not re¯ect important differences in chromatin-based regulatory known whether the paralogous gene copies have different functions. control of gene expression in eukaryotes (Supplementary Informa- The presence of multiple paralogues might also allow functional tion Table 6; http://Ag.Arizona.Edu/chromatin/chromatin.html). redundancy or a greater repair or recombination capacity. The Arabidopsis genome encodes transcription machinery for the The multiplicity of RAR genes in Arabidopsis is also partly due to three nuclear DNA-dependent RNA polymerase systems typical of the transfer of genes from the organellar genomes to the nucleus. eukaryotes (Supplementary Information Table 6). Transcription by Repair gene homologues that appear to be of chloroplast origin RNA polymerases II and III appears to involve the same machinery (Supplementary Information Tables 2 and 5) include the recombi- as is used in other eukarotes; however, most transcription factors for nation proteins RecA, RecG and SMS, two class I photolyase RNA polymerase I are not readily identi®ed. Only two polymerase I homologues, Fpg, two MutS2 proteins, and the transcription- regulators (other than polymerase subunits and TATA-binding repair coupling factor Mfd. Two of these (RecA and Fpg) are protein) are apparent in Arabidopsis, namely homologues of yeast involved in RAR functions in the plastid, suggesting that the RRN3 and mouse TTF-1. All eukaryotes examined to date have others may be as well. The ®nding of an Mfd orthologue of distinct genes for the largest and second largest subunits of poly- cyanobacterial descent is surprising. In E. coli, Mfd couples nucleo- merase I, II and III. Unexpectedly, Arabidopsis has two genes tide excision repair carried out by UvrABC to transcription, leading encoding a fourth class of largest subunit and second-largest to the rapid repair of DNA damage on the transcribed strand of subunit (Supplementary Information Fig. 5). It will be interesting transcribed genes66 The absence of orthologues of UvrABC in to determine whether the atypical subunits comprise a polymerase Arabidopsis renders the function of Mfd dif®cult to predict. The that has a plant-speci®c function. Four genes encoding single- presence of Mfd but not UvrABC has been reported for only one subunit plastid or mitochondrial RNA polymerases have been other species, a bacterial endosymbiont of the pea . identi®ed in Arabidopsis (Supplementary Information Table 6). Other nuclear-encoded Arabidopsis DNA repair gene homolo- Genes for the bacterial b-, b9- and a-subunits of RNA polymerase gues are evolutionarily related to genes from a-Proteobacteria, and are also present, as are homologues of various s-factors, and these thus may be of mitochondrial descent. In particular, the six homo- proteins may regulate chloroplast gene expression. in the logues of the alkyl-base glycosylase TagI appear to be the result of a Sde-1 gene, encoding RNA-dependent RNA polymerase (RdRp), large expansion in plants after transfer from the mitochondrial lead to defective post-transcriptional gene silencing72. We also genome. Whether any of these TagI homologues function in the identi®ed ®ve more closely related RdRp genes. repair and maintenance of mitochondrial DNA has not been Our analysis, using both similarity searches and domain matches, determined. More detailed phylogenetic analysis may reveal addi- has identi®ed 1,709 proteins with signi®cant similarity to known tional Arabidopsis RAR genes to be of organellar ancestry. classes of plant transcription factors classi®ed by conserved DNA- There are some notable absences of proteins important for RAR binding domains. This analysis used a consistent conservative in other species, including alkyltransferases, MSH4, RPA3 and many threshold that probably underestimates the size of families of components of TFIIH (TFB2, TFB3, TFB4, CCL1, Kin28). Never- diverse sequence. This class of protein is the least conserved theless, Arabidopsis shows many similarities to the set of DNA repair among all classes of known proteins, showing only 8±23% similar- genes found in other eukaryotes, and therefore offers an experi- ity to transcription factors in other eukaryotes (Fig. 2b). This mental system for determining the functions of many of these reduced similarity is due to the absence of certain classes of proteins, in part through characterization of mutants defective in transcription factors in Arabidopsis and large numbers of plant- DNA repair67. speci®c transcription factors. We did not detect any members of several widespread families of transcription factors, such as the REL Gene regulation (Rel-like DNA-binding domain) homology region proteins, nuclear Eukaryotic gene expression involves many nuclear proteins that steroid receptors and forkhead-winged helix and POU (Pit-1, Oct- modulate chromatin structure, contribute to the basal transcription and Unc-8b) domain families of developmental regulators. Con- machinery, or mediate gene regulation in response to developmen- versely, of 29 classes of Arabidopsis transcription factors, 16 appear tal, environmental or metabolic cues. As predicted by sequence to be unique to plants (Supplementary Information Table 6). similarity, more than 3,000 such proteins may be encoded by the Several of these, such as the AP2/EREBP-RAV, NAC and ARF- Arabidopsis genome, suggesting that it has a comparable complexity AUX/IAA families, contain unique DNA-binding domains, whereas of gene regulation to other eukaryotes. Arabidopsis has an additional others contain plant-speci®c variants of more widespread domains, level of gene regulation, however, with DNA methylation potentially such as the DOF and WRKY zinc-®nger families and the two-repeat mediating gene silencing and parental imprinting. MYB family. Plants have evolved several variations on chromatin remodelling Functional redundancy among members of large families of proteins, such as the family of HD2 histone deacetylases68. Although closely related transcription factors in Arabidopsis is a signi®cant Arabidopsis possesses the usual number of SNF2-type chromatin potential barrier to their characterization73. For example, in the

806 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles

SHATTERPROOF and SEPALLATA families of MADS box tran- Drosophila and yeast to glimpse the genetic basis of plant-cell- scription factors, all genes must be defective to produce visible speci®c features. mutant phenotypes74,75. These functionally redundant genes are The principal components of the plant cytoskeleton are micro- found on the segmental duplications described above. Our analyses, tubules (MTs) and ®laments (AFs); intermediate ®laments together with the signi®cant sequence similarity found in large (IFs) have not been described in plants. Arabidopsis appears to lack families of transcription factors such as the R2R3-repeat MYB and genes for cytokeratin or vimentin, the main components of animal WRKY families, suggest that strategies involving overexpression will IFs, but has several variants of actin, a- and b-tubulin. The be important in determining the functions of members of tran- Arabidopsis genome also encodes homologues of chaperones that scription factor families. mediate the folding of tubulin and actin polypeptides in yeast and Arabidopsis has two or over three times more transcription factors animal cells, such as the prefoldin and cytosolic chaperonin com- than identi®ed in Drosophila29 or C.elegans1, respectively. The sig- plexes and tubulin-folding cofactors. The dynamic stability of MTs ni®cantly greater extent of segmental chromosomal and local tandem and AFs is in¯uenced by MT-associated proteins and actin-binding duplications in the Arabidopsis genome generates larger gene families, proteins, respectively, several of which are encoded by Arabidopsis including transcription factors. The partly overlapping functions genes. These include the MT-severing ATPase katanin, AF-cross- de®ned for a few transcription factors are also likely to be much linking/bundling proteins, such as ®mbrins and villins, and AF- more widespread, implicating many sequence-related transcription disassembling proteins, such as pro®lin and actin-depolymerizing factors in the same cellular processes. Finally, the expanded number factor/co®lin. The Arabidopsis proteome appears to lack homolo- of genes involved in metabolism, defence and environmental inter- gues of proteins that, in animal cells, link the actin cytoskeleton action in Arabidopsis (Fig. 2a), which have few counterparts in across the plasma membrane to the extracellular matrix, such as Drosophila and C. elegans, all require additional numbers and classes integrin, talin, spectrin, a-actinin, vitronectin or vinculin. This of transcription factors to integrate gene function in response to a apparent lack of `anchorage' proteins is consistent with the different vast range of developmental and environmental cues. composition of the cell wall and with a prominence of cortical MTs at the expense of cortical AFs in plant cells. Cellular organization Plant-speci®c cytoskeletal arrays include interphase cortical MTs Plant cells differ from animal cells in many features such as plastids, mediating cell shape, the preprophase band marking the cortical site vacuoles, Golgi organization, cytoskeletal arrays, plasmodesmata of cell division, and the phragmoplast assisting in cytokinesis76. linking cytoplasms of neighbouring cells, and a rigid polysacchar- Although plant cells lack structural counterparts of the yeast spindle ide-rich extracellular matrixÐthe cell wall. Because the cell wall pole body and the animal centrosome, Arabidopsis has homologues maintains the position of a cell relative to its neighbours, both of core components of the MT-nucleating g-tubulin ring complex, changes in cell shape and organized cell divisions, involving cytos- such as g-tubulin, Spc97/hGCP2 and Spc98/hGCP3. Arabidopsis keleton reorganization and membrane vesicle targeting, have major has numerous motor molecules, both and dyneins with roles in plant development. Plant cytokinesis is also unique in associated dynactin complex proteins, which are presumably that the partitioning membrane is formed de novo by vesicle fusion. involved in the dynamic organization of MTs and in transporting We compared the Arabidopsis genome with those of C. elegans, cargo along MT tracks. There are also myosin motors that may be involved in AF-supported organelle traf®cking. Essential features of the eukaryotic cytoskeleton appear to be conserved in Arabidopsis. a A. thaliana C. elegans S. cerevisiae The Arabidopsis genome encodes homologues of proteins involved in vesicle budding, including several ARFs and ARF- related small G-proteins, large but not small ARF GEFs (adenosine ribosylation factor on guanine nucleotide exchange factor), adapter proteins, and coat proteins of the COP and non-COP types. Arabidopsis also has homologues of proteins involved in vesicle docking and fusion, including SNAP receptors (SNAREs), N- Channels ethylmaleimide-sensitive factor (NSF) and Cdc48-related ATPases, Secondary transport accessory proteins such as Sec1 and soluble NSF attachment protein Primary transport (SNAP), and Rab-type GTPases. The large number of Arabidopsis Uncharacterized SNAREs can be grouped by sequence similarity to yeast and animal counterparts involved in speci®c traf®cking pathways, and some have been localized to the trans-Golgi and the pre-vacuolar pathway77. Arabidopsis also has a receptor for retention of proteins b in the endoplasmic reticulum, a cargo receptor for transport to the vacuole and several phragmoplastins related to animal dynamin GTPases. Thus, plant cells appear to use the same basic machinery for vesicle traf®cking as yeast and animal cells. Animal cells possess many functionally diverse small G-proteins Cations (inorganic) Amines, amides and polyamines of the Ras superfamily involved in signal transduction, AF reorga- nization, vesicle fusion and other processes. Surprisingly, Anions (inorganic) Peptides Bases and derivatives Arabidopsis appears to lack genes for G-proteins of the Ras, Rho, Water Rac and Cdc42 subfamilies but has many Rab-type G-proteins Vitamins and cofactors Sugars and derivatives involved in vesicle fusion and several Rop-type G-proteins, one of Drugs and toxins Carboxylates which has a role in actin organization of the tip-growing pollen Macromolecules 78 Amino acids tube . The signi®cance of this divergent ampli®cation of different Unknown subfamilies of small G-proteins in plants and animals remains to be determined. Figure 7 Comparison of the transport capabilities of Arabidopsis, C. elegans and Arabidopsis possesses cyclin-dependent kinases (CDKs), includ- S. cerevisiae. Pie charts show the percentage of transporters in each organism according ing a plant-speci®c Cdc2b kinase expressed in a cell-cycle-depen- to bioenergetics (a) and substrate speci®city (b). dent manner, several cyclin subtypes, including a D-type cyclin that

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 807 articles mediates -stimulated cell-cycle progression79, a retinoblas- only in plants (Supplementary Information Table 6). toma-related protein and components of the ubiquitin-dependent A similar story can be told for cell±cell communication. Plants do proteolytic pathway of cyclin degradation. In yeast and animal cells, not seem to have receptor tyrosine kinases, but the Arabidopsis chromosome condensation is mediated by , sister chro- genome has at least 340 genes for receptor Ser/Thr kinases, belong- matids are held together by cohesins such as Scc1, and metaphase± ing to many different families, de®ned by their putative extracellular anaphase transition is triggered by separin/Esp1 endopeptidase domains (Supplementary Information Table 7). Several families proteolysis of Scc1 on APC-mediated degradation of its inhibitor, have members with known functions in cell±cell communication, securin/Psd1. Related proteins are encoded by the Arabidopsis such as the CLV1 receptor involved in cell signalling, the genome. Thus, the basic machinery of cell-cycle progression, S-glycoprotein homologues involved in signalling from pollen to genome duplication and segregation appears to be conserved in stigma in self-incompatible Brassica species, and the BRI1 receptor plants. By contrast, entry into M phase, M-phase progression and necessary for signalling83. Animals also have recep- cytokinesis seem to be modi®ed in plant cells. Arabidopsis does not tor Ser/Thr kinases, such as the transforming growth factor-b appear to have homologues of Cdc25 phosphatase, which activates (TGF-b) receptors, but these act through SMAD proteins that are Cdc2 kinase at the onset of mitosis, or of polo kinase, which absent from Arabidopsis. The leucine-rich repeat (LRR) family of regulates M-phase progression in yeast and animals. Conversely, Arabidopsis receptor kinases shares its extracellular domain with plant-speci®c mitogen-actived protein (MAP) kinases appear to be many animal and fungal proteins that do not have associated kinase involved in cytokinesis. domains, and there are at least 122 Arabidopsis genes that code for Cytokinesis partitions the cytoplasm of the dividing cell. Yeast LRR proteins without a kinase domain. Other Arabidopsis receptor and animal cells expand the membrane from the surface towards the kinase families have extracellular domains that are unfamiliar in centre in a cleavage process supported by septins and a contractile animals. Thus, evolution is modular, and the plant and animal ring of actin and type II myosin. By contrast, plant cytokinesis starts lineages have expanded different families of receptor kinases for a in the centre of the division plane and progresses laterally. A similar set of developmental processes. transient membrane compartment, the cell plate, is formed de Several Arabidopsis genes of developmental importance appear to novo by fusion of Golgi-derived vesicles traf®cking along the be derived from a cyanobacteria-like genome (Supplementary phragmoplast MTs80. Consistent with the unique mode of plant Information Table 2), with no close relationship to any animal or cytokinesis, Arabidopsis appears to lack genes for septins and type II fungal protein. One salient example is the family of ethylene myosin. Conversely, cell-plate formation requires a cytokinesis- receptors; another gene family of apparent chloroplast origin is speci®c syntaxin that has no close homologue in yeast and animals. the phytochromesÐlight receptors involved in many developmen- Although syntaxin-mediated membrane fusion occurs in animal tal decisions (see below). Whereas the land plant cytokinesis and cellularization, the vesicles are delivered to the base show clear homology to the cyanobacterial light receptors, which of the cleavage furrow. Thus, the plant-speci®c mechanism of cell are typical prokaryotic histidine kinases, the plant phytochromes division is linked to conserved eukaryotic cell-cycle machinery. are histidine kinase paralogues with Ser/Thr speci®city84. Similarly Two main conclusions are suggested by this comparative analysis. to the ethylene receptors, the proteins that act downstream of plant First, Arabidopsis and eukaryotic cells have common features related signalling are not found in cyanobacteria, and thus it to intracellular activities, such as vesicle traf®cking, cytoskeleton appears that a bacterial light receptor entered the plant genome and cell cycle. Second, evolutionarily divergent features, such as through horizontal transfer, altered its enzymatic activity, and organization of the cytoskeleton and cytokinesis, appear to relate to became linked to a eukaryotic signal transduction pathway. This the plant cell wall. infusion of genes from a cyanobacterial endosymbiont shows that plants have a richer heritage of ancestral genes than animals, and Development unique developmental processes that derive from horizontal gene The regulation of development in Arabidopsis, as in animals, transfer. involves cell±cell communication, hierarchies of transcription fac- tors, and the regulation of chromatin state; however, there is no Signal transduction reason to suppose that the complex multicellular states of plant and Being generally sessile organisms, plants have to respond to local animal development have evolved by elaborating the same general environmental conditions by changing their physiology or redirect- processes during the 1.6 billion years since the last common uni- ing their growth. Signals from the environment include light and cellular ancestor of plants and animals81,82. Our genome analyses pathogen attack, temperature, water, nutrients, touch and gravity. re¯ect the long, independent evolution of many processes contri- In addition to local cellular responses, some stimuli are commu- buting to development in the two kingdoms. nicated across the plant body, with plant hormones and peptides Plants and animals have converged on similar processes of pattern acting as secondary messengers. Some hormones, such as auxin, are formation, but have used and expanded different transcription taken up into the cell, whereas others, such as ethylene and factor families as key causal regulators. For example, segmentation , and the peptide CLV3, act as ligands for receptor in insects and differentiation along the anterior±posterior and limb kinases on the plasma membrane. No matter where the signal is axes in both involve the spatially speci®c activation of a perceived by the cell, it is transduced to the nucleus, resulting in series of homeobox gene family members. The pattern of activation altered patterns of gene expression. is causal in the later differentiation of body and limb axis regions. In Comparative genome analysis between Arabidopsis, C. elegans plants the pattern of ¯oral whorls (, , , carpels) is and Drosophila supports the idea that plants have evolved their own also established by the spatially speci®c activation of members of a pathways of signal transduction85. None of the components of the family of transcription factors, but in this instance the family is the widely adopted signalling pathways found in vertebrates, ¯ies or MADS box family. Plants also have homeobox genes and animals worms, such as Wingless/Wnt, Hedgehog, Notch/lin12, JAK/STAT, have MADS box genes, implying that each lineage invented sepa- TGF-b/SMADs, receptor tyrosine kinase/Ras or the nuclear steroid rately its mechanism of spatial pattern formation, while converging hormone receptors, is found in Arabidopsis. By contrast, brassinos- on actions and interactions of transcription factors as the mechan- teroids are ligands of the BRI1 Ser/Thr kinase, a member of the ism. Other examples show even greater divergence of plant and largest recognizable class of transmembrane sensors encoded by animal developmental control. Examples are the AP2/EREBP and 340 receptor-like kinase (RLK) genes in the Arabidopsis genome NAC families of transcription factors, which have important roles in (Supplementary Information Table 7). With a few notable excep- ¯ower and meristem development; both families are so far found tions, such as CLV1, the types of ligands sensed by RLKs are

808 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles completely unknown, providing an enormous future challenge for have evolved many defences. In mammals, polymorphism for plant biologists. G-protein-coupled receptors (GPCRs)/ seven- parasite recognition encoded in the MHC genes contributes to transmembrane proteins are an abundant class of proteins in resistance. In plants, disease resistance (R) genes that confer parasite mammalian genomes, instrumental in signal transduction. INTER- recognition are also extremely polymorphic. This polymorphism PRO detected 27 GPCR-related domains in Arabidopsis (Supple- has been proposed to restrict parasites, and its absence may explain mentary Information Table 1), although there is no direct the breakdown of resistance in crop monocultures93. In contrast to experimental evidence for these. Arabidopsis contains a family of MHC genes, plant resistance genes are found at several loci, and the 18 seven-transmembrane proteins of the mildew resistance (MlO) complete genome sequence enables analysis of their complement class, several of which are involved in defence responses. Notably, and structure. Parasite recognition by resistance genes triggers only single Ga (GPA1) and Gb (AGB1) subunits are found in defence mechanisms through various signalling molecules, such as Arabidopsis, both previously known86. protein kinases and adapter proteins, ion ¯uxes, reactive oxygen Although cyclic GMP has been proposed to be involved in signal intermediates and nitric oxide. These halt pathogen colonization transduction in Arabidopsis87, a protein containing a guanylate through transcriptional activation of defence genes and a form of cyclase domain was not identi®ed in our analyses. Nevertheless, called the hypersensitive response94. The cyclic nucleotide-binding domains were detected in various pro- Arabidopsis genome contains diverse resistance genes distributed at teins, indicating that cNMPs may have a role in plant signal many loci, along with components of signalling pathways, and transduction. Thus, although cNMP-binding domains appear to many other genes whose role in disease resistance has been inferred have been conserved during evolution, cNMP synthesis in from mutant . Arabidopsis may have evolved independently. Most resistance genes encode intracellular proteins with a nucleo- We were unable to identify a protein with signi®cant similarity to tide-binding (NB) site typical of small G proteins, and carboxy- known Gg subunits, but recent biochemical studies suggest that a terminal LRRs95. Their amino termini either carry a TIR domain, or protein with this functional capacity is likely to be present in plant a putative coiled coil (CC). There are 85 TIR±NB±LRR resistance cells (H. Ma, personal communication). Therefore, there is poten- genes at 64 loci, and 36 CC±NB±LRR resistance genes at 30 loci. tial for the formation of only a single heterotrimeric G-protein Some NB±LRR resistance genes express neither obvious TIR nor complex; however, its functional interaction with any of the poten- CC domains at their N termini. This potential class is present seven tial GPCR-related proteins remains to be determined. times, at six loci. There are 15 truncated TIR±NB genes that lack an Modules of cellular signal pathways from bacteria and animals LRR at 10 loci, often adjacent to full TIR±NB±LRR genes. There are have been combined and new cascades have been innovated in also six CC±NB genes, at ®ve loci. These truncated products may plants. A pertinent example is the response to the gaseous plant function in resistance. Intriguingly, two TIR±NB±LRR genes carry hormone ethylene88. Ethylene is perceived and its signal transmitted a WRKYdomain, found in transcription factors that are implicated by a family of receptors related to bacterial-type two-component in plant defence, and one of these also encodes a protein kinase histidine kinases (HKs). In bacteria, yeast and plants, these proteins domain. sense many extracellular signals and function in a His-to-Asp Resistance gene evolution may involve duplication and diver- phosphorelay network89. In turn, these proteins physically interact gence of linked gene families36; however, most (46) resistance genes with the genetically downstream protein CTR1, a Raf/MAPKKK- are singletons; 50 are in pairs, 21 are in 7 clusters of 3 family related kinase, revealing the juxtaposition of bacterial-type two- members, with single clusters of 4, 5, 7, 8 and 9 members, component receptors and animal-type MAP kinase cascades. Unlike respectively. Of the non-singletons, ,60% of pairs are in direct animals, however, Arabidopsis does not seem to have a Ras protein repeats, and ,40% are in inverted repeats. Resistance genes are to activate the MAP kinase cascade. MAP kinases are found in unevenly distributed between chromosomes, with 49 on chromo- abundance in Arabidopsis: we identi®ed ,20, a higher number than some 1; 2 on chromosome 2; 16 on chromosome 3; 28 on chromo- in any other . As potentially counteracting components, some 4; and 55 on chromosome 5. we found ,70 putative PP2C protein phosphatases. Although this In other plant species, resistance genes encode both transmem- group is largely uncharacterized functionally, several members are brane receptors for secreted pathogen products and protein kinases, related to ABI1/ABI2, key negative regulators in the signalling and some other classes are also found. The Cf genes in tomato pathway for the plant hormone . Additional compo- encode extracellular LRRs with a transmembrane domain and short nents of the His-to-Asp phosphorelay system were also found in cytoplasmic domain. Mutation in an Arabidopsis homologue, Arabidopsis, including authentic response regulators (ARRs), pseu- CLAVATA2, results in enlarged , but to date no resistance doresponse regulators (PRRs) and phosphotransfer intermediate function has been assigned to the 30 Arabidopsis CLV2 homologues. protein (HPt)90. We found 11 HKs in the proteome (3 new), 16 RRs CLAVATA1, a transmembrane LRR kinase, is also required for (2 new) and 8 PRRs (2 new). The biological roles of most ARRs, meristem function. Xa21, a rice LRR-kinase, confers Xanthomonas PRRs and HPts are largly unknown, but several have been found resistance, and the Arabidopsis FLS2 LRR kinase confers recognition to have diverse functions in plants, including transcriptional activa- of ¯agellin. It has been proposed that CLV1 and CLV2 function as a tion in response to the plant hormone cytokinin91, and as compo- heterodimer; perhaps this is also true for Xa21, FLS2 and Cf nents of the circadian clock92. proteins. There are 174 LRR transmembrane kinases in Plants seem to have evolved unique signalling pathways by Arabidopsis, with only FLS2 assigned a role in resistance. A unique combining a conserved MAP kinase cascade module with new resistance gene, beet Hs1pro-1, which confers resistance, receptor types. In many cases, however, the ligands are unknown. has two Arabidopsis homologues. Conversely, some known signalling molecules, such as auxin, are The tomato Pto Ser/Thr kinase acts as a resistance protein in still in search of a receptor. Auxin signalling may represent yet conjunction with an NB±LRR protein, so similar kinases might do another plant-speci®c mode of signalling, with protein degradation the same for Arabidopsis NB±LRR proteins. There are 860 Ser/Thr through the ubiquitin-proteasome pathway preceding altered gene kinases in the Arabidopsis sequence. Fifteen of these share 50% expression. With many Arabidopsis genes encoding components of identity over the Pto-aligned region. The Toll pathway in Drosophila the ubiquitin-proteasome pathway, elimination of negative regula- and mammals regulates innate immune responses through tors may be a more widespread phenomenon in plant signalling. LRR/TIR domain receptors that recognize bacterial lipopo- lysaccharides96. Pto is highly homologous to Drosophila PELLE Recognizing and responding to pathogens and mammalian IRAK protein kinases that mediate the TIR Plants are constantly exposed to pests, parasites and pathogens and pathway.

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 809 articles

Additional genes have been de®ned that are required for resis- The roles have been described of only 35 of the 100 candidate tance by our analysis of the genome sequence. The ndr1 mutation photomorphogenic genes (Supplementary Information Table 8). de®nes a gene required by the CC±NB±LRR gene RPS2 and RPM1. All of the light photoreceptors had been discovered previously, NDR1 is 1 of 28 Arabidopsis genes that are similar both to each other including ®ve red/far-red absorbing phytochromes (PHYA-E), two and to the HIN1 gene that is transcriptionally induced early blue/-A absorbing (CRY1 and CRY2), during the hypersensitive response. EDS1 is a gene required for one blue-absorbing (NPH1) and one NPH1-like (or TIR±NB±LRR function, and like PAD4, encodes a protein with a NPL1). In contrast, we uncovered many new proteins similar to putative lipase motif. EDS1, PAD4 and a third gene comprise the the regulators COP/DET/FUS, PKS1, PIF3, EDS1/PAD4 family. The NPR1/NIM1/SAI1 gene is required for NDPK2, SPA1, FAR1, GIGANTEA, FIN219, HY5, CCA1, ATHB-2, systemic acquired resistance, and we found ®ve additional NPR1 ZEITLUPE, FKF1, LKP1, NPH3 and RPT2. homologues. Recessive mutations at both the barley Mlo and Both the phytochromes and NPH1 contain chromophores for Arabidopsis LSD1 loci confer broad-spectrum resistance and dere- light sensing coupled to kinase domains for signal transmission. press a cell-death program. There are at least 18 Mlo family Phytochromes have an N-terminal chromophore-binding domain, members that resemble heterotrimeric GPCRs in Arabidopsis, and two PAS domains, and a C-terminal Ser/Thr kinase domain99, only two LSD1 homologues. whereas NPH1 has two LOV domains (members of the PAS One of the earliest responses to pathogen recognition is the domain superfamily) for ¯avin mononucleotide binding and a production of reactive oxygen intermediates. This involves a spe- C-terminal Ser/Thr kinase domain100. PAS domains potentially cialized oxidase protein that transfers an electron sense changes in light, redox potential and oxygen energy levels, as across the plasma membrane to make superoxide. Arabidopsis well as mediating protein±protein interactions99,100. We searched encodes eight apparently functional gp91 homologues, called for uncharacterized proteins with the combination of a kinase Atrboh genes. Unlike gp91, they all carry an ,300 amino-acid N- domain and either a phytochrome chromophore-binding site or terminal extension carrying an EF-hand Ca2+-binding domain. In PAS domains. Although we found no new phytochrome-like mammals, activation of the respiratory oxidative burst complex in genes, we did identify four predicted proteins that contain PAS the neutrophil, which includes gp91, requires the action of Rac and kinase domains (Supplementary Information Fig. 6). These proteins. As no Rac or Ras proteins are found in Arabidopsis, proteins share 80% amino-acid identity, but, unlike NPH1 and members of the large rop family of G proteins may carry this out. NPL1, have only one PAS domain. The combination of potential Similarly, we did not detect any Arabidopsis homologues of other signal sensing and transmitting domains makes it tempting to mammalian respiratory burst oxidase components (p22, p47, p67, speculate that these proteins may be receptors for light or other p40). signals. There are no clear homologues of many mammalian defence and Our screen included searches for components of photosynthetic cell-death control genes. Although nitric oxide production is reaction centres and light-harvesting complexes, enzymes involved involved in plant defence, there is no obvious homologue of nitric in CO2 ®xation and enzymes in pigment biosynthesis. We identi®ed oxide synthase. Also absent are apparent homologues of the REL 11 core proteins of , including the eukaryotic-speci®c domain transcription factors involved in innate immunity in both components PsaG and PsaH101, and 8 photosystem II proteins, Drosophila and mammals. We found no similarity to proteins including a single member (psbW) of the photosystem II core. We involved in regulating apoptosis in animal cells, such as classical also found 26 proteins similar to the Chlorophyll-a/b binding caspases, bcl2/ced9 and baculovirus p35. There are, however, 36 proteins (8 Lhca and 18 Lhcb). Of the seven subunits of the cysteine proteases. There are also eight homologues of a newly cytochrome b6f complex (PetA±D, PetG, PetL, PetM), only one de®ned metacaspase family97, two of which, along with LSD1, have a (PetC) was found in the nuclear genome, whereas the remainder are clear GATA-type zinc-®nger. probably encoded in the chloroplast. Similarly, of the nine subunits of the chloroplast ATP synthase complex, three are encoded in the Photomorphogenesis and nucleus, including the II- , g- and d-subunits; the remaining Because nearly all plants are sessile and most depend on photo- subunits (I, III, IV, a, b, e) are encoded in the chloroplast102.Ten synthesis, they have evolved unique ways of responding to light. genes were related to the soluble components of the electron transfer Light serves as an energy source, as well as a trigger and modulator chain, including two plastocyanins, ®ve ferredoxins and three of complex developmental pathways, including those regulated by ferredoxin/NADP oxidoreductases. Forty genes are predicted to the circadian clock. Light is especially important during have a role in CO2 ®xation, including all of the enzymes in the emergence, where it stimulates chlorophyll production, devel- Calvin±Benson cycle. For pigment biosynthesis, 16 genes in chlor- opment, cotyledon expansion, chloroplast biogenesis and the coor- ophyll biosynthesis and 31 genes in carotenoid biosynthesis were dinated induction of many nuclear- and chloroplast-encoded genes, found (Supplementary Information Table 8). Our analyses have while at the same time inhibiting stem growth. The goal of this identi®ed several potential components of the light perception process, called photomorphogenesis, is the establishment of a body pathway, and have revealed the complex distribution of components plan that allows the plant to be an ef®cient photosynthetic machine of the photosynthetic apparatus between nuclear and plastid under varying light conditions98. The signal transduction cascade genomes. leading to light-induced responses begins with the activation of photoreceptors. Next, the light signal is transduced via positively Metabolism and negatively acting nuclear and cytoplasmic proteins, causing Arabidopsis is an autotrophic organism that needs only minerals, activation or derepression of nuclear and chloroplast-encoded light, water and air to grow. Consequently, a large proportion of the photosynthetic genes and enabling the plant to establish optimal genome encodes enzymes that support metabolic processes, such as photoautotrophic growth. Although genetic and biochemical stud- photosynthesis, respiration, intermediary metabolism, mineral ies have de®ned many of the components in this process, the acquisition, and the synthesis of lipids, fatty acids, amino acids, genome sequence provides an opportunity to identify comprehen- and cofactors103. With respect to these processes, sively Arabidopsis genes involved in photomorphogenesis and the Arabidopsis appears to contain a complement of genes similar to establishment of photoautotrophic growth. We identi®ed at least those in the photoautotropic cyanobacterium Synechocystis45,but, 100 candidate genes involved in light perception and signalling, and whereas Synechocystis generally has a single gene encoding an enzyme, 139 nuclear-encoded genes that potentially function in photosynth- Arabidopsis frequently has many. For example, Arabidopsis has at esis. least seven genes for the glycolytic enzyme pyruvate kinase, with an

810 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles additional ®ve for pyruvate kinase-like proteins. Whatever the glucosinolates, and plant growth regulators, such as , reason for this high level of redundancy, it varies from gene to jasmonic acid and brassinosteroids. Whereas Arabidopsis has ,286 gene in the same pathway; the 11 enzymes of glycolysis are encoded P450 genes, Drosophila has 94, C. elegans has 73 and yeast has only 3. by up to 51 genes that are present in as few as one or as many as eight This low number in yeast indicates that there are few reactions of copies. Similarly, of the 59 genes encoding proteins involved in basic metabolism that are catalysed by P450s. It seems likely that glycerolipid metabolism, 39 are represented by more than one many animal P450s are involved in detoxi®cation of compounds gene104. Genome duplication and expansion of gene families by from food plant sources. The role of endogenous enzymes is poorly tandem duplication have contributed to this diversity. understood; only a few dozen P450 enzymes from plants have been This high degree of apparent structural redundancy does not characterized to any extent. The discrepancy between the number of necessarily imply functional redundancy. For instance, although known P450-catalysed reactions and the number of genes suggests there are seven genes for serine hydroxymethyltransferase, a muta- that Arabidopsis produces a relatively large number of metabolites tion in the gene for the mitochondrial form completely blocks the that have yet to be identi®ed. photorespiratory pathway105. Although there are 12 genes for In addition to the large number of cytochrome P450s, Arabidopsis cellulose synthase, mutations in at least 2 of the 12 confer distinct has many other genes that suggest the existence of pathways or phenotypes because of tissue-speci®c gene expression106. processes that are not currently known. For instance, the presence of The metabolome of Arabidopsis differs from that of cyanobac- 19 genes with similarity to anthranilate N-hydroxycinnamoyl/ teria, or of any other organism sequenced to date, by the presence of benzoyl transferase is currently inexplicable. This enzyme is many genes encoding enzymes for pathways that are unique to involved in the synthesis of dianthramide phytoalexins in Caryo- vascular plants. In particular, although relatively little is known phyllaceae and Gramineae. No phytoalexins of this class have been about the enzymology of cell-wall metabolism, more than 420 genes described in Arabidopsis as yet. Similarly, the presence of 12 genes could be assigned probable roles in pathways responsible for the with sequence similarity to the berberine bridge enzyme, ((S)- synthesis and modi®cation of cell-wall polymers. Twelve genes reticuline:oxygen oxidoreductase (methylene-bridge-forming); EC encode cellulose synthase, and 29 other genes encode 6 families of 1.5.3.9), and 13 genes with similarity to tropinone reductase, structurally related enzymes thought to synthesize other major suggests that Arabidopsis may have the ability to produce alkaloids. polysaccharides106. Roughly 52 genes encode polygalacturonases, In other plants, the berberine bridge enzyme transforms reticuline 20 encode pectate lyases and 79 encode pectin esterases, indicating a into scoulerine, a biosynthetic precursor to a multitude of species- massive investment in modifying pectin. Similarly, the presence of speci®c protopine, protoberberine and benzophenanthridine alka- 39 b-1,3-glucanases, 20 endoxyloglucan transglycosylases, 50 cellu- loids. The discovery of these and many other intriguing genes in lases and other hydrolases, and 23 re¯ects the importance the Arabidopsis genome has created a wealth of new opportunities of wall remodelling during growth of plant cells. Excluding ascor- to understand the metabolic and structural diversity of higher bate and glutathione peroxidases, there are 69 genes with signi®cant plants. similarity to known peroxidases and 15 laccases (diphenol oxi- dases). Their presence in such abundance indicates the importance Concluding remarks of oxidative processes in the synthesis of lignin, suberin and other The twentieth century began with the rediscovery of Mendel's rules cell-wall polymers. The high degree of apparent redundancy in the of inheritance in pea108, and it ends with the elucidation of the genes for cell-wall metabolism might re¯ect differences in substrate complete genetic complement of a model plant, Arabidopsis. The speci®city by some of the enzymes. analysis of the completed sequence of a ¯owering plant reported The high degree of apparent redundancy in the genes for cell wall here provides insights into the genetic basis of the similarities and metabolism might re¯ect differences in substrate speci®city by some differences of diverse multicellular organisms. It also creates the of the enzymes. It is already known that cell types have different wall potential for direct and ef®cient access to a much deeper under- compositions, which may require that the relevant enzymes be standing of plant development and environmental responses, and subject to cell-type-speci®c transcriptional regulation. Of the 40 or permits the structure and dynamics of plant genomes to be assessed so cell types that plants make, almost all can be identi®ed by unique and understood. features of their cell wall107. A large number of genes involved in wall Arabidopsis, C. elegans and Drosophila have a similar range of metabolism have yet to be de®ned. Although more than 60 genes for 11,000±15,000 different types of proteins, suggesting this is the glycosyltransferases can be found in the genome sequence, most of minimal complexity required by extremely diverse multicellular these are probably involved in protein glycosylation or metabolite eukaryotes to execute development and respond to their environ- catabolism and do not seem to be adequate to account for the ment. We account for the larger number of gene copies in complexity of the wall. For instance, at least 21 Arabidopsis compared with these other sequenced eukaryotes with enzymes are required just to produce the linkages of the pectic two possible explanations. First, independent ampli®cation of polysaccharide RGII, and none of these enzymes has been identi®ed individual genes has generated tandem and dispersed gene families at present. Thus, if these and related enzymes involved in the to a greater extent in Arabidopsis, and unequal crossing over may be synthesis of other cell-wall polymers are also represented by multi- the predominant mechanism involved. Second, ancestral duplica- ple genes, a substantial number of the genes of currently unknown tion of the entire genome and subsequent rearrangements have function may be involved in cell-wall metabolism. resulted in segmental duplications. The pattern of these duplica- Higher plants collectively synthesize more than 100,000 second- tions suggests an ancient polyploidy event, and mutant analysis ary metabolites. Because ¯owering plants are thought to have indicates that at least some of the many duplicate genes are similar numbers of genes, it is apparent that a great deal of functionally redundant. Their occurrence in a functionally diploid enzyme creation took place during the evolution of higher plants. genetic model came as a surprise, and is reminiscent of the situation An important factor in the rapid evolution of metabolic complexity in maize, an ancient segmental allotetraploid. The remarkable is the large family of cytochrome P450s that are evident in degree of genome plasticity revealed in the large-scale duplications Arabidopsis (Supplementary Information Table 1). These enzymes may be needed to provide new functions, as alternative promoters represent a superfamily of haem-containing proteins, most of which and appear to be less widely used in plants than catalyse NADPH- and O2-dependent hydroxylation reactions. Plant they are in animals. Apart from duplicated segments, the overall P450s participate in myriad biochemical pathways including those chromosome structure of Arabidopsis closely resembles that of devoted to the synthesis of plant products, such as phenylpropa- Drosophila; transposons and other repetitive sequences are concen- noids, alkaloids, terpenoids, lipids, cyanogenic glycosides and trated in the heterochromatic regions surrounding the centromere,

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 811 articles whereas the euchromatic arms are largely devoid of repetitive 4. Lin, X. et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402, sequences. Conversely, most protein-coding genes reside in the 761±768 (1999). 5. Mayer, K. et al. Sequence and analysis of chromsome 4 of the plant Arabidopsis thaliana. Nature 402, euchromatin, although a number of expressed genes have been 769±777 (1999). identi®ed in centromeric regions. Finally, Arabidopsis is the ®rst 6. Theologis, A. et al. Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature methylated eukaryotic genome to be sequenced, and will be invalu- 408, 816±820 (2000). 7. Salanoubat, M. et al. Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana. able in the study of epigenetic inheritance and gene regulation. Nature 408, 820±822 (2000). Unlike most animals, plants generally do not move, they can 8. Tabata, S. et al. Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana. Nature perpetuate inde®nitely, they reproduce through an extended hap- 408, 820±822 (2000). loid phase, and they synthesize all their metabolites. Our compar- 9. Choi, S. D., Creelman, R., Mullet, J. & Wing, R. A. Construction and characterisation of a bacterial arti®cial chromosome library from Arabidopsis thaliana. Weeds World 2, 17±20 (1995). ison of Arabidopsis, bacterial, fungal and animal genomes starts to 10. Mozo, T., Fischer, S., Shizuya, H. & Altmann, T. Construction and characterization of the IGF de®ne the genetic basis for these differences between plants and Arabidopsis BAC library. Mol. Gen. Genet. 258, 562±570 (1998). other life forms. Basic intracellular processes, such as translation or 11. Lui, Y. -G., Mitsukawa, N., Vazquez-Tello, A. & Whittier, R. F. Generation of a high-quality P1 library vesicle traf®cking, appear to be conserved across kingdoms, re¯ect- of Arabidopsis suitable for chromosome walking. Plant J. 7, 351±358 (1995). 12. Lui, Y. -G. et al. Complementation of plant mutants with large genomic DNA fragments by a ing a common eukaryotic heritage. More elaborate intercellular transformation-competent arti®cial chromosome vector accelerates positional cloning. Proc. Natl processes, including physiology and development, use different sets Acad. Sci. USA 96, 6535±6540 (1999). of components. For example, membrane channels, transporters and 13. Marra, M. et al. A map or sequence analysis of the Arabidopsis thaliana genome. Nature Genet. 22, 265±270 (1999). signalling components are very different in plants and animals, and 14. Mozo, T. et al. A complete BAC-based physical map of the Arabidopsis thaliana genome. Nature the large number of transcription factors unique to plants contrasts Genet. 22, 271±275 (1999). with the conservation of many chromatin proteins across the three 15. Sato, S. et al. Structural analysis of Arabidopsis thaliana chromosome 5. I. Sequence features of the 1. eukaryotic kingdoms. Unexpected differences between seemingly 6 Mb regions covered by twenty physically assigned P1 clones. DNA Res. 4, 215±230 (1997). 16. Bent, E., Johnson, S. & Bancroft, I. BAC representation of two low-copy regions of the genome of similar processes include the absence of intracellular regulators of Arabidopsis thaliana. Plant J. 13, 849±855 (1998). cell division (Cdc25) and apoptosis (Bcl-2). On the other hand, 17. Copenhaver, G. P. et al. Genetic de®nition and sequence analysis of Arabidopsis centromeres. Science DNA repair appears more highly conserved between plants and 286, 2468±2474 (1999). mammals than within the animal kingdom, perhaps re¯ecting 18. Meyerowitz, E. M. & Somerville, C. R. Arabidopsis (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 1994) common factors such as DNA methylation. Our analysis also 19. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection of transfer RNA genes in shows that many genes of the endosymbiotic ancestor of the plastid genomic sequence. Nucleic Acids Res. 25, 955±964 (1997). have been transferred to the nucleus, and the products of this rich 20. Pavy, N. et al. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. BioInformatics 15, 887±900 (1999). prokaryotic heritage contribute to diverse functions such as photo- 21. Mewes, H. W. et al. Overview of the yeast genome. Nature 387 (Suppl.) 7±65 (1997). autotrophic growth and signalling. 22. Frishman, D. et al. Functional and structural genomics using PEDANT. BioInformatics (in the press). The sequence reported here changes the fundamental nature of 23. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453±1462 plant genetic analysis. Forward genetics is greatly simpli®ed as (1997). 24. Kotani, H. & Tabata, S. Lessons from the sequencing of the genome of a unicellular cyanobacterium, mutations are more conveniently isolated molecularly, but at the Synechocystis SP. PCC6803. Annu. Rev. Plant Physiol. Plant Mol. Biol. 49, 151±171 (1998). same time extensive gene duplications mean that functional redun- 25. Apweiler, R. et al. INTERPRO (http://www. ebi. ac. uk/interpro/). Collaborative Computer Project dancy must be taken into account. At a biochemical level, the 11 Newsletter no. 10 (Cambridge, 2000). speci®city conferred by nucleotide sequence, and the completeness 26. Bent, A. F. et al. RPS2 of Arabidopsis thaliana a leucine-rich repeat class of genes. Science 265, 1856±1860 (1994). of the survey allow complex mixtures of RNA and protein to be 27. Skowyra, D. et al. F box proteins are receptors that recruit phsphorylated substrates to the SCF resolved into their individual components using micro-arrays and ubiquitin-ligase complex. Cell 91, 209±219 (1997). . This speci®city can also be used in the parallel 28. Joazeiro, C. A. P.& Weissman, A. M. RING ®nger proteins: mediators of ubiquitin ligase activity. Cell 102, 549±552 (2000). analysis of genome-wide polymorphisms and quantitative traits in 29. Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 2204±2215 (2000). 109 natural populations . Looking ahead, the challenge of determining 30. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369±2376 (1999). the function of the large set of predicted genes, many of which are 31. Blanc, G. et al. Extensive duplication and reshuf¯ing in the Arabidopsis genome. Plant Cell 12, 1093± plant-speci®c, is now a clear priority, and multinational programs 1102 (2000). 32. Wendel, J. F. Genome evolution in polyploids. Plant Mol. Biol. 42, 225±249 (2000). have been initiated to accomplish this goal using site-selected 33. Gaut, B. S. & Doebley, J. F. DNA sequence evidence for the segmental allotetraploid origin of maize. 110 among the the necessary tools . Finally, productive Proc. Natl Acad. Sci. USA 94, 6809±6814 (1997). paths of crop improvement, based on enhanced knowledge of 34. , H. -M., Vision, T., Liu, J. & Tanksley, S. D. Comparing sequenced segments of the tomato and Arabidopsis gene function, will help meet the challenge of sustaining Arabidopsis genomes: Large-scale duplication followed by selective gene loss creates a network of synteny. Proc. Natl Acad. Sci. USA 97, 9121±9126 (2000). our food supply in the coming years. 35. Noel, L. et al. Pronounced intraspeci®c haplotype divergence at the RPP5 complex disease resistance Note added in proof: at the time of publication 17 centromeric BACs locus of Arabidopsis. Plant Cell 11, 2099±2111 (1999). and 5 sequence gaps in chromosome arms are being sequenced. M 36. Ellis, J., Dodds, P. & Pryor, T. Structure, function, and evolution of plant disease resistance genes. Trends Plant Sci. 3, 278±284 (2000). 37. Tanksley, S. D. et al. High density molecular linkage maps of the tomato and genomes. Methods Genetics 132, 1141±1160 (1992). The three centres used similar annotation approaches involving in silico gene-®nding 38. Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Grasses, line up and form a circle. Curr. Biol. 5, methods, comparison to EST and protein databases, and manual reconciliation of that 737±739 (1995). data. Gene ®nding involved three steps: (1) analysis of BAC sequences using a computa- 39. Acarkan, A., Rossberg, M., Koch, M. & Schmidt, R. Comparative genome analysis reveals extensive tional gene ®nder; (2) alignment of the sequence to the protein and EST databases; (3) conservation of genome organisation for Arabidopsis thaliana and Capsella rubella. Plant J. 23, 55± assignment of functions to each of the genes. Genscan111, GeneMark.HMM112, Xgrail113 62 (2000). Gene®nder (P. Green, unpublished software) and GlimmerA114 were used to analyse BAC 40. Cavell, A., Lydiate, D., Parkin, I., Dean, C. & Trick, M. A 30 centimorgan segment of Arabidopsis sequences. All of these systems were specially trained for Arabidopsis genes. Splice sites thaliana chromosome 4 has six collinear homologues within the Brassica napus genome. Genome 41, were predicted using NetGene2115, Splice Predictor116 and GeneSplicer (M. Pertea and 62±69 (1998). 41. O'Neill, C. & Bancroft, I. Comparative physical mapping of segments of the genome of Brassica S. Salzberg, unpublished software). For the second step, BACs were aligned to ESTs and to oleracea var alboglabra that are homologous to sequenced regions of the chromosomes 4 and 5 of the Arabidopsis gene index117 using programs such as DDS/GAP2118 or BLASTN119. Arabidopsis thaliana. Plant J. 23, 233±243 (2000). Segmental duplications were analysed and displayed using a modi®ed version of 42. Wolfe, K. H., Gouy, M., Yang, Y. -W., Sharp, P.M. & Li, W. -H. Date of the monocot-dicot divergence DIALIGN2 (ref. 120). estimated from the chloroplast DNA sequence data. Proc. Natl Acad. Sci. USA 86, 6201±6205 (1989). 43. van Dodeweerd, A. -M. et al. Identi®cation and analysis of homologous segments of the genomes of Received 20 October; accepted 15 November 2000. rice and Arabidopsis thaliana. Genome 42, 887±892 (1999) 1. The C. elegans Sequencing Consortium. Sequence and analysis of the genome of C. elegans. Science 44. Mayer, K. Sequence level analysis of homologous segments of the genomes of rice and Arabidopsis 282, 2012±2018 (1998). thaliana. Genome Res. (submitted). 2. Adams, M. D. The genome sequence of . Science 287, 2185±2195 (2000). 45. Sato, S. Complete structure of the chloroplast genome of Arabidopsis thaliana. DNA Research 6, 3. Meinke, D. W., Cherry, J. M., Dean, C., Rounsley, S. D. & Koornneef, M. Arabidopsis thaliana:a 283±290 (1999). model plant for genome analysis. Science 282, 662±665 (1998). 46. Unseld, M., Marienfeld, J., Brandt, P. & Brennicke, A. The mitochondrial genome in Arabidopsis

812 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles

thaliana contains 57 genes in 366,924 nucleotides. Nature Genet. 15, 57±61 (1997). Arabidopsis encoding a G protein beta subunit. Proc. Natl Acad. Sci. USA 91, 9554±9558 (1994). 47. Palmer, J. D. et al. Dynamic evolution of plant mitochondrial genomes: mobile genes and introns 87. Bowler, C. et al. Cyclic GMP and calcium mediate phytochrome phototransduction. Cell 77, 73±81 and highly variable mutation rates. Proc. Natl Acad. Sci. USA 97, 6960±6966 (2000). (1994). 48. Le, Q. -H. et al. Transposon diversity in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 97, 7376± 88. Stepanova, A. & Ecker, J. R. Ethylene signaling: from mutants to molecules. Curr. Opin. Plant Biol. 3, 7381 (2000). 353±360 (2000). 49. Yu, Z., Wright, S. & Bureau, T. Mutator-like elements (MULEs) in Arabidopsis thaliana: Structure, 89. Urao, T., Yamaguchi-Shinozaki, K. & Shinozaki, K. Two-component systems in plant signal diversity and evolution. Genetics (in the press). transduction. Trends Plant Sci. 5, 67±74 (2000). 50. Feschotte, C. & Mouches, C. Evidence that a family of miniature inverted-repeat transposable 90. Makino, S. et al. Genes encoding pseudo-response regulators: Insight into His-to-Asp phosphorelay elements (MITEs) from the Arabidopsis thaliana genome has arisen from a pogo-like DNA and in Arabidopsis thaliana. Plant Cell Physiol. 41, 791±803 (2000). transposon. Mol. Biol. Evol. 17, 730±737 (2000). 91. D'Agostino, I. B. & Kieber, J. J. Phosphorelay signal transduction: the emerging family of plant 51. Martienssen, R. Transposons, DNA methylation and gene control. Trends Genet. 14, 263±264 response regulators. Trends Biol. Sci. 24, 452±456 (1999). (1998). 92. Strayer, C. et al. Cloning of the Arabidopsis clock gene TOC1, an autoregulatory response regulator 52. Singer, T., Yordan, C. & Martienssen, R. Robertson's Mutator transposons in Arabidopsis are homologue. Science 289, 768±771 (2000). regulated by the chromatin-remodeling gene Decrease in DNA Methylation (DDM1). Genes Dev. (in 93. Stahl, E. A. & Bishop, J. G. Plant-Pathogen arms races at the molecular level. Curr. Opin. Plant Biol. 3, the press). 299±304 (2000). 53. SanMiguel, P. et al. Nested retrotransposons in the intergenic regions of the maize genome. Science 94. McDowell, J. M. & Dangl, J. L. Signal transduction in the plant innate immune response. Trends 274, 765±768 (1996). Biochem. Sci. 25, 79±82 (2000). 54. Copenhaver, G. P.& Pikaard, C. S. Two-dimensional RFLP analyses reveal megabase-sized clusters of 95. Van der Biezen, E. A. & Jones, J. D. Plant disease-resistance proteins and the gene-for-gene concept. rRNA gene variants in Arabidopsis thaliana, suggesting local spreading of variants as the mode for Trends Biochem Sci. 23, 454±456 (1998). gene homogenization during concerted evolution. Plant J. 9, 273±282 (1996). 96. Belvin, M. P. & Anderson, K. V. A conserved signaling pathway: the Drosophila toll-dorsal pathway. 55. Fransz, P. et al. Cytogenetics for the model system Arabidopsis thaliana. Plant J. 13, 867±876 (1998). Annu. Rev. Cell. Dev. Biol. 12, 393±416 (1996). 56. Richards, E. J. & Ausubel, F. M. Isolation of a higher eukarotic telomere from Arabidopsis thaliana. 97. Uren, A. G. et al. Identi®cation of paracaspases and metacaspases: Two ancient families of caspase- Cell 53, 127±136 (1988). like proteins, one of which plays a key role in MALT lymphoma. Mol. Cell 6, 961±967 (2000). 57. Round, E. K., , S. K. & Richards, E. J. Arabidopsis thaliana centromere regions: genetic map 98. Fankhauser, C. & Chory, J. Light control of plant development. Annu. Rev. Cell. Dev. Biol. 13, 203± positions and repetitive DNA structure. Genome Res. 7, 1045±1053 (1997). 229 (1997). 58. The CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium. The complete sequence of a 99. Briggs, W. R. & Huala, E. Blue-light photoreceptors in higher plants. Annu. Rev. Cell. Dev. Biol. 15, heterochromatic island from a higher eukaryote. Cell 100, 377±386 (2000). 33±62 (1999). 59. Fransz, P. F. et al. Integrated cytogenetic map of chromosome arm 4S of A. thaliana: Structural 100. Christie, J. M., Salomon, M., Nozue, K., Wada, M. & Briggs, W. R. LOV (light, oxygen, or voltage) organization of heterochromatic knob and centromere region. Cell 100, 367±376 (2000). domains of the blue-light photoreceptor phototropin (nph1): binding sites for the chromophore 60. Paulsen, I. T., Nguyen, L., Sliwinski, M. K., Rabus, R. & Saier, M. H. Jr Microbial genome analyses: ¯avin mononucleotide. Proc. Natl Acad. Sci. USA 96, 8779±8783 (1999). comparative transport capabilities in eighteen prokaryotes. J. Mol. Biol. 301, 75±101 (2000). 101. Golbeck, J. H. Structure and function of photosystem I. Annu. Rev. Plant Physiol. Plant Mol. Biol. 43, 61. Paulsen, I. T., Sliwinski, M. K., Nelissen, B., Goffeau, A. & Saier, M. H. Jr Uni®ed inventory of 293±324 (1992). established and putative transporters encoded within the complete genome of Saccharomyces 102. Maier, R. M., Neckermann, K., Igloi, G. L. & Kossel, H. Complete sequence of the maize chloroplast cerevisiae. FEBS Lett. 430, 116±125 (1998). genome: gene content, hotspots of divergence and ®ne tuning of genetic information by transcript 62. Hirsch, R. E., Lewis, B. D, Spalding, E. P.& Sussman, M. R. A role for the AKT1 potassium channel in editing. J. Mol. Biol. 251, 614±28 (1995). plant nutrition. Science 280, 918±921 (1998). 103. Buchanan, B. B., Gruissem, W. & Jones, R. L. in and of Plants 1367 63. Slayman, C. L. & Slayman, C. W. Depolarization of the plasma membrane of Neurospora during (Am. Soc. Plant Physiol., Rockville, Maryland, 2000). active transport of glucose: evidence for a proton-dependent cotransport system. Proc. Natl Acad. 104. Mekhedov, S., MartõÂnez de IlaÂrduya, O. & Ohlrogge, J. Toward a functional catalog of the plant Sci. USA 71, 1035±1939 (1974). genome. A survey of genes for lipid biosynthesis. Plant Physiol. 122, 389±401 (2000). 64. Ryan, C. A. & Pearce, G. : a polypeptide signal for plant defensive genes. Annu. Rev. Cell. 105. Somerville, C. R., & Ogren, W. L. Photorespiration de®cient mutants of Arabidopsis thaliana lacking Dev. Biol. 14, 1±17 (1998). mitochrondrial serine transhydroxymethylase activity. Plant Physiol. 67, 666±671 (1981). 65. Eisen, J. A. & Hanawalt, P. C. A phylogenomic study of DNA repair genes, proteins, and processes. 106. Richmond, T., & Somerville, C. R. The cellulose synthase superfamily. Plant Physiol 124, 495±499 Mutat. Res. 435, 171±213 (1999). (1999). 66. Selby, C. P. & Sancar, A. Structure and function of transcription-repair coupling factor. Structural 107. Carpita, N. Vergara C: A recipe for cellulose. Science 279, 672±673 (1998). domains and binding properties. J. Biol. Chem. 270, 4882±4889 (1995). 108. De Vries, H. Sur la loi de disjonction des hybrides. C. R. Acad. Sci. Paris 130, 845±847 (1900). 67. Britt, A. B. of DNA repair in higher plants. Trends Plant Sci. 4, 20±25 (1999). 109. Alonso-Blanco, C. & Koornneef, M. Naturally occurring variation in Arabidopsis: an underexploited 68. Dangl, M. Response to Aravind, L. & Koonin, E. V. Second Family of Histone Deacetylases. Science resource for . Trends Plant Sci. 5, 1360±1385 (1999). 280, 1167 (1998). 110. Chory, J. Functional genomics and the virtual plant. A blueprint for understanding how plants are 69. Cao, X. et al. Conserved plant genes with similarity to mammalian de novo DNA methyltransferases. built and how to improve them. 123, 423±425 (2000). Proc. Natl Acad. Sci. USA 97, 4979±4984 (2000). 111. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 70. Henikoff, S. & Comai, L. A DNA methyltransferase homologue with a chromodomain exists in 268, 78±94 (1997). multiple polymorphic forms in Arabidopsis. Genetics 149, 307±318 (1998). 112. Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene ®nding. Nucleic Acids 71. Hung, M. -S. et al. Drosophila proteins related to vertebrate DNA (5-cytosine) methyltransferases. Res. 26, 1107±1115 (1998). Proc Natl Acad. Sci. USA 96, 11940±11945 (1999). 113. Uberbacher, E. C. & Mural, R. J. Locating protein-coding regions in human DNA sequences by a 72. Dalmay, T., Hamilton, A. J., Rudd, S., Angell, S. & Baulcombe, D. C. An RNA-dependent-RNA multiple sensor-neural network approach. Proc. Natl Acad. Sci. USA 88, 11261±11265 (1991). polymerase in Arabidopsis is required for post transcriptional gene silencing mediated by a 114. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J. & Tettelin, H. Interpolated Markov models but not by a virusÐthe truth. Cell 101, 543±553 (2000). for eukaryotic gene ®nding. Genomics 59, 24±31 (1999). 73. Riechmann, J. L. & Ratcliffe, O. J. A genomic perspective on plant transcription factors. Curr. Opin. 115. Hebsgaard, S. M. et al. Splice site prediction in Arabidopsis thaliana DNA by combining local and Plant Biol. 3, 423±434 (2000). global sequence information. Nucleic Acids Res. 24, 3439±3452 (1996). 74. Liljegren, S. J. et al. SHATTERPROOF MADS-box genes control seed dispersal in Arabidopsis. 116. Brendel, V. & Kleffe, J. Prediction of locally optimal splice sites in plant pre-mRNAwith applications Nature 404, 766±770 (2000). to gene identi®cation in Arabidopsis thaliana genomic DNA. Nucleic Acids Res. 26, 4748±4757 75. Pelaz, S. et al. B and C ¯oral organ identity functions require SEPALLATA MADS-box genes. Nature (1998). 405, 200±203 (2000). 117. Quackenbush, J., Liang, F., Holt, I., Pertea, G. & Upton, J. The TIGR gene indices: reconstruction and 76. Canaday, J., Stoppin-Mellet, V., Mutterer, J., Lambert, A. M. & Schmit, A. C. Higher plant cells: representation of expressed gene sequences. Nucleic Acids Res. 28, 141±145 (2000). gamma-tubulin and microtubule nucleation in the absence of centrosomes. Microsc. Res. Technol. 118. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R. A tool for analyzing and annotating genomic 49, 487±495 (2000). sequences. Genomics 46, 37±45 (1997). 77. Bassham, D. C. & Raikhel, N. V.Unique features of the plant vacuolar sorting machinery. Curr. Opin. 119. Altschul, S. F. et al. Basic local alignment search tool. J. Mol. Biol. 215, 403±410 (1990). Cell Biol. 12, 491±495 (2000). 120. Morgenstern, B. DIALIGN2: improvement of the segment-to-segment approach to multiple 78. Zheng, Z. L. & Yang, Z. The Rrop GTPase switch turns on polar growth in pollen. Trends Plant Sci. 5, . BioInformatics 15, 211±218 (1999). 298-303 (2000). 121. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classi®cation of proteins 79. den Boer, B. G. & Murray, J. A. Triggering the cell cycle in plants. Trends Cell Biol. 10, 245±250 database for the investigation of sequences and structures. J. Mol. Biol. 247, 536±540 (1995). (2000). 122. Emanuelsson, O., Nielsen, H., Brunak, S. & von Heijne, G. Predicting subcellular localization of 80. Heese, M., Mayer, U. & Jurgens, G. Cytokinesis in ¯owering plants: cellular process and proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300, 1005±1016 (2000). developmental integration. Curr. Opin. Plant Biol. 1, 486±491 (1998). 81. Meyerowitz, E. M. Plants, animals, and the logic of development. Trends Genet. 15, M65±M68 Supplementary information is available on Nature's World-Wide Web site (1999). (http://www.nature.com) or as paper copy from the London editorial of®ce of Nature. 82. Wang, D. Y. C. et al. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. R. Soc. Lond. B Bio. 266, 63±171 (1999). Acknowledgements 83. Torii, K. Receptor kinase activation and signal transduction in plants: an emerging picture. Curr. Opin. Plant Biol. 3, 362±367 (2000). This work was supported by the National Science Foundation (NSF) Cooperative 84. Yeh, K. C. & Lagarias, J. C. Eukaryotic phytochromes: Light-regulated serine/threonine protein Agreements (funded by the NSF, the US Department of Agriculture (USDA) and the US kinases with histidine kinase ancestry. Proc. Natl Acad. Sci. USA 95, 13976±13981 (1998). Department of Energy (DOE)), the Kazusa DNA Research Institute Foundation, and by 85. McCarty, D. R. & Chory, J. Conservation and innovation in plant signaling pathways. Cell 103, 201± the European Commission. Additional support from the USDA, MinisteÁre de la 211 (2000). Recherche, GSF-Forschungszentrum f. Umwelt u. Gesundheit, BMBF (Bundesminister- 86. Weiss, C. A., Garnaat, C., Mukai, K., Hu, Y. & Ma, H. of cDNAs from maize and ium f. Bildung, Forschung und Technologie), the BBSRC ( and Biological

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 813 articles

Sciences Research Council) and the Plant Research International, Wageningen, is also The Arabidopsis Genome Initiative gratefully acknowledged. The authors wish to thank E. Magnien, D. Nasser and J. D. Three groups contributed to the work reported here. The Genome Sequencing groups, Watson for their continual support and encouragement. arranged here in order of sequence contribution, sequenced and annotated assigned chromosomal regions. The Genome Analysis group carried out the analyses described. Correspondence and requests for materials should be addressed to The Arabidopsis The Contributing Authors interpreted the genome analyses, incorporating other data and Genome Initiative (e-mail: [email protected] or [email protected]). analyses, with respect to selected biological topics.

Genome Sequencing Groups Samir Kaul, Hean L. Koo, Jennifer Jenkins, Michael Rizzo, Timothy Rooney, Luke J. Tallon, Tamara Feldblyum, William Nierman, Maria-Ines Benito, Xiaoying Lin, Christopher D. Town, J. Craig Venter & Claire M. Fraser The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA

Satoshi Tabata, Yasukazu Nakamura, Takakazu Kaneko, Shusei Sato, Erika Asamizu, Tomohiko Kato, Hirokazu Kotani & Shigemi Sasamoto Kazusa DNA Research Institute, 1532-3 Yana, Kisarazu, Chiba 292, Japan

Joseph R. Ecker1*², Athanasios Theologis2*, Nancy A. Federspiel3*², Curtis J. Palm3, Brian I. Osborne2, Paul Shinn1, Aaron B. Conway3, Valentina S. Vysotskaia2, Ken Dewar1, Lane Conn3, Catherine A. Lenz2, Christopher J. Kim1, Nancy F. Hansen3, Shirley X. Liu2, Eugen Buehler1, Hootan Alta®3, Hitomi Sakano2, Patrick Dunn1, Bao Lam3, Paul K. Pham2, Qimin Chao1, Michelle Nguyen3, Guixia Yu2, Huaming Chen1, Audrey Southwick3, Jeong Mi Lee2, Molly Miranda3, Mitsue J. Toriumi2 & Ronald W. Davis3 1, Plant Science Institute, Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 USA; 2, Plant Gene Expression Center/USDA-U.C.Berkeley, 800 Buchanan Street, Albany, California 94710, USA; 3, Stanford Genome Technology Center, 855 California Avenue, Palo Alto, California 94304, USA. * These authors contributed equally to this work. ² Present addresses: The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, California 92037, USA (J.R.E.); Exelixis, Inc., 170 Harborway, P.O. Box 511, South San Francisco, California 94083-0511, USA (N.A.F)

European Union Chromosome 4 and 5 Sequencing Consortium: R. Wambutt1, G. Murphy2,A.DuÈ sterhoÈft3, W. Stiekema4, T. Pohl5, K.-D. Entian6, N. Terryn7 & G. Volckaert8 1, AGOWA GmbH, Glienicker Weg 185, D-12489 Berlin, Germany; 2, John Innes Centre, Colney Lane, Norwich NR4 7UH, UK; 3, QIAGEN GmbH, Max-Volmer-Str. 4, D-40724 Hilden, Germany; 4, Greenomics, Plant Research International, Droevendaalsesleeg 1, NL 6700, AA Wageningen, The Netherlands; 5, GATC GmbH, Fritz- Arnold Strasse 23, D-78467 Konstanz, Germany; 6, SRD GmbH, Oberurseler Str. 43, Oberursel 61440, Germany; 7, Department for Plant Genetics, (VIB), University of Gent, K.L. Ledeganckstraat 35, B-9000 Gent, Belgium; 8, Katholieke Universiteit Leuven, Laboratory of Gene Technology, Kardinaal Mercierlaan 92, B-3001 Leuven, Belgium

European Union Chromosome 3 Sequencing Consortium: M. Salanoubat1, N. Choisne1, M. Rieger2, W. Ansorge3, M. Unseld4, B. Fartmann5, G. Valle6, F. Artiguenave1, J. Weissenbach1 & F. Quetier1 1, Genoscope and CNRS FRE2231, 2 rue G. CreÂmieux, 91057 Evry Cedex, France; 2, Genotype GmbH Angelhofweg 39, D-69259 Wilhemlsfeld, Germany; 3, European Molecular Biology Laboratory, Biochemical Instrumentation Program, Meyerhoftstr. 1, D-69117 Heidelberg, Germany; 4, LION Bioscience AG, Im Neuenheimer Feld 515-517, 69120 Heidelberg, Germany; 5, MWG-Biotech AG, Anzinger Strasse 7a, 85560 Ebersberg, Germany; 6, CRIBI, UniversitaÁ di Padova, via G. Colombo 3, Padova 35131, Italy

The Cold Spring Harbor and Washington University Genome Sequencing Center Consortium: Richard K. Wilson1, Melissa de la Bastide2, M. Sekhon1, Emily Huang2, Lori Spiegel2, Lidia Gnoj2, K. Pepin1, J. Murray1, D. Johnson1, Kristina Habermann2, Neilay Dedhia2, Larry Parnell2, Raymond Preston2, L. Hillier1, Ellson Chen3, M. Marra2, Robert Martienssen4 & W. Richard McCombie2 1, Washington University Genome Sequencing Center, Washington University in St Louis School of Medicine, 4444 Forest Park Blvd., St. Louis, Missouri 63108 USA; 2, Lita Annenberg Hazen Genome Center, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; 3, Celera Genomics, 850 Lincoln Center Drive, Foster City, California 94494, USA; 4, Plant Biology Group, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA

Genome Analysis Group Klaus Mayer1*, Owen White2*, Michael Bevan3, Kai Lemcke1, Todd H. Creasy2, Cord Bielke2, Brian Haas1, Dirk Haase1, Rama Maiti2, Stephen Rudd1, Jeremy Peterson2, Heiko Schoof1, Dimitrij Frishman1, Burkhard Morgenstern1, Paulo Zaccaria1, Maria Ermolaeva 2, Mihaela Pertea2, John Quackenbush2, Natalia Volfovsky2, Dongying Wu2, Todd M. Lowe4, Steven L. Salzberg 2 & Hans-Werner Mewes1 1, GSF-Forschungszentrum f. Umwelt u. Gesundheit, Munich Information Center for Protein Sequences, am Max-Planck-Institut f. Biochemie, Am Klopferspitz 18a, D-82152, Germany; 2, The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA; 3, Molecular Genetics Deartment, John Innes Centre, Colney Lane, Norwich NR4 7UH, UK; 4, Dept Genetics, Stanford University Medical School, Stanford, California 94305-5120, USA. * These authors contributed equally to this work

Contributing Authors

Comparative analysis of the genomes of A. thaliana accessions. S. Rounsley, D. Bush, S. Subramaniam, I. Levin & S. Norris Cereon Genomics LLC, 45 Sidney St, Cambridge, Massachussetts 02139, USA

Comparative analysis of the genomes of A. thaliana and other genera. R. Schmidt1, A. Acarkan1 & I. Bancroft2 1, Max-DelbruÈck-Laboratorium in der Max-Planck-Gesellschaft, Carl-von-LinneÂ-Weg 10, 50829 Cologne, Germany; 2, Brassicas and Oilseeds Research Department, John Innes Centre, Norwich NR4 7UJ, UK

814 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles

Integration of the three genomes in the plant cell: the extent of protein and nucleic acid traf®c between nucleus, plastids and mitochondria. F. Quetier1, A. Brennicke2 & J. A. Eisen3. 1, Genoscope, Centre Nationale de Sequencage, 2 rue Gaston Cremieux, CP 5706, 91057 Evry Cedex, France; 2, Molekulare Botanik, UniversitaÈt Ulm, 89069 Ulm, Germany; 3, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA

Transposable elements. T. Bureau1, B.-A. Legault1, Q.-H. Le1, N. Agrawal1,Z.Yu1 & R. Martienssen2 1, McGill University, Dept of Biology, 1205 rue Dr Pen®eld, Montreal, Quebec, H3A 1B1, Canada; 2, Plant Biology Group, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA rDNA, telomeres and centromeres. G. P. Copenhaver1, S. Luo1, C. S. Pikaard2 & D. Preuss1 1, Howard Hughes Medical Institute, The University of Chicago, 1103 East 57th Street, Chicago, Illiois, USA; 2, Biology Department, Washington University in St Louis, St Louis, Missouri 63130, USA

Membrane transport. I. T. Paulsen1 & M. Sussman2 1, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA; 2, University of Wisconsin Biotechnology Center, 425 Henry Mall, Madison, Wisconsin 53706, USA

DNA repair and recombination. A. B. Britt1 & J. A. Eisen2 1, Section of Plant Biology, , Davis, California 95616, USA; 2, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA

Gene regulation. D. A. Selinger1, R. Pandey1, D. W. Mount2, V. L. Chandler1, R. A. Jorgensen1 & C. Pikaard3 1, Department of Plant Sciences, University of Arizona, 303 Forbes Hall; and 2, Department of Molecular and Cellular Biology, University of Arizona, Tucson, Arizona 85721, USA; 3, Biology Department, Washington University in St Louis, St Louis, Missouri 63130, USA

Cellular organization. G. Juergens Entwicklungsgenetik, ZMBP-Centre for Plant Molecular Biology, auf der Morgenstelle 1, Tuebingen D-72076, Germany

Development. E. M. Meyerowitz. Division of Biology, California Institute of Biology, Pasadena, California 91125, USA

Signal transduction. J. R. Ecker1 & A. Theologis2. 1, The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, California 92037, USA; 2, Plant Gene Expression Center/USDA-UC Berkeley, 800 Buchanan Street, Albany, California 94710, USA

Recognition of and response to pathogens. J. Dangl1 & J. D. G. Jones2 1, Biology Department, Coker Hall, University of North Carolina, Chapel Hill, North Carolina 27599, USA; 2, Sainsbury Laboratory, John Innes Centre, Colney Lane, Norwich NR4 7UJ, UK

Photomorphogenesis and photosynthesis. M. Chen & J. Chory Howard Hughes Medical Institute and Plant Biology Laboratory, The Salk Institute, 10010 North Torrey Pines Road, La Jolla, California 92037, USA

Metabolism. C. Somerville Carnegie Institution, 260 Panama Street, Stanford, California 94305, USA

NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 815

Vol 436|11 August 2005|doi:10.1038/nature03895 ARTICLES

The map-based sequence of the rice genome

International Rice Genome Sequencing Project*

Rice, one of the world’s most important food plants, has important syntenic relationships with the other cereal species and is a model plant for the grasses. Here we present a map-based, finished quality sequence that covers 95% of the 389 Mb genome, including virtually all of the euchromatin and two complete centromeres. A total of 37,544 non- transposable-element-related protein-coding genes were identified, of which 71% had a putative homologue in Arabidopsis. In a reciprocal analysis, 90% of the Arabidopsis proteins had a putative homologue in the predicted rice proteome. Twenty-nine per cent of the 37,544 predicted genes appear in clustered gene families. The number and classes of transposable elements found in the rice genome are consistent with the expansion of syntenic regions in the maize and sorghum genomes. We find evidence for widespread and recurrent gene transfer from the organelles to the nuclear chromosomes. The map-based sequence has proven useful for the identification of genes underlying agronomic traits. The additional single-nucleotide polymorphisms and simple sequence repeats identified in our study should accelerate improvements in rice production.

Rice (Oryza sativa L.) is the most important food crop in the world 6.3 £ indica and 6 £ japonica whole-genome shotgun sequence and feeds over half of the global population. As the first step in a assemblies revealed that the draft sequences provided coverage of systematic and complete functional characterization of the rice 69% by indica and 78% by japonica relative to the map-based genome, the International Rice Genome Sequencing Project sequence. (IRGSP) has generated and analysed a highly accurate finished Rice has played a central role in human nutrition and culture for sequence of the rice genome that is anchored to the genetic map. the past 10,000 years. It has been estimated that world rice pro- Our analysis has revealed several salient features of the rice duction must increase by 30% over the next 20 years to meet genome: projected demands from population increase and economic devel- . We provide evidence for a genome size of 389 Mb. This size opment1. Rice grown on the most productive irrigated land has estimation is ,260 Mb larger than the fully sequenced dicot plant achieved nearly maximum production with current strains1. model Arabidopsis thaliana. We generated 370 Mb of finished Environmental degradation, including pollution, increase in night sequence, representing 95% coverage of the genome and virtually time temperature due to global warming2, reductions in suitable all of the euchromatic regions. arable land, water, labour and energy-dependent fertilizer provide . A total of 37,544 non-transposable-element-related protein-cod- additional constraints. These factors make steps to maximize rice ing sequences were detected, compared with ,28,000–29,000 in productivity particularly important. Increasing yield potential and Arabidopsis, with a lower gene density of one gene per 9.9 kb in yield stability will come from a combination of biotechnology and rice. A total of 2,859 genes seem to be unique to rice and the other improved conventional breeding. Both will be dependent on a high- cereals, some of which might differentiate monocot and dicot quality rice genome sequence. lineages. Rice benefits from having the smallest genome of the major cereals, . Gene knockouts are useful tools for determining gene function dense genetic maps and relative ease of genetic transformation3. The and relating genes to phenotypes. We identified 11,487 Tos17 retro- discovery of extensive genome colinearity among the Poaceae4 has transposon insertion sites, of which 3,243 are in genes. established rice as the model organism for the cereal grasses. These . Between 0.38 and 0.43% of the nuclear genome contains orga- properties, along with the finished sequence and other tools under nellar DNA fragments, representing repeated and ongoing transfer of development, set the stage for a complete functional characterization organellar DNA to the nuclear genome. of the rice genome. . The transposon content of rice is at least 35% and is populated by representatives from all known transposon superfamilies. The International Rice Genome Sequencing Project . We have identified 80,127 polymorphic sites that distinguish The IRGSP, formally established in 1998, pooled the resources of between two cultivated rice subspecies, japonica and indica, sequencing groups in ten nations to obtain a complete finished resulting in a high-resolution genetic map for rice. Single-nucleo- quality sequence of the rice genome (Oryza sativa L. ssp. japonica tide polymorphism (SNP) frequency varies from 0.53 to 0.78%, cv. Nipponbare). Finished quality sequence is defined as containing which is 20 times the frequency observed between the Columbia less than one error in 10,000 nucleotides, having resolved ambigu- and Landsberg erecta of Arabidopsis. ities, and having made all state-of-the-art attempts to close gaps. . A comparison between the IRGSP genome sequence and the The IRGSP released a high-quality map-based draft sequence in

*Lists of participants and affiliations appear at the end of the paper 793 © 2005 Nature Publishing Group

ARTICLES NATURE|Vol 436|11 August 2005

December 2002. Three completely sequenced chromosomes have RIRE7, are found between and around the CentO repeats. CentO been published5–7, as well as two completely sequenced centro- clusters show differences in length and orientation for the two meres8–10. As the IRGSP subscribed to an immediate-release policy, centromeres. high-quality map-based sequence has been public for some time. BLASTN analysis of the pseudomolecules indicated that about This has permitted rice geneticists to identify several genes under- 0.9 Mb of CentO repeats (corresponding to more than 5,800 copies of lying traits, and revealed very large and previously unknown seg- the satellite) were sequenced and found to be associated with mental duplications that comprise 60% of the genome11–13. The centromere-specific retroelements. Locations of all CentO sequences public sequence has also revealed new details about the syntenic correspond to genetically identified centromere regions (Supplemen- relationships and gene mobility between rice, maize and sor- tary Table 3). Our pseudomolecules cover the centromere regions on ghum13–15. chromosomes 4, 5 and 8, and portions of the centromeres on the remaining chromosomes (Fig. 1). Physical maps, sequencing and coverage The IRGSP sequenced the genome of a single inbred cultivar, Oryza Gene content, expression and distribution sativa ssp. japonica cv. Nipponbare, and adopted a hierarchical clone- We masked the pseudomolecules for repetitive sequences and used by-clone method using bacterial and P1 artificial chromosome clones the ab initio gene finder FGENESH to identify only non-transpo- (BACs and PACs, respectively). This strategy used a high-density sable-element-related genes. A total of 37,544 non-transposable- genetic map16, expressed-sequence tags (ESTs)17,yeastartificial element protein-coding sequences were predicted, resulting in a chromosome (YAC)- and BAC-based physical maps18–20, BAC-end density of one gene per 9.9 kb (Supplementary Tables 4 and 5). As sequences21 and two draft sequences22,23. A total of 3,401 BAC/PAC the ability to identify unannotated and transposable-element-related clones (Table 1) were sequenced to approximately tenfold sequence genes improves, the true protein-coding gene number in rice will coverage, assembled, ordered and finished to a sequence quality of doubtless be revised. less than one error per 10,000 bases. A majority of physical gaps in Full-length complementary DNA sequences are available for rice27, the BAC/PAC tiling path were bridged using a variety of substrates, and provide a powerful resource for improving gene model structure including PCR fragments, 10-kb plasmids and 40-kb fosmid derived from ab initio gene finders28. Of the 37,544 non-transposa- clones. A total of 62 unsequenced physical gaps, including nine ble-element-related FGENESH models, 17,016 could be supported centromere and 17 telomere gaps, remain on the 12 chromosomes by a total of 25,636 full-length cDNAs (Supplementary Table 6). (Table 2). Chromosome arm and telomere gaps were measured, A total of 22,840 (61%) genes had a high identity match with a rice and the nine centromere gaps were estimated on the basis of ESTor full-length cDNA. On average, about 10.7 ESTsequences were CentO satellite DNA content. The remaining gaps are estimated to present for each expressed rice gene. A total of 2,927 genes aligned total 18.1 Mb. well with ESTs from other cereal species, and 330 of these genes Ninety-seven percent of the BAC/PACs and gap sequences (3,360) matched only with a non-rice cereal EST (Supplementary Fig. 1). have been submitted as finished quality in the PLN division of Except for the short arms of chromosomes 4, 9 and 10, which are GenBank/DDBJ/EMBL. These and the remaining draft-sequenced known to be highly heterochromatic, the density of expressed genes clones were used to construct pseudomolecules representing the 12 is greater on the distal portions of the chromosome arms chromosomes of rice (Fig. 1). The total nucleotide sequence of the 12 compared with the regions around the centromeres (Supplementary pseudomolecules is 370,733,456 bp, with an N-average continuous Fig. 2). sequence length of 6.9 Mb (see Table 1 for a definition of N-average A total of 19,675 proteins had matches with entries in the Swiss- length). Sequence quality was assessed by comparing 1.2 Mb of Prot database; of these, 4,500 had no expression support. Domain overlapping sequence produced by different laboratories. The overall searches revealed a minimum of one motif or domain present in 63% accuracy was calculated as 99.99% (Supplementary Table 2). The of the predicted proteins, with a total of 3,328 different domains statistics of sequenced PAC/BAC clones and pseudomolecules for present in the predicted rice proteome. The five most abundant each chromosome are shown in Table 1. domains were associated with protein kinases (Supplementary The genome size of rice (O. sativa ssp. japonica cv. Nipponbare) Table 7). Fifty-one per cent of the predicted proteins could be was reported to have a haploid nuclear DNA content of 394 Mb on associated with a biological process (Supplementary Fig. 3a), with the basis of flow cytometry24, and 403 Mb on the basis of lengths of metabolism (29.1%) and cellular physiological processes (11.9%) anchored BAC contigs and estimates of gap sizes20. Table 2 shows the representing the two most abundant classes. calculated size for each chromosome and the estimated coverage. Approximately 71% (26,837) of the predicted rice proteins have a Adding the estimated length of the gaps to the sum of the non- homologue in the Arabidopsis proteome (Supplementary Fig. 4). In a overlapping sequence, the total length of the rice nuclear genome was reciprocal search, 89.8% (26,004) of the proteins from the Arabi- calculated to be 388.8 Mb. Therefore, the pseudomolecules are dopsis genome have a homologue in the rice proteome. Of the 23,170 expected to cover 95.3% of the entire genome and an estimated rice genes with rice EST, cereal EST, or full-length cDNA support, 98.9% of the euchromatin. An independent measure of genome 20,311 (88%) have a homologue in Arabidopsis. Fewer putative coverage represented by the pseudomolecules was obtained by homologues were found in other model species: 38.1% in Drosophila, searching for unique EST markers19; of 8,440 ESTs, 8,391 (99.4%) 40.8% in human, 36.5% in , 30.2% in yeast, were identified in the pseudomolecules. 17.6% in Synechocystis and 10.2% in Escherichia coli. There are profound differences in plant architecture and biochem- Centromere location istry between monocotyledonous and dicotyledonous angiosperms. Typical eukaryotic centromeres contain repetitive sequences, includ- Only 2,859 rice genes with evidence of transcription lack homologues ing satellite DNA at the centre and retrotransposons and transposons in the Arabidopsis genome. We investigated these to learn what in the flanking regions. All rice centromeres contain the highly functions they encoded. The vast majority had no matches, or repetitive 155–165 bp CentO satellite DNA, together with centro- most closely matched unknown or hypothetical proteins. The grasses mere-specific retrotransposons25,26. The CentO satellites are located have a class of seed storage proteins called prolamins that is not found within the functional domain of the rice centromere10,26. Complete in dicots. There are also families of hormone response proteins and sequencing of the centromeres of rice chromosomes 4 and 8 revealed defence proteins, such as proteinase inhibitors, chitinases, patho- that they consist of 59 kb and 69 kb of clustered CentO repeats genesis-related proteins and seed allergens, many of which are (respectively)8–10, tandemly arrayed head-to-tail within the clusters. tandemly repeated (Supplementary Table 8). Nevertheless, with a Numerous retrotransposons, including the centromere-specific large number of proteins of unknown function, the most interesting 794 © 2005 Nature Publishing Group

NATURE|Vol 436|11 August 2005 ARTICLES differences between the genome content of these two groups of indica34. A single 5S cluster is present on the short arm of chromo- angiosperms remain to be discovered. some 11 in the vicinity of the centromere36, and encompasses Tos17 is an endogenous copia-like retrotransposon in rice that is 0.25 Mb. inactive under normal growth conditions. In , it A total of 763 transfer RNA genes, including 14 tRNA pseudogenes becomes activated, transposes and is stably inherited when the were detected in the 12 pseudomolecules. In comparison, a total of plant is regenerated29. There are only two copies of Tos17 in the 611 tRNA genes were detected in Arabidopsis32. Supplementary Fig. 7 rice cultivar Nipponbare. These features, together with its preferen- shows the distribution of these tRNA genes in each chromosome. tial insertion into gene-rich regions, make Tos17 uniquely suitable for Chromosome 4 has a single tRNA cluster6, and chromosome 10 has the functional analysis of rice genes by gene disruption. About 50,000 two large clusters derived from inserted chloroplast DNA7. Except for Tos17-insertion lines carrying 500,000 insertions have been pro- regions of intermediate density on chromosomes 1, 2, 8 and 12, there duced30. A total of 11,487 target loci were mapped on the 12 seem to be no other large clusters. pseudomolecules (Supplementary Fig. 5), with at least one insertion (miRNAs), a class of eukaryotic non-coding RNAs, detected in 3,243 genes. The density of Tos17 insertions is higher in are believed to regulate gene expression by interacting with the target euchromatic regions of the genome30, in contrast to the distribution messenger RNA37. miRNAs have been predicted from Arabidopsis38 of high-copy retrotransposons, which are more frequently found in and rice39, and we mapped 158 miRNAs onto the rice pseudomole- pericentromeric regions. A similar target site preference has been cules (Supplementary Table 10). Among other non-coding RNAs, we reported for T-DNA insertions in Arabidopsis31. identified 215 small nucleolar RNA (snoRNA) and 93 spliceosomal RNA genes, both showing biased chromosomal distributions, in the Tandem gene families rice genome (Supplementary Table 11). One surprising outcome of the Arabidopsis genome analysis was the large percentage (17%) of genes arranged in tandem repeats32. When Organellar insertions in the nuclear genome performing a similar analysis with rice, the percentage was compar- Mitochondria and chloroplasts originated from alpha-proteobac- able (14%). However, manual curation on rice chromosome 10 teria and cyanobacteria endosymbionts. A continuous transfer of showed one gene family encoding a glycine-rich protein with 27 organellar DNA to the nucleus has resulted in the presence of copies and one encoding a TRAF/BTB domain protein with 48 chloroplast and mitochondrial DNA inserted in the nuclear chromo- copies33. These tandemly repeated families are interrupted with somes. Although the endosymbionts probably contained genomes of other genes and are not included in strictly defined tandem repeats. several Mb at the time they were internalized, the organellar genomes We therefore screened for all tandemly arranged genes in 5-Mb diminished so that the present size of the mitochondrial genome is intervals. Using these criteria, 29% of the genes (10,837) are ampli- less than 600 kb, and that of the chloroplast is only 150 kb. Homology fied at least once in tandem, and 153 rice gene arrays contained 10– searches detected 421–453 chloroplast insertions and 909–1,191 134 members (Supplementary Fig. 6). Sixty five per cent of the mitochondrial insertions, depending upon the stringency adopted tandem arrays with over 27 members, and 33% of all the arrays with (Supplementary Fig. 8 and Supplementary Table 12). Thus, chlor- over 10 members, contain protein kinase domains (Supplementary oplast and mitochondrial insertions contribute 0.20–0.24% and Table 9). 0.18–0.19% of the nuclear genome of rice, respectively, and corre- spond to 5.3 chloroplast and 1.3 mitochondrial genome equivalents. Non-coding RNA genes The distribution of chloroplast and mitochondrial insertions over The nucleolar organizer, consisting of 17S–5.8S–25S ribosomal DNA the 12 chromosomes indicates that mitochondrial and chloroplast coding units, is found at the telomeric end of the short arm of transfers occurred independently. Two chromosomes harbour more chromosome 9 (ref. 34) in O. sativa ssp. japonica, and is estimated to insertions than the others (Supplementary Fig. 8 and Supplementary comprise 7 Mb (ref. 35). A second 17S–5.8S–25S rDNA locus is Table 12), with chromosome 12 containing nearly 1% mitochondrial found at the end of the short arm of chromosome 10 in O. sativa ssp. DNA and chromosome 10 containing approximately 0.8% chlor-

Table 1 | Classification and distribution of sequenced PAC and BAC clones* on the 12 rice chromosomes Chr Sequencing laboratory† PAC BAC OSJNBa/b OJ OSJNO Others‡ Total§ Pseudomolecule (bp) N-average lengthk (bp) Accession no. 1 RGP, KRGRP 251 77 42 23 4 0 397 43,260,640 9,688,259 AP008207 2 RGP, JIC 117 16 80 142 4 0 359 35,954,074 7,793,366 AP008208 3 ACWW, TIGR 1 8 263 47 1 10 330 36,189,985 5,196,992 AP008209 4 NCGR 2 7 275 7 0 0 291 35,489,479 1,427,419 AP008210 5 ASPGC 67 11 113 87 0 0 278 29,733,216 3,086,418 AP008211 6 RGP 169 20 78 14 0 0 281 30,731,386 8,669,608 AP008212 7 RGP 102 19 68 97 0 0 286 29,643,843 14,923,781 AP008213 8 RGP 113 23 56 83 2 0 277 28,434,680 14,872,702 AP008214 9 RGP, KRGRP, BIOTEC, BRIGI 72 24 72 50 5 0 223 22,692,709 5,219,517 AP008215 10 ACWW, TIGR, PGIR 1 5 172 6 0 21 205 22,683,701 2,124,647 AP008216 11 ACWW, TIGR, IIRGS, PGIR, Genoscope 10 6 236 3 2 1 258 28,357,783 1,087,274 AP008217 12 Genoscope 2 6 179 79 0 2 268 27,561,960 7,600,514 AP008218 Total 907 222 1634 638 18 34 3453 370,733,456 6,928,182 Chr, chromosome. *PAC, Rice Genome Research Program PAC; BAC, Rice Genome Research Program BAC; OSJNBa/b, Clemson University Genomics Institute BAC; OJ, BAC; OSJNO, Arizona Genomics Institute fosmid (http://www.genome.arizona.edu/orders/direct.html?library ¼ OSJNOa); Others, artificial gap-filling clones designated as OSJNA and OJA. †ACWW (Arizona Genomics Institute, Cold Spring Harbor Laboratory, Washington University Genome Sequencing Center, University of Wisconcin) Rice Genome Sequencing Consortium; ASPGC, Academia Sinica Plant Genome Center; BIOTEC, National Center for Genetic Engineering and Biotechnology; BRIGI, Brazilian Rice Genome Initiative; IIRGS, Indian Initiative for Rice Genome Sequencing; JIC, John Innes Centre; KRGRP, Korea Rice Genome Research Program; NCGR, National Center for Gene Research; PGIR, Plant Genome Initiative at Rutgers; RGP, Rice Genome Research Program; TIGR, The Institute for Genomic Research. ‡Constructs derived by joining (mostly from the clone gap regions) sequence from PCR fragments, Monsanto or sequences and the neighbouring clone sequences. §A total of 2,494 BAC and 907 PAC clones were used for draft and finished sequencing. Monsanto draft-sequenced BACs underlie 638 finished clones. The Syngenta draft sequence contributed to the assemblies of 140 IRGSP clone sequences. Thirty-four sequence submissions are artificial constructs derived by joining a regional sequence (mostly from the clone gap regions) from PCR fragments, Monsanto or Syngenta sequences with the neighbouring clone sequences. This also includes 93 clones submitted as phase 1 or phase 2 to the HTG section of GenBank. kN-average length: the average length of a contiguous segment (without sequence or physical gaps) containing a randomly chosen nucleotide. 795 © 2005 Nature Publishing Group

ARTICLES NATURE|Vol 436|11 August 2005 oplast DNA. It is clear that several successive transfer events have Intraspecific sequence polymorphism occurred, as insertions of less than 10 kb have heterogeneous iden- Map-based cloning to identify genes that are associated with agro- tities. The longest insertions, however, systematically show .98.5% nomic traits is dependent on having a high frequency of polymorphic identity to organellar DNA (Supplementary Table 13), indicating markers to order recombination events. In rice, most of the segregat- recent insertions for both chloroplast and mitochondrial genomes. ing populations are generated from crosses between the two major subspecies of cultivated rice, Oryza sativa ssp. japonica and O. sativa Transposable elements ssp. indica. Although several studies on the polymorphisms detected The rice genome is populated by representatives from all known between japonica and indica subspecies have been reported6,43,44, the transposon superfamilies, including elements that cannot be easily analysis reported here uses an approach that ensures comparison of classified into either class I or II (ref. 40). Previous estimates of the orthologous sequences. O. sativa ssp. indica cv. Kasalath and O. sativa transposon content in the rice genome range from 10 to 25% (refs 21, ssp. japonica cv. Nipponbare are the parents of the most densely 40). However, the increased availability of transposon query mapped rice population16. BAC-end sequences were obtained from a sequences and the use of profile hidden Markov models allow the Kasalath BAC library of 47,194 clones. Only high quality, single-copy identification of more divergent elements41 and indicate that the sequences were mapped to the Nipponbare pseudomolecules, and transposon content of the O. sativa ssp. japonica genome is at least only paired inverted sequences that mapped within 200 kb were 35% (Table 3). Chromosomes 8 and 12 have the highest transposon considered. A total of 26,632 paired Kasalath BAC-end sequences content (38.0% and 38.3%, respectively), and chromosomes 1 were mapped to the 12 rice pseudomolecules (Supplementary (31.0%), 2 (29.8%) and 3 (29.0%) have the lowest proportion of Table 15). Kasalath BAC clones spanned 308 Mb or 79% of the transposons. Conversely, elements belonging to the IS5/Tourist and Nipponbare genome. Sequence alignments with a PHRED quality IS630/Tc1/mariner superfamilies, which are generally correlated with value of 30 covered 12,319,100 bp (3%) of the total rice genome. A gene density, are prevalent on the first three chromosomes and least total of 80,127 sites differed in the corresponding regions in Nip- frequent on chromosomes 4 and 12. ponbare and Kasalath. The frequency of SNPs varied between Class II elements, characterized by terminal inverted-repeats and chromosomes (0.53–0.78%). Insertions and deletions were also including the hAT, CACTA, IS256/Mutator,IS5/Tourist, and IS630/ detected. The ratio of small insertion/deletion site nucleotides (1– Tc1/mariner superfamilies, outnumber class I elements, which 14 bases) against the alignment length (0.20–0.27%) was similar include long terminal-repeat (LTR) retrotransposons (Ty1/copia, among the different chromosomes, and there was no preference for Ty3/gypsy and TRIM) and non-LTR retrotransposons (LINEs and the direction of insertions or deletions. The main patterns of base SINEs, or long- and short-interspersed nucleotide elements, respect- substitutions observed between Nipponbare and Kasalath are shown ively), by more than twofold (Table 3). However, the nucleotide in Supplementary Table 16. Transitions (70%) were the most contribution of class I is greater than that of class II, due mostly to the prominent substitutions; this is a substantially higher fraction than large size of LTR retrotransposons and the small size of IS5/Tourist found between Arabidopsis ecotypes Columbia and Landsberg erecta32. and IS630/Tc1/mariner elements. The inverse is the case for maize, for which class I elements outnumber class II elements42. Given their Class 1 simple sequence repeats in the rice genome larger sizes, differential amplification of LTR elements in maize Class 1 simple sequence repeats (SSRs) are perfect repeats .20 compared with rice is consistent with the genomic expansion nucleotides in length45 that behave as hypervariable loci, providing found between orthologous regions of rice and maize15,33. a rich source of markers for use in genetics and breeding. A total of Most class I elements are concentrated in gene-poor, heterochro- 18,828 Class 1 di, tri and tetra-nucleotide SSRs, representing 47 matic regions such as the centromeric and pericentromeric regions distinctive motif families, were identified and annotated on the rice (Supplementary Table 14). In contrast, members of some transposon genome (Supplementary Fig. 9). Supplementary Table 17 provides superfamilies, including IS5/Tourist,IS630/Tc1/mariner and LINEs, information about the physical positions of all Class 1 SSRs in have a significant positive correlation with both recombination rate relation to widely used restriction-fragment length polymorphisms and gene density. There is an effect of average element length (RFLPs)16,46 and previously published SSRs45. There was an average of associated with these patterns: short elements generally show a 51 hypervariable SSRs per Mb, with the highest density of markers positive correlation with recombination rate and gene density, and occurring on chromosome 3 (55.8 SSR Mb21) and the lowest occur- are under-represented in the centromere regions, whereas larger ring on chromosome 4 (41.0 SSR Mb21). A summary of information elements have higher centromeric and pericentromeric abundance. about the Class 1 SSRs identified in the rice pseudomolecules appears

Table 2 | Size of each chromosome based on sequence data and estimated gaps Chr Sequenced bases (bp) Gaps on arm regions Telomeric gaps* (Mb) Centromeric gap† (Mb) rDNA‡ (Mb) Total (Mb) Coverage§ (%) Coveragek (%) No. Length (Mb) 1 43,260,640 5 0.33 0.06 1.40 45.05 99.1 96.0 2 35,954,074 3 0.10 0.01 0.72 36.78 99.7 97.7 3 36,189,985 4 0.96 0.04 0.18 37.37 97.3 96.8 4 35,489,479 3 0.46 0.20 36.15 98.7 98.2 5 29,733,216 6 0.22 0.05 30.00 99.3 99.1 6 30,731,386 1 0.02 0.03 0.82 31.60 99.8 97.2 7 29,643,843 1 0.31 0.01 0.32 30.28 98.9 97.9 8 28,434,680 1 0.09 0.05 28.57 99.7 99.5 9 22,692,709 4 0.13 0.14 0.62 6.95 30.53 98.8 74.3 10 22,683,701 4 0.68 0.13 0.47 23.96 96.6 94.7 11 28,357,783 4 0.21 0.04 1.90 0.25 30.76 99.1 92.2 12 27,561,960 0 0.00 0.05 0.16 27.77 99.8 99.2 All 370,733,456 36 3.51 0.81 6.59 7.20 388.82 98.9 95.3 *Estimated length including the telomeres, calculated with the average value of 3.2 kb for each chromosome24. †Estimated length of centromere-specific CentO repeats on each chromosome26. ‡Represents the estimated length of the17S–5.8S–25S rDNA cluster on Chr 9 (ref. 35) and the 5S cluster on Chr 11 (ref. 24). §Coverage of the pseudomolecules for the euchromatic regions in each chromosome. kCoverage of the pseudomolecules over the full length of each chromosome. 796 © 2005 Nature Publishing Group

NATURE|Vol 436|11 August 2005 ARTICLES

Figure 1 | Maps of the twelve rice chromosomes. For each chromosome physical maps represent the chromosomal positions of centromeres for (Chr 1–12), the genetic map is shown on the left and the PAC/BAC contigs which rice CentO satellites are sequenced. The maps are scaled to genetic on the right. The position of markers flanking the PAC/BAC contigs (green) distances in centimorgans (cM) and the physical maps are depicted in is indicated on the genetic map. Physical gaps are shown in white and the relative physical lengths. Please refer to Table 2 for estimated lengths of the nucleolar organizer on chromosome 9 is represented with a dotted green chromosomes. line. Constrictions in the genetic maps and arrowheads to the right of in Supplementary Table 18. Several thousand of these SSRs have that a substantial portion of the contigs from each assembly were already been shown to amplify well and be polymorphic in a panel of non-homologous, misaligned or provided duplicate coverage. diverse cultivars45, and thus are of immediate use for genetic analysis. Indeed, the whole-genome shotgun assembly differed by 0.05% base-pair mismatches for the two aligned regions from the same Genome-wide comparison of draft versus finished sequences Nipponbare cultivar. The two assemblies were further examined for Two whole-genome shotgun assemblies of draft-quality rice the presence of the CentO sequence (Supplementary Table 21). Sixty- sequence have been published23,47, and reassemblies of both have eight per cent of the copies observed in the 93-11 assembly and 32% just appeared48. One of these is an assembly of 6.28 £ coverage of O. of the CentO-containing contigs in the whole-genome shotgun sativa ssp. indica cv. 93-11. The second sequence is a ,6 £ coverage Nipponbare assembly were found outside the centromeric regions. of O. sativa ssp. japonica cv. Nipponbare23,48. These assemblies In contrast, the CentO repeats were restricted to the centromeric predict genome sizes of 433 Mb for japonica and 466 Mb for. indica, regions in the IRGSP pseudomolecules. It is unlikely that there are which differ from our estimation of a 389 Mb japonica genome. dispersed centromeres in indica rice; misassembly of the whole- Contigs from the whole-genome shotgun assembly of 93-11 and genome shotgun sequences is a more likely explanation for dispersed Nipponbare48 were aligned with the IRGSP pseudomolecules. Non- CentO repeats. These observations indicate that the draft sequences, redundant coverage of the pseudomolecules by the indica assembly although providing a useful preliminary survey of the genome, might varied from 78% for chromosome 3 to 59% for chromosome 12, with not be adequate for gene annotation, functional genomics or the an overall coverage of 69% (Supplementary Table 19). When genes identification of genes underlying agronomic traits. supported by full-length cDNA coverage were aligned to the covered regions, we found that 68.3% were completely covered by the indica Concluding remarks sequences. The average size of the indica contigs is 8.2 kb, so it is not The attainment of a complete and accurate map-based sequence for surprising that many did not completely cover the gene models rice is compelling. We now have a blueprint for all of the rice defined here. The coverage of the Nipponbare whole-genome shot- chromosomes. We know, with a high level of confidence, the gun assembly varied from 68–82%, with an overall coverage of 78% distribution and location of all the main components—the genes, of the genome, and 75.3% of the full-length cDNAs supported gene repetitive sequences and centromeres. Substantial portions of the models. map-based sequence have been in public databases for some time, We undertook a detailed comparison of the first Mb of these and the availability of provisional rice pseudomolecules based on this assemblies on 1S (the short arm of chromosome 1) with the IRGSP sequence has provided the scientific community with numerous chromosome 1 (Supplementary Fig. 10 and Supplementary Table opportunities to evaluate the genome, as indicated by the number of 20). The numbers from this comparison agree with the whole- publications in rice biology and genetics over the past few years. genome comparison described above. In addition, we observed Furthermore, the wealth of SNP and SSR information provided here 797 © 2005 Nature Publishing Group

ARTICLES NATURE|Vol 436|11 August 2005 and elsewhere will accelerate marker-assisted breeding and positional against the rice chromosomes, and gene index matches were recorded. We cloning, facilitating advances in rice improvement. searched the Oryza sativa ssp. japonica cv. Nipponbare collection of full-length The syntenic relationships between rice and the cereal grasses have cDNAs (ftp://cdna01..affrc.go.jp/pub/data/), after first removing the trans- long been recognized4. Comparing genome organization, genes and posable-element-related sequences, against the FGENESH models. intergenic regions between cereal species will permit identification of Gene models with rice full-length cDNA, EST or cereal EST matches but regions that are highly conserved or rapidly evolving. Such regions without identifiable homologues in the Arabidopsis genome were searched for conserved domains/motifs using InterproScan, and for homologues in the are expected to yield crucial insights into genome evolution, specia- Swiss-Prot database (http://us.expasy.org/sprot/) using BLASTP. All proteins tion and domestication. with positive matches were further compared with the nr database (http:// www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#protein_databases), using METHODS BLASTP to eliminate truncated proteins and those with matches to other dicots. Physical map and sequencing. Nine genomic libraries from Oryza sativa ssp. Tandem gene families. The rice genome was subjected to a BLASTP search as japonica cultivar Nipponbare were used to establish the physical map of rice previously described32. The search was also performed by permitting more than 19 20 chromosomes by polymerase chain reaction (PCR) screening , fingerprinting one unrelated gene within the arrays, and the limit of the search was set to 5-Mb 21 and end-sequencing . The PAC, BAC and fosmid clones on the physical map intervals to exclude large chromosomal duplications. were subjected to random shearing and to tenfold redun- Non-coding RNAs. Transfer-RNA genes were detected by the program tRNA- dancy, using both universal primers and the dye-terminator or dye-primer scan SE (http://www.genetics.wustl.edu/eddy/tRNAscan-SE/). The miRNA reg- methods. The sequences were assembled using PHRED (http://www.genome.- istry in the Rfam database (http://www.sanger.ac.uk/Software/Rfam/) was used washington.edu/UWGC/analysistools/Phred.cfm) and PHRAP (http://www.ge- as a reference database for miRNAs. In addition, experimentally validated nome.washington.edu/UWGC/analysistools/Phrap.cfm) software packages or miRNAs of other species, excluding Arabidopsis miRNAs, were used for BLASTN using the TIGR Assembler (http://www.tigr.org/software/assembler/). queries against the pseudomolecules. Spliceosomal and snoRNAs were retrieved Sequence gaps were resolved by full sequencing of gap-bridge clones, PCR from the Rfam database and used for queries. BLASTN was used to find the fragments or direct sequencing of BACs. Sequence ambiguities (indicated by location of snoRNAs and spliceosomal RNAs in the pseudomolecules. PHRAP scores less than 30) were resolved by confirming the sequence data using Organellar insertions. Oryza sativa ssp. japonica Nipponbare chloroplast alternative chemistries or different polymerases. We empirically determined that (GenBank NC_001320) and mitochondrial (GenBank BA000029) sequences a PHRAP score of 30 or above exceeds the standard of less than one error in were aligned with the pseudomolecules using BLASTN and MUMmer49. 10,000 bp. BAC and PAC assemblies were tested for accuracy by comparing Transposable elements. The TIGR Oryza Repeat Database, together with other computationally derived fingerprint patterns with experimentally determined published and unpublished rice transposable element sequences, was used to patterns of restriction enzyme digests. Sequence quality was also evaluated by create RTEdb (a rice transposable element database)50 and determine transpo- comparing independently obtained overlapping sequences. sable element coordinates on the rice pseudomolecules. In the case of hAT, IS256/ Small physical gaps were filled by long-range PCR. Remaining physical gaps Mutator,IS5/Tourist and IS630/Tc1/mariner elements, family-specific profile were measured using fluorescence in situ hybridization analysis. We used the hidden Markov models were applied using HMMER41 (http://hmmer.wustl.edu/). 26 length of CentO arrays to estimate the size of each of the remaining centromere The remaining superfamilies were annotated using RepeatMasker (http:// gaps. www.repeatmasker.org/). Annotation and bioinformatics. Gene models were predicted using FGENESH Tos17 insertions. Flanking sequences of transposed copies of 6,278 Tos17 (http://www.softberry.com/berry.phtml?topic ¼ fgenesh) using the monocot insertion lines were isolated by modified thermal asymmetric interlaced trained matrix on the native and repeat-masked pseudomolecules. Gene models (TAIL)-PCR and suppression PCR, and screened against the pseudomolecule with incomplete open reading frames, those encoding proteins of less than 50 sequences. amino acids, or those corresponding to organellar DNA were omitted from the SNP discovery. BAC clones from an O. sativa ssp. indica var. Kasalath BAC final set. The coordinates of transposable elements, excluding MITEs (miniature library were end-sequenced. Sequence reads were omitted if they contained more inverted-repeat transposable elements), were used to mask the pseudomolecules. than 50% nucleotides of low quality or high similarity to known repeats. The Conserved domain/motif searches and association with gene ontologies were remaining sequences were subjected to BLASTN analysis against the pseudo- performed using InterproScan (http://www.ebi.ac.uk/InterProScan/) in combi- molecules. Gaps within the alignments were classified as small insertions/ nation with the Interpro2Go program. For biological processes, the number of deletions. detected domains was re-calculated as number of non-redundant proteins. SSR loci. The Simple Sequence Repeat Identification Tool(http://www.gramene. The predicted rice proteome was searched using BLASTP against the org/) was used to identify simple sequence repeat motifs, and the physical proteomes of several model species for which a complete genome sequence position of all Class 1 SSRs was recorded. The copy number of SSR markers was and deduced protein set was available. Each rice chromosome was searched estimated using electronic (e)-PCR to determine the number of independent hits against the TIGR rice gene index (http://www.tigr.org/tdb/tgi/ogi/) and against of primer pairs on the pseudomolecules. gene index entries that aligned to gene models corresponding to expressed genes. Whole-genome shotgun assembly analysis. Contigs from the BGI 6.28 £ In addition, five cereal gene indices (http://www.tigr.org/tdb/tgi/) were searched whole genome assembly of O. sativa ssp. indica 93-11 (GenBank/DDBJ/EMBL accession number AAAA02000001–AAAA02050231) and the Syngenta 6 £ whole genome assembly of O. sativa ssp. japonica cv. Nipponbare Table 3 | Transposons in the rice genome (AACV01000001–AACV01035047; ref. 48) were aligned with the pseudomole- 49 Copy no. ( £ 103) Coverage (kb) Fraction of genome (%) cules using MUMmer . The number of IRGSP Nipponbare full-length cDNA- supported gene models completely covered by the aligned contigs was tabulated. Class I The 155-bp CentO consensus sequence was used for BLAST analysis against the LINEs 9.6 4161.3 1.12 93-11 and Nipponbare whole-genome shotgun contigs, and the coordinates of SINEs 1.8 209.9 0.06 the positive hits recorded. Locations of centromeres for each indica chromosome Ty1/copia 11.6 14266.7 3.85 Ty3/gypsy 23.5 40363.3 10.90 were obtained with the CentO sequence positions on the IRGSP pseudomolecule Other class I 15.4 12733.3 3.43 of the corresponding chromosome. A detailed comparison of the BGI-assembled Total class I 61.9 71734.4 19.35 and -mapped Syngenta contigs (AACV01000001–AACV01000070) and the 93- Class II 11 contigs (AAAA02000001–AAAA02000093) was obtained by BLAST analysis hAT 1.1 1405.9 0.38 against the IRGSP chromosome 1 pseudomolecule. CACTA 10.8 9987.3 2.69 Detailed procedures for the analyses described above can be found in the IS630/Tc1/mariner 67.0 8388.3 2.26 Supplementary Information. IS256/Mutator 8.8 13485.7 3.64 IS5/Tourist 57.9 12095.8 3.26 Received 29 December 2004; accepted 25 May 2005. Other class II 18.2 2703.6 0.73 Total class II 163.8 48066.6 12.96 1. Peng, S., Cassman, K. G., Virmani, S. S., Sheehy, J. & Khush, G. S. Yield Other TEs 23.6 6797.7 1.80 potential trends of tropical rice since the release of IR8 and the challenge of Total TEs 249.3 129019.3* 34.79 increasing rice yield potential. Crop Sci. 39, 1552–-1559(1999). TE, transposable element. 2. Peng, S. et al. Rice yields decline with higher night temperature from global *Total length; corrected for 2420.7 kb in overlaps of multiple, non-nested elements. warming. Proc. Natl Acad. Sci. USA 101, 9971–-9975 (2004). 798 © 2005 Nature Publishing Group

NATURE|Vol 436|11 August 2005 ARTICLES

3. Sasaki, T. & Burr, B. International Rice Genome Sequencing Project: the effort to 37. Bartel, D. P. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell completely sequence the rice genome. Curr. Opin. Plant Biol. 3, 138–-141 (2000). 116, 281–-297 (2004). 4. Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Cereal genome evolution: 38. Wang, X. J., Reyes, J. L., Chua, N. H. & Gaasterland, T. Prediction and Grasses, line up and form a circle. Curr. Biol. 5, 737–-739 (1995). identification of Arabidopsis thaliana microRNAs and their mRNA targets. 5. Sasaki, T. et al. The genome sequence and structure of rice chromosome 1. Genome Biol. 5, R65 (2004). Nature 420, 312–-316(2002). 39. Wang, J. F., Zhou, H., Chen, Y. Q., Luo, Q. J. & Qu, L. H. Identification of 20 6. Feng, Q. et al. Sequence and analysis of rice chromosome 4. Nature 420, microRNAs from Oryza sativa. Nucleic Acids Res. 32, 1688–-1695(2004). 316–-320(2002). 40. Turcotte, K., Srinivasan, S. & Bureau, T. Survey of transposable elements from 7. Rice Chromosome 10 Sequencing Consortium, In-depth view of structure, rice genomic sequences. Plant J. 25, 169–-179(2001). activity, and evolution of rice chromosome 10. Science 300, 1566–-1569(2003). 41. Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–-763(1998). 8. Wu, J. et al. Composition and structure of the centromeric region of rice 42. Messing, J. et al. Sequence composition and genome organization of maize. chromosome 8. Plant Cell 16, 967–-976 (2004). Proc. Natl Acad. Sci. USA 101, 14349–-14354(2004). 9. Zhang, Y. et al. Structural features of the rice chromosome 4 centromere. 43. Shen, Y. J. et al. Development of genome-wide DNA polymorphism database Nucleic Acids Res. 32, 2023–-2030(2004). for map-based cloning of rice genes. Plant Physiol. 135, 1198–-1205(2004). 10. Nagaki, K. et al. Sequencing of a rice centromere uncovers active genes. Nature 44. Feltus, F. A. et al. An SNP resource for rice genetics and breeding based on Genet. 36, 138–-145(2004). subspecies indica and japonica genome alignments. Genome Res. 14, 1812–-1819 11. Guyot, R. & Keller, B. Ancestral genome duplication in rice. Genome 47, (2004). 610–-614(2004). 45. McCouch, S. R. et al. Development and mapping of 2240 new SSR markers for 12. Simillion, C., Vandepoele, K., Saeys, Y. & Van de Peer, Y. Building genomic rice (Oryza sativa L.). DNA Res. 9, 257–-279(2002). profiles for uncovering segmental homology in the twilight zone. Genome Res. 46. Causse, M. A. et al. Saturated molecular map of the rice genome based on an 14, 1095–-1106(2004). interspecific backcross population. Genetics 138, 1251–-1274 (1994). 13. Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization 47. Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). predating divergence of the cereals, and its consequences for comparative Science 296, 79–-92(2002). genomics. Proc. Natl Acad. Sci. USA 101, 9903–-9908(2004). 48. Yu, J. et al. The genomes of Oryza sativa: A history of duplications. PLoS Biol. 3, 14. Salse, J., Piegu, B., Cooke, R. & Delseny, M. New in silico insight into the e38 (2005). synteny between rice (Oryza sativa L.) and maize (Zea mays L.) highlights 49. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, reshuffling and identifies new duplications in the rice genome. Plant J. 38, 2369–-2376(1999). 396–-409(2004). 50. Juretic, N., Bureau, T. E. & Bruskiewich, R. M. Transposable element annotation 15. Lai, J. et al. Gene loss and movement in the maize genome. Genome Res. 14, of the rice genome. Bioinformatics 20, 155–-160 (2004). 1924–-1931(2004). 16. Harushima, Y. et al. A high-density rice genetic linkage map with 2275 markers Supplementary Information is linked to the online version of the paper at using a single F2 population. Genetics 148, 479–-494(1998). www.nature.com/nature. 17. Yamamoto, K. & Sasaki, T. Large-scale EST sequencing in rice. Plant Mol. Biol. 35, 135–-144 (1997). Acknowledgements Work at the RGP was supported by the Ministry of 18. Saji, S. et al. A physical map with yeast artificial chromosome (YAC) clones Agriculture, Forestry and Fisheries of Japan. Work at TIGR was supported by covering 63% of the 12 rice chromosomes. Genome 44, 32–-37 (2001). grants to C.R.B. from the USDA Cooperative State Research, Education and 19. Wu, J. et al. A comprehensive rice transcript map containing 6591 expressed Extension Service–National Research Initiative, the National Science Foundation sequence tag sites. Plant Cell 14, 525–-535(2002). and the US Department of Energy. Work at the NCGR was supported by the 20. Chen, M. et al. An integrated physical and genetic map of the rice genome. Chinese Ministry of Science and Technology, the Chinese Academy of Sciences, Plant Cell 14, 537–-545 (2002). the Shanghai Municipal Commission of Science and Technology, and the 21. Mao, L. et al. Rice transposable elements: a survey of 73,000 sequence- National Natural Science Foundation of China. Work at Genoscope was tagged-connectors. Genome Res. 10, 982–-990(2000). supported by le Ministe`re de la Recherche, France. Funding for the work at the 22. Barry, G. F. The use of the Monsanto draft rice genome sequence in research. AGI and AGCoL was provided by grants to R.A.W. and C.S. from the USDA Plant Physiol. 125, 1164–-1165(2001). Cooperative State Research, Education and Extension Service–National Research 23. Goff, S. A. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. Initiative, the National Science Foundation, the US Department of Energy and japonica). Science 296, 92–-100(2002). the Rockefeller Foundation. Work at CSHL was supported by grants from the 24. Ohmido, N., Kijima, K., Akiyama, Y., de Jong, J. H. & Fukui, K. Quantification of USDA Cooperative State Research, Education and Extension Service–National total genomic DNA and selected repetitive sequences reveals concurrent Research Initiative and from the National Science Foundation. Work at the changes in different DNA families in indica and japonica rice. Mol. Gen. Genet. ASPGC was supported by Academia Sinica, National Science Council, Council of 263, 388–-394(2000). Agriculture, and Institute of , Academia Sinica. The IIRGS acknowledges 25. Dong, F. et al. Rice (Oryza sativa) centromeric regions consist of complex DNA. the Department of Biotechnology, Government of India, for financial assistance Proc. Natl Acad. Sci. USA 95, 8135–-8140(1998). and the Indian Council of Agricultural Research, New Delhi, for support. Work at 26. Cheng, Z. et al. Functional rice centromeres are marked by a satellite repeat Rice Gene Discovery was supported by BIOTECH and the Princess Sirindhorn’s and a centromere-specific retrotransposon. Plant Cell 14, 1691–-1704(2002). Plant Germplasm Conservation Initiative Program. Work at PGIR was supported 27. Kikuchi, S. et al. Collection, mapping, and annotation of over 28,000 cDNA by . The BRIGI was supported by Coordenac¸a˜ode clones from japonica rice. Science 301, 376–-379(2003). Aperfeic¸oamento de Pessoal de Nı´vel Superior (CAPES), Conselho Nacional de 28. Castelli, V. et al. Whole genome sequence comparisons and “full-length” cDNA Desenvolvimento Cientı´fico e Tecnolo´gico (CNPq), Financiadora de Estudos e sequences: a combined approach to evaluate and improve Arabidopsis genome Projetos - Ministe´rio de Cieˆncia e Tecnologia (FINEP-MCT), Fundac¸a˜ode annotation. Genome Res. 14, 406–-413(2004). Amparo a Pesquisa do Rio Grande do Sul (FAPERGS) and Universidade Federal 29. Hirochika, H., Sugimoto, K., Otsuki, Y., Tsugawa, H. & Kanda, M. de Pelotas (UFPel). Work at McGill and York Universities was supported by the Retrotransposons of rice involved in mutations induced by tissue culture. Proc. National Science and Engineering Research Council of Canada and the Canadian Natl Acad. Sci. USA 93, 7783–-7788(1996). International Development Agency. Funding for H.H. at the National Institute of 30. Miyao, A. et al. Target site specificity of the Tos17 retrotransposon shows a Agrobiological Sciences was from the Ministry of Agriculture, Forestry, and preference for insertion within genes and against insertion in retrotransposon- Fisheries of Japan, and the Program for Promotion of Basic Research Activities rich regions of the genome. Plant Cell 15, 1771–-1780 (2003). for Innovative Biosciences. Funding at Brookhaven National Laboratory was from 31. Alonso, J. M. et al. Genome-wide insertional mutagenesis of Arabidopsis The Rockefeller Foundation and the Office of Basic Energy Science of the United thaliana. Science 301, 653–-657(2003). States Department of Energy. We would like to thank G. Barry and S. Goff for 32. Arabidopsis Genome Initiative, Analysis of the genome sequence of the their help in negotiating agreements that permitted the sharing of materials and flowering plant Arabidopsis thaliana. Nature 408, 796–-815(2000). sequence with the IRGSP. We also acknowledge the work of G. Barry, S. Goff 33. Song, R., Llaca, V. & Messing, J. Mosaic organization of orthologous sequences and their colleagues in facilitating the transfer of sequence information and in grass genomes. Genome Res. 12, 1549–-1555(2002). supporting data. 34. Shishido, R., Sano, Y. & Fukui, K. Ribosomal : an exception to the conservation of gene order in rice genomes. Mol. Gen. Genet. 263, 586–-591 Author Information The genomic sequence is available under accession (2000). numbers AP008207–AP008218 in international databases (DDBJ, GenBank and 35. Oono, K. & Sugiura, M. Heterogeneity of the ribosomal RNA gene clusters in EMBL). Reprints and permissions information is available at npg.nature.com/ rice. Chromosoma 76, 85–-89(1980). reprintsandpermissions. The authors declare no competing financial interests. 36. Kamisugi, Y. et al. Physical mapping of the 5S ribosomal RNA genes on rice Correspondence and requests for materials should be addressed to Takuji chromosome 11. Mol. Gen. Genet. 245, 133–-138(1994). Sasaki ([email protected]).

799 © 2005 Nature Publishing Group

ARTICLES NATURE|Vol 436|11 August 2005

International Rice Genome Sequencing Project (Participants are arranged by area of contribution and then by institution.) Physical Maps and Sequencing: Rice Genome Research Program (RGP) Takashi Matsumoto1, Jianzhong Wu1, Hiroyuki Kanamori1, Yuichi Katayose1, Masaki Fujisawa1, Nobukazu Namiki1, Hiroshi Mizuno1, Kimiko Yamamoto1, Baltazar A. Antonio1, Tomoya Baba1, Katsumi Sakata1, Yoshiaki Nagamura1, Hiroyoshi Aoki1, Koji Arikawa1, Kohei Arita1, Takahito Bito1, Yoshino Chiden1, Nahoko Fujitsuka1, Rie Fukunaka1, Masao Hamada1, Chizuko Harada1, Akiko Hayashi1, Saori Hijishita1, Mikiko Honda1, Satomi Hosokawa1, Yoko Ichikawa1, Atsuko Idonuma1, Masumi Iijima1, Michiko Ikeda1, Maiko Ikeno1, Kazue Ito1, Sachie Ito1, Tomoko Ito1, Yuichi Ito1, Yukiyo Ito1, Aki Iwabuchi1, Kozue Kamiya1, Wataru Karasawa1, Kanako Kurita1, Satoshi Katagiri1, Ari Kikuta1, Harumi Kobayashi1, Noriko Kobayashi1, Kayo Machita1, Tomoko Maehara1, Masatoshi Masukawa1, Tatsumi Mizubayashi1, Yoshiyuki Mukai1, Hideki Nagasaki1, Yuko Nagata1, Shinji Naito1, Marina Nakashima1, Yuko Nakama1, Yumi Nakamichi1, Mari Nakamura1, Ayano Meguro1, Manami Negishi1, Isamu Ohta1, Tomoya Ohta1, Masako Okamoto1, Nozomi Ono1, Shoko Saji1, Miyuki Sakaguchi1, Kumiko Sakai1, Michie Shibata1, Takanori Shimokawa1, Jianyu Song1, Yuka Takazaki1, Kimihiro Terasawa1, Mika Tsugane1, Kumiko Tsuji1, Shigenori Ueda1, Kazunori Waki1, Harumi Yamagata1, Mayu Yamamoto1, Shinichi Yamamoto1, Hiroko Yamane1, Shoji Yoshiki1, Rie Yoshihara1, Kazuko Yukawa1, Huisun Zhong1, Masahiro Yano1, Takuji Sasaki (Principal Investigator)1; The Institute for Genomic Research (TIGR) Qiaoping Yuan2, Shu Ouyang2, Jia Liu2, Kristine M. Jones2, Kristen Gansberger2, Kelly Moffat2, Jessica Hill2, Jayati Bera2, Douglas Fadrosh2, Shaohua Jin2, Shivani Johri2, Mary Kim2, Larry Overton2, Matthew Reardon2, Tamara Tsitrin2, Hue Vuong2, Bruce Weaver2, Anne Ciecko2, Luke Tallon2, Jacqueline Jackson2, Grace Pai2, Susan Van Aken2, Terry Utterback2, Steve Reidmuller2, Tamara Feldblyum2, Joseph Hsiao2, Victoria Zismann2, Stacey Iobst2, Aymeric R. de Vazeille2, C. Robin Buell (Principal Investigator)2; National Center for Gene Research Chinese Academy of Sciences (NCGR) Kai Ying3, Ying Li3, Tingting Lu3, Yuchen Huang3, Qiang Zhao3, Qi Feng3, Lei Zhang3, Jingjie Zhu3, Qijun Weng3, Jie Mu3, Yiqi Lu3, Danlin Fan3, Yilei Liu3, Jianping Guan3, Yujun Zhang3, Shuliang Yu3, Xiaohui Liu3, Yu Zhang3, Guofan Hong3, Bin Han (Principal Investigator)3; Genoscope Nathalie Choisne4, Nadia Demange4, Gisela Orjeda4, Sylvie Samain4, Laurence Cattolico4, Eric Pelletier4, Arnaud Couloux4, Beatrice Segurens4, Patrick Wincker4, Angelique D’Hont5, Claude Scarpelli4, Jean Weissenbach4, Marcel Salanoubat4, Francis Quetier (Principal Investigator)4; Arizona Genomics Institute (AGI) and Arizona Genomics Computational Laboratory (AGCol) Yeisoo Yu6, Hye Ran Kim6, Teri Rambo6, Jennifer Currie6, Kristi Collura6, Meizhong Luo6, Tae-Jin Yang6, Jetty S. S. Ammiraju6, Friedrich Engler6, Carol Soderlund6, Rod A. Wing (Principal Investigator)6; Cold Spring Harbor Laboratory (CSHL) Lance E. Palmer7, Melissa de la Bastide7, Lori Spiegel7, Lidia Nascimento7, Theresa Zutavern7, Andrew O’Shaughnessy7, Sujit Dike7, Neilay Dedhia7, Raymond Preston7, Vivekanand Balija7, W. Richard McCombie (Principal Investigator)7; Academia Sinica Plant Genome Center (ASPGC) Teh-Yuan Chow8, Hong-Hwa Chen9, Mei-Chu Chung8, Ching-San Chen8, Jei-Fu Shaw8, Hong-Pang Wu8, Kwang-Jen Hsiao10, Ya-Ting Chao8, Mu-kuei Chu8, Chia-Hsiung Cheng8, Ai-Ling Hour8, Pei-Fang Lee8, Shu-Jen Lin8, Yao-Cheng Lin8, John-Yu Liou8, Shu-Mei Liu8, Yue-Ie Hsing (Principal Investigator)8; Indian Initiative for Rice Genome Sequencing (IIRGS), University of Delhi South Campus (UDSC) S. Raghuvanshi11, A. Mohanty11, A. K. Bharti11,13, A. Gaur11, V. Gupta11,D. Kumar11, V. Ravi11, S. Vij11, A. Kapur11, Parul Khurana11, Paramjit Khurana11, J. P. Khurana11, A. K. Tyagi (Principal Investigator)11; Indian Initiative for Rice Genome Sequencing (IIRGS), Indian Agricultural Research Institute (IARI) K. Gaikwad12, A. Singh12, V. Dalal12,S. Srivastava12, A. Dixit12, A. K. Pal12, I. A. Ghazi12, M. Yadav12, A. Pandit12, A. Bhargava12, K. Sureshbabu12, K. Batra12, T. R. Sharma12,T. Mohapatra12, N. K. Singh (Principal Investigator)12; Plant Genome Initiative at Rutgers (PGIR) Joachim Messing (Principal Investigator)13, Amy Bronzino Nelson13, Galina Fuks13, Steve Kavchok13, Gladys Keizer13, Eric Linton Victor Llaca13, Rentao Song13, Bahattin Tanyolac13, Steve Young13; Korea Rice Genome Research Program (KRGRP) Kim Ho-Il14, Jang Ho Hahn (Principal Investigator)14; National Center for Genetic Engineering and Biotechnology (BIOTEC) G. Sangsakoo15, A. Vanavichit (Principal Investigator)15; Brazilian Rice Genome Initiative (BRIGI) Luiz Anderson Teixeira de Mattos16, Paulo Dejalma Zimmer16, Gaspar Malone16, Odir Dellagostin16, Antonio Costa de Oliveira (Principal Investigator)16; John Innes Centre (JIC) Michael Bevan17, Ian Bancroft17; Washington University School of Medicine Genome Sequencing Center Pat Minx18, Cordum18, Richard Wilson18; University of Wisconsin–Madison Zhukuan Cheng19, Weiwei Jin19, Jiming Jiang19, Sally Ann Leong20

Annotation and Analysis: Hisakazu Iwama21, Takashi Gojobori21,22, Takeshi Itoh22,23, Yoshihito Niimura24, Yasuyuki Fujii25, Takuya Habara25, Hiroaki Sakai23,25, Yoshiharu Sato22, Greg Wilson26, Kiran Kumar27, Susan McCouch26, Nikoleta Juretic28, Douglas Hoen28, Stephen Wright29, Richard Bruskiewich30, Thomas Bureau28, Akio Miyao23, Hirohiko Hirochika23, Tomotaro Nishikawa23, Koh-ichi Kadowaki23 & Masahiro Sugiura31

Coordination: Benjamin Burr32

Affiliations for participants: 1National Institute of Agrobiological Sciences/Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries, 2-1-2 Kannondai, Tsukuba, Ibaraki 305-8602, Japan. 2The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA. 3Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (CAS), 500 Caobao Road, Shanghai 200233, China. 4Centre National de Se´quenc¸age, INRA-URGV, and CNRS UMR-8030, 2, rue Gaston Cre´mieux, CP 5706, 91057 EVRY Cedex, France. 5UMR PIA, Cirad-Amis, TA40-03 avenue Agropolis, 34398 Montpellier Cedex 05, France. 6Department of Plant Sciences, BIO5 Institute, The University of Arizona, Tucson, Arizona 85721, USA. 7Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11723, USA. 8Institute of Botany, Academia Sinica, 128, Sec. 2, Yen-Chiu-Yuan Rd, Nankang, Taipei 11529, Taiwan. 9National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan. 10National Yang-Ming University, 155, Sec. 2, Li-Nong St, Peitou, Taipei 112, Taiwan. 11Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110021, India. 12National Research Centre on Plant Biotechnology, Indian Agricultural Research Institute, New Delhi 110012, India. 13Waksman Institute, Rutgers University, Piscataway, New Jersey 08854, USA. 14National Institute of Agricultural Science and Technology, RDA, Suwon, 441-707 Republic of Korea. 15Rice Gene Discovery Unit, Kasetsart University, Nakron Pathom 73140, Thailand. 16Centro de Genomica e Fitomelhoramento, UFPel, Pelotas, RS, l 96001-970, Brazil. 17John Innes Centre, Norwich Research Park, Colney, Norwich NR4 7UH, UK. 18Washington University Genome Sequencing Center, 3333 Forest Park Boulevard, St. Louis, Missouri 63108, USA. 19University of Wisconsin, Department of Horticulture, Madison, Wisconsin 53706, USA. 20University of Wisconsin, Department of , Madison, Wisconsin 53706, USA. 21Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima 411-8540, Japan. 22Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan. 23National Institute of Agrobiological Sciences, Tsukuba, Ibaraki 305-8602, Japan. 24Medical Research Institute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo 113-8510, Japan. 25Japan Biological Information Research Center, Japan Biological Informatics Consortium, Koto-ku, Tokyo 135- 0064, Japan. 26Plant Breeding Dept, Cornell University, Ithaca, New York 14850-1901, USA. 27Cold Spring Harbor Laboratory, PO Box 100, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA. 28Department of Biology, McGill University, 1205 Dr Penfield Avenue, Montreal, Quebec H3A 1B1, Canada. 29Department of Biology, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, Canada. 30Biometrics and Bioinformatics Unit, International Rice Research Institute, DAPO Box 7777, Metro Manila, Philippines. 31Graduate School of Natural Sciences, Nagoya City University, Nagoya 467-8501, Japan. 32Biology Department, Brookhaven National Laboratory, Upton, New York 11973, USA.

800 © 2005 Nature Publishing Group RESEARCH ARTICLES

Sequencing and Assembly A single female genotype, ‘‘Nisqually 1,’’ was The Genome of Black Cottonwood, selected and used in a whole-genome shotgun 1Environmental Sciences Division, 2Life Sciences Division, Oak (Torr. & Gray) National Laboratory, Oak Ridge, TN 37831, USA. 3Plant Sciences Department, University of Tennessee, TN 37996, USA. 4Department of Biology, West Virginia University, Morgantown, 1,3 1,4 5 6 9 9 G. A. Tuskan, * S. DiFazio, † S. Jansson, † J. Bohlmann, † I. Grigoriev, † U. Hellsten, † WV 26506, USA. 5Umea˚ Plant Science Centre, Department of N. Putnam,9† S. Ralph,6† S. Rombauts,10† A. Salamov,9† J. Schein,11† L. Sterck,10† A. Aerts,9 Plant Physiology, Umea˚ University, SE-901 87, Umea˚,Sweden. 6 7 8 R. R. Bhalerao,5 R. P. Bhalerao,12 D. Blaudez,13 W. Boerjan,10 A. Brun,13 A. Brunner,14 Laboratories, Department of Botany, Depart- 15 16 17 13 9 2 6 ment of Forest Sciences, University of British Columbia, V. Busov, M. Campbell, J. Carlson, M. Chalot, J. Chapman, G.-L. Chen, D. Cooper, Vancouver, BC V6T 1Z4, Canada. 9U.S. Department of Energy, 19 13 20 7 1 22 P. M. Coutinho, J. Couturier, S. Covert, Q. Cronk, R. Cunningham, J. Davis, Joint Genome Institute, Walnut Creek, CA 94598, USA. S. Degroeve,10 A. De´jardin,23 C. dePamphilis,18 J. Detter,9 B. Dirks,24 I. Dubchak,9,25 10Department of Plant Systems Biology, Flanders Interuniversity 13 7 6 26 9 27 28 Institute for Biotechnology (VIB), Ghent University, B-9052 S. Duplessis, J. Ehlting, B. Ellis, K. Gendler, D. Goodstein, M. Gribskov, J. Grimwood, 11 29 1 7 30 12,31,33 19 Ghent, Belgium. Genome Sciences Centre, 100-570 West 7th A. Groover, L. Gunter, B. Hamberger, B. Heinze, Y. Helariutta, B. Henrissat, Avenue, Vancouver, BC V5Z 4S6, Canada. 12Umea˚ Plant 21 11 9 34 11 35 D. Holligan, R. Holt, W. Huang, N. Islam-Faridi, S. Jones, M. Jones-Rhoades, Science Centre, Department of Forest Genetics and Plant R. Jorgensen,26 C. Joshi,15 J. Kangasja¨rvi,32 J. Karlsson,5 C. Kelleher,6 R. Kirkpatrick,11 Physiology, Swedish University of Agricultural Sciences, SE-901 13 M. Kirst,22 A. Kohler,13 U. Kalluri,1 F. Larimer,2 J. Leebens-Mack,21 J.-C. Leple´,23 P. Locascio,2 83 Umea˚,Sweden. Tree-Microbe Interactions Unit, Institut 9 9 13 13 26 36 37 National de la Recherche Agronomique (INRA)–Universite´Henri Y. Lou, S. Lucas, F. Martin, B. Montanini, C. Napoli, D. R. Nelson, C. Nelson, 14 31 12 13 22 6 23 25 Poincare´, INRA-Nancy, 54280 Champenoux, France. Depart- K. Nieminen, O. Nilsson, V. Pereda, G. Peter, R. Philippe, G. Pilate, A. Poliakov, ment of Forestry, Virginia Polytechnic Institute and State J. Razumovskaya,2 P. Richardson,9 C. Rinaldi,13 K. Ritland,8 P. Rouze´,10 D. Ryaboy,25 University, Blacksburg, VA 24061, USA. 15Biotechnology J. Schmutz,28 J. Schrader,38 B. Segerman,5 H. Shin,11 A. Siddiqui,11 F. Sterky,39 A. Terry,9 Research Center, School of Forest Resources and Environmental 15 2 39 32 18 21 21 Science, Michigan Technological University, Houghton, MI C.-J. Tsai, E. Uberbacher, P. Unneberg, J. Vahala, K. Wall, S. Wessler, G. Yang, 16 1 7 11 12 10 9,24 49931, USA. Department of Cell and Systems Biology, T. Yin, C. Douglas, ‡ M. Marra, ‡ G. Sandberg, ‡ Y. Van de Peer, ‡ D. Rokhsar ‡ University of Toronto, 25 Willcocks Street, Toronto, Ontario, M5S 3B2 Canada. 17School of Forest Resources and Huck 18 Institutes of the Life Sciences, Department of Biology, Institute on February 28, 2010 We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun of Molecular Evolutionary Genetics, and Huck Institutes of Life sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. Sciences, The Pennsylvania State University, University Park, PA More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome 16802, USA. 19Architecture et Fonction des Macromole´cules revealed a whole-genome duplication event; about 8000 pairs of duplicated genes from that event Biologiques, UMR6098, CNRS and Universities of Aix-Marseille I and II, case 932, 163 avenue de Luminy, 13288 Marseille, survived in the Populus genome. A second, older duplication event is indistinguishably coincident with France. 20Warnell School of Forest Resources, 21Department of thedivergenceofthePopulus and Arabidopsis lineages. Nucleotide substitution, tandem gene Plant Biology, University of Georgia, Athens, GA 30602, USA. duplication, and gross chromosomal rearrangement appear to proceed substantially more slowly in 22School of Forest Resources and Conservation, Genetics Institute, and Plant Molecular and Cellular Biology Program, Populus than in Arabidopsis. Populus has more protein-coding genes than Arabidopsis,rangingon 23 average from 1.4 to 1.6 putative Populus homologs for each Arabidopsis gene. However, the relative University of Florida, Gainesville, FL 32611, USA. INRA-

Orle´ans,UnitofForestImprovement,GeneticsandPhysiology, www.sciencemag.org frequency of protein domains in the two genomes is similar. Overrepresented exceptions in Populus 45166 Olivet Cedex, France. 24Center for Integrative Genomics, include genes associated with lignocellulosic wall biosynthesis, meristem development, disease University of California, Berkeley, CA 94720, USA. 25Genomics resistance, and metabolite transport. Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. 26Department of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA. 27Department of Biological orests cover 30% (about 3.8 billion ha) tree, we describe here the draft genome of black Sciences, Purdue University, West Lafayette, IN 47907, USA. of Earth_s terrestrial surface, harbor sub- cottonwood, Populus trichocarpa (Torr. & Gray), 28The Stanford Center and the Department of stantial biodiversity, and provide humanity and compare it to other sequenced plant genomes. Genetics, Stanford University School of Medicine, Palo Alto, CA F 29 94305, USA. Institute of Forest Genetics, United States

with benefits such as clean air and water, lumber, P. trichocarpa was selected as the model Downloaded from Department of Agriculture, Forest Service, Davis, CA 95616, fiber, and fuels. Worldwide, one-quarter of all forest species for genome sequencing not only USA. 30Federal Research Centre for Forests, Hauptstrasse 7, industrial feedstocks have their origins in forest- because of its modest genome size but also A-1140 Vienna, Austria. 31Plant Molecular Biology Laboratory, In- based resources (1). Large and long-lived forest because of its rapid growth, relative ease of stitute of Biotechnology, 32Department of Biological and trees grow in extensive wild populations across experimental manipulation, and range of avail- Environmental Sciences, University of Helsinki, FI-00014 Helsinki, Finland. 33Department of Biology, 200014, Univer- continents, and they have evolved under selective able genetic tools (2, 3). The is pheno- sity of Turku, FI-20014 Turku, Finland. 34Southern Institute of pressures unlike those of annual herbaceous plants. typically diverse, and interspecific hybrids Forest Genetics, United States Department of Agriculture, Their growth and development involves extensive facilitate the genetic mapping of economically Forest Service and Department of Forest Science, Texas A&M 35 secondary growth, coordinated signaling and dis- important traits related to growth rate, stature, University, College Station, TX 77843, USA. Whitehead Institute for Biomedical Research and Department of Biology, tribution of water and nutrients over great dis- wood properties, and paper quality. Dozens of Massachusetts Institute of Technology, Cambridge, MA 02142, tances, and strategic storage and redistribution of quantitative trait loci have already been mapped USA. 36Department of Molecular Sciences and Center of Ex- metabolites in concordance with interannual cli- (4), and methods of genetic transformation have cellence in Genomics and Bioinformatics, University of matic cycles. Their need to survive and thrive in been developed (5). Under appropriate con- Tennessee, Memphis, TN 38163, USA. 37Southern Institute fixed locations over centuries under continually ditions, Populus can reach reproductive maturi- of Forest Genetics, United States Department of Agriculture, Forest Service, Saucier, MS 39574, USA. 38Developmental changing physical and biotic stresses also sets them ty in as few as 4 to 6 years, permitting selective Genetics, University of Tu¨bingen, D-72076 Tu¨bingen, Germa- apart from short-lived plants. Many of the features breeding for large-scale sustainable plantation ny. 39Department of Biotechnology, KTH, AlbaNova University that distinguish trees from other organisms, forestry. Finally, rapid growth of trees coupled Center, SE-106 91 Stockholm, Sweden. especially their large sizes and long-generation with thermochemical or biochemical conver- *To whom correspondence should be addressed. E-mail: times, present challenges to the study of the sion of the lignocellulosic portion of the plant [email protected] †These authors contributed equally to this work as second cellular and molecular mechanisms that underlie has the potential to provide a renewable energy authors. their unique biology. To enable and facilitate such resource with a concomitant reduction of green- ‡These authors contributed equally to this work as senior investigations in a relatively well-studied model house gases (6–8). authors.

1596 15 SEPTEMBER 2006 VOL 313 SCIENCE www.sciencemag.org RESEARCH ARTICLES sequence and assembly strategy (9). Roughly 7.6 ing to chloroplast (fig. S5) and mitochondrial although these errors were minimized by using million end-reads representing 4.2 billion high- genomes were assembled into circular genomes stringent criteria for SNP identification (9). quality (i.e., Q20 or higher) base pairs were of 157 and 803 kb, respectively (9). assembled into 2447 major scaffolds containing We anchored the 410 Mb of assembled Gene Annotation an estimated 410 megabases (Mb) of genomic scaffolds to a sequence-tagged genetic map (fig. We tentatively identified a first-draft reference set DNA (tables S1 and S2). On the basis of the depth S3). In total, 356 microsatellite markers were used of 45,555 protein-coding gene loci in the Populus of coverage of major scaffolds (È7.5 depth) and to assign 155 scaffolds (335 Mb of sequence) to nuclear genome (www.jgi.doe.gov/poplar) the total amount of nonorganellar shotgun the 19 P. trichocarpa chromosome-scale linkage usingavarietyofabinitio, homology-based, sequence that was generated, the Populus groups (13). The vast majority (91%) of the and expressed sequence tag (EST)–based meth- genome size was estimated to be 485 T 10 Mb mapped microsatellite markers were colinear with ods (14–17) (table S5). Similarly, 101 and 52 (TSD), in rough agreement with previous the sequence assembly. At the extremes, the genes were annotated in the chloroplast and mito- cytogenetic estimates of about 550 Mb (10). smallest chromosome, LGIX [79 centimorgans chondrial genomes, respectively (9). To aid the The near completeness of the shotgun assembly (cM)], is covered by two scaffolds containing 12.5 annotation process, 4664 full-length sequences, in protein-coding regions is supported by the Mb of assembled sequence, whereas the largest from full-length enriched cDNA libraries from identification of more than 95% of known chromosome, LGI (265 cM), contains 21 scaffolds Nisqually 1, were generated and used in Populus cDNA in the assembly. representing 35.5 Mb (fig. S3). We also generated training the gene-calling algorithms. Before The È75 Mb of unassembled genomic se- a physical map based on bacterial artificial gene prediction, repetitive sequences were char- quence is consistent with cytogenetic evidence chromosome (BAC) fingerprint contigs using a acterized (fig. S15 and table S14) and masked; that È30% of the genome is heterochromatic (9). Nisqually-1 BAC library representing an estimated additional putative transposable elements were The amount of euchromatin contained within the 9.5-fold genome coverage (fig. S2). Paired BAC- identified and subsequently removed from the Populus genome was estimated in parallel by end sequences from most of the physical map were reference gene set (9). Given the current draft subtraction on the basis of direct measurements linked to the large-scale assembly, permitting 2460 nature of the genome, we expect that the gene of 4¶,6¶-diamidino-2-phenylindole–stained pro- of the physical map contigs to be positioned on the set in Populus will continue to be refined. phase and metaphase chromosomes (fig. S4). genome assembly. Combining the genetic and About 89% of the predicted gene models had On average, 69.5 T 0.3% of the genome consisted physical map, nearly 385 Mb of the 410 Mb of homology [expectation (E)valuee 1 Â 10j8]to of euchromatin, with a significantly lower pro- assembled sequence are placed on a linkage group. the nonredundant (NR) set of proteins from the portion of euchromatin in linkage group I (LGI) Unlike Arabidopsis, where predominantly National Center for Biotechnology Information, on February 28, 2010 (66.4 T 1.1%) compared with the other 18 chro- self-fertilizing ecotypes maintain low levels of including 60% with extensive homology that mosomes (69.7 T 0.03%, P e 0.05). In contrast, allelic polymorphism, Populus species are pre- spans 75% of both model and NR protein lengths. Arabidopsis chromosomes contain roughly 93% dominantly dioecious, which results in obligate Nearly 12% (5248) of the predicted Populus euchromatin (11). The unassembled shotgun se- outcrossing. This compulsory outcrossing, along genes had no detectable similarity to Arabidopsis quences were derived from variants of organellar with wind pollination and wind-dispersed plu- genes (E value e 1 Â 10j3); conversely, in the DNA, including recent nuclear translocations; mose , results in high levels of gene flow more refined Arabidopsis set, only 9% (2321) of highly repetitive genomic DNA; haplotypic and high levels of heterozygosity (that is, within the predicted genes had no similarity to the segments that were redundant with short subseg- individual genetic polymorphisms). Within the Populus reference set. Of the 5248 Populus genes ments of the major scaffolds (separated as a result heterozygous Nisqually-1 genome, we identified without Arabidopsis similarity, 1883 have expres- of extensive sequence polymorphism and allelic 1,241,251 single-nucleotide polymorphisms sion evidence from the manually curated Populus www.sciencemag.org variants); and contaminants of the template DNA, (SNPs) or small insertion/deletion polymor- EST data set, and of these, 274 have no hits (E such as endophytic microbes inhabiting the leaf phisms (indels) for an overall rate of approxi- value Q 1 Â 10j3) to the NR database (9). and tissues used for template preparation (12) mately 2.6 polymorphisms per kilobase. Of these Whole-genome oligonucleotide microarray analy- (fig. S1 and table S3). The end-reads correspond- polymorphisms, the overwhelming majority sis provided evidence of tissue-based expression (83%) occurred in noncoding portions of the for 53% of the reference gene models (Fig. 1). In genome (Table 1). Short indels and SNPs within addition, a signal was detected from 20% of Table 1. Characterization of polymorphisms exons resulted in frameshifts and nonsense stop genes that were initially annotated and excluded Downloaded from according to their positions relative to predicted codons within predicted exons, respectively, sug- from the reference set, suggesting that as many coding sequences, introns, and untranslated regions gesting that null alleles of these genes exist in one as 4000 additional genes (or gene fragments) (UTRs). Rate shows the percentage of potential sites of the haplotypes. Some of the polymorphisms may be present. Within the reference gene set, of each class that were polymorphic. Most indels may be artifacts from the assembly process, we identified 13,019 pairs of orthologs between within exons resulted in frame shifts, but we could not quantify this due to difficulties with assembly Fig. 1. Whole-genome ol- and sequencing of regions containing indels. igonucleotide microarray Nonsense mutations created stop codons within expression data for all pre- predicted exons. dicted gene models in P. trichocarpa.Valuesrepre- Number of Rate Source sent the proportion of genes loci (%) expressed above negative Noncoding 1,027,322 0.32 controls at a 5% false dis- INTRON 141,199 0.25 covery rate. The x axis rep- 3¶UTR 6,731 0.25 resentsthesubsetsof 5¶UTR 3,306 0.24 predicted genes that were Exon 62,656 0.14 analyzed for the annotated Within exons: and promoted P. trichocarpa gene set (42,373 genes), Indels 2,722 0.01 chloroplast gene set (49 Nonsense 926 0.02 genes), mitochondria gene set (49 genes), annotated, nonpromoted gene set (10,875 genes), and Nonsynonymous 32,207 0.10 microRNAs (48 miRNAs).

www.sciencemag.org SCIENCE VOL 313 15 SEPTEMBER 2006 1597 RESEARCH ARTICLES

genes in Populus and Arabidopsis using the Populus genome provided evidence of a more this whole-genome duplication event (Fig. 3B). best bidirectional Basic Local Alignment recent duplication event that affected roughly 92% Moreover, the parallel karyotypes and collinear Search Tool (BLAST) hits, with average of the Populus genome. Nearly 8000 pairs of genetic maps (18)ofSalix and Populus also mutual coverage of these alignments equal to paralogous genes of similar age (excluding tandem support the conclusion that both lineages share 93%; 11,654 pairs of orthologs had greater or local duplications) were identified (Fig. 2). The the same large-scale genome history. than 90% alignment of gene lengths, whereas relative age of the duplicate genes was estimated If we naively calibrated the molecular clock only 156 genes had less than 50% coverage. As by the accumulated nucleotide divergence at using synonymous rates observed in the Brassica- of 1 June 2006, È10% (4378) gene models have fourfold synonymous third-codon transversion ceae (19) or derived from the Arabidopsis-Oryza been manually validated and curated. position(4DTV)values.Asharppeakin4DTV divergence (20), we would conclude that the values, corrected for multiple substitutions, genome duplication in Populus is very recent [8 Genome Organization representing a burst of gene duplication, is evident to 13 Ma, as reported by Sterk (21)]. Yet the Genome duplication in the . Populus at 0.0916 T 0.0004 (Fig. 3A). Comparison of 1825 record shows that the Populus and Salix and Arabidopsis lineages diverged about 100 to Populus and Salix orthologous genes derived lineages diverged 60 to 65 Ma (22–25). Thus, the 120 million years ago (Ma). Analysis of the from Salix EST suggests that both genera share molecular clock in Populus must be ticking at only

Fig. 2. Chromosome-level re- organization of the most recent genome-wide duplication event in Populus. Common colors refer to homologous genome blocks, pre- sumed to have arisen from the salicoid-specific genome duplication 65 Ma, shared by two chromosomes. Chromosomes are indicated by their linkage group number (I to XIX). The diagram to the left uses the same color coding and further illustrates

thechimericnatureofmostlinkage on February 28, 2010 groups. www.sciencemag.org Downloaded from

Fig. 3. (A) The 4DTV metrics for paralogous gene pairs in Populus- the Salicaceae and the middle event apparently shared among the Populus and Populus-Arabidopsis. Three separate genome-wide duplica- Eurosids. (B) Percent identity distributions for mutual best EST hit to tions events are detectable, with the most recent event contained within Populus trichocarpa CDS.

1598 15 SEPTEMBER 2006 VOL 313 SCIENCE www.sciencemag.org RESEARCH ARTICLES one-sixth the estimated rate for Arabidopsis (that is, also shows that a large fraction of the Populus dependent oligopeptide transporter (POT) and 8 to 13 Ma divided by 60 to 65 Ma). Qualitatively genome falls in a set of duplicated segments oligopeptide transporter (OPT) families, 129 ver- similar slowing of the molecular clock is found in anchored by gene pairs with 4DTV at 0.364 T sus 61, and potassium transporter, 30 versus 13, the Populus chloroplast and mitochondrial 0.001, representing the residue of a more ancient, respectively]. genomes (9). Because Populus is a long-lived large-scale, apparently synchronous duplication Some domains were underrepresented in vegetatively propagated species, it has the potential event (Fig. 3A). This relatively older duplication Populus compared with Arabidopsis.Forexam- to successfully contribute gametes to multiple event covers about 59% of the Populus genome ple, the F-box domain was twice as prevalent in generations. A single Populus genotype can with 16% of genes in these segments present in Arabidopsis as in Populus (624 versus 303, persist as a clone on the landscape for millennia two copies. Because this duplication preceded respectively). The F-box domain is involved in (26), and we propose that recurrent contributions and is therefore superimposed upon the salicoid diverse and complex interactions involving of ‘‘ancient gametes’’ from very old individuals event, each genomic region is potentially protein degradation through the ubiquitin-26S could account for the markedly reduced rate of covered by four such segments. Similarly, the proteosome pathway (33). Many of the ubiquitin- sequence evolution. As a result of the slowing of Arabidopsis genome experienced an older associated domains are underrepresented in the molecular clock, the Populus genome most ‘‘beta’’ duplication that preceded the Brassica- Populus compared with Arabidopsis (for exam- likely resembles the ancestral eurosid genome. ceae-specific ‘‘alpha’’ event (28–32). ple, the Ulp1 protease family and the C-terminal To test whether the burst of gene creation 60 to We next asked whether the Arabidopsis catalytic domain, 10 versus 63, respectively). 65 Ma was due to a single whole-genome event or ‘‘beta’’ (30, 32)andPopulus 4DTV È0.36 du- Moreover, the RING-finger domains are nearly to independent but near-synchronous gene dupli- plication events were (i) independent genome- equally present in both genomes (503 versus cation events, we used a variant of the algorithm of wide duplications that occurred after the split from 407, respectively), suggesting that protein deg-

Hokamp et al.(27)toidentifysegmentsof the last common eurosid ancestor (H1)or(ii)a radation pathways in the two organisms are conserved synteny within the Populus genome. single shared duplication event that occurred in an metabolically divergent. The longest conserved syntenic block from the ancestral lineage (i.e., before the divergence of The common eurosid gene set. The Populus È 4DTV 0.09 epoch spanned 765 pairs of eurosid lineages I and II) (H2). These two and Arabidopsis gene sets were compared to infer paralogous genes. In total, 32,577 genes were hypotheses have very different implications for the conserved gene complement of their common contained within syntenic blocks from the salicoid the interpretation of homology between Populus eurosid ancestor, integrating information from epoch; half of these genes were contained in and Arabidopsis. Under H1,eachgenomic nucleotide divergence, synteny, and mutual best segments longer than 142 paralogous pairs. The segment in one species is homologous to four BLAST-hit analysis (9). The ancestral eurosid on February 28, 2010 same algorithm, when applied to randomly segments in the other; whereas under H2,each genome contained at least 11,666 protein-coding shuffled genes, typically yields duplicate blocks segment is homologous to only two segments in genes, along with an undetermined number that with fewer than 8 to 9 genes, indicating that the the other species. These hypotheses were tested were either lost in one or both of the lineages or Populus gene duplications occurred as a single by comparing the relative distances between gene whose homology could not be detected. These genome-wide event. We refer to this duplication pairs sampled within and between Populus and ancestral genes were the progenitors of gene event as the ‘‘salicoid’’ duplication event. Arabidopsis.H2 was generally supported (9), but families of typically one to four descendents in Nearly every mapped segment of the Populus we could not reject H1. We can only conclude each of the complete plant genomes and account genome had a parallel ‘‘paralogous’’ segment that the Populus genome duplication occurred for 28,257 Populus and 17,521 Arabidopsis elsewhere in the genome as a result of the very close to the time of divergence of the genes. Gene family lists are accessible at www. salicoid event (Fig. 2). The pinwheel patterns can eurosid I and II lineages (9), with slight support phytozome.net. The gene predictions in these www.sciencemag.org be understood as a whole-genome duplication for a shared duplication. This coincident timing two genomes that could not be accounted for in followed by a series of reciprocal tandem terminal raises the possibility of a causal link between the eurosid clusters were often fragmentary or fusions between two separate sets of four chro- this duplication and rapid diversification early difficult to categorize, and we could not con- mosomes each—the first involving LGII, V, VII, in eurosid (and perhaps core eudicot) history. fidently assign orthology to them. They may and XIV and the second involving LGI, XI, IV, We refer to this older Populus/Arabidopsis include previously unidentified or rapidly evolv- and IX. In addition, several chromosomes appear duplication event as the ‘‘eurosid’’ duplication ing genes in the Populus and/or Arabidopsis

to have experienced minor reorganizational event. We note that the salicoid duplication lineages, as well as poorly predicted genes. Downloaded from exchanges. Furthermore, LGI appears to be the occurred independently of the eurosid duplica- Noncoding RNAs. Based on a series of result of multiple rearrangements involving three tion observed in the Arabidopsis genome. publicly available RNA detection algorithms major tandem fusions. These results suggest that (34), including tRNAScan-SE, INFERNAL, and the progenitor of Populus had a base chromosome Gene Content snoScan, we identified 817 putative tRNAs; 22 number of 10. After the whole-genome duplication Although Populus has substantially more pro- U1, 26 U2, 6 U4, 23 U5, and 11 U6 spliceosomal event, this base chromosome number experienced tein-coding genes than Arabidopsis, the relative small nuclear RNAs (snRNAs); 339 putative C/D a genome-wide reorganization and diploidization frequency of domains represented in protein small nucleolar RNAs (snoRNAs); and 88 of the duplicated chromosomes into four pairs of databases (Prints, Prosite, Pfam, ProDom, and predicted H/ACA snoRNAs in the Populus complete paralogous chromosomes (LGVI, VIII, SMART) in the two genomes is similar (9). assembly. All 57 possible anticodon tRNAs were X, XII, XIII, XV, XVI, XVIII, and XIX); two sets However, the most common domains occur in found. One selenocysteine tRNA was detected and of four chromosomes, each containing a terminal Populus compared with Arabidopsis in a ratio two possible suppressor tRNAs (anticodons that translocation (LGI, II, IV, V, VII, IX, and XI); and ranging from 1.4:1 to 1.8:1. Noteworthy outliers bind stop codons) were also identified. Populus one chromosome containing three terminally in Populus include genes and gene domains has nearly 1.3 times as many tRNA genes as joined chromosomes (LGIII with I or XVII with associated with disease and insect resistance (such Arabidopsis. In contrast to Arabidopsis (fig. S7A), VII). The colinearity of genetic maps among as, in Populus versus Arabidopsis, respectively: the copy number of tRNA in Populus was signifi- multiple Populus species suggests that the leucine-rich repeats, 1271 versus 527; NB-ARC cantly and positively correlated with amino acid genome reorganization occurred before the evo- domain, 302 versus 141; and thaumatin, 55 versus occurrence in predicted gene models (fig. S7B). lution of the modern taxa of Populus. 24), meristem development (such as NAC TheratioofthenumberofsnRNAsinPopulus Genome duplication in a common ances- transcription factors, 157 versus 100, respec- compared with the number in Arabidopsis is 1.3 to tor of Populus and Arabidopsis. The distribu- tively), and metabolite and nutrient transport 1.0, yet U1, U2, and U5 are overrepresented in tion of 4DTV values for paralogous pairs of genes [such as oligopeptide transporter of the proton- Populus, whereas U4 is underrepresented. Further-

www.sciencemag.org SCIENCE VOL 313 15 SEPTEMBER 2006 1599 RESEARCH ARTICLES

more, U14 was not detected in Arabidopsis.The disease resistance genes), the pentatricopeptide an ongoing process that has had more time to snRNAs and snoRNAs have not been experimen- repeat RNA-binding proteins (IPR002885), and occur in eurosid pairs (Fig. 5). This difference tally verified in Populus. the uridine diphosphate (UDP)-glucuronosyl/ could also be due to absolute expression level, There are 169 identified microRNA (miRNA) UDP-glucosyltransferase domain (IPR002213) which may vary systematically between the two genes representing 21 families in Populus (table S9). duplication events. Moreover, differential ex- (table S7). In Arabidopsis, these 21 families In contrast, some genes were highly expanded pression was more evident in the wood-forming contain 91 miRNA genes, representing a 1.9X in tandem duplicates in one genome and not in organs. Almost 14 and 13% (2632 pairs of genes) expansion in Populus, primarily in miR169 and the other (fig. S8). For example, one of the most of eurosid duplicated genes in the nodes and miR159/319. All 21 miRNA families have frequent classes of tandemly duplicated genes in internodes, respectively, displayed differential regulatory targets that appear to be conserved Arabidopsis was F-box genes, with a total of expression, compared with 8% or less in among Arabidopsis and Populus (table S8). 342 involved in tandem duplications, the largest and young leaves (Fig. 5). Similar to the miRNA genes themselves, the segment of which contained 24 F-box genes. Single-nucleotide polymorphisms. Popu- number of predicted targets for these miRNA Populus contains only 37 F-box genes in tan- lus is a highly polymorphic taxon and substan- is expanded in Populus (147) compared with dem duplications, with the largest segment con- tial numbers of SNPs are present even within a Arabidopsis (89). Similarly, the genes that taining only 3 genes. single individual (Table 1). The ratio of non- mediate RNA interference (RNAi) are also synonymous to synonymous substitution rate overrepresented in Populus (21) compared to Postduplication Gene Fate (w 0 dN/dS) was calculated as an index of Arabidopsis (11) [e.g., AGO1 class, 7 versus Functional expression divergence. In Populus, selective constraints for alleles of individual 3; RNA helicase 2 versus 1; HEN, 2 versus 1; 20 of the 66 salicoid-event duplicate gene pairs genes (9). The overall average dN across all HYL1-like (double-stranded RNA binding contained in 19 Populus EST libraries (2.3% of genes was 0.0014, whereas the dS value was proteins), 9 versus 5, respectively]. the total) showed differential expression (9) 0.0035, for a total w of 0.40, suggesting that the Tandem duplications. In Populus there [displaying significant deviation in EST frequen- majority of coding regions in the Populus were 1518 tandemly duplicated arrays of two cies per library (Fig. 4)]. Out of 18 eurosid-event genome are subject to purifying selection. or more genes based on a Smith-Waterman duplicate gene pairs (2.7% of the total), 11 also There was a significant, negative correlation alignment E value e 10j25 and a 100-kb displayed significant deviation in EST frequencies between w and the 4DTV distance to the most window. The total number of genes in such per library. Many of the duplicate gene pairs that closely related paralog (r 0 –0.034, P 0 0.028), arrays was 4839 and the total length of displayed significant overrepresentation in one or which is consistent with the expectation of on February 28, 2010 tandemly duplicated segments in Populus was more of the 19 sampled libraries were involved in higher levels of nonsynonymous polymorphism 47.9 Mb, or 15.6% of the genome (fig. S8). By protein-protein interactions (such as annexin) or in recently duplicated genes as a result of func- the same criteria, there are 1366 tandemly protein folding (such as cyclophilins). In the tional redundancy (20, 35). Similarly, genes duplicated segments in Arabidopsis, covering eurosid set, there was a greater divergence in with recent tandem duplicates (4DTV e 0.2) 32.4 Mb, or 27% of the genome. By far the the best BLAST hit among pairwise sets of genes. had significantly higher w than did genes with most common number of genes within a single These results support the premise of functional no recent tandem duplicates (Wilcoxon rank array was two, with 958 such arrays in Populus expression divergence among some duplicated sum Z 0 8.65, P e 0.0001) (table S10). and 805 in Arabidopsis. Arabidopsis had a gene pairs in Populus. The results for tandemly duplicated genes larger number of arrays containing six or more To further test for variation in gene expres- were consistent with expectations for accelerated genes than did Populus. Tandem duplications sion among duplicated genes, we examined evolution of duplicated genes (20). However, this www.sciencemag.org thus appear to be relatively more common in whole-genome oligonucleotide microarray data expectation was not upheld for paralogous pairs Arabidopsis than in Populus. This may in part containing the 45,555 promoted genes (9). There of genes from the whole-genome duplication be due to difficulties in assembling tandem was significantly lower differential expression events. Relative rates of nonsynonymous substi- repeats from a whole-genome shotgun sequenc- in the salicoid duplicated pairs of genes (mean 0 tution were actually lower for genes with paralogs ing approach, particularly when tandemly 5%) relative to eurosid duplications (mean 0 from the salicoid and eurosid whole-genome duplicated genes are highly conserved. Alter- 11%), again suggesting that differential expres- duplication events than for genes with no paralogs

natively, the Populus genome may be under- sion patterns for retained paralogous gene pairs is (table S11). One possible explanation for this Downloaded from going rearrangements at a slower rate than the Arabidopsis genome, which is consistent with Fig. 4. Kolmogorov- our observations of reduced chromosomal re- Smirnov (K-S) test for dif- arrangements and slower nucleotide substitu- ferential expression for 5- tion rates in Populus. methyltetrahydropteroyl- In some cases, genes were highly duplicated triglutamate-homocysteine in both species, and some tandem duplications S-methyltransferase genes predated the Populus-Arabidopsis split (9). The [for descriptions of the EST largest number of tandem repeats in Populus in data set, see Sterky et al. (79)]. Results suggest that a single array was 24 and contained genes with the duplicated genes in high homology to S locus–specific glycopro- Populus are differentially teins. Genes of this class also occur as tandem expressed in alternate tis- repeats in Arabidopsis, with the largest seg- sues. Tissue types include: ments containing 14 tandem duplicates on cambial zone (1), young chromosome 1. One of the InterPro domains leaves (2), buds (3), in this protein, IPR008271, a serine/threonine tension wood (4), senesc- protein kinase active site, was the most frequent ingleaves(5),apicalshoot domain in tandemly repeated genes in both (6), dormant cambium (7), species (fig. S8). Other common domains in active cambium (8), cold stressed leaves (9), roots (10), bark (11), shoot meristem (12), male catkins (13), both species were the leucine-rich repeat dormant buds (14), female catkins (15), petioles (16), wood cell death (17), imbibed seeds (18) and infected (IPR007090, primarily from tandem repeats of leaves (19).

1600 15 SEPTEMBER 2006 VOL 313 SCIENCE www.sciencemag.org RESEARCH ARTICLES discrepancy is that the apparent single-copy genes dopsis (40). Many other types of genes the largest classes of these metabolites. Pheno- have a corresponding overrepresentation of rap- associated with cellulose biosynthesis, such as lic glycosides and condensed tannins alone can idly evolving pseudogenes. However, this does KOR, SuSY, COBRA,andFRA2, occur in constitute up to 35% leaf dry weight and are not appear to be the case, as demonstrated by an duplicate pairs in Populus relative to single-copy abundant in buds, bark, and roots of Populus analysis of gene size, synonymous substitution Arabidopsis genes (39). For example, COBRA, (50, 54, 55). rate, and minimum genetic distance to the closest a regulator of cellulose biogenesis (41), is a The biosynthetic genes are well paralog as covariates in an analysis of variance single-copy gene in Arabidopsis, but in Populus annotated in Arabidopsis (56) and almost all with w as the response variable (table S11). there are four copies. (with the exception of flavonol synthase) are Therefore, genes with no paralogs from the The repertoire of acknowledged hemicel- encoded by single-copy genes. In contrast, all salicoid and eurosid duplication events seem to lulose biosynthetic genes in Populus is generally but three such enzymes ( isomerase, be under lower selective constraints, and purifying similar to that in Arabidopsis. However, Populus flavonoid 3¶-hydroxylase, and 3- selection is apparently stronger for genes with has more genes encoding a-L-fucosidases and hydroxylase) are encoded by multiple genes in paralogs retained from the whole-genome dupli- fewer genes encoding a-L-fucosyltransferases Populus (53). For example, the chalcone syn- cations. Chapman et al.(36) have recently than does Arabidopsis, which is consistent with thase, controlling the committed step to flavonoid proposed the concept of functional buffering to the lower xyloglucan fucose content (42)in biosynthesis, has expanded to at least six genes account for similar reduction in detected muta- Populus relative to Arabidopsis. in Populus. In addition, Populus contains two tions in paralogs from whole-genome duplications Lignin, the second most abundant cell wall genes each for flavone synthase II (cytochrome in Arabidopsis and Oryza. The vegetative polymer after cellulose, is a complex polymer of accession number CYP98B) and flavonoid 3¶,5¶- propagation habit of Populus may also favor monolignols (hydroxycinnamyl alcohols) that hydroxylase (CYP75A12 and CYP75A13), both the conservation of nucleotide sequences among encrusts and interacts with the cellulose/hemi- of which are absent in Arabidopsis.Furthermore, duplicated genes, in that complementation cellulose matrix of the secondary cell wall (43). three Populus genes encode among duplicate pairs of genes would minimize The full set of 34 Populus and reductase, required for the synthesis of condensed loss of gene function associated with the lignin biosynthetic genes (table S13) was tannin precursor 2,3-trans--3-ols, a stereo- accumulation of deleterious somatic mutations. identified by sequence alignment to the known chemical configuration also lacking in Arabidopsis Gene family evolution. The expansion of Arabidopsis phenylpropanoid and lignin genes (57). In contrast to the 32 terpenoid synthase several gene families has contributed to the (44, 45). The size of the Populus gene families (TPS) genes of secondary metabolism identified evolution of Populus biology. that encode these enzymes is generally larger in the Arabidopsis genome (58), the Populus on February 28, 2010 Lignocellulosic wall formation. Among the than in Arabidopsis (34 versus 18, respectively). genome contains at least 47 TPS genes, sug- processes unique to tree biology, one of the most The only exception is cinnamyl alcohol de- gesting a wide-ranging capacity for the forma- obvious is the yearly development of secondary hydrogenase (CAD), which is encoded by a tion of terpenoid secondary metabolites. xylem from the vascular cambium. We identified single gene in Populus and two genes in A number of phenylpropanoid-like enzymes Populus orthologs of the approximately 20 Arabidopsis (Fig. 6C); CAD is also encoded have been annotated in the Arabidopsis genome Arabidopsis genes and gene families involved in by only a single gene in Pinus taeda (46, 47). (44, 45, 59–61). One example is the family en- or associated with cellulose biosynthesis. The Two lignin-related Populus C4H genes are coding CAD. In addition to the single Populus Populus genome has 93 cellulose synthesis– strongly coexpressed in tissues related to wood CAD gene involved in lignin biosynthesis, related genes compared with 78 in Arabidopsis. formation, whereas the three Populus C3H several other clades of CAD-like (CADL) genes The Arabidopsis genome encodes 10 CesA genes genes show reciprocally exclusive expression are present, most of which fall within larger www.sciencemag.org belonging to six classes known to participate in patterns (48) (Fig. 6, A and B). subfamilies containing enzymes related to mul- cellulose microfibril biosynthesis (37). Populus Secondary metabolism. Populus trees pro- tifunctional alcohol dehydrogenases (Fig. 6). has 18 CesA genes (38), including duplicate duce a broad array of nonstructural, carbon-rich This comparative analysis makes it clear that copies of CesA7 and CesA8 homologs. Populus secondary metabolites that exhibit wide variation there has been selective expansion and retention homologs of Arabidopsis CesA4, CesA7,and in abundance, stress inducibility, and effects on of Populus CADL gene families. For example, CesA8 are coexpressed during xylem develop- tree growth and host-pest interactions (49–53). Populus contains seven CADL genes (Poptr-

ment and tension wood formation (39). Further- Shikimate-phenylpropanoid–derived phenolic CADL1 to PoptrCADL7; Fig. 6C) encoding Downloaded from more, one pair of CesA genes appears unique to esters, phenolic glycosides, and condensed enzymes related to the Arabidopsis BAD1 and Populus, with no homologs found in Arabi- tannins and their flavonoid precursors comprise BAD2 enzymes with apparent benzyl activities (62). BAD1 and BAD2 are known to be pathogen inducible, suggesting Fig. 5. Proportion of eurosid and that this group of Populus genes, including the salicoid duplicated gene sets differ- Populus SAD gene, previously characterized as entially expressed in stems (nodes encoding a sinapaldehyde-specific CAD enzyme and internode), leaves (young and (63), may be involved in chemical defense. mature), and whole roots. Samples from four biological replicates col- Disease resistance. The likelihood that a lected from the reference genotype perennial plant will encounter a pathogen or Nisqually 1 were individually - herbivore before reproduction is near unity. The ized to whole-genome oligonucleotide long-generation intervals for trees make it microarrays containing three 60- difficult for such plants to match the evolution- oligomer oligonucleotide probes for ary rates of a microbial or insect pest. Aside each gene. Differential expression from the formation of thickened cell walls and between duplicated genes was eval- the synthesis of secondary metabolites that uated in t tests and declared signif- constitute a first line of defense against micro- icant at a 5% false discovery rate (9). bial and insect pests, plants use a variety of disease-resistance (R) genes. The largest class of characterized R genes encodes intracellular proteins that contain a

www.sciencemag.org SCIENCE VOL 313 15 SEPTEMBER 2006 1601 RESEARCH ARTICLES

nucleotide-binding site (NBS) and carboxy- families, coding for adenosine 5¶-triphosphate– ceptor genes that have previously only been terminal leucine-rich-repeats (LRR) (64). The binding cassette proteins (ABC transporters, 226 found in fungi, and two genes, identified as NBS-coding family is one of the largest gene models), major facilitator superfamily mycorrhizal-specific phosphate transporters, that in Populus, with 399 members, approximately proteins (187 genes), drug/metabolite transporters confirm that the mycorrhizal symbiosis may have twice as high as in Arabidopsis.TheNBSfamily (108 genes), amino acid/auxin permeases (95 an impact on the mineral nutrition of this long- can be divided into multiple subfamilies with genes), and POT transporters (90 genes), ac- lived species. This expanded inventory of trans- distinct domain organizations, including 64 TIR- counted for more than 40% of the total number of porters could conceivably play a role in adaptation NBS-LRR genes, 10 truncated TIR-NBS that transporter gene models (fig. S14). Some large to nutrient-limited forest soils, long-distance trans- lack an LRR, 233 non–TIR-NBS-LRR genes, families such as those encoding POT (4.3X rel- port and storage of water and metabolites, and 17 unusual TIR-NBS–containing genes that ative to Arabidopsis), glutamate-gated ion chan- secretion and movement of secondary metabo- have not been identified previously in Arabi- nels (3.7X), potassium uptake permeases (2.3X), lites, and/or mediation of resistance to pathogen- dopsis (TNLT, TNLN, or TCNL domains) and ABC transporters (1.9X) are expanded in produced secondary metabolites or other toxic (Table 2). Five gene models coding for TNL Populus. We identified a subfamily of five compounds. proteins contained a predicted N-terminal putative aquaporins, lacking in the Arabidopsis. Phytohormones. Both physiological and mo- nuclear localization signal (65). The number of Populus also harbors seven transmembrane re- lecular studies have indicated the importance of non–TIR-NBS-LRR genes in Populus is also much higher than that in Arabidopsis (209 versus 57, respectively). Notably, 40 non–TIR-NBS genes, not found in Arabidopsis, carry an N- terminal BED DNA-binding zinc finger domain that was also found in the Oryza Xa1 gene. These findings suggest that domain cooption occurred in Populus. Most NBS-LRR (about 65%) in Popu- lus occur as singletons or in tandem duplications, and the distribution of pairwise genetic distances among these genes suggests a recent expansion of this family. That is, only 10% of the NBS-LRR on February 28, 2010 genes are associated with the eurosid and salicoid duplication events, compared with 55% of the extracellular LRR receptor-like kinase genes (for example, fig. S10). Several conserved signaling components such as RAR1, EDS1, PAD4, and NPR1, known to be recruited by R genes, also contain multiple homologs in Populus. For example, two copies of the PAD4 gene, which functions upstream of accumulation, and five copies of www.sciencemag.org Fig. 6. Phylogenetic analysis of gene families in Populus, Arabidopsis, and Oryza encoding the NPR1 gene, an important regulator of re- selected lignin biosynthetic and related enzymes. (A) Cinnamate-4-hydroxylase (C4H) gene family. sponses downstream of salicylic acid, are found (B) 4-coumaroyl-shikimate/quinate-3-hydroxlase (C3H)genefamily.(C) Cinnamyl alcohol dehydro- in Populus. Nearly all genes known to control genase (CAD) and related multifunctional alcohol dehydrogenase gene family. Arabidopsis gene names disease resistance signaling in Arabidopsis have are the same as those in Ehlting et al.(80). Populus and Oryza gene names were arbitrarily assigned; putative orthologs in Populus. Populus has a corresponding gene models are listed in table S13. Genes encoding enzymes for which biochemical larger number of b-1,3-glucanase and chitinase data are available are highlighted with a green flash. Yellow circles indicate monospecific clusters of

genes than does Arabidopsis (131 versus 73, gene family members. Downloaded from respectively). In summary, the structural and genetic diversity that exists among R genes and Table 2. Numbers of genes that encode domains similar to plant R proteins in Populus, Arabidopsis (81), their signaling components in Populus is re- and Oryza (82). *, BED finger and/or DUF1544 domain; CC, coiled coil; –, not detected. markable and suggests that unlike the rest of the genome, contemporary diversifying selection has Predicted protein domains Letter code Populus Arabidopsis Oryza played an important role in the evolution of TIR-NBS TN 10 21 – disease resistance genes in Populus.Suchdiver- TIR-NBS-LRR TNL 64 83 – sification suggests that enhanced ability to detect TIR-NBS-LRR-TIR TNLT 13 – – and respond to biotic challenges through R TIR-NBS-LRR-NBS TNLN 1 – – gene–mediated signaling may be critical over NBS-LRR-TIR NLT 1 – – a decades-long life span of this genus. TIR-CC-NBS-LRR TCNL 2 – – Membrane transporters. Attributes of Populus CC-NBS CN 19 4 7 biology such as massive interannual, seasonal, CC-NBS-LRR CNL 119 51 159 and diurnal metabolic shifts and redeployment BED/DUF1544*-NBS BN 5 – – of carbon and nitrogen may require an elab- NBS-BED/DUF1544* NB 1 – – orate array of transporters. Investigation of gene BED/DUF1544*-NBS-LRR BNL 24 – – families coding for transporter proteins (http:// NBS-LRR NL 90 6 40 plantst.genomics.purdue.edu/) in the Populus ge- NBS N 49 1 45 nome revealed a general expansion relative to Others – – 41 284 Arabidopsis (1722 versus 959, in Populus versus Total NBS genes 398 207 535 Arabidopsis, respectively) (table S12). Five gene

1602 15 SEPTEMBER 2006 VOL 313 SCIENCE www.sciencemag.org RESEARCH ARTICLES hormonal regulation underlying plant develop- gene families have been implicated in the nega- development of alternative energy sources, and ment. Auxin, , cytokinin, and ethylene tive regulation of cytokinin signaling (67, 77), mitigation of anthropogenic environmental prob- responses are of particular interest in tree biology. which is consistent with the idea of increased lems. Finally, the keystone role of Populus in Many auxin responses (66–71) are controlled complexity in regulation of cytokinin signal trans- many ecosystems provides the first opportunity by auxin response factor (ARF) transcription duction in Populus. for the application of genomics approaches to factors, which work together with cognate Populus and Arabidopsis genomes contain questions with ecosystem-scale implications. AUX/IAA proteins to regulate auxin- almost identical numbers of genes for the three responsive target genes (72, 73). A phyloge- enzymes of ethylene biosynthesis, whereas the References and Notes netic analysis using the known and predicted number of genes for proteins involved in 1. Food and Agricultural Organization of the United Nations, ARF protein sequences showed that Populus ethylene perception and signaling is higher in State of the World’s Forests 2003 (FAO, Rome, 2003). and Arabidopsis ARF gene families have Populus. For example, Populus has seven 2. R. F. Stettler, H. D. Bradshaw Jr., in Biology of Populus and Its Implications for Management and Conservation, expanded independently since they diverged predicted genes for ethylene receptor proteins R. F. Stettler, H. D. Bradshaw Jr., P. E. Heilman, T. M. from their common ancestor. Six duplicate ARF and Arabidopsis has five; the constitutive triple Hinckley, Eds. (NRC Research Press, Ottawa, 1996), genes in Populus encode paralogs of ARF response kinase that acts just downstream of the pp. 1–7. genes that are single-copy Arabidopsis genes, receptor is encoded by four genes in Populus 3. G. A. Tuskan, S. P. DiFazio, T. Teichmann, Plant Biol. 6,2 including ARF5 (MONOPTEROS), an impor- and only one in Arabidopsis (78). The number (2004). 4. T. M. Yin, S. P. DiFazio, L. E. Gunter, D. Riemenschneider, tant gene required for auxin-mediated signal of ethylene-responsive element binding factor G. A. Tuskan, Theor. Appl. Genet. 109, 451 (2004). transduction and xylem development. Further- (ERF) proteins (a subfamily of the AP2/ERF 5. R. Meilan, C. Ma, in Protocols, vol. 344 of more, five Arabidopsis ARF genes have four or family)ishigherinPopulus than in Arabidopsis Methods in Molecular Biology, K. Wang, Ed. (Humana more predicted Populus ARF gene paralogs. In (172 versus 122, respectively). The increased Press, Totowa, NJ, 2006), pp. 143–151. 6. G. A. Tuskan, Biomass Bioenerg. 14, 307 (1998). contrast to ARF genes, Populus does not con- variation in the number of ERF transcription fac- 7. G. A. Tuskan, M. Walsh, For. Chron. 77, 259 (2001). tain a notably expanded repertoire of AUX/IAA tors may be involved in the ethylene-dependent 8. S. Wullschleger et al., Can. J. For. Res. 35,1779 genes relative to Arabidopsis (35 versus 29, processes specific to trees, such as tension wood (2005). respectively) (74). Interestingly, there is a group formation (68) and the establishment of dor- 9. Materials and methods are available as supporting material on Science Online. of four Arabidopsis AUX/IAA genes with no ap- mancy (71). 10. H. D. Bradshaw, R. F. Stettler, Theor. Appl. Genet. 86, 301 parent Populus orthologs, suggesting Arabidopsis- (1993). specific functions. Conclusions 11. M. Koornneef, P. Fransz, H. de Jong, Chromosome Res. on February 28, 2010 Gibberellins (GAs) are thought to regulate Our initial analyses provide a flavor of the op- 11, 183 (2003). 12. O. Santamaria, J. J. Diez, For. Pathol. 35, 95 (2005). multiple processes during wood and root devel- portunities for comparative plant genomics made 13. G. A. Tuskan et al., Can. J. For. Res. 34, 85 (2004). opment, including xylem fiber length (75). possible by the generation of the Populus genome 14. A. A. Salamov, V. V. Solovyev, Genome Res. 10, 516 (2000). Among all gibberellin biosynthesis and signaling sequence. A complex history of whole-genome 15. E. Birney, R. Durbin, Genome Res. 10, 547 (2000). genes, the Populus GA20-oxidase gene family is duplications, chromosomal rearrangements and 16. T. Schiex, A. Moisan, P. Rouze´, in : the only family with approximately two times tandem duplications has shaped the genome that Selected Papers from JOBIM’2000, number 2066 in LNCS (Springer-Verlag, Heidelberg, Germany, 2001), the number of genes relative to Arabidopsis, we observe today. The differences in gene pp. 118–133. indicating that most of the duplicated genes that content between Populus and Arabidopsis have 17. Y. Xu, E. C. Uberbacher, J. Comput. Biol. 4, 325 (1997). arose from the salicoid duplication event have provided some tantalizing insights into the 18. S. J. Hanley, M. D. Mallott, A. Karp, Tree Genet. Genomes, been lost. GA20-oxidase appears to control flux possible molecular bases of their strongly con- in press. www.sciencemag.org 19. M. A. Koch, B. Haubold, T. Mitchell-Olds, Mol. Biol. Evol. in the biosynthetic pathway leading to the trasting life histories, although factors unrelated 17, 1483 (2000). bioactive gibberellins GA1 and GA4. The higher to gene content (such as regulatory elements, 20. M. Lynch, J. S. Conery, Science 290, 1151 (2000). complement of GA20-oxidase genes may have miRNAs, posttranslational modification, or epi- 21. L. Sterck et al., New Phytol. 167, 165 (2005). biological importance in Populus with respect to genetic modifications) may ultimately be of 22. L. A. Dode, Bull. Soc. Hist. Nat. Autun 18, 161 (1905). 23. R. Regnier, in Revue des Socie´te´s Savantes de Normandie secondary xylem and fiber cell development. equal or greater importance. With the sequence (Rouen, France, 1956), vol. 1, pp. 1–36. are thought to control the iden- of Populus, researchers can now go beyond what 24. M. E. Collinson, Proc. R. Soc. Edinburgh B Bio. Sci. 98, 155 (1992). tity and proliferation of cell types relevant for could be learned from Arabidopsis alone and Downloaded from wood formation as well as general cell division explore hypotheses to linking genome sequence 25. J. E. Eckenwalder, in Biology of Populus and Its Implications for Management and Conservation,R.F.Stettler, (67). The total number of members in gene features to wood development, nutrient and water H. D. Bradshaw Jr., P. E. Heilman, T. M. Hinckley, Eds. (NRC families encoding cytokinin homeostasis related movement, crown development, and disease Research Press, Ottawa, 1996), chap. 1. isopentenyl transferases (IPT) and cytokinin resistance in perennial plants. The availability of 26. J. B. Mitton, M. C. Grant, Bioscience 46, 25 (1996). oxidases is roughly similar between Populus the Populus genome sequence will enable 27. K. Hokamp, A. McLysaght, K. H. Wolfe, J. Struct. Funct. and Arabidopsis, although there appears to be continuing comparative genomics studies among Genomics 3, 95 (2003). 28. J. E. Bowers, B. A. Chapman, J. K. Rong, A. H. Paterson, lineage-specific expansion of IPT subfamilies. species that will shed new light on genome Nature 422, 433 (2003). The cytokinin signal transduction pathway reorganization and gene family evolution. Fur- 29. L. M. Zahn, J. Leebens-Mack, C. W. dePamphilis, H. Ma, represents a two-component phosphorelay sys- thermore, the genetics and population biology of G. Theissen, J. Hered. 96, 225 (2005). tem, in which a two-component hybrid receptor Populus make it an immense source of allelic 30. S. De Bodt, S. Maere, Y. Van de Peer, Trends Ecol. Evol. 20, 591 (2005). initiates a phosphotransfer by means of histidine- variation. Because Populus is an obligate 31. K. L. Adams, J. F. Wendel, Trends Genet. 21, 539 (2005). containing phosphotransmitters (HPt) to phospho- outcrossing species, recessive alleles tend to be 32. G. Blanc, K. Hokamp, K. H. Wolfe, Genome Res. 13, 137 accepting response regulators (RR). One family maintained in a heterozygous state. Informatics (2003). of genes, encoding the two-component recep- tools enabled by the sequence, assembly, and 33. B. A. Schulman et al., Nature 408, 381 (2000). 34. S. Griffiths-Jones et al., Nucleic Acids Res. 33, D121 tors (such as CKI1), is notably expanded in annotation of the Populus genome will facilitate (2005). Populus (four versus one in Populus and the characterization of allelic variation in wild 35. S. Lockton, B. S. Gaut, Trends Genet. 21, 60 (2005). Arabidopsis, respectively) (76). Gene families Populus populations adapted to a wide range of 36. B. A. Chapman, J. E. Bowers, F. A. Feltus, A. H. Paterson, coding for recently identified pseudo-HPt and environmental conditions and gradients over Proc. Natl. Acad. Sci. U.S.A. 103, 2730 (2006). 37. T. A. Richmond, C. R. Somerville, Plant Physiol. 124, 495 atypical RR are overrepresented in Populus large portions of the Northern Hemisphere. Such (2000). relative to Arabidopsis (2.5- and 4.0-fold in- variants represent a rich reservoir of molecular 38. S. Djerbi, M. Lindskog, L. Arvestad, F. Sterky, T. T. Teeri, crease in Populus, respectively). Both of these resources useful in biotechnological applications, Planta 221, 739 (2005).

www.sciencemag.org SCIENCE VOL 313 15 SEPTEMBER 2006 1603 RESEARCH ARTICLES

39. C. P. Joshi et al., New Phytol. 164, 53 (2004). 60. D. Cukovic, J. Ehlting, J. A. VanZiffle, C. J. Douglas, Biol. 81. B. C. Meyers, S. Kaushik, R. S. Nandety, Curr. Opin. Plant 40. A. Samuga, C. P. Joshi, Gene 334, 73 (2004). Chem. 382, 645 (2001). Biol. 8, 129 (2005). 41. F. Roudier et al., Plant Cell 17, 1749 (2005). 61. J. M. Shockey, M. S. Fulda, J. Browse, Plant Physiol. 132, 82. M. W. Jones-Rhoades, D. P. Bartel, Mol. Cell 14, 787 (2004). 42. R. M. Perrin et al., Science 284, 1976 (1999). 1065 (2003). 83. We thank the U.S. Department of Energy, Office of 43. R. W. Whetten, J. J. Mackay, R. R. Sederoff, Annu. Rev. 62. I. E. Somssich, P. Wernert, S. Kiedrowski, K. Hahlbrock, Science for supporting the sequencing and assembly Plant Physiol. Plant Mol. Biol. 49, 585 (1998). Proc. Natl. Acad. Sci. U.S.A. 93, 14199 (1996). portion of this study; Genome Canada and the Province of 44. J. Ehlting et al., Plant J. 42, 618 (2005). 63. L. Li et al., Plant Cell 13, 1567 (2001). British Columbia for providing support for the BAC end, 45. J. Raes, A. Rohde, J. H. Christensen, Y. Van de Peer, 64. B. C. Meyers, S. Kaushik, R. S. Nandety, Curr. Opin. Plant BAC genotyping, and full-length cDNA portions of this W. Boerjan, Plant Physiol. 133, 1051 (2003). Biol. 8, 129 (2005). study; the Umea˚ University and the Royal Technological 46. D. M. O’Malley, S. Porter, R. R. Sederoff, Plant Physiol. 65. L. Deslandes et al., Proc. Natl. Acad. Sci. U.S.A. 99, 2404 Institute (KTH) in Stockholm for supporting the EST 98, 1364 (1992). (2002). assembly and annotation portion of this study; the 47. J. J. Mackay, W. W. Liu, R. Whetten, R. R. Sederoff, 66. E. J. Mellerowicz, M. Baucher, B. Sundberg, W. Boerjan, membership of the International Populus Genome D. M. Omalley, Mol. Gen. Genet. 247, 537 (1995). Plant Mol. Biol. 47, 239 (2001). Consortium for supplying genetic and genomics resources 48. J. Schrader et al., Plant Cell 16, 2278 (2004). 67. A. P. Ma¨ho¨nen et al., Science 311, 94 (2006). used in the assembly and annotation of the genome; the 49. S. Whitham, S. McCormick, B. Baker, Proc. Natl. Acad. 68. S. Andersson-Gunneras et al., Plant J. 34, 339 (2003). NSF Plant Genome Program for supporting the develop- Sci. U.S.A. 93, 8776 (1996). 69. J. M. Hellgren, K. Olofsson, B. Sundberg, Plant Physiol. ment of Web-based tools; T. H. D. Bradshaw and 50. G. M. Gebre, T. J. Tschaplinski, G. A. Tuskan, D. E. Todd, 135, 212 (2004). R. Stettler for input and reviews on draft copies of the Tree Physiol. 18, 645 (1998). 70. M. G. Cline, K. Dong-Il, Ann. Bot. (London) 90, 417 (2002). manuscript; J. M. Tuskan for guidance and input during 51. G. Arimura, D. P. W. Huber, J. Bohlmann, Plant J. 37, 603 71. R. Ruonala, P. Rinne, M. Baghour, H. Tuominen, the analysis and writing of the manuscript; and the (2004). J. Kangasja¨rvi, Plant J., in press. anonymous reviewers who provided critical input and 52. D. J. Peters, C. P. Constabel, Plant J. 32, 701 (2002). 72. R. Moyle et al., Plant J. 31, 675 (2002). recommendations on the manuscript. GenBank Accession 53. C.-J. Tsai, S. A. Harding, T. J. Tschaplinski, R. L. Lindroth, 73. D. Weijers et al., EMBO J. 24, 1874 (2005). Number: AARH00000000. Y. Yuan, New Phytol. 172, 47 (2006). 74. G. Hagen, T. Guilfoyle, Plant Mol. Biol. 49, 373 (2002). 54. M. M. De Sa´, R. Subramaniam, F. E. Williams, C. J. Douglas, 75. M. E. Eriksson, M. Israelsson, O. Olsson, T. Moritz, Nat. Plant Physiol. 98, 728 (1992). Biotechnol. 18, 784 (2000). Supporting Online Material 55. R. L. Lindroth, S. Y. Hwang, Biochem. Syst. Ecol. 24, 357 76. T. Kakimoto, Science 274, 982 (1996). www.sciencemag.org/cgi/content/full/313/5793/1596/DC1 (1996). 77. T. Kiba, K. Aoki, H. Sakakibara, T. Mizuno, Plant Cell Materials and Methods 56. B. Winkel-Shirley, Curr. Opin. Plant Biol. 5, 218 Physiol. 45, 1063 (2004). Figs. S1 to S15 (2002). 78. T. Nakano, K. Suzuki, T. Fujimura, H. Shinshi, Plant Tables S1 to S14 57. G. J. Tanner et al., J. Biol. Chem. 278, 31647 (2003). Physiol. 140, 411 (2006). References 58. S. Aubourg, A. Lecharny, J. Bohlmann, Mol. Genet. 79. F. Sterky et al., Proc. Natl. Acad. Sci. U.S.A. 101, 13951 Genomics 267, 730 (2002). (2004). 13 April 2006; accepted 9 August 2006 59. M. A. Costa et al., Phytochemistry 64, 1097 (2003). 80. J. Ehlting et al., Plant J. 42, 618 (2005). 10.1126/science.1128691 on February 28, 2010

scriptomes result in the expression of nu- Opposing Activities Protect Against merous chaperones (13, 14) suggests that the integrity of protein folding could play a key role Age-Onset Proteotoxicity in life-span determination and the amelioration of aggregation-associated proteotoxicity. Indeed, Ehud Cohen,1* Jan Bieschke,2* Rhonda M. Perciavalle,1 Jeffery W. Kelly,2 Andrew Dillin1† amelioration of Huntington-associated proteotox- icity by slowing the aging process in worms has Aberrant protein aggregation is a common feature of late-onset neurodegenerative diseases, including been reported (13, 15, 16). www.sciencemag.org

Alzheimer’s disease, which is associated with the misassembly of the Ab1-42 peptide. Aggregation-mediated Reduced IIS activity lowers Ab1-42 toxicity. Ab1-42 toxicity was reduced in Caenorhabiditis elegans when aging was slowed by decreased insulin/ One hypothesis to explain late-onset aggregation- insulin growth factor–1–like signaling (IIS). The downstream transcription factors, heat shock factor 1, associated toxicity posits that the deposition of and DAF-16 regulate opposing disaggregation and aggregation activities to promote cellular survival in toxic aggregates is a stochastic process, governed response to constitutive toxic protein aggregation. Because the IIS pathway is central to the regulation of by a nucleated polymerization and requiring longevity and youthfulness in worms, flies, and mammals, these results suggest a mechanistic link many years to initiate disease. Alternatively,

between the aging process and aggregation-mediated proteotoxicity. aging could enable constitutive aggregation to Downloaded from become toxic as a result of declining detoxifica- ate-onset human neurodegenerative diseases flies, and mammals is the insulin/insulin growth tion activities. To distinguish between these two including Alzheimer_s (AD), Huntington_s, factor (IGF)–1–like signaling (IIS) pathway (6). possibilities, we asked what role the aging pro- and Parkinson_s diseases are geneti- In the nematode Caenorhabditis elegans,the cess plays in Ab aggregation-mediated toxic- L 1-42 cally and pathologically linked to aberrant pro- sole insulin/IGF-1 receptor, DAF-2 (7), initiates ityinaC. elegans model featuring intracellular

tein aggregation (1, 2). In AD, formation of the transduction of a signal that causes the phos- Ab1-42 expression (17). If Ab1-42 toxicity results aggregation-prone peptides, particularly Ab1-42, phorylation of the FOXO , from a non-age-related nucleated polymerization, by endoproteolysis of the amyloid precursor pro- DAF-16 (8, 9), preventing its translocation to animals that express Ab1-42 and whose life span tein (APP) is associated with the disease through the nucleus (10). This negative regulation of has been extended would be expected to succumb

an unknown mechanism (3, 4). Whether intracel- DAF-16 compromises expression of its target to Ab1-42 toxicity at the same rate as those with a lular accumulation or extracellular deposition of genes, decreases stress resistance, and shortens natural life span. However, if the aging process _ Ab1-42 initiates the pathologicalprocessisakey the worm s life span. Thus, inhibition of daf-2 plays a role in detoxifying an ongoing protein ag- unanswered question (5). Typically, individuals expression creates long-lived, youthful, stress- gregation process, alteration of the aging program who carry AD-linked mutations present with clin- resistant worms (11). Similarly, suppression of ical symptoms during their fifth or sixth decade, the mouse DAF-2 ortholog, IGF1-R, creates long- 1Molecular and Cell Biology Laboratory, Salk Institute for whereas sporadic cases appear after the seventh lived mice (12). Recent studies indicate that, in Biological Studies, 10010 North Torrey Pines Road, La Jolla, CA 2 decade. Why aggregation-mediated toxicity emerges worms, life-span extension due to reduced daf-2 92037, USA. Department of Chemistry and Skaggs Institute of Chemical Biology, Scripps Research Institute, 10550 North late in life and whether it is mechanistically linked activity is also dependent upon heat shock factor 1 Torrey Pines Road, La Jolla, CA 92037, USA. to the aging process remain unclear. (HSF-1). Moreover, increased expression of hsf-1 *These authors contributed equally to this work. Perhaps the most prominent pathway that extends worm life span in a daf-16–dependent †To whom correspondence should be addressed E-mail: regulates life span and youthfulness in worms, manner (13). That the DAF-16 and HSF-1 tran- [email protected]

1604 15 SEPTEMBER 2006 VOL 313 SCIENCE www.sciencemag.org Vol 449 | 27 September 2007 | doi:10.1038/nature06148 LETTERS

The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla

The French–Italian Public Consortium for Grapevine Genome Characterization*

The analysis of the first plant genomes provided unexpected evid- All grapevine varieties are highly heterozygous; preliminary data ence for genome duplication events in species that had previously showed that there was as much as 13% sequence divergence between been considered as true diploids on the basis of their genetics1–3. alleles, which would hinder reliable contig assembly when a whole- These polyploidization events may have had important conse- genome shotgun strategy was used for sequencing. Our consortium quences in , in particular for species radiation and therefore selected the grapevine PN40024 genotype for sequencing. adaptation and for the modulation of functional capacities4–10. Here This line, originally derived from Pinot Noir, has been bred close to we report a high-quality draft of the genome sequence of grapevine full homozygosity (estimated at about 93%) by successive selfings, (Vitis vinifera) obtained from a highly homozygous genotype. The permitting a high-quality whole-genome shotgun assembly. draft sequence of the grapevine genome is the fourth one produced A total of 6.2 million end-reads were produced by our consortium, so far for flowering plants, the second for a woody species and the representing an 8.4-fold coverage of the genome. Within the assem- first for a crop (cultivated for both fruit and beverage). bly, performed with Arachne12, 316 supercontigs represent putative Grapevine was selected because of its important place in the cul- allelic haplotypes that constitute 11.6 million bases (Mb). These tural heritage of humanity beginning during the Neolithic period11. values are in good fit with the 7% residual heterozygosity of Several large expansions of gene families with roles in aromatic PN40024 assessed by using genetic markers. When considering only features are observed. The grapevine genome has not undergone one of the haplotypes in each heterozygous region, the assembly recent genome duplication, thus enabling the discovery of ancestral (Table 1a) consists of 19,577 contigs (N50 5 65.9 kilobases (kb), traits and features of the genetic organization of flowering plants. where N50 corresponds to the size of the shorter supercontig or This analysis reveals the contribution of three ancestral genomes to contig in a subset representing half of the assembly size) and 3,514 the grapevine haploid content. This ancestral arrangement is com- supercontigs (N50 5 2.07 Mb) totalling 487 Mb. This value is mon to many dicotyledonous plants but is absent from the genome close to the 475 Mb previously reported for the grapevine genome of rice, which is a monocotyledon. Furthermore, we explain the size13. chronology of previously described whole-genome duplication Using a set of 409 molecular markers from the reference grapevine events in the evolution of flowering plants. map14, 69% of the assembled 487 Mb, arranged into 45 ultracontigs

Table 1 | Global statistics on the genome of Vitis vinifera (a) Assembly Status Number N50 (kb) Longest (kb) Size (Mb) Percentage of the assembly Contigs All 19,577 65.9 557 467.5 – Supercontigs All 3,514 2,065 12,675 487.1100 Anchored on chromosomes 191 3,189 12,675 335.668.9 Anchored on chromosomes 143 3,827 12,675 296.960.9 and oriented

(b) Annotation Number Median size (bp) Total length (Mb) Percentage of the genome %GC Gene 30,434 3,399 225.646.336.2 Exons CDS 149,351 130 33.66.944.5 Introns CDS 118,917 213 178.636.734.7 Intergenic 30,453 3,544 261.534.733.0 tRNA* 600 73 0.04 NS 43.0 miRNA{ 164 103.50.002 NS 35.9

(c) Orthology Number of orthologous proteins Mean identity (%) P. trichocarpa 12,996 72.7 A. thaliana 11,404 65.5 O. sativa 9,731 59.8 Common to eudicotyledons{ 10,547 Common to Magnoliophyta1 8,121 * Transfer RNA (tRNA) values were computed on exons. { Micro RNAs (miRNAs) are members of known conserved miRNA families. { Eudicotyledons are represented by P. trichocarpa and A. thaliana. 1 Magnoliophyta (most flowering plants) are represented by P. trichocarpa, A. thaliana and O. sativa.

*A list of participants and their affiliations appears at the end of the paper. 463 © 2007 Nature Publishing Group LETTERS NATURE | Vol 449 | 27 September 2007 and 51 single supercontigs, were anchored along the 19 linkage homodimeric GPPS and the heterodimeric form are present; the groups. Thirty-seven ultracontigs and 22 single supercontigs were latter is present only in plants such as Mentha piperita and Clarkia oriented, representing 61% of the genome assembly (Supplemen- breweri, which produce large quantities of monoterpenes19. Most of tary Tables 2 and 3). the STS and TPS genes occur as 20 clusters, including up to 33 para- This assembly has been annotated by using a combination of evid- logous genes located in a 680-kb stretch. ence. The major features of the genome annotation are presented in Because global duplication events seem to be a frequent event in Table 1b. The 8.4-fold draft sequence of the grapevine genome con- plant evolution20, we searched the genome of V. vinifera for paralo- tains a set of 30,434 protein-coding genes (an average of 372 codons gous regions by using protein sequence similarity. Paralogous regions and 5 exons per gene). This value is considerably lower than the are defined as chromosome fragments in which homologous genes 45,555 protein-coding genes reported for the poplar (Populus tricho- are present in clusters. Statistical analysis21 of these clusters reveals carpa) genome, which has a similar size, at 485 Mb (ref. 1), and even that 94.5% have high probability of being paralogous (P , 1024; lower than the 37,544 protein-coding genes identified in the 389 Mb Supplementary Table 11). Most Vitis gene regions have two different of the rice genome2. paralogous regions, which we have grouped together as triplets Three different approaches revealed that 41.4% (average value) of (Supplementary Fig. 5; coverage details in Supplementary Table the grapevine genome is composed of repetitive/transposable ele- 10). We conclude that the present-day grapevine haploid genome ments (TEs), a slightly higher proportion than that identified in the originated from the contribution of three ancestral genomes. It is rice genome, which has a somewhat smaller size2. The distribution of yet to be demonstrated whether this content came from a true hex- repeats and TEs along the chromosomes is quite uneven (see below). aploidization event or through successive genome duplications. The All classes and superfamilies of TEs are represented in the grapevine resulting plant had a diploid content that corresponds to the three genome, with a large prevalence of class I elements over class II and full diploid contents of the three ancestors; it may therefore be helitrons (rolling-circle transposons) (Supplementary Table 7). An described as a ‘palaeo-hexaploid’ organism. A number of rearrange- analysis of the distribution of the repetitive elements in the different ments have affected the original three complements after the forma- fractions of the grapevine genome based on the current annotation tion of the palaeo-hexaploid state. However, the gene order has been shows that introns are quite rich in repeats and TEs (data not shown). sufficiently conserved to permit the alignment of most regions with In addition, 12.4% of the intron sequence contains transposons as their two siblings. determined using our set of manually annotated elements, most of We explored the time of formation of the palaeo-hexaploid which (75%) correspond to LINE (long interspersed element) retro- arrangement by comparing grapevine gene regions with those of transposons, which therefore seem to have contributed specifically to other completely sequenced plant genomes. If the palaeo-hexaploid the intron size observed in grapevine (Supplementary Table 8). complement is present in another species, it should result in a one- In eukaryotes with large genomes, the coding and repeated ele- for-one pairing of gene regions between the two species considered. ments are distributed over the chromosomes and may be more or less In contrast, if another species’s genome evolved before palaeo- interlaced, hence defining gene-poor and gene-rich regions. It has hexaploid formation, it should result in a one-to-three relationship previously been noticed that the distribution of the genes along between the other species and the grapevine genome. The available the chromosomes of rice and Arabidopsis thaliana is fairly homo- genome sequences were those of poplar1, Arabidopsis3 and rice (Oryza geneous2,3. In contrast, we observe large regions that alternate sativa2), of which poplar is considered to be most closely related to between high and low gene density in V. vinifera (Supplementary grapevine. All clusters constructed between the orthologues in the Figs 2 and 3). As expected, the density of TEs reflects a pattern three comparisons have P , 1024 (Table 1c). When the gene order in substantially complementary to gene density. We observe a similar poplar is compared with that in grapevine, there are two clear dis- characteristic in the genome sequence of poplar, therefore indicating tributions. First, the grapevine regions align with two poplar seg- a dynamic for the invasion of TEs that is shared with the grapevine ments, as would be expected from a recent whole-genome (Supplementary Fig. 3). duplication (WGD) in the poplar lineage1. Second, each of the three A striking feature of the grapevine proteome lies in the existence of grapevine regions that form a homologous triplet recognizes differ- large families related to wine characteristics, which have a higher gene ent pairs of poplar segments (Fig. 1a and Supplementary Fig. 6). This copy number than in the other sequenced plants. Stilbene synthases shows that the palaeo-hexaploidy observed in grapevine was already (STSs) drive the synthesis of resveratrol, the grapevine phytoalexin present in its common ancestor with poplar. that has been associated with the health benefits associated with Poplar belongs to the Eurosid I clade. The sister clade to Eurosid I moderate consumption of red wine15,16. The family of genes encoding is that of Eurosid II, which contains the model species Arabidopsis. Its STSs has a noticeable expansion: 43 genes have been identified. Of gene order was compared with that in the grapevine genome. Two these, 20 have previously been shown to be expressed after infection distributions appear: first, most grapevine regions correspond to four by Plasmopara viticola, thus confirming that they are likely to be Arabidopsis segments (Supplementary Fig. 7); second, each compon- functional. The terpene synthases (TPSs) drive the synthesis of ent of a triplicated group in grapevine recognizes four different terpenoids; these secondary metabolites are major components of regions in Arabidopsis (Fig. 1b). This shows that the grapevine resins, essential oils and aromas (their relative abundance is directly palaeo-hexaploidy was present in the common ancestor to correlated with the aromatic features of wines17) and are involved in Arabidopsis and grapevine, and therefore that it is a trait common plant–environment interactions. In comparison with the 30–40 to all Eurosids. This is confirmed by the homology level distribution genes of this family in Arabidopsis, rice and poplar, the grapevine between paralogues of the grapevine, indicating a lower conservation TPS family is more than twice as large, with 89 functional genes and than between Vitis/Arabidopsis orthologues (Supplementary Fig. 4). 27 pseudogenes. Classification based on known plant homologues The Eurosid group contains many economically important flowering reveals that the subclass of putative monoterpene synthases repre- plants such as legumes, cotton and . Our present results sents only 15% of the Arabidopsis TPS family18 whereas this subclass establish these species as having a palaeo-hexaploid common represents 40% of the grapevine TPS family. This result suggests a ancestor. The grapevine/Arabidopsis comparison also reveals that high diversification of grapevine monoterpene synthases that specif- the Arabidopsis lineage underwent two WGDs after its separation 21–24 ically produce C10 terpenoids present in aroma (such as geraniol, from the Eurosid I clade . This contradicts some models based linalool, cineole and a-terpineol). Furthermore, the grapevine gen- on more indirect evidence that placed the most ancient of these two ome annotation has also revealed genes encoding homologues to the duplications at the base of the Eurosid group, or even earlier4,20–22. two forms of geranyl diphosphate synthases (GPPSs), the enzymes Some studies had also suggested a possible third duplication event in that produce the substrate for monoterpene synthases: both the the distant past of the Arabidopsis lineage, potentially at the base of 464 © 2007 Nature Publishing Group NATURE | Vol 449 | 27 September 2007 LETTERS the angiosperm radiation. The controversy about this third event is Because rice is a monocotyledon, we assessed the presence or absence now resolved by the Vitis genome comparisons: this event corre- of palaeo-hexaploidy in its genome sequence. The observed pattern is sponds to the palaeo-hexaploidy formation that remains evident in the opposite of that seen for Arabidopsis and poplar: constituents of a the grapevine genome but has been difficult to characterize in grapevine triplet are generally orthologous to the same group of rice Arabidopsis and poplar because of the more recent WGDs. In par- regions (Fig. 1c and Supplementary Fig. 11). Because rice and grape- ticular, the Arabidopsis genome lineage has undergone many rear- vine are phylogenetically distant, it is more difficult to detect rela- rangements and chromosome fusions such that the ancestral gene tions of orthology across the two whole genomes: rearrangements, order is particularly difficult to deduce from this species (Fig. 2). duplication and gene loss have affected the gene orders differently in Grapevines, like Arabidopsis and poplar, are dicotyledonous plants the two lineages (Supplementary Fig. 10). Even with this limitation, that diverged from monocotyledons about 130–240 Myr ago25,26. we observed numerous cases of one-to-three relationships between

a b

c

Figure 1 | Comparison between three paralogous Vitis genomic regions and shows that the Arabidopsis/Vitis ancestor had the same palaeo-hexaploid their orthologues in P. trichocarpa, A. thaliana and O. sativa. Orthologous content. One Vitis region corresponds to four Arabidopsis segments, gene pairs are joined with a different colour for each of the three paralogous indicating the presence of two WGDs in the Arabidopsis lineage after grapevine chromosomes 6 (green), 8 (blue) and 13 (red). a, Orthologous separation from the Vitis lineage. c, Orthologous regions in rice are the same regions in the poplar genome are different for each of the three Vitis for the three paralogous chromosomes. This indicates that the triplication chromosomes, showing that the triplication predates the poplar/Vitis was not present in the common ancestor of monocotyledons and separation. One Vitis region recognizes two poplar segments because of a . The presence in rice of different homologous blocks is due to WGD in the poplar lineage after the separation. b, Orthologous regions with global duplications in the rice lineage after divergence from dicotyledons. Arabidopsis are different for each of the three Vitis chromosomes. This 465 © 2007 Nature Publishing Group LETTERS NATURE | Vol 449 | 27 September 2007

ab c

12345678910111213141516171819 123456 7 8 9 10 11 12 13 14 15 16 17 18 19 12345 V. vinifera P. trichocarpa A. thaliana Figure 2 | Schematic representation of paralogous regions derived from The V. vinifera genome (a) is by far the closest to the ancestral arrangement, the three ancestral genomes in the karyotypes of V. vinifera, P. trichocarpa whereas that of Arabidopsis (c) is thoroughly rearranged, and P. trichocarpa and A. thaliana. Each colour corresponds to a syntenic region between the (b) presents an intermediate situation. The seven colours probably three ancestral genomes that were defined by their occurrence as linked correspond to linkage groups at the time of the palaeo-hexaploid ancestor. clusters in grapevine, independently of intrachromosomal rearrangements. rice and grapevine (Supplementary Figs 8, 9 and 11); 23% of ortho- this species, including domestication traits. A selective amplification logous blocks include the paralogous regions that originate from the of genes belonging to the metabolic pathways of terpenes and tannins grapevine palaeo-hexaploidy. For Arabidopsis, this number is as low has occurred in the grapevine genome, in contrast with other plant as 1.4% (this difference is significant at 5%: x2 5 8.9; Supplementary genomes. This suggests that it may become possible to trace the Table 12), despite the fact that the Arabidopsis genome has suffered diversity of wine flavours down to the genome level. Grapevine is many gene losses since its two WGDs. These gene losses would be also a crop that is highly susceptible to a large diversity of pathogens expected to obscure the orthologous relations with the grapevine including powdery mildew, oidium and Pierce disease. Other Vitis genome, but they are clearly insufficient to explain the high number species such as V. riparia or V. cinerea, which are known to be res- of one-to-three relationships observed in the rice–grapevine com- istant to several of these pathogens, are interfertile with V. vinifera parison. The most probable explanation for this excess is that the rice and can be used for the introduction of resistance traits by advanced ancestor did not exhibit the palaeo-hexaploidy observed in the grape- backcrosses27 or by gene transfer. Access to the Vitis sequence and the vine, poplar and Arabidopsis. exploitation of synteny will speed up this process of introgression of These findings are summarized in Fig. 3: the triplicated arrange- pathogen resistance traits. As a consequence of this, it is hoped that it ment is apparent after the separation of the monocotyledons and will also prompt a strong decrease in pesticide use. dicotyledons and before the spread of the Eurosid clade. Future gen- The high quality of the assembly, due mainly to the highly homo- ome sequencing projects for other clades of dicotyledons, such as zygous nature of the PN40024 line, enables the discovery of three Solanaceae or basal , will help in situating the triplication ancestral genomes constituting the diploid content of grapevine. The event more precisely, and eventually in establishing its precise nature Greek historian Thucydides wrote that Mediterranean people began (hexaploidization or genome duplications at distant times). to emerge from ignorance when they learnt to cultivate olives and Public access to the grapevine genome sequence will help in the grapes. This first characterization of the grapevine genome, with its identification of genes underlying the agricultural characteristics of indication of a palaeo-hexaploid ancestral genome for many dico- tyledonous plants, addresses fundamental questions related to the origin and importance of this event in the history of flowering plants. Monocotyledons Dicotyledons Future work may help in correlating the differential fates of the three Eurosids I Eurosids II gene complements with phenotypic traits of dicotyledonous species. O. sativaP. trichocarpa V. vinifera A. thaliana METHODS SUMMARY Gene annotation. Protein-coding genes were predicted by combining ab initio models, V. vinifera complementary DNA alignments, and alignments of proteins and genomic DNA from other species. The integration of the data was performed with GAZE28. Details are given in Supplementary Information. ? Paralogous and orthologous gene sets. Statistical testing of homologous regions was performed as described in ref. 21.

Formation of the Full Methods and any associated references are available in the online version of palaeo-hexaploid the paper at www.nature.com/nature. genome Received 5 April; accepted 7 August 2007. Published online 26 August 2007.

1. Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Flowering plants Gray). Science 313, 1596–1604 (2006). 2. International Rice Genome Sequencing Project. The map-based sequence of the Figure 3 | Positions of the polyploidization events in the evolution of plants rice genome. Nature 436, 793–800 (2005). 3. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering with a sequenced genome. Each star indicates a WGD (tetraploidization) plant Arabidopsis thaliana. Nature 408, 796–815 (2000). event on that branch. The question mark indicates that ancient events are 4. De Bodt, S., Maere, S. & Van de Peer, Y. Genome duplication and the origin of visible in the rice genome that would require other monocotyledon genome angiosperms. Trends Ecol. Evol. 20, 591–597 (2005). sequences to be resolved. The formation of the palaeo-hexaploid ancestral 5. Scannell, D. R., Byrne, K. P., Gordon, J. L., Wong, S. & Wolfe, K. H. Multiple rounds genome occurred after divergence from monocotyledons and before the of speciation associated with reciprocal gene loss in polyploid yeasts. Nature 440, radiation of the Eurosids. 341–345 (2006). 466 © 2007 Nature Publishing Group NATURE | Vol 449 | 27 September 2007 LETTERS

6. Jaillon, O. et al. Genome duplication in the teleost fish Tetraodon nigroviridis was financially supported by Consortium National de Recherche en Ge´nomique, reveals the early vertebrate proto-karyotype. Nature 431, 946–957 (2004). Agence Nationale de la Recherche, INRA, and by MiPAF (VIGNA-CRA), Friuli 7. Aury, J. M. et al. Global trends of whole-genome duplications revealed by the Innovazione, Universita` di Udine, Federazione BCC, Fondazione CRUP, Fondazione ciliate Paramecium tetraurelia. Nature 444, 171–178 (2006). Carigo, Fondazione CRT, Vivai Cooperativi Rauscedo, Eurotech, Livio Felluga, 8. Maere, S. et al. Modeling gene and genome duplications in eukaryotes. Proc. Natl Marco Felluga, Venica e Venica, Le Vigne di Zamo` (IGA). We thank S. Cure for Acad. Sci. USA 102, 5454–5459 (2005). correcting the manuscript; F. Caˆmara and R. Guigo for the calibration of the GeneID 9. Blanc, G. & Wolfe, K. H. Functional divergence of duplicated genes formed by gene prediction software, and the Centre Informatique National de l’Enseignement polyploidy during Arabidopsis evolution. Plant Cell 16, 1679–1691 (2004). Supe´rieur for computing resources. 10. Seoighe, C. & Gehring, C. Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet. 20, 461–464 (2004). Author Information The final assembly and annotation are deposited in the EMBL/ 11. McGovern, P. E., Hartung, U., Badler, V., Glusker, D. L. & Exner, L. J. The beginnings Genbank/DDBJ databases under accession numbers CU459218–CU462737 (for of wine making and viniculture in the anciant Near East and Egypt. Expedition 39, all scaffolds) and CU462738–CU462772 (for chromosome reconstitutions and 3–21 (1997). unanchored scaffolds). An annotation browser and further information on the 12. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: project are available from http://www.genoscope.cns.fr/vitis, http:// Arachne 2. Genome Res. 13, 91–96 (2003). www.vitisgenome.it/ and http://www.appliedgenomics.org/. Reprints and 13. Lodhi, M. A., Daly, M. J., Ye, G. N., Weeden, N. F. & Reisch, B. I. A molecular marker permissions information is available at www.nature.com/reprints. The authors based linkage map of Vitis. Genome 38, 786–794 (1995). declare no competing financial interests. Correspondence and requests for 14. Doligez, A. et al. An integrated SSR map of grapevine based on five mapping materials should be addressed to P.W. ([email protected]). populations. Theor. Appl. Genet. 113, 369–382 (2006). 15. Baur, J. A. et al. Resveratrol improves health and survival of mice on a high-calorie diet. Nature 444, 337–342 (2006). 16. Baur, J. A. & Sinclair, D. A. Therapeutic potential of resveratrol: the in vivo The French-Italian Public Consortium for Grapevine Genome Characterization evidence. Nature Rev. Drug Discov. 5, 493–506 (2006). Olivier Jaillon1*, Jean-Marc Aury1*, Benjamin Noel1, Alberto Policriti2,3, Christian 17. Mateo, J. J. & Jimenez, M. Monoterpenes in grape juice and wines. J. Chromatogr. A Clepet4, Alberto Casagrande2,5, Nathalie Choisne1,4,Se´bastien Aubourg4, Nicola 881, 557–567 (2000). Vitulo6,15, Claire Jubin1, Alessandro Vezzi6,15, Fabrice Legeai7, Philippe Hugueney8, 18. Aubourg, S., Lecharny, A. & Bohlmann, J. Genomic analysis of the terpenoid Corinne Dasilva1, David Horner9,15, Erica Mica9,15, Delphine Jublot4, Julie Poulain1, synthase (AtTPS) gene family of Arabidopsis thaliana. Mol. Genet. Genomics 267, Cle´mence Bruye`re4, Alain Billault1,Be´atrice Segurens1, Michel Gouyvenoux1, Edgardo 730–745 (2002). Ugarte1, Federica Cattonaro2,Ve´ronique Anthouard1, Virginie Vico1, Cristian Del 19. Tholl, D. et al. Formation of monoterpenes in and Clarkia breweri Fabbro2,3, Michae¨l Alaux7, Gabriele Di Gaspero2,5,Vincent Dumas8, Nicoletta Felice2,5, flowers involves heterodimeric geranyl diphosphate synthases. Plant Cell 16, Sophie Paillard4, Irena Juman2,5, Marco Moroldo4, Simone Scalabrin2,3, Aure´lie 977–992 (2004). Canaguier4, Isabelle Le Clainche4, Giorgio Malacrida6,15,Ele´onore Durand7, Graziano 20. Adams, K. L. & Wendel, J. F. Polyploidy and genome evolution in plants. Curr. Opin. Pesole10,11,15, Vale´rie Laucou12, Philippe Chatelet13, Didier Merdinoglu8, Massimo Plant Biol. 8, 135–141 (2005). Delledonne14,15, Mario Pezzotti15,16, Alain Lecharny4, Claude Scarpelli1, Franc¸ois 21. Simillion, C., Vandepoele, K., Van Montagu, M. C., Zabeau, M. & Van de Peer, Y. Artiguenave1, M. Enrico Pe`9,15, Giorgio Valle6,15, Michele Morgante2,5, Michel The hidden duplication past of Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 99, Caboche4, Anne-Franc¸oise Adam-Blondon4, Jean Weissenbach1, Francis Que´tier1 & 13627–13632 (2002). Patrick Wincker1 22. Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm *These authors contributed equally to this work. genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422, 433–438 (2003). Affiliations for participants: 1Genoscope (CEA) and UMR 8030 23. Vision, T. J., Brown, D. G. & Tanksley, S. D. The origins of genomic duplications in CNRS-Genoscope-Universite´ d’Evry, 2 rue Gaston Cre´mieux, BP5706, 91057 Evry, Arabidopsis. Science 290, 2114–2117 (2000). France. 2Istituto di Genomica Applicata, Parco Scientifico e Tecnologico di Udine, Via 24. Blanc, G., Hokamp, K. & Wolfe, K. H. A recent polyploidy superimposed on older Linussio 51, 33100 Udine, Italy. 3Dipartimento di Matematica ed Informatica, Universita` large-scale duplications in the Arabidopsis genome. Genome Res. 13, 137–144 degli Studi di Udine, via delle Scienze 208, 33100 Udine, Italy. 4URGV, UMR INRA 1165, (2003). CNRS-Universite´ d’Evry Genomique Ve´ge´tale, 2 rue Gaston Cre´mieux, BP5708, 91057 25. Wolfe, K. H., Gouy, M., Yang, Y. W., Sharp, P. M. & Li, W. H. Date of the Evry cedex, France. 5Dipartimento di Scienze Agrarie ed Ambientali, Universita` degli monocot–dicot divergence estimated from chloroplast DNA sequence data. Proc. Studi di Udine, via delle Scienze 208, 33100 Udine, Italy. 6CRIBI, Universita` degli Studi di Natl Acad. Sci. USA 86, 6201–6205 (1989). Padova, viale G. Colombo 3, 35121 Padova, Italy. 7URGI, UR1164 Ge´nomique Info, 523, 26. Crane, P. R., Friis, E. M. & Pedersen, K. R. The origin and early diversification of Place des Terrasses, 91034 Evry Cedex, France. 8UMR INRA 1131, Universite´ de angiosperms. Nature 374, 27–33 (1995). Strasbourg, Sante´ de la Vigne et Qualite´ du Vin, 28 rue de Herrlisheim, BP20507, 68021 27. Eshed, Y. & Zamir, D. An introgression line population of Lycopersicon pennellii in Colmar, France. 9Dipartimento di Scienze Biomolecolari e Biotecnologie, Universita` degli the cultivated tomato enables the identification and fine mapping of yield- Studi di Milano, via Celoria 26, 20133 Milano, Italy. 10Dipartimento di Biochimica e associated QTL. Genetics 141, 1147–1162 (1995). Biologia Molecolare, Universita` degli Studi di Bari, via Orabona 4, 70125 Bari, Italy. 28. Howe, K. L., Chothia, T. & Durbin, R. GAZE: a generic framework for the 11Istituto Tecnologie Biomediche, Consiglio Nazionale delle Ricerche, via Amendola 122/ integration of gene-prediction data by dynamic programming. Genome Res. 12, D, 70125 Bari, Italy. 12UMR INRA 1097, IRD-Montpellier SupAgro-Univ. Montpellier II, 1418–1427 (2002). Diversite´ et Adaptation des Plantes Cultive´es, 2 Place Pierre Viala, 34060 Montpellier 13 Supplementary Information is linked to the online version of the paper at Cedex 1, France. UMR INRA 1098, IRD-Montpellier SupAgro-CIRAD, De´veloppement www.nature.com/nature. et Ame´lioration des Plantes, 2 Place Pierre Viala, 34060 Montpellier Cedex 1, France. 14Dipartimento Scientifico e Tecnologico, Universita` degli Studi di Verona Strada Le Acknowledgements The sequencing of the grapevine genome was launched and Grazie 15 – Ca’ Vignal, 37134 Verona, Italy. 15Dipartimento di Scienze, Tecnologie e carried out after a scientific cooperation agreement between the Ministry of Mercati della Vite e del Vino, Universita` degli Studi di Verona, via della Pieve, 70 37029 S. Agriculture in France and the Ministry of Agriculture in Italy, involving l’Institut Floriano (VR), Italy. 16VIGNA-CRA Initiative; Consorzio Interuniversitario Nazionale per National de la Recherche Agronomique (INRA), Consiglio per la Ricerca e la Biologia Molecolare delle Piante, c/o Universita` degli Studi di Siena, via Banchi di Sotto Sperimentazione in Agricoltura (CRA) and Friuli Venezia Giulia Region. This work 55, 53100 Siena, Italy.

467 © 2007 Nature Publishing Group doi:10.1038/nature06148

METHODS Paralogous and orthologous gene sets. We identified orthologous genes in Genome sequencing. The V. vinifera PN40024 genome was sequenced with the six pairs of genomes from four species: A. thaliana, O. sativa, P. trichocarpa use of a whole-genome shotgun strategy. All data were generated by paired-end and V. vinifera. Each pair of predicted gene sets was aligned with the Smith– sequencing of cloned inserts using Sanger technology on ABI3730xl sequencers. Waterman algorithm, and alignments with a score higher than 300 (BLOSUM62; Supplementary Table 2 gives the number of reads obtained per library. gapo 5 10, gape 5 1) were retained. Two genes, A from genome GA and B from Genome assembly and chromosome anchoring. All reads were assembled with genome GB, were considered orthologues if B was the best match for gene A in Arachne12. We obtained 20,784 contigs that were linked into 3,830 supercontigs GB and A was the best match for B in GA. For each orthologous gene set with V. vinifera, clusters of orthologous genes of more than 2 kb. The contig N50 was 64 kb, and the supercontig N50 was 1.9 Mb. The total supercontig size was 498 Mb, remarkably close to the expected size of were generated. A single linkage clustering with a euclidean distance was used to 475 Mb. This indicates that the PN40024 has retained few heterozygous regions. group genes. The distances were calculated with the gene index in each chro- Remaining heterozygosity was assessed by aligning all supercontigs with each mosome rather than the genomic position. The minimal distance between two other. We first selected the supercontigs more than 30 kb in size that were orthologous genes was adapted in accordance with the selected genomes. Finally, covered over more than 40% of their length by another supercontig with more we retained only clusters that were composed of at least six genes for Arabidopsis than 95% identity. After visual inspection of the alignments, we added to this list and O. sativa, and eight genes for P. trichocarpa (Supplementary Table 10). 21 the supercontigs more than 10 kb in size that aligned at more than 40% of their To validate the clustering quality we used a method described previously . For length with supercontigs identified previously. All potential cases were then each cluster we computed the probability of finding this cluster in the gene inspected visually to discard potential heterozygous regions (aligning relatively homology matrix (Supplementary Table 11). This matrix was constructed from homogeneously across their complete length) and retained repeated regions two compared chromosomes with genes numbered according to their position (with more heterogeneous alignments). This treatment identified 11 Mb of on each chromosome, with no reference to physical distances. potentially allelic supercontigs. We confirmed that in most cases their coverage Paralogous genes were computed by comparing all-against-all of V. vinifera was about half the average of the homozygous supercontigs. Only one super- proteins by using blastp, and alignments with an expected value of less than 0.1 40 contig of each allelic pair was therefore conserved in the final assembly, which were retained and realigned with the Smith–Waterman algorithm . Two genes A and B were considered paralogues if B was the best match for gene A and A was consists of 3,514 supercontigs (N50 5 2 Mb) containing 19,577 contigs the best match for B. Moreover, clusters of paralogous genes were constructed in (N50 5 66 kb), totalling 487 Mb. If the haploid genome size of 475 Mb is con- sidered correct, then our final assembly contains only about 12 Mb of remaining the same fashion as orthologous clusters (Supplementary Table 10). heterozygosity, or 2.6%. 29. Adam-Blondon, A. F. et al. Construction and characterization of BAC libraries A set of 30,151 bacterial artificial chromosome (BAC) fingerprints of the BAC from major grapevine cultivars. Theor. Appl. Genet. 110, 1363–1371 (2005). 29 clones of a Cabernet–Sauvignon library were assembled into 1,763 contigs with 30. Soderlund, C., Humphray, S., Dunham, A. & French, L. Contigs built with FPC30, v. 8. In parallel, 1,981 markers were anchored on a subset of BAC clones31, fingerprints, markers, and FPC V4.7. Genome Res. 10, 1772–1787 (2000). among which 388 markers mapped onto the genetic map, and 77,237 BAC end 31. Lamoureux, D. et al. Anchoring of a large set of markers onto a BAC library for the sequences were obtained31. Blat32 alignments (90% identity on 80% of the length, development of a draft physical map of the grapevine genome. Theor. Appl. Genet. fewer than five hits) were performed with BAC end sequences on the 3,830 113, 344–356 (2006). supercontigs of sequences with lengths over 2 kb. The results were then filtered 32. Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002). with homemade Perl scripts to keep only the occurrences in which two paired 33. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in ends were matching at a distance of less than 300 kb and with a consistent large genomes. Bioinformatics 21 (Suppl. 1), i351–i358 (2005). orientation. Two supercontigs were considered linked to each other if two 34. Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide BAC links could be found or one BAC link and a BAC contig link. A total number analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235–238 of 111 ultracontigs were constructed with this procedure. (2000). Genome annotation. Several resources were used to build V. vinifera gene mod- 35. Jaillon, O. et al. Genome-wide analyses based on comparative genomics. Cold els automatically with GAZE28. We used predictions of repetitive regions by Spring Harb. Symp. Quant. Biol. 68, 275–282 (2003). repeatscout33, conserved coding regions predicted by the exofish method34,35, 36. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, genewise36 alignments of proteins from Uniprot37, Geneid38 and Snap39 ab initio 988–995 (2004). 37. Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, gene predictions, and alignments of several cDNA resources (Supplementary D154–D159 (2005). Information). 38. Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, 511–515 A weight was assigned to each resource to further reflect its reliability and (2000). accuracy in predicting gene models. This weight acts as a multiplier for the score 39. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). of each information source, before being processed by GAZE. When applied to 40. Smith, T. F. & Waterman, M. S. Identification of common molecular the entire assembled sequence, GAZE predicted 30,434 gene models. subsequences. J. Mol. Biol. 147, 195–197 (1981).

© 2007 Nature Publishing Group Vol 452 | 24 April 2008 | doi:10.1038/nature06856 LETTERS

The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus)

Ray Ming1,2*, Shaobin Hou3*, Yun Feng4,5*, Qingyi Yu1*, Alexandre Dionne-Laporte3, Jimmy H. Saw3, Pavel Senin3, Wei Wang4,6, Benjamin V. Ly3, Kanako L. T. Lewis3, Steven L. Salzberg7, Lu Feng4,5,6, Meghan R. Jones1, Rachel L. Skelton1, Jan E. Murray1,2, Cuixia Chen2, Wubin Qian4, Junguo Shen5, Peng Du5, Moriah Eustice1,8, Eric Tong1, Haibao Tang9, Eric Lyons10, Robert E. Paull11, Todd P. Michael12, Kerr Wall13, Danny W. Rice14, Henrik Albert15, Ming-Li Wang1, Yun J. Zhu1, Michael Schatz7, Niranjan Nagarajan7, Ricelle A. Acob1,8, Peizhu Guan1,8, Andrea Blas1,8, Ching Man Wai1,11, Christine M. Ackerman1, Yan Ren4, Chao Liu4, Jianmei Wang4, Jianping Wang2, Jong-Kuk Na2, Eugene V. Shakirov16, Brian Haas17, Jyothi Thimmapuram18, David Nelson19, Xiyin Wang9, John E. Bowers9, Andrea R. Gschwend2, Arthur L. Delcher7, Ratnesh Singh1,8, Jon Y. Suzuki15, Savarni Tripathi15, Kabi Neupane20, Hairong Wei21, Beth Irikura11, Maya Paidi1,8, Ning Jiang22, Wenli Zhang23, Gernot Presting8, Aaron Windsor24, Rafael Navajas-Pe´rez9, Manuel J. Torres9, F. Alex Feltus9, Brad Porter8, Yingjun Li2, A. Max Burroughs7, Ming-Cheng Luo25, Lei Liu18, David A. Christopher8, Stephen M. Mount7,26, Paul H. Moore15, Tak Sugimura27, Jiming Jiang23, Mary A. Schuler28, Vikki Friedman29, Thomas Mitchell-Olds24, Dorothy E. Shippen16, Claude W. dePamphilis13, Jeffrey D. Palmer14, Michael Freeling10, Andrew H. Paterson9, Dennis Gonsalves15, Lei Wang4,5,6 & Maqsudul Alam3,30

Papaya, a fruit crop cultivated in tropical and subtropical regions, provides the foundation for revealing the basis of Carica’s is known for its nutritional benefits and medicinal applications. distinguishing morpho-physiological, medicinal and nutritional Here we report a 33 draft genome sequence of ‘SunUp’ papaya, properties. the first commercial -resistant transgenic fruit tree1 to be Papaya is an exceptionally promising system for the exploration of sequenced. The papaya genome is three times the size of the tropical-tree genomes and fruit-tree genomics. It has a relatively Arabidopsis genome, but contains fewer genes, including signifi- small genome of 372 megabases (Mb)6, diploid inheritance with nine cantly fewer disease-resistance gene analogues. Comparison of the pairs of chromosomes, a well-established transformation system7, five sequenced genomes suggests a minimal angiosperm gene set a short generation time (9–15 months), continuous flowering of 13,311. A lack of recent genome duplication, atypical of other throughout the year and a primitive sex-chromosome system8.Itis angiosperm genomes sequenced so far2–5, may account for the a member of the , sharing a common ancestor with smaller papaya gene number in most functional groups. None- Arabidopsis about 72 million years ago9. Papaya is ranked first on theless, striking amplifications in gene number within particular nutritional scores among 38 common , based on the percentage functional groups suggest roles in the evolution of tree-like habit, of the United States Recommended Daily Allowance for vitamin A, deposition and remobilization of starch reserves, attraction of vitamin C, potassium, folate, niacin, thiamine, riboflavin, iron and seed dispersal agents, and adaptation to tropical daylengths. calcium, plus fibre. Consumption of its fruit is recommended for Transgenesis at three locations is closely associated with chloro- preventing vitamin A deficiency, a cause of childhood blindness in plast insertions into the nuclear genome, and with topoisomerase I tropical and subtropical developing countries. The fruit, stems, leaves recognition sites. Papaya offers numerous advantages as a system and roots of papaya are used in a wide range of medical applications, for fruit-tree functional genomics, and this draft genome sequence including production of papain, a valuable proteolytic enzyme.

1Hawaii Agriculture Research Center, Aiea, Hawaii 96701, USA. 2Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 3Advanced Studies in Genomics, Proteomics and Bioinformatics, University of Hawaii, Honolulu, Hawaii 96822, USA. 4TEDA School of Biological Sciences and Biotechnology, Nankai University, Tianjin Economic-Technological Development Area, Tianjin 300457, China. 5Tianjin Research Center for Functional Genomics and Biochip, Tianjin Economic-Technological Development Area, Tianjin 300457, China. 6Key Laboratory of Molecular Microbiology and Technology of the Ministry of Education, College of Life Sciences, Nankai University, Tianjin 300071, China. 7Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland 20742, USA. 8Department of Molecular Bioscience and Bioengineering, University of Hawaii, Honolulu, Hawaii 96822, USA. 9Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia 30602, USA. 10Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA. 11Department of Tropical Plant and Soil Sciences, University of Hawaii, Honolulu, Hawaii 96822, USA. 12Waksman Institute of Microbiology and Department of Plant Biology and Pathology, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA. 13Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA. 14Department of Biology, Indiana University, Bloomington, Indiana 47405, USA. 15USDA-ARS, Pacific Basin Agricultural Research Center, Hilo, Hawaii 96720, USA. 16Department of Biochemistry and Biophysics, 2128 TAMU, Texas A&M University, College Station, Texas 77843, USA. 17The Institute for Genomic Research, Rockville, Maryland 20850, USA. 18W.M. Keck Center for Comparative and Functional Genomics, University of Illinois at Urbana- Champaign, Urbana, Illinois 61801, USA. 19Department of Molecular Sciences, University of Tennessee, Memphis, Tennessee 38163, USA. 20Leeward Community College, University of Hawaii, Pearl City, Hawaii 96782, USA. 21Wicell Research Institute, Madison, Wisconsin 53707, USA. 22Department of Horticulture, Michigan State University, East Lansing, Michigan 48824, USA. 23Department of Horticulture, University of Wisconsin, Madison, Wisconsin 53706, USA. 24Department of Biology, Duke University, Durham, North Carolina 27708, USA. 25Department of Plant Sciences, University of California, Davis, California 95616, USA. 26Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland 20742, USA. 27Maui High Performance Computing Center, Kihei, Hawaii 96753, USA. 28Departments of Cell and , Biochemistry and Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 29Applied Biosystems, 850 Lincoln Centre Drive, Foster City, California 94404, USA. 30Department of Microbiology, University of Hawaii, Honolulu, Hawaii 96822, USA. *These authors contributed equally to this work. 991 © 2008 Nature Publishing Group LETTERS NATURE | Vol 452 | 24 April 2008

A total of 2.8 million whole-genome shotgun (WGS) sequencing 27,873 protein coding and RNA genes, or including the 3,241 novel reads were generated from a female plant of transgenic cultivar genes)2,13, 34% less than rice3, 46% less than poplar4 and 19% less SunUp, which was developed through transformation of Sunset that than grape5 (Table 1). had undergone more than 25 generations of inbreeding10. The esti- Comparison of the papaya genome with that of Arabidopsis mated residual heterozygosity of SunUp is 0.06% (Supplementary sheds new light on angiosperm evolutionary history in several ways. Note 1). After excluding low-quality and organellar reads, 1.6 million Considering only the 200 longest papaya scaffolds, we found 121 co- high-quality reads were assembled into contigs containing 271 Mb linear blocks. The papaya blocks range in size from 1.36 Mb contain- and scaffolds spanning 370 Mb including embedded gaps (Supple- ing 181 genes to 0.16 Mb containing 19 genes (a statistical, rather mentary Tables 1 and 2). Of 16,362 unigenes derived from expressed than a biological, lower limit); the corresponding Arabidopsis regions sequence tags (ESTs), 15,064 (92.1%) matched this assembly. Paired- range from 0.69 Mb containing 163 genes to 60 kilobases (kb) con- end reads from 34,065 bacterial artificial chromosome (BAC) clones taining 18 genes. Across the 121 papaya segments for which co- provided alignment to an fingerprinted contig (FPC)-based physical linearity can be detected, 26 show primary correspondence (that is, map (Supplementary Note 2). Among 706 BAC end and WGS excluding the effects of ancient triplication detailed below) to only sequence-derived simple sequence repeats on the genetic map, 652 one Arabidopsis segment, 41 to two, 21 to three, 30 to four, and only 3 (92.4%) could be used to anchor 167 Mb of contigs or 235 Mb of to more than four. scaffolds, to the 12 papaya linkage groups in the current genetic map The fact that many papaya segments show co-linearity with two to (Supplementary Fig. 1). four Arabidopsis segments (Fig. 1, and Supplementary Figs 3 and 4) is Papaya chromosomes at the pachytene stage of meiosis are most parsimoniously explained if either one or two genome duplica- generally stained lightly by 49,6-diamidino-2-phenylindole (DAPI), tions have affected the Arabidopsis lineage since its divergence from revealing that the papaya genome is largely euchromatic. However, papaya. Although it was suspected that the most recent Arabidopsis highly condensed heterochromatin knobs were observed on most genome duplication, a14, might affect only a subset of the chromosomes (Supplementary Fig. 2), concentrated in the centro- Brassicales15, previous phylogenetic dating of these events15 had sug- meric and pericentromeric regions. The lengths of the pachytene gested that the more ancient b-duplication occurred early in the bivalents that are heavily stained only account for approximately eudicot radiation, well before the Arabidopsis–Carica divergence. 17% of the genome. However, these cytologically distinct and highly This incongruity is under investigation. condensed heterochromatic regions could represent 30–35% of the In contrast, individual Arabidopsis genome segments correspond 11 genomic DNA . A large portion of the heterochromatic DNA was to only one papaya segment, indicating that no genome duplication probably not covered by the WGS sequence. The 271 Mb of contig has occurred in the papaya lineage since its divergence from sequence should represent about 75% of the papaya genome and Arabidopsis about 72 million years ago5. The lack of relatively recent more than 90% of the euchromatic regions, which is similar to the papaya genome doubling is further supported by an L-shaped distri- 92.1% of the EST and 92.4% of genetic markers covered by the bution of intra-EST correspondence for papaya (not shown). How- assembled genome and the theoretical 95% coverage by 33 WGS 12 ever, multiple genome/subgenome alignments (see Supplementary sequence . Methods) reveal evidence in papaya of the ancient ‘c’ genome Gene annotation was carried out using the TIGR Eukaryotic duplication shared with Arabidopsis and poplar that is postulated Annotation Pipeline. The assembled genome was masked based on to have occurred near the origin of angiosperms14. Indeed, both similarity to known repeat elements in RepBase and the TIGR Plant papaya (with no subsequent duplication) and poplar (with a rela- Repeat Database, plus a de novo papaya repeat database (see tively low rate of duplicate gene loss) suggest that c was not a duplica- Methods). Ab initio gene predictions were combined with spliced tion but a triplication (Fig. 1), with triplicated patterns evident for alignments of proteins and transcripts to produce a reference gene about 25% of the 247 Mb comprising the 200 largest papaya scaffolds. set of 28,629 gene models (Supplementary Table 3). A total of 21,784 (76.1%) of the predicted papaya genes with average length of 1,057 Cp sc29 0.4–0.1 Mb base pairs (bp) have similarity to proteins in the non-redundant Pt sc1 8.6–8.3 Mb database from the National Center for Biotechnology Information, Pt sc3 11.3–11.6 Mb Vv chr2 with 9,760 (44.8%) of these supported by papaya unigenes. Among 4.4–3.9 Mb At chr1 4.3–4.3 Mb 6,845 genes with average length 309 bp that had no hits to the non- At chr1 α20 23.3–23.4 Mb β redundant proteins, only 515 (7.5%) were supported by papaya uni- At chr4 6 7.2–7.1 Mb At chr4 α3 genes, implying that the number of predicted papaya-specific genes 12.0–12.1 Mb was inflated. If the 515 genes with unigene support represent 44.8% Cp sc18 γ 1.4–1.6 Mb 7 of the total, then 1,150 predicted papaya-specific genes may be real, Pt sc2 11.3–11.6 Mb Pt sc14 and the number of predicted genes in the assembled papaya genome 1.7–1.9 Mb Vv chr15 6.3–5.7 Mb would be 22,934. Considering the assembled genome covers 92.1% of At chr2 18.8–18.8 Mb At chr3 α11 the unigenes and 92.4% of the mapped genetic markers, the number 22.6–22.6 Mb of predicted genes in the papaya genome could be 7.9% higher, or Cp sc4 3.2–3.9 Mb 24,746, about 11–20% less than Arabidopsis (based on either the Pt sc12 13.5–13.3 Mb Pt sc15 9.7–9.9 Mb Vv chr16r 2.6–3.0 Mb Table 1 | Statistics of sequenced plant genomes At chr5 21.1–21.1 Mb Carica Arabidopsis Populus Oryza sativa Vitis papaya thaliana trichocarpa (japonica) vinifera Figure 1 | Alignment of co-linear regions from Arabidopsis (green), papaya Size (Mbp) 372 125 485 389 487 (magenta), poplar (blue) and grape (red). ‘Vv chr16r’ is an unordered 9 5 19 12 19 Number of chromosomes ultracontig that has been assigned to grape chromosome 16. Triangles G C content total (%) 35.335.033.343.036.2 1 represent individual genes with transcriptional orientations. Several Gene number 24,746 31,114* 45,555 37,544 30,434 Average gene length (bp per gene) 2,373 2,232 2,300 2,821 3,399 Arabidopsis regions belong to previously identified duplication segments 23 Average intron length (bp) 479 165 379 412 213 (a3, a11, a20, b6, c7, shown to the right) . The whole syntenic alignment Transposons (%) 51.9144234.841.4 supports four distinct whole-genome duplication events: a, b within the Arabidopsis lineage, an independent duplication in poplar, and c which is * The gene number of Arabidopsis is based on the 27,873 protein-coding and RNA genes from The Arabidopsis Information Resource website (http://www.arabidopsis.org/portals/ shared by all four eudicot genomes. Co-linear regions can be grouped into genAnnotation/genome_snapshot.jsp) and recently published 3,241 novel genes6. three c sub-genomes based on Camin–Sokal parsimony criteria. 992 © 2008 Nature Publishing Group NATURE | Vol 452 | 24 April 2008 LETTERS

This is most probably an underestimate that will increase as papaya 300 contiguity is improved. Triplication in papaya and poplar corre- sponds closely to the triplication suggested by an independent ana- lysis of the grape genome5. Arabidopsis Papaya A few hundred papaya chromosomal segments were aligned using 200 BLASTZ to their one to four syntenic regions in Arabidopsis, and the results examined visually using the Genome Evolution (GEvo) viewer16. The orthologous region of grape was also included5, making the alignment a six-way comparison. One example is given in 100 Supplementary Fig. 5: a 500 kb segment of papaya, its four 60 kb Number of genes syntenic, orthologous Arabidopsis segments and the 400 kb ortholo- gous segment of grape. For the homologous Arabidopsis segments that are discernibly co- 0 MYBRWP-RK proteinsWRKYAP2-EREBP (8,(56, (9,bHLH 4, 5,25, 1, NAC1, 3, ( 3, MADS2) (21,3)(23,bZIP 8, (36, 6,HB-others (30, 7, 18, SET1) 17, GRAS-Scarecr (30,8, (17,HB- 5, 10,CC 5,TCP 3, Jumonj (11,ZF- 4, 1,SPB 6,HB-Bel1 1)(5 TAZ HD-ZIP(1, 1,EF2-DP 0, (1,LIM0, 01,HB- (1,(4, 0, 1,C3, 0, LFY01, (1, 0, 1, 3 linear (by MC-SCANNER) to the 200 longest papaya scaffolds, -Lumini othAA HD (1, 0, 0, 0, (4, 3, 0, 1, 1) Pha T-H 52, 31, 9, 12, 15) ers (8, 6, 2, 0,i (16, 1) 1, , 1 34.8% of Arabidopsis genes in any one segment correspond to a 13, 1, 9, 1 voluta (1, 1, 0, 0 AP5 (8, 4, , 1, 3 dep 14, 11, 1) 0, 0) papaya gene, whereas only 24.8% of papaya genes in any one segment , 0) , 0) end 0, 18, 5, 1 7) ow (6, 3, 2, 13, 0) 0, 0) 0) 0) ens (1, 0) 2, 2, correspond to an Arabidopsis gene. Moreover, the Arabidopsis homo- ) 11, 20, 10) 4) 0) 1, 2, 0) , 2) 0 0 logous segments contain fewer genes, on average only about 57.9% of ) , 1, 0, 0) the number in their papaya counterparts. Papaya provides a useful outgroup necessary to detect subfunctio- nalization. Supplementary Fig. 6 is a GEvo screenshot of a blastn Figure 2 | Comparison of gene numbers in transcription-factor tribe or alignment illustrating subfunctionalization of conserved non-coding related tribes from Arabidopsis and papaya. Most transcription factors are sequences (CNSs)17 upstream of two syntenic, duplicate Arabidopsis represented by fewer genes in papaya than Arabidopsis. Transcription-factor names are given, with values after the names corresponding to: number of genes and their single papaya orthologous gene. The a-duplicated 18 tribes with genes assigned to transcription factor group, number of tribes genomes within Arabidopsis are perfect for CNS discovery . with smaller counts in papaya than Arabidopsis, number of tribes with equal Comparative analysis of the papaya and Arabidopsis 59 untrans- counts in papaya and Arabidopsis, number of tribes with larger counts in lated regions showed that only 14% of orthologous promoter pairs papaya, and number of tribes with zero members in papaya. Supporting data exhibit significantly higher levels of sequence identity than random are provided in Supplementary Table 8. comparisons (Supplementary Figs 7 and 8). Although some highly conserved promoters show substantial conservation across much of Arabidopsis. Some transcription-factor tribes had more genes in their length, sequence similarity for most orthologous papaya pro- papaya, specifically RWP-RK, MADS-box, Scarecrow, TCP and moters is indistinguishable from background. Jumonji gene families. Interestingly, the difference in MADS protein Global analysis of all inferred protein models from papaya, family size appears to be due to expanded numbers for half of the 36 Arabidopsis, poplar, grape and rice clusters the 208,901 non-redundant MADS tribes. The other 18 MADS tribes had fewer papaya genes, protein sequences into 39,706 similarity groups, or ‘tribes’19, 11,851 of including 14 that were not found in papaya. which contain two or more genes (see Supplementary Methods). Tribes Assuming that a generalized angiosperm could potentially require with multiple genes in a species typically correspond to families or only the types and minimal numbers of genes that are shared among subfamilies of genes; however, tribes may also contain just one gene divergent plant species, we examined each of the tribes shared among (‘singleton tribes’). In papaya, 25,312 gene models were classified into the five angiosperms with sequenced genomes. The number of genes 12,958 tribes, 5,669 of which were specific to papaya (Supplementary required in a minimal is based on the observed Table 4). Of the papaya-specific tribes, 5,314 were singleton tribes. EST minimum number of genes across each of the shared tribes support was markedly lower for genes in papaya-specific tribes (below (Table 2). When the smallest observed number is taken for each 14%) than in tribes that included genes from at least one other taxon evolutionarily conserved tribe, a minimal angiosperm genome of (72.4%). 13,311 genes is estimated. Papaya has the smallest number of genes To investigate the smaller number of genes in papaya, we com- for more tribes than any other sequenced taxon (4,515, or 76% of pared tribe membership from each of the five sequenced angiosperm 5,925 shared tribes), reinforcing the notion that papaya has fewer species (Supplementary Table 5). Among the 6,726 tribes that con- genes than any angiosperm sequenced so far. tain genes from both Arabidopsis and papaya, 3,595 contain equal Only 55 nucleotide-binding site (NBS)-containing R genes were numbers of genes from both species. However, tribes with more identified in papaya; about 28% of the 200 NBS genes in Arabidopsis20 Arabidopsis genes outnumber those with more papaya genes by more and less than 10% of the 600 NBS genes in rice21. Resistance proteins than 2:1 (2,153:979). The trend of smaller number of papaya genes is also have a carboxy-terminal leucine-rich repeat (LRR) domain. widespread across tribes of all sizes and major functional categories These NBS-containing R-gene families can be subdivided into three (Supplementary Table 6 and Supplementary Fig. 9). classes: NBS–LRR, toll interleukin receptor (TIR)–NBS–LRR, and We then examined membership in the 815 tribes with members coiled-coil (CC)–NBS–LRR on the basis of their amino-terminal identified as being likely transcription factors in the Arabidopsis tran- region. Papaya NBS–LRR outnumbered both TIR–NBS–LRR and scription factor database (http://arabidopsis.med.ohio-state.edu/ CC–NBS–LRR genes, in contrast to both poplar (with more CC– AtTFDB/). This set includes 2,897 genes in Arabidopsis and 2,438 NBS–LRR genes4) and Arabidopsis (with more TIR–NBS–LRR). in papaya (a ratio of 1.19:1). The details of tribe membership are More than 50% of the NBS-type R genes were clustered in about illustrated for 25 exemplar families and superfamilies (Fig. 2), where eight scaffolds, indicating that resistance gene evolution may involve most transcription-factor tribes have fewer genes in papaya than duplication and divergence of linked gene families.

Table 2 | Deduced potential minimal angiosperm gene number based on species with smallest number of genes for each tribe Carica papaya Arabidopsis thaliana Populus trichocarpa Oryza sativa (japonica) Vitis vinifera Shared tribes Minimal gene number Shared tribes with minimum 4,515 3,597 1,548 3,657 3,597 5,925 13,331 Number of unique tribes 5,708 2,950 6,338 13,003 3,567 Number of conserved tribes lost or 405 113 28 429 175 missing from each species

993 © 2008 Nature Publishing Group LETTERS NATURE | Vol 452 | 24 April 2008

Homologues for genes involved in cellulose biosynthesis are pre- and 1 genes for cinnamate-4-hydroxylase; 9 and 3 genes for pheny- sent in papaya and Arabidopsis, with more cellulose synthase genes in lalanine ammonia lyase; and 24 and 3 limonene cyclase genes. poplar, perhaps associated with wood formation. Papaya has at least Papaya ripening is climacteric, with the rise in ethylene production 32 putative b-glucosyl transferase (GT1) genes compared with 121 in occurring at the same time as the respiratory increase25. Papaya and Arabidopsis identified using sequence alignment. A total of 38 and 40 Arabidopsis, respectively, have similar numbers of genes involved in cellulose synthase-related genes (GT2) were identified in papaya ethylene synthesis, with four each for S-adenosyl methionine using the 48 poplar and 31 Arabidopsis genes as queries, respectively. synthase (SAM synthase); 8 and 13 for aminocyclopropane carb- These genes include 11 cellulose synthase (CesA) genes, the same oxylic acid (ACC) synthase (ACS); 8 and 12 for ACC oxidase number as in Arabidopsis but 7 fewer than in poplar. Putative cel- (ACO); and 42 and 64 for ethylene-responsive binding factors lulose orientation genes (COBRA) were more abundant in (AP2/ERF). Arabidopsis (12) than in papaya (8). Because papaya grows in tropical climates where daily light/dark Papaya also has a similar complement though fewer genes for cell- cycles do not change much over the year, we can ask if more or fewer wall synthesis than Arabidopsis. Papaya and Arabidopsis, respectively, light/circadian genes are required to synchronize with the environ- have 6 and 12 callose synthase genes (GT2); 15 and 15 xyloglucan ment. In fact, there are fewer light/clock genes in the papaya genome a-1,2-fucosyl transferases (GT37); 5 and 7 b-glucuronic acid trans- (49% and 34% of poplar and Arabidopsis, respectively; Supplemen- ferases in familes GT43 and GT47; and 27 and 42 in GT8 that includes tary Table 7). However, among the core circadian clock genes, the galacturonosyl transferases, associated with pectin synthesis. pseudo-response regulators (PRRs; Supplementary Fig. 10) have The cell wall of plants is capable of both plastic and elastic exten- expanded in poplar compared with Arabidopsis, and the papaya 22 sion, and controls the rate and direction of cell expansion . Despite PRR7 cluster has seemingly duplicated with the recent poplar fewer whole-genome duplications, papaya has a similar number of salicoid-specific genome duplication4 (Supplementary Fig. 11). putative A genes (24) as Arabidopsis (26) and poplar (27), Against the backdrop of fewer overall genes, the parallel expansion and more expansin B genes (10) than Arabidopsis (6) and poplar (3). of the PRRs is consistent with circadian timing being important in In contrast to expansion-related genes, papaya has on average papaya. about 25% fewer cell-wall degradation genes than Arabidopsis,in The PAS–FBOX–KELCH genes control light signalling and some cases far fewer. For example, papaya and Arabidopsis, respec- flowering time; however, the only papaya orthologue (ZTL) lacks tively, have 4 and 12 endoxylanase-like genes in glycoside hydrolase an obvious KELCH domain compared with Arabidopsis and poplar, family 10 (GH10); 29 and 67 pectin methyl esterases (carbohydrate which have five and one KELCH domains, respectively (Supplemen- esterase family 8); 28 and 69 polygalacturonases (GH28); 15 and 49 tary Fig. 10). In fact, the papaya genome contains fewer KELCH xyloglucan endotransglycosylase/hydrolases (GH16); 18 and 25 domains (37 compared with 130 and 74 in Arabidopsis and poplar, b-1,4-endoglucanases (GH9); 42 and 91 b-1,3-glucanases (GH17); respectively). In contrast, there are three constitutive photomorpho- and 15 and 27 pectin lyases (PL1). genic 1 (COP1) paralogues in the papaya genome compared with A semi-woody giant herb that accumulates lignin in the cell wall at only one in Arabidopsis (Supplementary Tables 7 and 8). A similar an intermediate level between Arabidopsis and poplar, papaya gene- expansion has been noted in moss (), which has rally has intermediate numbers of lignin synthetic genes, fewer than nine COP1 paralogues that are hypothesized to aid in tolerance to poplar but more than Arabidopsis despite fewer opportunities for ultraviolet light (Supplementary Fig. 12)26. Both KELCH domains duplication in papaya. Poplar, papaya and Arabidopsis have 37, 30 and the WD-40 of the COP1 family form b-propellers and play a role and 18 candidate genes for the lignin synthesis pathway, respec- in light-mediated ubiquitination. There is not a general expansion of tively4,23, with papaya having an intermediate number of genes for the PAL, C4H, 4CL and HCT gene families, and only one COMT and WD-40 genes in papaya (173 compared with 227 in Arabidopsis). two C3H genes. In contrast, poplar has three C3H genes, which are Perhaps papaya has developed an alternative way of integrating light presumed to convert p-coumaroyl quinic acid to caffeoyl shikimic or timing information specific to day-neutral plants, such as a strict acid, whereas there are two in papaya and one in Arabidopsis. Papaya, adherence to the diel light/dark cycle that is better served by the COP- Arabidopsis and poplar each have two genes in the family mediated system. CCoAOMT, which are presumed to convert caffeic acid to ferulic Sex determination in papaya is controlled by a pair of primitive acid4. Compared with these other plants, papaya has the fewest genes sex chromosomes, with a small male-specific region of the Y chro- 8 in the CCR gene family (1 gene) and the most in the F5H (4 genes) mosome (MSY) . The physical map of the MSY is currently estimated and CAD gene families (18 genes), which all mediate later steps of the by chromosome walking to span about 8 Mb (ref. 27). Two scaffolds lignin biosynthesis pathway. in the current female-genome sequence align to the More starch-associated genes in papaya, a perennial, may be due to physical map based on BAC end sequences, spanning 4.5 Mb and a greater need for storage in leaves, stem and developing fruit than in including 254 predicted protein-encoding genes, of which 75 Arabidopsis, an ephemeral that stores oil in the seed. Papaya and (29.5%) have EST support (Supplementary Table 9 and Supple- Arabidopsis, respectively, have 13 and 6 putative starch synthase mentary Fig. 13). If adjusted for the percentage of unigene validation (GT5) genes; 8 and 3 starch branching genes; 6 and 3 isoamylases for other genes (48.0%), the estimated number of genes in the (GH13); and 12 and 9 b-amylases (GH14). Early unloading of fruit X-specific region would be 156. The average gene density would be sugar in papaya is probably symplastic24, with five genes for sucrose one gene per 19.5 kb, lower than the estimated genome average of one synthase/sucrose phosphate synthase (GT4); seven are reported for gene per 14.3 kb. By contrast, among seven completely sequenced Arabidopsis. Five acid invertase (GH32) sequences were found in MSY BACs totalling 1.2 Mb, a total of four expressed genes were papaya whereas 11 have been reported in Arabidopsis. Papaya has found on two of the BACs14,28. The somewhat lower-than-average at least seven putative neutral invertase (GH32) genes; Arabidopsis gene density in the X-specific scaffolds is accompanied by more has six. Wall-associated kinases (WAK) are thought to be involved in repetitive DNA (58.3%) than the genome-wide average, perhaps the regulation of vacuolar invertases, with 17 in Arabidopsis and only because this region is near the centromere28. Re-analysis of 10 in papaya. Arabidopsis and papaya have 14 and 7 hexose trans- the repetitive DNA content of the MSY BACs, to include the new porters, respectively. The greater number of genes for sugar accu- papaya-specific repeat families identified herein, increased the aver- mulation in Arabidopsis may reflect recent genome duplications. age repeat sequence to 85.6%, with 54.1% Gypsy and 1.9% Copia Papaya has undergone particularly striking amplification of genes retro-elements (Supplementary Table 10). This compares with an involved in volatile development. Papaya and Arabidopsis, respec- earlier estimate of 17.9% using the Arabidopsis repeat database tively, have 18 and 8 genes for cinnamyl alcohol dehydrogenase; 2 alone28. 994 © 2008 Nature Publishing Group NATURE | Vol 452 | 24 April 2008 LETTERS

The SunUp genome has presented an opportunity to analyse trans- 2. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000). gene insertion sites critically. Southern blot analysis was key in the 3. International Rice Genome Sequencing Project. The map-based sequence of the initial identification of transgenic insertion fragments and was per- rice genome. Nature 436, 793–800 (2005). formed with probes spanning the entire 19,567-bp transformation 4. Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & vector used for bombardment (Supplementary Fig. 14). Among the Gray). Science 313, 1596–1604 (2006). identified inserts were the functional coat-protein transgene confer- 5. Jaillon, C. O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007). ring resistance to papaya ringspot virus, which was found in an intact 6. Arumuganathan, K. & Earle, E. D. Nuclear DNA content of some important plant 9,789-bp fragment of the transformation plasmid, and a 1,533-bp species. Plant Mol. Biol. Rep. 9, 208–218 (1991). fragment composed of a truncated, non-functional tetA gene and 7. Fitch, M. M. M., Manshardt, R. M., Gonsalves, D., Slightom, J. L. & Sanford, J. C. flanking vector backbone sequence. The structures of the coat- Virus resistant papaya plants derived from tissues bombarded with the coat protein gene of papaya ringspot virus. Bio/technology 10, 1466–1472 (1992). protein transgene and tetA region insertion sites were determined 8. Liu, Z. et al. A primitive Y chromosome in papaya marks incipient sex from cloned sequences. Southern analysis also confirmed a 290-bp chromosome evolution. Nature 427, 348–352 (2004). non-functional fragment of the nptII gene originally identified by 9. Wikstro¨m, N., Savolainen, V. & Chase, M. W. Evolution of the angiosperms: WGS sequence analysis (Supplementary Fig. 15). Five of the six calibrating the family tree. Proc. R. Soc. Lond. B 268, 2211–2220 (2001). 10. Storey, W. B. Papaya. in Outlines of Perennial Crop Breeding in the Tropics flanking sequences of the three insertions are nuclear DNA copies (eds Ferwerda, F. P. and Wit, F.) 389–408 (H. Veenman & Zonen, Wageningen, of papaya chloroplast DNA fragments. The integration of the trans- 1969). genes into chloroplast DNA-like sequences may be related to the 11. Li, L. et al. Genome-wide transcription analyses in rice using tiling microarrays. observation that produced either by Agrobacterium- Nature Genet. 38, 124–129 (2006). mediated or biolistic transformation are often inserted in AT-rich 12. Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random 29 clones: a mathematical analysis. Genomics 2, 231–239 (1988). DNA , as is the chloroplast DNA of papaya and other land plants. 13. Hanada, K., Zhang, X., Borevitz, J. O., Li, W.-H. & Shiu, S.-H. A large number of Four of the six insert junctions have sequences that match topoisome- novel coding small open reading frames in the intergenic regions of the rase I recognition sites, which are associated with breakpoints in geno- Arabidopsis thaliana genome are transcribed and/or under purifying selection. mic DNA transgene insertion sites and transgene rearrangements29. Genome Res. 17, 632–640 (2007). 14. Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm The presence of these inserts was confirmed by high-throughput genome evolution by phylogenetic analysis of chromosomal duplication events. 30 MUMmer analysis for each region of the transformation vector. Nature 422, 433–438 (2003). Evidence for the presence of other transgene inserts is not conclusive 15. Schranz, M. E. & Mitchell-Olds, T. Independent ancient polyploidy events in (Supplementary Note 3). the sister families Brassicaceae and Cleomaceae. Plant Cell 18, 1152–1165 (2006). Its lower overall gene number notwithstanding, striking variations 16. Lyons, E. & Freeling, M. How to usefully compare homologous plant genes and in gene number within particular functional groups, superimposed chromosomes as DNA sequence. Plant J. 53, 661–673 (2008). on the average approximate 20% reduction in papaya gene number 17. Inada, D. C. et al. Conserved noncoding sequences in the grasses. Genome Res. 13, relative to Arabidopsis, may be related to key features of papaya 2030–2041 (2003). morphological evolution. Despite a closer evolutionary relationship 18. Thomas, B. C., Rapaka, L., Lyons, E., Pedersen, B. & Freeling, M. Arabidopsis intragenomic conserved noncoding sequence. Proc. Natl Acad. Sci. USA 104, to Arabidopsis, papaya shares with poplar an increased number of 3348–3353 (2007). genes associated with cell expansion, consistent with larger plant size; 19. Wall, P. K. et al. PlantTribes: a gene and gene family resource for comparative and lignin biosynthesis, consistent with the of genomics in plants. Nucleic Acids Res. 36, D970–D976 (2008). tree-like habit. Amplification of starch-synthesis genes in papaya 20. Meyers, B. C., Morgante, M. & Michelmore, R. W. TIR-X and TIR-NBS proteins: two new families related to disease resistance TIR-NBS-LRR proteins encoded in relative to Arabidopsis is consistent with a greater need for storage Arabidopsis and other plant genomes. Plant J. 32, 77–92 (2002). in leaves, stem and developing fruit of this perennial. Tremendous 21. Zhou, T. et al. Genome-wide identification of NBS genes in japonica rice reveals amplification in papaya of genes related to volatile development significant expansion of divergent non-TIR NBS-LRR genes. Mol. Genet. Genomics implies strong natural selection for enhanced attractants that may 271, 402–415 (2004). 22. Fry, S. C. Primary cell wall metabolism: tracking the careers of wall polymers in be key to fruit (seed) dispersal by animals and which may also have living plant cells. New Phytol. 161, 641–675 (2004). attracted the attention of aboriginal peoples. This also foreshadows 23. Ehlting, J. et al. Global transcript profiling of primary stems from Arabidopsis what we might expect to discover in the genomes of other fragrant- thaliana identifies candidate genes for missing links in lignin biosynthesis and fruited trees, as well as plants with striking fragrance of leaves (herbs), transcriptional regulators of fiber differentiation. Plant J. 42, 618–640 (2005). 24. Zhou, L. L. & Paull, R. E. Sucrose metabolism during papaya (Carica papaya) fruit flowers or other organs. growth and ripening. J. Am. Soc. Hortic. Sci. 126, 351–357 (2001). Arguably, the sequencing of the genome of SunUp papaya makes it 25. Paull, R. E. & Chen, N. J. Postharvest variation in cell wall-degrading enzymes of the best-characterized commercial transgenic crop. Because papaya papaya (Carica papaya L.) during fruit ripening. Plant Physiol. 72, 382–385 (1983). ringspot virus is widespread in nearly all papaya-growing regions, 26. Richardt, S., Lang, D., Reski, R., Frank, W. & Rensing, S. A. PlanTAPDB, a SunUp could serve as a transgenic germplasm source that could be phylogeny-based resource of plant transcription-associated proteins. Plant Physiol. 143, 1452–1466 (2007). used to breed suitable cultivars resistant to the virus in various parts 27. Yu, Q. et al. Low X/Y divergence of four pairs of papaya sex-liked genes. Plant J. of the world. The characterization of the precise transgenic modifica- 53, 124–132 (2008). tions in SunUp papaya should also serve to lower regulatory barriers 28. Yu, Q. et al. Chromosomal location and gene paucity of the male specific region on currently in place in some countries. papaya Y chromosome. Mol. Genet. Genomics 278, 177–185 (2007). 29. Sawasaki, T., Takahashi, M., Goshima, N. & Morikawa, H. Structures of transgene loci in transgenic Arabidopsis plants obtained by particle bombardment: junction METHODS SUMMARY regions can bind to nuclear matrices. Gene 218, 27–35 (1998). Gene annotation. Papaya unigenes from complementary DNA were aligned to 30. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome the unmasked genome assembly, which was then used in training ab initio gene Biol. 5, R12 (2004). prediction software. Spliced alignments of proteins from the plant division of Supplementary Information is linked to the online version of the paper at GenBank, and transcripts from related angiosperms, were generated. Gene pre- www.nature.com/nature. dictions were combined with spliced alignments of proteins and transcripts to produce a reference gene set. Detailed descriptions are given in Methods. Acknowledgements We thank X. Wan, J. Saito and A. Young at the University of Hawaii for technical assistance; C. Detter at the DOE Joint Genome Institute; Full Methods and any associated references are available in the online version of F. MacKenzie, O. Veatch and T. Uhm at the Hawaii Agriculture Research Center; the paper at www.nature.com/nature. L. Li, W. Teng, Y. Wu, Y. Yang, C. Zhou, N. Wang, P. Wang and D. Fei at the Tianjin Biochip Corporation, Tianjin Economic-Technological Development Area, Tianjin; Received 6 September 2007; accepted 22 February 2008. and R. Herdes, L. Diebold, R. Kim, A. Hernandez, S. Ali and L. Bynum at the University of Illinois at Urbana-Champaign. This papaya genome-sequencing 1. Gonsalves, D. Control of papaya ringspot virus in papaya: a case study. Annu. Rev. project was given support by the University of Hawaii and the US Department of Phytopathol. 36, 415–437 (1998). Defense grant number W81XWH0520013 to M.A., the Maui High Performance 995 © 2008 Nature Publishing Group LETTERS NATURE | Vol 452 | 24 April 2008

Computing Center to M.A., the Hawaii Agriculture Research Center to R.M. and Author Information The papaya WGS sequence is deposited at DNA Data Bank of Q.Y., and Nankai University, China, to L.W. Other support to the papaya genome Japan/European Molecular Biology Laboratory/GenBank under accession number project included the United States Department of Agriculture T-STAR program; a ABIM00000000. The version described in this paper is the first version, United States Department of Agriculture–Agricultural Research Service ABIM01000000. The GenBank accession numbers of the papaya ESTs are cooperative agreement (CA 58-3020-8-134) with the Hawaii Agriculture EX227656–EX303501. This paper is distributed under the terms of the Creative Research Center; the University of Illinois; the National Science Foundation Plant Commons Attribution-Non-Commercial-Share Alike licence, and is freely available Genome Research Program; and Tianjin Municipal Special Fund for Science and to all readers at www.nature.com/nature. Reprints and permissions information is Technology Innovation Grant 05FZZDSH00800. We thank P. Englert, former available at www.nature.com/reprints. Correspondence and requests for chancellor of the University of Hawaii, for initial infrastructure support of the materials should be addressed to M.A. ([email protected]) or L.W. research. ([email protected]).

996 © 2008 Nature Publishing Group doi:10.1038/nature06856

METHODS developed repeat-finding tools, PILER and RepeatScout to the complete set of 34,51 Genome assembly. The Genome sequence was assembled by Arachne31. WGS contigs from the papaya genome . PILER was able to find 428 repeat families reads and BAC end reads were trimmed by LUCY and screened for organellar whereas RepeatScout found 6,596 repeat sequences. sequences32. Two approaches were applied to screening and removing reads of The repeat families obtained from PILER and RepeatScout were annotated presumably organellar origin to alleviate the load in assembling highly repetitive using a combination of manual curation (786 repeat families) and automated regions by WGS assembly software. The first approach was an iterative process, analysis. For the automated annotation, the combined data set from PILER and in which reads were assembled, contigs matching with organellar genomes iden- RepeatScout was made non-redundant (using CD-HIT at the 90% similarity level), leaving behind 6,240 repeat families52. As a post-processing step, we tified, constituent reads removed, and the process repeated by two or three more 20 rounds. This approach produced the read sets for the released assemblies selected only those families that had at least ten good (E value , 1 3 10 ) Stripped3 and Stripped4. The second approach was to remove plasmid clones BLAST matches to papaya contigs. The resulting data set contained 2,198 repeat and BAC clones of presumably organellar origin by identifying clones with both families in the papaya genome. BLAST searches against non-redundant and end reads matching entirely with organellar genomes, with physical map PTREP (http://wheat.pw.usda.gov/ITMI/Repeats) were then used to identify information an amendment to the identification of BAC clones. Two rounds repeat families matching genes associated with transposons and retrotranspo- of iterative screening based on pairing information of assembled and unplaced sons. This procedure discovered an additional 103 repeat families that could be reads were added to the second approach to generate the read set for the released annotated as being retrotransposons. The combined database of 889 annotated Papaya1.0 assembly. papaya-specific transposable-element sequences was used in addition to the The sequence error rates were estimated by aligning assembled shotgun database of known repeats to annotate the papaya genome. The remaining, sequences with two finished BACs (GenBank accession numbers EF661023 unannotated repeat families (1,455 sequences with no matches to known genes) and EF661026). The error rate of the assembly at 33 coverage or deeper were then used to estimate the additional repeat content of the genome. (74.2% of assembled sequences) was less than 0.01% based on average quality 31. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: values of 20 or greater in trimmed sequence. The error rate at 23 coverage Arachne 2. Genome Res. 13, 91–96 (2003). (16.3%) was 0.37%. The error rate at 13 coverage (9.5%) was approximately 32. Chou, H. H. & Holmes, M. H. DNA sequence quality trimming and vector removal. 0.75%, because these sequences are at the ends of the contigs (and sequence Bioinformatics 17, 1093–1104 (2001). reads) where the sequence quality declined. 33. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker (Release Open-3.1.3, 2006). Genome annotation. Gene annotation was conducted following the TIGR 34. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in Eukaryotic Annotation Pipeline. Repeat sequences were identified in the large genomes. Bioinformatics 21 (suppl.), i351–i358 (2005). assembled genome and masked by RepeatMasker, RepeatScout and 35. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003). TransposonPSI, based on known repeat elements in RepBase databases and 36. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new TIGR Plant Repeat Databases, and the papaya novel repeat database constructed intron submodel. Bioinformatics 19 (suppl.), ii215–ii225 (2003). 33,34 35 in this study . Program to Assemble Spliced Alignments (PASA) was used to 37. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two generate spliced alignments of papaya unigenes to the unmasked assembly, which open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 was then used in training ab initio gene prediction software Augustus, (2004). GlimmerHMM and SNAP36–38. Ab initio gene prediction software Fgenesh, 38. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). Genscan and TWINSCAN were trained on Arabidopsis39–41. Spliced alignments 39. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. of proteins from the plant division of GenBank and transcripts from related Genome Res. 10, 516–522 (2000). angiosperms (Arabidopsis thaliana, Glycine max, Gossypium hirsutum, Medicago 40. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997). truncatula, Nicotiana tabacum, Oryza sativa, Zea mays) were generated by the 42 41. Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene Analysis and Annotation Tool (AAT) . Spliced alignment of proteins from the structure prediction. Bioinformatics 17 (suppl. 1), S140–S148 (2001). 43,44 Pfam database were generated using GeneWise . Gene predictions generated by 42. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R. A tool for analyzing and Augustus, Fgenesh, Genscan, GlimmerHMM, SNAP and TWINSCAN were com- annotating genomic sequences. Genomics 46, 37–45 (1997). bined with spliced alignments of proteins and transcripts to produce a reference 43. Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 34 gene set using the evidence-based combiner EVidenceModeler (EVM)45. Protein (Database issue), D247–D251 (2006). domains were predicted using InterProScan against protein databases (PRINTS, 44. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, Pfam, ProDom, PROSITE, SMART)46–50. 988–995 (2004). Construction of papaya repeat database. We used a combination of homology- 45. Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome based and de novo methods to identify signatures of transposable elements in the Biol. 9, R7.1–R7.19 (2008). papaya genome. We used RepeatMasker (http://www.repeatmasker.org) in 46. Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res. 33, combination with a custom-built library of plant repeat elements for our initial W116–W120 (2005). classification of transposable elements. The customized library was generated by 47. Attwood, T. K. et al. PRINTS and its automatic supplement, prePRINTs. Nucleic combining plant repeats from Repbase and plant repeat databases from TIGR Acids Res. 31, 400–402 (2003). (ftp://ftp.tigr.org/pub/data/TIGR_Plant_Repeats)33. Repeat elements identified 48. Bru, C. et al. The ProDom database of protein domain families: more emphasis on as ribosomal RNA sequences in the TIGR databases match a large fraction of the 3D. Nucleic Acids Res. 33 (Database issue), D212–D215 (2005). papaya genome (about 3%). Ribosomal RNAs were identified separately, and 49. Hulo, N. et al. The PROSITE database. Nucleic Acids Res. 34 (Database issue), therefore were excluded from our repeat library, leaving a database of 76,924 D227–D230 (2006). 50. Letunic, I. et al. SMART 5: domains in the context of genomes and networks. repeat sequences that were used to search the papaya genome. Nucleic Acids Res. 34 (Database issue), D257–D260 (2006). Homology-based methods are limited to finding elements that have not 51. Edgar, R. C. & Myers, E. W. PILER: Identification and classification of genomic diverged too greatly from known repeats. Because databases of known transpos- repeats. Bioinformatics 21 (suppl.), i152–i158 (2005). able elements are necessarily incomplete, we used additional de novo methods to 52. Li, W. & Godzik, A. CD-HIT: A fast program for clustering and comparing large search for repeat elements in papaya contigs. For this, we applied two recently sets of protein or nucleotide sequences. Bioinformatics 22, i1658–i1659 (2006).

© 2008 Nature Publishing Group Vol 457 | 29 January 2009 | doi:10.1038/nature07723 ARTICLES

The Sorghum bicolor genome and the diversification of grasses

Andrew H. Paterson1, John E. Bowers1,Re´my Bruggmann2, Inna Dubchak3, Jane Grimwood4, Heidrun Gundlach5, Georg Haberer5, Uffe Hellsten3, Therese Mitros6, Alexander Poliakov3, Jeremy Schmutz4, Manuel Spannagl5, Haibao Tang1, Xiyin Wang1,7, Thomas Wicker8, Arvind K. Bharti2, Jarrod Chapman3, F. Alex Feltus1,9, Udo Gowik10, Igor V. Grigoriev3, Eric Lyons11, Christopher A. Maher12, Mihaela Martis5, Apurva Narechania12, Robert P. Otillar3, Bryan W. Penning13, Asaf A. Salamov3, Yu Wang5, Lifang Zhang12, Nicholas C. Carpita14, Michael Freeling11, Alan R. Gingle1, C. Thomas Hash15, Beat Keller8, Patricia Klein16, Stephen Kresovich17, Maureen C. McCann13, Ray Ming18, Daniel G. Peterson1,19, Mehboob-ur-Rahman1,20, Doreen Ware12,21, Peter Westhoff10, Klaus F. X. Mayer5, Joachim Messing2 & Daniel S. Rokhsar3,4

Sorghum, an African grass related to sugar cane and maize, is grown for food, feed, fibre and fuel. We present an initial analysis of the ,730-megabase Sorghum bicolor (L.) Moench genome, placing ,98% of genes in their chromosomal context using whole-genome shotgun sequence validated by genetic, physical and syntenic information. is largely confined to about one-third of the sorghum genome with gene order and density similar to those of rice. Retrotransposon accumulation in recombinationally recalcitrant heterochromatin explains the ,75% larger genome size of sorghum compared with rice. Although gene and repetitive DNA distributions have been preserved since palaeopolyploidization ,70 million years ago, most duplicated gene sets lost one member before the sorghum–rice divergence. Concerted evolution makes one duplicated chromosomal segment appear to be only a few million years old. About 24% of genes are grass-specific and 7% are sorghum-specific. Recent gene and microRNA duplications may contribute to sorghum’s drought tolerance.

The Saccharinae plants include some of the most efficient biomass improvement are constrained by high gene flow to weedy relatives4,mak- accumulators, providing food and fuel from starch (sorghum) and ing knowledge of its intrinsic genetic potential all the more important. sugar (sorghum and Saccharum, sugar cane), and have potential for use as cellulosic biofuel crops (sorghum, sugar cane, Miscanthus). Of Reconstructing a repeat-rich genome from shotgun sequences singular importance to Saccharinae productivity is C4 photo- Preferred approaches to sequencing entire genomes are currently to synthesis, comprising biochemical and morphological specializa- apply shotgun sequencing5 either to a minimum ‘tiling path’ of geno- tions that increase net carbon assimilation at high temperatures1. mic clones, or to genomic DNA directly. The latter approach, whole- Despite their common photosynthetic strategy, the Saccharinae show genome shotgun (WGS) sequencing, is widely used for mammalian much morphological and genomic variation (Supplementary Fig. 1). genomes, being fast, relatively economical and reducing cloning bias. Its small genome (,730 Mb) makes sorghum an attractive model for However, its applicability has been questioned for repetitive DNA- 6 functional genomics of Saccharinae and other C4 grasses. Rice, the first rich plant genomes . fully sequenced cereal genome, is more representative of C3 photosyn- Despite a repeat content of ,61%, a high-quality genome sequence thetic grasses. Drought tolerance makes sorghum especially important was assembled from homozygous sorghum genotype BTx623 by using in dry regions such as northeast (its centre of diversity) and the WGS and incorporating the following: (1) ,8.5 genome equivalents of southern plains of the United States. Genetic variation in the partitioning paired-end reads7 from genomic libraries spanning a ,100-fold range of carbon into sugar stores versus cell wall mass, and in perenniality and of insert sizes (Supplementary Table 1), resolving many repetitive associated features such as tillering and stalk reserve retention2,make regions; and (2) high-quality read length averaging 723 bp, facilitating sorghum an attractive system for the study of traits important in peren- assembly. Comparison with 27 finished bacterial artificial chromo- nialcellulosicbiomasscrops.Its high levelof inbreedingmakes itan attra- somes (BACs) showed the WGS assembly to be .98.46% complete ctive association genetics system3. Transgenic approaches to sorghum and accurate to ,1 error per 10 kb (Supplementary Note 2.5).

1Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia 30602, USA. 2Waksman Institute for Microbiology, Rutgers University, Piscataway, New Jersey 08854, USA. 3DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 4Stanford Human Genome Center, Stanford University, Palo Alto, California 94304, USA. 5MIPS/IBIS, Helmholtz Zentrum Mu¨nchen, Inglostaedter Landstrasse 1, 85764 Neuherberg, Germany. 6Center for Integrative Genomics, University of California, Berkeley, California 94720, USA. 7College of Sciences, Hebei Polytechnic University, Tangshan, Hebei 063000, China. 8Institute of Plant Biology, University of Zurich, Zollikerstrasse 107, 8008 Zurich, Switzerland. 9Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina 29631, USA. 10Institut fur Entwicklungs und Molekularbiologie der Pflanzen, Heinrich-Heine- Universitat, Universitatsstrasse 1, D-40225 Dusseldorf, Germany. 11Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA. 12Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA. 13Department of Biological Sciences, 14Department of Botany and Plant Pathology, Purdue University, West Lafayette, Indiana 47907, USA. 15International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru 502 324, India. 16Department of Horticulture and Institute for Plant Genomics and Biotechnology, Texas A&M University, College Station, Texas 77843, USA. 17Institute for Genomic Diversity, Cornell University, Ithaca, New York 14853, USA. 18Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 19Mississippi Genome Exploration Laboratory, Mississippi State University, Starkville, Mississippi 39762, USA. 20National Institute for Biotechnology & Genetic Engineering (NIBGE), Faisalabad, Pakistan. 21USDA NAA Robert Holley Center for Agriculture and Health, Ithaca, New York 14853, USA. 551 © 2009 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 457 | 29 January 2009

Comparison with a high-density genetic map8, a ‘finger-print sorghum genome contains 55% retrotransposons, intermediate contig’ (FPC)-based physical map9, and the rice sequence6 improved between the larger maize genome (79%) and smaller rice genome the sorghum WGS assembly (Supplementary Notes 1 and 2). Among (26%). Sorghum more closely resembles rice in having a higher ratio the 201 largest scaffolds (spanning 678.9 Mb, 97.3% of the assembly), of gypsy-like to copia-like elements (3.7 to 1 and 4.9 to 1) than maize 28 showed discrepancies with two or more of these lines of evidence (1.6 to 1: Supplementary Table 10). (Supplementary Note 2.6), often near repetitive elements. After break- Although recent retroelement activity is widely distributed across ing the assembly at the points of discrepancy, the resulting 229 scaf- the sorghum genome, turnover is rapid (as in other cereals15) with folds have an N50 (number of scaffolds that collectively cover at least pericentromeric elements persisting longer. Young LTR retrotran- 50% of the assembly) of 35 and L50 (length of the shortest scaffold sposon insertions (,0.01 million years (Myr) ago) appear randomly among those that collectively cover 50% of the assembly) of 7.0 Mb. A distributed along chromosomes, suggesting that they are preferen- total of 38 (2%) of 1,869 FPC contigs9 were deemed erroneous, con- tially eliminated from gene-rich regions9 but accumulate in gene- taining .5 BAC ends that fell into different sequence scaffolds. poor regions (Fig. 1; see also Supplementary Note 3.1). Insertion A total of 127 scaffolds containing 625.7 Mb (89.7%) of DNA and times suggest a major wave of retrotransposition ,1 Myr ago, after 1,476 FPC contigs could be assigned to chromosomal locations and a smaller wave 1–2 Myr ago (Supplementary Fig. 2). oriented. Fifteen out of twenty chromosome ends terminated in CACTA-like elements, the predominant sorghum DNA transpo- telomeric repeats. The other 102 scaffolds were generally smaller sons (4.7% of the genome), seem to relocate genes and gene frag- (53.2 Mb, 7.6%), with 85 (83%) containing far greater-than-average ments, as do rice ‘Pack-MULEs’16 and maize helitrons17. Many abundance of the Cen38 (ref. 10) centromeric repeat, and with only sorghum CACTA elements are non-autonomous deletion derivatives 374 predicted genes. These 102 scaffolds merged only 193 FPC con- in which transposon genes have been replaced with non-transposon tigs, presumably due to the greater abundance of repeats that are DNA including exons from one or more cellular genes as exemplified recalcitrant to clone-based physical mapping9 and may be omitted 11 for family G118 (Fig. 2). Among 13,775 CACTA elements identified in BAC-by-BAC approaches . (Supplementary Note 3.4), 200 encode no transposon proteins but Genome size evolution and its causes contain at least one cellular gene fragment. The ,75% larger quantity of DNA in the genome of sorghum com- In total, DNA transposons constitute 7.5% of the sorghum genome, pared with rice is mostly heterochromatin. Alignment to genetic8 and intermediate between maize (2.7%) and rice (13.7%; Supplementary cytological maps12 suggests that sorghum and rice have similar quan- Table 10). Miniature inverted-repeat transposable elements, 1.7% of tities of euchromatin (252 and 309 Mb, respectively; Supplementary the genome, are associated with genes (Fig. 1; see also Supplementary Table 7), accounting for 97–98% of recombination (1,025.2 cM and Note 3) as in other cereals6. Helitrons, ,0.8% of the genome, nearly all 1,496.5 cM, respectively) and 75.4–94.2% of genes in the respective lack helicase in sorghum as in maize17, but carry fewer gene fragments cereals, with largely collinear gene order9. In contrast, sorghum in sorghum than maize (Supplementary Note 3.5). Organellar DNA heterochromatin occupies at least 460 Mb (62%), far more than in rice insertion has contributed only 0.085% to the sorghum nuclear (63 Mb, 15%). The ,33 genome expansion in maize since its diver- genome, far less than the 0.53% of rice (Supplementary Note 2.7). gence from sorghum13 has been more dispersed—recombinogenic DNA has grown 4.53 to ,1,382 Mb, much more than can be The gene complement of sorghum explained by genome duplication14. Among 34,496 sorghum gene models, we found ,27,640 bona fide The net size expansion of the sorghum genome relative to rice protein-coding genes by combining homology-based and ab initio largely involved long terminal repeat (LTR) retrotransposons. The gene prediction methods with expressed sequences from sorghum, Figure 1 | Genomic landscape of sorghum Chr 3 Cen38 Retrotransposons chromosomes 3 and 9. Area charts quantify DNA transposons retrotransposons (55%), genes (6% exons, 8% Genes (introns) introns), DNA transposons (7%) and Genes (exons) centromeric repeats (2%). Lines between Young LTR-RTs chromosomes 3 and 9 connect collinear Full-length LTR-RTs duplicated genes. Heat-map tracks detail the LTR-RT/gypsy distribution of selected elements. Figures for all LTR-RT/copia sorghum chromosomes are in Supplementary DNA-TE/CACTA Note 3. Cen38, sorghum-specific centromeric CpG islands repeat10; RTs, retrotransposons (class I); LTR- DNA-TE/MITE RTs, long terminal repeat retrotransposons; Genes (exons) DNA-TEs, DNA transposons (class II). Paralogues

Paralogues Genes (exons) DNA-TEs/MITE CpG islands DNA-TE/CACTA LTR-RT/copia LTR-RT/gypsy Full-length LTR-RTs Young LTR-RTs Chr 9

0204060 (Mb)

552 © 2009 Macmillan Publishers Limited. All rights reserved NATURE | Vol 457 | 29 January 2009 ARTICLES

Figure 2 | CACTA element deletion derivatives Autonomous CACTA-G118 ‘Mother’ element (9,043 bp) that carry gene fragments. CACTA family G118 has only one complete and presumably Transposase ORF2 autonomous ‘mother’ element. Among 18 deletion derivatives, only the terminal Non-autonomous derivatives 500–2,500 bp are conserved, with 8 carrying gene fragments internally. One relatively G118-101, 6,698 bp homogeneous subgroup (106, 111 and 112) Conserved TIR presumably arose recently, whereas other region Chloroplast carbonic anhydrase Metal transporter Nramp6 derivatives are unique. The locations of the hits to Foreign gene known rice proteins are indicated as coloured fragments G118-104, 3,609 bp boxes. The descriptions of the foreign gene >30% identical fragments are indicated underneath the boxes. >40% identical HP, hypothetical protein. HP >50% identical G118-110, 5,088 bp >60% identical with rice protein Plant-specific β Conserved HP Importin 1 subunit * Found in three copies domain G118-106, 6,537 bp*

Replication factor A G118-114, 9,829 bp

Chloroplast carbonic anhydrase HP 40S ribosomal S15

G118-116, 4,244 bp

Importin β1 subunit 2 kb maize and sugar cane (Supplementary Note 4). Evidence for alternate in at least one contemporary grass and rosid genome. However, 3,983 splicing is found in 1,491 loci. (24%) gene families have members only in the grasses sorghum and Another5,197genemodelsareusuallyshorterthanthebonafidegenes rice; 1,153 (7%) appear to be unique to sorghum. (often ,150 amino acids); have few exons (often one) and no expressed Pfam domains that are over-represented, under-represented or sequence tag (EST) support (compared with 85% for bona fide genes); even absent in sorghum relative to rice, poplar and Arabidopsis, are more diverged from rice genes; and are often found in large families may reflect biological peculiarities specific to the Sorghum lineage with ‘hypothetical’, ‘uncharacterized’ and/or retroelement-associated (Supplementary Table 20). Domains over-represented in sorghum annotations, despite repeat masking (Supplementary Note 4). A high are usually present in the other organisms, a notable exception being concentration in pericentromeric regions where bona fide genes are the a-kafirin domain that accounts for most seed storage protein and scarce (Fig. 1) suggests that many of these low confidence gene models corresponds to maize zeins20 but which is absent from rice. are retroelement-derived. We also identified 727 processed pseudogenes Nucleotide-binding-site–leucine-rich-repeat (NBS-LRR) containing and 932 models containing domains known only from transposons. proteins associated with the plant immune system are only about half as The exon size distributions of orthologous sorghum and rice genes frequent in sorghum as in rice. A search with 12 NBS domains from agree closely, and intron position and phase show .98% concord- published rice, maize, wheat and Arabidopsis gene sequences revealed ance (Supplementary Note 5). Intron size has been conserved 211 NBS-LRR coding genes in sorghum, 410 in rice and 149 in between sorghum and rice, although it has increased in maize owing Arabidopsis21. Sorghum NBS-LRR genes mostly encode the CC type to transpositions18. of N-terminal domains. Only two sorghum genes (Sb02g005860 and Most paralogues in sorghum are proximally duplicated, including Sb02g036630) contain the TIR domain, and neither contains an NBS 5,303 genes in 1,947 families of $2 genes (Supplementary Note 4.3). domain. NBS-LRR genes are most abundant on sorghum chromosome The longest tandem gene array is 15 cytochrome P450 genes. Other 5 (62), and its rice homologue (chromosome 11, 106). Enrichment of sorghum-specific tandem gene expansions include haloacid dehalo- NBS-LRR genes in these corresponding genomic regions suggests con- genase-like hydrolases (PF00702), FNIP repeats (PF05725), and male servation of R gene location, in contrast to a proposal that R gene sterility proteins (PF03015). movement may be advantageous22. We confirmed the genomic locations of 67 known sorghum microRNAs (miRNAs) and identified 82 additional miRNAs Evolution of distinctive pathways and processes (Supplementary Note 4.4). Five clusters located within 500 bp of The evolution of C4 photosynthesis in the Sorghum lineage involved each other represent putative polycistronic miRNAs, similar to those redirection of C3 progenitorgenesaswellasrecruitment and functional in Arabidopsis and Oryza. Natural antisense miRNA precursors divergence of both ancient and recent gene duplicates. The sole sorghum (nat-miRNAs) of family miR444 (ref. 19) have been identified in C4 pyruvate orthophosphate dikinase (ppdk) and the phosphoenolpy- three copies. ruvate carboxylase kinase (ppck) gene and its two isoforms (produced by the whole genome duplication) have only single orthologues in rice. Comparative gene inventories of angiosperms Additional duplicates formed in maize after the sorghum–maize split The number and sizes of sorghum gene families are similar to those of (Zmppck2andZmppck3). The C4 NADP-dependent malic enzyme (me) Arabidopsis, rice and poplar (Fig. 3 and Supplementary Note 4.6). A gene has an adjacent isoform but each corresponds to a different maize total of 9,503 (58%) sorghum gene families were shared among all homologue, suggesting tandem duplication before the sorghum–maize four species and 15,225 (93%) with at least one other species. Nearly split. The C4 malate dehydrogenase (mdh) gene and its isoform are also 94% (25,875) of high-confidence sorghum genes have orthologues in adjacent, but share 97% amino acid similarity and correspond to the rice, Arabidopsis and/or poplar, and together these gene comple- single known maize Mdh gene, suggesting tandem duplication in ments define 11,502 ancestral angiosperm gene families represented sorghum after its split with maize. The rice Me and Mdh genes are single 553 © 2009 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 457 | 29 January 2009

Arabidopsis Sorghum major grass lineages diverged from a common ancestor at about the 13,144 16,378 clusters same time29 (see also Supplementary Note 7). 22,813 28,375 genes Although most post-duplication gene loss happened in a common Poplar 879 1,153 Rice cereal ancestor, some lineage-specific patterns occur. A total of 2 and 15,288 15,148 10 protein functional (Pfam) domains showed enrichment for dupli- 34,783 49 20,109 cates and singletons (respectively) in sorghum but not rice (Supplementary Note 6.1). Because the sorghum–rice divergence is 1,686 3,983 thought to have happened 20 Myr or more after genome duplica- 634 229 tion29, this suggests that even long-term gene loss differentially affects 2,403 542 gene functional groups. One genomic region has been subject to a high level of concerted 196 9,503 25 evolution. It was previously suggested that rice chromosomes 11 and 12 share a ,5–7-Myr-old segmental duplication30–32. We found a duplicated segment in the corresponding regions of sorghum chro- 631 139 mosomes 5 and 8 (Fig. 5). Sorghum–sorghum and rice–rice para- logues from this region show rates of synonymous DNA substitution (Ks) of 0.44 and 0.22, respectively, consistent with only 34 and 17 Myr 96 of divergence. However, the Ks value of sorghum–rice orthologues is 0.63, similar to the respective genome-wide averages (0.81, 0.87). We Figure 3 | Orthologous gene families between sorghum, Arabidopsis, rice hypothesize that the apparent segmental duplication actually resulted and poplar. The numbers of gene families (clusters) and the total numbers from the pan-cereal whole-genome duplication and became differen- of clustered genes are indicated for each species and species intersection. tiated from the remainder of the chromosome(s) owing to concerted evolution acting independently in sorghum, rice and perhaps other copy, suggesting duplication and recruitment to the C4 pathway after cereals. Gene conversion and illegitimate recombination are more the Panicoideae–Oryzoideae divergence (Supplementary Note 9). frequent in the rice 11–12 region than elsewhere in the genome33. The sorghum sequence reinforces inferences previously based only Physical and genetic maps suggest shared terminal segments of the on rice, about how different grass and gene inventories corresponding chromosomes in wheat (4, 5)34, foxtail millet (VII, 23,24 relate to their respective types of cell walls . In grasses, cellulose VIII) and pearl millet (linkage groups 1, 4)35. microfibrils coated with mixed-linkage (1R3),(1R4)-b-D-glucans are interlaced with glucuronoarabinoxylans and an extensive com- Synthesis and implications 25 plex of . The sorghum sequence largely corrobo- Comparison of the sorghum, rice and other genomes clarifies the grass rates differences between dicotyledons and rice in the distribution of gene set. Pairs of orthologous sorghum and rice genes combined with cell wall biogenesis genes (Supplementary Note 10). For example, the recent paralogous duplications define 19,542 conserved grass gene CesA/Csl superfamily and callose synthases have either diverged to families, each representing one gene in the sorghum–rice common form new subgroups or functionally non-essential subgroups were ancestor. Our sorghum gene count is similar to that in a manually selectively lost, such as CslB and CslG lost from the grasses, and CslF curated rice annotation (RAP2)36, but this similarity masks some differ- 26 and CslH lost from species with dicotyledon-like cell walls . The ences. About 2,054 syntenic orthologues shared by our sorghum annota- previously rice-unique CslF and CslH genes are present in sorghum. tion and the TIGR5 (ref. 37) rice annotation are absent from RAP2. Arabidopsis contains a single group F GT31 gene, whereas sorghum Conversely, ,12,000 TIGR5 annotations may be transposable elements and rice contain six and ten, respectively. or pseudogenes, comprising large families of hypothetical genes in both The characteristic adaptation of sorghum to drought may be partly sorghum and rice RAP2, often with short exons, few introns and limited related to expansion of one miRNA and several gene families. EST support. Phylogenetically incongruent cases of apparent gene loss 27 Rice miRNA 169g, upregulated during drought stress ,hasfive (for example, genes shared by Arabidopsis and sorghum but not rice: sorghum homologues (sbi-MIR169c, sbi-MIR169d, sbi-MIR169.p2, Fig. 3) may also suggest sequence gaps or misannotations. sbi-MIR169.p6 and sbi-MIR169.p7). The computationally predicted Grass genome architecture may reflect euchromatin-specific effects target of the sbi-MIR169 subfamily comprises members of the plant of recombination and selection, superimposed on non-adaptive nuclear factor Y (NF-Y) B transcription factor family, linked to processes of mutation and genetic drift that apply to all genomic 28 improved performance under drought by Arabidopsis and maize . regions38. Patterns of gene and repetitive DNA organization remain Cytochrome P450 domain-containing genes, often involved in scaven- correlated in homologous chromosomes duplicated 70 Myr ago ging toxins such as those accumulated in response to stress, are abund- (Fig. 1), despite extensive turnover of specific repetitive elements. ant in sorghum with 326 versus 228 in rice. Expansins, enzymes that Synteny is highest and retroelement abundance lowest in distal break hydrogen bonds and are responsible for a variety of growth res- chromosomal regions. More rapid retroelement removal from gene- ponses that could be linked to the durability of sorghum, occur in 82 rich euchromatin that frequently recombines than from heterochro- copies in sorghum versus 58 in rice and 40 each in Arabidopsis and matin that rarely recombines supports the hypothesis that recombina- poplar. tion may preserve gene structure, order and/or spacing by exposing new insertions to selection9. Less euchromatin–heterochromatin Duplication and diversification of cereal genomes polarization in maize, where retrotransposon persistence in euchro- Whole-genome duplication in a common ancestor of cereals is matin seems more frequent, may reflect variation in grass genome reflected in sorghum and rice gene ‘quartets’ (Fig. 4). A total of architecture or perhaps a lingering consequence of more recent gen- 19,929 (57.8%) sorghum gene models were in blocks collinear with ome duplication39. rice (Supplementary Note 6). After the shared whole-genome duplica- Identification of conserved DNA sequences may help us to under- tion, only one copy was retained for 13,667 (68.6%) collinear genes stand essential genes and binding sites that define grasses. Progress in with 13,526 (99%) being orthologous in rice–sorghum, indicating sequencing Brachypodium distachyon40 sets the stage for panicoid– that most gene losses predate taxon divergence. Both sorghum and oryzoid–pooid phylogenetic triangulation of genomic changes, as well rice retained both copies of 4,912 (14.2%) genes, whereas sorghum as association of some such changes with phenotypes ranging from lost one copy of 1,070 (3.1%) and rice lost one copy of 634 (1.8%). molecular (gene expression patterns) to morphological. The divergence These patterns are likely to be predictive of other grass genomes, as the between sorghum, rice and Brachypodium is sufficient to randomize 554 © 2009 Macmillan Publishers Limited. All rights reserved NATURE | Vol 457 | 29 January 2009 ARTICLES

C1 Zea c4 155.3–156.2 Mb Zea c5 199.2–2-200.2 Mb

C2 Sorghum c4 C3 61.0–60.8 Mb

Oryza c2 C4 29.6–29.7 Mb

C5 Oryza c4 30.5–30.7 Mb C6 Oryza Sorghum c6 C7 56.2–56.3 Mb

C8

C9 Zea c10 123.7–124.3 Mb Zea c2 12.4–11.6 Mb C10

C11 10 kb Sorghum/Oryza scale Sorghum gene Cereal duplication 80 kb C12 Zea scale Oryza gene Sorghum–Oryza orthologue C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Zea BAC Gene loss Sorghum–Zea orthologue Sorghum Figure 4 | Alignment of sorghum, rice and maize. Dot plots show but no orthologues. Each sorghum region corresponds to two duplicated intergenomic (gold) and intragenomic (black) alignments. One maize regions39, with maize gene loss suggested where sorghum loci only sorghum–rice quartet showing both orthologous and paralogous match one of the two. Because maize BACs are mostly unfinished, sorghum (duplicated) regions is magnified. Infrequent gene loss (red; see legend) after loci are aligned to the centres. Note the different scale necessary for maize sorghum–rice divergence causes ‘special cases’ in which there are paralogues physical distance. Larger dot plots are in Supplementary Note 6. nonfunctional sequence, facilitating conserved noncoding sequence The fact that the sorghum genome has not re-duplicated in (CNS) discovery41,42 (Supplementary Fig. 9). More distant comparisons ,70 Myr29 makes it a valuable outgroup for deducing fates of gene pairs to the dicotyledon Arabidopsis show exon conservation but no CNS and CNS in grasses that have reduplicated. Single sorghum regions (Supplementary Fig. 10). Chloridoid and arundinoid genome correspond to two regions resulting from maize-specific genome sequences are needed to sample the remaining grass lineages, and an doubling39—gene fractionation is evident (Fig. 4), and subfunctiona- outgroup such as Ananas (pineapple) or Musa (banana) would further lization is probable (Supplementary Fig. 10). Sorghum may prove aid in identifying genes and sequences that define grasses. especially valuable for unravelling genome evolution in the more closely related Saccharum–Miscanthus clade: two genome duplications since LL Oryza Sorghum its divergence from sorghum 8–9 Myr ago43 complicate sugar cane chr 11 chr 5 genetics44 yet Saccharum BACs show substantially conserved gene order with sorghum (Supplementary Note 11). Conservation of grass gene structure and order facilitates develop- ment of DNA markers to support crop improvement. We identified ,71,000 simple-sequence repeats (SSRs) in sorghum (Supplementary List 1); among a sampling of 212, only 9 (4.2%) map to paralogues of their source locus. Conserved-intron scanning primers (Supplementary S S List 2) for 6,760 genes provide DNA markers useful across many mono- cotyledons, particularly valuable for ‘orphan cereals’45. As the first sequenced plant genome of African origin, sorghum adds new dimensions to ethnobotanical research. Of particular interest will S S be the identification of alleles selected during the earliest stages of sorghum cultivation, which are valuable towards testing the hypothesis that convergent mutations in corresponding genes contributed to independent domestications of divergent cereals46. Invigorated sor- ghum improvement would benefit regions such as the African ‘Sahel’ where drought tolerance makes sorghum a staple for human populations that are increasing by 2.8% per year. Sorghum yield Oryza Sorghum improvement has lagged behind that of other grains, in Africa only chr 12 chr 8 LL gaining a total of 37% (western) to 38% (eastern) from 1961–63 to 2005–07 (Supplementary Note 12).

Ks 0 0.2 0.4 1.0 METHODS SUMMARY Genome sequencing. Approximately 8.5-fold redundant paired-end shotgun Figure 5 | Independent illegitimate recombination in corresponding sequencing was performed using standard Sanger methodologies from small (,2– regions of sorghum and rice. Four homologous rice and sorghum 3 kb) and medium (5–8 kb) insert plasmid libraries, one fosmid library (,35 kb chromosomes (11 and 12 in rice; 5 and 8 in sorghum) are shown, with gene inserts), and two BAC libraries (insert size 90 and 108 kb). (Supplementary Note 1.) densities plotted. ‘L’ and ‘S’ show long and short arms, respectively. Lines Integration of shotgun assembly with genetic and physical maps. The largest show Ks between homologous gene pairs, and colours are used to show 201 scaffolds, all exceeding 39 kb, excluding ‘N’s, and collectively representing different dates of conversion events. 678,902,941 bp (97.3%) of nucleotides, were checked for possible chimaeras 555 © 2009 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 457 | 29 January 2009

suggested by the sorghum genetic map, sorghum physical map, abrupt changes 23. Carpita, N. C. & Gibeaut, D. M. Structural models of primary cell walls in flowering in gene or repeat density, rice gene order, and coverage by BAC or fosmid clones plants—consistency of molecular structure with the physical properties of the (Supplementary Note 2). walls during growth Plant J. 3, 1–30 (1993). Repeat analysis. De novo searches for LTR retrotransposons used LTR_STRUC. 24. McCann, M. C. & Roberts, K. in The Cytoskeletal Basis of Plant Growth and Form (ed. Lloyd, C. W.) 109–129 (Academic Press, 1991). De novo detection of CACTA-DNA transposons and MITEs used custom pro- 25. Carpita, N. C. Structure and biogenesis of the cell walls of grasses. Annu. Rev. Plant grams (Supplementary Note 3). Known repeats were identified by RepeatMasker Physiol. Plant Mol. Biol. 47, 445–476 (1996). (Open-3-1-8) (http://www.repeatmasker.org) with mips-REdat_6.2_Poaceae, a 26. Hazen, S. P. et al. Quantitative trait loci and comparative genomics of cereal cell compilation of grass repeats including sorghum-specific LTR retrotransposons wall composition. Plant Physiol. 132, 263–271 (2003). (http://mips.gsf.de/proj/plant/webapp/recat/). The insertion age of full-length 27. Zhao, B. T. et al. Identification of drought-induced microRNAs in rice. Biochem. LTR-retrotransposons was determined from the evolutionary distance between Biophys. Res. Commun. 354, 585–590 (2007). 59 and 39 soloLTR derived from a ClustalW alignment of the two soloLTRs. 28. Nelson, D. E. et al. Plant nuclear factor Y (NF-Y) B subunits confer drought Protein-coding gene annotation. Putative protein-coding loci were identified tolerance and lead to improved corn yields on water-limited acres Proc. Natl Acad. based on BLAST alignments of rice and Arabidopsis peptides and sorghum and Sci. USA 104, 16450–16455 (2007). maize ESTs. GenomeScan47 was applied using maize-specific parameters. 29. Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization predating Predicted coding structures were merged with EST data from maize and sor- divergence of the cereals, and its consequences for comparative genomics. Proc. ghum using PASA48. Natl Acad. Sci. USA 101, 9903–9908 (2004). Intergenomic and intragenomic alignments. Dot plots used ColinearScan49 30. Wang, X. et al. Duplication and DNA segmental loss in rice genome and their implications for diploidization. New Phytol. 165, 937–946 (2005). and multi-alignments used MCScan50, applied to RAP236 (mapped represent- 31. Yu, J. et al. The genomes of Oryza sativa: A history of duplications. PLoS Biol. 3, ative models, 29,389 loci) and the sbi1.4 annotation set (34,496 loci). Pairwise 25 266–281 (2005). BLASTP (E , 1310 , top five hits), both within each genome and between the 32. The Rice Chromosomes 11 and 12 Sequencing Consortia.. The sequence of rice two genomes, was used to retrieve potential anchors. Zea BAC sequences and chromosomes 11 and 12, rich in disease resistance genes and recent gene FPC contig coordinates were downloaded (http://www.maizesequence.org, duplications. BMC Biol. 3, 20 (2005). release 7 January 2008). Zea BACs were searched for potential orthologues of 33. Wang, X. et al. Extensive concerted evolution of rice paralogs and the road to Sorghum coding sequences using translated BLAT with a minimum score of 100. regaining independence. Genetics 177, 1753–1763 (2007). 34. Singh, N. K. et al. Single-copy genes define a conserved order between rice and Received 20 August; accepted 9 December 2008. wheat for understanding differences caused by duplication, deletion, and transposition of genes. Funct. Integr. Genomics 7, 17–35 (2007). 1. Hatch, M. D. & Slack, C. R. Photosynthesis by sugar-cane leaves—a new 35. Devos, K. M., Pittaway, T. S., Reynolds, A. & Gale, M. D. Comparative mapping carboxylation reaction and pathway of sugar formation. Biochem. J. 101, 103 (1966). reveals a complex relationship between the pearl millet genome and those of 2. Paterson, A. H. et al. The weediness of wild plants—molecular analysis of genes foxtail millet and rice TAG. Theor. Appl. Genet. 100, 190–198 (2000). influencing dispersal and persistence of johnsongrass, Sorghum halepense (l) pers. 36. Tanaka, T. et al. The rice annotation project database (RAP-DB): 2008 update. Proc. Natl Acad. Sci. USA 92, 6127–6131 (1995). 3. Hamblin, M. T. et al. Equilibrium processes cannot explain high levels of short- and Nucleic Acids Res. 36, D1028–D1033 (2008). medium-range linkage disequilibrium in the domesticated grass Sorghum bicolour. 37. Ouyang, S. et al. The TIGR rice genome annotation resource: Improvements and Genetics 171, 1247–1256 (2005). new features. Nucleic Acids Res. 35, D883–D887 (2007). 4. Morrell, P. L. et al. Crop-to-weed introgression has impacted allelic composition of 38. Lynch, M. & Conery, J. S. The origins of genome complexity. Science 302, johnsongrass populations with and without recent exposure to cultivated 1401–1404 (2003). sorghum. Mol. Ecol. 14, 2143–2154 (2005). 39. Wei, F. et al. Physical and genetic structure of the maize genome reflects its 5. Gardner, R. C. et al. The complete nucleotide sequence of an infectious clone of complex evolutionary history. PLoS Genet. 3, e123 (2007). by M13mp7 shotgun sequencing. Nucleic Acids Res. 9, 40. Huo, N. et al. The nuclear genome of : Analysis of BAC 2871–2888 (1981). end sequences. Funct. Integr. Genomics 8, 135–147 (2007). 6. Matsumoto, T. et al. The map-based sequence of the rice genome. Nature 436, 41. Margulies, E. H. et al. An initial strategy for the systematic identification of 793–800 (2005). functional elements in the human genome by low-redundancy comparative 7. Vieira, J. & Messing, J. The pUC plasmids, an M13mp7-derived system for sequencing. Proc. Natl Acad. Sci. USA 102, 4795–4800 (2005). insertion mutagenesis and sequencing with synthetic universal primers. Gene 19, 42. Eddy, S. R. A model of the statistical power of comparative genome sequence 259–268 (1982). analysis. PLoS Biol. 3, 95–102 (2005). 8. Bowers, J. E. et al. A high-density genetic recombination map of sequence-tagged 43. Jannoo, N. et al. Orthologous comparison in a gene-rich region among grasses sites for Sorghum, as a framework for comparative structural and evolutionary reveals stability in the sugarcane polyploid genome. Plant J. 50, 574–585 (2007). genomics of tropical grains and grasses. Genetics 165, 367–386 (2003). 44. Ming, R. et al. Sugarcane improvement through breeding and biotechnology. Plant 9. Bowers, J. E. et al. Comparative physical mapping links conservation of Breed. Rev. 27, 15–118 (2005). microsynteny to chromosome structure and recombination in grasses. Proc. Natl 45. Lohithaswa, H. C. et al. Leveraging the rice genome sequence for comparative Acad. Sci. USA 102, 13206–13211 (2005). genomics in monocots. Theor. Appl. Genet. 115, 237–243 (2007). 10. Miller, J. T. et al. Cloning and characterization of a centromere-specific repetitive 46. Paterson, A. H. et al. Convergent domestication of cereal crops by independent DNA element from Sorghum bicolour. Theor. Appl. Genet. 96, 832–839 (1998). mutations at corresponding genetic loci. Science 269, 1714–1718 (1995). 11. Venter, J. C. et al. Shotgun sequencing of the human genome. Science 280, 47. Yeh, R.-F., Lim, L. P. & Burge, C. Computational inference of homologous gene 1540–1542 (1998). structures in the human genome. Genome Res. 11, 803–816 (2001). 12. Kim, J. S. et al. Chromosome identification and nomenclature of Sorghum bicolour. 48. Haas, B. J. et al. Full-length messenger RNA sequences greatly improve genome Genetics 169, 1169–1173 (2005). annotation. Genome Biol. 3, research0029.0021–0029.0012 (2002). 13. Swigonova, Z. et al. Close split of sorghum and maize genome progenitors. 49. Wang, X. Y. et al. Statistical inference of chromosomal homology based on gene Genome Res. 14, 1916–1923 (2004). colinearity and applications to Arabidopsis and rice. BMC Bioinform. 7, 447 (2006). 14. Swigonova, Z. et al. On the tetraploid origin of the maize genome. Comp. Funct. 50. Tang,H.etal.Syntenyandcolinearityinplantgenomes.Science320,486–488(2008). Genomics 5, 281–284 (2004). 15. Swigonova, Z., Bennetzen, J. L. & Messing, J. Structure and evolution of the r/b Supplementary Information is linked to the online version of the paper at chromosomal regions in rice, maize and sorghum. Genetics 169, 891–906 (2005). www.nature.com/nature. 16. Jiang, N. et al. Pack-mule transposable elements mediate gene evolution in plants. Nature 431, 569–573 (2004). Acknowledgements We thank the US Department of Energy Joint Genome 17. Brunner, S. et al. Evolution of DNA sequence nonhomologies among maize Institute Community Sequencing Program, J. Bristow, S. Lucas and the JGI inbreds. Plant Cell 17, 343–360 (2005). production sequencing team for sequencing sorghum; and L. Lin for contributions 18. Haberer, G. et al. Structure and architecture of the maize genome. Plant Physiol. to Fig. 1. We appreciate funding from the US National Science Foundation (NSF 139, 1612–1624 (2005). DBI-9872649, 0115903; MCB-0450260), International Consortium for Sugarcane 19. Lu, C. et al. Genome-wide analysis for discovery of rice microRNAs reveals natural Biotechnology, National Sorghum Producers, and a John Simon Guggenheim antisensemicroRNAs(nat-miRNAs).Proc.NatlAcad.Sci.USA105,4951–4956(2008). Foundation fellowship to A.H.P.; US Department of Energy (DE-FG05-95ER20194) 20. Xu, J.-H. & Messing, J. Organization of the prolamin gene family provides insight to J.M.; German Federal Ministry of Education GABI initiative to MIPS (0313117 and into the evolution of the maize genome and gene duplications in grass species. 0314000C); NSF DBI-0321467 to A.N.; and US Department of Proc. Natl Acad. Sci. USA 105, 14330–14335 (2008). Agriculture-Agricultural Research Service to C.A.M., L.Z. and D.W. 21. Meyers, B. C. et al. Genome-wide analysis of NBS-LRR-encoding genes in Arabidopsis. Plant Cell 15, 809–834 (2003). Author Information Reprints and permissions information is available at 22. Leister, D. Tandem and segmental gene duplication and recombination in the www.nature.com/reprints. Correspondence and requests for materials should be evolution of plant disease resistance genes. Trends Genet. 20, 116–122 (2004). addressed to A.H.P. ([email protected]).

556 © 2009 Macmillan Publishers Limited. All rights reserved REPORTS

long terminal repeat retrotransposons (LTR retro- The B73 Maize Genome: Complexity, transposons) (10). We sequenced the maize genome using a Diversity, and Dynamics minimum tiling path of bacterial artificial chro- mosomes (BACs) (n = 16,848) and fosmid (n = 63) clones derived from an integrated physical Patrick S. Schnable,1,2,3,4* Doreen Ware,5,6* Robert S. Fulton,7† Joshua C. Stein,6† Fusheng Wei,8† and genetic map (11, 12), augmented by com- Shiran Pasternak,6 Chengzhi Liang,6 Jianwei Zhang,8 Lucinda Fulton,7 Tina A. Graves,7 parisons with an optical map (13). Clones were Patrick Minx,7 Amy Denise Reily,7 Laura Courtney,7 Scott S. Kruchowski,7 Chad Tomlinson,7 shotgun sequenced (four- to sixfold coverage), Cindy Strong,7 Kim Delehaunty,7 Catrina Fronick,7 Bill Courtney,7 Susan M. Rock,7 Eddie Belter,7 followed by automated and manual sequence im- Feiyu Du,7 Kyung Kim,7 Rachel M. Abbott,7 Marc Cotton,7 Andy Levy,7 Pamela Marchetto,7 provement (14) of the unique regions only, which Kerri Ochoa,7 Stephanie M. Jackson,7 Barbara Gillam,7 Weizu Chen,7 Le Yan,7 Jamey Higginbotham,7 resulted in the B73 reference genome version 1 Marco Cardenas,7 Jason Waligorski,7 Elizabeth Applebaum,7 Lindsey Phelps,7 Jason Falcone,7 (B73 RefGen_v1). Krishna Kanchi,7 Thynn Thane,7 Adam Scimone,7 Nay Thane,7 Jessica Henke,7 Tom Wang,7 We identified the full complement of maize Jessica Ruppert,7 Neha Shah,7 Kelsi Rotter,7 Jennifer Hodges,7 Elizabeth Ingenthron,7 transposable elements (TEs) accessible from B73 Matt Cordes,7 Sara Kohlberg,7 Jennifer Sgro,7 Brandon Delgado,7 Kelly Mead,7 Asif Chinwalla,7 RefGen_v1, which includes active class II DNA Shawn Leonard,7 Kevin Crouse,7 Kristi Collura,8 Dave Kudrna,8 Jennifer Currie,8 Ruifeng He,8 TEs and an abundance of class I RNA TEs (15). Angelina Angelova,8 Shanmugam Rajasekar,8 Teri Mueller,8 Rene Lomeli,8 Gabriel Scara,8 Ara Ko,8 Krista Delaney,8 Marina Wissotski,8 Georgina Lopez,8 David Campos,8 Michele Braidotti,8 Elizabeth Ashley,8 Wolfgang Golser,8 HyeRan Kim,8 Seunghee Lee,8 Jinke Lin,8 Zeljko Dujmic,8 1Center for Plant Genomics, Iowa State University, Ames, IA 2 Woojin Kim,8 Jayson Talag,8 Andrea Zuccolo,8 Chuanzhu Fan,8 Aswathy Sebastian,8 Melissa Kramer,6 50011, USA. Department of Agronomy, Iowa State University, Ames, IA 50011, USA. 3Department of Genetics, Development Lori Spiegel,6 Lidia Nascimento,6 Theresa Zutavern,6 Beth Miller,6 Claude Ambroise,6 6 6 6 6 6 6 and Cell Biology, Iowa State University, Ames, IA 50011, USA. Stephanie Muller, Will Spooner, Apurva Narechania, Liya Ren, Sharon Wei, Sunita Kumari, 4Center for Carbon Capturing Crops, Iowa State University, Ben Faga,6 Michael J. Levy,6 Linda McMahan,6 Peter Van Buren,6 Matthew W. Vaughn,6 Kai Ying,3 Ames, IA 50011, USA. 5U.S. Department of Agriculture (USDA), 1,2 9,10 3 9,11 1,2 North Atlantic Area, Robert Holley Center for Agriculture and Cheng-Ting Yeh, Scott J. Emrich, Yi Jia, Ananth Kalyanaraman, An-Ping Hsia, 6 12 13 14 15 Health, Ithaca, NY 14853, USA. Cold Spring Harbor Labora- W. Brad Barbazuk, Regina S. Baucom, Thomas P. Brutnell, Nicholas C. Carpita, 7 16 6 16 13,17 2,4 tory, Cold Spring Harbor, NY 11724, USA. The Genome Center Cristian Chaparro, Jer-Ming Chia, Jean-Marc Deragon, James C. Estill, Yan Fu, at Washington University, St. Louis, MO 63108, USA. 8Arizona 18 13,17 19 14 20 JeffreyA.Jeddeloh, Yujun Han, Hyeran Lee, Pinghua Li, Damon R. Lisch, Genomics Institute, School of Plant Sciences and Department on February 28, 2010 Sanzhen Liu,3 Zhijie Liu,6 Dawn Holligan Nagel,13,17 Maureen C. McCann,21 Phillip SanMiguel,22 of Ecology and Evolutionary Biology, BIO5 Institute for Col- 23 24 25 15,21 26 laborative Research, University of Arizona, Tucson, AZ 85721, Alan M. Myers, Dan Nettleton, John Nguyen, Bryan W. Penning, Lalit Ponnala, 9 27 28 27 29 USA. Department of Electrical and Computer Engineering, Kevin L. Schneider, David C. Schwartz, Anupma Sharma, Carol Soderlund, Iowa State University, Ames, IA 50011, USA. 10Department of 30 26 13,17 25 22 Nathan M. Springer, Qi Sun, Hao Wang, Michael Waterman, Richard Westerman, Computer Science and Engineering, University of Notre Dame, Thomas K. Wolfgruber,27 Lixing Yang,13 Yeisoo Yu,29 Lifang Zhang,6 Shiguo Zhou,28 Qihui Zhu,13,17 Notre Dame, IN 46556, USA. 11School of Electrical Engineering 13 13,17 19 31 27 and Computer Science, Washington State University, Pullman, JeffreyL.Bennetzen, R. Kelly Dawe, Jiming Jiang, Ning Jiang, Gernot G. Presting, 12 13,17 1,9,32 6 7 WA 99164, USA. Department of Botany, University of Florida, Susan R. Wessler, Srinivas Aluru, Robert A. Martienssen, Sandra W. Clifton, Gainesville, FL 32611, USA. 13Department of Genetics, Univer- 6 8 7,33 W. Richard McCombie, Rod A. Wing, Richard K. Wilson ‡ sity of Georgia, Athens, GA 30602, USA. 14Boyce Thompson 15

Institute, Cornell University, Ithaca, NY 14853, USA. Depart- www.sciencemag.org ment of Botany and Plant Pathology, Purdue University, We report an improved draft nucleotide sequence of the 2.3-gigabase genome of maize, an West Lafayette, IN 47907, USA. 16Université de Perpignan important crop plant and model for biological research. Over 32,000 genes were predicted, of Via Domitia, CNRS, Perpignan, France. 17Department of Plant which 99.8% were placed on reference chromosomes. Nearly 85% of the genome is composed of Biology, University of Georgia, Athens, GA 30602, USA. 18 19 hundreds of families of transposable elements, dispersed nonuniformly across the genome. These NimbleGen, Madison, WI 53711, USA. Department of Horticulture, University of Wisconsin–Madison, Madison, WI were responsible for the capture and amplification of numerous gene fragments and affect the 53706, USA. 20Department of Plant Biology, University of composition, sizes, and positions of centromeres. We also report on the correlation of methylation- California, Berkeley, CA, 94720, USA. 21Department of poor regions with Mu transposon insertions and recombination, and copy number variants with Biological Sciences, Purdue University, West Lafayette, IN

22 Downloaded from insertions and/or deletions, as well as how uneven gene losses between duplicated regions were 47907, USA. Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN 47907, involved in returning an ancient allotetraploid to a genetically diploid state. These analyses inform 23 USA. Department of Biochemistry, Biophysics, and Molec- and set the stage for further investigations to improve our understanding of the domestication and ular Biology, Iowa State University, Ames, IA, 50011, USA. agricultural improvements of maize. 24Department of Statistics, Iowa State University, Ames, IA 50011, USA. 25Departments of Mathematics, Biology, and Computer Science, University of Southern California, Los aize (Zea mays ssp. mays L.) was do- eightfold (4), in part by harnessing heterosis Angeles, CA 90089, USA. 26Cornell University Computational mesticated over the past ~10,000 years (hybrid vigor), a universal, but poorly understood, Biology Service Unit, Cornell University, Ithaca, NY 14850, 27 from the grass teosinte in Central phenomenon that can increase yields of hybrids USA. Molecular Biosciences and Bioengineering, University M of Hawaii, Honolulu, HI 96822, USA. 28Laboratory for Molec- America (1) and has been subject to cultivation by 15 to 60% relative to inbred parents (5). ular and Computational Genomics, Department of Chemistry, and selection ever since. Maize is an important The maize genome has undergone several Laboratory of Genetics, University of Wisconsin–Madison, model organism for fundamental research into the rounds of genome duplication, including that Madison, WI 53706, USA. 29BIO5 Institute for Collaborative Research, University of Arizona, Tucson, AZ 85721, USA. inheritance and functions of genes, the physical of a paleopolyploid ancestor ~70 million years 30 linkage of genes to chromosomes, the mechanistic ago (mya) (6) and an additional whole-genome Department of Plant Biology, , St. Paul, MN 55108, USA. 31Department of Horticulture, Michigan relation between cytological crossovers and recom- duplication event about 5 to 12 mya (7, 8), State University, East Lansing, MI 48824, USA. 32Indian In- bination, the origin of the nucleolus, the proper- which distinguishes maize from its close rela- stitute of Technology, Bombay, India. 33Department of Ge- ties of telomeres, epigenetic silencing, imprinting, tive, Sorghum bicolor (9). The 10 chromosomes netics, Washington University School of Medicine, St. Louis, and transposition (2). Maize also is an important of the maize genome are structurally diverse and MO 63110, USA. crop, yielding in the USA alone 12 billion (B = have undergone dynamic changes in chromatin *These authors contributed equally to this work. 9 †These authors contributed equally to data production and 10 ) bushels of grain from ~86 million acres with composition. The size of the maize genome has analysis. a value of $47 B [2008 data from (3)]. Over the expanded dramatically (to 2.3 gigabases) over ‡To whom correspondence should be addressed. E-mail: last century, breeders have increased grain yields the last ~3 million years via a proliferation of [email protected]

1112 20 NOVEMBER 2009 VOL 326 SCIENCE www.sciencemag.org REPORTS

Almost 85% of the B73 RefGen_v1 consists of fragments) carrying fragments of 226 nuclear with a combined copy number of ~20,000, which TEs (table S2). Indeed, the existence of TEs genes. About 40,000 nonredundant Mu inser- are particularly active in gene fragment acqui- (16), as well as the first members of the CACTA tion sites were amplified from Mu-active lines, sition (table S2). In maize, we observed that (Spm/En), hAT (Ac), PIF/Harbinger and Mutator sequenced, and mapped to B73 RefGen_v1. The Helitrons are located predominantly within gene- superfamilies, and MITE family (Tourist), were nonuniformly distributed Mu insertion sites co- rich regions, whereas, in all previously studied all initially discovered in maize (17). Further, both localize with gene-rich regions of the genome plant and animal genomes, they are enriched in the existence and unparalleled abundance of LTR that have the highest rates of meiotic recombina- gene-poor regions (22, 23). LTR retrotransposons retrotransposons in plants were originally discov- tion per megabase (Fig. 1) (19). Like Mu,most compose >75% of the B73 RefGen_v1 and are ered in maize (18). maize DNA TEs (but not the CACTA elements) diverse. Most of the 406 families have fewer The B73 RefGen_v1 contains 855 families were enriched in the gene-rich, recombinationally than 10 copies. LTR retrotransposons exhibited of DNA TEs that make up 8.6% of the genome; active chromosome ends (Fig. 1 and fig. S1). family-specific, nonuniform distributions along most of these (82%) were identified in this study Helitrons, a class of DNA elements believed chromosomes, e.g., Copia-like elements are over- (table S2) (14). The most complex of these super- to transpose by a rolling-circle mechanism (20), represented in gene-rich euchromatic regions, families is Mutator, with dramatic variation in are present in plants, animals, and fungi, but are whereas Gypsy-like elements are overrepresented element sequence and size, including 262 Pack- particularly active, variable, and abundant in maize in gene-poor heterochromatic regions (fig. S1) MULEs (Mutator-like elements that contain gene (21). Maize contains eight families of Helitrons (24, 25). We observed more than 180 acquisi- tions of nuclear gene fragments inside LTR retro- transposons (table S2). Protein-encoding and microRNA (miRNA) (26) genes were predicted from assembled or improved BAC contigs by a combination of evidence-based (27) and ab initio approaches, projected to B73 RefGen_v1, and subsequently filtered to a set of 32,540 protein-encoding and 150 miRNA genes (14) (fig. S2). Exon sizes of maize genes were similar to that of their orthologous genes in rice and sorghum, but maize on February 28, 2010 genes contained more large introns because of insertion of repetitive elements (11, 28)(figs.S3 and S4 and tables S5 and S6). A comparative analysis with rice, sorghum, and Arabidopsis re- vealed similar numbers of gene families (14) (Fig. 2), of which a core set of 8494 families is shared among all four species, and of the 11,892 maize families, all but 465 are conserved with at least one other species. Species- and lineage- specific families point out potential inconsisten- www.sciencemag.org cies between annotation projects, but also reflect genuine biological differences in gene inventories. Because of the stringent criteria used for including genes in the filtered gene set (14), we expected to miss some genes. About 95% of a collection of 63,851 full-length maize

cDNAs (fl-cDNAs) (29, 30) mapped to B73 Downloaded from RefGen_v1. On the basis of the ratio of fl- cDNA to supported genes in the filtered set, we estimated that this set accounts for at least 85% of all genes in the B73 RefGen_v1 (14).

Fig. 1. The maize B73 reference genome (B73 RefGen_v1): Concentric circles show aspects of the genome. Chromosome structure (A). Reference chromosomes with physical fingerprint contigs (11)asalternatinggray and white bands. Presumed centromeric positions are indicated by red bands (31); enlarged for emphasis. Genetic map (B). Genetic linkage across the genome, on the basis of 6363 genetically and physically mapped markers (14, 19). Mu insertions (C). Genome mappings of nonredundant Mu insertion sites (14, 19). Methyl-filtration reads (D). Enrichment and depletion of methyl filtration. For each nonoverlapping 1-Mb window, read counts were divided by the total number of mapped reads. Repeats (E). Sequence coverage of TEs with RepeatMasker with all identified intact elements in maize. Genes (F). Density of genes in the filtered gene set across the genome, from a gene count per 1-Mb sliding window Fig. 2. Venn diagram showing unique and shared at 200-kb intervals. Sorghum synteny (G) and rice synteny (H). Syntenic blocks between maize and related gene families between and among the three se- cereals on the basis of 27,550 gene orthologs. Underlined blocks indicate alignment in the reverse strand. quenced grasses (maize, rice, and sorghum) and Homoeology map (I). Oriented homoeologous sites of duplicated gene blocks within maize. the dicot, Arabidopsis.

www.sciencemag.org SCIENCE VOL 326 20 NOVEMBER 2009 1113 REPORTS

The maximum rate of false-positive gene anno- methylation along the reference chromosomes genes; ~30% having orthologs in rice and/or tations was estimated by aligning ~112 million (fig. S13, A and B). It is noteworthy that, in the sorghum). On the basis of an analysis of GO RNA-seq ( sequencing) reads from sorghum genome, hypomethylated genes are () terms (14, 44) (table S15), re- various tissues to the filtered gene set (14) (figs. largely excluded from the pericentromeric re- tention of genes as duplicates is not random, e.g., S10 and S11). These experiments provided evi- gions, whereas they are dispersed more widely retained duplicates are significantly enriched dence for the transcription of ~91% of the genes in maize. Visual comparisons between sorghum for transcription factors (>1.5-fold; P value = in the filtered gene set (29,541 out of 32,540). and maize (14) revealed high levels of coalign- 7.6 × 10–22) (table S15), as is also the case in Manual annotation of 200 randomly chosen genes ment, including centromeres where centromeric rice (44) and Arabidopsis (45). An example of from the filtered gene set indicated that only two repeats are undermethylated relative to the sur- biased retention is the CesA family, in which all are likely to be TE-derived. Additional manual rounding heterochromatin (39, 40) (fig. S13C). 10 ancestral sites were retained as duplicates annotation of smaller sets of selected genetically Thus, the B73 RefGen_v1 yields evidence that (fig. S16) (46). Using the sorghum genome to well-characterized genes (tables S8 to S10) in- heavily methylated regions are more condensed project extant maize regions to ancestral chro- dicated that the vast majority of genes and pro- during interphase. mosomes (14) revealed a strong bias for gene loss teins predicted in the filtered gene set are mostly Anchoring the B73 RefGen_v1 to a newly (fractionation) between sister regions (table S16 correct. developed genetic map (19) revealed that rates of and fig. S17). Fractionation bias has been ob- Maize centromeres were found to contain meiotic recombination per megabase are highest served in other plant lineages and species (47–50). variable amounts of the tandem CentC satellite at the ends of the reference chromosomes and Sites containing proximately duplicated para- repeat and centromeric retrotransposon elements very low in the middle half of each chromosome logs tend to exist as single copies, or not at all, of maize (CRMs). On the basis of comparisons surrounding the centromeres (Fig. 1) (19, 41). at corresponding homoeologous positions (table to B73 whole-genome shotgun data, we initially Although recombination occurs preferentially in S18). Of the 1454 proximately duplicated para- identified about half of the genome’s CentC con- genes (2) and gene density shows a similar logs identified (making up 3614 genes), only tent (table S13). We captured additional CentC distribution (Fig. 1), gene density does not fully 126 (~9%) could be found at homoeologous sequence by draft sequencing 101 centromeric explain the nonrandom distribution of recombi- positions (14). Of the remainder, 279 (19%) had repeat–containing BACs and anchoring them to nation events, because a pronounced nonuniform a single paralog at the corresponding homoeol- the genetic and physical maps, thereby localiz- distribution is still observed even when gene ogous site, and 1049 (72%) had no homoeologs. ing all of the centromeres (31). We delineated density is taken into consideration (19). Instead, Nearly identical paralogs (NIPs) are genes the functional centromeres on the basis of their epigenetic marks, including hypomethylation and with pairwise alignments of ≥500 bp, ≥98% iden- on February 28, 2010 centromere-specific histone H3 (CENH3) (32) histone modifications, are implicated in guiding tity, and ≥95% coverage with other genes (51). Of by using chromatin immunoprecipitation (ChIP) both Mu insertion and meiotic recombination maize-filtered genes, 2.5% (828 out of 32,540) with an antibody against CENH3, followed by (19). Epigenetic processes have also been in- were NIPs from 386 families, most of which . The centromere regions deline- voked to explain the observation that genomic have only two members (n = 349); the largest has ated in this way, although mostly incomplete, imprinting contributes to the expression of nine members. Almost half (46%) of the NIP correlated with a high density of CentC and thousands of genes in maize hybrids (42). pairs had both members physically linked within CRM1/CRM2/CRM3 repeats, but a number of Maize exhibits extremely high levels of both 200 kb of each other, whereas in most of the these repeats also occurred outside of the func- phenotypic and genetic diversity. This genomic remaining cases, the two members were distant tional centromeres (fig. S12). The CRM2 sub- diversity was explored with both resequencing from each other or on different chromosomes family appears to be the centromeric repeat most (41) and array-based comparative genomic hy- (fig. S18). www.sciencemag.org closely associated with CENH3 in maize, as it is bridization between the B73 and Mo17 inbred Just as cytogenetic and genetic maps (52) more enriched in the CENH3 chromatin fraction lines (43). This revealed extensive structural var- revolutionized research and crop improvement than CentC, CRM1, or CRM3 (table S13). iation, including hundreds of copy number var- over the last century, the B73 maize reference We traversed two centromeres (2 and 5) in iants (CNVs) and thousands of present-absent sequence promises to advance basic research their entirety and determined that they differ in variants (PAVs). Many of the PAVs, including an and to facilitate efforts to meet the world’s size and CENH3 density (31). Because CRM ele- ~2-Mb region on chromosome 6, contain in- growing needs for food, feed, energy, and

ments have generated recombinants with dis- tact, expressed single-copy genes that are present industrial feed stocks in an era of global climate Downloaded from tinct periods of activity (33, 34), we were able to in one inbred genome but absent from the other. change. Findings derived from this genome demonstrate that the regional centromeres of These haplotype-specific sequences may contrib- sequence briefly summarized here are described maize are dynamic loci and that the CENH3 do- ute to heterosis and the substantial degree of in more detail in a series of companion papers main shifts over time (31). phenotypic variation among maize inbreds (43). (11, 13, 19, 22, 24–26, 30, 31, 37, 41–43, 46). To protect genome integrity, TEs are usually After a whole-genome duplication, the re- Annotation data and browser are available at www. transcriptionally silenced (35) in part via the turn to a genetically diploid state was associated maizegenome.org. RNA-directed DNA methylation (RdDM) path- with numerous chromosomal breakages and way, which requires an RNA-dependent RNA fusions, as shown by alignment to the genomes References and Notes polymerase 2 (RDR2). When the maize homolog of sorghum and the more distantly related rice 1. J. F. Doebley, B. S. Gaut, B. D. Smith, Cell 127, 1309 of RDR2 (36) is mutated, it alters the accumu- (Fig. 1 and fig. S14) (12). In contrast, sorghum (2006). 2. J. L. Bennetzen, S. Hake, Handbook of Maize: Genetics lation of transcripts from many characterized has experienced relatively few interchromo- and Genomics (Springer, New York, 2009). transposons, but unexpectedly, some TEs are somal rearrangements since its lineage split with 3. C. P. National Corn Growers Association, Table showing down-regulated by loss of RDR2 function (37). rice (8); therefore, its chromosomal configura- corn harvested, yield, production, mya price, and value, In most plant genomes, genes are less densely tion closely resembles the ancestral state of maize’s 1991–2008; http://ncga.com/corn-production-trends. 4. A. F. Troyer, Crop Sci. 46, 528 (2006). methylated than heterochromatic TEs and other two subgenomes (12). Cosynteny of maize genes 5. D. N. Duvick, Science 286, 418 (1999). repeats. Consequently, ~2× coverage of the maize to common reference genes in rice or sorghum 6. A. H. Paterson, J. E. Bowers, B. A. Chapman, Proc. Natl. genome by methylation-filtered (MF) reads in- defined maize’s duplicate regions (fig. S15). Al- Acad. Sci. U.S.A. 101, 9903 (2004). cludes portions of ~95% of maize genes (38). though syntenic blocks cover 1832 Mb (~89% 7. G. Blanc, K. H. Wolfe, Plant Cell 16, 1667 (2004). of the genome), individual gene losses were 8. Z. Swigonova et al., Genome Res. 14, 1916 (2004). Mapping MF reads (39)ofmaizeandsorghum 9. A. H. Paterson et al., Nature 457, 551 (2009). onto their respective genomes revealed species- common and resulted in retention of only ~8110 10. P. SanMiguel, B. S. Gaut, A. Tikhonov, Y. Nakajima, specific distributions of heterochromatic DNA genes as duplicate homoeologs (~25% of total J. L. Bennetzen, Nat. Genet. 20, 43 (1998).

1114 20 NOVEMBER 2009 VOL 326 SCIENCE www.sciencemag.org REPORTS

11. F. Wei et al., PLoS Genet., 19 November 2009 (10.1371/ 31. T. K. Wolfgruber et al., PLoS Genet., 19 November 2009 Maize Centromere Consortium supported by NSF awards journal.pgen.1000715). (10.1371/journal.pgen.1000743). DBI-0607123 (S.R.W., J.L.B., R.K.D., N.J., P.S.M.) and 12. F. Wei et al., PLoS Genet. 3, e123 (2007). 32. C. X. Zhong et al., Plant Cell 14, 2825 (2002). DBI-0421671 (R.K.D., J.J., G.G.P.). Also supported by 13. S. Zhou et al., PLoS Genet., 19 November 2009 33. A. Sharma, G. G. Presting, Mol. Genet. Genomics 279, NSF grants DBI-0321467 (D.W.), DBI-0321711 (10.1371/journal.pgen.1000711). 133 (2008). (P.S.S.), DBI-0333074 (D.W.), DBI-0501818 (D.C.S.), 14. Materials and methods are available as supporting 34. A. Sharma, K. L. Schneider, G. G. Presting, Proc. Natl. DBI-0501857 (Y.Y.), DBI-0701736 (T.P.B., Q.S.), material on Science Online. Acad. Sci. U.S.A. 105, 15470 (2008). DBI-0703273 (R.A.M.), and DBI-0703908 (D.W.), and 15. P. SanMiguel et al., Science 274, 765 (1996). 35. D. Lisch, Annu. Rev. Plant Biol. 60, 43 (2009). by USDA National Research Initiative Grants 16. B. McClintock, Cold Spring Harbor Symp. Quant. Biol. 36. M. Alleman et al., Nature 442, 295 (2006). 2005-35301-15715 and 2007-35301-18372 from the 16, 13 (1951). 37. Y. Jia et al., PLoS Genet., 19 November 2009 (10.1371/ USDA Cooperative State Research, Education, and 17. C. Feschotte, N. Jiang, S. R. Wessler, Nat. Rev. Genet. 3, journal.pgen.1000737). Extension Service (P.S.S.) and from the USDA-ARS 329 (2002). 38. Y. Fu et al., Proc. Natl. Acad. Sci. U.S.A. 102, 12282 (408934 and 413089) to D.W., and from the 18. A. Kumar, J. L. Bennetzen, Annu. Rev. Genet. 33, 479 (2005). Office of Science (Biological and Environmental (1999). 39. L. E. Palmer et al., Science 302, 2115 (2003). Research), U.S. Department of Energy, grant 19. S. Liu et al., PLoS Genet., 19 November 2009 (10.1371/ 40. W. Zhang, H. R. Lee, D. H. Koo, J. Jiang, Plant Cell 20,25 DE-FG02-08ER64702 to N.C.C. and M.C.M. journal.pgen.1000733). (2008). Sequences of the reference chromosomes have been 20. V. V. Kapitonov, J. Jurka, Proc. Natl. Acad. Sci. U.S.A. 98, 41. M. A. Gore et al., Science, 326, 1115 (2009). deposited in GenBank as accession numbers 8714 (2001). 42. R. A. Swanson-Wagner et al., Science 326, 1118 (2009). CM000777 to CM000786. RNA-sequence reads 21. S. Lal, N. Georgelis, L. Hannah, in Handbook of Maize: 43. N. M. Springer et al., PLoS Genet., 19 November 2009 have been deposited in the Gene Expression Omnibus Genetics and Genomics, J. L. Bennetzen, S. Hake, Eds. (10.1371/journal.pgen.1000734). (GEO) database (www.ncbi.nlm.nih.gov/geo) as (Springer, New York, 2008), pp. 329–339. 44. C. G. Tian et al., Yi Chuan Xue Bao 32, 519 (2005). accession numbers GSE16136, GSE16868, and 22. L. Yang, J. L. Bennetzen, Proc. Natl. Acad. Sci. USA, 45. C. Seoighe, C. Gehring, Trends Genet. 20, 461 (2004). GSE16916. Centromeric sequences have been published online 19 November 2009 (10.1073/ 46. B. W. Penning et al., Plant Physiol., published online deposited in the National Center for Biotechnology pnas.0908008106). 19 November 2009 (10.1104/pp.109.136804). Information, NIH, Trace Archive as accessions 23. L. Yang, J. L. Bennetzen, Proc. Natl. Acad. Sci. U.S.A. 47. H. Shaked, K. Kashkush, H. Ozkan, M. Feldman, 1757396377 to 1757412600 and 2185189231 to 106, 12832 (2009). A. A. Levy, Plant Cell 13, 1749 (2001). 2185200942. 24. R. S. Baucom et al., PLoS Genet., 19 November 2009 48. K. Song, P. Lu, K. Tang, T. C. Osborn, Proc. Natl. Acad. (10.1371/journal.pgen.1000732). Sci. U.S.A. 92, 7719 (1995). 25. F. Wei et al., PLoS Genet., 19 November 2009 (10.1371/ 49. J. A. Tate, P. Joshi, K. A. Soltis, P. S. Soltis, D. E. Soltis, Supporting Online Material journal.pgen.1000728). BMC Plant Biol. 9, 80 (2009). www.sciencemag.org/cgi/content/full/326/5956/1112/DC1 26. L. Zhang, PLoS Genet., 19 November 2009 (10.1371/ 50. B. C. Thomas, B. Pedersen, M. Freeling, Genome Res. 16, Materials and Methods journal.pgen.1000716). 934 (2006). SOM Text 27. H. Liang, W. H. Li, Mol. Biol. Evol. 26, 1195 (2009). 51. S. J. Emrich et al., Genetics 175, 429 (2007). Figs. S1 to S18 28. G. Haberer et al., Plant Physiol. 139, 1612 (2005). 52. B. McClintock, Science 69, 629 (1929). Tables S1 to S18 on February 28, 2010 29. N. N. Alexandrov et al., Plant Mol. Biol. 69, 179 53. The Maize Genome Sequencing Project supported by References (2009). NSF award DBI-0527192 (R.K.W., S.W.C., R.S.F., 30. C. Soderlund et al., PLoS Genet., 19 November 2009 R.A.W., P.S.S., S.A., L.S., D.W., W.R.M., R.A.M.). The 1 July 2009; accepted 13 October 2009 (10.1371/journal.pgen.1000740). Maize Transposable Element Consortium and the 10.1126/science.1178534

base pairs (Mbp) of low-copy sequence present A First-Generation Haplotype in 13 or more lines in this study. Roughly 39% of the sequenced low-copy fraction was derived Map of Maize from introns and exons (5), covering 32% of the www.sciencemag.org total genic fraction in the genome. We identified 3.3 million single-nucleotide polymorphisms Michael A. Gore,1,2,3*† Jer-Ming Chia,4* Robert J. Elshire,3 Qi Sun,5 Elhan S. Ersoz,3 4 2 1,6 7 (SNPs) and indels (table S1) and found that, over- Bonnie L. Hurwitz, ‡ Jason A. Peiffer, Michael D. McMullen, George S. Grills, p 8 1,4 1,2,3 all, 1 in every 44 bp was polymorphic ( = 0.0066 Jeffrey Ross-Ibarra, Doreen H. Ware, § Edward S. Buckler § per ). In a subset used for the population genetics analyses, the error rate was 1/2570 or Maize is an important crop species of high genetic diversity. We identified and genotyped several p

17-fold lower than (roughly half the errors are Downloaded from million sequence polymorphisms among 27 diverse maize inbred lines and discovered that the paralogy issues). The absolute level of diversity genome was characterized by highly divergent haplotypes and showed 10- to 30-fold variation in we examined, though high, may be slightly re- recombination rates. Most chromosomes have pericentromeric regions with highly suppressed duced because of difficulties aligning highly di- recombination that appear to have influenced the effectiveness of selection during maize inbred vergent sequences and our low power to call development and may be a major component of heterosis. We found hundreds of selective sweeps and highly differentiated regions that probably contain loci that are key to geographic adaptation. 1United States Department of Agriculture–Agriculture Re- This survey of genetic diversity provides a foundation for uniting breeding efforts across the world search Service (USDA-ARS). 2Department of Plant Breeding and for dissecting complex traits through genome-wide association studies. and Genetics, Cornell University, Ithaca, NY 14853, USA. 3Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA. 4Cold Spring Harbor Laboratory, Cold Spring aize (Zea mays L.) is both a model ge- that is low-copy (4, 5) on a diverse panel of 27 Harbor, NY 11724, USA. 5Computational Biology Service Unit, netic system and an important crop spe- inbred lines (representative of maize breeding ef- Cornell University, Ithaca, NY 14853, USA. 6Division of Plant cies. Already a critical source of food, forts and worldwide diversity)―founders of the Sciences, University of Missouri, Columbia, MO 65211, USA. M 7Institute for Biotechnology and Life Science Technologies, fuel, feed, and fiber, the addition of genomic in- maize nested association mapping (NAM) pop- 8 ― Cornell University, Ithaca, NY 14853, USA. Department of formation allows maize to be further improved ulation (6) and used sequencing-by-synthesis Plant Sciences, University of California, Davis, CA 95616– through plant breeding that exploits its tremendous (SBS) technology with three complementary re- 5294, USA. genetic diversity (1–3). Genome-wide association striction enzyme–anchored genomic libraries (figs. *These authors contributed equally to this work. studies (GWAS) of diverse maize germplasm of- S1 and S2A) (7). †Present address: United States Arid-Land Agricultural Re- fer the potential to rapidly resolve complex traits More than 1 billion SBS reads (>32 gigabases search Center, Maricopa, AZ 85138, USA. ‡Present address: Department of Ecology and Evolutionary to gene-level resolution, but these studies require of sequence) were generated, covering ~38% of Biology, University of Arizona, Tucson, AZ 85721, USA. a high density of genome-wide markers. To do the total maize genome, albeit at mostly low- §To whom correspondence should be addressed. E-mail: this, we targeted the 20% of the maize genome coverage levels. We focused on the ~93 million [email protected] (D.H.W.); [email protected] (E.S.B.)

www.sciencemag.org SCIENCE VOL 326 20 NOVEMBER 2009 1115 A r t i c l e s

The genome of the cucumber, Cucumis sativus L.

Sanwen Huang1,19, Ruiqiang Li2,3,19, Zhonghua Zhang1,19, Li Li2,19, Xingfang Gu1,19, Wei Fan2,19, William J Lucas4,19, Xiaowu Wang1, Bingyan Xie1, Peixiang Ni2, Yuanyuan Ren2, Hongmei Zhu2, Jun Li2, Kui Lin5, Weiwei Jin6, Zhangjun Fei7, Guangcun Li8, Jack Staub9, Andrzej Kilian10, Edwin A G van der Vossen11, Yang Wu5, Jie Guo5, Jun He1, Zhiqi Jia1, Yi Ren1, Geng Tian2, Yao Lu2, Jue Ruan2,12, Wubin Qian2, Mingwei Wang2, Quanfei Huang2, Bo Li2, Zhaoling Xuan2, Jianjun Cao2, Asan2, Zhigang Wu2, Juanbin Zhang2, Qingle Cai2, Yinqi Bai2, Bowen Zhao13, Yonghua Han6, Ying Li1, Xuefeng Li1, Shenhao Wang1, Qiuxiang Shi1, Shiqiang Liu1, Won Kyong Cho14, Jae-Yean Kim14, Yong Xu15, Katarzyna Heller-Uszynska10, Han Miao1, Zhouchao Cheng1, Shengping Zhang1, Jian Wu1, Yuhong Yang1, Houxiang Kang1, Man Li1, Huiqing Liang2, Xiaoli Ren2, Zhongbin Shi2, Ming Wen2, Min Jian2, Hailong Yang2, Guojie Zhang2,12, Zhentao Yang2, Rui Chen2, Shifang Liu2, Jianwen Li2, Lijia Ma2,12, Hui Liu2, Yan Zhou2, Jing Zhao2, Xiaodong Fang2, Guoqing Li2, Lin Fang2, Yingrui Li2,12, Dongyuan Liu2, Hongkun Zheng2,3, Yong Zhang2, Nan Qin2, Zhuo Li2, Guohua Yang2, Shuang Yang2, Lars Bolund2,16, Karsten Kristiansen17, Hancheng Zheng2,18, Shaochuan Li2,18, Xiuqing Zhang2, Huanming Yang2, Jian Wang2, Rifei Sun1, Baoxi Zhang1, Shuzhi Jiang1, Jun Wang2,17, Yongchen Du1 & Songgang Li2

Cucumber is an economically important crop as well as a The botanical family Cucurbitaceae, commonly known as cucur- model system for sex determination studies and plant vascular bits and gourds, includes several economically important cultivated biology. Here we report the draft genome sequence of Cucumis plants, such as cucumber (C. sativus L.), melon (C. melo L.), water- sativus var. sativus L., assembled using a novel combination of melon (Citrullus lanatus (Thunb.) Matsum. & Nakai) and squash and traditional Sanger and next-generation Illumina GA sequencing pumpkin (Cucurbita spp.). Agricultural production of cucurbits uses technologies to obtain 72.2-fold genome coverage. The absence 9 million hectares of land and yields 184 million tons of vegetables, of recent whole-genome duplication, along with the presence fruits and seeds annually (http://faostat.fao.org). The cucurbit fam- of few tandem duplications, explains the small number of ily also displays a rich diversity of sex expression, and the cucumber © All rights reserved. 2009 Inc. Nature America, genes in the cucumber. Our study establishes that five of the has served as a primary model system for sex determination studies1. cucumber’s seven chromosomes arose from fusions of ten The cucurbits are also model plants for the study of vascular biology, ancestral chromosomes after divergence from Cucumis melo. as both xylem and phloem sap can be readily collected for studies of The sequenced cucumber genome affords insight into traits long-distance signaling events2,3. such as its sex expression, disease resistance, biosynthesis of Despite the agricultural and biological importance of cucurbits, cucurbitacin and ‘fresh green’ odor. We also identify 686 gene knowledge of their genetics and genome is currently very limited. We clusters related to phloem function. The cucumber genome have therefore sequenced and assembled the genome of the domestic provides a valuable resource for developing elite cultivars cucumber, C. sativus var. sativus L. and for studying the evolution and function of the plant All previous plant genome sequences have been derived using vascular system. traditional Sanger technology4–9. The recent development of

1Key Laboratory of Horticultural Crops Genetic Improvement of Ministry of Agriculture, Sino-Dutch Joint Lab of Horticultural Genomics Technology, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China. 2BGI-Shenzhen, Shenzhen, China. 3Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Denmark. 4Department of Plant Biology, College of Biological Sciences, University of California, Davis, California, USA. 5College of Life Sciences, Beijing Normal University, Beijing, China. 6National Maize Improvement Center of China, Key Laboratory of Crop Genetic Improvement and Genome of Ministry of Agriculture, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing, China. 7Boyce Thompson Institute and USDA Robert W. Holley Center for Agriculture and Health, Cornell University, Ithaca, New York, USA. 8High-Tech Research Center, Shandong Academy of Agricultural Sciences, Jinan, China. 9US Department of Agriculture, Agricultural Research Service, Vegetable Crops Research Unit, Department of Horticulture, University of Wisconsin, Madison, Wisconsin, USA. 10Diversity Arrays Technology, Canberra, Australia. 11Wageningen UR Plant Breeding, Wageningen, The Netherlands. 12The Graduate University of Chinese Academy of Sciences, Beijing, China. 13High School Affiliated to Renmin University of China, Beijing, China. 14Division of Applied Life Science (BK21 and WCU program), PMBBRC and EB-NCRC, Gyeongsang National University, Jinju, Republic of Korea. 15National Engineering Research Center for Vegetables, Beijing, China. 16Institute of Human Genetics, University of Aarhus, Aarhus, Denmark. 17Department of Biology, University of Copenhagen, Copenhagen, Denmark. 18South China University of Technology, Guangzhou, China. 19These authors contributed equally to this work. Correspondence should be addressed to Y.D. ([email protected]), S.H. ([email protected]), Jun Wang ([email protected]) or Songgang Li ([email protected]). Received 6 May; accepted 28 September; published online 1 November 2009; doi:10.1038/ng.475

Nature Genetics volume 41 | number 12 | december 2009 1275 A r t i c l e s

next-generation sequencing technologies has Table 1 Cucumber genome assembly statistics significantly improved sequencing throughput at Contig N50a Contig total Scaffold N50 Scaffold total % sequence anchored on a markedly reduced cost10. However, an intrinsic Assembly (kb) (Mb) (kb) (Mb) chromosome characteristic of next-generation technologies is Sanger 2.6 204 19 238 — their short read length (~50 bp), which prevents Illumina GA 12.5 190 172 200 — their direct application for de novo assembly of Sanger + Illumina GA 19.8 226.5 1,140 243.5 72.8% large genomes. When using these new technolo- aN50 refers to the size above which half of the total length of the sequence set can be found. gies, assembly is typically carried out by mapping these short reads onto a known reference genome11,12. For the cucumber recombination suppression of two 10-Mb regions at either end of genome, we carried out a novel combination de novo sequencing strat- chromosome 4, a 20-Mb region on chromosome 5 and an 8-Mb region egy, taking advantage of the long read and clone length of Sanger on chromosome 7. Using high-resolution FISH, we confirmed previ- technology and, for the first time, the high sequencing depth and low ously identified segmental inversion16 within the suppression region unit cost of Illumina GA technology. on chromosome 5 between Gy14 and PI183967 (Fig. 1b), which pro- vides an explanation for recombination suppression in these regions. RESULTS These regions of recombination suppression are additionally useful Sequencing and assembly for studying cucumber evolution during domestication. We selected the ‘Chinese long’ inbred line 9930, which is commonly After excluding 16 markers whose genetic positions were ambigu- used in modern cucumber breeding13, for our genome sequencing ous, we examined the six remaining regions that had conflicts between project. We generated a total of 26.5 billion high-quality base pairs, the genetic map and our assembly. Upon inspection, we found that or 72.2-fold genome coverage, of which the Sanger reads provided clone mate-pair information supported our assembly in all of these 3.9-fold coverage and the Illumina GA reads provided 68.3-fold regions (Supplementary Fig. 2). We also identified no misassem- coverage (Supplementary Table 1). The GA reads ranged in length bly within the regions covered by the six finished fosmid or BAC from 42 to 53 bp. sequences (Supplementary Fig. 3). The conflicts may be a result of We compared the assemblies obtained by Sanger reads only, chromosomal rearrangement that occurred between the sequenced Illumina GA reads only and Sanger plus Illumina reads. The ‘hybrid’ genotype 9930 and the genotypes used to create the mapping popu- approach achieved markedly longer N50 (the size above which half of lation; alternatively, these markers may have been placed incorrectly the total length of the sequence set can be found) in both contigs and on the genetic map. Sequencing depth distribution showed that scaffolds, so we used this assembly for further analyses (Table 1 and we obtained more than 10× coverage on more than 97.5% of the Supplementary Table 2). The total length of the assembled genome assembly (Supplementary Fig. 4). was 243.5 Mb, about 30% smaller than the genome size estimated by flow cytometry of isolated nuclei stained with propidium iodide Repetitive sequences and transposons (367 Mb)14 and by K-mer depth distribution of sequenced reads The cucumber genome contains a large number of transposable ele- (350 Mb; Supplementary Fig. 1). Several types of satellite sequences ments, but only a few have previously been identified. We therefore were present in the data set, comprising 23.2% of all Sanger reads and constructed repeat libraries using multiple de novo methods and then 76.2% of unassembled reads (Supplementary Table 3). FISH analysis derived a combined repeat library that contained 1,566 sequences indicated that these are primarily located in the centromeric and telo- (Supplementary Table 5), of which 469 (29.9%) were manually clas- meric regions15. The cucumber genome also contains a large number sified (Supplementary Table 6). We then used this library for repeat

© All rights reserved. 2009 Inc. Nature America, of rRNA sequences, and about 3.3% of the Sanger reads matched 45S annotation of the cucumber genome. We identified a total of 54.4 Mb, rRNA. These results indicated that the majority of the remaining 30% which represents ~24% of the genome, as repeats. Among them, of unassembled regions of the genome are likely to be heterochro- 51.5% could be classified based on known repeats. The long termi- matic satellite or rRNA sequences. nal repeat (LTR) retrotransposons (gypsy and copia) made up the The high coverage of the cucumber genome by this assembly was majority of the transposable element classes and comprised 10.4% also confirmed using the available EST, fosmid and BAC sequences. of the genome (Supplementary Table 7). The repeats divergence rate The assembly contains 96.8% of the 63,312 cucumber unigenes (percentage of substitutions in the matching region compared with assembled from ~350,000 Roche 454–sequenced ESTs, 99.3% of the consensus repeats in constructed libraries) distribution showed a peak 6,952 NCBI-deposited ESTs of cucumber, 91.2% of the 50,441 NCBI- at 20%. A fraction of LTR retrotransposons, long interspersed nuclear deposited ESTs of melon and 98.7% of the six finished fosmid and elements and DNA transposons (composing 2.3%, 0.4% and 0.2% BAC sequences (Supplementary Table 4). of the genome, respectively) are of relatively recent origin, having a A genetic map was developed using 77 recombinant inbred lines sequence divergence rate of less than 5% (Supplementary Fig. 5).

from the intersubspecific cross between Gy14 (a North American processing market–type cucumber cultivar) and PI183967 (an acces- Gene annotation sion of C. sativus var. hardwickii originating from India). The map We used three gene-prediction methods (cDNA-EST, homology based spans 581 cM and contains 1,885 markers, including 995 micro- and ab initio) to identify protein-coding genes and then built a consen- satellite markers16 and 890 Diversity Arrays Technology markers sus gene set by merging all of the results (Supplementary Fig. 6). We (marker sequences can be accessed at http://cucumber.genomics.org. predicted 26,682 genes, with a mean coding sequence size of 1,046 bp cn). Using this map, we were able to anchor 72.8% of the assembled and an average of 4.39 exons per gene (Supplementary Table 8). sequences onto the seven chromosomes. Among the 1,885 mark- Under an 80% sequence overlap threshold, we found that 26.7% of ers, 1,763 (93.5%) were uniquely aligned and used for construct- the genes were supported by models from all three gene prediction ing the pseudochromosomes. The majority (98.7%) of the markers methods, 25% had both ab initio prediction and homology-based were collinear with the sequence assembly (Fig. 1a). Comparison of evidence, and 7.4% had ab initio prediction and cDNA-EST expres- the genetic and physical distances between markers revealed sion evidence; the remaining genes were primarily derived from pure

1276 volume 41 | number 12 | december 2009 Nature Genetics A r t i c l e s

Figure 1 Integrated genetic and physical map of a 120 120 120 cucumber. (a) Genetic versus physical distance LG1 LG2 LG3 100 100 100 map of the seven cucumber chromosomes. The genetic map was constructed using a 80 80 80 recombinant inbred line mapping population 60 60 60 from the intersubspecific cross between Gy14 40 40 40 (domestic cucumber) and PI183967 (wild 20 20 20 cucumber). (b) Segmental inversion between 0 0 0 Gy14 and PI183967 on cucumber chromosome 5 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 detected by high-resolution FISH (12-2 120 120 120 LG4 LG5 LG6 and 12-7 denote individual fosmid clones). 100 100 100 A low-resolution FISH analysis was also recently 80 80 80 16 reported . Scale bars represent 1 µm. 60 60 60 40 40 40 20 20 20 Genetic distance (cM) ab initio prediction, but the majority of these 0 0 0 were supported by multiple gene finders 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 120 (Supplementary Table 9). About 81% of LG7 100 Physical distance (Mb) the genes have homologs in the TrEMBL 80 b protein database, and 66% can be classified Gy14 Pl183967 by InterPro. In sum, 82% of the genes have 60 either known homologs or can be function- 40 ally classified (Supplementary Table 10). In 20 0 addition to protein-coding genes, we iden- 0 5 10 15 20 25 30 35 40 tified 292 rRNA fragments and 699 tRNA, Physical distance (Mb) 238 small nucleolar RNA, 192 small nuclear Centromeric regions estimated by FISH RNA and 171 miRNA genes in the cucumber

genome (Supplementary Table 11). 12-7 On the basis of pairwise protein sequence 12-2 similarities, we carried out a gene family clustering analysis on all genes in sequenced plants, using rice as an outgroup. The cucum- ber genes consist of 15,669 families. Of these, 4,362 are cucum- two recent WGDs (Fig. 2b). In cucumber, the analysis showed ancient ber unique families, among which 3,784 are single-gene families duplication events (peak at ~0.60) but did not reveal recent WGD. (Supplementary Table 12). The EST confirmation rate of these unique This lack of recurrent WGD in the small cucumber genome provides single-copy genes was much lower than the average of all predicted an important complement to the grapevine and papaya genomes to genes (33.4% vs. 72.3%, respectively). This category may therefore study ancestral forms and arrangements of plant genes. contain a number of false-positive predictions. In papaya, there are 4,622 unique families, but the actual number of genes is estimated to Synteny with flowering plant genomes

© All rights reserved. 2009 Inc. Nature America, be 24,746, which is lower than the 28,629 predicted genes7. Thus, the Given the similar gene arrangements between cucumber and other actual number in cucumber should be lower than 26,682 and similar plant genomes, we defined syntenic blocks that contained 5,473, to that in papaya. The smaller average gene family size in cucumber 6,525, 9,842, 8,439 and 3,992 cucumber genes collinear to Arabidopsis, (1.71) and papaya (1.77) supports this conclusion (Fig. 2a). papaya, poplar, grapevine and rice, respectively (Supplementary The cucumber genome contains the smallest number of tandem Table 13 and Supplementary Figs. 8–12). The numbers of collinear gene duplications (479) among all the plants we compared, whereas genes were consistent with the phylogenetic distances of the other grapevine has the largest number (5,382; Fig. 2a). This may contribute plants to cucumber. Within the syntenic blocks, we observed the in part to the small number of genes in cucumber. highest density of collinear genes between cucumber and grapevine (90.5 genes per Mb), followed by papaya (76.1; the low contiguity Absence of recent whole-genome duplication of genome assembly may have, in part, decreased this value), poplar Whole-genome duplication (WGD) is common in angiosperm plants (68.8), rice (55.6) and Arabidopsis (43.5; Supplementary Table 13). and produces a tremendous source of raw material for gene genesis. This indicates that Arabidopsis has the most reshuffled or rearranged

Previous research has revealed a paleohexaploidy (γ) event in the genome, whereas the genomes of grapevine and papaya are more common ancestor of Arabidopsis thaliana and grapevine after the conserved, probably because they have not undergone WGD since divergence of monocotyledons and dicotyledons6. Subsequently, two the ancestral paleohexaploidy. WGDs (α and β) occurred in Arabidopsis17 and one (p) in poplar8, whereas no recent WGD occurred in grapevine and papaya. Evidence Substantial fusion events involved in chromosomal evolution indicates that rice underwent an ancient WGD18. We carried out a Melon and cucumber belong to the same genus, although cucum- collinear gene-order analysis on the cucumber genome and observed ber has seven chromosomes and melon has 12. Watermelon, their no recent WGD and only a few segmental duplication events common distant relative, has 11 chromosomes. To investigate cucur- (Supplementary Fig. 7). We also used the distance-transversion rate bit chromosomal evolution, we compared the melon19 and water- at fourfold degenerate sites (4DTv method) to analyze paralogous melon genetic maps to the cucumber genome (Fig. 3a). In total, gene pairs between syntenic blocks in Arabidopsis and cucumber, 348 (66.7%) of the 522 melon markers and 136 (58.6%) of the 232 respectively. Two peaks (~0.06 and ~0.25) in Arabidopsis support the watermelon markers were aligned on the cucumber chromosomes

Nature Genetics volume 41 | number 12 | december 2009 1277 A r t i c l e s

(Supplementary Table 14). The comparison a 60 b 40 revealed that there has been no substantial No. of all C. sativus 40 A. thaliana predicted genes ks rearrangement of cucumber chromosome 7, 20 (×1,000) loc 30 which corresponds to melon chromosome 1 0 6 No. of tandem and watermelon group 7. 4 duplicated genes Recent two WGD duplications 2 20 Using watermelon as an outgroup, we (×1,000) in Arabidopsis 0 Ancient duplications found that cucumber chromosomes 1, 2, 3, 5 3 in cucumber and 6 were collinear to melon chromosomes No. of genes 2 10

per family 1 rcentage of syntenic b

2 and 12, 3 and 5, 4 and 6, 9 and 10, and 8 0 Pe a a and 11, respectively, indicating that after spe- ya pa er 0 sativ 0 0.25 0.5 0.75 1 vinif ciation these cucumber chromosomes each thaliana papa sativus O. A. C. ichocar C. V. 4DTv tr resulted from a fusion of two ancestral chro- P. mosomes. We also found that cucumber chro- Figure 2 Comparison of cucumber genome with other sequenced plant genomes. (a) Numbers of mosome 6 and melon chromosome 3 have a predicted genes, numbers of tandem duplicated genes and gene family sizes of the six sequenced syntenic segment, indicating that interchro- plant genomes. (b) The 4DTv distribution of duplicate gene pairs in cucumber and Arabidopsis, mosome rearrangement occurred in one of calculated based on alignment of codons with HKY substitution model. the two genomes after speciation. Cucumber chromosome 4 largely corresponds to melon chromosome 7, although BAC sequences could be aligned onto the cucumber genome, with an a segment of melon chromosome 8 is syntenic with cucumber chro- average of 88% sequence identity. Nonetheless, the highly conserved mosome 4 (crossing the centromere). These data indicate that the gene content and order between the two species make the cucumber rearrangement is most likely to have occurred before the divergence genome useful for genetic analysis of melon. of cucumber and melon. In addition to chromosome fusion and inter- Using the annotated genes in the four melon BACs, we obtained chromosome rearrangements, the comparison revealed the occur- and manually curated eight orthologous families among rice, cucum- rence of several intrachromosome rearrangements (Fig. 3a). ber, melon, Arabidopsis and papaya. Extrapolating from the age of divergence between Arabidopsis and papaya (54–90 million years ago), Cucumber-melon microsynteny we estimated that cucumber and melon diverged about 4–7 million To estimate the sequence divergence rate, we compared the four years ago, which is consistent with a previous estimate of 9 ± 3 million sequenced melon BACs to the cucumber genome (Fig. 3b and years ago20. Supplementary Fig. 13). There are 56 genes on the melon BACs, 52 of which are collinear with the cucumber genome. The mean sequence Pathogen resistance genes similarity over coding regions is 95%. Although the gene region simi- Only 61 nucleotide-binding site (NBS)-containing resistance (NBS-R) larity is very high, the repeat content between the two genomes is genes have been identified in cucumber, similar to papaya (55)7 quite different. New transposable elements were frequently inserted but only a fraction of what is found in Arabidopsis (200), poplar in the intergenic regions of both genomes. Hence, only 54% of the (398) and rice (600)8. Distribution of NBS genes on chromosomes

12 2 6 4 10 9 1 5 3 11 8 7 a Melon © All rights reserved. 2009 Inc. Nature America,

Cucumber 1 3 5 7 2 6 4

Watermelon 11 9 12 16 4 6 1 13 7 18 10 8 2 5 15 14

0 10 20 30 40 50 60 70 80 90 b Kb EF188258

Chr. 1 Kb 15,060 15,070 15,080 15,090 15,100 15,110 15,120 15,130 15,140 15,150 15,160 Figure 3 Comparative genomic analysis of cucurbits. (a) Comparative analysis of the melon and watermelon genetic maps with the cucumber sequence map. Cucumber, melon and watermelon have 7, 12 and 11 pairs of chromosomes, respectively. The current version of the watermelon genetic map is organized into 18 genetic groups. (b) Syntenic blocks between the cucumber genome and a melon BAC sequence (GenBank accession code EF188258.1). Genes are indicated by black arrows with the orientation indicated on the sequence. Rectangles, transposable elements; red, retrotransposable elements; blue, DNA transposons; green, unclassified transposable elements. Orthologous sequence regions between the two genomes are shown.

1278 volume 41 | number 12 | december 2009 Nature Genetics A r t i c l e s

2,190 2,250 kb 27 Chr. 2 such as zucchini yellow mosaic virus and watermelon mosaic virus . C. sativus A. thaliana In some wild melon genotypes, enhanced expression of two glyoxylate C. papaya aminotransferase genes (At1 and At2) controls the resistance to downy 20 kb P. trichocarpa 28 V. vinifera mildew, a devastating foliar disease of cucurbits . We identified two O. sativa At homologs in cucumber that could be candidate genes for downy 0 mildew resistance. Scaffold 337

Type I Novel biosynthetic pathways Cucurbitacins are bitter cucurbit triterpenoid compounds that are toxic to most organisms but can attract specialized insects29,30. The Type II presence of cucurbitacin in the cucumber is controlled by a men- delian gene, Bi30. Oxidosqualene cyclase catalyzes the formation of the triterpene carbon framework in plants31. An OSC gene, CPQ, in squash (Cucurbita pepo L.) is the first committed enzyme in the cucurbitacin biosynthesis pathway32. In cucumber, we identified four OSC genes; the CPQ ortholog Csa008595 resides in a genetic interval that defines the Bi gene (Supplementary Fig. 15). Notably, Chr. 4 Csa008595 forms a cluster that contains an acyltransferase-encoding 9,525 9,625-Kb gene (Csa008594) and two cytochrome P450–encoding genes Figure 4 Lineage-specific expansion of the LOX gene family in the five (Csa008596 and Csa008597). Three of these (Csa008594, Csa008595 sequenced dicot genomes and rice genome. The LOX family is divided and Csa008597) are coexpressed strongly in cucumber leaf tissue into two groups, type I and type II. The two tandem duplicated gene (Supplementary Fig. 16) in a pattern similar to that of the operon- clusters are ordered and shown on chromosomes 2 and 4, as well as one 33 unmapped scaffold of the cucumber genome. like gene cluster involved in thalianol biosynthesis in Arabidopsis . This gene cluster may therefore catalyze the stepwise formation of cucurbitacin in cucumber. is nonrandom, with only five genes located on chromosomes 1, 6 and 7 Cucumber is a model system for studying sex expression in plants1. and 20 genes located on chromosome 2 (Supplementary Fig. 14). Ethylene stimulates femaleness and is considered the sex hormone Three-quarters of the NBS genes are located within 11 clusters, indi- of cucumber34. We identified 137 cucumber genes that are related cating that they evolved through tandem duplications, similar to other to the biosynthetic and signaling pathways of ethylene35,36, but known plant genomes. we found no gene family expansion in these pathways compared The lipoxygenase (LOX) pathway has an important role in devel- with other sequenced plant genomes (Supplementary Table 15). opmentally and environmentally regulated processes in plants21 and Thus, the origin of monoecy in cucumber might involve other generates short-chain aldehydes and alcohols that are involved in plant evolutionary mechanisms. defense and pest resistance22. The LOX gene family has been notably The melon gene Cm-ACS7 (ref. 37) and its cucumber ortholog expanded in the cucumber genome (23 LOX genes in cucumber, 6 in Cs-ACS2 (ref. 38) encode 1-aminocyclopropane-1-carboxylate syn- Arabidopsis, 15 in papaya, 21 in poplar, 18 in grapevine and 15 in rice). thase (ACS), a key regulatory enzyme in the ethylene biosynthetic Fourteen of the LOX genes are specific to the cucumber lineage. The pathway. Both genes are crucial to the inhibition of male organs and

© All rights reserved. 2009 Inc. Nature America, majority of cucumber LOX genes (19 of 23) are distributed in three development of the female flower. In situ mRNA hybridization experi- clusters, the largest of which contains 11 members that are arranged ments revealed that both Cm-ACS7 and Cs-ACS2 transcripts accumu- in tandem (Fig. 4). The other sequenced plant genomes show no obvi- late only in the pistil and ovule, whereas their Arabidopsis ortholog, ous LOX clustering, with the exception of grapevine, which has one AT4G26200 (Supplementary Fig. 17), is expressed only in the roots39. cluster harboring six copies. We also identified two ethylene-responsive elements (AWTTCAAA) Given that the cucumber has only 61 NBS-R genes, the expanded and one flower meristem identity gene LEAFY-responsive element lipoxygenase pathway might be a complementary mechanism to cope (CCAATGT) within the Cs-ACS2 and Cm-ACS7 promoter sequences, with . In support of this hypothesis, Arabidopsis has more but these were absent from the promoter of AT4G26200. These find- NBS-R genes and fewer LOX genes than does papaya. The volatile ings indicate that the evolution of unisexual flowers in cucurbits may (E,Z)-2,6-nonadienal (NDE) gives cucumber its ‘fresh green’ flavor23 have involved the acquisition of new cis elements of the ACS genes. and confers resistance to some bacteria and fungi24. Lipoxygenase and To better understand the mechanism of sex determination in one type of hydroperoxide lyase, 9-HPL, synthesize NDE from lino- cucumber, we sequenced 359,105 EST sequences from near-isogenic

lenic acid precursors. Genes encoding enzymes with 9-HPL activity unisexual and bisexual flower buds using the 454 pyrosequencing are rarely found in other plants25. However, cucumber contains two technology. Our analysis revealed that six auxin-related genes (auxin tandem HPL genes, one of which has been experimentally confirmed can regulate sex expression by stimulating ethylene production40) and as encoding an enzyme with 9-HPL activity25. The expansion of the three short-chain dehydrogenase or reductase genes (homologs to LOX gene family and the duplicated HPL genes may be related to the the sex determination gene ts2 in maize41) are more highly expressed high level of NDE synthesis in cucumber. in unisexual flowers (Supplementary Table 16). This analysis pro- Eukaryotic translation initiation factors, particularly the eIF4E and vides an important resource for further study of sex determination eIF4G families, confer recessive resistance to plant RNA virus infec- in cucumber. tions. An EIF4E gene in melon was found to mediate recessive resist- ance against melon necrotic spot virus26. In the cucumber genome, Novel developmental programs three EIF4E and three EIF4G genes have been identified, providing The tendril is a specific climbing tool of vines, such as Vitaceae and all candidates for known recessive resistance genes against RNA Cucurbitaceae. Darwin considered tendrils a key innovation in plant

Nature Genetics volume 41 | number 12 | december 2009 1279 A r t i c l e s

evolution42. In cucumber and grapevine, gibberellic acid regulates Table 2 Summary of orthologous gene families (clusters) 43,44 tendril formation . In most plants, the transition of GA12- established using cucumber genes homologous to pumpkin phloem proteins aldehyde to GA12 is catalyzed by cytochrome P450 monooxygenase. In cucurbits, it is also catalyzed by specific GA-7-oxidase genes, which Average genes per are absent from Arabidopsis45. Cucumber has two GA-7-oxidase genes Genes Gene clusters cluster (Supplementary Table 17). GA-20-oxidase controls key steps leading P. patensa 1,072 622 1.7 to bioactive GA1 and GA4, and our data show that the cucumber has O. sativa 2,458 676 3.6 three lineage-specific clades (three copies; Supplementary Fig. 18). S. bicolor 2,780 679 4.1 These specific genes might be associated with the role of gibberellic A. thaliana 2,351 682 3.5 acid in the regulation of tendril formation. Tendril coiling involves C. papaya 1,944 672 2.9 rapid cell wall modification46, and expansins are cell wall–loosening P. trichocarpa 3,454 684 5.1 proteins in plants47. We found that, in cucumber, the expansin sub- C. sativusb 1,986 686 2.9 family EXLA has undergone marked expansion through tandem V. vinifera 2,535 668 3.8 duplication (eight genes in cucumber, compared with one to three aMoss (P. patens) was used as the only outgroup. bFor each cluster, at least one cucumber genes in other genomes; Supplementary Fig. 19); this event may have phloem protein was included. contributed to the development of tendril coiling in cucumber. this trend. In poplar and grapevine, homologs for AT4G37980 and Use in plant vascular biology studies AT4G37990 in Arabidopsis, which have low cadmium-sensitive enzy- The evolution of the plant vascular system, comprising xylem and matic activity in vitro and may have only a minor role in lignin forma- phloem tissues, had a pivotal role in the emergence of land plants. tion in this species54, were expanded markedly. In papaya, there is an The sieve tube system of phloem, the equivalent of the animal arte- expansion of homologs for AT1G37970, which lack detectable cad- rial system, delivers nutrients and signaling molecules to developing mium-sensitive catalytic activities in vitro but are expressed predomi- organs2. A BLASTP analysis of 1,209 protein fragments from pumpkin nantly in lignin-forming tissues54 (Supplementary Fig. 20). Thus, the phloem48 identified 800 phloem proteins in the cucumber genome expansion of CAD genes may be associated with wood formation. It (Supplementary Table 18). Using these cucumber proteins, we con- is also notable that grapevine has the largest PAL gene family, with 15 ducted orthologous gene family (cluster) analysis (Supplementary members, and that poplar and papaya have the largest number of HCT Table 19) with their homologs in other vascular plants as well as the genes, with 7 members. Of the cellulose biosynthesis–related genes, nonvascular moss Physcomitrella patens49. In total, we constructed poplar has more CESA and COB genes (18 of each) than do any of 686 clusters (Table 2). About two-thirds (49 of 75) of the Arabidopsis the other sequenced dicots (Supplementary Table 20). and half (57 of 120) of the rice phloem proteins identified in previous studies50,51 were included in this data set, indicating the effectiveness DISCUSSION of these analyses and the value of this resource for vascular biology The sequence of the cucumber genome provides an invaluable new studies in plants. resource for biological research and breeding of cucurbits. The high The vascular and nonvascular plants shared 596 clusters; between collinearity between cucumber and melon genomes enables cucumber monocots and eudicots, there are 648 clusters in common. Phloem pro- to serve as a model system in the Cucurbitaceae family for compara- tein II (PP2; cluster 2432) are present in angiosperms but absent from the tive genomics studies in plants. The cucumber genome and related moss genome. PP2-like genes are also present in gymnosperm52, indi- transcriptome analysis can provide insights into the mechanisms

© All rights reserved. 2009 Inc. Nature America, cating their association with the advent of vascular plants. In cucurbits, underlying sex determination, an important biological process that these genes can increase the size-exclusion limit of plasmodesmata and has been well characterized in cucumber at the phenotypic level. facilitate cell-to-cell traffic of macromolecules52 and thus are likely to The genome can also advance our knowledge of the evolution and have an essential role in vascular function. The sieve element occlusion function of the plant vascular system. proteins (gene cluster 4754), present in all eudicots but absent from We have also shown that, in combination with traditional Sanger mosses and monocots, represent a novel mechanism that evolved for sequencing, next-generation DNA sequencing technologies can be sealing the sieve tube system after wounding53. used effectively for de novo sequencing of plant genomes, making The average number of genes in each cluster ranges from 2.9 to 5.1 in it possible to carry out rapid and low-cost sequencing for other the vascular plants, compared to 1.7 in moss (Table 2). The increase of important plant species. gene numbers per cluster may be associated with the evolution of the plant vascular system. The 16-kDa PP16 cluster (cluster 2599) has an average Methods of 3.7 genes in the vascular plants compared to 2 in moss. The CmPP16 Methods and any associated references are available in the online version 3 gene in pumpkin is involved in transport of mRNA into the phloem . The of the paper at http://www.nature.com/naturegenetics/. increase of the number of PP16 genes in vascular plants indicates these new members may be involved in long-distance trafficking of mRNA. Accession codes. The cucumber genome sequence has been deposited To better understand xylem formation, we compared gene families in GenBank with accession code ACHR00000000 (the version described related to lignin and cellulose biosynthesis between woody and her- here is the first version, with accession code ACHR01000000). baceous plants. The perennial woody plants, poplar and grapevine, have a large number of lignin biosynthesis–related genes (48 and 49, Note: Supplementary information is available on the Nature Genetics website. respectively), whereas the semiwoody plant papaya has an interme- Acknowledgments diate number (39). In contrast, the herbaceous plants Arabidopsis We thank L. Goodman for assistance in editing the manuscript and R. Quatrano, L. Kochian, L. Comai, V. Sundaresan, S. Kamoun and S. Renner for critical readings and cucumber have smaller numbers (28 and 26, respectively; of the manuscript. This work was funded by the Chinese Ministry of Agriculture Supplementary Table 20). Among these gene families, the number (948 program), Ministry of Science and Technology (2006DFA32140, of genes in the cadmium-sensitive CAD family was consistent with 2007CB815701, 2007CB815703 and 2007CB815705) and Ministry of Finance

1280 volume 41 | number 12 | december 2009 Nature Genetics A r t i c l e s

(1251610601001); the National Natural Science Foundation of China (30871707 21. Liavonchanka, A. & Feussner, I. Lipoxygenases: occurrence, functions and catalysis. and 30725008); the Chinese Academy of Agricultural Sciences (seed grant to S.H.); J. Plant Physiol. 163, 348–357 (2006). the Chinese Academy of Science (GJHZ0701-6 and KSCX2-YWN-023); the US 22. Schwab, W., Davidovich-Rikanati, R. & Lewinsohn, E. Biosynthesis of plant-derived Department of Agriculture (National Research Initiative grant 2006-35304-17346 flavor compounds. Plant J. 54, 712–732 (2008). 23. Buescher, R.H. & Buescher, R.W. Production and stability of (E, Z )-2, 6-nonadienal, to W.J.L.); the National Science Foundation (grant IOS-07-15513 to W.J.L.); and the major flavor volatile of cucumbers. J. Food Sci. 66, 357–361 (2001). the Korea Science and Engineering Foundation–Ministry of Education, Science 24. Cho, M.J., Buescher, R.W., Johnson, M. & Janes, M. Inactivation of pathogenic and Technology (WCU R33-10002 and BK21 grants to J.-Y.K.). WKC was partly bacteria by cucumber volatiles (E,Z )-2,6-nonadienal and (E )-2-nonenal. J. Food supported by grants from the Environmental Biotechnology National Core Prot. 67, 1014–1016 (2004). Research Center (R15-2003-012-01003-0) and National Research Laboratory 25. Matsui, K. et al. Fatty acid 9- and 13-hydroperoxide lyases from cucumber. FEBS (2009-0066339). This work was also supported by the Shenzhen Municipal and Lett. 481, 183–188 (2000). Yantian District Governments and the Society of Entrepreneurs & Ecology. 26. Nieto, C. et al. An eIF4E allele confers resistance to an uncapped and non- D. Qu and Z. Fang of the Chinese Academy of Agricultural Sciences provided polyadenylated RNA virus in melon. Plant J. 48, 452–462 (2006). 27. Wai, T. & Grumet, R. Inheritance of resistance to watermelon mosaic virus in the management support for this work. cucumber line TMG-1: tissue-specific expression and relationship to zucchini yellow mosaic virus resistance. Theor. Appl. Genet. 91, 699–706 (1995). AUTHOR CONTRIBUTIONS 28. Taler, D., Galperin, M., Benjamin, I., Cohen, Y. & Kenigsbuch, D. Plant eR genes S.H., Y.D., Jun Wang and Songgang Li managed the project. S.H., Z.Z., W.J.L., X.G. that encode photorespiratory enzymes confer resistance against disease. Plant Cell and R.L. designed the analyses. X.G., H.M., L.L., Yuanyuan Ren, G.T., Y. Lu, Z.X., 16, 172–184 (2004). J.C., A., Z.W., J. Zhang, H. Liang, X.R., M.J., Hailong Yang, R.C., Shifang Liu and 29. Balkema-Boomstra, A.G. et al. Role of cucurbitacin C in resistance to spider mite X.Z. conducted DNA preparation and sequencing. X.W., B.X., K.L., W.J., Guangcun Tetranychus urticae in cucumber Cucumis sativus L. J. Chem. Ecol. 29, 225–235 Li, Z.F., J.S., A.K., E.A.G.v.d.V. and Y.X. contributed new reagents and analytic tools. (2003). 30. Da Costa, C.P. & Jones, C.M. Cucumber beetle resistance and mite susceptibility S.H., Z.Z., W.J.L., X.G., R.L., X.W., B.X., K.L., W.J., J.H., Z.J., Yi Ren, Ying Li, X.L., controlled by the bitter gene in Cucumis sativus L. Science 172, 1145–1146 S.W., Q.S., W.K.C., J.-Y.K., K.H.-U., H.M., Z.C., S.Z., J. Wu, Y.Y., H.K., Y.W., J.G., (1971). Y.H., M.L., B. Zhao, Shiqiang Liu, W.F., P.N., H. Zhu, Jun Li, J.R., W.Q., M. Wang, 31. Phillips, D.R., Rasbery, J.M., Bartel, B. & Matsuda, S.P. Biosynthetic diversity in Q.H., B.L., Q.C., Y.B., Z.S., M. Wen, G.Z., Z.Y., Jianwen Li, L.M., H. Liu., Y. Zhou, plant triterpene cyclization. Curr. Opin. Plant Biol. 9, 305–314 (2006). J. Zhao, X.F., Guoqing Li, L.F., Yingrui Li, D.L., Hancheng Zheng and Shaochuan 32. Shibuya, M., Adachi, S. & Ebizuka, Y. Cucurbitadienol synthase, the first committed Li conducted the data analyses. S.H., R.L., Z.Z. and W.J.L. wrote the paper. Y.D., enzyme for cucurbitacin biosynthesis, is a distinct enzyme from cycloartenol R.S., B. Zhang., S.J., G.Y., S.Y., Hongkun Zheng, Y. Zhang, N.Q., Z.L., L.B., K.K., synthase for biosynthesis. Tetrahedron 60, 6995–7003 (2004). Huanming Yang and Jian Wang revised the paper. 33. Field, B. & Osbourn, A.E. Metabolic diversification–independent assembly of operon- like gene clusters in different plants. Science 320, 543–547 (2008). 34. Rudich, J., Halevy, A.H. & Kedar, N. Ethylene evolution from cucumber plants as Published online at http://www.nature.com/naturegenetics/. related to sex expression. Plant Physiol. 49, 998–999 (1972). Reprints and permissions information is available online at http://npg.nature.com/ 35. Pirrung, M.C. Ethylene biosynthesis from 1-aminocyclopropanecarboxylic acid. reprintsandpermissions/. Acc. Chem. Res. 32, 711–718 (1999). 36. Stepanova, A.N. & Alonso, J.M. Ethylene signaling pathway. Sci. STKE 2005, cm3 (2005). 37. Boualem, A. et al. A conserved mutation in an ethylene biosynthesis enzyme leads 1. Tanurdzic, M. & Banks, J.A. Sex-determining mechanisms in land plants. Plant Cell to andromonoecy in melons. Science 321, 836–838 (2008). 16, S61–S71 (2004). 38. Li, Z. et al. Molecular isolation of the M gene suggests that a conserved-residue 2. Lough, T.J. & Lucas, W.J. Integrative plant biology: role of phloem conversion induces the formation of bisexual flowers in cucumber plants. Genetics long-distance macromolecular trafficking. Annu. Rev. Plant Biol. 57, 203–232 182, 1381–1385 (2009). (2006). 39. Yamagami, T. et al. Biochemical diversity among the 1-amino-cyclopropane-1- 3. Xoconostle-Cázares, B. et al. Plant paralog to viral movement protein that potentiates carboxylate synthase isozymes encoded by the Arabidopsis gene family. J. Biol. rransport of mRNA into the phloem. Science 283, 94–98 (1999). Chem. 278, 49102–49112 (2003). 4. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering 40. Takahashi, H. & Jaffe, M.J. Further studies of auxin and ACC induced feminization plant Arabidopsis thaliana. Nature 408, 796–815 (2000). in the cucumber plant using ethylene inhibitors. Phyton (Buenos Aires) 44, 81–86 5. International Rice Genome Sequencing Project. The map-based sequence of the (1984). rice genome. Nature 436, 793–800 (2005). 41. DeLong, A., Calderon-Urrea, A. & Dellaporta, S.L. Sex determination gene 6. Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization TASSELSEED2 of maize encodes a short-chain alcohol dehydrogenase required for in major angiosperm phyla. Nature 449, 463–467 (2007). stage-specific floral organ abortion. Cell 74, 757–768 (1993). © All rights reserved. 2009 Inc. Nature America, 7. Ming, R. et al. The draft genome of the transgenic tropical fruit tree papaya 42. Darwin, C.R. The Movements and Habits of Climbing Plants (Murray, London, (Carica papaya Linnaeus). Nature 452, 991–996 (2008). 1875). 8. Tuskan, G.A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & 43. Boss, P.K. & Thomas, M.R. Association of dwarfism and floral induction with a Gray). Science 313, 1596–1604 (2006). grape /‘/’ mutation. Nature 416, 847–850 (2002). 9. Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). 44. Galun, E. The cucumber tendril—a new test organ for gibberellic acid. Cell. Mol. Science 296, 79–92 (2002). Life Sci. 15, 184–185 (1959). 10. Shendure, J., Mitra, R.D., Varma, C. & Church, G.M. Advanced sequencing 45. Lange, T. Cloning gibberellin dioxygenase genes from pumpkin endosperm by technologies: methods and goals. Nat. Rev. Genet. 5, 335–344 (2004). heterologous expression of enzyme activities in Escherichia coli. Proc. Natl. Acad. 11. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible Sci. USA 94, 6553–6558 (1997). terminator chemistry. Nature 456, 53–59 (2008). 46. Braam, J. In touch: plant responses to mechanical stimuli. New Phytol. 165, 12. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 373–389 (2005). 60–65 (2008). 47. Cosgrove, D.J. Loosening of plant cell walls by expansins. Nature 407, 321–326 13. Staub, J.E., Serquen, F.C., Horejsi, T. & Chen, J.-f. Genetic diversity in cucumber (2000). (Cucumis sativus L.): IV. An evaluation of Chinese germplasm1. Genet. Resour. 48. Lin, M.-K., Lee, Y.-J., Lough, T.J., Phinney, B.S. & Lucas, W.J. Analysis of the Crop Evol. 46, 297–310 (1999). pumpkin phloem proteome provides insights into angiosperm sieve tube function. 14. Arumuganathan, K. & Earle, E. Nuclear DNA content of some important plant Mol. Cell. Proteomics 8, 343–356 (2009).

species. Plant Mol. Biol. Rep. 9, 208–218 (1991). 49. Rensing, S.A. et al. The Physcomitrella genome reveals evolutionary insights into 15. Han, Y.H. et al. Distribution of the tandem repeat sequences and karyotyping in the conquest of land by plants. Science 319, 64–69 (2008). cucumber (Cucumis sativus L.) by fluorescence in situ hybridization. Cytogenet. 50. Aki, T., Shigyo, M., Nakano, R., Yoneyama, T. & Yanagisawa, S. Nano scale Genome Res. 122, 80–88 (2008). proteomics revealed the presence of regulatory proteins including rhree FT-Like 16. Ren, Y. et al. An integrated genetic and cytogenetic map of the cucumber genome. proteins in phloem and xylem saps from rice. Plant Cell Physiol. 49, 767–790 PLoS One 4, e5795 (2009). (2008). 17. Bowers, J.E., Chapman, B.A., Rong, J. & Paterson, A.H. Unravelling angiosperm 51. Giavalisco, P., Kapitza, K., Kolasa, A., Buhtz, A. & Kehr, J. Towards the proteome genome evolution by phylogenetic analysis of chromosomal duplication events. of Brassica napus phloem sap. Proteomics 6, 896–909 (2006). Nature 422, 433–438 (2003). 52. Dinant, S. et al. Diversity of the superfamily of phloem lectins (phloem protein 2) 18. Yu, J. et al. The genomes of Oryza sativa: a history of duplications. PLoS Biol. 3, in angiosperms. Plant Physiol. 131, 114–128 (2003). e38 (2005). 53. Pélissier, H.C., Peters, W.S., Collier, R., van Bel, A.J. & Knoblauch, M. GFP tagging 19. Fernandez-Silva, I. et al. Bin mapping of genomic and EST-derived SSRs in melon of sieve element occlusion (SEO) proteins results in green fluorescent forisomes. (Cucumis melo L.). Theor. Appl. Genet. 118, 139 (2008). Plant Cell Physiol. 49, 1699–1710 (2008). 20. Schaefer, H., Heibl, C. & Renner, S.S. Gourds afloat: a dated phylogeny reveals an 54. Kim, S.J. et al. Expression of cinnamyl alcohol dehydrogenases and their putative Asian origin of the gourd family (Cucurbitaceae) and numerous oversea dispersal homologues during Arabidopsis thaliana growth and development: lessons for events. Proc. Biol. Sci. 276, 843–851 (2009). database annotations. Phytochemistry 68, 1957–1974 (2007).

Nature Genetics volume 41 | number 12 | december 2009 1281 ONLINE METHODS Identification of repetitive elements in the cucumber genome. Four de novo Removal of contamination for Sanger reads. Sanger reads were aligned against software packages, ReAS56, PILER-DF57, RepeatScout58 and LTR_Finder59, mitochondrion (assembled by us based on the gene sequences of mitochon- were used to search for repeat sequences within the cucumber genome. All dria of rice and Arabidopsis), chloroplast (GenBank accession code AJ970307) repeat sequences with lengths >100 bp and gap ‘N’ <5% constituted the raw and satellite (GenBank X03768, X03769, X03770, X69163, AY424361 and transposable element library. AY424362) sequences. Reads with identity >95% were filtered. The repeat elements belonging to rRNA and satellite sequences were first filtered using BLASTN (E value ≤ 1 × 10−10, identity ≥ 80%, coverage ≥ 50% De novo assembly of Solexa data. The De Bruijn graph method was used to and minimal matching length ≥ 100 bp). All-versus-all BLASTN (E value ≤ 1 represent all possible sequences assembled by Solexa reads, with a K-mer as a × 10−10) searches were then conducted iteratively, and the shorter sequences node and the (K − 1) base overlap between two K-mers as an edge. Some tips were filtered when two repeats aligned with identity ≥ 80%, coverage ≥ 80% and low-coverage K-mers in the graph were removed to reduce sequencing and minimal matching length ≥ 100 bp; this yielded a nonredundant transpos- errors and eliminate branches. The De Bruijn graph was then converted to a able element library. The nonredundant repeats were then searched against contiging graph by turning a series of linearly connected K-mers into a pre- the Swiss-Prot protein database to filter the protein-coding genes by BLASTX contig node. Dijkstra’s algorithm was implemented to detect bubbles, which (E value ≤ 1 × 10−4, identity ≥ 30%, coverage ≥ 30% and minimal matching were then straightforwardly merged into a single path if sequences of the length ≥ 30 amino acids). After manual curation, a de novo transposable ele- branches were sufficiently similar. By this approach, the repeat regions could ment library for cucumber was obtained. be assembled into consensus sequences. Transposable elements in the cucumber genome assembly were identified Contigs were next connected by paired reads to form a scaffolding graph. both at the DNA and protein level. RepeatMasker was applied for DNA-level Edges in this graph were connections between contigs, and the edge length identification using a custom library (a combination of Repbase, plant repeat was estimated from the insert size of the paired reads. The paired-end infor- database and our cucumber de novo transposable element library). At the mation was used step by step, from insert sizes around 200 bp and 500 bp to protein level, RepeatProteinMask was used to conduct WU-BLASTX searches 2 kb. At each step, two procedures were applied: the repeat-masking method against the transposable element protein database. Overlapping transposable masked the complicated connections around repeat contigs, and the subgraph elements belonging to the same type of repeats were integrated together, linearization turned the interleaving contigs into linear structure. This process whereas those with low scores were removed if they overlapped >80% and yielded the final set of Solexa contigs and scaffolds. belonged to different types.

Combination of Sanger reads and Solexa scaffolds. RePS2 (ref. 55) software Gene prediction. Our strategy for gene prediction was to conduct de novo pre- was used to assemble the Solexa scaffolds and Sanger reads. We counted the dictions on the repeat-masked genome and then integrate them with spliced depth of each 17-mer in the 3.9× plasmid and fosmid ends to create the 17-mer alignments of proteins and transcripts to genome sequences using GLEAN60. database, which contained all the depth information of the 17-mers. This Cucumber genome sequences were masked by identified repeat sequences database was then used to check all the contigs to identify repeated ones. A with length >500 bp, except for miniature inverted-repeat transposable ele- contig was defined as a repeat if over 80% of the 17-mers it contained were ments, which are usually found near genes or inside introns. The EST and with higher depth than the threshold. After removing the repeat contigs, the full-length cDNA sequences of cucumber were processed by PASA61 to train scaffolds were divided into fake paired reads with read length of 600 bp and gene prediction software BGF62, GlimmerHMM63 and SNAP64. Augustus65 insert size of 1,700 bp. All segments over 200 bp were put into the second and Genscan66 software used gene model parameters trained for Arabidopsis. data set, which was then assembled as a unique region. In the same way as the We aligned the protein sequences of five sequenced plants (Arabidopsis, papaya, construction of Solexa scaffolds, the plasmid, fosmid and BAC ends were used, poplar, grapevine and rice) onto the cucumber genome using TBLASTN, at step by step, to construct a ‘superscaffold’. an E-value cutoff of 1 × 10−5, and the homologous genome sequences were aligned against the matching proteins using GeneWise67 for accurate spliced Misassembly checking and gap filling. In the final stage, we used the repeat alignments. The cDNA and EST sequences of cucumber and melon were sequences to fill the gaps in the scaffolds using the following steps. First, we aligned against the cucumber genome using BLAT (identity ≥ 0.95, cover- © All rights reserved. 2009 Inc. Nature America, mapped all of the reads that contained paired-end information (Solexa and age ≥ 0.90) to generate spliced alignments. We also aligned TIGR unigenes68 plasmid reads, as well as fosmid and BAC ends) to the scaffolds, and we used from Cucurbitales, Fabales and to the cucumber genome by ATT_gap2 the unique contigs to establish the paired-end relationship between the con- (ref. 69). All of these resources were combined by GLEAN60 to produce the tigs. Second, we identified repeat contigs with paired ends that uniquely con- consensus gene sets. nected two other scaffolded contigs. If the length of the repeat contig and the estimated size of the gap were similar, the gap was filled by this repeat. Any Identification of noncoding RNA genes in the cucumber genome. The tRNA remaining repeat contigs that were not used for gap filling were added into genes were identified by tRNAscan-SE70 with default parameters. The C/D-box the final set of scaffolds. small nucleolar RNAs were identified by Snoscan71 using yeast rRNA and yeast methylation sites. Other noncoding RNAs, including miRNA, small nuclear Chromosome anchoring along the cucumber genetic map. The marker RNA and H/ACA-box small nucleolar RNA, were identified using INFERNAL sequences in the cucumber genetic map were aligned against the scaffold software by searching against the Rfam72 database with default parameters. sequences using BLASTN at an E-value cutoff of 1 × 10−20. Hits with cover- age >30% and identity >90% were considered mapped markers. Based on Construction of gene families. We adapted the Treefam73 method to construct

the mapped markers, the scaffold sequences were anchored on the cucumber gene families for the genes in cucumber, Arabidopsis, papaya, poplar, grapevine chromosomes. During this process, the scaffolds with mapped markers that and rice (outgroup). showed inconsistent genetic positions were manually checked by paired-end relationships; the incorrect scaffold was then split. Construction of syntenic blocks. We identified syntenic blocks between two species (A and B) by an automatic clustering algorithm on a dot plot graph, FISH analysis. The FISH protocol was described in a previous study16. To which included five steps. First, markers (gene pairs) were generated between better visualize the segmental inversion, we chose chromosome spreads where A and B. All protein sequences of A were aligned to all proteins of B using chromosome 5 appeared in a straight form. Instead of showing all chromo- BLASTP (E value < 1 × 10−10 and identity > 20%). The fragmental alignments somes16, only chromosome 5 is shown in Figure 1b of this study. In addition, were conjoined for each gene pair. Those gene pairs with aligned regions cov- the image was taken in a higher resolution. Scale bars represent 1 µm, as ering <50% were filtered. The remaining gene pairs were plotted on the dot compared to 3 µm previously16. Red and green signals were detected with graph as markers (points). Second, the Euclidean distance was calculated for anti-digoxigenin antibody coupled to rhodamine (Roche) and by anti-avidin each pair. Distances were calculated based on the gene order in each chromo- antibody conjugated with FITC (Vector Laboratories), respectively. some rather than the genomic position. Third, hierarchical clustering was

Nature Genetics doi:10.1038/ng.475 determined for all of the points. If the distance between two points was less plantbiology.msu.edu/index.html; Rfam release 9.0, http://rfam.sanger.ac.uk/; than the distance cutoff, a link was assigned. The distance cutoff was adapted Thornian Time Traveler (T3) package, http://abacus.gene.ucl.ac.uk/software. in accordance with the selected species. Fourth, the quality was estimated for html; RepeatMasker, http://www.repeatmasker.org. each cluster by calculating the point number (N), average point distance (D) and correlation coefficient (R). A score (S) was calculated to show the overall quality, defined as S = N × sqrt(2)/D × R. Finally, problematic clusters were 55. Wang, J. et al. RePS: a sequence assembler that masks exact repeats identified filtered. Clusters with N < 8 or |R| < 0.5 were filtered out. The clusters caused from the shotgun data. Genome Res. 12, 824–831 (2002). 56. Li, R. et al. ReAS: Recovery of ancestral sequences for transposable elements from by tandem duplication were further filtered by determining the slope (L) of the unassembled reads of a whole genome shotgun. PLOS Comput. Biol. 1, e43 the regression line within a range of 0.1 < |L| < 10. This algorithm can also be (2005). used to study intraspecies synteny. 57. Edgar, R.C. & Myers, E.W. PILER: identification and classification of genomic repeats. Bioinformatics 21 Suppl 1, i152–i158 (2005). 58. Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families 4DTv calculation. After the identification of syntenic blocks, the pairwise in large genomes. Bioinformatics 21 Suppl 1, i351–i358 (2005). 74 protein alignments for each gene pair were first constructed with MUSCLE . 59. Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length The nucleotide alignment was then created according to the protein alignment. LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007). 4DTv was then calculated on concatenated nucleotide alignments with HKY 60. Elsik, C.G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13 75 (2007). substitution models . 61. Campbell, M.A., Haas, B.J., Hamilton, J.P., Mount, S.M. & Buell, C.R. Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. Comparative analysis between cucumber and melon. Cucumber genome BMC Genomics 7, 327 (2006). sequences were aligned with melon BAC sequences using NUCmer, a program 62. Li, H. et al. Test data sets and evaluation of gene prediction programs on the rice 76 genome. J Comp Sci Tech 20, 446–453 (2005). in the MUMmer package . The delta-filter program was then run with the −1 63. Majoros, W.H., Pertea, M. & Salzberg, S.L. TigrScan and GlimmerHMM: two option to remove complex alignments. Orthologous gene pairs were identified open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 by the reciprocal best method. (2004). The Bayesian relaxed molecular clock approach was used to estimate diver- 64. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). 65. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new gence time using the program MULTIDIVTIME, which was implemented intron submodel. Bioinformatics 19 Suppl 2, ii215–ii225 (2003). using the Thornian Time Traveler (T3) package. The calibration time (fossil 66. Salamov, A.A. & Solovyev, V.V. Ab initio gene finding in Drosophila genomic DNA. record time) interval (54–90 million years ago) of Capparales was obtained Genome Res. 10, 516–522 (2000). from previous results77,78. 67. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004). 68. Childs, K.L. et al. The TIGR Plant Transcript Assemblies database. Nucleic Acids URLs. Arabidopsis thaliana (TIGR Release 5.0), ftp://ftp.tigr.org/pub/data/a_ Res. 35, D846–D851 (2007). thaliana/ath1; Carica papaya (assembly v1.0, EVidence Modeler genes), http:// 69. Huang, X., Adams, M.D., Zhou, H. & Kerlavage, A.R. A tool for analyzing and www.life.uiuc.edu/ming; Populus trichocarpa (assembly release v1.0, annota- annotating genomic sequences. Genomics 46, 37–45 (1997). 70. Lowe, T.M. & Eddy, S.R. tRNAscan-SE: a program for improved detection of transfer tion v1.1), http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.download.ftp.html; RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997). Vitis vinifera (published assembly, annotation v1), http://www.genoscope. 71. Lowe, T.M. & Eddy, S.R. A computational screen for methylation guide snoRNAs cns.fr/externe/GenomeBrowser/Vitis/; Oryza sativa (assembly International in yeast. Science 283, 1168–1171 (1999). Rice Genome Sequencing Project build 3), http://rgp.dna.affrc.go.jp/IRGSP/ 72. Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, D121–D124 (2005). download.html; Oryza sativa (GLEAN genes annotated by Beijing Genomics 73. Li, H. et al. TreeFam: a curated database of phylogenetic trees of animal gene Institute), ftp.genomics.org.cn/pub/ricedb/rice_update_data/GLEAN_genes/ families. Nucleic Acids Res. 34, D572–D580 (2006). IRGSP_japonica/; Physcomitrella patens (assembly release v1.0, annotation 74. Edgar, R.C. MUSCLE: multiple sequence alignment with high accuracy and high v1.1), http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html; Sorghum throughput. Nucleic Acids Res. 32, 1792–1797 (2004). 75. Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a bicolor (assembly release v1.0, annotation v1.4), http://www.phytozome. molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985). net/sorghum; UniGene sequences of Cucurbitales, Fabales and Fagales, http:// 76. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome © All rights reserved. 2009 Inc. Nature America, plantta.jcvi.org/; cucumber marker sequences, http://cucumber.genomics.org. Biol. 5, R12 (2004). cn; UniProt (Swiss-Prot/TrEMBL) release 14.1, http://www.uniprot.org/down 77. Crepet, W.L., Nixon, K.C. & Gandolfo, M.A. Fossil evidence and phylogeny: the age of major angiosperm clades based on mesofossil and macrofossil evidence from loads; InterPro v18.0, http://www.ebi.ac.uk/interpro/; KEGG release 47, ftp:// Cretaceous deposits. Am. J. Botany 91, 1666–1682 (2004). ftp.genome.jp/pub/kegg/pathway/; Repbase release 13.07, http://www.girinst. 78. Wikström, N., Savolainen, V. & Chase, M.W. Evolution of the angiosperms: calibrating org/repbase/index.html; Plant Repeat Databases (TIGR), http://plantrepeats. the family tree. Proc. Biol. Sci. 268, 2211–2220 (2001).

doi:10.1038/ng.475 Nature Genetics Vol 463 | 14 January 2010 | doi:10.1038/nature08670 ARTICLES

Genome sequence of the palaeopolyploid soybean

Jeremy Schmutz1,2, Steven B. Cannon3, Jessica Schlueter4,5, Jianxin Ma5, Therese Mitros6, William Nelson7, David L. Hyten8, Qijian Song8,9, Jay J. Thelen10, Jianlin Cheng11, Dong Xu11, Uffe Hellsten2, Gregory D. May12, Yeisoo Yu13, Tetsuya Sakurai14, Taishi Umezawa14, Madan K. Bhattacharyya15, Devinder Sandhu16, Babu Valliyodan17, Erika Lindquist2, Myron Peto3, David Grant3, Shengqiang Shu2, David Goodstein2, Kerrie Barry2, Montona Futrell-Griggs5, Brian Abernathy5, Jianchang Du5, Zhixi Tian5, Liucun Zhu5, Navdeep Gill5, Trupti Joshi11, Marc Libault17, Anand Sethuraman1, Xue-Cheng Zhang17, Kazuo Shinozaki14, Henry T. Nguyen17, Rod A. Wing13, Perry Cregan8, James Specht18, Jane Grimwood1,2, Dan Rokhsar2, Gary Stacey10,17, Randy C. Shoemaker3 & Scott A. Jackson5

Soybean (Glycine max) is one of the most important crop plants for seed protein and oil content, and for its capacity to fix atmospheric nitrogen through symbioses with soil-borne microorganisms. We sequenced the 1.1-gigabase genome by a whole-genome shotgun approach and integrated it with physical and high-density genetic maps to create a chromosome-scale draft sequence assembly. We predict 46,430 protein-coding genes, 70% more than Arabidopsis and similar to the poplar genome which, like soybean, is an ancient polyploid (palaeopolyploid). About 78% of the predicted genes occur in chromosome ends, which comprise less than one-half of the genome but account for nearly all of the genetic recombination. Genome duplications occurred at approximately 59 and 13 million years ago, resulting in a highly duplicated genome with nearly 75% of the genes present in multiple copies. The two duplication events were followed by gene diversification and loss, and numerous chromosome rearrangements. An accurate soybean genome sequence will facilitate the identification of the genetic basis of many soybean traits, and accelerate the creation of improved soybean varieties.

Legumes are an important part of world agriculture as they fix atmo- high-quality draft whole-genome shotgun-sequenced plant genomes spheric nitrogen by intimate symbioses with microorganisms. The (Supplementary Table 4). A total of 8 of the 20 chromosomes have soybean in particular is important worldwide as a predominant plant telomeric repeats (TTTAGGG or CCCTAAA) on both of the distal source of both animal feed protein and cooking oil. We report here a scaffolds and 11 other chromosomes have telomeric repeats on a soybean whole-genome shotgun sequence of Glycine max var. single arm, for a total of 27 out of 40 chromosome ends captured Williams 82, comprised of 950 megabases (Mb) of assembled and in sequence scaffolds. Also, internal scaffolds in 19 of 20 chromo- anchored sequence (Fig. 1), representing about 85% of the predicted somes contain a large block of characteristic 91- or 92-base-pair 1,115-Mb genome1 (Supplementary Table 3.1). Most of the genome (bp) centromeric repeats6,7 (Fig. 1). Four chromosome assemblies sequence (Fig. 1) is assembled into 20 chromosome-level pseudomole- contain several 91/92-bp blocks; this may be the correct physical cules containing 397 sequence scaffolds with ordered positions within placements of these sequences, or may reflect the difficulty in assembling the 20 soybean linkage groups. An additional 17.7 Mb is present in these highly repetitive regions. 1,148 unanchored sequence scaffolds that are mostly repetitive and contain fewer than 450 predicted genes. Scaffold placements were Gene composition and repetitive DNA determined with extensive genetic maps, including 4,991 single nuc- A striking feature of the soybean genome is that 57% of the genomic leotide polymorphisms (SNPs) and 874 simple sequence repeats sequence occurs in repeat-rich, low-recombination heterochromatic (SSRs)2–5. All but 20 of the 397 sequence scaffolds are unambiguously regions surrounding the centromeres. The average ratio of genetic- oriented on the chromosomes. Unoriented scaffolds are in repetitive to-physical distance is 1 cM per 197 kb in euchromatic regions, and regions where there is a paucity of recombination and genetic markers 1 cM per 3.5 Mb in heterochromatic regions (see Supplementary (see Supplementary Information for assembly details). Information section 1.8). For reference, these proportions are similar The soybean genome is the largest whole-genome shotgun- to those in Sorghum, in which 62% of the sequence is heterochro- sequenced plant genome so far and compares favourably to all other matic, and different than in rice, with 15% in heterochromatin8.In

1HudsonAlpha Genome Sequencing Center, 601 Genome Way, Huntsville, Alabama 35806, USA. 2Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California 94598, USA. 3USDA-ARS Corn Insects and Crop Genetics Research Unit, Ames, Iowa 50011, USA. 4Department of Bioinformatics and Genomics, 9201 University City Blvd, University of North Carolina at Charlotte, Charlotte, North Carolina 28223, USA. 5Department of Agronomy, Purdue University, 915 W. State Street, West Lafayette, Indiana 47906, USA. 6Center for Integrative Genomics, University of California, Berkeley, California 94720, USA. 7Arizona Genomics Computational Laboratory, BIO5 Institute, 1657 E. Helen Street, The University of Arizona, Tucson, Arizona 85721, USA. 8USDA, ARS, Soybean Genomics and Improvement Laboratory, B006, BARC-West, Beltsville, Maryland 20705, USA. 9Department Plant Science and Landscape Architecture, University of Maryland, College Park, Maryland 20742, USA. 10Division of Biochemistry & Interdisciplinary Plant Group, 109 Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, USA. 11Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA. 12The National Center for Genome Resources, 2935 Rodeo Park Drive East, Santa Fe, New Mexico 87505, USA. 13Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, Arizona 85721, USA. 14RIKEN Plant Science Center, Yokohama 230-0045, Japan. 15Department of Agronomy, Iowa State University, Ames, Iowa 50011, USA. 16Department of Biology, University of Wisconsin-Stevens Point, Stevens Point, Wisconsin 54481, USA. 17National Center for Soybean Biotechnology, Division of Plant Sciences, University of Missouri, Columbia, Missouri 65211, USA. 18Department of Agronomy and Horticulture, University of Nebraska, Lincoln, Nebraska 68583, USA. 178 ©2010 Macmillan Publishers Limited. All rights reserved NATURE | Vol 463 | 14 January 2010 ARTICLES

0 10 20 30 40 50 60 Mb Of the 46,430 high-confidence loci, 34,073 (73%) are clearly ortho-

Chr1-D1a logous with one or more sequences in other angiosperms, and can

Chr2-D1b be assigned to 12,253 gene families (Supplementary Table 5). Among pan-angiosperm or pan-rosid gene families that also have mem- Chr3-N bers outside the legumes, soybean is particularly enriched (using a Chr4-C1 Fisher’s exact test relative to Arabidopsis) in genes containing NB- Chr5-A1 ARC (nucleotide-binding-site-APAF1-R-Ced) and LRR (leucine-

Chr6-C2 rich-repeat) domains. These genes are associated with the plant immune system, and are known to be dynamic13.Tandemgenefamily Chr7-M expansions are common in soybean and include NBS-LRR, F-box, Chr8-A2 auxin-responsive protein, and other domains commonly found in Chr9-K large gene families in plants. The ages of genes in these tandem families, Chr10-O inferred from intrafamily sequence divergence, indicate that they ori-

Chr11-B1 ginated at various times in the evolutionary history of soybean, rather than in a discrete burst. Chr12-H From protein families in the sequenced angiosperms (http:// Chr13-F www.phytozome.net) (Supplementary Table 4), we identified 283 Chr14-B2 putative legume-specific gene families containing 448 high-confidence Chr15-E soybean genes (Supplementary Information section 2). These gene

Chr16-J families include soybean and Medicago representatives, but no repre- sentatives from grapevine, poplar, Arabidopsis, papaya, or grass Chr17-D2 (Sorghum, rice, maize, Brachypodium). The top domains in this set Chr18-G are the AP2 domain, protein kinase domain, cytochrome P450, and Chr19-L PPR repeat. An additional 741 putatively soybean-specific gene families

Chr20-I (each consisting of two or more high-confidence soybean genes) may also include legume-specific genes that have not yet been sequenced in Genes DNA transposons Copia-like Gypsy-like Cen91/92 retrotransposons retrotransposons the ongoing Medicago sequencing project, or may represent bona fide soybean-specific genes. The top domains in this list include protein Figure 1 | Genomic landscape of the 20 assembled soybean chromosomes. Major DNA components are categorized into genes (blue), DNA kinase and protein tyrosine kinase, AP2, LRR, MYB-like DNA binding transposons (green), Copia-like retrotransposons (yellow), Gypsy-like domain, cytochrome P450 (the same domains most common in the retrotransposons (cyan) and Cent91/92 (a soybean-specific centromeric entire soybean proteome) as well as GDSL-like lipase/acylhydrolase repeat (pink)), with respective DNA contents of 18%, 17%, 13%, 30% and and stress-upregulated Nod19. 1% of the genome sequence. Unclassified DNA content is coloured grey. A combination of structure-based analyses and homology-based Categories were determined for 0.5-Mb windows with a 0.1-Mb shift. comparisons resulted in identification of 38,581 repetitive elements, covering most types of plant transposable elements. These elements, general, these boundaries, determined on the basis of suppressed together with numerous truncated elements and other fragments, recombination, correlate with transitions in gene density and trans- make up ,59% of the soybean genome (Supplementary Table 6). poson density. Ninety-three per cent of the recombination occurs in Long terminal repeat (LTR) retrotransposons are the most abundant the repeat-poor, gene-rich euchromatic genomic region that only class of transposable elements. The soybean genome contains ,42% accounts for 43% of the genome. Nevertheless, 21.6% of the high- LTR retrotransposons, fewer than Sorghum8 and maize14, but higher confidence genes are found in the repeat- and transposon-rich than rice15. The intact element sizes range from 1 kb to 21 kb, with an regions in the chromosome centres. average size of 8.7 kb (Supplementary Fig. 2). Of the 510 families con- We identified 46,430 high-confidence protein-coding loci in the taining 14,106 intact elements, 69% are Gypsy-like and the remainder soybean genome, using a combination of full-length complementary Copia-like. However, most (,78%) of these families are present at low DNAs9, expressed sequence tags, homology and ab initio methods copy numbers, typically fewer than 10 copies. The genome also con- (Supplementary Information section 2). Another ,20,000 loci were tains an estimated 18,264 solo LTRs, probably caused by homologous predicted with lower confidence; this set is enriched for hypothetical, recombination between LTRs from a single element. Nested retrotran- partial and/or transposon-related sequences, and possess shorter sposons are common, with 4,552 nested insertion events identified.The coding sequences and fewer introns than the high-confidence set. copy numbers within each block range from one to six. The exon–intron structure of genes shows high conservation among The genome consists of ,17% transposable elements, divided into soybean, poplar and grapevine, consistent with a high degree of posi- Tc1/Mariner, haT, Mutator, PIF/Harbinger, Pong, CACTA superfamilies tion and phase conservation found more broadly across angios- and Helitrons. Of these superfamilies, those containing more than 65 perms10. Introns in soybean gene pairs retained in duplicate have a complete copies, Tc1/Mariner and Pong,comprise,0.1% of the strong tendency to persist. Of 19,775 introns shared by poplar and genome sequence, and seem to have not undergone recent amplifica- grapevine (diverged more than 90 million years (Myr) ago11), and tion, indicating that they may be inactive and relatively old. Conversely, hence by the last common ancestor of soybean and grapevine, 19,666 other families seem to have amplified recently and may still be active, (99.45%) were preserved in both copies in soybean. Of the remaining indicated by the high similarity (.98%) of multiple elements. 0.55%, 78% are absent in both recent soybean copies (that is, lost before the ,13-Myr-ago duplication) and 22% are found only in one Multiple whole-genome duplication events paralogue (that is, other copy lost). We find a slower intron loss rate Timing and phylogenetic position. A striking feature of the soybean in poplar (0.4%) than in soybean (0.6%) since the last common rosid genome is the extent to which blocks of duplicated genes have been ancestor, which is consistent with the slower rate of sequence evolu- retained. On the basis of previous studies that examined pairwise 16,17 tion in the poplar lineage thought to be associated with its perennial, synonymous distance (Ks values) of paralogues , and targeted clonal habit, global distribution and wind pollination12. Intron size is sequencing of duplicated regions within the soybean genome18,we also highly conserved in recent soybean paralogues, indicating that expected that large homologous regions would be identified in the few insertions and deletions have accumulated within introns over genome. Using a pattern-matching search, gene families of sizes from the past 13 Myr. two to six were identified, and Ks values were calculated for these genes, 179 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 14 January 2010

here displayed as a histogram plot (Fig. 2), which shows two distinct 10,000 Poplar–soybean peaks. Similarly, nucleotide diversity for the fourfold synonymous Grapevine–soybean 9,000 Arabidopsis–soybean third-codon transversion position, 4dTv, was calculated. Both metrics Rice–soybean Medicago–soybean give a measure of divergence between two genes, but the 4dTv uses a 8,000 Soybean–soybean subset of the sites (transitions/transversion) used in the computa- 7,000 tion of Ks. 31,264 high-confidence soybean genes have recent paralo- gues with Ks < 0.13 synonymous substitutions per site and 6,000 4dTv < 0.0566 synonymous transversions per site (Fig. 3), correspond- ing to a soybean-lineage-specific palaeotetraploidization. This was 5,000 19 probably an allotetraploidy event based on chromosomal evidence . 4,000 Of the 46,430 high-confidence genes, 31,264 exist as paralogues and 15,166 have reverted to singletons. We infer that the pre-duplication 3,000 Gene pairs on syntenic segments proto-soybean genome possessed ,30,000 genes: half of 2,000 (2 3 15,166 1 2 3 15,632) 5 30,798. This number is comparable to the modern Arabidopsis gene complement. A second paralogue peak 1,000 at Ks < 0.59 (4dTv < 0.26) corresponds to the early-legume duplica- 0 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 tion, which several lines of evidence suggest occurred near the origins of 4dTv distance (corrected for multiple substitutions) the papilionoid lineage20. The papilionoid origin has been dated to approximately 59 Myr ago21. A third highly diffuse peak is seen when Figure 3 | Distribution of 4dTv distance between syntenically orthologous the plot is expanded past a K value of 1.5 (data not shown) and most genes. Segments were found by locating blocks of BLAST hits with s significance 1310218 or better with less than 10 intervening genes between probably corresponds to the ‘gamma’ event22, shown to be a triplication 23 24 such hits. The 4dTv distance between orthologous genes on these segments in Vitis and in other angiosperms . is reported. Owing to the existence of macrofossils in the legumes and allies, the timing of clade origins in the legumes is better established than genus in the papilionoid clade)20, then the duplication occurred within other plant families. A fossil-calibrated molecular clock for the the narrow window of time between the origin of the legumes and the legumes places the origin of the legume stem clade and the oldest 21 papilionoid radiation. If the older duplication is assumed to have papilionoid crown clade at 58 to 60 Myr ago . If the early-legume occurred around 58 Myr ago, then the calculated rate of silent muta- whole-genome duplication (WGD) occurred outside the papilionoid tions extending back to the duplication would be 5.17 3 1023,similar lineage, as suggested by map evidence from Arachis (an early-diverging to previous estimates of 5.2 3 10–3 (ref. 21). The Glycine-specific duplication is estimated to have occurred ,13 Myr ago, an age con- sistent with previous estimates16,17. ab01 c18 01 18 01 18 11 11 11 05 05 17 1 05 17 7 Structural organization. We identified homologous blocks within 08 08 0 25 08 0 7 07 7 the genome using i-ADHoRe . Using relatively stringent parameters, 02 2 1 02 0 10 0 10 442 multiplicons (that is, duplicated segments) were identified

14 0

4 26

14 3 03

1

03 within the soybean genome and visualized using Circos (Fig. 2).

12 19

12

19

12

19 Owing to the multiple rounds of duplication and diploidization in

13 0

6

13

13

0

06

09

6 the genome, as well as chromosomal rearrangements, multiplicons

04

09

15

09 2 0

04 16

04

15 5

20 1 (or blocks) between chromosomes can involve more than just two 16 20

16 chromosomes. On average, 61.4% of the homologous genes are 12 found in blocks involving only two chromosomes, only 5.63% span- ning three chromosomes, and 21.53% traversing four chromosomes. 10 Two notable exceptions to this pattern are chromosome 14, which has 11.8% of its genes retained across three chromosomes, and chro- 8 mosome 20 with 7.08% of the homologues (gene pairs resulting from genome duplication) retained across four chromosomes. Chromo- 6 some 14 seems to be a highly fragmented chromosome with block

Pairs (%) matches to 14 other chromosomes, the highest number of all chro-

4 mosomes. Conversely, chromosome 20 is highly homologous to the long arm of chromosome 10, with few matches elsewhere in the

2 genome. Retention of homologues across the genome is exceptionally high; blocks retained in two or more chromosomes can be clearly observed 0 0 0.3 0.6 0.9 1.2 1.5 (Fig. 2 and Supplementary Figs 5 and 6). The number of homologues 0.06 0.12 0.18 0.24 0.36 0.42 0.48 0.54 0.66 0.72 0.78 0.84 0.96 1.02 1.08 1.14 1.26 1.32 1.38 1.44 (gene pairs) within a block average 31, although any given block may Synonymous distance contain from 6 to 736 homologues. Given that not all genes within a Figure 2 | Homologous relationships between the 20 soybean block are retained as homologues (owing to loss of duplicated genes chromosomes. The bottom histogram plot shows pairwise Ks values for over time (fractionation)), the average number of genes in a block is gene family sizes 2 to 6. Top panels show the 20 chromosomes in a circle with ,75 genes and ranges from 8 to 1,377 genes. lines connecting homologous genes. Gene-rich regions (euchromatin) of Repeated duplications in the soybean genome make it possible to each chromosome are coded a different colour around the circle. Grey determine rates of gene loss following each round of polyploidy. In represents Ks values of 0.06–0.39, 13-Myr genome duplication; black homologous segments from the 13-Myr-old Glycine duplication, represents Ks values of 0.40–0.80, 59-Myr genome duplication. These correspond to the grey and black bars in the histogram. a, Chromosomes 1, 43.4% of genes have matches in the corresponding region, in contrast 11, 17, 7, 10 and 3, which contain centromeric repeat Sb91. b, Chromosomes to 25.9% in blocks from the early legume duplication. Combining 19 and 6, which contain both Sb91 and Sb92 centromeric repeats. these gene-loss rates with WGD dates of 13 Myr ago and 59 Myr ago, c, Chromosomes 18, 5, 8, 2, 14, 12, 13, 9, 15, 16, 20 and 4, which contain the rate of gene loss has been 4.36% of genes per Myr following the Sb92. Glycine WGD and 1.28% of genes per Myr following the early-legume 180 ©2010 Macmillan Publishers Limited. All rights reserved NATURE | Vol 463 | 14 January 2010 ARTICLES

WGD. This differential in gene-loss rates indicates an exponential Table 1 | Putative acyl lipid genes in Arabidopsis and soybean decay pattern of rapid gene loss after duplication, slowing over time. Function category of acyl lipid genes Number in Number in Arabidopsis soybean Nodulation and oil biosynthesis genes Synthesis of fatty acids in plastids 46 75 A unique feature of legumes is their ability to establish nitrogen- Synthesis of membrane lipids in plastids 20 33 fixing symbioses with soil bacteria of the family Rhizobiaceae. Synthesis of membrane lipids in endomembrane system 56 117 Metabolism of acyl lipids in mitochondria 29 69 Therefore, information on the nodulation functions of the soybean Synthesis and storage of oil 19 22 genome is of particular interest. Sequence comparisons with previ- Degradation of storage lipids and straight fatty acids 43 155 ously identified nodulation genes identified 28 nodulin genes and 24 Lipid signalling 153 312 key regulatory genes, which probably represent true orthologues of Fatty acid elongation and wax and cutin metabolism 73 70 Miscellaneous 175 274 known nodulation genes in other legume species (Supplementary Total 614 1,127 section 3 and Supplementary Table 8). Among this list of 52 genes, 32 have at least one highly conserved homologue gene. We hypothesize that these are homologous gene pairs arising from the Glycine WGD Arabidopsis are encoded by multigene families in soybean, including (that is, ,13 Myr ago). Further analysis shows that seven soybean ketoacyl-ACP synthase II (12 copies in soybean), malonyl-CoA:ACP nodulin genes produce transcript variants. The exceptional example malonyltransferase (2 copies), enoyl-ACP reductase (5 copies), acyl- is nodulin-24 (Glyma14g05690), which seems to produce ten tran- ACP thioesterase FatB (6 copies) and plastid homomeric acetyl-CoA script variants (Supplementary Table 8). In total, 25% of the examined carboxylase (3 copies). Long-chain acyl-CoA synthetases, ER acyl- nodulin genes produce transcript variants, which is slightly higher than transferases, mitochondrial glycerol-phosphate acyltransferases, and the incidence of alternative splicing in Arabidopsis (,21.8%) and rice lipoxygenases are all unusually large gene families in soybean, con- (,21.2%)27. However, none of the soybean regulatory nodulation taining as many as 24, 21, 20 and 52 members, respectively. The genes produces transcript variants (Supplementary Table 8). multigenic nature of these and many other activities involved in acyl Mining the soybean genome for genes governing metabolic steps lipid metabolism suggests the potential for more complex transcrip- in triacylglycerol biosynthesis could prove beneficial in efforts to tional control in soybean compared to Arabidopsis. modify soybean oil composition or content. Genomic analysis of acyl lipid biosynthesis in Arabidopsis revealed 614 genes involved in path- Transcription factor diversity ways beginning with plastid acetyl-CoA production for de novo fatty We identified soybean transcription factor genes by sequence com- acid synthesis through cuticular wax deposition28. Comparison of parison to known transcription factor gene families, as well as by these sequences to the soybean genome identified 1,127 putative searching for known DNA-binding domains. In total, 5,671 putative orthologous and paralogous genes in soybean. This is probably a soybean transcription factor genes, distributed in 63 families, were low estimate owing to the high stringency conditions used for gene identified (Fig. 4a and Supplementary Table 9). This number re- mining. The distribution of these genes according to various func- presents 12.2% of the 46,430 predicted soybean protein-coding loci. tional classes of acyl lipid biosynthesis is shown in Table 1. A similar analysis performed on the Arabidopsis genome identified Comparing Arabidopsis to soybean, the number of genes involved 2,315 putative Arabidopsis transcription factor genes, representing in storage lipid synthesis, fatty acid elongation and wax/cutin pro- 7.1% of the 32,825 predicted Arabidopsis protein-coding loci duction was similar. For all other subclasses, the soybean genome (Fig. 4b). Transcription factor genes are homogeneously distributed contained substantially higher numbers of genes. Interestingly, the across the chromosomes in both soybean and Arabidopsis, with an number of genes involved in lipid signalling, degradation of storage average relative abundance of 8–10% transcription factor genes on lipids, and membrane lipid synthesis were two- to threefold higher in each chromosome. On rare occasions, regions were identified in both soybean than Arabidopsis, indicating that these areas of acyl lipid genomes that had a relatively low (,5%) or high density (.12%) of synthesis are more complex in soybean. The number of genes transcription factor genes. Among the transcription factor genes iden- involved in plastid de novo fatty acid synthesis was 63% higher in tified, 9.5% of soybean genes (538 transcription factor genes) and soybean compared to Arabidopsis. Many single-gene activities in 8.2% of Arabidopsis genes (190 Arabidopsis transcription factor genes)

ABI3/VP1: 78 a AP2-EREBP: 381 b ABI3/VP1: 71 AP2-EREBP: 146 Other TFs: 561 Other TFs: 241 ZF-HD: 54 AS2: 92 ZF-HD: 17 AS2: 43 WRKY: 197 AUX-IAA-ARF: 129 WRKY: 73 AUX-IAA-ARF: 51 TPR: 319 bHLH: 393 TPR: 65 TCP: 6 bHLH: 172 TCP: 65 Bromodomain: 57

SNF2: 33 Bromodomain: 29 SNF2: 69 BTB/POZ: 145 BZIP: 176 PHD: 55 BTB/POZ: 98 PHD: 222 C2C2 (Zn) CO-like: 72 NAC: 114 BZIP: 78 C2C2 (Zn) Dof: 82 C2C2 (Zn) GATA: 62 NAC: 208 C2C2 (Zn) CO-like: 34 C2C2 (Zn) Dof: 36 C2H2 (Zn): 395 C2C2 (Zn) GATA: 29 MYB/HD-like: 279

C2H2 (Zn): 173 MYB/HD-like: 726 C3H-type1(Zn): 147

CCAAT: 106 C3H-type1(Zn): 69 MYB: 24 MYB: 65 CCHC (Zn): 144 CCAAT: 38 Jumonji: 77 GRAS: 130 Jumonji: 21 GRAS: 33 CCHC (Zn): 66 MADS: 212 Homeodomain/Homeobox: 319 MADS: 109 Homeodomain/Homeobox: 112 Figure 4 | Distribution of soybean (a) and Arabidopsis (b) transcription zipper; GRAS, (GAI, RGA, SCR); NAC, (NAM, ATAF1/2, CUC2); PHD, factor genes in different transcription factor families. Only the distribution plant homeodomain-finger transcription factor; TCP, (TB1, CYC, PCF); of the most representative transcription families is detailed here. AUX-IAA- TFs, transcription factors; TPR, tetratricopepitide repeat; WRKY, conserved ARF, indole-3-acetic acid-auxin response factor; BTB/POZ, bric-a`-brac amino acid sequence WRKYGQK at its N-terminal end. tramtrack broad complex/pox viruses and zinc fingers; BZIP, basic leucine 181 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 14 January 2010 are tandemly duplicated. By way of example, only one region in removed from many other domesticated bean species, will allow us to Arabidopsis has more than five duplicated transcription factor genes knit together knowledge about traits observed and mapped in all of the in tandem (seven ABI3/VP1 genes (At4G31610 to At4G31660)), beans and relatives. The genome sequence is also an essential frame- whereas in soybean several such regions are present (for example, 13 work for vast new experimental information such as tissue-specific C3H-type 1 (Zn) (Glyma15g19120 to Glyma15g19240); six MYB/ expression and whole-genome association data. With knowledge of HD-like (Glyma06g45520 to Glyma06g45570); and five MADS this genome’s billion-plus nucleotides, we approach an understanding (Glyma20g27320 to Glyma20g27360); Supplementary Table 8). The of the plant’s capacity to turn carbon dioxide, water, and overall distribution of soybean transcription factor genes among the elemental nitrogen and minerals into concentrated energy, protein various known protein families is very similar between Arabidopsis and nutrients for human and animal use. The genome sequence opens and soybean (Supplementary Fig. 10a, b). However, some families are the door to crop improvements that are needed for sustainable human relatively sparser or more abundant in soybean, perhaps reflecting and animal food production, energy production and environmental differences in biological function. For example, members of the balance in agriculture worldwide. ABI3/VP1 family are 2.2-times more abundant in Arabidopsis, whereas members of the TCP family are 4.4-times more abundant METHODS SUMMARY in soybean. In addition, those gene families with fewer members are Seeds from cultivar Williams 82 were grown in a growth chamber for 2 weeks and differentially represented between soybean and Arabidopsis. FHA, etiolated for 5 days before harvest. A standard phenol/chloroform leaf extraction HD-Zip (homeodomain/leucine zipper), PLATZ, SRS and TUB tran- was performed. DNA was treated with RNase A and proteinase K and precipi- scription factor genes are more abundant in soybean (2.7, 2.9, 4.1, 3, tated with ethanol. and 4.9 times, respectively) and HTH-ARAC (helix–turn–helix araC/ All sequencing reads were collected with protocols on ABI 3730XL capillary sequencing machines, a majority at the Joint Genome Institute xylS-type) genes were identified exclusively in soybean. In contrast, in Walnut Creek, California. HSF, HTH-FIS (helix–turn–helix-factor for inversion stimulation), A total of 15,332,163 sequence reads were assembled using Arachne TAZ and U1-type (Zn) genes are present in relatively larger numbers v.20071016 (ref. 37) to form 3,363 scaffolds covering 969.6 Mb of the soybean in Arabidopsis (5.4, 4.9, 24.5 and 2.9 times, respectively). Notably, genome. The resulting assembly was integrated with the genetic and physical both ABI3/VP1, TCP, SRS and Tubby transcription factor genes were maps previously built for soybean and a newly constructed genetic map to shown to have critical roles in plant development (for example, ABI3/ produce 20 chromosome-scale scaffolds covering 937.3 Mb and an additional VP1 during seed development; TCP, SRS and Tubby affect overall 1,148 unmapped scaffolds that cover 17.7 Mb of the genome. plant development29–33). The differences seen in relative transcription Genes were annotated using Fgenesh138 and GenomeScan39 informed by EST alignments and peptide matches to genome from Arabidopsis, rice and grapevine. factor gene abundance indicates that regulatory pathways in soybean 40 may differ from those described in Arabidopsis. Models were reconciled with EST alignments and UTR added using PASA .Models were filtered for high confidence by penalizing genes which were transposable- element-related, had low sequence entropy, short introns, incomplete start or stop, Impact on agriculture lowC-score,noUniGenehitat13 1025, or the model was less than 30% the length Hundreds of qualitatively inherited (single gene) traits have been of its best hit. characterized in soybean and many genetically mapped. However, LTR retrotransposons were identified by the program LTR_STRUC41, manu- most important crop production traits and those important to seed ally inspected to check structure features and classified into distinct families based quality for human health, animal nutrition and biofuel production on the similarities to LTR sequences. DNA transposons were identified using 42 are quantitatively inherited. The regions of the genome containing conserved protein domains as queries in TBLASTN searches of the genome. DNA sequence affecting these traits are called quantitative trait loci Identified elements were used as a custom library for RepeatMasker (current version: open 3.2.8; http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker) (QTL). QTL mapping studies have been ongoing for more than 90 to detect missed intact elements, truncated elements and fragments. distinct traits of soybean including plant developmental and re- Virtual suffix trees with six-frame translation were generated using Vmatch43 and productive characters, disease resistance, seed quality and nutritional then clustered into families. Pairwise alignments between gene family members traits. In most cases, the causal functional gene or transcription factor were performed using ClustalW44. Identification of homologous blocks was per- underlying the QTL is unknown. However, the integration of the formed using i-ADHoRe v2.1 (ref. 25). Visualization of blocks was performed with whole genome sequence with the dense genetic marker map that Circos26. 2–5 now exists in soybean (http://www.Soybase.org) will allow the Received 19 August; accepted 12 November 2009. association of mapped phenotypic effectors with the causal DNA sequence. There are already examples where the availability of the 1. Arumuganathan, K. & Earle, E. D. Nuclear DNA content of some important plant soybean genomic sequence has accelerated these discovery efforts. species. Plant Mol. Biol. Rep. 9, 208–218 (1991). 2. Choi, I. Y. et al. A soybean transcript map: gene distribution, haplotype and single- Having access to the sequence allowed cloning and identification of nucleotide polymorphism analysis. Genetics 176, 685–696 (2007). the rsm1 (raffinose synthase) mutation that can be used to select for 3. Hyten, D. L. et al. High-throughput SNP discovery through deep resequencing of a low-stachyose-containing soybean lines that will improve the ability reduced representation library to anchor and orient scaffolds in the soybean of animals and to digest soybeans34. Using a comparative whole genome sequence. BMC Genomics (in the press). 4. Hyten, D. L. et al. A high density integrated genetic linkage map of soybean and the genomics approach between soybean and maize, a single-base muta- development of a 1,536 Universal Soy Linkage Panel for QTL mapping. Crop Sci. tion was found that causes a reduction in phytate production in (in the press). soybean35. Phytate reduction could result in a reduction of a major 5. Song, Q. J. et al. A new integrated genetic linkage map of the soybean. Theor. Appl. environmental runoff contaminant from swine and poultry waste. Genet. 109, 122–128 (2004). Perhaps most exciting for the soybean community, the first resistance 6. Lin, J. Y. et al. Pericentromeric regions of soybean (Glycine max L. Merr.) chromosomes consist of retroelements and tandemly repeated DNA and are gene for the devastating disease Asian soybean rust (ASR) has been structurally and evolutionarily labile. Genetics 170, 1221–1230 (2005). cloned with the aid of the soybean genomic sequence and confirmed 7. Vahedian, M. et al. Genomic organization and evolution of the soybean SB92 with viral-induced gene silencing36. In countries where ASR is well satellite sequence. Plant Mol. Biol. 29, 857–862 (1995). established, soybean yield losses due to the disease can range from 8. Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of 36 grasses. Nature 457, 551–556 (2009). 10% to 80% and the development of soybean strains resistant to 9. Umezawa, T. et al. Sequencing and analysis of approximately 40,000 soybean ASR will greatly benefit world soybean production. cDNA clones from a full-length-enriched cDNA library. DNA Res. 15, 333–346 Soybean, one of the most important global sources of protein and (2008). oil, is now the first legume species with a complete genome sequence. It 10. Roy, S. W. & Penny, D. Patterns of intron loss and gain in plants: intron loss- dominated evolution and genome-wide comparison of O. sativa and A. thaliana. is, therefore, a key reference for the more than 20,000 legume species, Mol. Biol. Evol. 24, 171–181 (2007). and for the remarkable evolutionary innovation of nitrogen-fixing 11. Wang, H. et al. Rosid radiation and the rapid rise of angiosperm-dominated symbiosis. This genome, with a common ancestor only 20 million years forests. Proc. Natl Acad. Sci. USA 106, 3853–3858 (2009). 182 ©2010 Macmillan Publishers Limited. All rights reserved NATURE | Vol 463 | 14 January 2010 ARTICLES

12. Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & 33. Stone, S. L. et al. LEAFY COTYLEDON2 encodes a transcription factor Gray). Science 313, 1596–1604 (2006). that induces embryo development. Proc. Natl Acad. Sci. USA 98, 11806–11811 13. Michelmore, R. & Meyers, B. C. Clusters of resistance genes in plants evolve by (2001). divergent selection and a birth-and-death process. Genome Res. 8, 1113–1130 34. Skoneczka, J., Saghai Maroof, M. A., Shang, C. & Buss, G. R. Identification of (1998). candidate gene mutation associated with low stachyose in soybean 14. Bruggmann, R. et al. Uneven chromosome contraction and expansion in the maize line PI 200508. Crop Sci. 49, 247–255 (2009). genome. Genome Res. 16, 1241–1251 (2006). 35. Saghai Maroof, M. A., Glover, N. M., Biyashev, R. M., Buss, G. R. & Grabau, E. A. 15. Ma, J., Devos, K. M. & Bennetzen, J. L. Analyses of LTR-retrotransposon structures Genetic basis of the low-phytate trait in the soybean line CX1834. Crop Sci. 49, reveal recent and rapid genomic DNA loss in rice. Genome Res. 14, 860–869 69–76 (2009). (2004). 36. Meyer, J. D. F. et al. Identification and analyses of candidate genes for Rpp4- 16. Pfeil, B. E., Schlueter, J. A., Shoemaker, R. C. & Doyle, J. J. Placing mediated resistance to Asian soybean rust in soybean (Glycine max (L.) Merr.). in relation to taxon divergence: a phylogenetic analysis in legumes using 39 gene Plant Physiol. 150, 295–307 (2009). families. Syst. Biol. 54, 441–454 (2005). 37. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: 17. Schlueter, J. A. et al. Mining EST databases to resolve evolutionary events in major Arachne 2. Genome Res. 13, 91–96 (2003). crop species. Genome 47, 868–876 (2004). 38. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. 18. Schlueter, J. A., Scheffler, B. E., Jackson, S. & Shoemaker, R. C. Fractionation of Genome Res. 10, 516–522 (2000). synteny in a genomic region containing tandemly duplicated genes across Glycine 39. Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous gene max, , and Arabidopsis thaliana. J. Hered. 99, 390–395 (2008). structures in the human genome. Genome Res. 11, 803–816 (2001). 19. Gill, N. et al. Molecular and chromosomal evidence for allopolyploidy in soybean, 40. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal Glycine max (L.) Merr. Plant Physiol. 151, 1167–1174 (2009). transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003). 20. Bertioli, D. J. et al. An analysis of synteny of Arachis with Lotus and Medicago sheds 41. McCarthy, E. M. & McDonald, J. F. LTR_STRUC: a novel search and identification new light on the structure, stability and evolution of legume genomes. BMC program for LTR retrotransposons. Bioinformatics 19, 362–367 (2003). Genomics 10, 45 (2009). 42. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein 21. Lavin, M., Herendeen, P. S. & Wojciechowski, M. F. Evolutionary rates analysis of database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). Leguminosae implicates a rapid diversification of lineages during the tertiary. Syst. 43. Beckstette, M., Homann, R., Giegerich, R. & Kurtz, S. Fast index based algorithms Biol. 54, 575–594 (2005). and software for matching position specific scoring matrices. BMC Bioinformatics 22. Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm 7, 389 (2006). genome evolution by phylogenetic analysis of chromosomal duplication events. 44. Thompson, J. D., Higgins, D. G. & Gibson, T. J. W: improving the Nature 422, 433–438 (2003). sensitivity of progressive multiple sequence alignment through sequence 23. Jaillon, O. et al. The grapevine genome sequence suggests ancestral weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007). Res. 22, 4673–4680 (1994). 24. Tang, H. et al. Unraveling ancient hexaploidy through multiply-aligned Supplementary Information is linked to the online version of the paper at angiosperm gene maps. Genome Res. 18, 1944–1954 (2008). www.nature.com/nature. 25. Simillion, C., Janssens, K., Sterck, L. & Van de Peer, Y. i-ADHoRe 2.0: an improved tool to detect degenerated genomic homology using genomic profiles. Acknowledgements We thank N. Weeks for informatics support and C. Gunter for Bioinformatics 24, 127–128 (2008). critical reading of the manuscript. We acknowledge funding from the National 26. Krzywinski, M. et al. Circos: An information aesthetic for comparative genomics. Science Foundation (DBI-0421620 to G.S.; DBI-0501877 and 082225 to S.A.J.) Genome Res 19, 1639–1645 (2009). and the United Soybean Board. 27. Wang, B. B. & Brendel, V. Genomewide comparative analysis of alternative Author Contributions Sequencing, assembly and integration: J. Schmutz, S.B.C., J. splicing in plants. Proc. Natl Acad. Sci. USA 103, 7175–7180 (2006). Schlueter, W.N., U.H., E.L., M.P., D. Grant, S.S., D. Goodstein, K.B., A.S., J.G. and D.R. 28. Beisson, F. et al. Arabidopsis genes involved in acyl lipid metabolism. A 2003 Annotation: J.M., T.M., J.J.T., J.C., D.X., J.D., Z.T., L.Z., N.G., T.J., M.L., X.-C.Z. and census of the candidates, a study of the distribution of expressed sequence tags in G.S. EST sequencing: G.D.M., T.S., T.U., M.B., D.S., B.V., K.S. and H.T.N. Physical organs, and a web-based database. Plant Physiol. 132, 681–697 (2003). mapping: Y.Y., M.F.G., R.A.W. and R.C.S. Genetic mapping: D.H., J. Specht, Q.S. and 29. Fridborg, I., Kuusk, S., Moritz, T. & Sundberg, E. The Arabidopsis dwarf mutant shi P.C. Writing/coordination: S.A.J. exhibits reduced gibberellin responses conferred by overexpression of a new putative zinc finger protein. Plant Cell 11, 1019–1032 (1999). Author Information This whole-genome shotgun project has been deposited at 30. Barkoulas, M., Galinha, C., Grigg, S. P. & Tsiantis, M. From genes to shape: DDBJ/EMBL/GenBank under the accession ACUP00000000. The version regulatory interactions in leaf development. Curr. Opin. Plant. Biol. 10, 660–666 described here is the first version, ACUP01000000. Full annotation is available at (2007). http://www.phytozome.net. Reprints and permissions information is available at 31. Lai, C. P. et al. Molecular analyses of the Arabidopsis TUBBY-like protein gene www.nature.com/reprints. The authors declare no competing financial interests. family. Plant Physiol. 134, 1586–1597 (2004). This paper is distributed under the terms of the Creative Commons 32. Herve, C. et al. In vivo interference with AtTCP20 function induces severe plant Attribution-Non-Commercial-Share Alike licence, and is freely available to all growth alterations and deregulates the expression of many genes important for readers at www.nature.com/nature. Correspondence and requests for materials development. Plant Physiol. 149, 1462–1477 (2009). should be addressed to S.A.J. ([email protected]).

183 ©2010 Macmillan Publishers Limited. All rights reserved Vol 463 | 11 February 2010 | doi:10.1038/nature08747 ARTICLES

Genome sequencing and analysis of the model grass Brachypodium distachyon

The International Brachypodium Initiative*

Three subfamilies of grasses, the Ehrhartoideae, Panicoideae and Pooideae, provide the bulk of human nutrition and are poised to become major sources of renewable energy. Here we describe the genome sequence of the wild grass Brachypodium distachyon (Brachypodium), which is, to our knowledge, the first member of the Pooideae subfamily to be sequenced. Comparison of the Brachypodium, rice and sorghum genomes shows a precise history of genome evolution across a broad diversity of the grasses, and establishes a template for analysis of the large genomes of economically important pooid grasses such as wheat. The high-quality genome sequence, coupled with ease of cultivation and transformation, small size and rapid life cycle, will help Brachypodium reach its potential as an important model system for developing new energy and food crops.

Grasses provide the bulk of human nutrition, and highly productive (Supplementary Fig. 1) detected two false joins and created a further grasses are promising sources of sustainable energy1. The grass family seven joins to produce five pseudomolecules that spanned 272 Mb (Poaceae) comprises over 600 genera and more than 10,000 species (Supplementary Table 3), within the range measured by flow cyto- that dominate many ecological and agricultural systems2,3. So far, metry20,21. The assembly was confirmed by cytogenetic analysis (Sup- genomic efforts have largely focused on two economically important plementary Fig. 2) and alignment with two physical maps and grass subfamilies, the Ehrhartoideae (rice) and the Panicoideae sequenced BACs (Supplementary Data). More than 98% of expressed (maize, sorghum, sugarcane and millets). The rice4 and sorghum5 sequence tags (ESTs) mapped to the sequence assembly, consistent genome sequences and a detailed physical map of maize6 showed with a near-complete genome (Supplementary Table 4 and Sup- extensive conservation of gene order5,7 and both ancient and rela- plementary Fig. 3). Compared to other grasses, the Brachypodium tively recent polyploidization. genome is very compact, with retrotransposons concentrated at the Most cool season cereal, forage and turf grasses belong to the centromeres and syntenic breakpoints (Fig. 1). DNA transposons and Pooideae subfamily, which is also the largest grass subfamily. The derivatives are broadly distributed and primarily associated with gene- genomes of many pooids are characterized by daunting size and rich regions. complexity. For example, the bread wheat genome is approximately We analysed small RNA populations from tissues 17,000 megabases (Mb) and contains three independent genomes8. with deep Illumina sequencing, and mapped them onto the genome This has prohibited genome-scale comparisons spanning the three sequence (Fig. 2a, Supplementary Fig. 4 and Supplementary Table 5). most economically important grass subfamilies. Small RNA reads were most dense in regions of high repeat density, Brachypodium, a member of the Pooideae subfamily, is a wild similar to the distribution reported in Arabidopsis22. We identified annual grass endemic to the Mediterranean and Middle East9 that 413 and 198 21- and 24-nucleotide phased short interfering RNA has promise as a model system. This has led to the development of (siRNA) loci, respectively. Using the same algorithm, the only phased highly efficient transformation10,11, germplasm collections12–14,genetic loci identified in Arabidopsis were five of the eight trans-acting siRNA markers14, a genetic linkage map15, bacterial artificial chromosome loci, and none was 24-nucelotide phased. The biological functions of (BAC) libraries16,17,physicalmaps18 (M.F., unpublished observations), these clusters of Brachypodium phased siRNAs, which account for a mutant collections (http://brachypodium.pw.usda.gov, http://www. significant number of small RNAs that map outside repeat regions, brachytag.org), microarrays and databases (http://www.brachybase. are not known at present. org, http://www.phytozome.net, http://www.modelcrop.org, http:// A total of 25,532 protein-coding gene loci was predicted in the v1.0 mips.helmholtz-muenchen.de/plant/index.jsp) that are facilitating annotation (Supplementary Information and Supplementary Table 6). the use of Brachypodium by the research community. The genome This is in the same range as rice (RAP2, 28,236)23 and sorghum (v1.4, sequence described here will allow Brachypodium to act as a powerful 27,640)5, suggesting similar gene numbers across a broad diversity of functional genomics resource for the grasses. It is also an important grasses. Gene models were evaluated using ,10.2 gigabases (Gb) of advance in grass structural genomics, permitting, for the first time, Illumina RNA-seq data (Supplementary Fig. 5)24. Overall, 92.7% whole-genome comparisons between members of the three most eco- of predicted coding sequences (CDS) were supported by Illumina data nomically important grass subfamilies. (Fig. 2b), demonstrating the high accuracy of the Brachypodium gene predictions. These gene models are available from several data- Genome sequence assembly and annotation bases (such as http://www.brachybase.org, http://www.phytozome.net, The diploid inbred line Bd21 (ref. 19) was sequenced using whole- http://www.modelcrop.org and http://mips.org). genome shotgun sequencing (Supplementary Table 1). The ten largest Between 77 and 84% of gene families (defined according to Sup- scaffolds contained 99.6% of all sequenced nucleotides (Supplemen- plementary Fig. 6) are shared among the three grass subfamilies tary Table 2). Comparison of these ten scaffolds with a genetic map represented by Brachypodium, rice and sorghum, reflecting a relatively

*A list of participants and their affiliations appears at the end of the paper. 763 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 11 February 2010

a 123 4 5 Total small RNA reads/loci 10,000 Chr. 1 5,000 0 5,000 10,000 Repeat-normalized 21-nt reads STA 1,500 cLTRs 750 sLTRs 0 DNA-TEs 750 MITEs 1,500 CDS Repeat-normalized 24-nt reads 1,500 750 0 750 Chr. 2 1,500 Phased small RNA loci 70 35 0 STA Repeat-normalized RNA-seq reads cLTRs 100,000 sLTRs DNA-TEs 50,000 MITEs CDS 0 b c Rice Sorghum Ehrhardtoideae Panicoideae 16,235 families 17,608 families 20,559 genes 25,816 genes Chr. 3 in families in families 0.9 1,479 495 860 0.7 STA 13,580 cLTRs 0.5 681 1,689 sLTRs DNA-TEs 0.3 Brachypodium MITEs 265 0.1 Wheat/barley CDS Pooideae 16,215 families Coverage over feature length Coverage over feature 20,562 genes UTR CDS SJS 5′UTR 3′ Exons in families Chr. 4 Introns cDNAs Figure 2 | Transcript and gene identification and distribution among three grass subfamilies. a, Genome-wide distribution of small RNA loci and STA transcripts in the Brachypodium genome. Brachypodium chromosomes (1–5) cLTRs sLTRs are shown at the top. Total small RNA reads (black lines) and total small RNA DNA-TEs loci (red lines) are shown on the top panel. Histograms plot 21-nucleotide (nt) MITEs CDS (blue) or 24-nucleotide (red) small RNA reads normalized for repeated matches to the genome. The phased loci histograms plot the position and phase-score of 21-nucleotide (blue) and 24-nucleotide (red) phased small RNA loci. Repeat- Retrotransposons Chr. 5 normalized RNA-seq read histograms plot the abundance of reads matching Genes (introns) RNA transcripts (green), normalized for ambiguous matches to the genome. Genes (CDS exons) DNA transposons b, Transcript coverage over gene features. Perfect match 32-base oligonucleotide Satellite tandem arrays STA Illumina reads were mapped to the Brachypodium v1.0 annotation features cLTRs sLTRs using HashMatch (http://mocklerlab-tools.cgrb.oregonstate.edu/). Plots of DNA-TEs Illumina coverage were calculated as the percentage of bases along the length of MITEs CDS the sequence feature supported by Illumina reads for the indicated gene model features. The bottom and top of the box represent the 25th and 75th quartiles, Figure 1 | Chromosomal distribution of the main Brachypodium genome respectively. The white line is the medianandthereddiamondsdenotethe features. The abundance and distribution of the following genome elements mean. SJS, splice junction site. c, Venn diagram showing the distribution of are shown: complete LTR retroelements (cLTRs); solo-LTRs (sLTRs); shared gene families between representatives of Ehrhartoideae (rice RAP2), potentially autonomous DNA transposons that are not miniature inverted- Panicoideae (sorghum v1.4) and Pooideae (Brachypodium v1.0, and Triticum repeat transposable elements (MITEs) (DNA-TEs); MITEs; gene exons (CDS); aestivum and Hordeum vulgare TCs (transcript consensus)/EST sequences). gene introns and satellite tandem arrays (STA). Graphs are from 0 to 100 per Paralogous gene families were collapsed in these data sets. cent base-pair (%bp) coverage of the respective window. The heat map tracks have different ranges and different maximum (max) pseudocolour levels: STA models required modification and very few pseudogenes were iden- (0–55, scaled to max 10) %bp; cLTRs (0–36, scaled to max 20) %bp; sLTRs tified, demonstrating the accuracy of the v1.0 annotation. (0–4) %bp; DNA-TEs (0–20) %bp; MITEs (0–22) %bp; CDS (exons) Phylogenetic trees for 62 gene families were constructed using genes (0–22.3) %bp. The triangles identify syntenic breakpoints. from rice, Arabidopsis, sorghum and poplar. In nearly all cases, recent common origin (Fig. 2c). Grass-specific genes include trans- Brachypodium genes had a similar distribution to rice and sorghum, membrane receptor protein kinases, glycosyltransferases, peroxidases demonstrating that Brachypodium is suitably generic for grass func- and P450 proteins (Supplementary Table 7B). The Pooideae-specific tional genomics research (Supplementary Figs 8 and 9). Analysis of the gene set contains only 265 gene families (Supplementary Table 7C) predicted secretome identified substantial differences in the distri- comprising 811 genes (1,400 including singletons). Genes enriched in bution of cell wall metabolism genes between dicots and grasses grasses were significantly more likely to be contained in tandem arrays (Supplementary Tables 12, 13 and Supplementary Fig. 10), consistent than random genes, demonstrating a prominent role for tandem with their different cell walls26. Signal peptide probability curves also gene expansion in the evolution of grass-specific genes (Supplemen- suggested that start codons were accurately predicted (Supplementary tary Fig. 7 and Supplementary Table 8). Fig. 11). To validate and improve the v1.0 gene models, we manually anno- tated 2,755 gene models from 97 diverse gene families (Supplementary Maintaining a small grass genome size Tables 9–11) relevant to bioenergy and food crop improvement. We Exhaustive analysis of transposable elements (Supplementary annotated 866 genes involved in cell wall biosynthesis/modification Information and Supplementary Table 14) showed retrotransposon and 948 transcription factors from 16 families25. Only 13% of the gene sequences comprise 21.4% of the genome, compared to 26% in rice, 764 ©2010 Macmillan Publishers Limited. All rights reserved NATURE | Vol 463 | 11 February 2010 ARTICLES

54% in sorghum, and more than 80% in wheat27. Thirteen retro- described in Supplementary Table 15. Conserved non-coding sequences element sets were younger than 20,000 years, showing a recent activa- are described in Supplementary Fig. 16. tion compared to rice28 (Supplementary Fig. 12), and a further 53 retroelement sets were less than 0.1 million years (Myr) old. A Whole-genome comparison of three diverse grass genomes minimum of 17.4 Mb has been lost by long terminal repeat (LTR)– The evolutionary relationships between Brachypodium, sorghum, LTR recombination, demonstrating that retroelement expansion is rice and wheat were assessed by measuring the mean synonymous countered by removal through recombination. In contrast, retroele- substitution rates (Ks) of orthologous gene pairs (Supplementary ments persist for very long periods of time in the closely related Information, Supplementary Fig. 17 and Supplementary Table 16), Triticeae28. from which divergence times of Brachypodium from wheat 32–39 DNA transposons comprise 4.77% of the Brachypodium genome, Myr ago, rice 40–53 Myr ago, and sorghum 45–60 Myr ago (Fig. 3a) 5,29 within the range found in other grass genomes . Transcriptome data were estimated. The Ks of orthologous gene pairs in the intragenomic and structural analysis suggest that many non-autonomous Mariner Brachypodium duplications (Fig. 3b) suggests duplication 56–72 Myr DTT and Harbinger elements recruit transposases from other families. ago, before the diversification of the grasses. This is consistent with Two CACTA DTC families (M and N) carried five non-element genes, previous evolutionary histories inferred from a small number of and the Harbinger U family has amplified a NBS-LRR gene family genes3,32–34. (Supplementary Figs 13 and 14), adding it to the group of transposable Paralogous relationships among Brachypodium chromosomes 30,31 elements implicated in gene mobility . Centromeric regions were showed six major chromosomal duplications covering 92.1% of the characterized by low gene density, characteristic repeats and retroele- genome (Fig. 3b), representing ancestral whole-genome duplication35. ment clusters (Supplementary Fig. 15). Other repeat classes are Using the rice and sorghum genome sequences, genetic maps of barley36 and Aegilops tauschii (the D genome donor of hexaploid ab wheat)37, and bin-mapped wheat ESTs38,39, 21,045 orthologous rela- tionships between Brachypodium, rice, sorghum and Triticeae were Bd5 identified (Supplementary Information). These identified 59 blocks Sorghum Wheat BrachypodiumRice Bd1 of collinear genes covering 99.2% of the Brachypodium genome (Fig. 3c–e). The orthologous relationships are consistent with anevolu- 32–39 Bd4 tionary model that shaped five Brachypodium chromosomes from a five-chromosome ancestral genome by a 12-chromosome inter- 40–54 mediate involving seven major chromosome fusions39 (Supplemen- 45–60 tary Fig. 18). These collinear blocks of orthologous genes provide a WGD robust and precise sequence framework for understanding grass Bd2 56–73 Bd3 genome evolution and aiding the assembly of sequences from other pooid grasses. We identified 14 major syntenic disruptions between c Rice Brachypodium and rice/sorghum that can be explained by nested inser- 673511082119124 tions of entire chromosomes into centromeric regions (Fig. 4a, b)2,37,40. Similar nested insertions in sorghum37 and barley (Fig. 4c, d) were also identified. Centromeric repeats and peaks in retroelements at the junc- tions of chromosome insertions are footprints of these insertion events (Supplementary Fig. 15C and Fig. 1), as is higher gene density at the former distal regions of the inserted chromosomes (Fig. 1). Notably, the reduction in chromosome number in Brachypodium and wheat 132 45 occurred independently because none of the chromosome fusions are shared by Brachypodium and the Triticeae37 (Supplementary Fig. 18).

Figure 3 | Brachypodium genome evolution and synteny between grass subfamilies. a, The distribution maxima of mean synonymous substitution rates (Ks)ofBrachypodium, rice, sorghum and wheat orthologous gene pairs (Supplementary Table 16) were used to define the divergence times of these 10 1 2 3 9 4 7 5 8 6 species and the age of interchromosomal duplications in Brachypodium. Sorghum WGD, whole-genome duplication. The numbers refer to the predicted d Barley e divergence times measured as Myr ago by the NG or ML methods. 4731652 b Brachypodium , Diagram showing the six major interchromosomal Brachypodium 13452 duplications, defined by 723 paralogous relationships, as coloured bands linking the five chromosomes. c, Identification of chromosome relationships between the Brachypodium, rice and sorghum genomes. Orthologous relationships between the 25,532 protein-coding Brachypodium genes, 7,216 sorghum orthologues (12 syntenic blocks), and 8,533 rice orthologues (12 syntenic blocks) were defined. Sets of collinear orthologous relationships are represented by a coloured band according to each Brachypodium 132 45 chromosome (blue, chromosome (chr.) 1; yellow, chr. 2; violet, chr. 3; red, chr. 4; green, chr. 5). The white region in each Brachypodium chromosome represents the centromeric region. d, Orthologous gene relationships between Brachypodium and barley and Ae. tauschii were identified using genetically mapped ESTs. 2,516 orthologous relationships defined 12 syntenic blocks. These are shown as coloured bands. e, Orthologous gene relationships between Brachypodium and hexaploid bread wheat defined by 4731652 5,003 ESTs mapped to wheat deletion bins. Each set of orthologous 4731652 Wheat relationships is represented by a band that is evenly spread across each Aegilops tauschii deletion interval on the wheat chromosomes. 765 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 11 February 2010 a that are among the world’s most important crops. The very high qual- ity of the Brachypodium genome sequence, in combination with those Bd1 from two other grass subfamilies, enabled reconstruction of chro- mosome evolution across a broad diversity of grasses. This analysis Os1 A5 Bd2 Os5 contributes to our understanding of grass diversification by explaining Os3 how the varying chromosome numbers found in the major grass sub- Os7 A7 families derive from an ancestral set of five chromosomes by nested Bd3 Os10 Os11 insertions of whole chromosomes into centromeres. The relatively A11 Os12 small genome of Brachypodium contains many active retroelement Bd4 Os8 A8 Os9 families, but recombination between these keeps genome expansion Os2 in check. The short arm of chromosome 5 deviates from the rest of the Bd5 10 Mb Os4 A4 Os6 genome by exhibiting a trend towards genome expansion through b increased retroelement numbers and disruption of gene order more Bd1 Bd3 Bd4 typical of the larger genomes of closely related grasses. Os6 Os10 Os11 44 45 Bd2 Grass crop improvement for sustainable fuel and food produc- Os7 Os5 Os8 Os9 tion requires a substantial increase in research in species such as Bd5 Miscanthus, switchgrass, wheat and cool season forage grasses. These Os3 Os1 Os2 Os12 Os4 considerations have led to the rapid adoption of Brachypodium as an c experimental system for grass research. The similarities in gene content Sb1 and gene family structure between Brachypodium, rice and sorghum Sb2 support the value of Brachypodium as a functional genomics model 10 Mb for all grasses. The Brachypodium genome sequence analysis reported here is therefore an important advance towards securing sustainable d H1 H2 H4 supplies of food, feed and fuel from new generations of grass crops. Figure 4 | A recurring pattern of nested chromosome fusions in grasses. METHODS SUMMARY a, The five Brachypodium chromosomes are coloured according to homology with rice chromosomes (Os1–Os12). Chromosomes descended Genome sequencing and assembly. Sanger sequencing was used to generate from an ancestral chromosome (A4–A11) through whole-genome paired-end reads from 3 kb, 8 kb, fosmid (35 kb) and BAC (100 kb) clones to duplication are shown in shades of the same colour. Gene density is generate 9.43 coverage (Supplementary Table 1). The final assembly of 83 scaf- indicated as a red line above the chromosome maps. Major discontinuities in folds covers 271.9 Mb (Supplementary Table 3). Sequence scaffolds were aligned gene density identify syntenic breakpoints, which are marked by a diamond. to a genetic map to create pseudomolecules covering each chromosome White diamonds identify fusion points containing remnant centromeric (Supplementary Figs 1 and 2). repeats. b, A pattern of nested insertions of whole chromosomes into Protein-coding gene annotation. Gene models were derived from weighted centromeric regions explains the observed syntenic break points. Bd5 has consensus prediction from several ab initio gene finders, optimal spliced align- not undergone chromosome fusion. c, Examples of nested chromosome ments of ESTs and transcript assemblies, and protein homology. Illumina tran- insertions in sorghum (Sb) chromosomes 1 and 2. d, Examples of nested scriptome sequence was aligned to predicted genome features to validate exons, chromosome insertions in barley (H chromosomes) inferred from genetic splice sites and alternatively spliced transcripts. maps. Nested insertions were not identified in other chromosomes, possibly Repeats analysis. The MIPS ANGELA pipeline was used to integrate analyses owing to the low resolution of genetic maps. from expert groups. LTR-STRUCT and LTR-HARVEST46 were used for de novo retroelement searches. Comparisons of evolutionary rates between Brachypodium, Received 29 August; accepted 9 December 2009. sorghum, rice and Ae. tauschii demonstrated a substantially higher 1. Somerville, C. The billion-ton biofuels vision. Science 312, 1277 (2006). rate of genome change in Ae. tauschii (Supplementary Table 17). 2. Kellogg, E. A. Evolutionary history of the grasses. Plant Physiol. 125, 1198–1205 This may be due to retroelement activity that increases syntenic (2001). disruptions, as proposed for chromosome 5S later41. Among seven 3. Gaut, B. S. Evolutionary dynamics of grass genomes. New Phytol. 154, 15–28 relatively large gene families, four were highly syntenic and two (2002). 4. International Rice Genome Sequencing Project. The map-based sequence of the (NBS-LRR and F-box) were almost never found in syntenic order rice genome. Nature 436, 793–800 (2005). when compared to rice and sorghum (Supplementary Table 18), 5. Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of consistent with the rapid diversification of the NBS-LRR and grasses. Nature 457, 551–556 (2009). F-box gene families42. 6. Wei, F. et al. Physical and genetic structure of the maize genome reflects its The short arm of chromosome 5 (Bd5S) has a gene density roughly complex evolutionary history. PLoS Genet. 3, e123 (2007). 7. Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Cereal genome evolution. half of the rest of the genome, high LTR retrotransposon density, the Grasses, line up and form a circle. Curr. Biol. 5, 737–739 (1995). youngest intact Gypsy elements and the lowest solo LTR density. Thus, 8. Salamini, F., Ozkan, H., Brandolini, A., Schafer-Pregl, R. & Martin, W. Genetics and unlike the rest of the Brachypodium genome, Bd5S is gaining retro- geography of wild cereal domestication in the near east. Nature Rev. Genet. 3, transposons by replication and losing fewer by recombination. 429–441 (2002). 9. Draper, J. et al. Brachypodium distachyon. A new model system for functional Syntenic regions of rice (Os4S) and sorghum (Sb6S) demon- genomics in grasses. Plant Physiol. 127, 1539–1555 (2001). strate maintenance of this high repeat content for ,50–70 Myr 10. Vain, P. et al. Agrobacterium-mediated transformation of the temperate grass (Supplementary Fig. 19)43. Bd5S, Os4S and Sb6S also have the lowest Brachypodium distachyon (genotype Bd21) for T-DNA insertional mutagenesis. proportion of collinear genes (Fig. 4a and Supplementary Fig. 19). We Plant Biotechnol. J. 6, 236–245 (2008). propose that the chromosome ancestral to Bd5S reached a tipping 11. Vogel, J. & Hill, T. High-efficiency Agrobacterium-mediated transformation of Brachypodium distachyon inbred line Bd21–3. Plant Cell Rep. 27, 471–478 point in which high retrotransposon density had deleterious effects (2008). on genes. 12. Vogel, J. P., Garvin, D. F., Leong, O. M. & Hayden, D. M. Agrobacterium-mediated transformation and inbred line development in the model grass Brachypodium Discussion distachyon. Plant Cell Tissue Organ Cult. 84, 100179–100191 (2006). 13. Filiz, E. et al. Molecular, morphological and cytological analysis of diverse As the first genome sequence of a pooid grass, the Brachypodium Brachypodium distachyon inbred lines. Genome 52, 876–890 (2009). genome aids genome analysis and gene identification in the large 14. Vogel, J. P. et al. Development of SSR markers and analysis of diversity in Turkish and complex genomes of wheat and barley, two other pooid grasses populations of Brachypodium distachyon. BMC Plant Biol. 9, 88 (2009). 766 ©2010 Macmillan Publishers Limited. All rights reserved NATURE | Vol 463 | 11 February 2010 ARTICLES

15. Garvin, D. F. et al. An SSR-based genetic linkage map of the model grass 46. McCarthy, E. M. & McDonald, J. F. LTR_STRUC: a novel search and identification Brachypodium distachyon. Genome 53, 1–13 (2009). program for LTR retrotransposons. Bioinformatics 19, 362–367 (2003). 16. Huo, N. et al. Construction and characterization of two BAC libraries from Brachypodium distachyon, a new model for grass genomics. Genome 49, Supplementary Information is linked to the online version of the paper at 1099–1108 (2006). www.nature.com/nature. 17. Huo, N. et al. The nuclear genome of Brachypodium distachyon: analysis of BAC end Acknowledgements We acknowledge the contributions of the late M. Gale, who sequences. Funct. Integr. Genomics 8, 135–147 (2008). identified the importance of conserved gene order in grass genomes. This work was 18. Gu, Y. Q. et al. A BAC-based physical map of Brachypodium distachyon and its mainly supported by the US Department of Energy Joint Genome Institute comparative analysis with rice and wheat. BMC Genomics 10, 496 (2009). Community Sequencing Program project with J.P.V., D.F.G., T.C.M. and M.W.B., a 19. Garvin, D. F. et al. Development of genetic and genomic research resources for BBSRC grant to M.W.B., an EU Contract Agronomics grant to M.W.B. and K.F.X.M., Brachypodium distachyon, a new model system for grass crop research. Crop Sci. and GABI Barlex grant to K.F.X.M. Illumina transcriptome sequencing was 48, S-69–S-84 (2008). supported by a DOE Plant Feedstock Genomics for Bioenergy grant and an Oregon 20. Bennett, M. D. & Leitch, I. J. Nuclear DNA amounts in angiosperms: progress, State Agricultural Research Foundation grant to T.C.M.; small RNA research was problems and prospects. Ann. Bot. (Lond.) 95, 45–90 (2005). supported by the DOE Plant Feedstock Genomics for Bioenergy grants to P.J.G. and 21. Vogel, J. P. et al. EST sequencing and phylogenetic analysis of the model grass T.C.M.; annotation was supported by a DOE Plant Feedstocks for Genomics Brachypodium distachyon. Theor. Appl. Genet. 113, 186–195 (2006). Bioenergy grant to J.P.V. A full list of support and acknowledgements is in the 22. Rajagopalan, R., Vaucheret, H., Trejo, J. & Bartel, D. P. A diverse and evolutionarily Supplementary Information. fluid set of microRNAs in Arabidopsis thaliana. Genes Dev. 20, 3407–3425 (2006). Author Information The whole-genome shotgun sequence of Brachypodium 23. Tanaka, T. et al. The rice annotation project database (RAP-DB): 2008 update. distachyon has been deposited at DDBJ/EMBL/GenBank under the accession Nucleic Acids Res. 36, D1028–D1033 (2008). ADDN00000000. (The version described in this manuscript is the first version, 24. Fox, S., Filichkin, S. & Mockler, T. Applications of ultra-high-throughput accession ADDN01000000). EST sequences have been deposited with dbEST sequencing. Methods Mol. Biol. 553, 79–108 (2009). (accessions 67946317–68053959) and GenBank (accessions 25. Gray, J. et al. A recommendation for naming transcription factor proteins in the GT758162–GT865804). The short read archive accession for RNA-seq data is grasses. Plant Physiol. 149, 4–6 (2009). SRA010177. Reprints and permissions information is available at 26. Vogel, J. Unique aspects of the grass cell wall. Curr. Opin. Plant Biol. 11, 301–307 www.nature.com/reprints. This paper is distributed under the terms of the (2008). Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely 27. Bennetzen, J. L. & Kellogg, E. A. Do plants have a one-way ticket to genomic available to all readers at www.nature.com/nature. The authors declare no obesity? Plant Cell 9, 1509–1514 (1997). competing financial interests. Correspondence and requests for materials should 28. Wicker, T. & Keller, B. Genome-wide comparative analysis of copia be addressed to J.P.V. ([email protected]) or D.F.G. retrotransposons in Triticeae, rice, and Arabidopsis reveals conserved ancient ([email protected]) or T.C.M. ([email protected]) or evolutionary lineages and distinct dynamics of individual copia families. Genome M.W.B. ([email protected]). Res. 17, 1072–1081 (2007). 29. Wicker, T. et al. Analysis of intraspecies diversity in wheat and barley genomes Author Contributions See list of consortium authors below. identifies breakpoints of ancient haplotypes and provides insight into the structure of diploid and hexaploid triticeae gene pools. Plant Physiol. 149, 258–270 (2009). 30. Jiang, N., Bao, Z., Zhang, X., Eddy, S. R. & Wessler, S. R. Pack-MULE transposable The International Brachypodium Initiative elements mediate gene evolution in plants. Nature 431, 569–573 (2004). 31. Morgante, M. et al. Gene duplication and exon shuffling by -like Principal investigators John P. Vogel1, David F. Garvin2, Todd C. Mockler3,Jeremy transposons generate intraspecies diversity in maize. Nature Genet. 37, 997–1002 Schmutz4, Dan Rokhsar5,6, Michael W. Bevan7; DNA sequencing and assembly Kerrie (2005). Barry5, Susan Lucas5,MirandaHarmon-Smith5,KathleenLail5, Hope Tice5,Jeremy 32. Grass Phylogeny Working Group. Phylogeny and subfamilial classification of the 4 4 7 7 grasses (Poaceae). Ann. Mo. Bot. Gard. 88, 373–457 (2001). Schmutz (Leader), Jane Grimwood , Neil McKenzie , Michael W. Bevan ; Pseudomolecule assembly and BAC end sequencing Naxin Huo1, Yong Q. Gu1,GerardR. 33. Bossolini, E., Wicker, T., Knobel, P. A. & Keller, B. Comparison of orthologous loci Lazo1,OlinD.Anderson1, John P. Vogel1 (Leader), Frank M. You8, Ming-Cheng Luo8,Jan from small grass genomes Brachypodium and rice: implications for wheat 8 7 7 7 9 genomics and grass genome annotation. Plant J. 49, 704–717 (2007). Dvorak , Jonathan Wright , Melanie Febrer , Michael W. Bevan ,DominikaIdziak , Robert Hasterok9, David F. Garvin2; Transcriptome sequencing and analysis Erika 34. Charles, M. et al. Sixty million years in evolution of soft grain trait in grasses: 5 5 3 3 3 emergence of the softness locus in the common ancestor of Pooideae and Lindquist , Mei Wang ,SamuelE.Fox , Henry D. Priest , Sergei A. Filichkin , Scott A. Givan3, Douglas W. Bryant3,JeffH.Chang3, Todd C. Mockler3 (Leader), Haiyan Wu10,24, Ehrhartoideae, after their divergence from Panicoideae. Mol. Biol. Evol. 26, 10 10 10,24 11 1651–1661 (2009). Wei Wu , An-Ping Hsia , Patrick S. Schnable , Anantharaman Kalyanaraman , Brad Barbazuk12,ToddP.Michael13,SamuelP.Hazen14,JenniferN.Bragg1,Debbie 35. Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization predating 1 1 2 15 divergence of the cereals, and its consequences for comparative genomics. Proc. Laudencia-Chingcuanco ,JohnP.Vogel, David F. Garvin , Yiqun Weng ,Neil 7 7 16 Natl Acad. Sci. USA 101, 9903–9908 (2004). McKenzie , Michael W. Bevan ; Gene analysis and annotation Georg Haberer , 16 16 17 6 36. Stein, N. et al. A 1,000-loci transcript map of the barley genome: new anchoring Manuel Spannagl , Klaus Mayer (Leader), Thomas Rattei ,ThereseMitros ,Dan 6 18 18 19 19 points for integrative grass genomics. Theor. Appl. Genet. 114, 823–839 (2007). Rokhsar , Sang-Jik Lee , Jocelyn K. C. Rose , Lukas A. Mueller ,ThomasL.York ; 20 20 21 37. Luo, M. C. et al. Genome comparisons reveal a dominant mechanism of Repeats analysis Thomas Wicker (Leader), Jan P. Buchmann , Jaakko Tanskanen , 21 16 7 7 chromosome number reduction in grasses and accelerated genome evolution in Alan H. Schulman (Leader), Heidrun Gundlach , Jonathan Wright , Michael Bevan , 22 22 1 1 Triticeae. Proc. Natl Acad. Sci. USA 106, 15780–15785 (2009). Antonio Costa de Oliveira , Luciano da C. Maia , William Belknap , Yong Q. Gu ,Ning 23 24 25 25 26 26 38. Qi, L. L. et al. A chromosome bin map of 16,000 expressed sequence tag loci and Jiang , Jinsheng Lai ,LiucunZhu ,JianxinMa , Cheng Sun , Ellen Pritham ; 27 27 27 distribution of genes among the three genomes of polyploid wheat. Genetics 168, Comparative genomics Jerome Salse (Leader), Florent Murat , Michael Abrouk , 16 16 16 13 701–712 (2004). Georg Haberer , Manuel Spannagl , Klaus Mayer ,RemyBruggmann ,Joachim 13 8 8 8 39. Salse, J. et al. Identification and characterization of shared duplications between Messing , Frank M. You , Ming-Cheng Luo , Jan Dvorak ; Small RNA analysis Noah 3 3 3 3 rice and wheat provide new insight into grass genome evolution. Plant Cell 20, Fahlgren ,SamuelE.Fox , Christopher M. Sullivan , Todd C. Mockler ,JamesC. 3 3,28 29 30 11–24 (2008). Carrington , Elisabeth J. Chapman ,GregD.May ,JixianZhai ,Matthias 30 30 30 30 40. Srinivasachary, Dida M. M., Gale, M. D. & Devos, K. M. Comparative analyses Ganssmann , Sai Guna Ranjan Gurazada , Marcelo German , Blake C. Meyers , 30 reveal high levels of conserved colinearity between the finger millet and rice Pamela J. Green (Leader); Manual annotation and gene family analysis Jennifer N. 1 1,6 1,8 1 1 genomes. Theor. Appl. Genet. 115, 489–499 (2007). Bragg , Ludmila Tyler , Jiajie Wu , Yong Q. Gu ,GerardR.Lazo,Debbie 1 1 1 14 41. Vicient, C. M., Kalendar, R. & Schulman, A. H. Variability, recombination, and Laudencia-Chingcuanco , James Thomson , John P. Vogel (Leader), Samuel P. Hazen , mosaic evolution of the barley BARE-1 retrotransposon. J. Mol. Evol. 61, 275–291 Shan Chen14, Henrik V. Scheller31,JesperHarholt32,PeterUlvskov32,SamuelE.Fox3, (2005). Sergei A. Filichkin3, Noah Fahlgren3, Jeffrey A. Kimbrel3, Jeff H. Chang3, Christopher M. 42. Meyers, B. C., Kozik, A., Griego, A., Kuang, H. & Michelmore, R. W. Genome-wide Sullivan3, Elisabeth J. Chapman3,27, James C. Carrington3, Todd C. Mockler3,LauraE. analysis of NBS-LRR-encoding genes in Arabidopsis. Plant Cell 15, 809–834 Bartley8,31, Peijian Cao8,31, Ki-Hong Jung8,31{,ManojKSharma8,31,Miguel (2003). Vega-Sanchez8,31, Pamela Ronald8,31, Christopher D. Dardick33, Stefanie De Bodt34,Wim 43. Ma, J. & Bennetzen, J. L. Rapid recent growth and divergence of rice nuclear Verelst34,DirkInze´34, Maren Heese35, Arp Schnittger35, Xiaohan Yang36, Udaya C. genomes. Proc. Natl Acad. Sci. USA 101, 12404–12410 (2004). Kalluri36,GeraldA.Tuskan36,ZhihuaHua37, Richard D. Vierstra37, David F. Garvin3,Yu 44. U.S. Department of Energy Office of Science. Breaking the Biological Barriers to Cui24, Shuhong Ouyang24, Qixin Sun24, Zhiyong Liu24, Alper Yilmaz38,Erich Cellulosic Ethanol: A Joint Research Agenda Æ http://genomicscience.energy.gov/ Grotewold38, Richard Sibout39, Kian Hematy39, Gregory Mouille39,HermanHo¨fte39, biofuels/b2bworkshop.shtmlæ (2006). Todd Michael13,Je´rome Pelloux40, Devin O’Connor41, James Schnable41, Scott Rowe41, 45. Food and Agriculture Organization of the United Nations. World Agriculture: Frank Harmon41, Cynthia L. Cass42, John C. Sedbrook42,MaryE.Byrne7,SeanWalsh7, Towards 2030/2050 Interim Report. Æ http://www.fao.org/ES/esd/ Janet Higgins7, Michael Bevan7,PinghuaLi19,ThomasBrutnell19, Turgay Unver43,Hikmet AT2050web.pdfæ (2006). Budak43, Harry Belcram44, Mathieu Charles44, Boulos Chalhoub44, Ivan Baxter45 767 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 11 February 2010

1USDA-ARS Western Regional Research Center, Albany, California 94710, USA. Recherche´ Agronomique UMR 1095, 63100 Clermont-Ferrand, France. 28University of 2USDA-ARS Plant Science Research Unit and University of Minnesota, St Paul, California San Diego, La Jolla, California 92093, USA. 29National Centre for Genome Minnesota 55108, USA. 3Oregon State University, Corvallis, Oregon 97331-4501, USA. Resources, Santa Fe, New Mexico 87505, USA. 30University of Delaware, Newark, 4HudsonAlpha Institute, Huntsville, Alabama 35806, USA. 5US DOE Joint Genome Delaware 19716, USA. 31Joint Bioenergy Institute, Emeryville, California 94720, USA. Institute, Walnut Creek, California 94598, USA. 6University of California Berkeley, 32University of Copenhagen, Frederiksberg DK-1871, Denmark. 33USDA-ARS Berkeley, California 94720, USA. 7John Innes Centre, Norwich NR4 7UJ, UK. 8University Appalachian Fruit Research Station, Kearneysville, West Virginia 25430, USA. 34VIB of California Davis, Davis, California 95616, USA. 9University of Silesia, 40-032 Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and Katowice, Poland. 10Iowa State University, Ames, Iowa 50011, USA. 11Washington State Genetics, Ghent University, Technologiepark 927, 9052 Gent, Belgium. 35Institut de University, Pullman, Washington 99163, USA. 12University of Florida, Gainsville, Florida Biologie Mole´culaire des Plantes du CNRS, Strasbourg 67084, France. 36BioEnergy 32611, USA. 13Rutgers University, Piscataway, New Jersey 08855-0759, USA. Science Center and Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831-6422, 14University of Massachusetts, Amherst, Massachusetts 01003-9292, USA. USA. 37University of Wisconsin-Madison, Madison, Wisconsin 53706, USA. 38The Ohio 15USDA-ARS Vegetable Crops Research Unit, Horticulture Department, University of State University, Columbus, Ohio 43210, USA. 39Institut Jean-Pierre Bourgin, UMR1318, Wisconsin, Madison, Wisconsin 53706, USA. 16Helmholtz Zentrum Mu¨nchen, D-85764 Institut National de la Recherche Agronomique, 78026 Versailles cedex, France. Neuherberg, Germany. 17Technical University Mu¨nchen, 80333 Mu¨nchen, Germany. 40Universite´ de Picardie, Amiens 80039, France. 41Plant Gene Expression Center, 18Cornell University, Ithaca, New York 14853, USA. 19Boyce Thompson Institute for Plant University of California Berkeley, Albany, California 94710, USA. 42Illinois State Research, Ithaca, New York 14853-1801, USA. 20University of Zurich, 8008 Zurich, University and DOE Great Lakes Bioenergy Research Center, Normal, Illinois 61790, Switzerland. 21MTT Agrifood Research and University of Helsinki, FIN-00014 Helsinki, USA. 43Sabanci University, Istanbul 34956, Turkey. 44Unite´ de Recherche en Ge´nomique Finland. 22Federal University of Pelotas, Pelotas, 96001-970, RS, Brazil. 23Michigan State Ve´ge´tale: URGV (INRA-CNRS-UEVE), Evry 91057, France. 45USDA-ARS/Donald University, East Lansing, Michigan 48824, USA. 24China Agricultural University, Beijing Danforth Plant Science Center, St Louis, Missouri 63130, USA. {Present address: The 10094, China. 25Purdue University, West Lafayette, Indiana 47907, USA. 26The School of Plant Molecular Systems Biotechnology, Kyung Hee University, Yongin University of Texas, Arlington, Arlington, Texas 76019, USA. 27Institut National de la 446-701, Korea.

768 ©2010 Macmillan Publishers Limited. All rights reserved