Genomics 97 (2011) 313–320

Contents lists available at ScienceDirect

Genomics

journal homepage: www.elsevier.com/locate/ygeno

Comparative analysis of and Vitis genomes indicates genome duplication specific to the Gossypium lineage

Lifeng Lin a, Haibao Tang a, Rosana O. Compton a, Cornelia Lemke a, Lisa K. Rainville a, Xiyin Wang a,b, Junkang Rong a,1, Mukesh Kumar Rana a,c, Andrew H. Paterson a,⁎ a Genome Mapping Laboratory, University of Georgia, Athens, GA 30605, USA b Center for Genomics and Biocomputing, College of Sciences, Hebei Polytechnic University, Tangshan, Hebei, 063000, China c NRC on DNA Fingerprinting, NBPGR, Pusa Campus, New Delhi, 110012, India article info abstract

Article history: Genetic mapping studies have suggested that diploid cotton (Gossypium) might be an ancient polyploid. Received 17 November 2010 However, further evidence is lacking due to the complexity of the genome and the lack of sequence resources. Accepted 15 February 2011 Here, we used the grape (Vitis vinifera) genome as an out-group in two different approaches to further explore Available online 22 February 2011 evidence regarding ancient genome duplication (WGD) event(s) in the diploid Gossypium lineage and its (their) effects: a genome-level alignment analysis and a local-level sequence component analysis. Both Keywords: Gossypium studies suggest that at least one round of genome duplication occurred in the Gossypium lineage. Also, gene Lineage specific genome-wide duplication densities in corresponding regions from Gossypium raimondii, V. vinifera, Arabidopsis thaliana and Carica Vitis vinifera papaya genomes are similar, despite the huge difference in their genome sizes and the different number of Whole genome alignment WGDs each genome has experienced. These observations fit the model that differences in plant genome sizes Synteny are largely explained by transposon insertions into heterochromatic regions. Collinearity © 2011 Elsevier Inc. All rights reserved. Gene loss Gene density

1. Introduction ments increase after WGD in teleost fish [9]. Duplicated genes created by WGD behave differently from single gene duplications, showing a Whole genome duplication (WGD) events have been more longer life span before one copy is pseudogenized and/or deleted [10]. frequent in the lineages of flowering plant species than in most In a cross-taxon alignment using a Gossypium raimondii (D-genome other taxa. With more plant genomes being sequenced and released cotton) physical map [11], more Gossypium contigs were aligned to and the emergence of new tools for genome comparisons, our the distantly-related Vitis vinifera genome than to the more closely- understanding of the history of genome duplication and its impor- related Arabidopsis genome [11]. It is possible that the two additional tance in angiosperm evolution is becoming clearer. An ancient WGD events in Arabidopsis lineage, along with subsequent gene losses genome triplication event is very likely to have been shared by all and chromosomal rearrangements, have significantly disrupted the [1,2], and different lineages have experienced additional, conservation of synteny. more recent WGD events [2,3]. For example, Populus had one round of The fact that members of the Gossypium genus have a gametic tetraploidy in the Salicoid lineage [4] and Glycine had two rounds of chromosome number of 13 and several related genera have many tetraploidy in the legume lineage [5]. In contrast, Vitis and Carica have species with n=6 has long hinted that a Gossypium ancestor may no lineage specific genome duplication events after the common have experienced a relatively recent WGD [12]. While the history of ancestor of all [1,6]. genome duplication in the Gossypium lineage is not yet clear due to WGD profoundly impacts the genomic landscape in many ways the lack of whole genome sequence, classical cytogenetic analysis, Ks [7]. Synthetic polyploid experience abrupt CpG methylation distributions of duplicated gene pairs, and possible homoeologous changes after genome doubling [8]. Interchromosomal rearrange- relationships among multiple chromosomal segments within the Gossypium genome [13–15] all support the hypothesis that Gossypium experienced at least one whole genome duplication event since the ⁎ Corresponding author at: 111 Riverbend Road, Rm 228, Athens, GA 30602-6810, triplication shared by most if not all eudicots. However, inferred USA. Fax: +1 706 5830160. Gossypium homology to date is based on genetic mapping, which is E-mail address: [email protected] (A.H. Paterson). dependent on the marker density and might lead to some spurious URL: http://www.plantgenome.uga.edu (A.H. Paterson). 1 fl Current affiliation: School of Agriculture and Food Sciences, Zhejiang Forestry matchings [14]. Additionally, sequence shuf ing between the peri- University, Lin'an, Hangzhou, Zhejiang, 311300, China. centromeric regions may cause false positives as well [14]. Therefore,

0888-7543/$ – see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2011.02.007 314 L. Lin et al. / Genomics 97 (2011) 313–320 although ancient lineage specificWGDinGossypium has been half of Vitis chromosome 18 matches regions on Gossypium consensus suggested, definitive proof is still lacking. chromosome 9 and chromosome 10. Similarly, there are syntenic In sequenced genomes, one common method to search for blocks found between Vitis chromosome 3 and Gossypium consensus evidence of ancient WGD is by “all-against-all” dot plot. In this chromosomes 8 and 12, and syntenic blocks found between Vitis method, ancient homologous genes (or “anchors”) are identified chromosome 14 and Gossypium consensus chromosomes 1 and 6. using BLAST, with runs of syntenic segments reflected by consecutive Across many regions in the Vitis genome, two or more blocks of strings of homologous genes preserved in a linear pattern parallel to Gossypium consensus chromosome fragments are found to be the diagonal or anti-diagonal (the latter indicating segmental syntenic to the same Vitis chromosome region, and we argue that inversion). Compared to the Ks distribution plot, this approach not the duplicated Gossypium regions are likely derived from a whole only provides structural evidence of ancient duplication events, but genome doubling event not shared with Vitis. also the physical location of the duplicated segment pairs. However, The consensus map provided us with improved information about this method is not feasible in species that lack contiguous sequences the genome structure in cotton. However, we realize that the syntenic or information about the relative chromosomal positions of the blocks in Fig. 1 appear “fuzzy” because of the uncertainties in the exact sequences. order of genetic markers and the construction of consensus map. For Without whole genome data, local gene loss patterns can also be example, some areas on the dot plot show a high density of matches, indicative of the history of WGD [16]. After genome duplication, one but lack a clear collinear relationship. In places where we could homologous gene is thought to be freed from selective pressure, and discern significant collinear relationships, there are still fluctuations may adopt new functions (neofunctionalization), share the original around the predicted linear order. We should note that the consensus gene function with its paralogue (subfunctionalization) or become Gossypium map was constructed by merging the genetic markers from pseudogenized or removed. Indeed, the majority of duplicated gene At-, Dt- and D-genome genetic maps. The interpolation of the copies are lost in just a few million years after polyploidy. If a eudicot positions of unshared markers could be problematic in inferring genome (such as G. raimondii) has experienced WGD with consequent marker orders on a local scale because both maps are relatively low gene loss after its divergence from V. vinifera, one would predict that resolution (ca. 1 cM) and because the genetic/physical distance ratio many genes would have been lost from their ancestral locations. can fluctuate widely (violating the linear assumption). Nonetheless, To further our understanding of its evolutionary history, we 1629 Vitis genes and 954 Gossypium loci are found in syntenic blocks, studied the Gossypium genome using two different methods: a whole- among which 263 Vitis genes and 314 Gossypium loci are found in genome-level dot plot analysis, and a local-level comparative study of blocks that show a 2:1 relationship between Gossypium and Vitis. a specific region of Gossypium–Vitis synteny on the basis of two These “duplicated blocks” are distributed across many different sequenced G. raimondii BACs with a total base pair of ~184 kb. Both chromosomes in Vitis, which strongly indicates that at least one the whole genome dot–plots and local-level sequence comparisons genome-wide duplication event has occurred in the Gossypium provide new evidence of Gossypium lineage-specific genome dupli- lineage since its divergence from Vitis. cation after the Vitales– split. Comparison of homologous We further note that the current analysis is feasible because of the sequences between the two species also offers new insight into high density of markers in our Gossypium consensus map. Indeed, we mechanisms of genome size variation. also attempted to detect collinearity using its individual components: the AD tetraploid reference map and the D genome genetic map [13] 2. Results separately. Although there are isolated cases where homoeologous tetraploid Gossypium chromosomes were found to be syntenic to the 2.1. Gossypium–Vitis whole genome dot plot same Vitis chromosome region, they generally fail to reach the same resolution as the analysis with the consensus map. There are many Gossypium is a genus consists of 50 allotetraploid species and instances where syntenic blocks detected in the plot using the diploid species. The smallest diploid Gossypium genome, that of consensus map were missing in the plot using the individual maps G. raimondii, has an estimated genome size of around 880 Mb [17]. The due to lack of data points (Supplemental Figure 1). construction of a cotton consensus map with 13 chromosomes by merging the most saturated AD tetraploid genetic map (constructed 2.2. Gossypium BAC sequencing and microsynteny detection from F2 population of tetraploid cotton G. hirsutum and G. barbadense) and the D genome (constructed from F2 population of diploid cotton We surveyed three BACs from the D-genome Gossypium physical G. trilobum and G. raimondii) genetic maps, was described in earlier map [11] using shotgun sequencing. The BACs selected were studies [13]. Briefly, there are 333 pairs of loci that were mapped in GR174O23, GR109E22 and GR163B08, in the order arranged by FPC extensive blocks in different subgenomes. These were used as the (Fingerprinted Contigs [11,18]). Two sequence contigs were assem- basis for merging. The positions of non-sharing markers were bled for GR109E22 with sizes of 30,903 bp (GR109E22contig1) and interpolated between these common anchor markers based on the 78,650 bp (GR109E22contig2) respectively. There is still one se- relative recombinational distance from the nearest anchor marker. quence gap (b3 kb) between the two contigs but they are ordered and The combined genetic map contains 3016 loci distributed in a reduced oriented with the mate-pair information from the subclones. The number of 13 putative ancestral chromosomes, thus providing a assembled lengths for the other two BACs are: 97,267bp for marker density higher than any previously published maps, offering GR174O23 and 134,012 bp for GR163B08. Sequence comparison more resolution than using individual maps alone. among the three BACs revealed that GR174O23 overlaps with In this study, all genetic markers from the Gossypium consensus GR109E22contig1, with a merged sequence 104,965 bp long. No map were compared and plotted against all Vitis genes. Among 3016 overlaps among other BAC sequence fragments were found. loci on the cotton consensus genetic map, there were 1865 identified We created putative cotton gene models based on two different homologies with a total of 3012 genes on the Vitis genome. These methods: Ab initio gene predictions were performed using FGENESH, genes/loci formed 5097 pairs and the positions of these pairs were and a similarity-based method was performed by aligning to cotton used in creating the genome-wide dot–plot. EST databases (see Methods). A total of 12 genes were identified in We were able to detect N50 blocks of syntenic regions between GR109E22 contig2 and an additional 12 genes were identified in the Gossypium consensus map and Vitis chromosomes (Fig. 1). It is clear combined fragment of GR109E22 contig1 and GR174O23. The BAC from the dot plot that there is often more than one region in GR163B08 has 19 genes identified by FGENESH, but these either failed Gossypium that matches the same Vitis region. For example, more than to show any corresponding EST sequence or are transposon-related L. Lin et al. / Genomics 97 (2011) 313–320 315

Fig. 1. Whole genome dot–plots between Gossypium consensus genetic map and Vitis whole genome gene sequences. Collinearity blocks identified as described in Methods and were highlighted in red.

(detailed in later sections). Significant synteny was found between determine if this rearrangement happened in the Gossypium genome two regions on Vitis chromosome 6 and two Gossypium BACs or the Vitis genome, we looked at the corresponding syntenic regions GR174O23 and GR109E22. Putative gene sequences from GR163B08 of Arabidopsis . Due to the lineage specific genome-wide duplications correspond to genes in multiple scattered positions on the Vitis and rearrangements in the Arabidopsis genomes, we have identified genome, and we failed to detect syntenic relationships to any Vitis several Arabidopsis genomic locations that showed synteny to the regions using this BAC. Gossypium BACs. Nonetheless, region 1 and region 2 are found to be For the ease of analysis, we divided the collinear relationships found syntenic to different Arabidopsis genomic locations that are not between G. raimondii BACs and the V. vinifera genome into two regions. adjacent to each other. In addition, the synteny between Vitis and Region 1 contains the consensus sequence combining GR174O23, Arabidopsis in these regions does not break at the same point as it does GR109E22 contig1 and part of GR109E22 contig2 that is immediately between region 1 and region 2 in the Gossypium genome studied here. adjacent to contig 1 across the sequencing gap in the BAC. This region We conclude that the genome rearrangement that fuses region1 and contains 8 collinear genes that aligned to the region from 21.8 Mb to region 2 happened in the Gossypium lineage. 22.3 Mb (Vv6g1599–Vv6g1637) on Vitis chromosome 6. Region 2 contains the remaining portion of the GR109E22 contig2, which 2.3. Ks value between syntenic gene pairs corresponds to 7.5 Mb to 7.8 Mb (Vv6g0801–Vv6g0829) on Vitis chromosome 6, with 9 genes in collinear order (Fig. 2). We further calculated the synonymous substitution rate (Ks) Region 1 and region 2 are contiguous on the Gossypium genome, between our predicted Gossypium gene models and syntenic orthologs but are located on separate arms of chromosome 6 in Vitis (Fig. 2). To in Arabidopsis and Vitis (Supplemental Table 1). With median value of

Fig. 2. Position of orthologous regions of sequenced Gossypium BACs and two regions on Vitis chromosome 6. 316 L. Lin et al. / Genomics 97 (2011) 313–320

~1.8 (of 21 gene pairs) KsbetweenGossypium–Arabidopsis orthologs are Table 1 much higher than Gossypium–Vitis orthologs (median value of ~1.4 out Vitis homologous region of Gossypium sequenced BACs. of 18 gene pairs). This trend is unexpected since Arabidopsis is Vitis gene number GR BAC number Homologous position phylogenetically closer to cotton (both taxa belong to the Eurosid II on GR BACs fi clade) than to Vitis. However, there are signi cant variations in the (Approx. kb) substitution rate among different angiosperm lineages [19]. Indeed, Region1 Vv6g1599 GR174O23_GR109E22contig1 64 among the four sequenced rosid genomes studied in Ref. [20], Vv6g1600 GR174O23_GR109E22contig1 79 Arabidopsis has the fastest substitution rate while Vitis has the slowest Vv6g1602 GR174O23_GR109E22contig1 84 substitution rate, which could explain this unexpected Ks trend. Also, Vv6g1615 GR174O23_GR109E22contig1 89 the substitution rate of genes from a small region such as the one studied Vv6g1617 GR174O23_GR109E22contig1 97 Vv6g1624 GR174O23_GR109E22contig1 105 here might not be representative of the whole genome. A genome level Vv6g1625 GR109E22Contig2 2 comparative analysis, once a cotton genome sequence is available, will Vv6g1637 GR109E22Contig2 16 clarify the evolutionary history of these species. Region2 Vv6g0801 GR109E22Contig2 24 Vv6g0802 GR109E22Contig2 29 2.4. Extensive loss of homologous genes observed in Gossypium sequenced Vv6g0806 GR109E22Contig2 34 Vv6g0814 GR109E22Contig2 43 regions Vv6g0817 GR109E22Contig2 49 Vv6g0819 GR109E22Contig2 49 To investigate gene loss in the Gossypium lineage after its split from Vv6g0823 GR109E22Contig2 54 Vitis, we constructed a putative ancestral gene order and compared it Vv6g0826 GR109E22Contig2 58 Vv6g0829 GR109E22Contig2 77 to the homologous regions in Arabidopsis. By comparing genes conserved in collinear arrangements in all four Arabidopsis homolo- gous regions (resulting from the two doublings in the Arabidopsis lineage), we were able to distinguish genes present in ancestral locations from “lineage-specific” insertion or deletion events. Genes 2.5. The Vitis homologous region spans a larger physical distance than the found in collinear blocks across taxa were inferred to be in putative corresponding regions of Gossypium and Arabidopsis ancestral locations; other genes are likely to be lineage specific gene insertions or deletions. Fig. 3 shows an example using Region 2. Although the V. vinifera genome is only about 55% of the size of the In Region 1, 19 genes were in putative ancestral locations on the G. raimondii [1,17], the syntenic region on Vitis is much larger in physical Vitis chromosome, of which 8 are still preserved in Gossypium;in size than the corresponding Gossypium regions in both cases. Region 1 Region 2 (Fig. 3), 9 genes are preserved in Gossypium out of 17 in covers a Vitis genomic region of ~446 kb, and a Gossypium region of putative ancestral locations in Vitis. In both cases, roughly half the 58 kb; region 2 covers a Vitis region of 290 kb and a Gossypium region of V. vinifera genes in ancestral locations are still found in the syntenic 53 kb. In both cases, the Vitis region is 5–10 times as large as the locations in Gossypium. corresponding Gossypium region. Arabidopsis syntenic regions had We further compared the extent of gene-loss in the Gossypium physical sizes similar to the Gossypium regions (between 22 and 70 kb regions with the corresponding regions in Carica (no WGDs after its for both Regions 1 and 2). The size difference between corresponding divergence from Vitis) and Arabidopsis (two WGDs after its divergence regions of Gossypium and Vitis could be caused by either expansion in from Vitis). Gene numbers are similar in Carica and Vitis regions, each the Vitis genome or condensation in the Gossypium genome, or very containing approximately twice the number of genes found in likely, both. collinear positions in Gossypium (Table 2). In the collinear Arabidopsis regions, the preserved gene number is significantly lower than that of Transposons Gossypium (Table 2), closer to ¼ of the genes in putative ancestral locations. We analyzed the distribution of transposable elements (Figs. 4 and 5) Many genes in the collinear regions of these genomes do not fit using RepBase (http://www.girinst.org/) and default parameters. TEs into putative ancestral gene positions. These are likely to be lineage comprise a larger proportion of the sequence in the V. vinifera homologous specific gene insertions or deletions. In particular, in Arabidopsis regions, at 25% and 17% of Region 1 and 2, than the 13% and 7% in Region 2 (Fig. 3), seven consecutive genes show no homology in Vitis, G. raimondii. Both DNA transposons and retroelements comprise a larger Carica or Gossypium, but are found in a collinear block on Vitis portion of the Vitis sequences than the Gossypium sequences. The chromosome 13, indicating translocation of a large fragment to this difference in quantity of transposons explains 30% and 18% of the size region in the Arabidopsis lineage. differences between the compared regions (Fig. 5A).

Fig. 3. Pattern of Gossypium gene loss in Region 2. Genes that showed collinearity across genomes are represented by filled squares; genes not preserved in collinearity (putative lineage specific insertions) are represented by hollow squares. Out of the 17 putative ancestral genes in Vitis, only 9 are still identifiable in Gossypium. L. Lin et al. / Genomics 97 (2011) 313–320 317

Table 2 2.6. A non-syntenic Gossypium BAC is enriched for repetitive DNA Number of ancestral genes preserved in Gossypium and Arabidopsis in the sequence BACs GR163B08, distal to GR109E22 in the same physical BAC contig Region 1 Region 2 (Fig. 2), differs markedly from the other two BACs sequenced. Size of the region in Vitis 446 kb 290 kb Homology searches in Genbank showed that 8 (out of 19) predicted Number of orthologous Vitis genes 19 17 genes on this BAC are retrotransposon related, and the remaining 11 Size of the region in Gossypium ~58 kb 53 kb showed either no significant homology to known proteins, or Number of orthologous Gossypium genes 8 9 homology to unknown proteins. No collinearity was detected with Size of the region in Carica ~250 kb 328 kb Number of orthologous Carica genes ~20 18 the Vitis, Carica or Arabidopsis genomes. A total of 11% of the BAC Size of the region in Arabidopsis 22-70 kb 23-40 kb sequence is made up of transposable elements, but unlike the other Number of orthologous Arabidopsis genes 6,4,9,5 5,8,3,6 two BACs, these are almost exclusively (97%) LTR-retrotransposons. The number of tandem repeats found in this BAC is 3 to 8 times higher than in other two BACs. GR163B08 is closer to the end of the chromosome than the other Gene loss sequenced BACs (Lin et al. unpublished) and may be in or near a transition zone from gene rich euchromatin to the sub-telomeric region. In addition to the size difference explained by transposable elements, Common features of sub-telomeric regions include the enrichment of thereisstilla3×to4×differenceinthesizeofthecorresponding tandem repeats and large transposable element insertions [21], G. raimondii and V. vinifera sequences (Fig. 4). This variation in physical consistent with the sequence composition of GR163B08. length of syntenic regions is approximately proportional to the number of genes identified. In region 1, Vitis has 38 genes (446 kb) corresponding 3. Discussion to 8 genes in Gossypium (15 kb+43 kb); in region 2, Vitis has 29 genes (290 kb) corresponding to 9 genes in Gossypium (53 kb). In both cases, Earlier genome mapping studies suggested that diploid Gossypium the size of the genomic region corresponds to the number of genes might be an ancient polyploidy [13]. In this study, we used two different identified, i.e. gene densities are relatively constant. By plotting the approaches to further investigate this hypothesis. We first showed positions of genes and TEs on these regions from the two genomes whole genome dot–plot analysis using genetic markers in Gossypium (Fig. 5), we found that many “extra” gene sequences in the Vitis regions against all genes in the sequenced V. vinifera genome. Although a are indeed retained in ancestral positions, suggesting that they may have significant improvement over prior studies, the resolution of the dot plot been lost in this particular region of Gossypium during diploidization is still constrained by the limited number of informative Gossypium following lineage-specific WGD. This suggests that the missing genes in markers. Nonetheless, in many cases one V. vinifera chromosome Gossypium are likely to be present in paralogous (homeologous) regions segment corresponded to at least two Gossypium segments, which that have yet to be identified and sequenced. strongly suggests at least one round of WGD in the diploid Gossypium lineage. Sequencing of one of the collinear regions revealed genome stratification in Gossypium that fits the expected behavior of duplicated gene loss after WGD events. These findings complement and reinforce earlier published findings using different methods that Gossypium species are ancient polyploids. Despite the smaller genome size of V. vinifera than G. raimondii, the homologous regions in V. vinifera that we have analyzed are much larger than the G. raimondii regions. We argue that although TE insertions do play a role in the size differences, diploidization in the Gossypium genome could explain a larger portion of the size difference. The missing genes in the Gossypium regions studied are likely to be retained in paleo-duplicated fragments elsewhere in the Gossypium genome that we have not sampled. Therefore, although gene loss has caused the Gossypium regions studied to be smaller in size than the corresponding Vitis regions, given the similar gene densities it is likely that the overall genome size is not affected much by gene deletion. Earlier studies in other taxa [22] suggest that the genome is composed of two distinctive components, with genes densely packed in euchromatic regions, and the heterochromatic regions being largely repetitive DNA that explains the majority of genome size differences. Therefore, given the similar gene densities in these genomes, the variation of genome sizes is mostly determined by the size of heterochromatin.

3.1. New evidence supporting a history of WGD in Gossypium

Both cytogenetic studies [15] and intragenomic comparisons of genetic marker positions and use of the current gene/marker order to deduce the ancestral gene order [13,23] previously suggested that the Gossypium lineage experienced at least one WGD. Two new lines of evidence further support this hypothesis and indicate that the Fig. 4. The proportion of transposable elements (A) and genes (B) in Gossypium BAC Gossypium WGD was subsequent to the triplication affecting most if sequences and the corresponding Vitis homologous regions. not all dicots. 318 L. Lin et al. / Genomics 97 (2011) 313–320

Fig. 5. Schematic view of gene and transposable elements in corresponding Gossypium and Vitis regions. Lines connecting different regions indicate syntenic genes.

Compared to earlier analyses using software packages CrimeStat2 genes found in the Gossypium BACs is explained by the existence of and FISH in the detection of ancient duplicated segments [14], whole- one or more paleo-duplicated segments in the G. raimondii genome genome dot–plots of mapped Gossypium genes herein reveal fewer that have not yet been sequenced. Such a segment would be expected segments. However, our analysis imposed the additional requirement to retain collinearity to the same V. vinifera region but based on a gene that genes to be ordered in a collinear manner, reducing the likelihood set that is largely complementary to what is found on the sequenced that corresponding segments are explicable by factors other than BACs, accounting for the missing half of the inferred ancestral gene genome duplication. content. Our study also includes a local-level comparative analysis, starting Whether there has been one round or two rounds of WGD in from a putative ancestral gene order prior to the duplication event, in Gossypium lineage is yet to be determined. Although our local-level a “top-down” approach [2]. The local analysis, based on BAC gene-loss patterns resemble the effect of one round of WGD, we need sequences, has the advantage of more detailed view of local regions to be cautious of our conclusions here for several reasons: first, when than studies using coarsely-mapped Gossypium markers alone. The inferring the ancestral gene repertoire, we inevitably miss genes that fraction of retained collinear genes in all Gossypium regions studied is are lost either in V. vinifera or in all Arabidopsis homologous regions, or less than 50% but is appreciably higher than that of any one both. So the real ancestral gene number may be larger than what we Arabidopsis segment (the latter having experienced two WGD events infer, and thus the apparent 2:1 ratio of ancestral gene number to G. after its divergence from Vitis). However, across the four Arabidopsis raimondii preserved gene number may be not significantly different segments (Fig. 3), a total of 17 genes are preserved in collinear from 4:1 (indicative of two rounds of WGD). Secondly, although on locations in at least one segment, roughly double the 9 in the single average, Arabidopsis thaliana homologous regions have fewer dupli- G. raimondii segment. cated genes preserved than G. raimondii, the number of duplicated This pattern of gene loss in the G. raimondii sequenced region could genes preserved in Gossypium is not significantly larger than what is be the result of a) individual single gene translocation events; b) gene found in the best preserved Arabidopsis homologous region (Table 1, loss after segmental duplication; or c) gene loss after genome bold numbers) (Fig. 3). The sequencing of the whole genome of duplication. The first scenario is unlikely because with a similar G. raimondii (in progress) will provide us with a relatively complete divergence time from Vitis, the gene content in Carica homologous list of Gossypium genes and their arrangements, clarifying the history region is still very well conserved. Although our comparison here of genome duplications in Gossypium species. includes only four species, the correlation between gene conservation pattern and number of genome duplications is quite significant. However, it is difficult to differentiate the effect of genome duplication 3.2. Effect of genome duplication on genome size versus segmental duplication using the gene loss pattern alone for this particular region, due to the limited sequences that we sampled in this There is no obvious correlation between the number of WGDs a study. Nevertheless, with the evidence provided in our genome-level genome has experienced, and the size of its genome. Genomes with dot–plot analysis and other published genome-level comparative history of WGD vary greatly in genome size. For example, sorghum mapping results, it is likely that the observed gene loss pattern in our and rice share a similar history of WGD, while the sorghum genome is Gossypium BACs is representative of a genome-wide event and the ~72% larger (740 Mb vs 430 Mb) [24]. The Arabidopsis genome, with a gene losses is incurred by WGD rather than segmental duplications. In history of one triplication and two duplications, has one of the most other words, it is likely that the relatively small number of ancestral compact genomes in higher plants. L. Lin et al. / Genomics 97 (2011) 313–320 319

The loss of duplicate genes (or “diploidization”) is common after could be generalized to the study of other genomes with well- whole genome duplication events. Over long periods of time, the developed genetic maps but lacking whole genome sequence. diploidization process seems to restore plant genomes to a relatively stable gene number although changing the relative abundance of 4. Materials and methods some gene functional groups. For the (albeit small) genomic region that we studied here, gene density of homologous regions in genomes 4.1. Genetic map and genome sequences with and without WGD is similar, consistent with the notion that genome size variation is mostly caused by transposon accumulations Gossypium genetic map and marker sequence data were retrieved in heterochromatic regions. Comparative studies between rice and from a previously published map [13]. Gene peptide sequences and sorghum showed that in genomes where sizes of gene space are very position information for Vitis, Carica and Arabidopsis were all down- similar, heterochromatin alone causes huge genome size differences loaded from the Plant Genome Duplication Database (PGDD: http:// [22,24]. In the regions of our study, however, fewer transposon chibba.agtec.uga.edu/duplication/). Gossypium mRNA sequences were insertions were detected in the Gossypium sequences. This might be downloaded from PlantGDB (http://www.plantgdb.org/prj/ESTCluster/ because the Gossypium BACs selected came from a gene-rich region. progress.php). Transposon insertions tend to accumulate in heterochromatic regions such as peri-centromeric or sub-telomeric regions [22,25].In 4.2. Gossypium–Vitis whole genome dot–plot euchromatic regions, duplicated genes in one paralogous region might be removed along with neighboring sequences after WGD, Gossypium genetic marker sequences were aligned against Vitis causing the region to be more compact than unduplicated counter- genes using BLASTx, with an E-value cut-off of 1e-10. The top 5 best hits parts in outgroup genomes. were retained in the BLAST results. The dot–plot was generated using a Many studies of genome size evolution focus on the effects of Python script (http://github.com/tanghaibao/quota-alignment/blob/ transposable elements, particularly the insertion and deletion master/scripts/blast_plot.py). ColinearScan [32] was used to detect patterns of LTR-retrotransposons [25,26]. The rapid expansion of collinear blocks. The maximum gap allowed within a syntenic block on a one or a few transposon families could lead to a huge increase in Vitis chromosome was set to 1 Mb, and the maximum genetic distance genome size [27–29]. A burst of transposon activity has been allowed on the consensus map was set to 10 cM. described in synthesized polyploids, and retrotransposons alone can cause genome size doubling even without WGD [30]. Our findings 4.3. BAC sequencing here suggest that regardless of the number of WGDs a genome has experienced, the collective size of gene-rich regions in different The BACs are sequenced following a shotgun protocol. Each BAC genomes do not vary much after extensive gene loss, e.g. the sum of DNA sample was sheared using a Hydroshear (GeneMachines) to the sizes of four Arabidopsis regions homologous to the Vitis region ensure random fragmentation. Sheared DNA fragments were repaired studied are similar in size to the Vitis region. This suggests that most using End-it DNA End Repair Kit (Epicenter, Madison, Wisconsin, WGDs have little long-term impact on the huge genome size USA). Fragment sizes of ~4–5 kb were selected on a 1% low melting differences between plant species. agarose gel, eluting the DNA from the gel using the Qiagen QIAEX II (Qiagen, Valencia, California, USA) gel extraction system. DNA fragments were then ligated into the PCR-Blunt II-TOPO vector and 3.3. Advantage of using the V. vinifera genome in whole genome dot–plots transformed into DH10B Escherichia coli host cells using an electro- porator. The transformed cells were spread onto Q-plates and picked The V. vinifera genome is an excellent reference for efforts to by a Q-bot into 96-well plates. Sequencing was performed on an ABI determine numbers of WGDs in eudicots. The Vitis genome has 3730-XL Sequence Analyzer using BigDye Terminator v3.1 Cycle experienced no WGD since the ancient hexaploidy event shared by Sequencing Kit. Chromatographs were assembled using PhredPhrap. most if not all eudicots [1]. Its slow evolutionary rate [2] and its stable Quality of sequence assemblies were checked using Sequencher karyotype [1] may have left it closer to the ancestral gene order than V.4.1.4. most other eudicot lineages. Vitis is also a good phylogenetic out- group for comparative analysis of many eudicot species. These 4.4. Gene and repetitive element identification from BAC sequences attributes are very helpful in elucidating the duplication history of a new genome. Genes were identified from BAC sequences using FGENESH (http:// Detecting ancient WGD often requires relatively complete infor- linux1.softberry.com/berry.phtml). In Gossypium, the species parameter mation, i.e. sequences and arrangement of most genes in a genome, in was set to “Dicot plants”;forVitis the parameter was set to V. vinifera. order for collinearity and/or synteny to be discernible after extensive Repetitive elements were identified using RepBase repeat masking gene loss, single gene duplications and translocations. For Gossypium, service (http://www.girinst.org/), with species set to A. thaliana. which is not yet sequenced and has only ~2000 genes genetically mapped, there are relatively few homologous gene pairs available so 4.5. Local-level collinearity searches far to distinguish paleopolyploidy from background noise [13]. We show that the lack of data points to infer paleopolyploidy by The Vitis peptide sequences were used to BLAST against the BAC intra-genomic comparison can be partially mitigated by using a sequences using tBLASTn, with a cutoff value of 1e-20. The BLAST consensus genetic map (with markers from homeoelogous chromo- results were manually checked for collinearity. For Arabidopsis–Vitis somes in tetraploid cotton interleaved into a consensus order) and genomes synteny, multiple collinearity search and alignment was comparison to an outgroup genome. This approach has two performed using MCScan [20]. advantages: 1. the consensus genetic map approximately doubled the number of Gossypium data points available; 2. using an outgroup 4.6. Calculation of synonymous substitutions (Ks) genome such as Vitis helps to detect “ghost duplication” [31] segments that are not detectable in self-plots due to the loss of one homolog. For homologues inferred from syntenic alignments, we aligned the The dot–plot analysis using Gossypium consensus map and the protein sequences using CLUSTALW [33] and used the protein alignments V. vinifera genome in this study has shown patterns of synteny that to guide coding sequence alignments by PAL2NAL [34].TocalculateKs, we are not detected using Gossypium–Gossypium dot–plots. This method used the Nei–Gojobori method implemented in yn00 program in PAML 320 L. Lin et al. / Genomics 97 (2011) 313–320 package [35]. Python script is used to pipeline all the calculations and [14] J. Rong, J.E. Bowers, S.R. Schulze, V.N. Waghmare, C.J. Rogers, G.J. Pierce, H. Zhang, J.C. Estill, A.H. Paterson, Comparative genomics of Gossypium and Arabidopsis: available at (http://github.com/tanghaibao/bio-pipeline/tree/master/ unraveling the consequences of both ancient and recent polyploidy, Genome Res. synonymous_calculation/). 15 (2005) 1198–1210. Supplementary materials related to this article can be found online [15] O.V. Muravenko, A.R. Fedotov, E.O. Punina, L.I. Fedorova, V.G. Grif, A.V. Zelenin, Comparison of chromosome BrdU–Hoechst–Giemsa banding patterns of the A(1) at 10.1016/j.ygeno.2011.02.007. and (AD)(2) genomes of cotton, Genome 41 (1998) 616–625. [16] H.M. Ku, T. Vision, J.P. Liu, S.D. Tanksley, Comparing sequenced segments of the References tomato and Arabidopsis genomes: large-scale duplication followed by selective gene loss creates a network of synteny, Proc. Natl Acad. Sci. USA 97 (2000) 9121–9126. [1] O. Jaillon, J.M. Aury, B. Noel, A. Policriti, C. Clepet, A. Casagrande, N. Choisne, S. [17] B. Hendrix, J.M. Stewart, Estimation of the nuclear DNA content of Gossypium Aubourg, N. Vitulo, C. Jubin, A. Vezzi, F. Legeai, P. Hugueney, C. Dasilva, D. Horner, species, Ann. Bot. (Lond) 95 (2005) 789–797. E. Mica, D. Jublot, J. Poulain, C. Bruyere, A. Billault, B. Segurens, M. Gouyvenoux, E. [18] C. Soderlund, I. Longden, R. Mott, FPC: a system for building contigs from Ugarte, F. Cattonaro, V. Anthouard, V. Vico, C. Del Fabbro, M. Alaux, G. Di Gaspero, restriction fingerprinted clones, Comput. Appl. Biosci. 13 (1997) 523–535. V. Dumas, N. Felice, S. Paillard, I. Juman, M. Moroldo, S. Scalabrin, A. Canaguier, I. Le [19] S.A. Smith, M.J. Donoghue, Rates of molecular evolution are linked to life history in Clainche, G. Malacrida, E. Durand, G. Pesole, V. Laucou, P. Chatelet, D. Merdinoglu, flowering plants, Science 322 (2008) 86–89. M. Delledonne, M. Pezzotti, A. Lecharny, C. Scarpelli, F. Artiguenave, M.E. Pe, G. [20] H. Tang, X. Wang, J.E. Bowers, R. Ming, M. Alam, A.H. Paterson, Unraveling ancient Valle, M. Morgante, M. Caboche, A.F. Adam-Blondon, J. Weissenbach, F. Quetier, P. hexaploidy through multiply-aligned angiosperm gene maps, Genome Res. 18 Wincker, The grapevine genome sequence suggests ancestral hexaploidization in (2008) 1944–1954. major angiosperm phyla, Nature 449 (2007) 463–467. [21] H.F. Kuo, K.M. Olsen, E.J. Richards, Natural variation in a subtelomeric region of [2] H. Tang, J.E. Bowers, X. Wang, R. Ming, M. Alam, A.H. Paterson, Synteny and Arabidopsis: implications for the genomic dynamics of a chromosome end, collinearity in plant genomes, Science 320 (2008) 486–488. Genetics 173 (2006) 401–417. [3] J.E. Bowers, B.A. Chapman, J. Rong, A.H. Paterson, Unravelling angiosperm genome [22] J.E. Bowers, M.A. Arias, R. Asher, J.A. Avise, R.T. Ball, G.A. Brewer, R.W. Buss, A.H. evolution by phylogenetic analysis of chromosomal duplication events, Nature Chen, T.M. Edwards, J.C. Estill, H.E. Exum, V.H. Goff, K.L. Herrick, C.L. Steele, S. 422 (2003) 433–438. Karunakaran, G.K. Lafayette, C. Lemke, B.S. Marler, S.L. Masters, J.M. McMillan, L.K. [4] E. Stokstad, Genomics. Poplar tree sequence yields genome double take, Science Nelson, G.A. Newsome, C.C. Nwakanma, R.N. Odeh, C.A. Phelps, E.A. Rarick, C.J. 313 (2006) 1556. Rogers, S.P. Ryan, K.A. Slaughter, C.A. Soderlund, H. Tang, R.A. Wing, A.H. Paterson, [5] J. Schmutz, S.B. Cannon, J. Schlueter, J. Ma, T. Mitros, W. Nelson, D.L. Hyten, Q. Song, J.J. Comparative physical mapping links conservation of microsynteny to chromo- Thelen, J. Cheng, D. Xu, U. Hellsten, G.D. May, Y. Yu, T. Sakurai, T. Umezawa, M.K. some structure and recombination in grasses, Proc. Natl Acad. Sci. USA 102 (2005) Bhattacharyya, D. Sandhu, B. Valliyodan, E. Lindquist, M. Peto, D. Grant, S. Shu, D. 13206–13211. Goodstein, K. Barry, M. Futrell-Griggs, B. Abernathy, J. Du, Z. Tian, L. Zhu, N. Gill, T. [23] A. Desai, P.W. Chee, J. Rong, O.L. May, A.H. Paterson, Chromosome structural changes Joshi, M. Libault, A. Sethuraman, X.C. Zhang, K. Shinozaki, H.T. Nguyen, R.A. Wing, P. in diploid and tetraploid A genomes of Gossypium, Genome 49 (2006) 336–345. Cregan, J. Specht, J. Grimwood, D. Rokhsar, G. Stacey, R.C. Shoemaker, S.A. Jackson, [24] A.H. Paterson, J.E. Bowers, R. Bruggmann, I. Dubchak, J. Grimwood, H. Gundlach, G. Genome sequence of the palaeopolyploid soybean, Nature 463 (2010) 178–183. Haberer, U. Hellsten, T. Mitros, A. Poliakov, J. Schmutz, M. Spannagl, H. Tang, X. [6] R. Ming, S. Hou, Y. Feng, Q. Yu, A. Dionne-Laporte, J.H. Saw, P. Senin, W. Wang, B.V. Wang, T. Wicker, A.K. Bharti, J. Chapman, F.A. Feltus, U. Gowik, I.V. Grigoriev, E. Ly, K.L. Lewis, S.L. Salzberg, L. Feng, M.R. Jones, R.L. Skelton, J.E. Murray, C. Chen, W. Lyons, C.A. Maher, M. Martis, A. Narechania, R.P. Otillar, B.W. Penning, A.A. Qian, J. Shen, P. Du, M. Eustice, E. Tong, H. Tang, E. Lyons, R.E. Paull, T.P. Michael, K. Salamov, Y. Wang, L. Zhang, N.C. Carpita, M. Freeling, A.R. Gingle, C.T. Hash, B. Wall, D.W. Rice, H. Albert, M.L. Wang, Y.J. Zhu, M. Schatz, N. Nagarajan, R.A. Acob, Keller, P. Klein, S. Kresovich, M.C. McCann, R. Ming, D.G. Peterson, D. Mehboob ur, P. Guan, A. Blas, C.M. Wai, C.M. Ackerman, Y. Ren, C. Liu, J. Wang, J.K. Na, E.V. D. Ware, P. Westhoff, K.F.X. Mayer, J. Messing, D.S. Rokhsar, The Sorghum bicolor Shakirov, B. Haas, J. Thimmapuram, D. Nelson, X. Wang, J.E. Bowers, A.R. genome and the diversification of grasses, Nature 457 (2009) 551–556. Gschwend, A.L. Delcher, R. Singh, J.Y. Suzuki, S. Tripathi, K. Neupane, H. Wei, B. [25] J.L. Bennetzen, J.X. Ma, K. Devos, Mechanisms of recent genome size variation in Irikura, M. Paidi, N. Jiang, W. Zhang, G. Presting, A. Windsor, R. Navajas-Perez, M.J. flowering plants, Ann. Bot. 95 (2005) 127–132. Torres, F.A. Feltus, B. Porter, Y. Li, A.M. Burroughs, M.C. Luo, L. Liu, D.A. Christopher, [26] J.L. Bennetzen, Mechanisms and rates of genome expansion and contraction in S.M. Mount, P.H. Moore, T. Sugimura, J. Jiang, M.A. Schuler, V. Friedman, T. flowering plants, Genetica 115 (2002) 29–36. Mitchell-Olds, D.E. Shippen, C.W. dePamphilis, J.D. Palmer, M. Freeling, A.H. [27] X. Zhao, R.A. Wing, A.H. Paterson, Cloning and characterization of the majority of Paterson, D. Gonsalves, L. Wang, M. Alam, The draft genome of the transgenic repetitive DNA in cotton (Gossypium L.), Genome 38 (1995) 1177–1188. tropical fruit tree papaya (Carica papaya Linnaeus), Nature 452 (2008) 991–996. [28] J.S. Hawkins, H. Kim, J.D. Nason, R.A. Wing, J.F. Wendel, Differential lineage- [7] M. Semon, K.H. Wolfe, Consequences of genome duplication, Curr. Opin. Genet. specific amplification of transposable elements is responsible for genome size Dev. 17 (2007) 505–512. variation in Gossypium, Genome Res. 16 (2006) 1252–1261. [8] L.N. Lukens, J.C. Pires, E. Leon, R. Vogelzang, L. Oslach, T. Osborn, Patterns of sequence [29] X.P. Zhao, Y. Si, R.E. Hanson, C.F. Crane, H.J. Price, D.M. Stelly, J.F. Wendel, A.H. loss and cytosine methylation within a population of newly resynthesized Brassica Paterson, Dispersed repetitive DNA has spread to new genomes since polyploid napus allopolyploids, Plant Physiol. 140 (2006) 336–348. formation in cotton, Genome Res. 8 (1998) 479–492. [9] J.H. Postlethwait, Y.L. Yan, A. Amores, B. Cresko, A. Singer, D. Rubin, Consequences [30] B. Piegu, R. Guyot, N. Picault, A. Roulin, A. Saniyal, H. Kim, K. Collura, D.S. Brar, S. of genome duplication for the evolution of developmental mechanisms in teleost Jackson, R.A. Wing, O. Panaud, Doubling genome size without polyploidization: fish, Integr. Comp. Biol. 45 (2005) 1058. dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, [10] M. Lynch, J.S. Conery, The evolutionary fate and consequences of duplicate genes, a wild relative of rice, Genome Res. 16 (2006) 1262–1269. Science 290 (2000) 1151–1155. [31] C. Simillion, K. Vandepoele, M.C.E. Van Montagu, M. Zabeau, Y. Van de Peer, The [11] L. Lin, G. Pierce, J. Bowers, J. Estill, R. Compton, L. Rainville, C. Kim, C. Lemke, J. hidden duplication past of Arabidopsis thaliana, Proc. Natl Acad. Sci. USA 99 (2002) Rong, H. Tang, X. Wang, M. Braidotti, A. Chen, K. Chicola, K. Collura, E. Epps, W. 13627–13632. Golser, C. Grover, J. Ingles, S. Karunakaran, D. Kudrna, J. Olive, N. Tabassum, E. Um, [32] X.Y. Wang, X.L. Shi, Z. Li, Q.H. Zhu, L. Kong, W. Tang, S. Ge, J.C. Luo, Statistical M. Wissotski, Y. Yu, A. Zuccolo, M. ur Rahman, D. Peterson, R. Wing, J. Wendel, A. inference of chromosomal homology based on gene colinearity and applications Paterson, A draft physical map of a D-genome cotton species (Gossypium to Arabidopsis and rice, BMC Bioinform. 7 (2006). raimondii), BMC Genomics 11 (2010) 395. [33]M.A.Larkin,G.Blackshields,N.P.Brown,R.Chenna,P.A.McGettigan,H. [12] J. Rong, A. Paterson, Comparative Genomics of Cotton and Arabidopsis, in: A. McWilliam, F. Valentin, I.M. Wallace, A. Wilm, R. Lopez, J.D. Thompson, T.J. Paterson (Ed.), Genetics and Genomics of Cotton, Springer, New York, 2009. Gibson, D.G. Higgins, Clustal W and Clustal X version 2.0, Bioinformatics 23 [13] J. Rong, C. Abbey, J.E. Bowers, C.L. Brubaker, C. Chang, P.W. Chee, T.A. Delmonte, X. (2007) 2947–2948. Ding, J.J. Garza, B.S. Marler, C.H. Park, G.J. Pierce, K.M. Rainey, V.K. Rastogi, S.R. [34] M. Suyama, D. Torrents, P. Bork, PAL2NAL: robust conversion of protein sequence Schulze, N.L. Trolinder, J.F. Wendel, T.A. Wilkins, T.D. Williams-Coplin, R.A. Wing, R.J. alignments into the corresponding codon alignments, Nucleic Acids Res. 34 Wright, X. Zhao, L. Zhu, A.H. Paterson, A 3347-locus genetic recombination map of (2006) W609–W612. sequence-tagged sites reveals features of genome organization, transmission and [35] Z. Yang, PAML 4: phylogenetic analysis by maximum likelihood, Mol. Biol. Evol. 24 evolution of cotton (Gossypium), Genetics 166 (2004) 389–417. (2007) 1586–1591.