An overview of the orchid genome by BAC end sequence analysis

Chia-Chi Hsu1, Yu-Lin Chung1, Yu-Ling Lee1, Tien-Chih Chen1, Yi-Tzu Kuo1, Wen-Chieh Tsai2,3, Yu-Yun Hsiao1, Yun-Wen Chen1, Wen-Luan Wu1,2,3, and Hong-Hwa Chen1,2,3

1Department of Life Sciences, National Cheng Kung University, Tainan 701, Taiwan 2Institute of Tropical Sciences, National Cheng Kung University, Tainan 701, Taiwan 3Orchid Research Center, National Cheng Kung University, Tainan 701, Taiwan.

Abstract

Phalaenopsis orchid is one of the most popular floral crops and is an economically important floriculture plant worldwide. Bacterial artificial chromosome (BAC) end sequences (BESs) can provide the first glimpse of the sequence composition of an unsequenced genome and yield molecular markers useful for genetic mapping and breeding. We used two BAC libraries (BamH I and Hind III) constructed from Phalaenopsis equestris. Pair-end sequences were generated from 2,920 BAC clones with a success rate of 95.7%. A total of 5,535 BESs generated represented 4.5 Mb, about 0.3% of the Phalaenopsis genome. The trimmed sequences ranged from 123 to 1,397 bp, with an average edited read length of 821 bp. These BESs underwent sequence homology searches. In total, 641 BESs (11.6%) were predicted to be protein-coding regions. In addition, 1,272 BESs (23.0%) were identified to contain repetitive DNA. Most of these repetitive DNA sequences were gypsy- and copia-like retrotransposons, with ratios of 41.9% and 12.8%, respectively, and only 10.8% were DNA transposons. Furthermore, 950 potential simple sequence repeats (SSRs) were discovered. Microsynteny analysis revealed that higher numbers of BESs mapped to the whole-genome sequences of poplar, then grape and Arabidopsis, and less to that of rice. Overall, BES analysis can give an overview of the Phalaenopsis genome in terms of the abundance of genes, repetitive DNA, SSR markers, and microsynteny with other plant species. BAC libraries are useful for constructing the physical map of the Phalaenopsis genome, thus improving the whole-genome sequencing of Phalaenopsis.

1 Background

The family of is one of the largest of the flowering , and the number of species may exceed 25,000 [1]. Like all other living organisms, present-day orchids have evolved from ancestral forms as a result of selection pressure and adaptation. They show a wide diversity of epiphytic and terrestrial growth forms and have successfully colonized almost every habitat on earth. The factors promoting the richness of orchid species may be specific interactions between orchid flowers and pollinators [2], sequential and rapid interplay between drift and natural selection [3], obligate orchid–mycorrhizal interactions [4], and epiphytism, the growth form of more than 70% of all orchids, which probably make up two-thirds of the epiphytic flora of the world.

The expanding diversity of the orchid family may have taken place in a shorter time that that of most families, which had already started to diversify in the Mid-Cretaceous [5]. The time of origin of orchids is in dispute, although they have been suggested to originate 80 to 40 million years ago (late Cretaceous to late

Eocene) [6]. Perhaps the only general statement that can be made about the origin of orchids is that most extant groups are probably very young. Recently, the Orchidaceae were dated by a fossil record of an orchid pollinia on the back of its pollinator, a stingless bee, in amber. The most recent common ancestor of extant orchids was deduced to live in the late Cretaceous (76-84 Mya) [7].

Orchids are known for their diversity of specialized reproductive and ecological strategies. For successful reproduction, the formation of labellum and gynostemium (a fused structure of androecium and gynoecium) to facilitate pollination is well documented, and the co-evolution of orchid flowers and their pollinators is also well known [8-9]. In addition, the successful evolutionary progress of orchids may be

2 explained by mature pollen grains packaged as pollinia, pollination-regulated ovary/ovule development, synchronized timing of micro- and mega-gametogenesis for effective fertilization, and the release of thousands or millions of immature embryos

(seeds without endosperm) in a mature capsule [10]. However, despite their unique developmental reproductive biology, specialized pollination and ecological strategies, molecular studies of orchids remain underrepresented relative to that of other species-rich plant families [11].

The genomic sequence resources for orchids are limited. A number of studies have used Sanger sequencing to develop expressed sequence tag (EST) resources for orchids [12-14]. These studies have highlighted the usefulness of cDNA sequencing for discovery of candidate genes for orchid floral development [15-16], floral scent production [13, 17] and flowering time [18] in the absence of a genomic sequence.

However, a comprehensive description of the full complement of genes expressed in orchids remains unavailable.

P. equestris is a diploid plant with an estimated haploid genome size of 1,600 Mb

(3.37 pg/diploid genome) and 2n = 2x = 38 chromosomes, which are small and uniform in size (< 2 μm) [19]. Public databases of floral bud ESTs of P. equestris and P. bellina have been analyzed [13, 16]. These collections provide valuable resources for direct access to the genes of interest [15-17] and for the development of molecular markers for marker-assisted breeding programs or cultivar identification

(unpublished data). However, basic genomic information about the genome organization and structure of Phalaenopsis orchids is lacking.

To understand the sequence content and complexity of the Phalaenopsis genome, an efficient and viable strategy is to construct bacterial artificial chromosome (BAC) libraries and end-sequence randomly selected BAC clones from the BAC libraries.

3 The BAC end sequences (BESs) can be used as a primary scaffold for genome shotgun-sequence assembly and to generate comparative physical maps [20].

Moreover, analysis of BES data can provide an overview of the sequence composition, including gene density, presence of potential transposable elements (TEs) and microsatellites of an unsequenced genome [21-24]. In addition, BESs can yield molecular markers for genomic mapping, cloning of genes and phylogenetic analysis.

Even for the sequenced rice genome, the Oryza Map Alignment Project (OMAP) constructed deep-coverage large-insert BAC libraries from 11 wild and 1 cultivated

African Oryza species (O. glaberrima); the project fingerprinted and end-sequenced the clones from all 12 BAC libraries to construct physical maps of these Oryza species for study of evolution, genome organization, domestication, and gene regulatory networks and to contribute to crop improvement [25-26].

In this study, we analyzed 5,535 BESs from 2 genomic BAC libraries of

Phalaenopsis equestris and focused on simple sequence repeat (SSR) or microsatellite content, repeat element composition, GC content and protein-coding regions. These

BESs offer the first detailed insight into the sequence composition of the P. equestris genome, and the annotated BESs can be useful resources for molecular marker development with P. equestris.

Results and Discussion

BAC end sequencing

DNA samples extracted from 2,920 BAC clones were sequenced from both ends, with a success rate of 95.7%. Of these, 71.4% and 28.6% BACs were from BamH I and Hind III libraries, respectively. After trimming ambiguous sequences, removing of vector sequence and mitochondrial DNA, we generated a collection of 5,535

4 high-quality BESs that included 5,360 paired-end reads (Table 1). The size of these

BESs ranged from 123 to 1,397 bp, with an average of 821 bp, which corresponded to a total length of 4,544,250 bp equivalent to 0.3% of the P. equestris genome (Table 1).

The 5535 BESs can be assembled into 340 contigs (with the average coverage of 2.99) and 4518 singletons (data not shown). For the read-length distribution of the BESs, both 800-899 bp and 900-999 bp were the most abundant for 1,567 BESs (28%) and

1,684 BESs (30%), respectively (Fig. 1). The GC content was 35.95%, close to that of the genomic DNA of P. amabilis BLUME (Sarcanthinae, Vandeae) and 34.09% as estimated by using buoyant density [34], which indicates that the Phalaenopsis genome is AT-rich. Buoyant density was also used to analyze Brassica maculate R.

BR. (Oncidiinae, Vandeae), schombocattleya LINDL. (Epidendrinae,

Epidendrae), and Cymbidium pumilum SWARTZ cv. “Gareth Latangor” (Cymbidiinae,

Vandeae) and revealed their GC contents to be 32.05%, 34.09%, and 32.05%, respectively [34], which suggests that most orchids have AT-rich genomes. All BES sequences have been submitted to GenBank, with the accession numbers

HN176659-HN182163.

Database sequence searches

The P. equestris BESs underwent sequence homology analysis in RepBase and

TIGR plant repeat databases by using RepeatMasker and BLAST for prediction of repeat sequences and potential TEs. A total of 1,272 BESs were homologous to transposable elements and repeats, for 23.0% of the total 5,535 BESs analyzed. These

BESs were RepeatMasked and compared to the NCBI non-redundant protein databases. In total, 641 sequences (11.6%) contained the genic region of protein coding sequences. In addition, 29 BESs (0.5%) matched chloroplasts (Table 1).

5

Analysis of repetitive DNA in BES

The large genome size of P. equestris (1,600 kb) implies a highly repetitive DNA content (i.e., more similar to that of maize than rice). After removal of 29 BESs containing chloroplast and mitochondria sequences, 5,506 BESs were used to identify the repetitive DNA sequences with RepeatMasker and the TIGR plant repeat database.

Similar to other eukaryotic genomes, that of Phalaenopsis contains a significant portion of the repeat sequences and potential TEs: 1,272 BESs contain the potential

TEs, for 23% of the total number of BESs (Table 1). The percentage of BESs with homologous sequences to TEs in Phalaenopsis was higher than that in apple (20.9%)

[23] but lower than that in Citrus clementina (25.4%) [35], carrot (28.3%) [24], and

Musa acuminate (36.6%) [22].

Among the 1,272 BESs with potential TEs, more BESs showed sequence homology to the Class I retrotransposons (963 BESs, 75.7%) than to the Class II DNA transposons (137 BESs, 10.8%), which represented one-seventh of the genome fraction for Class II DNA transposons to Class I retrotransposons (Table 2). The Class

I retrotransposons could be further classified into Ty1/copia (163, 12.8%) and

Ty3/gypsy long-terminal-repeat (LTR) retrotransposons (533, 41.9%), LINE (95),

SINE (0) non-LTR retrotransposons and other unclassified retrotransposons (172)

(Table 2). Clearly, LTR retrotransposons outnumbered non-LTR retrotransposons in terms of the number of matches in BESs, and the number of unclassified retrotransposons was also high. The next most abundant repeat DNA in the BESs was

Class II DNA transposons, containing the Ac/Ds (12), En/Spm (67), Mutator (15),

Tourist, Harbinger, Helitron, Mariner (10), and other unclassified transposons (33). In total, 1,100 BESs contained Class I retrotransposons and Class II DNA transposons.

6 Other repeat sequences identified by RepeatMasker and the TIGR plant repeat database were Miniature inverted-repeat transposable elements (MITE, containing 2

BESs), centromere-related sequences (19), ribosome RNA genes (33), and other unclassified repeat sequences (118), accounting for 172 BESs (13.4%) among those with repetitive sequences (Table 2).

Functional annotation

To identify protein-coding regions, comparison of the RepeatMasked BESs in the

NCBI non-redundant protein databases revealed 641 sequences (11.6%) containing the genic region of protein-coding sequences (Table 1). Of the 641 BESs matching protein coding sequences, the top BLAST match was 252 (39.3%) to V. vinifera, and

106 (16.5%) were matched to O. sativa (Fig. 2). This finding is consistent with the

BLAST results of floral bud ESTs, in which the top match was also V. vinifera, followed by O. sativa [12, 13]. At first glance, it is unreasonable for orchid genes to be related more to a phylogenetically distant dicot species than to a monocot. More orchid sequencing data will be useful for the study of gene family evolution.

The 641 BESs containing protein-coding sequences were BLASTN to our orchid

EST databases [12, 13] and showed that 417 BESs (65.1%) had matches and expressed in orchid (data not shown), but 224 BESs (34.9%) had no matches and no expression in sequencing stages of the orchid EST databases.

Gene density prediction was calculated from the 641 predicted protein-coding sequences in 5,535 BESs, which extended 4,544-kb Phalaenopsis genome sequences and estimated the presence of a gene every 7.1 kb of the Phalaenopsis genome. As compared with other plant genomes, that of banana (M. acuminate) has a gene density of 6.4 kb [22], rice (O. sativa) 6.2 kb per gene [37] and A. thaliana 4.5 kb per gene

7 [38].

For GO annotation, the 641 BESs with protein-coding regions showing homology in the NCBI non-redundant protein database were further analyzed for GO annotation with the categories cellular component, molecular function, and biological process. Among the 641 BESs, 321 were annotated to the cellular component category,

182 to the molecular function, and 329 to the biological process. For the 321 annotated BESs with the function of cellular components, 30 (9.35%) encoded proteins functioning in the chloroplast, 25 in the other membranes (7.79%), and 18 in other cellular components (5.61%). However, more than half of the BESs encoded proteins with unknown cellular components (179 BESs, 55.76%) (Fig. 3a). For the molecular function, the ratios of genes encoding for unknown molecular function, transferase activity, other enzyme activity, and nucleotide binding were 16.48% (30),

15.38% (28), 14.84% (27), and 14.29% (26), respectively (Fig. 3b). Among the 329

BESs with annotation to the biological process, more than half encoded proteins with unknown biology processes (179 BESs, 54.41%). BESs function in other cellular processes, protein metabolism, and transport, with ratios of 11.25% (37), 5.17% (17), and 5.17% (17), respectively (Fig. 3c).

RepeatMasked BESs were also compared to the protein databases of O. sativa

(downloaded from The Rice Annotation Project Database, http://rapdb.dna.affrc.go.jp/download/index.html) and V. v i nif e r a (downloaded from

NCBI V. vinifera protein database, ftp://ftp.ncbi.nih.gov/genomes/Vitis_vinifera/ protein/) by use of BLASTX with E-value <1e-5. Among the 5,506 BESs, 550 matched BESs (9.99%) were homologous to the V. vinifera proteins. Thus, the total coding region of the P. equestris genome might be approximately 159.8 Mb, from the estimated genome size of 1,600 Mb for the P. equestris genome. The total gene

8 content of the P. equestris genome is an estimated ~47,007 according to the assumption of an average gene length of 3.4 kb in V. vinifera [36]. Moreover, 504

Phalaenopsis BESs matched in the rice protein database and accounted for the 9.15% and a 146.5-Mb sequence of the Phalaenopsis genome, which suggests 54,259 genes in Phalaenopsis genome with the assumption of an average gene length of 2.7 kb in

Oryza [39]. The values were compatible to 30,434 protein-coding genes identified in the 487-Mb grapevine genome [36] and 37,544 protein genes in the 389-Mb rice genome [39]. Of note, the distribution of the genes is fairly homogeneous along the chromosomes of rice and Arabidopsis but more heterogeneous in V. vinifera [36].

Cytogenetics results of the P. equestris genome showed that the distribution of heterochromatin is patchy (i.e., the distribution of genes along the chromosomes is heterogeneous for Phalaenopsis orchids [personal communication, Dr. S. B. Chang,

Department of Life Sciences, National Cheng Kung University]). According to the genome size of Phalaenopsis, we assumed that it may have an average gene length and gene distribution similar to that of the Vitis genome (i.e., may have a gene number close to 47,000).

Comparative mapping of orchid BAC ends to other plant genomes for microsynteny identification

To understand the syntenic relationships between orchid and other plant species, orchid BESs were BLAST searched against the whole genome sequences of 4 sequenced genomes, including A. thaliana, rice (O. sativa), poplar (Populus trichocarpa) and grape (V. vinifera). The BAC pair ends with a span within 50-300 kb oriented in the same direction to the same chromosome of the target genome were considered potentially collinear with the target genome. Among them, 142

9 Phalaenopsis BESs contained the most hits to the poplar genome, with 14 end-pair sequences of the BAC clones and 12 BAC end pairs targeting the same poplar chromosome. One BAC end pair had a span within 50-300 kb, which suggests collinearity with the poplar genome (Table 3). A total of 123 BESs, including 12 BAC end pairs, showed significant hits with the grape genome. Among them, 10 end pairs were in the same chromosome. However, the whole genome sequence of grape contains several contigs for each individual chromosome. One of 10 BAC end pairs were located at different contigs in the same chromosomes; presumably they may locate within the range of 50-300 kb (Table 3).

For mapping orchid BESs to the Arabidopsis genome, 94 BESs, including 12 paired ends, had significant hits to the Arabidopsis genome. Eleven BAC end pairs were located on the same chromosomes, yet none was within 50-300 kb (Table 3).

The mapping of orchid BES to the monocot species of rice produced 83 BES hits, including 6 paired ends with significant hits in the rice genome. Six BAC end pairs were located on the same chromosomes, but none were within 50-300 kb (Table 3).

Most of the paired ends mapped onto other plant chromosomes were annotated as ribosome DNA, with 10 of 14 pair ends in poplar, 10 of 12 ends in grape, 10 of 12 ends in Arabidopsis, and all 6 ends in rice. In addition, all the pair ends containing rDNA were mapped to the same chromosomes in the 4 plant species.

From the results of microsynteny mapping to the 4 plant genomes, Phalaenopsis

BESs are suggested to be more similar to the poplar genome than to the grape,

Arabidopsis, and even less so the rice genome. As in the comparative analyses of asparagus (Asparagus officinalis L., ) BACs to onion (Allium cepa L.,

Asparagales) or rice (Poales, a group of orders sister to the Asparagales), no relationships were observed in physical (cDNA–selected BACs) or genetic (M-linked

10 BACs) linkages between asparagus to rice or onion across these regions, suggesting that synteny among rice genome does not extend to a sister order in the monocots, like asparagus [Jakse et al., 2006] or orchids. This result was consistency with our microsynteny result that Phalaenopsis BESs were less similar to rice genome, although both Phalaenopsis and rice belong to the . Whole-genome sequencing for Phalaenopsis will be necessary and useful for providing more insight in genome reorganization and for clarifying of genomic compositional differences between orchids and other plant species.

In addition, we noted that the 29 orchid BESs containing Phalaenopsis chloroplast genome sequences had matches to the 4 genomic sequences, with 25 BESs matched to the grape genome, 24 to the rice genome, 21 to the poplar genome, and 2 to the Arabidopsis genome (Table 4) by using BLASTN with an E-value <1e-30.

Moreover, these BESs also had matches to the chloroplast DNA of the 4 plant genomes, with 28 BESs to grape, 24 to rice, 27 to poplar, and 25 to Arabidopsis chloroplasts by use of the same stringent threshold. Transfer of the chloroplast DNA to the nucleus is well known to result in chloroplast DNA inserted in nuclear chromosomes. In rice, 421-453 chloroplast insertions distribute over the 12 chromosomes and extend 0.18%-0.19% of the rice nuclear genome [39]. Our results showed 12 BES paired ends and 5 unpaired BES ends among the 29 orchid BESs containing Phalaenopsis chloroplast DNA, which suggests that some of these BESs may indeed be located in the Phalaenopsis nuclear genome sequence but not due to chloroplast contamination. The whole genome sequencing of Phalaenopsis is urgently needed to confirm this.

Conclusions

11 Phalaenopsis BAC end sequences can provide the first insights into the composition of the Phalaenopsis genome in terms of GC content, transposable elements, protein-coding regions, SSR discovery and potential microsynteny between

Phalaenopsis and other plant species. The similarity of the homologous protein sequences between Phalaenopsis and grape, as well as the microsyntenay between

Phalaenopsis and poplar, are interesting and should be confirmed by large-scale BAC end sequencing. BESs analysis can also be a pioneer for sequence analysis of

Phalaenopsis BAC libraries, which can be used for fingerprinting contigs to construct a physical map of the Phalaenopsis genome.

Methods

Orchid BAC clones and BAC-end sequencing

Two large-insert bacterial artificial chromosome (BAC) libraries were used for end-sequencing in this study. One library was constructed from a partial Hind III digestion of genomic DNA of the diploid Taiwan native species, P. equestris, and consists of 100,992 clones with an average insert size of 100 kb. The other library, constructed from the partial BamH I-digested DNA, consists of 33,428 clones with an average insert size of 111 kb. The 2 libraries represent approximately 8.4 equivalents of the wild Phalaenopsis orchid haploid genome (unpublished data).

BAC clones, randomly chosen from 96-well microplates, were inoculated in

96-well deep plates containing 1.5 ml of 2x LB medium plus 12.5 μg/ml chloramphenicol. Plates were incubated at 37°C with continuous shaking at 100 rpm for 20-24 h. BAC end sequencing involved use of Applied Biosystems (ABI) Big Dye terminator chemistry and analysis on ABI 3730 machines by the Sequencing Core

Facility of National Yang Ming University Genome Center (YMGC) and by the

12 Arizona Genomics Institute (AGI).

The BESs were base-called and processed by CodonCode Aligner, which integrates the PHRED program [27], with the default parameters of “maximize region with error rate < 0.1” to trim the bases from the start and end of the sequences and

“move all sequences shorter than 25 bases to trash” and “move all sequences with fewer than 50 Phred 20 bases to trash” after end clipping. Vector sequences were trimmed by using Sequencher V4.1 (Ann Arbor, MI) with the pIndigoBAC5 DNA sequence. BESs < 100 bp were discarded, and 5,535 BESs remained. BESs were searched with BLASTN against chloroplast DNA from P. aphrodite subsp. formosana

(accession no.: NC_007499) [28] and mitochondrial DNA from O. sativa (japonica) cultivar Nipponbare (accession no.: DQ167400) [29] by using a stringent threshold of

E-value 1e-50.

Analysis of repetitive sequences

The BESs were used to analyze repetitive sequences by use of RepeatMasker [30] with the same default condition as that used for Arabidopsis, rice, and maize sections of the RepBase Update databases [31] and by using BLASTN and TBLASTX to search the TIGR plant repeat database (downloaded on Sep. 5, 2009) [32] with

E-value <1e-10 and < 1e-5, respectively. Repetitive sequences were annotated by use of the RepeatMasker default setting or classified by TIGR codes for plant repetitive sequences.

Functional annotation of the Phalaenopsis BES

The 5,506 RepeatMasked BESs were further analyzed for protein-coding regions with a BLASTX search of the NCBI non-redundant protein databases, with E-value

13 <1e-5. The protein-coding BESs were BLASTN to our orchid EST databases [12, 13] at an E-value <1e-20. The BESs containing protein-coding regions homologous to

NCBI non-redundant protein databases were further analyzed for their Gene Ontology

(GO) annotation with a BLASTX search of the Arabidopsis genome annotation database [18] (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/), with E-value <1e-7. Categories were assigned on the basis of biological, functional and molecular annotation available from GO (http://www.geneontology.org/).

Microsynteny between P. equestris and A. thaliana, O. sativa, P. trichocarpa, and V. vinifera

The Phalaenopsis BESs not RepeatMasked were compared with the genome sequences of A. thaliana, O. sativa, P. trichocarpa, and V. vinifera (downloaded from

NCBI database ftp://ftp.ncbi.nih.gov/genomes/ on Feb. 27, 2010) by a BLASTN search with E-value <1e-10. To identify comparative-tile BACs from the

Phalaenopsis library that showed microsynteny with the reference genomes as shown in a previous study with Musa genomic analysis [22], searches against the

Phalaenopsis genomic sequence were parsed for the pair of BES for which both ends had the highest significant matches to a stretch of A. thaliana, O. sativa, P. trichocarpa, or V. vinifera sequences and where the 2 regions on the Phalaenopsis genome were between 50 and 300 kb apart.

List of abbreviations

BAC: bacterial artificial chromosome; BESs: BAC end sequences; FPCs:

Fingerprinting contigs; GO: Gene Ontology; MISA: MIcroSAtellite identification tool;

SSRs: simple sequence repeats

14

Authors’ contributions

CCH performed the bioinformatic analysis for genes, transposable elements, and microsynteny. YLC, YLL, and YTK performed the SSR prediction and verification.

TCC and WLW constructed the Phalaenopsis BAC library for sequencing. WCT,

YYH, YWC, WLW and HHC participated in the design of the study and further analysis. CCH, YLC, YLL, WLW, and HHC drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgments

We thank Dr. Michel Delseny (Laboratory of Plant Genome and Plant Physiology,

University of Perpignan, France) for critically reading the manuscript and for helpful discussion of the paper. We acknowledge the technical services provided by the

Sequencing Core Facility at the National Yang-Ming University Genome Research

Center (YMGC, Taipei, Taiwan). The Sequencing Core Facility is supported by

National Research Program for Genomic Medicine (NRPGM), National Science

Council, Taiwan. We also acknowledge the technical services provided by the

Arizona Genomics Institute (AGI) DNA Sequencing Center and the AGI Physical

Mapping Center, Arizona, USA. This work was supported by grant G98-B303 from the Council of Agriculture, Taiwan.

References

1. Atwood JT: The size of Orchidaceae and the systematic distribution of epiphytic orchids. Selbyana 1986, 9:16. 2. Cozzolino S, Widmer A: Orchid diversity: an evolutionary consequence of deception? Trends Ecol Evol 2005, 20:487-494.

15 3. Tremblay RL, Ackerman JD, Zimmerman JK, Calvo RN: Variation in sexual reproduction in orchids and its evolutionary consequences: a spasmodic journey to diversification. Biol J Linn Soc 2005, 84:1-54. 4. Otero JT, Flanagan NS: Orchid diversity--beyond deception. Trends Ecol Evol 2006, 21:64-65. 5. Crane PR, Friis EM, Pedersen KR: The origin and early diversification of angiosperm. Nature 1995, 374:27-33. 6. Dressler RL, Dressler K: Phylogeny and classification of the orchid family. Cambridge Univ. Press, Cambridge 1993. 7. Ramirez SR, Gravendeel B, Singer RB, Marshall CR, Pierce NE: Dating the origin of the Orchidaceae from a fossil orchid with its pollinator. Nature 2007, 448:1042-1045. 8. Yu H, Goh CJ: Molecular genetics of reproductive biology in orchids. Plant Physiol 2001, 127:1390-1393. 9. Schiestl FP, Peakall R, Mant JG, Ibarra F, Schulz C, Franke S, Francke W: The chemistry of sexual deception in an orchid-wasp pollination system. Science 2003, 302:437-438. 10. Tsai WC, Hsiao YY, Pan ZJ, Kuoh CS, Chen WH, Chen HH: The role of ethylene in orchid ovule development. Plant Sci 2008, 175:98-105. 11. Peakall R: Speciation in the Orchidaceae: confronting the challenges. Mol Ecol 2007, 16:2834-2837. 12. Tsai WC, Hsiao YY, Lee SH, Tung CW, Wang DP, Wang HC, Chen WH, Chen HH: Expression analysis of the ESTs derived from the flower buds of Phalaenopsis equestris. Plant Sci 2006, 170:426-432. 13. Hsiao YY, Tsai WC, Kuoh CS, Huang TH, Wang HC, Wu TS, Leu YL, Chen WH, Chen HH: Comparison of transcripts in Phalaenopsis bellina and Phalaenopsis equestris (Orchidaceae) flowers to deduce monoterpene biosynthesis pathway. BMC Plant Biol 2006, 6:14. 14. Tan J, Wang HL, Yeh KW: Analysis of organ-specific, expressed genes in Oncidium orchid by subtractive expressed sequence tags library. Biotechnol Lett 2005, 27:1517-1528. 15. Tsai WC, Kuoh CS, Chuang MH, Chen WH, Chen HH: Four DEF-like MADS box genes displayed distinct floral morphogenetic roles in Phalaenopsis orchid. Plant Cell Physiol 2004, 45:831-844. 16. Tsai WC, Lee PF, Chen HI, Hsiao YY, Wei WJ, Pan ZJ, Chuang MH, Kuoh CS, Chen WH, Chen HH: PeMADS6, a GLOBOSA/PISTILLATA-like gene in Phalaenopsis equestris involved in petaloid formation, and correlated with flower longevity and ovary development. Plant Cell Physiol 2005, 46:1125-1139. 17. Hsiao YY, Jeng MF, Tsai WC, Chuang YC, Li CY, Wu TS, Kuoh CS, Chen WH, Chen HH: A novel homodimeric geranyl diphosphate synthase from the orchid Phalaenopsis bellina lacking a DD(X)2-4D motif. Plant J 2008, 55:719-733. 18. Wang CY, Chiou CY, Wang HL, Krishnamurthy R, Venkatagiri S, Tan J, Yeh KW: Carbohydrate mobilization and gene regulatory profile in the pseudobulb of Oncidium orchid during the flowering process. Planta 2008, 227:1063-1077. 19. Kao YY, Chang SB, Lin TY, Hsieh CH, Chen YH, Chen WH, Chen CC: Differential accumulation of heterochromatin as a cause for karyotype variation in Phalaenopsis orchids. Ann Bot-London 2001, 87:387-395. 20. Shultz JL, Kazi S, Bashir R, Afzal JA, Lightfoot DA: The development of

16 BAC-end sequence-based microsatellite markers and placement in the physical and genetic maps of soybean. Theor Appl Genet 2007, 114:1081-1090. 21. Lai CWJ, Yu QY, Hou SB, Skelton RL, Jones MR, Lewis KLT, Murray J, Eustice M, Guan PZ, Agbayani R et al: Analysis of papaya BAC end sequences reveals first insights into the organization of a fruit tree genome. Mol Genet Genomics 2006, 276:1-12. 22. Cheung F, Town CD: A BAC end view of the Musa acuminata genome. BMC Plant Biol 2007, 7:29. 23. Han Y, Korban SS: An overview of the apple genome through BAC end sequence analysis. Plant Mol Biol 2008, 67:581-588. 24. Cavagnaro PF, Chung SM, Szklarczyk M, Grzebelus D, Senalik D, Atkins AE, Simon PW: Characterization of a deep-coverage carrot (Daucus carota L.) BAC library and initial analysis of BAC-end sequences. Mol Genet Genomics 2009, 281:273-288. 25. Wing RA, Ammiraju JS, Luo M, Kim H, Yu Y, Kudrna D, Goicoechea JL, Wang W, Nelson W, Rao K et al: The Oryza map alignment project: the golden path to unlocking the genetic potential of wild rice species. Plant Mol Biol 2005, 59:53-62. 26. Ammiraju JS, Luo M, Goicoechea JL, Wang W, Kudrna D, Mueller C, Talag J, Kim H, Sisneros NB, Blackmon B et al: The Oryza bacterial artificial chromosome library resource: construction and analysis of 12 deep-coverage large-insert BAC libraries that represent the 10 genome types of the genus Oryza. Genome Res 2006, 16:140-147. 27. Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 1998, 8:175-185. 28. Chang CC, Lin HC, Lin IP, Chow TY, Chen HH, Chen WH, Cheng CH, Lin CY, Liu SM, Chaw SM: The chloroplast genome of Phalaenopsis aphrodite (Orchidaceae): comparative analysis of evolutionary rate with that of grasses and its phylogenetic implications. Mol Biol Evol 2006, 23:279-291. 29. Tian X, Zheng J, Yu J: The rice mitochondrial genomes and their variations. Plant Physiol 2006, 140:401-410. 30. Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. In: Computer Program. 1996. 31. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110:462-467. 32. Ouyang S, Buell CR: The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res 2004, 32:D360-363. 33. Thiel T, Michalek W, Varshney RK, Graner A: Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor Appl Genet 2003, 106:411-422. 34. Capesius I, Nagl W: Molecular and cytological characteristics of nuclear DNA and chromatin for agniosperm systematics: DNA diversification in the evolution of four orchids. Plant Syst Evol 1978, 129:143-166. 35. Terol J, Naranjo MA, Ollitrault P, Talon M: Development of genomic resources for Citrus clementina: characterization of three deep-coverage BAC libraries and analysis of 46,000 BAC end sequences. BMC Genomics 2008, 9:423. 36. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C et al: The grapevine genome sequence suggests

17 ancestral hexaploidization in major angiosperm phyla. Nature 2007, 449:463-467. 37. Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, Lin H, Hamilton J, Haas B, Sultana R, Cheung F et al: The institute for genomic research Osa1 rice genome annotation database. Plant Physiol 2005, 138:18-26. 38. The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408:796-815. 39. International Rice Genome Sequencing Project: The map-based sequence of the rice genome. Nature 2005, 436:793-800.

18 Figure legends

Figure 1. Size distribution of Phalaenopsis equestris bacterial artificial chromosome (BAC) end sequences. The trimmed sequences ranged from 123 to 1,397 bp, with an average edited read length of 821 bp.

Figure 2. Number of P. equestris BAC end sequences (BESs) containing hits in the NCBI database. Of the protein matches, the top BLASTX match was to V. vinifera.

Figure 3. Gene Ontology (GO) analysis of the P. equestris BESs. (a) Cellular component, (b) molecular function, (c) biological process. Among the 641 BESs containing the protein-coding regions, 321 were annotated to the cellular component category, 182 to the molecular function, and 329 to the biological process.

Figure 4. Frequency of AT-rich repeat motif in the nuclear genomes of orchid and selected plant species.

19

Table 1. Statistical analysis of Phalaenopsis equestris bacterial artificial chromosome (BAC) end sequences (BESs). Total number of BESs 5,535 No. of paired BESs 5,360 No. of non-paired BESs 175 Total length (bp) 4,544,250 Minimum length (bp) 123 Maximum length (bp) 1,397 Average length (bp) 821 35.95%[ROUND GC content UP?] Sequence composition Potential transposable elements (%) 1,272 (23.0) Simple sequence repeats (%) 950 (17.2) Protein coding regions (%) 641 (11.6) Chloroplast (%) 29 (0.5) Unknown genomic sequences (%) 2,643 (47.7)

20

Table 2. Number of bacterial artificial chromosome (BAC) end sequences (BESs) containing known repetitive DNA in the Phalaenopsis genome % of BESs with Class, subclass, group No. of BESs repetitive DNA Class I retrotransposons 963 75.7 Ty1-copia 163 12.8 Ty3-gypsy 533 41.9 LINE 95 7.5 SINE 0 0.0 Unclassified retrotransposons 172 13.5 Class II DNA transposons 137 10.8 Ac/Ds 12 0.9 CACTA, En/Spm 67 5.3 Mutator (MULE) 15 1.2 Tourist/Harbinger/Helitron/Mariner 10 0.8 Unclassified Transposons 33 2.6 Miniature inverted-repeat transposable 2 0.2 elements Centromere 19 1.5 rRNA 33 2.6 Unclassified 118 9.3 Total 1272 100

21

Table 3. Microsynteny between Phalaeonpsis and A. thaliana, O. sativa, P. trichocarpa, and V. v i n i f e r a

A. thaliana O. sativa P. trichocarpa V. vinifera

No. of hits 94 83 142 123 Pair ends 12 6 14 12 The same chromosome 11 6 12 10 50- to 300-kb sequence 0 0 1 1a a The whole genome sequence of V. v i n if e r a shows that several contigs are present in each individual chromosome. We cannot determine the exact length between the BAC pair ends on the same chromosome of the grape genome.

22

Table 4. Number of bacterial artificial chromosome (BAC) end sequences (BESs) containing Phalaenopsis chloroplast DNA with hits to the nuclear and chloroplast DNA of A. thaliana, O. sativa, P. trichocarpa, and V. v i n i f e r a A. thaliana O. sativa P. trichocarpa V. vinifera Nuclear DNA 2 24 21 25 Chloroplast DNA 25 24 27 28

23

2000

1600

r 1200

numbe 800

400

0

99 99 99 99 99 99 9 299 9 13 0~1 0~ 0~7 0~8 0~9 10 20 300~399400~499500~5 600~6 70 80 90 1000~10991100~11991200~12991300~ reading length of BAC-end sequences (bp)

Hsu et al. Figure 1

24

Hsu et al. Figure 2

25 (a)

(b)

(c)

Hsu et al. Figure 3

26