Gene discovery using next-generation pyrosequencing to develop ESTs for orchids Yu-Yun Hsiao1,2, Yun-Wen Chen3, Shi-Ching Huang1, Zhao-Jun Pan1, Wen-Huei Chen2, Wen-Chieh Tsai2,3* and Hong-Hwa Chen1,2,3*

1Department of Life Sciences, 2Orchid Research Center, 3Institute of Tropical Sciences, National Cheng Kung University, Tainan 701, Taiwan

Summary Functional annotation of novel transcripts A B Structural Molecule Activity C 2% Orchids represent one of the most diversified angiosperms, but few genomic resources are available Protein Metabolism Signal Transduction ER 1% 1% Response to Abiotic or Biotic 1% Golgi Apparatus Stimulus DNA or RNA binding Plastid Cell Organization And Biogenesis 2% 2% 1% 2% in these non-model . In addition to the ecological significance, Phalaenopsis has been considered as an Transcription 1% Nucleic Acid Binding Cytosol 3% 2% 3% Other Intracellular Components Other Molecular Functions Developmental Processes Unknown Biological Processes 4% Other Enzyme Activity 3% Unknown Cellular Components 19% 22% economically important floriculture industry worldwide. We applied massively parallel pyrosequencing 2% 20% Transcription Factor Activity Cell Wall 3% Other Binding 4% 4% Mitochondria Protein Binding 4% technique to explore the transcriptional complexity for Phalaenopsis orchid. Other Biological Processes 4% 4% Other Cellular Components 6% We present a global characterization of Phalaenopsis orchid transcriptome using 454 Kinase Activity Nucleotide Binding 5% 5%

pyrosequencing. To maximize sequence diversity, we pooled RNA extracted from ten samples including Other Cytoplasmic Components DNA or RNA binding Other Binding Other Cellular Processes Unknown Molecular Functions 7% 5% 6% different tissues, various developmental stages, and biotic or abiotic stressed plants (Fig. 1; Table 1). A total 14% 17% Chloroplast 18% Response To Stress 6% Transporter Activity 7% Nucleus of 206,960 expressed sequence tags (ESTs) with an average read length of 228 bp was obtained. These reads 8%

were assembled into 8,233 contigs and 34,630 singletons (Table 2-4). The unigenes were searched against the Protein Binding 6% Protein Metabolism Nucleotide Binding 8% Hydrolase Activity Other Membranes 11% 12% 10% Other Metabolic Processes Plasma Membrane Transferase Activity 12% NCBI non-redundant (NR) protein database (Fig. 2). Based on sequence similarity with known proteins, these 6% Transport 8% 8% analyses identified 22,234 different genes (E-value cutoff, e-7; Table 5). Assembled sequences were Fig. 3. Phalaenopsis sequences were classified into sub-categories within the Biological Processes GO annotated with Gene Ontology (Fig. 3), Gene Family and Kyoto Encyclopedia of Genes and Genomes category (A), Molecular Functions GO category (B), Cellular Components GO category (c). (KEGG) pathways (Fig. 4; Table 6). Among these annotations, over 780unigenes encoding putative transcription factors were identified (Fig. 5). Table 5. Highly abundant transcripts detected in Phalaenopsis Pyrosequencing is an effective approach to identify a large set of unigenes from Phalaenopsis. The Putative function Organism Number of component reads E-value informative EST dataset developed from this study constitute a much-needed resource for discovery of genes Putative P450-like protein precursor Zea mays 4705 5.00E-20 involved in various biological processes in Phalaenopsis and other orchid species. These transcribed Triple gene block 3 Cymbidium mosaic virus 4100 3.00E-22 Cytochrome P450 monooxygenase Sorghum bicolor 2653 2.00E-08 sequences will narrow the gap between approaches based on model organisms with plenty genomic resources LLA-1378 Lilium longiflorum 2269 3.00E-28 and species that are important for ecological and evolutionary studies. Hypothetical protein Sorghum bicolor 1668 3.00E-15

Materials Gene families and Pathways

A B C Table 6. Unigenes mapped in KEGG Pathways

KEGG Pathways Sub-pathways of KEGG Pathway Number of Unigenes Number of reads Metabolism 6269 43325 Glycan Biosynthesis and Metabolism 84 137 Xenobiotics Biodegradation and Metabolism 22 95 Metabolism of Other Amino Acids 162 675 (4/4) (9/62) Biosynthesis of Polyketides and Terpenoids 146 596 Carbohydrate Metabolism 1229 7957 (6/6) Overview 2360 19261 (3/17) Biosynthesis of Other Secondary Metabolites 179 1201 Lipid Metabolism 450 1860 Nucleotide Metabolism 236 678 Metabolism of Cofactors and Vitamins 188 707 (3/20) (4/4) Amino Acid Metabolism 639 2998 Energy Metabolism 574 7160 Genetic Information Processing 1078 6944 (2/2) (2/2) Replication and Repair 149 384 Transcription 301 1962 Folding, Sorting and Degradation 329 1674 (5/5) (2/30) Fig. 1. (A) P. bellina, (B) P. aphrodite subsp. formosana, (C) P. equestris Translation 299 2951 Organismal Systems 204 828 (3/3) (1/32) Cellular Processes 213 2167 Table 1. Samples used for transcriptome analysis (1/32) Environmental Information Processing 121 460 Species Tissues (1/14)

Phalaenopsis equestris inflorescence (1/10) flower bud Fig. 4. Graphic representation of terpenoid backbone biosynthesis pathway including cytosolic mevolonate root (2/17) (2/11) pathway and plastic MEP pathway in Phalaenopsis young leaf orchids. The red numbers represent the ESTs old leaf homologous to the genes involved in terpenoid backbone biosynthesis pathway. The first red number in the cold stressed leaf bracket indicates the number of unigene corresponding chrysanthemi-infected leaf to the catalytic gene in the pathway, and the second Phalaenopsis aphrodite protocorm number in the bracket represents the number of reads cold night temperature-induced spike constituting the unigene. Phalaenopsis bellina day 5 post anthesis flower Transcription factors 550

Results 497 500 Sequencing and assembly of 454 pyrosequenced ESTs 450 Table 2: Summary of Phalaenopsis Table 3. Length distribution of assembled Table 4. Summary of component read per assembly EST data contigs and singletons 400

Total Bases 42,034,787 Nucleotides length (bp) Contigs Singletons Number of reads Number of contigs 350 50-99 24 4,730 High-quality Reads 206,960 2 to 10 6,441 100-199 466 7,410 300 Average Read Length 228 200-299 4,422 22,344 11 to 20 1,021 249 300-399 1,177 146 250 Number of Contigs 8,233 21-30 307 400-499 791 1 31-40 143 Average Contig Length 364 500-599 458 0 200 600-699 321 0 41-50 92 Range Contig Length 72 to 4234 700-799 194 0 51-100 130 150 Number of Reads in Contigs 172,330 800-899 109 0 114 109 101-150 32 105 900-999 81 0 89 92 Number of Singletons 34,630 100 80 73 70 1,000-1,499 147 0 151-200 12 66 63 53 Number of Unigene Sequences 42,863 48 46 47 46 45 1,500-1,999 36 0 44 38 32 40 43 40 41 50 36 33 34 33 31 > 200 55 19 32 32 29 32 28 20 23 24 22 25 20 25 >2,000 8 0 16 18 14 10 17 18 14 12 14 16 11 14 8 9 11 88 11 10 88 9 89 13 9 8 44 3 5 13 7 46 6 44 44 23 22 1 55 55 47 1 45 6 11 6 Total 8,233 34,630 0 ke 40000 B H 2 ke A S IP C 1 B lix ly Y D Maximum length 4,234 bp 416 bp F - i P B /DP P -like B X -li S r L -l R1 -like T other-PHD-Z T SF - -YA -Y -YC -like a P -H H ZIP 3H B 2F A 2 A RA - IK AC n AV B R ihe AP2 AR B3 BES1b b C2H C CP D Dof EIL ERFF GeB G GRFB R H LBD LSD MY N NF R S S TCP Average length 364 bp 201 bp ARR BBR/BPC CAMTACO E G G H HB HD H M NF NF NF Ni S1F TALE Tr Whi WRKYABBYZF 35094 MYB_rel ated 35000 Fig. 5. Number of ESTs related to transcription factors appearing in each transcription factor family. A total of 2,424 putative Oryza sativa subsp. japonica transcription factors were searched against 30000 35000 30378 Phalaenopsis transcriptome, and then the target ESTs were classified into corresponding transcription 25000 30000 factor families. Black bar represents the number of reads in each family. Shaded bar represents the 25000 23999 22234 Number 20000 number of unigene in each family. 19730 20000 15000 Conclusions 15000 12032 Number 1. we applied next generation sequencing to facilitate transcriptome analysis of orchids that present important biological questions 10000 10000 but lack fully sequenced genome. 5696 6110 4857 5000 3363 4108 5000 1337 2. Our findings reported here represent substantial contributions to the publicly accessible expressed sequences for the 70 334 0 0 family. With P. equestris whole genome sequencing in progress, large amount of ESTs will be a valuable resource for 1E-50 1E-30 1E-20 1E-10 1E-07 1 0 <-200 <-100 <-50 <-20 <-10 <-7 <-5 1 researchers, allowing correction of assemblies, annotation, and construction of gene models that establish accurate exon- intron boundaries. E-value E-value References Fig. 1. Comparison of sequences generated by using Fig. 2. The number of sequence hit against the nr database by Hsiao Y.Y., Tsai, W.C., Kuoh, C.S., Huang, T.H., Wang, H.C., Leu, Y.L., Wu, T.S., Chen, W.H. and Chen. H.H. 2006. Comparison of transcripts in Phalaenopsis bellina and Phalaenopsis equestris 454 sequencing with Phalaenopsis ESTs published at using BLASTX (Orchidaceae) flowers to deduce the monoterpene biosynthesis pathway. BMC Plant Biology 6: 14 Tsai WC, Hsiao YY, Lee SH, Tung CW, Wang DP, Wang HC, Chen WH, Chen HH: Expression analysis of the ESTs derived from the flower buds of Phalaenopsis equestris. Plant Sci 2006, 170:426-432. NCBI EST database by BLASTN. Fu CH, Chen YW, Hsiao YY, Pan ZJ, Liu ZJ, Huang YM, Tsai WC, Chen HH: OrchidBase: A collection of sequences of transcriptome derived from orchids. Plant Cell Physiol 2011, In Press.