Transcriptome analysis of drough-tolerant CAM , deserti and Agave tequilana

Stephen M. Gross1,2, Jeffrey A. Martin1,2, June Simpson3, Zhong Wang1,2, and Axel Visel1,2

1. DOE Joint Genome Institute, Walnut Creek, CA 2. Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 3. CINVESTAV, Irapuato, MX Cinvestav

ABSTRACT COMPARISON OF AGAVE DE NOVO ASSEMBLIES are succulent monocotyledonous plants native to hot and arid environments of North A Analysis of assembled contigs America. Because of their adaptations to their environment, including crassulacean acid metabolism PacBio subreads || Rnnotator GenBank || Rnnotator McKain et al. || Rnnotator 2683 n = 4767 38 n = 82 6578 n = 12,972 suggest the Agave de novo $OLJQPHQWSURSHUWLHV 6000 (CAM, a water-efficient form of photosynthesis) and existing technologies for ethanol production, indels assemblies are comprehensive

30 29 no indels agaves have gained attention both as potential lignocellulosic bioenergy feedstocks and models for 2000 and accurate. unaligned exploring responses to abiotic stress. However, the lack of comprehensive Agave sequence 4000

transcripts 3560 f 20 Class descriptions 1221 FIGURE 3: Comparison of the de novo datasets limits the scope of investigations into the molecular-genetic basis of Agave traits. Here, we 2710 Containment: 1000 14 862 Agave transcriptome assemblies Number o present comprehensive, high quality de novo transcriptome assemblies of two Agave species, A. 2000 10 tequilana and A. deserti, from short-read RNA-seq data. Our analyses support completeness and Overlap: (A) Comparisons of the A. tequilana de novo Rnnotator assembly to error 1 1 124 accuracy of the de novo transcriptome assemblies, with each species having approximately 35,000 0 0 0 corrected Pacific Biosciences subreads, Rnnotator contigPacbio contains subreadcontainsSequences unaligned Rnnotator contigGenBank contains sequenceSequences contains unaligned Rnnotator contig containsMcKain Sequences unaligned protein-coding genes. Comparison of agave proteomes to those of additional plant species identifies PacBio subread Rnnotator contig overlap GenBank sequenceRnnotator contig overlap McKain Rnnotator contigoverlap 82 GenBank A. tequilana sequences, et al. et al. contig contains and an additional A. tequilana dataset biological functions of gene families displaying sequence divergence in agave species. Additionally, sequence we use RNA-seq data to gain insights into biological functions along the A. deserti juvenile leaf from McKain et al. 2012. [4] proximal-distal axis. Our work presents a foundation for further investigation of agave biology and (B) Comparisons between the A. BCA. tequilana || A. deserti A. deserti || A. tequilana A. tequilana || A. deserti A. deserti || A. tequilana tequilana and A. deserti de novo n = 204,530 107,821 n = 128,959 52,109 50 their improvement for bioenergy development. 50,000 Rnnotator assemblies. ) 90,000 30 40 (C) Histograms of the fraction of aligned 40,000 37,241 sequence lengths between A. deserti CAM PHOTOSYNTHESIS, ARID ENVIRONMENTS, AND BIOENERGY 32,231 30 30,000 60,000 20 and A. tequilana. trasncripts (thousands) transcripts (thousands) Agave species are adapted to their native habitat in arid regions of Mexico and the United States. Agave thus a an

44,443 erti il s 20 Symbol || separates query sequence 20,000 . de . tequ holds promise as a biofuel feedstock [1,2], capable of growing on marginal lands where other proposed bioenergy A A

of dataset from subject sequence dataset. 30,000 28,627

of 10 r r 23,649 e e 10 plants cannot. The ability of agaves to withstand hot and arid conditions relies upon crassulacean acid 10,000 7,378 Total number of sequences (n) is noted in Number of transcripts (thousands Numb metabolism (CAM)—a specialized form of photosynthesis allowing agaves to keep leaf stomata (pores) closed Numb each bar chart, total number of 0 0 0 0 sequences in alignment classes are during the hot day, minimizing water loss through evapotranspiration. A. deserti A. tequilana Sequences unaligned A. deserti A. tequilana Sequences unaligned 0.0 0.3 0.6 0.9 0.00.3 0.6 0.9 A. tequilana A. deserti overlap A. tequilana A. deserti overlap Fraction of A. tequilana transcript Fraction of A. deserti transcript noted above bar. contains contains aligning to A. deserti transcript aligning to A. tequilana transcript contains contains A C FIGURE 1: Agaves and CAM biology (A) Agave tequilana cultivated in Mexico. C (B) Semi-arid regions of the United States CO2 4 PROTEOMIC ANALYSES SUPPORT COMPREHENSIVE (brown) are unsuitable for cultivation of other C bioenergy plants, which require more temperate NIGHT 3 AGAVE TRANSCRIPTOME ASSEMBLIES vacuole regions (green). Most Agave species are chloroplast adapted to semi-arid regions in Mexico and the A BLAST B OrthoMCL extreme southwestern USA (purple). one-to-one RBH protein comparison protein family comparison Proteome comparisons between Agave species and additional monocot species suggest the majority of Agave proteins are B (C) Crassulacean Acid Metabolism (CAM). CO2 Semi-arid conserved across taxa. We can also identify protein families C3 enters plant cells at night, joins with a 3-carbon 20,161 14,709 20,377 2835 13,388 2948 regions molecule (C ) and is stored in the vacuole as a specific to agaves. CO C 3 2 4 4-carbon molecule (C ). During the day, C DAY 4 4 A. deserti A. tequilana A. deserti A. tequilana molecules diffuse out of the vacuole, and CO2 is

Agave Calvin relased and assimilated into sugar in the FIGURE 4: Proteomic comparison of agaves to other plant species light Cycle sugar chloroplast. C OrthoMCL (A) Venn diagram of BLASTP-based one-to-one reciprocal best hit proteins 16223 A. deserti shared between A. deserti and A. tequilana. 89 55 4323 29 12 2086 46 16 5 23 35 13 11 111 21 3 16336 (B) Venn diagram of OrthoMCL-defined protein families shared between 23 7 A. tequilana 176 31 81 46 15 33 agaves. 108 8144 593 12 15 13 16618 B. distachyon Inputs Outputs 135 26 (C) Edwards-Venn diagram of OrthoMCL-defined plant orthologous-group 36 782 9 748 3718 315 protein families (Plant OGs) shared between agave and 4 additional Water Drought Nitrogen Dry biomass Ethanol 20 199 19643 O. sativa Comparison of inputs (water and 26 13 503 2735 Feedstock 29 575 -1 -1 -1 -1 -1 -1 16 461 monocotyledonous plant species. Shape and color used for each species (cm yr ) tolerance (kg ha yr ) (Mg ha yr ) (liters yr ) 27 362 nitrogen) and outputs (biomass and 14 4 161 11 787 182188218821 S. bicolor 3 129 755 is at the right with the total number of Plant OGs within each species. 20 1108 ethanol) of agaves and other biofuel 1789 70 Corn grain 7–10 2900 3751 50–80 low 90–120 feedstock species. Though agaves 20681 Z. mays Corn stover 3–6 900 are harvested at several years of age, Miscanthus 75–120 low 0–15 15–40 4600–12,400 their annualized growth rate is on par with Miscanthus. Table is modified Poplar coppice 70–105 moderate 0–50 5–11 1500–3400 from reference [2]. Agave spp. 30–80 high 0–12 10–34 3000–10,500 PROFILING OF THE A. DESERTI LEAF HIGHLIGHTS REGIONS CRITICAL TO DEVELOPMENT AND PHOTOSYNTHESIS AGAVE TRANSCRIPTOME ASSEMBLIES FROM DEEP RNA-seq Agaves spend the majority of their lives as ABTo provide sequence resources for A proximal distal the Agave research community, we (base) 1 2 3 4 (tip) compact rosettes, thus leaves are important built de novo transcriptomes of Agave organs in which to study Agave tequilana and Agave deserti from developmental and bioenergetic processes. deep Illumina RNA-seq data. B 3UR[LPDO([SUHVVLRQ 0HGLDO([SUHVVLRQ Distal Expression Cluster A Cluster B Cluster C Cluster D Cluster E Cluster F

Sequences were assembled by O O O O 1 O Rnnotator [3], a de novo O O O transcriptome assembly pipeline. 0 O O O Agave tequilana Agave deserti O O O O FIGURE 5: Transcriptomic analysis of the A. deserti O O

ïVFDOHG53.0 O O z O O O O

ï 79645 of 88718 loci clustered leaf proximal-distal axis. O A. tequilana contigs CDO 6SHFLHVGDWDVHW A. deserti contigs 22249 loci 12063 loci 11789 loci 7539 loci 17579 loci 8426 loci 0.8 6 FIGURE 2: A. tequilana, A. deserti, and their O O A. deserti removed contigs removed contigs 1234 1234 1234 1234 1234 1234 (A) One of the A. deserti leaves used for analysis, O A. tequilana O respective transcriptomes cellular protein vesicle-mediated DNA regulation of 5 6 translation photosynthesis indicating proximal-distal (PD) sections 1–4. O modification transport metabolism gene expression 0.6 O 4 (A) Cultivated A. tequilana in Jalisco, Mexico. O (B) Six major K-means clusters of gene expression 4 O C 1 2 3 4 D 1 2 3 4 3 along the PD axis. Clusters are manually grouped by density (B) A. deserti (foreground) in natural habitat, 0.4 O

Riverside County, California, USA. Chlorophyll biosynthesis Class I hormones transcription factors highest expression in proximal, medial, or distal tissues. O 2

Antenna proteins photosynthesis a unique 25-mer O 2 Class II KNOX Blue lines connect mean z-scaled RPKM values, O Photosystem II Probability of observing 0.2 O 1 (C) Plot of the fraction of unique 25-mers over MADS-box O Photosystem I O shaded areas represent the 25th and 75th percentiles, O O Cytochrome b f & ATP synthase GRAS O O indicated read depth (log2 scale). 6 0 0 Calvin cycle YABBY red lines indicate standard error at each mean. Green 210 215 220 225 230 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 C4 dicarboxylic acid cycles MYB text beneath each cluster denotes the description of the Number of reads contig GC content contig GC content (D) Density plot of GC content of agave bHLH dark Zn finger

transcript contigs vs. contigs from light cell wall & stomata most significantly enriched GO term in each cluster. 1.2 CAM

VSHFLHV development EF2.0 A. tequilana 1.0 A. deserti contamination and commensal organisms. A. deserti ORFXVFRGLQJSRWHQWLDO ORFXVFRGLQJSRWHQWLDO Cell wall biosynthesis biosynthesis (C, D) Heatmaps of composite gene expression for A. tequilana non-coding non-coding Cellulose biosynthesis transport

coding coding Auxin 0.9 0.8 (E) Density plots of A. deserti and A. tequilana Lignin biosynthesis indicated biological processes along the leaf PD axis. 1.5 Stomata development biosynthesis

transcript lengths. Note log scale. Peaks at ABA 10 Cutin & suberin biosynthesis 0.6 biosynthesis 0.6 150 and 250 nt represent single reads or BR 1.0

density paired-end reads, respectively, that were not density 0.4 Relative composite RPKM value biosynthesis assembled into larger contigs. normalized to section with highest expression CK 0.3 0.5 biosynthesis 0.2 GA (F) Density plot of locus RPKM values for 0.0 0.2 0.4 0.6 0.8 1.0 biosynthesis 0.0 0.0 0.0 coding (dark shading) and non-coding (light ETH 100 1000 10000 1 100 10000 1 100 10000 Transcript length (nt) Locus RPKM Locus RPKM shading) loci. ACKNOWLEDGEMENTS AND CITATIONS This work performed at the U.S. Department of Energy Joint Genome Institute was supported in part by the OVERVIEW OF AGAVE TRANSCRIPTOME ASSEMBLIES Office of Science of the U.S. Department of Energy under contract DE-AC02-05CH112.

Species Total No. of loci No. transcript N50 length Sum No. protein-coding [1] Davis, A. S. et al. The global potential for Agave as a biofuel feedstock. GCB Bioenergy 3, 68–78, (2011). Sequencing contigs assembled length loci [2] Somerville, C. et al. Feedstocks for lignocellulosic biofuels. Science 329, 790-2, (2010). [3] Martin, J. et al. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-seq reads. BMC A. tequilana 293.5 Gbp 139,525 204,530 1387 bp 204.9 Mbp 34,870 Genomics 11, 663, (2010). A. deserti 184.7 Gbp 88,718 128,869 1323 bp 125.0 Mbp 35,086 [4] McKain, M. et al. Phylogenomic analysis of transcriptome data elucidates co-occurrence of a paleopolyploid event and the origin of bimodal karyotates in (). Am J Bot 99:2, 397–406.