bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 The draft genome sequence of herbaceous diploid

2 Raddia distichophylla

3 4 Wei Li,*,1 Cong Shi,†,1 Kui Li,†,1 Qun-jie Zhang,* Yan Tong,† Yun Zhang,† Jun Wang,‡ 5 Lynn Clark, §,2 and Li-zhi Gao*,† 2 6 7 *Institution of Genomics and Bioinformatics, South China Agricultural University, 8 Guangzhou 510642, China, † Germplasm and Genomics Center, Germplasm Bank 9 of Wild Species in Southwestern China, Kunming Institute of Botany, Chinese Academy 10 of Sciences, Kunming 650204, China, ‡Southwest China Forestry University, Kunming 11 650224, China, and §Ecology, Evolution and Organismal Biology, Iowa State University, 12 251 Bessey Hall, Ames, 50011-1020, IA, USA. 13 14 1These authors contributed equally to this work. 15 2Correspond author: E-mail address: [email protected] & [email protected] 16

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

17 ABSTRACT

18 are important non-timber forest widely distributed in the tropical and 19 subtropical regions of Asia, Africa, America, and Pacific islands. They comprise the 20 Bambusoideae in the grass family (), including approximately 1,700 described 21 species in 127 genera. In spite of the widespread uses of bamboo for food, construction 22 and bioenergy, the gene repertoire of bamboo still remains largely unexplored. Raddia 23 distichophylla (Schrad. ex Nees) Chase, belonging to the tribe (Bambusoideae, 24 Poaceae), is diploid herbaceous bamboo with only slightly lignified stems. In this study, 25 we report a draft genome assembly of the approximately ~589 Mb whole-genome 26 sequence of R. distichophylla with a contig N50 length of 86.36 Kb. Repeated sequences 27 account for ~49.08% of the genome, of which LTR retrotransposons occupy ~35.99% of 28 whole genome. A total of 30,763 protein-coding genes were annotated in the R. 29 distichophylla genome with an average transcript size of 2,887 bp. Access to this 30 herbaceous bamboo genome sequence will provide novel insights into biochemistry, 31 molecular-assisted breeding programs and germplasm conservation for bamboo species 32 world-wide.

33

34 KEYWORDS 35 Bamboos, Raddia distichophylla, whole-genome sequence

36

37

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

38 INTRODUCTION

39 Bamboos are important non-timber forest plants with a wide native geographic 40 distribution in tropical, subtropical and temperate regions except for Europe and 41 Antarctica (BambooPhylogenyGroup, 2012). Bamboos are of notable economic and 42 cultural significance worldwide, and can be used as food, bioenergy and building 43 materials. Bamboos comprise the Bambusoideae in grass family (Poaceae), including 44 approximately 1,700 described species in 127 genera (Clark and Oliveira, 2018; Soreng 45 et al., 2017; Vorontsova et al., 2016). Molecular phylogenetic analysis shows that 46 Bambusoideae falls into the Bambusoideae-Oryzoideae-Pooideae (BOP) clade and is 47 phylogenetically sister to Pooideae (Saarela et al., 2018). Bambusoideae may be divided 48 into two morphologically distinct growth forms: woody bamboos and herbaceous 49 bamboos (tribe Olyreae); woody bamboos can be further divided into two lineages: 50 temperate woody (Arundinarieae) and tropical woody (Bambuseae) 51 (BambooPhylogenyGroup, 2012; Kelchner and Group, 2013; Sungkaew et al., 2009). 52 Although the phylogenetic relationship between the two woody bamboo lineages is 53 still controversial (BambooPhylogenyGroup, 2012; Bouchenak-Khelladi et al., 2008; 54 Clark et al., 2015; GrassPhylogenyWorkingGroup et al., 2001; Sungkaew et al., 2009; 55 Triplett et al., 2014; Wysocki et al., 2016), their shared phenology and sexual systems are 56 suggestive of common ancestry (Wysocki et al., 2016). Woody bamboos form highly 57 lignified, usually hollow culms, complex rhizome systems, specialized culm leaves, 58 complex vegetative branching, and outer ligules on the foliage leaves 59 (BambooPhylogenyGroup, 2012; Clark et al., 2015). Woody bamboos possess bisexual 60 flowers and flower at intervals between 7 and 120 years usually followed by a die-off 61 (BambooPhylogenyGroup, 2012; Dransfield and Widjaja, 1995; Judziewicz et al., 1999). 62 All woody bamboos with known chromosome counts are uniformly polyploidy 63 (Soderstrom, 1981). 64 The tribe Olyreae comprises 22 genera and 124 described species native to tropical 65 America except Buergersiochloa and latifolia (BambooPhylogenyGroup, 2012; 66 Clark et al., 2015). Herbaceous bamboos are characterized by usually weakly developed 67 rhizomes and less lignification in the culms. Culm leaves and foliage leaves with the 68 outer ligule are absent in herbaceous bamboos. In contrast to woody bamboos,

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

69 herbaceous bamboos have at least functionally unisexual flowers and they flower 70 annually or seasonally for extended periods (Clark et al., 2015; Gaut et al., 1997; 71 Kelchner and Group, 2013; Oliveira et al., 2014; Wysocki et al., 2015). The tribe is 72 fundamentally diploid, but chromosome counts indicating tetraploidy or hexaploidy or 73 possibly even octoploidy (in Eremitis ) are available (Judziewicz et al., 1999; 74 Soderstrom, 1981). 75 Since the first comparative DNA sequence analysis of bamboos by Kelchner and 76 Clark (Kelchner and Clark, 1997), a number of studies have been further carried out 77 combined various biotechnologies (Das et al., 2005; Gui et al., 2010; Oliveira et al., 2014; 78 Peng et al., 2013; Sharma et al., 2008; Sungkaew et al., 2009; Wysocki et al., 2016; 79 Zhang et al., 2011). Peng et al. (Peng et al., 2013) reported the first draft genome of 80 tetraploid moso bamboo (Phyllostachys edulis). Recently, Guo et al. (Guo et al., 2019) 81 have released four draft genomes of bamboo lineages, including O. latifolia, R. 82 guianensis, Guadua angustifolia, and Bonia amplexicaulis. The lack of a high-quality 83 genome sequence for a diploid bamboo has been an impediment to our understanding of 84 bamboo biology and evolution. 85 R. distichophylla (Schrad. ex Nees) Chase, a diploid herbaceous bamboo, is almost 86 completely restricted to the forests of eastern (Oliveira et al., 2014). In this study, 87 we generate the complete genome sequence of R. distichophylla through the Illumina 88 sequencing platform. The availability of the fully sequenced and annotated genome will 89 provide functional, ecological and evolutionary insights into the bamboo species.

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

90 METHODS AND MATERIALS

91 Sample collection, total DNA and RNA extraction and sequencing 92 The source plant was an individual of R. distichophylla grown in cultivation at the R.W. 93 Pohl Conservatory, Iowa State University. Fresh and healthy leaves were harvested and 94 immediately frozen in liquid nitrogen, followed by storage at -80°C in the laboratory 95 prior to DNA extraction. A modified CTAB method (Porebski et al., 1997) was used to 96 extract high-quality genomic DNA. The quantity and quality of the extracted DNA were 97 examined using a NanoDrop D-1000 spectrophotometer (NanoDrop Technologies, 98 Wilmington, DE) and electrophoresis on a 0.8% agarose gel, respectively. A total of 9 99 paired-end sequencing libraries, spanning 180, 300, 500, 2 000, 5 000, 10 000 and 20 000 100 bp, were prepared and sequenced on Illumina HiSeq 2000 platform. 101 Total RNA was extracted from four tissues (root, stem, young leaf and female 102 inflorescence), using a Water Saturated Phenol method. RNA libraries were built using 103 the Illumina RNA-Seq kit (mRNA-Seq Sample Prep Kit P/N 1004814). The extracted 104 RNA was quantified using NanoDrop-1000 UV-VIS spectrophotometer (NanoDrop), and 105 RNA integrity was checked using Agilent 2100 Bioanalyzer (Agilent Technologies, Palo 106 Alto, CA, USA). For each tissue, only total RNAs with a total amount ≥ 15 μg with a 107 concentration ≥ 400 ng/μl, RNA integrity number (RIN) ≥ 7, and rRNA ratio ≥ 1.4 were 108 used for constructing cDNA library according to the manufacturer’s instruction (Illumina, 109 USA). The libraries were then sequenced (100 nt, pair-end) with Illumina HiSeq 2000 110 platform. 111 112 De novo assembly of the R. distichophylla genome 113 Two orthogonal methods were used to estimate the genome size of R. distichophylla, 114 including k-mer frequency distribution and flow cytometric analysis. Firstly, we 115 generated the 17-mer occurrence distribution of sequencing reads using GCE v1.0.0 (Liu 116 et al., 2013), and the genome size was then calculated with the equation NLK(   1) 117 G  , where G represents the genome size; N is the total number of reads LD 118 used; L is the average length of reads; K is set to 17. Secondly, the genome size was 119 further estimated and validated using flow cytometry analysis. We employed the rice

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

120 cultivar Nipponbare as an inner standard with an estimated genome size of 389 Mb 121 (IRGSP., 2005). 122 Paired-end sequencing reads were processed to remove adaptor and low quality 123 sequences using Trimmomatic v0.33 (Bolger et al., 2014). Reads were retained only if 124 both paired reads passed quality control filtering. We assembled the WGS sequencing 125 reads using Platanus v1.2.1 software (Rei Kajitani, 2014), which is optimized for highly 126 heterozygous diploid genomes. First, the high-quality paired-end Illumina reads from

127 short-insert size libraries (≤500 bp) were assembled into contig sequences using

128 Platanus. The assembled contigs were then scaffolded using SSPACE v3.0 (Boetzer et al., 129 2011) using Illumina mate pair data. GapCloser v1.12 (Li et al., 2010) was used to fill 130 gaps within the scaffolds. 131 The three methods were employed to assess the quality and completeness of the R. 132 distichophylla genome. First, high-quality reads from short-insert size libraries were 133 mapped to the genome assembly using BWA v0.7.15 (Li and Durbin, 2009). Second, the 134 genome assembly was checked with benchmarking universal single-copy orthologs 135 (BUSCO) (Simao et al., 2015). Third, the RNA sequencing reads generated in this study 136 were assembled using Trinity v r20131110 (Grabherr et al., 2011), the assembled 137 transcripts were then aligned back to the assembled genome using GMAP v2014-10-2 138 (Wu and Watanabe, 2005) at 30% coverage and 80% identity thresholds. 139 140 Annotation of repetitive sequences and non-coding RNA genes 141 We used a combination of homology-based and de novo approaches to identity the 142 repetitive sequences in the R. distichophylla genome. RepeatModeler v1.0.10 143 (Tarailo-Graovac and Chen, 2009), which included two de novo repeat finding programs, 144 RECON (Bao and Eddy, 2002) and RepeatScout (Price et al., 2005), was used for the 145 construction of the repeat library. This produced library, along with the Poaceae repeat 146 library, were used as the reference database for RepeatMasker (Tarailo-Graovac and Chen, 147 2009). Simple sequence repeats (SSRs) were identified in the genome sequence using the 148 MISA perl script (Thiel et al., 2003) with the default settings: monomer (one nucleotide,

149 n ≥ 12), dimer (two nucleotides, n ≥ 6), trimer (three nucleotides, n ≥ 4), tetramer

150 (four nucleotides, n ≥ 3), pentamer (five nucleotides, n ≥ 3), and hexamer (six

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

151 nucleotidess, n ≥ 3).

152 Non-coding RNAs genes play important roles in many cellular processes. The five 153 different types of non-coding RNA genes, namely transfer RNA (tRNA) genes, ribosomal 154 RNA (rRNA) genes, small nucleolar RNA (snoRNAs) genes, small nuclear RNA 155 (snRNAs) genes and microRNA (miRNAs) genes, were predicted using various de novo 156 and homology search methods. We used tRNAscan-SE algorithms (version 1.23) (Lowe 157 and Eddy, 1997) with default parameters to identify the tRNA genes. The rRNA genes 158 (8S, 18S, and 28S), which is the RNA component of the ribosome, were predicted by 159 using RNAmmer algorithms (v1.2) (Lagesen et al., 2007) with default parameters. The 160 snoRNA genes were annotated using snoScan v1.0 (Lowe and Eddy, 1999) with the yeast 161 rRNA methylation sites and yeast rRNA sequences provided by the snoScan distribution. 162 The snRNA genes were identified by INFERNAL software (v1.1.2) (Nawrocki et al., 163 2009) against the Rfam database (release 9.1) with default parameters. 164 165 Genome annotation 166 The gene prediction pipeline combined the de novo method, the homology-based method 167 and the EST-aided method. Augustus v2.5.5 (Stanke et al., 2004) and Fgenesh (Salamov 168 and Solovyev, 2000) were used to perform the de novo prediction. The protein sequences 169 of moso bamboo, stiff brome, barley, maize, Oropetium thomaeum, foxtail millet, rice 170 and sorghum were mapped to the genome using Exonerate (Slater and Birney, 2005). To 171 further aid the gene annotation, Illumina RNA-seq reads were assembled using the 172 Trinity software (v20131110) (Grabherr et al., 2011). The resulting transcripts were then 173 aligned to the soft-masked genome assembly using GMAP v2014-10-2 (Wu and 174 Watanabe, 2005) and BLAT v35 (Kent, 2002). The potential gene structures were derived 175 using PASA v20130907 (Program to Assemble Spliced Alignments) (Haas et al., 2003). 176 All gene models produced by the de novo, homology-based and EST-aided methods were 177 integrated using GLEAN (Elsik et al., 2007). 178 The predicted genes were searched against Swiss-Prot database (Boeckmann et al., 179 2003) using BLASTP (e-value cutoff of 10−5). The motifs and domains within gene 180 models were identified by InterProScan (Jones et al., 2014). Gene Ontology terms and 181 KEGG pathway for each gene were retrieved from the corresponding InterPro entry.

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

182 Gene functions were also assigned to TrEMBL (Boeckmann et al., 2003) database using 183 BLASP with an e-value threshold of 10−5. 184 185 Data availability 186 All sequencing reads have been deposited in the NCBI Sequence Read Archive 187 SRR8759078 to SRR8759084 (2019) and BIG Genome Sequence Archive CRR049770 188 to CRR049776 (2019).The assembled genome sequence is available at the NCBI and 189 BIG Genome Warehouse under accession number SPJY00000000 and 190 GWHAAKD00000000, respectively. Table S1 shows the whole genome sequencing 191 (WGS) reads used to assemble the R. distichophylla genome. Table S2 shows the 192 summary of RNA sequencing (RNA-Seq) of R. distichophylla. Table S3 shows the 193 estimation of genome size based on K-mer analysis. Table S4 shows the summary of 194 genome assembly. Table S5 shows the validation of the R. distichophylla genome 195 assembly using reads mapping BUSCO, and transcript alignments. Table S6 shows the 196 statistics of repeat sequences in the R. distichophylla and moso bamboo genomes. Table 197 S7 shows the summary of types and number of simple sequence repeats in the R. 198 distichophylla and moso bamboo genomes. Table S8 shows the non-coding RNA genes in 199 the R. distichophylla genome. Table S9 shows the statistics of predicted protein-coding 200 genes in the R. distichophylla genome. Table S10 shows the functional annotation of the 201 R. distichophylla protein-coding genes. 202

203 RESULTS AND DISCUSSION

204 We performed whole-genome sequencing with the Illumina sequencing platform. A total 205 of 253.94 Gb short sequencing reads were generated (~272.89-fold coverage) (Table S1). 206 A total of 21.99 Gb RNA-seq data was obtained from root, stem, young leaf and female 207 inflorescence (Table S2). Based on the K-mer analysis, we estimated the genome size of 208 R. distichophylla to be ~608 Mb (Table S3, Fig. 1). Flow cytometry analysis estimated 209 the genome size of R. distichophylla to be ~589 Mb, which is close to the obtained result 210 from the k-mer analysis. The final assembly amounted to ~580.85 Mb, 95.56% of the 211 estimated genome size. The N50 lengths of the assembled contigs and scaffolds were 212 ~86.36 Kb and ~1.81 Mb, respectively (Table 1, Table S4).

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

213 To validate the genome assembly quality, we first mapped ~211 M of high-quality 214 reads to the genome sequences. Our results revealed that nearly 89.25% Illumina reads 215 were mapped to the genome assembly (Table S5); second, BUSCO was used to assess 216 the completeness of the genome assembly. The percentage of completeness for our 217 assembly was 92.08 in the Embryophyta lineage (Table S5); and finally, we mapped the 218 assembled transcripts to the genome sequences. Approximately 78.04% of the transcripts 219 could be mapped to the genome (Table S5). 220 The annotation of repeat sequences showed that approximately 49.08% of the R. 221 distichophylla genome consists of transposable elements (TEs), lower than the amount 222 (63.15%) annotated in the moso bamboo genome (Peng et al., 2013) with the same 223 methods (Table S6, Fig. 2). LTR retrotransposons were the most abundant TE type, 224 occupying roughly 35.99% of the R. distichophylla genome. In total, 220,737 and 225 496,819 SSRs were found in the R. distichophylla and moso bamboo genome, 226 respectively, with trimer and tetramer as the most abundant SSR types (Table S7). 227 Among the trimer motifs, (CCG/GGC)n were the predominant repeat in R. distichophylla, 228 whereas (CCG/GGC)n, (AGG/CCT)n and (AAG/CCT)n showed a similar proportions in 229 the moso bamboo genome (Fig. 3). The identified SSRs will provide valuable molecular 230 resources for germplasm characterization and genomics-based breeding programs. In 231 total we identified 727 tRNA genes, 90 rRNA genes, 242 snoRNA genes, 127 snRNA 232 genes, and 256 miRNA genes, respectively (Table 1, Table S8). 233 In combination with ab initio prediction, protein and expressed sequence tags (ESTs) 234 alignments, we generated a gene set consisted of 30,763 protein-coding genes (Table 1, 235 Table S9), with an average length of 2,887 bp and an average coding sequence length of 236 1,099 bp (Table S9, Fig. 4). Among these genes, 88.85% had significant similarities to 237 sequences in the public databases (Table S10). 238

239 ACKNOWLEDGEMENTS

240 This work was supported by Yunnan Innovation Team Project and the start-up grant from 241 South China Agricultural University (to L. G.). 242 243

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

244 LITERATURE CITED 245 BambooPhylogenyGroup (2012). An updated tribal and subtribal classification for the Bambusoideae 246 (Poaceae). In Proceedings of the 9th World Bamboo Congress, pp. 10-15. 247 Bao, Z., and Eddy, S.R. (2002). Automated de novo identification of repeat sequence families in sequenced 248 genomes. Genome Res 12, 1269-1276. 249 Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., 250 Michoud, K., O'Donovan, C., Phan, I., et al. (2003). The SWISS-PROT protein knowledgebase and its 251 supplement TrEMBL in 2003. Nucleic Acids Res 31, 365-370. 252 Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D., and Pirovano, W. (2011). Scaffolding pre-assembled 253 contigs using SSPACE. Bioinformatics 27, 578-579. 254 Bolger, A.M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence 255 data. Bioinformatics 30, 2114-2120. 256 Bouchenak-Khelladi, Y., Salamin, N., Savolainen, V., Forest, F., van der Bank, M., Chase, M.W., and 257 Hodkinson, T.R. (2008). Large multi-gene phylogenetic trees of the grasses (Poaceae): progress towards 258 complete tribal and generic level sampling. Molecular phylogenetics and evolution 47, 488-505. 259 Clark, L., Londoño, X., and Ruiz-Sanchez, E. (2015). Bamboo and habitat. In Bamboo 260 (Springer), pp. 1-30. 261 Clark, L., and Oliveira, R. (2018). Diversity and evolution of the New World bamboos (Poaceae: 262 Bambusoideae: Bambuseae, Olyreae). In Proceedings of the 11th World Bamboo Congress, Xalapa, 263 Mexico, pp. 35-47. 264 Das, M., Bhattacharya, S., and Pal, A. (2005). Generation and characterization of SCARs by cloning and 265 sequencing of RAPD products: a strategy for species-specific marker development in bamboo. Annals of 266 botany 95, 835-841. 267 Dransfield, S., and Widjaja, E.A. (1995). Plant resources of south-east Asia. No. 7: Bamboos. 268 Elsik, C.G., Mackey, A.J., Reese, J.T., Milshina, N.V., Roos, D.S., and Weinstock, G.M. (2007). Creating a 269 honey bee consensus gene set. Genome Biol 8, R13. 270 Gaut, B.S., Clark, L.G., Wendel, J.F., and Muse, S.V. (1997). Comparisons of the molecular evolutionary 271 process at rbcL and ndhF in the grass family (Poaceae). Molecular Biology and Evolution 14, 769-777. 272 Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., 273 Raychowdhury, R., Zeng, Q., et al. (2011). Full-length transcriptome assembly from RNA-Seq data without 274 a reference genome. Nat Biotechnol 29, 644-652. 275 GrassPhylogenyWorkingGroup, Barker, N.P., Clark, L.G., Davis, J.I., Duvall, M.R., Guala, G.F., Hsiao, C., 276 Kellogg, E.A., Linder, H.P., and Mason-Gamer, R.J. (2001). Phylogeny and subfamilial classification of the 277 grasses (Poaceae). Annals of the Missouri Botanical Garden, 373-457. 278 Gui, Y.J., Zhou, Y., Wang, Y., Wang, S., Wang, S.Y., Hu, Y., Bo, S.P., Chen, H., Zhou, C.P., and Ma, N.X. 279 (2010). Insights into the bamboo genome: syntenic relationships to rice and sorghum. Journal of integrative 280 plant biology 52, 1008-1015.

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

281 Guo, Z.-H., Ma, P.-F., Yang, G.-Q., Hu, J.-Y., Liu, Y.-L., Xia, E.-H., Zhong, M.-C., Zhao, L., Sun, G.-L., and 282 Xu, Y.-X. (2019). Genome sequences provide insights into the reticulate origin and unique traits of woody 283 bamboos. Molecular plant. 284 Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, 285 C.M., Rusch, D.B., Town, C.D., et al. (2003). Improving the Arabidopsis genome annotation using 286 maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654-5666. 287 IRGSP. (2005). The map-based sequence of the rice genome. Nature 436, 793-800. 288 Jones, P., Binns, D., Chang, H.Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., 289 Nuka, G., et al. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 290 1236-1240. 291 Judziewicz, E.J., Clark, L.G., Londoño, X., and Stern, M.J. (1999). American bamboos (Smithsonian 292 Institution Press). 293 Kelchner, S.A., and Clark, L.G. (1997). Molecular Evolution and Phylogenetic Utility of the 294 Chloroplastrpl16Intron inChusqueaand the Bambusoideae (Poaceae). Molecular Phylogenetics and 295 Evolution 8, 385-397. 296 Kelchner, S.A., and Group, B.P. (2013). Higher level phylogenetic relationships within the bamboos 297 (Poaceae: Bambusoideae) based on five plastid markers. Molecular Phylogenetics and Evolution 67, 298 404-413. 299 Kent, W.J. (2002). BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664. 300 Lagesen, K., Hallin, P., Rodland, E.A., Staerfeldt, H.H., Rognes, T., and Ussery, D.W. (2007). RNAmmer: 301 consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35, 3100-3108. 302 Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. 303 Bioinformatics 25, 1754-1760. 304 Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., et al. (2010). The 305 sequence and de novo assembly of the giant panda genome. Nature 463, 311-317. 306 Liu, B., Shi, Y., Yuan, J., Hu, X., and Wei, F. (2013). Estimation of genomic characteristics by analyzing 307 k-mer frequency in de novo genome projects. Quantitative Biology 35, 62-67. 308 Lowe, T.M., and Eddy, S.R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA 309 genes in genomic sequence. Nucleic Acids Res 25, 955-964. 310 Lowe, T.M., and Eddy, S.R. (1999). A computational screen for methylation guide snoRNAs in yeast. 311 Science 283, 1168-1171. 312 Nawrocki, E.P., Kolbe, D.L., and Eddy, S.R. (2009). Infernal 1.0: inference of RNA alignments. 313 Bioinformatics 25, 1335. 314 Oliveira, R.P., Clark, L.G., Schnadelbach, A.S., Monteiro, S.H., Borba, E.L., Longhi-Wagner, H.M., and 315 van den Berg, C. (2014). A molecular phylogeny of Raddia and its allies within the tribe Olyreae (Poaceae, 316 Bambusoideae) based on noncoding plastid and nuclear spacers. Molecular Phylogenetics and Evolution 78, 317 105-117.

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

318 Peng, Z., Lu, Y., Li, L., Zhao, Q., Feng, Q., Gao, Z., Lu, H., Hu, T., Yao, N., Liu, K., et al. (2013). The draft 319 genome of the fast-growing non-timber forest species moso bamboo (Phyllostachys heterocycla). Nat 320 Genet 45, 456-461, 461e451-452. 321 Porebski, S., Bailey, L.G., and Baum, B.R. (1997). Modification of a CTAB DNA extraction protocol for 322 plants containing high polysaccharide and polyphenol components. Plant Molecular Biology Reporter 15, 323 8-15. 324 Price, A.L., Jones, N.C., and Pevzner, P.A. (2005). De novo identification of repeat families in large 325 genomes. Bioinformatics 21 Suppl 1, i351-358. 326 Rei Kajitani, K.T., Hideki Noguchi, Atsushi Toyoda, Yoshitoshi Ogura, Miki Okuno, Mitsuru Yabana, 327 Masayuki Harada, Eiji Nagayasu, Haruhiko Maruyama, Yuji Kohara, Asao Fujiyama, Tetsuya Hayashi, 328 Takehiko Itoh (2014). Efficient de novo assembly of highly heterozygous genomes from whole-genome 329 shotgun short reads. Genome Research 24, 1384-1395. 330 Saarela, J.M., Burke, S.V., Wysocki, W.P., Barrett, M.D., Clark, L.G., Craine, J.M., Peterson, P.M., Soreng, 331 R.J., Vorontsova, M.S., and Duvall, M.R. (2018). A 250 plastome phylogeny of the grass family (Poaceae): 332 topological support under different data partitions. PeerJ 6, e4299. 333 Salamov, A.A., and Solovyev, V.V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome 334 Res 10, 516-522. 335 Sharma, R., Gupta, P., Sharma, V., Sood, A., Mohapatra, T., and Ahuja, P.S. (2008). Evaluation of rice and 336 sugarcane SSR markers for phylogenetic and genetic diversity analyses in bamboo. Genome 51, 91-103. 337 Simao, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V., and Zdobnov, E.M. (2015). BUSCO: 338 assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 339 3210-3212. 340 Slater, G.S., and Birney, E. (2005). Automated generation of heuristics for biological sequence comparison. 341 BMC Bioinformatics 6, 31. 342 Soderstrom, T.R. (1981). Some Evolutionary Trends in the Bambusoideae (Poaceae). Annals of the 343 Missouri Botanical Garden 68, 15-47. 344 Soreng, R.J., Peterson, P.M., Romaschenko, K., Davidse, G., Teisher, J.K., Clark, L.G., Barberá, P., 345 Gillespie, L.J., and Zuloaga, F.O. (2017). A worldwide phylogenetic classification of the Poaceae 346 (Gramineae) II: An update and a comparison of two 2015 classifications. Journal of Systematics and 347 Evolution 55, 259-290. 348 Stanke, M., Steinkamp, R., Waack, S., and Morgenstern, B. (2004). AUGUSTUS: a web server for gene 349 finding in eukaryotes. Nucleic Acids Res 32, W309-312. 350 Sungkaew, S., Stapleton, C.M., Salamin, N., and Hodkinson, T.R. (2009). Non-monophyly of the woody 351 bamboos (Bambuseae; Poaceae): a multi-gene region phylogenetic analysis of Bambusoideae ss. Journal of 352 plant research 122, 95. 353 Tarailo-Graovac, M., and Chen, N. (2009). Using RepeatMasker to identify repetitive elements in genomic 354 sequences. Curr Protoc Bioinformatics Chapter 4, Unit 4.10.

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

355 Thiel, T., Michalek, W., Varshney, R.K., and Graner, A. (2003). Exploiting EST databases for the 356 development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor 357 Appl Genet 106, 411-422. 358 Triplett, J.K., Clark, L.G., Fisher, A.E., and Wen, J. (2014). Independent allopolyploidization events 359 preceded speciation in the temperate and tropical woody bamboos. New Phytologist 204, 66-73. 360 Vorontsova, M., G. Clark, L., Dransfield, J., Govaerts, R., and Baker, W. (2016). World Checklist of 361 Bamboos and Rattans. 362 Wu, T.D., and Watanabe, C.K. (2005). GMAP: a genomic mapping and alignment program for mRNA and 363 EST sequences. Bioinformatics 21, 1859-1875. 364 Wysocki, W.P., Clark, L.G., Attigala, L., Ruiz-Sanchez, E., and Duvall, M.R. (2015). Evolution of the 365 bamboos (Bambusoideae; Poaceae): a full plastome phylogenomic analysis. BMC evolutionary biology 15, 366 50. 367 Wysocki, W.P., Ruiz-Sanchez, E., Yin, Y., and Duvall, M.R. (2016). The floral transcriptomes of four 368 bamboo species (Bambusoideae; Poaceae): support for common ancestry among woody bamboos. BMC 369 genomics 17, 384. 370 Zhang, Y.-J., Ma, P.-F., and Li, D.-Z. (2011). High-throughput sequencing of six bamboo chloroplast 371 genomes: phylogenetic implications for temperate woody bamboos (Poaceae: Bambusoideae). PloS one 6, 372 e20596. 373 374

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

375 Tables

376

377 Table 1. Summary of the genome assembly and annotation of R. distichophylla. 378 Assembly Estimated genome size (Mb) 608 Assembled sequence length (Mb) 581 Scaffold N50 (Mb) 1.81 Contig N50 (Kb) 86.36 Annotation Number of predicted protein-coding genes 30,763 Average gene length (bp) 2,886.93 tRNAs 727 rRNAs 90 snoRNAs 242 snRNAs 127 miRNAs 256 Transposable elements (%) 49.08

379

380 381

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

382 Figure Legends 383 Figure 1. The 17-mer distribution of sequencing reads from R. distichophylla. The 384 occurrence of 17-mer was calculated using GCE based on the sequencing data from short 385 insert size libraries (insert size ≤ 500 bp) of R. distichophylla. The sharp peak on the left 386 with low depths represents the essentially random sequencing errors. The middle and 387 right peaks indicate the heterozygous and homozygous peaks, the depths of which are 52 388 and 103, respectively. 389 390 Figure 2. Evolutionary history of TE super-families in the R. distichophylla and 391 moso bamboo genomes. A) The occupied TE lengths in R. distichophylla genome; B) 392 the occupied TE lengths in the moso bamboo genome. The divergence was calculated 393 between TEs and the consensus sequences generated by RepeatMasker. 394 395 Figure 3. Distribution of different SSR motifs in the R. distichophylla and moso 396 bamboo genomes. The outer circle represents moso bamboo and the inner circle denotes 397 R. distichophylla. A) monomer distribution in the two genomes; B) dimer distribution in 398 the two genomes; C) trimer distribution in the two genomes. 399 400 Figure 4. Comparisons of gene features among R. distichophylla and three other 401 plant species. Gene features include mRNA length, CDS length, exon length, intron 402 length and exon number. No obvious length differences of gene features were observed 403 among these three plant species (Setaria italica, Brachypodium distachyon and Oryza 404 sativa). 405

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Tables Supplementary Table 1. Whole genome sequencing (WGS) reads used to assemble the R. distichophylla genome.

Insert Read Raw Sequence Data Lane Clean Size Length Data Coverage Type Information Data (Gb) (bp) (bp) (Gb) (×)* Total 253.94 165.91 272.89 Paired-ends 152.78 115.56 190.07 P01 180 100 74.81 58.35 95.97 P02 300 100 53.82 39.41 64.82 P03 500 100 24.15 17.80 29.28 Mate pair 101.16 50.35 82.82 M01 2000 100 8.54 7.92 13.03 M02 2000 125 12.64 11.67 19.19 M03 5000 100 15.03 11.92 19.61 M04 5000 125 3.82 3.38 5.56 M05 10000 90 31.56 8.06 13.26 M06 20000 90 29.57 7.40 12.17

* The estimated genome size was ~608 Mb.

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 2. Summary of RNA sequencing (RNA-Seq) of R. distichophylla.

Read RNA Number of Clean Data Length Source Tissues Pair-end Reads (bp) (bp) Stem 100 26,533,101 5,306,620,200 Root 100 27,651,149 5,530,229,800 Female inflorescence 100 28,030,510 5,606,102,000 Leaf 100 27,717,564 5,543,512,800 Total 109,932,324 21,986,464,800

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 3. Estimation of genome size of R. distichophylla based on K-mer analysis.

Genome size K-mer K-mer number K-mer depth (Mb) 17 57,760,011,267 95 608

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 4. Summary of genome assembly for R. distichophylla.

Assembly Features Values Library Number 9 Sequence Coverage (×) 272.89 180 bp 95.97 300 bp 64.82 500 bp 29.28 2 Kb 32.22 5 Kb 25.17 10 Kb 13.26 20 Kb 12.17 Assembled Length (bp) 580,854,656 Scaffold N50 (bp) 1,806,674 Scaffold N90 (bp) 3,687 Contig N50 (bp) 86,364 Contig N90 (bp) 3,477 Scaffold Number 38,269 Contig Number 45,206 Longest Scaffold (bp) 7,779,908 Longest Contig (bp) 650,854 >1Mb Scaffold Length (bp) 397,604,329 0.1-1 Mb 70,036,361 10-100 Kb 24,602,676 1-10 Kb 88,536,026 1 bp-1 Kb 75,264 Estimated Length (Mb) 608 No gap Length (bp) 563,901,200 Gap Number 6,937 Gap Length (bp) 16,953,456 Coverage (%) 95.56 GC Content (%) 44.97

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 5. Validation of the R. distichophylla genome assembly using reads mapping BUSCO, and transcript alignments.

Total Mapped Rate (%) Reads mapping Short libraries 211,270,340 188,569,033 89.25 BUSCOs from Embryophyta lineage* Complete 1,440 1,326 92.08 Duplicated 1,440 44 3.06 Fragmented 1,440 17 1.18 Missing 1,440 53 3.68 Transcripts assembled in this study** Unigenes 131,575 102,676 78.04

* The 1,440 BUSCO conserved genes used were collected from Embryophyta lineage; ** Hits with Coverage >= 30% and Identity >= 80%.

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 6. Statistics of repeat sequences in the R. distichophylla and moso bamboo genomes.

R. distichophylla Moso bamboo Length Percentage Length Percentage Type (Mb) (%) (Mb) (%) Total DNA TEs 25.78 4.44 157.02 7.65 CMC-EnSpm 9.19 1.58 1.54 0.08

hAT 5.77 0.99 50.92 2.48 MULE-MuDR 6.86 1.18 28.14 1.37 PIF-Harbinger 1.66 0.29 59.68 2.91 TcMar-Stowaway 1.28 0.22 6.93 0.34 Other 0.64 0.11 1.86 0.09 Helitron 0.37 0.06 7.94 0.39 Total No-LTR RT 8.73 1.50 21.97 1.07 LINE 7.10 1.22 21.59 1.05 SINE 1.63 0.28 0.38 0.02 Total LTR RTs 209.02 35.99 1014.83 49.46 Copia 47.45 8.17 344.60 16.80 Gypsy 76.77 13.22 483.77 23.58 Other 84.80 14.60 186.46 9.09 Other Repeats 41.57 7.16 101.85 4.96 Satellite 0.35 0.06 1.29 0.06 Simple repeat 6.04 1.04 10.50 0.51 Low complexity 1.12 0.19 2.03 0.10 Unknown 34.06 5.86 88.03 4.29 Total Repeat Elements 285.10 49.08 1295.67 63.15

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 7. Summary of types and number of simple sequence repeats in the R. distichophylla and moso bamboo genomes.

Repeat Number Proportion Total length Average Species type (%) (Kb) length (bp)

Monomer 15,360 6.96 213.98 14

Dimer 35,195 15.94 706.01 20

Trimer 72,626 32.90 1063.37 15

R. distichophylla Tetramer 58,046 26.30 791.80 14

Pentamer 26,823 12.15 442.42 16

Hexamer 12,687 5.75 250.85 20 Total 220,737 100 3468.42 16

Monomer 33,840 6.81 565.86 17 Dimer 112,350 22.61 2458.49 22 Trimer 129,860 26.14 1715.58 13 Moso bamboo Tetramer 150,447 30.28 1963.37 13 Pentamer 45,721 9.20 726.52 16 Hexamer 24,601 4.95 505.93 21 Total 496,819 100 7935.75 16

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 8. Non-coding RNA genes in the R. distichophylla genome.

ncRNA Type Loci# Average length Total length % of genome (bp) (bp) miRNA 256 134.8 34496 0.0057 tRNA 727 74.3 53980 0.0089 rRNA (8S) 62 113.6 7043 0.0012 rRNA (18S) 12 1795.2 21542 0.0035 rRNA (28S) 16 4660.9 74574 0.0123 SnoRNA 242 108.3 26207 0.0043 snRNA 127 136.4 17314 0.0028

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 9. Statistics of predicted protein-coding genes in the R. distichophylla genome*.

Gene set Average Average Average Average Average Num exon intron gene length CDS length exon per ber length length (bp) (bp) gene (bp) (bp)

De novo AUGUSTUS 66,10 2310.59 995.44 4.36 228.38 391.56 Fgenesh 334 ,26 2748.16 1036.74 4.49 231.15 491.06 Homolog Moso bamboo 430 ,21 2617.29 981.46 3.82 256.82 579.76 heterocyclaStiff brome 131 ,36 2633.39 1055.34 4.08 258.87 512.91 Barley 238 ,50 1884.43 838.36 3.21 261.09 473.11 Maize 432 ,21 2438.14 982.46 3.83 256.51 514.37 O. thomaeum 229 ,54 2691.81 916.39 3.85 238.30 623.93 Foxtail millet 133 ,12 2509.40 1019.58 3.94 258.84 506.90 Rice 932 ,29 2565.87 1016.27 3.93 258.89 529.68 Sorghum 831 ,29 2613.01 1043.01 4.01 259.87 520.99 RNA-Seq 1110 ,6 2385.95 782.61 2.69 290.77 947.85 GLEAN 5330 ,76 2886.93 1099.03 4.46 246.38 516.62 Final set 330,76 2886.93 1099.03 4.46 246.38 516.62 3

* UTRs are not included in transcripts.

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.27.064089; this version posted April 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 10. Functional annotation of the R. distichophylla protein-coding genes.

Type Number Percentage (%) 30,763 100 Total

Annotated InterPro 25,506 82.91

a GO 15,226 49.49

KEGG a 18,580 60.39

b Swiss-Prot 20,272 65.89

TrEMBL 27,191 88.38

Un-annotated 3,429 11.15

a Directly retrieved from InterPro entry; b Blast hits with e-value ≤ 1e-5 were retained.

10