Journal of Genetics (2019) 98:90 © Indian Academy of Sciences https://doi.org/10.1007/s12041-019-1136-8

RESEARCH NOTE

The evolution study on rufipogon. dw by whole-genome sequencing

JILIN WANG, SONG YAN, SHIYOU LUO, WEI DENG, XIANHUA SHEN, DAZHOU CHEN∗ and HONGPING CHEN∗

Rice Research Institute, Jiangxi Academy of Agricultural Sciences, Nanchang 330200, People’s Republic of China *For correspondence. E-mail: Dazhou Chen, [email protected]; Hongping Chen, [email protected].

Received 26 December 2018; revised 23 May 2019; accepted 17 June 2019; published online 5 September 2019

Abstract. The species of Oryza rufipogon. dw was first discovered at Dongxiang, Jiangxi in 1978. It is recognized as abundant in genetic resources with the characteristics of cold and insect resistance. A total of 100.15 Gb raw data was obtained from seven pair-end libraries by Illumina Hiseq4000 platform. Subsequently, a draft assembly genome of O. rufipogon. dw was generated with a final size of 422.7 Mb with a contig N50 of 15 kb and a scaffold N50 of 296.2 bb. The assembly genome size was higher than the estimated genome size (413 Mb) based on k-mer analysis. The identified repeat sequences accounted for 40.09% of the entire genome, and 32,521 protein-coding genes with an average of 4.59 exons per gene was annotated in five databases. Phylogenetic analysis using 1460 single-copy gene, O. rufipogon. dw was close with O. rufipogon by Bayes method. The wild species of O. rufipogon. dw divergence was estimated at ∼0.3 million years ago (Mya) from O. rufipogon,and ∼0.6 Mya from the O. sativa. The draft genome of O. rufipogon. dw provided an essential resource for its origin and evolution study.

Keywords. genome assembly; gene annotation; phylogenetic; evolution; Oryza rufipogon. dw.

Introduction exploited with the use of transgenic technology (Kovach et al. 2007). Rice accounts for more than half of human food sources The species of O. rufipogon. dw was first discovered all over the world with Asia has the main growing areas at Dongxiang, Jiangxi in 1978 (Zhang et al. 2016). It is (Krishnan et al. 2014). Oryza has attracted a great deal considered as the northern rice population in the world of attention in the scientific arena, hence the develop- (28◦14N), with abundant genetic traits for cultivated rice ment of the genus has led to many genome sequencing improvement such as cold resistance, insect resistance, research in recent years (Sakai et al. 2014). , biotic and abiotic stress (Zhang et al. 2006). Also, it is commonly cultivated rice, is widely cultivated all over a valuable resource for fundamental research on genetic the world and is one of the most important grains in diversity, and has the advantages of high yield and het- human food (Huang et al. 2012). Some studies have erosis. Comparative genomics provide some new insights shown that O. sativa is considered to be domesticated in on the evolution of genes and genomes (Bennetzen 2007). East Asia from O. rufipogon. However, in the process of The complexity of the genome is caused by its dupli- domestication nearly 30–40% of the genetic variation is cation, sequence rearrangement and transposable element lost (Sun et al. 2001). These findings suggest that wild (Bennetzen 2007). Even the closely related species, such as germplasm is the most precious source among genetic Arabidopsis or Oryza produce significant fluctuations in variation sources (Kovach et al. 2007). These useful varia- genome size, gene collinearity and gene number (Hu et al. tions can be exploited quickly by genomic technology and 2011).

Jilin Wang and Song Yan contributed equally to this work.

Electronic supplementary material: The online version of this article (https://doi.org/10.1007/s12041-019-1136-8) contains supplemen- tary material, which is available to authorized users.

1 90 Page 2 of 5 Jilin Wang et al.

This study presents a draft assembly and annotation on the RepBase library and annotated by the Repeat- of the O. rufipogon. dw genome, and detailed its differ- Masker and RepeatProteinMask software. De novo is a entiation time and comparison with other rice species library of repetitive sequences predicted by the de novo in genome collinearity. Our study provided an under- prediction methods LTR-FINDER and RepeatScout standing of the domestication process and evolutionary (Price et al. 2005), and then obtained by using the soft- status of O. rufipogon. dw from the analysis of O. rufi- ware RepeatMasker. Combinational data were the result pogon. dw functional genes, phylogeny and unique gene of integrating the above three methods with overlap data families and enriched the diversity of pop- removing. ulations, in addition to offer a new resource for rice We analysed and annotated four types of ncRNAs, breeding. including miRNA, tRNA, rRNA and snRNA. We per- formed homology searches and detections throughout the whole-genome sequence. For tRNA prediction, we used Materials and methods tRNAscan-SE (v.1.23) (Lowe and Eddy 1997). The snR- NAs and miRNAs were predicted by alignment using Plant materials preparation BLASTN (Griffiths-Jones et al. 2005). BLASTN was used for rRNAs alignment (Stanke et al. 2006). After filtering, O. rufipogon. dw were provided by the academy of all the clean reads were annotated to five database agricultural sciences in Nanchang, Jiangxi, China. The including SwissProt, GO, COG, KEGG, and NR data- seeds of O. rufipogon. dw were grown in a controlled growth ≤ × ◦ ◦ bases (Kent 2002), at the threshold of e-value 1 chamber at 30 C/26 C and 14 h/10 h of light/dark condi- e−5. tions. The genomic DNA of the healthy and fresh leaves was extracted using a DNeasy Plant Mini kit (Qiagen, USA). Detection of DNA quantification using a Nan- Phylogenetic analysis odrop ND1000 spectrophotometer and detection of DNA quality by 0.8% agarose gel electrophoresis. We used the Bayes method to construct a phylogenetic tree by globally comparing the 1460 single-copy genes found with PRANK software. The first phase site in each single DNA sequencing and genome size estimation copy gene family is typically used to estimate the molecular clock (replacement rate) and the divergence time among Seven pair end (PE) libraries were constructed with insert species. length of 200, 500 and 800 bp and 2, 5, 10 and 20 kb respec- tively, from nuclear DNA according to the manufacturer’s instruction (Illumina, San Diego, USA) for sequencing by Results Illumina HiSeq4000 platform (Li et al. 2009). For genome size estimation, we used PE reads with K-mer size of Genomic DNA sequencing and genome size estimation 17, and k-mer distribution was investigated using Jellyfish v1.1.6.31 (Marcais and Kingsford 2011). A total number of 100.15 Gb raw data was generated in O. rufipogon. dw genome by Illumina Hiseq4000 plat- form from seven PE libraries with insert length of 200 Assembly of the O. rufipogon. dw genome sequences bp, 500 bp, 800 bp, 2 kb, 5 kb, 10 kb and 20 kb (NCBI accession number, PRJNA543004). Consequently, After data filtering and error correction, the clean reads a total of 55,743,512,338 clean reads were generated were assembled using SOAPdenovo v1.05 with a min- by filtering for further study (table 1 in electronic sup- imum contig length of 1 kb. Assembled contigs were plementary material in http://www.ias.ac.in/jgenet/). We analysed at the nucleotide database of NCBI (http:// selected k-mer for estimating genome size and heterozy- www.ncbi.nlm.nih.gov/), and the genome DNA sequences gosity. The genome size of the O. rufipogon. dw was were published to represent the completion of genome estimated of 412 Mb, calculated by 11.07 G high quality sequencing. Mapping all clean whole-genome shotgun sequence data, according to the formula, genome size = sequence reads to the genome to calculate average depth by number of k−mer/depth of peak (k = 17) (figure 1 in SOAPaligner. electronic supplementary material).

Gene prediction and annotation Genome assembly

The repetitive sequence was annotated by three meth- A final genome size of 422.7 Mb with a contig N50 ods: RepBase TEs, TE proteins and de novo.RepBase of 15,026 bp and a scaffold N50 of 296,224 bp, respec- TEs and TE proteins are transposon components based tively (table 2 in electronic supplementary material). The Whole-genome of O. rufipogon. dw Page 3 of 5 90

Figure 1. The O. rufipogon. dw genome. Five concentric circles from the inner to the outer are (a) donors and acceptors of segmental duplications on rice chromosome are connected by gray lines, (b) transposable elements (TEs) density heat map, (c) gene density heat map, (d) GC content distribution linear map, and (e) chromosome. The chromosomes are scaled in units of 5 M. The statistical unit of GC content, gene density and TE density are all 1 M. gene number and scaffolds were shown in table 3 in dw genome was shown in figure 1. Concentric circles electronic supplementary material. The assembly genome reflected different features were drawn using the Circos size was higher with the estimated genome size (413 program. Mb) based on k-mer analysis. Raw reads of the O. rufi- pogon. dw genome were compared with O. rufipogon Griff. genome (NCBI accession number, SRP070627) (Zhang Repeat sequences and noncoding RNA annotation et al. 2016) raw reads by BWA software, and the map- ping rate was 85.41% containing 73.47% of uniquely and A combination of RepeatMasker and de novo analysis 11.94% of repeat sequences (table 4 in electronic sup- identified an official gene set including 169,457,333 bp plementary material). The gene map of O. rufipogon. repeat sequences. Among them, tandem repeats account 90 Page 4 of 5 Jilin Wang et al.

Figure 2. Estimation of 15 species differentiation time. Each branch length represents the rate of neutral evolution. The blue numbers indicate the estimated age of differentiation after correction (Mya). for 3% of the whole genome (table 5 in electronic Gene family identification and phylogenetic tree construction supplementary material). The TEs account for 40.09% of the whole genome, including 18.99% of DNA repeats, The homology between O. rufipogon. dw and other rice 2.76% of long interspersed nuclear elements (LINE), species was examined via gene families, sequence 21.38% of long terminal repeats (LTRs), 0.39% of short information from 12 genus Oryza and two representative interspersed nuclear elements (SINEs) and 1.75% of species (Zea mays and Sorghum bicolor) unknown repeat types. The novel repeats account for were used to cluster genes into 24,646 gene families. In 1.75% of the Illumina-assembled genome sequences (table addition, the dicotyledonous Arabidopsis thaliana is used 6 in electronic supplementary material). In additional, as an outer group. Among them, only 23 gene families are we also generated 182 microRNA (miRNA), 452 transfer unique to O. rufipogon. dw, while unique gene families of RNA (tRNA), 71 ribosomal RNA (rRNA) and 388 small the two cultivated rice O. sativa and O. glaberrima are 344 nuclear RNA (snRNA). There were 719 of 5.8s rRNA, and 122, respectively.In addition, Z. mays and A. thaliana, 1623 of 18s rRNA, 1644 of 28s rRNA, and 5260 of 5s as beyond genus Oryza, have the largest number of rRNA, respectively (table 7 in electronic supplementary unique gene families (table 9; figure 3 in electronic supple- material). mentary material). The phylogenetic tree was built using 1460 single-copy genes, O. rufipogon. dw was close with O. rufipogon by Bayes method. The wild rice specie of Gene annotation O. rufipogon. dw divergence was estimated at ∼0.3Mya from O. rufipogon,and∼0.6 Mya from the O. sativa A total of 32,678 protein-coding genes were identified and (figure 2). 32,521 with a percentage of 99.5% were annotated in five databases, including InterPro, GO, KEGG, Swissprot and TrEMBL, and the number of unannotated genes was 158 Discussion (table 8 in electronic supplementary material). The distri- bution of mRNA length, CDS length, exon length, intron Oryza is a suitable genus for comparative genomics. length, and exon number of O. rufipogon. dw were in mid Because Oryza have small genomes size and basically level by comparing with six Oryza species, including O. within 430 Mb (Huang et al. 2012), with high transfor- sativa, O. barthii, O. glaberrima, O. longistaminata, O. mation efficiency (Tyagi and Mohanty 2000). The gene punctate and O. refipogon (figure 2 in electronic supple- sequences are highly conserved in cereal crops and the mentary material). molecular data are abundant (Wang et al. 2014), rice Whole-genome of O. rufipogon. dw Page 5 of 5 90 is regarded as an ideal modal plant of monocotyle- Kovach M. J., Sweeney M. T. and McCouch S. R. 2007 New don. It provides important clues for the evolution of insights into the history of rice domestication. Trends Genet. gramineae genomes and also contributes to the study of 23, 578–587. Krishnan S. G., Waters D. L. and Henry R. J. 2014 Australian other gramineae genomes. Understanding genome and wild rice reveals pre-domestication origin of polymorphism gene evolution process requires within and between species deserts in rice genome. PLoS One 9, e98843. comparisons (Yan et al. 2015). The results of phylogenetic Li R., Fan W., Tian G., Zhu H., He L., Cai J. et al. 2009 The analysis O. rufipogon. dw and O. rufipogon have closer rela- sequence and de novo assembly of the giant panda genome. tionship, and the distance between O. rufipogon. dw and Nature 463, 311. Lowe T. M. and Eddy S. R. 1997 tRNAscan-SE: a program O. rufipogon was the closest. On the contrary, the dis- for improved detection of transfer RNA genes in genomic tance between O. rufipogon. dw and O. glaberrima was sequence. Nucleic Acids Res. 25, 955–964. the farthest. These were in line with their collinearities. Marcais G. and Kingsford C. 2011 A fast, lock-free approach for Previous studies have reported that with the phylogenetic efficient parallel counting of occurrences of k-mers. Bioinfor- distance increased, the degree of gene collinearity in the matics 27, 764–770. Price A. L., Jones N. C. and Pevzner P. A. 2005 De novo identifi- plant genome tends to decreased (Tang et al. 2008). cation of repeat families in large genomes. Bioinformatics 21, 351–358. Acknowledgements Sakai H., Kanamori H., Araikichise Y., Shibatahatta M., Ebana K., Oono Y. et al. 2014 Construction of pseudomolecule sequences of the aus rice cultivar kasalath for comparative The authors wish to thank three platforms, National Engineering genomics of asian cultivated rice. DNA Res. 21, 1131–1142. Laboratory of Rice, National Improvement Center of rice (Nan- Stanke M., Keller O., Gunduz I., Hayes A., Waack S. and Mor- chang), Ministry of Wild Rice Scientific Observation genstern B. B. 2006 AUGUSTUS: ab initio prediction of and Experimental Station in Dongxiang, Jiangxi Province. This alternative transcripts. Nucleic Acids Res. 34, 435–439. work was supported by Breeding Project of New Insect-resistant Sun C. Q., Wang X. K., Li Z. C., Yoshimura A. and Iwata N. Rice Varieties in Rice Areas in the middle and lower reaches 2001 Comparison of the genetic diversity of common wild rice of the Yangtze river (2016zx08001001-001-005); Jiangxi Provin- (Oryza rufipogon Griff.) and cultivated rice (O. sativa L.) using cial Agricultural Science and Technology Innovation Project RFLP markers. Theor. Appl. Genet. 102, 157–162. (JXTCX2015001); Jiangxi Province major science and technol- Tang H., Bowers J. E., Wang X., Ming R., Alam M. and Paterson ogy special projects (20114BA03101); Jiangxi natural science A. H. 2008 Synteny and collinearity in plant genomes. Science funds (2010GQN0090); Area Funds of National natural science 320, 486–488. funds (3166041). Tyagi A. K. and Mohanty A. 2000 Rice transformation for crop improvement and functional genomics. Plant Sci. 158,1–18. References Wang M., Yu Y., Haberer G., Marri P. R., Fan C., Goicoechea J. L. et al. 2014 The genome sequence of African rice (Oryza Bennetzen J. L. 2007 Patterns in grass genome evolution. Curr. glaberrima) and evidence for independent domestication. Nat. Opin. Plant Biol. 10, 176–181. Genet. 46, 982–988. Griffiths-Jones S., Moxon S., Marshall M., Khanna A., Eddy S. Yan L., Wang X., Liu H., Tian Y., Lian J., Yang R. et al. 2015 R. and Bateman A. 2005 Rfam: annotating non-coding RNAs The Genome of Dendrobium officinale illuminates the biology in complete genomes. Nucleic Acids Res. 33, 121–124. of the important traditional chinese orchid herb. Mol. Plant. Hu T. T., Pattyn P., Bakker E. G., Cao J., Cheng J. F, Clark R. 8, 922–934. M. et al. 2011 The Arabidopsis lyrata genome sequence and Zhang F., Xu T., Mao L., Yan S., Chen X., Wu Z. et al. 2016 the basis of rapid genome size change. Nat. Genet. 43, 476– Genome-wide analysis of Dongxiang wild rice (Oryza rufi- 481. pogon Griff.) to investigate lost/acquired genes during rice Huang X., Kurata N., Wei X., Wang Z. X., Wang A., Zhao Q. et domestication. BMC Plant Biol. 16, 103. al. 2012 A map of rice genome variation reveals the origin of Zhang X., Zhou S., Fu Y., Su Z., Wang X. and Sun C. 2006 cultivated rice. Nature 490, 497–501. Identification of a drought tolerant introgression line derived Kent W. J. 2002 BLAT–the BLAST-like alignment tool. Genome from Dongxiang common wild rice (O. rufipogon Griff.). Plant Res. 12, 656–664. Mol. Biol. 62, 247–259.

Corresponding editor: Manoj Prasad