First High-Quality Reference Genome of Amphicarpaea Edgeworthii

bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. First high-quality reference genome of Amphicarpaea edgeworthii Tingting Song1+, Mengyan Zhou2+, Yuying Yuan1, Jinqiu Yu 1, Hua Cai1, Jiawei Li1, Yajun Chen1, Yan Bai2, Gang Zhou2 and Guowen Cui1* 1 Northeast Agricultural University, 150030, Harbin, China 2 Novogene Bioinformatics Institute, 100083 Beijing, China + These authors contributed equally * Corresponding author: Guowen Cui, Northeast Agricultural University, Harbin, No. 600 Changjiang Road, Harbin, 150030, China. Email: [email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Abstract Amphicarpaea edgeworthii, an annual twining herb, is a widely distributed species and an ideal model for studying complex flowering types and evolutionary mechanisms of species. Herein, we generated a high-quality assembly of A. edgeworthii by using a combination of PacBio, 10× Genomics libraries, and Hi-C mapping technologies. The final 11 chromosome-level scaffolds covered 90.61% of the estimated genome (343.78 Mb), which is the first chromosome-scale assembled genome of an amphicarpic plant. These data will be beneficial for the discovery of genes that control major agronomic traits, spur genetic improvement of and functional genetic studies in legumes, and supply comparative genetic resources for other amphicarpic plants. 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Introduction In nature, the distribution of key resources required for plant growth is often uneven. Plants growing in unstable habitats, with limited supplies of mineral nutrients, water, or light, frequent soil interferences, and large environmental fluctuations, undergo adaptive evolution to improve their survival(Jackson and Caldwell, 1993; Pearcy and Caldwell, 1994). Some plant species that bear 2 or more heteromorphic flowers also bear heteromorphic fruits (seeds). Amphicarpy is a phenomenon in which a plant produces both aerial and subterranean flowers and simultaneously bears both aerial and subterranean fruits (seeds) on aerial- and subterranean- stems, respectively (Cheplick, 1987; Koontz et al., 2017; Schnee and Waller, 1986). This phenomenon is observed in at least 67 herbaceous species (31 in fabaceae) in 39 genera and 13 families of angiosperms, as reported by Zhang et al(Zhang et al., 2020a). Amphicarpy is an important part of plant adaptive evolution, in which angiosperms generally display a special type of fruiting pattern and different fruit (seed) types also exhibit various dormancy and morphological features. This type of fruiting mode is crucial for the ecological adaptation of plants that have evolved through natural selection because it reduces competition among siblings within the population, maintains and increases the population size in situ, and increases the adaptability and evolutionary plasticity of the species(Hidalgo et al., 2016; Sadeh et al., 2009). Amphicarpaea edgeworthii, an annual twining herb, belongs to the fabaceae family, which is a large and economically valuable family of flowering plants(Zhang et al., 2017; Zhang et al., 2006). In this plant species, 3 types of flowers (fruits) can 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. grow in the same plant, and aerial chasmogamous flowers are produced only during summer( Zhang et al., 2006; Zhang et al., 2005). This species may offer an attractive model for examining gene regulatory networks that control chasmogamous and cleistogamous flowering in plants. However, the mechanism of flower development in amphicarpic plants, particularly in legumes, is sparsely known. The present study was an attempt to enhance our understanding on the reproductive biology and the ultimate evolutionary mechanism in amphicarpic plants. We performed whole-genome sequencing of A. edgeworthii, which is the first plant in the legume family and also the first amphicarpic species subjected to sequencing. This reference genome represents a precious foundation for further understanding on agronomics and molecular breeding in A. edgeworthii. 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Result A. edgeworthii has a diploid genome (2n = 2x = 22) (Supplementary figure 1). We estimated the genome size of A. edgeworthii to be 360.91 Mb using 17-mer(Supplementary Table 1 and figure 2). We sequenced the genome of A. edgeworthii by using a combination of PacBio, Illumina, and 10× Genomics libraries that resulted in the generation of a 343.78-Mbp genome (contig N50 length = 1.44 Mb, scaffold N50 length = 2.4 Mb; Table 1 and Supplementary Table 2, 3 and 4). finally, we assembled a chromosome-level genome by using Hi-C technology. We used a total of 5.27 million reads from Hi-C libraries and mapped approximately 90.61% of the assembled sequences to 11 pseudochromosomes, with the longest scaffold length of 32.05 Mb (Table 1 and figure 1). Results indicate that the A. edgeworthii genome was adequately covered by the assembly. We evaluated the completeness of the genome assembly by mapping the Illumina paired-end reads to our assembly through Burrows–Wheeler Alignor (BWA) (Li and Durbin, 2009) with 98.70 % of mapping rate and 94.04 % of coverage (Supplementary Table 5). Then, we used both Core Eukaryotic Gene Mapping Approach (CEGMA) (Parra et al., 2007) and Benchmarking Universal Single-Copy Orthologs (BUSCO) (Simão et al., 2015) to access the integrity of the assembly. In the CEGMA assessment, 238 (95.97%) of 248 core eukaryotic genes were assembled (Supplementary Table 6). furthermore, 93.4% complete single-copy BUSCOs were detected, which indicates that the assembly was complete (Supplementary Table 7). Overall, the results indicate that a high-quality assembly was generated. 5 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Repeat sequences comprise 51.28% of the assembled genome, with transposable elements (TEs) being the major component (Supplementary Table 8). Among the TEs, long terminal repeats (LTRs) were the major component (29.32%) (Supplementary Table 9). We combined de novo prediction, homology search, and mRNA-seq assisted prediction to predict genes in the A. edgeworthii genome, and we obtained 28,372 protein-coding genes (97.2% were annotated) (Supplementary Tables 10 and 11). Additionally, we identified 2,260 non-coding RNAs, including 471 miRNAs, 701 transfer RNAs, 266 ribosomal RNAs, and 822 small nuclear RNAs (Supplementary Table 12). Table 1 Statistics of the A. edgeworthii genome assembly Total assembly size (Mb) 343.78 Total number of contigs 1,475 Total number of scaffolds 1,082 Contig N50 length (Mb) 1.44 Maximum contig length (Mb) 7.65 Maximum scaffold length (Mb) 32.05 Scaffold N50 length (Mb) 28.47 Scaffold N90 length (Mb) 23.07 GC content (%) 32.04 Gene number 28,372 Repeat content (%) 51.28 6 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Discussion In this study, we constructed a high-quality chromosome-level genome assembly for A. edgeworthii by combining the long-read sequences from PacBio with highly accurate short reads from Illumina sequencing and using Hi-C technology for super-scaffolding. The assembly of A. edgeworthii adds to the growing genomic information on the fabaceae family. As the first species in the Amphicarpaea genera of the fabaceae family to be sequenced, A. edgeworthii exhibits a range of specific biological features, such as 3 types of flowers and 3 types of fruit (seed) simultaneously in the same plant. It has maintained an essential evolutionary position in the tree of life(Goodwillie et al., 2005; Silvertown et al., 2001). According to our analysis, these genomic data will serve as valuable resources for future genomic studies on amphicarpic plants and molecular breeding of soybean. To date, no genomic data for the amphicarpic plants are available. Therefore, the sequencing of the entire A. edgeworthii genome will facilitate future investigations on the phylogenetic relationships of the flowering (seed) plants. The accessibility of the A. edgeworthii genome sequence makes feasible the investigation of deep phylogenetic questions on angiosperms and the determination of genome evolution signatures and the genetic basis of interesting traits. 7 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Method Plant material and genome survey for genome sequencing, we collected fresh young leaves of A. edgeworthii species distributed in Heilongjiang Province (45.80°N, 126.53°E), China. The karyotype analysis of the plant species revealed a karyotype of 2n = 2x = 22, with uniform and small chromosomes(Wolny et al., 2013) (Supplementary figure 1).

First High-Quality Reference Genome of Amphicarpaea Edgeworthii

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Microbial Gene Identification Using Interpolated Markov Models Steven L

Steven L. Salzberg

By Katherine E. L. Cox a Dissertation Submitted to Johns Hopkins University in Conformity with the Requirements for the Degree O

Improved Microbial Gene Identification with GLIMMER

The Effect of Machine Learning Algorithms on Metagenomics Gene

Genome Sequencing and Annotation

Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering

MCB 1200/1201, Virus Hunting

Machine Learning for Metagenomics: Methods and Tools

Glimmer-MG Release Notes Version 0.1

Information Theory, Graph Theory and Bayesian Statistics Based Improved and Robust Methods in Genome Assembly