bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Chromosome-level genome assembly of Japanese (Castanea crenata Sieb. et Zucc.)

2 reveals conserved chromosomal segments in woody

3

4 Kenta Shirasawa1*, Sogo Nishio2, Shingo Terakami2, Roberto Botta3, Daniela Torello Marinoni3,

5 Sachiko Isobe1

6

7 1Kazusa DNA Research Institute, Chiba 292-0818, Japan

8 2Institute of Fruit and Tea Science, NARO, Ibaraki 305-8605, Japan

9 3Dipartimento di Scienze Agrarie, Forestali e Alimentari, Università di Torino, Grugliasco 10095,

10 Italy

11

12 *To whom correspondence should be addressed:

13 Kenta Shirasawa

14 2-6-7 Kazusa-Kamatari, Kisarazu, Chiba 292-0818, Japan

15 Tel.: +81 438 52 3935

16 Fax: +81 438 52 3934

17 E-mail: [email protected]

18

19 Running title: The genome sequence of Japanese chestnut

20

1 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Abstract

2 Japanese chestnut (Castanea crenata Sieb. et Zucc.), unlike other Castanea species, is resistant to

3 most diseases and wasps. However, genomic data of Japanese chestnut that could be used to

4 determine its biotic stress resistance mechanisms have not been reported to date. In this study, we

5 employed long-read sequencing and genetic mapping to generate genome sequences of Japanese

6 chestnut at the chromosome level. Long reads (47.7 Gb; 71.6× genome coverage) were assembled

7 into 781 contigs, with a total length of 721.2 Mb and a contig N50 length of 1.6 Mb. Genome

8 sequences were anchored to the chestnut genetic map, comprising 14,973 single nucleotide

9 polymorphisms (SNPs) and covering 1,807.8 cM map distance, to establish a chromosome-level

10 genome assembly (683.8 Mb), with 69,980 potential protein-encoding genes and 425.5 Mb repetitive

11 sequences. Furthermore, comparative genome structure analysis revealed that Japanese chestnut

12 shares conserved chromosomal segments with woody , but not with herbaceous plants, of

13 rosids. Overall, the genome sequence data of Japanese chestnut generated in this study is expected to

14 enhance not only its genetics and genomics but also the evolutionary genomics of woody rosids.

15

16 Keywords: chestnut, genetic mapping, genome synteny, rosids, whole-genome sequence

17

18 1. Introduction

19 Chestnut is naturally distributed in temperate regions in the northern hemisphere, and plays a critical

20 role in maintaining the landscape and forest ecosystem1. Seeds or nuts of chestnut provide

21 important nutrition for animals as well as humans, while their wood is used as a source of timber and

22 tannins. The genus Castanea (2n = 2x = 24) includes four cultivated species: Japanese chestnut (C.

23 crenata Sieb. et Zucc), (C. dentata [Marshall] Borkh.), Chinese chestnut (C.

24 mollissima Bl.), and European chestnut (C. sativa Mill.). Japanese chestnut is commercially

2 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 cultivated in Japan and , while the remaining three species are commercially grown in the

2 USA, , and Europe, respectively. Seguin chestnut (C. seguinii Dode.) and Henry chestnut (C.

3 henryi Rehd. et Wils.), found in China, and Allegheny chinkapin (C. pumila Mill.), which grows in

4 the USA, are recognized as wild species.

5 In the early 19th century, American chestnut was devastated by , caused by an

6 exotic pathogen, namely, Cryphonectria parasitica2. European chestnut has also been damaged by

7 chestnut blight and ink disease, caused by C. parasitica and , respectively3.

8 In addition, chestnut gall wasp () was accidentally introduced into Europe,

9 resulting in great damage to chestnut production4. Because Japanese chestnut lines as well as

10 Chinese chestnut are generally resistant to diseases and wasps, resistance genes have been

11 transferred from Japanese chestnut to susceptible species in several chestnut breeding programs3.

12 Genome sequence information is useful for not only accelerating breeding programs but also

13 understanding the phylogenetic relationships and revealing the evolutionary history of a given

14 species through comparative genome structure analysis. To the best of our knowledge, genome

15 sequences of three Chinese chestnut lines and one Henry chestnut line have been reported to date5-8.

16 However, whole-genome sequence of Japanese chestnut is not yet publicly available. Advances in

17 long-read sequencing technology have enabled the construction of a mega-base-scale genome

18 assembly of haploid, diploid, and polyploid species9. Long contig sequences generated by long-read

19 sequencing could be further concatenated at the chromosome level with high-throughput

20 chromosome conformation analyses, optical mapping, and/or genetic mapping to establish

21 pseudomolecule sequences10. In this study, using long-read sequencing and genetic mapping, we

22 established chromosome-level pseudomolecule sequences of the Japanese chestnut variety ‘Ginyose’,

23 which has contributed to the spread of chestnut cultivation in Japan owing to its large size and

24 excellent appearance. Despite its long history of cultivation that dates back to the 17th century,

3 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 ‘Ginyose’ is still a major cultivated variety in Japan, accounting for approximately 15% of the

2 national chestnut production. Genetic structure and parentage analysis shows that ‘Ginyose’ is the

3 predecessor of many Japanese cultivars11.

4

5 2. Materials and methods

6 2.1 materials and DNA extraction

7 A tree of the Japanese chestnut ‘Ginyose’ planted at the Institute of Fruit Tree and Tea

8 Science, NARO (Tsukuba, Japan) was used for genome sequencing. Genomic DNA was extracted

9 from tree with Genomic-tips (Qiagen , Hilden, Germany).

10

11 2.2 Genome sequencing and analysis

12 Software tools used for data analyses are listed in Supplementary Table S1. Genomic DNA libraries

13 for short- and long-read sequencing were generated with the TruSeq DNA PCR-Free Kit (Illumina,

14 San Diego, CA, USA) and the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA,

15 USA), respectively. Short-read sequence data were obtained using the HiSeq2000 platform

16 (Illumina). Genome size was estimated using Jellyfish after removing adaptor sequences

17 (AGATCGGAAGAGC) and reads obtained from organelle genomes (GenBank accession numbers:

18 HQ336406 and MN199236). Long-read sequence data were obtained using the Sequel system

19 (PacBio), and primary contigs and alternate contigs, which were generated from one allele and the

20 other, respectively, of diploid genomes, were assembled with Falcon. Then, haplotype sequences

21 were resolved to generate haplotigs with Falcon_Unzip. Sequence errors in the contigs were

22 corrected twice using long reads with ARROW, and potential haplotype duplications in the primary

23 contigs were removed with Purge_Dups. Contig sequences potentially contaminated from organelle

24 genomes (GenBank accession numbers: HQ336406 and MN199236), which showed sequence

4 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 similarity of > 90% with Minimap2, were removed. Assembly completeness was evaluated with the

2 embryophyta_odb10 data using Benchmarking Universal Single-Copy Orthologs (BUSCO).

3

4 2.3 Genetic map-based chromosome-level sequence construction

5 Reads obtained by double digest restriction site-associated DNA sequencing (ddRAD-Seq) of an F1

6 mapping population (n = 185), derived from a cross between ‘Bouche de Bétizac’ (C. sativa × C.

7 crenata) and ‘Madonna’ (C. sativa) 4, were mapped onto the primary contigs as a reference using

8 Bowtie2, and sequence variants were detected with BCFtools. High-confidence single nucleotide

9 polymorphisms (SNPs) were selected with VCFtools (parameters of --minDP 5 --minQ 999

10 --max-missing 0.5 --maf 0.2 --max-maf 0.8) and subjected to linkage analysis with Lep-Map3. The

11 nomenclature and direction of the linkage groups were based on the map reported by Torello

12 Marinoni et al.4 To construct pseudomolecule sequences at the chromosome level, primary contigs

13 were assigned to the resultant genetic map with ALLMAPS.

14

15 2.4 Repeat sequence analysis

16 Repetitive sequences in the assembly were identified with RepeatMasker using repeat sequences

17 registered in Repbase and a de novo repeat library built with RepeatModeler. Repeat elements were

18 classified into nine types with RepeatMasker: short interspersed nuclear elements (SINEs), long

19 interspersed nuclear elements (LINEs), long terminal repeat (LTR) elements, DNA elements, small

20 RNA, satellites, simple repeats, low complexity repeats, and unclassified.

21

22 2.5 Gene prediction and annotation

23 Protein-coding genes were predicted using a MAKER pipeline, based on peptide sequences

24 predicted in the C. mollissima genome (GWHANWH00000000, n = 33,597)7 and transcriptome

5 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 sequences of three Castanea species2, C. crenata (cci_ccn, n = 13,451), C. dentata (AC454_v3, n =

2 45,288), and C. sativa (csi_csn, n = 12,771), released in the Hardwood Genomics Project

3 (https://doi.org/10.25504/FAIRsharing.srgkaf). Short gene sequences (< 300 bp), genes in repeat

4 sequences, and genes with annotation edit distance (AED) > 0.5, which is proposed as a threshold

5 for good annotations in the MAKER protocol, were removed to facilitate the selection of

6 high-confidence genes. Functional annotation of the predicted genes was performed with

7 Hayai-Annotation Plants.

8

9 2.6 Comparative genome structure analysis

10 The genome structure of Japanese chestnut cultivar ‘Ginyose’ was aligned to the genome sequence

11 of Chinese chestnut (C. mollissima), as a reference, with Unimap. In addition, publicly available

12 chromosome-level sequences of 114 plants10, which, in accordance with Angiosperm Phylogeny

13 Group (APG) classification IV12, consisted of 20 monocots, 93 (including 66 rosids and 23

14 asterids), and one member of Nymphaea, were also employed as references. Dot-plot charts based on

15 the alignments were drawn with D-Genies.

16

17 3. Results and data description

18 3.1 Genome sequence and assembly

19 In accordance with k-mer frequency analysis of 63.3 Gb short reads, the genome size of C. crenata

20 was estimated at 665.9 Mb (Figure 1). A total of 2.7 million long reads (47.7 Gb, 71.6x genome

21 coverage, N50 = 29.8 kb) obtained from four single-molecule real-time (SMRT) cells were

22 assembled into two haplotype-resolved sequences: primary contigs and haplotigs. The resultant

23 assemblies of primary contigs (CCR_r1.0) and haplotigs (CCR_r1.0.haplotigs) consisted of 781 and

24 3,355 sequences, respectively (Table 1). Total lengths of primary contigs (CCR_r1.0) and haplotigs

6 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 (CCR_r1.0.haplotigs) were 721 Mb (N50 = 1.6 Mb) and 629.2 Mb (N50 = 275.8 kb), respectively.

2 The completeness of primary contigs indicated that 96.6% of sequences were complete BUSCOs

3 (Table 1).

4

5 3.2 Pseudomolecule sequence construction

6 To construct a genetic map of C. crenata, we used ddRAD-Seq data of a mapping population

7 reported prreviously4. A total of 903 million high-quality ddRAD-Seq reads obtained from the

8 mapping population and parental lines were mapped onto the primary contigs, with an average

9 mapping rate of 86.5%. SNPs were detected and filtered. Consequently, 15,364 high-quality SNPs

10 were identified, of which 14,973 were grouped into 12 linkage groups, which corresponds to the

11 number of chromosomes in C. crenata. Then, SNP order and the map distance between adjacent

12 SNPs were calculated (Table 2, Supplementary Table S2). The nomenclature of linkage groups was

13 based on our previous study4. A total of 575 primary contigs (683.8 Mb) were assigned to this

14 genetic map and connected with 100 Ns to establish pseudomolecule sequences, while the remaining

15 206 contigs (37.4 Mb) were not assigned to any linkage groups (Table 2).

16

17 3.3 Repeat sequence analysis and gene prediction

18 Repetitive sequences (425.5 Mb) accounted for 59.0% of the Japanese chestnut genome assembly

19 (Table 3). LTR retroelements were the most abundant repeat sequences (172.7 Mb, 24.0%), followed

20 by unclassified repeat sequences (154.0 Mb, 21.3%), i.e., those unavailable in public databases.

21 A total of 195,950 potential protein-coding sequences were identified in the Japanese chestnut

22 genome assembly, based on ab initio prediction and amino acid sequence homology among four

23 Castanea species: C. crenata, C. dentata, C. mollissima, and C. sativa. Subsequently, protein-coding

24 sequences were filtered to remove 89,194 genes with AED > 0.5, 13,389 protein-coding sequences

7 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 in repeat sequences, 23,386 short sequences (< 300 bp), and one redundant sequence. Finally, 69,980

2 sequences were identified as high-confidence genes, including 83.6% complete BUSCOs. Functional

3 annotation analysis showed that of the 69,980 predicted genes, 4,370, 7,803, and 5,219 sequences

4 were assigned to Gene Ontology slim terms in the biological process, cellular component, and

5 molecular function categories, respectively, and 1,267 genes had enzyme commission numbers

6 (Supplementary Table S3).

7 Next, peptide sequences of C. mollissima and transcriptome sequences of C. crenata, C. dentata,

8 and C. sativa were aligned to the genome sequence of Japanese chestnut obtained in this study. A

9 total of 24,561 genes predicted in the Japanese chestnut genome sequence were covered by

10 transcriptome sequences from at least one of the three Castanea species (Figure 2), while the

11 remaining 28,773 genes did not correspond to transcriptome sequences from any Castanea species.

12

13 3.4 Comparative genome structure analysis

14 The genome structure of Japanese chestnut was well conserved with that of Chinese chestnut (Figure

15 3), as expected. Interestingly, the C. crenata genome showed local synteny with the genomes of at

16 least 14 plant species (Acer yangbiense, Citrus sinensis, Fragaria vesca, Malus × domestica,

17 Manihot esculenta, Populus trichocarpa, Prunus avium, Prunus dulcis, Prunus mume, Prunus

18 persica, Pyrus betulifolia, Theobroma cacao, Vitis vinifera, and Ziziphus jujuba) (Figure 3), all of

19 which represent woody tree species belonging to Malpighiales, Malvales, , Sapindales, and

20 Vitales in rosids12.

21

22 4. Conclusion and future perspectives

23 Here, we report the genome sequence of Japanese chestnut for the first time. Due to the

24 long-sequencing technology and genetic mapping with high-density SNP loci identified using

8 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 previously published ddRAD-Seq data, 683.8 Mb (94.8%) of the assembled contig sequences (721.2

2 Mb) were assigned to chromosomes for establishing pseudomolecule sequences for Japanese

3 chestnut (Table 2). Genes predicted based on the pseudomolecule sequences of Japanese chestnut

4 included 96.9% of the core gene set of Embryophyta (Table 2), suggesting that most of all genes

5 were represented in the Japanese chestnut genome assembly. The coverage of pseudomolecule

6 sequences in Japanese chestnut (683.8 Mb) was comparable to that in Chinese chestnut (12

7 pseudomolecule sequences, 667 Mb). Therefore, the chromosome-level genome assembly of

8 Japanese chestnut generated in this study could be used as a standard reference for genetics,

9 genomics, and breeding of Japanese chestnut and its relatives.

10 Comparative genome structure analysis showed that the genome structure of Japanese chestnut

11 was quite similar to that of Chinese chestnut (Figure 3). This result suggests that agronomically

12 important genetic loci controlling nut weight and harvesting time13-15 could be transferred from

13 Chinese chestnut to Japanese chestnut and vice versa. Chromosomal segments in the Japanese

14 chestnut genome shared synteny with the peach genome16 (Figure 3); this relationship has also been

15 reported in Chinese chestnut. More interestingly, regions in the Japanese chestnut genome showed

16 sequence similarity with the genomes of Malpighiales, Malvales, Rosales, Sapindales, and Vitales

17 belonging to woody rosids (Figure 3), but not with Brassicales, Cucurbitales, and Fabales

18 belonging to herbaceous rosids. Genome structure conservation among rosids, e.g., Rosales (sweet

19 cherry, apple, and, Japanese fig) and Malvales (cacao), have been suggested in our previous

20 investigation of 114 plant species10. This result suggests that woody members of rosids share a

21 closely related genome structure with each other, although this relationship contradicts that

22 suggested by the APG classification IV12. Indeed, none of the plant species in Cucurbitales, which is

23 suggested to be the closest order to according to APG IV12, showed genome structure

24 similarity to Japanese chestnut. Thus, comparative genome structure analysis provides new insights

9 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 into the evolutionary history of angiosperm.

2 The genome sequence data generated in this study would enhance the genetics and genomics of

3 Japanese chestnut, in which classical molecular markers, such as simple sequence repeats,

4 previously played a key role in 1) the identification of genetic loci affecting agronomically important

5 traits4,15,17; 2) chestnut breeding programs with marker-assisted selection18; and 3) the assessment of

6 the genetic structure of cultivated and wild chestnut populations19. Furthermore, the highly

7 conserved genome sequence and structure of woody members of rosids might help us better

8 understand their evolutionary history and facilitate the identification of common genetic mechanisms

9 affecting agronomic traits across various orders, families, species, and genera.

10

11 Acknowledgments

12 We thank Y. Kishida, C. Minami, H. Tsuruoka, and A. Watanabe (Kazusa DNA Research Institute)

13 for providing technical assistance. This work was supported in part by the JSPS KAKENHI

14 (20K15524) and the Kazusa DNA Research Institute Foundation.

15

16 Data availability

17 Raw sequence reads were deposited into the Sequence Read Archive (SRA) database of the DNA

18 Data Bank of Japan (DDBJ) under accession number DRA012289. Assembled sequences are

19 available at DDBJ under accession numbers BPMU01000001–BPMU01000781. Genome

20 information generated in this study is available at Plant GARDEN (https://plantgarden.jp).

21

22 Conflict of interest

23 None declared.

24

10 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Supplementary data

2 Supplementary Table S1 Software tools used for genome assembly and gene prediction.

3 Supplementary Table S2 Genetic map and anchored genome sequences.

4 Supplementary Table S3 IDs and annotations of predicted genes.

5

11 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 References

2 1. Bounous, G. and Marinoni, D. T. 2004, Chestnut: Botany, Horticulture, and 3 Utilization. Horticultural Reviews, pp. 291-347. 4 2. Barakat, A., DiLoreto, D. S., Zhang, Y., et al. 2009, Comparison of the 5 transcriptomes of American chestnut (Castanea dentata) and Chinese chestnut 6 () in response to the chestnut blight infection. BMC Plant Biol, 7 9, 51. 8 3. Pereira-Lorenzo, S., Ballester, A., Corredoira, E., et al. 2012, Chestnut. In: 9 Badenes, M. L. and Byrne, D. H. (eds), Fruit Breeding, Springer US, Boston, MA, 10 pp. 729-769. 11 4. Torello Marinoni, D., Nishio, S., Valentini, N., et al. 2020, Development of 12 High-Density Genetic Linkage Maps and Identification of Loci for Chestnut Gall 13 Wasp Resistance in Castanea spp. Plants (Basel), 9. 14 5. Staton, M., Addo-Quaye, C., Cannon, N., et al. 2020, A reference genome assembly 15 and adaptive trait analysis of Castanea mollissima ‘Vanuxem,’ a source of 16 resistance to chestnut blight in restoration breeding. Tree Genetics & Genomes, 17 16. 18 6. Sun, Y., Lu, Z., Zhu, X. and Ma, H. 2020, Genomic basis of homoploid hybrid 19 speciation within chestnut trees. Nat Commun, 11, 3375. 20 7. Wang, J., Tian, S., Sun, X., et al. 2020, Construction of Pseudomolecules for the 21 Chinese Chestnut (Castanea mollissima) Genome. G3 (Bethesda), 10, 3565-3574. 22 8. Xing, Y., Liu, Y., Zhang, Q., et al. 2019, Hybrid de novo genome assembly of Chinese 23 chestnut (Castanea mollissima). Gigascience, 8. 24 9. Sato, K., Abe, F., Mascher, M., et al. 2021, Chromosome-scale genome assembly of 25 the transformation-amenable common wheat cultivar ‘Fielder’. DNA Research. 26 10. Shirasawa, K., Harada, D., Hirakawa, H., Isobe, S. and Kole, C. 2021, 27 Chromosome-level de novo genome assemblies of over 100 plant species. Breeding 28 Science, 71, 117-124. 29 11. Nishio, S., Iketani, H., Fujii, H., et al. 2014, Use of population structure and 30 parentage analyses to elucidate the spread of native of Japanese chestnut. 31 Tree Genetics & Genomes, 10, 1171-1180. 32 12. The Angiosperm Phylogeny Group 2016, An update of the Angiosperm Phylogeny Group 33 classification for the orders and families of flowering plants: APG IV. Botanical 34 Journal of the Linnean Society, 181, 1-20. 35 13. Ji, F., Wei, W., Liu, Y., et al. 2018, Construction of a SNP-Based High-Density 36 Genetic Map Using Genotyping by Sequencing (GBS) and QTL Analysis of Nut Traits

12 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 in Chinese Chestnut (Castanea mollissima Blume). Front Plant Sci, 9, 816. 2 14. Nishio, S., Hayashi, T., Yamamoto, T., et al. 2018, Bayesian genome-wide 3 association study of nut traits in Japanese chestnut. Molecular Breeding, 38. 4 15. Nishio, S., Terakami, S., Matsumoto, T., et al. 2018, Identification of QTLs for 5 Agronomic Traits in the Japanese Chestnut (Castanea crenata Sieb. et Zucc.) 6 Breeding. The Horticulture Journal, 87, 43-54. 7 16. Kubisiak, T. L., Nelson, C. D., Staton, M. E., et al. 2012, A transcriptome-based 8 genetic map of Chinese chestnut (Castanea mollissima) and identification of 9 regions of segmental homology with peach (Prunus persica). Tree Genetics & Genomes, 10 9, 557-571. 11 17. Nishio, S., Takada, N., Yamamoto, T., et al. 2013, Mapping and pedigree analysis 12 of the gene that controls the easy peel pellicle trait in Japanese chestnut 13 (Castanea crenata Sieb. et Zucc.). Tree Genetics & Genomes, 9, 723-730. 14 18. Bernatzky, R. and Mulcahy, D. L. 1992, Marker-aided selection in a backcross 15 breeding program for resistance to chestnut blight in the American chestnut. 16 Canadian Journal of Forest Research, 22, 1031-1035. 17 19. Nishio, S., Takada, N., Terakami, S., et al. 2021, Genetic structure analysis of 18 cultivated and wild chestnut populations reveals gene flow from cultivars to 19 natural stands. Sci Rep, 11, 240.

20

13 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Table 1 Statistics of the contig sequences of Japanese chestnut

CCR_r1.0 CCR_r1.0.haplotig

Total contig size (bases) 721,168,657 629,166,897

Number of contigs 781 3,355

Contig N50 length (bases) 1,595,543 275,768

Longest contig size (bases) 13,981,633 1,688,135

Gap (bases) 0 0

Complete BUSCOs 96.6 n.a.

Single-copy BUSCOs 91.9 n.a.

Duplicated BUSCOs 4.7 n.a.

Fragmented BUSCOs 0.9 n.a.

Missing BUSCOs 2.5 n.a.

#Genes 69,980 n.a.

2 n.a.: not analyzed.

3

14 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Table 2 Statistics of Japanese chestnut pseudomolecule sequences (CCR_r1.0.pmol)

Genetic Linkage No. of No. of No. of distance % Total length (bp) % group SNPs bins contigs (cM)

CCR1.0A 2,483 173.7 284 59 7.6 89,228,932 12.4

CCR1.0B 1,514 172.2 202 53 6.8 53,118,407 7.4

CCR1.0C 1,232 152.6 204 44 5.6 56,655,719 7.9

CCR1.0D 1,083 137.5 161 57 7.3 52,875,950 7.3

CCR1.0E 1,342 165.2 228 61 7.8 71,160,265 9.9

CCR1.0F 1,339 132.3 154 42 5.4 61,824,531 8.6

CCR1.0G 924 137.2 153 44 5.6 44,683,286 6.2

CCR1.0H 816 138.9 170 34 4.4 53,252,922 7.4

CCR1.0I 1,199 157.9 168 42 5.4 45,426,588 6.3

CCR1.0J 1,036 148 157 32 4.1 40,655,448 5.6

CCR1.0K 1,026 144.9 196 62 7.9 67,434,223 9.4

CCR1.0L 979 147.4 140 45 5.8 47,439,869 6.6

Total 14,973 1,808 2,217 575 73.6 683,756,140 94.8

2

15 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Table 3 Repetitive sequences in the Japanese chestnut genome

Physical length Repeat type No. of elements % (bp)

SINEs 3,222 349,743 0.0

LINEs 58,824 28,244,400 3.9

LTR elements 173,629 172,741,932 24.0

DNA transposons 206,227 47,812,722 6.6

Unclassified 672,305 153,965,018 21.3

Small RNA 1,548 851,212 0.1

Satellites 1,888 1,522,847 0.2

Simple repeats 308,508 12,661,150 1.8

Low complexity 43,949 2,095,795 0.3

2

16 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Figure legends

2 Figure 1 Estimation of the genome size of Japanese chestnut (Castanea crenata), based on k-mer

3 analysis (k = 17) with the given multiplicity values.

4

5 Figure 2 Number of genes in the genomes of Japanese chestnut and Chinese chestnut supported by

6 members of the family.

7 Predicted genes in the genomes of Japanese chestnut (CCR_r1.0) and Chinese chestnut

8 (GWHANWH00000000), and the transcriptome sequences of American chestnut (AC454_v3),

9 European chestnut (csi_csn), and Japanese chestnut (cci_ccn).

10

11 Figure 3 Comparative analysis of the genome sequence and structure of Japanese chestnut and other

12 woody members of rosids.

13 Dots indicate genome sequence and structure similarity among Castanea mollissima, Acer

14 yangbiense, Citrus sinensis, Fragaria vesca, Malus × domestica, Manihot esculenta, Populus

15 trichocarpa, Prunus persica, Pyrus betulifolia, Theobroma cacao, Vitis vinifera, and Ziziphus jujuba.

16

17 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

7

6

5

4

3

2

1 Number of distinct k−mers (million) 0

0 50 100 150 200 250

k−mer multiplicity bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

C. crenata (CCR_r1.0)

28,773 (41.1%) C. mollissima C. sativa (GWHANWH00000000) (csi_csn) 16,646 432 (23.8%) (0.6%) C. dentata C. crenata (AC454_v3) 656 (cci_ccn) 7,067 (0.9%) 143 (10.1%) (0.2%) 1,824 584 2,722 (2.6%) (0.8%) 354 (3.9%) (0.5%)

292 6,792 708 (0.4%) (9.7%) (1.0%)

512 2,089 (0.7%) (3.0%)

386 (0.6%) bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454274; this version posted July 29, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Castanea mollissima Acer yangbiense Citrus sinensis Fragaria vesca Malus x domestica Manihot esculenta 1234567891011121234567891011121312345678912 3 4 5 6 712 345 67891011121314151617 1 2 3 4 5 6 7 8 9 101112131415161718 Castanea crenata L

615.4 M K J 547 M I

478.7 M H G 410.3 M F 341.9 M E 273.5 M D 205.1 M C

136.8 M B

68.4 M A

66.7 M 133.4 M 200.2 M 266.9 M 333.6 M 400.3 M 467 M 533.7 M 600.5 M 64.6 M 129.2 M 193.9 M 258.5 M 323.1 M 387.7 M 452.3 M 517 M 581.6 M 23.9 M 47.8 M 71.7 M 95.6 M 119.5 M 143.4 M 167.3 M 191.2 M 215.1 M 21 M 41.9 M 62.9 M 83.9 M 104.9 M 125.8 M 146.8 M 167.8 M 188.7 M 65.3 M 130.5 M 195.8 M 261 M 326.3 M 391.5 M 456.8 M 522 M 587.3 M 51.8 M 103.7 M 155.5 M 207.4 M 259.2 M 311.1 M 362.9 M 414.8 M 466.6 M

Populus trichocarpa Prunus persica Pyrus betulifolia Theobroma cacao Vitis vinifera Ziziphus jujuba 1 2 3 4 5 6 7 8 9 10 111213141516171819 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 101112 13 14 15 1617 18 19 1 2 3 4 5 6 7 8 9 10 11 12 Castanea crenata L

615.4 M K J 547 M I

478.7 M H G 410.3 M F 341.9 M E 273.5 M D 205.1 M C

136.8 M B

68.4 M A

38.9 M 77.8 M 116.8 M 155.7 M 194.6 M 233.5 M 272.4 M 311.4 M 350.3 M 22.6 M 45.1 M 67.7 M 90.3 M 112.8 M 135.4 M 158 M 180.6 M 203.1 M 50.1 M 100.1 M 150.2 M 200.2 M 250.3 M 300.4 M 350.4 M 400.5 M 450.6 M 31.4 M 62.8 M 94.3 M 125.7 M 157.1 M 188.5 M 219.9 M 251.4 M 282.8 M 42.6 M 85.2 M 127.9 M 170.5 M 213.1 M 255.7 M 298.3 M 340.9 M 383.6 M 32.2 M 64.3 M 96.5 M 128.6 M 160.8 M 192.9 M 225.1 M 257.3 M 289.4 M