THE CHINESE (CASTANEA MOLLISSIMA) GENOME Carlson, JE1; Staton, M, E2; Addo-Quaye, C3,1; Cannon, N1; Fan, S4; Nelson, C, D4,7; Henry, N2; Yu, J2; Huff, M2; Zhebentyayeva, T1; Wu, D1; Conrad, A4; Ficklin, S5; Saski, C5; Mandal, M6, Islam-Faridi, N7; Zembower, N1; Drautz,, D8; Schuster, S, C8; Swale, T10; Sun, Y11; Westbrook, J9; Holliday, J6; Abbott, A, G 1,4; Hebard, F. V9. 2018. Schatz Center for Molecular Genetics, Pennsylvania State University, University Park, PA, USA 16802 ([email protected]); 2University of Tennessee, Knoxville, TN, USA 37996; 3Division of Natural Sciences and Mathematics, Lewis-Clark State College, Lewiston, ID, USA 83501; 4 Forest Health Research and Education Center, University of Kentucky, Lexington, KY, USA 40506; 5Clemson University, Clemson, SC, USA 29634; 6Virginia Polytechnic University, Blacksburg, VA, USA 24061; 7USDA Forest Service, Saucier, MS, USA 39574; 8Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technology University, Singapore 639798; 9The Foundation, Meadowview, VA, USA 24361; 10Dovetail Genomics LLC, 5790, Delaware Ave, Santa Cruz, CA 95060; 11Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming, 650223 SUMMARY: Chinese chestnut (Castanea mollissima) genotypes are serving as the source of genes for developing resistance to (Cryphonectria parasitica) in American chestnut. We sequenced the genome of Chinese chestnut, to gain a better understanding of the genetic basis of blight resistance and to provide tools to assist in selecting for the blight resistance QTL and against the C. mollissima genetic background, in The American Chestnut Foundation’s back-cross breeding program. A draft genome (v.1.1) covering app. 90% of the genome of TACF’s Chinese chestnut ‘Vanuxem’ was released to the public in January, 2014 (www.hardwoodgenomics.org website). In addition, recombinant DNA clones covering the three major blight resistance QTL in the Vanuxem cv. were sequenced to great depth. Hundreds of genes were identified in the resistance QTLs, including 15 known “defense response” genes. The v.1.1 draft genome has served as reference for genome-wide genetic diversity studies among and within chestnut species and for preliminary studies on the progress of back-cross breeding at the BC3 generation. We will report progress towards a new version (v.2.0) of the Chinese chestnut genome featuring chromosome-scale sequences, taking 2 approaches. In one approach, very long sequence reads from the PACBio technology were used to produce longer genome fragments, of which over four thousand were placed in proper order along the 12 genetic linkage groups. In a new genome assembly approach, the Dovetail Genomics company is using chromatin-interaction sequence data to assemble more continuous sequence for each of the 12 chromosome pairs in the Mahogany chestnut genotype. This was supported by The Forest Health Initiative (www.foresthealthinitiative.org), the American Chestnut Foundation, the USDA AFRI program, and The Schatz Center for Tree Molecular Genetics at Penn State.

Goals: ` Results Summary: 1) A complete genome sequence for Castanea mollissima • C. mollissima cv. Vanuxem tree selected from TACF breeding program as reference genotype 2) Identify genes for Cryphonectria resistance • Total of 60.3 Gb (75X depth) gDNA sequence (454 shotgun reads plus 3Kb, 8Kb, 100Kb paired-end) 3) Provide tools for restoration of C. dentata (American chestnut) = • Version1 assembly = 323,611 contigs, 41,270 scaffolds (724 Mb; N50 51.8Kp), 36,478 gene models • Assembly V2a = 60,546 contigs and 14,358 scaffolds (760 Mb; N50 2.75Mb), after gap-closing Approach: • Assembly V2b = 12,684 contigs after merging and gap-closing with PACBio long reads, (783.4 Mb) 1) Generate deep genomic DNA sequence for a Chinese chestnut • First pseudo-chromosome build = 4,284 largest contigs of V2b anchored on 12 linkage groups genotype • V3 chromosome build = 4,314 of 12,684 contigs anchored to reference map; 452.3 Mb assembled; 2) Assemble the genome into scaffolds >1 Mb in length 29,470 gene models identified with chestnut RNASeq data and Quercus robur gene models. 3) Anchor scaffolds to the chestnut genetic linkage map and physical map • Independent genome assembly for Chinese chestnut Mahogany cv. completed with new 4) Predict gene models, including gene expression data genome assembly approach using chromatin-interaction (“HiC”) data: 5) Annotate putative gene functions and repetitive DNA content • 12 scaffolds ranging from 31 Mb to 69 Mb; total of 511.8Mb assembled 6) Provide public access through a genome browser and database • 70% of estimated 734 Mb genome assembled; most repetitive DNA probably not included 7) “Resequence” additional chestnut genomes for diversity analysis • All 12 Linkage Groups in reference map covered; 88% of predicted single copy genes found

Detailed Information on Chinese Chestnut Genome Assemblies Vanuxem Version 1 Vanuxem Version 2 Chromosome Vanuxem Version 3 Chromosome Mahogany Version 1 Sequencing and Assembly Statistics Assembly Statistics Assembly Statistics Chromosome Sequencing and Assembly Statistics METRIC Assembly 3.0 (new pseudo-chromosome METRIC Illumina Sequencing and De Novo Assembly with “Chicago” Assembly with “HiC” chromatin assembly) Assembly chromatin data data De novo • 12,684 contigs (up to 1.1 Mb) Assembly Stats: • 164,360,173 reads at 250 bp • 193M read pairs; 2x150 bp • 3,180 scaffolds total • 535,528,520 pe reads at 130 bp • contigs N50 11.01 Kb • L50/N50: 5 scaffolds; 43.Mb Assembly: • 783.4 Mb assembled • best fit: 55 k-mer size, 39% • 3,848 scaffolds • L90/N90: 11 scaffolds; 32.8 Mb • 98% of estimated genome length assembled heterozygosity, 734Mb genome size • L50/N50: 67 scaffolds; 2.1Mb • Longest scaffold 68,930,353 bp • 152,840 contigs (N50 10.9 Kb) • L90/N90: 295 scaffolds; 401Kb • 511.84 Mb assembled (70% total; • 29,470 gene models identified, with • 51,675 scaffolds (N50 19.6 Kb) • 511.18 Mb assembled (69% 85% of non-repetitive DNA) Gene models: expression evidence • 507.1 Mb assembled (69% total) total; 85% of non-repeat DNA) • 149,065 gaps, covering 3.46% • Gene models being functionally annotated Gene models: Of 303 expected single copy genes: Of 303 single copy genes: Of 303 single copy genes: • 195 found single copy (64%) • 226 found single copy (75%) • 227 found single copy (75%) Pseudo- • 4,314 contigs anchored to reference genetic 25 duplicated (8%); 26 duplicated (8.5%); 25 duplicated (8%); chromosomes linkage map (1322 SNP markers) • • • • 38 fragmented (8%); 45 missed • 16 fragmented (5%); 35 missed • 14 fragmented (5%); 37 missed • 12 pseudo-chromosomes produced (19%) (12%) (12%) • Pseudo-chromosome lengths currently range •Total # genes not determined • Total # genes not determined • Total # genes not determined from 28.3Mb to 64.4 Mb, totaling 452.3 Mb. Pseudo- • not attempted • not attempted • 12 chromosome-length * after gap closing (CLCBio/Geneious) and mate-pair scaffolding (SSPACE) Validations: • Genetic markers from new high density maps chromosomes • L50/N50 = 7,299 scaffolds; 20 Kb sequences; covering complete included and co-linear • L90/N90 = 28,550 scaffolds; 4 Kb reference genetic linkage map, but missing repeat DNA

Blight-Resistance QTL Results 15 best candidate disease resistance genes in Chinese chestnut blight-resistance QTL Browser for Blight-Resistance QTL http://www.hardwoodgenomics.org Seq. Name Seq. Description Closest matching NCBI nr Number of Genes Identified in the cbr1_scaffold114-gene-0.3-mRNA-1 transcription factor tga1 Transcription factor TGA1 (Vitis vinifera) Chestnut Blight Resistance QTLs cbr1_scaffold134-gene-0.0-mRNA-1 cc-nbs-lrr resistance protein Putative disease resistance protein RGA3 (Vitis vinifera) cbr1_scaffold16-gene-0.12-mRNA-1 rna recognition motif-containing protein PREDICTED: DAZ-associated protein 1-like (Vitis vinifera) cbr1_scaffold17-gene-0.29-mRNA-1 beta-hydroxyacyl-acp dehydratase predicted protein (Populus trichocarpa) # Stress- # # Genes / cbr1_scaffold28-gene-0.12-mRNA-1 transcription factor tga1 TGA transcription factor 1 (Populus tremula x Populus alba) QTL LG # bases / QTL response Genes Mbases cbr1_scaffold32-gene-0.28-mRNA-1 14-3-3-like protein gf14 lambda hypothetical protein ARALYDRAFT_496774 [Arabidopsis lyrata) genes cbr1_scaffold4-gene-0.38-mRNA-1 multicatalytic endopeptidase complex proteasome subunit alpha type-7 (Vitis vinifera) cbr1_scaffold61-gene-0.11-mRNA-1 disease resistance protein at4g27190-like PREDICTED: disease resistance protein At4g27190-like cbr1 B 994 10.14 Mb 151 98 cbr2_scaffold29-gene-0.6-mRNA-1 cc-nbs-lrr resistance protein cc-nbs-lrr resistance protein (Populus trichocarpa) cbr2_scaffold34-gene-0.3-mRNA-1 feronia receptor-like kinase /-protein kinase PBS1, putative cbr2_scaffold3-gene-0.42-mRNA-1 Protein PREDICTED: MLO protein homolog 1-like ( max) cbr2 F 548 5.13 Mb 137 58 cbr2_scaffold5-gene-0.9-mRNA-1 transferring glycosyl transferase, transferring glycosyl groups, putative cbr3_scaffold1-gene-1.1-mRNA-1 histone- n-methyltransferase ashh2-like PREDICTED: uncharacterized protein LOC100245350 (Vitis vinifera) cbr3 G 410 4.51 Mb 139 38 cbr3_scaffold1-gene-1.19-mRNA-1 set domain protein PREDICTED: uncharacterized protein LOC100245350 (Vitis vinifera) cbr3_scaffold28-gene-0.8-mRNA-1 cysteine proteinase rd19a Cysteine proteinase RD19a (Arabidopsis thaliana)

Mahogany Genome Analyses

Mapping of Mahogany and peach genomes. A. all chromosome sequences

Fred Hebard and Laura Georgi with a clone of the reference genome tree ‘Vanuxem’ at Meadowview

Acknowledgments: This project was supported by The Forest Health Initiative (www.foresthealthinitiative.org), the American Chestnut Foundation, the USDA AFRI program, and The Schatz Center for Tree Molecular Genetics at Penn State University. .