Tree Genetics & Genomes (2018) 14: 51 https://doi.org/10.1007/s11295-018-1263-z

ORIGINAL ARTICLE

Resources for studies of iron ( sigillata) gene expression, genetic diversity, and evolution

Xiaojia Feng1 & Xiaoying Yuan1 & Yiwei Sun1 & Yiheng Hu1 & Saman Zulfiqar1 & Xianheng Ouyang1 & Meng Dang1 & Huijuan Zhou1 & Keith Woeste2 & Peng Zhao1

Received: 11 December 2017 /Revised: 6 June 2018 /Accepted: 11 June 2018 /Published online: 23 June 2018 # Springer-Verlag GmbH Germany, part of Springer Nature 2018

Abstract Iron walnut (Juglans sigillata Dode) is a temperate deciduous tree indigenous to . It is distributed mainly in southwestern China, where it is valued for its wood and nuts. Transcriptomic and genomic data for the species are limited. Our goal was to assemble the whole chloroplast genome of J. sigillata, to use transcriptome information from RNA-Seq to understand the gene space in J. sigillata, and to develop polymorphic simple sequence repeats (SSRs, microsatellites) useful for understanding the species’ population genetics. The chloroplast genome consisted of a large single copy (LSC) of 89,872 bp, an inverted region (IR) of 52,072 bp, and a short single copy (SSC) of 18,406 bp. The chloroplast genome consisted of 137 annotated genes, with 71 unique coding regions and eight coding regions that were repeated in the inverted region. De novo assembly of the transcriptome yielded 83,112 unigenes with an average length of 686.9 bp. A search against the Gene Ontology (GO) database identified 19,718 unigenes. We evaluated transcriptome-derived microsatellite markers and chloroplast sequence polymorphisms in 48 J. sigillata individuals from three populations and 66 individuals from five other Chinese walnut (Juglans) species. We found 20 expressed sequence tag-SSRs and four loci in the chloroplast that were polymorphic in J. sigillata. The number of alleles per locus ranged from 3 to 10. The whole chloroplast genome and these 24 informative loci will be useful for studies of population genetics, diversity, and genetic structure, and they will undoubtedly benefit future breeding studies of this walnut species.

Keywords Juglans sigillata . Microsatellites . Transcriptome . Chloroplast . Iron . Walnut

Introduction importance (Manning 1978; Han et al. 2016; Pollegioni et al. 2014;Wangetal.2015). The iron walnut (Juglans (Juglans) have high economic value due to their sigillata) is a thick-shelled walnut that grows in south- high-quality wood, nutritious nuts, and ecological westernChina.Itiscloselyrelatedto(andpossiblyan ecotype of) the widely cultivated (Pollegioni et al. 2014;Wangetal.2015;Lu1982; Xiaojia Feng and Xiaoying Yuan contributed equally to this work. Gunn et al. 2010). Iron walnut is one of the most desir- Communicated by A. M. Dandekar able and economically valuable hardwood timber tree spe- Electronic supplementary material The online version of this article cies in southwestern China, and its fruits, which are edi- (https://doi.org/10.1007/s11295-018-1263-z) contains supplementary ble, have religious and cultural importance (Gunn et al. material, which is available to authorized users. 2010; Weckerle et al. 2005). To date, most studies of J. sigillata focused on its genetic diversity and the nutrition- * Peng Zhao al value and chemical composition of its nut and other [email protected] tissues (Wang et al. 2015; Aradhya et al. 2004;Guetal. 1 Key Laboratory of Resource Biology and Biotechnology in Western 2015). Despite this species’ value, existing genomic re- China, Ministry of Education, College of Life Sciences, Northwest sources for J. sigillata include only 17 nucleotide se- University, Xi’an 710069, Shaanxi, China quences and 181 proteins in GenBank (as of April 2 USDA Forest Service Hardwood Tree Improvement and 2017). There are no microsatellites developed from J. Regeneration Center (HTIRC), Department of Forestry and Natural sigillata genomic DNA, so genetic studies of J. sigillata Resources, Purdue University, 715 West State Street, West (Wang et al. 2015;Gunnetal.2010) have been limited by Lafayette, IN 47907, USA 51 Page 2 of 15 Tree Genetics & Genomes (2018) 14: 51 the relatively small number of polymorphic markers avail- extracted following the methods of Doyle and Doyle (1987) able, and there are no ESTs (expressed sequence tags) and stored at − 20 °C at the Evolutionary Lab, available in GenBank for J. sigillata. Northwest University, Xi’an, China. All DNA sequences were Due to the large effort required to sequence eukaryotic obtained using Sanger sequencing technology and an ABI genomes, only a few genomes were fully sequenced 3730. Methods of RNA extraction, quantification, and pro- before 2014 (Wei et al. 2015). Since then, however, genomic cessing for sequencing were essentially as described previous- information for a large number of non-model has be- ly (Hu et al. 2016). To obtain an overview of the J. sigillata come available, mostly because of rapid changes in sequenc- transcriptome, the RNA from different tissues (fresh leaves, ing technology (Yuan et al. 2017). A draft genome for com- vegetative buds, and immature fruits) was extracted separate- mon walnut (J. regia) is significant because Juglans regia is ly. Total RNAwas extracted using a Plant RNA Kit (OMEGA closely related to iron walnut, as previously mentioned Bio-Tek, Norcross, GA, USA). RNA degradation and con- (Martínez-García et al. 2016). There are five complete chlo- tamination were monitored on 1% agarose gels. RNA purity roplast genomes of Juglans species sequenced to date; these was assayed using the NanoPhotometer spectrophotometer are also important genetic resources (Hu et al. 2017). In an- (IMPLEN, Westlake Village, CA, USA), and RNA concentra- giosperms, chloroplast genomes are predominantly maternal- tion was measured using the Qubit RNA Assay Kit and a ly inherited and thus can reveal maternal lineages (Birky Qubit 2.0 Flurometer (Life Technologies, Carlsbad, CA, 1995; Greiner et al. 2015). Chloroplast polymorphisms USA). RNA integrity was assessed using the RNA Nano [indels, single nucleotide polymorphisms (SNPs) and 6000 Assay Kit and Agilent Bioanalyzer 2100 system microsatellites] are frequently targeted because they are more (Agilent Technologies, Santa Clara, CA, USA). After RNA easily isolated than nuclear microsatellites and highly efficient quantification, we pooled equivalent amounts of all RNA markers for population genetics and studies of evolutionary from fresh leaves, vegetative buds, and immature fruits for history (Provana et al. 2001; Wheeler et al. 2014). sequencing. Transcriptomic data has often been used for gene discovery and simple sequence repeat (SSR) marker development RNA-Seq library preparation for transcriptome (Zhang et al. 2012; Jiang et al. 2013). Because some EST- sequencing and gene annotation SSRs can be transferred to related taxa, they have value for understanding evolution, for population genetic studies, and Libraries for RNA-Seq were produced using an NEBNext comparative mapping (Ellis and Burke. 2007; Bodénès et al. Ultra RNA Library Prep Kit (NEB, Beverly, MA, USA). 2012). Aside from the development of EST-SSRs, RNA-Seq Paired end sequencing was performed on Illumina also allows the potential recovery of rare transcripts, splicing HiSeq2500 platform to generate 100 bp reads with default variants, and the determination of gene expression levels in parameters by Novogene Bioinformatics Technology Co., cells or tissues of interest (Kaur et al. 2011;Zhouetal.2014). Ltd., Beijing, China (www.novogene.cn). The de novo The goals of this study were to determine the complete transcriptome was assembled using default settings in Trinity sequence of the chloroplast genome of J. sigillata,preparea (Grabherr et al. 2011). Unigenes were annotated using data preliminary characterization of the transcriptome of J. from the NCBI Gene Ontology (GO) and protein family sigillata, and use the complete chloroplast genome and pre- (Pfam) databases. GO annotations were performed in liminary transcriptome to identify a suite of chloroplast poly- Blast2GO v2.5 with the cutoff e-value 1e−6 (Conesa et al. morphisms and EST-SSRs for population genetic analysis. A 2005). The transcriptome assembly was deposited at NCBI further goal was to demonstrate the utility of the polymor- under the BioProject accession no. PRJNA393481 (Short phisms by characterizing three populations of J. sigillata. Read Archive, SRR6902320).

Chloroplast genome sequencing, assembly and gene Materials and methods annotation

Sample collections, DNA extraction, and RNA Raw reads with sequences shorter than 50 bp or with more extraction than 2% ambiguous bases were removed from the total NGS PE reads using the NGSQC toolkit v2.3.3 (Patel and Jain. Fresh leaves, buds, and immature fruits of J. sigillata were 2012) trim tool. After trimming, high-quality PE reads were collected in June 2015 from a single, mature, healthy- assembled using aMIRA (Chevreux et al. 2004). Functional appearing tree growing in province and immediately genes were annotated by alignment with genes of their closest frozen in liquid nitrogen prior to storage at − 80 °C. Tissues relative (J. regia, NCBI accession KT963008; Hu et al. 2016). from this tree were used for RNA extraction. Leaves used for Genome annotation was performed using Geneious v10.2.2 DNA extraction were dried with silica gel, and the DNA was software (Kearse et al. 2012). The GC content of coding and Tree Genetics & Genomes (2018) 14: 51 Page 3 of 15 51 non-coding regions was determined based on annotation. The Lischer. 2010) was used to test Hardy-Weinberg equilibrium annotated cp genomic sequence was deposited into GenBank (HWE). The linkage disequilibrium (LD) for all loci was test- (KX424843). ed using FSTAT (Goudet 2001). Null alleles were found using MICRO-CHECKER 2.2.3 (Van Oosterhout et al. 2004). To Discovery of SSRs and primer design detect and understand genetic structure, we used STRUCTURE software (version 2.3.4) and STRUCTURE We identified microsatellites within assembled unigenes using HARVESTER. Results were drawn as bar-plots using MISA (http://pgrc.ipk-gatersleben.de/misa/misa.html). A DISTRUCT (Rosenberg 2004) software. Analysis of molecu- sample of unigenes (N = 70) was selected for primer design lar variance (AMOVA) was performed in GenAlEx6.5 out of 7118 sequences containing microsatellites that were not (Peakall and Smouse. 2012), and AMOVA results were used single nucleotide repeats but contained greater than five motif to estimate population genetic differentiation (FST) repeats. Primers were designed to amplify the 70 selected loci (Dupanloup et al. 2002). to produce a range of PCR product sizes from 140 to 280 bp (five from 142 to 150 bp, five 160 bp, five 170 bp, five 180 bp, Chloroplast DNA sequence analysis five 190 bp, five 200 bp, five 210 bp, five 220 bp, five 230 bp, five 240 bp, five 250 bp, five 260 bp, and five 280 bp) from To characterize the cp-SSR loci, we used a total of 66 individ- genomic DNA (Table S1). uals of five Chinese walnut (Juglans) species (18 individuals Nine loci (Table 3) were selected from 67 SSR-containing from three J. sigillata populations, 7 individuals of J. regia,12 loci identified in the J. sigillata chloroplast genome, and individuals of J. hopeiensis,12individualsofJ. mandshurica, primers were designed to amplify sequences in those regions and 12 individuals of J. cathayensis, for details see Table S3). (Table S2; Table 1). Genomic DNA from 48 samples of J. Polymorphism at each locus within and among species was sigillata was used for PCR amplification and analysis of poly- determined by sequencing (Table 3). PCR amplification was morphism. Primers were designed using Primer3 (http:// carried out on a SimpliAmp Thermal Cycler (Applied primer3.sourceforge.net/releases.php). Primer design Biosystem, USA) in 20 μL reaction volumes (10 μL2× parameters were as follows: 18 to 23 nucleotides in length PCR Master Mix including 0.1 U Taq polymerase/μL, with 21 bp as optimum. Amplicon product size ranges from 500 μM each dNTP, 20 mM Tris-HCl (pH 8.3), 100 mM

142 to 280 bp, optimum annealing temperature of 58 °C, and KCl, 3.0 mM MgCl2 (Tiangen, Beijing, China), 0.5 μLeach GC content 50–60% with 50% as optimum (Table 2). primer, 2 μL BSA (100 mg/mL), 2 μLof10ng/μLDNA). The PCR was programmed for 3 min at 94 °C followed by Amplification conditions and validation of primers 35 cycles of 15 s at 93 °C, 1 min at annealing temperature (55 °C), 30 s at 72 °C, and extension of 10 min at 72 °C. After To verify the polymorphism of EST-SSRs and for subsequent PCR amplification, fragments were sequenced by Sangon population genetic studies, we extracted DNA from 48 leaf Biotech (Shanghai, China). The DNA sequence data were samples collected in 2015 from J. sigillata trees in three nat- edited and aligned using Bioedit v7.0.9 (Hall 1999). DnaSP ural populations in three locations (Weixixian, Yunnan v5.0 (Librado and Rozas 2009) was used to calculate the Province, TZ; Nanhuaxian, Yunnan Province, NH; number of segregating sites, number of haplotype/mitotypes, Shuichenxian, Guizhou Province, BN) in China (Table 1). mean number of pairwise differences (K), nucleotide diversity

DNA was extracted and handled as described above. After (pi), gene diversity within populations (hS), total gene diver- drying, DNA was re-suspended in water, diluted with water sity (hT), and parsimony informative sites. AMOVA based on to 10 ng/μL, and stored at − 20 °C until use. PCR amplifica- pairwise differences was performed in Arlequin v. 3.5 tion conditions and allele scoring were essentially as described (Excoffier and Lischer. 2010). in Hu et al. (2016). All 48 genotypes were examined for poly- morphism at all 70 EST-SSR loci and nine cp-SSR loci (Table 3). Results

Microsatellites data analysis Characterization of the complete chloroplast genome

Per locus genetic diversity and population-level diversity were The J. sigillata chloroplast genome is 160,350 bp in length; its evaluated using the program GenAlEx6.5 (Peakall and organization and gene contents were typical of angiosperms.

Smouse. 2012) to determine number of alleles (NA) and ob- Overall GC content was 36.1%. The GC content in LSC, SSC, served (HO) and expected (HE) heterozygosity. Genetic differ- and IR regions was 35.9, 32.1, and 42.7%, respectively. The J. entiation of populations (FST) was tested using the program sigillata chloroplast consisted of large single copy (LSC) re- GENEPOP (Raymon 1995). ARLEQUIN (Excoffier and gion of 898 72 bp, a small single copy region (SSC) of 18, 51 Page 4 of 15 Tree Genetics & Genomes (2018) 14: 51

Table 1 Sources of Juglans sigillata Dode sampled for this Collection site Population code Species Sample size Longitude (E) Latitude (N) study Weixixian, Yunnan TZ J. sigillata 16 99°24′ 27°36′ Nanhuaxian, Yunnan NH J. sigillata 16 101°16′ 25°11′ Shuichenxian, Guizhou BN J. sigillata 16 105°09′ 26°13′ Total 48

406 bp, and a pair of inverted repeat (IRa and IRb) of common, followed by 11 tandem repeats (n = 11; 16.2%). The 26,036 bp. The chloroplast genome contained 129 functional dominant repeat motif in cp-SSRs was A/T (53, 77.9%) (Table genes, including 88 protein-coding genes, 8 rRNA genes (two S2). 4.5S rRNA, two 5S rRNA, two 16S rRNA, and two 23S rRNA), and 40 tRNA (Table 4). ycf1 was found in the junction of the IR regions. Gene annotation of J. sigillata transcriptome

Basedontranscriptomesequence data, we annotated Illumina sequencing and De novo assembly 19,718 of 83,112 unigene sequences (23.72%) to GO classes (Fig. 3). All unigenes were annotated to three ma- In total, 4.29 GB sequence quality reads were generated, jor GO classes: the largest class was biological processes resulting in a total of 4.2 GB of clean data for assembly. (46, 625; 49.5%), followed by a molecular function class The mean Q30 percentage (sequencing error rate < (25,260; 26. 8%), and cellular component class (22,259; 0.05%) was 90.99, and GC percentage was 45.16. 23.6%). Metabolic process (11,613; 24.9%) and protein Using the software Trinity, de novo assemblies generated phosphorylation (10,932; 23.4%) were the largest and sec- 170,811 transcripts including 83,112 unigenes. The ond largest of 20 categories of biological processes. The length of the transcripts varied from 201 to 15,705 bp, least abundant were locomotion (5; 0.01%) and cell kill- with an average of 1294 bp and N50 value of 2171 bp. ing (3; 0.004%). Within the cellular component class, cell The length of the unigenes varied from 201 to 12,402 bp, part (7803; 35.0%), organelle (4164; 18.7%), and mem- with an average of 687 bp and N50 value of 1147 bp brane (2787; 12.5%) were the most abundant among the (Table S4). The transcripts with the length over 500 bp 14 categories (Fig. 3), and the least abundant were extra- accounted for about 64.7% of all transcripts; the number cellular region part (5; 0.02%), virion part (5; 0.02%), and between 200 and 500 bp in length was 53, 574. Unigenes extracellular matrix part (4; 0.01%). Within the molecular with length of over 2000 bp accounted for about 7.2% function class, catalytic activity (10,900; 43. 5%) and (Fig. 1). binding (10,750; 42.6%) were the most abundant among the 16 categories, and the least abundant were protein tag Microsatellites (SSR) enrichment (13; 0.008%) and translation regulator activity (9; 0.003%) (Fig. 3). In total, 7118 transcriptome unigenes (sequences) contained SSRs, which were 8.6% of all unigenes (83,112). In these loci, mononucleotide repeats were the most frequent (3425; Functional classification by COG 48.11%), followed by dinucleotide repeats (2649; 37.21%) and trinucleotide repeats (948; 13.32%), while only 96 ESTs All unigenes were aligned to the clusters of orthologous contained tetranucleotide, pentanucleotide, or hexanucleotide groups (COG) database to predict and classify their pos- repeats (Fig. 2;TableS5). The dominant repeat motif in EST- sible functions. A total of 11,909 sequences were assigned SSRs was A/T (46.82%), followed by AG/CT (14.26%), GA/ to COG classifications (Fig. 4). Among the 25 COG cat- TC (14.03%), AT/TA (6.71%), GAA/TTC (1.44%), and egories, the cluster for general function (2230; 18.7%) AGA/TCT (1.38%) (Fig. 2; Table S6). In total, 66 chloroplast represented the largest group, followed by replication, re- sequence fragments from the whole chloroplast genome combination, and repair (1272; 10.7%); transcription contained SSRs. Within these 66, mononucleotide repeats (1153; 9.7%); signal transduction mechanisms (977; were the most frequent (n = 49; 74.2%), followed by complex 8.2%); posttranslational modification, protein turnover, nucleotide repeats (n = 8; 12.1%) and dinucleotide repeats and chaperones (793; 6.6%); and translation, ribosomal (n = 5; 7.6%), while only four sequences contained trinucleo- structure, and biogenesis (743; 6.2%), whereas only a tide, tetranucleotide, or pentanucleotide repeats (Table S2). few unigenes were assigned to cell motility and nuclear SSRs with ten tandem repeats (n = 24; 35.3%) were the most structure (Fig. 4). reGntc eoe 21)1:51 14: (2018) Genomes & Genetics Tree Table 2 Characterization of 20 transcriptome-derived microsatellites of Juglans sigillata

Locus name Putative function of unigene Repeat motif(s) Primer sequence(5′-3′) Tm (°C) Size range (bp) Na HO HE

JS0354 Protein phosphatase 2C NADPH-dependent (A)10tagagac(AG)8 F:CGATGGCAAAATTAGAGGGA 53 280–288 4 0.20 0.62 FMN reductase flavodoxin-like R:TGGAATTTCGGGTACACCAT JS0357 Protein kinase domain; protein tyrosine (GCA)7 F:AGAAAGGGCAGCAAGAACAA 58 153–162 4 0.32 0.51 kinase R:CGTATCCTTGCCCTTTGTGT JS0470 PREDICTED: uncharacterized protein (T)11(A)10 F:GCAGGGAATAAGAGTGCCAA 58 295–373 4 0.65 0.58 LOC104607906 [Nelumbo nucifera] R:GGACTCACCAGATCAATGGG JS0569 Integral component of mitochondrial outer (GAGAG)5 F:CAGAGCGTCACCAAAGTTCA 58 270–290 4 0.57 0.62 membrane (GO:0031307) R:CCGTCAAATTCATAACGGCT JS0632 Pollen allergen (A)10(AT)6 F:TCCACACAGCTTTACCCTCC 58 157–179 10 0.18 0.70 R:ACAATGTTGCTCCCAAGGAC JS0927 Transferase activity, transferring (AT)10 F:CCTCAATTTCGGTGGAGAAA 54.4 229–248 6 0.53 0.61 phosphorus-containing groups R:TCACGCATTGATTGGTGATT (GO:0016772) JS1225 Domain of unknown function (DUF4283) (AT)10 F:CCTCATTGCATTCTTGCTACC 54.4 254–278 6 0.60 0.48 R:CAGCTCCAACAACAGCCATA JS1294 Binding (GO:0005488) (TGCA)5 F:GGCAGTGGGTTTTGAGAGAG 60.6 182–206 7 0.68 0.67 R:GCATTACGAGAAGTACGCAGC JS1752 Brf1-like TBP-binding domain (CT)7ccccctttctatatatc(TA)6 F:GACAAAGGTTCTGCTGTCTGC 57.3 204–210 4 0.43 0.57 R:CAGTCGCGAGTAGGGAAAAG JS2009 Taurine catabolism dioxygenase TauD, TfdA (T)12(A)12 F:TGCAAGGTTTGAAACGAAGA 58 290–352 6 0.52 0.62 family R:GCTTGGAAAATACAGAGCCC JS4710 PB1 domain (TGC)7 F:TGTAGTACACCTGCCTCCCC 60.6 210–219 4 0.53 0.68 R:ACGCATACTACGGGGTTCAG JS5369 ATP binding (GO:0005524); signal (CAT)5caccattt(TCA)5 F:TGATTCTCCCCTTCGTCATC 58 154–178 3 0.56 0.79 transduction (GO:0007165); R:TCTGTCTGCGTAGGTTGTGC nucleoside-triphosphatase activity (GO:0017111) JS5720 Domain of unknown function (DUF588) (AAAG)5 F:TGCGGCAAACAGATACAACT 58 176–212 8 0.70 0.78 R:TTCAGCTTGGGCTTGGTAGT JS7868 Chitinase class I (A)10c(AT)7 F:CTTGATGATCACGATCCCCT 57.6 259–275 8 0.66 0.80 R:TATTGAAAAAGGGCAGCCAC JS8101 Reverse transcriptase (RNA-dependent DNA (AGAC)5 F:CAGCAGCCAAGGACTTAACC 58.5 182–196 4 0.30 0.49 polymerase) R:CGATCGATCAACAAAGTCCA JS8349 K12129|2.4769e-145|tcc:TCM_ (C)10g(TC)8 F:GGGCTCAGACAATGTCACAA 59.6 167–187 4 0.22 0.45 043038|Pseudo response regulator isoform R:GCTAATACGAGACGGAAGCG 11; K12129 pseudo-response regulator 7 (A) JS8980 Reverse transcriptase-like (TTC)7 F:AGAACCCAATCCCATCTCAA 58 119–185 7 0.52 0.79 R:TCGGTGGAAGGTAATCTTGG JS9377 Myb/SANT-like DNA-binding domain (AT)10 F:TGGTATGACTCTGGGCTCAA 57.3 218–248 7 0.62 0.79 R:TCTTGTCAACGGCATTCAAG JS9948 Pyridoxal-phosphate dependent enzyme (CTTC)8 F:GTTACTGGGAAATGCCAAGC 58 264–288 7 0.42 0.68 15 of 5 Page R:GAAGAACGACAAGTGCACCA JS9772 Hydroxymethylglutaryl-coenzyme A (TCC)7 F:TCCTCCACTCAGAGCCAAAC 58 173–190 3 0.04 0.50 reductase R:CTGCTCATGTCCATCGCTAA 51 51 Page 6 of 15 Tree Genetics & Genomes (2018) 14: 51

Table 3 Location of nine regions in the chloroplasts of 66 individuals of five Chinese walnuts. Six were polymorphic in all Juglans species, and four were polymorphic in three populations of Iron walnut (Juglans sigillata)

Loci Repeat motifs Primer sequence(5′-3′)LengthGapsMSPHdπ psbM-trnD* (AATA) F: TTCTCCTATCAATCGGGACT 418 11/2 5/6 3/3 2/2 0.658/0.719 0.003/0.003 R:ATATCACTCGCGTTTCTGTT trnM-CAU (TTTATA)3 F:AAAGTCGTCTCCTTGAATCC 5960 0000 0 R:CTCTTTGTCACATCCGTTCT trnG-trnR (ATT)5 F:GCCTCACTTACTGGTATCTG 5380 0000 0 R:TGAACAACAAAGAGCGAGTA trnE-UUC (AT)7 F:TGGGTTGGTCTTGAAACAAT 592 16/0 11/1 3/1 7/0 0.602/0.111 0.004/0.0002 R:GGATTCAAGGAGACGACTTT rpl2 (TA)6 F:CCAAACTTTTCTGGTTCACC 649 2/0 0/0 0 0 0 0 R:AGAGGAGGACGGATTATTCA ycf3-trnS* (ATT)5 F:AAACATCAAATCGCACCATC 516 7/1 2/0 2/0 0/0 0.030/0.000 0.0001/0 R:ATGCACACAACACAAAACAG trnF-ndhJ (ATA)4 F:TTGGCTCAGTTTATCCGAAA 469 29/0 11/0 0/0 11/0 0.536/0 0.015/0 R:TCTGTTTTCTGGGTTTGGAA psbA-trnK* (TGAA)3 F:AAACAGGTTCACGAATACCA 692 7/2 4/4 1/4 3/0 0.343/0.111 0.0008/0.0007 R:TGTTTATGCAGCAAGAGAGT rpoC1* (ATAAA)3 F:CGGCTCAAGTAGTTACATCA 413 36/1 1/2 1/0 0/2 0.030/0.621 0.00009/0.002 R:GGGAAGTAGGGACCTAAAAG

Length = PCR fragment length (bp); gaps = Sites with alignment gaps or missing data; M = Mutations = Total number of mutations; S = Singleton variable sites; P = Parsimony informative sites; Hd = Haplotype (gene) diversity; π = Nucleotide diversity; The number to the left of the slash indicates results based on 66 individuals in all five Chinese walnut species, the number to the right of the slash indicates results based on three populations of Iron walnut * indicate that the loci were polymorphic in three iron walnut populations

EST-SSR primer screening and verification Seven loci out of the 20 polymorphic EST-SSRs, JS0357, JS0632, JS1294, JS1752, JS7868, JS9772, and JS9948 Among the 70 microsatellite loci evaluated, only 20 showed significant departure from Hardy-Weinberg equi- (28.5%) were successfully amplified and also were poly- librium (HWE) across all samples (Table S9). Only morphic among the three J. sigillata populations tested JS0470 and JS1225 were in HWE in all three populations (Table 2;Fig.5). The polymorphic EST-SSRs ranged from (Table S9). Bayesian clustering of 48 individuals from three dinucleotide repeats to hexanucleotide repeats; trinucleo- natural populations identified K = 3 as the most likely num- tide repeats were not represented, which was surprising ber of populations for the sampled genotypes (Fig. 6). A because they were relatively common overall (Fig. 2). secondary peak at K = 6 was observed; the delta K for the

The number of alleles (NA) for each locus ranged from 3 secondary peak was less than that for K =3(Fig.6a). The to 10, and the mean number of alleles per locus was 5.39. presence of population substructure within sample sites TZ We observed no obvious association between repeat unit and BN and admixture between NH and BN was possibly size (di- vs penta-nucleotide repeats, for example) and indicated at K =4toK =8(Fig.6a). The three individuals polymorphism or allele number. The observed heterozy- (3/16 = 18.75%) from population BN were mixed into ge- gosity (HO) and expected heterozygosity (HE)variedfrom netic cluster2 with 16 individuals (16/16 = 100%) from pop- 0.04 to 0.742 and from 0.125 to 0.893, with an average of ulation NH based on STRUCTURE analysis (K =3),whilea 0.316 and 0.672, respectively (Table 2). Populations TZ, total of five individuals (4/16 = 25.00%) from population NH, and BN had 13, 13, and 10 private alleles, respectively BN were mixed into genetic cluster4 with five individuals (Table S8). A private allele was identified at high frequency (4/16 = 31.25%) from population NH based on (> 0.25) at the TZ sample site at four loci (JS8980, JS0354, STRUCTURE analysis (K =6)(TableS11). Three popula- JS4710, and JS7868); in NH at JS4710, JS0354, and tions (TZ, NH, and BN) were also discriminated along the JS0632 (allele frequencies were 0.457, 0.385, and 0.281, first two coordinates of the principal coordinate analyses, respectively); and in BN at JS0357 and JS5369 (allele fre- which accounted for 18.64% of the observed variance quencies were 0.313 and 0.300, respectively) (Table S7). (Fig. S1). The 20 polymorphic SSR-containing unigenes were The microsatellites we developed were polymorphic assigned putative biological function based on BLAST enough to distinguish among the 48 individuals and to search in the NCBI protein databases (Table 2). The reveal affinities and differences among the three sampled unigene (JS0470) matched to Bhypothetical protein^ or populations (Fig. 6;TableS10). The subpopulation of NH BPREDICTED^ results in the database (Table 2). and BN was from a geographically distinct area (Fig. S1). Tree Genetics & Genomes (2018) 14: 51 Page 7 of 15 51

Table 4 Gene contents in Juglans sigillata chloroplast genomes

Category Gene group Gene name

Self-replication Ribosomal RNA genes rrn4.5 rrna5 rrn16 rrn23 Transfer RNA genes trnA-UGCa trnC-GCA trnD-GUC trnE-UUC trnF-GAA trnfM-CAU trnG-GCC trnG-UCC trnH-GUG trnI-CAUa trnI-GAUa trnK-UUU trnL-CAAa trnL-UAA trnL-UAG trnM-CAUa trnN-GUUa trnP-GGG trnP-UGG trnQ-UUG trnR-ACGa trnR-UCU trnS-GCU trnS-GGA trnS-UGA trnT-GGUa trnT-UGU trnV-GACa trnV-UAC trnW-CCA trnY-GUA Small subunit of ribosome rps2 rps3 rps4 rps7 rps8 rps11 rps12 rps14 rps15 rps16 rps18 rps19 Large subunit of ribosome rpl2 rpl14 rpl16 rpl20 rpl22 rpl23 rpl32 rpl33 rpl36 DNA-dependent RNA polymerase rpoA rpoB rpoC1 rpoC2 Translation initiation factor infA Genes for photosynthesis Subunits of photosystem I psaA psaB psaC psaI psaJ ycf3 ycf4 Subunits of photosystem II psbA psbB psbC psbD psbE psbF psbH psbI psbJ psbK psbL psbM psbN psbT Subunits of cytochrome b/f complex petA petB petD petG petL petN Subunits of ATP synthase atpA atpB atpE atpF atpH atpI Subunits of Rubisco rbcL Other genes Maturase matK Protease clpP Envelope membrane protein cemA Subunit of acetyl-CoA-carboxylase accD C-type cytochrome synthesis gene ccsA Pseudogene infA Genes of unknown function Conserved open reading frames ycf1 ycf2 ycf4 ycf15 a Duplicated genes

Populations NH and BN showed admixture (Fig. 6), and found high levels of genetic variation in the chloroplasts they shared chloroplast haplotype H1 (Fig. 7;TableS3) among the five Chinese walnut species (section Juglans/ due to gene flow from pollen and seeds. The moderate to Dioscaryon, J. regia and J. sigillata;section high genetic differentiation among populations (FST TZ Cardiocaryon, J. cathayensis, J. mandshurica,andJ. vs NH = 0.12, TZ vs BN = 0.10, BN vs NH = 0.13, respec- hopeiensis) based on sequence polymorphism at six loci tively) of J. sigillata (Table S10,Fig.6) was reflected in (Fig. 7;Fig.S2;TableS3). Among the nine chloroplast the relatively large numbers of private nuclear alleles loci evaluated, four were polymorphic among the three J. present at relatively high frequencies (Table S7). Genetic sigillata populations tested, and six (66.7%) were poly- variation within populations was 73.0% for EST-SSRs. morphic among five Chinese walnut (Juglans) species Based on chloroplast polymorphisms, 94% of the varia- (Table 3;Fig.7;Fig.S3;TableS2). The intergenic spacer tion in chloroplasts was within populations (Table 5). region psbM-trnD was exceptionally polymorphic in iron walnut, with high levels of genetic variation among popu- Polymorphic chloroplast loci and verification lations and individuals (Fig. S3). psbM-trnD, trnE-UUC, ycf3-trnS, and rpoC1showed7,1,1,and1singlenucleo- Of nine cp-SSRs evaluated, all could be easily and consis- tide polymorphisms (SNPs), respectively (Fig. S3). tently amplified in all five Juglans species (Table 3). We Polymorphism within the sequence of the chloroplast 51 Page 8 of 15 Tree Genetics & Genomes (2018) 14: 51 (a)

100000

10000

1000

Transcript Number Transcript 100

10

0

1-100 >3001 101-200201-300301-400401-500501-600601-700701-800801-900 901-10001001-11001101-12001201-13001301-14001401-15001501-16001601-17001701-18001801-19001901-20002001-21002101-22002201-23002301-24002401-25002501-26002601-27002701-28002801-29002901-3000 Length (nt)

(b)

100000

10000

1000

100 Unigene Number

10

0

0-100 >3000 100-200200-300300-400400-500500-600600-700700-800800-900 900-10001000-11001100-12001200-13001300-14001400-15001500-16001600-17001700-18001800-19001900-20002000-21002100-22002200-23002300-24002400-25002500-26002600-27002700-28002800-29002900-3000 Length (nt) Fig. 1 The length distributions of the a transcripts and b unigenes of Juglans sigillata

SSR markers constituted three haplotypes in J. sigillata, cathayensis, haplotypes H16, H20, and H21 were found H1 to H3 (Fig. 7;TableS3). Haplotype H2 was common in J. mandshurica, and six haplotypes H2, H14, H15, to J. regia, J. hopeiensis,andJ. sigillata (Fig. 7;TableS3, H16, H17, H18, and H19 were found in J. hopeiensis Fig. S2). Haplotypes H7 to H13 were found in J. (Fig. 7;TableS3). Tree Genetics & Genomes (2018) 14: 51 Page 9 of 15 51

(a) Dinucleo de (14;0.38%) (c) 1050 (76;2.06%) Trinucleo de 1000 950 Tetranucleo de 900 Trinucleo de (6;0.16%) Pentanucleo de 850 (948;25.67%) Hexanucleo de 800 750 700

)N(ycneuqerF 650 600 110 100 90 Dinucleo de 80 70 (2,469;71.73%) 60 (b) 50 40 800 30 Dinucleo de 20 700 10 Trinucleo de 0

TC/GA

TG/CA

CT/AG

TTC/GAA

AT/TA

TTG/CAA GT/AC

TTA/TAA

TCT/AGA

CG/GC

TGT/ACA

CTT/AAG

TCG/CGA

ATT/AAT

TCC/GGA

GTT/AAC

TGG/CCA

AGT/ACT

TAC/GTA

TAG/CTA

TAT/ATA

CGT/ACG

CTG/CAG

CCT/AGG

GTC/GA 600 GAG/CTC

GTG/CAC

GGT/ACC

GCT/AGC Tetranucleo de CGG/CCG 500 Pentanucleo de C 400 Hexanucleo de 9 EST-SSR repeat motif

300 8

200 7 100 6 0 5 678910 11 12 )N(ycneuqerF 5

(d) 4 12 10-12 repeats

8-9 repeats 3 10 5-7 repeats 2 8

1 6

0 4

TGA/TCA

TGC/GCA

TTTC

TTCT/AAGA

T

CAT/ATG

CTTT/AAAG

GAT/A

TTCC/CGAA

TCGA/TCGA

ATTT/AAAT

TATG/CAAA

TCTG/CAGA

GCG/CGC

TGCG/AGGA

GGC/G

TTAT/ATA

CGTC/AATC

ACTC/AGCG

CCTG/CAAG

ACGT/ACGT

GTGA/TCAC

CGTA/TACA

A

GGTG/AA

ATGT

GCAG/CTTC

GATC/GATC

CTAG/GATA GGAG/CCTC

T

CTAT/ATA

GACCT/GAAAA

AATAT/TAAAA

TGCTT/AATCC

TTA

CGAGA/TCCTC

TTATG/CTTAG

TCCCT

AAGGTT/

AAATCC/GGGATT

A

CCAC/GAGAG

CCTACT/GCCGAG

TA/TATT

/

/

/ACAT

TAAA

T

GAAA

CC

C 2 /GTGTG

A

CC

G

TAGTCA

0 EST-SSR repeat motif Di Tri Tetra Penta Fig. 2 Characterization of portions of the iron walnut (Juglans sigillata) motif types (ranging from dinucleotide to hexanucleotide) was repeated transcriptome that contained microsatellites (SSRs). a Distribution of at a locus from five to 12 times. c Frequency distribution of major SSRs EST-SSR containing dinucleotide, trinucleotide, tetranucleotide, based on main motif type and d distribution of 70 EST-SSR motifs in the pentanucleotide, and hexanucleotide repeats. b Frequency with which J. sigillata transcriptome analyzed in detail in this study

Analysis of molecular variance using EST-SSR organs and tissues (roots, bark, catkins) would no doubt have and cp-SSR loci revealed additional unigenes. In the end, our transcriptome quality was high (Q20 of 97.22%), and the average transcript Microsatellites clearly distinguished three J. sigillata popula- length was 747 bp, similar to results from other species (Jiang tions (FST =0.269,p <0.001)(Table5), a result corroborated et al. 2015;Adaletal.2015). by analysis of molecular variance (AMOVA) (Table 5). Dinucleotide repeats (DNR) were the most frequent AMOVAanalysis of the chloroplast sequence polymorphisms among EST-SSR, aside from single nucleotide repeats. among the three sampled J. sigillata populations showed ev- This finding was consistent with results reported previ- idence of genetic differentiation based on polymorphism at ously for common walnut (J. regia)(Zhangetal.2010; four cp-SSR loci (FST =0.060;p < 0.05) (Table 5). Dang et al. 2016), Manchurian walnut (J. mandshurica) (Hu et al. 2016); Chinese walnut (J. cathayensis)(Danget al. 2015), Ma walnut (J. hopeiensis) (Hu et al. 2015), and Discussion numerous other species (Durand et al. 2010;Arbeiteret al.2017), but it is not observed in all species (Adal et al. This research presents extensive DNA sequence data for J. 2015; Wei et al. 2015). Tri-nucleotide repeats were the sigillata (5.04 GB of raw reads, 4.77 GB of nucleotide se- most abundant EST-SSR marker for cabbage, red clover, quence), a species with poor representation in public data- and other plants (Izzah et al. 2014). AG/CT repeats were bases. By pooling RNA from leaves, vegetative buds, and the most common dinucleotide motif in iron walnut, ac- fruit, we hoped to obtain a broad sample of expressed se- counting for 38.2% of the total DNRs, although this was a quences (Sloan et al. 2012), although samples from other much lower percentage than observed in J. mandshurica 51 Page 10 of 15 Tree Genetics & Genomes (2018) 14: 51 100 19718 10 1972 Number of genes 1 Percentage of genes Percentage 197 20 0.1

growth binding cell part nucleoid ane part signaling organelle cell killing virion part membrane protein tag locomotion localization cell junction reproduction

organelle part

membr catalytic receptoractivity activity cellular process biological phase rhythmic process metabolic process transporter activity antioxidant activity extracellular region extracellular matrix biological adhesion biological regulation reproductive process response to stimulus

electron carrier activity developmental process extracellular matrix part extracellular region part multi−organism process nutrient reservoir activity immune system process macromolecular complex enzyme regulator activity single−organism process metallochaperone activity structural molecule activity translation regulator activity membrane−enclosed lumen molecular transducer activity

multicellular organismal process

protein binding transcription factor activity guanyl−nucleotide exchange factor activity

cellular component organization or biogenesis nucleic acid binding transcription factor activity cellular component molecular function biological process Fig. 3 Gene Ontology classification of assembled unigenes. The results are summarized in three main categories: Biological process, Cellular component and Molecular function. A total of 7718 unigenes with BLAST matches to known proteins were assigned to Gene Ontology

(58.1%) (Hu et al. 2016), J. regia (Zhang et al. 2010), only, is less clear, as it appears to be intermediate between the (AG/CT 7929; 57.7%) (Dang et al. 2016), and J. dominant Dioscaryon (H2) haplotype and two haplotypes hopeiensis (Dang et al. 2015) (AG/CT 32.77%). found only in J. cathayensis (Cardiocaryon; Fig. 7). The four polymorphic chloroplast markers psbM-trnD, From the large pool of ESTs we identified, we randomly trnE-UUC, ycf3-trnS, and rpoC1showed7,1,1,and1single sampled 70 unigenes that contained microsatellites. Of these, nucleotide polymorphisms (SNPs), respectively, and should 20 (28.6%) could be reproducibly amplified and were poly- prove useful for further iron walnut population genetics and morphic (Table 2). Some of these EST-SSRs were highly evolutionary history studies. They may also prove useful for polymorphic in iron walnut. Departures from HWE were genetic studies of other Juglans species (Fig. 7; Table S3). For common (Table S9) and may have been caused by several example, we were surprised to observe 7 haplotypes in 12 factors, the most likely include the presence of null alleles, individuals of J. cathayensis based on the nine cp-SSR loci selection, immigration of alleles, and sampling from small examined in this study (Fig. 7; Table S3).ThefiveChinese populations. Potential explanations for why the remaining Juglans species were divided into two clades corresponding to 50 EST-SSRs did not amplify (n = 46) or appeared monomor- the two sections (Juglans/Dioscaryon and Cardiocaryon) phic (n = 4) are many (Hu et al. 2016), but the most likely based on sequence variability at the six cp-SSR loci (Fig. 7; reason for non-amplification was probably polymorphism in Table S3). The same five Chinese species were divided into the primer binding sites (poor primer binding). two clades corresponding to the two sections with 100% boot- Monomorphism at a microsatellite is common; it can be an strap support based on whole chloroplast genome sequences artifact of insufficient sampling or may reflect the current ge- of five samples (Hu et al. 2017). Haplotype H1 was found netic variation at a particular locus. only in J. sigillata (not J. regia), but its position was interme- The advantages of EST-derived SSRs over those from diate between H2 and H4, which were both J. regia chloro- (non-coding) nuclear genomes are many. EST-SSRs are typi- plast haplotypes, which probably indicate that H3 is not now cally more transferrable to related species (Ellis and Burke or historically was not limited to J. sigillata. The origin of J. 2007), and polymorphisms in EST-SSRs may be used as a sigillata haplotype H1, which was present in population TZ basis to sample for differences in the expression of genes, Tree Genetics & Genomes (2018) 14: 51 Page 11 of 15 51

COG Function Classification of Consensus Sequence

A: RNA processing and modification B: Chromatin structure and dynamics 2000 C: Energy production and conversion D: Cell cycle control, cell division, chromosome partitioning E: Amino acid transport and metabolism F: Nucleotide transport and metabolism G: Carbohydrate transport and metabolism H: Coenzyme transport and metabolism 1500 I: Lipid transport and metabolism J: Translation, ribosomal structure and biogenesis K: Transcription L: Replication, recombination and repair M: Cell wall/membrane/envelope biogenesis 1000 N: Cell motility Frequency O: Posttranslational modification, protein turnover, chaperones P: Inorganic ion transport and metabolism Q: Secondary metabolites biosynthesis, transport and catabolism R: General function prediction only 500 S: Function unknown T: Signal transduction mechanisms U: Intracellular trafficking, secretion, and vesicular transport V: Defense mechanisms W: Extracellular structures 0 Y: Nuclear structure Z: Cytoskeleton ABCDEFGH I JKLMNOPQRSTUVWYZ Function Class Fig. 4 Histogram presentation of clusters of orthologous groups (COG) biogenesis (J); transcription (K); replication, recombination, and repair classification. All unigenes were aligned to the COG database to predict (L); cell wall/membrane/envelope biogenesis (M); cell motility (N); and classify possible functions. Out of 27,435 Nr hits, 18,001 sequences posttranslational modification, protein turnover, chaperones (O); were assigned to 26 COG classifications. RNA processing and inorganic ion transport and metabolism (P); secondary metabolites modification (A); chromatin structure and dynamics (B); energy biosynthesis, transport and catabolism (Q); general function prediction production and conversion (C); cell cycle control, cell division, only (R); function unknown (S); signal transduction mechanisms (T); chromosome partitioning (D); amino acid transport and metabolism (E); intracellular trafficking, secretion, and vesicular transport (U); defense nucleotide transport and metabolism (F); carbohydrate transport and mechanisms (V); extracellular structures (W); unnamed protein (X); metabolism (G); coenzyme transport and metabolism (H); lipid nuclear structure (Y); and cytoskeleton (Z) transport and metabolism (I); transition, ribosomal structure and

(a) Js8101 1 2 3 4 5 6 7 8 9 10 11 12 M 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

242 bp 242 bp 217 bp 217 bp 201 bp 192 196 201 bp 188 180 bp 182 184 180 bp 160 bp 160 bp 123 bp 123 bp

(b) Js0569

404 bp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 M 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 M 404 bp

309 bp 309 bp 295 290 285 280 270 242 bp 242 bp 217 bp 217 bp 201 bp 201 bp 180 bp 180 bp Fig. 5 PCR products and polymorphic patterns of two EST-SSR markers JS8101and JS5069, respectively. Lanes 1–48 refer to the different across 48 individuals of J. sigillata. M indicates size ladders. a, b Panels individuals of J. sigillata from three populations (see Table 1). Arrows of PCR products for individuals genotyped using primers for the loci indicate diagnostic product size 51 Page 12 of 15 Tree Genetics & Genomes (2018) 14: 51 (a) TZ NH BN K=2

K=3

K=4

K=5

K=6

K=7

K=8

(b) (c)

Fig. 6 Results of STRUCTURE analysis (Pritchard et al. 2000) and samples were grouped according to their population of origin as color-coded grouping from K = 3 to 6 as determined using the deltaK describedinTable1. b Distribution of delta K for K = 2 to 10 to method of Evanno et al. (2005). a The proportions of cluster determine the true number of populations (K) as described in Evanno memberships for each individual are represented by a vertical bar; (2005). c Mean log likelihood of the data at varying estimates of K

indicating adaptive differences among populations (Bradbury of J. sigillata showed highest genetic differentiation (FST = et al. 2013). In the particular case of iron walnut, they might be 0.328, p < 0.001) between cluster1 and cluster6 when K =6, an excellent tool for studying speciation and population ge- the highest genetic differentiation (FST =0.356,p <0.001)be- netic differentiation between iron walnut and common walnut. tween cluster3 and cluster7 when K = 7, and the highest ge-

J. sigillata showed a high level of among population dif- netic differentiation (FST =0.339,p < 0.001) between cluster4 ferentiation (FST =0.269, p < 0.001) based on EST-SSRs; and cluster8 when K = 8 based on STRUCTURE analysis each of the three populations was more or less equally genet- (Table S10). The value of FST we observed was considerably ically distant from the other two (Table S10). The differentia- larger than a previous estimate based on four populations of J. tion estimate based on four chloroplast SSRs was more mod- sigillata (11.07%, p < 0.0001) (Wang et al. 2015). However, est (FST =0.060,p <0.05;Table5). However, subpopulations Wang et al. used samples of J. sigillata and J. regia from Tree Genetics & Genomes (2018) 14: 51 Page 13 of 15 51

Fig. 7 The relationships among 21 chloroplast haplotypes of Jr Js H4 H3 Juglans including three populations of iron walnut and Jr=J. regia four other Juglans species based Js=J. sigillata on Network software. Jr J. regia, Js+Jr+Jh Js J. sigillata,JmJ. mandshurica, H5 H2 Jc J. cathayensis,JhJ. hopeiensis Jr

H21 Jm H15 Jh Jr H6

H1 Js Jh Jh

H17 H14

Jh+Jm H16 Jc H13

Jh Jc H7 Jc H9 H19 H12 Jc

Jc H8 Jc H10 H18 Jh Jm H20 Jm=J. mandshurica Jc=J. cathayensis H11 Jc Jh=J. hopeiensis

Tibet, and their samples probably included J. regia × J. PCA and STRUCTURE analysis both showed that the J. sigillata hybrids. Our samples were from Guizhou and sigillata samples from BN, NH, and TZ represent three pop- Yunnan provinces. One possible explanation for the relatively ulations (Fig. S1; Fig. 6). Our analysis also showed a slight high FST we observed is that by using EST-SSRs, our samples indication that there may be substructure within our samples. captured variation caused by selection (Narasimhamoorthy et We observed a slight increase in delta K at K =6,which,based al. 2008). The spatial genetic structure of J. sigillata is not on further analysis, may show population substructure at sam- sufficiently understood to predict how differences in sample ple sites TZ and BN and admixture between NH and BN (Fig. sources may have contributed to differences in estimates of 6). More extensive sampling would be required to resolve population differentiation. population structure of J. sigillata, including, potentially,

Table 5 Analysis of molecular variance (AMOVA) results for Type of genetic markers Source df SS Est. var. P (%) three populations of J. sigillata using 20 EST-SSR and four EST-SSR Among populations 2 3,613,273.58 51,994.97 26.9 polymorphic chloroplast loci Within populations 93 3,280,183.75 142,797.68 73.1 Total 95 416,893,457.33 194,792.64 100

FST = 0.269** Chloroplast loci Among populations 2 46.18 1.07 6.02 Within populations 15 250.17 16.68 93.98 Total 17 296.33 17.74 100

FST =0.060*

df degree freedom, SS sum of square, MS mean square, Est. var. variance components, P percentages of molecular variance *p < 0.05; **p <0.001 51 Page 14 of 15 Tree Genetics & Genomes (2018) 14: 51 hybrid zones with J. regia. Surprisingly, although walnuts are References wind-pollinated, we observed higher levels of differentiation at nuclear loci (EST-SSRs) than we observed based on mater- Adal AM, Demissie ZA, Mahmoud SS (2015) Identification, validation nally inherited chloroplast polymorphisms (Table 5). These and cross-species transferability of novel Lavandula EST-SSRs. – results probably indicated higher levels of seed flow among Planta 241:987 1004 Aradhya MK, Potter D, Simon CJ (2004) Origin, evolution, and bioge- populations than pollen flow. The best explanation for this ography of Juglans: a phylogenetic perspective. Int Walnut result is that although the populations are isolated by high Symposium 4:85–94 mountains and steep ravines, seed dispersers (probably Arbeiter AB, Hladnik M, Jakše J, Bandelj D (2017) Identification and – humans) are transporting seeds among locations (Table 5; validation of novel EST-SSR markers in olives. Sci Agric 74:215 225 Bai WN, Liao WJ, Zhang DY (2010) Nuclear and chloroplast DNA Fig. 6;Fig.S1). Gunn et al. (2010) and Pollegioni et al. phylogeography reveal two refuge areas with asymmetrical gene flow (2017) found that human movement of walnut seeds contrib- in a temperate walnut tree from East Asia. New Phytol 188:892–901 uted to the genetic structure of (apparently) wild populations Birky CW (1995) Uniparental inheritance of mitochondrial and chloro- (Gunn et al. 2010;Pollegionietal.2014). plast genes: mechanisms and evolution. Proc Nat Acad Sci 92: 11331–11338 The basis for genetic divergence among J. sigillata popu- Bodénès C, Chancerel E, Gailing O, Vendramin GG, Bagnoli F, Durand J, lations is most likely isolation by distance combined with Goicoechea PG, Soliani C, Villani F, Mattioni C, Koelewijn HP, regional adaptation. The heavy seeds of Juglans species are Murat F, Salse J, Roussel G, Boury C, Alberto F, Kremer A, usually not dispersed far (except when dispersed by humans), Plomion C (2012) Comparative mapping in the Fagaceae and be- yond with EST-SSRs. BMC Plant Biol 12:153 but geographic isolation in Juglans can be overcome by long- Bradbury D, Smithson A, Krauss SL (2013) Signatures of diversifying distance pollen flow (Victory et al. 2006), which typically selection at EST-SSR loci and association with climate in natural minimizes genetic divergence among wild populations in the Eucalyptus populations. Mol Ecol 22:5112–5129 (Pollegioni et al. 2014; Bai et al. 2010; Hoban et Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WE, Wetter T, Suhai S (2004) Using the miraEST assembler for reliable and auto- al. 2010). It is possible that the EST-SSRs we used to analyze mated mRNA transcript assembly and SNP detection in sequenced the population genetics of J. sigillata produced results that ESTs. Genome Res 14:1147–1159 reflect selection pressure on the genetic regions in which the Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M (2005) SSRs reside. A full understanding of the genetic structure and Blast2GO: a universal tool for annotation, visualization and analysis – diversity of iron walnut will require additional sampling and in functional genomics research. Bioinformatics 21:3674 3676 Dang M, Liu ZX, Chen X, Zhang T, Zhou HJ, Hu YH, Zhao P (2015) the establishment of common gardens. The EST-SSRs de- Identification, development, and application of 12 polymorphic scribed here, and the other EST-SSRs found within the tran- EST-SSR markers for an endemic Chinese walnut (Juglans scriptome data we generated, could prove useful for under- cathayensis L.) using next-generation sequencing technology. – standing local adaption, gene flow, and genetic structure in Biochem Syst Ecol 60:74 80 Dang M, Zhang T, Hu Y, Zhou H, Woeste KE, Zhao P (2016) De novo Manchurian walnut and other members of the Juglandaceae assembly and characterization of bud, leaf and flowers transcriptome as well. from Juglans regia L. for the identification and characterization of The transcriptome of J. sigillata should assist future new EST-SSRs. Forests 7:247 Juglans genomics studies and especially studies of the rela- Doyle JJ, Doyle JL (1987) A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem Bull 19:11–15 tionships between gene expression and walnut development. Durand J, Bodénès C, Chancerel E, Frigerio JM, Vendramin G, Sebastiani It may be particularly useful for understanding the genomics F, Buonamici A, Gailing O, Koelewijn H, Villani F, Mattioni C, of Juglans regia, an important commercial species and iron Cherubini M, Goicoechea PG, Herrán A, Ikaran Z (2010) A fast walnut’s closest relative. and cost-effective approach to develop and map EST-SSR markers: oak as a case study. BMC Genomics 11:570 Dupanloup I, Schneider S, Excoffier L (2002) A simulated annealing Acknowledgments Mention of a trademark, proprietary product, or ven- approach to define the genetic structure of populations. Mol Ecol dor does not constitute a guarantee or warranty of the product by the U.S. 11:2571–2581 Department of Agriculture and does not imply its approval to the exclu- Ellis JR, Burke JM (2007) EST-SSRs as a resource for population genetic sion of other products or vendors that also may be suitable. analyses. Heredity 99:125–132 Excoffier L, Lischer HEL (2010) Arlequin suite ver 3.5: a new series of Funding information This study was funded by the National Natural programs to perform population genetics analyses under Linux and Science Foundation of China (41471038, 31200500), the Program for Windows. Mol Ecol Resour 10:564–567 Excellent Young Academic Backbones funding by Northwest Goudet J (2001) FSTAT (version 2.9.3.): a program to estimate and test gene University (338050070), and the Northwest University Training diversities and fixation indices www.unil.ch/izea/softwares/fstat.html Programs of Innovation and Entrepreneurship for Graduates (Nos. Grabherr M, Haas B, Yassour M, Levin J, Thompson D, Amit I, Adiconis 2016002, 20171037, 2018298). X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, Palma F, Birren B, Nusbaum C, Lindblad- Compliance with ethical standards Toh K, Friedman N, Regev A (2011) Full-length transcriptome as- sembly from RNA-Seq data without a reference genome. Nat Biotechno l29:644–652 Conflict of interest The authors declare that they have no conflict of Greiner S, Sobanski J, Bock R (2015) Why are most organelle genomes interest. transmitted maternally? Cell Mol Biol 37:80–94 Tree Genetics & Genomes (2018) 14: 51 Page 15 of 15 51

Gu M, Chen HP, Zhao MM, Wang X, Yang B, Ren JY, Su GW (2015) (2016) The walnut (Juglans regia) genome sequence reveals diver- Identification of antioxidant peptides released from defatted walnut sity in genes coding for the biosynthesis of nonstructural polyphe- (Juglans Sigillata Dode) meal proteins with pancreatin. LWT-Food nols. Plant J 87:507–532 Sci Technol 60:213–220 Narasimhamoorthy B, Saha MC, Swaller T, Bouton JH (2008) Genetic Gunn BF, Aradhya M, Salick JM, Miller AJ, Yang YP, Liu L, Hai H diversity in switchgrass collections assessed by EST-SSR markers. (2010) Genetic variation in walnuts (Juglans regia and J. sigillata; Bioenergy Res 1:136–146 Juglandaceae): species distinctions, human impacts, and the conser- Patel RK, Jain M (2012) NGS QC toolkit: a toolkit for quality control of vation of agrobiodiversity in Yunnan, China. Am J Bot 97:660–671 next generation sequencing data. PLoS One 7:e30619 Hall TA (1999) BioEdit: a user-friendly biological sequence alignment Peakall R, Smouse PE (2012) GenAlEx 6.5: genetic analysis in Excel. editor and analysis program for Windows 95/98/NT. Nucleic Acids Population genetic software for teaching and research—an update. Symp Ser 41:95–98 Bioinformatics 28:2537–2539 Han H, Woeste KE, Hu YH, Dand M, Zhang T, Gao XX, Zhou HJ, Feng Pollegioni P, Woeste KE, Chiocchini F, Olimpieri I, Tortolano V, Clark J, XJ, Zhao GF, Zhao P (2016) Genetic diversity and population struc- Hemery GE, Mapelli S, Malvolti ME (2014) Landscape genetics of ture of common walnut (Juglans regia) in China based on EST- Persian walnut (Juglans regia L.) across its Asian range. Tree Genet SSRs and the nuclear gene phenylalanine ammonia-lyase (PAL). Genomes 10:1027–1043 Tree Genet Genomes 12:111 Pollegioni P, Woeste K, Chiocchini F, Del Lungo S, Ciolfi M, Olimpieri I, Hoban SM, Borkowski DS, Brosi SL, McCleary T, Thompson LM, Tortolano V, Clark J, Hemery GE, Mapelli S, Malvolti ME (2017) Mclachlan JS, Pereira MA, Schlarbaum SE, Romero-severson J Rethinking the history of common walnut (Juglans regia L.) in (2010) Range-wide distribution of genetic diversity in the North Europe: its origins and human interactions. PloS ONE 12:e017254 American tree Juglans cinerea: a product of range shifts, not ecolog- Provana J, Powellb W, Hollingsworthc PM (2001) Chloroplast ical marginality or recent population decline. Mol Ecol 19:4876–4891 microsatellites: new tools for studies in plant ecology and evolution. Hu YH, Woeste KE, Zhao P (2017) Completion of the chloroplast ge- Trends Ecol Evol 16:142–147 nomes of five Chinese Juglans and their contribution to chloroplast Raymon M (1995) GENEPOP (version 1.2): population genetics soft- phylogeny. Front Plant Sci 7:1955 ware for exact tests and ecumenicism. J Hered 86:248–249 Hu YH, Zhao P, Zhang Q, Wang Y,Gao XX, Zhang T, Zhou HJ, Dang M, Rosenberg NA (2004) DISTRUCT: a program for the graphical display Woeste KE (2015) De novo assembly and characterization of tran- of population structure. Mol Ecol Resour 4:137–138 scriptome using Illumina sequencing and development of twenty Sloan DB, Keller SR, Berardi AE, Sanderson BJ, Karpovich JF, Taylor five microsatellite markers for an endemic tree Juglans hopeiensis DR (2012) De novo transcriptome assembly and polymorphism Hu in China. Biochem Syst Ecol 63:201–211 detection in the Silene vulgaris (Caryophyllaceae). Hu Z, Zhang T, Gao XX, Wang Y, Zhang Q, Zhou HJ, Zhao GF, Wang ML, MolEcolResour12:333–343 Woeste KE, Zhao P (2016) De novo assembly and characterization of Yuan Y,Bayer PE, Batley J, Edwards D (2017) Improvements in genomic the leaf, bud, and fruit transcriptome from the vulnerable tree Juglans technologies: application to crop genomics. Trends Biotechnol 35: mandshurica for the development of 20 new microsatellite markers 547–558 using Illumina sequencing. Mol Gen Genomics 291:849–862 Van Oosterhout C, Hutchinson WF, Wills DPM, Shipley P (2004) Izzah NK, Lee J, Jayakodi M, Perumal S, Jin M, Park B, Ahn K, Yang T MICRO-CHECKER: software for identifying and correcting and (2014) Transcriptome sequencing of two parental lines of cabbage genotyping errors in microsatellite data. Mol Ecol Resour 4:535– (Brassica oleracea L. var. capitata L.) and construction of an EST- 538 based genetic map. BMC Genomics 15:149 Victory ER, Glaubitz JC, Rhodes OE Jr, Woeste K (2006) Genetic ho- Jiang B, Xie D, Liu W, Peng Q, He X (2013) De novo assembly and mogeneity in Juglans nigra (Juglandaceae) at nuclear characterization of the transcriptome, and development of SSR microsatellites. Am J Bot 93:118–126 markers in wax gourd (Benicasa hispida). PLoS One 8:e71054 Wang H, Pan G, Ma Q, Zhang J, Pei D (2015) The genetic diversity and Jiang Q, Wang F, Tan HW, Li MY, Xu ZS, Tan GF, Xiong S (2015) De introgression of Juglans regia and Juglans sigillata in Tibet as re- novo transcriptome assembly, gene annotation, marker develop- vealed by SSR markers. Tree Genet Genomes 11(1) ment, and miRNA potential target genes validation under abiotic Weckerle C, Huber F, Yang YP, Sun WB (2005) Walnuts among Shuhi in stresses in Oenanthe javanica. Mol Gen Genomics 290:671–683 Shuiluo, eastern : notes on economic plants. Econ Bot 59: Kaur S, Cogan NO, Pembleton LW, Shinozuka M, Savin KW, Materne 287–290 M, Forster JW (2011) Transcriptome sequencing of lentil based on Wei C, Tao X, Li M, He B, Yan L, Zhang Y (2015) De novo tran- second-generation technology permits large-scale unigene assembly scriptome assembly of Ipomoea nil using Illumina sequencing for and SSR marker discovery. BMC Genomics 12:265 gene discovery and SSR marker identification. Mol Gen Genomics Kearse M, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, 290:1873–1884 Cooper A, Markowitz S, Duran C, Thierer T, Ashton B, Meintjes P, Wheeler GL, Dorman HE, Buchanan A, Challagundla L, Wallace LE Drummond A (2012) Geneious basic: an integrated and extendable (2014) A review of the prevalence, utility, and caveats of using desktop software platform for the organization and analysis of se- chloroplast simple sequence repeats for studies of plant biology. quence data. Bioinformatics 28:1647–1649 Appl Plant Sci 2:1400059 Librado P, Rozas J (2009) DnaSP v5: a software for comprehensive Zhang J, Liang S, Duan J, Wang J, Chen S, Cheng Z, Zhang Q, Liang X, analysis of DNA polymorphism data. Bioinformatics 25:1451–1452 Li Y (2012) De novo assembly and characterisation of the tran- Lu AM (1982) The geographical dispersal of Juglandaceae. Acta scriptome during seed development, and generation of genic-SSR Phytotaxon Sin 20:257–274 markers in peanut (Arachis hypogaea L.). BMC Genomics 13:90 Manning WE (1978) The classification within the Juglandaceae. Ann Mo Zhang R, Zhu AD, Wang XF, Yu J, Zhang HR, Gao JS, Cheng YJ, Deng Bot Gard 65:1058–1087 XX (2010) Development of Juglans regia SSR markers by data Martínez-García PJ, Crepeau MW, Puiu D, Gonzalez-Ibeas D, Whalen J, mining of the EST database. Plant Mol Biol Report 28:646–653 Stevens KA, Paul P, Butterfield TS, Britton MT, Regan RL, Zhou ZC, Dong Y,Sun HJ, Yang AF, Chen Z, Gao S, Jiang JW, Guan XY, Chakraborty S, Walawage SL, Vasquez-Gross HA, Cardeno C, Jiang B, Wang B (2014) Transcriptome sequencing of sea cucumber Famula RA, Pratt K, Kuruganti S, Aradhya MK, Leslie CA, (Apostichopus japonicus) and the identification of gene-associated Dandekar AM, Salzberg SL, Wegrzyn JL, Langley CH, Neale DB markers. Mol Ecol Resour 14:127–138