The Pennsylvania State University The Graduate School Intercollege Program in Integrative Biosciences

INTRASPECIFIC VARIATION IN GREEN ASH RESPONSE TO AN INVASIVE INSECT

A Dissertation in Bioinformatics and Genomics by Di Wu

© 2018 Di Wu

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2018

The dissertation of Di Wu was reviewed and approved* by the following:

John E. Carlson Professor of Molecular Genetics Dissertation Advisor Chair of Committee

Jesse R. Lasky Assistant Professor of Biology

Majid Foolad Professor of genetics

Rongling Wu Professor of Public Health Sciences

Cooduvalli Shashikant Associate Professor of Molecular and Developmental Biology Chair of Bioinformatics and Genomics Graduate Program

*Signatures are on file in the Graduate School

ii

ABSTRACT

Green ash (Fraxinus pennsylvanica) is a medium-sized, ecologically and economically valuable species native to the eastern and central United States. However, the widely distributed green ash species in North America is under severe threat from the rapid invasion of emerald ash borer (Agrilus planipennis; EAB), an Asian wood-boring beetle. To understand the mechanism of the defense response, transcriptomes were prepared for six green ash genotypes exposed to EAB infestation, using an RNA-seq approach. Mapping these reads to the de novo assembled reference of 107,611 transcript contigs, prepared from 98 Gb of RNAseq data from multiple tissues and treatments (www.hardwoodgenomics.org/node/68249), enabled differentially expressed genes to be identified between potentially resistant ( that survived EAB-infestations, hereafter simply referred to as “resistant”) and susceptible genotypes and between control and EAB egg-treated bark samples. The enrichment analysis showed that most of the overrepresented GO terms were related to stress response in the resistant genotypes. In addition, our results indicate that the response process was associated with induced, rather than suppressed, gene expression. Further network analysis revealed putative hub genes perhaps regulating the EAB resistance. In addition, comparison of metabolic pathways between resistant and susceptible groups provided insight into the mechanism of EAB resistance in green ash. To understand more about this serious forest health issue and to assist in green ash protection and restoration, I have conducted a genetic diversity study, using SSR markers, with 429 green ash accessions collected from 60 provenances across the species’ natural range. Our results revealed three distinct sub- groups of provenances. Northern provenances fell into one group, southern provenances into a second group, and the third sub-group of provenances consisted of admixtures of northern and southern genotypes. We also constructed a DNA-marker based genetic linkage map for a green ash population of full- sib seedlings segregating for EAB-resistance from a controlled cross of EAB-resistant and EAB- susceptible green ash parent trees. The mapping population can be used in future to conduct quantitative trait locus (QTL) analysis to identify genomic regions contributing to different susceptibilities to EAB. Furthermore, a genetic association study identified loci significantly associated with EAB resistance, height, diameter at breast, budburst and foliage coloration using the range-wide selection of green ash trees in the provenance study. We identified candidate genes potentially associated with EAB resistance and foliage coloration in green ash. These results will be confirmed using larger sample size and more genetic markers. I hope that this study will support further research on the basis of apparent low frequency natural EAB resistance in green ash and lead to strategies for eventual restoration of the species. This research

iii was supported by a grant to Dr. John Carlson from NSF’s Plant Genome Research Program (IOS- 1025974) and by the USDA National Institute of Food and Agriculture Federal Appropriations under Project PEN04532 and Accession number 1000326.

iv Table of Contents LIST OF FIGURES ...... viii LIST OF TABLES ...... ix Chapter 1 Introduction ...... 1 Background ...... 1 Interspecies variation of EAB resistance ...... 1 Green Ash ...... 2 Natural variation within green ash ...... 3 Common garden trial ...... 4 Transcriptomic comparison ...... 4 Genetic linkage mapping ...... 5 Marker-trait associations ...... 6 Objectives ...... 7 References ...... 9 Chapter 2 Transcriptome profile between resistant and susceptible green ash genotypes .. 12 Abstract ...... 12 Introduction ...... 12 Materials and Methods ...... 13 Plant materials ...... 13 EAB inoculation treatments and RNA sequencing ...... 13 Sequence Read mapping and function annotation ...... 13 Metabolic pathway analysis ...... 14 Protein-protein interaction network analysis ...... 14 Results and Discussion ...... 14 Enrichment analysis results ...... 15 Pathway Analysis ...... 16 Network Analysis ...... 17 Putative hub genes ...... 18 Conclusions ...... 20 References ...... 22 Chapter 3 Population structure and genetic diversity of green ash (Fraxinus pennsylvanica) assessed with SSR markers ...... 34 Abstract ...... 34 Introduction ...... 34 Materials and Methods ...... 35 Plant Materials ...... 35 Simple sequence repeats (SSRs) genotyping and data analysis ...... 36 Marker informativeness testing ...... 36 Population structure and clustering ...... 36 Genetic diversity and population differentiation ...... 37 Phenotypic data ...... 37 Results ...... 37 Genotyping and filtering process ...... 38 Informativeness of SSR markers ...... 38 Hardy–Weinberg equilibrium (HWE) testing ...... 39

v Additional allele statistics and filtering ...... 39 Population structure ...... 39 Phenotypic variation among three major green ash clusters ...... 41 Discussion ...... 42 References ...... 45 Chapter 4 The first genetic map for Fraxinus pennsylvanica and syntenic relationships with four related species ...... 61 Abstract ...... 61 Keywords ...... 61 Introduction ...... 61 Materials and Methods ...... 63 Plant material and DNA extraction ...... 63 Genotyping-by-sequencing ...... 63 GBS Data Processing and SNP discovery ...... 63 Simple sequence repeats (SSRs) genotyping and data analysis ...... 64 Linkage map construction ...... 64 Alignment of scaffolds to corresponding linkage groups ...... 65 Comparative analysis with two Asterid and two Rosid species ...... 65 Results ...... 65 Enzyme selection ...... 65 Genome-wide identification of SNPs ...... 66 Polymorphism of EST-derived and genomic SSR markers in F1 mapping population ...... 66 Construction of genetic linkage maps ...... 67 Segregation distortion ...... 67 Integration of genetic map with genome assembly ...... 68 Comparative analysis of green ash with other species ...... 68 Discussion ...... 69 References ...... 73 Chapter 5 Genome-Wide Association Study in Green Ash ...... 104 Abstract ...... 104 Introduction ...... 104 Materials and methods ...... 106 Plant Materials ...... 106 Phenotypic data preparation and sampling ...... 106 Double-digested restriction site associated DNA sequencing (ddRADseq) library preparation and sequencing ...... 106 Sequence data processing ...... 107 Linkage disequilibrium (LD) analyses ...... 107 Population structure ...... 107 Genome-wide association study (GWAS) ...... 108 Results ...... 108 SNPs identification and LD estimation ...... 108 Population structure and genetic diversity ...... 109 Association mapping ...... 110 Identification of candidate genes ...... 110 Discussion ...... 110 References ...... 113

vi Chapter 6 Summary and Discussion ...... 129 References ...... 133

vii LIST OF FIGURES Figure 2.1 GO Term enrichment analysis...... 24 Figure 2.2 MapMan overview of metabolite pathways in resistant (left) and susceptible (right) genotypes...... 25 Figure 2.3 Secondary pathways in resistant (left) and susceptible (right) genotypes...... 26 Figure 2.4 Large enzyme pathways in resistant (left) and susceptible (right) genotypes...... 27 Figure 2.5 Network analysis of differentially expressed transcription factors (TFs)...... 28 Figure 2.6 Network analysis of differentially expressed hormone-related genes...... 29 Figure 2.7 Network analysis of defense-related genes...... 30 Figure 3.1 Variation at the provenance level projected on the first two correspondence analysis eigenvalues...... 47 Figure 3.2 Plot for detecting the number of K groups that best fits the data from Structure analysis...... 48 Figure 3.3 Results of population structure analysis...... 49 Figure 3.4 Geographic locations of the two groups and one admixed group...... 50 Figure 3.5 Geographic variation between populations of the two clusters...... 51 Figure 3.6 Principal component analysis based on Fst...... 52 Figure 3.7 Principle component analysis based on phenotypic traits...... 53 Figure 3.8 Variation of Phenotypic traits were ploted to PC1, PC2 and latitudes...... 54 Figure 4.1 Assessment of GBS libraries using three different restriction enzymes...... 78 Figure 4.2 Genetic maps of F. pennsylvanica...... 79 Figure 4.3 Genetic maps of F. pennsylvanica including segregation distorted markers...... 80 Figure 4.4 Circos plot of synteny between green ash genetic map and genome assembly...... 81 Figure 4.5 Circos plots of syntenic relationships between green ash and four other species...... 82 Figure 5.1 Average LD decay across the green ash genome...... 116 Figure 5.2 LD decay rates across the 23 green ash chromosomes...... 117 Figure 5.3 Assessment of K value that best fits the accessions...... 121 Figure 5.4 Population structure of the accessions...... 122 Figure 5.5 Scatter plot principal component axis one (PC1) and axis two (PC2) based on genotype data of 85 samples...... 122 Figure 5.6 Geographic variation between the three clusters...... 123 Figure 5.7 Kinship heatmap...... 124 Figure 5.8 Quantile-quantile (QQ) plots for GWAS...... 125 Figure 5.9 Manhattan plots...... 126

viii LIST OF TABLES Table 2.1 Putative hub transcription factors ...... 31 Table 2.2 Putative hub genes associated with plant hormone ...... 32 Table 2.3 Putative hub genes associated with plant defense ...... 33 Table 3.1 Summary of features of eight new EST-based microsatellite loci...... 55 Table 3.2 Hardy–Weinberg equilibrium(HWE) testing results for 429 green ash accessions across the eight loci ...... 56 Table 3.3 Genetic variation statistics for the eight SSR markers among 429 green ash accessions ...... 56 Table 3.4 Genetic diversity statistics across and within clusters averaged over six loci ...... 57 Table 3.5 Analysis of molecular variance (AMOVA) among three clusters over six loci...... 57 Table 3.6 Genetic variation statistics at six loci within 3 clusters ...... 58 Table 3.7 Pairwise Fst between pairs of clusters ...... 58 Table 3.8 Average phenotypic values of three clusters and pairwise comparison ...... 59 Table 3.9 Pairwise Fst values across four zones detected by winter injuries ...... 60 Table 3.10 Pairwise Fst values across five clusters detected by cold tolerance...... 60 Table 4.1 Summary of SNPs identified in green ash ...... 83 Table 4.2 Summary of SSR markers ...... 83 Table 4.3 Summary of DNA marker information in female and male genetic maps...... 84 Table 4.4 Summary of genetic linkage map of F. pennsylvanica ...... 85 Table 4.5 Distribution of orthologous loci on LGs of green ash and tomato genome ...... 86 Table 4.6 Distribution of orthologous loci on LGs of green ash and coffee genome ...... 87 Table 4.7 Distribution of orthologous loci on LGs of green ash and peach genome ...... 88 Table 4.8 Distribution of orthologous loci on LGs of green ash and poplar genome ...... 89 Table 5.1 Summary of identified SNPs...... 127 Table 5.2 Categories of identified SNPs...... 127 Table 5.3 Genetic variation statistics within three clusters ...... 128 Table 5.4 Pairwise Fst among three clusters ...... 128 Table 5.5 Significant marker-trait associations...... 128

ix Chapter 1 Introduction Background Forested land is an integral component of the American landscape, while the forest landscape provides essential ecosystem services, and valuable raw materials for composites and biofuels and high quality timber. In addition to the great economic value of forestry and forest products, hardwoods also provide food and shelter for wildlife and play important roles in the environment. However, many forest tree species in North America are under severe threat from effects of globalization, climate change, environmental stresses, and invasive species. Ash trees (genus Fraxinus) are economically and ecologically important natural resources throughout North America. The economic value of billions of ash trees is difficult to be estimated, but it could be in the billions of dollars (Kovacs et al., 2010). As with many other tree species, ash species are facing severe biotic and abiotic stresses. Ash species in North America are under severe threat from climate change, environmental stresses, and invasive species. Abiotic stresses are increasing susceptibility of trees to insect and pathogen attacks, as well as causing increased mortality themselves. The introduction and rapid invasion of the emerald ash borer (EAB) from Asia has resulted in the death of millions of ash trees and is a severe threat for all native ash trees as it spreads across North America (Herms and McCullough, 2014). However, Asian ash species are resistant to EAB, with which they co-evolved. A common garden experiment established in China to test the EAB resistance of ash species within EAB natural habitats (Liu et al., 2003) showed that Asian ash species are much more resistant to EAB attack than North American species regardless of the trial location. It partially explained that the lethal nature of the insect to ash species in North America was not only due to absence of its natural enemies.

Interspecies variation of EAB resistance Substantial progress has been made in recent years in understanding the susceptibility to EAB among ash species. Perhaps most important, Manchurian ash and other Asian ash species are more resistant to EAB than non-Asian species because Asian species coevolved with EAB. An elegant study reveals in its natural range, EAB prefers to attack stressed or drying ash trees (Liu et al., 2007), which supports the inherently EAB resistant in Asian ash species. It has also been stated that adult EAB prefers to feed on the of green, black and white ash than Manchurian, blue and European ash (Pureswaran and Poland, 2009). Although Pureswaran and Poland (2009) hypothesized that adult EAB feeding negatively correlated with total emission of host volatiles as they observed lower level in green ash relative to Manchurian ash, their study

1 failed to confirm the negative correlation for other tested ash speceis. As a result, it may indicate that the qualitative variation in volatiles is more important for EAB to locate the susceptible hosts. Comparative analysis of phenolic compounds, defensive proteins, nutritional quality and primary metabolites among ash species provide additional insights on resistance mechanisms for ash species (Villari et al., 2015). Phenolic compunds are believed to function in plant defense as feeding deterrants, toxins and digestion inhibitors (Usha Rani and Jyothsna, 2010). Recent studeis have domonstrated that Manchurian ash shows higher concentrations of constitutive bark lignans (Eyles et al., 2007), coumarins (Whitehill et al., 2012), proline, tyramine (Hill et al., 2012) and defensive proteins (Whitehill, 2011) than susceptible ash species. In addition to constitutive defenses, induced resistance among ash species has also been documented. Interestingly, methyl jasmonate treatment of susceptible ash species did reduce EAB larval survival and/or growth, which was also associated with increased bark concentration of verbascoside, lignin and trypsin inhibitors (Whitehill et al., 2014). These published studies suggest that the susceptible ash species have latent defenses to EAB but they are not induced naturally or quickly enough by EAB larval colonization. As discussed above, the majority of the EAB-host interactions have been conducted as comparisons among various ash species. However, Koch et al. (2015) reported infrequent green ash trees that survived from intense EAB attack, which shows that it is possible to uncover variation in susceptibility within the species, to understand resistance mechanisms specific to the species and to breed resistant green ash genotypes. Unlike comparisons among ash species, variation within green ash may be more reliable to identify the resistance mechanism and candidate genes associated with resistance.

Green Ash Among ash species, green ash (Fraxinus pennsylvanica) was one of the most widely distributed species and most highly susceptible to EAB. Green ash is an important tree species in North America, both as a component of forest and riparian ecosystems, and as a widely planted urban street tree. The natural range of green ash extends from Cape Breton Island and Nova Scotia west to southeastern Alberta; south through central Montana, northeastern Wyoming, to southeastern Texas; and east to northwestern Florida and Georgia (Wright, 1965). As green ash can grow in many types of soils, it is widely used for rehabilitation of disturbed sites. Green ash also provides food and cover to many kinds of wildlife, which is ecologically important to many wild animals. Green ash wood is also used to make specialty items such as tool handles and baseball bats. Furthermore, some studies suggest that, among all major North American ash

2 species, green ash is preferred by EAB (Anulewicz et al., 2007). Fortunately, intraspecific variation in green ash responses to EAB has been identified (Koch et al., 2015). The mature green ash trees that have been identified as surviving EAB infestations that resulted in death of the remainder of its cohort provide valuable material for gaining an understanding the potential for development of resistance in North American ash species. According to previous studies, the rare, putatively resistant green ash trees identified in natural stands show various responses to EAB attack. An EAB bioassay experiment in which EAB eggs were applied to stems of greenhouse grown showed that the bark of some resistant genotypes killed significantly larger numbers of EAB eggs than did susceptible genotypes. Other resistant genotypes showed significantly lower weight gained by larvae than on susceptible genotypes, and yet other resistant genotypes were less preferred by mature EAB than were susceptible genotypes (Koch et al., 2015). Therefore, it appears that more than one mechanism may be responsible for the EAB resistance in the surviving ash trees. Such advances in understanding variation in natural resistance may allow the green ash - EAB system to serve as a model for studies with other phloem-feeding insects and their hosts. It will also be valuable to access if genetic variation among natural populations plays a role in EAB resistance or susceptibility levels, and conversely to know how mortality caused by EAB may affect genetic diversity in natural populations of green ash.

Natural variation within green ash It has been hypothesized that the green ash species is composed of three or more ecotypes due to distinguishable differences among various geographic sources when seedlings were grown under uniform conditions(Wright, 1965). In addition, three ecotypes were supported in Great Plains in terms of the various drought resistances(Meuli and Shirley, 1937). It is still unknown if these ecotypes are identical with those from the Eastern range of the species. On the other hand, Ying (1971) stated that the border between Nebraska and Kansas split green ash into northern and southern Great Plains ecotypes. Ying (1971) also described that the southern ecotype has a longer growing season and faster growing rates than the northern ecotype. Nevertheless, none of the ecotypes have been given Latin varietal or subspecies names, as no differentiation can be distinguished in natural conditions. The findings suggest that natural variation could be a factor contributing to EAB resistance. Common garden trials have been the most widely used plantation experiment helping uncover how trees are adapted to different environmental conditions through genetic adaptation or phenotypic plasticity. To unravel the intraspecific variation of EAB resistance in green ash, it

3 is necessary to clarify the relationship between natural variation and resistance level. The comparison will also be helpful to see if natural variation contributes to the EAB resistance. Common garden trials are a good methodology to study such correlations.

Common garden trial We are using a common garden field study established at Penn State in the 1970s (Steiner, 1983), with app. 2000 ash trees originating from 66 provenances for which phenotypic data has been recorded from each tree for growth (diameter at breast height; DBH) and for susceptibility to EAB (number of EAB exit holes from the trunk, canopy condition and tree death). To establish the provenance trial, seeds were collected (Steiner, 1983) from two to four green ash mother trees in each of 60 geographically distinct forest stand locations, selected to maximize the coverage of the natural range for the species. In addition, seeds of white ash and pumpkin ash were also collected from four and two natural stands, respectively. All the seeds were planted in a common garden on The Pennsylvania State University’s University Park campus following a randomized complete block design (Steiner, 1983). Among the ~2000 trees, a few of them were white and pumpkin ash trees, which can be used as an out-group in further analyses. Invasion of the common garden trial by EAB in 2011 resulted in app. 99% mortality to date. “Lingering” (surviving) ash trees were identified from the common garden trial following the EAB infestation, that suggest variation exists for susceptibility to EAB both among and within provenances. Both the provenance trial and previous studies support that hypothesis that there is variation within green ash populations for response to EAB infestation. Understanding the basis of this variation will advance our ability to identify EAB-resistant genotypes and help with green ash restoration in the future.

Transcriptomic comparison A transcriptome is the complete set of transcripts in a given organism, or the specific subset of transcripts present in a particular cell/tissue type, which can vary with external environmental conditions. In contract with the genome, which is characterized by its stability, the transcriptome actively changes. Because it reflects the genes that are being actively expressed at any given time/tissue/environmental condition, the transcriptome is essential for interpreting the functional elements of the genome and revealing the molecular constituents of cells and tissues, and also for understanding development and disease. Microarrays have been developed and utilized to understand and quantify the transcriptome because they provide a cost-effective way to assess and compare mRNA levels for

4 thousands of genes at once (David et al., 2006). However, microarray methods have several limitations. Microarrays rely upon existing knowledge of genome and/or gene sequences, which limit their applications in non-model species. In addition, high background levels due to cross- hybridization and a limited dynamic range of detection, due to both background and saturation of signals, can be issues with microarrays (Okoniewski and Miller, 2006; Royce et al., 2007). Sequence-based approaches to studying gene expression directly provide access to the underlying gene sequences. Examples of sequencing-based approaches include Sanger sequencing of expressed sequence tags (ESTs), serial analysis of gene expression (SAGE) (Velculescu et al., 1995) and massively parallel signature sequencing (MPSS) (Brenner et al., 2000). However, because most of these assays are based on expensive Sanger sequencing, only a portion of the transcript are analyzed and isoforms are generally indistinguishable from each other. To overcome limitations of existing Sanger-sequencing approaches, the RNA-seq technique was developed to assess transcriptome profiles using Next Generation Sequencing (NGS) technologies. The RNA-seq method allows the entire transcriptome to be studied in a very high-throughput and quantitative manner (Wang et al., 2009). RNA-seq can be used to reveal gene expression analysis for non-model species without genome sequences. In contrast to microarray, RNA-seq has been shown to be highly accurate for quantifying expression levels (Nagalakshmi et al., 2008) and highly reproducible (Cloonan et al., 2008).

Genetic linkage mapping Genetic linkage mapping is used to determine the order of and relative genetic distance between genetic markers on a chromosome based on their pattern of inheritance. A genetic map is based on the frequencies of recombination between markers during crossover of homologous chromosomes. Genetic mapping is necessary to identify genomic regions or DNA markers linked to quantitative traits. Another application of genetic mapping is to provide the basis of map-based cloning of major genes involved in important agronomic traits and the development of markers for marker-based selection. Genetic maps can also help to understand the evolutionary relationships between and within species and provide the assistance of genome assembly. To construct genetic maps, various types of genetic markers have been developed and widely used in both plants and animals. Genetic markers can be summarized as morphological markers, biochemical markers and DNA markers. Morphological markers are visible phenotypic changes caused by mutations, which served as the only genetic markers for many years. Biochemical markers (also known as isozyme markers) were first used as molecular markers to

5 detect QTL in maize (Stuber et al., 1987). Among DNA-based markers, restriction fragment length polymorphisms (RFLPs) were first used in human genetic mapping (Botstein et al., 1980) and later in plant genetic mapping (Weber and Helentjaris, 1989). Since then, PCR based molecular marker techniques have been developed, such as random-amplified polymorphism DNA (RAPD) markers (Welsh and McClelland, 1990), amplified fragment length polymorphism (AFLP) markers (Vos et al., 1995) and simple sequence repeats (SSRs) or microsatellites (Hearne et al., 1992). Genetic maps of Eucalyptus grandis and Eucalyptus urophylla were developed using RAPD markers (Grattapaglia and Sederoff, 1994). An AFLP-based genetic map was constructed with nearly complete genome coverage in Pinus taeda (Remington et al., 1999). A genetic map (343 markers) of a hybrid poplar was constructed from a combination of RFLP, sequence-tagged sites and RAPD markers (Bradshaw et al., 1994). Among these DNA markers, SSRs have been the most widely used for plants because they are highly informative and experimentally reproducible and transferable among related species (Mason, 2015). More recently, with the emergence of NGS technologies, and many genotyping platforms for sequence-based markers, single nucleotide polymorphisms (SNPs) have readily gained the center stage of molecular genetics. Using a SNP array, framework maps of E. grandis and E. urophylla were constructed with one marker every 0.45 cM and 0.50 cM on average, respectively (Bartholome et al., 2015). Using the NGS-based genotyping-by-sequencing (GBS) approach, a high density genetic map of chickpea was constructed at an average inter marker distance of 0.33 cM (Verma et al., 2015). Using reduced- representative sequencing (RADseq), also based on NGS, a genetic map of Chinese cabbage was constructed with an average interval of 1.72 cM (Huang et al., 2017). Although SNPs are the most abundant and uniformly distributed DNA markers in genomes, SNPs are less informative than SSRs. Many studies have demonstrated the effectiveness of integrating SSR and SNP markers for genetic mapping (Li et al., 2015; McCallum et al., 2016; Zhai et al., 2015; Zhou et al., 2014), trait mapping and marker-assisted breeding (Kusi et al., 2018).

Marker-trait associations Molecular markers offer opportunities for dissecting complex traits using quantitative trait locus (QTL) mapping (González-Martínez et al., 2006; Paterson et al., 1988; Sax, 1923). The first use of DNA markers to identify QTLs in plants was conducted with an RFLP genetic map for tomato fruit traits (Paterson et al., 1988). Since then, QTL mapping has been widely used in plants for many complex traits, such as biomass, yield and disease resistance traits. However, one major limitation of QTL mapping is the low resolution of QTL markers, which are usually within

6 ~5-10cM, which can include many genes. Within QTL regions, identification of individual genes is time-consuming and costly. In addition, QTL mapping is based on bi-parental crosses and generally they are specific to the bi-parental population. Therefore, QTLs identified in a specific mapping population are often not useful in natural populations (Sorkheh et al., 2008). To overcome limitations of pedigree-based QTL mapping, association mapping, also known as linkage disequilibrium (LD) mapping, has been developed to investigate genotype and phenotype correlations in unrelated individuals. In forest trees, association mapping has been widely utilized to dissect the marker-trait association due to the availability of large random- mating populations, high levels of nucleotide diversity, rapid decay of LD and reliable phenotypic evaluation in clonally propagated plants. With decreasing sequencing cost and improving data processing, genome-wide association studies (GWAS) are getting popular for finding signals of associations with various complex traits in forest trees. In P. taeda, 10 SNPs have been identified as significant markers associated with pitch canker resistance in through association mapping (Quesada et al., 2010). In Populus, association mapping has identified 141 significant loci associated with 16 wood traits (Porth et al., 2013).

Objectives To understand and help resolve issues associated with the EAB-infestation of green ash, our long-term goal is to identify DNA markers and candidate genes for growth, adaptation and response to biotic and abiotic stresses in green ash using both EST and genome sequence data. The immediate objectives of my research were to use DNA markers and gene expression data to assess genetic diversity and identify genes associated with EAB resistance and susceptibility in the forest tree species green ash. To identify candidate genes, three different approaches were used, including transcriptome comparison, quantitative trait loci (QTL) mapping and association mapping. The results of my thesis will assist in disease-resistance tree breeding programs and in forest management by providing powerful tools to detect the important genes related to EAB resistance as well as genes for growth and adaptation. This new information should prove to be important for studies of resistance to boring insects and of environmental stress-responses in other tree species currently under threat of extinction and extirpation. This project will also provide new information at the whole genome level to better understand the genetic variation within green ash. To learn more about the molecular genetics of response to biotic stresses in ash species, we defined the following specific aims:

7 Aim 1: As the first step to identify candidate genes for EAB resistance, I focused on comparing the variation in gene expression between resistant and susceptible genotypes during EAB treatment. This involved identification of differentially expressed genes (DEGs) using EAB- treated resistant vs. susceptible genotypes, relative to control conditions, using the reference transcriptome of 107,611 transcripts for green ash annotated for potential protein function by The Hardwood Genomics Resources Project directed by PI John Carlson, http://www.hardwoodgenomics.org/organism/Fraxinus/pennsylvanica).

Aim 2: Gene network analysis was used to identify putative hub genes amongst the DEG results. Hub genes annotated as transcription factors, or hormone-related, or defense-related genes were selected as candidates for further analysis.

Aim 3: It is important to know the complex population structure of natural populations to avoid spurious associations in GWAS mapping studies. Thus, another goal of my thesis was to characterize genetic diversity and population structure among the natural stands of green ash using nuclear microsatellite DNA markers. Knowledge of genetic structure may reveal variation within green ash in terms of defense response to EAB. Knowledge of genetic diversity and population structure in green ash also can guide germplasm collections to represent a high proportion of genetic diversity in natural stands and help guide long-term reforestation and preservation projects for the species.

Aim 4: To learn what genome regions (QTL) are responsible for resistance, we developed a genetic linkage map for a green ash F1 population from a controlled cross of EAB-resistant and EAB-susceptible green ash parent trees, using SNP markers from GBS data and microsatellite markers.

Aim 5: To further identify genetic differences between putative EAB-resistant and –susceptible genotypes of green ash, I made use of a set of putatively resistant green ash trees identified within a range-wide provenance trial at Penn State University. A genome-wide-association-study (GWAS) was conducted using 95 unrelated green ash trees from the 40-year old provenance trial (Steiner et al., 1988) for which phenotypic data was been recorded from each tree for growth (DBH and height), susceptibility to EAB (canopy health condition), budburst and fall foliage coloration.

8 References Anulewicz, A.C., McCullough, D.G., and Cappaert, D.L. (2007). Emerald Ash Borer (Agrilus planipennis) Density and Canopy Dieback in Three North American Ash Species. Arboriculture & Urban Forestry 33, 338-349. Bartholome, J., Mandrou, E., Mabiala, A., Jenkins, J., Nabihoudine, I., Klopp, C., Schmutz, J., Plomion, C., and Gion, J.M. (2015). High-resolution genetic maps of Eucalyptus improve Eucalyptus grandis genome assembly. The New phytologist 206, 1283-1296. Botstein, D., White, R.L., Skolnick, M., and Davis, R.W. (1980). Construction of a genetic linkage map in man using restriction fragment length polymorphisms. American journal of human genetics 32, 314. Bradshaw, H.D., Villar, M., Watson, B.D., Otto, K.G., Stewart, S., and Stettler, R.F. (1994). Molecular genetics of growth and development in Populus. III. A genetic linkage map of a hybrid poplar composed of RFLP, STS, and RAPD markers. Theor Appl Genet 89, 167-178. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., et al. (2000). Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature biotechnology 18, 630-634. Cloonan, N., Forrest, A.R.R., Kolle, G., Gardiner, B.B.A., Faulkner, G.J., Brown, M.K., Taylor, D.F., Steptoe, A.L., Wani, S., Bethel, G., et al. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods 5, 613. David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones, T., Davis, R.W., and Steinmetz, L.M. (2006). A high-resolution map of transcription in the yeast genome. Proceedings of the National Academy of Sciences 103, 5320-5325. Eyles, A., Jones, W., Riedl, K., Cipollini, D., Schwartz, S., Chan, K., Herms, D.A., and Bonello, P. (2007). Comparative phloem chemistry of Manchurian (Fraxinus mandshurica) and two North American ash species (Fraxinus americana and Fraxinus pennsylvanica). Journal of chemical ecology 33, 1430-1448. González-Martínez, S.C., Krutovsky, K.V., and Neale, D.B. (2006). Forest-tree population genomics and adaptive evolution. New Phytologist 170, 227-238. Grattapaglia, D., and Sederoff, R. (1994). Genetic linkage maps of Eucalyptus grandis and Eucalyptus urophylla using a pseudo-testcross: mapping strategy and RAPD markers. Genetics 137, 1121-1137. Hearne, C.M., Ghosh, S., and Todd, J.A. (1992). Microsatellites for linkage analysis of genetic traits. Trends in Genetics 8, 288-294. Herms, D.A., and McCullough, D.G. (2014). Emerald ash borer invasion of North America: history, biology, ecology, impacts, and management. Annual review of entomology 59, 13-30. Hill, A.L., Whitehill, J.G.A., Opiyo, S.O., Phelan, P.L., and Bonello, P. (2012). Nutritional attributes of ash (Fraxinus spp.) outer bark and phloem and their relationships to resistance against the emerald ash borer. Tree Physiology. Huang, L., Yang, Y., Zhang, F., and Cao, J. (2017). A genome-wide SNP-based genetic map and QTL mapping for agronomic traits in Chinese cabbage. Scientific reports 7, 46305. Koch, J.L., Carey, D.W., Mason, M.E., Poland, T.M., and Knight, K.S. (2015). Intraspecific variation in Fraxinus pennsylvanica responses to emerald ash borer (Agrilus planipennis). New Forests 46, 995-1011. Kovacs, K.F., Haight, R.G., McCullough, D.G., Mercader, R.J., Siegert, N.W., and Liebhold, A.M. (2010). Cost of potential emerald ash borer damage in U.S. communities, 2009–2019. Ecological Economics 69, 569-578. Kusi, F., Padi, F.K., Obeng-Ofori, D., Asante, S.K., Agyare, R.Y., Sugri, I., Timko, M.P., Koebner, R., Huynh, B.L., Santos, J.R.P., et al. (2018). A novel aphid resistance locus in cowpea identified by combining SSR and SNP markers. Plant Breeding 137, 203-209. Li, C., Bai, G., Chao, S., and Wang, Z. (2015). A High-Density SNP and SSR Consensus Map Reveals Segregation Distortion Regions in Wheat. BioMed Research International 2015, 830618.

9 Liu, H., Bauer, L.S., Gao, R., Zhao, T., Petrice, T.R., and Haack, R.A. (2003). Exploratory survey for the emerald ash borer, Agrilus planipennis (Coleoptera: Buprestidae), and its natural enemies in China. The Great Lakes Entomologist 36, 191-204. Liu, H., Bauer, L.S., Miller, D.L., Zhao, T., Gao, R., Song, L., Luan, Q., Jin, R., and Gao, C. (2007). Seasonal abundance of Agrilus planipennis (Coleoptera: Buprestidae) and its natural enemies Oobius agrili (Hymenoptera: Encyrtidae) and Tetrastichus planipennisi (Hymenoptera: Eulophidae) in China. Biological Control 42, 61-71. Mason, A.S. (2015). SSR genotyping. Methods in molecular biology 1245, 77-89. McCallum, S., Graham, J., Jorgensen, L., Rowland, L.J., Bassil, N.V., Hancock, J.F., Wheeler, E.J., Vining, K., Poland, J.A., Olmstead, J.W., et al. (2016). Construction of a SNP and SSR linkage map in autotetraploid blueberry using genotyping by sequencing. Mol Breeding 36, 41. Meuli, L.J., and Shirley, H.L. (1937). The effects of seed origin on drought resistance of green ash in the Prairie-Plains states. Journal of Forestry 35, 1060-1062. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., and Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344-1349. Okoniewski, M.J., and Miller, C.J. (2006). Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 7, 276. Paterson, A.H., Lander, E.S., Hewitt, J.D., Peterson, S., Lincoln, S.E., and Tanksley, S.D. (1988). Resolution of quantitative traits into Mendelian factors by using a complete linkage map of restriction fragment length polymorphisms. Nature 335, 721. Porth, I., Klapšte, J., Skyba, O., Hannemann, J., McKown, A.D., Guy, R.D., DiFazio, S.P., Muchero, W., Ranjan, P., Tuskan, G.A., et al. (2013). Genome-wide association mapping for wood characteristics in Populus identifies an array of candidate single nucleotide polymorphisms. New Phytologist 200, 710-726. Pureswaran, D.S., and Poland, T.M. (2009). Host selection and feeding preference of Agrilus planipennis (Coleoptera: Buprestidae) on ash (Fraxinus spp.). Environmental entomology 38, 757-765. Quesada, T., Gopal, V., Cumbie, W.P., Eckert, A.J., Wegrzyn, J.L., Neale, D.B., Goldfarb, B., Huber, D.A., Casella, G., and Davis, J.M. (2010). Association mapping of quantitative disease resistance in a natural population of loblolly pine (Pinus taeda L.). Genetics 186, 677-686. Remington, D.L., Whetten, R., Liu, B.-H., and O’malley, D. (1999). Construction of an AFLP genetic map with nearly complete genome coverage in Pinus taeda. Theor Appl Genet 98, 1279- 1292. Royce, T.E., Rozowsky, J.S., and Gerstein, M.B. (2007). Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification. Nucleic acids research 35, e99-e99. Sax, K. (1923). The Association of Size Differences with Seed-Coat Pattern and Pigmentation in PHASEOLUS VULGARIS. Genetics 8, 552-560. Sorkheh, K., Malysheva-Otto, L.V., Wirthensohn, M.G., Tarkesh-Esfahani, S., and Martínez- Gómez, P. (2008). Linkage disequilibrium, genetic association mapping and gene localization in crop plants. Genetics and Molecular Biology 31, 805-814. Steiner, K.C. (1983). A provenance test of green ash. Proc Northeast Forest Tree Improv Conf 28, 68-76. Stuber, C.W., Edwards, M.D., and Wendel, J.F. (1987). Molecular Marker-Facilitated Investigations of Quantitative Trait Loci in Maize. II. Factors Influencing Yield and its Component Traits1. Crop Science 27, 639-648. Usha Rani, P., and Jyothsna, Y. (2010). Biochemical and enzymatic changes in rice plants as a mechanism of defense. Acta Physiologiae Plantarum 32, 695-701. Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. (1995). Serial Analysis of Gene Expression. Science 270, 484-487.

10 Verma, S., Gupta, S., Bandhiwal, N., Kumar, T., Bharadwaj, C., and Bhatia, S. (2015). High- density linkage map construction and mapping of seed trait QTLs in chickpea (Cicer arietinum L.) using Genotyping-by-Sequencing (GBS). Scientific reports 5, 17512. Villari, C., Herms, D.A., Whitehill, J.G., Cipollini, D., and Bonello, P. (2015). Progress and gaps in understanding mechanisms of ash tree resistance to emerald ash borer, a model for wood- boring insects that kill angiosperms. The New phytologist. Vos, P., Hogers, R., Bleeker, M., Reijans, M., Lee, T.v.d., Hornes, M., Friters, A., Pot, J., Paleman, J., and Kuiper, M. (1995). AFLP: a new technique for DNA fingerprinting. Nucleic acids research 23, 4407-4414. Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10, 57-63. Weber, D., and Helentjaris, T. (1989). Mapping RFLP loci in maize using BA translocations. Genetics 121, 583-590. Welsh, J., and McClelland, M. (1990). Fingerprinting genomes using PCR with arbitrary primers II Nucl. Whitehill, J.G., Opiyo, S.O., Koch, J.L., Herms, D.A., Cipollini, D.F., and Bonello, P. (2012). Interspecific comparison of constitutive ash phloem phenolic chemistry reveals compounds unique to manchurian ash, a species resistant to emerald ash borer. Journal of chemical ecology 38, 499-511. Whitehill, J.G., Rigsby, C., Cipollini, D., Herms, D.A., and Bonello, P. (2014). Decreased emergence of emerald ash borer from ash treated with methyl jasmonate is associated with induction of general defense traits and the toxic phenolic compound verbascoside. Oecologia 176, 1047-1059. Whitehill, J.G.A. (2011). INVESTIGATIONS INTO MECHANISMS OF ASH RESISTANCE TO THE EMERALD ASH BORER (The Ohio State University). Wright, J.W. (1965). Green ash (Fraxinux pennsylvanica Marsh.). In Silvics of forest trees of the Unitied States, H.A. Fowells, ed. (Washington DC: U.S. Department of Agriculature, Agriculture Handbook 271), pp. 185-190. Zhai, H.-j., Feng, Z.-y., Liu, X.-y., Cheng, X.-j., Peng, H.-r., Yao, Y.-y., Sun, Q.-x., and Ni, Z.-f. (2015). A genetic linkage map with 178 SSR and 1 901 SNP markers constructed using a RIL population in wheat (Triticum aestivum L.). Journal of Integrative Agriculture 14, 1697-1705. Zhou, X., Xia, Y., Ren, X., Chen, Y., Huang, L., Huang, S., Liao, B., Lei, Y., Yan, L., and Jiang, H. (2014). Construction of a SNP-based genetic linkage map in cultivated peanut based on large scale marker development using next-generation double-digest restriction-site-associated DNA sequencing (ddRADseq). BMC genomics 15, 351.

11 Chapter 2 Transcriptome profile between resistant and susceptible green ash genotypes Abstract Invasion of emerald ash borer (EAB) is one of the major threats faced by North American ash species. Although the majority of native ash species in NA are susceptible to EAB, a low percentage of EAB-resistant green ash trees have been observed. To understand the molecular mechanisms underlying cases of putative EAB-resistance in green ash, we conducted a transcriptomic profiling study of host responses in stems to identify differentially expressed genes associated with resistance. GO term enrichment and network analysis showed that key genes involved in jasmonic acid (JA)- mediated defense response were activated in resistant genotypes but were repressed in susceptible genotypes. The result supports that JA pathways may contribute to the EAB resistance in green ash. Our results provide valuable information on gene functions associated with response to EAB in green ash.

Introduction Our initial hypothesis was that significant associations should be detectable between EAB resistance and single nucleotide polymorphisms (SNPs) located close to or within genes in the lingering trees and susceptible trees that are genes related to insect defense responses. Typically GWAS studies are conducted with hundreds to thousands of individuals exhibiting traits that vary continuously. Such “normal” phenotypic distributions in traits require far more individuals to detect statistically significant gene or DNA marker associations with traits. Our secondary hypothesis was that the bimodal trait of EAB-resistance vs. EAB-susceptibility would be associated with alleles with much stronger phenotypic effects that could be detected with only the 96 individuals selected for this study. In previous studies, MeJA application was observed to induce EAB resistance that was associated with higher accumulations of verbascoside, lignin and trypsin inhibitor activity in green ash (Whitehill et al., 2014). Thus we can expect to identify SNPs associated with genes in the biosynthesis pathways of verbascoside and lignin as well as trypsin inhibitor proteins. In addition, upstream transcription factors that regulate these pathways and gene products may be identified. Furthermore, in addition to such inducible genes and pathways, we may also expect to identify genes for constitutive factors that may be responsible for differences among lingering and susceptible trees, such as bark quality phenotypes. These findings will represent an important first step in the study of resistance and susceptibility to EAB and other stem-boring insects in broadleaf trees. The results from this study will

12 need to be confirmed in subsequent studies in larger population studies, and through QTL analysis with our full-sib green ash mapping population.

Materials and Methods Plant materials Four selected “lingering” green ash (PE19, PE21, PE22 and PE24) and two known EAB susceptible green ash (PE36 and Summit) trees were selected for analysis of expression patterns of differentially expressed genes (DEGs) by RNA sequencing. All four lingering genotypes were wild selections in or near permanent monitoring plots located in southeast Michigan. The F. pennsylvanica cultivar ‘Summit’ was chosen from Summit Nursery, Stillwater MN. The PE36 was wild selection located in northeast Ohio. Briefly, the two EAB-susceptible and four EAB-resistant genotypes were grafted and grown in the greenhouse for two years prior to treatment.

EAB inoculation treatments and RNA sequencing Tissues from the PE19, PE21, PE22, PE24, PE36 and SUM green ash genotypes were collected before and after exposure to EAB larvae. The EAB-feeding bioassay, RNA isolations and sequencing was described in Lane (et al. 2016). Samples of bark and phloem of each genotype were collected in the summer for use as control tissues. EAB eggs were placed under the stem bark for each grafted plant and eight weeks after hatching the EAB larvae were removed and samples of the damaged phloem and bark tissues were collected, snap-frozen in liquid nitrogen and shipped in dry ice to our lab. Total RNAs were isolated from the stem tissues of the six genotypes before and after EAB treatment, and cDNA libraries of the 12 samples were constructed using the Illumina TruSeq kit and sequenced on Illumina HiSeq 2000 as described by Lane (et al. 2016).

Sequence Read mapping and function annotation Raw sequences were trimmed using trimmomatic version 0.32 (Bolger et al., 2014), and clean reads were mapped to green ash transcriptome reference using the Trinity program version r20121005 (Grabherr et al., 2011). A total of 107,611 transcript assembly contigs were obtained from the 55 transcriptome libraries (http://www.hardwoodgenomics.org/organism/Fraxinus/pennsylvanica), which were used as the reference (Lane et al., 2016). Mapped reads for all identified genes were used for differential expression analysis after normalization using DESeq2 (Love et al., 2014). Gene expressions were compared between EAB-treated and control groups. Based on hypothesis testing, genes with an absolute value log2 ratio ≥ 2 and adjusted p<0.01 were considered as significant differentially expressed genes (DEGs) and selected for further analysis. The final set of DEGs was

13 annotated using BLASTX alignment search of the NCBI non-redundant protein database, with an E- value threshold of 1e-5. Gene function enrichment analysis for all DEGs was conducted using BINGO (Maere et al., 2005), which is a Cytoscape plugin assessing the overrepresentation of ontologies.

Metabolic pathway analysis MapMan (Thimm et al., 2004) was used to analyze the functionalities of these significant DEGs based on the known pathways of Arabidopsis. MapMan permitted the differences of expression patterns between resistant and susceptible genotypes to be visualized. All DEGs with assigned AGI IDs and the minimum log2 fold change were used to visualize the pathway-based analysis, intending to associate biological functions with DEGs in response to EAB infestation. Genes with Log2 fold change of DEGs were overlaid onto the Arabidopsis pathways to identify and illustrate the expression patterns.

Protein-protein interaction network analysis The protein-protein interaction (PPI) network of Arabidopsis was utilized to identify putative hub genes. DEGs annotated as transcription factors, or hormone-related genes, or defense-related genes were selected for further analysis. DEGs annotated with the term GO:0003700 were used as query to identify master key transcription factors (TFs). DEGs annotated as hormone-related genes were also used as query to identify putative central hub hormone-related genes. Jiang et al., (2011) collected a set of 1319 plant hormone-related genes from update version of Arabidopsis Hormone Database (ADH2.0) by both genetic evidence (mutant and transgenic studies) and evidence from Gene Ontology annotation. Plant defense-related genes were collected when the DEGs annotated with response to abiotic or biotic stimulus and response to stress. The selected DEGs were then mapped onto the PPI network of Arabidopsis. As a result, genes with higher node degrees could be distinguished as putative hub genes. Cytoscape (version 3.2.0) and its plugins were utilized to visualize the network and perform network analysis (Cline et al., 2007).

Results and Discussion To assess whether EAB infestation affects expression patterns and how the expression patterns differ between resistant and susceptible genotypes, the transcriptome of stem tissue from four resistant and two susceptible trees were examined using the RNA-seq approach. After cleaning, reads were mapped to the reference transcriptome sequences of F. pennsylvanica. Then, DESeq2 was used to identify DEGs between resistant and susceptible genotypes. The number of upregulated DEGs was

14 much larger than the number of downregulated DEGs, suggesting that the response process was associated with induction, rather than suppression, of gene expression. Enrichment analysis results GO term enrichment analysis was conducted separately for upregulated and downregulated DEGs (Fig. 2.1). Among the overrepresented up-regulated genes in the resistant group of genotypes, the most significantly enriched GO terms were related to stress response, such as response to stress, response to biotic stimulus, response to endogenous stimulus, and response to external stimulus. These functional categories were also overrepresented in the susceptible group but at a much lower significant level. In the resistant group, the other significantly enriched processes were “signal transduction” and “metabolic process”, which are also overrepresented at a higher level of significance than in susceptible group. In terms of cellular components, “plasma membrane” and “cell wall” categories are also overrepresented in the resistant group. We suggested that up-regulation of genes involved in cell wall remodeling can be one mechanism to increase the physical barrier to insect attack, which could also be one explanation of why resistant genotypes are more tolerant to EAB infestation than susceptible genotypes. Overall, the results from the enrichment analysis showed that more genes associated with stress defenses are significantly upregulated in the resistant genotypes than in susceptible plants, which indicates that induction of gene expression may be a key EAB-resistance mechanism. In addition to defense-response differences, the differences observed in metabolic processes between the two groups also suggested that the resistant group is more efficient to respond to the insect attack then the susceptible genotypes (Fig. 2.1). The GO term enrichment analysis showed that the carbohydrate metabolic process, secondary metabolic process, lipid metabolic process and catabolic process were all overrepresented in the resistant genotypes. However, only secondary metabolic processes were enriched in the susceptible group. In addition, genes associated with the carbohydrate metabolic process were significantly decreased in expression in the susceptible group. It is accepted that such significant changes in secondary metabolic processes within the tissue undergoing insect herbivory will be associated with primary metabolism, which enable plants to tolerate herbivory while minimizing impacts on fitness traits (Bolton, 2009; Frost et al., 2008). Energy and resources from primary metabolism support have been shown to secondary metabolism for plant defense (Schwachtje and Baldwin, 2008). Significant down-regulation of genes involved in carbohydrate metabolic pathways in the susceptible group suggests that susceptible genotype may fail to reduce the negative fitness consequences of EAB attack. On the other hand, the susceptible group is more enriched in “pollen-pistil interaction”, “cellular homeostasis” and “peroxisome” compared to the resistant group, which suggests that the

15 susceptible genotypes were unable to trigger the defense response specifically against EAB attack. Interestingly, we noticed that the photosynthesis was significantly downregulated in the resistant group. Evidence is accumulating in support of reduced photosynthesis being a genetically programmed plant response to both biotic and abiotic stresses (Bilgin et al., 2010). It might thus be expected that the observed reduced photosynthesis relates to earlier perception of the insect attack in resistant genotypes and higher efficiency of response to attack than by the susceptible group.

Pathway Analysis Although more upregulated genes associated with defense response were detected in the resistant group, it is still difficult to distinguish what specific process or pathway is responsible for EAB resistance in green ash from gene annotations alone. To further uncover details of metabolic processes involved in EAB resistance, MapMan was used for comparing variation of functional categories of DEGs between resistant and susceptible genotypes (Figs. 2.2-2.4). The MapMan result showed that the major categories of DEGs were related to cell walls, lipids and secondary metabolites (Fig. 2.2). Each square box in the pathways figure represents a DEG and the grey dots represent metabolism pathways to which none of the DEGs were assigned. In terms of cell wall synthesis, the MapMan result illustrated that most DEGs were up-regulated in resistant group but more genes were down-regulated in the susceptible group. This suggests that the resistant group responded quickly to insect attack through cell wall remodeling. Our observation supported that cell wall remodeling was one mechanism for plants to protect themselves against insect herbivory, through modified physical barriers. For secondary metabolite pathways, all DEGs associated with phenlypropanoids, simple phenols, lignin and lignans biosynthesis were all induced in the resistant group. Some of the DEGs in these pathways were decreased in susceptible group (Fig. 2.3). In addition, non-mevalonate (non- MVA) pathways were also down-regulated in the susceptible group. Looking more closely into the differences among large enzyme families can increase our understanding of the resistance variation within green ash. In general, DEGs associated with large enzymes were mostly induced in the resistant group relative to susceptible group (Fig. 2.4), such as cytochrome P450, peroxidases, phosphatases, oxidases and uridine disphosphate (UDP) glycosyltransferases (UGTs). It has been reported that cytochrome P450 (CYP) superfamily plays important roles in response to biotic stresses through their involvement in phytoalexin biosynthesis, hormone metabolism and the biosynthesis of other secondary metabolites (Xu, 2015). Peroxidase was also found to be functioning in plant development and defense. Up-regulation of peroxidase has been suggested as a direct response to chinch bug feeding in resistant buffalo grasses (Usha Rani and Jyothsna, 2010). The UGT superfamily also has been thought to encode enzymes that glycosylate a

16 broad array of aglycones, such as plant hormones, all major classes of plant secondary metabolites, and xenobiotics, which may in fact function in plant defense and stress tolerance (Vogt and Jones, 2000). Our results suggest that the defense response in resistant genotypes requires more genes to be induced than occurs in the susceptible group. Overall, both GO term enrichment and metabolic pathways analyses provided evidence that the resistant group was more efficient in recognizing EAB attack, and in initiating and maintaining defense responses successfully, than the susceptible group. With increased secondary metabolism, we can conclude that both resistant and susceptible groups are affected by the insect attack and may turn on defense mechanisms. However, without increased primary metabolism to support the required energy and resources, the susceptible genotypes have to decrease their growth to save energy for defense. Additionally, cell wall remodeling in the resistant group could reduce or slow down the insect herbivory. To uncover why the susceptible group is unable to initiate and/or maintain defense responses, network analysis among the DEGs was utilized to facilitate a better understanding of key mechanisms, such as identification of hub genes acting as key regulatory or signaling roles.

Network Analysis Functional data mining of DEGs was performed based on predetermined protein-protein interactions (PPI) in Arabidopsis. A first, functional network was created to represent differentially expressed transcription factors. A second network was generated to represent differentially expressed hormone-related genes. Similarly, differentially expressed defense-related genes can be observed in the third network. Within each functional network, three sub-networks were elucidated, representing unique DEGs in the resistant group, unique DEGs in the susceptible group, and DEGs in common to both groups. Moreover, top 20 related genes with the query DEGs were also included for the network analysis. Together with each network, genes with higher node degree can be identified as putative hub genes (Figs. 2.5-2.7). Transcription factors (TFs) are key regulatory proteins of processes such as response to biotic and abiotic stresses. Among all deferentially expressed TFs, uniquely up- or down-regulated TFs in resistant genotypes were observed, as shown in Fig. 2.5A. RRTF1, RHL41, DDF1 and ERF107, in which central hub genes are highlighted in yellow. The circles with red borders in the figure represent genes with increased expression after EAB treatment, while the circles with green border represent genes with decreased expression after EAB treatment. Fig. 2.5B shows TFs that are uniquely up- or down-regulated in susceptible genotypes after EAB treatment, wherein five hub genes are highlighted in yellow, including COL2, COL5, SIGE, STH and CDF3. The TFs that were differentially expressed

17 in common in both groups after EAB treatment as are illustrated in Fig. 2.5C. In this case, the potential hub genes are highlighted with blue border, including WRKY11, WRKY22, BHLH112 and NAC83. In line with TFs, plant hormones have also been reported to play key roles in plant immunity (Bari and Jones, 2009). To identify master key genes related to hormone pathways, DEGs annotated on eight major phytohormone pathways, including abscisic acid, auxin, brassinosteroid, ethylene, gibberellins, jasmonic acid and salicylic acid were included in the functional network analysis. Fig. 2.6A shows the hormone-related genes differentially expressed in only the resistant group. The color combination is same as described above for Fig. 2.5. The putative hub genes are highlighted in yellow, including TIFY10A, ATMYC2, JAZ6, MDR4, RAN1 and TGA1, where all are up-regulated. In contrast, all identified hub genes from uniquely hormone-related DEGs in the susceptible group were down- regulated (Fig. 2.6B), including BIM1, ELF3, ASO and AOC3. Fig. 2.6C shows the common hormone-related genes that are differentially expressed in both groups after EAB treatment. Blue borders in this panel represent four hub genes, including one up-regulated gene (i.e. MP) and three down-regulated genes (i.e. ARF19, IAA7, and IAA13). Furthermore, plant defense-related genes were obtained when the corresponding GO term contained “response to abiotic and biotic stimulus” and “response to stress”. Among the defense- related DEGs, network analysis was also conducted to identify hub genes. Fig. 2.7A shows the defense-related DEGs that were only differentially expressed in the resistant group, with the hub genes highlighted in yellow, including MYC2, WRKY53, TFY10A, TCH4 and RHL41. Genes only differentially expressed in the susceptible group are shown in Fig. 2.7B, including COL5, HSC70-1, CAX3 and CESA3. The genes differentially expressed in common in both groups are shown in fig. 2.7C, in which the hub genes are highlighted with blue border, including HSP8-1, HSPRO2, STZ, WRKY40, WRKY6 and WRKY33.

Putative hub genes Clearly, a number of interesting putative hub genes were identified from the network analysis in both the resistant and susceptible groups. Functional analysis of the hub genes was conducted by including top 20 related genes of the queries from the network analysis. As shown in Table 2.1, all putative hub TFs are up-regulated except BHLH11 and ANAC08. In Table 2.2, hormone-related hub genes are all up-regulated (except TGA1) in resistant group but all down-regulated in susceptible group. For common hormone-related hub genes, only one of the four genes was up-regulated. Defense-related hub genes are mostly up-regulated except one common gene (i.e. HSP81-1) and two susceptible group specific genes (i.e. HSC70-1 and CESA3) (Table 2.3).

18 The functional annotation results suggest that jasmonic acid pathways may contribute to the EAB resistance in green ash (Table 2). Among the eight important plant hormones, jasmonate (JA) has been reported to function in plant defenses, particularly against insect attack, as signaling molecules (Wu and Baldwin, 2010). JA biosynthesis, signal transporting and regulation of downstream genes can all contribute to plant defense to insect attack. In the resting state, 80% of JA is in trans-conformation form, but 80% of JA is identified as the more bioactive cis-conformation after wounding (Schulze et al., 2006). Active JA is then transported within the plant to regulate the expression of defense-responsive genes both locally and systemically. Among the putative hubs of hormone-related genes in the resistant group, transcription factor MYC2 and two jasmonate ZIM domain (JAZ) proteins are involved in regulation of JA-responsive genes. JA-induced JAZ family and MYC2 have been experimentally identified as primary response genes in JA pathways (Chung et al., 2008). With accumulation of JA, the hormone will bind to the Jas motif of JAZ proteins and release MYC2. As a result, the transcription factor MYC2 is then free to initiate expression of the JA- responsive genes. All the three up-regulated genes appear to function to increase JA levels in the resistant group. The down-regulation of transcription factor TGA1 also suggests that expression of JA-responsive genes is increased in resistant group, as TGA factors can form a complex with a key component of salicylic acid (SA) defense signaling pathway and then suppress the JA-responsive genes (Bari and Jones, 2009). Such down regulation of the TGA factor may hence indicate increased expression of JA-responsive genes. Since the copper-transporting ATPase RAN1 is essential for ethylene signaling (Binder et al., 2010), combined with the two hub TFs (i.e. ERFs), we can conclude that the ethylene signaling pathways were activated as well in the resistant group. It has been shown that several ERF members play important roles in mediating defense response in Arabidopsis (McGrath et al., 2005). In addition, TCH4 was another specifically up-regulated hub gene in the resistant group, which supports that the possibility that cell wall remodeling may contribute to EAB resistance to at least some extent (Xu et al., 1995). Although up-regulation of hub genes from the resistant group suggests a potential mechanism of EAB resistance, hub genes from the susceptible group can further indicate why susceptible genotypes failed to activate a defense response. Notably, in the susceptible group, all four hubs of hormone-related genes are down-regulated, and two of them (i.e. AOS and AOC) are key components for biosynthesis of jasmonic acid. Without the accumulation of active JA, JAZ proteins will bind to transcription factor (i.e. MYC2), resulting in prevention of JA-responsive gene expression. In other words, susceptible genotypes failed to accumulate JA for triggering the JA-responsive defense. In terms of plant defense against insect attack, volatile chemicals are also important to consider. The resistant genotypes may be able to emit volatiles to keep insects away or to transport signals for

19 defense to neighbor plants. Some studies have reported that herbivore-induced plant volatile (HIPV) release requires the JA signaling pathways (Degenhardt et al., 2010; Girling et al., 2008). Based on our analysis, we can conclude that the JA pathway is likely to be associated with both direct and indirect resistance to EAB in green ash.

Conclusions Among the eight important plant hormones, the jasmonate hormones are known to function in plant defense, particularly against insect attacks, acting as signaling molecules (Wu and Baldwin, 2010). Jasmonate biosynthesis, signal transporting and its regulation of downstream genes can all affect plant defense to insect attack. In the resting state, 80% of JA is in the trans-conformation, but 80% of JA is identified as more bioactive cis-conformation after wounding (Schulze et al., 2006). The active JA is then transported within plant and regulating the expression of defense-responsive genes both locally and systemically. Among the putative hubs regulating hormone-related genes in the resistant group, transcription factor MYC2 and two jasmonate ZIM domain (JAZ) proteins were identified which are known to be involved in regulation of JA-responsive genes. The JA-induced JAZ protein family and MYC2 have been experimentally identified as primary response genes in JA pathways (Chung et al., 2008). With accumulation of JA, the hormone will bind to the Jas motif of JAZ protein and release MYC2. As a result, the transcription factor MYC2 is now free to initiate expression of the JA-responsive genes. The down-regulated transcription factor TGA1 also indicates that increased expression of JA-responsive genes occurred in the resistant group. TGA factors can form a complex with a key component of the SA defense signaling pathway and then suppress the JA-responsive genes (Bari and Jones, 2009), while the copper-transporting ATPase RAN1 protein is essential for ethylene signaling (Binder et al., 2010). Combined with our observation that two of the hub TFs were ERFs, we can conclude that the ethylene signaling pathways was activated as well in resistant group. It has been shown that several ERF members play important roles in mediating defense response in Arabidopsis (McGrath et al., 2005). In addition, TCH4 was another specifically up-regulated hug gene in the resistant group, which supports that possibility that cell wall remodeling may contribute to the EAB resistance, at least to some extent (Xu et al., 1995). On the other hand, in the susceptible group, all four hubs of hormone-related genes were down regulated, two of which (i.e. AOS and AOC) are key components for biosynthesis of jasmonic acid biosynthesis. Without the accumulation of active JA, JAZ proteins will bind to the transcription factor MYC2, resulting in prevention of JA-responsive gene expression. In terms of plant defense against insect attack, volatiles are also an important factor to consider. The resistant genotypes may be able to emit volatiles to keep insects away or to transport

20 signals to neighbor plants. Some studies have supported that herbivore-induced plant volatile (HIPV) release requires the JA signaling pathways (Degenhardt et al., 2010; Girling et al., 2008).

21 References Bari, R., and Jones, J.D. (2009). Role of plant hormones in plant defence responses. Plant molecular biology 69, 473-488. Bilgin, D.D., Zavala, J.A., Zhu, J., Clough, S.J., Ort, D.R., and DeLucia, E.H. (2010). Biotic stress globally downregulates photosynthesis genes. Plant, cell & environment 33, 1597-1613. Binder, B.M., Rodríguez, F.I., and Bleecker, A.B. (2010). The copper transporter RAN1 is essential for biogenesis of ethylene receptors in Arabidopsis. Journal of Biological Chemistry 285, 37263- 37270. Bolger, A.M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-2120. Bolton, M.D. (2009). Primary metabolism and plant defense--fuel for the fire. Molecular plant- microbe interactions : MPMI 22, 487-497. Chung, H.S., Koo, A.J.K., Gao, X., Jayanty, S., Thines, B., Jones, A.D., and Howe, G.A. (2008). Regulation and Function of Arabidopsis JASMONATE ZIM-Domain Genes in Response to Wounding and Herbivory. Plant physiology 146, 952-964. Cline, M.S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, C., Christmas, R., Avila- Campilo, I., Creech, M., Gross, B., et al. (2007). Integration of biological networks and gene expression data using Cytoscape. Nature protocols 2, 2366-2382. Degenhardt, D.C., Refi-Hind, S., Stratmann, J.W., and Lincoln, D.E. (2010). Systemin and jasmonic acid regulate constitutive and herbivore-induced systemic volatile emissions in tomato, Solanum lycopersicum. Phytochemistry 71, 2024-2037. Frost, C.J., Mescher, M.C., Carlson, J.E., and De Moraes, C.M. (2008). Plant defense priming against herbivores: getting ready for a different battle. Plant physiology 146, 818-824. Girling, R.D., Madison, R., Hassall, M., Poppy, G.M., and Turner, J.G. (2008). Investigations into plant biochemical wound-response pathways involved in the production of aphid-induced plant volatiles. Journal of experimental botany 59, 3077-3085. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., et al. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology 29, 644. Jiang, Z., Liu, X., Peng, Z., Wan, Y., Ji, Y., He, W., Wan, W., Luo, J., and Guo, H. (2011). AHD2.0: an update version of Arabidopsis Hormone Database for plant systematic studies. Nucleic acids research 39, D1123-1129. Lane, T., Best, T., Zembower, N., Davitt, J., Henry, N., Xu, Y., Koch, J., Liang, H., McGraw, J., Schuster, S., et al. (2016). The green ash transcriptome and identification of genes responding to abiotic and biotic stresses. BMC genomics 17, 702. Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology 15, 1-21. Maere, S., Heymans, K., and Kuiper, M. (2005). BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 3448-3449. McGrath, K.C., Dombrecht, B., Manners, J.M., Schenk, P.M., Edgar, C.I., Maclean, D.J., Scheible, W.-R., Udvardi, M.K., and Kazan, K. (2005). Repressor- and Activator-Type Ethylene Response Factors Functioning in Jasmonate Signaling and Disease Resistance Identified via a Genome-Wide Screen of Arabidopsis Transcription Factor Gene Expression. Plant physiology 139, 949-959. Schulze, B., Lauchli, R., Sonwa, M.M., Schmidt, A., and Boland, W. (2006). Profiling of structurally labile oxylipins in plants by in situ derivatization with pentafluorobenzyl hydroxylamine. Analytical biochemistry 348, 269-283. Schwachtje, J., and Baldwin, I.T. (2008). Why Does Herbivore Attack Reconfigure Primary Metabolism? Plant physiology 146, 845-851. Thimm, O., Blasing, O., Gibon, Y., Nagel, A., Meyer, S., Kruger, P., Selbig, J., Muller, L.A., Rhee, S.Y., and Stitt, M. (2004). MAPMAN: a user-driven tool to display genomics data sets onto diagrams

22 of metabolic pathways and other biological processes. The Plant journal : for cell and molecular biology 37, 914-939. Usha Rani, P., and Jyothsna, Y. (2010). Biochemical and enzymatic changes in rice plants as a mechanism of defense. Acta Physiologiae Plantarum 32, 695-701. Vogt, T., and Jones, P. (2000). Glycosyltransferases in plant natural product synthesis: characterization of a supergene family. Trends Plant Sci 5, 380-386. Whitehill, J.G., Rigsby, C., Cipollini, D., Herms, D.A., and Bonello, P. (2014). Decreased emergence of emerald ash borer from ash treated with methyl jasmonate is associated with induction of general defense traits and the toxic phenolic compound verbascoside. Oecologia 176, 1047-1059. Wu, J., and Baldwin, I.T. (2010). New insights into plant responses to the attack from insect herbivores. Annual review of genetics 44, 1-24. Xu, J. (2015). The cytochrome P450 superfamily: Key players in plant development and defense. Journal of Integrative Agriculture 14, 1673-1686. Xu, W., Purugganan, M.M., Polisensky, D.H., Antosiewicz, D.M., Fry, S.C., and Braam, J. (1995). Arabidopsis TCH4, regulated by hormones and the environment, encodes a xyloglucan endotransglycosylase. Plant Cell 7, 1555-1567.

23 (A)

(B) Figure 2.1 GO Term enrichment analysis. Enrichment analysis of over-represented GO-slim terms of up- (A) and down- (B) regulated DEGs. -1 Y-axis represents p-value (i.e. log10P ).

24

Figure 2.2 MapMan overview of metabolite pathways in resistant (left) and susceptible (right) genotypes.

Figure 2.3 Secondary pathways in resistant (left) and susceptible (right) genotypes.

26

Figure 2.4 Large enzyme pathways in resistant (left) and susceptible (right) genotypes.

27

(A) (B) (C) Figure 2.5 Network analysis of differentially expressed transcription factors (TFs). (A): TFs that were only differentially expressed in resistant group with hubs highlighted in yellow. (B): TFs only differentially expressed in susceptible group with hubs highlighted in yellow. (C): TFs differentially expressed in both resistant and susceptible groups with hubs highlighted in blue border. Circles represent the TF queries and diamonds represent the top 20 genes related to the input queries. Red represents the up- regulated DEGs and green represents the down-regulated DEGs. For the common TFs (C), green circles with red borders presents down-regulated expression in resistant group but up-regulated expression in susceptible group; while red circles with green borders represent up-regulated expression in the resistant group but down-regulated expression in the susceptible group; green and red circles represent decreased and increased expression in both groups, respectively; blue borders represent the putative hub genes.

28

(A) (B) (C)

Figure 2.6 Network analysis of differentially expressed hormone-related genes. Hub genes identified from hormone-related genes in resistant group (A) or in susceptible group (B) are highlighted in yellow. Hub genes identified from differentially expressed hormone-related genes in both resistant and susceptible groups (C) are highlighted with blue border.

29

(A) (B) (C)

Figure 2.7 Network analysis of defense-related genes. Hub genes identified from defense-related genes in resistant group (A) or in susceptible group (B) are highlighted in yellow. Hub genes identified from differentially expressed defense-related genes in both resistant and susceptible groups are highlighted with blue border.

30 Table 2.1 Putative hub transcription factors Differential Arabidopsis Transcript Gene # of Hubs Expression Description ID ID Name Nodes R S R-Group AT4G34410 45058_c0_s RRTF1 27 Up NA Ethylene- specific eq8 responsive transcription factor 109 AT5G59820 23323_c0_s RHL41 25 Up NA Zinc finger protein eq1 AT1G12610 49273_c0_s DDF1 22 Up NA Dehydration- eq1 responsive element-binding protein AT1G19210 50055_c0_s ERF017 22 Up NA Ethylene- eq1; responsive 47599_c0_s transcription factor eq1 17 Common AT4G31550 33833_c0_s WRKY11 44 Up Up Probable WRKY eq1; transciption facotr 61210_c0_s 11 eq1; 51048_c0_s eq1 AT4G01250 54959_c0_s WRKY22 43 Up Up WRKY eq1 transciption facotr 22 AT1G61660 52100_c0_s BHLH112 41 Dow Dow Basic helix-loop- eq1 n n helix protein 112 AT5G13180 49933_c0_s ANAC083 40 Dow Dow NAC domain eq2 n n containing protein 83 S-Group AT3G02380 55842_c0_s COL2 31 NA Up Zinc finger protein specific eq5 CONSTANS- LIKE 2 AT5G57660 59118_c0_s COL5 31 NA Up Zinc finger protein eq3 CONSTANS- LIKE 5 AT5G24120 48790_c0_s SIGE 31 NA Up RNA polymerase eq2, sigma factor E 58493_c0_s eq1 AT2G31380 49752_c0_s STH 30 NA Up B-box zinc finger eq2 protein 25 AT3G47500 54702_c0_s CDF3 30 NA Up Cyclic dof factor 3 eq2

Table 2.2 Putative hub genes associated with plant hormone

Table 2.2 Putative hub genes associated with plant hormone Differential Arabidopsis Gene # of Hubs Transcript ID Expression Description ID Name Nodes R S R- AT2G47000 64181_c0_seq MDR4 16 Up NA ABC transporter B Group 3 family member 4 specific AT1G32640 24423_c0_seq ATMY 16 Up NA Transcription factor 1 C2 MYC2 AT1G72450 54896_c0_seq JAZ6 16 Up NA Jasmonate ZIM 2 domain-containing protein 6 AT1G19180 52463_c0_seq TIFY1 17 Up NA Jasmonate ZIM 2 0A domain-containing protein 1 AT5G44790 58689_c0_seq RAN1 17 Up NA Copper-transporting 4 ATPase RAN1 AT5G65210 38288_c0_seq TGA1 18 Dow NA Transcription factor 1 n TGA1 Comm AT1G19220 49411_c0_seq ARF19 46 Dow Dow Auxin response on 14 n n factor 19 AT3G23050 57202_c0_seq IAA7 52 Dow Dow Auxin-responsive 1 n n protein IAA7 AT1G19850 49510_c0_seq MP 50 Up Up Auxin response 1 factor 5 AT2G33310 54406_c0_seq IAA13 63 Dow Dow Auxin-responsive 1 n n protein IAA13 S- AT2G25930 62100_c0_seq ELF3 12 NA Dow Protein early Group 2 n flowering 3 specific AT5G42650 46274_c0_seq AOS 12 NA Dow Allene oxide 1 n synthase, chloroplastic AT5G08130 64318_c0_seq BIM1 12 NA Dow Transcription factor 8 n BIM1 AT3G25780 42148_c0_seq AOC3 13 NA Dow Allene oxide cyclase 2 n 3, chloroplastic

32 Table 2.3 Putative hub genes associated with plant defense

Differential Arabidopsis Gene # of Hubs Transcript ID Expression Description ID Name Nodes R S R- AT5G57560 60029_c0_seq1 TCH4 52 Up NA Xyloglucan Group endotransglucosylase specific / hydrolase protein 22 AT4G23810 45856_c0_seq1 WRKY53 53 Up NA Probable WRKY transciption factor 53 AT5G59820 23323_c0_seq1 RHL41 53 Up NA Zinc finger protein ZAT12 AT1G19180 26412_c0_seq1 TIFY10A 54 Up NA Jasmonate ZIM domain-containing protein 1 AT1G32640 24423_c0_seq1 ATMYC2 56 Up NA Transcription factor MYC2 Commo AT5G52640 57777_c0_seq2 HSP81-1 70 Down Down Heat shock protein n 81-1 AT2G40000 14588_c0_seq1; HSPRO2 71 Up Up Nematode resistance 57197_c0_seq1 protein-like HSPRO2 AT1G27730 42800_c0_seq1; STZ 71 Up Up Zinc finger protein 46938_c0_seq1 ZAT10 AT1G80840 54869_c0_seq2 WRKY40 75 Up Up Probable WRKY transciption facotr 40 AT1G62300 57306_c0_seq1; WRKY6 78 Up Up WRKY transciption 32283_c0_seq1; facotr 6 40929_c0_seq1; 49462_c0_seq2; 52284_c1_seq2; 60056_c0_seq1 AT2G38470 25706_c0_seq1; WRKY33 86 Up Up Probable WRKY 50968_c0_seq1; transciption facotr 33 52918_c0_seq1; 54902_c0_seq1; 60267_c0_seq1; 28099_c0_seq1 S-Group AT5G57660 53020_c0_seq1 ATCOL5 31 NA Up Zinc finger protein specific CONSTANS-LIKE 5 AT5G02500 57860_c0_seq1 HSC70-1 32 NA Down Probable mediator of RNA polymerase II transcription subunit 37e AT3G51860 47086_c1_seq1 CAX3 32 NA Up Vacuolar cation/proton exchanger 3 AT5G05170 56269_c0_seq4 CESA3 35 NA Down Cellulose synthase A catalytic subunit 3

33 Chapter 3 Population structure and genetic diversity of green ash (Fraxinus pennsylvanica) assessed with SSR markers

Abstract Green ash (Fraxinus pennsylvanica Marsh.) is the most widely distributed of about 20 ash species in North America that are believed to be under severe threat from the rapid invasion of emerald ash borer (Agrilus planipennis; EAB), an Asian wood-boring beetle. To assist in green ash protection and restoration, a study was conducted to determine the genetic diversity that existed across the species’ natural range prior to introduction of EAB, using a range-wide selection of 491 accessions collected from 60 provenances from a long-standing common garden trial. Allelic variation was assessed using six simple sequence repeat (SSR) markers derived from the green ash transcriptome. Bayesian based cluster analysis revealed two clusters which generally located south and north of 40.5˚N in North America, while a third cluster consisted of admixtures of northern and southern genotypes at the intersection. The global Fst value and phenotypic variation also supported two differentiated sub-populations. Some evidence of further sub-groups within each of the major sub-populations was also observed, which hierarchical structure analysis with larger numbers of markers might elucidate. Our study captured the genetic diversity and population structure of natural accessions and will be useful to test how the recent decline because of EAB affected the genetic variation in green ash. Our results also can have implications for germplasm collections and ex situ conservation sampling.

Keywords Green ash, genetic diversity, microsatellite marker, population structure

Introduction Green ash (Fraxinus pennsylvanica Marsh.) is an economically and ecologically important component of North American forests and riparian ecosystems. Green ash is also widely planted in urban settings. The natural range of green ash extends across the eastern half of North America, from Cape Breton Island and Nova Scotia west to southeastern Alberta; south through central Montana, northeastern Wyoming, to southeastern Texas; and east to northwestern Florida and Georgia (Little, 1971). It is one of the six or so most widely distributed angiosperm trees in North America. The introduction and rapid invasion of the emerald ash borer (EAB) from Asia has resulted in the death of

34 millions of ash trees and continues to be a severe threat for native ash trees as it spreads across North America (Herms and McCullough, 2014). Furthermore, some studies suggest that among all major North American ash species, green ash is the preferred host for EAB (Anulewicz et al., 2007). Fortunately, intraspecific variation in green ash responses to EAB has been identified (Koch et al., 2015; Lane et al., 2016). The dramatic decline caused by EAB in recent years will affect the genetic diversity and gene pool of green ash. Koch et al. (2015) reported that only a small number of green ash trees (i.e. ~ 1%) survived the EAB infestation in studied stands, which suggested a genetic bottleneck for green ash. Thus it is urgent to understand the genetic structure of the species as a precondition to its possible future resurrection as an ecologically and economically significant forest tree. However, no studies have been published to date on green ash allelic diversity, despite the severity of the EAB threat. Thus, our goal for this study was to analyze genetic diversity and population structure in a provenance collection of natural stands of green ash using nuclear microsatellite markers. For long-term reforestation and preservation of the species, genetic diversity and population structure information will be helpful in ensuring that germplasm collections and the development of new seed orchards represent a high proportion of the genetic diversity originally present in natural stands. Our study will also be helpful to understand how a recent decline influences genetic diversity of a species.

Materials and Methods Plant Materials Tissue samples were collected from an ash provenance trial established in the 1970s on the Pennsylvania State University campus (Steiner, 1983). To establish the provenance trial, seeds were collected in 1975 from 216 green ash female trees, comprising two to four trees from each of 60 geographically distinct forest stand locations, 57 of which were believed to be natural (autochthonous) populations. The collection served to represent coverage of the natural range for the species (Steiner, 1983). In addition, seeds of white ash were also collected from five natural stands. In 1978, seedlings were planted in a common garden on The Pennsylvania State University campus. The trial followed a randomized completed block design with 4-tree population plots randomized within each of 8 blocks (Steiner 1983, Steiner et al. 1988). Typically, each of the 4 trees within a population plot was the open-pollinated offspring of a different female parent, and these family identities were maintained throughout the plantation. A total of 593 individuals from the 65 populations were selected for this study of genetic diversity across the natural range of green ash, including trees of white ash, to serve

35 as outgroup. Genomic DNA extractions were conducted from tissues of the 593 accessions using a CTAB-based protocol (Clarke, 2009) optimized for hardwood trees.

Simple sequence repeats (SSRs) genotyping and data analysis Di-, tri-, and tetra-nucleotide microsatellite repeats were previously identified in a de novo assembly of green ash transcripts (Lane et al., 2016). Eight microsatellite loci from SSR-containing expressed sequence tag contigs (EST-SSRs) were used in this study (http://www.hardwoodgenomics.org/node/68249). Primers for the eight loci were designed using default settings in PRIMER3 (Koressaar and Remm, 2007; Untergasser et al., 2012). Total genomic DNA samples were diluted to 2ng/µl prior to use in PCR amplification reactions consisting of 5x FIREPol Master Mix and primer mix (forward primers were labeled with fluorescent dyes). Conditions of the PCR amplification were as follows: 95°C (15 min), then 35 cycles at 94°C (30 s) / 56°C (90 s) / 72°C (90 s) and a final extension at 72°C for 10 min. Pooled PCR reaction mixes were size fractionated using capillary electrophoresis, and peaks identified by GeneScan software (Applied Biosystems), followed by fragment sizing with GeneMapper v5.0 software (Applied Biosystem).

Marker informativeness testing COLONY (Jones and Wang, 2010) was first utilized to identify full- and half- siblings among the accessions. Six pairs of full-siblings were identified, from which one individual per each pair was excluded from the dataset before further analysis. GENEPOP v.4.2 (Rousset, 2008) was used to test for Hardy-Weinberg equilibrium (HWE), to identify the proportion of null alleles (PNA), the number of alleles per locus, and to determine the pairwise genotypic disequilibrium across the eight loci with Bonferroni corrected p values. The transcript contig sequences that contained the eight EST-SSRs were aligned to the GenBank non-redundant sequence database (http://www.ncbi.nlm.nih.gov/ ) using BLASTX. The gene with the best alignment score was used to assign putative function to each EST- SSR locus used in this study. In addition, the eight loci were tested for their ability to distinguish among the three ash species, as a measure of marker informativeness.

Population structure and clustering Bayesian admixture analysis in STRUCTURE v.2.3.4 was used to determine the optimal number of clusters of families and populations among the accessions in the provenance trial. Five replicates from K=1 to K= 62 of runs with a burn-in period of 50,000 and Markov chain Monte Carlo (MCMC) iterations of 100,000 were conducted to select the best K value (Porras-Hurtado et al., 2013;

36 Pritchard et al., 2000). After calculating the best K value, an additional 15 iterations of the analyses were performed for the best K value. Due to replicated STRUCTURE runs, CLUMPP v.1.1.2 (Jakobsson and Rosenberg, 2007) was used to calculate the means of the assessment of replicated STRUTURE runs through collating all the replicates into a single matrix. To get the CLUMPP input files, STRUCTURE HARVESTER (Earl and vonHoldt, 2011) was used to convert STRUCTURE output files to the required format for CLUMPP. With the best correspondence of the membership coefficient from CLUMPP, DISTRUCT v.1.1 (Rosenberg, 2004) was then used to provide a graphical display of the results.

Genetic diversity and population differentiation We used GenAlEx 6.501 (Peakall and Smouse, 2006) to estimate the genetic statistics, including the number of alleles, observed heterozygosity, expected heterozygosity, population differentiation, count of private alleles, and rare alleles with frequency less than 5%. The “adegenet” package (Jombart, 2008) in R program was used to generate a principle component analysis (PCA) for the detection of genetic clusters and to estimate the inbreeding coefficients. Analysis of molecular variance analysis (AMOVA) for the genotyped accessions was conducted by using Arlequin 3.5 (Excoffier and Lischer, 2010).

Phenotypic data Diameter at breast height (DBH), height, budburst and foliage coloration were included in the study to detect the variation across selected accessions. DBH was measured in 1990, 2009 and 2012, while height was measured in six continuous growing seasons from 1978 to 1983 and then in 1985, 1988 and 1990. Foliage coloration was recorded as plot mean number of days past Sept. 26, 1979, to peak coloration. Budburst was recorded as plot average number of days after Apr. 17, 1981 when more than 50% of terminal buds had developing leaves at least 0.5cm long. The Dunnett modified Tukey-Kramer pairwise comparison test was used to assess significance of phenotypic differences among multiple comparisons with unbalanced sample sizes and/or unequal variances among groups (Dunnett, 1980; Lau, 2013).

Results The 593 samples from the 65 populations in our data set represent the broadest existing sample of the green ash (Steiner et al. 1988) along with outgroup samples from white ash (F.

37 americana) grown in the same common garden. White ash is member of the same Fraxinus subsection as green ash (Wallander 2008). The general goal of our study was to assess both genetic variation and adaptive phenotypic variation within the widely distributed species, to determine if more in-depth genome-wide association studies might be warranted.

Genotyping and filtering process Eight EST-based microsatellite markers predicted from transcriptome data for green ash (http://www.hardwoodgenomics.org/node/68249) were tested and identified as informative markers for our study (Table 3.1). All 593 accessions were genotyped at the 8 loci. Colony was used first to identify full sibs within each provenance. As a result, 6 samples were deleted from the dataset as being full-siblings. Since only 8 loci were genotyped, individuals with any missing data were also removed from the dataset to eliminate bias in the further analysis. In addition, after excluding individuals with missing data, five provenances with less than 4 individuals remaining were also removed from the dataset. The filtering steps left 58 green ash provenances and 3 white ash provenances with more than 4 individuals per provenance that were kept in the dataset. With our filtering criteria, a total of 448 individuals originating from 61 provenances were retained for further analysis.

Informativeness of SSR markers Although some studies have reported genetic variation in European and Asian ash species (Heuertz et al., 2004; Hu et al., 2008; Petit et al., 2003), information on molecular genetic variation in North American ash species is scant. Our first aim was to test if the eight loci were informative enough to differentiate between green and white ash species. Correspondence analysis (CA) was carried out to study variation at the provenance level. The resulting CA plot clearly showed all of the green ash stands clustered together, but separated from the three white ash stands (Fig. 3.1). However, the CA plot did not show diversity within each provenance. The CA plot thus confirmed the informativeness of the eight loci for distinguishing between species and for correctly clustering the trees from diverse green ash stands as being more similar. To further investigate informativeness of the markers for studying genetic variation within green ash, all the white ash individuals were then removed from the dataset, which lead to 429 accessions of 58 provenances remaining for further analysis of variation within green ash.

38 Hardy–Weinberg equilibrium (HWE) testing HWE tests showed that 7 of the 8 loci significantly departed from HWE. An excess of homozygotes best explained the significant departure from HWE for the 7 loci (Table 3.2). Possibly this is characteristic of the 8 loci chosen, which were identified from EST datasets. The EST-SSR markers may be less variable in general than genomic SSRs (Hu et al., 2011; Varshney et al., 2005), which may lead to higher observed homozygosity. It is also possible to have a high proportion of null alleles in natural populations, especially in highly heterozygous forest tree species, which might also help to explain the deviations from HWE.

Additional allele statistics and filtering A total of 171 alleles were amplified from the 8 loci across all 58 provenances, with an average of 21 alleles per locus, and a range from 11 alleles in EST-854 to 37 alleles in EST-19414 (Table 3.3). For EST-18631 and EST-11183 loci, the observed heterozygosities were much lower than the expected values, which may have resulted from the high proportion of null alleles. The proportion of null alleles (PNA) can affect estimates of genetic diversity in natural populations. Null alleles can lead to miscalling of homozygotes, which will lead to inflated homozygote frequencies. Because the SSR locus from transcript EST-11183 had a PNA value > 25%, it was then excluded from the dataset for subsequent analyses. Genetic linkage disequilibrium was another important consideration for marker informativeness and selection. The results showed that the SSR locus from EST-4588 is significantly linked with the SSR loci from EST-4511 and EST-854. But no linkage was observed between EST- 4511 and EST-845. Such a linkage was not expected given that only 8 SSR loci were used, and linkage disequilibrium in forest trees from native stands has been observed at less than a few thousand bp (Neale and Ingvarsson, 2008; Neale and Savolainen, 2004; Slavov et al., 2012). So EST- 4588 was removed from the dataset to eliminate overrepresentation. After results from these two SSR markers (EST-11183 and EST-4588) were removed from the dataset, and individuals with missing data were removed, 491 individuals originating from 60 green ash provenances assayed at 6 loci remained for further analyses.

Population structure The Bayesian based STRUCTURE program was used to estimate the optimal set of subpopulations. Delta K was widely used in many previous publications to optimize the number of clusters (i.e. K), which calculated the second order of change of the likelihood function in respect to K (Evanno et al., 2005). K values from 1 to 62 were tested, the latter value being the number of

39 provenances plus two. The best reported K estimation was 2, capturing the majority of the structure present in the data as shown in Figure 3.2. All individuals could be assigned to one of the two clusters based on individual member coefficient. The plot produced by Structure DISTRUCT (Figure 3.3A) suggests that there are two clusters with less than 1/3 of the individuals being admixtures of genotypes from the two clusters. Individuals with more than 80% of either Q value, which represents the proportion of an individual’s ancestry, were placed in one of the two groups, while the rest of the individuals were classified as being in an admixed group (Figure 3.3A). Among the 491 accessions, 149 accessions originating from 18 provenances were assigned to one cluster (blue genotype in Fig 3.3A) and 191 accessions from 24 provenances were placed into a second cluster (orange genotype in Fig 3.3A). The remaining 151 accessions from 18 provenances were assigned to the admixed cluster. The two clusters (groupings) from STRUCTURE are separated geographically with the latitude of 40.5°N (Fig. 3.4). Therefore, the two clusters were renamed as South cluster and North cluster based on their origins. The locations of the 60 green ash provenances were plotted on the map in Figure 3.5. The colored triangles represent provenances in the North and South clusters, while the provenances from the admixed cluster are shown as pie charts with the amount of the genotypes contributed from each of the two separated groups represented by the same color combination (Fig. 3.3). The admixed provenances were, as expected, distributed primarily at the interface between the North and South clusters. To further understand the past population demographic processes, hierarchical structure analysis was performed within each cluster, i.e. the three population clusters were analyzed separately for further sub-divisions by STRUCTURE. The optimized K value was 2 for all three groups, as shown in Fig. 3.3B and C. Thus it appears that each of the 3 main clusters can be further subdivided into two sub-groups of provenances.

Genetic diversity and population differentiation among the three clusters Of the 491genotyped accessions, 128 alleles were amplified and the mean number of alleles per locus was 21.33. The mean observed heterozygosity across six loci was 0.69, ranging between 0.34 (EST-18631) and 0.84 (EST-19414), while the mean expected heterozygosity was 0.80 ranged between 0.56 (EST-18631) and 0.93(EST-754). Among the 491 accessions, significant differentiation was detected with a mean overall Fst of 0.14 (P<0.01). As three subpopulations were detected among the 491 accessions, it is informative to assess how different the north, south and admixed clusters are. The PCA based on pairwise Fst values indicated that provenances from the South cluster tend to be genetically distant from other two clusters (Fig. 3.6). Generally, the three clusters showed similar levels of genetic variation with the

40 south subpopulation showing the highest differentiation (Fst=0.13) (Table 3.4). All Fst values within clusters were statistically significant (P<0.01). On the other hand, the South cluster showed higher inbreeding coefficient (Fis=0.28) than the North and admixed clusters. However, the differences were not statistically significant. As expected, we observed less private alleles in the admixed cluster, as compared to the North and South clusters. AMOVA result showed that differentiation among three clusters contributed 3.61% (P<0.001) of the total variation (Table 3.5). Among the three clusters with a mean differentiation of 3.33%, EST-4511 accounted for the highest differentiation (9.39%, Fst=0.16). Genetic differentiation across six loci within clusters was compared and summarized in Table 3.6. All loci showed similar variation within three clusters with significant Fst values ranged from 0.08 to 0.18 with EST-854 showed the lowest differentiation within the North cluster (Fst=0.08), while EST-4511 showed the highest differentiation within the South cluster (Fst=0.18). But, EST- 15761 showed a reduced genetic variation in northern subpopulations (Ho=0.65, He=0.76), as compared to the southern subpopulations (Ho=0.87, He=0.90). Interestingly, EST-4511 showed a high Ho (0.79) but a low He (0.77) within the northern subpopulations than within the southern subpopulations (Ho=0.66, He=0.82), which may also support the locus was under selection in southern subpopulations. Across the six loci, EST-15761, EST-754, EST-19414 and EST- 4511showed significant differentiation within the southern subpopulations, while all loci expect EST- 18631showed significant differentiation within the northern subpopulations. Within admixed population, significant differentiation was detected at all loci. In addition, all pairwise Fst values were significant (P<0.01) with the greatest differentiation detected between the South and the North clusters (Fst=0.034) and the smallest differentiation between the admixed and the North clusters (Fst=0.009) (Table 3.7). All loci except EST-854 and EST-15761 showed significant pairwise Fst values among the three clusters

Phenotypic variation among three major green ash clusters In addition to genetic variation, the phenotypic variation across the three primary clusters (groupings) was also assessed (Table 3.8). Significant differences in growth (i.e. stem diameter and height) were observed overall between the clusters. Significant differences of DBH and height were obtained among three clusters in 1990 and 1978, respectively. However, only the admixed cluster always showed significantly greater DBH and height values from other two clusters. The South cluster showed significantly earlier budburst and later foliage coloration, while the North cluster showed later budburst and earlier foliage coloration. PCA was conducted to visualize the amount of phenotypic variation among the accessions assigned to the North, South and admixed clusters. In

41 general, PCA supported the differentiation between the North and South clusters (Fig 3.7, 3.8). As expected, the admixed cluster showed overlaps with the other two groups.

Discussion Although the organization of genetic diversity for the species F. pennsylvanica is still unclear, a few studies have previously reported on phenotypic variation. It was previously hypothesized that the border between Nebraska and Kansas split green ash into northern and southern Great Plains varieties or ecotypes (Ying, 1971). Ying (1971) described the southern ecotype as having a longer growing season and faster growing rates than the northern ecotype. However, none of these putative ecotypes have been given Latin varietal or subspecies names, as no differentiation can be distinguished in natural conditions. Williams (1984) conducted an assessment of cold tolerance and winter injuries for the natural accessions, and then clustered those accessions into different groups. William developed five zones across the natural range of green ash based on the mean winter injuries of the 55 provenances. Interestingly, our North cluster, as defined by Structure, was located in Williams’ zone 1, and our South cluster was located in Williams’ zones 3, 4 and 5, while the majority of our admixed cluster was located within Williams’ zone 2. Pairwise Fst values across Williams’ five zones were also computed based on our genotypic data (Table 3.9). According to the stability of cold tolerance, Williams observed that 3 of the identified clusters were differentiated by differences in overall cold tolerance. However, with our genotypic data, only one cluster was obviously differentiated from the other clusters, showing higher pairwise Fst values (Table 3.10). Our present study constitutes the first report of population genetics in green ash. Our result suggests that population structure exists, like most trees with a large range, even though no obvious morphological variations have been reported among populations of green ash. The initial accession panel of our study (a provenance trial established in 1978) included several families of white ash in addition to the green ash populations of interest, which enabled us to also compare genetic diversity between ash species. The role of the two species as outgroups was also valuable to help access the informativeness of the molecular markers. EST-SSR markers were chosen for this study due to relatively higher transferability to related species than genomic SSRs, to ensure the ability to compare genetic variation between the species. Furthermore, for non-model species without a reference genome, EST databases offer a rich and easily accessible source of SSR markers. Because EST-SSR markers are more conserved than genomic SSRs, they may capture lower levels of diversity and may result in an excess of homozygotes compared to genomic SSRs. However, more conserved markers also have a greater chance of success in PCR amplification across species and accessions, which was

42 important for this study. The existence of the provenance trial at Penn State allowed us to examine the population structure of green ash across its large, natural range of distribution in North America, that existed prior to invasion by the emerald ash borer. In our study, the occurrence of population genetic structure within green ash was evaluated. We identified two groups, one represented by the more northern populations with a second group made of the southern populations. Evidence for the existence of the two geographic clusters, as well as sub-clusters and admixture populations, was provided by both Bayesian and multivariate methods. PCA also supported the presence of two regionally separated groups of populations by projecting the variation in the first few components. In our present study, the southern subpopulation was mostly differentiated from other subpopulations with high pairwise Fst values. The greatest pairwise Fst value (0.034, P<0.01) was detected between the North and South population clusters which suggested that the two clusters were genetically differentiated as well. Interestingly, a slightly higher inbreeding coefficient was observed within the South cluster than other clusters, although the difference was not significant. One reason for more inbreeding in southern populations could be the existence of additional clusters within the south group, which was supported by higher population differentiation (Fst). As a result, reduced gene flow could happen among those subclusters. Another reason could be that the density and size of green ash populations varied between the seed collection sites. Lower densities of ash trees in forest stands can be expected to result in higher inbreeding coefficients. Our results showed higher diversity within the South cluster than in the admixed and the North clusters. This suggests a hypothesis of a general south-to-north migration path of green ash, that probably split into two subgroups by the natural barrier of the Appalachian Mountains. To test this hypothesis, hierarchical structure analysis was conducted to further examine the genetic variation among populations within each cluster separately, which revealed some within region clustering of populations as well. However, the sub-groups within each of the two main clusters were not separated geographically. Perhaps this subpopulation differentiation is correlated with adaptations to local environments and stressors, which could be revealed through more in-depth hierarchical structure analysis using genome-wide analyses. As is often the case for other widely distributed tree species, green ash also showed a low amount of genetic variation among populations, but high heterozygosity within populations. Genetic differentiation has been reported between Quercus rubra and Q. ellipoidalis (Fst=0.05-0.07) and among populations within species (Fst=0.01-0.03) by using both genomic-SSR and EST-SSR (Lind and Gailing, 2013). Their results also showed that genomic-SSRs revealed slightly higher differentiation than EST-SSR did. Another study for Q. rubra reported mean pairwise Fst value of 0.041, ranging from 0.006 to 0.14 by using genomic SSRs(Borkowski et al., 2017). A mean pairwise

43 Fst value of 0.045 was detected for Juglans cinerea (Hoban et al., 2010). The overall Fst value of 0.06 was detected for Salix viminalis, for which pairwise Fst values among the detected four subpopulations ranged from 0.04 from 0.12 (Berlin et al., 2014). In summary, even though only six markers were used in this study, two separate clusters and one admixed cluster of populations were detected, representing a geographically-based distribution of genetic variation in green ash. This result warrants further investigation by higher resolution genome- wide association studies. In this provenance trial, approximately 11 green ash trees (less than 2%) have survived the EAB infestation to date. These, and other “lingering” trees previously reported (Koch et al., 2015), are invaluable material for determining mechanisms of EAB resistance, through future transcriptome analysis and genetic association studies. Our study suggested that taking the population structure into account is necessary to reduce the false positive associations in future studies to identify potential SNPs significantly associated with EAB resistance or any other traits. To help future conservation and breeding programs, our study can help guide the selection of core collections, to maximize the capture of genetic diversity in the species. Combining with other traits of interest, such core selections can be optimized to maximize both genetic and phenotypic variation.

44 References Anulewicz, A.C., McCullough, D.G., and Cappaert, D.L. (2007). Emerald Ash Borer (Agrilus planipennis) Density and Canopy Dieback in Three North American Ash Species. Arboriculture & Urban Forestry 33, 338-349. Berlin, S., Trybush, S.O., Fogelqvist, J., Gyllenstrand, N., Hallingbäck, H.R., Åhman, I., Nordh, N.- E., Shield, I., Powers, S.J., Weih, M., et al. (2014). Genetic diversity, population structure and phenotypic variation in European Salix viminalis L. (Salicaceae). Tree Genetics & Genomes 10, 1595-1610. Borkowski, D.S., Hoban, S.M., Chatwin, W., and Romero-Severson, J. (2017). Rangewide population differentiation and population substructure in Quercus rubra L. Tree Genetics & Genomes 13, 67. Clarke, J.D. (2009). Cetyltrimethyl Ammonium Bromide (CTAB) DNA Miniprep for Plant DNA Isolation. Cold Spring Harbor Protocols 2009, pdb.prot5177. Dunnett, C.W. (1980). Pairwise Multiple Comparisons in the Unequal Variance Case. Journal of the American Statistical Association 75, 796-800. Earl, D.A., and vonHoldt, B.M. (2011). STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genetics Resources 4, 359-361. Evanno, G., Regnaut, S., and Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14, 2611-2620. Excoffier, L., and Lischer, H.E. (2010). Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour 10, 564-567. Herms, D.A., and McCullough, D.G. (2014). Emerald ash borer invasion of North America: history, biology, ecology, impacts, and management. Annual review of entomology 59, 13-30. Heuertz, M., Hausman, J.-F., Hardy, O.J., Vendramin, G.G., Frascaria-Lacoste, N., and Vekemans, X. (2004). Nuclear microsatellites reveal contrasting patterns of genetic structure between western and southeastern European populations of the common ash (Fraxinus excelsior L). Evolution 58, 976-988. Hoban, S.M., Borkowski, D.S., Brosi, S.L., McCleary, T.S., Thompson, L.M., McLachlan, J.S., Pereira, M.A., Schlarbaum, S.E., and Romero-Severson, J. (2010). Range-wide distribution of genetic diversity in the North American tree Juglans cinerea: a product of range shifts, not ecological marginality or recent population decline. Mol Ecol 19, 4876-4891. Hu, J., Wang, L., and Li, J. (2011). Comparison of genomic SSR and EST-SSR markers for estimating genetic diversity in cucumber. Biologia Plantarum 55, 577-580. Hu, L.-J., Uchiyama, K., Shen, H.-L., Saito, Y., Tsuda, Y., and Ide, Y. (2008). Nuclear DNA microsatellites reveal genetic variation but a lack of phylogeographical structure in an endangered species, Fraxinus mandshurica, across north-east China. Annals of botany 102, 195-205. Jakobsson, M., and Rosenberg, N.A. (2007). CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23, 1801-1806. Jombart, T. (2008). adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24, 1403-1405. Jones, O.R., and Wang, J. (2010). COLONY: a program for parentage and sibship inference from multilocus genotype data. Molecular Ecology Resources 10, 551-555. Koch, J.L., Carey, D.W., Mason, M.E., Poland, T.M., and Knight, K.S. (2015). Intraspecific variation in Fraxinus pennsylvanica responses to emerald ash borer (Agrilus planipennis). New Forests 46, 995-1011. Koressaar, T., and Remm, M. (2007). Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289-1291. Lane, T., Best, T., Zembower, N., Davitt, J., Henry, N., Xu, Y., Koch, J., Liang, H., McGraw, J., Schuster, S., et al. (2016). The green ash transcriptome and identification of genes responding to abiotic and biotic stresses. BMC genomics 17, 702.

45 Lau, M.K. (2013). DTK: Dunnett-Tukey-Kramer Pairwise Multiple Comparison Test Adjusted for Unequal Variances and Unequal Sample Sizes. R package version 35, https://cran.r- project.org/package=DTK. Lind, J.F., and Gailing, O. (2013). Genetic structure of Quercus rubra L. and Quercus ellipsoidalis E. J. Hill populations at gene-based EST-SSR and nuclear SSR markers. Tree Genetics & Genomes 9, 707-722. Little, E.L.J. (1971). Atlas of United States trees, volume 1, conifers and important hardwoods. US Department of Agriculture Miscellaneous Publication 1146. Neale, D.B., and Ingvarsson, P.K. (2008). Population, quantitative and comparative genomics of adaptation in forest trees. Current opinion in plant biology 11, 149-155. Neale, D.B., and Savolainen, O. (2004). Association genetics of complex traits in conifers. Trends Plant Sci 9, 325-330. Peakall, R., and Smouse, P.E. (2006). GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular ecology notes 6, 288-295. Petit, R.J., Aguinagalde, I., de Beaulieu, J.-L., Bittkau, C., Brewer, S., Cheddadi, R., Ennos, R., Fineschi, S., Grivet, D., and Lascoux, M. (2003). Glacial refugia: hotspots but not melting pots of genetic diversity. science 300, 1563-1565. Porras-Hurtado, L., Ruiz, Y., Santos, C., Phillips, C., Carracedo, Á., and Lareu, M.V. (2013). An overview of STRUCTURE: applications, parameter settings, and supporting software. Frontiers in Genetics 4, 98. Pritchard, J.K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155, 945-959. Rosenberg, N.A. (2004). distruct: a program for the graphical display of population structure. Molecular Ecology Notes 4, 137-138. Rousset, F. (2008). genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Molecular Ecology Resources 8, 103-106. Slavov, G.T., DiFazio, S.P., Martin, J., Schackwitz, W., Muchero, W., Rodgers-Melnick, E., Lipphardt, M.F., Pennacchio, C.P., Hellsten, U., Pennacchio, L.A., et al. (2012). Genome resequencing reveals multiscale geographic structure and extensive linkage disequilibrium in the forest tree Populus trichocarpa. The New phytologist 196, 713-725. Steiner, K.C. (1983). A provenance test of green ash. Proc Northeast Forest Tree Improv Conf 28, 68- 76. Untergasser, A., Cutcutache, I., Koressaar, T., Ye, J., Faircloth, B.C., Remm, M., and Rozen, S.G. (2012). Primer3—new capabilities and interfaces. Nucleic acids research 40, e115-e115. Varshney, R.K., Graner, A., and Sorrells, M.E. (2005). Genic microsatellite markers in plants: features and applications. Trends in Biotechnology 23, 48-55. Ying, C.-C. (1971). Response of Fraxinus Pennsylvanica M. provenances to daylength and temperature (MS thesis. University of Nebraska.).

46

Figure 3.1 Variation at the provenance level projected on the first two correspondence analysis eigenvalues.

47

Figure 3.2 Plot for detecting the number of K groups that best fits the data from Structure analysis.

48 A) K=2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

IL_345 IL_173 IL_477 IL_165 IL_161 IL_169 PA_45 SD_73 IN_138 MD_46 MI_253 IN_429MI_293 IA_425 VT_217 VA_393MO_177 TN_145TN_309 SC_237OH_155 TN_185 OH_141 AR_453MO_233KY_189 KY_153 OH_249DE_450VA_397 WV_193PA_277 NY_201NY_205 MO_469 NB_409SD_269 ND_437 NB_321 WY_533NB_317NB_441SD_273NY_373NB_413NB_405 PA_265 MN_225 Man_93SD_197 SD_417 Alb_214 Ont_529 Ont_505 Que_337 Ont_305 Sas_541 Man_509Ont_211 Man_513

B) K=2 K=2 K=2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

IL_173 IN_138 IL_345 IL_477 IL_165 IL_161 IL_169 MD_46 IN_429 IA_425 SD_73 PA_45 KY_189 VA_393 TN_185 TN_145OH_141OH_155KY_153 TN_309SC_237AR_453 DE_450OH_249 MI_253 MI_293 VT_217 MO_177 MO_233 VA_397 WV_193PA_277 NY_201NY_205 MO_469 NB_409SD_269 SD_417ND_437NB_321NB_405 SD_197MN_225 WY_533NB_413 NB_441NB_317SD_273 PA_265NY_373Man_93 Alb_214 Ont_529 Ont_505 Que_337 Man_513 Ont_211Sas_541 Ont_305Man_509 C) K=2 K=4 K=3 K=3 K=2 K=2

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 1 11 2 12 3 13 4 145 156 716 817 18 9 19 10 20 1121 2212 2313 1424 15 16 17 18

1 2 3 4 5 6 7 8 9 1 3 4 5 7 8 9 2 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 10 11 13 14 12 15 16 17 18 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

SD_73 PA_45 IL_165 IL_161 SD_273 IL_169 MD_46PA_265NY_373 IN_429 NB_441NB_317 Ont_211 Man_93Ont_305MI_253 Alb_214MI_293 VA_397Man_513 WV_193PA_277 Sas_541 NY_201NY_205 Man_509MO_469 NB_409SD_269 Ont_529 Ont_505 Que_337

SD_73 PA_45 NB_441NB_317SD_273 PA_265NY_373Man_93 Alb_214 Man_513 Ont_211Sas_541 Ont_305Man_509

Figure 3.3 Results of population structure analysis. (A) STRUCTURE results plotted by DISTRUCT for genetic variation at K=2 for 491 individuals representing 60 provenances. The colored plot represents the estimates of Q (Q represents the proportion of an individual’s ancestry in a population) with assumed 2 predefined populations. Cluster I (Blue): Pop1-18 (VA_393 ~ DE_450) with Q1 > 80%; Cluster II (Orange): Pop37-60 (ND_437 ~ Alb_214) with Q2 > 80%; Admixed Cluster: Pop19-36 (VA_397 ~ Que_337) with 20% < Q1/Q2 < 80%. (B) Hierarchical structure within each cluster with optimized K=2. (C) Structure analysis with 0.8 membership threshold to assign groups with optimized k value.

49

Figure 3.4 Geographic locations of the two groups and one admixed group. Orange and blue triangles represent the two distinct groups from STRUCTURE analysis and the pie chart represent the admixed group with two colored segments of the two predetermined clusters;

50

Figure 3.5 Geographic variation between populations of the two clusters. The boundary of two clusters was detected at the latitude of 40.5°N.

51

Figure 3.6 Principal component analysis based on Fst. The fist two componnets contributed to 56% and 22.2% of the variation, respectively. Blue squares, green triangles and red dots represented provenances assigned to the South cluster, North cluster and admixed cluster, respectively.

52

Figure 3.7 Principle component analysis based on phenotypic traits. The first two components accounted for 71% and 14.8% of the variation, respectively. Blue squares, green triangles and red dots represented provenances assigned to the South cluster, North cluster and admixed cluster, respectively.

53

Figure 3.8 Variation of Phenotypic traits were plotted to PC1, PC2 and latitudes. The two components accounted for 71% and 14.8% of the variation, respectively. Red, green and black dots represents provenances assigned to the North, South and admixed clusters.

54 Table 3.1 Summary of features of eight new EST-based microsatellite loci.

SSRs Forward Reverse Motif Expected Size range Putative function

Primer Primer size

15761 CCTAGAG GAAAGCA (GT)13 151 136-171 DEAD-box ATP-dependent RNA helicase 24 AAAGGCG CTTTGGC [Sesamum indicum] (Identity: 74%) GAGACC ACCATT 754 GCCTTAG CCTCTCT (CT)15 150 137-205 PREDICTED: protein TIC110, chloroplastic GGTAGGG CTAGCAG [Sesamum indicum] (Identity: 83.8%) GTGAAG CCTTGC 854 CTGAGGC CGTAATG (CAT)5 176 158-194 Putative clathrin assembly protein At5g35200 ATAGGGG GCTTCAC [Sesamum indicum] (Identity: 78%) TAGCTG CACCTT 18631 ATTTTCTC TGACACG (TCA)6 150 134-168 SART-1 family protein DOT2 [Sesamum GCCCTTTC AGTCTTT indicum] (Identify: 71%) TAACG CATCTAT TGC 19414 CCGAGCA CATCTTTT (CT)11 154 153-187 Transcription initiation factor IIB-2-like GTATGTA CTCCCAG [Sesamum indicum] (Identify: 93%) TCCGACA CTCACA 4511 TGGAGGC CGAACTC (TC)15 167 143-185 No significant similarity found TCTCAGG TATAAAT TCTCTC TTGTGTG TTTTTG 4588 GGGGGAG TGCTGAC (TG)12 159 140-180 Uncharacterized protein LOC105173611 GAAAATG TGCACAA isoform X1 [Sesamum indicum] (Identify: AGTTTCTA ACTTATC 81%) A 11183 GCAAAAA CGGCAAA (CCT)10 157 140-160 No significant similarity found TTTGTCAC ACTTAGA TTGAAAG TTGGAA G

Table 3.2 Hardy–Weinberg equilibrium(HWE) testing results for 429 green ash accessions across the eight loci Locus HWE testing

Ha: deviation from HWE Ha: Homozygote excess

EST-15761 *** ***

EST-754 *** *** EST-854 NS NS EST-18631 *** *** EST-19414 *** *** EST-4511 *** *** EST-4588 *** *** EST-11183 *** *** Signif. codes: 0 ‘***’; NS: not significant; p value adjustment method: Bonferroni.

Table 3.3 Genetic variation statistics for the eight SSR markers among 429 green ash accessions

SSR locus Genetic statistics

N F H H PNA LD allele is o E (W&C) (R&H)

EST-15761 18 0.1047 0.0585 0.74 0.82 4.42% --

EST-754 31 0.1660 0.1037 0.78 0.93 7.91% -- EST-854 11 0.0429 0.0154 0.68 0.71 1.77% a EST-18631 16 0.3584 0.3119 0.34 0.84 15.65% -- EST-19414 37 0.0850 0.0436 0.84 0.91 3.45% b EST-4511 21 0.1377 0.0471 0.74 0.86 5.8% c EST-4588 25 0.0850 0.0873 0.79 0.87 4.34% a,c EST-11183 12 0.5470 0.5632 0.36 0.8 28.66% b Nallele: total number of alleles; Fis: inbreeding coefficient; PNA: Proportion of null allele; HO: observed heterozygosity; HE: expected heterozygosity; LD (linkage disequilibrium): not linked (--); a same letter represented the linked markers.

56

Table 3.4 Genetic diversity statistics across and within clusters averaged over six loci

Cluster # of N Na Ne HO HE uHe Fst Fis PA RA prov.

All 60 491 21.33 7.53 0.69 0.80 0.80 0.14** -- --

North 24 191 17.50 6.05 0.69 0.76 0.76 0.11** 0.26 2 12.5 Admixed 18 151 17.00 6.74 0.70 0.79 0.79 0.11** 0.26 0.83 11 South 18 149 18.00 6.76 0.68 0.80 0.81 0.13** 0.28 1.67 12.5 N: total number of individuals in each cluster; Na: number of different alleles per locus; Ne: number of effective alleles per locus; Ho: observed heterozygosity; He: expected heterozygosity; uHe: unbiased expected heterozygosity; Fst: fixation index; Fis: inbreeding coefficient; PA: count of private alleles; RA: count of rare alleles (Freq. < 5%). Significance: *(p<0.05), **(p<0.01)

Table 3.5 Analysis of molecular variance (AMOVA) among three clusters over six loci.

Source of d.f. Variance Percentage P value variation components variation Among clusters 2 0.09 3.61 <0.001 Within clusters 979 2.35 96.39 Total 982 2.44

57

Table 3.6 Genetic variation statistics at six loci within 3 clusters

SSR locus North Admixed South Nallele HO HE Fst Nallele HO HE Fst Nallele HO HE Fst

** ** ** EST-15761 16 0.65 0.76 0.13 14 0.77 0.81 0.11 16 0.87 0.90 0.11 EST-754 27 0.82 0.92 0.11** 25 0.80 0.92 0.11** 26 0.71 0.90 0.12** ** ** EST-854 10 0.67 0.71 0.08 9 0.70 0.71 0.10 10 0.69 0.71 0.11 EST-18631 12 0.37 0.48 0.10 14 0.30 0.56 0.15** 10 0.34 0.60 0.11 EST-19414 25 0.85 0.90 0.11** 27 0.85 0.91 0.11** 30 0.80 0.89 0.15** EST-4511 15 0.79 0.77 0.11** 13 0.78 0.83 0.11** 16 0.66 0.82 0.18** Nallele: total number of alleles; HO: observed heterozygosity; HE: expected heterozygosity; Fst: fixation index; Significance: *(p<0.05), **(p<0.01)

Table 3.7 Pairwise Fst between pairs of clusters SSR locus Pairwise Fst North-South North-Admixed South-Admixed

EST-15761 0.025** 0.003 0.016**

EST-754 0.022** 0.010** 0.008** EST-854 0.005* 0.001 0.002 EST-18631 0.043** 0.013** 0.012** EST-19414 0.018** 0.005** 0.012** EST-4511 0.088** 0.020** 0.045** All 0.034** 0.009** 0.016** Significance: *(p<0.05), **(p<0.01)

58 Table 3.8 Average phenotypic values of three clusters and pairwise comparison

Cluster DBH (cm) Height (m) Budburst Foliage (mean ± s.d.) (mean ± s.d.) (mean ± s.d.) coloration 1990 2009 2012 1978 1982 1985 1990 (mean ± s.d.) South 5.67±3.11 14.43±6.23 15.46±6.53 0.66±0.18 1.95±0.59 3.15±1.19 4.99±2.24 17.09±6.57 26.84±8.07 a a a a a a a a a North 6.21±2.87 14.66±5.51 15.81±5.84 0.53±0.18 1.99±0.61 3.26±1.14 5.21±1.91 20.41±6.52 15.48±7.94 b a a b a a a b b

6.89±3.28 16.73±5.86 17.89±6.28 0.61±0.20 2.17±0.65 3.60±1.24 5.76±2.17 18.96±7.09 21.73±7.08 Admixed c b b c b b b c c

In each column, the differences between means followed by a same letter were not significant (significance level: 0.05).

59 Table 3.9 Pairwise Fst values across four zones detected by winter injuries

Zones 2 3 4 1 0.011 0.024 0.023 2 0.012 0.010 3 0.006 4 *Zone 5 was not included as only one provenance located in Zone 5.

Table 3.10 Pairwise Fst values across five clusters detected by cold tolerance.

Clusters B C D E A 0.015 0.010 0.019 0.047 B 0.014 0.015 0.040 C 0.014 0.037 D 0.020

60 Chapter 4 The first genetic map for Fraxinus pennsylvanica and syntenic relationships with four related species Abstract Green ash (Fraxinus pennsylvanica) is an outcrossing, diploid (2n=46) hardwood tree species. Rapid invasion of emerald ash borer (EAB) from Asia is threatening all native ash species in North America. Green ash, the most widely distributed ash species, is being severely affected by EAB infestation, yet few resources for genetic studies and improvement of green ash are available. In this study, a total of 5,712 high quality single nucleotide polymorphisms (SNPs) was discovered using minimum allele frequency (MAF) of 1% across the entire genome through genotyping-by-sequencing (GBS). We also screened hundreds of genomic- and EST-based microsatellite markers (SSRs) from previous de novo assemblies (Lane et al., 2016; Staton et al., 2015). This first linkage map of green ash was constructed from 91 individuals in a full-sib family, combining 2,719 SNP and 84 SSR segregating markers. The SNP and SSR map contains a total of 1201 markers in 23 linkage groups spanning 2008.87cM, at an average inter marker distance of 1.67 cM with a minimum logarithm of odds (LOD) of 6 and maximum recombination fraction of 0.40. The genetic map showed near-perfect consistency with a draft genome assembly of one of the mapping family parents, and was used to anchor 27 large scaffolds of the draft genome sequence to the 23 linkage groups, accounting for 732Mb (76%) of the estimated 961Mb length of the draft genome. Comparative analysis between green ash and two asterid species and two rosid species indicates that synteny has eroded more slowly in woody perennials relative to annuals.

Keywords Green ash, linkage map, single nucleotide polymorphism (SNP), simple sequence repeats (SSRs), Genotyping-by-Sequencing (GBS), synteny

Introduction Green ash (Fraxinus pennsylvanica Marsh.) is one of the most widely distributed angiosperm trees in North America. This species, along with other native ash species, has been threatened by the rapid invasion of emerald ash borer (EAB) insect from Asia. The economic value of ash trees (Fraxinus spp.) is difficult to assess, but as the most widely distributed ash species in NA, green ash may be contributing billions of dollars to the economy (Kovacs et al., 2010). Unfortunately, EAB has resulted in the death of millions of ash trees (Herms and McCullough, 2014), which cost municipalities, property owners, nursery operators and forest

61 product industries hundreds of millions of dollars. However, studies of ash species are limited to intraspecific transcriptome analysis(Lane et al., 2016), and interspecific metabolites comparison (Whitehill et al., 2012). Little is yet known about genetic and molecular mechanisms underlying biological traits in green ash. Genetic linkage maps are a powerful tool for genomic and genetic research to uncover the basis of trait variation. With a large number of DNA markers, fine mapping of quantitative trait loci (QTL) can support marker-assisted selection for breeding programs. Candidate genes can also be positioned within the quantitative trait locus (QTL) region identified by association studies. Combined with transcriptome study, differentially expressed genes can also be mapped and localized for gene cloning. In addition, linkage maps are also used to conduct comparative analyses to detect genomic synteny and to improve and correct chromosome-scale genome assemblies. Linkage maps have been accomplished in many non-model species (Moumouni et al., 2015; Raman et al., 2014; Tian et al., 2015), but the resolution of genetic maps depends on the number of markers obtained that are segregating in the mapping population. Recently, the innovation of next generation sequencing (NGS) methods for genetic marker discovery, combined with reduced costs, facilitates the genotyping of thousands of SNPs across hundreds of samples. New NGS methods include utilizing restriction enzyme digestion of target genomes to reduce the complexity of the data generated, such as reduced-representation libraries (RRLs) (Van Tassell et al., 2008), restriction-site associated sequencing (RADseq) (Davey and Blaxter, 2010) and genotyping-by-sequencing (GBS) (Elshire et al., 2011). In addition, new genotyping methods such as Illumina’s BeadArrayä technology based GoldenGateâ and Infiniumâ assays and Sequenom MassARRAYâ (Gabriel et al., 2001) also enable SNP genotyping at genome-wide level in a cost-effective manner. However, because SNPs are biallelic markers, they are usually less informative than multi-allelic makers, such as microsatellites, which can amplify up to four alleles in a full-sib mapping family in a diploid species. Therefore, SSRs are still robust markers in genetic map development and many other genetic studies, even though the development and genotyping of microsatellite markers can still be laborious and time-consuming relative to NGS approaches.

62 Materials and Methods Plant material and DNA extraction An F1 population of 780 individuals was generated from controlled crosses of an EAB- resistant and an EAB-susceptible green ash trees. The maternal parent tree for the cross is located at the Dawes Arboretum in Newark, OH, and the male (pollen) parent tree is located in the Openings Metropark in Toledo, OH. The maternal parent tree PE0048 is considered susceptible to EAB while the paternal parent tree PE00248 is considered tolerant to EAB, as established by replicated EAB-egg inoculation tests. During growth of the seedlings in the greenhouse at Delaware, Ohio, , an ash yellows infestation reduced the family to 543 survivors, which were transferred to a gravel-bed nursery, in Columbia, Missouri, for further growth prior to bare-root seedling transfer to the field. For this initial genetic linkage mapping study, leaf samples were collected from 91 of the seedlings, as well as from the parent trees, for genomic DNA extraction. DNA was isolated from frozen leaves using a CTAB method (Clarke, 2009). DNA samples were quantified using a Qubit 2.0 fluorimeter (Invitrogen, Carlsbad, CA, USA) and then checked for molecular weight in 0.8% agarose gel. The 91 progeny were verified by SSR analysis as authentic full siblings using Colony parentage analysis (Jones and Wang, 2010).

Genotyping-by-sequencing GBS libraries were constructed and sequenced at the Cornell University Genomic Diversity Facility, using the protocol described by Elshire (Elshire et al., 2011). Three methylation-sensitive restriction enzymes (i.e. ApeKI, EcoT22I and PstI) were tested prior to selection of Pst1 as the best for GBS libraries construction. A compatible set of 96 barcode sequences were included in library construction for multiplex sequencing. The libraries of 95 samples and a blank negative control were pooled and sequenced using Illumina NextSeq 500 on one lane of single-end reads, running for 86 cycles to obtain a read length of 86bp.

GBS Data Processing and SNP discovery The TASSEL-GBS pipeline was employed for SNP discovery and SNP calling (Glaubitz et al., 2014), using a draft de novo genome assembly of green ash consisting 495,002 contigs (Buggs, unpublished) as a reference. The raw data FASTQ files were processed for sequence filtering through the pipeline with minimum Qscore of 20 across the first 64 bases. Raw reads with a perfect match to one of the barcodes plus the subsequent five nucleotides that are expected to remain from a PstI cut-site (i.e. 5’…CTGCA’G…3’) and without N’s were retained as good

63 barcoded reads. Identical good barcoded reads were then clustered into tags, while rare tags represented by fewer than 5 reads were excluded from the dataset. Each sequence tag then was aligned to the reference genome and only genomic positions of the tags that align to a unique best position in the genome were kept for further processing. SNP discovery was performed for each set of tags that align to the same starting genomic position. For each SNP, the allele represented by each tag was then determined along with the observed depth of each allele. The initial filtering process was based upon minimum allele frequency (MAF) of 0.01 to remove sequencing errors and spurious SNPs. To exclude less reliable SNPs, only SNPs with less than 10% missing genotyping data across samples were kept for downstream analysis.

Simple sequence repeats (SSRs) genotyping and data analysis De novo assemblies of the genome and the transcriptome of green ash have been used previously to search for di-, tri-, and tetra-nucleotide microsatellite repeats (Lane et al., 2016). Primers for the simple sequence repeats (SSRs) loci were designed using default settings in PRIMER3 (Koressaar and Remm, 2007; Untergasser et al., 2012). 419 pairs of primers were tested against several individuals within the mapping population. Total genomic DNA samples were normalized to 2ng/µl and then used for PCR amplification reactions using 5x FIREPol Mater Mix and primer mix. To reduce the cost of fluorescent primers, a three-primer system was used, which included a universal M13 oligonucleotide (TGTAAAACGACGGCCAGT) labeled with fluorescent dye, a sequence-specific forward primer with the M13 tail at its 5’ end, and a sequence-specific reverse primer (Schuelke, 2000). Conditions of the PCR amplification were as follows: 95°C (15 min), then 35 cycles at 94°C (30 s) / 56°C (90 s) / 72°C (90 s) and a final extension at 72°C for 10 min. Pooled PCR reaction products were sized on an ABI 3730xl capillary electrophoresis instrument, with peaks identified by GeneScan (Applied Biosystems) followed by fragment sizing with GeneMapper v5.0 (Applied Biosystem).

Linkage map construction The allele data for all selected markers were tested for segregation distortion and similarities. Markers that showed a significantly distorted segregation (P<0.05) were removed from the dataset. Redundant markers whose genotypic data showed 100% similarity were also removed from the dataset to improve the computation efficiency. Male and female-specific linkage maps were then constructed individually for the filtered markers with JoinMap 4.0 using the regression algorithm and Kosambi mapping function and designating a cross-pollinator (CP) population type (Van Ooijen, 2006). In the first run, only markers with up to 5% missing data and

64 categorized as lm x ll, nn x np, ef x eg and ab x cd and were used to build the framework maps (LOD score threshold = 6). For maternal maps, the genotype codes from loci with segregation types and were translated to genotype codes of and for paternal maps to genotype codes of . In the second run, loci with segregation type and less informative markers with 5% to 10% missing data were included to build the maps, while using the marker order obtained from step one as the Start Order in JoinMap. The genotypes hk, hh and kk were translated to unknowns, ll and lm, respectively for maternal parent and to unknowns, nn and np for paternal parent. The consensus map was then established by integrating paternal and maternal maps through shared markers using MergeMap (Wu et al., 2008). The linkage maps were then drawn using MapChart 2.2 (Voorrips, 2002).

Alignment of scaffolds to corresponding linkage groups The draft genome assembly of the green ash PE00248 parent tree spanned 961Mb distributed across 495,002 scaffolds. Fifty base pairs flanking sequences from both sides of the mapped markers were used to assign the draft genome assemblies of the PE00248 to corresponding linkage groups of our intra-specific genetic map through Burrows-Wheeler Alignment tool (Li, 2013; Li and Durbin, 2009b). Only uniquely mapped markers were considered for further analysis.

Comparative analysis with two Asterid and two Rosid species Fifty base pairs of flanking sequences from both sides of mapped green ash SNP markers were searched against the tomato genome (Phytozome v10.0) and coffee genome (Coffee Genome Hub) using BLASTN with an e-value threshold of 1e-5. Circos v0.69 was then used to plot synteny between green ash genetic maps and chromosomes of tomato (Krzywinski et al., 2009) and coffee (Denoeud et al., 2014). To plot the syntenic regions, cM distances on the genetic maps of green ash were converted to base pairs using an averaged cM/bp value, based on the total linkage length in cM of the map and the estimated total genome size of green ash (i.e. 961Mb).

Results Enzyme selection We tested whether one of the commonly used restriction enzymes would generate an appropriate distribution of fragment lengths across the genome of green ash (Fig. 4.1). Fragment size distribution was checked on the Bioanalyzer 2100 HS-DNA chip (Agilent Technologies,

65 Santa Clara, CA, USA) (Fig. 4.1). All three enzymes yielded a large number of fragments ranging between 150bp and 500bp, which was suitable for GBS approach. However, ApeKI and EcoT22I showed some repetitive peaks and a small proportion of fragments greater than 500bp. The enzyme PstI was thus selected to construct GBS libraries for green ash.

Genome-wide identification of SNPs To identify genome-wide SNPs from green ash, the 6-base cutter restriction enzyme PstI was used to digest the genome and construct the 96-plex GBS libraries of the F1 population. A total of 63,540 putative SNPs that had a MAF of 1% were identified with the TASSEL GBS pipeline. With a maximum of 20% missing genotyping, 5,712 SNPs were retained for map construction. The frequency of SNP occurrence across the genome is summarized in Table 1. In general, the frequency of transitions (63.65%) was higher than transversions (35.83%). The most widespread variation was A/G (31.86%) while the least common variation was C/G, accounting for 6.53% of the total detected SNPs. We observed a transition : transversion (Ts/Tv) ratio of 1.78, which was similar to the observations for other plant species(Gaur et al., 2015; Pootakham et al., 2015). A set of bi-allelic SNPs were scored as lm x ll, nn x np and hk x hk. With a stringent cutoff of 10% missing data, 2729 high quality and polymorphic SNPs were retained for further analysis. 727 and 1548 SNPs were polymorphic in maternal and paternal parents, respectively, while 454 SNPs were polymorphic in both parents.

Polymorphism of EST-derived and genomic SSR markers in F1 mapping population SSR polymorphism assessment was first examined in six randomly selected F1 progeny and the two parents using non-fluorescent primers. Among 352 EST-derived and 252 genomic SSR primer pairs, 84 (13.91%) successfully amplified polymorphisms between the two mapping parents, including 30 EST-SSR and 54 gSSR markers. The 84 pairs of primers are listed in Table S1 with related information such as primer name, motif type, forward and reverse primers, and expected product size. 30 EST-SSRs associated sequences were blasted against the GenBank non-redundant (nr) protein database using BLASTX with an e-value of 1e-5. The results of a BLASTx search showed that 19 (63.33%) of the 30 polymorphic EST-SSR loci are markers for known or uncharacterized protein coding genes (Table S4.1).

66 Construction of genetic linkage maps After removing markers that significantly deviated from the expected Mendelian ratio (P£0.05), a total of 90 samples with genotypic data for 2049 SNP and SSR loci were retained for genetic map construction. To reduce computation time, loci with identical genotypes were eliminated. A set of 1537 high quality SNP loci along with 75 SSR markers that segregated in the F1 population arising from an intra-specific cross of PE00248 and PE0048 were used to construct an intra-specific linkage map for F. pennsylvanica. The female genetic map consisted of 992 markers mapped on 760 distinct positions spanning 1562.64cM, with an average marker interval of 2.22cM, ranging from 1.08cM in LG9 to 3.97cM in LG22. Among the 992 markers, 6 markers were later assigned individually to LG21, but it was not possible to compute genetic distance between the markers (Table 4.3) and thus a linkage group could not be generated. As a result, only 22 LGs were constructed in the maternal map. The male genetic map consisted of 755 markers on 1744.12cM, with an average marker interval 2.87cM (Table 4.3), ranging from an average paired marker distance of 1.16 cM in LG18 to 4.99 cM in LG21. In total, 230 markers were shared between the two parental maps, which were used to construct the consensus map by integrating the two parental maps at those loci. Overall, the average male-to-female ratio of map length was 1.12, ranging from 0.36 in LG18 to 2.3 in LG16. The total length of the consensus map was 2008.98cM with LG1 (125 cM) being the largest and LG23 (53.93cM) being the smallest (Fig. 4.2, Table 4.4). Total number of DNA markers in the consensus map is 1201 of which 1126 are SNPs and 75 are SSRs. The number of markers per linkage group varied from 13 (LG21) to 95 (LG9), with an average of 52 markers per linkage group.

Segregation distortion Chi-square tests were performed to test for deviation from the expected Mendelian segregation ratio. Of the 2731 polymorphic SNPs (<10% missing data), 902 markers (33.02%) showed significant segregation distortion (P £ 0.05). Severe distortion (SD) was detected in 604 SNP markers (P£0.001) while 298 SNP loci were moderately distorted (0.001

67 Table S4.2). Additionally, we observed that LG21 in the maternal map could be successfully generated by including SD markers.

Integration of genetic map with genome assembly The draft genome assembly consisted of 495,002 scaffolds, spanning 961,215,495 bases. 27 scaffolds were assigned to the corresponding linkage groups using the mapped 1444 SNPs, accounting for 732Mb (76% of 961Mb) of the total length of draft genome. The sequences of all markers were mapped to the assembly, of which 80 were mapped to multiple locations. Therefore, the mapped 1364 SNPs anchored additional 4 scaffolds with a total length of 24Mb onto the 23 LGs. Our result showed that three scaffolds (i.e. 22, 24 and 27) are anchored to LG13, and two scaffolds (17, 25) mapped to LG9 and scaffolds 15 and 26 are corresponding to LG15. The Circos plot (Fig. 4.4) showed near-perfect consistency between genome assembly and genetic map. However, 26 (1.9%) of the loci showed mismatches between specific LGs and corresponding chromosomes of the draft reference genome, for example, one locus in LG2 was aligned to scaffold 14, while all other loci in LG2 were mapped to scaffold 2.

Comparative analysis of green ash with other species Tomato (Solanum lycopersicum, Sl, 2n=24), coffee (Coffea canephora, Cc, 2n=22) and green ash (Fp) belong to the sister orders Solanales, Gentianales and Lamiales, respectively within the asterids. Peach (Prunus persica, Pp, 2n=16) and poplar (Populus trichocarpa, Pt, 2n=38) are two species from the , which are distantly related species to green ash. The 50bp flanking sequences of both sides of the mapped 1522 SNP markers (1126 unique locations) were aligned to the genomes of the four species using BLASTN analysis, with an e-value cutoff of 1e- 5, revealing that 325 (21.35%), 342 (22.47%), 329 (21.62%) and 239 (15.7%) SNP markers could be mapped to the tomato, coffee, poplar and peach chromosomes, respectively. The extent of syntenies of the green ash genetic map to the four species was summarized in Tables 4.5-4.8. When query LG marker sequences and target chromosome genome sequence shared at least 5 loci, the region was considered as a potential syntenic block in the study. We constructed circular plots of potential syntenic blocks between each Fp LG and the pseudomolecules of relevant Sl, Cc, Pp and Pt chromosomes (Fig. 4.5). We observed that the syntenic relationships were inconsistent across the 23 Fp LGs. For example, LG1 showed unique one-to-one correspondence with the peach genome, and showed a one-to-two correspondence with the poplar genome, but showed a one-to-three correspondence with the tomato and the coffee genomes. As another example, green ash LG12 showed unique

68 one-to-one correspondence with all four target genomes. On the other hand, LG17 showed better conserved synteny with tomato as it corresponded with one chromosome of tomato (Sl_Chr7), but two chromosomes of both coffee (Cc_Chr6 and 7) and peach (Pp_Chr1 and 3). LG2, 13 and 14 of green ash showed syntenic regions with Pp_Chr1; while LG5, 1, 20, 22 and 3 corresponded to Pp_Chr2, 3, 4, 5 and 7, respectively (Table 4.7). Loci from LG9 were mapped to Pp_Chr1 and 6, and loci from LG17 were mapped to Pp_Chr1 and 3. Synteny of green ash linkage groups with poplar genome is more eroded than the synteny of green ash with genomes of peach, coffee and tomato. LG7 and 10 showed syntenic block with Pt_Chr9. LG12 and 20 showed syntenic block with Pt_Chr1 and 11, respectively. Other linkage groups from green ash showed synteny with several chromosomes of poplar. Overall, the green ash map showed higher synteny with the coffee genome than the other genomes. From 23 linkage groups, 11 linkage groups (LG2, 4~8, 10, 12, 14, 16 and 19) showed syntenic regions to a single chromosome of coffee, while 8 linkage groups (LG2, 7, 10, 12, 17, 18, 20 and 23) showed syntenic blocks to a single chromosome of tomato. Compared with rosid species, we observed that 8 LGs and 4 LGs shows one-to-one correspondence with the peach and the poplar genomes, respectively. In addition, 5 LGs only showed syntenic blocks with the coffee genome.

Discussion Development of the GBS technology has made it possible to rapidly obtain genotyping data for thousands of loci within a period of only months. Restriction enzyme selection is a critical step to target sites in low-copy genomic regions, minimizing reads in repetitive sequences, which influences the number and genomic location of SNPs discovered. We tested ApeKI (GCWGC), PstI (CTGCAG) and EcoT22I (ATGCAT) enzymes previously used in maize, pine and other plant species. There is a tradeoff between the read coverage of SNPs and the number of SNPs. For our study, we aimed to genotype at good coverage with at least hundreds of loci. Hence, we chose a six-cutter enzyme PstI to make the reduced representation libraries show less repetitive fragments and a large proportion of fragments within sequencing range of smaller than 500bp. As a result, we discovered 63,540 SNPs, which were mapped to 3,973 reference genome scaffolds. However, only 5727 (9%) SNPs showed less than 20% missing data. The large proportion of missing data may have resulted from the high heterozygosity of such an out- crossing species. In future studies, the amount of missing data could be lowered to reduce the level of multiplex or by constructing optimized two-enzyme (double-digest) RADseq libraries.

69 Our SNP identification revealed a Ts/Tv ratio of 1.78, which is a transition bias that has also been observed in many other plant species. Transitions are favored due to their better tolerance than transversions during natural selections as they are more likely to contribute synonymous mutations in protein-coding regions. The Ts/Tv ratio was detected as 2 in selected inbred lines of Indica rice (Subbaiyan et al., 2012). In oil palm, SNPs showed a Ts/Tv ratio of 1.67, using a modified two-enzyme GBS protocol (Pootakham et al., 2015). The ratio between transition and transversion was 1.42 in sunflower (Celik et al., 2016). In chickpea, a Ts/Tv ratio of 1.74 was observed from SNP discovery (Gaur et al., 2015). Although SNPs can be obtained in higher numbers than SSRs, bi-allelic SNP markers are often less informative than SSRs. Multi-allelic SSR markers are more powerful for integrating parental maps for an intra-specific map. SSR markers are also more transferrable among different populations and species. Therefore, a combination of SNPs and SSRs is ideal for map constructions. SSR has limitations of cost and time-consumption. We only identified 84 reliable polymorphic SSR markers from the hundreds of SSRs discovered. Many genetic maps have reported for some tree species, including Eucalyptus sp. (Freeman et al., 2006), Populus sp. (Cervera et al., 2001), Quercus sp. (Barreneche et al., 1998) and Pinus sp. (Neves et al., 2014). However, no a genetic map of the threated Fraxinux species has been developed until now. We found a 96-plex GBS protocol worked well for the construction of a high-density linkage map with a compact marker interval (< 2cM) in a non- model outcrossing species. A linkage map of green ash was constructed with 1201 markers, including 1126 SNP and 75 SSR loci, spanning 2008.87 cM. The map consisted of 23 groups equal to the chromosome number, ranging from 53.93 to 127.05 cM. Higher recombination rate was observed in male parent (1.94cM/Mb) than female parent (1.74cM/Mb), which was consistent with observations in Arabidopsis (Giraut et al., 2011). Among 23 linkage maps, the most dramatic differences of map length between two parents were observed between LG16 and 18, suggesting that sex-specific patterns may occur along the two chromosomes. The average recombination rate across the all the linkage groups is 2.23cM/Mb, which is comparable to rates of cacao (1.7cM/Mb), grape (2.0cM/Mb), papaya (2.54cM/Mb) and soybean (2.51cM/Mb) (Henderson, 2012). Segregation distortion (SD) is a general phenomenon in plant but the percentage, degree and genetic effects may vary significantly across species (Dai et al., 2016; Taylor and Ingvarsson, 2003; Zhou et al., 2015). In maize, it has been reported that 18 chromosome regions on 10 chromosomes were associated with SD (Lu et al., 2002). 14 SD regions haven been identified in barley(Li et al., 2010). SD has been suggested as a selection mechanism (Sandler and Novitski,

70 1957) and it may be resulted from biological and environmental factors, such as chromosome loss(Bradshaw and Stettler, 1994), or gametic and zygotic selection(Liebhard et al., 2003). In this study, 33.02% of the markers showed distortion from the expected segregation ratio. In future study, with increased population size, we may notice different regions associated with SD markers compared to what we reported here. Ash yellows killed ~350 highly susceptible genotypes, which may lead to segregation distorted markers around the genomic region responding for ash yellows resistance. Synteny was retained during genome evolution and duplication events. Genome investigations have shown evidence for whole genome duplication (WGD) across all lineages, and the model organism Arabidopsis thaliana underwent at least three ancient genome duplication events (a, b and g) over the last 300 million years (Bowers et al., 2003). Both a and b events were occurred within rosids clade II: the a event, shared within genus Brassica, and the b event occurred within the order of Brassicales following the divergence from papaya (Bowers et al., 2003). A whole genome triplication, called g, was suggested by the analyses of sequenced genomes of poplar (Tuskan et al., 2006), grape (Jaillon et al., 2007) and papaya (Ming et al., 2008), which may have occurred close to the eudicot divergence (Soltis et al., 2009). Furthermore, a resent additional WGD event was also occurred in some species, such as poplar (Tuskan et al., 2006), cotton (Wang et al., 2012), genus Brassica (Lukens et al., 2004; Lysak et al., 2005), and the Cleomaceae (Schranz and Mitchell-Olds, 2006), while a recent genome triplication was occurred in the Solanum lineage (The Tomato Genome, 2012). To understand synteny relationship between green ash and related species, tomato, coffee, peach and poplar have been selected to conduct comparative analyses. Tomato (Solanales), coffee (Gentianales) are order sisters to green ash (Lamiales), which are diverged from their last common ancestor approximately 82-89 million years ago (Wikström et al., 2001). Generally, tomato and coffee pseudomolecules contain syntenic regions that subsequent possible with only a couple of green ash LGs, indicating that genome duplication and rearrangement occurred since the separation of the three species. Synteny analysis among woody tree species and herbs has showed slower rates of synteny erosion during genome divergence in woody perennials (Luo et al., 2015). Here, we also observed that syntenic relationships between green ash and coffee is more conserved than between green ash and tomato. Poplar and peach are distantly-related species to green ash. Even Rosid and Asterid clades are diverged 125 million years ago (Wikström et al., 2001), synteny was detected between very distantly related woody species. Our results suggest that coffee and peach could be used as model species for further comparative analyses.

71 This research allowed us to provide detailed genetic marker data for use in future ash breeding programs once QTL identified and construct the first reference map for F. pennsylvanica. The DNA markers identified in this study also can be utilized for genetic variation and population structure analysis for the threatened ash species, providing valuable information to conserve genetic diversity in the species for the future. No phenotypic traits were studied with the present map as the progenies in our mapping population were still seedlings and just have been planted and exposed to EAB last year. Over the next couple of years, phenotypic data will be collected. This high-density linkage map could be a useful platform for QTL identification, including the important traits of EAB resistance, height, budburst. In addition, the genetic map can be further utilized to map candidate genes identified from our expression profiling analysis. The results from this study can also be extended to future association mapping analysis. Hopefully, our study will help identify resistant seedlings for tree breeding efforts in the US and restore resistant green ash trees back to the disturbed forest sites, supporting recovery of the habitat of the ecosystem.

72 References Barreneche, T., Bodenes, C., Lexer, C., Trontin, J.-F., Fluch, S., Streiff, R., Plomion, C., Roussel, G., Steinkellner, H., Burg, K., et al. (1998). A genetic linkage map of Quercus robur L. (pedunculate oak) based on RAPD, SCAR, microsatellite, minisatellite, isozyme and 5S rDNA markers. Theor Appl Genet 97, 1090-1103. Bowers, J.E., Chapman, B.A., Rong, J., and Paterson, A.H. (2003). Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422, 433. Bradshaw, H.D., Jr., and Stettler, R.F. (1994). Molecular genetics of growth and development in Populus. II. Segregation distortion due to genetic load. Theor Appl Genet 89, 551-558. Celik, I., Bodur, S., Frary, A., and Doganlar, S. (2016). Genome-wide SNP discovery and genetic linkage map construction in sunflower (Helianthus annuus L.) using a genotyping by sequencing (GBS) approach. Mol Breeding 36, 133. Cervera, M.-T., Storme, V., Ivens, B., Gusmao, J., Liu, B.H., Hostyn, V., Van Slycken, J., Van Montagu, M., and Boerjan, W. (2001). Dense genetic linkage maps of three Populus species (Populus deltoides, P. nigra and P. trichocarpa) based on AFLP and microsatellite markers. Genetics 158, 787-809. Clarke, J.D. (2009). Cetyltrimethyl Ammonium Bromide (CTAB) DNA Miniprep for Plant DNA Isolation. Cold Spring Harbor Protocols 2009, pdb.prot5177. Dai, B., Guo, H., Huang, C., Ahmed, M.M., and Lin, Z. (2016). Identification and Characterization of Segregation Distortion Loci on Cotton Chromosome 18. Frontiers in plant science 7, 2037. Davey, J.W., and Blaxter, M.L. (2010). RADSeq: next-generation population genetics. Briefings in functional genomics 9, 416-423. Denoeud, F., Carretero-Paulet, L., Dereeper, A., Droc, G., Guyot, R., Pietrella, M., Zheng, C., Alberti, A., Anthony, F., Aprea, G., et al. (2014). The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science 345, 1181-1184. Elshire, R.J., Glaubitz, J.C., Sun, Q., Poland, J.A., Kawamoto, K., Buckler, E.S., and Mitchell, S.E. (2011). A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. PLoS ONE 6, e19379. Freeman, J.S., Potts, B.M., Shepherd, M., and Vaillancourt, R.E. (2006). Parental and consensus linkage maps of Eucalyptus globulus using AFLP and microsatellite markers. Silvae Genetica 55, 202-217.

73 Gabriel, S., Ziaugra, L., and Tabbaa, D. (2001). SNP Genotyping Using the Sequenom MassARRAY iPLEX Platform. In Current Protocols in Human Genetics (John Wiley & Sons, Inc.). Gaur, R., Jeena, G., Shah, N., Gupta, S., Pradhan, S., Tyagi, A.K., Jain, M., Chattopadhyay, D., and Bhatia, S. (2015). High density linkage mapping of genomic and transcriptomic SNPs for synteny analysis and anchoring the genome sequence of chickpea. 5, 13387. Giraut, L., Falque, M., Drouaud, J., Pereira, L., Martin, O.C., and Mézard, C. (2011). Genome- Wide Crossover Distribution in Arabidopsis thaliana Meiosis Reveals Sex-Specific Patterns along Chromosomes. PLoS Genetics 7, e1002354. Glaubitz, J.C., Casstevens, T.M., Lu, F., Harriman, J., Elshire, R.J., Sun, Q., and Buckler, E.S. (2014). TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline. PLoS ONE 9, e90346. Henderson, I.R. (2012). Control of meiotic recombination frequency in plant genomes. Current opinion in plant biology 15, 556-561. Jaillon, O., Aury, J.M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., Choisne, N., Aubourg, S., Vitulo, N., Jubin, C., et al. (2007). The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463-467. Jones, O.R., and Wang, J. (2010). COLONY: a program for parentage and sibship inference from multilocus genotype data. Molecular Ecology Resources 10, 551-555. Koressaar, T., and Remm, M. (2007). Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289-1291. Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., and Marra, M.A. (2009). Circos: an information aesthetic for comparative genomics. Genome research 19, 1639-1645. Lane, T., Best, T., Zembower, N., Davitt, J., Henry, N., Xu, Y., Koch, J., Liang, H., McGraw, J., Schuster, S., et al. (2016). The green ash transcriptome and identification of genes responding to abiotic and biotic stresses. BMC genomics 17, 702. Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997. Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754-1760. Li, H., Kilian, A., Zhou, M., Wenzl, P., Huttner, E., Mendham, N., McIntyre, L., and Vaillancourt, R.E. (2010). Construction of a high-density composite map and comparative

74 mapping of segregation distortion regions in barley. Molecular genetics and genomics : MGG 284, 319-331. Liebhard, R., Koller, B., Gianfranceschi, L., and Gessler, C. (2003). Creating a saturated reference map for the apple (Malus x domestica Borkh.) genome. Theor Appl Genet 106, 1497-1508. Lu, H., Romero-Severson, J., and Bernardo, R. (2002). Chromosomal regions associated with segregation distortion in maize. Theor Appl Genet 105, 622-628. Lukens, L.N., Quijada, P.A., Udall, J., Pires, J.C., Schranz, M.E., and Osborn, T.C. (2004). Genome redundancy and plasticity within ancient and recent Brassica crop species. Biological Journal of the Linnean Society 82, 665-674. Luo, M.-C., You, F.M., Li, P., Wang, J.-R., Zhu, T., Dandekar, A.M., Leslie, C.A., Aradhya, M., McGuire, P.E., and Dvorak, J. (2015). Synteny analysis in Rosids with a walnut physical map reveals slow genome evolution in long-lived woody perennials. BMC genomics 16, 707. Lysak, M.A., Koch, M.A., Pecinka, A., and Schubert, I. (2005). Chromosome triplication found across the tribe Brassiceae. Genome research 15, 516-525. Ming, R., Hou, S., Feng, Y., Yu, Q., Dionne-Laporte, A., Saw, J.H., Senin, P., Wang, W., Ly, B.V., Lewis, K.L., et al. (2008). The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452, 991-996. Moumouni, K.H., Kountche, B.A., Jean, M., Hash, C.T., Vigouroux, Y., Haussmann, B.I.G., and Belzile, F. (2015). Construction of a genetic map for pearl millet, Pennisetum glaucum (L.) R. Br., using a genotyping-by-sequencing (GBS) approach. Mol Breeding 35, 5. Neves, L.G., Davis, J.M., Barbazuk, W.B., and Kirst, M. (2014). A High-Density Gene Map of Loblolly Pine (Pinus taeda L.) Based on Exome Sequence Capture Genotyping. G3: Genes|Genomes|Genetics 4, 29-37. Pootakham, W., Jomchai, N., Ruang-areerate, P., Shearman, J.R., Sonthirod, C., Sangsrakru, D., Tragoonrung, S., and Tangphatsornruang, S. (2015). Genome-wide SNP discovery and identification of QTL associated with agronomic traits in oil palm using genotyping-by- sequencing (GBS). Genomics 105, 288-295. Raman, H., Dalton-Morgan, J., Diffey, S., Raman, R., Alamery, S., Edwards, D., and Batley, J. (2014). SNP markers-based map construction and genome-wide linkage analysis in Brassica napus. Plant biotechnology journal 12, 851-860. Sandler, L., and Novitski, E. (1957). Meiotic drive as an evolutionary force. The American Naturalist 91, 105-110.

75 Schranz, M.E., and Mitchell-Olds, T. (2006). Independent ancient polyploidy events in the sister families Brassicaceae and Cleomaceae. Plant Cell 18, 1152-1165. Schuelke, M. (2000). An economic method for the fluorescent labeling of PCR fragments. Nature biotechnology 18, 233-234. Soltis, D.E., Albert, V.A., Leebens-Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., Depamphilis, C.W., Wall, P.K., and Soltis, P.S. (2009). Polyploidy and angiosperm diversification. American journal of botany 96, 336-348. Staton, M., Best, T., Khodwekar, S., Owusu, S., Xu, T., Xu, Y., Jennings, T., Cronn, R., Arumuganathan, A.K., Coggeshall, M., et al. (2015). Preliminary Genomic Characterization of Ten Hardwood Tree Species from Multiplexed Low Coverage Whole Genome Sequencing. PLOS ONE 10, e0145031. Subbaiyan, G.K., Waters, D.L.E., Katiyar, S.K., Sadananda, A.R., Vaddadi, S., and Henry, R.J. (2012). Genome-wide DNA polymorphisms in elite indica rice inbreds discovered by whole- genome sequencing. Plant biotechnology journal 10, 623-634. Taylor, D.R., and Ingvarsson, P.K. (2003). Common Features of Segregation Distortion in Plants and Animals. Genetica 117, 27-35. The Tomato Genome, C. (2012). The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635. Tian, M., Li, Y., Jing, J., Mu, C., Du, H., Dou, J., Mao, J., Li, X., Jiao, W., Wang, Y., et al. (2015). Construction of a High-Density Genetic Map and Quantitative Trait Locus Mapping in the Sea Cucumber Apostichopus japonicus. Scientific reports 5, 14852. Tuskan, G.A., Difazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., Hellsten, U., Putnam, N., Ralph, S., Rombauts, S., Salamov, A., et al. (2006). The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 1596-1604. Untergasser, A., Cutcutache, I., Koressaar, T., Ye, J., Faircloth, B.C., Remm, M., and Rozen, S.G. (2012). Primer3—new capabilities and interfaces. Nucleic acids research 40, e115-e115. Van Ooijen, J.W. (2006). JoinMap 4: Software for the calculation of genetic linkage maps in experimental populations. Kyazma BV, Wageningen, the Netherlands. Van Tassell, C.P., Smith, T.P.L., Matukumalli, L.K., Taylor, J.F., Schnabel, R.D., Lawley, C.T., Haudenschild, C.D., Moore, S.S., Warren, W.C., and Sonstegard, T.S. (2008). SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nature Methods 5, 247. Voorrips, R.E. (2002). MapChart: software for the graphical presentation of linkage maps and QTLs. The Journal of heredity 93, 77-78.

76 Wang, K., Wang, Z., Li, F., Ye, W., Wang, J., Song, G., Yue, Z., Cong, L., Shang, H., Zhu, S., et al. (2012). The draft genome of a diploid cotton Gossypium raimondii. Nature genetics 44, 1098. Whitehill, J.G., Opiyo, S.O., Koch, J.L., Herms, D.A., Cipollini, D.F., and Bonello, P. (2012). Interspecific comparison of constitutive ash phloem phenolic chemistry reveals compounds unique to manchurian ash, a species resistant to emerald ash borer. Journal of chemical ecology 38, 499-511. Wikström, N., Savolainen, V., and Chase, M.W. (2001). Evolution of the angiosperms: calibrating the family tree. Proceedings of the Royal Society of London Series B: Biological Sciences 268, 2211-2220. Wu, Y., Close, T.J., and Lonardi, S. (2008). On the accurate construction of consensus genetic maps. Computational systems bioinformatics Computational Systems Bioinformatics Conference 7, 285-296. Zhou, W., Tang, Z., Hou, J., Hu, N., and Yin, T. (2015). Genetic Map Construction and Detection of Genetic Loci Underlying Segregation Distortion in an Intraspecific Cross of Populus deltoides. PLOS ONE 10, e0126077.

77 (A)

(B)

(C) Figure 4.1 Assessment of GBS libraries using three different restriction enzymes. (A) GBS libraries prepared with ApeKI; (B) GBS libraries prepared with EcoT22I; (C) GBS libraries prepared with PstI.

78

Figure 4.2 Genetic maps of F. pennsylvanica. The intra-specific linkage map of green ash based on F1 population of one EAB-resistant and one EAB-susceptible genotypes harboring 1201 loci. SNP markers are represented in black, while red color represented SSR markers.

79

Figure 4.3 Genetic maps of F. pennsylvanica including segregation distorted markers. Green bars represent the segregation distorted SNP loci, while black color represents markers with no segregation distortion.

80

Figure 4.4 Circos plot of synteny between green ash genetic map and genome assembly. Green ash linkage groups were color coded and labeled from LG1 to LG23, and genome scaffolds were shown in black and labeled from 1 to 27.

81

Figure 4.5 Circos plots of syntenic relationships between green ash and four other species. The 23 F. pennsylvanica linkage groups and each of the 12 tomato chromosomes (A), each of the 11 coffee chromosomes (B), each of the 8 peach chromosomes (C), each of the 19 poplar chromosomes (D), using GBS-derived SNPs. Green ash linkage groups are color coded and labeled from LG1 to LG23. Pseudomolecules of the four target species are shown in black. Colored bands show links connecting the locations of homologous regions between genomes.

82

Table 4.1 Summary of SNPs identified in green ash

Total number of SNPs 5712 100% Bi-allelic 5683 99.49% others 29 5.01% Transversions A/C 494 8.65% A/T 652 11.41% C/G 373 6.53% G/T 528 9.24% Transitions A/G 1820 31.86% C/T 1816 31.79%

Table 4.2 Summary of SSR markers

Type Tested Polymorphic Segregation type ab x cd ef x eg hk xhk ll x lm nn x np EST- 352 30 14 9 1 5 1 SSR gSSR 252 54 25 12 0 7 10

83 Table 4.3 Summary of DNA marker information in female and male genetic maps. Female (PE0048) Male (PE00248) Common Distinct Average Distinct Average M:F LGs Markers Length LGs Markers Length markers positions interval positions interval 1 71 55 97.69 1.78 1 46 39 115.76 2.97 2 1.18 2 33 28 66.65 2.38 2 49 42 99.14 2.36 10 1.49 3 54 39 76.49 1.96 3 38 33 102.38 3.1 14 1.34 4 48 40 92.16 2.3 4A 22 16 48.51 3.03 9 0.7 4B 5 4 15.66 3.92 4 5 54 43 103.21 2.4 5 33 27 75.71 2.8 11 0.73 6 43 35 80.91 2.31 6 33 29 92.78 3.2 13 1.15 7 51 40 75.62 1.89 7 42 34 83.77 2.46 14 1.11 8 49 39 78.3 2.01 8 34 30 91.48 3.05 10 1.17 9 97 69 74.72 1.08 9 49 48 85.83 1.79 23 1.15 10 42 31 74.96 2.42 10 41 32 90.52 2.83 12 1.21 11 30 25 64.96 2.6 11 36 31 90.58 2.92 5 1.39 12 43 36 77.53 2.15 12A 28 18 71.01 3.95 10 1.03 12B 6 5 8.63 1.73 1 13 45 33 67.47 2.04 13 39 37 80.29 2.17 13 1.19 14 37 29 72.15 2.49 14 32 27 78.59 2.91 10 1.09 15 36 27 46.06 1.71 15 25 20 85.34 4.27 5 1.85 16 19 15 32.29 2.15 16 42 35 74.27 2.12 7 2.3 17 59 47 67.51 1.44 17 25 25 62.58 2.5 10 0.93 18 49 37 68.47 1.85 18 26 21 24.33 1.16 13 0.36 19 23 20 66.47 3.32 19 22 20 62.36 3.12 8 0.94 20 33 22 67.11 3.05 20 18 14 49.48 3.53 7 0.74 21 6 ------21 17 14 69.86 4.99 -- -- 22 23 15 59.56 3.97 22 25 18 51.55 2.86 10 0.87 23 53 35 52.35 1.5 23 22 16 33.71 2.11 9 0.64 Total 992 760 1562.6 2.22 755 635 1744.1 2.87 230 1.12

84

Table 4.4 Summary of genetic linkage map of F. pennsylvanica

mapped markers Genetic Average Max LGs size marker gap

SSRs SNPs (cM) interval (cM)

1 7 85 127.05 1.38 11.97

2 5 56 109.81 1.8 9.15

3 2 59 107.22 1.76 8.06 4 4 47 101.17 1.98 6.51

5 2 58 99.99 1.67 10.31 6 3 48 96.01 1.85 6.61

7 5 57 95.36 1.54 10.27

8 3 59 95.25 1.54 6.19 9 5 90 94.07 1.05 9.2

10 3 51 92.03 1.7 8.8 11 4 49 91.83 1.73 18.55 12 2 49 90.44 1.77 11.9

13 3 54 89.03 1.56 7.08 14 5 42 82.16 1.75 8.45

15 1 41 80.26 1.91 11.1

16 5 39 75.3 1.71 8.38 17 3 59 74.67 1.2 9.63

18 3 44 74.56 1.59 8.03 19 2 30 74.29 2.32 9.77

20 2 28 71.28 2.38 11.73

21 3 10 69.87 5.37 21.81 22 0 29 63.29 2.18 6.86

23 3 42 53.93 1.2 8.07 Total 75 1126 2008.87 1.67 9.93

85 Table 4.5 Distribution of orthologous loci on LGs of green ash and tomato genome

Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12 Total

LG1 1 5 5 2 0 5 2 0 0 2 1 4 27 LG2 1 1 0 0 0 2 6 1 0 1 1 2 15 LG3 0 7 1 1 2 5 1 0 0 0 1 3 21 LG4 0 1 1 0 1 0 4 2 1 2 3 1 16

LG5 4 0 4 0 2 1 2 0 3 1 1 1 19

LG6 0 1 1 4 3 0 0 0 1 0 0 0 10 LG7 14 0 0 0 0 0 0 0 1 1 0 0 16 LG8 0 0 1 0 2 0 2 0 2 0 1 2 10 LG9 1 0 4 4 3 3 1 0 4 0 0 0 20 LG10 12 0 0 1 1 0 0 1 1 0 1 0 17 LG11 1 0 0 2 0 2 0 1 3 0 0 0 9 LG12 0 4 3 1 1 0 2 6 1 0 2 2 22 LG13 4 0 2 0 3 0 0 2 0 0 0 0 11 LG14 1 0 0 3 0 3 0 4 3 0 0 1 15 LG15 0 3 0 1 0 1 0 0 1 0 0 0 6 LG16 1 0 0 0 1 1 0 0 3 2 2 0 10 LG17 2 1 2 0 1 2 10 1 0 0 2 2 23 LG18 5 2 0 1 2 1 0 0 0 0 0 0 11 LG19 0 0 0 0 0 0 1 1 0 2 0 1 5 LG20 1 4 0 0 0 0 5 0 0 0 0 2 12 LG21 0 0 0 0 0 0 0 2 0 0 0 0 2 LG22 3 1 0 0 3 1 0 1 2 0 1 0 12 LG23 0 2 3 2 0 2 5 0 0 2 0 0 16 Total 51 32 27 22 25 29 41 22 26 13 16 21 325 * Syntenic regions with at least five shared markers were highlighted in red.

86 Table 4.6 Distribution of orthologous loci on LGs of green ash and coffee genome Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Total

LG1 0 1 1 9 0 2 5 1 0 6 0 25 LG2 3 0 0 0 7 0 0 3 0 1 1 15 LG3 1 1 0 7 0 1 8 1 0 1 0 20 LG4 0 2 0 1 2 5 0 3 2 0 0 15 LG5 3 4 1 1 0 7 0 1 1 1 0 19 LG6 0 0 0 0 0 0 5 0 0 4 0 9 LG7 3 14 0 0 0 0 0 0 0 0 0 17 LG8 1 3 1 0 0 7 2 0 0 1 1 16 LG9 8 4 3 5 0 0 0 1 1 2 7 31 LG10 0 14 1 0 0 0 0 1 0 0 0 16 LG11 3 0 3 0 0 0 0 2 0 1 0 9 LG12 0 2 2 0 3 0 1 7 3 0 1 19 LG13 2 1 0 2 0 1 0 2 0 1 4 13 LG14 5 1 3 0 0 0 1 4 0 4 0 18 LG15 0 1 1 0 0 0 1 0 0 2 0 5 LG16 5 0 1 0 0 2 0 0 0 0 1 9 LG17 2 1 2 4 1 9 5 0 1 0 0 25 LG18 1 2 0 0 2 1 0 1 3 0 2 12 LG19 0 0 0 1 0 7 0 0 0 0 0 8 LG20 0 2 0 1 3 0 0 0 3 0 0 9 LG21 0 0 0 0 0 0 0 2 0 0 0 2 LG22 2 4 0 1 0 0 0 2 2 0 1 12 LG23 1 3 1 4 4 1 0 0 0 3 1 18 Total 40 60 20 36 22 43 28 31 16 27 19 342 *Syntenic regions with at least five shared markers were highlighted in red.

87

Table 4.7 Distribution of orthologous loci on LGs of green ash and peach genome

Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Total

LG1 4 1 10 2 1 2 3 0 23

LG2 5 0 3 4 1 0 0 0 13 LG3 3 0 4 1 0 2 5 0 15

LG4 0 1 3 2 1 1 3 0 11

LG5 2 7 0 0 0 2 1 3 15 LG6 0 0 1 0 0 1 0 0 2

LG7 4 3 0 0 0 1 0 3 11 LG8 2 0 2 0 0 0 2 0 6

LG9 9 1 0 0 2 6 0 2 20 LG10 1 2 1 0 1 2 4 3 14

LG11 4 0 0 0 0 1 1 1 7

LG12 0 3 0 4 5 0 1 1 14 LG13 7 0 0 0 1 3 0 0 11

LG14 9 0 0 0 0 2 0 1 12 LG15 2 0 0 0 0 1 1 0 4

LG16 0 0 0 2 0 4 0 0 6 LG17 5 0 7 0 1 2 3 0 18

LG18 1 1 2 0 0 2 3 1 10 LG19 1 1 0 0 0 0 2 0 4

LG20 0 1 0 8 0 0 0 0 9

LG21 0 0 0 0 0 0 0 0 0 LG22 0 1 0 0 0 1 1 2 5

LG23 0 0 2 3 3 0 0 1 9 Total 59 22 35 26 16 33 30 18 239

*Syntenic regions with at least five shared markers were highlighted in red.

88

Table 4.8 Distribution of orthologous loci on LGs of green ash and poplar genome Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Chr Total 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 LG1 1 0 5 0 4 1 5 2 1 2 1 3 0 2 0 0 2 0 0 29 LG2 2 0 1 0 0 0 0 1 0 1 2 0 1 0 0 0 0 3 2 13 LG3 3 0 3 0 1 1 4 0 0 1 0 2 0 2 1 0 2 1 0 21 LG4 1 0 1 2 0 1 0 0 0 0 2 0 2 1 0 0 0 4 0 14 LG5 0 4 1 1 0 3 0 0 1 1 0 3 1 0 1 1 2 2 1 22 LG6 0 2 0 0 4 0 0 0 0 0 0 0 0 0 0 0 1 0 1 8 LG7 0 0 2 1 0 0 0 4 6 0 0 0 0 1 0 0 0 0 0 14 LG8 1 0 0 0 1 3 1 0 0 4 0 0 1 0 0 0 0 2 0 13 LG9 3 5 0 0 2 0 1 5 5 1 0 3 0 0 3 0 0 0 0 28 LG10 1 0 0 2 1 0 0 0 6 0 0 0 2 3 1 0 1 0 0 17 LG11 1 0 3 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 6 LG12 6 0 2 0 1 0 0 0 0 2 0 3 0 0 0 0 0 0 1 15 LG13 4 1 1 0 0 0 0 0 0 3 0 0 0 1 2 1 0 2 0 15 LG14 4 3 1 0 2 3 0 0 0 1 0 0 1 3 0 1 0 2 0 21 LG15 1 1 0 1 1 2 0 0 0 1 0 0 0 1 0 0 1 0 0 9 LG16 3 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 7 LG17 2 0 3 1 2 4 0 2 2 1 0 1 1 1 0 0 0 4 0 24 LG18 0 4 0 2 0 0 0 1 2 1 0 0 0 0 0 0 0 0 0 10 LG19 0 0 1 0 0 3 0 0 0 0 0 0 0 0 0 1 0 2 0 7 LG20 2 0 0 1 0 0 0 0 0 2 6 0 0 2 1 0 0 0 0 14 LG21 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 LG22 0 1 1 3 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 9 LG23 1 0 0 1 0 1 1 0 0 0 0 3 0 0 2 0 0 0 0 9 Total 36 21 25 15 19 24 12 15 25 22 11 18 9 21 12 5 9 22 5 326

89 APPENDIX – SUPPLEMENTARY TABLES

Table S4.1 Summary of EST-SSR markers Marker Forward Reverse expected # Type Blastx E-value Identities Contig Info Motif Start End Name primer sequence length Repeats

uncharacterize GA1_S1_L001 GCAGA TTGGCG d protein _R1_001_(pair GGAGTT AAATCT 7067 EST-SSR LOC10515715 1.00E-122 64% ed)_trimmed_( 203 ct 15 30 60 TCCGTT TTGAGG 3 [Sesamum paired)_contig ATACA TT indicum] _7067

GA1_S1_L001 TGCCGG TGATTG _R1_001_(pair TTATGA GTGAA 30480 EST-SSR NO Hit ed)_trimmed_( 132 ct 12 47 71 AATCTC GCAGC paired)_contig TCT GATA _30480

uncharacterize GA1_S1_L001 TGAGTG CCTGGA d protein _R1_001_(pair CATTAA CATTCT 31256 EST-SSR LOC10515642 1.00E-07 82% ed)_trimmed_( 234 ga 11 478 500 CTGTGA GTCAG 5 [Sesamum paired)_contig GCAA ATGAA indicum] _31256

chromo GA1_S1_L001 CATTTC TGCCTC domain protein _R1_001_(pair CCTCCC CATCTC 32330 EST-SSR LHP1 isoform 2.00E-65 55% ed)_trimmed_( 216 gaa 8 219 243 TCACTT GAGTTC X1 [Sesamum paired)_contig CA TT indicum] _32330

hypothetical GA1_S1_L001 TTGATG AGCAA protein _R1_001_(pair AAACC TGAAA 33761 EST-SSR PRUPE_1G41 0 78% ed)_trimmed_( 127 ga 12 35 59 ATACG CCACCA 8100 [Prunus paired)_contig AAATCC ATCA persica] _33761

PREDICTED: chlorophyllase GA1_S1_L001 CACAA AGTGG -2, _R1_001_(pair GAGCCT CTTTGG 37766 EST-SSR chloroplastic 5.00E-27 69% ed)_trimmed_( 186 ttc 9 69 96 AAGCA AGGAA [Nicotiana paired)_contig ATTGAA AACA tomentosiform _37766 is] GA1_S1_L001 TGGAG GCGACT _R1_001_(pair ATGCCC AGCGG 38145 EST-SSR NO Hit ed)_trimmed_( 127 tga 7 284 305 AAGAA AAACTT paired)_contig GACT CAT _38145 activator of 90 kDa heat GA1_S1_L001 TCCTCT AGGAT shock protein _R1_001_(pair CTCTTA GGGTTC 41990 EST-SSR ATPase 7.00E-153 81% ed)_trimmed_( 240 ag 12 55 79 AATTCG ATTGCA homolog paired)_contig CCTAAA GAG [Sesamum _41990 indicum] GA1_S1_L001 GTGCA TAGCAC _R1_001_(pair ATAACC GCCTTT 44635 EST-SSR NO Hit ed)_trimmed_( 224 ca 20 259 298 AAGGC GTGGA paired)_contig CAGT AG _44635 GA1_S1_L001 TTAAGG CTTGGT _R1_001_(pair AAGCA GGGCA 45153 EST-SSR NO Hit ed)_trimmed_( 153 ga 14 113 140 CGTCCG AAAGTT paired)_contig AAT ACC _45153

GA1_S1_L001 GCCAA GGACCT _R1_001_(pair ATCGAT CTGCTT 49026 EST-SSR NO Hit ed)_trimmed_( 211 ga 23 144 189 GCAATT TTTGCA paired)_contig TCT TC _49026

91 GA1_S1_L001 CCTGGC TTTGTC _R1_001_(pair CTGTTA CCTTCT 49818 EST-SSR NO Hit ed)_trimmed_( 135 aat 16 336 384 CATGTC TTTGGT paired)_contig CT GAA _49818

GA1_S1_L001 CAGTCT CGATTT _R1_001_(pair GCAAG GAACTC 52216 EST-SSR NO Hit ed)_trimmed_( 187 ga 17 56 90 AGACC TGAGG paired)_contig ACGA AGGA _52216

GA1_S1_L001 GGAGG GGCTGC _R1_001_(pair CCCAA CTCTTT 52729 EST-SSR NO Hit ed)_trimmed_( 266 ac 12 207 231 AATTAA CTTCTC paired)_contig ACCT TT _52729

GA1_S1_L001 TGCCAT CGGAA _R1_001_(pair GCTAA GAAAT 53031 EST-SSR NO Hit ed)_trimmed_( 126 tc 18 172 208 ATTGGT GTTCCC paired)_contig TGA TGTG _53031

GA1_S1_L001 CACAG TGCCTT _R1_001_(pair GGCAC CATCTT 55558 EST-SSR NO Hit ed)_trimmed_( 257 ga 22 226 270 GTCACA TTCTTG paired)_contig AATA TCC _55558 PREDICTED: mitotic spindle GA1_S1_L001 GAGAG checkpoint AAAAC _R1_001_(pair AGAGA protein TCCGTC 55597 EST-SSR 3.00E-29 83% ed)_trimmed_( GAGAG 122 tct 9 231 258 BUBR1 TCCGGA paired)_contig AGGCA [Nicotiana TCT _55597 GAA tomentosiform is]

92 protein GA1_S1_L001 TIC110, GCCTTA CCTCTC _R1_001_(pair chloroplastic GGGTA TCTAGC 754 EST-SSR 0 83% ed)_trimmed_( 150 ct 15 3233 3263 isoform X2 GGGGT AGCCTT paired)_contig [Sesamum GAAG GC _754 indicum] putative clathrin GA1_S1_L001 CTGAG CGTAAT assembly _R1_001_(pair GCATA GGCTTC 854 EST-SSR protein 0 78% ed)_trimmed_( 176 ctg 9 402 428 GGGGT ACCACC At5g35200 paired)_contig AGCTG TT [Sesamum _854 indicum] DEAD-box ATP- GA1_S1_L001 CCTAGA GAAAG dependent _R1_001_(pair GAAAG CACTTT 15761 EST-SSR RNA helicase 4.00E-24 62% ed)_trimmed_( 151 gt 13 525 550 GCGGA GGCAC 24-like protein paired)_contig GACC CATT [Phaseolus _15761 vulgaris] SART-1 GA1_S1_L001 ATTTTC TGACAC family protein _R1_001_(pair TCGCCC GAGTCT 18631 EST-SSR DOT2 8.00E-21 71% ed)_trimmed_( 150 tca 5 421 436 TTTCTA TTCATC [Sesamum paired)_contig ACG TATTGC indicum] _18631

trihelix GA1_S1_L001 CAAGA AACAA transcription _R1_001_(pair TTCAGC CTCAGC 12374 EST-SSR factor GT-2- 2.00E-109 68% ed)_trimmed_( 248 ga 9 1445 1473 ACTGG CACCCA like [Sesamum paired)_contig ACGA CTC indicum] _12374 protein STRUBBELI AGCAC TCACAG G- comp22077_c0 CTAAGC AGGTTG 22077 EST-SSR 6.00E-72 68% 197 gcc 10 627 657 RECEPTOR _seq1 len=926 CGAATC CAGGA FAMILY 3- ATC AGA like [Sesamum

93 indicum]

hypothetical GGGTTT GGCAT EST- protein comp8445_c0_ CAGCC GTTGTG TGG 8445-0- EST-SSR GLYMA_07G 0.72 49% 266 4 459 479 seq1 len=1061 AAATA GGATAT GG 1 052000 CAGTG TCA [Glycine max]

GCGAG TGATGT comp8710_c0_ AAATG TTGGCT 8710 EST-SSR NO Hit 228 GAA 9 255 282 seq1 len=480 GAATC GTGCTT ATTGG AAA

probable transcriptional GCTTCT TTGCTG comp13475_c0 regulator CGAATT GATTGT 13475 EST-SSR 4.00E-99 64% _seq1 149 gca 8 620 644 SLK2 CCCAAC TTGCTG len=1152 [Sesamum TG TT indicum] lysine histidine CCACCG CTCGTA transporter- comp16925_c1 TATTCA TGGAA 16925 EST-SSR like 8 0 83% _seq1 144 tc 9 1788 1804 ACCACC GCAGG [Sesamum len=1878 TC AAAA indicum]

uncharacterize AGTTTC GAGCG d protein comp4773_c0_ TGTGCC ATCATG 4773 EST-SSR LOC10516895 8.00E-162 61% 202 ttc 7 1125 1146 seq1 len=1851 CTTGGT GAAGC 6 [Sesamum TC TGAT indicum]

94 pentatricopepti de repeat- TGGCTT ACCCCT containing >comp7938_c GTTTTG GCACTT 7938 EST-SSR protein 4.00E-18 71% 0_seq1 169 tga 8 232 256 GATTAT CTACTC At1g77405 len=340 TGC CA [Sesamum indicum] PREDICTED: GA1_S1_L001 TGAGA omega- TGCAGC _R1_001_(pair ATGAA amidase, CTTCAC 5708 EST-SSR 0 87% ed)_trimmed_( GCTGA 152 ct 12 1057 1080 chloroplastic- AATAAT paired)_contig GGCAG like [Nicotiana CG _5708 T tabacum] Contig_from_ AAAGT TTCCTG DB775P1:206: GAGCC TCTCCA 12710 gSSR NA NA NA C13PNACXX: 114 ga 9 47 64 AAACA TCTCAA 1:1110:12416: AGAGC CC 12710 Contig_from_ CATCGA TGTTCC DB775P1:206: CCAATG TCCATC 21312 gSSR NA NA NA C13PNACXX: 101 ag 8 49 64 TAAGAT TTCTAC 1:1110:7962:2 GCCG CTTGG 1312 Contig_from_ TTACAA GAATCC DB775P1:206: GGACA CACATA 44114 gSSR NA NA NA C13PNACXX: AGATTC TCGTAT 100 ac 11 97 118 1:1111:15449: GAGTA ATCTAG 44114 GC C Contig_from_ AGGTG AATGA DB775P1:206: AACTCT CGTGTG 52266 gSSR NA NA NA C13PNACXX: TTCTAT 136 ag 11 100 121 CAAGCT 1:1111:20734: CTACAT CC 52266 GG Contig_from_ GACAG TCTGCG DB775P1:206: CAGAA TAAGAT 63122 gSSR NA NA NA C13PNACXX: GAAGC 107 gt 10 69 88 CAATTT 1:1111:19106: ACAAG CCC 63122 G

95 Contig_from_ GCCGA AATGG DB775P1:206: ATTAGG GAGTC 74121 gSSR NA NA NA C13PNACXX: 124 tg 15 68 97 TCAATG AGAAT 1:1111:5849:7 AGCC GAACC 4121 Contig_from_ AGCATT TTCATC DB775P1:206: CAGGG GCTACC 6547 gSSR NA NA NA C13PNACXX: 185 ct 8 97 112 TCCTAA AGCAA 1:1112:17538: CACC GGG 6547 Contig_from_ CGATCC AAAGC DB775P1:206: GAGAG TCAAAC 77309 gSSR NA NA NA C13PNACXX: AATAC 118 ag 15 119 148 ATTGTG 1:1112:16591: AGGTG TCG 77309 G Contig_from_ CACATT TGTGAT DB775P1:206: TGAAAT GATATG 4460 gSSR NA NA NA C13PNACXX: TCAGG 136 ag 8 81 96 GCCTCC 1:1114:2744:4 AGACA GC 460 AGC Contig_from_ GCCTGA CTCTCC DB775P1:206: TGTGTA ACTGCG 5012 gSSR NA NA NA C13PNACXX: 114 ga 11 93 114 ATTCTC TGAGTT 1:1114:9161:5 TTCTGC CC 012 Contig_from_ TGACAT ATGAA DB775P1:206: TTCATC ATGCAC 54462 gSSR NA NA NA C13PNACXX: 100 ct 10 75 94 AGCTGT AAGTC 1:1114:6255:5 GC AGGC 4462 Contig_from_ GTGTGT TGAAG DB775P1:206: ACGTTA CTGGAC 26851 gSSR NA NA NA C13PNACXX: TGATAG 112 ac 8 69 84 TCAGA 1:1116:14823: TAGGG GTTGC 26851 C

96 Contig_from_ CATCAC AGGCCT DB775P1:206: CTCTGT ACATA 49751 gSSR NA NA NA C13PNACXX: TATCTC 111 ga 13 105 130 ATGTAC 1:1116:8058:4 TAAAG TCGC 9751 G Contig_from_ CAGTTT AGCCC DB775P1:206: CCTCAG GAACA 68222 gSSR NA NA NA C13PNACXX: 100 ct 9 70 87 AATCTC CACACT 1:1116:16835: TCTCCC CACG 68222 Contig_from_ GGAAC CCAAG DB775P1:206: AAACA ACTGGT 84454 gSSR NA NA NA C13PNACXX: AGATA 192 gt 8 81 96 CGCCTA 1:1116:7083:8 CCAGA TCG 4454 ATTGC Contig_from_ TCACCT TTTGAA DB775P1:206: AACCC ATGGTG 3979 gSSR NA NA NA C13PNACXX: 167 ct 14 74 101 ACATGC GTTGCC 1:1201:13208: ACG GG 3979 Contig_from_ GTGAC TCGCCA DB775P1:206: ATATTT TCTCTC 10008 gSSR NA NA NA C13PNACXX: ATAGG 120 ta 8 75 90 TACAA 1:1201:6930:1 CACATA GTGC 0008 CGC Contig_from_ CGCCCA TGCTAG DB775P1:206: CTACCA GTACA 21921 gSSR NA NA NA C13PNACXX: AATAA 100 tc 18 44 79 GAGGG 1:1201:13909: GAATTG ACGC 21921 C Contig_from_ TCAGTT AGTAG DB775P1:206: CGTTGA CTGCAG 32290 gSSR NA NA NA C13PNACXX: 123 tc 8 42 57 CAATA GGACA 1:1201:1400:3 GAGC TTGC 2290

97 Contig_from_ TCCAAA AGATTC DB775P1:206: TATCTA ACTTTG 91527 gSSR NA NA NA C13PNACXX: TTTGTG 116 ag 8 65 80 TACACT 1:1201:5868:9 CTTGGC CTCC 1527 C Contig_from_ ATCTGC AGCCA DB775P1:206: AGGTCT ATTCCA 34286 gSSR NA NA NA C13PNACXX: 136 ga 8 91 106 AAGAG GTTTAG 1:1202:6153:3 TGG CAGC 4286 Contig_from_ TCACAT TGTGGC DB775P1:206: GTTATA CTTAAC 30154 gSSR NA NA NA C13PNACXX: 167 tc 10 40 59 CGGAA TTGCAT 1:1203:14310: GATGC TTGG 30154 Contig_from_ AGAAG CGCTCC DB775P1:206: TTGTTG AGTTTC 44284 gSSR NA NA NA C13PNACXX: 103 tc 10 65 84 GGCAT TATGTT 1:1203:17696: GTTGC CAACC 44284 Contig_from_ CGCTAT TGCAGC DB775P1:206: ATCAA AACTTG 50937 gSSR NA NA NA C13PNACXX: 118 ct 17 74 107 GCTCTG AAACTC 1:1203:10399: CGC TCC 50937 Contig_from_ TGGGTT CCTCCT DB775P1:206: GGAATT CATGAT 52464 gSSR NA NA NA C13PNACXX: 138 ta 15 112 141 AAATG TCGGCT 1:1203:8708:5 GGCC CC 2464 Contig_from_ GTCACT CTTGTA DB775P1:206: AGCCA ATAAA 69982 gSSR NA NA NA C13PNACXX: 138 ga 9 74 91 ACATCA CACGC 1:1203:16365: CGC GCGC 69982

98 Contig_from_ ACTGA AGCTCA DB775P1:206: AACAG GTGTAT 21191 gSSR NA NA NA C13PNACXX: TCCTAT 100 ag 10 63 82 TTATAG 1:1204:6867:2 ACTCAC ACGCG 1191 G Contig_from_ GAAAG CTACGT DB775P1:206: TGGGA CTGGA 19536 gSSR NA NA NA C13PNACXX: 162 tg 10 121 140 AGGCA GTCGA 1:1205:14301: AGTGC ACCC 19536 Contig_from_ ACTTTC ACTCGA DB775P1:206: AGAAC GGTTTG 30565 gSSR NA NA NA C13PNACXX: CCTATC 120 tc 12 43 66 TCTTGA 1:1205:14051: CATTTG GG 30565 C Contig_from_ TGGCAC CTCTGC DB775P1:206: ACACA TACCAT 3964 gSSR NA NA NA C13PNACXX: 100 ag 9 39 56 ACACA GCAATT 1:1206:5054:3 CGC TCTTCC 964 Contig_from_ TCTACG ACTACT DB775P1:206: ATTTGG CATTGG 42319 gSSR NA NA NA C13PNACXX: 167 gt 9 131 148 GTGCA CACATG 1:1206:7802:4 ATTGC CG 2319 Contig_from_ TCGGCA TGATGC DB775P1:206: CCGACT ACCAA 45846 gSSR NA NA NA C13PNACXX: 143 tg 12 49 72 TTGTTA GATTTG 1:1206:16613: CC TTCCC 45846 Contig_from_ TGGGTT CACTGA DB775P1:206: CAATA AGACG 59293 gSSR NA NA NA C13PNACXX: 165 ag 12 69 92 AATCCG AAATTT 1:1206:12172: TGTCG CTGCG 59293

99 Contig_from_ CATCCT GTTGAC DB775P1:206: TCCTTT ATCTGC 75887 gSSR NA NA NA C13PNACXX: 144 ca 8 110 125 GCAGA ATGCA 1:1208:20295: CGC GGG 75887 Contig_from_ TGACA ACTACT DB775P1:206: GAGAT TTCGAG 8823 gSSR NA NA NA C13PNACXX: CAACTC 100 ag 13 100 125 TGACA 1:1211:12449: CTGAAC GCCC 8823 G Contig_from_ ATTCAC TTCTTC DB775P1:206: CCATGA TTCGTC 17467 gSSR NA NA NA C13PNACXX: 119 ca 8 74 89 CCCAAC ACACC 1:1212:5522:1 GG GCC 7467 Contig_from_ ATTGGC AACAT DB775P1:206: CTCTCT ACCCAT 65402 gSSR NA NA NA C13PNACXX: 155 gt 13 81 106 GAACG GCACA 1:1212:4048:6 TGC CACG 5402 Contig_from_ ACTTGT GATTCC DB775P1:206: CAGAC ACACG 71768 gSSR NA NA NA C13PNACXX: 100 tc 8 88 103 ACGACT CTAATG 1:1212:10558: CGG ACTTGG 71768 Contig_from_ TCAAGT CTTGCT DB775P1:206: GCAAG CGAATT 44817 gSSR NA NA NA C13PNACXX: 134 tg 10 112 131 GATCTC CCAGTT 1:1213:20629: GGG CCG 44817 Contig_from_ AGTTAC TCCCAA DB775P1:206: AAGGC AGTCA 63878 gSSR NA NA NA C13PNACXX: 122 tc 14 97 124 AAAGC AAGTG 1:1213:11566: TCAGG CTTGC 63878

100 Contig_from_ ACTCTT ACTATA DB775P1:206: GAAAT TAAGG 88395 gSSR NA NA NA C13PNACXX: AAAGA 124 tg 11 93 114 ACGTGT 1:1213:16930: GACGA TCTCCC 88395 GACG Contig_from_ TGTTGA AGGGA DB775P1:206: GGGAC CTATAT 98620 gSSR NA NA NA C13PNACXX: 130 tc 10 69 88 CAGTA GTTTCT 1:1213:11156: ATGCG CAAGC 98620 Contig_from_ ATGGG AAGAG DB775P1:206: AATCA GACAA 14132 gSSR NA NA NA C13PNACXX: 106 tc 10 33 52 AACAA TGCAG 1:1214:1966:1 GGGC GAGCC 4132 Contig_from_ ATTCTC TGCACT DB775P1:206: CCATAA TTAGTC 73591 gSSR NA NA NA C13PNACXX: 148 ac 10 118 137 TGGAC CTTCAG 1:1215:15308: GCC GG 73591 Contig_from_ TACTTT TCTTAT DB775P1:206: GGGTC ACAGA 2919 gSSR NA NA NA C13PNACXX: 139 ac 9 97 114 GAGAT CCGATG 1:1301:8614:2 CGCC TTGCC 919 Contig_from_ TCAGAT TGCATC DB775P1:206: AGCCTA TGTAAA 70658 gSSR NA NA NA C13PNACXX: AGTCTG 178 tg 9 57 74 CCTCAA 1:1302:16965: TGAAC CC 70658 G Contig_from_ ACCCTT ACACAT DB775P1:206: GAATG TTAACA 5266 gSSR NA NA NA C13PNACXX: ATTATA 135 ct 13 108 133 CAACC 1:1106:11082: CTGCTC GACG 5266 C

101 Contig_from_ TGAGA TGCCCA DB775P1:206: AACGG AACAA 59189 gSSR NA NA NA C13PNACXX: 145 ag 12 77 100 GCTGTA ATCTTG 1:1106:8407:5 GAGG CTGG 9189 Contig_from_ TCTTTC TGGTAA DB775P1:206: CCATCA TACGTT 58830 gSSR NA NA NA C13PNACXX: AACTA 100 ta 9 87 104 TAGTCA 1:1104:9833:5 GTCCAT CACC 8830 CC Contig_from_ AACACT TCACAA DB775P1:206: TTAGGT ATGCTC 89687 gSSR NA NA NA C13PNACXX: 103 tg 8 65 80 CTTTCG GATATA 1:1105:13147: AGG CGCC 89687 Contig_from_ GATCCC GTGATA DB775P1:206: ACTCTT GGTTGA 15785 gSSR NA NA NA C13PNACXX: 160 tc 15 77 106 GCTGCA TACATG 1:1106:15859: GG GCGC 15785 Contig_from_ TGATCG AGCCC DB775P1:206: ATCGTT ACCATA 87070 gSSR NA NA NA C13PNACXX: 100 ag 14 78 105 TCTCCA ACTAAC 1:1106:9073:8 TCCC TCTGG 7070 Contig_from_ ACATCT AACTCA DB775P1:206: CATCTC ACCATG 92295 gSSR NA NA NA C13PNACXX: 113 tc 12 66 89 ACTCAG CCCTAC 1:1103:14288: GG CG 92295 Contig_from_ GCATGC AGGGA DB775P1:206: CACCA AGCTG 51037 gSSR NA NA NA C13PNACXX: ATCAA AAACA 191 ac 10 129 148 1:1116:11269: ATTCTA AGAAG 51073 GG C

102 Table S4.2 Statistics of the distorted markers on the map Number of markers Number of distorted LGs Percentage (%) (distinct positions) markers (distinct positions) 1 148(112) 24(19) 16.22 2 104(81) 40(27) 38.46 3 84(65) 8(6) 9.52 4 73(59) 12(11) 16.44 5 84(65) 23(17) 27.38

6 61(50) 3(2) 4.92

7 73(55) 6(4) 8.22

8 92(71) 21(12) 22.83

9 140(107) 12(8) 8.57 10 71(53) 6(4) 8.45 11 76(61) 11(9) 14.47 12 72(53) 7(5) 9.72 13 126(89) 61(33) 48.41 14 64(50) 8(6) 12.50 15 57(39) 6(2) 10.53 16 66(53) 13(9) 19.70 17 57(49) 2(2) 3.51 18 83(63) 19(13) 22.89 19 44(37) 14(10) 31.82 20 44(31) 3(3) 6.82 21 69(47) 51(32) 73.91 22 43(30) 5(1) 11.63 23 102(68) 34(22) 33.33 Genome 1833(1388) 389(257) 21.22 wide

103 Chapter 5 Genome-Wide Association Study in Green Ash

Abstract Green ash (Fraxinus pennsylvanica) is under severe threat of an invasive insect, emerald ash borer. To identify genetic loci controlling traits of interest, we performed the first comprehensive genome-wide association study for insect resistance, height, diameter at breast, budburst and foliage coloration using restriction site associated DNA sequencing (RADseq) data generated from 85 green ash accessions. The results showed that among the five traits investigated, three exhibit significant association signals. For two of the traits, we detected candidate genes, one potentially involved in chlorophyll accumulation and the other genes involved in bud release and general development. Further, the genome-wide LD decay rate for the 85 accessions was 440bp on average, where the r2 dropped to half from the highest value Introduction Forests are an integral component of the landscape, providing essential ecosystem services as well as valuable raw materials for composites, biofuel and high-quality timber. Forest trees provide food and shelter for wildlife and play important roles in environment, including carbon sequestration. However, many forest tree species in North America are under decline due to environmental stresses and invasive species. Thanks to the ease of the use of next-generation sequencing (NGS) methods, more and more forest tree species have been studied. Fraxinus sp., one of the most widely distributed hardwood species in North America, are under severe threat from climate change, environmental stresses, and invasive species. The rapid invasion of North America by the emerald ash borer (EAB) insect from Asia, has resulted in the death of millions of ash trees threatening all native species of Fraxinus (Herms et al., 2014). Nonetheless, limited knowledge of genetic and molecular mechanisms about the threated species are available. Green ash (Fraxinus pennsylvanica) is an important tree species in North America, both as a component of forest and riparian ecosystems, and as a widely planted urban street tree. Some studies suggest that, among all major North American ash species, green ash is preferred by EAB (Anulewicz et al., 2007). Fortunately, intraspecific variation in green ash responses to EAB has been identified in natural forests (Koch et al., 2015) and in our provenance trial. Some mature green ash trees that have been identified as putatively EAB-resistant genotypes, where EAB infestations have resulted in death of the remainder of its cohort, provide valuable material for gaining an understanding the potential for development of resistance in North American ash species. Herein, we used EAB-susceptible and

putatively EAB-resistant collections to conduct identify markers and genes associated with the target traits. Quantitative trait loci (QTLs) mapping is popular in molecular biology to link phenotypic data and genotypic data, explaining the genetic basis of variation in complex traits. One major limitation of QTL analysis is only allelic diversity that segregates between the two initial parents can be assayed(Korte and Farlow, 2013). In other word, it is impossible to detect all segregating alleles from the parents at every locus contributing to variation in natural populations. Also, the effects of detected QTLs can differ from other mapping populations and also the natural population. Furthermore, QTL regions tend to quite large, containing too many genes to be investigated as candidate genes(Xu et al., 2017). On the other hand, the reduced cost of sequencing methods has allowed the detection of tens of thousands of single nucleotides (SNPs) across several samples. Genome-wide association study (GWAS) is able to overcome the limitations of QTL analysis by taking advantage of genome-wide SNPs. GWAS has been emerged as a powerful method and widely implemented in tree species, including peach(Cao et al., 2016), eucalyptus (Tassinari et al., 2017), polar (McKown et al., 2014), citrus (Minamikawa et al., 2017). GWAS in tree species offer advantages over pedigree-based genetic tests as the availability of large size of random-mating populations.

105 Materials and methods Plant Materials The green ash collections comprised of 95 accessions, selected from a provenance trial of app. 2,000 green ash trees from 60 provenances collected from across the entire native growing range of green ash in North America was planted at Penn State 40 years ago (Steiner et al., 1988). Invasion of the trial by EAB in 2010 has resulted in app. 95% mortality to date. “Lingering” (surviving) ash trees have been identified in specific half-sib families of 8 of the provenances, suggesting variation for susceptibility to EAB both among and within provenances. For this GWAS study, 12 lingering trees from 8 different provenances were selected to represent putatively resistant genotypes and 83 dead trees from 62 provenances were selected to represent susceptible genotypes.

Phenotypic data preparation and sampling Canopy condition, diameter at breast height (DBH), height, budburst date and foliage coloration were included in the study as traits of interest for detection of significant accessions with genomic SNP markers. Canopy condition was recorded in six continuous years since 2010, ranging from 1 to 5, which represented a full healthy canopy to a dead canopy. DBH was measured in 1990, 2009 and 2012, while height was measured in six continuous growing seasons from 1978 to 1983 and then in 1985, 1988 and 1990. Foliage coloration was recorded as plot mean number of days past Sept. 26, 1979, to peak coloration. Budburst was recorded as plot average number of days after Apr. 17, 1981 when more than 50% of terminal buds had developing leaves at least 0.5cm long.

Double-digested restriction site associated DNA sequencing (ddRADseq) library preparation and sequencing Genomic DNA were extracted from frozen leaves of the selected 95 green ash trees using CATB protocol(Clarke, 2009). DNA samples were quantified for high molecular weight using a Qubit® 2.0 fluorimeter (Invitrogen, Carlsbad, CA, USA) and then checked in 0.8% agarose gel. A combination of a rare Pst1 and a common NlaIII restriction enzymes has been used to produce fragments for sequencing as in previous RADseq studies(Peterson et al., 2012). Following adapter ligation and cleanup, gDNA library was constructed for each of the 95 samples and a blank cell as negative control with 96 specific barcodes followed by pooling into 96-plex libraries. Pooled libraries were then run on a lane of the Illumina HiSeq 4000 platform. In total, we obtained ~966 million paired-end reads of 150bp length each, across 95 individuals.

106 Sequence data processing We first checked raw sequence reads for their qualities and if there are intact barcode and RAD site (i.e. restriction enzyme site) in each read. Only reads that pass first filtering process were de-multiplexed into individual library FASTA files using Stacks (Catchen et al., 2013). Retained reads then were mapped against a Green ash draft genome (unpublished data) using Burrows-Wheeler Alignment (Li and Durbin, 2009a). Aligned reads from each individual are grouped into RAD loci with a minimum coverage depth (-m) of 5 and then polymorphic loci are identified. SNP genotypes were called at each locus in each individual using a maximum-likelihood statistical model implanted in Stacks. To include a locus in downstream analyses, we required it to be present in at least 80% of the samples (i.e. missing data < 20%) and with minor allele frequency above 0.05. We excluded 10 individuals with ≥30% missing data.

Linkage disequilibrium (LD) analyses Linkage disequilibrium explained by pairwise correlation coefficient (r2) was estimated for each SNP pair of all loci using PLINK (Purcell et al., 2007). To compute genome-wide and chromosome-wide LD decay rates, all pairwise r2 within each chromosome were included. The decay curve of r2 was fitted using a nonlinear regression of pairwise LD (Hill and Weir, 1988; Marroni et al., 2011) and the average decay rate was based on the pairwise distance (kb) where the r2 dropped to half from the highest value.

Population structure Bayesian admixture analysis in STRUCTURE v.2.3.4 was used to determine the optimal number of clusters of populations among the 95 individuals. Five independent runs for all K values from 1 to 10 with a burn-in period of 50,000 and Markov chain Monte Carlo (MCMC) iterations of 100,000 were conducted to select the best K value (Porras-Hurtado et al., 2013; Pritchard et al., 2000). The best K value was selected by computing the log probability of the data [ln Pr(X|K)] and plots of ∆K(Evanno et al., 2005). After calculating the best K value, an additional 15 iterations of the analyses were performed for the best K value. Due to replicated STRUCTURE runs, CLUMPP v.1.1.2 (Jakobsson and Rosenberg, 2007) was used to calculate the means of the assessment of replicated STRUCTURE runs through collating all the replicates into a single matrix. To get the CLUMPP input files, STRUCTURE HARVESTER (Earl and vonHoldt, 2011) was used to convert STRUCTURE output files to the required format for CLUMPP. With the best correspondence of the membership coefficient from CLUMPP, DISTRUCT v.1.1 (Rosenberg, 2004) was then used to provide a graphical display of the results.

107 Genome-wide association study (GWAS) To consider false positive associations that arise from populations, in addition to genetic marker, covariates from STRUCTURE were also included as fixed effects based on the mixed linear model (MLM). The relationships between individuals were also considered through a kinship matrix. SUPER (settlement of MLM under progressively exclusive relationship) algorithm was utilized to compute associations in the study through R package GAPIT (Lipka et al., 2012; Tang et al., 2016). Compared to standard MLM method, which generally uses all SNP data or a randomly selected subset to compute kinship matrix, SUPER GWAS method has a marker selection algorithm and an exclusion algorithm to increase computation efficiency and statistical power(Wang et al., 2014). Genome-wide SNPs were divided into small bins and each bin was represented by the most significant marker. Then maximum likelihood method was used to optimize the size and the number of bins and finalize the set of SNPs as the pseudo Quantitative Trait Nucleotides (QTNs). In the final association test of each marker, only those QTN loci that are not in LD with the testing marker are used to assess the kinship among individuals, which was able to reduce the confounding between the kinship and the tested SNPs.

Results SNPs identification and LD estimation PstI and NlaIII were used to digest the genome and construct the 96-plex RADseq libearies of the selected accessions. We identified 28,592 high quality SNPs that had a MAF of 5% with STACKS. Pairwise estimates of r2 were conducted for all pairwise up to 100kb distance among the SNPs on each chromosome. Among 28,592 SNPs, we mapped 28005 SNPs to 23 pseudomolecules (Table 5.1). The frequency of SNP occurrence across the genome is summarized in Table 5.2. In general, the frequency of transitions (60.22%) was higher than transversions (39.78%). The most widespread variation was C/T (30.22%) while the least common variation was C/G, accounting for 7.35% of the total detected SNPs. We observed a transition:transversion (Ts/Tv) ratio of 1.78, which was similar to the observations for other plant species(Gaur et al., 2015; Pootakham et al., 2015). A set of bi-allelic SNPs were scored as lm x ll, nn x np and hk x hk. With a stringent cutoff of 10% missing data, 2729 high quality and polymorphic SNPs were retained for further analysis. 727 and 1548 SNPs were polymorphic in maternal and paternal parents, respectively, while 454 SNPs were polymorphic in both parents. Linkage group 19 had the least markers (752 SNPs) with a marker density of one per 34.4kb, while LG1 had the most markers (2071 SNPs) with a marker density of one per 24.6kb. We used r2 to

108 estimate the extent of LD. The genome-wide LD decay rate for the 85 accessions was 440bp on average, where the r2 dropped to half from the highest value. Due to the LD decay curve became an asymptote thereafter, Genome-wide pattern of LD decay was shown only up to 20kb SNP distance (Fig. 5.1). An average of 1217 SNPs per chromosome, with chromosome-wide pattern of LD decay distance from 174 to 670 bp (Fig. 5.2). Chromosome 16 showed the shortest LD decay of 174bp, and Chromosome 3 showed the longest LD decay of 670bp.

Population structure and genetic diversity The genetic structure of the selected accessions was estimated using the Bayesian based program STRUCTURE. According to STRUCTURE clustering, the optimal number (K) was estimated as 2, capturing the majority of the structure presented in the data (Figure 5.3). The STRUCTURE plot (Figure 5.4) suggests that there are two clusters with individuals being admixtures of genotypes from the two clusters. Individuals with more than 80% of either Q value, which represents the proportion of an individual’s ancestry, were placed in one of the two groups, while the rest of the individuals were classified as being in an admixed group (Figure 5.4). Among the 83 accessions of our provenance trial, 46 accessions originating from 27 provenances were assigned to one cluster (orange genotype) and 22 accessions from 18 provenances were placed into a second cluster (blue genotype). The remaining 15 accessions from 11 provenances were assigned to the admixed cluster. In addition, Parent1 and Parent2 of the mapping population were classified into the North group and admixed group, respectively. Similarly, our previous microsatellite markers also indicated the presence of two groups (renamed as South cluster and North cluster) and one admixed group. To visualize underlying genetic ancestry differences, principle component analysis (PCA) was performed on the genotypic data. The first two principal components (PC1 and PC2) represent the main axes of variation within this data and explained 10.84 and 1.63% of variation, respectively. We created scatter plots of these components to visualize these data (Fig. 5.5). The locations of the 56 green ash provenances were plotted in Figure 5.6. The admixed provenances were, as expected, distributed primarily at the interface between the North and South clusters. Generally, the three clusters showed similar levels of genetic variation with the admixed subpopulation showing the highest differentiation (Fst=0.0123) (Table 5.3). On the other hand, the South cluster showed higher inbreeding coefficient (Fis=0.0799) than the North and admixed clusters. In addition, pairwise Fst values show the greatest differentiation detected between the South and the North clusters (Fst=0.111) and the smallest differentiation between the admixed and the North clusters (Fst=0.040) (Table 5.4).

109 Association mapping GWAS was carried out for 5 traits in the selected accessions, taking both population structure (Fig. 5.4) and relative kinship (Fig. 5.5) into account. SNPs were considered as significant markers if false discovery rate (FDR) smaller than 0.05. We performed marker-trait associations using the SUPER GWAS model and detected a total of 15 significant associations for selected traits (Figure 5.6 and 5.7, Table 5.5). For budburst, we detected 9 significant SNPs - 77390_92, 32755_118, 32756_125, 22279_18, 62631_57, 29105_13, 73875_59, 37899_120 and 21997_49 located at 25.64Mb, 34.60Mb, 34.60Mb, 22.86Mb, 4.46Mb, 26.68Mb, 15.50Mb, 17.45Mb and 19.18Mb of chromosomes 7, 2, 2, 23, 10, 17, 18, 8 and 23, respectively. For EAB resistance, we detected 3 GWAS peak SNPs - 11179_87, 25645_98 and 60870_103, located at 20.18Mb, 28.45Mb and 18.86Mb of chromosomes 12, 3 and 10, respectively. For leaf coloration, we detected two significant loci, 4493_23 and 21238_22 located at 16.68Mb and 10.64Mb of chromosomes 21 and 23, respectively. For height, 52671_77 was detected as the significant SNP, located at 27.74Mb of chromosome 11. These loci together could explain 42.10%, 4.16%, 6.37% and 0.052% of genetic variance for budburst, EAB resistance, leaf coloration in fall, and height, respectively.

Identification of candidate genes Genomic regions with two adjacent windows of LD decay centered by significant SNP were used to identify candidate genes. Transcription factor ASIL2 was identified by SNP 25645_98 as being significantly associated with EAB resistance. An uncharacterized gene was identified by the significant SNP associated with autumn leaf coloration. Four genes were associated with budburst: a pectin esterase inhibitor, u1 small nuclear ribonucleoprotein, CYCLOIDEA-like TCP gene and an uncharacterized gene were identified by SNPs 28105_13, 37899_120, 21997_49 and 32755_118, respectively. For the remaining significant SNPs, we did not identify any candidate genes within the LD region of the SNPs.

Discussion Association mapping has widely used to identify genetic regions controlling quantitative traits in Arabidopsis (Atwell et al., 2010), rice (Huang et al., 2010a), maize (Li et al., 2013), peach (Cao et al., 2016) and other plant species. For green ash, we achieved the first GWAS results regarding EAB resistance, height, DBH, budburst and leaf coloration. In this project, we made use the largest known set of putatively resistant green ash trees that have been identified and closely studied within a range-wide provenance trial at Penn State University. Our present GWAS for EAB resistance detected 3 SNPs on chromosomes 3, 10 and 12. For height, we detected one significant SNP on

110 chromosome 22 at 27.74Mb. For DBH, no significant SNPs were detected. For budburst, we detected 9 significant GWAS SNPs on chromosomes 2, 7, 8, 10, 17, 18 and 23. For leaf coloration, we detect 2 significant SNPs on chromosomes 21 and 23. A significant SNP for EAB resistance resides in the transcription factor ASIL2 gene, which has been reported to involve in chlorophyll accumulation in A. thaliana ( Willmann et al., 2011). For budburst, pectin esterase inhibitor, u1 small nuclear ribonucleoprotein (u1 snRNP), CYCLOIDEA- like TCP gene are identified using significant SNPs. U1 snRNP plays an essential role in early stages of the splicing reaction. One component of u1 snRNP has been proven to control plant development and stress response (de Francisco Amorim et al., 2018). Pectin esterase inhibitor has been implicated in the regulation of cell wall extension, carbohydrate metabolism and fruit development (Camardella et al., 2000). In Arabidopsis, TCP gene has been reported to act inside buds as integrator of signals controlling bud outgrowth (Aguilar-Martinez et al., 2007). In this thesis project, only a limited number of resistant tree samples were available for conducting a GWAS project. In addition to smaller than normal sample size, it was unknown in this case how LD differences relative to previous GWAS reports might influence the resolution and power of GWAS analysis. In general, rapid LD decay is expected in outcrossing species. However, we observed here, that LD decay in green ash (440bp) is comparable to that of previous LD reports for trees and other out-crossers, including loblolly pine (which decays within 800bp) (Gonzalez-Martinez et al., 2006), European aspen (within 500bp) (Ingvarsson, 2005), and eucalyptus (within 500bp) (Thavamanikumar et al., 2011), all of which were larger than LD decay in maize (within 100bp) (Tenaillon et al., 2001). In contrast, LD can persist for longer distances in self-pollinated species, such as Arabidopsis (250kb) (Nordborg et al., 2002), cultivated rice (123kb) (Huang et al., 2010b), and cultivated maize (30kb) (Hufford et al., 2012). In the present study, the rapid LD decay for the traits studied in green ash could improve the precision of QTL and candidate gene detection when combined with future analyses of bi-parental mapping populations. At this time the marker density is not yet high enough for such a fine-mapping analysis. However, in future studies, with larger sample sizes and more markers, more precise QTL regions and candidate genes will be identified and confirmed for green ash.

Acknowledgement We would like to thank Dr. Kim Steiner and his research group for the ash plantation and phenotypic traits measurement. Height was measured by K. Steiner, M. W. Williams, P. Berrang, J. Barbour, N. Foose, A. H. Perera, T. Kolb, J. Zaczek, J. Ayers and R. Bauman. Fall foliage coloration and date of

111 budburst were recorded by M. W. Williams. Diameter at breast height was measured by D. Andrus and B. Montgomery.

112 References Aguilar-Martinez, J.A., Poza-Carrion, C., and Cubas, P. (2007). Arabidopsis BRANCHED1 acts as an integrator of branching signals within axillary buds. Plant Cell 19, 458-472. Anulewicz, A.C., McCullough, D.G., and Cappaert, D.L. (2007). Emerald Ash Borer (Agrilus planipennis) Density and Canopy Dieback in Three North American Ash Species. Arboriculture & Urban Forestry 33, 338-349. Atwell, S., Huang, Y.S., Vilhjalmsson, B.J., Willems, G., Horton, M., Li, Y., Meng, D., Platt, A., Tarone, A.M., Hu, T.T., et al. (2010). Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627-631. Camardella, L., Carratore, V., Ciardiello, M.A., Servillo, L., Balestrieri, C., and Giovane, A. (2000). Kiwi protein inhibitor of pectin methylesterase amino-acid sequence and structural importance of two disulfide bridges. European journal of biochemistry 267, 4561-4565. Cao, K., Zhou, Z., Wang, Q., Guo, J., Zhao, P., Zhu, G., Fang, W., Chen, C., Wang, X., Wang, X., et al. (2016). Genome-wide association study of 12 agronomic traits in peach. Nature communications 7, 13246. Catchen, J., Hohenlohe, P.A., Bassham, S., Amores, A., and Cresko, W.A. (2013). Stacks: an analysis tool set for population genomics. Mol Ecol 22, 3124-3140. Clarke, J.D. (2009). Cetyltrimethyl Ammonium Bromide (CTAB) DNA Miniprep for Plant DNA Isolation. Cold Spring Harbor Protocols 2009, pdb.prot5177. de Francisco Amorim, M., Willing, E.-M., Francisco-Mangilet, A.G., Droste-Borel, I., Macek, B., Schneeberger, K., and Laubinger, S. (2018). The U1 snRNP subunit LUC7 controls plant development and stress response through alternative splicing regulation. bioRxiv. Earl, D.A., and vonHoldt, B.M. (2011). STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genetics Resources 4, 359-361. Evanno, G., Regnaut, S., and Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14, 2611-2620. Gaur, R., Jeena, G., Shah, N., Gupta, S., Pradhan, S., Tyagi, A.K., Jain, M., Chattopadhyay, D., and Bhatia, S. (2015). High density linkage mapping of genomic and transcriptomic SNPs for synteny analysis and anchoring the genome sequence of chickpea. 5, 13387. Gonzalez-Martinez, S.C., Ersoz, E., Brown, G.R., Wheeler, N.C., and Neale, D.B. (2006). DNA sequence variation and selection of tag single-nucleotide polymorphisms at candidate genes for drought-stress response in Pinus taeda L. Genetics 172, 1915-1926. Hill, W.G., and Weir, B.S. (1988). Variances and covariances of squared linkage disequilibria in finite populations. Theoretical population biology 33, 54-78. Huang, X., Sang, T., Zhao, Q., Feng, Q., Zhao, Y., Li, C., Zhu, C., Lu, T., Zhang, Z., and Li, M. (2010a). Genome-wide association studies of 14 agronomic traits in rice landraces. Nature genetics 42, 961. Huang, X., Wei, X., Sang, T., Zhao, Q., Feng, Q., Zhao, Y., Li, C., Zhu, C., Lu, T., Zhang, Z., et al. (2010b). Genome-wide association studies of 14 agronomic traits in rice landraces. Nature genetics 42, 961. Hufford, M.B., Xu, X., van Heerwaarden, J., Pyhäjärvi, T., Chia, J.-M., Cartwright, R.A., Elshire, R.J., Glaubitz, J.C., Guill, K.E., Kaeppler, S.M., et al. (2012). Comparative population genomics of maize domestication and improvement. Nature genetics 44, 808. Ingvarsson, P.K. (2005). Nucleotide polymorphism and linkage disequilibrium within and among natural populations of European aspen (Populus tremula L., Salicaceae). Genetics 169, 945-953. Jakobsson, M., and Rosenberg, N.A. (2007). CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23, 1801-1806.

113 Koch, J.L., Carey, D.W., Mason, M.E., Poland, T.M., and Knight, K.S. (2015). Intraspecific variation in Fraxinus pennsylvanica responses to emerald ash borer (Agrilus planipennis). New Forests 46, 995-1011. Korte, A., and Farlow, A. (2013). The advantages and limitations of trait analysis with GWAS: a review. Plant Methods 9, 29-29. Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760. Li, H., Peng, Z., Yang, X., Wang, W., Fu, J., Wang, J., Han, Y., Chai, Y., Guo, T., and Yang, N. (2013). Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nature genetics 45, 43. Lipka, A.E., Tian, F., Wang, Q., Peiffer, J., Li, M., Bradbury, P.J., Gore, M.A., Buckler, E.S., and Zhang, Z. (2012). GAPIT: genome association and prediction integrated tool. Bioinformatics 28, 2397-2399. Marroni, F., Pinosio, S., Zaina, G., Fogolari, F., Felice, N., Cattonaro, F., and Morgante, M. (2011). Nucleotide diversity and linkage disequilibrium in Populus nigra cinnamyl alcohol dehydrogenase (CAD4) gene. Tree Genetics & Genomes 7, 1011-1023. McKown, A.D., Klapste, J., Guy, R.D., Geraldes, A., Porth, I., Hannemann, J., Friedmann, M., Muchero, W., Tuskan, G.A., Ehlting, J., et al. (2014). Genome-wide association implicates numerous genes underlying ecological trait variation in natural populations of Populus trichocarpa. The New phytologist 203, 535-553. Minamikawa, M.F., Nonaka, K., Kaminuma, E., Kajiya-Kanegae, H., Onogi, A., Goto, S., Yoshioka, T., Imai, A., Hamada, H., Hayashi, T., et al. (2017). Genome-wide association study and genomic prediction in citrus: Potential of genomics-assisted breeding for fruit quality traits. Scientific reports 7, 4721. Nordborg, M., Borevitz, J.O., Bergelson, J., Berry, C.C., Chory, J., Hagenblad, J., Kreitman, M., Maloof, J.N., Noyes, T., Oefner, P.J., et al. (2002). The extent of linkage disequilibrium in Arabidopsis thaliana. Nature genetics 30, 190-193. Peterson, B.K., Weber, J.N., Kay, E.H., Fisher, H.S., and Hoekstra, H.E. (2012). Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. PLOS ONE 7, e37135. Pootakham, W., Jomchai, N., Ruang-areerate, P., Shearman, J.R., Sonthirod, C., Sangsrakru, D., Tragoonrung, S., and Tangphatsornruang, S. (2015). Genome-wide SNP discovery and identification of QTL associated with agronomic traits in oil palm using genotyping-by- sequencing (GBS). Genomics 105, 288-295. Porras-Hurtado, L., Ruiz, Y., Santos, C., Phillips, C., Carracedo, Á., and Lareu, M.V. (2013). An overview of STRUCTURE: applications, parameter settings, and supporting software. Frontiers in Genetics 4, 98. Pritchard, J.K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155, 945-959. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, Manuel A R., Bender, D., Maller, J., Sklar, P., de Bakker, Paul I W., Daly, Mark J., et al. (2007). PLINK: A Tool Set for Whole- Genome Association and Population-Based Linkage Analyses. American journal of human genetics 81, 559-575. Rosenberg, N.A. (2004). distruct: a program for the graphical display of population structure. Molecular Ecology Notes 4, 137-138. Tang, Y., Liu, X., Wang, J., Li, M., Wang, Q., Tian, F., Su, Z., Pan, Y., Liu, D., Lipka, A.E., et al. (2016). GAPIT Version 2: An Enhanced Integrated Tool for Genomic Association and Prediction. Plant Genome 9. Tassinari, R., Vilela, M.D., Fonseca, F., Ferreira, C., Keiko, E., Bonfim, O., and Junior, S. (2017). Regional heritability mapping and genomeâ wide association identify loci for complex growth, wood and disease resistance traits in Eucalyptus. new phytologist.

114 Tenaillon, M.I., Sawkins, M.C., Long, A.D., Gaut, R.L., Doebley, J.F., and Gaut, B.S. (2001). Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.). Proceedings of the National Academy of Sciences of the United States of America 98, 9161- 9166. Thavamanikumar, S., McManus, L.J., Tibbits, J.F.G., and Bossinger, G. (2011). The significance of single nucleotide polymorphisms (SNPs) in Eucalyptus globulus breeding programs. Australian Forestry 74, 23-29. Wang, Q., Tian, F., Pan, Y., Buckler, E.S., and Zhang, Z. (2014). A SUPER Powerful Method for Genome Wide Association Study. PLOS ONE 9, e107684. Willmann, M.R., Mehalick, A.J., Packer, R.L., and Jenik, P.D. (2011). MicroRNAs regulate the timing of embryo maturation in Arabidopsis. Plant physiology 155, 1871-1884. Xu, Y., Li, P., Yang, Z., and Xu, C. (2017). Genetic mapping of quantitative trait loci in crops. The Crop Journal 5, 175-184.

115 Figures:

Figure 5.1 Average LD decay across the green ash genome. Blue line represents the regression curve fitted to the LD plot. Red line represents that the r2 dropped to half from the highest value at 440bp.

116

117

118

119

Figure 5.2 LD decay rates across the 23 green ash chromosomes. Blue lines represent the regression curve fitted to the LD plot across 23 chromosomes. Red lines represent the average distance that the r2 dropped to half from the highest value.

120

Figure 5.3 Assessment of K value that best fits the accessions. Values of DK in 85 individuals using 28,592 SNP markers, with its modal value detecting a true K of two core ancestry populations.

121

S S S S S S S N S S S S S S S Ad Ad S Ad S Ad Ad Ad Ad Ad Ad Ad NaNAd N NaNAd Ad Ad N N N N N N N Ad Ad N Ad N Ad N N N N N N N N N N N

PA_45 SD_73 IL_345 IN_138 IL_169 IN_429IL_165MD_46 IL_161 MI_253 MI_293 SC_237TN_185TN_145TN_309VA_393AR_453 KY_153 DE_450KY_189 VA_397 PA_277 NY_201NY_205 Parent2 NY_373Parent1 NB_441 VT_217 NB_317 NB_405 NB_409 SD_273SD_269 NB_321 SD_417NB_413SD_197 ND_437 Man_93 MO_177 Ont_211 MO_233 OH_155 WV_193OH_141 OH_249 Ont_529MO_469WY_533 Ont_305 Ont_505 MN_225 Alb_214 Que_337 Man_509 Sas_541 Man_513 Figure 5.4 Population structure of the accessions. STRUCTURE results plotted by DISTRUCT for genetic variation at K=2 for 85 individuals representing 60 provenances. The colored plot represents the estimates of Q (Q represents the proportion of an individual’s ancestry in a population) with assumed 2 predefined populations. Cluster I (Blue): Pop1-18 (MO_177 ~ OH_141) with Q1 > 80%; Cluster II (Orange): Pop30-56 (MI_293 ~ Man_513) and Parent2 of the F1 mapping population with Q2 > 80%; Admixed Cluster: Pop19-29 (IL_169 ~ NY_373) and Parent1 of the F1 mapping population with 80% > Q1 > 20%.

Figure 5.5 Scatter plot principal component axis one (PC1) and axis two (PC2) based on genotype data of 85 samples. X-axis is PC1 and explained 10.84% variation and y-axis is PC2 and explained 1.63% variation. Samples were colored according to STRUCTURE result. Orange, blue and green represent the North, the South and the admixed groups, respectively, from STRUCTURE analysis (i.e. Q).

122

Figure 5.6 GPS coordinates illustrating geographic variation among the North, South and admixed clusters.

123

Figure 5.7 Kinship heatmap. Heatmap and dendrogram of a kinship matrix based on ~28,592 SNPs among 85 genotypes. Rows and columns represent the 85 individuals.

124

Figure 5.8 Quantile-quantile (QQ) plots for GWAS for (a) budburst, (b) EAB resistance, (c) fall foliage coloration, (d) height. Most P-values were similar to the expected diagonal in the QQ plot, which indicates the appropriateness of the GWAS model.

125

(a)

(b)

(c)

(d) Figure 5.9 Manhattan plots. Significant loci associated with (a) budburst (b) EAB resistance (c) foliage coloration (d) height were identified. Each dot represents a SNP. Red horizontal line indicates the Bonferroni-corrected significance threshold –log10(p)=5.76. SNPs with FDR below 0.5 were circled.

126

Table 5.1 Summary of identified SNPs. LGs Scaffold_ID Scaffold_size SNPs 1 ScuXjxm_494999;HRSCAF=496248 50995270 2071 2 ScuXjxm_494985;HRSCAF=496196 43608880 1649 3 ScuXjxm_494981;HRSCAF=496072 34147523 1173 4 ScuXjxm_125222;HRSCAF=125413 34501153 1394 5 ScuXjxm_494991;HRSCAF=496207 39445854 1751 6 ScuXjxm_494994;HRSCAF=496212 31863345 1194 7 ScuXjxm_495002;HRSCAF=496252 30524050 1234 8 ScuXjxm_494987;HRSCAF=496199 30717653 1216 ScuXjxm_494988;HRSCAF=496201 26354852 963 9 ScuXjxm_494990;HRSCAF=496204 7793390 524 10 ScuXjxm_494996;HRSCAF=496225 33820380 1284 11 ScuXjxm_494992;HRSCAF=496209 30397008 1219 12 ScuXjxm_257556;HRSCAF=257952 35099959 1403 ScuXjxm_495000;HRSCAF=496249 22919704 683 ScuXjxm_494989;HRSCAF=496202 9428771 742 13 ScuXjxm_494977;HRSCAF=495950 2612082 170 14 ScuXjxm_23148;HRSCAF=23188 27998687 970 ScuXjxm_494986;HRSCAF=496197 26865606 808 15 ScuXjxm_494993;HRSCAF=496211 4250457 222 16 ScuXjxm_494995;HRSCAF=496215 24819257 828 17 ScuXjxm_494984;HRSCAF=496195 31893592 1221 18 ScuXjxm_495001;HRSCAF=496251 25639400 987 19 ScuXjxm_494998;HRSCAF=496232 25897340 752 20 ScuXjxm_274662;HRSCAF=275086 22141284 788 21 ScuXjxm_138335;HRSCAF=138560 27152721 862 22 ScuXjxm_358465;HRSCAF=359033 24885302 919 23 ScuXjxm_494978;HRSCAF=496020 26605849 978

Table 5.2 Categories of identified SNPs. Total number of SNPs 28005 100% Transversion A/C 2810 10.03% A/T 3467 12.38% C/G 2058 7.35% G/T 2806 10.02% Transition A/G 8402 30.00% C/T 8462 30.22%

127

Table 5.3 Genetic variation statistics within three clusters Ho HE Fst Fis North 0.2022 0.2186 0.0012 0.0750 Admixed 0.2436 0.2485 0.0123 0.0197 South 0.2141 0.2327 0.0036 0.0799

Table 5.4 Pairwise Fst among three clusters Admixed North North 0.040 -- South 0.041 0.111

Table 5.5 Significant marker-trait associations. Allelic Traits Marker Chr. Position(bp) p-value FDR Candidate gene Effect (%) 11179_87 12 20,176,672 1.51E-07 0.004 1.336 EAB 25645_98 3 28,448,895 2.95E-06 0.040 1.419 ASIL2(TF) 60870_103 10 18,859,494 4.97E-06 0.045 1.407 Height_1979 52671_77 11 27,735,951 1.50E-06 0.041 0.052 Foliage 4493_23 21 16,676,388 1.06E-06 0.029 -1.976 uncharacterized Color 21238_22 23 10,624,963 3.66E-06 0.050 4.394 77390_92 7 25,638,778 1.34E-07 0.002 -5.752 Uncharacterized 32755_118 2 34,601,448 1.95E-07 0.002 -5.581 mRNA 32756_125 2 34,601,577 1.95E-07 0.002 -5.581 22279_18 23 22,861,498 1.38E-06 0.009 5.859 62631_57 10 4,456,524 1.82E-06 0.010 6.137 BudBurst Pectin esterase 29105_13 17 26,676,916 2.54E-06 0.012 -3.365 inhibitor 73875_59 18 15,502,133 1.02E-05 0.039 -2.102 U1 small nuclear 37899_120 8 17,452,812 1.55E-05 0.048 4.336 ribonucleoprotein CYCLOIDEA- 21997_49 23 19,177,204 1.57E-05 0.048 3.388 like TCP gene

128 Chapter 6 Summary and Discussion

Ash species in North America, Britain and Europe, are under severe threat from climate change, environmental stresses, and invasive species. The rapid invasion of North America by the emerald ash borer (EAB) insect from Asia, has led to EAB being ranked as one of the most damaging insect pests, threatening all native species of ash. It is known that EAB has resulted in the death of millions of ash trees and is a severe threat for all native ash trees in North America (Herms et al., 2014). Green ash (Fraxinus pennsylvanica) is an important tree species in North America, both as a component of forest and riparian ecosystems, and as a widely planted urban street tree. Some studies suggest that, among all major North American ash species, green ash is preferred by EAB (Anulewicz et al., 2007). Furthermore, intraspecific variation in green ash responses to EAB has been identified (Koch et al., 2015). The mature green ash trees that have been identified as surviving EAB infestations, that resulted in death of the remainder of its cohort, provide valuable material for gaining an understanding the potential for development of resistance in North American ash species. To understand and help resolve forest health issues such as the EAB-infestation, we identified genome-wide DNA markers using NGS data, and used those markers to identify candidate genes associated with growth and response to biotic stresses in green ash. DNA markers, genomic regions and genes associated with important traits can be identified genetically by quantitative trait locus (QTL) mapping, through genome wide association studies (GWAS) or transcriptome comparison. Transcriptome comparisons are carried out between individuals or groups who may represent contrasted phenotypes to identify genes that are differentially expressed between the groups. QTL mapping is based on informative the location of DNA markers within families, whose resolution will be affected by how many recombination events occur during bi-parental crossing. On the other hand, association mapping was conducted at a natural population level, with a larger number of markers, providing better mapping resolution. GWAS has the capacity to identify both major and minor gene associations with traits, while QTL analysis typically identifies the strongest association with traits such as insect resistance. In this project, we identified genome-wide DNA markers and used those markers to assess population diversity, structure and marker-trait associations in green ash. We also identified DEGs and hub genes from transcriptome comparisons between EAB-resistance and EAB-susceptible genotypes. Last but not least, we developed the first genetic map for green ash. In previous studies, MeJA application was observed to induce EAB resistance that was associated with higher accumulations of verbascoside, lignin and trypsin inhibitor activity in selected green ash genotypes (Whitehill et al., 2014). Thus, we expected to identify genes in the biosynthesis and /or signaling pathways of JA, as well as secondary chemical pathways. In addition, upstream

129 transcription factors that regulate these pathways and gene products may be identified. Furthermore, in addition to such inducible genes and pathways, we may also expect to identify genes for constitutive factors that may be responsible for differences among lingering and susceptible trees, such as bark quality phenotypes. Our GO term enrichment of up-regulated genes in resistant genotypes result shows that stress response, carbohydrate metabolic process, secondary metabolic process, the lipid metabolic process and catabolic processes are all overrepresented. On the other hand, the susceptible group is more enriched in “pollen-pistil interactions”, “cellular homeostasis” and “peroxisome”. The differences between resistant and susceptible groups suggest that the defense response specifically against the herbivory attack were triggered successfully in EAB-resistant genotypes. Network analysis of DEGs supports that jasmonic acid pathways may contribute to EAB resistance in green ash. In resistant genotypes, up-regulated genes encoding transcription factor MYC2 and two jasmonate ZIM domain (JAZ) proteins further support defense response against EAB by accumulating JA level and JA-responsive genes. However, in the susceptible group, two key genes (i.e. AOS and AOC) involved in JA biosynthesis were down-regulated, which suggests that susceptible genotypes may fail to accumulate the JA for triggering JA-responsive defense. Transcriptome comparison is commonly being used to identify candidate genes and pathways for different resistance phenotypes in forest trees. In P. glauca, a gene encoding b–glucosidase was identified as candidate gene contributing to spruce budworm resistance as up to 1000-fold higher expression was observed in resistant trees (Mageroy et al., 2015). RNA-seq has been used to study differences in gene expression from leaves of Eucalyptus melliodora trees with resistant vs susceptible phenotype to insect herbivory (Padovan et al., 2013). It has been shown that genes encoding for lipoxygenase (LOX), allene oxide cyclase (AOS) and allene oxide synthase (AOC) were up-regulated in inset-attacked, mechanical wounded and jasmonate-treated trees (Men et al., 2013; Ralph et al., 2006). The transcriptome profiling in the thesis showed that AOC and AOS were down- regulated in susceptible group, revealing that defense response was not activated effectively. In addition to transcriptome profiling, genetic variation is also able to help us understand the susceptibility to EAB. As a non-model species, we have limited genome information. Therefore, we assessed population structure and diversity of green ash natural population using 8 microsatellite markers. The study supports that population structure exists. The STRUCTURE analysis revealed two clusters which generally located south and north of 40.5 ˚N in North America, while a third cluster consisted of admixtures of northern and southern genotypes at the intersection. The southern cluster was mostly differentiated from the other subpopulations with high pairwise Fst value. As expected, the greatest variation was detected between the North and South population clusters. We also observed higher diversity within the South cluster than the other two clusters, supporting our

130 hypothesis of a general south-to-north migration path of green ash. In the study, 6 EST-SSR markers successfully dissected the south, north and admixed clusters. Genetic diversity serves several important purposes in forest trees. It is an important source for tree conservation, breeding and improvement programs to develop well-adapted tree species varieties. Rich genetic diversity also ensures the capacity of forest trees to tolerate biotic and abiotic stressors under climate change. Microsatellite markers have been used to detect genetic diversity and population structure in many other tree species. In Populus, genetic diversity and population structure have been assessed by SSR markers (Du et al., 2012; Grewal et al., 2014; Wang et al., 2011; Zhu et al., 2016). In Euculyptus, genetic diversity has also been tested using SSR markers within different populations and species(Acuña et al., 2011; Costa et al., 2017; Cupertino et al., 2011). In Quercus, genetic diversity also is available in several oak species using microsatellite markers (Lind and Gailing, 2013; Valencia-Cuevas et al., 2014). SSR markers also revealed genetic diversity in spruce and pine (Verbylaitė et al., 2017). We constructed the first genetic map of green ash using combination of SSR and SNP markers. Our study revealed that the 96-plex GBS protocol can work well for genetic map construction in such non-model species. The genetic map spanned 2008.87 cM with an average marker interval of 1.67 cM. The genetic map is also used to detect the syntenic relationships between green ash and four other species, including tomato, coffee, peach and poplar. Our result suggested that coffee and peach could be used as model species for further comparative analysis in green ash. In addition, the genetic map was used to anchor 27 scaffolds of the draft genome assembly and showed near-perfect consistency with the genome assembly. Last but not least, we made use a set of putatively resistant green ash trees that have been identified and closely studied within a range-wide provenance trial at Penn State University. We detected 3, 9 and 2 SNP markers significantly associated with EAB resistance, budburst and foliage coloration, respectively, in green ash. The genome-wide LD decay was estimated at 440bp on average. With limited samples size, our preliminary GWAS result identified some significant SNPs associated with traits of interest and also estimated the LD decay in green ash. However, our results need to be further confirmed with larger population size and higher marker densities. These findings will represent an important first step in the study resistance and susceptibility to EAB and other stem-boring insects in broadleaf trees. The results from this study will need to be confirmed in subsequent studies in larger population studies, and through QTL analysis with our full- sib green ash mapping population. Our study will assist in disease-resistance tree breeding programs and in forest management by providing powerful tools to detect the important genes related to EAB resistance as well as genes

131 for growth and adaptation. This new information should prove to be important for studies of resistance to boring insects and of environmental stress-responses in other tree species currently under threat of extinction and extirpation.

132 References Acuña, C., Villalba, P., García, M., Hopp, E., and Marcucci, S. (2011). Functional markers development and genetic diversity analysis in Eucalyptus globulus with emphasis in wood quality candidate genes. BMC Proceedings 5, P154-P154. Anulewicz, A.C., McCullough, D.G., and Cappaert, D.L. (2007). Emerald Ash Borer (Agrilus planipennis) Density and Canopy Dieback in Three North American Ash Species. Arboriculture & Urban Forestry 33, 338-349. Costa, J., Vaillancourt, R.E., Steane, D.A., Jones, R.C., and Marques, C. (2017). Microsatellite analysis of population structure in Eucalyptus globulus. Genome 60, 770-777. Cupertino, F., Leal, J., Correa, R., and Gaiotto, F. (2011). Genetic diversity of Eucalyptus hybrids estimated by genomic and EST microsatellite markers. Biologia Plantarum 55, 379-382. Du, Q., Wang, B., Wei, Z., Zhang, D., and Li, B. (2012). Genetic diversity and population structure of Chinese white poplar (Populus tomentosa) revealed by SSR markers. Journal of Heredity 103, 853-862. Grewal, G., Gill, R., Dhillon, G., and Vikal, Y. (2014). Molecular characterisation and genetic diversity analysis of Populus deltoides Bartr. ex Marsh. clones using SSR markers. Koch, J.L., Carey, D.W., Mason, M.E., Poland, T.M., and Knight, K.S. (2015). Intraspecific variation in Fraxinus pennsylvanica responses to emerald ash borer (Agrilus planipennis). New Forests 46, 995-1011. Lind, J.F., and Gailing, O. (2013). Genetic structure of Quercus rubra L. and Quercus ellipsoidalis E. J. Hill populations at gene-based EST-SSR and nuclear SSR markers. Tree Genetics & Genomes 9, 707-722. Liu, H., Bauer, L.S., Gao, R., Zhao, T., Petrice, T.R., and Haack, R.A. (2003). Exploratory survey for the emerald ash borer, Agrilus planipennis (Coleoptera: Buprestidae), and its natural enemies in China. The Great Lakes Entomologist 36, 191-204. Mageroy, M.H., Parent, G., Germanos, G., Giguère, I., Delvas, N., Maaroufi, H., Bauce, É., Bohlmann, J., and Mackay, J.J. (2015). Expression of the β-glucosidase gene Pgβglu-1 underpins natural resistance of white spruce against spruce budworm. The Plant Journal 81, 68-80. Men, L., Yan, S., and Liu, G. (2013). De novo characterization of Larix gmelinii (Rupr.) Rupr. transcriptome and analysis of its gene expression induced by jasmonates. BMC genomics 14, 548. Padovan, A., Keszei, A., Foley, W.J., and Külheim, C. (2013). Differences in gene expression within a striking phenotypic mosaic Eucalyptustree that varies in susceptibility to herbivory. BMC Plant Biol 13, 29. Ralph, S.G., Yueh, H., Friedmann, M., Aeschliman, D., Zeznik, J.A., Nelson, C.C., Butterfield, Y.S., Kirkpatrick, R., Liu, J., Jones, S.J., et al. (2006). Conifer defence against insects: microarray gene expression profiling of Sitka spruce (Picea sitchensis) induced by mechanical wounding or feeding by spruce budworms (Choristoneura occidentalis) or white pine weevils (Pissodes strobi) reveals large-scale changes of the host transcriptome. Plant, cell & environment 29, 1545-1570. Valencia-Cuevas, L., Piñero, D., Mussali-Galante, P., Valencia-Ávalos, S., and Tovar-Sánchez, E. (2014). Effect of a red oak species gradient on genetic structure and diversity of Quercus castanea () in Mexico. Tree Genetics & Genomes 10, 641-652. Verbylaitė, R., Pliūra, A., Lygis, V., Suchockas, V., Jankauskienė, J., and Labokas, J. (2017). Genetic Diversity and Its Spatial Distribution in Self-Regenerating Norway Spruce and Scots Pine Stands. Forests 8, 470. Wang, J., Li, Z., Guo, Q., Ren, G., and Wu, Y. (2011). Genetic variation within and between populations of a desert poplar (Populus euphratica) revealed by SSR markers. Annals of forest science 68, 1143. Whitehill, J.G., Rigsby, C., Cipollini, D., Herms, D.A., and Bonello, P. (2014). Decreased emergence of emerald ash borer from ash treated with methyl jasmonate is associated with induction of general defense traits and the toxic phenolic compound verbascoside. Oecologia 176, 1047-1059.

133 Zhu, X., Cheng, S., Liao, T., and Kang, X. (2016). Genetic diversity in fragmented populations of Populus talassica inferred from microsatellites: implications for conservation. Genetics and molecular research: GMR 15.

134

VITA

Di Wu

Education The Pennsylvania State University University Park, PA Ph.D. candidate in Bioinformatics and Genomics. GPA: 3.64/4 Expected May 2018

The Pennsylvania State University University Park, PA M.S. in Forest Resources management. GPA: 3.46/4 Dec. 2012

Beijing Forestry University Beijing, China B.S. in Biotechnology. GPA: 85.5/100 Jul. 2010

Research Experiences PhD Researcher | The Pennsylvania State University Jan.2013-May 2018 • Discovered genome-wide SNPs across green ash genome through Genotyping-by-Sequencing • Constructed genetic map of green ash • Identified candidate genes contributing to disease resistance through RNA-seq analysis • Explored population structure and variation across natural populations of green ash • Conducted genome-wide association study to identify SNPs significantly associated with disease resistance Intern | INRA-Bordeaux, France Jan. 2015-Jul. 2015 • Identified EST-SSR and SNP markers from exome capture • Designed primers for SSR and SNP markers • Developed genetic map of white oak Master’s Researcher | The Pennsylvania State University Aug.2010-Dec.2012 • Compared expression profiles of three cellulose synthase genes through real-time qPCR

Publications and Presentations • D. Wu, Koch, M.V. Coggeshall and J.E. Carlson. The first genetic linkage map for Fraxinus pennsylvanica and syntenic relationships with four related species. (To be submitted) • D. Wu, T. Best, K.C. Steiner and J.E. Carlson. Population structure and genetic diversity of green ash (Fraxinus pennsylvanica) assessed by SSR markers. (To be submitted) • D. Wu, J. Koch, M.V. Coggeshall, N. Zembower and J.E. Carlson. A genetic linkage map for Fraxinus pennsylvanica and synteny analysis with Asterid and Rosid species. International Plant & Animal Genome XXVI conference, San Diego, CA. • D. Wu and J.E. Carlson. Intraspecific variation in green ash response to an invasive insect. 33rd Mid Atlantic Plant Molecular Biology Society Annual Conference, College Park, MD.