ABSTRACT

JACKSON, COLIN ALAN. Single Nucleotide Polymorphism Discovery and Probe Design Using RNA-Seq and Targeted Capture in Tropical and Subtropical Species (Under the direction of Dr. Juan Jose Acosta and Dr. Fikret Isik).

Mexico and Central America are two primary pockets of diversity for tropical and subtropical pine species. Of the species found there, many are considered threatened or endangered, which in turn has spurred conservation efforts and the establishment of landrace populations for commercial use across South America and Southern Africa. The development of genomic resources in Pinus taeda has created a platform to extend these technologies into tropical . The information gained from these resources will prove to be valuable for the continued improvement of pure and hybrid breeding populations, as well as gene conservation efforts. In this study, we performed targeted genome and gene-based SNP discovery for the development of a high-throughput genome-wide genotyping array for tropical pines. Pooled

RNA-Seq data was generated from shoot tissues of seedlings from five species: Pinus patula,

Pinus tecunumanii, Pinus oocarpa, Pinus greggii and Pinus maximinoi. Pooled RNA was isolated in triplicate from 8 to 16 seedlings from two or more families per species. Paired-end

RNA sequencing generated between 29.4 and 67.7 million raw reads per pool. Reads were trimmed and mapped to each species’ respective transcriptome assembly. SNP detection was performed under standard and strict criteria using a min-alternative-fraction of ≥ 0.05. Using standard criteria, we identified between 687K and 1.3M transcriptome-based SNPs per species, while strict calls produced between 426K and 824K SNPs per species. SNP probe design (using

Affymetrix design criteria) resulted in 113K to 186K candidate probes per species for a total of

803K probes designed. Probes were further assessed for unique vs. repetitive mapping against the v2.01 P. taeda genome assembly. Of the 803K probes that mapped to the genome, 60% to

67% mapped uniquely in the reference genome assembly.

In addition, targeted sequence data for the five species mentioned above, plus Pinus caribaea, was generated using a custom set of 40K capture probes (RAPiD Genomics

Gainesville, FL). About 30K were designed from single copy regions of the v2.01 P. taeda genome and 10K were designed from P. patula and P. tecunumanii transcriptome assemblies.

Across the six species, a total of 81 pooled samples were sequenced. Each pooled sample contained DNA from 4 to 8 trees from a single provenance. The 81 provenances cover the natural range of the six species in Mexico and Central America. Capture-seq analysis generated between 3.1 and 7.7 million reads per sample with 20-30X coverage of each capture region. We identified a total of 1.3 million SNPs suitable for probe design, from which, 562K probes were successfully designed. Between the two datasets, there are ~1.6 million unique locations represented by ~1 million probes. This study shows that RNA-Seq and targeted capture sequencing are valuable technologies in identifying variation within non-model species and produce a high quality dataset for implementation of probe design for use on a commercial genotyping array.

© Copyright 2019 by Colin Alan Jackson

All Rights Reserved Single Nucleotide Polymorphism Discovery and Probe Design Using RNA-Seq and Targeted Capture in Tropical and Subtropical Pine Species

by Colin Alan Jackson

A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the degree of Master of Science

Forestry and Environmental Resources

Raleigh, North Carolina 2019

APPROVED BY:

______Juan Jose Acosta Fikret Isik Committee Co-Chair Committee Co-Chair

______Gary Hodge Ross Whetten

BIOGRAPHY

Colin Jackson, son of Jeffrey and Carol Jackson, was born on May 12, 1993 in

Texarkana, TX. Colin was raised in Broken Bow, OK where he spent much of his time helping raise livestock and being active outdoors: hunting, fishing, and hiking. He attended Broken

Bow High School and graduated in the class of 2011. After graduating, he attended Oklahoma

State University where he was active in undergraduate research from his first through final year researching the field of microbial ecology. In 2016, Colin graduated with his B.S. in microbiology with an emphasis in cell and molecular biology along with a minor in forestry.

After graduation, Colin worked a year in the forestry industry in the field of tree improvement.

This led him to his path at NCSU working jointly with Camcore and the Cooperative Tree

Improvement Program for his graduate degree.

ii

ACKNOWLEDGMENTS

First, I would like to thank my family and particularly my parents for their support throughout all of my endeavors in life. Without them, there is no way I would be in the position I am in today. Thank you for shaping me into who I am today and know that my successes are equally yours.

Next, I would like to thank my co-advisors Drs. Juan Acosta and Fikret Isik, and committee members Gary Hodge and Ross Whetten, for their guidance and constructive criticisms as I worked through this project. I am also tremendously grateful for my collaborators at the University of Pretoria, Drs. Alexander Myburg and Nanette Christie and

Dr. Jill Wegrzyn at the University of Connecticut. This project has been hugely collaborative and would not have been possible without the direction and resources given by them and others at their institutions.

Lastly, I would like to acknowledge my fellow TIP, Camcore, and FER GSA student counterparts for not only providing advise on the project but also providing a sense of community that helps keep the ever present challenge of work life balance in check.

iii

TABLE OF CONTENTS

LIST OF TABLES ...... v LIST OF FIGURES ...... vi Chapter 1. Pines, Tree Improvement, and Genetics ...... 1 1.1. Introduction ...... 1 1.2. Genetics and Breeding of Pines ...... 4 1.3. Molecular Markers ...... 9 1.4. Summary ...... 18 1.5. Literature Cited ...... 20 Chapter 2. Single Nucleotide Polymorphism Discovery in Tropical and Subtropical Pines Using RNA-Seq ...... 35 2.1. Introduction ...... 35 2.2. Materials and Methods ...... 37 2.2.1. Material ...... 37 2.2.2. RNA Isolation and Sequencing ...... 38 2.2.3. SNP Discovery ...... 39 2.2.4. Probe Design ...... 40 2.2.5. Probe Description ...... 40 2.3. Results ...... 43 2.3.1. RNA-Sequencing and Alignment ...... 43 2.3.2. Variant calling and Probe Design ...... 43 2.3.3. Probe Description ...... 45 2.4. Discussion ...... 46 2.5. Literature Cited ...... 51 Chapter 3. Single Nucleotide Polymorphism Discovery and Probe Design using Targeted Sequencing Data of Tropical and Subtropical Pine Species ...... 69 3.1. Introduction ...... 69 3.2. Methods and Materials ...... 71 3.2.1. Plant Material and DNA Isolation ...... 71 3.2.2. Capture Probe Design and Target Enrichment ...... 72 3.2.3. Quality Control and Alignment ...... 73 3.2.4. Variant Calling and SNP Probe Design ...... 74 3.2.5. SNP and Probe Description ...... 75 3.3. Results ...... 77 3.3.1. Target Capture Sequencing and Alignment ...... 77 3.3.2. Variant Calling and Probe Design ...... 77 3.3.3. SNP and Probe Characterization ...... 78 3.4. Discussion ...... 79 3.5. Literature Cited ...... 84 Chapter 4 Conclusions ...... 99 4.1. Literature Cited ...... 101 APPENDIX ...... 102

iv

LIST OF TABLES

Chapter 2

Table 2.1 Sequencing and alignment metrics generated from RNA sequencing. Sequencing generated millions (M) of reads and billions of bases on average per species. Raw reads underwent quality control (QC) requiring a base quality score of 30 and minimum read length of 50bp. Average read length decreased slightly from the 150bp raw reads generated from sequencing. Average alignment percentage across species was consistent...... 57

Table 2.2 SNP numbers and probe design metrics for SNPs generated under standard parameters. The number of SNPs identified from variant calling generally numbered in the millions with the exception of P. greggii which had approximately 680K identified. Of the SNPs generated, less than one third were successfully converted to a probe. Probes tended to have a high number of unique left and right flanking regions, showing little redundancy within species. .. 58

Table 2.3 Catalogue of the number of transitions (Ts) and transversions (Tv) per species. Ts:Tv ratios were consistent across species and within expected ranges for SNPs generated from transcriptome sequences...... 59

Chapter 3

Table 3.1 Provenance and tree sampling structure for each of the nine subspecies selected for targeted capture. Each provenance represents a pooled sample submitted for sequencing and contained between 4 to 8 trees in the sample. The 81 total provenances represented the natural range of the species...... 89

Table 3.2 Sequencing metrics generated from targeted capture sequencing. Sequencing generated millions of reads and bases on average per species. Raw reads underwent quality control trimming, requiring a base quality score of 30 and minimum read length of 50bp. Average read length decreased slightly from the 150bp raw reads generated from sequencing to ~143bp...... 90

v

LIST OF FIGURES

Chapter 2

Figure 2.1 Work flow for SNP discovery using RNA-seq data. Work flow includes two phases, bioinformatics (blue) and probe design (purple). Software used and file outputs are shown in the ovals underneath the colored boxes...... 60

Figure 2.2 Sharing pattern for each species' left and right flanking regions when allowing for zero, one, or two differences within a flanking region to be considered shared. For each species, a large portion of flanking regions are unique to that species regardless of differences allowed. Allowing for sequence differences increases the number of flanking regions shared by two or three species at a larger proportion than sharing between four or all...... 61

Figure 2.3 Sharing of left flanking regions allowing for zero differences across species. P. maximinoi has the highest number of unique flanking regions while P. tecunumanii and P. oocarpa have a high number of shared flanks between the two as expected from phylogeny ...... 62

Figure 2.4 Principal component analysis of left flanking (a) and right flanking (b) regions when allowing for zero differences in flanking sequence. Correlation between species that are more closely related phylogenetically is strong as indicated by loading angles. The more closely related P. tecunumanii / P. oocarpa and P. patula / P. greggii pairs have angles that are close to one another. The more distantly related P. maximinoi segregates itself...... 63

Figure 2.5 Repetitive mapping of left and right flanking regions for each species when mapped to the v2.01 P. taeda genome assembly. Over 50% of flanking regions in all species map to a unique location within the genome reference...... 64

Figure 2.6 Sharing pattern for each species' left and right flanking regions when mapped to v2.01 of the P. taeda genome assembly. Flanking regions are considered shared if they mapped within two bases of another. A large portion of flanking regions are unique to a given species while very few map to the same location in all species...... 65

Figure 2.7 Distribution of left flanking regions shared mapping locations. Each species contains a large number of regions unique to the individual. Pinus patula / P. greggii and P. tecunumanii / P. oocarpa species combinations sharing mapping locations at a higher proportion between themselves than with other species ...... 66

Figure 2.8 Distribution of right flanking regions shared mapping locations. A relatively small number of regions are shared between four or five species. Pinus maximinoi has the largest number of uniquely mapping regions and species groupings are consistent with that found in the left regions...... 67

vi

Figure 2.9 Principal component analysis of left flanking (a) and right flanking (b) regions shared mapping locations. Correlation between species that are more closely related phylogenetically is strong as indicated by loading angles. The more closely related P. tecunumanii / P. oocarpa and P. patula / P. greggii pairs have angles that are close to one another, while the more distantly related P. maximinoi distant from any other species...... 68

Chapter 3

Figure 3.1 Work flow for SNP discovery using targeted sequencing data. Work flow includes two phases, bioinformatics (blue) and probe design (purple). Software used and file routines are shown in ovals underneath the colored boxes...... 91

Figure 3.2 The distribution of SNPs shared between the Top-down and Bottom-up filtering datasets. Between the two sets of SNPs generated from filtering approximately 403K SNPs were shared between the two. This resulted in approximately 1.36 million unique SNPs selected for probe design...... 92

Figure 3.3 Sharing pattern for SNPs which had probes successfully designed for them by species. SNP sharing patterns among the species are relatively similar with the exception of P. maximinoi which contains a large number of private SNPs. There are a total of just over 25K SNPs that are shared across all species ...... 93

Figure 3.4 Distribution of SNP sharing across species for SNPs with successful probes. SNP sharing patterns among the species are relatively similar with the exception of P. maximinoi which contains a large number of private SNPs when compared to the others ...... 94

Figure 3.5 Sharing pattern for SNPs with successful probes by subspecies. Subspecies exhibited a relatively similar number of private SNPs (~20K) with the exception of P. maximinoi (~120K). All species shared just under 25K SNPs between each other ...... 95

Figure 3.6 Sharing pattern for SNPs with successful probes by provenance. Provenances exhibited a relatively similar number of private within them. Pinus maximinoi provenances had a much larger number of SNPs shared between 2 to 14 provenances than other species...... 96

Figure 3.7 Principal component analysis of SNP sharing for species (a) and subspecies (b). Correlation of SNP sharing at the species level appears to be high as indicated by the narrow loading angles. Pinus maximinoi segregates itself from the other species showing little correlation. Subspecies follow a pattern as expected from phylogeny with subspecies of the same species sharing a similar loading angle. .... 97

Figure 3.8 Repetitive mapping of left and right flanking regions for SNP probes when mapped to the "tropicalized" v2.01 P. taeda genome assembly. Over 75% of

vii

flanking regions in all species map to a unique location within the genome reference...... 98

viii

Chapter 1. Pines, Tree Improvement, and Genetics

1.1 Introduction

The genus Pinus consists of over 100 species and is the largest genus of

(Price et al. 1998). Pine species are widely distributed exclusively in the northern hemisphere, with the one exception being a population of Pinus merkusii, which is found in the southern hemisphere (Mirov 1967). All pines are evergreen trees or shrubs that can be found in climatic regions varying from tropical to boreal and across elevation ranges from sea level to alpine in North and Central America along with most of Europe and Asia. Across this widespread range, there are two primary pockets of diversity, one in Mexico and Central

America and another in (Plomion et al. 2007). Individual species have varying natural ranges, with some such as P. resinosa, P. taeda, and P. contorta having very large geographical ranges, while P. radiata, P. greggii, and P. patula have much smaller natural distributions.

Many of these species have found value in commercial timber markets. Globally, forest plantations are increasing in area. In 2015, the United States and China had 47 percent and 88 percent, respectively, more planted forests than they did in 1990 (FAO 2015). Within these plantations, pines play a key role and provide a significant portion of wood harvested on planted lands. In 2000, planted forests globally consisted of 20 percent Pinus species, with those making up significant percentages of planted lands in Asia (30%), North and Central

America (88%), and Africa (28%) (FAO 2015). Utilization of pines in plantation forestry is common due to their adaptability, diversity, and ability to be planted on a wide variety of sites. Pinus taeda, for instance, has shown the ability to undergo morphological and physiological changes in response to unfavorable conditions such as moderate and long-term

1

drought (Samuelson et al. 2014; Maggard et al. 2016). Additionally, pines can perform well on marginal sites and are, good candidates for afforestation and reforestation efforts (Pinto et al. 2011; López-Tirado and Hidalgo 2016).

Even though the natural ranges of almost all pines are restricted to the northern hemisphere, their high productivity and success in plantations has resulted in the establishment of land race populations of many species south of the equator. A land race can be described as a population of individuals that has become adapted to an environment in which it has been planted (Zobel and Talbert 1984). In many cases, these land races are more productive than naturally occurring species. One of the most prevalent is P. radiata, which has a small natural range off the coast of California and is now one of the most widely planted species of pines across the globe, with populations established in New Zealand,

Australia, and portions of South America (Sutton 1999). In South Africa, numerous species such as P. patula, P. oocarpa, P. tecunumanii, P. caribaea, and others have land race populations established and now play an important role in the country's commercial timber markets (Griess et al. 2016).

Species of tropical and subtropical pines native to Mexico and Central America are of particular interests when it comes to land race populations. The large amount of diversity within and between species has made them candidates for small scale deployments in areas where they may be most adapted. In the case of P. caribaea, there are approximately 1 million hectares of plantations globally, with the majority of that being in Venezuela (Dieters and Nikles 1997). Valued for its rapid growth, P. caribaea provides a cautionary tale when establishing land race populations. When first established in South Africa, it proved to have much less desirable wood quality characteristics than expected (Zobel and Talbert 1984).

2

Pinus greggii has shown widespread adaptability across testing sites at a variety of elevations and latitudes. It has also shown potential to survive on colder dryer sites than P. patula. At this point, P. greggii is almost exclusively planted in South Africa for commercial uses

(Dvorak et al. 2000). Pinus patula has been planted on over 1 million hectares worldwide with much of that in eastern and southern Africa. Despite not being as cold tolerant as P. greggii, it performs well on sites that are less tropical than other species and has good wood quality that is useful for a number of different purposes. Pinus tecunumanii has been established on approximately 10,000 hectares worldwide with most plantations occurring in

Brazil and Colombia (Dvorak et al. 2000). Pinus tecunumanii has shown fast growth and wood quality characteristics when compared to other tropical species and has been the focus of hybridization studies to potentially incorporate growth traits and disease resistance with other species (Kanzler et al. 2012; Hongwane et al. 2018). Two other species P. maximinoi and P. oocarpa have been deployed on a much smaller scale. Although, these species show utility in niche deployments potentially offering advantages through future study (Dvorak et al. 2000).

Ultimately, the demand for high value timber products such as lumber, poles, pulp and paper has driven the increase in plantation area and establishment of such land race populations. In addition to these standard timber products, there has been a growing interest in using pine slash and residuals from harvest for the manufacture of various phenolic compounds, essential oils, pellet for bioenergy and other secondary products (Kelkar et al.

2006; Rodrigues-Corrêa et al. 2012). Even more recently, there has been extensive research into the use of pine as a source of feedstock for lignocellulosic biofuel production (Lu et al.

3

2015; Vaid et al. 2018; Dong et al. 2018). The value of these products combined with the versatility of pines makes these species an important focus of study.

1.2 Genetics and Breeding of Pines

Tree breeding and genetic improvement in pines is in its relative infancy when compared to commercial agricultural crops. The first generation breeding population of loblolly pine (P. taeda) was established from selections of phenotypically superior individuals in naturally occurring stands, in the 1950's (Zobel and Talbert 1984). These selections were used to create full-sib families from breeding orchards and families were subsequently tested in progeny tests. Similar steps have been taken in other species of pine as well in establishing breeding programs (Ruotsalainen 2014). In forest trees, estimation of trait heritability and breeding values are done almost exclusively using phenotypic data from progeny testing and pedigree information through the use of best linear unbiased predictions

(BLUP).

Field testing of progeny is both time consuming and expensive, due to test establishment, maintenance, and measurement. Typically, testing requires relatively large areas of land and a large number of progenies to be tested across a number of sites to make accurate estimations about genetic parameters. It has been shown that accurate estimations for selection can be made on uniform sites at age six for loblolly pine (McKeand 1988). New breeding strategies coupled with efficient testing designs can reduce some labor and cost, but the process remains costly primarily due to the maintenance and duration of the tests. These challenges create an opportunity to leverage molecular data as a way to reduce costs associated with testing.

4

All Pinus species are diploid (2n=24), many with very large genome sizes, ranging from 20-30Gb (De La Torre et al. 2014). This is much larger than other tree genome references such a poplar, (~500Mbp) (Tuskan et al. 2006), eucalyptus (~640Mbp) (Myburg et al. 2014), and have genomes between 133-296 times bigger than Arabidopsis thaliana (The

Arabidopsis Genome Initiative, 2000). At the whole genome level, it has been shown that approximately 14% of the pine genome is made up of low- or single-copy sequences, while

86% are highly repeated sequences (Wakamiya et al. 1993). Within the genus Pinus, one of the most characterized genomes is that of P. taeda which had its draft genome assembled in

2014 (Zimin et al. 2014) and then an improved assembly released in 2017 (Zimin et al.

2017). Although still an incomplete reference assembly, the P. taeda genome provides an attractive platform for the implementation evolutionary, population structure, and diversity studies in other species of pines.

Pine genome mapping efforts began as early as the 1970's and 1980's using isozyme loci that were quite small and limited in number (Rudin 1978; Conkle 1979). With the advent of restriction enzymes, PCR technologies and greater sequencing capabilities, DNA markers were used to expand upon the earlier maps. The first maps to use DNA marker technology were assembled using restriction fragment length polymorphisms (RFLPs) in P. taeda and P. radiata, finding 20 and 22 linkage groups respectively (Devey et al. 1994, 1996). Since then, the use of more advanced sequencing technology has allowed scientists to use single nucleotide polymorphisms (SNPs) and other molecular markers to enhance coverage and detail of these maps (Eckert et al. 2009; Echt et al. 2011; Martínez-García et al. 2013).

Quantitative trait loci (QTLs) involved in many traits important to the improvement of pines have been identified. Many traits important to forestry, such as volume growth,

5

disease resistance, and wood quality are complex quantitative traits with complex inheritance. Quantitative genetics assumes that the genetic control of a complex trait can be attributed to the additive effects of the many loci involved and is then affected by the environment (White et al. 2007). Quantitative trait loci are locations within a genome that are associated with a given quantitative trait. Many different quantitative traits have been mapped, including wood quality traits, such as wood specific gravity and density, pulp yield, cell wall content, and average microfibril angle (Groover et al. 1994; Kumar et al. 2000; Ball

2001; Sewell et al. 2002; Brown et al. 2003). Growth traits have also been of interest. QTL studies assessing many growth traits including average numbers of branch per whorl, average branch diameter, and average whorl spacing (Shepherd et al. 2002a) have been performed.

Studies examining forking (Xiong et al. 2016) and tree height and diameter (Kaya et al.

1999; Weng et al. 2002) have been done as well. Links between qualitative traits and markers have been shown and have led to the discovery of disease resistance markers such as in the case of fusiform rust (Cronartium quercuum f. sp. fusiforme) in P. taeda (Wilcox et al. 1996) and pitch canker (Fusarium circinatum) in P. radiata (Moraga-Suazo et al. 2014).

In contrast to crop and animal species, there seems to be a more limited ability to detect QTLs in pines and other forest trees. Detection of QTLs largely depends on the quality of phenotypic assessment. In many cases, these assessments are challenging because of the high environmental and developmental variation in tree plantations or the small population sizes used in the analyses (Plomion et al. 2007). In addition, the results from Brown et al.

(2003), when compared to other studies in tree species, suggests that in many cases, the amount of variance described by QTLs in trees are overestimated. Even with potentially inflated estimations, the amount of total variation explained by these QTL experiments have

6

been quite low. This suggests that there is a more complex gene structure at work, having fewer major effect loci and many more of low effect. This is not surprising, given the highly complex inheritance patterns suggested for growth and yield traits (Hill et al. 2008).

Additionally, QTL stability across tree age can vary (Lerceteau et al. 2001). This indicates that the loci responsible for characteristics may change depending upon stage of growth or environmental effects being encountered.

Leveraging QTL information for the use of marker assisted selection (MAS) was an attractive proposition during the time of QTL mapping. MAS was of particular interest in forest trees due to their long breeding cycles and was seen as a way to reduce the time associated with them. Evaluation of the technology showed that within-family selection using

MAS was not useful unless used in weakly heritable traits and used with very large numbers of samples along with a high selection intensity (Strauss et al. 1992). They also concluded that there may be value in MAS in interspecies hybrids and hard to measure phenotypic data such as wood quality. It was suggested that MAS could be utilized in commercial breeding programs if families with evidence of major gene effects could be identified. However, the testing capacity of a program would have to increase in order to meet the large number of samples needed to utilize MAS (O’Malley and Mckeand 1994). Ultimately, despite a significant amount of time and resources being put into QTL research, there has been limited to no success in utilizing MAS for breeding in forest trees (Isik 2014).

Despite the issues in using MAS in forest trees, genomic selection, a new frontier in utilizing molecular genetic data is being explored. Using simulated data it was shown that selections can be made with very accurate genomic information (Meuwissen et al. 2001).

Genomic selection is defined as the selection of individuals based on genomic estimated

7

breeding values, which are estimated by solving prediction equations based on single nucleotide polymorphism (SNP) markers (Hayes et al. 2009). Genomic selection is of great interest to tree breeders due to the large number of potentially informative SNPs and the potential to reduce progeny testing efforts and the time taken to complete a breeding cycle.

Genomic selection is being deployed primarily in populations that exhibit high linkage disequilibrium and that are highly structured (Thavamanikumar et al. 2013).

Successful genomic selection programs have been deployed in cattle (Hayes et al. 2009), rice

(Oryza sativa) (Spindel et al. 2015), and others (Oh, Na, & Park, 2017; Zhao et al., 2012).

There has been little work done in pines and other forest trees, however. In forest trees, under simulated data, it has been shown that genomic selection is possible and advantageous compared to traditional BLUP based-selection, given a genotyping density above 10 markers/cM for effective population sizes of 60 and up to 20 markers/cM for populations with an effective size of 100 (Grattapaglia and Resende 2011). These results were then supported in study examining two unrelated elite populations of eucalyptus by matching or exceeding the traditional phenotypic selection method accuracies in all traits (Resende et al.

2012b). A study performed on P. pinaster produced a lower estimates of the correlation of true and estimated breeding values obtained from progeny tests using genomic selection (Isik et al. 2015). In P. taeda, it was shown that selection efficiency per unit time was 53-112% higher using genomic selection and could significantly reduce the time required to complete a breeding cycle (Resende et al. 2012a). These results, along with reduced genotyping costs and readily available molecular markers through next generation sequencing, are making genomic selection the next area of focus in pine and tree breeding.

8

1.3 Molecular Markers

Polymorphisms that are found in populations and reveal neutral sites of variation in

DNA are called molecular markers (Kumar et al. 2009). Molecular markers provide scientists and plant breeders a valuable tool to help understand the architecture of these complex quantitative traits. According to Gupta et al. (2001), there are three classifications of molecular markers: first generation, second generation, and third generation markers. He defines first generation markers as restriction fragment length polymorphisms (RFLPs) and random amplified polymorphic DNAs (RAPDs). He categorizes second generation markers include amplified fragment length polymorphisms (AFLPs) and simple sequence repeats

(SSRs) and the third generation of molecular markers as single nucleotide polymorphisms

(SNPs).

Restriction fragment length polymorphisms (RFLPs) were first used in the construction of the genetic map of humans (Botstein et al. 1980). This technique differentiates organisms by analysis patterns of their cleaved DNA. Utilization of a specific restriction endonuclease allows for comparison of individuals based on the distance between cut sites in their DNA (Kumar et al. 2009). The resulting fragments are then separated via electrophoresis. During electrophoresis, high voltage is added so that the DNA fragments are separated by size and then detected using a camera system. Using software, the similarity of the size of cut DNA to the sample of interest is determined and indicates similarity between the samples.

Despite being, co-dominantly inherited and very reproducible, the use of RFLPs is costly, time consuming, and requires a large quantity of highly purified DNA for analysis

(Agarwal et al. 2008). RFLPs have been extensively used in crop for mapping

9

purposes and to assess diversity of breeding populations (Schön et al. 1994; Kim and Ward

2000). In pines, despite finding a number of markers that could be used for high density mapping, only two such studies were done. One was performed to create a linkage map for loblolly pine based on a third generation outbred population (Devey et al. 1994). The other was to create a linkage map for P. radiata (Devey et al. 1996). Ultimately, these studies were the basis of a comparative mapping project which showed high levels of synteny (similarity of genetic maps possibly due to shared ancestry) between the two species (Devey et al.

1999).

Random amplified polymorphic DNA (RAPD) markers utilize PCR and random primers to detect variation in DNA. The procedure is used to detect nucleotide sequence polymorphisms through the use of a given primer (Williams et al. 1990). Through the use of a short primer of random sequence, genomic DNA is bound and amplified using PCR technology. The resulting amplified strands are stained, run through gel and then viewed under ultraviolet light. The presence or absence of a band indicates differences between the

DNA template strands due to sequence differences in the priming sites. The use of RAPD markers is fairly easy due to the low amount of purified DNA needed for the reaction, easily accessible primers, and no prior knowledge of the genome being studied needed. However, concerns about reproducibility exist between and within labs because of the sensitivity to reaction conditions and the risk of contamination due to the nonspecific nature of primer binding (Rabouam et al. 1999).

Despite the issues with reproducibility, RAPD analysis was optimized in pines using

P. radiata as a template (Devey et al. 1996; Ostrowska et al. 1998). However genomic mapping efforts using RAPDs had been made prior to their optimization. Genetic maps were

10

created using RAPDs for P. pinaster (Plomion et al. 1995), P. elliottii (Nelson et al. 1993), and others(Nelson et al. 1994; Yazdani et al. 1995). RAPD markers have been shown to be useful in highlighting genetic diversity in highly conserved genomes such as the case with P. resinosa (DeVerno and Mosseler 2008) and have been used to assess the genetic variation within endangered and threatened species (Newton et al. 2002; Zhang et al. 2005).

Amplified fragment length polymorphisms (AFLPs) are categorized as a second generation marker and selectively target sequences of DNA through the use of specific restriction enzymes for amplification. Polymorphisms are then detected based on differences in length of the amplified segments on gel. Through the use of two restriction enzymes, genomic DNA is fragmented. Then double stranded adaptors are ligated to the end of the fragments generating a template for PCR reactions downstream. Ligated fragments are amplified using PCR reactions. The amplified DNA is then separated via gel electrophoresis.

AFLPs present an useful combination of RFLP and PCR technologies (Agarwal et al. 2008).

AFLPs are highly reproducible and have the ability to generate large amounts of fragment markers without prior knowledge of genomic sequence. This fact makes them useful in the investigation of non-model organisms which may have limited genomic information available. Two challenges when using AFLPs are the need for highly purified

DNA and the ambiguity of analyzing the bands present in the gel (Kumar et al. 2009). In forest trees AFLPs have shown utility in population structure and genome mapping (Cervera et al. 2000). In pines, linkage maps for P. radiata (Moraga-Suazo et al. 2014), P. taeda

(Remington et al. 1999), and P. edulis (Travis et al. 1998) have been created using AFLPs.

Numerous population structure, diversity, and hybridization studies have been performed

11

using the technique as well (Diaz et al. 2001; Ribeiro et al. 2002; Di and Wang 2013;

Vasilyeva and Semerikov 2014).

Another type of second generation makers are simple sequence repeats (SSRs), also known as microsatellites. SSRs are short segments of repetitive DNA that generally consist of 1-5 nucleotides and are dispersed throughout eukaryotic genomes (Powell et al. 1996).

Microsatellite makers can be found in both non-transcribed and transcribed regions of the genome. In transcribed regions of the genome expressed sequence tags can be mined for SSR markers (Varshney et al. 2005). SSRs tend to be much more informative than AFLPs or

RAPDs since they are co-dominant markers. The ability to segregate based on zygosity is one of the primary draws of SSR technology. In a given population, there are generally several alleles for the same locus. For example, using hypothetical numbers, a given individual in a population may contain 6, 7, 8, or 10 repeats. This type of information can be useful in identifying different individuals within the population (Oliveira et al. 2006), while there are differences in the number of repeats, the upstream and downstream sequences flanking them are generally highly conserved (Jarne and Lagoda 1996).

SSR markers are classified into three categories: perfect, imperfect and complex/compound. These classifications are based on the fragments being repeated. A perfect SSR is where the repeat sequence is unbroken by any additional nucleotides, i.e. the sequence AT repeated 12 time (AT12). A SSR is imperfect when the repeat sequence is broken by other nucleotides, i.e. the sequence AT repeated six times followed by GC and then AT another six times (AT6GC1AT6). A complex/compound SSR is where two or more repetitive sequences are in line with each other, i.e. the sequence AT eight times followed by

GC repeated 10 times (AT8GC10). Additionally, markers are generally classified as

12

SSR/microsatellites if they are between 10 and 30 nucleotides long, while SSRs are classified as minisatellites between 30 and 100 bases long (Weber 1990).

SSRs have been shown to be highly transferable between species (Shepherd et al.

2002b). One such study found 251 and 168 SSRs in unigene sets (transcripts that appear to stem from the same locus) of P. taeda and P. pinaster respectively with the most common repeat type being trinucleotides. For these markers they designed 72 primers and found single band amplification rates in non-source species ranging from 64.6% to 94.2% (Chagné et al.

2004). Transferability has been found in organelle DNA within pines as well. Variation in length of mononucleotide repeats in chloroplast DNA were shown in 11 species of pine

(Powell et al. 1996). Within pines, AT repeats seem to be the most common motif across species (Smith and Devey 1994; Bai et al. 2014).

SSRs are highly polymorphic and can easily be used for population structure experiments (Smith and Devey 1994). Since microsatellites exhibit Mendelian inheritance, it is possible to use them for fingerprinting (Liewlaksaneeyanawin et al. 2004), and evolutionary history of species (Boys et al. 2005). A linkage map for P. taeda has been created using microsatellites as well (Echt et al. 2011). In addition, SSRs have been widely used in other areas of genetics: population genetics, pedigree construction, marker assisted breeding, and genetic conservation.

The use of SSR analysis does have some difficulties however. Microsatellites can have a high mutation rate which can cause incorrect conclusions to be drawn about the genetic history of an individual (Queller et al. 1993). Additionally, homoplasy (the occurrence of similar gene structures without being present in a common ancestor) occur at

SSR loci due to mutation, which can cause a misrepresentation of genetic divergence

13

(Wheeler et al. 2014). Several mutation mechanisms can affect the results received from

DNA SSRs. There are two primary models used to characterize the mutations in SSRs, infinite allele model and stepwise model, each having their own shortcomings (Jarne and

Lagoda 1996).

The third generation of molecular makers are single nucleotide polymorphisms

(SNPs), pronounced "snips". When compared with other molecular markers, SNPs appear with much higher frequency and are far more prevalent (Gupta et al. 2001). SNPs are defined as a single nucleotide change when compared to a reference sequence. There are two types of

SNPs, transitions and transversions. Transitions occur when a purine base is replaced with another purine base. A transversion occurs when a purine base is replaced with a pyrimidine or vice versa (Slonczewski and Foster 2009).Given their abundance, they have become the marker of choice for genetic studies. These markers are also easy to produce and cost efficient. It has been shown that in elite populations of maize, polymorphisms can be as common as 1 per 31 bases in non-coding regions and 1 per 124 bases in coding regions

(Ching et al. 2002). A SNP frequency of 1 SNP per 103 bases using expressed sequence tag

(ETS) data was found in P. pinaster (Dantec et al. 2004). Despite being very common throughout the genome, SNP association with traits tend to occur at a low frequency or account for very little of the genetic variance of a trait. A recent study found approximately

2.8 million SNPs using exome capture which included 34 SNPS and 11 SNP/SNP interactions that were associated with important growth traits in P. taeda (Lu et al. 2017). In another exome capture study using around 67K and 55K SNPs in P. tadea and P. elliottii respectively, extensive similarity between species was shown when looking at linkage disequilibrium and levels of genetic diversity (Acosta et al. 2019).

14

Before the implementation of next generation sequencing (NGS) technologies, SNPs were discovered though a variety of techniques to help cope with the repetitive nature of plant genomes. The development of SNP markers through EST mining followed by PCR validation and the resequencing of unigene amplicons using Sanger's sequencing method were two of the more popular methods for SNP discovery within genic regions (Mammadov et al. 2012). DNA sequencing is the process of identifying the order DNA bases through synthesis of a complementary strand of DNA and determining which nucleotides have been added.

In short, Sanger sequencing involves using a primer to bind next to the region of interest to facilitate the extension of the complementary strand of DNA. The use of a dideoxynucleotide causes the polymerase to terminate replication at various lengths along the amplicon. Each type of dideoxynucleotide was radioactively marked and were detected by autoradiography (Sanger et al. 1977).

SNP mining through the use of ESTs avoids the problems of de novo discovery by using available cDNA libraries from sources such as NCBI. ESTs are mRNA sequences derived from single sequencing reactions performed on randomly selected clones from cDNA libraries (Parkinson and Blaxter 2009). EST sequences are compiled and SNPs identified and filtered using bioinformatics software. The SNPs are then validated through software and re-sequencing. PCR is used to amplify sequences of interest and those that yield well defined bands are the re-sequenced for validation (Dong et al. 2010). EST mining techniques have been used for SNP discovery in both P. pinaster (Dantec et al. 2004;

Chancerel et al. 2011) and P. taeda (Eckert et al. 2009).

15

Next generation sequencing (NGS) technologies, for the most part, utilize sequencing by synthesis (SBS) methods. SBS is where the sequence of DNA is read while the complementary strand is being synthesized. The sequence can be determined by the use of labeled nucleotides which can be differentiated from one another (Goodwin et al. 2016).

Methods used by Illumina for their NGS technology have similarities to Sanger sequencing through the use of a terminator nucleotide. As labeled bases are incorporated, the sequence is then determined. Pyrosequencing uses synthesis reactions to emit light which can then be used to identify the base being added to the complementary DNA strand (Ronaghi et al.

1998). These technologies generally prove shorter read lengths than older technologies but provide a higher number of reads, allowing for greater sequencing depth. This in turn allows for higher confidence in the base call and reduces error (Sims et al. 2014).

In SNP discovery there are two main components, read mapping and SNP calling.

Read mapping is a process that aligns reads generated from sequencing to that of a reference set of DNA. This reference is generally a complete genome reference of a well-characterized organism. However, with NGS technology, references can now be de novo assemblies of more closely related organisms or even the target organism itself (Kumar et al. 2012). The use of read assembly algorithms such as, Bowtie (Langmead et al. 2009) and Burrows-

Wheeler Aligner (BWA) (Li and Durbin 2009) are good tools for handling short-read, high- throughput data. High throughput methods use the DNA sequences from multiple individuals to compare their sequences to a reference. When the sequenced reads have been aligned to the reference, they can then be scanned for potential SNPs.

SNP calling, or variant calling, is the process of locating where the physical positions of the SNPs are located in the genome, based on a reference sequence. Potential SNPs are

16

identified by SNP calling software such as SAMTools (Li 2011), FreeBayes (Garrison &

Marth 2012), or VarScan (Koboldt et al. 2009). These programs use algorithms to scan reads for mismatches that could be potential SNPs. When assessing these mismatches, sequencing depth and read map accuracies are assessed to determine the quality of the SNP and identify it as a possible error (You et al. 2012). These variant callers each are biased towards different types of SNP calling errors, meaning careful consideration must be made when selecting a program based on the data being used (Hwang et al. 2016). This SNP calling process requires a high level of confidence in the quality and accuracy of the reference sequence to prevent false positives. Despite the utility of these variant callers, often errors in read mapping and variant calling are compounded, which requires additional custom screening to be done on the compiled set of called variants.

The ability to identify SNPs is an important tool in human, animal, and plant genetics.

SNPs can be utilized in a number of different ways. In humans SNP identification has been used to identify disease susceptibility and other complex traits (Zheng et al. 2009; Frazer et al. 2009). In plants, SNPs have the ability to produce dense genetic maps (Rafalski 2002) which can be used in marker assisted selection, help understand genome organization, or to identify genotypes. SNPs can also be used to assess linkage disequilibrium in populations, for genome wide association studies, and have shown great potential in genomic selection.

SNP genotyping arrays can be a useful tool in utilizing SNP data for the above mentioned work. SNP genotyping arrays are chips containing a set of selected probe sequences. A probe is a DNA sequence that is complementary to the region of interest.

Although array processes vary between platforms, they all involve the production of a signal in the presence of a targeted marker. Currently, ThermoFisher Scientific’s Axiom technology

17

is being used for the construction of a commercial array for pine species. Under this platform, four steps are used: amplification and fragmentation of DNA samples, hybridization, ligation, and signal amplification (ThermoFisher). This technology hybridizes amplified fragmented DNA to complementary capture probes. Following hybridization, bound target is washed to remove noise and ligated with a multi-color solution probe.

Following ligation, arrays are stained and imaged to quantify marker presence.

In pine, the implementation of SNP array technology is in its infancy when compared to crop and animal species. A high density 9K SNP chip was created for P. pinaster (Plomion et al. 2016), however, it gives a much narrower view than the 57K array created for maize

(Ganal et al. 2011), 60K chip for eucalyptus (Silva-Junior et al. 2015a), or 50K chip in cattle

(Matukumalli et al. 2009). Utilization of very large SNP genotyping arrays have been used for genome-wide association studies in humans and other organisms (McCarthy et al. 2008).

In animals, the use of these large number of markers have been used in genomic selection and have accurately predicted the breeding values in non-phenotyped populations (Erbe et al.

2012). This type of utility could have huge implications in tree improvement and breeding by reducing the labor and time associated with the process.

1.4 Summary

Given their commercial value and the projected increase in demand for wood products, genetic improvement of pine species is an important area of study. Traditional tree improvement methods have been effective in creating faster growing, straighter, and more disease-resistant trees, however, the process is still time consuming and costly. With the advent of next generation sequencing platforms, decreased genotyping costs and more efficient software for analysis, utilization of SNP marker data obtained from sequence data

18

for tree breeding is more accessible than ever before. These technologies have the potential to revolutionize tree breeding and improvement through reducing breeding cycles and increasing the efficiency in estimating genetic parameters.

This study aims to contribute to this effort by performing targeted genome and gene- based SNP discovery for the development of a high-throughput, genome wide, genotyping array for six tropical pine species. Two custom bioinformatics pipelines were developed for

SNP discovery using datasets generated from RNA-sequencing and targeted capture sequencing, generating a high quality set of SNP probes for assessment.

19

1.7 Literature Cited

Acosta JJ, Fahrenkrog AM, Neves LG, et al (2019) Exome re-sequencing reveals

evolutionary history, genomic diversity, and targets of selection in the conifers Pinus

taeda and Pinus elliottii. Genome Biol Evol 11:508–520. doi: 10.1093/gbe/evz016

Agarwal M, Shrivastava N, Padh H (2008) Advances in molecular marker techniques and

their applications in plant sciences. Plant Cell Rep 27:617–631. doi: 10.1007/s00299-

008-0507-z

Bai T-D, Xu L-A, Xu M, Wang Z-R (2014) Characterization of masson pine (Pinus

massoniana Lamb.) microsatellite DNA by 454 genome shotgun sequencing. Tree Genet

Genomes 10:429–437. doi: 10.1007/s11295-013-0684-y

Ball RD (2001) Bayesian Methods for Quantitative Trait Loci Mapping Based on Model

Selection: Approximate Analysis Using the Bayesian Information Criterion. Genetics

159:1351–1364

Botstein D, White RL, Skolnick M, Davis RW (1980) Construction of a genetic linkage map

in man using restriction fragment length polymorphisms. Am J Hum Genet 32:314–31

Boys J, Cherry M, Dayanandan S (2005) Microsatellite analysis reveals genetically distinct

populations of red pine (Pinus resinosa, ). Am J Bot 92:833–841. doi:

10.3732/ajb.92.5.833

Brown GR, Bassoni DL, Gill GP, et al (2003) Identification of Quantitative Trait Loci

Influencing Wood Property Traits in Loblolly Pine (Pinus taeda L.). III. QTL

Verification and Candidate Gene Mapping. Genetics 164:1537–1546

Cervera MT, Remington D, Frigerio J-M, et al (2000) Improved AFLP analysis of tree

species. Can J For Res 30:1608–1616

20

Chagné C, Ramboer A, Collada C, et al (2004) Cross-species transferability and mapping of

genomic and cDNA SSRs in pines. Theor Appl Genet 109:1204–1214. doi:

10.1007/s00122-004-1683-z

Chancerel E, Lepoittevin C, Le Provost G, et al (2011) Development and implementation of

a highly- multiplexed SNP array for genetic mapping in maritime pine and comparative

mapping with loblolly pine. BMC Genomics 12:. doi: 10.1186/1471-2164-12-368

Ching A, Caldwell KS, Jung M, et al (2002) SNP frequency, haplotype structure and linkage

disequilibrium in elite maize inbred lines. BMC Genet 3:1–14. doi:

http://www.biomedcentral.com/1471-2156/3/19

Conkle MT (1979) Isozyme Variation and Linkage in Six Species. In: Isozymes of

North American Forest Trees and Forest Insects. pp 11–17

Dantec L Le, Chagné D, Pot D, et al (2004) Automated SNP detection in expressed sequence

tags: Statistical considerations and application to maritime pine sequences. Plant Mol

Biol 54:461–470. doi: 10.1023/B:PLAN.0000036376.11710.6f

De La Torre AR, Birol I, Bousquet J, et al (2014) Insights into Conifer Giga-Genomes.

PLANT Physiol 166:1724–1732. doi: 10.1104/pp.114.248708

DeVerno LL, Mosseler A (2008) Genetic variation in red pine ( Pinus resinosa ) revealed by

RAPD and RAPD-RFLP analysis. Can J For Res 27:1316–1320. doi: 10.1139/x97-090

Devey ME, Beil JC, Smith DN, et al (1996) A genetic linkage map for Pinus radiata based on

RFLP, RAPD, and microsatellite markers. Theor Appl Genet 92:673–679. doi:

10.1007/BF00226088

Devey ME, Fiddler TA, Liu B-H, et al (1994) An RFLP linkage map for Ioblolly pine based

on a three-generation outbred pedigree. Theor Appl Genet 88:273–278

21

Devey ME, Sewell MM, Uren TL, Neale DB (1999) Comparative mapping in loblolly and

radiata pine using RFLP and microsatellite markers. Theor Appl Genet 99:656–662. doi:

10.1007/s001220051281

Di X-Y, Wang M-B (2013) Genetic diversity and structure of natural Pinus tabulaeformis

populations in North China using amplified fragment length polymorphism (AFLP).

Biochem Syst Ecol 51:269–275. doi: 10.1016/j.bse.2013.09.013

Diaz V, Muniz LM, Ferrer E (2001) Random amplified polymorphic DNA and amplified

fragment length polymorphism assessment of genetic variation in Nicaraguan

populations of Pinus oocarpa. Mol Ecol 10:2593–2603. doi: 10.1046/j.0962-

1083.2001.01390.x

Dieters MJ, Nikles DG (1997) The genetic improvement of caribbean pine (Pinus caribaea

Morelet), building a firm foundation. In: 245h Southern Forest Tree Improvement

Conference. pp 33–52

Dong C, Wang Y, Zhang H, Leu S-Y (2018) Feasibility of high-concentration cellulosic

bioethanol production from undetoxified whole Monterey pine slurry. Bioresour

Technol 250:102–109. doi: 10.1016/J.BIORTECH.2017.11.029

Dong J, Qing-liang Y, Fu-sheng W, Li C (2010) The Mining of Citrus EST-SNP and Its

Application in Cultivar Discrimination. Agric Sci China 9:179–190

Dvorak WS, Gutierrez EA, Hodge GR, et al (2000) Conservation & Testing of Tropical &

Subtropcial Forest Tree Species by the CAMCORE Cooperative. NCSU

Echt CS, Saha S, Krutovsky K V, et al (2011) An annotated genetic map of loblolly pine

based on microsatellite and cDNA markers. BMC Genet 12:1–16. doi: 10.1186/1471-

2156-12-17

22

Eckert AJ, Pande B, Ersoz ES, et al (2009) High-throughput genotyping and mapping of

single nucleotide polymorphisms in loblolly pine (Pinus taeda L.). Tree Genet Genomes

5:225–234. doi: 10.1007/s11295-008-0183-8

Erbe M, Hayes B, Matukumalli L, et al (2012) Improving accuracy of genomic predictions

within and between dairy cattle breeds with imputed high-density single nucleotide

polymorphism panels. J Dairy Sci 95:. doi: 10.3168/jds.2011-5019

FAO (2015) Global Forest Resouces Assessment

Frazer KA, Murray SS, Schork NJ, Topol EJ (2009) Human genetic variation and its

contribution to complex traits. Nat Rev Genet 10:241–251. doi: 10.1038/nrg2554

Ganal MW, Durstewitz G, Polley A, et al (2011) A Large Maize (Zea mays L.) SNP

Genotyping Array: Development and Germplasm Genotyping, and Genetic Mapping to

Compare with the B73 Reference Genome. PLoS One 6:e28334. doi:

10.1371/journal.pone.0028334

Garrison E, Marth G. Haplotype-based variant detection from short-read

sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012

Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-

generation sequencing technologies. Nat Rev Genet 17:333–351. doi:

10.1038/nrg.2016.49

Grattapaglia D, Resende MD V. (2011) Genomic selection in forest tree breeding. Tree

Genet Genomes 7:241–255. doi: 10.1007/s11295-010-0328-4

Griess VC, Uhde B, Ham C, Seifert T (2016) Product diversification in South Africa’s

commercial timber plantations: a way to mitigate investment risk. South For a J For Sci

78:145–150. doi: 10.2989/20702620.2015.1136508

23

Groover A, Devey M, Fiddler T, et al (1994) Identification of quantitative trait loci

influencing wood specific gravity in an outbred pedigree of loblolly pine. Genetics 138:

Gupta PK, Roy JK, Prasad M (2001) Single nucleotide polymorphisms: A new paradigm for

molecular marker technology and DNA polymorphism detection with emphasis on their

use in plants. Curr Sci 80:524–535

Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME (2009) Invited review: Genomic

selection in dairy cattle: Progress and challenges. J Dairy Sci 92:433–443. doi:

10.3168/jds.2008-1646

Hill WG, Goddard ME, Visscher PM (2008) Data and theory point to mainly additive genetic

variance for complex traits. PLoS Genet 4:1–10. doi: 10.1371/journal.pgen.1000008

Hongwane P, Mitchell G, Kanzler A, et al (2018) Alternative pine hybrids and species to

Pinus patula and P. radiata in South Africa and Swaziland. South For. doi:

10.2989/20702620.2017.1393744

Hwang S, Kim E, Lee I, Marcotte EM (2016) Systematic comparison of variant calling

pipelines using gold standard personal exome variants. Sci Rep 5:17875. doi:

10.1038/srep17875

Isik F (2014) Genomic selection in forest tree breeding: The concept and an outlook to the

future. New For. 45:379–401

Isik F, Bartholomé J, Farjat A, et al (2015) Genomic selection in maritime pine. Plant Sci.

doi: 10.1016/j.plantsci.2015.08.006

Jarne P, Lagoda P (1996) Microsatellites, from molecules to populations and back. Trends

Ecol Evol 11:424–429

Kanzler A, Payn K, Nel A (2012) Performance of two Pinus patula hybrids in southern

24

Africa. South For. doi: 10.2989/20702620.2012.683639

Kaya Z, Sewell MM, Neale DB (1999) Identification of quantitative trait loci influencing

annual height- and diameter-increment growth in loblolly pine (Pinus taeda L.). Theor

Appl Genet 98:586–592. doi: 10.1007/s001220051108

Kelkar VM, Geils BW, Becker DR, et al (2006) How to recover more value from small pine

trees: Essential oils and resins. Biomass and Bioenergy 30:316–320. doi:

10.1016/J.BIOMBIOE.2005.07.009

Kim HS, Ward RW (2000) Patterns of RFLP-based genetic diversity in germplasm pools of

common wheat with different geographical or breeding program origins. Euphytica

115:197–208

Koboldt DC, Chen K, Wylie T, et al (2009) VarScan: variant detection in massively parallel

sequencing of individual and pooled samples. Bioinformatics 25:2283–2285. doi:

10.1093/bioinformatics/btp373

Kumar P, Gupta VK, Misra AK, et al (2009) Potential of Molecular Markers in Plant

Biotechnology. South Cross Journals 2:141–162

Kumar S, Banks TW, Cloutier S (2012) SNP discovery through next-generation sequencing

and its applications. Int J Plant Genomics 2012:. doi: 10.1155/2012/831460

Kumar S, Spelman RJ, Garrick DJ, et al (2000) Multiple-marker mapping of wood density

loci in an outbred pedigree of radiata pine. Theor Appl Genet 100:926–933. doi:

10.1007/s001220051372

Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient

alignment of short DNA sequences to the human genome. Genome Biol 10:. doi:

10.1186/gb-2009-10-3-r25

25

Lerceteau E, Szmidt AE, Andersson B (2001) Detection of quantitative trait loci in Pinus

sylvestris L. across years. Euphytica 121:117–122. doi: 10.1023/A:1012076825293

Li H (2011) A statistical framework for SNP calling, mutation discovery, association

mapping and population genetical parameter estimation from sequencing data.

Bioinformatics 27:2987–2993. doi: 10.1093/bioinformatics/btr509

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler

transform. Bioinformatics 25:1754–1760. doi: 10.1093/bioinformatics/btp324

Liewlaksaneeyanawin C, Ritland CE, El-Kassaby YA, Ritland K (2004) Single-copy,

species-transferable microsatellite markers developed from loblolly pine ESTs. Theor

Appl Genet. doi: 10.1007/s00122-004-1635-7

López-Tirado J, Hidalgo P, (2016) Ecological niche modelling of three Mediterranean pine

species in the south of Spain: a tool for afforestation/ reforestation programs in the

twenty-first century. New For 47:411–429. doi: 10.1007/s11056-015-9523-3

Lu M, Krutovsky K V., Nelson CD, et al (2017) Association genetics of growth and adaptive

traits in loblolly pine (Pinus taeda L.) using whole-exome-discovered polymorphisms.

Tree Genet Genomes. doi: 10.1007/s11295-017-1140-1

Lu X, Withers MR, Seifkar N, et al (2015) Biomass logistics analysis for large scale biofuel

production: Case study of loblolly pine and switchgrass. Bioresour Technol 183:1–9.

doi: 10.1016/J.BIORTECH.2015.02.032

Maggard A, Will R, Wilson D, Meek C (2016) Response of Mid-Rotation Loblolly Pine

(Pinus taeda L.) Physiology and Productivity to Sustained, Moderate Drought on the

Western Edge of the Range. Forests 7:203. doi: 10.3390/f7090203

Mammadov J, Aggarwal R, Buyyarapu R, Kumpatla S (2012) SNP markers and their impact

26

on plant breeding. Int J Plant Genomics 2012:1–11. doi: 10.1155/2012/728398

Martínez-García PJ, Stevens KA, Wegrzyn JL, et al (2013) Combination of multipoint

maximum likelihood (MML) and regression mapping algorithms to construct a high-

density genetic linkage map for loblolly pine (Pinus taeda L.). Tree Genet Genomes

9:1529–1535. doi: 10.1007/s11295-013-0646-4

Matukumalli LK, Lawley CT, Schnabel RD, et al (2009) Development and Characterization

of a High Density SNP Genotyping Assay for Cattle. PLoS One 4:e5350. doi:

10.1371/journal.pone.0005350

McCarthy MI, Abecasis GR, Cardon LR, et al (2008) Genome-wide association studies for

complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9:356–369. doi:

10.1038/nrg2344

McKeand SE (1988) Optimum Age For Family Selection for Growth in Genetic Tests of

Loblolly Pine. For Sci 34:400–411. doi: 10.1093/forestscience/34.2.400

Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of Total Genetic Value Using

Genome-Wide Dense Marker Maps. Genetics 157:1819–1829

Mirov NT (1967) The genus Pinus. Ronald Press Company, New York

Moraga-Suazo P, Orellana L, Quiroga P, et al (2014) Development of a genetic linkage map

for Pinus radiata and detection of pitch canker disease resistance associated QTLs. Trees

28:1823–1835. doi: 10.1007/s00468-014-1090-2

Myburg AA, Grattapaglia D, Tuskan GA, et al (2014) The genome of Eucalyptus grandis.

Nature 510:356–362. doi: 10.1038/nature13308

Nelson CD, Kubisiak TL, Stine M, Nance WL (1994) A genetic linkage map of longleaf pine

(Pinus palustris Mill.) based on random amplified polymorphic DNAs. J Hered 85:433–

27

439. doi: 10.1093/oxfordjournals.jhered.a111497

Nelson CD, Nanee WL, Doudriek RL (1993) A partial genetic linkage map of slash pine

(Pinus elliottg Engelm. var. elliottit) based on random amplified polymorphic DNAs.

Theor Appl Genet 87145:

Newton A, Allnutt T, Dvorak W, et al (2002) Patterns of genetic variation in Pinus

chiapensis, a threatened Mexican pine, detected by RAPD and mitochondrial DNA

RFLP markers. Heredity (Edinb) 89:191–198. doi: 10.1038/sj.hdy.6800113

O’Malley DM, Mckeand SE (1994) MARKER ASSISTED SELECTION FOR BREEDING

VALUE IN FOREST TREES. For Genet 1:207–218

Oh J-D, Na C-S, Park K-D (2017) Validation of selection accuracy for the total number of

piglets born in Landrace pigs using genomic selection. Asian-Australasian J Anim Sci

30:149–153. doi: 10.5713/ajas.16.0394

Oliveira EJ, Pádua JG, Zucchi MI, et al (2006) Origin, evolution and genome distribution of

microsatellites. Genet Mol Biol 29:294–307. doi: 10.1590/S1415-47572006000200018

Ostrowska E, Muralitharan M, Chandler S, et al (1998) Optimized Conditions for Rapd

Analysis in Pinus radiata

Parkinson J, Blaxter M (2009) Expressed Sequence Tags: An Overview. In: Expressed

Sequence Tags (ESTs). Methods in Molecular Biology (Methods and Protocols).

Humana Press, pp 1–12

Pinto JR, Marshall JD, Dumroese RK, et al (2011) Establishment and growth of container

seedlings for reforestation: A function of stocktype and edaphic conditions. For Ecol

Manage 261:1876–1884. doi: 10.1016/J.FORECO.2011.02.010

Plomion C, Bartholomé J, Lesur I, et al (2016) High-density SNP assay development for

28

genetic analysis in maritime pine ( Pinus pinaster). Mol Ecol Resour 16:574–587. doi:

10.1111/1755-0998.12464

Plomion C, Chagné D, Pot D, et al (2007) Pines. In: Forest Trees. Springer Berlin

Heidelberg, Berlin, Heidelberg, pp 29–92

Plomion C, Bahrmant N, Durelt CE, O ’Malley DM (1995) Genetical Society of Great

Britain Genomic mapping in Pinus pinaster (maritime pine) using RAPD and protein

markers. Heredity (Edinb) 74:661–668

Powell W, Machray GC, Provan J (1996) Polymorphism revealed by simple sequence

repeats. Trends Plant Sci 1:215–222. doi: 10.1016/1360-1385(96)86898-1

Price RA, Liston A, Strauss SH (1998) Phylogeny and systematics of Pinus. In: Richardson

DM (ed) Ecology and Biogeography of Pinus. Cambridge University Press, Cambridge,

pp 49–68

Queller D, Strassmann J, Hughes C (1993) Microsatellites and Kinship. Trends Ecol Evol

8:285–288

Rabouam C, Comes AM, Bretagnolle V, et al (1999) Features of DNA fragments obtained by

random amplified polymorphic DNA (RAPD) assays. Mol Ecol 8:493–503. doi:

10.1046/j.1365-294X.1999.00605.x

Rafalski JA (2002) Novel genetic mapping tools in plants: SNPs and LD-based approaches.

Plant Sci 162:329–333

Remington DL, Whetten RW, Liu ’ B-H, et al (1999) Construction of an AFLP genetic map

with nearly complete genome coverage in Pinus taeda. Theor Appl Genet 98:1279-1292.

doi: https://doi.org/10.1007/s001220051

Resende MFR, Muñoz P, Acosta JJ, et al (2012a) Accelerating the domestication of trees

29

using genomic selection: accuracy of prediction models across ages and environments.

New Phytol 193:617–624. doi: 10.1111/j.1469-8137.2011.03895.x

Resende MD V., Resende MFR, Sansaloni CP, et al (2012b) Genomic selection for growth

and wood quality in Eucalyptus: capturing the missing heritability and accelerating

breeding for complex traits in forest trees. New Phytol 194:116–128. doi:

10.1111/j.1469-8137.2011.04038.x

Ribeiro MM, Mariette S, Vendramin GG, et al (2002) Comparison of genetic diversity

estimates within and among populations of maritime pine using chloroplast simple-

sequence repeat and amplified fragment length polymorphism data. Mol Ecol 11:869–

877. doi: 10.1046/j.1365-294X.2002.01490.x

Rodrigues-Corrêa KC da S, de Lima JC, Fett-Neto AG (2012) Pine oleoresin: tapping green

chemicals, biofuels, food protection, and carbon sequestration from multipurpose trees.

Food Energy Secur 1:81–93. doi: 10.1002/fes3.13

Ronaghi M, Uhlen M, Myren P (1998) A Sequencing Method Based on Real-Time

Pyrophosphate. Science (80- ) 281:363–365

Rudin IE (1978) Linkage studies in Pinus sylvestris L. using macro gametophye allozymes.

Silvae Genet 27,:1–12

Ruotsalainen S (2014) Increased forest production through forest tree breeding. Scand J For

Res 29:333–344. doi: 10.1080/02827581.2014.926100

Samuelson LJ, Pell CJ, Stokes TA, et al (2014) Two-year throughfall and fertilization effects

on physiology and growth of loblolly pine in the Georgia Piedmont. For Ecol

Manage 330:29–37. doi: 10.1016/J.FORECO.2014.06.030

Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors.

30

Proc Natl Acad Sci 74:5463–5467

Schön CC, Melchinger AE, Boppenmaier J, et al (1994) RFLP Mapping in Maize:

Quantitative Trait Loci Affecting Testcross Performance of Elite European Flint Lines.

Crop Sci 34:378. doi: 10.2135/cropsci1994.0011183X003400020014x

Sewell MM, Davis MF, Tuskan GA, et al (2002) Identification of QTLs influencing wood

property traits in loblolly pine (Pinus taeda L.). II. Chemical wood properties. Theor

Appl Genet 104:214–222. doi: 10.1007/s001220100697

Shepherd M, Cross M, Dieters MJ, Henry R (2002a) Branch architecture QTL for Pinus

elliottii var. elliottii x Pinus caribaea var. hondurensis hybrids. Ann For Sci 59:617–625.

doi: 10.1051/forest:2002047

Shepherd M, Cross M, Maguire T, et al (2002b) Transpecific microsatellites for hard pines.

Theor Appl Genet 104:819–827. doi: 10.1007/s00122-001-0794-z

Silva-Junior OB, Faria DA, Grattapaglia D (2015) A flexible multi-species genome-wide

60K SNP chip developed from pooled resequencing of 240 Eucalyptus tree genomes

across 12 species. New Phytol 206:1527–1540. doi: 10.1111/nph.13322

Sims D, Sudbery I, Ilott NE, et al (2014) Sequencing depth and coverage: key considerations

in genomic analyses. Nat Rev 15:121–132. doi: 10.1038/nrg3642

Slonczewski J, Foster J (2009) Genes and Genomes. In: Twitchell B (ed) Microbiology an

Evolving Science, Second Edi. New York, pp 218–453

Smith DN, Devey ME (1994) Occurrence and inheritance of microsatellites in Pinus radiata.

Can J For 37:977–983

Spindel J, Begum H, Akdemir D, et al (2015) Genomic Selection and Association Mapping

in Rice (Oryza sativa): Effect of Trait Genetic Architecture, Training Population

31

Composition, Marker Number and Statistical Model on Accuracy of Rice Genomic

Selection in Elite, Tropical Rice Breeding Lines. PLoS Genet 11:1004982. doi:

10.1371/journal

Strauss SH, Lande R, Namkoong G (1992) Limitations of molecular-marker-aided selection

in forest tree breeding. Can J For Res 22:1050–1061. doi: 10.1139/x92-140

Sutton WRJ (Wink) (1999) The need for planted forests and the example of radiata pine.

New For 17:95–110. doi: 10.1023/A:1006567221005

Thavamanikumar S, Southerton SG, Bossinger G, Thumma BR (2013) Dissection of

complex traits in forest trees — opportunities for marker-assisted selection. Tree Genet

Genomes 9:627–639. doi: 10.1007/s11295-013-0594-z

The Arabidopis Genome Initiative (2000) Analysis of the genome sequence of the flowering

plant Arabidopsis thaliana. Nature 408:796-815. doi:

http://dx.doi.org/10.1038/35048692

ThermoFisher Scientific, https://www.thermofisher.com/us/en/home/life-science/microarray-

analysis/agrigenomics-solutions-microarrays-gbs/axiom-genotyping-solution-

agrigenomics.html

Travis SE, Ritland K, Whitham TG, Keim P (1998) A Genetic linkage map of

(Pinus edulis) based on amplified fragment length polymorphisms. Springer-Verlag

Tuskan GA, DiFazio S, Jansson S, et al (2006) The Genome of Black Cottonwood, Populus

trichocarpa (Torr. & Gray). Science (80- ) 313:1596–1604. doi:

10.1126/science.1128691

Vaid S, Nargotra P, Bajaj BK (2018) Consolidated bioprocessing for biofuel-ethanol

production from pine needle biomass. Environ Prog Sustain Energy 37:546–552. doi:

32

10.1002/ep.12691

Varshney RK, Graner A, Sorrells ME (2005) Genic microsatellite markers in plants: features

and applications. Trends Biotechnol 23:48–55. doi: 10.1016/J.TIBTECH.2004.11.005

Vasilyeva G, Semerikov V (2014) Application of amplified fragment length polymorphisms

markers to study the hybridization between Pinus sibirica and P. pumila. Ann For Res

Ann For Res 57:175–180. doi: 10.15287/afr.2014.219

Wakamiya I, Newton RJ, Johnston JS, Price HJ (1993) Genome Size and Environmental

Factors in the Genus Pinus. Am J Bot 80:1235. doi: 10.2307/2445706

Weber JL (1990) Informativeness of human (dC-dA)n · (dG-dT)n polymorphisms. Genomics

7:524–530. doi: 10.1016/0888-7543(90)90195-Z

Weng C, Kubisiak T, Nelson C, Stine M (2002) Mapping quantitative trait loci controlling

early growth in a (longleaf pine × slash pine) × slash pine BC1 family. Theor Appl

Genet 104:852–859. doi: 10.1007/s00122-001-0840-x

Wheeler GL, Dorman HE, Buchanan A, et al (2014) A Review of the Prevalence, Utility, and

Caveats of Using Chloroplast Simple Sequence Repeats for Studies of Plant Biology.

Appl Plant Sci 2:. doi: 10.3732/apps.1400059

White TL, Adams WT, Neale DB (2007) Forest Genetics. CABI Publishing, Cambridge

Wilcox PL, Amerson H V, Kuhlman EG, et al (1996) Detection of a major gene for

resistance to fusiform rust disease in loblolly pine by genomic mapping. Proc Natl Acad

Sci U S A 93:3859–64

Williams JG, Kubelik AR, Livak KJ, et al (1990) DNA polymorphisms amplified by

arbitrary primers are useful as genetic markers. Nucleic Acids Res 18:6531–6535

Xiong JS, McKeand SE, Isik F, et al (2016) Quantitative trait loci influencing forking defects

33

in an outbred pedigree of loblolly pine. BMC Genet 17:. doi: 10.1186/s12863-016-0446-

6

Yazdani R, Yeh FC, Rimsha J (1995) Genomic mapping of Pinus sylvestris (L.) using

random amplified polymorphic DNA markers. For Genet 2:109–116

You N, Murillo G, Su X, et al (2012) SNP calling using genotype model selection on high-

throughput sequencing data. Bioinformatics 28:643–650. doi:

10.1093/bioinformatics/bts001

Zhang Z-Y, Chen Y-Y, Li D-Z (2005) Detection of Low Genetic Variation in a Critically

Endangered Chinese Pine, Pinus squamata, Using RAPD and ISSR Markers. Biochem

Genet 436:. doi: 10.1007/s10528-005-5215-6

Zhao Y, Gowda M, Liu W, et al (2012) Accuracy of genomic selection in European maize

elite breeding populations. Theor Appl Genet 124:769–776. doi: 10.1007/s00122-011-

1745-y

Zheng W, Long J, Gao Y-T, et al (2009) Genome-wide association study identifies a new

breast cancer susceptibility locus at 6q25.1. Nat Genet 41:324–328. doi: 10.1038/ng.318

Zimin A, Stevens KA, Crepeau MW, et al (2014) Sequencing and assembly of the 22-gb

loblolly pine genome. Genetics 196:875–90. doi: 10.1534/genetics.113.159715

Zimin A V, Stevens KA, Crepeau MW, et al (2017) An improved assembly of the loblolly

pine mega-genome using long-read single-molecule sequencing. Gigascience 6:1–4. doi:

10.1093/gigascience/giw016

Zobel B, Talbert J (1984) Applied Forest Tree Improvement, 1st edn. Blackburn Press,

Caldwell

34

Chapter 2. Single Nucleotide Polymorphism Discovery in

Tropical and Subtropical Pines Using RNA-Seq

2.1 Introduction

The genus Pinus is the largest genus of conifers in the world consisting of over 100 individual species (Price et al. 1998). Their ability to adapt, grow and reproduce under a large number of ecological conditions is evidenced by their large natural distribution in the northern hemisphere. Due to this adaptability and variation, pines have come to play a crucial role in sustaining the global forestry market. In 2000, approximately 20% of all trees planted in commercial settings were pines (FAO 2015).

Within the genus, the subsection Australes contains many species that are important to pulp and sawtimber markets. In particular, some species of tropical and subtropical pines that are native to Mexico and Central America, have had landrace populations established in

Africa and South America for commercial applications (Dvorak et al. 2000a; Griess et al.

2016). Of these species, P. patula and P. greggii are successfully planted on colder sites in higher elevations due to their frost tolerance, while species such as P. caribaea, P. maximinoi, are valued for their rapid growth, and P. oocarpa and P. tecunumanii for their wood properties (Dvorak et al. 2000a). The large amount of genetic diversity present within and between species, along with their ability to mate and produce viable offspring, has generated interest in breeders to potentially exploit hybrid crosses for niche commercial deployments, adding to their importance (Gwaze 1999; Hongwane et al. 2018).

Despite traditional breeding and testing methods being effective in consistently producing better trees, the cost and labor associated with the methods are spurring breeding programs to explore the implementation of genomic technologies. Much of the ability to

35

understand the relationship between genotype and phenotype revolves around the ability to effectively identify genomic variation (Piskol et al. 2013). Before next generation sequencing

(NGS) technologies were available, efforts to understand this variation started in the 1980’s through the utilization of isozymes (Conkle 1979; Eckert et al. 1981) and continued through the use of expressed sequence tags (EST), microsatellites and simple sequence repeats (SSR)

(Elsik et al. 2000; Temesgen et al. 2001; Bérubé et al. 2007). ESTs have also been a valuable resource for locating markers within related species (Liewlaksaneeyanawin et al. 2004).

Ultimately, EST mining gave way to other methods such as genotyping by sequencing and other advancements as they were developed.

Currently, with the advent of NGS technologies, the ability to identify and exploit genomic variation has been revolutionized. Whole genome and whole exome sequencing studies have become attainable for large-scale projects looking to identify variation in both human and plant species (Michael and Jackson 2013; Gudbjartsson et al. 2015). Whole genome sequencing would be ideal in most cases looking for genetic variability, however, the implementation of such studies remains costly and computationally prohibitive for most pine species. Given the large size of pine genomes, which are generally more than 20Gb (De

La Torre et al. 2014), and their highly repetitive nature (Wegrzyn et al. 2014), reduced representation sequencing methods are an attractive alternative to whole genome sequencing.

Of these reduced sequencing methods, RNA sequencing (RNA-Seq) is among the most popular.

RNA-Seq is a popular alternative to whole genome studies because of its reduced cost when compared to whole genome studies and its versatility to address many different questions involving variation, gene expression, and alternative splicing (Oshlack et al. 2010;

36

Ozsolak and Milos 2011). More recently, it has been shown to be an effective means of identifying SNPs in transcribed regions of many different species (Thumma et al. 2012; Liu et al. 2014; Telfer et al. 2018). SNPs located in expressed regions are found at a lower frequency and are more likely to have an effect on protein structure and function, making them useful for explaining phenotypic variation. Utilizing RNA-Seq data for variant discovery does pose some unique challenges, primarily due to the complex nature of the transcripts themselves. Intron-exon boundaries present a computational challenge, especially in the absence of a same species or complete reference (De Wit et al. 2015). Therefore, special care must be taken to account for this when selecting SNPs for validation downstream.

In this manuscript, the design and characterization of more than 1.3 million SNP probes designed for five species of tropical or subtropical pines is described. Next generation

RNA sequencing generated between 29 and 67 million raw reads per sequencing pool across the species. Reads were used to perform transcriptome-wide identification of more than 5.3 million SNP markers. The SNP probes designed in this study are to be assessed for their potential use on a commercial genotyping array in a future study with the ultimate goal of the array being used to facilitate genomic selection, population structure, hybrid studies, and more in the field of tree improvement and conservation.

2.2 Materials and Methods

2.2.1 Plant Material

RNA sequence data was generated for five species of pine through a pathogen challenge experiment for pitch canker Fusarium circinatum. The pathogen challenge, RNA extraction and sequencing steps outlined in this project were performed by Erik Visser at the University of

37

Pretoria as part of a previous unrelated study. The species tested were: P. greggii, P. maximinoi,

P. oocarpa, P. patula, and P. tecunumanii. Pinus patula seedlings consisting of a single open pollinated family were sourced from Mondi Forests (Trahar Technology Center, South Africa).

Pinus tecunumanii seedlings consisting of four open pollinated families were sourced from

SAPPI (Shaw Research Center, South Africa). Pinus maximinoi and P. oocarpa seedlings were sourced from York Timbers (South Africa) and P. greggii seedlings were sourced from

Komatiland Forests (South Africa). Seedlings were grown and challenged at the Forestry and

Agriculture Biotechnology Institute (FABI) disease screening facility (University of Pretoria,

South Africa). Pathogen inoculation and testing was performed as described in Visser et al.

(2018, 2015). In short, the top 1cm of shoot tissue from 16 seedlings were pooled and used to create three biological replicates for P. patula and P. tecunumanii, while tissue from 8 seedlings were used per pool for the remaining species. Tissue was harvested from both inoculated and mock-inoculated seedlings at 3 and 7 days post inoculation to create a more robust expression dataset.

2.2.2 RNA Isolation and Sequencing

Isolation of RNA was done utilizing the Norgen Biotek Corp.’s Plant/Fungi RNA

Purification Kit using ground homogenized samples. Extraction was performed using the manufacturer’s recommendation, adding acid washed beads to improve cell lysis. The integrity and purity of extracted samples were assessed using electrophoresis.

Samples from all species from both time points for the inoculated and mock- inoculated samples were sent to Novogene (Novogene Corporation Inc., Chula Vista, CA) for strand specific RNA sequencing using Illumina HiSeq2500. Pinus patula and P. tecunumanii samples were sequenced as described in Visser et al. (2018). Pinus greggii, P.

38

maximinoi, and P. oocarpa samples were sequenced using 300bp insert, paired end 150bp

(PE150) sequencing. Similar to that of sample set three in Visser et al. (2018).

2.2.3 SNP Discovery

The paired end RNA-seq libraries received from Novogene were put through a custom bioinformatics pipeline for SNP discovery (Figure 2.1). Reads were quality checked using FastQC (Andrews 2010). Reads from each library were trimmed and filtered using

Sickle v1.33 (Joshi & Fass 2011), requiring a base quality score of 30 and a minimum read length of 50. The trimmed reads from each library for each species were aligned using

Burrows-Wheeler Aligner’s (BWA) v0.7.15 (Li and Durbin 2009), mem routine. Using standard parameters, reads were aligned back to their respective species’ transcriptome assembly (Visser et al. 2018; Visser et al. 2015;unpublished data). The SAM alignment files generated were converted to BAM files and subsequently sorted using Samtools’ v1.7 (Li et al. 2009) view and sort routines.

Variant detection was done under two sets of parameters, strict and standard, and performed using Freebayes v1.1.0 (Garrison & Marth, 2012). SNP calls were done jointly, using all alignment files from an individual species as an input and utilized the pooled continuous parameter. Under the pooled continuous parameter, all alleles that pass input filters are output, regardless of the genotyping outcome or model. The standard parameters used are as follows: min-mapping-quality = 0, min-base-quality = 15, min-alternate-fraction

= 0.05, min-alternate-count = 3, and min-coverage = 10. The strict parameters were: min- mapping-quality = 20, min-base-quality = 25, min-alternate-fraction = 0.05, min-alternate- count = 5, and min-coverage = 12. The resulting outputs were filtered to include only SNP variants using vcflib v1.0.0-rc1 (Garrison 2016).

39

2.2.4 Probe Design

Successfully called SNPs from each species were then subject to probe design. A probe is defined as the identified SNP plus a number of base pairs of sequence on either side of it. In this study, a target of 35 bases on either or both sides of the SNP was used. Probes were designed for use on ThermoFisher Scientific’s Axiom genotyping array platform.

Custom Perl scripts were used on both strict and standard datasets to extract the 35 bases upstream and downstream of each SNP also referred to as left and right flanking regions respectively. International Union of Pure and Applied Chemistry (IUPAC) codes were inserted into the probe flanking regions to identify proximity to another SNP and the flanks were trimmed to the innermost IUPAC code on both sides. Probes were then filtered to include only flanking regions for SNPs that had either the left, right, or both flanking regions clear of an IUPAC code for a full 35 bases in order to meet technical specifications.

To address the potential issue of probes being located in or around intron/exon boundaries, probes were further filtered. The transcriptome assemblies for each species were translated to amino acid sequences using EMBOSS v6.6.0 (Rice et al. 2000) and aligned to v2.01 of the P. taeda genome using GenomeThreader v1.7.1 (Gremme et al. 2005). The resulting alignment file was used to inform the probes. Probes that were within a region that mapped to the genome and that fell within 5 bases of a splice site were removed. Probes from a region of the transcriptome assembly that did not map to the genome and those that did not fall within 5 bases of a splice site were selected for downstream analysis.

2.2.5 Probe Description

Probes that survived filtering based on the previously mentioned technical specifications were further characterized. This was done to determine their similarity

40

between species and their potential behavior once submitted for scoring for the screening array. Probes for each species underwent three separate characterization steps: 1) the number of times each flanking sequence mapped to the P. taeda genome assembly, 2) sequence similarity of flanking regions between each species, 3) number of flanking sites that map to the same location in the P. taeda genome.

1. Comparison of the flanking sequence similarity between species was done under

three conditions: allowing for zero, one, or two bases different to be considered

shared. A flanking sequence was considered shared between two species if it did

not exceed the number of bases different allowed. For condition one, allowing no

differences (i.e. sequences must be identical to be considered shared), custom R

and SAS scripts were used to create an incidence matrix (with flanking regions as

the rows and species as columns). For conditions two and three, a custom Python

script was used to calculate similar incidence matrices.

2. The characterization of a probe’s flanking site’s ability to map to more than one

location to the P. tadea genome was done utilizing BWA mem (Li and Durbin

2009). A browser extensible data (bed) file was created for both the right and left

flanking regions for each probe using a custom R script and then modified using

Bedtools v2.27 (Quinlan and Hall 2010) slop routine to select the 35 bases either

up and downstream from the SNP, specifying the positions of either the left or

right flanking region. The resulting bed files were converted to fasta format using

Bedtools v2.27.1 getfasta routine and then converted to fastq format via BBmap

v.37.41 (Bushnell, sourceforge.net/projects/bbmap). An arbitrary base quality

score of 35 was used in the fastq generation step. Once formatted into fastq

41

format, probe sequences were aligned back to the genome using BWA mem with

the –a (all alignment) option, which gives all possible alignments for a read

regardless of its mapping score or redundancy.

3. The left and right flanking regions of the probes were assessed for shared

mapping locations using BWA aln/samse (Li and Durbin 2009). For each species

specific probe set, a bed file was generated using a custom R script. Flank specific

bed files were created in the same manner as above, using Bedtools' slop routine

and targeting the 35 bases up or downstream of the SNP location. Flanking

specific bed files were then converted to fasta file format using Bedtools getfasta

and subsequent fasta files were converted to fastq files using BBMap v.37.41

(Bushnell, sourceforge.net/projects/bbmap) and a base quality score of 35 was

assigned to all bases in a read. The resulting fastq files were aligned to the P.

taeda genome using the optimized short read aligner BWA aln/samse routine. The

output alignment files were converted to bed files using Bedtools bamtobed

routine. The species specific bed files were then concatenated to create a “master”

alignment bed file. Using Bedtools’ intersect routine, the species specific bed files

were compared to the “master” file. If mapping locations were within two bases

of each other, the site was considered shared by those flanking regions. An

incidence matrix was constructed for downstream analysis.

Principal component analysis (PCA) was performed to assess probe correlation between species. PCA was performed on the incidence matrices generated for sequence similarity and shared mapping assessments. PCA for sequence similarity was performed using the zero difference allowed matrix for both the left and right flanking regions. A

42

random sample of 50,000 observations from the matrices were selected and used for the analysis.

2.3 Results

2.3.1 RNA-Sequencing and Alignment

RNA-sequencing of the five tropical pine species generated between 6.3 billion and

5.0 billion raw bases from between 33.7 million and 45.4 million paired end reads on average for each species. After quality control, trimming reads for base quality and read length, an average of between 4.1 billion and 5.6 billion bases from between 29.1 million to 40.7 million reads were retained for further analysis. Average read length across the species ranged from 133 to 143 bases. Alignment of each species' quality controlled reads to their respective transcriptome assemblies showed between 73.6% to 78.3% of reads successfully aligned to their reference (Table 2.1). Of the reads that successfully aligned to their transcriptome, more than 90% of reads mapped in their proper pair.

2.3.2 Variant Calling and Probe Design

Initial variant calling for strict and standard parameters generated a list of variants including, complex events, multi-nucleotide polymorphisms, SNPs, and indels. Variant calling was not restricted to SNPs alone per recommended best practices for variant calling using Freebayes (Garrison & Marth, 2012) in order to increase detection power. The number of variants called ranged from 845K to 1.6 million under standard parameters and between

531K to 974K variants under strict per species. The raw variant vcf files were filtered to include only SNP markers, removing complex events, multi-nucleotide polymorphisms, and indels. This filtering produced reduced the dataset of between 686K to 1.1 million and between 426K to 824K SNPs per species for standard and strict calls respectively. Pinus

43

patula had the highest number of SNPs discovered within the species while P. greggii had the lowest number of SNPs generated from sequencing data (Table 2.2).

Strict and standard parameters were used to create a more robust catalogue of SNPs for probe design than using either one set of parameters individually. SNPs generated from strict parameters created a higher confidence catalogue of SNPs of which we expect a larger proportion of true SNPs rather than artifacts from sequencing or alignment errors than standard parameters. However, lower confidence SNPs generated from standard parameter were included for probe design as validation of SNP probes will be performed using a screening genotyping array, allowing for some leniency in the design phase. From the two datasets, between 175K to 301K probes were successfully designed per species (Table 2.2) that were then further characterized. In total 1,319,369 probes were successfully created from the design phase. These probes fell into three categories based on presence of a full 35mer flanking site: left flank only (442K), right flank only (425K), or both (451K).

Of the ~1.3 million SNPs across all species that had probes successfully designed for them, approximately 61% were transitions (C/T or A/G) and 39% were transversions (A/C,

A/T, C/G, or G/T). Across all species, the C/T transition was the most prevalent making up

31% of the total number of observations while A/T transversions were the least observed making up 9% of observations. At the species levels, transition/transversion ratios ranged from 1.46 in P. greggii to 1.59 in P. patula (Table 2.3). Architecture of transition and transversion numbers were consistent across species.

44

2.3.3 Probe Description

The left and right flanking regions of the probes generated were assessed separately for sequence similarity, repetitive mapping, and shared mapping locations. In order to be included in the dataset a flank must have reached the full 35bp length for that flank side.

Sequence similarity assessment of the probe's left and right flanks showed a high proportion of unique flanking regions within each species. As the number of allowed base differences to be considered shared was increased, the proportion of flanking regions shared between two and three species increased while those shared between four or more species remained fairly constant. The sharing had a similar pattern for both right and left flank datasets (Figure 2.2). Looking at all possible combinations of sharing between species, patterns become apparent between the species. For the left flanking regions allowing for zero base differences P. maximinoi had the largest proportion of unique flanks, while sharing between species such as P. tecunumanii and P. oocarpa occurred at a higher proportion than with other species (Figure 2.3). This pattern holds constant across left and right flanks for each number of differences allowed. Diagrams for the other combinations of flanks and differences allowed are found in the Appendix (Appendix 1-5). Principal component analysis

(PCA) was performed on the left and right flanking regions when allowing for zero differences datasets. PCA showed a strong correlation between species that are more closely phylogenetically related (Figure 2.4). Right and left regions produced similar results with

~29% and ~23% of variance explained by component 1 and 2 respectively.

Repetitive mapping analysis of the left and right flanking 35mer regions to the P. taeda genome revealed that over 50% of flanking regions mapped to a unique location for each species (Figure 2.5). Approximately 17% of flanking regions mapped to more than five

45

locations in the genome reference with a maximum number of 511 locations for a single flank found. Alignment percentage for all species' left and right flanking regions was greater than 90%.

Analysis of the shared mapping locations of left and right flanking regions revealed a high proportion of flanks mapping to a unique location in the v2.01 P. taeda genome assembly (Figure 2.6). Pinus maximinoi once again exhibited the largest number of unique locations distinguishing it as an outlier in behavior when compared to the other species. More than 83% of flanks successfully mapped to the reference for all species. When examining all possible combinations of sharing, a relatively small number of mapping locations were shared for all species. Pinus tecunumanii and P. oocarpa tended to have a larger number of shared locations between the species when compared with the others as did P. greggii and P. patula for both left and right flanking datasets (Figures 2.7 & 2.8).

Principal component analysis revealed 29% and 22% of variance associated with shared mapping location were explained by component one and two respectively for both the left and right flanking region. Correlation between pairs of species were shown with P. patula / P. greggii and P. tecunumanii / P. oocarpa pairs having loading angles which are close to one another (Figure 2.9). Left and right flanks paired in a similar manner indicating biological consistency between datasets.

2.4 Discussion

Pines, as important long-term rotation crops, have been researched extensively to develop useful statistical models to estimate genetic parameters using phenotypic data. More recently, implementation of molecular technologies has aimed to improve the rate and efficiency associated with tree improvement. However, when compared with annual rotation

46

crop species, molecular markers and genetic maps are much less developed (Isik 2014). The use of NGS technology has the potential to facilitate whole genome or whole transcriptome level molecular markers to be used in breeding programs. While there have been genome assemblies of several pine species developed (Nystedt et al. 2013; Stevens et al. 2016; Zimin et al. 2017), the complexity and size of pine genomes make whole genome level sequencing efforts difficult when compared to other plant species. Due to the cost and labor associated with whole genome assembly, work to develop a reference has not been undertaken for the species in this study. Therefore, this study focused on using de-novo assembled transcriptome references and NGS RNA-Seq data to identify SNP markers using a custom bioinformatics pipeline.

Challenges associated with using RNA-Seq for create a number of potential avenues for the introduction of errors into an experiment. Despite NGS sequencing technologies have an error rate as low as 0.1% (Glenn, 2011), incorrectly sequenced bases are still a source of false SNPs. Additionally, in plants and especially pine species where repetitive sequences can make up more that 80% of a genome (Wegrzyn et al. 2014), de novo assembly of transcriptomes and subsequent variant calling efforts can be effected by paralogous gene families and multi-mapping reads that may be present (Pepke et al. 2009). Of the strategies available to handle multi-mapping reads, this study utilized the best match method, reporting the alignment with the fewest mismatches and randomly assigning a location between those with equivalent mapping scores.

Assessment of the transition (Ts) vs transversion (Tv) ratio of the study species yielded ratios of between 1.46 to 1.59. Even though Ts:Tv ratios have not been reported in the species present in this study, the numbers are consistent with the ratios found in

47

transcriptomic studies of other tree species. SNPs generated from the analysis apex shoot tissue in eucalyptus has shown a Ts:Tv ratio of 1.62 and foliar analysis of Norway spruce produced a ratio of 1.49 (Singh et al. 2011; Chen et al. 2012). The divergence from the 1:2 ratio expected from random mutations indicates the SNPs called in this study are probably true SNPs and not errors from sequencing or mapping. Bias due to higher rates of C/T mutations because of CpG effects and the deamination of methlycytosines in CpG regions produces a much higher Ts rate and skews the ratio (Vignal et al. 2002; Morton et al. 2006).

Results from this study follow this trend, with C/T transitions being found in the highest proportion, making up 31% of SNPs with successfully designed probes.

A unique challenge associated with this study was to identify SNPs and probes that are potentially shared between species when generated from different reference sequences.

To address this issue, probe flanking regions were used as a surrogate of the SNP they surrounded. Comparative transcriptomic studies have shown reasonable alignment of transcripts of relatively diverged pine species to the P. taeda genome (Baker et al. 2018).

Therefore, probe flanking regions which originated from transcriptome assemblies were mapped to v2.01 P. taeda genome to be used as a common reference and assessment of sequence similarity between probes was performed to gain insight to the potential performance of a probe on the commercial array. Successful alignment of flanking regions to the genome was higher than shown in Baker et al. (2018). However, this is not surprising given the read length being mapped is lower in this study. Very little expansion of pine transcriptomic regions has been shown when compared to other plant species (Kirst et al.

2003) so many of the non-aligning flanking regions could be attributed to the fragmented and incomplete nature of the P. taeda genome rather than absence of sequence.

48

Sequence similarity between probes and multiple locations within a genome has been shown to cause nonspecific binding and subsequent cross reactivity. This cross reactivity can generate false positive signals which will skew downstream analysis. It has been shown that sequences not targeted by probes which have >70-80% similarity with target sequences will affect signals generated for 50mer probes (Kane et al. 2000). The presence of repeat regions in the genome either through paralogous gene families or pseudogenes also has an effect on probe specificity (Chen et al. 2011). This is evident in this study by the >40% of flanking regions mapping to multiple sites within the P. taeda genome.

Applying the logic above to the probe sequences themselves allows for identification of probes that may behave in a similar manner and thus may be redundant. Both the sequence similarity and shared mapping locations assessments performed to assess species sharing and redundancy required greater than 94% similarity in flanking regions to be considered shared, much higher than the 80% mark mentioned above. However, direct comparisons cannot be made given the different probe lengths between studies which would lead to different dynamics in hybridization. While sequence similarity is one factor that affects nonspecific hybridization, others such as location of base differences within the flanking region and if the difference is A/T or C/G can also influence binding (Chen et al. 2011) and were not taken into account in this study.

SNPs generated from this study are expected to follow the phylogenetic pattern found in biology through flanking region similarity as well as shared mapping locations. Lacking a common reference and due to the unknown family sampling in the study, a true population structure analysis was not achievable. Therefore, PCA was performed on the incidence matrices for both sequence similarity and shared mapping to assess if SNPs are being shared

49

between species in a similar pattern as expected from biology. A phylogenetic study of

Mexican and Central American pines using RAPD makers showed relatedness between P. greggii and P. patula species is higher than with other species of tropical pines. The same was shown for P. oocarpa and P. tecunumanii (Dvorak et al. 2000b). Another study supports this representation of phylogeny, and places P. maximinoi as the most distantly related to the study species (Vargas-Mendoza et al. 2011). PCA on both shared mapping and sequence similarity between flanking regions followed this pattern shown in the phylogenetic studies, further indicating the SNPs found in this study are likely true and will have utility on the screening array for population genetics and hybrid studies.

50

2.5 Literature Cited

Andrews, S. (2014). FastQC A Quality Control tool for High Throughput Sequence Data.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Baker EAG, Wegrzyn JL, Sezen UU, et al (2018) Comparative Transcriptomics Among Four

White Pine Species. G3 (Bethesda) 8:1461–1474. doi: 10.1534/g3.118.200257

Bérubé Y, Zhuang J, Rungis D, et al (2007) Characterization of EST-SSRs in loblolly pine

and spruce. Tree Genet Genomes 3:251–259. doi: 10.1007/s11295-006-0061-1

Chen J, Uebbing S, Gyllenstrand N, et al (2012) Sequencing of the needle transcriptome

from Norway spruce (Picea abies Karst L.) reveals lower substitution rates, but similar

selective constraints in gymnosperms and angiosperms. BMC Genomics 13:589. doi:

10.1186/1471-2164-13-589

Chen Y, Choufani S, Ferreira JC, et al (2011) Sequence overlap between autosomal and sex-

linked probes on the Illumina HumanMethylation27 microarray. Genomics 97:214–222.

doi: 10.1016/J.YGENO.2010.12.004

Conkle MT (1979) Isozyme Variation and Linkage in Six Conifer Species. In: Isozymes of

North American Forest Trees and Forest Insects. pp 11–17

De La Torre AR, Birol I, Bousquet J, et al (2014) Insights into Conifer Giga-Genomes.

PLANT Physiol 166:1724–1732. doi: 10.1104/pp.114.248708

De Wit P, Pespeni MH, Palumbi SR (2015) SNP genotyping and population genomics from

expressed sequences - Current advances and future possibilities. Mol. Ecol. 24:2310-

2323. doi: 10.1111/mec.13165

Dvorak WS, Gutierrez EA, Hodge GR, et al (2000a) Conservation & Testing of Tropical &

Subtropcial Forest Tree Species by the CAMCORE Cooperative. NCSU

51

Dvorak WS, Jordon AP, Hodge GP, Romero JL (2000b) Assessing evolutionary relationships

of pines in the Oocarpae and Australes subsections using RAPD markers. New For

20:163–192. doi: 10.1023/A:1006763120982

Eckert RT, Joly RJ, Neale DB (1981) Genetics of isozyme variants and linkage relationships

among allozyme loci in 35 eastern white pine clones. Can J For Res 11:573–579. doi:

10.1182/blood-2011-06-363911

Elsik CG, Minihan VT, Hall SE, et al (2000) Low-copy microsatellite markers for Pinus

taeda L. Genome 43:550–555

FAO (2015) Global Forest Resouces Assessment

Garrison E. (2016) Vcflib, a simple C++ library for parsing and manipulating VCF files.

https://github.com/vcflib/vcflib.

Garrison E, Marth G.(2012) Haplotype-based variant detection from short-read sequencing.

arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012

Glenn TC (2011) Field guide to next-generation DNA sequencers. Mol Ecol Resour 11:759–

769. doi: 10.1111/j.1755-0998.2011.03024.x

Gremme G, Brendel V, Sparks ME, Kurtz S (2005) Engineering a software tool for gene

structure prediction in higher organisms. doi: 10.1016/j.infsof.2005.09.005

Griess VC, Uhde B, Ham C, Seifert T (2016) Product diversification in South Africa’s

commercial timber plantations: a way to mitigate investment risk. South For a J For Sci

78:145–150. doi: 10.2989/20702620.2015.1136508

Gudbjartsson DF, Sulem P, Helgason H, et al (2015) Sequence variants from whole genome

sequencing a large group of Icelanders. Sci Data. doi: 10.1038/sdata.2015.11

Gwaze DP (1999) Performance of some F1 interspecific Pine hybrids in Zimbabwe. For

52

Genet 6:283–289

Hongwane P, Mitchell G, Kanzler A, et al (2018) Alternative pine hybrids and species to

Pinus patula and P. radiata in South Africa and Swaziland. South For. doi:

10.2989/20702620.2017.1393744

Isik F (2014) Genomic selection in forest tree breeding: The concept and an outlook to the

future. New For. 45:379–401

Joshi NA, Fass JN. (2011). Sickle: A sliding-window, adaptive, quality-based trimming tool

for FastQ files (Version 1.33) [Software]. Available at

https://github.com/najoshi/sickle.

Kane MD, Jatkoe TA, Stumpf CR, et al (2000) Assessment of the sensitivity and specificity

of oligonucleotide (50mer) microarrays. Nucleic Acids Res 28:4552. doi:

10.1093/NAR/28.22.4552

Kirst M, Johnson AF, Baucom C, et al (2003) Apparent homology of expressed genes from

wood-forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana. Proc

Natl Acad Sci U S A 100:7383–7388

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler

transform. Bioinformatics 25:1754–1760. doi: 10.1093/bioinformatics/btp324

Li H, Handsaker B, Wysoker A, et al (2009) The Sequence Alignment/Map format and

SAMtools. Bioinformatics 25:2078–9. doi: 10.1093/bioinformatics/btp352

Liewlaksaneeyanawin C, Ritland CE, El-Kassaby YA, Ritland K (2004) Single-copy,

species-transferable microsatellite markers developed from loblolly pine ESTs. Theor

Appl Genet. doi: 10.1007/s00122-004-1635-7

Liu JJ, Sniezko RA, Sturrock RN, Chen H (2014) Western white pine SNP discovery and

53

high-throughput genotyping for breeding and conservation applications. BMC Plant

Biol. doi: 10.1186/s12870-014-0380-6

Michael TP, Jackson S (2013) The First 50 Plant Genomes. Plant Genome. doi:

10.3835/plantgenome2013.03.0001in

Morton BR, Bi I V, McMullen MD, Gaut BS (2006) Variation in mutation dynamics across

the maize genome as a function of regional and flanking base composition. Genetics

172:569–77. doi: 10.1534/genetics.105.049916

Nystedt B, Street NR, Wetterbom A, et al (2013) The Norway spruce genome sequence and

conifer genome evolution. Nature 497:579–584. doi: 10.1038/nature12211

Oshlack A, Robinson MD, Young MD (2010) From RNA-seq reads to differential expression

results. Genome Biol. 11:220. doi: https://doi.org/10.1186/gp-2010-11-12-220

Ozsolak F, Milos PM (2011) RNA sequencing: Advances, challenges and opportunities. Nat.

Rev. Genet. 12:87-98. doi: https://doi.org/10.1038/nrg2934

Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat

Methods 6:S22-32. doi: 10.1038/nmeth.1371

Piskol R, Ramaswami G, Li JB (2013) Reliable identification of genomic variants from

RNA-seq data. Am J Hum Genet. doi: 10.1016/j.ajhg.2013.08.008

Price RA, Liston A, Strauss SH (1998) Phylogeny and systematics of Pinus. In: Richardson

DM (ed) Ecology and Biogeography of Pinus. Cambridge University Press, Cambridge,

pp 49–68

Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic

features. Bioinformatics 26:841–842. doi: 10.1093/bioinformatics/btq033

Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open

54

Software Suite. Trends Genet 16:276–277

Singh TR, Gupta A, Riju A, et al (2011) Computational identification and analysis of single-

nucleotide polymorphisms and insertions/deletions in expressed sequence tag data of

Eucalyptus. J Genet 90:e34–e38

Stevens KA, Wegrzyn JL, Zimin A, et al (2016) Sequence of the Sugar Pine Megagenome.

Genetics 204:1613–1626. doi: 10.1534/genetics.116.193227

Telfer E, Graham N, Macdonald L, et al (2018) Approaches to variant discovery for conifer

transcriptome sequencing. PLoS One. doi: 10.1371/journal.pone.0205835

Temesgen B, Brown GR, Harry DE, et al (2001) Genetic mapping of express sequence tag

polymorphism (ESTP) markers in loblolly pine (Pinus taeda L.). Theor Appl Genet

102:664–675

Thumma BR, Sharma N, Southerton SG (2012) Transcriptome sequencing of Eucalyptus

camaldulensis seedlings subjected to water stress reveals functional single nucleotide

polymorphisms and genes under selection. BMC Genomics. doi: 10.1186/1471-2164-

13-364

Vargas-Mendoza C, Medina-Jaritz N, Ibarra-Sanchez C, et al (2011) Phylogenetic Analysis

of Mexian Pine Species Based on Three Loci from Different Genomes (Nuclear,

Mitochondiral, and Chloroplast). In: Agboola J (ed) Relevant Perspecites in Global

Environmental Change. InTech, pp 139–154

Vignal A, Milan D, Sancristobal M, Eggen A (2002) A review on SNP and other types of

molecular markers and their use in animal genetics. Genet Sel Evol 34:275–305. doi:

10.1051/gse:2002009

Visser EA, Wegrzyn JL, Myburg AA, Naidoo S (2018) Defence transcriptome assembly and

55

pathogenesis related gene family analysis in Pinus tecunumanii (low elevation). BMC

Genomics 19:1–13. doi: 10.1186/s12864-018-5015-0

Visser EA, Wegrzyn JL, Steenkmap ET, et al (2015) Combined de novo and genome guided

assembly and annotation of the Pinus patula juvenile shoot transcriptome. BMC

Genomics 16:1057. doi: 10.1186/s12864-015-2277-7

Wegrzyn JL, Liechty JD, Stevens KA, et al (2014) Unique Features of the Loblolly Pine

(Pinus taeda L.) Megagenome Revealed Through Sequence Annotation. Genetics

196:891–909. doi: 10.1534/genetics.113.159996

Zimin A V, Stevens KA, Crepeau MW, et al (2017) An improved assembly of the loblolly

pine mega-genome using long-read single-molecule sequencing. Gigascience 6:1–4. doi:

10.1093/gigascience/giw016

56

Table 2.1. Sequencing and alignment metrics generated from RNA sequencing. Sequencing generated millions (M) of reads and billions of bases on average per species. Raw reads underwent quality control (QC) requiring a base quality score of 30 and minimum read length of 50bp. Average read length decreased slightly from the 150bp raw reads generated from sequencing. Average alignment percentage across species was consistent.

Metrics P. greggii P. maximinoi P. oocarpa P. patula P. tecunumanii

Raw Reads 33.7 M 36.0 M 35.8 M 45.9 M 42.9 M Raw Bases 5,100 M 5,400 M 5,400 M 6,400 M 6,000 M QC’d Reads 29.1 M 32.3 M 31.7 M 40.7 M 37.4 M QC’d Bases 4,200 M 4,700 M 4,500 M 5,600 M 5,130 M QC’d Read Length 143bp 145bp 142bp 133bp 133bp Alignment Percent 76.3 78.3 77.5 73.6 77.1

57

Table 2.2. SNP numbers and probe design metrics for SNPs generated under standard parameters. The number of SNPs identified from variant calling generally numbered in the millions with the exception of P. greggii which had approximately 680K identified. Of the

SNPs generated, less than one third were successfully converted to a probe. Probes tended to have a high number of unique left and right flanking regions, showing little redundancy within species.

Metrics P. greggii P. maximinoi P. oocarpa P. patula P. tecunumanii

SNPs identified 680 K 1,110 K 1,030 K 1,390 K 1,110 K Probes designed 175 K 293 K 301 K 290 K 259 K Unique Left Flanks 87 K 185 K 135 K 138 K 109 K Unique Right Flanks 86 K 182 K 132 K 135 K 107 K

58

Table 2.3. Catalogue of the number of transitions (Ts) and transversions (Tv) per species.

Ts:Tv ratios were consistent across species and within expected ranges for SNPs generated from transcriptome sequences.

Model P. greggii P. maximinoi P. oocarpa P. patula P. tecunumanii A/C 17 K 27 K 28 K 28 K 25 K A/G 48 K 82 K 85 K 79 K 72 K A/T 16 K 25 K 25 K 24 K 21 K G/C 18 K 30 K 31 K 29 K 26 K C/T 52 K 89 K 91 K 93 K 81 K G/T 18 K 29 K 30 K 27 K 25K Ts:Tv 1.46 1.54 1.55 1.59 1.57

59

Figure 2.1. Work flow for SNP discovery using RNA-seq data. Work flow includes two phases, bioinformatics (blue) and probe design (purple). Software used and file outputs are shown in the ovals underneath the colored boxes.

60

Bases Different: 0 Bases Different: 1 Bases Different: 2 300,000

250,000

200,000 Flank: Left

150,000

100,000

50,000 Unique 0 Shared: between two species Shared: between three species 300,000 Shared: between four species 250,000 Shared: between all species Flank: Right 200,000

Number of Flanking Regions 150,000

100,000

50,000

0

P. greggii P. patula P. greggii P. patula P. greggii P. patula P. oocarpa P. oocarpa P. oocarpa P. maximinoi P. maximinoi P. maximinoi P. tecunumanii P. tecunumanii P. tecunumanii Species

Figure 2.2. Sharing pattern for each species' left and right flanking regions when allowing for zero, one, or two differences within a flanking region to be considered shared. For each species, a large portion of flanking regions are unique to that species regardless of differences allowed. Allowing for sequence differences increases the number of flanking regions shared by two or three species at a larger proportion than sharing between four or all.

61

P. maximinoi

185002 P. greggii

2890 1906 P. oocarpa 87310 403 964 608 3081 6830 360 322

2208 994 134909 3573 622 7945 374 3338 3180 2903

25408 591 7775 617 11786

109372 518 1976 11716

138222

P. tecunumanii

P. patula

Figure 2.3. Sharing of left flanking regions allowing for zero differences across species. P. maximinoi has the highest number of unique flanking regions while P. tecunumanii and P. oocarpa have a high number of shared flanks between the two as expected from phylogeny.

62

1.0 a) 1.0 b)

● ●

P. patula P. patula

● ● ●

● 0.5 0.5

P. greggii P. greggii

● ● 0.0 0.0 P. maximinoi P. maximinoi PC2 (23.7%) PC2 (23.6%) P.● tecunumanii P. tecunumanii

● ● −0.5 P. oocarpa ● −0.5 P. oocarpa

● ● ● ●

−1.0 −1.0

−1.00 −0.50 0.00 0.75 1.50 −1.00 −0.50 0.00 0.75 1.50 PC1 (28.8%) PC1 (28.7%)

Figure 2.4. Principal component analysis of left flanking (a) and right flanking (b) regions when allowing for zero differences in flanking sequence. Correlation between species that are more closely related phylogenetically is strong as indicated by loading angles. The more closely related P. tecunumanii / P. oocarpa and P. patula / P. greggii pairs have angles that are close to one another. The more distantly related P. maximinoi segregates itself.

63

Flank: Left Flank: Right

100

75

Species

P. greggii P. maximinoi 50 P. oocarpa

Percent P. patula P. tecunumanii

25

0

1 2 3 4 5 >5 1 2 3 4 5 >5 Number of Times Mapped

Figure 2.5. Repetitive mapping of left and right flanking regions for each species when mapped to the v2.01 P. taeda genome assembly. Over 50% of flanking regions in all species map to a unique location within the genome reference.

64

Flank: Left Flank: Right 275000

250000

225000

200000

175000

150000 Unique Shared: between two species 125000 Shared: between three species Shared: between four species 100000 Shared: between all species

75000 Number of Flanking Regions 50000

25000

0

P. greggii P. patula P. greggii P. patula P. oocarpa P. oocarpa P. maximinoi P. maximinoi P. tecunumanii P. tecunumanii Species

Figure 2.6. Sharing pattern for each species' left and right flanking regions when mapped to v2.01 of the P. taeda genome assembly. Flanking regions are considered shared if they mapped within two bases of another. A large portion of flanking regions are unique to a given species while very few map to the same location in all species.

65

P. maximinoi

161663 P. greggii

6499 5832 P. oocarpa 79932 1001 3309 1469 8429 11466 1166 937

4643 2776 127716 7010 1906 13719 1078 9084 7157 6705

41149 2144 17973 2019 20962

104148 1589 6035 20007

114362

P. tecunumanii

P. patula

Figure 2.7. Distribution of left flanking regions shared mapping locations. Each species contains a large number of regions unique to the individual. Pinus patula / P. greggii and P. tecunumanii / P. oocarpa species combinations sharing mapping locations at a higher proportion between themselves than with other species.

66

P. maximinoi

160748 P. greggii

6226 5529 P. oocarpa 80166 952 3332 1369 8133 11166 1033 753

4371 2588 126532 6920 1664 13350 1014 8800 6858 6478

40680 1901 17556 1960 20894

102481 1543 6081 19196

112535

P. tecunumanii

P. patula

Figure 2.8. Distribution of right flanking regions shared mapping locations. A relatively small number of regions are shared between four or five species. Pinus maximinoi has the largest number of uniquely mapping regions and species groupings are consistent with that found in the left regions.

67

1.0 a) 1.0 b)

●● ● ● ● P. greggii P. greggii 0.5 P. patula 0.5 P. patula ●

●● ●

0.0 0.0 ●● P. maximinoi P. maximinoi PC2 (22.1%) P. tecunumanii PC2 (22.1%) P. tecunumanii

−0.5 P. oocarpa −0.5 P. oocarpa

● ●

●● ● ● −1.0 −1.0

−1.00 −0.50 0.00 0.75 1.50 −1.00 −0.50 0.00 0.75 1.50 PC1 (29.1%) PC1 (29.2%)

Figure 2.9. Principal component analysis of left flanking (a) and right flanking (b) regions shared mapping locations. Correlation between species that are more closely related phylogenetically is strong as indicated by loading angles. The more closely related P. tecunumanii / P. oocarpa and P. patula / P. greggii pairs have angles that are close to one another, while the more distantly related P. maximinoi distant from any other species.

68

Chapter 3. Single Nucleotide Polymorphism Discovery and Probe

Design using Targeted Sequencing Data of Tropical and

Subtropical Pine Species

3.1 Introduction

The genus Pinus consists of over 100 species across the globe which are found naturally almost exclusively in the northern hemisphere (Price et al. 1998). Across their global range, there are two primary hotspots of biodiversity; one in China and another in

Mexico and Central America (Plomion et al. 2007). Amongst the tropical and subtropical pine species found in Mexico and Central America, many are considered to be threatened or endangered in their natural range. The threat to their natural populations has in turn spurred conservation efforts and led to the establishment of commercial land race populations for species in South America and in southern Africa (Dvorak et al. 2000a; Griess et al. 2016). Of these tropical and subtropical pine species, six species were selected for this study due to their importance to commercial timber markets and ecological significance: Pinus greggii,

Pinus maximinoi, Pinus oocarpa, Pinus patula, Pinus tecunumanii, and Pinus caribaea.

Being able to better understand the amount of genetic diversity present within these species and leveraging that information is crucial to effectively manage them for commercial use and gene conservation efforts. Utilizing molecular genetic data generated by the use of next generation sequencing technologies shows great promise in helping to expand our knowledge of such species and allow for more efficient management of genetic resources.

Next generation sequencing technologies have revolutionized the field of genetics by facilitating the use of high throughput genotyping to make inferences about population

69

structure and diversity of many different species. Whole genome resequencing studies have been performed on a number of important plant species to date (Tuskan et al. 2006; Myburg et al. 2014). However, these sorts of studies remain cost prohibitive and computationally inefficient for many species with large repetitive genomes such as those found in conifers.

Within conifers, only a handful of whole genome sequencing projects have been undertaken.

Reference genomes of Pinus tadea (Zimin et al. 2017), Picea abies (Nystedt et al. 2013), and

Pinus lambertiana (Stevens et al. 2016) are a few that have been assembled to date. The prohibitive cost and complexity associated with whole genome sequencing has generated methods that can be used to overcome these issues. The utilization of transcriptome and reduced representation sequencing alleviates much of the complexity associated with whole genome sequencing by focusing on coding and regulatory sequences in the genome.

Reduced representation sequencing such as targeted sequence capture have three advantages over sequencing of an entire genome: reduction of sequencing space allows for multiplexing of samples, and targets only areas of the genome that are informative. It also aides in identifying regions of interest across populations using intraspecific genomic assays

(Grover et al. 2012). Targeted sequence capture uses the hybridization of synthetic probes to sheared genomic DNA to selectively target regions of interest in the genome, which are then enriched and sequenced for further study (Vidalis et al. 2018). Hybridization based methods of sequence capture using synthetic designed probes have been successfully exhibited in a number of plant species with complex genome structures: P. taeda (Neves et al. 2013; Lu et al. 2016), P. abies (Vidalis et al. 2018), Pinus albicaulis (Syring et al. 2016), and Picea mariana (Pavy et al. 2016).

70

Additionally, target capture probes that have been developed for P. taeda have shown high transferability across pine species (Neves et al. 2013). This makes the genomic resources developed for P. taeda an attractive platform to extend these technologies into pine species that lack genome reference assemblies of their own, such as tropical pines mentioned before. Information gained from such experiments will prove to be valuable for the continued improvement of pure and hybrid breeding populations, as well as gene conservation efforts, for tropical and subtropical pine species.

The objectives of this study were to 1) utilize targeted sequence capture methods for single nucleotide polymorphic (SNP) loci discovery in six tropical or subtropical pine species, 2) assess SNPs and their probes for their potential use on a commercial high throughput genome wide genotyping array.

3.2 Methods and Materials

3.2.1 Plant Material and DNA Isolation

The population used in this study originated from the Pine Genomic Atlas Study that characterizes the genetic diversity and structure of commercially important pine species in southern Africa (Tii-kuzu, MSc in preparation). A total of six species were selected for this study: P. greggii, P. maximinoi, P. oocarpa, P. patula, P. tecunumanii, and P. caribaea.

Within these species, sub-species were represented for P. greggii (north and south), P. patula

(var. patula and var. longipedunculata), and P. tecunumanii (high and low elevation).

Foliage samples were taken from first generation genetic material established in Camcore field trials across South Africa for the extraction of genomic DNA. For the six species, a total of 81 pooled samples were created. Each pooled sample contained DNA between 4 to 8 trees from different families and represented a single provenance of a given species. A total of 567

71

trees were sampled (Table 3.1). The 81 provenances (Appendix 6) were selected, covering the natural range of the six species in Mexico and Central America in order to create a more robust and diverse dataset.

Genomic DNA extraction was processed by the Forest Molecular Genetics Program’s

DNA Fingerprinting Platform (University of Pretoria, South Africa). DNA was isolated from

50mg of fresh needle tissue using the NucleoSpin® Plant II DNA extraction kit (Machery-

Nagel, Germany) according to the manufacturer’s specifications. DNA was quantified using a MultiSkan™ Go Microplate Spectrophotometer (Thermo Scientific, MA, USA).

Approximately 100ng of genomic DNA per sample was shipped to RAPiD Genomics

(Gainesville, FL, USA) where equimolar pooled samples for each provenance were created for targeted capture sequencing.

3.2.2 Capture Probe Design and Target Enrichment

DNA was mechanically sheared to an average size of 400bp using a Covaris E210 focused ultrasonicator (Covaris, Woburn, MA). Libraries were constructed by repairing the ends of the sheared fragments using the End-It DNA End-Repair kit (Epicentre

Biotechnologies, Madison, WI), producing blunt end fragments. Ligation of a single adenine residue to the 3’ end of the blunt end fragment was performed using 15-U Klenow fragment

(New England Biolabs Inc., Ipswich, MA) and deoxyadenosine triphospate (dATP)

(Promega, Madison, WI). Barcoded adapters that are suited for Illumina Sequencing were ligated to the libraries and the ligated fragments were PCR-amplified using standard cycling protocols (Mamanova et al. 2010). A custom set of 40K target capture probes were developed by RAPiD Genomics to facilitate target capture sequencing. Of these 40K probes,

30K were designed from single copy regions of the v2.01 Pinus taeda genome assembly

72

(Zimin et al. 2017) and 10K were designed from the P. patula and P. tecunumanii transcriptome assemblies (Visser et al. 2015, 2018). A total of 16 barcoded libraries were pooled with equimolar amounts to a total of 500ng of DNA for hybridization to the target capture probes.

For target capture enrichment, the Select XT Target Enrichment System for solution- based target enrichment of Illumina paired-end data by Agilent (Palo Alto, CA, USA) was used. After enrichment, samples were re-amplified for an additional 6-14 cycles. All samples were sequenced using Illumina HiSeq 3000 with paired end 150bp reads. Samples were then de-multiplexed using Illumina’s BCLtofastq.

3.2.3 Quality Control and Alignment

The paired end reads generated from the 40K capture probes underwent processing through a custom SNP discovery pipeline (Figure 3.1). Quality control was performed on the reads using FastQC (Andrews 2010) for visualization. Raw reads that did not meet base quality or length standards were trimmed and filtered using Sickle v1.33 (Joshi & Fass 2011).

Standard procedures used required a base quality score of 30 and a minimum read length of

50bp. Trimmed reads from the 81 libraries were aligned to a modified version of the v2.01 P. taeda genome assembly.

Using SNP calls generated from a pilot study of the current study (unpublished data), v2.01 of the P. taeda genome assembly was modified to create the modified reference used in this study. The modified reference was created by switching the alternative and reference alleles at any location where the alternative allele appeared in more than 60% of the total number of observations. The alternative allele coverage was calculated per SNP across all provenances by [AO/(AO+RO)], where AO is the number of alternative observations and RO

73

is the number of reference observations. Approximately 25K bases were changed and the resulting modified P. taeda genome will be referred to as tropical pine reference genome going forward. Using Burrows-Wheeler Aligner (BWA) v0.7.15 (Li and Durbin 2009), the tropical pine reference genome was indexed using the index routine and trimmed reads aligned using the mem routine under standard parameters. The SAM files generated from

BWA were converted to BAM files and sorted using Samtools v1.7 (Li et al. 2009) view and sort routines respectively.

3.2.4 Variant Calling and SNP Probe Design

Reads that did not align within 500bp on either side of a capture probe location were removed from the datasets prior to variant calling. Reads were filtered using Samtools v1.7 view routine, to include only those reads which had the sam flag 2. Including only reads mapped in their proper pair and within the window of a capture probe were kept for downstream analysis. Variant calling was performed using Freebayes v1.2.0-2-g29c4002

(Garrison & Marth 2012). Variant calls were done jointly; using each of the 81 filtered sorted

BAM files as input and utilized the pooled-discrete and cnv_map parameters to specify ploidy in each of the 81 pooled samples. Additional parameters used include: min- alternative-fraction = 0.01, min-alternative-count = 2, min-coverage = 8, min-mapping- quality = 1, min-base-quality = 20, and report-genotype-likelihood-max.

The raw variants that were called were further filtered. Using vcflib v1.0.0-rc1

(Garrison 2016), indels, multi-nucleotide polymorphisms, and complex events were removed, leaving only SNPs for analysis. This SNP only dataset was filtered using two processes, each filtering step was done independently and did not influence the results of the other. The top down process filtered SNPs across all samples based on the following criteria:

74

an overall alternative allele fraction (AAF) > 0.1 and overall coverage > 100X. The bottom up process used the following criteria: maximum within sample AAF > 0.2 and maximum within sample alternative observation > 4. If a SNP was in either post filter dataset, it was selected for probe design. The bottom up process filtered SNPs within sample, where if any sample for a given SNP met criteria the SNP was kept.

SNPs that passed filtering underwent probe design. A probe in this case is defined as a SNP plus the sequence of a set number of DNA bases on its upstream and downstream flank. Probes were designed using technical specifications outlined for use on ThermoFisher

Scientific’s Axiom genotyping array platform. Technical specifications required at least one region, from a SNP, either upstream or downstream, to be free of another SNP or indel for 35 bases. These upstream and downstream regions around the SNP are also referred to as left and right flanking regions.

A set of custom Perl scripts was used to extract the 35 bases upstream and downstream of each SNP. International Union of Pure and Applied Chemistry (IUPAC) codes were then inserted into the probe flanking regions in order to denote the location of a

SNP within the flanking region. The flanking regions for each SNP probe were then trimmed to the innermost IUPAC code and those probes without at least a right or left flanking region of 35 bases were removed.

3.2.5 SNP and Probe Description

SNPs that had probes successfully designed were further characterized, examining their relatedness between and across species to gain insight on how they may behave when used on the screening array. SNPs were characterized at the species, subspecies and provenance levels through creation of incidence matrices and principal component analysis

75

(PCA). Furthermore, left and right SNP probe flanking regions that reached the full 35mer length were assessed for repetitive mapping to the tropical pine reference genome.

Species, subspecies, and provenance level incidence matrices were created for all

SNPs generated by variant calling. The species and subspecies matrices required a within sample (species or subspecies) AAF > 0.05 and AO > 4 for a SNP to be considered present in the specific species or subspecies. The provenance matrix required a within sample

(provenance) AAF > 0.1 and AO > 2 for a SNP to be considered present within a given provenance. The generated matrices were then subset to include only those SNPs that had a probe successfully designed for it and were used to illustrate how SNPs were shared among the species. Principal component analysis was performed at the species and subspecies levels on a random sample of 50K observations within the incidence matrices to assess concurrence of SNP between species and subspecies.

The repetitive mapping of flanking regions to the tropical pine genome assembly was done utilizing a series of software. Left and right flanking regions were assessed independently. A bed file denoting the SNP location was created using a custom R script for each dataset. Using Bedtools’ v2.27.1 (Quinlan and Hall 2010) slop routine, the bed files were adjusted to target the regions of interest for each dataset, either the 35 bases left or right of the SNP. The bed files were converted to fasta and then fastq file formats using Bedtools v2.27.1 getfasta routine and BBMap v37.41 (Bushnell, sourceforge.net/projects/bbmap) respectively. Arbitrary base scores of 35 were used in the generation of the fastq files. The fastq files for each species were aligned to the tropical pine genome assembly using BWA’s mem routine and the -a parameter, outputting all alignments for a read regardless of score or redundancy.

76

3.3 Results

3.3.1 Target Capture Sequencing and Alignment

Targeted sequencing capture of the six species generated between 4.6 million to 5.3 million raw paired end reads made up of between 690 million to 790 million bases on average per species. After trimming reads to for a minimum base quality score and minimum read length, between 4.0 to 4.6 million trimmed reads consisting of between 590 to 650 million bases were used for downstream analysis (Table 3.2). Average read length ranged from between 141 to 146 bases for each species after trimming. Alignment of the trimmed reads to the tropical pine genome yielded very high alignment percentages of >98% for each species. Filtering of the reads to include only those that mapped in proper pair and within 500 bases on either side of an area targeted for sequencing reduced the number of reads to be used in variant calling by ~57% across all species.

3.3.2 Variant Calling and Probe Design

Variant calling within properly aligned and paired reads using the tropical pine genome as reference generated a list of 4.3 million variants including complex events, indels, multi-nucleotide polymorphisms, and SNPs. Filtering of this set to include only SNP markers yielded a list of 3.4 million raw SNPs. Results from the top down filtering of the raw SNPs created a list of about 418K SNPs while bottom up generated a list of approximately 1.3 million SNPs. A total of 403K SNPs were common across both filtered datasets (Figure 3.2).

Sets were merged to create a list of 1,356,522 SNPs for probe design.

Probe design processes performed on the approximately 1.3 million SNPs selected yielded a total of 562K probes. Of these probes, 228K had only a left flanking region clear of another SNP while 227K were generated with only a right flanking region meeting technical

77

specifications. 107K probes were generated with both a full left and right 35mer flanking sequence clear of another SNP.

3.3.3 SNP and Probe Characterization

SNPs with successful probes were further characterized at the species, subspecies, and provenance level. About 26K SNPs are shared across all study species (Figure 3.3).

Pinus caribaea and P. maximinoi acted as outliers with a much lower proportion of SNPs shared between two, three, or four species, making up 29% and 13% of shared SNPs respectively. Pinus maximinoi contained a much higher proportion of private SNPs (63%) when compared to all other species. Looking at the distribution of all possible sharing combinations across species, P. oocarpa and P. tecunumanii shared SNPs at a slightly greater proportion (13%) with each other than with other species, none of which exceeded 10% sharing. Pinus maximinoi once again exhibits a low proportion of sharing across all other species and has a high number of private SNPs (Figure 3.4). SNP sharing at the subspecies level exhibited the same overall pattern as at the species level. Across subspecies, about 24K

SNPs were shared while about 26K SNPs were shared between eight subspecies. Subspecies of the same species had very similar proportion of SNP sharing (Figure 3.5). Pinus maximinoi exhibited a much higher proportion (63%) of private SNPs than other subspecies.

At the provenance level, a small number of SNPs were present across all 81 provenances or were unique to a single provenance. Pinus maximinoi once again acted as an outlier. When looking at the pattern of sharing across provenances it had a much higher proportion being shared between 2-14 than the other categories (Figure 3.6). All other provenances exhibited a similar pattern between them.

78

Principal component analysis performed on the species and subspecies SNP sharing matrices both showed P. maximinoi behaving very differently than the other species (Figure

3.7). At the species level, 46% and 17% of the variance was explained by component one and two respectively. Loading angles of the species, except for P. maximinoi, were fairly similar indicating correlation between them. Looking at the subspecies level, 48% and 11% of variance was explained by component one and two respectively. Loadings segregated by subspecies within species, grouping P. greggii, P. patula, and P. tecunumanii subspecies together, indicating stronger correlation between closer related species.

Successfully designed probes had both their left and right flanking regions assessed for repetitive mapping to the tropical pine genome reference. The range of times mapped for a flanking region ranged from 1 to >500. However, the majority of left and right flanking regions mapped to a unique location to the genome (Figure 3.8).

3.4 Discussion

Reduced representation sequencing models such as targeted capture sequencing are attractive alternatives to whole genome resequencing studies for non-model organisms that have large, repetitive, and incomplete genome assemblies such as conifers. Despite their massive size, many over 20Gb (De La Torre et al. 2014), coding regions in pine species seem to have not expanded much beyond that of other vascular plants (Kirst et al. 2003). Instead, much of the size seems to stem from the presence of repetitive elements. An estimated 82% of the P. taeda genome is assumed to be repetitive with 62% of the repeats stemming from retrotransposons (Neale et al. 2014) with the presence of pseudogenes contributing although to a lesser known degree (Kovach et al. 2010). Therefore, being able to selectively target single copy coding and regulatory sequences greatly reduces the computation and challenges

79

associated with these repetitive elements. This, coupled with the transferability of enrichment probes between related species (Neves et al. 2013), makes targeted sequencing an ideal method for this study.

This study exhibited one of the challenges associated with implementation of targeted sequencing in incomplete and poorly characterized genomes, that being off target capture.

Off-target capture occurs either through the inadvertent targeting of multi-copy regions in the genome due to mistakes in the assembly the probes were designed from or due to evolutionary divergence between the target and reference species (Grover et al. 2012; Syring et al. 2016). In this study, an off-target capture of greater than 50% was shown across all species when enrichment libraries were mapped to the full tropical pine reference genome.

This is a higher percentage than seen in other studies (Fu et al. 2010; Vidalis et al. 2018); although not unreasonable given most of the target capture probes were designed from a somewhat diverged species' genome assembly.

Investigation into this number by restricting mapping locations to only regions targeted by probes, showed large coverage differences in approximately 2,000 of the 40,000 targeted regions when compared to mapping to the full genome. Even though the target probes were designed to target single copy locations within the genome, it is possible that due to an error in annotation or the fragmented nature of the assembly, the 2K probes targeted one of the many repetitive elements in the genome. If the 2K probes were generated from the transcriptome assemblies of P. patula or P. tecunumanii, it is possible that the probe locations appeared single copy but due to either the divergence of gene structure from P. taeda or the incomplete assemblies for both the transcriptome and genome, they were not recognized as such. These reads that mapped off target were subsequently filtered in an effort

80

to reduce potential sources of false SNPs during the variant calling probe design stages

(Vidalis et al. 2018).

In an effort to increase efficacy of the SNP probes when used on the screening array, the original 3.3 million SNPs were filtered based on allele frequency and depth statistics. The mean pre-filter depth per SNP was 22X when averaged across all individuals and 1700X when summed across individuals. Filtering metrics were based on eight trees per pooled sample. The bottom down filtering assumes good coverage for a SNP in each of the 81 provenance pools is 20X. Using the selected number of alternative observations AO of >4 and alternative fraction AF > 0.2, this equates to at least two of eight trees carrying the alternate allele. This effectively reduces SNPs that may be present in a single tree or that were generated due to sequencing and mapping errors in the downstream analysis. Top down filtering uses global (across species) metrics. Again, assuming a well-covered SNP is represented by 20X coverage in a provenance pool, an estimated 1600X total coverage would be expected if the SNP is present in all provenances. Using the selected AF > 0.1, this would correspond to an AO of 160 and indicate a SNP is present in at least 8 provenance pools. To reduce the stringency, a depth of 100X was selected indicating an AO of 10. Thus, filtering for SNPs that may be present in at least five provenance pools. These filtering criteria should reduce the number of provenance specific SNPs present in the dataset and produce SNPs shared by multiple species.

One of the challenges associated with multispecies and single species fixed genotyping arrays is the presence of ascertainment bias. Ascertainment bias stems from either nonrandom sampling or by SNP discovery protocols that favors those SNPs with higher minor allele frequencies (Lachance and Tishkoff 2013). This bias can become

81

troublesome when assessing population structure and diversity, which depends on allele frequencies and become problematic when genotyping across species (Albrechtsen et al.

2010; Garvin et al. 2010). SNPs called from this study are intrinsically biased towards shared regions within P. taeda genome due to the design of the target capture probes.

Although true ascertainment bias metrics will be assessed with the construction of the

SNP screening array, in an effort to gain insight into this potential problem, SNP sharing at the species, subspecies and provenance level was assessed. This showed a large portion of

SNPs being shared across many species, indicating the potential for ascertainment bias downstream. Additionally, PCA analysis on SNP sharing at the species and subspecies levels showed departure from what is expected from phylogeny. Phylogenetic analysis of many

Mexican and Central American pines using RAPD markers has shown a strong relationship between P. greggii and P. patula species along with P. tecunumanii and P. oocarpa being more closely related. Pinus caribaea seems to be genetically equidistant from the other species in the study (Dvorak et al. 2000b). An additional study places P. maximinoi as a distinct genetic group in relation to the others (Vargas-Mendoza et al. 2011). At the species level, correlation of SNP sharing projected by PCA does not show a high level differentiation between species that are not closely related. Loading angles for all species other than P. maximinoi were fairly similar. At the subspecies level, subspecies of the same species shared almost identical angles. However, there was very little variation in the angles overall. This indicates that SNPs generated from this dataset have the potential to skew the screening array toward SNPs that are commonly shared and may not be useful as a single species identifier.

However, with careful consideration on SNP selection, multispecies SNP chips have exhibited little bias and utility across species (Silva-Junior et al. 2015).

82

Eventually, after scoring of the probes and validation through a screening array, the final outcome of this project will be a multispecies commercial genotyping SNP array. The primary purpose for this array will be to accelerate breeding efforts through genomic selection, its utility in population genetics and hybrid species studies will also provide valuable resources to tree improvement and forest genetics professionals. It is estimated that genomic selection has the potential to increase selection efficiency per year by between 50-

87% when compared to traditional pedigree based methods in scots pine (Calleja-Rodriguez et al. 2019). Recently, genomic selection has shown utility in hybrid populations of eucalyptus (Resende et al. 2017) underscoring the need for both markers that are transferrable across and unique to individual species. Additionally, species and breed specific markers have been shown to be useful in marker-trait association studies (Meuwissen et al.

2013; Pareek et al. 2017). Such SNP markers could prove valuable for the species in this study that have hybrid populations deployed commercially.

83

3.5 Literature Cited

Albrechtsen A, Cilius Nielsen F, Nielsen R (2010) Ascertainment Biases in SNP Chips

Affect Measures of Population Divergence. doi: 10.1093/molbev/msq148

Andrews, S. (2014). FastQC A Quality Control tool for High Throughput Sequence Data.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Calleja-Rodriguez A, Pan J, Funda T, et al (2019) Genomic Prediction accuracies and

abilities for growth and wood quality traits of Scots pine, using genotyping-by-

sequencing (GBS) data. bioRxiv 607648. doi: https://doi.org/10.1101/607648

De La Torre AR, Birol I, Bousquet J, et al (2014) Insights into Conifer Giga-Genomes.

PLANT Physiol 166:1724–1732. doi: 10.1104/pp.114.248708

Dvorak WS, Gutierrez EA, Hodge GR, et al (2000a) Conservation & Testing of Tropical &

Subtropcial Forest Tree Species by the CAMCORE Cooperative. NCSU

Dvorak WS, Jordon AP, Hodge GP, Romero JL (2000b) Assessing evolutionary relationships

of pines in the Oocarpae and Australes subsections using RAPD markers. New For

20:163–192. doi: 10.1023/A:1006763120982

Fu Y, Springer NM, Gerhardt DJ, et al (2010) Repeat subtraction-mediated sequence capture

from a complex genome. Plant J 62:898–909. doi: 10.1111/j.1365-313X.2010.04196.x

Garrison E. (2016) Vcflib, a simple C++ library for parsing and manipulating VCF files.

https://github.com/vcflib/vcflib.

Garrison E, Marth G. (2012) Haplotype-based variant detection from short-read sequencing.

arXiv preprint arXiv:1207.3907 [q-bio.GN]

Garvin MR, Saitoh K, Gharrett AJ (2010) Application of single nucleotide polymorphisms to

non-model species: a technical review. Mol Ecol Resour 10:915–934. doi:

84

10.1111/j.1755-0998.2010.02891.x

Griess VC, Uhde B, Ham C, Seifert T (2016) Product diversification in South Africa’s

commercial timber plantations: a way to mitigate investment risk. South For a J For Sci

78:145–150. doi: 10.2989/20702620.2015.1136508

Grover CE, Salmon A, Wendel JF (2012) Targeted sequence capture as a powerful tool for

evolutionary analysis. Am J Bot 99:312–319. doi: 10.3732/ajb.1100323

Joshi NA, Fass JN. (2011). Sickle: A sliding-window, adaptive, quality-based trimming tool

for FastQ files (Version 1.33) [Software]. Available at

https://github.com/najoshi/sickle.

Kirst M, Johnson AF, Baucom C, et al (2003) Apparent homology of expressed genes from

wood-forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana. Proc

Natl Acad Sci U S A 100:7383–7388

Kovach A, Wegrzyn JL, Parra G, et al (2010) The Pinus taeda genome is characterized by

diverse and highly diverged repetitive sequences. BMC Genomics 11:

Lachance J, Tishkoff SA (2013) SNP ascertainment bias in population genetic analyses: Why

it is important, and how to correct it. Bioessays 35:780–786. doi:

10.1002/bies.201300014

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler

transform. Bioinformatics 25:1754–1760. doi: 10.1093/bioinformatics/btp324

Li H, Handsaker B, Wysoker A, et al (2009) The Sequence Alignment/Map format and

SAMtools. Bioinformatics 25:2078–9. doi: 10.1093/bioinformatics/btp352

Lu M, Krutovsky K V., Nelson CD, et al (2016) Exome genotyping, linkage disequilibrium

and population structure in loblolly pine (Pinus taeda L.). BMC Genomics. doi:

85

10.1186/s12864-016-3081-8

Mamanova L, Coffey AJ, Scott CE, et al (2010) Target-enrichment strategies for next-

generation sequencing. Nat Methods 7:111–118. doi: 10.1038/nmeth.1419

Meuwissen T, Hayes B, Goddard M (2013) Accelerating Improvement of Livestock with

Genomic Selection. Annu Rev Anim Biosci 1:221–237. doi: 10.1146/annurev-animal-

031412-103705

Myburg AA, Grattapaglia D, Tuskan GA, et al (2014) The genome of Eucalyptus grandis.

Nature 510:356–362. doi: 10.1038/nature13308

Neale DB, Wegrzyn JL, Stevens KA, et al (2014) Decoding the massive genome of loblolly

pine using haploid DNA and novel assembly strategies. Genome Biol 15:R59. doi:

10.1186/gb-2014-15-3-r59

Neves LG, Davis JM, Barbazuk WB, Kirst M (2013) Whole-exome targeted sequencing of

the uncharacterized pine genome. Plant J 75:146–156. doi: 10.1111/tpj.12193

Nystedt B, Street NR, Wetterbom A, et al (2013) The Norway spruce genome sequence and

conifer genome evolution. Nature 497:579–584. doi: 10.1038/nature12211

Pareek CS, Błaszczyk P, Dziuba P, et al (2017) Single nucleotide polymorphism discovery in

bovine liver using RNA-seq technology. PLoS One 12:1–27. doi:

10.1371/journal.pone.0172687

Pavy N, Gagnon F, Deschênes A, et al (2016) Development of highly reliable in silico SNP

resource and genotyping assay from exome capture and sequencing: An example from

black spruce (Picea mariana). Mol Ecol Resour 16:588–598. doi: 10.1111/1755-

0998.12468

Plomion C, Chagné D, Pot D, et al (2007) Pines. In: Forest Trees. Springer Berlin

86

Heidelberg, Berlin, Heidelberg, pp 29–92

Price RA, Liston A, Strauss SH (1998) Phylogeny and systematics of Pinus. In: Richardson

DM (ed) Ecology and Biogeography of Pinus. Cambridge University Press, Cambridge,

pp 49–68

Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic

features. Bioinformatics 26:841–842. doi: 10.1093/bioinformatics/btq033

Resende MFR, Muñoz P, Acosta JJ, et al (2012) Accelerating the domestication of trees

using genomic selection: accuracy of prediction models across ages and environments.

New Phytol 193:617–624. doi: 10.1111/j.1469-8137.2011.03895.x

Resende RT, Resende MD V, Silva FF, et al (2017) Assessing the expected response to

genomic selection of individuals and families in Eucalyptus breeding with an additive-

dominant model. Heredity (Edinb) 119:245–255. doi: 10.1038/hdy.2017.37

Silva-Junior OB, Faria DA, Grattapaglia D (2015) A flexible multi-species genome-wide

60K SNP chip developed from pooled resequencing of 240 Eucalyptus tree genomes

across 12 species. New Phytol 206:1527–1540. doi: 10.1111/nph.13322

Stevens KA, Wegrzyn JL, Zimin A, et al (2016) Sequence of the Sugar Pine Megagenome.

Genetics 204:1613–1626. doi: 10.1534/genetics.116.193227

Syring J V, Tennessen JA, Jennings TN, et al (2016) Targeted Capture Sequencing in

Whitebark Pine Reveals Range-Wide Demographic and Adaptive Patterns Despite

Challenges of a Large, Repetitive Genome. Front Plant Sci 7:484. doi:

10.3389/fpls.2016.00484

Tuskan GA, DiFazio S, Jansson S, et al (2006) The Genome of Black Cottonwood, Populus

trichocarpa (Torr. & Gray). Science (80- ) 313:1596–1604. doi:

87

10.1126/science.1128691

Vargas-Mendoza C, Medina-Jaritz N, Ibarra-Sanchez C, et al (2011) Phylogenetic Analysis

of Mexian Pine Species Based on Three Loci from Different Genomes (Nuclear,

Mitochondiral, and Chloroplast). In: Agboola J (ed) Relevant Perspecites in Global

Environmental Change. InTech, pp 139–154

Vidalis A, Scofield DG, Neves LG, et al (2018) Design and evaluation of a large sequence-

capture probe set and associated SNPs for diploid and haploid samples of Norway

spruce (Picea abies). bioRxiv 291716. doi: 10.1101/291716

Visser EA, Wegrzyn JL, Myburg AA, Naidoo S (2018) Defence transcriptome assembly and

pathogenesis related gene family analysis in Pinus tecunumanii (low elevation). BMC

Genomics 19:1–13. doi: 10.1186/s12864-018-5015-0

Visser EA, Wegrzyn JL, Steenkmap ET, et al (2015) Combined de novo and genome guided

assembly and annotation of the Pinus patula juvenile shoot transcriptome. BMC

Genomics 16:1057. doi: 10.1186/s12864-015-2277-7

Zimin A V, Stevens KA, Crepeau MW, et al (2017) An improved assembly of the loblolly

pine mega-genome using long-read single-molecule sequencing. Gigascience 6:1–4. doi:

10.1093/gigascience/giw016

88

Table 3.1. Provenance and tree sampling structure for each of the nine subspecies selected for targeted capture. Each provenance represents a pooled sample submitted for sequencing and contained between 4 to 8 trees in the sample. The 81 total provenances represented the natural range of the species.

Provenances Species Trees sampled (pooled samples) P. caribaea 4 24 P. oocarpa 11 75 P. greggii N 10 67 P. greggii S 7 52 P. maximinoi 13 87 P. patula var. patula 7 56 P. patula var. longipedunculata 4 32 P. tecunumnaii HE 16 109 P. tecunumanii LE 9 65 Total 81 567

89

Table 3.2. Sequencing metrics generated from targeted capture sequencing. Sequencing generated millions of reads and bases on average per species. Raw reads underwent quality control trimming, requiring a base quality score of 30 and minimum read length of 50bp.

Average read length decreased slightly from the 150bp raw reads generated from sequencing to ~143bp.

Raw Reads Raw Bases Trimmed Reads Trimmed Bases Species (millions) (millions) (millions) (millions) P. caribaea 4.60 693 4.00 588 P. greggii 5.30 795 4.58 652 P. maximinoi 4.92 737 4.20 598 P. oocarpa 4.83 724 4.12 586 P. patula 5.10 776 4.38 622 P. tecunumanii 5.27 791 4.52 641

90

Figure 3.1. Work flow for SNP discovery using targeted sequencing data. Work flow includes two phases, bioinformatics (blue) and probe design (purple). Software used and file routines are shown in ovals underneath the colored boxes.

91

15310

Top-down: 418565

Bottom-up: 1341242 Total: 1356552

Figure 3.2. The distribution of SNPs shared between the Top-down and Bottom-up filtering datasets. Between the two sets of SNPs generated from filtering approximately 403K SNPs were shared between the two. This resulted in approximately 1.36 million unique SNPs selected for probe design.

92

200,000

175,000

150,000

125,000 Private Shared: between two species

100,000 Shared: between three species Shared: between four species Shared: between five species

Number of SNPs 75,000 Shared: between all species

50,000

25,000

0

P. greggii P. patula P. caribaea P. oocarpa P. maximinoi P. tecunumanii Species

Figure 3.3. Sharing pattern for SNPs which had probes successfully designed for them by species. SNP sharing patterns among the species are relatively similar with the exception of

P. maximinoi which contains a large number of private SNPs. There are a total of just over

25K SNPs that are shared across all species.

93

P. oocarpa 109000

25986

P. caribaea 2474 1507 369 13171 1902 6688 1261 1845 4302 P. greggii 1621 6076 571 480 270 1202 191 23654

67 1953 21642 87 976 129 3656 1964 35826

168 671 261 506 57 26689

44711039 61 102

952 58 413 431 290 1767 161 4763 204 1320 495 15262 2529 120988 29155 2242 268 3082 1389 23345 1069 7941 9718 P. maximinoi P. tecunumanii 8955 2090

20810

P. patula Figure 3.4. Distribution of SNP sharing across species for SNPs with successful probes. SNP sharing patterns among the species are relatively similar with the exception of P. maximinoi which contains a large number of private SNPs when compared to the others.

94

200,000

175,000

150,000

125,000 Private Shared between two species Shared between three species 100,000 Shared between four species Shared between five species Shared between six species 75,000 Shared between seven species Number of SNPs Shared between eight species Shared between all 50,000

25,000

0

P. caribaea P. greggii S P. oocarpa P. greggii N P. maximinoi

P. patula var. long.P. patula var.P. pat. tecunumaniiP.tecunumanii LE HE Species

Figure 3.5. Sharing pattern for SNPs with successful probes by subspecies. Subspecies exhibited a relatively similar number of private SNPs (~20K) with the exception of P. maximinoi (~120K). All species shared just under 25K SNPs between each other.

95

100,000

75,000

Private Shared: 2−14 provenances Shared: 15−27 provenances 50,000 Shared: 28−40 provenances Shared: 41−54 provenances Shared: 55−66 provenances

Number of SNPs Shared: 67−80 provenances Shared: All

25,000

0 P. patula_1 P. patula_2 P. patula_3 P. patula_4 P. patula_5 P. patula_6 P. patula_7 P. patula_8 P. patula_9 P. greggii_1 P. greggii_2 P. greggii_3 P. greggii_4 P. greggii_5 P. greggii_6 P. greggii_7 P. greggii_8 P. greggii_9 P. patula_10 P. patula_11 P. greggii_10 P. greggii_11 P. greggii_12 P. greggii_13 P. greggii_14 P. greggii_15 P. greggii_16 P. greggii_17 P. oocarpa_1 P. oocarpa_2 P. oocarpa_3 P. oocarpa_4 P. oocarpa_5 P. oocarpa_6 P. oocarpa_7 P. oocarpa_8 P. oocarpa_9 P. caribaea_1 P. caribaea_2 P. caribaea_3 P. caribaea_4 P. oocarpa_10 P. oocarpa_11 P. maximinoi_1 P. maximinoi_2 P. maximinoi_3 P. maximinoi_4 P. maximinoi_5 P. maximinoi_6 P. maximinoi_7 P. maximinoi_8 P. maximinoi_9 P. maximinoi_10 P. maximinoi_11 P. maximinoi_12 P. maximinoi_13 P. tecunumanii_1 P. tecunumanii_2 P. tecunumanii_3 P. tecunumanii_4 P. tecunumanii_5 P. tecunumanii_6 P. tecunumanii_7 P. tecunumanii_8 P. tecunumanii_9 P. tecunumanii_10 P. tecunumanii_11 P. tecunumanii_12 P. tecunumanii_13 P. tecunumanii_14 P. tecunumanii_15 P. tecunumanii_16 P. tecunumanii_17 P. tecunumanii_18 P. tecunumanii_19 P. tecunumanii_20 P. tecunumanii_21 P. tecunumanii_22 P. tecunumanii_23 P. tecunumanii_24 P. tecunumanii_25 Species

Figure 3.6. Sharing pattern for SNPs with successful probes by provenance. Provenances exhibited a relatively similar number of private within them. Pinus maximinoi provenances had a much larger number of SNPs shared between 2 to 14 provenances than other species.

96

0.1 a) P. tecunumanii b) ● ● 0.25 P. oocarpa ● ● P. oocarpa P. caribaea 0.0 P. tecunumaniiLE ● P. patula P. greggii P. tecunumaniiHE P. caribaea ● 0.00 ●

P. ●patula v. pat

P. patula v. long

● ● ● P. greggiiN

● P. greggiiS ● ●

● ● −0.5 ● ● PC2 (16.7%) PC2 (11.3%) −0.50 ●

P. maximinoi ● P. maximinoi ● −1.0 −1.00 ●

−0.75 −0.25 0.00 0.25 −0.55 −0.25 0.00 0.25 PC1 (46.1%) PC1 (47.5%)

Figure 3.7. Principal component analysis of SNP sharing for species (a) and subspecies (b).

Correlation of SNP sharing at the species level appears to be high as indicated by the narrow loading angles. Pinus maximinoi segregates itself from the other species showing little correlation. Subspecies follow a pattern as expected from phylogeny with subspecies of the same species sharing a similar loading angle.

97

100

75

Flank 50 Left Right Percent

25

0

1 2 3 4 5 >5 Number of Times Mapped

Figure 3.8. Repetitive mapping of left and right flanking regions for SNP probes when mapped to the "tropicalized" v2.01 P. taeda genome assembly. Over 75% of flanking regions in all species map to a unique location within the genome reference.

98

Chapter 4 Conclusions

Between the two datasets utilized in this study a large number of SNPs were discovered and selected for probe design. A total of 5.3 million SNPs were identified between the five pine species assessed in the RNA-Seq study, from which 1.3 million probes were successfully designed for assessment. For targeted capture sequencing, a total of 3.4 million SNPs were identified within six pine species, from which 562K probes were designed. This generated a total of 1.8 million probes for assessment from the two datasets.

Mapping flanking sites from the RNA-Seq generated probes to the tropical pine reference indicates a total of 1.6 million unique locations targeted between the two datasets by approximately 1 million probes. This shows some overlap between the two SNP calling sets, which is expected due to the selective targeting of single copy coding or regulatory regions in the targeted capture design and the use of RNA-Seq.

Assessment of SNP sharing between species shows that there are a large proportion of SNPs that are unique to a given species for the RNA-Seq dataset and to a lesser extent in the targeted sequencing. Additionally, correlation of SNP sharing between more closely related species indicates that the SNPs identified and probes designed are behaving as expected from biology. This correlation between species could indicate that the SNPs generated from this study are true and not arbitrary calls due to sequencing or alignment errors. This is further supported by the Ts:Tv ratios that are consistent with previously published studies in tree species. The deviation from the 1:2 ratio expected from random events indicates that the SNPs are more likely to be true SNPs or at least following a mutation pattern influenced by biological mechanisms.

99

Ultimately, the assessment of the probes generated in this study through scoring and use on a screening array will provide definitive answers to their utility on a commercial genotyping array. However, results from this study will be valuable in helping to decide on

SNPs that give the appropriate species sharing distribution as well as identifying species specific SNPs to be included on the final version of the array. These sorts of species distribution considerations will be an important part of the design phase moving forward.

Particularly, in the future of this study, there is a need for a set of SNP probes that are shared across species, making the array valuable for populations of each species assessed. However, the need for transferability will ideally be balanced with species specific SNPs that will be valuable in species and hybrid cross identification. Also, as shown in cattle, the need for breed specific markers, or in the case of this study species specific, are important in determining marker trait associations which differentiate populations (Pareek et al. 2017).

Despite the challenges, a number of successful multi-species genotyping arrays have been designed and shown to be effective (Hulse-Kemp et al. 2015; Silva-Junior et al. 2015b).

The completion of the commercial array for these species will provide tree improvement managers, breeders, and scientists a powerful tool to aide in the management and conservation of tropical pine species. The array will provide users with a highly transferrable, reproducible, and computationally friendly technique to assess their populations. This makes the array useful to a wide range of individuals and applications by removing the need for special bioinformatics knowledge or infrastructure, helping usher in the use of molecular breeding technology in commercial pine breeding programs.

100

4.1 Literature Cited

Hulse-Kemp AM, Lemm J, Plieske J, et al (2015) Development of a 63K SNP Array for

Cotton and High-Density Mapping of Intraspecific and Interspecific Populations of

Gossypium spp. doi: 10.1534/g3.115.018416

Pareek CS, Błaszczyk P, Dziuba P, et al (2017) Single nucleotide polymorphism discovery in

bovine liver using RNA-seq technology. PLoS One 12:1–27. doi:

10.1371/journal.pone.0172687

Silva-Junior OB, Faria DA, Grattapaglia D (2015) A flexible multi-species genome-wide

60K SNP chip developed from pooled resequencing of 240 Eucalyptus tree genomes

across 12 species. New Phytol. doi: 10.1111/nph.13322

101

APPENDIX

102

P. maximinoi

184575 P. greggii

6242 4037 P. oocarpa 87070 1244 2990 1909 6577 13723 1487 1330

6646 3186 134732 7216 3204 15943 1539 13391 9609 8744

50894 2429 23378 1905 23633

109153 1600 4267 23486

137990

P. tecunumanii

P. patula

Appendix 1. Sharing of left flanking regions allowing for one difference across species. P. maximinoi has the highest number of unique flanking regions while P. greggii and P. patula have a high number of shared flanks between the two as expected from phylogeny.

103

P. maximinoi

183844 P. greggii

6496 4159 P. oocarpa 86261 1290 3093 2022 6779 13914 1600 1426

6728 3336 133857 7297 3428 16094 1660 13781 9815 8876

51096 2534 23595 1965 23785

108512 1682 4461 23504

137239

P. tecunumanii

P. patula

Appendix 2. Sharing of left flanking regions allowing for two differences across species.

The number of flanking regions only marginally increases when compared with the one difference allowed scenario.

104

P. maximinoi

182154 P. greggii

2753 1796 P. oocarpa 86203 357 941 548 3043 6712 408 321

2047 961 132735 3571 561 7740 370 3211 3104 2845

24963 551 7553 582 11685

107158 444 1980 11193

135617

P. tecunumanii

P. patula

Appendix 3. Sharing of right flanking regions allowing for zero differences across species.

Right flanking regions appear to follow the same distribution as left regions, with many regions unique to each species and few shared across all.

105

P. maximinoi

182839 P. greggii

6429 1807 P. oocarpa 88069 455 1652 2272 7295 15464 760 513

3058 3671 142630 3306 1372 18135 2720 7469 13202 4739

30484 1129 13609 2628 29129

51163 611 4678 11097

141427

P. tecunumanii

P. patula

Appendix 4. Sharing of right flanking regions allowing for one difference across species.

There is an increase in the number of flanking regions shared between species in all categories with species sharing regions as you would expect from biology.

106

P. maximinoi

182250 P. greggii

6668 1839 P. oocarpa 87363 476 1681 2441 7424 15665 793 517

3083 3841 141930 3305 1472 18304 3014 7603 13542 4823

30528 1157 13696 2748 29340

50974 632 4874 11088

140741

P. tecunumanii

P. patula

Appendix 5. Sharing of right flanking regions allowing for two differences across species.

Flanking regions are more commonly shared between P. patula / P. greggii and P.oocarpa /

P.tecunumanii pairs than with other species.

107

Appendix 6. Table of geographic and climatic information for each of the 81 provenances sampled across all species used in targeted

capture sequencing.

Species Provenance Country Lat (N) Long (W) Elevation (m) Rainfall (mm)

P. caribaea Gualjoco Honduras 14° 55' 88° 14' 240 - 355 1200

Santa Cruz de Yojoa Honduras 14° 53' 86° 56' 530 - 720 2758

Limón Honduras 15° 51' 85° 23' 20 - 85 2452

Trincheras Guatemala 15° 27' 89° 03' 150 - 500 2000

P. greggii N El Madroño Mexico 21° 16' 99° 10' 1500 - 1660 1100

Jamé Mexico 25° 21' 100° 37' 2500 - 2590 650

Laguna Seca Mexico 21° 02' 99° 10' 1750 - 1900 820

Las Placetas Mexico 24° 55' 100° 11' 2370 - 2520 750

Los Lirios Mexico 25° 22' 100° 29' 2300 - 2400 650

Jamé Mexico 25° 21' 100° 37' 2500 - 2590 650

Cerro El Potosí Mexico 24° 54' 100° 12' 2430 - 2500 750

Ojo de Agua Mexico 24° 53' 100° 13' 2115 - 2400 750

La Tapona Mexico 24° 37' 100° 10' 2090 - 2350 650

108

P. greggii S Magueyes Mexico -- -- 2250 - 2350 1200

El Madroño Mexico 21° 16' 99° 10' 1500 - 1660 1100

Laguna Atezca Mexico 20° 49' 98° 46' 1250 - 1420 1642

Laguna Seca Mexico 21° 02' 99° 10' 1750 - 1900 820

Valle Verde Mexico 21° 29' 99° 10' 1150 - 1250 1400

San Joaquín Mexico 20° 56' 99° 34' 2130 - 2350 1109

Carrizal Chico Mexico 20° 26' 98° 20' 1360 - 1770 1855

P. maximinoi Cobán Guatemala 15° 28' 90° 24' 1300 - 1440 2075

San Jerónimo Guatemala 15° 03' 90° 15' 1280 - 1860 970

San Juan Sacatepéquez Guatemala 14° 41' 90° 38' 1580 - 2000 1138

Dulce Nombre de Honduras 14° 50' 88° 51' 1100 - 1300 1386 Copán Marcala Honduras 14° 10' 88° 01' 1600 - 1800 1670

Tatumbla Honduras 14° 02' 87° 07' 1400 - 1600 1153

San Jerónimo CH Mexico 17° 03' 92° 08' 940 - 1020 1417

San Jerónimo OA Mexico 16° 10' 97° 00' 1220 - 1480 1350

Candelaria Mexico 16° 00' 96° 31' 1370 - 1480 1117

109

Las Compuertas Mexico 17° 10' 99° 59' 1050 - 1200 1400

El Portillo Honduras 14° 28' 89° 01' 1400 - 1600 1325

Yuscarán Honduras 13° 50' 86° 55' 1500 - 1700 1300

La Lagunilla Guatemala 14° 42' 89° 57' 1540 - 1860 1017

P. patula var. patula Potrero de Monroy Mexico 20° 24' 98° 25' 2320 - 2480 1350

Corralitla Mexico 18° 38' 97° 06' 2000 - 2230 2500

Conrado Castillo Mexico 23° 56' 99° 28' 1500 - 2060 1012

Tlacotla Mexico 19° 40' 98° 05' 2750 - 2915 1097

Pinal de Amoles Mexico 21° 07' 99° 41' 2380 - 2550 1350

Zacualtipán Mexico 20° 39' 98° 40' 1980 - 2200 2047

Llano de las Carmonas Mexico 19° 48' 97° 54' 2530 - 2880 1097

P. patula var. El Manzanal Mexico 16° 06' 96° 33' 2350 - 2660 1348 longipedunculata El Tlacuache Mexico 16° 44' 97° 09' 2300 - 2620 2000

Ixtlán Mexico 17° 24' 96° 27' 2600 - 2870 1750

Santa María Papalo Mexico 17° 49' 96° 48' 2270 - 2720 1100

P. oocarpa Camotán Guatemala 14° 49' 89° 22' 740 - 960 926

110

El Castaño (Bucaral) Guatemala 15° 01' 90° 09' 930 - 1330 900

San Luis Jilotepeque Guatemala 14° 37' 89° 46' 950 - 1010 895

San Sebastian Coatlán Mexico 16° 11' 96° 50' 1750 - 1750 598

El Jícaro Mexico 16° 32' 94° 12' 1000 - 1000 1684

Taretan Mexico 19° 25' 102° 04' 1610 - 1610 1622

Huayacocotla Mexico 20° 30' 98° 25' 1190 - 1410 1711

San Pedro Solteapán Mexico 18° 15' 94° 51' 602 - 602 1812

La Petaca Mexico 23° 25' 105° 48' 1560 - 1710 1155

Chinipas Mexico 27° 18' 108° 35' 1140 - 1780 822

Dipilto Nicaragua 13° 43' 86° 32' 1075 - 1320 --

P. tecunumanii HE Pachoc Guatemala 14° 51' 91° 16' 2000 - 2500 1350

San Jerónimo Guatemala 15° 03' 90° 18' 1620 - 1850 1200

San Lorenzo Guatemala 15° 05' 89° 40' 1900 - 2100 1700

San Vicente Guatemala 15° 05' 90° 07' 1690 - 2200 1700

Celaque Honduras 14° 33' 88° 40' 1540 - 2030 1273

Las Trancas Honduras 14° 07' 87° 49' 2075 - 2185 1579

111

Chanal Mexico 16° 42' 92° 23' 2010 - 2350 1238

Chempil Mexico 16° 45' 92° 25' 2020 - 2220 1146

El Carrizal Mexico 15° 24' 92° 18' 2130 - 2280 2000

Jitotol Mexico 17° 02' 92° 51' 1660 - 1750 1701

Montebello Mexico 16° 06' 91° 45' 1660 - 1750 1909

Napite Mexico 16° 34' 92° 19' 2070 - 2350 1350

Rancho Nuevo Mexico 16° 41' 92° 35' 2280 - 2340 1238

Chiul Guatemala 15° 20' 91° 04' 2440 - 2680 1996

San José Mexico 16° 42' 92° 41' 2245 - 2400 1252

Juquila Mexico 16° 15' 97° 13' 2090-2260 1325

P. tecunumanii LE San Rafael del Norte Nicaragua 13° 14' 86° 08' 910 - 1170 1366

Yucul Nicaragua 12° 55' 85° 47' 1080 - 1330 1394

San Francisco Honduras 14° 57' 86° 07' 900 - 1590 1491

Villa Santa Honduras 14° 12' 86° 17' 800 - 1000 1302

Culmí Honduras 15° 08' 85° 36' 400 - 950 1491

Los Planes Honduras 14° 48' 87° 53' 1100 - 1650 2287

112

Cerro Cusuco Honduras 15° 29' 88° 12' 1350 - 1630 2287

Gualaco Honduras 15° 03' 86° 08' 600 - 800 1491

Cerro la Joya Nicaragua 12° 25' 85° 59' 940 - 1160 1394

113