<<

Investigating the North American () agamic complex for evidence of homoploid hybridization using next-generation sequencing techniques

by

Kathleen McGrath

B.Sc.FSc. Trent University, 2007

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES

(Botany)

THE UNIVERSITY OF

(Vancouver)

September 2014

© Kathleen McGrath, 2014

Abstract

Despite the classic place that the North American Crepis (Asteraceae) agamic complex holds in evolutionary literature, few of the hypotheses about the group presented by Babcock and Stebbins 1938 monograph have been tested. In particular, they hypothesized that the seven sexual diploids had strong interfertility barriers that prevented the formation of diploid hybrids. Here I present an analysis of two previously unrecognized diploid morphotypes, which belong to an unresolved clade with Crepis pleurocarpa and Crepis occidentalis based on plastid DNA variation. Morphological traits suggest that both morphotypes may be the product of diploid x diploid hybridization. I gathered nuclear SNP markers using genotyping by sequencing to assess the origins of these two lineages. I constructed a de novo assembly of the nuclear genome of Crepis monticola to serve as a reference for SNP discovery. Analysis of contig length,number, and coverage indicate that the nuclear genome of Crepis is highly repetitive and shares features in common with other genomes that make angiosperm genomes challenging to work with. This complexity, as well as technical challenges likely due to partial enzymatic digestion of genomic DNA during GBS library preparation, resulted in only 19 SNPs passing the quality filters. Nonetheless, these 19 markers were used to provide a preliminary assessment of the origins of novel morphotypes. A signal of mixed ancestry was found for one of these morphotypes with the majority of their genome being distinct from both C. occidentalis subsp. occidentalis and C. pleurocarpa. The second morphotype is of non-mixed ancestry most closely resembling Crepis occidentalis. In a separate study, I provide a draft assembly of the Crepis monticola chloroplast genome. I show that gene order and content are unchanged from other members of Asteraceae with the exception of the rpl16 gene, which retains an intron that is reported as lost multiple times in Asteraceae. Results of a data analysis detailing the presence or absence of the first exon of rpl16 in published Asteraceae plastome sequences indicates that most of these supposed losses are errors, pointing to the need for careful examination of plastome assemblies gathered from databases.

ii

Preface

This thesis is original, unpublished, independent work by the author, K. McGrath. All plant material used in this thesis was collected by Chris Sears, with Crepis runcinata samples obtained by Jeannette Whitton. The genotyping-by-sequencing protocol used was modified from protocols created by Greg Baute, Greg Owens, Kristen Nurkowski, David Toews, Miguel Alcaide, Sampath Seneviratne, and Haley Kenyon. The reference based UNIX pipeline used in data analysis was graciously provided by Greg Baute and Greg Owens. Additional scripts for data parsing were provided by David Tack.

iii

Table of contents

Abstract ...... ii

Preface ...... iii

Table of contents...... iv

List of tables ...... vii

List of figures ...... viii

Acknowledgments...... ix

Dedication ...... x

1. Introduction ...... 1 1.1. DNA sequencing: then and now ...... 1 1.2. Next-generation sequencing: limitations and challenges ...... 2 1.3. Genotyping-by-sequencing ...... 3 1.4. Angiosperm nuclear genomes ...... 3 1.5. Thesis chapters summary ...... 4 1.5.1. Chapter two ...... 4 1.5.2. Chapter three ...... 5

2. Investigating the North American Crepis (Asteraceae) agamic complex for evidence of homoploid hybridization ...... 6

2.1. Introduction ...... 6 2.1.1. Historical background ...... 6 2.1.2. Current taxonomic circumscriptions and putative hybrids ...... 7 2.1.3. Homoploid hybridization ...... 8 2.1.4. Goals and objectives ...... 9

2.2. Materials and methods ...... 10 2.2.1. Plant materials and DNA extractions ...... 10 2.2.2. Isolation of Crepis monticola DNA for genomic paired-end Illumina sequencing ...... 11

iv

2.2.3. Genotyping-by-sequencing protocol ...... 12 2.2.4. Raw sequence analysis and quality assessment ...... 14 2.2.4.1. Crepis monticola assembly ...... 14 2.2.4.2. .. Genotyping-by-sequencing data within reference based pipeline ...... 15 2.2.4.3. Stacks ...... 16 2.2.4.4. UNEAK ...... 17 2.2.5. Crepis monticola De Novo assembly ...... 18 2.2.6. Alignment of genotyping-by-sequencing data to Crepis monticola contigs ...... 20 2.2.7. De Novo assembly in STACKS ...... 20 2.2.7.1. Ustacks ...... 21 2.2.7.2. Cstacks ...... 21 2.2.7.3. Sstacks...... 22 2.2.7.4 Populations ...... 22 2.2.8. De Novo assembly in UNEAK ...... 23 2.2.9.1. Reference based SNP calling ...... 24 2.2.9.2. De Novo SNP calling ...... 26 2.2.10. Analysis of SNP data ...... 27

2.3. Results ...... 33 2.3.1. Crepis monticola sequencing and contig assembly ...... 33 2.3.2. Genotyping-by-sequencing read analysis and assembly ...... 35 2.3.2.1. Sequence analysis ...... 35 2.3.2.2. De-multiplexing ...... 35 2.3.3. SNP identification...... 37 2.3.4. Data analysis ...... 39 2.3.4.1. Pair-wise Fsts ...... 41 2.3.4.2. STRUCTURE analysis ...... 41 2.3.4.3. Principal component analysis ...... 43

2.4. Discussion ...... 44

v

2.4.1. ... Crepis monticola genome composition and reference based alignment of GBS data ...... 44 2.4.2. Genotyping-by-sequencing data ...... 47 2.4.3. De Novo assembly of GBS data ...... 50 2.4.4. Data analysis ...... 51 2.4.5. Conclusions and recommendations ...... 53

3. Assembling the chloroplast genome of Crepis monticola ...... 57

3.1. Introduction ...... 57 3.1.1. Historical background ...... 57 3.1.2. Next generation sequencing for plastome assembly ...... 58 3.1.3. Chapter goals and objectives ...... 59

3.2. Methods ...... 59 3.2.1. Sample and raw sequence data ...... 59 3.2.2. Reference based alignment ...... 59 3.2.3. Annotation of draft Crepis monticola plastome ...... 60 3.2.4. Pattern of retention and loss of rpl16 exon 1 in Asteraceae ...... 61

3.3. Results ...... 62 3.3.1. Plastome size, gene content, order, and organization ...... 62 3.3.2. Pattern of retention and loss of rpl16 exon 1 in Asteraceae ...... 63

3.4. Discussion ...... 67

Bibliography ...... 71

vi

List of tables

Table 1: List of Crepis collections from within the agamic complex and outgroup species, Crepis runcinata, showing geographical information, morphological species designation (TAXON), ploidy, and number of specimens (N) analyzed. Astericks indicate species included in the final data set ...... 30 Table 2: Summary of FastQC Analysis on Crepis monticola ...... 34 Table 3: Summary statistics of the Crepis monticola nuclear contig assembly ...... 34 Table 4: Summary of FastQC Analysis on GBS data ...... 35 Table 5: Summary statistics of the de novo GBS pipeline assemblies (excluding C. runcinata and C. modocensis subsp. rostrata) ...... 37 Table 6: Finalized SNP set for members of Clade V. SNPs are recorded as standard IUPAC nucleotide codes ...... 40 Table 7: Pair-wise Fst values for populations in clade V, excluding Crepis occidentalis subsp. occidentalis (4065) and Crepis pleurocarpa (5014). Crepis nuvo subsp. X (4074,4075), C. nuvo subsp. Y (5006), and C. pleurocarpa (5007) are included ...... 41 Table 8: Compiled data for the presence of Exon 1 from GenBank published complete chloroplast genome sequences in the Asteraceae. Asterisks indicate presence of feature ...... 65

vii

List of figures

Figure 1: Sequencing and marker discovery in progenitor species of the Crepis agamic complex using both a reference based and de novo methodology ..... 33 Figure 2: Number of sequences per sample (Logarithmic scale ) as determined by Stacks (green), UNEAK (yellow), and In-house scripts for the reference-based ...... 36 Figure 3: STRUCTURE analysis of SNP data from individuals in clade V with K=2. C. occidentalis subsp. occidentalis (4065), C. nuvo subsp. X (4074,4075), and C. nuvo subsp. Y (5006) form a single cluster distinct from C. pleurocarpa (5007,5014) ...... 42 Figure 4: STRUCTURE analysis of individuals in clade V with K=3. Crepis occidentalis subsp. occidentalis (4065) most closely resembles C. nuvo subsp. X (4075) and C. nuvo subsp. Y (5006) with C. nuvo subsp. X (4074) primarily of a different genetic background. C. pleurocarpa remains a distinct cluster (5007,5014) ...... 42 Figure 5: PCA of individuals in clade V. C. occidentalis subsp. occidentalis (4065), C. nuvo subsp. X (4074,4075), C. nuvo subsp. Y (5006) form a single cluster distinct from C. pleurocarpa (5007-5014) ...... 43 Figure 6: Draft plastome map of Crepis monticola chloroplast. Gene order and content is the same as in reference Lactuca sativa ‘salinas’ (not shown). Thick lines indicate extent of inverted repeats (IRa and IRb). Genes on outside of map are transcribed in clockwise direction, and genes on the inside are transcribed in counter-clockwise direction. Introns are not shown; gene box represents exon and introns together...... 64

viii

Acknowledgments

I would like to thank Chris Sears for his previous work on the Crepis agamic complex and his patience in leading an unexperienced new graduate student in the field. His guidance and encouragment were integral to my project starting off on the right foot. Computer expertise in the form of numerous late night troubleshooting sessions was provided throughout by David Tack. David’s experience with Python was invaluable to me as I learned and applied bioinformatic techniques to my data set. Additional computer aide from “The Gregs”, Greg Baute and Greg Owens, was of vital help and their willingness to test my understanding of concepts and to act as a sounding board for various challenges was truly appreciated. My committee members, Dr. Keith Adams, Dr. Wayne Maddison, and Dr. Loren Rieseberg provided invaluable guidance and advice throughout my degree. Unwavering emotional support came from a great number of family and friends, notably my mother, Theresa McGrath, the CDWM ladies, Chelsea Maskos, and Greg Ross. Without these people to lean on, de-stress with, and to celebrate with I would not be where I am or who I am today. My Masters supervisor, Dr. Jeannette Whitton was a daily inspiration to me and is one of my greatest mentors. She was a truly amazing supervisor and her commitment to a healthy work-life balance created a positive and productive environment that enabled me to persevere through a difficult set of challenges. Funding for this research was provided by UBC and NSERC.

ix

Dedication

pro scientia

x

1. Introduction

1.1. DNA sequencing: then and now

Originally introduced in 1977 by Fred Sanger, DNA sequencing quickly became, and has since remained, one of the key molecular techniques in bioscience (Ronaghi et al., 1998). Sanger sequencing takes advantage of terminating dideoxy nucleic acids and florescent terminator chemistry to determine base-pair identity, a methodology that has remained largely unchanged for the last 30 years (McGinn & Gut, 2013). The low- throughput nature of Sanger sequencing, a result of template preparation, has always been the main limitation of this method with a secondary limitation of low-frequency variant detection due to background noise (Morey et al., 2013). Both of these limitations contributed to the existence of a practical upper-limit in both data acquisition and biological interpretation that until recently was ultimately responsible for limiting our approach to a number of research questions. A radical shift has occurred in the last decade, however, and new techniques collectively referred to as next-generation sequencing (NGS) technologies are allowing for sequence determination at a much greater throughput and a fraction of the cost. This has rapidly resulted in a move away from Sanger techniques for most applications, and has led to a sort of tipping point in scientific inquiry, where previously, research questions were more advanced than technology could address, but now technology can not only address these questions, it is leading to the formulation of new questions as well (Morey et al., 2013).

For non-model organisms, research programs with limited resources, and organisms with traits that make them difficult to study with only Sanger methodologies, the ability of NGS to resolve billions of bases in hundreds of individuals for only a few thousand dollars is revolutionary. NGS is changing the landscape of genetics research dramatically and acting to democratize it, as obtaining this information is no longer out of reach for all but the most well funded labs and organisms (Morey et al., 2013). The genus Crepis (Asteraceae) is one such taxon that has and will benefit greatly from

1

increased access to genomic information. Spread across the Northern Hemisphere and Africa, Crepis contains approximately 200 species and has had little investment in genetic data, with the only major phylogenetic work being involving sampling the ITS region of the nuclear genome (Enke & Gemeinholzer, 2008), and investigations of a subset of taxa with chloroplast regions (Whitton 1994; Holsinger et al., 1999; Sears, 2011). Plagued by difficulties in garden propagation (Babcock & Navashin 1930; Stebbins, 1950), DNA extraction (Whitton, pers. comm; Sears 2011), and genome size estimates using flow cytometry (Sears, 2011), Crepis has stubbornly fought to keep its secrets. NGS offers new hope in accessing genome information from Crepis and in my project specifically, allows for inferences of the origins of population and phenotypes that were previously far too fine scale to be resolved.

1.2. Next-generation sequencing: limitations and challenges

While NGS offers an exponential increase in genome data correlating with a decrease in time and cost, it is not without its challenges. First, while it took thirty years to move from Sanger sequencing to NGS, it has taken only five years to move from a single next generation sequencing platform to at least eleven different platforms integrating no less than four different sequencing chemistries (McGinn & Gut, 2013). This Moore’s law phenomenon means that a researcher must weigh the pros and cons of each technology in the context of their question and organism, but with only limited information available that highlights the differences and similarities in approaches, and little understanding of how their organism will perform given these differences. Furthermore, the sheer amount of short-read data generated requires massive computing resources and outpaced the statistical and bioinformatics tools available at the time these technologies started to gain in popularity. Since then, new bioinformatics tools have been in constant development, with the downside of being almost obsolete as soon as they are published (De Wit et al., 2012). This creates a bottleneck where biologists must learn at least some computer science in the form of a scripting language or micro-programming in order to properly filter through their data to arrive at biologically sound conclusions. In addition, as with all new technologies, it takes time for industry-

2

agreed upon standards to be implemented and the interim years can often see a “wild west” with publications that lack a rigorous methodological framework (Bild et al., 2014). Biologists unprepared for the massive and complicated data-sets these NGS technologies produce may ignore the impact of “batch effects” or apply inappropriate algorithms or configuration parameters (Bild et al., 2014), leading to plausible but wholly inaccurate results.

1.3. Genotyping-by-sequencing

For my thesis, I chose to implement a NGS technique called genotyping-by-sequencing (GBS) which employs restriction endonuclease digestion to create a reduced representation of the genome by sequencing only the regions adjacent to restriction sites in the genome (Elshire et al., 2011). This is both a benefit and a potential hindrance, as focusing sequencing efforts on these regions reduces data complexity allowing for high density, accurate SNP discovery (Baird et al., 2008), but digestion success is key to all downstream steps, so enzyme choice and species-specific interactions can greatly decrease the efficacy of the reaction. Though multiple approaches have now been developed to sample reduced genome complexity, GBS is comparatively simple and is currently being used by other researchers in the Biodiversity Research Centre, and therefore, trouble shooting expertise was available. A challenge of GBS, which will be discussed in detail in Chapter 2, is achieving adequate coverage at sites across multiple individuals (Elshire et al., 2011); an issue that becomes non-trivial as genome size increases.

1.4. Angiosperm nuclear genomes

As the technologies implemented will influence the data sets obtained, so too will the genome being sequenced. Angiosperms are notoriously more complex than mammalian genomes with rampant genome duplication events, hybridization, and redundancy leading to genomes that are large and with limited gene space (Li et al., 2004). The level of redundancy in plant genomes is a result of multiple repeat classes distinct from

3

those resulting from polyploidy: microsatellites, short interspersed nuclear elements (SINES), DNA transposons, Long terminal repeat (LTR) retrotransposons, long interspersed nuclear elements (LINES), and segmental duplications (Treangen & Salzberg., 2012) that can make up as much as 90% of the genome (e.g., in Triticeae; Li et al., 2004). This acts to reduce the amount of unique regions sequenced while increasing coverage of information limited regions and, additionally, can greatly impede the assembly of the reads into a reference genome as the reads are unable to span (and thus resolve) the redundant sections of the genome (Treangen & Salzberg., 2012). For example, the genome size estimates for diploid members of the Crepis agamic complex is ~12pg, twice the average genome size of known (5.39pg; Bennett & Leitch, Kew, 2012).

1.5. Thesis chapters summary

1.5.1. Chapter two

In this chapter I present the results from my investigation into the Crepis agamic complex for evidence of a reticulate evolutionary history at the diploid level. It was my aim to use NGS technologies to perform population wide genotype analyses in order to probe for a genetic signature of hybridization.I show that one lane of standard paired end sequencing achieves less than 1x nuclear genome coverage composed of numerous small contigs with an average length of 350bp. I infer that genome size and redundancy in Crepis contributed directly to this and that a different sequencing approach focused on long read generation for resolution of repetitive regions would act to increase contig length and genome coverage. I further present the results of three different bioinformatic pipelines for variant calling in the GBS data set and show that putative hybrids Crepis nuvo subsp. X and C. nuvo subsp. Y are of a genetic background distinct from C. pleurocarpa, with population 4074 of C. nuvo subsp. X having evidence of introgression of a currently unknown genotype.

4

1.5.2. Chapter three

In this chapter I present the results of a draft assembly of the chloroplast plastome in Crepis monticola, which was assembled from a reference alignment to the plastome of Lactuca sativa ‘salinas’ (Timme et al., 2007). This assembly was completed in order to increase the amount of genetic resources available for the genus of Crepis and identify any unique features within the plastome. I show that no major insertions, deletions, or structural changes have occurred in Crepis monticola relative to other chloroplasts in . I further show that Crepis monticola has retained exon 1 and the intron of rpl16; the loss of which is reported in the reference species, Lactuca sativa ‘salinas’. I present the results of a preliminary meta-analysis showing the pattern of loss and retention of exon1 of rpl16 in published plastome sequences in Asteraceae.

5

2. Investigating the North American Crepis (Asteraceae) agamic complex for evidence of homoploid hybridization

2.1. Introduction

2.1.1. Historical background

The North American Crepis agamic complex is a unique aggregate of with a geographical range spanning the western of from British Columbia to , and east to . The complex consists of seven evolutionarily distinct diploid species and their polyploid derivatives, and is characterized by the divergent reproductive methods utilized by the diploids versus polyploids. Crepis diploids are outcrossers that reproduce sexually while polyploid Crepis reproduce asexually by apomixis, thereby generating a reproductive barrier between the parents and newly formed agamospecies. This interplay of hybridization, polyploidy, and apomixis in a group of species was first recognized by Babcock and Stebbins (1938), who proposed recognizing the existence of groups sharing these features, coining the term “heteroploid agamic complex” (Smocovitis, 2009). The first to highlight the importance of an agamic complex in addressing questions pertaining to hybridization, species mechanisms, and the processes governing patterns of variation, Babcock and Stebbins (1938) formed numerous hypotheses regarding the origin and spread of members of the Crepis agamic complex, some of which are still being tested today.

Babcock and Stebbins (1938) hypothesized that North American Crepis belong to two groups, the first composed of the obligately sexually reproducing Crepis runcinata, and the second comprising the remaining nine species; two exclusively polyploid taxa, Crepis barbigera and Crepis intermedia, and seven sexual diploids at the base of the array of auto- and allopolyploid apomicts. They further proposed that strong interfertility

6

barriers between the diploid progenitors prevented the formation of diploid hybrids (Smocovitis, 2009).

2.1.2. Current taxonomic circumscriptions and putative hybrids

A recent analysis of the Crepis agamic complex exists in the form of a plastid phylogeny and a detailed analysis of ploidy variation using flow cytometry and morphological variation (Sears, 2011). The plastid phylogeny supports the hypothesis of monophyly of the ten x=11 base chromosome number individuals within the genus, with the sexual obligate C. runcinata as sister to the other nine species, which also form a monophyletic group comprising the diploids and their polyploid derivatives (Sears, 2011). The relationships of the diploid species conformed to a bifurcating tree structure with only a single instance of conflict in the placement of diploid individuals of the same species within the tree. Resolution of relationships was strong, with clear support for the monophyly of groups of populations of five of the seven sexual diploids, while the remaining two (C. occidentalis and C. pleurocarpa) formed a polytomy. It is possible that these two species have evolved from a common ancestor and are not yet divergent enough to be resolved with chloroplast DNA (i.e., they are still sorting ancestral polymorphism). Alternatively, though these two species are morphologically divergent, they may have a history of interbreeding and plastid capture as the two species occur in sympatry in the California center of diversity.

While the majority of individuals belong to a single species each formed a single monophyletic group, an exception was seen in diploid Crepis modocensis subsp. rostrata collected from central (Sears, unpublished). This population is disjunct from the rest of the species’ range, and only comes in contact with tetraploid Crepis atribarba subsp. originalis. Although currently a disjunct population, it is not unreasonable to hypothesize a hybrid origin for these individuals; recent studies of Hieracium have demonstrated the presence of stable homoploid hybrids at sites where the second parental species was either completely absent or extremely rare (Mraz et al., 2011). Although this population of C. modocensis subsp. rostrata is easily

7

identifiable as belonging to this distinctive taxon, plastid data place this particular population in a clade isolated from other populations of C. modocensis ssp rostrata, and most closely related to C. pleurocarpa diploids originating within the California center of diversity. This last fact is especially interesting as the Cascade Mountain range and the entire state of separate these two populations geographically, although I note that current geographical ranges should not be the main indicator of potential hybrid progenitors and suggest that investigations of the origins of this unusual disjunct population should consider a broad set of possible phylogenetic and geographical origins.

In the course of his research, Sears also discovered two previously unrecognized diploid morphotypes at Scott Mountain (California, USA), the same geographical region where C. occidentalis and C. pleurocarpa samples that form a polytomy within the plastid phylogeny originated. These individuals share morphological features of both C. occidentalis and C. pleurocarpa and are strong candidates for a hybrid origin. Furthermore, the populations containing these morphotypes exist in a unique environment with strongly serpentine soil, creating ecological conditions that could aid hybrid stabilization and persistence. As we know, ecological divergence is expected to play a major role in the likelihood of diploid hybrid speciation (Buerkle et al., 2000).

2.1.3. Homoploid hybridization

The identification of unique morphotypes, coupled with discordance in the placement of C. modocensis subsp. rostrata necessitates re-evaluation of Babcock and Stebbins’ hypothesis that interfertility barriers would preclude the success of diploid hybrids within the complex. The creation of a stable, fertile, and reproductively isolated homoploid hybrids is most often thought to occur via the recombinational species model (Stebbins, 1959), which states that parental species differing by two or more chromosomal rearrangements can create reduced fertility F1 offspring that in turn generate recombined, novel, chromosomally balanced genotypes with backcrossing and interbreeding among these F1 individuals (Buerkle et al., 2000; Gross & Rieseberg,

8

2005). The initial reduction in fertility acts as a barrier to introgression with the parental species, which is further solidified by genetic recombination in subsequent generations. Stebbins’ model requires that the hybridizing species share homology in order for the offspring to be viable and, additionally, success of these putative hybrids will be influenced by extrinsic factors (Stebbins, 1959). The close evolutionary relationships between taxa of the Crepis agamic complex combined with the homology in chromosome numbers indicate that recombinational hybrids are possible within the complex and with the additional influence of natural selection, it is feasible that suitably fit offspring would result from diploid x diploid pairings. Ecological selection can greatly impact the success of any newly formed hybrid with certain environments promoting stable hybrid derivatives; for example disturbed and open terrain with relaxed competition is thought to favour hybrid establishment (Rieseberg & Wendel, 1993). Spatial modeling experiments by Buerkle et al. (2000; 2003) demonstrated that the availability and colonization of a novel habitat greatly increases the likelihood of survival and are crucial to the origin of a stable homoploid hybrid. The small (600m2) radius of strongly serpentine soil that C. nuvo subsp. X and C. nuvo subsp. Y were found on (Sears, 2011) represent a divergent habitat from that of the surrounding Crepis species and could be acting as an agent of ecological selection, aiding in the successful isolation of these new morphotypes.

2.1.4. Goals and objectives

I propose to contribute to our understanding of the evolutionary origins of the Crepis agamic complex using nuclear markers obtained through next generation sequencing, specifically GBS. GBS offered the potential to address the following questions: 1. Is there evidence to suggest a reticulate evolutionary history among diploids in the complex as a whole? If so, is this evidence consistent with ongoing hybridization or with the existence of stable homoploid hybrids? 2. Are the disjunct populations of C. modocensis subsp. rostrata from Washington, USA and two unidentified diploid morphotypes (C. nuvo subsp. X and C. nuvo

9

subsp. Y) from Scott Mountain (California, USA) examples of stable homoploid hybrid derivatives, recent hybrids or divergent lineages of non-hybrid ancestry?

2.2. Materials and methods

2.2.1. Plant materials and DNA extractions

Field collection of leaf samples and ploidy confirmation via flow cytometry were completed by Chris Sears over the course of three field seasons (2007,2009,2011) in Washington, Oregon, and California, USA (Sears, 2011). Table 1 summarizes information on collection number, geographic location, morphological species classification, inferred ploidy, and number of samples (additional details in Sears, 2011). A total of one-hundred and twenty-nine samples representing each of the species with diploid members in the complex and three outgroup samples of Crepis runcinata were included in the investigation. Samples were silica-dried on site and stored in a -80 freezer (Thermo Scientific Forma900 Series) until needed. Leaf tissue quality of the samples listed in Table 1 was assessed visually using an empirical scale (1 – good quality; green tissue; no evidence of disease or pathogen 2 – fair; mostly green tissue with some browning; browning or other evidence of degradation 3 – poor; brown or browning tissue; evidence of disease or pathogen) and 20mg of the highest quality leaf tissue was obtained from each individual for DNA isolation. Tissue was ground using a reciprocating saw method developed by Alexander et al., (2007) and genomic DNA was extracted using the Illustra DNA Extraction Kit Phytopure for plant DNA extractions (GE Healthcare, USA). This commercial kit was chosen because previously tested protocols failed to yield enough high quality DNA as required for the Illumina based sequencing. It is well established that members of the Asteraceae produce secondary metabolites and compounds (Panero & Funk, 2008) that can impede DNA extraction and inhibit further enzymatic reactions (Frier, 2005). Previous work with Crepis identified this material as especially challenging to work with (Whitton and Sears, pers. comm.). The resin based technology of the Phytopure kit effectively removes complex polysaccharides and the

10

isopropanol based DNA precipitation further reduces loss of DNA associated with column based extractions.

DNA yield was determined using a Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, California) and DNA quality was assessed visually by running 3uL of the extraction on a 1% agarose gel in 0.5X Tris/Borate/EDTA (TBE) at 80V for 5 minutes followed by 25 minutes at 90V.

2.2.2. Isolation of Crepis monticola DNA for genomic paired-end Illumina sequencing

To date there is limited genetic information available for the North American agamic complex of Crepis. Originally investigated by Babcock and Stebbins (1938) and more recently by Whitton (1994) and Sears (2011), the complex is known to have a base chromosome number of x =11 with a diploid nuclear genome content (as estimated from flow cytometry) of approximately 12pg and plastid sequence data available for three regions (rpl16, rps16, trnS). In order to obtain preliminary information on gene content and arrangement of the nuclear genome of Crepis, a representative individual (Crepis monticola; 5001-8) from the complex was chosen for standard paired-end Illumina sequencing. An overview illustrating this step in our investigation in relation to the entire project workflow is given in Figure 1. Crepis monticola was chosen for sequencing because all diploids and polyploid samples formed a monophyletic group in the plastid phylogeny, (Sears, 2011), and because the lack of morphological intermediates (putative allopolyploids) on the landscape (Sears, pers. comm.) suggested that Crepis monticola would make a good candidate for a non-reticulated diploid genome. A sample of 200ug of high quality genomic DNA was submitted to the NextGeneration Sequencing Facility at the Biodiversity Research Center of the University of British Columbia () and standard paired-end library preparation was completed by the facility. This paired-end (100bp) 1-plex library per flowcell was sequenced using the Illumina HiSeq 2000 platform.

11

2.2.3. Genotyping-by-sequencing protocol

DNA from the extracted samples was digested with Hi-fidelity Pst1 (New England Biolabs, Ipswich, Massachusetts) with the following thermal cycling conditions: 37ᴼC for 90min, 85ᴼC for 20min. Prior to this, digestion success of both Pst1 and EcoR1 individually and as a double digest was assessed. Three samples within the complex (4067-1, 4072-3, 5004-6) with positive controls from Townsendia hookeri (Lee, 2012) and Helianthus annuus (Baute, 2012) were digested at enzyme concentrations of 1x, 2x, 5x, and 10x with 1U, 2U, 5U, and 10U of Pst1 respectively, in a thermal cycler with the following conditions: 37ᴼC for 90min, 85ᴼC for 20min (enzyme deactivation). Digestion success was assessed visually by running 3uL of the product on a 1% agarose gel in 0.5X Tris/Borate/EDTA (TBE) at 80V for 5 minutes followed by 25 minutes at 90V. Based on the results of these tests, library preparation was completed using a single digestion with Pst1 at a 5x concentration.

A GBS library was prepared as in Elshire et al. (2011) with the following modifications from Baute et al. (unpublished, Rieseberg Lab Protocol, 2013) and McGrath (unpublished, Whitton Lab Protocol, 2013): Additional barcodes were created to allow for multiplexing of all 132 samples. Care was taken to ensure all barcodes differed from one another by at least two base-pairs as recommended in Hohenlohe et al. (2010). Ligations were done on a thermal cycler as follows: 22ᴼC for 60min, 65ᴼC for 10min. Ligated samples were individually cleaned using AMPure XP beads (AgenCourt, Beverly, Massachusetts) at a ratio of 1.8x AMPure XP beads: 1x solution volume to ensure the retention of all sizes of DNA in the solution. Samples were then pooled and the volume was reduced by half with vacuum centrifugation. Restriction fragments were amplified in 25uL volumes containing 1uL pooled DNA fragments, 1x Phusion Hi-Fidelity PCR Master Mix with HF buffer (New Englands Biolabs, Ipswich, Massachusetts),

2.0mM MgCl2, 3% DMSO, and 12.5pmol each, of the following primers: (A) 5’ AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGAT CT-3’ and (B) 5’- CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTC

12

CGATCT-3’ with a reagent spike immediately after the last thermal cycle of: 0.25x Phusion Hi-Fidelity PCR Master Mix with HF buffer, and an additional 12.5pmol, each of primers A and B.

Thermal cycling conditions were as follows: Initial denaturing at 94ᴼC for 3min, followed by 18 cycles of 94ᴼC for 30sec, 65ᴼC for 30sec, 68ᴼC for 30sec, with a final extension at 68ᴼC for 10min, followed by a 4ᴼC hold until the addition of reagents (spike). Following the reagent spike, samples ran through an additional cycle of 94ᴼC for 3min, 65ᴼC for 2min, 68ᴼC for12min, followed by a 4ᴼC hold. Eight concurrent PCR reactions were performed to reduce PCR error and bias, as well as to increase yield in the final library. Libraries were cleaned as above and were quantified using the Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, California). Libraries were considered suitable for sequencing if the following criteria were met: 1. Final concentration of amplified, pooled library is no less than 0.5ng/uL 2. qPCR performed by NextGeneration Sequencing Facility at the Biodiversity Research Center of the University of British Columbia (Canada) had a minimum concentration of 2mM 3. Bioanalyzer fragment size determination performed by NextGeneration Sequencing Facility at the Biodiversity Research Center of the University of British Columbia (Canada) was between 300-700bp

Paired-end (100bp) sequencing of this 132-plex library per flowcell was completed by the NextGeneration Sequencing Facility at the Biodiversity Research Center of the University of British Columbia (Canada) using the Illumina HiSeq 2000 platform.

Despite stringent standardization of individual DNA concentration in the final pooled library, raw obtained data fluctuated among samples and a second GBS library was created as above with 83 samples that had less than 1,000,000 sequenced reads assigned to them. The same barcodes were used for each sample, eliminating downstream confusion. Thirteen previously un-sequenced samples from the tetraploid Crepis barbigera were also included in this run for polyploid parentage studies beyond

13

the scope of this thesis. This 96-plex GBS library was sequenced as above and read data from samples included in both runs were subsequently combined.

2.2.4. Raw sequence analysis and quality assessment

All three sequence data sets from the libraries (1-plex Crepis monticola standard, 132- plex GBS, and 96-plex GBS) were initially investigated for per base sequence quality, per sequence quality, per base and per sequence GC content, sequence length distribution, sequence duplication levels, overrepresented sequences, and adapter contamination in the form of Kmer content using the FastQC toolbox (Babraham Bioinformatics,Cambridge, England). This preliminary investigation (see Figure 1) into the data set allowed for a fairly comprehensive overview of the actual sequence files without making any biological conclusions or alterations to the data set. While a “good” data set will have high per base and sequence quality with an unbiased GC content and only a small percentage of duplicated or overrepresented sequences or kmers, a “poor” data set may have one or more of these quality indicators fail.

Subsequent to quality assessment, sequences obtained from the 1-plex Crepis monticola library were processed and aligned into contigs using the CLC Genomics Workbench v.7.0.3. (CLCBio, Boston, Massachusetts) while all raw sequences obtained from the GBS libraries were processed using three distinct bioinformatics pipelines. Two of these, Stacks (Catchen et al., 2013) and Tassel – UNEAK (Lu et al., 2013) are de- novo SNP calling pipelines while the third is an in-house reference based assembly pipeline that utilizes both in-house Perl and Python scripts as well as open-source bioinformatics software. Each methodology will be discussed in turn.

2.2.4.1. Crepis monticola assembly

Sequence data generated for C. monticola was filtered and assembled de-novo into contigs using the CLC Genomics Workbench v.7.0.3. (CLCBio, Boston, Massachusetts). Trimming and filtering based on phred quality scores followed standard

14

practices (see Elshire et al., 2011, Hohenlohe et al., 2011) wherein each sequence was assessed and only proceeded to the next step according if it met the following criteria: 1.There were no more than 2 N’s in the read that could not be trimmed from the leading and/or trailing end 2.There were no more than 2 nucleotides that have a phred score < 30 that could not be trimmed from leading and/or trailing end 3.Adapters were trimmed off and low quality base-pairs at the leading and trailing end were removed 4.After trimming, sequences were no less than 64bp

If one sequence passed the above filters but its pair did not, I proceeded with the ‘broken’ pair. Due to the stringent quality filtering, I was confident in the information contained in an individual read without confirmation from its pair.

2.2.4.2. Genotyping-by-sequencing data within reference based pipeline

GBS data were demultiplexed with an in-house Perl script (Baute, G. unpublished) and filtered for adapter contamination and quality with Trimmomatic (Bolger et al., 2014), an open-source program that is flexible, sophisticated, and efficient.

In order to utilize the entirety of information contained in paired-end data in downstream applications that differed in accepted input formats, the reverse reads were augmented to include the appropriate barcode and in this way act as a new ‘forward’ read. Some programs used in the analyses only accept single end data while others were amenable to paired end data. In programs that allowed for only single end reads, alignments would not incorporate the reverse read and thus the information contained in the reverse reads was completely ignored. In augmenting these reverse reads to include a barcode before the restriction enzyme site, the programs were able to recognize these reads and incorporate them appropriately. It should be noted that the sequences retain the information contained in their header regarding their pair information and therefore

15

programs that can properly interpret paired end data are still able to do so after barcode removal. The sequences were then filtered according to the following criteria: 1. Each sequence had a barcode that was removed, followed by the restriction enzyme site ‘TGCA’ 2. Sequences had had adapters removed using a sophisticated approach allowing for removal of short, partial adapter sequences  ‘Palindrome’ mode exploits the nature of paired end data (read through into the adapter can occur on both forward and reverse strands) and utilizes the reverse complimented portion of the sequence to identify adapter contamination (Bolger et al., 2014) 3. Sequences with a phred score ≧ 30 as determined by a sliding window 4. Wordsize for the sliding window was 5 5. Low quality base-pairs from leading and trailing edges are removed from sequence 6. After trimming, sequence must be no less than 64bp  This was primarily chosen to be comparable to Lu et al., (2013)

2.2.4.3. Stacks

GBS data generated for our 129 diploid samples, and 3 diploid outgroup samples (Figure 1), were processed first using the script ‘process_radtags.pl’ within the Stacks pipeline (Catchen et al., 2013). Filtering was completed according to the following criteria: 1. Sequence reads were demultiplexed according to supplied barcode file (NOTE: Stacks does not currently have the built-in capability to process barcodes of multiple lengths so a work-around was created wherein the raw sequences were processed separately for each barcode length) 2. A built-in barcode rescue parameter also allowed for de-multiplexing of sequences that have an N present in their barcode 3. There were no N’s present in the sequence 4. There were no nucleotides with a phred score < 30

16

5. Each sequence had a barcode, which was removed, followed by the restriction enzyme site ‘TGCA’ 6. As the barcode should be the first base-pairs read, any sequences with adapter contamination would have been immediately discarded 7. Each sequence was truncated to a length of 90bases  This was done to minimize the number of sequences that were discarded due to low quality scores at the trailing end. The number was chosen based on the FastQC quality assessment.

2.2.4.4. UNEAK

In an effort to compare and contrast the available published pipelines, the GBS data were also run through the UNEAK De Novo Assembly and SNP calling pipeline; a GBS based bioinformatics offshoot of Tassel. (Lu et al., 2013). The filtering parameters of the pipeline are less sophisticated than Stacks with more brute force discarding and trimming of sequences. Filtering was completed according to the following criteria: 1. Sequence reads were demultiplexed using a supplied key file containing barcode, taxon, and population information 2. Only exact barcode matches were accepted 3. Each sequence had a barcode that was removed, followed by the restriction enzyme site ‘TGCA’ 4. Each sequence was trimmed to 64 bases (not user-controlled) 5. There were no N’s present in the sequence.

In truncating the sequence to 64 bases, any sequences that read into the common adapter were salvaged, however the filter does not specifically look for adapter sequences or quality scores. While this is not very sophisticated, it is likely to be reasonably effective with the large data sets produced from next-generation sequencing methodologies.

17

2.2.5. Crepis monticola De Novo assembly

All assemblies completed on the data from the 1-plex Crepis monticola standard Illumina library were done in CLC Genomics Workbench De Novo Assembly (CLCBio, Boston, Massachusetts). In an attempt to produce the most robust, biologically accurate picture of the nuclear genome of C.monticola from the available data (Figure 1), multiple assemblies were completed with differing algorithm settings before settling on a final contig set to be used in downstream analyses.

The CLC De Novo algorithm utilizes de bruijn graphs (CLCBio, 2012) to determine sequence placement in contig generation. de bruijn graphs are formed from finding the shortest possible sequence that can be constructed from the sequence reads. This is done through the creation of directed graphs wherein the nodes of the graph are sequences and the directed edges between nodes represent shared information between the two nodes (Pevzner et al., 2001; CLCBio, 2012). In this way, a path through the graphs is found, nodes are collapsed into the shortest possible sequence, and a contig is created. As the edges connecting nodes are composed of sequence shared between the two nodes, choosing an appropriate k-mer size (ie: number of bases shared between the sequences at the nodes) is critical for the best possible contig resolution. CLC assigns a value of k for k-mer size based on the number of reads in the dataset (as a proxy for genome size) (CLCBio, 2012); a k of 21 was assigned for C. monticola data. However, choosing the optimal k has been a known challenge in genome assembly and by testing multiple k-mer sizes the best possible assembly can be obtained (Chikhi & Medvedev 2014). In order to confirm the CLC chosen k-mer size, the filtered reads were analyzed in Kmergenie (Chikhi & Medvedev, 2014), and the optimal k for maximizing the number of genomic k-mers for de bruijn graph utilization was compared to the k determined by CLC.

Resolution of de bruijn bubbles caused by systematic sequencing errors was attempted by testing different bubble sizes before settling on a bubble size of 100, equal to average read length (CLCbio, 2012). Bubbles form when two redundant paths start and

18

end at the same node, meaning that the sequences in each path are majority similar but with polymorphisms present (Zerbino & Birney, 2008). In a repetitive genome, some of these bubbles will be true indicators of genome redundancy and thus it is necessary to attempt to resolve these bubbles and find the right path through the graph. Selection of appropriate size, similar to k-mer choice, is therefore a trade-off between too small of a size, resulting in a large number of broken contigs, and too large, resulting in misassemblies (Compeau et al., 2011).

Upon completion of contig generation, individual reads were mapped back onto the contigs with three separate alignment stringencies; 80%, 90%, and 95% sequence identity. Therefore, at an 80% alignment stringency, 20% of the bases were allowed to differ from the contig at that location. Multiple stringencies were utilized to determine if contigs generated were the result of collapsed paralogous regions in the genome. This is based on the knowledge that paralogous sequences diverge independently following their origin through duplication, and their level of sequence divergence should be greater than divergence among orthologues. Therefore, a lower stringency can collapse paralogues into single contigs, and illuminate duplications, while higher stringencies can either confirm uniqueness or point to more recent divergence. The final contig assembly was generated with reads that mapped back with a 90% identity match as increasing the stringency to 95% failed to resolve additional paralogy.

Contigs were then subjected to a stringent BLAST search (Altschul et al., 1990) to filter out contaminants and plastid regions (Figure 1). A discontiguous megablast with an e- value cutoff of 0.0001 and a word size of 11 was chosen to specifically target sequences that may be slightly more diverged but still show a greater similarity than expected by chance. In this way, an informed decision about whether to discard or retain the contig in question could be made. The nuclear contigs were saved and used downstream as a reference to align and call SNPs from the GBS dataset.

19

2.2.6. Alignment of genotyping-by-sequencing data to Crepis monticola contigs

The nuclear contigs obtained for C. monticola were prepared for use in BWA, a reference-based aligner capable of processing insertions and deletions with accuracy (Li & Durbin, 2009), and the filtered GBS sequence data was aligned to this reference using default settings for the ‘bwa mem’ aligner (Figure 1). A comprehensive explanation of the default options as well as their performance is given in Li & Durbin (2009).

BWA identified read groups were analyzed for the presence of insertions and deletions in GenomeAnalysisTK (McKenna et al., 2010) before being flagged for realignment with ‘RealignerTargetCreator’ following DePristo et al., (2011) and finally appropriately realigned using the GATK ‘IndelRealigner’.

Default settings were selected as stringent post alignment filtering of putative SNPs was employed which will be described in detail below.

2.2.7. De Novo assembly in STACKS

The core of the STACKS pipeline is four distinct programs: ustacks, cstacks, sstacks, and populations which are accessed through a wrapper; ‘denovo_map.pl’ (Catchen et al., 2013, ver.1.10). These programs, respectively, create ‘stacks’ or loci made up of identical reads for each individual, group these loci across individuals in a catalog, determine the allelic state at each locus in each individual, and after conversion into genotypes, calculate values of basic population genetics statistics (Catchen et al., 2013). At each step, I selected multiple parameters to test outcomes and found that the desired level of stringency in creating stacks, warding against paralogous and erroneous loci, and calling alleles with confidence was achieved using the parameters listed below.

20

2.2.7.1. Ustacks

Illumina platform sequencing has an associated sequencing error rate of approximately 1% (Lou et al., 2013), which can quickly become a concern when dealing with a large data set as the number of reads containing errors can increase to a degree that they are included in downstream analyses (ie: inclusion of an incorrect base in a read that is present at a copy depth > X). In order to decrease the likelihood of calling false stacks due to sequencing error, a minimum stack depth of 3 was selected. Therefore, a read must be present in at least three copies for inclusion in the next filtering step. Conversely, a maximum distance of 2 nucleotides between stacks from different individuals was chosen. Limiting the sequence divergence to 2 nucleotides between stacks produced stacks that were either exactly identical (i.e., a homozygous loci if within individuals, a non polymorphic loci if between individuals), 1 base-pair different (a heterozygote if variation is within individuals, a potential SNP if between individuals), or two base-pairs different (multiple putative conclusions). In this way, downstream variant analysis is more straightforward. A transitive reduction of stacks occurs during ustacks as well, wherein pairs of stacks are compared to one another with an allowable basepair distance of 1. Pairs that are recognized as matching are then merged (Catchen et al., 2011).To supplement stacks with minimal depth, secondary reads (those that have more than 1bp different from a stack, up to a nucleotide distance of 5), were aligned to the primary stacks. Though these secondary reads were aligned to the primary stacks, haplotypes were not called from them, as will be discussed in more detail in the SNP calling section.

2.2.7.2. Cstacks

The set of consensus loci among individuals is generated using a k-mer search algorithm that compares the stacks across individuals and merges them based on a set criterion (Catchen et al., 2011). Our catalog was generated with the default parameters of allowing 0 mismatches between stacks. This means that the program will fail to capture loci with fixed differences between individuals. For instance, if individual 1 is

21

heterozygous at a locus (A/T) and individual 2 is fixed for (A) or (T), the two loci will be merged into one in the catalog with coverage across two individuals. However, if individual 2 is instead fixed at that position with a (G) or (C), the above setting will result in the two loci failing to merge and two distinct spots in the catalog with a coverage of one individual for each will be produced. In retrospect, this step should be modified to allow for the detection of fixed differences between individuals at specific loci, however as discussed below, I don’t believe this will make a discernible difference for my data.

2.2.7.3. Sstacks

The stacks generated in ustacks are compared against the catalog generated in cstacks to determine the allelic state at each locus in each individual (Catchen et al., 2013). This is completed via construction of a k-mer based hash table containing all haplotypes in the catalog. While a stack/locus may only match one catalog locus, a heterozygous locus with two haplotypes (i.e., multiple stacks) can be matched to the same catalog locus (Catchen et al., 2011).

2.2.7.4. Populations

At this point, each individual is assigned a multi-locus genotype based on the loci included in this step. Due to the immense computational requirements of processing such a large number of loci among and within individuals, a ‘blacklist’ of markers to exclude from these analyses can be implemented. This is made up of stacks with exceptional read depth (measured as two standard deviations > than the average stack depth) and loci that are not variable. Removing these stacks will decrease paralogy and non-informative loci from downstream analyses.

Further data filtering is achieved by specifying the minimum percentage of individuals in a population required to include a locus for that population, the minimum number of populations a locus must be present in to process a locus, the minimum stack depth, the minimum minor allele frequency required to call an allele, and the p-value cutoff

22

required to keep an Fst measurement (Catchen et al., 2013). Trading off between missing data and statistically rigorous sample numbers, a minimum coverage within populations of 70% was selected, with only one population needing the locus, a minimum stack depth of three, and a minimum allele frequency of 0.01. The default p- value cutoff of 0.05 was applied. These criteria ensured adequate coverage across individuals when calling variant sites while allowing for the presence of private alleles in populations.

Variant call format (VCF), Structure, and Phylip output formats were selected and subjected to further filtering to ensure that all variant sites identified (putative SNPs) were true, ancestrally-informative SNPs.

2.2.8. De Novo assembly in UNEAK

All analyses were completed using TASSEL3.0 with the TASSEL5.0 GUI (Lu et al., 2013). A text-delimited key file containing the barcode, sample, and population information along with format-required columns specifying the flowcell lane number and location in the 96-well plate was created and is utilized in multiple steps of the pipeline. This pipeline has many similarities to Stacks and is composed of six main plug-ins: UfastqToTagCountPlugin, UmergeTaxaTagCountPlugin, UtagCountToTagPairPlugin, UTagPairToTBTPlugin, UTBTToMapInfoPlugin, and UmapInfoToHapMapPlugin. There are fewer filtering options however, and as such each plugin will be discussed only briefly.

After creating a working directory using ‘UcreatWorkingDirPlugin’, and de-multiplexing the data as described in the previous section, sequences were merged into ‘tags’ (equivalent to ‘stacks’ or ‘loci’) with UfastqToTagCountPlugin. As with Stacks, a minimum tag depth of 3 was assigned. Tags belonging to certain taxa are then merged to create a taglist for each sample, as well as a taglist of all tags present in all samples, in UmergeTaxaTagCountPlugin. This is similar to the ustacks step in the Stacks pipeline. UtagCountToTagPairPlugin performs a pairwise alignment of all tags and

23

tagpairs with 1bp mismatch are considered putative SNPs. An error tolerance rate (ETR) of 0.03 was assigned, which allows for reciprocal tags to be called with up to 3% sequence divergence. An ETR of 0 would mean that only purely reciprocal (no divergence) tags were called with no sequencing error ever occurring, an unlikely scenario (Lu et al., 2013). It is expected that any false SNPs resulting from this ETR assignment will be quickly filtered out downstream as they will not be present in a high enough number or with enough statistical support. Subsequent to the pairwise alignment, tags that were successfully paired are analyzed in UTagPairToTBTPlugin to determine the tag distribution in all of the taxa and elucidate whether some tags occur more frequently than expected (perhaps due to repetitive sequences or a highly sequenced site). Finally, UTBTToMapInfoPlugin and UmapInfoToHapMapPlugin assign genotypes to each taxon and generate haplotype maps based on a minimum allele frequency of 0.01 (user-specified), and a maximum allele frequency of 0.5.

The HapMaps that are produced in this pipeline can then be filtered according to minimum site coverage among samples, minimum number of populations covered, taxa, minor allele frequency, and traits (Lu et al., 2013). Based on the sequence coverage I had at loci containing putative SNPs, I decided to create a putative SNP table for all samples, for all samples excluding those with exceptionally low coverage across sites (< 30%), and for samples contained within a more limited phylogenetic sample of diversity within the agamic complex (clade V of the plastid phylogeny; Sears, 2011).

2.2.9. SNP calling

2.2.9.1. Reference based SNP calling

Filtering for true SNPs occurred in multiple steps following the alignment of the GBS data to the C.monticola reference contigs.

First, GATK ‘SelectVariants’ tool (DePristo et al., 2011) identifies any sites that differ between the GBS data and the C. monticola reference that have a phred score ≧ 30,

24

and outputs the information in standard VCF format. These variant sites are putative informative markers. An optional processing step to change the VCF output to tab format was used as downstream in-house scripts were written with specific input criterion.

Second, every identified variant site was analyzed for coverage, observed heterozygosity, major and minor allele frequencies, base-pair quality, minimum and maximum read depth, and mapping quality score using in-house perl scripts (modified by McGrath from Owens, unpublished). Initial filtering was first completed with quite stringent requirements; 70% coverage, Hobs < 0.6, minAF > 0.01, q>30, minRD ≧ 5, maxRD ≦ 100,000, and mapping quality score ≧ 20. It was quickly determined that these parameters would need to be relaxed in order to retrieve more SNPs. Therefore a final SNP table was created that had 70% coverage, Hobs < 0.6, minAF > 0.01, q>10, minRD ≧ 3, maxRD ≦ 100,000, and mapping quality score ≧ 10.

Lastly, linkage disequilibrium (LD) between SNPs located on the same contig was calculated as an indicator of paralogy and incorrect alignment. This is based on the assumption that valid SNPs should be in LD with adjacent markers (Lu et al., 2013). Though LD can decay rapidly and the rate can fluctuate widely between organisms, it is generally regarded that SNPs found within 100bp (IE: within a single read) are in complete linkage with one another (Catchen et al., 2011). The linkage disequilibrium coefficient, r2, were calculated using VCFtools v.0.1.12 ‘hap-r2’ option (Danecek et al., 2011) with a ld-window of 2 indicating the maximum number of SNPs between the SNPs being tested. Marker association was assessed from the r2 scores wherein values are from 0-1 with 0 indicating that the markers being compared are not in linkage disequilibrium and a score of 1 indicating complete LD (Mueller, 2004). Variant sites with an 0.5 < r2 < 1 underwent further scrutiny in the Integrative Genome Viewer (Thorvaldsdottir et al., 2012), where visualization of the aligned reads and reference allowed for an empirical assessment of whether the variant site was a result of a true SNP or instead an artifact of the paralogy in the genome. This occurs when a read aligns equally well to multiple regions of the reference genome, or, the correct repetitive

25

region is not present in the reference genome and as a result the read aligns based on a best-match principal.

2.2.9.2. De Novo SNP calling

Both Stacks and UNEAK generate putative SNPs within their pipeline that must then be filtered to remove all false variant sites (Figure 1). False SNP calls are most frequently the result of PCR error, sequencing error, and erroneous assembly due to unrecognized repetitive regions of the genome (Nielson et al., 2012, Treangen & Salzberg 2012). In my analysis, both the Stacks and UNEAK methods assigned a putative SNP on the basis of a 1bp mismatch between two stacks or tags belonging to separate individuals, the raw sequence filtering was completed with a high quality and adapter contamination filter, and the required minor allele frequency was conservative. Therefore, many of the common causes for false SNP calls have been reduced or accounted for. However, repetitive regions leading to false SNPs are not a small concern when working with plant genomes (Davey et al., 2011), and therefore both pipelines attempt to account for repetitive sequences. In Stacks, loci with a stack depth greater than 2 standard deviations above the average stack depth are blacklisted. In UNEAK, a network filter is applied that eliminates all tags connected to each-other by multiple edges such that the remaining networks are composed of reciprocal tag pairs. This acts in a similar manner as de bruijn graphs, with tags making up the nodes and the amount of sequence they share in common, the edges. As tag pairs should only differ by 1bp and be unique to one another, complicated networks are frequently a mixture of repeats, paralogs, and error tags (Lu et al., 2013). While the network filter of UNEAK is more sophisticated than the blanket removal based on stack depth of Stacks, neither method satisfactorily takes into account the non-quantitative nature of GBS (Baute pers. comm). The number of digested reads present is not an accurate reflection of the number of occurrences of that read in the nuclear genome of these organisms and as such, a more suitable metric by which to assign uniqueness is observed heterozygosity. This is based on the simple observation that a sexually reproducing diploid organism with two alleles at a locus

26

should not have an Hobs > 0.5 at any given locus. Thus, all putative markers were filtered for heterozygosity following identification within the pipeline.

Finally, the stacks or tags containing the SNPs are blasted (Altschul et al., 1990) to ensure that they are from the nuclear genome and that they are not a contaminant from a different organism. This step is not completed during the raw filtering of the GBS data as it would be prohibitively time consuming as well as unnecessary; only the reads with putative informative variants need be investigated.

2.2.10. Analysis of SNP data

The analyses presented were completed with the finalized SNP set for all individuals of Crepis occidentalis subsp. occidentalis, Crepis pleurocarpa, Crepis nuvo subsp. X, and Crepis nuvo subsp. Y, all members of clade V of the plastid phylogeny in Sears (2011).

To quantify the level of genetic differentiation among populations, pairwise Fst values from the SNPs were calculated using Nei’s distances (1973) within the R package Adegenet (Jombart, 2008). Nei’s population differentiation estimate is analogous to Wright’s Fst and identical when dealing with diploid random-mating populations (Excoffier, 2003 in Kitada et al., 2007). Fst is a common metric for assessing the amount of gene flow between populations and can aid in identifying population divergence.

The program STRUCTURE v. 2.3.4 (Pritchard et al., 2000) was used to test for genetic admixture within and among individuals and populations and investigate for the presence of hybrid individuals. STRUCTURE uses a bayesian clustering approach that estimates the number of distinct genetic groups (K) and uses the posterior probability of each K to assign individuals to clusters (Pritchard et al., 2000). The hypothesis of a hybrid origin for C.nuvo subsp. X and C.nuvo subsp. Y was investigated using the admixture model in STRUCTURE with the assumption of correlated allele frequencies. The data set includes individuals from multiple populations of some species and

27

Pritchard et al. (2000) found that permitting the allele frequencies to be correlated allowed for proper assignment of individuals in closely related populations. Simulations for each of K=1 through K=6 were performed with a burn-in period of 100,000 replicates and 100,000 MCMC iterations, based on the recommendations of Gilbert et al. (2012). These results were repeated and found to be consistent over three simulation runs. Delta K was calculated according to Evanno et al. (2005) using Structure Harvester (Earl & vonHoldt, 2012) to determine the number of clusters with the highest statistical support.

I also visualized the relationship among samples in clade V using a Principal Components Analysis (PCA) performed within the Adegenet package in R (Jombart, 2008) using the duality diagram class from the ade4 R package (Chessel et al., 2004). PCA acts to simplify a data set by separating samples along a number of axes that explain a calculated amount of variance among the data. By plotting the axes along which the variance in the data is maximized, one can visualize the degree of similarity and differences in the data (Ringnér, 2008). Four separate PCAs were initially performed to assess which data scaling option was most appropriate for our limited data set as PCA’s are sensitive to the scaling of the original data, with skewed or highly differentiated data influencing the final PC’s (Ringnér 2008). 1. Data centered (subtracted from the mean) with missing data treated as the mean value at given locus 2. Data centered with missing data given a value of 0 and not considered in the analysis 3. Data is not centered with missing data treated as the mean value at given locus 4. Data is not centered with missing data given a value of 0 and not considered in the analysis As centering the data did not alter the clusters formed, only results from non-centered analyses are presented. I also found that with missing data treated as a mean of the allele frequency at the given locus, the PCA failed to resolve the presence of any distinct clusters and therefore results are presented from #4 – not centered and with a missing value of 0. As PCA’s depend on the data inputted, they are plainly affected by

28

factors related to the data but unrelated to population structure, such as sampling locations and amounts of data (Novembre & Stephens, 2008). It is therefore not surprising that, given the limitations of my marker set, assigning a mean value for missing data obscured the underlying biological pattern.

29

Table 1: List of Crepis collections from within the agamic complex and outgroup species, Crepis runcinata, showing geographical information, morphological species designation (TAXON), ploidy, and number of specimens (N) analyzed. Astericks indicate species included in the final data set

POPULATION TAXON STATE LATITUDE LONGITUDE ELEVATION PLOIDY N (m. a.s.l)

4007 C. atribarba subsp. originalis Washington 47.3065 -120.095 965 2x 6

4009 C. atribarba subsp. originalis Washington 47.3691 -120.077 1757 2x 6

4011 C. atribarba subsp. originalis Washington 47.8358 -120.067 1152 2x 4

4012 C. atribarba subsp. originalis Washington 47.806 -120.136 1540 2x 3

6002 C. atribarba subsp. originalis Washington 47.40843 -118.76122 1719 2x 1

4015 C. modocensis subsp. rostrata Washington 46.9169 -120.595 2478 2x 1

4016 C. modocensis subsp. rostrata Washington 46.9078 -120.645 2556 2x 2

6000 C. modocensis subsp. rostrata Washington 47.42018 -118.75951 1780 2x 1

6002 C. modocensis subsp. rostrata Washington 47.40843 -118.76122 1719 2x 2

6003 C. modocensis subsp. rostrata Washington 47.4045 -118.9959 1402 2x 10

6004 C. modocensis subsp. rostrata Washington 47.58083 -119.33653 1583 2x 2

6005 C. modocensis subsp. rostrata Washington 47.57935 -119.33591 1593 2x 3

4023 C. acuminata subsp. acuminata Oregon 44.4896 -117.307 3540 2x 2

4051 C. acuminata subsp. acuminata Oregon 43.4558 -118.11 4507 2x 1

4055 C. acuminata subsp. acuminata Oregon 42.5763 -119.724 4580 2x 1

4059 C. acuminata subsp. acuminata Oregon 40.8417 -120.621 5256 2x 2

30

POPULATION TAXON STATE LATITUDE LONGITUDE ELEVATION PLOIDY N (m. a.s.l)

5018 C. acuminata subsp. acuminata California 40.7216 -121.426 4037 2x 2

4074 C. sp. nuvo X* California 41.2781 -122.696 5603 2x 6

4075 C. sp. nuvo X* California 41.2866 -122.693 5790 2x 4

5006 C. sp. nuvo Y* California 41.2869 -122.692 5825 2x 3

5001 C. monticola California 41.9219 -122.603 2926 2x 6

5005 C. monticola California 41.5955 -122.735 3683 2x 2

5007 C. pleurocarpa* California 41.2344 -122.671 3915 2x 9

5008 C. pleurocarpa California 41.2371 -122.658 3449 2x 6

5009 C. pleurocarpa California 41.2134 -122.648 3126 2x 5

5010 C. pleurocarpa California 41.2071 -122.649 3100 2x 4

5011 C. pleurocarpa California 41.129 -122.702 2690 2x 4

5012 C. pleurocarpa California 41.1228 -122.703 2620 2x 3

5014 C. pleurocarpa* California 41.4218 -122.502 3860 2x 2

5015 C. pleurocarpa California 41.426 -122.5 3438 2x 2

5016 C. pleurocarpa California 41.4162 -122.513 4575 2x 3

4065 C. occidentalis subsp. California 39.7357 -120.378 4908 2x 1 occidentalis*

5020 C. bakeri subsp. cusickii California 40.5896 -121.062 5642 2x 4

5021 C. bakeri subsp. cusickii California 40.5879 -121.076 5660 2x 4

31

POPULATION TAXON STATE LATITUDE LONGITUDE ELEVATION PLOIDY N (m. a.s.l)

5022 C. bakeri subsp. cusickii California 40.628 -121.111 5644 2x 4

5023 C. bakeri subsp. cusickii California 41.5646 -120.244 6586 2x 2

6012 C. bakeri subsp. cusickii California 41.56477 -120.24367 6575 2x 4

6016 C. bakeri subsp. cusickii California 41.58269 -120.26546 7514 2x 2

97 C. runcinata – outgroup NA NA NA 2x 1

187 C. runcinata – outgroup NA NA NA 2x 1

199 C. runcinata – outgroup , NA NA NA 2x 1 Canada

132

32

Figure 1: Sequencing and marker discovery in progenitor species of the Crepis agamic complex using both a reference based and de novo methodology

2.3. Results

2.3.1. Crepis monticola sequencing and contig assembly

Sequencing generated 288M paired reads of Crepis monticola which, after filtering, resulted in 258.2M sequences made up of 218M paired sequences and 39.9M sequences that were no longer matched due to the stringent filtering parameters. Sequence quality was assessed prior to filtering and results are presented in Table 2. Average quality was high, with overrepresented sequences being a result of adapter

33

contamination (as opposed to sequencing bias). Sequence duplication levels, of reads that are present more than once, for C. monticola were well within acceptable ranges, with over 90% of obtained reads being unique.

Table 2: Summary of FastQC Analysis on Crepis monticola

Feature

Average per sequence quality score (phred) 38 Average per sequence GC content 42% Average sequence duplication level 10.8% Percentage of overrepresented sequences 1.97%

Sequences were assembled into 1,238,163 contigs with an N50 of 375 bp (Table 3). N50 is a standard measure of assembly success; 50% of the entire assembly is contained in contigs equal to or greater than this value (Barchi et al., 2011). Of these 1,214,365 were nuclear in origin. Despite stringent assembly settings, we were unable to resolve the high percentage of broken read pairs (63%). These nuclear contigs became the genomic reference for GBS data to align to.

Table 3: Summary statistics of the Crepis monticola nuclear contig assembly

Feature

Illumina reads (millions) 288 Gb of total sequences 26.6 Total Mb of nuclear content after filtering 460.9 Average insert size 190 Contigs (millions) 1.2 Average contig length (bp) 376 N50 (bp) 375 Contig length range (bp, min – max) 101- 96,593

34

Feature

Percentage of broken pairs 63%

2.3.2. Genotyping-by-sequencing read analysis and assembly

2.3.2.1. Sequence analysis

Fastqc results relevant to GBS are presented in Table 4. Although other metrics for assessing quality are presented in Table1, these were not included here because the high levels of duplication found makes those additional metrics difficult to interpret.

Table 4: Summary of FastQC Analysis on GBS data

Feature

Average per sequence quality score (phred) 34 Average sequence duplication level 90.4%

The average sequence duplication level for this data set was 90.4%, thus only 10% of the sequences represented unique regions of the genome relative to one another. Almost the entirety of the GBS dataset was composed of two mitochondrial regions (as determined by BLAST) that had been deeply sequenced. This discovery limited all steps downstream as it greatly-limited the number of informative sequences that could be used in marker discovery.

2.3.2.2. De-multiplexing

Figure 2 describes the sequence distribution among reads after being de-multiplexed in each of the three pipelines. De-multiplexing of the data revealed differences in final sequence designation between Stacks, UNEAK, and the in-house method for alignment to the reference contigs. These differences are expected, and can be attributed to the user-chosen options within each program, rather than incorrect de-multiplexing by the

35

pipeline itself. The sequences in Stacks were de-multiplexed without consideration of the Pst1 cut site. That is, reads where a barcode was found were included in the final sample distribution regardless of whether or not the barcode was followed by the Pst1 cut site “TGCA”. This user generated design limitation led to an artificial inflation of sequence per sample and was appropriately addressed when implementing the UNEAK pipeline and reference based pipeline. UNEAK has a rigid, simplistic filtering model that removes all reads not meeting the specified criteria while the reference-based pipeline implements more sophisticated techniques that allow for the rescue of many reads. Thus, figure2 reflects the higher proportion of sequences retrieved with this pipeline relative to UNEAK. While the specific number of sequences recovered per sample varies by pipeline, the overall distribution was similar between the techniques and biases were not pipeline specific, but rather a reflection of true sequence level and biological differences. This will be discussed in more detail below (see Discussion section).

Figure 2: Number of sequences per sample (Logarithmic scale ) as determined by Stacks (green), UNEAK (yellow), and In-house scripts for the reference-based alignment (blue). The individuals are listed in order of increasing sample identification number along the X axis (label shown for every third individual)

36

Coverage across samples was highly variable despite measures taken to normalize DNA content during library preparation. Both sample and species specific bias was seen. For example, the three outgroup samples from C. runcinata had zero amplification success, likely due to poor quality leaf material. In C. pleurocarpa (5007-5013, 5014- 5016 in Figure 2) and C. bakeri (5020-5023, 6012, 6012), I found an order of magnitude higher than average sequencing success, while C. modocensis subsp. rostrata (4015- 4016, 6000, 6002-6005) had an order of magnitude lower than average sequencing success. Despite ample effort, inclusion of C. modocensis subsp. rostrata markers was not possible due to this lack of sequence data and thus this species was eliminated from further analyses and is not included in the final SNP dataset.

2.3.3. SNP identification

Although only 10% of the GBS reads was utilized for marker discovery, there were 674,651 and 468,270 tags discovered in Stacks and UNEAK respectively (Table 5). The terms “tag”, “stack”, and “loci” are used interchangeably depending on the pipeline being discussed. All refer to a sequence read that was considered for marker discovery. Of these 19,762 and 45,526 contained putative SNPs equaling 27,298 and 43, 526 total putative SNPs in each de novo pipeline respectively (Table 5).

Table 5: Summary statistics of the de novo GBS pipeline assemblies (excluding C. runcinata and C. modocensis subsp. rostrata)

Reference Features Stacks UNEAK aligned Illumina reads (million) 344 Total number of tags found 674,651 468,270 N/A Number of tags or contigs with putative 19,762 43,526 1,726 SNPs Total Number of putative SNPs 27,298 43,526 33, 558 Number of filtered SNPs present in at 0 23 N/A least 70% of individuals

37

Reference Features Stacks UNEAK aligned Number of filtered SNPs present in at least 70% of individuals in clade V 0 19 296 (before LD testing in reference aligned pipeline) Number of filtered SNPs present in at least 70% of individuals in clade V (after N/A N/A 0 LD testing in reference aligned pipeline)

Unfortunately, when filters were applied, this number dropped drastically. 89.4% of the putative SNPs identified by UNEAK passed filters for heterozygosity, linkage, genetic origin, and minor allele frequency, however coverage across the samples was extremely shallow and upon filtering for markers that were present in at least 70% of samples, Stacks retained 0 markers while UNEAK retained 23.

Analyses completed using the 23 confirmed SNPs from UNEAK failed to resolve relationships at the species level (results not shown) and the decision was made to restrict analyses to individuals within clade V (Sears, 2011) to attempt to elucidate the origin of C. nuvo subsp. X and C.nuvo subsp. Y. Members belonging to this clade included in the final analyses are identified by an asterisk beside their names in Table 1.

Reference based alignment of the GBS data to C. monticola was initially promising with over 290 SNPs passing filters for heterozygosity, minor allele frequency, and coverage (Table 5), however calculations for degree of linkage disequilibrium between adjacent SNPs revealed that variant sites were the result of misalignment of GBS reads (results not shown). This is a likely artifact of having an extremely limited portion of the genome to align to coupled with the repetitive nature of the genome. I found that the 33,558 putative SNPs originated from only 1,726 of the 1.2M nuclear contigs (Table 5) and that decoupling true SNPs from the noise was virtually impossible.

38

2.3.4. Data analysis

The finalized marker set of 19 SNPs for 25 samples belonging to clade V was used for all of the analyses and is presented in Table 6. Standard IUPAC nucleotide codes are utilized.

39

Table 6: Finalized SNP set for members of Clade V. SNPs are recorded as standard IUPAC nucleotide codes

POPULATION ID TAXON 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 C. occidentalis 4065-1 subsp. occidentalis C T G N G T T G R C C C T C C N G C T 4074-10 C. nuvo subsp. X Y Y G N G Y Y R A Y C Y T C C Y R Y T 4074-11 C. nuvo subsp. X C T S R N T T G A Y N Y N C C Y G C N 4074-12 C. nuvo subsp. X C Y S G R Y Y G R C C Y K C C Y G Y T 4074-14 C. nuvo subsp. X Y Y G G A Y Y G A C C Y T Y S C G C N 4074-15 C. nuvo subsp. X Y T G R G T N R R C C C T Y N C G C N 4074-4 C. nuvo subsp. X C T S N G T T R R C C C K C C Y G Y N 4075-2 C. nuvo subsp. X C T S R G T T G A C C Y T N C C G C T 4075-3 C. nuvo subsp. X C T G G R T T G A C C C N C C C G C T 4075-4 C. nuvo subsp. X C T G N N T T G N C N Y N N C N N C N 4075-5 C. nuvo subsp. X C T G G A T T G R C C C T C C Y G C N 5006-1 C. nuvo subsp. Y C T G N N N T G N C N C T N C C N C N 5006-2 C. nuvo subsp. Y C T G G G N T G A C C C T C C C G C N 5006-4 C. nuvo subsp. Y C T G G A T T G R C C C T C S C G C T 5007-1 C. pleurocarpa C C S A G T T R G N T Y T C C N A C T 5007-2 C. pleurocarpa C C G A G T T G G N C Y T C C T A C T 5007-3 C. pleurocarpa C C S G N T N G N N N C T C N N A N T 5007-4 C. pleurocarpa C Y G R G T N G G N C C T C N Y N C Y 5007-5 C. pleurocarpa C C G A G T T G R T C C T C C T A C Y 5007-6 C. pleurocarpa C C G A N T T G G T N C T C S T A C Y 5007-7 C. pleurocarpa C N G N N T N G N N N C T N N N N N T 5007-8 C. pleurocarpa C C S R G T N G R T Y Y T C N T A C T 5007-9 C. pleurocarpa C C N A N T T G G T N C T Y C T A C T 5014-2 C. pleurocarpa N N N A N N N N N N N C N N N N N N T 5014-3 C. pleurocarpa C C N A G T N G G T T C N C N T A C T

40

2.3.4.1. Pair-wise Fsts

The degree of genetic similarity between populations in clade V, as tested by Nei’s Fst formula is presented in Table 6. Sample size limitations in Crepis occidentalis subsp. occidentalis (4065) and Crepis pleurocarpa (5014) led to their elimination from this summary statistic.

Table 7: Pair-wise Fst values for populations in clade V, excluding Crepis occidentalis subsp. occidentalis (4065) and Crepis pleurocarpa (5014). Crepis nuvo subsp. X (4074,4075), C. nuvo subsp. Y (5006), and C. pleurocarpa (5007) are included

4074 4075 5006 5007

4074 0 0.0778 0.0962 0.334

4075 0 0.041 0.470

5006 0 0.486

5007 0

2.3.4.2. STRUCTURE analysis

Significance testing as part of the Structure analysis most strongly supports the presence of two clusters, which clearly delimitate C. occidentalis subsp. occidentalis and the two putative new species from all C. pleurocarpa populations (Figure 3). Within C. pleurocarpa, individuals 5007-3, 5007-4, 5007-8, and 5014-2 show some evidence of admixture, however the low marker density and lack of fixed SNPs prevents us from drawing a robust conclusion from this.

41

Figure 3: STRUCTURE analysis of SNP data from individuals in clade V with K=2. C. occidentalis subsp. occidentalis (4065), C. nuvo subsp. X (4074,4075), and C. nuvo subsp. Y (5006) form a single cluster distinct from C. pleurocarpa (5007,5014)

Although K=2 was the most favoured number of clusters, genetic structure is still clearly present at K=3 (Figure 4); the addition of a third cluster separates population 4074 of C. nuvo subsp. X from population 4075 of the same putative species, which shows more genetic similarity with C. nuvo subsp. Y in population 5006 (Figure 4).

Figure 4: STRUCTURE analysis of individuals in clade V with K=3. Crepis occidentalis subsp. occidentalis (4065) most closely resembles C. nuvo subsp. X (4075) and C. nuvo subsp. Y (5006) with C. nuvo subsp. X (4074) primarily of a different genetic background. C. pleurocarpa remains a distinct cluster (5007,5014)

Members of population 4074 appear to share some of their genome with members of C. occidentalis subsp. occidentalis (4065) and members in populations 4075 and 5006 ,while C.pleurocarpa remains distinct. Individuals of C. pleurocarpa that demonstrated

42

some level of genetic introgression at K=2 maintain a mixed genotype due to polymorphic sites.

2.3.4.3. Principal component analysis

Results of the Principal Component Analysis generally agreed with the results of STRUCTURE analysis, with C. occidentalis subsp. occidentalis, C. nuvo subsp. X, and C. nuvo subsp. Y forming a distinct cluster from C.pleurocarpa (Figure 5). There is no clear third cluster present within the PCA plot of axis 1 and 2

Figure 5:PCA of individuals in clade V. C. occidentalis subsp. occidentalis (4065), C. nuvo subsp. X (4074,4075), C. nuvo subsp. Y (5006) form a single cluster distinct from C. pleurocarpa (5007-5014)

43

2.4. Discussion

2.4.1. Crepis monticola genome composition and reference based alignment of GBS data

The intent of obtaining genome sequence information was to assemble a number of nuclear contigs that could act as a reference genome for the alignment of GBS data and the calling of variant sites. It was expected that sequence coverage would be inadequate to accurately resolve large regions of the genome, however, it was not my intention to assembly a draft nuclear genome of Crepis. Rather, given the short (100bp) GBS reads, it was expected that contigs greater than this could still aid in resolution of small INDELS, or repeat regions that did not span kilobases. Given an estimated 2N (diploid) nuclear content of approximately 12 pg (Sears, 2011), which translates into an approximate haploid genome size of 6Gb, 250M paired reads could achieve a maximum sequence coverage of 4x. Accounting for the high-copy plastid genomes, I predicted that a sequence coverage between 2 and 3x was possible. In the end, I achieved less than 1x genome coverage, with a small average contig size of 375bp, which is hardly adequate for production of a high-quality assembly, as the sequence redundancy inherent in increasing coverage improves accuracy of assembly due to a reduction of error rate, and is necessary for paralog and ortholog resolution (Schatz et al., 2010). While I was aware that we would be lacking coverage due to genome size, the assembly I did achieve provided insight into the composition of the nuclear genome. The 1-fold difference in predicted versus achieved coverage can indicate a high level of redundancy in the genome, as highly similar sequences are collapsed into single contigs. Genome size varies by more than 1000-fold in angiosperms, with all genomes analyzed to date containing a significant component of repeated sequences. Wheat (Triticeae) for instance, was found to consist of 91.6% repetitive elements with only 2.5% genes for which there is a known function (Li et al., 2004). This pattern can clearly be seen in Crepis. The failure to generate large contigs may be a straightforward

44

reflection of genome size: the necessary volume of sequence information was simply not available to assemble the sequences into larger contigs. However my results can also indicate a high level of sequence repetition throughout the genome. Highly similar sequences interfere with the construction of long de bruijn graphs as similar sequences produce branching and force the assembly algorithm to select a branch to follow or, more conservatively, break the assembly at these points (Treangen & Salzberg, 2012). Repeats that are longer than the read length create gaps in the assembly (Treangen & Salzberg, 2012) and will lead to small, fragmented contigs similar to those seen in our assembly.

The presence of a high proportion of repeated sequences in Crepis is further supported by the high percentage of broken read pairs (63%) despite mapping reads back to the generated contigs with rigorous match requirements (up to a 95% similarity). This is likely the result of one read mapping to a non-repetitive region in a short contig while its pair more accurately maps to either a different contig, or has no appropriate contig to match to. Both insert size and orientation of paired reads are utilized by the assembler in the decision of best placement, and if these pairs (or other pairs in the sequence data) do not span a particular repeat, it may be impossible to unambiguously assemble the data (Treangen & Salzberg, 2012). Taken together, I concluded that the nuclear genome of Crepis is likely to be highly repetitive, with a high percentage of regions that are recently diverged and present in high copy number. While further investigation into the nature of these regions is necessary to determine the nature of repeated elements, it is likely that these represent transposable elements in the genome. Long terminal repeat (LTR) retrotransposons, in particular, represent the most redundant component of plant genomes and are present in high-copies in intergenic regions (Li et al., 2004; Staton et al., 2012).

As a result of the limitations of the nuclear genome contig assembly, and the extreme redundancy of the genome, the reference based alignment of GBS data for marker discovery among the diploids of the complex was largely unsuccessful. Although sequence repetition in the GBS dataset was high, the unique regions were sampled

45

from across the entire 6Gb genome, of which we had assembled information representing only 460Mb. This creates a problem for the alignment program wherein it must make the decision to map reads to the available genome at a lower mapping quality (down to the user-specified cut off of 4 nucleotide divergence) than would be possible with a true alignment, or to discard the read with a failure to align. I found that BWA, with a default setting allowing for four nucleotide differences between read and reference (Li & Durbin, 2009), coupled with our limited reference, resulted in a high percentage of reads being mapped erroneously. Specifically,GBS reads that were within a 4bp divergence would align to a reference contig simply because there was no better, less diverged contigs available for the GBS read to align to; based on the “best match” nature of the aligner algorithm. Had I achieved 3x coverage of the genome, all of the regions of the genome sequenced with GBS would have been present in the reference contigs and a better alignment would have been possible. This is both a reflection of the aligner as well as true paralogy within the genome. That there was erroneous alignment of GBS reads was supported by LD scores for putative SNPs, which frequently showed a squared coefficient of determination (r2) of less than 1, indicating that the SNPs were not in linkage disequilibrium with one another as would be expected for markers within recombination breakpoints (Mueller, 2004).

It is clear that a complete and accurate genome assembly of Crepis (which was not a goal of this project) would require more intensive sequencing with implementation of a species specific sequencing strategy. While a detailed action plan is outside of the scope of this project, I offer a few insights intended to increase contig length and limit deep sequencing of redundant areas. Current analytically deduced estimates of required coverage for successful de-novo assembly, that account for errors, ploidy, sequence bias, plastid interference, paralogy, and pseudogeny, is 100-fold (Schatz et al., 2012). It is unlikely that a coverage this deep will be achieved using short-read based technology alone. Recent work in wheat (Triticeae) has highlighted the use of mate-pair library generated sequencing data in reducing assembly fragmentation due to genome redundancy (Belova et al., 2013). By generating mate-pair libraries with insert sizes ranging from 2 to 5Kb, Belova et al. (2013) were able to increase the N50 of

46

assembled contigs 7-fold over the 1,500bp average for paired-end reads. This increase in contig length is promising when assessing the feasibility of assembling the Crepis genome, and a mixed sequencing design that incorporated multiple fragment size libraries could aid in contig resolution and general assembly success.

The ability to target specific gene regions and minimize skewed coverage due to high- copy plastid genomes would also be invaluable in Crepis genome assembly. Plant genomes contain numerous gene families including nearly identical members, which assembly programs frequently collapse, creating a non-representative mosaic assembly that can, at best, make downstream analyses messy, or at worst, lead to false conclusions about the biological question of interest (Schatz et al., 2013). Additionally, coverage bias of plastid genomes inevitably reduces the available nuclear information. Targeted DNA enrichment strategies act to increase the representation of regions of interest in a library prior to sequencing, and are based on the assumption that a region is likely to be sequence in proportion to its relative abundance within the sample (Hedtke et al., 2013). Transcriptome sequencing is one such enrichment strategy, as only expressed genes will be represented, but oligonucleotide probe design and an array-hybridization approach will similarly increase the proportion of the gene region under investigation (Hedtke et al., 2013). As transcriptome sequencing approach that would act to decrease the coverage of the plastid genome via poly-A fractioning, or an array-based nuclear gene enrichment would be most appropriate for the resolution of difficult regions and the reduction of plastid coverage bias (Albert et al., 2007) .

2.4.2. Genotyping-by-sequencing data

My findings were necessarily limited by the quality and quantity of useable sequence information and it is apparent from the extreme level of sequence duplication in the GBS data (Table 4) that the initial digestion step of library preparation was not fully successful.

47

If we imagine that the genome is a book that we are trying to read in 100 letter blocks (sequences) by cutting the book every time you see a specific word, for example the word “the”. As “the” is a common word, we expect to get a lot of sentence fragments (= sequence reads), and it is obvious that each 100 letter sequence is likely to be unique, as most sentence fragments appear only once in the book. Take this one step further and imagine if your book was published by a bad editor; you might have pages that are repeated at random throughout the book, leading to an increase in the number of times you see a sequence but overall still generating a high number of sequences, of which the majority are unique and a small subset are duplicated. This is what we expect when we perform a restriction enzyme based digestion on genomic DNA, and it is this logic that forms the basis of restriction enzyme cutting in next-generation sequencing protocols (Elshire et al., 2011). However, if the digestion is only marginally successful then our book does not get cut at every single instance of “the”. There are many reasons that this may occur but the end result is a smaller number of sequences from fewer chapters in the book. In a book that has pages repeated this problem is exacerbated as the sequences obtained are both fewer in number and less likely to be unique.

I believe that the poor GBS results obtained in my study are at least in part due to incomplete digestion of the DNA samples, and that this issue was compounded by the extreme overrepresentation of certain fragments in the amplification step. Known difficulties in obtaining high quantity and quality DNA from Crepis (Whitton & Sears, pers. comm), prompted the evaluation of multiple extraction approaches before the Phytopure resin based extraction was chosen for it’s superior performance relative to the other protocols. Similar care was taken to prevent this partial digestion, with visualization of multiple samples digested with increasing concentrations of PstI . This was not standard in the GBS library preparation protocol, however I was concerned that the level of digestion visible with gel electrophoresis was not adequate. It was only after consultation with others with GBS experience that the decision was made to go forward in the library preparation. This decision was based on the knowledge that PstI did not cut frequently enough in certain species of Asteraceae to be visible on a gel (Rieseberg,

48

pers. comm). In retrospect, greater effort to ensure that extracted DNA was clean and free of enzyme inhibitors, and optimization with different or a combination of enzymes, beyond concentration tests and comparison with EcoRI, should have been used at this step to ensure confidence going forward.

While the GBS protocol as implemented in plants generally recommends the use of PstI, recent results, including mine, indicate that the importance of enzyme choice cannot be overstated (Elshire et al., 2011, Gore et al., 2007, Davey et al., 2011). Criteria for enzyme choice should aim to avoid common repeats, increase the representation of low copy regions, and modulate amount of cutting based on the methylation sensitivity of the enzyme and the methylation level of genome. I chose PstI for this project because it is methylation sensitive (Saintenac et al., 2013) and therefore is most efficient in transcriptionally active euchromatic DNA that is hypomethylated (Isidore et al., 2003). This bias towards gene space makes it a good choice of enzyme for Crepis, and plants with large genomes in general. Additionally, PstI has a 6bp recognition site (give the sequence here) and therefore is expected to cut less frequently than enzymes with 4 bp recognition sites (Davey et al., 2011). This is beneficial for large genomes as it increases the probability of across sample coverage at a particular site. Finally, similar work had successfully been done in Helianthus sp. (Rieseberg, pers. comm.) and in- house expertise with this enzyme was available. For these reasons, PstI was an appropriate enzyme choice and if successful digestion had occurred, I believe that greater across sample coverage would have been obtained and a higher marker density achieved.

This is not to say that protocol optimization is unnecessary, possibly including the newer two-enzyme protocols, for example as used in polyploid barley (Triticeae) (Poland et al., 2012). A two enzyme approach could greatly increase sequence coverage across the genome and improve the efficacy of GBS based techniques. Poland et al., (2012) performed a double-digestion with a “rare” and “common” cutter (PstI and MspI respectively) and ligation of a combined Y adapter that limited amplified fragments to those that consisted of a barcoded forward adapter and a common reverse adapter. In

49

this way, sequences were restricted to those generated from a PstI /MspI cut and the library produced consisted of uniform fragments (Poland et al., 2012). Comparisons of PstI -MspI to PstI-MseI and PstI-MluI libraries reveal a higher average number of reads per individual and sequence clusters (stacks/tags) for PstI-MspI (Saintenac et al., 2013) and attests to the increased efficacy of this method.

2.4.3. De Novo assembly of GBS data

De novo assembly of the GBS data in Stacks and UNEAK revealed pipeline differences that led to incongruities in the outputs and in the determination of putative markers. The sliding window approach for inferring average read quality (Catchen et al., 2013), coupled with the erroneous disabling of the check for enzyme cut-site accounts for the higher number of stacks or tags found in Stacks, however the almost two-fold difference in final number of variant sites detected is more difficult to explain (Table 5). At first glance, the more sophisticated approach to quality filtering, accounting for paralogy, and haplotype assignment (Catchen et al., 2011) of Stacks would imply a greater power of detection and assignment, however this is not what I found. Stacks identified approximately 19,000 stacks containing over 27,000 variant sites while UNEAK reported over 43,000 tags with variant sites. This may be due to the ustacks algorithm, which attempts to identify and discard stacks of paralogous origin. However, I think it is more likely that the merging of stacks with matching pairs reduced the total number of variant sites inferred. Stack merging occurs via a k-mer search algorithm, which relates pairs of stacks transitively, leading to multiple stacks being merged into one (Catchen et al., 2011). UNEAK, on the other hand, forms pairs based on a single base pair mismatch and applies a network filter for these pairs that allows for discarding of complicated networks, but does not merge these sets of pairs (Lu et al., 2013). A second disparity between the pipelines exists in conjunction with this behaviour; stacks are formed in Stacks that have more than one polymorphic site and therefore more SNPs can be identified than stacks. In UNEAK, only one nucleotide mismatch is permitted between tag pairs and thus each read containing a variant site has a maximum number of 1 variant allowed (Lu et al., 2013). The 2-fold difference in stacks/tags directly influenced

50

downstream filtering and after accounting for across sample coverage, Stacks failed to produce a single informative marker (Table 5). UNEAK, however, identified 19 SNPs among individuals of clade V that passed all filtering steps and made up the final marker set used in STRUCTURE and PCA analyses.

The final marker set was limited to members of clade V as sequence duplication levels due to inefficient digestion acted to exacerbate the large genome problem of low coverage across samples, and in order to move forward with probing the complex for evidence of a reticulate evolutionary history at the diploid level, a reduced sample set had to be implemented.

2.4.4. Data analysis

Despite having few markers, population structure was apparent in both the STRUCTURE and PCA analyses. Crepis. nuvo subsp. X and C. nuvo subsp.Y consistently clustered with C. occidentalis subsp. occidentalis while C. pleurocarpa formed a separate and distinct group (Figures 3 - 5). STRUCTURE results based on k=3, revealed that one of the two populations of C. nuvo subsp. X (4074) is distinct, while population 4075 clusters with C. nuvo subsp.Y and C. occidentalis subsp. occidentalis (Figure 4). Population 4075 was originally tentatively assigned to C. occidentalis subsp. conjuncta (Sears, 2011), because it was densely canescent with non-glandular involucral , as occurs in subsp. conjuncta. It was noted at the time, however, that the subspecies of Crepis occidentalis are among the most difficult to assign taxonomically (Sears, 2011). C. occidentalis, comprises three morphologically and ecologically distinct subspecies and is highly polymorphic in it’s morphology. This along with the STRUCTURE analysis supports the hypothesis that population 4075 is most closely related to C. occidentalis and C. nuvo subsp.Y. There was no evidence, however, of admixture in members of 4075, indicating that while they may possess unique combinations of morphological traits, they are not likely of hybrid origin. This is in agreement with the PCA, which places members of population 4075 along the same axis as C. occidentalis subsp. occidentalis and C. nuvo subsp.Y (Figure 5). Pairwise Fst

51

estimates (Table 6) for populations 4074, 4075, and C. nuvo subsp.Y agree with STRUCTURE and PCA, with 4074 and 4075 (Fst = 0.0778) having lower genetic similarity estimate than 4075 and 5006 (Fst = 0.041). The limited sample size for C. occidentalis subsp. occidentalis prevented the estimation of Fst for this taxon, and comparisons with other taxa.

The wide range of morphologies apparent on the landscape can make taxonomic assignment difficult and it is likely that part of this confusion is due to elements of different subspecies integrating on the landscape and subsequently being incorrectly attributed to formal subspecies circumscriptions. The apparent lack of genetic admixture in C. nuvo subsp.Y individuals, coupled with their shared sampled genotype with C. occidentalis subsp. occidentalis is discordant from the morphologically based classifications of Sears (2011), who found that members assigned to C. nuvo subsp.Y had few to many glandular hairs on the involucral bracts with fewer and reduced cauline leaves, and deeply pinnatifid basal leaves; quite distinct from C. occidentalis subsp. occidentalis type specimens. It is therefore possible that C. nuvo subsp.Y represents a previously unrecognized subspecies of diploid that is diverged from Crepis occidentalis, but not the result of a hybridization event. The full genetic diversity present within C. occidentalis was not possible in our analyses, or in the analyses of Sears (2011) because only a single diploid individual of this species was detected in flow cytometry analyses, despite targeted searches of historical collection sites (Sears 2011). If more samples are found, it is possible that their inclusion might shed light on the nature of C. nuvo subsp. Y.

Excitingly, a different pattern was present in individuals from population 4074, with individuals having a distinct genetic profile from all other samples. Individuals also all showed a degree of admixture from two other lineages, with approximately 5 - 75% introgression of the C. occidentalis subsp. occidentalis genotype and an additional 1-5% introgression from C. pleurocarpa in all but one of the individuals (Figure 5). The majority of their genome is distinct from both C. occidentalis subsp. occidentalis and C. pleurocarpa, and inclusion of more species is required to determine if this is reflective of

52

a new genetic lineage or if this population simply shares genetic ancestry with a member of a clade outside clade V. Described as non-glandular, scapose, and with olive/grey tomentulose (Sears, 2011), it was originally hypothesized that individuals from 4074 were diploid hybrids of C. pleurocarpa with members of population 4075. Fst estimates support the hypothesis of introgression from C. pleurocarpa with population 4074 (Fst = 0.336) showing more genetic similarity to C. pleurocarpa than either 4075 (Fst = 0.470) or C. nuvo subsp. Y (Fst = 0.486) do. While there is some support for a contribution of lineages similar to 4075 and C. pleurocarpa, their contributions alone don’t seem to account for the pattern seen in Fig.5. Further investigation through Principal Components Analysis hints at a hybrid relationship as well, with an individual of 4074 showing the highest level of introgression from C. pleurocarpa being placed most intermediate between the two clusters (Figure 5). It should be noted, however, that as both STRUCTURE and PCA’s are based on statistical clustering, agreement between analyses may be more of a reflection of the shared statistical methodology between the two and less the true biology present.

The failure to obtain a robust set of ancestry-informative markers greatly limited our ability to draw biological conclusions regarding the evolutionary history of individuals within clade V. Nonetheless, the analysis that I was able to do suggests that the populations with novel phenotypes may represent undescribed diversity within the diploid lineages of the Crepis complex that are worthy of further study.

2.4.5. Conclusions and recommendations

In my project I have confirmed the highly redundant nature of the diploid Crepis genome, uncovered species specific enzyme inhibition in Crepis modocensis subsp. rostrata, and assessed the efficacy of reduced-representation DNA sequencing libraries for large genomes. These combined challenges negatively impacted my ability to resolve whether there is a signature of past or ongoing hybridization among the diploids of the complex and more work is required to determine if members of C. nuvo subsp. X,

53

C. nuvo subsp. Y, and the disjunct population of C. modocensis subsp. rostrata represent newly discovered diploid lineages or diploid hybrids.

The lack of genetic resources for non-model organisms has acted as a bottleneck for getting at some of the key questions of interest in evolutionary biology and population genetics. The advent of NGS technologies, however, has opened up a whole new frontier of genome based research. The ability to sequence billions of bases for hundreds of individuals in a single, cost-effective reaction means that researchers can now address these important evolutionary and ecological questions at a previously unprecedented scale (Ekblom & Galindo, 2011). While going from data poor to data rich may seem like a researcher’s dream, the variety of biological, computational, and bio- molecular challenges that are associated with NGS and complex plant genomes can make it difficult to “see the forest for all of the trees,” and solutions for dealing with such large, difficult datasets are still in their infancy. Population-wide whole genome data requires immense computational resources for memory and processing time (De Wit et al., 2012) with each step of the analysis requiring sophisticated mathematical and statistical techniques to filter through the noise that NGS technologies produce and adequately resolve repetitive sequences (Schatz et al., 2010). Additionally, current analytical approaches often require considerable knowledge of computer scripting and micro-programming with each analysis step requiring a custom-made script (De Wit et al., 2012). Many of the challenges I have outlined here became central to my project and it is in light of this that I propose a new sampling and sequencing strategy for addressing the question of homoploid hybridization in Crepis.

GBS techniques sequence many target markers at a low coverage per individual and use markers that are sequenced at sufficient coverage to impute genotypes, for which it is implicit that a different subset of markers will be genotyped in each individual (Davey et al., 2011). In wild populations for which ancestral genotypes are unknown, populations that are recently diverged, and populations with low levels of polymorphism, GBS may fail to uncover the amount of markers necessary for resolution of relationships (Davey et al., 2011). It is clear that the combination of these traits in Crepis, in addition

54

to genome size, decreased the efficacy of this technique. The lack of across sample site coverage was an almost insurmountable problem in marker discovery and a new approach that would achieve the depth of coverage necessary to resolve features beyond those in high-copy and repetitive targets should be implemented. The absence of a reference genome for Crepis precludes a target-enrichment based strategy but paired-end Rad-seq techniques that rely on tiled reads across adjacently sheared regions may provide a suitable approach for increasing across sample coverage (Cronn et al., 2012). Unlike GBS, reads are generated from one end being cut at a restriction enzyme recognition site and one being randomly sheared. This creates fragments of varying length that can be aligned into contigs up to 500bp long (Etter & Johnson, 2012) and can therefore assemble into the longer sequences required for un-pedigreed populations where genotypes cannot be imputed across missing sites (Cronn et al., 2012). As PE Rad-seq reduces the proportion of the genome targeted for sequencing via selective PCR, only fragments containing a restriction enzyme site are amplified for sequencing (Davey et al., 2011) and each marker can be sequenced at a higher coverage, allowing for markers to be genotyped across numerous individuals. A caveat of this method is that initial fragmentation of the genomic DNA is still completed with a restriction enzyme, thus species-specific inhibition of digestion may limit the number of unique sequence reads obtained and optimization at this step would still be necessary in Crepis. The redundant, bloated nature of the nuclear genome will, however, continue to pose a challenge for both de-novo assembly of a reference genome and SNP calling, and as such future project design should be reflective of this limitation.

Future studies in the Crepis agamic complex would further benefit from deeper sampling; across the range, within known populations, and particularly within members of clade V. While Sears (2011) performed exhaustive collecting of diploids from all known locations, in addition to the surrounding areas, the presence of only one confirmed Crepis occidentalis subsp. occidentalis diploid became informatively prohibitive. More individuals would help elucidate a representative C. occidentalis subsp. occidentalis haplotype for comparison with other species in the complex, in particular with population 4075 from C. nuvo subsp. X and C. nuvo subsp. Y. Continued

55

collection of all species is recommended, with a particular focus within the California center of diversity.

56

3. Assembling the chloroplast genome of Crepis monticola

3.1. Introduction

3.1.1. Historical background

The study of the Crepis agamic complex has led to significant insights into the process of evolutionary change through hybridization and polyploidy (Babcock, 1924; 1947), and the genus as a whole was once considered “the plant counterpart to Drosophila” as a model organism for genetically based investigations (Smocovitis, 2009). Despite this proclamation, little investigation into the Crepis agamic complex has been achieved, with the exception of Sears (2011) who completed the first comprehensive phylogenetic analysis of the complex, based on 3 plastid gene regions.

The genus Crepis is a member of the tribe Cichorieae (or Lactuceae) within the Asteraceae, a distinctive group characterized by ligulate flowers and the presence of milky sap, with a nearly global distribution, with centers of diversity in north temperate climates and the Mediterranean (Funk et al., 2009). Members of the agamic complex are long-lived perennials with stems arising from a persistent woody caudex and herbage that ranges from glabrous, tomentulose, glandular, to densely canescent (Sears, 2011). Within the Asteraceae, the tribe Cichorieae has long been one of the most difficult to resolve, with initial studies finding relatively weak support for relationships among subtribes (Whitton et al., 1995). It was not until the metatree analysis of Compositae (Asteraceae) by Funk et al. (2009) that significant progress was made in elucidating the relationships throughout this large and complicated tribe. As with many angiosperms groups, including the Asteraceae, our understanding of phylogenetic relationships has advanced greatly since the advent of molecular systematics, with most of the work to date focusing on harnessing variation in the plastic genome.

57

Angiosperm plastid genomes are simple, circular molecules that have stable plastome architecture, a high level of sequence conservation, little recombination, and a general pattern of maternal inheritance (Pan et al., 2012). Most plastid genomes are organized in a quadripartite structure with two copies of an inverted repeat (IR) separating a large (LSC) and a small (SSC) single copy region. Gene content and order are highly conserved throughout angiosperms (Jansen et al., 2008), with a typical plastid genome having 101-118 different genes that primarily code for proteins involved in photosynthesis, carbon fixation, and gene expression (Li et al., 2013). Of these, there are generally eighteen genes that contain at least one intron, with three of the eighteen containing two introns. Changes to this pattern are relatively rare and so these changes are frequently used as phylogenetic markers. Non-coding regions of plastid (and non- plastid) DNA, such as introns, are less functionally constrained than their coding counterparts and as such, tend to evolve more rapidly (Downie et al., 2000). While gene phylogenies can resolve relationships between distantly related taxa, the higher rate of INDELs and point mutations in introns allows for finer scale studies among taxa that are more recently diverged.

3.1.2. Next generation sequencing for plastome assembly

Currently, only partial chloroplast sequences exist for Crepis, and sequencing the entire plastome can yield information about variation within the agamic complex and the clade as a whole. The advent of NGS technologies has provided access to genomic information at a scale that was previously unheard of. The shotgun nature of many of these techniques means that high-copy regions are proportionally represented in the sequences obtained (De Wit et al., 2012) and as chloroplast molecules are present in multiple copies per cell, resolving the plastome of an organism is now possible with little effort or cost. The rapid rise in the number of publicly available plastome sequences [954 seed plants (as of September 14, 2014) up from 134 in 2011 (Jansen & Ruhlman, 2012)] attests to this; within Asteraceae, 69 plastomes representing 10 genera (as of September 14, 2014) have been sequenced. These plastomes represent an

58

exponentially increasing amount of information relative to what has been available for taxonomic assignment within Asteraceae to date, and hints that a shift from gene region phylogenies to whole plastome phylogenies may soon allow for more robust assignments and resolution of relationships within difficult tribes.

3.1.3. Chapter goals and objectives

I propose to contribute to the genetic resources available for the genus Crepis using standard paired-end shotgun sequencing. Specifically, I aim to: Prepare a draft sequence of the plastome of Crepis monticola from the reference plastome sequence of Lactuca sativa ‘salinas’ (DQ383816.1). Identify any unique features of the Crepis plastome relative to other Asteraceae.

3.2. Methods

3.2.1. Sample and raw sequence data

Plant material for this study consisted of a single individual of Crepis monticola (Sears, collected 2009) collected from Siskiyou County in California (USA). Isolation of DNA, library preparation, and sequencing are described in chapter 2 (2.2.1, 2.2.2). The initial filtering of reads for quality was completed within CLC Genomics Workbench v.6.0 (CLCBio, Boston, Massachusetts) following the methodology described in chapter 2 (2.2.4.1). Briefly, reads were assessed for quality based on phred scores with a minimum requirement of q=30, and trimming for uncalled bases and adapter contamination was completed. There was no prior determination of putative plastome sequences, all of the reads obtained in the paired-end sequencing of Crepis monticola that were of high quality were included for possible alignment.

3.2.2. Reference based alignment

59

The complete chloroplast genome of Lactuca sativa ‘salinas’ (DQ383816.1) (Timme et al., 2007) served as the reference for alignment of trimmed and filtered Crepis monticola sequences. Lactuca sativa was chosen because it is the closest relative with a published plastome (as of October 2013), belonging to the same tribe (Cichorieae) as Crepis.

Sequence alignment was completed in CLC Genomics Workbench v.6.0 (CLCBio, Boston, Massachusetts) using the ‘Map reads to reference’ function. Standard parameters for mismatch, insert and deletion costs (2, 3, 3 respectively) were chosen with a sequence similarity requirement of 95%. I used a length fraction of 50%, meaning that reads were only aligned if at least 50% of the read aligned to the reference. These two parameters function together resulting in a stringency wherein only 50% of the read must align with a 95% similarity. While a length fraction of 1 with a similarity requirement of 95% would mean that the entire read must have a 95% similarity to the consensus. Testing of various combinations did not produce significantly different results.

Because the inverted repeat (IR) is present in two copies in the plastome, reads originating from this region will align with equal stringency to two locations in the reference genome. The aligner maps these reads randomly and thus the read is equally likely to be mapped to IRa or IRb, despite its true origin. Therefore, the generated consensus sequence may have an IR that is not oriented correctly. Manual reverse- complementing of IRa was completed using custom python scripts (Tack, 2013) and orientation was confirmed using blastn (Altschul, et al., 1990) comparing the consensus C. monticola draft chloroplast sequence to the reference L. sativa ‘salinas’ plastome sequence.

3.2.3. Annotation of draft Crepis monticola plastome

Initial genome annotation was performed with CpGAVAS (Liu et al., 2012) and edited as necessary in Apollo Genome Annotation and Curation Tool (Lewis et al., 2002; Lee et al., 2013). Default parameters in CpGAVAS were chosen after testing found that they

60

generated consistent and accurate results. The default parameters were as follows: A Blastn cutoff E-value of 1e-10 with a maximum of 10 included hits produced, reference sequences from all available angiosperm taxa, and a Cove only tRNA-scan with a cutoff reporting score of 15 and a maximum length of tRNA intron+variable region of 116bp. Cove is a covariance model based tRNA discovery tool that provides sensitive and discriminative identification of tRNA sequences and uses probabilistic models to flexibly describe the secondary structure (Eddy & Durbin, 1994). Scores are assigned on a log- odds scale wherein scores above zero indicate a higher match probability (Eddy & Durbin, 1994).

Manual annotation of difficult genes not recognized by CpGAVAS was completed with Apollo v 1.11.8. (Lewis et al., 2002) Each putative gene was manually assessed in Apollo for the start and end of translation with appropriate codons, in addition to searching for the presence of exon and introns. Confirmation of each of the above was completed in Apollo with the Blastn and Blastx functions.

The circular chloroplast genome map of the annotated draft Crepis monticola plastome was completed in CpGAVAS quickdraw.

3.2.4. Pattern of retention and loss of rpl16 exon 1 in Asteraceae

In order to explore the phylogenetic distribution of the presence and absence of the first exon of the ribosomal L16 gene (see results, 3.3.2), I downloaded plastid sequence data from other Asteraceae and compiled the data to determine the pattern of loss and retention of exon 1 in Asteraceae.

All published, complete chloroplast genomes within Asteraceae were downloaded from GenBank (June - August 2014), and investigated for the presence/absence of the first exon of the ribosomal L16 gene. Fourteen species representing ten genera within Asteraceae have complete chloroplast sequences available. I used the grep function of UNIX (GNU, 2014) to identify the 9bp sequence of the first exon within these plastomes

61

as blastn searches with the smallest allowable word size of 7 failed to identify such a small sequence without non-specific matching. Each individual for which a 9bp sequence exactly matching that of the first exon of rpl16 was obtained was analyzed for the location of Exon 1 relative to Exon 2 before being classified as either having or lacking Exon 1.

3.3. Results

3.3.1. Plastome size, gene content, order, and organization

The draft plastome of Crepis monticola (Figure 6) was 152,715 bp and contained a pair of inverted repeats (IRs) of 24,584 bp (IRa) and 24,797 bp (IRb) separated by a large and a small single-copy (LSC and SSC) region of 83,788 bp and 19,547 bp, respectively. Gene order and content were identical in the reference and C. monticola. There were 81 protein-coding genes in the C. monticola genome, 9 of which were duplicated in the IR and of these 9, 2 were pseudogenized, as determined by premature stop codons, ycf1 and rps19. The trans-splicing gene rps12 has the 3’ end duplicated within the IR, contributing to the total of 9. The four rRNA genes were completely contained in the IR. There were 29 unique tRNA genes, of which 7 were in the IR, for a total of 36 copies of tRNA genes in the genome. Of the 81 protein-coding genes, 12 contained a single intron and two contained two introns. The trans-splicing gene rps12 is considered to have one intron in the above summary. It should be noted that while Timme et al., (2007) state the presence of 18 intron containing genes, they counted duplicated genes twice, and therefore by my counting method, Lactuca sativa ‘salinas’ has 13 introns, one less than the draft plastome of Crepis monticola. The intron reported as present in Crepis monticola but absent in Lactuca sativa ‘salinas’ is within the protein-coding gene rpl16.

62

3.3.2. Pattern of retention and loss of rpl16 exon 1 in Asteraceae

The rpl16 gene codes for an essential protein (Zarazaga et al., 2002) and during the annotation of C. monticola I found that the plastome of Lactuca sativa ‘salinas’ had rpl16 identified as missing both exon1 and the intron, with translation noted as starting at exon 2. Exon 2 begins with the amino acid isoleucine (AT/UC), although the GenBank entry states that methionine (AT/UG) is usually present at this location. It appears, based on the GenBank accession that AUC is assumed to be acting as a non-standard bacterial origin start codon.

Further investigation revealed that the 9 bp sequence coding for Exon 1 was in fact present in the reference plastome, and it was simply not included in the annotation. Table 7 shows that of the 14 species investigated, all had Exon 1 present in the raw fasta sequences while only 3 had correctly annotated GenBank accessions wherein Exon 1 was both in the raw fasta sequence and in the GenBank accession. There were no cases of the first exon sequence being completely absent or mutated. One accession, for Parthenium argentatum stated that rpl16 was pseudogenized on the basis of the missing Exon 1 however a simple blast search followed by visual confirmation revealed the 9 bp exon approximately 1660 bp upstream from the start of Exon 2, before the start codon for the adjacent gene.

63

Figure 6: Draft plastome map of Crepis monticola chloroplast. Gene order and content is the same as in reference Lactuca sativa ‘salinas’ (not shown). Thick lines indicate extent of inverted repeats (IRa and IRb). Genes on outside of map are transcribed in clockwise direction, and genes on the inside are transcribed in counter-clockwise direction. Introns are not shown; gene box represents exon and introns together.

64

Table 8: Compiled data for the presence of Exon 1 from GenBank published complete chloroplast genome sequences in the Asteraceae. Asterisks indicate presence of feature.

Exon1 present in Genbank Exon1 present in Exon1 not Listed as Organism Accession ID annotation fasta sequence present pseudogenized

Crepis monticola * *

Ageratina adenophora JF826503.1 *

Artemisia frigida JX293720.1 * *

Centaurea diffusa KJ690264.1 *

Crysanthemum indicum NC_020320.1 *

Guizotia abyssinica EU549769.1 *

Helianthus annuus NC_007977.1 *

Helianthus decapetalus KF746356 *

65

Exon1 present in Genbank Exon1 present in Exon1 not Listed as Organism Accession ID annotation fasta sequence present pseudogenized

Helianthus tuberosus NC_023112.1 *

Jacobaea vulgaris NC_015543.1 *

Lactuca sativa AP007232.1 * *

Lactuca sativa NC_007578.1 * *

Lactuca sativa ‘salinas’ DQ383816.1 *

Parthenium argentatum NC_013553.1 * *

Praxelis clematidea KF922320.1 *

66

3.4. Discussion

The assembly of a draft plastome sequence for Crepis monticola allows for species comparisons across more gene regions than previously available, and increases the amount of genetic resources available for the genus as a whole. As NGS technologies make it easier than ever to obtain the whole plastome sequence of organisms of interest, a shift from gene phylogenies to genome phylogenies will increase phylogenetic resolution at low taxonomic levels (Parks et al., 2009). The broad genome sampling that whole plastome sequencing allows for is necessary for resolution among recently diverged taxa and to increase the phylogenetic signal to noise ratios that plague rapid radiations (Parks et al., 2009). The noted difficulty of resolving relationships in Cichorieae (Funk et al., 2008) and evidence that implementing a multi- gene strategy will increase resolution in the tribe (Panero & Funk, 2008) lends strong support to the benefits of whole plastome phylogenies.

Currently, few genera within Asteraceae have published whole plastome genetic maps and I was limited in my choices of a reference genome from which to determine the sequence of Crepis monticola’s plastome. While assembling a plastome de novo is possible due to their high-copy nature, it requires additional steps that were deemed unnecessary due to the conservative nature of plastome sequences allowing for an accurate reference based assembly despite species divergence. When choosing an appropriate organism to serve as a reference genome it is optimal to have an organism from within the same species or genus, as increased divergence will decrease sequence similarity (Hedtke et al., 2013), however for non-model organisms lacking in genomic resources, this is frequently impossible. In highly divergent species or genomes this may necessitate a de novo approach, however the conserved nature of chloroplast genomes allowed for confident alignment despite greater divergence between the reference organism and the organism of interest and I achieved sequence resolution of the entire plastome with a few short, uncalled regions that will be confirmed via Sanger sequencing. The draft nature of this assembly is reflected in the size difference between IRa and IRb. A small (~200bp) unresolved N run is contributing to

67

the larger size of IRa and it is expected that Sanger sequencing of the IR boundaries will resolve this conflict. I did not find that this region impeded annotation across the plastome, but note that subsequent to resequencing of the boundaries, a final confirmation and possible alteration to annotation of exact gene boundaries will be necessary.

While gene content and organization was largely unchanged from other members of Asteraceae, I discovered that the ribosomal L16 gene of Crepis monticola retained an intron sequence that was not identified in the published reference sequence of Lactuca sativa ‘salinas’ nor in the accompanying publication (Timme et al., 2012: DQ.383816.1). Further investigation revealed an anomaly between the raw fasta sequence of L. sativa ‘salinas’ and the GenBank record (Timme et al., 2012: DQ.383816.1), wherein the raw sequence did in fact contain the 9 bp sequence of the first exon, followed by an intronic region, followed by the second exon of this gene, and the GenBank record failed to reflect this. As the intron of the chloroplast gene rpl16 is frequently selected as a fine scale phylogenetic marker due to a high rate of sequence change and its large size relative to other plastid introns (Downie et al., 2000), it was of interest to me to determine if the failure to identify short exons was leading to a false inflation of the instances of reported intron loss or retention in rpl16 within Asteraceae.

DNA sequencing has revealed that the intron of rpl16 spans diverse evolutionary lineages and varies significantly in size in sequenced organisms from 536bp (Marchantia) to 1,411bp (duckweed) (Downie et al., 2000). It is a group II self splicing intron, requiring a minimum of approximately 500bp for accurate splicing within an open reading frame (Bonen & Vogel, 2001) and as most group II introns are within a few hundred base pairs of this minimum size their phylogenetic utility is more limited than the larger rpl16 insert (Doyle et al., 1995). The rpl16 intron is likely to have more mutations, insertions, and deletions that can be evolutionarily informative. Evidence suggests that this intron has been lost independently at least three times (Campagna & Downie, 1998), including at the base of the rosids (Guisinger et al., 2011) however, these studies have been limited in scope and species inclusion.

68

My analysis outlining the presence of the first exon of rpl16 within published plastome sequences in Asteraceae (Table 7) revealed gross underreporting of exon 1 and despite extensive literature searches, a lack of supporting information detailing the loss of this exon. The identification of short exons in annotation programs is typically difficult and standard convention is to highlight difficult or putatively incorrect annotations to allow for manual confirmation and annotation of these regions (Liu et al., 2012). This is a reflection of the blastn functionality utilized in CpGAVAS and similar annotation programs (e.g.: DOGMA) wherein the standard word size of 7 or 11 (Altschul et al., 1990) may fail to identify a short exon. It is apparent from my analysis that this confirmation from the raw sequence data may not be occurring before sequence publication, and highlights the need for cautious interpretation of publicly provided, non peer-reviewed databases. Caution is of even greater importance when hypothesizing gene expression or pseudogenization from sequence information, as typically, the loss of an exon sequence would indicate a need for further investigation into gene expression. Of the 14 genera investigated, only one, Parthenium argentatum (NC_013553) classified rpl16 as pseudogenized although accompanying literature (Kumar et al., 2009) does not indicate expression-based confirmation of pseudogenization. It is likely that putative pseudogenization was not identified in the other genera investigated due to the presence of a bacterial-origin start codon (AT/UC) at the start of the second exon. As the ribosomal protein L16 is essential in the large ribosomal subunit of bacteria, with individuals lacking this protein exhibiting defective peptidyl-tRNA hydrolysis activity, peptidyl transferase activity, binding of aminoacyl- tRNA, and proper association with the 30S subunit (Zarazaga et al., 2002) erroneous classification of pseudogeny could have far-reaching consequences.

The true pattern of loss or retention of the rpl16 intron is currently being obscured by the erroneous classification of the rpl16 gene within published plastome sequences and could lead to underreporting of rpl16 intron retention in Asteraceae. Future studies including the intron of rpl16 should explicitly determine the presence or absence of a non-mutated sequence of the first exon in order to accurately assess for intron

69

retention. It is my further recommendation that each gene annotation of a published plastome is manually confirmed for the presence and location of open reading frames, exon, and intron boundaries prior to publication.

70

Bibliography

Albert, T. J., Molla, M. N., Muzny, D. M., Nazareth, L., Wheeler, D., Song, X., Richmonds, T. A., Middle, C. M., Rodesch, M. J., Packard, C. J., Wienstock, G. M., Gibbs, R. A. (2007). Direct selection of human genomic loci by microarray hybridization, Nature Methods, 4(11), 903–905.

Alexander PJ, Rajanikantch G, Bacon C, and Bailey CD. (2007). Recovery of plant DNA using a reciprocating saw and silica-based columns. Molecular Ecology Notes, 7: 5-7.

Altschup, S. F., Gish, W., Pennsylvania, T., & Park, U. (1990). Basic Local Alignment Search Journal of Molecular Biology, 215, 403–410.

Baird, N. A, Etter, P. D., Atwood, T. S., Currey, M. C., Shiver, A. L., Lewis, Z. A., Cresko, W. A., Johnson, E. A. (2008). Rapid SNP discovery and genetic mapping using sequenced RAD markers. PloS One, 3(10), e3376, 1-7.

Babcock, E.B. 1924. Genetics and plant . Science 59: 327-328.

Babcock, E.B., & M.S. Navashin, (1930). The Genus Crepis. Bibliographia Genetica 6: 1- 90.

Babcock, E.B. & G.L. Stebbins (1938). The American species of Crepis. Their interrelationships and distribution as affected by polyploidy and apomixis. Washington: Carnegie Institution of Washington (504).

Bonen, L., & Vogel, J. (2001). The ins and outs of group II introns. Trends in Genetics : TIG, 17(6), 322–31.

Chessel, D., Dufour, A. B., & Thioulouse, J. (2004). The ade4 package - I : One-table methods, R news, 4(1), 5–10.

71

Campagna, M. L., & Downie, S. R. (1998). The intron in chloroplast gene rpl16 is missing from the families Geraniaceae, Goodeniaceae, and Plumbaginaceae. Gene, 91, 1–11.

CLCBio. (2012). White Paper: de novo assembly in CLC Assembly. CLCBio. Denmark.

Bennett M. D., & I. J. Leitch, (2012). Plant DNA C-values database. Kew Royal Botanical Gardens. Accessed: August 2014.

Belova, T., Zhan, B., Wright, J., Caccamo, M., Asp, T., Simková, H., Kent, M., Bendixen, C., Panitz, F., Lien, S., Delezel, J., Olsen, O. A., Sandve, S. R. (2013). Integration of mate pair sequences to improve shotgun assemblies of flow-sorted chromosome arms of hexaploid wheat. BMC Genomics, 14, 222.

Bild, A. H., Chang, J. T., Johnson, W. E., & S. R. Piccolo (2014). A Field Guide to Genomics Research. PLoS Biology, 12(1), e1001744, 1-6.

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England), 30(15), 2114–2120.

Buerkle, C. A., Morris, R. J., Asmussen, M. A., & L. H. Rieseberg (2000). The likelihood of homoploid hybrid speciation. Heredity, 84(4), 441–51.

Buerkle, C. A., D. E. Wolf, & L. H. Rieseberg. 2003. The origin and extinction of species through hybridization. In: Population Viability in Plants: Conservation, Management, and Modeling of Rare Plants. Springer: Verlag., (pp.117–141)

Catchen, J., Amores, A., Hohenlohe, P. A., Cresko, W., & J. Postlethwait (2011). Stacks: building and genotyping Loci de novo from short-read sequences. G3, 1(3), 171–82.

Catchen, J. M., Hohenlohe, P. A., Bassham, S., Amores, A., & W. A. Cresko (2013). Stacks: an analysis tool set for population genomics. Molecular Ecology, 22(11), 3124–40.

72

Chikhi, R., & P. Medvedev (2014). Informed and automated k-mer size selection for genome assembly. Bioinformatics (Oxford, England), 30(1), 31–7.

Cronn, R., Knaus, B. J., Liston, A., Maughan, P. J., Parks, M., Syring, J. V, & Udall, J. (2012). Targeted enrichment strategies for next-generation plant biology. American Journal of Botany, 99(2), 291–311.

Compeau, P. E. C., Pevzner, P. a, & G. Tesler (2011). How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 29(11), 987–91.

Danecek, P., Auton, A., Abecasis, G., Albers, C. a, Banks, E., DePristo, M. A.,Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., & R. Durbin (2011). The variant call format and VCFtools. Bioinformatics (Oxford, England), 27(15), 2156–8.

Davey, J. W., Hohenlohe, P. A., Etter, P. D., Boone, J. Q., Catchen, J. M., & Blaxter, M. L. (2011). Genome-wide genetic marker discovery and genotyping using next- generation sequencing. Nature Reviews. Genetics, 12(7), 499–510.

De Wit, P., Pespeni, M. H., Ladner, J. T., Barshis, D. J., Seneca, F., Jaris, H., Therkildsen, N. O., Morikawas, M., Palumbi, S. R. (2012). The simple fool’s guide to population genomics via RNA-Seq: an introduction to high-throughput sequencing data analysis. Molecular Ecology Resources, 12(6), 1058–67.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498

Downie, S. R., Katz-downie, D. S., & Watson, M. (2000). A phylogeny of the flowering plant family Apiaceae based on chloroplast DNA rpl16 and rpoc1 intron sequences: Towards a suprageneric classification of subfamily Apiodeae. American Journal of Botany, 87(2), 273–292.

73

Earl, D. A., & vonHoldt, B. M. (2011). STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genetics Resources, 4(2), 359–361.

Ekblom, R., & Galindo, J. (2011). Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity, 107(1), 1–15.

Eddy, S. R., & Durbin, R. (1994). RNA analysis using covariance models. Nucleic Acids Research, 22(11), 2079–2088.

Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A, Kawamoto, K., Buckler, E. S., & S. E. Mitchell (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PloS One, 6(5), e19379, 1-10.

Etter, P. D., & Johnson, E. (2012). Chapter 9 RAD Paired-End Sequencing for Local De Novo Assembly and SNP Discovery in Non-model Organisms. In: Data Production and Analysis in Population Genomics: Methods and Protocols, Methods in Molecular Biology New York, USA., 135–151.

Evanno, G., Regnaut, S., & Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular Ecology, 14(8), 2611–20.

Enke, N., & B. Gemeinholzer (2008). Babcock revisited : new insights into generic delimitation and character evolution in Crepis L . ( Compositae : Cichorieae ) from ITS and matK sequence data. Taxon, 57(3), 756–768.

Funk, V. A., Susanna, A., Stuessy, T. F., & Robinson, H. (2009). Classification of Compositae. In Systematics, Evolution, and Biogeography of Compositae. International Association for Plant Taxonomy, Vienna. (pp. 171–189).

Gilbert, K. J., Andrew, R. L., Bock, D. G., Franklin, M. T., Kane, N. C., Moore, J.-S., Moyers, B., Renaut, S., Rennison, D. J., Veen, T., & Vines, T. H. (2012). Recommendations for utilizing and reporting population genetic analyses: the

74

reproducibility of genetic clustering using the program STRUCTURE. Molecular Ecology, 21(20), 4925–30.

Gore, M., Bradbury, P., Hogers, R., Kirst, M., Verstege, E., van Oeveren, J., Peleman, J., Buckler, E., van Eijk, M. (2007). Evaluation of Target Preparation Methods for Single-Feature Polymorphism Detection in Large Complex Plant Genomes. Crop Science, 47(S2), 135-148.

Gross, B. L., & L. H. Rieseberg (2005). The ecological genetics of homoploid hybrid speciation. The Journal of Heredity, 96(3), 241–52.

Guisinger, M. M., Kuehl, J. V, Boore, J. L., & Jansen, R. K. (2011). Extreme reconfiguration of plastid genomes in the angiosperm family Geraniaceae: rearrangements, repeats, and codon usage. Molecular Biology and Evolution, 28(1), 583–600.

Hedtke, S. M., Morgan, M. J., Cannatella, D. C., & Hillis, D. M. (2013). Targeted enrichment: maximizing orthologous gene comparisons across deep evolutionary time. PloS One, 8(7), e67908,1-10.

Hohenlohe, P. A., Bassham, S., Etter, P. D., Stiffler, N., Johnson, E. A., & W. A. Cresko (2010). Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genetics, 6(2), e1000862,1-23.

Hohenlohe, P. A., Amish, S. J., Catchen, J. M., Allendorf, F. W., & G. Luikart (2011). Next-generation RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout. Molecular Ecology Resources, 11, 117–22.

Holsinger, K. E., & Weir, B. S. (2009). Genetics in geographically structured populations: defining, estimating and interpreting FST. Nature Reviews Genetics, 10(9), 639–650.

Isidore, E., van Os, H., Andrzejewski, S., Bakker, J., Barrena, I., Bryan, G. J., Bernard, C., van Eck, H., Ghareeb, B., de Jong, W., van Keort P., Lefebvre, V., Milbourne,

75

D., Ritter, E., van der Voort, J. R., Rouselle-Bourgeois, F., van Vliet, J., Waugh, R. (2003). Toward a marker-dense meiotic map of the potato genome: lessons from linkage group I. Genetics, 165(4), 2107–16.

Jansen, R. K., & Ruhlman, T. A. (2012). Genomics of Chloroplasts and Mitochondria. Respiration, 35, 103–126.

Jansen, R. K., Wojciechowski, M. F., Sanniyasi, E., Lee, S.-B., & Daniell, H. (2008). Complete plastid genome sequence of the chickpea (Cicer arietinum) and the phylogenetic distribution of rps12 and clpP intron losses among legumes (Leguminosae). Molecular Phylogenetics and Evolution, 48(3), 1204–17.

Jombart, T. (2008). adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics (Oxford, England), 24(11), 1403–5.

Kumar, S., Hahn, F. M., McMahan, C. M., Cornish, K. & Whalen, M. C. (2009). Comparative analysis of the complete sequence of the plastid genome of Parthenium argentatum and identification of DNA barcodes to differentiate Parthenium species and lines. BMC Plant Biology. 9, 131-143.

Lewis, S. E., Searle, S. M. J., Harris, N., Gibson, M., Lyer, V., Richter, J., Wiel, C., Bayraktaroglu, L., Birney, E., Crosby, M, A,m Kaminker, J.S., Matthews, B. B., Prochnik, S. E., Smith, C. D., Tupy, J. L., Rubin, G. M., Misra, S., Mungall, C. J., Clamp, M. E. (2002). Apollo: a sequence annotation editor. Genome Biology, 3(12), 1-14.

Lee, E., Helt, G. a, Reese, J. T., Munoz-Torres, M. C., Childers, C. P., Buels, R. M., Stein, L., Holmes, I. H., Elsik, C. G., Lewis, S. E. (2013). Web Apollo: a web- based genomic annotation editing platform. Genome Biology, 14(8), R93.

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics (Oxford, England), 25(14), 1754–60.

76

Li, W., Zhang, P., Fellers, J. P., Friebe, B., & Gill, B. S. (2004). Sequence composition, organization, and evolution of the core Triticeae genome. The Plant Journal : For Cell and Molecular Biology, 40(4), 500–11

Liu, C., Shi, L., Zhu, Y., Chen, H., Zhang, J., Lin, X., & Guan, X. (2012). CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences. BMC Genomics, 13(1), 715.

Lu, F., Lipka, A. E., Glaubitz, J., Elshire, R., Cherney, J. H., Casler, M. D., Buckler E., & D. E. Costich (2013). TASSLE-organism paper-Switchgrass genomic diversity, ploidy, and evolution: novel insights from a network-based SNP discovery protocol. PLoS Genetics, 9(1), e1003215, 1-14.

McGinn, S., & Gut, I. G. (2013). DNA sequencing - spanning the generations. New Biotechnology, 30(4), 366–72.

Morey, M., Fernández-Marmiesse, A., Castiñeiras, D., Fraga, J. M., Couce, M. L., & Cocho, J. a. (2013). A glimpse into past, present, and future DNA sequencing. Molecular Genetics and Metabolism, 110(1-2), 3–24.

Mráz, P., Chrtek, J., & J. Fehrer (2011). Interspecific hybridization in the genus Hieracium s. str.: evidence for bidirectional gene flow and spontaneous allopolyploidization. Plant Systematics and Evolution, 293(1-4), 237–245.

Mueller, J. C. (2004). Linkage disequilibrium for different scales and applications. Briefings in Bioinformatics, 5(4), 355–64.

Nielsen, R., Korneliussen, T., Albrechtsen, A., Li, Y., & Wang, J. (2012). SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data. PloS One, 7(7), e37558, 1-11.

Novembre, J., & Stephens, M. (2008). Interpreting principal component analyses of spatial population genetic variation. Nature Genetics, 40(5), 646–9.

77

Pan, I.-C., Liao, D.-C., Wu, F.-H., Daniell, H., Singh, N. D., Chang, C., Shich, M.-C., Lin, C.-S. (2012). Complete chloroplast genome sequence of an orchid model plant candidate: Erycina pusilla apply in tropical Oncidium breeding. PloS One, 7(4), e34738, 1-12.

Panero, J. L., & Funk, V. A. (2008). The value of sampling anomalous taxa in phylogenetic studies: major clades of the Asteraceae revealed. Molecular Phylogenetics and Evolution, 47(2), 757–82.

Parks, M., Cronn, R., & Liston, A. (2009). Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes. BMC Biology, 7, 84.

Pevzner, P. a, Tang, H., & M. S. Waterman (2001). An Eulerian path approach to DNA fragment assembly. (de bruijn graphs). Proceedings of the National Academy of Sciences of the of America, 98(17), 9748–53.

Poland, J. a., Brown, P. J., Sorrells, M. E., & Jannink, J.-L. (2012). Development of High-Density Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach. PLoS ONE, 7(2), e32253, 1-8.

Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–59.

Rieseberg, L.H. & J. F. Wendel (1993). Introgression and its consequences in plants. In Hybrid zones and the evolutionary process. Toronto, Canada: Oxford University Press. (pp.70-109).

Ringnér, M. (2008). What is principal component analysis ? Nature Biotechnology 26(3), 303–304.

Ronaghi, M., Uhlén, M., & Nyrén, P. (1998). DNA SEQUENCING:A Sequencing Method Based on Real-Time Pyrophosphate. Science, 281(5375), 363–365.

78

Saintenac, C., Jiang, D., Wang, S., & Akhunov, E. (2013). Sequence-based mapping of the polyploid wheat genome. G3, 3(7), 1105–14.

Schatz, M. C., Delcher, A. L., & Salzberg, S. L. (2010). Assembly of large genomes using second-generation sequencing, 1165–1173.

Schatz, M. C., Witkowski, J., & McCombie, W. R. (2012). Current challenges in de novo plant genome sequencing and assembly. Genome Biology, 13(4), 243.

Sears, C. 2011. Systematic investigations into the North American Crepis agamic complex. PhD Thesis, University of British Columbia, CAN.

Smocovitis, V. B. (2009). The “ Plant Drosophila ”: E . B . Babcock, the Genus Crepis, and the Evolution of a Genetics Research Program at Berkeley, 1915 – 1947. Historical Studies in the Natural Sciences, 39(3), 1915–1947.

Staton, S. E., Bakken, B. H., Blackman, B. K., Chapman, M. a, Kane, N. C., Tang, S., Ungerer, M. C., Knapp, S. J., Rieseberg, L. H., Burke, J. M. (2012). The sunflower (Helianthus annuus L.) genome reflects a recent history of biased accumulation of transposable elements. The Plant Journal : For Cell and Molecular Biology, 72(1), 142–53.

Stebbins G.L. (1950). Variation and evolution in plants. New York: Columbia Press.

Stebbins, G. L. (2011). The Role of Hybridization in Evolution. Proceedings of the American Philosophical Society, 103(2), 231–249.

Thorvaldsdóttir, H., Robinson, J. T., & J. P. Mesirov (2013). Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics, 14(2), 178–92.

Timme, R., Kuehl, J., Boore, J., & Jansen, R. (2007). A comparative analysis of the Lactuca and Helianthus (Asteraceae) plastid genomes: Identification of divergent regions and categorization of shared repeats. American Journal of Botany, 94(3), 302–312.

79

Treangen, T. J., & Salzberg, S. L. (2012). Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews. Genetics, 13(1), 36–46.

Whitton J. (1994). Systematic and evolutionary investigation of the North American Crepis agamic complex. PhD Thesis, University of Connecticut, USA.

Whitton, J., Wallace, R.S. & Jansen, R. (1995). Phylogenetic relationships and patterns of character change in the tribe Lactuceae (Asteraceae) based on chloroplast DNA restriction site variation. Canadian Journal of Botany, 73, 1058-1073.

Zarazaga, M., Tenorio, C., Campo, R. Del, Ruiz-larrea, F., & Torres, C. (2002). Mutations in Ribosomal Protein L16 and in 23S rRNA in Enterococcus Strains for Which Evernimicin MICs Differ, Antimicrobial Agents and Chemotherapy, 46(11), 3657–3659.

Zerbino, D. R., & E. Birney (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(5), 821–9.

80