<<

ELUCIDATING MONILOPHYTE GENOMICS: HOW , TRANSPOSABLE ELEMENTS, AND THE ALTERNATION OF INDEPENDENT GENERATIONS DRIVE

By

DANIEL BLAINE MARCHANT

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

© 2018 Daniel Blaine Marchant

To Granddad

ACKNOWLEDGMENTS

I thank my advisors, Doug and Pam Soltis, for their endless support both inside and outside the lab. I couldn’t have wished for a better graduate experience. I thank Emily Sessa,

Brad Barbazuk, Matias Kirst, Paul Wolf, and Zhonghua Chen for their insights. I thank my past academic mentors, Andreas Madlung, Betsy Kirkpatrick, and Virginia Walbot, for teaching me to think scientifically and I thank my past non-academic mentors, Dick Held, Adam Stein, Karen

Masters, and my grandfather, for teaching me to think critically. I thank my for their frequent visits and my friends for their frequent distractions. Finally, I thank my funding sources: the National Science Foundation, University of Florida Genetics Institute, Australian Academy of Science, Botanical Society of America, and Integrated Digitized Biocollections.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ...... 4

LIST OF TABLES ...... 7

LIST OF FIGURES ...... 8

ABSTRACT ...... 10

CHAPTERS

1 INTRODUCTION ...... 12

2 PATTERNS OF ABIOTIC NICHE SHIFTS IN ALLOPOLYPLOIDS RELATIVE TO THEIR PROGENITORS ...... 16

Background ...... 16 Materials and Methods ...... 20 Polyploid Selection and Locality Data Collection ...... 20 ENM Layers ...... 21 Niche Overlap and Breadth ...... 22 Statistical Analyses ...... 23 Results...... 24 Discussion ...... 26

3 GENOME EVOLUTION IN ...... 38

Background ...... 38 Polyploidy ...... 39 Transposable Elements ...... 43 Horizontal Gene Transfer ...... 45 Alternative Splicing ...... 47 Summary ...... 48

4 FINALLY : INSIGHTS INTO THE FIRT HOMOSPOROUS FERN GENOME ....53

Background ...... 53 Materials and Methods ...... 57 Tissue Samples ...... 57 Library Construction and Sequencing ...... 58 Genome Assembly ...... 59 Transcriptome Assembly ...... 60 Polyploidy ...... 61 Repeat Characterization ...... 63 Dating Repeat Insertion Events ...... 65

5

Results and Discussion ...... 65 Genome Sequencing and Assembly ...... 65 Transcriptome Sequencing and Assembly ...... 66 Polyploidy ...... 67 Repeat Diversity ...... 70 Summary ...... 73

5 GENETIC SPECIFICITYAND EVOLUTION UNDERLYING THE ALTERNATION OF GENERATIONS IN LAND PLANTS ...... 84

Background ...... 84 Methods ...... 87 Tissue Samples ...... 87 Library Construction ...... 88 Transcriptome Assembly ...... 88 Comparative Transcriptomics ...... 89 Gene Family Evolution ...... 90 Results...... 91 Generation-Specific Patterns of Gene Expression ...... 91 ...... 91 Across Land Plants ...... 93 Gene Family Evolution Across Land Plants ...... 94 Discussion ...... 96

6 CONCLUSIONS ...... 109

APPENDIX: SUPPLEMENTARY MATERIALS...... 112

LIST OF REFERENCES ...... 118

BIOGRAPHICAL SKETCH ...... 139

6

LIST OF TABLES

Table page

2-1 Niche categorization of polyploids relative to their progenitors...... 33

4-1 Ceratopteris genome assembly statistics...... 82

4-2 Ceratopteris repeat diversity and composition ...... 83

5-1 Gene family life stage specificity in a , fern, and conifer...... 108

A-1 Polyploid systems used in this study and the means by which their progenitors were identified...... 114

A-2 Source information for - comparisons...... 116

A-3 Representative and data source for ancestral gene family reconstructions...... 117

7

LIST OF FIGURES

Figure page

2-1 Predicted niches and statistical differentiation of niche contraction in the polyploid complex, tennesseensis...... 34

2-2 Predicted niches and statistical differentiation of niche expansion in the polyploid complex, hesperium...... 35

2-3 Predicted niches and statistical differentiation of niche intermediacy in the polyploid complex, celsa...... 36

2-4 Predicted niches and statistical differentiation of niche novelty in the polyploid complex, Polypodium saximontanum...... 37

3-1 The two types of polyploidy: allopolyploidy and autopolyploidy...... 49

3-2 The phases of genome evolution following polyploidization to diploidization...... 50

3-3 The composition and phylogeny of fern neochrome...... 51

3-4 Alternative splicing event types and rate in Arabidopsis, rice, maize, and humans...... 52

4-1 Paralog-age distribution analyses and associated SiZER plots of three fern species...... 75

4-2 Overlapping density plots of the three paralog-age distribution analyses...... 76

4-3 MAPS analysis across land plants and the associated WGD events (shown as stars)...... 77

4-4 Fluorescent in situ hybridizations of Ceratopteris chromosome squashes...... 78

4-5 Repeat composition, genome size, and genome assembly N50 for representative embryophytes...... 79

4-6 Mean repeat lengths for representative embryophytes across different repeat categories...... 80

4-7 LTR RT insertion dates in Ceratotperis richardii based on the CFern v1.1A and BACSubSample assemblies...... 81

5-1 Life stage specificity of isoforms, genes, and gene families in Ceratopteris richardii. ..102

5-2 Gene ontology (GO) annotation frequency of terms related to root, , and flower development, as well as reproduction in sporophytic (blue) and gametophytic (green) transcripts of Ceratopteris richardii...... 103

5-3 Ancestral reconstruction of gene family content and evolution in land plants...... 104

8

5-4 Number of MADS-box, TCP, TALE, and HD-ZIP transcription factor genes per species and ancestral node across land plants...... 105

5-5 Number of AGO and DCL genes per species and ancestral node across land plants...... 106

5-6 Number of NBS-LRR genes per species and ancestral node across land plants ...... 107

A-1 Niche breadth of the polyploids and their diploid progenitors...... 112

A-2 Pairwise niche-overlap scores of each polyploid and its diploid progenitors with gray bars for polyploid-diploid comparisons and black bars for diploid-diploid comparisons...... 113

A-3 Box plots of transcript length in the genome-free UniCFernModels, genome-mapped with ≥50% coverage, and genome-mapped with ≥98% coverage...... 115

9

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

ELUCIDATING MONILOPHYTE GENOMICS: HOW POLYPLOIDY, TRANSPOSABLE ELEMENTS, AND THE ALTERNATION OF INDEPENDENT GENERATIONS DRIVE FERN EVOLUTION

By

Daniel Blaine Marchant

August 2018

Chair: Douglas E. Soltis Major: Botany

Arguably the most evolutionarily significant lineage on Earth, land plants have shaped and supported terrestrial life for nearly 500 million years. While the morphological and physiological evolution of land plants is readily apparent through comparisons of fossil, ancestral, and modern taxa, the underlying genetic and genomic changes that resulted or accompanied these physical alterations are much less conspicuous. Here I present my findings on the ecological shifts between polyploids and their diploid progenitors in , review the major processes involved in genome evolution, and delve into the first homosporous fern genome assembly to investigate the roles of polyploidy, transposable elements, and the alternation of independent generations in shaping fern genomes. I find four categories of niche shifts in the North American polyploids: niche expansion, niche contraction, niche intermediacy, and niche novelty. From a genomics perspective, I find evidence for only a single ancient WGD event in the evolutionary history of our model fern species, Ceratopteris richardii, and repeat proportions similar to classically large genomes, such as maize. There is substantial life stage specificity in Ceratopteris at the isoform and gene level, but very little in gene families. Ancestral reconstructions of gene families across land plants show that

10

transcription factors and gene families associated with the regulation of development largely diversified after the evolution of flowering plants, while gene families associated with pathogen resistance significantly expanded in seed plants. This research provides a major evolutionary stepping-stone by providing the first homosporous fern genome reference, as well as unique insight into the root processes underlying these massively complicated genomes.

11

CHAPTER 1 INTRODUCTION

Arguably the most evolutionarily significant lineage on Earth, land plants have shaped and supported terrestrial life for nearly 500 million years via nutrient cycling, gas regulation, niche formation, and countless other ecosystem services and transformations (Costanza & Folke,

1997; Rensing et al., 2008; Morris et al., 2018). Diverging from streptophyte algae, the earliest land plants likely resembled modern-day bryophytes (, liverworts, ), lacking vascular tissue and thus structurally minute while relying heavily on environmental moisture for dispersal, reproduction, and transfer of nutrients (reviewed in Judd et al., 1999). The next critical adaptation in land plants was the evolution of lignified vascular tissue, allowing the conduction of water and nutrients and providing structural support throughout the plants and resulting in the evolution of , roots, and stems. Subsequently, seeds evolved as the primary means of dispersal rather than , resulting in a major radiation of gymnosperms, which today consist of conifers, cycads, Ginkgo, and gnetophytes, while the angiosperms are characterized by the evolution of flowers and ovules within a protective structure, the carpel, which, following fertilization, results in fruits; together these provided even more effective means of dispersal.

Each of these major radiations within land plants resulted in evolutionary cascades, as other organisms adapted to the novel plant species, whether for food, shelter, or reproduction, and predatory species then evolved to specialize on those organisms. While some aspects of the morphological and physiological evolution of land plants are apparent through comparisons of fossil, ancestral, and modern taxa, the underlying genetic and genomic changes that resulted or accompanied these evolutionary modifications are much less conspicuous. This dissertation takes a multifaceted approach to land plant evolution, investigating genome evolution and associated processes at a variety of temporal and biological scales.

12

Polyploidy, or whole-genome duplication (WGD), is hypothesized to be one of the most influential genetic processes in eukaryotic evolution, as it results in an entire “extra” genome upon which evolutionary processes can act, in the long term, and immediate morphological, physiological, and ecological changes in the short term (Soltis & Soltis, 1999a; Chen & Birchler,

2013; Conant et al., 2014; Selmecki et al., 2015). WGD can occur via intraspecific genome duplication (autopolyploidy) or the interspecific amalgamation of genomes through the combined processes of hybridization and genome duplication (allopolyploidy). Chapter 2

(published as Marchant et al., 2016a) investigates the ecological shifts between allopolyploids and their diploid progenitors. Interest in this topic dates to the classic work of Stebbins, Wagner, and Grant, who considered allopolyploids biological and ecological intermediates of their diploid parents, a view that was held throughout most of the 20th century (Stebbins, 1950; Wagner Jr,

1970; Grant, 1981). Contrary to this generalization, however, genetic and genomic investigations of recently formed polyploids, or neopolyploids, have found the genetic amalgamation accompanying allopolyploidy to be highly dynamic, resulting at times in novelty or parental biases, in addition to simple additivity, at the genetic, genomic, and transcriptomic levels (Flagel et al., 2008; Leitch et al., 2008; Hegarty et al., 2008; Aïnouche et al., 2009; Chaudhary et al.,

2009; Buggs et al., 2012; Feldman & Levy, 2012; Soltis et al., 2014b; Yoo et al., 2014). Despite this vast progress in understanding the genetic and genomic effects of allopolyploidy, the ecological ramifications of allopolyploidization remained nearly as poorly understood as they were more than half a century ago (Thompson & Lumaret, 1992; Soltis et al., 2010; Madlung,

2013).

To address the ecological changes associated with allopolyploidy, I directly compared the ecogeographies of 13 allopolyploids with those of their diploid progenitors in taxa spanning a

13

wide phylogenetic diversity, from ferns to flowering plants (Marchant et al., 2016a). The data set consisted of nearly 40,000 digitized occurrence data points from herbaria across North America, tangentially demonstrating the scope and ease of research incorporating digitized natural history museum specimen data from online sources such as Integrated Digitized Biocollections

(iDigBio). Ecological niche modeling, niche analyses, and ordination analyses were applied to categorize the niche shift between the polyploids and their progenitors.

Chapter 3 (published as Marchant et al., 2016b) reviewed four major mechanisms contributing to plant genomic and genetic diversity. I focused on the prevalence of polyploidy and its resurgence as a focal study subject with advances in sequencing technology and then discussed the role of transposable elements in altering genomes. I reviewed horizontal gene transfer in plants with a focus on the bizarre mitochondrial genome(s) of Amborella trichopoda and the putative gene transfer of neochrome from a to ferns and then ended with the evolutionary conservation of alternative splicing patterns in flowering plants.

In Chapter 4 I provided the first homosporous fern genome assembly. Ferns are notorious for possessing large genomes and numerous chromosomes. Despite decades of speculation, the processes underlying the large, complicated genomes of ferns are still unclear, largely due to the absence of a sequenced homosporous fern genome (Sessa et al., 2014a). The lack of this crucial resource has not only hindered investigations of the evolutionary processes responsible for the unusual genome characteristics of homosporous ferns, but also impeded synthesis of genome evolution across land plants. Here I addressed the genomic and evolutionary processes by which the large genomes and high chromosome numbers typical of homosporous ferns evolved and have been maintained. Using the model fern species Ceratopteris richardii, I evaluated the possible roles of polyploidy, repeat element composition, and the expansion of transposable

14

elements in shaping fern genome evolution. I directly compared repeat compositions in species spanning the plant tree of life and a variety of genome sizes, as well as both short- and long-read- based assemblies of Ceratopteris. The Ceratopteris genome provided critical insights into the evolutionary genomics and paradoxes of the long-neglected fern clade, in addition to serving as a crucial reference for future investigations into land plant genome composition and dynamics.

Chapter 5 addressed fundamental questions regarding the evolution of land plant life history traits as well as gene family evolution. The interdependence of the haploid gametophyte and diploid sporophyte varies between major lineages of plants. For example, extant non- vascular land plants (i.e. mosses, hornworts, liverworts) have free-living upon which the develop and are dependent. In contrast, seed plants (i.e. gymnosperms and flowering plants) have free-living sporophytes with highly reduced, dependent gametophytes developing within sporophytic tissue. Ferns differ dramatically from seed plants and bryophytes in that nearly all species (99%) have both an independent, free-living haploid gametophyte and an independent diploid sporophyte life stage. As the sister group to seed plants, ferns serve as the ideal system for investigating both the genetics underlying the transition from independent- gametophyte to independent-sporophyte plant life and the ancestral gene content at this transition. In this chapter, I investigated gene and gene family specificity and evolution in the gametophyte and sporophyte life stages of Ceratopteris richardii. I compared these results with genes recovered from a species with an independent, free-living gametophyte (Physcomitrella patens) and a species with an independent sporophyte (Pinus taeda) and reconstructed the ancestral gene content of all euphyllophytes (ferns and seed plants).

15

CHAPTER 2 PATTERNS OF ABIOTIC NICHE SHIFTS IN ALLOPOLYPLOIDS RELATIVE TO THEIR PROGENITORS

Background

Polyploidy (whole-genome duplication, WGD), the most extensive form of genomic mutation, is the presence of more than two sets of chromosomes in a cell, arising via intraspecific genome duplication (autopolyploidy) or the interspecific amalgamation of genomes through the combined processes of hybridization and genome duplication (allopolyploidy) (Soltis & Soltis,

1999b, 2012; Flagel & Wendel, 2010; te Beest et al., 2012; Chen & Birchler, 2013; Conant et al.,

2014). This complete doubling of an already functional genetic system results in, at a minimum, redundant copies of genes, regulators, and non-coding DNA that can subsequently be eliminated by selection or drift. Conversely, this redundancy can provide the evolutionary fodder for genetic diversification, cascading to novel biochemical pathways and, in some cases, ultimately promoting species radiations (e.g. Jaillon et al., 2007; Soltis et al., 2009; Jiao et al., 2011; Tank et al., 2015). Phylogenetic and genomic analyses have identified key polyploidy events in an array of eukaryotic lineages – fungi (Albertin & Marullo, 2012), vertebrates (McLysaght et al.,

2002), and the entirety of flowering plants (Jiao et al., 2011; Amborella Genome Project, 2013), supporting the role of polyploidy as a primary mechanism to generate life’s vast diversity. While both forms of polyploidy have recently been hypothesized to be equally prevalent (Barker et al.,

2016), autopolyploids are often more cryptic relative to their diploid progenitors compared to allopolyploids and, as a result, allopolyploids have been more widely recognized and thoroughly investigated (Clausen et al., 1945; Stebbins, 1950; Soltis et al., 2007, 2014b; Chen et al., 2007).

 This work was previously published in New Phytologist, November 2016. 212:708-718 with a commentary article by Parisod and Broennimann (540-542).

16

Historically perceived as genetic summations of their progenitors, allopolyploids were considered biological and ecological intermediates of their diploid parents throughout most of the 20th century (Stebbins, 1950; Wagner Jr, 1970; Grant, 1981). This assessment was further supported because most allopolyploids were identified as such based on their morphological intermediacy between two proposed parental diploid species. Contrary to this generalization, genetic and genomic investigations of recently formed polyploids, or neopolyploids, have found the genetic amalgamation accompanying allopolyploidy to be highly dynamic, resulting at times in novelty or parental biases, in addition to simple additivity, at the genetic, genomic, and transcriptomic levels (Flagel et al., 2008; Leitch et al., 2008; Hegarty & Hiscock, 2008;

Aïnouche et al., 2009; Chaudhary et al., 2009; Parisod et al., 2009; Buggs et al., 2012; Feldman

& Levy, 2012; Soltis et al., 2014a; Yoo et al., 2014). Despite vast progress in understanding the genetic and genomic effects of allopolyploidy, the ecological ramifications of allopolyploidization remain nearly as ambiguous as they were more than half a century ago

(Thompson & Lumaret, 1992; Soltis et al., 2010; Madlung, 2013).

While the evidence and theory surrounding sympatric speciation via cladogenesis has been disputed for decades (Mayr, 1963; Turelli et al., 2001; Coyne & Orr, 2004; Gavrilets et al.,

2007; Bird et al., 2012), polyploidy provides a clear mechanism for sympatric speciation, as it generally results in immediate post-zygotic reproductive isolation of the neopolyploid population from that of the progenitor(s). Allopolyploid speciation, in contrast to cladogenesis, involves the merger and genomic shuffling of disparate lineages into one lineage, coupled with immediate reproductive isolation. Thus, the processes underlying ecological establishment in these novel species are expected to differ dramatically from those typical of homoploid, cladogenic speciation processes.

17

A neopolyploid must inhabit an ecological niche different from that of its two parental species or else outcompete one or both parents to become established and persist (Levin, 1975,

2003; Fowler & Levin, 1984). Because the parental species of an allopolyploid are already locally adapted and well established, niche differentiation is considered the norm for the establishment of neopolyploids in stable environments. Yet, the direction and means of this niche shift have long been contentious, with allopolyploids often considered “fill-in” taxa, inhabiting geographic ranges and ecological niches intermediate to those of their progenitors (Grant, 1981;

Bayer et al., 1991). Allopolyploids, with their fixed heterozygosity, may be more successful in colonizing harsh habitats than their progenitors, resulting in latitudinal or elevational gradients in polyploid richness (Ehrendorfer, 1980; Abbott & Brochmann, 2003; Brochmann et al., 2004).

Alternatively, environmental vicissitude could reduce parental competition by opening novel niches or ranges, allowing immediate colonization with substantial niche conservatism between the allopolyploid and the progenitors. The frequency of polyploidy has also been hypothesized to be linked in part with glaciation patterns, as polyploids are reportedly more common in deglaciated habitats as a result of secondary contact of closely related species that had become allopatric as a result of glacial barriers and then geographically converged after deglaciation

(Stebbins, 1950, 1984).

More recent studies have examined the ecology of polyploids relative to their progenitors at a finer ecological scale by incorporating microclimate, soil, or biotic interactions (Husband &

Schemske, 1998; Thompson & Merg, 2008; Ramsey, 2011). These studies have found that autopolyploids can be geographically sympatric with their progenitor, escaping progenitor competition via alterations in phenology, pollinator syndrome, and parasite interactions

(Thompson et al., 1997; Segraves & Thompson, 1999; Thompson et al., 2004), while others have

18

specialized on distinct soil types (Ramsey, 2011). Surprisingly, nearly no ecological studies have delved into the fine-scale niche partitioning of allopolyploids relative to their progenitors, except among a few nascent (<50 generations) polyploids, such as Senecio cambrensis and Spartina anglica (Ainouche et al., 2004; Abbott & Lowe, 2004; Ramsey & Ramsey, 2014).

Characterizing niche differentiation between an allopolyploid and its progenitors requires both extensive geographic occurrence information as well as abiotic and biotic ecological data.

Consequently, polyploid-progenitor niche comparisons have largely been qualitative or taxonomically limited (Clausen et al., 1945; Ramsey, 2011; McIntyre, 2012), with only a few ecological niche studies that have expanded beyond single polyploid systems (Bayer et al., 1991;

Martin & Husband, 2009; Glennon et al., 2014; Visser & Molofsky, 2015). However, with the expanding digitization of specimen data from natural history museums and the aggregation of these data into readily accessible online portals (e.g. Integrated Digitized Biocollections, www.idigbio.org; Global Biodiversity Information Facility, www.gbif.org), the field of evolutionary ecology is rapidly expanding in breadth and scope, leading to a renaissance in specimen-based investigations at a scale unimaginable only a decade ago (Donoghue &

Edwards, 2014; Page et al., 2015). With an estimated 3-4 billion natural history specimens potentially available globally and digitization efforts rapidly increasing, scientists are now able to access hundreds of millions of specimen data points collected from around the world at the ease of their own computers.

Ecological niche modeling (ENM), or species distribution modeling, uses georeferenced specimen occurrence presence data and random background sampling to characterize quantitatively the environmental niche of a species or population based on abiotic variables (e.g. precipitation, temperature, soil characteristics, elevation) and subsequently predict both the

19

geographic bounds of taxa and the major abiotic factors influencing their distributions (Warren et al., 2008a; Peterson et al., 2011). Several studies have applied these approaches to individual polyploid species or to species complexes (Glennon et al., 2011; McIntyre, 2012; Theodoridis et al., 2013), however, it must be noted that ENM and associated analyses are dependent on assumptions. First is that the species have reached their distribution limits and are not still expanding, as would be the case in a novel allopolyploid or recent environmental change.

Additionally, ENM does not take biotic interactions into account, basing the predicted distribution solely on the incorporated abiotic layers. Nonetheless, ENM provide a broad understanding of a species’ niche from which testable hypotheses can be formed.

Here we took a more inclusive approach to address how allopolyploids (ten ferns and three angiosperms) differ ecogeographically from their well-documented progenitors. We used

ENM and a variety of niche analyses and multivariate techniques to investigate the geographic distribution, niche breadth, and niche overlap of these allopolyploid systems, in an effort to delimit general patterns in niche distribution. We hypothesized that the allopolyploids would occupy completely distinct ecological niches relative to those of their progenitors due to the need for a niche shift away from those of the parents during the early establishment of the polyploid so as to avoid competition (Fowler & Levin, 1984). In addition, we hypothesized that each allopolyploid would occupy a broader niche than those of each of its individual progenitors due to the increased allelic diversity of combining two distinct genomes.

Materials and Methods

Polyploid Selection and Locality Data Collection

We searched the literature for North American endemic allopolyploid species for which the progenitors are well documented via a variety of genetic and/or genomic methodologies

(Table A-1). This study was restricted to North America to ensure abundant georeferenced

20

specimen sampling and consistent abiotic layers used for ENM. Invasive and non-native naturalized species were excluded to emphasize natural ecogeographic bounds of the systems.

Locality information for each allopolyploid and its respective diploid progenitors was gathered from georeferenced specimen, observational, and literature data extracted from iDigBio, GBIF,

Biodiversity Information Serving Our Nation (BISON; http://bison.usgs.ornl.gov), and regional herbaria consortia. Duplicated georeferenced points were removed, and those with a precision less than 1 kilometer re-georeferenced using GEOLocate v. 3.22 (Rios & Bart, 2010).

ENM Layers

We used all 19 environmental layers of the current (1950-present) temperature and precipitation conditions, as well as altitude above sea level, from the BioClim dataset with spatial resolutions of 30 arc-seconds (Hijmans et al., 2005). We also included seven topsoil (0-30 cm) layers from the Unified North American Soil Maps (UNASM) with resolutions of 15 arc-minutes

(Liu et al., 2013). While the low resolution of the UNASM layers could not convey the fine- scale mosaics potentially formed by various soil types, they do provide broad soil information that can be incorporated into the ENM – abiotic variables that are not typically incorporated into regional niche models. The BioClim and UNASM layers were checked for pairwise correlations across North America using ENMTools (Warren et al., 2010). Eight of the 19 BioClim layers

(Altitude above sea level, Mean Diurnal Range, Temperature Annual Range, Mean Temperature of Wettest Quarter, Mean Temperature of Warmest Quarter, Annual Precipitation, Precipitation

Seasonality, and Precipitation of Warmest Quarter) and all seven topsoil layers (sand content, silt content, clay content, gravel content, organic carbon content, pH, and cation exchange capacity) were retained due to their correlation values under |0.75|. For each polyploid-parental diploid study system, the layers were trimmed around the projected coordinates of the specimens from that system and converted to ASCII format in ArcGIS v. 10 (Environmental Systems Research

21

Institute, Redlands, CA, USA) to encompass the occurrence records. Due to the differing resolutions between the two layer types, the cell size of the UNASM layers were made to match that of the BioClim layers during trimming in ArcGIS using the Raster Analysis tool. Maxent uses randomly sampled background points in the formation of the models, therefore, it is necessary to fit the layers to the natural range of the species so as to avoid the testing of background sample points from an unnatural distribution (Merow et al., 2013). If the ENM expanded outside the trimmed layers, the layers were retrimmed, expanded, and the models ran again to incorporate the predicted range.

Niche Overlap and Breadth

Ecological niche models were produced for each of the species examined using the aggregated occurrence data and the 15 abiotic layers in Maxent v. 3.3 (Phillips et al., 2006). We ran 10 replicates for each species, using 75% of the occurrence records to calibrate the model and 25% to test it (Phillips et al., 2006). The average predicted model values for each species were used for subsequent analyses. Each model was assessed with the area under the receiver operating characteristic curve (AUC) (Hanley & McNeil, 1982), which represents a measure of the model’s ability to discriminate between suitable and unsuitable area (Anderson & Gonzalez

Jr., 2011). The AUC value ranges from zero (cannot discriminate) to one (perfect differentiation of suitable and unsuitable) with 0.5 representing a model that is no better than chance. The predicted models were also compared to distribution literature given in the Flora of North

America (www.efloras.org) by eye. Models that had poor (< 0.75) AUC scores or that varied substantially from the recorded distributions were discarded.

Pairwise niche overlap was calculated between the mean models of the three species per system using Schoener’s D similarity index implemented in ENMTools (Warren et al., 2010;

Broennimann et al., 2012). Schoener’s D has been widely used for comparing the habitat

22

suitability of two species’ geographic ranges, with the index ranging from 0, indicating no niche overlap, to 1, indicating identical niches (Schoener, 1968; Warren et al., 2008a). In addition,

Levins’ inverse concentration measure of niche breadth for each species was calculated in

ENMTools (Levins, 1968; Warren et al., 2010). Niche breadth averages the suitability score per cell based on the ENM results, providing a means of comparing the abiotic coverage of different species.

Statistical Analyses

The 15 abiotic layers were used as loading factors to quantify niche overlap and divergence in multivariate space with principal component analysis (PCA) using JMP v. 12 (SAS

Institute, Inc.) (Theodoridis et al., 2013). The abiotic factor values for each georeferenced occurrence were extracted using the “Point sampling tool” plugin in QGIS v. 2.6.1 (QGIS Core

Development Team 2014). PCAs were run for each polyploid system with the three species distinguished (Theodoridis et al., 2013).

We categorized four distinct patterns of niche shift: niche contraction, niche expansion, niche intermediacy, and niche novelty. Categorization was based on the results of the PCAs and

ENM analyses of niche overlap and breadth. We defined niche contraction as the occupation of a subset of the progenitors’ niches, with niche breadth values below that of both progenitors individually and high niche overlap (Schoener’s D) values with both of the progenitors. High niche overlap values were considered to be any D values above that of the average progenitor- progenitor overlap value. Niche expansion was defined as the occupation of a niche overlying and expanding beyond that of the progenitors’ niches, with niche breadth values higher than that of both progenitors and high niche overlaps with the progenitors. Niche intermediacy was defined as the polyploid’s occupation of a niche intermediate between the two progenitors, having niche breadth values between those of the progenitors and high niche overlap with both

23

of the progenitors. Finally, niche novelty was the occupation by the polyploid of a niche largely divergent from that of the progenitors in abiotic preference, having a minimal pair-wise niche overlap value, below the diploid-diploid average, with both progenitors.

Results

More than 50 allopolyploid species were identified from the literature and checked for well-documented, unambiguous parentage. However, only 14 allopolyploid-progenitor systems met all of our criteria for subsequent analyses. Eleven systems were fern species, and only three were angiosperm species, with their parentage in all cases confirmed by allozyme, restriction site variability, and/or genomic in situ hybridization analyses (Table A-1). The most stringent criterion was that of unambiguous parentage. Presumably because of greater genetic divergence among congeneric diploid species of ferns than angiosperms (Haufler, 1987) and the concomitant greater ability to discern polyploid parentage, more ferns than angiosperms were included in our data set. Although many allopolyploid angiosperm species have been reported from North America (see, e.g. Index to Plant Chromosome Numbers, http://www.tropicos.org/Project/IPCN), the parentage is certain using our stringent criteria for only Iris versicolor, Spiranthes diluvialis, and Stebbinsoseris heterocarpa. In many other cases, the allopolyploids are members of species complexes, and either one or both parents are equivocal.

In total, 39,657 georeferenced occurrence points for the allopolyploid and progenitor species were retrieved from the digitized data collections. After data cleaning for acceptable precision and the removal of duplicates, the total number of occurrence points was reduced to

16,247, ranging from 14 for Cystopteris utahensis to 2,266 for platyneuron. The

Cystopteris utahensis system was removed following inspection of the predicted model because of the polyploid’s vastly overestimated predicted range compared to the range described in the

24

literature, likely a result of too few georeferenced data points. The remainder of the models had a mean AUC score of 0.92 and distributions similar to those previously recorded in the Flora of

North America and were therefore used for this study.

The ecological niche models of the polyploids showed a wide variety of geographical patterns relative to those of their progenitor species. Predicted distributions ranged from complete sympatry in the Cystopteris tennesseensis (Figure 2-1) and

(Figure 2-2) systems to intermediacy in the system between the two progenitors

(Figure 2-3) and nearly complete allopatry, as in Polypodium saximontanum and its progenitors

(Figure 2-4). Similarly, the niche breadth analyses also varied widely, from polyploids occupying breadths three times larger (Polypodium hesperium) to eight times smaller

(Polypodium saximontanum) than those of their progenitors (Figure A-1). scopulinum and Polypodium hesperium were the only polyploids investigated to encompass niches broader than that of either progenitor, while three of the polyploid-diploid systems had polyploid niche breadths smaller than that of each progenitor.

In seven of the 13 systems, the niche overlap between the polyploid and both progenitors was higher than the overlap between the two diploids (Figure A-2). The Polypodium hesperium and systems both had greater niche overlap between the diploid progenitors than between the polyploids and either progenitor. Across the systems, niche overlap between polyploid and progenitor ranged from 0.04 to 0.77, with a mean of 0.42, compared with

0.03 to 0.66 and a mean of 0.30 between the diploid parents.

Each polyploid-diploid progenitor system was categorized to its niche shift pattern (Table

2-1). Cystopteris tennessensis and Stebbinsoseris heterocarpa had abiotic preferences within those of their progenitors and occupied subsets of their progenitors’ distributions, categorizing

25

them as niche contractions (e.g. Figure 2-1). The niches of the polyploids Polypodium hesperium and Polypodium scopulinum overlapped and expanded well beyond those of their progenitors, classifying these systems as niche expansions (e.g. Figure 2-2).

Niche intermediacy was the most common ecogeographic pattern found among the systems investigated here. Dryopteris celsa, , ,

Polypodium virginianum, Spiranthes diluvialis, , , and Iris versicolor all occupied niches intermediate to those of their progenitors in terms of breadth and abiotic tolerances and were substantially sympatric with at least one progenitor (e.g.

Figure 2-3). Polypodium saximontanum was the sole polyploid-diploid progenitor system that possessed a novel niche, exhibiting entirely different abiotic tolerances compared to P. amorphum and P. sibiricum, as well as allopatric distributions and nearly no niche overlap (D =

0.08, 0.04 respectively) (Figure 2-4).

Discussion

In this study we addressed the fundamental question of where allopolyploids are distributed in ecological niche space relative to their diploid progenitors. Although this question has been explored for decades, the broad phylogenetic scope, vast amount of data, and direct polyploid-progenitor comparisons used here provide a novel dataset and categorization scheme for approaching polyploid-progenitor niche patterns.

Polyploidy has contributed to a considerable proportion of land plant diversity, underlying ~15% and 30% of speciation events in angiosperms and ferns, respectively (Wood et al., 2009). While the frequency of polyploid formation in land plants has been found to be surprisingly high when taking evolutionary time spans into account (Ramsey & Schemske,

1998), the relatively low rate of neopolyploid establishment may reflect the diverse factors that affect young polyploids and hinder survival. Neoallopolyploids arise in areas of overlap between

26

two congeneric species and are parapatric or sympatric with at least one of the progenitors

(Levin, 2013). Due to their extremely close phylogenetic relationship and near geographic proximity, polyploids and their progenitors will typically have similar physiological, morphological, and ecological characteristics and preferences. Polyploids are therefore not apt to outcompete their already established progenitors and can only establish if they differentiate ecologically by undergoing a shift in ecology and/or geography from that of their progenitor(s)

(Levin, 1975, 2003; Fowler & Levin, 1984). This shift has long been hypothesized to mean a broad geographic change in the distribution of a polyploid, such as towards a more northern region (in the Northern Hemisphere), higher elevation, or into a recently deglaciated habitat, perhaps due to genetic attributes conferred by fixed heterozygosity and/or increased genomic diversity (reviewed in Stebbins, 1950; Levin, 1975, 2000; Grant, 1981; Brochmann et al., 2004).

Wider abiotic tolerance (niche breadth) has also been proposed as a means of ecological shifting by allopolyploids, as would be expected from the merger of two distinct genomes and fixed heterozygosity (Ehrendorfer, 1980; Levin, 2000; te Beest et al., 2012). Our study suggests that while extensive niche changes between polyploids and their progenitors can and do occur, subtler niche changes or intermediate geographic shifts are frequent as well.

Perhaps the most surprising of the niche shift patterns unveiled by this study was the degree of niche overlap and geographic sympatry between polyploids and progenitors. Every allopolyploid studied, except Polypodium saximontanum, had a high overlap (D > 0.30) with at least one of its progenitors. In Cystopteris tennesseensis, the polyploid occupied a subset of the abiotic niche and geographical distribution of both of its diploid progenitors, C. protrusa and C. bulbifera, across the eastern United States, where the three species occupy very similar niches (D

= 0.77, D = 0.59, respectively). Haufler et al. (1990) proposed that C. tennesseensis was a

27

relatively recent allopolyploid based on its highly sympatric distribution with its progenitors, as well as completely additive enzyme expression with few novel alleles. Although it is clear from its broad geographic distribution that C. tennesseensis is an established allopolyploid, it is possible that the range of this species is still shifting away from those of the progenitors or that the polyploid occupies a niche based on factors unaddressed in the models employed, such as phenology, microclimate, or biotic variables not analyzed here. Niche contraction of the polyploid relative to its diploid parents has also been found in Houstonia longifolia, Houstonia purpurea, Primula halleri, and Melampodium cinereum using similar niche modeling and multivariate analyses to those used here (Glennon et al., 2011, 2014; Theodoridis et al., 2013).

The high amount of niche overlap between the polyploids and their progenitors suggests more subtle ecological factors must be investigated in these systems to clarify how the polyploid has avoided or reduced competition with its already established progenitors.

Ecological niche models of diploid and autotetraploid cylindrica found that the predicted distributions considerably overlapped, however, closer inspection of the populations via flow cytometry revealed the cytotypes to be nearly completely allopatric (Godsoe et al.,

2013). These results can provide evidence for geographic partitioning via competition or imply that factors other than those incorporated in the models are limiting the actual distributions of these two species. Incorporating the impact of biotic factors on the distributions and niche shift of polyploids and their progenitors, Thompson et al. (2004) noted that sympatric populations of diploid and tetraploid plants of Heuchera grossulariifolia undergo distinct levels of herbivory from the politella and differ in pollinator assemblages. Such small alterations in biotic interactions can have profound effects on the general ecology of polyploids relative to their parents, but will not be detected via the approaches employed in this study.

28

The distributions of both Polypodium hesperium and Polystichum scopulinum were found to considerably overlap those of their progenitors, with similar but wider abiotic niches that expanded well beyond those of the progenitors. While some hypotheses predict broader niche breadth in allopolyploids relative to their progenitors allowing the inhabitation of harsher niches, the co-occurrence of the progenitors and polyploids is rather unexpected. McIntyre (2012) found similar patterns with hexaploids of parviflora and C. perfoliata relative to the diploid cytotypes. The hexaploids had 37% and 16% increases in niche breadth relative to their diploid parents, respectively, as well as high levels of niche overlap and niche similarity. These cases may provide excellent resources for investigating the transgressive properties of polyploidy from an ecological standpoint, a subject of growing importance as the connection between polyploidy and invasiveness strengthens (te Beest et al., 2012; Soltis et al., 2014b).

Niche intermediacy has long been hypothesized to be the obvious outcome of allopolyploid evolution (Stebbins, 1950; Grant, 1981; Bayer et al., 1991), and our results for eight of our 13 polyploid-diploid systems support this “fill-in” hypothesis. In the Dryopteris celsa system, the niche of the polyploid slightly overlapped that of its progenitor, D. ludoviciana, to the north (D = 0.19) and overlapped considerably that of the other progenitor, D. goldiana, to the south (D = 0.47), while the progenitors themselves had negligible overlap (D = 0.03). In terms of abiotic tolerances, D. celsa clearly occupied a niche between the tolerances of the two progenitors (Figure 2-3). Similar parental biases in terms of niche overlap were observed in six of the eight systems categorized as niche intermediacy (Spiranthes diluvialis, Iris versicolor,

Stebbinsoseris heterocarpa, Dryopteris celsa, , and Polypodium calirhiza) (Figure A-1). These systems stray from the null hypothesis of a symmetrical niche overlap between the two progenitors that could be presumed from an additive genomic

29

contribution from each progenitor. Nor were the biases associated with the niche breadth of the progenitors, as in the case of Polypodium calirhiza, which was biased towards P. californicum

(niche breadth = 0.03) rather than P. glycyrrhiza (niche breadth = 0.09). However, recent genetic and genomic studies of polyploidy have demonstrated that progenitor contributions to a polyploid are rarely equal and can be biased at the genome, gene, or expression level (Leitch et al., 2008; Buggs et al., 2012; Yoo et al., 2014) sometimes due to the progenitor’s means of inheritance (maternal versus paternal). Thus, these ecological results could reveal an underlying progenitor genomic bias in the polyploid, although the incorporation of genomic data is essential.

The proposed importance of niche intermediacy has a long history in the study of polyploids. For example, Clausen et al. (1945) similarly found Madia citrigracilis to be an ecogeographical intermediate of its putative progenitors, M. gracilis and M. citriodora, occupying a highly limited range in the northern Sierra Nevada bordered by its two parents.

Clausen et al. (1945) proposed that the polyploid is of relatively recent origin or is perhaps incapable of competing and expanding into the niche of its progenitors, a viewpoint historically associated with allopolyploids as “fill-in” taxa. However, unlike most researchers from that time,

Clausen et al. (1945) recognized the wide variability of possible ecogeographical shifts that can occur via polyploidy, as demonstrated by this study.

The final category of niche shifts found in this study is the occupation of an entirely novel niche. Polypodium saximontanum was the only polyploid that clearly demonstrated niche novelty, as it occupied a distribution in the Rocky Mountains of the USA while P. amorphum was found entirely in the Pacific Northwest and P. sibiricum throughout the Canadian subarctic

(Figure 2-4). In terms of niche overlap, the polyploid was clearly distinguishable from its progenitors when statistically visualized and had Schoener’s D values below 0.1 for both of the

30

pairwise comparisons with its progenitors (Figure A-2). Altitude above sea level was the most influential factor in determining the predicted niche model of P. saximontanum and was important as a factor loading in the two principal components that together explained 46% of the system’s abiotic variability. Although additional studies should investigate the ecology of this system more closely, altitude above sea level could have played a large role in the ecological differentiation and niche shift of the polyploid from its progenitors.

A similar study of four related polyploid and diploid species of Primula sect. Aleuritia in

Europe found that Primula scotica (6x) occupied a distinct climatic niche compared to its progenitors (P. farinosa and P. halleri) with virtually no niche overlap (Theodoridis et al., 2013).

The niche of P. scotica was found to have a higher minimum temperature and stable temperature range and a substantially lower niche breadth than its putative progenitors. Factors such as soil pH, glaciation patterns, and/or self-compatibility were proposed to explain the distinct niche of

P. scotica. Additional experimentation is needed to discern the ecological factors that caused the niche shift of this polyploid from its progenitors. The autotetraploid Tolmeia menziesii has also been found to occupy a novel niche compared to its diploid progenitor, T. diplomenziesii, putatively a result of differing moisture optima which is supported by concurrent physiological investigations (Visger et al., 2016). As with each of these polyploid-progenitor systems, physiological, reciprocal transplant, common garden, genetic, and synthetic polyploidization studies could provide enormous insight into the role of polyploidy and the evolution of novelty from the genetic to ecological levels (Soltis et al., 2010; Madlung, 2013).

ENM and associated analyses provide baseline results for understanding where a species or population is distributed and the abiotic factors limiting that distribution, however, these results are entirely limited by the abiotic factors incorporated into the models. While factors such

31

as biotic interactions can drastically differentiate the realized niche from the modeled niche, the resolution of the incorporated abiotic factors can also influence ENM models. For example,

Leempoel et al. (2015) found that spatial resolution had a significant effect on the strength and accuracy of their digital elevation model (DEM) derived variables, altering the ecological relevance of these commonly used factors. As abiotic patterns at the regional and local level are both influential on an organism’s distribution, future studies of these systems should incorporate multi-scale resolution. Additionally, the ages of the allopolyploids considered here and in other similar studies are entirely unknown; thus, the interplay of the immediate effects of allopolyploidy and subsequent evolution following speciation cannot be determined.

Although our understanding of polyploidy at the genetic and genomic levels has been rapidly increasing over the past decade, that understanding has also been taxonomically biased and largely restricted to a few model systems (Soltis et al., 2014a; Yoo et al., 2014). In contrast, ecological investigations of polyploidy are expanding in taxonomic breadth with the integration of ENM from digitized specimen data. However, there still exists a wide disparity in sampling between the genetic/genomic polyploid “model” systems and that of the ecological “model” systems. We hope that this and similar studies reveal the variety of ecological outcomes that can arise via polyploidy, and we encourage further sampling of polyploid systems for formulating testable hypotheses. Most critical is discerning the immediate effects of polyploidy from subsequent evolution, which could be gleaned via the study of recent natural polyploids

(Tragopogon, Spartina, Senecio), as well as synthetic polyploids. With these hypotheses, we can more directly investigate the genetic and physiological effects of polyploidy and subsequent evolution that led to these diverse ecological outcomes.

32

Table 2-1. Niche categorization of polyploids relative to their progenitors. Niche Contraction Niche Expansion Niche Intermediacy Niche Novelty Cystopteris Polypodium Dryopteris celsa Polypodium tennesseensis hesperium Asplenium saximontanum Stebbinsoseris Polystichum pinnatifidium heterocarpa scopulinum Polypodium calirhiza Polypodium virginianum Spiranthes diluvialis Polystichum californicum Asplenium bradleyi Iris versicolor

33

Figure 2-1. Predicted niches and statistical differentiation of niche contraction in the polyploid complex, Cystopteris tennesseensis. Panels (A), (B), and (C) represent the ecological niche models of , Cystopteris tennesseensis, and , respectively. Increasing color darkness indicates higher predicted niche suitability. (D) is a scatterplot of principal components grouped by the three species, and (E) is a biplot of the relative importance and direction of the abiotic variables (alt: altitude above sea level; bio2: mean diurnal range; bio7: temperature annual range; bio8: mean temperature of wettest quarter; bio10: mean temperature of warmest quarter; bio12: annual precipitation; bio15: precipitation seasonality; bio18: precipitation of warmest quarter; t_sand: sand content; t_silt: silt content; t_clay: clay content; t_gravel: gravel content; t_oc: organic carbon content; t_ph: pH; and t_cec: cation exchange capacity) between the two principal components from (D).

34

Figure 2-2. Predicted niches and statistical differentiation of niche expansion in the polyploid complex, Polypodium hesperium. Panels (A), (B), and (C) represent the ecological niche models of Polypodium amorphum, Polypodium hesperium, and , respectively. Increasing color darkness indicates higher predicted niche suitability. (D) is a scatterplot of principal components grouped by the three species and (E) is a biplot of the relative importance and direction of the abiotic variables (alt: altitude above sea level; bio2: mean diurnal range; bio7: temperature annual range; bio8: mean temperature of wettest quarter; bio10: mean temperature of warmest quarter; bio12: annual precipitation; bio15: precipitation seasonality; bio18: precipitation of warmest quarter; t_sand: sand content; t_silt: silt content; t_clay: clay content; t_gravel: gravel content; t_oc: organic carbon content; t_ph: pH; and t_cec: cation exchange capacity) among the two principal components from (D).

35

Figure 2-3. Predicted niches and statistical differentiation of niche intermediacy in the polyploid complex, Dryopteris celsa. Panels (A), (B), and (C) represent the ecological niche models of , Dryopteris celsa, and Dryopteris goldiana, respectively. Increasing color darkness indicates higher predicted niche suitability. (D) is a scatterplot of principal components grouped by the three species and (E) is a biplot of the relative importance and direction of the abiotic variables (alt: altitude above sea level; bio2: mean diurnal range; bio7: temperature annual range; bio8: mean temperature of wettest quarter; bio10: mean temperature of warmest quarter; bio12: annual precipitation; bio15: precipitation seasonality; bio18: precipitation of warmest quarter; t_sand: sand content; t_silt: silt content; t_clay: clay content; t_gravel: gravel content; t_oc: organic carbon content; t_ph: pH; and t_cec: cation exchange capacity) among the two principal components from (D).

36

Figure 2-4. Predicted niches and statistical differentiation of niche novelty in the polyploid complex, Polypodium saximontanum. Panels (A), (B), and (C) represent the ecological niche models of Polypodium sibiricum, Polypodium saximontanum, and Polypodium amorphum, respectively. Increasing color darkness indicates higher predicted niche suitability. (D) is a scatterplot of principal components grouped by the three species and (E) is a biplot of the relative importance and direction of the abiotic variables (alt: altitude above sea level; bio2: mean diurnal range; bio7: temperature annual range; bio8: mean temperature of wettest quarter; bio10: mean temperature of warmest quarter; bio12: annual precipitation; bio15: precipitation seasonality; bio18: precipitation of warmest quarter; t_sand: sand content; t_silt: silt content; t_clay: clay content; t_gravel: gravel content; t_oc: organic carbon content; t_ph: pH; and t_cec: cation exchange capacity) among the two principal components from (D).

37

CHAPTER 3 GENOME EVOLUTION IN PLANTS

Background

One of the most diverse and ecologically invaluable groups of life, land plants

(Embryophyta) are key to understanding the processes underlying the evolutionary genomics of

Earth’s diversity. In size alone, land plant genomes can vary dramatically ranging from ~60 million base pairs in the carnivorous plant Genlisea margaretae to 148 billion base pairs in Paris japonica (Bennett & Leitch, 2012), spanning a 2,400-fold change. Similarly, the numbers of chromosomes into which plant genomes are packaged also fluctuate drastically, ranging from 2 to 1,440 chromosomes in diploid cells; furthermore, this variation in chromosome number is not always concordant with genome size (Rice et al., 2015). This enormous variation in genome size and chromosome number among green plants has intrigued researchers for decades, especially given that the number of nuclear genes in any organism is consistently 20,000 – 40,000 (Sterck et al., 2007). Understanding what components, in addition to genes, compose the vast space of a plant genome and the processes that affect the structure and function of a genome is an ongoing major goal of plant biology. Fortunately, recent advances in DNA and RNA sequencing and bioinformatics have permitted many new insights into the genomic foundation of plants.

The publication of the first entire plant genome, Arabidopsis thaliana (Arabidopsis

Genome Initiative, 2000) catalyzed investigations at the whole-genome scale, elucidating the mechanisms underlying genetic variation among plants at the nucleotide level (Le et al., 2000;

Arabidopsis Genome Initiative, 2000; Vision et al., 2000). While mechanisms responsible for

 This work was previously published in the Encyclopedia of Life Sciences, September 2016.

38

genetic and genomic diversity are constantly emerging, the following processes have been found to play substantial roles in plant genome evolution.

Polyploidy

Gene duplication is a major evolutionary force, providing an expanded repertoire of genetic material that may ultimately result in novelty at the level of gene function or expression

(Ohno, 1999; Crow & Wagner, 2006). Duplicated genes have four possible fates: a) nonfunctionalization, in which one of the gene copies is physically lost or pseudogenized; b) subfunctionalization, in which the two copies each preserve a portion of the ancestral gene’s function; c) neofunctionalization, in which one gene copy gains a new beneficial function; and d) retention, in which both genes are maintained without alteration (Force et al., 1999; Lynch &

Conery, 2000). Gene duplications range from single gene duplicates to tandem duplications spanning multiple genes and regulatory regions to whole-genome duplication (WGD), also known as polyploidy.

WGD can occur via intraspecific genome duplication (autopolyploidy) or through the interspecific combination of genomes via the collective processes of hybridization and genome duplication (allopolyploidy) (Figure 3-1). Just as single gene duplications can provide the genetic material for evolution to generate new functions or expression patterns, WGD provides an entire duplicated genome where the diverse processes of nonfunctionalization, subfunctionalization, and neofunctionalization can occur at individual loci or across entire genetic pathways (Soltis &

Soltis, 1999b; Conant et al., 2014; Selmecki et al., 2015). As a result, WGD has been proposed as a major driver of biodiversity throughout eukaryotic life (e.g. vertebrates: McLysaght et al.,

2002; fungi: Albertin & Marullo, 2012; teleost fishes: Braasch & Postlethwait, 2012; amphibians: Evans et al., 2012); however, it has been most prevalent and extensively studied in flowering plants (angiosperms) (Soltis et al., 2009; Yoo et al., 2014; Soltis & Soltis, 2016).

39

Historically, WGD events in plants were inferred from chromosome numbers. This methodology led to estimates that 30-80% of extant flowering plant species were polyploid

(Stebbins, 1950; Lewis, 1980). Today, diverse genetic, genomic, and cytogenetic techniques are employed to identify polyploidy events and infer their timing, such as gene family phylogenetic analyses (Jiao et al., 2011), genomic synteny analyses (Tang et al., 2008), and paralog age distribution analyses (Lynch & Conery, 2000; Barker et al., 2009). These analyses have demonstrated that even minute genomes, such as that of the model organism Arabidopsis thaliana, with only five chromosomes per haploid cell and a genome size 1/20th that of the human genome, experienced at least three ancient polyploidy (paleopolyploidy) events (Vision et al., 2000; Barker et al., 2009). Phylogenetic and genomic studies have identified WGD events shortly preceding key radiations (Soltis et al., 2009; Tank et al., 2015), such as that of the (75% of the flowering plants) (Jiao et al., 2012), monocots (Tang et al., 2010), as well as the entirety of flowering plants and seed plants (Jiao et al., 2011). WGD is estimated to be responsible for 15% of all speciation events in flowering plants and 31% in ferns and lycophytes

(Wood et al., 2009). Now, rather than asking if WGD occurred in a lineage of plants, the question has become how many WGD events occurred in that lineage (Soltis et al., 2009)? Even more essential is understanding how the genomes of plants are altered following the duplication of an already functioning genetic system.

To understand the genomic ramifications of WGD, evolutionary biologists are investigating polyploidy at a variety of temporal scales. Investigations of synthetic as well as naturally occurring recently formed polyploids (<150 years; neopolyploids) have revealed the young polyploid genomes to be highly dynamic. In synthetic lines and natural populations of the recently formed allopolyploids Tragopogon mirus and T. miscellus, major chromosomal

40

rearrangements and meiotic irregularities have been found, including reciprocal and non- reciprocal aneuploidy and quadrivalent chromosome pairing (reviewed in Soltis et al., 2012).

Despite 90 years since formation (~45 generations in these biennials), an extraordinary amount of karyotype variation exists in the natural populations. Similarly, 50 independent, synthetic lines of the allopolyploid Brassica napus were formed from genetically identical gametes and analyzed for genetic and genomic changes following polyploidization (Gaeta et al., 2007; Xiong et al., 2011). After ten generations, the 50 lines displayed varying degrees of chromosomal and genetic variability similar to those in Tragopogon, including reciprocal and non-reciprocal aneuploidy, chromosomal fission and fusion, gene loss, and major rearrangements (Xiong et al.,

2011). Hence, available data suggest a period of genomic variability and instability following a

WGD event, producing chromosomal and genetic variants upon which evolutionary forces can then act.

Diverse patterns of gene expression may arise in polyploid organisms, furthering their genetic and phenotypic diversity. Additive expression, in which the gene product is the sum of parental expression patterns, is a predicted response for an allopolyploid. Alternatively, gene expression may be biased toward one of the parents, as a result of silencing or a reduction of the expression of the other parent’s genes. Examples of gene conversion, in which the genes from one parent are replaced by the genes of the other parent, are also becoming more prevalent

(Wang et al., 2015). In addition, tissue-specific expression changes and completely novel expression patterns relative to those of the parental genomes have been observed (reviewed in

Yoo et al., 2014). Genomic reorganization and changes in gene expression can directly and indirectly alter genetic pathways and regulatory mechanisms, resulting in a cascade of genomic repercussions of WGD at the phenotypic level. For example, in just five generations, the

41

synthetic lines of B. napus were highly variable in terms of phenology, reproductive output, and morphology, with one line displaying a dwarfed phenotype (Gaeta et al., 2007).

When explored at deeper timescales, patterns of gene retention and loss following WGD emerge. Most notable is the reversion of house-keeping genes (i.e. genes responsible for DNA repair and organellar proteins), cellular life cycle genes, and photosynthesis genes to singleton status (Blanc & Wolfe, 2004; Paterson et al., 2006; Freeling, 2009; De Smet et al., 2013). These

“singleton” genes are highly expressed across tissue types and are conserved in nucleotide sequence across plants. In contrast, genes for transferases, protein kinases, binding proteins, and transcription factors are more commonly retained in duplicate following WGD, permitting gene subfunctionalization and neofunctionalization (Jiao et al., 2011). It has been proposed that WGD events ca. 300 and 200 million years ago provided the genetic fodder for today’s vast diversity of seed plants via duplication and diversification of genes encoding the processes of phenology and seed germination (Jiao et al., 2011).

Evidence for multiple rounds of WGD have been found in genomes as small as

Arabidopsis (135 million base pairs) through genetic and genomic analyses. However, chromosome number and genome size do not necessarily reflect polyploidy. This discrepancy may be a result of diploidization, a process that reduces genome size and chromosome numbers and alters expression patterns and inheritance over time (Figure 3-2). The mechanisms underlying diploidization are still being unraveled, however. For example, illegitimate recombination and unequal homologous recombination have been found to play important roles in genome size reduction via the loss of repetitive regions (Devos et al., 2002; Bennetzen et al.,

2005).

42

By comparing the effects of polyploidy at a variety of temporal scales and using both genomic and transcriptomic sequence data, researchers are teasing apart the genomic processes that occur over time following WGD. As more and more genomes across the plant tree of life are sequenced, the trends and patterns of this macroevolutionary mechanism will be further clarified, leading to further understanding of the role of WGD in shaping plant genomes.

Transposable Elements

Transposable elements (TEs) greatly affect genome composition and evolution (Feschotte et al., 2002; Oliver & Greene, 2009). TEs are classified into two main categories: I, or retrotransposons, which undergo reverse transcription, transcribing from DNA into RNA then reverse-transcribing back into DNA to be inserted into the genome at a novel site; and Class II, or DNA transposons, in which transposases are used to excise the TE and insert it in a novel location within the genome.

TEs mold and alter genomes via a wide array of direct and indirect mechanisms extending well beyond their capacity to move throughout genomes (reviewed in Feschotte et al.,

2002; Feschotte, 2008). TEs can produce entirely novel genes, or neogenes (Brandt et al., 2005).

Wang et al. (2006) found over 1000 genes in rice that originated from the recruitment of exons by retrotransposons. Conservative estimates suggest that around 1% of all protein-coding genes in Arabidopsis have some form of TE contribution in the coding region (Lockton & Gaut, 2009).

TEs can create variation within coding or regulatory regions, in some cases upregulating, downregulating, or silencing genes (Brosius, 2003). The transposition of regulatory DNA elements or microRNAs can drastically alter expression patterns or complete genetic networks

(Naito et al., 2009). Since genetic variation can deleteriously affect gene networks and pathways, most TEs are rendered inactive via DNA methylation and histone modifications (Matzke &

Mosher, 2014). Passively, methylation of TEs can influence adjacent genes as well, inadvertently

43

silencing them. TEs can also lead to duplicated genes, exons, or DNA segments via non-allelic recombination (Jurka, 2004). They can also induce chromosomal rearrangements, leading to the loss of genetic material, through recombination (Zhang & Peterson, 2004).

TEs have been found to greatly affect genome size. In maize, ~210,000 TEs make up over 85% of the 2.5 gigabase (Gb) genome, compared to the ~35,000 protein-coding genes

(Schnable et al., 2009). This enormous proportion of TEs is hypothesized to be the primary cause of “genome obesity” in grasses, as genome size correlates with TE content when polyploidy is controlled for (Piegu et al., 2006; Paterson et al., 2009). For example, while the genome of cultivated rice (Oryza sativa) is only 35% TEs, that of wild rice (O. australiensis) is 76% TEs and is twice as large as cultivated rice without the influence of WGD (Piegu et al., 2006).

Compared to most flowering plants, gymnosperms have remarkably large genomes (1C =

17.6 Gb); however, they have much lower chromosome numbers than angiosperms with an average of n = 11.9 and range of 2n = 12 to 66 (compared to 2n = 4 to 986 in angiosperms). The recent sequencing of the Norway spruce (Nystedt et al., 2013), loblolly pine (Zimin et al., 2014).

2014), and white spruce (Birol et al., 2013) genomes revealed that the low chromosome numbers of gymnosperms reflect the low frequency of polyploidy in conifers (De La Torre et al., 2014; Li et al., 2015). Instead, their large genomes are largely composed of heterogeneous long terminal- repeat retrotransposons (LTR RTs), specifically of the Ty3/Gypsy superfamily (De La Torre et al., 2014). LTR RTs are hypothesized to have consistently inflated the genomes of the gymnosperms over the last 200 million years via insertion into both the non-coding and coding regions of the genomes, producing extremely long introns relative to those of flowering plants

(Nystedt et al., 2013). Whereas flowering plant genomes rapidly discard LTR RTs via unequal recombination (Devos et al., 2002; Bennetzen et al., 2005), the low recombination rate and high

44

fidelity of homologous chromosomes of gymnosperms apparently combine to retain these TEs

(Nystedt et al., 2013).

Transposable elements can have profound effects on the genes, genetic regulation, and overall genomic composition of entire lineages. As with any mutation, these effects provide the genetic novelty on which evolutionary processes can act, translating from genetic to physiological and morphological variation.

Horizontal Gene Transfer

While the transfer of genes between organisms via non-sexual processes, or horizontal gene transfer (HGT), has been widely acknowledged in prokaryotic organisms, the occurrence of

HGT in plants is relatively underinvestigated. HGT can affect the genome evolution of a lineage by incorporating entirely novel genetic material from a separate lineage. In land plants, HGT is predominantly involved in the transfer of genes between mitochondrial genomes (reviewed in

Richardson & Palmer, 2007; Davis & Xi, 2015), or between mitochondrial and nuclear genomes, and cases involving the plastid genome appear to be nearly nonexistent (Richardson & Palmer,

2007). In plants, HGT is typically associated with parasitism, whether host or donor, likely due to the close cellular association needed for such a symbiosis (Kim et al., 2014; Davis & Xi,

2015).

The completely sequenced mitochondrial genome of Amborella trichopoda, sister species to all other extant flowering plants, revealed that it was enormous (3.9 megabase versus 200 –

2,900 kilobase for most seed plants) and composed of genomic segments from a variety of different lineages, including mosses, green algae, and other flowering plants (Rice et al. 2013).

In total, 197 foreign genes were identified in the mitochondrial genome of Amborella, although only 25% of these genes were complete and even fewer appeared truly functional (Rice et al.,

2013). Although segmental and single gene transfers were identified, four whole mitochondrial

45

genomes from three green algae and a moss were also clearly recognized, suggesting mitochondrial fusion as a major mechanism of HGT. In Amborella, and likely other land plants, the opportunity to undergo mitochondrial fusion between two separate organisms is hypothesized to be a result of wounding of both plants, opening a cytoplasmic thoroughfare through which the mitochondrial genomes can come into contact and potentially fuse (Stegemann & Bock, 2009;

Rice et al., 2013; Gurdon et al., 2016). The totipotency of plant tissue allows the cells containing the novel mitochondrial genome to become stable and eventually give rise to reproductive organs, replicating the genomic novelty beyond the individual.

HGT may also involve the nuclear genome. A recent study showed that a nuclear photoreceptor gene, neochrome, found in a major group of ferns (, , and ) was derived from hornworts, a lineage of bryophytes, through HGT (Li et al.,

2014). Neochrome is hypothesized to be a chimera of blue-sensing phototropin and red-sensing phytochrome, permitting an additive range of phototropic responses with a single molecule

(Figure 3-3). This ability is hypothesized to have given an adaptive advantage to the ferns in which the gene is found, allowing these ferns to diversify in low-light conditions. This finding was discovered via phylogenetic analyses of photoreceptor gene families from across green plants, which placed the gene encoding the fern neochromes among those of the hornworts rather than other fern photoreceptor genes (Figure 3-3) (Li et al., 2014). These results strongly suggest that the fern neochrome arose through HGT from hornworts to ferns and could have played a highly significant role in the adaptive radiation of the fern lineages that inherited neochrome (Li et al., 2014).

Although HGT has been found in many more plant mitochondrial genomes than nuclear genomes, it is entirely possible that this trend is a result of sampling bias rather than biological

46

reality. Vastly more mitochondrial genomes have been sequenced than nuclear genomes, but as sequencing technology advances and more whole nuclear genome projects emerge alongside phylogenetic analyses, cases such as that of the fern neochrome HGT could become more numerous.

Alternative Splicing

The number of protein-coding genes ranges consistently from ~20,000 to 40,000 across eukaryotic life. To reconcile the variability in eukaryote organismal complexity and lack of variation in gene content, it is now generally accepted that the process of alternative splicing

(AS) provides the protein diversity found in more complex organisms. AS arises from the inclusion or exclusion of a gene’s exons and introns following pre-mRNA processing to mRNA, altering the amino acid sequence and function of the protein product following translation

(Figure 3-4).

In plants so far examined, AS can be found in 40-70% of genes with multiple exons

(Chamala et al., 2015). AS has been found to play roles in abiotic and biotic stress responses, growth and development, phenology, flowering, and photosynthesis (Reddy, 2007; Barbazuk et al., 2008). A recent genome-wide AS comparison of virally infected Brachypodium distachyon individuals and uninfected individuals found that around 600 genes showed altered AS patterns due to the infection, of which over 100 were immune-related genes (Mandadi & Scholthof,

2015). The authors further investigated one of the immune-response genes, SCL33, and found the splicing patterns to be very similar to the Arabidopsis ortholog, suggesting evolutionary conservation of AS patterns.

Sampling from nine species spanning a diversity of lineages from the flowering plant tree of life, Chamala et al. (2015) found that over 27,000 genes had conserved AS patterns among two or more plant species. From these analyses, the authors found that intron retention is the

47

most common AS form among the plants sampled, rather than exon skipping, the most common form of AS in humans and most (Barbazuk et al., 2008). In addition, gene families varied in extent of AS and conservation. Serine/arginine-rich (SR) protein families, which function in spliceosome assembly and regulate AS, had high rates of AS and were largely conserved among the taxa investigated. In contrast, MADS-box protein families had very little

AS, although some of those AS events were conserved (Chamala et al., 2015). While detailed investigations of conservation of AS are only now emerging among plant systems, the capacity of AS for expanding a genome’s protein repertoire without increasing genetic variation has undoubtedly played a substantial role in the evolution of plant life.

Summary

Over the past decade, our understanding of the genomic processes underlying the evolution of plants has burgeoned with the rapid advances in sequencing technology and bioinformatics. No longer are studies confined to economically important crops or genetic model organisms. Instead, biologists are now able to incorporate taxonomically divergent plant species into our evolutionary genomics studies, clarifying the importance of mechanisms such as polyploidy, transposable elements, horizontal gene transfer, and alternative splicing in producing life’s diversity. Collectively, these genomes are providing new perspectives on the forces that drive plant genome evolution and contribute to the content, structure, and organization of plant genomes.

48

Figure 3-1. The two types of polyploidy: allopolyploidy and autopolyploidy. Allopolyploidy is the merger and duplication of genomes from two (or more) different species. Autopolyploidy is genome doubling within a single species.

49

Figure 3-2. The phases of genome evolution following polyploidization to diploidization. From Soltis et al. (2016). Reproduced by permission of the Botanical Society of America.

50

Figure 3-3. The composition and phylogeny of fern neochrome. A) The neochrome gene is a chimera of phytochrome and phototropin genes. B) Phylogeny of neochrome and phototropin across green plants. Hypothesized HGT event is labeled. From Li et al. (2014). Reproduced by permission of the National Academy of Sciences.

51

Figure 3-4. Alternative splicing event types and rate in Arabidopsis, rice, maize, and humans. Adapted from Barbazuk et al. (2008).

52

CHAPTER 4 FINALLY FERNS: INSIGHTS INTO THE FIRST HOMOSPOROUS FERN GENOME

Background

There are more than 400,000 species of extant land plants, encompassing an array of morphological, physiological, and ecological diversity across the globe (Willis & Bachman,

2017). Accompanying this vast diversity is an extraordinary diversity in genome size (Bennett &

Leitch, 2012; Rice et al., 2015), spanning a 2,500-fold change from the bladderwort Genlisea aurea (~60 Mb; Greilhuber et al., 2006) to that of the monocot Paris japonica (150 Gb; Pellicer et al., 2010). How these genomes are chromosomally partitioned also varies immensely, as land plants span a 360-fold change in chromosome number, from 2n = 4 in Haplopappus gracilis,

Brachychome dichromosomatica, Zingeria biebersteiniana, and Colpodium versicola to 2n =

1,260 in the fern Ophioglossum reticulatum, the highest number reported for any eukaryote

(Ghatak, 1977; Bennett, 1998; Rice et al., 2015). Understanding the processes underlying this genomic heterogeneity has become a major area of interest among evolutionary biologists.

However, sampling biases towards smaller, less complex genomes (e.g. Arabidopsis 135 Mb, 2n

= 10) and crops have pervaded plant genome projects until very recently. Broad-scale analyses of genome sizes and chromosome numbers across land plants suggest that these simpler, smaller genomes are actually atypical of most of land plants. Fortunately, recent technological advances have enabled the assembly and analysis of large genomes, such as those of conifers (Birol et al.,

2013; Nystedt et al., 2013; Zimin et al., 2014), providing novel insight into the processes underlying genome size and chromosomal composition.

Polyploidy, or whole-genome duplication (WGD), is the traditional explanation for the large genomes and numerous chromosomes found in many plants as it results in the complete doubling of the genome (Stebbins Jr, 1940; Grant, 1981). Just within land plants, phylogenetic

53

and genomic studies have identified WGD events immediately preceding key radiations, such as that of the core eudicots (~70% of the flowering plants) (Jiao et al., 2012), monocots (Tang et al., 2010), and the entirety of flowering plants and seed plants (Jiao et al., 2011; Amborella

Genome Project, 2013; but see Ruprecht et al., 2017). Evidence of WGD has also been found throughout eukaryotes (e.g., origin of vertebrates: McLysaght et al., 2002; fungi: Albertin and

Marullo, 2012; teleost fishes: Braasch and Postlethwait, 2012; amphibians: Evans et al., 2012).

In addition, it was demonstrated that even species with minute genomes, such as that of the carnivorous plant Utricularia gibba, with n = 13 and a genome size of 80 Mb, experienced at least three ancient WGD events in the last 80 million years (Carretero-Paulet et al., 2015). These discrepancies between genome size and chromosomal composition relative to WGD history have altered our understanding of genome evolution, as the question has often changed from whether or not an organism is polyploid, to how many rounds of polyploidy an organism has experienced in its recent evolutionary history. In addition, genome size and chromosome number may be poor indicators of the role of WGD in a species or lineage. Significantly, the question of the role of polyploidy has been puzzling plant biologists for decades in the only major land plant lineage lacking a reference genome, the ferns (Klekowski & Baker, 1966; Klekowski, 1972; Wagner &

Wagner, 1980; Haufler & Soltis, 1986; Soltis & Soltis, 1987).

Although a few fern genomes, such as those of the heterosporous water ferns ( <1% of fern diversity) are less than 250 Mb (Li et al., 2018), the average homosporous fern genome is 12

Gb, nearly five times larger than that of maize (2.5 Gb). In addition, fern genomes are typically partitioned into substantially more chromosomes than seed plants, with an average haploid chromosome number of 59 compared to 16 in flowering plants or 12 in gymnosperms (Rice et al., 2015; similar to values reported by Klekowski and Baker, 1966). As a result, longstanding

54

hypotheses have proposed that multiple, repeated WGD events were the major causal factor contributing to the high chromosome numbers and large genomes of ferns (Klekowski & Baker,

1966; Wagner & Wagner, 1980; Haufler, 1987).

It was also hypothesized that homosporous ferns undergo intense selection favoring polyploidy to buffer against a putatively high rate of inbreeding that results from their unique life history. That includes a free-living haploid gametophyte phase with the potential for intragametophytic selfing (IGS) – a process that can produce a completely homozygous diploid phase in a single generation and thus expose any deleterious mutations (Klekowski, 1972).

However, numerous isozyme analyses demonstrated that fern species with the lowest chromosome numbers within a given (ranging from n = 27 to 52) were functionally diploid, producing typical diploid numbers of isozyme alleles rather than multiple alleles as seen in truly polyploid species with multiples of these low chromosome numbers (Haufler & Soltis,

1986; Soltis, 1986; Soltis & Soltis, 1988). Despite the lack of multiple isozymes in diploid fern species, multiple copies of chlorophyll a/b-binding protein genes were discovered in the diploid

Polystichum munitum, but the duplicated genes were nonfunctional (Pichersky et al., 1990).

Furthermore, early population genetic investigations showed that ferns have highly variable mating systems and are typically outcrossing, refuting the hypothesized force (intense inbreeding depression via IGS) driving selection for polyploidy (Soltis & Soltis, 1992). More recently, a genetic linkage map showed that Ceratopteris has one of the highest proportions of duplicated loci among plants (76%) yet lacks large, duplicated blocks that would be indicative of polyploidy

(Nakazato et al., 2006). Nonetheless, chromosome count models suggest that 31% of fern speciation events involve WGD, compared to 15% in flowering plants (Wood et al., 2009).

Importantly, however, the Wood et al. (2009) estimate refers to relatively recent polyploidy

55

events (neopolyploidy) evident from chromosome numbers rather than paleopolyploidy events in the deep evolutionary history of ferns. While repeated episodes of WGD followed by extensive silencing and rearrangement cannot be discounted as an explanation for the paradoxical genomic, genetic, and chromosomal composition of ferns, alternative mechanisms for the large genomes and high chromosome numbers must be explored.

Most notable of these alternative explanations for the large genomes of ferns is the impact of transposable elements (TEs) on genome size, as TEs can make up the majority of genome space in a variety of plant species and lineages. For example, TEs are responsible for the inflation in genome size between cultivated rice (Oryza sativa, 390 Mb) and wild rice (O. australiensis, 965 Mb) (Piegu et al., 2006). Phylogenetic reconstructions of major TE families in various plant lineages suggest bursts of TE activation result in inflated genome size (Ma &

Bennetzen, 2004; Vitte & Bennetzen, 2006; Estep et al., 2013; Bennetzen & Wang, 2014).

However, genome inflation does not seem to be a one-way street as unequal homologous recombination can discard repetitive regions, such as those produced by TEs (Devos et al., 2002;

Hawkins et al., 2006). The large genomes of conifers (20-30 Gb) were derived via extensive expansion of ancient TEs (especially retrotransposons) and an apparent inability to shed these repetitive regions via unequal recombination, rather than through WGD (Nystedt et al., 2013).

While TEs can provide an alternative explanation for the large genome sizes of ferns, as they do for conifers, they do not explain the high chromosome numbers of ferns. Outside of WGD, the most likely possibility is that ferns or even all vascular plants have ancestrally high chromosome numbers; however, aneuploidy or chromosome fission are also possible alternatives (Wagner &

Wagner, 1980; Haufler & Soltis, 1986; Soltis & Soltis, 1987).

56

There are now over 100 flowering plant genomes, four gymnosperm genomes, one lycophyte genome, and multiple bryophyte genomes to provide critical insight into genome evolution in land plants.Surprisingly, no sequenced genome is yet available for homosporous ferns (Sessa et al., 2014b). This major information gap is made more startling when the high species diversity (>10,000 species), significant ecological roles, and economic importance of ferns are considered (Durand & Goldstein, 2001; Ellwood & Foster, 2004; Fayle et al., 2009;

Paul et al., 2014; PPG I, 2016; Shukla et al., 2016). Moreover, ferns occupy a pivotal phylogenetic position as sister to seed plants and thus are key for investigating a host of both genomic and non-genomic traits and permitting a synthesis of genome evolution across seed plants.

Here we investigated the genome of the homosporous fern Ceratopteris richardii (11.25

Gb, n = 39), characterizing and classifying TE composition and documenting ancient WGD.

With the publication of two heterosporous water fern genomes (Li et al., 2018), these three fern genomes will provide an evolutionary context not just for ferns but also for vascular plants and will permit deductions about ancestral genome characteristics of seed plants and ferns, as in studies of other phylogenetically pivotal lineages (e.g., platypus-mammals: Warren et al., 2008;

Amborella-flowering plants: Amborella Genome Project, 2013). Specifically, this Ceratopteris genome provides critical insights into the evolutionary genomics and paradoxes of the genomically long-neglected fern clade, in addition to serving as a crucial reference for future investigations into land plant genome composition and dynamics.

Materials and Methods

Tissue Samples

Ceratopteris richardii (), or C-Fern, is a fast-growing tropical fern, used globally in research laboratories, as well as K-12 and undergraduate biology courses for studying

57

alternation of generations in plants (see www.c-fern.org). Inbred lines and single-gene mutants are commercially available and readily produced. For this study, spores from the Hn-n inbred line were kindly donated by Dr. Leslie Hickok (University of Tennessee). The spores were germinated on Bold's (1957) nutrient media with Nitsch's (1951) micronutrients and grown following the recommended conditions in the C-Fern Manual (www.c-fern.org). Upon germination, we isolated the gametophytes to individual petri dishes and growth media. Given that C. richardii is homosporous, the gametophytes are typically bisexual and produce both antheridia and archegonia. By isolating the gametophytes pre-sexual maturity, we ensured that any sporophytes that did develop were a product of gametes from a single gametophyte, thus completely homozygous, i.e., a doubled haploid. All of the tissue used for sequencing came from one doubled haploid genotype (Voucher: M. Whitten #5841, University of Florida Herbarium).

Library Construction and Sequencing

We extracted genomic DNA (gDNA) from Ceratopteris using a modified CTAB protocol

(Doyle & Doyle, 1987) which was quality checked and quantified using a Qubit fluorometer

(Invitrogen, Carlsbad, CA, USA) and NanoDrop spectrophotometer (Thermo Fisher Scientific,

Waltham, MA, USA). Genomic short-read library preparation and sequencing for Ceratopteris were completed by the University of Florida’s Interdisciplinary Center for Biotechnology

Research (UF ICBR). The gDNA was fragmented and size-selected for ~300 basepair (bp) inserts, and the sequencing of 150 bp paired-end (PE) reads was conducted on two runs of the

Illumina NextSeq platform (Illumina, San Diego, CA, USA). Mate-pair (MP) libraries (125 bp

PE, 8-10 Kbp inserts) were prepared and sequenced at the Duke Genome Sequencing and

Analysis Core on two lanes of Illumina HiSeq 2000 (Illumina, San Diego, CA, USA).

To avoid assembly errors inherent to short-read sequencing, we subsampled the

Ceratopteris genome by sequencing bacterial artificial chromosome (BAC) clones (Plate CR_Ba

58

#624, Green Plant BAC Library Project, provided by Clemson University Genomics Institute) with long-read sequencing technology. We selected 34 Ceratopteris Hn-n BAC clones to be grown, pooled, purified, and sequenced using the RS II platform (Pacific Bioscience, Menlo

Park, CA, USA) at the Genomics Institute. The reads were cleaned and de novo assembled using the Hierarchical Genome Assembly Process (HGAP) in the SMRT Analysis software package (Pacific Biosciences, Menlo Park, CA, USA) to produce the BAC.SubSample assembly.

We used long-read technology to acquire a high-confidence Ceratopteris transcriptome from sporophyte tissue. We extracted total RNA from sexually mature leaf tissue using the

RNeasy Plant Mini kit (Qiagen, Hilden, Germany). The total RNA was size- selected for 0.8-2,

2-3, 3-5, and >5 Kbp with the SageELF (Sage Science, Beverly, MA, USA) at the UF ICBR. The libraries were prepared following the SMRTbell library protocol, and each library was sequenced on three PacBio SMRT cells (Pacific Bioscience, Menlo Park, CA, USA) at the UF ICBR.

Genome Assembly

We trimmed the raw genomic PE reads of adapters and then quality-filtered with

Trimmomatic (Bolger et al., 2014), while the raw MP reads were trimmed of adapters and separated into MP, PE, and unknown reads with NxTrim (O’Connell et al., 2015). All libraries were quality-checked before and after cleaning with FastQC

(www.bioinformatics.babraham.ac.uk/projects/fastqc/). We divided the cleaned PE reads into 24- mers with Jellyfish (Marçais & Kingsford, 2011) and plotted their frequencies with KAT

(Mapleson et al., 2016) to assess environmental contamination, organellar genome content, nuclear genome size, and repeat content. We were able to de novo assemble the complete plastome from the cleaned reads and a seed sequence of the RuBP subunit (Wolf et al., 2015)

59

with NOVOPLasty (Dierckxsens et al., 2016) as well as contigs of the mitochondrial genome using RPS3 from Ceratopteris thalictroides as the seed sequence.

The PE reads were assembled using Meraculous2 (Chapman et al., 2016) with 300 gigabytes of RAM for 10 days and k-mer size of 61 based on the results of KmerGenie (Chikhi

& Medvedev, 2014) to produce assembly CFern v1.0. The scaffolds from the CFern v1.0 assembly were further scaffolded with the MP reads using the SSPACE assembler (Boetzer et al., 2010) to produce the final genome assembly, CFern v1.1. To compare the content of CFern v1.1 with the overall content of the cleaned reads, we divided the assembly into 24-mers with

Jellyfish (Marçais & Kingsford, 2011) and compared the resulting frequencies to those of the cleaned PE reads using the compare feature of KAT (Dierckxsens et al., 2016). For subsequent analyses, only scaffolds over 10 Kbp were used (CFern v1.1A).

Transcriptome Assembly

We cleaned and processed the long reads following the IsoSeq protocol (Gordon et al.,

2015) in which the circular consensus sequences (CCS) were acquired from the raw reads and then classified and clustered. Only full-length, high-quality (accuracy >= 99%), polished sequences (IsoSeq.HQ) were used for analysis following the Iterative Clustering and Error correction (ICE)/Quiver algorithm. The IsoSeq.HQ sequences were further collapsed into unique isoforms and genes using both genome-based and sequence-based protocols (see below).

For the genome-based method, the IsoSeq.HQ sequences were mapped to the CFern v1.1A assembly using GMAP (parameters: -f samse –n 0 –z sense_force) (Wu & Watanabe,

2005). The sam file output was sorted (parameters: -k 3,3 –k 4,4n), and transcripts were collapsed together (Gordon et al., 2015). We used both 98% coverage and 98% identity as our full-length mapping cutoff and then searched for partially mapped transcripts with 50% coverage and 98% identity

60

Due to the fragmented state of the CFern v1.1A assembly, many transcripts did not map.

Thus, we also used CD-Hit v4.6.4 (parameters: -c 0.99 –G 0 –aL 0.90 –AL 100 –aS 0.99 –AS

30) (Fu et al., 2012) to cluster and collapse highly similar transcripts into putative isoforms without a reference genome. We then used those sequences with the Coding Genome reconstruction tool (Cogent v2.1; Workman et al., 2018) for genome-free isoform collapse and gene identification. This pipeline divided the sequences into 30-mers and then grouped those kmers into clusters based on pairwise distances. De Bruijn graphs of the sequences for each cluster were then used to resolve sequencing errors and alternative splicing events and output putative genes. Due to the high accuracy, full-length, de novo nature of IsoSeq and subsequent cleaning protocols, these genes served as our reference transcriptome for C. richardii (referred to as UniCFernModels).

Polyploidy

The UniCFernModels gene set was used in the DupPipe pipeline (Barker et al., 2008,

2010) to estimate the relative age of gene duplications. In short, DupPipe finds duplicate gene pairs and then estimates the divergence of these genes using the number of substitutions per synonymous site (KS). The divergence, as a substitute for timing, of these duplicated genes was plotted in a histogram, and peaks were inferred to represent synchronous gene duplications, indicative of ancient polyploidy events (Lynch & Conery, 2000; Barker et al., 2008). Genes from two other ferns, Equisetum giganteum (Vanneste et al., 2015) and Azolla filiculoides (Li et al.,

2018), were similarly analyzed and plotted for comparison.

To reduce the subjectivity of smoothing based on varying bin sizes, we analyzed the KS values of these three ferns using the SiZer (Significance of Zero Crossings of the Derivative;

Chaudhuri and Marron, 1999) package in R v3.4.2 (R Core Team, 2013). This analysis determines whether an increase or decrease in a scatterplot or histogram is significant at α = 0.05

61

and plots the changes along the original x-axis with blue coloration indicating a significant increase, red a significant decrease, purple insignificance, and gray too few data points to determine.

To determine whether the three fern species examined here (Ceratopteris, Equisetum, and

Azolla), spanning over 400 million years since their most recent common ancestor (Testo &

Sundue, 2016), share any ancient polyploidy events, we clustered the predicted proteins of

Ceratopteris, Equisetum, Azolla, Amborella, Selaginella, and Physcomitrella into orthogroups using OrthoFinder (Emms & Kelly, 2015). Only orthogroups with gene representatives from all

6 species were retained. The protein sequences of each orthogroup were aligned with MAFFT

(Katoh & Standley, 2013) and the alignment converted to nucleotide alignments using the pxaa2cdn tool in Phyx (Brown et al., 2017). The alignments were stripped of highly ambiguous

(>90% missing data) columns and gene trees were produced with RAxML using 100 rapid bootstrap searches and the GTRGAMMA model of evolution (Stamatakis, 2014). These gene family trees were entered into the Multi- Paleopolyploidy Search (MAPS) package (Li et al., 2015). This package first filters all of the gene family trees for subtrees that match the known species tree [here (Physcomitrella, (Selaginella, (Amborella, (Equisetum, (Ceratopteris,

Azolla)))))]. It then counts the number of subtrees with gene duplications at a specific node in the species tree relative to the number of available subtrees. A node with a high proportion of gene duplications is presumed to have a shared polyploidy event.

We also used a cytogenetic approach to assess WGD. We conducted fluorescent in situ hybridization (FISH) using the previously described BAC clones as probes. To produce the probes, the BAC DNA was extracted from the Escherichia coli culture and amplified by rolling circle amplification (RCA) (see Chamala et al., 2013). The RCA product was labeled by nick

62

translation with Cy5-dUTP and purified with a QIAquick Nucleotide Removal kit (Qiagen,

Venlo, Netherlands).

Root tips for chromosome preparations were collected in the mornings and immediately treated with pressurized nitrous oxide for 1 hour before being fixed in 3:1 ethanol (EtOH): glacial acetic acid overnight at room temperature and transferred to 70% EtOH at -20C for long- term storage. The root tips were then treated and chromosome squashes prepared to produce slides for in situ hybridization with the fluorescently labeled probes following Chester et al.

(2012). The BAC FISH images were taken on an AxioImager M2 microscope with an AxioCam

MR camera (Carl Zeiss AG, Oberkochen, Germany).

Repeat Characterization

We took both structural- and homology-based approaches to repeat characterization following the protocol of Campbell et al. (2014). As long terminal repeat retrotransposons (LTR

RTs) comprise a sizable proportion of most plant genomes, a variety of tools was used to characterize these repeats in the CFern v1.1A assembly. Recent LTR RTs were collected based on 90% LTR similarity using LTRharvest (parameters: -minlenltr 100 -maxlenltr 6000 - mindistltr 1500 -maxdistltr 25000 -mintsd 5 -maxtsd 5 -motif tgca -similar 90 -vic 10)

(Ellinghaus et al., 2008) from the GenomeTools package (Gremme et al., 2013). LTRdigest was then used to find elements with poly purine tracts (PPT) or primer binding sites (PBS) using the

Genomic tRNA database (Chan & Lowe, 2015). Those elements were identified and further filtered for false positive elements by removing gappy elements (>50 Ns), recent gene duplications where the flanking regions of the LTRs are alignable, and nested retrotransposon insertions using custom scripts. LTR RTs with nested DNA transposons were also identified by searching DNA transposase protein sequences with BLASTX (Camacho et al., 2009). LTR RT

63

exemplars were then identified based on 80% identity and 90% coverage from the filtered elements based on the internal sequences of the LTR RTs and then based on the LTR sequences.

Older LTR RTs were similarly collected but with 75% similarity among the LTR sequences and lacking the TGCA motif. To exclude more recent LTR RTs from the older LTR RT library, the younger LTR RT exemplars were used to mask and exclude elements found in the older LTR RT library with RepeatMasker (Smit et al., 2013). The two LTR RT libraries were combined

(allLTR.lib) and used as the reference library to mask the CFern v1.1A assembly with

RepeatMasker (Smit et al., 2013).

The unmasked remainder of the assembly was inputted into RepeatModeler to identify de-novo repeat families (Smit & Hubley, 2008). The RepeatModeler library and LTR RT library were combined, and unidentified repeats were searched against a transposase database (Kennedy et al., 2011; Smit et al., 2013) using BLASTX and identified to superfamily when possible

(Camacho et al., 2009). To ensure that fragmented plant genes were not included in the final repeat library, we queried all of our repeats with the SwissProt plant protein (Schneider et al.,

2009) and NCBI RefSeq plant protein databases using BLASTX (Camacho et al., 2009). With our clean, final repeat library, we used RepeatMasker to quantify the repeat elements throughout

CFern v1.1A.

To make direct comparisons with other plant genome assemblies of varying sizes, qualities, and lineages, we followed the same repeat annotation protocol for the genomes of

Amborella trichopoda (Amborella Genome Project, 2013), a monocot (Zea mays; Hirsch et al.,

2016), liverwort (Marchantia polymorpha; Bowman et al., 2017), lycophyte (Selaginella moellendorffii; Banks et al., 2011), and moss (Physcomitrella patens; Lang et al., 2018). We also ran the same protocol on the BAC.SubSample assembly separately.

64

Dating Repeat Insertion Events

We used the highly accurate but conservative LTR_Retriever package (Ou & Jiang,

2018) to identify full-length LTR RTs and date their insertion using both the CFern v1.1A and

BAC.SubSample assemblies. We provided candidate LTR RTs from LTR_harvest and

LTR_finder using a 90% similarity minimum threshold between LTRs and the presence of the

TGCA motifs. The candidate LTR RTs were filtered, removing non-LTR RT repeat elements or those with large amounts of tandem repeats or gaps. Especially in fragmented genome assemblies, such as the CFern v1.1A, these requirements hugely reduce the number of LTR RT candidates but ensure that only full-length LTR RTs are analyzed. Following filtering, the long terminal repeat regions of each transposable element were aligned, and the Jukes-Cantor model was used to estimate the divergence time of the two LTR regions because these are noncoding regions after insertion. We used a mutation rate of 6.5 x 10-9 per site per year to estimate the years since insertion (Amborella Genome Project, 2013). This mutation rate is twice as slow as that of rice (Ma & Bennetzen, 2004) and a broad estimate; therefore, the insertion times should only be used in reference to the general timing of insertion, rather than as exact dates.

Results and Discussion

Genome Sequencing and Assembly

Here we present the first homosporous fern genome, a critical resource for plant and evolutionary biology. The ability of homosporous ferns to undergo intragametophytic selfing - the production of a sporophyte from a single gametophyte – partially simplified the assembly of this difficult genome as it made the sporophyte completely homozygous so that heterozygosity was not a concern. However, the quality of the CFern v1.1 assembly and the computational resources required to assemble and analyze it reflect the technological difficulties of working with such a large and complicated genome with no reference from a close relative.

65

With paired-end short-read libraries totaling ~24X coverage from 1.8 billion cleaned reads, we assembled the 11.25-Gb Ceratopteris genome into ~15 million contigs (>100 bp) or

988,403 scaffolds (>1,000 bp) (Table 1). We were then able to combine and reduce the number of scaffolds using 8-10 Kbp mate-pair reads (13X coverage), producing a final genome assembly

(CFern v1.1) of 626,576 scaffolds with an N50 of 16 Kbp and total length of 4.25 Gb, representing about 38% of the Ceratopteris genome. The BAC.SubSample assembly only totaled

3 Mb of the Ceratopteris genome (0.03%) but had an N50 of 97,182 bp, providing a small, but more accurate sampling of the 11.25-Gb genome. The GC content of Ceratopteris was 37.7%, very similar to that of both the gymnosperm Norway spruce (37.6%) and the flowering plant

Amborella (37.5%), yet lower than that of maize (46.9%), the liverwort Marchantia (42.0%), or the lycophyte Selaginella (45.3%). Of the six plant genomes directly compared, only the moss

Physcomitrella had a GC content considerably lower than that of Ceratopteris, at 33.7% (Table

2). In addition, we were able to assemble and circularize the 148,753 bp Ceratopteris plastid genome, as well as identify 283 mitochondrial contigs.

Transcriptome Sequencing and Assembly

From 12 PacBio SMRT cells, we obtained 850,000 reads from which we produced

97,084 full-length, high-quality, cleaned transcripts (IsoSeq.HQ) ranging from 285 to 11,353 bp in length. When mapped onto the CFern v1.1 assembly at 98% identity and 98% coverage, the

IsoSeq.HQ transcripts were collapsed into 4,620 genes and 10,043 isoforms; however, when coverage was reduced to 50%, there were 11,924 genes and 23,278 isoforms. The 2.5-fold increase in identified genes and isoforms via reduced coverage shows that our scaffolds do not span entire genes in the majority of cases. To overcome this fragmentation and provide a set of high-confidence gene models, we implemented the Cogent genome-free protocol (Gordon et al.,

66

2015) to produce 18,179 gene models (UniCFernModels) from the IsoSeq.HQ transcripts (Figure

A-3).

Polyploidy

To address the decades-long question of how frequent polyploidy is in ferns, we took both sequencing and cytogenetic approaches, assessing three different temporal scopes of evolutionary history. Using paralog-age distribution analyses, we identified 1,800 paralogous gene pairs in the UniCFernModels with a KS value between 0.1 and 2.1. A minor peak around KS

= 0.3 was noticeable; however, such “early” peaks are often a result of small-scale gene duplications, not WGD. In contrast, a single major peak was revealed in the synonymous distance plot of Ceratopteris, similar to those observed in Azolla and Equisetum (Figure 4-1).

Based on the significant transition from increasing to decreasing in the SiZER plot, the

Ceratopteris peak was around KS = 1.1, compared to 0.8 in Azolla and 0.75 in Equisetum, the latter estimate matching the original results by Vanneste et al. (2015) for Equisetum. Inspection of the combined density plots of these three fern species showed considerable overlap among the three peaks (Figure 4-2).

To determine whether these three ferns (Ceratopteris, Azolla, Equisetum) shared a WGD event in their evolutionary history, we used the Multi-taxon Paleopolyploidy Search (MAPS) package by Li et al. (2015). We first recovered 10,182 orthogroups from the clustered amino acid sequences of Ceratopteris, Azolla, Equisetum, Amborella, Selaginella, and Physcomitrella. We isolated 4,836 orthogroups with amino acid sequences from all six species and estimated gene family trees for each orthogroup. Of the subtrees that fit the known fern topology, ((Ceratopteris,

Azolla), Equisetum), 34% supported a gene duplication in the most recent common ancestor

(MRCA) of Ceratopteris and Azolla and 19% of subtrees fitting the ((Ceratopteris/Azolla,

Equisetum), Amborella) topology supported a gene duplication shared across the three fern

67

species (Figure 4-3) – relatively low proportions compared to similar studies with shared WGD events (see Li et al., 2015).

While the previously described sequencing approaches to assessing WGD are appropriate at deeper time scales, both are susceptible to missing more recent WGD events. An “early” peak in a paralog-age distribution analysis may be overlooked as small-scale duplications while

MAPS can only identify WGD events to the MRCA of the next closest taxon analyzed, in this case 280 million years since the divergence of Ceratopteris and Azolla. Our cytogenetic BAC-

FISH approach identifies recent polyploidy by localizing the BAC DNA fragments to more than two chromosomes. If the organism is diploid, only two localizations will be apparent. Similar studies of Nicotiana allopolyploids found that five million years after the WGD event, the two parental genomes were no longer distinguishable due to genome turnover (Lim et al., 2004).

However, our BAC-FISH results further corroborated our sequencing results in demonstrating a lack of WGD, i.e., we detected only two primary localizations of each BAC probe to the

Ceratopteris chromosome preparations (Figure 4-4). In a few cases, secondary localizations were found on multiple chromosomes; however, these are likely a result of repeat elements that are distributed throughout the numerous chromosomes.

These three approaches provided evidence at three distinct temporal scales for investigating WGD in the evolutionary history of Ceratopteris. Despite a genome size five times that of classically “large genome” flowering plants and with eight times more chromosomes than

Arabidopsis, which has undergone at least five WGD events (Amborella Genome Project, 2013), we found evidence for just one ancient WGD event in Ceratopteris. The diploid signal localizations of our BAC-FISH approach refute any recent WGD events that may not have been identified by the paralog-age distribution analyses. The three peaks in the paralog-age

68

distribution analyses of Ceratopteris, Azolla, and Equisetum overlap and thus could potentially be a shared event before the divergence of these three ferns (Figure 4-1). However, gene family analysis suggests that only a minority of subtrees support shared duplications between these three ferns, suggesting three lineage-specific WGD events rather than a shared event. Based on our analyses and the timing from Vanneste et al. (2015), the WGD of Ceratopteris is likely older than that of Equisetum (92 mya) yet younger than the MRCA of Ceratopteris and Azolla (~280 mya) (Testo & Sundue, 2016) (Figure 4-3).

The key result in these analyses is that evidence for only one ancient WGD was apparent and more recent WGD events were not detected. These results refute hypotheses of frequent recurrent WGD, followed by massive gene silencing and the slow loss of genetic material in ferns (Haufler, 2002, 2014), and instead lend credence to three non-mutually exclusive hypotheses: 1) ferns had ancestrally high chromosome numbers (Soltis & Soltis, 1987), 2) ferns underwent WGD rarely yet were unable or very slow to lose genetic material (Haufler, 1987), 3) ferns have a high rate of aneuploidy or chromosomal fission. Ancestral reconstructions of chromosome numbers across ferns have demonstrated that the common ancestor of all ferns had a haploid chromosome number of n = 22, while many of the more diverse fern lineages had higher ancestral chromosomes numbers, such as n = 30 in Pteridaceae (Clark et al., 2016).

Therefore, to reach a chromosome number of n = 39 in Ceratopteris (or n = 59, the average across all ferns), WGD events could have occurred only sporadically since the divergence of the common ancestor of ferns from that of seed plants 400 million years ago (Testo & Sundue, 2016;

Morris et al., 2018). A third possibility is that the high chromosome counts of ferns are a result of aneuploidy or chromosomal fission. To determine the cause of the high chromosome numbers of fern more fully, comparative syntenic and phylogenomic analyses will have to be applied

69

across multiple fern taxa, as any analysis comparing Ceratopteris with seed plants must span over 400 million years since their most recent common ancestor (Morris et al., 2018).

Repeat Diversity

In total, ~42% of the CFern v1.1A assembly was composed of repeat elements. The

Copia LTR retrotransposons were the most prolific with over 800,000 elements making up

16.5% of the assembled genome, followed by the Gypsy LTR RT superfamily with 330,000 elements and accounting for 7.5% of the genome (Table 2). In comparison, the Class II DNA transposons include members of 17 different super-families, yet only totaled 52,000 elements and <1% of the genome. The LINE retrotransposons similarly covered 1.6% of the genome across 64,000 elements. Low-complexity, satellite, and simple repeats all covered <0.5% of the genome.

The repeat content and percent coverage were considerably higher in the long-read

BAC.SubSample assembly (63%). Nearly 26% of the subsample was made up of Gypsy LTR RTs and 21.8% was Copia LTR RTs, while the LINE RTs and DNA transposons represented 3.2% and 0.16%, respectively. Low-complexity and simple repeats made up 0.2 % and 2.2% of the

BAC.SubSample, respectively. The mean lengths of all of the repeat types in the

BAC.SubSample, with the exception of the DNA TEs, were more than double those of the CFern v1.1A assembly, and the Copia and Gypsy elements were nearly three times as large in the subsample compared to those of the CFern v1.1A assembly.

Previous read-based, rather than assembly-based, analyses of genome composition estimated that ~75% of the Ceratopteris genome is repetitive (Wolf et al., 2015). Those estimates are probably more accurate as they are not biased by assembly difficulties, such as the inability to assemble repetitive regions beyond the length of the reads. Because the CFern v1.1 assembly comprised only paired-end and mate-pair short-reads, it is extremely limited in low-

70

complexity, satellite, and simple repeat assembly, while the read-based analyses estimated that over 25% of the genome consists of simple repeats (Wolf et al., 2015). These limitations can also have a large effect on assembling or spanning larger repeat elements such as transposable elements. This limitation is most apparent in comparing the mean lengths of the CFern v1.1A retrotransposons to those of the BAC.SubSample which was assembled using long-read technology which can span those repetitive regions.

When directly compared to the six other land plant genome assemblies of various sizes and qualities, Ceratopteris had the fourth highest proportion of repeat elements despite more than a 100-fold difference in genome size from the smallest analyzed genome, Selaginella

(Figure 4-5). Ceratopteris and Zea had similar proportions of Copia elements (~17%), much higher than any of the other genomes analyzed. However, Ceratopteris had the second lowest proportion of Gypsy elements, topping only Marchantia (6.5%). In contrast, 35% of the Zea genome is made up of Gypsy LTR RT elements with a mean length of 2,755 bp. Despite higher overall counts of repeat elements, Ceratopteris had some of the lowest mean lengths of the repeat element types among the seven species investigated (Figure 4-6).

We assessed LTR RT richness by comparing recent LTR RT exemplars (i.e., those with

>90% LTR similarity) among the seven species (Table 2). Zea was by far the most diverse with

4,561 distinct LTR RT exemplars, followed by Physcomitrella at 1,217 exemplars. Ceratopteris was low in LTR RT diversity with only 22 exemplars, similar to that of Amborella and

Marchantia, which had 11 and 30, respectively. Ancient (75-90% LTR similarity) LTR RT richness differed greatly from recent LTR RT diversity was in Zea and Physcomitrella, which only had 45 and 16 ancient exemplars, respectively (Table 2). Ceratopteris and Amborella each had more ancient than recent LTR RT exemplars with 82 and 55, respectively.

71

The quality of the genome assembly could have had a large effect on these interspecific comparisons of repeat diversity, number, and size, as well as genome size, as earlier demonstrated with the BAC.SubSample. For example, these six genome assemblies spanned a

10,000-fold difference in N50 lengths between that of Ceratopteris (22 Kb) and maize (217,960

Kb). Thus, in addition to the genome of Ceratopteris being many times larger than that of maize, it is hugely more fragmented, making the identification of repeat elements more difficult and likely biasing those identified towards smaller lengths.

To investigate LTR RT activation timing, we identified 62 full-length, high-confidence

LTR RT elements in the CFern v1.1A and BAC.SubSample assemblies (Ou & Jiang, 2018). The activation timing of these LTR RTs was relatively uniform over the past 7 million years (Figure

4-7). However, we found considerable differences in the timing of these two assemblies as the majority of the identified LTR RTs in the BAC.SubSample originated within the last million years, while the CFern v1.1A assembly did not identify a single LTR RT within the past million years and instead had largely older (>4 mya) LTR RTs. In addition, we note that the

BAC.SubSample had 28 full-length, high-confidence LTR RTs, while the CFern v1.1A assembly had 34 despite nearly a 1,000-fold greater assembly length. These results suggest that the long- read sequencing of the BAC.SubSample was able to span and properly assemble these repetitive elements, while the short reads of CFern v1.1 could only assemble older, more heterogeneous repetitive elements. The composition comparisons and repeat characterizations between the

CFern v1.1 and BAC.SubSample assemblies illustrate the need for long-read technology with large genomes. While comparisons between sequencing and assembly technologies demonstrate the limitations of short-read sequencing, economic factors must be taken into account in any sequencing project, and here the deep sequencing of an 11.25-Gb genome using long-read

72

technology would take much longer and cost much more than similar sequencing with short reads.

Summary

Ferns are the second most diverse lineage of vascular plants with over 10,000 species

(Smith et al., 2006). Sister lineage to all seed plants, ferns first appeared approximately 430 million years ago based on fossil-calibrated phylogenies (Morris et al., 2018); however, most extant diversity arose within the last 40-60 million years during the Cenozoic Era (Schuettpelz &

Pryer, 2007; Testo & Sundue, 2016). Despite their vast diversity, pivotal relationship as sister to seed plants, and considerable evolutionary history, ferns have been the final frontier of land plant genomics, as they were, until this study and the recently published Azolla and Salvinia (Li et al.,

2018), the sole lineage lacking a reference nuclear genome.

Here we provide the first draft genome assembly of the 11.25-Gb Ceratopteris genome, as well as a high-confidence set of gene models. Multiple types of analyses of the Ceratopteris genome provide clear evidence of only a single ancient WGD event and suggest that ferns have not undergone recurrent rounds of WGD followed by massive gene silencing as some have proposed (Haufler, 2002, 2014). Instead, this study supports past studies based on transcripts of

Equisetum giganteum (Vanneste et al., 2015), a genetic linkage map of Ceratopteris (Nakazato et al., 2006), as well as ancestral reconstructions (Clark et al., 2016) that WGD events are rare and few in the evolutionary history of ferns despite the presence of many species of recent polyploid origin (Wood et al., 2009). While genomic analyses in flowering plants have shown that even tiny genomes such as Arabidopsis have undergone numerous rounds of polyploidy yet still have a low number of chromosomes, we find that ferns are much less dynamic, having undergone relatively few WGD events and yet having retained a high number of chromosomes.

73

In contrast to similar analyses of flowering plant genomes which have undergone

“genome obesity”, the repeat content in Ceratopteris based on CFern v1.1 does not appear to differ dramatically from other land plants and thus does not play a substantial role in the differing genome sizes between lineages. In none of the major repeat categories was Ceratopteris substantially higher in proportion, although these results could be biased by differing genome assembly qualities (Figure 4-5). Comparisons with the BAC.SubSample assembly perhaps provide a more accurate representation of the repeat composition of Ceratopteris; however, the small fraction of the genome that the subsample represents could be an outlier and not a fair representation of the genome as a whole. Interestingly, Ceratopteris had a very low diversity of recent LTR RT exemplars when compared to other large genomes such as maize. While this could be indicative of low LTR RT richness and high abundance, given that the counts of the

LTR RTs were considerably higher in Ceratopteris compared to the other genomes, it is also possible that we are unable to identify the majority of full-length LTR RTs due to low scaffold contiguity whereas the repeat composition analysis was able to pick up partial LTR RTs.

Clearly, long-read technology will be necessary to overcome and fully analyze a genome of this size, as short-read sequencing simply cannot span and assemble the repetitive structures found in Ceratopteris. We are just now beginning to unfurl the evolutionary history of ferns.

This study provides a major evolutionary stepping-stone by providing the first homosporous fern reference genome, as well as unique insights into the processes underlying the formation of these massive genomes.

74

Figure 4-1. Paralog-age distribution analyses and associated SiZER plots of three fern species. Upper panels are Ks-based histograms (0.05 bins) of paralogs in Ceratopteris richardii, Azolla filiculoides, and Equisetum giganteum. Lower panels are SiZER plots of the above paralog-age distribution analyses where blue indicates significant (α = 0.05) increases, red significant decreases, purple insignificance, and gray too few data points to determine. The white lines show the effective window widths for each bandwidth. Both upper and lower panels are on the same x-axis.

75

Figure 4-2. Overlapping density plots of the three paralog-age distribution analyses. Green is Ceratopteris, red is Azolla, and blue is Equisetum.

76

Figure 4-3. MAPS analysis across land plants and the associated WGD events (shown as stars). The percentages of subtrees that contain gene duplications shared by descendent species at the designated nodes are above the phylogeny. Dates are based on Testo and Sundue (2016) and Morris et al. (2018).

77

Figure 4-4. Fluorescent in situ hybridizations of Ceratopteris chromosome squashes. The fluorescent probes are of 100-150 Kb DNA fragments from bacterial artificial chromosomes of Ceratopteris. Primary diploid localizations are shown in all four panels, while secondary localizations, most likely of repetitive elements, are apparent in (C).

78

Figure 4-5. Repeat composition, genome size, and genome assembly N50 for representative embryophytes.

79

Figure 4-6. Mean repeat lengths for representative embryophytes across different repeat categories.

80

Figure 4-7. LTR RT insertion dates in Ceratotperis richardii based on the CFern v1.1A and BACSubSample assemblies. Insertion dates were inferred from the similarity of long terminal repeat regions of the LTR RTs and a neutral substitution rate of 6.5 x 10-9 per site per year.

81

Table 4-1. Ceratopteris genome assembly statistics. Cytometric Genome Size 11.25 Gbp Chromosome number 39 Assembly V1.0 Meraculous Contigs 15,871,274 contigs Total Size 4.21 Gbp N50 300 bp % Gaps 0 % GC 36 CFern v1.0 (≥1,000 bp) 988,403 scaffolds Total Size 2.69 Gbp N50 3,376 bp % Gaps 0.5 % GC 36 CFern v1.1 (≥1,000 bp) 626,576 scaffolds Total Size 4.25 Gb N50 16,289 bp % Gaps 37 % GC 38 CFern v1.1A (≥10,000 bp) 133,755 scaffolds Total Size 2.79 Gb N50 22,401 bp % Gaps 44 % GC 38 BAC.SubSample 35 scaffolds Total Size 3.03 Mb N50 97,182 bp % Gaps 0 % GC 39

82

Table 4-2. Ceratopteris repeat diversity and composition Class Superfamily Count Length % Uncategorized 1458 523887 0.02 RTE-BovB 421 128815 0.00 Jockey 434 127827 0.00 LINE R1 2565 1241869 0.04 RTE-X 8733 3937956 0.14 L2 19047 3652063 0.13 L1-Tx1 23635 15522256 0.56 Retrotransposons L1 47105 20464731 0.73 Uncategorized 23507 7859618 0.28 DIRS 361 45453 0.00 Pao 1494 462814 0.02 LTR Gypsy-Troyka 5331 3393861 0.12 ERV1 8083 6191165 0.22 Gypsy 329706 207014935 7.42 Copia 812470 460237954 16.50 Uncategorized 3289 693293 0.02 hAT-Tip100 374 251768 0.01 CMC-Mirage 416 153347 0.01 MULE-MuDR 425 41133 0.00 TcMar 530 83430 0.00 hAT-hATw 622 343180 0.01 Harbinger 627 224160 0.01 PiggyBac 1230 122417 0.00 DNA Transposons Dada 1981 1130399 0.04 CMC-EnSpm 2339 1276711 0.05 Sola 2739 1565090 0.06 hAT 2774 819873 0.03 Maverick 4082 2443709 0.09 hAT-Ac 4625 1727566 0.06 PIF-Harbinger 4982 1214259 0.04 En-Spm 10115 5487072 0.20 hAT-Tag1 15720 6076071 0.22 Helitron 4260 812693 0.03

83

CHAPTER 5 GENETIC SPECIFICITY AND EVOLUTION UNDERLYING THE ALTERNATION OF GENERATIONS IN LAND PLANTS

Background

Recent research has provided many new insights into the genetics underlying the origin and radiation of flowering plants, or Darwin’s “abominable mystery.” Perhaps most enlightening has been the association among major genomic processes (e.g. whole-genome duplications, massive transposable element activations), gene family expansions, and large-scale species radiations (Slotkin et al., 2012; Airoldi & Davies, 2012; Oliver et al., 2013; Soltis et al., 2015;

Tank et al., 2015; Landis et al., 2018). While numerous gene families have undoubtedly played a role in the massive diversification of flowering plants, expansions of certain gene families are hypothesized to have been primary catalysts for species diversification (e.g., Amborella Genome

Project 2013). For example, MADS-box transcription factors are critical regulators in the development of all land plants, but have been most commonly associated with seed plant

(gymnosperm and angiosperm) architecture. As a result, this gene family has been a focus of many phylogenetic, comparative genetics, and developmental genetics studies throughout land plants (Winter et al., 1999; Pelaz et al., 2000; Becker & Theißen, 2003; Chang et al., 2009;

Amborella Genome Project, 2013; Gramzow et al., 2014). Despite the functional importance of many regulatory gene families in angiosperms and gymnosperms, few studies have investigated the role they play and their evolutionary history in the sister group to the seed plants, the ferns.

Combined, seed plants and ferns make up a clade of more than 307,000 described and accepted species, the euphyllophytes (Christenhusz & Byng, 2016; PPG I, 2016). Time- calibrated phylogenies estimate the crown age of euphyllophytes between 440 and 400 million years old (mya) (Testo & Sundue, 2016; Morris et al., 2018), although the oldest unequivocal fossils of this clade date only to the (419-359 mya) (Walker et al., 2013).

84

Euphyllophytes are characterized by pseudomonopodial growth, or the differentiation between a main axis and side branches, as well as megaphylls, or “true leaves” (reviewed in Judd et al.,

1999). Ferns (in the broad sense, monilophytes) alone comprise a clade of over 10,000 species found in many ecosystems across the globe (Christenhusz & Byng, 2016; PPG I,

2016). Although extant ferns are substantially less diverse than the seed plants and have only a few economically important species, investigations into fern genetics and fern genome evolution can provide unique insights into plant genetics and evolution. In much the same way that

Amborella served as a reference for characterizing the ancestral gene content of early flowering plants, or platypus served as a reference for studying early mammals (Warren et al., 2008b;

Amborella Genome Project, 2013), ferns can help illuminate the evolutionary history not only of seed plants, but also of all euphyllophytes, especially regarding the interplay of genetics and life history traits.

The alternation of generations between a haploid gametophyte and a diploid sporophyte is a central feature of land plants. Unlike animals, in which a diploid organism’s germ line undergoes meiosis to produce haploid gametes, meiosis in a plant produces haploid spores, each of which develops into a multicellular gametophyte. These gametophytes produce male and female gametes, which undergo syngamy to produce a diploid sporophyte and complete the life cycle. However, the level of dominance and interdependence of the haploid gametophyte and diploid sporophyte varies between major lineages of plants. For example, extant non-vascular land plants (i.e. mosses, hornworts, liverworts, collectively “bryophytes” whether or not they form a clade; e.g., Wickett et al., 2014; Gitzendanner et al., 2018) have free-living gametophytes upon which the sporophytes develop and are nutritionally dependent. In contrast, seed plants (i.e. gymnosperms and flowering plants) have free-living sporophytes with highly reduced

85

gametophytes that are dependent on the sporophyte and develop within sporophytic tissue. Some lycophytes and most ferns (nearly 99% of all species) differ dramatically from seed plants and bryophytes in having both an independent haploid gametophyte and an independent diploid sporophyte life stage. In addition, while all seed plants are heterosporous, producing microspores and megaspores, which develop into sexually differentiated male and female gametophytes

(microgametophytes and megagametophytes) that produce either sperm or eggs, respectively, nearly all ferns are homosporous, producing a single type of that develops into a potentially bisexual haploid gametophyte that produces both sperm and egg.

How do plants regulate gene expression in sporophytes versus gametophytes? How much of the genome is restricted to expression in one generation or the other, and for housekeeping genes, are separate gene copies expressed in the two life stages? Because most genetic and genomic investigations of plants have centered on seed plants, our understanding of gene expression patterns and of the genetic pathways operating within these distinct life stages has been largely biased towards heterosporous, independent-sporophyte, dependent-gametophyte systems. Even studies of flowering plants comparing the genetics underlying the two life stages have been restricted to a select few species, including the grass sorghum (Sorghum bicolor)

(McCormick et al., 2018), Arabidopsis thaliana (Honys & Twell, 2004; Johnston et al., 2007), or maize (Zea mays) (Chettoor et al., 2014; Stelpflug et al., 2016), as a result of the difficulties associated with isolating gametophytic from sporophytic tissue in angiosperms. The moss

Physcomitrella patens has served as a genetic model system for non-vascular plants and as a reference for independent-gametophyte, dependent-sporophyte plant species (Perroud et al.,

2018).

86

As the sister group to seed plants, ferns serve as the ideal system for investigating both the genetics underlying the transition from independent-gametophyte to independent-sporophyte plant life and the ancestral gene content at this transition. Here, we (i) investigated patterns of gene expression in the independent gametophyte and sporophyte life stages of the homosporous model fern Ceratopteris richardii, (ii) compared the results with those for independent- gametophyte and independent-sporophyte species (Physcomitrella patens and Pinus taeda, respectively), and (iii) reconstructed the ancestral gene content of all euphyllophytes, all seed plants, and all ferns.

Methods

Tissue Samples

Ceratopteris richardii (Pteridaceae) is a fast-growing tropical fern, used globally in research laboratories, as well as in K-12 and undergraduate biology courses for studying alternation of generations in plants (see www.c-fern.org). Inbred lines and single-gene mutants are commercially available and readily produced. For this study, spores from the Hn-n inbred line were kindly donated by Dr. Leslie Hickok (University of Tennessee). The spores were germinated on Bold’s (1957) nutrient media with Nitsch’s (1951) micronutrients and grown following the recommended conditions in the C-Fern Manual (www.c-fern.org). Upon germination, we isolated the gametophytes to individual petri dishes and growth media. Given that C. richardii is homosporous, the gametophytes are typically bisexual and produce both antheridia and archegonia. By isolating the gametophytes prior to sexual maturity, we ensured that any sporophytes that did develop were a product of gametes from a single gametophyte, thus completely homozygous, i.e., a double haploid. All of the sporophytic tissue used for sequencing came from one double-haploid genotype (Voucher: M. Whitten #5841, University of Florida

87

Herbarium). The tissues for transcriptome sequencing of the gametophyte came from Hn-n siblings of this double-haploid individual.

Library Construction

We used both short- and long-read sequencing technology to acquire three Ceratopteris transcriptomes (short-read gametophyte, short-read pooled sporophyte, long-read mature fertile leaf sporophyte). The long-read transcriptome used the Isoform Sequencing (IsoSeq) technology of Pacific Bioscience (PacBio; Menlo Park, CA, USA) to produce full-length transcripts. We extracted total RNA from mature fertile leaf tissue producing sporangia using the RNeasy Plant

Mini kit (Qiagen, Germany). The total RNA was size-selected for 0.8-2, 2-3, 3-5, and >5 Kbp with the SageELF (Sage Science, Beverly, MA, USA) at the University of Florida

Interdisciplinary Center for Biotechnology Research (UF ICBR). The libraries were prepared following the SMRTbell library protocol, and each library was sequenced on three PacBio

SMRT cells (Menlo Park, CA, USA) at the UF ICBR.

For the short-read transcriptomes, we extracted total RNA from (i) ~100 pooled sibling sexually immature gametophytes and (ii) pooled sporophyte tissue (immature fertile leaf, sterile leaf, stem, root). Library preparation (550-bp inserts) and sequencing (300-bp PE) of the two

RNA samples were performed by SciVentures (Singapore, Singapore) using the Illumina MiSeq platform (Illumina, San Diego, CA, USA).

Transcriptome Assembly

We cleaned and processed the long reads following the IsoSeq protocol (Gordon et al.,

2015) in which the circular consensus sequences (CCS) were acquired from the raw reads and then classified and clustered. Only full-length, high-quality, polished sequences (≥2 full-length reads; accuracy ≥99%) were used for analysis, following the Iterative Clustering and Error correction (ICE)/Quiver algorithm.

88

The raw short reads from the gametophyte and sporophyte transcriptome libraries were trimmed of adapters and quality-filtered using Trimmomatic (Bolger et al., 2014). We visually analyzed the quality of the reads before and after cleaning with FastQC

(www.bioinformatics.babraham.ac.uk/projects/fastqc/) and then assembled the two transcriptomes separately using Trinity de novo RNA-Seq and default settings (Haas et al.,

2013).

Comparative Transcriptomics

We identified the coding DNA sequences (CDS) of the transcripts in each of the three transcriptomes (short-read gametophyte, short-read pooled sporophyte, long-read mature fertile leaf sporophyte) and predicted the amino acid sequences with TransDecoder v5.0.2 (Haas et al.,

2013). The CDS were clustered separately at 95% sequence identity using USEARCH-UCLUST

(Edgar, 2010) as well as all together to produce non-redundant sets of transcripts for each transcriptome and a total non-redundant transcript set for Ceratopteris. In addition, we mapped the CDS of each transcriptome to the CFern v1.1A genome assembly (Marchant et al. in prep) using GMAP (Wu & Watanabe, 2005) at 98% identity and 98% coverage to obtain a relative proportion of transcript and isoform specificity. We also investigated transcript specificity in the two life stages without the reference genome using highly stringent reciprocal BLASTn (E-value

1e-10, 100% identity) (v2.7.1, Camacho et al., 2009) searches of the gametophyte and sporophyte CDS.

The transcripts and predicted proteins were queried against the manually annotated and reviewed Swiss-Prot database (UniProt Consortium, 2018) with BLASTx and BLASTp (E-value

1e-10) (v2.7.1, Camacho et al., 2009). The predicted proteins were queried against the Pfam

(Finn et al., 2013) and PANTHER (Mi et al., 2005) databases using InterProScan (Finn et al.,

89

2016). The results were loaded into a Trinotate SQLite database providing functional Gene

Ontology (GO) annotations and top SwissProt, PANTHER, and Pfam hits (Haas et al., 2013).

To directly compare life stage-specificity of gene expression in Ceratopteris with those of other land plants, we acquired raw RNA short-read sequences from gametophyte and sporophyte tissues of a moss (Physcomitrella patens) and sporophyte, microgametophyte

(pollen), and megagametophyte tissues of a conifer (Pinus taeda) via the NCBI Short Read

Archive (SRA) database (Table A-2). These two taxa and data sets were selected because they had distinct sporophyte and gametophyte (megagametophyte and microgametophyte, in the case of Pinus) samples and used sequencing platforms similar to that of Ceratopteris. We trimmed and cleaned the reads with Trimmomatic (Bolger et al., 2014) and assembled them using Trinity

(Haas et al., 2013). As with the Ceratopteris short-read transcriptomes, we translated the assembled transcripts into predicted CDS and amino acid sequences (TransDecoder; Haas et al.,

2013) and clustered the CDS based on 95% nucleotide similarity (USEARCH-UCLUST; Edgar,

2010). We then grouped all of the amino acid sequences, including the short-read, non-redundant sporophyte and gametophyte transcriptomes of Ceratopteris, into orthogroups using OrthoFinder

(Emms & Kelly, 2015) to approximate gene families and analyzed orthogroup overlap among these transcriptomes.

Gene Family Evolution

Primary transcripts and their predicted amino acid sequences were retrieved from

Phytozome (v. 12.1.5, accessed Nov. 2017) for 12 species spanning the major lineages of land plants (Table A-2). Those representative taxa include four eudicots (Arabidopsis thaliana,

Glycine max, Solanum lycopersicum, Vitis vinifera), two monocots (Oryza sativa, Zea mays), a angiosperm (Amborella trichopoda), a lycophyte (Selaginella moellendorffii), and a moss

(Physcomitrella patens). Sequences from a gymnosperm (Picea abies) were obtained from

90

ConGenIE (Nystedt et al., 2013) and a liverwort (Marchantia polymorpha) were retrieved from

MarpolBase (Bowman et al., 2017), while those of a heterosporous water fern (Azolla filiculoides) (Li et al., 2018). These predicted protein sequences along with the total non- redundant transcript set from Ceratopteris were clustered into putative gene families using

OrthoFinder (Emms & Kelly, 2015).

We identified all single-copy orthogroups with one gene per species for all 13 species, as these are ideal for estimating phylogenetic relationships among species (i.e., the species tree;

Emms & Kelly, 2015). The amino acid sequences were aligned with MAFFT (Katoh & Standley,

2013), stripped of highly ambiguous (>90% missing data) columns, and the alignments concatenated with Phyx (Brown et al., 2017) before estimating the species tree with RAxML using 100 rapid bootstrap searches (Stamatakis, 2014). We inputted this species tree into the program Count (Csűös, 2010) along with the number of gene representatives from each species per orthogroup to reconstruct the ancestral gene content at each node of our species tree under a

Wagner parsimony framework with a 1.2-gene gain penalty (Amborella Genome Project, 2013;

Li et al., 2018). In association with our transcript annotations and those of Arabidopsis

(Berardini et al., 2015), we identified gene families of interest and estimated gene family phylogenies. We selected gene families associated with developmental regulation, reproduction, and disease resistance.

Results

Generation-Specific Patterns of Gene Expression

Ceratopteris

Three distinct transcriptomes (short-read gametophyte, short-read pooled sporophyte, long-read mature fertile leaf sporophyte) of Ceratopteris were produced using long- and short- read technology. From 12 PacBio SMRT cells, we obtained 850,000 reads, from which we

91

produced 97,084 full-length, high-quality, cleaned transcripts. Short-read RNA sequencing resulted in over 29 million clean and trimmed reads per transcriptome. The sporophyte de novo transcriptome assembly had 190,824 transcripts, while the gametophyte de novo assembly had

137,450 transcripts.

From the short-read transcriptomes, we identified and clustered the CDS to produce two non-redundant transcript sets for the sporophyte and gametophyte with 42,841 and 36,427 CDS respectively. The total non-redundant CDS set had 52,894 sequences and 981 of the 1440

(68.1%) universal land plant orthologs. From the short-read gametophyte transcriptome, 11,189 separate isoforms were identified in the CFern v1.1A genome assembly, while 13,448 were mapped from the short-read sporophyte transcriptome (Figure 5-1). Of those, 35% were sporophyte-specific isoforms, and 21% were gametophyte-specific isoforms. Life-stage specificity was lower in genes than in isoforms (29% in the sporophyte and 15% in the gametophyte). Reciprocal BLASTs of the two transcriptomes produced similar values, with 34% of genes sporophyte-specific and 26% of genes gametophyte-specific.

The total numbers of GO term annotations were similar in the sporophyte and gametophyte transcriptomes of Ceratopteris (165,609 and 150,166, respectively). These included

16 GO annotations related to leaf development, 15 to flower development, 29 to root development, and 11 to reproduction, all of which were similarly represented in both the gametophyte and sporophyte transcriptomes (Figure 5-2). The largest differences in GO term representation were in those representing RNA-directed DNA polymerase activity (627 more in the sporophyte), DNA integration (612 more in the sporophyte), and aspartic-type endopeptidase activity (599 more in the sporophyte).

92

Across Land Plants

We identified 29,725 orthogroups, or putative gene families, in the transcriptomes of a moss (sporophyte and gametophyte), fern (sporophyte and gametophyte), and conifer

(sporophyte, microgametophyte, and megagametophyte). Of those orthogroups, 5,742 had representative genes from all seven transcriptomes, representing a core set of ubiquitous gene families. In terms of putative biological function, these gene families were largely involved in metabolic processes (GO:0044237 cellular metabolic process, GO:0006807 nitrogen compound metabolic process, GO:0043170 macromolecule metabolic process). We found low percentages of life-stage specificity in Ceratopteris gene families (1.9% sporophyte-specific, 1.2% gametophyte-specific), while 5.3% of the gene families in the moss were sporophyte-specific and

4.8% gametophyte-specific (Table 5-1). The conifer had much higher gametophyte specificity

(8.9%), especially when accounting for the two separate gametophytes (2.2% megagametophyte- specific, 4.1% microgametophyte-specific) and relatively low sporophyte specificity (2.3%).

While 24 gene families were unique to both the fern sporophyte and conifer sporophyte and not found in the gametophytes of any of the three species, only four gene families were specific to all three sporophytes (moss, fern, conifer). In addition, 24 gene families were gametophyte-specific in the fern and moss (i.e., not found in the sporophyte of those plants). Of those 24 gene families,

17 were in one of the two conifer gametophyte transcriptomes (megagametophyte or microgametophyte), but 13 were also found in the conifer sporophyte. A single gene family

(PTHR10027: calcium-activated potassium channel alpha chain) was specific to all four gametophyte transcriptomes (moss gametophyte, fern gametophyte, conifer microgametophyte, conifer megagametophyte). Another single gene family was found solely in the fern gametophyte, moss gametophyte, and conifer megagametophyte (PTHR32018: lyase family) and

93

another shared by the fern gametophyte, moss gametophyte, and conifer microgametophyte

(PTHR35106: carboxypeptidase).

Gene Family Evolution Across Land Plants

We gathered 449,682 genes classified into 16,097 orthogroups from 13 species (Table A-

2) spanning the major lineages of land plants. We identified 29 orthogroups with single gene representatives from each of the 13 species to be used in the species tree estimation and subsequent ancestral gene content reconstruction (Figure 5-3). This phylogeny agreed with previous relationship estimations based on larger taxonomic sampling and both nuclear and plastid genes (Wickett et al., 2014; Gitzendanner et al., 2018).

Gene content varied among transcriptomes of the 13 species. Ceratopteris had genes in

9,569 orthogroups, of which 117 were Ceratopteris-specific and 447 fern-specific (Ceratopteris

+ Azolla). Zea and Oryza had the most orthogroups (11,311 and 10,859), while Picea and

Selaginella had the fewest (7,324 and 7,996). Ceratopteris had only the fifth most orthogroups

(9,569) but gained more orthogroups than of any of the other 13 species (776). In addition,

Ceratopteris had the second most orthogroup expansions (2,071), behind Glycine. Throughout the evolutionary history of land plants, the largest gain in new orthogroups was found in the

MRCA of the monocots (Oryza and Zea) (1,712) followed by the angiosperms (741), and then seed plants (642). The branches leading to the ferns and euphyllophytes were relatively stable in terms of gene family evolution, gaining 542 and 240 new orthogroups, while only losing 243 and

168 gene families, respectively. Similarly, orthogroup expansion and contraction were relatively low along the branches leading to the ferns and euphyllophytes.

MADS-box transcription factors are critical regulators in the development of flowering plants and seed plants. We identified 55 putative MADS-box genes in the transcriptomes of

Ceratopteris. Based on our ancestral reconstructions, MADS-box gene content remained

94

relatively stable through the evolutionary history of land plants, with 26 in the MRCA of vascular plants, 27 in the MRCA of euphyllophytes, and 30 in the MRCA of ferns (Figure 5-4).

Thus, it has only been relatively recently that the MADS-box family has expanded in multiple distinct lineages to reach the current MADS-box diversity in Solanum (65), Oryza (67), Zea (87),

Arabidopsis (95), and Glycine (173).

Another key gene family in the regulation of plant development and defense response is the TCP transcription factor family. The highest diversity of TCP transcription factors was found in Glycine (56), followed by maize (42) and Solanum (36) (Figure 5-4). Ceratopteris had the fifth fewest TCPs (10) of the species investigated, and just one more than the MRCA of euphyllophytes (9).

We identified genes from both the TALE and HD-ZIP super classes of homeobox transcription factors. Ceratopteris had the same number of TALE genes as Arabidopsis (22), considerably lower than Glycine (70) or maize (45) (Figure 5-4). Ancestral reconstructions of the

TALE super class suggest a sizeable gene family expansion at the MRCA of the monocots and eudicots (26) relative to the MRCA of vascular plants (12), euphyllophytes (14), seed plants

(14), or all flowering plants (14). In contrast, the HD-ZIP super class was much more variable, spanning 139 in Glycine to 5 in Marchantia. Ceratopteris had 70 HD-ZIP genes, while the

MRCA of euphyllophytes had 33, and that of seed plants had 45.

We also searched for genes associated with RNA silencing (AGO and DCL gene families), as RNA silencing regulates a variety of cellular processes. The highest diversity of

AGO genes was in maize (22), followed by Glycine (20) and Oryza (18) (Figure 5-5).

Marchantia had the lowest AGO diversity (4), followed by Selaginella (6), and Ceratopteris,

Azolla, and Picea all had 7. The MRCA of vascular plants, euphyllophytes, and ferns all had 7

95

AGO genes, while the MRCA of angiosperms had 13. The highest DCL diversity was found in

Solanum (12), followed by Glycine (10), Oryza (8), and maize (8). The lowest DCL diversities were in Azolla (3), Marchantia (4), Physcomitrella (5), and Picea (5). The MRCA of land plants through the MRCA of euphyllophytes all had 5 DCL genes.

Disease resistance in plants is largely monitored by nucleotide-binding site leucine-rich repeat (NBS-LRR) genes. We identified 10 NBS-LRR orthogroups, in which Oryza had 445 genes, Vitis had 310, and Glycine had 306 (Figure 5-6). In comparison, Ceratopteris and Azolla had 9 and 8 NBS-LRR genes, while Selaginella and Marchantia had 13 and 10, respectively.

Ancestral reconstructions of these gene families found the common ancestor of euphyllophytes to have 16 NBS-LRR genes while 78 were identified in the common ancestor of seed plants.

Discussion

Ferns provide an ideal genetic and genomic reference for understanding land plant evolution as they are sister group to the hyper-diverse seed plants and also encapsulate intermediate life history traits relative to non-vascular plants and seed plants. By comparing gene expression in the independent sporophyte and gametophyte life stages of Ceratopteris richardii, we were able to investigate isoform, gene, and gene family specificity and function in this homosporous fern species. We also directly compared these results to those of a gametophyte- dominant moss and a sporophyte-dominant conifer. Finally, we incorporated fern gene families into an examination of gene family evolution across land plants as a whole, to investigate the expansion and contraction of specific gene families.

The sporophyte and gametophyte life stages can vary immensely across land plants in both relative size and ability to survive as free-living. For example, bryophyte gametophytes are composed of stem-like and leaf-like structures upon which sporophytes develop. In contrast, the male gametophyte of flowering plants is a mere two or three cells, while the female gametophyte

96

can range from seven to a few hundred cells. Despite their highly reduced states, angiosperm male gametophyte (microgametophyte; pollen) and female gametophyte (megagametophyte; embryo sac) gene expression profiles using microarrays showed 13,997 distinct genes expressed during pollen development in Arabidopsis, of which 1,355 were microgametophyte-specific

(Honys & Twell, 2004). Only 1,260 genes were identified as expressed during embryo sac development, with 100 being embryo sac-specific (Johnston et al., 2007). More recent investigations have utilized RNA-sequencing technology and found considerably more transcripts than the microarrays. Chettoor et al. (2014) found 734 genes specific to pollen and

1,714 specific to the embryo sac in maize when compared to a variety of sporophytic tissues.

Using the Ceratopteris richardii genome assembly (see Chapter 4), we mapped the transcripts from the separate sporophyte and gametophyte transcriptomes to estimate the relative proportions of isoform and gene specificity for the two life stages. Over 8,000 genes are gametophyte-specific and over 15,000 are sporophyte-specific in Ceratopteris. These values are considerably higher than those predicted for Arabidopsis or maize noted above (Honys & Twell,

2004; Johnston et al., 2007; Chettoor et al., 2014), but this difference is perhaps understandable when the overall size and complexity of the fern gametophyte is taken into account compared to those of angiosperms. As in seed plant gametophytes, the primary function of fern gametophytes is to produce male and female gametes; however, fern gametophytes (including those of

Ceratopteris) typically sustain themselves independently of the sporophyte. The free-living fern gametophyte must photosynthesize, develop rhizoids for anchoring the plant, and produce archegonia and antherida, yet fern gametophytes lack many of the tissues and complex structures found in sporophytes, including vascular tissue, stomata, a cuticle, roots, leaves, stems; furthermore, fern gametophytes are almost entirely only a single cell thick. Thus, it was

97

surprising to find so much similarity in the functional annotations of the two life stage transcriptomes (gametophyte vs. sporophyte) of Ceratopteris related to leaf and root development, and even more surprising to see shared annotations for genes for flower development (Figure 5-2). Instead, the majority (56%) of the gametophyte-specific transcripts in

Ceratopteris lacked annotations, suggesting a high prevalence of unexplored genes in fern gametophytes. Therefore, a future grand challenge in fern biology is to explore the function of these numerous genes with no known function; this will soon be possible given the availability of an improving reference Ceratopteris genome and the tractability of Ceratotperis as a genetic model (Marchant et al. in prep).

Despite the relatively high proportion of gametophyte- and sporophyte-specific isoforms and genes, overall gene family specificity was <2% for each life cycle stage of Ceratopteris; this value is low compared to that of the moss (5.3% sporophyte-specific, 4.8% gametophyte- specific) and conifer (2.3% sporophyte-specific, 8.9% gametophyte-specific) (Table 5-1).

Significantly, the minimal overlap of life stage-specific gene families among these three taxa

(moss, fern, conifer) suggests that there are almost no sporophyte- or gametophyte-specific gene families despite drastically different morphologies, tissues, and reproductive roles in these two distinct life stages.

Ancestral reconstructions of gene families across bryophytes, lycophytes, ferns, gymnosperms, and flowering plants showed little change in the total number of gene families at the MRCA of euphyllophytes (Figure 5-3). Instead, the largest gain in gene families was in the monocots (74,000 species), and the largest loss of gene families was in the gymnosperms (1,100 species) (Christenhusz & Byng, 2016). Given that total gene family diversity did not reflect the

98

overall diversity of euphyllophytes, we further investigated gene families directly associated with development and reproduction as well as disease resistance.

Found throughout eukaryotic life, MADS (MINICHROMOSOME MAINTENANCE1,

AGAMOUS, DEFICIENS, SERUM RESPONSE FACTOR)-box transcription factors are genetic regulators of organ development (Theissen et al., 2000; Messenguy & Dubois, 2003; Gramzow

& Theißen, 2015; Thangavel & Nayar, 2018). While most animals and fungi only contain two to six MADS-box genes in total (Gramzow et al., 2010), flowering plants can possess over 100 of these genes which regulate flower, gametophyte, embryo, and seed development (Parenicová et al., 2003; Arora et al., 2007; Gramzow & Theissen, 2010). Of the few studies investigating

MADS-box gene expression in ferns, most of the genes were expressed in both gametophyte and sporophyte life stages (Kwantes et al., 2011; Huang et al., 2014). Similarly, the TCP

(TEOSINTE BRANCHED1, CYCLOIDEA, PROLIFERATING CELL NUCLEAR ANTIGEN

FACTOR) transcription factors regulate leaf architecture, petal and stamen development, shoot branching, and defense response (Li, 2015). The TALE (Three Amino acid Loop Extension) and

HD-ZIP (Homeodomain associated to a leucine zipper) homeobox transcription factors are critical for environmental responses, meristem regulation, and organ development. While the

TALE family can be found throughout eukaryotic life, the HD-ZIP transcription factors are unique to plants (Ariel et al., 2007; Mukherjee et al., 2009). Although Ceratopteris had the highest diversity of these transcription factor families of any non-seed plant, the MRCA of euphyllophytes, ferns, and seed plants all had relatively low diversity of these transcription factor families. Previous analyses of MADS-box gene diversity in Ceratopteris had identified only nine genes in this family (Münster et al., 1997; Hasebe et al., 1998; Kwantes et al., 2011), however, our broad transcriptomic approach revealed these genes in a variety of tissues in both the

99

gametophyte and sporophyte. Specialization of the MADS-box genes in regards to gametophyte or sporophyte development is the norm for seed plants, but non-seed plants have broader MADS- box expression patterns (Thangavel & Nayar, 2018).

The AGO (Argonaute) and DCL (Dicer-like) gene families are fundamental components of RNA interference (RNAi) pathways which mediate gene expression and transposable element activation. DCL proteins “dice” long double-stranded RNAs into small RNAs (sRNAs), which then form RNA-induced silencing complexes (RISCs) with associated AGO proteins (Singh et al. 2015). These RISCs regulate meristem development, stress responses, and germ cell development in flowering plants (Singh et al., 2015; Zhai et al., 2015). While DCL diversity fluctuated very little throughout land plant evolution, the AGO family expanded in the MRCA of flowering plants, as well as that of the monocots and eudicots.

The NBS-LRR (nucleotide-binding site leucine-rich repeat) proteins are critical to disease resistance in seed plants; however, their presence in non-seed plants, especially ferns, is largely unknown. Ancestral reconstructions suggest that the NBS-LRR gene family steadily diversified following the origin of the seed plants, as only 16 NBS-LRR genes were found in the MRCA of euphyllophytes compared to 78 in the MRCA of seed plants or 145 in Picea and 445 in Oryza.

While ferns are known for their herbivory resistance and deterrence, to the degree that an insecticide-coding gene from a fern was transferred to cotton (Shukla et al., 2016), the genetic mechanisms underlying their resistance to bacterial and fungal pathogens is relatively unexplored. The low NBS-LRR diversity of Ceratopteris and non-seed plants could reflect a lower selective pressure from pathogens or that they have alternative defense methods.

Ferns possess free-living gametophyte and sporophyte life stages. This feature and their pivotal phylogenetic position as the sister clade to seed plants make them ideal for investigating

100

features of the alternation of generations and land plant evolution. Here we found 21% and 35% of isoforms and 15% and 29% of genes were gametophyte- and sporophyte-specific, respectively, in the homosporous fern Ceratopteris richardii. Despite these high proportions of specificity, very few gene families (1.2% gametophyte-specific, 1.9% sporophyte-specific) were life stage-specific, especially when compared to those of a gametophyte-independent, sporophyte-dependent moss and sporophyte-independent, gametophyte-dependent conifer.

The integration of data for a homosporous fern into the study of land plant gene family evolution provided insight into the ancestral gene content of euphyllophytes, seed plants, and ferns. These ancestral reconstructions showed that while key regulatory gene families (MADS- box, TCP, TALE, HD-ZIP, AGO, DCL) were present in the MRCA of euphyllophytes and the

MRCA of seed plants, the expansions of these families came well after the switch to sporophyte- independent, gametophyte-dependent life stages and even after the evolution of flowering plants.

Significantly, we found a rapid expansion of the disease resistance NBS-LRR gene family at the

MRCA of seed plants and not in the ferns or MRCA of euphyllophytes. This expansion could be the result of increased seed plant-specific pathogens but raises the question of how non-seed plants defend themselves with so few NBS-LRR genes. These results and associated genetic resources from the homosporous fern Ceratopteris provide considerable insight into land plant evolution and the transition from free-living haploid gametophytes and dependent sporophytes found in non-vascular plants to free-living diploid sporophytes and highly reduced, dependent gametophytes found in all seed plants.

101

Figure 5-1. Life stage specificity of isoforms, genes, and gene families in Ceratopteris richardii.

102

Figure 5-2. Gene ontology (GO) annotation frequency of terms related to root, leaf, and flower development, as well as reproduction in sporophytic (blue) and gametophytic (green) transcripts of Ceratopteris richardii.

103

Figure 5-3. Ancestral reconstruction of gene family content and evolution in land plants. The total number of orthogroups for each taxon and node are in ovals while the number of orthogroups gained, lost, expanded, and contracted are represented in the bar plots.

104

Figure 5-4. Number of MADS-box, TCP, TALE, and HD-ZIP transcription factor genes per species and ancestral node across land plants.

105

Figure 5-5. Number of AGO and DCL genes per species and ancestral node across land plants.

106

Figure 5-6. Number of NBS-LRR genes per species and ancestral node across land plants

107

Table 5-1. Gene family life stage specificity in a moss, fern, and conifer. Micro- Mega- Sporophyte Shared Gametophyte gametophyte gametophyte Moss 5.3% 89.9% 4.8% Fern 1.9% 96.9% 1.2% Conifer 2.3% 82.5% 8.9% 4.1% 2.2%

108

CHAPTER 6 CONCLUSIONS

The field of plant evolution has made enormous strides in the past decade. On one side, evolutionary genomics is revolutionizing our understanding of the mechanisms underlying much of biodiversity, most notably the major forms of macromutation (i.e. polyploidy, TE, chromosomal rearrangements). However, a sizeable sampling bias towards economically important crops and model organisms has precluded contributions from phylogenetically distributed samples in these studies, and therefore lack a natural context. In contrast, research on the ecological effects of these macromutations is expanding in taxonomic breadth via the integration of digitized specimen data from natural history museums. As a result, there exists a disconnect between genetic/genomic model plant systems and ecological model plant systems that has hindered the synthesis and interplay of genotype and ecology in relation to processes such as polyploidy (Soltis et al., 2016).

By comparing the ecological niches of allopolyploids to their diploid progenitors, I revealed the variety of ecological outcomes that can arise via polyploidy in natural systems.

These categorizations will serve as a baseline and standard for further sampling of polyploid systems and for formulating testable hypotheses. Most critical for future investigations is discerning the immediate effects of polyploidy from subsequent evolution, which could be gleaned via the study of recent natural polyploids and synthetic polyploids. With these hypotheses, the genetic and physiological effects of polyploidy and subsequent evolution that led to these diverse ecological outcomes can be more directly investigated.

I also provided the critical insight into the nuclear genome of a homosporous fern and subsequently into euphyllophyte (ferns and seed plants) genome evolution. Ferns have historically perplexed evolutionary biologists due to their high chromosome numbers and large

109

genomes, and, as a result, numerous hypotheses have arisen to address these mysteries. However, only recently, due to the relative ease of next-generation sequencing (NGS) technologies, has it been possible for these issues to be truly addressed. Here I investigated the genomic and evolutionary processes by which the large genomes and high chromosome numbers typical of homosporous ferns evolved and have been maintained. Using the model fern species

Ceratopteris richardii, I evaluated the possible roles of polyploidy (whole-genome duplication,

WGD), repeat element composition, and the expansion of transposable elements (TEs) in shaping fern genome evolution. Repeat compositions in species spanning the plant tree of life and a variety of genome sizes were directly compared, as were both short- and long-read-based assemblies of the Ceratopteris genome. Only a single ancient WGD event was found in the evolutionary history of Ceratopteris, and repeat proportions were similar to classically large flowering plant genomes, such as maize. As such, alternative hypotheses to WGD must be explored to explain the genomic composition of homosporous ferns. Chromosome-level genome assemblies of multiple fern taxa could provide valuable insights into addressing these hypotheses.

Homosporous ferns are also important for understanding the genetics underlying life history traits in plants, as they have both free-living haploid gametophyte and free-living diploid sporophyte life stages. In contrast, bryophytes all have free-living gametophytes and dependent sporophytes, while seed plants have free-living sporophytes and dependent gametophytes.

Across land plants, the genetics underlying these two life stages are largely unexplored.

Substantial life-stage specificity was discovered in Ceratopteris at the isoform and gene levels, but very little in gene families. Over 5,000 gene families, representing a core set of ubiquitous gene families, were identified in both the gametophyte and sporophyte stages of a moss, fern,

110

and conifer, but only a few gene families were gametophyte- or sporophyte-specific across these three species.

Finally, the integration of a homosporous fern into land plant gene family evolution provided insights into the ancestral gene content of euphyllophytes, seed plants, and ferns. These reconstructions showed that while key regulatory gene families were present in the ancestral euphyllophytes and seed plants, the expansions of these families came well after the switch to sporophyte-independent, gametophyte-dependent life stages and even after the evolution of flowering plants. Significantly, there was a rapid expansion of the disease resistance NBS-LRR gene family at the MRCA of seed plants that was not in the ferns or ancestral euphyllophytes.

This research provided critical resources (the transcriptomes of independent gametophytes and sporophytes and the first sequenced homosporous fern genome) for future phylogenetic, developmental, and evolutionary studies. In addition, Ceratopteris has been used for decades as a model organism in K-12 and undergraduate biology courses around the world for teaching alternation of generations in plants (http://www.c-fern.org/). These resources can extend that lesson plan to include basic genetics. No other group of organisms can be as easily irradiated and their mutations tracked as ferns. Their independent haploid gametophytes and ability to undergo IGS to produce an entirely homozygous sporophyte exposes mutant phenotypes in both the gametophyte or sporophyte phase. With a reference genome and transcriptomes, classes will be able to easily identify mutants and understand the underlying genetics, increasing the value of this popular educational model species. This research provides a major evolutionary stepping-stone by providing the first homosporous fern genome reference, as well as unique insight into the root processes underlying these massively complicated genomes.

111

APPENDIX SUPPLEMENTARY MATERIALS

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Iris setosa Iris

Iris virginica Iris

Iris versicolor Iris

Dryopteris celsa Dryopteris

Uropappus lindleyi Uropappus

Asplenium bradleyi Asplenium

Dryopteris goldiana Dryopteris

Spiranthes diluvialis Spiranthes

Polystichum dudleyi Polystichum

Microseris douglasii

Cystopteris protrusa Cystopteris

Cystopteris bulbifera Cystopteris

Polypodium calirhiza Polypodium

Polypodium sibiricum Polypodium sibiricum Polypodium

Polystichum lemmonii Polystichum

Asplenium montanum Asplenium

Polystichum imbricans Polystichum

Dryopteris ludoviciana Dryopteris

Polypodium hesperium Polypodium

Asplenium platyneuron Asplenium

Polypodium glycyrrhiza Polypodium glycyrrhiza Polypodium

Polypodium amorphum Polypodium amorphum Polypodium

Asplenium rhizophyllum Asplenium

Polystichum scopulinum Polystichum

Polypodium virginianum Polypodium

Asplenium pinnatifidium Asplenium

Polypodium californicum Polypodium

Polystichum californicum Polystichum

Cystopteris tennesseensis Cystopteris

Spiranthes romanzoffiana Spiranthes

Stebbinsoseris heterocarpa Stebbinsoseris

Polypodium saximontanum Polypodium Polypodium Spiranthes magnicamporum Spiranthes Figure A-1. Niche breadth of the polyploids and their diploid progenitors.

112

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

Iris setosa - Iris - virginica Iris setosa Iris

Iris versicolor - Iris - setosa Iris versicolor Iris

Iris versicolor - Iris - virginica Iris versicolor Iris

Dryopteris celsa - Dryopteris goldiana - Dryopteris Dryopteris celsa

Dryopteris celsa - Dryopteris ludoviciana - Dryopteris Dryopteris celsa

Microseris douglasii - lindleyi - douglasii Uropappus Microseris

Cystopteris bulbifera -protrusa Cystopteris bulbifera Cystopteris

Asplenium bradleyi - Asplenium Asplenium montanum - bradleyi Asplenium

Asplenium bradleyi - Asplenium - bradleyi Asplenium

Dryopteris ludoviciana - Dryopteris Dryopteris goldiana - Dryopteris ludoviciana

Polystichum dudleyi - Polystichum imbricans Polystichum dudleyi - Polystichum

Polypodium calirhiza - Polypodiumglycyrrhiza calirhiza - Polypodium

Polystichum imbricans - - Polystichum imbricans Polystichum

Polypodium amorphum - Polypodium sibiricum -amorphum Polypodium Polypodium

Polystichum californicum - Polystichum dudleyi - Polystichum californicum Polystichum

Asplenium platyneuron - Asplenium montanum - Asplenium platyneuron Asplenium

Polypodium calirhiza - Polypodiumcalifornicum calirhiza - Polypodium

Asplenium montanum - Asplenium montanum - Asplenium

Cystopteris tennesseensis - Cystopteris - protrusa Cystopteris tennesseensis Cystopteris

Stebbinsoseris heterocarpa - Uropappus lindleyi - Uropappus heterocarpa Stebbinsoseris Spiranthesromanzoffiana diluvialis- Spiranthes

Polypodium virginianum - Polypodium Polypodium sibiricum virginianum - Polypodium

Polystichum scopulinum - Polystichum lemmonii Polystichum scopulinum - Polystichum

Cystopteris tennesseensis - Cystopteris - bulbifera Cystopteris tennesseensis Cystopteris

Polypodium hesperium - Polypodium glycyrrhiza - hesperium Polypodium Polypodium

Polypodium hesperium - Polypodium amorphum - hesperium Polypodium Polypodium

Polystichum scopulinum - Polystichum imbricans Polystichum scopulinum - Polystichum

Asplenium pinnatifidium - Asplenium montanum - Asplenium pinnatifidium Asplenium

Polypodium glycyrrhiza - Polypodium amorphum -glycyrrhiza Polypodium Polypodium

Stebbinsoseris heterocarpa - Microseris douglasii - Microseris heterocarpa Stebbinsoseris

Polystichum californicum - Polystichum imbricans - Polystichum californicum Polystichum

Polypodium glycyrrhiza - -glycyrrhiza Polypodium Polypodium

Polypodium saximontanum - Polypodium sibiricum - saximontanum Polypodium Polypodium Spiranthesmagnicamporum diluvialis- Spiranthes

Asplenium pinnatifidium - Asplenium rhizophyllum - Asplenium pinnatifidium Asplenium

Polypodium sibiricum - Polypodium appalachianum - Polypodium sibiricum Polypodium

Polypodium saximontanum - Polypodium amorphum - saximontanum Polypodium Polypodium Polypodium virginianum - Polypodium Polypodium appalachianum virginianum - Polypodium Spiranthes magnicamporum - Spiranthes Spiranthes romanzoffiana magnicamporum - Spiranthes Figure A-2. Pairwise niche-overlap scores of each polyploid and its diploid progenitors with gray bars for polyploid-diploid comparisons and black bars for diploid-diploid comparisons.

113

Table A-1. Polyploid systems used in this study and the means by which their progenitors were identified. The number of occurrence points for each species is designated in parentheses. Polyploid Progenitors Means of Progenitor Identification Asplenium bradleyi Asplenium montanum (302) Allozyme analysis- (217) Asplenium platyneuron (2266) (Werth et al., 1985) Asplenium Asplenium rhizophyllum (1126) Allozyme analysis pinnatifidium (330) Asplenium montanum (302) (Werth et al., 1985) Cystopteris Cystopteris protrusa (951) Isozyme analysis tennesseensis (521) Cystopteris bulbifera (938) (Haufler & Windham, 1991) Dryopteris celsa Dryopteris goldiana (524) Isozyme analysis (147) Dryopteris ludoviciana (119) (Werth & Windham, 1991) Iris versicolor (726) Iris virginica (808) FISH, GISH, Southern Iris setosa (209) hybridization (Lim et al., 2007) Polypodium Polypodium californicum (228) Isozyme analysis calirhiza (217) Polypodium glycyrrhiza (438) (Haufler & Windham, 1991) Polypodium Polypodium amorphum (84) Isozyme analysis hesperium (198) Polypodium glycyrrhiza (438) (Haufler & Windham, 1991) Polypodium Polypodium amorphum (84) Isozyme analysis saximontanum (38) Polypodium sibiricum (28) (Haufler & Windham, 1991) Polypodium Polypodium sibiricum (28) Isozyme analysis virginianum (1251) Polypodium appalachianum (100) (Haufler & Windham, 1991) Polystichum Polystichum dudleyi (57) Allozyme and cpDNA RFLP californicum (80) Polystichum imbricans (529) (Soltis et al., 1991) Polystichum Polystichum lemmonnii (120) Allozyme and cpDNA RFLP scopulinum (191) Polystichum imbricans (529) (Soltis et al., 1991) Spiranthes diluvialis Spiranthes magnicamporum (270) Isozyme analysis (55) Spiranthes romanzoffiana (1201) (Arft & Ranker, 1998) Stebbinsoseris Microseris douglasii (433) cpDNA and ribosomal DNA heterocarpa (227) Uropappus lindleyi (1150) restriction site variability (Wallace & Jansen, 1995)

114

Figure A-3. Box plots of transcript length in the genome-free UniCFernModels, genome-mapped with ≥50% coverage, and genome-mapped with ≥98% coverage.

115

Table A-2. Source information for gametophyte-sporophyte comparisons. Species Tissue (n) SRA Study Source Physcomitrella Gametophyte (1n) SRR6257615 Perroud et al. 2017 Physcomitrella Sporophyte (2n) SRR1588575 Perroud et al. 2017 Pinus taeda Megagametophyte (1n) SRR1200298 Zimin et al. 2017 Pinus taeda Pollen (1n) SRR3712438 Zimin et al. 2017 Pinus taeda Seedling (2n) SRR3712440 Zimin et al. 2017

116

Table A-3. Representative species and data source for ancestral gene family reconstructions. Species Lineage Source Physcomitrella patens Moss Phytozome Marchantia polymorpha Liverwort MarpolBase Selaginella moellendorffii Lycophyte Phytozome Ceratopteris richardii Fern This study Azolla filiculoides Fern Li et al. 2018 Picea abies Gymnosperm ConGenIE Amborella trichopoda Basal angiosperm Phytozome Oryza sativa Monocot Phytozome Zea mays Monocot Phytozome Solanum lycopersicum Eudicot Phytozome Vitis vinifera Eudicot Phytozome Glycine max Eudicot Phytozome Arabidopsis thaliana Eudicot Phytozome

.

117

LIST OF REFERENCES

Abbott RJ, Brochmann C. 2003. History and evolution of the arctic flora: in the footsteps of Eric Hulten. Molecular Ecology 12: 299–313.

Abbott RJ, Lowe AJ. 2004. Origins, establishment and evolution of new polyploid species: Senecio cambrensis and S. eboracensis in the British Isles. Biological Journal of the Linnean Society 82: 467–474.

Ainouche ML, Baumel A, Salmon A. 2004. Spartina anglica CE Hubbard: a natural model system for analysing early evolutionary changes that affect allopolyploid genomes. Biological Journal of the Linnean Society 82: 475–484.

Aïnouche ML, Fortune PM, Salmon A, Parisod C, Grandbastien M-A, Fukunaga K, Ricou M, Misset M-T. 2009. Hybridization, polyploidy and invasion: lessons from Spartina (). Biological invasions 11: 1159.

Airoldi CA, Davies B. 2012. Gene Duplication and the Evolution of Plant MADS-box Transcription Factors. Journal of Genetics and Genomics 39: 157–165.

Albertin W, Marullo P. 2012. Polyploidy in fungi: evolution after whole-genome duplication. Proceedings. Biological sciences / The Royal Society 279: 2497–509.

Amborella Genome Project. 2013. The Amborella Genome and the Evolution of Flowering Plants. Science 342.

Anderson RP, Gonzalez Jr. I. 2011. Species-specific tuning increases robustness to sampling bias in models of species distributions: An implementation with Maxent. Ecological Modelling 222: 2796–2811.

Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815.

Arft A, Ranker T. 1998. Allopolyploid origin and population genetics of the rare orchid Spiranthes diluvialis. Am. J. Botany 85: 110-.

Ariel FD, Manavella PA, Dezar CA, Chan RL. 2007. The true story of the HD-Zip family. Trends in plant science 12: 419–426.

Arora R, Agarwal P, Ray S, Singh AK, Singh VP, Tyagi AK, Kapoor S. 2007. MADS-box gene family in rice: genome-wide identification, organization and expression profiling during reproductive development and stress. BMC Genomics 8: 242.

Banks JA, Nishiyama T, Hasebe M, Bowman JL, Gribskov M, dePamphilis C, Albert VA, Aono N, Aoyama T, Ambrose BA, et al. 2011. The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of Vascular Plants. Science 332: 960–963.

Barbazuk WB, Fu Y, McGinnis KM. 2008. Genome-wide analyses of alternative splicing in

118

plants: opportunities and challenges. Genome research 18: 1381–92.

Barker MS, Arrigo N, Baniaga AE, Li Z, Levin DA. 2016. On the relative abundance of autopolyploids and allopolyploids. New Phytologist 210: 391–398.

Barker MS, Dlugosch KM, Dinh L, Challa RS, Kane NC, King MG, Rieseberg LH. 2010. EvoPipes. net: bioinformatic tools for ecological and evolutionary genomics. Evolutionary Bioinformatics 6: EBO-S5861.

Barker MS, Kane NC, Matvienko M, Kozik A, Michelmore RW, Knapp SJ, Rieseberg LH. 2008. Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Molecular Biology and Evolution 25: 2445–2455.

Barker MS, Vogel H, Schranz ME. 2009. Paleopolyploidy in the Brassicales: Analyses of the Cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other Brassicales. Genome Biology and Evolution .

Bayer RJ, Purdy BG, Lebedyk DG. 1991. Niche differentiation among eight sexual species of Antennaria Gaertner (: Inuleae) and A. rosea, their allopolyploid derivative. Evolutionary trends in plants.

Becker A, Theißen G. 2003. The major clades of MADS-box genes and their role in the development and evolution of flowering plants. Molecular and Evolution 29: 464– 489. te Beest M, Le Roux JJ, Richardson DM, Brysting AK, Suda J, Kubesová M, Pysek P. 2012. The more the better? The role of polyploidy in facilitating plant invasions. Annals of botany 109: 19–45.

Bennett MD. 1998. Plant genome values: How much do we know? Proceedings of the National Academy of Sciences 95: 2011–2016.

Bennett MD, Leitch IJ. 2012. Plant DNA C-values database (release 6.0, Dec. 2012). WWW document] URL http://data. kew. org/cvalues/.[accessed 14 October 2014].

Bennetzen JL, Ma J, Devos KM. 2005. Mechanisms of Recent Genome Size Variation in Flowering Plants. Annals of Botany 95: 127–132.

Bennetzen JL, Wang H. 2014. The contributions of transposable elements to the structure, function, and evolution of plant genomes. Annual review of plant biology 65: 505–530.

Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E. 2015. The Arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome. genesis 53: 474–485.

Bird CE, Fernandez-Silva I, Skillings DJ, Toonen RJ. 2012. Sympatric speciation in the post “modern synthesis” era of evolutionary biology. Evolutionary Biology 39: 158–180.

119

Birol I, Raymond A, Jackman SD, Pleasance S, Coope R, Taylor GA, Yuen MM Saint, Keeling CI, Brand D, Vandervalk BP, et al. 2013. Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data. Bioinformatics .

Blanc G, Wolfe KH. 2004. Widespread Paleopolyploidy in Model Plant Species Inferred from Age Distributions of Duplicate Genes. The Plant Cell 16: 1667–1678.

Boetzer M, Henkel C V, Jansen HJ, Butler D, Pirovano W. 2010. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27: 578–579.

Bold HC. 1957. Morphology of Plants. Isted.

Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114–2120.

Bowman JL, Kohchi T, Yamato KT, Jenkins J, Shu S, Ishizaki K, Yamaoka S, Nishihama R, Nakamura Y, Berger F. 2017. Insights into land plant evolution garnered from the Marchantia polymorpha genome. Cell 171: 287–304.

Braasch I, Postlethwait JH. 2012. Polyploidy in Fish and the Teleost Genome Duplication BT - Polyploidy and Genome Evolution. In: Soltis PS, Soltis DE, eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 341–383.

Brandt J, Schrauth S, Veith A-M, Froschauer A, Haneke T, Schultheis C, Gessler M, Leimeister C, Volff J-N. 2005. Transposable elements as a source of genetic innovation: expression and evolution of a family of retrotransposon-derived neogenes in mammals. Gene 345: 101–11.

Brochmann C, Brysting AK, Alsos IG, Borgen L, Grundt HH, Scheen A-C, Elven R. 2004. Polyploidy in arctic plants. Biological Journal of the Linnean Society 82: 521–536.

Broennimann O, Fitzpatrick MC, Pearman PB, Petitpierre B, Pellissier L, Yoccoz NG, Thuiller W, Fortin M-J, Randin C, Zimmermann NE, et al. 2012. Measuring ecological niche overlap from occurrence and spatial environmental data. Global Ecology and Biogeography 21: 481–497.

Brosius J. 2003. Origin and Evolution of New Gene Functions. In: Long M, ed. Dordrecht: Springer Netherlands, 99–116.

Brown JW, Walker JF, Smith SA. 2017. Phyx: phylogenetic tools for unix. Bioinformatics 33: 1886–1888.

Buggs RJA, Chamala S, Wu W, Tate JA, Schnable PS, Soltis DE, Soltis PS, Barbazuk WB. 2012. Rapid, repeated, and clustered loss of duplicate genes in allopolyploid plant populations of independent origin. Current biology : CB 22: 248–52.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10: 421.

120

Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun R, Jiao D, Lawrence CJ. 2014. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant physiology 164: 513–524.

Carretero-Paulet L, Librado P, Chang T-H, Ibarra-Laclette E, Herrera-Estrella L, Rozas J, Albert VA. 2015. High Gene Family Turnover Rates and Gene Space Adaptation in the Compact Genome of the Carnivorous Plant Utricularia gibba. Molecular Biology and Evolution 32: 1284–1295.

Chamala S, Chanderbali AS, Der JP, Lan T, Walts B, Albert V a, dePamphilis CW, Leebens-Mack J, Rounsley S, Schuster SC, et al. 2013. Assembly and validation of the genome of the nonmodel basal angiosperm Amborella. Science (New York, N.Y.) 342: 1516–7.

Chamala S, Feng G, Chavarro C, Barbazuk WB. 2015. Genome-Wide Identification of Evolutionarily Conserved Alternative Splicing Events in Flowering Plants. Frontiers in Bioengineering and Biotechnology 3: 33.

Chan PP, Lowe TM. 2015. GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes. Nucleic acids research 44: D184–D189.

Chang Y-Y, Chiu Y-F, Wu J-W, Yang C-H. 2009. Four Orchid (Oncidium Gower Ramsey) AP1/AGL9-like MADS Box Genes Show Novel Expression Patterns and Cause Different Effects on Floral Transition and Formation in Arabidopsis thaliana. Plant and Cell Physiology 50: 1425–1438.

Chapman JA, Ho IY, Goltsman E, Rokhsar DS. 2016. Meraculous2: fast accurate short-read assembly of large polymorphic genomes. arXiv preprint arXiv:1608.01031.

Chaudhary B, Flagel L, Stupar RM, Udall JA, Verma N, Springer NM, Wendel JF. 2009. Reciprocal silencing, transcriptional bias and functional divergence of homeologs in polyploid cotton (gossypium). Genetics 182: 503–17.

Chaudhuri P, Marron JS. 1999. SiZer for exploration of structures in curves. Journal of the American Statistical Association 94: 807–823.

Chen ZJ, Birchler JA. 2013. Polyploid and genomics. John Wiley & Sons.

Chen ZJ, Ha M, Soltis D. 2007. Polyploidy: genome obesity and its consequences. The New phytologist 174: 717–20.

Chester M, Gallagher JP, Symonds VV, Cruz da Silva AV, Mavrodiev E V, Leitch AR, Soltis PS, Soltis DE. 2012. Extensive chromosomal variation in a recently formed natural allopolyploid species, Tragopogon miscellus (Asteraceae). Proceedings of the National Academy of Sciences 109: 1176–1181.

Chettoor AM, Givan SA, Cole RA, Coker CT, Unger-Wallace E, Vejlupkova Z, Vollbrecht E, Fowler JE, Evans MMS. 2014. Discovery of novel transcripts and gametophytic functions via RNA-seq analysis of maize gametophytic transcriptomes. Genome biology 15: 414.

121

Chikhi R, Medvedev P. 2014. Informed and automated k-mer size selection for genome assembly. Bioinformatics 30: 31–37.

Christenhusz MJM, Byng JW. 2016. The number of known plants species in the world and its annual increase. Phytotaxa 261: 201–217.

Clark J, Hidalgo O, Pellicer J, Liu H, Marquardt J, Robert Y, Christenhusz M, Zhang S, Gibby M, Leitch IJ. 2016. Genome evolution of ferns: evidence for relative stasis of genome size across the fern phylogeny. New Phytologist 210: 1072–1082.

Clausen J, Keck DD, Hiesey WM. 1945. Experimental studies on the nature of species. II. Plant evolution through amphiploidy, with examples from the Madiinae.

Conant GC, Birchler JA, Pires JC. 2014. Dosage, duplication, and diploidization: clarifying the interplay of multiple models for duplicate gene evolution over time. Current Opinion in Plant Biology 19: 91–98.

Costanza R, Folke C. 1997. Valuing ecosystem services with efficiency, fairness and sustainability as goals. Nature’s services: Societal dependence on natural ecosystems: 49–70.

Coyne JA, Orr HA. 2004. Speciation. Sinauer Associates, Inc.

Crow KD, Wagner GP. 2006. What Is the Role of Genome Duplication in the Evolution of Complexity and Diversity? Molecular Biology and Evolution 23: 887–892.

Csűös M. 2010. Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics 26: 1910–1912.

Davis CC, Xi Z. 2015. Horizontal gene transfer in parasitic plants. Current Opinion in Plant Biology 26: 14–19.

Devos KM, Brown JKM, Bennetzen JL. 2002. Genome Size Reduction through Illegitimate Recombination Counteracts Genome Expansion in Arabidopsis . Genome Research 12: 1075– 1079.

Dierckxsens N, Mardulyn P, Smits G. 2016. NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic acids research 45: e18–e18.

Donoghue MJ, Edwards EJ. 2014. Biome Shifts and Niche Evolution in Plants. Annual Review of Ecology, Evolution, and Systematics 45: 547–572.

Doyle J, Doyle JL. 1987. Genomic plant DNA preparation from fresh tissue-CTAB method. Phytochem Bull 19: 11–15.

Durand LZ, Goldstein G. 2001. Photosynthesis, photoinhibition, and nitrogen use efficiency in native and invasive tree ferns in Hawaii. Oecologia 126: 345–354.

Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics

122

26: 2460–2461.

Ehrendorfer F. 1980. Polyploidy and distribution. In: Polyploidy. Springer, 45–60.

Ellinghaus D, Kurtz S, Willhoeft U. 2008. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC bioinformatics 9: 18.

Ellwood MDF, Foster W a. 2004. Doubling the estimate of invertebrate biomass in a rainforest canopy. Nature 429: 549–51.

Emms DM, Kelly S. 2015. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome biology 16: 157.

Estep MC, DeBarry JD, Bennetzen JL. 2013. The dynamics of LTR retrotransposon accumulation across 25 million years of panicoid grass evolution. Heredity 110: 194–204.

Evans BJ, Pyron RA, Wiens JJ. 2012. Polyploidization and sex chromosome evolution in amphibians. In: Polyploidy and genome evolution. Springer, 385–410.

Fayle TM, Chung AYC, Dumbrell AJ, Eggleton P, Foster WA. 2009. The Effect of Rain Forest Canopy Architecture on the Distribution of Epiphytic Ferns (Asplenium spp.) in Sabah, Malaysia. Biotropica 41: 676–681.

Feldman M, Levy A a. 2012. Genome evolution due to allopolyploidization in wheat. Genetics 192: 763–74.

Feschotte C. 2008. Transposable elements and the evolution of regulatory networks. Nat Rev Genet 9: 397–405.

Feschotte C, Jiang N, Wessler SR. 2002. Plant transposable elements: where genetics meets genomics. Nature reviews. Genetics 3: 329–41.

Finn RD, Attwood TK, Babbitt PC, Bateman A, Bork P, Bridge AJ, Chang H-Y, Dosztányi Z, El-Gebali S, Fraser M. 2016. InterPro in 2017—beyond protein family and domain annotations. Nucleic acids research 45: D190–D199.

Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J. 2013. Pfam: the protein families database. Nucleic acids research 42: D222–D230.

Flagel L, Udall J, Nettleton D, Wendel J. 2008. Duplicate gene expression in allopolyploid Gossypium reveals two temporally distinct phases of expression evolution. BMC biology 6: 16.

Flagel LE, Wendel JF. 2010. Evolutionary rate variation, genomic dominance and duplicate gene expression evolution during allotetraploid cotton speciation. The New phytologist 186: 184– 93.

Force A, Lynch M, Pickett FB, Amores A, Yan Y, Postlethwait J. 1999. Preservation of

123

Duplicate Genes by Complementary, Degenerative Mutations. Genetics 151: 1531–1545.

Fowler NL, Levin DA. 1984. Ecological Constraints on the Establishment of a Novel Polyploid in Competition with Its Diploid Progenitor. The American Naturalist 124: 703–711.

Freeling M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu Rev Plant Biol 60.

Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150–3152.

Gaeta RT, Pires JC, Iniguez-Luy F, Leon E, Osborn TC. 2007. Genomic Changes in Resynthesized Brassica napus and Their Effect on Gene Expression and Phenotype. The Plant Cell 19: 3403–3417.

Gavrilets S, Vose A, Barluenga M, Salzburger W, Meyer A. 2007. Case studies and mathematical models of ecological speciation. 1. Cichlids in a crater lake. Molecular Ecology 16: 2893–2909.

Ghatak J. 1977. Biosystematic survey of from Shevaroy Hills, south India. Nucleus 20: 105–108.

Gitzendanner MA, Soltis PS, Wong GK, Ruhfel BR, Soltis DE. 2018. Plastid phylogenomic analysis of green plants: a billion years of evolutionary history. American journal of botany 105: 291–301.

Glennon KL, Rissler LJ, Church S a. 2011. Ecogeographic isolation: a reproductive barrier between species and between cytotypes in Houstonia (Rubiaceae). Evolutionary Ecology 26: 909–926.

Glennon KL, Ritchie ME, Segraves K a. 2014. Evidence for shared broad-scale climatic niches of diploid and polyploid plants (R Bardgett, Ed.). Ecology Letters 17: 574–582.

Godsoe W, Larson MA, Glennon KL, Segraves KA. 2013. Polyploidization in Heuchera cylindrica () did not result in a shift in climatic requirements. American journal of botany 100: 496–508.

Gordon SP, Tseng E, Salamov A, Zhang J, Meng X, Zhao Z, Kang D, Underwood J, Grigoriev I V, Figueroa M. 2015. Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PloS one 10: e0132628.

Gramzow L, Ritz MS, Theissen G. 2010. On the origin of MADS-domain transcription factors. Trends Genet 26.

Gramzow L, Theissen G. 2010. A hitchhiker’s guide to the MADS world of plants. Genome Biology 11: 214.

Gramzow L, Theißen G. 2015. Phylogenomics reveals surprising sets of essential and

124

dispensable clades of MIKCc‐group MADS‐box genes in flowering plants. Journal of Experimental Zoology Part B: Molecular and Developmental Evolution 324: 353–362.

Gramzow L, Weilandt L, Theißen G. 2014. MADS goes genomic in conifers: towards determining the ancestral set of MADS-box genes in seed plants. Annals of Botany .

Grant V. 1981. Plant Speciation. Columbia University Press: New York.

Greilhuber J, Borsch T, Müller K, Worberg A, Porembski S, Barthlott W. 2006. Smallest angiosperm genomes found in Lentibulariaceae, with chromosomes of bacterial size. Plant biology 8: 770–777.

Gremme G, Steinbiss S, Kurtz S. 2013. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10: 645–656.

Gurdon C, Svab Z, Feng Y, Kumar D, Maliga P. 2016. Cell-to-cell movement of mitochondria in plants. Proceedings of the National Academy of Sciences 113: 3395–3400.

Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M. 2013. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature protocols 8: 1494.

Hanley J, McNeil B. 1982. The meaning and use of the area under a Receiver Characteristic (ROC) curve. Radiology: 29–36.

Hasebe M, Wen C-K, Kato M, Banks JA. 1998. Characterization of MADS homeotic genes in the fern Ceratopteris richardii. Proceedings of the National Academy of Sciences 95: 6222– 6227.

Haufler CH. 1987. Electrophoresis is Modifying Our Concepts of Evolution in Homosporous Pteridophytes. American Journal of Botany 74: 953–966.

Haufler CH. 2002. Homospory 2002: An Odyssey of progress in genetics and evolutionary biology: Ferns and other homosporous vascular plants have highly polyploid chromosome numbers, but they express traits following diploid models and, although capable of extre. AIBS Bulletin 52: 1081–1093.

Haufler CH. 2014. Ever since Klekowski: testing a set of radical hypotheses revives the genetics of ferns and lycophytes. American journal of botany 101: 2036–2042.

Haufler CH, Soltis DE. 1986. Evolutionary Significance of Polyploidy in the Pteridophyta. 83: 4389–4393.

Haufler C, Windham M. 1991. New Species of North American Cystopteris and Polypodium, with Comments on Their Reticulate Relationships. American Fern Journal 81: 7–23.

Haufler C, Windham M, Ranker T. 1990. Biosystematic analysis of the Cystopteris

125

tennesseensis () complex. Annals of the Missouri Botanical … 77: 314–329.

Hawkins JS, Kim H, Nason JD, Wing RA, Wendel JF. 2006. Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium . Genome Research 16: 1252–1261.

Hegarty MJ, Barker GL, Brennan AC, Edwards KJ, Abbott RJ, Hiscock SJ. 2008. Changes to gene expression associated with hybrid speciation in plants: further insights from transcriptomic studies in Senecio. Philosophical Transactions of the Royal Society of London B: Biological Sciences 363: 3055–3069.

Hegarty MJ, Hiscock SJ. 2008. Genomic clues to the evolutionary success of polyploid plants. Current biology : CB 18: R435-44.

Hijmans RJ, Cameron SE, Parra JL, Jones PG, Jarvis A. 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965–1978.

Hirsch C, Hirsch CD, Brohammer AB, Bowman MJ, Soifer I, Barad O, Shem-Tov D, Baruch K, Lu F, Hernandez AG. 2016. Draft assembly of elite inbred line PH207 provides insights into genomic and transcriptome diversity in maize. The Plant Cell: tpc-00353.

Honys D, Twell D. 2004. Transcriptome analysis of haploid male gametophyte development in Arabidopsis. Genome biology 5: R85.

Huang Q, Li W, Fan R, Chang Y. 2014. New MADS-box gene in fern: cloning and expression analysis of DfMADS1 from . PloS one 9: e86349.

Husband BC, Schemske DW. 1998. Cytotype distribution at a diploid-tetraploid contact zone in Chamerion (Epilobium) angustifolium (Onagraceae). American Journal of Botany 85: 1688– 1694.

Jaillon O, Aury J-M, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, et al. 2007. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449: 463–7.

Jiao Y, Leebens-Mack J, Ayyampalayam S, Bowers JE, McKain MR, McNeal J, Rolf M, Ruzicka DR, Wafula E, Wickett NJ, et al. 2012. A genome triplication associated with early diversification of the core eudicots. Genome biology 13: R3.

Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS, et al. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473: 97–100.

Johnston AJ, Meier P, Gheyselinck J, Wuest SEJ, Federer M, Schlagenhauf E, Becker JD, Grossniklaus U. 2007. Genetic subtraction profiling identifies genes essential for Arabidopsis reproduction and reveals interaction between the female gametophyte and the maternal sporophyte. Genome biology 8: R204.

126

Judd WS, Campbell CS, Kellogg EA, Stevens PF. 1999. Plant systematics. A phylogenetic approach. Sinauer Associates, Sunderland, Mass., USA 464: 3–4.

Jurka J. 2004. Evolutionary impact of human Alu repetitive elements. Current Opinion in Genetics & Development 14: 603–608.

Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution 30: 772–780.

Kennedy RC, Unger MF, Christley S, Collins FH, Madey GR. 2011. An automated homology-based approach for identifying transposable elements. BMC bioinformatics 12: 130.

Kim G, LeBlanc ML, Wafula EK, dePamphilis CW, Westwood JH. 2014. Genomic-scale exchange of mRNA between a parasitic plant and its hosts. Science 345: 808–811.

Klekowski E. 1972. Genetical features of ferns as contrasted with seed plants. Annals of the Missouri Botanical Garden 59: 138–151.

Klekowski E, Baker H. 1966. Evolutionary Significance of Polyploidy in the Pteridophyta. Science 153: 305–307.

Kwantes M, Liebsch D, Verelst W. 2011. How MIKC* MADS-box genes originated and evidence for their conserved function throughout the evolution of vascular plant gametophytes. Molecular biology and evolution 29: 293–302.

De La Torre AR, Birol I, Bousquet J, Ingvarsson PK, Jansson S, Jones SJM, Keeling CI, MacKay J, Nilsson O, Ritland K, et al. 2014. Insights into conifer giga-genomes. Plant physiology 166: 1724–32.

Landis JB, Soltis DE, Li Z, Marx HE, Barker MS, Tank DC, Soltis PS. 2018. Impact of whole‐genome duplication events on diversification rates in angiosperms. American journal of botany.

Lang D, Ullrich KK, Murat F, Fuchs J, Jenkins J, Haas FB, Piednoel M, Gundlach H, Van Bel M, Meyberg R. 2018. The Physcomitrella patens chromosome‐scale assembly reveals moss genome structure and evolution. The Plant Journal 93: 515–533.

Le QH, Wright S, Yu Z, Bureau T. 2000. Transposon Diversity in Arabidopsis thaliana. Proceedings of the National Academy of Sciences of the United States of America 97: 7376– 7381.

Leempoel K, Parisod C, Geiser C, Daprà L, Vittoz P, Joost S. 2015. Very high-resolution digital elevation models: are multi-scale derived variables ecologically relevant? Methods in Ecology and Evolution 6: 1373–1383.

Leitch IJ, Hanson L, Lim KY, Kovarik A, Chase MW, Clarkson JJ, Leitch AR. 2008. The Ups and Downs of Genome Size Evolution in Polyploid Species of Nicotiana (Solanaceae). Annals of Botany 101: 805–814.

127

Levin DA. 1975. Minority Cytotype Exclusion in Local Plant Populations. Taxon 24: 35–43.

Levin DA. 2000. The origin, expansion, and demise of plant species. Oxford University Press on Demand.

Levin D a. 2003. The ecological transition in speciation. New Phytologist 161: 91–96.

Levin D. 2013. The timetable for allopolyploidy in flowering plants. Annals of botany 112: 1201–8.

Levins R. 1968. Evolution in changing environments: some theoretical explorations. Princeton University Press.

Lewis WH (Ed.). 1980. Polyploidy. Boston, MA: Springer US.

Li S. 2015. The Arabidopsis thaliana TCP transcription factors: a broadening horizon beyond development. Plant signaling & behavior 10: e1044192.

Li Z, Baniaga AE, Sessa EB, Scascitelli M, Graham SW, Rieseberg LH, Barker MS. 2015. Early genome duplications in conifers and other seed plants. Science Advances 1.

Li F-W, Brouwer P, Carretero-Paulet L, Cheng S, de Vries J, Delaux P-M, Eily A, Koppers N, Kuo L-Y, Li Z. 2018. Fern genomes elucidate land plant evolution and cyanobacterial symbioses. Nature plants: 1.

Li F-W, Villarreal JC, Kelly S, Rothfels CJ, Melkonian M, Frangedakis E, Ruhsam M, Sigel EM, Der JP, Pittermann J, et al. 2014. Horizontal transfer of an adaptive chimeric photoreceptor from bryophytes to ferns. Proceedings of the National Academy of Sciences 111: 6672–6677.

Lim KARY, Matyaksek R, Kovarik A, Leitch AR. 2004. Genome evolution in allotetraploid Nicotiana. Biological Journal of the Linnean Society 82: 599–606.

Lim KY, Matyasek R, Kovarik A, Leitch A. 2007. Parental origin and genome evolution in the allopolyploid Iris versicolor. Annals of botany 100: 219–24.

Liu S, Wei Y, Post WM, Cook RB, Schaefer K, Thornton MM. 2013. The Unified North American Soil Map and its implication on the soil organic carbon stock in North America. Biogeosciences 10: 2915–2930.

Lockton S, Gaut BS. 2009. The Contribution of Transposable Elements to Expressed Coding Sequence in Arabidopsis thaliana . Journal of Molecular Evolution 68: 80–89.

Lynch M, Conery JS. 2000. The Evolutionary Fate and Consequences of Duplicate Genes. Science 290: 1151–1155.

Ma J, Bennetzen JL. 2004. Rapid recent growth and divergence of rice nuclear genomes. Proceedings of the National Academy of Sciences of the United States of America 101: 12404

128

LP-12410.

Madlung a. 2013. Polyploidy and its effect on evolutionary success: old questions revisited with new tools. Heredity 110: 99–104.

Mandadi KK, Scholthof K-BG. 2015. Genome-Wide Analysis of Alternative Splicing Landscapes Modulated during Plant-Virus Interactions in Brachypodium distachyon. The Plant Cell 27: 71–85.

Mapleson D, Garcia Accinelli G, Kettleborough G, Wright J, Clavijo BJ. 2016. KAT: a K- mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33: 574–576.

Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27: 764–770.

Marchant DB, Soltis DE, Soltis PS. 2016a. Patterns of abiotic niche shifts in allopolyploids relative to their progenitors. New Phytologist.

Marchant DB, Soltis DE, Soltis PS. 2016b. Genome Evolution in Plants. eLS.

Martin SL, Husband BC. 2009. Influence of phylogeny and on species ranges of North American angiosperms. Journal of Ecology 97: 913–922.

Matzke MA, Mosher RA. 2014. RNA-directed DNA methylation: an epigenetic pathway of increasing complexity. Nature Reviews Genetics 15: 394–408.

Mayr E. 1963. Species and Evolution.

McCormick RF, Truong SK, Sreedasyam A, Jenkins J, Shu S, Sims D, Kennedy M, Amirebrahimi M, Weers BD, McKinley B. 2018. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal 93: 338–354.

McIntyre PJ. 2012. Polyploidy associated with altered and broader ecological niches in the (Portulacaceae) species complex. American journal of botany 99: 655–62.

McLysaght A, Hokamp K, Wolfe KH. 2002. Extensive genomic duplication during early chordate evolution. Nature genetics 31: 200–4.

Merow C, Smith MJ, Silander JA. 2013. A practical guide to MaxEnt for modeling species’ distributions: what it does, and why inputs and settings matter. Ecography 36: 1058–1069.

Messenguy F, Dubois E. 2003. Role of MADS box proteins and their cofactors in combinatorial control of gene expression and cell development. Gene 316: 1–21.

Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ. 2005. The PANTHER database of protein

129

families, subfamilies, functions and pathways. Nucleic acids research 33: D284–D288.

Morris JL, Puttick MN, Clark JW, Edwards D, Kenrick P, Pressel S, Wellman CH, Yang Z, Schneider H, Donoghue PCJ. 2018. The timescale of early land plant evolution. Proceedings of the National Academy of Sciences 115: E2274–E2283.

Mukherjee K, Brocchieri L, Bürglin TR. 2009. A comprehensive classification and evolutionary analysis of plant homeobox genes. Molecular biology and evolution 26: 2775– 2794.

Münster T, Pahnke J, Di Rosa A, Kim JT, Martin W, Saedler H, Theissen G. 1997. Floral homeotic genes were recruited from homologous MADS-box genes preexisting in the common ancestor of ferns and seed plants. Proceedings of the National Academy of Sciences 94: 2415– 2420.

Naito K, Zhang F, Tsukiyama T, Saito H, Hancock CN, Richardson AO, Okumoto Y, Tanisaka T, Wessler SR. 2009. Unexpected consequences of a sudden and massive transposon amplification on rice gene expression. Nature 461: 1130–1134.

Nakazato T, Jung M-K, Housworth E a, Rieseberg LH, Gastony GJ. 2006. Genetic map- based analysis of genome structure in the homosporous fern Ceratopteris richardii. Genetics 173: 1585–97.

Nitsch JP. 1951. Growth and development in vitro of excised ovaries. American Journal of Botany: 566–577.

Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin Y-C, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A, et al. 2013. The Norway spruce genome sequence and conifer genome evolution. Nature 497: 579–584.

O’Connell J, Schulz-Trieglaff O, Carlson E, Hims MM, Gormley NA, Cox AJ. 2015. NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 31: 2035–2037.

Ohno S. 1999. Gene duplication and the uniqueness of vertebrate genomes circa 1970–1999. In: Seminars in cell & developmental biology. Elsevier, 517–522.

Oliver KR, Greene WK. 2009. Transposable elements: powerful facilitators of evolution. BioEssays 31: 703–714.

Oliver KR, McComb JA, Greene WK. 2013. Transposable Elements: Powerful Contributors to Angiosperm Evolution and Diversity. Genome Biology and Evolution 5: 1886–1901.

Ou S, Jiang N. 2018. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant physiology 176: 1410–1422.

Page LM, MacFadden BJ, Fortes JA, Soltis PS, Riccardi G. 2015. Digitization of Biodiversity Collections Reveals Biggest Data on Biodiversity. BioScience .

130

Parenicová L, de Folter S, Kieffer M, Horner DS, Favalli C, Busscher J, Cook HE, Ingram RM, Kater MM, Davies B, et al. 2003. Molecular and phylogenetic analyses of the complete MADS-box transcription factor family in Arabidopsis: new openings to the MADS world. Plant Cell 15.

Parisod C, Salmon A, Zerjal T, Tenaillon M, Grandbastien M-A, Ainouche M. 2009. Rapid structural and epigenetic reorganization near transposable elements in hybrid and allopolyploid genomes in Spartina. The New phytologist 184: 1003–15.

Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, et al. 2009. The Sorghum bicolor genome and the diversification of grasses. Nature 457: 551–556.

Paterson AH, Chapman BA, Kissinger JC, Bowers JE, Feltus FA, Estill JC. 2006. Many gene and domain families have convergent fates following independent whole-genome duplication events in Arabidopsis, Oryza, Saccharomyces and Tetraodon. Trends in Genetics 22: 597–602.

Paul B, Andrea B, Canan K, E. TAO, Samantha K, J. NKG, Adrie W, M. WAP, Henriette S. 2014. Azolla domestication towards a biobased economy? New Phytologist 202: 1069–1082.

Pelaz S, Ditta GS, Baumann E, Wisman E, Yanofsky MF. 2000. B and C floral organ identity functions require SEPALLATA MADS-box genes. Nature 405: 200–203.

Pellicer J, Fay MF, Leitch IJ. 2010. The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society 164: 10–15.

Perroud P, Haas FB, Hiss M, Ullrich KK, Alboresi A, Amirebrahimi M, Barry K, Bassi R, Bonhomme S, Chen H. 2018. The Physcomitrella patens gene atlas project: large‐scale RNA‐ seq based expression data. The Plant Journal.

Peterson AT, Soberón J, Pearson RG, Anderson RP, Martínez-Meyer E, Nakamura M, Araújo MB. 2011. Ecological niches and geographic distributions (MPB-49). Princeton University Press.

Phillips SJ, Anderson RP, Schapire RE. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190: 231–259.

Pichersky E, Soltis D, Soltis P. 1990. Defective chlorophyll a/b-binding protein genes in the genome of a homosporous fern. Proceedings of the National Academy of Sciences 87: 195–199.

Piegu B, Guyot R, Picault N, Roulin A, Saniyal A, Kim H, Collura K, Brar DS, Jackson S, Wing RA, et al. 2006. Doubling genome size without polyploidization: Dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice. Genome Research 16: 1262–1269.

PPG I. 2016. A community‐derived classification for extant lycophytes and ferns. Journal of Systematics and Evolution 54: 563–603.

131

R Core Team. 2013. R: A language and environment for statistical computing.

Ramsey J. 2011. Polyploidy and ecological adaptation in wild yarrow. Proceedings of the National Academy of Sciences 108: 7096–7101.

Ramsey J, Ramsey TS. 2014. Ecological studies of polyploidy in the 100 years following its discovery. Phil. Trans. R. Soc. B 369: 20130352.

Ramsey J, Schemske D. 1998. Pathways, mechanisms, and rates of polyploid formation in flowering plants. Annual Review of Ecology and Systematics 29: 467–501.

Reddy ASN. 2007. Alternative Splicing of Pre-Messenger RNAs in Plants in the Genomic Era. Annual Review of Plant Biology 58: 267–294.

Rensing S a, Lang D, Zimmer AD, Terry A, Salamov A, Shapiro H, Nishiyama T, Perroud P-F, Lindquist E a, Kamisugi Y, et al. 2008. The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science (New York, N.Y.) 319: 64–9.

Rice DW, Alverson AJ, Richardson AO, Young GJ, Sanchez-Puerta MV, Munzinger J, Barry K, Boore JL, Zhang Y, dePamphilis CW, et al. 2013. Horizontal Transfer of Entire Genomes via Mitochondrial Fusion in the Angiosperm Amborella. Science 342: 1468–1473.

Rice A, Glick L, Abadi S, Einhorn M, Kopelman NM, Salman-Minkov A, Mayzel J, Chay O, Mayrose I. 2015. The Chromosome Counts Database (CCDB) – a community resource of plant chromosome numbers. New Phytologist 206: 19–26.

Richardson AO, Palmer JD. 2007. Horizontal gene transfer in plants. Journal of Experimental Botany 58: 1–9.

Rios NE, Bart HL. 2010. GEOLocate. Tulane University Museum of Natural History, version 3.

Ruprecht C, Lohaus R, Vanneste K, Mutwil M, Nikoloski Z, Van de Peer Y, Persson S. 2017. Revisiting ancestral polyploidy in plants. Science advances 3: e1603195.

Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al. 2009. The B73 Maize Genome: Complexity, Diversity, and Dynamics. Science 326: 1112–1115.

Schneider M, Lane L, Boutet E, Lieberherr D, Tognolli M, Bougueleret L, Bairoch A. 2009. The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program. Journal of proteomics 72: 567–573.

Schoener T. 1968. The Anolis Lizards of Bimini : Resource Partitioning in a Complex Fauna. Ecology 49: 704–726.

Schuettpelz E, Pryer KM. 2007. Fern phylogeny inferred from 400 leptosporangiate species and three plastid genes. Taxon 56: 1037.

132

Segraves K, Thompson J. 1999. Plant polyploidy and pollination: floral traits and visits to diploid and tetraploid Heuchera grossulariifolia. Evolution 53: 1114–1127.

Selmecki AM, Maruvka YE, Richmond PA, Guillet M, Shoresh N, Sorenson AL, De S, Kishony R, Michor F, Dowell R, et al. 2015. Polyploidy can drive rapid adaptation in yeast. Nature 519: 349–352.

Sessa EB, Banks JA, Barker MS, Der JP, Duffy AM, Graham SW, Hasebe M, Langdale J, Li F-W, Marchant DB. 2014a. Between two fern genomes. GigaScience 3: 1.

Sessa EB, Banks JA, Barker MS, Der JP, Duffy AM, Graham SW, Hasebe M, Langdale J, Li F-W, Marchant DB, et al. 2014b. Between Two Fern Genomes. GigaScience 3.

Shukla AK, Upadhyay SK, Mishra M, Saurabh S, Singh R, Singh H, Thakur N, Rai P, Pandey P, Hans AL. 2016. Expression of an insecticidal fern protein in cotton protects against whitefly. Nature biotechnology 34: 1046.

Singh RK, Gase K, Baldwin IT, Pandey SP. 2015. Molecular evolution and diversification of the Argonaute family of proteins in plants. BMC plant biology 15: 23.

Slotkin RK, Nuthikattu S, Jiang N. 2012. The Impact of Transposable Elements on Gene and Genome Evolution. In: Wendel JF, Greilhuber J, Dolezel J, Leitch IJ, eds. Plant Genome Diversity Volume 1: Plant Genomes, their Residents, and their Evolutionary Dynamics. Vienna: Springer Vienna, 35–58.

De Smet R, Adams KL, Vandepoele K, Van Montagu MCE, Maere S, Van de Peer Y. 2013. Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proceedings of the National Academy of Sciences 110: 2898–2903.

Smit AFA, Hubley R. 2008. RepeatModeler Open-1.0. Available fom http://www. repeatmasker. org.

Smit AFA, Hubley R, Green P. 2013. 2013–2015. RepeatMasker Open-4.0.

Smith AR, Pryer KM, Schuettpelz E, Korall P, Schneider H, Wolf PG. 2006. A classification for extant ferns. Taxon 55: 705–731.

Soltis DE. 1986. Genetic evidence for diploidy in Equisetum. American journal of botany: 908– 913.

Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson AH, Zheng C, Sankoff D, Depamphilis CW, Wall PK, Soltis PS. 2009. Polyploidy and angiosperm diversification. American journal of botany 96: 336–48.

Soltis DE, Buggs RJA, Barbazuk WB, Chamala S, Chester M, Gallagher JP, Schnable PS, Soltis PS. 2012. The early stages of polyploidy: rapid and repeated evolution in Tragopogon. In: Polyploidy and genome evolution. Springer, 271–292.

133

Soltis D, Buggs R, Doyle J, Soltis P. 2010. What we still don’ t know about polyploidy. Taxon 59: 1387–1403.

Soltis PS, Liu X, Marchant DB, Visger CJ, Soltis DE. 2014a. Polyploidy and novelty: Gottlieb’s legacy. Philosophical Transactions of the Royal Society B: Biological Sciences 369.

Soltis PS, Marchant DB, Van de Peer Y, Soltis DE. 2015. Polyploidy and genome evolution in plants. Current Opinion in Genetics and Development 35.

Soltis DE, Soltis PS. 1987. Polyploidy and Breeding Systems in Homosporous Pteridophyta: A Reevaluation. The American Naturalist 130: 219–232.

Soltis PS, Soltis DE. 1988. Electrophoretic evidence for genetic diploidy in Psilotum nudum. American journal of botany: 1667–1671.

Soltis DE, Soltis PS. 1992. The distribution of selfing rates in homosporous ferns. American journal of botany (USA).

Soltis DE, Soltis PS. 1999a. Polyploidy: recurrent formation and genome evolution. Trends in Ecology & Evolution 14: 348–352.

Soltis DE, Soltis PS. 1999b. Polyploidy: recurrent formation and genome evolution. Trends in Ecology & Evolution 14: 348–352.

Soltis PS, Soltis DE (Eds.). 2012. Polyploidy and Genome Evolution. Berlin, Heidelberg: Springer Berlin Heidelberg.

Soltis PS, Soltis DE. 2016. Ancient WGD events as drivers of key innovations in angiosperms. Current Opinion in Plant Biology 30: 159–165.

Soltis D, Soltis P, Schemske D. 2007. Autopolyploidy in Angiosperms: Have We Grossly Underestimated the Number of Species? Taxon 56: 13–30.

Soltis P, Soltis D, Wolf P. 1991. Allozymic and Chloroplast DNA Analyses of Polyploidy in Polystichum (Dryopteridaceae). I. The Origins of P. californicum and P. scopulinum. Systematic botany 16: 245–256.

Soltis DE, Visger CJ, Marchant DB, Soltis PS. 2016. Polyploidy: Pitfalls and paths to a paradigm. American Journal of Botany .

Soltis DE, Visger CJ, Soltis PS. 2014b. The polyploidy revolution then… and now: Stebbins revisited. American Journal of Botany 101: 1057–1078.

Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30: 1312–1313.

Stebbins GL. 1950. Variation and evolution in plants. Geoffrey Cumberlege.; London.

Stebbins GL. 1984. Polyploidy and the distribution of the arctic-alpine flora: new evidence and

134

a new approach. Bot. Helvetica 94: 1–13.

Stebbins Jr GL. 1940. The significance of polyploidy in plant evolution. The American Naturalist 74: 54–66.

Stegemann S, Bock R. 2009. Exchange of Genetic Material Between Cells in Plant Tissue Grafts. Science 324: 649–651.

Stelpflug SC, Sekhon RS, Vaillancourt B, Hirsch CN, Buell CR, de Leon N, Kaeppler SM. 2016. An expanded maize gene expression atlas based on RNA sequencing and its use to explore root development. The plant genome 9.

Sterck L, Rombauts S, Vandepoele K, Rouzé P, Van de Peer Y. 2007. How many genes are there in plants (… and why are they there)? Current Opinion in Plant Biology 10: 199–203.

Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. 2008. Synteny and Collinearity in Plant Genomes. Science 320: 486–488.

Tang H, Bowers JE, Wang X, Paterson AH. 2010. Angiosperm genome comparisons reveal early polyploidy in the monocot lineage. Proceedings of the National Academy of Sciences 107: 472–477.

Tank DC, Eastman JM, Pennell MW, Soltis PS, Soltis DE, Hinchliff CE, Brown JW, Sessa EB, Harmon LJ. 2015. Nested radiations and the pulse of angiosperm diversification: increased diversification rates often follow whole genome duplications. The New phytologist 207: 454–67.

Testo W, Sundue M. 2016. A 4000-species dataset provides new insight into the evolution of ferns. and evolution 105: 200–211.

Thangavel G, Nayar S. 2018. A survey of MIKC type MADS-box genes in non-seed plants: Algae, Bryophytes, Lycophytes and Ferns. Frontiers in plant science 9: 510.

Theissen G, Becker A, Di Rosa A, Kanno A, Kim JT, Munster T, Winter KU, Saedler H. 2000. A short history of MADS-box genes in plants. Plant Mol Biol 42.

Theodoridis S, Randin C, Broennimann O, Patsiou T, Conti E. 2013. Divergent and narrower climatic niches characterize polyploid species of European primroses in Primula sect. Aleuritia (R Pearson, Ed.). Journal of Biogeography 40: 1278–1289.

Thompson JN, Cunningham, Kari A. Segraves, David M. Althoff, Diane Wagner. 1997. Plant Polyploidy and Insect/Plant Interactions. The American Naturalist 150: 730–743.

Thompson JD, Lumaret R. 1992. The evolutionary dynamics of polyploid plants: origins, establishment and persistence. Trends in Ecology & Evolution 7: 302–307.

Thompson JN, Merg KF. 2008. EVOLUTION OF POLYPLOIDY AND THE DIVERSIFICATION OF PLANT–POLLINATOR INTERACTIONS. Ecology 89: 2197–2206.

135

Thompson JN, Nuismer SL, Merg K. 2004. Plant polyploidy and the evolutionary ecology of plant/animal interactions. Biological Journal of the Linnean Society 82: 511–519.

Turelli M, Barton NH, Coyne JA. 2001. Theory and speciation. Trends in Ecology & Evolution 16: 330–343.

UniProt Consortium. 2018. UniProt: the universal protein knowledgebase. Nucleic acids research 46: 2699.

Vanneste K, Sterck L, Myburg Z, Van de Peer Y, Mizrachi E. 2015. Horsetails Are Ancient Polyploids: Evidence from Equisetum giganteum. The Plant cell: 1–13.

Visger CJ, Germain-Aubrey CC, Patel M, Sessa EB, Soltis PS, Soltis DE. 2016. Niche divergence between diploid and autotetraploid . American Journal of Botany 103: 1396–1406.

Vision TJ, Brown DG, Tanksley SD. 2000. The Origins of Genomic Duplications in Arabidopsis. Science 290: 2114–2117.

Visser V, Molofsky J. 2015. Ecological niche differentiation of polyploidization is not supported by environmental differences among species in a cosmopolitan grass genus. American journal of botany 102: 36–49.

Vitte C, Bennetzen JL. 2006. Analysis of retrotransposon structural diversity uncovers properties and propensities in angiosperm genome evolution. Proceedings of the National Academy of Sciences 103: 17638–17643.

Wagner Jr WH. 1970. Biosystematics and evolutionary noise. Taxon: 146–151.

Wagner WH, Wagner FS. 1980. Polyploidy in pteridophytes. In: Polyploidy. Springer, 199– 214.

Walker JD, Geissman JW, Bowring SA, Babcock LE. 2013. The Geological Society of America geologic time scale. Bulletin 125: 259–272.

Wallace R, Jansen R. 1995. DNA evidence for multiple origins of intergeneric allopolyploids in annual Microseris (Asteraceae). Plant Systematics and Evolution 198: 253–265.

Wang J, Guo H, Jin D, Wang X, Paterson AH. 2015. Comparative analysis of gene conversion between duplicated regions in Brassica rapa and B. oleracea genomes. In: The Brassica rapa Genome. Springer, 121–129.

Wang J, Tian L, Lee H-S, Wei NE, Jiang H, Watson B, Madlung A, Osborn TC, Doerge RW, Comai L, et al. 2006. Genomewide Nonadditive Gene Regulation in Arabidopsis Allotetraploids. Genetics 172: 507–517.

Warren DL, Glor RE, Turelli M. 2008a. Environmental niche equivalency versus conservatism: quantitative approaches to niche evolution. Evolution; international journal of

136

organic evolution 62: 2868–83.

Warren DL, Glor RE, Turelli M. 2010. ENMTools: a toolbox for comparative studies of environmental niche models. Ecography.

Warren WC, Hillier LW, Graves JAM, Birney E, Ponting CP, Grutzner F, Belov K, Miller W, Clarke L, Chinwalla AT, et al. 2008b. Genome analysis of the platypus reveals unique signatures of evolution. Nature 453: 175–183.

Werth C, Guttman S, Eshbaugh W. 1985. Electrophoretic evidence of reticulate evolution in the Appalachian Asplenium complex. Systematic Botany 10: 184–192.

Werth C, Windham M. 1991. A model for divergent, allopatric speciation of polyploid pteridophytes resulting from silencing of duplicate-gene expression. The American naturalist.

Wickett NJ, Mirarab S, Nguyen N, Warnow T, Carpenter E, Matasci N, Ayyampalayam S, Barker MS, Burleigh JG, Gitzendanner MA. 2014. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proceedings of the National Academy of Sciences 111: E4859–E4868.

Willis KJ, Bachman S. 2017. State of the world’s plants 2017. Report. Royal Botanic Gardens, Kew Google Scholar.

Winter K-U, Becker A, Münster T, Kim JT, Saedler H, Theissen G. 1999. MADS-box genes reveal that gnetophytes are more closely related to conifers than to flowering plants. Proceedings of the National Academy of Sciences 96: 7342–7347.

Wolf PG, Sessa EB, Marchant DB, Li F-W, Rothfels CJ, Sigel EM, Gitzendanner MA, Visger CJ, Banks JA, Soltis DE, et al. 2015. An exploration into fern genome space. Genome Biology and Evolution .

Wood TE, Takebayashi N, Barker MS, Mayrose I, Greenspoon PB, Rieseberg LH. 2009. The frequency of polyploid speciation in vascular plants. Proceedings of the National Academy of Sciences 106: 13875–13879.

Workman RE, Myrka AM, Wong GW, Tseng E, Welch Jr KC, Timp W. 2018. Single- molecule, full-length transcript sequencing provides insight into the extreme metabolism of the ruby-throated hummingbird Archilochus colubris. GigaScience 7: giy009.

Wu TD, Watanabe CK. 2005. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21: 1859–1875.

Xiong Z, Gaeta RT, Pires JC. 2011. Homoeologous shuffling and chromosome compensation maintain genome balance in resynthesized allopolyploid Brassica napus. Proceedings of the National Academy of Sciences of the United States of America 108: 7908–13.

Yoo M-J, Liu X, Pires JC, Soltis PS, Soltis DE. 2014. Nonadditive Gene Expression in Polyploids. Annual Review of Genetics 48: 485–517.

137

Zhai J, Zhang H, Arikit S, Huang K, Nan G-L, Walbot V, Meyers BC. 2015. Spatiotemporally dynamic, cell-type–dependent premeiotic and meiotic phasiRNAs in maize anthers. Proceedings of the National Academy of Sciences 112: 3146–3151.

Zhang J, Peterson T. 2004. Transposition of Reversed Ac Element Ends Generates Chromosome Rearrangements in Maize. Genetics 167: 1929–1937.

Zimin A, Stevens K a, Crepeau MW, Holtz-Morris A, Koriabine M, Marçais G, Puiu D, Roberts M, Wegrzyn JL, de Jong PJ, et al. 2014. Sequencing and assembly of the 22-gb loblolly pine genome. Genetics 196: 875–90.

138

BIOGRAPHICAL SKETCH

Daniel Blaine Marchant was born in Palo Alto, , in 1989 to Kitzi and Dan

Marchant. He has two younger brothers, Pierce and Graham. He has always been fascinated by the natural world, growing up with a multitude of pets ranging from rabbits and chickens to tortoises and poison dart frogs. In high school, his interests expanded into flora, and during this time he built a tropical greenhouse in which he still grows orchids, bromeliads, and carnivorous plants. His sophomore year of high school, he contacted the Biology Department at Stanford

University to find a summer research internship. Dr. Virginia Walbot responded, and Blaine interned in her corn genetics lab for the following two summers before she hired him as a summer technician until 2012. Throughout high school, Blaine played catcher for the Palo Alto

High baseball team and was a team captain his senior year. Blaine attended the University of

Puget Sound in Tacoma, , from 2007 to 2011 where he received a B.S. in Biology.

He worked on polyploidy in Arabidopsis thaliana with Dr. Andreas Madlung his freshman and sophomore years and then completed his senior thesis on the reproductive biology of a native orchid, Goodyera oblongifolia, with Dr. Betsy Kirkpatrick. He also studied abroad in Costa Rica with the Council on International Educational Exchange (CIEE) in 2009, where he studied tropical biology and conservation. After graduating from the University of Puget Sound, Blaine was hired by his previous study abroad program, CIEE: Costa Rica, as a teaching assistant. He taught in Costa Rica for three semesters before starting his doctoral program at the University of

Florida under Drs. Doug and Pam Soltis. Blaine will be starting as a postdoctoral researcher in

Dr. Walbot’s lab at Stanford University in the fall of 2018.

139