<<

by Ancient and Duplication in Hexapods and Land

Item Type text; Electronic Dissertation

Authors Li, Zheng

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 11/10/2021 02:05:02

Link to Item http://hdl.handle.net/10150/645813

EVOLUTION BY ANCIENT GENE AND GENOME DUPLICATION IN HEXAPODS

AND LAND PLANTS

by

Zheng Li

______Copyright © Zheng Li 2020

A Dissertation Submitted to the Faculty of the

DEPARTMENT OF ECOLOGY AND EVOLUTIONARY BIOLOGY

In Partial Fulfillment of the Requirements

For the Degree of

DOCTOR OF PHILOSOPHY

In the Graduate College

THE UNIVERSITY OF ARIZONA

2020

2

THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE

As members of the Dissertation Committee, we certify that we have read the dissertation prepared by: Zheng Li, titled: “Evolution by Ancient Gene and Genome Duplication in Hexapods and Land Plants" and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy.

Sep 10, 2020 ______Date: ______Michael S. Barker Sep 10, 2020 ______Date: ______Michael J. Sanderson Sep 10, 2020 ______Date: ______John J. Wiens Sep 10, 2020 ______Date: ______Wendy Moore

Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College.

I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement.

Sep 10, 2020 ______Date: ______Michael S. Barker Dissertation Committee Chair Department of Ecology & Evolutionary Biology

3

DEDICATION

This dissertation is dedicated to my beloved mother, Guohuan He.

4

TABLE OF CONTENTS

ABSTRACT...... 5

INTRODUCTION...... 6

PRESENT STUDY...... 13

REFERENCES...... 18

APPENDIX A: EARLY GENOME DUPLICATIONS IN CONIFERS AND

OTHER SEED PLANTS…………………………………………………………...... 30

APPENDIX B: MULTIPLE LARGE-SCALE GENE AND GENOME

DUPLICATIONS DURING THE EVOLUTION OF HEXAPODS...... 77

APPENDIX C: ANCIENT AND LOW RATE OF

LOSS EXPLAIN WITH HIGH CHROMOSOME NUMBERS...... 156

APPENDIX D: PATTERNS AND PROCESSES OF DIPLOIDIZATION IN LAND

PLANTS…………………………………………………...…...... 216

5

ABSTRACT

Gene and genome duplications have been found across the eukaryotic of life.

Yet, many aspects of evolution by gene and genome duplications remain unclear, especially ancient duplications. In my dissertation, I focus on the incidence of ancient gene and genome duplication in different lineages of land plants and hexapods, their impact on genome evolution, and the pattern and processes of diploidization following polyploidy. In Appendix A, I use transcriptomes of gymnosperms and outgroups, and a novel phylogenetic algorithm to provide the first comprehensive study of ancient WGD in gymnosperms. In Appendix B, I use over 150 insect and transcriptomes to infer ancient WGDs and other large-scale gene duplications during the evolution of hexapods. In Appendix C, I investigate ancient WGD in ferns from over 140 transcriptomes and test the long standing hypothesis on high chromosome number in ferns. In Appendix D, I summarize current studies on the patterns and processes of diploidization in the land plants and provide directions for testing hypotheses and understanding diploidization in the future. As a whole, this work improves our understanding of the mode and tempo of eukaryotic genome evolution and diversity.

6

INTRODUCTION

Polyploidy is perhaps most well recognized as an important component of evolution and speciation (Stebbins, 1950 Grant, 1981). In flowering plants and ferns about 15% and 31% speciation events, respectively, are due to recent polyploidization

(Wood et al., 2009). Most land plants are now known to be ancient polyploids that have rediploidized (One Thousand Plant Transcriptomes Initiative, 2019). Comparative genomic studies have shown plant genomes are highly dynamic and plants experienced cycles of polyploidy followed by diploidization (Wolfe, 2001 Jiao et al., 2011 Arrigo and

Barker, 2012 Li et al., 2015 Wendel, 2015a Van de Peer et al., 2017 One Thousand Plant

Transcriptomes Initiative, 2019). Different from plants, polyploidy is considered to be rarer in animals (Muller, 1925 Orr, 1990 Mable, 2004). Recent advancements in genome sequencing and analyses have found ancient polyploidy across the tree of life (Aury et al.,

2006 Storchová et al., 2006 Van de Peer et al., 2017 Li, Tiley, et al., 2018 One Thousand

Plant Transcriptomes Initiative, 2019). Polyploidy is also an important component of genome evolution. Genome duplication can double the genome size and chromosome numbers during polyploid formation (Otto, 2007). The process of diploidization following polyploidy can significantly change genome content (Arrigo and Barker, 2012

Murat et al., 2017) and gene networks (Thomas et al., 2006 Freeling , 2009 Defoort et al.,

2019) may have provided novel genetic variation that was important for the evolution of plant diversity (Tank et al., 2015 Landis et al., 2018).

7

Ancient genome duplications in gymnosperms

Polyploidy is a common mode of speciation and evolution in angiosperms

(Stebbins, 1950 Grant, 1981). The signatures of ancient WGD are found in most of the vascular plants (One Thousand Plant Transcriptomes Initiative, 2019). Recent analyses also provide evidence that an ancient WGD shared by seed plants, and all flowering plants experienced another round of (Jiao et al., 2011). In contrast, there is little evidence indicating whole genome duplication played a significant role in the gymnosperms (Wood et al., 2009), except in a few genera (such as Ephedra) in which polyploidy is prevalent (Ahuja, 2005 Ickert-Bond et al., 2020).

One of the major consequences of polyploidy is the increase in genome size by doubling the whole genome (Otto, 2007). Many gymnosperms, especially conifers, have huge genomes (Murray, 1998 Ahuja et al., 2005). The first gymnosperm genome sequenced, the Norway spruce, was published in Nature in 2014. In this study, researchers found evidence of the seed plant genome duplication but no ancient WGD specific to gymnosperm (Nystedt et al., 2013). They proposed the large genome size in conifers is due to transposable element proliferation without ancient WGD (Nystedt et al.,

2013). However, this result contradicted some previous studies on G-banding in and genome size of conifers which suggested ancient WGD might have occurred during the evolution of conifers (Drewry, 1982 Ahuja, 2005).

Before my dissertation research, little was known about ancient genome duplication in gymnosperms. It was also unclear whether the seed plant WGD is restricted to seed plants or shared with monilophytes. In Appendix A, I used 24 transcriptomes of gymnosperms and three outgroups, and a novel phylogenetic algorithm 8 to provide the first comprehensive study of ancient WGD in gymnosperms. My study also improved the phylogenetic placement of the ancestral seed plant WGD.

Ancient gene and genome duplications in hexapods

Different from plants, polyploidy has long been considered rarer in animals

(Muller, 1925 Orr, 1990). Although less common, polyploidy has been found in some animal lineages (Otto and Whitton, 2000 Mable, 2004). Multiple recent WGDs have been recorded in fish (Mable et al., 2011). All teleost fish have polyploid ancestry and all the

Salmonidea share another ancient WGD (Van de Peer et al., 2003 Van de Peer, 2004

Meyer and Van de Peer, 2005 Berthelot et al., 2014). More broadly, Ohno hypothesized that two rounds of ancient WGDs (the 2R hypothesis) occurred in the ancestry of all the (Ohno, 1970). According to this hypothesis, most vertebrates, including humans, are descendants of an ancient polyploid (Makalowski, 2001). Although whether two rounds of WGDs occurred has been debated (Furlong and Holland, 2002 Hughes and

Robert, 2003 Kasahara, 2007), several recent genomic studies provide strong evidence for the 2R hypothesis (Dehal et al., 2005) (Putnam et al., 2008 Smith et al., 2013).

Recent genomic analyses also revealed multiple paleopolyploidies in the ancestry of various invertebrate lineages. In chelicerates, evidence of ancient WGD has been found in horseshoe crabs (Nossa et al., 2014 Kenny et al., 2017 Nong et al., 2020

Shingate et al., 2020) and spiders (Clarke et al., 2015 Schwager et al., 2017). Genome and chromosome number studies also found ancient WGD in some molluscs (Hallinan and Lindberg, 2011 Yoshida et al., 2011 Liu et al., 2020). Previous studies also found evidence of recent polyploidy in some insects and suggest it might be associated with 9 parthenogenesis (Lokki and Saura, 1979 Otto and Whitton, 2000 Jacobson et al., 2013).

However, comparing the number of known polyploid among invertebrates to the hyperdiversity of this , the frequency of invertebrate polyploidy appears to be relatively low. Before my dissertation research, no evidence of paleopolyploidy among

Hexapoda had been found. In Appendix B, I used over 100 transcriptomes and 50 insect genomes to infer ancient WGD and other large-scale during the evolution of hexapods.

Ancient genome duplications and chromosome loss in ferns

Genome duplication can double the chromosome number during polyploid formation. In flowering plants, extensive chromosomal rearrangements and chromosome losses have been found in ten generations after the polyploid formation (Xiong et al.,

2011 Chester et al., 2012). This rapid chromosomal evolution can result in dysploidy and returning to diploid-like chromosome number (Mandáková and Lysak, 2018a). This might explain why most of the flowering plants do not have a high chromosome number even when experiencing many rounds of WGD (Wood et al., 2009 Rice et al., 2015). For example, Arabidopsis thaliana has experienced at least five rounds of ancient WGD and still has a gametic chromosome number of five (Initiative and The Arabidopsis Genome

Initiative, 2000 Wolfe, 2001).

In contrast to flowering plants, one of the most intriguing patterns in plant evolution is the high chromosome numbers of ferns (Klekowski and Baker, 1966). The average gametic chromosome numbers in homosporous ferns is n = 57.05 (Klekowski and Baker, 1966). Different than homosporous ferns, angiosperms and heterosporous 10 ferns have an average haploid chromosome number of n = 15.99 (Grant, 1963) and 13.62

(Klekowski and Baker, 1966), respectively. Numerous hypotheses have been proposed to explain the origin and maintenance of high chromosome numbers in homosporous ferns

(Grant, 1963). The alternative hypotheses, such as ascending aneuploidy or high ancestral chromosome numbers are not well supported because neopolyploid speciation is common in ferns (Otto and Whitton, 2000 Wood et al., 2009) and cytological variation is predominantly euploid (Love et al., 1977). The most compelling hypothesis suggests multiple rounds of whole genome duplication (WGD) (Haufler and Soltis, 1986 Haufler,

1987) and the active process of gene silencing without chromosome loss in a polyploid genome (Gastony, 1991).

However, a previous study did not resolve paleopolyploidy using linkage mapping in Ceratopteris (Nakazato et al., 2006). Evidence of ancient WGD is not inferred until recently when transcriptomic and genomes become available (Barker, 2012

Vanneste et al., 2015 Li, Brouwer, et al., 2018 Clark et al., 2019 One Thousand Plant

Transcriptomes Initiative, 2019). The first two fern genomes revealed two rounds of ancient WGDs in Azolla by using phylogenomic and syntenic approaches (Li, Brouwer, et al., 2018). In the One Thousand Plant Transcriptome (1KP) project, 21 ancient polyploidy events were inferred during the evolution of monilophytes (One Thousand

Plant Transcriptomes Initiative, 2019). The signatures of ancient WGD are found in most of the ferns in 1KP (One Thousand Plant Transcriptomes Initiative, 2019). Prior to my dissertation, the ‘multiple ancient WGD’ hypothesis had not been tested. In Appendix C,

I used over 140 fern transcriptomes to infer ancient WGD in ferns and test this leading hypothesis. 11

Patterns and Processes of Diploidization in Land Plants

Genome sequencing and comparative genomic analyses has provided conclusive evidence that plants experienced cycles of polyploidy followed by diploidization (Wolfe,

2001 Jiao et al., 2011 Arrigo and Barker, 2012 Li et al., 2015 Wendel, 2015 Van de Peer et al., 2017; One Thousand Plant Transcriptomes Initiative, 2019). We have learned a lot about polyploidization over the past century (Lutz, 1907 Winge, 1917 Barker et al.,

2016). However, we know comparatively little about the mechanisms and forces that drive diploidization (Wolfe, 2001 Dodsworth et al., 2016). In general, diploidization is the return of a polyploid genome to a diploid state (Wolfe, 2001e Wendel, 2015a Soltis et al., 2016 Mandáková and Lysak, 2018b). The restoration of bivalent chromosome pairing behavior and associated diploid is considered a key feature of diploidization

(Clausen, 1941 Stebbins, 1947 Stebbins, 1950 Grant, 1981).

Many mechanisms of genome evolution contribute to diploidization, and these can be broadly described as two major processes: cytological diploidization and genic diploidization/fractionation (Ma and Gustafson, 2005 Mandáková and Lysak, 2018). The cytological diploidization is the process of chromosomal evolution such as chromosomal rearrangement, fission and fusion, and these mechanisms eventually lead to restoration of bivalent pairing and disomic inheritance following polyploidy. Genic diploidization/fractionation is the process of gene removal and loss following polyploidy.

This results in only a subset of being retained as paralogs over time. These two processes may occur largely independently of each other and at different rates yielding a 12 diversity of genomes with different patterns of diploidization following polyploidy across lineages (Otto and Whitton, 2000 Wolfe, 2001f Mandáková et al., 2017).

In Appendix D, I summarized the current studies on patterns and processes of diploidization in the land plants, and also compared these patterns and processes to those in animals and other eukaryotes. In this study, I conducted a new survey of the plant cytological literature to assess the distribution of bivalent pairing among contemporary polyploid species. And I also review differences in the rate of diploidization in plants and present new analyses on the rates of gene loss across land plants. Finally, I summarize the growing importance of understanding diploidization and provide direction for testing hypotheses on diploidization.

13

PRESENT STUDY

Each of the studies discussed are presented in the appendices in the form of manuscripts for publication. Appendix A has been published in Science Advances (Li et al., 2015) Appendix B has been published in PNAS (Li, Tiley, et al., 2018) Appendix C has been submitted to Nature Plants and Appendix D has been submitted to Annual

Review of Plant Biology (Li et al., 2020). The following is a summary of the important findings of each study.

Appendix A: Early genome duplications in conifers and other seed plants

The first gymnosperm genome sequenced, the Norway spruce, was published in

Nature in 2014 (Nystedt et al., 2013). Despite its large genome size (over 20 Gbp), researchers found no evidence of ancient WGD. Based on this result, they proposed the large genome size in conifers is due to only transposable element proliferation and without ancient polyploidy (Nystedt et al., 2013). Although most evidence indicates that polyploid speciation is relatively rare among extant gymnosperms (except in a few genera, such as Ephedra) (Ahuja, 2005 Ickert-Bond et al., 2020), previous analyses of conifer genome sizes and chromosomes suggested that paleopolyploidy might occurred in

Pinaceae (Drewry, 1982 Ahuja, 2005).

In Appendix A, I test for evidence of ancient polyploidy in gymnosperms. I assembled transcriptomes for 24 gymnosperms and 3 outgroups, including representatives of all major gymnosperm and . Ancient WGDs were inferred from age distributions of gene duplications (Ks plots) by analyzing transcriptomes of single species with the DupPipe pipeline (Barker et al., 2010). I also 14 introduce a new gene tree topology count algorithm, Multi-tAxon Paleopolyploidy

Search (MAPS), to place inferred paleopolyploid events in phylogenetic context (Li et al., 2015).

I found evidence for three ancient WGD during the evolution of gymnosperms:

Pinaceae, cupressophyte conifers, and Welwitschia (Gnetales). Contrary to previous genomic research that reported an absence of polyploidy in the ancestry of contemporary gymnosperms (Nystedt et al., 2013), our analyses indicate that polyploidy has contributed to the evolution of conifers and other gymnosperms. I also confirm that a WGD hypothesized to be restricted to seed plants is indeed not shared with the monilophytes, a result that was unclear in earlier studies (Jiao et al., 2011).

Appendix B: Multiple large-scale gene and genome duplications during the evolution of hexapods

Although polyploidy occurs in both animals and plants, it has been long recognized that the incidence of polyploidy in animals is much lower compared to plants

(Muller, 1925 Orr, 1990). Ancient polyploidy is found in the ancestry of vertebrates

(Lagman et al., 2013 Braasch et al., 2016 Sacerdot et al., 2018 Smith et al., 2018) and teleost fishes (Jaillon et al., 2004 Meyer and Van de Peer, 2005 Inoue et al., 2015), but there is little evidence for ancient WGDs in invertebrates (Otto and Whitton, 2000). In

Appendix B, I use newly available transcriptomes and genomes of insects to investigate if ancient genome duplication and other large-scale gene duplication occurred during the evolution of hexapods. 15

I assembled genomic data from more than 150 species across the insect phylogeny. To infer ancient gene and genome duplications, I used a combination of Ks plot, ortholog divergence analysis, and MAPS phylogenomic methods (Barker et al.,

2010 Li, Tiley, et al., 2018). I also improved the MAPS algorithm by incorporating statistical comparisons to null and positive simulations of whole genome duplication (Li,

Tiley, et al., 2018). To further corroborate the nature of these duplications, I evaluated the pattern of gene retention from putative WGDs observed in the gene age distributions.

I found evidence for 18 ancient WGDs and six other large-scale bursts of gene duplication during insect evolution. I also observed a strong signal of parallel gene retention across many of the putative insect WGDs likely driven by gene dosage balance.

This work provides the first evidence of ancient WGD in hexapods and highlights ancient genome duplication and other large-scale gene duplication likely contributed to evolution of hexapods.

Appendix C: Ancient genome duplications and low rate of chromosome loss explain ferns with high chromosome numbers

A longstanding question in plant evolution is why homosporous ferns have much higher chromosome numbers compared to flowering plants (Klekowski and Baker, 1966).

The leading hypothesis suggests ancient polyploidy without chromosome loss can explain high chromosome numbers in ferns (Haufler and Soltis, 1986 Haufler, 1987). In

Appendix C, I test this leading hypothesis by using phylogenomic inferences of ancient polyploidy and phylogenetic reconstruction of chromosome evolution in monilophytes. 16

To investigate ancient WGDs in ferns, I assembled a dataset with over 140 fern transcriptomes. I used a total evidence approach that combined inferences of WGDs from

Ks and MAPS phylogenomic methods to infer and place putative ancient WGDs across the monilophytes. I also reconstructed the rates and patterns of chromosome number evolution across more than 2,300 vascular plant genera with ChromEvol (Otto and

Whitton, 2000 Mayrose et al., 2010). In addition, I explore whether this genomic characteristic of ferns is impacted by genes related to meiosis by comparing rates of protein evolution for these genes in angiosperms and ferns.

I found evidence of 26 putative ancient WGDs during its evolutionary history of monilophytes and ferns on average have experienced 3.76 rounds of ancient polyploidy.

The phylogenetic reconstruction of chromosome evolution also shows lower rates of ascending and descending dysploidy in ferns compared to angiosperms. I also found consistently higher rates in meiosis genes compare to the background rate in angiosperms but not in ferns. Overall, the result from this study is consistent with prediction of the ancient polyploidy hypothesis and my work provides comprehensive evidence to support this longstanding hypothesis (Haufler and Soltis, 1986 Haufler, 1987).

Appendix D: Patterns and processes of diploidization in land plants

Previous genomic analyses indicate that genomes are highly dynamic and experienced repeated rounds of ancient WGDs followed by diploidization

(Schnable et al., 2011 Barker et al., 2012 Wendel, 2015 Soltis et al., 2016). Despite the prevalence of WGDs across the vascular plant phylogeny, we know relatively little about 17 the processes and mechanisms of diploidization, especially outside of the flowering plants (Wolfe, 2001 Dodsworth et al., 2016).

In Appendix D, I provide a literature review to summarize the current understanding of the patterns and processes of diploidization. In addition, I perform a meta-analysis to access the frequency of bivalent and tetravalent pairing for allo. vs. autopolyploidy based on previous cytological studies. To further investigate the rate of diploidization across land plants, I analyzed patterns of gene retention and loss across a thousand transcriptomes from the 1KP project (One Thousand Plant Transcriptomes

Initiative, 2019).

In the meta-analyses, I found 208 polyploid systems with at least one record of chromosome pairing behaviour during meiosis. This work provides a quantification of the polyploid chromosome pairing behaviour variation continuum. My analyses also provide evidence that rates of gene loss following WGD are better explained by variation among phylogenetically related lineages rather than simply with time since a WGD. This study highlights that understanding diploidization in plants will require evaluating data from phylogenetically diverse lineages. Combining findings on patterns of chromosome loss in vascular plants in Appendix C, these results enhance our understanding of the mode and tempo of land plant genome evolution.

18

REFERENCES

Ahuja, M. R. 2005. Polyploidy in gymnosperms: Revisited. Silvae Genetica 54: 59–

69.

Ahuja, M. R. and D. B. Neale. 2005. Evolution of genome size in conifers. Silvae

Genetica 54: 126–137.

Arrigo, N., and M. S. Barker. 2012. Rarely successful polyploids and their legacy in plant genomes. Current opinion in plant biology 15: 140–146.

Aury, J.-M., O. Jaillon, L. Duret, B. Noel, C. Jubin, B. M. Porcel, B. Ségurens, et al.

2006. Global trends of whole-genome duplications revealed by the ciliate

Paramecium tetraurelia. Nature 444: 171–178.

Barker, M. S. 2012. Karyotype and genome evolution in Pteridophytes. Plant

Genome Diversity Volume 2, 245–253.

Barker, M. S., G. J. Baute, and S.-L. Liu. 2012. Duplications and turnover in plant genomes. Plant Genome Diversity Volume 1: 155–169.

Barker, M. S., K. M. Dlugosch, L. Dinh, R. S. Challa, N. C. Kane, M. G. King, and

L. H. Rieseberg. 2010. EvoPipes.net: Bioinformatic tools for ecological and evolutionary genomics. Evolutionary bioinformatics online 6: 143–149.

Barker, M. S., B. C. Husband, and J. C. Pires. 2016. Spreading Winge and flying high: The evolutionary importance of polyploidy after a century of study. American journal of botany 103: 1139–1145. 19

Berthelot, C., F. Brunet, D. Chalopin, A. Juanchich, M. Bernard, B. Noël, P. Bento, et al. 2014. The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nature communications 5: 3657.

Braasch, I., A. R. Gehrke, J. J. Smith, K. Kawasaki, T. Manousaki, J. Pasquier, A.

Amores, et al. 2016. The spotted gar genome illuminates evolution and facilitates human-teleost comparisons. Nature genetics 48: 427–437.

Chester, M., J. P. Gallagher, V. V. Symonds, A. V. C. da Silva, E. V. Mavrodiev, A.

R. Leitch, P. S. Soltis, and D. E. Soltis. 2012. Extensive chromosomal variation in a recently formed natural allopolyploid species, miscellus ().

Proceedings of the National Academy of Sciences 109: 1176–1181.

Clarke, T. H., J. E. Garb, C. Y. Hayashi, P. Arensburger, and N. A. Ayoub. 2015.

Spider transcriptomes identify ancient large-scale gene duplication event potentially important in silk gland evolution. Genome biology and evolution 7: 1856–1870.

Clark, J. W., M. N. Puttick, and P. C. J. Donoghue. 2019. Origin of horsetails and the role of whole-genome duplication in plant macroevolution. Proceedings of The

Royal Society B. Biological sciences 286: 20191662.

Clausen, R. E. 1941. Polyploidy in . The American Naturalist 75: 291–

306.

Defoort, J., Y. Van de Peer, and L. Carretero-Paulet. 2019. The evolution of gene duplicates in angiosperms and the impact of protein-protein interactions and the mechanism of duplication. Genome biology and evolution 11: 2292–2305. 20

Dehal, P., D. Paramvir, and J. L. Boore. 2005. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS biology 3: e314.

Dodsworth, S., M. W. Chase, and A. R. Leitch. 2016. Is post-polyploidization diploidization the key to the evolutionary success of angiosperms? Botanical

Journal of the Linnean Society 180: 1–5.

Drewry, A. 1982. G-banded chromosomes in Pinus resinosa. Journal of Heredity

73: 305–306.

Freeling, M., and F. Michael. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annual review of plant biology 60: 433–453.

Furlong, R. F., and P. W. H. Holland. 2002. Were vertebrates octoploid?

Philosophical transactions of the Royal Society of London. Series B, Biological sciences 357: 531–544.

Gastony, G. J. 1991. Gene silencing in a polyploid homosporous fern: paleopolyploidy revisited. Proceedings of the National Academy of Sciences of the

United States of America 88: 1602–1605.

Grant, V. 1981. Plant Speciation. Columbia University Press.

Grant, V. 1963. The Origin of Adaptations. Columbia University Press.

Hallinan, N. M., and D. R. Lindberg. 2011. Comparative analysis of chromosome counts infers three paleopolyploidies in the Mollusca. Genome biology and 21 evolution 3: 1150–1163.

Haufler, C. H. 1987. Electrophoresis is modifying our concepts of evolution in homosporous pteridophytes. American journal of botany 74: 953–966.

Haufler, C. H., and D. E. Soltis. 1986. Genetic evidence suggests that homosporous ferns with high chromosome numbers are diploid. Proceedings of the National

Academy of Sciences of the of America 83: 4389–4393.

Hughes, A. L., and F. Robert. 2003. 2R or not 2R: Testing hypotheses of genome duplication in early vertebrates. Genome Evolution, 85–93.

Ickert-Bond, S. M., A. Sousa, Y. Min, I. Loera, J. Metzgar, J. Pellicer, O. Hidalgo, and I. J. Leitch. 2020. Polyploidy in gymnosperms - Insights into the genomic and evolutionary consequences of polyploidy in Ephedra. Molecular and evolution 147: 106786.

The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815.

Inoue, J., Y. Sato, R. Sinclair, K. Tsukamoto, and M. Nishida. 2015. Rapid genome reshaping by multiple-gene loss after whole-genome duplication in teleost fish suggested by mathematical modeling. Proceedings of the National Academy of

Sciences of the United States of America 112: 14918–14923.

Jacobson, A. L., J. S. Johnston, D. Rotenberg, A. E. Whitfield, W. Booth, E. L.

Vargo, and G. G. Kennedy. 2013. Genome size and of Thysanoptera. Insect 22 molecular biology 22: 12–17.

Jaillon, O., J.-M. Aury, F. Brunet, J.-L. Petit, N. Stange-Thomann, E. Mauceli, L.

Bouneau, et al. 2004. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431: 946–957.

Jiao, Y., N. J. Wickett, S. Ayyampalayam, A. S. Chanderbali, L. Landherr, P. E.

Ralph, L. P. Tomsho, et al. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473: 97–100.

Kasahara, M., and K. M. 2007. The 2R hypothesis: an update. Current opinion in immunology 19: 547–552.

Kenny, N. J., K. W. Chan, W. Nong, Z. Qu, I. Maeso, H. Y. Yip, T. F. Chan, et al.

2017. Ancestral whole-genome duplication in the marine chelicerate horseshoe crabs. Heredity 119: 388.

Klekowski, E. J., Jr, and H. G. Baker. 1966. Evolutionary significance of polyploidy in the pteridophyta. Science 153: 305–307.

Lagman, D., D. Ocampo Daza, J. Widmark, X. M. Abalo, G. Sundström, and D.

Larhammar. 2013. The vertebrate ancestral repertoire of visual opsins, transducin alpha subunits and oxytocin/vasopressin receptors was established by duplication of their shared genomic region in the two rounds of early vertebrate genome duplications. BMC evolutionary biology 13: 238.

Landis, J. B., D. E. Soltis, Z. Li, and H. E. Marx. 2018. Impact of whole‐genome 23 duplication events on diversification rates in angiosperms. American journal of botany 105 (3), 348-363.

Li, F.-W., P. Brouwer, L. Carretero-Paulet, S. Cheng, J. de Vries, P.-M. Delaux, A.

Eily, et al. 2018. Fern genomes elucidate land plant evolution and cyanobacterial symbioses. Nature plants 4: 460–472.

Liu, C., Y. Ren, Z. Li, Q. Hu, L. Yin, X. Qiao, Y. Zhang, et al. 2020. Giant African snail genomes provide insights into molluscan whole-genome duplication and aquatic-terrestrial transition. bioRxiv: 2020.02.02.930693.

Li Z, A.E. Baniaga, E.B . Sessa, M. Scascitelli, S.W. Graham, L.H. Rieseberg, et al.

2015. Early genome duplications in conifers and other seed plants. Science

Advances 1(10):e1501084.

Li, Z., M. T. W. McKibben, G. S. Finch, P. D. Blischak, B. L. Sutherland, and M. S.

Barker. 2020. Patterns and processes of diploidization in land plants.

Li, Z., G. P. Tiley, S. R. Galuska, C. R. Reardon, T. I. Kidder, R. J. Rundell, and M.

S. Barker. 2018. Multiple large-scale gene and genome duplications during the evolution of hexapods. Proceedings of the National Academy of Sciences of the

United States of America 115: 4713–4718.

Lokki, J., and A. Saura. 1979. Polyploidy in insect evolution. Basic life sciences 13:

277–312.

Love, A., D. Love, and R. E. G. Pichi Sermolli. 1977. Cytotaxonomical atlas of the 24

Pteridophyta. Vaduz: J. Cramer 398p. -Chrom. nos. . Chromosome numbers. Geog:

1–7.

Lutz, A. M. 1907. A preliminary note on the chromosomes of lamarckiana and one of its mutants, O. gigas. Science 26: 151–152.

Mable, B. K. 2004. ‘Why polyploidy is rarer in animals than in plants’: myths and mechanisms. Biological journal of the Linnean Society. Linnean Society of London

82: 453–466.

Mable, B. K., M. A. Alexandrou, and M. I. Taylor. 2011. Genome duplication in amphibians and fish: an extended synthesis. Journal of zoology 284: 151–182.

Makalowski, W. 2001. Are we polyploids? A brief history of one hypothesis.

Genome research 11: 667–670.

Mandáková, T., Z. Li, M. S. Barker, and M. A. Lysak. 2017. Diverse genome organization following 13 independent mesopolyploid events in contrasts with convergent patterns of gene retention. The Plant journal: for cell and molecular biology 91: 3–21.

Mandáková, T., and M. A. Lysak. 2018. Post-polyploid diploidization and diversification through dysploid changes. Current opinion in plant biology 42: 55–

65.

Ma, X.-F., and J. P. Gustafson. 2005. Genome evolution of allopolyploids: a process of cytological and genetic diploidization. Cytogenetic and genome research 109: 25

236–249.

Mayrose, I., M. S. Barker, and S. P. Otto. 2010. Probabilistic models of chromosome number evolution and the inference of polyploidy. Systematic biology

59: 132–144.

Meyer, A., and Y. Van de Peer. 2005. From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). BioEssays 27: 937–945.

Muller, H. J. 1925. Why polyploidy is rarer in animals than in plants. The American naturalist 59: 346–353.

Murat, F., A. Armero, C. Pont, C. Klopp, and J. Salse. 2017. Reconstructing the genome of the most recent common ancestor of flowering plants. Nature Genetics

49: 490–496.

Murray, B. 1998. Nuclear DNA amounts in gymnosperms. Annals of Botany 82: 3–

15.

Nakazato, T., M.-K. Jung, E. A. Housworth, L. H. Rieseberg, and G. J. Gastony.

2006. Genetic map-based analysis of genome structure in the homosporous fern

Ceratopteris richardii. Genetics 173: 1585–1597.

Nong, W., Z. Qu, Y. Li, T. Barton-Owen, A. Y. P. Wong, H. Y. Yip, H. T. Lee, et al. 2020. Horseshoe crab genomes reveal the evolutionary fates of genes and microRNAs after three rounds (3R) of whole genome duplication. bioRxiv:

2020.04.16.045815. 26

Nossa, C. W., P. Havlak, J.-X. Yue, J. Lv, K. Y. Vincent, H. Brockmann, and N. H.

Putnam. 2014. Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication. GigaScience 3: 9.

Nystedt, B., N. R. Street, A. Wetterbom, A. Zuccolo, Y.-C. Lin, D. G. Scofield, F.

Vezzi, et al. 2013. The Norway spruce genome sequence and conifer genome evolution. Nature 497: 579–584.

Ohno, S. 1970. Evolution by Gene Duplication.

One Thousand Plant Transcriptomes Initiative. 2019. One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574: 679–685.

Orr, H. A. 1990. ‘Why Polyploidy is Rarer in Animals Than in Plants’ Revisited.

The American naturalist 136: 759–770.

Otto, S. P. 2007. The evolutionary consequences of polyploidy. Cell 131: 452–462.

Otto, S. P., and J. Whitton. 2000. Polyploid incidence and evolution. Annual review of genetics 34: 401–437.

Putnam, N. H., T. Butts, D. E. K. Ferrier, R. F. Furlong, U. Hellsten, T. Kawashima,

M. Robinson-Rechavi, et al. 2008. The amphioxus genome and the evolution of the chordate karyotype. Nature 453: 1064–1071.

Rice, A., L. Glick, S. Abadi, M. Einhorn, N. M. Kopelman, A. Salman-Minkov, J.

Mayzel, et al. 2015. The Chromosome Counts Database (CCDB) - a community resource of plant chromosome numbers. The New phytologist 206: 19–26. 27

Sacerdot, C., A. Louis, C. Bon, C. Berthelot, and H. Roest Crollius. 2018.

Chromosome evolution at the origin of the ancestral vertebrate genome. Genome biology 19: 166.

Schnable, J. C., N. M. Springer, and M. Freeling. 2011. Differentiation of the maize subgenomes by genome dominance and both ancient and ongoing gene loss.

Proceedings of the National Academy of Sciences of the United States of America

108: 4069–4074.

Schwager, E. E., P. P. Sharma, T. Clarke, D. J. Leite, T. Wierschin, M. Pechmann,

Y. Akiyama-Oda, et al. 2017. The house spider genome reveals an ancient whole- genome duplication during arachnid evolution. BMC biology 15: 62.

Shingate, P., V. Ravi, A. Prasad, B.-H. Tay, K. M. Garg, B. Chattopadhyay, L.-M.

Yap, et al. 2020. Chromosome-level assembly of the horseshoe crab genome provides insights into its genome evolution. Nature communications 11: 2322.

Smith, J. J., S. Kuraku, C. Holt, T. Sauka-Spengler, N. Jiang, M. S. Campbell, M. D.

Yandell, et al. 2013. Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nature genetics 45: 415–21, 421e1–2.

Smith, J. J., N. Timoshevskaya, C. Ye, C. Holt, M. C. Keinath, H. J. Parker, M. E.

Cook, et al. 2018. The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution. Nature genetics 50:

270–277.

Soltis, D. E., C. J. Visger, D. B. Marchant, and P. S. Soltis. 2016. Polyploidy: 28

Pitfalls and paths to a paradigm. American journal of botany 103: 1146–1166.

Stebbins, G. L.,.1947. Types of polyploids their classification and significance.

Advances in genetics 1: 403–429.

Stebbins, G. L.,. 1950. Variation and evolution in plants.

Storchová, Z., A. Breneman, J. Cande, J. Dunn, K. Burbank, E. O’Toole, and D.

Pellman. 2006. Genome-wide genetic analysis of polyploidy in yeast. Nature 443:

541–547.

Tank, D. C., J. M. Eastman, M. W. Pennell, P. S. Soltis, D. E. Soltis, C. E. Hinchliff,

J. W. Brown, et al. 2015. Nested radiations and the pulse of angiosperm diversification: increased diversification rates often follow whole genome duplications. The New phytologist 207: 454–467.

Thomas, B. C., B. Pedersen, and M. Freeling. 2006. Following tetraploidy in an

Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome research 16: 934–946.

Van de Peer, Y. 2004. Tetraodon genome confirms Takifugu findings: most fish are ancient polyploids. Genome biology 5: 250.

Van de Peer, Y., E. Mizrachi, and K. Marchal. 2017. The evolutionary significance of polyploidy. Nature reviews. Genetics 18: 411–424.

Van de Peer, Y., J. S. Taylor, and A. Meyer. 2003. Are all fishes ancient polyploids?

Journal of structural and functional genomics 3: 65–73. 29

Vanneste, K., L. Sterck, A. A. Myburg, Y. Van de Peer, and E. Mizrachi. 2015.

Horsetails are ancient polyploids: evidence from Equisetum giganteum. The Plant cell 27: 1567–1578.

Wendel, J. F. 2015. The wondrous cycles of polyploidy in plants. American journal of botany 102: 1753–1756.

Winge, O. 1917. The chromosomes. Their numbers and general importance. Compt.

Rend. Trav. du Lab. de Carlsberg 13: 131–175.

Wolfe, K. H. 2001a. Yesterday’s polyploids and the mystery of diploidization.

Nature Reviews Genetics 2: 333–341.

Wood, T. E., N. Takebayashi, M. S. Barker, I. Mayrose, P. B. Greenspoon, and L.

H. Rieseberg. 2009. The frequency of polyploid speciation in vascular plants.

Proceedings of the National Academy of Sciences of the United States of America

106: 13875–13879.

Xiong, Z., R. T. Gaeta, and J. C. Pires. 2011. Homoeologous shuffling and chromosome compensation maintain genome balance in resynthesized allopolyploid

Brassica napus. Proceedings of the National Academy of Sciences of the United

States of America 108: 7908–7913.

Yoshida, M.-A., Y. Ishikura, T. Moritaki, E. Shoguchi, K. K. Shimizu, J. Sese, and

A. Ogura. 2011. Genome structure analysis of molluscs revealed whole genome duplication and lineage specific repeat variation. Gene 483: 63–71. 30

APPENDIX A:

EARLY GENOME DUPLICATIONS IN CONIFERS AND OTHER SEED

PLANTS

Citation

Li Z, AE Baniaga, EB Sessa, M Scascitelli, SW Graham, LH Rieseberg and MS Barker

(2015) Early genome duplications in conifers and other seed plant. Science Advances 1 (10), e1501084.

Authors

Zheng Li1, Anthony E. Baniaga1, Emily B. Sessa2, Moira Scascitelli3, Sean W. Graham3,

Loren H. Rieseberg3,4, S. Barker1*

Corresponding author

1Michael S. Barker, Department of Ecology & Evolutionary Biology, University of

Arizona, P.O. Box 210088, Tucson, AZ 85721 USA, tel. (520) 621-2213, fax (520) 621-

9190, [email protected]

Affiliations

1Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ

85721, USA.

2Department of Biology, University of Florida, Gainesville, FL 32611, USA. 31

3Department of Botany, University of British Columbia, Vancouver, BC V6T 1Z4,

Canada.

4Department of Biology, Indiana University, Bloomington, IN 47405, USA.

Abstract

Polyploidy is a common mode of speciation and evolution in angiosperms (flowering plants). In contrast, there is little evidence to date that whole genome duplication (WGD) has played a significant role in the evolution of their putative extant sister lineage, the gymnosperms. Recent analyses of the spruce genome, the first published conifer genome, failed to detect evidence of WGDs in gene age distributions, and attributed many aspects of conifer biology to a lack of WGDs. Here, we present evidence for three ancient genome duplications during the evolution of gymnosperms, based on phylogenomic analyses of transcriptomes from 24 gymnosperms and three outgroups. We use a new algorithm to place these WGD events in phylogenetic context: two in the ancestry of major conifer clades (Pinaceae and cupressophyte conifers) and one in Welwitschia

(Gnetales). We also confirm that a WGD hypothesized to be restricted to seed plants is indeed not shared with ferns and relatives (monilophytes), a result that was unclear in earlier studies. Contrary to previous genomic research that reported an absence of polyploidy in the ancestry of contemporary gymnosperms, our analyses indicate that polyploidy has contributed to the evolution of conifers and other gymnosperms. As in the 32 flowering plants, the evolution of the large genome sizes of gymnosperms involved both polyploidy and repetitive element activity.

Introduction

Polyploidy, or whole genome duplication (WGD) is one of the most important forces in vascular plant evolution. Nearly 25% of vascular plants are recent polyploids

(Barker et al., 2015) with approximately 15% of angiosperm and 31% of fern speciation events due to genome duplication (Wood et al., 2009). Ancient polyploidy is found in the ancestry of all extant seed and flowering plants (Jiao et al., 2011), and many angiosperm lineages have experienced additional rounds of genome duplication (Cui et al., 2006

Barker et al., 2008 Arrigo and Barker, 2012 McKain et al., 2012 Jiao et al., 2014 Soltis et al., 2014 Cannon et al., 2015). Changes in the rates of molecular evolution and turn-over in genome content following polyploidy may have provided novel genetic variation that was important for the evolution of plant diversity (Freeling and Thomas, 2006 Van de

Peer et al., 2009 Jiao et al., 2011 Arrigo and Barker, 2012 Barker et al., 2012 Schranz et al., 2012 Rensing, 2014 Selmecki et al., 2015).

Despite the prevalence of polyploidy in the history of flowering plants, the role of polyploidy in gymnosperm evolution is less clear. The extant gymnosperms appear to be the sister clade of angiosperms (Wickett et al., 2014), and they diverged from their most recent common ancestor (MRCA) as much as 310 million years ago (Schneider et al.,

2004). Most evidence indicates that polyploid speciation is relatively rare among extant gymnosperms (Wood et al., 2009), although in some genera (e.g., Ephedra) polyploidy is prevalent (Ickert-Bond and Wojciechowski, 2004 Ahuja, 2005). Previous analyses of 33 conifer genome sizes and chromosomes suggested that paleopolyploidy occurred in

Pinaceae (Drewby, 1988 Ahuja, 2005). Although there was evidence of an ancient polyploidy shared by all seed plants (Jiao et al., 2011), no evidence of a gymnosperm or conifer ancient polyploidy was found in the genome of Norway spruce (Picea abies), the first published gymnosperm genome. However, this conclusion was based on only a single plot of the relative ages of duplicate genes, presumably because the genome assembly was not of high enough quality (N50 = 4.87 kb) for syntenic analyses. Based on the pattern of accumulation of paralogs seen in this plot, they suggested that the large genomes of conifers originated by mechanisms exclusive of whole genome duplication, in particular through the proliferation of long terminal repeat retrotransposons (LTR-

RTs). Given that paleopolyploidy has been observed repeatedly among flowering plants and is also hypothesized to occur among the conifers (Drewby, 1988 Ahuja, 2005), our goal was to test more thoroughly for evidence of ancient polyploidy in gymnosperms, using a phylogenetically diverse dataset and a new phylogenomic method for determining the phylogenetic placement of WGDs.

We assembled transcriptomes for 24 gymnosperms and three outgroup species, including representatives of all major gymnosperm and vascular plant clades (Table S1).

Three of these transcriptomes—Ophioglossum petiolatum, Gnetum gnemon, and Ephedra frustillata—were newly sequenced to cover phylogenetic gaps in our dataset. For each transcriptome, we used our DupPipe bioinformatic pipeline to generate age distributions of paralogs to identify shared bursts of gene duplication that are indicative of ancient

WGD (Barker et al., 2008, 2009, 2010). We also introduce a newly developed algorithm,

Multi-tAxon Paleopolyploidy Search (MAPS), to place inferred paleopolyploid events in 34 phylogenetic context. For each node in a phylogeny, MAPS evaluates the percentage of gene duplications shared by all taxa descended from that node. Ancient whole genome duplications are identified and located as peaks in plots of duplication events shared among a set of species (Materials & Methods Figs.S1 & S2). We used MAPS to confirm and locate genome duplication events in the history of the gymnosperms and seed plants.

Results

Phylogenetic Position of the Ancient Seed Plant Polyploidy

Most seed plant species contained evidence of a gene duplication peak consistent with previous evidence for a WGD in the ancestry of all seed plants (Jiao et al., 2011).

With the exception of the Gnetales taxa, each gymnosperm Ks plot (Fig. S3) had a peak with median Ks = 0.75 to 1.5 that, in some of these taxa, has previously been correlated with a WGD shared by all seed plants (Jiao et al., 2011). Among the Gnetales we only observed a peak with a median Ks = 1.05 in Welwitschia mirabilis that is consistent with a Welwitschia specific WGD (Cui et al., 2006). All three Gnetales taxa do not contain clear evidence of the putative seed plant WGD, perhaps due to elevated substitution or gene birth/death rates among these species.

To place this ancient WGD in the vascular plant phylogeny, we implemented a new multi-species paleopolyploid search tool, MAPS. Previous analyses found evidence for an ancient polyploidy in the ancestry of all extant seed plants (Jiao et al., 2011).

However, a major clade of vascular plants, the monilophytes (ferns), was not included in that analysis. It was therefore unclear if this WGD is shared among all euphyllophytes

(seed plants and monilophytes) or restricted to only seed plants. To better place this 35

WGD in the vascular plant phylogeny, we analyzed new transcriptome data from the eusporangiate fern Ophioglossum with data from Araucaria (gymnosperm), Gingko

(gymnosperm), Amborella (angiosperm), and Selaginella (lycophyte, the sister lineage to euphyllophytes). Gene were constructed for 3,235 gene families with at least one gene copy present in each species. Among these gene families, MAPS identified 544 subtrees that included the MRCA of Araucaria, Gingko, and Amborella that were consistent with the species tree. Nearly 64% of these subtrees contained evidence for a shared duplication in the MRCA of the seed plants that was not shared with

Ophioglossum (Fig. 1A Fig. S4A Table S2). This result demonstrates that the unclearly delimited euphyllophyte genome duplication (Jiao et al., 2011) is indeed limited to seed plants as a whole and not shared with ferns and other vascular plants (Fig. 2).

Independent Paleopolyploidies in Pinaceae and Cupressaceae

Most gymnosperm lineages only contained evidence for a single, ancient WGD, but some species had multiple signals. The Ks plots for most of the conifers contained a younger peak consistent with a WGD since the seed plant genome duplication (Fig. S3).

Among Pinaceae we observed a younger peak with a median Ks = 0.2 to 0.4 for each taxon in our data set. Similarly, gene age distributions for taxa in the Cephalotaxaceae,

Cupressaceae, and Taxaceae contained a younger peak with a median Ks = 0.2 to 0.5.

Araucaria was the only conifer in our data set without an unambiguous younger peak.

Thus, the Ks plots suggest that there may have been one shared conifer WGD, or independent WGDs in the history of different conifer families. 36

We conducted two different MAPS analyses to resolve the placement and number of WGDs among the conifers. For one analysis, we selected transcriptomes of Pinus,

Larix, and Cedrus to represent Pinaceae, Taxus to represent Taxaceae, and chose Ginkgo,

Ophioglossum, and Selaginella as outgroups. We recovered 2,175 phylogenies with at least one gene copy from each taxon. MAPS identified 625 subtrees among these gene family phylogenies that included the MRCA of Pinaceae. More than

52% of the subtrees supported a shared duplication in the ancestry of Pinaceae (Fig. 1B

Fig. S4B Table S3). In contrast, only 9% of 535 subtrees supported a gene duplication shared between Pinaceae and Taxaceae. In the second analysis, we selected Taxus

(Taxaceae), Cephalotaxus (Cephalotaxaceae), Cryptomeria (Cupressaceae), and Pinus

(Pinaceae), with Ginkgo, Ophioglossum, and Selaginella as outgroups. Among 1,886 gene family phylogenies for these taxa, MAPS identified 469 subtrees that included the

MRCA of the cupressophytes. Over 42% of the subtrees supported a shared gene duplication in the MRCA of Cupressaceae and Taxaceae (Fig. 1C Fig. S4C Table S4).

Only 10% of the subtrees supported a duplication event shared by Pinaceae,

Cupressaceae and Taxaceae. We found similar results with MAPS using only gene trees with >50% bootstrap support for all branches (Table S5). These results suggest that there are two ancient WGDs in the conifers one shared by Cupressaceae and Taxaceae (the cupressophytes) and one in the ancestry of Pinaceae (Fig. 2).

Analyses of ortholog divergence corroborated our MAPS results and supported independent WGDs among the conifers. We identified 3,266 orthologs by reciprocal best

BLAST hit (Barker et al., 2010) from representatives of the Pinaceae and Cupressaceae,

Picea glauca and Cryptomeria japonica. Excluding poorly aligned orthologs with Ks >5, 37 the median orthologous divergence between P. glauca and C. japonica was Ks = 0.78. In contrast, their most recent WGDs occurred at median Ks = 0.35 and 0.24 respectively

(Fig. 3), much later than the divergence of their lineages. Orthologous divergence and phylogenomic approaches both support independent WGDs in the Pinaceae and cupressophytes. Consistent with this interpretation is an absence of evidence for these

WGDs in Araucariaceae (Fig. S3). Overall, these results are consistent with previous analyses of chromosomes and genome sizes that hypothesized no paleopolyploidy in

Araucariaceae but likely ancient WGD in the Pinaceae (Drewby, 1988 Ahuja, 2005).

Discussion

In contrast to the recently published study of the Norway spruce genome (Nystedt et al., 2013), our analyses find evidence for at least two independent WGDs in the ancestry of major conifer clades. Why did analyses of the spruce genome not recover similar evidence of this WGD? Visual evaluation of the age distribution of paralogs from that analysis (Supplementary Figure 2.6 of Nystedt et al. 2013) suggests that there is in fact a peak consistent with a WGD near Ks ~0.25, similar to our results. Although it is not clear why this result was overlooked, the spruce genome results do appear to be fully consistent with our analyses. Our more extensive phylogenetic sampling provides additional support that this peak is likely a WGD because more than 50% of gene families in multiple Pinaceae species have paralogs from this event (Fig. 1B &

C, Fig. S4 B & C.).

What are the implications of these results for our understanding of conifer genome evolution? First, Nystedt et al. (2013) proposed a model of conifer genome 38 evolution that must be revised in light of our results. Their model suggests that in the absence of polyploidy, 12 ancestral conifer chromosomes expanded at a slow and steady rate due solely to the activity of a diverse set of LTR transposable elements. Although conifer chromosome numbers cluster near n=12 (Rice et al., 2015), our discovery of

WGDs in the ancestry of two major conifer clades (Pinaceae and cupressophytes) indicates that these numbers must have fluctuated rather than remained completely static over time. Our analyses do not contradict evidence that the expansion of repetitive DNA is the major contributor to conifer genome size evolution. However, the dynamics of conifer genome evolution clearly did involve WGDs, and genome duplication events have played a role in generating some of the largest genomes among conifers (e.g.,

Pinaceae). It is notable that the genome sizes of paleopolyploid Cupressaceae and

Taxaceae are not substantially larger on average than the non-paleopolyploid

Araucariaceae (Burleigh et al., 2012 Garcia et al., 2014). This suggests that an insight from angiosperm genome evolution also holds true for gymnosperms differences in turnover rates of genome content likely contribute more to genome size variation than a single paleopolyploidy (Leitch and Leitch, 2008 Barker et al., 2012 Bromham et al.,

2015).

Nystedt et al. (2013) also suggest that conserved synteny across Pinaceae (Pavy et al., 2012) results from an absence of paleopolyploidy. Analyses of angiosperm genomes indicate that the degree of synteny conservation following paleopolyploidy varies widely (Tang et al., 2008 Barker et al., 2012 Murat et al., 2012 Woodhouse et al.,

2014). The composition of parental genomes, in particular differences in transposon load, may establish genome dominance that leads to the biased retention and loss of genes 39

(Woodhouse et al., 2014). If most fractionation and genome rearrangements occur quickly following polyploidy, descendant polyploids may also inherit a largely common synteny (Xiong et al., 2011 Buggs et al., 2012). The lack of reciprocal genome rearrangements following WGDs, such as in (Schnable et al., 2012), would also reduce syntenic diversity in descendant lineages following WGDs. For decades, the broad ancestry of polyploidy in the flowering plants was undetected in linkage mapping studies.

Thus, relatively conserved synteny, especially from linkage map data, is not evidence against a paleopolyploidy in Pinaceae.

One of the most intriguing evolutionary questions raised by our analyses is why are there so few polyploid species among extant conifers and other gymnosperms. Our analyses indicate that polyploid speciation contributed to their diversity. Perhaps these

WGDs thrived at a climatically favorable time for polyploid species, as was proposed to explain the apparent clustering of angiosperm WGDs near the K-Pg mass extinction event (Vanneste et al., 2014). Based on our phylogenetic placements of WGDs and existing estimates for the ages of gymnosperm lineages (Lu et al., 2014), the conifer

WGDs occurred ca. 210–275 million years ago (mya) (Cupressaceae+Taxaceae) and ca.

200–342 mya (Pinaceae). Many major events in earth’s history occurred during this timeframe, including the earth’s most severe mass extinction event, the Permian-Triassic extinction. Did polyploid conifers survive the end-Permian event better than their diploid contemporaries? Given that many of these conifer clades originated during this period, these WGDs may have uniquely contributed to the morphological and biological diversity of these lineages. Polyploidy may differentially influence the evolution of dosage-sensitive genes and pathways (Freeling and Thomas, 2006 Freeling, 2009 Bekaert 40 et al., 2011 Conant et al., 2014) or generate novelty by sub- or neofunctionalization

(Edger et al., 2015). Examining further data sets to more precisely pinpoint these WGDs in the conifer phylogeny, and to explore the effects of duplication on specific gene families, will be critical to further answer how polyploidy has contributed to conifer evolution.

Materials and Methods

Sampling and Sequencing

Leaf material of Ophioglossum petiolatum (PRJNA257107), Gnetum gnemon

(PRJNA283231), and Ephedra frustillata (PRJNA283230) was collected on liquid nitrogen from the UBC Botanical Gardens and Greenhouse and then stored in a -80°C freezer (Table S1). We extracted total RNA using the TRIzol reagent

(Invitrogen)/RNeasy (QIAGEN) approach as described in Lai et al. (2006)(Lai et al.,

2006). For 454 sequencing (454 Life Sciences, Branford, CT, USA), we employed modified oligo-dT primers during cDNA synthesis to reduce the length of mononucleotide runs associated with the poly (A) tail of mRNA. We used a ‘broken chain’ short oligo-dT primer to prime the poly(A) tail of mRNA during first strand cDNA synthesis (Meyer et al., 2009). cDNA was amplified and normalized with the

TRIMMER-DIRECT cDNA Normalization Kit. Following normalization, we fragmented the cDNA to 500- to-800-bp fragments by either sonication or nebulization and size selected to remove small fragments using AMPure SPRI beads (Angencourt, Beverly,

MA, USA). Then, the fragmented ends were polished and ligated with adaptors. The 41 optimal ligation products were selectively amplified and subjected to two rounds of size selection including gel electrophoresis and AMPure SPRI bead purification (Lai et al.,

2012). Normalized cDNA was prepared for sequencing following the standard genomic

DNA shotgun protocol recommended by 454 Life Sciences.

Additional data sets were downloaded from the GenBank SRA (Table S1). These included Sanger and Illumina data from 22 species. Data sets were selected to provide broad phylogenetic coverage of the gymnosperms. We also collected annotated CDS for

Amborella trichopoda (Amborella Genome Project, 2013) and Selaginella moellendorffii

(Banks et al., 2011) from Phytozome (http://www.phytozome.net/).

Transcriptome Assembly

Raw read quality filtering and trimming were performed by SnoWhite (Dlugosch et al., 2013) before assembly. Three different assembly strategies were used for our three different data types. Sanger ESTs were cleaned using the SeqClean pipeline and assembled using TGICL. For 454 data, we used a combination of MIRA and CAP3 to assemble contigs. We used MIRA version 3.2.1 (Chevreux et al., 2004) using the

‘accurate.est.denovo.454’ assembly mode. As MIRA may split up high coverage contigs into multiple contigs, we used CAP3 at 94% identity to further assemble the MIRA contigs and singletons (Huang and Madan, 1999). SOAPdenovo-Trans (Xie et al., 2014) was used to assemble Illumina sequenced transcriptomes using a k-mer of ~⅔ read length. All other parameters were set to default. Assembly statistics for the 26 assemblies are in Table S1.

42

Age Distribution of Paralogs

For each species data set, we used our DupPipe pipeline to construct gene families and estimate the age of gene duplications (Barker et al., 2008, 2009, 2010 Shi et al., 2010

Banks et al., 2011). Translations and reading frames were estimated by Genewise alignment to the best hit protein from a collection of proteins from 25 plant genomes on

Phytozome. As in other DupPipe runs, we used protein-guided DNA alignments to align our nucleic acids while maintaining reading frame. For each node in our gene family phylogenies, we estimated synonymous divergence (Ks) using PAML with the F3X4 model (Yang, 2000). Summary plots of the age distribution of gene duplications were evaluated for each gymnosperm species for peaks of gene duplication as evidence of ancient WGDs. Taxa with peaks suggesting ancient WGDs were further analyzed using a multi-species approach (described below) to assess what fraction of gene families show a shared gene duplication and simultaneously place potential WGDs in phylogenetic context.

Estimating Orthologous Divergence of Pinaceae and Cupressaceae

To estimate the average ortholog divergence of conifer taxa and compare to observed paleopolyploid peaks, we used our previously described RBH Ortholog pipeline

(Barker et al., 2010). Briefly, we identified orthologs as reciprocal best blast hits in the transcriptomes of Picea glauca (Pinaceae) and Cryptomeria japonica (Cupressaceae).

Using protein-guided DNA alignments, we estimated the pairwise synonymous (Ks) divergence for each pair of orthologs using PAML with the F3X4 model (Yang, 2000).

We plotted the distribution of ortholog divergences and calculated the median divergence 43 to compare against the synonymous divergence of paralogs from inferred WGDs in these lineages.

Inference of Gene Family Phylogenies

Each transcriptome was translated into amino acid sequences using the TransPipe pipeline (Barker et al., 2010). We performed reciprocal protein BLAST (blastp) searches of selected transcriptomes with an e-value of 10e-5 as a cutoff. Gene families were clustered from these BLAST results using OrthoMCL v2.0 with default parameters (Li et al., 2003). Using a custom perl script, we filtered for gene families that contained at least one gene copy from each taxon and discarded the remaining OrthoMCL clusters. SATé was used for automatic alignment and phylogeny reconstruction of gene families (Liu et al., 2009). For each gene family phylogeny, we ran SATé until five iterations without an improvement in score using a centroid breaking strategy. MAFFT was used for alignments (Katoh et al., 2002), Opal for mergers (Wheeler and Kececioglu, 2007), and

RAXML for tree estimation (Stamatakis, 2014). The best SATé tree for each gene family was used to infer and locate whole genome duplications by our multi-taxon paleopolyploidy search (MAPS) algorithm.

Multi-tAxon Paleopolyploidy Search (MAPS)

To infer and locate ancient WGDs in our data sets, we developed a gene tree sorting and counting algorithm. This algorithm, the multi-taxon paleopolyploidy search

(MAPS), uses a given species tree to filter for subtrees within complex gene trees consistent with relationships at each node in the species tree.For each node of the species 44 tree, MAPS parses the species tree into subtrees with a sister species and an outgroup, eg.

((A,B),C). MAPS iteratively searches for each of these subtrees in the gene tree and will ignore subtrees that do not have the expected relationship. In-paralogs are collapsed by

MAPS to simplify the search. We filter for these substrees, rather than filtering on entire toplogies, because ancient WGDs may yield phylogenies with many nested and/or orthologous clades. Filtering for a simple gene tree that matches the species tree would eliminate many of the trees that support WGDs. By filtering for subtrees of the species tree, MAPS captures the evidence for polyploidy in complex gene family topologies.

Using this filtered set of gene trees, MAPS records the number of subtrees that support a gene duplication at a particular node in the species tree (Fig. S1). To infer and locate a potential whole genome duplication in the species tree, we plot the percentage of gene duplications shared by descendant taxa by node (Fig. S2). A WGD will produce a large burst of shared duplications across taxa and gene trees. This burst of duplication will appear as an increase in the percentage of shared gene duplications in our MAPS analyses.

To evaluate if a WGD occurred before the divergence of taxon A and B, MAPS requires gene trees with at least a A and B and an outgroup C (Fig. S1). The basic algorithm of MAPS utilizes two steps. In step one, MAPS collapses in-paralogs that evolved after the divergence of A and B to a single copy in each gene tree (Fig. S1). In step two, MAPS counts subtrees from all gene trees that are consistent with a duplication event in the most recent common ancestor (MRCA) of A and B. In our ABC example, subtrees with a topology consistent with duplication before the divergence of A and B eg.

(((A,B), (A,B)),C) will be recorded as a duplication at their MRCA node (Fig. S1.6). 45

Additionally, subtrees with a topology consistent with duplication before the divergence of A and B followed by independent gene loss [eg., ((A,~),(A,B)),C) or (((A,B),(~,B),C)] will also be recorded as a duplication at their MRCA node (Fig. S1.7-10). If gene trees do not have a topology consistent with any gene duplication among the ingroup taxa, then no duplications will be recorded at the internal nodes (Fig. S1.1-5). When searching for ancient WGDs in a collection of gene trees that contain more than three taxa, MAPS will repeat the same algorithm on each node of the tree (Fig. S2). WGDs are inferred by searching for evidence of a large number of shared duplications at a particular node(s) of the species tree (Fig. S2).

To evaluate the phylogenetic placement of the putative ‘seed plant’ WGD, we used MAPS to analyze gene families from representatives of each vascular plant lineage

(Fig. 1A, Fig. S4A). We selected Araucaria angustifolia and Gingko biloba to represent gymnosperms because our Ks plots suggest that they only experienced the ‘seed plant ’

WGD. We also analyzed the Amborella genome to represent angiosperms (Amborella

Genome Project, 2013). The newly sequenced Ophioglossum petiolatum transcriptome and the Selaginella moellendorffii genome (Banks et al., 2011) were chosen to represent ferns and lycophytes respectively.

We conducted two MAPS analyses to evaluate numbers and placements of WGDs among conifers (Fig. 1B & C, Fig. S4B & S4C). Two analyses were conducted instead of one because the MAPS algorithm works best with simple, ladderized species trees. To maximized numbers of gene trees in the MAPS analysis but also have good coverage of the Pinaceae phylogeny, we selected transcriptomes of Pinus monticola, Larix gmelinii, and Cedrus atlantica to represent Pinaceae. We also selected Taxus mairei to represent 46 the cupressophytes. Likewise, we chose Taxus mairei, Cephalotaxus hainanensis and

Cryptomeria japonica to represent cupressophytes and Pinus monticola to represent

Pinaceae. For both Pinaceae and cupressophyte analyses, transcriptomes of Gingko biloba and Ophioglossum petiolatum as well as the Selaginella moellendorffii genome were selected as outgroups.

Acknowledgments

We thank K. Dlugosch, S. Jorgensen, and X. Qi for discussion. Hosting infrastructure and services provided by the Biotechnology Computing Facility (BCF) at the University of

Arizona.

Data and materials availability

Raw reads for newly sequenced transcriptomes of Ophioglossum petiolatum

(PRJNA257107), Gnetum gnemon (PRJNA283231), and Ephedra frustillata

(PRJNA283230) are deposited in NCBI SRA.

References

Ahuja, M. R. 2005. Polyploidy in gymnosperms: revisited. Silvae genetica 54: 59–

68.Amborella Genome Project. 2013. The Amborella genome and the evolution of

flowering plants. Science 342: 1241089.

Arrigo, N., and M. S. Barker. 2012. Rarely successful polyploids and their legacy in

plant genomes. Current opinion in plant biology 15: 140–146. 47

Banks, J. A., T. Nishiyama, M. Hasebe, J. L. Bowman, M. Gribskov, C. dePamphilis, V. A. Albert, et al. 2011. The Selaginella genome identifies genetic changes associated with the evolution of vascular plants. Science 332: 960–963.

Barker, M. S., N. Arrigo, A. E. Baniaga, Z. Li, and D. A. Levin. 2015. On the relative abundance of auto- and allopolyploids. The New phytologist In press.

Barker, M. S., G. J. Baute, and S.-L. Liu. 2012. Duplications and turnover in plant genomes. Plant Genome Diversity Volume 1, 155–169. Springer Vienna.

Barker, M. S., K. M. Dlugosch, L. Dinh, R. S. Challa, N. C. Kane, M. G. King, and

L. H. Rieseberg. 2010. EvoPipes.net: bioinformatic tools for ecological and evolutionary genomics. Evolutionary bioinformatics online 6: 143–149.

Barker, M. S., N. C. Kane, M. Matvienko, A. Kozik, R. W. Michelmore, S. J. Knapp, and L. H. Rieseberg. 2008. Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Molecular biology and evolution 25: 2445–2455.

Barker, M. S., H. Vogel, and M. E. Schranz. 2009. Paleopolyploidy in the

Brassicales: analyses of the Cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other . Genome biology and evolution 1:

391–399.

Bekaert, M., P. P. Edger, J. C. Pires, and G. C. Conant. 2011. Two-phase resolution of polyploidy in the Arabidopsis metabolic network gives rise to relative and absolute dosage constraints. The Plant cell 23: 1719–1728. 48

Bromham, L., X. Hua, R. Lanfear, and P. F. Cowman. 2015. Exploring the relationships between rates, life history, genome size, environment, and species richness in flowering plants. The American naturalist 185: 507–524.

Buggs, R. J. A., S. Chamala, W. Wu, J. A. Tate, P. S. Schnable, D. E. Soltis, P. S.

Soltis, and W. B. Barbazuk. 2012. Rapid, repeated, and clustered loss of duplicate genes in allopolyploid plant populations of independent origin. Current biology: CB

22: 248–252.

Burleigh, J. G., W. B. Barbazuk, J. M. Davis, A. M. Morse, and P. S. Soltis. 2012.

Exploring diversification and genome size evolution in extant gymnosperms through phylogenetic synthesis. Journal of botany 2012: 1–6.

Cannon, S. B., M. R. McKain, A. Harkess, M. N. Nelson, S. Dash, M. K. Deyholos,

Y. Peng, et al. 2015. Multiple polyploidy events in the early radiation of nodulating and nonnodulating legumes. Molecular biology and evolution 32: 193–210.

Chevreux, B., T. Pfisterer, B. Drescher, A. J. Driesel, W. E. G. Müller, T. Wetter, and S. Suhai. 2004. Using the miraEST Assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome research

14: 1147–1159.

Conant, G. C., J. A. Birchler, and J. C. Pires. 2014. Dosage, duplication, and diploidization: clarifying the interplay of multiple models for duplicate gene evolution over time. Current opinion in plant biology 19: 91–98.

Cui, L., P. K. Wall, J. H. Leebens-Mack, B. G. Lindsay, D. E. Soltis, J. J. Doyle, P. 49

S. Soltis, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome research 16: 738–749.

Dlugosch, K. M., Z. Lai, A. Bonin, J. Hierro, and L. H. Rieseberg. 2013. Allele identification for transcriptome-based population genomics in the invasive plant

Centaurea solstitialis. G3 3: 359–367.

Drewby, A. 1988. The G-banded karyotype of Pinus resinosa Ait.

Edger, P. P., H. M. Heidel-Fischer, M. Bekaert, J. Rota, G. Glöckner, A. E. Platts, D.

G. Heckel, et al. 2015. The butterfly plant arms-race escalated by gene and genome duplications. Proceedings of the National Academy of Sciences of the United States of America 112: 8362–8366.

Freeling, M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annual review of plant biology 60: 433–453.

Freeling, M., and B. C. Thomas. 2006. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome research

16: 805–814.

Garcia, S., I. J. Leitch, A. Anadon-Rosell, M. Á. Canela, F. Gálvez, T. Garnatje, A.

Gras, et al. 2014. Recent updates and developments to plant genome size databases.

Nucleic acids research 42: D1159–66.

Huang, X., and A. Madan. 1999. CAP3: A DNA sequence assembly program.

Genome research 9: 868–877. 50

Ickert-Bond, S. M., and M. F. Wojciechowski. 2004. Phylogenetic relationships in

Ephedra (Gnetales): Evidence from nuclear and chloroplast DNA sequence data.

Systematic botany 29: 834–849.

Jiao, Y., J. Li, H. Tang, and A. H. Paterson. 2014. Integrated syntenic and phylogenomic analyses reveal an ancient genome duplication in monocots. The Plant cell 26: 2792–2802.

Jiao, Y., N. J. Wickett, S. Ayyampalayam, A. S. Chanderbali, L. Landherr, P. E.

Ralph, L. P. Tomsho, et al. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473: 97–100.

Katoh, K., K. Misawa, K.-I. Kuma, and T. Miyata. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research 30: 3059–3066.

Lai, Z., B. L. Gross, Y. Zou, J. Andrews, and L. H. Rieseberg. 2006. Microarray analysis reveals differential gene expression in sunflower species. Molecular ecology 15: 1213–1227.

Lai, Z., Y. Zou, N. C. Kane, J.-H. Choi, X. Wang, and L. H. Rieseberg. 2012.

Preparation of normalized cDNA libraries for 454 Titanium transcriptome sequencing. Data production and analysis in population genomics, Methods in

Molecular Biology, 119–133.

Leitch, A. R., and I. J. Leitch. 2008. Genomic plasticity and the diversity of polyploid plants. Science 320: 481–483. 51

Li, L., C. J. Stoeckert, and D. S. Roos. 2003. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome research 13: 2178–2189.

Liu, K., S. Raghavan, S. Nelesen, C. R. Linder, and T. Warnow. 2009. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.

Science 324: 1561–1564.

Lu, Y., J.-H. Ran, D.-M. Guo, Z.-Y. Yang, and X.-Q. Wang. 2014. Phylogeny and divergence times of gymnosperms inferred from single-copy nuclear genes. PloS one

9: e107679.

McKain, M. R., N. Wickett, Y. Zhang, S. Ayyampalayam, W. R. McCombie, M. W.

Chase, J. C. Pires, et al. 2012. Phylogenomic analysis of transcriptome data elucidates co-occurrence of a paleopolyploid event and the origin of bimodal karyotypes in Agavoideae (Asparagaceae). American journal of botany 99: 397–406.

Meyer, E., G. Aglyamova, S. Wang, J. Buchanan-Carter, D. Abrego, J. Colbourne,

B. Willis, and M. Matz. 2009. Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC genomics 10: 219.

Murat, F., Y. Van de Peer, and J. Salse. 2012. Decoding plant and animal genome plasticity from differential paleo-evolutionary patterns and processes. Genome biology and evolution 4: 917–928.

Nystedt, B., N. R. Street, A. Wetterbom, A. Zuccolo, Y.-C. Lin, D. G. Scofield, F.

Vezzi, et al. 2013. The Norway spruce genome sequence and conifer genome evolution. Nature 497: 579–584. 52

Pavy, N., B. Pelgas, J. Laroche, P. Rigault, N. Isabel, and J. Bousquet. 2012. A spruce gene map infers ancient plant genome reshuffling and subsequent slow evolution in the gymnosperm lineage leading to extant conifers. BMC biology 10: 84.

Rensing, S. A. 2014. Gene duplication as a driver of plant morphogenetic evolution.

Current opinion in plant biology 17: 43–48.

Rice, A., L. Glick, S. Abadi, M. Einhorn, N. M. Kopelman, A. Salman-Minkov, J.

Mayzel, et al. 2015. The Chromosome Counts Database (CCDB) - a community resource of plant chromosome numbers. The New phytologist 206: 19–26.

Schnable, J. C., M. Freeling, and E. Lyons. 2012. Genome-wide analysis of syntenic gene deletion in the grasses. Genome biology and evolution 4: 265–277.

Schneider, H., E. Schuettpelz, K. M. Pryer, R. Cranfill, S. Magallón, and R. Lupia.

2004. Ferns diversified in the shadow of angiosperms. Nature 428: 553–557.

Schranz, M. E., S. Mohammadin, and P. P. Edger. 2012. Ancient whole genome duplications, novelty and diversification: the WGD radiation lag-time model.

Current opinion in plant biology 15: 147–153.

Selmecki, A. M., Y. E. Maruvka, P. A. Richmond, M. Guillet, N. Shoresh, A. L.

Sorenson, S. De, et al. 2015. Polyploidy can drive rapid adaptation in yeast. Nature.

Shi, T., H. Huang, and M. S. Barker. 2010. Ancient genome duplications during the evolution of kiwifruit (Actinidia) and related Ericales. Annals of botany 106: 497–

504. 53

Soltis, D. E., C. J. Visger, and P. S. Soltis. 2014. The polyploidy revolution then… and now: Stebbins revisited. American journal of botany.

Stamatakis, A. 2014. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30: 1312–1313.

Tang, H., X. Wang, J. E. Bowers, R. Ming, M. Alam, and A. H. Paterson. 2008.

Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps.

Genome research 18: 1944–1954.

Van de Peer, Y., S. Maere, and A. Meyer. 2009. The evolutionary significance of ancient genome duplications. Nature reviews. Genetics 10: 725–732.

Vanneste, K., G. Baele, S. Maere, and Y. Van de Peer. 2014. Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the

Cretaceous-Paleogene boundary. Genome research 24: 1334–1347.

Wheeler, T. J., and J. D. Kececioglu. 2007. Multiple alignment by aligning alignments. Bioinformatics 23: i559–i568.

Wickett, N. J., S. Mirarab, N. Nguyen, T. Warnow, E. Carpenter, N. Matasci, S.

Ayyampalayam, et al. 2014. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proceedings of the National Academy of Sciences of the United States of America 111: E4859–68.

Woodhouse, M. R., F. Cheng, J. C. Pires, D. Lisch, M. Freeling, and X. Wang. 2014.

Origin, inheritance, and gene regulatory consequences of genome dominance in polyploids. Proceedings of the National Academy of Sciences of the United States of 54

America 111: 5283–5288.

Wood, T. E., N. Takebayashi, M. S. Barker, I. Mayrose, P. B. Greenspoon, and L. H.

Rieseberg. 2009. The frequency of polyploid speciation in vascular plants.

Proceedings of the National Academy of Sciences of the United States of America

106: 13875–13879.

Xie, Y., G. Wu, J. Tang, R. Luo, J. Patterson, S. Liu, W. Huang, et al. 2014.

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads.

Bioinformatics .

Xiong, Z., R. T. Gaeta, and J. C. Pires. 2011. Homoeologous shuffling and

chromosome compensation maintain genome balance in resynthesized allopolyploid

Brassica napus. Proceedings of the National Academy of Sciences of the United

States of America 108: 7908–7913.

Yang, Z. 2000. Phylogenetic analysis by maximum likelihood (PAML).

abacus.gene.ucl.ac.uk/software/paml.html.

Appendix A: Figures 55

Fig. 1. Multi-tAxon Paleopolyploidy Search (MAPS) result in the associated phylogeny. Percentage of subtrees that contained a gene duplication (red line) shared by descendant species at each node. Ovals correspond to inferred locations of WGD events

(A) Seed plant analysis, black oval= seed plant WGD. (B) Pinaceae analysis, black oval = seed plant WGD, green oval = Pinaceae WGD. (C) Cupressophyte analysis, black oval= seed plant WGD, red oval= Cupressophyte WGD.

56

Fig. 2. Phylogenetic placement of WGDs in seed plant and gymnosperm history.

Ovals correspond to inferred locations of WGD events black = seed plant WGD, gray = angiosperm WGD, purple = Welwitschia WGD, green = Pinaceae WGD, and red = cupressophyte WGD. Amborella image adopted from Amborella Genome Project, 2013.

Other botanical illustrations are in the public domain.

57

Fig. 3. Pinaceae - Cupressaceae Ortholog Divergence and Independent WGDs.

Combined Ks plot of gene age distributions of Picea glauca (Pinaceae green) and

Cryptomeria japonica (Cupressaceae orange), and their ortholog divergences (blue). The median peaks for these plots are highlighted. Analyses of ortholog divergence indicated that these two taxa diverged before their most recent WGDs.

58

Appendix A: Supplementary Information

59

Fig. S1. Example topologies processed by MAPS to identify a gene duplication (red star) or not (black dot) in a given gene family phylogeny. Left column of trees are observed phylogenies. Middle column of trees have in-paralogs collapsed following

MAPS step one. The right column of trees In trees 1.1 - 1.5, only in-paralogs are present and no shared gene duplication (black dot) is observed in the phylogeny. Trees 1.6 - 1.10 contain evidence for a duplication (red star) in the ancestry of taxa A and B that will be recorded by MAPS.

60

Fig. S2. Multi-tAxon Paleopolyploidy Search (MAPS) example summary results for a four taxon phylogeny. Percentage of gene trees that fit the expected species tree consistent with a shared duplication (red) at each node. Red ovals correspond in inferred locations of WGD events.

61

62

63

64

65

66

67

68

69

70

71

72

73

Fig. S3. Histograms of the age distribution of gene duplications from 24 gymnosperm transcriptomes.

Fig. S4. Numerical summary of MAPS results. Species trees summarizing the percentage of gene subtrees supporting a shared gene duplication (red) versus no shared duplication (black) inferred by MAPS at each node. A) Seed plant WGD MAPS analysis.

B) Pinaceae WGD MAPS analysis. C) Cupressophyte WGD MAPS analysis.

74

Data Family Species Source Accession # Contig # Library type Amborellaceae Amborella trichopoda (46) Phytozome 27313 Genome Project Araucariaceae Araucaria angustifolia (70) SRR1185544 46157 93 Paired_end Cephalotaxace ae Cephalotaxus hainanensis (71) SRR1509462 57602 101 Paired_end Cupressaceae Cunninghamia lanceolata (72) SRR475258 90230 100 Paired_end Cupressaceae Cryptomeria japonica (73) 22439 Sanger

Cycadaceae Cycas rumphii (74, 75) 10034 Sanger Ephedraceae Ephedra frustillata New PRJNA283230 52292 Normalized 454 Ginkgoaceae Ginkgo biloba (76) SRR325161 61988 75 Paired_end Gnetaceae Gnetum gnemon New PRJNA283231 48191 Normalized 454 Ophioglossace ae Ophioglossum petiolatum New PRJNA257107 55232 Normalized 454 Pinaceae Larix gmelinii (77) SRR522902 47991 90 Paired_end Pinaceae Cedrus atlantica (78) SRR1518632 46684 101 Paired_end Pinaceae Pseudolarix amabilis (79) SRS260867 51557 101 Paired_end Pinaceae Pinus contorta (80) SRR576449 62719 108 Paired_end Pinaceae Pinus monticola (81) SRR1013833 73965 76 Paired_end Pinaceae Pseudotsuga menziesii (82) SRR488373 123015 100 Paired_end Pinaceae Picea sitchensis (83) 21717 Sanger Pinaceae Pinus pinaster (84, 85) 11622 Sanger Picea engelmannii Pinaceae x Picea glauca (83) 13462 Sanger Pinaceae Pinus taeda (86, 87) 4023 Sanger Pinaceae Picea glauca (83, 88) 29864 Sanger Pinaceae Picea abies (89) 4110 Sanger Selaginellaceae Selaginella moellendorffii (47) Phytozome 21008 Genome Project Taxaceae Taxus mairei (90) SRR350719 47598 75 Paired_end

Taxaceae Taxus chinensis (91) SRR527088 33393 75 Paired_end Taxaceae Taxus x media (92) SRR534004 40886 75 Paired_end Welwitschiace ae Welwitschia mirabilis (4) 6286 Sanger

75

Table S1. Assembly statistics and accession numbers for 25 transcriptomes and 2 genomes. All assemblies will be deposited on Dryad. For Sanger ESTs, we only reference the final assembly on Dryad rather than thousands of accession numbers for each read.

Node Non-duplication Duplication Total

N1 226 5 231

N2 196 348 544

N3 347 144 491

Table S2. Number of gene subtrees that fit the expected species tree support shared duplication in seed plant analysis. Node numbers correspond to species tree in Fig. S4A.

Node Non-duplication Duplication Total N1 566 32 598 N2 298 327 625 N3 484 51 535 N4 159 141 300 N5 221 66 287 Table S3. Number of gene subtrees that fit the expected species tree support shared duplication in Pinaceae analysis. Node numbers correspond to species tree in Fig. S4B.

Node Non-duplication Duplication Total N1 507 51 558 76

N2 271 198 469 N3 557 65 622 N4 141 128 269 N5 207 57 264 Table S4. Number of gene subtrees that fit the expected species tree support shared duplication in cupressophyte analysis. Node numbers correspond to species tree in Fig.

S4C.

Node Non-duplication Duplication Total N1 206 10 216 N2 124 73 197 N3 182 5 187 N4 60 64 124 N5 76 6 82 Table S5. Number of gene subtrees that fit the expected species tree support shared duplication in cupressophyte analysis using only trees with >50% bootstrap support for each branch. Node numbers correspond to species tree in Fig. S4C.

77

APPENDIX B:

MULTIPLE LARGE-SCALE GENE AND GENOME DUPLICATIONS DURING

THE EVOLUTION OF HEXAPODS

Citation

Li Z, GP Tiley, S Galuska, CR Reardon, TI Kidder, RJ Rundell, MS Barker. (2018) Multiple large-scale gene and genome duplications during the evolution of hexapods. PNAS 115 (18):

4713-4718.

Authors

Zheng Li*1, George P. Tiley*2,3, Sally R. Galuska1, Chris R. Reardon1, Thomas I.

Kidder1, Rebecca J. Rundell1,4, Michael S. Barker1

Corresponding author [email protected]

* Authors contributed equally

Affiliations

1Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ

85721

2Department of Biology, University of Florida, Gainesville, FL 32611

3Department of Biology, Duke University, Durham, NC 27708 78

4Department of Environmental and Forest Biology, State University of New York

College of Environmental Science and Forestry, Syracuse, NY 13210

Keywords polyploidy, whole genome duplication, large-scale genome duplication, genome evolution, hexapods

Abstract

Polyploidy or whole genome duplication (WGD) is a major contributor to genome evolution and diversity. Although polyploidy is recognized as an important component of plant evolution, it is generally considered to play a relatively minor role in animal evolution. Ancient polyploidy is found in the ancestry of some animals, especially fishes, but there is little evidence for ancient WGDs in other metazoan lineages. Here we use recently published transcriptomes and genomes from more than 150 species across the insect phylogeny to investigate whether ancient WGDs occurred during the evolution of

Hexapoda, the most diverse clade of animals. Using gene age distributions and phylogenomics, we found evidence for 18 ancient WGDs and six other large-scale bursts of gene duplication during insect evolution. These bursts of gene duplication occurred in the history of lineages such as the Lepidoptera, Trichoptera, and Odonata. To further corroborate the nature of these duplications, we evaluated the pattern of gene retention from putative WGDs observed in the gene age distributions. We found a relatively strong signal of convergent gene retention across many of the putative insect WGDs. 79

Considering the phylogenetic breadth and depth of the insect phylogeny, this observation is consistent with polyploidy as we expect dosage-balance to drive the parallel retention of genes. Together with recent research on plant evolution, our hexapod results suggest that genome duplications contributed to the evolution of two of the most diverse lineages of eukaryotes on Earth.

Introduction

Genome duplication has long been considered a major force of genome evolution and a generator of diversity. Evidence of paleopolyploidy is found in the genomes of many eukaryotes, such as yeasts, teleost fishes, and plants (Wolfe and Shields 1997 Berthelot et al. 2014 Barker, Husband, et al. 2016 Van de Peer et al. 2017). Polyploid speciation is perhaps most important among plants where nearly ⅓ of contemporary vascular plant species have recently duplicated genomes (Wood et al. 2009 Barker, Arrigo, et al. 2016).

All extant seed plants have also experienced at least one ancient WGD (Jiao et al. 2011

Li et al. 2015), and many flowering plants have undergone multiple rounds of paleopolyploidy (Wendel 2015 Barker, Husband, et al. 2016 Van de Peer et al. 2017).

The creation of new genes (Edger et al. 2015 Gout and Lynch 2015), higher turnover of genome content (Arrigo and Barker 2012 Murat et al. 2017), and increased rates of adaptation (Selmecki et al. 2015) following polyploidy have likely contributed to the diversification of flowering plants (Arrigo and Barker 2012 Tank et al. 2015).

In contrast to plants, polyploid speciation among animals is generally regarded as exceptional (Orr 1990 Otto and Whitton 2000). The most well known polyploidization events in animals are two rounds of ancient WGD (the 2R hypothesis) that occurred in 80 the ancestry of all vertebrates (Ohno 1970 McLysaght et al. 2002). However, most known cases of polyploidy in animals are found among parthenogenetic and hermaphroditic groups (Otto and Whitton 2000 Gregory and Mable 2005). If paleopolyploidy is indeed fundamental to the evolution of animal life across deep time, as it is in plants, we would expect to find WGDs throughout the most species-rich animal lineages: molluscs and arthropods. Little is known about ancient WGD among invertebrates, but there is growing evidence for paleopolyploidy in molluscs (Hallinan and Lindberg 2011) and chelicerates

(Nossa et al. 2014 Clarke et al. 2015 Schwager et al. 2017). There is no evidence of paleopolyploidy among Hexapoda, the most diverse lineage of animals on Earth. Only

0.01% of the more than 800,000 described hexapod species (De Wever A. eds. 2015) are known polyploids (Otto and Whitton 2000 Gregory and Mable 2005). However, until recently there were limited data available to search for evidence of paleopolyploidy among the hexapods and other animal clades. Thus, the contributions of polyploidy to animal evolution and the differences with plant evolution have remained unclear.

To search for evidence of WGDs among the hexapods, we leveraged recently released genomic data for the insects (Misof et al. 2014). Combined with additional data sets from public databases, we assembled 128 transcriptomes and 27 genomes with at least one representative from each order of Hexapoda (SI Appendix, Dataset S1). We selected data from chelicerates, myriapods, and crustaceans as outgroups. Ancient WGDs were initially identified in the distributions of gene ages (Ks plots) produced by DupPipe

(Barker et al. 2008 Barker et al. 2010). We also used the MAPS algorithm (Li et al. 2015) to infer WGDs or other large-scale genome duplications that are shared among descendant taxa. MAPS uses multi-species gene trees to infer the phylogenetic placement 81 of significant bursts of ancient gene duplication based on comparison to simulated gene trees with and without WGDs. Simulations were conducted with GenPhyloData

(Sjöstrand et al. 2013) with background gene birth and death rates estimated from

WGDgc (Rabier et al. 2014) for each MAPS analysis (Dataset S3-S4). Analyses of synteny within the Bombyx mori genome (International Silkworm Genome Consortium

2008) provided additional evidence that significant duplications inferred by our MAPS analyses may result from large-scale genome duplication events. We also compared the synonymous divergence of putative WGD paralogs with the orthologous divergence among lineages to place inferred genome duplications in phylogenetic context. Potential ancient WGDs detected in our gene age distributions were further corroborated by analyses of biased gene retention across 20 hexapod genomes.

Results

Inference of WGDs from Gene Age Distributions

Our phylogenomic analyses revealed evidence for WGDs in the ancestry of many insects. Peaks of gene duplication consistent with WGDs were observed in the gene age distributions of 20 hexapod species (Fig. 1, SI Appendix, Fig. S1-S4, and Table S1). Each of the inferred WGDs was identified as a significant peak using SiZer and mixture model analyses (SI Appendix, Fig. S1, Table S1, Dataset S2). Fifteen of these appear as phylogenetically independent WGDs because the sampled sister lineages lack evidence of the duplications (Fig. 2, SI Appendix, Fig. S1-S4). In two cases, multiple sister lineages contained evidence for paleopolyploidy in their Ks plots. All sampled species of

Thysanoptera contained evidence of at least one peak consistent with paleopolyploidy in 82 their Ks plots (Fig. 1B, SI Appendix, Fig. S2I-K). Analyses of orthologous divergence among these taxa indicated that the putative WGD peaks are older than the divergence of these lineages, and we currently infer a single, shared WGD in the ancestry of

Thysanoptera. Similarly, multiple taxa in the Trichoptera had evidence for WGD(s) (SI

Appendix, Fig. S1O-R, and Fig. S2O-R). Analyses of orthologous divergence indicated that each of these putative WGDs occurred independently (SI Appendix, Fig. S5O and P, and Table S2). A MAPS analysis also supported the independence of these WGDs and found evidence for a deeper duplication event shared among all the sampled Trichoptera

(SI Appendix, Fig. S6Y and Table S3).

Overall, our analyses of gene age distributions found evidence for 18 independent paleopolyploidizations in the ancestry of 14 orders of hexapods (Fig. 2, SI Appendix, Fig.

S7). We observed evidence for ancient WGDs in diverse lineages of hexapods including springtails, beetles, ants, lice, flies, thrips, moths, termites, sawflies, caddisflies, stoneflies, and mayflies. Some of these WGDs were of relatively modest synonymous divergence and may be correlated with the origins of families or clades of genera, such as the inferred paleopolyploidization in Trichocera saltator (Fig. 1D). However, many of these putative WGDs appear to have occurred early in the evolution of different hexapod orders with relatively high synonymous divergence among paralogs. For example, applying the Drosophila synonymous substitution rate of 5.8 X 10-9 substitutions/synonymous site/year (Cutter 2008) to thrips, we estimated that the thrips duplication occurred approximately 155 MYA based on the median paralog divergence of the WGD. However, if thrips have a slower rate of evolution than Drosophila, then this WGD would be older. 83

Phylogenomic Inference and Simulation of Ancient Large-scale Genome

Duplications

Given the depth of the phylogeny, there may be many WGDs or other large-scale genome duplications in the ancestry of hexapods that do not appear in Ks plots due to saturation of substitutions. We conducted 33 MAPS (Li et al. 2015) analyses of 111,933 nuclear gene family phylogenies to infer large bursts of gene duplication deep in the history of all major clades of hexapods (SI Appendix, Fig. S6 and Dataset S3). By examining shared gene duplications from multiple species, MAPS increases the signal of deep duplications and provides more resolution than a single species analysis. Overall, we found 25 branches in 22 MAPS analyses that contained at least one branch with significantly more shared gene duplications than expected compared to the null simulations (SI Appendix, Fig. S8 and Dataset S3). To further characterize these significant gene bursts, we simulated an additional set of gene trees with a WGD at the phylogenetic location of the duplication bursts in these 22 analyses. We found that 14 of the 25 bursts of gene duplication were statistically consistent with our positive simulations of WGDs (SI Appendix, Fig. S9 and Dataset S4). Six of these large-scale genome duplications had robust and consistent evidence across all MAPS analyses (Fig.

2, SI Appendix, Fig. S6, Table S3 and Dataset S3-S4). These included large-scale genome duplications in the ancestry of Odonota, Lepidoptera, and Trichoptera. We also observed seven episodic bursts of gene duplication in the insect phylogeny that had varying levels of significance across different MAPS analyses (eg. Coleoptera,

Hymenoptera (in part) and Hemiptera (in part) Fig. 2, SI Appendix, Fig. S6-7 and Dataset

S4). Although there was conflict among our analyses as to whether these seven events 84 were statistically consistent with large-scale genome duplications, they do reflect significant increases in gene duplication at these locations in the insect phylogeny.

Considering the putative ancient nature of the MAPS inferred duplications, most are likely too saturated to appear in any of our Ks plots. Saturation of substitutions diminishes the signature peaks of polyploidy in Ks plots, and WGDs with a peak of Ks >

2 may become difficult to detect (Cui et al. 2006 Barker et al. 2008 Vanneste et al. 2013).

To confirm that we should not expect to see these six large-scale genome duplications in

Ks plots, we tested if the ortholog divergence among these lineages was Ks > 2. As expected, the median ortholog divergence was Ks > 2 (SI Appendix, Fig. S5 and Table

S2). Thus, all six of the large-scale genome duplications inferred with MAPS are likely too saturated to appear in Ks plots.

Synteny provides perhaps the most compelling evidence to characterize ancient genome duplication events. Although there are a number of high quality hexapod genomes, there are few high quality genomes in the clades where we inferred ancient duplications. The genome of Bombyx mori (International Silkworm Genome Consortium

2008), the silkworm moth, is one of the few genomes that is reasonably well assembled and has an ancient large-scale genome duplication based on our MAPS analyses (SI

Appendix, Figs. S8AA and S9AA, Dataset S5). If this ancient duplication involved structural duplications, as with a WGD or other large-scale chromosomal duplication event, then we expect to find a significant association of syntenic chains with the MAPS paralogs in the genome of B. mori. Using CoGe’s SynMap tool (Lyons et al. 2008), we identified 728 syntenic chains that included 2210 genes (SI Appendix, Fig. S11). To test the significance of this association, we used a statistical approach similar to a method 85 developed to find evidence of paleopolyploidy from synteny in linkage mapping data

(Nakazato et al. 2006). We found that significantly more syntenic chains—83 chains— were associated with our MAPS paralogs than expected by chance (chi-square test p- value = 0.0001). Although many of these syntenic chains are small, this is not unexpected given the quality of the B. mori genome assembly and the potential age of this duplication event. Thus, these results are consistent with some type of ancient duplication event in the ancestry of the Lepidoptera.

Biased Gene Retention and Loss Following Inferred WGDs

Given the rarity of polyploidy among insects and the fact that no ancient WGDs have been previously observed in the clade, we further characterized the nature of these putative WGDs. A common signature of paleopolyploidy is the biased retention and loss of genes relative to the background pattern of gene turnover. Surviving paralogs from ancient WGDs are often enriched with idiosyncratic gene ontology (GO) categories in plants (Barker et al. 2008 De Smet et al. 2013 Li et al. 2016 Mandáková et al. 2017 Rody et al. 2017), yeasts (Conant 2014), and animals (Berthelot et al. 2014 Lien et al. 2016

Session et al. 2016). Among the many hypotheses to explain the retention of paralogs

(Kondrashov and Kondrashov 2006 Freeling 2009 Hahn 2009), the dosage balance hypothesis (DBH) is the only hypothesis that predicts the parallel retention and loss of functionally related genes following WGDs across species (Freeling and Thomas 2006

Freeling 2009 Conant et al. 2014). The DBH predicts that genes with many connections or in dosage sensitive regulatory networks will be retained in duplicate following polyploidy to maintain the relative abundance of protein products. Conversely, these same genes will be preferentially lost following small-scale gene duplications to prevent 86 the disruption of dosage (Freeling 2009). Thus, if the signatures of gene duplication observed in the insects are the result of WGDs, we expect to find a biased pattern of retention and loss that may be shared among the putative insect WGDs.

To test for biased gene retention and loss among our inferred WGDs, we used a

HMMR based approach to annotate the genomes or transcriptomes of 20 hexapod and one outgroup species with the D. melanogaster Gene Ontology (GO) data (Gene

Ontology Consortium 2015). Paralogs were partitioned from the putative WGDs inferred in Ks plots by fitting a normal mixture model to the distributions (SI Appendix, Table

S5). Using the numbers of genes annotated to each GO category, we performed a principal component analysis to assess the overall differences in GO category composition among all genes in the genome/transcriptome and paralogs retained from each putative WGD (Fig. 3, SI Appendix, Dataset S7). These categories of genes formed non-overlapping clusters in the PCA. Notably, the GO composition of WGD paralogs from each species formed a narrower confidence interval than the entire transcriptomes/genomes. Significant differences (p < 0.001) between the GO composition of these two groups were also found using a chi-square goodness of fit test. Paralogs from

WGDs across all 21 species demonstrated biased patterns of gene retention and loss (Fig.

3, SI Appendix, Fig. S13-S14). Consistent with our PCA, many of the same GO categories were significantly over- and under-represented among the genes maintained in duplicate from the putative insect WGDs. For example, over 50% of our sampled species had paralogs from the ancient WGDs that were significantly enriched for RNA metabolic processes, nucleus component, DNA binding function, and nucleobase-containing compound metabolic process(SI Appendix, Fig. S13). Similarly, many GO categories 87 were significantly under-retained among the paralogs of ancient WGDs (SI Appendix,

Fig. S13). Over half of our sampled species demonstrated significant under-retention of genes associated with oxidoreductase activity, hydrolase activity, electron carrier activity, peptidase activity, and proteolysis. Although there is some noise in the patterns of gene retention, this may not be surprising given the great divergence of many species from the annotation source, Drosophila, and the phylogenetic scale of the hexapods. Considering the great diversity of these lineages, the convergent pattern of gene retention and loss is consistent with our expectations following polyploidy.

Discussion

Our analyses provide the first evidence for paleopolyploidy in the hexapods.

Combining our gene age distribution and phylogenomic analyses, we found evidence for

24 significant, episodic bursts of gene duplication in the insects. Although some of these duplication events may result from other mechanisms of gene duplication, they appear to be consistent with WGDs inferred using similar approaches in plants (Barker et al. 2008

Barker et al. 2016 Badouin et al. 2017) and animals (Berthelot et al. 2014 Schwager et al.

2017). Of these bursts, 18 were detected as peaks in gene age distributions that are characteristic of ancient WGDs in 14 hexapod orders (Fig. 2, SI Appendix, Table S3).

Genes retained in duplicate from these 18 putative WGD events had a shared pattern of biased gene retention and loss, an expected result of paleopolyploidy but not other types of gene duplication (Kondrashov and Kondrashov 2006 Freeling 2009). An additional six large-scale genome duplication events were inferred deeper in the phylogeny of insects using phylogenomic analyses. These six duplications were consistent with simulated 88

WGDs at these locations in the insect phylogeny. Many phylogenomic analyses use a single value of gene duplication number per branch to diagnose a WGD (Yang et al. 2015

Huang et al. 2016). However, variation in branch lengths and gene birth/death rates may confound these phylogenomic inferences of gene duplication (Hahn 2007). The number of duplications on a given branch is expected to covary with branch length. Without taking into account branch length variation, the number of duplications may appear to change dramatically from branch to branch. Our use of simulated gene trees with both null and positive simulations of WGDs should provide a more robust inference of large- scale genome duplication events that is less sensitive to branch length variation than previous approaches (Li et al. 2015 Yang et al. 2015 Huang et al. 2016). Although many of these inferred duplications are consistent with simulated WGDs, we refer to them here as large-scale duplication events given the phylogenetic depth and difficulty assessing their nature with other methods. Notably, we do find statistical evidence from analyses of synteny in the B. mori genome that a duplication in the ancestry of the Lepidoptera likely involved structural duplication rather than just gene duplication alone. However, more complete genomic analyses are needed to confirm the nature of these duplication events.

Our analyses of GO categories revealed a largely convergent pattern of biased gene retention and loss following genome duplication in multiple species of insects.

Based on the dosage balance hypothesis (DBH), we expected to observe biased retention and loss of similar GO categories following WGDs rather than other types of gene duplication events (Kondrashov and Kondrashov 2006 Freeling 2009). Our observation across multiple species is consistent with post-polyploid genome evolution because dosage-balance may drive the convergent retention of genes. Although neo- and 89 subfunctionalization are also important in duplicate gene retention, only the DBH predicts a convergent pattern of gene retention following polyploidy across multiple species (Kondrashov and Kondrashov 2006 Freeling 2009). Thus, the signal of convergent gene retention is consistent with our inference that they were likely WGDs.

This biased pattern was also found among WGD paralogs in an outgroup species, Ixodes scapularis (deer tick), and suggests that it may be consistent across arthropods. Previous studies in plants have found that similar functional categories of genes have been maintained following different genome duplication events (Li et al. 2016 Rody et al.

2017), although there are often idiosyncratic patterns observed across families (Barker et al. 2008 Li et al. 2016 Mandáková et al. 2017). The 20 insect transcriptomes included in our GO category analyses represent diverse hexapod (SI Appendix, Fig. S13) orders whose divergence times far exceed most previously studied plant examples. At least among the arthropods, our analyses suggest that biases in duplicate gene retention may be maintained over hundreds of millions of years. A potential explanation for the long consistency of this pattern is that insects have experienced a limited number of ancient

WGDs that may influence large-scale shifts in gene network relationships. Most hexapod species included in our analyses only had one round of genome duplication, whereas nearly all flowering plants have likely experienced at least three rounds of paleopolyploidy (Barker, Husband, et al. 2016).

Our discovery of WGDs and other large-scale genome duplications in the ancestry of hexapods raises many questions about the role of gene and genome duplication in plant and animal evolution. It has long been known that polyploidy is rarer in animals than in plants (Orr 1990 Otto and Whitton 2000). Mueller hypothesized that 90 sex chromosomes are barriers to polyploidy in animals (Muller 1925 Orr 1990). Although our sample size is limited, we observed some patterns in our data set consistent with this hypothesis. For example, we observed more putative WGDs in the Trichoptera, with Z0 sex-determination, relative to its sister lineage, the Lepidoptera, which mostly has a ZW system (Tree of Sex Consortium 2014 Blackmon et al. 2017). Although we had a small sample size for each lineage, this observation is expected because WGD will cause less disruption of dosage compensation in Z0 compared to ZW systems (Orr 1990 Otto and

Whitton 2000). More refined placement and denser sampling of ancient WGDs in insects will provide a new opportunity to test Mueller's classic hypothesis. Similarly, a recent study proposed that the phylogenetic distribution of ancient WGD among plants may be a by-product of asexuality rather than an intrinsic advantage of polyploidy itself (Freeling

2017). Given the diversity of insect sexual systems, a better understanding of ancient

WGDs in insects would provide an improved context for testing this hypothesis in other eukaryotes. Our results also lend support to long-standing hypotheses that gene and genome duplications are important forces in animal evolution (Ohno 1970 Van de Peer et al. 2009) in insects such as the expansion of Hox genes in the Lepidoptera (Ferguson et al. 2014) and during the co-evolutionary radiation of pierid butterflies and the Brassicales

(Edger et al. 2015). Our observation of numerous episodes of duplication in the hexapod phylogeny raises the possibility that large-scale duplications may be associated with the evolution of novelty and diversity across the insect phylogeny, including additional duplication driven co-evolutionary interactions with plants. Further phylogenomic sampling of insects will likely reveal more paleopolyploidy and other large-scale genome duplications. Large sequencing projects such as 1KP (Wickett et al. 2014), 1KITE (Misof 91 et al. 2014), and i5K (i5K Consortium 2013) will improve our ability to place these events in the plant and hexapod phylogenies. As it stands, our results indicate that large- scale gene and genome duplications have occurred during the evolution of the most diverse clade of eukaryotes.

Materials and Methods

Data Sampling

We compiled a phylogenetically diverse genomic dataset that comprised every hexapod order and outgroups from related arthropods (SI Appendix, Dataset S1). These data included 119 transcriptomes and 25 genomes for hexapods, as well as nine transcriptomes and two genomes from the Chelicerates, myriapods, and crustaceans as outgroups. We downloaded 128 published transcriptome assemblies from the GenBank

Transcriptome Shotgun Assembly database (TSA), and 27 published genomes from multiple genome databases (SI Appendix).

DupPipe: Inference of WGDs from Paralog Age Distributions

For each data set, we used our DupPipe pipeline to construct gene families and estimate the age of gene duplications (Barker et al. 2010). We translated DNA sequences and identified reading frames by comparing the Genewise alignment to the best hit protein from a collection of proteins from 24 metazoan genomes from Metazome v3.0.

For each node in our gene family phylogenies, we estimated synonymous divergence

(Ks) using PAML with the F3X4 model (Yang 1997). For each species, we identified ancient WGDs as significant peaks of gene duplication in histograms of the age 92 distribution of gene duplications (Ks plots) using mixture models (McLachlan and Peel

1999) and SiZer (Chaudhuri and Marron 1999).

MAPS: Phylogenomic Inference of Large-scale Genome Duplications from Nuclear

Gene Trees

To infer large-scale genome duplications, we used the Multi-tAxon

Paleopolyploidy Search (MAPS) tool (Li et al. 2015), a gene tree sorting and counting algorithm. We translated each transcriptome into amino acid sequences using the

TransPipe pipeline (Barker et al. 2010). Using these translations, we performed reciprocal

BLASTP searches for each MAPS data set with an E-value of 10e-5 as a cutoff. We clustered gene families from these BLAST results using OrthoMCL v2.0 with the default parameters (Li et al. 2003) and only retained gene families that contained at least one gene copy from each taxon. We used SATé for alignment and phylogeny reconstruction of gene families (Liu et al. 2009). The best scoring SATé tree for each gene family was used to infer large-scale genome duplications with MAPS. Results of all 33 MAPS analyses are provided in Dataset S3.

Syntenic Analysis of Bombyx mori

To validate our MAPS inferences of large-scale genome duplications in the

Lepidoptera, we examined the the Bombyx mori genome (International Silkworm

Genome Consortium 2008) for syntenic evidence of duplication (SI Appendix, Figs.

S6AA-S9AA). SynMap on the CoGe platform (Lyons et al. 2008) was used to identify syntenic regions. We used blastp with E-value cutoff of 10e-5. Quota Align Merge was selected to merge syntenic chains and other parameters were set as default. Synteny was detected with a minimum of three genes to seed a chain and a Manhattan distance of 40. 93

Estimating Orthologous Divergence to Place Large-Scale Genome Duplications in

Relation to Lineage Divergence

We estimated ortholog divergences among major hexapod clades to place large- scale genome duplications in relation to lineage divergence (SI Appendix, Table S2). We used the RBH Ortholog pipeline (Barker et al. 2010) to estimate the median ortholog divergence between 24 species pairs (SI Appendix, Fig. S5 and Table S2). The median ortholog divergence was used to estimate the lower bound of paralog divergence for shared ancient large-scale genome duplication events.

Gene Ontology (GO) Annotations and Paleolog Retention and Loss Patterns

We used the best hit with length of at least 100 bp and an E-value of at least 0.01 in phmmer (HMMER 3.1b1) for Gene Ontology (GO) annotation. For each species, we assigned paralogs to ancient WGDs based on the Ks ranges identified in mixture model analyses (SI Appendix, Table S5) (McLachlan and Peel 1999 Barker et al. 2008). We evaluated the overall differences between the genome/transcriptomes and WGD paralogs by performing a principal component analysis (PCA) using the rda function in R package vegan (Dixon 2003). We also tested for differences among GO annotations across the inferred WGD events using chi-square tests (SI Appendix, Fig. S13-S14).

Acknowledgments

We thank A. Baltzell, A. Baniaga, J. Bronstein, K. Dlugosch, S. Jorgensen, E. Lyons, X.

Qi, J. Czekanski-Moir and Y. Tomoyasu for suggestions and discussion. Z. Wang, L.

Tang, and X. Song for assistance with insect images. We also thank two anonymous reviewers for their careful reading and suggestions that improved the manuscript. Hosting infrastructure and services provided by the Biotechnology Computing Facility (BCF) at 94 the University of Arizona. M.S.B. was supported by NSF-IOS-1339156 and NSF-EF-

1550838.

References

Arrigo N, Barker MS (2012) Rarely successful polyploids and their legacy in plant genomes. Curr Opin Plant Biol 15(2):140–146.

Badouin H, et al. (2017) The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution. Nature 546(7656):148–152.

Barker MS, Arrigo N, Baniaga AE, Li Z, Levin DA (2016) On the relative abundance of autopolyploids and allopolyploids. New Phytol 210(2):391–398.

Barker MS, et al. (2008) Multiple paleopolyploidizations during the evolution of the

Compositae reveal parallel patterns of duplicate gene retention after millions of years.

Mol Biol Evol 25(11):2445–2455.

Barker MS, et al. (2010) EvoPipes.net: Bioinformatic tools for ecological and evolutionary genomics. Evol Bioinform Online 6:143–149.

Barker MS, et al. (2016) Most Compositae (Asteraceae) are descendants of a paleohexaploid and all share a paleotetraploid ancestor with the Calyceraceae. Am J Bot

103(7):1203–1211.

Barker MS, Husband BC, Pires JC (2016) Spreading Winge and flying high: The evolutionary importance of polyploidy after a century of study. Am J Bot 103(7):1139– 95

1145.

Berthelot C, et al. (2014) The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nat Commun 5:3657.

Blackmon H, Ross L, Bachtrog D (2017) Sex determination, sex chromosomes, and karyotype evolution in Insects. J Hered 108(1):78–93.

Chaudhuri P, Marron JS (1999) SiZer for exploration of structures in curves. J Am Stat

Assoc 94(447):807–823.

Clarke TH, Garb JE, Hayashi CY, Arensburger P, Ayoub NA (2015) Spider transcriptomes identify ancient large-scale gene duplication event potentially important in silk gland evolution. Genome Biol Evol. 7(7):1856-1870.

Conant GC (2014) Comparative genomics as a time machine: how relative gene dosage and metabolic requirements shaped the time-dependent resolution of yeast polyploidy.

Mol Biol Evol 31(12):3184–3193.

Conant GC, Birchler JA, Pires JC (2014) Dosage, duplication, and diploidization: clarifying the interplay of multiple models for duplicate gene evolution over time. Curr

Opin Plant Biol 19:91–98.

Cui L, et al. (2006) Widespread genome duplications throughout the history of flowering plants. Genome Res 16(6):738–749.

Cutter AD (2008) Divergence times in Caenorhabditis and Drosophila inferred from direct estimates of the neutral mutation rate. Mol Biol Evol 25(4):778–786. 96

De Smet R, et al. (2013) Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proc Natl Acad Sci USA 110(8):2898–

2903.

De Wever A. eds. (2015) Species 2000 & ITIS , 2015 Annual Checklist.

Catalogue of Life. Available at: www.catalogueoflife.org/annual-checklist/2015

[Accessed October 20, 2015].

Dixon P (2003) VEGAN, a package of R functions for community ecology. J Veg Sci

14(6):927.

Edger PP, et al. (2015) The butterfly plant arms-race escalated by gene and genome duplications. Proc Natl Acad Sci USA 112(27):8362–8366.

Ferguson L, et al. (2014) Ancient expansion of the hox cluster in lepidoptera generated four genes implicated in extra-embryonic tissue formation. PLoS Genet

10(10):e1004698.

Freeling M (2009) Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu Rev Plant Biol 60(1):433–

453.

Freeling M (2017) Picking up the ball at the K/Pg boundary: the distribution of ancient polyploidies in the plant as a spandrel of asexuality with occasional sex.

Plant Cell 29(2):202–206.

Freeling M, Thomas BC (2006) Gene-balanced duplications, like tetraploidy, provide 97 predictable drive to increase morphological complexity. Genome Res 16(7):805–814.

Gene Ontology Consortium (2015) Gene Ontology Consortium: going forward. Nucleic

Acids Res 43(Database issue):D1049–1056.

Gout J-F, Lynch M (2015) Maintenance and loss of duplicated genes by dosage subfunctionalization. Mol Biol Evol. 32(8):2141-2148.

Gregory TR, Mable BK (2005) Polyploidy in animals. The Evolution of the Genome, eds

Gregory TR (Elsevier Academic, London ), pp 427–517.

Hahn MW (2007) Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol 8(7):R141.

Hahn MW (2009) Distinguishing among evolutionary models for the maintenance of gene duplicates. J Hered 100(5):605–617.

Hallinan NM, Lindberg DR (2011) Comparative analysis of chromosome counts infers three paleopolyploidies in the Mollusca. Genome Biol Evol 3:1150–1163.

Huang C-H, et al. (2016) Multiple polyploidization events across Asteraceae with two nested events in the early history revealed by nuclear phylogenomics. Mol Biol Evol

33(11):2820-2835. i5K Consortium (2013) The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J Hered 104(5):595–600.

International Silkworm Genome Consortium (2008) The genome of a lepidopteran model insect, the silkworm Bombyx mori. Insect Biochem Mol Biol 38(12):1036–1045. 98

Jiao Y, et al. (2011) Ancestral polyploidy in seed plants and angiosperms. Nature

473(7345):97–100.

Kondrashov FA, Kondrashov AS (2006) Role of selection in fixation of gene duplications. J Theor Biol 239(2):141–151.

Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13(9):2178–2189.

Li Z, et al. (2015) Early genome duplications in conifers and other seed plants. Science

Advances 1(10):e1501084.

Li Z, et al. (2016) Gene duplicability of core genes Is highly consistent across all angiosperms. Plant Cell 28(2):326–344.

Lien S, et al. (2016) The Atlantic salmon genome provides insights into rediploidization.

Nature 533(7602):200–205.

Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T (2009) Rapid and accurate large- scale coestimation of sequence alignments and phylogenetic trees. Science

324(5934):1561–1564.

Lyons E, Pedersen B, Kane J, Freeling M (2008) The value of nonmodel genomes and an example using SynMap within CoGe to dissect the hexaploidy that predates the .

Trop Plant Biol 1(3-4):181–190.

Mandáková T, Li Z, Barker MS, Lysak MA (2017) Diverse genome organization following 13 independent mesopolyploid events in Brassicaceae contrasts with 99 convergent patterns of gene retention. Plant J. 91(1):3-21.

McLachlan GJ, Peel D (1999) The EMMIX algorithm for the fitting of normal andt-

Components. J Stat Softw 4(2):1-14.

McLysaght A, Hokamp K, Wolfe KH (2002) Extensive genomic duplication during early chordate evolution. Nat Genet 31(2):200–204.

Misof B, et al. (2014) Phylogenomics resolves the timing and pattern of insect evolution.

Science 346(6210):763–767.

Muller HJ (1925) Why polyploidy is rarer in animals than in plants. Am Nat

59(663):346–353.

Murat F, Armero A, Pont C, Klopp C, Salse J (2017) Reconstructing the genome of the most recent common ancestor of flowering plants. Nat Genet. 49(4):490-496.

Nakazato T, Jung M-K, Housworth EA, Rieseberg LH, Gastony GJ (2006) Genetic map- based analysis of genome structure in the homosporous fern Ceratopteris richardii.

Genetics 173(3):1585–1597.

Nossa CW, et al. (2014) Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication. Giga Sci 3(1):9.

Ohno S (1970) Evolution by Gene Duplication (Springer, New York).

Orr HA (1990) “ Why Polyploidy is Rarer in Animals Than in Plants” Revisited. Am Nat

136(6):759–770. 100

Otto SP, Whitton J (2000) Polyploid incidence and evolution. Annu Rev Genet 34:401–

437.

Rabier C-E, Ta T, Ané C (2014) Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Mol Biol Evol 31(3):750–762.

Rody HVS, Baute GJ, Rieseberg LH, Oliveira LO (2017) Both mechanism and age of duplications contribute to biased gene retention patterns in plants. BMC Genomics

18(1):46.

Schwager EE, et al. (2017) The house spider genome reveals an ancient whole-genome duplication during arachnid evolution. BMC Biol 15(1):62.

Selmecki AM, et al. (2015) Polyploidy can drive rapid adaptation in yeast. Nature.

519(7543):349-352.

Session AM, et al. (2016) Genome evolution in the allotetraploid frog Xenopus laevis.

Nature 538(7625):336–343.

Sjöstrand J, Arvestad L, Lagergren J, Sennblad B (2013) GenPhyloData: realistic simulation of gene family evolution. BMC Bioinformatics 14:209.

Tank DC, et al. (2015) Nested radiations and the pulse of angiosperm diversification: increased diversification rates often follow whole genome duplications. New Phytol

207(2):454–467.

Tree of Sex Consortium (2014) Tree of Sex: a database of sexual systems. Sci Data

1:140015. 101

Van de Peer Y, Maere S, Meyer A (2009) The evolutionary significance of ancient genome duplications. Nat Rev Genet 10(10):725–732.

Van de Peer Y, Mizrachi E, Marchal K (2017) The evolutionary significance of polyploidy. Nat Rev Genet. 18(7):411-424.

Vanneste K, Van de Peer Y, Maere S (2013) Inference of genome duplications from age distributions revisited. Mol Biol Evol 30(1):177–190.

Wendel JF (2015) The wondrous cycles of polyploidy in plants. Am J Bot 102(11):1753–

1756.

Wickett NJ, et al. (2014) Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc Natl Acad Sci USA 111(45):E4859–E4868.

Wolfe KH, Shields DC (1997) Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387(6634):708–713.

Wood TE, et al. (2009) The frequency of polyploid speciation in vascular plants. Proc

Natl Acad Sci USA 106(33):13875–13879.

Yang Y, et al. (2015) Dissecting molecular evolution in the highly diverse plant clade

Caryophyllales using transcriptome sequencing. Mol Biol Evol 32(8):2001–2014.

Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Bioinformatics 13(5):555–556.

102

Appendix B: Figures

Fig. 1. Inferring ancient WGDs and large-scale genome duplications. Histograms of the age distribution of gene duplications (Ks plots) with mixture models of inferred

WGDs for (A) Baetis sp. (Ephemeroptera) inferred WGD peak median Ks = 0.83. (B)

Gynaikothrips ficorum (Thysanoptera) inferred WGD peak median Ks = 1.73. (C)

Menopon gallinae (Psocodea) inferred WGD peak median Ks = 0.80. (D) Trichocera saltator (Diptera) inferred WGD peak median Ks = 0.59. The mixture model distributions consistent with inferred ancient WGDs are highlighted in yellow. (E) MAPS results from observed data, null and positive simulations on the associated phylogeny.

Percentage of subtrees that contain a gene duplication shared by descendant species at each node, results from observed data (red line), 100 resampled sets of null simulations

(multiple black lines) and positive simulations (multiple gray lines). The red oval 103 corresponds to the location of an inferred large-scale genome duplication event in

Lepidoptera.

Fig. 2. Placement of inferred ancient genome duplications on the phylogeny of

Hexapoda. Red circles = WGDs in hexapods inferred from Ks plots Orange circles =

WGDs in outgroups inferred from Ks plots Blue diamonds = large-scale genome duplications inferred by MAPS analyses Empty squares = episodic bursts of gene duplication with varying levels of significance across different MAPS analyses. Hexapod 104 phylogeny adapted from Misof et al. 2014. The Solenopsis invicta (Hymenoptera) WGD inferred by Ks plot is not included on this phylogeny. Images of Raphidioptera,

Coleoptera, and Neuroptera credit to Tang Liang, Zichen Wang and Zheng Li. Other images are in the public domain (Table S6).

Fig. 3. Principal component analysis of the GO category composition of all genes in each genome/transcriptome and WGD paralogs. Red circles = number of genes annotated to each GO category in the whole genome or transcriptomes. Black circles = number of WGD paralogs annotated to each GO category. Ellipses represent the 95% confidence interval of standard deviation of point scores. 105

Appendix B: Supplementary Information

Supplemental Methods

Supplemental Discussion

References

Figures S1-S14

Tables S1-S6

Datasets S1-S8

Supplemental Methods

Sampling and Transcriptome Assembly

We downloaded 128 published transcriptome assemblies from the GenBank

Transcriptome Shotgun Assembly database (TSA), and 27 published genomes from from

VectorBase (1), AphidBase (2), Hymenoptera genome databases (3), BeetleBase (4),

SilkDB (5) and Ensembl Metazoa (6) (Dataset S1). We reassembled four transcriptomes

(Aphis gossypii, Cryptops hortensis, Daphnia pulex, Litopenaeus vannamei) with data from the GenBank Sequence Read Archive (SRA) because the previous assemblies significantly reduced the number of recent paralogs. We performed quality filtering and trimming of these raw reads using the SnoWhite pipeline (7) and assembled the cleaned reads from each dataset using Soapdenovo-trans (8) with a k-mer comprising ⅔ of the read lengths and other parameters set to default.

Duppipe: Single Species Inference of WGDs from Paralog Age Distributions 106

We used the DupPipe pipeline (9) to estimate the divergence of paralogs and infer whole genome duplication events. We translated DNA sequences and identified reading frames by comparing the Genewise alignment to the best hit protein from a collection of proteins from 24 metazoan genomes from Metazome v3.0. For all DupPipe runs, only protein-coding gene sequences with significant similarity to a collection of annotated proteins from 24 metazoan genomes from Metazome v3.0 were retained by our pipelines

(9). For each node in our gene family phylogenies, we estimated synonymous divergence

(Ks) using PAML with the F3X4 model (10). The age distribution of gene duplications was plotted on two sets of histograms with x-axis scales of Ks = 2 and Ks = 5 to assess

WGDs at different scales (Fig S1-S4) .

Following previous analyses (11–13), we used a combination of SiZer (14) and mixture model analyses to identify significant peaks of gene duplication in hexapod Ks plots. SiZer uses the first derivative of a range of kernel density estimates to find significant slope increases or decreases in the age distribution. Statistically significant peaks appear as significant increases and decreases in slope. In contrast, EMMIX (15) fits mixtures of normal distributions to a given set of data. Peaks produced by paleopolyploidy are expected to be approximately Gaussian (16, 17), and this mixture model test identifies the number of normal distributions and their positions that best explain our observed age distributions. For our analyses, 1–5 normal distributions were fitted to the data with 1,000 random starts and 100 k-mean starts. Significant peaks were identified by comparing the ΔBIC values for models with and without inferred polyploid peaks. We used the BIC instead of the Akaike information criterion (AIC) because the

BIC has more severe penalties for increasing parameters. 107

For many species with evidence of WGDs, Ks plots of their sister lineages did not contain evidence of the putative WGD. In these cases, we concluded that the WGD occurred only in the ancestry of the lineage with evidence. In some cases, WGDs were apparent in Ks plots of sister lineages. We used a combination of ortholog divergence analyses and a phylogenomic approach (MAPS described below) to further analyze these data and simultaneously infer and place WGDs in a phylogenetic context.

MAPS: Phylogenomic Inference of Large-Scale Genome Duplications from Nuclear

Gene Trees

We used 33 MAPS analyses to infer large-scale genome duplications across the hexapod phylogeny. We circumscribed and constructed nuclear gene family phylogenies from multiple species for each MAPS analysis. We used SATé for alignment and phylogeny reconstruction of gene families (18). For each gene family, we ran SATé until we reached five iterations without an improvement in the likelihood score using a centroid breaking strategy. We constructed alignments using MAFFT (19), employed

Opal for mergers (20), and RAxML for tree estimation (21). We used the best scoring

SATé tree for each gene family to infer large-scale genome duplications with MAPS. The

MAPS algorithm uses a given species tree to filter collections of nuclear gene trees for subtrees consistent with relationships at each node in the species tree. Using this filtered set of subtrees, MAPS identifies and counts the number of gene duplications shared by descendant taxa at each node. To maintain sufficient gene tree numbers for each MAPS analysis, we used collections of gene family phylogenies for six to eight taxa to infer ancient WGDs. 108

Errors in transcriptome or genome assembly, gene family clustering, and the construction of gene family phylogenies can result in topological errors in gene trees

(22). Previous studies have suggested that errors in gene trees can lead to bias placements of duplicates towards the root of the tree and losses towards the tips of the tree (23). For this reason, we aim to put focal nodes for a particular MAPS analysis test in the middle of the phylogeny. To further decrease potential error in our inferences of gene duplications, we required at least 45% of the ingroup taxa to be present in all subtrees analyzed by

MAPS. If this minimum ingroup taxa number requirement is not met, the gene subtree will be filtered out and excluded from our analysis. Increasing taxon occupancy should lead to more accurate duplication inference and reduce some of the biases in mapping duplications onto a species tree (23, 24).

We expect that the number of gene duplications on each branch will vary with branch length. We also expect that changes in species composition among our MAPS analyses will impact the number of shared gene duplications observed at a particular branch in different analyses. To account for this variation in our comparisons, all MAPS analyses were first compared to a null simulation of the number of gene duplications we expect on each branch from background gene birth and death. A Fisher’s exact test, implemented in R (25), was used to identify locations with significant increases of gene duplication compared with a null simulation (Figure S8). For 22 MAPS analyses with significant bursts of gene duplication detected with the Fisher’s exact test, we simulated an additional distribution of gene trees with a WGD at the location of the significant burst of duplication. If these increases in gene duplications are due to large-scale genome duplications, we expect the number of shared gene duplications in the observed MAPS 109 result to be consistent with these positive simulations. A second Fisher’s exact test was then used to characterize bursts of duplication as ancient large-scale duplications if the observed data was not significantly less than the percentage of duplications in a positive simulation (Figure S9).

For the null simulations, we first estimated the mean background gene duplication rate (λ) and gene loss rate (μ) with WGDgc (26)(Table S4). Gene count data were obtained from OrthoMCL (27) clusters associated with each species tree (Table S4). λ and μ were estimated using only gene clusters that spanned the root of their respective species trees, which has been shown to reduce biases in the maximum likelihood estimates of λ and μ (26). We chose a maximum gene family size of 100 for parameter estimation, which was necessary to provide an upper bound for numerical integration of node states (26). We provided a prior probability distribution on the number of genes at the root of each species tree, such that ancestral gene family sizes followed a shifted geometric distribution with mean equal to the average number of genes per gene family across species (Table S4).

Gene trees were then simulated within each MAPS species tree using the

GuestTreeGen program from GenPhyloData (28). We developed ultrametric species trees from the topological relationships of Misof et al. (2014) (29) and median branch lengths from TimeTree (30). For each species tree, we simulated 4000 gene trees with at least one tip per species: 1000 gene trees at the λ and μ maximum likelihood estimates, 1000 gene trees at half the estimated λ and μ, 1000 trees at three times λ and μ, and 1000 trees at five times λ and μ. For all simulations, we applied the same empirical prior used for estimation of λ and μ. We then randomly resampled 1000 trees without replacement from 110 the total pool of gene trees 100 times to provide a measure of uncertainty on the percentage of subtrees at each node. Given that natural gene tree discordance and errors from methodological sources are an inescapable part of our analyses, we also introduced error into our gene tree simulations. For each set of 1000 gene trees, each gene tree had a

50% of chance that three pairs of genes would randomly swap positions. We selected to randomize three pairs of genes to introduce a range of errors in our gene trees similar to other studies (31).

For positive simulations, we simulated gene trees using the same methods described above. However, we incorporated a WGD at the location in each of the 22

MAPS phylogenies with significantly larger numbers of gene duplications compared to the null simulation. Based on previous phylogenomic analyses of WGDs (32–35), we allowed at least 30% of the genes to be retained following the simulated WGD to account for biased gene retention and loss (36–39).

Validating The MAPS Inferred Gene Duplication Event in Lepidoptera with

Syntenic Evidence from Bombyx Mori

Synteny provides perhaps the most compelling evidence to characterize ancient genome duplication events. Although there are a number of high quality hexapod genomes, especially among model organisms, there are few high quality assemblies in the clades where we inferred ancient large-scale genome duplications. The genome of

Bombyx mori (40), the silkworm moth, is one of the few genomes that has an ancient genome duplication based on our MAPS analyses. We used the SynMap tool on the

CoGe platform (41) to identify syntenic regions. Given the quality of the B. mori 111 assembly and our expectation that the syntenic evidence of an ancient large-scale gene duplication event in the ancestry of extant Lepidoptera will be significantly eroded, we required three genes to seed a syntenic region and up to 40 interleaving genes to define syntenic chains. Similar analyses of deep genome duplication events, such as the syntenic analyses in the genome of Amborella trichopoda (42), often use more relaxed parameters and allow up to 80 interleaving genes to identify syntenic chains. However, such relaxed syntenic parameters make it possible for a single syntenic region to be counted as multiple smaller subsets of nested syntenic chains. Although our parameters were more stringent than other analyses, we still searched for and removed any syntenic chains that were subsets of a larger chain or reorderings of existing chains from the SynMap output.

To be conservative, we also discarded chains that only differed by two or less genes from a chain previously encountered in the SynMap output. We also accounted for the presence of tandem duplicates in our syntenic analysis. Including tandem duplicates in our results may lead to overestimation of the number of syntenic chains. In CoGe

SynMap, the blast2raw program was used to identify tandem duplicates. These tandem duplicates are then condensed and treated as a single gene in SynMap analyses. To further minimize the potential impact of tandem duplicates, we also removed any syntenic chains whose duplicate chain was a hit to the same scaffold. Such cases are more likely to result from the expansion of tandem arrays within a scaffold rather than large- scale gene duplications. All self-syntenic regions in the B. mori genome (Fig. S11), as well as those only corresponding to chains with MAPS genes (Fig. S12), were plotted using Circos (43).

Using these syntenic data, we tested if paralogs associated with the large-scale 112 genome duplication event inferred by MAPS were more likely to occur on syntenic chains than expected by chance. This method is based on an approach introduced by

Nakazato et al. 2006 which was developed to find evidence of paleopolyploidy from synteny in linkage mapping data(44). The null expectation for the number of genes associated with syntenic chains given the level of genome fragmentation is established by comparing the proportion of syntenic chains in the entire genome to the the total number of genes in the genome. Using a 2 X 2 contingency table, this is then compared to the observed proportion of syntenic chains with MAPS genes relative to the number of B. mori paralogs associated with the large-scale genome duplication event inferred by the

Lepidoptera MAPS analysis. A chi-square test with Yates correction was used to test if these proportions are significantly different than our null expectation.

Gene Ontology (GO) Annotations and Paleolog Retention and Lost Patterns

Gene Ontology (GO) annotations of all hexapod (and one outgroup) transcriptomes and genomes were obtained through phmmer (HMMER 3.1b1) (45) searches against annotated Drosophila melanogaster transcripts from Gene Ontology

Consortium (46) to find the best hit. We evaluated the overall differences between the

GO composition of whole genomes/transcriptomes and WGD paralogs by PCA (principal component analysis) using the rda function in vegan (47) (Fig. 3, Dataset S7). We tested if the GO category composition was different between the whole genome/transcriptome and WGD paralogs using the goodness of fit test in the vegan envfit function. Ellipses representing the 95% confidence interval of standard deviation of point scores were drawn on the PCA plot using the ordiellipse function in vegan. 113

We further tested for differences among GO annotations using chi-square tests.

When chi-square tests were significant (P < 0.05), GO categories with residuals >|2| were implicated as major contributors to the significant chi-square statistic. A category with residual >2 indicates significant over-retention of this category following WGD, whereas residual <–2 indicates significant under-retention (13). Using this statistical framework, we tested for significant differences between the overall transcriptome and paralogs from all the paleopolyploid species.

Supplemental Results

Overall, our analyses of Ks plots for 155 species revealed evidence for 18 WGDs in hexapods. Peaks of gene duplication consistent with ancient WGDs were observed in the Ks plots of 21 taxa (20 hexapods and one chelicerate Fig. S1-S2, and Table S1,

Dataset S2). SiZer analyses identified significant peaks at the same positions in the Ks plots as the mixture model analyses in 17 of the 21 taxa. Leuctra sp., Thrips palmi,

Tenthredo koehleri and Nemophora degeerella did not have a significant peak in SiZer, most likely because the younger peaks of gene duplication are in the “shadow” of the slope from the peak of recent gene birth (Fig. S1F, J, M and S). However, the ΔBIC values of mixture models with and without inferred polyploid peaks were large, supporting our inferences for WGDs in 20 hexapod species and one chelicerate (Table

S1, Dataset S2). The sampled sister lineages for 12 of these taxa did not contain any evidence of an ancient WGD. This suggests that the observed WGDs are restricted to these 12 lineages (Fig. 2). These ancient WGDs are found in diverse hexapod lineages including springtails, beetles, ants, lice, flies, thrips, moths, termites, sawflies, caddisflies, 114 stoneflies, and mayflies. For 10 taxa, there was evidence of WGDs among related taxa. In addition to the 21 taxa with strong signatures of gene duplication consistent with WGDs, the Ks and SiZer plots of other species contained ambiguous peaks that may be the result of ancient WGDs or another large-scale genome duplication process (Fig. S3A, K, V, AF,

AK, AN, AO, AV, BS, BY, BZ, CE, CF, CM, CS, DN, DX, ED, and Fig. S4A, K, AK, AN,

BY, BZ, CM, DS, DX). To avoid overestimating the frequency of ancient WGDs in hexapods, we did not recognize any of these ambiguous peaks as WGDs.

Four trichopteran species contained at least one peak of gene duplication in their

Ks plots, suggesting that they experienced at least one round of ancient WGD (Fig. S1O-

R, and Fig. S2O-R). To resolve whether ancient genome duplication events occurred independently in different trichopteran lineages or a single duplication event was shared among sampled taxa, we used MAPS and ortholog divergence analyses to infer the placement of ancient large-scale genome duplication events. We selected Platycentropus,

Rhyacophila, and Hydroptila to represent Trichoptera in our MAPS analyses. At N2, the node representing the MRCA of Platycentropus and Rhyacophila, 94.5% of the subtrees that fit the expected species tree do not support shared gene duplication at this node.

However, at N3, the node representing the MRCA of Trichoptera, 47.1% of the subtrees that fit the expected species tree showed shared gene duplication at this node (Fig. S6Y, and Dataset S3-S4). The results from our comparison to the null and positive simulations suggest that an ancient large-scale genome duplication event is shared among all trichopterans. We estimated that the ortholog divergence between Hydroptila and

Rhyacophila is older than Ks = 5 (Fig. S5P and Table S2). Recent peaks in Hydroptila and Rhyacophila Ks plots have a median Ks ~0.5, more recent than their ortholog 115 divergence (Fig. S1P and Q, and Table S2). Likewise, analyses of ortholog divergences suggest that WGD peaks younger than Ks = 1 in Platycentropus and Philopotamus are most likely independent WGDs (Fig. S1O-R, and Fig. S2O-R). Thus, our MAPS and ortholog analyses support an ancient large-scale genome duplication consistent with a

WGD before the diversification of trichopterans as well as two more recent phylogenetically independent WGDs.

We used a similar approach to disentangle the history of polyploidy among the

Thysanoptera. Each sampled Thysanoptera species contained evidence of a polyploid peak in their Ks plots Frankliniella had a gene duplication peak with median Ks ~ 2, whereas Thrips and Gynaikothrips had a peak with median Ks ~ 1.7 (Fig. S1I-K, and

Table S1). The median ortholog divergence between Frankliniella and Gynaikothrips was Ks = 0.72, younger than the peaks of gene duplication in the Ks plots for each of the

Thysanoptera taxa (Table S1). This relatively recent ortholog divergence suggests a

WGD event occurred before the divergence of these lineages. However, our MAPS analyses (Figs. S6M and S8M) did not find strong evidence for this shared duplication event (Table S2). It is possible that our current species phylogeny is incorrect and causing problems with our MAPS inference. Analyses of genome sizes have found evidence for recent polyploidy among the thrips (48). Based on current evidence, we infer a single shared WGD in the ancestry of Thysanoptera.

Due to the long evolutionary history of Hexapoda, some ancient WGD or other large-scale genome duplication events are not likely to be observed in Ks plots. To infer possible ancient large-scale genome duplication events that are too saturated to be seen on Ks plots, we conducted 33 MAPS analyses with 111,933 nuclear gene family 116 phylogenies from across the phylogeny of hexapods (Fig. S6 and Dataset S3). To verify that these large-scale genome duplications were not the result of differences in branch lengths, gene mapping errors, or chance, we simulated gene family evolution with background rates of gene birth and death for each MAPS analysis as a null expectation.

We then compared the observed MAPS results to our null simulations using a Fisher’s exact test. We compared the % of subtrees with and without a shared duplication for the observed and simulated data in a two-by-two contingency table. As we were only interested in locations where the observed number of duplications was significantly higher than our null expectation, we used a one-sided test and applied a Bonferroni correction that accounted for the total of 184 tests and an alpha of 0.05.

Using the statistical approach in MAPS, we inferred six ancient large-scale genome duplications across the insect phylogeny (Fig. 2, Fig. S6, Table S3 and Dataset

S3-S4). We also observed episodic bursts of gene duplication at seven other locations that had varying levels of significance across different MAPS analyses (Fig. 2, Fig. S6-7,

Dataset S4). For example, in MAPS analysis B, the frequency of duplications on the branch subtending N2 node at the MRCA of Odonata and Zygentoma is consistent with large-scale genome duplication as supported by the two Fisher’s exact tests. However, the frequency of duplications at the same location fails to reject the null hypothesis when tested at N4 in MAPS analysis C (Fig, S6 and Dataset S3-S4). Similarly, in MAPS analysis E the frequency of duplications on the branch subtending N4 node is consistent with the positive simulation, but conflicts with N5 in MAPS analysis F which is not significantly different than the null expectation. In five other similar cases, we also observed episodic bursts of gene duplication at different locations that had conflicting 117 results in different MAPS analyses (Fig. 2, Fig. S6K, N, S, W, X, and Dataset S4). In total, we observed seven episodic bursts of gene duplication at different locations that had varying levels of significance across different MAPS analyses (Fig. 2, Fig. S6B, E, K, N,

S, W, X, and Dataset S4). To be conservative with inferring ancient large-scale genome duplications, we did not consider these seven episodic bursts of gene duplications to be ancient large-scale genome duplication. We did recognize them as “empty squares” in

Figure 2 so that future studies can evaluate the nature of gene duplications at these locations in the hexapod phylogeny.

Overall, our MAPS analyses uncovered evidence for six additional ancient large- scale genome duplications (Fig. 2, Fig. S10 and Dataset S4). One of these ancient large- scale genome duplications is inferred in the ancestry of the Lepidoptera (Fig. 2, Fig. S6,

Table S3 and Dataset S3-S4). More than 55% of subtrees from a total of 7007 gene family phylogenies support a shared ancient large-scale genome duplication event in the

MRCA of sampled Lepidoptera, but it is not likely shared with its sister lineage,

Trichoptera (Fig. S6Z and AA, and Table S3). Similarly, more than 47% of subtrees from a MAPS analysis support a shared large-scale genome duplication in the MRCA of

Odonata (Fig. S6C, and Dataset S3-S4). Other MAPS analyses inferred ancient large- scale genome duplications in the ancestry of Collembola, Ephemeroptera (in part),

Plecoptera, and Trichoptera (Fig. 2, Fig. S6, Table S3 and Dataset S3-S4).

To confirm the six inferred large-scale genome duplications are robust using gene trees with only high bootstrap support, we used RAxML’s rapid bootstrap approach to filter out gene trees with lower support (Dataset S8). We restricted the gene trees to require >50% bootstrap support for each branch in our MAPS analyses. The results for 5 118 out of 6 MAPS analyses remain the same. In MAPS A, the number of nuclear gene trees with >50% bootstrap support on all branches was too low at one node (N3) to apply

Fisher’s exact test with sufficient power at all nodes. Overall, the MAPS analyses did not change substantially with the stringent filter of >50% bootstrap at all branches.

As validation of the ancient large-scale genome duplication event inferred in the evolutionary history of Lepidoptera (Figs. S6AA-S9AA, Dataset S5), we identified self- syntenic regions in the Bombyx mori genome (40)(Dataset S6) using the SynMap tool on the CoGe platform (41). Given the quality of the B. mori assembly as well as the putative ancient age and significant fractionation of a genome duplication in the ancestry of

Lepidoptera, we used several parameter combinations between the number of genes required to seed a syntenic region and the number of interleaving genes allowed between syntenic genes. The Manhattan distance of 40 genes was used when detecting synteny from ancient large-scale gene duplication events. Using three anchor genes to seed a syntenic chain we identified 728 syntenic chains containing 2210 genes. Eighty-three of these syntenic chains had at least one gene included in the Lepidoptera MAPS analyses.

Overall, 39 MAPS paralogs were found on these 83 syntenic chains. Seeding with five anchor genes resulted in many fewer total syntenic chains that only contained six of the

MAPS paralogs. Although this number of syntenic regions associated with our MAPS analyses is small, it is not unexpected because of the fragmented B. mori genome assembly and the erosion of synteny over time (37, 49). Importantly, we found that the number of syntenic chains associated with the MAPS paralogs was statistically significant (chi-square test, p < 0.0001). We would not expect this significant relationship if the bursts of genes identified in our MAPS analysis resulted from independent, small 119 scale gene family expansions. Instead, the significant association of synteny and paralogs in our MAPS analysis suggests that the duplication was likely structural nature, such as a

WGD or other chromosomal duplication event. However, the B. mori genome is still in thousands of scaffolds and syntenic relationships within hexapod genomes should be re- evaluated as more near-chromosome level assemblies become available.

References:

1. Giraldo-Calderón GI, et al. (2015) VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases. Nucleic Acids Res

43(Database issue):D707–713.

2. Legeai F, et al. (2010) AphidBase: a centralized bioinformatic resource for annotation of the pea aphid genome. Insect Mol Biol 19 Suppl 2:5–12.

3. Munoz-Torres MC, et al. (2011) Hymenoptera Genome Database: integrated community resources for insect species of the order Hymenoptera. Nucleic Acids Res

39(Database issue):D658–662.

4. Wang L, Wang S, Li Y, Paradesi MSR, Brown SJ (2007) BeetleBase: the model organism database for Tribolium castaneum. Nucleic Acids Res 35(Database issue):D476–479.

5. Wang J, et al. (2005) SilkDB: a knowledgebase for silkworm biology and genomics.

Nucleic Acids Res 33(Database issue):D399–402.

6. Kersey PJ, et al. (2014) Ensembl Genomes 2013: scaling up access to genome-wide data. Nucleic Acids Res 42(Database issue):D546–552. 120

7. Dlugosch KM, Lai Z, Bonin A, Hierro J, Rieseberg LH (2013) Allele identification for transcriptome-based population genomics in the invasive plant Centaurea solstitialis. G3

3(2):359–367.

8. Xie Y, et al. (2014) SOAPdenovo-Trans: de novo transcriptome assembly with short

RNA-Seq reads. Bioinformatics 30(12):1660–1666.

9. Barker MS, et al. (2010) EvoPipes.net: bioinformatic tools for ecological and evolutionary genomics. Evol Bioinform Online 6:143–149.

10. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Bioinformatics 13(5):555–556.

11. Barker MS, Vogel H, Schranz ME (2009) Paleopolyploidy in the Brassicales: analyses of the Cleome transcriptome elucidate the history of genome duplications in

Arabidopsis and other Brassicales. Genome Biol Evol 1:391–399.

12. Shi T, Huang H, Barker MS (2010) Ancient genome duplications during the evolution of kiwifruit (Actinidia) and related Ericales. Ann Bot 106(3):497–504.

13. Barker MS, et al. (2008) Multiple paleopolyploidizations during the evolution of the

Compositae reveal parallel patterns of duplicate gene retention after millions of years.

Mol Biol Evol 25(11):2445–2455.

14. Chaudhuri P, Marron JS (1999) SiZer for exploration of structures in curves. J Am

Stat Assoc 94(447):807.

15. McLachlan GJ, Peel D (1999) The EMMIX algorithm for the fitting of normal and t-

Components. J Stat Softw 4(2). doi:10.18637/jss.v004.i02.

16. Schlueter JA, et al. (2004) Mining EST databases to resolve evolutionary events in major crop species. Genome 47(5):868–876. 121

17. Blanc G (2004) Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16(7):1667–1678.

18. Liu K, et al. (2009) Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324(5934):1561–1564.

19. Katoh K, Misawa K, Kuma K-I, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res

30(14):3059–3066.

20. Wheeler TJ, Kececioglu JD (2007) Multiple alignment by aligning alignments.

Bioinformatics 23(13):i559–i568.

21. Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30(9):1312–1313.

22. Yang Y, Smith SA (2013) Optimizing de novo assembly of short-read RNA-seq data for phylogenomics. BMC Genomics 14:328.

23. Hahn MW (2007) Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol 8(7):R141.

24. Smith SA, Moore MJ, Brown JW, Yang Y (2015) Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants. BMC Evol Biol 15:150.

25. Team RC (2014) R: A language and environment for statistical computing. Vienna,

Austria: R Foundation for Statistical Computing 2013.

26. Rabier C-E, Ta T, Ané C (2014) Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Mol Biol Evol 31(3):750–762.

27. Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups 122 for eukaryotic genomes. Genome Res 13(9):2178–2189.

28. Sjöstrand J, Arvestad L, Lagergren J, Sennblad B (2013) GenPhyloData: realistic simulation of gene family evolution. BMC Bioinformatics 14:209.

29. Misof B, et al. (2014) Phylogenomics resolves the timing and pattern of insect evolution. Science 346(6210):763–767.

30. Hedges SB, Dudley J, Kumar S (2006) TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics 22(23):2971–2972.

31. Wu Y-C, Rasmussen MD, Bansal MS, Kellis M (2012) TreeFix: statistically informed gene tree error correction using species trees. Syst Biol 62(1):110–120.

32. Yang Y, et al. (2015) Dissecting molecular evolution in the highly diverse plant clade

Caryophyllales using transcriptome sequencing. Mol Biol Evol 32(8):2001–2014.

33. Barker MS, et al. (2016) Most Compositae (Asteraceae) are descendants of a paleohexaploid and all share a paleotetraploid ancestor with the Calyceraceae. Am J Bot

103(7):1203–1211.

34. Huang C-H, et al. (2016) Multiple polyploidization events across Asteraceae with two nested events in the early history revealed by nuclear phylogenomics. Mol Biol Evol

33(11):2820-2835.

35. Yang Y, et al. (2017) Improved transcriptome sampling pinpoints 26 ancient and more recent polyploidy events in Caryophyllales, including two allopolyploidy events.

New Phytol 217(2):855-870.

36. Rody HVS, Baute GJ, Rieseberg LH, Oliveira LO (2017) Both mechanism and age of duplications contribute to biased gene retention patterns in plants. BMC Genomics

18(1):46. 123

37. Conant GC (2014) Comparative genomics as a time machine: how relative gene dosage and metabolic requirements shaped the time-dependent resolution of yeast polyploidy. Mol Biol Evol 31(12):3184–3193.

38. Mayfield-Jones D, et al. (2013) Watching the grin fade: tracing the effects of polyploidy on different evolutionary time scales. Semin Cell Dev Biol 24(4):320–331.

39. Pires JC, Conant GC (2016) Robust yet fragile: expression noise, protein misfolding and gene dosage in the evolution of genomes. Annu Rev Genet 50:113-131.

40. International Silkworm Genome Consortium (2008) The genome of a lepidopteran model insect, the silkworm Bombyx mori. Insect Biochem Mol Biol 38(12):1036–1045.

41. Lyons E, Pedersen B, Kane J, Freeling M (2008) The value of nonmodel genomes and an example using SynMap within CoGe to dissect the hexaploidy that predates the

Rosids. Trop Plant Biol 1(3-4):181–190.

42. Amborella Genome Project (2013) The Amborella genome and the evolution of flowering plants. Science 342(6165):1241089.

43. Krzywinski M, et al. (2009) Circos: an information aesthetic for comparative genomics. Genome Res 19(9):1639–1645.

44. Nakazato T, Jung M-K, Housworth EA, Rieseberg LH, Gastony GJ (2006) Genetic map-based analysis of genome structure in the homosporous fern Ceratopteris richardii.

Genetics 173(3):1585–1597.

45. Finn RD, et al. (2015) HMMER web server: 2015 update. Nucleic Acids Res

43(W1):W30–8.

46. Gene Ontology Consortium (2015) Gene Ontology Consortium: going forward.

Nucleic Acids Res 43(Database issue):D1049–1056. 124

47. Dixon P (2003) VEGAN, a package of R functions for community ecology. J Veg Sci

14(6):927.

48. Jacobson AL, et al. (2013) Genome size and ploidy of Thysanoptera. Insect Mol Biol

22(1):12–17.

49. Zhao T, Schranz ME (2017) Network approaches for plant phylogenomic synteny analysis. Curr Opin Plant Biol 36:129–134.

125

Fig. S1. Ks 2 Histograms of the age distribution of gene duplications (Ks plots) with mixture model distributions and SiZer analyses for 21 taxa with evidence of

WGD(s). Histogram x-axis scale is Ks 0–2, and y-axis is number of gene duplications.

The mixture model distributions that consistent with the ancient WGDs are highlighted in yellow. The SiZer plots show significant features at corresponding Ks values with blue areas indicating significantly increasing slopes, red indicating significantly decreasing slopes, purple representing no significant slope change, and gray indicating not enough 126 data for the test. Peaks of gene duplication consistent with ancient WGDs and other details for each Ks plot are described in Table S1 and Dataset S2.

Fig. S2. Ks 5 Histograms of the age distribution of gene duplications (Ks plots) for

21 taxa with evidence of WGD(s) with mixture model distributions. Histogram x-axis scale is Ks 0–5, and y-axis is number of gene duplications. The mixture model distributions that consistent with the ancient WGDs are highlighted in green. Peaks of gene duplication consistent with ancient WGDs and other details for each Ks plot are described in Table S1and Dataset S2. 127

128

129 130

131

Fig. S3. 134 Histograms of the age distribution of gene duplications (Ks plots) and

SiZer analyses range from Ks 0 to 2. No unambiguous peaks consistent with ancient

WGDs are observed in these Ks plots. The SiZer plots show significant features at corresponding Ks values with blue areas indicating significantly increasing slopes, red indicating significantly decreasing slopes, purple representing no significant slope change, and gray indicating not enough data for the test. Detailed information of each Ks plot are provided in Dataset S1.

132

133

134

135

Fig. S4. 134 Histograms of the age distribution of gene duplications (Ks plots) range from Ks 0 to 5. No unambiguous peaks consistent with ancient WGDs are observed in these Ks plots. Detailed information of each Ks plot are provided in Dataset S1.

Fig. S5. Synonymous ortholog divergence among select hexapods with MAPS inferred large-scale genome duplications. Blue shading highlights the peaks of synonymous ortholog divergences for pairs of hexapod taxa. The vertical red line indicates the Ks = 5 cut off for gene age distribution analyses. MAPS inferred large-scale 136 genome duplications in the ancestry of these taxa would have occurred prior to their ortholog divergence and these ranges are highlighted in orange. For most taxa, MAPS inferred large-scale genome duplications would have Ks > 5 all would be Ks > 2 peaks.

Thus, the MAPS inferred large-scale duplications may not be apparent in Ks plots of gene duplications. Sampling information and median ortholog divergences are presented in Table S2.

137

Fig. S6. Phylogenetic placement of ancient large-scale genome duplications inferred by MAPS analyses. Ancient large-scale genome duplications inferred by MAPS are indicated by blue diamonds. Episodic bursts of gene duplication that had varying levels of significance across different maps analyses are indicated by empty diamonds on each phylogeny. Percentages on each phylogeny represent the percentage of subtrees that support a shared gene duplication on the branch subtending a particular node in the species tree. The total number of nuclear gene family phylogenies analyzed and other detailed information for each MAPS analysis are provided in Dataset S3.

138

Fig. S7. Phylogenetic placement of ancient large-scale genome duplications inferred in the phylogeny of Hexapoda. Red circles represent WGDs inferred from Ks plots blue diamonds represent large-scale genome duplications inferred by MAPS analyses orange circles represent WGDs in outgroups (ie., non-hexapod) inferred from Ks plots empty square represent episodic bursts of gene duplication that had varying levels of 139 significance across different maps analyses. Alphabetical labels correspond to Ks plots described in Fig. S1, Fig. S2 and Table S1. Numerical labels correspond to MAPS inferred ancient large-scale genome duplications described in Fig. S6, Fig. S10, Table S3 and Dataset S3-S4. Other episodic bursts of gene duplication are labeled as A1-A7.

Hexapod phylogeny adapted from Misof et al. 2014. Solenopsis invicta (Hymenoptera) paleopolyploidy inferred from Ks plot is not included on this phylogeny.

140

Fig. S8. MAPS results for all 33 species trees and null simulations. The percentages of subtrees mapping to each node in the species trees (see Fig. S6) are shown for the observed data (black) as well as the null simulation results (red). The mean for each null simulation is given as a red triangle and vertical bars are the 95% confidence intervals for each simulated mean. Asterisks indicate an observed node that had a significantly higher frequency of shared gene duplications than expected compared to the simulated distribution as determined by a Fisher’s exact test (see Dataset S3).

141

Fig. S9. MAPS results for 14 potential ancient large-scale genome duplications and their simulated distributions. The percentages of subtrees mapping to each node in the species trees (see Fig. S6) are shown for the observed data (black) as well as the null simulation results (red), and simulated data where 30% of the paralogs from a WGD were retained (blue). Triangles for simulated data represent the mean across 100 resampled datasets. The 95% confidence intervals are given for each simulated mean as vertical bars. Asterisks indicate an observed node that had a significantly higher frequency of 142 gene duplications than the null simulations were not significantly less than the number of duplications in the positive simulations, as determined by a Fisher’s exact test (see

Dataset S4).

Fig. S10. Phylogenetic placements of ancient large-scale genome duplications inferred by MAPS in the phylogeny of Hexapoda. Blue diamonds represent ancient 143 large-scale genome duplications inferred only by MAPS analyses empty square represent episodic bursts of gene duplication that had varying levels of significance across different

MAPS analyses black circles represent node covered by MAPS analyses Alphabetical labels for each black circles correspond to MAPS analyses described in Fig. S6 and

Dataset S3. Numerical labels correspond to MAPS analysis information in Table S3.

Hexapod phylogeny adapted from Misof et al. 2014.

Fig. S11. Distribution of syntenic regions in the Bombyx mori genome. Syntenic regions in the B. mori genome are shown as links between scaffolds. Grey links indicate syntenic regions that are not associated with a gene in the MAPS inferred duplication 144 event, whereas yellow links show syntenic regions that contain a B. mori gene from the large-scale genome duplication inferred by MAPS. Tick marks indicate two megabase intervals. Synteny is highly fragmented in the B. mori genome.

Fig. S12. Pairs of syntenic chains in the Bombyx mori genome that were associated with paralogs identified by MAPS. Syntenic regions between B. mori scaffolds that are associated with genes from the MAPS inferred large-scale genome duplication are shown as links on their respective scaffolds. Numbers are the scaffold identifiers and tick marks indicate one megabase intervals. 145

Fig. S13. Pattern of gene retention and loss following ancient WGDs. Each column represents the annotated GO categories of paralogs retained following ancient WGDs from each analyzed species. The order of analyzed species (20 hexapods and one outgroup) are based on the median Ks of WGD peak from low to high. The overall ranking of GO category rows was determined by the ranking of GO annotations among

Hydrophilla sp. with a median WGD peak of Ks = 0.6. Colored boxes indicate GO categories among putative WGD paralogs that were significantly over- (red) or under- 146 retained (blue) relative to the whole transcriptome, as determined by residuals from chi- square tests. GO categories with gray boxes were not present among WGD paralogs in significantly different numbers relative to their frequency in the whole transcriptome.

Fig. S14. GO annotations of whole transcriptomes and genes retained from ancient

WGDs. Each column represents the annotated GO categories of whole transcriptome or genes retained in duplicate following ancient WGD from each analyzed species. Colors of the heatmap represent the percent of the transcriptome represented by a particular GO category. The overall ranking of GO category rows was determined by the ranking of GO 147 annotations among the total pooled transcriptomes. The order of analyzed species (20 hexapods and one outgroup) are based on the median Ks of WGD peak from low to high.

The summary statistics of genes retained from ancient WGD(s) provided in Table S5.

Table S1. Summary of Ks plots with evidence of ancient WGD(s). Alphabetical labels correspond to Ks plots from Fig. S1 and Fig. S2. The ΔBIC values of mixture models with and without inferred polyploid peaks were large, supporting our inferences for

WGDs in 20 hexapod species. Phylogenetic placement of ancient WGDs inferred from these Ks plots are presented in Fig. S7. * indicates sampling from whole genome sequence. Peaks of gene duplication consistent with ancient WGDs and other details for each Ks plot are described in Dataset S2.

# of Median Ks of WGD Label Organism Order peaks Peaks Delta BIC

A Ixodes scapularis* Arachnida 2 0.6312, 2.1507 80.046, 27.573 B Folsomia candida Collembola 1 0.4095 144.109 Occasjapyx C japonicus Diplura 1 0.3785 371.645 D Baetis sp. Ephemeroptera 1 0.8284 107.887 E Zorotypus caudelli Zoraptera 1 0.4298 370.086

F Leuctra sp. Plecoptera 1 0.6046 60.853 G Haploembia palaui Embioptera 1 2.0720 77.597 Mastotermes H darwiniensis Blattodea 1 0.4151 344.241 Gynaikothrips I ficorum Thysanoptera 1 1.7313 1046.290

J Thrips palmi Thysanoptera 1 1.7684 461.804 Frankliniella K cephalica Thysanoptera 1 2.0831 81.871

L Menopon gallinae Psocodea 1 0.7952 568.027 148

M Tenthredo koehleri Hymenoptera 1 0.8147 824.200

N Lepicerus sp. Coleoptera 1 0.5302 2020.560 Platycentropus O radiatus Trichoptera 1 0.4580 130.032 Rhyacophila P fasciata Trichoptera 1 0.4276 153.674

Q Hydroptila sp. Trichoptera 1 0.6495 323.443 Philopotamus R ludificatus Trichoptera 1 0.3419 62.161 Nemophora S degeerella Lepidoptera 1 0.2646 153.451

T Trichocera saltator Diptera 1 0.5859 262.350

U Solenopsis invicta* Hymenoptera 1 0.1971 426.618

Table S2. Summary of ortholog age distribution analyses. Alphabetical labels correspond to labels of ortholog divergence analyses in Fig. S5. Median ortholog divergence is the median Ks value of ortholog divergences of the selected species pair in each analysis.

Large- scaleGene and Median Genome Ortholog Duplicatio Divergence Label n Taxa 1 Order 1 Taxa 2 Order 2 (Ks) A 1 Pogonognathellus sp. Collembola Anurida maritima Collembola 5.88 Meinertellus B 2 cundinamarcensis Archaeognatha Cordulegaster boltonii Odonata 6.31 C 2, A1 Thermobia domestica Zygentoma Cordulegaster boltonii Odonata 5.99 D 2, A1 Thermobia domestica Zygentoma Epiophlebia superstes Odonata 6.06 Cordulegaster E 2 boltonii Odonata Calopteryx splendens Odonata 2.03 F 3 Isonychia bicolor Ephemeroptera Ephemera danica Ephemeroptera 2.88 G 4, A3 Blaberus atropos Blattodea Leuctra sp. Plecoptera 5.56 H A3 Cryptocercus wrighti Blattodea Zorotypus caudelli Zoraptera 6.09 Prosarthria I A2 Forficula auricularia Dermaptera teretrirostris Orthoptera 6.9 J 4 Perla marginata Plecoptera Leuctra sp. Plecoptera 4.15 149

Xenophysella K A4 Okanagana villosa Hemiptera greensladeae Hemiptera 5.74 L A4 Thrips palmi Thysanoptera Okanagana villosa Hemiptera 6.93 M A5 Orussus abietinus Hymenoptera Chrysis viridula Hymenoptera 4.2 Conwentzia Xanthostigma N - psociformis Neuroptera xanthostigma Raphidioptera 5.49 O 5 Hydroptila sp. Trichoptera Annulipalpia sp. Trichoptera 5.25

P 5 Rhyacophila fasciata Trichoptera Hydroptila sp. Trichoptera 5.08 Dyseriocrania Q 6 subpurpurella Lepidoptera Parides eurimedes Lepidoptera 5.44

R 6 Triodia sylvina Lepidoptera Planococcus citri Hemiptera 7.24 S A4, A6, A7 Aleochara curtula Coleoptera Orussus abietinus Hymenoptera 7.58 T 6, A5 Chrysis viridula Hymenoptera Parides eurimedes Lepidoptera 6.88 Pseudomallada U 5 prasinus Neuroptera Hydroptila sp. Trichoptera 7.35 Pseudomallada V 5 Rhyacophila fasciata Trichoptera prasinus Neuroptera 7.24 Conwentzia W 6 Triodia sylvina Lepidoptera psociformis Neuroptera 6.53 Xanthostigma X 6 xanthostigma Raphidioptera Triodia sylvina Lepidoptera 6.64

Table S3. Summary of 6 large-scale genome duplications from MAPS Analyses.

Numerical labels correspond with results in Fig. S7 indicating the phylogenetic placement of ancient large-scale genome duplications inferred from MAPS analyses.

Labels of MAPS analyses supporting inferences of each large gene burst correspond to results in Fig. S6, Dataset S3-S4.

Supporting Numbe Order(s) with Large-scale # of Location MAPS r Genome Duplications Orders Analyses 1 Collembola Collembola 1 A 2 Odonata Odonata 1 C Ephemeropt 3 Ephemeroptera (in part) 1 D era 4 Plecoptera Plecoptera 1 F 5 Trichoptera Trichoptera 1 Y 6 Lepidoptera Lepidoptera 1 Z, AA 150

Table S4. Rates of gene duplication (λ) and gene loss (μ) used in null simulations.

Alphabetical labels correspond to labels of MAPS analyses in Fig. S6. Rates of gene duplication (λ) and gene loss (μ) were obtained by extracting gene count data from

OrthoMCL clusters associated with each MAPS analysis. Values correspond to global

MLEs and mean rates for null simulations. The prior mean is the mean of the geometric probability distribution applied to the root of each species tree for optimizing MLEs of λ and μ as well as simulating gene trees with and without WGDs.

MAPS Analysis λ μ Prior Mean A 0.00072 0.00101 1.10 B 0.00051 0.00061 1.14 C 0.00065 0.00081 1.10 D 0.00072 0.00099 1.10 E 0.00075 0.00114 1.02 F 0.00062 0.00154 1.10 G 0.00064 0.00107 1.05 H 0.00072 0.00131 1.07 I 0.00112 0.00177 1.03 J 0.00096 0.00192 1.10 K 0.00106 0.00147 1.05 L 0.00068 0.00105 1.06 M 0.00088 0.00129 1.09 N 0.00068 0.00105 1.10 O 0.00094 0.00113 1.09 P 0.00095 0.00120 1.08 Q 0.00098 0.00145 1.02 R 0.00065 0.00099 1.10 S 0.00066 0.00091 1.10 T 0.00071 0.00092 1.08 U 0.00083 0.00108 1.10 V 0.00064 0.00086 1.06 W 0.00077 0.00110 1.02 X 0.00074 0.00126 1.10 Y 0.00093 0.00139 1.10 Z 0.00145 0.00163 1.20 AA 0.00156 0.00127 1.30 AB 0.00080 0.00106 1.05 151

AC 0.00075 0.00100 1.10 AD 0.00152 0.00139 1.30 AE 0.00124 0.00067 1.33 AF 0.00111 0.00060 1.30 AG 0.00101 0.00073 1.13

Table S5. Summary statistics of mixture model distributions of ancient WGD(s) used in Gene Ontology (GO) annotations. Alphabetical labels correspond to Ks plots from Fig. S1, Fig. S2 and Table S1. The annotated unigenes represent unigenes from each species with at least one hit to the annotated Drosophila melanogaster transcripts through phmmer (HMMER 3.1b1).

# of # of Log- Unigene Annotated likelihood Label Species s Unigenes Ks Ranges BIC Score Score A Ixodes scapularis 20486 12778 0.269 - 1.157 3046.59 48.69 Ixodes scapularis A (peak2) 20486 12778 1.179 - 3.697 3046.59 48.69 B Folsomia candida 15680 14543 0.164 - 0.628 3084.72 21.9 C Occasjapyx japonicus 17222 16061 0.314 - 0.701 102.68 1639.63 D Baetis sp. 14508 13708 0.159 - 1.256 2617.33 38.53 E Zorotypus caudelli 17706 13821 0.193 - 0.660 1539.97 128.2 F Leuctra sp. 14551 12967 0.150 - 0.983 3038.41 204 G Haploembia palaui 15264 13053 0.635 - 2.517 5722.89 232.16 Mastotermes H darwiniens 34181 31016 0.194 - 0.578 27107.89 809.54 I Gynaikothrips ficorum 18116 16616 0.861 - 2.075 8569.63 1014.88 J Thrips palmi 12311 11483 0.949 - 2.791 1825.18 52.92 Frankliniella K cephalica 15807 13778 0.940 - 2.491 6207.88 131.03 L Menopon gallinae 16780 15986 0.177 - 1.242 2412.57 305.47 M Tenthredo koehleri 11961 11221 0.239 - 0.844 1815.43 43.37 N Lepicerus sp. 25367 23432 0.179 - 0.921 6701.16 205.17 Platycentropus O radiatus 10049 8785 0.289 - 0.815 1839.53 23.8 P Rhyacophila fasciata 7230 6738 0.085 - 0.743 1726.94 8.9 Q Hydroptila sp. 17637 16564 0.099 - 1.056 10383.12 816.23 Philopotamus R ludificatus 5581 5277 0.233 - 0.562 776.23 7.7 152

S Nemophora degeerella 14086 12122 0.110 - 0.502 1607.65 20.56 T Trichocera saltator 17222 16969 0.173 - 1.072 110.13 1290.36 0.0001 - U Solenopsis invicta 21108 17711 0.403 794.73 166.33

Table S6. Summary of image sources. This table contains sources of images used in

Fig. 2. Images of Raphidioptera, Coleoptera, and Neuroptera are credited to artists Tang

Liang, Zichen Wang and Zheng Li. All other images are in the public domain with credit information and links provided in the table.

153

Dataset S1. Assembly statistics and accession numbers for 155 transcriptomes and genomes. Whole genomes analyzed for 27 species are indicated by an asterisk. See Fig.

S1 and Fig. S2 for associated Ks plots with WGD peaks. See Fig. S3 and Fig. S4 for associated Ks plots without WGD peaks.

Dataset S2. Summary of mixture model of normal distributions from EMMIX. 154

Alphabetical labels correspond to Ks plots from Fig. S1 and Fig. S2. ΔBIC values of mixture models with and without inferred polyploid peaks are indicated. The total number of estimated components, median Ks, variance, and proportion of each components are provided. * indicates components that are consistent with an ancient whole genome duplication.

Dataset S3. Summary statistics and null simulations for the 33 MAPS analyses. This table contains sampling and percentage of subtrees with shared gene duplications at each node of 33 MAPS analyses. The total number of nuclear gene family phylogenies included in each analysis is also reported. This table also contains the p-value for a

Fisher’s exact test used to detect nodes with a significantly higher proportion of mapping subtrees compared to our null simulations. An * indicates significant nodes after a

Bonferroni correction.

Dataset S4. Summary statistics and simulations with WGDs for the 22 selected MAPS analyses. This table contains data for simulations with 30% of paralogs retained from a

WGD. This table also contains the p-value for a Fisher’s exact test used to detect nodes with a significantly lower proportion of mapping subtrees compared to our simulation with 30% duplicate retention. An * indicates significant nodes after a Bonferroni correction. See Fig. S10 for phylogenetic placement of inferred ancient large-scale genome duplications by each MAPS analyses.

Dataset S5. The nuclear gene family phylogenies for MAPS analysis AA. This dataset 155 contains 4136 nuclear gene trees used in the MAPS analysis AA. The species name and abbreviation codes use in the gene trees are provided: Bombyx mori (bom) Parides eurimedes (gxh), Zygaena fausta (gyb), Hydroptila sp. (gvm), Drosophila melanogaster

(dme), Tribolium castaneum (trc), Chrysis viridula (gty). See Fig. S6, Dataset S3-S4 for detailed information of MAPS analysis AA.

Dataset S6. Chains of syntenic genes that have at least one paralog from the MAPS inferred large-scale duplication in Bombyx mori. There are 639 B. mori genes represented on these 83 chains. Thirty-nine MAPS genes were associated with 83 syntenic chains detected by SynMap on the CoGe platform. The genes included in MAPS analyses are italicized and identified by their Ensembl accession number.

Dataset S7. Summary table of response variable scores for principal component analysis.

This table contains the raw response variable scores of each GO category in PC1 and PC2 in principal component analysis. These response variable scores represent the unscaled coordinates used to ordinated points and vectors.

Dataset S8. Summary statistics and simulations with and without WGDs for the 6 selected MAPS analyses with bootstrap support. Gene trees to require >50% bootstrap support for each branch. This table contains data for null simulations and positive simulations with 30% of gene retention rate. This table also contains the p-value for a

Fisher’s exact test when compared to null or positive simulation. An * indicates significant nodes after a Bonferroni correction. 156

APPENDIX C:

ANCIENT POLYPLOIDY AND LOW RATE OF CHROMOSOME LOSS

EXPLAIN FERNS WITH HIGH CHROMOSOME NUMBERS

Authors

Zheng Lia, Shing H. Zhana,b S. Barkera,1

Affiliations aDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ

85721 USA bDepartment of Zoology & Biodiversity Research Centre, University of British

Columbia, Vancouver, BC V6T 1Z4 CANADA

Abstract

A longstanding question in plant evolution is why homosporous ferns have much higher chromosome numbers compared to flowering plants. The leading hypothesis suggests ancient polyploidy without chromosome loss can explain high chromosome numbers in ferns. Here, we test this hypothesis by using phylogenomic inferences of ancient polyploidy and phylogenetic reconstruction of chromosome evolution in monilophytes.

We found evidence of paleopolyploidy, and reported a lower rate of chromosome in ferns compared to angiosperms. To explore whether this genomic characteristic of ferns is impacted by genes related to meiosis, we compare rates of protein evolution for these genes in angiosperms and ferns. We found consistent higher rates in meiosis genes 157 compare to the background rate in angiosperm but not in ferns. Furthermore, we evaluated patterns of gene retention after polyploidization, and we present the first evidence of parallel and biased gene retention in ferns. Overall, we provided comprehensive evidence to support the ancient polyploidy hypothesis, and our results explain part of this unresolved question in plant evolution.

Introduction

A longstanding mystery in plant evolutionary biology is the origin of the exceptional chromosome number variation observed across the vascular plant phylogeny.

Vascular plant chromosome numbers range from n = 2 in multiple angiosperms

(Abraham and Ninan, 1954) to n = 1260 in the eusporangiate fern, Ophioglossum reticulatum (Abraham and Ninan, 1954). The average gametic chromosome numbers in homosporous ferns is n = 57.05 (Klekowski and Baker, 1966). In contrast, angiosperms and heterosporous ferns have an average haploid chromosome number of n = 15.99

(Grant, 1963) and 13.62 (Klekowski and Baker, 1966), respectively. Although a variety of cytogenetic mechanisms are known to influence chromosome number variation at lower taxonomic scales (Stebbins and Others, 1971 Levin, 2002), it is not clear what processes are responsible for the macroevolutionary pattern observed between homosporous and heterosporous plants.

Numerous hypotheses have been proposed to explain the origin and maintenance of high chromosome numbers in homosporous ferns (Wagner and Wagner, 1980 Haufler and Soltis, 1986 Haufler, 1987 Barker and Wolf, 2010). The most compelling one invokes multiple rounds of whole genome duplication (WGD) (Haufler and Soltis, 1986 158

Haufler, 1987). The alternatives, such as ascending aneuploidy or high ancestral chromosome numbers are not well supported because neopolyploid speciation is common in ferns (Otto and Whitton, 2000 Wood et al., 2009) and cytological variation is predominantly euploid (Love et al., 1977). An influential early hypothesis argued that repeated cycles of genome doubling provided homoeologous heterozygosity, which was necessary to compensate for putatively high rates of inbreeding (Klekowski and Baker,

1966 Chapman et al., 1979). However, isozyme studies demonstrated that homosporous ferns with the base chromosome number for their are diploid, not polyploid. And gene expression profiles with Mendelian inheritance and patterns of genetic variation consistent with outcrossing, not inbreeding (Gastony and Gottlieb, 1982, 1985 Gastony and Darrow, 1983 Haufler and Soltis, 1986 Haufler, 1987 Wolf et al., 1987). To explain this intriguing combination of high chromosome numbers and diploid gene expression,

Haufler (Haufler, 1987) suggested that ferns experienced multiple rounds of polyploid speciation followed by gene silencing but not chromosome loss. In support of this hypothesis, a few studies have identified multiple silenced copies of nuclear genes in putatively diploid homosporous fern genomes (Pichersky et al., 1990 Mitchell McGrath et al., 1994 McGrath and Hickok, 1999). And the active process of gene silencing without chromosome loss in a polyploid genome (Gastony, 1991). However, genetic mapping of the diploid homosporous fern, Ceratopteris richardii (n = 39) failed to identify homoeologous chromosomes (Nakazato et al., 2006).

Despite the lack of success in resolving paleopolyploidy using genetic mapping in

Ceratopteris (Nakazato et al., 2006), recent transcriptomic and genomic analyses have found evidence of ancient genome duplication in ferns (Barker, 2012 Vanneste et al., 159

2015 Li, Brouwer, et al., 2018 Clark et al., 2019 One Thousand Plant Transcriptomes

Initiative, 2019). The first two fern genomes revealed two rounds of ancient WGDs in

Azolla by using phylogenomic and syntenic approaches (Li, Brouwer, et al., 2018). In the

One Thousand Plant Transcriptome (1KP) project, 21 ancient polyploidy events were inferred during the evolution of monilophytes (One Thousand Plant Transcriptomes

Initiative, 2019). To further investigate ancient WGDs in ferns, we assembled a dataset with over 140 fern transcriptomes. Ancient WGDs were initially identified in gene age distributions using DupPipe (Barker et al., 2010). To place inferred WGDs in a phylogenetic context, we used Multi-tAxon Paleopolyploidy Search (MAPS) phylogenomics approach and simulations (Li et al., 2015 Li, Tiley, et al., 2018) and compared synonymous divergence of the WGD paralogs with the orthologous divergence. We also reconstructed the rates and patterns of chromosome number evolution across more than 2,300 vascular plant genera with ChromEvol (Mayrose et al.,

2010). To investigate genes might be associated with different patterns of chromosome number evolution between angiosperms and ferns, we compared the rate of protein evolution for meiosis-related genes in these two lineages. We also evaluated the pattern of gene retention and loss following inferred ancient WGDs in monilophytes. Our combination of phylogenetic reconstruction and genomic inference provides independent and complementary views on the evolution of high chromosome numbers in homosporous ferns.

Results 160

Inference, Distribution, and Frequency of Paleopolyploidy. To infer and place ancient

WGDs in ferns in a phylogenetic context, we assembled 142 fern genomes and transcriptomes, one of the largest fern genomic datasets to date (Fig. 1 and

Supplementary Fig. 1). Using gene age distributions, ortholog divergence analyses, and

MAPS phylogenomics approach and simulations, our study reveals ancient WGDs in the ancestry of major fern lineages such as Equisetaceae (EQUIα), Ophioglossaceae

(OPHIα), Psilotaceae (PSILα, PSILβ), Marattiaceae (MARAα), and Osmundaceae

(OSMNα) (Fig. 1 and Supplementary Fig. 1). Histograms of gene age distributions (Ks plots) found peaks of gene duplication consistent with WGDs in all fern transcriptomes

(Supplementary Table. 1). The Ks plots revealed evidence of five ancient WGDs in

Osmolindsaea (OSODα, Ks = 0.17), Schizaea (SCDIα, Ks = 0.28 ANTOα, Ks = 0.8587 ),

Stenochlaena (STPAα, Ks = 0.25), and Taenitis (TABLα, Ks = 0.32) (Supplementary

Table 1). These putative ancient WGDs were not previously inferred in 1KP due to lack of samples. They appear as phylogenetically independent WGDs because sampled sister lineages lack evidence of the same WGDs. Analyses of orthologous divergence also indicated that each of these putative WGDs occurred independently (Supplementary

Table 2). The peaks consistent with WGDs in other species confirmed ancient WGDs previously inferred in the 1KP project (Supplementary Table 1). We further place seven putative WGDs in ferns (ASNIα, HYMEα, LINDα, ANTOα, LYGOα, VILIα, and

CYATα) in a phylogenetic context with MAPS (Supplementary Table 3, 4). These ancient WGDs were not tested with MAPS in the 1KP project due to lack of samples.

Using gene trees, null and positive simulations of WGDs, our MAPS approaches placed these WGDs in the ancestry of a subclade of Asplenium (ASNIα), a subclade of the 161

Polypodiidae (HYMEα), (LINDα), Anemiaceae, and Schizaeaceae

(ANTOα), Lygodiaceae (LYGOα), a subclade of the Vittarioideae (VILIα), and a subclade of the (CYATα) (Fig. 1, and Supplementary Table 3, 4). Overall, we found evidence for 26 ancient WGDs during the evolutionary history of monilophytes

(Fig. 1 and Supplementary Fig. 1).

To explore the frequency of paleopolyploidy in ferns and compare it to other vascular plants, we incorporated inferences of ancient WGDs from this study and previous 1KP analyses (One Thousand Plant Transcriptomes Initiative, 2019). We estimated the frequency of paleopolyploidy by the mean number of ancient WGDs in the ancestry of each species (Fig. 2a). On average, fern species experienced 3.76 ± 0.83 rounds of ancient genome duplications. Similar to ferns, angiosperms have on average

3.87 ± 0.65 rounds of ancient WGDs. Gymnosperms species went through 1.86 ± 0.34 rounds of ancient WGDs, and lycophytes on average had undergone 1.52 ± 1.21 rounds of ancient genome duplication (Fig. 2a). We further used the two-sample Mann-Whitney

U test to assess whether the frequency of paleopolyploidy in ferns is significantly different than other vascular plants. We found no difference between ferns and angiosperms in the number of ancient WGDs in the ancestry of each species (p = 0.8034)

(Fig. 2a). However, the rounds of ancient genome duplication in ferns is different compared to gymnosperms and lycophytes (p < 0.0001) (Fig. 2a). To account for impacts of age on the number of ancient WGDs in each lineage, we also estimate the rate of ancient WGDs by using the number of ancient genome duplications divided by the minimum crown group age (Fig. 2b). The minimum crown group age used for angiosperms, ferns, gymnosperms, and lycophytes are 197.5, 384.9, 308.4, and 392.8 162 million years, respectively (Table SX)(Morris et al., 2018). By using the two-sample

Mann-Whitney U test, we found the rate of ancient WGD in flowering plants is significantly higher compared to ferns (p < 0.0001). The rate of gymnosperms and lycophytes is lower compared to ferns (p < 0.0001) (Fig. 2b).

Patterns and Rates of Chromosome Number Evolution. Likelihood estimates of the rate of chromosome number evolution are higher in angiosperms than other vascular plants (Fig. 3). Using ChromEvol, we modeled chromosome number evolution across rbcL phylogenies of 2112 angiosperm, 199 fern, and 51 gymnosperm genera for which cytological data were available. Given the size of the angiosperm phylogeny, we break down the flowering plant ChromEvol analyses by the orders. On average, angiosperms were found to have nearly 1.9 times the rate of polyploidy as ferns and more than 8.5 times the rate of polyploidy found among gymnosperms (Fig. 3). However, angiosperms are also estimated to have relatively high rates of dysploidy, with chromosome loss occurring approximately two times more frequently per a million years than among ferns.

Estimated rates for all types of chromosome number evolution in ferns were lower than angiosperms. Rates for gymnosperms were the lowest among these vascular plants (Fig.

3). Notably, the ratios of descending to ascending dysploidy and descending dysploidy to polyploidy were very similar across these three groups of plants (Fig. 3). Thus, the relative proportion of dysploidy to polyploidy is broadly similar among these three clades, but the rates of these chromosomal changes are various.

To further explore the pattern of chromosome number evolution among seed plants and ferns, we plotted the inferred chromosome number at each node from our rbcL 163 phylogenies (Fig. 4). Our reconstructions yielded three different patterns of chromosome number evolution over the history of ferns, angiosperms, and gymnosperms.

Angiosperms demonstrated a continuous range of chromosome number variation over their evolutionary history, with inferred numbers at most nodes near the base chromosome number for the clade. This pattern is recovered even near the tips of the angiosperm tree where signatures of more recent polyploidy events are most likely to be observed. These estimates suggest that most angiosperms quickly reduce chromosome number following polyploidy (Fig. 4a). In contrast, gymnosperms demonstrated almost no variation in inferred chromosome number over their phylogeny. Nearly all nodes had an inferred chromosome number, n = 12 with evidence of recent chromosome loss (Fig.

4c). Ferns demonstrated an intriguing pattern with a continuous increase in chromosome number over their evolutionary history. In particular, at least two distinct "steps" of chromosome number increase are evident in our analysis. Although our inferred fern numbers demonstrate some dysploidy (Fig. 3), there appears to be remarkable conservation of chromosome number among many nodes following these broadly shared increases in chromosome number (Fig. 4b).

Rates of Protein Evolution of Meiosis Genes in Angiosperms and Ferns. High chromosome numbers in polyploids can lead to multivalent formation and failure in chromosome pairing during meiosis (Comai, 2005). How ferns overcome these potential challenges in meiotic stability is unknown. Recent studies have shown specific genes are responsible for meiotic stabilization in a recent polyploid, Arabidopsis arenosa (Yant et al., 2013 Morgan et al., 2020). To explore whether high chromosome number in ferns is 164 impacted by genes that are related to meiosis, we compare the rate of protein evolution of meiosis-related genes in angiosperms and ferns. To eliminate the impact of rate heterogeneity between angiosperms and ferns, we compare the rate of protein evolution for 352 meiosis genes based on Arabidopsis Gene Ontologies (GO) category and 500 randomly selected Arabidopsis genes as background. These 500 random genes are used as the query sequences to blast against eleven angiosperm genomes and ten fern genomes and transcriptomes as databases. The Physcomitrella patens genome was used as an outgroup. Gene families were clustered using the blast hits sequences. Overall, 129 and

720 gene family phylogenies were constructed for meiosis related genes and randomly selected genes, respectively. After re-rooting with Physcomitrella patens and dropping tips of the outgroup, we estimate the mean, minimum, and maximum number of the root to tip distance for each species in each gene tree. We used Mann-Whitney U test to compare the distribution of root to tip distance between meiosis and randomly selected genes in angiosperms and ferns. In angiosperm, we observe a significantly higher rate of protein evolution in meiosis genes compare to background genes (p < 0.001) (Fig. 5).

However, we found no significant differences in rate of protein evolution in meiosis compared to random background genes in ferns (p = 0.1710) (Fig. 5). This overall pattern is consistent when using the mean, minimum, and maximum number of the root to tip distance, and with or without outliers from the distribution (Fig. 5).

Biased Gene Retention and Loss Following Inferred WGDs. The biased pattern of gene retention and loss is a common feature of ancient WGDs, but this pattern of gene retention is still unknown in monilophytes. To test biased gene retention and loss among 165 our inferred paleopolyploidy in ferns, we compared the overall differences between the

GO composition of retained paralogs and the whole transcriptome. A principal component analysis (PCA) using the number of genes annotated to each GO category.

The PCA shows two significantly different clusters (p < 0.001) with some overlaps (Fig.

6). The retained paralogs formed a tighter cluster with a narrower 95% confidence interval compared to the whole transcriptome (Fig. 6). We also used a hierarchical clustering approach to access the overall GO composition similarity of paralogs retained from ancient WGDs. We observed biased retention and loss from all inferred WGDs in monilophytes (Supplementary Fig. 3). However, we found little evidence of parallel patterns of gene retention across different ancient genome duplications. The hierarchical clustering resolved clusters of gene retention and loss among a few ancient WGDs, for example PSILβ, CYATγ, EQUIα, and PTERα. Most paleopolyploidy events were not resolved based on the hierarchical clustering approach. We further used a simulated chi- square test to infer whether any categories are significantly over- and under-retained

(Supplementary Fig. 3). Many WGDs in ferns were significantly enriched for genes associated with transcription factor activity, DNA or RNA binding, and the nucleus.

Paralogs retained from these ancient duplications are often significantly under-retained for genes associated with other enzyme activity, hydrolase activity, and transferase activity (Supplementary Fig. 3). Overall, all ancient WGDs in ferns show a pattern of biased retention and loss. Parallel patterns of gene retention and loss predicted by the

DBH were observed in a few ancient genome duplications.

Discussion 166

Ancient whole genome duplication is one of the major forces in vascular plant evolution

(Van de Peer et al., 2009, 2017 Wendel, 2015 Barker et al., 2016 Soltis et al., 2016).

Previous genomic analyses found evidence of ancient polyploidy in the ancestry of seed plants and flowering plants (Jiao et al., 2011 Li et al., 2015). Many angiosperm and gymnosperm lineages have experienced additional rounds of polyploidy (Cui et al., 2006

Barker et al., 2008 Jiao et al., 2014 Cannon et al., 2015 Li et al., 2015 Yang et al., 2015).

In monilophytes, we found evidence of 26 putative ancient WGDs during its evolutionary history (Fig. 1). Our result is consistent with ancient WGDs inferred in the recent 1KP study (One Thousand Plant Transcriptomes Initiative, 2019). Although our sample size is nearly double compared to 1KP, we only found five unidentified ancient genome duplications. This indicates ancient WGDs in major fern lineages had been inferred by the 1KP (One Thousand Plant Transcriptomes Initiative, 2019). We also confirmed the genome duplications that were previously inferred in Ceratopteris (Barker, 2012

Marchant et al., 2019), Equisetum (Vanneste et al., 2015), Azolla, and Salvinia (Li,

Brouwer, et al., 2018). With better sampling, our phylogenomics approach also improves the resolution of phylogenetic placement of seven putative WGDs in ferns. Overall, ongoing and future fern genome and transcriptome sequencing projects will provide syntenic evidence to confirm these putative ancient WGDs and better place these events in a phylogenetic context.

Our finding that ferns have experienced multiple rounds of ancient polyploidy is consistent with early proposed hypotheses. One of the most distinctive features of homosporous ferns is their high chromosome numbers (Manton, 1950 Klekowski and 167

Baker, 1966). Multiple hypotheses have been proposed to explain why homosporous ferns have high chromosome numbers. The alternative explanations do not involve polyploidy but instead include an ancestral high chromosome number and ascending chromosomal fission (Wagner and Wagner, 1980 Haufler and Soltis, 1986 Barker and

Wolf, 2010). However, given the high rate of recent polyploidy in ferns (Wood et al.,

2009) and the highly uniform size and structure of fern chromosomes across genera

(Wagner and Wagner, 1980), these alternative explanations have never gained much support. The most compelling hypothesis proposed by Haufler (1987) predicts ferns had undergone repeated rounds of ancient genome duplication (Haufler, 1987). The high chromosome numbers and diploid isozyme expression in ferns are due to gene silencing without chromosome loss following polyploidy (Haufler, 1987). By estimating the distribution of ancient WGDs in ferns, we found ferns on average have experienced 3.76 rounds of ancient polyploidy (Fig. 2a). This result is consistent with the prediction that ferns experienced repeated rounds of ancient polyploidy (Haufler, 1987).

In our phylogenetic reconstruction of chromosome evolution, we found lower rates of ascending and descending dysploidy in ferns compared to angiosperms (Fig. 3). Given flowering plants have a higher rate of ancient polyploidy compared to ferns (Fig. 2b), the much higher chromosome number in ferns is likely due to their lower rate of chromosome loss. By plotting the inferred chromosome number from the fern phylogeny, we also observed increases of chromosome number towards tips of the phylogeny (Fig.

3c). This result indicates that the high chromosome number in ferns is due to their capacity to retain chromosomes over time rather than an ancestrally high chromosome 168 number. In general, our genomic inference of ancient WGDs and phylogenetic reconstruction of chromosome evolution support the multiple ancient WGDs hypothesis

(Haufler, 1987). Overall, our study shows multiple rounds of ancient polyploidy and low rate of chromosome loss explain the high chromosome numbers in ferns.

High chromosome numbers in newly formed polyploidy can lead to issues of meiotic stability (Comai, 2005). The additional chromosome copies can cause multivalent formation and failure in chromosome pairing during meiosis (Bomblies et al., 2016).

However, how ferns with high chromosome numbers overcome these potential challenges of meiotic stability is unclear. Recent studies have shown specific genes are responsible for meiotic stabilization in a recent polyploid, Arabidopsis arenosa (Yant et al., 2013 Morgan et al., 2020). To explore whether low rate of chromosome loss in ferns is impacted by genes related to meiosis, we compare the rate of protein evolution of meiosis-related genes in angiosperms and ferns. We found a consistent pattern of faster protein evolution of meiosis genes in flowering plants but not in ferns (Fig. 5). This result suggests meiosis related genes might be conserved in ferns, and evolve faster in angiosperms. This finding also indicates the different pattern of chromosomal and genome evolution between angiosperms and ferns might be related to the evolution of meiosis-related genes. However, future analyses are needed to reveal the molecular basis of different genome evolution between these major vascular plant lineages.

To further investigate the unique genome evolution in ferns, we evaluated the pattern of gene retention and loss following ancient polyploids. Previous studies have found parallel 169 gene retention in single ancient WGDs (Scannell et al., 2007 Mandáková et al., 2017), and multiple independent ancient WGDs (Schranz and Mitchell-Olds, 2006 Barker et al.,

2008 Mandáková et al., 2017 Li, Tiley, et al., 2018). Our PCA analyses found a tight cluster of the retained paralogs (Fig. 6). This result suggests that retained paralogs have a similar GO composition compared to the whole transcriptomes and indicates these genes are likely to be retained from ancient WGDs. Based on the GO composition analyses, we found evidence of parallel gene retention following ancient WGDs in ferns

(Supplementary Fig. 2-3). This parallel retention is not as dramatic as previously observed in flowering plants using a similar statistical framework (Barker et al., 2008

Mandakova et al., 2017). This result is expected for two reasons. First, the phylogenetic depth of the fern phylogeny might weaken the parallel retention pattern. Extant fern orders share the most recent common ancestor ~350 MYA (Rothfels et al., 2015 Morris et al., 2018). Given that each fern species has experienced 3.76 rounds of ancient WGDs on average, many signals of gene retention might also be imbricated by different genome duplications. Second, a lower chromosome loss rate in ferns might result in a weaker parallel retention compared to angiosperms. A previous study in Tragopogon has shown that the parallel pattern of gene loss can be established in only 40 generations (Buggs et al., 2012). The strong parallel retention pattern in angiosperms might be driven by their high rates of chromosome evolution. Our observation of retention patterns might be consistent with slow rates of chromosome loss in ferns. Overall, our study provides the first evidence of parallel and biased gene retention and loss following paleopolyploidy in monilophytes.

170

Our findings show the unique characteristics of genome evolution in ferns. Monilophytes are the only lineage in vascular plants that have a strong positive correlation between genome size and chromosome number (Nakazato et al., 2008 Bainard et al., 2011 Clark et al., 2016). Consistent with previous observation (Barker, 2012 Liu et al., 2019), our finding of a lower rate of chromosome loss in ferns suggests fern chromosome evolution is less dynamic compared to angiosperms. Our combination of genomic inference and phylogenetic reconstruction provides complementary evidence of the long-standing

‘multiple ancient WGDs’ hypothesis (Haufler, 1987). Consistent with its prediction, we found multiple rounds of ancient polyploidy and low rates of chromosome loss explain ferns with high chromosome numbers. Our study also provides a hint for future studies to reveal the potential process that might explain the genomic differences between different lineages of vascular plants. Future fern genome sequencing projects, especially on homosporous ferns with high chromosome numbers will advance our understanding on this unique aspect of genome evolution in ferns.

Methods

Duppipe Analyses of WGDs from Transcriptomes of Single Species

For each transcriptome, we used the DupPipe pipeline to construct gene families and estimate the age distribution of gene duplications (Barker et al., 2008, 2010). We translated DNA sequences and identified reading frames by comparing the Genewise

(Birney et al., 2004) alignment to the best-hit protein from a collection of proteins from

25 plant genomes from Phytozome (Goodstein et al., 2012). For all DupPipe runs, we used protein-guided DNA alignments to align our nucleic acid sequences while 171 maintaining the reading frame. We estimated synonymous divergence (Ks) using PAML with the F3X4 model (Yang, 2007) for each node in the gene family phylogenies. We identified peaks of gene duplication as evidence of ancient WGDs in histograms of the age distribution of gene duplications (Ks plots). We identified species with potential

WGDs by comparing their paralog age distribution to a simulated null using a

Kolmogorov–Smirnov goodness of fit test (Cui et al., 2006). We then used mixture modeling and manual curation to identify significant peaks consistent with a potential

WGD and to estimate their median paralog Ks values. Significant peaks were identified using a likelihood ratio test in the boot.comp function of the mixtools R package

(Benaglia et al., 2009).

Estimating Orthologous Divergence

To place putative WGDs in relation to lineage divergence, we estimated the synonymous divergence of orthologs among species pairs that may share a WGD based on their phylogenetic position and evidence from the within-species Ks plots. We used the RBH

Ortholog pipeline (Barker et al., 2010) to estimate the mean and median synonymous divergence of orthologs and compared those to the synonymous divergence of inferred paleopolyploid peaks. We identified orthologs as reciprocal best blast hits in pairs of transcriptomes. Using protein-guided DNA alignments, we estimated the pairwise synonymous (Ks) divergence for each pair of orthologs using PAML with the F3X4 model (Yang, 2007). WGDs were interpreted to have occurred after lineage divergence if the median synonymous divergence of WGD paralogs was younger than the median synonymous divergence of orthologs. Similarly, if the synonymous divergence of WGD 172 paralogs was older than that ortholog synonymous divergence, then we interpreted those

WGDs as shared.

MAPS Analyses of WGDs from Transcriptomes of Multiple Species

To infer and locate putative WGDs in our datasets, we used a gene tree sorting and counting algorithm, the Multi-tAxon Paleopolyploidy Search (MAPS) tool (Li et al.,

2015). For each MAPS analysis, we selected at least two species that potentially shared a

WGD in their ancestry as well as representative species from lineages that may phylogenetically bracket the WGD. MAPS uses this given species tree to filter collections of nuclear gene trees for subtrees consistent with relationships at each node in the species tree. Using this filtered set of subtrees, MAPS identifies and records nodes with a gene duplication shared by descendant taxa. To infer and locate a potential WGD, we compared the number of duplications observed at each node to a null simulation of background gene birth/death rates (Rabier et al., 2014 Li, Tiley, et al., 2018). A Fisher’s exact test, implemented in R (Team, 2014), was used to identify locations with significant increases of gene duplication compared with a null simulation. Locations with significantly more duplications than expected were then compared to a simulated WGD at this location. If the observed duplications were consistent with this simulated WGD using Fisher’s exact test, we identified the location as a WGD if it was consistent with inferences from Ks plots and ortholog divergence data.

Each MAPS analysis was designed to place focal WGDs near the center of a species tree to minimize errors in WGD inference. Errors in transcriptome or genome assembly, gene 173 family clustering, and the construction of gene family phylogenies can result in topological errors in gene trees (Yang and Smith, 2013). Previous studies have suggested that errors in gene trees can lead to biased placements of duplicates towards the root of the tree and losses towards the tips of the tree (Hahn, 2007). For this reason, we aimed to put focal nodes for a particular MAPS analysis test in the middle of the phylogeny. To further decrease potential error in our inferences of gene duplications, we required at least

45% of the ingroup taxa to be present in all subtrees analyzed by MAPS (Li, Tiley, et al.,

2018). If this minimum ingroup taxa number requirement is not met, the gene subtree will be filtered out and excluded from our analysis. Increasing taxon occupancy leads to more accurate duplication inference and reduces some of the biases in mapping duplications onto a species tree (Hahn, 2007 Smith et al., 2015). To maintain sufficient gene tree numbers for each MAPS analysis, we used collections of gene family phylogenies for six to eight taxa to infer ancient WGDs.

For each MAPS analysis, the transcriptomes were translated into amino acid sequences using the TransPipe pipeline (Barker et al., 2010). Using these translations, we performed reciprocal protein BLAST (blastp) searches among data sets for the MAPS analysis using an E-value of 10e-5 as a cutoff. We clustered gene families from these BLAST results using OrthoFinder under the default parameters (Emms and Kelly, 2015). Using a custom

Perl script(https://bitbucket.org/barkerlab/maps/src/master/), we filtered for gene families that contained at least one gene copy from each taxon in a given MAPS analysis and discarded the remaining OrthoFinder clusters. We used PASTA (Mirarab et al., 2014) for automatic alignment and phylogeny reconstruction of gene families. For each gene family 174 phylogeny, we ran PASTA until we reached three iterations without an improvement in likelihood score using a centroid breaking strategy. Within each iteration of PASTA, we constructed subset alignments using MAFFT (Katoh et al., 2002), employed Muscle

(Edgar, 2004) for merging these subset alignments, and RAxML (Stamatakis, 2014) for tree estimation. The parameters for each software package were the default options for

PASTA. We used the best scoring PASTA tree for each multi-species nuclear gene family to collectively estimate the numbers of shared gene duplications on each branch of the given species.

To generate null simulations, we first estimated the mean background gene duplication rate (λ) and gene loss rate (μ) with WGDgc (Rabier et al., 2014). Gene count data were obtained from OrthoFinder (Emms and Kelly, 2015) clusters associated with each species tree. λ and μ were estimated using only gene clusters that spanned the root of their respective species trees, which has been shown to reduce biases in the maximum likelihood estimates of λ and μ (Rabier et al., 2014). We chose a maximum gene family size of 100 for parameter estimation, which was necessary to provide an upper bound for numerical integration of node states (Rabier et al., 2014). We provided a prior probability distribution on the number of genes at the root of each species tree, such that ancestral gene family sizes followed a shifted geometric distribution with mean equal to the average number of genes per gene family across species.

Gene trees were then simulated within each MAPS species tree using the GuestTreeGen program from GenPhyloData (Sjöstrand et al., 2013). For each species tree, we simulated 175

3000 gene trees with at least one tip per species: 1000 gene trees at the λ and μ maximum likelihood estimates, 1000 gene trees at half the estimated λ and μ, and 1000 trees at three times λ and μ. For all simulations, we applied the same empirical prior used for estimation of λ and μ. We then randomly resampled 1000 trees without replacement from the total pool of gene trees 100 times to provide a measure of uncertainty on the percentage of subtrees at each node. For positive simulations of WGDs, we simulated gene trees using the same approach as the nulls but incorporated a WGD at the test branch. Previous empirical estimates of paralogs retained following a plant WGD are approximately 10% on average (Tiley et al., 2016). To be conservative for inferring

WGDs in our MAPS analyses, we allowed at least 20% of the genes to be retained following the simulated WGD to account for biased gene retention and loss.

Gene Ontology (GO) Annotations and Paleolog Retention and Lost Patterns

Gene Ontology (GO) annotations of all fern transcriptomes were obtained through discontiguous MegaBlast searches against annotated Arabidopsis thaliana transcripts from TAIR (Swarbreck et al., 2007) to find the best hit with length of at least 100 bp and an e-value of at least 0.01. We evaluated the overall differences between the GO composition of transcriptomes and WGD paralogs by principal component analysis

(PCA) using the rda function in vegan (Dixon, 2003). We tested if the GO category composition was different between transcriptome and WGD paralogs using a permutation test in the vegan envfit function. Ellipses representing the 95% confidence interval of standard deviation of point scores were drawn on the PCA plot using the ordiellipse function in vegan. We further tested for differences among GO annotations using chi- 176 square tests. When chi-square tests were significant (p < 0.05), GO categories with residuals >|2| were implicated as major contributors to the significant chi-square statistic.

A category with residual >2 indicates significant over-retention of this category following

WGD, whereas residual <–2 indicates significant under-retention (Barker et al., 2008).

Using this statistical framework, we tested for significant differences between the overall transcriptome and paralogs from the all paleopolyploid species.

Rates of Protein Evolution of Meiosis Genes in Angiosperms and Ferns

To estimate the rates of protein evolution, we select 352 meiosis related genes based on

GO terms under meiosis and their child genes. To avoid the impact of rate heterogeneity between angiosperms and ferns, we compare the rate of protein evolution for these meiosis genes to 500 randomly selected genes from the Arabidopsis genome. Meiosis related genes and randomly selected genes are used as queries. Whole genome of 11 angiosperms (Amaranthus hypochondriacus, Amborella trichopoda, Ananas comosus,

Aquilegia coerulea, Arabidopsis thaliana, Carica papaya, grandis, Mimulus guttatus, Oryza sativa, Populus trichocarpa, and Vitis vinifera) were downloaded from

Phytozome (Goodstein et al., 2012) used as the angiosperm database. Whole genome of

Azolla filiculoides and Salvinia cucullata (Li, Brouwer, et al., 2018), and transcriptomes of eight ferns (Asplenium formosae, Ceratopteris thalictroides, Dicksonia antarctica,

Equisetum diffusum, Microlepia speluncae, Ophioglossum vulgatum, Osmunda javanica,

Polypodium hesperium) were used as the database for ferns. Physcomitrella patens are used as an outgroup, and A. thaliana and P. patens were outgroups in fern analyses. Both queries were used to blast against the angiosperm and fern databases using protein 177

BLAST (blastp) with E-value of 1e-20 as. The hits from each protein blast search were used for gene family clustering with Orthofinder 2.3.7 (Emms and Kelly, 2015). Gene families with the presence of at least one P. patens sequence and one A. thaliana sequences were selected. PASTA (Mirarab et al., 2014) were used to construct gene family phylogenies. Gene trees are re-rooted by the Physcomitrella patens and tips of outgroups are dropped in the gene trees after re-rooting. The root to tips distance is extracted with ‘distances’ function in ape (Paradis and Schliep, 2019). For each gene tree, we estimate the mean, minimum, and maximum root-to-tip distance for each species. To compare the rate of protein evolution, we use the Mann-Whitney U test to compare the distribution of root to tip distance between meiosis and randomly selected genes in angiosperms and ferns. To visualist these distributions, we also plot the distributions of each comparison of angiosperms and ferns with ggplot in R.

Phylogenetic Reconstruction of Chromosomal Evolution

We reconstructed two time-calibrated phylogenetic trees based on rbcL (ribulose-1,5- bisphosphate carboxylase/oxygenase, large subunit), one for the seed plants (angiosperms and gymnosperms) and the other for the ferns. The rbcL sequences were downloaded from NCBI GenBank. For each genus, the longest rbcL sequence was chosen as the representative in case of ties, one rbcL sequence was picked at random. The two genus- level phylogenetic trees were inferred from the sequence data sets using SATé version

1.4.0 (Liu et al., 2009 Sukumaran and Holder, 2010)), breaking the subproblems using the centroid strategy and iteratively rebuilding the phylogeny until no improvement was seen for 10 consecutive rounds. Also, MAFFT version 6.717 (Katoh et al., 2002)) was 178 chosen as the main aligner, MUSCLE version 3.8.31 (Edgar, 2004) to obtain the seed plant phylogeny) or OPAL version 1.0.3 (Wheeler and Kececioglu, 2007) to obtain the fern phylogeny) as the merger, and RAxML version 7.2.6 (Stamatakis, 2006) as the phylogeny estimator.

The seed plant phylogeny (likelihood score = -258546.04 alignment length = 1647) contains 2,352 angiosperm genera and 51 gymnosperm genera, and it is rooted using 96 outgroup genera (ferns and lycophytes). The fern phylogeny (likelihood score = -

45845.59 alignment length = 1448) contains 199 ingroup genera, and it is rooted using six outgroup genera (lycophytes). Then, the trees were dated with fossil records using

PATHd8 version 1.0 (Britton et al., 2007). The crown (root) age of the seed plant phylogeny (including the outgroup taxa) was fixed at 380 million years ago (Kenrick and

Crane, 1997), and minimum age constraint was applied at 51 ancestral nodes using the fossil records. Similarly, the crown (root) age of the fern phylogeny (including the outgroup taxa) was fixed at 380 MYA, and minimum age constraint was applied at 11 ancestral nodes using fossil records analyzed by Schneider et al. 2004 ). Prior to further analyses, the outgroup genera were pruned out from the phylogenies.

Furthermore, we took a by-order approach to estimate the rates of chromosomal evolution for the angiosperms. In an initial analysis, we attempted to estimate the rates for the full angiosperm phylogeny, but it was taking too long to run. Hence, we extracted subtrees corresponding to 26 major angiosperm orders from the dated seed plant phylogeny and then performed the rate estimation on each of the order-level data sets. 179

We focused on these orders because they contain at least 25 genera with a representative chromosome count. Likewise, we extracted the gymnosperm subtree from the seed plant phylogeny for downstream analysis. Gametic chromosome numbers were collated and summarized using the mode count as the representative for each genus. In the case of ties, the lowest mode count was taken. The representative chromosome counts were matched to the tips of the phylogenetic trees for each genus.

ChromEvol version 1 (Mayrose et al., 2010) was run to infer the best-fitting model and to estimate clade-wide rates of chromosomal evolution for (1) each angiosperm order, (2) all of ferns, and (3) all of gymnosperms (the maximum allowed chromosome number was set to one plus the maximum observed). Here, we considered four models of ChromEvol that assume that the rates of chromosomal evolution are constant throughout a given phylogenetic tree. Chromosome numbers can evolve along a phylogeny via ascending dysploidy (i.e., single increment), descending dysploidy (i.e., single decrement), polyploidy (i.e., doubling), and demi-polyploidy (i.e., multiplication by 1.5 for example, in the formation of hexaploids). In the simplest model (“NO DUPL”), chromosome numbers can vary only via dysploidy (two parameters). In the next simplest model, polyploidy is allowed (“CONST RATE” three parameters). There are two models allowing for demi-polyploidy, “DEMI” (three parameters) and “DEMI EST” (four parameters) the former assumes that the rates of polyploidy and demi-polyploidy are equal and the latter allows for the two rates to vary. The best-fitting ChromEvol model was identified by the lowest AIC score. For each clade, we took the rate parameters estimated under the best-fitting model for statistical analysis. Additionally, to show the 180 pattern of chromosome number change over the phylogeny of each clade, we extracted the ancestral chromosome numbers inferred at each of the internal nodes of the phylogeny as well as the estimated ages of the internal nodes. For the angiosperm orders, the ChromEvol results were visualized together.

References

Abraham, A., and C. A. Ninan. 1954. Chromosomes of Ophioglossum reticulaturn

L. Current science 23: 213–214.

Bainard, J. D., T. A. Henry, L. D. Bainard, and S. G. Newmaster. 2011. DNA

content variation in monilophytes and lycophytes: large genomes that are not

endopolyploid. Chromosome research 19: 763–775.

Barker, M. S. 2012. Karyotype and genome evolution in Pteridophytes. Plant

Genome Diversity Volume 2, 245–253.

Barker, M. S., K. M. Dlugosch, L. Dinh, R. S. Challa, N. C. Kane, M. G. King, and

L. H. Rieseberg. 2010. EvoPipes.net: Bioinformatic tools for ecological and

evolutionary genomics. Evolutionary bioinformatics online 6: 143–149.

Barker, M. S., B. C. Husband, and J. C. Pires. 2016. Spreading Winge and flying

high: The evolutionary importance of polyploidy after a century of study. American

journal of botany 103: 1139–1145.

Barker, M. S., N. C. Kane, M. Matvienko, A. Kozik, R. W. Michelmore, S. J.

Knapp, and L. H. Rieseberg. 2008. Multiple paleopolyploidizations during the 181 evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Molecular biology and evolution 25: 2445–2455.

Barker, M. S., and P. G. Wolf. 2010. Unfurling fern biology in the genomics age.

Bioscience 60: 177–185.

Benaglia, T., D. Chauveau, D. Hunter, and D. Young. 2009. mixtools: An R package for analyzing mixture models. Journal of Statistical Software, Articles 32: 1–29.

Birney, E., M. Clamp, and R. Durbin. 2004. GeneWise and Genomewise. Genome research 14: 988–995.

Bomblies, K., G. Jones, C. Franklin, D. Zickler, and N. Kleckner. 2016. The challenge of evolving stable polyploidy: could an increase in ‘crossover interference distance’ play a central role? Chromosoma 125: 287–300.

Britton, T., C. L. Anderson, D. Jacquet, S. Lundqvist, and K. Bremer. 2007.

Estimating divergence times in large phylogenetic trees. Systematic biology 56:

741–752.

Buggs, R. J. A., S. Chamala, W. Wu, J. A. Tate, P. S. Schnable, D. E. Soltis, P. S.

Soltis, and W. B. Barbazuk. 2012. Rapid, repeated, and clustered loss of duplicate genes in allopolyploid plant populations of independent origin. Current biology: CB

22: 248–252.

Cannon, S. B., M. R. McKain, A. Harkess, M. N. Nelson, S. Dash, M. K. Deyholos,

Y. Peng, et al. 2015. Multiple polyploidy events in the early radiation of nodulating 182 and nonnodulating legumes. Molecular biology and evolution 32: 193–210.

Chapman, R. H., E. J. Klekowski Jr, and R. K. Selander. 1979. Homoeologous heterozygosity and recombination in the fern Pteridium aquilinum. Science 204:

1207–1209.

Clark, J., O. Hidalgo, J. Pellicer, H. Liu, J. Marquardt, Y. Robert, M. Christenhusz, et al. 2016. Genome evolution of ferns: evidence for relative stasis of genome size across the fern phylogeny. The New phytologist 210: 1072–1082.

Clark, J. W., M. N. Puttick, and P. C. J. Donoghue. 2019. Origin of horsetails and the role of whole-genome duplication in plant macroevolution. Proceedings of The

Royal Society B. Biological sciences 286: 20191662.

Comai, L. 2005. The advantages and disadvantages of being polyploid. Nature reviews. Genetics 6: 836–846.

Cui, L., P. K. Wall, J. H. Leebens-Mack, B. G. Lindsay, D. E. Soltis, J. J. Doyle, P.

S. Soltis, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome research 16: 738–749.

Dixon, P. 2003. VEGAN, a package of R functions for community ecology. Journal of vegetation science 14: 927.

Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32: 1792–1797.

Emms, D. M., and S. Kelly. 2015. OrthoFinder: solving fundamental biases in whole 183 genome comparisons dramatically improves orthogroup inference accuracy. Genome biology 16: 157.

Gastony, G. J. 1991. Gene silencing in a polyploid homosporous fern: paleopolyploidy revisited. Proceedings of the National Academy of Sciences of the

United States of America 88: 1602–1605.

Gastony, G. J., and D. C. Darrow. 1983. Chloroplastic and cytosolic isozymes of the homosporous fern Athyrium filix-femina L. American journal of botany 70: 1409–

1415.

Gastony, G. J., and L. D. Gottlieb. 1982. Evidence for genetic heterozygosity in a homosporous fern. American journal of botany 69: 634–637.

Gastony, G. J., and L. D. Gottlieb. 1985. Genetic variation in the homosporous fern

Pellaea andromedifolia. American journal of botany 72: 257–267.

Goodstein, D. M., S. Shu, R. Howson, R. Neupane, R. D. Hayes, J. Fazo, T. Mitros, et al. 2012. Phytozome: a comparative platform for green plant genomics. Nucleic acids research 40: D1178–86.

Grant, V. 1963. The origin of adaptations. Columbia University Press.

Hahn, M. W. 2007. Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome biology 8: R141.

Haufler, C. H. 1987. Electrophoresis is modifying our concepts of evolution in homosporous pteridophytes. American journal of botany 74: 953–966. 184

Haufler, C. H., and D. E. Soltis. 1986. Genetic evidence suggests that homosporous ferns with high chromosome numbers are diploid. Proceedings of the National

Academy of Sciences of the United States of America 83: 4389–4393.

Jiao, Y., J. Li, H. Tang, and A. H. Paterson. 2014. Integrated syntenic and phylogenomic analyses reveal an ancient genome duplication in monocots. The

Plant cell 26: 2792–2802.

Jiao, Y., N. J. Wickett, S. Ayyampalayam, A. S. Chanderbali, L. Landherr, P. E.

Ralph, L. P. Tomsho, et al. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473: 97–100.

Katoh, K., K. Misawa, K.-I. Kuma, and T. Miyata. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research 30: 3059–3066.

Kenrick, P., and P. R. Crane. 1997. The origin and early evolution of plants on land.

Nature 389: 33–39.

Klekowski, E. J., Jr, and H. G. Baker. 1966. Evolutionary significance of polyploidy in the pteridophyta. Science 153: 305–307.

Levin, D. A. 2002. The role of chromosomal change in plant evolution. Oxford

University Press.

Li, F.-W., P. Brouwer, L. Carretero-Paulet, S. Cheng, J. de Vries, P.-M. Delaux, A.

Eily, et al. 2018. Fern genomes elucidate land plant evolution and cyanobacterial 185 symbioses. Nature plants 4: 460–472.

Liu, H., L. Ekrt, P. Koutecky, J. Pellicer, O. Hidalgo, J. Marquardt, F. Pustahija, et al. 2019. Polyploidy does not control all: Lineage‐specific average chromosome length constrains genome size evolution in ferns. Journal of Systematics and

Evolution 57: 418–430.

Liu, K., S. Raghavan, S. Nelesen, C. R. Linder, and T. Warnow. 2009. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.

Science 324: 1561–1564.

Li, Z., A. E. Baniaga, E. B. Sessa, M. Scascitelli, S. W. Graham, L. H. Rieseberg, and M. S. Barker. 2015. Early genome duplications in conifers and other seed plants.

Science advances 1: e1501084.

Li, Z., G. P. Tiley, S. R. Galuska, C. R. Reardon, T. I. Kidder, R. J. Rundell, and M.

S. Barker. 2018. Multiple large-scale gene and genome duplications during the evolution of hexapods. Proceedings of the National Academy of Sciences of the

United States of America 115: 4713–4718.

Love, A., D. Love, and R. E. G. Pichi Sermolli. 1977. Cytotaxonomical atlas of the

Pteridophyta. Vaduz: J. Cramer 398p. -Chrom. nos. . Chromosome numbers. Geog:

1–7.

Mandáková, T., Z. Li, M. S. Barker, and M. A. Lysak. 2017. Diverse genome organization following 13 independent mesopolyploid events in Brassicaceae contrasts with convergent patterns of gene retention. The Plant journal 91: 3–21. 186

Manton, I. 1950. Problems of cytology and evolution in the Pteridophyta.

Marchant, D. B., E. B. Sessa, P. G. Wolf, K. Heo, W. B. Barbazuk, P. S. Soltis, and

D. E. Soltis. 2019. The C-Fern (Ceratopteris richardii) genome: insights into plant genome evolution with the first partial homosporous fern genome assembly.

Scientific reports 9: 18181.

Mayrose, I., M. S. Barker, and S. P. Otto. 2010. Probabilistic models of chromosome number evolution and the inference of polyploidy. Systematic Biology

59: 132–144.

McGrath, J. M., and L. G. Hickok. 1999. Multiple ribosomal RNA gene loci in the genome of the homosporous fern Ceratopteris richardii. Canadian journal of botany. Journal canadien de botanique 77: 1199–1202.

Mirarab, S., N. Nguyen, and T. Warnow. 2014. PASTA: Ultra-large multiple sequence alignment. In R. Sharan [ed.], Research in Computational Molecular

Biology, Lecture Notes in Computer Science, 177–191. Springer International

Publishing, Cham.

Mitchell McGrath, J., L. G. Hickok, and E. Pichersky. 1994. Assessment of gene copy number in the homosporous ferns Ceratopteris thalictroides and C. richardii

(Parkeriaceae) by restriction fragment length polymorphisms. Plant systematics and evolution 189: 203–210.

Morgan, C., H. Zhang, C. E. Henry, F. C. H. Franklin, and K. Bomblies. 2020.

Derived alleles of two axis proteins affect meiotic traits in autotetraploid 187

Arabidopsis arenosa. Proceedings of the National Academy of Sciences 117: 8980–

8988.

Morris, J. L., M. N. Puttick, J. W. Clark, D. Edwards, P. Kenrick, S. Pressel, C. H.

Wellman, et al. 2018. The timescale of early land plant evolution. Proceedings of the National Academy of Sciences of the United States of America 115: E2274–

E2283.

Nakazato, T., M. S. Barker, L. H. Rieseberg, and G. J. Gastony. 2008. Evolution of the nuclear genome of ferns and lycophytes. Biology and evolution of ferns and lycophytes, Cambridge University Press.

Nakazato, T., M.-K. Jung, E. A. Housworth, L. H. Rieseberg, and G. J. Gastony.

2006. Genetic map-based analysis of genome structure in the homosporous fern

Ceratopteris richardii. Genetics 173: 1585–1597.

One Thousand Plant Transcriptomes Initiative. 2019. One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574: 679–685.

Otto, S. P., and J. Whitton. 2000. Polyploid incidence and evolution. Annual review of genetics 34: 401–437.

Paradis, E., and K. Schliep. 2019. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35: 526–528.

Pichersky, E., D. Soltis, and P. Soltis. 1990. Defective chlorophyll a/b-binding protein genes in the genome of a homosporous fern. Proceedings of the National 188

Academy of Sciences of the United States of America 87: 195–199.

Rabier, C.-E., T. Ta, and C. Ané. 2014. Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Molecular biology and evolution 31: 750–762.

Rothfels, C. J., F.-W. Li, E. M. Sigel, L. Huiet, A. Larsson, D. O. Burge, M.

Ruhsam, et al. 2015. The evolutionary history of ferns inferred from 25 low-copy nuclear genes. American journal of botany 102: 1089–1107.

Scannell, D. R., A. C. Frank, G. C. Conant, K. P. Byrne, M. Woolfit, and K. H.

Wolfe. 2007. Independent sorting-out of thousands of duplicated gene pairs in two yeast species descended from a whole-genome duplication. Proceedings of the

National Academy of Sciences of the United States of America 104: 8397–8402.

Schneider, H., E. Schuettpelz, K. M. Pryer, R. Cranfill, S. Magallón, and R. Lupia.

2004. Ferns diversified in the shadow of angiosperms. Nature 428: 553–557.

Schranz, M. E., and T. Mitchell-Olds. 2006. Independent ancient polyploidy events in the sister families Brassicaceae and Cleomaceae. The Plant cell 18: 1152–1165.

Sjöstrand, J., L. Arvestad, J. Lagergren, and B. Sennblad. 2013. GenPhyloData: realistic simulation of gene family evolution. BMC bioinformatics 14: 209.

Smith, S. A., M. J. Moore, J. W. Brown, and Y. Yang. 2015. Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants. BMC evolutionary biology 15: 150. 189

Soltis, D. E., C. J. Visger, D. B. Marchant, and P. S. Soltis. 2016. Polyploidy:

Pitfalls and paths to a paradigm. American journal of botany 103: 1146–1166.

Stamatakis, A. 2014. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30: 1312–1313.

Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688–2690.

Stebbins, G. L. 1971. Chromosomal evolution in higher plants. Chromosomal evolution in higher plants.

Sukumaran, J., and M. T. Holder. 2010. DendroPy: a Python library for phylogenetic computing. Bioinformatics 26: 1569–1571.

Swarbreck, D., C. Wilks, P. Lamesch, T. Z. Berardini, M. Garcia-Hernandez, H.

Foerster, D. Li, et al. 2007. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic acids research 36: D1009–D1014.

Team, R. C. 2014. R: A language and environment for statistical computing. R

Foundation for Statistical Computing, Vienna, Austria. 2013.

Tiley, G. P., C. Ané, and J. G. Burleigh. 2016. Evaluating and characterizing ancient whole-genome duplications in plants with gene count data. Genome biology and evolution 8: 1023–1037.

Van de Peer, Y., S. Maere, and A. Meyer. 2009. The evolutionary significance of ancient genome duplications. Nature reviews. Genetics 10: 725–732. 190

Van de Peer, Y., E. Mizrachi, and K. Marchal. 2017. The evolutionary significance of polyploidy. Nature reviews. Genetics 18: 411–424.

Vanneste, K., L. Sterck, A. A. Myburg, Y. Van de Peer, and E. Mizrachi. 2015.

Horsetails are ancient polyploids: evidence from Equisetum giganteum. The Plant cell 27: 1567–1578.

Wagner, W. H., and F. S. Wagner. 1980. Polyploidy in Pteridophytes. In W. H.

Lewis [ed.], Polyploidy: Biological Relevance, 199–214. Springer US, Boston, MA.

Wendel, J. F. 2015. The wondrous cycles of polyploidy in plants. American journal of botany 102: 1753–1756.

Wheeler, T. J., and J. D. Kececioglu. 2007. Multiple alignment by aligning alignments. Bioinformatics 23: i559–i568.

Wolf, P. G., C. H. Haufler, and E. Sheffield. 1987. Electrophoretic evidence for genetic diploidy in the Bracken Fern (Pteridium aquilinum). Science 236: 947–949.

Wood, T. E., N. Takebayashi, M. S. Barker, I. Mayrose, P. B. Greenspoon, and L.

H. Rieseberg. 2009. The frequency of polyploid speciation in vascular plants.

Proceedings of the National Academy of Sciences of the United States of America

106: 13875–13879.

Yang, Y., M. J. Moore, S. F. Brockington, D. E. Soltis, G. K.-S. Wong, E. J.

Carpenter, Y. Zhang, et al. 2015. Dissecting molecular evolution in the highly diverse plant clade Caryophyllales using transcriptome sequencing. Molecular 191

biology and evolution 32: 2001–2014.

Yang, Y., and S. A. Smith. 2013. Optimizing de novo assembly of short-read RNA-

seq data for phylogenomics. BMC genomics 14: 328.

Yang, Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Molecular

biology and evolution 24: 1586–1591.

Yant, L., J. D. Hollister, K. M. Wright, B. J. Arnold, J. D. Higgins, F. C. H.

Franklin, and K. Bomblies. 2013. Meiotic adaptation to genome duplication in

Arabidopsis arenosa. Current biology: CB 23: 2151–2156.

Acknowledgements

We thank Y. Li for providing the illustrations of ferns. We also thank C. Román-Palacios and members of the Barker lab for discussion. Funding:

Author contributions

Z. L., S.H.Z., and M.S.B. designed research Z. L., S.H.Z., and M.S.B. performed research and analyzed data Z. L. and M.S.B. wrote the paper.

Bitbucket repository

Two different versions of the ks plots(same format as the 1kp).

192

140 Histograms of the age distribution of gene duplications (Ks plots) range from Ks 0 to

2. Detail information of the mixture model for each Ks plot provided in Supplementary

Table 1.

140 Histograms of the age distribution of gene duplications (Ks plots) range from Ks 0 to

5. Detail information of the mixture model for each Ks plot provided in Supplementary

Table 1.

Appendix C: Figures 193

Fig. 1 Phylogenetic placement of ancient WGDs inferred in the phylogeny of monilophytes. Yellow stars represent WGDs inferred from Ks plots and synonymous ortholog divergence analyses Red stars represent WGDs inferred by Ks plots, synonymous ortholog divergence analyses and MAPS analyses Blue stars represent

WGDs inferred in outgroups. The phylogeny modified from Testo and Sundue, 2016.

Illustrations of ferns are credited to artist Yifan Li.

194

Fig. 2 The number of inferred ancient WGDs in vascular plants. a, The number of inferred ancient polyploidization events within each lineage is shown in the violin plots.

The sample sizes for angiosperms, ferns, gymnosperms, and lycophytes are 674, 81, 133, and 21, respectively. b, The rate of inferred ancient polyploidization events within each lineage is shown in the violin plots. The rate of inferred WGD is estimated by the number of inferred WGD divided by the minimum crown group age of each lineage. The minimum crown group age used for angiosperms, ferns, gymnosperms, and lycophytes are 197.5, 384.9, 308.4, and 392.8 million years, respectively. The black dot indicates the mean, the thick black bars represent the standard deviation of the data, the color shading represents the density of data points. The comparisons of the mean of ferns to other lineages by the two-sample Mann-Whitney U test are provided. ‘***’ represent p <

0.0001.

195

Fig. 3 Rates of ascending dysploidy, descending dysploidy, and polyploidy in angiosperms, gymnosperms, and ferns. The rate distributions of ascending and descending dysploidy and polyploidy of angiosperm orders are shown in the violin plots.

The mean angiosperms rate and the rate of gymnosperm and ferns are shown in circles

(red, angiosperms blue, gymnosperms green, ferns). The y-axis represents the rate of dysploidy and polyploidy (per one million years). 196

Fig. 4 Patterns of inferred ancestral chromosome number in angiosperms, gymnosperms, and ferns. The x-axis represents the root-to-internal node distance on the phylogenies analyzed using ChromEvol. The y-axis shows the ancestral chromosome numbers taken from the ChromEvol results. a. 2,352 data points show the pattern in angiosperms. b. 51 data points show the pattern in gymnosperms. c. 199 data points show the pattern in ferns. 197

Fig. 5 The rates of protein evolution of meiosis rateled genes in angiosperm and ferns. The rates of protein evolution of meiosis rateled genes and random selected genes of angiosperm and ferns is shown in the violin plots. The black dot indicates the mean, the thick black bars represent the standard deviation of the data, the color shading represents the density of data points. In total, 129 and 720 gene family phylogenies were constructed for meiosis related genes and randomly selected genes in angiosperms and ferns. 198

Fig. 6 Principal component analysis of the GO category composition of all genes in each genome/transcriptome and WGD paralogs. Red circles indicate the number of genes annotated to each GO category in the whole transcriptomes black circles, number of WGD paralogs annotated to each GO category. Ellipses represent the 95% confidence interval of SD of point scores.

Appendix C: Supplementary Information 199

Supplemental Tables:

Supplementary Table 1 Summary of Ks distributions of duplicate gene pairs for each species. Species are organized in alphabetical order. Taxa and median Ks for each histogram of the age distribution of gene duplications, and p-values of the two-sided K-S goodness of fit test is provided for each taxon. The number of inferred WGD in the ancestry of each species is also reported. The phylogenetic placement of each inferred ancient WGD is provided in Figure 1 and Supplementary Fig.1.

K-S #WGDs Median # Species Code Test p- Median Ks 1 WGD 1 Median Ks 2 WGD 2 WGD 3 in Ks 3 value history

1 Acrostichum aureum RS_90 0 2.0133 PTERα 4

2 Acystopteris japonica RS_52 0 1.2098 PTERα 4

3 Adiantum aleuticum WCLG 0 1.5717 PTERα 3.5163 CYATγ 4

4 Adiantum caudatum RS_91 0 2.0689 PTERα 4

5 Adiantum raddianum BMJR 0 0.5484 ADIAα 1.8217 PTERα 5 Aleuritopteris 6 chrysophylla RS_72 0 1.6663 PTERα 4

7 Alsophila pinulosa GANB 0 0.2867 CYATα 0.9811 CYATβ 5

8 Alsophila podophylla RS_34 0 0.2092 CYATα 0.9812 CYATβ 5

9 Anemia tomentosa CQPW 0 1.2433 SCDIβ 3

10 Angiopteris fokiensis RS_48 0 0.69988 MARAα 2.4275 OSMNβ 2 Antrophyum 11 callifolium RS_10 0 0.459 VILIα 1.4793 PTERα 5 Arachniodes 12 nigrospinosa RS_56 0 1.2066 PTERα 4

13 Argyrochosma nivea XDDT 0 1.8012 PTERα 4

14 palisotii RS_123 0 1.2061 PTERα 4

15 Asplenium x lucrosum BMIF 0 0.3431 ASNIα 1.4513 PTERα 3.013 CYATγ 5

16 Asplenium formosae RS_25 0 0.2628 ASNIα 1.3037 PTERα 5

17 Asplenium nidus PSKY 0 0.4116 ASNIα 2.3458 PTERα 5 200

Asplenium 18 platyneuron KJZG 0 1.6945 PTERα 3.6058 CYATγ 4

19 Athyrium filix-femina URCP 0 1.1495 PTERα 2.3847 CYATγ 4

20 Athyrium sp. AFPO 0 1.298 PTERα 3.0318 CYATγ 4

21 Azolla caroliniana CVEG 0 0.9755 AZOLα 2.005 CYATγ 4

22 Azolla pinnata RS_112 0 0.991 AZOLα 2.544 CYATγ 4

23 Blechnum spicant VITX 0 1.1821 PTERα 3.0759 CYATγ 4 Bolbitis 24 appendiculata RS_16 0 1.3205 PTERα 4

25 Bolbitis repanda JBLI 0 1.6118 PTERα 3.7766 CYATγ 4 Botrychium 26 japonicum RS_119 0 0.2052 SCEPα 0.9775 OPHIα 3

27 Botrypus virginianus BEGM 0 0.2315 SCEPα 0.8883 OPHIα 3 Ceratopteris 28 thalictroides RS_98 0 1.1561 CERAα 3.0559 PTERα 5

PIVW 0 1.0793 CERAα 3.0713 PTERα 5

29 chusana RS_69 0 1.9828 PTERα 4 Cheiropleuria 30 bicuspis RS_28 0 0.8963 HYMEα 2

31 Cibotium barometz RS_37 0 0.1909 CYATα 1.0157 CYATβ 5 Crepidomanes 32 venosum TWFZ 0 ~1.17 HYMEα ~3.40 OSMNβ 2 Cryptogramma 33 acrostichoides WQML 0 1.6238 PTERα 4

34 macrocarpa PNZO 0 1.0192 CYATβ 4

35 Cyclopeltis crenata RS_24 0 1.492 PTERα 4

36 Cystopteris fragilis XXHP 0 1.2237 PTERα 2.8767 CYATγ 4

LHLE 0 1.3209 PTERα 3.4071 CYATγ 4

37 Cystopteris protrusa YOWV 0 1.3669 PTERα 2.8685 CYATγ 4 Cystopteris 38 reevesiana RICC 0 1.4523 PTERα 3.1656 CYATγ 4

39 Cystopteris utahensis HNDZ 0 1.2225 PTERα 2.5857 CYATγ 4

40 Danaea nodosa DFHO 0 0.9452 MARAα 2.3841 OSMNβ 2

41 OQWW 0 1.1192 PTERα 2.6636 CYATγ 4 Dennstaedtia 42 davallioides MTGC 0 1.3277 PTERα 4

43 Dennstaedtia hirsuta RS_50 0 1.0936 PTERα 4

44 Dennstaedtia scabra RS_54 0 1.2223 PTERα 4 Deparia lobato- 45 crenata FCHS 0 1.2166 PTERα 2.5344 CYATγ 4

46 Dicksonia antarctica RS_43 0 0.179 CYATα 0.954 CYATβ 5

47 Dicranopteris pedata RS_18 0 0.8869 HYMEα 2 201

Didymochlaena 48 truncatula RFRB 0 1.3948 PTERα 2.941 CYATγ 4 Diplaziopsis 49 brunoniana RS_5 0 1.3239 PTERα 4 Diplazium 50 viridescens RS_14 0 1.0335 PTERα 4

51 Diplazium wichurae UFJN 0 1.1805 PTERα 2.7517 CYATγ 4

52 Dipteris conjugata MEKP 0 0.8265 HYMEα 2.4868 OSMNβ 2

53 Drynaria bonii RS_46 0 1.4281 PTERα 4 Dryopteris 54 pseudocaenopteris RS_17 0 1.8307 PTERα 4 Elaphoglossum 55 mcclurei RS_7 0 1.7025 PTERα 4

56 Equisetum diffusum RS_107 0 0.7153 EQUIα 2.2282 OSMNβ 2

CAPN 0 0.8026 EQUIα 3.2071 OSMNβ 2

57 Equisetum hyemale JVSZ 0 0.7213 EQUIα ~2.5 OSMNβ 2 Goniophlebium 58 niponicum RS_122 0 1.0895 PTERα 4 Gymnocarpium 59 dryopteris HEGQ 0 1.2009 PTERα 2.276 CYATγ 4 60 dareiformis RS_115 0 1.3762 PTERα 4 Haplopteris 61 amboinensis RS_19 0 0.47 VILIα 1.6387 PTERα 5

62 Histiopteris incisa RS_35 0 1.0978 PTERα 4 Homalosorus 63 pycnocarpos OCZL 0 1.2786 PTERα 2.8461 CYATγ 4

64 Humata repens RS_8 0 1.3546 PTERα 4 Hymenophyllum 65 bivalve QIAD 0 0.6658 HYMEα 2.1337 OSMNβ 2

Hymenophyllum 66 cupressiforme TRPJ 0 0.8097 HYMEα 2.4793 OSMNβ 2 67 crenatum RS_89 0 1.167 PTERα 4

68 Hypolepis punctata RS_42 0 1.0667 PTERα 4

69 immersa WGTU 0 1.368 PTERα 3.0689 CYATγ 4

70 linearis NOKI 0 0.7538 LINDα 2.018 CYATγ 4

71 Lindsaea microphylla YIXP 0 0.8 LINDα 1.7164 CYATγ 4 Lomagramma 72 matthewii RS_70 0 1.2786 PTERα 4 Lomariopsis 73 spectabilis RS_27 0 1.9528 PTERα 4

74 Lonchitis hirsuta VVRN 0 0.4088 LONCα 1.3234 CYATγ 4 Loxogramme 75 chinensis RS_39 0 1.9321 PTERα 4

76 Lygodium flexuosum RS_88 0 1.186 LYGOα 3

77 Lygodium japonicum PBUU 0 ~1.6 LYGOα 3 202

78 Marattia attenuata UGNK 0 0.9452 MARAα 1.9495 OSMNβ 2

79 Marsilea quadrifolia RS_77 0 2.141 CYATγ 3 Matteuccia 80 struthiopteris RS_124 0 1.1998 PTERα 4 Microlepia 81 hookeriana RS_4 0 1.226 PTERα 4 Microlepia 82 platyphylla RS_86 0 1.1322 PTERα 4

83 Microlepia speluncae RS_93 0 1.097 PTERα 4 Monachosorum 84 henryi RS_51 0 1.2372 PTERα 4 Monachosorum 85 maximowiczii RS_53 0 1.1895 PTERα 4

86 Myriopteris rufa GSXD 0 1.6012 PTERα 3.5507 CYATγ 4 87 cordifolia RS_85 0 1.2528 PTERα 4

88 NWWI 0 1.8836 PTERα 4 Notholaena 89 montieliae YCKE 0 ~1.9 PTERα 4

90 musifolia RS_101 0 1.57435 PTERα 4

91 Onoclea sensibilis HTFH 0 1.4917 PTERα 3.4883 CYATγ 4 Ophioglossum 92 petiolatum DJSE 0 0.9902 OPHIα 2.5839 OSMNβ 2 Ophioglossum 93 vulgatum RS_84 0 0.9959 OPHIα 2 Oreogrammitis 94 dorsipila RS_108 0 1.4829 PTERα 4 Osmolindsaea 95 odorata RS_71 0 0.1743 OSODα 0.9917 LINDα 5

96 Osmunda japonica RS_38 0 0.80027 OSMNα 2.4725 OSMNβ 2

97 Osmunda javanica VIBO 0 0.8161 OSMNα 2.4526 OSMNβ 2

Osmundastrum 98 cinnamomeum RFMZ 0 0.7929 OSMNα 2.1954 OSMNβ 2 Parahemionitis 99 cordata RS_92 0 2.0045 PTERα 4

ZXJO 0 2.2364 PTERα 4 Phlebodium 100 pseudoaureum ZQYU 0 1.4523 PTERα 3.5508 CYATγ 4 Phymatosorus 101 grossus ORJE 0 1.2754 PTERα 2.7596 CYATγ 4

102 Pilularia globulifera KIIX 0 1.9545 CYATγ 3 Pityrogramma 103 trifoliata UJTT 0 2.144 PTERα 4

104 Plagiogyria japonica RS_31 0 0.9377 CYATβ 4

UWOD 0 0.9811 CYATβ 4 Platycerium 105 bifurcatum RS_47 0 2.0054 PTERα 4 Pleopeltis 106 polypodioides UJWU 0 2.0471 PTERα 4 203

Pleurosoriopsis 107 makinoi RS_111 0 1.2832 PTERα 4 Polypodium 108 amorphum YLJA 0 1.3498 PTERα 3.0698 CYATγ 4 Polypodium 109 glycyrrhiza CJNT 0 1.3375 PTERα 3.0142 CYATγ 4 Polypodium 110 hesperium ZRAV 0 1.2104 PTERα 2.4599 CYATγ 4 Polystichum 111 acrostichoides FQGQ 0 1.3947 PTERα 3.5245 CYATγ 4

112 Pronephrium simplex RS_1 0 1.2082 PTERα 4 OSMN 113 Psilotum nudum RS_21 0 0.2383 PSILα 0.6902 PSILβ 2.3618 β 3 OSMN QVMR 0 0.359 PSILα 1.0295 PSILβ 2.6227 β 3

114 Pteridium aquilinum RS_41 0 1.1259 PTERα 4

115 Pteris ensiformis FLTD 0 2.1841 PTERα 4

116 Pteris vittata RS_36 0 ~2.0 PTERα 4

POPJ 0 2.0878 PTERα 4 Rhachidosorus 117 mesosorus RS_45 0 1.1196 PTERα 4

118 Salvinia natans RS_127 0 1.402 CYATγ 3 Sceptridium OSMN 119 dissectum EEAQ 0 0.283 SCEPα 0.9975 OPHIα 2.6911 β 3

120 Schizaea dichotoma RS_116 0 0.277 SCDIα 0.8587 SCDIβ 4 Stenochlaena 121 palustris RS_97 0 0.2477 STPAα 1.2626 PTERα 5

122 Sticherus lobatus XDVM 0 0.7788 HYMEα 2.4273 OSMNβ 2

123 Taenitis blechnoides RS_114 0 0.3221 TABLα 2.0212 PTERα 5

124 Tectaria subpedata RS_81 0 1.3618 PTERα 4 Thelypteris 125 acuminata MROH 0 1.2722 PTERα 3.0494 CYATγ 4

126 Thyrsopteris elegans EWXK 0 0.937 CYATβ 4 OSMN 127 Tmesipteris parva ALVQ 0 0.2419 PSILα 0.7284 PSILβ 2.6967 β 3 Vandenboschia 128 striata RS_11 0 0.9722 HYMEα 2.41 OSMNβ 2

129 Vittaria lineata SKYV 0 0.5627 VILIα 2.1918 PTERα 5

130 Woodsia ilvensis YQEC 0 1.487 PTERα 3.3595 CYATγ 4 Woodsia 131 polystichoides RS_103 0 1.3217 PTERα 4

132 Woodsia scopulina YJJY 0 1.2684 PTERα 2.6918 CYATγ 4 Woodwardia 133 prolifera RS_128 0 1.0722 PTERα 4

204

Supplementary Table 2 Summary of the synonymous ortholog divergence analyses.

The mean, median, and standard deviation of synonymous ortholog divergence (Ks) for each inferred ancient WGD is reported. These statistics are based on the number of ortholog pairs identified for each species pair comparison. The sampling information with taxon code and WGD code is also reported.

Ks_Orth olog_Div Ks_Orthol Index ergence_ og_Diverg # Taxon 1 code Taxon 2 code WGD code Mean Median SD Min ence_Max Asplenium Asplenium 1 formosae RS_25 xlucrosum BMIF ASNIα 0.1774 0.152 0.0775 0.1 0.5 Asplenium Asplenium 2 formosae RS_25 platyneuron KJZG ASNIα 0.2657 0.2452 0.0884 0.1 0.6 Alsophila Dicksonia 3 podophylla RS_34 antarctica RS_43 CYATα 0.1965 0.1842 0.0619 0.1 0.4 Cibotium Alsophila 4 barometz RS_37 podophylla RS_34 CYATα 0.2085 0.1962 0.0623 0.1 0.4 Dicksonia Cibotium 5 antarctica RS_43 barometz RS_37 CYATα 0.1784 0.1637 0.0605 0.1 0.4 Thyrsopteris Cibotium 6 elegans EWXK barometz RS_37 CYATα 0.2052 0.1895 0.0704 0.1 0.5 Thyrsopteris Alsophila 7 elegans EWXK podophylla RS_34 CYATα 0.2128 0.1997 0.069 0.1 0.5 Plagiogyria Dicksonia 8 japonica RS_31 antarctica RS_43 CYATα 0.2266 0.2111 0.0741 0.1 0.5 Plagiogyria Alsophila 9 japonica RS_31 podophylla RS_34 CYATα 0.2505 0.2374 0.0761 0.1 0.5 Osmolindsaea Lindsaea 10 odorata RS_71 linearis NOKI LINDα, OSODα 0.4096 0.3802 0.1433 0.1 0.9 Osmolindsaea Lonchitis 11 odorata RS_71 hirsuta VVRN LINDα, LONCα 1.2128 1.1692 0.2924 0.5 2 Anemia Schizaea LYGOα, SCDIα, 12 tomentosa CQPW dichotoma RS_116 SCDIβ 1.1036 1.0649 0.2526 0.5 1.8 Lygodium Lygodium 13 flexuosum RS_88 japonicum PBUU LYGOα 0.1113 0.1024 0.0474 0.0001 0.3 Lygodium Anemia LYGOα, SCDIα, 14 flexuosum RS_88 tomentosa CQPW SCDIβ 1.8341 1.7601 0.2642 1.5 2.5 Lygodium Schizaea LYGOα, SCDIα, 15 japonicum PBUU dichotoma RS_116 SCDIβ 1.8247 1.7632 0.4621 1 3 Lygodium Pilularia 16 japonicum PBUU globulifera KIIX LYGOα 2.0352 1.9901 0.4549 1 3 Schizaea Pilularia 17 dichotoma RS_116 globulifera KIIX LYGOα 2.5558 2.5025 0.6818 1 4 205

Ophioglossum Ophioglossu 18 petiolatum DJSE m vulgatum RS_84 OPHIα 0.0383 0.0239 0.0431 0.0001 0.3 Ophioglossum Psilotum 19 vulgatum RS_84 nudum RS_21 OPHIα 1.4085 1.384 0.2899 0.8 2 Psilotum Botrychium 20 nudum RS_21 japonicum RS_119 OPHIα 0.9631 0.9341 0.2165 0.5 1.5 Sceptridium Botrychium 21 dissectum EEAQ japonicum RS_119 OPHIα, SCEPα 0.0342 0.0187 0.0482 0.0001 0.3 Ophioglossum Botrychium 22 vulgatum RS_84 japonicum RS_119 OPHIα, SCEPα 0.9722 0.8402 0.4404 0.4 3 Botrypus Botrychium 23 virginianus BEGM japonicum RS_119 OPHIα, SCEPα 0.1523 0.1401 0.0604 0.0001 0.4 Botrypus Sceptridium 24 virginianus BEGM dissectum EEAQ OPHIα, SCEPα 0.1567 0.1426 0.0642 0.0001 0.4 Stenochlaena Blechnum 25 palustris RS_97 spicant VITX STPAα 0.2706 0.2564 0.0811 0.1 0.6 Taenitis Pityrogramm 26 blechnoides RS_114 a trifoliata UJTT TABLα 0.5773 0.5599 0.134 0.3 1 Taenitis Pteris 27 blechnoides RS_114 ensiformis FLTD TABLα 0.531 0.5104 0.1331 0.2 1 Haplopteris Vittaria 28 amboinensis RS_19 lineata SKYV VILIα 0.4915 0.4785 0.1095 0.2 0.8 Antrophyum Vittaria 29 callifolium RS_10 lineata SKYV VILIα 0.4847 0.4731 0.1336 0.2 0.8 Haplopteris Antrophyum 30 amboinensis RS_19 callifolium RS_10 VILIα 0.483 0.4629 0.116 0.2 0.9 Antrophyum Adiantum 31 callifolium RS_10 aleuticum WCLG VILIα 0.6798 0.6499 0.1739 0.3 1.3 Haplopteris Adiantum 32 amboinensis RS_19 caudatum RS_91 VILIα 0.838 0.8046 0.1977 0.4 1.5

Supplementary Table 3 Summary statistics and null simulations (no WGDs) for seven Multi-tAxon Paleopolyploidy Search (MAPS) analyses. For each node of seven

MAPS analyses percentage of subtrees with shared inferred gene duplications and numbers of gene duplications in simulations without WGDs are reported along with the p-value for a one-sided Fisher’s exact test used to detect nodes with a significantly higher proportion of inferred gene duplications compared to the null distribution. An * indicates a significant node.

206

Total nuclear % Subtrees # Subtrees # Subtrees Index MAPS gene family WGD with shared without shared with shared Total # Index phylogenies # Code Species Node ID duplication duplication duplication subtrees # p-value

RS_25 Asplenium 1 ASNIα 5676 (afo) formosae Asplenium PSKY nidus N1 4.89% 1012 52 1064 1 Asplenium cf. x BMIF lucrosum N2* ASNIα 53.71% 1255 1456 2711 5.8E−79 Asplenium KJZG platyneuron N3 13.21% 1156 176 1332 0.99985 Woodsia YJJY scopulina N4 4.49% 723 34 757 0.58065

Polystichum FQGQ acrostichoides Hymenophyllu 2 HYMEα 4911 QIAD m bivalve RS_11 Vandenboschia (vst) striata N1 9.38% 2513 260 2773 0.9034 RS_18 Dicranopteris HYME (dpe) pedata N2* α 18.37% 1768 398 2166 1.2E−30 Osmunda VIBO javanica N3* 6.70% 1058 76 1134 7.4E−05 Marattia UGNK attenuata N4 13.25% 655 100 755 0.97072 Psilotum QVMR nudum Lindsaea 3 LINDα 6626 NOKI linearis Lindsaea YIXP microphylla N1 8.43% 4054 373 4427 0.99483 RS_71 Osmolindsaea (ood) odorata N2 LINDα 19.80% 1284 317 1601 0.63481 Lonchitis VVRN hirsuta N3 1.71% 1033 18 1051 0.99427

POPJ Pteris vittata N4 16.01% 1637 312 1949 1.3E−05 Culcita PNZO macrocarpa

SCDIβ & Anemia 4 LYGOα 6575 CQPW tomentosa RS_11 Schizaea (sdi) dichotoma N1 12.02% 2495 341 2836 0.96024 RS_88 Lygodium (lfl) flexuosum N2* 14.15% 1741 287 2028 5.7E−12 Pilularia KIIX globulifera N3* 14.20% 1003 166 1169 2.6E−11 RS_28 Cheiropleuria (cbi) bicuspis N4 9.60% 1102 117 1219 0.71292 Osmunda VIBO javanica

RS_119 Botrychium 5 SCEPα 5201 (bja) japonicum 207

Sceptridium EEAQ dissectum N1* 20.20% 2094 530 2624 1.6E−14 Botrypus BEGM virginianus N2* SCEPα 23.72% 1225 381 1606 0.00349 Ophioglossum DJSE petiolatum N3 10.38% 881 102 983 1 Psilotum QVMR nudum N4* 13.57% 809 127 936 0.00936 Osmunda VIBO javanica

RS_10 Antrophyum 6 VILIα 6892 (aca) callifolium

SKYV Vittaria lineata N1* 6.50% 1626 113 1739 0.00058

RS_19 Haplopteris (ham) amboinensis N2* VILIα 51.98% 593 642 1235 1.1E−52 Adiantum WCLG aleuticum N3 0.99% 896 9 905 0.99991 Myriopteris GSXD rufa N4 8.84% 1876 182 2058 0.99999 Pteris FLTD ensiformis

Alsophila 7 CYATα 6008 GANB pinulosa RS_37 Cibotium (cba) barometz N1* 5.63% 536 32 568 4.4E−06 RS_43 Dicksonia (dan) antarctica N2* CYATα 38.58% 589 370 959 9.2E−43 Thyrsopteris EWXK elegans N3* 9.76% 610 66 676 0.00121 Culcita PNZO macrocarpa N4 22.38% 912 263 1175 0.95828 Lonchitis VVRN hirsuta

Supplementary Table 4 Summary statistics and positive simulations (with WGDs) for six MAPS analyses. As with Supplementary Table 3, for each lineage in each MAPS tree percentages of subtrees with shared inferred gene duplications is reported along with expectations based on simulations with 20% paralog retention following WGDs. This table contains the p-value for a one-sided Fisher’s exact test used to detect nodes with a significantly lower proportion of mapping subtrees compared to our simulation. An * indicates the placement of the putative ancient WGD.

208

% Subtrees # Subtrees with without Total nuclear shared shared # Subtrees Total Index MAPS gene family duplicati duplicatio with shared subtrees # Index phylogenies # Code Species Node WGD ID on n duplication # p-value

RS_25 1 ASNIα 5676 (afo) Asplenium formosae

PSKY Asplenium nidus N1 4.89% 1012 52 1064 NA Asplenium cf. x BMIF lucrosum N2* ASNIα 53.71% 1255 1456 2711 1

KJZG Asplenium platyneuron N3 13.21% 1156 176 1332 NA

YJJY Woodsia scopulina N4 4.49% 723 34 757 NA Polystichum FQGQ acrostichoides

2 HYMEα 4911 QIAD Hymenophyllum bivalve RS_11 (vst) Vandenboschia striata N1 9.38% 2513 260 2773 NA RS_18 (dpe) Dicranopteris pedata N2* HYMEα 18.37% 1768 398 2166 0.016

VIBO Osmunda javanica N3 6.70% 1058 76 1134 5.9E−22 UGN K Marattia attenuata N4 13.25% 655 100 755 NA QVM R Psilotum nudum

SCDIβ & CQP 4 LYGOα 6575 W Anemia tomentosa RS_11 (sdi) Schizaea dichotoma N1 12.02% 2495 341 2836 NA RS_88 (lfl) Lygodium flexuosum N2 14.15% 1741 287 2028 1E−14

KIIX Pilularia globulifera N3 14.20% 1003 166 1169 9.6E−10 RS_28 (cbi) Cheiropleuria bicuspis N4 9.60% 1102 117 1219 NA

VIBO Osmunda javanica

RS_11 5 SCEPα 5201 9 (bja) Botrychium japonicum

EEAQ Sceptridium dissectum N1 20.20% 2094 530 2624 0.00039 BEG M Botrypus virginianus N2 SCEPα 23.72% 1225 381 1606 6.7E−05 Ophioglossum DJSE petiolatum N3 10.38% 881 102 983 NA QVM R Psilotum nudum N4 13.57% 809 127 936 6.3E−05

VIBO Osmunda javanica

RS_10 6 VILIα 6892 (aca) Antrophyum callifolium

SKYV Vittaria lineata N1 6.50% 1626 113 1739 4.1E−31 209

RS_19 Haplopteris (ham) amboinensis N2* VILIα 51.98% 593 642 1235 1 WCL G Adiantum aleuticum N3 0.99% 896 9 905 NA

GSXD Myriopteris rufa N4 8.84% 1876 182 2058 NA

FLTD Pteris ensiformis

GAN 7 CYATα 6008 B Alsophila pinulosa RS_37 (cba) Cibotium barometz N1 5.63% 536 32 568 1.4E−19 RS_43 (dan) Dicksonia antarctica N2* CYATα 38.58% 589 370 959 1 EWX K Thyrsopteris elegans N3 9.76% 610 66 676 2.6E−14

PNZO Culcita macrocarpa N4 22.38% 912 263 1175 NA VVR N Lonchitis hirsuta

210

Supplementary Table 5 Rates of gene duplication (λ) and gene loss (μ) used in null and positive simulations. Rates of gene duplication (λ) and gene loss (μ) were estimated using gene counts from OrthoFinder clusters associated with each MAPS analysis

(Supplementary Tables 3-4). Values correspond to global Maximum Likelihood

Estimates (MLEs) and mean rates for simulations. The prior mean is the mean of the geometric probability distribution applied to the root of each species tree for optimizing

MLEs of λ and μ as well as simulating gene trees with and without WGDs.

Index # MAPS Index λ μ Prior Mean 1 ASNIa 0.00277 0.00663 1.38 2 HYMEa 0.00094 0.00091 1.58 3 LINDa 0.00157 0.00272 1.55 SCDIβ & 4 LYGOα 0.00124 0.00084 1.68 5 SCEPa 0.00170 0.00355 1.59 6 VILIa 0.00203 0.00285 1.38 7 CYATα 0.00131 0.00232 1.54

211

Supplementary Figures

212

Supplementary Fig. 1 Phylogenetic placement of ancient WGDs inferred in the phylogeny of ferns. Yellow stars represent WGDs inferred from Ks plots and synonymous ortholog divergence analyses Red stars represent WGDs inferred by Ks plots, synonymous ortholog divergence analyses and MAPS analyses Blue stars represent

WGDs inferred in outgroups. The WGD identification labels correspond to WGD ID codes in Supplementary Table 1.

213

Supplementary Fig. 2 GO annotations of whole transcriptomes and genes retained from ancient WGDs. Each column represents the annotated GO categories of pooled whole transcriptomes or genes retained in duplicate following ancient WGD from each analyzed species. Colors of the heatmap represent the percent of the transcriptome represented by a particular GO category. The overall ranking of GO category rows was determined by the ranking of GO annotations among the total pooled transcriptomes.

Hierarchical clustering was used to organize the heatmap columns. 214

Supplementary Fig. 3 Pattern of gene retention and loss following ancient WGDs.

Each column represents the annotated GO categories of paralogs retained following ancient WGDs from each analyzed species. The order of analyzed species (20 hexapods and one outgroup) are based on hierarchical clustering. The overall ranking of GO category rows was determined by the ranking of GO annotations among ancient WGD 215

OSMNα in Osmundastrum cinnamomeum. Colored boxes indicate GO categories among putative WGD paralogs that were significantly over- (red) or under-retained (blue) relative to the pooled whole transcriptomes, as determined by residuals from chi-square tests. GO categories with gray boxes were not present among WGD paralogs in significantly different numbers relative to their frequency in the pooled whole transcriptomes.

216

APPENDIX D:

PATTERNS AND PROCESSES OF DIPLOIDIZATION IN LAND PLANTS

Authors

Zheng Li, Michael T. W. McKibben, Geoffrey S. Finch, Paul D. Blischak, Brittany L.

Sutherland, Michael S. Barker

Corresponding Authors

Zheng Li ([email protected])

Michael S. Barker ([email protected])

Affiliation

Department of Ecology and Evolutionary Biology, University of Arizona, Tucson,

Arizona 85721, USA

Keywords

Diploidization, polyploidy, genome evolution, plant genomics, genome fractionation, chromosome pairing

Abstract

Most land plants are now known to be ancient polyploids that have rediploidized. This process of diploidization involves many changes in genome organization that ultimately restores bivalent chromosome pairing, disomic inheritance, and resolves dosage and other 217 issues caused by genome duplication. Here, we provide an overview of the variety of mechanisms involved in diploidization as well as new analyses of pairing behavior and variation in gene fractionation across land plants. Overall, we find that lineage and WGD specific attributes influence the evolutionary outcomes of WGD and the process of diploidization in plant genomes. Ultimately, many of the mechanisms and forces driving diploidization remain to be discovered. Future research that leverages variation in the patterns and processes of diploidization will be able to advance our understanding of plant genome evolution and unlock the mysteries of diploidization.

Introduction

A major insight from two decades of sequencing plant genomes is that most are not simply diploid, but diploidized paleopolyploid genomes. Although it has long been recognized that many contemporary plants are polyploids (Stebbins 1950 Wood et al.

2009 Mayrose et al. 2011 Barker, Arrigo, et al. 2016), or species with duplicated genomes, it required comparative genomic analyses to provide conclusive evidence that plants experienced cycles of polyploidy followed by diploidization (Wolfe 2001 Jiao et al. 2011 Arrigo and Barker 2012 Z. Li et al. 2015 Wendel 2015 Van de Peer et al. 2017

One Thousand Plant Transcriptomes Initiative 2019). Over the past century (Lutz 1907

Winge 1917 Barker, Husband, et al. 2016), we have learned a lot about polyploidization, but we know comparatively little about the mechanisms and forces that drive diploidization (Wolfe 2001 Dodsworth et al. 2016). In the most basic sense, diploidization is the return of a polyploid genome to a diploid state (Wolfe 2001 Wendel

2015 Soltis et al. 2016 Mandáková and Lysak 2018). One of the earliest references to this 218 sort of diploidization—the fungal literature used “diploidization” in a different manner

(eg: (Fulton 1950))—was by Stebbins (1947) in reference to a study by R. E. Clausen on pairing behavior in Nicotiana allopolyploids (Clausen 1941 Stebbins 1947 Stebbins

1950). The restoration of bivalent chromosome pairing behavior and associated diploid genetics is considered a key feature of diploidization. As recognized early on (Stebbins

1947), the characteristics of a given whole genome duplication event impacts the pairing behavior, genetics, and subsequent course of diploidization in a polyploid genome. Thus, all polyploid species do not necessarily experience the same process of post-polyploid genome evolution and diploidization.

Although many mechanisms of genome evolution contribute to diploidization, it can be broadly described as involving two major processes: cytological diploidization and genic diploidization/fractionation (Ma and Gustafson 2005 Mandáková and Lysak

2018). Cytological diploidization occurs via chromosomal rearrangements, fission, fusion, and other large-scale chromosomal evolution events that produce significant changes in genome structure and eventually lead to diploid-like chromosome pairing behavior during meiosis (Ma and Gustafson 2005). During fractionation, many genes duplicated during the WGD event are lost, and only a subset of genes are retained as paralogs over time (Leitch and Leitch 2008 Freeling et al. 2012). These two processes may occur largely independently of each other and at different rates yielding a diversity of genomes with different patterns of diploidization following polyploidy across lineages

(Otto and Whitton 2000 Wolfe 2001 Mandáková, Li, et al. 2017).

In this review, we discuss the different aspects of diploidization and post- polyploid genome evolution. We largely focus on genome evolution in the land plants, 219 but also compare their patterns and processes of diploidization to those in animals and other eukaryotes. We begin with an introduction on the nature of polyploidy and how it may affect chromosome pairing behavior during meiosis. This includes a new survey of the plant cytological literature to assess the distribution of bivalent pairing among contemporary polyploid species. In the following sections we describe the two main processes of diploidization, cytological and genic diploidization, and summarize current knowledge on the molecular mechanisms of these diplodization processes. We also review differences in the rate of diploidization in plants and present new analyses on the rates of gene loss across land plants. Finally, we highlight the growing importance of developing new models and simulations to rigorously test hypotheses on diploidization as we try to understand the ultimate question: why diploidize at all?

The Nature of Polyploidy and Chromosome Pairing Behavior

A key milestone during diploidization is establishing bivalent chromosome pairing during meiosis (Wolfe 2001). Bivalent pairing is important because it is a precursor to restoring diploid-like genetics with two alleles per locus (ie., disomic inheritance). Although polyploids are often imagined to have multivalent pairing, many polyploid species actually have bivalent pairing at formation or evolve it quickly (Tayalé and Parisod 2013). Differences in pairing behavior are often used to distinguish the two major categories of polyploid species, allopolyploids and autopolyploids (Ramsey and

Schemske 1998 Otto and Whitton 2000 Ramsey and Schemske 2002 Barker, Arrigo, et al. 2016). Distinguishing allo- and autopolyploids by pairing behavior is considered to be the “genetic classification” of polyploid species (Doyle and Egan 2010 Barker, Arrigo, et 220 al. 2016 Doyle and Sherman-Broyles 2017). In allopolyploids, divergence between the parental taxa is expected to limit pairing among the homoeologous chromosomes and the homologous chromosomes are expected to form pairs of bivalents during meiosis. In contrast, autopolyploids are expected to have homologous chromosomes that form multivalents (Figure 1). The bivalent pairing expected to occur in allopolyploids should lead to mostly disomic inheritance (two alleles at each of two distinct loci), whereas autopolyploids with multivalent pairing are expected to have multisomic inheritance

(multiple alleles at a single locus) (Figure 1). It is important to point out that even though strictly bivalent pairing can occur in some autopolyploids, random segregation of homologous chromosomes during meiosis can result in multisomic inheritance (Jackson and Jackson 1996 Qu et al. 1998 Hauber et al. 1999 Landergott et al. 2006 Stift et al.

2008). Therefore, multisomic inheritance is a unique feature to define autopolyploids

(Parisod et al. 2010 Tayalé and Parisod 2013). Although the genetic definition is widely used in the field, many studies distinguish allo- and autopolyploid species by a taxonomic definition. This definition emphasizes the number of progenitor species (Ramsey and

Schemske 2002). Allopolyploid species result from hybridization of two or more species with genome duplication. In contrast, autopolyploids result from a genome duplication within a single progenitor species (Doyle and Egan 2010 Barker, Arrigo, et al. 2016). The taxonomic definition putatively gets around one of the limitations of the genetic definition: change in pairing behavior over time. As polyploid species diploidize, bivalent pairing is restored and this can make the genetic classification of an allo- or autopolyploid contingent on the age of the polyploid species. The taxonomic definition 221 captures the nature of polyploid species regardless of the age of the WGD event and stage of diploidization.

Although the definitions of allo- and autopolyploidy are straightforward, in practice it is often difficult to describe the nature of polyploid species and degree of diploidization because of the dynamic processes of genome divergence and evolution.

Allo- and autopolyploidy represent two ends of a continuum of variation in subgenome divergence and independence (Stebbins 1947 Ramsey and Schemske 2002 Barker,

Arrigo, et al. 2016). This gradient of polyploid variation has long been recognized

(Stebbins 1947 Stebbins 1950). For example, the term “segmental allopolyploidy” was used for polyploid species that show mixtures of bivalent and multivalent formation

(Stebbins 1947). Differences in observed pairing behavior across this spectrum have been documented in multiple systems (Ramsey and Schemske 2002). This variation led to describing the inheritance patterns of segmental allopolyploids and other polyploids in the middle of this gradient of pairing behavior as being “mixosomic” (Soltis et al. 2016).

Although segmental allopolyploidy and mixosomic inheritance can be recognized by careful genetic analyses, most studies simply classify polyploid species as allo- or autopolyploids without distinguishing the polyploid variation continuum (Barker, Arrigo, et al. 2016). However, to understand diploidization we ultimately must grapple with this continuum of variation and recognize that not all studies of post-polyploid genome evolution are examining the same biology. For example, if a polyploid species is born with diploid-like bivalent pairing, is the ongoing divergent evolution of those homoeologous chromosomes really diploidization? Is it equivalent to the evolution of bivalent pairing in a multivalent autotetraploid? Analyses of diploidization in recent and 222 ancient polyploid genomes need to better understand the origin of the species to evaluate what is and is not due to diploidization in these genomes.

One starting point to understanding diploidization in polyploid genomes is to assess how many contemporary polyploid species have bivalent pairing and how this pattern aligns with allo- and autopolyploid species. To address this gap in our knowledge, we conducted a survey of pairing behavior in allo- and autopolyploid species recognized by the taxonomic definition. The initial survey was based on a previous study of the frequency of allo- and autopolyploidy that examined data for 4,003 species from 47 genera of vascular plants (Barker, Arrigo, et al. 2016). For each species, we recorded the chromosome pairing behavior during meiosis (Supplemental Table 1). We classified the meiotic chromosome pairing behavior as either strictly bivalent pairing (only bivalent formation was observed) or a mix (multivalent or mixture of bivalent and multivalent pairing). We identified 208 polyploid species from 40 genera (Supplemental Table 1) with at least one record of meiotic chromosome pairing behavior (Figure 2). Among these studies, 118 were classified by Barker et al. (Barker, Arrigo, et al. 2016) as allopolyploids and 90 as autopolyploids (Figure 2, Supplemental Table 1). Overall, we found that 92 of these species had strictly bivalent pairing, whereas 116 had mixed or multivalent pairing.

Among species classified as allopolyploids, 48.3% had bivalent pairing and 51.7% had at least some multivalent formation during meiosis. Only 38.9% of the autopolyploids had bivalent pairing and 61.1% of the autopolyploids had multivalent or mixed pairing behavior. Consistent with our expectations, we found a lower frequency of strictly bivalent pairing among autopolyploid species compared to allopolyploids. However, the difference in pairing behavior between allo- and autopolyploids was not as large as 223 expected. Some of this difference may be due to the taxonomic and phylogenetic classification of allo- and autopolyploid species used by Barker et al. (Barker, Arrigo, et al. 2016), but the methodology used to classify polyploid species in that study is consistent with the approaches used broadly in the community. Our results suggest that segmental allopolyploidy is likely prevalent among polyploid plant species and that many autopolyploid species may rapidly evolve bivalent pairing.

Despite possessing twice the number of chromosomes as their progenitors and regardless of the nature of polyploid speciation, nearly half (44.2%) of the polyploid species we surveyed have bivalent chromosome pairing behavior. As expected, allopolyploid species demonstrated more strictly bivalent pairing than autopolyploid species. The stable meiosis of allopolyploid species likely results from pairing preferences for homologs and suppression of pairing between the divergent homoeologs

(Otto and Whitton 2000 Ramsey and Schemske 2002 Comai 2005). However, it has been suggested that stability of meiosis may be a neutral by-product of chromosomal divergence (Hollister 2015). Future studies need to determine whether and to what degree divergence among homoeologous chromosomes leads to bivalent formation in polyploids. Further analyses on the divergence of the parental diploids and the pairing behavior of their allopolyploid species would provide some insight into this question.

Similarly, analyses of the age of the surveyed autopolyploid species would help explain why nearly 40% had strictly 292bivalent pairing. Are these species simply older autopolyploids that have gone through cytological diploidization already? Or are they cryptic allopolyploids that were misclassified as autopolyploids? The answers to these questions will help us understand the mechanisms that lead to the restoration of bivalent 224 pairing in allo- and autopolyploids, and eventually the evolution of disomic inheritance across the spectrum of polyploid species.

Mechanisms of Cytological Diploidization

What are the mechanisms that lead to the restoration of bivalent pairing, disomic inheritance, and cytological diploidization of polyploid genomes in plants? Although the forces and mechanisms driving cytological diploidization are not completely understood

(Le Comber et al. 2010 Feldman and Levy 2012 Hollister 2015), the process broadly involves changes in genome organization that ultimately produces pairs of homologous chromosomes that pair with each other and limit homoeologous pairing (Figure 3). These changes include chromosomal rearrangements, fissions, fusions, and other changes that lead to differentiated pairs of homologous chromosomes (Le Comber et al. 2010 Schubert and Lysak 2011). Dysploidy can also occur as a part of genome evolution associated with cytological diploidization, causing changes to base chromosome numbers (Escudero et al.

2014 Mandáková and Lysak 2018) and chromosome loss following WGD (Mandáková et al. 2010 Schubert and Lysak 2011 Xiong et al. 2011 Lysak 2014 Mandáková, Pouch, et al. 2017). More broadly, it is not yet clear if these changes accumulate (neutrally or through local adaptation) and lead to divergent resolution in different populations of a polyploid species (Werth and Windham 1991), or if natural selection is driving cytological diploidization because of some fitness benefit of diploid genetics or meiosis.

Evidence from studies of established polyploid species indicates that natural selection is likely driving some aspects of cytological diploidization. Research on established polyploids suggests they have lower crossover frequencies compared to 225 neotetraploids or their diploid relatives (Ramsey and Schemske 2002 Yant et al. 2013).

Recently formed polyploid species, especially autopolyploids but many allopolyploids as well (Figure 1), produce multivalents during meiosis. Multivalents are generally less stable during meiosis than bivalents and can lead to the loss of chromosomes during anaphase (Le Comber et al. 2010 Xiong et al. 2011 Zhang et al. 2013). This loss of chromosomes and other challenges of multivalent pairing and segregation can lead to reductions in fitness. These observations lead to a hypothesis that selection may reduce the number of crossovers or chiasma to suppress multivalent formation and non- homologous pairing in polyploid species (Cifuentes et al. 2010 Le Comber et al. 2010

Bomblies et al. 2016). Reducing the number of crossovers limits the opportunity for chromosomes to pair with more than one partner during meiosis and leads to more stable, bivalent pairing.

In autopolyploids, meiotic stability is associated with the rate of crossover

(Bomblies et al. 2015). More meiotically stable autopolyploids have diploid progenitors with a lower frequency of crossover formation, whereas polyploids with higher multivalent frequencies are formed by diploids with higher crossover rates (Morrison and

Rajhathy 1960 Hazarika and Rees 1967 Bomblies et al. 2015). Studies suggest a single crossover per pair of homologous chromosomes is essential in most diploid species for chromosome segregation (Jones and Franklin 2006 Crismani and Mercier 2012). For a chromosome to be associated with more than one partner during meiosis, at least two crossovers are required (Bomblies et al. 2015). Theoretically, reducing crossover to one per pair of homologous chromosomes in autopolyploids would be ideal for chromosome segregation and lead to bivalent formation (Bomblies et al. 2016). A model has been 226 proposed for the mechanistic basis for limiting the number of crossovers in autopolyploids (Bomblies et al. 2016). In this model, the number of crossovers will be reduced to one if the range of crossover interference needs to be larger than the distance to the end of the chromosome (Bomblies et al. 2016). However, the genetic and molecular mechanisms that control the number of crossovers are not well understood.

The genetic basis of autopolyploid meiosis has mainly been studied in autotetraploid A. arenosa (Hollister et al. 2012 Yant et al. 2013 Morgan et al. 2020). Previous studies used population data to show that eight unlinked candidate genes were important for meiotic chromosome pairing (Hollister et al. 2012 Yant et al. 2013). Strong signatures of selective sweeps are found on these genes and they are differentiated between polyploids and diploids. The results suggest that the genetics of re-establishing bivalent pairing in autopolyploid meiosis is likely to be polygenic (Yant et al. 2013). A more recent follow up study has identified the derived alleles of two genes, ASY1 and ASY3, that are associated with meiotic changes in A. arenosa (Morgan et al. 2020). This functional study also found that derived alleles of both genes are associated with traits in meiosis, such as reduction of multivalent formation, reduced chromosome axis length, and a tendency of more rod-shaped bivalent formation during meiosis (Morgan et al. 2020). This work provides the first empirical analysis of multiple genes involved in bivalent restoration in autopolyploid meiosis and provides evidence that pairing behavior in autopolyploids can be genetically controlled. Although this model of restoring bivalent pairing has been developed in the context of autopolyploid species, it likely applies to many allopolyploids that experience multivalent pairing as well (Figure 2). 227

Meiotic chromosome pairing behavior in allopolyploids is traditionally considered to be stable and diploid-like (Otto and Whitton 2000 Ramsey and Schemske 2002 Comai

2005).The general explanation of the stable meiosis in allopolyploid species is that the homoeologous chromosomes are already differentiated, making it easier to establish bivalent pairing between homologs and suppress homoeolog pairing (Otto and Whitton

2000 Ramsey and Schemske 2002 Comai 2005). The molecular mechanism that makes chromosome pairing behavior dependent on the divergence of chromosomes remains unclear (Comai et al. 2003 Le Comber et al. 2010 Bomblies and Madlung 2014). Further, many allopolyploid species still experience significant chromosomal change following genome duplication. Extensive chromosomal rearrangements and chromosome losses have been found in both synthetic Brassica napus and natural populations of Tragopogon miscellus (Xiong et al. 2011 Chester et al. 2012). As we found above (Figure 2), many allopolyploids also demonstrate some multisomic pairing and need to at least partially restore bivalent pairing to diploidize. Studies have shown that the restoration of diploid- like chromosome segregation is genetically controlled (Riley and Chapman 1958

Feldman and Levy 2012 Martín et al. 2014 Hollister 2015 Gonzalo et al. 2019). The best known example is the Ph1 locus, which has been studied in grasses, especially in wheat.

This locus is associated with suppressing homoeologous pairing and promoting homologous chromosome pairing in meiosis. In the absence of Ph1, the number of crossovers increases and extensive homoeologous pairing can occur (Sánchez-Morán et al. 2001). Loci with similar effects have also been identified in allotetraploids Brassica napus (Jenczewski et al. 2003 Liu et al. 2006) and A. suecica (Henry et al. 2014). A recent study proposed a clear mechanism of how non-homologous crossovers can be 228 suppressed in allopolyploids (Gonzalo et al. 2019). The gene MSH4 is essential for the main crossover pathway in B. napus. The number of non-homologous crossovers decreases if MSH4 returns to single copy and these crossovers will not be affected if

MSH4 is lost. Significantly, they found a convergent pattern of MSH4 returning to a single copy following multiple independent WGDs across the angiosperms. However, researchers suggest MSH4 is unlikely to contribute to meiosis stability in autopolyploids because it mainly affects non-homologous crossovers that are not thought to be important in autopolyploid pairing. This study provides a new mechanism for restoration of bivalent pairing in allopolyploids and suggests that chromosome pairing in allopolyploids is genetically determined across flowering plants (Gonzalo et al. 2019).

Overall, the mechanisms behind restoring bivalent pairing is still not clear (Le

Comber et al. 2010 Feldman and Levy 2012 Hollister 2015). Some evidence suggests chromosome pairing is genetically determined in different auto- and allopolyploid systems (Jenczewski et al. 2003 Liu et al. 2006 Yant et al. 2013 Henry et al. 2014). Few systems have been studied to understand the cytological diploidization of autopolyploids

(Hollister et al. 2012 Yant et al. 2013 Bomblies and Madlung 2014 Morgan et al. 2020).

It remains unclear how these mechanisms may vary across the phylogeny. The recent study on MSH4 shed some light on the molecular mechanism of cytological diploidization in flowering plants (Gonzalo et al. 2019). Future studies should look for

MSH4 and other genes associated with pairing and test if chromosome pairing is genetically determined across land plants. The molecular mechanisms of cytological diploidization and the restoration of diploid-like bivalent pairing remain to be fully understood (Bomblies and Madlung 2014). 229

Genic Diploidization and Fractionation

Although some polyploid species are essentially cytologically diploid at birth with bivalent pairing, all polyploid genomes appear to go through extensive gene loss and fractionation. Plant genomes are highly dynamic with significant turn-over in content, especially following WGDs (Schnable et al. 2011 Barker et al. 2012 Wendel 2015 Soltis et al. 2016). All genes are duplicated during polyploidization and many of these new paralogs do not persist for long (Adams and Wendel 2005 Barker et al. 2008 Shi et al.

2010 Conant et al. 2014 Douglas et al. 2015). This process of gene removal and loss following polyploidy is known as fractionation (Leitch and Leitch 2008 Freeling et al.

2012). Although fractionation does not necessarily lead to the restoration of bivalent pairing or disomic inheritance, focussing on pairing behavior as the only process involved in diploidization misses the other aspects of genome evolution caused by

WGDs. These include significant changes in gene content, network structure, and expression (Blischak et al. 2016). Fractionation is a particularly important component of diploidization and post-polyploid genome evolution because they all experience gene loss and the resolution of duplicated gene networks.

Two major molecular mechanisms for fractionation have been proposed: pseudogenization and gene deletion by recombination (Gaut et al. 2007 Freeling et al.

2012 Freeling et al. 2015). In flowering plants, it has been suggested that gene deletion by recombination is the predominant mechanism of fractionation and that pseudogenization may be relatively rare (Freeling et al. 2012). However, a recent study estimated that the numbers of pseudogenes are highly lineage specific in angiosperm 230 genomes, ranging from 5,000 to over 73,000 (Xie, Li, et al. 2019). These results suggest that pseudogenization may be more common in plant genomes than previously thought.

Pseudogenization is generally caused by mutation and results in the non-functionalization of a gene (Zhang et al. 2003 Xie, Chen, et al. 2019 Xie, Li, et al. 2019). Although gene function is lost, pseudogenes are not physically deleted from the genome. In contrast, gene deletion by recombination removes DNA from the genome (Woodhouse et al. 2010

Pang et al. 2015). Illegitimate recombination and unequal intra-strand homologous recombination are thought to be the two primary molecular mechanisms of gene deletion in plants (Devos et al. 2002 Soltis et al. 2015). These two mechanisms involve unequal crossing over during recombination and result in physical removal of DNA from the genome (Devos et al. 2002 Woodhouse et al. 2010). An additional potential mechanism for gene deletion in plants was recently proposed from research on synthetic allohexaploid Brassica (Gaebelein et al. 2019). Unlike the other two major molecular mechanisms of gene deletion, this deletion mechanism occurs between homoeologs during homoeologous recombination in allopolyploids. In the case of synthetic allohexaploid Brassica, fertility was significantly reduced when a particular subgenome was duplicated or deleted in a homoeologous exchange. This difference in fertility based on which subgenome is unbalanced in the homoeologous exchange can lead to a non- random retention of a subgenome (Gaebelein et al. 2019).

In many other plant genomes, the process of fractionation has also been observed to be non-random (Gaut and Doebley 1997 Bowers et al. 2003 Wang et al. 2011 Paterson et al. 2012). This biased fractionation can result in subgenome dominance in which one subgenome is retained more than the other. This phenomenon has been widely observed 231 across angiosperm lineages (Schnable et al. 2011 Cheng et al. 2012 Freeling et al. 2012

Renny-Byfield et al. 2015 Emery et al. 2018 Edger et al. 2019). In general, genes from the more highly retained subgenome are expressed at a higher level than their homoeologs (Schnable et al. 2011 Cheng et al. 2012). Transposable element (TE) density and methylation of these TEs can reduce the expression level of nearby genes (Hollister and Gaut 2009 Hollister et al. 2011). In allopolyploids, one parental genome may have a higher TE density and higher level of methylation compared to the other parental genome. It has been hypothesized that genes from the subgenome with higher TE density and methylation may be expressed at a lower level resulting in more fractionation compared to the other subgenome (Woodhouse et al. 2014). Under this hypothesis, there is more opportunity for subgenome dominance to occur with allopolyploid species

(Woodhouse et al. 2014). This hypothesis has also been extended to paleopolyploidy events (Garsmeur et al. 2014). It has been proposed that genomes with evidence of biased fractionation and subgenome dominance are more likely to be ancient allopolyploids

(Garsmeur et al. 2014). However, studies have shown that allopolyploid genomes may not always result in subgenome dominance. For example, in allopolyploids such as B. napus, wheat, and cotton, subgenome dominance is not observed (Yoo et al. 2013

Chalhoub et al. 2014 Pfeifer et al. 2014 Harper et al. 2016). In soybean, subgenome dominance is not found and the nature of its paleopolyploid event is still unresolved

(Zhao et al. 2017). These observations suggest the degree of genome differentiation prior to polyploidy may determine the amount of subgenome dominance. It remains unclear why this pattern varies across the phylogeny. Recent studies have provided progress on understanding the potential mechanisms that may drive subgenome dominance and 232 biased fractionation. In the genome, it has been found that subgenome dominance and biased fractionation is associated with higher gene body methylation, degree of protein-protein interactions, and gene expression levels (Shi et al. 2020). Recent studies also suggested homoeologous exchanges in allopolyploidy are likely to impact the pattern of subgenome dominance (Edger et al. 2017 Bird et al. 2018 Gaebelein et al. 2019 Alger and Edger 2020). The phylogenetic distribution and relative contributions of these mechanisms to the evolution of subgenome dominance and biased fractionation is not yet clear, but additional analyses leveraging population genomics, resynthesized polyploids, and other analyses of genetics and fitness will provide further insight into their roles in the polyploid genome evolution.

The drastic and biased gene loss that accompanies diploidization can also result in significant genome reorganization, which may occur to resolve genomic conflicts or dosage balance issues that would otherwise reduce polyploid fitness (Wendel 2015 Pires et al. 2016). It has been shown that paralogs with more interaction partners, such as transcription factors, are more likely to be retained following WGD to maintain protein product stoichiometry or dosage (Thomas et al. 2006 Freeling 2009 Defoort et al. 2019).

This dosage-balance hypothesis (DBH) also predicts that dosage-sensitive genes will be preferentially lost following small-scale gene duplication events to prevent dosage disruptions as their interaction partners are not doubled (Freeling 2009 Birchler and

Veitia 2011 Li et al. 2016 Defoort et al. 2019). An alternative to the DBH attributes retention of paralogs to functional diversification, especially neofunctionalization (a gene copy acquiring a novel function) (Ohno 1970) or subfunctionalization (each gene copy retaining part of the original function) (Lynch and Force 2000). A previous study 233 suggests subfunctionalization may also drive cytological diploidization by maintaining appropriate chromosome pairs and promoting bivalent chromosome pairing and disomic inheritance (Le Comber et al. 2010). However, neo- and subfunctionalization cannot explain the parallel pattern of gene retention following different WGDs (Barker et al.

2008 De Smet et al. 2013 Mandáková, Li, et al. 2017). Among these hypotheses for duplicate gene retention (Kondrashov and Kondrashov 2006 Freeling 2009), the DBH is the only hypothesis that explicitly predicts the parallel retention and loss of functionally related genes across species following WGD (Thomas et al. 2006 Freeling 2009 Conant et al. 2014). A recent study of tandem duplicate genes in mammals suggests that the DBH might explain the initial survival of these gene duplicates and neo- or subfunctionalization may be more important for the long term retention of paralogs (Lan and Pritchard 2016). It remains to be understood what determines the portion of retained duplicate genes that are explained by the DBH, neo- and subfunctionalization, and other processes, and how this pattern varies across different lineages of plants.

In general, genic diploidization/fractionation occurs after all WGDs. Although the complete set of forces and mechanisms that drive fractionation are not yet understood, there is plenty of evidence that the process is generally not random with regard to the subgenomes and types of genes that are retained and lost (Gaut and Doebley 1997

Bowers et al. 2003 Wang et al. 2011 Paterson et al. 2012). Future studies should aim to better understand how much fractionation is determined by the nature of polyploidy or other factors such as level of methylation in parental genomes. We also need to understand how genic diploidization and fractionation contribute to resolving genomic conflicts or dosage balance issues. This will help improve our understanding of the fate 234 of duplicate genes from the WGD. Given that diverse mechanisms and forces appear to drive fractionation, the processes of genic diploidization may vary considerably among lineages.

Rate of Diploidization in Plants

The process of diploidization involves many mechanisms and forces, and it is not yet clear how they operate in different lineages of plants. Most studies on genetic and cytological diploidization have focused on the angiosperms. In Tragopogon, it has been shown that the parallel pattern of gene loss and chromosomal rearrangements can be established in only 40 generations (Buggs et al. 2012). Similarly, Xiong et al. studied ten generations of the resynthesized allopolyploid Brassica napus and found evidence for many chromosomal rearrangements and aneuploidies (Xiong et al. 2011). Although there is evidence for rapid chromosomal evolution following polyploidy, a recent study demonstrated that the rate of diploidization following WGD can vary among related lineages (Mandáková, Li, et al. 2017). In 13 independent Brassicaceae mesopolyploidies, multiple species displayed different degrees of diploidization yielding a range of chromosome numbers and rearrangements across lineages. The different levels of diploidization are not clearly predicted by the age of these polyploidy events

(Mandáková, Li, et al. 2017). More striking, in a recent cytological study of a

Brassicaceae tribe largely endemic to , different lineages descending from a common allopolyploid ancestor can have different rates of diploidization (Mandáková,

Pouch, et al. 2017). The difference in rate is mainly driven by the number of chromosomal rearrangements observed in each species (Mandáková, Pouch, et al. 2017). 235

Given that the rate of diploidization can vary dramatically in the descendants of a single

WGD, the rate of diploidization likely varies across different lineages of flowering plants.

However, it is not yet clear how much the rates of different aspects of diplodization vary across the land plant phylogeny and the forces driving these differences in rate.

Relatively little is known about diploidization outside of angiosperms. A recent study in Sequoia confirms that an autopolyploidization event occurred around 33 Ma

(Scott et al. 2016). However, Sequoia has apparently maintained multivalent pairing since this paleopolyploidy (Stebbins 1948), suggesting a slow diploidization process in comparison to flowering plants (Scott et al. 2016). Although debated (Ruprecht et al.

2017 Zwaenepoel and Van de Peer 2019), genomic analyses have inferred at least three other ancient WGDs in the gymnosperms (Z. Li et al. 2015 One Thousand Plant

Transcriptomes Initiative 2019 Li and Barker 2020). Other recent studies have found evidence of neopolyploidy in Ginkgo (Šmarda et al. 2016 Šmarda et al. 2018) and

Junipus (Farhat et al. 2019). These ancient and recent WGDs provide opportunities to estimate the rate of genic and cytological diploidization in gymnosperms. Better understanding of diploidization in gymnosperms may provide a new angle to understand why polyploidy is relatively rare in most of the gymnosperms (Ahuja 2005). Similar to the gymnosperms, diploidization remains to be studied in ferns. It has been hypothesized that ferns experienced multiple rounds of ancient WGDs without losing their chromosomes following WGDs (Haufler and Soltis 1986 Haufler 1987 Barker and Wolf

2010). In contrast to the flowering plants, diploidization in the ferns has been hypothesized to be predominantly driven by gene silencing or pseudogenization rather than gene deletion (Haufler 1987 Nakazato et al. 2006 Nakazato et al. 2008 Barker 2013). 236

A few studies have identified multiple, silenced copies of nuclear genes in putatively diploid homosporous fern genomes (Pichersky et al. 1990 McGrath et al. 1994 McGrath and Hickok 1999) and the active process of gene silencing without chromosome loss in a polyploid genome (Gastony 1991). However, the molecular mechanism of gene fractionation and the rate of diploidization in ferns is still unknown. Two heterosporous fern genomes have been published (F.-W. Li et al. 2018). However, these two genomes might experience different processes of diploidization compared to the homosporous ferns which have much higher average chromosome numbers. Similar to the gymnosperms and ferns, relatively little is known about diploidization in the other lineages of land plants. Future studies should estimate the patterns and processes of diploidization with chromosome level genome assemblies of these lineages, especially mosses, Lycopodiaceae, Isoetaceae, and the homosporous ferns where polyploidy seems to be prominent (One Thousand Plant Transcriptomes Initiative 2019).

Estimating the rate of genic and cytological diploidization in plants can be challenging because the process occurs across large timescales and requires substantial genomic data. Additional phylogenetic and cytological analyses could be used to develop greater insight into the rate of cytological diploidization (Figure 4-5). Similarly, the rate of gene loss following polyploidy can be estimated from recent studies on the incidence of paleopolyploidy across the plant phylogeny. With genomic and transcriptomic data, the rate of duplicated gene loss in ancient polyploids can be estimated by comparing the fraction of paralogs in a genome derived from a WGD and the age of the WGD across multiple events and species. In general, studies have used synteny or duplicate gene age distribution analyses to infer duplicate genes derived from the polyploidy events (Ren et 237 al. 2018 Qiao et al. 2019 Gout et al.). The relative age of a WGD can be estimated using the synonymous divergence (Ks) of the paralogs in the WGD peak from a Ks plot. By plotting the fraction of retained WGD paralogs in the genome (% paleologs) against the median paralog divergence for a WGD, we can obtain an estimate of the variation in the rate of genic diploidization following ancient WGDs.

Previous research has found that the fraction of genes retained from WGDs decreases exponentially over time in flowering plants (Ren et al. 2018 Qiao et al. 2019

Gout et al.). To estimate variation in the rate of gene loss across land plants, we analyzed land plant transcriptomic data of 815 species which are inferred to have at least one round of ancient polyploidy from the One Thousand Plant Transcriptome (1KP) project (One

Thousand Plant Transcriptomes Initiative 2019). These species were organized into five major lineages of land plants: bryophytes, lycophytes, ferns, gymnosperms, and angiosperms (Supplemental Table 2). We used mixture modeling to identify genes retained from the most recent ancient WGD that each species experienced based on the

WGD peak in the Ks plot (Li and Barker 2020). The paralog divergence of the WGD was estimated by the median Ks value of the WGD peak. We estimated the fraction of paleologs by using the total number of genes retained from an ancient WGD divided by the total number of unigenes in the transcriptome (Supplemental Table 2). We then plotted the fraction of paleologs with paralog divergence (Ks) of the WGD for each species (Figure 4). To infer if there was a significant trend in the data, we fit linear and exponential models to the distribution (Supplemental Table 3). Consistent with previous research (Ren et al. 2018 Qiao et al. 2019), we found a decrease in the fraction of retained paleologs over time in the angiosperms (Figure 4, Supplemental Table 3). We also 238 observed higher variation in the fraction of retained paralogs among relatively young

WGDs (lower Ks values) compared to older WGDs (higher Ks value). In contrast, we observed an increase in the fraction of paleologs over time in the gymnosperms (Figure 4,

Supplemental Table 3). The bryophytes, lycophytes, and ferns did not have a significant increase or decrease in the fraction of retained WGD paralogs over time (Figure 4).

One issue with analyses of ancient polyploidy is that many taxa may be closely related and some taxa may share the same ancient duplication event. To test whether there is any phylogenetic signal for the fraction of retained paralogs and the relative age of the polyploidy, we used the phylosig function in phytools R package (Revell 2012).

We found evidence of significant phylogenetic signal for all categories except fractions of paleologs in the ferns and lycophytes. To address the potential impact of these closely related species and phylogenetically shared WGDs on the observed relationship between

WGD age and paleolog retention, we used phylogenetic independent contrasts (PIC) to account for the phylogenetic relatedness among lineages in our dataset. Specifically, we transformed raw values of the fraction of genes retained from each WGD and Ks value of a WGD and the phylogeny from the 1KP project using the pic function in the ape R package (Popescu et al. 2012). Similar to the results above, our phylogenetically- corrected analyses did not recover a significant relationship between gene loss and the relative age of the WGD event in bryophytes, lycophytes, and ferns (Figure 5A-D,

Supplemental Table 3). The significant positive relationship observed in the gymnosperms was not significant after taking phylogeny into account (Figure 5D,

Supplemental Table 3). Our phylogenetically-corrected analyses recover a significant linear fit (p < 0.001, Adjusted R-squared = 0.09593, slope = -0.04506 ) and a significant 239 exponential fit (p < 0.001, b = -0.2032) in angiosperms (Figure 5E, Supplemental Table

3). Similar to studies that did not take phylogeny into account (Ren et al. 2018 Qiao et al.

2019), we observed that paleologs were lost over time. We found that the relative age of the WGDs explains about 10% of the variation in the amount of gene loss in the linear model fit after PIC (Supplemental Table 3). Our study provides the first observation of the rate of gene loss in other lineages of land plants. Unlike flowering plants, the amount of gene loss from a WGD does not appear to be correlated with the relative age of the

WGDs in these lineages. Our results suggest the dominant mechanism of fractionation may vary across land plants, and appears to be different in angiosperms compared to other land plants. Considering that the relative age of the WGD explained a relatively small amount of the variation in gene loss in angiosperms, other mechanisms are clearly important. It may be that each WGD ultimately experiences different patterns of fractionation. Every post-WGD lineage experiences different demography, selection pressures, and other population genetic differences that could drive unique rates of gene loss. Variation in all of these dimensions likely contributes to the differences in the patterns of fractionation we observed across the land plant phylogeny.

Our results highlight that there is still much we do not understand about diploidization. Although other analyses also suggest that the rate of diploidization is likely to vary across the phylogeny of plants (Mandáková, Pouch, et al. 2017), it is not clear why we observed no relationship between the age of a WGD (as inferred by paralog divergence) and the fraction of retained paralogs for most clades of land plants. Future studies are needed to understand if the angiosperms have evolved novel mechanisms of gene fractionation distinct from those found in other land plants. Sample size in other 240 lineages may contribute to some of the differences we observed, but the bryophytes, ferns, and gymnosperms were all represented by more than 50 species. Given the potential importance of eliminating genes after WGD (Thomas et al. 2006 Freeling 2009

Birchler and Veitia 2012 Defoort et al. 2019), the apparently efficient gene fractionation in angiosperms may be a part of their evolutionary success. Similarly, more comprehensive analyses of pseudogenization across land plants are needed to understand variation in gene loss among lineages. It also remains to be resolved how allo- and autopolyploidy influences the rate of gene loss and chromosomal evolution. Analyses leveraging comparative genomic approaches from emerging chromosome level gymnosperm and homosporous fern genomes will be important to address why these rates of diploidization differ across land plants. Similarly, deeper analyses of populations and species descended from the same WGD are needed to understand the forces that drive diploidization. Our analyses and others (Ren et al. 2018 Qiao et al. 2019 Gout et al.) indicate that there is ample variation in the rates of diploidization to begin understanding these forces.

Differences in Diploidization Between Plants and Animals

Variation in the patterns and rates of diploidization is also evident between plants and animals. In angiosperms, most of the gene loss that occurs during fractionation is attributed to intrachromosomal recombination (Woodhouse et al. 2010 Freeling et al.

2012 Schnable et al. 2012 Tang et al. 2012). However, in animals many gene losses appear to be caused by pseudogenization (Freeling et al. 2012). Vertebrate genomes do not seem to rapidly remove functionless nonrepetitive DNA, and pseudogenes can be 241 carried for tens of millions of years (Meyer and Schartl 1999 Schrider et al. 2009

Berthelot et al. 2014 Lien et al. 2016).

Patterns of gene loss following paleopolyploidy have been studied in many flowering plants such as A. thaliana (Bowers et al. 2003), Brassica (Gaut and Doebley

1997 Bowers et al. 2003 Wang et al. 2011 Paterson et al. 2012), maize (Gaut and

Doebley 1997 Bowers et al. 2003 Wang et al. 2011 Paterson et al. 2012), as well as more recent cotton allopolyploids (Wendel et al. 2012). A general pattern that has been found across these flowering plant genomes is that most of the gene losses are due to illegitimate recombination rather than gene pseudogenization (Woodhouse et al. 2010

Freeling et al. 2012 Schnable et al. 2012 Tang et al. 2012). In maize, around 10% of the paleologs have been removed after a whole genome duplication that occurred around 12 million years ago. These paralogs were deleted by intrachromosomal recombination facilitated by direct repeats flanking the gene or exons (Woodhouse et al. 2010). In

Brassica rapa, gene loss following the Brassiceae paleohexaploidy was driven by the same gene deletion mechanism (Tang et al. 2012).

In contrast to plant genomes with rapid gene deletion caused by intrachromosomal recombination, pseudogenization appears to be the major gene loss mechanism in vertebrates (Meyer and Schartl 1999 Schrider et al. 2009 Berthelot et al.

2014 Lien et al. 2016). The most common type of pseudogenization occurs when a gene is disrupted by and becomes unexpressed or non-functional (Zhang 2003). For example, all of the nearly 200 genes lost since humans diverged from chimpanzees are present as pseudogenes in our genome (Schrider et al. 2009). Another excellent example of slow gene deletion in vertebrates comes from the recently sequenced rainbow trout 242 genome (Berthelot et al. 2014). Analyses of the genome revealed an ancient WGD shared by the salmonid family. After nearly 100 million years of evolution, syntenic analyses found that the two subgenomes are still highly collinear. Nearly half of the protein-coding genes are retained in the genome, and most of the gene loss is due to pseudogenization.

They also estimated that the average rate of gene inactivation is ~170 genes per million years (Berthelot et al. 2014). Similarly, carp experienced a WGD 8-18 MYa. Analyses of the common carp genome found a slow rate of gene loss with 92% of the paralogs from the polyploid event still retained in both copies (J.-T. Li et al. 2015). In Xenopus frogs, there is significant pseudogene accumulation following an allopolyploidy event that occurred 17-18 MYa. Comparable to rainbow trout, around 64% of paralogs from the

WGD experienced gene loss by pseudogenization (Session et al. 2016). Different from the patterns observed in flowering plants, few large scale gene deletions have been observed in animals. Most genes are deleted independently from neighboring genes by single gene deletion (Session et al. 2016). Notably, vertebrates represent all of the currently studied post-polyploid animal genomes. It is not clear if this pattern of gene deletion following WGDs is shared by all animals (Berthelot et al. 2014 Lien et al. 2016).

The slow rate of gene removal in animals contrasts with the flowering plant- centric perspective that genes are rapidly deleted and genomes highly re-organized following WGDs. Slow gene deletion may impede the rate at which dosage balance problems are resolved following WGDs as well as reduce the rate of diploidization. The rapid gene deletion in flowering plants may allow them to resolve dosage balance problems much faster than animals. This hypothesis might help explain why polyploidy is rarer in animals compared to plants (Muller 1925 Orr 1990 Mable 2004). Future 243 studies should confirm if this pattern of gene deletion is shared by all animals. Recent genomic analyses revealed multiple paleopolyploidies in the ancestry of various invertebrate lineages, such as insects, horseshoe crabs, spiders, and molluscs (Hallinan and Lindberg 2011 Yoshida et al. 2011 Nossa et al. 2014 Clarke et al. 2015 Z. Li et al.

2018). These ancient polyploids can be used to test if this pattern of gene deletion is shared by invertebrates. To test this hypothesis, one needs to assess the average rate of pseudogenization and gene deletion following polyploidy in animals and compare it to plants. Synteny analyses on high quality animal and plant genomes are needed to estimate the average rate of gene loss. Variation in the rates and mechanisms of diploidization will likely be found. For example, a recent study using 13 Paramecium genomes show a slower post-WGD gene loss rate compared to plants and vertebrates (Gout et al.). Future studies are needed to further investigate the mechanisms and patterns of gene deletion following WGDs across eukaryotes.

Final thoughts and future directions

Diploidization involves a diversity of mechanisms to return polyploid genomes to an effectively diploid state. New comparative and population genomic data combined with cytogenetic and molecular biological approaches will continue to uncover the genetics and biology of the mechanisms involved in diploidization. Perhaps the most important next step in improving our understanding of diploidization is developing a more rigorous and objective framework for testing hypotheses about diploidization.

Many studies of diploidization are largely descriptive. This is fair because we are still in the relatively early days of discovering ancient WGDs and their legacies in eukaryotic 244 genomes. As we move forward and more data become available, we need to work towards more explicit hypothesis testing of diploidization. There has been progress in this area for some aspects of diploidization, such as hypotheses on subgenome dominance

(Bird et al. 2018). Developing model and simulation based approaches to evaluate and test diploidization hypotheses would push the field forward. For example, model-based analyses of chromosomal evolution first introduced with chromEvol provided a new phylogenetic framework to test hypotheses of cytological evolution (Mayrose et al.

2010). Similar modeling and simulation approaches would permit researchers to more rigorously test hypotheses and develop more informed expectations about the outcomes of diploidization caused by different mechanisms and forces. Ultimately, the scale of data will demand more rigorous approaches as single genome analyses make way for phylogenomic and population genomic investigations.

More rigorous analyses of diploidization will also allow us to address perhaps the most interesting question about the entire process: why diploidize at all? Given the prevalence of diploidy among eukaryotes and the frequency of polyploid speciation in plants, we can deduce that polyploid species either diploidize or go extinct (Arrigo and

Barker 2012 Baduel et al. 2018). Why do polyploid species ultimately diploidize? It may be that bivalent pairing is inherently more stable than multivalent pairing and increases fitness. Perhaps bivalent pairing eventually leads to disomic inheritance and chromosomal differentiation by drift (Le Comber et al. 2010). Alternatively, diploidization may be driven to more efficiently purge deleterious substitutions in polyploid genomes (Otto 2007). It may be that natural selection is more efficient in diploid genomes (Otto and Whitton 2000 Monnahan and Brandvain 2019) and selection 245 in the environment, rather than the genome, drives diploidization. Model and simulation based analyses of these and other hypotheses would provide new ways to explicitly test the ultimate causes and drivers of diploidization. Coupling comparative genomic analyses and data with studies that are explicitly aimed at measuring fitness of the changes associated with diploidization are also needed. A challenge of studying diploidization is that many of the processes happen in that shadowy area of inference where the power of population genetics starts to fade but comparative phylogenetics may not be possible because of too few species. Moving forward, a combination of explicit models and simulations with data from carefully selected systems will help shine a light on the shadow of polyploidy.

Acknowledgments

We thank Cristian Román-Palacios and Anthony E. Baniaga for helpful discussions. M.

S. B. was supported by US National Science Foundation (NSF) grants IOS-1339156 and

EF-1550838.

Terms and definitions list

Homologous Chromosomes (Homologs): A set of chromosomes that pair up during meiosis I one is of maternal origin and the other of paternal origin.

Homoeologous Chromosomes (Homoeologs): A set of chromosomes in an allopolyploid that are derived from different parental species and have shared homology.

246

Disomic inheritance: Regular pairing and segregation of two chromosomes that produces two alleles at a locus.

Tetrasomic inheritance: Pairing and segregation of four chromosomes that produces four alleles at a single locus.

Multisomic/Polysomic inheritance: Combinations of chromosome pairing and segregation that yield more than two alleles at a locus.

Mixosomic inheritance: Combination of disomic and multisomic inheritances in a species.

Allopolyploidy: Polyploid species formed by interspecific hybridization and whole genome duplication. Generally considered to have pairs of homologous chromosomes from each parent that form bivalents during meiosis.

Autopolyploidy: Polyploid species with a single progenitor species and typically expected to have sets of homologous chromosomes that form multivalents during meiosis.

Segmental Allopolyploidy: Polyploid species with a mixture of bivalent and multivalent chromosome pairing.

247

Bivalent: A pair of homologous chromosomes aligned on the meiotic spindle during meiosis I.

Multivalent: Three or more homologous chromosomes aligned on the meiotic spindle during meiosis I.

Fractionation/Genic diploidization: The process of gene removal and loss following polyploidy by molecular mechanisms such as pseudogenization and gene deletion by recombination.

Cytological diploidization: The process of chromosomal evolution and restoration of bivalent pairing and disomic inheritance following polyploidy.

References

Adams KL, Wendel JF. 2005. Polyploidy and genome evolution in plants. Curr.

Opin. Plant Biol. 8:135–141.

Ahuja MR, Raj Ahuja M. 2005. Polyploidy in gymnosperms: revisited. Silvae

Genetica [Internet] 54:59–69. Available from: http://dx.doi.org/10.1515/sg-2005-

0010

Alger EI, Edger PP. 2020. One subgenome to rule them all: underlying mechanisms

of subgenome dominance. Curr. Opin. Plant Biol. 54:108–113. 248

Arrigo N, Barker MS. 2012. Rarely successful polyploids and their legacy in plant genomes. Curr. Opin. Plant Biol. 15:140–146.

Baduel P, Bray S, Vallejo-Marin M, Kolář F, Yant L. 2018. The “Polyploid Hop”: shifting challenges and opportunities over the evolutionary lifespan of genome duplications. Frontiers in Ecology and Evolution 6:117.

Barker MS. 2013. Karyotype and genome evolution in pteridophytes. Plant Genome

Diversity Volume 2 [Internet]. Available from: https://link.springer.com/chapter/10.1007/978-3-7091-1160-4_15

Barker MS, Arrigo N, Baniaga AE, Li Z, Levin DA. 2016. On the relative abundance of autopolyploids and allopolyploids. New Phytol. 210:391–398.

Barker MS, Baute GJ, Liu S-L. 2012. Duplications and turnover in plant genomes.

Plant Genome Diversity Volume 1 [Internet]:155–169. Available from: http://dx.doi.org/10.1007/978-3-7091-1130-7_11

Barker MS, Husband BC, Pires JC. 2016. Spreading Winge and flying high: The evolutionary importance of polyploidy after a century of study. Am. J. Bot.

103:1139–1145.

Barker MS, Kane NC, Matvienko M, Kozik A, Michelmore RW, Knapp SJ,

Rieseberg LH. 2008. Multiple paleopolyploidizations during the evolution of the

Compositae reveal parallel patterns of duplicate gene retention after millions of years. Mol. Biol. Evol. 25:2445–2455. 249

Barker MS, Wolf PG. 2010. Unfurling fern biology in the genomics age. BioScience

[Internet] 60:177–185. Available from: http://dx.doi.org/10.1525/bio.2010.60.3.4

Berthelot C, Brunet F, Chalopin D, Juanchich A, Bernard M, Noël B, Bento P, Da

Silva C, Labadie K, Alberti A, et al. 2014. The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nat. Commun.

5:3657.

Birchler JA, Veitia RA. 2011. Protein-Protein and Protein-DNA dosage balance and differential paralog transcription factor retention in polyploids. Front. Plant Sci.

2:64.

Birchler JA, Veitia RA. 2012. Gene balance hypothesis: connecting issues of dosage sensitivity across biological disciplines. Proc. Natl. Acad. Sci. U. S. A. 109:14746–

14753.

Bird KA, VanBuren R, Puzey JR, Edger PP. 2018. The causes and consequences of subgenome dominance in hybrids and recent polyploids. New Phytol. 220:87–93.

Blischak PD, Mabry ME, Conant GC, Chris Pires J. 2016. Integrating networks, phylogenomics, and population genomics for the study of polyploidy. Annual

Review of Ecology, Evolution, and Systematics. Available from: http://dx.doi.org/10.1146/annurev-ecolsys-121415-032302

Bomblies K, Higgins JD, Yant L. 2015. Meiosis evolves: adaptation to external and internal environments. New Phytologist [Internet] 208:306–323. Available from: http://dx.doi.org/10.1111/nph.13499 250

Bomblies K, Jones G, Franklin C, Zickler D, Kleckner N. 2016. The challenge of evolving stable polyploidy: could an increase in “crossover interference distance” play a central role? Chromosoma [Internet] 125:287–300. Available from: http://dx.doi.org/10.1007/s00412-015-0571-4

Bomblies K, Madlung A. 2014. Polyploidy in the Arabidopsis genus. Chromosome

Res. 22:117–134.

Bowers JE, Chapman BA, Rong J, Paterson AH. 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events.

Nature 422:433–438.

Buggs RJA, Chamala S, Wu W, Tate JA, Schnable PS, Soltis DE, Soltis PS,

Barbazuk WB. 2012. Rapid, repeated, and clustered loss of duplicate genes in allopolyploid plant populations of independent origin. Curr. Biol. 22:248–252.

Chalhoub B, Denoeud F, Liu S, Parkin IAP, Tang H, Wang X, Chiquet J, Belcram

H, Tong C, Samans B, et al. 2014. Early allopolyploid evolution in the post-

Neolithic Brassica napus oilseed genome. Science 345:950–953.

Cheng F, Wu J, Fang L, Sun S, Liu B, Lin K, Bonnema G, Wang X. 2012. Biased gene fractionation and dominant gene expression among the subgenomes of

Brassica rapa. PLoS One 7:e36442.

Chester M, Gallagher JP, Symonds VV, da Silva AVC, Mavrodiev EV, Leitch AR,

Soltis PS, Soltis DE. 2012. Extensive chromosomal variation in a recently formed natural allopolyploid species, Tragopogon miscellus (Asteraceae). Proceedings of 251 the National Academy of Sciences [Internet] 109:1176–1181. Available from: http://dx.doi.org/10.1073/pnas.1112041109

Cifuentes M, Grandont L, Moore G, Chèvre AM, Jenczewski E. 2010. Genetic regulation of meiosis in polyploid species: new insights into an old question. New

Phytol. 186:29–36.

Clarke TH, Garb JE, Hayashi CY, Arensburger P, Ayoub NA. 2015. Spider transcriptomes identify ancient large-scale gene duplication event potentially important in silk gland evolution. Genome Biol. Evol. 7:1856–1870.

Clausen RE. 1941. Polyploidy in Nicotiana. The American Naturalist [Internet]

75:291–306. Available from: http://dx.doi.org/10.1086/280965

Comai L. 2005. The advantages and disadvantages of being polyploid. Nat. Rev.

Genet. 6:836–846.

Comai L, Tyagi AP, Lysak MA. 2003. FISH analysis of meiosis in Arabidopsis allopolyploids. Chromosome Res. 11:217–226.

Conant GC, Birchler JA, Chris Pires J. 2014. Dosage, duplication, and diploidization: clarifying the interplay of multiple models for duplicate gene evolution over time. Curr. Opin. Plant Biol. 19:91–98.

Crismani W, Mercier R. 2012. What limits meiotic crossovers? Cell Cycle 11:3527–

3528.

Defoort J, Van de Peer Y, Carretero-Paulet L. 2019. The evolution of gene 252 duplicates in angiosperms and the impact of protein-protein interactions and the mechanism of duplication. Genome Biol. Evol. 11:2292–2305.

De Smet R, Adams KL, Vandepoele K, Van Montagu MCE, Maere S, Van de Peer

Y. 2013. Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proc. Natl. Acad. Sci. U. S. A. 110:2898–

2903.

Devos KM, Brown JKM, Bennetzen JL. 2002. Genome size reduction through illegitimate recombination counteracts genome expansion in Arabidopsis. Genome

Res. 12:1075–1079.

Dodsworth S, Chase MW, Leitch AR. 2016. Is post-polyploidization diploidization the key to the evolutionary success of angiosperms? Botanical Journal of the

Linnean Society [Internet] 180:1–5. Available from: http://dx.doi.org/10.1111/boj.12357

Douglas GM, Gos G, Steige KA, Salcedo A, Holm K, Josephs EB, Arunkumar R,

Ågren JA, Hazzouri KM, Wang W, et al. 2015. Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa- pastoris. Proc. Natl. Acad. Sci. U. S. A. 112:2806–2811.

Doyle JJ, Egan AN. 2010. Dating the origins of polyploidy events. New Phytol.

186:73–85.

Doyle JJ, Sherman-Broyles S. 2017. Double trouble: and definitions of polyploidy. New Phytol. 213:487–493. 253

Edger PP, Poorten TJ, VanBuren R, Hardigan MA, Colle M, McKain MR, Smith

RD, Teresi SJ, Nelson ADL, Wai CM, et al. 2019. Origin and evolution of the octoploid strawberry genome. Nat. Genet. 51:541–547.

Edger PP, Smith R, McKain MR, Cooley AM, Vallejo-Marin M, Yuan Y, Bewick

AJ, Ji L, Platts AE, Bowman MJ, et al. 2017. Subgenome dominance in an interspecific hybrid, synthetic allopolyploid, and a 140-year-old naturally established neo-allopolyploid monkeyflower. The Plant Cell [Internet] 29:2150–

2167. Available from: http://dx.doi.org/10.1105/tpc.17.00010

Emery M, Willis MMS, Hao Y, Barry K, Oakgrove K, Peng Y, Schmutz J, Lyons E,

Pires JC, Edger PP, et al. 2018. Preferential retention of genes from one parental genome after polyploidy illustrates the nature and scope of the genomic conflicts induced by hybridization. PLoS Genet. 14:e1007267.

Escudero M, Martín-Bravo S, Mayrose I, Fernández-Mazuecos M, Fiz-Palacios O,

Hipp AL, Pimentel M, Jiménez-Mejías P, Valcárcel V, Vargas P, et al. 2014.

Karyotypic changes through dysploidy persist longer over evolutionary time than polyploid changes. PLoS One 9:e85266.

Farhat P, Hidalgo O, Robert T, Siljak-Yakovlev S, Leitch IJ, Adams RP, Dagher-

Kharrat MB. 2019. Polyploidy in the conifer genus Juniperus: an unexpectedly high rate. Frontiers in Plant Science [Internet] 10. Available from: http://dx.doi.org/10.3389/fpls.2019.00676

Feldman M, Levy AA. 2012. Genome evolution due to allopolyploidization in 254 wheat. Genetics 192:763–774.

Freeling M, Michael F. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu. Rev.

Plant Biol. 60:433–453.

Freeling M, Scanlon MJ, Fowler JE. 2015. Fractionation and subfunctionalization following genome duplications: mechanisms that drive gene content and their consequences. Curr. Opin. Genet. Dev. 35:110–118.

Freeling M, Woodhouse MR, Subramaniam S, Turco G, Lisch D, Schnable JC.

2012. Fractionation mutagenesis and similar consequences of mechanisms removing dispensable or less-expressed DNA in plants. Curr. Opin. Plant Biol. 15:131–139.

Fulton IW. 1950. Unilateral nuclear migration and the interactions of haploid mycelia in the Fungus Cyathus stercoreus. Proceedings of the National Academy of

Sciences [Internet] 36:306–312. Available from: http://dx.doi.org/10.1073/pnas.36.5.306

Gaebelein R, Schiessl SV, Samans B, Batley J, Mason AS. 2019. Inherited allelic variants and novel karyotype changes influence fertility and genome stability in

Brassica allohexaploids. New Phytol. [Internet]. Available from: http://dx.doi.org/10.1111/nph.15804

Garsmeur O, Schnable JC, Almeida A, Jourda C, D’Hont A, Freeling M. 2014. Two evolutionarily distinct classes of paleopolyploidy. Mol. Biol. Evol. 31:448–454. 255

Gastony GJ. 1991. Gene silencing in a polyploid homosporous fern: paleopolyploidy revisited. Proc. Natl. Acad. Sci. U. S. A. 88:1602–1605.

Gaut BS, Doebley JF. 1997. DNA sequence evidence for the segmental allotetraploid origin of maize. Proc. Natl. Acad. Sci. U. S. A. 94:6809–6814.

Gaut BS, Wright SI, Rizzon C, Dvorak J, Anderson LK. 2007. Recombination: an underappreciated factor in the evolution of plant genomes. Nat. Rev. Genet. 8:77–84.

Gonzalo A, Lucas M-O, Charpentier C, Sandmann G, Lloyd A, Jenczewski E. 2019.

Reducing MSH4 copy number prevents meiotic crossovers between non- homologous chromosomes in Brassica napus. Nat. Commun. 10:2354.

Gout J-F, Johri P, Arnaiz O, Doak TG, Bhullar S, Couloux A, Guérin F, Malinsky S,

Sperling L, Labadie K, et al. Universal trends of post-duplication evolution revealed by the genomes of 13 Paramecium species sharing an ancestral whole-genome duplication. Available from: http://dx.doi.org/10.1101/573576

Hallinan NM, Lindberg DR. 2011. Comparative analysis of chromosome counts infers three paleopolyploidies in the Mollusca. Genome Biol. Evol. 3:1150–1163.

Harper AL, Trick M, He Z, Clissold L, Fellgett A, Griffiths S, Bancroft I. 2016.

Genome distribution of differential homoeologue contributions to gene expression in bread wheat. Plant Biotechnol. J. 14:1207–1214.

Hauber DP, Reeves A, Stack SM. 1999. Synapsis in a natural autotetraploid.

Genome [Internet] 42:936–949. Available from: http://dx.doi.org/10.1139/g99-026 256

Haufler CH. 1987. Electrophoresis is modifying our concepts of evolution in homosporous pteridophytes. Am. J. Bot. 74:953–966.

Haufler CH, Soltis DE. 1986. Genetic evidence suggests that homosporous ferns with high chromosome numbers are diploid. Proc. Natl. Acad. Sci. U. S. A.

83:4389–4393.

Hazarika MH, Rees H. 1967. Genotypic control of chromosome behaviour in rye X.

Chromosome pairing and fertility in autotetraploids. Heredity [Internet] 22:317–332.

Available from: http://dx.doi.org/10.1038/hdy.1967.44

Henry IM, Dilkes BP, Tyagi A, Gao J, Christensen B, Comai L. 2014. The BOY

NAMED SUE quantitative trait locus confers increased meiotic stability to an adapted natural allopolyploid of Arabidopsis. The Plant Cell [Internet] 26:181–194.

Available from: http://dx.doi.org/10.1105/tpc.113.120626

Hollister JD. 2015. Polyploidy: adaptation to the genomic environment. New

Phytologist [Internet] 205:1034–1039. Available from: http://dx.doi.org/10.1111/nph.12939

Hollister JD, Arnold BJ, Svedin E, Xue KS, Dilkes BP, Bomblies K. 2012. Genetic adaptation associated with genome-doubling in autotetraploid Arabidopsis arenosa.

PLoS Genetics [Internet] 8:e1003093. Available from: http://dx.doi.org/10.1371/journal.pgen.1003093

Hollister JD, Gaut BS. 2009. Epigenetic silencing of transposable elements: a trade- off between reduced transposition and deleterious effects on neighboring gene 257 expression. Genome Res. 19:1419–1428.

Hollister JD, Smith LM, Guo Y-L, Ott F, Weigel D, Gaut BS. 2011. Transposable elements and small RNAs contribute to gene expression divergence between

Arabidopsis thaliana and Arabidopsis lyrata. Proc. Natl. Acad. Sci. U. S. A.

108:2322–2327.

Jackson RC, Jackson JW. 1996. Gene segregation in autotetraploids: prediction from meiotic configurations. American Journal of Botany [Internet] 83:673–678.

Available from: http://dx.doi.org/10.1002/j.1537-2197.1996.tb12756.x

Jenczewski E, Eber F, Grimaud A, Huet S, Lucas MO, Monod H, Chèvre AM. 2003.

PrBn, a major gene controlling homeologous pairing in oilseed rape (Brassica napus) haploids. Genetics 164:645–653.

Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE,

Tomsho LP, Hu Y, Liang H, Soltis PS, et al. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473:97–100.

Jones GH, Franklin FCH. 2006. Meiotic crossing-over: obligation and interference.

Cell 126:246–248.

Kondrashov FA, Kondrashov AS. 2006. Role of selection in fixation of gene duplications. J. Theor. Biol. 239:141–151.

Landergott U, Naciri Y, Jakob Schneller J, Holderegger R. 2006. Allelic configuration and polysomic inheritance of highly variable microsatellites in 258 tetraploid gynodioecious praecox agg. Theoretical and Applied Genetics

[Internet] 113:453–465. Available from: http://dx.doi.org/10.1007/s00122-006-

0310-6

Lan X, Pritchard JK. 2016. Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science 352:1009–1013.

Le Comber SC, Ainouche ML, Kovarik A, Leitch AR. 2010. Making a functional diploid: from polysomic to disomic inheritance. New Phytol. 186:113–122.

Leitch AR, Leitch IJ. 2008. Genomic plasticity and the diversity of polyploid plants.

Science 320:481–483.

Lien S, Koop BF, Sandve SR, Miller JR, Kent MP, Nome T, Hvidsten TR, Leong

JS, Minkley DR, Zimin A, et al. 2016. The Atlantic salmon genome provides insights into rediploidization. Nature [Internet]. Available from: http://dx.doi.org/10.1038/nature17164

Li F-W, Brouwer P, Carretero-Paulet L, Cheng S, de Vries J, Delaux P-M, Eily A,

Koppers N, Kuo L-Y, Li Z, et al. 2018. Fern genomes elucidate land plant evolution and cyanobacterial symbioses. Nat Plants 4:460–472.

Li J-T, Hou G-Y, Kong X-F, Li C-Y, Zeng J-M, Li H-D, Xiao G-B, Li X-M, Sun X-

W. 2015. The fate of recent duplicated genes following a fourth-round whole genome duplication in a tetraploid fish, common carp (Cyprinus carpio). Scientific

Reports [Internet] 5. Available from: http://dx.doi.org/10.1038/srep08199 259

Liu Z, Adamczyk K, Manzanares-Dauleux M, Eber F, Lucas M-O, Delourme R,

Chèvre AM, Jenczewski E. 2006. Mapping PrBn and other quantitative trait loci responsible for the control of homeologous chromosome pairing in oilseed rape

(Brassica napus L.) haploids. Genetics [Internet] 174:1583–1596. Available from: http://dx.doi.org/10.1534/genetics.106.064071

Li Z, Baniaga AE, Sessa EB. 2015. Early genome duplications in conifers and other seed plants. Science [Internet]. Available from: http://advances.sciencemag.org/content/1/10/e1501084.abstract

Li Z, Barker MS. 2020. Inferring putative ancient whole-genome duplications in the

1000 Plants (1KP) initiative: access to gene family phylogenies and age distributions. Gigascience [Internet] 9. Available from: http://dx.doi.org/10.1093/gigascience/giaa004

Li Z, Defoort J, Tasdighian S, Maere S, Van de Peer Y, De Smet R. 2016. Gene duplicability of core genes is highly consistent across all angiosperms. The Plant

Cell [Internet] 28:326–344. Available from: http://dx.doi.org/10.1105/tpc.15.00877

Li Z, Tiley GP, Galuska SR, Reardon CR, Kidder TI, Rundell RJ, Barker MS. 2018.

Multiple large-scale gene and genome duplications during the evolution of hexapods. Proc. Natl. Acad. Sci. U. S. A. 115:4713–4718.

Lutz AM. 1907. A preliminary note on the chromosomes of Oenothera lamarckiana and one of its mutants, O. gigas. Science 26:151–152.

Lynch M, Force A. 2000. The probability of duplicate gene preservation by 260 subfunctionalization. Genetics 154:459–473.

Lysak MA. 2014. Live and let die: centromere loss during evolution of plant chromosomes. New Phytol. 203:1082–1089.

Mable BK. 2004. “Why polyploidy is rarer in animals than in plants”: myths and mechanisms. Biol. J. Linn. Soc. Lond. 82:453–466.

Mandáková T, Joly S, Krzywinski M, Mummenhoff K, Lysak MA. 2010. Fast diploidization in close mesopolyploid relatives of Arabidopsis. Plant Cell 22:2277–

2290.

Mandáková T, Li Z, Barker MS, Lysak MA. 2017. Diverse genome organization following 13 independent mesopolyploid events in Brassicaceae contrasts with convergent patterns of gene retention. Plant J. 91:3–21.

Mandáková T, Lysak MA. 2018. Post-polyploid diploidization and diversification through dysploid changes. Curr. Opin. Plant Biol. 42:55–65.

Mandáková T, Pouch M, Harmanová K, Zhan SH, Mayrose I, Lysak MA. 2017.

Multispeed genome diploidization and diversification after an ancient allopolyploidization. Mol. Ecol. 26:6445–6462.

Martín AC, Shaw P, Phillips D, Reader S, Moore G. 2014. Licensing MLH1 sites for crossover during meiosis. Nat. Commun. 5:4580.

Ma X-F, Gustafson JP. 2005. Genome evolution of allopolyploids: a process of cytological and genetic diploidization. Cytogenet. Genome Res. 109:236–249. 261

Mayrose I, Barker MS, Otto SP. 2010. Probabilistic models of chromosome number evolution and the inference of polyploidy. Syst. Biol. 59:132–144.

Mayrose I, Zhan SH, Rothfels CJ, Magnuson-Ford K, Barker MS, Rieseberg LH,

Otto SP. 2011. Recently formed polyploid plants diversify at lower rates. Science

333:1257.

McGrath JM, Hickok LG. 1999. Multiple ribosomal RNA gene loci in the genome of the homosporous fern Ceratopteris richardii. Can. J. Bot. 77:1199–1202.

McGrath JM, Hickok LG, Pichersky E. 1994. Assessment of gene copy number in the homosporous ferns Ceratopteris thalictroides and C. richardii (Parkeriaceae) by restriction fragment length polymorphisms. Plant Syst. Evol. 189:203–210.

Meyer A, Schartl M. 1999. Gene and genome duplications in vertebrates: the one-to- four (-to-eight in fish) rule and the evolution of novel gene functions. Curr. Opin.

Cell Biol. 11:699–704.

Monnahan P, Brandvain Y. 2019. The effect of autopolyploidy on population genetic signals of hard sweeps. bioRxiv [Internet]:753491. Available from: https://www.biorxiv.org/content/10.1101/753491v1

Morgan C, Zhang H, Henry CE, Franklin FCH, Bomblies K. 2020. Derived alleles of two axis proteins affect meiotic traits in autotetraploid Arabidopsis arenosa.

Proceedings of the National Academy of Sciences [Internet] 117:8980–8988.

Available from: http://dx.doi.org/10.1073/pnas.1919459117 262

Morrison JW, Rajhathy T. 1960. Chromosome behaviour in autotetraploid cereals and grasses. Chromosoma [Internet] 11:297–309. Available from: http://dx.doi.org/10.1007/bf00328656

Muller HJ. 1925. Why polyploidy is rarer in animals than in plants. Am. Nat.

59:346–353.

Nakazato T, Barker MS, Rieseberg LH, Gastony GJ. 2008. Evolution of the nuclear genome of ferns and lycophytes. In: Biology and evolution of ferns and lycophytes.

Cambridge University Press.

Nakazato T, Jung M-K, Housworth EA, Rieseberg LH, Gastony GJ. 2006. Genetic map-based analysis of genome structure in the homosporous fern Ceratopteris richardii. Genetics 173:1585–1597.

Nossa CW, Havlak P, Yue J-X, Lv J, Vincent KY, Brockmann H, Putnam NH.

2014. Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication. Giga Sci 3:9.

Ohno S. 1970. Evolution by gene duplication. Available from: http://dx.doi.org/10.1007/978-3-642-86659-3

One Thousand Plant Transcriptomes Initiative. 2019. One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574:679–685.

Orr HA. 1990. “Why Polyploidy is Rarer in Animals Than in Plants” Revisited. Am.

Nat. 136:759–770. 263

Otto SP. 2007. The evolutionary consequences of polyploidy. Cell 131:452–462.

Otto SP, Whitton J. 2000. Polyploid incidence and evolution. Annu. Rev. Genet.

34:401–437.

Pang E, Cao H, Zhang B, Lin K. 2015. Crop genome annotation: a case study for the

Brassica rapa genome. Compendium of Plant Genomes [Internet]:53–64. Available from: http://dx.doi.org/10.1007/978-3-662-47901-8_5

Parisod C, Holderegger R, Brochmann C. 2010. Evolutionary consequences of autopolyploidy. New Phytologist [Internet] 186:5–17. Available from: http://dx.doi.org/10.1111/j.1469-8137.2009.03142.x

Paterson AH, Xiyin W, Jingping L, Haibao T. 2012. Ancient and recent polyploidy in monocots. in: polyploidy and genome evolution. p. 93–108.

Pfeifer M, Kugler KG, Sandve SR, Zhan B, Rudi H, Hvidsten TR, Mayer KFX,

Olsen O-A, International Wheat Genome Sequencing Consortium. 2014. Genome interplay in the grain transcriptome of hexaploid bread wheat. Science [Internet]

345:1250091–1250091. Available from: http://dx.doi.org/10.1126/science.1250091

Pichersky E, Soltis D, Soltis P. 1990. Defective chlorophyll a/b-binding protein genes in the genome of a homosporous fern. Proc. Natl. Acad. Sci. U. S. A. 87:195–

199.

Pires JC, Chris Pires J, Conant GC. 2016. Robust yet fragile: expression noise, protein misfolding, and gene dosage in the evolution of genomes. Annu. Rev. Genet. 264

50:113–131.

Popescu A-A, Huber KT, Paradis E. 2012. ape 3.0: New tools for distance-based phylogenetics and evolutionary analysis in R. Bioinformatics 28:1536–1537.

Qiao X, Li Q, Yin H, Qi K, Li L, Wang R, Zhang S, Paterson AH. 2019. Gene duplication and evolution in recurring polyploidization–diploidization cycles in plants. Genome Biology [Internet] 20. Available from: http://dx.doi.org/10.1186/s13059-019-1650-2

Qu L, Hancock JF, Whallon JH. 1998. Evolution in an autopolyploid group displaying predominantly bivalent pairing at meiosis: genomic similarity of diploid

Vaccinium darrowi and autotetraploid V. corymbosum (Ericaceae). American

Journal of Botany [Internet] 85:698–703. Available from: http://dx.doi.org/10.2307/2446540

Ramsey J, Schemske DW. 1998. Pathways, mechanisms, and rates of polyploid formation in flowering plants. Annual Review of Ecology and Systematics [Internet]

29:467–501. Available from: http://dx.doi.org/10.1146/annurev.ecolsys.29.1.467

Ramsey J, Schemske DW. 2002. Neopolyploidy in flowering plants. Annual Review of Ecology and Systematics [Internet] 33:589–639. Available from: http://dx.doi.org/10.1146/annurev.ecolsys.33.010802.150437

Renny-Byfield S, Gong L, Gallagher JP, Wendel JF. 2015. Persistence of subgenomes in paleopolyploid cotton after 60 my of evolution. Mol. Biol. Evol.

32:1063–1071. 265

Ren R, Wang H, Guo C, Zhang N, Zeng L, Chen Y, Ma H, Qi J. 2018. Widespread whole genome duplications contribute to genome complexity and species diversity in angiosperms. Mol. Plant 11:414–428.

Revell LJ. 2012. phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution [Internet] 3:217–223. Available from: http://dx.doi.org/10.1111/j.2041-210x.2011.00169.x

Riley R, Chapman V. 1958. Genetic control of the cytologically diploid behaviour of hexaploid wheat. Nature [Internet] 182:713–715. Available from: http://dx.doi.org/10.1038/182713a0

Ruprecht C, Lohaus R, Vanneste K, Mutwil M, Nikoloski Z, Van de Peer Y, Persson

S. 2017. Revisiting ancestral polyploidy in plants. Sci Adv 3:e1603195.

Sánchez-Morán E, Benavente E, Orellana J. 2001. Analysis of karyotypic stability of homoeologous-pairing (ph) mutants in allopolyploid wheats. Chromosoma

110:371–377.

Schnable JC, Freeling M, Lyons E. 2012. Genome-wide analysis of syntenic gene deletion in the grasses. Genome Biol. Evol. 4:265–277.

Schnable JC, Springer NM, Freeling M. 2011. Differentiation of the maize subgenomes by genome dominance and both ancient and ongoing gene loss. Proc.

Natl. Acad. Sci. U. S. A. 108:4069–4074.

Schrider DR, Costello JC, Hahn MW. 2009. All human-specific gene losses are 266 present in the genome as pseudogenes. J. Comput. Biol. 16:1419–1427.

Schubert I, Lysak MA. 2011. Interpretation of karyotype evolution should consider chromosome structural constraints. Trends Genet. 27:207–216.

Scott AD, Stenz NWM, Ingvarsson PK, Baum DA. 2016. Whole genome duplication in coast redwood (Sequoia sempervirens) and its implications for explaining the rarity of polyploidy in conifers. New Phytologist [Internet] 211:186–

193. Available from: http://dx.doi.org/10.1111/nph.13930

Session AM, Uno Y, Kwon T, Chapman JA, Toyoda A, Takahashi S, Fukui A,

Hikosaka A, Suzuki A, Kondo M, et al. 2016. Genome evolution in the allotetraploid frog Xenopus laevis. Nature 538:336–343.

Shi T, Huang H, Barker MS. 2010. Ancient genome duplications during the evolution of kiwifruit (Actinidia) and related Ericales. Ann. Bot. 106:497–504.

Shi T, Rahmani RS, Gugger PF, Wang M, Li H, Zhang Y, Li Z, Wang Q, Van de

Peer Y, Marchal K, et al. 2020. Distinct expression and methylation patterns for genes with different fates following a single whole-genome duplication in flowering plants. Mol. Biol. Evol. [Internet]. Available from: http://dx.doi.org/10.1093/molbev/msaa105

Šmarda P, Horová L, Knápek O, Dieck H, Dieck M, Ražná K, Hrubík P, Orlóci L,

Papp L, Veselá K, et al. 2018. Multiple haploids, triploids, and tetraploids found in modern-day “living fossil” Ginkgo biloba. Horticulture Research [Internet] 5.

Available from: http://dx.doi.org/10.1038/s41438-018-0055-9 267

Šmarda P, Vesel\`y P, Šmerda J, Bureš P, Knápek O, Chytrá M. 2016. Polyploidy in a “living fossil” Ginkgo biloba. New Phytol. [Internet]. Available from: http://onlinelibrary.wiley.com/doi/10.1111/nph.14062/full

Soltis DE, Visger CJ, Marchant DB, Soltis PS. 2016. Polyploidy: Pitfalls and paths to a paradigm. Am. J. Bot. 103:1146–1166.

Soltis PS, Marchant DB, Van de Peer Y, Soltis DE. 2015. Polyploidy and genome evolution in plants. Curr. Opin. Genet. Dev. 35:119–125.

Stebbins GL Jr. 1947. Types of polyploids their classification and significance. Adv.

Genet. 1:403–429.

Stebbins GL Jr. 1948. The chromosomes and relationships of Metasequoia and

Sequoia. Science 108:95–98.

Stebbins GL Jr. 1950. Variation and evolution in plants. Available from: http://dx.doi.org/10.7312/steb94536

Stift M, Berenos C, Kuperus P, van Tienderen PH. 2008. Segregation models for disomic, tetrasomic and intermediate inheritance in tetraploids: a general procedure applied to Rorippa (yellow cress) microsatellite data. Genetics 179:2113–2123.

Tang H, Woodhouse MR, Cheng F, Schnable JC, Pedersen BS, Conant G, Wang X,

Freeling M, Pires JC. 2012. Altered patterns of fractionation and exon deletions in

Brassica rapa support a two-step model of paleohexaploidy. Genetics 190:1563–

1574. 268

Tayalé A, Parisod C. 2013. Natural pathways to polyploidy in plants and consequences for genome reorganization. Cytogenet. Genome Res. 140:79–96.

Thomas BC, Pedersen B, Freeling M. 2006. Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Res. 16:934–946.

Van de Peer Y, Mizrachi E, Marchal K. 2017. The evolutionary significance of polyploidy. Nat. Rev. Genet. 18:411–424.

Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun J-H, Bancroft I, Cheng

F, et al. 2011. The genome of the mesopolyploid crop species Brassica rapa. Nat.

Genet. 43:1035–1039.

Wendel JF. 2015. The wondrous cycles of polyploidy in plants. Am. J. Bot.

102:1753–1756.

Wendel JF, Flagel LE, Adams KL. 2012. Jeans, genes, and genomes: cotton as a model for studying polyploidy. In: Polyploidy and Genome Evolution. p. 181–207.

Werth CR, Windham MD. 1991. A model for divergent, allopatric speciation of polyploid pteridophytes resulting from silencing of duplicate-gene expression. Am.

Nat. 137:515–526.

Winge O. 1917. The chromosomes. Their numbers and general importance. Compt.

Rend. Trav. du Lab. de Carlsberg 13:131–175.

Wolfe KH. 2001. Yesterday’s polyploids and the mystery of diploidization. Nature 269

Reviews Genetics [Internet] 2:333–341. Available from: http://dx.doi.org/10.1038/35072009

Woodhouse MR, Cheng F, Pires JC, Lisch D, Freeling M, Wang X. 2014. Origin, inheritance, and gene regulatory consequences of genome dominance in polyploids.

Proc. Natl. Acad. Sci. U. S. A. 111:5283–5288.

Woodhouse MR, Schnable JC, Pedersen BS, Lyons E, Lisch D, Subramaniam S,

Freeling M. 2010. Following tetraploidy in maize, a short deletion mechanism removed genes preferentially from one of the two homeologs. PLoS Biol.

8:e1000409.

Wood TE, Takebayashi N, Barker MS, Mayrose I, Greenspoon PB, Rieseberg LH.

2009. The frequency of polyploid speciation in vascular plants. Proc. Natl. Acad.

Sci. U. S. A. 106:13875–13879.

Xie J, Chen S, Xu W, Zhao Y, Zhang D. 2019. Origination and function of plant pseudogenes. Plant Signal. Behav. 14:1625698.

Xie J, Li Y, Liu X, Zhao Y, Li B, Ingvarsson PK, Zhang D. 2019. Evolutionary origins of pseudogenes and their association with regulatory sequences in plants.

Plant Cell 31:563–578.

Xiong Z, Gaeta RT, Pires JC. 2011. Homoeologous shuffling and chromosome compensation maintain genome balance in resynthesized allopolyploid Brassica napus. Proc. Natl. Acad. Sci. U. S. A. 108:7908–7913. 270

Yant L, Hollister JD, Wright KM, Arnold BJ, Higgins JD, Franklin FCH, Bomblies

K. 2013. Meiotic adaptation to genome duplication in Arabidopsis arenosa. Curr.

Biol. 23:2151–2156.

Yoo M-J, Szadkowski E, Wendel JF. 2013. Homoeolog expression bias and expression level dominance in allopolyploid cotton. Heredity 110:171–180.

Yoshida M-A, Ishikura Y, Moritaki T, Shoguchi E, Shimizu KK, Sese J, Ogura A.

2011. Genome structure analysis of molluscs revealed whole genome duplication and lineage specific repeat variation. Gene 483:63–71.

Zhang H, Bian Y, Gou X, Zhu B, Xu C, Qi B, Li N, Rustgi S, Zhou H, Han F, et al.

2013. Persistent whole-chromosome aneuploidy is generally associated with nascent allohexaploid wheat. Proc. Natl. Acad. Sci. U. S. A. 110:3447–3452.

Zhang J. 2003. Evolution by gene duplication: an update. Trends in Ecology &

Evolution [Internet] 18:292–298. Available from: http://dx.doi.org/10.1016/s0169-

5347(03)00033-8

Zhang Z, Harrison PM, Liu Y, Gerstein M. 2003. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the . Genome Res. 13:2541–2558.

Zhao M, Zhang B, Lisch D, Ma J. 2017. Patterns and consequences of subgenome differentiation provide insights into the nature of paleopolyploidy in plants. Plant

Cell 29:2974–2994. 271

Zwaenepoel A, Van de Peer Y. 2019. Inference of ancient whole-genome

duplications and the evolution of gene duplication and loss rates. Mol. Biol. Evol.

36:1384–1404.

Appendix D: Figures

Figure 1. Chromosome pairing behavior during meiosis in diploid (white), autopolyploid

(red), and allopolyploid (yellow). Chromosomes with the same size and color but different in shade represent homologous chromosomes. Chromosomes of the same size but in different colors (blue vs. gray) represent homoeologous chromosomes.

272

Figure 2. The frequency of strictly bivalent (gray) vs multivalent or a mix of bivalent and multivalent pairing (black) and whether species are reported as allo- or autopolyploids in

Barker et al. (7). This meta-analysis is based on 208 species (Supplemental Table 1). The categories represent allopolyploids, autopolyploids, or all polyploids combined. The y- axis represents the number of species.

273

Figure 3. The major processes and mechanisms of diploidization. From left to right, the abrupt transition from white to black represents a change from diploidy to polyploidy.

The gradual transition from gray to white represents diploidization. The shade of color shows the hypothetical level of diploidization. The differences in shade of color between cytological and genic diploidization shows that they are independent processes that occur at different rates. The process of cytological diploidization involves chromosomal evolution leading to the restoration of bivalent pairing and disomic inheritance following polyploidy. The process of genic diploidization and fractionation involves gene removal and loss following polyploidy by molecular mechanisms such as pseudogenization and gene deletion by recombination. 274

Figure 4. The fraction of genes retained from a WGD over estimated median Ks value of a WGD in land plants. The x-axis represents the Ks value of a WGD inferred by mixture model in gene age distribution analysis. The y-axis represents the fraction of gene retained from a WGD, which is estimated as the number of paralogs retained from a

WGD divided by the total number of unigenes of a transcriptome. This study is based on

815 species of land plants (Supplemental Table 2 A: Bryophytes, 52 species B:

Lycophytes, 13 species C: Ferns, 66 species, D: Gymnosperms, 73 species, E:

Angiosperms, 610 species). 275

Figure 5. Phylogenetically-corrected rate of post-WGD paralog loss in land plants. Both the fraction of gene retained from a WGD (y-axis) and estimated median Ks value of a

WGD (x-axis) in land plants were corrected using phylogenetic independent contrasts

(PIC). This study is based on 815 species of land plants (Supplemental Table 2 A:

Bryophytes, 52 species B: Lycophytes, 13 species C: Ferns, 66 species, D:

Gymnosperms, 73 species, E: Angiosperms, 610 species).

276

Appendix D: Supplementary Information

Supplemental tables

Supplemental Table 1. The summary table of frequency of strictly bivalent vs multivalent or a mix of bivalent and multivalent pairing in allo- or autopolyploids.

# Genus Species Category Ploidal state Pairing behavior references

1 Arabidopsis Arabidopsis arenosa bivalent auto Comai et al., 2003 Arabidopsis 2 Arabidopsis kamchatica bivalent allo Shimizu et al., 2005

3 Arabidopsis Arabidopsis suecica bivalent allo Comai et al., 2003

4 Arachis Arachis monticola bivalent allo Stalker, 1980

5 Artemisia desertorum bivalent auto Tantray et al., 2020

6 Artemisia Artemisia maritima bivalent allo Koul, 1964

7 Artemisia Artemisia vulgaris bivalent auto Tantray et al., 2020; Dolyatari et al., 2013

8 Asplenium Asplenium ceterach bivalent auto Pinter et al., 2002

9 Asplenium Asplenium cristatum bivalent auto Morzenti, 1967 Asplenium 10 Asplenium cuneifolium bivalent auto Sleep, 1983

11 Asplenium Asplenium cyprium bivalent auto Van den Heede et al., 2002

12 Cardamine flexuosa bivalent allo Mandakova et al., 2014

13 Centaurea Centaurea jacea bivalent auto Gardou, 1972

14 Centaurium Centaurium erythraea bivalent auto Ubsdell, 1976

15 Cochlearia danica bivalent allo Gill, 1990

16 Cochlearia Cochlearia micacea bivalent auto Gill, 1973

17 Cochlearia Cochlearia officinalis bivalent auto Gill, 1973

18 Draba Draba porsildii bivalent auto Mulligan, 1974 Dryopteris 19 Dryopteris campyloptera bivalent allo Walker, 1961

20 Dryopteris Dryopteris celsa bivalent allo Walker, 1962

21 Dryopteris Dryopteris chinensis bivalent auto Kurita, 1961 277

Dryopteris 22 Dryopteris clintoniana bivalent allo Walker, 1962

23 Dryopteris Dryopteris crispifolia bivalent allo Gibby, 1983

24 Dryopteris Dryopteris cristata bivalent allo Walker, 1962

25 Dryopteris Dryopteris dilatata bivalent allo Sorsa and Widen, 1967 Dryopteris 26 Dryopteris erythrosora bivalent auto Kurita, 1961

27 Dryopteris Dryopteris goldiana bivalent auto Walker, 1962

28 Dryopteris Dryopteris sparsa bivalent auto Darnaedi, 2016

29 Dryopteris Dryopteris spinulosa bivalent allo Britton, 1961

30 Eragrostis Eragrostis hypnoides bivalent auto Gerrit and Pohl, 1974

31 Eragrostis Eragrostis pilosa bivalent allo Christopher and Abraham, 1974

32 Eragrostis Eragrostis tef bivalent allo Tavassoli, 1986

33 Eragrostis Eragrostis tenella bivalent auto Christopher and Abraham, 1974

34 Eragrostis Eragrostis tremula bivalent allo Christopher and Abraham, 1974

35 Fuchsia boliviana bivalent allo Talluri, 2010

36 Fuchsia Fuchsia glazioviana bivalent allo Talluri, 2010

37 Fuchsia Fuchsia hatschbachii bivalent allo Talluri, 2010

38 Fuchsia Fuchsia lycioides bivalent allo Hoshino and Berry, 1989

39 Gagea Gagea villosa bivalent allo Mesicek and Hrouda, 1974

40 Geranium lucidum bivalent auto Kaur et al., 2010 Singh et al., 1987; Singh and Hymowitz, 1985; Newell 41 Glycine Glycine tomentella bivalent allo and Hymowitz, 1987 42 Helianthus decapetalus bivalent auto Kulshreshtha and Gupta, 1979

43 Helianthus Helianthus eggertii bivalent allo Atlagic, 1996

44 Helianthus Helianthus resinosus bivalent allo Atlagic, 1996

45 Helianthus Helianthus tuberosus bivalent allo Kulshreshtha and Gupta, 1979

46 Hordeum capense bivalent auto von Bothmer and Subrahmanyam, 1987

47 Hordeum Hordeum depressum bivalent auto Gupta and Fedak, 1985

48 Hordeum Hordeum marinum bivalent auto Eilam et al., 2009

49 Hordeum Hordeum murinum bivalent allo Cuadrado, 2013

50 Hordeum Hordeum secalinum bivalent allo von Bothmer and Subrahmanyam, 1987

51 Iris Iris spuria bivalent auto Ryan, 1983

52 Isoetes Isoetes japonica bivalent allo Takamiya et al., 1996 278

53 Isoetes Isoetes sinensis bivalent auto Takamiya et al., 1996; He et al., 2004 Lycopodiu Lycopodium 54 m wightianum bivalent allo Wagner, 1992 Melampodi Melampodium 55 um argophyllum bivalent auto Stuessy et al., 2004

56 Nicotiana Nicotiana africana bivalent allo Burns, 1982 Orobanche 57 Orobanche austrohispanica bivalent allo Piednoël et al., 2015 Orobanche 58 Orobanche californica bivalent allo Schneeweiss et al., 2004

59 Orobanche Orobanche densiflora bivalent allo Piednoël et al., 2015

60 Orobanche Orobanche gracilis bivalent allo Schneeweiss et al., 2004 Orobanche 61 Orobanche macrolepis bivalent allo Schneeweiss et al., 2004

62 Orobanche Orobanche pinorum bivalent allo Schneeweiss et al., 2004

63 Paeonia Paeonia obovata bivalent allo Sang et al., 2004

64 Paspalum Paspalum commune bivalent allo Sede et al., 2010

65 Paspalum Paspalum conjugatum bivalent auto Mehra and Chaudhary, 1979 Paspalum 66 Paspalum dasypleurum bivalent allo Quarin and Caponio, 1995 Paspalum 67 Paspalum denticulatum bivalent auto Sede et al., 2010

68 Paspalum Paspalum dilatatum bivalent allo Burson, 1979 Paspalum 69 Paspalum intermedium bivalent auto Burson and Quarin, 1981

70 Paspalum Paspalum ionanthum bivalent allo Sede et al., 2010 Paspalum 71 Paspalum mandiocanum bivalent allo Burson and Bennett, 1971

72 Paspalum Paspalum procurrens bivalent auto Hojsgaard et al., 2008

73 Paspalum Paspalum regnellii bivalent allo Pagliarini et al., 1997

74 Paspalum Paspalum urvillei bivalent allo Burson, 1979 75 Penstemon subserratus bivalent allo Keck, 1945

76 Persicaria Persicaria minor bivalent allo Sharma et al., 2013

77 Plantago media bivalent auto Qu et al., 1998

78 Plantago Plantago spathulata bivalent allo Wong and Murray, 2014

79 Plantago Plantago virginica bivalent allo Zhang et al., 2012

80 Ranunculus Ranunculus nivicola bivalent allo Rendle and Murray, 1989

81 Senecio eboracensis bivalent allo Lowe and Abbot, 2003

82 Senecio bivalent allo Kadereit, 1983 279

83 Silene Silene vulgaris bivalent auto Kumar et al., 2012

84 Solanum Solanum albicans bivalent allo Pendinen et al., 2012

85 Solanum Solanum stoloniferum bivalent allo Panahandeh, 2017

86 bivalent allo Clausen, 1931; Moore and Harvey, 1961

87 Viola Viola elatior bivalent auto Moore and Harvey, 1961

88 Viola Viola lactea bivalent allo Moore and Harvey, 1961

89 Viola bivalent auto Clausen, 1929

90 Viola Viola pumila bivalent allo Moore and Harvey, 1961 Viola 91 bivalent auto Valentine, 1950

92 Viola bivalent allo Valentine, 1950

93 Arabidopsis Arabidopsis thaliana mix auto Santos et al., 2003

94 Achillea Achillea millefolium mix allo Vetter et al, 2014

95 Arachis Arachis glabrata mix auto Stalker, 1980

96 Arachis Arachis hagenbeckii mix auto Stalker, 1980

97 Arachis Arachis hypogaea mix allo Burow et al., 2001

98 Artemisia Artemisia arbuscula mix allo McArthur et al., 1981

99 Artemisia Artemisia bigelovii mix auto McArthur et al., 1981 Artemisia 100 Artemisia douglasiana mix allo Estes, 1969

101 Artemisia Artemisia frigida mix auto McArthur et al., 1981

102 Artemisia Artemisia glauca mix auto Mir et al., 2015

103 Artemisia Artemisia ludoviciana mix auto Estes, 1969

104 Artemisia Artemisia rigida mix auto McArthur et al., 1981

105 Artemisia Artemisia rothrockii mix allo McArthur et al., 1981 Artemisia tridentata 106 Artemisia tridentata mix auto McArthur et al., 1981 Artemisia tridentata 107 Artemisia vaseyana mix auto McArthur et al., 1981

Artemisia tridentata 108 Artemisia wyomingensis mix allo McArthur et al., 1981 Asplenium ruta- 109 Asplenium muraria mix auto Rasbach et al., 1991 Asplenium 110 Asplenium septentrionale mix auto Brownsey, 1977

111 Cardamine Cardamine pratensis mix auto Lawrence, 2005

112 Centaurium Centaurium wigginsii mix auto Broome, 1978 280

113 Cuphea Cuphea ericoides mix auto Graham, 1989 Dactylorhiz 114 a Dactylorhiza fuchsii mix auto Lord and Richards, 1977 Dactylorhiz Dactylorhiza 115 a purpurella mix allo Lord and Richards, 1977

116 Delphinium Delphinium elatum mix allo Gajewski, 1963

117 Delphinium Delphinium hansenii mix auto Gage, 1953

118 Delphinium Delphinium ruysii mix allo Gage, 1953 Delphinium 119 Delphinium variegatum mix auto Gage, 1953

120 Fuchsia Fuchsia magellanica mix allo Talluri, 2013

121 Fuchsia Fuchsia regia mix auto Hoshino and Berry, 1989

122 Gagea Gagea bohemica mix allo Mesicek and Hrouda, 1974

123 Gagea Gagea pratensis mix allo Mesicek and Hrouda, 1974

124 Gagea Gagea reticulata mix auto Malik, 1960

125 Gossypium Gossypium hirsutum mix allo Sheidai et al., 1998; Webber, 1934

126 Helianthus Helianthus annuus mix auto Srivastava and Srivastava, 2002

127 Helianthus Helianthus ciliaris mix allo Jackson and Hauber, 1994

128 Helianthus Helianthus hirsutus mix allo Georgieva-Todorova and Bohorova, 1979

129 Hordeum Hordeum arizonicum mix allo Rajhathy and Symko, 1966 Hordeum 130 Hordeum brachyantherum mix allo von Bothmer and Subrahmanyam, 1987

131 Hordeum Hordeum bulbosum mix auto Eilam et al., 2009

132 Hordeum Hordeum cordobense mix auto von Bothmer and Subrahmanyam, 1987

133 Hordeum Hordeum fuegianum mix allo von Bothmer et al., 1986

134 Hordeum Hordeum jubatum mix allo von Bothmer and Subrahmanyam, 1987

135 Hordeum Hordeum lechleri mix allo von Bothmer and Subrahmanyam, 1987

136 Hordeum Hordeum murinum mix auto von Bothmer and Subrahmanyam, 1987

137 Hordeum Hordeum parodii mix allo von Bothmer et al., 1986

138 Hordeum Hordeum procerum mix allo von Bothmer and Subrahmanyam, 1987 Hordeum 139 Hordeum tetraploidum mix allo von Bothmer et al., 1986

140 Iris Iris ensata mix auto Yabuya et al., 1989

141 Iris Iris japonica mix auto Chimphamba, 1973

142 Isoetes Isoetes tuberculata mix auto Srivastava, 2003 Melampodi Melampodium 143 um cinereum mix auto Stuessy et al., 2004 281

Melampodi Melampodium 144 um leucanthum mix auto Stuessy et al., 2004

145 Nicotiana Nicotiana tabacum mix allo Olmo, 1936

146 Paeonia Paeonia officinalis mix allo Stebbins, 1948

147 Paeonia Paeonia peregrina mix allo Stebbins, 1948 Paeonia 148 Paeonia wittmanniana mix allo Stebbins, 1948

149 Paspalum Paspalum almum mix auto Burson, 1975 Paspalum 150 Paspalum arechavaletae mix allo Burson and Bennett, 1971 Paspalum 151 Paspalum arundinaceum mix allo Sede et al., 2010

152 Paspalum Paspalum brunneum mix auto Burson, 1975 Paspalum 153 Paspalum compressifolium mix allo Pagliarini et al., 2001

154 Paspalum Paspalum conspersum mix allo Burson, 1978

155 Paspalum Paspalum convexum mix auto Selva, 1975 Paspalum 156 Paspalum coryphaeum mix auto Burson, 1975 Paspalum 157 Paspalum cromyorhizon mix allo Burson and Bennett, 1971

158 Paspalum Paspalum dedeccae mix auto Quaren and Burson, 1991

159 Paspalum Paspalum distichum mix auto Katayama and Ikeda, 1975

160 Paspalum Paspalum durifolium mix allo Burson, 1985

161 Paspalum Paspalum exaltatum mix auto Burson and Bennett, 1971 Paspalum 162 Paspalum glaucescens mix allo Pozzobon and Valls, 2000

163 Paspalum Paspalum guenoarum mix allo Burson and Bennett, 1971

164 Paspalum Paspalum limbatum mix auto Adamowski et al., 2005

165 Paspalum Paspalum lividum mix allo Burson and Bennett, 1971

166 Paspalum Paspalum maculosum mix auto Norman and Burson, 1989 Paspalum 167 Paspalum malacophyllum mix auto Pagliarini et al., 2001

168 Paspalum Paspalum maritimum mix allo Adamowski et al., 2000

169 Paspalum Paspalum nicorae mix allo Brunson and Bennett, 1970

170 Paspalum Paspalum notatum mix allo Dahmer et al., 2008

171 Paspalum Paspalum oteroi mix auto Novo et al., 2016

172 Paspalum Paspalum plicatulum mix allo Pagliarini et al., 2001 Paspalum 173 Paspalum quadrifarium mix auto Speranza et al., 2003 282

174 Paspalum Paspalum rufum mix auto Burson, 1975

175 Paspalum Paspalum simplex mix auto Hojsgaard et al., 2008

176 Paspalum Paspalum virgatum mix allo Burson and Quarin, 1981

177 Penstemon Penstemon attenuatus mix allo Keck, 1945

178 Penstemon Penstemon confertus mix auto Keck, 1945

179 Penstemon Penstemon procerus mix auto Keck, 1945

180 Plantago Plantago coronopus mix auto Bocher et al., 1995

181 Plantago Plantago lagopus mix auto Bhan et al., 1990

182 Plantago Plantago major mix auto Bala and Gupta, 2011

183 Plantago Plantago ovata mix auto Kaul et al., 2005 Ranunculus 184 Ranunculus cantoniensis mix allo Okada, 1988

185 Ranunculus Ranunculus ficaria mix auto Wislow and Pogan, 1981

186 Senecio Senecio brasiliensis mix allo Lopez et al., 2013

187 Senecio Senecio cambrensis mix allo Abbott et al., 2005

188 Senecio Senecio grisebachii mix allo Lopez et al., 2013

189 Senecio Senecio ragonesei mix allo Lopez et al., 2012; Lopez et al., 2013

190 Senecio Senecio subulatus mix allo Lopez et al., 2013

191 Senecio Senecio uspallatensis mix allo Lopez et al., 2013

192 Senecio Senecio viridis mix allo Lopez et al., 2012; Lopez et al., 2013

193 Solanum Solanum acaule mix allo Swaminathan, 1953 Solanum 194 Solanum cardiophyllum mix auto Gavrilenko et al., 2007

195 Solanum Solanum demissum mix allo Ono et al., 2016

196 Solanum Solanum hjertingii mix allo Sangowawa, 1989

197 Solanum Solanum hougasii mix allo Dvora, 1981 Solanum 198 Solanum longipedicellatum mix allo Swaminathan, 1953

199 Solanum Solanum luteum mix auto Dvora, 1981

200 Solanum Solanum nigrum mix auto Magoon et al., 1962; Kumar 1981

201 Solanum Solanum tuberosum mix auto Swaminathan, 1953

202 Solanum Solanum vallis-mexici mix allo Marks, 1957 Stephanome 203 ria Stephanomeria elata mix allo Sherman, 1996 Thinopyrum 204 Thinopyrum curvifolium mix allo Liu and Wang, 1993 283

Thinopyrum 205 Thinopyrum elongatum mix auto Liu and Wang, 1993

206 Thinopyrum Thinopyrum junceum mix allo Liu and Wang, 1993

207 Thinopyrum Thinopyrum ponticum mix allo Wang et al., 1991

208 Thinopyrum Thinopyrum sartorii mix allo Liu and Wang, 1992

Supplemental Table 2. List of fraction of genes retained from a WGD over estimated median Ks value of a WGD in land plants.

Bryophytes

1KP WGD # Paleologs / # Code Species code Start Ks End Ks Median Ks # Paleologs # Unigenes # Unigenes Anomodon 1 QMWB attenuatus PHFOα 0.0918 1.4290 0.7391 2356 15290 0.1541

2 TMAJ Neckera douglasii PHFOα 0.0658 1.3614 0.6839 3707 21502 0.1724 Loeskeobryum 3 WSPM brevirostre PHFOα 0.1111 1.3585 0.7555 1878 14293 0.1314 Claopodium 4 VBMM rostratum PHFOα 0.0634 1.4999 0.7708 2871 15570 0.1844 Climacium 5 MIRS dendroides PHFOα 0.1863 1.2359 0.7411 1661 15979 0.1039 Calliergon 6 TAVP cordifolium PHFOα 0.0747 1.4728 0.7537 2632 17282 0.1523 Thuidium 7 EEMJ delicatulum PHFOα 0.0849 1.4425 0.7421 1971 13611 0.1448 Stereodon 8 LNSF subimponens PHFOα 0.0698 1.4109 0.7361 2283 16033 0.1424

9 IGUH Leucodon julaceus PHFOα 0.0525 1.4857 0.7969 3988 22425 0.1778

10 ZACW Leucodon brachypus PHFOα 0.0532 1.5140 0.7667 3638 17755 0.2049 Pseudotaxiphyllum 11 QKQO elegans PHFOα 0.0615 1.3508 0.6730 2956 15538 0.1902 Fontinalis 12 DHWX antipyretica PHFOα 0.0859 1.3777 0.7858 1768 16479 0.1073

Aulacomnium 13 WNGH heterostichum PHFOα 0.0933 0.9918 0.5519 3858 19721 0.1956

14 CMEQ Orthotrichum lyellii PHFOα 0.1120 1.5188 0.8327 2618 16761 0.1562

15 JMXW Bryum argenteum PHFOα 0.1078 2.2034 1.1629 2367 14189 0.1668 Rosulabryum cf. 16 XWHK capillare PHFOα 0.1698 1.9866 1.1325 4071 15212 0.2676 284

17 BGXB Plagiomnium insigne PHFOα 0.1248 1.1103 0.6456 2313 15124 0.1529

18 ORKS Philonotis fontana PHFOα 0.1161 0.9051 0.5045 3919 17434 0.2248

19 YWNF Hedwigia ciliata PHFOα 0.1719 0.9594 0.6079 2563 16533 0.1550 Ceratodon 20 FFPD purpureus DISCα 0.0797 1.2141 0.6689 3338 16365 0.2040

21 GRKU Syntrichia princeps DISCα 0.1711 1.7734 0.9310 2755 18351 0.1501

22 NGTD Dicranum scoparium DISCα 0.0664 1.0716 0.5845 3187 18000 0.1771 Leucobryum 23 RGKI glaucum DISCα 0.1429 1.0941 0.6538 2693 16159 0.1667

24 VMXJ Leucobryum albidum DISCα 0.0836 1.1527 0.6567 5523 20616 0.2679 Racomitrium 25 ABCD elongatum DISCα 0.0571 1.1922 0.6524 2870 16274 0.1764

26 RDOO Racomitrium varium DISCα 0.0841 1.1651 0.6443 3886 18793 0.2068

27 BPSG Scouleria aquatica DISCα 0.1568 0.8333 0.4807 5987 21228 0.2820

28 ZQRI Timmia austriaca DISCα 0.0815 0.8268 0.4183 5822 19441 0.2995

29 YEPO Physcomitrium sp. PHPAα 0.1601 1.0907 0.6749 10353 26170 0.3956 Encalypta 30 KEFD streptocarpa ENSTα 0.0833 0.9100 0.4935 3924 17971 0.2184

31 RCBT Sphagnum palustre SPPAα 0.0895 0.7855 0.4178 8681 24135 0.3597

32 GOWD Sphagnum lescurii SPPAα 0.1297 0.7912 0.4376 5817 17979 0.3235

33 UHLI Sphagnum recurvum SPPAα 0.1135 0.7882 0.4407 8765 21302 0.4115

34 IRBN Scapania nemorosa RLINβ 0.0430 1.6531 0.8015 2123 22250 0.0954 Odontoschisma 35 YBQN prostratum RLINβ 0.0721 0.9226 0.4248 1514 16825 0.0900

36 RTMU Calypogeia fissa RLINβ 0.4112 2.5799 1.4754 2974 23061 0.1290

37 WZYK Bazzania trilobata RLINβ 0.0650 1.4775 0.6536 2104 14331 0.1468

38 LGOW Schistochila sp. SCHLα 0.0799 0.5885 0.3283 1244 16375 0.0760

39 KRUQ Porella navicularis RLINβ 0.0525 1.2109 0.4867 3122 15367 0.2032

40 UUHD Porella pinnata RLINβ 0.0702 1.4347 0.6618 2538 17557 0.1446

41 TGKW Frullania sp. FRULα 0.0649 0.6223 0.3231 886 16162 0.0548

42 YFGP Pallavicinia lyellii RLINβ 0.0598 1.5859 0.7113 2608 16533 0.1577

43 PIUF Pellia cf. epiphylla RLINβ 0.0716 1.5387 0.7131 3571 14654 0.2437 Nothoceros 44 DXOU aenigmaticus LEIAβ 0.0892 2.2092 1.1089 1111 10257 0.1083 Nothoceros 45 TCBC vincentianus LEIAβ 0.0660 2.0086 0.9628 2930 14988 0.1955 285

Phaeomegaceros 46 AKXB coriaceus LEIAβ 0.0508 1.7660 0.8835 2502 14168 0.1766 Paraphymatoceros 47 FAJB hallii LEIAβ 0.0405 1.7386 0.8533 3704 14186 0.2611 Phaeoceros 48 WEEQ carolinianus LEIAβ 0.0457 1.8172 0.8621 3860 14561 0.2651

49 BSNI Anthoceros agrestis LEIAβ 0.0737 1.9871 1.0033 2121 11266 0.1883

50 TWUW Anthoceros agrestis LEIAβ 0.0492 1.6057 0.7247 2636 11210 0.2351 Phaeoceros 51 ZFRE carolinianus LEIAβ 0.0646 2.7196 1.4128 16699 23513 0.7102 Leiosporoceros 52 ANON dussii LEIAα 0.0847 1.3687 0.7351 581 8876 0.0655

Lycophytes

1KP WGD Median # Paleologs / # # Code Species code Start Ks End Ks Ks # Paleologs # Unigenes Unigenes

1 ENQF Lycopodium annotinum LYCOα 0.1069 0.7509 0.4079 3959 16977 0.2332 Diphasiastrum 2 WAFT digitatum LYCOα 0.1106 0.8049 0.4196 3847 15645 0.2459 Lycopodium 3 PQTO deuterodensum LYCOα 0.1163 0.9261 0.5369 4724 17995 0.2625 Dendrolycopodium 4 XNXF obscurum LYCOα 0.0002 0.1189 0.4634 914 14712 0.0621

5 ULKT Lycopodiella appressa LYAPα 0.1404 0.7999 0.4641 3986 16129 0.2471

Pseudolycopodiella 6 UPMJ caroliniana LYAPα 0.1339 0.9981 0.6021 4423 17035 0.2596

7 GKAG Huperzia lucidula HUSQα 0.0858 0.3997 0.2216 4470 18903 0.2365

8 YHZW Huperzia selago HUSQα 0.0765 0.3984 0.2138 4787 22202 0.2156

9 CBAE Huperzia myrsinites HUSQα 0.1266 0.6722 0.4070 4172 15374 0.2714

10 GAON Huperzia squarrosa HUSQα 0.1227 0.6651 0.3902 4202 14618 0.2875

11 ZFGK Selaginella kraussiana SEKRα 0.1525 4.9952 1.9688 2318 14332 0.1617

12 PKOX Isoetes tegetiformans ISTEβ 0.0596 0.5567 0.3143 1366 17905 0.0763

13 PYHZ Isoetes sp. ISTEβ 0.0771 0.5742 0.3367 1269 17091 0.0742

Ferns

1KP Median # Paleologs / # # Code Species WGD code Start Ks End Ks Ks # Paleologs # Unigenes Unigenes

1 JVSZ Equisetum hyemale EQUIα 0.1733 1.1475 0.7214 5105 18169 0.2810 286

2 CAPN Equisetum diffusum EQUIα 0.1228 1.4209 0.8048 3232 11616 0.2782

3 ALVQ Tmesipteris parva PSILα 0.0979 0.4152 0.2325 2771 18010 0.1539

4 QVMR Psilotum nudum PSILα 0.1044 0.5801 0.3525 2489 13462 0.1849 Ophioglossum 5 DJSE petiolatum OPHIα 0.1721 1.8607 0.9488 4370 18419 0.2373

6 EEAQ Sceptridium dissectum SCEPα 0.0991 0.5039 0.2818 5998 19576 0.3064

7 BEGM Botrypus virginianus SCEPα 0.0846 0.3994 0.2000 2482 17653 0.1406

8 DFHO Danaea nodosa MARAα 0.0868 1.7471 0.9456 5592 15643 0.3575

9 UXCS Marattia sp. MARAα 0.1346 1.3483 0.7613 1519 7618 0.1994

10 UGNK Marattia attenuata MARAα 1.0320 1.7753 1.1090 3182 17790 0.1789

11 UOMY Osmunda sp. OSMNα 0.0954 1.5719 0.7694 5283 16462 0.3209

12 VIBO Osmunda javanica OSMNα 0.4185 1.5542 0.8863 3934 15176 0.2592

Osmundastrum 13 RFMZ cinnamomeum OSMNα 0.0816 1.4831 0.7447 4052 13715 0.2954

14 FCHS Deparia lobato-crenata PTERα 0.1350 2.3670 1.3448 3130 14276 0.2192

15 MROH Thelypteris acuminata PTERα 0.1155 2.9231 1.4435 3066 12116 0.2531 Homalosorus 16 OCZL pycnocarpos PTERα 0.1256 2.3008 1.2809 3840 15352 0.2501

17 AFPO Athyrium sp. PTERα 0.1256 2.2913 1.2712 5094 17235 0.2956

18 URCP Athyrium filix-femina PTERα 0.0986 2.3319 1.2693 4630 17101 0.2707

19 UFJN Diplazium wichurae PTERα 0.0940 2.1754 1.1780 6182 19231 0.3215

20 VITX Blechnum spicant PTERα 0.0807 2.4538 1.2831 5986 19493 0.3071

21 YJJY Woodsia scopulina PTERα 0.1302 2.1721 1.2753 4544 18663 0.2435

22 YQEC Woodsia ilvensis PTERα 0.0756 2.7404 1.4980 4525 17233 0.2626

23 XXHP Cystopteris fragilis PTERα 0.0805 2.2412 1.2260 5756 20433 0.2817

24 YOWV Cystopteris protrusa PTERα 0.0914 2.3817 1.3367 4452 17227 0.2584

25 LHLE Cystopteris fragilis PTERα 0.0937 1.9838 1.1999 4626 18841 0.2455

26 RICC Cystopteris reevesiana PTERα 0.1084 2.6704 1.4753 4072 16329 0.2494 Polypodium 27 CJNT glycyrrhiza PTERα 0.0849 2.5829 1.3849 5261 18208 0.2889

28 ZRAV Polypodium hesperium PTERα 0.0937 2.8375 1.4782 6699 22479 0.2980

29 YLJA Polypodium amorphum PTERα 0.1059 2.5429 1.3766 4988 18046 0.2764 Phlebodium 30 ZQYU pseudoaureum PTERα 0.1464 2.7457 1.5048 4629 21031 0.2201 287

31 ORJE Phymatosorus grossus PTERα 0.1382 2.2099 1.2826 3348 17798 0.1881

32 NWWI Nephrolepis exaltata PTERα 0.3162 4.9748 1.8831 6030 22430 0.2688 Polystichum 33 FQGQ acrostichoides PTERα 0.0841 2.6524 1.4197 4852 18291 0.2653

34 WGTU Leucostegia immersa PTERα 0.1044 2.4461 1.3615 5467 19008 0.2876 35 RFRB truncatula PTERα 0.1394 2.9027 1.5676 2136 9625 0.2219 Asplenium cf. x 36 BMIF lucrosum ASNIα 0.1009 0.6976 0.3398 7041 20917 0.3366

37 PSKY Asplenium nidus ASNIα 0.1195 0.8827 0.4114 3551 12987 0.2734

38 KJZG Asplenium platyneuron PTERα 0.1066 2.9202 1.6465 3787 17106 0.2214 Dennstaedtia 39 MTGC davallioides PTERα 0.0873 2.2809 1.3241 3016 13746 0.2194

40 YCKE Notholaena montieliae PTERα 0.5304 3.2512 1.9111 3697 17653 0.2094

41 GSXD Myriopteris rufa PTERα 0.1265 2.8874 1.6041 4363 18076 0.2414

42 XDDT Argyrochosma nivea PTERα 0.0469 3.5587 1.8034 5090 18225 0.2793

43 BMJR Adiantum raddianum ADIAα 0.1518 0.9247 0.5342 5898 20578 0.2866

44 WCLG Adiantum aleuticum PTERα 0.0934 2.9347 1.5856 3761 17161 0.2192

45 SKYV Vittaria lineata VILIα 0.1230 1.0353 0.5628 4163 17282 0.2409 Cryptogramma 46 WQML acrostichoides PTERα 0.0467 3.0428 1.6168 5067 18727 0.2706

47 FLTD Pteris ensiformis PTERα 0.0634 4.9982 2.1837 4209 16124 0.2610

48 POPJ Pteris vittata PTERα 0.0475 4.9987 2.0876 6062 19748 0.3070 Pityrogramma 49 UJTT trifoliata PTERα 0.0943 0.4967 0.2113 4794 22114 0.2168 Ceratopteris 50 PIVW thalictroides CERAα 0.0663 2.1135 1.1036 5954 19514 0.3051

51 NOKI Lindsaea linearis LINDα 0.0921 1.3076 0.7540 5696 21121 0.2697

52 YIXP Lindsaea microphylla LINDα 0.1580 1.2497 0.8000 3857 16809 0.2295

53 VVRN Lonchitis hirsuta LONCα 0.1358 0.7527 0.4662 4790 21705 0.2207

54 PNZO Culcita macrocarpa CYATβ 0.3297 1.7695 0.9689 5939 22348 0.2658

55 UWOD Plagiogyria japonica CYATβ 0.3624 1.8007 0.9815 6150 22964 0.2678

56 EWXK Thyrsopteris elegans CYATβ 0.3512 1.7172 0.9654 5505 20786 0.2648

57 GANB Alsophila pinulosa CYATα 0.1054 0.5354 0.2866 3941 15535 0.2537

58 CVEG Azolla cf. caroliniana AZOLα 0.1662 1.5038 0.8240 3752 22043 0.1702

59 KIIX Pilularia globulifera CYATγ 0.4644 3.4077 1.9688 4378 18491 0.2368

60 CQPW Anemia tomentosa LYGOα 0.0701 2.3196 1.2716 4580 19164 0.2390 288

61 PBUU Lygodium japonicum LYGOα 0.3070 2.3497 1.4138 2316 14702 0.1575

62 MEKP Dipteris conjugata HYMEα 0.0640 1.4967 0.8024 5683 18707 0.3038

63 XDVM Sticherus lobatus HYMEα 0.1722 1.1989 0.7369 335 2799 0.1197 Hymenophyllum 64 QIAD bivalve HYMEα 0.0758 1.2693 0.6660 2283 11563 0.1974

Hymenophyllum 65 TRPJ cupressiforme HYMEα 0.1566 1.6094 0.8078 405 3075 0.1317

66 TWFZ Crepidomanes venosum HYMEα 0.3899 2.2384 1.2099 211 2032 0.1038

Gymnosperms

1KP Median # Paleologs / # # Code Species WGD code Start Ks End Ks Ks # Paleologs # Unigenes Unigenes

1 BUWV Platycladus orientalis METAα 0.1082 0.7729 0.4554 2641 18005 0.1467

2 XQSG Microbiota decussata METAα 0.0998 0.7449 0.4250 2682 18274 0.1468

3 CGDN Tetraclinis sp. METAα 0.0511 0.7912 0.4469 2659 17579 0.1513

4 FRPM Calocedrus decurrens METAα 0.0977 0.8191 0.4685 3201 18155 0.1763

5 QNGJ Cupressus dupreziana METAα 0.1087 0.8032 0.4716 2714 18537 0.1464

6 XMGP Juniperus scopulorum METAα 0.1510 0.7374 0.4588 1286 14739 0.0873 Chamaecyparis 7 AIGO lawsoniana METAα 0.0924 0.7441 0.4056 3055 18173 0.1681

8 UEVI Fokienia hodginsii METAα 0.0963 0.7311 0.4062 2809 17477 0.1607

9 NKIN Thujopsis dolabrata METAα 0.1074 0.7015 0.3994 2399 16349 0.1467

10 VFYZ Thuja plicata METAα 0.1037 0.6802 0.3680 2079 14800 0.1405

11 IFLI Callitris gracilis METAα 0.1742 0.8755 0.5601 1262 15166 0.0832

12 RMMV Callitris macleayana METAα 0.0957 0.8058 0.4757 2093 16977 0.1233

Widdringtonia 13 AUDE cedarbergensis METAα 0.0827 0.7714 0.4377 2025 18085 0.1120

14 GKCZ Diselma archeri METAα 0.0741 0.7844 0.4287 2595 17618 0.1473

15 OVIJ Papuacedrus papuana METAα 0.0932 0.7639 0.4478 1741 15947 0.1092 Austrocedrus 16 YYPE chilensis METAα 0.0858 0.7724 0.4411 2832 18100 0.1565 Pilgerodendron 17 ETCJ uviferum METAα 0.0616 0.8044 0.4564 2463 17494 0.1408

18 FHST Taxodium distichum METAα 0.0782 0.6755 0.3578 3119 18381 0.1697

19 OXGJ Glyptostrobus pensilis METAα 0.1024 0.6351 0.3461 2690 16813 0.1600 289

20 GMHZ Cryptomeria japonica METAα 0.0666 0.6731 0.3550 4517 20155 0.2241

21 HBGV Sequoia sempervirens METAα 0.1143 0.5677 0.3267 615 10506 0.0585 Sequoiadendron 22 QFAE giganteum METAα 0.0001 0.0956 0.4209 600 14819 0.0405

Metasequoia 23 NRXL glyptostroboides METAα 0.0001 0.0714 0.2915 1313 19078 0.0688 Athrotaxis 24 XIRK cupressoides METAα 0.0485 0.6212 0.3198 2772 17719 0.1564 Taiwania 25 QSNJ cryptomerioides METAα 0.0862 0.5942 0.3219 2725 18126 0.1503 Cunninghamia 26 ZQVF lanceolata METAα 0.0721 0.5660 0.2943 2748 18292 0.1502

27 WWSS Taxus baccata METAα 0.1079 0.5974 0.3514 1415 14860 0.0952

28 BTTS Austrotaxus spicata METAα 0.0912 0.6367 0.3441 1781 16969 0.1050

29 YLPM Pseudotaxus chienii METAα 0.1102 0.6179 0.3540 1519 16725 0.0908

30 EFMS Torreya taxifolia METAα 0.1050 0.5304 0.3110 1322 17373 0.0761 Amentotaxus 31 IAJW argotaenia METAα 0.0642 0.4894 0.2652 1735 17575 0.0987 Cephalotaxus 32 NVGZ harringtonia METAα 0.0655 0.5452 0.3032 2781 18884 0.1473 Sciadopitys 33 YFZK verticillata METAα 0.0803 0.5437 0.3045 1514 14914 0.1015

34 SCEB Podocarpus coriaceus PODOα 0.0633 0.5112 0.2462 1583 16571 0.0955

35 XLGK Podocarpus rubens PODOα 0.0864 0.5640 0.2839 1774 17178 0.1033

36 UUJS Nageia nagi PODOα 0.0696 0.5653 0.2721 1627 15586 0.1044

37 VGSX Retrophyllum minus PODOα 0.0599 0.5635 0.2716 2401 16781 0.1431

38 QHBI Falcatifolium taxoides PODOα 0.0841 0.6294 0.3200 2164 17384 0.1245

39 ROWR Falcatifolium taxoides PODOα 0.0936 0.6152 0.3182 1892 17032 0.1111 Dacrycarpus 40 FMWZ compactus PODOα 0.0747 0.6081 0.2945 2309 17408 0.1326 Microcachrys 41 MHGD tetragona PODOα 0.0612 0.5132 0.2474 1760 17717 0.0993 Saxegothaea 42 QCGM conspicua PODOα 0.0719 0.5638 0.2672 1711 17039 0.1004

43 CDFR Manoao colensoi PODOα 0.0769 0.5502 0.2729 1886 16990 0.1110 Lagarostrobos 44 ZQWM franklinii PODOα 0.0333 0.5715 0.2544 2576 16812 0.1532

45 JZVE Parasitaxus usta PODOα 0.0438 0.5146 0.2404 1339 15428 0.0868 Microstrobos 46 BBDD fitzgeraldii PODOα 0.0723 0.5014 0.2375 1325 17907 0.0740 290

Phyllocladus 47 JRNA hypophyllus PODOα 0.0797 0.6176 0.3077 2175 17692 0.1229

48 OWFC Halocarpus bidwillii PODOα 0.0793 0.5906 0.2857 2079 17646 0.1178

49 EGLZ Prumnopitys andina PODOα 0.0717 0.5168 0.2473 1304 15302 0.0852

50 KLGF Sundacarpus amarus PODOα 0.0701 0.4996 0.2459 1195 17868 0.0669

51 ACWS Agathis macrophylla GINKα 0.1256 2.2651 1.0979 5326 17670 0.3014

52 MIXZ Agathis robusta GINKα 0.1043 2.2260 1.0619 5162 17313 0.2982

53 RSCE Wollemia nobilis GINKα 0.1163 2.5395 1.1752 6070 17026 0.3565

54 XTZO Araucaria rulei GINKα 0.1092 2.3829 1.1566 5892 18099 0.3255

55 MFTM Pinus jeffreyi PINEα 0.0980 0.6143 0.3831 3866 19625 0.1970

56 DZQM Pinus radiata PINEα 0.0977 0.6280 0.3714 3874 18695 0.2072

57 IIOL Pinus parviflora PINEα 0.1080 0.6231 0.3887 3682 19317 0.1906

58 AWQB Picea engelmannii PINEα 0.0969 0.5481 0.3111 2658 16979 0.1565

59 NPRL Cathaya argyrophylla PINEα 0.1250 0.6012 0.3571 2063 19548 0.1055 Pseudotsuga 60 IOVS wilsoniana PINEα 0.1124 0.6487 0.3732 3600 16519 0.2179

61 WVWN Larix speciosa PINEα 0.0899 0.5925 0.3452 3222 16345 0.1971 Nothotsuga 62 AREG longibracteata PINEα 0.0948 0.5511 0.3196 2440 16547 0.1475

63 GAMH Tsuga heterophylla PINEα 0.0897 0.5799 0.3324 3400 17830 0.1907

64 AQFM Pseudolarix amabilis PINEα 0.0983 0.5160 0.3001 2346 17785 0.1319

65 JUWL Keteleeria evelyniana PINEα 0.1082 0.5422 0.3192 2565 18294 0.1402

66 VSRH Abies lasiocarpa PINEα 0.1011 0.5972 0.3371 3416 18006 0.1897

67 GGEA Cedrus libani PINEα 0.1116 0.5602 0.3240 2087 15828 0.1319

68 TOXE Welwitschia mirabilis WELMα 0.3242 1.7073 0.9885 3182 15669 0.2031

69 GNQG Encephalartos barteri GINKα 0.1248 1.4054 0.7753 6000 18030 0.3328

70 KAWQ Stangeria eriopus GINKα 0.1320 1.5934 0.9171 5067 17316 0.2926

71 WLIC Dioon edule GINKα 0.0929 1.4057 0.7517 4102 15416 0.2661

72 XZUY Cycas micholitzii GINKα 0.1496 1.4155 0.8000 3677 13269 0.2771

73 SGTW Ginkgo biloba GINKα 0.2398 1.0928 0.7724 1959 10551 0.1857

291

Supplemental Table 3. Summary statistics of phylogenetic signal and linear and exponential fits before and after phylogenetically-corrected rate of post-WGD paralog loss in land plants.

Ks

Group lambda P-value

Bryophytes 0.975479 0.00028562 Lycophytes 1.0198 2.58E-07 Ferns 0.816982 0.0262867 Gymnosperms 1.00291 7.59E-36 Angiosperms 0.608506 4.70E-10

Fraction

Group lambda P-value

Bryophytes 0.906518 1.30E-06 Lycophytes 0.562202 0.0931625 Ferns 6.66E-05 1 Gymnosperms 0.979765 1.10E-18 Angiosperms 0.457574 1.14E-12

Linear model before PIC

Group Adjusted R-squared Slope p-value

Bryophytes 0.002456 0.06567 0.2938 Lycophytes -0.0821 -0.01619 0.7704 Ferns -0.0004521 0.01324 0.3282 Gymnosperms 0.622 0.236 2.00E-16 Angiosperms 0.06364 -0.045062 1.51E-10

292

Exponential model before PIC

Group b value p-value

Bryophytes 0.4115 0.145 Lycophytes -0.0769 0.798 Ferns 0.05293 0.335 Gymnosperms 1.16058 2.00E-16 Angiosperms -0.20323 9.33E-10

Linear model after PIC

Group Adjusted R-squared Slope p-value

Bryophytes 0.02379 -0.092667 0.1451 Lycophytes -0.08253 0.044224 0.6964 Ferns 0.007037 -0.01945 0.2386 Gymnosperms 0.03806 0.1181599 0.05628 Angiosperms 0.09593 -5.13E-02 3.50E-15

Exponential model after PIC

Group b value p-value

Bryophytes NA NA Lycophytes 9.006 0.743255 Ferns NA NA Gymnosperms NA NA Angiosperms -4.614 0.000183