GBE
Phylotranscriptomic Analyses Reveal Asymmetrical Gene Duplication Dynamics and Signatures of Ancient Polyploidy in Mints
Grant T. Godden 1,*, Taliesin J. Kinser1,2, Pamela S. Soltis1, and Douglas E. Soltis1,2 1Florida Museum of Natural History, University of Florida 2Department of Biology, University of Florida
*Corresponding author: E-mail: g0ddengr@ufl.edu. Accepted: October 28, 2019 Data deposition: This project has been deposited at Dryad under the accession doi: 10.5061/dryad.qbzkh18cr.
Abstract Ancient duplication events and retained gene duplicates have contributed to the evolution of many novel plant traits and, conse- quently, to the diversity and complexity within and across plant lineages. Although mounting evidence highlights the importance of whole-genome duplication (WGD; polyploidy) and its key role as an evolutionary driver, gene duplication dynamics and mechanisms, both of which are fundamental to our understanding of evolutionary process and patterns of plant diversity, remain poorly char- acterized in many clades. We use newly available transcriptomic data and a robust phylogeny to investigate the prevalence, occur- rence, and timing of gene duplications in Lamiaceae (mints), a species-rich and chemically diverse clade with many ecologically, economically, and culturally important species. We also infer putative WGDs—an extreme mechanism of gene duplication—using large-scale data sets from synonymous divergence (KS), phylotranscriptomic, and divergence time analyses. We find evidence for widespread but asymmetrical levels of gene duplication and ancient polyploidy in Lamiaceae that correlate with species richness, including pronounced levels of gene duplication and putative ancient WGDs (7–18 events) within the large subclade Nepetoideae and up to 10 additional WGD events in other subclades. Our results help disentangle WGD-derived gene duplicates from those produced by other mechanisms and illustrate the nonuniformity of duplication dynamics in mints, setting the stage for future investigations that explore their impacts on trait diversity and species diversification. Our results also provide a practical context for evaluating the benefits and limitations of transcriptome-based approaches to inferring WGD, and we offer recommendations for researchers interested in investigating ancient WGDs in other plant groups. Key words: Lamiaceae, mints, phylotranscriptomics, gene duplication, whole-genome duplication, ancient polyploidy.
Introduction many extant lineages, are also found throughout angiosperm Duplicated genes are abundant in plant genomes and can phylogeny (Wood et al. 2009; Jiao et al. 2011, 2014; Husband arise via multiple mechanisms, including tandem duplications, et al. 2012; Tank et al. 2015; McKain et al. 2016; Landis et al. chromosomal segmental duplications, or whole-genome 2018). Ancient polyploidy has preceded the diversification of duplications (WGDs; polyploidy). On average, 64.5% of an- all seed plants, angiosperms, eudicots, and monocots (Tuskan notated genes in plant genomes have a duplicate copy, and et al. 2006; Tang et al. 2010; Jiao et al. 2011, 2012; Amborella comparative studies suggest that most are derived from WGD Genome Project 2013; Li et al. 2015). Furthermore, recent (Panchy et al. 2016). evidence indicates that WGD has occurred much more fre- In angiosperms, the prevalence of WGD and its role as a quently than traditionally appreciated both within angio- key driver in the diversification of species and traits have be- sperms (Landis et al. 2018) and in most land plant lineages come increasingly evident. An estimated 35% of all extant (One Thousand Plant Transcriptomes Initiative forthcoming). angiosperm species are considered recent polyploids, yet WGD can arise through multiple avenues in plants (Ramsey past polyploidy events, accompanying the divergence of and Schemske 1998; Comai 2005), and it is a common
ß The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http://creativecommons.org/licenses/by-nc/4.0/),whichpermitsnon- commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Genome Biol. Evol. 11(12):3393–3408. doi:10.1093/gbe/evz239 Advance Access publication November 5, 2019 3393 Godden et al. GBE
feature of many species-rich clades (Soltis et al. 2009). The within the context of a dated species tree (Jiao et al. 2011, novel genetic material resulting from WGD events (e.g., 2014). Comai 2005; Freeling and Thomas 2006; VandePeeretal. To better understand the prevalence, occurrence, and tim- 2009; Rensing 2014; Soltis et al. 2014, 2015) may contribute ing of gene duplications in Lamiaceae, we use a combination to key evolutionary innovations that are central to rapid radi- of KS, phylotranscriptomic, and divergence time approaches ations (e.g., Edger et al. 2015; Soltis and Soltis 2016;Green to document and characterize gene duplication dynamics and Plant Consortium, unpublished). For instance, ancient WGDs to infer ancient polyploidy in the clade, taking advantage of contributed to the proliferation of novel genes and gene inter- recently available transcriptomic resources for mints and a actions in Brassicales, vastly increasing specialized metabolite robust phylogeny based on data from 520 nuclear genes diversity in the clade, particularly within Brassicaceae (mus- (MEGC 2018). We compare patterns of putative ancient tards). Importantly, this expansion of chemical diversity was WGD events across 11 major mint subclades (¼traditional accompaniedbyanincreaseinspeciesdiversificationrates subfamilies) and correlate these patterns with their species- (Edger et al. 2015), highlighting the importance of WGD as level diversity. We also compare and discuss the congruence a driver of both chemical innovation and speciation in plants of results across different analyses as well as the efficacy of (Soltis et al. 2018). these approaches. Most importantly, we provide a set of hy- Despite increasing evidence for the prevalence of WGDs potheses for gene duplication dynamics and ancient poly- and their impacts on trait evolution (reviewed in Van de Peer ploidy that can be used in future investigations that explore et al. [2009]), gene duplication dynamics and mechanisms their impacts on mint diversity and diversification. remain underexplored in most plant clades. Lamiaceae (mints) are one of the most species-rich angiosperm families, with Materials and Methods more than 7,000 species, and possess a high degree of chem- ical diversity and many ecologically, economically, and cultur- Species Tree and Transcriptome Data Used in This Study ally important species appreciated by peoples worldwide. We sourced a robust species tree hypothesis (fig. 1a)and Beyond differences in mint genome sizes (MEGC 2018) and available leaf transcriptomes from 48 mint species and four well-documented examples of auto- and allo-polyploidy, the outgroup species from Lamiales (supplementary table S1, latter of which is common throughout the clade (Harley et al. Supplementary Material online) generated by the Mint 2004), little is known about gene duplication dynamics and Evolutionary Genomics Consortium (2018) from the Dryad mechanisms in Lamiaceae. For example, it is unclear how of- Digital Repository (doi:10.5061/dryad.tj1p3). The evolutionary ten WGD events have occurred throughout mint evolutionary sampling scheme from the Mint Evolutionary Genomics history and whether gene or WGD events have occurred with Consortium (2018) study reflects our current understanding greater frequency in species-rich mint lineages such as of Lamiaceae phylogeny and includes species from 11 of 12 Nepetoideae (>3,300 species) as compared with other line- recognized subclades, thereby representing very well the phy- ages with much lower species richness. logenetic diversity of mints and providing a suitable data set Recent WGDs are evident through comparison of chromo- for investigation of gene duplication dynamics and ancient some numbers, but evidence of ancient WGD is often erased WGD in the clade. All transcriptome data sets were prefiltered through postpolyploid diploidization, making detection from and included both Transdecoder-predicted (Haas et al. 2013) chromosome numbers difficult or impossible. Detection of peptides and coding sequences from representative transcrip- ancient WGD has relied largely on estimates of synonymous tome assemblies (RTAs)—that is, filtered transcriptomes that substitutions per synonymous site (KS) among paralogous se- included only the longest assembled isoform for each gene quence pairs present in EST or transcriptome data sets, which (see Mint Evolutionary Genomics Consortium [2018]for are subsequently used to infer the ages of gene duplications methods and details). within individual species (Maere et al. 2005). At the genome level, KS distributions representing all paralogous gene pairs allow for identification of putative WGD events, which appear Inferring WGDs from Estimates of Synonymous as punctuated episodes of large-scale gene duplication in time Divergence and stand out against background levels of gene duplication The fraction of synonymous substitutions per synonymous site arising from other processes (Lynch and Conery 2000; Cui (KS) between a given pair of paralogous sequences can be et al. 2006; Barker et al. 2008). Recently, researchers have used to estimate their time of duplication (Lynch and Conery used phylotranscriptomic approaches to infer and place puta- 2000; Blanc and Wolfe 2004; Maere et al. 2005; Cui et al. tive WGDs in a multispecies phylogenetic context. Two of 2006). At the genome level, KS distributions representing all these approaches consider gene or gene family trees and rec- paralogous gene pairs can reveal temporal patterns of gene oncile gene duplications using a species tree reference (Li et al. duplication that may reflect ancient WGD events (Lynch and 2015; McKain et al. 2016), whereas another method esti- Conery 2000). This methodology has been widely used to mates the ages of gene duplication events and places them assess putative ancient WGDs. We analyzed RTAs from 52
3394 Genome Biol. Evol. 11(12):3393–3408 doi:10.1093/gbe/evz239 Advance Access publication November 5, 2019 Gene and Whole-Genome Duplication Dynamics in Lamiaceae GBE
(a)(b) Monarda didyma 1.2 1.0 0.8 0.6 Density 0.4 0.2 0.0 0.5 1.0 1.5 2.0 Ks
Sign. incr. ↑ 0.0 Not sign. ≠ 0 Sign. dec. ↓
0.03 10 −1.0 log −2.0 MMonardaonarda didymadidyma (bandwidths) 0.5 1.0 1.5 2.0 N = 4003 MenthaMentha × piperitapiperita
MenthaMentha spspicataicata 1.2 Salvia hispanica OriganumOriganum mamajoranajorana MenthinaeMenthinae 1.0 OriganumOriganum vulvulgaregare ThyThymusmus vuvulgarislgaris 0.8 NepetaNepeta catariacataria 0.6 Density NepetaNepeta mussiniimussinii 0.4 AgastacheAgastache foeniculum NepetinaeNepetinae 0.2 MentheaeMentheae GlechomaGlechoma hederacea 0.0 HHyssopusyssopus oofficinalisfficinalis 0.5 1.0 1.5 2.0 Ks
LycopusLycopus ameramericanusicanus LycopinaeLycopinae Sign. incr. ↑ 0.0 Not sign. ≠ 0 ↓ PPrunellarunella vulgarisvulgaris PrunellinaePrunellinae Sign. dec. 10 −1.0 Salvia officinalisofficinalis log Rosmarinus officinalisofficinalis −2.0
(bandwidths) 0.5 1.0 1.5 2.0 Perovskia atriatriplicifoliaplicifolia SalviinaeSalviinae N = 5477 SSalviaalvia hispanichispanicaa
MelissaMelissa oofficinalisfficinalis 1.2 Lavandula angustifolia PlectranthusPlectranthus babarbatusrbatus 1.0 HyHyptisptis suaveosuaveolenslens OcimeaeOcimeae 0.8 OcimumOcimum basilicum 0.6
Lavandula angustifoliaangustifolia Density 0.4 Collinsonia canadensis ElsholtzieaeElsholtzieae 0.2 Perilla frutescensfrutescens 0.0 BBallotaallota ppseudodictamnusseudodictamnus 0.5 1.0 1.5 2.0 MarrubiumMarrubium vulgarevulgare Ks
Sign. incr. ↑ 0.0 Not sign. ≠ 0 LamiumLamium albumalbum ↓ Sign. dec.
LeonotisLeonotis leoleonurusnurus 10 −1.0
LLeonuruseonurus cardiacacardiaca log −2.0
Phlomis fruticosafruticosa (bandwidths) 0.5 1.0 1.5 2.0 BetonicaBetonica offofficinalisicinalis N = 6698
PogostemonPogostemon cablincablin Petraeovitex bambusetorium RRothecaotheca myrmyricoidesicoides 1.5 ClerodendrumClerodendrum bungeibungei TeucriumTeucrium cacanadensenadense 1.0 AjugaAjuga reptans Density PetraeovitexPetraeovitex babambusetorummbusetorum 0.5 HolmskioldiaHolmskioldia sanguineasanguinea ScutellariaScutellaria baicalensis 0.0 0.5 1.0 1.5 2.0 Cornutia pyramidatapyramidata Ks
Gmelina pphilippensishilippensis Sign. incr. ↑ 0.0 Not sign. ≠ 0 Sign. dec. ↓ PremnaPremna mmicrophyllaicrophylla 10 −1.0
VitexVitex agnus-castus log CongeaCongea tomentosatomentosa −2.0
(bandwidths) 0.5 1.0 1.5 2.0 Tectona grandisgrandis N = 5746 ProstantheraProstanthera lasialasianthosnthos WWestringiaestringia fruticosafruticosa 1.2 Prostanthera lasianthos CCallicarpaallicarpa americanaamericana 1.0 AureolariaAureolaria pectinatapectinata 0.8 LaLanceancea tibeticatibetica 0.6 Density PetreaPetrea vvolubilisolubilis 0.4 PaulowniaPaulownia tomentosatomentosa 0.2 0.0 0.5 1.0 1.5 2.0 Clade Color Key: Ks
Sign. incr. ↑ Ajugoideae Peronematoideae Symphorematoideae 0.0 Not sign. ≠ 0 Sign. dec. ↓
Callicarpoideae Premnoideae Tectonoideae 10 −1.0
Lamioideae Prostantheroideae Viticoideae log Nepetoideae Scutellarioideae Outgroups −2.0
(bandwidths) 0.5 1.0 1.5 2.0 N = 6698
FIG.1.—(a) Maximum likelihood phylogeny of Lamiaceae inferred from 520 single-copy nuclear genes by the Mint Evolutionary Genomics Consortium (2018). The tree topology is shown as a cladogram, with a phylogram illustrating the tree shape as an inset. Subfamilial classifications (Li et al. 2016; Li and Olmstead 2017) are color coded according to the provided key, and tribal and subtribal classifications for Nepetoideae (Harley et al. 2004; Drew and Sytsma
2012) are indicated with brackets. (b) KS (top) and SiZer (bottom) plots for a selection of five species sampled by our study. Gaussian distributions produced by mixture models are shown as overlays on each KS distribution, with red or blue color-coded peaks representing putative WGD events that were either corroborated or not corroborated, respectively, by a SiZer analysis (Chaudhuri and Marron 1999). As shown in plots below each KS distribution, SiZer tests for significant increases (blue) or decreases (red), or no significant changes (pink) across a distribution at various (log transformed) bandwidths to distinguish true data features from noise.
Genome Biol. Evol. 11(12):3393–3408 doi:10.1093/gbe/evz239 Advance Access publication November 5, 2019 3395 Godden et al. GBE
species (48 ingroup and 4 outgroup) with DupPipe (Barker our species tree topology for Lamiaceae (fig. 1a) does not have et al. 2010). DupPipe uses a reciprocal best-BLAST-hit ap- a stepwise branching pattern, we followed Li et al. (2015, proach to identify pairs of putatively paralogous transcripts 2018) and systematically pruned taxa from the tree in R using (gene duplicates) and estimates their pairwise synonymous theAPEpackage(Paradis et al. 2004), producing 12 stepwise site divergence (KS) from protein-guided DNA alignments guide trees for use with MAPS (supplementary fig. S1, with PAML and the F3 4 model (Yang 2007). The resulting Supplementary Material online). Each guide tree contained a KS estimates from each species were plotted as a histogram focal clade (i.e., a subclade) from the species tree, and the (KS plot) with age distributions from 0.1 to 2; all gene pairs taxon representation in its sister clade and in all other sub- with KS values below and above these upper and lower val- clades in the species tree was reduced to a single individual to ues, respectively, were removed to enable reliable inference of produce a tree with a stepwise branching pattern. Collectively, WGD events (e.g., Vanneste et al. 2013). Significant peaks in these guide trees enabled investigations of WGD events across each observed KS distribution were inferred with Gaussian most nodes in our (unpruned) species tree. Preparation of a mixture models, as implemented in the mixtools R package guide tree was not necessary for use of PUG. (Benaglia et al. 2009) with the expectation-maximization (EM) algorithm (McLachlan and Peel 2000), and the most likely Gene Family Circumscription number of Gaussian components that fit each distribution We precomputed reciprocal protein BLAST (BlastP) searches was tested with parametric bootstrap analyses (100 boot- among RTAs comprising Transdecoder-predicted peptide straps) of the likelihood ratio statistic using the “boot.comp” sequences from all 52 species using NCBI BLAST version function of mixtools. In cases where Gaussian peaks did not 2.2.31 (Altschul et al. 1990)withane-value threshold of align with K peaks or multiple peaks were placed in approx- S 0.001. Gene families (i.e., orthogroups comprising gene imately the same position, the number of components was orthologs and close paralogs) were circumscribed from the manually adjusted to better fit the K distribution. We further S BLAST results with OrthoFinder (Emms and Kelly 2015)ver- compared all mixture model components with results from a sion 0.7.1 using default parameters. The resulting gene family SiZer analysis (Chaudhuri and Marron 1999). SiZer tests for data were processed directly in preparation for the PUG anal- significant increases or decreases or no significant changes ysis. However, additional OrthoFinder processing was neces- across a distribution at various bandwidths to distinguish sary to prepare individual data sets for each of the MAPS true data features from noise. Values of K 2 and band- S analyses. For the latter, we systematically reclustered BLAST widths ranging from K ¼ 0.01 to K ¼ 2wereusedtoiden- S S output using the corresponding taxon sampling in each MAPS tify significant (a ¼ 0.05) features in each observed K S guide tree (supplementary fig. S1, Supplementary Material distribution. Shifts from significant increases to significant online), producing new gene family circumscriptions for decreases identified by SiZer were interpreted as true peaks, each MAPS data set. In all, 13 independent gene family and all peaks corroborating Gaussian mixture models were data sets with partial or complete taxon sampling (i.e., 12 inferred as WGD events. for MAPS and one for PUG, respectively) were produced for downstream filtering, multiple sequence alignment, tree esti- Inferring and Placing WGDs in a Phylogenetic Context mation, and analysis.
Although KS plots are useful for documenting putative WGD events that occurred within the evolutionary histo- Filtering, Multiple Sequence Alignment, and Tree ries of individual species, phylotranscriptomic approaches Estimation utilize transcriptome data from multiple species and test Amino acid sequences from gene families occupied by at least for the occurrence of putative WGDs across all nodes of a one sequence per species represented in a corresponding given species tree to facilitate detection of shared WGDs guide tree were parsed and passed to PASTA (Mirarab et al. across lineages. To characterize patterns of ancestral poly- 2015) for automated multiple sequence alignment and phy- ploidy in Lamiaceae, we inferred and placed putative logenetic inference; all remaining data were discarded. WGD events in the context of our best species tree Sequences were divided into subsets and aligned using hypothesis (fig. 1a) using two available pipelines: Multi- MAFFT (Katoh et al. 2002), with MUSCLE (Edgar 2004)and tAxon Paleopolyploidy Search (MAPS) (Li et al. 2015)and RAxML (Stamatakis 2014) used for pairwise merger of aligned Phylogenetic Placement of Polyploidy Using Genomes subsets and tree estimation, respectively, according to the (PUG) (McKain et al. 2016). default parameters set by PASTA for each tool. For each gene family, we ran PASTA until no improvement in the like- Guide Tree Preparation lihood score was reached after three iterations using a cen- The MAPS algorithm searches gene family tree topologies and troid break strategy. We followed Li et al. (2015) and identifies where paralogs coalesce within the context of a compiled the best-scoring amino acid-based PASTA trees for user-provided species tree with stepwise branching. Because each data set for use as tree input in their respective MAPS
3396 Genome Biol. Evol. 11(12):3393–3408 doi:10.1093/gbe/evz239 Advance Access publication November 5, 2019 Gene and Whole-Genome Duplication Dynamics in Lamiaceae GBE
analyses. To prepare tree input for PUG, we followed McKain punctuated gene duplication that are indicative of putative et al. (2016). Corresponding coding sequences were forced ancient WGD events (Lynch and Conery 2000). We also com- onto the amino acid alignments from PASTA using PAL2NAL pared these results with putative WGD events inferred with KS (Suyama et al. 2006), and gene family trees were estimated and other phylotranscriptomic approaches. from the resulting nucleotide alignments. All phylogenetic analyses with nucleotide sequence data were conducted us- Divergence Time Estimations ing RAxML version 8.2.10 and the GTRGAMMA model, and We extracted a set of gene family trees for molecular dating topological uncertainty in each of the resulting maximum like- that could be reliably rooted with all outgroup species (using lihood (ML) trees was assessed with 100 rapid bootstraps. The the APE package in R) and contained putative gene paralog resulting bootstrapped (nucleotide-based) ML trees were pairs identified by PUG. Divergence times were estimated for compiled into a single file for analysis with PUG. species in our species tree for Lamiaceae (fig. 1a) and for paralog pairs in each gene family tree assuming a relaxed MAPS and PUG Analyses molecular clock, as implemented in treePL (Smith and Weran12MAPSanalysestoinferandplaceputativeWGD O’Meara 2012) with a penalized likelihood approach events in Lamiaceae in a phylogenetic context. A compiled set (Sanderson 2002). We used the “prime” option in treePL of multispecies gene family phylogenies and corresponding to identify the best optimization parameters for each analy- guide tree were used as input for each analysis. Using MAPS sis, which was followed by a “thorough” analysis using output, we identified nodes in each guide tree that exhibited those parameters. The best smoothing parameter for each shared duplications among descendent taxa in 40% of gene analysis was determined by cross-validation. We used the family trees with congruent topologies and had higher levels following age constraints for the estimation procedure: a of duplication relative to neighboring nodes. We interpreted minimum and maximum age of 59.99 and 70.94 Myr, re- nodes at or above this threshold as putative WGD nodes, fol- spectively, for the crown node of Lamiaceae, and a maxi- lowing Li et al. (2015) and Barker et al. (2016). We summa- mum age of 107 Myr for the root node (Lamiales crown). rized results across all MAPS analyses by resolving WGD nodes These constraints correspond to the lower and upper bounds in each guide tree to our species tree topology. For compar- of the 95% highest posterior density interval of estimated ison, we also ran an analysis with PUG to infer and place WGD ages reported for the Lamiaceae crown (Yao et al. 2016)and events in the context of our species tree. The the oldest estimated age (upper confidence interval) “estimate_paralogs” option in PUG was used to identify all reported for the Lamiales crown (Janssens et al. 2009), re- putative gene paralog pairs in each gene family phylogeny, spectively. All divergence times for paralog pair coalescence whose coalescence nodes were subsequently identified and nodes inferred with PUG were extracted with the APE R resolved to our species tree topology. We accounted for pos- package and compiled for downstream analyses; sible phylogenetic uncertainty in each gene family tree topol- Estimated ages for constrained nodes were not considered. ogy and applied a stringent minimum bootstrap threshold to filter our results and confidently place gene duplication events Statistical Analyses within the context of our species tree; following McKain et al. We extracted and analyzed the estimated divergence times for (2016) and Unruh et al. (2018), only paralog pair coalescence paralog pairs representing unique duplication events resolved nodes with ML bootstrap values 80 in each gene family tree by PUG to individual nodes in our species tree and used these were retained and interpreted as part of the final data set. We ages to place duplications within the context of our time- used an R script, PUG_Figure_Maker.R (Unruh et al. 2018), to calibrated mint phylogeny. We also assessed whether gene identify putative WGDs in our species tree. The script identifies duplication events at each node were clustered in time (i.e., the maximum number of paralog pairs mapped to any given representing 1 putative WGD event). For the latter, we fol- node in the species tree and labels the stem lineage subtend- lowed Cui et al. (2006) andcomparedeachobserveddistri- ing this and all other nodes with 10% of the maximum bution of gene duplication ages with an expected (null) number as having putative WGD events. distribution that was generated under a constant-rate birth– death model. The null model forms a declining exponential Inferring and Placing WGDs in a Temporal Context distribution whose decay rate was informed by an observed Although the phylotranscriptomic approaches described age distribution. The null distribution differs from possible above are useful for placing gene duplication events, they WGD scenarios, where observed distributions are expected do not account for factors such as the phylogenetic timing to show temporal patterns of punctuated gene duplication. and ages of gene duplications, both of which are useful for We estimated a decay parameter for nodes in our species tree identifying putative shared WGD events. We investigated that had observed age distributions with samples sizes 5and both factors and placed gene duplications within a used it to randomly generate a null distribution that was sub- time-calibrated phylogeny to identify temporal patterns of sampled to reflect the range and size of the observed
Genome Biol. Evol. 11(12):3393–3408 doi:10.1093/gbe/evz239 Advance Access publication November 5, 2019 3397 Godden et al. GBE
distribution. The decay parameter for each test node corre- Kendall’s tau-b (sb) correlation coefficient, which provides a sponded to its expected gene death rate, which was estimated nonparametric measure of the strength and direction of as- from the sample mean of its observed age distribution. Unlike sociation between these variables and enables adjustments Cui et al. (2006), who incorporated error into their analyses for ties within the rankings. Plant name records from The
(with KS data), we accounted for possible outliers in our ob- Plant List (www.theplantlist.org) were downloaded and cu- served age distributions and removed the 1.5th and 99.5th rated, and numbers of currently recognized species were percentiles prior to estimating decay parameters and generat- summed for each of the following clades (¼traditional sub- ing null distributions. We also accounted for possible subsam- families) to produce species richness estimates: Ajugoideae, pling artifacts when simulating null distributions and repeated Callicarpoideae, Lamioideae, Nepetoideae, Peronematoideae, the process 5,000 times. To compare the observed distribution Premnoideae, Prostantheroideae, Scutellarioideae, of gene duplication ages at each test node with each simu- Symphorematoideae, Tectonoideae, and Viticoideae (see lated null model, we used the bootstrap version of the univar- fig. 1a). Cumulative levels of ancient polyploidy were com- iate Kolmogorov–Smirnov (KS) test, as implemented with piled for each clade by summing the number of putative an- 1,000 replicates in R with the “ks.boot” function (http:// cient WGD events inferred from MAPS, PUG, and divergence sekhon.polisci.berkeley.edu/matching/ks.boot.html). times results, respectively, and placed at or near nodes within Our strategy produced a distribution of 5,000 boot- each clade and all ancestral nodes through the most recent strapped P values for each test node, which we examined common ancestor (MRCA) of Lamiaceae. All statistical analy- to identify observed age distributions that bore signatures of ses were conducted in R version 3.5.3 (R-Core-Team 2019). putative ancient WGDs. If KS tests with an observed age dis- tribution yielded a relatively uniform distribution of P values Results (0 P 1), we concluded that the distribution was largely consistent with the null model (Breheny et al. 2018). WGDs Inferred from KS Distributions Conversely, if the P value distributions showed a large skew We observed peaks of gene duplication consistent with WGD toward marginal or significant P values (P < 0.05), we con- events in the KS distributions of all study species (fig. 1b and cluded that the distribution was inconsistent with the null supplementary fig. S2, Supplementary Material online). Our model and performed additional tests to characterize possible mixture modeling identified two to four significant peaks per age clustering. For the latter, we inferred data features within KS distribution, but only one to three of these were corrobo- each observed distribution using finite Gaussian mixture mod- rated by our SiZer results (supplementary fig. S2 and table S2, els, as implemented using the Mclust function of the Supplementary Material online). At least 1 putative WGD “mclust” R package with the EM algorithm (Fraley and event per each of the 52 species analyzed was inferred. Raftery 2002). For the univariate data used here, the EM al- However, 17 species showed evidence for 2 putative gorithm identifies data features under variable variance mod- WGDs, and 1 species, Paulownia tomentosa (Thunb.) els and tests a range of possible numbers of mixture Steud., had significant peaks corresponding to 3 putative components representing individual gene duplication age WGDs. The range of mean KS values estimated by mixtools clusters; the optimal number of components is selected using for peaks confirmed by SiZer ranged from 0.21 to 1.657 (sup- the Bayesian Information Criterion. We tested a range of one plementary table S2, Supplementary Material online). to three possible mixture components per distribution and inferred the optimal number of components within each. Given that mixture models could incorrectly infer one or Phylogenetic Analyses and Placement of WGDs more components within some broad age distributions (Johnson et al. 2016), we performed an additional test to Gene Family Circumscription, Multiple Sequence assess whether mixture model components represented Alignment, and Tree Estimation “true” data features (i.e., clusters of gene duplication in The number of transcriptomes (RTAs) and genes comprising time). We conducted a SiZer analysis to visualize the data each MAPS data set varied considerably (supplementary fig. and test for significant data features in each distribution, as S1 and table S3, Supplementary Material online), with 5–18 described above for KS analyses. Clusters of gene duplication species and 227,344–830,078 genes represented in each set, ages inferred with mixture models that were corroborated by respectively. OrthoFinder circumscribed 67% of the genes our SiZer results were interpreted as putative WGD events. in each data set into gene families (orthogroups) and yielded a range of 25,731–46,731 gene families per data set (mean ¼ 34,124; median ¼ 33,319). More than 8,000 gene families in Inferring Associations between Species Richness and each MAPS data set had full species representation, and Cumulative WGDs sequences from these families were filtered for phylogenetic Associations between lineage-specific species richness and analysis (supplementary table S3, Supplementary Material on- cumulative levels of ancient polyploidy were examined using line). We successfully generated amino acid sequence
3398 Genome Biol. Evol. 11(12):3393–3408 doi:10.1093/gbe/evz239 Advance Access publication November 5, 2019 Gene and Whole-Genome Duplication Dynamics in Lamiaceae GBE
alignments and corresponding ML trees with PASTA for gene the MRCA of Nepetoideae and a grade of Tectonoideae þ families comprising each filtered MAPS data set, albeit with a Symphorematoideae þ Viticoideae s.s., Premnoideae þ few exceptions. PASTA failed to complete analysis for 18 gene Scutellarioideae þ Peronematoideae þ Ajugoideae þ families distributed among 4 data sets, and we excluded these Lamoideae (fig. 2a: N46), the MRCA of Scutellarioideae and data from downstream analyses. The OrthoFinder analysis of Premnoideae þ Scutellarioideae þ Preonematoideae þ sequence data comprising our PUG data set, which included Ajugoideae þ Lamoideae (fig. 2a: N17), the MRCA of RTAs from all 52 species and over 2 million gene sequences, Peronematoideae and Ajugoideae þ Lamoideae (fig. 2a: yielded 89,266 gene families containing two-thirds of the to- N16), and at a single node each within Lamioideae (fig. 2a: tal genes analyzed (supplementary table S3, Supplementary N13) and Ajugoideae (fig. 2a: N6). Material online). In all, 6,336 gene families had full species representation, and multiple sequence alignments and ML PUG Results trees were successfully generated for 97% of these families Our PUG analysis identified 168,547 putative paralog and used for the PUG analysis; PASTA alignments or RAxML pairs within the filtered set of 6,162 gene family trees. trees were not obtained for 174 gene families. However, only 10,690 of these pairs resolved to nodes supporting our species tree topology (fig. 1a)andsur- MAPS Results passed our minimum bootstrap threshold (ML BS 80) for inclusion in the final data set (supplementary table S5, Analyses of gene family tree data and guide trees with MAPS Supplementary Material online). These pairs were inter- identified putative WGD events in 10 of 12 data sets (supple- preted as uniquely mapped duplication events and pro- mentary fig. S1, Supplementary Material online). As summa- vided evidence for 21 putative WGD events distributed rized on our best species tree hypothesis for Lamiaceae, we across the species tree (fig. 3), including one WGD within identified pronounced gene duplication levels consistent with the outgroup stem lineage (201 duplication events; fig. 3: WGDs at 14 nodes (fig. 2a). Each of these nodes showed a N50), a second WGD within the Lamiaceae stem lineage characteristic upsurge in gene duplication levels relative to (332 duplication events; fig. 3: N47), and 19 additional neighboring nodes and had a high proportion of gene family WGDs within Lamiaceae (fig. 3:descendantsofN47). trees with shared gene duplications observed across descen- Regarding the latter, it is noteworthy that the greatest dent taxa (i.e., 40% of the gene trees analyzed were topo- number of uniquely mapped duplications and inferred logically congruent with respect to the guide tree and showed WGD events were observed for Nepetoideae (i.e., 15 evidence for duplication at that node) (supplementary fig. S1 WGDs and 8,397 duplication events in total, including and table S4, Supplementary Material online). Because the the Nepetoideae stem lineage [fig. 3: N45]), with pro- species tree was subsampled to generate stepwise guide trees nounced levels of duplication observed within for MAPS, six species tree nodes (fig. 2a: N16, N17, N27, N41, Mentheae. Outside Nepetoideae, additional WGDs were N44,andN46) were represented more than once in our set of inferred within the stem lineages of Prostantheroideae guide trees and, consequently, were tested more than once and Lamioideae (fig. 3: N1 and N14, respectively), across individual MAPS analyses. MAPS consistently inferred within the stem lineage of Ballota þ Marrubium putative WGD events at only three of these nodes (fig. 2a: (Lamioideae) (fig. 3: N8),andwithinthestemlineage N17, N41,andN46). As for the remaining nodes with puta- of Clerodendrum þ Teucrium þ Ajuga (Ajugoideae) tive WGDs, eight were tested only once (fig. 2a: N2, N6, N13, (fig. 3: N6). N25, N29, N34, N37,andN39). The total number of putative WGDs inferred by MAPS as well as the numbers and propor- tions of gene family trees showing shared gene duplications Divergence Time Results across MAPS data sets were most pronounced within Of the 6,181 gene family trees comprising the PUG data set, Nepetoideae, especially within the species-rich subclade 4,021 trees contained well-supported (ML BS 80) nodes Mentheae. We inferred putative WGDs at the node repre- representing gene duplication events in our species tree (sup- senting the MRCA of Mentheae (fig. 2a: N44) and at five plementary table S5, Supplementary Material online). additional nodes within most major subclades comprising However, only 910 of these trees could be reliably rooted this clade (fig. 2a: N29, N34, N37, N39,andN41). Two ad- with all outgroup species and were filtered for phylogenetic ditional WGDs were identified by MAPS within dating. The final dated tree set included estimated ages for Nepetoideae—one at the node representing the MRCA of 1,868 uniquely mapped gene duplication events correspond- Elscholzieae þ Ocimeae (fig. 2a: N27) and another at the ingto43speciestreestemlineages(supplementary tables S5 MRCA of Ocimeae (fig. 2a: N25)—for a sum of eight WGD and S6, Supplementary Material online), and significant events in Nepetoideae. Outside Nepetoideae, MAPS inferred (P < 0.05) KS test results revealed that 12 of these had non- six WGDs. These events were placed at the MRCA uniform data distributions with clustered gene duplication of Callicarpoideae and Prostantheroideae (fig. 2a: N2), ages (supplementary figs. S3 and S4, Supplementary
Genome Biol. Evol. 11(12):3393–3408 doi:10.1093/gbe/evz239 Advance Access publication November 5, 2019 3399 odne al. et Godden 3400 otie eedpiain e tr eoepttv G oe,wihsoe ulctosin duplications showed which nodes, WGD putative denote stars Red (see duplication. output gene PUG a to contained correspond numbers ( Node analyses. analyses. MAPS across independent comparisons 12 of one least at from identified (see hypothesis ulcto eesrltv onihoignds ee to plots. Refer nodes. corresponding neighboring to relative levels duplication F IG .2. eoeBo.Evol. Biol. Genome —( a uaieWDeet nLmaeeifre n lcdwt h ASppln ( pipeline MAPS the with placed and inferred Lamiaceae in events WGD Putative ) g 1 fig. r 4acetplpod vns e tr eoendswt rnucdgn ulcto eescnitn ihWD htwere that WGDs with consistent levels duplication gene pronounced with nodes denote stars Red events. polyploidy ancient 14 are ) N39 ( N38 Monarda didyma a )
N40 Mentha × piperita N47 b N36 Mentha spicata xmlso ud re sdwt AS(et,wt lt rgt hwn h ecnaeo utesi ahaayi that analysis each in subtrees of percentage the showing (right) plots with (left), MAPS with used trees guide of Examples ) 11)39–48di1.03geez3 dac cespbiainNvme ,2019 5, November publication Access Advance doi:10.1093/gbe/evz239 11(12):3393–3408 N37 N46 Origanum majorana N41
N22 Origanum vulgare N47 N32 Thymus vulgaris N21 N46 Nepeta cataria N43 N20
N45 Nepeta mussinii N35
N18 Agastache foeniculum N44 Glechoma hederacea N17 N34 N33 N47 N43 (
b Hyssopus officinalis N44
N16 Nepetoideae N46 ) N41 Lycopus americanus N42 N28 N15 N30 N45 N40
N47 Prunella vulgaris N29
N14 Salvia officinalis N27 N39 N2
N13 Rosmarinus officinalis N31 N23 N38 N45 upeetr gr S2 figure supplementary Perovskia atriplicifolia N1 N12 Salvia hispanica N24 N11 Melissa officinalis N25 N26 N10 Petrea volubilis Westringia fruticosa Tectona grandis hispanica Salvia Ocimum basilicum Perilla frutescens Collinsonia canadensis Monarda didyma Monarda spicata Mentha Petrea volubilis Westringia fruticosa Tectona grandis Hyptis suaveolens hispanica Salvia vulgaris Prunella hederacea Glechoma Thymus vulgaris Mentha Callicarpa americana Callicarpa Petrea volubilis Tectona grandis Westringia fruticosa Prostanthera lasianthos Plectranthus barbatus Hyptis suaveolens N27 N9 Ocimum basilicum N23
N8 Lavandula angustifolia ×
piperita Collinsonia canadensis N8 N9 Perilla frutescens Rotheca myricoides Cornutia pyramidataCornutia Holmskioldia sanguinea Pogostemon cablin Betonica officinalis fruticosa Phlomis cardiaca Leonurus Leonotis leonurus Lamium album vulgare Marrubium Ballota pseudodictamnus Petrea volubilis Westringia fruticosa vulgare Origanum Tectona grandis Vitex agnus-castus Gmelina philippensis Petraeovitex bambusetorum Ballota pseudodictamnus N10 N11 Marrubium vulgare Lamium album N12
, Leonotis leonurus N13 N46 upeetr Material Supplementary N14 Leonurus cardiaca Lamioideae .002 .007 1.00 0.75 0.50 0.25 0.00 .002 .007 1.00 0.75 0.50 0.25 0.00 .002 .007 1.00 0.75 0.50 0.25 0.00 Phlomis fruticosa
N15 Betonica officinalis