Accepted Manuscript1.0
Total Page:16
File Type:pdf, Size:1020Kb
774 • American Journal of Botany One of the first molecular studies investigating these deep rela- clusters to investigate support for contentious interfamilial relation- tionships used three plastid and two mitochondrial loci; the authors ships, ancient gene and genome duplications, heterogeneous rates concluded that the data set was unable to resolve several interfamil- of evolution, and conflicting signals among genes. We examined ial relationships (Anderberg et al., 2002). These relationships were the deep relationships of Ericales to determine whether the data revisited by Schönenberger et al. (2005) with 11 loci (two nuclear, strongly support any resolution. We also consider the possibility two mitochondrial, and seven chloroplast). Schönenberger et al. that the data are unable to resolve these relationships (e.g., a hard or (2005) found that maximum parsimony and Bayesian analyses pro- soft polytomy) despite previous strong support for alternative res- vided support for the resolution of some early diverging lineages. olutions and thousands of transcriptomic sequences. While future Rose et al. (2018) utilized three nuclear, nine mitochondrial, and developments in methods and sampling will likely continue to elu- 13 chloroplast loci in a concatenated supermatrix consisting of cidate many contentious relationships across the plant phylogeny, 49,435 aligned sites and including 4531 ericalean species but with we consider whether a polytomy may represent a more justifiable 87.6% missing data. Despite the extensive taxon sampling utilized representation of the evolutionary history of a clade than any single in Rose et al. (2018), several relationships were only poorly sup- fully bifurcating species tree given our current resources. ported, including several deep divergences that the authors show In applying a phylogenetic consensus approach that considers to be the result of an ancient, rapid radiation. The 1KP initiative several methodological alternatives, we examined disagreement (Leebens-Mack et al., 2019) analyzed transcriptomes from across among methods. We explore this approach as a means of provid- green plants, including 25 species of Ericales; their results suggest ing valuable information about whether the available data may be that whole-genome duplications (WGDs) have occurred several insufficient to confidently resolve a single, bifurcating species tree, times in ericalean taxa. However, the equivocal support they re- even if a given methodology may suggest a resolved topology with cover for several interfamilial relationships within Ericales high- strong support. Our investigation of the evolution of Ericales may lights the need for a more thorough investigation of the biological be particularly well-suited to initiate a discussion about the affect and methodological sources of phylogenetic incongruence in the of topological uncertainty on our ability to confidently resolve the group (Anderberg et al., 2002; Bremer et al., 2002; Schönenberger placement of rare evolutionary events (e.g., whole-genome duplica- et al., 2005; Rose et al., 2018; Leebens-Mack et al., 2019). tion, major morphological innovation), the prevalence of biological Despite the increasing availability of genome-scale data polytomies across the Tree of Life, and when polytomies may be sets, many relationships across the Tree of Life remain controversial. considered useful representations of evolutionary relationships in Research groups recover different answers to the same evolution- the postgenomic era. ary questions, often with seemingly strong support (e.g., Shen et al., 2017). One benefit of genome-scale data for phylogenetics (i.e., phy- logenomics) is the ability to examine conflicting signals within and MATERIALS AND METHODS among data sets, which can be used to help understand conflicting species tree results and conduct increasingly comprehensive inves- Taxon sampling and de novo assembly procedures tigations as to why some relationships remain elusive. A key find- ing in the phylogenomics literature has been the high prevalence of We assembled a data set consisting of coding sequence data from conflicting phylogenetic signals among genes at contentious nodes 97 transcriptomes and genomes—the most extensive exome data (e.g., Brown et al., 2017b; Reddy et al., 2017; Vargas et al., 2017; set for Ericales to date (Appendix S1). The data set was constructed Walker et al., 2018a). Such conflict may be the result of biological by obtaining the ericalean transcriptomes included in Matasci processes (e.g., introgression, incomplete lineage sorting, horizontal et al. (2014) and Vargas et al. (2019). In addition, at least one se- gene transfer), but can also occur because of lack of phylogenetic quencing run for every species with data available on the Sequence information or other methodological artifacts (Richards et al., 2018; Read Archive was downloaded, with preference given to those with Gonçalves et al., 2019; Walker et al., 2019). By identifying regions of the most reads (Leinonen et al., 2010). Outgroup sampling included high conflict, it becomes possible to determine areas of the phylog- cornalean transcriptomes from Matasci et al. (2014) and several ref- eny where additional analyses are warranted and future sampling erence genomes from Ensembl Plants (http://plants.ensem bl.org). efforts might prove useful. Transcriptomes provide information Samples assembled from raw reads were done so with Trinity ver- from hundreds to thousands of coding sequences per sample and sion 2.5.1 using default parameters and the “–trimmomatic” option have elucidated many of the most contentious relationships in the (Grabherr et al., 2011; Kodama et al., 2011; Matasci et al., 2014). green plant phylogeny (e.g., Simon et al., 2012; Wickett et al., 2014; Assembled reads were translated using TransDecoder v5.0.2 (Haas Walker et al., 2017; Leebens-Mack et al., 2019). Transcriptomes also et al., 2013) with the option to retain Basic Local Alignment Search provide information about gene and genome duplications not pro- Tool (BLAST) hits to a custom protein database consisting of Beta vided by most other common sequencing protocols for nonmodel vulgaris (Amaranthaceae), Arabidopsis thaliana (Brassicaceae), and organisms. Gene duplications are often associated with import- Daucus carota (Apiaceae) obtained from Ensembl plants (Altschul ant molecular evolutionary events in taxa, but duplicated genes are et al., 1990). The translated amino acid sequences from each as- also inherited through descent and should therefore contain evi- sembly were reduced by clustering with CD-HIT version 4.6 with dence for how clades are related. By leveraging the multiple lines of settings -c 0.995 -n 5 (Fu et al., 2012). As a quality control step, nu- phylogenetic evidence and the large amount of data available from cleotide sequences for the chloroplast genes rbcL and matK were transcriptomes, several phylogenetic hypotheses can be generated extracted from each assembly using a custom script (see Data and tested to gain a holistic understanding of contentious nodes in Availability Statement) and then queried using BLAST against the the Tree of Life. National Center of Biotechnology Information (NCBI) online da- In this study, we sought to understand the evolutionary history tabase (Altschul et al., 1990). In cases where the top hits were to a of Ericales by analyzing sequences from thousands of homolog species not closely related to that of the query, additional sequences May 2020, Volume 107 • Larson et al.—A consensus approach to resolving the history of Ericales • 775 were investigated and if contamination, misidentification, or other (Appendix S1). Previous work has suggested that Ericales is sister to issues seemed likely, the transcriptome was not included in further the euasterids with Cornales sister to Ericales+euasterids (Stevens, analyses. The final transcriptome sampling included 86 ingroup taxa 2001 onward), therefore we treated Helianthus annuus (Asteraceae) spanning 17 of the 22 families within Ericales as recognized by the and Solanum lycopersicum (Solanaceae) as ingroup taxa for the Angiosperm Phylogeny Group (Appendix S1; Chase et al., 2016). purposes of the RT procedure so that orthologs could be rooted on a non-ericalean taxon after ortholog identification. Orthologs were not extracted from homolog groups with more than 5000 tips be- Homology inference cause of the uncertainty in reconstructing very large homolog trees We used the hierarchical clustering procedure from Walker et al. (Walker et al., 2018b). Orthologs with sequences from fewer than 50 (2018b) for homology identification. In short, the method involves ingroup taxa were discarded to reduce the amount of missing data performing an all-by-all BLAST procedure on user-defined clades; in downstream analyses. The tree-aware ortholog identification em- homolog clusters identified within each clade are then combined ployed here should provide the best available safeguard against mis- recursively with clusters of a sister clade based on sequence similar- identified orthology, which could mislead phylogenetic analyses ity until clusters from all clades have been combined. To assign taxa (Eisen, 1998; Gabaldón, 2008; Yang and Smith, 2014; Brown and to groups for clustering, we identified coding sequences from the Thomson, 2016). genes rpoC2, rbcL, nhdF, and matK using the same script as for the The resulting ortholog trees were then filtered to require at least quality control step. Sequences from each