bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

1 Characterizing gene tree conflict in plastome-inferred phylogenies 2 3 4 Joseph F. Walker1,2, Gregory W. Stull1,2, Nathanael Walker-Hale1, Oscar M. Vargas1, and Drew 5 A. Larson1 6 7 8 1Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 9 48109, USA 10 11 2Authors for correspondence: J. F. Walker ([email protected]) and G. W. Stull 12 ([email protected]) 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

29 ABSTRACT 30 • Premise of the study: Evolutionary relationships among have been inferred 31 primarily using chloroplast data. To date, no study has comprehensively examined the 32 plastome for gene tree conflict. 33 • Methods: Using a broad sampling of angiosperm plastomes, we characterized gene tree 34 conflict among plastid genes at various time scales and explore correlates to conflict (e.g., 35 evolutionary rate, gene length, molecule type). 36 • Key results: We uncover notable gene tree conflict against a backdrop of largely 37 uninformative genes. We find gene length is the strongest correlate to concordance, and 38 that nucleotides outperform amino acids. Of the most commonly used markers, matK 39 greatly outperforms rbcL; however, the rarely used gene rpoC2 is the top-performing 40 gene in every analysis. We find that rpoC2 reconstructs angiosperm phylogeny as well as 41 the entire concatenated set of protein-coding chloroplast genes. 42 • Conclusions: Our results suggest that longer genes are superior for phylogeny 43 reconstruction. The alleviation of some conflict through the use of nucleotides suggests 44 that systematic error is likely the root of most of the observed conflict, but further 45 research on biological conflict within plastome is warranted given the documented cases 46 of heteroplasmic recombination. We suggest rpoC2 as a useful marker for reconstructing 47 angiosperm phylogeny, reducing the effort and expense of assembling and analyzing 48 entire plastomes. 49 50 Key words: Angiosperms, chloroplast, gene tree conflict, matK, phylogenomics, plastome, rbcL. 51 52 53 54 55 56 57 bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

58 Chloroplast data have been the most prominent source of information for phylogenetics, 59 largely due to the ease with which chloroplast genes can be sequenced, assembled, and analyzed 60 (Palmer, 1985; Taberlet et al., 1991). The majority of broad-scale phylogenetic studies on plants 61 have used chloroplast genes (e.g., Chase et al., 1993, Soltis et al., 2000, 2011), and the resulting 62 phylogenies have been used for countless other comparative studies examining ancestral states, 63 historical biogeography, and other evolutionary patterns. While older studies relied mostly on 64 targeted genes such as rbcL and matK, recent advances in DNA sequencing have drastically 65 increased the ease and affordability of whole-chloroplast genome (i.e., plastome) sequencing 66 (Moore et al. 2006; Cronn et al., 2008, 2012; Stull et al., 2013; Uribe-Convers, et al. 2014), 67 increasing the number of studies employing plastome-scale data for phylogenetic and 68 comparative analyses (e.g., Jansen et al., 2007; Moore et al., 2007, 2010; Ruhfel et al., 2014; 69 Stull et al., 2015; Gitzendanner et al., 2018). Nonetheless, the utility of plastid genes, as well as 70 the entire plastome, is ultimately determined by the extent to which they reflect ‘true’ 71 evolutionary relationships (i.e., the ‘species tree’) of the lineages in question (Doyle, 1992). 72 Phylogenomic conflict (i.e., the presence of conflicting relationships among gene trees in 73 a genomic dataset) is now recognized as a nearly ubiquitous feature of nuclear phylogenomic 74 studies (Smith et al., 2015). Most gene tree conflict is attributed to biological causes such as 75 incomplete lineage sorting, hybridization, and gene duplication and loss (Maddison, 1997; 76 Galtier and Daubin, 2008; Smith et al., 2015; Walker et al., 2017; Vargas et al., 2017). The genes 77 within the plastome, however, are generally thought to be free of such biological sources of 78 conflict. This is because the plastome is typically uniparentally inherited (maternally in 79 angiosperms, paternally in conifers: Mogensen, 1996) and undergoes a unique form of 80 recombination that is not expected to result in conflicting gene histories within a single genome 81 (Palmer, 1983; Bendich, 2004; Walker et al. 2015). 82 However, in angiosperms, nonmaternal inheritance is relatively common (e.g., Smith, 83 1989; McCauley et al., 2007), and several surveys of pollen in flowering plants (Corriveau and 84 Coleman, 1988; Zhang et al., 2003) have documented plastid DNA in up to 18% of the species 85 examined, indicating potential for biparental inheritance and heteroplasmy (i.e., the presence of 86 two or more different plastomes in a single organism, cell, or organelle). Heteroplasmy, which 87 has been documented in multiple angiosperm species (e.g., Johnson and Palmer, 1988; Lee et al., 88 1988; Hansen et al., 2007; Carbonell-Caballero et al., 2015), creates an opportunity for bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

89 heteroplasmic recombination, which could result in gene tree conflict in plastomes-inferred 90 phylogenies. Heteroplasmic recombination (both intra- and interspecific) has been invoked by 91 multiple studies to explain discordance in various clades (Huang et al., 2001; Marshall et al., 92 2001; Erixxon and Oxelman, 2008; Bouillé et al., 2011), but only a few recent studies have 93 documented clear evidence of this phenonmenon (Sullivan et al., 2017; Sancho et al., 2018). 94 Beyond heteroplasmy, sharing of genes between the chloroplast and nuclear genomes remains 95 another potential source of biological conflict (Martin et al. 1998; Martin 2003 Stegemann et al., 96 2003). Although biological conflict in the plastome generally seems rare, the true extent of intra- 97 plastome conflict is poorly known. Quantifying its extent is of high importance given that the 98 vast majority of studies assume no conflict as an operating principle (Wolfe and Randle, 2004). 99 Aside from biological sources of conflict, there also remain significant potential sources 100 of systematic conflict that have been poorly explored across the plastome (e.g., Burleigh and 101 Mathews 2007a,b). Chloroplast data are used at various time scales, and the accumulation of 102 substitutions over long periods of evolutionary time increases the probability of encountering 103 systematic error due to saturation (Rodríguez-Ezpeleta et al., 2007; Philippe et al., 2011). 104 Conflict has been demonstrated among different functional groups of genes (Liu et al., 2012), 105 among different regions of the plastome (Walker et al., 2014), as well as among individual genes 106 (e.g., Shepherd et al., 2008, Foster et al. 2018). The rate of chloroplast evolution as a whole has 107 been examined (and compared with the nuclear and mitochondrial genomes; Wolfe, 1987), and 108 rate variation within the chloroplast—especially across the three major regions of the genome, 109 i.e., the long single-copy (LSC) region, the short single-copy (SSC) region, and the inverted 110 repeats (IRa, IRb)—has been explored to help determine the markers useful for phylogenetic 111 inference at different time scales (e.g., Graham and Olmstead, 2000; Shaw et al., 2005, 2007, 112 2014). However, no study has comprehensively examined gene tree conflict within the plastome 113 to better characterize the extent and sources of conflict, and to identify the plastid genes most 114 concordant with our current understanding of angiosperm phylogeny (e.g., Soltis et al., 2011; 115 Wickett et al., 2014; Zeng et al., 2014; Gitzendanner et al., 2018; but see Logacheva et al. [2007] 116 for a preliminary investigation of the concordance of individual plastid genes with angiosperm 117 phylogeny). 118 Here we use phylogenomic tools to characterize the extent of conflict among plastid 119 genes as a function of evolutionary rate, rate variation among species, sequence length, and data bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

120 type (i.e., nucleotides vs. amino acids) at varying time scales across angiosperms. Our results 121 show that the plastome—at all levels—contains notable gene tree conflict, with the number of 122 conflicting genes at each node often comparable to the number of concordant genes; however, 123 the majority of plastid genes are uninformative for most nodes when considering support. 124 Information content (gene length and molecule type , i.e., nucleotides vs amino acids) was the 125 strongest correlate with concordance, suggesting that most observed gene-tree conflict is a 126 consequence of spurious inferences from insufficient information. However, many nodes across 127 angiosperm phylogeny show at least several strongly supported conflicting genes, indicating the 128 need for future investigations into the causes of intraplastome conflict (e.g., inappropriate 129 models, heteroplasmic recombination, horizontal gene transfer). We also document the 130 performance of individual genes at recapitulating angiosperm phylogeny, finding the seldom- 131 used gene rpoC2 to outperform commonly used genes (e.g., rbcL, matK) in all cases, consistent 132 with previous work highlighting the utility of this gene (e.g., Logacheva et al., 2007). Our results 133 provide an important glimpse into the extent and sources of intraplastome conflict. 134 135 MATERIALS AND METHODS 136 Data acquisition and sampling—Complete plastome coding data (both nucleotide and amino 137 acid) were downloaded from NCBI for 53 taxa: 51 angiosperm ingroups and two gymnosperm 138 outgroups (Ginkgo Biloba and Podocarpus lambertii; Table S1). Our sampling scheme was 139 designed to capture all major angiosperm lineages (e.g., Soltis et al., 2011), while also including 140 denser sampling for nested clades in . This allowed us to evaluate the extent of gene 141 tree conflict at different evolutionary levels/time scales, from species-level relationships in 142 Diplostephium () to the ordinal-level relationships defining the backbone of 143 angiosperm phylogeny. 144 145 Data preparation, alignment, and phylogenetic inference—All scripts developed for this study 146 may be found on GitHub (https://github.com/jfwalker/ChloroplastPhylogenomics). Orthology 147 was determined based upon the annotations of protein-coding genes on Genbank; this resulted in 148 almost complete gene occupancy apart from instances of gene loss or reported pseudogenization. 149 We did not use non-coding regions, as proper orthology at deep time scales can be difficult to 150 assess due to gene re-arrangements and inversions; this is in part why most deep phylogenetic bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

151 analyses of angiosperms using plastomes have excluding non-coding data (e.g., Moore et al., 152 2010; Ruhfel et al., 2014). 153 The amino acid and nucleotide data were aligned using Fast Statistical Alignment (FSA; 154 Bradley et al., 2009) with the default settings for peptide and the setting “--noanchored” for 155 nucleotide. FSA has been shown to be one of the top-performing alignment programs 156 (Redelings, 2014), and does not rely upon a guide tree for sequence alignment, alleviating 157 downstream bias. A maximum likelihood (ML) tree was then inferred for each gene using 158 RAxMLv.8.2.4 (Stamatakis, 2014), with the PROTGAMMAAUTO and GTR+G models of 159 evolution used for the amino acid and nucleotide data, respectively. For each dataset, we 160 conducted 200 rapid bootstrap replicates. The alignments were also concatenated into 161 supermatrices and partitioned by gene using the phyx program pxcat (Brown et al. 2017). 162 The nucleotide and amino acid supermatrices were then each used to infer ‘plastome’ trees using 163 the GTR+G and the PROTGAMMAAUTO models, respectively, as implemented in RAxML. 164 To complement the model inference performed by RAxML from the AUTO feature, we also 165 used IQ-TREE’s (Nguyen et al. 2014) built-in model selection process (Kalyaanamoorthy et al. 166 2017) on the partitioned data. We did not partition by codon position because the lengths of most 167 of the plastid genes, especially when divided into three smaller partitions, would have been 168 insufficient to inform an evolutionary model. 169 170 Reference phylogenies—For the conflict analyses, described below, we created several reference 171 trees against which the gene trees were mapped. Primarily, we used a topology based on Soltis et 172 al. (2011) and—for species-level relationships within Asteraceae—Vargas et al. (2017); we refer 173 to this is our ‘accepted tree’, or AT. We also created a reference tree based on the recent 174 plastome phylogeny of Gitzendanner et al. (2018), which we refer to as the ‘Gitzendanner tree’, 175 or GT. The GT is largely identical to the AT, but several deep (and commonly contentions) 176 nodes differ: e.g., the placement of Buxales and ordinal relationships in lamiids (here, we only 177 sampled Lamiales, Gentianales, and Solanales). In the latter case, all three possible combinations 178 of lamiids taxa were assessed (discussed further below in Assesment of Supported Conflict). 179 Although it is difficult to determine which of these trees better represents species relationships of 180 angiosperms, they both serve as useful frameworks for revealing conflict among genes. Here, we bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

181 primarily focus analyses using the AT, but conflict analyses using the GT were also conduct to 182 ensure that our overall results were consistent across reference trees. 183 184 Analysis of conflict—All gene trees were rooted on the outgroups using the phyx program pxrr 185 (Brown et al. 2017), in a ranked fashion (“-r”) in the order Podocarpus lambertii then Ginkgo 186 biloba; this was performed to account for any missing genes in either of the taxa. Conflict in the 187 data was identified using the bipartition method as implemented in phypartsv.0.0.1 (Smith et al. 188 2015), with the gene trees from each data set (amino acid and nucleotide) mapped against the AT 189 and GT references trees. The concordance analyses were performed using both a support cutoff 190 (at 70% bootstrap support, i.e., moderate support) and no support cutoff. When the support cutoff 191 is used, any gene tree node with under 70 bootstrap support is regarded as uninformative for the 192 reference node in question (i.e., it is uncertain whether the gene tree node is in conflict or 193 concordance with the reference node); when no support cutoff is used, the gene tree node is 194 evaluated as conflicting or concordant regardless of the support value. However, in both cases, a 195 gene is considered uninformative if a taxon relevant to a particular node/relationship is missing 196 from the gene dataset. For example, consider the reference tree ((A,B),C); if a gene tree is 197 missing taxon B, but contains the relationship (A,C), then the gene tree is concordant with the 198 node ((A,B),C) but uninformative regarding the node (A,B). See Smith et al. (2015) for an in- 199 depth explanation of this method. 200 Levels of conflict were examined across the entire tree and within particular time 201 intervals of the evolutionary history of angiosperms. For the latter case, the ~150 Ma of crown 202 angiosperm evolution were divided into five time intervals (30 Ma each): 150 to 121 mya, 120 to 203 91 mya, 90 to 61 mya, 60 to 31 mya, and 30 to 0 mya. The nodes across the tree were binned into 204 the time intervals based on their inferred ages (ages > 20 Ma from Magallón et al., 2015; ages < 205 20 Ma from Vargas et al., 2017 and Roquet et al., 2009). At each time interval, the proportion of 206 concordant nodes for each gene was calculated (for all the nodes falling within that time 207 interval). This allowed us to assess the level(s) of divergence at which each gene is most 208 informative. We also determined concordance levels of the ‘plastome’ trees (against the 209 reference) within each of these time intervals. 210 bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

211 Assessments of Individual Plastid Genes—Using the phyx program pxlstr (Brown et al. 2017), 212 we calculated summary statistics for each gene alignment and corresponding gene tree: number 213 of included species, alignment length, tree length (a measure of gene evolutionary rate), and 214 root-to-tip variance (a measure of rate variation across the phylogeny). Alignment length and tree 215 length represent different measures of a gene’s information content. Levels of concordance of 216 each gene tree with the reference trees were then assessed by tabulating the number of nodes 217 concordant between the gene tree and the given reference tree (e.g., the AT). The number of 218 concordant nodes (in Fig. 2 this is treated as a proportion of total nodes available to support and 219 in Fig. 3 this is based on total nodes) was used as a measure of the gene’s ability to accurately 220 reconstruct angiosperm phylogeny. 221 222 Predictors of Concordance—We examined the statistical relationships between gene tree 223 concordance (using concordance data from the AT-based conflict analysis) and alignment length, 224 tree length, and root-to-tip variance. Because the performance of each gene consists of an 225 aggregate sample of trials (with each node being a trial with outcomes of either concordance or 226 discordance), we analyzed the relationships between gene performance (concordance levels) and 227 alignment length, tree length, and root-to-tip variance using logistic regression of aggregate 228 binomial trials with the function glm() in R (R Core Team, 2018). Binomial models were 229 generally characterized by high residual deviance, and we thus allowed for overdispersion by 230 fitting quasibinomial logistic regressions (using ‘family = quasibinomial()’ in R). All code used 231 for these analyses is available on GitHub 232 (https://github.com/jfwalker/ChloroplastPhylogenomics). 233 We modelled gene performance as a function of length, tree length, and root-to-tip 234 variation, and as a function of each predictor individually. Because it is possible that apparent 235 relationships between alignment length and concordance may reflect signal from gene 236 information content per alignment site, we also modelled gene performance as a function of 237 length and tree length (as a proxy of gene information content, noted above), to assess the 238 relationship between alignment length and gene performance after controlling for variation 239 associated with gene information content. This has the added benefit of controlling for possible 240 multicollinearity introduced by the covariation between tree length and root-to-tip variation. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

241 Investigation of model fits on full datasets revealed that several observations were highly 242 influential based on leverage and Cook’s distance values. Therefore, we also conducted 243 investigations on reduced datasets to investigate the influence of these observations. In amino 244 acid datasets, we excluded rpl22 and rpl32, which were probably influential due to their high 245 tree length values, and ycf1 and ycf2, which were probably influential based on their long 246 alignments. In nucleotide datasets, we excluded clpP, rps15, ycf1 and ycf2. Combined analyses 247 of alignment length and tree length were not subject to influence driven by high root-to-tip 248 variance values, and hence reduced datasets had fewer genes removed. In this case, we excluded 249 only ycf1 and ycf2. Regression results were summarized in tables using the R package Stargazer 250 (Hlavac, 2018). 251 252 Saturation Analyses—We also performed saturation analyses on all the chloroplast genes to 253 determine if they were capable of inferring deep divergence times (Phillipe and Forterre, 1999). 254 The saturation analysis was performed on each codon position and thus the data were realigned 255 by codons for this analysis. The amino acid alignments were used to guide the codon alignments 256 using the program pxaa2cdn, from the phyx package. Four of the genes did not appear to match 257 the amino acid sequence in length (i.e., the gene was not 1/3rd the length of the amino acid); thus 258 these were left out of this analysis, since the codons could not be properly aligned. This 259 discrepancy is likely due to errors in the GenBank submission (the amino acid and nucleotide 260 data do not perfectly correspond). Saturation was assessed by determining the observed number 261 of differences between sequences compared to the inferred number of substitutions. This analysis 262 was performed using the “dist.dna” and “dist.corrected” functions in the R package “ape” 263 (Paradis et al. 2017), with the JC69 model of evolution used for the correction. The analysis was 264 conducted on the entire gene and on each codon position separately. 265 266 Comparison of genomic regions—We assessed the utility of the three major plastome regions— 267 the Long Single Copy (LSC) region, Short Single Copy (SSC), and the Inverted Repeat (IR) 268 region—for reconstructing angiosperm phylogeny in two ways. First, we constructed ML 269 phylogenies (as described above for each of the ‘plastome’ analyses) for each genomic region 270 using the concatenated set of genes comprising each region; with the resulting trees, we then 271 calculated (for each genomic region) the number of nodes concordant with the AT. Second, bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

272 using the concordance levels of each individual gene (described above), we created a plastome 273 diagram (with genes arranged according to their genomic position) showing the concordance 274 levels of each gene at the five different time scales discussed above (Fig. S1); this permits a 275 qualitative visual assessment of the general concordance levels of each genomic region at each 276 time slice. 277 278 Assessment of supported conflict—We compared four contentious regions of the AT to a series 279 of alternative relationships. The tested alternative relationships were present in either the GT 280 (i.e., the plastome phylogeny of Gitzendanner et al., 2018), our plastome phylogenies (with >70 281 BS for the node in question; we refer to this tree below as the ‘plastome tree’, or PT), and/or 282 three or more of the gene trees from our nucleotide dataset (with >70 BS for the node in 283 question). These alternative relationships pertain to: (1) the placements of Buxus and 284 Trochodendron; (2) relationships among Acnistus (Solanales), Coffea (Gentianales), and Olea 285 (Lamiales); (3) the placements of Jacobaea and Artemisia in Asteraceae; and (4) the placements 286 of Galinsoga, Ageratina, and Parthenium in Asteraceae. To test these alternative relationships, 287 we used a modified version of the Maxmimum Gene Wise Edge (MGWE) method (Walker et al. 288 2018), where instead of a defined “TREE SET”, we used a constraint tree for each alternative 289 hypothesis. The likelihood for every gene was calculated across each constraint, thereby forcing 290 the relationship in question to be the same. Similar to MGWE, this allowed for conflict across 291 the rest of the topology; however, this provides a greater amount of tree space to be explored 292 (similar to Smith et al., 2018). This method creates likelihood scores solely based on each 293 alternative relationship, making it robust to gene tree conflict outside of the node in question. 294 The modification of the MGWE method has been implemented in the EdgeTest.py program of 295 the package PHylogenetic Analysis Into Lineages (https://github.com/jfwalker/PHAIL). 296 297 RESULTS 298 Patterns of conflicting chloroplast signal—The sampling for this experiment allowed us to 299 examine conflict at multiple timescales, with roughly 10 divergences occurring in each 30 Ma 300 bin. Our ‘plastome’ trees, inferred from the concatenated gene sets (nucleotide and amino acid), 301 were highly concordant with the AT (Fig. 1). Without considering support, the gene trees 302 showed notable levels of conflict across the different analyses. This pattern is similar to that seen bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

303 in nuclear phylogenomic datasets (Smith et al., 2015). When considering support, the majority of 304 the plastid genes were uninformative for practically all nodes in the phylogeny (Fig. 1; Tables 305 S2, S3); i.e., they had bootstrap support below 70 (moderate support) for that particular 306 relationship (whether in conflict or concordance). 307 There was no obvious relationship between the amount of gene tree conflict and 308 evolutionary scale (i.e., conflict was relatively evenly distributed across shallow and deeper 309 nodes/time scales; Figs. 1, 2). Although the greatest degree of gene tree concordance with the AT 310 appeared in the nodes with inferred ages between 90–61 mya (ages based on Magallón et al., 311 2015), these nodes typically still contained at least 50% uninformative gene trees (Fig. 1). 312 Genome location also should weak correspondence to conflict/concordance (Fig. S1; Table S4); 313 discussed further below. Rather than timescale and genome location, data type (nucleotide vs 314 amino acid) had a much greater impact on the prevalence of conflict (Figs. 2, 3), with the amino 315 acid dataset generally showing higher levels of gene tree conflict. When factoring in support (BS 316 70 cutoff), the amino acid data set showed even less concordance with the AT (as more genes 317 were considered uninformative due to low BS support); the integration of support also decreased 318 concordance of the nucleotide data set with the AT, but proportionally less. See Figs. S3 and S4 319 for conflict analyses showing the amino acid and nucleotide gene trees mapped onto the AT (the 320 nucleotide results from Fig. S4 are also shown in Fig. 1). 321 To identify how many different models of amino acid evolution underlie the genes in the 322 plastome, we tested each gene against the candidate set of amino acid models in IQ-TREE and 323 RAxML. We found that a wide range of evolutionary models best fit our data—rather than just a 324 single model for the entire concatenated set of genes. Many of the models were not designed 325 specifically for plastome data, and cpRev, which was designed for plastome data, was only the 326 best fit for 19 of the 79 genes based upon the IQ-TREE model test (Table S2). 327 To examine relationships between gene characteristics and levels of 328 concordance/conflict, we calculated the following statistics for each gene: alignment length (a 329 measure of gene information content), tree length (a measure of evolutionary rate), and root-to- 330 tip variance (a measure of variation in evolutionary rate across the tree). This information is 331 presented for each gene across both data types in the supplementary information (Tables S2 and 332 S3). We used logistic multiple regression to test relationships between gene performance and 333 characteristics (Tables S5–S12). We found that alignment length had a significant positive bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

334 multiplicative relationship with odds of concordance across both datasets (Fig. 2, Tables S5-S8). 335 Tree length and root-to-tip variance had significant positive and negative multiplicative 336 relationships, respectively, in bot datasets (Tables S5, S6). Notably, excluding highly influential 337 observations with outlying predictor values rendered all three predictors significant but did not 338 affect the direction of most relationships Tables S7, S8). In models including only alignment 339 length and tree length, both predictors had significant positive multiplicative relationships with 340 odds of concordance even when the other was included in the model (Fig. 2, Tables S9-S12). 341 342 Genomic patterns of concordance/conflict—In terms of number of nodes concordant with the 343 AT, the LSC and SSC regions were roughly comparable, with both outperforming the IR 344 whether considering bootstrap or not (Table S4). This pattern held across time periods, with the 345 LSC and SSC regions having more concordant nodes than the IR at every time slice. The tree 346 lengths of the LSC and SSC regions (1.82 and 2.05, respectively) were also considerably larger 347 than that of the IR (0.98). The alignment lengths of the LSC region, SSC region, and IR were 348 73422 bp, 10395 bp, and 19314 bp respectively. The genome diagram of concordance (Fig. S1) 349 does not show any striking patterns among the different genomic regions; however, it is notable 350 that the majority of the LSC region is discordant (or uninformative) with the exception of a few 351 highly informative genes (namely, rpoC2 and matK). 352 353 Performance of individual plastid genes—Across all analyses, rpoC2 showed the highest levels 354 of concordance with the AT (Fig. 2, Tables S2 and S3); in general, it performed at least as well 355 as the 79 concatenated genes in reconstructing the AT. The commonly used genes ndhF and 356 matK generally scored among the best-performing plastid genes (in terms of number of 357 concordant nodes), while rbcL, the other most commonly used gene, performing relatively 358 poorly (Fig. 2). The matK alignment is ~250 bp longer than the rbcL alignment; the best 359 performing gene, rpoC2 (alignment length 4660 bp), is one of the longest plastid genes. 360 However, the notably long region ycf1—which encodes for ~5,400 bp (Dong et al., 2015)—did 361 not perform as well as rpoC2. In this study, the alignment length of ycf1 was 21,696 bp (vs. 4660 362 bp for rpoC2). In several cases it performed toward the top; however, it was never the top- 363 performing gene in terms of number of concordant nodes (Tables S2 and S3). Notably, despite 364 the high levels of observed conflict overall, we found that every node of the AT was supported bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

365 by at least one gene. Thus, to varying degrees, all relationships of the AT are found within the 366 plastome gene tree set. 367 368 Saturation analyses—None of the genes analyzed showed significant signatures of saturation 369 (Fig. S2). The uncorrected genetic distances and the corrected genetic distances appeared to be 370 roughly the same, and this trend was true for the first, second, and third codon position. Four 371 genes (ndhD, psbL, rpl16 and rpl2) were unable to be properly analyzed for saturation as the 372 nucleotide data uploaded to GenBank were missing the codons for 2 to 3 amino acids in some 373 taxa resulting in an improper codon alignment. 374 375 Analysis of well-supported conflict—We found that, out of the four major contentious 376 relationships in the AT, the edge-based analysis only supports one as the most likely (Table 1) 377 compared to the alternative relationships found in the other trees examined (i.e., the GT, the 378 plastome trees, and the gene trees). In the case of Buxus (node 1 in Fig. 1), this analysis supports 379 Buxus as sister to Trochodendraceae (Table 1), which is the topology found in the GT and four 380 gene trees: rbcL, petL, rps2, rps14 (Table 2). In the case of lamiids relationships (node 2), the 381 data support Olea (Lamiales) sister to Acnistus (Solanales) + Coffea (Gentianales), which is the 382 relationship present in the GT and 15 gene trees (Tables 1, 2). For node 3 (the placements of 383 Jacobaea and Artemisa), the data support the AT topology, yet no individual gene trees provided 384 >70 BS support for this relationship; however, rpoC2 matched the GT topology with strong 385 support (Tables 1 and 2). Lastly, for node 4 (the placement of Galinsoga), the data collectively 386 support Parthenium argentatum sister to Galinsoga (our plastome tree, PT), with two gene trees 387 (petA, psaJ) providing strong support for this topology; however, 16 genes (Table 2) supported a 388 dominant alternative (Ageratina sister to Galinsoga quadriradiata). 389 390 DISCUSSION 391 Following the typical assumptions of chloroplast inheritance, we would expect all genes in the 392 plastomes to have the same evolutionary history. We would also expect all plastid genes to show 393 similar patterns of conflict when compared to non-plastid inferred phylogenies. Furthermore, the 394 amino acid and nucleotide plastid data we used should show the same conflicting/concordant 395 relationships against given reference trees (e.g., the AT or GT). However, our results, discussed bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

396 below, frequently conflict with these common assumptions about chloroplast inheritance and 397 evolutionary history. 398 399 Conflicting topologies inferred from the chloroplast genome—In general, the ‘plastome’ 400 topologies inferred from nucleotide and amino acid alignments showed high levels of 401 concordance with the AT (Fig. 1; Figs. S3 and S4). While the genes within the chloroplast 402 genome are largely uninformative for most nodes of the phylogeny, a number of genes exhibited 403 well-supported conflict (Fig. 1). In general, there appears to be no relationship between 404 evolutionary scale and amount of gene tree conflict: i.e., conflict generally does not appear to 405 correlate with divergence time in angiosperms. Instead, the extent of conflict/concordance had a 406 stronger relationship with the molecule type analyzed. The amino-acid dataset showed the 407 highest levels of gene tree conflict (Figs. 3, S3), and the nucleotide dataset had about half the 408 amount of gene tree conflict found in the amino acid data (Figs. 3, S3, S4). It is difficult to 409 determine the causes of the observed instances of strongly supported conflict, which can be 410 found at most nodes in the phylogeny (Figs. 1, S3, and S4). The edge based analysis’ support for 411 the GT over the PT suggest some systematic error occurred when resolving the species 412 relationships using all genes. However, no analyses we performed fully explain the level of 413 conflict we saw and we suggest that the possibility of biological conflict deserves further 414 exploration, especially given the recent documented instances of inter-plastome recombination 415 (Sancho et al., 2018; Sullivan et al., 2017) and chloroplast-nuclear genomic exchange (Martin et 416 al. 1998; Martin 2003). 417 The superior performance of (coding) nucleotide data compared to amino acid data 418 possibly stems from the relatively greater information content of nucleotides (i.e., longer 419 alignments). Assuming there is not a significant amount of missing/indel data (with the exception 420 of ycf1), longer alignments should result in better-informed models, aided by both parsimony- 421 informative and -uninformative characters (Yang, 1998). However, inherent differences in amino 422 acid and nucleotide models might also explain differences in performance. The nucleotide data 423 was run using the GTR model, where the substitution rates among bases are individually 424 estimated; however, for the empirical amino acid substitution models investigated here, 425 substitution rates are pre-estimated, as the number of estimated parameters for changes among 20 426 states is extremely large. Additionally, although the plastome has been treated as a single bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

427 molecule for designing amino acid models of evolution (Adachi et al. 2000), a wide variety of 428 amino acid models (some of which were designed for viruses, such as flu or HIV) were inferred 429 to be the best for different plastid genes (Table S2). This might be the result of the different 430 methods implemented in RAxML vs. IQ-TREE for model testing, the different available models, 431 or the lack of sufficient information (because of gene length) to inform the model. However, the 432 most important point is that, based on the amount of information present in each gene, the 433 chloroplast is inferred to evolve under significantly different models of evolution. Given the 434 highly pectinate structure of the AT, phylogenetic inference in this case should rely heavily on 435 the model for the likelihood calculations. While in some cases (e.g., the shortest plastid genes) 436 amounts genetic information might be inherently insufficient, in others, improvements in amino 437 acid modeling might lead to great improvements in phylogenetic inference; such has been 438 suggested for animal mitochondrial data (Richards, 2018). 439 We expect that at deeper time scales, nucleotides (of coding regions) may begin to 440 experience saturation and thus information loss due to increased noise, at which point amino 441 acids (with 20 states) would begin to outperform nucleotides. However, the time scale of 442 angiosperm evolution does not appear great enough to result in nucleotide saturation (at least for 443 the genes sampled here, given that no genes appeared to exhibit significant levels of saturation; 444 Fig. S2), indicating that nucleotides are the most informative molecule for phylogenetic analysis 445 of plastomes across angiosperms. Future work, with a broader plastome sampling across green 446 plants, will be necessary to determine the evolutionary scale at which amino acids become more 447 informative than nucleotides for phylogenetic inference. 448 449 Analyses of well-supported conflict—The use of a reference topology (AT) afforded us the 450 ability to examine how individual genes within the plastomes agree/conflict with our current, 451 generally accepted hypothesis of angiosperm phylogeny. However, several contentious areas of 452 angiosperm phylogeny remain unresolved, with different topologies recovered across different 453 analyses (e.g., the AT, GT, and PT show small differences in several parts of the tree). Thus we 454 used edge-based analyses to compare these alternative topologies. 455 The placement of Buxus (Buxales) is a salient example, with different analyses resolving 456 different placements. Our edge-based analyses found Buxus sister to 457 Trochodendron/Trochdendrales (i.e., the GT topology) to have the highest likelihood score bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

458 (Table 1), and this placement was recovered in four plastid genes (rbcL, petL, rps2, rps14) with 459 strong support. However, five plastid genes recovered, with strong support, Buxus and 460 Trochodendron successively sister to the core (i.e., the AT topology): ndhD, rps4, 461 matK, psbB, and rpoC2. Relationships among core lamiid orders are another intriguing example 462 of gene tree conflict. Core lamiid relationships have been notoriously problematic (Refulio- 463 Rodriguez and Olmstead, 2014; Stull et al., 2015), and previous phylogenetic studies have 464 recovered every possible combination of the constituent orders (Boraginales, Gentianales, 465 Lamiales, Solanales). While our sampling does not include Boraginaceae, we tested alternative 466 relationships among the three remaining orders. The relationship 467 (Lamiales(Gentianales,Solanales)), which is present in the GT, received the highest likelihood 468 support and is found in 15 gene trees (Tables 1 and 2). However, the alternatives were each 469 recovered with strong support by two plastid genes: psaA and atpH (Gentianales sister to the 470 rest), and psbB and petA (Solanales sister to the rest). Asteraceae also exhibited several examples 471 of strongly supported conflict (Tables 1 and 2), but sampling differences across the trees 472 examined make comparisons more difficult. 473 These examples above are not meant to represent an exhaustive examination of 474 conflicting relationships within angiosperms. Instead, they are intended to highlight several 475 instances of strongly supported conflict within the plastome. Based on the present analyses, it is 476 difficult to say whether this conflict is biological (e.g.., genes with different evolutionary 477 histories in a single genome) or systematic (e.g., modeling error) in nature. But these results 478 nevertheless have important implications for plastid phylogenomics, suggesting that 479 concatenated analyses should be performed with caution given the notable presence of gene-tree 480 conflict. The majority of nodes show at least one strongly supported conflicting gene, and many 481 nodes have roughly equivalent numbers of conflicting and concordant genes against the 482 reference tree (Fig. 1). Concatenated analyses with extensive underlying conflict can yield 483 problematic results in terms of both topology and branch lengths, and occasionally, even with 484 large datasets, a few genes can drive the entire analysis toward the wrong topology (Brown and 485 Thomson, 2017; Shen et al. 2017; Walker et al. 2018). 486 487 Utility of individual plastid genes for future studies—Previous studies have laid a strong 488 framework for determining the utility of chloroplast regions at various phylogenetic scales. For bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

489 example, work by Shaw et al. (2005, 2007, 2014) highlighted non-coding DNA regions useful 490 for shallow evolutionary studies, while Graham and Olmstead (2000) explored protein-coding 491 genes useful for reconstructing deep relationships in angiosperms. Here, we expand upon 492 previous work by using a novel phylogenomic approach, allowing us examining the concordance 493 of individual protein-coding plastid genes with all nodes of the accepted angiosperm phylogeny 494 (AT; as well as slight variations thereof, e.g., the GT). We paid special attention to matK and 495 rbcL, given their historical significance for plant systematics (e.g., Donoghue et al., 1992; Chase 496 et al., 1993; Hilu et al., 2003). We find that rbcL performs relatively well in recapitulating the 497 TT— however, matK performs considerably better (i.e., it generally has more nodes concordant 498 with the TT; Tables S2 and S3). This is likely due to a strong positive correlation between 499 alignment length and number of concordant nodes, as noted above (Fig. 2); matK has a longer 500 alignment/gene length than rbcL. 501 The gene ycf1 has been found to be a useful marker in phylogenetics (e.g., Neubig et al., 502 2008; Neubig and Abbot, 2010; Thomson et al., in press) and barcoding (Dong et al. 2015), and 503 here we find that it generally performs above average. The alignment of ycf1 is abnormally long, 504 and this is likely due to its position spanning the boundary of the IR and the SSC, an area known 505 to fluctuate greatly in size. This variability likely contributes to the value of ycf1 as a marker for 506 ‘species-level’ phylogenetics and barcoding. However, the performance of ycf1 does not scale 507 with its alignment length. In terms of concordance, we find that matK performs roughly equally 508 as well if not slightly better than ycf1 (Tables S2 and S3). This might in part be a consequence of 509 ycf1 being missing/lacking annotation from some species, preventing us from analyzing its 510 concordance/conflict with certain nodes. Nevertheless, our results add to the body of evidence 511 supporting ycf1 as a generally useful plastid region. However, we found rpoC2 to outperform all 512 other plastid regions in every case (Tables S2 and S3), and its alignment length (4660 bp) is 513 ~1/5th the length of ycf1, easing the computational burden of using alignment tools such as FSA 514 (which would struggle with a region as long as ycf1), as well as divergence dating and tree- 515 building programs such as BEAST (Redelings and Suchard 2006; Drummond and Rambaut 516 2007). 517 In our analyses, when BS support is not considered, rpoC2 performed at least as well if 518 not better than using the concatenation of all chloroplast genes (Tables S2 and S3). When 519 support is considered (Tables S2 and S3), rpoC2 still remains the best-performing gene, but it bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

520 performs slightly worse than the concatenation of all chloroplast genes (in terms of number of 521 supported nodes concordant with the TT). The utility of rpoC2 likely stems from its notable 522 length, resulting in a wealth of useful phylogenetic information. In light of our results, rpoC2 523 should be a highly attractive coding region for future studies, as it generally recapitulates the 524 plastome phylogeny while allowing more proper branch length inferences (given that conflicting 525 signal among multiple genes can result in problematic branch length estimates: Mendes & Hahn, 526 2016). Recent work (Smith et al., 2018) indicates that filtering for a smaller set of highly 527 informative genes might yield more accurate results in various phylogenetic applications (e.g., 528 divergence dating). Given this, rpoC2 might be particularly useful for comparative analyses 529 requiring accurate branch length estimates (i.e., the majority of comparative analyses). Use of 530 rpoC2 alone (instead of the entire plastome) would also allow for more complex, 531 computationally expensive models to be implemented. Furthermore, focused sequencing of 532 rpoC2 would increase compatibility of datasets from different studies, facilitating subsequent 533 comprehensive, synthetic analyses. Although the performance rpoC2 may be dataset dependent, 534 our results support its utility at multiple levels, in terms of both time scales and sampling. It is 535 important to note, however, that we did include non-coding regions in our study, which would 536 likely outperform coding regions at shallow phylogenetic levels. But, as we noted, it is generally 537 too difficult to use non-coding regions across broad phylogenetic scales (e.g., angiosperms) 538 because of complications related to homology assessment and alignment. 539 540 Genomic patterns of concordance/conflict—Several previous studies (e.g., Jian et al., 2008; 541 Moore et al., 2011) have highlighted the Inverted Repeat (IR) as a valuable plastid region for 542 deep-level phylogenetic analyses, attributing its utility to its relatively slow rate of evolution, 543 resulting in less homoplasy and minimal saturation. However, our results contradict this notion, 544 suggesting that the coding sequences of the IR alone perform poorly compared to the LSC and 545 SSC coding regions for reconstructing angiosperm phylogeny (Fig. S1, Table S4). However, 546 there are important differences between our study and earlier studies on the IR (e.g., Jian et al., 547 2008; Moore et al., 2011). For one, we did not include the ribosomal RNA genes, which are 548 highly conserved; thus if the conserved nature of the IR (or at least portions of it) is the basis of 549 its utility, then this might explain the poor performance in the current study. However, our 550 saturation analyses (described above) did not reveal any genes to have significant saturation bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

551 issues at the scale of angiosperm evolution. This calls into question the idea that the conserved 552 genes of the IR would make it superior for reconstruction of angiosperm phylogeny. Instead, it is 553 possible that the non-coding regions of the IR (which we did not include here) are highly 554 informative for angiosperm phylogeny. While the non-coding regions of the LSC and SSC 555 regions have been extensively examined for use as phylogenetic markers (Shaw et al., 2005, 556 2007, 2014), the non-coding regions of the IR have been underexplored. Among the IR genes 557 examined here, ycf2 (which is exceptionally long) showed the greatest levels of concordance 558 (Fig. S1, Table S4), underscoring the idea that longer genes are generally more useful for 559 phylogeny reconstruction. 560 Another important difference between our study and Moore et al. (2011) is that we only 561 partitioned our data by gene region, while Moore et al. (2011) explored various partitioning 562 strategies (including codon positions and different combinations of genes). It is clear that 563 sequences within the plastomes follow various different models of molecular evolution (as 564 shown above in the results section Patterns of conflicting chloroplast signal). Exploration of 565 more complicated partitioning schemes—which is a time-consuming process (Lanfear et al., 566 2012; Kainer and Lanfear 2015), and beyond the scope of this study—might generally improve 567 plastome-inferred phylogenies. 568 569 Conclusions—We find notable levels of gene-tree conflict within the plastome, at all levels of 570 angiosperm phylogeny, highlighting the necessity of future research into the causes of plastome 571 conflict—do some genes share different evolutionary histories, or is systematic error (e.g., poor 572 modeling) the source of the observed conflict?—and the appropriateness of concatenating plastid 573 genes for phylogeney reconstruction. However, while conflict is evident, the majority of plastid 574 genes are largely uninformative for angiosperm phylogeny (Fig. 1). We find alignment length 575 and molecular type (nucleotide vs amino acids) to be the strongest correlate with plastid gene 576 performance—i.e., longer genes and nucleotides (which have greater information content than 577 amino acids) generally show higher levels of concordance with our accepted hypothesis of 578 angiosperm phylogeny. Accordingly, we found rpoC2 (one of the longest plastid genes) to be the 579 best-performing plastid gene, outperforming all of the more commonly used genes (e.g., rbcL, 580 ndhF, matK) at reconstructing angiosperm phylogeny. This supports the notion that longer genes bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

581 (with greater information content) are generally superior for phylogeny reconstruction (Yang, 582 1998) and should be prioritized when developing phylogenetic studies. 583 584 ACKNOWLEDGEMENTS 585 We thank S. A. Smith and M. J. Moore for reading an early draft of the manuscript and proving 586 helpful suggestions toward its improvement. JFW was supported by a Rackham Predoctoral 587 Fellowship. G.W.S. was supported by a NSF Postdoctoral Fellowship (NSF DBI grant 1612032). 588 O.M.V was supported by NSF FESD 1338694 and DEB 1240869. D.A.L. was supported by NSF 589 DEB grant 1458466. 590 591 AUTHOR CONTRIBUTIONS 592 This study was conceived by all of the authors. Analyses were led by J.F.W and N.W.-H, with 593 contributions from D.A.L, O.M.V, and G.W.S. Writing was led by G.W.S., J.F.W., and N.W.-H., 594 with contributions from O.M.V and. D.A.L. 595 596 LITERATURE CITED 597 Adachi, J., P. J. Waddell, W. Martin, and M. Hasegawa. 2000. Plastid genome phylogeny and a 598 model of amino acid substitution for proteins encoded by chloroplast DNA. Journal of 599 Molecular Evolution 50: 348–358. 600 Bendich, A. J. 2004. Circular chloroplast chromosomes: the grand illusion. The Plant Cell 16: 601 1661–1666. 602 Birky, C. W., Jr. 1995. Uniparental inheritance of mitochondrial and chloroplast genes: 603 mechanisms and evolution. Proceedings of the National Academy of Sciences 92: 11331– 604 11338. 605 Bradley, R. K., A. Roberts, M. Smoot, S. Juvekar , J. Do, C. Dewey, I. Holmes, and L. Pachter. 606 2009. Fast statistical alignment. PLoS Computational Biology 5: e1000392. 607 Brown, J. M., and R. C. Thomson. 2017. Bayes factors unmask highly variable information 608 content, bias, and extreme influence in phylogenomic analyses. Systematic Biology 66: 517– 609 530. 610 Brown, J. W., J. F. Walker, and S. A. Smith. 2017. Phyx: phylogenetic tools for unix. 611 Bioinformatics 33: 1886–1888. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

612 Burleigh, J. G., and S. Mathews. 2007a. Assessing among-locus variation in the inference of 613 seed plant phylogeny. International Journal of Plant Sciences 168: 111–124. 614 Burleigh, J. G., and S. Mathews. 2007b. Assessing systematic error in the inference of seed plant 615 phylogeny. International Journal of Plant Sciences 168: 125–135. 616 Carbonell-Caballero, J., R. Alonso, V. Ibañez, J. Terol, M. Talon, and J. Dopazo. 2015. A 617 Phylogenetic Analysis of 34 Chloroplast Genomes Elucidates the Relationships between Wild 618 and Domestic Species within the Genus Citrus. Molecular Biology and Evolution 32: 2015– 619 2035. 620 Chase, M. W., D. E. Soltis, R. G. Olmstead, D. Morgan, D. H. Les, B. D. Mishler, M. R. Duvall, 621 R. A. Price, H. G. Hills, Y.-L. Qiu, et al. 1993. Phylogenetics of seed plants: an analysis of 622 nucleotide sequences from the plastid gene rbcL. Annals of the Missouri Botanical 623 Garden 80: 528–580. 624 Corriveau, J. L., and A. W. Coleman. 1988. Rapid screening method to detect potential 625 biparental inheritance of plastid DNA and results for over 200 angiosperms. American 626 Journal of Botany 75: 1443–1458. 627 Cronn, R., A. Liston, M. Parks, D. S. Gernandt, R. Shen, and T. Mockler. 2008. Multiplex 628 sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis 629 technology. Nucleic acids research 36: e122–e122. 630 Cronn, R., B. J. Knaus, A. Liston, P. J. Maughan, M. Parks, J. V. Syring, and J. Udall. 2012. 631 Targeted enrichment strategies for next‐generation plant biology. American Journal of Botany 632 99: 291–311. 633 Dong, W., C. Xu, C. Li, J. Sun, Y. Zuo, S. Shi, T. Cheng, J. Guo, and S. Zhou. 2015. ycf1, the 634 most promising plastid DNA barcode of land plants. Scientific Reports 5: 8348. 635 Donoghue, M. J., R. G. Olmstead, J. F. Smith, and J. D. Palmer. 1992. Phylogenetic relationships 636 of Dipsacales based on rbcL sequences. Annals of the Missouri Botanical Garden 79: 333– 637 345. 638 Doyle, J. J. 1992. Gene trees and species trees: molecular systematics as one-character 639 . Systematic Botany 17: 144–163. 640 Drummond, A. J., and A. Rambaut. 2007. BEAST: Bayesian evolutionary analysis by sampling 641 trees. BMC Evolutionary Biology 7: 214. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

642 Foster, C.S., Henwood, M.J. and Ho, S.Y., 2018. Plastome sequences and exploration of tree- 643 space help to resolve the phylogeny of riceflowers (Thymelaeaceae: Pimelea). Molecular 644 phylogenetics and evolution. 645 Galtier, N., and V. Daubin. 2008. Dealing with incongruence in phylogenomic analyses. 646 Philosophical Transactions of the Royal Society B: Biological Sciences 363: 4023–4029. 647 Gitzendanner, M. A., P. S. Soltis, G. K. Wong, B. R. Ruhfel, and D. E. Soltis. 2018. Plastid 648 phylogenomic analysis of green plants: a billion years of evolutionary history. American 649 Journal of Botany 105: 291–301. 650 Graham, S. W., and R. G. Olmstead. 2000. Utility of 17 chloroplast genes for inferring the 651 phylogeny of the basal angiosperms. American Journal of Botany 87: 1712–1730. 652 Hensen, A.K., L. K. Escobar, L. E. Gilbert, and R. K. Jansen. 2007. Paternal, maternal, and 653 bipaternal inheritance of the chloroplast genome in Passiflora (Passifloraceae): implications 654 for phylogenetic studies. American Journal of Botany 94: 42–46. 655 Hilu, K. W., T. Borsch, K. Müller, D. E. Soltis, P. S. Soltis, V. Savolainen, M. W. Chase, M. P. 656 Powell, L. A. Alice, R. Evans. 2003. Angiosperm phylogeny based on matK sequence 657 information. American Journal of Botany 90: 1758–1776. 658 Jansen, R. K., Z. Cai, L. A. Raubeson, H. Daniell, C. W. dePamphilis, J. Leebens-Mack, K. F. 659 Müller, M. Guisinger-Bellian, R. C. Haberle, and A. K. Hansen. 2007. Analysis of 81 genes 660 from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale 661 evolutionary patterns. Proceedings of the National Academy of Sciences 104: 19369–19374. 662 Jian, S., P. S. Soltis, M. A. Gitzendanner, M. J. Moore, R. Li, T. A. Hendry, Y. L. Qiu, A. 663 Dhingra, C. D. Bell, and D. E. Soltis. 2008. Resolving an ancient, rapid radiation in 664 Saxifragales. Systematic Biology 57: 38–57. 665 Johnson, L. B., and J. D. Palmer. 1988. Heteroplasmy of chloroplast DNA in Medicago. Plant 666 Molecular Biology 12: 3–11. 667 Kainer, D., and R. Lanfear. 2015. The effects of partitioning on phylogenetic inference. 668 Molecular Biology and Evolution 32: 1611–1627. 669 Kalyaanamoorthy, S., B. Q. Minh, T. K. F. Wong, A. von Haeseler, and L. S. Jermiin. 2017. 670 ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods 14: 671 587–589. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

672 R. Lanfear, B. Calcott, S. Y. Ho, and S. Guindon. 2012. PartitionFinder: combined selection of 673 partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology 674 and Evolution 29: 1695–1701. 675 Lee, D. J., T. K. Blake, and S. E. Smith. 1988. Biparental inheritance of chloroplast DNA and the 676 existence of heteroplasmic cells in alfalfa. Theoretical and Applied Genetics 76: 545–549. 677 Liu, J., Z. C. Qi, Y. P. Zhao, C. X. Fu CX, and Q. Y. Jenny Xiang. 2012. Complete cpDNA 678 genome sequence of Smilax china and phylogenetic placement of Liliales—Influences of 679 gene partitions and taxon sampling. Molecular Phylogenetics and Evolution 64: 545–562. 680 Maddison, W. P. 1997. Gene trees in species trees. Systematic Biology 46: 523–536. 681 Magallón, S., S. Gómez-Acevedo, L. L. Sánchez-Reyes, T. Hernández-Hernández. 2015. A 682 metacalibrated time‐tree documents the early rise of phylogenetic diversity. 683 New Phytologist 207: 437–453. 684 Martin, W. 2003. Gene transfer from organelles to the nucleus: frequent and in big chunks. 685 Proceedings of the National Academy of Sciences 100: 8612–8614. 686 Martin, W., B. Stoebe, V. Goremykin, S. Hapsmann, M. Hasegawa, and K. V. Kowallik. 1998. 687 Gene transfer to the nucleus and the evolution of chloroplasts. Nature 393: 162–165. 688 McCauley, D. E., A. K. Sundby, M. F. Bailey MF, M E. Welch. 2007. Inheritance of chloroplast 689 DNA is not strictly maternal in Silene vulgaris (Caryophyllaceae): evidence from 690 experimental crosses and natural populations. American Journal of Botany 94: 1333–1337. 691 Mendes, F. K., and M. H. Hahn. 2016. Gene tree discordance causes apparent substitution rate 692 variation. Systematic Biology 65: 711–721. 693 Mogensen, H. L. 1996. The hows and whys of cytoplasmic inheritance in seed plants. American 694 Journal of Botany 83: 383–404 695 Moore, M. J., A. Dhingra, P. S. Soltis, R. Shaw, W. G. Farmerie, K. M. Folta, and D. E. Soltis. 696 2006. Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant 697 Biology 6: 17. 698 Moore, M. J., C. D. Bell, P. S. Soltis, and D. E. Soltis. 2007. Using plastid genome-scale data to 699 resolve enigmatic relationships among basal angiosperms. Proceedings of the National 700 Academy of Sciences 104: 19363–19368. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

701 Moore MJ, Soltis PS, Bell CD, Burleigh JG, Soltis DE. 2010. Phylogenetic analysis of 83 plastid 702 genes further resolves the early diversification of eudicots. Proceedings of the National 703 Academy of Sciences 107: 4623–4628. 704 Moore, M. J., N. Hassan, M. A. Gitzendanner, R. A. Bruenn, M. Croley, A. Vandeventer, J. W. 705 Horn, A. Dhingra, S. F. Brockington, and M. Latvis. 2011. Phylogenetic analysis of the 706 plastid inverted repeat for 244 species: insights into deeper-level angiosperm relationships 707 from a long, slowly evolving sequence region. International Journal of Plant Sciences 172: 708 541–558. 709 Neubig, K. M., and J. R. Abbott. 2010. Primer development for the plastid region ycf1 in 710 Annonaceae and other magnoliids. American Journal of Botany 97: e52–e55. 711 Neubig, K. M., W. M. Whitten, B. S. Carlsward, M. A. Blanco, L. Endara, N. H. Williams, and 712 M. J. Moore. 2009. Phylogenetic utility of ycf1 in orchids: a plastid gene more variable than 713 matK. Plant Systematics and Evolution 277: 75–84. 714 Nguyen, L.-T., H. A. Schmidt, A. von Haeseler, and B. Q. Minh. 2014. IQ-TREE: a fast and 715 effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular 716 Biology and Evolution 32: 268–274. 717 Palmer, J. D. 1983. Chloroplast DNA exists in two orientations. Nature 301: 92–93. 718 Palmer, J. D. 1985. Comparative organization of chloroplast genomes. Annual review of 719 genetics 19: 325–354. 720 Philippe, H., H. Brinkmann, D. V. Lavrov, D. T. J. Littlewood, M. Manuel, G. Wörheide, and D. 721 Baurain. 2011. Resolving difficult phylogenetic questions: why more sequences are not 722 enough. PLoS Biology 9: e1000602. 723 Redelings, B. 2014. Erasing errors due to alignment ambiguity when estimating positive 724 selection. Molecular Biology and Evolution 31: 1979–1993. 725 Refulio-Rodriguez, N. F., and R. G. Olmstead. 2014. Phylogeny of Lamiidae. American Journal 726 of Botany 101: 287–299. 727 Richards, E. J., J. M. Brown, A. J. Barley, R. A. Chong, R. C. Thomson. 2018. Variation across 728 mitochondrial gene trees provides evidence for systematic error: How much gene tree 729 variation is biological? Systematic Biology doi:10.1093/sysbio/syy013. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

730 Rodríguez-Ezpeleta, N., H. Brinkmann, G. Burger, A. J. Roder, M. W. Gray, H. Philippe, and B. 731 F. Lang. 2007. Toward resolving the eukaryotic tree: the phylogenetic positions of jakobids 732 and cercozoans. Current Biology 17: 1420–1425. 733 Rokas, A., B. L. Williams, N. King, S. B. Carroll. 2003. Genome-scale approaches to resolving 734 incongruence in molecular phylogenies. Nature 425: 798–804. 735 Roquet, C., I. Sanmartín, N. Garcia-Jacas, L. Sáez, A. Susanna, N. Wikström, and J. J. Aldasoro. 736 2009. Reconstructing the history of Campanulaceae with a Bayesian approach to molecular 737 dating and dispersal–vicariance analyses. Molecular Phylogenetics and Evolution 52: 575– 738 587. 739 Ruhfel, B. R., M. A. Gitzendanner, P. S. Soltis, D. E. Soltis, and J. G. Burleigh. 2014. From 740 algae to angiosperms–inferring the phylogeny of green plants (Viridiplantae) from 360 plastid 741 genomes. BMC Evolutionary Biology 14: 23. 742 Sancho, R., C. P. Cantalapiedra, D. López-Alvarez, S. P. Gordon, J. P. Vogel, P. Catalán, B. 743 Contreras-Moreira. 2018. Comparative plastome genomics and phylogenomics of 744 Brachypodium: flowering time signatures, introgression and recombination in recently 745 diverged ecotypes. New Phytologist 218: 1631–1644. 746 Shaw, J., E. B. Lickey, J. T. Beck, S. B. Farmer, W. Liu, J. Miller, K. C. Siripun, C. T. Winder, 747 E. E. Schilling, R. L. Small. 2005. The tortoise and the hare II: relative utility of 21 noncoding 748 chloroplast DNA sequences for phylogenetic analysis. American Journal of Botany 92: 142– 749 166. 750 Shaw, J., E. B. Lickey, E. E. Schilling, R. L. Small. 2007. Comparison of whole chloroplast 751 genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the 752 tortoise and the hare III. American Journal of Botany 94: 275–288. 753 Shaw, J., H. L. Shafer, O. R. Leonard, M. J. Kovach, M. Schorr, and A. B. Morris. 2014. 754 Chloroplast DNA sequence utility for the lowest phylogenetic and phylogeographic 755 inferences in angiosperms: the tortoise and the hare IV. American Journal of Botany 101: 756 1987–2004. 757 Shen, X.-X., C. T. Hittinger, and A. Rokas. 2017. Contentious relationships in phylogenomic 758 studies can be driven by a handful of genes. Nature Ecology & Evolution 1: 126. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

759 Shepherd, L. D., B. R. Holland, L. P. Perrie. 2008. Conflict amongst chloroplast DNA sequences 760 obscures the phylogeny of a group of Asplenium ferns. Molecular Phylogenetics and 761 Evolution 48: 176–187. 762 Smith, S. E. 1989. Biparental inheritance of organelles and its implications in crop improvement. 763 Plant Breeding Reviews 6: 361–393. 764 Smith, S. A., M. J. Moore, J. W. Brown, and Y. Yang. 2015. Analysis of phylogenomic datasets 765 reveals conflict, concordance, and gene duplications with examples from animals and plants. 766 BMC Evolutionary Biology 15: 150. 767 Smith, S.A., Walker, J.F., Brown, J. and Walker-Hale, N., 2018. Nested phylogenetic conflicts 768 and deep phylogenomics in plants. bioRxiv, p.371930. 769 Soltis, D. E., P. S. Soltis, M. W. Chase, M. E. Mort, D. C. Albach, M. Zanis, V. Salvolainen, W. 770 H. Hahn, S. B. Hoot, M. F. Fay, et al. 2000. Angiosperm phylogeny inferred from 18S rDNA, 771 rbcL, and atpB sequences. Botanical Journal of the Linnean Society 133: 381–461. 772 Soltis, D. E., S. A. Smith, N. Cellinese, K. J. Wurdack, D. C. Tank, S. F. Brockington, N. F. 773 Refulio-Rodriguez, J. B. Walker, M. J. Moore, B. S. Carlsward, et al. 2011. Angiosperm 774 phylogeny: 17 genes, 640 taxa. American journal of botany 98: 704–730. 775 Stamatakis, A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of 776 large phylogenies. Bioinformatics 30: 1312-1313. 777 Stegemann, S., S. Hartmann, S. Ruf, and R. Bock. 2003. High-frequencing gene transfer from 778 the chloroplast genome to the nucleus. Proceedings of the National Academy of Sciences 100: 779 8828–8833. 780 Stull, G. W., M. J. Moore, V. S. Mandala, N. A. Douglas, H. R. Kates, X. Qi, S. F. Brockington, 781 P. S. Soltis, D. E. Soltis, and M. A. Gitzendanner. 2013. A targeted enrichment strategy for 782 massively parallel sequencing of angiosperm plastid genomes. Applications in Plant Sciences 783 1: 1200497. 784 Stull, G. W., R. Duno de Stefano, D. E. Soltis, P. S. Soltis. 2015. Resolving basal lamiid 785 phylogeny and the circumscription of Icacinaceae with a plastome‐scale data set. American 786 Journal of Botany 102: 1794–1813. 787 Suchard, M. A., and B. D. Redelings. 2006. BAli-Phy: simultaneous Bayesian inference of 788 alignment and phylogeny. Bioinformatics 22: 2047–2048. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

789 Sullivan, A. R., B. Schiffthaler, S. L. Thompson, N. R. Street, and X. R. Wang. 2017. 790 Interspecific plastome recombination reflects ancient reticulate evolution in Picea (Pinaceae). 791 Molecular Biology and Evolution 34: 1689–1701. 792 Taberlet, P., L. Gielly, G. Pautou, and J. Bouvet. 1991. Universal primers for amplification of 793 three non-coding regions of chloroplast DNA. Plant molecular biology 17: 1105–1109. 794 Thomson, A. M., O. M. Vargas, and C. W. Dick. 2017. Comparative analysis of 24 chloroplast 795 genomes yields highly informative genetic markers for the Brazil nut family (Lecythidaceae). 796 bioRxiv: 192112. 797 Uribe-Convers, S., J. R. Duke, M. J. Moore, D. C. Tank. 2014. A long PCR-based approach for 798 DNA enrichment prior to next-generation sequencing for systematic studies. Applications in 799 Plant Sciences 2: 1300063. 800 Vargas, O. M., E. M. Ortiz, and B. B. Simpson. 2017. Conflicting phylogenomic signals reveal a 801 pattern of reticulate evolution in a recent high‐Andean diversification (Asteraceae: : 802 Diplostephium). New Phytologist 214: 1736–1750. 803 Walker, J. F., M. J. Zanis, N. C. Emery. 2014. Comparative analysis of complete chloroplast 804 genome sequence and inversion variation in Lasthenia burkei (Madieae, Asteraceae)." 805 American Journal of Botany 101: 722–729. 806 Walker, J. F., R. K. Jansen, M. J. Zanis, N. C. Emery. 2015. Sources of inversion variation in the 807 small single copy (SSC) region of chloroplast genomes. American Journal of Botany 102: 808 1751–1752. 809 Walker, J. F., Y. Yang, M. J. Moore, J. Mikenas, A. Timoneda, S. F. Brockington, and S. A. 810 Smith. 2017. Widespread paleopolyploidy, gene tree conflict, and recalcitrant relationships 811 among the carnivorous Caryophyllales. American Journal of Botany 104: 858–867. 812 Walker, J.F., Brown, J.W. and Smith, S.A., 2018. Analyzing contentious relationships and outlier 813 genes in phylogenomics. Systematic biology. 814 Wickett, N. J., S. Mirarab, N. Nguyen, T. Warnow, E. Carpenter, N. Matasci, S. Ayyampalayam, 815 M. S. Barker, J. G. Burleigh, M. A. Gitzendanner, et al. 2014. Phylotranscriptomic analysis of 816 the origin and early diversification of land plants. Proceedings of the National Academy of 817 Sciences 111: e4859–e4868. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

818 Wolfe, K. H., W.-H. Li, and P. M. Sharp. 1987. Rates of nucleotide substitution vary greatly 819 among plant mitochondrial, chloroplast, and nuclear DNAs. Proceedings of the National 820 Academy of Sciences 84: 9054–9058. 821 Wolfe A. D., and C. P. Randle. 2004. Recombination, heteroplasmy, haplotype polymorphism, 822 and paralogy in plastid genes: implications for plant molecular systematics. Systematic 823 Botany 29:1011–1020. 824 Yang, Z. 1998. On the best evolutionary rate for phylogenetic analysis. Systematic Biology 47: 825 125–133. 826 Zeng, L., Q. Zhang, R. Sun, H. Kong, N. Zhang, H. Ma. 2014. Resolution of deep angiosperm 827 phylogeny using conserved nuclear genes and estimates of early divergence times. Nature 828 Communications 5: 4956. 829 Zhang, Q., Y. Liu, and Sodmergen. 2003. Examination of the cytoplasmic DNA in male 830 reproductive cells to determine the potential for cytoplasmic inheritance in 295 angiosperm 831 species. Plant and Cell Physiology 44: 941–951. 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

849 Figures 850 851 Figure 1. Summary of chloroplast conflict against the reference phylogeny of angiosperms. 852 Green and orange lines indicate where the amino acid- and nucleotide-inferred plastome trees 853 conflict with the reference phylogeny. Pie charts depict the amount of gene tree conflict observed 854 in the nucleotide analysis, with the blue, red, green, and gray slices representing, respectively, 855 the proportion of gene trees concordant, conflicting (supporting a single main alternative 856 topology), conflicting (supporting various alternative topologies), and uninformative (BS < 70 or 857 missing taxon) at each node in the species tree. The dashed lines represent 30 myr time intervals 858 (positioned based on Magallon et al. 2015 and Vargas et al. 2017) used to bin nodes for 859 examinations of conflict at different levels of divergence. 860 861 862 863 Figure 2. Gene tree concordance/conflict at varying time scales. Each diagram represents a 864 different molecule type and shows the proportion of concordance each gene exhibits at the five 865 time slices shown in Fig. 1: (1) 150–120 mya, (2) 120–90 mya, (3) 90–60 mya, (4) 60–30 mya 866 and (5) 30–0 mya. The individual genes are scaled by length of alignment; however, ycf1 and 867 ycf2 are cut to approximately the length of rpoC2 due to their abnormally long alignments. The 868 plots along the bottom show the relationships between gene concordance levels and various 869 attributes of the genes (e.g., alignment length, tree length). 870 871 872 Figure 3. Histograms depicting number of concordant edges each gene tree contains 873 compared to the reference phylogeny (i.e., the ‘accepted tree, AT). Each molecule type is 874 plotted twice (once integrating BS support, shown on the bottom row; another time not 875 integrating BS support, shown on the top row), resulting in four total histograms. The x-axes 876 place each gene based on the number of concordant nodes it shares with the AT; the y-axes show 877 the number of genes with different counts of concordant nodes. Commonly used markers (matK, 878 ndhF, rbcL, and ycf1) are labeled on the graph, along with the most concordant gene (rpoC2) and 879 the number of concordant nodes for the complete chloroplast (CC) compared to the AT. bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

880 881 882 Contentious relationships GT AT CC Alternative topology (>3 genes supporting) Buxus placement -618242.522753 -618371.605232 -618371.605232 -618242.522753 (node 10) Lamiid relationships -618203.494625 -618235.710231 -618203.494625 -618203.494625 (node 19) Jacobaea and Artemisia -618506.967905 -618506.967905 N/A (BS < 70) N/A placements (node 34) Galinsoga placement NA -618308.350169 -618174.41765 -618319.713582 (node 36) 883 884 Table 1. Likelihood scores for alternative resolutions of contentious nodes based on edge-based 885 analyses using a modification of the Maximum Gene Wise Edge (MGWE) method (Walker et al. 886 2018). The best-supported relationship (highest likelihood score) for each case is presented in 887 bold. 888 889 Contentious relationships GT AT CC Alternative topology (>3 genes supporting)

Buxus placement rbcL, petL, rps2, ndhD, rps4, ndhD, rps4, rbcL, petL, rps2, rps14 matK, psbB, matK, psbB, rps14 (node 10) rpoC2 rpoC2

Lamiid relationships ndhK, rbcL, psbB, petA ndhK, rbcL, ndhK, rbcL, rps11, rps11, rps7, ycf4, rps11, rps7, rps7, ycf4, rpoC2, (node 19) rpoC2, rps4, ycf4, rpoC2, rps4, ndhA, ndhJ, ndhA, ndhJ, rps4, ndhA, ccsA, rps2, petB, ccsA, rps2, petB, ndhJ, ccsA, matK, ndhI, atpB matK, ndhI, atpB rps2, petB, matK, ndhI, bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

atpB

Jacobaea and Artemisia rpoC2 None N/A N/A placements (node 34)

Galinsoga placement N/A ndhC, rbcL, petA, psaJ ndhF, petB, psbC, matK, ndhE ndhI, psaB, ycf2, (node 36) psbB, ycf4, petD, rpl16, ndhA, rpoC2 atpB, rpoB, ccsA, atpF 890 Table 2. Individual genes supporting (with >70 BS) the alternative topologies examined for each 891 contentious relationship. 892 121-150mya 91-120mya 61-90 31-60mya 0-30mya mya

Diplostephium haenkei Diplostephium oxapampanum Diplostephium gnidioides Diplostephium pulchrum Diplostephium foliosissimum Diplostephium empetrifolium Diplostephium meyenii Diplostephium azureum dominant alternative Diplostephium hartwegii Hinterhubera ericoides concordant Diplostephium oblanceolatum all other Oritrophium peruvianum conflict caucana Artemisa frigida Anaphalis sinica 3 Parthenium argentatum 4 Ageratina adenophora Galinsoga quadriradiata bioRxiv preprintuninformative doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not Jacobaea vulgaris certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under Lactuca sativa aCC-BY-NC 4.0 International license. Conflict with amino acid alignment Centaurea diffusa Conflict with nucleotide alignment Campanula punctata Adenophora remotiflora Lobelia inflata Fatsia japonica Dendropanax dentiger Schefflera heptaphylla Aralia undulata Viburnum utile Sambucus williamsii Kolkwitzia amabilis Helwingia himalaica Coffea arabica 2 Olea europaea Acnistus arborescens Camellia oleifera Cornus controversa Haloxylon ammodendron Champereia manillana Liquidambar formosana 1 padus Buxus microphylla Trochodendron aralioides Platanus occidentalis Berberis bealei Acorus americanus Bismarckia nobilis Liriodendrom tulipifera Chloranthus japonicus Illicium oligandrum Amborella trichopoda ycf1 ycf1 C C

0- 0- 30 30 30- 30- 60 60 60- 60- 90 90 rpoC2 90- rpoC2 90- 120 120

120- 120- 150 150

Amino Acid Nucleotide (BS > 70) (BS > 70)

Proportion of Concordant Nodes

75%

matK matK 50% bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under rbcLaCC-BY-NC 4.0 International license. rbcL 50%

*** *** *** ***

Length Tree Length Length Tree Length (aa) (substitutions per site) (bp) (substitutions per site) bioRxiv preprint doi: https://doi.org/10.1101/512079; this version posted January 4, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Nucleotide Amino Acid

rbcL (15) rbcL (27) rpoC2 matK rpoC2 CC (37) (46) (44) ycf1 CC ycf1matK Number of Genes (36) (45) Number of Genes (27)(32) 0 2 4 6 8 10 0 2 4 6 8 10

0 10 20 30 40 50 0 10 20 30 40 50

Concordant edges Concordant edges

Nucleotide Amino Acid (BS > 70) (BS > 70)

rbcL (21) matK rpoC2 (30) (39) rbcL ycf1 CC (9) Number of Genes (25) (42) Number of Genes rpoC2 CC (31) (44) ycf1 matK (18) (25) 0 2 4 6 8 10 0 5 10 15 20

0 10 20 30 40 50 0 10 20 30 40 50

Concordant edges Concordant edges