This is the peer reviewed version of the following article:

Schmickl, R. , Liston, A. , Zeisek, V. , Oberlander, K. , Weitemier, K. , Straub, S. C., Cronn, R. C., Dreyer, L. L. and Suda, J. (2016), Phylogenetic marker development for target enrichment from transcriptome and genome skim data: the pipeline and its application in southern African ().

Molecular Ecology Resources, 16: 1124-1135. which has been published in final form at

https://doi.org/10.1111/1755-0998.12487. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions. 1 Phylogenetic marker development for target enrichment from transcriptome

2 and genome skim data: the pipeline and its application in southern African

3 Oxalis (Oxalidaceae) 4

5 Roswitha Schmickl1,*, Aaron Liston2, Vojtěch Zeisek1,3, Kenneth Oberlander1,4, Kevin

6 Weitemier2, Shannon C.K. Straub5, Richard C. Cronn6, Léanne L. Dreyer7, Jan

7 Suda1,3

9 1 Institute of Botany, The Czech Academy of Sciences, Zámek 1, 252 43

10 Průhonice, Czech Republic

11 2 Department of Botany and Pathology, Oregon State University, 2082

12 Cordley Hall, Corvallis, OR 97331, USA

13 3 Department of Botany, Faculty of Science, Charles University in Prague,

14 Benátská 2, 128 01 Prague, Czech Republic

15 4 Department of Conservation Ecology and Entomology, Stellenbosch

16 University, Private Bag X1, Matieland, 7602, South Africa

17 5 Department of Biology, Hobart and William Smith Colleges, 213 Eaton Hall,

18 Geneva, NY 14456, USA

19 6 USDA Forest Service, Pacific Northwest Research Station, 3200 SW

20 Jefferson Way, Corvallis, OR 97331, USA

21 7 Department of Botany and Zoology, Stellenbosch University, Private Bag X1,

22 Matieland, 7602, South Africa 23

24 *Correspondence: Roswitha Schmickl; Institute of Botany, The Czech Academy of

25 Sciences, Zámek 1, 252 43 Průhonice, Czech Republic; Fax: +420 221 951 645; E-

26 mail: [email protected]

27 Keywords: cytonuclear discordance, genome skimming, low-copy nuclear genes,

28 Oxalis, species tree, target enrichment

1 29

30 Running head: Low-copy nuclear genes for Hyb-Seq 31

32 Abstract

33 Phylogenetics benefits from using a large number of putatively independent nuclear

34 loci and their combination with other sources of information, such as the plastid and

35 mitochondrial genomes. To facilitate the selection of orthologous low-copy nuclear

36 (LCN) loci for phylogenetics in non-model organisms, we created an automated and

37 interactive script to select hundreds of LCN loci by a comparison between

38 transcriptome and genome skim data. We used our script to obtain LCN genes for

39 southern African Oxalis (Oxalidaceae), a speciose plant lineage in the Greater Cape

40 Floristic Region. This resulted in 1,164 LCN genes greater than 600 bp. Using target

41 enrichment combined with genome skimming (Hyb-Seq) we obtained on average

42 1,141 LCN loci, nearly the whole plastid genome and the nrDNA cistron from 23

43 southern African Oxalis species. Despite a wide range of gene trees, the phylogeny

44 based on the LCN genes was very robust, as retrieved through various gene and

45 species tree reconstruction methods as well as concatenation. Cytonuclear

46 discordance was strong. This indicates that organellar phylogenies alone are unlikely

47 to represent the species tree and stresses the utility of Hyb-Seq in phylogenetics. 48

49 Introduction

50 High-throughput sequencing (HTS) has the potential to greatly increase the amount

51 of phylogenetically informative signal in molecular datasets (Parks et al., 2009, 2012)

52 and overcome difficulties in phylogenetic reconstructions, such as polytomies and

53 low support values, that are often the result of using only a small fraction of the

54 genome. However, HTS also “opens the era of real incongruence” (Jeffroy et al.,

55 2006), and even massive amounts of sequence data do not always result in strongly

56 resolved phylogenies (Pyron, 2015). When HTS was introduced to plant

2 57 phylogenetics, sequencing of the plastid genome was its first focus (e.g., Parks et al.,

58 2009, Givnish et al., 2010). Later approaches of genome skimming, the sequencing

59 of the high-copy fractions of the nuclear, plastid and mitochondrial genome (Straub et

60 al., 2012), resulted in the assembly of the rDNA cistron and nearly the complete

61 plastid and mitochondrial genome. Currently, target enrichment (sequence capture)

62 of hundreds of loci is becoming increasingly popular in phylogenetics. In animal

63 phylogenomics non-exonic or partly exonic ultraconserved elements and their quite

64 variable flanking regions are often utilized (e.g., Faircloth et al., 2012, Hedtke et al.,

65 2013, Smith et al., 2013). For plant phylogenetics, low-copy nuclear (LCN) genes are

66 targeted (Mandel et al., 2014, Weitemier et al., 2014, Grover et al., 2015, Heyduk et

67 al., 2015, Stephens et al., 2015a, b, Mandel et al., 2015, Nicholls et al., 2015) due to

68 the paucity of ultraconserved nuclear sequences (Reneker et al., 2012).

69 Target sequencing strategies for plant nuclear genomes are largely lineage-

70 specific, requiring the de novo design of target enrichment probes. Chamala et al.

71 (2015) recently introduced a pipeline for phylogenetic marker development in

72 angiosperms using transcriptomes, and they obtained several hundred putative LCN

73 genes that can be utilized at three phylogenetic levels (genus, family, order);

74 however empirical evidence for the phylogenetic utility of these loci was not

75 demonstrated. Alternative phylogenetic marker developments, also utilizing

76 transcriptomes (Rothfels et al., 2013, Pillon et al., 2014, Tonnabel et al., 2014),

77 resulted in a much smaller number (up to 20) of mainly LCN loci, but these loci were

78 evaluated with PCR in the empirical datasets, not target enrichment. In recently

79 published phylogenies based on target enrichment of several hundred LCN genes,

80 these loci were selected from transcriptomes, gene expression studies, the literature,

81 or a combination of these sources (Mandel et al., 2014, Grover et al., 2015, Heyduk

82 et al., 2015, Stephens et al., 2015a, b, Mandel et al., 2015, Nicholls et al., 2015).

83 Weitemier et al. (2014) designed LCN probes for target enrichment based on a

84 combination of transcriptome and genome data and demonstrated their phylogenetic

3 85 utility in Asclepias L.. The limitation of this probe design pipeline is that (draft)

86 genomes are still infrequent, especially for non-model species, and are costly to

87 generate. This limitation also applies to the approach of de Sousa et al. (2014), who

88 selected 50 LCN loci from a genomic source and amplified them using target

89 enrichment. Except for Chamala et al. (2015), who offer a user-friendly but

90 empirically untested probe design pipeline, and Weitemier et al. (2014), whose Hyb-

91 Seq pipeline is designed for more advanced users, no automated probe design

92 pipeline for LCN genes is currently available.

93 In this study we developed a novel probe design pipeline for targeting

94 orthologous LCN loci for phylogenetic reconstruction by using genome skim and

95 transcriptome data. In particular, genome skim data of one accession of the studied

96 plant group were combined with a congeneric transcriptome from the 1000

97 (1KP) initiative (http://www.onekp.com/). We implemented our software workflow in

98 the user-friendly, automated and interactive BASH script Sondovač, which allows a

99 straightforward design of LCN probes also for users with limited bioinformatics skills.

100 The utility of this approach is demonstrated by the design of probes for southern

101 African Oxalis, and over 1,000 candidate LCN loci were obtained. Use of the probes

102 for targeted sequencing of these loci in 23 southern African Oxalis species resulted in

103 sufficient sequencing depth of the LCN loci, as well as the plastid and high-copy

104 nuclear genome (e.g., nuclear ribosomal DNA (nrDNA) cistron). Considering their

105 different evolutionary rates and modes of inheritance (Small et al., 2004), the

106 combination of all three datasets can substantially contribute to understanding

107 speciation from a phylogenetic perspective. The observed, strong cytonuclear

108 discordance (nuclear tree topology deviating from organellar tree topology) suggests

109 that organellar phylogenies alone do not resemble the species tree. 110

111 Materials and methods

112 Taxonomic focus

4 113 Oxalis L. (ca. 500 species) is common in the flora of the New World and a major

114 component of the Greater Cape Floristic Region (GCFR) (Born et al., 2006).

115 Southern African taxa comprise ca. 46% of the genus (ca. 230 species) and

116 represent the seventh-largest genus and the largest geophytic genus in the GCFR

117 (Proches et al., 2006); they bear with above-ground parts emanating from

118 seasonal stems. There is some evidence of rapid, possibly adaptive radiation of

119 southern African Oxalis, as the base of the clade is poorly resolved (Oberlander et

120 al., 2011). Results of dating the southern African crown Oxalis radiation varied

121 between an age of 9.9 and 32.2 mil. years (Oberlander et al., 2014).

122 Oberlander et al. (2011) published a phylogeny of the southern African

123 species, based on a combined dataset of plastid trnL intron, trnL-trnF intergenic

124 spacer, and trnS-trnG, as well as the internal transcribed spacers of nuclear

125 ribosomal DNA (ITS), and found that these species are monophyletic. However, the

126 phylogeny lacked good resolution and high support values for many clades, and

127 strong cytonuclear discordance was observed. For Hyb-Seq we selected 23 Oxalis

128 species from the core southern African Oxalis clade (clade four of Oberlander et al.,

129 2011). Twelve of those species were from the “O. hirta and relatives clade” (clade 11

130 of Oberlander et al., 2011; hereafter the Hirta clade), which showed strong

131 cytonuclear discordance. Oxalis hirta L. and Oxalis obtusa Jacq. had two accessions

132 each, as they are among the most morphologically variable taxa within the core

133 southern African Oxalis clade. Approximately one third of the accessions were

134 polyploid. Silica-dried leaf material was used in all cases. Sampling information is

135 available in Table S1 (Supporting information). 136

137 Target enrichment probe design

138 The transcriptome draft assembly of the cosmopolitan weed Oxalis corniculata L.

139 from the 1KP initiative (accession JHCN) and genome skim raw data of an accession

140 of southern African O. obtusa (accession J12; see “Illumina library preparation and

5 141 Hyb-Seq”) were combined to get hundreds of orthologous LCN loci. Enrichment of

142 multi-copy loci was minimized by using unique transcripts only, which were obtained

143 by comparing all transcripts and removing those sharing ≥90% sequence similarity

144 using BLAT v.32x1 (Kent, 2002). Before matching the O. obtusa genome skim data

145 against those unique transcripts, reads of plastid and mitochondrial origin were

146 removed with Bowtie 2 (Langmead and Salzberg, 2012), SAMtools (Li et al., 2009)

147 and bam2fastq (http://gsl.hudsonalpha.org/information/software/bam2fastq) utilizing

148 Ricinus communis L. GenBank accessions NC_016736 and NC_015141 as

149 references. Paired-end reads were subsequently combined with FLASH (Magoč and

150 Salzberg, 2011). These processed reads were matched against the unique O.

151 corniculata transcripts sharing ≥85% sequence similarity with BLAT. Transcripts with

152 >1000 BLAT hits, indicating repetitive elements, and O. obtusa BLAT hits containing

153 masked nucleotides were removed before de novo assembly of the O. obtusa BLAT

154 hits to larger contigs with Geneious v.6.1.7 (Kearse et al., 2012), using the medium

155 sensitivity / fast setting. After assembly, only those contigs that comprised exons

156 ≥120 bp and had a total locus length ≥600 bp were retained. To ensure that probes

157 did not target multiple similar loci, any probe sequences sharing ≥90% sequence

158 similarity were removed using cd-hit-est v.4.5.4 (Li and Godzik, 2006), followed by a

159 second filtering step for contigs containing exons ≥120 bp and totaling loci length

160 ≥600 bp. To ensure that plastid sequences were absent from the probes, the probe

161 sequences were matched against the Ricinus plastome reference sharing ≥90%

162 sequence similarity with BLAT, and the hits were removed from the probe set.

163 Repetitive elements were then masked with RepeatMasker

164 (http://www.repeatmasker.org/). Ambiguous sites in the probe sequences, which

165 were generated during assembly of the genome skim reads, were randomly replaced

166 by one of the relevant nucleotides and stretches of up to 5N replaced by T. Tiling

167 density 2☓ was used.

6 168 The workflow described above up to the removal of remaining plastid

169 sequences is summarized in Figure 1. It was implemented in an automated and

170 interactive BASH script named Sondovač, which is deposited in GitHub

171 (https://github.com/V-Z/sondovac/wiki/) and licensed under open-source license GPL

172 v.3 allowing further modifications. The script runs on major Linux distributions and

173 Mac OS X; it runs on a standard desktop computer equipped with modern CPU like

174 Intel i5 or i7. Strong bioinformatics skills and access to high-performance computer

175 clusters are not required. 176

177 Illumina library preparation and Hyb-Seq

178 Genomic DNA was extracted according to the CTAB protocol (Doyle and Doyle,

179 1987). The genome skim data of O. obtusa accession J12 was obtained with 250 bp

180 paired-end reads from a partial lane of an Illumina MiSeq (San Diego, California,

181 USA) performed by StarSEQ (Mainz, Germany). For the other 23 species of Oxalis,

182 used for Hyb-Seq, 200 ng to 1 µg extracted DNA was sheared with a Covaris

183 (Woburn, Massachusetts, USA) S220 sonicator using the program for fragmentation

184 to 1,000 bp for 45 sec (200 cycles, 4°C). Library preparation followed the NEBNext

185 Ultra DNA Library Prep (New England Biolabs, Ipswich, Massachusetts, USA)

186 protocol for Illumina with a few modifications: (1) Size-selection (~600–650 bp) was

187 performed on a 1% agarose gel and (2) two additional cleanup steps were

188 implemented, one after adapter-ligation with the QIAquick PCR Purification Kit

189 (Qiagen, Venlo, Netherlands), and a second after gel extraction with the QIAquick

190 Gel Extraction Kit (Qiagen). (3) Enriched PCR products were cleaned up with the

191 QIAquick PCR Purification Kit and subsequently with Agencourt AMPure XP beads

192 (Beckman Coulter Genomics, Danvers, Massachusetts, USA). Amplification of

193 ligated, size-selected fragments was performed with 10 cycles of PCR, using

194 NEBNext Multiplex Oligos for Illumina Index Primers Set 1 and 2 (New England

195 Biolabs). Libraries were subsequently pooled in approximately equimolar ratios in a

7 196 24-plex reaction. Solution hybridization with MyBaits biotinylated RNA baits

197 (MYcroarray, Ann Arbor, Michigan, USA), which were synthesized from our custom-

198 designed probes, and enrichment followed the MYbaits manual v.1.3.8 with

199 approximately 200 ng of input DNA (9 ng per accession) and 12 cycles of PCR

200 enrichment. Target-enriched libraries were sequenced on an Illumina MiSeq at

201 Oregon State University (v.3 chemistry) to obtain 150 bp paired-end reads. All DNA

202 concentration measurements were performed with the Qubit 2.0 fluorometer

203 (Invitrogen / Life Technologies, Carlsbad, California, USA). 204

205 Data analysis pipelines for quality filtering, assembly, alignment, and quality

206 assessment of the Hyb-Seq data

207 Adapter sequences and low quality reads were removed with Trimmomatic v.0.30

208 (Bolger et al., 2014). In case of quality

209 discarded. The remaining part of the read was trimmed, if average quality in a 5 bp

210 window was

211 FASTX-Toolkit (Gordon and Hannon, 2010) was used to remove duplicate reads.

212 Reference-guided assembly of the targeted loci was performed with Alignreads v.

213 2.25 (Straub et al., 2011), using the probe sequences separated by a string of 200

214 Ns each as reference and the parameter settings of Weitemier et al. (2014). Steps up

215 to the final alignments of the LCN loci were performed following Weitemier et al.

216 (2014). Although the LCN loci were created by randomly concatenating the

217 respective exons, they will be called genes. All analyses based on the LCN genes

218 were conducted on 727 loci that contained sequence information for at least part of

219 these genes for all accessions; the remaining 437 genes out of the total 1,164

220 targeted had completely missing sequences for certain accessions.

221 The plastid genome and the nrDNA cistron (18S-ITS1-5.8S-ITS2-26S) were

222 also assembled with Alignreads, utilizing Oxalis reference sequences that were built

223 according to Straub et al. (2011, 2012). Genome skim reads of O. obtusa were

8 224 quality trimmed, using the same settings as for the Hyb-Seq data, and duplicate

225 reads were removed. Alignreads was run with the following masking parameters,

226 utilizing the plastid genome sequence of Ricinus communis (GenBank accession

227 JF937588) as reference: Any base with sequencing depth <5 was masked, and

228 single nucleotide polymorphisms (SNPs) were only called in O. obtusa if 80% of

229 reads supported that SNP with sequencing depth ≥25 at that site. Read type 454,

230 accounting for longer read length, and linear setting were chosen in YASRA, the

231 assembler within Alignreads, and medium percentage sequence identity between

232 genome skim data and reference sequence was chosen, as this resulted in the

233 highest quality assembly (data not shown). Plastome regions present in the Ricinus

234 reference, but absent in O. obtusa, were masked. Hyb-Seq reads were then

235 assembled with guidance of the draft Oxalis plastome reference with the same

236 settings used for read assembly of the targeted nuclear loci. The resulting contigs of

237 all accessions were aligned with Mulan (Ovcharenko et al., 2005), utilizing the draft

238 Oxalis plastome reference sequence. Positions in consensus sequences were

239 masked in the alignment editor MEGA v.6 (Tamura et al., 2013) in case of (1) regions

240 with many SNPs compared to the reference due to wrongly assembled indels or

241 SNPs between overlapping contig ends, or (2) insertions not found in other Oxalis

242 accessions present at contig ends. Visual inspection of the assemblies was

243 performed with Tablet v.1.14.04.10 (Milne et al., 2013).

244 The nrDNA cistron (without the external transcribed spacer due to repetitive

245 elements) was assembled with Alignreads based on an Oxalis reference that was

246 built from a 774 bp partial sequence of O. obtusa (GenBank accession EU436922).

247 The same masking parameters as for building the Oxalis draft plastome reference

248 were chosen. Hyb-Seq reads were assembled to this reference with Alignreads

249 without masking parameters. In cases of more than one contig per accession and

250 divergent SNPs between them, such sites were masked. The consensus sequences

251 were aligned using MAFFT v.6864b (Katoh and Toh, 2008) with default settings.

9 252 Data completeness of all final alignments was calculated considering missing

253 data per base pair, and in case of the LCN gene matrix also considering missing

254 number of targeted exons and missing number of targeted genes, which were

255 regarded missing if all targeted exons failed enrichment. Sequence similarity

256 between probes and assembled reads of each Oxalis accession was estimated using

257 BLAT, based on a minimum 85% sequence similarity. 258

259 Plastome annotation

260 Annotation of the draft plastid genome was performed with DOGMA (Wyman et al.,

261 2004) on the Oxalis accession with the longest and one of the most complete

262 plastome sequence (Oxalis hirsuta Sond.). The annotated plastome was visualized

263 with GenomeVx (Conant and Wolfe, 2008). 264

265 Phylogenetic network analysis

266 Putative conflicting signals within the LCN gene, plastome and nrDNA cistron

267 datasets were visualized with NeighborNet (Bryant and Moulton, 2004) implemented

268 in SplitsTree v.4.13.1 (Huson and Bryant, 2006) using concatenated sequence

269 alignments, and networks were constructed based on uncorrected pairwise matrices.

270 Bootstrapping was performed with 100 replicates. 271

272 Phylogenetic tree reconstructions and gene-gene/-species tree comparisons

273 Maximum likelihood (ML) and Bayesian Markov chain Monte Carlo (MCMC) /

274 Bayesian inference (BI) methods were used for phylogenetic reconstruction of gene

275 trees. ML gene trees were run with RAxML v.7.3.0 (Stamatakis, 2006) with the GTR

276 + Г nucleotide substitution model and rapid bootstrap with 100 replicates each.

277 Bayesian MCMC analysis was performed with MrBayes v.3.2 (Ronquist and

278 Huelsenbeck, 2003), utilizing sampling across substitution models and Г correction.

279 Two simultaneous runs with four chains each were performed for 5 mil. generations,

10 280 in each run 1,001 trees were sampled, of which the first 25% were discarded as

281 burn-in. In order to diagnose convergence between the individual Bayesian MCMC

282 runs in such a large dataset, three measures of convergence were chosen: (1)

283 average effective sample size (avgESS) >100, (2) potential scale reduction factor

284 (PSRF) ~1.0, and (3) average standard deviation of split frequencies (ASDSF) <0.05.

285 Phylogenetic hypotheses based on the targeted LCN genes were inferred in

286 three different ways, which are intensely debated (e.g., von Haeseler, 2012,

287 DeGiorgio et al., 2014, Liu et al., 2015, Tonini et al., 2015): (1) species tree

288 reconstruction under the multispecies coalescent model utilizing three different

289 programs: (i) ASTRAL (Mirarab et al., 2014), which finds the species tree that agrees

290 with the largest number of quartet trees induced by the set of gene trees, (ii) MP-EST

291 (Liu et al., 2010), which uses maximum pseudo-likelihood for the estimation of

292 species trees, and (iii) STAR (Liu et al., 2009), which uses average ranks of gene

293 coalescence times to build species trees; bootstrapping of the STAR species tree

294 was performed according to the multilocus bootstrap method of Seo (2008), (2) a

295 supertree approach using matrix representation parsimony (MRP) (Baum, 1992,

296 Ragan, 1992), and (3) a supermatrix approach using concatenation of the dataset.

297 The first two methods were performed both on the set of MrBayes maximum

298 posterior probability trees and RAxML best trees. The concatenated dataset was run

299 with RAxML, applying the GTR + Г nucleotide substitution model and rapid bootstrap

300 with 100 replicates.

301 Topological differences between gene trees were assessed by calculating the

302 Robinson-Foulds (RF) distance (Robinson and Foulds, 1981) with the R function

303 RF.dist of the phangorn package (Schliep, 2011) and visualized as principal

304 coordinate (PCoA) plot. Topological differences between the gene trees and the

305 STAR species tree were calculated as modified RF distance (RFD) using STRAW

306 (Shaw et al., 2013) and visualized as a histogram.

11 307 Phylogenetic reconstruction based on the draft plastid genome and nrDNA

308 cistron was performed using MrBayes on partitioned datasets. In the case of the

309 plastome data the alignment was partitioned into protein-, tRNA-, and rRNA-coding

310 as well as intron/spacer (i.e., non-coding) regions. The nrDNA alignment was

311 partitioned into 18S, ITS1, 5.8S, ITS2 and 26S. All partitions were sampled across

312 substitution models, and an initial partition-specific among-site rate variation

313 correction, estimated by PartitionFinder v.1.1.1 (Lanfear et al., 2012), was employed,

314 which was updated based on output from initial MrBayes runs. All parameters were

315 unlinked across partitions, and each partition had its own, independent evolutionary

316 rate. Two simultaneous runs with four chains each were performed for 100 mil.

317 generations, in each run 1,001 trees were sampled, of which the first 25% were

318 discarded as burn-in. For plastome analysis, the complete protein-coding plastid

319 complement (Moore et al., 2010) and inverted repeat of Oxalis latifolia Kunth

320 (GenBank accession HQ664602) was used as outgroup, whereas O. corniculata

321 served as outgroup both in the LCN and nrDNA cistron dataset. However, as O.

322 corniculata and O. latifolia are phylogenetically very distantly related, and the branch

323 leading to O. corniculata was very long in all analyses, these outgroups were

324 trimmed from the final trees and trees were rooted using Oxalis imbricata Eckl. &

325 Zeyh. in the comparison of phylogenetic hypotheses based on all three datasets. 326

327 Results

328 Probe design, sequencing, quality trimming, assembly, alignment and quality

329 assessment

330 Design of Oxalis LCN probes resulted in 4,926 exons ≥120 bp length and 1,164

331 genes (Table S2, Supporting information) of total 1,127,209 bp. Gene length ranged

332 from 600 to 4,125 bp with a mean of 968 bp. Adapter and quality-filtering resulted in

333 an average loss of 7% of the reads (Table S3, Supporting information). In O.

334 palmifrons T.M. Salter 22% of the reads were dropped, indicating its poor initial DNA

12 335 quality; DNA of this accession was nearly degraded. After duplicate read removal on

336 average 47% of quality-filtered reads remained (Table S3, Supporting information).

337 This was a high number of duplicate reads, likely the result of PCR duplicates from

338 too many PCR cycles during genomic library preparation and enrichment. Of the

339 quality-filtered reads after duplicate removal on average 59% mapped to the LCN

340 genes (on-target), 27% to the draft plastome and 3% to the nrDNA cistron reference

341 (Table S3, Supporting information). The mean number of targeted loci with complete

342 or partial sequence information was 4,124 for exons (from total of 4,926) and 1,141

343 for genes (from total of 1,164) (Table S2, Supporting information). Sequences

344 diverged by 3–4% between reads and LCN probes (Table S2, Supporting

345 information), and sequence divergence was 11% between probes and the O.

346 corniculata transcriptome used to define exon boundaries. Completeness of the

347 datasets were as follows: 90% for the LCN exons, 96% for the plastome, and 96% for

348 the nrDNA cistron (Table S2, Supporting information). These mean values are an

349 underestimate due to two accessions with exceptionally low read numbers (O. hirta

350 J62, O. palmifrons). Mean sequencing depth for the datasets was 16 (LCN exons),

351 173 (plastome) and 290 (nrDNA cistron) (Table S2, Supporting information). 352

353 Plastome annotation

354 The plastid genome was annotated to enable partitioning of the alignment for

355 phylogenetic reconstruction, and the annotation can be summarized as follows (Fig.

356 S1, Supporting information): 79 protein-coding genes were found, including three

357 genes that were absent in closely related Ricinus (infA, ycf15, ycf68). Two genes

358 (rps16, rpl32) were missing in Oxalis compared to Ricinus. All 30 tRNA-coding genes

359 were present, as were all four rRNA-coding genes. 360

361 Phylogenetic reconstructions

362 1) Phylogenetic networks

13 363 In the phylogenetic network based on the concatenated LCN gene matrix (Fig. S2A,

364 Supporting information) splits were numerous but short, and in the plastome network

365 (Fig. S2B, Supporting information) splits were largely absent, thereby justifying the

366 use of phylogenetic tree reconstruction methods for those two datasets. In contrast,

367 the nrDNA cistron network contained numerous splits (Fig. S2C, Supporting

368 information).

369 2) Robustness of phylogenetic hypothesis based on the LCN genes

370 Although all avgESS estimates were satisfactory, a small minority of PSRF and

371 ASDSF values (<20) strongly differed from expectation (Fig. S3, Supporting

372 information). Removing these loci from species tree reconstruction idiosyncratically

373 affected the uncertain nodes discussed in the following (data not shown), and the loci

374 were kept. Species tree topology was relatively robust to both the method of species

375 and gene tree reconstruction, as the major clades were obtained and relationships

376 within clades were nearly identical in all cases (Fig. S4, Supporting information): (1)

377 Topology within the Hirta clade was identical between the different methods except

378 for the weakly supported position of Oxalis primuloides R. Knuth in the concatenated

379 tree with 45% bootstrap support (BS). (2) In all cases Oxalis amblyosepala Schltr.

380 and Oxalis polyphylla Jacq. formed a clade, which was in sister relationship to the

381 Hirta clade. (3) Oxalis hirsuta, Oxalis inconspicua T.M. Salter, O. obtusa, and Oxalis

382 pulchella Jacq. formed a clade, and within-clade relationships were identical with all

383 methods except for the MRP supertree based on BI of gene trees. Clustering of O.

384 inconspicua and the outgroup O. corniculata was apparently due to long-branch

385 attraction: These two species exhibited the longest branches in the tree, which is

386 likely to result in clustering together, independent on the relationships of the

387 underlying sequences (Felsenstein, 1978). (4) Phylogenetic placement of O.

388 imbricata, Oxalis orthopoda T.M. Salter, Oxalis truncatula Jacq., O. palmifrons and

389 Oxalis smithiana Eckl. & Zeyh. partly deviated between the different methods. In all

14 390 cases the latter two were in a successive sister relationship to the clade comprising

391 the Hirta clade and O. amblyosepala and O. polyphylla.

392 The number of incongruent nodes between species trees based on BI of gene

393 trees versus ML gene trees was two to three, and in a comparison of all utilized

394 species tree methods with concatenation there were six incongruent nodes (Fig. S4,

395 Supporting information). All nodes of the STAR tree had 100% BS. The

396 concatenated dataset showed weak support for all those nodes, which were

397 discussed above in (1) and (4) as ambiguous in phylogenetic placement; otherwise

398 the BS values were also 100%.

399 3) Incongruences between gene-gene and gene-species trees

400 The extensive and largely homogeneous cluster of RF distances between gene trees

401 shown in the PCoA demonstrated the great variability of gene tree topologies (Fig.

402 S5, Supporting information). RF distances between the gene trees and the STAR

403 species tree, displayed in a histogram (Fig. S6, Supporting information), was on

404 average RFD = 34, which is a relatively high value, considering that the maximum RF

405 value for this number of accessions is RFD = 46.

406 4) Comparison between phylogenetic trees based on the LCN genes, the plastome,

407 and the nrDNA cistron

408 The STAR species tree reconstruction method was chosen to compare datasets

409 (Figure 2). The plastome and nrDNA cistron trees resulted in the same major clades

410 as the LCN gene dataset: the Hirta clade, O. amblyosepala and O. polyphylla as

411 sister, and O. hirsuta, O. inconspicua, O. obtusa, and O. pulchella as a clade.

412 However, relationships within the Hirta clade were incongruently resolved, especially

413 in the plastome tree (Figure 2A). In the nrDNA cistron tree (Figure 2B) there were

414 numerous polytomies and partly low posterior probability (PP), which is possibly the

415 result of a lack of parsimony informative sites in the ribosomal RNAs combined with a

416 high rate of evolution in the spacers, hindering direct comparison. Similar to the

417 comparison of reconstruction methods of trees based on the LCN loci, phylogenetic

15 418 placement of O. imbricata, O. orthopoda, O. truncatula, O. palmifrons, and O.

419 smithiana deviated between the trees based on the three datasets; the differences

420 were slightly stronger, as O. palmifrons and O. smithiana were not in sister

421 relationship to the clade comprising the Hirta clade and O. amblyosepala and O.

422 polyphylla in the plastome and nrDNA cistron trees. All nodes of the plastome tree

423 were supported with 1 PP except for the Oxalis ciliaris Jacq. / O. hirta and Oxalis

424 tenella Jacq. / Oxalis callosa R. Knuth – O. primuloides splits (Figure 2A). The 18

425 incongruent nodes between the species tree based on the LCN loci and the plastome

426 tree revealed the strong cytonuclear discordance, which is supported by the high

427 support values in both the plastome tree and species tree based on the LCN genes

428 (Figure 2A). 429

430 Discussion

431 The value of our target enrichment probe design for plant phylogenetics and

432 its application in Hyb-Seq

433 Our LCN probe design pipeline, implemented in the BASH script Sondovač,

434 generates LCN loci from a combination of transcriptome and genome skim data.

435 Many transcriptomes can be taken from the already existing HTS resource 1KP

436 initiative, in which approximately 70% of APG III families are represented (APG III,

437 2009). Depending on the phylogenetic level of the group under study and the number

438 of LCN genes users want to obtain, a transcriptome of a more or less closer relative

439 of the study group must be utilized. Users need to provide paired-end genome skim

440 data of one of the taxa of their study group, as this accounts for the specificity of the

441 probe design. The reference sequences of the plastid and possibly mitochondrial

442 genome, which are used in the pipeline to remove organellar reads from the genome

443 skim read pool, are also part of existing HTS resources

444 (http://www.ncbi.nlm.nih.gov/genome/organelle/); plastome reference sequences

16 445 from taxa up to the same order of the studied plant group are suitable (Straub et al.,

446 2012).

447 By running Sondovač for southern African Oxalis, over 1,000 orthologous

448 LCN genes of adequate length were obtained. Target enrichment of these loci

449 resulted in a nearly complete data matrix. Efficiency of target enrichment of the LCN

450 loci was similar to that reported for the Asclepias dataset by Weitemier et al. (2014) if

451 considering only their accessions with external barcodes; there were approximately

452 60% on-target, quality-filtered reads after duplicate removal, although sequence

453 divergence to the LCN probes was larger for Oxalis (average 4% compared to 1.5%

454 for Asclepias). The average sequencing depth 16☓ of LCN exons should facilitate

455 unambiguous SNP calling in orthologous genes, but also enable the identification of

456 paralogous genes in polyploid accessions, which will be an essential step towards

457 using target enrichment on polyploids. Plastome assembly and annotation suggest

458 that it is possible to obtain (nearly) the whole plastid genome with Hyb-Seq. Only a

459 few quickly evolving regions, such as the full-length copy of ycf1, were partial in this

460 analysis. 461

462 Towards a refined phylogeny of southern African Oxalis

463 Although the selected 23 Oxalis species comprised only a small subset of species

464 from the southern African Oxalis clade, preliminary phylogenetic conclusions and

465 evolutionary implications can be drawn. Major phylogenetic relationships based on

466 the LCN loci did not strongly differ from the published phylogeny of Oberlander et al.

467 (2011), but node resolution and support improved dramatically.

468 Topologies of the trees based on the LCN loci were relatively robust to the

469 methods by which they were obtained (multispecies coalescent, MRP supertree,

470 concatenation) and also to the methods through which the gene trees were estimated

471 (BI and ML). Only few highly supported, conflicting relationships were found between

17 472 the concatenated tree and the coalescent trees, indicating that the coalescent model

473 likely reduces to the concatenation model in this case (Liu et al., 2009, 2015).

474 The phylogenetic tree topology based on the LCN genes showed widespread

475 and strong incongruences with the plastome tree topology. Cytonuclear discordance

476 is considered as evidence for either incomplete lineage sorting (ILS) of the

477 chloroplast or chloroplast introgression (chloroplast capture). Studies usually

478 interpreted this discordance in terms of hybridization (e.g., Helianthus L.: Dorado et

479 al., 1992, Mitella L.: Okuyama et al., 2005, Ficus L.: Renoult et al., 2009,

480 Senecioneae Cass.: Pelser et al., 2010), although the presence of ILS was rarely

481 statistically tested. In addition to cytonuclear discordance, both gene-gene and gene-

482 species trees showed quite strong topological incongruences in Oxalis. There was an

483 immense variety of gene tree topologies in general, and gene tree topologies strongly

484 differed from the species tree topology. The underlying reasons for this topological

485 variation need to be found by testing for ILS, hybridization, paralogy and a

486 combination of those under coalescent models (e.g., Rasmussen and Kellis, 2012,

487 Yu et al., 2014). Given the very short branch lengths at the base of the southern

488 African Oxalis clade (Oberlander et al., 2011) as well as generally large population

489 sizes of most species, strong ILS effects are perhaps to be expected, particularly at

490 the base of the putative southern African radiation. The extent of hybridization in

491 Oxalis is poorly known. The only confirmed hybrid is South American Oxalis tuberosa

492 Molina (Emshwiller and Doyle, 2002). Both heterostyly and pollinator specificity may

493 reduce the potential for hybridization: In Oxalis tristyly is predominant and distyly rare

494 (Weller et al., 2007, Gardner et al., 2011), meaning that three (in distyly two) floral

495 morphs with reciprocal placement of and anthers persist in a population.

496 Heterostyly promotes precise transfer, which could thus promote reproductive

497 isolation and prevent hybridization, if the position of sexual organs is not similar

498 between the mating individuals (Keller et al., 2012). Dense taxon sampling in the

499 southern African lineage is required to address the extent of hybridization. Finally,

18 500 gene duplication and loss could well be a major cause of the observed

501 incongruences, especially in the polyploid accessions, but testing that requires the

502 identification of paralogous genes. A bioinformatic pipeline for paralogue

503 identification of LCN genes from target enrichment data is currently under

504 development by AL.

505 The observed, strong cytonuclear discordance suggests that organellar

506 phylogenies alone are unlikely to represent the species tree (Davis et al., 2014), and

507 the strong incongruences between gene-gene and gene-species trees imply that a

508 large number of LCN loci is needed to robustly resolve phylogenies in the presence

509 of ILS and hybridization, especially of putatively radiating lineages such as southern

510 African Oxalis. 511

512 Acknowledgements

513 The authors thank Gane Ka-Shu Wong (University of Alberta) and Douglas E. Soltis

514 (University of Florida) for access to the 1KP transcriptome data, Tomáš Fér (Charles

515 University in Prague) for access to the genome skim data, Lenka Flašková (Charles

516 University in Prague) for DNA extraction, Filip Pardy (Central European Institute of

517 Technology, Brno) for help with sonication, Mark Dasenko (Oregon State University

518 Center for Genome Research and Biocomputing / CGRB) for Illumina sequencing

519 support, Sanjuro Jodgeo (Oregon State University) for data analysis support, and

520 Daisie Huang, Jennifer Mandel and an anonymous reviewer for constructive

521 comments. Data were analysed on the CGRB computer cluster, on MetaCentrum

522 under the program LM2010005, and on CERIT-SC under the program Centre CERIT

523 Scientific Cloud. MetaCentrum and CERIT-SC are part of the Operational Program

524 Research and Development for Innovations, Reg. no. CZ.1.05/3.2.00/08.0144, Czech

525 Republic. This work, including three stays in the Liston Laboratory, was financially

526 supported to R.S. by project Reg. no. CZ.1.07/2.3.00/30.0048 of the European Social

527 Fund in the Czech Republic through the Operational Program Education for

19 528 Competitiveness. Additional funding came from the long-term research development

529 projects RVO 67985939 (The Czech Academy of Sciences) and institutional

530 resources of the Ministry of Education, Youth and Sports of the Czech Republic for

531 the support of science and research. 532

533 Author contributions

534 A.L., R.S., and J.S. designed the study, R.S. and A.L. developed the probe design

535 pipeline, V.Z. converted this pipeline into the BASH script Sondovač, J.S., K.O., and

536 L.D. provided samples, R.S. conducted laboratory work, K.W., S.C.K.S., and R.C.C.

537 advised on data collection and computational analyses, R.S., V.Z., and K.O.

538 analysed the data, and R.S. wrote the initial draft of the manuscript. All authors

539 contributed to the final version of the manuscript. 540

541 References

542 The Angiosperm Phylogeny Group (2009) An update of the Angiosperm Phylogeny

543 Group classification for the orders and families of flowering plants: APG III. Botanical

544 Journal of the Linnean Society, 161, 105–121.

545 Baum BR (1992) Combining trees as a way of combining data sets for phylogenetic

546 inference, and the desirability of combining gene trees. Taxon, 41, 3–10.

547 Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina

548 sequence data. Bioinformatics, 30, 2114–2120.

549 Born J, Linder HP, Desmet P (2006) The Greater Cape Floristic Region. Journal of

550 Biogeography, 34, 147–162.

551 Bryant D, Moulton V (2004) Neighbor-Net: an agglomerative method for the

552 construction of phylogenetic networks. Molecular Biology and Evolution, 21, 255–

553 265.

20 554 Chamala S, García N, Godden GT et al. (2015) MarkerMiner 1.0: a new application

555 for phylogenetic marker development using angiosperm transcriptomes. Applications

556 in Plant Sciences, 3, 1400115.

557 Conant GC, Wolfe KH (2008) GenomeVx: simple web-based creation of editable

558 circular chromosome maps. Bioinformatics, 24, 861–862.

559 Davis CC, Xi Z, Mathews S (2014) Plastid phylogenomics and green plant

560 phylogeny: almost full circle but not quite there. BMC Biology, 12, 11.

561 DeGiorgio M, Syring J, Eckert AJ et al. (2014) An empirical evaluation of two-stage

562 species tree inference strategies using a multilocus dataset from North American

563 pines. BMC Evolutionary Biology, 14, 67.

564 De Sousa F, Bertrand YJK, Nylinder S et al. (2014) Phylogenetic properties of 50

565 nuclear loci in Medicago (Leguminosae) generated using multiplexed sequence

566 capture and next-generation sequencing. PLoS ONE, 9, e109704.

567 Dorado O, Rieseberg LH, Arias DM (1992) Chloroplast DNA introgression in

568 southern California sunflowers. Evolution, 46, 566–572.

569 Doyle JJ, Doyle JL (1987) A rapid DNA isolation procedure for small quantities of

570 fresh leaf tissue. Phytochemical Bulletin, 19, 11–15.

571 Emshwiller E, Doyle JJ (2002) Origins of domestication and polyploidy in oca (Oxalis

572 tuberosa; Oxalidaceae). 2. Chloroplast-expressed glutamine synthetase data.

573 American Journal of Botany, 89, 1042–1056.

574 Faircloth BC, McCormack JE, Crawford NG et al. (2012) Ultraconserved elements

575 anchor thousands of genetic markers for target enrichment spanning multiple

576 evolutionary timescales. Systematic Biology, 61, 717–726.

577 Felsenstein J (1978) Cases in which parsimony or compatibility methods will be

578 positively misleading. Systematic Zoology, 27, 401–410.

579 Givnish TJ, Ames M, McNeal JR et al. (2010) Assembling the tree of the

580 Monocotyledons: plastome sequence phylogeny and evolution of Poales. Annals of

581 the Missouri Botanical Garden, 97, 584–616.

21 582 Gordon A, Hannon GJ (2010) FASTX-Toolkit. FASTQ/A short-reads pre-processing

583 tools. http://hannonlab.cshl.edu/fastx_toolkit/ [accessed 29 May 2015].

584 Grover CE, Gallagher JP, Jareczek JJ et al. (2015) Re-evaluating the phylogeny of

585 allopolyploid Gossypium L.. Molecular Phylogenetics and Evolution, 92, 45–52.

586 Hedtke SM, Morgan MJ, Cannatella DC, Hillis DM (2013) Targeted enrichment:

587 maximizing orthologous gene comparisons across deep evolutionary time. PLoS

588 ONE, 8, e67908.

589 Heyduk K, Trapnell DW, Barrett CF, Leebens-Mack J (2015) Phylogenomic analyses

590 of species relationships in the genus Sabal (Arecaceae) using targeted sequence

591 capture. Biological Journal of the Linnean Society, early view.

592 Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary

593 studies. Molecular Biology and Evolution, 23, 254–267.

594 Jeffroy O, Brinkmann H, Delsuc F, Philippe H (2006) Phylogenomics: the beginning

595 of incongruence. Trends in Genetics, 22, 225–231.

596 Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence

597 alignment program. Briefings in Bioinformatics, 9, 286–298.

598 Kearse M, Moir R, Wilson A et al. (2012) Geneious Basic: an integrated and

599 extendable desktop software platform for the organization and analysis of sequence

600 data. Bioinformatics, 28, 1647–1649.

601 Keller B, de Vos JM, Conti E (2012) Decrease of sexual organ reciprocity between

602 heterostylous primrose species, with possible functional and evolutionary

603 implications. Annals of Botany, 110, 1233–1244.

604 Kent WJ (2002) BLAT – the BLAST-like alignment tool. Genome Research, 12, 656–

605 664.

606 Lanfear R, Calcott B, Ho SY, Guindon S (2012) PartitionFinder: combined selection

607 of partitioning schemes and substitution models for phylogenetic analyses. Molecular

608 Biology and Evolution, 29, 1695–1701.

22 609 Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nature

610 Methods, 9, 357–359.

611 Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets

612 of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659.

613 Li H, Handsaker B, Wysoker A et al. (2009) The Sequence Alignment/Map format

614 and SAMtools. Bioinformatics, 25, 2078–2079.

615 Liu L, Xi Z, Wu S, Davis C, Edwards SV (2015) Estimating phylogenetic trees from

616 genome-scale data. ArXiv:1501.03578.

617 Liu L, Yu L, Edwards SV (2010) A maximum pseudo-likelihood approach for

618 estimating species trees under the coalescent model. BMC Evolutionary Biology, 10,

619 302.

620 Liu L, Yu L, Kubatko L et al. (2009) Coalescent methods for estimating phylogenetic

621 trees. Molecular Phylogenetics and Evolution, 53, 320–328.

622 Liu L, Yu L, Pearl DK, Edwards SV (2009) Estimating species phylogenies using

623 coalescence times among sequences. Systematic Biology, 58, 468–477.

624 Magoč T, Salzberg SL (2011) FLASH: fast length adjustment of short reads to

625 improve genome assemblies. Bioinformatics, 27, 2957–2963.

626 Mandel JR, Dikow RB, Funk VA (2015) Using phylogenomics to resolve mega-

627 families: an example from Compositae. Journal of Systematics and Evolution, 53,

628 391–402.

629 Mandel JR, Dikow RB, Funk VA et al. (2014) A target enrichment method for

630 gathering phylogenetic information from hundreds of loci: an example from the

631 Compositae. Applications in Plant Sciences, 2, 1300085.

632 Milne I, Stephen G, Bayer M et al. (2013) Using Tablet for visual exploration of

633 second-generation sequencing data. Briefings in Bioinformatics, 14, 193–202.

634 Mirarab S, Reaz R, Bayzid MdS et al. (2014) ASTRAL: genome-scale coalescent-

635 based species tree estimation. Bioinformatics, 30, i541–i548.

23 636 Moore MJ, Soltis PS, Bell CD et al. (2010) Phylogenetic analysis of 83 plastid genes

637 further resolves the early diversification of . Proceedings of the National

638 Academy of Sciences of the United States of America, 107, 4623–4628.

639 Nicholls JA, Pennington RT, Koenen EJM et al. (2015) Using targeted enrichment of

640 nuclear genes to increase phylogenetic resolution in the neotropical rain forest genus

641 Inga (Leguminosae: Mimosoideae). Frontiers in Plant Science,

642 doi:10.3389/fpls.2015.00710.

643 Oberlander KC, Dreyer LL, Bellstedt DU (2011) Molecular phylogenetics and origins

644 of southern African Oxalis. Taxon, 60, 1667–1677.

645 Oberlander KC, Roets F, Dreyer LL (2014) Pre-Pleistocene origin of an endangered

646 habitat: links between vernal pools and aquatic Oxalis in the Greater Cape Floristic

647 Region of South Africa. Journal of Biogeography, 41, 1572–1582.

648 Okuyama Y, Fujii N, Wakabayashi M et al. (2005) Nonuniform concerted evolution

649 and chloroplast capture: heterogeneity of observed introgression patterns in three

650 molecular data partition phylogenies of Asian Mitella (Saxifragaceae). Molecular

651 Biology and Evolution, 22, 285–296.

652 Ovcharenko I, Loots GG, Giardine BM et al. (2005) Mulan: multiple-sequence local

653 alignment and visualization for studying function and evolution. Genome Research,

654 15, 184–194.

655 Parks M, Cronn R, Liston A (2009) Increasing phylogenetic resolution at low

656 taxonomic levels using massively parallel sequencing of chloroplast genomes. BMC

657 Biology, 7, 84.

658 Parks M, Cronn R, Liston A (2012) Separating the wheat from the chaff: mitigating

659 the effects of noise in a plastome phylogenomic dataset from Pinus L. (Pinaceae).

660 BMC Evolutionary Biology, 12, 100.

661 Pelser PB, Kennedy AH, Tepe EJ et al. (2010) Patterns and causes of incongruence

662 between plastid and nuclear Senecioneae (Asteraceae) phylogenies. American

663 Journal of Botany, 97, 856–873.

24 664 Pillon Y, Johansen J, Sakishima T et al. (2014) Primers for low-copy nuclear genes in

665 Metrosideros and cross-amplification in Myrtaceae. Applications in Plant Sciences, 2,

666 1400049.

667 Proches S, Cowling RM, Goldblatt P et al. (2006) An overview of the Cape

668 geophytes. Biological Journal of the Linnean Society, 87, 27–43.

669 Pyron RA (2015) Post-molecular systematics and the future of phylogenetics. Trends

670 in Ecology and Evolution, 30, 384–389.

671 Ragan MA (1992) Phylogenetic inference based on matrix representation of trees.

672 Molecular Phylogenetics and Evolution, 1, 53–58.

673 Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and

674 coalescence using a locus tree. Genome Research, 22, 755–765.

675 Reneker J, Lyons E, Conant GC et al. (2012) Long identical multispecies elements in

676 plant and animal genomes. Proceedings of the National Academy of Sciences of the

677 United States of America, 109, E1183–E1191.

678 Renoult JP, Kjellberg F, Grout C et al. (2009) Cyto-nuclear discordance in the

679 phylogeny of Ficus section Galoglychia and host shifts in plant-pollinator

680 associations. BMC Evolutionary Biology, 9, 248.

681 Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Mathematical

682 Biosciences, 53, 131–147.

683 Ronquist F, Huelsenbeck J (2003) MrBayes 3: Bayesian phylogenetic inference

684 under mixed models. Bioinformatics, 19, 1572–1574.

685 Rothfels CJ, Larsson A, Li F-W et al. (2013) Transcriptome-mining for single-copy

686 nuclear markers in ferns. PLoS ONE, 8, e76957.

687 Schliep KP (2011) phangorn: phylogenetic analysis in R. Bioinformatics, 27, 592–

688 593.

689 Seo TK (2008) Calculating bootstrap probabilities of phylogeny using multilocus

690 sequence data. Molecular Biology and Evolution, 25, 960–971.

25 691 Shaw TI, Ruan Z, Glenn TC, Liu L (2013) STRAW: Species TRee Analysis Web

692 server. Nucleic Acids Research, 41, W238–W241.

693 Small RL, Cronn RC, Wendel JF (2004) Use of nuclear genes for phylogeny

694 reconstruction in plants. Australian Systematic Botany, 17, 145–170.

695 Smith BT, Harvey MG, Faircloth BC et al. (2013) Target capture and massively

696 parallel sequencing of ultraconserved elements for comparative studies at shallow

697 evolutionary time scales. Systematic Biology, 0, 1–13.

698 Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic

699 analyses with thousands of taxa and mixed models. Bioinformatics, 22, 2688–2690.

700 Stephens JD, Rogers WL, Heyduk K et al. (2015a) Resolving phylogenetic

701 relationships of the recently radiated carnivorous plant genus Sarracenia using target

702 enrichment. Molecular Phylogenetics and Evolution, 85, 76–87.

703 Stephens JD, Rogers WL, Mason CM et al. (2015b) Species tree estimation of

704 diploid Helianthus (Asteraceae) using target enrichment. American Journal of Botany,

705 102, 910–920.

706 Straub SCK, Fishbein M, Livshultz T et al. (2011) Building a model: developing

707 genomic resources for common milkweed (Asclepias syriaca) with low coverage

708 genome sequencing. BMC Genomics, 12, 211.

709 Straub SC, Parks M, Weitemier K et al. (2012) Navigating the tip of the genomic

710 iceberg: next-generation sequencing for plant systematics. American Journal of

711 Botany, 99, 349–364.

712 Tamura K, Stecher G, Peterson D et al. (2013) MEGA6: molecular evolutionary

713 genetics analysis version 6.0. Molecular Biology and Evolution, 30, 2725–2729.

714 Tonini J, Moore A, Stern D et al. (2015) Concatenation and species tree methods

715 exhibit statistically indistinguishable accuracy under a range of simulated conditions.

716 PLoS Currents, 7, ecurrents.tol.34260cc27551a527b124ec5f6334b6be.

717 Tonnabel J, Olivieri I, Mignot A et al. (2014) Developing nuclear DNA phylogenetic

718 markers in the angiosperm genus Leucadendron (Proteaceae): a next-generation

26 719 sequencing transcriptomic approach. Molecular Phylogenetics and Evolution, 70, 37–

720 46.

721 von Haeseler A (2012) Do we still need supertrees? BMC Biology, 10, 13.

722 Weitemier K, Straub SCK, Cronn RC et al. (2014) Hyb-Seq: combining target

723 enrichment and genome skimming for plant phylogenomics. Applications in Plant

724 Sciences, 2, 1400042.

725 Wyman SK, Jansen RK, Boore JL (2004) Automatic annotation of organellar

726 genomes with DOGMA. Bioinformatics, 20, 3252–3255.

727 Yu Y, Dong J, Liu KJ, Nakhleh L (2014) Maximum likelihood inference of reticulate

728 evolutionary histories. Proceedings of the National Academy of Sciences of the

729 United States of America, 111, 16448–16453. 730

731 Data accessibility

732 Sondovač is deposited in GitHub (https://github.com/V-Z/sondovac/wiki/) together

733 with the input Oxalis data (genome skim paired-end reads, plastid and mitochondrial

734 reference). There we provided a link to the transcriptome data, which were obtained

735 in the framework of the 1KP initiative. Raw reads, assemblies, alignments,

736 phylogenetic trees, the Geneious de novo assembly output and the final probe

737 sequences are available under Dryad doi:10.5061/dryad.dn08t. 738

27 739 Figures and tables 740

741 Figure 1

742 Workflow of the probe design script Sondovač. An overview on the main steps of

743 Hyb-Seq are given in the top part of the figure; probe design is the first one. Each

744 step of Sondovač is numbered and illustrated by three boxes each: Software is

745 highlighted in dark gray, a summary of each step is given in medium gray, and

746 input/output of each step is depicted in light gray. Optional removal of reads of

747 mitochondrial origin from the genome skim data is marked by decoloration of the text.

748 The required input files of Sondovač are highlighted in bold. The direction of the

749 workflow is indicated by arrows. 750

751 Figure 2

752 Comparison of phylogenetic hypotheses based on 727 LCN genes, the plastid

753 genome and the nrDNA cistron of southern African Oxalis. (A) Comparison between

754 the STAR species tree, based on maximum likelihood gene trees, and the plastome

755 tree. (B) Comparison between the STAR species tree and the nrDNA cistron tree.

756 Dashed lines connect each accession to its placement in the contrasting tree. Values

757 close to the nodes are bootstrap support (* = 100%) in the STAR species tree or

758 posterior probability (* = 1) in the plastome and nrDNA cistron tree. The Hirta clade

759 (Oberlander et al., 2011) is colored in light gray (respectively green in the on-line

760 color figure). Nodes that are incongruent between the contrasting phylogenetic trees

761 of the different datasets are marked with arrows.

28 Hybridization between Bait synthesis, based Probe design baits and genomic Sequencing Data analysis on probe sequences libraries

2 3 4 Bowtie 2, SAMtools, Bowtie 2, SAMtools, FLASH bam2fastq bam2fastq Removal of reads of Removal of reads of Combination of plastid origin mitochondrial origin paired-end reads

Paired-end genome skim Combined genome skim Paired-end genome Paired-end genome skim reads without plastid and reads without plastid (and skim reads reads without plastid reads mitochondrial reads mitochondrial) reads

Mitochondriome Plastome reference 1 reference BLAT, UNIX commands

Removal of transcripts sharing ≥90% sequence similarity

Transcripts Unique transcripts

8 7 6 5 UNIX commands Geneious UNIX commands BLAT

Retention of those contigs De novo assembly of the Filtering of BLAT output: Matching of the unique that comprise exons transcript or genome skim (1) Choice of transcript or transcripts and the filtered, ≥ bait length (in the Oxalis BLAT hits [depending on genome skim sequences combined genome skim case 120 bp) and have a the selection in (1) in the for further processing reads sharing ≥85% certain total locus length previous step] to larger (2) Removal of transcripts sequence similarity (in the Oxalis case contigs with >1,000 BLAT hits ≥600 bp) (3) Removal of transcript or genome skim BLAT hits [depending on the selection in (1)] containing masked nucleotides

De novo assembled Quality-filtered sequences Sequences of matching Length-filtered probe sequences of filtered of matching transcripts transcripts and genome sequences BLAT hits and genome skim reads skim reads

9 10 11 cd-hit-est UNIX commands BLAT, UNIX commands

Removal of probe Retention of those contigs Removal of probe sequences sharing ≥90% that comprise exons sequences sharing ≥90% sequence similarity ≥ bait length (in the Oxalis sequence similarity with case 120 bp) and have a the plastome reference certain total locus length (in the Oxalis case ≥600 bp)

Similarity-filtered probe Similarity-, length-filtered Final probe sequences sequences probe sequences

Plastome reference A LCN genes Plastome

B LCN genes nrDNA cistron 1 Supporting information 2 3 Table S1 4 Voucher information of southern African Oxalis accessions used in this study. 5 Herbarium specimens are either deposited in the Herbarium Collections at Charles 6 University in Prague (PRC) or the Stellenbosch University Herbarium (STEU). Ploidy 7 level estimates are unpublished data of JS. 8 Species Herbarium/ Locality Latitude/ Collector/ Ploidy Voucher no. Longitude Date Oxalis amblyosepala Schltr. PRC Vanhrynsdorp, Western Cape, S31° 46' 52.5'' J. Suda & R. Sudová 2x J11-969 South Africa E18° 46' 02.7'' 24.08.2011 Oxalis blastorrhiza T.M. Salter PRC Vanhrynsdorp, Western Cape, S31° 36' 07.9'' J. Suda & R. Sudová 2x J557 South Africa E18° 43' 47.9'' 31.05.2010 Oxalis callosa R. Knuth PRC Calvinia, Northern Cape, South S31° 58' 22.4'' J. Suda & R. Sudová 6x J11-726 Africa E20° 08' 50.3'' 19.08.2011 Oxalis ciliaris Jacq. PRC Swellendam, Western Cape, S34° 01' 28.3'' J. Suda & P. Trávníček 4x J233 South Africa E20° 25' 40.3'' 01.09.2009 Oxalis creaseyii T.M. Salter PRC Garies, Northern Cape, South S30° 28' 09.9'' J. Suda & R. Sudová 2x J11-961 Africa E18° 08' 08.2'' 24.08.2011 Oxalis exserta T.M. Salter PRC Springbok, Northern Cape, S29° 26' 14.4'' J. Suda & R. Sudová 8x J616 South Africa E17° 57' 10.1'' 08.06.2010 Oxalis gracilis Jacq. PRC Vanhrynsdorp, Western Cape, S31° 43' 04.4'' J. Suda & R. Sudová 2x J558 South Africa E18° 46' 07.6'' 31.05.2010 Oxalis helicoides T.M. Salter PRC Springbok, Northern Cape, S29° 46' 22.9'' J. Suda & P. Trávníček 2x J319 South Africa E17° 49' 15.5'' 06.09.2009 Oxalis hirsuta Sond. PRC Williston, Northern Cape, South S31° 22' 59.6'' J. Suda & R. Sudová 2x J12-20 Africa E20° 35' 24.6'' 14.04.2012 Oxalis hirta L. PRC Yzerfontein, Western Cape, S33° 19' 42.1'' J. Suda & R. Sudová 4x J57 South Africa E18° 11' 15.9'' 29.08.2007 Oxalis hirta L. PRC Yzerfontein, Western Cape, S33° 08' 10.2'' J. Suda & R. Sudová 4x J62 South Africa E17° 59' 54.5'' 30.08.2007 Oxalis imbricata Eckl. & Zeyh. STEU Worcester, Western Cape, S33°53.624' K. Oberlander 2x MO472 South Africa E19°26.284' 05.05.2004 Oxalis inconspicua T.M. Salter PRC Garies, Northern Cape, South S30° 32' 03.9'' J. Suda & R. Sudová 2x J595 Africa E18° 08' 14.9'' 02.06.2010 Oxalis kamiesbergensis T.M. PRC Kamieskroon, Northern Cape, S30° 10' 43.6'' J. Suda & R. Sudová 2x Salter J153 South Africa E18° 00' 45.0'' 14.09.2007 Oxalis obtusa Jacq. PRC Kommetjie, Western Cape, S34° 08' 00.5'' J. Suda & R. Sudová 2x J12 South Africa E18° 20' 13.6'' 26.08.2007 Oxalis obtusa Jacq. PRC Calvinia, Northern Cape, South S31° 19' 51.1'' J. Suda & R. Sudová 2x J99 Africa E19° 52' 09.4'' 02.09.2007 Oxalis orthopoda T.M. Salter STEU Mossel Bay, Western Cape, S34°10.723' K. Oberlander 4x MO351 South Africa E21°56.650' 24.04.2003 Oxalis palmifrons T.M. Salter STEU NA NA Ex Hort./ 2x MO403 NA Oxalis polyphylla Jacq. PRC De Hoop, Western Cape, South S34° 25' 51.7'' J. Suda & J. Rauchová 2x J11-44 Africa E20° 25' 06.5'' 21.05.2011 Oxalis primuloides R. Knuth PRC Calvinia, Northern Cape, South S31° 21' 03.1'' J. Suda & R. Sudová 2x J136 Africa E19° 31' 29.0'' 12.09.2007

1 Oxalis pulchella Jacq. PRC Steinkopf, Northern Cape, S29° 13' 33.2'' J. Suda & R. Sudová 2x J11-875 South Africa E17° 36' 29.2'' 22.08.2011 Oxalis reclinata Jacq. PRC Bitterfontein, Western Cape, S30° 58' 34.4'' J. Suda & P. Trávníček 2x J343 South Africa E18° 14' 03.6'' 08.09.2009 Oxalis smithiana Eckl. & Zeyh. PRC Cradock, Eastern Cape, South S32° 15' 46.4'' P. Trávníček & M. Kubešová 4x J11-511 Africa E25° 25' 21.9'' 05.06.2011 Oxalis tenella Jacq. PRC Clanwilliam, Western Cape, S32° 10' 05.3'' J. Suda & R. Sudová 2x J131 South Africa E18° 52' 15.2'' 11.09.2007 Oxalis truncatula Jacq. PRC Greyton, Western Cape, South S34° 04' 11.8'' J. Suda & P. Trávníček 4x J230 Africa E19° 37' 58.0'' 31.08.2009 9 10 Table S2 11 Success of Hyb-Seq in terms of number of reads after duplicate read removal, 12 number of assembled LCN exons and genes, sequence divergence between the 13 LCN matrix and the LCN probes, and completion and sequencing depth of all three 14 datasets of southern African Oxalis. The table is complemented by the genome skim 15 accession O. obtusa J12 and the transcriptome accession O. corniculata JHCN, 16 which did not form part of the calculation of the means. 17

Species Accession # quality-filtered # LCN exons # LCN genes Mean Completion of Completion of Completion of Mean Mean Mean number reads after assembled assembled sequence LCN matrix plastome nrDNA cistron sequencing sequencing sequencing duplicate from total 4,926 from total 1,164 divergence depth of LCN depth of depth of nrDNA removal from LCN loci plastome cistron probes

Oxalis J11-969 497,402 4,246 (86%) 1,161 (100%) 4% 94% 98% 100% 13 166 209 amblyosepala

Oxalis J557 715,360 4,339 (88%) 1,160 (100%) 4% 95% 99% 100% 16 220 427 blastorrhiza

Oxalis callosa J11-726 788,393 4,490 (91%) 1,160 (100%) 4% 97% 98% 100% 17 139 405

Oxalis ciliaris J233 542,186 4,345 (88%) 1,160 (100%) 4% 95% 99% 100% 13 99 223

Oxalis creaseyii J11-961 703,091 4,454 (90%) 1,162 (100%) 4% 96% 99% 99% 15 155 293

Oxalis exserta J616 496,193 4,170 (85%) 1,160 (100%) 4% 92% 97% 100% 12 103 225

Oxalis gracilis J558 749,774 4,432 (90%) 1,162 (100%) 4% 96% 94% 100% 17 118 418

Oxalis helicoides J319 798,432 4,450 (90%) 1,162 (100%) 4% 97% 99% 100% 16 184 379

Oxalis hirsuta J12-20 933,290 4,514 (92%) 1,163 (100%) 3% 98% 99% 100% 21 393 306

Oxalis hirta J57 472,230 4,137 (84%) 1,162 (100%) 4% 91% 98% 93% 11 111 273

Oxalis hirta J62 156,655 2,611 (53%) 1,093 (94%) 4% 63% 79% 98% 8 55 69

Oxalis imbricata MO472 3,280,494 4,520 (92%) 1,163 (100%) 4% 99% 98% 83% 45 361 1,583

Oxalis J595 655,265 4,408 (89%) 1,161 (100%) 4% 95% 99% 98% 16 217 172 inconspicua

Oxalis J153 401,417 3,913 (79%) 1,158 (99%) 4% 88% 99% 79% 12 138 228 kamiesbergensis

Oxalis obtusa J99 534,889 4,287 (87%) 1,163 (100%) 3% 95% 99% 81% 14 166 233

Oxalis orthopoda MO351 939,291 4,516 (92%) 1,163 (100%) 4% 98% 96% 100% 21 168 660

Oxalis palmifrons MO403 168,543 1,253 (25%) 753 (65%) 4% 22% 64% 100% 16 83 58

Oxalis polyphylla J11-44 599,495 4,298 (87%) 1,159 (100%) 4% 94% 99% 100% 16 234 190

Oxalis J136 500,786 4,093 (83%) 1,161 (100%) 4% 91% 99% 100% 13 153 317 primuloides

Oxalis pulchella J11-875 549,374 4,225 (86%) 1,162 (100%) 4% 93% 99% 100% 15 226 248

Oxalis reclinata J343 375,386 4,003 (81%) 1,158 (99%) 4% 90% 99% 100% 11 111 158

Oxalis smithiana J11-511 1,074,130 4,451 (90%) 1,162 (100%) 3% 97% 96% 100% 26 299 981

Oxalis tenella J131 511,049 4,343 (88%) 1,160 (100%) 4% 94% 99% 100% 14 137 246

Oxalis truncatula J230 627,899 4,488 (91%) 1,164 (100%) 3% 98% 98% 79% 15 126 231

Mean 711,293 4,124 (84%) 1,141 (98%) 4% 90% 96% 96% 16 173 290

Oxalis obtusa J12 7,938,349 3,972 (81%) 1,160 (100%) 1% 88% 99% 100% 11 825 4,490

Oxalis corniculata JHCN NA NA 1,163 (100%) 11% 93% NA NA NA NA NA 18 19 Table S3

2 20 Success of Hyb-Seq in terms of number of raw reads after removal of PhiX reads, 21 reads after quality-filtering, reads after duplicate read removal, number and 22 proportion of reads mapped to the LCN probes (on-target), the plastome reference 23 and the nrDNA cistron reference in southern African Oxalis. In cases where the sum 24 of on-target, plastid and nrDNA reads after assembly exceeded the initial, total 25 number of quality-filtered reads after duplicate removal, which was obviously due to 26 read mapping across datasets, the three read sources were marked with an asterisk. 27 The table is complemented by the genome skim accession O. obtusa J12, which did 28 not form part of the calculation of the means. The effect of target enrichment is 29 evident from a comparison of the proportion of reads mapped on-target between the 30 genome skim accession J12 and the Hyb-Seq accession J99 of O. obtusa: Instead of 31 on average 2% genome skim reads 59% Hyb-Seq reads mapped on-target. 32 Species Accession # raw reads # quality-filtered # quality-filtered # on-target reads # plastid reads # nrDNA reads number reads reads after (as % of quality- (as % of quality- (as % of quality- duplicate removal filtered reads filtered reads filtered reads (as % of quality- after duplicate after duplicate after duplicate filtered reads) removal) removal) removal) Oxalis J11-969 1,220,088 1,101,853 497,402 (45%) 313,410 (63%) 162,330 (33%) 11,079 (2%) amblyosepala Oxalis J557 1,895,744 1,724,917 715,360 (41%) 402,111 (56%) 222,377 (31%) 26,254 (4%) blastorrhiza Oxalis callosa J11-726 1,451,514 1,311,491 788,393 (60%) 472,837 (60%) 142,150 (18%) 20,045 (3%)

Oxalis ciliaris J233 1,012,810 931,389 542,186 (58%) 317,293 (59%) 96,490 (18%) 10,788 (2%)

Oxalis creaseyii J11-961 1,543,348 1,423,598 703,091 (49%) 424,600 (60%) 154,369 (22%) 16,048 (2%)

Oxalis exserta J616 935,348 841,451 496,193 (59%) 268,508 (54%) 97,016 (20%) 13,602 (3%)

Oxalis gracilis J558 1,296,462 1,188,885 749,774 (63%) 447,138 (60%) 109,747 (15%) 28,477 (4%)

Oxalis helicoides J319 1,838,158 1,704,468 798,432 (47%) 458,664 (57%) 186,043 (23%) 19,722 (2%)

Oxalis hirsuta J12-20 3,207,132 2,876,738 933,290 (32%) 625,456 (67%)* 407,215 (44%)* 20,964 (2%)*

Oxalis hirta J57 1,039,590 942,496 472,230 (50%) 236,745 (50%) 111,585 (24%) 12,304 (3%)

Oxalis hirta J62 276,908 246,727 156,655 (63%) 77,330 (49%) 39,072 (25%) 3,420 (2%)

Oxalis imbricata MO472 6,867,908 6,243,085 3,280,494 (53%) 1,775,687 (54%) 372,088 (11%) 69,811 (2%)

Oxalis J595 1,773,838 1,635,614 655,265 (40%) 419,477 (64%)* 224,056 (34%)* 18,540 (3%)* inconspicua Oxalis J153 1,029,982 927,395 401,417 (43%) 232,487 (58%) 134,519 (34%) 9,216 (2%) kamiesbergensis Oxalis obtusa J99 1,378,380 1,260,241 534,889 (42%) 331,024 (62%) 162,427 (30%) 9,398 (2%)

Oxalis orthopoda MO351 1,849,216 1,711,627 939,291 (55%) 620,990 (66%) 167,454 (18%) 87,142 (9%)

Oxalis palmifrons MO403 507,374 395,164 168,543 (43%) 66,170 (39%) 64,900 (39%) 3,728 (2%)

Oxalis polyphylla J11-44 1,806,672 1,631,230 599,495 (37%) 383,590 (64%)* 235,690 (39%)* 11,903 (2%)*

Oxalis J136 1,197,900 1,100,614 500,786 (46%) 264,869 (53%) 147,514 (29%) 19,336 (4%) primuloides Oxalis pulchella J11-875 1,719,890 1,540,537 549,374 (36%) 339,571 (62%)* 237,816 (43%)* 17,299 (3%)*

Oxalis reclinata J343 826,650 742,405 375,386 (51%) 224,180 (60%) 109,011 (29%) 7,297 (2%)

3 Oxalis smithiana J11-511 2,730,800 2,505,934 1,074,130 (43%) 629,066 (59%) 283,942 (26%) 61,765 (6%)

Oxalis tenella J131 1,207,402 1,114,970 511,049 (46%) 320,370 (63%) 134,438 (26%) 13,894 (3%)

Oxalis truncatula J230 1,262,226 1,158,263 627,899 (54%) 419,502 (67%) 122,845 (20%) 8,458 (1%)

Mean 1,630,473 1,510,879 711,293 (47%) 419,628 (59%) 171,879 (27%) 21,687 (3%)

Oxalis obtusa J12 9,236,186 8,525,040 7,938,349 (93%) 186,831 (2%) 1,113,811 (14%) 264,900 (3%) 33 34 Fig. S1 35 Map of the draft plastid genome of Oxalis hirsuta. The thin black lines of the inner 36 circle indicate the location of the large single copy (LSC) and small single copy (SSC) 37 regions, the thick black lines those of the inverted repeats (IR). Transcription is 38 clockwise for genes on the outside of the circle and counterclockwise for genes on 39 the inside of the circle. Genes with multiple exons are marked with an asterisk. Quasi 40 full-length (lacking start and stop codon) and partial genes are marked with an arrow. 41 42 Fig. S2 43 Splits network of the LCN gene matrix, the plastome and the nrDNA cistron datasets. 44 (A) 727 concatenated LCN loci (least squares fit = 99.5%), (B) plastid genome (least 45 squares fit = 98.0%), (C) nrDNA cistron (least squares fit = 96.7%). For better 46 visualization only bootstrap support values above 90% are shown. 47 48 Fig. S3 49 Convergence measures of Bayesian MCMC runs. (A) Average effective sample size 50 (avgESS) and (B) potential scale reduction factor (PSRF). Parameters are TL: total 51 tree length; r(A⟨–⟩C)–r(G⟨–T⟩ ): reversible substitution rates; k_revmat: substitution 52 rates of the GTR model; pi(A)–pi(T): stationary state frequencies; alpha: shape of 53 gamma distribution of rate variation across sites. Convergence is indicated by 54 avgESS >100 and PSRF ~1.0. (C) Average standard deviation of split frequencies 55 (ASDSF), shown for each of the 727 LCN genes. Convergence is indicated by 56 ASDSF <0.05. 57 58 Fig. S4 59 Comparison of phylogenetic trees based on 727 LCN genes. Species trees were 60 reconstructed under the assumption of the multispecies coalescent model 61 implemented in ASTRAL (A), MP-EST (B), and STAR (C), as MRP supertree (D) and 62 concatenated supermatrix (E). Except for concatenation both Bayesian inference (BI) 63 and maximum likelihood (ML) were used for building the species tree. Dashed lines 64 connect each accession to its placement in the contrasting tree. Values close to the 65 nodes are bootstrap support (* = 100%). The Hirta clade (Oberlander et al., 2011) is 66 colored in green. Nodes which are incongruent between BI of gene trees and ML 67 gene trees within the same species tree method are marked with a blue arrow. 68 Nodes which are incongruent between the different species tree methods and 69 concatenation are marked with orange arrows on the concatenated tree. 70 71 Fig. S5 72 Principal coordinate analysis of Robinson-Foulds (RF) distances between gene trees. 73 (A) RF distances between gene trees obtained with Bayesian inference (BI), (B) RF 74 distances between maximum likelihood (ML) gene trees. Eigenvalues were highest 75 for the first three axes, and the first two are displayed. Each dot represents a gene 76 tree. Kernel density is indicated with blue lines. 77 78 Fig. S6

4 79 Robinson-Foulds (RF) distances between gene trees and the species tree. (A) RF 80 distances between gene trees obtained with Bayesian inference (BI) and the STAR 81 species tree, (B) RF distances between maximum likelihood (ML) gene trees and the 82 STAR species tree.

5 trnS

trnL trnF trnL* * -GGA

-UCC

-UAA -GAA -UAA trnM trnG psbZ -GGU psbC psbD -CAU trnT

psaA ycf3 ycf3 ycf3 rps4 trnT -GCA

ndhJ petN

ndhK psaB

rbcL -UGU rps14 trnC ndhC * * -CAU ⌂ trnV * -UGA trnV accD atpE atpB -UAC trnS -UAC trnfM -UUC -GUA -GUC psaI psbM trnE ycf4 trnY trnD cemA * * rpoB petA * rpoC1 * rpoC1 petL petG psbJ psaJ psbF LSC rpoC2 rpl33 psbL ⌂ rps18 trnW psbE trnP -CCA rps2 -UGG atpI rpl20 rps12 atpH clpP 5' end * atpF clpP * ⌂ ⌂ Oxalis hirsuta draft plastome *atpF clpP * -UCU psbB ⌂ atpA trnR * trnG-GCC psbT 153,678 bp psbH psbN psbI

trnS-GCU psbK ⌂ petB trnQ-UUG petD ⌂ rpoA trnK-UUU rps11 *

rpl36 ⌂

infA ⌂ matK rps8 rpl14 *trnK-UUU rpl16 psbA

trnH ⌂ -GUG rps3 rps19 rpl22 rpl2 rps19 * rpl2* rpl23 rpl2 * trnI * rpl2 -CAU rpl23 trnI-CAU IRb

IRa ⌂ ycf2

SSC

ycf2 -CAA ⌂ ⌂ * trnL trnL ycf68 -CAA ndhB * trnV ndhB 3' end ndhB -GAC ndhB rps7 *trnI * *trnA*trnI rrn16 rps7 rps12 *trnA -GAU -GAU rps12 * ycf15 -UGC -UGC

trnR ycf15

rrn23 3' end

rrn4.5 ⌂ ⌂ -ACG rrn5

-GUU rps15 ycf1 ndhH ndhA

ndhG ndhA ndhE -GAC psaC * rrn16 ndhI * trnN

trnV -GAU ndhF -GAU trnI -UGC ndhD trnN * trnI -UGC rrn23 * trnA -GUU trnA rrn5 * rrn4.5 -ACG ycf68 * ycf1'

⌂ ccsA trnR ⌂ -UAG

trnL

ATP synthase Ribosomal RNA Other function NADH dehydrogenase RNA polymerase Intron Photosystem protein Rubisco subunit Unknown function Ribosomal protein subunit Transfer RNA O. tenella J131 0.0010 0.0010 O. hirta J57 0.01 O. hirta J62 O. callosa J11-726 95 O. primuloides J136 93 O. creaseyii J11-961 O. callosa J11-726 O. kamiesbergensis J153 O. ciliaris J233 100 100 100 O. helicoides J319 O. blastorrhiza J557 O. kamiesbergensis J153 100 O. amblyosepala J11-969 100 100 100 O. tenella J131 O. gracilis J558 100 O. exserta J616 100 O. hirta J57 100 100 100 O. creaseyii J11-961 O. helicoides J319 100 O. primuloides J136 100 O. polyphylla J11-44 100 O. gracilis J558 100 100 O. ciliaris J233 O. reclinata J343 O. tenella J131 100 96 100 100 100 O. blastorrhiza J557 100 98 99 O. gracilis J558 100 100 100 100 100 O. hirta J62 96 O. amblyosepala J11-969 100 100 100 92 O. exserta J616 100 O. reclinata J343 100 O. helicoides J319 100 99 O. amblyosepala J11-969 100 97 100 98 100 100 100 96 100 100 O. smithiana J11-511 O. polyphylla J11-44 O. primuloides J136 O. polyphylla J11-44 99 100 100 100 100 100 100 O. blastorrhiza J557 O. smithiana J11-511 100 100 100 O. reclinata J343 100 100 97 100 100

100 O. callosa J11-726 O. ciliaris J233 1 O. imbricata MO472 97 100 100 O. exserta J616 O. hirta J57 100 100 O. palmifrons MO403 94 O. palmifrons MO403 91 100 95 100 O. hirta J62 93 100 O. truncatula J230 97 100 98 100 O. truncatula J230 100 O. orthopoda MO351 100 100 O. creaseyii J11-961 100 O. orthopoda MO351 100 100 100 100 100 100 O. orthopoda MO351 O. kamiesbergensis J153 100 96 93 97 O. hirsuta J12-20 100 100 O. imbricata MO472 100 O. hirsuta J12-20 100 O. smithiana J11-511 100 O. hirsuta J12-20 100 100 100 100 O. palmifrons MO403 100 100 O. inconspicua J595 LCN genes 100 Plastome nrDNA cistron O. pulchella J11-875 100 100 100 O. obtusa J12 O. pulchella J11-875 O. obtusa J99 100 O. pulchella J11-875 O. truncatula J230 O. obtusa J12 O. obtusa J99

O. obtusa J99 O. obtusa J12 O. inconspicua J595

A O. inconspicua J595 B C O. imbricata MO472 TL TL pi(A) pi(C) pi(G) pi(T) alpha pi(A) pi(C) pi(G) pi(T) alpha A r(A‹-›C) r(A‹-›G) r(A‹-›T) r(C‹-›G) r(C‹-›T) r(G‹-›T) B r(A‹-›C) r(A‹-›G) r(A‹-›T) r(C‹-›G) r(C‹-›T) r(G‹-›T) k_revmat k_revmat C A ASTRAL species tree B MP-EST species tree

BI of gene trees ML gene trees BI of gene trees ML gene trees

C STAR species tree D MRP supertree

BI of gene trees ML gene trees BI of gene trees ML gene trees

E Concatenation

ML d = 5 d = 5 Eigenvalues Eigenvalues

A BI of gene trees B ML gene trees BI of gene trees ML gene trees Numbergene-species of comparisons tree Numbergene-species of comparisons tree 0 20 40 60 80 0 20 40 60 80 100

15 20 25 30 35 40 45 15 20 25 30 35 40 45 A RF distance B RF distance