Terminology We Use the Term Paralog to Describe Gene

1 Terminology

2 We use the term paralog to describe gene copies that diverged from one another in a

3 duplication event; hence, multiple paralogs can be present in a single individual. In contrast,

4 ortholog is used when referring to a set of homologous genes that originated via speciation

5 events. Depending on the context, a single gene can therefore be included and discussed in the

6 context of a paralog group or an ortholog group. We also use the term “locus” to refer to a

7 particular ortholog in the aligned matrices.

9 Data Availability

10 All scripts are available in a public repository. One folder contains the analysis pipeline

11 (https://github.com/abigail-Moore/baits-analysis) and a second folder has the scripts for the bait

12 design and gene tree/species tree analysis (https://github.com/abigail-Moore/baits-suppl_scripts).

14 Probe Design

15 Probes for targeted enrichment were designed based on analyses of eight previously

16 sequenced transcriptomes from the Portulacineae (Christin et al. 2014, 2015;

17 Anacampserotaceae: Anacampseros filamentosa; Cactaceae: Echinocereus pectinatus, Nopalea

18 cochenillifera, Pereskia bleo, Pereskia grandifolia, Pereskia lychnidiflora; Portulacaceae:

19 Portulaca oleracea; and Talinaceae: Talinum portulacifolium) and four from its sister group

20 Molluginaceae (Matasci et al. 2014; Hypertelis cerviana (called M. cerviana in 1KP), Mollugo

21 verticillata, Paramollugo nudicaulis (called M. nudicaulis in 1KP), and Trigastrotheca

22 pentaphylla (called M. pentaphylla in 1KP)), jointly referred to as portullugo. Probes were

23 designed from two sets of genes: gene families that were known to be important in the CAM and

24 C4 photosynthetic pathways and other low- or single-copy nuclear genes.

25 For the photosynthesis-related genes, 19 families of CAM-C4 photosynthesis-related

26 genes were used for probe design (Table S1). Sequences from these gene families were taken

27 from the alignments from Christin et al. (2014, 2015), which included the transcriptomic data,

28 sequences from GenBank, and individual loci from other members of the portullugo clade. In

29 some cases, non-transcriptomic sequences were therefore also used in probe design. Twelve of

30 these gene families had multiple known paralogs; we designed separate sets of probes for each

31 paralog for a total of 45 paralogs (with nadmdh and nadpmdh accidentally included twice).

32 The remaining, non-photosynthetic genes in the portullugo transcriptomes were assigned

33 a gene family identity by blasting (BLASTN 2.2.25, default settings; Altschul et al. 1990) them

34 against sets of orthologous sequences of known gene family from six model plants (Ensembl

35 database; Kersey et al. 2016; plants.ensembl.org/, accessed 4 Dec. 2013; Arabidopsis thaliana,

36 Glycine max, Oryza sativa, Populus trichocarpa, Solanum tuberosum, Vitis vinifera). The best

37 blast hit was taken as the preliminary gene family assignment. In addition to the portullugo

38 transcriptomes, additional Caryophyllales transcriptomes were classified through these blast

39 searches and included in subsequent alignments and trees (Beta vulgaris, Amaranthaceae, from

40 the Beta vulgaris genome project, Dohm et al. 2013; and Amaranthus hypochondriacus,

41 Amaranthaceae; Boerhavia coccinea, Nyctaginaceae; Mesembryanthemum crystallinum,

42 Aizoaceae; and Trianthema portulacastrum, Aizoaceae, from Christin et al. 2014, 2015).

43 The newly classified sequences were added to the sets of orthologous model-species

44 genes to form “ortho-groups” and aligned with MUSCLE version 3.8.31 (Edgar 2004), using

45 default options. In most cases, “ortho-groups” contained many distantly related sequences, so a

46 smaller subset of sequences was selected for probe design using custom R (R Core Team 2016)

47 and Python 2.7 scripts that implemented the following iterative process. First, it was determined

48 whether the mean K80 genetic distance among all sequences was less than 0.40. If so, sequences

49 were considered closely enough related to use as they were. If not, the alignment was split by

50 making a distance tree and splitting it at its midpoint. If a tree-half contained at least one cactus,

51 one Molluginaceae, and one other member of the portullugo clade, its corresponding alignment

52 was retained as potentially suitable for bait design. In this case, a new sequence alignment was

53 made for just those sequences and the process was repeated until all (sub-)sets of sequences met

54 the K80 criterion. Among these, we further selected subsets based on an apparent lack of gene

55 duplication within Caryophyllales, judged by the frequency of occurrence of Caryophyllales

56 taxa: we retained only those alignments for which the corresponding tree resolved one of the six

57 model organisms as sister to a group of Caryophyllales taxa in which each taxon was maximally

58 represented once. Finally, we randomly selected 64 alignments, each representing an ortholog

59 group of non-photosynthesis genes for which probes were designed.

60 For all 111 selected alignments, we pruned them to two Molluginaceae sequences, two

61 Cactaceae sequences, and two portullugo sequences from outside of those lineages and re-

62 aligned them using MUSCLE; we then used the resulting alignment for probe design. In some

63 cases, not all of these sequences were available and smaller alignments were used. A total of

64 20,000 unique baits of 120 bases each were designed from these alignments by MYcroarray

65 (Ann Arbor, MI), using their bait design pipeline with 2x coverage.

67 Barcodes

68 A combination of 25 different inline barcodes (4–6 bp in length) and seven third-read

69 (TruSeq) barcodes was used to sequence 50 samples per lane while achieving a balanced mix of

70 nucleotides at each site. All barcodes differed by at least two nucleotides, so that a single

71 sequencing error could not turn one barcode into another.

72 PCR recombination between the inline barcode and the insert to be sequenced should not

73 take place, as they are immediately adjacent to one another. To avoid misclassification due to

74 PCR recombination between the TruSeq barcode and the sequence (which may be common due

75 to their separation by 34 bp that are identical in all samples; Kircher et al. 2011), samples were

76 combined in bait hybridization reactions (and subsequent PCR) such that each inline barcode

77 was present only once. Multiple (generally three or four) third-read barcodes were present in

78 each reaction to achieve proper color balance in the lane, while minimizing the number of

79 hybridization reactions. This allowed recombinant reads (those with a novel inline/third-read

80 barcode combination) to be identified and reclassified according to their inline barcode.

82 Taxon Sampling

83 Sixty portullugo individuals were sequenced (Supplemental Table 2), including multiple

84 representatives of all major lineages (with the exception of the monotypic Halophytaceae, which

85 was represented by Halophytum ameghinoi), and relevant sequences from transcriptomes of two

86 further species were added (Pereskia bleo, Cactaceae; Portulaca oleracea, Portulacaceae). As

87 outgroups we used sequences from the five non-portullugo, Caryophyllales transcriptomes (i.e.,

88 Amaranthus hypochondriacus, Amaranthaceae; Beta vulgaris, Amaranthaceae; Boerhavia

89 coccinea, Nyctaginaceae; Mesembryanthemum crystallinum, Aizoaceae; Trianthema

90 portulacastrum, Aizoaceae) and the six model plant genomes used in bait design, for a total of 73

91 taxa.

93 Molecular Methods: Due to the high polysaccharide content of most leaf material in the

94 portullugo clade, a two step DNA extraction procedure was performed. First, 100–400 mg of

95 fresh leaf material or 20–40 mg of silica-dried leaf material was extracted using the FastDNA

96 Spin Kit (MP Biomedicals, Santa Ana, CA) following the manufacturer's protocol with the

97 following adjustments: Silica-dried samples were kept at room temperature for approximately

98 two hours following homogenization and addition of the CLS-VF and PPS buffers. Extracted

99 DNA was eluted twice in 75 µl distilled water (DES). After the first extraction, fresh samples

100 were incubated for 15 minutes at 37°C with 0.5 μl Thermo RNase (Thermo Fisher Scientific,

101 Waltham, MA). Following the FastDNA extraction, samples were cleaned again using a

102 QIAquick PCR Cleanup Kit (Qiagen Inc., Valencia, CA), again following the manufacturer's

103 protocol. The samples were eluted twice in 50 µl EB buffer for most samples or twice in 30 µl

104 EB buffer for samples for which we had relatively little starting material.

105 Samples were quantified using a Qubit Fluorometer (Invitrogen, part of Thermo Fisher

106 Scientific) with the Qubit dsDNA HS Assay Kit (Invitrogen). For sonication, additional EB

107 buffer was added to obtain 500 ng of sample DNA in 117 µl of buffer. If less than 500 ng of

108 DNA was present in the sample, the entire sample was used with enough additional buffer to

109 make the total volume 117 µl. Samples were sonicated using a Covaris S220 (Covaris, Inc.,

110 Woburn, MA) at the Brown University Genomics Core Facility. The following parameters were

111 used to achieve a mean fragment length of 400 bp: peak power 140.0, duty factor 10.0, and

112 cycles/burst 200 for 50 seconds.

113 Sonicated DNA was used for library preparation using the NEBNext Ultra DNA Library

114 Prep Kit for Illumina or NEBNext Ultra II DNA Library Prep Kit for Illumina (New England

115 Biolabs, Ipswich, MA), following the manufacturer's protocols with the following modifications:

116 Agencourt AMPure beads (Beckman Coulter, Brea, CA) were used for cleanup and no further

117 size selection was performed. As we used custom adapters, the USER Enzyme digest was not

118 performed. The number of cycles was adjusted depending on the kit and the amount of sonicated

119 DNA: For 300–500 ng and less than 300 ng of sonicated DNA, 14 and 16 cycles, respectively,

120 were used with the Ultra kit and 13 and 15, respectively, with the Ultra II kit.

121 After PCR cleanup, the samples were pooled in groups of 8–9, using the MinElute PCR

122 Purification Kit (Qiagen) and eluted with 30 µl EB buffer. Pooled samples were then combined

123 for hybridization so that there were approximately equal amounts of DNA from each individual

124 and a total of 100–500 ng of DNA in 5.9 μl of buffer.

125 Because species used for bait design were sometimes quite distantly related to the species

126 sequenced, a protocol for low stringency hybridization was followed (Li et al. 2013). The

127 hybridization temperatures were as follows: 11 hours at 65°C, 11 hours at 60°C, 11 hours at

128 55°C, and 11 hours at 50°C, followed by a hold at 50°C until the samples were cleaned. The

129 remainder of the hybridization and cleanup protocol followed version 2 of the MYbaits manual

130 using the reagents provided and Dynabeads MyOne Streptavidin C1 beads (Invitrogen), except

131 that the cleanup steps took place at 50°C instead of 65°C. PCR was performed with KAPA HiFi

132 HotStart Ready Mix (Kapa Biosystems, Inc., Wilmington, MA), following the MYbaits protocol,

133 with 14 cycles and an annealing temperature of 65°C. PCR products were cleaned using the

134 MinElute Gel Extraction Kit (Qiagen).

135 Final quantification, combination, and sequencing of most samples were performed at the

136 Brown University Genomics Core Facility on an Illumina HiSeq 2000 or 2500, to obtain 100-bp,

137 paired end reads. Some test samples were run at the Rhode Island Genomics and Sequencing

138 Center on an Illumina MiSeq, to obtain 250-bp, paired end reads. The individuals analyzed for

139 this paper were not sequenced alone; instead, they were sequenced with additional individuals

140 from across the Portulacineae, whose sequences will be presented in future papers.

141

142 Data Processing and Ortholog Assignment Pipeline

143 First, reads were assigned to individuals using their barcodes (script:

144 trans_bcparse_2reads.py). Paired reads with neither inline barcode matching exactly to the

145 template barcode were discarded. For accepted reads, the barcode as well as the last five bases

146 were trimmed. Trimmed reads with more than one low-quality (Phred score < 2 (#)) base were

147 also discarded.

148 We designed a three-part bioinformatics pipeline to reconstruct gene sequences (Fig. 2).

149 Part I (tfastq_assembly_master.py and subordinate scripts) aimed to extract all relevant reads for

150 each gene family and then assemble them into contigs. Part II (tcontig_classif_master.py and

151 subordinate scripts) then constructed longer sequences from contigs and assigned them to

152 particular paralogs within a gene family. Part III (tgenefam_to_spptree_master.py and

153 subordinate scripts) identified gene duplications within gene families, extracted phylogenetically

154 useful sets of orthologs, and used them for phylogenetic analysis.

155 For various parts of the pipeline analysis, the individuals need to be divided into ca. 5 to

156 15 groups that are known from previous research to be well supported as monophyletic, and

157 which we would expect to be monophyletic in many of the gene trees. All or almost all of the

158 groups should contain multiple individuals. For this study, we divided the portullugo into

159 families, because that gave us the right number of groups of the proper size, however in most

160 other cases different clades would be more appropriate.

161 Part I.—In Part I, the fastq files were converted to fasta files and the pairs of reads were

162 classified into gene families using BLASTN version 2.2.29 (Altschul et al. 1990) and assembled

163 into contigs. Paired reads were classified into a gene family if either read matched (with an e-

164 value < 10-16) the sequences used to design its baits (trans_fastq_to_2blast.py and

165 tbaits_blastn_parse.py). For each gene family, reads were then pooled among the individuals that

166 belonged to the each of the nine major lineages, and SPAdes version 3.1.0 (Bankevich et al.

167 2012) was used to assemble them into nine sets of preliminary contigs (tblast_to_fastq.py, run in

168 “together” mode). By using reads from different individuals and different species in the same

169 assembly, we maximized contig number and lengths by also assembling chimeric contigs

170 containing reads from multiple individuals; this step allowed us to pull significantly more reads

171 into the pool for analysis. In the next step, a new BLAST database was created from the reads

172 from the preliminary contigs and the sequences from which the baits were designed. The original

173 reads were then blasted to this larger database, again extracting both reads of a pair if either

174 matched (tassembly_to_blast.py and tbaits_blastn_parse.py). Separate assemblies for each

175 individual for each gene family were then constructed by SPAdes (tblast_to_fastq.py, run in

176 “separate” mode). A single fasta file was constructed for each gene family containing all of the

177 contigs for that gene family, labeled according to individual. Finally, these fasta files were

178 blasted against the bait sequences (which do not contain introns) to delimit exons

179 (tassembly_to_loci.py, run in “spades” mode). Only exons were used for all subsequent analyses.

180 Part II.—Part II of the pipeline identified the paralog that each contig from Part I

181 belonged to, in order to combine contigs and maximize the sequence length for each paralog

182 (tcontig_classif_master.py). This classification was performed on the principle that shorter

183 sequences (i.e. the contigs) can be placed within a backbone tree built from longer sequences

184 (i.e. the bait sequences); the topological position of a placed sequence then indicates its affinity.

185 This procedure was conducted after removing introns from all contigs according to the blast

186 results from the previous pipeline (tbaits_intron_removal.py). The pipeline iteratively refined the

187 backbone tree used. Initially, we used alignments that consisted only of the same sequences that

188 were used in bait design (transcriptome, model plant, and, in the case of some photosynthesis-

189 related genes and phytochrome C, other sequences from the same gene families downloaded

190 from GenBank or amplified by PCR). Part II of the pipeline also attempted to place the paralogs

191 from each major plant lineage in separate iterations, because the more tightly clustered the

192 contigs from each paralog are on the phylogeny, the easier it is to group the contigs according to

193 paralog. (For example, if we are trying to place the Cactaceae paralogs by themselves, the

194 phylogenetic distance between the different paralogs is much greater, and thus the overlap

195 between clusters of contigs is much less, than if we were placing all portullugo paralogs at once.)

196 At the end of each iteration, all good consensus sequences were added to the backbone tree for

197 the subsequent round.

198 To place contigs in the backbone tree, we executed the following steps. All contigs for a

199 gene family were added to the backbone alignments for their gene families using the

200 addfragments algorithm in MAFFT version 7.017 (Katoh and Standley 2013). The short-read

201 classification algorithm in RAxML version 8.0.22 (option “-f v”; Berger 2011, Stamatakis 2014)

202 was then used to place these sequences in the backbone gene family tree

203 (tbaits_intron_removal.py and tcontigs_to_fixed_paralogs.py). Each contig would be given a set

204 of possible placements in the backbone tree together with the probability of each placement; in

205 subsequent analyses, we considered the set of placements that gave us a total probability of 0.9,

206 so the 90% confidence interval for the placement of that contig. We then looked for clusters of

207 contigs that had overlapping placements on the backbone tree, but whose 90% confidence

208 intervals did not overlap with those of sequences from other clusters. Each cluster was treated as

209 a putative paralog and extracted for further testing (tseq_placer_dup.py). For each of these

210 clusters, the contigs from each individual were combined into a consensus sequence for that

211 individual (tcontigs_to_fixed_paralogs.py). Each consensus sequence was then examined based

212 on three criteria to decide whether it would be accepted: 1) it was at least 75 bases long; 2) it was

213 at least 75% of the mean length of all consensus sequences from that cluster; and 3) the number

214 of bases that differed in overlapping regions of contigs (due to multiple alleles (if low) or

215 multiple paralogs (if high) in that individual), needed to be below a threshold value of less than

216 twice the number of contigs for non-polyploids and less than five times the number of contigs for

217 plants that were previously known to be polyploid. (The number of contigs was used instead of

218 sequence length, because, when the contigs are correctly classified, the ends of the contigs are

219 usually the only places they overlap.) In intermediate rounds, all sequences from a given plant

220 family and a given cluster were looked at together to determine if the number of accepted

221 sequences is less than twice as many as the number of rejected sequences or if the number of

222 bases that differed in the contigs was more than 5% of the total contig length. If these criteria

223 were met, it was assumed that that sequence group consisted of one paralog and a consensus

224 sequence was accepted for further analysis by including it into the alignment of existing

225 backbone sequences. Sequences that failed these criteria were analyzed again in the next round

226 (tparalog_combiner.py). (In the final round, all accepted contigs were passed on to Part III of the

227 pipeline.)

228 After six iterations of contig classification, some contigs remained orphaned, i.e., they

229 could not be combined into acceptable consensus sequences (most likely due to a recent

230 duplication that was absent from the backbone tree). Here, a single contig per individual and

231 paralog was selected. If multiple individuals had orphaned contigs for the same paralog, we

232 retained those contigs that were alignable across individuals (i.e. representing the same exons)

233 for further analysis. In particular, we selected the set of alignable contigs that contained most

234 individuals and greatest total contig length (tundivcontigs_combiner.py and

235 tcontig_selection.py).

236 The selected consensus sequences and orphaned contigs were combined and aligned

237 using the localpair algorithm in MAFFT (tparcomb_combiner.py). Each alignment was checked

238 to make sure each individual had only one sequence per paralog maximum, based on the original

239 naming of the paralog. If multiple sequences per individual and paralog were present, an attempt

240 was made to combine them into a single consensus sequence using the criteria for accepting

241 consensus sequences made from contigs as described above (tparcomb_final.py). If they could

242 not be combined, then the two contigs were kept separate. All sequences that were over 150

243 bases long were added to the backbone alignments for each gene family to make a combined

244 alignment for analysis in Part III of the pipeline.

245 Part III.—Part III of the pipeline extracted paralogs as separate phylogenetic loci from

246 the gene-family trees, by identifying the positions of gene duplications in comparison with a

247 preliminary species tree, and used these loci to reconstruct gene trees

248 (tgenefam_to_spptree_master.py). Part III was performed twice, first with a preliminary species

249 tree constructed from three chloroplast loci (matK, ndhF, and rbcL) and the nuclear internal

250 transcribed spacer (ITS) region, all recovered as off-target reads (See below for construction of

251 the preliminary species tree), and then with an updated species tree, reconstructed from the loci

252 recovered from the pipeline using ASTRAL II (Mirarab et al. 2014)

253 The fasta files containing the original backbone sequences as well as all of the new

254 sequences were pruned to include only the individuals present in the species tree

255 (tcombpars_to_trees.py). The pruned fasta files were aligned using MAFFT using the localpair

256 algorithm and trees were made from the alignments using RAxML with 100 bootstrap replicates.

257 NOTUNG version 2.8.1.6 (Chen et al. 2000, Stolzer et al. 2012) was used to find gene

258 duplications in the gene family trees, based on the given species tree. While the topology of the

259 species tree was taken as given, poorly supported nodes (< 90% bootstrap) on the gene family

260 trees were rearranged to correspond to the species tree, to minimize the impact of lack of support

261 on paralog classification. Losses were not reconstructed, only gene duplications. Besides

262 accounting for poorly supported incongruences between the gene-family tree and species tree,

263 we also employed a conservative strategy to accept duplications

264 (tnotung_homolog_parsing.py). Here, a duplication was accepted if it met the following criteria:

265 1) At least one individual had to be present on both sides of the duplication or, if there were not

266 duplicated individuals, then at least two plant families had to be present on both sides of the

267 duplication. If neither of these criteria is met, the putative duplication was assumed to be due to

268 incongruence between the gene tree, instead of being an actual duplication. 2) If very few

269 individuals (either fewer than five or fewer than 40% of the total individuals in the two sides of

270 duplication) were present on both sides of the duplication, an attempt was made to combine the

271 sequences of those individuals (in the same way that the contigs were combined in Part II of the

272 pipeline), and the duplication was only accepted if the sequences were not combinable. 3) If the

273 duplication contained at least two individuals. If the duplication was within a single individual,

274 it was rejected and the longer of the two sequences was chosen to represent that individual in the

275 subsequent analyses.

276 After inspection, at each node that subtended an accepted duplication, the smaller sister

277 group was pruned off as a distinct locus, while the larger group was retained on the gene tree.

278 (Note that after pruning, the larger group was no longer a single paralog, as it contained one of

279 the paralogs from the accepted duplication in addition to the unduplicated sequences that subtend

280 the duplication.) This strategy maximized the number of loci that contained all or most of the

281 individuals, facilitating phylogenetic inference. The sets of sequences for each locus were

282 aligned using MAFFT. Once the final set of loci had been obtained, the number of individuals

283 and the number of plant families that have each locus were calculated (tparalog_selector.py),

284 thus allowing different subsets of loci and individuals to be selected, in order to run analyses

285 with different levels of missing data.

286 New sets of alignments were made containing only the selected loci and individuals

287 (tal_combiner.py). In addition, to reduce the amount of missing data, all sites with >90%

288 missing data were pruned prior to analysis. These regions were largely the result of the

289 transcriptomes and genome sequences being longer than the baits sequences. These alignments

290 were analyzed in three different ways: A concatenated alignment containing all sequences was

291 made and analyzed in RAxML and separate trees were made for each locus using both RAxML

292 and MrBayes version 3.2.2 (Ronquist et al. 2012) for further analysis.

293

294 Pipeline Parallelization

295 Extensive parallelization helped the pipelines run much faster. Each of the three wrapper

296 steps (tfastq_assembly_master.py, tcontig_classif_master.py, and

297 tgenefam_to_spptree_master.py) and some of the internal scripts could be parallelized in two

298 different ways: One was using gnu parallel, which can automatically detect the number of jobs

299 that can be run simultaneously on a given computer or node of a cluster, and run multiple

300 concurrent jobs (run in “Parallel” mode). (However, it is often necessary to run fewer jobs than

301 could potentially be run on a given computer to allow each job to have sufficient RAM.) If the

302 pipeline was being run on a cluster, however, it was much more efficient (both in terms of time

303 and in terms of usage of resources on the cluster) to run each portion as a separate job, so that

304 more jobs could be run at once and the resources for each job were freed immediately upon

305 completion of that job. For this reason, there is a second option, “Array” mode, for slurm-based

306 clusters, in which jobs are scheduled using the sbatch command.

307

308 Construction of the Preliminary Species Tree

309 The preliminary species tree was constructed from three chloroplast loci (matK, ndhF,

310 and rbcL) and the nuclear internal transcribed spacer (ITS) region, all recovered as off-target

311 reads, (using the wrapper script torig_spp_tree_master.py). These sequences were recovered

312 from our baits using the first part of the pipeline with one round of blast (using

313 torig_spp_tree_blasting.py if the fasta files are already present or trans_fastq_to_2blast.py if the

314 fasta files also need to be made; tbaits_blastn_parse.py, tblast_to_fastq.py, and

315 tassembly_to_loci.py). The longest sequence from each individual for each locus was used and

316 trees were made in RAxML from the separate and concatenated alignments

317 (tbaits_to_spptreeseqs.py). These trees were then checked by eye for individuals that were out of

318 place, likely due to the selection of a pseudogene sequence. The putative pseudogenes were

319 removed for those individuals and a new species tree was produced from the filtered sequences.