1 Terminology
2 We use the term paralog to describe gene copies that diverged from one another in a
3 duplication event; hence, multiple paralogs can be present in a single individual. In contrast,
4 ortholog is used when referring to a set of homologous genes that originated via speciation
5 events. Depending on the context, a single gene can therefore be included and discussed in the
6 context of a paralog group or an ortholog group. We also use the term “locus” to refer to a
7 particular ortholog in the aligned matrices.
8
9 Data Availability
10 All scripts are available in a public repository. One folder contains the analysis pipeline
11 (https://github.com/abigail-Moore/baits-analysis) and a second folder has the scripts for the bait
12 design and gene tree/species tree analysis (https://github.com/abigail-Moore/baits-suppl_scripts).
13
14 Probe Design
15 Probes for targeted enrichment were designed based on analyses of eight previously
16 sequenced transcriptomes from the Portulacineae (Christin et al. 2014, 2015;
17 Anacampserotaceae: Anacampseros filamentosa; Cactaceae: Echinocereus pectinatus, Nopalea
18 cochenillifera, Pereskia bleo, Pereskia grandifolia, Pereskia lychnidiflora; Portulacaceae:
19 Portulaca oleracea; and Talinaceae: Talinum portulacifolium) and four from its sister group
20 Molluginaceae (Matasci et al. 2014; Hypertelis cerviana (called M. cerviana in 1KP), Mollugo
21 verticillata, Paramollugo nudicaulis (called M. nudicaulis in 1KP), and Trigastrotheca
22 pentaphylla (called M. pentaphylla in 1KP)), jointly referred to as portullugo. Probes were
23 designed from two sets of genes: gene families that were known to be important in the CAM and
1
24 C4 photosynthetic pathways and other low- or single-copy nuclear genes.
25 For the photosynthesis-related genes, 19 families of CAM-C4 photosynthesis-related
26 genes were used for probe design (Table S1). Sequences from these gene families were taken
27 from the alignments from Christin et al. (2014, 2015), which included the transcriptomic data,
28 sequences from GenBank, and individual loci from other members of the portullugo clade. In
29 some cases, non-transcriptomic sequences were therefore also used in probe design. Twelve of
30 these gene families had multiple known paralogs; we designed separate sets of probes for each
31 paralog for a total of 45 paralogs (with nadmdh and nadpmdh accidentally included twice).
32 The remaining, non-photosynthetic genes in the portullugo transcriptomes were assigned
33 a gene family identity by blasting (BLASTN 2.2.25, default settings; Altschul et al. 1990) them
34 against sets of orthologous sequences of known gene family from six model plants (Ensembl
35 database; Kersey et al. 2016; plants.ensembl.org/, accessed 4 Dec. 2013; Arabidopsis thaliana,
36 Glycine max, Oryza sativa, Populus trichocarpa, Solanum tuberosum, Vitis vinifera). The best
37 blast hit was taken as the preliminary gene family assignment. In addition to the portullugo
38 transcriptomes, additional Caryophyllales transcriptomes were classified through these blast
39 searches and included in subsequent alignments and trees (Beta vulgaris, Amaranthaceae, from
40 the Beta vulgaris genome project, Dohm et al. 2013; and Amaranthus hypochondriacus,
41 Amaranthaceae; Boerhavia coccinea, Nyctaginaceae; Mesembryanthemum crystallinum,
42 Aizoaceae; and Trianthema portulacastrum, Aizoaceae, from Christin et al. 2014, 2015).
43 The newly classified sequences were added to the sets of orthologous model-species
44 genes to form “ortho-groups” and aligned with MUSCLE version 3.8.31 (Edgar 2004), using
45 default options. In most cases, “ortho-groups” contained many distantly related sequences, so a
46 smaller subset of sequences was selected for probe design using custom R (R Core Team 2016)
2
47 and Python 2.7 scripts that implemented the following iterative process. First, it was determined
48 whether the mean K80 genetic distance among all sequences was less than 0.40. If so, sequences
49 were considered closely enough related to use as they were. If not, the alignment was split by
50 making a distance tree and splitting it at its midpoint. If a tree-half contained at least one cactus,
51 one Molluginaceae, and one other member of the portullugo clade, its corresponding alignment
52 was retained as potentially suitable for bait design. In this case, a new sequence alignment was
53 made for just those sequences and the process was repeated until all (sub-)sets of sequences met
54 the K80 criterion. Among these, we further selected subsets based on an apparent lack of gene
55 duplication within Caryophyllales, judged by the frequency of occurrence of Caryophyllales
56 taxa: we retained only those alignments for which the corresponding tree resolved one of the six
57 model organisms as sister to a group of Caryophyllales taxa in which each taxon was maximally
58 represented once. Finally, we randomly selected 64 alignments, each representing an ortholog
59 group of non-photosynthesis genes for which probes were designed.
60 For all 111 selected alignments, we pruned them to two Molluginaceae sequences, two
61 Cactaceae sequences, and two portullugo sequences from outside of those lineages and re-
62 aligned them using MUSCLE; we then used the resulting alignment for probe design. In some
63 cases, not all of these sequences were available and smaller alignments were used. A total of
64 20,000 unique baits of 120 bases each were designed from these alignments by MYcroarray
65 (Ann Arbor, MI), using their bait design pipeline with 2x coverage.
66
67 Barcodes
68 A combination of 25 different inline barcodes (4–6 bp in length) and seven third-read
69 (TruSeq) barcodes was used to sequence 50 samples per lane while achieving a balanced mix of
3
70 nucleotides at each site. All barcodes differed by at least two nucleotides, so that a single
71 sequencing error could not turn one barcode into another.
72 PCR recombination between the inline barcode and the insert to be sequenced should not
73 take place, as they are immediately adjacent to one another. To avoid misclassification due to
74 PCR recombination between the TruSeq barcode and the sequence (which may be common due
75 to their separation by 34 bp that are identical in all samples; Kircher et al. 2011), samples were
76 combined in bait hybridization reactions (and subsequent PCR) such that each inline barcode
77 was present only once. Multiple (generally three or four) third-read barcodes were present in
78 each reaction to achieve proper color balance in the lane, while minimizing the number of
79 hybridization reactions. This allowed recombinant reads (those with a novel inline/third-read
80 barcode combination) to be identified and reclassified according to their inline barcode.
81
82 Taxon Sampling
83 Sixty portullugo individuals were sequenced (Supplemental Table 2), including multiple
84 representatives of all major lineages (with the exception of the monotypic Halophytaceae, which
85 was represented by Halophytum ameghinoi), and relevant sequences from transcriptomes of two
86 further species were added (Pereskia bleo, Cactaceae; Portulaca oleracea, Portulacaceae). As
87 outgroups we used sequences from the five non-portullugo, Caryophyllales transcriptomes (i.e.,
88 Amaranthus hypochondriacus, Amaranthaceae; Beta vulgaris, Amaranthaceae; Boerhavia
89 coccinea, Nyctaginaceae; Mesembryanthemum crystallinum, Aizoaceae; Trianthema
90 portulacastrum, Aizoaceae) and the six model plant genomes used in bait design, for a total of 73
91 taxa.
92
4
93 Molecular Methods: Due to the high polysaccharide content of most leaf material in the
94 portullugo clade, a two step DNA extraction procedure was performed. First, 100–400 mg of
95 fresh leaf material or 20–40 mg of silica-dried leaf material was extracted using the FastDNA
96 Spin Kit (MP Biomedicals, Santa Ana, CA) following the manufacturer's protocol with the
97 following adjustments: Silica-dried samples were kept at room temperature for approximately
98 two hours following homogenization and addition of the CLS-VF and PPS buffers. Extracted
99 DNA was eluted twice in 75 µl distilled water (DES). After the first extraction, fresh samples
100 were incubated for 15 minutes at 37°C with 0.5 μl Thermo RNase (Thermo Fisher Scientific,
101 Waltham, MA). Following the FastDNA extraction, samples were cleaned again using a
102 QIAquick PCR Cleanup Kit (Qiagen Inc., Valencia, CA), again following the manufacturer's
103 protocol. The samples were eluted twice in 50 µl EB buffer for most samples or twice in 30 µl
104 EB buffer for samples for which we had relatively little starting material.
105 Samples were quantified using a Qubit Fluorometer (Invitrogen, part of Thermo Fisher
106 Scientific) with the Qubit dsDNA HS Assay Kit (Invitrogen). For sonication, additional EB
107 buffer was added to obtain 500 ng of sample DNA in 117 µl of buffer. If less than 500 ng of
108 DNA was present in the sample, the entire sample was used with enough additional buffer to
109 make the total volume 117 µl. Samples were sonicated using a Covaris S220 (Covaris, Inc.,
110 Woburn, MA) at the Brown University Genomics Core Facility. The following parameters were
111 used to achieve a mean fragment length of 400 bp: peak power 140.0, duty factor 10.0, and
112 cycles/burst 200 for 50 seconds.
113 Sonicated DNA was used for library preparation using the NEBNext Ultra DNA Library
114 Prep Kit for Illumina or NEBNext Ultra II DNA Library Prep Kit for Illumina (New England
115 Biolabs, Ipswich, MA), following the manufacturer's protocols with the following modifications:
5
116 Agencourt AMPure beads (Beckman Coulter, Brea, CA) were used for cleanup and no further
117 size selection was performed. As we used custom adapters, the USER Enzyme digest was not
118 performed. The number of cycles was adjusted depending on the kit and the amount of sonicated
119 DNA: For 300–500 ng and less than 300 ng of sonicated DNA, 14 and 16 cycles, respectively,
120 were used with the Ultra kit and 13 and 15, respectively, with the Ultra II kit.
121 After PCR cleanup, the samples were pooled in groups of 8–9, using the MinElute PCR
122 Purification Kit (Qiagen) and eluted with 30 µl EB buffer. Pooled samples were then combined
123 for hybridization so that there were approximately equal amounts of DNA from each individual
124 and a total of 100–500 ng of DNA in 5.9 μl of buffer.
125 Because species used for bait design were sometimes quite distantly related to the species
126 sequenced, a protocol for low stringency hybridization was followed (Li et al. 2013). The
127 hybridization temperatures were as follows: 11 hours at 65°C, 11 hours at 60°C, 11 hours at
128 55°C, and 11 hours at 50°C, followed by a hold at 50°C until the samples were cleaned. The
129 remainder of the hybridization and cleanup protocol followed version 2 of the MYbaits manual
130 using the reagents provided and Dynabeads MyOne Streptavidin C1 beads (Invitrogen), except
131 that the cleanup steps took place at 50°C instead of 65°C. PCR was performed with KAPA HiFi
132 HotStart Ready Mix (Kapa Biosystems, Inc., Wilmington, MA), following the MYbaits protocol,
133 with 14 cycles and an annealing temperature of 65°C. PCR products were cleaned using the
134 MinElute Gel Extraction Kit (Qiagen).
135 Final quantification, combination, and sequencing of most samples were performed at the
136 Brown University Genomics Core Facility on an Illumina HiSeq 2000 or 2500, to obtain 100-bp,
137 paired end reads. Some test samples were run at the Rhode Island Genomics and Sequencing
138 Center on an Illumina MiSeq, to obtain 250-bp, paired end reads. The individuals analyzed for
6
139 this paper were not sequenced alone; instead, they were sequenced with additional individuals
140 from across the Portulacineae, whose sequences will be presented in future papers.
141
142 Data Processing and Ortholog Assignment Pipeline
143 First, reads were assigned to individuals using their barcodes (script:
144 trans_bcparse_2reads.py). Paired reads with neither inline barcode matching exactly to the
145 template barcode were discarded. For accepted reads, the barcode as well as the last five bases
146 were trimmed. Trimmed reads with more than one low-quality (Phred score < 2 (#)) base were
147 also discarded.
148 We designed a three-part bioinformatics pipeline to reconstruct gene sequences (Fig. 2).
149 Part I (tfastq_assembly_master.py and subordinate scripts) aimed to extract all relevant reads for
150 each gene family and then assemble them into contigs. Part II (tcontig_classif_master.py and
151 subordinate scripts) then constructed longer sequences from contigs and assigned them to
152 particular paralogs within a gene family. Part III (tgenefam_to_spptree_master.py and
153 subordinate scripts) identified gene duplications within gene families, extracted phylogenetically
154 useful sets of orthologs, and used them for phylogenetic analysis.
155 For various parts of the pipeline analysis, the individuals need to be divided into ca. 5 to
156 15 groups that are known from previous research to be well supported as monophyletic, and
157 which we would expect to be monophyletic in many of the gene trees. All or almost all of the
158 groups should contain multiple individuals. For this study, we divided the portullugo into
159 families, because that gave us the right number of groups of the proper size, however in most
160 other cases different clades would be more appropriate.
161 Part I.—In Part I, the fastq files were converted to fasta files and the pairs of reads were
7
162 classified into gene families using BLASTN version 2.2.29 (Altschul et al. 1990) and assembled
163 into contigs. Paired reads were classified into a gene family if either read matched (with an e-
164 value < 10-16) the sequences used to design its baits (trans_fastq_to_2blast.py and
165 tbaits_blastn_parse.py). For each gene family, reads were then pooled among the individuals that
166 belonged to the each of the nine major lineages, and SPAdes version 3.1.0 (Bankevich et al.
167 2012) was used to assemble them into nine sets of preliminary contigs (tblast_to_fastq.py, run in
168 “together” mode). By using reads from different individuals and different species in the same
169 assembly, we maximized contig number and lengths by also assembling chimeric contigs
170 containing reads from multiple individuals; this step allowed us to pull significantly more reads
171 into the pool for analysis. In the next step, a new BLAST database was created from the reads
172 from the preliminary contigs and the sequences from which the baits were designed. The original
173 reads were then blasted to this larger database, again extracting both reads of a pair if either
174 matched (tassembly_to_blast.py and tbaits_blastn_parse.py). Separate assemblies for each
175 individual for each gene family were then constructed by SPAdes (tblast_to_fastq.py, run in
176 “separate” mode). A single fasta file was constructed for each gene family containing all of the
177 contigs for that gene family, labeled according to individual. Finally, these fasta files were
178 blasted against the bait sequences (which do not contain introns) to delimit exons
179 (tassembly_to_loci.py, run in “spades” mode). Only exons were used for all subsequent analyses.
180 Part II.—Part II of the pipeline identified the paralog that each contig from Part I
181 belonged to, in order to combine contigs and maximize the sequence length for each paralog
182 (tcontig_classif_master.py). This classification was performed on the principle that shorter
183 sequences (i.e. the contigs) can be placed within a backbone tree built from longer sequences
184 (i.e. the bait sequences); the topological position of a placed sequence then indicates its affinity.
8
185 This procedure was conducted after removing introns from all contigs according to the blast
186 results from the previous pipeline (tbaits_intron_removal.py). The pipeline iteratively refined the
187 backbone tree used. Initially, we used alignments that consisted only of the same sequences that
188 were used in bait design (transcriptome, model plant, and, in the case of some photosynthesis-
189 related genes and phytochrome C, other sequences from the same gene families downloaded
190 from GenBank or amplified by PCR). Part II of the pipeline also attempted to place the paralogs
191 from each major plant lineage in separate iterations, because the more tightly clustered the
192 contigs from each paralog are on the phylogeny, the easier it is to group the contigs according to
193 paralog. (For example, if we are trying to place the Cactaceae paralogs by themselves, the
194 phylogenetic distance between the different paralogs is much greater, and thus the overlap
195 between clusters of contigs is much less, than if we were placing all portullugo paralogs at once.)
196 At the end of each iteration, all good consensus sequences were added to the backbone tree for
197 the subsequent round.
198 To place contigs in the backbone tree, we executed the following steps. All contigs for a
199 gene family were added to the backbone alignments for their gene families using the
200 addfragments algorithm in MAFFT version 7.017 (Katoh and Standley 2013). The short-read
201 classification algorithm in RAxML version 8.0.22 (option “-f v”; Berger 2011, Stamatakis 2014)
202 was then used to place these sequences in the backbone gene family tree
203 (tbaits_intron_removal.py and tcontigs_to_fixed_paralogs.py). Each contig would be given a set
204 of possible placements in the backbone tree together with the probability of each placement; in
205 subsequent analyses, we considered the set of placements that gave us a total probability of 0.9,
206 so the 90% confidence interval for the placement of that contig. We then looked for clusters of
207 contigs that had overlapping placements on the backbone tree, but whose 90% confidence
9
208 intervals did not overlap with those of sequences from other clusters. Each cluster was treated as
209 a putative paralog and extracted for further testing (tseq_placer_dup.py). For each of these
210 clusters, the contigs from each individual were combined into a consensus sequence for that
211 individual (tcontigs_to_fixed_paralogs.py). Each consensus sequence was then examined based
212 on three criteria to decide whether it would be accepted: 1) it was at least 75 bases long; 2) it was
213 at least 75% of the mean length of all consensus sequences from that cluster; and 3) the number
214 of bases that differed in overlapping regions of contigs (due to multiple alleles (if low) or
215 multiple paralogs (if high) in that individual), needed to be below a threshold value of less than
216 twice the number of contigs for non-polyploids and less than five times the number of contigs for
217 plants that were previously known to be polyploid. (The number of contigs was used instead of
218 sequence length, because, when the contigs are correctly classified, the ends of the contigs are
219 usually the only places they overlap.) In intermediate rounds, all sequences from a given plant
220 family and a given cluster were looked at together to determine if the number of accepted
221 sequences is less than twice as many as the number of rejected sequences or if the number of
222 bases that differed in the contigs was more than 5% of the total contig length. If these criteria
223 were met, it was assumed that that sequence group consisted of one paralog and a consensus
224 sequence was accepted for further analysis by including it into the alignment of existing
225 backbone sequences. Sequences that failed these criteria were analyzed again in the next round
226 (tparalog_combiner.py). (In the final round, all accepted contigs were passed on to Part III of the
227 pipeline.)
228 After six iterations of contig classification, some contigs remained orphaned, i.e., they
229 could not be combined into acceptable consensus sequences (most likely due to a recent
230 duplication that was absent from the backbone tree). Here, a single contig per individual and
10
231 paralog was selected. If multiple individuals had orphaned contigs for the same paralog, we
232 retained those contigs that were alignable across individuals (i.e. representing the same exons)
233 for further analysis. In particular, we selected the set of alignable contigs that contained most
234 individuals and greatest total contig length (tundivcontigs_combiner.py and
235 tcontig_selection.py).
236 The selected consensus sequences and orphaned contigs were combined and aligned
237 using the localpair algorithm in MAFFT (tparcomb_combiner.py). Each alignment was checked
238 to make sure each individual had only one sequence per paralog maximum, based on the original
239 naming of the paralog. If multiple sequences per individual and paralog were present, an attempt
240 was made to combine them into a single consensus sequence using the criteria for accepting
241 consensus sequences made from contigs as described above (tparcomb_final.py). If they could
242 not be combined, then the two contigs were kept separate. All sequences that were over 150
243 bases long were added to the backbone alignments for each gene family to make a combined
244 alignment for analysis in Part III of the pipeline.
245 Part III.—Part III of the pipeline extracted paralogs as separate phylogenetic loci from
246 the gene-family trees, by identifying the positions of gene duplications in comparison with a
247 preliminary species tree, and used these loci to reconstruct gene trees
248 (tgenefam_to_spptree_master.py). Part III was performed twice, first with a preliminary species
249 tree constructed from three chloroplast loci (matK, ndhF, and rbcL) and the nuclear internal
250 transcribed spacer (ITS) region, all recovered as off-target reads (See below for construction of
251 the preliminary species tree), and then with an updated species tree, reconstructed from the loci
252 recovered from the pipeline using ASTRAL II (Mirarab et al. 2014)
253 The fasta files containing the original backbone sequences as well as all of the new
11
254 sequences were pruned to include only the individuals present in the species tree
255 (tcombpars_to_trees.py). The pruned fasta files were aligned using MAFFT using the localpair
256 algorithm and trees were made from the alignments using RAxML with 100 bootstrap replicates.
257 NOTUNG version 2.8.1.6 (Chen et al. 2000, Stolzer et al. 2012) was used to find gene
258 duplications in the gene family trees, based on the given species tree. While the topology of the
259 species tree was taken as given, poorly supported nodes (< 90% bootstrap) on the gene family
260 trees were rearranged to correspond to the species tree, to minimize the impact of lack of support
261 on paralog classification. Losses were not reconstructed, only gene duplications. Besides
262 accounting for poorly supported incongruences between the gene-family tree and species tree,
263 we also employed a conservative strategy to accept duplications
264 (tnotung_homolog_parsing.py). Here, a duplication was accepted if it met the following criteria:
265 1) At least one individual had to be present on both sides of the duplication or, if there were not
266 duplicated individuals, then at least two plant families had to be present on both sides of the
267 duplication. If neither of these criteria is met, the putative duplication was assumed to be due to
268 incongruence between the gene tree, instead of being an actual duplication. 2) If very few
269 individuals (either fewer than five or fewer than 40% of the total individuals in the two sides of
270 duplication) were present on both sides of the duplication, an attempt was made to combine the
271 sequences of those individuals (in the same way that the contigs were combined in Part II of the
272 pipeline), and the duplication was only accepted if the sequences were not combinable. 3) If the
273 duplication contained at least two individuals. If the duplication was within a single individual,
274 it was rejected and the longer of the two sequences was chosen to represent that individual in the
275 subsequent analyses.
276 After inspection, at each node that subtended an accepted duplication, the smaller sister
12
277 group was pruned off as a distinct locus, while the larger group was retained on the gene tree.
278 (Note that after pruning, the larger group was no longer a single paralog, as it contained one of
279 the paralogs from the accepted duplication in addition to the unduplicated sequences that subtend
280 the duplication.) This strategy maximized the number of loci that contained all or most of the
281 individuals, facilitating phylogenetic inference. The sets of sequences for each locus were
282 aligned using MAFFT. Once the final set of loci had been obtained, the number of individuals
283 and the number of plant families that have each locus were calculated (tparalog_selector.py),
284 thus allowing different subsets of loci and individuals to be selected, in order to run analyses
285 with different levels of missing data.
286 New sets of alignments were made containing only the selected loci and individuals
287 (tal_combiner.py). In addition, to reduce the amount of missing data, all sites with >90%
288 missing data were pruned prior to analysis. These regions were largely the result of the
289 transcriptomes and genome sequences being longer than the baits sequences. These alignments
290 were analyzed in three different ways: A concatenated alignment containing all sequences was
291 made and analyzed in RAxML and separate trees were made for each locus using both RAxML
292 and MrBayes version 3.2.2 (Ronquist et al. 2012) for further analysis.
293
294 Pipeline Parallelization
295 Extensive parallelization helped the pipelines run much faster. Each of the three wrapper
296 steps (tfastq_assembly_master.py, tcontig_classif_master.py, and
297 tgenefam_to_spptree_master.py) and some of the internal scripts could be parallelized in two
298 different ways: One was using gnu parallel, which can automatically detect the number of jobs
299 that can be run simultaneously on a given computer or node of a cluster, and run multiple
13
300 concurrent jobs (run in “Parallel” mode). (However, it is often necessary to run fewer jobs than
301 could potentially be run on a given computer to allow each job to have sufficient RAM.) If the
302 pipeline was being run on a cluster, however, it was much more efficient (both in terms of time
303 and in terms of usage of resources on the cluster) to run each portion as a separate job, so that
304 more jobs could be run at once and the resources for each job were freed immediately upon
305 completion of that job. For this reason, there is a second option, “Array” mode, for slurm-based
306 clusters, in which jobs are scheduled using the sbatch command.
307
308 Construction of the Preliminary Species Tree
309 The preliminary species tree was constructed from three chloroplast loci (matK, ndhF,
310 and rbcL) and the nuclear internal transcribed spacer (ITS) region, all recovered as off-target
311 reads, (using the wrapper script torig_spp_tree_master.py). These sequences were recovered
312 from our baits using the first part of the pipeline with one round of blast (using
313 torig_spp_tree_blasting.py if the fasta files are already present or trans_fastq_to_2blast.py if the
314 fasta files also need to be made; tbaits_blastn_parse.py, tblast_to_fastq.py, and
315 tassembly_to_loci.py). The longest sequence from each individual for each locus was used and
316 trees were made in RAxML from the separate and concatenated alignments
317 (tbaits_to_spptreeseqs.py). These trees were then checked by eye for individuals that were out of
318 place, likely due to the selection of a pseudogene sequence. The putative pseudogenes were
319 removed for those individuals and a new species tree was produced from the filtered sequences.
14