1 Supplemental Methods
Total Page:16
File Type:pdf, Size:1020Kb
Supplemental methods for: Geographic range dynamics drove hybridization in a lineage of angiosperms 1 1 1 2 1 R.A. FOLK , C.J. VISGER , P.S. SOLTIS , D.E. SOLTIS , R. GURALNICK 1Florida Museum of Natural History 2Biology, University of Florida 3Author for correspondence: [email protected] 1 Sequencing: Sequencing followed previously developed methods1 with the following modifications: library preparation was performed by RAPiD Genomics (Gainesville, FL; using TruSeq-like adapters as in Folk et al. 2015), the targeted insert size was > 200 bp, and sequencing used a 300-cyle (150 bp read) kit for a HiSeq 3000 instrument. The overall outgroup sampling (21 taxa total; Supplementary Table S1) was improved > 5 fold.2 This includes several representatives each of all lineages that have been hypothesized to undergo hybridization in the Heuchera group of genera. For the transcriptomes, reads were assembled against the low-copy nuclear loci from our targeted enrichment experiment, where the targets stripped of intronic sequence but assembly methods otherwise followed a previously developed BWA-based approach1. Transcriptomic reads were also mapped to a Heuchera parviflora var. saurensis chloroplast genome reference1 which was stripped of intronic and intergenic sequence. Assembly methods for target-enriched data followed the BWA-based approach1 directly. In practice, intronic sequence can be recovered from RNAseq data,3 but has consistently lower coverage (pers. obs.); moreover non-coding read dropout can be expected to be high for more divergent outgroups added here. For this reason, only coding reference sequences were used to assemble transcriptomic taxa. For nuclear analyses, reads were assembled with 277 references comprising the gene sequences used for bait design, with intronic sequences stripped. For chloroplast analyses, the reference plastid genome of Heuchera parviflora var. saurensis1 was stripped of all intronic and intergenic sequence; reads were assembled to 113 references comprising coding sequences in the chloroplast (protein-coding genes, rDNA, tRNAs). Alignment and concatenation: All contig consensus sequences were aligned using MAFFT4 using “–auto” and default gap cost options (alignments were completely remade rather 2 than adding new sequences to the existing alignments). Manual editing was deemed unnecessary for the plastid analysis and for enriched taxa in the nuclear analysis. However, for several nuclear genes, exon-intron boundaries were incorrectly aligned for the 5 taxa with transcriptomic data. The errors were identified and manually edited in Geneious (version R9) to improve sequence homology assessment. Due to aberrant placement of transcriptome taxa towards the base of the tree, likely caused by the highly non-random nature of missing data (introns, 2/3 of the dataset, absent from transcriptomes), we excluded all intronic positions (still present in genomic assemblies) from the matrix after alignment. Species tree estimate: Given the strong similarity of concatenated and coalescent phylogenetic estimates with and without various partitioning schemes for the nuclear dataset,2 we chose to focus on concatenation for this case study, which under these conditions serves as an estimate of species phylogeny. To incorporate phylogenetic uncertainty in our ancestral reconstructions, we performed a RAxML rapid bootstrap analysis (option -f a) with 1,000 bootstraps on the nuclear data, using a matrix where one individual was randomly selected per species, setting RAxML to output branch lengths on the bootstrap trees. Since non-coding sites were excluded, and gene-wise partitioning did not previously have a major impact on topology or support, we did not partition nucleotide positions. Point record synthesis: We synthesized available point records from the California, Pacific Northwest, and Intermountain Consortia of Herbaria; GBIF, S-NET (http://science- net.kahaku.go.jp/specimen_en/collection/); OS (https://herbarium.osu.edu/online-data-access); SEINet; and SERNEC. After assessing weaknesses in point records from these repositories, strategic taxa of Heuchera were identified, imaged, and georeferenced from specimen loans from the following herbaria: ARIZ, ASC, ASU, BRIT, CAS, CS, DUKE, F, GH, MEXU, MICH, 3 MINN, MNA, MO, NCU, NMC, NY, RM, RSA, SIU, TENN, TEX, UNM, US, UTC, UTEP, VT, WCUH, WIS, XAL. Several species (H. acutifolia, H. glomerulata, H. inconstans, H. longipetala, H. mexicana, H. rosendahlii, H. sanguinea, H. soltisii, and H. wellsiae) consist entirely of records identified by the first author and taken from previously published monographic range maps;5-7 additional unpublished identified records were available for Mexican H. versicolor. Finally, we used a significant number of occurrences from previous fieldwork that serve as ground-truthed records; for H. longiflora, H. missouriensis, H. parviflora, H. puberula, and H. soltisii we mostly or entirely used field-collected GPS records (all but H. longiflora published previously7,8). For this work, species delimitations were conservative and primarily followed current taxonomic works 5-12); an exception was made for Heuchera versicolor, which was found to be only distantly phylogenetically related to H. rubescens,2 yet with which it has been synonymized in recent works. Recognizing a grossly polyphyletic species was seen as problematic, so these have been treated separately. Locality data for Heuchera versicolor in the strict sense was downloaded from GBIF and SEINet, together with new georeferencing; few records are available for H. rubescens explicitly identified in the strict sense, so we took all records for H. rubescens sensu lato from SEINet and GBIF; individuals outside of the approximate range of this entity and in the range of H. versicolor were removed following the approximately allopatric range previously recognized.9 We removed all point records calculated to the nearest degree. Spurious records were removed manually by reference to known ranges in the literature (cited above); in particular, a large number of European records had to be removed because this group contains common garden subjects; other botanical garden records were found by scanning locality fields and removed. 4 Layer correlation: We calculated layer correlation in R using Pearson’s ρ on layers downsampled to 2.5-minute resolution (Python library GDAL, http://www.gdal.org/; using nearest-neighbor sampling which is equivalent to subsampling 1/5th of the untransformed original data) and clipped to the combined range of sampled extant taxa using the training region method noted below on pooled point records; convex hulls were calculated separately for disjunct range portions in Asia and North America and merged for the correlation calculation. Among highly correlated (ρ ≥ 0.75) environmental layer pairs, we deleted one of the layers using a random-number generator. This resulted in 22 layers, which was still excessive given the limited locality data we had for many species, so we retained from this set 12 layers chosen to capture multiple climatic, edaphic, and topographic aspects relevant to Heuchera12,17: mean annual temperature, temperature mean annual range, annual mean precipitation, mean precipitation of driest quarter (i.e., BIOCLIM 1, 7, 12, and 17), elevation, slope, mean coarse fragment percent, mean pH, mean sand percent, mean organic carbon content, needle-leaf land cover percent, and herbaceous land cover percent. Predicted niche occupancy profiles: While outgroup occurrence samples were generally sufficient, the genus Heuchera contains a number of extremely narrow endemics (documented range as small as tens of kilometers), many of which have been described in the last decade. For ten included taxa, we could not obtain more training points than the number of layers (i.e., n < 15, = 12 + 25%); given the extensive loans undertaken from critical collections for western North America and Mexico, this likely represents the limits of our knowledge of occurrences for these taxa. Under these conditions complex multivariate methods such as Maxent are suspect, yet simply excluding taxa with insufficient data may ultimately be misguided in small trees where the effect of taxon sampling may be large. We addressed this by 5 sampling directly from layer values at occurrence coordinates (somewhat similar to 13), excluding pixel-wise duplicates and creating a PNO (predicted niche occupancy) profile solely of these observed values to be sampled under a uniform distribution (next section). Ancestral suitability overlap: We projected nodal variable distributions into geographical space using (1) a binary approach, and (2) a binned-probabilities approach. For the binary map (1), intended as representing a literal Hutchinsonian niche (an n- dimensional hypercube in E-space) and close analog of the BIOCLIM method14 (cf. 15), we calculated 95% credibility intervals (2.5th and 97.5th percentiles) from pooled MCMC chains for both ancestral taxa, thinned as described above. We then calculated a binary map in qGIS for each LGM raster by scoring as present only pixels that were in the credibility interval. We then combined these rasters by taking the intersection (in this case the product), returning all pixels that are within 95% of the distribution of all four BIOCLIM variables in the LGM projection. We also calculated a binned-probabilities map (2), intended to correctly incorporate distributions of suitability rather than ranges, and hence closer to recent modeling approaches, to