ddRADseq and data analysis for inferring phylogenetics of the rapidly radiated Australian Camphorosmeae Philipp Hühn & Gudrun Kadereit Institute of Molecular Physiology, Working Group Biodiversity and Evolution of Johannes Gutenberg-University Mainz, 55099 Mainz, Germany

Introduction One of the major challenges in evolutionary biology is resolving phylogenetic relationships of closely related, rapidly radiated and recently Lab Workflow diversified lineages1. The Australian Camphorosmeae Moq. (Chenopodiaceae, Fig. 2) are a typical example of such a species group which is further complicated by ongoing hybridization events and a reduced morphology2. This most species-rich tribe of Australian Chenopodiaceae consists of 149 described species in 14 genera that form an ubiquitous and diverse component of the Australian vegetation and its current is thought to be highly artificial3. For inferring population and species histories multi-locus approaches have become the baseline in phylogeography and phylogenetics4. These methods require gene trees which are inferred from loci with high information content, each containing a series of linked SNPs within the locus4. Thanks to the fast development of Next Generation Sequencing (NGS) technologies during the last decade, the resources needed for DNA sequence generation has significantly decreased. Most NGS platforms generate sequences of a relatively short length and low information content that are unsuitable for coalescent-based gene tree methods5. More recent innovations in NGS technology have resulted in an increase in sequencing quality and read length. We developed a modified NGS ddRADseq lab workflow (Fig. 1) for targeting a defined set of long loci up to 600 bp length and optimized the data analysis (Fig. 3) to guarantee the best possible sequence processing and information output.

200 ng genomic DNA

Lab Workflow New Polymorphic Loci The lab workflow (Fig. 1) is based on two NGS protocols: 250 Genotyping-by-Sequencing (GBS6) and double digest Restriction site- 200 Associated DNA sequencing (ddRADseq7) and consist of three major 150 steps. First, 200 ng of genomic DNA is digested with both restriction 100 enzymes simultaneously. Second, specially designed, sample specific 50 0 barcode adapters are ligated to the digested DNA fragments and the -50

samples are pooled together. Third, the library is size selected (broad -100 Double digest using BamHI and KpnI target range 350-710 bp) by Pippin Prep, enriched by a low copy PCR amplification and finally size selected and purified by Magnet Figure 3A: The turn-over clustering threshold was Beads. The ready-to-load library is then sequenced on the Illumina determined at 0.95 in regard to the number of MiSeq platform using the v3 kit for 300 bp paired-end reads with an new polymorphic loci found in the data sets of estimated output of up to 40 Million reads per lane. varying clustering thresholds.

Loci in Assembly 3500

3000

2500 Adapter ligation and pooling 2000 1500

1000

500

0

Figure 3B: The identified threshold of 0.95 is located before the climax of loci found in the assemblies. The number of loci collapses after the climax peak.

Sum SNPs Fragment segregation to target 1600 range 300-600 bp length 1400 Figure 2: The Australian Camphorosmeae serve as study group for the 1200 development of a lab workflow and downstream data analysis tests 1000 for inferring phylogenetic signals of challenging species groups using 800 RADseq data. The 149 species form an ubiquitous and diverse 600 component of the Australian vegetation3. The lineage originated 400 200

around 10 Mya at the end of the Miocene and diversified during the 0 Pliocene, probably driven by the aridification of Australia2. Figure 3C: Also the number of SNPs found in each data set is maximized at the determined Data Analysis clustering threshold of 0.95. Low copy library amplification A thorough data analysis (Fig. 3) is cruicial for any RADseq experiment. We choose ipyrad8 for filtering and assembling the reads to among sample loci. Throughout the assembling process, values must be chosen for key parameters, which frequently have a significant effect on the building and quality of the resulting loci. We tested and optimized the main parameter of the pipeline for our study group based on the parameter tests by Paris et al. 9 for the STACKS pipeline and Pätzold et al.10 for the ipyrad pipeline. First, the clustering threshold was changed iteratively and the resulting new polymorphic loci (Fig. 3A), loci in assembly (Fig. 3B) and SNPs in the loci set (Fig. 3C) were analyzed. Then the settings for the Max SNP and Max Indel Final size selection and purification filter were adjusted to the read length. For a better understanding of the distribution of information across the data set, the sampling coverage, read length and content of SNPs were visualized in a 3D Figure 1: Main steps of the ddRADseq lab workflow. loci plot (Fig. 3D).

Results and Outlook Figure 3D: Typically, long and variable loci The double digest approach together with the specially designed barcode adapters and three purification/enrichment steps lead to a library (white dots) show a low sample coverage, where- consisting of 50 samples with a fragment target range of 300-600 bp length. The sequence output of the Illumina MiSeq v3 kit is overall of a as less informative loci are shared by a majority good quality. However, the read quantity is significantly lower than expected, probably due to a clustering bias of very long fragments. The of samples. parameter tests revealed a within sample clustering threshold of 0.95 to be optimal for assembling the reads (Fig. 3A). At this point (turn-over) mis-clustering of homologous loci to heterologous loci and vice versa is avoided. This threshold is also located directly before the loci in assembly Figure 3: Analysis of the RADseq data. collapse (Fig. 3B, climax) and furthermore shows the highest number of SNPs found in this data set (Fig. 3C). The assembly tests revealed well supported basic phylogenetic patterns. However, missing data caused by loci dropout due to cut site polymorphism hinders the utility of RADseq loci for gene tree studies. Typically, long and informative RAD loci have a low sample coverage (Fig. 3D), whereas more conserved loci are shared by the majority. To overcome this problem, a denser sampling is needed and hyRAD11 and Hyb-Seq12 based on the RAD tags will be further applied to the target group. In the end, a practical solution for inferring phylogenies of rapidly diversified lineages probably lies in a two-step approach of sampling loci from the whole genome by RADseq and then capturing a suitable number of long and variable loci using lineage specific probes. Data Analysis

References Copyright of pictures Acknowledgements 1 2 Whitfield JB, Lockhart PJ (2007) Deciphering ancient rapid radiations. Trends in Ecology and Evolution 22, 258–265. / Cabrera J, Jacobs SWL, Kadereit G (2009) Phylogeny of the Australian Camphorosmeae (Chenopodiaceae) and the taxonomic significance of the fruiting perianth. Figure 1, Lab Workflow Figure 2, Australian Camphorosmeae, Kevin R. Thiele (University of Perth, WA) The authors would like to thank: 3 4 International Journal of Sciences 170, 505–521. / Wilson PG (1984) Chenopodiaceae. Canberra, Bureau of Flora and Fauna. , 4, 81-317. / Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends DNA – David Smith (2017), commons.wikimedia.org A) Maireana georgei, flickr.com/photos/66951228@N07/6282631413/ DFG – Taxon-Omics (SPP 1991) and PROMOS (DAAD) for financial support 5 6 in Ecology and Evolution 24, 332–340. / McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT (2013) Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and Evolution 66, 526–538. / Elshire, R. J., Glaubitz, J. C., Sun, Q., BamHI – S. Jähnichen (1998), commons.wikimedia.org B) Threlkeldia diffusa, flickr.com/photos/66951228@N07/15248623749/ TERN (University of Queensland,Australia) and John McDonald (University of 7 Poland, J. A., Kawamoto, K., Buckler, E. S., & Mitchell, S. E. (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PloS one, 6(5), e19379. / Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S., & Hoekstra, H. E. (2012). Double digest RADseq: an Ligase - Tom Ellenberger (2015), commons.wikimedia.org C) Osteocarpum acropterum, flickr.com/photos/66951228@N07/19542884812/ Adelaide, Australia) for providing plant material 8 9 inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PloS one, 7(5), e37135. / Eaton, D. A. (2014). PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics, 30(13), 1844-1849. / Paris, J. R., Stevens, J. R., & Catchen, Pippin Prep – Sage Science Inc. (2018), sagescience.com D) Eriochiton sclerolaenoides, flickr.com/photos/66951228@N07/7232038400/ MOGON (JGU Mainz) for the computing time granted 10 J. M. (2017). Lost in parameter space: a road map for stacks. Methods in Ecology and Evolution, 8(10), 1360-1373. / C Paetzold, K Wood, D Eaton, WL Wagner & MS Appelhans (in prep) Navigating the assembly parameter maze: RADSeq resolves Phylogeny of Hawaiian Melicope Thermocycler – Analytik Jena AG (2018), analytic-jena.de E) , flickr.com/photos/66951228@N07/7232048280/ 11 12 (Rutaceae) / Suchan T, et al. (2016) Hybridization capture using RAD probes (hyRAD), a new tool for performing genomic analyses on collection specimens. PLoS One, 11. Jg., Nr. 3, S. e0151651. / Weitemier K, et al. (2014) Hyb‐Seq: Combining target enrichment and genome Magnetic Beads – Thermo Fisher Sceintific (2018), thermofisher.com F) Sclerolaena diacantha, flickr.com/photos/66951228@N07/7232033322/ skimming for plant phylogenomics. Applications in Plant Sciences, 2(9), 1400042.