Developing Computational Tools for Evolutionary Inferences in Polyploids
Dissertation
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Paul David Blischak, B.Sc.
Graduate Program in Evolution, Ecology, and Organismal Biology
The Ohio State University
2018
Dissertation Committee:
Andrea D. Wolfe, Advisor Bryan C. Carstens John V. Freudenstein Laura S. Kubatko © Copyright by
Paul David Blischak
2018 Abstract
Methods for generating genome-scale data sets are facilitating the inference of phyloge-
netic relationships in non-model taxa across the Tree of Life. However, rapid speciation
and heterogeneous patterns of diversification make this task difficult when gene trees
have conflicting histories (e.g., from incomplete lineage sorting). For plant species in
particular, additional complications arise due to the intermixing of divergent lineages
through hybridization and the subsequent occurrence of whole genome duplication
(WGD; i.e., allopolyploidy). Investigations regarding the evolutionary history of re-
cently formed polyploids and their diploid progenitors are difficult to conduct because
of problems with resolving ambiguous genotypes in the polyploids as well as analyzing
species with different ploidies. The focus of my dissertation has been to develop models
and bioinformatic tools for analyzing high-throughput sequencing (HTS) data collected
in non-model taxa of different ploidy levels to estimate phylogenetic relationships. I
am applying these tools in the plant genus Penstemon (Plantaginaceae) to infer the
relationships in two groups of closely related species containing diploids, tetraploids,
and hexaploids.
The first chapter of my dissertation uses HTS data and a hierarchical Bayesian
framework to estimate biallelic single nucleotide polymorphism (SNP) genotypes and
allele frequencies in populations of any ploidy level (diploid or higher) assuming Hardy
Weinberg equilibrium. It does this using Markov chain Monte Carlo (MCMC) to
ii integrate over the uncertainty in the estimated genotypes. I then assess the model’s
accuracy using simulations and test it on a SNP data set in autotetraploid potato
(Solanum tuberosum). Both of these tests demonstrate the usefulness of the model for
parameter inference at different ploidy levels. The MCMC algorithm that is used for
inference is implemented in the open source R package polyfreqs.
The set of models in my second chapter builds on Chapter 1 in two important ways.
First, I extend the Hardy Weinberg equilibrium model to include inbreeding. Second,
I directly address the hybrid nature of allopolyploid organisms by separately modeling
the genomes of the two parental species. Using both simulations and empirical data
sets from the literature (autopolyploid: Andropogon gerardii, allopolyploid: Betula
pubescens + diploid parent: B. pendula), I benchmark these methods against other
software (Genome Analysis Toolkit) to demonstrate their effectiveness for estimating
genotypes. These new models also use a different algorithm for inferring population
parameters, the expectation maximization algorithm, which I have implemented in
the open source software package ebg.
Chapter 3 uses ideas similar to those presented in the first two chapters, but
focuses on inferring full haplotype sequences, rather than single SNP genotypes, for
samples of arbitrary ploidy. The method is able to process paired-end HTS data
collected using double-barcoded amplicon sequencing, and uses the program PURC to
cluster sequencing reads into haplotypes. It then uses a multinomial likelihood to infer
haplotypes while also accounting for sequencing error. The pipeline is implemented in
the software Fluidigm2PURC, and I demonstrate its use on a polyploid series from
the genus Thalictrum (Ranunculaceae).
iii My final chapter uses nuclear amplicon sequencing to infer evolutionary relation-
ships between two closely related groups in Penstemon: subsections Humiles and
Proceri (Plantaginaceae). These two groups are known to hybridize and have docu-
mented cases of WGD events forming putative allotetraploids and allohexaploids. To
estimate phylogeny in these two groups, I first use the methods described in Chapter
3 to determine haplotypes from paired-end HTS data for all diploid, tetraploid, and
hexaploid individuals. I also develop a method for assessing the proportion of gene
trees supporting a species-level quartet (quartet concordance factors; QCFs), which I
use as input for estimating a species network using the program SNaQ. Phylogenies
inferred using both species tree, and network, approaches recover subsections Humiles
and Proceri as non-monophyletic. There is also strong evidence for hybridization within and between these two groups.
iv This is dedicated to my grandparents:
Doris & Michael† Blischak and Mary† & Carl Firm
v Acknowledgments
First and foremost, I would like to thank my advisor, Dr. Andrea Wolfe, for her
guidance and support throughout the process of getting my Ph.D. I started working with Andi as an undergraduate math major, and she convinced me to go to grad school,
and to give biology a try. I will be forever grateful for this advice. My co-advisor for
this undergrad research project with Andi was Dr. Laura Kubatko, who has continued
to serve as a mentor and “unofficial” Ph.D. co-advisor, for which I am very thankful.
Working with Andi and Laura has been an incredible experience, and I will miss
getting to chat in their offices while mulling over the latest issues about why none
of my research is working. I would also like to thank my other committee members,
Dr. Bryan Carstens and Dr. John Freudenstein, who provided invaluable insights
into key areas of my thesis, and were always happy to discuss my research or to help
troubleshoot.
I would like to thank all members of the Wolfe Pack, both past and present. When
I first started in the lab, I was beyond clueless, but my senior lab mates, Dan Robarts
and Aaron Wenzel, were an immense help, and continued to guide me through the
early years of my graduate work. I’d also like to acknowledge my current lab mates,
Ben Stone and Rosa Rodriguez, who are great friends and colleagues, and who have
helped me bounce around ideas or problems that I was having with my research on
numerous occasions.
vi Getting through this Ph.D. would not have been possible without my family. My
parents, Maggi and Dave, have shown indefatigable love and support throughout all
stages of my education, even when I had absolutely no idea what I wanted to do with
my life. They are model human beings, and I hope to live up to the example that they
have set for me. My siblings, John and Julianna, are equally badass. Not only are
they great couch-fort builders, living-room soccer players, and backyard mat tumblers,
they are incredible friends. I am also immeasurably fortunate to have an amazing
partner, Makenzie Mabry, whose love and encouragement has sustained me through
both the high and low points of my Ph.D. over the past couple of years.
I would like to thank everyone who helped me in the field and with obtaining
collecting permits, including Mikel Stevens, Noel Holmgren, Karen and Steve Shelly,
Carol Blackburn, Teresa Prendusi, Dale Reinhart, Maret Pajutee, and Steve Popovich.
I also thank the organizations that provided funding for my research: The Ohio State
University (Distinguished University Fellowship), the National Science Foundation
(Doctoral Dissertation Improvement Grant; DEB-1601096), the Society for Systematic
Biologists (Graduate Student Research Award), and the American Society of Plant
Taxonomists (Graduate Student Research Grant). Computational resources for my
research were provided by the Ohio Supercomputer Center and the College of Arts
and Sciences Unity Cluster at OSU. I would also like to acknowledge Drs. Xin He,
John Novembre, and Matthew Stephens, as well as their lab members, for allowing
me to come visit and present my research at the University of Chicago, and thanks
especially to my brother John for orchestrating this opportunity.
vii To all of the wonderful friends that I have made while working in EEOB, thank you all for the memories, the laughs, and the good times. I will miss you all, and I
hope we’ll bump into each other as frequently as possible.
viii Vita
2008 ...... Archbishop Hoban High School.
2012 ...... B.Sc. Mathematics, The Ohio State University. 2012-2013, 2017-2018 ...... Distinguished University Fellow, The Ohio State University. 2013-2015 ...... Graduate Teaching Associate, The Ohio State University. 2015-2017 ...... Graduate Research Associate, The Ohio State University.
Publications
Research Publications
Blischak, P. D., M. Latvis, D. F. Morales-Briones, J. C. Johnson, V. S. Di Stilio, A. D. Wolfe, and D. C. Tank. Fluidigm2PURC: Automated Processing and Haplotype Inference for Double-Barcoded PCR Amplicons. Applications in Plant Sciences, 6:e1156, 2018.
Blischak, P. D., J. Chifman, A. D. Wolfe, and L. S. Kubatko. HyDe: a Python Package for Genome-Scale Hybridization Detection. Systematic Biology, doi:10.1093/sysbio/syy023, 2018.
Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. SNP Genotyping and Parameter Estimation in Polyploids Using Low-Coverage Sequencing Data. Bioinformatics, 34:407–415, 2018.
Latvis, M., S. J. Jacobs, S. M. E. Mortimer, M. Richards, P. D. Blischak, S. Mathews, and D. C. Tank. Primers for Castilleja and Their Utility Across Orobanchaceae: II. Single-Copy Nuclear Loci. Applications in Plant Sciences, 5:1700038, 2017.
ix Wolfe, A. D., T. Necamp, S. Fassnacht, P. D. Blischak, and L. S. Kubatko. Popula- tion Genetics of Penstemon albomarginatus (Plantaginaceae), a Rare Mojave Desert Species of Conservation Concern. Conservation Genetics, 17:1245–1255, 2016.
Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. Accounting for Genotype Uncer- tainty in the Estimation of Allele Frequencies in Autopolyploids. Molecular Ecology Resources, 16:742–754, 2016.
Blischak, P. D., A. J. Wenzel, and A. D. Wolfe. Gene Prediction and Annotation in Penstemon (Plantaginaceae): a Workflow for Marker Development from Extremely Low-Coverage Genome Sequencing. Applications in Plant Sciences, 2:1400044, 2014.
Fields of Study
Major Field: Evolution, Ecology, and Organismal Biology
Minor Field: Statistics
x Table of Contents
Page
Abstract ...... ii
Dedication ...... v
Acknowledgments ...... vi
Vita...... ix
List of Tables ...... xv
List of Figures ...... xvii
1. Accounting for Genotype Uncertainty in the Estimation of Allele Frequen- cies in Autopolyploids ...... 1
1.1 Abstract ...... 1 1.2 Introduction ...... 2 1.3 Materials and Methods ...... 7 1.3.1 Model Setup ...... 9 1.3.2 Full Conditionals and MCMC Using Gibbs Sampling . . . . 12 1.3.3 Simulation Study ...... 13 1.3.4 Example Analyses of Autotetraploid Potato (Solanum tubero- sum)...... 15 1.4 Results ...... 19 1.4.1 Simulation Study ...... 20 1.4.2 Example Analyses ...... 23 1.5 Discussion ...... 24 1.6 Conclusions ...... 28 1.7 Software Note ...... 29 1.8 Acknowledgements ...... 30
xi 1.9 Author Contributions ...... 30 1.10 Data Accessibility ...... 31
2. SNP Genotyping and Parameter Estimation in Polyploids from Low- Coverage Sequencing Data ...... 32
2.1 Abstract ...... 32 2.2 Introduction ...... 33 2.3 Models ...... 35 2.3.1 Autopolyploid Model ...... 36 2.3.2 Allopolyploid Model ...... 39 2.3.3 Other Approaches ...... 42 2.4 Methods ...... 42 2.4.1 Simulations ...... 44 2.4.2 Empirical Data Analysis ...... 45 2.4.3 Software and Reproducibility ...... 47 2.5 Results ...... 48 2.5.1 Simulations ...... 48 2.5.2 Empirical Data Analysis ...... 53 2.6 Discussion ...... 56 2.7 Conclusions ...... 58 2.8 Acknowledgements ...... 58
3. Fluidigm2PURC: Automated Processing and Haplotype Inference for Double-Barcoded PCR Amplicons ...... 59
3.1 Abstract ...... 59 3.2 Introduction ...... 60 3.3 Methods and Results ...... 62 3.3.1 Input data ...... 62 3.3.2 Step 1: fluidigm2purc ...... 62 3.3.3 Step 2: PURC ...... 65 3.3.4 Step 3: crunch_clusters ...... 65 3.3.5 Example analysis ...... 68 3.4 Conclusions ...... 72 3.5 Availability ...... 73 3.6 Acknowledgements ...... 73
4. Inferring Species Trees and Networks from Gene Tree Quartet Site Patterns: An Example from the Plant Genus Penstemon (Plantaginaceae) . . . . . 74
4.1 Abstract ...... 74
xii 4.2 Introduction ...... 75 4.3 Approach ...... 78 4.3.1 Calculating Quartet Concordance Factors ...... 79 4.3.2 Bootstrapping and Gene Tree Uncertainty ...... 81 4.3.3 Validating QCF Estimation ...... 81 4.3.4 Implementation ...... 82 4.4 Materials and Methods ...... 83 4.4.1 Study System ...... 83 4.4.2 Sample Collection, DNA Extraction, and Amplicon Sequencing 85 4.4.3 Species Tree Inference ...... 87 4.4.4 Candidate Hybridization Events from Rooted Triples . . . . 88 4.4.5 Species Network Inference ...... 88 4.5 Results ...... 89 4.5.1 Nuclear Amplicon Data ...... 89 4.5.2 Species Tree Inference ...... 89 4.5.3 Tests for Hybridization and Species Network Inference . . . 91 4.6 Discussion ...... 96 4.6.1 Taxonomy of Subsections Humiles and Proceri ...... 96 4.6.2 Character Evolution and Biogeography ...... 98 4.6.3 Phylogenetics of Hybrids and Polyploids ...... 100 4.7 Conclusions ...... 102
Bibliography ...... 103
A. Chapter 1 Supplemental Materials ...... 129
A.1 Example Analyses of Autotetraploid Potato (Solanum tuberosum). 129 A.1.1 Calculating Expected and Observed Heterozygosity . . . . . 131 A.1.2 Evaluating Model Adequacy ...... 136
B. Chapter 2 Supplemental Materials ...... 145
B.1 EM Algorithms ...... 145 B.1.1 Autopolyploid Model ...... 145 B.1.2 Allopolyploid Model ...... 146 B.1.3 C++ Code ...... 148 B.2 Simulations ...... 149 B.2.1 Inbreeding Coefficient From Called Genotypes ...... 149 B.3 Empirical Data Analysis ...... 149 B.3.1 Data Acquisition ...... 149 B.3.2 Comparison with GATK ...... 151
xiii C. Chapter 3 Supplemental Meterials ...... 174
C.1 Haplotype Inference ...... 174 C.1.1 Inferring Haplotypes with Known Ploidy ...... 174 C.1.2 Inferring Haplotypes with Unknown Ploidy ...... 175 C.2 Example Analysis ...... 178 C.2.1 Fluidigm2PURC ...... 178 C.2.2 dbcAmplicons (reduce_amplicons.R)...... 181
D. Chapter 4 Supplemental Materials ...... 183
D.1 Validating QCF Estimation ...... 183 D.1.1 Tree Simulations ...... 184 D.1.2 Network Simulations ...... 186 D.2 Code for Species Tree and Network Inference ...... 188 D.2.1 Gene Tree Estimates with RAxML ...... 188 D.2.2 Species Tree Inference with ASTRAL-III ...... 188 D.2.3 Species Tree Inference with qcf+QuartetMaxCut ...... 188 D.2.4 Network Analyses with PhyloNetworks ...... 189
xiv List of Tables
Table Page
1.1 Notation and symbols used in the description of the model for estimating allele frequencies in polyploids ...... 8
2.1 A key to the symbols and notation that are used in describing the autopolyploid and allopolyploid models ...... 43
3.1 Dependencies for the Fluidigm2PURC pipeline with version numbers in parentheses...... 64
3.2 Thalictrum L. species included in the comparison of Fluidigm2PURC and dbcAmplicons ...... 68
3.3 Overall alignment statistics for the comparison between Flu- idigm2PURC and the reduce_amplicons.R script...... 70
3.4 Per species data for the number of haplotypes inferred by Flu- idigm2PURC using known vs. unknown ploidy ...... 72
C.1 Haplotype configurations and their corresponding log-likelihoods for a tetraploid with ordered cluster sizes equal to 285, 95, 10, and 8 . . . . 175
C.2 Haplotype configurations for an individual with six clusters ...... 178
D.1 Collection and ploidy information for accessions from Penstemon sub- sections Humiles and Proceri...... 197
D.2 Primers for amplicon sequencing ...... 198
D.3 Primers for amplicon sequencing (continued) ...... 199
xv D.4 RMSD values for QCF estimation using data simulated from a tree topology (Figure 4.1a)...... 200
D.5 RMSD values for QCF estimation using data simulated from a network topology (Figure 4.1b)...... 201
xvi List of Figures
Figure Page
1.1 Error in allele frequency estimation as measured by the RMSE of posterior means ...... 21
1.2 Posterior standard deviation for allele frequency estimates across levels of sequencing coverage ...... 22
1.3 Posterior distribution of observed and expected heterozygosity in Solanum tuberosum ...... 23
2.1 RMSD values for simulations under the autopolyploid model with inbreeding for (a) estimated inbreeding coefficients and (b) estimated genotypes ...... 49
2.2 RMSD values for full genotype estimation (combined number of alter- native alleles in subgenomes one and two) ...... 51
2.3 Results of empirical data analyses: (a) Levels of inbreeding in Andro- pogon gerardii and (b) genotype estimation error in Betula pubescens 55
3.1 Flowchart outlining the steps for haplotype inference using Flu- idigm2PURC...... 63
4.1 Simulation setup for (a) tree and (b) network topologies. Internal branches are annotated with their lengths in coalescent units (CUs). The total tree height is 4.0 CUs...... 82
4.2 Hypotheses of allopolyploid formation in Penstemon attenuatus ... 84
4.3 Phylogeny of P. subsect. Humiles and Proceri inferred by ASTRAL-III 91
xvii 4.4 Phylogeny of P. subsect. Humiles and Proceri inferred using qcf and QuartetMaxCut ...... 92
4.5 Phylogeny of P. subsect. Humiles and Proceri inferred using RAxML 93
4.6 Best ML networks for clades A and B ...... 95
A.1 Comparison of posterior mean versus mean read ratio estimates of allele frequencies for all simulation settings ...... 141
A.2 Comparison of posterior mean versus mean read ratio estimates of allele frequencies for Solanum tuberosum ...... 142
A.3 Density plot of the difference between the mean read ratio (simple) and posterior mean estimates in Solanum tuberosum ...... 143
A.4 A close up comparison of the effect of coverage and the number of individuals sampled on estimation error for octopoloids ...... 144
B.1 RMSD values for inbreeding coefficient estimation with 25 individuals (all simulations) ...... 163
B.2 RMSD values for inbreeding coefficient estimation with 50 individuals (all simulations) ...... 164
B.3 RMSD values for inbreeding coefficient estimation with 100 individuals (all simulations) ...... 165
B.4 RMSD values for autopolyploid genotype estimation with 25 individuals (all simulations) ...... 166
B.5 RMSD values for autopolyploid genotype estimation with 50 individuals (all simulations) ...... 167
B.6 RMSD values for autopolyploid genotype estimation with 100 individu- als (all simulations) ...... 168
B.7 RMSD values for the estimation of the allele frequency in subgenome two for the allopolyploid model ...... 169
xviii B.8 RMSD values for the estimation of the genotype in subgenome one for the allopolyploid model ...... 170
B.9 RMSD values for the estimation of the genotype in subgenome two for the allopolyploid model ...... 171
B.10 Distribution of the difference in allele frequency estimates for the Hardy Weinberg model vs. GATK ...... 172
B.11 Distribution of the genotypes estimated by the allopolyploid model for each possible value of the genotype estimated by GATK ...... 173
D.1 Simulation results for tree topology (Figure 4.1a) ...... 191
D.2 Simulation results for network topology (Figure 4.1b) ...... 192
D.3 Phylogeny of P. subsect. Humiles and Proceri inferred by ASTRAL-III (with branch lengths) ...... 193
D.4 Phylogeny of P. subsect. Humiles and Proceri inferred using RAxML (with branch lengths) ...... 194
D.5 Networks inferred for clade A using SNaQ as implemented in the software PhyloNetworks ...... 195
D.6 Networks inferred for clade B using SNaQ as implemented in the software PhyloNetworks ...... 196
xix Chapter 1: Accounting for Genotype Uncertainty in the Estimation of Allele Frequencies in Autopolyploids
Publication Information
This chapter is formatted for this dissertation from the following publication:
Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. Accounting for Genotype Uncertainty
in the Estimation of Allele Frequencies in Autopolyploids. Molecular Ecology Resources,
16:742–754, 2016.
1.1 Abstract
Despite the increasing opportunity to collect large-scale data sets for population
genomic analyses, the use of high-throughput sequencing to study populations of
polyploids has seen little application. This is due in large part to problems associated with determining allele copy number in the genotypes of polyploid individuals (allelic
dosage uncertainty–ADU), which complicates the calculation of important quantities
such as allele frequencies. Here we describe a statistical model to estimate biallelic SNP
frequencies in a population of autopolyploids using high-throughput sequencing data
in the form of read counts. We bridge the gap from data collection (using restriction
enzyme based techniques [e.g., GBS, RADseq]) to allele frequency estimation in a
1 unified inferential framework using a hierarchical Bayesian model to sum over genotype
uncertainty. Simulated data sets were generated under various conditions for tetraploid,
hexaploid, and octoploid populations to evaluate the model’s performance and to help
guide the collection of empirical data. We also provide an implementation of our model
in the R package polyfreqs and demonstrate its use with two example analyses that
investigate (i) levels of expected and observed heterozygosity and (ii) model adequacy.
Our simulations show that the number of individuals sampled from a population has
a greater impact on estimation error than sequencing coverage. The example analyses
also show that our model and software can be used to make inferences beyond the
estimation of allele frequencies for autopolyploids by providing assessments of model
adequacy and estimates of heterozygosity.
1.2 Introduction
Biologists have long been fascinated by the occurrence of whole genome duplication
(WGD) in natural populations and have recognized its role in the generation of
biodiversity (Clausen et al., 1940; Stebbins, 1950; Grant, 1971; Otto and Whitton,
2000). Though WGD is thought to have occurred at some point in nearly every major
group of eukaryotes, it is a particularly common phenomenon in plants and is regarded
by many to be an important factor in plant diversification (Wood et al., 2009; Soltis et al., 2009; Scarpino et al., 2014). The role of polyploidy in plant evolution was
originally considered by some to be a “dead-end” (Stebbins, 1950; Wagner, 1970; Soltis et al., 2014) but, since its first discovery in the early twentieth century, polyploidy
has been continually studied in nearly all areas of botany (Winge, 1917; Winkler,
1916; Clausen et al., 1945; Grant, 1971; Stebbins, 1950; Soltis et al., 2003, 2010; Soltis
2 and Soltis, 2009; Ramsey and Ramsey, 2014). Though fewer examples of WGD are
currently known for animal systems, groups such as amphibians, fish, and reptiles
all exhibit polyploidy (Allendorf and Thorgaard, 1984; Gregory and Mable, 2005).
Ancient genome duplications are also thought to have played an important role in
the evolution of both plants and animals, occurring in the lineages preceeding the
seed plants, angiosperms, and vertebrates (Ohno, 1970; Otto and Whitton, 2000;
Furlong and Holland, 2001; Jiao et al., 2011). These ancient WGD events during
the early history of seed plants and angiosperms have been followed by several more
WGDs in all major plant groups (Cui et al., 2006; Scarpino et al., 2014; Cannon et al.,
2014). Recent experimental evidence has also demonstrated increased survivorship
and adaptability to foreign environments of polyploid taxa when compared with their
lower ploidy relatives (Ramsey, 2011; Selmecki et al., 2015).
Polyploids are generally divided into two types based on how they are formed: auto-
and allopolyploids. Autopolyploids form when a WGD event occurs within a single evo-
lutionary lineage and typically have polysomic inheritance. Allopolyploids are formed
by hybridization between two separately evolving lineages followed by WGD and are
thought to have mostly disomic inheritance. Multivalent chromosome pairing during
meiosis can occur in allopolyploids, however, resulting in mixed inheritance patterns
across loci in the genome (segmental allopolyploids; Stebbins, 1950). Autopolyploids
can also undergo double reduction, a product of multivalent chromosome pairing wherein segments from sister chromatids move together during meiosis—resulting
in allelic inheritance that breaks away from a strict pattern of polysomy (Haldane,
1930). Autopolyploidy was also thought to be far less common than allopolyploidy,
3 but recent studies have concluded that autopolyploidy occurs much more frequently
than originally proposed (Soltis et al., 2007; Parisod et al., 2010).
The theoretical treatment of population genetic models in polyploids has it origins
in the Modern Synthesis with Fisher, Haldane, and Wright each contributing to
the development of some of the earliest mathematical models for understanding the
genetic patterns of inheritance in polyploids (Haldane, 1930; Wright, 1938; Fisher,
1943). Early empirical work on polyploids that influenced Fisher, Haldane, and Wright
include studies on Lythrum salicaria by N. Barlow (Barlow, 1913, 1923), Dahlia by W.
J. C. Lawrence (Lawrence, 1929), and Primula by H. J. Muller (Muller, 1914). The
foundation laid down by these early papers has led to the continuing development
of population genetic models for polyploids, including models for understanding the
rate of loss of genetic diversity and extensions of the coalescent in autotetraploids,
as well as modifications of the multispecies coalescent for the inference of species
networks containing allotetraploids (Moody et al., 1993; Arnold et al., 2012; Jones et al., 2013). Much of this progress was described in a review by Dufresne et al.
(2014), who outlined the current state of population genetics in polyploids regarding
both molecular techniques and statistical models. Not surprisingly, one of the most
promising developments for the future of population genetics in polyploids is the
advancement of sequencing technologies. A particularly common method of gathering
large data sets for genome scale inferences are restriction enzyme based techniques
(e.g., RADseq, ddRAD, GBS, etc.), which we will refer to generally as RADseq (Miller et al., 2007; Baird et al., 2008; Peterson et al., 2012; Puritz et al., 2014). However,
despite its popularity for population genetic inferences at the diploid level, there are
4 many fewer examples of RADseq experiments conducted on polyploid taxa (but see
Ogden et al., 2013; Wang et al., 2013; Logan-Young et al., 2015).
Among the primary reasons for the dearth in applying RADseq to polyploids is
the issue of allelic dosage uncertainty (ADU), or the inability to fully determine the
genotype of a polyploid organism when it is partially heterozygous at a given locus.
This is the same problem that has been encountered by other codominant markers such
as microsatellites, which have been commonly used for population genetic analyses in
polyploids. One way of dealing with allelic dosage that has been used for multi-allelic
microsatellite markers has been to code alleles as either present or absent based on
electropherogram readings (allelic phenotypes) and to analyze the resulting dominant
data using a program such as polysat (Clark and Jasieniuk, 2011; Dufresne et al.,
2014). de Silva et al. (2005) developed a method for inferring allele frequencies using
observed allelic phenotype data and used an expectation-maximization algorithm to
deal with the incomplete genotype data resulting from ADU. Attempts to directly infer
the genotype of polyploid microsatellite loci have also been successfully completed in
some cases by using the relative electropherogram peak heights of the alleles in the
genotypes (Esselink et al., 2004). The estimation problem would be similar for biallelic
SNP data collected using RADseq, where a partially heterozygous polyploid will
have high-throughput sequencing reads containing both alleles. For a tetraploid, the
possible genotypes for a partial heterozygote (alleles A and B) would be AAAB, AABB,
and ABBB. For a hexaploid they are AAAAAB, AAAABB, AAABBB, AABBBB,
and ABBBBB. In general, the number of possible genotypes for a biallelic locus of
a partially heterozygous K-ploid (K = 3, 4, 5,...) is K − 1. A possible solution to
this problem for SNPs would be to try to use existing genotype callers and to rely on
5 the relative number of sequencing reads containing the two alleles (similar to what was done for microsatellites). However, this could lead to erroneous inferences when
genotypes are simply fixed at point estimates based on read proportions without
considering estimation error. Furthermore, when sequencing coverage is low, the
number of genotypes that will appear to be equally probable increases with ploidy,
making it difficult to distinguish among the possible partially heterozygous genotypes.
In this paper we describe a model that aims to address the problems associated with ADU by treating genotypes as a latent variable in a hierarchical Bayesian
model and using high throughput sequencing read counts as data. In this way we preserve the uncertainty that is inherent in polyploid genotypes by inferring a
probability distribution across all possible values of the genotype, rather than treating
them as being directly observed. This approach has been used by Buerkle and
Gompert (2013) to deal with uncertainty in calling genotypes in diploids and the work we present here builds off of their earlier models. Our model assumes that
the ploidy level of the population is known and that the genotypes of individuals in
the population are drawn from a single underlying allele frequency for each locus.
These assumptions imply that alleles in the population are undergoing polysomic
inheritance without double reduction, which most closely adheres to the inheritance
patterns of an autopolyploid. We acknowledge that the model in its current form
is an oversimplification of biological reality and realize that it does not apply to a
large portion of polyploid taxa. Nevertheless, we believe that accounting for ADU
by modeling genotype uncertainty has the potential to be applied more broadly via modifications of the probability model used for the inheritance of alleles, which
6 could lead to more generalized population genetic models for polyploids (see the
Extensibility section of the Discussion).
1.3 Materials and Methods
Our goal is to estimate the frequency of a reference allele for each locus sampled
from a population of known ploidy (ψ), where the reference allele can be chosen
arbitrarily between the two alleles at a given biallelic SNP. To do this we extend the
population genomic models of Buerkle and Gompert (2013), which employ a Bayesian
framework to model high-throughput sequencing reads (T , R), genotypes (G), and
allele frequencies (p), to the case of arbitrary ploidy. The idea behind the model is to view the sequencing reads gathered for an individual as a random sample from the
unobserved genotype at each locus. Genotypes can then be treated as a parameter in
a probability model that governs how likely it is that we see a particular number of
sequencing reads carrying the reference allele. Similarly, we can treat genotypes as
a random sample from the underlying allele frequency in the population (assuming
Hardy-Weinberg equilibrium). For our model, a genotype is simply a count of the
number of reference alleles at a locus which can range from 0 (a homozygote with no
reference alleles in the genotype) to ψ (a homozygote with only reference alleles in the
genotype). All whole numbers in between 0 and ψ represent partially heterozygous
genotypes. This hierarchical setup addresses the problems associated with ADU by
treating genotypes as a latent variable that can be integrated out using Markov chain
Monte Carlo (MCMC).
7 Symbol Description L The number of loci. ` Index for loci (` ∈ {1,...,L}). N Total number of individuals sequenced. i Index for individuals (i ∈ {1,...,N}). ψ The ploidy level of individuals in the population (e.g., tetraploid: ψ=4).
p` Frequency of the reference allele at locus `.[p]
gi` The number of copies of the reference allele for individual i at locus `.[G]
g˜i` Simulated genotype for posterior predictive model check- ing.
ti` The total number of reads for individual i at locus `.[T ]
ri` The number of reads with the reference allele for indi- vidual i at locus `.[R]
r˜i` Simulated reference read count for posterior predictive model checking. [R˜] Sequencing error.
Table 1.1: Notation and symbols used in the description of the model for estimating allele frequencies in polyploids. Vector and matrix forms of the variables are also provided when appropriate.
8 1.3.1 Model Setup
Here we consider a sample of N individuals from a single population of ploidy level
ψ sequenced at L unlinked SNPs. The data for the model consist of two matrices
containing counts of high-throughput sequencing reads mapping to each locus for
each individual: R and T . The N × L matrix T contains the total number of
reads sampled at each locus for each individual. Similarly, R is an N × L matrix
containing the number of sampled reads with the reference allele at each locus for
each individual. Then for individual i at locus `, we model the number of sequencing
reads containing the reference allele (ri`) as a Binomial random variable conditional
on the total number of sequencing reads (ti`), the underlying genotype (gi`), and a
constant level of sequencing error ()
t i` ri` ti`−ri` P (ri`|ti`, gi`, ) = g (1 − g) . (1.1) ri`
Here g is the probability of observing a read containing the reference allele corrected
for sequencing error
gi` gi` g = (1 − ) + 1 − . (1.2) ψ ψ
The intuition behind including error is that we want to calculate the probability
that we observe a read containing the reference allele. There are two ways that
this can happen. (1) Reads are drawn from the reference allele(s) in the genotype
gi` with probability ψ but are only observed as reference reads if they are not errors
(probability 1 − ). (2) Similarly, reads from the non-reference allele(s) in the genotype
gi` are drawn with probability 1 − ψ but can be mistakenly read as a coming from a
9 reference allele if an error occurs (probability ). The sum across these two possibilities
gives the overall probability of observing a read containing the reference allele. If we
also assume conditional independence of the sequencing reads given the genotypes,
the joint probability distribution for sequencing reads is given by
L N Y Y P (R|T , G, ) = P (ri`|ti`, gi`, ) . (1.3) `=1 i=1
Since the ri`’s are the data that we observe, the product of P (ri`|ti`, gi`, ) across loci
and individuals will form the likelihood in the model.
The next level in the hierarchy is the conditional prior for genotypes. We model
each gi` as a Binomial random variable conditional on the ploidy level of the population
and the frequency of the reference allele for locus ` (p`):
ψ gi` ψ−gi` P (gi`|ψ, p`) = p` (1 − p`) . gi` We also assume that the genotypes of the sampled individuals are conditionally
independent given the allele frequencies, which is equivalent to taking a random
sample from a population in Hardy-Weinberg equilibrium. Factoring the distribution
for genotypes and taking the product across loci and individuals gives us the joint
probability distribution of genotypes given the ploidy level of the population and the vector of allele frequencies at each locus (p = {p1, . . . , pL}):
L N Y Y P (G|ψ, p) = P (gi`|ψ, p`) . (1.4) `=1 i=1 We choose here to ignore other factors that may be influencing the distribution of
genotypes such as double reduction. In general, double reduction will act to increase
homozygosity (Hardy, 2016). However, it is more prevalent for loci that are farther
10 away from the centromere, which makes the estimation of a global double reduction
parameter (typically denoted α) inappropriate for the thousands of loci gathered from
across the genome using techniques such as RADseq. It might be possible to estimate
a per locus rate of double reduction (α`) but this would add an additional parameter
that would need to be estimated for each locus, perhaps unnecessarily if the majority
end up being equal, or close, to 0.
The final level of the model is the prior distribution on allele frequencies. Assuming a priori independence across loci, we use a Beta distribution with parameters α and β
both equal to 1 as our prior distribution for each locus. A Beta(1,1) is equivalent to a
Uniform distribution over the interval (0, 1), making our choice of prior uninformative.
The joint posterior distribution of allele frequencies and genotypes is then equal to
the product across all loci and all individuals of the likelihood, the conditional prior
on genotypes and the prior distribution on allele frequencies up to a constant of
proportionality
P ( p, G|T , R, ) ∝ P (R|T , G, )P (G|ψ, p)P (p)
L N Y Y = P (ri`|ti`, gi`, )P (gi`|ψ, p`)P (p`) . (1.5) `=1 i=1
The marginal posterior distribution for allele frequencies can be obtained by summing
over genotypes
X P ( p|T , R, ) ∝ P ( p, G|T , R, ) . (1.6) G It would also be possible to examine the marginal posterior distribution of genotypes
but here we will focus primarily on allele frequencies.
11 1.3.2 Full Conditionals and MCMC Using Gibbs Sampling
We estimate the joint posterior distribution for allele frequencies and genotypes in
Eq. 1.5 using MCMC. This is done using Gibbs sampling of the states ( p, G) in a
Markov chain by alternating samples from the full conditional distributions of p and
G. Given the setup for our model using Binomial and Beta distributions (which form
a conjugate family), analytical solutions for these distributions can be readily acquired
(Gelman et al., 2014). The full conditional distribution for allele frequencies is Beta
distributed and is given by Eq. 1.7 below:
N N ! X X p` | gi`, ri`, ∼ Beta α = gi` + 1, β = (ψ − gi`) + 1 , for ` = 1, . . . , L. i=1 i=1 (1.7)
This full conditional distribution for p` has a natural interpretation as it is roughly
centered at the proportion of sampled alleles carrying the reference allele divided by
the total number of alleles sampled. The “+1” comes from the prior distribution and will not have a strong influence on the posterior when the sample size is large.
The full conditional distribution for genotypes is a discrete categorical distribution
over the possible values for the genotypes (0, . . . , ψ). The distribution for individual i at locus ` is
t i` ri` ti`−ri` P (gi`|g(-i)`, p`, ri`, ) = g (1 − g) ri` ψ gi` ψ−gi` × p` (1 − p`) , (1.8) gi` where g(-i)` is the value of the genotypes for all sampled individuals excluding individual
i and g is the same as Eq. 1.2. The full conditional distribution for genotypes can
12 be seen as the product of two quantities: (1) the probability of each of the possible
genotypes based on the observed reference reads and (2) the probability of drawing
each genotype given the allele frequency for that locus in the population.
We begin our Gibbs sampling algorithm in a random position in parameter space
through the use of uniform probability distributions. The genotype matrix is initialized with random draws from a Discrete Uniform distribution ranging from 0 to ψ and the
initial allele frequencies are drawn from a Uniform distribution on the interval [0,1].
1.3.3 Simulation Study
Simulations were performed to assess error rates in allele frequency estimation for
tetraploid, hexaploid, and octoploid populations (ψ = 4, 6, and 8, respectively). Data were generated under the model by sampling genotypes from a Binomial distribution
conditional on a fixed, known allele frequency ( p` = 0.01, 0.05, 0.1, 0.2, 0.4). Total
read counts were simulated for a single locus using a Poisson distribution with mean
coverage equal to 5, 10, 20, 50 or 100 reads per individual. We then sampled the
number of sequencing reads containing the reference allele from a Binomial distribution
conditional on the number of total reads, the genotype, and sequencing error (Eq. 1.1;
fixed to 0.01). Finally, we varied the number of individuals sampled per population
(N = 5, 10, 20, 30) and ran all possible combinations of the simulation settings. Our
choice for the number of individuals to simulate was intended to reflect sampling within a single population/locality and not that of an entire population genetics study.
Furthermore, RAD sequencing is used at various taxonomic levels from population
genetics to phylogenetics (e.g., Rheindt et al., 2014; Eaton et al., 2015), and we wanted
our simulations to be informative across these applications. Each combination of
13 sequencing coverage, individuals sampled, and allele frequency was analyzed using
100 replicates for tetraploid, hexaploid, and octoploid populations for a total of
30,000 simulation runs. MCMC analyses using Gibbs sampling were run for 100,000
generations with parameter values stored every 100th generation. The first 25% of
the sample was discarded as burn-in, resulting in 750 posterior samples for each
replicate. Convergence on the stationary distribution, P ( p, G|T , R, ), was assessed
by examining trace plots for a subset of runs for each combination of settings and
ensuring that the effective sample sizes (ESS) were greater than 200. Deviations from
the known underlying allele frequency used to simulate each data set were assessed by
taking the posterior mean of each replicate and calculating the root mean squared
error (RMSE) based on the true underlying value. We also compared the posterior
mean as an estimate of the allele frequency at a locus to a more simple estimate
1 P ri` calculated directly from the read counts (mean read ratio): N i ti` . Comparisons between estimates were again made using the RMSE.
All simulations were performed using the R statistical programming language (R
Core Team, 2014) on the Oakley cluster at the Ohio Supercomputer Center (https:
//osc.edu). Figures were generated using the R packages ggplot2 (Wickham, 2009)
and reshape (Wickham, 2007), with additional figure manipulation completed using
Inkscape (https://inkscape.org). MCMC diagnostics were done using the coda
package (Plummer et al., 2006). All scripts are available on GitHub (https://github.
com/pblischak/polyfreqs-ms-data) in the ‘code/’ folder and all simulated data
sets are in the ‘raw_data/’ folder.
14 1.3.4 Example Analyses of Autotetraploid Potato (Solanum tuberosum)
To further evaluate the model and to demonstrate its use we present an example analysis
using an empirical data set collected for autotetraploid potato (Solanum tuberosum)
using the Illumina GoldenGate platform (Anithakumari et al., 2010; Voorrips et al.,
2011). Though these data are not the typical reads returned by RADseq experiments,
they still represent the same type of binary response data that our model uses to get a
probability distribution for biallelic SNP genotypes. A detailed walkthrough with the
code used for each step is provided in Appendix A. The data set and output are also
available on GitHub (https://github.com/pblischak/polyfreqs-ms-data) in the
‘example/’ folder.
Calculating Expected and Observed Heterozygosity
One advantage of using a Bayesian framework for our model is that we can approximate
a posterior distribution for any quantity that is a functional transformation of the
parameters that we are estimating without doing any additional MCMC simulation
(Gelman et al., 2014). Two such quantities that are often used in population genetics
are the observed and expected heterozygosity, which are in turn used for calculating
the various fixation indices (FIS, FIT , FST ) introduced by Wright (1951). To analyze
levels of heterozygosity in this way, we used the estimators of Hardy (2016) to calculate
the per locus observed (Ho) and expected (He) heterozygosity for each stored sample
of the joint posterior distribution in Eq. 1.5. This procedure is especially useful
because it estimates heterozygosity while taking into account ADU by utilizing the
marginal posterior distribution of genotypes. Given a total of M posterior samples of
15 genotypes and allele frequencies, we calculate the mth (m = 1,...,M) estimate of the
observed heterozygosity using Eq. 1.9 [numerator of Eq. 7 in Hardy (2016)]:
[m] [m] 1 X [m] 1 X g (ψ − g ) H[m] = h = i` i` . (1.9) o N i N ψ i i 2 Similarly, the mth estimate of the expected heterozygosity is calculated using Eq. 1.10
[denominator of Eq. 8 in Hardy (2016)]:
" # N [m] [m] ψ − 1 X [m] H[m] = 1 − (p )2 − (1 − p )2 − h . (1.10) e N − 1 ` ` ψN 2 i i The posterior distribution of a multi-locus estimate of heterozygosity can then be
approximated by taking the average across loci for each of the per locus posterior
samples.
To evaluate levels of heterozygosity in autotetraploid potato, we obtained biallelic
count data for 224 accessions collected at 384 loci using the Illumina GoldenGate
platform from the R package fitTetra (Voorrips et al., 2011), which provides the
data set as part of the package. We chose the ‘X’ reading to be the count data for
the reference allele and added the ‘X’ and ‘Y’ readings together to get the total read
counts (‘X’ and ‘Y’ represent the counts of the two alternative alleles). Initial attempts
to analyze the data set using our Gibbs sampling algorithm were unsuccessful due to
arithmetic underflow. This was due to the fact that the counts/intensities returned by
the Illumina GoldenGate platform are on a different scale (∼10,000-20,000+) than
the read counts that would be expected from a RADseq experiment. To alleviate this
problem, we rescaled the data set while preserving the relative dosage information by
dividing the GoldenGate count readings by 100 and rounding to the nearest whole
number. We then analyzed the rescaled count data using 100,000 MCMC generations,
16 sampling every 100 generations and using the stored samples of the allele frequencies
and genotypes to calculate the observed and expected heterozygosity for a total of
1,000 posterior samples of the per locus observed and expected heterozygosity. We also
compared post burn-in (25%) allele frequency estimates based on the posterior mean
to the simple allele frequency estimate based directly on read counts used previously
(mean read ratio). Posterior distributions for multi-locus estimates of observed and
expected heterozygosity were obtained by taking the average across loci for each
posterior sample of the per locus estimates using a burn-in of 25%.
Evaluating Model Adequacy
As noted earlier, the probability model that we use for the inheritance of alleles is one
of polysomy without double reduction. In some cases, this model may be inappropriate.
Therefore, it can be informative to check for loci that do not follow the model that we assume. Below we describe a procedure for rejecting our model of inheritance
on a per locus basis using comparisons with the posterior predictive distribution of
sequencing reads. Model checking is an important part of making statistical inferences
and can play a role in understanding when a model adequately describes the data
being analyzed. In the case of our model, it can serve as a basis for understanding
the inheritance patterns of the organism being studied by determining which loci
adhere to a simple pattern of polysomic inheritance. Other sources of disequilibrium
that could indicate poor model fit include inbreeding, null alleles, and allele drop out
(sensu Arnold et al., 2013), making this posterior predictive model check more broadly
applicable for RADseq data.
n [1] [2] [M]o Given M posterior samples for the allele frequencies at locus `, p` , p` , . . . , p` , we simulate new values for the genotypes (g˜i`) and reference read counts (r˜i`) for all
17 individuals and use the ratio of simulated reference read counts to observed total read r˜i` counts ti` as a summary statistic for comparing the observed read count ratios to the distribution of the predicted read count ratios. The use of the likelihood (or similar
quantities) as a summary statistic has been a common practice in posterior predictive
comparisons of nucleotide substitution models, and more recently for comparative
phylogenetics (Ripplinger and Sullivan, 2010; Reid et al., 2014; Pennell et al., 2015).
We use the ratio of reference to total read counts here because it is the maximum
likelihood estimate of the probability of success for a Binomial random variable and
because it is a simple quantity to calculate. The use of other summary statistics, or a
combination of multiple summary statistics, would also be possible. The procedure
for our posterior predictive model check is as follows:
1. For locus ` = 1,...,L:
1.1. For posterior sample m = 1,...,M:
[m] 1.1.1. Simulate new genotype values g˜i` for all individuals (i = 1,...,N)
[m] by drawing from a Binomial ψ, p` .
[m] 1.1.2. Simulate new reference read counts r˜i` from each new genotype for
all individuals by drawing from Eq. 1.1.
1.1.3. Calculate the reference read ratio for the simulated data for sample m
[m] [m] r˜ S˜ = PN i` and sum across individuals: ` i=1 ti` .
1.1.4. Calculate the reference read ratio for the observed data and sum across PN ri` S` = individuals: i=1 ti` .
1.2. Calculate the difference between the observed reference read ratio and the n ˜[1] ˜[M]o M simulated reference read ratios: S` − S` ,..., S` − S` .
18 2. Determine if the 95% highest posterior density (HPD) interval of the distribution
of re-centered reference read ratios contains 0.
When the distribution of the differences in ratios between the observed and
simulated data sets does not contain 0 in the 95% HPD interval, it provides evidence
that the locus being examined does not follow a pattern of strict polysomic inheritance.
A similar approach could be used on an individual basis by comparing the observed
ratio of reference reads to the predicted ratios for each individual at each locus. We
used this posterior predictive model checking procedure to assess model adequacy in
the potato data set using the posterior distribution of allele frequencies estimated in
the previous section with 25% of the samples discarded as burn-in.
1.4 Results
Our Gibbs sampling algorithm was able to accurately estimate allele frequencies for a
number of simulation settings while simultaneously allowing for genotype uncertainty.
There were no indications of a lack of convergence (ESS values > 200) for any of the
simulation replicates and all trace plots examined also indicated that the Markov
chain had reached stationarity. Running the MCMC for 100,000 generations and
sampling every 100th generation appeared to be suitable for our analyses and we
recommend it as a starting point for running most data sets. Reducing the number of
generations and sampling more frequently (e.g., 50,000 generations sampled every 50
generations) could be a potential work around for larger data sets. When doing test
runs we went as low as 20,000 generations sampled every 20th generation, which still
passed our diagnostic tests for convergence. This is likely because the parameter space
of our model is not overly difficult to navigate so stationarity is reached rather quickly.
19 Ultimately, the deciding factor on how long to run the analysis and how frequently to
sample the chain will come down to assessing convergence.
1.4.1 Simulation Study
Increasing the number of individuals sampled had the largest effect on the accuracy
of allele frequency estimation (Figure 1.1). Since allele frequencies are population
parameters, it is not surprising that sampling more individuals from the population
leads to better estimates. This appears to be the case even when sequencing coverage
is quite low (5x, 10x), which corroborates the observations made by Buerkle and
Gompert (2013). This is not to say, however, that sequencing coverage has no effect
on the posterior distribution of allele frequencies. Lower sequencing coverage affects
the posterior distribution by increasing the posterior standard deviation (Figure 1.2).
An interesting pattern that emerged during the simulation study is the observation
that the allele frequencies closer to 0.5 tend to have higher error rates, which is to
be expected given that the variance of a Binomial random variable is highest when
the probability of success is 0.5. We also observed small differences in the RMSE
between ploidy levels, with estimates increasing in accuracy with increasing ploidy.
Comparisons between the posterior mean and mean read ratio estimates of allele
frequencies (Figure A.1) show that the estimate based on read ratios has a lower
RMSE than the posterior mean when the true allele frequency is low (p` = 0.01, 0.05)
but has higher error rates than the posterior mean for allele frequencies closer to 0.5.
When sequencing coverage is greater than 10x and the number of individuals sampled
is greater than 20, the two estimates are almost indistinguishable.
20 Figure 1.1: Error in allele frequency estimation as measured by the RMSE of posterior means. Columns represent the different allele frequen-cies used to simulate read data (0.01, 0.05, 0.1, 0.2, 0.4) and rows represent the number of individuals samples from the population (5, 10, 20, 30). Each individual plot shows the RMSE of the estimates for each ploidy level (tetra, hex, octo) across the different levels of coverage (5x, 10x, 20x, 50x, 100x). The best scenario is in the bottom left with 30 individuals sampled and an allele frequency of 0.01. The worst scenario is in the upper right corner with 5 individuals sampled and an allele frequency of 0.4. Looking across rows shows thaterror increases as allele frequencies get closer to 0.5. Looking up and down columns shows that error increases as the number of indi-viduals decreases. Within each plot, increasing sequence coverage does not have as large of an effect on error, and differences in ploidyshow that error decreases as ploidy increases.
21 Figure 1.2: The posterior standard deviation for allele frequencies decreases when compared across levels of sequencing coverage. This plot provides a comparison of the distribution of the posterior standard deviations of the 100 replicates performed for eachlevel of sequencing coverage (5x, 10x, 20x, 50x, 100x) for the hexaploid simulation with 30 individuals sampled from the population and an allele frequency of 0.2.
22 Figure 1.3: Posterior distributions of the multilocus estimates of expected and observed heterozygosity in Solanum tuberosum. The observed heterozygosity is higher than the expected, consistent with a pattern of excess outbreeding.
1.4.2 Example Analyses
Our analyses of Solanum tuberosum tetraploids showed levels of heterozygosity consis- tent with a pattern of excess outbreeding (Ho > He). In fact, the posterior distributions of the multi-locus estimates of observed and expected heterozygosity do not overlap at all (Figure 1.3). The assessment of model adequacy also showed that 49 out of the 384 loci (∼13%) were a poor fit to the model of polysomic inheritance that we assume. The allele frequency estimates using the posterior mean and the mean read ratio provided similar estimates and were comparable for most loci. For loci in which the frequency of the reference allele is very low, the read ratio estimate tends to be higher than the posterior mean. However, the overall pattern does not indicate over or under estimation for most allele frequencies (Figure A.2). When we took the difference between the estimates at each locus, the distribution was centered near 0 (Figure A.3).
23 1.5 Discussion
The inference of population genetic parameters and the demographic history of non-
model polyploid organisms has consistently lagged behind that of diploids. The
difficulties associated with these inferences present themselves at two levels. The first
of these is the widely known inability to determine the genotypes of polyploids due
to ADU. Even though there have been theoretical developments in the description
of models for polyploid taxa as early as the 1930s, a large portion of this population
genetic theory relies on knowledge about individuals’ genotypes (e.g., Haldane, 1930;
Wright, 1938). The second complicating factor is the complexity of inheritance
patterns and changes in mating systems that often accompany WGD events. Polyploid
organisms can sometimes mate by both outcrossing or selfing, and can display mixed
inheritance patterns at different loci in the genome (Dufresne et al., 2014). If genotypes were known, then it might be easier to develop and test models for dealing with and
inferring rates of selfing versus outcrossing, as well as understanding inheritance
patterns across the genome. However, ADU only compounds the problems associated with these inferences, making the development and application of appropriate models
far more difficult (but see list of software in Dufresne et al., 2014). The model we have
presented here deals with the first of these two issues by not treating genotypes as
observed quantities. Almost all other methods of genotype estimation for polyploids
treat the genotype as the primary parameter of interest. Our model is different in
that we still use the read counts generated by high-throughput sequencing platforms
as our observed data but instead integrate across genotype uncertainty when inferring
other parameters, thus bypassing the problems caused by ADU.
24 Despite our focus on bypassing ADU, an important consideration for the model we
present here is that, because it approximates the joint posterior distribution of allele
frequencies and genotypes, it would also be possible to use the marginal posterior
distribution of genotypes to make inferences using existing methods. This could be
done using the posterior mode as a maximum a posteriori (MAP) estimate of the
genotype for downstream analyses, followed by analyzing the samples taken from the
marginal posterior distribution of genotypes. The resulting set of estimates would not
constitute a “true” posterior distribution of downstream parameters but would allow
researchers to interpret their results based on the MAP estimate of the genotypes while still getting a sense for the amount of variation in their estimates. Using the
marginal posterior distribution of genotypes in this way could technically be applied
to any type of polyploid, but is only really appropriate for autopolyploids due to the
model of inheritance that is used. Other methods for estimating SNP genotypes from
high-throughput sequencing data include the program SuperMASSA, which models
the relative intensity of the two alternative alleles using Normal densities (Serang et al., 2012).
A second important factor for using our model is that, although estimates of allele
frequencies can be accurate when sequencing coverage is low and sample sizes are
large (see Figure A.4 for a direct comparison between sample size and coverage), the
resulting distribution for genotypes is likely going to be quite diffuse. For analyses that
treat genotypes as a nuisance parameter, this is not an issue since we can integrate
across genotype uncertainty. However, if the genotype is of primary interest, then the
experimental design of the study will need to change to acquire higher coverage at
each locus for more accurate genotype estimation. Therefore, the decision between
25 sequencing more individuals with lower average coverage versus sequencing fewer
individuals with higher average coverage depends primarily on whether the genotypes will be used or not.
Extensibility
The modular nature of our hierarchical model can allow for the addition and modifica-
tion of levels in the hierarchy. One of the simplest extensions to the model that can
build directly on the current setup would be to consider loci with more than two alleles.
This can be done using Multinomial distributions for sequencing reads and genotypes,
and a Dirichlet prior on allele frequencies (the Multinomial and Dirichlet distributions
form a conjugate family; Gelman et al., 2014). We could also model populations
of mixed ploidy by using a vector of individually assigned ploidy levels instead of
assuming a single value for the whole population (ψ = {ψ1, . . . , ψN }). However, this would assume random mating among ploidy levels.
Double Reduction
The inclusion of double reduction into the model is a difficult consideration for genome wide data collected using high-throughput sequencing platforms. The number of
parameters estimated by our model is L × (N + 1) and including double reduction would add an additional L parameters, bringing the total to L × (N + 2). Though the
addition of these parameters would not prohibit an analysis using Gibbs sampling, we chose to implement the simpler equilibrium model. We hope to include double
reduction in future models but feel that our posterior predictive model checking
procedure will prove sufficient for identifying loci in disequilibrium with our current
implementation. Another concern that we had regarding double reduction is that it
26 can be confounded with the overall signal of inbreeding, making it especially difficult
to tease apart the specific effects of double reduction alone (Hardy, 2016). However,
because the probability of double reduction at a locus (α`) depends on its distance
from the centromere (call it x`), a potential way to estimate α` would be to use the
x`’s as predictor variables in a linear model: α` = β0 + β1x`. This would only add
two additional parameters (β0 and β1) that would need to be estimated and would
be completely independent of the number of loci analyzed. The downside to this
approach is that it would only be applicable for polyploid organisms with sequenced
genomes (or the genome of a diploid progenitor), making the use of such a model
impractical for the time being.
Additional Levels in the Hierarchical Model
The place where we believe our model could have the greatest impact is through
modifications and extensions of the probability model used for the inheritance of alleles.
These models have been difficult to apply in the past as a result of genotype uncertainty.
However, using our model as a starting point, it could be possible to infer patterns
of inheritance (polysomy, disomy, heterosomy) and other demographic parameters
(e.g., effective population size, population differentiation) without requiring direct
knowledge about the genotypes of the individuals in the population. For example,
Haldane’s (1930) model of genotype frequencies for autopolyploids that are partially
selfing could be used to infer the prevalence of self-fertilization within a population.
Another possible approach would be to use general disequilibrium coefficients (DA)
to model departures from Hardy-Weinberg equilibrium (Hernández and Weir, 1989;
Weir, 1996). A more recent model described by Stift et al. (2008) used microsatellites
to infer the different inheritance patterns (disomic, tetrasomic, intermediate) for
27 tetraploids in the genus Rorippa (Brassicaceae) following crossing experiments. The
reformulation of such a model for biallelic SNPs gathered using high-throughput
sequencing could provide a suitable framework for understanding inheritance patterns
across the genome. An ideal model would be one that could help to understand
genome-wide inheritance patterns for a polyploid of arbitrary formation pathway
(autopolyploid ↔ allopolyploid) without the need conduct additional experiments.
However, to our knowledge, such a model does not currently exist.
1.6 Conclusions
The recent emergence of models for genotype uncertainty in diploids has introduced
a theoretical framework for dealing with the fact that genotypes are unobserved
quantities (Gompert and Buerkle, 2012; Buerkle and Gompert, 2013). Our extension
of this theory to cases of higher ploidy (specifically to autopolyploids) progresses
naturally from the original work but also serves to alleviate the deeper issue of ADU.
The power and flexibility of these models as applied at the diploid level has the
potential to be replicated for polyploid organisms with the addition of suitable models
for allelic inheritance. The construction of hierarchical models containing probability
models for ADU, allelic inheritance, and perhaps even additional levels for important
parameters such as F-statistics or the allele frequency spectrum also have the potential
to provide key insights into the population genetics of polyploids (Gompert and
Buerkle, 2011a; Buerkle and Gompert, 2013). Future work on such models will help
to progress the study of polyploid taxa and could eventually lead to more generalized
models for understanding the processes that have shaped their evolutionary histories.
28 1.7 Software Note
We have combined the scripts for our Gibbs sampler as an R package—
polyfreqs—which is available on GitHub (https://github.com/pblischak/ polyfreqs). Though polyfreqs is written in R, it deals with the large data sets that
are generated by high-throughput sequencing platforms in two ways. First, it takes
advantage of R’s ability to incorporate C++ code via the Rcpp and RcppArmadillo
packages, allowing for a faster implementation of our MCMC algorithm (Eddelbuettel
and François, 2011; Eddelbuettel, 2013; Eddelbuettel and Sanderson, 2014). Second,
since the model assumes independence between loci, polyfreqs can facilitate the
process of parallelizing analyses by splitting the total read count and reference read
count matrices into subsets of loci which can be analyzed at the same time on separate
nodes of a computing cluster. Additional features of the program include
• Estimation of posterior distributions of per locus observed and expected het-
erozygosity (het_obs and het_exp, respectively).
• Maximum a posteriori (posterior mode) estimation of genotypes using the
get_map_genotypes() function.
• Posterior predictive model checking using the polyfreqs_pps() function.
• Simulation of high-throughput sequencing read counts and genotypes from user
specified allele frequencies using the sim_reads() function.
• Options for controlling program output such as writing genotype samples to file,
printing MCMC updates to the R console, etc.
29 • Simple input format using tab delimited text files that can be directly imported
into R using the read.table() function. The format is as follows:
1. An optional row of locus names (use header=TRUE to specify this in
read.table()).
2. One row for each individual.
3. First column contains individual names (use row.names=1 to specify this
in read.table()).
4. One column for each locus.
1.8 Acknowledgements
The authors would like to thank the Ohio Supercomputer Center for access to com-
puting resources and Nick Skomrock for assistance with deriving the full conditional
distributions of the model in the diploid case. We would also like to thank Frederic
Austerlitz, Aaron Wenzel, members of the Wolfe and Kubatko labs, and 3 anonymous
reviewers for their helpful comments on the manuscript. This work was partially
funded through a grant from the National Science Foundation (DEB-1455399) to
ADW and LSK.
1.9 Author Contributions
Conceived of the study: PDB, LSK, and ADW. PDB derived the polyploid model,
ran the simulations and other analyses, coded the R package, and wrote the initial
draft of the manuscript. PDB, LSK, and ADW reviewed all parts of the manuscript
and all authors approved of the final version.
30 1.10 Data Accessibility
Scripts for simulating the data sets, analyzing them using Gibbs sampling, and
producing the figures from the resulting output can all be found on GitHub, along with the original simulated data sets and autotetraploid potato data (https://
github.com/pblischak/polyfreqs-ms-data). We also provide an implementation
of the Gibbs sampler for estimating allele frequencies in the R package polyfreqs
(https://github.com/pblischak/polyfreqs). See the package vignette or GitHub wiki for more details (https://github.com/pblischak/polyfreqs/wiki).
31 Chapter 2: SNP Genotyping and Parameter Estimation in Polyploids from Low-Coverage Sequencing Data
Publication Information
This chapter is formatted for this dissertation from the following publication:
Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. SNP Genotyping and Parameter
Estimation in Polyploids from Low-Coverage Sequencing Data. Bioinformatics, 34:407–
415, 2018.
2.1 Abstract
Motivation: Genotyping and parameter estimation using high throughput sequenc-
ing data are everyday tasks for population geneticists, but methods developed for
diploids are typically not applicable to polyploid taxa. This is due to their duplicated
chromosomes, as well as the complex patterns of allelic exchange that often accom-
pany whole genome duplication (WGD) events. For WGDs within a single lineage
(autopolyploids), inbreeding can result from mixed mating and/or double reduction.
For WGDs that involve hybridization (allopolyploids), alleles are typically inherited
through independently segregating subgenomes. Results: We present two new models
for estimating genotypes and population genetic parameters from genotype likelihoods
32 for auto- and allopolyploids. We then use simulations to compare these models to
existing approaches at varying depths of sequencing coverage and ploidy levels. These
simulations show that our models typically have lower levels of estimation error for
genotype and parameter estimates, especially when sequencing coverage is low. Finally, we also apply these models to two empirical data sets from the literature. Overall, we show that the use of genotype likelihoods to model non-standard inheritance
patterns is a promising approach for conducting population genomic inferences in
polyploids. Availability: A C++ program, ebg, is provided to perform inference
using the models we describe. It is available under the GNU GPLv3 on GitHub:
https://github.com/pblischak/polyploid-genotyping.
2.2 Introduction
The discovery and analysis of genetic variation in natural populations is a central task
of evolutionary genetics, with applications ranging from the inference of population
structure and patterns of historical demography, detecting selection and local adapta-
tion, and performing genetic association studies. The ability to use high-throughput
sequencing technologies to detect variants across the genome has further advanced our
understanding of the impact of evolutionary forces on genetic diversity in populations.
However, the nature of data sets collected using high-throughput sequencing often
require special considerations regarding sequencing error and, especially, the level of
sequencing coverage. Common approaches for dealing with low-coverage sequence data
use genotype likelihoods to integrate over the uncertainty of inferring genotypes when
estimating other parameters [allele frequencies, inbreeding coefficients, population
differentiation, etc.] (e.g., Martin et al., 2010; Li, 2011; Nielsen et al., 2011, 2012;
33 Fumagalli et al., 2013; Vieira et al., 2013; Huang et al., 2016, among others). Genotype
likelihoods for biallelic SNPs are calculated as the probability of the sequencing read
data mapping to a variable site (total number of reads, number of reads with the
alternative allele, and probability of sequencing error) given the possible values of
the genotypes (typically 0, 1, or 2 for the number of copies of the alternative allele
in diploids). When combined with computationally efficient algorithms for inference,
these models are the primary tools used for conducting population genetic analyses
from high-throughput data.
Although the theory for these models is well established for diploids and even
special cases of higher ploidy samples (treated equivalently to pooled samples of
multiple diploids), the application of these tools to taxa that have experienced a
recent whole genome duplication (WGD) is currently limited (McKenna et al., 2010;
DePristo et al., 2011; Li, 2011). This is due in part because of ambiguity in the
copy number of each allele in the genotype of a polyploid, a phenomenon referred
to as allelic dosage uncertainty (Blischak et al., 2016). Another important aspect of
polyploid evolution to consider is that the occurrence of WGD can have an impact
on how alleles are exchanged in a population, making the assumption of randomly
inherited alleles inappropriate. Together these two factors have limited the widespread
application of population genomic tools to gain insights about levels of genetic variation
following WGD. Given both the evolutionary and economic importance of many of
these organisms (e.g., agricultural crops, farmed fishes), the development of methods
that can accommodate more complex patterns of inheritance is critical for the study
of polyploids (Stebbins, 1950; Grant, 1971; Otto and Whitton, 2000; Soltis and Soltis,
2000; Soltis et al., 2014).
34 In this paper we present two new models for SNP genotyping in polyploids using
high-throughput sequencing data. The models correspond to two different ways in which polyploids can be formed: WGD within a lineage (autopolyploid) or involving
hybridization between two lineages (allopolyploid). The former builds off of previous work to relax the assumption of Hardy-Weinberg equilibrium by including inbreeding
(Blischak et al., 2016) and the latter provides a framework for separately determining
the genotypes within the two genomes that compose the allopolyploid (typically
referred to as subgenomes). We test our models using a wide range of simulations
and describe our numerical approach for parameter estimation using the expectation
maximization (EM) algorithm (Dempster et al., 1977). For comparison, we analyzed
our simulated data sets using two additional approaches based on models that assume
either Hardy Weinberg equilibrium or equal genotype probabilities. Finally, we also
test the models on empirical data sets collected for a diploid-allotetraploid species
pair from the genus Betula (birch trees) and a mixed-ploidy grass species, Andropogon gerardii. Overall, we demonstrate that genotype uncertainty resulting from both
low-coverage sequencing data, allelic dosage uncertainty, and non-standard inheritance
patterns can be overcome in polyploids using genotype likelihoods.
2.3 Models
Assumptions: For each of the models below, we assume that SNPs are biallelic, and
that loci and individuals are independent. For the autopolyploid model, we do not
directly include double reduction (but see Discussion). For the allopolyploid model, we assume that subgenomes are independent, that they do not interact during meiosis
35 (i.e., no homoeologous recombination), and that they are both in Hardy Weinberg
equilibrium.
Notation for each model is introduced in the descriptions we provide below and
is also summarized in Table 2.1. Throughout the paper, we use boldface letters to
denote an array of the respective parameter across either individuals (N), loci (L), or
both (e.g., p := p1, . . . , pL, F := F1,...,FN , and G := g11, g12, . . . , gN(L−1), gNL).
2.3.1 Autopolyploid Model
The genotype for a biallelic SNP in an autopolyploid with K sets of chromosomes
has K + 1 possible values. For example, using A and a to denote the two alleles,
an autotetraploid can have genotypes equal to AAAA, AAAa, AAaa, Aaaa, or aaaa
(e.g., gi` = 0, 1, 2, 3, or 4, if a is the alternative allele; i = 1,...,N and ` = 1,...,L).
A simple extension of the typical binomial sampling (Hardy Weinberg; HW) model
used for diploids but with larger sample size to accommodate higher ploidy levels has
been used previously (Li, 2011; Blischak et al., 2016). However, inbreeding in various
forms can bias inferences made when HW equilibrium is assumed. Vieira et al. (2013)
introduced a genotype prior to include inbreeding either per-site or per-individual for
a sample of diploids (implemented in the programs ngsF and ANGSD). This model
used a formulation for generalized HW that includes the inbreeding coefficient, F , which is the probability that two alleles are identical by decent (ibd). Instead of
using a generalized HW formulation for autopolyploids, we used the Balding-Nichols
beta-binomial model (Balding and Nichols, 1995, 1997; Bradburd et al., 2013), which
also models the probability of two alleles being ibd but is more easily extended to
higher ploidy levels by not directly enumerating all combinations of allele draws for
36 the genotype of an autopolyploid. The beta-binomial distribution is obtained from the
product of a binomial and beta distribution, which are commonly used in population
genetics to model genotypes and allele frequencies, respectively (Wright, 1931). The
beta distribution in this case is used to model genetic correlations that can result
from inbreeding and/or population subdivision. These types of models are commonly
referred to as F-models because of their relation to Wright’s fixation indices (e.g.,
FIS, FST ; Wright, 1931), and they form the basis of many well-known population
genetic models, including those by Holsinger et al. (2002), Falush et al. (2003), and
Foll and Gaggiotti (2008), as well as more recent modeling applications that include
uncertainty in genotype calling from high-throughput sequencing data using genotype
likelihoods (e.g., Gompert et al., 2010; Gompert and Buerkle, 2011b; Fumagalli et al.,
2013).
Given genotype values at L loci for N individuals each of ploidy mi, we model
individual genotypes at each locus (gi` = 0, . . . , mi copies of the alternative allele) as a
beta-binomial random variable. This distribution derives from treating the probability
of drawing an alternative allele as a beta distributed random variable with parameters
1−Fi 1−Fi α = p` β = (1 − p`) Fi and Fi , which scales the binomial probability of successfully
drawing an alternative allele by both the allele frequency (p`) and the amount of
inbreeding (Fi) (Balding and Nichols, 1995; Bradburd et al., 2013). The log likelihood
of the genotype data for this model given the allele frequency at each site (p`) and the
per-individual inbreeding coefficients (Fi) is then
37 X X log L(p, F ; G) = log P (gi`|p`,Fi) i ` 1 − Fi 1 − Fi B gi` + p` , mi − gi` + (1 − p`) X X Fi Fi = log . (2.1) 1 − Fi 1 − Fi i ` B p` , (1 − p`) Fi Fi where B(α, β) represents the beta function with parameters α and β. Since genotypes
must be inferred from sequence data (di`; see Methods), we can also account for
this uncertainty by summing over the possible genotype values to get the likelihood
of the sequence data given allele frequencies and inbreeding coefficients by including
genotype likelihoods [P (di`|gi`)]:
log L(p, F ; D) " # X X X = log P (di`|gi` = a)P (gi` = a|p`,Fi) . (2.2) i ` a
Here P (gi`|p`,Fi) is the beta-binomial distribution from Eq. (2.1). Because maximiza-
tion of the log likelihood is encumbered by the logarithm of the sum over genotypes, we instead use an expectation conditional maximization algorithm to obtain maximum
likelihood (ML) estimates for p and F (Meng and Rubin, 1993). Since an analytical
solution for the maximization step is not readily available, we instead employ numerical
maximization of the likelihood using Brent’s method (Brent, 1973). Then, given the
ML parameter estimates, we can calculate the posterior probability of the genotype of
each individual at each locus using Bayes’ theorem:
ˆ P (di`|gi` = a)P (gi` = a|pˆ`, Fi) P (gi` = a|di`) = , (2.3) Pmi 0 0 ˆ a0=0 P (di`|gi` = a )P (gi` = a |pˆ`, Fi) 38 for a = 0, . . . , mi.
2.3.2 Allopolyploid Model
Deviations from simple HW expectations are evident in allopolyploids in that they
have two (sometimes more) sets of chromosomes inherited from separate evolutionary
lineages. When these sets of chromosomes (called homoeologs, or homoeologous chromosomes) segregate during meiosis, they are inherited separately from one another
and should be treated independently. For example, the genotypes for a biallelic SNP
in an allotetraploid with two diploid subgenomes could have values AA|A0A0, AA|A0a0,
Aa|A0A0, AA|a0a0, Aa|A0a0, aa|A0A0, Aa|a0a0, aa|A0a0, or aa|a0a0. Here the vertical bar
‘|’ denotes separation between the subgenomes and the 0 indicates homoeologous alleles.
With perfect knowledge about which alleles go with each subgenome, determining the
genotypes could be done completely independently. However, if separate reference
genomes for the homoeologous chromosomes are not available, all reads mapping to
a variable position will not be separable into reads coming from one subgenome or
the other. Thus, when considering a variable site across the full set of homoeologs, we need to account for the fact that the frequency of the alternative allele may not
be the same in each subgenome due to their separate evolutionary histories, even if
both subgenomes are independently in Hardy Weinberg equilibrium. When we cannot
separate reads, we can instead consider the full genotype of an allopolyploid with two
subgenomes as being a combination of the genotypes within the subgenomes (i.e., the
number of alternative alleles summed across subgenomes). Returning to the previous
example, a tetraploid with two diploid subgenomes can have a full genotype of 0,..., 4
copies of the alternative allele, but each of these full genotypes can be found via a
39 different combination of genotypes in the subgenomes: {0 = (0, 0); 1 = (0, 1), (1, 0); 2 =
(0, 2), (2, 0), (1, 1); 3 = (1, 2), (2, 1); 4 = (2, 2)}. In general, for an allopolyploid that
has two subgenomes with ploidy levels equal to m1i and m2i, there are a total of
(m1i + 1) × (m2i + 1) genotype combinations to consider. The probabilities of these
genotypes are then determined using the allele frequencies for the alternative allele in
the subgenomes.
An obvious complication of not being able to separate the sequencing reads into
sets coming from each subgenome is that it makes independently estimating the allele
frequencies and genotypes impossible. However, it is sometimes the case that the
parental species of the allopolyploid are known, which can help with inferring genotypes
by providing an outside estimate of the allele frequencies within the subgenomes. For
our model, we relax this use of outside knowledge further and assume that only a single
parent has been identified. Arbitrarily designating the known parent as subgenome
one, we treat the allele frequencies at each locus estimated in the parental population
∗ to be known (p1) and require only the estimation of the allele frequencies in subgenome
two (p2). We then model the full genotype in the allopolyploid as the sum of the two
independent subgenomes with separate, and potentially unequal, allele frequencies.
Since we assume Hardy Weinberg equilibrium within each subgenome, we can model
the sum of the number of alternative alleles in the two subgenomes as a product of two
binomial distributions. The log likelihood for known genotype data across individuals
at all loci is then given by
40 ∗ log L(p2; p1, G1, G2) m X X 1i ∗ g1i` ∗ (m1i−g1i`) = log (p1`) (1 − p1`) g1i` i ` m 2i g2i` (m2i−g2i`) + log (p2`) (1 − p2`) . (2.4) g2i`
The inclusion of genotype likelihoods is done in a similar way to the autopolyploid
model, only now we are summing over the values of the genotypes in both subgenomes
one and two. The log likelihood for observed sequence data given the allele frequencies
in each of the subgenomes is
∗ log L(p2; p1, D) X X X X = log P (di`|gi` = g1i` + g2i`)
i ` a1 a2 ∗ × P (g1i` = a1|p1`)P (g2i` = a2|p2`) . (2.5)
∗ where P (di`|gi`) is the genotype likelihood, and P (g1i`|p1`) and P (g2i`|p2`) are binomial
distributions.
Because maximizing the log likelihood involves the logarithm of a double sum, we
turn once again to the expectation maximization algorithm to obtain a ML estimate
for the allele frequency at each locus in subgenome two (Dempster et al., 1977). An
analytical solution for the maximization step of the EM algorithm is given by
P P P a P (g = a + a |d , p∗ , p(t)) (t+1) i a1 a2 2 i` 1 2 i` 1` 2` p2` = P , (2.6) i m2i ∗ (t) where P (gi` = a1 + a2|di`, p1`, p2` ) is the joint conditional probability of the genotypes
in subgenomes one and two given the data and the current parameter estimates. Using
41 these ML estimates, an empirical Bayes estimate of the genotypes within each of the
subgenomes can be found using their joint posterior probability (note that subscripts
i and ` are dropped for readability)
P (g1 = a1, g2 = a2|d)
∗ P (d|g = g1 + g2)P (g1 = a1|p1)P (g2 = a2|pˆ2) = P P 0 ∗ 0 , (2.7) 0 0 P (d|g = g1 + g2)P (g1 = a |p )P (g2 = a |pˆ2) a1 a2 1 1 2
where a1 = 0, . . . , m1i and a2 = 0, . . . , m2i.
2.3.3 Other Approaches
We consider two additional approaches that use genotype priors that have been
described in previous studies. The first is an implementation of the SAMtools Hardy
Weinberg equilibrium prior (Li, 2011) and the second is a flat prior on genotypes that
is similar to the model used by the Genome Analysis Toolkit (GATK; McKenna et al.,
2010). Other approaches that accommodate polyploids such as the FITTETRA package
in R (Voorrips et al., 2011) and the method of Maruki and Lynch (2017) were not
considered here because they can only handle specific ploidy levels (triploids and/or
tetraploids).
2.4 Methods
Genotype likelihoods were calculated using a simplified version of the SAMtools model
by using average sequencing error values at each locus, `, across reads and individuals
(Li, 2011). Then for the possible values of the genotype (a = 0, . . . , mi), the probability
of the read data, di` = {ti`, ri`, `} (ti` = total read count, ri` = alternative allele read
count), given the genotype, gi`, is
42 Symbol Description N, L The number of individuals and loci sampled. mi Ploidy level of individual i.
di` Sequence data for individual i at locus ` (={ti`, ri`, `}). ti` Total number of reads for individ- ual i at locus `. ri` Number of alternative allele reads for individual i at locus `. ` Average sequencing error at locus `. gi` Genotype for individual i at locus `. p` Allele frequency at locus `. Fi Inbreeding coefficient for individual i.
Table 2.1: A key to the symbols and notation that are used in describing the au- topolyploid and allopolyploid models. We use a either a bold or bold-capitalized letter when referring to the collection of parameters together (e.g., G refers to gi` for all individuals at all loci). Parameters within subgenomes for the allopolyploid model use the same symbol but with either a 1 or a 2 added as a subscript.
t i` ri` P (di`|gi` = a) = f(a, mi, `) ri`
(ti`−ri`) ×[1 − f(a, mi, `)] , (2.8) where
a a f(a, m, e) = (1 − e) + 1 − e. (2.9) m m
43 2.4.1 Simulations
We generated sequencing read data with mean coverage per individual, per locus equal
to 2x, 5x, 10x, 20x, 30x, and 40x, simulated from a Poisson distribution for 10 000
sites. The number of individuals was set to 25, 50, or 100, and we tested ploidy levels
equal to 4, 6, and 8 (4=2+2, 6=2+4, and 8=4+4 for allopolyploids). Sequencing
errors were drawn from a beta distribution with parameters α = 1 and β = 200 (mean
error ≈ 0.005)]. Allele frequencies were drawn from a truncated beta distribution with a minimum minor allele frequency of 5% and parameters α = β = 0.01. For
the autopolyploid model, the values of the inbreeding coefficient were set to 0.1, 0.25,
0.5, 0.75, and 0.9. For the allopolyploid model, the allele frequencies simulated for
subgenome one were treated as the reference panel. Genotypes were drawn according
to their respective generating models (autopolyploid or allopolyploid), and the number
of alternative reads for each individual at each locus was drawn from the binomial
distribution in Eq. (2.8) given the total read count, genotype, and level of sequencing
error. For each simulation, we evaluated estimation error using the root mean squared
deviation (RMSD)
v u R u 1 X [r] 2 RMSD = t (X − Xtrue) , (2.10) R est. r=1
[r] where R represents the number of replicates, Xest. is the estimated value for replicate
r, and Xtrue is the original value used to simulate the data.
To compare our models with other methods, we reused these simulated data as
input for the estimation of genotypes and model parameters using priors that assume
either Hardy Weinberg equilibrium or equal genotype probabilities (GATK-like). For
44 the allopolyploid model, this also equates to ignoring the fact that genotypes are drawn
from two independent subgenomes. Inference for the Hardy Weinberg model used the
EM algorithm described in Li (2011). Genotyping based on the GATK-like model were calculated based on normalized genotype likelihoods as described in McKenna et al. (2010).
Comparisons for the autopolyploid model were based on the RMSD of four estimates
of the inbreeding coefficient. The first of these was the estimate obtained by our
ECM algorithm, which is built directly into the model. The other three estimates were calculated as a summary statistic from estimated genotypes for the three models
(Appendix B). We then also compared RMSD values of the estimated genotype values for the three methods. For the allopolyploid model, direct comparisons with
models that assume Hardy Weinberg or uniform genotype priors are more difficult
because they do not share the assumption of two subgenomes. Therefore, we focused
on the accuracy of the models to infer the full genotype by again comparing RMSD values.
2.4.2 Empirical Data Analysis Andropogon gerardii
We tested our autopolyploid model on an empirical data set collected in the grass
species Andropogon gerardii. SNP data from McAllister and Miller (2016) were
downloaded from Dryad as a VCF file (http://datadryad.org/resource/doi:10.
5061/dryad.05qs7). The data were filtered using VCFtools with the following criteria:
biallelic SNPs only, no more than 50% missing data per site, one SNP per 10 000 base
pair window, and a minimum sequencing depth of five reads (Danecek et al., 2011).
The output from VCFtools was then converted to a plain text format containing
45 the number of total reads and alternative allele reads per individual per site using a
Perl script (read-counts-from-vcf.pl; available on GitHub). We then also removed
any individuals with more than 50% missing data using an R script (filter-inds.R;
available on GitHub). Since A. gerardii has two cytotypes (6N and 9N), we analyzed
the hexaploid and nonaploid individuals separately and compared the estimates of the
inbreeding coefficients across ploidy levels.
Betula pubescens and B. pendula
To test the allopolyploid model, biallelic SNP genotypes from Zohren et al. (2016) for
the allotetraploid Betula pubescens and its putative diploid progenitor, B. pendula, were
downloaded from Dryad (http://datadryad.org/resource/doi:10.5061/dryad.
815rj). Treating the genotypes as known, we simulated read data and error values
as before using Eq. (2.8) with beta distributed error values. We varied the level of
sequencing coverage (5x, 10x, 20x) but did not alter the amount of missing data. Allele
frequencies for B. pendula were estimated under the assumption of Hardy Weinberg
equilibrium and disequilibrium to assess which was a better fit. These allele frequency
estimates were then used as the reference panel for genotype estimation in B. pubescens
using the allopolyploid model.
Comparison with GATK
As a final comparison, we re-analyzed raw sequence data collected for B. pendula
and B. pubescens using GATK v3.5.0 and our model for allopolyploids. Data for 15
individuals each of B. pendula and B. pubescens were downloaded from the European
Nucleotide Archive (Project Accession ERA600270). Reads were mapped to a draft
reference genome of B. nana (Dryad, doi:10.5061/dryad.815rj; Wang et al., 2013) using
46 the MEM algorithm in BWA v0.7.13 with additional processing (conversion to BAM
and sorting) using SAMtools v1.4.1 (Li and Durbin, 2009; Li, 2011). Read group
information was added using Picard (http://broadinstitute.github.io/picard),
followed by variant calling and genotype estimation using the GATK UnifiedGenotyper
(B. pubescens was run with -ploidy=4; McKenna et al., 2010). Variant site positions
in the resulting VCF files were used to extract base quality scores from the original
BAM files using the SAMtools mpileup command (Li, 2011). All other data processing
steps (filtering sites, finding shared variants, etc.) were conducted using Python and
R scripts (available on GitHub; see Supplemental Text, §B.3). Allele frequencies
at each site were estimated in B. pendula using our implementation of the Hardy
Weinberg model (run until convergence). These allele frequencies were then used as
the reference panel for estimating genotypes in B. pubescens using the allopolyploid
model (EM+Brent with 100 iterations). All VCF, pileup, and input/output files are
publicly available on Zenodo (doi:10.5281/zenodo.825228).
2.4.3 Software and Reproducibility
We have packaged our code for the EM/ECM algorithms in a command line C++
program called EBG, which we have included as part of a GitHub repository for this
manuscript (doi:10.5281/zenodo.195779). This software includes our implementations
of the autopolyploid (diseq), allopolyploid (alloSNP), Hardy Weinberg (hwe), and
GATK-like (gatk) models for genotyping in polyploids. Code for the simulation study
and empirical data analyses was written using a combination of the R statistical
language and C++ through the use of the RCPP package (Eddelbuettel and François,
2011; Eddelbuettel, 2013; R Core Team, 2014). Figures were generated using the
47 GGPLOT2 package in R (Wickham, 2009). Additional figure manipulations were done
using Inkscape (https://inkscape.org/). All Python, Perl, R, and Bash scripts
used to process data files are included on GitHub in the ‘helper-scripts/’ folder.
2.5 Results
2.5.1 Simulations Autopolyploid Model
Simulated read count data were generated to assess the impact of sequencing coverage
and ploidy level on estimation error in autopolyploids using an expectation conditional
maximization (ECM) algorithm. Convergence of the ECM algorithm depended on
the number of individuals sampled, sequencing coverage, and ploidy. Each iteration
of the algorithm employs Brent’s method, itself an iterative maximization algorithm,
resulting in slower M-steps than the other EM algorithms we describe. However,
overall convergence was reached before the maximum number of allowed iterations
(1000) in all cases, with analyses typically employing between 50–100 iterations.
For the estimation of individual inbreeding coefficients (Fi), Figure 2.1a shows the
root mean squared deviation (RMSD) for estimated inbreeding coefficients for the
four different estimation methods across ploidy levels and the three lowest levels of
sequencing coverage (sample size of 50 individuals). Compared with the other methods
that used called genotypes (diseqCG, hwe, gatk), the level of sequencing coverage
and ploidy level had virtually no effect on estimation error using our model (diseq).
For the other estimates, increasing sequencing coverage lowered estimation error as
expected, and higher ploidy levels showed higher levels of error. However, inbreeding
coefficients estimated from genotypes called from our model (diseqCG) did have lower
48 (a) Inbreeding Coeff. Estimation Error [50 ind.]
p4 p6 p8
0.75 0.50 c2 0.25 0.00
0.75 0.50 c5
RMSD 0.25 0.00
0.75 c10 0.50 0.25 0.00
F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 Method diseq diseqCG gatk hwe
(b) Genotype Estimation Error [50 ind.]
p4 p6 p8
2.0
1.5 c2 1.0 0.5
2.0
1.5 c5 1.0 RMSD 0.5
2.0 1.5 c10 1.0 0.5
F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 Method diseq gatk hwe
Figure 2.1: RMSD values for simulations under the autopolyploid model with in- breeding for (a) estimated inbreeding coefficients and (b) estimated genotypes. Each individual plot within (a) and (b) displays the RMSD on the y-axis and inbreeding coefficients on the x-axis. Rows correspond with the depth of sequencing coverage (2x, 5x, 10x) and the columns correspond to the ploidy level (4, 6, 8). The different estimation methods (diseq, diseqCG, gatk, hwe) are represented by different shapes within each plot. (a) The RMSD of the inbreeding coefficient estimated by our model (diseq) is consistently the lowest across all depths of sequencing coverage, ploidy level, and level of inbreeding. (b) Genotypes estimated by our model are at least as accurate as the other methods and are not as affected by high or low levels of inbreeding.
RMSD values than the other methods, except when the inbreeding coefficient was
0.5, when the level of error was about the same. All of the methods except for Hardy
Weinberg showed low levels of estimation error once the depth of sequencing reached
49 10x. Figures B.1–B.3 show the results for all simulated depths of sequencing (2x to
40x) and sample sizes (25, 50, and 100 individuals).
Our empirical Bayes approach for maximum a posteriori (MAP) genotype esti-
mation resulted in a similar overall pattern of lower estimation error for increased
sequencing coverage (Figure 2.1b). Interestingly, the other two methods for genotyping
(gatk, hwe) showed opposing patterns of accuracy: the GATK-like model increased
in accuracy with increasing levels of inbreeding but the Hardy Weinberg model had
decreasing accuracy. Genotypes called by our method showed some dependence on
the level of inbreeding with intermediate values having the most error. However, our
method was still the most accurate across the range of inbreeding values simulated.
Ploidy also had an impact on genotyping with higher ploidy levels having higher levels
of estimation error. This is largely due to the fact that higher ploidy individuals have
a larger number of possible values for the genotype and that the average sequencing
coverage per allele (chromosome) is lower (e.g., 10x coverage in a tetraploid is on aver-
age 2.5x per allele but is 1.25x in an octoploid). Once the depth of sequencing reached
10x, the only model that still showed a higher level of error was the Hardy Weinberg
model. Figures B.4–B.6 show the results for all simulated depths of sequencing (2x to
40x) and sample sizes (25, 50, and 100 individuals).
Allopolyploid Model
Using the same general parameter settings as the simulations for the autopolyploid
model (except for inbreeding), we calculated genotype likelihoods by simulating read
data from genotypes generated under the model from Eq. (2.4). The ploidy of each
subgenome was as follows: tetraploids = diploid + diploid, hexaploid = diploid +
tetraploid, and octoploid = tetraploid + tetraploid. Our expectation maximization
50 Full Genotype Estimation
p4 p6 p8
2
RMSD 1
0
c2 c5 c2 c5 c2 c5 c10 c20 c30 c40 c10 c20 c30 c40 c10 c20 c30 c40 Method allosnp gatk hwe
Figure 2.2: RMSD values for full genotype estimation (combined number of alternative alleles in subgenomes one and two). Sequencing coverage is on the x-axis and RMSD values are on the y-axis. Each column represents a different ploidy level and the three methods used (allosnp, gatk, hwe) are represented by different shapes. For low levels of sequencing coverage, the allosnp and hwe models have much lower levels of estimation error when compared with the gatk model. The level of sequencing coverage required for the three methods to converge in error rate depends on the ploidy level, with tetraploids needing less coverage and octoploids needing more.
algorithm for this model was slow to converge, despite each maximization step taking
less time when compared with the autopolyploid model. Analyses never reached the
upper limit on the number of iterations (again 1000) but some analyses did not reach
convergence until over 900 iterations had been run. To make analyses with this model
more practical, we reanalyzed all simulated data sets using only 100 EM iterations
followed by direct maximization of the observed data log likelihood function in Eq.
(2.5) using Brent’s method (EM+Brent).
Comparing our model with other genotype priors (Hardy Weinberg, GATK) only
allowed us to consider the full genotype estimates from the different methods. Figure
2.2 shows the level of estimation error for each of the three genotyping methods for
each ploidy level across all depths of sequencing coverage. For low depths of sequencing,
51 genotyping with the GATK-like model resulted in high levels of error. As the depth
of coverage increased, the three methods converged. However, this was dependent
on the ploidy level: octoploids required a higher depth of sequencing for the GATK
model than tetraploids or hexaploids to achieve the same level of accuracy. The Hardy
Weinberg prior performed almost identically to our allopolyploid model, most likely as
a result of our assuming Hardy Weinberg within the subgenomes of the allopolyploid.
We also assessed the accuracy of the model for estimating parameters based on
the true values used for the simulations. Allele frequency estimates for subgenome
two improved as the number of individuals and sequencing coverage were increased
(Figure B.7). Tetraploids showed the highest estimation error for subgenome two
(diploid), followed by octoploids and hexaploids (tetraploid subgenomes), respectively.
This pattern with hexaploids and octoploids is counterintuitive considering that higher
ploidy levels typically result in better estimates of allele frequencies since more alleles
are sampled from the population (Blischak et al., 2016). However, the tetraploid
subgenomes in the hexaploid and octoploid individuals do not show similar levels
of error as would be expected. This is likely a result of subgenome one having
higher ploidy in the octoploid simulations, resulting in a larger number of possible
genotype combinations and therefore higher estimation error (octoploid: 5 × 5 = 25 vs.
hexaploid: 3 × 5 = 15). Figures B.8 and B.9 show the error in genotype estimation in
subgenome one and two, respectively. Here we again observe that higher ploidy levels
have higher levels of estimation error for genotypes. Overall, genotype estimates were
inferred with higher error for subgenome two. This result makes sense given that we
treat the allele frequencies for subgenome one as known but have to estimate them in
subgenome two.
52 2.5.2 Empirical Data Analysis Andropogon gerardii
Analyzing and filtering the data sets for hexaploid and nonaploid A. gerardii separately
resulted in slightly different numbers of loci (6N: 83 individuals, 6 928 loci; 9N: 70
individuals, 6 887 loci). The average depth of sequencing coverage was 10.9x for
hexaploids and 10.8x for nonaploids. Though levels of inbreeding for both cytotypes were low, nonaploids showed significantly higher levels of inbreeding than hexaploids
−8 (Figure 2.3a; F1,151 = 36.14, p = 1.3 × 10 ).
Betula pubescens and B. pendula
The data set for the species of Betula consisted of 130 individuals for B. pubescens
and 34 individuals for B. pendula with genotype data for 49 021 loci. For B. pendula, we inferred allele frequencies and genotypes assuming Hardy Weinberg (HW), as well
as using our model for individual inbreeding coefficients. The log likelihoods of the
two models were very similar and most of the inbreeding coefficients were estimated
to be close to 0, so we used the allele frequency estimates from the HW model as
the reference panel for the allopolyploid model. After estimating the parameters of
this model for B. pubescens using the EM+Brent method, we assessed the accuracy
of our empirical Bayes genotype estimates by comparing them to the original data
set using the root mean squared deviation. This comparison is shown for each of the
possible genotype values (0–4) in Figure 2.3b. The left panel shows the RMSD for
each genotype value and the right panel shows a weighted measure of the RMSD that
corresponds to the relative amount of error based on the frequency of that genotype
in the original data set. For example, we do a poor job of estimating the genotype
53 when the true value is 0, but very few of the true genotypes have that value (∼0.5%),
so the relative contribution to the overall error is much less. In contrast, roughly
75% of the true genotypes have a value of 4, which is the value that we estimate the
best. In addition, many of the genotypes in B. pendula were equal to 2 (∼88%), so
the estimates of the allele frequencies were very close to 1.0, which could have led to
more error prone estimates of the genotypes in B. pubescens when using them as the
reference panel.
Comparison with GATK
Variant calling and genotype estimation using GATK resulted in 14 931 shared SNPs
between B. pendula and B. pubescens after applying the following filters: biallelic sites
only, variant quality score (QUAL) greater than 30, minimum read depth (DP) per
individual per site of at least five, and a maximum of five missing individuals per
site. Analyzing these same sites for B. pendula using the Hardy Weinberg equilibrium
model produced genotype estimates that were 99.1% identical to the estimates from
GATK. Allele frequencies estimated by the Hardy Weinberg model had an RMSD value of 0.032 when compared to those estimated by GATK (Figure B.10). Similarly
for B. pubescens, genotype estimates combined from the allopolyploid subgenomes
resulted in full genotype estimates that were 96.2% identical to GATK. The majority
of differences in the estimated genotypes between the two methods were mainly due
to the allopolyploid model inferring one fewer copy of the alternative allele compared
to GATK (Figure B.11). Run times between our models and GATK are not directly
comparable because the latter was used to identify all variants before filtering and it
also performs more steps than genotyping and parameter estimation. However, it is worth noting that the analyses with our models took approximately 3.5s and 43s for
54 (a) Inbreeding levels in A. gerardii
●
0.06 ● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ● 0.04 ● ● Ploidy ● ● ● ● ● ● ● 6N ● ● ● ● ● ● ● ● ● ● ● ● 9N
● ● ● ●
Inbreeding ●
● ● ● ● ●
● 0.02 ● ● ● ● ●
● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● 0.00 ●● ●●● ● ●●● ●●●●● ● ●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●● ● ●●●● ●●●●● ●●● ● ● ●●● ●● ● ●●●● ● ●
6N 9N
(b) Genotyping error for B. pubescens
RMSD RRMSD
1.5
1.0 Coverage c5 c10 Value c20 0.5
0.0
0 1 2 3 4 0 1 2 3 4
Figure 2.3: Results of empirical data analyses. (a) Levels of inbreeding in Andropogon gerardii. Inbreeding in the two cytotypes of A. gerardii is generally low, but the nonaploid (9N) samples have higher levels of inbreeding on average. (b) Genotype estimation error in Betula pubescens. The left panel shows the RMSD values for each of the possible full genotypes (0–4; number of alternative alleles in subgenomes one and two). The right panel shows a relative measure of the RMSD where each value is weighted by the occurrence of the particular genotype in the data set (see text for details).
the Hardy Weinberg and allopolyploid models, respectively (measured using the Unix
time command).
55 2.6 Discussion
The ability to genotype individuals in a population can be an under-appreciated task,
even though it is typically the first step of any population genetic analysis. This is
especially true for populations of polyploids, where genotyping is further complicated
by duplicated chromosomes and their subsequent genome evolution. Until recently,
genotyping polyploids using high-throughput sequencing data was only possible in
model organisms with reference genomes and/or subgenomes. However, more re-
searchers have begun genotyping SNPs in both model and non-model organisms using whole genome resequencing and reduced representation methods such as restriction-
site associated DNA sequencing (RADseq) and its variants (e.g., Arnold et al., 2015;
Douglas et al., 2015; Cornille et al., 2016; Zohren et al., 2016). Most of these stud-
ies used already existing pieces of software to perform SNP calling and genotyping
[e.g., Genome Analysis Toolkit (McKenna et al., 2010), UNEAK (Lu et al., 2012),
TASSEL-GBS (Glaubitz et al., 2014)] but others used novel approaches for estimating
genotypes (e.g., Voorrips et al., 2011; Zohren et al., 2016; Maruki and Lynch, 2017).
A major caveat with these tools, however, is that many of them cannot estimate
inbreeding coefficients for arbitrary ploidy levels in autopolyploids, nor can they
separately estimate genotypes in the subgenomes of an allopolyploid. This is especially
important considering that ignoring the independence of allopolyploid subgenomes
can lead to biases in the estimation of heterozygosity when alternative alleles are
fixed in the individual subgenomes (fixed heterozygosity, Cornille et al., 2016). In
general, our models aim to incorporate more biologically realistic assumptions about
how population-level factors influence the distribution of genotypes in populations of
polyploids, which is critical when conducting population genetic studies in these taxa.
56 Furthermore, our approaches use genotype likelihoods and produce updated estimates of genotype probabilities given population parameters that can be used to propagate the uncertainty in calling genotypes in polyploids to downstream analyses such as estimating heterozygosity or population differentiation, rather than relying on called genotypes.
Though our models were accurate for many of our simulations and outperformed comparable methods at low depths of sequencing coverage, it is important to consider scenarios when their assumptions are inappropriate. One concern for autopolyploids is the occurrence of double reduction, a process by which alleles in the genotype are identical by decent due to the segregation of sister chromatids to the same gamete during meiosis (Haldane, 1930). As we mentioned before, our model does not directly estimate rates of double reduction. However, because double reduction leads to identity by descent, it contributes to deviations from Hardy Weinberg that are similar to inbreeding. Therefore, our model for individual inbreeding coefficients should be able to accommodate, but not specifically estimate, double reduction.
Allopolyploids present a different set of challenges that are a result of their hybrid origins. In our model, we assume that the two subgenomes of the allopolyploid are completely independent. However, homoeologous recombination can make this assumption inappropriate. Future work that models this exchange of alleles between subgenomes will be an important extension of the model we presented here. Another potential avenue would be to develop ways to use more parental information, as well as demographic parameters to account for the amount of divergence between the allopolyploid and its parents. Models that help to identify parental taxa will also be an important contribution for future research on allopolyploids.
57 2.7 Conclusions
As methods for the analysis of polyploid data continue to be developed, we are hopeful
that the barriers to more widespread study of these taxa will begin to drop. The
prevalence of polyploidy in plants and other groups of eukaryotes, including fish,
amphibians, and fungi, make these methods fundamentally important for furthering
our understanding of the impact of WGD on genetic diversity (Rogers, 1973; Otto
and Whitton, 2000; Gregory and Mable, 2005; Wood et al., 2009). Of the main
problems that complicate population genetics in polyploids, modeling allelic inheritance
remains the most difficult. Overall, we believe that using genotype likelihoods when
studying polyploids to overcome difficulties in determining allele copy number and for
dealing with low-coverage sequencing data is a promising approach for future model
development.
2.8 Acknowledgements
The authors thank members of the Wolfe lab, B. Berger, M. Fumagalli, and
three anonymous reviewers for their helpful comments on this manuscript. We also
thank J. Zohren and her colleagues, C. McAllister, and A. Miller for making their
data sets publicly available. Early versions of these models were presented to X. He,
J. Novembre, M. Stephens, and their lab members, and we thank them for their
constructive feedback and advice (especially regarding the use of EM algorithms).
58 Chapter 3: Fluidigm2PURC: Automated Processing and Haplotype Inference for Double-Barcoded PCR Amplicons
Publication Information
This chapter is formatted for this dissertation from the following publication:
Blischak, P. D., M. Latvis, D. F. Morales-Briones, J. C. Johnson, V. S. Di Stilio, A.
D. Wolfe, and D. C. Tank. Fluidigm2PURC: Automated Processing and Haplotype
Inference for Double-Barcoded PCR Amplicons. Applications in Plant Scienes, 6:e1156,
2018.
3.1 Abstract
Premise of the study: Targeted enrichment strategies for phylogenomic inference are
a time- and cost-efficient way to collect DNA sequence data for large numbers of
individuals at multiple, independent loci. Automated and reproducible processing of
these data is a crucial step for researchers conducting phylogenetic studies. Methods
and Results: We present Fluidigm2PURC, an open source Python utility for processing
paired-end Illumina data from double-barcoded PCR amplicons. In combination with
the program PURC (Pipeline for Untangling Reticulate Complexes), our scripts process
raw FASTQ files for analysis with PURC and use its output to infer haplotypes for
59 diploids, polyploids, and samples with unknown ploidy. We demonstrate the use of
the pipeline with an example data set from the genus Thalictrum L. (Ranunculaceae).
Conclusions: Fluidigm2PURC is freely available for Unix-like operating systems
on GitHub [https://github.com/pblischak/fluidigm2purc] and for all operating
systems through Docker [https://hub.docker.com/r/pblischak/fluidigm2purc].
3.2 Introduction
The collection of large-scale, multilocus data sets for phylogenomic inference has
become an increasingly common method for understanding evolutionary relationships within a group of taxa. Coupled with recent implementations of coalescent-based
species tree estimation programs that take into account the independent histories
of different genes (e.g., SVDquartets, Chifman and Kubatko (2014); ASTRAL-II,
Mirarab and Warnow (2015)), targeted enrichment strategies are powerful methods for
collecting more informative data sets for conducting phylogenomic investigations. Of
the many types of targeted enrichment that exist, several recent studies have begun
to use a method that combines both library preparation and target amplification
into a single step. This process, known as double-barcoded amplicon sequencing
(Uribe-Convers et al., 2016), allows for the collection of multilocus sequence data for
large numbers of individuals that is both time- and cost-effective.
Double-barcoded amplicon sequencing combines the amplification of a targeted
region in the genome with the addition of sample-specific barcodes and Illumina
sequencing adapters to the resulting PCR product for paired-end sequencing on
an Illumina MiSeq platform (Uribe-Convers et al., 2016). This is done by adding
conserved sequence (CS) tags to traditional PCR primers, which act as templates
60 for adding barcodes and adapters when preparing the sequencing library. Parallel
amplification is most often achieved using microfluidic PCR with the Fluidigm Access
Array (Fluidigm, San Francisco, CA, USA; e.g., Gostel et al., 2015; Uribe-Convers et al., 2016; Kates et al., 2017), allowing for multiple samples and loci to be amplified
simultaneously (minimum of 48 samples x 48 loci). The newer Fluidigm Juno system
can also handle up to 192 samples in a single run, and multiplexing of primer pairs can
allow for even higher throughput, provided that the primers do not interact during
amplification. Double-barcoded amplicons can also be generated by other means using
approaches such as traditional or highly-multiplexed PCR (e.g., Bybee et al., 2011;
Dupuis et al., 2017).
Previous methods to analyze these data have typically relied on generating consen-
sus sequences using software packages such as Geneious (Kearse et al., 2012; Gostel et al., 2015), HiMAP (Dupuis et al., 2017), or an R script, reduce_amplicons.R, that
is part of the dbcAmplicons package (but see comparison with “occurrence-based”
methods in dbcAmplicons in the Example Analyses section; Uribe-Convers et al.,
2016; Kates et al., 2017). However, using consensus sequences can often ignore impor-
tant within-individual level variation, such as differing alleles or levels of ploidy. To
alleviate this issue, and to facilitate the analysis of these data for haplotype inference, we developed Fluidigm2PURC. Fluidigm2PURC consists of two main Python scripts
that process input data files using several external programs (Table 3.1) that automate
quality filtering, read merging, and file formatting for downstream steps (Figure 3.1).
Although it can be used to process any double-barcoded amplicons, the software
derives its name from the method of PCR amplification that we used to generate our
data (Fluidigm Access Array), as well as its primary dependency, PURC, a Python
61 program that combines sequence clustering and PCR chimera detection (Rothfels et al.,
2017). The final step in the Fluidigm2PURC pipeline processes clusters from PURC
and outputs a FASTA file containing phased haplotypes for all targeted sequences.
This last step has methods for haplotype inference that work on diploids, polyploids,
individuals with unknown ploidy, or any mixture of the three. To demonstrate the util-
ity of Fluidigm2PURC, we analyzed nuclear amplicon data from the genus Thalictrum
L. (Ranunculaceae) and compared the results with those obtained from dbcAmplicons
using the reduce_amplicons.R script (Uribe-Convers et al., 2016).
3.3 Methods and Results
3.3.1 Input data
The input data for Fluidigm2PURC are paired-end FASTQ files (R1 and R2 for paired
reads) that have been demultiplexed using the program dbcAmplicons (Uribe-Convers et al., 2016). dbcAmplicons demultiplexes reads using the original sample barcodes
and amplicon primer sequences to annotate the reads with the sample and locus
name that each read comes from, followed by trimming these identifying parts of the
sequence. The resulting pair of FASTQ files is then input into the first script in the
pipeline, fluidigm2purc.
3.3.2 Step 1: fluidigm2purc
The fluidigm2purc script takes the paired-end FASTQ files, filters them using Sickle
(Joshi and Fash, 2011, minimum length = 100bp, PHRED threshold = 20), merges
the filtered reads using FLASH2 (Magoč and Salzberg, 2011), and then converts the
resulting FASTQ files into FASTA files (one for each locus) with sequence header
information that is compatible with PURC. The sequence headers for PURC follow
62 1. fluidigm2purc:
Convert reads to Filter and trim Join paired reads PURC format and reads using Sickle using FLASH2 write FASTA files for each locus
2. PURC (purc_recluster.py):
Detect chimeras Combine clusters and cluster reads Annotate resulting for each locus for each taxon clusters with across taxa and at each locus size information align with MUSCLE using USEARCH
3. crunch_clusters:
a. Known ploidy levels: Add ploidy levels to the Realign, clean, and taxon table and infer haplotypes and their dosage return consensus or unique haplotypes b. Unknown ploidy levels: Infer real haplotypes versus sequencing errors using a likelihood cutoff
Figure 3.1: Flowchart outlining the steps for haplotype inference using Flu- idigm2PURC.
63 Dependency Citation Link https://bitbucket.org/crothfels/ PURC (v1.02) Rothfels et al. (2017) purc https://github.com/najoshi/ Sickle (v1.33) Joshi and Fash (2011) sickle https://github.com/dstreett/ FLASH2 (v2.2.00) Magoč and Salzberg (2011) FLASH2 http://mafft.cbrc.jp/alignment/ MAFFT (v7.237) Katoh (2013) software/ https://github.com/blackrim/ Phyutility (v2.7.1) Smith and Dunn (2008) phyutility
Table 3.1: Dependencies for the Fluidigm2PURC pipeline with version numbers in parentheses.
the format ‘>IndividualName|LocusName|UniqueID#’. When paired reads with
low quality bases are trimmed by Sickle and no longer overlap, we merge them
artificially with multiple N’s inserted between them. The fluidigm2purc script writes
two additional files: (1) the taxon table, a two-column table listing each sequenced
taxon and its ploidy level, and (2) the locus-err table, a two-column table listing each
sequenced locus and the average level of sequencing error for all reads coming from
that locus. The taxon table lists the ploidy as “None” for all individuals by default,
but known ploidy levels can be included by the user (e.g., diploid has the value “2,”
tetraploid has the value “4,” etc.). For the locus-err table, the per locus levels of
sequencing error are calculated individually from the input FASTQ files using the
average PHRED score per read averaged across all reads coming from that locus.
64 3.3.3 Step 2: PURC
The output FASTA files from fluidigm2purc can be run through PURC using the
purc_recluster.py script (Rothfels et al., 2017). This script is used to iteratively
run chimera detection and sequence clustering (performed with USEARCH; Edgar,
2010; Edgar et al., 2011) on each locus individually to produce a reduced set of
putative haplotypes that includes size information about the number of original reads
forming each cluster. Details on running PURC can be found on its Bitbucket page
[https://bitbucket.org/crothfels/purc].
3.3.4 Step 3: crunch_clusters
The clusters output by PURC are then run through our second script, crunch_clusters, which uses the taxon table and locus-err table output by fluidigm2purc (Step 1) to
infer haplotypes in a maximum likelihood framework. This script also has options for
realigning clusters using MAFFT (Katoh, 2013), as well as cleaning the clusters using
Phyutility (Smith and Dunn, 2008). Before haplotypes can be inferred at a locus, we
first do a pairwise comparison of all clusters for each taxon individually and merge any
clusters that are identical (ignoring gaps). This step is necessary because of the initial
trimming/filtering step in the fluidigm2purc script. Artificially joining unmerged reads
often causes two sequencing clusters representing the same haplotype to form: (1)
one cluster for reads that were merged, and (2) one cluster for the reads that were
artificially merged and contain a large number of gapped sites in the middle. In this
case, these two clusters should not be treated as separate haplotypes, so we combine
the clusters by keeping the larger haplotype (i.e., the one with less gaps) and adding
the sizes of the two clusters together. The alternative would be to process the original
65 data by ignoring all reads that did not merge. However, throwing away unmerged
reads could potentially discard sequence variation that should be represented in the
data set, especially if most reads are unmerged, which may be the case for large
amplicons. The downside of merging sequences that are identical except for gaps is
that it potentially discards informative indel variation, although it is unlikely that
a locus within an individual would have only one of its haplotypes containing gaps
and not the others. Overall, we felt that this approach provided the best method for
including more of the original data when inferring haplotypes.
Inferring haplotypes with ploidy information
For known ploidy levels, we use a multinomial likelihood to determine the number
of copies of each potential haplotype using the ordered cluster sizes returned by
PURC (largest to smallest). Given an individual of ploidy level K, we enumerate the
number of possible haplotype configurations using integer partitions (an unordered
set of integers that sums to K; Stojmenovic and Zoghbi, 1998). Since the cluster
sizes are sorted, we never need to consider more than the first K largest clusters.
For example, a tetraploid can have a maximum of four haplotypes, and the integer
partitions to consider are (4, 0, 0, 0), (3, 1, 0, 0), (2, 2, 0, 0), (2, 1, 1, 0), and (1, 1, 1, 1).
This corresponds to (4 copies of haplotype one), (3 copies of haplotype one, 1 copy
of haplotype two), (2 copies of haplotype one, 2 copy of haplotype two), (2 copies
of haplotype one, 1 copy of haplotype two, 1 copy of haplotype three), and (1 copy
of haplotype one, 1 copy of haplotype two, 1 copy of haplotype three, 1 copy of
haplotype four). The mathematical details for the likelihood function with an example
calculation are presented in Appendix C. Once the most likely configuration has been
identified, the crunch_clusters script will return each haplotype in proportion to its
66 representation in the maximum likelihood estimate. We have also provided options to
return only unique haplotypes and to treat loci as haploid, the latter of which can be
used to process organellar data. The haploid option can also be used as an alternative
to finding consensus sequences for nuclear loci by returning only the cluster with the
most reads.
Inferring haplotypes without ploidy information
For unknown ploidy levels, we no longer have information about the maximum number
of haplotypes that an individual can have. However, we can use the cluster sizes to
infer which clusters from PURC are actual haplotypes versus those that are likely to
be sequencing errors. We do this by calculating the likelihood that each successive
haplotype in the sorted list is a “real” haplotype versus a sequencing error. As an
example, consider a tetraploid with six clusters identified by PURC. We first calculate
the likelihood that all clusters are errors. Then we calculate the likelihood that
cluster one is a real haplotype, and two through six are errors. Next, we calculate the
likelihood that clusters one and two are real haplotypes, and that three through six
are errors. This continues until we calculate the likelihood of all six clusters being real
haplotypes. We then apply a cutoff that uses the relative increase in the likelihood when an additional haplotype is added. If treating an additional cluster as a haplotype
increases the likelihood by less than the cutoff then only the previous haplotypes are
kept and the others are considered errors. We use a default cutoff of 10% increase in
the likelihood. An example with the likelihood function that we use for this approach
is provided in Appendix C.
67 Species Ploidy Level Collection Information T. thalictroides: (L.) A. J. 2N=2X=14 V. Di Stilio, 123, WTU Eames & B. Boivin T. squarrosum: Stephan ex V. Di Stilio & X. Duan Thal- 2N=6X=42 Willd. ictrum sp#8, 20120617, PE T. macrostylum: Shuttlew. ex 2N=8X=56 R. Penny (unvouchered) Small & A. Heller 2N=12X=84 or D. Baum & D. Howarth, 375, T. pubescens: Pursh 2N=22X=154 A T. revolutum: DC. 2N=20X=144 V. Soza, 1917, WTU T. dasycarpum: Fisch., C. A. 2N=22X=154 V. Di Stilio, 110, WTU Mey. & Avé-Lall.
Table 3.2: Thalictrum L. species included in the comparison of Fluidigm2PURC and dbcAmplicons. Collection information is listed as the collector(s), collection number, and the herbarium. All additional information is available from Soza et al. (2013), Tables S1, S3, and S4. For all analyses, T. pubsescens was analyzed at the 22X level.
3.3.5 Example analysis
To demonstrate the use of the Fluidigm2PURC pipeline, we analyzed amplicon
sequence data generated from orthologs of the nuclear gene PISTILLATA (PI ) in the
genus Thalictrum L. (Ranunculaceae), which is single copy in diploids and two-copy in
tetraploids (Di Stilio et al., 2005). PI is responsible for establishing stamen and petal
identity during flower development in Arabidopsis thaliana (Goto and Meyerowitz,
1994), and has been used to detect reticulation in polyploid Lepidium L. (Brassicaceae)
(Lee et al., 2002; Soza et al., 2014). Given the length of the PI locus, primers were
designed to sequence exons 3 to 6 in two overlapping 600-bp segments: exons 3 to 5
(PIS_4 ) and exons 4 to 6 (PIS_3 ). Our analyses focused on six species with known
ploidy levels ranging from diploid (2N=2X=14) to 22-ploid (2N=22X=154). These
species are presented in Table 3.2, with accession numbers following Soza et al. (2013).
68 Paired-end reads were demultiplexed and annotated using dbcAmplicons (Uribe-
Convers et al., 2016) followed by read trimming, merging, and sequence renaming
using the fluidigm2purc script with default options. All reads coming from PIS_3
and PIS_4 were then run separately through PURC using the purc_recluster.py
script (Rothfels et al., 2017). After clustering and chimera detection, we determined
haplotypes for each amplicon using three different approaches: (1) consensus sequences
using the ‘--haploid’ option, (2) unique haplotypes assuming unknown ploidy (10%
likelihood cutoff), and (3) unique haplotypes using known ploidy. For each of these
methods, we realigned and cleaned the sequences using MAFFT (Katoh, 2013) and
Phyutility (added the options ‘--realign --clean 0.33’; Smith and Dunn, 2008).
As a comparison, we also analyzed these data using the reduce_amplicons.R script
from the dbcAmplicons package (v0.8.5; Uribe-Convers et al., 2016). This script
merges paired-end reads using FLASH2 (Magoč and Salzberg, 2011) and allows for a
global read trimming size to be used for read one, read two, or both. Unmerged reads
are treated independently, resulting in separate haplotypes for read one and read two.
The final result is a FASTA file with the unaligned haplotypes that can be further
processed for downstream applications. We generated consensus haplotypes as well as
haplotypes based on read occurrence (controlled by the minimum read frequency and
minimum read count) using the default settings, and trimmed 20 bp from read one and
40 bp from read two. We then aligned the resulting sequences using MAFFT (Katoh,
2013). These results were compared to the haplotypes from Fluidigm2PURC based
on (1) the number of recovered haplotypes, (2) the length of the resulting alignment,
and (3) the amount of gaps in the alignment.
69 PIS_3 PIS_4 Fluidigm2PURC reduce_amplicons Fluidigm2PURC reduce_amplicons Consensus Alignment length (bp) 395 415 403 418 Percent gaps 27.9 30.0 1.9 4.4 P.I.S. 19 19 11 10 Unknown Ploidy/Occurrence Number of haplotypes 18 5 14 6 Alignment length (bp) 395 403 403 424 Percent gaps 16.2 48.9 2.7 7.1 P.I.S. 48 11 43 10 Known Ploidy Number of haplotypes 57 – 43 – Alignment length (bp) 395 – 403 – Percent gaps 10.3 – 2.6 – P.I.S. 81 – 62 –
Table 3.3: Overall alignment statistics for the comparison between Fluidigm2PURC and the reduce_amplicons.R script.
Results
Haplotypes inferred by both methods were visualized and compared using alignment
statistics computed in Geneious v8.1.8 (Kearse et al., 2012) and MEGA v7.0.18 (Kumar et al., 2016). Consensus sequences from the Fluidigm2PURC and dbcAmplicons
pipelines were similar overall, with the reduce_amplicons.R script producing longer
haplotypes, but containing more gaps (Table 3.3). We then compared the occurrence-
based method from the reduce_amplicons.R script with the crunch_clusters results when ploidy levels are treated as unknown. In this case, Fluidigm2PURC recovered
more haplotypes with fewer gaps and more parsimony informative sites. We believe
the reason that the reduce_amplicons.R script recovered so few haplotypes is due to its
use of minimum read count and frequency criteria that rely on reads being identical to
form haplotypes, rather than clustering based on similarity. Inferring haplotypes with
70 Fluidigm2PURC using known ploidy levels resulted in the largest number of recovered
haplotypes. The reason that using known versus unknown ploidy levels produced
more haplotypes (PIS_3 : 57 vs. 18, PIS_4 : 43 vs. 14) was because the clusters
sizes that went into the likelihood calculation were disparate for some species (a few
large clusters and many smaller ones), making the smaller clusters difficult to model when the ploidy level was unknown due to lack of prior knowledge about how many
haplotypes should be expected. On a per species basis, using known ploidy levels
always led to more inferred haplotypes (Table 3.4). For example, the PIS_4 region for
Thalictrum pubescens recovered 15 haplotypes when assuming known ploidy (analyzed
as 22X), but only one haplotype when assuming unknown ploidy. The reason for this
is that the cluster data for this species had one putative haplotype with many reads
(147), but all other putative haplotypes had far fewer reads (the next largest cluster
had 25 reads, and nine clusters had fewer than 10 reads). In general, drawing the line
between real haplotypes and errors for clusters with lower read counts is difficult when
the ploidy level is unknown. By applying a threshold, the method we implement is a
conservative way to estimate haplotypes that only includes clusters with the highest
read counts.
Code and Data Availability
The code for each step of our example analysis is available in Appendix C. Raw
sequence data from the PIS_3 and PIS_4 loci for the six sampled Thalictrum
species, as well as all output FASTA files from the Fluidigm2PURC and dbcAmplicons
pipelines, are available on Dryad.
71 PIS_3 PIS_4 Species Ploidy Known Unknown Known Unknown T. thalictroides 2N=2X=14 1 (2.3) 1 (2.3) 1 (0.8) 1 (0.8) T. squarrosum 2N=6X=42 6 (3.3) 4 (3.1) 4 (3.9) 3 (3.2) T. macrostylum 2N=8X=56 8 (13.9) 5 (20.9) 4 (3.3) 4 (3.3) 2N=12X=84 or 12X: 11 (3.7) 12X: 9 (2.3) T. pubescens 1 (1.5) 1 (1.2) 2N=22X=154 22X: 19 (3.3) 22X: 15 (2.3) T. revolutum 2N=20X=140 10 (26.9) 4 (40.9) 5 (2.5) 3 (2.9) T. dasycarpum 2N=22X=154 13 (9.2) 3 (2.8) 14 (2.4) 2 (2.1)
Table 3.4: Per species data for the number of haplotypes inferred by Fluidigm2PURC using known vs. unknown ploidy. Data are presented as: number of inferred haplotypes (average percent gaps per haplotype). For Thalictrum pubescens, haplotypes are presented at both the 12X and 22X level.
3.4 Conclusions
The ability to infer haplotypes regardless of an individual’s ploidy level is a crucial
step toward understanding the complex relationships within many plant groups, whose evolutionary histories often contain multiple instances of hybridization and whole genome duplication (Soltis and Soltis, 2009; Van de Peer et al., 2009). As
models that accommodate these processes continue to be developed (e.g., Jones et al.,
2013; Solís-Lemus and Ané, 2016; Oberpieler et al., 2017; Thomas et al., 2017; Wen
and Nakhleh, 2018), we anticipate that the functionality of our pipeline will be
especially useful for conducting phylogenomic studies with nuclear sequence data.
Furthermore, the increase in genomic resources for taxa across the Plant Tree of Life will continue to facilitate the process of phylogenetic marker development, allowing
more researchers to take advantage of targeted enrichment strategies such as double-
barcoded amplicon sequencing. Compared with existing approaches for analyzing
these data, the methods we present here offer an improved workflow for sequence
72 processing, clustering, and haplotype inference, and are particularly well suited for
analyses in taxa with incomplete knowledge about ploidy levels.
3.5 Availability
Fluidigm2PURC is open source software that is freely available on GitHub [https:
//github.com/pblischak/fluidigm2purc] for Unix-like operating systems (Mac,
Linux) under the GNU General Public License v3. We have also built a Docker
image with all dependencies (Table 3.1) pre-installed for use on any operating sys-
tem with a compatible distribution of the Docker software [https://hub.docker.
com/r/pblischak/fluidigm2purc](https://www.docker.com; Merkel, 2014). Flu-
idigm2PURC is written in Python and has been successfully tested using Python versions 2.7 and 3.6. Documentation for the software can be found on ReadTheDocs
[http://fluidigm2purc.readthedocs.io].
3.6 Acknowledgements
The authors thank C. Rothfels for helpful discussions regarding the use of PURC.
We also thank S. Uribe-Convers for providing valuable feedback while testing the
Fluidigm2PURC code. This work was supported by the following grants from the
National Science Foundation (NSF): DEB-1455399 (ADW, L. S. Kubatko), DEB-
1253463 (DCT), and IOS-1121669 (VSD), with additional support to DCT and VSD
from the NSF BEACON Center for the Study of Evolution in Action (DBI-0939454).
73 Chapter 4: Inferring Species Trees and Networks from Gene Tree Quartet Site Patterns: An Example from the Plant Genus Penstemon (Plantaginaceae)
4.1 Abstract
Reticulate evolutionary events are hallmarks of plant phylogeny, and are increasingly
recognized as common occurrences in other branches of the Tree of Life. However,
inferring the evolutionary history of admixed lineages presents a difficult challenge for
systematists due to genealogical discordance caused by both incomplete lineage sorting
(ILS) and hybridization. Methods that accommodate both of these processes are
continuing to be developed, but they often do not scale well to larger numbers of species.
An additional complicating factor for many plant species is the occurrence of whole
genome duplication (WGD), which can have various outcomes on the genealogical
history of haplotypes sampled from the genome. In this study, we sought to investigate
patterns of hybridization and WGD in two subsections from the genus Penstemon
(Plantaginaceae; subsect. Humiles and Proceri), a speciose group of angiosperms that
has rapidly radiated across North America. Species in subsect. Humiles and Proceri
occur primarily in the Pacific Northwest of the USA, occupying habitats such as mesic,
subalpine meadows, as well as more well-drained substrates at varying elevations.
Ploidy levels in the subsections range from diploid to hexaploid, and it is hypothesized
74 that most of the polyploids are hybrids (i.e., allopolyploids). To estimate phylogeny in these groups, we first developed a method for estimating quartet concordance factors (QCFs) from multiple sequences sampled per lineage, allowing us to model all haplotypes from a polyploid. QCFs represent the proportion of gene trees that support a particular species quartet relationship, and are used for species network estimation in the program SNaQ (Solís-Lemus & Ané. 2016. PLoS Genet. 12:e1005896). Using phased haplotypes for nuclear amplicons, we inferred species trees and networks for 38 taxa from P. subsect. Humiles and Proceri. Our phylogenetic analyses recovered two clades comprising a mix of taxa from both subsections, indicating that the current taxonomy for these groups is inconsistent with our estimates of phylogeny. In addition, there was little support for hypotheses regarding the formation of putative allopolyploid lineages. Overall, we found evidence for the effects of both ILS and admixture on the evolutionary history of these species, but were able to evaluate our taxonomic hypotheses despite high levels of gene tree discordance. Our method for estimating
QCFs from multiple haplotypes also allowed us to include species of varying ploidy levels in our analyses, which we anticipate will help to facilitate estimation of species networks in other plant groups as well.
4.2 Introduction
Phylogenetic inference with multiple gene sequences has emerged as a dominant paradigm in systematics, with multilocus datasets ranging in size from just a handful of genes, to thousands of loci pulled from whole genomes. Discordant signals from these different gene regions can often be present, however, raising the issue of how to model the incongruence among the sampled gene trees from the underlying species
75 tree. The multispecies coalescent (MSC) model is one approach for species tree
estimation from multilocus data that can accommodate gene tree discordance caused
by incomplete lineage sorting (ILS) (reviewed in Degnan and Rosenberg, 2009). The
appeal of the MSC model stems from its connection with concepts in population
genetics (Wright-Fisher model; Kingman, 1982), and its explicit predictions regarding
the amount of gene tree discordance that should be present for a given species tree
(Tavaré, 1984; Pamilo and Nei, 1988; Takahata, 1989). Nevertheless, despite the
popularity of the coalescent model, it has been shown that it can be a poor fit to
empirical data sets (Reid et al., 2014; Gruenstaeudl et al., 2015). A potential reason
for the poor performance of the MSC in empirical data is that it only models ILS,
leaving other processes that generate genealogical discordance, such as gene flow and
hybridization, unaccounted for (Maddison, 1997).
An alternative to using the coalescent to model gene tree discordance is to use
the concept of a “gene to tree map,” wherein gene tree topologies are mapped to
possible species tree topologies without assuming an underlying process. This was
the approach taken by Ané et al. (2007), who used a Bayesian framework to estimate
a species tree by maximizing gene tree concordance. Implemented in the software
BUCKy (Larget et al., 2010), this method relies on the concept of concordance factors,
or the proportion of gene trees for which a given bipartition is true (Baum, 2007).
The resulting phylogenetic estimate is referred to as the primary concordance tree
(PCT), and can be estimated even if ILS is not the only process affecting gene tree
incongruence. Larget et al. (2010) also introduced the concept of a population tree, which uses the average concordance factors for all quartets on an internal branch
of the PCT to calculate branch lengths in coalescent units. Because concordance
76 factors contain information about both ILS and gene flow, Solís-Lemus and Ané (2016)
developed a method for estimating species networks (species trees with reticulate
edges) from concordance factors estimated for quartets of species. Their method,
called SNaQ (Species Networks applying Quartets), uses these quartet concordance
factors (QCFs) to maximize a pseudolikelihood function that matches the expected
QCF values under the coalescent model with hybridization (Meng and Kubatko, 2009)
and the observed QCFs.
Despite the availability and success of the SNaQ method, there remain several areas where we believe the estimation of the QCF input can be improved. First, estimating
QCFs for multiple individuals or haplotypes per species is not easily accomplished
using BUCKy or the methods available in the PhyloNetworks package that implements
the SNaQ method. This is problematic not only because having multiple alleles
sampled from a population can increase phylogenetic resolution (Andermann et al.,
2018), but also because many hybrid plant lineages are polyploids, which means that
not all of their homoeologs can be modeled simultaneously. Second, BUCKy requires
the estimation of posterior distributions of gene trees, which can be computationally
demanding for large numbers of loci and/or large numbers of sampled alleles. Using
gene tree posteriors is a common way to deal with gene tree uncertainty (utilized by
several methods in PhyloNet; Wen et al., 2018), but other methods, such as those
using site pattern frequencies, would allow for faster computation of the gene tree
quartet topologies that are used to estimate QCFs.
To address the issues listed above, we first developed a method for estimating
concordance factors directly from sequence data for quartets of species. Our method
accommodates multiple haplotypes sampled per species, and can conduct bootstrapping
77 to account for gene tree uncertainty. To validate the method, we simulated multilocus
sequence data on both tree and network topologies to assess how accurately it could
estimate QCFs. We then collected nuclear amplicon data for two subsections in the
plant genus Penstemon (Plantaginaceae; subsect. Humiles and Proceri). Subsections
Humiles and Proceri are known to hybridize, and have the additional complication
of containing multiple polyploid species. Using phased haplotypes from the nuclear
amplicon sequences, we estimated species trees and networks using four different
approaches to evaluate if the current circumscription of the subsections are in agreement with our phylogenetic estimates. Overall, we found strong evidence for hybridization
in subsect. Humiles and Proceri, but phylogenetic support was generally lacking for
many of the species relationships, owing to large amounts of genealogical discordance.
Given the pace with which Penstemon has recently radiated, these types of patterns
are not unexpected (Wolfe et al., 2006). Nevertheless, our phylogenetic estimates
showed some stable relationships among the different methods used, and suggest that
the current taxonomy of the two subsections needs revising.
4.3 Approach
We begin with a brief description of our method for QCF estimation, which uses site
patterns to estimate the gene tree quartet relationships that are used to calculate
species-level concordance factors. The basis for this method stems from ideas regarding
the use of phylogenetic invariants for the inference of phylogenetic trees (Allman et al.,
2008; Chifman and Kubatko, 2015). For a given quartet of species, the QCF values
represent the proportion of gene trees that agree with each of the three possible
unrooted topologies relating the four species: ((1,2)(3,4)), ((1,3)(2,4)), and ((1,4)(2,3)).
78 Because they represent a single bipartition among four species, these topologies are
referred to as “splits”, and are denoted 12|34, 13|24, and 14|23 (Chifman and Kubatko,
2015). Estimating the species-level concordance factors then amounts to estimating
quartet topologies for each gene, followed by tabulating which species-level split is
supported by each gene. When there are multiple haplotypes at a locus, we consider all
of their possible sampling combinations and calculate the gene tree quartet topologies
that support each species-level relationship. Using this approach, we are able to quickly
estimate QCF values for samples with different ploidy levels. Below we detail our
method for scoring gene tree quartet topologies, and combining them across sampled
haplotypes to get species-level QCFs.
4.3.1 Calculating Quartet Concordance Factors
Consider four species (1, 2, 3, and 4) with DNA sequence data collected at G
independent loci, and with haplotypes phased and aligned for each locus. For example,
a diploid individual from species 1 might have two haplotypes at locus one, which we denote 1(1,1) and 1(1,2). For any species, S, we denote its haplotypes at each
gene by indexing across the gene number (g = 1,...,G) and the haplotype number
(h = 1, . . . , sg; where sg is the number of haplotypes present in species S at gene g).
Then, for each gene, we score the three possible splits of all haplotype combinations
using the frequency of matching site patterns. To get these scores, we first calculate
the number of times a pair of species, A and B, have the same nucleotide at each site
in the alignment:
X mAB = I(A(i) = B(i)). (4.1) i
79 Here I() is the indicator function and is equal to 1 if the two bases are the same, and
0 otherwise.
We then use the mIJ ’s to calculate scores for each of the three unrooted quartet
topologies:
G(1, 2|3, 4) = m12 + m34 − (m13 + m14 + m23 + m24),
G(1, 3|2, 4) = m13 + m24 − (m12 + m14 + m23 + m34),
G(1, 4|2, 3) = m14 + m23 − (m12 + m13 + m24 + m34). (4.2)
For these scores, patterns of nucleotide substitution that support a given split are given
positive weight, while those that support alternative topologies are given negative weight. At the species level, we tabulate the number of gene trees supporting these
same splits, and add 1 to the species topology that corresponds to the gene split with the highest score. If there is a tie for the highest score, we add 0.5 to the two
species-level splits for the highest scoring gene trees. If all three gene tree splits have
1 the same score, then 3 is added to each species-level split.
The calculation of concordance factors for each species-level quartet is then done
by summing over all genes and tabulating how often each possible split is supported
by each gene. This sum is also taken over all possible combinations of haplotypes,
giving the following equation for calculating QCFs:
G 1g 2g 3g 4g ! X X X X X CF 12|34 ∝ argmax G 1(g,a), 2(g,b)|3(g,c), 4(g,d) . (4.3) g=1 a=1 b=1 c=1 d=1 Here, argmax() is an indicator that is 1 if G(1, 2|3, 4) is the maximum argument and
0 otherwise. Ties are handled as described above. The calculation of CF 13|24 and
80 CF 14|24 are the same as above but with the species and sums switched into the correct
order for the split under consideration. All three species-level concordance factors are
then normalized by their sum.
4.3.2 Bootstrapping and Gene Tree Uncertainty
To deal with uncertainty in gene tree quartet estimation, we can also conduct bootstrap
resampling of sites within genes when calculating the gene tree split scores. If we
conduct B rounds of resampling, the gene tree contributions to the species-level splits
can then be calculated across bootstrap replicates, with each gene tree split getting a weight proportional to the number of times it was the best scoring topology across all
replicates:
B ˜ 1 X G 1(g,a), 2(g,b)|3(g,c), 4(g,d) = argmax G 1(g,a), 2(g,b)|3(g,c), 4(g,d) . (4.4) B b=1
The species-level QCF value is then taken as the sum over these bootstrap weighted
gene tree splits:
G ag bg cg dg ! X X X X X ˜ CF 12|34 ∝ G 1(g,a), 2(g,b)|3(g,c), 4(g,d) . (4.5) g=1 a=1 b=1 c=1 d=1
As before, CF 13|24 and CF 14|23 are calculated in a similar way, such that their indices
are in the correct order. These CF values are then normalized by their sum.
4.3.3 Validating QCF Estimation
We validated our approach for QCF estimation using simulations on both tree and
network topologies (Figure 4.1). The details of these simulations can be found in
Appendix D. In general, our method for calculating QCF values produced accurate
81 (a) 0.5 A
1.0 B
C 2.0 D
E
F
(b) 0.5 A
1.0 B
C 2.0 D ɣ=0.6 E
F
Figure 4.1: Simulation setup for (a) tree and (b) network topologies. Internal branches are annotated with their lengths in coalescent units (CUs). The total tree height is 4.0 CUs.
estimates when compared to the true simulated data, with RMSD values ranging
from 0.019–0.042 (0.023–0.059 for bootstrapped) and 0.019–0.036 (0.021–0.043 for
bootstrapped) for the tree and network topologies, respectively (Tables D.4 and D.5).
Figures D.1 and D.2 show plots of fitted linear regression models for each quartet of
species and the corresponding QCF estimate for each of the three unrooted topologies.
4.3.4 Implementation
We have implemented our new method in the open-source software package qcf.
qcf is wrtten in C++ and is available under the GNU GPL v3 on GitHub (https:
//github.com/pblischak/qcf). Documentation and tutorials for using the software
can be found on ReadTheDocs (https://qcf.readthedocs.io).
82 4.4 Materials and Methods
4.4.1 Study System
Penstemon Mitch. (Plantaginaceae) is the largest group of flowering plants endemic
to North America, with ca. 300 species distributed from Alaska to Guatemala, and
from the Pacific to Atlantic coasts (Wolfe et al., 2006). The center of diversity for
Penstemon is the Intermountain West of the United States, with the biogeographic
origin of the genus hypothesized to be in the Columbia Plateau (Straw, 1966; Wolfe et al., 2002, 2006). Penstemon has undergone a recent and rapid radiation, which
is thought to be driven by Pleistocene glaciation cycles, as well as adaptation to
different ecological habitats and pollinators (Castellanos et al., 2006; Wolfe et al., 2006;
Wilson et al., 2007). The most comprehensive molecular phylogeny of Penstemon was published by Wolfe et al. (2006), with 193 species sampled for their analyses of
the nuclear ribosomal ITS region and two chloroplast genes (trnCD+trnTL). While
support for many species-level relationships was lacking, Wolfe et al. (2006) were able
to make several inferences regarding higher level relationships within Penstemon.A
more recent study conducted high-throughput sequencing for 70 species of Penstemon
(Wessinger et al., 2016), and recovered high support across the entire tree. However,
the limited taxon sampling of their phylogeny did not contain many members of the
subg. Penstemon, limiting the interpretation of their results in the context of the whole genus.
Two groups within Penstemon subg. Penstemon that are of particular interest
are the subsections Humiles and Proceri. Species in these subsections are primarily
distributed in the Pacific Northwest of the United States, and occur at subalpine
to alpine elevations in a variety of habitats, including some that are atypical for
83 confertus (4X) globosus (4X) procerus (4X)
palustris (6X) attenuatus (6X) militaris (6X) pseudoprocerus (6X)
albertinus (2X)
Figure 4.2: Hypotheses of allopolyploid formation in Penstemon attenuatus according to Keck (1945). Varieties of P. attenuatus are placed in the center, with their putative diploid parent below (P. albertinus for all), and their putative tetraploid parents above. P. attenuatus var. palustris is marked with a dashed arrow due to the uncertainty of its placement.
species of Penstemon in the western US (e.g., mesic meadows; Keck, 1945; Nold,
1999). These subsections are also morphologically distinct from other members of
the genus, with their inflorescences organized into verticillasters. Many of the species
also have glandular hairs on the inflorescence, a character present in all species of
subsect. Humiles, but only in some members of subsect. Proceri (Keck, 1945). The
traditional taxonomic division between these groups is based on a single leaf character:
members of subsect. Humiles have serrate leaf margins and members of subsect.
Proceri have entire margins (Keck, 1945). However, there have been observations of
hybridization between the two subsections, such that members of subsect. Proceri
can sometimes have toothed leaf margins (Strickler, 1997). Cases of hybridization
at the diploid level are well documented in Penstemon (Straw, 1955; Crosswhite,
1965; Wolfe et al., 1998b,a; Datwyler and Wolfe, 2004), and numerous instances of
polyploidy, mostly within sect. Penstemon and subg. Saccanthera, have been studied
as well (Keck, 1945; Broderick et al., 2011). Most of these polyploids are thought to
be allopolyploids (formed through hybridization) (Wolfe et al., 2006), but the majority
of these hypotheses remain untested (but see Lawrence and Datwyler, 2016).
84 Given the lability of the leaf character dividing these two subsections, as well as
their similarities in geographic ranges and morphology, we sought to evaluate the
monophyly of subsect. Humiles and Proceri using nuclear amplicon data. We also
aimed to investigate the extent to which hybridization has occurred in these groups,
as well as gaining an understanding of the origin of any polyploid taxa (auto- vs.
allopolyploid). In the case of subsect. Humiles and Proceri, the P. attenuatus species
complex presents a compelling test for understanding polyploidy in these groups.
According to Keck (1945), the four varieties of P. attenuatus are all hypothesized to
be allopolyploids, forming through hybridization between P. albertinus in subsect.
Humiles and three different species in subsect. Proceri (see Figure 4.2). Earlier
molecular phylogenetic analyses recovered the subsections as polyphyletic (Wolfe et al.,
2006). However, these patterns are based on two uncombined gene trees, which would
not allow for processes that cause gene tree incongruence to be modeled. Here we use
species tree and network approaches to account for genealogical discordance caused by
both ILS and hybridization to estimate a phylogeny for subsect. Humiles and Proceri.
4.4.2 Sample Collection, DNA Extraction, and Amplicon Se- quencing
DNA was extracted from field-collected leaf tissue that was dried on silica gel. We used
a modified CTAB protocol for DNA isolation (Wolfe, 2005), quantified all samples using
a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA), and normalized all samples
to 20 ng/µL. Normalized DNA samples for 38 accessions representing 17/22 and
20/27 currently circumscribed taxa from subsect. Humiles and Proceri, respectively,
plus an outgroup taxon from Penstemon subgenus Dasanthera (P. davidsonii var. davidsonii), were sent to the IBEST Genomics Resources Core at the University of
85 Idaho (Moscow, ID, USA) for sample preparation and sequencing (listed in Table D.1).
Amplification of targeted amplicons and the addition of sample barcodes and Illumina
adapters were done using microfluidic PCR on the Fluidigm 48 x 48 Access Array
(Fluidigm Corporation, South San Francisco, CA, USA), followed by 300 bp paired-end
sequencing on an Illumina MiSeq (Illumina, San Diego, CA, USA) (Uribe-Convers et al., 2016). Primers for the 48 loci used in this study were designed and tested as
described in Blischak et al. (2014), and are given in Tables D.2 and D.3.
Raw, paired-end sequencing reads were returned from IBEST and processed using
Fluidigm2PURC v0.1.2 (https://github.com/pblischak/fluidigm2purc; Blischak et al., 2018a). Fluidigm2PURC trims reads using Sickle (Joshi and Fash, 2011), joins
paired reads using FLASH2 (Magoč and Salzberg, 2011), and prepares the input file for
clustering and chimera detection using the program PURC (Rothfels et al., 2017). After
clustering with PURC, haplotypes are inferred based on cluster sizes and user-specified
ploidy levels. Three rounds of chimera detection and clustering were performed using
default settings in the script purc_recluster2.py, a modified version of the original
script distributed with PURC (https://bitbucket.org/crothfels/purc). To get
haplotypes, clusters were first cleaned of excessive gaps using Phyutility (threshold
= 33%; Smith and Dunn, 2008) and then realigned using MAFFT (Katoh, 2013).
Haplotypes were then inferred for all sampled taxa assuming known ploidy levels
reported in Keck (1945), Strickler (1997), and Broderick et al. (2011). Information
regarding haplotype dosage (i.e., number of haplotype copies) was ignored, resulting
in only unique haplotypes being returned for each taxon at each gene.
86 4.4.3 Species Tree Inference
Haplotype-level gene trees for each locus were inferred with RAxML v8.2.11 using the
GTRGAMMA model of nucleotide substitution and 500 rapid bootstrap replicates
(Stamatakis et al., 2008; Stamatakis, 2014). We then inferred a taxon-level species
tree with ASTRAL v5.5.9 (ASTRAL-III) using a mapping file to link haplotypes with
their respective taxa (Mirarab and Warnow, 2015). To increase the thoroughness of
the ASTRAL-III search algorithm, we added the following command line options:
--polylimit 20 (maximum size of polytomy ), --samplingrounds 100 (number of
rounds of subsampling haplotypes from taxa), and --extraLevel 2 (increase the
number of bipartitions added to the search space).
A species tree was also inferred using methods from the TICR pipeline (https:
//github.com/nstenz/TICR; Stenz et al., 2015). The TICR pipeline estimates QCFs
using BUCKy (Larget et al., 2010), and then uses the concordance values of these
quartets to infer a species tree using the QuartetMaxCut algorithm (Snir, 2012).
Here, we instead used our new method to estimate QCFs, and inferred a species
tree using the script TICR/scripts/get-pop-tree.pl. Average concordance factors
and branch lengths (in coalescent units) were then estimated for this tree using the
TICR/scripts/getTreeBranchLengths.R script.
As a final estimate of phylogeny, we used only the majority haplotype (haplotype
inferred from the largest cluster) returned by Fluidigm2PURC to estimate a species
tree with RAxML using a supermatrix as input. It has been shown previously that
concatenating multilocus data can result in incorrect inferences of phylogeny when
gene tree discordance is present (Kubatko and Degnan, 2007). However, congruence
among different methods can also be a good indicator of stable species relationships.
87 Inference with RAxML was conducted using a partition file to estimate separate model
parameters for each gene, and 1000 rapid bootstrap replicates were used to assess
statistical support.
4.4.4 Candidate Hybridization Events from Rooted Triples
To generate a list of candidate hybridization events, we used the program HyDe v0.4.2 to test for hybridization on all possible triples of taxa from subsect. Humiles
and Proceri (Blischak et al., 2018b). HyDe tests for hybridization using site pattern
frequencies (Kubatko and Chifman, 2015), and estimates the amount of admixture
occuring between two parental taxa to form a third hybrid taxon. Using P. davidsonii var. davidsonii as an outgroup and a mapping file to assign haplotypes to taxa, we tested all triples in all directions using the run_hyde_mp.py script. Statistical
significance was assessed at the α = 0.05 level with a Bonferroni correction for the
number of hypothesis tests conducted.
4.4.5 Species Network Inference
Our species tree analyses with ASTRAL-III, QCF+QuartetMaxCut, and RAxML
(supermatrix) recovered two clades with corresponding taxon membership (see Results), which we refer to as clades A and B (Figures 4.3–4.5). To reduce the computational
burden of estimating a large network, we chose to analyze these clades independently.
Haplotypes from taxa belonging to each clade were extracted from the original
sequence alignments and written to new files. Penstemon davidsonii var. davidsonii was included in the data set for both clades as an outgroup. We then estimated
haplotype-level gene trees as before, and inferred a taxon-level species in ASTRAL-
III using a mapping file and default search settings. We also estimated QCFs for
88 each clade using the qcf software with 500 bootstrap replicates. The resulting
species trees and QCF estimates for clades A and B were then used as input for
network estimation using the SNaQ method implemented in the software package
PhyloNetworks v0.7.0 (Solís-Lemus and Ané, 2016; Solís-Lemus et al., 2017). We varied
the maximum number of hybridization events from h=1 to h=5 and used the resulting
log-pseudolikelihood values to determine the most likely number of hybridization
events. The log-pseudolikelihood for the case of no hybridization (h=0) was calculated
by maximizing the fit of the observed QCF values on the fixed tree topology estimated
by ASTRAL-III. All network analyses were conducted on the Oakley cluster at the
Ohio Supercomputer Center (https://www.osc.edu).
4.5 Results
4.5.1 Nuclear Amplicon Data
Of the 48 loci that were amplified using the Fluidigm Access Array, 43 (all nuclear)
recovered sufficient data for processing and downstream phylogenetic analyses. Data
processing with Fluidigm2PURC on these 43 loci produced phased haplotypes for all
38 taxa, with many of the polyploid taxa containing three or more unique haplotypes.
The supermatrix of majority haplotypes used for analysis with RAxML had a total
alignment length of 18,207 bp.
4.5.2 Species Tree Inference
Species trees were inferred with three different methods, all of which produced mostly
similar phylogenetic estimates for subsect. Humiles and Proceri (Figures 4.3–4.5). For
each method, four species were consistently recovered as a grade outside of the rest of
the ingroup: P. anguineus, P. rattanii, P. watsonii, and P. whippleanus. Penstemon
89 ovatus was also recovered outside of the two subsections in the ASTRAL-III analysis.
For ASTRAL-III and qcf+QuartetMaxCut, two species from subsect. Proceri, P. attenuatus var. attenuatus and P. attenuatus var. pseudoprocerus, were recovered in a
clade consisting primarily of species from subsect. Humiles. These two methods also
recovered a clade consisting almost entirely of species from subsect. Proceri, with the
exception of P. radicosus being present in the ASTRAL-III tree. The supermatrix
analysis inferred a tree with a number of relationships that differed from the other
two approaches. Three of the four variaties of P. attenuatus were inferred to belong
to the same clade, but P. attenuatus var. pseudoprocerus was still recovered in a clade
of Humiles taxa. This analysis also shifted a clade of three species, P. radicosus, P. degeneri, and P. inflatus, to be sister to the clade consisting of Proceri species with
high support (bootstrap = 96). Another notable difference among the methods was
that ASTRAL-III did not recover the three varieties of P. humilis as monophyletic,
but the qcf+QuartetMaxCut and supermatrix approaches did.
Estimated branch lengths in coalescent units from ASTRAL-III and
qcf+QuartetMaxCut were extremely short for all internal branches, indicating ram-
pant genealogical discordance (Figures 4.4, D.3, and D.4). Branch lengths from the
RAxML supermatrix analysis were also very short for the branches along the backbone
of the tree, demonstrating that few substitutions were present to inform relationships
for these deeper bipartitions. Support values were generally low across the different
trees, with only a few relationships showing high levels of support. This is likely a
result of the short branches observed in the different trees, and support the hypothesis
that speciation has occurred rapidly, with little time for informative substitutions
to occur. Another possible reason for these low support values is the occurrence of
90 P. davidsonii davidsonii 0.98 P. rattanii P. whippleanus P. anguineus P. ovatus 0.83 P. watsonii 0.92 P. inflatus 0.43 P. degeneri 0.26 P. humilis obtusifolius 0.1 P. subserratus 0.24 0.81 P. attenuatus attenuatus 0.61 0.38 P. aridus 0.44 P. albertinus 1 P. humilis brevifolius 0.44 P. humilis humilis 0.47 P. wilcoxii 0.77 0.29 P. attenuatus pseudoprocerus 0.25 P. pruinosus 0.37 P. virens P. elegantulus P. flavescens 0.52 0.39 P. rydbergii rydbergii P. radicosus 0.4 P. euglaucus P. rydbergii oreocharis 0.57 0.35 P. globosus P. pratensis 0.54 0.69 0.76 P. attenuatus militaris 0.17 P. laxus 0.7 0.05 P. heterodoxus heterodoxus P. attenuatus palustris P. spatulatus 0.4 0.48 P. hesperius 0.36 P. procerus procerus 0.56 P. confertus 0.39 P. washingtonensis 0.76 P. cinicola P. peckii
Figure 4.3: Phylogeny of Penstemon subsections Humiles and Proceri inferred by ASTRAL-III. Labels on branches are local posterior probabilities (Sayyari and Mirarab, 2016). Taxa are colored based on their current taxonomic classification: red = subsect. Proceri, blue = subsect. Humiles.
hybridization. As we show below, there is strong evidence for hybridization in these
groups, making phylogenetic inference difficult.
4.5.3 Tests for Hybridization and Species Network Inference
Analyzing all possible triples of ingroup taxa with HyDe resulted in a total of 23,310
hypothesis tests, of which 282 showed significant evidence for hybridization. The
average value for the hybridization parameter (γ) was 0.513 (standard deviation =
0.114), with a minimum and maximum value of 0.205 and 0.843, respectively. Out of
37 total ingroup taxa, 24 had a significant signal for hybridization.
91 P. davidsonii davidsonii 0.37 P. whippleanus 0.54 P. rattanii P. anguineus P. watsonii P. spatulatus 0.06 0.19 P. euglaucus 0.37 0.45 P. rydbergii rydbergii 0.01 0.34 0.03 P. procerus procerus 0.07 0.36 0.02 P. hesperius 0.06 0.38 0.1 0.35 0.37 P. confertus 0.39 0.03 P. washingtonensis P. peckii 0 0.36 0.15 0.33 0.43 P. cinicola P. flavescens 0.13 0.07 0.08 P. heterodoxus heterodoxus 0.41 0.38 0.38 0.05 P. attenuatus palustris 0.37 P. pratensis 0.07 P. globosus A 0.38 0.08 0.39 0.04 P. attenuatus militaris 0.36 0.13 P. rydbergii oreocharis 0.42 P. laxus P. radicosus 0.02 0.38 P. inflatus 0.35 0.54 P. degeneri 0 P. attenuatus attenuatus 0.33 0.06 P. virens 0.37 0.08 0.06 P. aridus 0.39 0.37 0.11 P. humilis obtusifolius 0.4 0.33 P. humilis humilis 0.12 0.52 0.41 P. humilis brevifolius P. ovatus 0.07 P. subserratus 0.38 0.1 0.06 P. pruinosus 0.4 0.37 0.07 P. elegantulus 0.38 B 0.07 P. attenuatus pseudoprocerus 0.38 0.03 P. wilcoxii 0.36 P. albertinus
Figure 4.4: Phylogeny of Penstemon subsections Humiles and Proceri inferred using qcf and QuartetMaxCut. Each branch is labeled above by its length in coalescent units and below by the average QCF value for all quartets induced by that branch. All branches with average QCF values greater than 0.38 are plotted with thicker lines (Stenz et al., 2015).
92 P. davidsonii davidsonii 100 P. whippleanus P. rattanii P. anguineus P. watsonii P. aridus 67 40 30 P. virens 92 P. humilis obtusifolius 80 100 P. humilis humilis 53 P. humilis brevifolius P. pruinosus 44 P. elegantulus 35 P. subserratus 48 76 P. albertinus 62 72 P. attenuatus pseudoprocerus 90 P. ovatus P. wilcoxii 96 P. radicosus 100 P. degeneri P. inflatus P. attenuatus attenuatus 48 17 51 P. flavescens P. spatulatus P. procerus procerus 46 40 65 P. cinicola 28 P. peckii 45 P. euglaucus 14 65 P. hesperius 41 P. washingtonensis P. confertus P. rydbergii rydbergii 10 46 P. attenuatus palustris 49 P. heterodoxus heterodoxus P. attenuatus militaris 41 91 P. pratensis 39 P. globosus 88 P. laxus P. rydbergii oreocharis
Figure 4.5: Phylogeny of Penstemon subsections Humiles and Proceri inferred with RAxML using a supermatrix of 43 loci. Labels on branches are support values from 1000 bootstrap replicates.
93 Species network inference with SNaQ was then conducted on the two primary
clades that were recovered in the qcf+QuartetMaxCut analyses (clades A and B;
Figure 4.4). The reason for using these clades was that this analysis recovered a
pattern of relationships for subsect. Humiles and Proceri that was most consistent with the current taxonomy. Although hybridization was detected between species in
these clades using HyDe, network inference with SNaQ cannot currently handle 38
taxa. However, since the members of these clades were recovered fairly consistently
between the different methods that we used for species tree inference, we decided to
analyze them independently to make network inference computationally feasible.
Using a range of values on the number of possible reticulation events (h=0 to h=5), we were able to infer species networks in all cases for both clades A and B within the
amount of compute time allotted (10 cores, 80–100 hours). For both clades, adding
reticulation events greatly reduced the log-pseudolikelihood, providing strong evidence
that hybridization is occurring within these clades. Networks with four and three
reticulations had the highest pseudolikelihoods for clades A and B, respectively, but
the network topology with four reticulations for network A produced non-sensical
relationships (hybridization with the outgroup), so we preferred the network with
h=3 (Figures D.5 & D.6). In addition to having the highest (or second highest for
clade A) log-pseduolikelihood, the networks estimated with three reticulations were
among the only estimates that had sensible branch length. For most other networks
in both clades A and B, one of the reticulate edges was always inferred to have a
branch length of >9.5 coalescent units. Given the amount of gene tree discordance
present in the data set, and the short branch lengths estimated by ASTRAL-III and
qcf+QuartetMaxCut, these estimates are most likely incorrect.
94 Clade A P. . P. davidsoniidavidsonii
Pspatulatus
Pflavescens
Pattenuatuspalustris
Pattenuatusmilitaris
Prydbergiioreocharis
Plaxus
Pglobosus
Ppratensis
Pheterodoxusheterodoxus
Prydbergiirydbergii
Peuglaucus
Pprocerusprocerus
Pconfertus
Phesperius
Pwashingtonensis
Ppeckii
Pcinicola
Clade B Pdavidsoniidavidsonii
Pradicosus
Paridus
Pinflatus
Pdegeneri
Pvirens
Pattenuatusattenuatus
Pelegantulus
Psubserratus
Ppruinosus
Pattenuatuspseudoprocerus
Pwilcoxii
Palbertinus
Povatus
Phumilisbrevifolius
Phumilishumilis
Phumilisvarobtusifolius
Figure 4.6: Best maximum pseudolikelihood (ML) networks for clades A and B estimated by SNaQ. The maximum number of hybridization events for these ML networks is h=3.
95 The best networks for clades A and B showed different patterns for the timing of
hybridization (Figure 4.6). For clade A, all of the reticulation events occurred closer to
the present, and only involved pairs of species hybridizing. The hybridization events
inferred include: (1) P. spatulatus × P. attenuatus var. palustris → P. flavescens,
(2) P. rydbergii var. oreocharis × P. globosus → P. laxus, and (3) P. confertus × P. cinicola → P. washingtonensis. Clade B, on the other hand, was estimated to have a
deep reticulation event involving two ancestral populations, with the resulting hybrid
lineage then diversifying into 12 different taxa. The other hybridization event in this
clade was P. ovatus × P. humilis var. humilis → P. humilis var. brevifolius.
4.6 Discussion
In this paper, we investigated the phylogenetic relationships and taxonomic affinities
among the taxa within Penstemon subsections Humiles and Proceri using nuclear
amplicon sequencing. We found strong evidence for hybridization in these groups, but
the rapid diversification in these two subsections made the exact inference, localization,
and interpretation of reticulation events extremely difficult. Despite these shortcomings,
there are some clear trends regarding the taxonomic implications of our phylogenetic
estimates, as well as what they may mean for character evolution and biogeography in
the group. Our work also highlights the difficulties of estimating phylogeny for recently
radiating groups with variable ploidy levels and high amounts of hybridization, an
issue that is common for many groups of angiosperms.
4.6.1 Taxonomy of Subsections Humiles and Proceri
Using several methods for phylogenetic inference, we found evidence for the non-
monophyly of subsect. Humiles and Proceri (Figures 4.3–4.5). This pattern was
96 recovered for all methods, despite variable levels of statistical support for the different
analyses, suggesting that this pattern is robust to the various assumptions of each
method. Four species in particular were recovered completely outside of the two
subsections: P. whippleanus, P. rattanii, and P. anguineus (all subsect. Humiles), as well as P. watsonii (subsect. Proceri). Only one of the methods, qcf+QuartetMaxCut,
recovered a monophyletic grouping of species belonging to subsect. Proceri. However,
this analysis also placed two taxa currently classified in subsect. Proceri (P. attenuatus var. attenuatus and P. attenuatus var. pseudoprocerus) into a clade of species from
subsect. Humiles, casting doublt on Keck’s hypotheses of allopolyploid formation for
these taxa (Keck, 1945). Interestingly, none of the varieties of P. attenuatus showed
the predicted affinities for their putative parental taxa (Figure 4.2). However, given
their hypothesized hybrid nature, it is possible that they are simply difficult to place
using models that do not include reticulation.
The phylogenetic placement of hybrid taxa has been shown to be problematic
in phylogenetic analyses, with hybrids typically branching at the base of a clade
containing one of their parental taxa (e.g., McDade, 1990, 1992). Our tests for
hybridization and species network analyses confirmed the presence of hybrids within
and between subsect. Humiles and Proceri, but did not support the putative parentage
of the varieties of P. attenuatus, as well as a number of other hypothesized hybrids
from Keck (1945). Nevertheless, the occurrence of a deep hybridization event in
the clade consisting of taxa primarily from subsect. Humiles (clade B) is especially
interesting. The impact of hybridization on genetic variation and its connection to the
subsequent speciation and diversification of hybrid lineages has long been understood
in plants (Anderson, 1949; Stebbins, 1950; Anderson and Stebbins, 1954; Grant, 1971;
97 Mallet, 2007), and several recent studies have observed deep hybridization events at
the diploid (e.g., Folk et al., 2016; García et al., 2017; Folk et al., 2018) and polyploid
(e.g., Morales-Briones et al., 2018) levels. The hybrid group within clade B is a mix
of diploids and polyploids, suggesting that the processes of both hybridization and whole genome duplication have been at play in its diversification. Future work with
more genomic data will be important for this clade to gain better resolution for any
additional hybridization events.
A particular strength of our method of QCF estimation is the ability to analyze
taxa with different ploidy levels, allowing us to analyze all diploid and polyploid
taxa in subsect. Humiles and Proceri simultaneously. From our network analyses with SNaQ, the only polyploid that was inferred to be a hybrid was P. flavescens
(hexaploid; clade A). However, our tests for hybridization with HyDe found far more
evidence for hybridization, likely because it only tests three species at a time, rather
than trying to infer an entire network. Nevertheless, out of the 24 taxa inferred to be
hybrids using HyDe, only four were polyploids. This casts doubt on the hypothesis
that most of the polyploids in subsect. Proceri are of hybrid origin. However, a lack
of phylogenetically informative variation could also be preventing us from detecting
the full extent of hybridization potentially occurring in these polyploids.
4.6.2 Character Evolution and Biogeography
Given the reticulate history of subsect. Humiles and Proceri, there are several patterns
of morphological character evolution that can be interpreted in the context of their
past genetic exchanges. Of particular interest is the presence of glandular hairs on
the inflorescence, a trait with potentially adaptive importance (Levin, 1973) that is
98 present in all species of subsect. Humiles, but is absent in the majority of species
in subsect. Proceri. For the species in subset. Proceri where it does occur, it is
hard to determine if it is simply a labile trait that has arisen several times, or if
there is a single clade of species that all have the trait. A perhaps more interesting
scenario could be that this trait was gained through hybridization or introgression,
however testing this hypothesis is currently not feasible due to a lack of methods
for discrete character reconstruction on phylogenetic networks (but see Jhwueng and
O’Meara, 2015; Bastide et al., 2018, for examples of continuous character evolution).
A possible workaround would be to construct all possible resolutions of the underlying
trees displayed by the networks and to reconstruct the character history on each tree.
Nevertheless, future model development on discrete character evolution on networks will help to address this type of question.
The biogeographic context of these hybridization events is also of interest, with
reconstructions of species’ geographic ranges potentially helping to shed light on the
plausibility of hypotheses about the occurrence of reticulation. The current geographic
distribution of the taxa in subsect. Humiles and Proceri is concentrated in the Pacific
Northwest of the United States, an area with several well-established biogeographic
and phylogeographic hypotheses regarding the occurrence of species in the Cascade
Range, the Northern Rocky Mountains, and the Sierra Nevada in northern California
(Brunsfeld et al., 2001; Carstens et al., 2005; Brunsfeld et al., 2007). Two recent studies
that have investigated the biogeography of hybridization events include Burbink and
Gehara (2018) and Folk et al. (2018), who take different approaches to reconstructing
ancestral contact zones where hybridization could potentially have occurred. Burbink
and Gehara (2018) found a single deep reticulation event in the phylogeny of New
99 World kingsnakes and used the resulting parental trees (trees where a hybrid clade is
sister to either parent) to infer ancestral areas for the hybrid clade. Folk et al. (2018)
used climatic niche reconstructions to find likely regions where ancestral lineages
of Huechera and Mitella may have occurred in sympatry and hybridized. These
approaches could be used in concert to illuminate the dynamics of vicariance and
dispersal for lineages of subsect. Humiles and Procer, as well as helping to locate
geographic regions where hybridization, as well as whole genome duplication, could
have occurred in the past.
4.6.3 Phylogenetics of Hybrids and Polyploids
As was seen from our phylogenetic analyses, the internal branch lengths of our
species trees and networks were very short (most were less than 0.5 coalescent units),
highlighting the prevalence of incomplete lineage sorting and genealogical discordance
in our data. Previous research has shown that Penstemon is a young genus (crown age
2.5–4.0 mya) that has radiated extremely rapidly, with hybridization and polyploidy
occurring frequently (Wolfe et al., 2006; Wessinger et al., 2016). These types of
processes are likely not uncommon for other groups of angiosperms, and having
methods to deal with them will be especially important for making future inferences
about the evolutionary history of these groups. To resolve hybridization events,
especially when they involve polyploids, there are several methods that have already
been developed. Some of these are not coalescent-based, but instead try to reconstruct
a network from gene trees that have all of the haplotypes from a polyploid sampled (a
so called “multi-labeled” tree; Lott et al., 2009; Marcussen et al., 2012). Other studies
have relied on coalescent-based assignment of homoeologous haplotypes into putative,
100 diploid subgenomes, but these approaches can be computationally limited due to the
cost of exploring all permutations of haplotypes assignments (Bertrand et al., 2015;
Oberpieler et al., 2017). The only approach to simultaneously infer a network topology
and homoeolog assignment in a coalescent framework is the method of Jones et al.
(2013). However, this method uses a hierarchical Bayesian framework that does not
scale well to large numbers of loci or taxa.
If homoeolog assignment is the goal, then it may be beneficial to first identify
parental taxa so that the number of comparisons for determining haplotype origin
is reduced. Kamneva et al. (2017) used such an approach in strawberries to identify
potential parents for several different polyploid species. They used a two-step approach
to generate and test hypotheses, first constructing networks using consensus methods,
followed by evaluating the likelihood of candidate networks using PhyloNet (Wen et al.,
2018). Their analyses were limited to no more than 5 haplotypes per taxon, and also
did not include an actual search over network space. Our method for QCF estimation was able to analyze all inferred haplotypes for all 38 taxa sampled in this study, and
our network analyses with SNaQ were used to conduct an actual search over network
topologies. The appeal of these types of approaches is that they do not require a priori
knowledge about parental taxa when inferring a network. For non-model taxa where
cases of hybridization and allopolyploidy are being investigated for the first time, the
ability to model these processes with little input from the user regarding putative
hybridization events should help to facilitate the discovery of reticulate evolutionary
events in virtually any group of taxa where they may be occurring.
101 4.7 Conclusions
Hybridization and polyploidy are processes that obscure phylogenetic inference for
many groups of taxa, and are a particular problem for lineages of angiosperms, where
they are especially common. Using the concept of quartet concordance factors (the
proportion of gene tree quartets supporting a species-level quartet), we developed a
method for estimating these concordance factors that can accommodate taxa with variable ploidy levels. Using this approach, and several others, we then inferred species
trees for Penstemon subsect. Humiles and Proceri, finding that the subsections were
not reciprocally monophyletic. Tests for hybridization and species network inference
also revealed that reticulation has been a common occurrence in these groups. In
general, this study highlights the difficulties of inferring phylogeny in a rapid species
radiation where hybridization and WGD are common. However, our approach for
QCF estimation, in combination with the network inference method SNaQ, helped to
disentangle the complex patterns of hybridization in these subsections, and should
provide a useful tool for other researchers interested in reticulate evolution as well.
102 Bibliography
Allendorf, F. W. and Thorgaard, G. H. 1984. Tetraploidy and the evolution of salmonid
fishes. In: Evolutionary genetics of fishes. Edited by B. J. Turner. Plenum Press,
pp. 1–53.
Allman, E. S., Ané, C., and Rhodes, J. A. 2008. Identifiability of a Markovian model of
molecular evolution with Gamma-distributed rates. Advances in Applied Probability,
40: 229–249.
Andermann, T., Fernandes, A. M., Olsson, U., Töpel, M., Pfeil, B. E., Oxelman,
B., Aleixo, A., Faircloth, B. C., and Antonelli, A. 2018. Allele phasing greatly
improves the phylogenetic utility of ultraconserved elements. Systematic Biology,
https://doi.org/10.1093/sysbio/syy039.
Anderson, E. 1949. Introgressive hybridization. John Wiley, New York, NY, USA.
Anderson, E. and Stebbins, G. L. 1954. Hybridization as an evolutionary stimulus.
Evolution, 8: 378–388.
Ané, C., Larget, B., Baum, D. A., and Rokas, A. 2007. Bayesian estimation of
concordance among gene trees. Molecular Biology and Evolution, 24: 412–426.
103 Anithakumari, A., Tang, J., van Eck, H., Visser, R., Leunissen, J., Vosman, B., and
van der Linden, C. 2010. A pipeline for high throughput detection and mapping of
SNPs from EST databases. Molecular Breeding, 26: 65–75.
Arnold, B., Bomblies, K., and Wakeley, J. 2012. Extending coalescent theory to
autotetraploids. Genetics, 192: 195–204.
Arnold, B., Corbett-Detig, R. B., Hartl, D., and Bomblies, K. 2013. RADseq underes-
timates diversity and introduces genealogical biases due to nonrandom haplotype
sampling. Molecular Ecology, 22: 3179–3190.
Arnold, B., Kim, S.-T., and Bomblies, K. 2015. Single geographic origin of a widespread
autotetraploid arabidopsis arenosa lineage followed by interploidy admixture. Molec-
ular Biology and Evolution, 32: 1382–1395.
Baird, N. A., Etter, P. D., Atwood, T. S., Currey, M. C., Shiver, A. L., Lewis, Z. A.,
Selker, E. U., Cresko, W. A., and Johnson, E. A. 2008. Rapid SNP discovery and
genetic mapping using sequenced RAD markers. PloS ONE, 3: e3376.
Balding, D. J. and Nichols, R. A. 1995. A method for quantifying differen-tiation
between populations at multi-allelic loci and its implications for investigating identity
and paternity. Genetica, 96: 3–12.
Balding, D. J. and Nichols, R. A. 1997. Significant genetic correlations among
Caucasians at forensic DNA loci. Heredity, 108: 583–589.
Barlow, N. 1913. Preliminary note on heterostylism in Oxalis and Lythrum. Journal
of Genetics, 3: 53–65.
104 Barlow, N. 1923. Inheritance of the three forms in trimorphic plants. Journal of
Genetics, 13: 133–146.
Bastide, P., Solís-Lemus, C., Kriebel, R., Sparks, K. W., and Ané, C. 2018. Phyloge-
netic comparative methods on phylogenetic networks with reticulations. Systematic
Biology, https://doi.org/10.1093/sysbio/syy033.
Baum, D. A. 2007. Concordance trees, concordance factors, and the exploration of
reticulate genealogy. Taxon, 56: 417–426.
Bertrand, Y. J. K., Scheen, A.-C., Marcussen, T., Pfeil, B. E., de Sousa, F., and
Oxelman, B. 2015. Assignment of homoeologues to parental genomes in allopoly-
ploids for species tree inference, with an example from Fumaria (Papaveraceae).
Systematic Biology, 64: 448–471.
Blischak, P. D., Wenzel, A. J., and Wolfe, A. D. 2014. Gene prediction and annotation
in Penstemon (Plantaginaceae): a workflow for marker development from low-
coverage genome sequencing. Applications in Plant Sciences, 2: 1400044.
Blischak, P. D., Kubatko, L. S., and Wolfe, A. D. 2016. Accounting for genotype
uncertainty in the estimation of allele frequencies in autopolyploids. Molecular
Ecology Resources, 16: 742–754.
Blischak, P. D., Latvis, M., Morales-Briones, D. F., Johnson, J. C., Di Stilio, V. S.,
Wolfe, A. D., and Tank, D. C. 2018a. Fluidigm2PURC: automated processing and
haplotype inference for double-barcoded PCR amplicons. Applications in Plant
Sciences, 6: e1156.
105 Blischak, P. D., Chifman, J., Wolfe, A. D., and Kubatko, L. S. 2018b. HyDe: a
Python package for genome-scale hybridization detection. Systematic Biology,
https://doi.org/10.1093/sysbio/syy023.
Bradburd, G., Ralph, P., and Coop, G. 2013. Disentangling the effects of geographic
and ecological isolation on genetic differentiation. Evolution, 67: 3258–3273.
Brent, R. P. 1973. Algorithms for minimization without derivatives. Prentice-Hall,
Englewood Cliffs, NJ.
Broderick, S. R., Stevens, M. R., Geary, B., Love, S. L., Jellen, E. N., Dockter, R. B.,
Daley, S. L., and Lindgren, D. T. 2011. A survey of Penstemon’s genome size.
Genome, 54: 160–173.
Brunsfeld, S. J., Sullivan, J., Soltis, D. S., and Soltis, P. S. 2001. Integrating ecological
and evolutionary processes in a spatial context, chapter Comparative phylogeography
of northwestern North America: a synthesis, pages 319–339. Oxford: Blackwell
Science.
Brunsfeld, S. J., Miller, T. R., and Carstens, B. C. 2007. Insights into the biogeography
of the Pacific Northwest of North America: evidence from the phylogeography of
Salix melanopsis (Salicaceae). Systematic Botany, 32: 129–139.
Buerkle, C. A. and Gompert, Z. 2013. Population genomics based on low coverage
sequencing: how low should we go? Molecular Ecology, 22: 3028–3035.
Burbink, F. T. and Gehara, M. 2018. The biogeography of deep time reticulation.
Systematic Biology, https://doi.org/10.1093/sysbio/syy019.
106 Bybee, S. M., Bracken-Grissom, H., Haynes, B. D., Hermansen, R. A., Byers, R. L.,
Clement, M. J., Udall, J. A., Wilcox, E. R., and Crandall, K. A. 2011. Targeted
amplicon sequencing (TAS): a scalable next-gen approach to multilocus, multitaxa
phylogenetics. Genome Biology and Evolution, 3: 1312–1323.
Cannon, S. B., McKain, M. R., Harkess, A., Nelson, M. N., Dash, S., Deyholos, M. K.,
Peng, Y., Joyce, B., Stewart, C. N., Rolf, M., Kutchan, T., Tan, X., Chen, C.,
Zhang, Y., Carpenter, E., Wong, G. K.-S., Doyle, J. J., and Leebens-Mack, J. 2014.
Multiple polyploidy events in the early radiation of nodulating and nonnodulating
legumes. Molecular Biology and Evolution, 32: 193–210.
Carstens, B. C., Brunsfeld, S. J., Demboski, J. R., D, G. J., and Sullivan, J. 2005.
Investigating the evolutionary history of the Pacific Northwest mesic forest ecosystem:
hypothesis testing within a comparative phylogeographic framework. Evolution, 59:
1639–1652.
Castellanos, M. C., Wilson, P. S., Keller, S. J., Wolfe, A. D., and Thompson, J. D. 2006.
Anther evolution: pollen presentation strategies when pollinators differ. American
Naturalist, 167: 288–296.
Chifman, J. and Kubatko, L. S. 2014. Quartet inference from SNP data under the
coalescent model. Bioinformatics, 30: 3317–3324.
Chifman, J. and Kubatko, L. S. 2015. Identifiability of the unrooted species tree
topology under the coalescent model with time-reversible substitution processes,
site-specific rate variation, and invariable sites. Journal of Theoretical Biology, 374:
35–47.
107 Clark, L. V. and Jasieniuk, M. 2011. polysat: an R package for polyploid microsatel-
lite analysis. Molecular Ecology Resources, 11: 562–566.
Clausen, J., Keck, D. D., and Hiesey, W. M. 1940. Experimental studies on the nature
of species. I. Effect of varied environments on western American plants. Carnegie
Inst. Washington Publ.
Clausen, J., Keck, D. D., and Hiesey, W. M. 1945. Experimental studies on the nature
of species. II. Plant evolution through amphiploidy and autoploidy, with examples
from Madiinae. Carnegie Inst. Washington Publ.
Cornille, A., Salcedo, A., Kryvokhyzha, D., Glémin, S., Holm, K., Wright, S. I., and
Lascoux, M. 2016. Genomic signature of successful colonization of Eurasia by the
allopolyploid shepherd’s purse (Capsella bursa-pastoris). Molecular Ecology, 25:
616–629.
Crosswhite, F. S. 1965. Hybridization of Penstemon barbatus (Scrophulariaceae) of
section Elmigera with species of Habroanthus. Southwestern Naturalist, 10: 234–237.
Cui, L., Wall, P. K., Leebens-Mack, J. H., Lindsay, B. G., Soltis, D. E., Doyle, J. J.,
Soltis, P. S., Carlson, J. E., Arumuganathan, K., Barakat, A., Albert, V. A., Ma,
H., and dePamphilis, C. W. 2006. Widespread genome duplications throughout the
history of flowering plants. Genome Research, 16: 738–749.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A.,
Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin,
R., and 1000 Genomes Project Analysis Group 2011. The variant call format and
VCFtools. Bioinformatics, 27: 2156–2158.
108 Datwyler, S. L. and Wolfe, A. D. 2004. Phylogenetic relationships and morphological
evolution in Penstemon subg. Dasanthera (Veronicaceae). Systematic Botany, 29:
165–176. de Silva, H., Hall, A., Rikkerink, E., McNeilage, M., and Fraser, L. 2005. Estimation
of allele frequencies in polyploids under certain patterns of inheritance. Heredity,
95: 327–334.
Degnan, J. H. and Rosenberg, N. A. 2009. Gene tree discordance, phylogenetic
inference and the multispecies coalescent. Trends in Ecology and Evolution, 24:
332–340.
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society:
Series B (Methodological), 39: 1–38.
DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C.,
Philippakis, A. A., del Angel, G., Rivas, M. A., Hanna, M., McKenna, A., Fennell,
T. J., Kernytsky, A. M., Sivachenko, A. Y., Cibulskis, K., Gabriel, S. B., Altshuler,
D., and Daly, M. J. 2011. A framework for variation discovery and genotyping using
next-generation dna sequencing data. Nature Genetics, 43(5): 491–498.
Di Stilio, V. S., Kramer, E. M., and Baum, D. A. 2005. Floral MADS box genes
and homeotic gender dimorphism in Thalictrum dioicum (Ranunculaceae) - a new
model for the study of dioecy. The Plant Journal, 41: 755–766.
Douglas, G. M., Gos, G., Steige, K. A., Salcedo, A., Holm, K., Josephs, E. B.,
Arunkumar, R., Ågren, J. A., Hazzouri, K. M., Wang, W., Platts, A. E., Williamson,
109 R. J., Neuffer, B., Lascoux, M., Slotte, T., and Wright, S. I. 2015. Hybrid origins
and the earliest stages of diploidization in the highly successful recent polyploid
Capsella bursa-pastoris. Proceedings of the National Academy of Sciences USA, 112:
2806–2811.
Dufresne, F., Stift, M., Vergilino, R., and Mable, B. K. 2014. Recent progress and
challenges in population genetics of polyploid organisms: an overview of current
state-of-the-art molecular and statistical tools. Molecular Ecology, 23: 40–69.
Dupuis, J. R., Bremer, F. T., Kauwe, A., San Jose, M., Leblanc, L., Rubinoff, D., and
Geib, S. 2017. HiMAP: robust phylogenomics from highly multiplexed amplicon
sequencing. bioRxiv, pages https://doi.org/10.1111/1755–0998.12783.
Eaton, D. A. R., Hipp, A. L., González-Rodríguez, A., and Cavender-Bares, J. 2015.
Historical introgression among the American live oaks and the comparative nature
of tests for introgression. Evolution, 69: 2587–2601.
Eddelbuettel, D. 2013. Seamless R and C++ integration with Rcpp. Springer, New
York.
Eddelbuettel, D. and François, R. 2011. Rcpp: seamless R and C++ integration.
Journal of Statistical Software, 40: 1–18.
Eddelbuettel, D. and Sanderson, C. 2014. RcppArmadillo: accelerating R with high-
performance C++ linear algebra. Computational Statistics and Data Analysis, 71:
1054–1063.
Edgar, R. C. 2010. Search and clustering orders of magnitude faster than BLAST.
Bioinformatics, 26: 2460–2461.
110 Edgar, R. C., Haas, B. J., Clemente, J. C., Quince, C., and Knight, R. 2011. uchime
improves sensitivity and speed of chimera detection. Bioinformatics, 27: 2194–2200.
Esselink, G. D., Nybom, H., and Vosman, B. 2004. Assignment of allelic configuration
in polyploids using the MAC-PR (microsatellite DNA allele counting–peak ratios)
method. Theoretical and Applied Genetics, 109: 402–408.
Falush, D., Stephens, M., and Pritchard, J. 2003. Inference of population structure us-
ing multilocus genotype data: linked loci and correlated allele frequencies. Genetics,
164: 1567–1587.
Fisher, R. A. 1943. Allowance for double reduction in the calculation of genotype
frequencies with polysomic inheritance. Annals of Eugenics, 12: 169–171.
Folk, R. A., Mandel, J. R., and Freudenstein, J. V. 2016. Ancestral gene flow and
parallel organellar genome capture result in extreme phylogenomic discord in a
lineage of angiosperms. Systematic Biology, 66: 320–337.
Folk, R. A., Visger, C. J., Soltis, P. S., Soltis, D. E., and Guralnick, R. P. 2018.
Geographic range dynamics drove ancient hybridization in a lineage of angiosperms.
American Naturalist, https://dx.doi.org/10.1086/698120.
Foll, M. and Gaggiotti, O. 2008. A genome-scan method to identify selected loci
appropriate for both dominant and codominant markers: a Bayesian perspective.
Genetics, 180: 977–993.
Fumagalli, M., Vieira, F. G., Korneliussen, T., Linderoth, T., Huerta-Sánchez, E.,
Albrechtsen, A., and Nielsen, R. 2013. Quantifying population genetic differentiation
from next-generation sequencing data. Genetics, 195: 979–992.
111 Furlong, R. F. and Holland, P. W. H. 2001. Were vertebrates octoploid? Philosophical
Transactions of the Royal Society B: Biological Sciences, 357: 531–544.
García, N., Folk, R. A., Meerow, A. W., Chamala, S., Gitzendanner, M. A., de Oliveira,
R. S., Soltis, D. E., and Soltis, P. S. 2017. Deep reticulation and incomplete lineage
sorting obscure the diploid phylogeny of rain-lillies and allies (Amaryllidaceae tribe
Hippeastreae). Molecular Phylogenetics and Evolution, 111: 231–247.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B.
2014. Bayesian data analysis. Chapman & Hall/CRC Press, 3rd edn.
Glaubitz, J., Casstevens, T. M., Lu, F., Harriman, J., Elshire, R. J., Sun, Q., and
Buckler, E. S. 2014. TASSEL-GBS: A high capacity genotyping by sequencing
analysis pipeline. PLoS ONE, 9: e90346.
Gompert, Z. and Buerkle, C. A. 2011a. A hierarchical Bayesian model for next-
generation population genomics. Genetics, 187: 903–917.
Gompert, Z. and Buerkle, C. A. 2011b. A hierarchical Bayesian model for next-
generation population genomics. Genetics, 187: 903–917.
Gompert, Z. and Buerkle, C. A. 2012. bgc: software for Bayesian estimation of
genomic clines. Molecular Ecology Resources, 12: 1168–1176.
Gompert, Z., Forister, M. L., Fordyce, J. A., Nice, C. C., Williamson, R. J., and
Buerkle, C. A. 2010. Bayesian analysis of molecular variance in pyrosequences
quantifies population genetic structure across the genome of lycaeides butterflies.
Molecular Ecology, 19: 2455–2473.
112 Gostel, M. R., Coy, K. A., and Weeks, A. 2015. Microfluidic PCR-based target
enrichment: a case study in two rapid radiations of Commiphora (Burseraceae)
from Madagascar. Journal of Systematics and Evolution, 53: 411–431.
Goto, K. and Meyerowitz, M. 1994. Function and regulation of the Arabidopsis floral
homeotic gene PISTILLATA. Genes and Development, 8: 1548–1560.
Grant, V. 1971. Plant speciation. Columbia University Press.
Gregory, T. R. and Mable, B. K. 2005. Polyploidy in animals. In: The evolution of
the genome. Edited by T. R. Gregory. Elsevier, pp. 427–517.
Gruenstaeudl, M., Reid, N. M., Wheeler, G. L., and Carstens, B. C. 2015. Posterior
predictive checks of coalescent models: P2C2M, an R package. Molecular Ecology
Resources, 16: 193–205.
Haldane, J. B. S. 1930. Theoretical genetics of autopolyploids. Journal of Genetics,
22: 359–372.
Hardy, O. J. 2016. Population genetics of autopolyploids under a mixed mating model
and the estimation of selfing rate. Molecular Ecology Resources, 16: 103–117.
Hernández, J. L. and Weir, B. S. 1989. A disequilibrium coefficient approach to
Hardy-Weinberg testing. Biometrics, 45: 53–70.
Holsinger, K. E., Lewis, P. O., and Dey, D. K. 2002. A Bayesian approach to inferring
population structure from dominant markers. Molecular Ecology, 11: 1157–1164.
113 Huang, G., Wang, S., Wang, X., and You, N. 2016. An empirical Bayes method for
genotyping and SNP detection using multi-sample next-generation sequencing data.
Bioinformatics, 32: 3240–3245.
Hudson, R. R. 2002. Generating samples under a Wright-Fisher neutral model of
genetic variation. Bioinformatics, 18: 337–338.
Jhwueng, D.-C. and O’Meara, B. 2015. Trait evolution on phylogenetic networks.
bioRxiv, https://doi.org/10.1101/023986.
Jiao, Y., Wickett, N. J., Ayyampalayam, S., Chanderbali, A. S., Landherr, L., Ralph,
P. E., Tomsho, L. P., Hu, Y., Liang, H., Soltis, P. S., Soltis, D. E., Clifton, S. W.,
Schlarbaum, S. E., Schuster, S. C., Ma, H., Leebens-Mack, J., and dePamphilis,
C. W. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature, 473:
97–100.
Jombart, T. and Ahmed, I. 2011. adegenet 1.3-1: new tools for the analysis of
genome-wide SNP data. Bioinformatics, 27: 3070–3071.
Jones, G., Sagitov, S., and Oxelman, B. 2013. Statistical inference of allopolyploid
species networks in the presence of incomplete lineage sorting. Systematic Biology,
62: 467–478.
Joshi, N. A. and Fash, J. N. 2011. Sickle: sliding-window, adaptive,
quality-based trimming tool for FASTQ files (version 1.33). Available at
https://github.com/najoshi/sickle.
114 Kamneva, O. K., Syring, J., Liston, A., and Rosenberg, N. A. 2017. Evaluating
allopolyploid origins in strawberries (Fragaria) using haplotypes generated from
target sequence capture. BMC Evolutionary Biology, 17: 180.
Kates, H. R., Soltis, P. S., and Soltis, D. E. 2017. Evolutionary and domestication
history of Cucurbita (pumpkin and squash) species inferred from 44 nuclear loci.
Molecular Phylogenetics and Evolution, 111: 98–109.
Katoh, S. 2013. MAFFT multiple sequence alignment software version 7: improvements
in performance and usability. Molecular Biology and Evolution, 30: 772–780.
Kearse, M., Moir, R., Wilson, A., Stones-Havas, S., Cheung, M., Sturrock, S., Buxton,
S., Cooper, A., Markowitz, S., Duran, C., Thierer, T., Ashton, B., Meintjes, P.,
and Drummond, A. 2012. Geneious Basic: an integrated and extendable desktop
software platform for the organization and analysis of sequence data. Bioinformatics,
28: 1647–1649.
Keck, D. D. 1945. Studies in Penstemon–XIII: a cyto-taxonomic account of the section
Spermunculus. American Midland Naturalist, 33: 128–206.
Kingman, J. F. C. 1982. On the genealogy of large populations. Journal of Applied
Probability, 19: 27–43.
Kubatko, L. S. and Chifman, J. 2015. An invariants-based method for hybridization de-
tection from genome-scale sequence data. bioRxiv, https://doi.org/10.1101/034348.
Kubatko, L. S. and Degnan, J. H. 2007. Inconsistency of phylogenetic estimates from
phylogenetic data under coalescence. Systematic Biology, 56: 17–24.
115 Kumar, S., Stecher, G., and Tamura, K. 2016. MEAGA7: Molecular Evolutionary
Genetics Analysis version 7.0 for bigger datasets. Molecular Biology and Evolution,
33: 1870–1874.
Larget, B., Kotha, S. K., Dewey, C. N., and Ané, C. 2010. BUCKy: gene tree /
species tree reconciliation with Bayesian concordance analysis. Bioinformatics, 26:
2910–2911.
Lawrence, T. J. and Datwyler, S. L. 2016. Testing the hypothesis of allopolyploidy
in the origin of Penstemon azureus (Plantaginaceae). Frontiers in Ecology and
Evolution, 4: 60.
Lawrence, W. J. C. 1929. The genetics and cytology of Dahlia species. Journal of
Genetics, 21: 125–158.
Lee, J.-Y., Mummenhoff, K., and Bowman, J. L. 2002. Allopolyploidization and
evolution of species with reduced floral structures in Lepidium L. (Brassicaceae).
Proceedings of the National Academy of Sciences USA, 99: 16835–16840.
Levin, D. A. 1973. The role of trichomes in plant defense. Quarterly Review of Biology,
48: 3–15.
Li, H. 2010. Mathematical notes on SAMtools algorithms. https: // software.
broadinstitute. org/ gatk/ media/ docs/ Samtools. pdf .
Li, H. 2011. A statistical framework for SNP calling, mutation discovery, association
mapping and population genetical parameter estimation from sequencing data.
Bioinformatics, 27: 2987–2993.
116 Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows-
Wheeler transform. Bioinformatics, 25: 1754–1760.
Logan-Young, C. J., Yu, J. Z., Verma, S. K., Percy, R. G., and Pepper, A. E. 2015.
SNP discovery in complex allotetraploid genomes (Gossypium spp., Malvaceae)
using genotyping by sequencing. Applications in Plant Sciences, 3: 1400077.
Lott, M., Spillner, A., Huber, K. T., and Moulton, V. 2009. PADRE: a package for
analyzing and displaying reticulate evolution. Bioinformatics, 25: 1199–1200.
Lu, F., Lipka, A. E., Glaubitz, J., Elshire, R., Cherney, J. H., Casler, M. D., Buckler,
E. S., and Costich, D. E. 2012. Switchgrass genomic diversity, ploidy, and evolution:
Novel insights from a network-based SNP discovery protocol. PLoS Genetics, 9:
e1003215.
Maddison, W. P. 1997. Gene trees in species trees. Systematic Biology, 46: 523–536.
Magoč, T. and Salzberg, S. L. 2011. FLASH: fast length adjustment of short reads to
improve genome assemblies. Bioinformatics, 27: 2957–2963.
Mallet, J. 2007. Hybrid speciation. Nature, 446: 279–283.
Marcussen, T., Jakobsen, K. S., Danihelka, J., Ballard, H. E., Blaxland, K., Brysting,
A. K., and Oxelman, B. 2012. Inferring species networks from gene trees in high-
polyploid North American and Hawaiian violets (Viola, Violaceae). Systematic
Biology, 61: 107–126.
Martin, E. R., Kinnamon, D. D., Schmidt, M. A., Powell, E. H., Zuchner, S., and Morris,
R. W. 2010. SeqEM: an adaptive genotype-calling approach for next-generation
sequencing studies. Bioinformatics, 26: 2803–2810.
117 Maruki, T. and Lynch, M. 2017. Genotype calling from population-genomic sequencing
data. G3: Genes, Genomes, Genetics, 7: 1393–1404.
McAllister, C. A. and Miller, A. J. 2016. Single nucleotide polymorphism discovery
via genotyping by sequencing to assess population genetic structure and recurrent
polyploidization in Andropogon gerardii. American Journal of Botany, 103: 1314–
1325.
McDade, L. 1990. Hybrids and phylogenetic systematics I. Patterns of character
expression in hybrids and their implications for cladistic analysis. Evolution, 44:
1685–1700.
McDade, L. 1992. Hybrids and phylogenetic systematics II. The impact of hybrids on
cladistic analysis. Evolution, 46: 1329–1346.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A.,
Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M. A. 2010. The
Genome Analysis Toolkit: A mapreduce framework for analyzing next-generation
DNA sequencing data. Genome Research, 20: 1297–1303.
Meng, C. and Kubatko, L. S. 2009. Detecting hybrid speciation in the presence
of incomplete lineage sorting using gene tree incongruence: a model. Theoretical
Population Biology, 75: 35–45.
Meng, X.-L. and Rubin, D. B. 1993. Maximum likelihood estimation via the ECM
algorithm: a general framework. Biometrika, 80: 267–278.
Merkel, D. 2014. Docker: lightweight Linux containers for consistent development and
deployment. Linux Journal, 2014: 2.
118 Miller, M. R., Dunham, J. P., Amores, A., Cresko, W. A., and Johnson, E. A.
2007. Rapid and cost-effective polymorphism identification and genotyping using
restriction site associated DNA (RAD) markers. Genome Research, 17: 240–248.
Mirarab, S. and Warnow, T. 2015. ASTRAL-II: coalescent-based species tree estimation
with many hundreds of taxa and thousands of genes. Bioinformatics, 31: i44–i52.
Moody, M. L., Mueller, L. D., and Soltis, D. E. 1993. Genetic variation and random
drift in autotetraploid populations. Genetics, 134: 649–657.
Morales-Briones, D. F., Liston, A., and Tank, D. C. 2018. Phylogenomic analyses
reveal a deep history of hybridization and polyploidy in the Neotropical genus
Lachemilla (Rosaceae). New Phytologist, https://doi.org/10.1111/nph.15099.
Muller, H. J. 1914. A new mode of segregation in gregory’s tetraploid primulas.
American Naturalist, 48: 508–512.
Nielsen, R., Paul, J. S., Albrechtsen, A., and Song, Y. S. 2011. Genotyping and
SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12:
443–451.
Nielsen, R., Korneliussen, T., Albrechtsen, A., and Li, Y. 2012. SNP calling, genotype
calling, and sample allele frequency estimation from new-generation sequencing
data. PLoS ONE, 7: e37558.
Nold, R. 1999. Penstemons. Timber Press, Portland, OR.
Oberpieler, C., F, W., Tomasello, S., and Konowalik, K. 2017. A permutation approach
for inferring species networks from gene trees in polyploid complexes by minimizing
deep coalescences. Methods in Ecology and Evolution, 8: 835–849.
119 Ogden, R., Gharbi, K., Mugue, N., Martinsohn, J., Senn, H., Davey, J. W., Pourkazemi,
M., McEwing, R., Eland, C., Vidotto, M., Sergeev, A., and Congiu, L. 2013. Stur-
geon conservation genomics: SNP discovery and validation using RAD sequencing.
Molecular Ecology, 22: 3112–3123.
Ohno, S. 1970. Evolution by gene duplication. Springer.
Otto, S. P. and Whitton, J. 2000. Polyploid incidence and evolution. Annual Review
of Genetics, 34: 401–437.
Pamilo, P. and Nei, M. 1988. Relationships between gene trees and species trees.
Molecular Biology and Evolution, 5: 568–583.
Parisod, C., Holderegger, R., and Brochmann, C. 2010. Evolutionary consequences of
autopolyploidy. New Phytologist, 186: 5–17.
Pennell, M. W., FitzJohn, R. G., Cornwell, W. K., and Harmon, L. J. 2015. Model ade-
quacy and the macroevolution of angiosperm functional traits. American Naturalist,
186: E100.
Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S., and Hoekstra, H. E. 2012.
Double digest RADseq: an inexpensive method for de novo SNP discovery and
genotyping in model and non-model species. PloS ONE, 7: e37135.
Plummer, M., Best, N., Cowles, K., and Vines, K. 2006. CODA: Convergence
Diagnostics and Output Analysis for MCMC. R News, 6: 7–11.
Puritz, J. B., Matz, M. V., Toonen, R. J., Weber, J. N., Bolnick, D. I., and Bird, C. E.
2014. Demystifying the RAD fad. Molecular Ecology, 23: 5937–5942.
120 R Core Team 2014. R: a language and environment for statistical computing.R
Foundation for Statistical Computing, Vienna, Austria.
R Core Team 2016. R: a language and environment for statistical computing.R
Foundation for Statistical Computing, Vienna, Austria.
Rambaut, A. and Grass, N. C. 1997. Seq-gen: an application for the monte carlo sim-
ulation of dna sequence evolution along phylogenetic trees. Computer Applications
in the Biosciences, 13: 235–238.
Ramsey, J. 2011. Polyploidy and ecological adaptation in wild yarrow. Proceedings of
the National Academy of Sciences, 108: 7096–7101.
Ramsey, J. and Ramsey, T. S. 2014. Ecological studies of polyploidy in the 100 years
following its discovery. Philosophical Transactions of the Royal Society B: Biological
Sciences, 369: 20130352.
Reid, N. M., Hird, S. M., Brown, J. M., Pelletier, T. A., McVay, J. D., Satler, J. D.,
and Carstens, B. C. 2014. Poor fit to the multispecies coalescent is widely detectable
in empirical data. Systematic Biology, 63: 322–333.
Rheindt, F. E., Fujita, M. K., Wilton, P. R., and Edwards, S. V. 2014. Introgression
and phenotypic assimilation in Zimmerius flycatchers (Tyrannidae): population
genetic and phylogenetic inferences from genome-wide SNPs. Systematic Biology,
63: 134–152.
Ripplinger, J. and Sullivan, J. 2010. Assessment of substitution model adequacy using
frequentist and Bayesian methods. Molecular Biology and Evolution, 27: 2790–2803.
121 Rogers, J. D. 1973. Polyploidy in Fungi. Evolution, 27: 153–160.
Rothfels, R. C., Li, F.-W., and Pryer, K. M. 2017. Next-generation polyploid phyloge-
netics: rapid resolution of hybrid polyploid complexes using PacBio single-molecule
sequencing. New Phytologist, 213: 413–429.
Sayyari, E. and Mirarab, S. 2016. Fast coalescent-based computation of local branch
support from quartet frequencies. Molecular Biology and Evolution, 33: 1654–1668.
Scarpino, S. V., Levin, D. A., and Meyers, L. A. 2014. Polyploid formation shapes
flowering plant diversity. American Naturalist, 184: 456–465.
Selmecki, A. M., Maruvka, Y. E., Richmond, P. A., Guillet, M., Shoresh, N., Sorenson,
A. L., De, S., Kishony, R., Michor, F., Dowell, R., and Pellman, D. 2015. Polyploidy
can drive rapid adaptation in yeast. Nature, 519: 349–352.
Serang, O., Mollinari, M., and Garcia, A. A. F. 2012. Efficient exact maximum a
posteriori computation for Bayesian SNP genotyping in polyploids. PloS ONE, 7:
e30906.
Smith, S. A. and Dunn, C. 2008. Phyutility: a phyloinformatics utility for trees,
alignments, and molecular data. Bioinformatics, 24: 715–716.
Snir, S. 2012. Quartet maxcut: a fast algorithm for amalgomating quartet trees.
Molecular Phylogenetics and Evolution, 62: 1–8.
Solís-Lemus, C. and Ané, C. 2016. Inferring phylogenetic networks with maximum
pseudolikelihood under incomplete lineage sorting. PLoS Genetics, 12: e1005896.
122 Solís-Lemus, C., Bastide, P., and Ané, C. 2017. Phylonetworks: a package for
phylogenetic networks. Molecular Biology and Evolution, 34: 3292–3298.
Soltis, D. E., Soltis, P. S., and Tate, J. A. 2003. Advances in the study of polyploidy
since plant speciation. New Phytologist, 161: 173–191.
Soltis, D. E., Soltis, P. S., Schemske, D. W., Hancock, J. F., Thompson, J. N., Husband,
B. C., and Judd, W. S. 2007. Autopolyploidy in angiosperms: have we grossly
underestimated the number of species? Taxon, 56: 13–30.
Soltis, D. E., Albert, V. A., Leebens-Mack, J., Bell, C. D., Peterson, A. H., Zheng, C.,
Sankoff, D., dePamphilis, C. W., Wall, P. K., and Soltis, P. S. 2009. Polyploidy and
angiosperm diversification. American Journal of Botany, 96: 336–348.
Soltis, D. E., Buggs, R. J. A., Doyle, J. J., and Soltis, P. S. 2010. What we still don’t
know about polyploidy. Taxon, 59: 1387–1403.
Soltis, D. E., Visger, C. J., and Soltis, P. S. 2014. The polyploidy revolution then...and
now: Stebbins revisited. American Journal of Botany, 101: 1057–1078.
Soltis, P. S. and Soltis, D. E. 2000. The role of genetic and genomic attributes in
the success of polyploids. Proceedings of the National Academy of Sciences, 97:
7051–7057.
Soltis, P. S. and Soltis, D. E. 2009. The role of hybridization in plant speciation.
Annual Review of Plant Biology, 60: 561–588.
Soza, V. L., Haworth, K. L., and Di Stilio, V. S. 2013. Timing and consequences
of recurrent polyploidy in meadow-rues (Thalictrum, Rannunculaceae). Molecular
Biology and Evolution, 30: 1940–1954.
123 Soza, V. L., Hyunh, V. L., and Di Stilio, V. S. 2014. Pattern and process in the
evolution of the sole dioecious member of Brassicaceae. EvoDevo, 5: 42.
Stamatakis, A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-
analysis of large phylogenies. Bioinformatics, 30: 1312–1313.
Stamatakis, A., Hoover, P., and Rougemont, J. 2008. A rapid bootstrap algorithm for
the RAxML web servers. Systematic Biology, 57: 758–771.
Stebbins, G. L. 1950. Variation and evolution in plants. Columbia University Press.
Stenz, N. W. M., Larget, B., Baum, D. A., and Ané, C. 2015. Exploring tree-like and
non-tree-like patterns using genome sequences: an example using the inbreeding
plant species Arabisopsis thaliana (L.) Heynh. Systematic Biology, 64: 809–823.
Stift, M., Berenos, C., Kuperus, P., and van Tienderen, P. H. 2008. Segregation
models for disomic, tetrasomic and intermediate inheritance in tetraploids: a
general procedure applied to Rorippa (yellow cress) microsatellite data. Genetics,
179: 2113–2123.
Stojmenovic, I. and Zoghbi, A. 1998. Fast algorithms for generating integer partitions.
International Journal of Computer Mathematics, 70: 319–332.
Straw, R. M. 1955. Hybridization, homogamy, and sympatric speciation. Evolution, 9:
441–444.
Straw, R. M. 1966. A redefinition of Penstemon (Scrophulariaceae). Brittonia, 18:
80–95.
Strickler, D. 1997. Northwest Penstemons. Flower Press.
124 Takahata, N. 1989. Gene genealogy in three related populations: consistency proba-
bility between gene and population trees. Genetics, 122: 957–966.
Tavaré, S. 1984. Line-of-descent and genealogical processes, and their applications in
population genetics models. Theoretical Population Biology, 26: 119–164.
Thomas, G. W. C., Ather, S. H., and Hahn, M. W. 2017. Gene-tree reconciliation
with MUL-trees to resolve polyploidy events. Systematic Biology, 66: 1007–1018.
Uribe-Convers, S., Settles, M. L., and Tank, D. C. 2016. A phylogenomic approach
based on PCR enrichment and high throughput sequencing: resolving diversity
within the South American species of Bartsia L. (Orobanchaceae). PLoS ONE, 11:
e0148203.
Van de Peer, Y., Fawcett, J. A., Proost, S., Sterck, L., and Vandepoele, K. 2009. The
flowering world: a tale of duplications. Trends in Plant Sciences, 14: 680–688.
Vieira, F. G., Fumagalli, M., Albrechtsen, A., and Nielsen, R. 2013. Estimating
inbreeding coefficients from NGS data: impact on genotype calling and allele
frequency estimation. Genome Research, 23: 1852–1861.
Voorrips, R., Gort, G., and Vosman, B. 2011. Genotype calling in tetraploid species
from bi-allelic marker data using mixture models. BMC Bioinformatics, 12: 172.
Wagner, W. H. 1970. Biosystematics and evolutionary noise. Taxon, 19: 146–151.
Wang, N., Thomson, M., Bodles, W. J. A., Crawford, R. M. M., Hunt, H. V.,
Fetherstone, A. W., Pellicer, J., and Buggs, R. J. A. 2013. Genome sequence of
dwarf birch (Betula nana) and cross-species RAD markers. Molecular Ecology, 22:
3098–3111.
125 Weir, B. S. 1996. Genetic Data Analysis II . Sunderland (MA): Sinauer Associates,
Sunderland, MA.
Wen, D. and Nakhleh, L. 2018. Coestimating reticulate phylogenies
and gene trees from multilocus sequence data. Systematic Biology,
https://doi.org/10.1093/sysbio/syx085.
Wen, D., Yu, Y., Zhu, J., and Nakhleh, L. 2018. Inferring phylogenetic networks using
PhyloNet. Systematic Biology, https://doi.org/10.1093/sysbio/syy015.
Wessinger, C. A., Freeman, C. C., Mort, M. E., Rausher, M. D., and Hileman, L. C.
2016. Multiplexed shotgun genotyping resolves species relationships within the
North American genus Penstemon. American Journal of Botany, 103: 912–922.
Wickham, H. 2007. Reshaping data with the reshape package. Journal of Statistical
Software, 21(12): 1–20.
Wickham, H. 2009. ggplot2: elegant graphics for data analysis. Springer, New York.
Wilson, P. S., Wolfe, A. D., Armbruster, W. S., and Thompson, J. D. 2007. Constrained
lability in floral evolution: counting convergent origins of hummingbird pollination
in Penstemon and Keckiella. New Phytologist, 176: 883–890.
Winge, Ö. 1917. The chromosomes: their number and general importance. Comptes
rendus des travaux du Laboratoire Carlsberg, 13: 131–275.
Winkler, H. 1916. Über die experimentelle Erzeugung von Pflanzen mit abweichenden
Chromosomenzahlen. Zeitschrift für induktive Abstammungs- und Vererbungslehre,
8: 417–531.
126 Wolfe, A. D. 2005. ISSR techniques for evolutionary biology. Methods in Enzymology,
395: 134–144.
Wolfe, A. D., Xiang, Q.-Y., and Kephart, S. R. 1998a. Assessing hybridization in
natural populations of Penstemon (Scrophulariaceae) using hypervariable intersimple
sequence repeat (ISSR) bands. Molecular Ecology, 7: 1107–1125.
Wolfe, A. D., Xiang, Q.-Y., and Kephart, S. R. 1998b. Diploid hybrid speciation in
Penstemon (Scrophulariaceae). Proceedings of the National Academy of Sciences,
95: 5112–5115.
Wolfe, A. D., Datwyler, S. L., and Randle, C. P. 2002. A phylogenetic and biogeographic
analysis of the Cheloneae (Scrophulariaceae) based on ITS and matK sequence data.
Systematic Botany, 27: 138–148.
Wolfe, A. D., Randle, C. P., Datwyler, S. L., Morawetz, J. J., Arguedas, N., and
Diaz, J. 2006. Phylogeny, taxonomic affinities, and biogeography of Penstemon
(Plantaginaceae) based on ITS and cpDNA sequence data. American Journal of
Botany, 93: 1699–1713.
Wood, T. E., Takebayashi, N., Barker, M. S., Mayrose, I., Greenspoon, P. B., and
Rieseberg, L. H. 2009. The frequency of polyploid speciation in vascular plants.
Proceedings of the National Academy of Sciences, 106: 13875–13879.
Wright, S. 1931. Evolution in Mendelian populations. Genetics, 16: 97–159.
Wright, S. 1938. The distribution of gene frequencies in populations of polyploids.
Proceedings of the National Academy of Sciences, 24: 372–377.
127 Wright, S. 1951. The genetical structure of populations. Annals of Eugenics, 15:
323–354.
Zohren, J., Wang, N., Kardailsky, I., Borrell, J. S., Joecker, A., Nichols, R. A.,
and Buggs, R. J. A. 2016. Unidirectional diploid–tetraploid introgression among
British birch trees with shifting ranges shown by restriction site-associated markers.
Molecular Ecology, 25: 2413–2426.
128 Appendix A: Chapter 1 Supplemental Materials
A.1 Example Analyses of Autotetraploid Potato (Solanum tuberosum)
The following walk-through will take you through every step of analyzing the data set
for autotetraploid potato (Solanum tuberosum) that was completed in the manuscript.
Because the analysis with polyfreqs takes a few hours (there are 86,400 parameters
to estimate), we have provided the output from that step for you. The potato data
set is provided for free with the R package fitTetra, and the code below goes
through how we acquired, formatted, and rescaled it for an analysis with polyfreqs.
Instructions for installing polyfreqs can be found on the GitHub page associated with the package (http://pblischak.github.io/polyfreqs). The following sections
are intended to be completed with the data in the example/ folder found in the
GitHub repository accompanying the manuscript (https://github.com/pblischak/ polyfreqs-ms-data).
# Using autetraploid potato data from the fitTetra package.
# If not installed, install it using:
# install.packages("fitTetra")
# Then load the data.
129 library(fitTetra) data(tetra.potato.SNP)
# Get the names of the individuals and loci. samples <- unique(tetra.potato.SNP$SampleName) markers <- unique(tetra.potato.SNP$MarkerName)
# Initialize x and y matrices -- x will be the reference allele. potato_mat_x <- matrix(NA,
nrow=length(unique(tetra.potato.SNP$SampleName)),
ncol=length(unique(tetra.potato.SNP$MarkerName))) rownames(potato_mat_x) <- samples colnames(potato_mat_x) <- markers
potato_mat_y <- matrix(NA, nrow=length(unique(tetra.potato.SNP$SampleName)),
ncol=length(unique(tetra.potato.SNP$MarkerName)))
# Get the counts from the data frame. for(i in1:dim(potato_mat_x)[1]){
tmp <- subset(tetra.potato.SNP, SampleName==samples[i])
potato_mat_x[i,] <- tmp$X_Raw
potato_mat_y[i,] <- tmp$Y_Raw
}
# Get the total counts as the sum of x and y and give row and column names. potato_mat_tot <- potato_mat_x + potato_mat_y rownames(potato_mat_tot) <- samples
130 colnames(potato_mat_tot) <- markers
# Rescale, then print the tables to file in a format suitable for polyfreqs.
potato_mat_x <- round(potato_mat_x/100)
potato_mat_tot <- round(potato_mat_tot/100)
write.table(potato_mat_x, file="potato_ref_reads.txt", quote=F, sep="\t")
write.table(potato_mat_tot, file="potato_tot_reads.txt", quote=F, sep="\t")
If you look at the files that were just made (potato_ref_reads.txt and
potato_tot_reads.txt), you can see how data should be formatted for running
an analysis with polyfreqs. More details will be provided in the next section when we read in the data and analyze it.
A.1.1 Calculating Expected and Observed Heterozygosity
Next we will read the data into R. The simplest way to do this is to use the
read.table() function. In the total and reference read count files for the potato data,
the first row is a tab delimited list of locus names. This row is optional and can be
excluded. After that, each row has the name of the individual followed by the read
counts at each locus (tab delimited). The individual name is required because it is used when writing genotype samples to file (set genotypes=T when running polyfreqs).
To specify that the first column contains the names, we use the row.names argument
and set it equal to 1. To specify that the first row has column names for each locus
(you do not need a label for the names), set the header argument to TRUE. With the
data read in, all that is left to do is to load polyfreqs and set up an analysis.
131 NB: When the data are passed to the polyfreqs() function, make sure that they are converted to matrices using the as.matrix() function.
# Read in data using read.table. Remember the row.names and header options.
# If you don't have locus names in the first row, take out header=T. potato_tot_table <- read.table("potato_tot_reads.txt", row.names=1, header=T) potato_ref_table <- read.table("potato_ref_reads.txt", row.names=1, header=T)
# Load polyfreqs library(polyfreqs)
# Run through polyfreqs with genotypes=T
# and geno_dir="potato_genotypes".
# Make sure you use the as.matrix() command. potato_out <- polyfreqs(as.matrix(potato_tot_table),
as.matrix(potato_ref_table), ploidy=4, iter=100000,
genotypes=T, geno_dir="potato_genotypes",
outfile="potato_mcmc.out")
The potato_out object will be a list of four items:
• potato_out$posterior_freqs – a matrix of the posterior samples of allele fre-
quencies at each locus prior to burn-in (also printed to the file potato_mcmc.out).
• potato_out$map_genotypes – a matrix of the maximum a posteriori genotypes
for each individual at each locus estimated using the posterior mode.
• potato_out$het_obs – a matrix of the per locus posterior samples of observed
heterozygosity.
132 • potato_out$het_exp – a matrix of the per locus posterior samples of expected
heterozygosity.
We will write each of these to file for downstream analyses (except for the posterior_freqs which already has its own file). write.table(potato_out$map_genotypes, "potato_map_genotypes.txt", quote=F,
row.names=F, col.names=F) write.table(potato_out$het_obs, "potato_het_obs.txt", quote=F,
row.names=F, col.names=F) write.table(potato_out$het_exp, "potato_het_exp.txt", quote=F,
row.names=F, col.names=F))
To evaluate the observed and expected heterozygosity, we will get multi-locus estimates by taking the mean across loci of the per locus posterior samples in the het_obs and het_exp matrices. We can then plot these and calculate summary statistics to understand the difference between them.
# If you have the potato_out object in the workspace you can proceed
# without reading in the files using the commands:
#
# het_obs <- potato_out$het_obs
# het_exp <- potato_out$het_exp
# We will read in the files and convert to matrices at the same time. het_obs <- as.matrix(read.table("potato_het_obs.txt")) het_exp <- as.matrix(read.table("potato_het_exp.txt"))
133 # Get a multi-locus estimate by taking the mean across loci using the
# apply function. Take 25% burn-in, only samples 251-1000 are used. multi_het_obs <- apply(het_obs[251:1000,],1, mean, na.rm=T) multi_het_exp <- apply(het_exp[251:1000,],1, mean, na.rm=T)
# Check for convergence library(coda) effectiveSize(mcmc(multi_het_obs))
## var1
## 920.8956 effectiveSize(mcmc(multi_het_exp))
## var1
## 750
# Plot a simple set of histograms to see the difference (Figure~1.3 in MS).
# The histograms will look slightly different but this is just a quick view.
# The reason is because the spreads are very different,
# which affects bin size. hist(multi_het_exp, col="blue", xlim=c(0.37, 0.39),
main="Heterozygosity", xlab="") hist(multi_het_obs, col="red", add=T) legend(x="topright",
c("expected","observed"),
col=c("blue","red"),
fill=c("blue","red"), bty="n")
134 Heterozygosity
expected 250 observed 200 150 Frequency 100 50 0
0.370 0.375 0.380 0.385 0.390
# Calculate summary stats (mean and 95% highest posterior density
# [HPD] interval) with the quantile() function. list("mean_exp"= mean(multi_het_exp),
"95HPD_exp"= quantile(multi_het_exp,c(0.025, 0.975)),
"mean_obs"= mean(multi_het_obs),
"95HPD_obs"= quantile(multi_het_obs,c(0.025, 0.975)))
## $mean_exp
## [1] 0.3722944
##
## $`95HPD_exp`
## 2.5% 97.5%
## 0.3711756 0.3735096
##
## $mean_obs
## [1] 0.3880829
##
135 ## $`95HPD_obs`
## 2.5% 97.5%
## 0.3877001 0.3884551
As can be seen from the histograms and the summary statistics, the observed
heterozygosity is higher than the expected heterozygosity, consistent with a pattern of
excess outbreeding.
A.1.2 Evaluating Model Adequacy
To evaluate model adequacy using posterior predictive simulation, we used the posterior
distribution of allele frequencies from the polyfreqs run (potato_mcmc.out) minus
burn-in to look at model fit on a per locus basis. You will also need the original read
count data to compare the observed and predicted read count ratios for each locus.
# Read in the original read cound data using read.table().
# Again, remember the row.names and header arguments.
potato_tot_table <- read.table("potato_tot_reads.txt",
row.names=1, header=T)
potato_ref_table <- read.table("potato_ref_reads.txt",
row.names=1, header=T)
# If you haven't done so, load polyfreqs.
library(polyfreqs)
# Now we'll read in the posterior distribution of allele frequencies.
potato_mcmc_table <- read.table("potato_mcmc.out", row.names=1,
header=T)
136 # Take burn-in potato_post <- potato_mcmc_table[251:1000,]
# Check for convergence sum(effectiveSize(mcmc(potato_post)) < 200)
## [1] 0 plot(mcmc(potato_post[,4])) Trace of var1 Density of var1 25 0.82 20 0.80 15 0.78 10 5 0.76 0
0 200 400 600 0.74 0.78 0.82
Iterations N = 750 Bandwidth = 0.003843
# Run the analysis using the polyfreqs_pps() function. potato_pps <- polyfreqs_pps(as.matrix(potato_post),
as.matrix(potato_tot_table),
as.matrix(potato_ref_table),
ploidy=4, error=0.01)
The potato_pps object will be a list with two items:
137 • potato_pps$ratio_diff – A matrix with the per locus posterior predictive
samples of the read ratio differences.
• potato_pps$locus_fit – A logical vector indicating whether each locus passed
or failed the posterior predictive check.
These two items can then be used to examine various aspects of model fit such as the proportion of adequate/inadequate loci and plotting the posterior predictive distribuion of read ratio differences for inadequate loci.
# Get the proportion of adequate and inadequate loci. list("adequate"= mean(potato_pps$locus_fit),
"inadequate"=1- mean(potato_pps$locus_fit))
## $adequate
## [1] 0.8723958
##
## $inadequate
## [1] 0.1276042
# Get the names of the loci that are inadequate
# (provided that locus names are given). names(potato_pps$locus_fit[potato_pps$locus_fit==FALSE])
## [1] "PotSNP002" "PotSNP015" "PotSNP020" "PotSNP044" "PotSNP068"
## [6] "PotSNp071" "PotSNP080" "PotSNP104" "PotSNP138" "PotSNP140"
## [11] "PotSNP154" "PotSNP183" "PotSNP193" "PotSNP213" "PotSNP225"
## [16] "PotSNP238" "PotSNP245" "PotSNP247" "PotSNP249" "PotSNP252"
138 ## [21] "PotSNP254" "PotSNP258" "PotSNP259" "PotSNP262" "PotSNP267"
## [26] "PotSNP268" "PotSNP275" "PotSNP277" "PotSNP286" "PotSNP287"
## [31] "PotSNP289" "PotSNP299" "PotSNP300" "PotSNP310" "PotSNP311"
## [36] "PotSNP313" "PotSNP327" "PotSNP329" "PotSNP331" "PotSNP335"
## [41] "PotSNP339" "PotSNP360" "PotSNP367" "PotSNP368" "PotSNP369"
## [46] "PotSNP372" "PotSNP373" "PotSNP383" "PotSNP384" length(potato_pps$locus_fit[potato_pps$locus_fit==FALSE])
## [1] 49
# plot the posterior predictive distribution of read ratio differences. inadequate <- names(potato_pps$locus_fit[potato_pps$locus_fit==FALSE]) hist(potato_pps$ratio_diff[,inadequate[1]], main=inadequate[1], xlab="") abline(v=quantile(potato_pps$ratio_diff[,inadequate[1]],c(0.025,0.975)),
col="blue", lty="dashed", lwd=2) abline(v=0, col="red")
PotSNP002 120 80 60 Frequency 40 20 0
−5 0 5 10 15 20 25
139 The stochastic nature of simulating data may change the results between posterior predictive model checking runs slightly, but we consistently get ∼13-14% of loci fitting the model poorly.
140 f0.01 f0.05 f0.1 f0.2 f0.4 0.15
0.10 i5
0.05
0.15
0.10 i10
0.05
method polyfreqs 0.15 ratio RMSE
0.10 i20
0.05
0.15
0.10 i30
0.05
c5 c10 c20 c50 c100 c5 c10 c20 c50 c100 c5 c10 c20 c50 c100 c5 c10 c20 c50 c100 c5 c10 c20 c50 c100
Figure A.1: Comparison of posterior mean versus mean read ratio estimates of allele frequencies for all simulation settings. All panels are set up the same as in Figure 1.1.
141 ●● 1.00 ●● ●
● ● ●
● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●●●●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●●● ● ●● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ●●● ●● ●●●●● ● ● ● ●● ● ● ● ●● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● 0.75 ● ●● ● ●●● ●●● ● ● ●● ●● ●● ● ●● ● ● ●●● ● ● ●●● ●●●● ● ● ●●● ● ●● ●● ● ● ● ●●● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ●●● ●●●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●●●● ● ● ●●●●●●●●● ●● ●●● ● ●●● ●●●● ● ● ● ● ●●● ● ● ●●●●● ● ● ●●●●●● ● ●●●● ● ● ●●● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ●●●●● ● ● ●●●● ● ●●●● ● ●●●● method ●●● ● ● ● ● ●● ● ● ●●●●● ● polyfreqs 0.50 ●● ● ●●● ●● ●●●● ● ● ●● ●●● simple ● ●●●● ● ● ●● ●● ● ●●● ● ● ●●● ● ● ●● ●●●● ● ● ●●● ●● ● ●●●● ● ● ●● ● Allele frequency ●●● ● ●●●● ● ● ●● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●●● ● ●●● ● ● ● ●●● ● ● ● ●● ●●●● ●● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ●● ●●● ● ● ● ● ●● ● ●● ●●● ● ● ●● ●●● ● ●● ● ● ●● ● ● ● ● ●●● ● ●● ●●●● ●●● ● ● 0.25 ●●●● ● ●● ● ●● ● ●●●● ● ● ● ●● ● ● ●●● ●● ● ● ● ●●●● ● ●●●● ● ● ●● ● ● ●● ● ● ● ●●● ●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●●●● ● ●●●●●●● ● ● ● ●● ● ●●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ●●● ●●● ● ●●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ●●● 0.00 ●●●
Figure A.2: Comparison of posterior mean versus mean read ratio estimates of allele frequencies for Solanum tuberosum. We sorted the estimates from lowest to highest based on the posterior mean, which is why the read ratio estimates appear to be quite erratic. The same result is seen in the posterior mean estimates if the results are sorted by the read ratio estimate.
142 15
10 density
5
0
−0.05 0.00 0.05 0.10 Difference: simple − polyfreqs
Figure A.3: Density plot of the difference between the mean read ratio (simple) and posterior mean estimates in Solanum tuberosum taken for all loci. The distribution is centered around 0, demonstrating that on average there is no difference between the estimates.
143 c10 c100
0.075
frequency f0.01 f0.05 0.050 f0.1
RMSE f0.2 f0.4
0.025
i5 i10 i20 i30 i5 i10 i20 i30 Number of individuals
Figure A.4: A close up comparison of the effect of coverage and the number of individuals sampled on estimation error for octopoloids. The two panels represent two levels of sequencing coverage (10x and 100x) and compare the RMSE across the different numbers of individuals sampled (5, 10, 20, 30) and the different allele frequencies used to simulate read counts (0.01, 0.05, 0.1, 0.2, 0.4) for the simulation study. The two panels show that error decreases more from an increase in the number of individuals sampled than from a 10-fold increase in sequencing coverage.
144 Appendix B: Chapter 2 Supplemental Materials
B.1 EM Algorithms
An important aspect of the EM algorithm is the convergence criterion. For all of our
analyses we updated all individual parameters in the model until successive estimates
differed by less than 1e-5. Once a parameter had converged, we no longer updated it
in the remaining iterations of the algorithm. This causes the algorithm to “gain speed”
as more and more parameter values converge. Below we outline the mathematical
details of the maximization steps that were used for our EM algorithms.
B.1.1 Autopolyploid Model
When doing the conditional maximization steps for the autopolyploid model, our
ECM algorithm first updates the allele frequencies with the current estimates for the
inbreeding coefficients fixed, followed by updating the inbreeding coefficients with the
current estimates for allele frequencies fixed. For each locus ` = 1,...,L, we update
p` independently using the following objective function for Brent’s method (Brent,
1973):
145 " (t+1) X X (t) (t) p` = arg max P (gi` = a|di`, p` ,Fi ) p ` i a # (t) × log P (di`|gi` = a)P (g = a|p`,Fi ) (B.1)
(t+1) Once we have obtained all of the p` ’s, we then estimate the inbreeding coefficient
for each individual i = 1,...,N using a similar function:
" (t+1) X X (t+1) (t) Fi = arg max P (gi` = a|di`, p` ,Fi ) F i ` a # (t+1) × log P (di`|gi` = a)P (g = a|p` ,Fi) (B.2)
B.1.2 Allopolyploid Model
An analytical solution for the EM update of the allele frequencies in subgenome two is
available using a derivation similar to that of (Li, 2010, 2011). We wish to maximize
the following function across all loci independently:
(t+1) (t) hX X X ∗ (t) p2` = arg max Q(p2`|p2` ) = arg max P (gi` = a1 + a2|di`, p1`, p2` ) p2` p2` i a1 a2
∗ i × log[P (di`|gi` = a1 + a2)P (g1i` = a1|p1`)P (g2i` = a2|p2`)] . (B.3)
(t) Because many of the terms in Q(p2`|p2` ) are constant and do not depend on p2`, we
can combine them together to form a simpler expression:
146 (t) X X X ∗ (t) Q(p2`|p2` ) = C1 + C2 + P (gi` = a1 + a2|di`, p1`, p2` ) log P (g2i` = a2|p2`) i a1 a2
X X X ∗ (t) = C + P (gi` = a1 + a2|di`, p1`, p2` ) log P (g2i` = a2|p2`) i a1 a2 X X X ∗ (t) = C + P (gi` = a1 + a2|di`, p1`, p2` ) i a1 a2 × [a2 log(p2`) + (m2i − a2) log(1 − p2`)] . (B.4)
∗ (t) C1 = P (gi` = a1 + a2|di`, p1`, p2` ) log P (di`|gi` = a1 + a2),
∗ (t) ∗ C2 = P (gi` = a1 + a2|di`, p1`, p2` ) log P (g1i` = a1|p1`).
Here C is equal to the sum of C1 and C2 across all individuals and values for the genotypes in each subgenome and we also drop the binomial coefficient in the third step.
Taking the derivative with respect to p2` and setting it equal to 0, we can derive the EM update for the next iteration:
∂ (t) 1 X X X ∗ (t) Q(p2`|p2` ) = P (gi` = a1 + a2|di`, p1`, p2` ) ∂p2` p2`(1 − p2`) i a1 a2 × [a2(1 − p2`) − (m2i − a2)p2`] 1 X X X ∗ (t) = P (gi` = a1 + a2|di`, p1`, p2` ) p2`(1 − p2`) i a1 a2 × [a2 − m2ip2`] = 0. (B.5)
147 (t+1) Substituting p2` for p2` and solving gives us the needed equation:
P P P a P (g = a + a |d , p∗ , p(t)) (t+1) i a1 a2 2 i` 1 2 i` 1` 2` p2` = P . (B.6) i m2i
B.1.3 C++ Code
The C++ source code for fitting the four models used in the manuscript is provided
on GitHub (https://github.com/pblischak/polyploid-genotyping). The four
models are (1) Hardy Weinberg equilibrium ( hwe ), (2) Autopolyploid/Hardy Weinberg
disequilibrium ( diseq ), (3) the Allopolyploid Subgenome model ( alloSNP ), and
(4) the flat genotype prior model ( gatk ). The executable, ebg , can be compiled
using the Makefile provided in the ebg/ folder that contains the source code. The
program has a simple command line interface for specifying which model is to be used
and the input necessary for completing an analysis. Within the data/ folder of the
GitHub repository, we also provide the simulated read count data from Betula pendula
and B. pubescens as example data sets with instructions for how to analyze them in
the README . Another thing to note is that the first two models can also be run on
diploids as was done in the manuscript for B. pendula. Analyses of mixed ploidy are
not currently implemented.
In the same GitHub repository we also provide the R and C++ code that we used
for all of the simulations conducted in the paper. These can be found in the Rcode/
folder, along with a README file with details about how the scripts can be used for
simulating and analyzing the data used in the manuscript.
148 B.2 Simulations
B.2.1 Inbreeding Coefficient From Called Genotypes
We calculated the inbreeding coefficient from called genotypes using observed and
expected heterozygosity:
Ho(i) Fi = 1 − . (B.7) He
Observed heterozygosity was calculated using the following equation (Blischak et al.,
2016; Hardy, 2016):
1 X 1 X gi`(mi − gi`) Ho(i) = hi = . (B.8) L L mi ` ` gi` Expected heterozygosity was calculated as the average heterozygosity across all loci:
1 X 2 2 He = 1 − p − (1 − p`) . (B.9) L ` `
B.3 Empirical Data Analysis
B.3.1 Data Acquisition Andropogon gerardii
A VCF file for A. gerardii was downloaded from Dryad (file:
McAllister.Miller.all.mergedRefGuidedSNPs.vcf.gz ; link: http:
//datadryad.org/resource/doi:10.5061/dryad.05qs7) along with all indi- vidual metadata
(file: McAllister_Miller_Locality_Ploidy_Info.csv ). Read counts were
extracted and filtered as described in the Methods for hexaploids and nonaploids
149 separately. Below is an example of the commands that we ran using VCFtools, and our Perl and R scripts.
# Bash code for running VCFtools + Perl and R scripts
# Substitute 'nona' for 'hex' to run the scripts for nonaploids
vcftools --gzvcf McAllister.Miller.all.mergedRefGuidedSNPs.vcf.gz \
--keep andropogon-hex-names.txt \
--max-alleles 2 --min-alleles 2 --thin 10000 \
--minDP 5 --max-missing 0.5 --remove-indels \
--remove-filtered-all --recode \
--stdout | perl read-counts-from-vcf.pl andr-hex-tot.txt \
andr-hex-alt.txt 2 5
# Perl script arguments: name for tot file, name for alt file,
# allele depth position in VCF file, min read depth
# R script Arguments: tot file, alt file, % missing cutoff,
# transpose output (T/F), missing data val
Rscript --vanilla filter-inds.R andr-hex-tot.txt andr-hex-alt.txt \
0.5 TRUE -9
VCF files store information with individuals in columns and loci in rows. ebg expects loci to be in columns and individuals to be in rows. This is why we transpose the data matrices within the R script. Loci in McAllister and Miller (2016) were kept with a minimum Phred score of Q20. Since error information was not directly available,
150 we used a value of 0.01 for the error for each locus as a conservative, maximum level
of error.
Betula pubescens and B. pendula
Genotype data for Betula were downloaded from Dryad (file:
data_80p_genlight.rdata ; link: http://datadryad.org/resource/doi:
10.5061/dryad.815rj) as an Rdata file with genotypes stored as a genlight
object from the R package adegenet (Jombart and Ahmed, 2011). The genlight
object is designed to store genotypes more efficiently but can be easily converted into
a matrix of integer genotypes, which is what we did for simulating read data using
our own R and C++ code.
B.3.2 Comparison with GATK
Below we provide a walkthrough of our analyses that compared our models for
genotyping with GATK. The code that was used is provided, and we have also
provided any scripts that were used on GitHub (https://github.com/pblischak/ polyploid-genotyping). These steps were completed separately for Betula pendula
and B. pubescens (replace ‘pendula’ with ‘pubescens’ in each step).
Indexing Betula nana Reference Genome
The reference genome for Betula nana was downloaded from Dryad (http://
datadryad.org/resource/doi:10.5061/dryad.815rj) and was processed following
(Zohren et al., 2016) (concatenating all contigs with 50 N’s in between). Next, we
indexed the reference genome for downstream analyses using BWA, SAMtools, and Pi-
card (Li and Durbin, 2009; Li, 2011, https://broadinstitute.github.io/picard).
151 bwa index Betula_concat_reference.fasta
samtools faidx Betula_concat_reference.fasta
java -jar ~/picard-tools-2.2.1/picard.jar \
CreateSequenceDictionary \
R=Betula_concat_reference.fasta \
O=Betula_concat_reference.dict
Mapping Reads with BWA
We then downloaded FASTQ files from the European Nucleotide Archive (Project
Accession ERA600270; link: http://www.ebi.ac.uk/ena/data/view/PRJEB3322)
for 15 individuals each of B. pendula and B. pubescens. Below are the individual files
that we downloaded for each species.
Betula pendula:
• 1147x_CTCTCTAG.fq.gz, 1148x_AGCTATAG.fq.gz, 1163x_TGTGACTG.fq.gz,
14007_CTAGCTCT.fq.gz, 14008_CTGATGCT.fq.gz, 14009_GACTCATC.fq.gz,
2310x_TCTCGCTC.fq.gz, 2315x_TGACTGTG.fq.gz, 2320x_ACACTGAC.fq.gz,
2346x_GTACTCGT.fq.gz, 2347x_GTCATGTG.fq.gz, 2350x_TGCATCGT.fq.gz,
2354x_TGTGACTG.fq.gz, 2361x_ACACGACA.fq.gz, 2380x_AGAGCTAG.fq.gz
Betula pubescens:
• 1045x_CACACAGT.fq.gz, 1045x_CATGA_1.fq.gz, 1123x_AAGGG_1.fq.gz,
1123x_ACGTAGCA.fq.gz, 1153x_TTTTA_1.fq.gz, 1158x_CAGTGTGT.fq.gz,
152 1158x_GTTGT_1.fq.gz, 13004_CTAGTGTC.fq.gz, 13006_CTAGATAG.fq.gz,
14007_CTAGCTCT.fq.gz, 14008_CTGATGCT.fq.gz, 14009_GACTCATC.fq.gz,
1578x_CGTATGTA.fq.gz, 1578x_GTGTG_1.fq.gz, 38005_GACTACGA.fq.gz
These files were mapped to the B. nana reference using the BWA MEM algorithm
(Li and Durbin, 2009). These mapped alignments were then convereted from SAM to
BAM format and sorted using SAMtools (Li, 2011).
for f in *.fq.gz;
do
PREFIX=$(echo $f | awk -F'.''{print $1}')
bwa mem -t 2 ../../Betula_concat_reference.fasta $f | \
samtools view -bSu - | \
samtools sort -O bam -o ../bam/$PREFIX.sorted.bam
done
Adding Read Groups and Genotyping with GATK
Read groups were added to the BAM files output by the previous step using Picard, followed by sorting by coordinate, and genotyping using the GATK UnifiedGenotyper
(https://broadinstitute.github.io/picard; McKenna et al., 2010).
# Adding read groups to BAM files
for f in *.bam
do
PREFIX=$(echo $f | awk -F'.''{print $1}')
153 java -jar ~/picard-tools-2.2.1/picard.jar AddOrReplaceReadGroups \
I=$f \
O=$PREFIX.sortedRG.bam \
SORT_ORDER=coordinate \
RGID=pendula \
RGLB=pendula \
RGPL=illumina \
RGSM=$PREFIX \
RGPU=pendula \
CREATE_INDEX=True
done
# Running GATK
java -jar ~/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R ../../Betula_concat_reference.fasta \
$(for f in *sortedRG.bam; do printf "%sI $f " '-'; done) \
-o ../pendula-ug.vcf \
-ploidy 2
Filtering Variants Called by GATK
Variants were filtered using the following criteria: biallelic SNPs only, minimum QUAL score of 30, minimum read depth of 5, and maximum number of 5 missing individuals.
We wrote our own R script to complete this step (most tools do not take polyploid
VCF files) called filter-vcf.R which is available on GitHub.
154 # Filtering variants based on depth, QUAL, max # of missing individuals
Rscript filter-vcf.R --vcf pendula-ug.vcf --minDP 5 --minQ 30 \
--missing 5 --out pendula-ug-filtered30.vcf
Finding Shared Variants
After filtering the variants for B. pendula and B. pubescnes, we took the intersection
of their VCF files to find SNPs that were in common.
# Find intersection of variants in two VCF files
Rscript intersect-vcf.R --vcf1 pendula-ug-filtered30.vcf \
--vcf2 pubescens-ug-filtered30.vcf \
--prefix filtered30
Preparing Input Files for ebg
We extracted read counts (total number of reads, number of alt reads) from the allele
depth field (AD) using a Python script. Per locus error rates were calculated by first
making a pileup of reads from the original BAM files at the shared variant positions
using SAMtools. We then used Python to calculate the average error value at each
site using the PHRED scores reported in the pileup file.
# Get read counts using Python
python read-counts-from-vcf.py -n 15 -s 14931 \
--vcf filtered30-vcf1-pendula.vcf --prefix filtered30-pendula
155 # Get error values from pileup files
samtools mpileup -I -f Betula_concat_reference.fasta \
-l filtered30-variants.txt $(ls *.sortedRG.bam) \
-o filtered30-pendula.pileup
# Extract error values and take per locus avg.
python per-locus-err.py -i filtered30-pendula.pileup \
-n 15 > filtered30-pendula-err.txt
Running ebg
With the input files for ebg prepared, we ran an analysis on B. pendula using our hwe model.
# Analysis for Betula pendula w/ hwe model
ebg hwe -t pendula-filtered30-tot.txt \
-a pendula-filtered30-alt.txt \
-e pendula-filtered30-err.txt \
-n 15 \
-l 14931 \
-p 2 \
--iters 1000 \
--prefix filtered30-pendula
The allele frequency estimates for B. pendula were used as a reference for the estimation of genotypes in B. pubescens using the alloSNP model.
156 # Analysis for Betula pubescens w/ alloSNP model
ebg alloSNP -f filtered30-pendula-freqs.txt \
-t pubescens-filtered30-tot.txt \
-a pubescens-filtered30-alt.txt \
-e pubescens-filtered30-err.txt \
-n 15 \
-l 14931 \
-p1 2 \
-p2 2 \
--iters 100 \
--brent \
--prefix filtered30-pubescens
Comparing Genotype Estimates
To compare the estimated genotypes between GATK and ebg, we first extracted the
genotypes (number of alternative alleles) from each VCF file using a Python script.
python gt-from-vcf.py --vcf filtered30-pendula-vcf1.vcf \
--prefix filtered30-pendula-gatk
Next, we read the genotype estimates into R and calculated what percent of the
estimated genotypes were identical between the two methods (B. pendula: 99.1%
identical; B. pubescens: 96.2% identical).
157 # Read in genotype data for allopolyploid B. pubescens g1_alloSNP <- as.matrix(read.table("filtered30-pubescens-alloSNP-g1.txt",
header=F, na.strings="-9")) g2_alloSNP <- as.matrix(read.table("filtered30-pubescens-alloSNP-g2.txt",
header=F, na.strings="-9")) g_tot_alloSNP <- g1 + g2
# Read in genotypes from GATK g_tot_gatk <- t(as.matrix(read.table("filtered30-pubescens-gatk-out.txt",
header=F)))
# Percent identical genotypes mean((g_tot_gatk - g_tot_alloSNP)==0, na.rm = T) * 100
# Read in genotype data for diploid B. pendula g_hwe <- as.matrix(read.table("filtered30-pendula-hwe-genos.txt",
header=F, na.strings="-9"))
# Read in genotypes called by GATK for B. pendula g_gatk <- t(as.matrix(read.table("filtered30-pendula-gatk-out.txt",
header=F)))
# Percent identical genotypes mean((g_gatk - g_hwe)==0, na.rm = T) * 100
158 We also compared allele frequency estimates between our Hardy Weinberg model
and those estimated by GATK for B. pendula (Figure B.10). The root mean squared
deviation (RMSD) between the two estimates was calculated as follows:
v u L u 1 X 2 RMSD = t (phwe,` − pgatk,`) . (B.10) L `=1 Finally, we compared the genotype estimates from GATK and the allopolyploid
model by looking at the distribution of estimated full genotype for our model and
seeing how often it matched the estimate from GATK (Figure B.11). We did this in
the following way: for each genotype estimated to be 0 by GATK, we looked at the
same genotypes to see what the corresponding estimates were for the allopolyploid
model. We then repeated this procedure for genotypes estimated by GATK to be 1, 2,
3, and 4 copies of the alternative allele. Figure B.11 shows this distribution of the
genotypes estimated by the allopolyploid model when the estimate from GATK is 0
through 4 copies of alternative allele. The R code for performing these comparisons is
below. For genotype estimates that did not match, the estimates by our model tended
to have one fewer copy of the alternative allele compared to GATK. We believe this is
because our model uses an outside estimate of the allele frequency for subgenome one, which influences our genotype estimation algorithm in ways not experienced by GATK.
Using this outside information can be especially useful when sequencing coverage is
low (e.g., see Figure 2.2 in Chapter 2). However, if the allele frequencies used are not
representative of the allele frequencies in subgenome one (i.e., if they are not from the
actual parental species), then they may lead to poor genotype estimates. Thus it is
important to use a reference panel from a known parental species.
159 library(ggplot2)
################################
# Allele frequency comparisons #
################################
hwe_freqs <- as.matrix(read.table("filtered30-pendula-hwe-freqs.txt")) variants <- read.table("filtered30-variants.txt", stringsAsFactors = F) gatk_vcf <- read.table("pendula-ug-filtered30.vcf", stringsAsFactors = F) gatk_variants <- dplyr::semi_join(gatk_vcf, variants) gatk_freqs <- apply(gatk_variants,1,
function(x){
as.numeric(strsplit(strsplit(x[8],";")[[1]][2],"=")[[1]][2]))
} sqrt(mean((hwe_freqs - gatk_freqs)^2)) figS10 <- qplot(hwe_freqs - gatk_freqs) + theme_bw(base_size = 22) +
xlab("hwe - gatk") +
ggtitle("Allele frequency estimates for Hardy Weinberg vs. GATK") print(figS10) ggsave("../supp/supp-figs/FigureS10-hwe-gatk-freqs.pdf",
figS10, height=100, width=169, unit="mm", scale=2.5)
########################
# Genotype comparisons #
########################
gatk_genos <- t(as.matrix(read.table("filtered30-pubescens-gatk-out.txt")))
160 alloSNP_genos1 <- as.matrix(
read.table("filtered30-pubescens-alloSNP-g1.txt",
na.strings="-9")) alloSNP_genos2 <- as.matrix(
read.table("filtered30-pubescens-alloSNP-g2.txt",
na.strings="-9")) alloSNP_genos <- alloSNP_genos1 + alloSNP_genos2 off <- matrix(NA, nrow=5, ncol=5) mismatch <- data.frame(gatk=rep(0:4, 5),
alloSNP=rep(0:4, each=5),
Frequency=rep(NA,25))
for(i in 0:4){
for(j in 0:4){
gatk <- gatk_genos == i
off[i+1,j+1] <- mean(alloSNP_genos[gatk] == j, na.rm = T)
mismatch[mismatch$gatk == i & mismatch$alloSNP == j,]$Frequency
= mean(alloSNP_genos[gatk] == j,na.rm=T)
}
} off
figS11 <- ggplot(mismatch, aes(x=alloSNP, y=Frequency)) +
geom_bar(stat="identity") +
facet_grid(.~gatk) + theme_bw(base_size = 22) +
ggtitle("Allopolyploid vs GATK genotype estimates") print(figS11)
161 ggsave("../supp/supp-figs/FigureS11-alloSNP-gatk-genos.pdf", figS11, height=100, width=169, unit="mm", scale=2.5)
162 Inbreeding Coeff. Estimation Error [25 ind.]
p4 p6 p8
0.75
0.50 c2
0.25
0.00
0.75
0.50 c5
0.25
0.00
0.75 c10 0.50
0.25
0.00 RMSD 0.75 c20 0.50
0.25
0.00
0.75 c30 0.50
0.25
0.00
0.75 c40 0.50
0.25
0.00
F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90
Method diseq diseqCG gatk hwe
Figure B.1: Root mean squared deviation (RMSD) values for all levels of sequencing coverage for the estimation of inbreeding coefficients using the autopolyploid and other models (25 ind.). Everything is set up the same as in Figure 2.1a.
163 Inbreeding Coeff. Estimation Error [50 ind.]
p4 p6 p8
0.75
0.50 c2
0.25
0.00
0.75
0.50 c5
0.25
0.00
0.75 c10 0.50
0.25
0.00 RMSD 0.75 c20 0.50
0.25
0.00
0.75 c30 0.50
0.25
0.00
0.75 c40 0.50
0.25
0.00
F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90
Method diseq diseqCG gatk hwe
Figure B.2: RMSD values for all levels of sequencing coverage for the estimation of inbreeding coefficients using the autopolyploid and other models (50 ind.). Everything is set up the same as in Figure 2.1a.
164 Inbreeding Coeff. Estimation Error [100 ind.]
p4 p6 p8
0.75
0.50 c2
0.25
0.00
0.75
0.50 c5
0.25
0.00
0.75 c10 0.50
0.25
0.00 RMSD 0.75 c20 0.50
0.25
0.00
0.75 c30 0.50
0.25
0.00
0.75 c40 0.50
0.25
0.00
F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90
Method diseq diseqCG gatk hwe
Figure B.3: RMSD values for all levels of sequencing coverage for the estimation of inbreeding coefficients using the autopolyploid and other models (100 ind.). Everything is set up the same as in Figure 2.1a.
165 Genotype Estimation Error [25 ind.]
p4 p6 p8
2.0
1.5 c2 1.0 0.5 0.0
2.0
1.5 c5 1.0 0.5 0.0
2.0
1.5 c10 1.0 0.5 0.0
RMSD 2.0
1.5 c20 1.0 0.5 0.0
2.0
1.5 c30 1.0 0.5 0.0
2.0
1.5 c40 1.0 0.5 0.0
F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90
Method diseq gatk hwe
Figure B.4: RMSD values for all levels of sequencing coverage for the estimation of genotypes in autopolyploids (25 ind.). Everything is set up the same as in Figure 2.1b.
166 Genotype Estimation Error [50 ind.]
p4 p6 p8
2.0
1.5 c2 1.0 0.5 0.0
2.0
1.5 c5 1.0 0.5 0.0
2.0
1.5 c10 1.0 0.5 0.0
RMSD 2.0
1.5 c20 1.0 0.5 0.0
2.0
1.5 c30 1.0 0.5 0.0
2.0
1.5 c40 1.0 0.5 0.0
F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90
Method diseq gatk hwe
Figure B.5: RMSD values for all levels of sequencing coverage for the estimation of genotypes in autopolyploids (50 ind.). Everything is set up the same as in Figure 2.1b.
167 Genotype Estimation Error [100 ind.]
p4 p6 p8
2.0
1.5 c2 1.0 0.5 0.0
2.0
1.5 c5 1.0 0.5 0.0
2.0
1.5 c10 1.0 0.5 0.0
RMSD 2.0
1.5 c20 1.0 0.5 0.0
2.0
1.5 c30 1.0 0.5 0.0
2.0
1.5 c40 1.0 0.5 0.0
F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90
Method diseq gatk hwe
Figure B.6: RMSD values for all levels of sequencing coverage for the estimation of genotypes in autopolyploids (100 ind.). Everything is set up the same as in Figure 2.1b.
168 Allele Frequency in Subgenome 2
i25 i50 i100
0.12
0.08 RMSD
0.04
0.00
c2 c5 c2 c5 c2 c5 c10 c20 c30 c40 c10 c20 c30 c40 c10 c20 c30 c40
Ploidy p4 p6 p8
Figure B.7: RMSD values for the estimation of the allele frequency in subgenome two with the allopolyploid model. Increasing sequencing coverage increases accuracy but there is a plateau. Sampling more individuals also decreases estimation error.
169 Genotype Estimation in Subgenome 1
i25 i50 i100
1.00
0.75
0.50 RMSD
0.25
0.00
c2 c5 c2 c5 c2 c5 c10 c20 c30 c40 c10 c20 c30 c40 c10 c20 c30 c40
Ploidy p4 p6 p8
Figure B.8: RMSD values for the estimation of genotypes in subgenome one with the allopolyploid model. Increasing sequencing coverage increases accuracy but sampling more individuals does not have much of an effect.
170 Genotype Estimation in Subgenome 2
i25 i50 i100
1.00
0.75
0.50 RMSD
0.25
0.00
c2 c5 c2 c5 c2 c5 c10 c20 c30 c40 c10 c20 c30 c40 c10 c20 c30 c40
Ploidy p4 p6 p8
Figure B.9: RMSD values for the estimation of genotypes in subgenome two with the allopolyploid model. Increasing sequencing coverage increases accuracy but sampling more individuals does not have much of an effect. Estimation error in subgenome two is also higher than in subgenome one since allele frequencies for subgenome two must be estimated.
171 Allele frequency estimates for Hardy Weinberg vs. GATK 12500
10000
7500 count 5000
2500
0
−0.50 −0.25 0.00 0.25 0.50 hwe − gatk
Figure B.10: Distribution of the difference in allele frequency estimates from our Hardy Weinberg model versus GATK (hwe - GATK).
172 Allopolyploid vs GATK genotype estimates
0 1 2 3 4 1.00
0.75
0.50 Frequency
0.25
0.00
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 alloSNP
Figure B.11: Distribution of the full genotypes estimated by the allopolyploid model for each possible value of the genotype estimated by GATK. Each individual plot corresponds to genotypes estimated by GATK to have 0 through 4 copies of the alternative allele. The distributions within each plot are the frequency with which the allopolyploid model estimated each of the possible genotypes given the estimates by GATK. For example, for all genotypes estimated by GATK to be 2, the allopolyploid model also estimated ∼80% of those genotypes to have 2 copies of the alternative allele, but estimated that ∼10 − 15% had only 1 copy.
173 Appendix C: Chapter 3 Supplemental Meterials
C.1 Haplotype Inference
C.1.1 Inferring Haplotypes with Known Ploidy
To infer the maximum likelihood haplotype configurations with known ploidy, we use
a multinomial likelihood that models the size of each cluster being considered as a
haplotype. For an individual with ploidy level K, we take the first K clusters sorted
by size and calculate the likelihood for a given partition as follows: each entry in a
partition, P , contains the number of times that a particular haplotype is represented
in the configuration. Given cluster sizes C1 through CK , and a sequencing error rate
of , the log-likelihood for a partition P is:
|P | K X P [i] X `P = Ci × log + Cj × log(). (C.1) K i=1 j>|P | Here |P | represents the size of the partition.
Example Calculation
To illustrate this calculation, let us consider a tetraploid individual that has the
following sizes for the first four cluster: C1 = 285, C2 = 95, C3 = 10, and C4 = 8. We will also use an error rate of = 0.002. Table C.1 has the list of possible haplotype
174 Haplotype Configuration Log-Likelihood (4, 0, 0, 0) -702.2507 (3, 1, 0, 0) -325.5503* (2, 2, 0, 0) -375.2589 (2, 1, 1, 0) -392.8247 (1, 1, 1, 1) -551.7452
Table C.1: Haplotype configurations and their corresponding log-likelihoods for a tetraploid with ordered cluster sizes equal to 285, 95, 10, and 8. The haplotype configuration with three copies of haplotype one and one copy of haplotype two has the highest likelihood.
configurations and their corresponding log-likelihood values. The maximum likelihood
haplotype configuration has three copies of haplotype one and one copy of haplotype
two. The explicit likelihood calculation for this haplotype configuration proceeds as
follows:
3 1 ` = 285 × log + 95 × log + 10 × log(0.002) (3,1) 4 4 + 8 × log(0.002) = −325.5503. (C.2)
C.1.2 Inferring Haplotypes with Unknown Ploidy
Inferring haplotype configurations for individuals with unknown ploidy levels involves
distinguishing clusters that are likely to be “real” haplotypes from those that are likely
to be errors. We do this by considering a set of models that range from treating all
clusters as errors, to one where all clusters are real haplotypes. The models in between
successively treat each cluster in the ordered set as a real haplotype (clusters are sorted
by size). For an individual with N clusters, there are N + 1 models to test. Each of
175 these models has H real haplotypes (0,...,H) and N − H errors (H + 1,...,N). The
likelihood for each of these models is the sum of the clusters sizes (C1,...,CN ) times
the probability that they are sequencing errors () or not (1 − ). The log-likelihood
for a model with H haplotypes is given by:
H N X X `H = Ci × log(1 − ) + Cj × log(). (C.3) i=1 j>H To determine the most likely haplotype configuration, we calculate how much the
likelihood increases over the previous model when another haplotype is added (the
likelihood is monotonically increasing). We also normalize these differences by the
total change in likelihood from the model with H = 0 to the model with H = N. If
this value is less than a given cutoff (we use a default of 0.10), the previous model is
treated as the best configuration. Since the cluster sizes are ordered, the increase in
the log-likelihood will always be smaller for any additional haplotypes.
Example Calculation
We will illustrate this procedure using an example for six clusters with the following
sizes: C1 = 425,C2 = 210,C3 = 145, C4 = 18, C5 = 11, and C6 = 7. Using an error
rate of 0.002, the R code below will calculate the likelihoods for the different models
as well as the relative increase for each of them (Table C.2).
R code:
ullik_0 <- 425*log(0.002) + 210*log(0.002) + 145*log(0.002)
+ 18*log(0.002) + 11*log(0.002) + 7*log(0.002)
ullik_1 <- 425*log(1-0.002) + 210*log(0.002) + 145*log(0.002)
+ 18*log(0.002) + 11*log(0.002) + 7*log(0.002)
176 ullik_2 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(0.002)
+ 18*log(0.002) + 11*log(0.002) + 7*log(0.002) ullik_3 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(1-0.002)
+ 18*log(0.002) + 11*log(0.002) + 7*log(0.002) ullik_4 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(1-0.002)
+ 18*log(1-0.002) + 11*log(0.002) + 7*log(0.002) ullik_5 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(1-0.002)
+ 18*log(1-0.002) + 11*log(1-0.002) + 7*log(0.002) ullik_6 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(1-0.002)
+ 18*log(1-0.002) + 11*log(1-0.002) + 7*log(1-0.002)
total_diff <- abs(ullik_0 - ullik_6) diff_1 <- ullik_1 - ullik_0; rel_diff_1 <- diff_1 / total_diff diff_2 <- ullik_2 - ullik_1; rel_diff_2 <- diff_2 / total_diff diff_3 <- ullik_3 - ullik_2; rel_diff_3 <- diff_3 / total_diff diff_4 <- ullik_4 - ullik_3; rel_diff_4 <- diff_4 / total_diff diff_5 <- ullik_5 - ullik_4; rel_diff_5 <- diff_5 / total_diff diff_6 <- ullik_6 - ullik_5; rel_diff_6 <- diff_6 / total_diff
Using a cutoff of 0.10, we can see that the configuration (1, 1, 1, 0, 0, 0) is the last haplotype configuration that increases the likelihood by more than 10%, meaning that the most likely scenario is that the first three clusters are real haplotypes and the last three are errors.
177 Haplotype Configuration Log-Likelihood Relative Increase (0,0,0,0,0,0) -5071.12 NA (1,0,0,0,0,0) -2430.763 0.5208 (1,1,0,0,0,0) -1126.115 0.2574 (1,1,1,0,0,0) -225.2875 0.1777* (1,1,1,1,0,0) -113.4605 0.0221 (1,1,1,1,1,0) -45.12188 0.0135 (1,1,1,1,1,1) -1.633634 0.0086
Table C.2: Haplotype configurations for an individual with six clusters. The ordered cluster sizes are 425, 210, 145, 18, 11, and 7. A model where the first three clusters are real haplotypes is the best fit.
C.2 Example Analysis
We have provided the sequence data from the PIS_3 and PIS_4 loci for the six species
of Thalictrum in the files Thalictrum_R1.fastq.gz and Thalictrum_R2.fastq.gz.
The following sections will walk through the analyses that we did to compare haplotypes
inferred using Fluidigm2PURC and dbcAmplicons (Uribe-Convers et al., 2016).
C.2.1 Fluidigm2PURC
To analyze the data using Fluidigm2PURC, we first run the fluidigm2purc script:
$ fluidigm2purc -f Thalictrum -o thalictrum-f2p -j 2
All of the reads coming from each locus will be written to separate FASTA files
that are put into the directory thalictrum-f2p-FASTA/ (the -o option gives the
output prefix). Next, we change into this directory and run PURC (Rothfels et al.,
2017) on the PIS_3 and PIS_4 locus separately.
178 $ cd thalictrum-f2p-FASTA/
$ purc_recluster.py -f PIS_3.fasta -o PIS_3 -c 0.997 0.995 0.99 0.997 \
-s 2 5 --clean
$ purc_recluster.py -f PIS_4.fasta -o PIS_4 -c 0.997 0.995 0.99 0.997 \
-s 2 5 --clean
Here the results are written to separate directories: PIS_3/ and PIS_4/. We will work on these loci in their respective directories by running the crunch_clusters script.
For analyses 1 and 2, we do not change the taxon table to include ploidy information.
However, before running step 3, we add the ploidy levels of the taxa sampled. Running
the first two cluster crunching steps for both loci before adding the ploidy information
is best so that we do not have to change the taxon table more than once. We also
manually renamed the output FASTA files after each crunch_clusters run so that the
files do not get overwritten. In a normal situation, you would not need to run all three
of these analyses unless you wanted to compare the differences between assuming
ploidy levels are known or unknown.
## Working on the PIS_3 locus
# 1. Getting consensus loci using the --haploid flag
$ cd PIS_3/
$ crunch_clusters -i PIS_3_clustered_reconsensus.afa -l PIS_3 \
-s ../../thalictrum-f2p-taxon-table.txt \
-e ../../thalictrum-f2p-locus-err.txt \
--realign --clean 0.33 --haploid
# 2. Assuming we don’t know ploidy levels
$ crunch_clusters -i PIS_3_clustered_reconsensus.afa -l PIS_3 \
179 -s ../../thalictrum-f2p-taxon-table.txt \
-e ../../thalictrum-f2p-locus-err.txt \
--realign --clean 0.33
# 3. Getting unique haplotypes with known ploidy information
# (add ploidy info first)
$ crunch_clusters -i PIS_3_clustered_reconsensus.afa -l PIS_3 \
-s ../../thalictrum-f2p-taxon-table.txt \
-e ../../thalictrum-f2p-locus-err.txt \
--realign --clean 0.33 --unique_haps
## Working on the PIS_4 locus
# 1. Getting consensus loci using the --haploid flag
$ cd ../PIS_4/
$ crunch_clusters -i PIS_4_clustered_reconsensus.afa -l PIS_4 \
-s ../../thalictrum-f2p-taxon-table.txt \
-e ../../thalictrum-f2p-locus-err.txt \
--realign --clean 0.33 --haploid
# 2. Assuming we don’t know ploidy levels
$ crunch_clusters -i PIS_4_clustered_reconsensus.afa -l PIS_4 \
-s ../../thalictrum-f2p-taxon-table.txt \
-e ../../thalictrum-f2p-locus-err.txt \
--realign --clean 0.33
# 3. Getting unique haplotypes with known ploidy information
# (add ploidy info first)
$ crunch_clusters -i PIS_4_clustered_reconsensus.afa -l PIS_4 \
180 -s ../../thalictrum-f2p-taxon-table.txt \
-e ../../thalictrum-f2p-locus-err.txt \
--realign --clean 0.33 --unique_haps
C.2.2 dbcAmplicons (reduce_amplicons.R)
To get haplotypes with dbcAmplicons, we first run the reduce_amplicons.R script.
We trimmed 20 bases from read one and 40 bases from read two. We also run both
consensus- and occurrence-based haplotype inference using the -p option. The output
directory is specified using the -o option.
$ reduce_amplicons.R -p consensus,occurrence --trim-1 20 --trim-2 40 \
-o thalictrum-dbc Thalictrum
Next, we need to align the output of the reduce_amplicons.R script for the
consensus- and occurrence-based haplotype inference methods. We first change into
the consensus.split_amplicon/ directory in the main thalictrum-dbc/ output
directory and then align the haplotypes for PIS_3 and PIS_4 using MAFFT (Katoh,
2013).
$ cd thalictrum-dbc/consensus.split_amplicon/
$ mafft --auto --quiet Amplicon.PIS_3.merged.fasta > PIS_3-consensus.fasta
$ mafft --auto --quiet Amplicon.PIS_4.merged.fasta > PIS_4-consensus.fasta
Next, we change back into the main output directory and then change into the
occurrence.split_amplicon/ directory to align the occurrence-based haplotypes
inferred by dbcAmplicons.
$ cd ../occurrence.split_amplicon/
$ mafft --auto --quiet Amplicon.PIS_3.merged.fasta > PIS_3-occurrence.fasta
$ mafft --auto --quiet Amplicon.PIS_4.merged.fasta > PIS_4-occurrence.fasta
181 All of the resulting haplotype files were then read into Geneious to visualize and calculate alignment statistics (Kearse et al., 2012). Parsimony informative sites were calculated in MEGA7 (Kumar et al., 2016).
182 Appendix D: Chapter 4 Supplemental Materials
D.1 Validating QCF Estimation
To evaluate the accuracy of our approach for QCF estimation, we performed a
simulation study using both tree and network topologies (Figure 4.1), and compared
our estimates with the true QCF values from simulated gene trees. Gene trees were
simulated using the program ms for 50 loci using the specified topology with internal
branch lengths of 0.5, 1.0, and 2.0 coalescent units (Hudson, 2002). For the species
network, the ancestral lineage to species C and D was simulated as a 60:40 hybrid
species forming 1.0 coalescent units in the past through an admixture event between
species E (γ = 0.6) and the ancestral lineage to A and B (1 − γ = 0.4). Sequence
data was then simulated on each gene tree using the program Seq-Gen (Rambaut
and Grass, 1997). The length of each gene was 400 bp, with an expected number of
substitutions per site of 0.05. QCFs were then estimated with the approach outlined in
§4.3 using either (1) no bootstrapping or (2) 500 bootstrap replicates. True QCF values were calculated using the simulated gene trees as input with the software package
PhyloNetworks v0.7.0 (Solís-Lemus et al., 2017). Estimates from our approach were
compared to the true values using the root mean squared deviation (RMSD), as well
as linear regression, in R v3.3.2 (R Core Team, 2016). Results were plotted using
183 ggplot2 v2.2.1 (Wickham, 2009). Code for performing these simulations can be found in Appendix D (§D.1.1 and §D.1.2).
D.1.1 Tree Simulations qcf-sims-tree.sh:
#!/bin/bash
# Global parameters
REP=$1 # Rep number
THETA=0.05 # Expected number of mutations per base
BP=400 # Sequence length in base pairs
julia5=/Applications/Julia-0.5.app/Contents/Resources/julia/bin/julia
for i in `seq 1 50`
do
# Simulate gene tree using ms
ms 6 1 -T -I 6 1 1 1 1 1 1 -ej 0.25 1 2 -ej 0.25 3 4 -ej 0.5 2 4 \
-ej 1.0 4 5 -ej 2.0 5 6 | grep '^(' > trees-${REP}-${i}.tre
# Simulate sequence data using seq-gen
seq-gen -mGTR -s $THETA -l $BP -r 1.0 0.2 10.0 0.75 3.2 1.6 \
-f 0.15 0.35 0.15 0.35 -i 0.2 -a 5.0 -g 3 -q \
< trees-${REP}-${i}.tre > seqs-${REP}-${i}.phy
done
ls -1 *.phy > genes.txt
184 cat trees-${REP}-*.tre > trees-${REP}.tre
qcf -i genes.txt -m map.txt --prefix tree-${REP}
qcf -i genes.txt -m map.txt -b 500 --prefix tree-${REP}-boot
$julia5 -e 'using PhyloNetworks; readTree2CF("trees-${REP}.tre", \
"tree-${REP}-phynet.CFs.csv", writeSummary=false)'
mkdir rep-${REP}
mv *.tre *.phy *.csv rep-${REP}
$ for i in `seq 1 100`; do ./qcf-sims-tree.sh ${i}; done
185 D.1.2 Network Simulations qcf-sims-network.sh:
#!/bin/bash
# Global parameters
REP=$1 # Rep number
THETA=0.05 # Expected number of mutations per base
BP=400 # Sequence length in base pairs
julia5=/Applications/Julia-0.5.app/Contents/Resources/julia/bin/julia
# Simuate 20 gene trees from topology 1
for i in `seq 1 20`
do
# Simulate gene tree using ms
ms 6 1 -T -I 6 1 1 1 1 1 1 -ej 0.25 1 2 -ej 0.25 3 4 -ej 0.5 2 4 \
-ej 1.0 4 5 -ej 2.0 5 6 | grep '^(' > trees-${REP}-${i}.tre
# Simulate sequence data using seq-gen
seq-gen -mGTR -s $THETA -l $BP -r 1.0 0.2 10.0 0.75 3.2 1.6 \
-f 0.15 0.35 0.15 0.35 -i 0.2 -a 5.0 -g 3 -q \
< trees-${REP}-${i}.tre > seqs-${REP}-${i}.phy
done
# Simulate 30 gene trees from topology 2
for i in `seq 21 50`
do
186 # Simulate gene tree using ms
ms 6 1 -T -I 6 1 1 1 1 1 1 -ej 0.25 1 2 -ej 0.25 3 4 -ej 0.5 4 5 \
-ej 1.0 2 5 -ej 2.0 5 6 | grep '^(' > trees-${REP}-${i}.tre
# Simulate sequence data using seq-gen
seq-gen -mGTR -s $THETA -l $BP -r 1.0 0.2 10.0 0.75 3.2 1.6 \
-f 0.15 0.35 0.15 0.35 -i 0.2 -a 5.0 -g 3 -q \
< trees-${REP}-${i}.tre > seqs-${REP}-${i}.phy
done
ls *.phy > genes.txt
cat trees-${REP}-*.tre > trees-${REP}.tre
qcf -i genes.txt -m map.txt --prefix net-${REP}
qcf -i genes.txt -m map.txt -b 500 --prefix net-${REP}-boot
$julia5 -e 'using PhyloNetworks; readTree2CF("trees-${REP}.tre", \
"net-${REP}-phynet.CFs.csv", writeSummary=false)'
mkdir rep-${REP}
mv *.tre *.phy *.csv rep-${REP}
$ for i in `seq 1 100`; do ./qcf-sims-network.sh ${i}; done
187 D.2 Code for Species Tree and Network Inference
D.2.1 Gene Tree Estimates with RAxML
# Loop through all genes and analyze with RAxML
for f in *.phy
do
raxml -f a -x 12345 -p 12345 -# 500 -m GTRGAMMA \
-s $f
done
# Then combine all gene trees
cat RAxML_bipartitions.* > AllGeneTrees.tre
D.2.2 Species Tree Inference with ASTRAL-III
java -jar astral.5.5.9.jar -i AllGeneTrees.tre -a map.txt \
-o Humiles-Proceri.tre --polylimit 20 \
--samplingrounds 100 --extraLevel 2
The analyses for clades A and B were conducted using the same commands but with only the subset of taxa belonging to each clade.
D.2.3 Species Tree Inference with qcf+QuartetMaxCut
188 # run QCF
qcf -i gene-list.txt -m map.txt -b 500
# Run get-pop-tree.pl from TICR
perl get-pop-tree.pl out-qcf.CFs.csv
# Run getTreeBranchLengths.R from TICR
Rscript getTreeBranchLengths.R out-qcf Pdavidsoniidavidsonii
D.2.4 Network Analyses with PhyloNetworks
Network analyses were conducted using the SNaQ method in the PhyloNetworks package (v0.7.0) with the following template for each script [written in the Julia language using versions 5.2.0 and 6.2.0] (Solís-Lemus and Ané, 2016; Solís-Lemus et al., 2017).
# snaq-net
addprocs(10) # add processors to run things in parallel
using Phylonetworks;
t = readTopology("cladeA-astral.tre")
cf = readTableCF("cladeA-qcf.CFs.csv")
net1 = snaq!(t, cf, hmax=
filename="cladA-net
outgroup="Pdavidsoniidavidsonii")
189 For each network analysis, we changed the hmax argument to the corresponding maximum number of hybridization events (
190 QCF Estimation
A,B,C,D A,B,C,E A,B,C,F A,B,D,E A,B,D,F A,B,E,F A,C,D,E A,C,D,F A,C,E,F A,D,E,F B,C,D,E B,C,D,F B,C,E,F B,D,E,F C,D,E,F 1.00
0.75 12|34 0.50
0.25
0.00 1.00
0.75 13|24 0.50 CF
0.25 191 0.00 1.00
0.75 14|23 0.50
0.25
0.00 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 PhyloNetworks
Method QCF QCF−Boot
Figure D.1: Simulation results for the tree topology from Figure 4.1a. Results are plotted for bootstrapped (blue) and non- bootstrapped (yellow) comparisons. Regression lines were estimated by ggplot2 using the lm method in the geom_smooth() function. QCF Estimation
A,B,C,D A,B,C,E A,B,C,F A,B,D,E A,B,D,F A,B,E,F A,C,D,E A,C,D,F A,C,E,F A,D,E,F B,C,D,E B,C,D,F B,C,E,F B,D,E,F C,D,E,F 1.00
0.75 12|34 0.50
0.25
0.00 1.00
0.75 13|24 0.50 CF
0.25 192 0.00 1.00
0.75 14|23 0.50
0.25
0.00 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 PhyloNetworks
Method QCF QCF−Boot
Figure D.2: Simulation results for the network topology from Figure 4.1b. Results are plotted for bootstrapped (blue) and non-bootstrapped (yellow) comparisons. Regression lines were estimated by ggplot2 using the lm method in the geom_smooth() function. P. davidsonii davidsonii P. rattanii P. whippleanus P. anguineus P. ovatus P. watsonii P. inflatus P. degeneri P. humilis obtusifolius P. subserratus P. attenuatus attenuatus P. aridus P. albertinus P. humilis brevifolius P. humilis humilis P. wilcoxii P. attenuatus pseudoprocerus P. pruinosus P. virens P. elegantulus P. flavescens P. rydbergii rydbergii P. radicosus P. euglaucus P. rydbergii oreocharis P. globosus P. pratensis P. attenuatus militaris P. laxus P. heterodoxus heterodoxus P. attenuatus palustris P. spatulatus P. hesperius P. procerus procerus P. confertus P. washingtonensis P. cinicola P. peckii
0.2
Figure D.3: Phylogeny of Penstemon subsections Humiles and Proceri inferred by ASTRAL-III. Branch lengths are in coalescent units.
193 Pdavidsoniidavidsonii960101_1 Pwhippleanus960015_1 Prattanii960214_1 Panguineus16064_1 Pwatsonii980489_1 Paridus991492_1 Pvirens991745_1 Phumilisvarobtusifolius15202_1 Phumilishumilis980513_1 Phumilisbrevifolius980507_1 Ppruinosus16157_1 Pelegantulus15698_1 Psubserratus170005_1 Palbertinus15469_1 Pattenuatuspseudoprocerus15485_1 Povatus170006_1 Pwilcoxii15451_1 Pradicosus991656_1 Pdegeneri113_1 Pinflatus991389_1 Pattenuatusattenuatus12347_1 Pflavescens15737_1 Pspatulatus15666_1 Pprocerusprocerus12216_1 Pcinicola12577_1 Ppeckii12540_1 Peuglaucus12562_1 Phesperius16128_1 Pwashingtonensis16173_1 Pconfertus12327_1 Prydbergiirydbergii10182_1 Pattenuatuspalustris15690_1 Pheterodoxusheterodoxus16014_1 Pattenuatusmilitaris12275_1 Ppratensis12290_1 Pglobosus16584_1 Plaxus15702_1 Prydbergiioreocharis12319_1
0.0040
Figure D.4: Phylogeny of Penstemon subsections Humiles and Proceri inferred with RAxML v8.2.11 using a supermatrix of 43 loci. Branch lengths are the mean number of substitutions per site.
194 Negative Log−Likehood for Maximum Number of Reticulations
Pdavidsoniidavidsonii Pdavidsoniidavidsonii ● Pattenuatuspalustris Pconfertus
Pheterodoxusheterodoxus Phesperius
Pflavescens Pprocerusprocerus
Plaxus Pwashingtonensis
Pattenuatusmilitaris Ppeckii
Pglobosus Pcinicola
Prydbergiioreocharis Prydbergiirydbergii
Ppratensis Peuglaucus
Prydbergiirydbergii Pspatulatus −logLik
Pspatulatus Pheterodoxusheterodoxus
Ppeckii Pattenuatuspalustris
Pcinicola Pflavescens
Pwashingtonensis Ppratensis
3000 3500 4000 4500 ● ● Phesperius Prydbergiioreocharis ● ● ●
Pprocerusprocerus Plaxus
Pconfertus Pattenuatusmilitaris
Peuglaucus Pglobosus 0 1 2 3 4 5
h
Pdavidsoniidavidsonii Pglobosus Pdavidsoniidavidsonii
Pspatulatus Ppratensis Peuglaucus
Pflavescens Pheterodoxusheterodoxus Pconfertus
Pattenuatuspalustris Pspatulatus Phesperius
Pattenuatusmilitaris Peuglaucus Pprocerusprocerus
Prydbergiioreocharis Pdavidsoniidavidsonii Pattenuatusmilitaris
195 Plaxus Ppeckii Plaxus
Pglobosus Pcinicola Pglobosus
Ppratensis Pwashingtonensis Prydbergiioreocharis
Pheterodoxusheterodoxus Phesperius Ppratensis
Prydbergiirydbergii Pconfertus Pflavescens
Peuglaucus Pprocerusprocerus Pattenuatuspalustris
Pprocerusprocerus Pattenuatuspalustris Pheterodoxusheterodoxus
Pconfertus Prydbergiirydbergii Prydbergiirydbergii
Phesperius Pflavescens Pspatulatus
Pwashingtonensis Prydbergiioreocharis Pwashingtonensis
Ppeckii Pattenuatusmilitaris Pcinicola
Pcinicola Plaxus Ppeckii
Figure D.5: Networks inferred for clade A using SNaQ, as implemented in the software PhyloNetworks. Each panel shows the network estimated for the given number of hybridization events (h=1 to h=5), starting at the top left (h=1) and ending at the bottom right (h=5). The panel in the top right corner shows the negative log-pseduolikelihoods for each number of hybridization events, with h=3 having the best pseduolikelihood. Negative Log−Likehood for Maximum Number of Reticulations
Pdavidsoniidavidsonii Phumilisbrevifolius ● Pinflatus Phumilishumilis
Pdegeneri Phumilisvarobtusifolius
Pradicosus Pvirens
Phumilisvarobtusifolius Pelegantulus
Phumilisbrevifolius Ppruinosus
Phumilishumilis Pattenuatuspseudoprocerus
Psubserratus Povatus
Povatus Palbertinus −logLik Palbertinus Pwilcoxii
Pattenuatuspseudoprocerus Psubserratus
Pwilcoxii Pattenuatusattenuatus
Ppruinosus Paridus ● ● ●
Pelegantulus 2600 2800 3000 3200 Pradicosus ● Pvirens Pdegeneri ● Pattenuatusattenuatus Pinflatus
Paridus Pdavidsoniidavidsonii 0 1 2 3 4 5
h
Pdavidsoniidavidsonii Pvirens Phumilishumilis
Pradicosus Phumilisbrevifolius Phumilisbrevifolius
Paridus Phumilishumilis Phumilisvarobtusifolius
Pinflatus Phumilisvarobtusifolius Pdavidsoniidavidsonii
Pdegeneri Pdavidsoniidavidsonii Pdegeneri
Pvirens Pdegeneri Pinflatus 196 Pattenuatusattenuatus Pinflatus Pradicosus
Pelegantulus Pradicosus Paridus
Psubserratus Paridus Pattenuatusattenuatus
Ppruinosus Pattenuatusattenuatus Pvirens
Pattenuatuspseudoprocerus Psubserratus Pelegantulus
Pwilcoxii Pelegantulus Ppruinosus
Palbertinus Ppruinosus Psubserratus
Povatus Pattenuatuspseudoprocerus Povatus
Phumilisbrevifolius Pwilcoxii Pattenuatuspseudoprocerus
Phumilishumilis Palbertinus Pwilcoxii
Phumilisvarobtusifolius Povatus Palbertinus
Figure D.6: Networks inferred for clade B using SNaQ, as implemented in the software PhyloNetworks. Each panel shows the network estimated for the given number of hybridization events (h=1 to h=5), starting at the top left (h=1) and ending at the bottom right (h=5). The panel in the top right corner shows the negative log-pseduolikelihoods for each number of hybridization events, with h=3 having the best pseduolikelihood. Locus Ploidy Voucher Outgroup P. davidsonii davidsonii 2X SLD 53 Subsection Humiles P. albertinus 2X PDB 41 P. anguineus 2X ADW 1505 P. aridus 2X Andrew Lutz, Billy Creek 1 P. degeneri 2X ADW 401 P. elegantulus 2X Idaho Gray 4321 P. inflatus 2X ADW 811 P. humilis brevifolius 2X ADW 573 P. humilis humilis 2X ADW 761 P. humilis obtusifolius 2X ADW 1430 P. ovatus 2X ADW 608 P. pruinosus 2X BYU 98608/EC Moran sn (1971) P. radicosus 2X Mt. West Enviro. Services 7962 P. rattanii 2X ADW 512 P. subserratus 2X ADW 590 P. virens 2X Mt. West Enviro. Services 7953 P. whippleanus 2X 1073 Potsdam P. wilcoxii 2X PDB 39 Subsection Proceri P. attenuatus attenuatus 6X PDB 19 P. attenuatus militaris 6X PDB 12 P. attenuatus palustris 6X PDB 61 P. attenuatus pseudoprocerus 6X PDB 42 P. cinicola 2X PDB 32 P. confertus 4X PDB 18 P. euglaucus 6X PDB 33 P. flavescens 6X PDB 63 P. globosus 4X ADW 1566 P. hesperius 2X – P. heterodoxus heterodoxus 2X ADW 1498 P. laxus 2X Idaho Smith 8123 P. peckii 4X PDB 31 P. pratensis 2X PDB 14 P. procerus procerus 4X PDB 3 P. rydbergii rydbergii 4X Western Env. 8845 P. rydbergii oreocharis 2X PDB 16 P. spatulatus 2X PDB 56 P. washingtonensis 2X PDB 36 P. watsonii 2X ADW 786
Table D.1: Collection and ploidy information for accessions from Penstemon subsec- tions Humiles and Proceri.
197 Locus Direction Primer COS4270 Forward ACACTGACGACATGGTTCTACAACCAAGCTCTTCACCTGGAA Reverse TACGGTAGCAGAGACTTGGTCTAGACCAGCATAACAATTTTATTCCTAA COS14240 Forward ACACTGACGACATGGTTCTACACCGACATTAGTCACGGTCCT Reverse TACGGTAGCAGAGACTTGGTCTCGGCATTCCTTCAGATAAAC COS23460 Forward ACACTGACGACATGGTTCTACATGGTTGTTCGTGTGAGGTTG Reverse TACGGTAGCAGAGACTTGGTCTAAACTCATTATTTGCTCGATAGGG COS24530 Forward ACACTGACGACATGGTTCTACATTAAAATGCAGGAGGGCTTG Reverse TACGGTAGCAGAGACTTGGTCTCCAACCAAATCAGTTCTCTGC COS50360 Forward ACACTGACGACATGGTTCTACACCATGGAATCAAACCTGGAC Reverse TACGGTAGCAGAGACTTGGTCTAAGCCCAAATCGAAGAAGAA COS57850 Forward ACACTGACGACATGGTTCTACAGAAGGAGCCTCAAAGCAGTG Reverse TACGGTAGCAGAGACTTGGTCTGGATGTCCATCTAACCCGTTT PPR876 Forward ACACTGACGACATGGTTCTACACAGCTTCTGGTAGATGGGCT Reverse TACGGTAGCAGAGACTTGGTCTCTCTCCCCACAATCTTCGCC PPR1651 Forward ACACTGACGACATGGTTCTACAACGCAGCTTCTGGTAGATGG Reverse TACGGTAGCAGAGACTTGGTCTCACCACAATTACACACGCCC PPR5729 Forward ACACTGACGACATGGTTCTACATTTGCTCACTGCTTGTGCTG Reverse TACGGTAGCAGAGACTTGGTCTCCTTCATCCACCATGCCACA PPR985 Forward ACACTGACGACATGGTTCTACATCCGTGGCATTTTTAGTGCG Reverse TACGGTAGCAGAGACTTGGTCTCCGGAGAAAGCTCTTACGGTT PPR1839 Forward ACACTGACGACATGGTTCTACATGCAACCGTAATGCTCGACT Reverse TACGGTAGCAGAGACTTGGTCTTTTGCTCACTGCTTGTGCTG 34130 Forward ACACTGACGACATGGTTCTACATCTAAGTTTGCGGATGTTGAGA Reserve TACGGTAGCAGAGACTTGGTCTCATTCCCAGAACATACATGCAA 59820 Forward ACACTGACGACATGGTTCTACAGCAGATTTAGTTTTACTCTCCTCCA Reverse TACGGTAGCAGAGACTTGGTCTGGTCTTAAATACCATCTTCTGTGTCC 80460 Forward ACACTGACGACATGGTTCTACACCGAAATTTTACCCAAAATCG Reverse TACGGTAGCAGAGACTTGGTCTGCAATGTGGGATTTGTTCGT 20370 Forward ACACTGACGACATGGTTCTACATTCAGAGCTCCCATTTTTGC Reverse TACGGTAGCAGAGACTTGGTCTTTGACCTTCATCCAATAGAGCA 21370 Forward ACACTGACGACATGGTTCTACACTGTTTTTCCAATTTTCCATCC Reverse TACGGTAGCAGAGACTTGGTCTCAGGTTGTGGGCTACGATTT PPR369 Forward ACACTGACGACATGGTTCTACAGGAAAGGAAATCCATGCCCA Reverse TACGGTAGCAGAGACTTGGTCTAGCCTTCAGTTACCATTCCG PPR950 Forward ACACTGACGACATGGTTCTACATCTCCATCTTCGAGAACGCC Reverse TACGGTAGCAGAGACTTGGTCTGCCGGATCAGGTGACGATAG PPR1250 Forward ACACTGACGACATGGTTCTACAAAAGCCCTTCTTGCACGAGT Reverse TACGGTAGCAGAGACTTGGTCTGGACAGCTTTGATTGCAGGG PPR1561 Forward ACACTGACGACATGGTTCTACATCCCTTTTGCCTCATCGACC Reverse TACGGTAGCAGAGACTTGGTCTGGTTCACACGGTGAATGTCG 30360 Forward ACACTGACGACATGGTTCTACAAGGTTGCTAAAGGCCGATTC Reverse TACGGTAGCAGAGACTTGGTCTGGGTCTTTATCTAAAAGGCGAGA 35920 Forward ACACTGACGACATGGTTCTACAGGGGACAAAAATAGCAGAGC Reverse TACGGTAGCAGAGACTTGGTCTTCACCGTGCTTGTTAAGTGC 27260 Forward ACACTGACGACATGGTTCTACACTCCCCCGGAAAGTAACAAA Reverse TACGGTAGCAGAGACTTGGTCTTTGTTTCATGTTGCGCCTTT 2840 Forward ACACTGACGACATGGTTCTACATCTGGAAAAATTCCCTGGAC Reverse TACGGTAGCAGAGACTTGGTCTGCGCTTTGCAAATTCTTGAG
Table D.2: Primers for amplicon sequencing using using the Fluidigm AccessArray. Primer sequences include conserved sequence tags.
198 Locus Direction Primer 53950 Forward ACACTGACGACATGGTTCTACAAAACTGTGCTCTTCCTCCAA Reverse TACGGTAGCAGAGACTTGGTCTGGGGAATGGTGACTCCTACA 62010 Forward ACACTGACGACATGGTTCTACAAGCAACGCCATAAACTGGAA Reverse TACGGTAGCAGAGACTTGGTCTTTGGGAAAGTTGATTGAGACG 2350 Forward ACACTGACGACATGGTTCTACAGCTCCCATCTTTGTATATCTCG Reverse TACGGTAGCAGAGACTTGGTCTCCCGTTCGTTCGATTGATAG 18520 Forward ACACTGACGACATGGTTCTACATTCGAGAACGCCTCTAAACC Reverse TACGGTAGCAGAGACTTGGTCTGTTATGCACAAAACGGGATG 33495 Forward ACACTGACGACATGGTTCTACATCTCCATTTCTCAACCTCAGC Reverse TACGGTAGCAGAGACTTGGTCTGCCCCTTCCTCTCCATACA 38180 Forward ACACTGACGACATGGTTCTACAGGCATCAAAAGTGGATGATG Reverse TACGGTAGCAGAGACTTGGTCTTCCCTCGTTGAGACATTCCT 4810 Forward ACACTGACGACATGGTTCTACATACCAATTCCCCAGTTCTGC Reverse TACGGTAGCAGAGACTTGGTCTATGGGAAGAGATGCTTACCTGA 37450 Forward ACACTGACGACATGGTTCTACATGCTATCAAAACTTCGGCATC Reverse TACGGTAGCAGAGACTTGGTCTGATCTCAAAAAGCACAACTCCA 66330 Forward ACACTGACGACATGGTTCTACATGCAAATTCTTGAGCTGTCC Reverse TACGGTAGCAGAGACTTGGTCTAAAATTCCCTGGACGCTTG 1331 Forward ACACTGACGACATGGTTCTACAGCACCAAGATATGCCATTGA Reverse TACGGTAGCAGAGACTTGGTCTTCCGAGCTAAGGCTATACATTCA 48730 Forward ACACTGACGACATGGTTCTACACCATACGCGTAAATAAGAGAGC Reverse TACGGTAGCAGAGACTTGGTCTTGATGGATATGGTAAAGCTAAACG 3382 Forward ACACTGACGACATGGTTCTACATCTGAAAGCCTTGTACCAACC Reverse TACGGTAGCAGAGACTTGGTCTGAGCCCTCTTGCCATTTCTA 2829 Forward ACACTGACGACATGGTTCTACATGACCCGTTGACAACCCTAT Reverse TACGGTAGCAGAGACTTGGTCTTGCTTACAGGCCCTTTGGTA 604 Forward ACACTGACGACATGGTTCTACAATGGCCTCCGGTAATTCTCT Reverse TACGGTAGCAGAGACTTGGTCTAGTGGCCTGAACTTTGCAGT 598 Forward ACACTGACGACATGGTTCTACATCTGGGCTAACCTGAAATCG Reverse TACGGTAGCAGAGACTTGGTCTTGGAAATGATCAAGAAATGAAGC 233 Forward ACACTGACGACATGGTTCTACAACAACGCTGTGTGTTTGGTC Reverse TACGGTAGCAGAGACTTGGTCTCCCACCAGCCCTTAACTACTC 2978 Forward ACACTGACGACATGGTTCTACATCCATACAATGGTAAGATCACAAGA Reverse TACGGTAGCAGAGACTTGGTCTGAAGGGTGTTCCGGGATTAT 2919 Forward ACACTGACGACATGGTTCTACAAGAGTGTCATGGCCACCAAT Reverse TACGGTAGCAGAGACTTGGTCTATGGACCGCATAGCTCAAAG 2782 Forward ACACTGACGACATGGTTCTACAGAATTGAGGAGATTTGGGAATTT Reverse TACGGTAGCAGAGACTTGGTCTCAGAATTGGGCCCTCCTAAG 1397 Forward ACACTGACGACATGGTTCTACATCCAGTTTCGCTGAAATCACT Reverse TACGGTAGCAGAGACTTGGTCTTAAAGGCCTTGGAGAAGCAA 836 Forward ACACTGACGACATGGTTCTACATCCCAATTTATCCCAGAAAGC Reverse TACGGTAGCAGAGACTTGGTCTAATCATGGGCGACCTATTTG rps12rpl20 Forward ACACTGACGACATGGTTCTACAATTAGAAANRCAAGACAGCCAAT Reverse TACGGTAGCAGAGACTTGGTCTCGYYAYCGAGCTATATATCC trnTL Forward ACACTGACGACATGGTTCTACACATTACAAATGCGATGCTCT Reverse TACGGTAGCAGAGACTTGGTCTTCTACCGATTTCGCCATATC trnCD Forward ACACTGACGACATGGTTCTACACCAGTTCAAATCTGGGTGTC Reverse TACGGTAGCAGAGACTTGGTCTGGGATTGTAGTTCAATTGGT
Table D.3: Primers for amplicon sequencing using using the Fluidigm AccessArray (continued). Primer sequences include conserved sequence tags.
199 Quartet Topology QCF QCF-Boot A,B,C,D 12|34 0.0365333 0.0310762 A,B,C,D 13|24 0.0283334 0.0293331 A,B,C,D 14|23 0.0242000 0.0524701 A,B,C,E 12|34 0.0336667 0.0307411 A,B,C,E 13|24 0.0295333 0.0279207 A,B,C,E 14|23 0.0304000 0.0357291 A,B,C,F 12|34 0.0423667 0.0334601 A,B,C,F 13|24 0.0314000 0.0293172 A,B,C,F 14|23 0.0297667 0.0465771 A,B,D,E 12|34 0.0320666 0.0301693 A,B,D,E 13|24 0.0313667 0.0285252 A,B,D,E 14|23 0.0311000 0.0377345 A,B,D,F 12|34 0.0387000 0.0317376 A,B,D,F 13|24 0.0317334 0.0317221 A,B,D,F 14|23 0.0319667 0.0447368 A,B,E,F 12|34 0.0307667 0.0247832 A,B,E,F 13|24 0.0233333 0.0252398 A,B,E,F 14|23 0.0229000 0.0394890 A,C,D,E 12|34 0.0282000 0.0265650 A,C,D,E 13|24 0.0282333 0.0365885 A,C,D,E 14|23 0.0323000 0.0257895 A,C,D,F 12|34 0.0306667 0.0286579 A,C,D,F 13|24 0.0314666 0.0411249 A,C,D,F 14|23 0.0361333 0.0283903 A,C,E,F 12|34 0.0387333 0.0313787 A,C,E,F 13|24 0.0293333 0.0382125 A,C,E,F 14|23 0.0329333 0.0578481 A,D,E,F 12|34 0.0399333 0.0323120 A,D,E,F 13|24 0.0290667 0.0377330 A,D,E,F 14|23 0.0312000 0.0587910 B,C,D,E 12|34 0.0268000 0.0258273 B,C,D,E 13|24 0.0245333 0.0355890 B,C,D,E 14|23 0.0260000 0.0280845 B,C,D,F 12|34 0.0247333 0.0312867 B,C,D,F 13|24 0.0313000 0.0387545 B,C,D,F 14|23 0.0295667 0.0284660 B,C,E,F 12|34 0.0365333 0.0315087 B,C,E,F 13|24 0.0272000 0.0346498 B,C,E,F 14|23 0.0296667 0.0532958 B,D,E,F 12|34 0.0360000 0.0285520 B,D,E,F 13|24 0.0244000 0.0334152 B,D,E,F 14|23 0.0313333 0.0534680 C,D,E,F 12|34 0.0252000 0.0228099 C,D,E,F 13|24 0.0194000 0.0227021 C,D,E,F 14|23 0.0207333 0.0345828
Table D.4: RMSD values for QCF estimation using data simulated from a tree topology (Figure 4.1a).
200 Quartet Topology QCF QCF-Boot A,B,C,D 12|34 0.0232667 0.0213276 A,B,C,D 13|24 0.0199333 0.0212925 A,B,C,D 14|23 0.0191333 0.0323620 A,B,C,E 12|34 0.0255000 0.0235640 A,B,C,E 13|24 0.0211000 0.0240413 A,B,C,E 14|23 0.0219333 0.0345390 A,B,C,F 12|34 0.0342000 0.0277485 A,B,C,F 13|24 0.0274000 0.0242728 A,B,C,F 14|23 0.0261333 0.0412771 A,B,D,E 12|34 0.0277000 0.0237231 A,B,D,E 13|24 0.0225000 0.0239117 A,B,D,E 14|23 0.0235333 0.0328303 A,B,D,F 12|34 0.0341000 0.0275928 A,B,D,F 13|24 0.0246667 0.0279635 A,B,D,F 14|23 0.0277000 0.0425720 A,B,E,F 12|34 0.0245000 0.0222437 A,B,E,F 13|24 0.0190000 0.0241789 A,B,E,F 14|23 0.0205000 0.0381137 A,C,D,E 12|34 0.0266000 0.0233988 A,C,D,E 13|24 0.0270334 0.0358624 A,C,D,E 14|23 0.0281000 0.0273872 A,C,D,F 12|34 0.0243667 0.0259449 A,C,D,F 13|24 0.0287000 0.0369820 A,C,D,F 14|23 0.0288667 0.0292340 A,C,E,F 12|34 0.0294000 0.0330291 A,C,E,F 13|24 0.0273000 0.0286997 A,C,E,F 14|23 0.0281667 0.0265154 A,D,E,F 12|34 0.0311333 0.0349219 A,D,E,F 13|24 0.0323667 0.0313181 A,D,E,F 14|23 0.0313000 0.0244409 B,C,D,E 12|34 0.0267667 0.0263707 B,C,D,E 13|24 0.0282000 0.0345829 B,C,D,E 14|23 0.0311667 0.0280294 B,C,D,F 12|34 0.0264667 0.0263840 B,C,D,F 13|24 0.0264000 0.0366836 B,C,D,F 14|23 0.0268000 0.0291028 B,C,E,F 12|34 0.0318667 0.0311415 B,C,E,F 13|24 0.0287333 0.0305662 B,C,E,F 14|23 0.0301333 0.0293290 B,D,E,F 12|34 0.0300333 0.0340250 B,D,E,F 13|24 0.0313667 0.0315823 B,D,E,F 14|23 0.0358000 0.0246551 C,D,E,F 12|34 0.0295333 0.0258286 C,D,E,F 13|24 0.0241667 0.0287898 C,D,E,F 14|23 0.0299666 0.0373072
Table D.5: RMSD values for QCF estimation using data simulated from a network topology (Figure 4.1b).
201