Developing Computational Tools for Evolutionary Inferences in Polyploids

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Paul David Blischak, B.Sc.

Graduate Program in Evolution, Ecology, and Organismal Biology

The Ohio State University

2018

Dissertation Committee:

Andrea D. Wolfe, Advisor Bryan C. Carstens John V. Freudenstein Laura S. Kubatko © Copyright by

Paul David Blischak

2018 Abstract

Methods for generating genome-scale data sets are facilitating the inference of phyloge-

netic relationships in non-model taxa across the Tree of Life. However, rapid speciation

and heterogeneous patterns of diversification make this task difficult when gene trees

have conflicting histories (e.g., from incomplete lineage sorting). For species in

particular, additional complications arise due to the intermixing of divergent lineages

through hybridization and the subsequent occurrence of whole genome duplication

(WGD; i.e., allopolyploidy). Investigations regarding the evolutionary history of re-

cently formed polyploids and their diploid progenitors are difficult to conduct because

of problems with resolving ambiguous genotypes in the polyploids as well as analyzing

species with different ploidies. The focus of my dissertation has been to develop models

and bioinformatic tools for analyzing high-throughput sequencing (HTS) data collected

in non-model taxa of different ploidy levels to estimate phylogenetic relationships. I

am applying these tools in the plant genus () to infer the

relationships in two groups of closely related species containing diploids, tetraploids,

and hexaploids.

The first chapter of my dissertation uses HTS data and a hierarchical Bayesian

framework to estimate biallelic single nucleotide polymorphism (SNP) genotypes and

allele frequencies in populations of any ploidy level (diploid or higher) assuming Hardy

Weinberg equilibrium. It does this using Markov chain Monte Carlo (MCMC) to

ii integrate over the uncertainty in the estimated genotypes. I then assess the model’s

accuracy using simulations and test it on a SNP data set in autotetraploid potato

(Solanum tuberosum). Both of these tests demonstrate the usefulness of the model for

parameter inference at different ploidy levels. The MCMC algorithm that is used for

inference is implemented in the open source R package polyfreqs.

The set of models in my second chapter builds on Chapter 1 in two important ways.

First, I extend the Hardy Weinberg equilibrium model to include inbreeding. Second,

I directly address the hybrid nature of allopolyploid organisms by separately modeling

the genomes of the two parental species. Using both simulations and empirical data

sets from the literature (autopolyploid: Andropogon gerardii, allopolyploid: Betula

pubescens + diploid parent: B. pendula), I benchmark these methods against other

software (Genome Analysis Toolkit) to demonstrate their effectiveness for estimating

genotypes. These new models also use a different algorithm for inferring population

parameters, the expectation maximization algorithm, which I have implemented in

the open source software package ebg.

Chapter 3 uses ideas similar to those presented in the first two chapters, but

focuses on inferring full haplotype sequences, rather than single SNP genotypes, for

samples of arbitrary ploidy. The method is able to process paired-end HTS data

collected using double-barcoded amplicon sequencing, and uses the program PURC to

cluster sequencing reads into haplotypes. It then uses a multinomial likelihood to infer

haplotypes while also accounting for sequencing error. The pipeline is implemented in

the software Fluidigm2PURC, and I demonstrate its use on a polyploid series from

the genus Thalictrum (Ranunculaceae).

iii My final chapter uses nuclear amplicon sequencing to infer evolutionary relation-

ships between two closely related groups in Penstemon: subsections Humiles and

Proceri (Plantaginaceae). These two groups are known to hybridize and have docu-

mented cases of WGD events forming putative allotetraploids and allohexaploids. To

estimate phylogeny in these two groups, I first use the methods described in Chapter

3 to determine haplotypes from paired-end HTS data for all diploid, tetraploid, and

hexaploid individuals. I also develop a method for assessing the proportion of gene

trees supporting a species-level quartet (quartet concordance factors; QCFs), which I

use as input for estimating a species network using the program SNaQ. Phylogenies

inferred using both species tree, and network, approaches recover subsections Humiles

and Proceri as non-monophyletic. There is also strong evidence for hybridization within and between these two groups.

iv This is dedicated to my grandparents:

Doris & Michael† Blischak and Mary† & Carl Firm

v Acknowledgments

First and foremost, I would like to thank my advisor, Dr. Andrea Wolfe, for her

guidance and support throughout the process of getting my Ph.D. I started working with Andi as an undergraduate math major, and she convinced me to go to grad school,

and to give biology a try. I will be forever grateful for this advice. My co-advisor for

this undergrad research project with Andi was Dr. Laura Kubatko, who has continued

to serve as a mentor and “unofficial” Ph.D. co-advisor, for which I am very thankful.

Working with Andi and Laura has been an incredible experience, and I will miss

getting to chat in their offices while mulling over the latest issues about why none

of my research is working. I would also like to thank my other committee members,

Dr. Bryan Carstens and Dr. John Freudenstein, who provided invaluable insights

into key areas of my thesis, and were always happy to discuss my research or to help

troubleshoot.

I would like to thank all members of the Wolfe Pack, both past and present. When

I first started in the lab, I was beyond clueless, but my senior lab mates, Dan Robarts

and Aaron Wenzel, were an immense help, and continued to guide me through the

early years of my graduate work. I’d also like to acknowledge my current lab mates,

Ben Stone and Rosa Rodriguez, who are great friends and colleagues, and who have

helped me bounce around ideas or problems that I was having with my research on

numerous occasions.

vi Getting through this Ph.D. would not have been possible without my family. My

parents, Maggi and Dave, have shown indefatigable love and support throughout all

stages of my education, even when I had absolutely no idea what I wanted to do with

my life. They are model human beings, and I hope to live up to the example that they

have set for me. My siblings, John and Julianna, are equally badass. Not only are

they great couch-fort builders, living-room soccer players, and backyard mat tumblers,

they are incredible friends. I am also immeasurably fortunate to have an amazing

partner, Makenzie Mabry, whose love and encouragement has sustained me through

both the high and low points of my Ph.D. over the past couple of years.

I would like to thank everyone who helped me in the field and with obtaining

collecting permits, including Mikel Stevens, Noel Holmgren, Karen and Steve Shelly,

Carol Blackburn, Teresa Prendusi, Dale Reinhart, Maret Pajutee, and Steve Popovich.

I also thank the organizations that provided funding for my research: The Ohio State

University (Distinguished University Fellowship), the National Science Foundation

(Doctoral Dissertation Improvement Grant; DEB-1601096), the Society for Systematic

Biologists (Graduate Student Research Award), and the American Society of Plant

Taxonomists (Graduate Student Research Grant). Computational resources for my

research were provided by the Ohio Supercomputer Center and the College of Arts

and Sciences Unity Cluster at OSU. I would also like to acknowledge Drs. Xin He,

John Novembre, and Matthew Stephens, as well as their lab members, for allowing

me to come visit and present my research at the University of Chicago, and thanks

especially to my brother John for orchestrating this opportunity.

vii To all of the wonderful friends that I have made while working in EEOB, thank you all for the memories, the laughs, and the good times. I will miss you all, and I

hope we’ll bump into each other as frequently as possible.

viii Vita

2008 ...... Archbishop Hoban High School.

2012 ...... B.Sc. Mathematics, The Ohio State University. 2012-2013, 2017-2018 ...... Distinguished University Fellow, The Ohio State University. 2013-2015 ...... Graduate Teaching Associate, The Ohio State University. 2015-2017 ...... Graduate Research Associate, The Ohio State University.

Publications

Research Publications

Blischak, P. D., M. Latvis, D. F. Morales-Briones, J. C. Johnson, V. S. Di Stilio, A. D. Wolfe, and D. C. Tank. Fluidigm2PURC: Automated Processing and Haplotype Inference for Double-Barcoded PCR Amplicons. Applications in Plant Sciences, 6:e1156, 2018.

Blischak, P. D., J. Chifman, A. D. Wolfe, and L. S. Kubatko. HyDe: a Python Package for Genome-Scale Hybridization Detection. Systematic Biology, doi:10.1093/sysbio/syy023, 2018.

Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. SNP Genotyping and Parameter Estimation in Polyploids Using Low-Coverage Sequencing Data. Bioinformatics, 34:407–415, 2018.

Latvis, M., S. J. Jacobs, S. M. E. Mortimer, M. Richards, P. D. Blischak, S. Mathews, and D. C. Tank. Primers for Castilleja and Their Utility Across Orobanchaceae: II. Single-Copy Nuclear Loci. Applications in Plant Sciences, 5:1700038, 2017.

ix Wolfe, A. D., T. Necamp, S. Fassnacht, P. D. Blischak, and L. S. Kubatko. Popula- tion Genetics of Penstemon albomarginatus (Plantaginaceae), a Rare Mojave Desert Species of Conservation Concern. Conservation Genetics, 17:1245–1255, 2016.

Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. Accounting for Genotype Uncer- tainty in the Estimation of Allele Frequencies in Autopolyploids. Molecular Ecology Resources, 16:742–754, 2016.

Blischak, P. D., A. J. Wenzel, and A. D. Wolfe. Gene Prediction and Annotation in Penstemon (Plantaginaceae): a Workflow for Marker Development from Extremely Low-Coverage Genome Sequencing. Applications in Plant Sciences, 2:1400044, 2014.

Fields of Study

Major Field: Evolution, Ecology, and Organismal Biology

Minor Field: Statistics

x Table of Contents

Page

Abstract ...... ii

Dedication ...... v

Acknowledgments ...... vi

Vita...... ix

List of Tables ...... xv

List of Figures ...... xvii

1. Accounting for Genotype Uncertainty in the Estimation of Allele Frequen- cies in Autopolyploids ...... 1

1.1 Abstract ...... 1 1.2 Introduction ...... 2 1.3 Materials and Methods ...... 7 1.3.1 Model Setup ...... 9 1.3.2 Full Conditionals and MCMC Using Gibbs Sampling . . . . 12 1.3.3 Simulation Study ...... 13 1.3.4 Example Analyses of Autotetraploid Potato (Solanum tubero- sum)...... 15 1.4 Results ...... 19 1.4.1 Simulation Study ...... 20 1.4.2 Example Analyses ...... 23 1.5 Discussion ...... 24 1.6 Conclusions ...... 28 1.7 Software Note ...... 29 1.8 Acknowledgements ...... 30

xi 1.9 Author Contributions ...... 30 1.10 Data Accessibility ...... 31

2. SNP Genotyping and Parameter Estimation in Polyploids from Low- Coverage Sequencing Data ...... 32

2.1 Abstract ...... 32 2.2 Introduction ...... 33 2.3 Models ...... 35 2.3.1 Autopolyploid Model ...... 36 2.3.2 Allopolyploid Model ...... 39 2.3.3 Other Approaches ...... 42 2.4 Methods ...... 42 2.4.1 Simulations ...... 44 2.4.2 Empirical Data Analysis ...... 45 2.4.3 Software and Reproducibility ...... 47 2.5 Results ...... 48 2.5.1 Simulations ...... 48 2.5.2 Empirical Data Analysis ...... 53 2.6 Discussion ...... 56 2.7 Conclusions ...... 58 2.8 Acknowledgements ...... 58

3. Fluidigm2PURC: Automated Processing and Haplotype Inference for Double-Barcoded PCR Amplicons ...... 59

3.1 Abstract ...... 59 3.2 Introduction ...... 60 3.3 Methods and Results ...... 62 3.3.1 Input data ...... 62 3.3.2 Step 1: fluidigm2purc ...... 62 3.3.3 Step 2: PURC ...... 65 3.3.4 Step 3: crunch_clusters ...... 65 3.3.5 Example analysis ...... 68 3.4 Conclusions ...... 72 3.5 Availability ...... 73 3.6 Acknowledgements ...... 73

4. Inferring Species Trees and Networks from Gene Tree Quartet Site Patterns: An Example from the Plant Genus Penstemon (Plantaginaceae) . . . . . 74

4.1 Abstract ...... 74

xii 4.2 Introduction ...... 75 4.3 Approach ...... 78 4.3.1 Calculating Quartet Concordance Factors ...... 79 4.3.2 Bootstrapping and Gene Tree Uncertainty ...... 81 4.3.3 Validating QCF Estimation ...... 81 4.3.4 Implementation ...... 82 4.4 Materials and Methods ...... 83 4.4.1 Study System ...... 83 4.4.2 Sample Collection, DNA Extraction, and Amplicon Sequencing 85 4.4.3 Species Tree Inference ...... 87 4.4.4 Candidate Hybridization Events from Rooted Triples . . . . 88 4.4.5 Species Network Inference ...... 88 4.5 Results ...... 89 4.5.1 Nuclear Amplicon Data ...... 89 4.5.2 Species Tree Inference ...... 89 4.5.3 Tests for Hybridization and Species Network Inference . . . 91 4.6 Discussion ...... 96 4.6.1 of Subsections Humiles and Proceri ...... 96 4.6.2 Character Evolution and Biogeography ...... 98 4.6.3 Phylogenetics of Hybrids and Polyploids ...... 100 4.7 Conclusions ...... 102

Bibliography ...... 103

A. Chapter 1 Supplemental Materials ...... 129

A.1 Example Analyses of Autotetraploid Potato (Solanum tuberosum). 129 A.1.1 Calculating Expected and Observed Heterozygosity . . . . . 131 A.1.2 Evaluating Model Adequacy ...... 136

B. Chapter 2 Supplemental Materials ...... 145

B.1 EM Algorithms ...... 145 B.1.1 Autopolyploid Model ...... 145 B.1.2 Allopolyploid Model ...... 146 B.1.3 C++ Code ...... 148 B.2 Simulations ...... 149 B.2.1 Inbreeding Coefficient From Called Genotypes ...... 149 B.3 Empirical Data Analysis ...... 149 B.3.1 Data Acquisition ...... 149 B.3.2 Comparison with GATK ...... 151

xiii C. Chapter 3 Supplemental Meterials ...... 174

C.1 Haplotype Inference ...... 174 C.1.1 Inferring Haplotypes with Known Ploidy ...... 174 C.1.2 Inferring Haplotypes with Unknown Ploidy ...... 175 C.2 Example Analysis ...... 178 C.2.1 Fluidigm2PURC ...... 178 C.2.2 dbcAmplicons (reduce_amplicons.R)...... 181

D. Chapter 4 Supplemental Materials ...... 183

D.1 Validating QCF Estimation ...... 183 D.1.1 Tree Simulations ...... 184 D.1.2 Network Simulations ...... 186 D.2 Code for Species Tree and Network Inference ...... 188 D.2.1 Gene Tree Estimates with RAxML ...... 188 D.2.2 Species Tree Inference with ASTRAL-III ...... 188 D.2.3 Species Tree Inference with qcf+QuartetMaxCut ...... 188 D.2.4 Network Analyses with PhyloNetworks ...... 189

xiv List of Tables

Table Page

1.1 Notation and symbols used in the description of the model for estimating allele frequencies in polyploids ...... 8

2.1 A key to the symbols and notation that are used in describing the autopolyploid and allopolyploid models ...... 43

3.1 Dependencies for the Fluidigm2PURC pipeline with version numbers in parentheses...... 64

3.2 Thalictrum L. species included in the comparison of Fluidigm2PURC and dbcAmplicons ...... 68

3.3 Overall alignment statistics for the comparison between Flu- idigm2PURC and the reduce_amplicons.R script...... 70

3.4 Per species data for the number of haplotypes inferred by Flu- idigm2PURC using known vs. unknown ploidy ...... 72

C.1 Haplotype configurations and their corresponding log-likelihoods for a tetraploid with ordered cluster sizes equal to 285, 95, 10, and 8 . . . . 175

C.2 Haplotype configurations for an individual with six clusters ...... 178

D.1 Collection and ploidy information for accessions from Penstemon sub- sections Humiles and Proceri...... 197

D.2 Primers for amplicon sequencing ...... 198

D.3 Primers for amplicon sequencing (continued) ...... 199

xv D.4 RMSD values for QCF estimation using data simulated from a tree topology (Figure 4.1a)...... 200

D.5 RMSD values for QCF estimation using data simulated from a network topology (Figure 4.1b)...... 201

xvi List of Figures

Figure Page

1.1 Error in allele frequency estimation as measured by the RMSE of posterior means ...... 21

1.2 Posterior standard deviation for allele frequency estimates across levels of sequencing coverage ...... 22

1.3 Posterior distribution of observed and expected heterozygosity in Solanum tuberosum ...... 23

2.1 RMSD values for simulations under the autopolyploid model with inbreeding for (a) estimated inbreeding coefficients and (b) estimated genotypes ...... 49

2.2 RMSD values for full genotype estimation (combined number of alter- native alleles in subgenomes one and two) ...... 51

2.3 Results of empirical data analyses: (a) Levels of inbreeding in Andro- pogon gerardii and (b) genotype estimation error in Betula pubescens 55

3.1 Flowchart outlining the steps for haplotype inference using Flu- idigm2PURC...... 63

4.1 Simulation setup for (a) tree and (b) network topologies. Internal branches are annotated with their lengths in coalescent units (CUs). The total tree height is 4.0 CUs...... 82

4.2 Hypotheses of allopolyploid formation in Penstemon attenuatus ... 84

4.3 Phylogeny of P. subsect. Humiles and Proceri inferred by ASTRAL-III 91

xvii 4.4 Phylogeny of P. subsect. Humiles and Proceri inferred using qcf and QuartetMaxCut ...... 92

4.5 Phylogeny of P. subsect. Humiles and Proceri inferred using RAxML 93

4.6 Best ML networks for clades A and B ...... 95

A.1 Comparison of posterior mean versus mean read ratio estimates of allele frequencies for all simulation settings ...... 141

A.2 Comparison of posterior mean versus mean read ratio estimates of allele frequencies for Solanum tuberosum ...... 142

A.3 Density plot of the difference between the mean read ratio (simple) and posterior mean estimates in Solanum tuberosum ...... 143

A.4 A close up comparison of the effect of coverage and the number of individuals sampled on estimation error for octopoloids ...... 144

B.1 RMSD values for inbreeding coefficient estimation with 25 individuals (all simulations) ...... 163

B.2 RMSD values for inbreeding coefficient estimation with 50 individuals (all simulations) ...... 164

B.3 RMSD values for inbreeding coefficient estimation with 100 individuals (all simulations) ...... 165

B.4 RMSD values for autopolyploid genotype estimation with 25 individuals (all simulations) ...... 166

B.5 RMSD values for autopolyploid genotype estimation with 50 individuals (all simulations) ...... 167

B.6 RMSD values for autopolyploid genotype estimation with 100 individu- als (all simulations) ...... 168

B.7 RMSD values for the estimation of the allele frequency in subgenome two for the allopolyploid model ...... 169

xviii B.8 RMSD values for the estimation of the genotype in subgenome one for the allopolyploid model ...... 170

B.9 RMSD values for the estimation of the genotype in subgenome two for the allopolyploid model ...... 171

B.10 Distribution of the difference in allele frequency estimates for the Hardy Weinberg model vs. GATK ...... 172

B.11 Distribution of the genotypes estimated by the allopolyploid model for each possible value of the genotype estimated by GATK ...... 173

D.1 Simulation results for tree topology (Figure 4.1a) ...... 191

D.2 Simulation results for network topology (Figure 4.1b) ...... 192

D.3 Phylogeny of P. subsect. Humiles and Proceri inferred by ASTRAL-III (with branch lengths) ...... 193

D.4 Phylogeny of P. subsect. Humiles and Proceri inferred using RAxML (with branch lengths) ...... 194

D.5 Networks inferred for clade A using SNaQ as implemented in the software PhyloNetworks ...... 195

D.6 Networks inferred for clade B using SNaQ as implemented in the software PhyloNetworks ...... 196

xix Chapter 1: Accounting for Genotype Uncertainty in the Estimation of Allele Frequencies in Autopolyploids

Publication Information

This chapter is formatted for this dissertation from the following publication:

Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. Accounting for Genotype Uncertainty

in the Estimation of Allele Frequencies in Autopolyploids. Molecular Ecology Resources,

16:742–754, 2016.

1.1 Abstract

Despite the increasing opportunity to collect large-scale data sets for population

genomic analyses, the use of high-throughput sequencing to study populations of

polyploids has seen little application. This is due in large part to problems associated with determining allele copy number in the genotypes of polyploid individuals (allelic

dosage uncertainty–ADU), which complicates the calculation of important quantities

such as allele frequencies. Here we describe a statistical model to estimate biallelic SNP

frequencies in a population of autopolyploids using high-throughput sequencing data

in the form of read counts. We bridge the gap from data collection (using restriction

enzyme based techniques [e.g., GBS, RADseq]) to allele frequency estimation in a

1 unified inferential framework using a hierarchical Bayesian model to sum over genotype

uncertainty. Simulated data sets were generated under various conditions for tetraploid,

hexaploid, and octoploid populations to evaluate the model’s performance and to help

guide the collection of empirical data. We also provide an implementation of our model

in the R package polyfreqs and demonstrate its use with two example analyses that

investigate (i) levels of expected and observed heterozygosity and (ii) model adequacy.

Our simulations show that the number of individuals sampled from a population has

a greater impact on estimation error than sequencing coverage. The example analyses

also show that our model and software can be used to make inferences beyond the

estimation of allele frequencies for autopolyploids by providing assessments of model

adequacy and estimates of heterozygosity.

1.2 Introduction

Biologists have long been fascinated by the occurrence of whole genome duplication

(WGD) in natural populations and have recognized its role in the generation of

biodiversity (Clausen et al., 1940; Stebbins, 1950; Grant, 1971; Otto and Whitton,

2000). Though WGD is thought to have occurred at some point in nearly every major

group of eukaryotes, it is a particularly common phenomenon in and is regarded

by many to be an important factor in plant diversification (Wood et al., 2009; Soltis et al., 2009; Scarpino et al., 2014). The role of polyploidy in plant evolution was

originally considered by some to be a “dead-end” (Stebbins, 1950; Wagner, 1970; Soltis et al., 2014) but, since its first discovery in the early twentieth century, polyploidy

has been continually studied in nearly all areas of botany (Winge, 1917; Winkler,

1916; Clausen et al., 1945; Grant, 1971; Stebbins, 1950; Soltis et al., 2003, 2010; Soltis

2 and Soltis, 2009; Ramsey and Ramsey, 2014). Though fewer examples of WGD are

currently known for animal systems, groups such as amphibians, fish, and reptiles

all exhibit polyploidy (Allendorf and Thorgaard, 1984; Gregory and Mable, 2005).

Ancient genome duplications are also thought to have played an important role in

the evolution of both plants and animals, occurring in the lineages preceeding the

seed plants, angiosperms, and vertebrates (Ohno, 1970; Otto and Whitton, 2000;

Furlong and Holland, 2001; Jiao et al., 2011). These ancient WGD events during

the early history of seed plants and angiosperms have been followed by several more

WGDs in all major plant groups (Cui et al., 2006; Scarpino et al., 2014; Cannon et al.,

2014). Recent experimental evidence has also demonstrated increased survivorship

and adaptability to foreign environments of polyploid taxa when compared with their

lower ploidy relatives (Ramsey, 2011; Selmecki et al., 2015).

Polyploids are generally divided into two types based on how they are formed: auto-

and allopolyploids. Autopolyploids form when a WGD event occurs within a single evo-

lutionary lineage and typically have polysomic inheritance. Allopolyploids are formed

by hybridization between two separately evolving lineages followed by WGD and are

thought to have mostly disomic inheritance. Multivalent chromosome pairing during

meiosis can occur in allopolyploids, however, resulting in mixed inheritance patterns

across loci in the genome (segmental allopolyploids; Stebbins, 1950). Autopolyploids

can also undergo double reduction, a product of multivalent chromosome pairing wherein segments from sister chromatids move together during meiosis—resulting

in allelic inheritance that breaks away from a strict pattern of polysomy (Haldane,

1930). Autopolyploidy was also thought to be far less common than allopolyploidy,

3 but recent studies have concluded that autopolyploidy occurs much more frequently

than originally proposed (Soltis et al., 2007; Parisod et al., 2010).

The theoretical treatment of population genetic models in polyploids has it origins

in the Modern Synthesis with Fisher, Haldane, and Wright each contributing to

the development of some of the earliest mathematical models for understanding the

genetic patterns of inheritance in polyploids (Haldane, 1930; Wright, 1938; Fisher,

1943). Early empirical work on polyploids that influenced Fisher, Haldane, and Wright

include studies on Lythrum salicaria by N. Barlow (Barlow, 1913, 1923), Dahlia by W.

J. C. Lawrence (Lawrence, 1929), and Primula by H. J. Muller (Muller, 1914). The

foundation laid down by these early papers has led to the continuing development

of population genetic models for polyploids, including models for understanding the

rate of loss of genetic diversity and extensions of the coalescent in autotetraploids,

as well as modifications of the multispecies coalescent for the inference of species

networks containing allotetraploids (Moody et al., 1993; Arnold et al., 2012; Jones et al., 2013). Much of this progress was described in a review by Dufresne et al.

(2014), who outlined the current state of population genetics in polyploids regarding

both molecular techniques and statistical models. Not surprisingly, one of the most

promising developments for the future of population genetics in polyploids is the

advancement of sequencing technologies. A particularly common method of gathering

large data sets for genome scale inferences are restriction enzyme based techniques

(e.g., RADseq, ddRAD, GBS, etc.), which we will refer to generally as RADseq (Miller et al., 2007; Baird et al., 2008; Peterson et al., 2012; Puritz et al., 2014). However,

despite its popularity for population genetic inferences at the diploid level, there are

4 many fewer examples of RADseq experiments conducted on polyploid taxa (but see

Ogden et al., 2013; Wang et al., 2013; Logan-Young et al., 2015).

Among the primary reasons for the dearth in applying RADseq to polyploids is

the issue of allelic dosage uncertainty (ADU), or the inability to fully determine the

genotype of a polyploid organism when it is partially heterozygous at a given locus.

This is the same problem that has been encountered by other codominant markers such

as microsatellites, which have been commonly used for population genetic analyses in

polyploids. One way of dealing with allelic dosage that has been used for multi-allelic

microsatellite markers has been to code alleles as either present or absent based on

electropherogram readings (allelic phenotypes) and to analyze the resulting dominant

data using a program such as polysat (Clark and Jasieniuk, 2011; Dufresne et al.,

2014). de Silva et al. (2005) developed a method for inferring allele frequencies using

observed allelic phenotype data and used an expectation-maximization algorithm to

deal with the incomplete genotype data resulting from ADU. Attempts to directly infer

the genotype of polyploid microsatellite loci have also been successfully completed in

some cases by using the relative electropherogram peak heights of the alleles in the

genotypes (Esselink et al., 2004). The estimation problem would be similar for biallelic

SNP data collected using RADseq, where a partially heterozygous polyploid will

have high-throughput sequencing reads containing both alleles. For a tetraploid, the

possible genotypes for a partial heterozygote (alleles A and B) would be AAAB, AABB,

and ABBB. For a hexaploid they are AAAAAB, AAAABB, AAABBB, AABBBB,

and ABBBBB. In general, the number of possible genotypes for a biallelic locus of

a partially heterozygous K-ploid (K = 3, 4, 5,...) is K − 1. A possible solution to

this problem for SNPs would be to try to use existing genotype callers and to rely on

5 the relative number of sequencing reads containing the two alleles (similar to what was done for microsatellites). However, this could lead to erroneous inferences when

genotypes are simply fixed at point estimates based on read proportions without

considering estimation error. Furthermore, when sequencing coverage is low, the

number of genotypes that will appear to be equally probable increases with ploidy,

making it difficult to distinguish among the possible partially heterozygous genotypes.

In this paper we describe a model that aims to address the problems associated with ADU by treating genotypes as a latent variable in a hierarchical Bayesian

model and using high throughput sequencing read counts as data. In this way we preserve the uncertainty that is inherent in polyploid genotypes by inferring a

probability distribution across all possible values of the genotype, rather than treating

them as being directly observed. This approach has been used by Buerkle and

Gompert (2013) to deal with uncertainty in calling genotypes in diploids and the work we present here builds off of their earlier models. Our model assumes that

the ploidy level of the population is known and that the genotypes of individuals in

the population are drawn from a single underlying allele frequency for each locus.

These assumptions imply that alleles in the population are undergoing polysomic

inheritance without double reduction, which most closely adheres to the inheritance

patterns of an autopolyploid. We acknowledge that the model in its current form

is an oversimplification of biological reality and realize that it does not apply to a

large portion of polyploid taxa. Nevertheless, we believe that accounting for ADU

by modeling genotype uncertainty has the potential to be applied more broadly via modifications of the probability model used for the inheritance of alleles, which

6 could lead to more generalized population genetic models for polyploids (see the

Extensibility section of the Discussion).

1.3 Materials and Methods

Our goal is to estimate the frequency of a reference allele for each locus sampled

from a population of known ploidy (ψ), where the reference allele can be chosen

arbitrarily between the two alleles at a given biallelic SNP. To do this we extend the

population genomic models of Buerkle and Gompert (2013), which employ a Bayesian

framework to model high-throughput sequencing reads (T , R), genotypes (G), and

allele frequencies (p), to the case of arbitrary ploidy. The idea behind the model is to view the sequencing reads gathered for an individual as a random sample from the

unobserved genotype at each locus. Genotypes can then be treated as a parameter in

a probability model that governs how likely it is that we see a particular number of

sequencing reads carrying the reference allele. Similarly, we can treat genotypes as

a random sample from the underlying allele frequency in the population (assuming

Hardy-Weinberg equilibrium). For our model, a genotype is simply a count of the

number of reference alleles at a locus which can range from 0 (a homozygote with no

reference alleles in the genotype) to ψ (a homozygote with only reference alleles in the

genotype). All whole numbers in between 0 and ψ represent partially heterozygous

genotypes. This hierarchical setup addresses the problems associated with ADU by

treating genotypes as a latent variable that can be integrated out using Markov chain

Monte Carlo (MCMC).

7 Symbol Description L The number of loci. ` Index for loci (` ∈ {1,...,L}). N Total number of individuals sequenced. i Index for individuals (i ∈ {1,...,N}). ψ The ploidy level of individuals in the population (e.g., tetraploid: ψ=4).

p` Frequency of the reference allele at locus `.[p]

gi` The number of copies of the reference allele for individual i at locus `.[G]

g˜i` Simulated genotype for posterior predictive model check- ing.

ti` The total number of reads for individual i at locus `.[T ]

ri` The number of reads with the reference allele for indi- vidual i at locus `.[R]

r˜i` Simulated reference read count for posterior predictive model checking. [R˜]  Sequencing error.

Table 1.1: Notation and symbols used in the description of the model for estimating allele frequencies in polyploids. Vector and matrix forms of the variables are also provided when appropriate.

8 1.3.1 Model Setup

Here we consider a sample of N individuals from a single population of ploidy level

ψ sequenced at L unlinked SNPs. The data for the model consist of two matrices

containing counts of high-throughput sequencing reads mapping to each locus for

each individual: R and T . The N × L matrix T contains the total number of

reads sampled at each locus for each individual. Similarly, R is an N × L matrix

containing the number of sampled reads with the reference allele at each locus for

each individual. Then for individual i at locus `, we model the number of sequencing

reads containing the reference allele (ri`) as a Binomial random variable conditional

on the total number of sequencing reads (ti`), the underlying genotype (gi`), and a

constant level of sequencing error ()

t  i` ri` ti`−ri` P (ri`|ti`, gi`, ) = g (1 − g) . (1.1) ri`

Here g is the probability of observing a read containing the reference allele corrected

for sequencing error

    gi` gi` g = (1 − ) + 1 −  . (1.2) ψ ψ

The intuition behind including error is that we want to calculate the probability

that we observe a read containing the reference allele. There are two ways that

this can happen. (1) Reads are drawn from the reference allele(s) in the genotype

gi` with probability ψ but are only observed as reference reads if they are not errors

(probability 1 − ). (2) Similarly, reads from the non-reference allele(s) in the genotype

gi` are drawn with probability 1 − ψ but can be mistakenly read as a coming from a

9 reference allele if an error occurs (probability ). The sum across these two possibilities

gives the overall probability of observing a read containing the reference allele. If we

also assume conditional independence of the sequencing reads given the genotypes,

the joint probability distribution for sequencing reads is given by

L N Y Y P (R|T , G, ) = P (ri`|ti`, gi`, ) . (1.3) `=1 i=1

Since the ri`’s are the data that we observe, the product of P (ri`|ti`, gi`, ) across loci

and individuals will form the likelihood in the model.

The next level in the hierarchy is the conditional prior for genotypes. We model

each gi` as a Binomial random variable conditional on the ploidy level of the population

and the frequency of the reference allele for locus ` (p`):

 ψ  gi` ψ−gi` P (gi`|ψ, p`) = p` (1 − p`) . gi` We also assume that the genotypes of the sampled individuals are conditionally

independent given the allele frequencies, which is equivalent to taking a random

sample from a population in Hardy-Weinberg equilibrium. Factoring the distribution

for genotypes and taking the product across loci and individuals gives us the joint

probability distribution of genotypes given the ploidy level of the population and the vector of allele frequencies at each locus (p = {p1, . . . , pL}):

L N Y Y P (G|ψ, p) = P (gi`|ψ, p`) . (1.4) `=1 i=1 We choose here to ignore other factors that may be influencing the distribution of

genotypes such as double reduction. In general, double reduction will act to increase

homozygosity (Hardy, 2016). However, it is more prevalent for loci that are farther

10 away from the centromere, which makes the estimation of a global double reduction

parameter (typically denoted α) inappropriate for the thousands of loci gathered from

across the genome using techniques such as RADseq. It might be possible to estimate

a per locus rate of double reduction (α`) but this would add an additional parameter

that would need to be estimated for each locus, perhaps unnecessarily if the majority

end up being equal, or close, to 0.

The final level of the model is the prior distribution on allele frequencies. Assuming a priori independence across loci, we use a Beta distribution with parameters α and β

both equal to 1 as our prior distribution for each locus. A Beta(1,1) is equivalent to a

Uniform distribution over the interval (0, 1), making our choice of prior uninformative.

The joint posterior distribution of allele frequencies and genotypes is then equal to

the product across all loci and all individuals of the likelihood, the conditional prior

on genotypes and the prior distribution on allele frequencies up to a constant of

proportionality

P ( p, G|T , R, ) ∝ P (R|T , G, )P (G|ψ, p)P (p)

L N Y Y = P (ri`|ti`, gi`, )P (gi`|ψ, p`)P (p`) . (1.5) `=1 i=1

The marginal posterior distribution for allele frequencies can be obtained by summing

over genotypes

X P ( p|T , R, ) ∝ P ( p, G|T , R, ) . (1.6) G It would also be possible to examine the marginal posterior distribution of genotypes

but here we will focus primarily on allele frequencies.

11 1.3.2 Full Conditionals and MCMC Using Gibbs Sampling

We estimate the joint posterior distribution for allele frequencies and genotypes in

Eq. 1.5 using MCMC. This is done using Gibbs sampling of the states ( p, G) in a

Markov chain by alternating samples from the full conditional distributions of p and

G. Given the setup for our model using Binomial and Beta distributions (which form

a conjugate family), analytical solutions for these distributions can be readily acquired

(Gelman et al., 2014). The full conditional distribution for allele frequencies is Beta

distributed and is given by Eq. 1.7 below:

N N ! X X p` | gi`, ri`,  ∼ Beta α = gi` + 1, β = (ψ − gi`) + 1 , for ` = 1, . . . , L. i=1 i=1 (1.7)

This full conditional distribution for p` has a natural interpretation as it is roughly

centered at the proportion of sampled alleles carrying the reference allele divided by

the total number of alleles sampled. The “+1” comes from the prior distribution and will not have a strong influence on the posterior when the sample size is large.

The full conditional distribution for genotypes is a discrete categorical distribution

over the possible values for the genotypes (0, . . . , ψ). The distribution for individual i at locus ` is

t  i` ri` ti`−ri` P (gi`|g(-i)`, p`, ri`, ) = g (1 − g) ri`  ψ  gi` ψ−gi` × p` (1 − p`) , (1.8) gi` where g(-i)` is the value of the genotypes for all sampled individuals excluding individual

i and g is the same as Eq. 1.2. The full conditional distribution for genotypes can

12 be seen as the product of two quantities: (1) the probability of each of the possible

genotypes based on the observed reference reads and (2) the probability of drawing

each genotype given the allele frequency for that locus in the population.

We begin our Gibbs sampling algorithm in a random position in parameter space

through the use of uniform probability distributions. The genotype matrix is initialized with random draws from a Discrete Uniform distribution ranging from 0 to ψ and the

initial allele frequencies are drawn from a Uniform distribution on the interval [0,1].

1.3.3 Simulation Study

Simulations were performed to assess error rates in allele frequency estimation for

tetraploid, hexaploid, and octoploid populations (ψ = 4, 6, and 8, respectively). Data were generated under the model by sampling genotypes from a Binomial distribution

conditional on a fixed, known allele frequency ( p` = 0.01, 0.05, 0.1, 0.2, 0.4). Total

read counts were simulated for a single locus using a Poisson distribution with mean

coverage equal to 5, 10, 20, 50 or 100 reads per individual. We then sampled the

number of sequencing reads containing the reference allele from a Binomial distribution

conditional on the number of total reads, the genotype, and sequencing error (Eq. 1.1;

 fixed to 0.01). Finally, we varied the number of individuals sampled per population

(N = 5, 10, 20, 30) and ran all possible combinations of the simulation settings. Our

choice for the number of individuals to simulate was intended to reflect sampling within a single population/locality and not that of an entire population genetics study.

Furthermore, RAD sequencing is used at various taxonomic levels from population

genetics to phylogenetics (e.g., Rheindt et al., 2014; Eaton et al., 2015), and we wanted

our simulations to be informative across these applications. Each combination of

13 sequencing coverage, individuals sampled, and allele frequency was analyzed using

100 replicates for tetraploid, hexaploid, and octoploid populations for a total of

30,000 simulation runs. MCMC analyses using Gibbs sampling were run for 100,000

generations with parameter values stored every 100th generation. The first 25% of

the sample was discarded as burn-in, resulting in 750 posterior samples for each

replicate. Convergence on the stationary distribution, P ( p, G|T , R, ), was assessed

by examining trace plots for a subset of runs for each combination of settings and

ensuring that the effective sample sizes (ESS) were greater than 200. Deviations from

the known underlying allele frequency used to simulate each data set were assessed by

taking the posterior mean of each replicate and calculating the root mean squared

error (RMSE) based on the true underlying value. We also compared the posterior

mean as an estimate of the allele frequency at a locus to a more simple estimate

1 P ri` calculated directly from the read counts (mean read ratio): N i ti` . Comparisons between estimates were again made using the RMSE.

All simulations were performed using the R statistical programming language (R

Core Team, 2014) on the Oakley cluster at the Ohio Supercomputer Center (https:

//osc.edu). Figures were generated using the R packages ggplot2 (Wickham, 2009)

and reshape (Wickham, 2007), with additional figure manipulation completed using

Inkscape (https://inkscape.org). MCMC diagnostics were done using the coda

package (Plummer et al., 2006). All scripts are available on GitHub (https://github.

com/pblischak/polyfreqs-ms-data) in the ‘code/’ folder and all simulated data

sets are in the ‘raw_data/’ folder.

14 1.3.4 Example Analyses of Autotetraploid Potato (Solanum tuberosum)

To further evaluate the model and to demonstrate its use we present an example analysis

using an empirical data set collected for autotetraploid potato (Solanum tuberosum)

using the Illumina GoldenGate platform (Anithakumari et al., 2010; Voorrips et al.,

2011). Though these data are not the typical reads returned by RADseq experiments,

they still represent the same type of binary response data that our model uses to get a

probability distribution for biallelic SNP genotypes. A detailed walkthrough with the

code used for each step is provided in Appendix A. The data set and output are also

available on GitHub (https://github.com/pblischak/polyfreqs-ms-data) in the

‘example/’ folder.

Calculating Expected and Observed Heterozygosity

One advantage of using a Bayesian framework for our model is that we can approximate

a posterior distribution for any quantity that is a functional transformation of the

parameters that we are estimating without doing any additional MCMC simulation

(Gelman et al., 2014). Two such quantities that are often used in population genetics

are the observed and expected heterozygosity, which are in turn used for calculating

the various fixation indices (FIS, FIT , FST ) introduced by Wright (1951). To analyze

levels of heterozygosity in this way, we used the estimators of Hardy (2016) to calculate

the per locus observed (Ho) and expected (He) heterozygosity for each stored sample

of the joint posterior distribution in Eq. 1.5. This procedure is especially useful

because it estimates heterozygosity while taking into account ADU by utilizing the

marginal posterior distribution of genotypes. Given a total of M posterior samples of

15 genotypes and allele frequencies, we calculate the mth (m = 1,...,M) estimate of the

observed heterozygosity using Eq. 1.9 [numerator of Eq. 7 in Hardy (2016)]:

[m] [m] 1 X [m] 1 X g (ψ − g ) H[m] = h = i` i` . (1.9) o N i N ψ i i 2 Similarly, the mth estimate of the expected heterozygosity is calculated using Eq. 1.10

[denominator of Eq. 8 in Hardy (2016)]:

" # N [m] [m] ψ − 1 X [m] H[m] = 1 − (p )2 − (1 − p )2 − h . (1.10) e N − 1 ` ` ψN 2 i i The posterior distribution of a multi-locus estimate of heterozygosity can then be

approximated by taking the average across loci for each of the per locus posterior

samples.

To evaluate levels of heterozygosity in autotetraploid potato, we obtained biallelic

count data for 224 accessions collected at 384 loci using the Illumina GoldenGate

platform from the R package fitTetra (Voorrips et al., 2011), which provides the

data set as part of the package. We chose the ‘X’ reading to be the count data for

the reference allele and added the ‘X’ and ‘Y’ readings together to get the total read

counts (‘X’ and ‘Y’ represent the counts of the two alternative alleles). Initial attempts

to analyze the data set using our Gibbs sampling algorithm were unsuccessful due to

arithmetic underflow. This was due to the fact that the counts/intensities returned by

the Illumina GoldenGate platform are on a different scale (∼10,000-20,000+) than

the read counts that would be expected from a RADseq experiment. To alleviate this

problem, we rescaled the data set while preserving the relative dosage information by

dividing the GoldenGate count readings by 100 and rounding to the nearest whole

number. We then analyzed the rescaled count data using 100,000 MCMC generations,

16 sampling every 100 generations and using the stored samples of the allele frequencies

and genotypes to calculate the observed and expected heterozygosity for a total of

1,000 posterior samples of the per locus observed and expected heterozygosity. We also

compared post burn-in (25%) allele frequency estimates based on the posterior mean

to the simple allele frequency estimate based directly on read counts used previously

(mean read ratio). Posterior distributions for multi-locus estimates of observed and

expected heterozygosity were obtained by taking the average across loci for each

posterior sample of the per locus estimates using a burn-in of 25%.

Evaluating Model Adequacy

As noted earlier, the probability model that we use for the inheritance of alleles is one

of polysomy without double reduction. In some cases, this model may be inappropriate.

Therefore, it can be informative to check for loci that do not follow the model that we assume. Below we describe a procedure for rejecting our model of inheritance

on a per locus basis using comparisons with the posterior predictive distribution of

sequencing reads. Model checking is an important part of making statistical inferences

and can play a role in understanding when a model adequately describes the data

being analyzed. In the case of our model, it can serve as a basis for understanding

the inheritance patterns of the organism being studied by determining which loci

adhere to a simple pattern of polysomic inheritance. Other sources of disequilibrium

that could indicate poor model fit include inbreeding, null alleles, and allele drop out

(sensu Arnold et al., 2013), making this posterior predictive model check more broadly

applicable for RADseq data.

n [1] [2] [M]o Given M posterior samples for the allele frequencies at locus `, p` , p` , . . . , p` , we simulate new values for the genotypes (g˜i`) and reference read counts (r˜i`) for all

17 individuals and use the ratio of simulated reference read counts to observed total read   r˜i` counts ti` as a summary statistic for comparing the observed read count ratios to the distribution of the predicted read count ratios. The use of the likelihood (or similar

quantities) as a summary statistic has been a common practice in posterior predictive

comparisons of nucleotide substitution models, and more recently for comparative

phylogenetics (Ripplinger and Sullivan, 2010; Reid et al., 2014; Pennell et al., 2015).

We use the ratio of reference to total read counts here because it is the maximum

likelihood estimate of the probability of success for a Binomial random variable and

because it is a simple quantity to calculate. The use of other summary statistics, or a

combination of multiple summary statistics, would also be possible. The procedure

for our posterior predictive model check is as follows:

1. For locus ` = 1,...,L:

1.1. For posterior sample m = 1,...,M:

 [m] 1.1.1. Simulate new genotype values g˜i` for all individuals (i = 1,...,N)

 [m] by drawing from a Binomial ψ, p` .

 [m] 1.1.2. Simulate new reference read counts r˜i` from each new genotype for

all individuals by drawing from Eq. 1.1.

1.1.3. Calculate the reference read ratio for the simulated data for sample m

 [m]  [m] r˜ S˜ = PN i` and sum across individuals: ` i=1 ti` .

1.1.4. Calculate the reference read ratio for the observed data and sum across   PN ri` S` = individuals: i=1 ti` .

1.2. Calculate the difference between the observed reference read ratio and the n ˜[1] ˜[M]o M simulated reference read ratios: S` − S` ,..., S` − S` .

18 2. Determine if the 95% highest posterior density (HPD) interval of the distribution

of re-centered reference read ratios contains 0.

When the distribution of the differences in ratios between the observed and

simulated data sets does not contain 0 in the 95% HPD interval, it provides evidence

that the locus being examined does not follow a pattern of strict polysomic inheritance.

A similar approach could be used on an individual basis by comparing the observed

ratio of reference reads to the predicted ratios for each individual at each locus. We

used this posterior predictive model checking procedure to assess model adequacy in

the potato data set using the posterior distribution of allele frequencies estimated in

the previous section with 25% of the samples discarded as burn-in.

1.4 Results

Our Gibbs sampling algorithm was able to accurately estimate allele frequencies for a

number of simulation settings while simultaneously allowing for genotype uncertainty.

There were no indications of a lack of convergence (ESS values > 200) for any of the

simulation replicates and all trace plots examined also indicated that the Markov

chain had reached stationarity. Running the MCMC for 100,000 generations and

sampling every 100th generation appeared to be suitable for our analyses and we

recommend it as a starting point for running most data sets. Reducing the number of

generations and sampling more frequently (e.g., 50,000 generations sampled every 50

generations) could be a potential work around for larger data sets. When doing test

runs we went as low as 20,000 generations sampled every 20th generation, which still

passed our diagnostic tests for convergence. This is likely because the parameter space

of our model is not overly difficult to navigate so stationarity is reached rather quickly.

19 Ultimately, the deciding factor on how long to run the analysis and how frequently to

sample the chain will come down to assessing convergence.

1.4.1 Simulation Study

Increasing the number of individuals sampled had the largest effect on the accuracy

of allele frequency estimation (Figure 1.1). Since allele frequencies are population

parameters, it is not surprising that sampling more individuals from the population

leads to better estimates. This appears to be the case even when sequencing coverage

is quite low (5x, 10x), which corroborates the observations made by Buerkle and

Gompert (2013). This is not to say, however, that sequencing coverage has no effect

on the posterior distribution of allele frequencies. Lower sequencing coverage affects

the posterior distribution by increasing the posterior standard deviation (Figure 1.2).

An interesting pattern that emerged during the simulation study is the observation

that the allele frequencies closer to 0.5 tend to have higher error rates, which is to

be expected given that the variance of a Binomial random variable is highest when

the probability of success is 0.5. We also observed small differences in the RMSE

between ploidy levels, with estimates increasing in accuracy with increasing ploidy.

Comparisons between the posterior mean and mean read ratio estimates of allele

frequencies (Figure A.1) show that the estimate based on read ratios has a lower

RMSE than the posterior mean when the true allele frequency is low (p` = 0.01, 0.05)

but has higher error rates than the posterior mean for allele frequencies closer to 0.5.

When sequencing coverage is greater than 10x and the number of individuals sampled

is greater than 20, the two estimates are almost indistinguishable.

20 Figure 1.1: Error in allele frequency estimation as measured by the RMSE of posterior means. Columns represent the different allele frequen-cies used to simulate read data (0.01, 0.05, 0.1, 0.2, 0.4) and rows represent the number of individuals samples from the population (5, 10, 20, 30). Each individual plot shows the RMSE of the estimates for each ploidy level (tetra, hex, octo) across the different levels of coverage (5x, 10x, 20x, 50x, 100x). The best scenario is in the bottom left with 30 individuals sampled and an allele frequency of 0.01. The worst scenario is in the upper right corner with 5 individuals sampled and an allele frequency of 0.4. Looking across rows shows thaterror increases as allele frequencies get closer to 0.5. Looking up and down columns shows that error increases as the number of indi-viduals decreases. Within each plot, increasing sequence coverage does not have as large of an effect on error, and differences in ploidyshow that error decreases as ploidy increases.

21 Figure 1.2: The posterior standard deviation for allele frequencies decreases when compared across levels of sequencing coverage. This plot provides a comparison of the distribution of the posterior standard deviations of the 100 replicates performed for eachlevel of sequencing coverage (5x, 10x, 20x, 50x, 100x) for the hexaploid simulation with 30 individuals sampled from the population and an allele frequency of 0.2.

22 Figure 1.3: Posterior distributions of the multilocus estimates of expected and observed heterozygosity in Solanum tuberosum. The observed heterozygosity is higher than the expected, consistent with a pattern of excess outbreeding.

1.4.2 Example Analyses

Our analyses of Solanum tuberosum tetraploids showed levels of heterozygosity consis- tent with a pattern of excess outbreeding (Ho > He). In fact, the posterior distributions of the multi-locus estimates of observed and expected heterozygosity do not overlap at all (Figure 1.3). The assessment of model adequacy also showed that 49 out of the 384 loci (∼13%) were a poor fit to the model of polysomic inheritance that we assume. The allele frequency estimates using the posterior mean and the mean read ratio provided similar estimates and were comparable for most loci. For loci in which the frequency of the reference allele is very low, the read ratio estimate tends to be higher than the posterior mean. However, the overall pattern does not indicate over or under estimation for most allele frequencies (Figure A.2). When we took the difference between the estimates at each locus, the distribution was centered near 0 (Figure A.3).

23 1.5 Discussion

The inference of population genetic parameters and the demographic history of non-

model polyploid organisms has consistently lagged behind that of diploids. The

difficulties associated with these inferences present themselves at two levels. The first

of these is the widely known inability to determine the genotypes of polyploids due

to ADU. Even though there have been theoretical developments in the description

of models for polyploid taxa as early as the 1930s, a large portion of this population

genetic theory relies on knowledge about individuals’ genotypes (e.g., Haldane, 1930;

Wright, 1938). The second complicating factor is the complexity of inheritance

patterns and changes in mating systems that often accompany WGD events. Polyploid

organisms can sometimes mate by both outcrossing or selfing, and can display mixed

inheritance patterns at different loci in the genome (Dufresne et al., 2014). If genotypes were known, then it might be easier to develop and test models for dealing with and

inferring rates of selfing versus outcrossing, as well as understanding inheritance

patterns across the genome. However, ADU only compounds the problems associated with these inferences, making the development and application of appropriate models

far more difficult (but see list of software in Dufresne et al., 2014). The model we have

presented here deals with the first of these two issues by not treating genotypes as

observed quantities. Almost all other methods of genotype estimation for polyploids

treat the genotype as the primary parameter of interest. Our model is different in

that we still use the read counts generated by high-throughput sequencing platforms

as our observed data but instead integrate across genotype uncertainty when inferring

other parameters, thus bypassing the problems caused by ADU.

24 Despite our focus on bypassing ADU, an important consideration for the model we

present here is that, because it approximates the joint posterior distribution of allele

frequencies and genotypes, it would also be possible to use the marginal posterior

distribution of genotypes to make inferences using existing methods. This could be

done using the posterior mode as a maximum a posteriori (MAP) estimate of the

genotype for downstream analyses, followed by analyzing the samples taken from the

marginal posterior distribution of genotypes. The resulting set of estimates would not

constitute a “true” posterior distribution of downstream parameters but would allow

researchers to interpret their results based on the MAP estimate of the genotypes while still getting a sense for the amount of variation in their estimates. Using the

marginal posterior distribution of genotypes in this way could technically be applied

to any type of polyploid, but is only really appropriate for autopolyploids due to the

model of inheritance that is used. Other methods for estimating SNP genotypes from

high-throughput sequencing data include the program SuperMASSA, which models

the relative intensity of the two alternative alleles using Normal densities (Serang et al., 2012).

A second important factor for using our model is that, although estimates of allele

frequencies can be accurate when sequencing coverage is low and sample sizes are

large (see Figure A.4 for a direct comparison between sample size and coverage), the

resulting distribution for genotypes is likely going to be quite diffuse. For analyses that

treat genotypes as a nuisance parameter, this is not an issue since we can integrate

across genotype uncertainty. However, if the genotype is of primary interest, then the

experimental design of the study will need to change to acquire higher coverage at

each locus for more accurate genotype estimation. Therefore, the decision between

25 sequencing more individuals with lower average coverage versus sequencing fewer

individuals with higher average coverage depends primarily on whether the genotypes will be used or not.

Extensibility

The modular nature of our hierarchical model can allow for the addition and modifica-

tion of levels in the hierarchy. One of the simplest extensions to the model that can

build directly on the current setup would be to consider loci with more than two alleles.

This can be done using Multinomial distributions for sequencing reads and genotypes,

and a Dirichlet prior on allele frequencies (the Multinomial and Dirichlet distributions

form a conjugate family; Gelman et al., 2014). We could also model populations

of mixed ploidy by using a vector of individually assigned ploidy levels instead of

assuming a single value for the whole population (ψ = {ψ1, . . . , ψN }). However, this would assume random mating among ploidy levels.

Double Reduction

The inclusion of double reduction into the model is a difficult consideration for genome wide data collected using high-throughput sequencing platforms. The number of

parameters estimated by our model is L × (N + 1) and including double reduction would add an additional L parameters, bringing the total to L × (N + 2). Though the

addition of these parameters would not prohibit an analysis using Gibbs sampling, we chose to implement the simpler equilibrium model. We hope to include double

reduction in future models but feel that our posterior predictive model checking

procedure will prove sufficient for identifying loci in disequilibrium with our current

implementation. Another concern that we had regarding double reduction is that it

26 can be confounded with the overall signal of inbreeding, making it especially difficult

to tease apart the specific effects of double reduction alone (Hardy, 2016). However,

because the probability of double reduction at a locus (α`) depends on its distance

from the centromere (call it x`), a potential way to estimate α` would be to use the

x`’s as predictor variables in a linear model: α` = β0 + β1x`. This would only add

two additional parameters (β0 and β1) that would need to be estimated and would

be completely independent of the number of loci analyzed. The downside to this

approach is that it would only be applicable for polyploid organisms with sequenced

genomes (or the genome of a diploid progenitor), making the use of such a model

impractical for the time being.

Additional Levels in the Hierarchical Model

The place where we believe our model could have the greatest impact is through

modifications and extensions of the probability model used for the inheritance of alleles.

These models have been difficult to apply in the past as a result of genotype uncertainty.

However, using our model as a starting point, it could be possible to infer patterns

of inheritance (polysomy, disomy, heterosomy) and other demographic parameters

(e.g., effective population size, population differentiation) without requiring direct

knowledge about the genotypes of the individuals in the population. For example,

Haldane’s (1930) model of genotype frequencies for autopolyploids that are partially

selfing could be used to infer the prevalence of self-fertilization within a population.

Another possible approach would be to use general disequilibrium coefficients (DA)

to model departures from Hardy-Weinberg equilibrium (Hernández and Weir, 1989;

Weir, 1996). A more recent model described by Stift et al. (2008) used microsatellites

to infer the different inheritance patterns (disomic, tetrasomic, intermediate) for

27 tetraploids in the genus Rorippa (Brassicaceae) following crossing experiments. The

reformulation of such a model for biallelic SNPs gathered using high-throughput

sequencing could provide a suitable framework for understanding inheritance patterns

across the genome. An ideal model would be one that could help to understand

genome-wide inheritance patterns for a polyploid of arbitrary formation pathway

(autopolyploid ↔ allopolyploid) without the need conduct additional experiments.

However, to our knowledge, such a model does not currently exist.

1.6 Conclusions

The recent emergence of models for genotype uncertainty in diploids has introduced

a theoretical framework for dealing with the fact that genotypes are unobserved

quantities (Gompert and Buerkle, 2012; Buerkle and Gompert, 2013). Our extension

of this theory to cases of higher ploidy (specifically to autopolyploids) progresses

naturally from the original work but also serves to alleviate the deeper issue of ADU.

The power and flexibility of these models as applied at the diploid level has the

potential to be replicated for polyploid organisms with the addition of suitable models

for allelic inheritance. The construction of hierarchical models containing probability

models for ADU, allelic inheritance, and perhaps even additional levels for important

parameters such as F-statistics or the allele frequency spectrum also have the potential

to provide key insights into the population genetics of polyploids (Gompert and

Buerkle, 2011a; Buerkle and Gompert, 2013). Future work on such models will help

to progress the study of polyploid taxa and could eventually lead to more generalized

models for understanding the processes that have shaped their evolutionary histories.

28 1.7 Software Note

We have combined the scripts for our Gibbs sampler as an R package—

polyfreqs—which is available on GitHub (https://github.com/pblischak/ polyfreqs). Though polyfreqs is written in R, it deals with the large data sets that

are generated by high-throughput sequencing platforms in two ways. First, it takes

advantage of R’s ability to incorporate C++ code via the Rcpp and RcppArmadillo

packages, allowing for a faster implementation of our MCMC algorithm (Eddelbuettel

and François, 2011; Eddelbuettel, 2013; Eddelbuettel and Sanderson, 2014). Second,

since the model assumes independence between loci, polyfreqs can facilitate the

process of parallelizing analyses by splitting the total read count and reference read

count matrices into subsets of loci which can be analyzed at the same time on separate

nodes of a computing cluster. Additional features of the program include

• Estimation of posterior distributions of per locus observed and expected het-

erozygosity (het_obs and het_exp, respectively).

• Maximum a posteriori (posterior mode) estimation of genotypes using the

get_map_genotypes() function.

• Posterior predictive model checking using the polyfreqs_pps() function.

• Simulation of high-throughput sequencing read counts and genotypes from user

specified allele frequencies using the sim_reads() function.

• Options for controlling program output such as writing genotype samples to file,

printing MCMC updates to the R console, etc.

29 • Simple input format using tab delimited text files that can be directly imported

into R using the read.table() function. The format is as follows:

1. An optional row of locus names (use header=TRUE to specify this in

read.table()).

2. One row for each individual.

3. First column contains individual names (use row.names=1 to specify this

in read.table()).

4. One column for each locus.

1.8 Acknowledgements

The authors would like to thank the Ohio Supercomputer Center for access to com-

puting resources and Nick Skomrock for assistance with deriving the full conditional

distributions of the model in the diploid case. We would also like to thank Frederic

Austerlitz, Aaron Wenzel, members of the Wolfe and Kubatko labs, and 3 anonymous

reviewers for their helpful comments on the manuscript. This work was partially

funded through a grant from the National Science Foundation (DEB-1455399) to

ADW and LSK.

1.9 Author Contributions

Conceived of the study: PDB, LSK, and ADW. PDB derived the polyploid model,

ran the simulations and other analyses, coded the R package, and wrote the initial

draft of the manuscript. PDB, LSK, and ADW reviewed all parts of the manuscript

and all authors approved of the final version.

30 1.10 Data Accessibility

Scripts for simulating the data sets, analyzing them using Gibbs sampling, and

producing the figures from the resulting output can all be found on GitHub, along with the original simulated data sets and autotetraploid potato data (https://

github.com/pblischak/polyfreqs-ms-data). We also provide an implementation

of the Gibbs sampler for estimating allele frequencies in the R package polyfreqs

(https://github.com/pblischak/polyfreqs). See the package vignette or GitHub wiki for more details (https://github.com/pblischak/polyfreqs/wiki).

31 Chapter 2: SNP Genotyping and Parameter Estimation in Polyploids from Low-Coverage Sequencing Data

Publication Information

This chapter is formatted for this dissertation from the following publication:

Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. SNP Genotyping and Parameter

Estimation in Polyploids from Low-Coverage Sequencing Data. Bioinformatics, 34:407–

415, 2018.

2.1 Abstract

Motivation: Genotyping and parameter estimation using high throughput sequenc-

ing data are everyday tasks for population geneticists, but methods developed for

diploids are typically not applicable to polyploid taxa. This is due to their duplicated

chromosomes, as well as the complex patterns of allelic exchange that often accom-

pany whole genome duplication (WGD) events. For WGDs within a single lineage

(autopolyploids), inbreeding can result from mixed mating and/or double reduction.

For WGDs that involve hybridization (allopolyploids), alleles are typically inherited

through independently segregating subgenomes. Results: We present two new models

for estimating genotypes and population genetic parameters from genotype likelihoods

32 for auto- and allopolyploids. We then use simulations to compare these models to

existing approaches at varying depths of sequencing coverage and ploidy levels. These

simulations show that our models typically have lower levels of estimation error for

genotype and parameter estimates, especially when sequencing coverage is low. Finally, we also apply these models to two empirical data sets from the literature. Overall, we show that the use of genotype likelihoods to model non-standard inheritance

patterns is a promising approach for conducting population genomic inferences in

polyploids. Availability: A C++ program, ebg, is provided to perform inference

using the models we describe. It is available under the GNU GPLv3 on GitHub:

https://github.com/pblischak/polyploid-genotyping.

2.2 Introduction

The discovery and analysis of genetic variation in natural populations is a central task

of evolutionary genetics, with applications ranging from the inference of population

structure and patterns of historical demography, detecting selection and local adapta-

tion, and performing genetic association studies. The ability to use high-throughput

sequencing technologies to detect variants across the genome has further advanced our

understanding of the impact of evolutionary forces on genetic diversity in populations.

However, the nature of data sets collected using high-throughput sequencing often

require special considerations regarding sequencing error and, especially, the level of

sequencing coverage. Common approaches for dealing with low-coverage sequence data

use genotype likelihoods to integrate over the uncertainty of inferring genotypes when

estimating other parameters [allele frequencies, inbreeding coefficients, population

differentiation, etc.] (e.g., Martin et al., 2010; Li, 2011; Nielsen et al., 2011, 2012;

33 Fumagalli et al., 2013; Vieira et al., 2013; Huang et al., 2016, among others). Genotype

likelihoods for biallelic SNPs are calculated as the probability of the sequencing read

data mapping to a variable site (total number of reads, number of reads with the

alternative allele, and probability of sequencing error) given the possible values of

the genotypes (typically 0, 1, or 2 for the number of copies of the alternative allele

in diploids). When combined with computationally efficient algorithms for inference,

these models are the primary tools used for conducting population genetic analyses

from high-throughput data.

Although the theory for these models is well established for diploids and even

special cases of higher ploidy samples (treated equivalently to pooled samples of

multiple diploids), the application of these tools to taxa that have experienced a

recent whole genome duplication (WGD) is currently limited (McKenna et al., 2010;

DePristo et al., 2011; Li, 2011). This is due in part because of ambiguity in the

copy number of each allele in the genotype of a polyploid, a phenomenon referred

to as allelic dosage uncertainty (Blischak et al., 2016). Another important aspect of

polyploid evolution to consider is that the occurrence of WGD can have an impact

on how alleles are exchanged in a population, making the assumption of randomly

inherited alleles inappropriate. Together these two factors have limited the widespread

application of population genomic tools to gain insights about levels of genetic variation

following WGD. Given both the evolutionary and economic importance of many of

these organisms (e.g., agricultural crops, farmed fishes), the development of methods

that can accommodate more complex patterns of inheritance is critical for the study

of polyploids (Stebbins, 1950; Grant, 1971; Otto and Whitton, 2000; Soltis and Soltis,

2000; Soltis et al., 2014).

34 In this paper we present two new models for SNP genotyping in polyploids using

high-throughput sequencing data. The models correspond to two different ways in which polyploids can be formed: WGD within a lineage (autopolyploid) or involving

hybridization between two lineages (allopolyploid). The former builds off of previous work to relax the assumption of Hardy-Weinberg equilibrium by including inbreeding

(Blischak et al., 2016) and the latter provides a framework for separately determining

the genotypes within the two genomes that compose the allopolyploid (typically

referred to as subgenomes). We test our models using a wide range of simulations

and describe our numerical approach for parameter estimation using the expectation

maximization (EM) algorithm (Dempster et al., 1977). For comparison, we analyzed

our simulated data sets using two additional approaches based on models that assume

either Hardy Weinberg equilibrium or equal genotype probabilities. Finally, we also

test the models on empirical data sets collected for a diploid-allotetraploid species

pair from the genus Betula (birch trees) and a mixed-ploidy grass species, Andropogon gerardii. Overall, we demonstrate that genotype uncertainty resulting from both

low-coverage sequencing data, allelic dosage uncertainty, and non-standard inheritance

patterns can be overcome in polyploids using genotype likelihoods.

2.3 Models

Assumptions: For each of the models below, we assume that SNPs are biallelic, and

that loci and individuals are independent. For the autopolyploid model, we do not

directly include double reduction (but see Discussion). For the allopolyploid model, we assume that subgenomes are independent, that they do not interact during meiosis

35 (i.e., no homoeologous recombination), and that they are both in Hardy Weinberg

equilibrium.

Notation for each model is introduced in the descriptions we provide below and

is also summarized in Table 2.1. Throughout the paper, we use boldface letters to

denote an array of the respective parameter across either individuals (N), loci (L), or

both (e.g., p := p1, . . . , pL, F := F1,...,FN , and G := g11, g12, . . . , gN(L−1), gNL).

2.3.1 Autopolyploid Model

The genotype for a biallelic SNP in an autopolyploid with K sets of chromosomes

has K + 1 possible values. For example, using A and a to denote the two alleles,

an autotetraploid can have genotypes equal to AAAA, AAAa, AAaa, Aaaa, or aaaa

(e.g., gi` = 0, 1, 2, 3, or 4, if a is the alternative allele; i = 1,...,N and ` = 1,...,L).

A simple extension of the typical binomial sampling (Hardy Weinberg; HW) model

used for diploids but with larger sample size to accommodate higher ploidy levels has

been used previously (Li, 2011; Blischak et al., 2016). However, inbreeding in various

forms can bias inferences made when HW equilibrium is assumed. Vieira et al. (2013)

introduced a genotype prior to include inbreeding either per-site or per-individual for

a sample of diploids (implemented in the programs ngsF and ANGSD). This model

used a formulation for generalized HW that includes the inbreeding coefficient, F , which is the probability that two alleles are identical by decent (ibd). Instead of

using a generalized HW formulation for autopolyploids, we used the Balding-Nichols

beta-binomial model (Balding and Nichols, 1995, 1997; Bradburd et al., 2013), which

also models the probability of two alleles being ibd but is more easily extended to

higher ploidy levels by not directly enumerating all combinations of allele draws for

36 the genotype of an autopolyploid. The beta-binomial distribution is obtained from the

product of a binomial and beta distribution, which are commonly used in population

genetics to model genotypes and allele frequencies, respectively (Wright, 1931). The

beta distribution in this case is used to model genetic correlations that can result

from inbreeding and/or population subdivision. These types of models are commonly

referred to as F-models because of their relation to Wright’s fixation indices (e.g.,

FIS, FST ; Wright, 1931), and they form the basis of many well-known population

genetic models, including those by Holsinger et al. (2002), Falush et al. (2003), and

Foll and Gaggiotti (2008), as well as more recent modeling applications that include

uncertainty in genotype calling from high-throughput sequencing data using genotype

likelihoods (e.g., Gompert et al., 2010; Gompert and Buerkle, 2011b; Fumagalli et al.,

2013).

Given genotype values at L loci for N individuals each of ploidy mi, we model

individual genotypes at each locus (gi` = 0, . . . , mi copies of the alternative allele) as a

beta-binomial random variable. This distribution derives from treating the probability

of drawing an alternative allele as a beta distributed random variable with parameters

1−Fi 1−Fi α = p` β = (1 − p`) Fi and Fi , which scales the binomial probability of successfully

drawing an alternative allele by both the allele frequency (p`) and the amount of

inbreeding (Fi) (Balding and Nichols, 1995; Bradburd et al., 2013). The log likelihood

of the genotype data for this model given the allele frequency at each site (p`) and the

per-individual inbreeding coefficients (Fi) is then

37 X X log L(p, F ; G) = log P (gi`|p`,Fi) i `   1 − Fi 1 − Fi B gi` + p` , mi − gi` + (1 − p`) X X Fi Fi = log   . (2.1) 1 − Fi 1 − Fi i ` B p` , (1 − p`) Fi Fi where B(α, β) represents the beta function with parameters α and β. Since genotypes

must be inferred from sequence data (di`; see Methods), we can also account for

this uncertainty by summing over the possible genotype values to get the likelihood

of the sequence data given allele frequencies and inbreeding coefficients by including

genotype likelihoods [P (di`|gi`)]:

log L(p, F ; D) " # X X X = log P (di`|gi` = a)P (gi` = a|p`,Fi) . (2.2) i ` a

Here P (gi`|p`,Fi) is the beta-binomial distribution from Eq. (2.1). Because maximiza-

tion of the log likelihood is encumbered by the logarithm of the sum over genotypes, we instead use an expectation conditional maximization algorithm to obtain maximum

likelihood (ML) estimates for p and F (Meng and Rubin, 1993). Since an analytical

solution for the maximization step is not readily available, we instead employ numerical

maximization of the likelihood using Brent’s method (Brent, 1973). Then, given the

ML parameter estimates, we can calculate the posterior probability of the genotype of

each individual at each locus using Bayes’ theorem:

ˆ P (di`|gi` = a)P (gi` = a|pˆ`, Fi) P (gi` = a|di`) = , (2.3) Pmi 0 0 ˆ a0=0 P (di`|gi` = a )P (gi` = a |pˆ`, Fi) 38 for a = 0, . . . , mi.

2.3.2 Allopolyploid Model

Deviations from simple HW expectations are evident in allopolyploids in that they

have two (sometimes more) sets of chromosomes inherited from separate evolutionary

lineages. When these sets of chromosomes (called homoeologs, or homoeologous chromosomes) segregate during meiosis, they are inherited separately from one another

and should be treated independently. For example, the genotypes for a biallelic SNP

in an allotetraploid with two diploid subgenomes could have values AA|A0A0, AA|A0a0,

Aa|A0A0, AA|a0a0, Aa|A0a0, aa|A0A0, Aa|a0a0, aa|A0a0, or aa|a0a0. Here the vertical bar

‘|’ denotes separation between the subgenomes and the 0 indicates homoeologous alleles.

With perfect knowledge about which alleles go with each subgenome, determining the

genotypes could be done completely independently. However, if separate reference

genomes for the homoeologous chromosomes are not available, all reads mapping to

a variable position will not be separable into reads coming from one subgenome or

the other. Thus, when considering a variable site across the full set of homoeologs, we need to account for the fact that the frequency of the alternative allele may not

be the same in each subgenome due to their separate evolutionary histories, even if

both subgenomes are independently in Hardy Weinberg equilibrium. When we cannot

separate reads, we can instead consider the full genotype of an allopolyploid with two

subgenomes as being a combination of the genotypes within the subgenomes (i.e., the

number of alternative alleles summed across subgenomes). Returning to the previous

example, a tetraploid with two diploid subgenomes can have a full genotype of 0,..., 4

copies of the alternative allele, but each of these full genotypes can be found via a

39 different combination of genotypes in the subgenomes: {0 = (0, 0); 1 = (0, 1), (1, 0); 2 =

(0, 2), (2, 0), (1, 1); 3 = (1, 2), (2, 1); 4 = (2, 2)}. In general, for an allopolyploid that

has two subgenomes with ploidy levels equal to m1i and m2i, there are a total of

(m1i + 1) × (m2i + 1) genotype combinations to consider. The probabilities of these

genotypes are then determined using the allele frequencies for the alternative allele in

the subgenomes.

An obvious complication of not being able to separate the sequencing reads into

sets coming from each subgenome is that it makes independently estimating the allele

frequencies and genotypes impossible. However, it is sometimes the case that the

parental species of the allopolyploid are known, which can help with inferring genotypes

by providing an outside estimate of the allele frequencies within the subgenomes. For

our model, we relax this use of outside knowledge further and assume that only a single

parent has been identified. Arbitrarily designating the known parent as subgenome

one, we treat the allele frequencies at each locus estimated in the parental population

∗ to be known (p1) and require only the estimation of the allele frequencies in subgenome

two (p2). We then model the full genotype in the allopolyploid as the sum of the two

independent subgenomes with separate, and potentially unequal, allele frequencies.

Since we assume Hardy Weinberg equilibrium within each subgenome, we can model

the sum of the number of alternative alleles in the two subgenomes as a product of two

binomial distributions. The log likelihood for known genotype data across individuals

at all loci is then given by

40 ∗ log L(p2; p1, G1, G2)  m  X X 1i ∗ g1i` ∗ (m1i−g1i`) = log (p1`) (1 − p1`) g1i` i ` m   2i g2i` (m2i−g2i`) + log (p2`) (1 − p2`) . (2.4) g2i`

The inclusion of genotype likelihoods is done in a similar way to the autopolyploid

model, only now we are summing over the values of the genotypes in both subgenomes

one and two. The log likelihood for observed sequence data given the allele frequencies

in each of the subgenomes is

∗ log L(p2; p1, D) X X X X = log P (di`|gi` = g1i` + g2i`)

i ` a1 a2  ∗ × P (g1i` = a1|p1`)P (g2i` = a2|p2`) . (2.5)

∗ where P (di`|gi`) is the genotype likelihood, and P (g1i`|p1`) and P (g2i`|p2`) are binomial

distributions.

Because maximizing the log likelihood involves the logarithm of a double sum, we

turn once again to the expectation maximization algorithm to obtain a ML estimate

for the allele frequency at each locus in subgenome two (Dempster et al., 1977). An

analytical solution for the maximization step of the EM algorithm is given by

P P P a P (g = a + a |d , p∗ , p(t)) (t+1) i a1 a2 2 i` 1 2 i` 1` 2` p2` = P , (2.6) i m2i ∗ (t) where P (gi` = a1 + a2|di`, p1`, p2` ) is the joint conditional probability of the genotypes

in subgenomes one and two given the data and the current parameter estimates. Using

41 these ML estimates, an empirical Bayes estimate of the genotypes within each of the

subgenomes can be found using their joint posterior probability (note that subscripts

i and ` are dropped for readability)

P (g1 = a1, g2 = a2|d)

∗ P (d|g = g1 + g2)P (g1 = a1|p1)P (g2 = a2|pˆ2) = P P 0 ∗ 0 , (2.7) 0 0 P (d|g = g1 + g2)P (g1 = a |p )P (g2 = a |pˆ2) a1 a2 1 1 2

where a1 = 0, . . . , m1i and a2 = 0, . . . , m2i.

2.3.3 Other Approaches

We consider two additional approaches that use genotype priors that have been

described in previous studies. The first is an implementation of the SAMtools Hardy

Weinberg equilibrium prior (Li, 2011) and the second is a flat prior on genotypes that

is similar to the model used by the Genome Analysis Toolkit (GATK; McKenna et al.,

2010). Other approaches that accommodate polyploids such as the FITTETRA package

in R (Voorrips et al., 2011) and the method of Maruki and Lynch (2017) were not

considered here because they can only handle specific ploidy levels (triploids and/or

tetraploids).

2.4 Methods

Genotype likelihoods were calculated using a simplified version of the SAMtools model

by using average sequencing error values at each locus, `, across reads and individuals

(Li, 2011). Then for the possible values of the genotype (a = 0, . . . , mi), the probability

of the read data, di` = {ti`, ri`, `} (ti` = total read count, ri` = alternative allele read

count), given the genotype, gi`, is

42 Symbol Description N, L The number of individuals and loci sampled. mi Ploidy level of individual i.

di` Sequence data for individual i at locus ` (={ti`, ri`, `}). ti` Total number of reads for individ- ual i at locus `. ri` Number of alternative allele reads for individual i at locus `. ` Average sequencing error at locus `. gi` Genotype for individual i at locus `. p` Allele frequency at locus `. Fi Inbreeding coefficient for individual i.

Table 2.1: A key to the symbols and notation that are used in describing the au- topolyploid and allopolyploid models. We use a either a bold or bold-capitalized letter when referring to the collection of parameters together (e.g., G refers to gi` for all individuals at all loci). Parameters within subgenomes for the allopolyploid model use the same symbol but with either a 1 or a 2 added as a subscript.

t  i` ri` P (di`|gi` = a) = f(a, mi, `) ri`

(ti`−ri`) ×[1 − f(a, mi, `)] , (2.8) where

a  a  f(a, m, e) = (1 − e) + 1 − e. (2.9) m m

43 2.4.1 Simulations

We generated sequencing read data with mean coverage per individual, per locus equal

to 2x, 5x, 10x, 20x, 30x, and 40x, simulated from a Poisson distribution for 10 000

sites. The number of individuals was set to 25, 50, or 100, and we tested ploidy levels

equal to 4, 6, and 8 (4=2+2, 6=2+4, and 8=4+4 for allopolyploids). Sequencing

errors were drawn from a beta distribution with parameters α = 1 and β = 200 (mean

error ≈ 0.005)]. Allele frequencies were drawn from a truncated beta distribution with a minimum minor allele frequency of 5% and parameters α = β = 0.01. For

the autopolyploid model, the values of the inbreeding coefficient were set to 0.1, 0.25,

0.5, 0.75, and 0.9. For the allopolyploid model, the allele frequencies simulated for

subgenome one were treated as the reference panel. Genotypes were drawn according

to their respective generating models (autopolyploid or allopolyploid), and the number

of alternative reads for each individual at each locus was drawn from the binomial

distribution in Eq. (2.8) given the total read count, genotype, and level of sequencing

error. For each simulation, we evaluated estimation error using the root mean squared

deviation (RMSD)

v u R u 1 X [r] 2 RMSD = t (X − Xtrue) , (2.10) R est. r=1

[r] where R represents the number of replicates, Xest. is the estimated value for replicate

r, and Xtrue is the original value used to simulate the data.

To compare our models with other methods, we reused these simulated data as

input for the estimation of genotypes and model parameters using priors that assume

either Hardy Weinberg equilibrium or equal genotype probabilities (GATK-like). For

44 the allopolyploid model, this also equates to ignoring the fact that genotypes are drawn

from two independent subgenomes. Inference for the Hardy Weinberg model used the

EM algorithm described in Li (2011). Genotyping based on the GATK-like model were calculated based on normalized genotype likelihoods as described in McKenna et al. (2010).

Comparisons for the autopolyploid model were based on the RMSD of four estimates

of the inbreeding coefficient. The first of these was the estimate obtained by our

ECM algorithm, which is built directly into the model. The other three estimates were calculated as a summary statistic from estimated genotypes for the three models

(Appendix B). We then also compared RMSD values of the estimated genotype values for the three methods. For the allopolyploid model, direct comparisons with

models that assume Hardy Weinberg or uniform genotype priors are more difficult

because they do not share the assumption of two subgenomes. Therefore, we focused

on the accuracy of the models to infer the full genotype by again comparing RMSD values.

2.4.2 Empirical Data Analysis Andropogon gerardii

We tested our autopolyploid model on an empirical data set collected in the grass

species Andropogon gerardii. SNP data from McAllister and Miller (2016) were

downloaded from Dryad as a VCF file (http://datadryad.org/resource/doi:10.

5061/dryad.05qs7). The data were filtered using VCFtools with the following criteria:

biallelic SNPs only, no more than 50% missing data per site, one SNP per 10 000 base

pair window, and a minimum sequencing depth of five reads (Danecek et al., 2011).

The output from VCFtools was then converted to a plain text format containing

45 the number of total reads and alternative allele reads per individual per site using a

Perl script (read-counts-from-vcf.pl; available on GitHub). We then also removed

any individuals with more than 50% missing data using an R script (filter-inds.R;

available on GitHub). Since A. gerardii has two cytotypes (6N and 9N), we analyzed

the hexaploid and nonaploid individuals separately and compared the estimates of the

inbreeding coefficients across ploidy levels.

Betula pubescens and B. pendula

To test the allopolyploid model, biallelic SNP genotypes from Zohren et al. (2016) for

the allotetraploid Betula pubescens and its putative diploid progenitor, B. pendula, were

downloaded from Dryad (http://datadryad.org/resource/doi:10.5061/dryad.

815rj). Treating the genotypes as known, we simulated read data and error values

as before using Eq. (2.8) with beta distributed error values. We varied the level of

sequencing coverage (5x, 10x, 20x) but did not alter the amount of missing data. Allele

frequencies for B. pendula were estimated under the assumption of Hardy Weinberg

equilibrium and disequilibrium to assess which was a better fit. These allele frequency

estimates were then used as the reference panel for genotype estimation in B. pubescens

using the allopolyploid model.

Comparison with GATK

As a final comparison, we re-analyzed raw sequence data collected for B. pendula

and B. pubescens using GATK v3.5.0 and our model for allopolyploids. Data for 15

individuals each of B. pendula and B. pubescens were downloaded from the European

Nucleotide Archive (Project Accession ERA600270). Reads were mapped to a draft

reference genome of B. nana (Dryad, doi:10.5061/dryad.815rj; Wang et al., 2013) using

46 the MEM algorithm in BWA v0.7.13 with additional processing (conversion to BAM

and sorting) using SAMtools v1.4.1 (Li and Durbin, 2009; Li, 2011). Read group

information was added using Picard (http://broadinstitute.github.io/picard),

followed by variant calling and genotype estimation using the GATK UnifiedGenotyper

(B. pubescens was run with -ploidy=4; McKenna et al., 2010). Variant site positions

in the resulting VCF files were used to extract base quality scores from the original

BAM files using the SAMtools mpileup command (Li, 2011). All other data processing

steps (filtering sites, finding shared variants, etc.) were conducted using Python and

R scripts (available on GitHub; see Supplemental Text, §B.3). Allele frequencies

at each site were estimated in B. pendula using our implementation of the Hardy

Weinberg model (run until convergence). These allele frequencies were then used as

the reference panel for estimating genotypes in B. pubescens using the allopolyploid

model (EM+Brent with 100 iterations). All VCF, pileup, and input/output files are

publicly available on Zenodo (doi:10.5281/zenodo.825228).

2.4.3 Software and Reproducibility

We have packaged our code for the EM/ECM algorithms in a command line C++

program called EBG, which we have included as part of a GitHub repository for this

manuscript (doi:10.5281/zenodo.195779). This software includes our implementations

of the autopolyploid (diseq), allopolyploid (alloSNP), Hardy Weinberg (hwe), and

GATK-like (gatk) models for genotyping in polyploids. Code for the simulation study

and empirical data analyses was written using a combination of the R statistical

language and C++ through the use of the RCPP package (Eddelbuettel and François,

2011; Eddelbuettel, 2013; R Core Team, 2014). Figures were generated using the

47 GGPLOT2 package in R (Wickham, 2009). Additional figure manipulations were done

using Inkscape (https://inkscape.org/). All Python, Perl, R, and Bash scripts

used to process data files are included on GitHub in the ‘helper-scripts/’ folder.

2.5 Results

2.5.1 Simulations Autopolyploid Model

Simulated read count data were generated to assess the impact of sequencing coverage

and ploidy level on estimation error in autopolyploids using an expectation conditional

maximization (ECM) algorithm. Convergence of the ECM algorithm depended on

the number of individuals sampled, sequencing coverage, and ploidy. Each iteration

of the algorithm employs Brent’s method, itself an iterative maximization algorithm,

resulting in slower M-steps than the other EM algorithms we describe. However,

overall convergence was reached before the maximum number of allowed iterations

(1000) in all cases, with analyses typically employing between 50–100 iterations.

For the estimation of individual inbreeding coefficients (Fi), Figure 2.1a shows the

root mean squared deviation (RMSD) for estimated inbreeding coefficients for the

four different estimation methods across ploidy levels and the three lowest levels of

sequencing coverage (sample size of 50 individuals). Compared with the other methods

that used called genotypes (diseqCG, hwe, gatk), the level of sequencing coverage

and ploidy level had virtually no effect on estimation error using our model (diseq).

For the other estimates, increasing sequencing coverage lowered estimation error as

expected, and higher ploidy levels showed higher levels of error. However, inbreeding

coefficients estimated from genotypes called from our model (diseqCG) did have lower

48 (a) Inbreeding Coeff. Estimation Error [50 ind.]

p4 p6 p8

0.75 0.50 c2 0.25 0.00

0.75 0.50 c5

RMSD 0.25 0.00

0.75 c10 0.50 0.25 0.00

F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 Method diseq diseqCG gatk hwe

(b) Genotype Estimation Error [50 ind.]

p4 p6 p8

2.0

1.5 c2 1.0 0.5

2.0

1.5 c5 1.0 RMSD 0.5

2.0 1.5 c10 1.0 0.5

F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 Method diseq gatk hwe

Figure 2.1: RMSD values for simulations under the autopolyploid model with in- breeding for (a) estimated inbreeding coefficients and (b) estimated genotypes. Each individual plot within (a) and (b) displays the RMSD on the y-axis and inbreeding coefficients on the x-axis. Rows correspond with the depth of sequencing coverage (2x, 5x, 10x) and the columns correspond to the ploidy level (4, 6, 8). The different estimation methods (diseq, diseqCG, gatk, hwe) are represented by different shapes within each plot. (a) The RMSD of the inbreeding coefficient estimated by our model (diseq) is consistently the lowest across all depths of sequencing coverage, ploidy level, and level of inbreeding. (b) Genotypes estimated by our model are at least as accurate as the other methods and are not as affected by high or low levels of inbreeding.

RMSD values than the other methods, except when the inbreeding coefficient was

0.5, when the level of error was about the same. All of the methods except for Hardy

Weinberg showed low levels of estimation error once the depth of sequencing reached

49 10x. Figures B.1–B.3 show the results for all simulated depths of sequencing (2x to

40x) and sample sizes (25, 50, and 100 individuals).

Our empirical Bayes approach for maximum a posteriori (MAP) genotype esti-

mation resulted in a similar overall pattern of lower estimation error for increased

sequencing coverage (Figure 2.1b). Interestingly, the other two methods for genotyping

(gatk, hwe) showed opposing patterns of accuracy: the GATK-like model increased

in accuracy with increasing levels of inbreeding but the Hardy Weinberg model had

decreasing accuracy. Genotypes called by our method showed some dependence on

the level of inbreeding with intermediate values having the most error. However, our

method was still the most accurate across the range of inbreeding values simulated.

Ploidy also had an impact on genotyping with higher ploidy levels having higher levels

of estimation error. This is largely due to the fact that higher ploidy individuals have

a larger number of possible values for the genotype and that the average sequencing

coverage per allele (chromosome) is lower (e.g., 10x coverage in a tetraploid is on aver-

age 2.5x per allele but is 1.25x in an octoploid). Once the depth of sequencing reached

10x, the only model that still showed a higher level of error was the Hardy Weinberg

model. Figures B.4–B.6 show the results for all simulated depths of sequencing (2x to

40x) and sample sizes (25, 50, and 100 individuals).

Allopolyploid Model

Using the same general parameter settings as the simulations for the autopolyploid

model (except for inbreeding), we calculated genotype likelihoods by simulating read

data from genotypes generated under the model from Eq. (2.4). The ploidy of each

subgenome was as follows: tetraploids = diploid + diploid, hexaploid = diploid +

tetraploid, and octoploid = tetraploid + tetraploid. Our expectation maximization

50 Full Genotype Estimation

p4 p6 p8

2

RMSD 1

0

c2 c5 c2 c5 c2 c5 c10 c20 c30 c40 c10 c20 c30 c40 c10 c20 c30 c40 Method allosnp gatk hwe

Figure 2.2: RMSD values for full genotype estimation (combined number of alternative alleles in subgenomes one and two). Sequencing coverage is on the x-axis and RMSD values are on the y-axis. Each column represents a different ploidy level and the three methods used (allosnp, gatk, hwe) are represented by different shapes. For low levels of sequencing coverage, the allosnp and hwe models have much lower levels of estimation error when compared with the gatk model. The level of sequencing coverage required for the three methods to converge in error rate depends on the ploidy level, with tetraploids needing less coverage and octoploids needing more.

algorithm for this model was slow to converge, despite each maximization step taking

less time when compared with the autopolyploid model. Analyses never reached the

upper limit on the number of iterations (again 1000) but some analyses did not reach

convergence until over 900 iterations had been run. To make analyses with this model

more practical, we reanalyzed all simulated data sets using only 100 EM iterations

followed by direct maximization of the observed data log likelihood function in Eq.

(2.5) using Brent’s method (EM+Brent).

Comparing our model with other genotype priors (Hardy Weinberg, GATK) only

allowed us to consider the full genotype estimates from the different methods. Figure

2.2 shows the level of estimation error for each of the three genotyping methods for

each ploidy level across all depths of sequencing coverage. For low depths of sequencing,

51 genotyping with the GATK-like model resulted in high levels of error. As the depth

of coverage increased, the three methods converged. However, this was dependent

on the ploidy level: octoploids required a higher depth of sequencing for the GATK

model than tetraploids or hexaploids to achieve the same level of accuracy. The Hardy

Weinberg prior performed almost identically to our allopolyploid model, most likely as

a result of our assuming Hardy Weinberg within the subgenomes of the allopolyploid.

We also assessed the accuracy of the model for estimating parameters based on

the true values used for the simulations. Allele frequency estimates for subgenome

two improved as the number of individuals and sequencing coverage were increased

(Figure B.7). Tetraploids showed the highest estimation error for subgenome two

(diploid), followed by octoploids and hexaploids (tetraploid subgenomes), respectively.

This pattern with hexaploids and octoploids is counterintuitive considering that higher

ploidy levels typically result in better estimates of allele frequencies since more alleles

are sampled from the population (Blischak et al., 2016). However, the tetraploid

subgenomes in the hexaploid and octoploid individuals do not show similar levels

of error as would be expected. This is likely a result of subgenome one having

higher ploidy in the octoploid simulations, resulting in a larger number of possible

genotype combinations and therefore higher estimation error (octoploid: 5 × 5 = 25 vs.

hexaploid: 3 × 5 = 15). Figures B.8 and B.9 show the error in genotype estimation in

subgenome one and two, respectively. Here we again observe that higher ploidy levels

have higher levels of estimation error for genotypes. Overall, genotype estimates were

inferred with higher error for subgenome two. This result makes sense given that we

treat the allele frequencies for subgenome one as known but have to estimate them in

subgenome two.

52 2.5.2 Empirical Data Analysis Andropogon gerardii

Analyzing and filtering the data sets for hexaploid and nonaploid A. gerardii separately

resulted in slightly different numbers of loci (6N: 83 individuals, 6 928 loci; 9N: 70

individuals, 6 887 loci). The average depth of sequencing coverage was 10.9x for

hexaploids and 10.8x for nonaploids. Though levels of inbreeding for both cytotypes were low, nonaploids showed significantly higher levels of inbreeding than hexaploids

−8 (Figure 2.3a; F1,151 = 36.14, p = 1.3 × 10 ).

Betula pubescens and B. pendula

The data set for the species of Betula consisted of 130 individuals for B. pubescens

and 34 individuals for B. pendula with genotype data for 49 021 loci. For B. pendula, we inferred allele frequencies and genotypes assuming Hardy Weinberg (HW), as well

as using our model for individual inbreeding coefficients. The log likelihoods of the

two models were very similar and most of the inbreeding coefficients were estimated

to be close to 0, so we used the allele frequency estimates from the HW model as

the reference panel for the allopolyploid model. After estimating the parameters of

this model for B. pubescens using the EM+Brent method, we assessed the accuracy

of our empirical Bayes genotype estimates by comparing them to the original data

set using the root mean squared deviation. This comparison is shown for each of the

possible genotype values (0–4) in Figure 2.3b. The left panel shows the RMSD for

each genotype value and the right panel shows a weighted measure of the RMSD that

corresponds to the relative amount of error based on the frequency of that genotype

in the original data set. For example, we do a poor job of estimating the genotype

53 when the true value is 0, but very few of the true genotypes have that value (∼0.5%),

so the relative contribution to the overall error is much less. In contrast, roughly

75% of the true genotypes have a value of 4, which is the value that we estimate the

best. In addition, many of the genotypes in B. pendula were equal to 2 (∼88%), so

the estimates of the allele frequencies were very close to 1.0, which could have led to

more error prone estimates of the genotypes in B. pubescens when using them as the

reference panel.

Comparison with GATK

Variant calling and genotype estimation using GATK resulted in 14 931 shared SNPs

between B. pendula and B. pubescens after applying the following filters: biallelic sites

only, variant quality score (QUAL) greater than 30, minimum read depth (DP) per

individual per site of at least five, and a maximum of five missing individuals per

site. Analyzing these same sites for B. pendula using the Hardy Weinberg equilibrium

model produced genotype estimates that were 99.1% identical to the estimates from

GATK. Allele frequencies estimated by the Hardy Weinberg model had an RMSD value of 0.032 when compared to those estimated by GATK (Figure B.10). Similarly

for B. pubescens, genotype estimates combined from the allopolyploid subgenomes

resulted in full genotype estimates that were 96.2% identical to GATK. The majority

of differences in the estimated genotypes between the two methods were mainly due

to the allopolyploid model inferring one fewer copy of the alternative allele compared

to GATK (Figure B.11). Run times between our models and GATK are not directly

comparable because the latter was used to identify all variants before filtering and it

also performs more steps than genotyping and parameter estimation. However, it is worth noting that the analyses with our models took approximately 3.5s and 43s for

54 (a) Inbreeding levels in A. gerardii

0.06 ● ● ● ●

● ●

● ●

● ● ● ● ● ● ● ● ● ● 0.04 ● ● Ploidy ● ● ● ● ● ● ● 6N ● ● ● ● ● ● ● ● ● ● ● ● 9N

● ● ● ●

Inbreeding ●

● ● ● ● ●

● 0.02 ● ● ● ● ●

● ● ● ● ●

● ● ●

● ● ● ● ● ● ● ● ● ● 0.00 ●● ●●● ● ●●● ●●●●● ● ●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●● ● ●●●● ●●●●● ●●● ● ● ●●● ●● ● ●●●● ● ●

6N 9N

(b) Genotyping error for B. pubescens

RMSD RRMSD

1.5

1.0 Coverage c5 c10 Value c20 0.5

0.0

0 1 2 3 4 0 1 2 3 4

Figure 2.3: Results of empirical data analyses. (a) Levels of inbreeding in Andropogon gerardii. Inbreeding in the two cytotypes of A. gerardii is generally low, but the nonaploid (9N) samples have higher levels of inbreeding on average. (b) Genotype estimation error in Betula pubescens. The left panel shows the RMSD values for each of the possible full genotypes (0–4; number of alternative alleles in subgenomes one and two). The right panel shows a relative measure of the RMSD where each value is weighted by the occurrence of the particular genotype in the data set (see text for details).

the Hardy Weinberg and allopolyploid models, respectively (measured using the Unix

time command).

55 2.6 Discussion

The ability to genotype individuals in a population can be an under-appreciated task,

even though it is typically the first step of any population genetic analysis. This is

especially true for populations of polyploids, where genotyping is further complicated

by duplicated chromosomes and their subsequent genome evolution. Until recently,

genotyping polyploids using high-throughput sequencing data was only possible in

model organisms with reference genomes and/or subgenomes. However, more re-

searchers have begun genotyping SNPs in both model and non-model organisms using whole genome resequencing and reduced representation methods such as restriction-

site associated DNA sequencing (RADseq) and its variants (e.g., Arnold et al., 2015;

Douglas et al., 2015; Cornille et al., 2016; Zohren et al., 2016). Most of these stud-

ies used already existing pieces of software to perform SNP calling and genotyping

[e.g., Genome Analysis Toolkit (McKenna et al., 2010), UNEAK (Lu et al., 2012),

TASSEL-GBS (Glaubitz et al., 2014)] but others used novel approaches for estimating

genotypes (e.g., Voorrips et al., 2011; Zohren et al., 2016; Maruki and Lynch, 2017).

A major caveat with these tools, however, is that many of them cannot estimate

inbreeding coefficients for arbitrary ploidy levels in autopolyploids, nor can they

separately estimate genotypes in the subgenomes of an allopolyploid. This is especially

important considering that ignoring the independence of allopolyploid subgenomes

can lead to biases in the estimation of heterozygosity when alternative alleles are

fixed in the individual subgenomes (fixed heterozygosity, Cornille et al., 2016). In

general, our models aim to incorporate more biologically realistic assumptions about

how population-level factors influence the distribution of genotypes in populations of

polyploids, which is critical when conducting population genetic studies in these taxa.

56 Furthermore, our approaches use genotype likelihoods and produce updated estimates of genotype probabilities given population parameters that can be used to propagate the uncertainty in calling genotypes in polyploids to downstream analyses such as estimating heterozygosity or population differentiation, rather than relying on called genotypes.

Though our models were accurate for many of our simulations and outperformed comparable methods at low depths of sequencing coverage, it is important to consider scenarios when their assumptions are inappropriate. One concern for autopolyploids is the occurrence of double reduction, a process by which alleles in the genotype are identical by decent due to the segregation of sister chromatids to the same gamete during meiosis (Haldane, 1930). As we mentioned before, our model does not directly estimate rates of double reduction. However, because double reduction leads to identity by descent, it contributes to deviations from Hardy Weinberg that are similar to inbreeding. Therefore, our model for individual inbreeding coefficients should be able to accommodate, but not specifically estimate, double reduction.

Allopolyploids present a different set of challenges that are a result of their hybrid origins. In our model, we assume that the two subgenomes of the allopolyploid are completely independent. However, homoeologous recombination can make this assumption inappropriate. Future work that models this exchange of alleles between subgenomes will be an important extension of the model we presented here. Another potential avenue would be to develop ways to use more parental information, as well as demographic parameters to account for the amount of divergence between the allopolyploid and its parents. Models that help to identify parental taxa will also be an important contribution for future research on allopolyploids.

57 2.7 Conclusions

As methods for the analysis of polyploid data continue to be developed, we are hopeful

that the barriers to more widespread study of these taxa will begin to drop. The

prevalence of polyploidy in plants and other groups of eukaryotes, including fish,

amphibians, and fungi, make these methods fundamentally important for furthering

our understanding of the impact of WGD on genetic diversity (Rogers, 1973; Otto

and Whitton, 2000; Gregory and Mable, 2005; Wood et al., 2009). Of the main

problems that complicate population genetics in polyploids, modeling allelic inheritance

remains the most difficult. Overall, we believe that using genotype likelihoods when

studying polyploids to overcome difficulties in determining allele copy number and for

dealing with low-coverage sequencing data is a promising approach for future model

development.

2.8 Acknowledgements

The authors thank members of the Wolfe lab, B. Berger, M. Fumagalli, and

three anonymous reviewers for their helpful comments on this manuscript. We also

thank J. Zohren and her colleagues, C. McAllister, and A. Miller for making their

data sets publicly available. Early versions of these models were presented to X. He,

J. Novembre, M. Stephens, and their lab members, and we thank them for their

constructive feedback and advice (especially regarding the use of EM algorithms).

58 Chapter 3: Fluidigm2PURC: Automated Processing and Haplotype Inference for Double-Barcoded PCR Amplicons

Publication Information

This chapter is formatted for this dissertation from the following publication:

Blischak, P. D., M. Latvis, D. F. Morales-Briones, J. C. Johnson, V. S. Di Stilio, A.

D. Wolfe, and D. C. Tank. Fluidigm2PURC: Automated Processing and Haplotype

Inference for Double-Barcoded PCR Amplicons. Applications in Plant Scienes, 6:e1156,

2018.

3.1 Abstract

Premise of the study: Targeted enrichment strategies for phylogenomic inference are

a time- and cost-efficient way to collect DNA sequence data for large numbers of

individuals at multiple, independent loci. Automated and reproducible processing of

these data is a crucial step for researchers conducting phylogenetic studies. Methods

and Results: We present Fluidigm2PURC, an open source Python utility for processing

paired-end Illumina data from double-barcoded PCR amplicons. In combination with

the program PURC (Pipeline for Untangling Reticulate Complexes), our scripts process

raw FASTQ files for analysis with PURC and use its output to infer haplotypes for

59 diploids, polyploids, and samples with unknown ploidy. We demonstrate the use of

the pipeline with an example data set from the genus Thalictrum L. (Ranunculaceae).

Conclusions: Fluidigm2PURC is freely available for Unix-like operating systems

on GitHub [https://github.com/pblischak/fluidigm2purc] and for all operating

systems through Docker [https://hub.docker.com/r/pblischak/fluidigm2purc].

3.2 Introduction

The collection of large-scale, multilocus data sets for phylogenomic inference has

become an increasingly common method for understanding evolutionary relationships within a group of taxa. Coupled with recent implementations of coalescent-based

species tree estimation programs that take into account the independent histories

of different genes (e.g., SVDquartets, Chifman and Kubatko (2014); ASTRAL-II,

Mirarab and Warnow (2015)), targeted enrichment strategies are powerful methods for

collecting more informative data sets for conducting phylogenomic investigations. Of

the many types of targeted enrichment that exist, several recent studies have begun

to use a method that combines both library preparation and target amplification

into a single step. This process, known as double-barcoded amplicon sequencing

(Uribe-Convers et al., 2016), allows for the collection of multilocus sequence data for

large numbers of individuals that is both time- and cost-effective.

Double-barcoded amplicon sequencing combines the amplification of a targeted

region in the genome with the addition of sample-specific barcodes and Illumina

sequencing adapters to the resulting PCR product for paired-end sequencing on

an Illumina MiSeq platform (Uribe-Convers et al., 2016). This is done by adding

conserved sequence (CS) tags to traditional PCR primers, which act as templates

60 for adding barcodes and adapters when preparing the sequencing library. Parallel

amplification is most often achieved using microfluidic PCR with the Fluidigm Access

Array (Fluidigm, San Francisco, CA, USA; e.g., Gostel et al., 2015; Uribe-Convers et al., 2016; Kates et al., 2017), allowing for multiple samples and loci to be amplified

simultaneously (minimum of 48 samples x 48 loci). The newer Fluidigm Juno system

can also handle up to 192 samples in a single run, and multiplexing of primer pairs can

allow for even higher throughput, provided that the primers do not interact during

amplification. Double-barcoded amplicons can also be generated by other means using

approaches such as traditional or highly-multiplexed PCR (e.g., Bybee et al., 2011;

Dupuis et al., 2017).

Previous methods to analyze these data have typically relied on generating consen-

sus sequences using software packages such as Geneious (Kearse et al., 2012; Gostel et al., 2015), HiMAP (Dupuis et al., 2017), or an R script, reduce_amplicons.R, that

is part of the dbcAmplicons package (but see comparison with “occurrence-based”

methods in dbcAmplicons in the Example Analyses section; Uribe-Convers et al.,

2016; Kates et al., 2017). However, using consensus sequences can often ignore impor-

tant within-individual level variation, such as differing alleles or levels of ploidy. To

alleviate this issue, and to facilitate the analysis of these data for haplotype inference, we developed Fluidigm2PURC. Fluidigm2PURC consists of two main Python scripts

that process input data files using several external programs (Table 3.1) that automate

quality filtering, read merging, and file formatting for downstream steps (Figure 3.1).

Although it can be used to process any double-barcoded amplicons, the software

derives its name from the method of PCR amplification that we used to generate our

data (Fluidigm Access Array), as well as its primary dependency, PURC, a Python

61 program that combines sequence clustering and PCR chimera detection (Rothfels et al.,

2017). The final step in the Fluidigm2PURC pipeline processes clusters from PURC

and outputs a FASTA file containing phased haplotypes for all targeted sequences.

This last step has methods for haplotype inference that work on diploids, polyploids,

individuals with unknown ploidy, or any mixture of the three. To demonstrate the util-

ity of Fluidigm2PURC, we analyzed nuclear amplicon data from the genus Thalictrum

L. (Ranunculaceae) and compared the results with those obtained from dbcAmplicons

using the reduce_amplicons.R script (Uribe-Convers et al., 2016).

3.3 Methods and Results

3.3.1 Input data

The input data for Fluidigm2PURC are paired-end FASTQ files (R1 and R2 for paired

reads) that have been demultiplexed using the program dbcAmplicons (Uribe-Convers et al., 2016). dbcAmplicons demultiplexes reads using the original sample barcodes

and amplicon primer sequences to annotate the reads with the sample and locus

name that each read comes from, followed by trimming these identifying parts of the

sequence. The resulting pair of FASTQ files is then input into the first script in the

pipeline, fluidigm2purc.

3.3.2 Step 1: fluidigm2purc

The fluidigm2purc script takes the paired-end FASTQ files, filters them using Sickle

(Joshi and Fash, 2011, minimum length = 100bp, PHRED threshold = 20), merges

the filtered reads using FLASH2 (Magoč and Salzberg, 2011), and then converts the

resulting FASTQ files into FASTA files (one for each locus) with sequence header

information that is compatible with PURC. The sequence headers for PURC follow

62 1. fluidigm2purc:

Convert reads to Filter and trim Join paired reads PURC format and reads using Sickle using FLASH2 write FASTA files for each locus

2. PURC (purc_recluster.py):

Detect chimeras Combine clusters and cluster reads Annotate resulting for each locus for each taxon clusters with across taxa and at each locus size information align with MUSCLE using USEARCH

3. crunch_clusters:

a. Known ploidy levels: Add ploidy levels to the Realign, clean, and taxon table and infer haplotypes and their dosage return consensus or unique haplotypes b. Unknown ploidy levels: Infer real haplotypes versus sequencing errors using a likelihood cutoff

Figure 3.1: Flowchart outlining the steps for haplotype inference using Flu- idigm2PURC.

63 Dependency Citation Link https://bitbucket.org/crothfels/ PURC (v1.02) Rothfels et al. (2017) purc https://github.com/najoshi/ Sickle (v1.33) Joshi and Fash (2011) sickle https://github.com/dstreett/ FLASH2 (v2.2.00) Magoč and Salzberg (2011) FLASH2 http://mafft.cbrc.jp/alignment/ MAFFT (v7.237) Katoh (2013) software/ https://github.com/blackrim/ Phyutility (v2.7.1) Smith and Dunn (2008) phyutility

Table 3.1: Dependencies for the Fluidigm2PURC pipeline with version numbers in parentheses.

the format ‘>IndividualName|LocusName|UniqueID#’. When paired reads with

low quality bases are trimmed by Sickle and no longer overlap, we merge them

artificially with multiple N’s inserted between them. The fluidigm2purc script writes

two additional files: (1) the taxon table, a two-column table listing each sequenced

taxon and its ploidy level, and (2) the locus-err table, a two-column table listing each

sequenced locus and the average level of sequencing error for all reads coming from

that locus. The taxon table lists the ploidy as “None” for all individuals by default,

but known ploidy levels can be included by the user (e.g., diploid has the value “2,”

tetraploid has the value “4,” etc.). For the locus-err table, the per locus levels of

sequencing error are calculated individually from the input FASTQ files using the

average PHRED score per read averaged across all reads coming from that locus.

64 3.3.3 Step 2: PURC

The output FASTA files from fluidigm2purc can be run through PURC using the

purc_recluster.py script (Rothfels et al., 2017). This script is used to iteratively

run chimera detection and sequence clustering (performed with USEARCH; Edgar,

2010; Edgar et al., 2011) on each locus individually to produce a reduced set of

putative haplotypes that includes size information about the number of original reads

forming each cluster. Details on running PURC can be found on its Bitbucket page

[https://bitbucket.org/crothfels/purc].

3.3.4 Step 3: crunch_clusters

The clusters output by PURC are then run through our second script, crunch_clusters, which uses the taxon table and locus-err table output by fluidigm2purc (Step 1) to

infer haplotypes in a maximum likelihood framework. This script also has options for

realigning clusters using MAFFT (Katoh, 2013), as well as cleaning the clusters using

Phyutility (Smith and Dunn, 2008). Before haplotypes can be inferred at a locus, we

first do a pairwise comparison of all clusters for each taxon individually and merge any

clusters that are identical (ignoring gaps). This step is necessary because of the initial

trimming/filtering step in the fluidigm2purc script. Artificially joining unmerged reads

often causes two sequencing clusters representing the same haplotype to form: (1)

one cluster for reads that were merged, and (2) one cluster for the reads that were

artificially merged and contain a large number of gapped sites in the middle. In this

case, these two clusters should not be treated as separate haplotypes, so we combine

the clusters by keeping the larger haplotype (i.e., the one with less gaps) and adding

the sizes of the two clusters together. The alternative would be to process the original

65 data by ignoring all reads that did not merge. However, throwing away unmerged

reads could potentially discard sequence variation that should be represented in the

data set, especially if most reads are unmerged, which may be the case for large

amplicons. The downside of merging sequences that are identical except for gaps is

that it potentially discards informative indel variation, although it is unlikely that

a locus within an individual would have only one of its haplotypes containing gaps

and not the others. Overall, we felt that this approach provided the best method for

including more of the original data when inferring haplotypes.

Inferring haplotypes with ploidy information

For known ploidy levels, we use a multinomial likelihood to determine the number

of copies of each potential haplotype using the ordered cluster sizes returned by

PURC (largest to smallest). Given an individual of ploidy level K, we enumerate the

number of possible haplotype configurations using integer partitions (an unordered

set of integers that sums to K; Stojmenovic and Zoghbi, 1998). Since the cluster

sizes are sorted, we never need to consider more than the first K largest clusters.

For example, a tetraploid can have a maximum of four haplotypes, and the integer

partitions to consider are (4, 0, 0, 0), (3, 1, 0, 0), (2, 2, 0, 0), (2, 1, 1, 0), and (1, 1, 1, 1).

This corresponds to (4 copies of haplotype one), (3 copies of haplotype one, 1 copy

of haplotype two), (2 copies of haplotype one, 2 copy of haplotype two), (2 copies

of haplotype one, 1 copy of haplotype two, 1 copy of haplotype three), and (1 copy

of haplotype one, 1 copy of haplotype two, 1 copy of haplotype three, 1 copy of

haplotype four). The mathematical details for the likelihood function with an example

calculation are presented in Appendix C. Once the most likely configuration has been

identified, the crunch_clusters script will return each haplotype in proportion to its

66 representation in the maximum likelihood estimate. We have also provided options to

return only unique haplotypes and to treat loci as haploid, the latter of which can be

used to process organellar data. The haploid option can also be used as an alternative

to finding consensus sequences for nuclear loci by returning only the cluster with the

most reads.

Inferring haplotypes without ploidy information

For unknown ploidy levels, we no longer have information about the maximum number

of haplotypes that an individual can have. However, we can use the cluster sizes to

infer which clusters from PURC are actual haplotypes versus those that are likely to

be sequencing errors. We do this by calculating the likelihood that each successive

haplotype in the sorted list is a “real” haplotype versus a sequencing error. As an

example, consider a tetraploid with six clusters identified by PURC. We first calculate

the likelihood that all clusters are errors. Then we calculate the likelihood that

cluster one is a real haplotype, and two through six are errors. Next, we calculate the

likelihood that clusters one and two are real haplotypes, and that three through six

are errors. This continues until we calculate the likelihood of all six clusters being real

haplotypes. We then apply a cutoff that uses the relative increase in the likelihood when an additional haplotype is added. If treating an additional cluster as a haplotype

increases the likelihood by less than the cutoff then only the previous haplotypes are

kept and the others are considered errors. We use a default cutoff of 10% increase in

the likelihood. An example with the likelihood function that we use for this approach

is provided in Appendix C.

67 Species Ploidy Level Collection Information T. thalictroides: (L.) A. J. 2N=2X=14 V. Di Stilio, 123, WTU Eames & B. Boivin T. squarrosum: Stephan ex V. Di Stilio & X. Duan Thal- 2N=6X=42 Willd. ictrum sp#8, 20120617, PE T. macrostylum: Shuttlew. ex 2N=8X=56 R. Penny (unvouchered) Small & A. Heller 2N=12X=84 or D. Baum & D. Howarth, 375, T. pubescens: Pursh 2N=22X=154 A T. revolutum: DC. 2N=20X=144 V. Soza, 1917, WTU T. dasycarpum: Fisch., C. A. 2N=22X=154 V. Di Stilio, 110, WTU Mey. & Avé-Lall.

Table 3.2: Thalictrum L. species included in the comparison of Fluidigm2PURC and dbcAmplicons. Collection information is listed as the collector(s), collection number, and the herbarium. All additional information is available from Soza et al. (2013), Tables S1, S3, and S4. For all analyses, T. pubsescens was analyzed at the 22X level.

3.3.5 Example analysis

To demonstrate the use of the Fluidigm2PURC pipeline, we analyzed amplicon

sequence data generated from orthologs of the nuclear gene PISTILLATA (PI ) in the

genus Thalictrum L. (Ranunculaceae), which is single copy in diploids and two-copy in

tetraploids (Di Stilio et al., 2005). PI is responsible for establishing stamen and petal

identity during flower development in Arabidopsis thaliana (Goto and Meyerowitz,

1994), and has been used to detect reticulation in polyploid Lepidium L. (Brassicaceae)

(Lee et al., 2002; Soza et al., 2014). Given the length of the PI locus, primers were

designed to sequence exons 3 to 6 in two overlapping 600-bp segments: exons 3 to 5

(PIS_4 ) and exons 4 to 6 (PIS_3 ). Our analyses focused on six species with known

ploidy levels ranging from diploid (2N=2X=14) to 22-ploid (2N=22X=154). These

species are presented in Table 3.2, with accession numbers following Soza et al. (2013).

68 Paired-end reads were demultiplexed and annotated using dbcAmplicons (Uribe-

Convers et al., 2016) followed by read trimming, merging, and sequence renaming

using the fluidigm2purc script with default options. All reads coming from PIS_3

and PIS_4 were then run separately through PURC using the purc_recluster.py

script (Rothfels et al., 2017). After clustering and chimera detection, we determined

haplotypes for each amplicon using three different approaches: (1) consensus sequences

using the ‘--haploid’ option, (2) unique haplotypes assuming unknown ploidy (10%

likelihood cutoff), and (3) unique haplotypes using known ploidy. For each of these

methods, we realigned and cleaned the sequences using MAFFT (Katoh, 2013) and

Phyutility (added the options ‘--realign --clean 0.33’; Smith and Dunn, 2008).

As a comparison, we also analyzed these data using the reduce_amplicons.R script

from the dbcAmplicons package (v0.8.5; Uribe-Convers et al., 2016). This script

merges paired-end reads using FLASH2 (Magoč and Salzberg, 2011) and allows for a

global read trimming size to be used for read one, read two, or both. Unmerged reads

are treated independently, resulting in separate haplotypes for read one and read two.

The final result is a FASTA file with the unaligned haplotypes that can be further

processed for downstream applications. We generated consensus haplotypes as well as

haplotypes based on read occurrence (controlled by the minimum read frequency and

minimum read count) using the default settings, and trimmed 20 bp from read one and

40 bp from read two. We then aligned the resulting sequences using MAFFT (Katoh,

2013). These results were compared to the haplotypes from Fluidigm2PURC based

on (1) the number of recovered haplotypes, (2) the length of the resulting alignment,

and (3) the amount of gaps in the alignment.

69 PIS_3 PIS_4 Fluidigm2PURC reduce_amplicons Fluidigm2PURC reduce_amplicons Consensus Alignment length (bp) 395 415 403 418 Percent gaps 27.9 30.0 1.9 4.4 P.I.S. 19 19 11 10 Unknown Ploidy/Occurrence Number of haplotypes 18 5 14 6 Alignment length (bp) 395 403 403 424 Percent gaps 16.2 48.9 2.7 7.1 P.I.S. 48 11 43 10 Known Ploidy Number of haplotypes 57 – 43 – Alignment length (bp) 395 – 403 – Percent gaps 10.3 – 2.6 – P.I.S. 81 – 62 –

Table 3.3: Overall alignment statistics for the comparison between Fluidigm2PURC and the reduce_amplicons.R script.

Results

Haplotypes inferred by both methods were visualized and compared using alignment

statistics computed in Geneious v8.1.8 (Kearse et al., 2012) and MEGA v7.0.18 (Kumar et al., 2016). Consensus sequences from the Fluidigm2PURC and dbcAmplicons

pipelines were similar overall, with the reduce_amplicons.R script producing longer

haplotypes, but containing more gaps (Table 3.3). We then compared the occurrence-

based method from the reduce_amplicons.R script with the crunch_clusters results when ploidy levels are treated as unknown. In this case, Fluidigm2PURC recovered

more haplotypes with fewer gaps and more parsimony informative sites. We believe

the reason that the reduce_amplicons.R script recovered so few haplotypes is due to its

use of minimum read count and frequency criteria that rely on reads being identical to

form haplotypes, rather than clustering based on similarity. Inferring haplotypes with

70 Fluidigm2PURC using known ploidy levels resulted in the largest number of recovered

haplotypes. The reason that using known versus unknown ploidy levels produced

more haplotypes (PIS_3 : 57 vs. 18, PIS_4 : 43 vs. 14) was because the clusters

sizes that went into the likelihood calculation were disparate for some species (a few

large clusters and many smaller ones), making the smaller clusters difficult to model when the ploidy level was unknown due to lack of prior knowledge about how many

haplotypes should be expected. On a per species basis, using known ploidy levels

always led to more inferred haplotypes (Table 3.4). For example, the PIS_4 region for

Thalictrum pubescens recovered 15 haplotypes when assuming known ploidy (analyzed

as 22X), but only one haplotype when assuming unknown ploidy. The reason for this

is that the cluster data for this species had one putative haplotype with many reads

(147), but all other putative haplotypes had far fewer reads (the next largest cluster

had 25 reads, and nine clusters had fewer than 10 reads). In general, drawing the line

between real haplotypes and errors for clusters with lower read counts is difficult when

the ploidy level is unknown. By applying a threshold, the method we implement is a

conservative way to estimate haplotypes that only includes clusters with the highest

read counts.

Code and Data Availability

The code for each step of our example analysis is available in Appendix C. Raw

sequence data from the PIS_3 and PIS_4 loci for the six sampled Thalictrum

species, as well as all output FASTA files from the Fluidigm2PURC and dbcAmplicons

pipelines, are available on Dryad.

71 PIS_3 PIS_4 Species Ploidy Known Unknown Known Unknown T. thalictroides 2N=2X=14 1 (2.3) 1 (2.3) 1 (0.8) 1 (0.8) T. squarrosum 2N=6X=42 6 (3.3) 4 (3.1) 4 (3.9) 3 (3.2) T. macrostylum 2N=8X=56 8 (13.9) 5 (20.9) 4 (3.3) 4 (3.3) 2N=12X=84 or 12X: 11 (3.7) 12X: 9 (2.3) T. pubescens 1 (1.5) 1 (1.2) 2N=22X=154 22X: 19 (3.3) 22X: 15 (2.3) T. revolutum 2N=20X=140 10 (26.9) 4 (40.9) 5 (2.5) 3 (2.9) T. dasycarpum 2N=22X=154 13 (9.2) 3 (2.8) 14 (2.4) 2 (2.1)

Table 3.4: Per species data for the number of haplotypes inferred by Fluidigm2PURC using known vs. unknown ploidy. Data are presented as: number of inferred haplotypes (average percent gaps per haplotype). For Thalictrum pubescens, haplotypes are presented at both the 12X and 22X level.

3.4 Conclusions

The ability to infer haplotypes regardless of an individual’s ploidy level is a crucial

step toward understanding the complex relationships within many plant groups, whose evolutionary histories often contain multiple instances of hybridization and whole genome duplication (Soltis and Soltis, 2009; Van de Peer et al., 2009). As

models that accommodate these processes continue to be developed (e.g., Jones et al.,

2013; Solís-Lemus and Ané, 2016; Oberpieler et al., 2017; Thomas et al., 2017; Wen

and Nakhleh, 2018), we anticipate that the functionality of our pipeline will be

especially useful for conducting phylogenomic studies with nuclear sequence data.

Furthermore, the increase in genomic resources for taxa across the Plant Tree of Life will continue to facilitate the process of phylogenetic marker development, allowing

more researchers to take advantage of targeted enrichment strategies such as double-

barcoded amplicon sequencing. Compared with existing approaches for analyzing

these data, the methods we present here offer an improved workflow for sequence

72 processing, clustering, and haplotype inference, and are particularly well suited for

analyses in taxa with incomplete knowledge about ploidy levels.

3.5 Availability

Fluidigm2PURC is open source software that is freely available on GitHub [https:

//github.com/pblischak/fluidigm2purc] for Unix-like operating systems (Mac,

Linux) under the GNU General Public License v3. We have also built a Docker

image with all dependencies (Table 3.1) pre-installed for use on any operating sys-

tem with a compatible distribution of the Docker software [https://hub.docker.

com/r/pblischak/fluidigm2purc](https://www.docker.com; Merkel, 2014). Flu-

idigm2PURC is written in Python and has been successfully tested using Python versions 2.7 and 3.6. Documentation for the software can be found on ReadTheDocs

[http://fluidigm2purc.readthedocs.io].

3.6 Acknowledgements

The authors thank C. Rothfels for helpful discussions regarding the use of PURC.

We also thank S. Uribe-Convers for providing valuable feedback while testing the

Fluidigm2PURC code. This work was supported by the following grants from the

National Science Foundation (NSF): DEB-1455399 (ADW, L. S. Kubatko), DEB-

1253463 (DCT), and IOS-1121669 (VSD), with additional support to DCT and VSD

from the NSF BEACON Center for the Study of Evolution in Action (DBI-0939454).

73 Chapter 4: Inferring Species Trees and Networks from Gene Tree Quartet Site Patterns: An Example from the Plant Genus Penstemon (Plantaginaceae)

4.1 Abstract

Reticulate evolutionary events are hallmarks of plant phylogeny, and are increasingly

recognized as common occurrences in other branches of the Tree of Life. However,

inferring the evolutionary history of admixed lineages presents a difficult challenge for

systematists due to genealogical discordance caused by both incomplete lineage sorting

(ILS) and hybridization. Methods that accommodate both of these processes are

continuing to be developed, but they often do not scale well to larger numbers of species.

An additional complicating factor for many plant species is the occurrence of whole

genome duplication (WGD), which can have various outcomes on the genealogical

history of haplotypes sampled from the genome. In this study, we sought to investigate

patterns of hybridization and WGD in two subsections from the genus Penstemon

(Plantaginaceae; subsect. Humiles and Proceri), a speciose group of angiosperms that

has rapidly radiated across North America. Species in subsect. Humiles and Proceri

occur primarily in the Pacific Northwest of the USA, occupying habitats such as mesic,

subalpine meadows, as well as more well-drained substrates at varying elevations.

Ploidy levels in the subsections range from diploid to hexaploid, and it is hypothesized

74 that most of the polyploids are hybrids (i.e., allopolyploids). To estimate phylogeny in these groups, we first developed a method for estimating quartet concordance factors (QCFs) from multiple sequences sampled per lineage, allowing us to model all haplotypes from a polyploid. QCFs represent the proportion of gene trees that support a particular species quartet relationship, and are used for species network estimation in the program SNaQ (Solís-Lemus & Ané. 2016. PLoS Genet. 12:e1005896). Using phased haplotypes for nuclear amplicons, we inferred species trees and networks for 38 taxa from P. subsect. Humiles and Proceri. Our phylogenetic analyses recovered two clades comprising a mix of taxa from both subsections, indicating that the current taxonomy for these groups is inconsistent with our estimates of phylogeny. In addition, there was little support for hypotheses regarding the formation of putative allopolyploid lineages. Overall, we found evidence for the effects of both ILS and admixture on the evolutionary history of these species, but were able to evaluate our taxonomic hypotheses despite high levels of gene tree discordance. Our method for estimating

QCFs from multiple haplotypes also allowed us to include species of varying ploidy levels in our analyses, which we anticipate will help to facilitate estimation of species networks in other plant groups as well.

4.2 Introduction

Phylogenetic inference with multiple gene sequences has emerged as a dominant paradigm in systematics, with multilocus datasets ranging in size from just a handful of genes, to thousands of loci pulled from whole genomes. Discordant signals from these different gene regions can often be present, however, raising the issue of how to model the incongruence among the sampled gene trees from the underlying species

75 tree. The multispecies coalescent (MSC) model is one approach for species tree

estimation from multilocus data that can accommodate gene tree discordance caused

by incomplete lineage sorting (ILS) (reviewed in Degnan and Rosenberg, 2009). The

appeal of the MSC model stems from its connection with concepts in population

genetics (Wright-Fisher model; Kingman, 1982), and its explicit predictions regarding

the amount of gene tree discordance that should be present for a given species tree

(Tavaré, 1984; Pamilo and Nei, 1988; Takahata, 1989). Nevertheless, despite the

popularity of the coalescent model, it has been shown that it can be a poor fit to

empirical data sets (Reid et al., 2014; Gruenstaeudl et al., 2015). A potential reason

for the poor performance of the MSC in empirical data is that it only models ILS,

leaving other processes that generate genealogical discordance, such as gene flow and

hybridization, unaccounted for (Maddison, 1997).

An alternative to using the coalescent to model gene tree discordance is to use

the concept of a “gene to tree map,” wherein gene tree topologies are mapped to

possible species tree topologies without assuming an underlying process. This was

the approach taken by Ané et al. (2007), who used a Bayesian framework to estimate

a species tree by maximizing gene tree concordance. Implemented in the software

BUCKy (Larget et al., 2010), this method relies on the concept of concordance factors,

or the proportion of gene trees for which a given bipartition is true (Baum, 2007).

The resulting phylogenetic estimate is referred to as the primary concordance tree

(PCT), and can be estimated even if ILS is not the only process affecting gene tree

incongruence. Larget et al. (2010) also introduced the concept of a population tree, which uses the average concordance factors for all quartets on an internal branch

of the PCT to calculate branch lengths in coalescent units. Because concordance

76 factors contain information about both ILS and gene flow, Solís-Lemus and Ané (2016)

developed a method for estimating species networks (species trees with reticulate

edges) from concordance factors estimated for quartets of species. Their method,

called SNaQ (Species Networks applying Quartets), uses these quartet concordance

factors (QCFs) to maximize a pseudolikelihood function that matches the expected

QCF values under the coalescent model with hybridization (Meng and Kubatko, 2009)

and the observed QCFs.

Despite the availability and success of the SNaQ method, there remain several areas where we believe the estimation of the QCF input can be improved. First, estimating

QCFs for multiple individuals or haplotypes per species is not easily accomplished

using BUCKy or the methods available in the PhyloNetworks package that implements

the SNaQ method. This is problematic not only because having multiple alleles

sampled from a population can increase phylogenetic resolution (Andermann et al.,

2018), but also because many hybrid plant lineages are polyploids, which means that

not all of their homoeologs can be modeled simultaneously. Second, BUCKy requires

the estimation of posterior distributions of gene trees, which can be computationally

demanding for large numbers of loci and/or large numbers of sampled alleles. Using

gene tree posteriors is a common way to deal with gene tree uncertainty (utilized by

several methods in PhyloNet; Wen et al., 2018), but other methods, such as those

using site pattern frequencies, would allow for faster computation of the gene tree

quartet topologies that are used to estimate QCFs.

To address the issues listed above, we first developed a method for estimating

concordance factors directly from sequence data for quartets of species. Our method

accommodates multiple haplotypes sampled per species, and can conduct bootstrapping

77 to account for gene tree uncertainty. To validate the method, we simulated multilocus

sequence data on both tree and network topologies to assess how accurately it could

estimate QCFs. We then collected nuclear amplicon data for two subsections in the

plant genus Penstemon (Plantaginaceae; subsect. Humiles and Proceri). Subsections

Humiles and Proceri are known to hybridize, and have the additional complication

of containing multiple polyploid species. Using phased haplotypes from the nuclear

amplicon sequences, we estimated species trees and networks using four different

approaches to evaluate if the current circumscription of the subsections are in agreement with our phylogenetic estimates. Overall, we found strong evidence for hybridization

in subsect. Humiles and Proceri, but phylogenetic support was generally lacking for

many of the species relationships, owing to large amounts of genealogical discordance.

Given the pace with which Penstemon has recently radiated, these types of patterns

are not unexpected (Wolfe et al., 2006). Nevertheless, our phylogenetic estimates

showed some stable relationships among the different methods used, and suggest that

the current taxonomy of the two subsections needs revising.

4.3 Approach

We begin with a brief description of our method for QCF estimation, which uses site

patterns to estimate the gene tree quartet relationships that are used to calculate

species-level concordance factors. The basis for this method stems from ideas regarding

the use of phylogenetic invariants for the inference of phylogenetic trees (Allman et al.,

2008; Chifman and Kubatko, 2015). For a given quartet of species, the QCF values

represent the proportion of gene trees that agree with each of the three possible

unrooted topologies relating the four species: ((1,2)(3,4)), ((1,3)(2,4)), and ((1,4)(2,3)).

78 Because they represent a single bipartition among four species, these topologies are

referred to as “splits”, and are denoted 12|34, 13|24, and 14|23 (Chifman and Kubatko,

2015). Estimating the species-level concordance factors then amounts to estimating

quartet topologies for each gene, followed by tabulating which species-level split is

supported by each gene. When there are multiple haplotypes at a locus, we consider all

of their possible sampling combinations and calculate the gene tree quartet topologies

that support each species-level relationship. Using this approach, we are able to quickly

estimate QCF values for samples with different ploidy levels. Below we detail our

method for scoring gene tree quartet topologies, and combining them across sampled

haplotypes to get species-level QCFs.

4.3.1 Calculating Quartet Concordance Factors

Consider four species (1, 2, 3, and 4) with DNA sequence data collected at G

independent loci, and with haplotypes phased and aligned for each locus. For example,

a diploid individual from species 1 might have two haplotypes at locus one, which we denote 1(1,1) and 1(1,2). For any species, S, we denote its haplotypes at each

gene by indexing across the gene number (g = 1,...,G) and the haplotype number

(h = 1, . . . , sg; where sg is the number of haplotypes present in species S at gene g).

Then, for each gene, we score the three possible splits of all haplotype combinations

using the frequency of matching site patterns. To get these scores, we first calculate

the number of times a pair of species, A and B, have the same nucleotide at each site

in the alignment:

X mAB = I(A(i) = B(i)). (4.1) i

79 Here I() is the indicator function and is equal to 1 if the two bases are the same, and

0 otherwise.

We then use the mIJ ’s to calculate scores for each of the three unrooted quartet

topologies:

G(1, 2|3, 4) = m12 + m34 − (m13 + m14 + m23 + m24),

G(1, 3|2, 4) = m13 + m24 − (m12 + m14 + m23 + m34),

G(1, 4|2, 3) = m14 + m23 − (m12 + m13 + m24 + m34). (4.2)

For these scores, patterns of nucleotide substitution that support a given split are given

positive weight, while those that support alternative topologies are given negative weight. At the species level, we tabulate the number of gene trees supporting these

same splits, and add 1 to the species topology that corresponds to the gene split with the highest score. If there is a tie for the highest score, we add 0.5 to the two

species-level splits for the highest scoring gene trees. If all three gene tree splits have

1 the same score, then 3 is added to each species-level split.

The calculation of concordance factors for each species-level quartet is then done

by summing over all genes and tabulating how often each possible split is supported

by each gene. This sum is also taken over all possible combinations of haplotypes,

giving the following equation for calculating QCFs:

G 1g 2g 3g 4g ! X X X X X   CF 12|34 ∝ argmax G 1(g,a), 2(g,b)|3(g,c), 4(g,d) . (4.3) g=1 a=1 b=1 c=1 d=1 Here, argmax() is an indicator that is 1 if G(1, 2|3, 4) is the maximum argument and

0 otherwise. Ties are handled as described above. The calculation of CF 13|24 and

80 CF 14|24 are the same as above but with the species and sums switched into the correct

order for the split under consideration. All three species-level concordance factors are

then normalized by their sum.

4.3.2 Bootstrapping and Gene Tree Uncertainty

To deal with uncertainty in gene tree quartet estimation, we can also conduct bootstrap

resampling of sites within genes when calculating the gene tree split scores. If we

conduct B rounds of resampling, the gene tree contributions to the species-level splits

can then be calculated across bootstrap replicates, with each gene tree split getting a weight proportional to the number of times it was the best scoring topology across all

replicates:

B ˜  1 X   G 1(g,a), 2(g,b)|3(g,c), 4(g,d) = argmax G 1(g,a), 2(g,b)|3(g,c), 4(g,d) . (4.4) B b=1

The species-level QCF value is then taken as the sum over these bootstrap weighted

gene tree splits:

G ag bg cg dg ! X X X X X ˜  CF 12|34 ∝ G 1(g,a), 2(g,b)|3(g,c), 4(g,d) . (4.5) g=1 a=1 b=1 c=1 d=1

As before, CF 13|24 and CF 14|23 are calculated in a similar way, such that their indices

are in the correct order. These CF values are then normalized by their sum.

4.3.3 Validating QCF Estimation

We validated our approach for QCF estimation using simulations on both tree and

network topologies (Figure 4.1). The details of these simulations can be found in

Appendix D. In general, our method for calculating QCF values produced accurate

81 (a) 0.5 A

1.0 B

C 2.0 D

E

F

(b) 0.5 A

1.0 B

C 2.0 D ɣ=0.6 E

F

Figure 4.1: Simulation setup for (a) tree and (b) network topologies. Internal branches are annotated with their lengths in coalescent units (CUs). The total tree height is 4.0 CUs.

estimates when compared to the true simulated data, with RMSD values ranging

from 0.019–0.042 (0.023–0.059 for bootstrapped) and 0.019–0.036 (0.021–0.043 for

bootstrapped) for the tree and network topologies, respectively (Tables D.4 and D.5).

Figures D.1 and D.2 show plots of fitted linear regression models for each quartet of

species and the corresponding QCF estimate for each of the three unrooted topologies.

4.3.4 Implementation

We have implemented our new method in the open-source software package qcf.

qcf is wrtten in C++ and is available under the GNU GPL v3 on GitHub (https:

//github.com/pblischak/qcf). Documentation and tutorials for using the software

can be found on ReadTheDocs (https://qcf.readthedocs.io).

82 4.4 Materials and Methods

4.4.1 Study System

Penstemon Mitch. (Plantaginaceae) is the largest group of flowering plants endemic

to North America, with ca. 300 species distributed from Alaska to Guatemala, and

from the Pacific to Atlantic coasts (Wolfe et al., 2006). The center of diversity for

Penstemon is the Intermountain West of the United States, with the biogeographic

origin of the genus hypothesized to be in the Columbia Plateau (Straw, 1966; Wolfe et al., 2002, 2006). Penstemon has undergone a recent and rapid radiation, which

is thought to be driven by Pleistocene glaciation cycles, as well as adaptation to

different ecological habitats and pollinators (Castellanos et al., 2006; Wolfe et al., 2006;

Wilson et al., 2007). The most comprehensive molecular phylogeny of Penstemon was published by Wolfe et al. (2006), with 193 species sampled for their analyses of

the nuclear ribosomal ITS region and two chloroplast genes (trnCD+trnTL). While

support for many species-level relationships was lacking, Wolfe et al. (2006) were able

to make several inferences regarding higher level relationships within Penstemon.A

more recent study conducted high-throughput sequencing for 70 species of Penstemon

(Wessinger et al., 2016), and recovered high support across the entire tree. However,

the limited taxon sampling of their phylogeny did not contain many members of the

subg. Penstemon, limiting the interpretation of their results in the context of the whole genus.

Two groups within Penstemon subg. Penstemon that are of particular interest

are the subsections Humiles and Proceri. Species in these subsections are primarily

distributed in the Pacific Northwest of the United States, and occur at subalpine

to alpine elevations in a variety of habitats, including some that are atypical for

83 confertus (4X) globosus (4X) procerus (4X)

palustris (6X) attenuatus (6X) militaris (6X) pseudoprocerus (6X)

albertinus (2X)

Figure 4.2: Hypotheses of allopolyploid formation in Penstemon attenuatus according to Keck (1945). Varieties of P. attenuatus are placed in the center, with their putative diploid parent below (P. albertinus for all), and their putative tetraploid parents above. P. attenuatus var. palustris is marked with a dashed arrow due to the uncertainty of its placement.

species of Penstemon in the western US (e.g., mesic meadows; Keck, 1945; Nold,

1999). These subsections are also morphologically distinct from other members of

the genus, with their inflorescences organized into verticillasters. Many of the species

also have glandular hairs on the inflorescence, a character present in all species of

subsect. Humiles, but only in some members of subsect. Proceri (Keck, 1945). The

traditional taxonomic division between these groups is based on a single leaf character:

members of subsect. Humiles have serrate leaf margins and members of subsect.

Proceri have entire margins (Keck, 1945). However, there have been observations of

hybridization between the two subsections, such that members of subsect. Proceri

can sometimes have toothed leaf margins (Strickler, 1997). Cases of hybridization

at the diploid level are well documented in Penstemon (Straw, 1955; Crosswhite,

1965; Wolfe et al., 1998b,a; Datwyler and Wolfe, 2004), and numerous instances of

polyploidy, mostly within sect. Penstemon and subg. Saccanthera, have been studied

as well (Keck, 1945; Broderick et al., 2011). Most of these polyploids are thought to

be allopolyploids (formed through hybridization) (Wolfe et al., 2006), but the majority

of these hypotheses remain untested (but see Lawrence and Datwyler, 2016).

84 Given the lability of the leaf character dividing these two subsections, as well as

their similarities in geographic ranges and morphology, we sought to evaluate the

monophyly of subsect. Humiles and Proceri using nuclear amplicon data. We also

aimed to investigate the extent to which hybridization has occurred in these groups,

as well as gaining an understanding of the origin of any polyploid taxa (auto- vs.

allopolyploid). In the case of subsect. Humiles and Proceri, the P. attenuatus species

complex presents a compelling test for understanding polyploidy in these groups.

According to Keck (1945), the four varieties of P. attenuatus are all hypothesized to

be allopolyploids, forming through hybridization between P. albertinus in subsect.

Humiles and three different species in subsect. Proceri (see Figure 4.2). Earlier

molecular phylogenetic analyses recovered the subsections as polyphyletic (Wolfe et al.,

2006). However, these patterns are based on two uncombined gene trees, which would

not allow for processes that cause gene tree incongruence to be modeled. Here we use

species tree and network approaches to account for genealogical discordance caused by

both ILS and hybridization to estimate a phylogeny for subsect. Humiles and Proceri.

4.4.2 Sample Collection, DNA Extraction, and Amplicon Se- quencing

DNA was extracted from field-collected leaf tissue that was dried on silica gel. We used

a modified CTAB protocol for DNA isolation (Wolfe, 2005), quantified all samples using

a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA), and normalized all samples

to 20 ng/µL. Normalized DNA samples for 38 accessions representing 17/22 and

20/27 currently circumscribed taxa from subsect. Humiles and Proceri, respectively,

plus an outgroup taxon from Penstemon subgenus Dasanthera (P. davidsonii var. davidsonii), were sent to the IBEST Genomics Resources Core at the University of

85 Idaho (Moscow, ID, USA) for sample preparation and sequencing (listed in Table D.1).

Amplification of targeted amplicons and the addition of sample barcodes and Illumina

adapters were done using microfluidic PCR on the Fluidigm 48 x 48 Access Array

(Fluidigm Corporation, South San Francisco, CA, USA), followed by 300 bp paired-end

sequencing on an Illumina MiSeq (Illumina, San Diego, CA, USA) (Uribe-Convers et al., 2016). Primers for the 48 loci used in this study were designed and tested as

described in Blischak et al. (2014), and are given in Tables D.2 and D.3.

Raw, paired-end sequencing reads were returned from IBEST and processed using

Fluidigm2PURC v0.1.2 (https://github.com/pblischak/fluidigm2purc; Blischak et al., 2018a). Fluidigm2PURC trims reads using Sickle (Joshi and Fash, 2011), joins

paired reads using FLASH2 (Magoč and Salzberg, 2011), and prepares the input file for

clustering and chimera detection using the program PURC (Rothfels et al., 2017). After

clustering with PURC, haplotypes are inferred based on cluster sizes and user-specified

ploidy levels. Three rounds of chimera detection and clustering were performed using

default settings in the script purc_recluster2.py, a modified version of the original

script distributed with PURC (https://bitbucket.org/crothfels/purc). To get

haplotypes, clusters were first cleaned of excessive gaps using Phyutility (threshold

= 33%; Smith and Dunn, 2008) and then realigned using MAFFT (Katoh, 2013).

Haplotypes were then inferred for all sampled taxa assuming known ploidy levels

reported in Keck (1945), Strickler (1997), and Broderick et al. (2011). Information

regarding haplotype dosage (i.e., number of haplotype copies) was ignored, resulting

in only unique haplotypes being returned for each taxon at each gene.

86 4.4.3 Species Tree Inference

Haplotype-level gene trees for each locus were inferred with RAxML v8.2.11 using the

GTRGAMMA model of nucleotide substitution and 500 rapid bootstrap replicates

(Stamatakis et al., 2008; Stamatakis, 2014). We then inferred a taxon-level species

tree with ASTRAL v5.5.9 (ASTRAL-III) using a mapping file to link haplotypes with

their respective taxa (Mirarab and Warnow, 2015). To increase the thoroughness of

the ASTRAL-III search algorithm, we added the following command line options:

--polylimit 20 (maximum size of polytomy ), --samplingrounds 100 (number of

rounds of subsampling haplotypes from taxa), and --extraLevel 2 (increase the

number of bipartitions added to the search space).

A species tree was also inferred using methods from the TICR pipeline (https:

//github.com/nstenz/TICR; Stenz et al., 2015). The TICR pipeline estimates QCFs

using BUCKy (Larget et al., 2010), and then uses the concordance values of these

quartets to infer a species tree using the QuartetMaxCut algorithm (Snir, 2012).

Here, we instead used our new method to estimate QCFs, and inferred a species

tree using the script TICR/scripts/get-pop-tree.pl. Average concordance factors

and branch lengths (in coalescent units) were then estimated for this tree using the

TICR/scripts/getTreeBranchLengths.R script.

As a final estimate of phylogeny, we used only the majority haplotype (haplotype

inferred from the largest cluster) returned by Fluidigm2PURC to estimate a species

tree with RAxML using a supermatrix as input. It has been shown previously that

concatenating multilocus data can result in incorrect inferences of phylogeny when

gene tree discordance is present (Kubatko and Degnan, 2007). However, congruence

among different methods can also be a good indicator of stable species relationships.

87 Inference with RAxML was conducted using a partition file to estimate separate model

parameters for each gene, and 1000 rapid bootstrap replicates were used to assess

statistical support.

4.4.4 Candidate Hybridization Events from Rooted Triples

To generate a list of candidate hybridization events, we used the program HyDe v0.4.2 to test for hybridization on all possible triples of taxa from subsect. Humiles

and Proceri (Blischak et al., 2018b). HyDe tests for hybridization using site pattern

frequencies (Kubatko and Chifman, 2015), and estimates the amount of admixture

occuring between two parental taxa to form a third hybrid taxon. Using P. davidsonii var. davidsonii as an outgroup and a mapping file to assign haplotypes to taxa, we tested all triples in all directions using the run_hyde_mp.py script. Statistical

significance was assessed at the α = 0.05 level with a Bonferroni correction for the

number of hypothesis tests conducted.

4.4.5 Species Network Inference

Our species tree analyses with ASTRAL-III, QCF+QuartetMaxCut, and RAxML

(supermatrix) recovered two clades with corresponding taxon membership (see Results), which we refer to as clades A and B (Figures 4.3–4.5). To reduce the computational

burden of estimating a large network, we chose to analyze these clades independently.

Haplotypes from taxa belonging to each clade were extracted from the original

sequence alignments and written to new files. Penstemon davidsonii var. davidsonii was included in the data set for both clades as an outgroup. We then estimated

haplotype-level gene trees as before, and inferred a taxon-level species in ASTRAL-

III using a mapping file and default search settings. We also estimated QCFs for

88 each clade using the qcf software with 500 bootstrap replicates. The resulting

species trees and QCF estimates for clades A and B were then used as input for

network estimation using the SNaQ method implemented in the software package

PhyloNetworks v0.7.0 (Solís-Lemus and Ané, 2016; Solís-Lemus et al., 2017). We varied

the maximum number of hybridization events from h=1 to h=5 and used the resulting

log-pseudolikelihood values to determine the most likely number of hybridization

events. The log-pseudolikelihood for the case of no hybridization (h=0) was calculated

by maximizing the fit of the observed QCF values on the fixed tree topology estimated

by ASTRAL-III. All network analyses were conducted on the Oakley cluster at the

Ohio Supercomputer Center (https://www.osc.edu).

4.5 Results

4.5.1 Nuclear Amplicon Data

Of the 48 loci that were amplified using the Fluidigm Access Array, 43 (all nuclear)

recovered sufficient data for processing and downstream phylogenetic analyses. Data

processing with Fluidigm2PURC on these 43 loci produced phased haplotypes for all

38 taxa, with many of the polyploid taxa containing three or more unique haplotypes.

The supermatrix of majority haplotypes used for analysis with RAxML had a total

alignment length of 18,207 bp.

4.5.2 Species Tree Inference

Species trees were inferred with three different methods, all of which produced mostly

similar phylogenetic estimates for subsect. Humiles and Proceri (Figures 4.3–4.5). For

each method, four species were consistently recovered as a grade outside of the rest of

the ingroup: P. anguineus, P. rattanii, P. watsonii, and P. whippleanus. Penstemon

89 ovatus was also recovered outside of the two subsections in the ASTRAL-III analysis.

For ASTRAL-III and qcf+QuartetMaxCut, two species from subsect. Proceri, P. attenuatus var. attenuatus and P. attenuatus var. pseudoprocerus, were recovered in a

clade consisting primarily of species from subsect. Humiles. These two methods also

recovered a clade consisting almost entirely of species from subsect. Proceri, with the

exception of P. radicosus being present in the ASTRAL-III tree. The supermatrix

analysis inferred a tree with a number of relationships that differed from the other

two approaches. Three of the four variaties of P. attenuatus were inferred to belong

to the same clade, but P. attenuatus var. pseudoprocerus was still recovered in a clade

of Humiles taxa. This analysis also shifted a clade of three species, P. radicosus, P. degeneri, and P. inflatus, to be sister to the clade consisting of Proceri species with

high support (bootstrap = 96). Another notable difference among the methods was

that ASTRAL-III did not recover the three varieties of P. humilis as monophyletic,

but the qcf+QuartetMaxCut and supermatrix approaches did.

Estimated branch lengths in coalescent units from ASTRAL-III and

qcf+QuartetMaxCut were extremely short for all internal branches, indicating ram-

pant genealogical discordance (Figures 4.4, D.3, and D.4). Branch lengths from the

RAxML supermatrix analysis were also very short for the branches along the backbone

of the tree, demonstrating that few substitutions were present to inform relationships

for these deeper bipartitions. Support values were generally low across the different

trees, with only a few relationships showing high levels of support. This is likely a

result of the short branches observed in the different trees, and support the hypothesis

that speciation has occurred rapidly, with little time for informative substitutions

to occur. Another possible reason for these low support values is the occurrence of

90 P. davidsonii davidsonii 0.98 P. rattanii P. whippleanus P. anguineus P. ovatus 0.83 P. watsonii 0.92 P. inflatus 0.43 P. degeneri 0.26 P. humilis obtusifolius 0.1 P. subserratus 0.24 0.81 P. attenuatus attenuatus 0.61 0.38 P. aridus 0.44 P. albertinus 1 P. humilis brevifolius 0.44 P. humilis humilis 0.47 P. wilcoxii 0.77 0.29 P. attenuatus pseudoprocerus 0.25 P. pruinosus 0.37 P. virens P. elegantulus P. flavescens 0.52 0.39 P. rydbergii rydbergii P. radicosus 0.4 P. euglaucus P. rydbergii oreocharis 0.57 0.35 P. globosus P. pratensis 0.54 0.69 0.76 P. attenuatus militaris 0.17 P. laxus 0.7 0.05 P. heterodoxus heterodoxus P. attenuatus palustris P. spatulatus 0.4 0.48 P. hesperius 0.36 P. procerus procerus 0.56 P. confertus 0.39 P. washingtonensis 0.76 P. cinicola P. peckii

Figure 4.3: Phylogeny of Penstemon subsections Humiles and Proceri inferred by ASTRAL-III. Labels on branches are local posterior probabilities (Sayyari and Mirarab, 2016). Taxa are colored based on their current taxonomic classification: red = subsect. Proceri, blue = subsect. Humiles.

hybridization. As we show below, there is strong evidence for hybridization in these

groups, making phylogenetic inference difficult.

4.5.3 Tests for Hybridization and Species Network Inference

Analyzing all possible triples of ingroup taxa with HyDe resulted in a total of 23,310

hypothesis tests, of which 282 showed significant evidence for hybridization. The

average value for the hybridization parameter (γ) was 0.513 (standard deviation =

0.114), with a minimum and maximum value of 0.205 and 0.843, respectively. Out of

37 total ingroup taxa, 24 had a significant signal for hybridization.

91 P. davidsonii davidsonii 0.37 P. whippleanus 0.54 P. rattanii P. anguineus P. watsonii P. spatulatus 0.06 0.19 P. euglaucus 0.37 0.45 P. rydbergii rydbergii 0.01 0.34 0.03 P. procerus procerus 0.07 0.36 0.02 P. hesperius 0.06 0.38 0.1 0.35 0.37 P. confertus 0.39 0.03 P. washingtonensis P. peckii 0 0.36 0.15 0.33 0.43 P. cinicola P. flavescens 0.13 0.07 0.08 P. heterodoxus heterodoxus 0.41 0.38 0.38 0.05 P. attenuatus palustris 0.37 P. pratensis 0.07 P. globosus A 0.38 0.08 0.39 0.04 P. attenuatus militaris 0.36 0.13 P. rydbergii oreocharis 0.42 P. laxus P. radicosus 0.02 0.38 P. inflatus 0.35 0.54 P. degeneri 0 P. attenuatus attenuatus 0.33 0.06 P. virens 0.37 0.08 0.06 P. aridus 0.39 0.37 0.11 P. humilis obtusifolius 0.4 0.33 P. humilis humilis 0.12 0.52 0.41 P. humilis brevifolius P. ovatus 0.07 P. subserratus 0.38 0.1 0.06 P. pruinosus 0.4 0.37 0.07 P. elegantulus 0.38 B 0.07 P. attenuatus pseudoprocerus 0.38 0.03 P. wilcoxii 0.36 P. albertinus

Figure 4.4: Phylogeny of Penstemon subsections Humiles and Proceri inferred using qcf and QuartetMaxCut. Each branch is labeled above by its length in coalescent units and below by the average QCF value for all quartets induced by that branch. All branches with average QCF values greater than 0.38 are plotted with thicker lines (Stenz et al., 2015).

92 P. davidsonii davidsonii 100 P. whippleanus P. rattanii P. anguineus P. watsonii P. aridus 67 40 30 P. virens 92 P. humilis obtusifolius 80 100 P. humilis humilis 53 P. humilis brevifolius P. pruinosus 44 P. elegantulus 35 P. subserratus 48 76 P. albertinus 62 72 P. attenuatus pseudoprocerus 90 P. ovatus P. wilcoxii 96 P. radicosus 100 P. degeneri P. inflatus P. attenuatus attenuatus 48 17 51 P. flavescens P. spatulatus P. procerus procerus 46 40 65 P. cinicola 28 P. peckii 45 P. euglaucus 14 65 P. hesperius 41 P. washingtonensis P. confertus P. rydbergii rydbergii 10 46 P. attenuatus palustris 49 P. heterodoxus heterodoxus P. attenuatus militaris 41 91 P. pratensis 39 P. globosus 88 P. laxus P. rydbergii oreocharis

Figure 4.5: Phylogeny of Penstemon subsections Humiles and Proceri inferred with RAxML using a supermatrix of 43 loci. Labels on branches are support values from 1000 bootstrap replicates.

93 Species network inference with SNaQ was then conducted on the two primary

clades that were recovered in the qcf+QuartetMaxCut analyses (clades A and B;

Figure 4.4). The reason for using these clades was that this analysis recovered a

pattern of relationships for subsect. Humiles and Proceri that was most consistent with the current taxonomy. Although hybridization was detected between species in

these clades using HyDe, network inference with SNaQ cannot currently handle 38

taxa. However, since the members of these clades were recovered fairly consistently

between the different methods that we used for species tree inference, we decided to

analyze them independently to make network inference computationally feasible.

Using a range of values on the number of possible reticulation events (h=0 to h=5), we were able to infer species networks in all cases for both clades A and B within the

amount of compute time allotted (10 cores, 80–100 hours). For both clades, adding

reticulation events greatly reduced the log-pseudolikelihood, providing strong evidence

that hybridization is occurring within these clades. Networks with four and three

reticulations had the highest pseudolikelihoods for clades A and B, respectively, but

the network topology with four reticulations for network A produced non-sensical

relationships (hybridization with the outgroup), so we preferred the network with

h=3 (Figures D.5 & D.6). In addition to having the highest (or second highest for

clade A) log-pseduolikelihood, the networks estimated with three reticulations were

among the only estimates that had sensible branch length. For most other networks

in both clades A and B, one of the reticulate edges was always inferred to have a

branch length of >9.5 coalescent units. Given the amount of gene tree discordance

present in the data set, and the short branch lengths estimated by ASTRAL-III and

qcf+QuartetMaxCut, these estimates are most likely incorrect.

94 Clade A P. . P. davidsoniidavidsonii

Pspatulatus

Pflavescens

Pattenuatuspalustris

Pattenuatusmilitaris

Prydbergiioreocharis

Plaxus

Pglobosus

Ppratensis

Pheterodoxusheterodoxus

Prydbergiirydbergii

Peuglaucus

Pprocerusprocerus

Pconfertus

Phesperius

Pwashingtonensis

Ppeckii

Pcinicola

Clade B Pdavidsoniidavidsonii

Pradicosus

Paridus

Pinflatus

Pdegeneri

Pvirens

Pattenuatusattenuatus

Pelegantulus

Psubserratus

Ppruinosus

Pattenuatuspseudoprocerus

Pwilcoxii

Palbertinus

Povatus

Phumilisbrevifolius

Phumilishumilis

Phumilisvarobtusifolius

Figure 4.6: Best maximum pseudolikelihood (ML) networks for clades A and B estimated by SNaQ. The maximum number of hybridization events for these ML networks is h=3.

95 The best networks for clades A and B showed different patterns for the timing of

hybridization (Figure 4.6). For clade A, all of the reticulation events occurred closer to

the present, and only involved pairs of species hybridizing. The hybridization events

inferred include: (1) P. spatulatus × P. attenuatus var. palustris → P. flavescens,

(2) P. rydbergii var. oreocharis × P. globosus → P. laxus, and (3) P. confertus × P. cinicola → P. washingtonensis. Clade B, on the other hand, was estimated to have a

deep reticulation event involving two ancestral populations, with the resulting hybrid

lineage then diversifying into 12 different taxa. The other hybridization event in this

clade was P. ovatus × P. humilis var. humilis → P. humilis var. brevifolius.

4.6 Discussion

In this paper, we investigated the phylogenetic relationships and taxonomic affinities

among the taxa within Penstemon subsections Humiles and Proceri using nuclear

amplicon sequencing. We found strong evidence for hybridization in these groups, but

the rapid diversification in these two subsections made the exact inference, localization,

and interpretation of reticulation events extremely difficult. Despite these shortcomings,

there are some clear trends regarding the taxonomic implications of our phylogenetic

estimates, as well as what they may mean for character evolution and biogeography in

the group. Our work also highlights the difficulties of estimating phylogeny for recently

radiating groups with variable ploidy levels and high amounts of hybridization, an

issue that is common for many groups of angiosperms.

4.6.1 Taxonomy of Subsections Humiles and Proceri

Using several methods for phylogenetic inference, we found evidence for the non-

monophyly of subsect. Humiles and Proceri (Figures 4.3–4.5). This pattern was

96 recovered for all methods, despite variable levels of statistical support for the different

analyses, suggesting that this pattern is robust to the various assumptions of each

method. Four species in particular were recovered completely outside of the two

subsections: P. whippleanus, P. rattanii, and P. anguineus (all subsect. Humiles), as well as P. watsonii (subsect. Proceri). Only one of the methods, qcf+QuartetMaxCut,

recovered a monophyletic grouping of species belonging to subsect. Proceri. However,

this analysis also placed two taxa currently classified in subsect. Proceri (P. attenuatus var. attenuatus and P. attenuatus var. pseudoprocerus) into a clade of species from

subsect. Humiles, casting doublt on Keck’s hypotheses of allopolyploid formation for

these taxa (Keck, 1945). Interestingly, none of the varieties of P. attenuatus showed

the predicted affinities for their putative parental taxa (Figure 4.2). However, given

their hypothesized hybrid nature, it is possible that they are simply difficult to place

using models that do not include reticulation.

The phylogenetic placement of hybrid taxa has been shown to be problematic

in phylogenetic analyses, with hybrids typically branching at the base of a clade

containing one of their parental taxa (e.g., McDade, 1990, 1992). Our tests for

hybridization and species network analyses confirmed the presence of hybrids within

and between subsect. Humiles and Proceri, but did not support the putative parentage

of the varieties of P. attenuatus, as well as a number of other hypothesized hybrids

from Keck (1945). Nevertheless, the occurrence of a deep hybridization event in

the clade consisting of taxa primarily from subsect. Humiles (clade B) is especially

interesting. The impact of hybridization on genetic variation and its connection to the

subsequent speciation and diversification of hybrid lineages has long been understood

in plants (Anderson, 1949; Stebbins, 1950; Anderson and Stebbins, 1954; Grant, 1971;

97 Mallet, 2007), and several recent studies have observed deep hybridization events at

the diploid (e.g., Folk et al., 2016; García et al., 2017; Folk et al., 2018) and polyploid

(e.g., Morales-Briones et al., 2018) levels. The hybrid group within clade B is a mix

of diploids and polyploids, suggesting that the processes of both hybridization and whole genome duplication have been at play in its diversification. Future work with

more genomic data will be important for this clade to gain better resolution for any

additional hybridization events.

A particular strength of our method of QCF estimation is the ability to analyze

taxa with different ploidy levels, allowing us to analyze all diploid and polyploid

taxa in subsect. Humiles and Proceri simultaneously. From our network analyses with SNaQ, the only polyploid that was inferred to be a hybrid was P. flavescens

(hexaploid; clade A). However, our tests for hybridization with HyDe found far more

evidence for hybridization, likely because it only tests three species at a time, rather

than trying to infer an entire network. Nevertheless, out of the 24 taxa inferred to be

hybrids using HyDe, only four were polyploids. This casts doubt on the hypothesis

that most of the polyploids in subsect. Proceri are of hybrid origin. However, a lack

of phylogenetically informative variation could also be preventing us from detecting

the full extent of hybridization potentially occurring in these polyploids.

4.6.2 Character Evolution and Biogeography

Given the reticulate history of subsect. Humiles and Proceri, there are several patterns

of morphological character evolution that can be interpreted in the context of their

past genetic exchanges. Of particular interest is the presence of glandular hairs on

the inflorescence, a trait with potentially adaptive importance (Levin, 1973) that is

98 present in all species of subsect. Humiles, but is absent in the majority of species

in subsect. Proceri. For the species in subset. Proceri where it does occur, it is

hard to determine if it is simply a labile trait that has arisen several times, or if

there is a single clade of species that all have the trait. A perhaps more interesting

scenario could be that this trait was gained through hybridization or introgression,

however testing this hypothesis is currently not feasible due to a lack of methods

for discrete character reconstruction on phylogenetic networks (but see Jhwueng and

O’Meara, 2015; Bastide et al., 2018, for examples of continuous character evolution).

A possible workaround would be to construct all possible resolutions of the underlying

trees displayed by the networks and to reconstruct the character history on each tree.

Nevertheless, future model development on discrete character evolution on networks will help to address this type of question.

The biogeographic context of these hybridization events is also of interest, with

reconstructions of species’ geographic ranges potentially helping to shed light on the

plausibility of hypotheses about the occurrence of reticulation. The current geographic

distribution of the taxa in subsect. Humiles and Proceri is concentrated in the Pacific

Northwest of the United States, an area with several well-established biogeographic

and phylogeographic hypotheses regarding the occurrence of species in the Cascade

Range, the Northern Rocky Mountains, and the Sierra Nevada in northern

(Brunsfeld et al., 2001; Carstens et al., 2005; Brunsfeld et al., 2007). Two recent studies

that have investigated the biogeography of hybridization events include Burbink and

Gehara (2018) and Folk et al. (2018), who take different approaches to reconstructing

ancestral contact zones where hybridization could potentially have occurred. Burbink

and Gehara (2018) found a single deep reticulation event in the phylogeny of New

99 World kingsnakes and used the resulting parental trees (trees where a hybrid clade is

sister to either parent) to infer ancestral areas for the hybrid clade. Folk et al. (2018)

used climatic niche reconstructions to find likely regions where ancestral lineages

of Huechera and Mitella may have occurred in sympatry and hybridized. These

approaches could be used in concert to illuminate the dynamics of vicariance and

dispersal for lineages of subsect. Humiles and Procer, as well as helping to locate

geographic regions where hybridization, as well as whole genome duplication, could

have occurred in the past.

4.6.3 Phylogenetics of Hybrids and Polyploids

As was seen from our phylogenetic analyses, the internal branch lengths of our

species trees and networks were very short (most were less than 0.5 coalescent units),

highlighting the prevalence of incomplete lineage sorting and genealogical discordance

in our data. Previous research has shown that Penstemon is a young genus (crown age

2.5–4.0 mya) that has radiated extremely rapidly, with hybridization and polyploidy

occurring frequently (Wolfe et al., 2006; Wessinger et al., 2016). These types of

processes are likely not uncommon for other groups of angiosperms, and having

methods to deal with them will be especially important for making future inferences

about the evolutionary history of these groups. To resolve hybridization events,

especially when they involve polyploids, there are several methods that have already

been developed. Some of these are not coalescent-based, but instead try to reconstruct

a network from gene trees that have all of the haplotypes from a polyploid sampled (a

so called “multi-labeled” tree; Lott et al., 2009; Marcussen et al., 2012). Other studies

have relied on coalescent-based assignment of homoeologous haplotypes into putative,

100 diploid subgenomes, but these approaches can be computationally limited due to the

cost of exploring all permutations of haplotypes assignments (Bertrand et al., 2015;

Oberpieler et al., 2017). The only approach to simultaneously infer a network topology

and homoeolog assignment in a coalescent framework is the method of Jones et al.

(2013). However, this method uses a hierarchical Bayesian framework that does not

scale well to large numbers of loci or taxa.

If homoeolog assignment is the goal, then it may be beneficial to first identify

parental taxa so that the number of comparisons for determining haplotype origin

is reduced. Kamneva et al. (2017) used such an approach in strawberries to identify

potential parents for several different polyploid species. They used a two-step approach

to generate and test hypotheses, first constructing networks using consensus methods,

followed by evaluating the likelihood of candidate networks using PhyloNet (Wen et al.,

2018). Their analyses were limited to no more than 5 haplotypes per taxon, and also

did not include an actual search over network space. Our method for QCF estimation was able to analyze all inferred haplotypes for all 38 taxa sampled in this study, and

our network analyses with SNaQ were used to conduct an actual search over network

topologies. The appeal of these types of approaches is that they do not require a priori

knowledge about parental taxa when inferring a network. For non-model taxa where

cases of hybridization and allopolyploidy are being investigated for the first time, the

ability to model these processes with little input from the user regarding putative

hybridization events should help to facilitate the discovery of reticulate evolutionary

events in virtually any group of taxa where they may be occurring.

101 4.7 Conclusions

Hybridization and polyploidy are processes that obscure phylogenetic inference for

many groups of taxa, and are a particular problem for lineages of angiosperms, where

they are especially common. Using the concept of quartet concordance factors (the

proportion of gene tree quartets supporting a species-level quartet), we developed a

method for estimating these concordance factors that can accommodate taxa with variable ploidy levels. Using this approach, and several others, we then inferred species

trees for Penstemon subsect. Humiles and Proceri, finding that the subsections were

not reciprocally monophyletic. Tests for hybridization and species network inference

also revealed that reticulation has been a common occurrence in these groups. In

general, this study highlights the difficulties of inferring phylogeny in a rapid species

radiation where hybridization and WGD are common. However, our approach for

QCF estimation, in combination with the network inference method SNaQ, helped to

disentangle the complex patterns of hybridization in these subsections, and should

provide a useful tool for other researchers interested in reticulate evolution as well.

102 Bibliography

Allendorf, F. W. and Thorgaard, G. H. 1984. Tetraploidy and the evolution of salmonid

fishes. In: Evolutionary genetics of fishes. Edited by B. J. Turner. Plenum Press,

pp. 1–53.

Allman, E. S., Ané, C., and Rhodes, J. A. 2008. Identifiability of a Markovian model of

molecular evolution with Gamma-distributed rates. Advances in Applied Probability,

40: 229–249.

Andermann, T., Fernandes, A. M., Olsson, U., Töpel, M., Pfeil, B. E., Oxelman,

B., Aleixo, A., Faircloth, B. C., and Antonelli, A. 2018. Allele phasing greatly

improves the phylogenetic utility of ultraconserved elements. Systematic Biology,

https://doi.org/10.1093/sysbio/syy039.

Anderson, E. 1949. Introgressive hybridization. John Wiley, New York, NY, USA.

Anderson, E. and Stebbins, G. L. 1954. Hybridization as an evolutionary stimulus.

Evolution, 8: 378–388.

Ané, C., Larget, B., Baum, D. A., and Rokas, A. 2007. Bayesian estimation of

concordance among gene trees. Molecular Biology and Evolution, 24: 412–426.

103 Anithakumari, A., Tang, J., van Eck, H., Visser, R., Leunissen, J., Vosman, B., and

van der Linden, C. 2010. A pipeline for high throughput detection and mapping of

SNPs from EST databases. Molecular Breeding, 26: 65–75.

Arnold, B., Bomblies, K., and Wakeley, J. 2012. Extending coalescent theory to

autotetraploids. Genetics, 192: 195–204.

Arnold, B., Corbett-Detig, R. B., Hartl, D., and Bomblies, K. 2013. RADseq underes-

timates diversity and introduces genealogical biases due to nonrandom haplotype

sampling. Molecular Ecology, 22: 3179–3190.

Arnold, B., Kim, S.-T., and Bomblies, K. 2015. Single geographic origin of a widespread

autotetraploid arabidopsis arenosa lineage followed by interploidy admixture. Molec-

ular Biology and Evolution, 32: 1382–1395.

Baird, N. A., Etter, P. D., Atwood, T. S., Currey, M. C., Shiver, A. L., Lewis, Z. A.,

Selker, E. U., Cresko, W. A., and Johnson, E. A. 2008. Rapid SNP discovery and

genetic mapping using sequenced RAD markers. PloS ONE, 3: e3376.

Balding, D. J. and Nichols, R. A. 1995. A method for quantifying differen-tiation

between populations at multi-allelic loci and its implications for investigating identity

and paternity. Genetica, 96: 3–12.

Balding, D. J. and Nichols, R. A. 1997. Significant genetic correlations among

Caucasians at forensic DNA loci. Heredity, 108: 583–589.

Barlow, N. 1913. Preliminary note on heterostylism in Oxalis and Lythrum. Journal

of Genetics, 3: 53–65.

104 Barlow, N. 1923. Inheritance of the three forms in trimorphic plants. Journal of

Genetics, 13: 133–146.

Bastide, P., Solís-Lemus, C., Kriebel, R., Sparks, K. W., and Ané, C. 2018. Phyloge-

netic comparative methods on phylogenetic networks with reticulations. Systematic

Biology, https://doi.org/10.1093/sysbio/syy033.

Baum, D. A. 2007. Concordance trees, concordance factors, and the exploration of

reticulate genealogy. Taxon, 56: 417–426.

Bertrand, Y. J. K., Scheen, A.-C., Marcussen, T., Pfeil, B. E., de Sousa, F., and

Oxelman, B. 2015. Assignment of homoeologues to parental genomes in allopoly-

ploids for species tree inference, with an example from Fumaria (Papaveraceae).

Systematic Biology, 64: 448–471.

Blischak, P. D., Wenzel, A. J., and Wolfe, A. D. 2014. Gene prediction and annotation

in Penstemon (Plantaginaceae): a workflow for marker development from low-

coverage genome sequencing. Applications in Plant Sciences, 2: 1400044.

Blischak, P. D., Kubatko, L. S., and Wolfe, A. D. 2016. Accounting for genotype

uncertainty in the estimation of allele frequencies in autopolyploids. Molecular

Ecology Resources, 16: 742–754.

Blischak, P. D., Latvis, M., Morales-Briones, D. F., Johnson, J. C., Di Stilio, V. S.,

Wolfe, A. D., and Tank, D. C. 2018a. Fluidigm2PURC: automated processing and

haplotype inference for double-barcoded PCR amplicons. Applications in Plant

Sciences, 6: e1156.

105 Blischak, P. D., Chifman, J., Wolfe, A. D., and Kubatko, L. S. 2018b. HyDe: a

Python package for genome-scale hybridization detection. Systematic Biology,

https://doi.org/10.1093/sysbio/syy023.

Bradburd, G., Ralph, P., and Coop, G. 2013. Disentangling the effects of geographic

and ecological isolation on genetic differentiation. Evolution, 67: 3258–3273.

Brent, R. P. 1973. Algorithms for minimization without derivatives. Prentice-Hall,

Englewood Cliffs, NJ.

Broderick, S. R., Stevens, M. R., Geary, B., Love, S. L., Jellen, E. N., Dockter, R. B.,

Daley, S. L., and Lindgren, D. T. 2011. A survey of Penstemon’s genome size.

Genome, 54: 160–173.

Brunsfeld, S. J., Sullivan, J., Soltis, D. S., and Soltis, P. S. 2001. Integrating ecological

and evolutionary processes in a spatial context, chapter Comparative phylogeography

of northwestern North America: a synthesis, pages 319–339. Oxford: Blackwell

Science.

Brunsfeld, S. J., Miller, T. R., and Carstens, B. C. 2007. Insights into the biogeography

of the Pacific Northwest of North America: evidence from the phylogeography of

Salix melanopsis (Salicaceae). Systematic Botany, 32: 129–139.

Buerkle, C. A. and Gompert, Z. 2013. Population genomics based on low coverage

sequencing: how low should we go? Molecular Ecology, 22: 3028–3035.

Burbink, F. T. and Gehara, M. 2018. The biogeography of deep time reticulation.

Systematic Biology, https://doi.org/10.1093/sysbio/syy019.

106 Bybee, S. M., Bracken-Grissom, H., Haynes, B. D., Hermansen, R. A., Byers, R. L.,

Clement, M. J., Udall, J. A., Wilcox, E. R., and Crandall, K. A. 2011. Targeted

amplicon sequencing (TAS): a scalable next-gen approach to multilocus, multitaxa

phylogenetics. Genome Biology and Evolution, 3: 1312–1323.

Cannon, S. B., McKain, M. R., Harkess, A., Nelson, M. N., Dash, S., Deyholos, M. K.,

Peng, Y., Joyce, B., Stewart, C. N., Rolf, M., Kutchan, T., Tan, X., Chen, C.,

Zhang, Y., Carpenter, E., Wong, G. K.-S., Doyle, J. J., and Leebens-Mack, J. 2014.

Multiple polyploidy events in the early radiation of nodulating and nonnodulating

legumes. Molecular Biology and Evolution, 32: 193–210.

Carstens, B. C., Brunsfeld, S. J., Demboski, J. R., D, G. J., and Sullivan, J. 2005.

Investigating the evolutionary history of the Pacific Northwest mesic forest ecosystem:

hypothesis testing within a comparative phylogeographic framework. Evolution, 59:

1639–1652.

Castellanos, M. C., Wilson, P. S., Keller, S. J., Wolfe, A. D., and Thompson, J. D. 2006.

Anther evolution: pollen presentation strategies when pollinators differ. American

Naturalist, 167: 288–296.

Chifman, J. and Kubatko, L. S. 2014. Quartet inference from SNP data under the

coalescent model. Bioinformatics, 30: 3317–3324.

Chifman, J. and Kubatko, L. S. 2015. Identifiability of the unrooted species tree

topology under the coalescent model with time-reversible substitution processes,

site-specific rate variation, and invariable sites. Journal of Theoretical Biology, 374:

35–47.

107 Clark, L. V. and Jasieniuk, M. 2011. polysat: an R package for polyploid microsatel-

lite analysis. Molecular Ecology Resources, 11: 562–566.

Clausen, J., Keck, D. D., and Hiesey, W. M. 1940. Experimental studies on the nature

of species. I. Effect of varied environments on western American plants. Carnegie

Inst. Washington Publ.

Clausen, J., Keck, D. D., and Hiesey, W. M. 1945. Experimental studies on the nature

of species. II. Plant evolution through amphiploidy and autoploidy, with examples

from Madiinae. Carnegie Inst. Washington Publ.

Cornille, A., Salcedo, A., Kryvokhyzha, D., Glémin, S., Holm, K., Wright, S. I., and

Lascoux, M. 2016. Genomic signature of successful colonization of Eurasia by the

allopolyploid shepherd’s purse (Capsella bursa-pastoris). Molecular Ecology, 25:

616–629.

Crosswhite, F. S. 1965. Hybridization of Penstemon barbatus (Scrophulariaceae) of

section Elmigera with species of Habroanthus. Southwestern Naturalist, 10: 234–237.

Cui, L., Wall, P. K., Leebens-Mack, J. H., Lindsay, B. G., Soltis, D. E., Doyle, J. J.,

Soltis, P. S., Carlson, J. E., Arumuganathan, K., Barakat, A., Albert, V. A., Ma,

H., and dePamphilis, C. W. 2006. Widespread genome duplications throughout the

history of flowering plants. Genome Research, 16: 738–749.

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A.,

Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin,

R., and 1000 Genomes Project Analysis Group 2011. The variant call format and

VCFtools. Bioinformatics, 27: 2156–2158.

108 Datwyler, S. L. and Wolfe, A. D. 2004. Phylogenetic relationships and morphological

evolution in Penstemon subg. Dasanthera (Veronicaceae). Systematic Botany, 29:

165–176. de Silva, H., Hall, A., Rikkerink, E., McNeilage, M., and Fraser, L. 2005. Estimation

of allele frequencies in polyploids under certain patterns of inheritance. Heredity,

95: 327–334.

Degnan, J. H. and Rosenberg, N. A. 2009. Gene tree discordance, phylogenetic

inference and the multispecies coalescent. Trends in Ecology and Evolution, 24:

332–340.

Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from

incomplete data via the EM algorithm. Journal of the Royal Statistical Society:

Series B (Methodological), 39: 1–38.

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C.,

Philippakis, A. A., del Angel, G., Rivas, M. A., Hanna, M., McKenna, A., Fennell,

T. J., Kernytsky, A. M., Sivachenko, A. Y., Cibulskis, K., Gabriel, S. B., Altshuler,

D., and Daly, M. J. 2011. A framework for variation discovery and genotyping using

next-generation dna sequencing data. Nature Genetics, 43(5): 491–498.

Di Stilio, V. S., Kramer, E. M., and Baum, D. A. 2005. Floral MADS box genes

and homeotic gender dimorphism in Thalictrum dioicum (Ranunculaceae) - a new

model for the study of dioecy. The Plant Journal, 41: 755–766.

Douglas, G. M., Gos, G., Steige, K. A., Salcedo, A., Holm, K., Josephs, E. B.,

Arunkumar, R., Ågren, J. A., Hazzouri, K. M., Wang, W., Platts, A. E., Williamson,

109 R. J., Neuffer, B., Lascoux, M., Slotte, T., and Wright, S. I. 2015. Hybrid origins

and the earliest stages of diploidization in the highly successful recent polyploid

Capsella bursa-pastoris. Proceedings of the National Academy of Sciences USA, 112:

2806–2811.

Dufresne, F., Stift, M., Vergilino, R., and Mable, B. K. 2014. Recent progress and

challenges in population genetics of polyploid organisms: an overview of current

state-of-the-art molecular and statistical tools. Molecular Ecology, 23: 40–69.

Dupuis, J. R., Bremer, F. T., Kauwe, A., San Jose, M., Leblanc, L., Rubinoff, D., and

Geib, S. 2017. HiMAP: robust phylogenomics from highly multiplexed amplicon

sequencing. bioRxiv, pages https://doi.org/10.1111/1755–0998.12783.

Eaton, D. A. R., Hipp, A. L., González-Rodríguez, A., and Cavender-Bares, J. 2015.

Historical introgression among the American live oaks and the comparative nature

of tests for introgression. Evolution, 69: 2587–2601.

Eddelbuettel, D. 2013. Seamless R and C++ integration with Rcpp. Springer, New

York.

Eddelbuettel, D. and François, R. 2011. Rcpp: seamless R and C++ integration.

Journal of Statistical Software, 40: 1–18.

Eddelbuettel, D. and Sanderson, C. 2014. RcppArmadillo: accelerating R with high-

performance C++ linear algebra. Computational Statistics and Data Analysis, 71:

1054–1063.

Edgar, R. C. 2010. Search and clustering orders of magnitude faster than BLAST.

Bioinformatics, 26: 2460–2461.

110 Edgar, R. C., Haas, B. J., Clemente, J. C., Quince, C., and Knight, R. 2011. uchime

improves sensitivity and speed of chimera detection. Bioinformatics, 27: 2194–2200.

Esselink, G. D., Nybom, H., and Vosman, B. 2004. Assignment of allelic configuration

in polyploids using the MAC-PR (microsatellite DNA allele counting–peak ratios)

method. Theoretical and Applied Genetics, 109: 402–408.

Falush, D., Stephens, M., and Pritchard, J. 2003. Inference of population structure us-

ing multilocus genotype data: linked loci and correlated allele frequencies. Genetics,

164: 1567–1587.

Fisher, R. A. 1943. Allowance for double reduction in the calculation of genotype

frequencies with polysomic inheritance. Annals of Eugenics, 12: 169–171.

Folk, R. A., Mandel, J. R., and Freudenstein, J. V. 2016. Ancestral gene flow and

parallel organellar genome capture result in extreme phylogenomic discord in a

lineage of angiosperms. Systematic Biology, 66: 320–337.

Folk, R. A., Visger, C. J., Soltis, P. S., Soltis, D. E., and Guralnick, R. P. 2018.

Geographic range dynamics drove ancient hybridization in a lineage of angiosperms.

American Naturalist, https://dx.doi.org/10.1086/698120.

Foll, M. and Gaggiotti, O. 2008. A genome-scan method to identify selected loci

appropriate for both dominant and codominant markers: a Bayesian perspective.

Genetics, 180: 977–993.

Fumagalli, M., Vieira, F. G., Korneliussen, T., Linderoth, T., Huerta-Sánchez, E.,

Albrechtsen, A., and Nielsen, R. 2013. Quantifying population genetic differentiation

from next-generation sequencing data. Genetics, 195: 979–992.

111 Furlong, R. F. and Holland, P. W. H. 2001. Were vertebrates octoploid? Philosophical

Transactions of the Royal Society B: Biological Sciences, 357: 531–544.

García, N., Folk, R. A., Meerow, A. W., Chamala, S., Gitzendanner, M. A., de Oliveira,

R. S., Soltis, D. E., and Soltis, P. S. 2017. Deep reticulation and incomplete lineage

sorting obscure the diploid phylogeny of rain-lillies and allies (Amaryllidaceae tribe

Hippeastreae). Molecular Phylogenetics and Evolution, 111: 231–247.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B.

2014. Bayesian data analysis. Chapman & Hall/CRC Press, 3rd edn.

Glaubitz, J., Casstevens, T. M., Lu, F., Harriman, J., Elshire, R. J., Sun, Q., and

Buckler, E. S. 2014. TASSEL-GBS: A high capacity genotyping by sequencing

analysis pipeline. PLoS ONE, 9: e90346.

Gompert, Z. and Buerkle, C. A. 2011a. A hierarchical Bayesian model for next-

generation population genomics. Genetics, 187: 903–917.

Gompert, Z. and Buerkle, C. A. 2011b. A hierarchical Bayesian model for next-

generation population genomics. Genetics, 187: 903–917.

Gompert, Z. and Buerkle, C. A. 2012. bgc: software for Bayesian estimation of

genomic clines. Molecular Ecology Resources, 12: 1168–1176.

Gompert, Z., Forister, M. L., Fordyce, J. A., Nice, C. C., Williamson, R. J., and

Buerkle, C. A. 2010. Bayesian analysis of molecular variance in pyrosequences

quantifies population genetic structure across the genome of lycaeides butterflies.

Molecular Ecology, 19: 2455–2473.

112 Gostel, M. R., Coy, K. A., and Weeks, A. 2015. Microfluidic PCR-based target

enrichment: a case study in two rapid radiations of Commiphora (Burseraceae)

from Madagascar. Journal of Systematics and Evolution, 53: 411–431.

Goto, K. and Meyerowitz, M. 1994. Function and regulation of the Arabidopsis floral

homeotic gene PISTILLATA. Genes and Development, 8: 1548–1560.

Grant, V. 1971. Plant speciation. Columbia University Press.

Gregory, T. R. and Mable, B. K. 2005. Polyploidy in animals. In: The evolution of

the genome. Edited by T. R. Gregory. Elsevier, pp. 427–517.

Gruenstaeudl, M., Reid, N. M., Wheeler, G. L., and Carstens, B. C. 2015. Posterior

predictive checks of coalescent models: P2C2M, an R package. Molecular Ecology

Resources, 16: 193–205.

Haldane, J. B. S. 1930. Theoretical genetics of autopolyploids. Journal of Genetics,

22: 359–372.

Hardy, O. J. 2016. Population genetics of autopolyploids under a mixed mating model

and the estimation of selfing rate. Molecular Ecology Resources, 16: 103–117.

Hernández, J. L. and Weir, B. S. 1989. A disequilibrium coefficient approach to

Hardy-Weinberg testing. Biometrics, 45: 53–70.

Holsinger, K. E., Lewis, P. O., and Dey, D. K. 2002. A Bayesian approach to inferring

population structure from dominant markers. Molecular Ecology, 11: 1157–1164.

113 Huang, G., Wang, S., Wang, X., and You, N. 2016. An empirical Bayes method for

genotyping and SNP detection using multi-sample next-generation sequencing data.

Bioinformatics, 32: 3240–3245.

Hudson, R. R. 2002. Generating samples under a Wright-Fisher neutral model of

genetic variation. Bioinformatics, 18: 337–338.

Jhwueng, D.-C. and O’Meara, B. 2015. Trait evolution on phylogenetic networks.

bioRxiv, https://doi.org/10.1101/023986.

Jiao, Y., Wickett, N. J., Ayyampalayam, S., Chanderbali, A. S., Landherr, L., Ralph,

P. E., Tomsho, L. P., Hu, Y., Liang, H., Soltis, P. S., Soltis, D. E., Clifton, S. W.,

Schlarbaum, S. E., Schuster, S. C., Ma, H., Leebens-Mack, J., and dePamphilis,

C. W. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature, 473:

97–100.

Jombart, T. and Ahmed, I. 2011. adegenet 1.3-1: new tools for the analysis of

genome-wide SNP data. Bioinformatics, 27: 3070–3071.

Jones, G., Sagitov, S., and Oxelman, B. 2013. Statistical inference of allopolyploid

species networks in the presence of incomplete lineage sorting. Systematic Biology,

62: 467–478.

Joshi, N. A. and Fash, J. N. 2011. Sickle: sliding-window, adaptive,

quality-based trimming tool for FASTQ files (version 1.33). Available at

https://github.com/najoshi/sickle.

114 Kamneva, O. K., Syring, J., Liston, A., and Rosenberg, N. A. 2017. Evaluating

allopolyploid origins in strawberries (Fragaria) using haplotypes generated from

target sequence capture. BMC Evolutionary Biology, 17: 180.

Kates, H. R., Soltis, P. S., and Soltis, D. E. 2017. Evolutionary and domestication

history of Cucurbita (pumpkin and squash) species inferred from 44 nuclear loci.

Molecular Phylogenetics and Evolution, 111: 98–109.

Katoh, S. 2013. MAFFT multiple sequence alignment software version 7: improvements

in performance and usability. Molecular Biology and Evolution, 30: 772–780.

Kearse, M., Moir, R., Wilson, A., Stones-Havas, S., Cheung, M., Sturrock, S., Buxton,

S., Cooper, A., Markowitz, S., Duran, C., Thierer, T., Ashton, B., Meintjes, P.,

and Drummond, A. 2012. Geneious Basic: an integrated and extendable desktop

software platform for the organization and analysis of sequence data. Bioinformatics,

28: 1647–1649.

Keck, D. D. 1945. Studies in Penstemon–XIII: a cyto-taxonomic account of the section

Spermunculus. American Midland Naturalist, 33: 128–206.

Kingman, J. F. C. 1982. On the genealogy of large populations. Journal of Applied

Probability, 19: 27–43.

Kubatko, L. S. and Chifman, J. 2015. An invariants-based method for hybridization de-

tection from genome-scale sequence data. bioRxiv, https://doi.org/10.1101/034348.

Kubatko, L. S. and Degnan, J. H. 2007. Inconsistency of phylogenetic estimates from

phylogenetic data under coalescence. Systematic Biology, 56: 17–24.

115 Kumar, S., Stecher, G., and Tamura, K. 2016. MEAGA7: Molecular Evolutionary

Genetics Analysis version 7.0 for bigger datasets. Molecular Biology and Evolution,

33: 1870–1874.

Larget, B., Kotha, S. K., Dewey, C. N., and Ané, C. 2010. BUCKy: gene tree /

species tree reconciliation with Bayesian concordance analysis. Bioinformatics, 26:

2910–2911.

Lawrence, T. J. and Datwyler, S. L. 2016. Testing the hypothesis of allopolyploidy

in the origin of Penstemon azureus (Plantaginaceae). Frontiers in Ecology and

Evolution, 4: 60.

Lawrence, W. J. C. 1929. The genetics and cytology of Dahlia species. Journal of

Genetics, 21: 125–158.

Lee, J.-Y., Mummenhoff, K., and Bowman, J. L. 2002. Allopolyploidization and

evolution of species with reduced floral structures in Lepidium L. (Brassicaceae).

Proceedings of the National Academy of Sciences USA, 99: 16835–16840.

Levin, D. A. 1973. The role of trichomes in plant defense. Quarterly Review of Biology,

48: 3–15.

Li, H. 2010. Mathematical notes on SAMtools algorithms. https: // software.

broadinstitute. org/ gatk/ media/ docs/ Samtools. pdf .

Li, H. 2011. A statistical framework for SNP calling, mutation discovery, association

mapping and population genetical parameter estimation from sequencing data.

Bioinformatics, 27: 2987–2993.

116 Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows-

Wheeler transform. Bioinformatics, 25: 1754–1760.

Logan-Young, C. J., Yu, J. Z., Verma, S. K., Percy, R. G., and Pepper, A. E. 2015.

SNP discovery in complex allotetraploid genomes (Gossypium spp., Malvaceae)

using genotyping by sequencing. Applications in Plant Sciences, 3: 1400077.

Lott, M., Spillner, A., Huber, K. T., and Moulton, V. 2009. PADRE: a package for

analyzing and displaying reticulate evolution. Bioinformatics, 25: 1199–1200.

Lu, F., Lipka, A. E., Glaubitz, J., Elshire, R., Cherney, J. H., Casler, M. D., Buckler,

E. S., and Costich, D. E. 2012. Switchgrass genomic diversity, ploidy, and evolution:

Novel insights from a network-based SNP discovery protocol. PLoS Genetics, 9:

e1003215.

Maddison, W. P. 1997. Gene trees in species trees. Systematic Biology, 46: 523–536.

Magoč, T. and Salzberg, S. L. 2011. FLASH: fast length adjustment of short reads to

improve genome assemblies. Bioinformatics, 27: 2957–2963.

Mallet, J. 2007. Hybrid speciation. Nature, 446: 279–283.

Marcussen, T., Jakobsen, K. S., Danihelka, J., Ballard, H. E., Blaxland, K., Brysting,

A. K., and Oxelman, B. 2012. Inferring species networks from gene trees in high-

polyploid North American and Hawaiian violets (Viola, Violaceae). Systematic

Biology, 61: 107–126.

Martin, E. R., Kinnamon, D. D., Schmidt, M. A., Powell, E. H., Zuchner, S., and Morris,

R. W. 2010. SeqEM: an adaptive genotype-calling approach for next-generation

sequencing studies. Bioinformatics, 26: 2803–2810.

117 Maruki, T. and Lynch, M. 2017. Genotype calling from population-genomic sequencing

data. G3: Genes, Genomes, Genetics, 7: 1393–1404.

McAllister, C. A. and Miller, A. J. 2016. Single nucleotide polymorphism discovery

via genotyping by sequencing to assess population genetic structure and recurrent

polyploidization in Andropogon gerardii. American Journal of Botany, 103: 1314–

1325.

McDade, L. 1990. Hybrids and phylogenetic systematics I. Patterns of character

expression in hybrids and their implications for cladistic analysis. Evolution, 44:

1685–1700.

McDade, L. 1992. Hybrids and phylogenetic systematics II. The impact of hybrids on

cladistic analysis. Evolution, 46: 1329–1346.

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A.,

Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M. A. 2010. The

Genome Analysis Toolkit: A mapreduce framework for analyzing next-generation

DNA sequencing data. Genome Research, 20: 1297–1303.

Meng, C. and Kubatko, L. S. 2009. Detecting hybrid speciation in the presence

of incomplete lineage sorting using gene tree incongruence: a model. Theoretical

Population Biology, 75: 35–45.

Meng, X.-L. and Rubin, D. B. 1993. Maximum likelihood estimation via the ECM

algorithm: a general framework. Biometrika, 80: 267–278.

Merkel, D. 2014. Docker: lightweight Linux containers for consistent development and

deployment. Linux Journal, 2014: 2.

118 Miller, M. R., Dunham, J. P., Amores, A., Cresko, W. A., and Johnson, E. A.

2007. Rapid and cost-effective polymorphism identification and genotyping using

restriction site associated DNA (RAD) markers. Genome Research, 17: 240–248.

Mirarab, S. and Warnow, T. 2015. ASTRAL-II: coalescent-based species tree estimation

with many hundreds of taxa and thousands of genes. Bioinformatics, 31: i44–i52.

Moody, M. L., Mueller, L. D., and Soltis, D. E. 1993. Genetic variation and random

drift in autotetraploid populations. Genetics, 134: 649–657.

Morales-Briones, D. F., Liston, A., and Tank, D. C. 2018. Phylogenomic analyses

reveal a deep history of hybridization and polyploidy in the Neotropical genus

Lachemilla (Rosaceae). New Phytologist, https://doi.org/10.1111/nph.15099.

Muller, H. J. 1914. A new mode of segregation in gregory’s tetraploid primulas.

American Naturalist, 48: 508–512.

Nielsen, R., Paul, J. S., Albrechtsen, A., and Song, Y. S. 2011. Genotyping and

SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12:

443–451.

Nielsen, R., Korneliussen, T., Albrechtsen, A., and Li, Y. 2012. SNP calling, genotype

calling, and sample allele frequency estimation from new-generation sequencing

data. PLoS ONE, 7: e37558.

Nold, R. 1999. . Timber Press, Portland, OR.

Oberpieler, C., F, W., Tomasello, S., and Konowalik, K. 2017. A permutation approach

for inferring species networks from gene trees in polyploid complexes by minimizing

deep coalescences. Methods in Ecology and Evolution, 8: 835–849.

119 Ogden, R., Gharbi, K., Mugue, N., Martinsohn, J., Senn, H., Davey, J. W., Pourkazemi,

M., McEwing, R., Eland, C., Vidotto, M., Sergeev, A., and Congiu, L. 2013. Stur-

geon conservation genomics: SNP discovery and validation using RAD sequencing.

Molecular Ecology, 22: 3112–3123.

Ohno, S. 1970. Evolution by gene duplication. Springer.

Otto, S. P. and Whitton, J. 2000. Polyploid incidence and evolution. Annual Review

of Genetics, 34: 401–437.

Pamilo, P. and Nei, M. 1988. Relationships between gene trees and species trees.

Molecular Biology and Evolution, 5: 568–583.

Parisod, C., Holderegger, R., and Brochmann, C. 2010. Evolutionary consequences of

autopolyploidy. New Phytologist, 186: 5–17.

Pennell, M. W., FitzJohn, R. G., Cornwell, W. K., and Harmon, L. J. 2015. Model ade-

quacy and the macroevolution of angiosperm functional traits. American Naturalist,

186: E100.

Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S., and Hoekstra, H. E. 2012.

Double digest RADseq: an inexpensive method for de novo SNP discovery and

genotyping in model and non-model species. PloS ONE, 7: e37135.

Plummer, M., Best, N., Cowles, K., and Vines, K. 2006. CODA: Convergence

Diagnostics and Output Analysis for MCMC. R News, 6: 7–11.

Puritz, J. B., Matz, M. V., Toonen, R. J., Weber, J. N., Bolnick, D. I., and Bird, C. E.

2014. Demystifying the RAD fad. Molecular Ecology, 23: 5937–5942.

120 R Core Team 2014. R: a language and environment for statistical computing.R

Foundation for Statistical Computing, Vienna, Austria.

R Core Team 2016. R: a language and environment for statistical computing.R

Foundation for Statistical Computing, Vienna, Austria.

Rambaut, A. and Grass, N. C. 1997. Seq-gen: an application for the monte carlo sim-

ulation of dna sequence evolution along phylogenetic trees. Computer Applications

in the Biosciences, 13: 235–238.

Ramsey, J. 2011. Polyploidy and ecological adaptation in wild yarrow. Proceedings of

the National Academy of Sciences, 108: 7096–7101.

Ramsey, J. and Ramsey, T. S. 2014. Ecological studies of polyploidy in the 100 years

following its discovery. Philosophical Transactions of the Royal Society B: Biological

Sciences, 369: 20130352.

Reid, N. M., Hird, S. M., Brown, J. M., Pelletier, T. A., McVay, J. D., Satler, J. D.,

and Carstens, B. C. 2014. Poor fit to the multispecies coalescent is widely detectable

in empirical data. Systematic Biology, 63: 322–333.

Rheindt, F. E., Fujita, M. K., Wilton, P. R., and Edwards, S. V. 2014. Introgression

and phenotypic assimilation in Zimmerius flycatchers (Tyrannidae): population

genetic and phylogenetic inferences from genome-wide SNPs. Systematic Biology,

63: 134–152.

Ripplinger, J. and Sullivan, J. 2010. Assessment of substitution model adequacy using

frequentist and Bayesian methods. Molecular Biology and Evolution, 27: 2790–2803.

121 Rogers, J. D. 1973. Polyploidy in Fungi. Evolution, 27: 153–160.

Rothfels, R. C., Li, F.-W., and Pryer, K. M. 2017. Next-generation polyploid phyloge-

netics: rapid resolution of hybrid polyploid complexes using PacBio single-molecule

sequencing. New Phytologist, 213: 413–429.

Sayyari, E. and Mirarab, S. 2016. Fast coalescent-based computation of local branch

support from quartet frequencies. Molecular Biology and Evolution, 33: 1654–1668.

Scarpino, S. V., Levin, D. A., and Meyers, L. A. 2014. Polyploid formation shapes

flowering plant diversity. American Naturalist, 184: 456–465.

Selmecki, A. M., Maruvka, Y. E., Richmond, P. A., Guillet, M., Shoresh, N., Sorenson,

A. L., De, S., Kishony, R., Michor, F., Dowell, R., and Pellman, D. 2015. Polyploidy

can drive rapid adaptation in yeast. Nature, 519: 349–352.

Serang, O., Mollinari, M., and Garcia, A. A. F. 2012. Efficient exact maximum a

posteriori computation for Bayesian SNP genotyping in polyploids. PloS ONE, 7:

e30906.

Smith, S. A. and Dunn, C. 2008. Phyutility: a phyloinformatics utility for trees,

alignments, and molecular data. Bioinformatics, 24: 715–716.

Snir, S. 2012. Quartet maxcut: a fast algorithm for amalgomating quartet trees.

Molecular Phylogenetics and Evolution, 62: 1–8.

Solís-Lemus, C. and Ané, C. 2016. Inferring phylogenetic networks with maximum

pseudolikelihood under incomplete lineage sorting. PLoS Genetics, 12: e1005896.

122 Solís-Lemus, C., Bastide, P., and Ané, C. 2017. Phylonetworks: a package for

phylogenetic networks. Molecular Biology and Evolution, 34: 3292–3298.

Soltis, D. E., Soltis, P. S., and Tate, J. A. 2003. Advances in the study of polyploidy

since plant speciation. New Phytologist, 161: 173–191.

Soltis, D. E., Soltis, P. S., Schemske, D. W., Hancock, J. F., Thompson, J. N., Husband,

B. C., and Judd, W. S. 2007. Autopolyploidy in angiosperms: have we grossly

underestimated the number of species? Taxon, 56: 13–30.

Soltis, D. E., Albert, V. A., Leebens-Mack, J., Bell, C. D., Peterson, A. H., Zheng, C.,

Sankoff, D., dePamphilis, C. W., Wall, P. K., and Soltis, P. S. 2009. Polyploidy and

angiosperm diversification. American Journal of Botany, 96: 336–348.

Soltis, D. E., Buggs, R. J. A., Doyle, J. J., and Soltis, P. S. 2010. What we still don’t

know about polyploidy. Taxon, 59: 1387–1403.

Soltis, D. E., Visger, C. J., and Soltis, P. S. 2014. The polyploidy revolution then...and

now: Stebbins revisited. American Journal of Botany, 101: 1057–1078.

Soltis, P. S. and Soltis, D. E. 2000. The role of genetic and genomic attributes in

the success of polyploids. Proceedings of the National Academy of Sciences, 97:

7051–7057.

Soltis, P. S. and Soltis, D. E. 2009. The role of hybridization in plant speciation.

Annual Review of Plant Biology, 60: 561–588.

Soza, V. L., Haworth, K. L., and Di Stilio, V. S. 2013. Timing and consequences

of recurrent polyploidy in meadow-rues (Thalictrum, Rannunculaceae). Molecular

Biology and Evolution, 30: 1940–1954.

123 Soza, V. L., Hyunh, V. L., and Di Stilio, V. S. 2014. Pattern and process in the

evolution of the sole dioecious member of Brassicaceae. EvoDevo, 5: 42.

Stamatakis, A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-

analysis of large phylogenies. Bioinformatics, 30: 1312–1313.

Stamatakis, A., Hoover, P., and Rougemont, J. 2008. A rapid bootstrap algorithm for

the RAxML web servers. Systematic Biology, 57: 758–771.

Stebbins, G. L. 1950. Variation and evolution in plants. Columbia University Press.

Stenz, N. W. M., Larget, B., Baum, D. A., and Ané, C. 2015. Exploring tree-like and

non-tree-like patterns using genome sequences: an example using the inbreeding

plant species Arabisopsis thaliana (L.) Heynh. Systematic Biology, 64: 809–823.

Stift, M., Berenos, C., Kuperus, P., and van Tienderen, P. H. 2008. Segregation

models for disomic, tetrasomic and intermediate inheritance in tetraploids: a

general procedure applied to Rorippa (yellow cress) microsatellite data. Genetics,

179: 2113–2123.

Stojmenovic, I. and Zoghbi, A. 1998. Fast algorithms for generating integer partitions.

International Journal of Computer Mathematics, 70: 319–332.

Straw, R. M. 1955. Hybridization, homogamy, and sympatric speciation. Evolution, 9:

441–444.

Straw, R. M. 1966. A redefinition of Penstemon (Scrophulariaceae). Brittonia, 18:

80–95.

Strickler, D. 1997. Northwest Penstemons. Flower Press.

124 Takahata, N. 1989. Gene genealogy in three related populations: consistency proba-

bility between gene and population trees. Genetics, 122: 957–966.

Tavaré, S. 1984. Line-of-descent and genealogical processes, and their applications in

population genetics models. Theoretical Population Biology, 26: 119–164.

Thomas, G. W. C., Ather, S. H., and Hahn, M. W. 2017. Gene-tree reconciliation

with MUL-trees to resolve polyploidy events. Systematic Biology, 66: 1007–1018.

Uribe-Convers, S., Settles, M. L., and Tank, D. C. 2016. A phylogenomic approach

based on PCR enrichment and high throughput sequencing: resolving diversity

within the South American species of Bartsia L. (Orobanchaceae). PLoS ONE, 11:

e0148203.

Van de Peer, Y., Fawcett, J. A., Proost, S., Sterck, L., and Vandepoele, K. 2009. The

flowering world: a tale of duplications. Trends in Plant Sciences, 14: 680–688.

Vieira, F. G., Fumagalli, M., Albrechtsen, A., and Nielsen, R. 2013. Estimating

inbreeding coefficients from NGS data: impact on genotype calling and allele

frequency estimation. Genome Research, 23: 1852–1861.

Voorrips, R., Gort, G., and Vosman, B. 2011. Genotype calling in tetraploid species

from bi-allelic marker data using mixture models. BMC Bioinformatics, 12: 172.

Wagner, W. H. 1970. Biosystematics and evolutionary noise. Taxon, 19: 146–151.

Wang, N., Thomson, M., Bodles, W. J. A., Crawford, R. M. M., Hunt, H. V.,

Fetherstone, A. W., Pellicer, J., and Buggs, R. J. A. 2013. Genome sequence of

dwarf birch (Betula nana) and cross-species RAD markers. Molecular Ecology, 22:

3098–3111.

125 Weir, B. S. 1996. Genetic Data Analysis II . Sunderland (MA): Sinauer Associates,

Sunderland, MA.

Wen, D. and Nakhleh, L. 2018. Coestimating reticulate phylogenies

and gene trees from multilocus sequence data. Systematic Biology,

https://doi.org/10.1093/sysbio/syx085.

Wen, D., Yu, Y., Zhu, J., and Nakhleh, L. 2018. Inferring phylogenetic networks using

PhyloNet. Systematic Biology, https://doi.org/10.1093/sysbio/syy015.

Wessinger, C. A., Freeman, C. C., Mort, M. E., Rausher, M. D., and Hileman, L. C.

2016. Multiplexed shotgun genotyping resolves species relationships within the

North American genus Penstemon. American Journal of Botany, 103: 912–922.

Wickham, H. 2007. Reshaping data with the reshape package. Journal of Statistical

Software, 21(12): 1–20.

Wickham, H. 2009. ggplot2: elegant graphics for data analysis. Springer, New York.

Wilson, P. S., Wolfe, A. D., Armbruster, W. S., and Thompson, J. D. 2007. Constrained

lability in floral evolution: counting convergent origins of hummingbird pollination

in Penstemon and Keckiella. New Phytologist, 176: 883–890.

Winge, Ö. 1917. The chromosomes: their number and general importance. Comptes

rendus des travaux du Laboratoire Carlsberg, 13: 131–275.

Winkler, H. 1916. Über die experimentelle Erzeugung von Pflanzen mit abweichenden

Chromosomenzahlen. Zeitschrift für induktive Abstammungs- und Vererbungslehre,

8: 417–531.

126 Wolfe, A. D. 2005. ISSR techniques for evolutionary biology. Methods in Enzymology,

395: 134–144.

Wolfe, A. D., Xiang, Q.-Y., and Kephart, S. R. 1998a. Assessing hybridization in

natural populations of Penstemon (Scrophulariaceae) using hypervariable intersimple

sequence repeat (ISSR) bands. Molecular Ecology, 7: 1107–1125.

Wolfe, A. D., Xiang, Q.-Y., and Kephart, S. R. 1998b. Diploid hybrid speciation in

Penstemon (Scrophulariaceae). Proceedings of the National Academy of Sciences,

95: 5112–5115.

Wolfe, A. D., Datwyler, S. L., and Randle, C. P. 2002. A phylogenetic and biogeographic

analysis of the Cheloneae (Scrophulariaceae) based on ITS and matK sequence data.

Systematic Botany, 27: 138–148.

Wolfe, A. D., Randle, C. P., Datwyler, S. L., Morawetz, J. J., Arguedas, N., and

Diaz, J. 2006. Phylogeny, taxonomic affinities, and biogeography of Penstemon

(Plantaginaceae) based on ITS and cpDNA sequence data. American Journal of

Botany, 93: 1699–1713.

Wood, T. E., Takebayashi, N., Barker, M. S., Mayrose, I., Greenspoon, P. B., and

Rieseberg, L. H. 2009. The frequency of polyploid speciation in vascular plants.

Proceedings of the National Academy of Sciences, 106: 13875–13879.

Wright, S. 1931. Evolution in Mendelian populations. Genetics, 16: 97–159.

Wright, S. 1938. The distribution of gene frequencies in populations of polyploids.

Proceedings of the National Academy of Sciences, 24: 372–377.

127 Wright, S. 1951. The genetical structure of populations. Annals of Eugenics, 15:

323–354.

Zohren, J., Wang, N., Kardailsky, I., Borrell, J. S., Joecker, A., Nichols, R. A.,

and Buggs, R. J. A. 2016. Unidirectional diploid–tetraploid introgression among

British birch trees with shifting ranges shown by restriction site-associated markers.

Molecular Ecology, 25: 2413–2426.

128 Appendix A: Chapter 1 Supplemental Materials

A.1 Example Analyses of Autotetraploid Potato (Solanum tuberosum)

The following walk-through will take you through every step of analyzing the data set

for autotetraploid potato (Solanum tuberosum) that was completed in the manuscript.

Because the analysis with polyfreqs takes a few hours (there are 86,400 parameters

to estimate), we have provided the output from that step for you. The potato data

set is provided for free with the R package fitTetra, and the code below goes

through how we acquired, formatted, and rescaled it for an analysis with polyfreqs.

Instructions for installing polyfreqs can be found on the GitHub page associated with the package (http://pblischak.github.io/polyfreqs). The following sections

are intended to be completed with the data in the example/ folder found in the

GitHub repository accompanying the manuscript (https://github.com/pblischak/ polyfreqs-ms-data).

# Using autetraploid potato data from the fitTetra package.

# If not installed, install it using:

# install.packages("fitTetra")

# Then load the data.

129 library(fitTetra) data(tetra.potato.SNP)

# Get the names of the individuals and loci. samples <- unique(tetra.potato.SNP$SampleName) markers <- unique(tetra.potato.SNP$MarkerName)

# Initialize x and y matrices -- x will be the reference allele. potato_mat_x <- matrix(NA,

nrow=length(unique(tetra.potato.SNP$SampleName)),

ncol=length(unique(tetra.potato.SNP$MarkerName))) rownames(potato_mat_x) <- samples colnames(potato_mat_x) <- markers

potato_mat_y <- matrix(NA, nrow=length(unique(tetra.potato.SNP$SampleName)),

ncol=length(unique(tetra.potato.SNP$MarkerName)))

# Get the counts from the data frame. for(i in1:dim(potato_mat_x)[1]){

tmp <- subset(tetra.potato.SNP, SampleName==samples[i])

potato_mat_x[i,] <- tmp$X_Raw

potato_mat_y[i,] <- tmp$Y_Raw

}

# Get the total counts as the sum of x and y and give row and column names. potato_mat_tot <- potato_mat_x + potato_mat_y rownames(potato_mat_tot) <- samples

130 colnames(potato_mat_tot) <- markers

# Rescale, then print the tables to file in a format suitable for polyfreqs.

potato_mat_x <- round(potato_mat_x/100)

potato_mat_tot <- round(potato_mat_tot/100)

write.table(potato_mat_x, file="potato_ref_reads.txt", quote=F, sep="\t")

write.table(potato_mat_tot, file="potato_tot_reads.txt", quote=F, sep="\t")

If you look at the files that were just made (potato_ref_reads.txt and

potato_tot_reads.txt), you can see how data should be formatted for running

an analysis with polyfreqs. More details will be provided in the next section when we read in the data and analyze it.

A.1.1 Calculating Expected and Observed Heterozygosity

Next we will read the data into R. The simplest way to do this is to use the

read.table() function. In the total and reference read count files for the potato data,

the first row is a tab delimited list of locus names. This row is optional and can be

excluded. After that, each row has the name of the individual followed by the read

counts at each locus (tab delimited). The individual name is required because it is used when writing genotype samples to file (set genotypes=T when running polyfreqs).

To specify that the first column contains the names, we use the row.names argument

and set it equal to 1. To specify that the first row has column names for each locus

(you do not need a label for the names), set the header argument to TRUE. With the

data read in, all that is left to do is to load polyfreqs and set up an analysis.

131 NB: When the data are passed to the polyfreqs() function, make sure that they are converted to matrices using the as.matrix() function.

# Read in data using read.table. Remember the row.names and header options.

# If you don't have locus names in the first row, take out header=T. potato_tot_table <- read.table("potato_tot_reads.txt", row.names=1, header=T) potato_ref_table <- read.table("potato_ref_reads.txt", row.names=1, header=T)

# Load polyfreqs library(polyfreqs)

# Run through polyfreqs with genotypes=T

# and geno_dir="potato_genotypes".

# Make sure you use the as.matrix() command. potato_out <- polyfreqs(as.matrix(potato_tot_table),

as.matrix(potato_ref_table), ploidy=4, iter=100000,

genotypes=T, geno_dir="potato_genotypes",

outfile="potato_mcmc.out")

The potato_out object will be a list of four items:

• potato_out$posterior_freqs – a matrix of the posterior samples of allele fre-

quencies at each locus prior to burn-in (also printed to the file potato_mcmc.out).

• potato_out$map_genotypes – a matrix of the maximum a posteriori genotypes

for each individual at each locus estimated using the posterior mode.

• potato_out$het_obs – a matrix of the per locus posterior samples of observed

heterozygosity.

132 • potato_out$het_exp – a matrix of the per locus posterior samples of expected

heterozygosity.

We will write each of these to file for downstream analyses (except for the posterior_freqs which already has its own file). write.table(potato_out$map_genotypes, "potato_map_genotypes.txt", quote=F,

row.names=F, col.names=F) write.table(potato_out$het_obs, "potato_het_obs.txt", quote=F,

row.names=F, col.names=F) write.table(potato_out$het_exp, "potato_het_exp.txt", quote=F,

row.names=F, col.names=F))

To evaluate the observed and expected heterozygosity, we will get multi-locus estimates by taking the mean across loci of the per locus posterior samples in the het_obs and het_exp matrices. We can then plot these and calculate summary statistics to understand the difference between them.

# If you have the potato_out object in the workspace you can proceed

# without reading in the files using the commands:

#

# het_obs <- potato_out$het_obs

# het_exp <- potato_out$het_exp

# We will read in the files and convert to matrices at the same time. het_obs <- as.matrix(read.table("potato_het_obs.txt")) het_exp <- as.matrix(read.table("potato_het_exp.txt"))

133 # Get a multi-locus estimate by taking the mean across loci using the

# apply function. Take 25% burn-in, only samples 251-1000 are used. multi_het_obs <- apply(het_obs[251:1000,],1, mean, na.rm=T) multi_het_exp <- apply(het_exp[251:1000,],1, mean, na.rm=T)

# Check for convergence library(coda) effectiveSize(mcmc(multi_het_obs))

## var1

## 920.8956 effectiveSize(mcmc(multi_het_exp))

## var1

## 750

# Plot a simple set of histograms to see the difference (Figure~1.3 in MS).

# The histograms will look slightly different but this is just a quick view.

# The reason is because the spreads are very different,

# which affects bin size. hist(multi_het_exp, col="blue", xlim=c(0.37, 0.39),

main="Heterozygosity", xlab="") hist(multi_het_obs, col="red", add=T) legend(x="topright",

c("expected","observed"),

col=c("blue","red"),

fill=c("blue","red"), bty="n")

134 Heterozygosity

expected 250 observed 200 150 Frequency 100 50 0

0.370 0.375 0.380 0.385 0.390

# Calculate summary stats (mean and 95% highest posterior density

# [HPD] interval) with the quantile() function. list("mean_exp"= mean(multi_het_exp),

"95HPD_exp"= quantile(multi_het_exp,c(0.025, 0.975)),

"mean_obs"= mean(multi_het_obs),

"95HPD_obs"= quantile(multi_het_obs,c(0.025, 0.975)))

## $mean_exp

## [1] 0.3722944

##

## $`95HPD_exp`

## 2.5% 97.5%

## 0.3711756 0.3735096

##

## $mean_obs

## [1] 0.3880829

##

135 ## $`95HPD_obs`

## 2.5% 97.5%

## 0.3877001 0.3884551

As can be seen from the histograms and the summary statistics, the observed

heterozygosity is higher than the expected heterozygosity, consistent with a pattern of

excess outbreeding.

A.1.2 Evaluating Model Adequacy

To evaluate model adequacy using posterior predictive simulation, we used the posterior

distribution of allele frequencies from the polyfreqs run (potato_mcmc.out) minus

burn-in to look at model fit on a per locus basis. You will also need the original read

count data to compare the observed and predicted read count ratios for each locus.

# Read in the original read cound data using read.table().

# Again, remember the row.names and header arguments.

potato_tot_table <- read.table("potato_tot_reads.txt",

row.names=1, header=T)

potato_ref_table <- read.table("potato_ref_reads.txt",

row.names=1, header=T)

# If you haven't done so, load polyfreqs.

library(polyfreqs)

# Now we'll read in the posterior distribution of allele frequencies.

potato_mcmc_table <- read.table("potato_mcmc.out", row.names=1,

header=T)

136 # Take burn-in potato_post <- potato_mcmc_table[251:1000,]

# Check for convergence sum(effectiveSize(mcmc(potato_post)) < 200)

## [1] 0 plot(mcmc(potato_post[,4])) Trace of var1 Density of var1 25 0.82 20 0.80 15 0.78 10 5 0.76 0

0 200 400 600 0.74 0.78 0.82

Iterations N = 750 Bandwidth = 0.003843

# Run the analysis using the polyfreqs_pps() function. potato_pps <- polyfreqs_pps(as.matrix(potato_post),

as.matrix(potato_tot_table),

as.matrix(potato_ref_table),

ploidy=4, error=0.01)

The potato_pps object will be a list with two items:

137 • potato_pps$ratio_diff – A matrix with the per locus posterior predictive

samples of the read ratio differences.

• potato_pps$locus_fit – A logical vector indicating whether each locus passed

or failed the posterior predictive check.

These two items can then be used to examine various aspects of model fit such as the proportion of adequate/inadequate loci and plotting the posterior predictive distribuion of read ratio differences for inadequate loci.

# Get the proportion of adequate and inadequate loci. list("adequate"= mean(potato_pps$locus_fit),

"inadequate"=1- mean(potato_pps$locus_fit))

## $adequate

## [1] 0.8723958

##

## $inadequate

## [1] 0.1276042

# Get the names of the loci that are inadequate

# (provided that locus names are given). names(potato_pps$locus_fit[potato_pps$locus_fit==FALSE])

## [1] "PotSNP002" "PotSNP015" "PotSNP020" "PotSNP044" "PotSNP068"

## [6] "PotSNp071" "PotSNP080" "PotSNP104" "PotSNP138" "PotSNP140"

## [11] "PotSNP154" "PotSNP183" "PotSNP193" "PotSNP213" "PotSNP225"

## [16] "PotSNP238" "PotSNP245" "PotSNP247" "PotSNP249" "PotSNP252"

138 ## [21] "PotSNP254" "PotSNP258" "PotSNP259" "PotSNP262" "PotSNP267"

## [26] "PotSNP268" "PotSNP275" "PotSNP277" "PotSNP286" "PotSNP287"

## [31] "PotSNP289" "PotSNP299" "PotSNP300" "PotSNP310" "PotSNP311"

## [36] "PotSNP313" "PotSNP327" "PotSNP329" "PotSNP331" "PotSNP335"

## [41] "PotSNP339" "PotSNP360" "PotSNP367" "PotSNP368" "PotSNP369"

## [46] "PotSNP372" "PotSNP373" "PotSNP383" "PotSNP384" length(potato_pps$locus_fit[potato_pps$locus_fit==FALSE])

## [1] 49

# plot the posterior predictive distribution of read ratio differences. inadequate <- names(potato_pps$locus_fit[potato_pps$locus_fit==FALSE]) hist(potato_pps$ratio_diff[,inadequate[1]], main=inadequate[1], xlab="") abline(v=quantile(potato_pps$ratio_diff[,inadequate[1]],c(0.025,0.975)),

col="blue", lty="dashed", lwd=2) abline(v=0, col="red")

PotSNP002 120 80 60 Frequency 40 20 0

−5 0 5 10 15 20 25

139 The stochastic nature of simulating data may change the results between posterior predictive model checking runs slightly, but we consistently get ∼13-14% of loci fitting the model poorly.

140 f0.01 f0.05 f0.1 f0.2 f0.4 0.15

0.10 i5

0.05

0.15

0.10 i10

0.05

method polyfreqs 0.15 ratio RMSE

0.10 i20

0.05

0.15

0.10 i30

0.05

c5 c10 c20 c50 c100 c5 c10 c20 c50 c100 c5 c10 c20 c50 c100 c5 c10 c20 c50 c100 c5 c10 c20 c50 c100

Figure A.1: Comparison of posterior mean versus mean read ratio estimates of allele frequencies for all simulation settings. All panels are set up the same as in Figure 1.1.

141 ●● 1.00 ●● ●

● ● ●

● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●●●●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●●● ● ●● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ●●● ●● ●●●●● ● ● ● ●● ● ● ● ●● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● 0.75 ● ●● ● ●●● ●●● ● ● ●● ●● ●● ● ●● ● ● ●●● ● ● ●●● ●●●● ● ● ●●● ● ●● ●● ● ● ● ●●● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ●●● ●●●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●●●● ● ● ●●●●●●●●● ●● ●●● ● ●●● ●●●● ● ● ● ● ●●● ● ● ●●●●● ● ● ●●●●●● ● ●●●● ● ● ●●● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ●●●●● ● ● ●●●● ● ●●●● ● ●●●● method ●●● ● ● ● ● ●● ● ● ●●●●● ● polyfreqs 0.50 ●● ● ●●● ●● ●●●● ● ● ●● ●●● simple ● ●●●● ● ● ●● ●● ● ●●● ● ● ●●● ● ● ●● ●●●● ● ● ●●● ●● ● ●●●● ● ● ●● ● Allele frequency ●●● ● ●●●● ● ● ●● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●●● ● ●●● ● ● ● ●●● ● ● ● ●● ●●●● ●● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ●● ●●● ● ● ● ● ●● ● ●● ●●● ● ● ●● ●●● ● ●● ● ● ●● ● ● ● ● ●●● ● ●● ●●●● ●●● ● ● 0.25 ●●●● ● ●● ● ●● ● ●●●● ● ● ● ●● ● ● ●●● ●● ● ● ● ●●●● ● ●●●● ● ● ●● ● ● ●● ● ● ● ●●● ●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●●●● ● ●●●●●●● ● ● ● ●● ● ●●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ●●● ●●● ● ●●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ●●● 0.00 ●●●

Figure A.2: Comparison of posterior mean versus mean read ratio estimates of allele frequencies for Solanum tuberosum. We sorted the estimates from lowest to highest based on the posterior mean, which is why the read ratio estimates appear to be quite erratic. The same result is seen in the posterior mean estimates if the results are sorted by the read ratio estimate.

142 15

10 density

5

0

−0.05 0.00 0.05 0.10 Difference: simple − polyfreqs

Figure A.3: Density plot of the difference between the mean read ratio (simple) and posterior mean estimates in Solanum tuberosum taken for all loci. The distribution is centered around 0, demonstrating that on average there is no difference between the estimates.

143 c10 c100

0.075

frequency f0.01 f0.05 0.050 f0.1

RMSE f0.2 f0.4

0.025

i5 i10 i20 i30 i5 i10 i20 i30 Number of individuals

Figure A.4: A close up comparison of the effect of coverage and the number of individuals sampled on estimation error for octopoloids. The two panels represent two levels of sequencing coverage (10x and 100x) and compare the RMSE across the different numbers of individuals sampled (5, 10, 20, 30) and the different allele frequencies used to simulate read counts (0.01, 0.05, 0.1, 0.2, 0.4) for the simulation study. The two panels show that error decreases more from an increase in the number of individuals sampled than from a 10-fold increase in sequencing coverage.

144 Appendix B: Chapter 2 Supplemental Materials

B.1 EM Algorithms

An important aspect of the EM algorithm is the convergence criterion. For all of our

analyses we updated all individual parameters in the model until successive estimates

differed by less than 1e-5. Once a parameter had converged, we no longer updated it

in the remaining iterations of the algorithm. This causes the algorithm to “gain speed”

as more and more parameter values converge. Below we outline the mathematical

details of the maximization steps that were used for our EM algorithms.

B.1.1 Autopolyploid Model

When doing the conditional maximization steps for the autopolyploid model, our

ECM algorithm first updates the allele frequencies with the current estimates for the

inbreeding coefficients fixed, followed by updating the inbreeding coefficients with the

current estimates for allele frequencies fixed. For each locus ` = 1,...,L, we update

p` independently using the following objective function for Brent’s method (Brent,

1973):

145 " (t+1) X X (t) (t) p` = arg max P (gi` = a|di`, p` ,Fi ) p ` i a # (t) × log P (di`|gi` = a)P (g = a|p`,Fi ) (B.1)

(t+1) Once we have obtained all of the p` ’s, we then estimate the inbreeding coefficient

for each individual i = 1,...,N using a similar function:

" (t+1) X X (t+1) (t) Fi = arg max P (gi` = a|di`, p` ,Fi ) F i ` a # (t+1) × log P (di`|gi` = a)P (g = a|p` ,Fi) (B.2)

B.1.2 Allopolyploid Model

An analytical solution for the EM update of the allele frequencies in subgenome two is

available using a derivation similar to that of (Li, 2010, 2011). We wish to maximize

the following function across all loci independently:

(t+1) (t) hX X X ∗ (t) p2` = arg max Q(p2`|p2` ) = arg max P (gi` = a1 + a2|di`, p1`, p2` ) p2` p2` i a1 a2

∗ i × log[P (di`|gi` = a1 + a2)P (g1i` = a1|p1`)P (g2i` = a2|p2`)] . (B.3)

(t) Because many of the terms in Q(p2`|p2` ) are constant and do not depend on p2`, we

can combine them together to form a simpler expression:

146 (t) X X X ∗ (t) Q(p2`|p2` ) = C1 + C2 + P (gi` = a1 + a2|di`, p1`, p2` ) log P (g2i` = a2|p2`) i a1 a2

X X X ∗ (t) = C + P (gi` = a1 + a2|di`, p1`, p2` ) log P (g2i` = a2|p2`) i a1 a2  X X X ∗ (t) = C + P (gi` = a1 + a2|di`, p1`, p2` ) i a1 a2  × [a2 log(p2`) + (m2i − a2) log(1 − p2`)] . (B.4)

∗ (t) C1 = P (gi` = a1 + a2|di`, p1`, p2` ) log P (di`|gi` = a1 + a2),

∗ (t) ∗ C2 = P (gi` = a1 + a2|di`, p1`, p2` ) log P (g1i` = a1|p1`).

Here C is equal to the sum of C1 and C2 across all individuals and values for the genotypes in each subgenome and we also drop the binomial coefficient in the third step.

Taking the derivative with respect to p2` and setting it equal to 0, we can derive the EM update for the next iteration:

 ∂ (t) 1 X X X ∗ (t) Q(p2`|p2` ) = P (gi` = a1 + a2|di`, p1`, p2` ) ∂p2` p2`(1 − p2`) i a1 a2  × [a2(1 − p2`) − (m2i − a2)p2`]  1 X X X ∗ (t) = P (gi` = a1 + a2|di`, p1`, p2` ) p2`(1 − p2`) i a1 a2  × [a2 − m2ip2`] = 0. (B.5)

147 (t+1) Substituting p2` for p2` and solving gives us the needed equation:

P P P a P (g = a + a |d , p∗ , p(t)) (t+1) i a1 a2 2 i` 1 2 i` 1` 2` p2` = P . (B.6) i m2i

B.1.3 C++ Code

The C++ source code for fitting the four models used in the manuscript is provided

on GitHub (https://github.com/pblischak/polyploid-genotyping). The four

models are (1) Hardy Weinberg equilibrium ( hwe ), (2) Autopolyploid/Hardy Weinberg

disequilibrium ( diseq ), (3) the Allopolyploid Subgenome model ( alloSNP ), and

(4) the flat genotype prior model ( gatk ). The executable, ebg , can be compiled

using the Makefile provided in the ebg/ folder that contains the source code. The

program has a simple command line interface for specifying which model is to be used

and the input necessary for completing an analysis. Within the data/ folder of the

GitHub repository, we also provide the simulated read count data from Betula pendula

and B. pubescens as example data sets with instructions for how to analyze them in

the README . Another thing to note is that the first two models can also be run on

diploids as was done in the manuscript for B. pendula. Analyses of mixed ploidy are

not currently implemented.

In the same GitHub repository we also provide the R and C++ code that we used

for all of the simulations conducted in the paper. These can be found in the Rcode/

folder, along with a README file with details about how the scripts can be used for

simulating and analyzing the data used in the manuscript.

148 B.2 Simulations

B.2.1 Inbreeding Coefficient From Called Genotypes

We calculated the inbreeding coefficient from called genotypes using observed and

expected heterozygosity:

Ho(i) Fi = 1 − . (B.7) He

Observed heterozygosity was calculated using the following equation (Blischak et al.,

2016; Hardy, 2016):

1 X 1 X gi`(mi − gi`) Ho(i) = hi =   . (B.8) L L mi ` ` gi` Expected heterozygosity was calculated as the average heterozygosity across all loci:

1 X 2 2 He = 1 − p − (1 − p`) . (B.9) L ` `

B.3 Empirical Data Analysis

B.3.1 Data Acquisition Andropogon gerardii

A VCF file for A. gerardii was downloaded from Dryad (file:

McAllister.Miller.all.mergedRefGuidedSNPs.vcf.gz ; link: http:

//datadryad.org/resource/doi:10.5061/dryad.05qs7) along with all indi- vidual metadata

(file: McAllister_Miller_Locality_Ploidy_Info.csv ). Read counts were

extracted and filtered as described in the Methods for hexaploids and nonaploids

149 separately. Below is an example of the commands that we ran using VCFtools, and our Perl and R scripts.

# Bash code for running VCFtools + Perl and R scripts

# Substitute 'nona' for 'hex' to run the scripts for nonaploids

vcftools --gzvcf McAllister.Miller.all.mergedRefGuidedSNPs.vcf.gz \

--keep andropogon-hex-names.txt \

--max-alleles 2 --min-alleles 2 --thin 10000 \

--minDP 5 --max-missing 0.5 --remove-indels \

--remove-filtered-all --recode \

--stdout | perl read-counts-from-vcf.pl andr-hex-tot.txt \

andr-hex-alt.txt 2 5

# Perl script arguments: name for tot file, name for alt file,

# allele depth position in VCF file, min read depth

# R script Arguments: tot file, alt file, % missing cutoff,

# transpose output (T/F), missing data val

Rscript --vanilla filter-inds.R andr-hex-tot.txt andr-hex-alt.txt \

0.5 TRUE -9

VCF files store information with individuals in columns and loci in rows. ebg expects loci to be in columns and individuals to be in rows. This is why we transpose the data matrices within the R script. Loci in McAllister and Miller (2016) were kept with a minimum Phred score of Q20. Since error information was not directly available,

150 we used a value of 0.01 for the error for each locus as a conservative, maximum level

of error.

Betula pubescens and B. pendula

Genotype data for Betula were downloaded from Dryad (file:

data_80p_genlight.rdata ; link: http://datadryad.org/resource/doi:

10.5061/dryad.815rj) as an Rdata file with genotypes stored as a genlight

object from the R package adegenet (Jombart and Ahmed, 2011). The genlight

object is designed to store genotypes more efficiently but can be easily converted into

a matrix of integer genotypes, which is what we did for simulating read data using

our own R and C++ code.

B.3.2 Comparison with GATK

Below we provide a walkthrough of our analyses that compared our models for

genotyping with GATK. The code that was used is provided, and we have also

provided any scripts that were used on GitHub (https://github.com/pblischak/ polyploid-genotyping). These steps were completed separately for Betula pendula

and B. pubescens (replace ‘pendula’ with ‘pubescens’ in each step).

Indexing Betula nana Reference Genome

The reference genome for Betula nana was downloaded from Dryad (http://

datadryad.org/resource/doi:10.5061/dryad.815rj) and was processed following

(Zohren et al., 2016) (concatenating all contigs with 50 N’s in between). Next, we

indexed the reference genome for downstream analyses using BWA, SAMtools, and Pi-

card (Li and Durbin, 2009; Li, 2011, https://broadinstitute.github.io/picard).

151 bwa index Betula_concat_reference.fasta

samtools faidx Betula_concat_reference.fasta

java -jar ~/picard-tools-2.2.1/picard.jar \

CreateSequenceDictionary \

R=Betula_concat_reference.fasta \

O=Betula_concat_reference.dict

Mapping Reads with BWA

We then downloaded FASTQ files from the European Nucleotide Archive (Project

Accession ERA600270; link: http://www.ebi.ac.uk/ena/data/view/PRJEB3322)

for 15 individuals each of B. pendula and B. pubescens. Below are the individual files

that we downloaded for each species.

Betula pendula:

• 1147x_CTCTCTAG.fq.gz, 1148x_AGCTATAG.fq.gz, 1163x_TGTGACTG.fq.gz,

14007_CTAGCTCT.fq.gz, 14008_CTGATGCT.fq.gz, 14009_GACTCATC.fq.gz,

2310x_TCTCGCTC.fq.gz, 2315x_TGACTGTG.fq.gz, 2320x_ACACTGAC.fq.gz,

2346x_GTACTCGT.fq.gz, 2347x_GTCATGTG.fq.gz, 2350x_TGCATCGT.fq.gz,

2354x_TGTGACTG.fq.gz, 2361x_ACACGACA.fq.gz, 2380x_AGAGCTAG.fq.gz

Betula pubescens:

• 1045x_CACACAGT.fq.gz, 1045x_CATGA_1.fq.gz, 1123x_AAGGG_1.fq.gz,

1123x_ACGTAGCA.fq.gz, 1153x_TTTTA_1.fq.gz, 1158x_CAGTGTGT.fq.gz,

152 1158x_GTTGT_1.fq.gz, 13004_CTAGTGTC.fq.gz, 13006_CTAGATAG.fq.gz,

14007_CTAGCTCT.fq.gz, 14008_CTGATGCT.fq.gz, 14009_GACTCATC.fq.gz,

1578x_CGTATGTA.fq.gz, 1578x_GTGTG_1.fq.gz, 38005_GACTACGA.fq.gz

These files were mapped to the B. nana reference using the BWA MEM algorithm

(Li and Durbin, 2009). These mapped alignments were then convereted from SAM to

BAM format and sorted using SAMtools (Li, 2011).

for f in *.fq.gz;

do

PREFIX=$(echo $f | awk -F'.''{print $1}')

bwa mem -t 2 ../../Betula_concat_reference.fasta $f | \

samtools view -bSu - | \

samtools sort -O bam -o ../bam/$PREFIX.sorted.bam

done

Adding Read Groups and Genotyping with GATK

Read groups were added to the BAM files output by the previous step using Picard, followed by sorting by coordinate, and genotyping using the GATK UnifiedGenotyper

(https://broadinstitute.github.io/picard; McKenna et al., 2010).

# Adding read groups to BAM files

for f in *.bam

do

PREFIX=$(echo $f | awk -F'.''{print $1}')

153 java -jar ~/picard-tools-2.2.1/picard.jar AddOrReplaceReadGroups \

I=$f \

O=$PREFIX.sortedRG.bam \

SORT_ORDER=coordinate \

RGID=pendula \

RGLB=pendula \

RGPL=illumina \

RGSM=$PREFIX \

RGPU=pendula \

CREATE_INDEX=True

done

# Running GATK

java -jar ~/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar \

-T UnifiedGenotyper \

-R ../../Betula_concat_reference.fasta \

$(for f in *sortedRG.bam; do printf "%sI $f " '-'; done) \

-o ../pendula-ug.vcf \

-ploidy 2

Filtering Variants Called by GATK

Variants were filtered using the following criteria: biallelic SNPs only, minimum QUAL score of 30, minimum read depth of 5, and maximum number of 5 missing individuals.

We wrote our own R script to complete this step (most tools do not take polyploid

VCF files) called filter-vcf.R which is available on GitHub.

154 # Filtering variants based on depth, QUAL, max # of missing individuals

Rscript filter-vcf.R --vcf pendula-ug.vcf --minDP 5 --minQ 30 \

--missing 5 --out pendula-ug-filtered30.vcf

Finding Shared Variants

After filtering the variants for B. pendula and B. pubescnes, we took the intersection

of their VCF files to find SNPs that were in common.

# Find intersection of variants in two VCF files

Rscript intersect-vcf.R --vcf1 pendula-ug-filtered30.vcf \

--vcf2 pubescens-ug-filtered30.vcf \

--prefix filtered30

Preparing Input Files for ebg

We extracted read counts (total number of reads, number of alt reads) from the allele

depth field (AD) using a Python script. Per locus error rates were calculated by first

making a pileup of reads from the original BAM files at the shared variant positions

using SAMtools. We then used Python to calculate the average error value at each

site using the PHRED scores reported in the pileup file.

# Get read counts using Python

python read-counts-from-vcf.py -n 15 -s 14931 \

--vcf filtered30-vcf1-pendula.vcf --prefix filtered30-pendula

155 # Get error values from pileup files

samtools mpileup -I -f Betula_concat_reference.fasta \

-l filtered30-variants.txt $(ls *.sortedRG.bam) \

-o filtered30-pendula.pileup

# Extract error values and take per locus avg.

python per-locus-err.py -i filtered30-pendula.pileup \

-n 15 > filtered30-pendula-err.txt

Running ebg

With the input files for ebg prepared, we ran an analysis on B. pendula using our hwe model.

# Analysis for Betula pendula w/ hwe model

ebg hwe -t pendula-filtered30-tot.txt \

-a pendula-filtered30-alt.txt \

-e pendula-filtered30-err.txt \

-n 15 \

-l 14931 \

-p 2 \

--iters 1000 \

--prefix filtered30-pendula

The allele frequency estimates for B. pendula were used as a reference for the estimation of genotypes in B. pubescens using the alloSNP model.

156 # Analysis for Betula pubescens w/ alloSNP model

ebg alloSNP -f filtered30-pendula-freqs.txt \

-t pubescens-filtered30-tot.txt \

-a pubescens-filtered30-alt.txt \

-e pubescens-filtered30-err.txt \

-n 15 \

-l 14931 \

-p1 2 \

-p2 2 \

--iters 100 \

--brent \

--prefix filtered30-pubescens

Comparing Genotype Estimates

To compare the estimated genotypes between GATK and ebg, we first extracted the

genotypes (number of alternative alleles) from each VCF file using a Python script.

python gt-from-vcf.py --vcf filtered30-pendula-vcf1.vcf \

--prefix filtered30-pendula-gatk

Next, we read the genotype estimates into R and calculated what percent of the

estimated genotypes were identical between the two methods (B. pendula: 99.1%

identical; B. pubescens: 96.2% identical).

157 # Read in genotype data for allopolyploid B. pubescens g1_alloSNP <- as.matrix(read.table("filtered30-pubescens-alloSNP-g1.txt",

header=F, na.strings="-9")) g2_alloSNP <- as.matrix(read.table("filtered30-pubescens-alloSNP-g2.txt",

header=F, na.strings="-9")) g_tot_alloSNP <- g1 + g2

# Read in genotypes from GATK g_tot_gatk <- t(as.matrix(read.table("filtered30-pubescens-gatk-out.txt",

header=F)))

# Percent identical genotypes mean((g_tot_gatk - g_tot_alloSNP)==0, na.rm = T) * 100

# Read in genotype data for diploid B. pendula g_hwe <- as.matrix(read.table("filtered30-pendula-hwe-genos.txt",

header=F, na.strings="-9"))

# Read in genotypes called by GATK for B. pendula g_gatk <- t(as.matrix(read.table("filtered30-pendula-gatk-out.txt",

header=F)))

# Percent identical genotypes mean((g_gatk - g_hwe)==0, na.rm = T) * 100

158 We also compared allele frequency estimates between our Hardy Weinberg model

and those estimated by GATK for B. pendula (Figure B.10). The root mean squared

deviation (RMSD) between the two estimates was calculated as follows:

v u L u 1 X 2 RMSD = t (phwe,` − pgatk,`) . (B.10) L `=1 Finally, we compared the genotype estimates from GATK and the allopolyploid

model by looking at the distribution of estimated full genotype for our model and

seeing how often it matched the estimate from GATK (Figure B.11). We did this in

the following way: for each genotype estimated to be 0 by GATK, we looked at the

same genotypes to see what the corresponding estimates were for the allopolyploid

model. We then repeated this procedure for genotypes estimated by GATK to be 1, 2,

3, and 4 copies of the alternative allele. Figure B.11 shows this distribution of the

genotypes estimated by the allopolyploid model when the estimate from GATK is 0

through 4 copies of alternative allele. The R code for performing these comparisons is

below. For genotype estimates that did not match, the estimates by our model tended

to have one fewer copy of the alternative allele compared to GATK. We believe this is

because our model uses an outside estimate of the allele frequency for subgenome one, which influences our genotype estimation algorithm in ways not experienced by GATK.

Using this outside information can be especially useful when sequencing coverage is

low (e.g., see Figure 2.2 in Chapter 2). However, if the allele frequencies used are not

representative of the allele frequencies in subgenome one (i.e., if they are not from the

actual parental species), then they may lead to poor genotype estimates. Thus it is

important to use a reference panel from a known parental species.

159 library(ggplot2)

################################

# Allele frequency comparisons #

################################

hwe_freqs <- as.matrix(read.table("filtered30-pendula-hwe-freqs.txt")) variants <- read.table("filtered30-variants.txt", stringsAsFactors = F) gatk_vcf <- read.table("pendula-ug-filtered30.vcf", stringsAsFactors = F) gatk_variants <- dplyr::semi_join(gatk_vcf, variants) gatk_freqs <- apply(gatk_variants,1,

function(x){

as.numeric(strsplit(strsplit(x[8],";")[[1]][2],"=")[[1]][2]))

} sqrt(mean((hwe_freqs - gatk_freqs)^2)) figS10 <- qplot(hwe_freqs - gatk_freqs) + theme_bw(base_size = 22) +

xlab("hwe - gatk") +

ggtitle("Allele frequency estimates for Hardy Weinberg vs. GATK") print(figS10) ggsave("../supp/supp-figs/FigureS10-hwe-gatk-freqs.pdf",

figS10, height=100, width=169, unit="mm", scale=2.5)

########################

# Genotype comparisons #

########################

gatk_genos <- t(as.matrix(read.table("filtered30-pubescens-gatk-out.txt")))

160 alloSNP_genos1 <- as.matrix(

read.table("filtered30-pubescens-alloSNP-g1.txt",

na.strings="-9")) alloSNP_genos2 <- as.matrix(

read.table("filtered30-pubescens-alloSNP-g2.txt",

na.strings="-9")) alloSNP_genos <- alloSNP_genos1 + alloSNP_genos2 off <- matrix(NA, nrow=5, ncol=5) mismatch <- data.frame(gatk=rep(0:4, 5),

alloSNP=rep(0:4, each=5),

Frequency=rep(NA,25))

for(i in 0:4){

for(j in 0:4){

gatk <- gatk_genos == i

off[i+1,j+1] <- mean(alloSNP_genos[gatk] == j, na.rm = T)

mismatch[mismatch$gatk == i & mismatch$alloSNP == j,]$Frequency

= mean(alloSNP_genos[gatk] == j,na.rm=T)

}

} off

figS11 <- ggplot(mismatch, aes(x=alloSNP, y=Frequency)) +

geom_bar(stat="identity") +

facet_grid(.~gatk) + theme_bw(base_size = 22) +

ggtitle("Allopolyploid vs GATK genotype estimates") print(figS11)

161 ggsave("../supp/supp-figs/FigureS11-alloSNP-gatk-genos.pdf", figS11, height=100, width=169, unit="mm", scale=2.5)

162 Inbreeding Coeff. Estimation Error [25 ind.]

p4 p6 p8

0.75

0.50 c2

0.25

0.00

0.75

0.50 c5

0.25

0.00

0.75 c10 0.50

0.25

0.00 RMSD 0.75 c20 0.50

0.25

0.00

0.75 c30 0.50

0.25

0.00

0.75 c40 0.50

0.25

0.00

F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90

Method diseq diseqCG gatk hwe

Figure B.1: Root mean squared deviation (RMSD) values for all levels of sequencing coverage for the estimation of inbreeding coefficients using the autopolyploid and other models (25 ind.). Everything is set up the same as in Figure 2.1a.

163 Inbreeding Coeff. Estimation Error [50 ind.]

p4 p6 p8

0.75

0.50 c2

0.25

0.00

0.75

0.50 c5

0.25

0.00

0.75 c10 0.50

0.25

0.00 RMSD 0.75 c20 0.50

0.25

0.00

0.75 c30 0.50

0.25

0.00

0.75 c40 0.50

0.25

0.00

F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90

Method diseq diseqCG gatk hwe

Figure B.2: RMSD values for all levels of sequencing coverage for the estimation of inbreeding coefficients using the autopolyploid and other models (50 ind.). Everything is set up the same as in Figure 2.1a.

164 Inbreeding Coeff. Estimation Error [100 ind.]

p4 p6 p8

0.75

0.50 c2

0.25

0.00

0.75

0.50 c5

0.25

0.00

0.75 c10 0.50

0.25

0.00 RMSD 0.75 c20 0.50

0.25

0.00

0.75 c30 0.50

0.25

0.00

0.75 c40 0.50

0.25

0.00

F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90

Method diseq diseqCG gatk hwe

Figure B.3: RMSD values for all levels of sequencing coverage for the estimation of inbreeding coefficients using the autopolyploid and other models (100 ind.). Everything is set up the same as in Figure 2.1a.

165 Genotype Estimation Error [25 ind.]

p4 p6 p8

2.0

1.5 c2 1.0 0.5 0.0

2.0

1.5 c5 1.0 0.5 0.0

2.0

1.5 c10 1.0 0.5 0.0

RMSD 2.0

1.5 c20 1.0 0.5 0.0

2.0

1.5 c30 1.0 0.5 0.0

2.0

1.5 c40 1.0 0.5 0.0

F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90

Method diseq gatk hwe

Figure B.4: RMSD values for all levels of sequencing coverage for the estimation of genotypes in autopolyploids (25 ind.). Everything is set up the same as in Figure 2.1b.

166 Genotype Estimation Error [50 ind.]

p4 p6 p8

2.0

1.5 c2 1.0 0.5 0.0

2.0

1.5 c5 1.0 0.5 0.0

2.0

1.5 c10 1.0 0.5 0.0

RMSD 2.0

1.5 c20 1.0 0.5 0.0

2.0

1.5 c30 1.0 0.5 0.0

2.0

1.5 c40 1.0 0.5 0.0

F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90

Method diseq gatk hwe

Figure B.5: RMSD values for all levels of sequencing coverage for the estimation of genotypes in autopolyploids (50 ind.). Everything is set up the same as in Figure 2.1b.

167 Genotype Estimation Error [100 ind.]

p4 p6 p8

2.0

1.5 c2 1.0 0.5 0.0

2.0

1.5 c5 1.0 0.5 0.0

2.0

1.5 c10 1.0 0.5 0.0

RMSD 2.0

1.5 c20 1.0 0.5 0.0

2.0

1.5 c30 1.0 0.5 0.0

2.0

1.5 c40 1.0 0.5 0.0

F10 F25 F50 F75 F90 F10 F25 F50 F75 F90 F10 F25 F50 F75 F90

Method diseq gatk hwe

Figure B.6: RMSD values for all levels of sequencing coverage for the estimation of genotypes in autopolyploids (100 ind.). Everything is set up the same as in Figure 2.1b.

168 Allele Frequency in Subgenome 2

i25 i50 i100

0.12

0.08 RMSD

0.04

0.00

c2 c5 c2 c5 c2 c5 c10 c20 c30 c40 c10 c20 c30 c40 c10 c20 c30 c40

Ploidy p4 p6 p8

Figure B.7: RMSD values for the estimation of the allele frequency in subgenome two with the allopolyploid model. Increasing sequencing coverage increases accuracy but there is a plateau. Sampling more individuals also decreases estimation error.

169 Genotype Estimation in Subgenome 1

i25 i50 i100

1.00

0.75

0.50 RMSD

0.25

0.00

c2 c5 c2 c5 c2 c5 c10 c20 c30 c40 c10 c20 c30 c40 c10 c20 c30 c40

Ploidy p4 p6 p8

Figure B.8: RMSD values for the estimation of genotypes in subgenome one with the allopolyploid model. Increasing sequencing coverage increases accuracy but sampling more individuals does not have much of an effect.

170 Genotype Estimation in Subgenome 2

i25 i50 i100

1.00

0.75

0.50 RMSD

0.25

0.00

c2 c5 c2 c5 c2 c5 c10 c20 c30 c40 c10 c20 c30 c40 c10 c20 c30 c40

Ploidy p4 p6 p8

Figure B.9: RMSD values for the estimation of genotypes in subgenome two with the allopolyploid model. Increasing sequencing coverage increases accuracy but sampling more individuals does not have much of an effect. Estimation error in subgenome two is also higher than in subgenome one since allele frequencies for subgenome two must be estimated.

171 Allele frequency estimates for Hardy Weinberg vs. GATK 12500

10000

7500 count 5000

2500

0

−0.50 −0.25 0.00 0.25 0.50 hwe − gatk

Figure B.10: Distribution of the difference in allele frequency estimates from our Hardy Weinberg model versus GATK (hwe - GATK).

172 Allopolyploid vs GATK genotype estimates

0 1 2 3 4 1.00

0.75

0.50 Frequency

0.25

0.00

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 alloSNP

Figure B.11: Distribution of the full genotypes estimated by the allopolyploid model for each possible value of the genotype estimated by GATK. Each individual plot corresponds to genotypes estimated by GATK to have 0 through 4 copies of the alternative allele. The distributions within each plot are the frequency with which the allopolyploid model estimated each of the possible genotypes given the estimates by GATK. For example, for all genotypes estimated by GATK to be 2, the allopolyploid model also estimated ∼80% of those genotypes to have 2 copies of the alternative allele, but estimated that ∼10 − 15% had only 1 copy.

173 Appendix C: Chapter 3 Supplemental Meterials

C.1 Haplotype Inference

C.1.1 Inferring Haplotypes with Known Ploidy

To infer the maximum likelihood haplotype configurations with known ploidy, we use

a multinomial likelihood that models the size of each cluster being considered as a

haplotype. For an individual with ploidy level K, we take the first K clusters sorted

by size and calculate the likelihood for a given partition as follows: each entry in a

partition, P , contains the number of times that a particular haplotype is represented

in the configuration. Given cluster sizes C1 through CK , and a sequencing error rate

of , the log-likelihood for a partition P is:

|P | K X P [i] X `P = Ci × log + Cj × log(). (C.1) K i=1 j>|P | Here |P | represents the size of the partition.

Example Calculation

To illustrate this calculation, let us consider a tetraploid individual that has the

following sizes for the first four cluster: C1 = 285, C2 = 95, C3 = 10, and C4 = 8. We will also use an error rate of = 0.002. Table C.1 has the list of possible haplotype

174 Haplotype Configuration Log-Likelihood (4, 0, 0, 0) -702.2507 (3, 1, 0, 0) -325.5503* (2, 2, 0, 0) -375.2589 (2, 1, 1, 0) -392.8247 (1, 1, 1, 1) -551.7452

Table C.1: Haplotype configurations and their corresponding log-likelihoods for a tetraploid with ordered cluster sizes equal to 285, 95, 10, and 8. The haplotype configuration with three copies of haplotype one and one copy of haplotype two has the highest likelihood.

configurations and their corresponding log-likelihood values. The maximum likelihood

haplotype configuration has three copies of haplotype one and one copy of haplotype

two. The explicit likelihood calculation for this haplotype configuration proceeds as

follows:

3 1 ` = 285 × log + 95 × log + 10 × log(0.002) (3,1) 4 4 + 8 × log(0.002) = −325.5503. (C.2)

C.1.2 Inferring Haplotypes with Unknown Ploidy

Inferring haplotype configurations for individuals with unknown ploidy levels involves

distinguishing clusters that are likely to be “real” haplotypes from those that are likely

to be errors. We do this by considering a set of models that range from treating all

clusters as errors, to one where all clusters are real haplotypes. The models in between

successively treat each cluster in the ordered set as a real haplotype (clusters are sorted

by size). For an individual with N clusters, there are N + 1 models to test. Each of

175 these models has H real haplotypes (0,...,H) and N − H errors (H + 1,...,N). The

likelihood for each of these models is the sum of the clusters sizes (C1,...,CN ) times

the probability that they are sequencing errors () or not (1 − ). The log-likelihood

for a model with H haplotypes is given by:

H N X X `H = Ci × log(1 − ) + Cj × log(). (C.3) i=1 j>H To determine the most likely haplotype configuration, we calculate how much the

likelihood increases over the previous model when another haplotype is added (the

likelihood is monotonically increasing). We also normalize these differences by the

total change in likelihood from the model with H = 0 to the model with H = N. If

this value is less than a given cutoff (we use a default of 0.10), the previous model is

treated as the best configuration. Since the cluster sizes are ordered, the increase in

the log-likelihood will always be smaller for any additional haplotypes.

Example Calculation

We will illustrate this procedure using an example for six clusters with the following

sizes: C1 = 425,C2 = 210,C3 = 145, C4 = 18, C5 = 11, and C6 = 7. Using an error

rate of 0.002, the R code below will calculate the likelihoods for the different models

as well as the relative increase for each of them (Table C.2).

R code:

ullik_0 <- 425*log(0.002) + 210*log(0.002) + 145*log(0.002)

+ 18*log(0.002) + 11*log(0.002) + 7*log(0.002)

ullik_1 <- 425*log(1-0.002) + 210*log(0.002) + 145*log(0.002)

+ 18*log(0.002) + 11*log(0.002) + 7*log(0.002)

176 ullik_2 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(0.002)

+ 18*log(0.002) + 11*log(0.002) + 7*log(0.002) ullik_3 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(1-0.002)

+ 18*log(0.002) + 11*log(0.002) + 7*log(0.002) ullik_4 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(1-0.002)

+ 18*log(1-0.002) + 11*log(0.002) + 7*log(0.002) ullik_5 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(1-0.002)

+ 18*log(1-0.002) + 11*log(1-0.002) + 7*log(0.002) ullik_6 <- 425*log(1-0.002) + 210*log(1-0.002) + 145*log(1-0.002)

+ 18*log(1-0.002) + 11*log(1-0.002) + 7*log(1-0.002)

total_diff <- abs(ullik_0 - ullik_6) diff_1 <- ullik_1 - ullik_0; rel_diff_1 <- diff_1 / total_diff diff_2 <- ullik_2 - ullik_1; rel_diff_2 <- diff_2 / total_diff diff_3 <- ullik_3 - ullik_2; rel_diff_3 <- diff_3 / total_diff diff_4 <- ullik_4 - ullik_3; rel_diff_4 <- diff_4 / total_diff diff_5 <- ullik_5 - ullik_4; rel_diff_5 <- diff_5 / total_diff diff_6 <- ullik_6 - ullik_5; rel_diff_6 <- diff_6 / total_diff

Using a cutoff of 0.10, we can see that the configuration (1, 1, 1, 0, 0, 0) is the last haplotype configuration that increases the likelihood by more than 10%, meaning that the most likely scenario is that the first three clusters are real haplotypes and the last three are errors.

177 Haplotype Configuration Log-Likelihood Relative Increase (0,0,0,0,0,0) -5071.12 NA (1,0,0,0,0,0) -2430.763 0.5208 (1,1,0,0,0,0) -1126.115 0.2574 (1,1,1,0,0,0) -225.2875 0.1777* (1,1,1,1,0,0) -113.4605 0.0221 (1,1,1,1,1,0) -45.12188 0.0135 (1,1,1,1,1,1) -1.633634 0.0086

Table C.2: Haplotype configurations for an individual with six clusters. The ordered cluster sizes are 425, 210, 145, 18, 11, and 7. A model where the first three clusters are real haplotypes is the best fit.

C.2 Example Analysis

We have provided the sequence data from the PIS_3 and PIS_4 loci for the six species

of Thalictrum in the files Thalictrum_R1.fastq.gz and Thalictrum_R2.fastq.gz.

The following sections will walk through the analyses that we did to compare haplotypes

inferred using Fluidigm2PURC and dbcAmplicons (Uribe-Convers et al., 2016).

C.2.1 Fluidigm2PURC

To analyze the data using Fluidigm2PURC, we first run the fluidigm2purc script:

$ fluidigm2purc -f Thalictrum -o thalictrum-f2p -j 2

All of the reads coming from each locus will be written to separate FASTA files

that are put into the directory thalictrum-f2p-FASTA/ (the -o option gives the

output prefix). Next, we change into this directory and run PURC (Rothfels et al.,

2017) on the PIS_3 and PIS_4 locus separately.

178 $ cd thalictrum-f2p-FASTA/

$ purc_recluster.py -f PIS_3.fasta -o PIS_3 -c 0.997 0.995 0.99 0.997 \

-s 2 5 --clean

$ purc_recluster.py -f PIS_4.fasta -o PIS_4 -c 0.997 0.995 0.99 0.997 \

-s 2 5 --clean

Here the results are written to separate directories: PIS_3/ and PIS_4/. We will work on these loci in their respective directories by running the crunch_clusters script.

For analyses 1 and 2, we do not change the taxon table to include ploidy information.

However, before running step 3, we add the ploidy levels of the taxa sampled. Running

the first two cluster crunching steps for both loci before adding the ploidy information

is best so that we do not have to change the taxon table more than once. We also

manually renamed the output FASTA files after each crunch_clusters run so that the

files do not get overwritten. In a normal situation, you would not need to run all three

of these analyses unless you wanted to compare the differences between assuming

ploidy levels are known or unknown.

## Working on the PIS_3 locus

# 1. Getting consensus loci using the --haploid flag

$ cd PIS_3/

$ crunch_clusters -i PIS_3_clustered_reconsensus.afa -l PIS_3 \

-s ../../thalictrum-f2p-taxon-table.txt \

-e ../../thalictrum-f2p-locus-err.txt \

--realign --clean 0.33 --haploid

# 2. Assuming we don’t know ploidy levels

$ crunch_clusters -i PIS_3_clustered_reconsensus.afa -l PIS_3 \

179 -s ../../thalictrum-f2p-taxon-table.txt \

-e ../../thalictrum-f2p-locus-err.txt \

--realign --clean 0.33

# 3. Getting unique haplotypes with known ploidy information

# (add ploidy info first)

$ crunch_clusters -i PIS_3_clustered_reconsensus.afa -l PIS_3 \

-s ../../thalictrum-f2p-taxon-table.txt \

-e ../../thalictrum-f2p-locus-err.txt \

--realign --clean 0.33 --unique_haps

## Working on the PIS_4 locus

# 1. Getting consensus loci using the --haploid flag

$ cd ../PIS_4/

$ crunch_clusters -i PIS_4_clustered_reconsensus.afa -l PIS_4 \

-s ../../thalictrum-f2p-taxon-table.txt \

-e ../../thalictrum-f2p-locus-err.txt \

--realign --clean 0.33 --haploid

# 2. Assuming we don’t know ploidy levels

$ crunch_clusters -i PIS_4_clustered_reconsensus.afa -l PIS_4 \

-s ../../thalictrum-f2p-taxon-table.txt \

-e ../../thalictrum-f2p-locus-err.txt \

--realign --clean 0.33

# 3. Getting unique haplotypes with known ploidy information

# (add ploidy info first)

$ crunch_clusters -i PIS_4_clustered_reconsensus.afa -l PIS_4 \

180 -s ../../thalictrum-f2p-taxon-table.txt \

-e ../../thalictrum-f2p-locus-err.txt \

--realign --clean 0.33 --unique_haps

C.2.2 dbcAmplicons (reduce_amplicons.R)

To get haplotypes with dbcAmplicons, we first run the reduce_amplicons.R script.

We trimmed 20 bases from read one and 40 bases from read two. We also run both

consensus- and occurrence-based haplotype inference using the -p option. The output

directory is specified using the -o option.

$ reduce_amplicons.R -p consensus,occurrence --trim-1 20 --trim-2 40 \

-o thalictrum-dbc Thalictrum

Next, we need to align the output of the reduce_amplicons.R script for the

consensus- and occurrence-based haplotype inference methods. We first change into

the consensus.split_amplicon/ directory in the main thalictrum-dbc/ output

directory and then align the haplotypes for PIS_3 and PIS_4 using MAFFT (Katoh,

2013).

$ cd thalictrum-dbc/consensus.split_amplicon/

$ mafft --auto --quiet Amplicon.PIS_3.merged.fasta > PIS_3-consensus.fasta

$ mafft --auto --quiet Amplicon.PIS_4.merged.fasta > PIS_4-consensus.fasta

Next, we change back into the main output directory and then change into the

occurrence.split_amplicon/ directory to align the occurrence-based haplotypes

inferred by dbcAmplicons.

$ cd ../occurrence.split_amplicon/

$ mafft --auto --quiet Amplicon.PIS_3.merged.fasta > PIS_3-occurrence.fasta

$ mafft --auto --quiet Amplicon.PIS_4.merged.fasta > PIS_4-occurrence.fasta

181 All of the resulting haplotype files were then read into Geneious to visualize and calculate alignment statistics (Kearse et al., 2012). Parsimony informative sites were calculated in MEGA7 (Kumar et al., 2016).

182 Appendix D: Chapter 4 Supplemental Materials

D.1 Validating QCF Estimation

To evaluate the accuracy of our approach for QCF estimation, we performed a

simulation study using both tree and network topologies (Figure 4.1), and compared

our estimates with the true QCF values from simulated gene trees. Gene trees were

simulated using the program ms for 50 loci using the specified topology with internal

branch lengths of 0.5, 1.0, and 2.0 coalescent units (Hudson, 2002). For the species

network, the ancestral lineage to species C and D was simulated as a 60:40 hybrid

species forming 1.0 coalescent units in the past through an admixture event between

species E (γ = 0.6) and the ancestral lineage to A and B (1 − γ = 0.4). Sequence

data was then simulated on each gene tree using the program Seq-Gen (Rambaut

and Grass, 1997). The length of each gene was 400 bp, with an expected number of

substitutions per site of 0.05. QCFs were then estimated with the approach outlined in

§4.3 using either (1) no bootstrapping or (2) 500 bootstrap replicates. True QCF values were calculated using the simulated gene trees as input with the software package

PhyloNetworks v0.7.0 (Solís-Lemus et al., 2017). Estimates from our approach were

compared to the true values using the root mean squared deviation (RMSD), as well

as linear regression, in R v3.3.2 (R Core Team, 2016). Results were plotted using

183 ggplot2 v2.2.1 (Wickham, 2009). Code for performing these simulations can be found in Appendix D (§D.1.1 and §D.1.2).

D.1.1 Tree Simulations qcf-sims-tree.sh:

#!/bin/bash

# Global parameters

REP=$1 # Rep number

THETA=0.05 # Expected number of mutations per base

BP=400 # Sequence length in base pairs

julia5=/Applications/Julia-0.5.app/Contents/Resources/julia/bin/julia

for i in `seq 1 50`

do

# Simulate gene tree using ms

ms 6 1 -T -I 6 1 1 1 1 1 1 -ej 0.25 1 2 -ej 0.25 3 4 -ej 0.5 2 4 \

-ej 1.0 4 5 -ej 2.0 5 6 | grep '^(' > trees-${REP}-${i}.tre

# Simulate sequence data using seq-gen

seq-gen -mGTR -s $THETA -l $BP -r 1.0 0.2 10.0 0.75 3.2 1.6 \

-f 0.15 0.35 0.15 0.35 -i 0.2 -a 5.0 -g 3 -q \

< trees-${REP}-${i}.tre > seqs-${REP}-${i}.phy

done

ls -1 *.phy > genes.txt

184 cat trees-${REP}-*.tre > trees-${REP}.tre

qcf -i genes.txt -m map.txt --prefix tree-${REP}

qcf -i genes.txt -m map.txt -b 500 --prefix tree-${REP}-boot

$julia5 -e 'using PhyloNetworks; readTree2CF("trees-${REP}.tre", \

"tree-${REP}-phynet.CFs.csv", writeSummary=false)'

mkdir rep-${REP}

mv *.tre *.phy *.csv rep-${REP}

$ for i in `seq 1 100`; do ./qcf-sims-tree.sh ${i}; done

185 D.1.2 Network Simulations qcf-sims-network.sh:

#!/bin/bash

# Global parameters

REP=$1 # Rep number

THETA=0.05 # Expected number of mutations per base

BP=400 # Sequence length in base pairs

julia5=/Applications/Julia-0.5.app/Contents/Resources/julia/bin/julia

# Simuate 20 gene trees from topology 1

for i in `seq 1 20`

do

# Simulate gene tree using ms

ms 6 1 -T -I 6 1 1 1 1 1 1 -ej 0.25 1 2 -ej 0.25 3 4 -ej 0.5 2 4 \

-ej 1.0 4 5 -ej 2.0 5 6 | grep '^(' > trees-${REP}-${i}.tre

# Simulate sequence data using seq-gen

seq-gen -mGTR -s $THETA -l $BP -r 1.0 0.2 10.0 0.75 3.2 1.6 \

-f 0.15 0.35 0.15 0.35 -i 0.2 -a 5.0 -g 3 -q \

< trees-${REP}-${i}.tre > seqs-${REP}-${i}.phy

done

# Simulate 30 gene trees from topology 2

for i in `seq 21 50`

do

186 # Simulate gene tree using ms

ms 6 1 -T -I 6 1 1 1 1 1 1 -ej 0.25 1 2 -ej 0.25 3 4 -ej 0.5 4 5 \

-ej 1.0 2 5 -ej 2.0 5 6 | grep '^(' > trees-${REP}-${i}.tre

# Simulate sequence data using seq-gen

seq-gen -mGTR -s $THETA -l $BP -r 1.0 0.2 10.0 0.75 3.2 1.6 \

-f 0.15 0.35 0.15 0.35 -i 0.2 -a 5.0 -g 3 -q \

< trees-${REP}-${i}.tre > seqs-${REP}-${i}.phy

done

ls *.phy > genes.txt

cat trees-${REP}-*.tre > trees-${REP}.tre

qcf -i genes.txt -m map.txt --prefix net-${REP}

qcf -i genes.txt -m map.txt -b 500 --prefix net-${REP}-boot

$julia5 -e 'using PhyloNetworks; readTree2CF("trees-${REP}.tre", \

"net-${REP}-phynet.CFs.csv", writeSummary=false)'

mkdir rep-${REP}

mv *.tre *.phy *.csv rep-${REP}

$ for i in `seq 1 100`; do ./qcf-sims-network.sh ${i}; done

187 D.2 Code for Species Tree and Network Inference

D.2.1 Gene Tree Estimates with RAxML

# Loop through all genes and analyze with RAxML

for f in *.phy

do

raxml -f a -x 12345 -p 12345 -# 500 -m GTRGAMMA \

-s $f

done

# Then combine all gene trees

cat RAxML_bipartitions.* > AllGeneTrees.tre

D.2.2 Species Tree Inference with ASTRAL-III

java -jar astral.5.5.9.jar -i AllGeneTrees.tre -a map.txt \

-o Humiles-Proceri.tre --polylimit 20 \

--samplingrounds 100 --extraLevel 2

The analyses for clades A and B were conducted using the same commands but with only the subset of taxa belonging to each clade.

D.2.3 Species Tree Inference with qcf+QuartetMaxCut

188 # run QCF

qcf -i gene-list.txt -m map.txt -b 500

# Run get-pop-tree.pl from TICR

perl get-pop-tree.pl out-qcf.CFs.csv

# Run getTreeBranchLengths.R from TICR

Rscript getTreeBranchLengths.R out-qcf Pdavidsoniidavidsonii

D.2.4 Network Analyses with PhyloNetworks

Network analyses were conducted using the SNaQ method in the PhyloNetworks package (v0.7.0) with the following template for each script [written in the Julia language using versions 5.2.0 and 6.2.0] (Solís-Lemus and Ané, 2016; Solís-Lemus et al., 2017).

# snaq-net.jl

addprocs(10) # add processors to run things in parallel

using Phylonetworks;

t = readTopology("cladeA-astral.tre")

cf = readTableCF("cladeA-qcf.CFs.csv")

net1 = snaq!(t, cf, hmax=,

filename="cladA-net",

outgroup="Pdavidsoniidavidsonii")

189 For each network analysis, we changed the hmax argument to the corresponding maximum number of hybridization events ( = 1 through 5).

190 QCF Estimation

A,B,C,D A,B,C,E A,B,C,F A,B,D,E A,B,D,F A,B,E,F A,C,D,E A,C,D,F A,C,E,F A,D,E,F B,C,D,E B,C,D,F B,C,E,F B,D,E,F C,D,E,F 1.00

0.75 12|34 0.50

0.25

0.00 1.00

0.75 13|24 0.50 CF

0.25 191 0.00 1.00

0.75 14|23 0.50

0.25

0.00 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 PhyloNetworks

Method QCF QCF−Boot

Figure D.1: Simulation results for the tree topology from Figure 4.1a. Results are plotted for bootstrapped (blue) and non- bootstrapped (yellow) comparisons. Regression lines were estimated by ggplot2 using the lm method in the geom_smooth() function. QCF Estimation

A,B,C,D A,B,C,E A,B,C,F A,B,D,E A,B,D,F A,B,E,F A,C,D,E A,C,D,F A,C,E,F A,D,E,F B,C,D,E B,C,D,F B,C,E,F B,D,E,F C,D,E,F 1.00

0.75 12|34 0.50

0.25

0.00 1.00

0.75 13|24 0.50 CF

0.25 192 0.00 1.00

0.75 14|23 0.50

0.25

0.00 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 PhyloNetworks

Method QCF QCF−Boot

Figure D.2: Simulation results for the network topology from Figure 4.1b. Results are plotted for bootstrapped (blue) and non-bootstrapped (yellow) comparisons. Regression lines were estimated by ggplot2 using the lm method in the geom_smooth() function. P. davidsonii davidsonii P. rattanii P. whippleanus P. anguineus P. ovatus P. watsonii P. inflatus P. degeneri P. humilis obtusifolius P. subserratus P. attenuatus attenuatus P. aridus P. albertinus P. humilis brevifolius P. humilis humilis P. wilcoxii P. attenuatus pseudoprocerus P. pruinosus P. virens P. elegantulus P. flavescens P. rydbergii rydbergii P. radicosus P. euglaucus P. rydbergii oreocharis P. globosus P. pratensis P. attenuatus militaris P. laxus P. heterodoxus heterodoxus P. attenuatus palustris P. spatulatus P. hesperius P. procerus procerus P. confertus P. washingtonensis P. cinicola P. peckii

0.2

Figure D.3: Phylogeny of Penstemon subsections Humiles and Proceri inferred by ASTRAL-III. Branch lengths are in coalescent units.

193 Pdavidsoniidavidsonii960101_1 Pwhippleanus960015_1 Prattanii960214_1 Panguineus16064_1 Pwatsonii980489_1 Paridus991492_1 Pvirens991745_1 Phumilisvarobtusifolius15202_1 Phumilishumilis980513_1 Phumilisbrevifolius980507_1 Ppruinosus16157_1 Pelegantulus15698_1 Psubserratus170005_1 Palbertinus15469_1 Pattenuatuspseudoprocerus15485_1 Povatus170006_1 Pwilcoxii15451_1 Pradicosus991656_1 Pdegeneri113_1 Pinflatus991389_1 Pattenuatusattenuatus12347_1 Pflavescens15737_1 Pspatulatus15666_1 Pprocerusprocerus12216_1 Pcinicola12577_1 Ppeckii12540_1 Peuglaucus12562_1 Phesperius16128_1 Pwashingtonensis16173_1 Pconfertus12327_1 Prydbergiirydbergii10182_1 Pattenuatuspalustris15690_1 Pheterodoxusheterodoxus16014_1 Pattenuatusmilitaris12275_1 Ppratensis12290_1 Pglobosus16584_1 Plaxus15702_1 Prydbergiioreocharis12319_1

0.0040

Figure D.4: Phylogeny of Penstemon subsections Humiles and Proceri inferred with RAxML v8.2.11 using a supermatrix of 43 loci. Branch lengths are the mean number of substitutions per site.

194 Negative Log−Likehood for Maximum Number of Reticulations

Pdavidsoniidavidsonii Pdavidsoniidavidsonii ● Pattenuatuspalustris Pconfertus

Pheterodoxusheterodoxus Phesperius

Pflavescens Pprocerusprocerus

Plaxus Pwashingtonensis

Pattenuatusmilitaris Ppeckii

Pglobosus Pcinicola

Prydbergiioreocharis Prydbergiirydbergii

Ppratensis Peuglaucus

Prydbergiirydbergii Pspatulatus −logLik

Pspatulatus Pheterodoxusheterodoxus

Ppeckii Pattenuatuspalustris

Pcinicola Pflavescens

Pwashingtonensis Ppratensis

3000 3500 4000 4500 ● ● Phesperius Prydbergiioreocharis ● ● ●

Pprocerusprocerus Plaxus

Pconfertus Pattenuatusmilitaris

Peuglaucus Pglobosus 0 1 2 3 4 5

h

Pdavidsoniidavidsonii Pglobosus Pdavidsoniidavidsonii

Pspatulatus Ppratensis Peuglaucus

Pflavescens Pheterodoxusheterodoxus Pconfertus

Pattenuatuspalustris Pspatulatus Phesperius

Pattenuatusmilitaris Peuglaucus Pprocerusprocerus

Prydbergiioreocharis Pdavidsoniidavidsonii Pattenuatusmilitaris

195 Plaxus Ppeckii Plaxus

Pglobosus Pcinicola Pglobosus

Ppratensis Pwashingtonensis Prydbergiioreocharis

Pheterodoxusheterodoxus Phesperius Ppratensis

Prydbergiirydbergii Pconfertus Pflavescens

Peuglaucus Pprocerusprocerus Pattenuatuspalustris

Pprocerusprocerus Pattenuatuspalustris Pheterodoxusheterodoxus

Pconfertus Prydbergiirydbergii Prydbergiirydbergii

Phesperius Pflavescens Pspatulatus

Pwashingtonensis Prydbergiioreocharis Pwashingtonensis

Ppeckii Pattenuatusmilitaris Pcinicola

Pcinicola Plaxus Ppeckii

Figure D.5: Networks inferred for clade A using SNaQ, as implemented in the software PhyloNetworks. Each panel shows the network estimated for the given number of hybridization events (h=1 to h=5), starting at the top left (h=1) and ending at the bottom right (h=5). The panel in the top right corner shows the negative log-pseduolikelihoods for each number of hybridization events, with h=3 having the best pseduolikelihood. Negative Log−Likehood for Maximum Number of Reticulations

Pdavidsoniidavidsonii Phumilisbrevifolius ● Pinflatus Phumilishumilis

Pdegeneri Phumilisvarobtusifolius

Pradicosus Pvirens

Phumilisvarobtusifolius Pelegantulus

Phumilisbrevifolius Ppruinosus

Phumilishumilis Pattenuatuspseudoprocerus

Psubserratus Povatus

Povatus Palbertinus −logLik Palbertinus Pwilcoxii

Pattenuatuspseudoprocerus Psubserratus

Pwilcoxii Pattenuatusattenuatus

Ppruinosus Paridus ● ● ●

Pelegantulus 2600 2800 3000 3200 Pradicosus ● Pvirens Pdegeneri ● Pattenuatusattenuatus Pinflatus

Paridus Pdavidsoniidavidsonii 0 1 2 3 4 5

h

Pdavidsoniidavidsonii Pvirens Phumilishumilis

Pradicosus Phumilisbrevifolius Phumilisbrevifolius

Paridus Phumilishumilis Phumilisvarobtusifolius

Pinflatus Phumilisvarobtusifolius Pdavidsoniidavidsonii

Pdegeneri Pdavidsoniidavidsonii Pdegeneri

Pvirens Pdegeneri Pinflatus 196 Pattenuatusattenuatus Pinflatus Pradicosus

Pelegantulus Pradicosus Paridus

Psubserratus Paridus Pattenuatusattenuatus

Ppruinosus Pattenuatusattenuatus Pvirens

Pattenuatuspseudoprocerus Psubserratus Pelegantulus

Pwilcoxii Pelegantulus Ppruinosus

Palbertinus Ppruinosus Psubserratus

Povatus Pattenuatuspseudoprocerus Povatus

Phumilisbrevifolius Pwilcoxii Pattenuatuspseudoprocerus

Phumilishumilis Palbertinus Pwilcoxii

Phumilisvarobtusifolius Povatus Palbertinus

Figure D.6: Networks inferred for clade B using SNaQ, as implemented in the software PhyloNetworks. Each panel shows the network estimated for the given number of hybridization events (h=1 to h=5), starting at the top left (h=1) and ending at the bottom right (h=5). The panel in the top right corner shows the negative log-pseduolikelihoods for each number of hybridization events, with h=3 having the best pseduolikelihood. Locus Ploidy Voucher Outgroup P. davidsonii davidsonii 2X SLD 53 Subsection Humiles P. albertinus 2X PDB 41 P. anguineus 2X ADW 1505 P. aridus 2X Andrew Lutz, Billy Creek 1 P. degeneri 2X ADW 401 P. elegantulus 2X Idaho Gray 4321 P. inflatus 2X ADW 811 P. humilis brevifolius 2X ADW 573 P. humilis humilis 2X ADW 761 P. humilis obtusifolius 2X ADW 1430 P. ovatus 2X ADW 608 P. pruinosus 2X BYU 98608/EC Moran sn (1971) P. radicosus 2X Mt. West Enviro. Services 7962 P. rattanii 2X ADW 512 P. subserratus 2X ADW 590 P. virens 2X Mt. West Enviro. Services 7953 P. whippleanus 2X 1073 Potsdam P. wilcoxii 2X PDB 39 Subsection Proceri P. attenuatus attenuatus 6X PDB 19 P. attenuatus militaris 6X PDB 12 P. attenuatus palustris 6X PDB 61 P. attenuatus pseudoprocerus 6X PDB 42 P. cinicola 2X PDB 32 P. confertus 4X PDB 18 P. euglaucus 6X PDB 33 P. flavescens 6X PDB 63 P. globosus 4X ADW 1566 P. hesperius 2X – P. heterodoxus heterodoxus 2X ADW 1498 P. laxus 2X Idaho Smith 8123 P. peckii 4X PDB 31 P. pratensis 2X PDB 14 P. procerus procerus 4X PDB 3 P. rydbergii rydbergii 4X Western Env. 8845 P. rydbergii oreocharis 2X PDB 16 P. spatulatus 2X PDB 56 P. washingtonensis 2X PDB 36 P. watsonii 2X ADW 786

Table D.1: Collection and ploidy information for accessions from Penstemon subsec- tions Humiles and Proceri.

197 Locus Direction Primer COS4270 Forward ACACTGACGACATGGTTCTACAACCAAGCTCTTCACCTGGAA Reverse TACGGTAGCAGAGACTTGGTCTAGACCAGCATAACAATTTTATTCCTAA COS14240 Forward ACACTGACGACATGGTTCTACACCGACATTAGTCACGGTCCT Reverse TACGGTAGCAGAGACTTGGTCTCGGCATTCCTTCAGATAAAC COS23460 Forward ACACTGACGACATGGTTCTACATGGTTGTTCGTGTGAGGTTG Reverse TACGGTAGCAGAGACTTGGTCTAAACTCATTATTTGCTCGATAGGG COS24530 Forward ACACTGACGACATGGTTCTACATTAAAATGCAGGAGGGCTTG Reverse TACGGTAGCAGAGACTTGGTCTCCAACCAAATCAGTTCTCTGC COS50360 Forward ACACTGACGACATGGTTCTACACCATGGAATCAAACCTGGAC Reverse TACGGTAGCAGAGACTTGGTCTAAGCCCAAATCGAAGAAGAA COS57850 Forward ACACTGACGACATGGTTCTACAGAAGGAGCCTCAAAGCAGTG Reverse TACGGTAGCAGAGACTTGGTCTGGATGTCCATCTAACCCGTTT PPR876 Forward ACACTGACGACATGGTTCTACACAGCTTCTGGTAGATGGGCT Reverse TACGGTAGCAGAGACTTGGTCTCTCTCCCCACAATCTTCGCC PPR1651 Forward ACACTGACGACATGGTTCTACAACGCAGCTTCTGGTAGATGG Reverse TACGGTAGCAGAGACTTGGTCTCACCACAATTACACACGCCC PPR5729 Forward ACACTGACGACATGGTTCTACATTTGCTCACTGCTTGTGCTG Reverse TACGGTAGCAGAGACTTGGTCTCCTTCATCCACCATGCCACA PPR985 Forward ACACTGACGACATGGTTCTACATCCGTGGCATTTTTAGTGCG Reverse TACGGTAGCAGAGACTTGGTCTCCGGAGAAAGCTCTTACGGTT PPR1839 Forward ACACTGACGACATGGTTCTACATGCAACCGTAATGCTCGACT Reverse TACGGTAGCAGAGACTTGGTCTTTTGCTCACTGCTTGTGCTG 34130 Forward ACACTGACGACATGGTTCTACATCTAAGTTTGCGGATGTTGAGA Reserve TACGGTAGCAGAGACTTGGTCTCATTCCCAGAACATACATGCAA 59820 Forward ACACTGACGACATGGTTCTACAGCAGATTTAGTTTTACTCTCCTCCA Reverse TACGGTAGCAGAGACTTGGTCTGGTCTTAAATACCATCTTCTGTGTCC 80460 Forward ACACTGACGACATGGTTCTACACCGAAATTTTACCCAAAATCG Reverse TACGGTAGCAGAGACTTGGTCTGCAATGTGGGATTTGTTCGT 20370 Forward ACACTGACGACATGGTTCTACATTCAGAGCTCCCATTTTTGC Reverse TACGGTAGCAGAGACTTGGTCTTTGACCTTCATCCAATAGAGCA 21370 Forward ACACTGACGACATGGTTCTACACTGTTTTTCCAATTTTCCATCC Reverse TACGGTAGCAGAGACTTGGTCTCAGGTTGTGGGCTACGATTT PPR369 Forward ACACTGACGACATGGTTCTACAGGAAAGGAAATCCATGCCCA Reverse TACGGTAGCAGAGACTTGGTCTAGCCTTCAGTTACCATTCCG PPR950 Forward ACACTGACGACATGGTTCTACATCTCCATCTTCGAGAACGCC Reverse TACGGTAGCAGAGACTTGGTCTGCCGGATCAGGTGACGATAG PPR1250 Forward ACACTGACGACATGGTTCTACAAAAGCCCTTCTTGCACGAGT Reverse TACGGTAGCAGAGACTTGGTCTGGACAGCTTTGATTGCAGGG PPR1561 Forward ACACTGACGACATGGTTCTACATCCCTTTTGCCTCATCGACC Reverse TACGGTAGCAGAGACTTGGTCTGGTTCACACGGTGAATGTCG 30360 Forward ACACTGACGACATGGTTCTACAAGGTTGCTAAAGGCCGATTC Reverse TACGGTAGCAGAGACTTGGTCTGGGTCTTTATCTAAAAGGCGAGA 35920 Forward ACACTGACGACATGGTTCTACAGGGGACAAAAATAGCAGAGC Reverse TACGGTAGCAGAGACTTGGTCTTCACCGTGCTTGTTAAGTGC 27260 Forward ACACTGACGACATGGTTCTACACTCCCCCGGAAAGTAACAAA Reverse TACGGTAGCAGAGACTTGGTCTTTGTTTCATGTTGCGCCTTT 2840 Forward ACACTGACGACATGGTTCTACATCTGGAAAAATTCCCTGGAC Reverse TACGGTAGCAGAGACTTGGTCTGCGCTTTGCAAATTCTTGAG

Table D.2: Primers for amplicon sequencing using using the Fluidigm AccessArray. Primer sequences include conserved sequence tags.

198 Locus Direction Primer 53950 Forward ACACTGACGACATGGTTCTACAAAACTGTGCTCTTCCTCCAA Reverse TACGGTAGCAGAGACTTGGTCTGGGGAATGGTGACTCCTACA 62010 Forward ACACTGACGACATGGTTCTACAAGCAACGCCATAAACTGGAA Reverse TACGGTAGCAGAGACTTGGTCTTTGGGAAAGTTGATTGAGACG 2350 Forward ACACTGACGACATGGTTCTACAGCTCCCATCTTTGTATATCTCG Reverse TACGGTAGCAGAGACTTGGTCTCCCGTTCGTTCGATTGATAG 18520 Forward ACACTGACGACATGGTTCTACATTCGAGAACGCCTCTAAACC Reverse TACGGTAGCAGAGACTTGGTCTGTTATGCACAAAACGGGATG 33495 Forward ACACTGACGACATGGTTCTACATCTCCATTTCTCAACCTCAGC Reverse TACGGTAGCAGAGACTTGGTCTGCCCCTTCCTCTCCATACA 38180 Forward ACACTGACGACATGGTTCTACAGGCATCAAAAGTGGATGATG Reverse TACGGTAGCAGAGACTTGGTCTTCCCTCGTTGAGACATTCCT 4810 Forward ACACTGACGACATGGTTCTACATACCAATTCCCCAGTTCTGC Reverse TACGGTAGCAGAGACTTGGTCTATGGGAAGAGATGCTTACCTGA 37450 Forward ACACTGACGACATGGTTCTACATGCTATCAAAACTTCGGCATC Reverse TACGGTAGCAGAGACTTGGTCTGATCTCAAAAAGCACAACTCCA 66330 Forward ACACTGACGACATGGTTCTACATGCAAATTCTTGAGCTGTCC Reverse TACGGTAGCAGAGACTTGGTCTAAAATTCCCTGGACGCTTG 1331 Forward ACACTGACGACATGGTTCTACAGCACCAAGATATGCCATTGA Reverse TACGGTAGCAGAGACTTGGTCTTCCGAGCTAAGGCTATACATTCA 48730 Forward ACACTGACGACATGGTTCTACACCATACGCGTAAATAAGAGAGC Reverse TACGGTAGCAGAGACTTGGTCTTGATGGATATGGTAAAGCTAAACG 3382 Forward ACACTGACGACATGGTTCTACATCTGAAAGCCTTGTACCAACC Reverse TACGGTAGCAGAGACTTGGTCTGAGCCCTCTTGCCATTTCTA 2829 Forward ACACTGACGACATGGTTCTACATGACCCGTTGACAACCCTAT Reverse TACGGTAGCAGAGACTTGGTCTTGCTTACAGGCCCTTTGGTA 604 Forward ACACTGACGACATGGTTCTACAATGGCCTCCGGTAATTCTCT Reverse TACGGTAGCAGAGACTTGGTCTAGTGGCCTGAACTTTGCAGT 598 Forward ACACTGACGACATGGTTCTACATCTGGGCTAACCTGAAATCG Reverse TACGGTAGCAGAGACTTGGTCTTGGAAATGATCAAGAAATGAAGC 233 Forward ACACTGACGACATGGTTCTACAACAACGCTGTGTGTTTGGTC Reverse TACGGTAGCAGAGACTTGGTCTCCCACCAGCCCTTAACTACTC 2978 Forward ACACTGACGACATGGTTCTACATCCATACAATGGTAAGATCACAAGA Reverse TACGGTAGCAGAGACTTGGTCTGAAGGGTGTTCCGGGATTAT 2919 Forward ACACTGACGACATGGTTCTACAAGAGTGTCATGGCCACCAAT Reverse TACGGTAGCAGAGACTTGGTCTATGGACCGCATAGCTCAAAG 2782 Forward ACACTGACGACATGGTTCTACAGAATTGAGGAGATTTGGGAATTT Reverse TACGGTAGCAGAGACTTGGTCTCAGAATTGGGCCCTCCTAAG 1397 Forward ACACTGACGACATGGTTCTACATCCAGTTTCGCTGAAATCACT Reverse TACGGTAGCAGAGACTTGGTCTTAAAGGCCTTGGAGAAGCAA 836 Forward ACACTGACGACATGGTTCTACATCCCAATTTATCCCAGAAAGC Reverse TACGGTAGCAGAGACTTGGTCTAATCATGGGCGACCTATTTG rps12rpl20 Forward ACACTGACGACATGGTTCTACAATTAGAAANRCAAGACAGCCAAT Reverse TACGGTAGCAGAGACTTGGTCTCGYYAYCGAGCTATATATCC trnTL Forward ACACTGACGACATGGTTCTACACATTACAAATGCGATGCTCT Reverse TACGGTAGCAGAGACTTGGTCTTCTACCGATTTCGCCATATC trnCD Forward ACACTGACGACATGGTTCTACACCAGTTCAAATCTGGGTGTC Reverse TACGGTAGCAGAGACTTGGTCTGGGATTGTAGTTCAATTGGT

Table D.3: Primers for amplicon sequencing using using the Fluidigm AccessArray (continued). Primer sequences include conserved sequence tags.

199 Quartet Topology QCF QCF-Boot A,B,C,D 12|34 0.0365333 0.0310762 A,B,C,D 13|24 0.0283334 0.0293331 A,B,C,D 14|23 0.0242000 0.0524701 A,B,C,E 12|34 0.0336667 0.0307411 A,B,C,E 13|24 0.0295333 0.0279207 A,B,C,E 14|23 0.0304000 0.0357291 A,B,C,F 12|34 0.0423667 0.0334601 A,B,C,F 13|24 0.0314000 0.0293172 A,B,C,F 14|23 0.0297667 0.0465771 A,B,D,E 12|34 0.0320666 0.0301693 A,B,D,E 13|24 0.0313667 0.0285252 A,B,D,E 14|23 0.0311000 0.0377345 A,B,D,F 12|34 0.0387000 0.0317376 A,B,D,F 13|24 0.0317334 0.0317221 A,B,D,F 14|23 0.0319667 0.0447368 A,B,E,F 12|34 0.0307667 0.0247832 A,B,E,F 13|24 0.0233333 0.0252398 A,B,E,F 14|23 0.0229000 0.0394890 A,C,D,E 12|34 0.0282000 0.0265650 A,C,D,E 13|24 0.0282333 0.0365885 A,C,D,E 14|23 0.0323000 0.0257895 A,C,D,F 12|34 0.0306667 0.0286579 A,C,D,F 13|24 0.0314666 0.0411249 A,C,D,F 14|23 0.0361333 0.0283903 A,C,E,F 12|34 0.0387333 0.0313787 A,C,E,F 13|24 0.0293333 0.0382125 A,C,E,F 14|23 0.0329333 0.0578481 A,D,E,F 12|34 0.0399333 0.0323120 A,D,E,F 13|24 0.0290667 0.0377330 A,D,E,F 14|23 0.0312000 0.0587910 B,C,D,E 12|34 0.0268000 0.0258273 B,C,D,E 13|24 0.0245333 0.0355890 B,C,D,E 14|23 0.0260000 0.0280845 B,C,D,F 12|34 0.0247333 0.0312867 B,C,D,F 13|24 0.0313000 0.0387545 B,C,D,F 14|23 0.0295667 0.0284660 B,C,E,F 12|34 0.0365333 0.0315087 B,C,E,F 13|24 0.0272000 0.0346498 B,C,E,F 14|23 0.0296667 0.0532958 B,D,E,F 12|34 0.0360000 0.0285520 B,D,E,F 13|24 0.0244000 0.0334152 B,D,E,F 14|23 0.0313333 0.0534680 C,D,E,F 12|34 0.0252000 0.0228099 C,D,E,F 13|24 0.0194000 0.0227021 C,D,E,F 14|23 0.0207333 0.0345828

Table D.4: RMSD values for QCF estimation using data simulated from a tree topology (Figure 4.1a).

200 Quartet Topology QCF QCF-Boot A,B,C,D 12|34 0.0232667 0.0213276 A,B,C,D 13|24 0.0199333 0.0212925 A,B,C,D 14|23 0.0191333 0.0323620 A,B,C,E 12|34 0.0255000 0.0235640 A,B,C,E 13|24 0.0211000 0.0240413 A,B,C,E 14|23 0.0219333 0.0345390 A,B,C,F 12|34 0.0342000 0.0277485 A,B,C,F 13|24 0.0274000 0.0242728 A,B,C,F 14|23 0.0261333 0.0412771 A,B,D,E 12|34 0.0277000 0.0237231 A,B,D,E 13|24 0.0225000 0.0239117 A,B,D,E 14|23 0.0235333 0.0328303 A,B,D,F 12|34 0.0341000 0.0275928 A,B,D,F 13|24 0.0246667 0.0279635 A,B,D,F 14|23 0.0277000 0.0425720 A,B,E,F 12|34 0.0245000 0.0222437 A,B,E,F 13|24 0.0190000 0.0241789 A,B,E,F 14|23 0.0205000 0.0381137 A,C,D,E 12|34 0.0266000 0.0233988 A,C,D,E 13|24 0.0270334 0.0358624 A,C,D,E 14|23 0.0281000 0.0273872 A,C,D,F 12|34 0.0243667 0.0259449 A,C,D,F 13|24 0.0287000 0.0369820 A,C,D,F 14|23 0.0288667 0.0292340 A,C,E,F 12|34 0.0294000 0.0330291 A,C,E,F 13|24 0.0273000 0.0286997 A,C,E,F 14|23 0.0281667 0.0265154 A,D,E,F 12|34 0.0311333 0.0349219 A,D,E,F 13|24 0.0323667 0.0313181 A,D,E,F 14|23 0.0313000 0.0244409 B,C,D,E 12|34 0.0267667 0.0263707 B,C,D,E 13|24 0.0282000 0.0345829 B,C,D,E 14|23 0.0311667 0.0280294 B,C,D,F 12|34 0.0264667 0.0263840 B,C,D,F 13|24 0.0264000 0.0366836 B,C,D,F 14|23 0.0268000 0.0291028 B,C,E,F 12|34 0.0318667 0.0311415 B,C,E,F 13|24 0.0287333 0.0305662 B,C,E,F 14|23 0.0301333 0.0293290 B,D,E,F 12|34 0.0300333 0.0340250 B,D,E,F 13|24 0.0313667 0.0315823 B,D,E,F 14|23 0.0358000 0.0246551 C,D,E,F 12|34 0.0295333 0.0258286 C,D,E,F 13|24 0.0241667 0.0287898 C,D,E,F 14|23 0.0299666 0.0373072

Table D.5: RMSD values for QCF estimation using data simulated from a network topology (Figure 4.1b).

201