Molecular Population Genetics of Mating System

MOLECULAR POPULATION GENETICS OF MATING SYSTEM

TRANSITIONS IN THE BRASSICACEAE

JOHN PAUL FOXE

A DISSERTATION SUBMITTED TO THE FACULTY OF GRADUATE

STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

GRADUATE PROGRAM IN BIOLOGY

YORK UNIVERSITY

TORONTO, ONTARIO

OCTOBER 2010 Library and Archives Bibliotheque et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition

395 Wellington Street 395, rue Wellington OttawaONK1A0N4 Ottawa ON K1A 0N4 Canada Canada

Your We Votre r6f6rence ISBN: 978-0-494-80588-6 Our file Notre reference ISBN: 978-0-494-80588-6

NOTICE: AVIS:

The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission.

In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis.

1*1 Canada Abstract

The transition from outcrossing to selfing represents one of the most common

major evolutionary transitions in the plant kingdom. Predominantly selfing and

outcrossing species pairs exist in a variety of plant taxa, yet little is known about the

underlying evolutionary dynamics promoting such a transition. In Chapter 2 of this

dissertation, I investigate the evolution of the selfing C. rubella from its sister species, the

outcrossing C. grandiflora. My results suggest that this transition to selfing has led to a

substantial bottleneck in C. rubella with a 100-1500 fold reduction in effective

population size. Furthermore, I estimate a divergence time of approximately 13,500 years

between these two species suggesting that this recent speciation event has led to dramatic

genotypic and phenotypic changes in C. rubella over a relatively short period of time.

While results from Chapter 2 show little evidence for introgression in natural populations

of C. grandiflora and C. rubella, in Chapter 3 I ask whether a shift in mating system has

lead to the establishment of effective reproductive isolating barriers between these two

sister species. Through the investigation of the population genetics of potentially

hybridizing populations of C. grandiflora and C. rubella I find no significant evidence

for ongoing gene flow between C. grandiflora and C. rubella, suggesting that differences

in mating system are acting as an effective reproductive barrier between these two species. In Chapter 4,1 investigate the evolutionary origins of the third species in the

Capsella genus, the selfing polyploid, C. bursa-pastoris. My results suggest that C. bursa-pastoris originated via autopolyploidization involving two distinct C. grandiflora haplotypes. Furthermore, I have dated this event at approximately 667,000 years ago.

IV In Chapter 5,1 investigate the demographic history and population genetics of Great

Lakes region, selfing and outcrossing populations of A. lyrata. I find evidence for strong geographic population clustering irrespective of mating system, suggesting that selfing either evolved multiple times or has spread to multiple genetic backgrounds, unlike the transition to selfing in C. rubella.

v Acknowledgements

The work presented in this dissertation was made possible by a large number of people, to whom I am very grateful. Specific acknowledgements are outlined at the end of Chapters 2 and 5. However, there are a number of other individuals who contributed to additional aspects of my dissertation.

I thank each of my committee members: Dr. Joel Shore, Dr. Imogen Coe and Dr.

Dawn Bazely for their support, their thoughtful suggestions and their encouragement throughout my time spent as a PhD candidate.

I thank my examination committee members Dr. Amro Zayed, Dr. Bridget

Stutchbury, Dr. Raymond Mar and Dr. Dan Schoen for their willingness to serve on my examination committee and for being so flexible around the timing of my defence.

I thank each of my coauthors on Chapters 2 and 5, namely: Tanja Slotte, Eli Stahl,

Barbara Neuffer, Herbert Hurka, Marc Stift, Andrew Tedder, Annabelle Haudry and

Barbara Mable. In particular, Marc Stift and Barbara Mable are two fantastic people to work with and I count myself very lucky to have had the opportunity to collaborate with them.

I thank Kate St. Onge for her assistance while sampling Capsella in Greece and

Martin Lascoux for providing Capsella seeds used in Chapter 4.

There are too many of my peers to thank individually. I thank the members of the

Wright laboratory, as well as the many graduate students in the Biology Department at

York University and in the Department of Ecology and Evolutionary Biology at the

vi University of Toronto, with whom I have had such a great experience over the past few years.

Finally, I thank my supervisor, Dr. Stephen Wright. In addition to his supervision,

Dr. Wright has acted as a mentor to me. More than that, he has become a friend. Without his constant support, advice and guidance, this body of work presented here would not have been possible have been possible. Thank you.

vii Note on authorship

Chapters 2 and 5 of this thesis have been published as refereed journal articles. In each case, I am the first author. Below I outline the contributions of each author to each of these Chapters.

For Chapter 2,1 performed the research. B. Neuffer and H. Hurka provided

Capsella grandiflora and C. rubella seeds. E. Stahl provided computer coding allowing the program MIMAR to be run in a number of different ways. The data were analyzed and the paper written by T. Slotte, S. Wright and myself.

For Chapter 5,1 generated all of the nuclear gene data. M. Stift, A. Tedder and A.

Haudry generated the microsatellite and chloroplast data. M. Stift, A. Tedder, A. Haudry,

S. Wright, B. Mable and myself analyzed the data. M. Stift, B. Mable, S. Wright and myself wrote the paper.

I have received written permission from each of my coauthors to include these published articles in my dissertation (see Appendix 1).

viii Table of Contents

Abstract iv

Acknowledgments vi

Note on authorship viii

Table of Contents ix

List of Tables xi

List of Figures xv

Chapter 1

Introduction to Molecular Population Genetics of Mating System Transitions in the

Brassicacea 1

Chapter 2

Recent speciation associated with the evolution of selfing in Capsella 17

Chapter 3

No evidence for ongoing and widespread hybridization between sympatric populations of

C. grandiflora and C. rubella 61

Chapter 4

Dynamics of polyploid speciation in the genus Capsella 92

ix Chapter 5

Reconstructing origins of loss of self-incompatibility and selfing in North American

Arabidopsis lyrata: a population genetic context 144

Chapter 6

Conclusions 201

Appendix A - Permission to include published manuscripts in dissertation 205

x List of Tables

Chapter 2

Table 1. Modes of parameter estimates under a range of MIMAR models, with 90% HPD intervals in parentheses. 37

SI Table 2. Predictive posterior probabilities from simulations of the posterior distributions. 48

SI Table 3. Likelihood ratio test comparing model 1 to other models. 49

SI Table 4. Goodness-of-fit tests based on simulations under the marginal modes (a) and under maximum likelihood parameter estimates (b). 50

SI Table 5. Goodness-of-fit test P values based on simulations under the marginal modes

(a) and under maximum likelihood parameter estimates (b), allowing for a lineage- specific change in recombination in population 2. 51

SI Table 6. Sequence-based summary statistics as estimated for each locus in both C. grandiflora and C. rubella. 52

xi SI Table 7. Modes of parameter estimates under a range of MIMAR models using

summaries of the data that do not rely on ougroup inference, with 90% HPD intervals in parentheses. 56

Chapter 3

Table 1. Modes of parameter estimates under a range of MIMAR models using the entire dataset, allopatric populations only and sympatric populations only. 85

Table 2. Number of silent sites subdivided by species and population category (allopatric and sympatric) as well as gene ontology terms for each of the 18 loci used in this study.

Chapter 4

Table 1. Minimum number of synonymous substitutions and number of fixed differences between C. bursa-pastorisA and C. grandiflora, C. bursa-pastorisB and C. grandiflora and C. bursa-pastoris A and C. bursa-pastorisB. Number of fixed differences are given in parentheses. 122

Table 2. Modes of parameter estimates under a range of MIMAR models. 123

Table SI. Accession name and origin of each individual used in this study. 135

xii Table S2. Names and gene ontology terms for each of the 14 loci studied. 140

Table S3. Modes of parameter estimates under a range of MIMAR models for 5

simulated datasets where an autopolyploid event was simulated using the coalescent

simulation program ms (See Methods). 141

Chapter 5

Table 1. Population-level outcrossing rates (Tm), proportion of self-compatible (SC)

individuals (as an indication of selfing phenotype), summary statistics and observed

heterozygosity for 18 nuclear gene sequences, and observed and expected

heterozygosities (H0 and He) across nine microsatellite loci, with populations ordered by

increasing outcrossing rates. 185

Table 2. Linear regressions of outcrossing rate (Tm) and the proportion of self-compatible

(SC) individuals per population on synonymous diversity (jtsyn), corrected synonymous diversity, the recombination parameter (p), the corrected recombination parameter and observed heterozygosity (H0) across 18 nuclear gene sequences; and observed and expected microsatellite heterozygosity (H0 and He) across nine microsatellite loci. 188

Table 3. Total and unique number of variants (microsatellite alleles across nine loci, nuclear gene haplotypes across 18 genes) for the group of inbreeding populations and for the group of outcrossing populations; overall total number of different variants;

xiii number of variants shared across inbreeding and outcrossing populations. 190

Table SI. Explanation of population abbreviations, lakefront (where relevant), and

geographic specifications (state/province, country and coordinates) for each of the 24 populations used in this study (more details regarding sampling method are given in the

supplementary text). 196

xiv List of Figures

Chapter 2

Figure 1. Floral organs and petals are reduced in C. rubella (left) compared to C.

grandiflora (right). 38

Figure 2. Comparison of polymorphism patterns between C. grandiflora and C. rubella.

Bars represent the median, boxes the interquartile range, and whiskers extend out to 1.5-

times the interquartile range, a) n synonymous where 7i is the average pairwise

differences (16) b), the population recombination estimator p per base pair, using the

composite likelihood estimator of Hudson (32) c) Tajima's D (16) in C. grandiflora and

C. rubella. 39

Figure 3. Derived SNP frequencies in C. grandiflora and C. rubella calculated using A.

thaliana as an outgroup. 40

Figure 4. Smoothed marginal posterior distributions of speciation parameters estimated by MIMAR, for two models with posterior modes showing good fit to data summaries, assuming either symmetric migration (solid lines) or no migration (dashed lines) and equal effective population sizes in the ancestor as in present-day C. grandiflora. (A)

Marginal densities for effective population size (individuals) in C. rubella (grey) and C. grandiflora/the ancestral species. (B) Marginal densities for divergence time (years). (C) Marginal density for migration (4Nm), where N is the effective population size of C. grandiflora. 41

SI Figure 5. Observed and expected (under neutrality as calculated using Equation 49,

Tajima, 1989) minor allele frequency distribution of synonymous SNPs in a) C. grandiflora and b) C. rubella using A. thaliana as an outgroup. 57

SI Figure 6. Average levels of linkage disequilibrium as measured by the squared correlation coefficient r2 in C. rubella and C. grandiflora in lOObp windows. 58

SI Figure 7. STRUCTURE output for a) C. grandiflora and C. rubella combined (k=2) and for b) C. rubella alone (k=3). 59

SI Figure 8. Distribution of the fraction of shared polymorphism from simulated 39-gene datasets. 60

Chapter 3

Figure 1. Comparison of polymorphism patterns between C. grandiflora and C rubella for a) the entire dataset, b) the allopatric populations and c) the sympatric populations as measured by % synonymous where n is the average pairwise differences. 87

xvi Figure 2. Distribution of synonymous variants. Variants are classed as unique to C. rubella, unique to C. grandiflora or shared between species. The datasets are subdivided into a) the entire dataset, b) allopatric populations only and c) sympatric populations only. 88

Figure 3. Posterior probabilities of Bayesian clustering analysis (InStruct) conducted on the entire dataset, where k = 2-4. 89

Figure 4. Posterior probabilities of Bayesian clustering analysis (InStruct), using the (a)

C. grandiflora individuals and the (b) C. rubella individuals, where k = 2-4. 90

Figure SI. Geographic location of each of the nine populations included in this study.

Chapter 4

Figure 1. Floral organs and petals are reduced in C. bursa-pastoris and C. rubella (left and middle respectively) compared with C. grandiflora (right).

126

Figure 2. Comparison of silent polymorphism patterns between C. bursa-pastoris A, C. bursa-pastorisB, C. grandiflora and C. rubella, given by 7i synonymous, where % is the average pairwise difference. 127

xvii Figure 3. Number of synonymous fixed differences between all pairs of C. bursa- pastoris A, C. bursa-pastorisB, C. grandiflora and C. rubella. 128

Figure 4. Bayesian estimates of species trees of the Capsella genus generated using

BEST including the close relatives a) A. thaliana and b) A. thaliana and Neslia as outgroups.

129

Figure 5. Marginal posterior distributions of speciation parameters estimated by MIMAR, with posterior modes showing good fit to data summaries. 9 = 4Nefi where Ne is the effective population size and /x is the mutation rate (1.5*10"g) a) constrained model assuming equal effective population sizes in the ancestor as in present-day C. grandiflora b) unconstrained model. 131

Figure 6. Numbers of unique synonymous variants and for each of the three species pairs a) C. bursa-pastorisA and C. bursa-pastorisB, b) C. grandiflora and C bursa-pastorisA and c) C. grandiflora and C. bursa-pastorisB 133

Figure 7. Numbers of a) synonymous shared variants and b) synonymous fixed differences for each of the three species pairs 1) C. bursa-pastoris A and C. bursa-

xvin pastorisB, 2) C. grandiflora and C. bursa-pastorisA and 3) C. grandiflora and C. bursa- pastorisB. 134

Figure SI. Autopolyploid event simulated using the coalescent simulation program ms.

6 is the effective population size {4Nepi, where Ne is the effective population size and fj, is the mutation rate 1.5X10"8). 143

Chapter 5

Figure 1. Posterior probabilities of Bayesian clustering analysis (InStruct) using the combined nuclear gene sequence and microsatellite datasets, based on a prior of six clusters (k = 6). 191

Figure 2. Comparisons between microsatellite-based multilocus outcrossing rates (Tm) and genetic diversity: (a) Synonymous diversity (jtsyn) based on nuclear sequence data from 18 unlinked loci, where JT is the average number of pairwise differences between two sequences; (b) Expected heterozygosity (He) estimated from microsatellite data.

193

Figure SI. Posterior probabilities of Bayesian clustering analysis (using STRUCTURE,

2,000,000 generations with a burnin of 200,000), using a combination of the nuclear haplotype and microsatellite data. 198

xix Figure S2. Posterior probabilities of Bayesian clustering analysis (using InStruct,

2,000,000 generations with a burnin of 200,000), using a combination of the nuclear haplotype and microsatellite data. 199

Figure S3. Distribution of the -In probability (-ln(P), represented by closed circles) and its variance (Var[ln(P)], represented by open squares) for Bayesian clustering analysis

(2,000,000 generations with a burnin of 200,000, five chains for each setting of the number of pre-defined clusters (k), k ranging from 1-12) with a) InStruct and b)

STRUCTURE. 200

xx CHAPTER 1

Introduction to Molecular Population Genetics of Mating System Transitions in the

Brassicaceae

1 The transition from outcrossing to selfing represents one of the most common

major evolutionary transitions in flowering plants (Stebbins 1970; Barrett 2002).

Predominantly selfing and outcrossing species pairs exist in a variety of plant taxa

including Leavenworthia (Charlesworth and Yang 1998; Liu et al. 1998; Filatov and

Charlesworth 1999; Liu et al. 1999), Arabidopsis (Ross-Ibarra et al. 2008; Savolainen et

al. 2000; Wright et al. 2003), Lycopersicon (Baudry et al. 2001), Miscanthus (Chiang et

al. 2003) Amsinckia (Schoen et al. 1997) and Capsella (Foxe et al. 2009 (Chapter 2)).

This transition confers two major advantages upon selfers versus their outcrossing

congeners. First, because selfers are 100% related to their progeny and can also act as

pollen donors for seed produced by other individuals, they have an inherent transmission

advantage over outcrossers (Fisher 1941; Nagylaki 1976; Lloyd 1979). Secondly, a

selfing lifestyle results in reproductive assurance when pollinators or potential mates are

limited (Baker 1955). Evolutionary theory predicts that selfing will evolve when these

advantages outweigh the costs associated with inbreeding depression (Charlesworth

2006).

The shift from outcrossing to selfing often leads to a number of both phenotypic

and genotypic consequences. Phenotypically, this transition to selfing is almost universally associated with the 'selfing syndrome', characterized by a severe reduction in flower size and a breakdown of the morphological and genetic mechanisms preventing

self-fertilization. Empirically, the population genetic consequences associated with a transition to selfing have been well documented at the species level in both plant and animal systems (Charlesworth and Yang 1998; Baudry et al. 2001; Chiang et al. 2003;

2 Cutter and Payseur 2003; Glemin 2006). Most directly, selfing is associated with an

increase in homozygosity and a decrease in levels of genetic diversity (Wright et al.

2008). As a result of a reduction in the number of distinct alleles, the effective population

size (Ne) is decreased. Theory predicts that a completely selfing population will decrease

its effective population size by 50%, as a result of homozygosity (Charlesworth et al.

1993; Nordborg 2000). Furthermore, this increase in homozygosity can lead to an

increase in levels of linkage disequilibrium among loci resulting in amplified effects of

genetic hitchhiking. This can lead to the fixation of positively selected mutations and the

purging of deleterious alleles due to purifying selection, both of which will further reduce

Ne.

While there are clear advantages to a selfing lifestyle (described above), selfing

does come with its disadvantages. The transition to selfing can lead to a rapid increase in

homozgote frequencies. This increase can cause individuals to express recessive or

particularly deleterious alleles, leading to lower survival rates and reduced fertility, a

condition known as inbreeding depression (Charlesworth 2006).

Despite the advantages to a selfing lifestyle, a mere 20-25% of plant taxa are predominantly selfing (Barrett and Eckert 1990), leading us to the obvious question: why

aren't more plants selfing? The predominant hypothesis attempting to answer this

question suggests that selfing is an "evolutionary dead end." Specifically, this hypothesis

suggests that selfing lineages do not persist over extended evolutionary time periods and that new lineages are founded from outcrossing progenitors.

3 Several studies suggest that the complex characteristics promoting outcrossing seem to have broken down frequently leading to transitions to selfing. Furthermore, these selfers are typically restricted to the terminal clades of phylogenies suggesting that the selfing lineages evolved from the outcrossing progenitors and that these selfing lineages tend not to be ancient. For example, a phylogenetic analysis of the genus Amsinckia, based on chloroplast DNA restriction site variation, suggests four independent origins of homostylous selfing species from heterostylous outcrossers. These finding were based on the assumption that distyly was ancestral or that the loss of distyly was more likely than the gain (Schoen et al. 1997).

While studies have suggested frequent transitions to selfing, in most cases the timescale over which this transition occurs remains difficult to ascertain. The model plant system A. thaliana is thought to have evolved self-fertilization approximately 1 million years ago through inactivation of the self-incompatibility locus, referred to as the S-locus

(Tang et al. 2007). Evidence for the role of the S-locus stems from transformation studies, which identified five accessions in which full self-incompatibility could be restored by transformation with a functional S-locus, and implies that all other genes required for SI are still intact in these accessions (Tang et al. 2007; Boggs et al. 2009). Recent results suggest that a mutation in the male component of self-incompatibility (SCR) has resulted in loss of SI, apparently across a wide range of accessions (Tsuchimatsu et al. 2010). In addition, a modifier locus has been identified, unlinked to the 5-locus (Liu et al. 2007), which suggests that 5-locus inactivation may not be the sole mechanism by which SI

4 broke down in A. thaliana and different mechanisms of loss could have operated in

different accessions (Boggs et al. 2009).

Systems with more recent transitions from outcrossing to selfing may provide a

more direct picture of the causes and short-term consequences of mating system

evolution (Foxe et al. 2009 (Chapter'2); Guo et al. 2009; Ness et al. 2010). For example,

if the evolution of selfing involves the long-term spread of modifiers through previously

outcrossing populations, recently-derived selfing populations are expected to retain

reasonably high levels of ancestral polymorphism, as recently observed in Eichhornia paniculata (Ness et al. 2010). In contrast, if a highly selfing lineage evolves rapidly from

a small number of founders, we would expect a severe loss of genetic variation, as seen in

Capsella rubella (Foxe et al. 2009 (Chapter 2)).

In Chapter 2 of this dissertation, I investigate the evolution of selfing in diploid

species in the genus Capsella. Capsella rubella is characterized by a high rate of self-

fertilization (Hurka et al. 1989) and shows the typical morphological characteristics of a

selfing syndrome (Hurka and Neuffer 1997). From genetic marker studies, the selfing rate

in C. rubella has been estimated as 1, with a lower-bound estimate of 0.7 (Hurka et al.

1989). In comparison with its self-incompatible congener C grandiflora, there has been a derived breakdown of the self-incompatibility mechanism, and its floral organ sizes are highly reduced. The breakdown of self-incompatibility is also associated with an expansion of geographic range; C. grandiflora is restricted to Greece and Albania and

locally in Northern Italy, while C rubella has expanded into much of southern Europe extending to Middle Europe, Northern Africa and into Australia and North and South

5 America (Hurka and Neuffer 1997; Paetsch et al. 2006). Interspecific crossing experiments suggest that, in addition to mating system evolution, there is considerable post-pollination reproductive isolation between the species, with only a small proportion of crosses producing viable seed ((Hurka and Neuffer 1997; Koch and Kiefer 2005), T.

Slotte, K. Hazzouri, and S. Wright, unpublished).

My results from Chapter 2 suggest that this transition to selfing has lead to a substantial bottleneck in C. rubella with a 100-1500 fold reduction in effective population size (Ne) (Foxe et al. 2009 (Chapter 2)). Furthermore I estimate a divergence time of approximately 13,500 years between these two species (Foxe et al. 2009 (Chapter

2)) suggesting that dramatic genotypic and phenotypic changes in C. rubella occurred over a relatively short period of time.

While results from studies outlined in Chapter 2 show little evidence for introgression in natural populations of C. grandiflora and C. rubella, in Chapter 3 I ask whether a shift in mating system has lead to the establishment of effective reproductive isolating barriers between these two sister species. Through the investigation of the population genetics of potentially hybridizing populations of C. grandiflora and C. rubella I find no significant evidence for ongoing gene flow between C. grandiflora and

C. rubella, suggesting that differences in mating system are acting as an effective reproductive barrier between these two species.

An additional factor in the speciation process is that of polyploidy. Polyploidy is considered by many to be the predominant mode of sympatric speciation (Coyne and Orr

2004; Mallet 2007). Polyploidization can act to instantly create a new species as the

6 newly formed polyploid is often immediately reproductively isolated from the diploid

progenitor(s) due to changes in ploidy. Such changes in ploidy are in often associated

with mating system transitions and (or) vice versa. The relative contribution of

polyploidy to speciation in plants is a controversial topic with widely varying estimates

of the frequency of polyploid speciation. Based upon the fraction of speciation events that

involve any change in chromosome number as well as the fraction of changes in

chromosome number that involve polyploidy, Otto and Whitton (2000) report that 2-4%

of speciation events in angiosperms and 7% in ferns are a direct result of polyploidy.

More recent estimates based upon phylogenetic data estimate the frequency of polyploid

speciation by tracking changes in ploidy level across infrageneric phylogenetic trees

(Wood et al. 2009). Wood and colleagues (2009) put this number at 15%) in angiosperms

and 31% ferns. These estimates indicate that polyploidization represents an extremely

common vehicle for the speciation process in plants.

Given the dominant role of polyploidization in plant speciation, understanding the

evolutionary context in which this process occurs becomes an important aspect of

speciation genetics. Several relevant and important questions must be posed in order to

elucidate the evolutionary history of a polyploid species, for instance: does the species

have a single or multiple origins; what is the role of founder events in this process; is there ongoing gene flow between the species and its ancestor(s); when did the polyploidization event occur; is the species an alio- or autopolyploid? Addressing these questions is difficult considering the challenges associated with distinguishing multiple

7 origins of polyploids, extinction of parental lineages and the sampling of standing

variation in progenitor species (Doyle and Egan 2009).

One such polyploid is the weed C. bursa-pastoris or Shepherd's Purse, which is a

selfing tetraploid species and a member of the genus Capsella (Hurka and Neuffer 1997).

Like C. rubella, C. bursa-pastoris phenotypically very clearly displays the selfing

syndrome. Previous work on these species has suggested that both C. grandiflora and C.

rubella may be ancestral to C. bursa-pastoris (Hurka and Neuffer 1997) and more recent

findings reveal that C rubella diverged from C. grandiflora approximately 13,500 years

ago (Foxe et al. 2009 (Chapter 2, see above)). Capsella bursa-pastoris has a worldwide

distribution that can partly be explained anthropogenically (Hurka and Neuffer 1997;

Slotte et al. 2008). In contrast to C. grandiflora and C. rubella, C. bursa-pastoris can be

found on each continent and thrives in a wide climate range (Hurka and Neuffer 1997).

In Chapter 4, using DNA sequence data from 14 unlinked nuclear loci in C.

bursa-pastoris, C. grandiflora and C. rubella, I address the following areas. First, I

characterize patterns of polymorphism in all three species in this genus. Next, using

molecular phylogenetic techniques, I elucidate the evolutionary phylogenetic relationships between C. bursa-pastoris, C grandiflora and C. rubella. Finally, using coalescent-based analyses I date the divergence of C. bursa-pastoris from the ancestral C. grandiflora and assess evidence for population bottlenecks in C. bursa-pastoris, thus elucidating the evolutionary history between all three species in the genus Capsella. My results suggest that C. bursa-pastoris originated via autopolyploidization involving two

8 distinct C grandiflora haplotypes. Furthermore, I have dated this event at approximately

667,000 years ago.

In Chapter 5,1 investigate the demographic history and population structure of

North American A. lyrata (Foxe et al. 2010). It has been suggested that A. lyrata

colonized North America from ancestral European populations (Clauss and Mitchell-Olds

2006; Ross-Ibarra et al. 2008), which are highly self-incompatible and exclusively

outcrossing. The North American populations are unique because some are still

predominantly outcrossing, despite the occurrence of self-compatible individuals at low

frequency, while others are almost entirely self-compatible and have undergone a

transition to high rates of selfing (Mable et al. 2005; Mable and Adam 2007). This

transition to selfing in A. lyrata appears to be very recent, as selfing populations belong

to a chloroplast lineage that also contains outcrossing populations (Hoebe et al. 2009).

Moreover, selfing populations are not characterized by smaller flowers (Hoebe 2009),

which contrasts with other systems where the transition to selfing has led to notable floral

evolution towards smaller flowers (Hurka and Neuffer 1997; Charlesworth and

Vekemans 2005; Tang et al. 2007; Foxe et al. 2009 (Chapter 2); Guo et al. 2009).

In this study, I integrate polymorphism information from nuclear genes,

chloroplast markers and nuclear microsatellites, in order to obtain a detailed picture of the

demographic history and population structure of A. lyrata in the Great Lakes region of

North America. I find evidence for strong geographic clustering irrespective of mating

system, suggesting that selfing either evolved multiple times or has spread to multiple genetic backgrounds. I find much reduced diversity in selfing populations, but not to the

9 extent of the severe loss of variation expected if selfing evolved under severe selective pressure to colonize new areas. Furthermore, unlike the transition to selfing in C rubella these results suggest multiple transitions to selfing in this system.

10 References

Baker, H. G. 1955. Self-compatibility and establishment after long-distance dispersal.

Evolution 9:347-349.

Barrett, S. C. H. 2002. The evolution of plant sexual diversity. Nat Rev Genet 3:274-284.

Barrett, S. C. H., and C. G. Eckert. 1990. Variation and evolution of mating systems in

seed plants. Pp. 229-254 in S. Kawano, ed. Biological approaches and

evolutionary trends in plants. Academic Press, New York, New York, USA.

Baudry, E., C. Kerdelhue, H. Innan, and W. Stephan. 2001. Species and recombination

effects on DNA variability in the tomato genus. Genetics 158:1725-1735.

Boggs, N. A., J. B. Nasrallah, and M. E. Nasrallah. 2009. Independent S-locus mutations

caused self-fertility in Arabidopsis thaliana. PLoS Genet 5:e 1000426.

Charlesworth, B., M. T. Morgan, and D. Charlesworth. 1993. The effects of deleterious

mutations on neutral molecular varaition. Genetics 134:1289-1303.

Charlesworth, D. 2006. Evolution of plant breeding systems. Curr Biol 16:R726-735.

Charlesworth, D., and X. Vekemans. 2005. How and when did Arabidopsis thaliana

become highly self-fertilising. Bioessays 27:472-476.

Charlesworth, D., and Z. Yang. 1998. Allozyme diversity in Leavenworthia populations

with different inbreeding levels. Heredity 81:453-461.

Chiang, Y. H., B. A. Schaal, C. H. Chou, S. Huang, and T. Y. Chiang. 2003. Contrasting

selection modes at the Adhl locus in outcrossing Miscanthus sinensis vs.

inbreeding Miscanthus condensatus (Poaceae). Am J Bot 90:561-570.

11 Clauss, M. J., and T. Mitchell-Olds. 2006. Population genetic structure of Arabidopsis

lyrata in Europe. Mol Ecol 15:2753-2766.

Coyne, J. A., and H. A. Orr. 2004. Speciation. Sinauer Associates, Inc., Sunderland,

Massachusetts.

Cutter, A. D., and B. A. Payseur. 2003. Selection at Linked Sites in the Partial Selfer

Caenorhabditis elegans. Mol Biol Evol 20:665-673.

Doyle, J. J., and A. N. Egan. 2009. Dating the origins of polyploidy events. New Phytol

186-73-85.

Filatov, D. A., and D. Charlesworth. 1999. DNA polymorphism, haplotype structure and

balancing selection in the Leavenworthia PgiC locus. Genetics 153:1423-1434.

Fisher, R. A. 1941. Average excess and average effect of a gene substitution. Ann Eugen

11:53-63.

Foxe, J. P., T. Slotte, E. A. Stahl, B. Neuffer, H. Hurka, and S. I. Wright. 2009. Rapid

morphological evolution and speciation associated with the evolution of selfing in

Capsella. PNAS 106:5241-5245.

Foxe, J. P., M. Stift, A. Tedder, A. Haudry, S. I. Wright, and B. K. Mable. 2010.

Reconstructing Origins of Loss of Self-Incompatibility and Selfing in North

American Arabidopsis Lyrata: a Population Genetic Context. Evolution In Print.

Glemin, S., Bazin, E. & Charlesworth, D. 2006. Impact of mating systems on patterns of

sequence polymorphism in flowering plants. Proc Biol Sci 273:3011-3019.

Guo, Y.-L., J. S. Bechsgaardb, T. Slotte, B. Neuffer, M. Lascoux, Weigel D., and M. H.

Schierup. 2009. Recent speciation of Capsella rubella from Capsella grandiflora,

12 associated with loss of self-incompatibility and an extreme bottleneck PNAS

106:5246-5251.

Hoebe, P. N., M. Stift, A. Tedder, and B. K. Mable. 2009. Multiple losses of self-

incompatibility in North-American Arabidopsis lyrata: phylogeographic context

and population genetic consequences. Mol Ecol 18:4924-4939.

Hurka, H., S. Freundner, A. H. Brown, and U. Plantholt. 1989. Aspartate

aminotransferase isozymes in the genus Capsella (Brassicaceae): subcellular

location, gene duplication, and polymorphism. Biochem Genet 27:77-90.

Hurka, H., and B. Neuffer. 1997. Evolutionary processes in the genus Capsella

(Brassicaceae). Plant Sys Evol 206:295-316.

Koch, M., and M. Kiefer. 2005. Genome evolution among cruciferous plants: a lecture

from the genetic maps of three diploid species— Capsella rubella, Arabidopsis

lyrata subsp. petraea, and A. thaliana. Am J Bot 92:761-767.

Liu, F., D. Charlesworth, and M. Kreitman. 1999. The effect of mating system

differences on nucleotide diversity at the phosphoglucose isomerase locus in the

plant genus Leavenworthia. Genetics 151:343-357.

Liu, F., L. Zhang, and D. Charlesworth. 1998. Genetic diversity in Leavenworthia

populations with different inbreeding levels. Proc R Soc Lond B Biol Sci

265:293-301.

Liu, P., S. Sherman-Broyles, M. E. Nasrallah, and J. B. Nasrallah. 2007. A Cryptic

Modifier Causing Transient Self-Incompatibility in Arabidopsis thaliana. Curr

Biol 17:734-740.

13 Lloyd, D. G. 1979. Some reproductive factors affecting the selection of self-fertilization

in plants. Am Nat 113:67-79.

Mable, B. K., and A. Adam. 2007. Patterns of genetic diversity in outcrossing and selfing

populations of Arabidopsis lyrata. Mol Ecol 16:3565-3580.

Mable, B. K., A. V. Robertson, S. Dart, C. Di Berardo, and L. Witham. 2005. Breakdown

of self-incompatibility in the perennial Arabidopsis lyrata (Brassicaceae) and its

genetic consequences. Evolution 59:1437-1448.

Mallet, J. 2007. Hybrid speciation. Nature 446:279-283.

Nagylaki, T. 1976. A model for the evolution of self-fertilization and vegetative

reproduction. J Theor Biol 58:55-58.

Ness, R. W., S. I. Wright, and S. C. H. Barrett. 2010. Mating-System Variation,

Demographic History and Patterns of Nucleotide Diversity in the Tristylous Plant

Eichhornia Paniculata Genetics 184:381-392.

Otto, S. P., and J. Whitton. 2000. Polyploid incidence and evolution. Ann Rev Gen

34:401-437.

Nordborg, M. 2000. Linkage disequilibrium, gene trees and selfing: an ancestral

recombination graph with partial self-fertilization. Genetics 154:923-929.

Paetsch, M., S. Maryland-Quellhorst, and B. Neuffer. 2006. Evolution of the self-

incompatibility system in the Brassicaceae: identification of S-locus receptor

kinase (SRK) in self-incompatible Capsella grandiflora. Heredity 97:283-290.

14 Ross-Ibarra, J., S. I. Wright, J. P. Foxe, A. Kawabe, L. DeRose-Wilson, G. Gos, D.

Charlesworth, and B. S. Gaut. 2008. Patterns of Polymorphism and Demographic

History in Natural Populations of Arabidopsis lyrata. PloS One 3:e2411.

Savolainen, O., C. H. Langley, B. P. Lazzaro and H. Freville, 2000. Contrasting patterns

of nucleotide polymorphism at the alcohol dehydrogenase locus in the outcrossing

Arabidopsis lyrata and the selfing Arabidopsis thaliana. Mol Biol Evol 17:645-

655.

Schoen, D. J., M. O. Johnston, A.-M. L'Heureux, and J. V. Marsolais. 1997. Evolutionary

history of the mating system in Amsinckia (Boraginaceae). Evolution 51:1090-

1099.

Slotte, T., H. Huang, M. Lascoux, and A. Ceplitis. 2008. Polyploid speciation did not

confer instant reproductive isolation in Capsella (Brassicaceae). Mol Biol Evol

25:1472-1481.

Stebbins, G. L. 1970. Adaptative radiation of reproductive characteristics in angiosperms.

I. Pollination mechanisms. Ann Rev Ecol Sys 1:307-326.

Tang, C, C. Toomajian, S. Sherman-Broyles, V. Plagnol, Y. L. Guo, T. T. Hu, R. M.

Clark, J. B. Nasrallah, D. Weigel, and M. Nordborg. 2007. The evolution of

selfing in Arabidopsis thaliana. Science 317:1070-1072.

15 Tsuchimatsu T., K. Suwabe, R. Shimizu-Inatsugi, S. Isokawa, P., Pavlidis, T. Stadler, G.

Suzuki, S. Takayama, M. Watanabe and K. K. Shimizu. 2010. Evolution of self-'

compatibility in Arabidopsis by a mutation in the male specificity gene. Nature

464:1342-1346.

Wood, T. E., N. Takebayashi, M. S. Barker, I. Mayrose, P. B. Greenspoon, and L. H.

Rieseberg. 2009. The frequency of polyploid speciation in vascular plants. PNAS

106:13875-13879.

Wright S. I., B. Lauga and D. Charlesworth. 2003. Subdivision and haplotype structure in

natural populations of Arabidopsis lyrata. Mol Ecol 12:1247-1263.

Wright, S. I., R. W. Ness, J. P. Foxe, and S. C. H. Barrett. 2008. Genomic consequences

of outcrossing and selfing in plants. Int J Plant Sci 169:105-118.

Recent speciation associated with the evolution of selfing in Capsella

Biological Sciences: Evolution

John Paul Foxe*'+, Tanja Slotte*' * •+, Eli A. Stahls, Barbara Neuffer1', Herbert Hurka11 and Stephen I. Wright*,f

* These authors contributed equally to the work t Department of Biology, York University, 4700 Keele St. Toronto, ON Canada M3J1P3 t Department of Ecology and Evolutionary Biology, University of Toronto, 25 Willcocks St., Toronto, ON Canada M5S 3B2

§Department of Biology, University of Massachusets Dartmouth, North Dartmouth, MA 07247 USA

H Universitat Osnabriick, Fachbereich Biologie/Chemie, Spezielle Botanik, Barbarastrasse 11, D-49076 Osnabriick, Germany

Corresponding Author: Stephen I. Wright Department of Ecology and Evolutionary Biology University of Toronto 25 Willcocks St., Toronto, ON Canada. Phone: 416-946-8508 Fax: 416-978-5878 [email protected]

The evolution from outcrossing to predominant self-fertilization represents one of the most common transitions in flowering plant evolution. This shift in mating system is almost universally associated with the 'selfing syndrome', characterized by marked reduction in flower size and a breakdown of the morphological and genetic mechanisms that prevent self-fertilization. In general, the timescale in which these transitions occur, and the evolutionary dynamics associated with the evolution of the i selfing syndrome are poorly known. We investigated the origin and evolution of selfing in the annual plant Capsella rubella from its self-incompatible, outcrossing progenitor Capsella grandiflora by characterizing multilocus patterns of DNA sequence variation at nuclear genes. We estimate that the transition to selfing and subsequent geographic expansion have taken place during the last 20,000 years. This transition was probably associated with a shift from stable equilibrium towards a near-complete population bottleneck causing a major reduction in effective population size. The timing and severe founder event support the hypothesis that selfing was favored during colonization as new habitats emerged following the last glaciation and the expansion of agriculture. These results suggest that natural selection for reproductive assurance can lead to major morphological evolution and speciation on relatively short evolutionary timescales.

Selfing plants benefit from two distinct advantages over their outcrossing

competitors (1-3). First, because selfers are 100% related to their progeny and can

also act as outcross pollen donors for seed produced by other individuals, they have

an inherent transmission advantage over outcrossers (4). A second major advantage

conferred by selfing, first discussed by Darwin (5), is the ability to reproduce when

pollinators or potential mates are limited (reproductive assurance). One important

aspect of reproductive assurance is the ability of selfing lineages to colonize new

habitats from a very small founding population. Evolutionary theory predicts that

selfing will evolve when these advantages outweigh the costs associated with

inbreeding depression, reduced fitness caused by increased homozygosity of

deleterious recessive alleles (6).

In most cases the timescale over which the selfing syndrome has evolved

remains difficult to ascertain. In the model genus Arabidopsis for example, the selfing

A. thaliana diverged from its closest outcrossing relatives roughly 5 million years ago

(7), and patterns of diversity have suggested a possibly complicated picture of the breakdown of outcrossing over a period of more than a million years (7, 8). Thus it remains unclear how rapidly the selfing syndrome can arise, and what historical conditions have favored mating system evolution. Capturing and characterizing a recent transition is important for inferring this timescale, and the relative importance of various forces favoring its evolution. For example, if selfing is initially favored through the genetic transmission advantage, we would expect that self-compatibility alleles would invade individual populations that were ancestrally outcrossing, and the

19 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 recently derived selfing species should maintain considerable levels of genetic variation at regions unlinked to the locus causing selfing (9). In contrast, selection for colonization ability in small, founding populations should lead to a severe reduction in genetic variation genome-wide in the derived selfing lineage.

Here, we investigate the evolution of selfing in diploid species in the genus

Capsella. Capsella rubella is characterized by a high rate of self-fertilization (10) and shows the typical morphological characteristics of a selfing syndrome (11). From genetic marker studies, the selfing rate in C. rubella has been estimated as 1, with a lower-bound estimate of 0.7 (10). In comparison with its self-incompatible congener

C. grandiflora, there has been a derived breakdown of the self-incompatibility mechanism, and its floral organ sizes are highly reduced (Figure 1). The breakdown of self-incompatibility is also associated with an expansion of geographic range; C. grandiflora is restricted to Western Greece and Albania and locally in Northern Italy, while C. rubella has expanded into much of southern Europe extending to Middle

Europe, Northern Africa and into Australia and North and South America (11, 12).

Interspecific crossing experiments suggest that, in addition to mating system evolution, there is considerable post-pollination reproductive isolation between the species, with only a small proportion of crosses producing viable seed ((11, 13), T.

Slotte, K. Hazzouri, and S. Wright, unpublished). We characterized multilocus patterns of neutral DNA sequence diversity across the genome to investigate the evolutionary history associated with this transition to selfing. We estimated the parameters of a model of isolation with or without migration, and tested the goodness of fit of the data to models incorporating species-specific recombination rate. Our

20 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 approach provides a detailed examination of the evolutionary history associated with the origins of selfing Capsella.

Results

Patterns of Polymorphism

We estimated levels of synonymous and nonsynonymous diversity in a geographically broad sample of both C. rubella (14 individuals) and C. grandiflora

(20 diploid individuals) using direct sequencing of 39 nuclear genes. We identified a total of 587 synonymous single nucleotide polymorphisms (SNPs) and 343 nonsynonymous SNPs in C. grandiflora. In C. rubella, diversity was highly reduced with 81 synonymous and 41 nonsynonymous SNPs. Figure 2a illustrates this major reduction in synonymous diversity in C. rubella when compared to C. grandiflora.

Over a third of loci (36%) were devoid of any variation in C rubella, and almost half

(46%) lacked synonymous site variation. In contrast, all loci were highly variable in

C. grandiflora.

Although a reduction in sequence diversity in selfing species is predicted under models of long-term equilibrium (14), the diversity reduction seen here is extreme (15), and our sequence data imply a very close relationship between C. rubella and C. grandiflora. At the majority of loci (74%), we identify sequence haplotypes shared between the species, and sequence divergence, when present, is very low. To illustrate the patterns of polymorphism, we plot the joint derived SNP frequency spectrum from twelve samples of each species (Figure 3). More than 80% of the segregating sites found in C. rubella are shared with C. grandiflora, and thus

21 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 there is a very small proportion of private polymorphisms present in this species

(Figure 3; private polymorphisms to C. rubella have a derived frequency of 0 in C. grandiflora). We also identify a significant fraction of the segregating sites found in

C. grandiflora to be fixed derived SNPs in C. rubella (Figure 3; Derived allele frequency of 12 in C. rubella). Similarly only 3 fixed differences were identified in our dataset in a total of over 20,000 bp surveyed for this study. This suggests that the sequence haplotypes present in C. rubella were mostly sub-sampled from existing variation in C. grandiflora.

Analysis of the frequency distribution of polymorphisms and linkage disequilibrium suggest that C. grandiflora represents a large, stable equilibrium population. In particular, the derived frequency spectrum of synonymous SNPs matches closely to the expected distribution under long-term constant population size

(Fig. 2c, SI Fig. 5). Furthermore, linkage disequilibrium decays rapidly with distance

(SI Fig. 6) giving a high estimate of the effective rate of recombination (Fig. 2b), consistent with expectations from neutral equilibrium (10, 16). In contrast C. rubella shows much greater within-locus linkage disequilibrium and very little effective recombination within genes, consistent with a severe bottleneck and recent transition towards high selfing (Fig. 2; SI Fig. 6). Nevertheless, a lack of linkage disequilibrium between loci (SI Fig. 6) suggests that C. rubella does some level of outcrossing, uncoupling coalescent history among distant loci. The site frequency spectrum, measured by Tajima's D (16), shows an elevated variance across loci (Figure 2) relative to the predictions of stable equilibrium in C. rubella (simulation results,

22 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 p<0.05; Figure 2), providing further evidence for recent changes in population size

(17).

Evidence for Single Origin of G rubella

The extent of shared haplotypes and polymorphism between these species contrasts with the 5 million year divergence of the selfing species Arabidopsis thaliana from its closest extant outcrossing relatives (7), and suggests one of three possibilities; very recent evolution of selfing, multiple origins of C. rubella, and/or ongoing hybridization. We first sought to test whether selfing C. rubella was derived from a single speciation event, or whether we can detect evidence for multiple origins or recent hybridization. A Bayesian clustering algorithm with the entire dataset (18) found the most likely number of clusters to be 2, in which all C. rubella and C. grandiflora individuals cluster with their own species, with no evidence for very recent hybridization affecting this sample (SI Fig. 7). If we analyze the species on their own, we find no evidence for geographic substructure in our sample of C. grandiflora, while C. rubella subdivides into three distinct clusters, broadly consistent with the geographic origins of our samples (SI Fig. 7). Given the wide geographic sampling for this study, the combined results suggest that C. rubella has evolved once, with subsequent geographic expansion and subdivision.

Demographic Model Fitting

Our results provide evidence for a single recent origin of Capsella rubella from Capsella grandiflora, with little or no recent hybridization affecting our

23 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 samples. To investigate the dynamics of the speciation event in detail, we used a

Markov Chain Monte Carlo (MCMC) approach based on coalescent simulations to fit a model of isolation with and without migration to the data (19). The approach makes use of the observed information from each locus on the number of shared and unique polymorphisms, as well as fixed differences. For our analysis, we restrict the inference to synonymous sites, to avoid potential effects of selection on the nonsynonymous variants. The model assumes that a single ancestral population of size Na split into two at time x, and the two derived populations have distinct effective population sizes (Ni and N2).

Although our Structure results suggest little or no hybridization, it remains possible that historical gene flow has contributed to the shared polymorphism between the species. To estimate demographic parameters, we therefore investigated a series of models that varied in the inclusion of symmetrical, asymmetrical or no migration, and varied whether or not the ancestral population size was constrained to be the same size as present-day C. grandiflora. We estimated historical demographic parameters from a total of four alternative models (Table 1, SI Tables 2 to 5). To evaluate the fit of these models to the data, we used goodness-of-fit tests. In particular we simulated datasets from both the posterior distributions and the point parameter estimates of each model, and estimated the probability of seeing our observed multilocus summary statistics, including additional data summaries not used in parameter estimation (see Supplemental Information).

Our parameter estimates from these models are consistent with an extremely recent speciation event associated with a major reduction in effective population size

24 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 in C. rubella; Figure 3 shows the parameter estimates from two of the best fit models, based on the results of goodness-of-fit tests (Table 1 and Supplementary Information).

Under the model with no migration, our most likely estimate of divergence time is approximately 13,500 years ago with a 90% posterior density from 1500 to 51,000 years, suggesting species divergence since the last glacial maximum about 21,000 years ago (20). Similarly, models with migration provide estimates of divergence time of 15,000 or 22,000 years ago. The boundaries on the divergence time estimate across all models examined depend on whether migration is included in the model (Table 1); in particular, allowing for a wide prior probability on migration and time leads to a long tail in the posterior density which is small but non-zero for older divergence

(Figure 4b). However, the best model is clearly a recent divergence in the last 25,000 years. Estimates of population size parameters suggest that C. grandiflora has maintained a large population size since its divergence from the ancestor, while we infer an approximately 100 to 1500-fold smaller effective population size in C. rubella (Table 1; Figure. 4b). Taken together, these results clearly indicate recent speciation associated with the breakdown of self-incompatibility from a small number of founding lineages.

The one exception is a 1.4 million year divergence time estimate from the fully unconstrained model (Table 1, Supplementary Information); this model also infers a very high migration rate from C. rubella to C. grandiflora, an extremely low effective population size in C. rubella and a very low ancestral population size. The inferred level of gene flow under this model seems implausibly high given the mating system, our Structure results, and the evidence for post-pollination reproductive

25 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 isolation between the species. Furthermore, goodness-of-fit tests and comparisons of likelihoods suggest that this unconstrained model provides a poorer fit to our data

(Supplementary Information). However, since some models with and without migration appear to provide comparable fits to the observed data, divergence may be too recent to have sufficient power to infer whether or not there has been ongoing gene flow.

One potential concern with the approach used is that the model assumes a constant rate of recombination since the species split. Since the transition to selfing is expected to lead to a dramatic shift in the effective recombination rate (12), this assumption is violated. However, the method has been shown to be robust to inaccuracies in estimation of recombination (17), and given the very recent timescale and massive population bottleneck inferred here, effective recombination rates are suppressed even in the absence of this effect. Consistent with this, goodness-of-fit tests using simulations that allow for a lineage-specific change in recombination rate to zero in C. rubella show that the best-fit models provide an equally adequate fit to the data when this shift is incorporated (SI Table 5).

Discussion

The timescale of speciation suggested by our analysis is consistent with selfing evolving following the last glacial maximum. C grandiflora is distributed in the Balkans, which was a glacial refugium for many plant species (11), and this is consistent with our inference of long-term equilibrium for the outcrossing species.

Furthermore, the massive loss of diversity in C. rubella suggests that selfing alleles

26 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 did not spread through a previously outcrossing population because of their transmission advantage. Instead selection for reproductive assurance, either due to a lack of pollinators or via selection for colonization ability, seems the most likely explanation for the evolution of selfing. Under either mode of selection for reproductive assurance, the enhanced colonization ability likely contributed to the geographic spread of selfing. As the glaciers receded, agriculture spread and Europe warmed, new habitats emerged, and this would favor colonization by selfing genotypes that can reproduce in the absence of pollinators and/or available mates (21-

23). Our inference of a very severe population bottleneck associated with the recent evolution of selfing is consistent with this model for the evolution and spread of selfing Capsella. This result is in line with mounting ecological data suggesting that reproductive assurance is a major factor favoring the evolution and maintenance of selfing (24-26).

If, as our results suggest, C. rubella experienced recent evolution of selfing associated with a severe population bottleneck, we would predict a similar severe loss of variation with little sequence divergence at the self-incompatibility locus. Recent results confirm this prediction (Guo, Y., et al, unpublished), providing support for our inference of recent speciation tied to the breakdown of self-incompatibility.

In A. thaliana, the transition to selfing appears to have been gradual, and occurred considerably further back in time (about 1 Myr or more), providing ample time for the evolution of morphological traits associated with the selfing habit (8). In contrast, our inference of recent speciation in Capsella implies that extensive phenotypic evolution has occurred on a very short evolutionary timescale. Given the

27 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 evidence for rapid morphological evolution, one possible alternative explanation to a severe bottleneck in C. rubella is the occurrence of positive selection reducing diversity genome-wide. Although it is likely that the evolution of selfing and spread through Europe has lead to novel selective pressures, our observed low between-locus linkage disequilibrium (SI Figure 6) implies that this would likely be restricted to a small subset of the genome even in this highly selfing species, unless the extent of positive selection is very high. Furthermore, positive selection at a subset of loci would be expected to increase the variance in diversity across loci beyond the bottleneck expectations, but goodness-of-fit tests suggest that our models are sufficient to explain the observed variance (SI Tables 4 and 5). Finally, the observed retention of shared ancestral polymorphism in C. rubella is unlikely under a model of genome-wide selective sweeps (SI Figure 8).

Given the extreme population bottleneck, this nevertheless implies that rapid evolution has occurred from small amounts of standing variation. Understanding the genomic extent and genetic basis of these changes, and testing the extent to which phenotypic evolution occurred via positive selection or relaxation of stabilizing selection, will be of considerable interest in future investigations.

Methods

Species Sampling and DNA Sequencing

We collected nucleotide polymorphism data from 39 large exons (see SI Table

6) in C. grandiflora and C. rubella. A total of 20 diploid C. grandiflora and 14 C rubella individuals were used for this study. The C. grandiflora samples originated

28 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 from eight natural localities- in Greece. The plants from which sequences were obtained were four individuals from Ioaninna, three individuals from the Katara Pass; two individuals from Metsovo, four individuals from Sokraki on the island of Corfu; three individuals from Doukades, Corfu; two individuals from Votonosi, Corfu; one individual from Paleokastritsas, and one individuals from Pantokrator, Corfu. The C. rubella samples originated from four natural localities. Four individuals were from

Buenos Aires, Argentina; four were from Cumbre Dorsal, Tenerife, Spain; three were from Caldera de Taburiente, La Palma, Spain and three were from Ioaninna, Greece.

Outgroup data from Boechera stricta was obtained using DNA kindly provided by

B.H. Song and T. Mitchell-Olds.

Seeds were placed at 4°C on wet filter paper for 14 days before being allowed to germinate at room temperature. The seedlings were grown at 20°C under conditions of 18 hours of light and 6 hours of darkness. After 6 weeks of growth,

DNA was extracted from leaf material using a FastDNA® Kit and the FastPrep*

Instrument (Qbiogene, Inc., CA).

PCR primers for the large exons were designed as described by Wright et al. and Ross-Ibarra et al. (27, 28). Briefly, primers were designed to amplify 650-700 bp from single large exons based on the A. thaliana genome sequence, chosen with no a priori expectation as to their function or the action of selection upon these genes.

Each exon was used as a BLAST query against the shotgun genome sequence of

Brassica oleracea and homologous regions were used to design primers using

PrimerQuest (Integrated DNA Technologies). PCR reactions were carried out in

25uL volumes on an Eppendorf Mastercycler. The cycles were as follows: 2 minutes

29 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 at 94°C, 20 seconds at 94°C, 20 seconds at 55°C, 40 seconds at 72°C, for 35 cycles, with a final extension time of 4 minutes at 72°C.

These products were sequenced directly using ABI sequening by Cogenics

(Houston, TX). Chromatograms were checked manually for heterozygous sites, using

Sequencher version 4.7 (Gene Codes, Ann Arbor, MI), with the aid of the 'Call secondary peaks' option. Sequences were aligned using Genedoc (29). Consistent with high levels of selfing, no heterozygous sites were identified in our C. rubella dataset. This complete lack of heterozygosity in C. rubella also allowed us to confirm that we were sequencing single copy regions only. Nucleotide sequences are deposited in GenBank under accession numbers (FJ182244-FJ183352).

Sequence Statistics and Analysis

Sequence-based summary statistics 9 (30) and ;r(31) synonymous and nonsynonymous, as well as frequency data were calculated using a modified version of Perl code (Polymorphurama) written by D. Bachtrog and P. Andolfatto (U.C. San

Diego). The frequency spectra of derived polymorphic variants, and the number of shared derived polymorphisms, unique polymorphisms and fixed differences were calculated using Perl scripts written by S. Wright.

Population recombination rate estimates were calculated using the composite likelihood approach of Hudson (32), using the maxdip program for diploid unphased data for C. grandiflora, and ldhat for C. rubella (33) using data processing code written by J. Ross-Ibarra (http://rossibarra.googlepages.com/ldpipereadme). Linkage disequilibrium, calculated as the squared correlation coefficient r2 was calculated

30 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 using Weir's algorithm (34) for unphased data using R script kindly provided by

Stuart MacDonald (U. Kansas). DNAsp (version 4.0) (35) was used to calculate linkage disequilibrium in C. rubella. To assess significance of the mean and variance of Tajima's D estimates we used Jodi Hey's HKA program

(http://lifesci.rutgers.edU/heylab/DistributedProgramsandData.htm#HKA)

In order to infer haplotype data in C. grandiflora we used PHASE 2.1 which implements a Bayesian statistical method to reconstruct haplotypes from diploid data

(36, 37). Haplotypes with the highest posterior probabilities were used for cluster analysis performed with the program STRUCTURE (18). The program was run under the haploid model assuming values of k (population number) from 1 to 7, each with

1,000,000 repetitions and a burn in period of 100,000.

Coalescent Simulations

Coalescent simulations were conducted using MIMAR which estimates the parameters of an isolation-migration model based on Hudson's ms (38). Because

MIMAR makes use of outgroup information to infer the number of derived SNP fixations, we made every effort to minimize errors associated with ancestral state inference, as described in the Supplemental Information. Simulations using a modified version of MIMAR that does not rely on outgroup inference provide comparable estimates under all models (SI Table 7).

Simulations were conducted using the 25 loci for which we had sequence data from three outgroup species for ancestral state inference. Furthermore, sites with more than 2 segregating bases were excluded from the anaysis. To allow for locus-specific

31 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 mutation rates, the mutation rate scalar was estimated using 0syn for each of the 26 loci, dividing by the mean 0syn. We ran a series of MIMAR coalescent simulations that differed in the degree of constraint on migration rates and effective population sizes (Table 1). Migration rates were either unconstrained (asymmetric migration), constrained to be symmetrical or set to zero (no migration), whereas effective population sizes were either unconstrained or assumed to be identical in C. grandiflora and the ancestor of C. rubella and C. grandiflora.

Prior limits for the Bayesian procedure implemented in MIMAR were set based on initial runs using wide priors. Priors for #were uniform 0.001-0.1 for both

C. grandiflora and the ancestral species, and uniform 0-0.0025 for C. rubella. All runs assumed an exponentially distributed prior with rate 1 for pi6, and a mutation rate per bp of 1.5xl0"8 (39). Migration rate priors were log uniform -5-2.5 for migration from C. grandiflora to C. rubella (forward in time) and log uniform -5-6 for migration from C. rubella to C. grandiflora. The prior for the time of the split between C. rubella and C. grandiflora was uniform 0-4x106.

Each simulation was run for a total of 20,160 minutes (2 weeks) with a burnin of 100,000 steps, and we performed three sets of simulations with different random seeds for each model. Mixing was monitored by assessing parameter autocorrelation over runs and we considered that MIMAR reached convergence when the posterior distributions from independent runs were highly similar (19). The mode of the marginal posterior probability distribution was considered as a point estimate for each parameter, and we calculated 90% highest posterior density (HPD) intervals from the

MIMAR output using the boa package in R 2.6.2 (40).

We thank C. Becquet for assistance and advice with running Mimar, and S.C.H.

Barrett and A. Cutter for comments on the manuscript. This work was supported by a

Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grant, an Early Researcher Award from the Ontario Ministry of Research and

Innovation, and an Alfred P. Sloan Foundation Fellowship to SIW.

1. Charlesworth D (2006) Evolution of plant breeding systems. Curr Biol 16(17):R726-735. 2. Stebbins GL (1950) Variation and Evolution in Plants (Columbia University Press, New York, NY). 3. Barrett SCH (2002) The evolution of plant sexual diversity. Nature reviews 3((4)):274-284. 4. Fisher RA (1941) Average excess and average effect of a gene substitution. Ann. Eugen. 11:53-63. 5. Darwin CR (1876) The effects of cross and self-fertilization in the vegetable kingdom (John Murray, London). 6. Charlesworth D & Charlesworth B (1987) Inbreeding depression and its evolutionary consequences, ares 18:237-268. 7. Koch MA, Haubold B, & Mitchell-Olds T (2000) Comparative evolutionary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae). Molecular biology and evolution 17(10):1483-1498. 8. Tang C, et al. (2007) The evolution of selfing in Arabidopsis thaliana. Science 317(5841):1070-1072. 9. Schoen DJ, Morgan MT, & Bataillon T (1996) How does self-pollination evolve? Inferences from floral ecology and molecular genetic variation. Philos. Trans. R. Soc. Lond. B 351:1281 -1290. 10. Hurka H, Freundner S, Brown AH, & Plantholt U (1989) Aspartate aminotransferase isozymes in the genus Capsella (Brassicaceae): subcellular location, gene duplication, and polymorphism. Biochemical genetics 27(1- 2):77-90. 11. Hurka H & Neuffer B (1997) Evolutionary processes in the genus Capsella (Brassicaceae)*. Plant Systematics and Evolution 206:295-316. 12. Paetsch M, Maryland-Quellhorst S, & Neuffer B (2006) Evolution of the self- incompatibility system in the Brassicaceae: identification of S-locus receptor kinase (SRK) in self-incompatible Capsella grandiflora. Heredity 97:283-290. 13. Koch M & Kiefer M (2005) Genome evolution among cruciferous plants: a lecture from the of the genetic maps of three diploid species— Capsella rubella, Arabidopsis lyrata subsp. petraea, and A. thaliana. Am J Bot 92(4):761-767. 14. Charlesworth D & Wright SI (2001) Breeding systems and genome evolution. Curr Opin Genet Dev 11(6):685-690. 15. Wright SI, Ness RW, Foxe JP, & Barrett SCH (2008) Genomic consequences of outcrossing and selfing in plants. International Journal of Plant Sciences 169(1):105-118. 16. Tajima F (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.

34 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 17. Wright SI & Gaut BS (2005) Molecular population genetics and the search for adaptive evolution in plants. Molecular biology and evolution 22(3):506-519. 18. Pritchard JK, Stephens M, & Donnelly P (2000) Inference of population structure using multilocus genotype d&t&.Genetics 155(2):945-959. 19. Becquet C & Przeworski M (2007) A new approach to estimate parameters of speciation models with application to apes. Genome Res 17(10): 1505-1519. 20. Kageyama M, et al. (2001) The Last Glacial Maximum climate over Europe and western Siberia: a PMIP comparison between models and data. Climate dynamics 17(l):23-43. 21. Ammerman AJ & Cavalli-Sforza LL (1984) The Neolithic Transition and the Genetics of Populations in Europe (Princeton Univ. Press, Princeton). 22. Baker G (1985) Prehistoric Farming in Europe (Cambridge University Press, Cambridge). 23. Pinhasi R, Fort J, & Ammerman AJ (2005) Tracing the origin and spread of agriculture in Europe. PLoS Biol 3(12):e410. 24. Baker HG (1955) Self-compatibility and establishment after long-distance dispersal. Evolution 9:347-349. 25. Pannell JR & Barrett SCH (1998) Baker's law revisited: reproductive assurance in a metapopulation. Evolution 5(657-668). 26. Eckert CG, Samis KE, & Dart S (2006) Reproductive assurance and the evolution of uniparental reproduction in flowering plants. Ecology and Evolution of Flowers, eds Harder LD & Barrett SCH (Oxford University Press, Oxford), pp 183-203. 27. Wright SI, et al. (2006) Testing for effects of recombination rate on nucleotide diversity in natural populations of Arabidopsis lyrata. Genetics 174(3):1421- 1430. 28. Ross-Ibarra J, et al. (2008) Patterns of Polymorphism and Demographic History in Natural Populations of Arabidopsis lyrata. PloS One 3(6):e2411. 29. Nicholas KB, Nicholas, H.B. Jr., Deerfield, D.W. II. (1997) GeneDoc: Analysis and Visualization of Genetic Variation. EMBNEW.NEWS 4:14. 30. Watterson GA (1975) On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:188-193. 31. Tajima F (1993) Measurement of DNA polymorphism. Mechanisms of molecular evolution, eds Takahata N & Clark A (Japan Scientific Societies Press, Tokyo), pp 37-60. 32. Hudson RR (2001) Two-locus sampling distributions and their application. Genetics 159:1805-1817. 33. McVean G, Awadalla P, & Fearnhead P (2002) A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160(3):1231-1241. 34. Weir BS (1990) Genetic Data Analysis (Sinauer Assoc, Inc., Sunderland, MA). 35. Rozas J, Sanchez-DelBarrio JC, Messeguer X, & Rozas R (2003) DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics 19(18):2496-2497.

35 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 36. Stephens M, Smith NJ, & Donnelly P (2001) A new statistical method for haplotype reconstruction from population data. American journal of human genetics 68(4):978-989. 37. Stephens M & Donnelly P (2003) A comparison of bayesian methods for haplotype reconstruction from population genotype data. American journal of human genetics 73(5):1162-1169. 38. Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18(2):337-338. 39. Koch M, Haubold B, & Mitchell-Olds T (2001) Molecular systematics of the Brassicaceae: evidence from coding plastidic matK and nuclear Chs sequences. Am J. Bot. 88:534-544. 40. Smith BJ (2007) boa: An R Package for MCMC Output Convergence Assessment and Posterior Inference. Journal of Statistical Software 21(11): 1- 37.

Table 1. Modes of parameter estimates under a range of MIMAR models, with 90%

HPD intervals in parentheses.

a a a Model Ne(Cg) Ne(Cr) ~ %TlW T*

1 Ancestral size constrained, no 488.2" 3.0 488.2" - - 13.6 migration (412.2, (0.5, (412.2, (1.5,

536.0) 11.5) 536.0) 51.7)

2 Ancestral size constrained, 463.6" AJ 463.6" 1.9" 1.9" 151) symmetrical migration (352.4, (2.0, (352.4, (0.007, (0.007, (0.8,

602.2) 17.5) 602.2) 4.2) 4.2) 3546.8)

~3 Ancestral size constrained, 487.5" 03 487.5" O 509 21/7 asymmetrical migration (400.4, (0.06, (400.4, (0.009, (7.0, (3.5

540.9) 6.0) 540.9) 3.1) 336.9) 3579.5)

4 Ancestral size unconstrained, 525.0 04 283 lo 73~0 1395.07 asymmetrical migration (410.4, (0.06, (16.7, (0.01, (13.3, (13.3,

652.1) 5.5) 508,5) 4.2) 399.8) 3156.1) a Effective population size (effective number of individualsxlO"3) for C. rubella (Cr),

C. grandiflora (Cg) and their ancestor (A)

Migration rate (4Nem) from C. grandiflora to C. rubella

0 Migration rate (4Nem) from C. rubella to C. grandiflora d Time (ka) of the split of C. rubella and C. grandiflora e Constrained f Bimodal distribution, with second mode <30,000 years. See also goodness-of-fit tests in SI Text.

Figure 1. Floral organs and petals are reduced in C. rubella (left) compared to C. grandiflora (right).

B 1.0 2- • 0.08- 0.8- 1- I 0.06- 0 0.6 Q 1 i 0.04- L0.4 FO-

« r- 1 0.02- 0.2 -1- I i 1 1 0- oJ - —i r 1 1 C. grandiflora C. rubella C. grandiflora C. rubella C. grancliflora C. rubella

Figure 2. Comparison of polymorphism patterns between C. grandiflora and C. rubella. Bars represent the median, boxes the interquartile range, and whiskers extend out to 1.5-times the interquartile range, a) n synonymous where n is the average pairwise differences (16) b), the population recombination estimator p per base pair, using the composite likelihood estimator of Hudson (32) c) Tajima's D (16) in C. grandiflora and C. rubella.

-80

-70 t

-60

-50

>• 40

Figure 3. Derived SNP frequencies in C grandiflora and C rubella calculated using

A thaliana as an outgroup. For this plot, we randomly subsampled the data to include

12 individuals from each species. Values where the derived allele frequencies are greater than 0 and less than 12 would represent polymorphisms, a frequency of 12 is a derived fixation, and a frequency of zero implies a complete absence of the derived

SNP. For example, values with 0 or 12 for one species and not in the other represent unique polymorphism in the other species, while derived frequencies greater than zero and less than 12 in both species represent shared polymorphisms.

C. amneMfkm „ . ^ A i Ancestor Symmetric migratert 0.081 1 _.»- NomtgratKxn f | • c ' *» •S&04- • OS a 0.02' y* ^0* 2 y// %. „ „ • -*

k 02- 0.04. P f1 J• *l Ot- jl 0 02 ! v\ 0- K J?!..*^ ™'MI§ 0 200,000 400,000 , ^ , g j ^ ^ Divwgenoft Hem (yetrsl Migration (4Nm)

Figure 4. Smoothed marginal posterior distributions of speciation parameters

estimated by MIMAR, for two models with posterior modes showing good fit to data

summaries, assuming either symmetric migration (solid lines) or no migration (dashed

lines) and equal effective population sizes in the ancestor as in present-day C. grandiflora. (A) Marginal densities for effective population size (individuals) in C.

rubella (grey) and C. grandiflora/the ancestral species. (B) Marginal densities for

divergence time (years). (C) Marginal density for migration (4Nm), where N is the

effective population size of C. grandiflora.

Supplementary Information

MIMAR Model Assessment

Mimar simulates samples under a model of isolation with migration, with six parameters, where population 1 is C. grandiflora and population 2 C. rubella: Q\y 62,

0a, M12 (migration rate from population 1 to population 2), M21 (migration rate from population 2 to population 1), andx (divergence time). We considered four nested models of demographic history: model 1 assumed no migration and constrained ancestral effective populations sizes to be equal to present day C. grandiflora, which is population 1 (i.e. 8a=6i, Mu= M2i=0); model 2 allowed for symmetric migration

= (9a 6i,Mi2= M21), model 3 allowed for asymmetric migration (8a=0i), and model 4 allowed for both asymmetric migration and ancestral effective population sizes to differ allowing for asymmetrical migration and free effective population sizes for each population.

Goodness-of-fit Tests and Likelihood Ratios

We performed goodness-of-fit tests using MIMARgof (1) and using a modified version of ms, mspopr, which can simulate a lineage-specific change in population recombination rate (Stahl, E. unpublished). For each model, we generated

10,000 simulations of 25 loci using the same number of sites and individuals sequenced per locus as in the original data set. Another set of 10,000 simulations of

25 loci were run in mspopr, including a lineage-specific change in recombination in

C. rubella, but with parameter estimates for each model otherwise unchanged.

Specifically, we set the population recombination rate to zero in the C. rubella lineage

at the time of the split of C. grandiflora and C. rubella. Finally, we also performed

goodness-of-fit tests using the predictive posterior distributions of each model. Model

fit was assessed by calculating one-tailed P-values for observed summary statistics,

based on distributions of simulated statistics. In addition to the mean over all loci of

shared and unique variants used by MIMAR, we performed goodness-of-fit tests on

mean nucleotide diversity estimates, mean Fst and mean Tajima's D, to assess how

well the model could accommodate aspects of data not directly used by MIMAR. To

further assess the fit of the data to a model with no locus-specific positive selection,

we also assessed the fit of the data to the variance in pi and Tajima's D (varal, var7i2,

varTajDl, varTajD2).

Goodness-of-fit tests were performed in two ways, 1) Simulating under the

Bayesian posterior distribution of parameters (i.e. 'predictive posterior'), and 2)

Simulating under the marginal modes of the posterior parameter distributions. The

second approach allowed us to assess the degree to which a single demographic

model can explain all aspects of the data, while the first approach allows a

comparison of the overall fit of each inferred posterior distribution. As shown in

Table 2, predictive posterior simulations suggest that all models fit the data equally

well under this criterion, suggesting that Model 1 is sufficient to explain our data,

without the need to invoke migration and changes in effective population size

between population 1 and the ancestral population. In other words, we cannot reject the null hypothesis of no migration and equal effective population sizes between population 1 and the ancestral population. Using a likelihood ratio test from

43 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 maximum likelihood estimates across the MIMAR runs provides a similar conclusion, i.e. the more parameter-rich models do not provide a significant improvement to the likelihood over model 1 (Table 2). Although the likelihoods in this analysis should be considered as approximate, the combination of results from the predictive posterior and the likelihood analysis is consistent with the hypothesis that model 1 adequately explains the data.

As shown in Table 4, simulations under the marginal modes provide further evidence in favor of the simpler models over the parameter-rich models. In particular, best-fit parameters under models 1 and 2 are consistent with all tested summary statistics, whereas model 3 fails one goodness-of-fit test, and model 4 fails five tests.

Very similar conclusions are obtained when we allow for a lineage-specific change in recombination (Table 5). One possible explanation for the poor fit of model 4 is that the combination of modes from the marginal posterior distributions is distinct from the best-fitting parameter set. To explore this possibility, we also conducted goodness-of-fit tests under the maximum likelihood parameter estimates for each model. Although these simulations improve the fit considerably, the unconstrained model still shows a poorer fit to the data than the simpler models for parameters not used in estimation (Tables 4 and 5). In particular, the observed values for nucleotide diversity, Tajima's D, and the variance of Tajima's D in C. grandiflora have low probabilities under this model. This inconsistency is likely because high gene flow and population growth inferred in C. grandiflora under this model are not consistent with the patterns consistent with stable equilibrium in this species.

Exploring the Effects of Positive Selection on Diversity

If recurrent positive selection reduced diversity in C. rubella across the genome, we would predict that these positive selection events would erode any ancestral variation such that the majority of segregating variation would be unique to

C. rubella. In contrast with this expectation, we found that 84% of variation in C. rubella is shared with C. grandiflora. To illustrate this expectation, we simulated

10,000 39-gene datasets under a) the inferred bottleneck model assuming no migration; b) the selective sweep model of Thornton and Jensen (2007), allowing for only a slight bottleneck of a twofold reduction in effective population size (2), and c) the selective sweep model of Kim and Innan (2008) under a slight bottleneck (3), with positive selection acting on standing variation. All simulations assumed a divergence time of 14,000 years, and an ancestral 9 equal to 0 in C. grandiflora of 0.03 per base pair.

Under the bottleneck model, 60% of 39-gene datasets show 84% or more shared polymorphisms in C. rubella, consistent with our observed data (SI Figure 8).

In contrast, simulating under a model of selective sweeps, allowing for only a slight bottleneck (reduction of Ne by half in C. rubella) with a selective sweep during the bottleneck (4Ns=6500), no 39-gene datasets in 10,000 were found to have this high a fraction of shared polymorphisms; the maximum proportion of shared polymorphisms was found to be 0.46 (SI Figure 8). Similarly, the model of selection from standing variation generated no datasets with the observed fraction of shared polymorphism, and the maximum proportion observed was 0.79. Although these simulations are not exhaustive, they illustrate that under reasonable parameter values, we would not

45 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 expect genome-wide positive selection to lead to the observed maintenance of shared polymorphism.

Generating Data Summaries and the Use of Ancestral State Inference

Because MIMAR relies on the inference of derived SNPs, we made every effort to minimize errors associated with ancestral misinference. To determine derived states we used PAML (4) to perform likelihood reconstruction of ancestral states for the common ancestor of Capsella under various substitution models. These likelihood reconstructions were based upon sequence data from A. thaliana, A. lyrata, B. stricta,

C. grandiflora and C. rubella. Given that the phylogenetic position of Capsella in relation to Boechera and Arabidopsis shows some uncertainty (5, 6), we assumed a star-shaped phylogeny for the three genera. For the purposes of these reconstructions we also assumed that within-Cop^e/Za genealogies are star shaped. Data reported here are from likelihood reconstructions under the Kimura 2 Parameter (K80) model which distinguishes between transitions and transversions. The reconstructed ancestor of Capsella was then used to infer ancestral states for the Capsella polymorphism data. Using the method of Baudry and Depaulis (2003) we calculated the rate of ancestral misinference in our dataset as 0.084/base, using transition./transversion ratios estimated directly at synonymous sites (7). To explore any effect that this residual error may have on demographic inference, we generated a modified version of MIMAR code to use data summaries that do not rely on outgroup inference. In particular, we modified Ss to include only shared polymorphisms, while SI and S2 represent polymorphisms unique to population 1 and 2 respectively, regardless of

46 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 ancestral vs. derived states. As shown in Table 7, parameter estimates using this approach are very much in line with estimates reported in Table 1 of the text.

References

1. Becquet C & Przeworski M (2007) A new approach to estimate parameters of speciation models with application to apes. Genome Res 17(10): 1505-1519. 2. Thornton KR & Jensen JD (2007) Controlling the false-positive rate in multilocus genome scans for selection. Genetics 175(2):737-750. 3. Innan H & Kim Y (2008) Detecting local adaptation using the joint sampling of polymorphism data in the parental and derived populations. Genetics 179(3):1713-1720. 4. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555-556. 5. Koch M, Haubold B, & Mitchell-Olds T (2001) Molecular systematics of the Brassicaceae: evidence from coding plastidic matK and nuclear Chs sequences. Am. J. Bot. 88:534-544. 6. Al-Shehbaz IA, Beilstein MA, & Kellog EA (2006) Systematics and phlogeny of the Brassicaceae (Cruciferae): an overview. Plant Systematics and Evolution 259:89-120. 7. Baudry E & Depaulis F (2003) Effect of misoriented sites on neutrality tests with outgroup. Genetics 165(3): 1619-1622.

SI Table 2. Predictive posterior probabilities from simulations of the posterior distributions

Sla S2b Ssc sf Fste nl* Jt2g TajDlh TajD2' pval pval pval pval pval pval pval pval pval Model 1 0.42 0.25 0.44 0.35 0.47 0.42 0.28 0.34 0.27 Model 2 0.37 0.27 0.42 0.50 0.40 0.37 0.06 0.31 0.47 Model 3 0.44 0.39 0.39 0.38 0.46 0.42 0.25 0.33 0.48 Model 4 0.39 0.34 0.43 0.43 0.45 0.24 0.30 0.16 0.48

Values shown are 1-tailed probabilities of the observed data from simulations under the posterior parameter distributions. aMean number of unique polymorphisms in C. grandiflora. bMean number of unique polymorphisms in C. rubella. c Mean number of shared polymorphisms. dMean number of fixed differences. eMean population differentiation. fAverage pairwise differences in C. grandiflora. gAverage pairwise differences in C. rubella. hAverage Tajma's D in C. grandiflora.

'Average Tajima's D in C. rubella.

SI Table 3. Likelihood ratio test comparing model 1 to other models

Last column shows twice the difference in In likelihoods between the model and model 1, and significance is given using the Chi Squared approximation (ns= not significant), with the number of degrees of freedom equal to the difference in the number of free parameters.

Model Description No. of parameters Ln likelihood 2(L2-L1)*

1 ancestral theta 3 (thl,th2,t) -117.68 constrained, no migration 2 ancestral theta 4 (thl,th2,M,t) -117.66 0.04 ns constrained, symmetric migration 3 ancestral theta 5 (thl,th2,ml2,m21,t) -116.29 2.78 ns constrained, asymmetric migration 4 Asymmetric 6 (thl,th2,thA,ml2,m21,t) -114.08 7.3 ns migration, ancestral theta distinct

*Last column shows twice the difference in ln likelihoods between the model and model 1, and significance is given by using the Chi squared approximation (ns = not significant), with the number of degrees of freedom equal to the difference in the number of free parameters.

SI Table 4. Goodness-of-fit tests based on simulations under the marginal modes (a) and under maximum likelihood parameter estimates (b).

var var var var d c e h 1 m Model sr S2 Ss «1' n2 TDl TD2' Fst St" TTl' n2 TD1" TD2° la 0.411 0.301 0.372 0.332 0.231 0.319 0.220 0.385 0.456 0.497 0.226 0.278 0.386 lb 0.455 0.492 0.404 0.302 0.190 0.314 0.215 0.332 0.470 0.480 0.206 0.286 0.397

2a 0.196 0.118 0.237 0.202 0.286 0.324 0.150 0.099 0.257 0.413 0.459 0.290 0.385

2b 0.207 0.405 0.486 0.472 0.004* 0.460 0.459 0.004* 0.323 0.428 0.018* 0.312 0.276

3a 0.326 0.125 0.476 0.326 0.017* 0.325 0.468 0.043* 0.420 0.491 0.050 0.294 0.421

3b 0.408 0.201 0.481 0.300 0.048 0.328 0.466 0.113 0.417 0.483 0.092 0.283 0.469 4a 0.442 0.213 0.009* 0.001* 0.008* 0.031* 0.470 0.100 0.468 0.018* 0.026 0.036* 0.421 4b 0.215 0.343 0.257 0.072* 0.202 0.055* 0.472 0.392 0.315 0.140 0.142 0.045* 0.483

Values shown are 1-tailed P values of the observed mean and variances of summary statistics by using coalescent simulations under the various parameter combinations. Significant and marginally significant departures are shown with an asterisk. cMean number of unique polymorphisms to C. grandiflora. Mean number of unique polymorphisms to C. rubella. eMean number of shared polymorphisms. Average pairwise differences in C. grandiflora. 8Average pairwise differences in C. rubella. hAverage Tajma's D in C. grandiflora. 'Average Tajima's D in C. rubella. ^Mean differentiation. kMean number of fixed differences. 'Variance in pairwise differences in C. grandiflora. "Variance in pairwise differences in C. rubella. "Variance in Tajma's D in C. grandiflora. "Variance in Tajima's D in C. rubella.

SI Table 5. Goodness-of-fit test P values based on simulations under the marginal modes (a) and under maximum likelihood parameter estimates (b), allowing for a lineage-specific change in recombination in population 2.

var var var var d e e h Model sr S2 Ss Kl' n2 TDl TD2' Fst' Sf" Til' 7f2m TD1" TD2" la 0.415 0.309 0.372 0.337 0.238 0.333 0.219 0.384 0.439 0.498 0.236 0.278 0.397 lb 0.450 0.492 0.406 0.305 0.187 0.332 0.221 0.327 0.478 0.482 0.202 0.284 0.404

2a 0.193 0.115 0.232 0.200 0.287 0.321 0.151 0.098 0.256 0.411 0.448 0.290 0.380

2b 0.206 0.407 0.497 0.475 0.004* 0.460 0.453 0.003* 0.314 0.428 0.018* 0.318 0.262

3a 0.319 0.128 0.475 0.331 0.016* 0.329 0.468 0.042* 0.423 0.495 0.046* 0.285 0.424

3b 0.399 0.204 0.470 0.316 0.054* 0.328 0.473 0.109 0.405 0.475 0 100 0.282 0.475

4a 0.437 0.210 0.009 0.001* 0.007* 0.031 0.469 0.099 0.464 0.017* 0.024* 0.037* 0.414

4b 0.221 0.338 0.250 0.063* 0.201 0.054* 0.475 0.380 0.310 0.131 0.144 0.044* 0.490

Values shown are 1-tailed P values of the observed mean and variances of summary statistics by using coalescent simulations under the various parameter combinations. Significant and marginally significant departures are shown with an asterisk. cMean number of unique polymorphisms to C. grandiflora. dMean number of unique polymorphisms to C. rubella. eMean number of shared polymorphisms. fAverage pairwise differences in C. grandiflora. sAverage pairwise differences in C. rubella. hAverage Tajma's D in C. grandiflora. 'Average Tajima's D in C. rubella. 'Mean differentiation. kMean number of fixed differences. Variance in pairwise differences in C. grandiflora. "Variance in pairwise differences in C. rubella. "Variance in Tajma's D in C. grandiflora. °Variance in Tajima's D in C. rubella.

SI Table 6. Sequence-based summary statistics as estimated for each locus in both C. grandiflora and C. rubella.

Sample Synonymous Theta S Tajima's D Replacement Theta S 7T Tajima's D Species Locus Size Sites Synonymous Syn Synonymous Synonymous Sites Replacement Rep Replacement Replacement C grandiflora Atlg01040 34 1201095238 0012217381 6 0 012481181 0 059785052 419 8904762 0 002912315 5 0 001095269 -1 649659063 C grandiflora Atlg03560 28 128 1551724 0 006015516 3 0 00487174 -0 456422119 459 8448276 0 007264736 13 0 004245739 -1 392653386 C grandiflora Atlg04650 32 149 4090909 0 024929075 15 0 020308498 -0 616260234 474 5909091 0 005755263 11 0 006346722 0 326203043 C grandiflora Atlg06520 28 115 3045977 0 02451511 11 0 022484729 -0 270635841 400 6954023 0 005771862 9 0 004819663 -0 520883643 C grandiflora Atlg06530 38 111 4188034 0 034178123 16 0 036921937 0 260600475 440 5811966 0008103115 15 0 006715557 -0 550965203 C grandiflora Atlg 10900 26 139 7407407 0 003750622 2 0 00457991 0 473577154 448 2592593 0 0 0 0 C grandiflora Atlgll050 32 118 4646465 0 044017205 21 0 052792394 0 689702738 352 5353535 0 004226107 6 0 0035629 -0 440821078 C grandiflora Atlg 15240 30 111 9516129 0 022547226 10 0 011417084 -1 563384883 410 0483871 0 015389637 25 0 009127042 -1 448709834 C grandiflora Atlg31930 8 108 0925926 0 024976036 7 0 023458724 -0 289096242 317 9074074 0001213167 1 0 000786392 -1 054819107 C grandiflora Atlg59720 30 114 3494624 0 077260479 35 0 093502442 0 76792837 419 6505376 0 015639002 26 0 016340909 0160323884 C grandiflora Atlg62390 36 124 472973 0 025185881 13 0 021551182 -0 459368216 466 527027 0 003101439 6 0 002534772 -0 499232495 C grandiflora Atlg62520 28 129 8045977 0 075228311 38 0 079260365 0199241023 416 1954023 0 00617434 10 0 00359772 -1 342343614 C grandiflora Atlg65450 22 117 6811594 0 058276411 25 0 067207792 0 580648839 389 3188406 0 004932333 7 0 003891801 -0 6750978 C grandiflora Atlg68530 36 135 7792793 0 030192862 17 0 033469352 0 358580449 443 2207207 0 001632263 3 0 000605238 -1 409039404 C grandiflora Atlg72390 40 115 2804878 0 014275502 7 0 006561477 -1 496242606 409 7195122 0 004016618 7 0 00269728 -0 909511059 C grandiflora Atlg74600 30 121 1236559 0 025007815 12 0 018258153 -0 880694329 406 8763441 0 003722308 6 0 004418299 0 533564216 C grandiflora Atlg78850 36 103 6621622 0 0604842 26 0 08571801 1 442991383 340 3378378 0 009919885 14 0 007606821 -0 750466787 C grandiflora At2g23170 36 123 3648649 0 050824219 26 0 050784957 -0 002671947 407 6351351 0 005915851 10 0 003247535 -1 374013943 C grandiflora At2g26730 38 132 1538462 0 052228182 29 0 050471371 -0 116404366 425 8461538 0 0016767 3 0 001299396 -0 496927108 C grandiflora At2g28050 32 115 4848485 0 025801691 12 0 025226742 -0 071730967 394 5151515 0 00566462 9 0 001998165 -1 98260169 C grandiflora At2g44900 36 123 1711712 0 007831407 4 0 005940887 -0 59129855 386 8288288 0 000623405 1 0 000143618 -1 133212888 C grandiflora At2g47430 38 125 5726496 0 005686079 3 0 005290132 -0 153773609 402 4273504 0 000591424 1 0 00067867 0 213763113 C grandiflora At3gl0340 36 135 3423423 0 040981043 23 0 035535988 -0 454237424 428 6576577 0 004500578 8 0 003036426 -0 948751959 C grandiflora At3g23590 26 106 5061728 0 056591271 23 0 042207738 -0 923383112 313 4938272 0 005015551 6 0 002129842 -1 702992442 C grandiflora At3 £26650 38 93 73504274 0 012695644 5 0 024174554 2 328116714 287 2649573 0 0 0 0 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009

C grandiflora At3g44530 38 114 4059829 0 018723218 9 0 018065981 -0 103660344 356 5940171 0 005339525 8 0 003219172 -1 144812013 C grandiflora At3g60750 22 145 6086957 0 003767926 2 0 002318971 -0 871247959 448 3913043 0 00061179 1 0 000202745 -1 162402043 C grandiflora At3g62890 38 103 8504274 0 022918091 10 0 010738718 -1 601213668 352 1495726 0 007434508 11 0 003728372 -1 527871009 C grandiflora At4g08840 28 35 71264368 0 021586739 3 0 009555995 -1 337838808 123 2873563 0 00625303 3 0 004699307 -0 596460285 C grandiflora At4gl4190 30 91 87096774 0 049455853 18 0 045741315 -0 258696163 301 1290323 0 015926651 19 0 011481694 -0 967102366 C grandiflora At4g14370 22 127 0652174 0 034542441 16 0 036930962 0 25117467 451 9347826 0 013960848 23 0 00758643 -I 718251797 C grandiflora At4g38160 34 132 0 012969655 7 0 007778318 -1 148619662 429 0 001710284 3 0 001337937 -0 494978031 C grandiflora At5g04190 32 127 0707071 0 019540986 10 0 018103332 -0 229741211 412 9292929 0 010824024 18 0 008012189 -0 883634204 C grandiflora At5g20280 20 112 9603175 0 044915357 18 0 040815451 -0 343415808 373 0396825 0 0 0 0 C grandiflora At5g41920 30 125 0967742 0 048427116 24 0 050627471 0 161166801 390 9032258 0 005165879 8 0 005786775 0 365120611 C grandiflora At5g43670 32 112 5656566 0 02205901 10 0 028746664 0 946713864 322 4343434 0 001540212 2 0 000737835 -1 046838824 C grandiflora At5g51670 36 129 6936937 0 044625309 24 0 05319004 0 658884766 407 3063063 0 004144439 7 0 001340592 -1 916647035 C grandiflora At5g53020 34 64 30952381 0 049439308 13 0 080853398 2 045386197 241 6904762 0 020238335 20 0 033977844 2 310606856 C grandiflora At5g66280 30 135 1505376 0 013073857 7 0 010971163 -0 47519625 395 8494624 0 001912999 3 0 000505243 -1 731782748 C rubella Atlg01040 13 120 047619 0 0 0 0 419 952381 0 0 0 0 C rubella Atlg03560 12 128 1538462 0 01033568 4 0 012768744 0 82792558 459 8461538 0 001440218 2 0 001779251 0 687881658 C rubella Atlg04650 14 149 5 0 0 0 0 477 5 0 0 0 0 C rubella Atlg06520 14 114 7444444 0 002740457 1 0 001245003 -1 155241342 401 2555556 0 0 0 0 C rubella Atlg06530 14 112 5777778 0 0 0 0 442 4222222 0 0 0 0 C rubella Atlg 10900 12 139 6410256 0 0 0 0 451 3589744 0 0 0 0 C rubella Atlgll050 13 120 5952381 0 026721361 10 0 038271699 1 726825312 359 4047619 0 001793226 2 0 001712233 -0 126877208 C rubella Atlg 15240 14 113 3777778 0 0 0 0 408 6222222 0 0 0 0 C rubella Atlg31930 14 1082111111 0 002905914 1 0 002437238 -0 341438343 317 7888889 0 0009895 1 0 00082991 -0 341438343 C rubella Atlg59720 14 114 0777778 0 002756472 1 0 00433481 1 212185563 419 9222222 0 0 0 0 C rubella Atlg62390 13 124 3928571 0 007771674 3 0 003710329 -1 652312061 466 6071429 0 002071851 3 0 000989137 -1 652312061 C rubella Atlg62520 14 129 7333333 0 048476698 20 0 058446179 0 859675381 416 2666667 0 001510821 2 0 002455104 1 695975145 Reprinted with permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009

C rubella Atlg65450 12 118 3846154 0 0 0 0 388 6153846 0 0 0 0 C rubella Atlg68530 14 135 7888889 0 0 0 0 443 2111111 0 0 0 0 C rubella Atlg72390 12 115 6794872 0 005725117 2 0 004060331 -0 849714979 409 3205128 0 002426993 3 0 001998878 -0 578636747 C rubella Atlg74600 , 13 121 6309524 0 013246912 5 0 022135054 2 392106459 406 3690476 0 002378972 3 0 003975166 2 121451886 C rubella Atlg78850 14 105 5444444 0 002979334 1 0 001353526 -1 155241342 344 4555556 0 0 0 0 C rubella At2g23170 14 122 6333333 0 0 0 0 408 3666667 0 0 0 0 C rubella At2g26730 13 132 202381 0 004875054 2 0 002327434 -1 468005781 425 797619 0 0 0 0 C rubella At2g28050 14 115 4222222 0 005448729 2 0 002475384 -1 48074498 394 5777778 0 001593867 2 0 001838103 0 415804348 C rubella At2g44900 12 123 6923077 0 0 0 0 386 3076923 0 0 0 0 C rubella At2g47430 14 126 4222222 0 002487317 1 0 002086154 -0 341438343 404 5777778 0 0 0 0 C rubella At3gl0340 13 135 6190476 0 0 0 0 428 3809524 0 0 0 0 C rubella At3g23590 11 106 5833333 0 01921973 6 0 024905821 1 171200707 313 4166667 0 0 0 0 C rubella At3g26650 14 94 02222222 0 0 0 0 286 9777778 0 001095737 1 0 001531688 0 842275109 C rubella At3g44530 14 114 7888889 0 008218187 3 0 011487883 1217964186 3562111111 0 003531077 4 0 004103012 0 533557751 C rubella At3g60750 5 145 4444444 0 0 0 0 448 5555556 0 0 0 0 C rubella At3g62890 11 104 375 0 0 0 0 351 625 0 00097097 1 0 00051708 -1 128501595 C rubella At4g08840 7 35 35416667 0 03463495 3 0 029632124 -0 65405158 123 6458333 0 006602135 2 0 006161993 -0 27492444 C rubella At4gl4190 13 92 73809524 0 006949612 2 0 006635726 -0 126877208 300 2619048 0 0 0 0 C rubella At4g 14370 14 127 1222222 0 012368104 5 0 005618889 -1 889327875 451 8777778 0 00626291 9 0 003623464 -1 617340049 C rubella At4g38160 14 1314444444 0 007176846 3 0 003260476 -1 670526133 429 5555556 0 000732041 1 0 00033257 -1 155241342 C rubella At5g04190 14 127 2333333 0 0 0 0 412 7666667 0 000761816 1 0 000346097 -1 155241342 C rubella At5g20280 12 112 6025641 0 0 0 0 373 3974359 0 0 0 0 C rubella At5g41920 13 124 6309524 0 0 0 0 391 3690476 0 000823384 1 0 000393097 -1 149147105 C rubella At5g43670 12 113 0 011721744 4 0 014481094 0 82792558 322 0 001028383 1 0 001270469 0 540554689 C rubella At5g51670 14 129 5555556 0 007281483 3 0 007888338 0 25513408 407 4444444 0 000771767 1 0 001078823 0 842275109 C rubella At5g53020 14 64 33333333 0 0 0 0 241 6666667 0 0 0 0 C rubella At5g66280 13 136 1785714 0 0 0 0 397 8214286 0 0 0 0 Reprinted by permission from the Proceedings of the National Academy of Sciences of the United States of America, 106:5241-5245. Copyright 2009 SI Table 7. Modes of parameter estimates under a range of MIMAR models using

summaries of the data that do not rely on ougroup inference, with 90% HPD intervals in

parentheses.

a b c s Model Ne(Cg)° TV/ ~ MCg_Cr Mc,Cg 7

1 Ancestral size constrained, 503.8C 2A 503.8s - - 93

no migration (442.6, (0.3, (442.6, (1.2,

576.0) 11.1) 576.0) 37.8)

2 Ancestral size constrained, 497.16 \52 497. \e JJe 3je 71

symmetrical migration (378.3, (5.6, (378.3, (1.1, (1.1,10.7) (0.6,

824) 27.6) 824) 10.7) 3673.0)

3 Ancestral size constrained, 493.8e 0~4 493.8e L9 5L7 171

asymmetrical migration (421.7, (0.1, (421.7, (0.011, (7.7,376.1) (2.5,

594.0) 8.8) 594.0) 5.5) 3608.8)

4 Ancestral size unconstrained, 532.5 06 72~! L9 8L0 1362.5 "

asymmetrical migration (417.6, (0.138, (16.7, (0.01, (10.5,400.8) (1.94,

893.6) 8.6) 584,8) 8.6) 3273.6)

" Effective population size (effective number of individualsxlO" ) for C. rubella (Cr), C. grandiflora (Cg) and their ancestor (A)

b Migration rate (4Nem) from C. grandiflora to C. rubella

c Migration rate (4Nem) from C. rubella to C. grandiflora d Time (ka) of the split of C. rubella and C. grandiflora

e Constrained

observed I expected

c .2 20 5 (A observed 3 I expected I " > e Ie 10 I n

SI Figure 5. Observed and expected (under neutrality as calculated using Equation 49,

Tajima, 1989) minor allele frequency distribution of synonymous SNPs in a) C grandiflora and b) C rubella using A thaliana as an outgroup.

-t r 201-300 301-400 between-locus

Distance (bp)

C. rubella

C. grandiflora

SI Figure 6. Average levels of linkage disequilibrium as measured by the squared correlation coefficient r in C. rubella and C. grandiflora in lOObp windows.

Capsella grandiflora Capsella rubella Capsella rubella

SI Figure 7. STRUCTURE output for a) C. grandiflora and C. rubella combined (k=2) and for b) C. rubella alone (k=3) Each bar represents an individual where the color denotes the proportion of an individual's genome assigned to a cluster (k) based on haplotype information.

1.2

0.8 • neutral bottleneck

0.6 • selective sweep 1 selection on standing 0.4 variation

0.2

0-0.1 0.1- 0.2- 0.3- 0.4- 0.5- 0.6- 0.7- 0.8- 0.9-1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

SI Figure 8. Distribution of the fraction of shared polymorphism from simulated 39-gene datasets. Arrow shows the observed fraction of shared polymorphisms from the data. The selective sweep model was run using the deterministic model of Thornton and Jensen

(2007), with the sweep occurring 10,500 years ago, and a selection coefficient (s) of

0.003. The model of standing variation assumed a selection coefficient of 0.003 and an initial allele frequency of 0.01 for the selected allele.

60 CHAPTER 3

No evidence for ongoing and widespread hybridization between sympatric populations of C. grandiflora and C. rubella

61 Abstract

The speciation process involves the establishment and maintenance of

reproductive isolating barriers that restrict hybridization and gene flow. Despite the

establishment of these barriers, hybridization in the wild is a relatively common

occurrence. While in some cases, specific reproductive barriers have been identified as the primary isolating barrier, it is often the combination of several reproductive barriers that will result in reproductive isolation. One such barrier is mating system, however, the

efficacy of differences in mating system in functioning as a reproductive barrier remains unclear. Here, I search for evidence of hybridization in the genus Capsella. Capsella rubella is characterized by a high rate of self-fertilization and shows the typical morphological characteristics of a selfing syndrome. In comparison with its self-

incompatible sister species C. grandiflora, there has been a derived breakdown of the

self-incompatibility mechanism, and its floral organ sizes are highly reduced. In this study, I ask whether a shift in mating system has lead to the establishment of effective reproductive isolating barriers between these two sister species. By investigating the population genetics of potentially hybridizing populations of C. grandiflora and C. rubella I find no evidence for ongoing and widespread hybridization between sympatric populations of C. grandiflora and C. rubella. These results suggest that mating system may be acting as an effective reproductive isolating barrier between these two species.

62 Introduction

The speciation process involves the establishment and maintenance of

reproductive isolating barriers that restrict hybridization and gene flow (Dobzhansky

1937; Mayr 1942; Coyne and Orr 2004). These isolating barriers may be classified

according to their timing of action. Prezygotic isolating barriers impede gene flow before the transfer of pollen or sperm to members of other species. Postzygotic barriers such as

hybrid inviability or sterility, act following fertilization (Coyne and Orr 2004; Lowry et

al. 2008). While few studies in the literature have attempted to comprehensively quantify the action of various isolating barriers in a speciation event, it has been established that it

is often the cumulative effects of many reproductive barriers that will result in reproductive isolation (Rieseberg and Willis 2007).

Complete estimates of reproductive isolation have been conducted in a number of different systems. Habitat isolation, followed by pollinator isolation have been identified as the primary causes of speciation between Mimulus lewisii and M. cardinalis (Ramsey et al. 2003), Costus pulverulentus and C. scaber (Kay 2006) and diploid and tetraploid

Chamerion angustifolium (Husband and Sahara 2004). Similarly, ecogeographic and pollinator isolation are thought to contribute most to total isolation between Costus pulverulentus and C. scaber (Kay 2006). Although hybrids are sometimes formed, these hybrids exhibit severely reduced viability.

One additional potential premating barrier is mating system. The transition from outcrossing to selfing represents one of the most common major evolutionary transitions in the plant kingdom (Stebbins 1970; Barrett 2002). Phenotypically, this transition to

63 selfing is almost universally associated with the 'selfing syndrome', characterized by a

severe reduction in flower size and a breakdown of the morphological and genetic

mechanisms preventing self-fertilization and promoting outcrossing. By reducing the

probability of pollen transfer between diverging populations, such changes may directly

act as prezygotic reproductive barriers, thus aiding the speciation process (Fishman et al.

2002). However, the efficacy of differences in mating system in functioning as a

reproductive barrier remains unclear. Furthermore, the extent to which gene flow occurs

between outcrossing and selfing species pairs is not yet known.

Despite the presence of strong intrinsic postzygotic barriers, mating system

isolation has been identified as the primary reproductive barrier between the outcrossing

M. guttatus and its selfing sister species M. nasutus (Martin and Willis 2007). It has been

shown that the difference in mating system (in combination with karyotype) has lead to the maintenance of reproductive barriers between outcrossing and selfing Stephanomeria

(Gottlieb 1973).

Despite the establishment and maintenance of these reproductive barriers, the processes of hybridization and introgression are thought to be common in many groups of plants and animals (Arnold 1997), yet the majority of hybridizing species remain morphologically distinct because of selection against hybrids (Coyne and Orr 2004).

Theoretically, it is possible that neutral or advantageous alleles may move across species boundaries unless they are tightly linked to loci contributing to reproductive isolation yet still little is known about the occurrence of this phenomenon (Barton 1979; Harrison

1990). Conclusive detection of hybridization and introgression can be somewhat difficult.

64 Cryptic introgression can be inferred if phylogenies based on different loci are not

concordant, however, these results may reflect the presence of ancestral polymorphism

(Yatabe et al. 2007).

With recent advances in coalescent modeling, it is now possible to distinguish

between the action of hybridization and the presence of ancestral polymorphism using

coalescent based modeling approaches (Soltis et al. 2003; Noor and Feder 2006; Becquet

and Przeworski 2007; Hey and Nielsen 2007; Hey 2010). Formalized speciation models

now facilitate the differentiation of ancestral polymorphism from introgression and allow

statistically-based timing estimates of divergence events (Wakeley and Hey 1997;

Nielsen and Wakeley 2001). Using these models, speciation processes have been studied

in Drosophila (Wang et al. 1997; Hey and Nielsen 2007), Arabidopsis (Ramos-Onsins et

al. 2004), Oryza (Zhang and Ge 2007) and Capsella (Foxe et al. 2009 (Chapter 2)) and

evidence for hybridization between species in has been detected across a wide variety of

genera; Macaques (Bonhomme et al. 2009), Oryctolagus (rabbit) (Geraldes et al. 2006),

Serrasalmus (Hubert et al. 2008), Acropora (Vollmer and Palumbi 2002), Sorghum

(Morrell et al. 2005), Arabidopsis (Castric et al. 2008) and Capsella (Slotte et al. 2008).

Here, I search for evidence of hybridization in the genus Capsella. Capsella rubella is characterized by a high rate of self-fertilization (Hurka et al. 1989; Hurka and

Neuffer 1997) and shows the typical morphological characteristics of a selfing syndrome

(Hurka and Neuffer 1997). In comparison with its self-incompatible sister species C. grandiflora, there has been a derived breakdown of the self-incompatibility mechanism, and its floral organ sizes are highly reduced (see Figure 1, Foxe et al. 2009 (Chapter 2)).

65 This transition to selfing has also lead to a substantial bottleneck in C. rubella with a 100-

1500 fold reduction in effective population size (Ne) (Foxe et al. 2009 (Chapter 2)).

While C. grandiflora is restricted to Greece and Albania and locally in Northern Italy, C.

rubella has expanded into much of southern Europe extending to Middle Europe,

Northern Africa, and into Australia and North and South America (Hurka and Neuffer

1997; Paetsch et al. 2006). This expansion is likely associated with the colonization

ability conferred upon C. rubella as a result of its transition to selfing. Previous studies

have estimated a divergence time of approximately 13,500 years between these two

species (Foxe et al. 2009 (Chapter 2)).

Interspecific crossing experiments suggest that, in addition to mating system

evolution, there is considerable postpollination reproductive isolation between the

species, with only a small proportion of crosses producing viable seed (Hurka and

Neuffer 1997; Koch and Kiefer 2005). Under controlled conditions, forced crossing may

result in viable seed when the outcrossing C. grandiflora receives pollen from the selfing

C. rubella. Approximately 80% of these crosses gave rise to viable Fl hybrids while the

reciprocal cross was found to succeed at a frequency of approximately 10% (K. M.

Hazzouri personal communication).

While previous work has shown little evidence for introgression in natural

populations of C. grandiflora and C. rubella (Foxe et al. 2009 (Chapter 2)), this work

focused on allopatric populations of these species. Here, I ask whether a shift in mating

system has lead to the establishment of effective reproductive isolating barriers between these two sister species. By investigating the population genetics of potentially

66 hybridizing populations of C. grandiflora and C. rubella I ask whether there is evidence

for ongoing and widespread hybridization between sympatric populations of C.

grandiflora and C. rubella.

Methods

Species Sampling and DNA Sequencing

Nucleotide polymorphism data from 18 large exons were collected (Table 2) in C

grandiflora and C. rubella. Nine natural Capsella localities were visited in the Zagori

region of Northern Greece, from which seed was sampled from a total of 37 diploid C.

grandiflora and 35 C. rubella individuals (Figure SI). Of these nine populations, three

were found to be allopatric C. grandiflora populations, four were found to be allopatric

C. rubella populations and the remaining two populations were found to be sympatric for

C. grandiflora and C. rubella. Five C. grandiflora individuals were sampled from

Lazaena, eleven individuals from Monodendri, eight individuals from Papigo, five

individuals from Retsina and eight individuals from Serviana. Twelve C. rubella

individuals were sampled from Ellinka, seven from Kalavryta, six from Milies, three

from Papigo, two from Retsina and five from Souli. Outgroup data from Arabidopsis

thaliana was obtained from GenBank.

Following sterilization, seeds were placed at 4°C on sterile (Murashige-Skoog;

MS) nutriend medium for 14 days before being allowed to germinate at room temperature. The seedlings were grown at 20°C under conditions of 18 hours of light and

6 hours of darkness.

67 After 6 weeks of growth, DNA was extracted from leaf material using a DNeasy

kit (QIAGEN, Hilden, Germany). PCR primers for the large exons were designed as

described by Wright et al. and Ross-Ibarra et al. (Wright et al. 2006; Ross-Ibarra et al.

2008). Briefly, primers were designed to amplify 650-700 bp from single large exons based on the A. thaliana genome sequence, chosen with no a priori expectation as to their

function or the action of selection upon these genes. Each exon was used as a BLAST

(Altschul et al. 1990) query against the shotgun genome sequence of Brassica oleracea

and homologous regions were used to design primers using PrimerQuest (Integrated

DNA Technologies). PCR reactions were performed in 25uL reaction volumes (15mM

PCR (10X) buffer, 2 mM MgS04, 10mM dNTPs, lOuM forward primer, 10u.M reverse primer, 1U Tsg polymerase and 50-100ng DNA) on an Eppendorf Mastercycler with the following program: 2 minutes at 94°C, followed by 20 seconds at 94°C, 20 seconds at

55°C, 40 seconds at 72°C, for 35 cycles, with a final extension time of 4 minutes at 72°C.

These products were sequenced on an ABI 3730 sequencer at the Genome Quebec

Innovation Centre (McGill University, Canada). Chromatograms were checked manually for heterozygous sites, using Sequencher version 4.7 (Gene Codes, Ann Arbor, MI), with the aid of the 'Call secondary peaks' option. Sequences were aligned using Genedoc

(Nicholas 1997). Consistent with high levels of selfing, no heterozygous sites were identified in our C. rubella dataset. This complete lack of heterozygosity in C. rubella also allowed us to confirm that we were sequencing single copy regions only.

68 Sequence statistics and Analysis

Synonymous and nonsynonymous sites were identified by aligning each fragment

to the corresponding fragment in the A. thaliana genome sequence, identified using

BLAST (Altschul et al. 1990), and using the protein annotation from A. thaliana.

Standard population genetic descriptives, including numbers of synonymous and

nonsynonymous sites, estimates of synonymous (jtsyn) and nonsynymous (nrep) diversity

(Tajima 1993), as well as frequency data were calculated using a modified version of Perl

code (Polymorphurama) written by D. Bachtrog and P. Andolfatto (University of

California at San Diego, available from http://ib.berkeley.edu/labs/bachtrog/data/polyMORPHOrama/polyMORPHOrama.html).

The frequency spectra of derived polymorphic variants, and the number of shared derived polymorphisms, unique polymorphisms and fixed differences were calculated using Perl

scripts written by S. Wright.

Bayesian inference of population structure

Individual haplotypes of unphased diploid sequences were reconstructed using the software PHASE (Stephens et al. 2001), as implemented in DnaSP Version 5.0 (Librado and Rozas 2009). Haplotypes with the highest posterior probabilities were used for cluster analysis performed with the program InStruct version 1.0 (Gao et al. 2007).

InStruct performs Bayesian clustering and works by assigning individuals to a given number of clusters in such a way that deviations from Hardy-Weinberg equilibrium are minimized. Unlike STRUCTURE (Pritchard et al. 2000), InStruct can accommodate non-

69 random mating due to selfing. Based on exploratory runs, the number of clusters (k) were

restricted to range from k = 1 to k = 4, and InStruct was run for 2,000,000 generations

with a burnin of 200,000 generations, with two independent chains (runs) for each k.

InStruct was run in this manner on the entire dataset, including C. grandiflora alone and

including C. rubella alone. DISTRUCT vl .1 (Rosenberg 2004) was used to create bar plots of the aligned matrices.

Coalescent Simulations

Coalescent simulations were conducted using MIMAR (Becquet and Przeworski

2007) which estimates the parameters of an isolation-migration model based on Hudson's ms (Hudson 2002). Simulations were conducted using all 18 loci included in this study.

Sites with more than 2 segregating bases were excluded from the anaysis. Based upon previous analyses (Foxe et al. 2009 (Chapter 2)) and results from crossing experiments, migration rates were either unconstrained (asymmetric migration), or set to zero (no migration), whereas effective population sizes were assumed to be identical in C. grandiflora and the ancestor of C. rubella and C. grandiflora.

Prior limits for the Bayesian procedure implemented in MIMAR were set based on those models used in Foxe et al 2009 (Foxe et al. 2009 (Chapter 2)). Priors for 6(9 =

ANepi, where Ne is the effective population size and n is the mutation rate) were uniform

0.001-0.1 for both C. grandiflora and the ancestral species, and uniform 0-0.0025 for C. rubella. All runs assumed an exponentially distributed prior with rate 1 for rid, and a mutation rate per bp of 1.5xl0"8 (Koch et al. 2001). Migration rate priors were log

70 uniform -5-2.5 for migration from C. grandiflora to C. rubella (forward in time) and log

uniform -5-9 for migration from C. rubella to C. grandiflora. The prior for the time of the

split between C. rubella and C. grandiflora was uniform 0-4x106.

Simulations were conducted in three ways; 1) on the dataset as a whole, 2)

including allopatric populations only and 3) including sympatric populations only. Each

simulation was run for a total of 10,080 minutes (1 week) with a burnin of 100,000 steps.

Mixing was monitored by assessing parameter autocorrelation over runs and I considered

that MIMAR reached convergence when the posterior distributions from independent

runs were highly similar (Becquet and Przeworski 2007). The mode of the marginal

posterior probability distribution was considered as a point estimate for each parameter

and 90% highest posterior density (HPD) intervals from the MIMAR output were

calculated using the boa package in R 2.9.0 (Smith 2007).

Results

Patterns of polymorphism

Levels of synonymous and nonsynonymous diversity were estimated in both C. grandiflora (37 diploid individuals) and C. rubella (35 individuals) using direct

sequencing of 18 nuclear genes. A total of 299 synonymous single nucleotide polymorphisms (SNPs) and 190 nonsynonymous SNPs were found in C. grandiflora versus 97 synonymous and 63 nonsynonymous SNPs in the selfing C. rubella. In C. rubella 28% of loci were found to be completely devoid of any variation while 33%

lacked synonymous variation. Levels of synonymous variation in C. grandiflora were

71 found to be 2.8 times higher than those found in C. rubella (Figure la). These results

reflect previous estimates of diversity in C. grandiflora and C. rubella, which pointed towards massive reductions in diversity in C. rubella when compared to C. grandiflora

resulting from an extreme population bottleneck (Foxe et al. 2009 (Chapter 2); Guo et al.

2009).

Levels of synonymous and nonsynonymous diversity were also estimated in allopatric and sympatric populations only (Figures lb and lc). Were C. grandiflora and

C. rubella hybridizing in natural populations, it would be expected that levels of diversity

should be elevated in sympatric hybridizing populations, particularly in the selfing C. rubella. Levels of synonymous and nonsynonymous diversity in both C. grandiflora and

C. rubella in allopatric populations were similar to those found in the total dataset. In comparison with diversity levels in allopatric C. rubella, synonymous and nonsynonymous diversity in sympatric C. rubella were reduced by 1.7 fold and 2.3 fold respectively.

The distribution of synonymous variants across C. grandiflora and C. rubella was estimated. 46% of synonymous variants were found to be shared between species (Figure

2a). This estimate is considerably higher than previous estimates of shared variants between C. grandiflora and C. rubella (26% (Foxe et al. 2009 (Chapter 2))) However, this may be reflective of the increased sample size in this study (72 compared with 34).

Due to the extremely recent divergence time (-13,500 years, (Foxe et al. 2009 (Chapter

2))) between these two species, an increase in sample size would easily explain any increase in the proportion of shared variants. 50% of synonymous variants were found to

72 be unique to C. grandiflora, while 4% of synonymous variants were unique to C. rubella.

49% of synonymous variants were found to be shared in allopatric populations (Figure

2b) versus 45% shared synonymous variants in sympatric populations (Figure 2c). Were

C. grandiflora and C. rubella hybridizing in these natural sympatric populations it would

be expected that these individuals in sympatric populations would have increased

proportions of shared variants versus those individuals in allopatric populations.

However, this is not the case as the proportion of shared variants in sympatric

populations is decreased when compared to allopatric populations, although this

reduction in shared variation is not statistically significant (p > 0.05, Fisher's exact test).

Bayesian Clustering Analyses

The results from Bayesian clustering analyses for the combined nuclear sequence

data suggest the existence of three clusters across the populations sampled from Greece

(Figure 3). Capsella grandiflora and C. rubella broadly speaking fall into two discrete

clusters where each individual clusters within its own species. The main exception to this

are C. grandiflora individuals found in Retsina and Lazaena who group together to form

a third cluster (Figure 3, k = 3). This pattern can also be clearly seen in the results from

Bayesian clustering analyses conducted on C. grandiflora only (Figure 4a).

The results from Bayesian clustering analyses conducted on C. rubella alone

suggest the existence of three clusters in C. rubella (Figure 4b). Kalavryta predominantly

forms its own cluster although there is evidence for admixture with Souli. Ellinika clusters with Milies and Papigo, Retsina and Souli form the third cluster.

73 Geographically, Kalavryta is situated to the east of the other C. rubella populations and

this may explain its forming an individual cluster. While there are no obvious geographic

reasons for the clustering patterns observed in the other populations, it may be that the

clustering patterns are reflective of a common ancestral population.

Demographic Model Fitting

The results from coalescent simulations are in strong agreement with previous

work in that they provide strong evidence for a single recent origin of C. rubella from C.

grandiflora. Here, a Markov Chain Monte Carlo (MCMC) approach based on coalescent

simulations was used to fit a model of isolation with and without migration to the data

(Becquet and Przeworski 2007). The approach makes use of the observed information

from each locus on the number of shared and unique polymorphisms, as well as fixed

differences. For analysis, the inference was restricted to synonymous sites, to avoid

potential effects of selection on the nonsynonymous variants. The model assumes that a

single ancestral population of size Na split into two at time t, and the two derived

populations have distinct effective population sizes (Ni and N2).

Although previous results have suggested little or no hybridization between C. grandiflora and C. rubella, these results were based upon individuals originating from

allopatric populations. To specifically test for the presence of hybridization, individuals

from both allopatric and sympatric populations were included in analyses. To estimate

demographic parameters, I investigated a series of models that varied in the inclusion of

74 asymmetrical or no migration, and whether the individuals were from allopatric or

sympatric and therefore potentially hybridizing populations (Table 1).

The parameter estimates from these models are consistent with previous results

suggesting an extremely recent speciation event associated with a major reduction in

effective population size in C. rubella (Foxe et al. 2009 (Chapter 2)). Under the model

with no migration, including the entire dataset, the most likely estimate of divergence

time was found to be approximately 14,000 years ago. Similarly, the model with

asymmetrical migration provides estimates of divergence time of approximately 18,000

years ago.

When the allopatric and sympatric datasets were analysed separately, the point estimates

for divergence time in the absence migration were 14,000 and 10,000 years respectively

and 822,000 and 14,000 years allowing for asymmetrical migration. Although the point

estimate of divergence time while allowing for asymmetrical migration, in allopatric populations, differs considerably from that of sympatric populations, the 90% HPD

intervals of each of these estimates overlap (Table 1).

Migration rate estimates are also consistent with previous results providing scant evidence for hybridization between these two species. Although the point estimate for migration from C. rubella to C. grandiflora was found to be 82.6 individuals per

generation in sympatric populations, versus 32.7 individuals per generation in allopatric populations, the 90% HPD intervals of these estimates overlap (Table 1).

Discussion

75 Here, I find no clear evidence for gene flow between the outcrossing C. grandiflora and its selfing sister species C. rubella. These results suggest that differences in mating system have lead to the establishment of effective reproductive barriers preventing gene flow between these two species. A number of lines of evidence support these conclusions. Firstly, diversity in sympatric C. rubella is not increased when compared to allopatric C. rubella, as would be predicted were C. grandiflora and C. rubella hybridizing. Secondly, the distributions of synonymous variants estimated in both allopatric and sympatric Capsella are comparable, with no increase in the proportion of shared alleles in sympatric Capsella. Next, results from Bayesian clustering analyses indicate little or no shared haplotype identity between these two species and finally, demographic model fitting indicates no significant differences in migration rate estimates in allopatric versus sympatric populations.

While a difference in mating system alone can act as the primary reproductive barrier (Gottlieb 1973; Martin and Willis 2007), this difference can also act to promote speciation in other ways. For instance, selfing can allow a single individual to successfully colonize and populate a new habitat, previously unavailable to an outcrossing relative. This in turn may result in the establishment of habitat isolation

(Coyne and Orr 2004). Interestingly, the results from coalescent simulations, with and without migration, reveal a -27 and -14 fold reduction in Ne in sympatric C. rubella when compared with allopatric C. rubella. In this instance, it is possible that C. grandiflora effectively outcompetes C. rubella when the two species are found in close proximity resulting in a fewer number of C. rubella individuals at these localities. If this

76 is the case and C. rubella performs better in allopatry to C. grandiflora, it may be that

habitat isolation has developed as a byproduct of mating system isolation.

Capsella grandiflora is thought to represent a large, stable population at

equilibrium (Foxe et al. 2009 (Chapter 2)). Despite this, there is some evidence for

population structure in C. grandiflora with the existence of two population clusters. A

study investigating the demographic history of Greek Capsella has recently identified the

presence of three clusters in C. grandiflora in Greece (St. Onge et al. 2010). These

clusters are separated by geography with one cluster located in Northern Greece, a second

cluster in Southern Greece and a third cluster on the island of Corfu. The presence of the

Retsina and Lazaena cluster in C. grandiflora in this dataset is likely reflective of this

geographic clustering.

Mating system isolation has been identified as the primary reproductive barrier in just two plant species pairs to date (Gottlieb 1973; Martin and Willis 2007). Although

mating system appears to be acting as a reproductive barrier between C. grandiflora and

C. rubella, what is not clear is the relative contribution of this mating system isolation in

the speciation process. It is likely that other factors are playing a role and some evidence

for postmating reproductive isolation has been observed. Given the extremely recent

divergence between C. grandiflora and C. rubella, fully understanding the extent to

which different reproductive isolating barriers, including mating system isolation, will be

of considerable interest in future investigations. Nevertheless, the results from this study provide no significant evidence for ongoing and widespread hybridization between these two species.

77 References

Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local

alignment search tool. J Mol Biol 215:403-410.

Arnold, M. L. 1997. Natural Hybridization and Evolution. Oxford University Press, New

York.

Barrett, S. C. H. 2002. The evolution of plant sexual diversity. Nature Reviews 3:274-

284.

Barton, N. H. 1979. Dynamics of hybrid zones. Heredity 43:341-359.

Becquet, C, and M. Przeworski. 2007. A new approach to estimate parameters of

speciation models with application to apes. Genome Res 17:1505-1519.

Bonhomme, M., S. Cuartero, A. Blancher, and B. Crouau-Roy. 2009. Assessing natural

introgression in 2 biomedical model species, the rhesus macaque (Macaca

mulatto) and the long-tailed macaque (Macaca fascicular is). J Hered 100:158-

169.

Castric, V., J. Bechsgaard, M. H. Schierup, and X. Vekemans. 2008. Repeated adaptive

introgression at a gene under multiallelic balancing selection. PLoS Genetics

4:el000168.

Coyne, J. A., and H. A. Orr. 2004. Speciation. Sinauer Associates, Inc., Sunderland,

Massachusetts.

Dobzhansky, T. 1937. Genetics and the Origin of Species.

78 Fishman, L., A. J. Kelly, and J. H. Willis. 2002. Minor quantitative trait loci underlie

floral traits associated with mating system divergence in Mimulus. Evolution

56:2138-2155.

Foxe, J. P., T. Slotte, E. A. Stahl, B. Neuffer, H. Hurka, and S. I. Wright. 2009. Rapid

morphological evolution and speciation associated with the evolution of selfing in

Capsella. PNAS 106:5241-5245.

Gao, H., S. Williamson, and C. D. Bustamante. 2007. A Markov chain Monte Carlo

approach for joint inference of population structure and inbreeding rates from

multilocus genotype data. Genetics 176:1635-1651.

Geraldes, A., N. Ferrand, and M. W. Nachman. 2006. Contrasting patterns of

introgression at X-linked loci across the hybrid zone between subspecies of the

European rabbit (Oryctolagus cuniculus). Genetics 173:919-933.

Gottlieb, L. D. 1973. Genetic differentiation, sympatric speciation, and the origin of a

diploid species of Stephanomeria. Am J of Bot 60:545-553.

Guo, Y.-L., J. S. Bechsgaardb, T. Slotte, B. Neuffer, M. Lascoux, Weigel D., and M. H.

Schierup. 2009. Recent speciation of Capsella rubella from Capsella grandiflora,

associated with loss of self-incompatibility and an extreme bottleneck PNAS

106:5246-5251.

Harrison, R. G. 1990. Hybrid zones: windows on evolutionary processes. Oxf. Surv.

Evol. Biol. 7:69-128.

Hey, J. 2010. Isolation with migration models for more than two populations. Mol Biol

Evol 27:905-920.

79 Hey, J., and R. Nielsen. 2007. Integration within the Felsenstein equation for improved

Markov chain Monte Carlo methods in population genetics. PNAS 104:2785-

2790.

Hubert, N., J. P. Torrico, F. Bonhomme, and J. F. Renno. 2008. Species polyphyly and

mtDNA introgression among three Serrasalmus sister-species. Mol Phylogenet

Evol 46:375-381.

Hudson, R. R. 2002. Generating samples under a Wright-Fisher neutral model of genetic

variation. Bioinformatics 18:337-338.

Hurka, H., S. Freundner, A. H. Brown, and U. Plantholt. 1989. Aspartate

aminotransferase isozymes in the genus Capsella (Brassicaceae): subcellular

location, gene duplication, and polymorphism. Biochemical Genetics 27:77-90.

Hurka, H., and B. Neuffer. 1997. Evolutionary processes in the genus Capsella

(Brassicaceae). Plant Sys and Evol 206:295-316.

Husband, B. C, and H. A. Sahara. 2004. Reproductive isolation between autotetraploids

and their diploid progenitors in fireweed, Chamerion angustifolium

(Onagraceae). New Phytologist 161:703-713.

Kay, K. M. 2006. Reproductive isolation between two closely related hummingbird-

pollinated neotropical gingers. Evolution 60:538-552.

Koch, M., B. Haubold, and T. Mitchell-Olds. 2001. Molecular systematics of the

Brassicaceae: evidence from coding plastidic matK and nuclear Chs sequences.

Am J Bot 88:534-544.

80 Koch, M., and M. Kiefer. 2005. Genome evolution among cruciferous plants: a lecture

from the genetic maps of three diploid species— Capsella rubella, Arabidopsis

lyrata subsp. petraea, and A. thaliana. Am J Bot 92:761-767.

Librado, P., and J. Rozas. 2009. DnaSP v5: a software for comprehensive analysis of

DNA polymorphism data. Bioinformatics 25:1451-1452.

Lowry, D. B., J. L. Modliszewski, K. M. Wright, C. A. Wu, and J. H. Willis. 2008. The

strength and genetic basis of reproductive isolating barriers in flowering plants.

Phil Trans R Soc Lon 363:3009-3021.

Martin, N. H., and J. H. Willis. 2007. Ecological divergence associated with mating

system causes nearly complete reproductive isolation between sympatric Mimulus

species. Evolution 61:68-82.

Mayr, E. 1942. Systematics and the origin of species. Columbia University Press, New

York.

Morrell, P. L., T. D. Williams-Coplin, A. L. Lattu, J. E. Bowers, J. M. Chandler, and A.

H. Paterson. 2005. Crop-to-weed introgression has impacted allelic composition

of johnsongrass populations with and without recent exposure to cultivated

sorghum. Mol Ecol 14:2143-2154.

Nicholas, K. B., Nicholas, H.B. Jr., Deerfield, D.W. II. 1997. GeneDoc: Analysis and

Visualization of Genetic Variation. EMBNEW.NEWS 4:14.

Nielsen, R., and J. Wakeley. 2001. Distinguishing migration from isolation: a Markov

chain Monte Carlo approach. Genetics 158:885-896.

81 Noor, M. A., and J. L. Feder. 2006. Speciation genetics: evolving approaches. Nature

Reviews 7:851-861.

Paetsch, M., S. Maryland-Quellhorst, and B. Neuffer. 2006. Evolution of the self-

incompatibility system in the Brassicaceae: identification of S-locus receptor

kinase (SRK) in self-incompatible Capsella grandiflora. Heredity 97:283-290.

Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. Inference of population structure

using multilocus genotype data. Genetics 155:945-959.

Ramos-Onsins, S. E., B. E. Stranger, T. Mitchell-Olds, and M. Aguade. 2004. Multilocus

analysis of variation and speciation in the closely related species Arabidopsis

halleri and A. lyrata. Genetics 166:373-388.

Ramsey, J., H. D. Bradshaw, Jr., and D. W. Schemske. 2003. Components of

reproductive isolation between the monkeyflowers Mimulus lewisii and M.

cardinalis (Phrymaceae). Evolution 57:1520-1534.

Rieseberg, L. H., and J. H. Willis. 2007. Plant speciation. Science 317:910-914.

Rosenberg, N. A. 2004. DISTRUCT: a program for the graphical display of population

structure. Mol Ecol Notes 4:137-138.

Ross-Ibarra, J., S. I. Wright, J. P. Foxe, A. Kawabe, L. DeRose-Wilson, G. Gos, D.

Charlesworth, and B. S. Gaut. 2008. Patterns of Polymorphism and Demographic

History in Natural Populations of Arabidopsis lyrata. PloS One 3:e2411.

Slotte, T., H. Huang, M. Lascoux, and A. Ceplitis. 2008. Polyploid speciation did not

confer instant reproductive isolation in Capsella (Brassicaceae). Mol Biol Evol

25:1472-1481.

82 Smith, B. J. 2007. boa: An R Package for MCMC Output Convergence Assessment and

Posterior Inference. Journal of Statistical Software 21:1-37.

Soltis, D. E., P. S. Soltis, and J. A. Tate. 2003. Advances in the study of polyploidy since

Plant Speciation. New Phytologist 161:173-191.

St. Onge, K., T. Kallman, T. Slotte, M. Lascoux, and A. Palme. 2010. Divergent

population history and structure in two closely related species (Capsella rubella

and C. grandiflora) with different mating system In Prep.

Stebbins, G. L. 1970. Adaptative radiation of reproductive characteristics in angiosperms.

I. Pollination mechanisms. Ann Rev of Ecol Sys 1:307-326.

Stephens, M., N. J. Smith, and P. Donnelly. 2001. A new statistical method for haplotype

reconstruction from population data. Am J Hum Gen 68:978-989.

Tajima, F. 1993. Measurement of DNA polymorphism. Pp. 37-60 in N. Takahata, and A.

Clark, eds. Mechanisms of molecular evolution. Japan Scientific Societies Press,

Tokyo.

Vollmer, S. V., and S. R. Palumbi. 2002. Hybridization and the evolution of reef coral

diversity. Science 296:2023-2025.

Wakeley, J., and J. Hey. 1997. Estimating ancestral population parameters. Genetics

145:847-855.

Wang, R. L., J. Wakeley, and J. Hey. 1997. Gene flow and natural selection in the origin

of Drosophilapseudoobscura and close relatives. Genetics 147:1091-1106.

83 Wright, S. I., J. P. Foxe, L. DeRose-Wilson, A. Kawabe, M. Looseley, B. S. Gaut, and D.

Charlesworth. 2006. Testing for effects of recombination rate on nucleotide

diversity in natural populations of Arabidopsis lyrata. Genetics 174:1421-1430.

Yatabe, Y., N. C. Kane, C. Scotti-Saintagne, and L. H. Rieseberg. 2007. Rampant gene

exchange across a strong reproductive barrier between the annual sunflowers,

Helianthus annuus and H. petiolaris. Genetics 175:1883-1893.

Zhang, L.-B., and S. Ge. 2007. Multilocus analysis of nucleotide variation and speciation

in Oryza officinalis and its close relatives. Mol Biol Evol 24:769-783.

84 Table 1. Modes of parameter estimates under a range of MIMAR models using the entire dataset, allopatric populations only and sympatric populations only. 90% HPD intervals are given in parentheses.

--pd Dataset and Model a a b C 9Vg) 6 (Cr) 9 (A) MCg-Cr MCr-Cg Entire dataset, no migration 0.02275 0.00151 (0.00062, 0.02275, - - 14,014(4,210, (0.01908, 0.00244) (0.01908, 33,053) 0.02662) 0.02662) Entire dataset, asymmetrical 0.02216 0.00067(0.00015, 0.02216 0.37390 42.0776 18,018(3,731, migration (0.01839, 0.00221) (0.01839, (0.00674, (0.01323, 3,812,460) 0.02744) 0.02744) 9.63982) 164.544 Allopatric populations, no 0.02335 0.00109 (0.00034, 0.02335 - - 14,014(5,126, migration (0.01777, 0.00212) (0.01777, 39,173) 0.02595) 0.02595) Allopatric populations, 0.02265 0.00083 (0.00007, 0.02265 6.46406 32.6719 822,823 (27,164, asymmetrical migration (0.01783, 0.00145) (0.01783, (0.00675, (0.01003, 3,813,830) 0.02629) 0.02629) 2.8334) 194.819) Sympatric populations, no 0.02295 0.00004 (0.00003, 0.02295 - - 10,010(459, migration (0.01888, 0.00214) (0.01888, 35,520) 0.02788) 0.02788) Sympatric populations, 0.02305 0.00006 0.02305 11.4302 82.6839 14,014 (2,340, asymmetrical migration (0.01845, (0.000002, (0.01845, (0.00674, (0.00688, 3,793,770) 0.02894) 0.00167) 0.02894) 1.00024) 1,296.58 a 9 (4Ne/J-, where Ne is the effective population size and \i is the mutation rate 1.5 X 10" ) for C. grandiflora (Cg), C. rubella (Cr) and their ancestor (A) b Migration rate (4Nem) from C. grandiflora to C. rubella c Migration rate (4Nem) from C. rubella to C. grandiflora

Time of the split of C. rubella and C. grandiflora

85 Table 2. Number of silent sites subdivided by species and population category (allopatric and sympatric) as well as gene ontology terms for each of the 18 loci used in this study.

Number of Silent Polymorphisms C. grandiflora C. rubella Locus Total Allopatric Sympatric Allopatric Sympatric Gene Ontology Terms Atlg01040 9 8 7 0 0 DEAD/DEAH box helicase carpel factory Atlg03560 11 8 5 4 0 pentatricopeptide (PPR) repeat-containing Atlg04650 14 12 11 8 0 hypothetical protein Atlg06520 13 13 9 0 0 phospholipid/glycerol acyltransferase family Atlgl0900 3 3 1 0 0 phosphatidylinositol-4-phosphate 5-kinase family Atlg59720 23 21 18 13 5 pentatricopeptide (PPR) repeat-containing Atlg62390 7 7 5 2 0 octicosapeptide/Phox/Bemlp (PB1) Atlg62520 39 31 28 21 13 expressed protein Atlg65450 28 25 25 17 0 transferase family protein Atlg68530 27 20 22 1 0 very-long-chain fatty acid condensing enzyme Atlg72390 8 5 5 0 1 expressed protein Atlg74600 10 9 6 0 0 pentatricopeptide (PPR) repeat-containing At2g23170 35 28 32 0 0 auxin-responsive GH3 family protein At2g26730 21 17 18 1 1 leucine-rich repeat transmembrane protein At2g44900 5 5 2 0 0 armadillo/beta-catenin repeat family protein/F-box family protein At2g47430 4 4 2 1 1 cytokinin-responsive histidine kinase (CKI1) At4gl4190 27 20 16 13 0 pentatricopeptide (PPR) repeat-containing At4gl4370 26 24 20 8 0 phosphoinositide binding

86 © T ©"

o i

<- o . r " i o 1 a j --T— o i « , | C graadsfiom C rubetfa C jrd/iaiffcfd C 'jbellti

Figure 1. Comparison of polymorphism patterns between C grandiflora and C rubella for a) the entire dataset, b) the allopatric populations and c) the sympatric populations as measured by n synonymous where % is the average pairwise differences. Bars represent the median, boxes the interquartile range, and whiskers extend out to 1.5-times the interquartile range.

87 / | Unique Crub 1 Unique Cgf | Shared I Fixed Differences

Figure 2. Distribution of synonymous variants. Variants are classed as unique to C rubella, unique to C. grandiflora or shared between species. The datasets are subdivided into a) the entire dataset, b) allopatric populations only and c) sympatric populations only.

88 C. gmndifiom C. rub&iia

Figure 3. Posterior probabilities of Bayesian clustering analysis (InStruct) conducted on the entire dataset, where k = 2-4. Bar plots show individual posterior probabilities.

* denotes sympatric populations

89 tSJiL

B CD c * * 9 CO 8 'a. i9 3

Figure 4. Posterior probabilities of Bayesian clustering analysis (InStruct), using the (a)

C. grandiflora individuals and the (b) C. rubella individuals, where k = 2-4. Bar plots show individual posterior probabilities.

* denotes sympatric populations

90 ;::.;;'i-.--A'vr-' •* :,'*' *"" 'tr*; :5Hj?j

•^%a£:sH 'ft ,45;

Jr :I^*» -5^r £***** *

•i«fcl •*^s «u* W|*4ojl ?¥&; ?*••* a «J«H •V-*-1 4 B«fe (»•». "J ""Br -*1 : •".*-•»!. .''^ #k !&

, , *..•:.i---"-T;-M":" • !JT..:tj ftrwfrv,-jr/%i.s?f: •E-MWMI* MM* t$»A

!« *-*": "i» V Figure SI. Geographic location of each of the nine populations included in this study.

Allopatric C. grandiflora popultions are marked in blue, allopatric C. rubella populations are marked in pink and sympatric populations are marked in yellow.

91 CHAPTER 4

Dynamics of polyploid speciation in the genus Capsella Abstract

Polyploidy has long been known to be an important form of speciation. Often

polyploidization can act to instantly create a new species as the newly formed polyploid

is immediately reproductively isolated from the diploid progenitor(s) due to resulting

problems with chromosome pairing and segregation in meiosis. Given the dominant role

of polyploidization in plant speciation, understanding the evolutionary context in which

this process occurs becomes an important aspect of speciation genetics. Capsella bursa- pastoris or Shepherd's Purse, is a selfing tetraploid and sister species to the outcrossing

C. grandiflora and selfing C. rubella. Like C. rubella, C bursa-pastoris very clearly

displays the selfing syndrome; in comparison with the outcrossing C. grandiflora, their floral organs are very much reduced. Capsella bursa-pastoris has a worldwide

distribution, which can partly be explained anthropogenically. Investigations into the evolutionary origin of C. bursa-pastoris remain inconclusive. Here, using DNA sequence data from 14 unlinked nuclear loci in C. bursa-pastoris, C. grandiflora and C. rubella I attempt to identify the evolutionary mode of origin of C. bursa-pastoris. Furthermore I address the evolutionary phylogenetic relationships between C. bursa-pastoris, C grandiflora and C. rubella. My results suggest that C. bursa-pastoris diverged from C. grandiflora via autoployploidization, approximately 667,000 years ago and quantification estimates of a population bottleneck in C. bursa-pastoris are inconsistent with a speciation event from a single individual.

93 Introduction

Polyploidy has long been associated with speciation and is considered by many to

be the predominant mode of sympatric speciation (Coyne and Orr 2004; Mallet 2007).

Polyploidization can act to instantly create a new species as the newly formed polyploid

is often immediately reproductively isolated from the diploid progenitor(s) due to the

change in ploidy. Hybridization between a newly created tetraploid individual and an

ancestral diploid individual results in triploid offspring. These progeny are typically

sterile, as in a genome containing an odd number of chromosomes, meiosis cannot proceed correctly due to problems in chromosome pairing and segregation (Ramsey and

Schemske 1998).

The relative contribution of polyploidy to speciation in plants is a controversial topic with widely varying estimates of the frequency of polyploid speciation. Based upon the fraction of speciation events that involve any change in chromosome number as well

as the fraction of changes in chromosome number that involve polyploidy, Otto and

Whitton (2000) report that 2-4% of speciation events in angiosperms and 7% in ferns are

a direct result of polyploidy. More recent estimates based upon phylogenetic data

estimate the frequency of polyploid speciation by tracking changes in ploidy level across

infrageneric phylogenetic trees (Wood et al. 2009). Wood and colleagues (2009) put this number at 15% in angiosperms and 31% ferns. These estimates indicate that polyploidization represents an extremely common vehicle for the speciation process in plants.

94 Given the dominant role of polyploidization in plant speciation, understanding the

evolutionary context in which this process occurs becomes an important aspect of

speciation genetics. Several relevant and important questions must be posed in order to

elucidate the evolutionary history of a polyploid species, for instance: does the species have a single or multiple origins; what is the role of founder events in this process; is there ongoing gene flow between the species and its ancestor(s); when did the polyploidization event occur; is the species an alio- or autopolyploid? Addressing these questions is difficult considering the challenges associated with distinguishing multiple origins of polyploids, extinction of parental lineages and the sampling of standing variation in progenitor species (Doyle and Egan 2009).

Recent advances in coalescent modeling have allowed for much progress in the field of speciation genetics (Soltis et al. 2003; Noor and Feder 2006; Becquet and

Przeworski 2007; Hey and Nielsen 2007; Hey 2010). Formalized speciation models now facilitate the differentiation of ancestral polymorphism from introgression and allow statistically based timing estimates of divergence events (Wakeley and Hey 1997;

Nielsen and Wakeley 2001). Using these models, speciation processes have been studied in Drosophila (Wang et al. 1997; Hey and Nielsen 2007), Arabidopsis (Ramos-Onsins et al. 2004), Oryza (Zhang and Ge 2007) and Capsella (Foxe et al. 2009 (Chapter 2)) and polyploid speciation has been investigated in soybean (Gill et al. 2009), A. suecica

(Jakobsson et al. 2006) and C. bursa-pastoris (Slotte et al. 2008) to name but a few.

It is possible that the formation of a new species may involve but a single individual. This situation however, would result in a severe population bottleneck, as

95 there would be a massive reduction in effective population size. There are several well-

characterized examples of polyploid species where a single origin seems likely including;

wheat (Levy and Feldman 2002); Arachis hypogaea (Kochert et al. 1996); Spartina

anglica (Raybould AF 1991; Ainouche et al. 2004); Arabidopsis suecica (Jakobsson et al.

2006; Hazzouri et al. 2008). However, with the advent of the molecular era, the relative

frequency of recurrent polyploid formation has become apparent with over 45 examples

listed in just two literature reviews (Soltis and Soltis 1993; Soltis and Soltis 1999) with more examples being published on a regular basis (e.g.: A. kamchatica (Shimizu-Inatsugi

et al. 2009); Aegilops (Meimberg et al. 2009); Asteraceae (Grubbs et al. 2009; Symonds

et al. 2010) In fact, recurrent origins of polyploid species may be the rule, not the

exception (Soltis and Soltis 1999).

Polyploids are considered to be tremendously successful species for a number of reasons. Stebbins (1950) suggested that the availability of new ecological niches, closed to the diploid progenitors, was vital to the success of polyploids and anecdotally, it has been documented that many of world's most successful weeds are polyploids (Hegarty and Hiscock 2008). One such polyploid is the weed C. bursa-pastoris or Shepherd's

Purse, which is a selfing tetraploid species and a member of the genus Capsella (Hurka and Neuffer 1997). There are three species in this genus, namely C. bursa-pastoris, a predominantly selfing tetraploid which is among the most successful colonizing plant species in existence (Hintz et al. 2006), C. grandiflora, a diploid outbreeder and C. rubella, a diploid selfer (Hurka and Neuffer 1997). The nature of the three Capsella species make the genus an attractive model in that it contains species with different

96 mating systems and ploidy levels which maintain a close relationship. Previous work has

demonstrated the ability of diploid Capsella to form inter-specific hybrids, allowing for

many types of genetic analysis (Hurka and Neuffer 1997). Both C. bursa-pastoris and C.

rubella phenotypically very clearly display the selfing syndrome; in comparison with the

outcrossing C. grandiflora, their floral organs are very much reduced (Figure 1). Previous

work on these species has suggested that both C. grandiflora and C. rubella may be

ancestral to C. bursa-pastoris (Hurka and Neuffer 1997) and more recent findings reveal

that C. rubella diverged from C. grandiflora approximately 13,500 years ago (Foxe et al.

2009 (Chapter 2)). C. bursa-pastoris has a worldwide distribution that can partly be

explained anthropogenically (Hurka and Neuffer 1997; Slotte et al. 2008). In contrast to

C. grandiflora and C. rubella, C. bursa-pastoris can be found on each continent and thrives in a wide climate range (Hurka and Neuffer 1997).

Although it cannot be conclusively said whether C. bursa-pastoris is an autopolyploid or allopolyploid, speculation has been made. Capsella bursa-pastoris displays disomic inheritance (Hurka et al. 1989). It is often assumed that polyploids that form bivalents during meiosis (i.e. exhibit disomic segregation) are allopolyploids and those that form multivalents during meiosis are autopolyploids (Otto and Whitton 2000).

This however, is not always the case as polyploids with tetrasomic segregation (pairing of four homologous chromosomes during meiosis) tend to rediploidize over time as mutations accumulate and chromosomes diverge (Ramsey and Schemske 1998).

Furthermore, autopolyploids with small chromosomes or low chiasma frequencies may exhibit disomic inheritance immediately after their formation (Stebbins 1971).

97 Clearly, the inheritance mechanism observed in C. bursa-pastoris cannot be used to infer

its mode of origin.

Isozyme electrophoresis indicated that C. bursa-pastoris shared alleles with both

C. grandiflora and C. rubella and was hence thought to be an allopolyploid between

these two species (Hurka et al. 1989). Later, based upon restriction site variation in the

chloroplast genome, C. bursa-pastoris was inferred to be an ancient autopolyploid of C. grandiflora (Hurka and Neuffer 1997). Most recently, it has been hypothesized that C.

bursa-pastoris is in fact an allopolyploid, although not between C. grandiflora and C.

rubella (Slotte et al. 2006).

There are a number of mechanisms by which polyploidy may result in

instantaneous speciation. If the newly formed polyploid is selfing, the polyploidization

event need only occur but once. However, this would represent an extreme population bottleneck. Alternatively, multiple origins of different polyploid species have been hypothesized and this mechanism of formation would explain reported shared polymorphism across species with different levels of ploidy. Finally, single or multiple origins of a polyploid species followed by introgression could explain these patterns of shared polymorphism. In this case, the polyploid event would not have introduced instant reproductive isolation (Ramsey and Schemske 1998). While hybridization between a polyploid and diploid is unusual, there are mechanisms under which it may occur. For example, hybridization between a tetraploid and diploid will lead to triploid offspring.

Backcrossing between this triploid and a diploid can lead to formation of a fully fertile tetraploid individual (Miintzing 1930; Skalihska 1945; Ramsey and Schemske 1998).

98 These three alternatives have recently been tested in C. bursa-pastoris (Slotte et

al. 2008). Using a coalescent based isolation-with-migration model (Nielsen and Wakeley

2001), Slotte et al (2008) found evidence for gene flow from C. rubella to C. bursa- pastoris following the dispersal of C. bursa-pastoris throughout Eurasia. These findings

indicate that, in this case, polyploidy did not result in instantaneous reproductive isolation

in C. bursa-pastoris. However, these conclusions were made using sequence data from C.

bursa-pastoris and C. rubella only. Studies demonstrating an extremely recent divergence time for C. rubella from C. grandiflora suggest it is unlikely that C. rubella is

ancestral to C. bursa-pastoris (Foxe et al. 2009 (Chapter 2); Guo et al. 2009). Given this,

it seems much more likely that including C. grandiflora may allow for more accurate

inferences about the origins of C. bursa-pastoris.

Here, using DNA sequence data from 14 unlinked nuclear loci in C. bursa- pastoris, C. grandiflora and C. rubella, I address the following areas. First, I characterize patterns of polymorphism in all three species in this genus. Next, using molecular phylogenetic techniques, I elucidate the evolutionary phylogenetic relationships between

C. bursa-pastoris, C. grandiflora and C. rubella. Finally, using coalescent-based analyses

I date the divergence of C. bursa-pastoris from its putative ancestor C. grandiflora and assess evidence for population bottlenecks in C. bursa-pastoris, thus elucidating the evolutionary history between all three species in the genus Capsella.

Methods

Sampling

99 Seeds were obtained from a total of 78 accessions of C. bursa-pastoris from

China, Taiwan, Israel and Europe, 53 accessions of C. grandiflora from its native Greece,

43 accessions of C. rubella from Africa, South America, Europe and Israel, as well as

one accession of Neslia paniculata. The accession designations and geographical origin of Capsella and seed material are given in the Table S1.

Following sterilization, seeds were placed at 4°C on sterile (Murashige-Skoog;

MS) nutriend medium for 14 days before being allowed to germinate at room temperature. The seedlings were grown at 20°C under conditions of 18 hours of light and

6 hours of darkness. After 6 weeks of growth DNA was extracted from fresh or frozen

leaf tissue from a single individual per accession. Leaf tissue was ground to a fine powder

in liquid nitrogen, and DNA was extracted using the QIAgen DNeasy Plant Mini Kit

(QIAGEN, Valencia, California, USA).

PCR and Sequencing

Fourteen single copy, effectively unlinked nuclear genes were chosen for inclusion in this analysis (Table S2). Due to the high conservation of gene content between A thaliana, C. grandiflora and C. rubella (Acarkan et al. 2000; Boivin et al.

2004), such genes are likely to be single copy in the diploids C. grandiflora and C. rubella and to be found in 2 copies duplicated by polyploidy, homoeologs, in the tetraploid C. bursa-pastoris.

PCR primers for the large exons were designed as described by Ross-Ibarra et al.

(Ross-Ibarra et al. 2008) and Slotte et al. (Slotte et al. 2008). In brief, primers were

100 designed to amplify 650-700 bp from single large exons based on the A. thaliana genome

sequence, chosen with no a priori expectation as to their function or the action of

selection on these genes. Each exon was used as a BLAST query against the shotgun

genome sequence of Brassica oleracea and homologous regions were used to design primers by using PrimerQuest (Integrated DNA Technologies). PCR reactions were performed in 25uL reaction volumes (15mM PCR (10X) buffer, 2 mM MgS04, lOmM

dNTPs, 10|^M forward primer, lOuM reverse primer, 1U Tsg polymerase and 50-100ng

DNA) on an Eppendorf Mastercycler with the following program: 2 minutes at 94°C, followed by 20 seconds at 94°C, 20 seconds at 55°C, 40 seconds at 72°C, for 35 cycles, with a final extension time of 4 minutes at 72°C.

I amplified -200 bp to ~800 bp of each gene in C. grandiflora, C. rubella and C. bursa-pastoris. These products were sequenced at Lark Technologies (Houston, Texas) and at the Genome Quebec Innovation Centre (McGill University, Canada).

Chromatograms were checked manually for heterozygous sites, using Sequencher version

4.7 (Gene Codes, Ann Arbor, MI), with the aid of the 'Call secondary peaks' option.

Sequences were aligned using Genedoc (Nicholas 1997).

Based on these data, allele-specific primers were designed using polymorphisms to selectively amplify each of the two C. bursa-pastoris homoeologs. Each copy of the 14 loci used in this study was successfully amplified using this approach and sequenced on an ABI 3730 sequencer at the Genome Quebec Innovation Centre (McGill University,

Canada). Because all C. bursa-pastoris sequences analyzed in this study were amplified using homoeolog-specific primers and were sequenced directly, I avoided sequence

101 artefacts resulting from cloning of heterogeneous polymerase chain reaction products

(Cronn et al. 2002). For each accession and homoeolog, a single sequence was retrieved,

as expected in predominantly selfing species.

Standard Population Genetic Analyses

Synonymous and nonsynonymous sites were identified by aligning each fragment

to the corresponding fragment in the A. thaliana genome sequence, identified using

BLAST (Altschul et al. 1990), and using the protein annotation from A. thaliana.

Sequence-based summary statistics 0 (Watterson 1975) 71 (Tajima 1993) and Tajima's D

(Tajima 1989) synonymous and nonsynonymous, as well as frequency data, were

calculated by using a modified version of Perl code (Polymorphurama) written by D.

Bachtrog and P. Andolfatto (University of California at San Diego, available from

http://ib.berkeley.edu/labs/bachtrog/data/polyMORPHOrama/polyMORPHOrama.html).

The frequency spectra of derived polymorphic variants, and the number of shared derived polymorphisms, unique polymorphisms, and fixed differences were calculated by using

Perl scripts written by S.Wright. The minimum number of synonymous substitutions between C. grandiflora haplotypes and each of the two C. bursa-pastoris alleles was estimated using DnaSP Version 5.0 (Librado and Rozas 2009).

Based upon the minimum number of synonymous substitutions between C. grandiflora and each of the two C. bursa-pastoris alleles, each locus was designated

locus A or locus B where locus B has the larger minimum distance compared with locus

A (Table 1; following (Slotte et al. 2006) and (Slotte et al. 2008)).

102 To infer haplotype data in C. grandiflora I used PHASE 2.1, as implemented in

DnaSP Version 5.0 (Librado and Rozas 2009), which uses a Bayesian statistical method

to reconstruct haplotypes from diploid data (Stephens et al. 2001; Stephens and Donnelly

2003).

Bayesian Estimation of Species Tree

The molecular phylogenetic program BEST (Bayesian estimation of species trees)

(Liu 2008), which implements a Bayesian hierarchical model while accounting for a deep

coalescence, was used to estimate the Capsella genus species tree using this multi-locus

dataset (Liu 2008). BEST works using concatenated alignments and reportedly does not

seem to perform well when there are missing data. Consequently, BEST was run using the 7 loci in this dataset that had the most consistent sampling of individuals across loci

(Atlg03560, Atlgl5240, Atlg65450, At2g26730, At4gl4190, At5g51670 and

At5g53020). Alignments were concatenated using MacClade version 4.08 (available from http://macclade.org/). BEST was run in two ways, once using A. thaliana as an outgroup and again including both A. thaliana and N. paniculata (where available). In each case BEST was run twice, with 4 chains for a maximum of 2 million generations, with a burnin of 200,000 generations, sampling every 100 generations.

Coalescent Simulations

Coalescent simulations were conducted using MIMAR, which estimates the parameters of an isolation-migration model based on Hudson's ms (Hudson 2002).

103 Simulations were conducted using the 14 loci included in this study. Furthermore, sites

with >2 segregating bases were excluded from the analysis. Coalescent simulations were

run 1) in the absence of migration 2) allowing symmetrical migration and 3) allowing

asymmetrical migration. Effective population sizes were either unconstrained or assumed to be identical in C grandiflora and the ancestor of C. bursa-pastoris and C. grandiflora.

Prior limits for the Bayesian procedure implemented in MIMAR were set based

on initial runs using wide priors. Priors for 9 were uniform 0.001-0.1 for both C. grandiflora and the ancestral species, and uniform 0-0.0025 for C. rubella. All runs assumed an exponentially distributed prior with rate 1 for p/6, and a mutation rate per bp of 1.5 x 10-8 (Koch et al. 2001). Symmetrical migration rate priors were log uniform -5-

2.5. Asymmetrical migration rate priors were log uniform -5-2.5 for migration from C. grandiflora to C. rubella (forward in time) and log uniform -5-6 for migration from C. rubella to C. grandiflora. The prior for the time of the split between C. bursa-pastoris and C. grandiflora was uniform 0-4x10 .

Simulations were conducted in three ways; 1) including C. grandiflora and C. bursa-pastoris A, 2) including C. grandiflora and C. bursa-pastorisB and 3) including C. bursa-pastoris A and C. bursa-pastorisB. Each simulation was run for a total of 10,080 min (1 week) with a burn-in of 100,000. Mixing was monitored by assessing parameter autocorrelation over runs and it was considered that MIMAR reached convergence when the posterior distributions from independent runs were highly similar (Becquet and

Przeworski 2007). The mode of the marginal posterior probability distribution was considered as a point estimate for each parameter, and 90% highest posterior density

104 (HPD) intervals were calculated from the MIMAR output by using the boa package in R

2.9.0 (Smith 2007).

To assess the validity of the results from MIMAR, an autopolyploid event

(depicted in Figure SI) was simulated using the coalescent simulation program ms

(Hudson 2002). A total of 14,000 simulations were run. From this output, sixty loci were

randomly selected and the C. bursa-pastoris loci "A" and "B" were assigned where,

based on the number of fixed differences and unique variants, the most diverged locus

from C. grandiflora was designated locus B. These 60 loci were then divided into five

groups of twelve and run through MIMAR in the absence of migration under the three

models as described above.

Results

Patterns of Polymorphism

Synonymous diversity as measured by K synonymous (where n is the average

number of pairwise difference between two sequences) was estimated in C. grandiflora,

C. rubella and across the A and B loci in C. bursa-pastoris (Figure 2a). In agreement with previous studies (Foxe et al. 2009 (Chapter 2)), levels of synonymous diversity in the outcrossing C. grandiflora were found to be greatly elevated above those found in the

selfing C. rubella. Similarly, synonymous diversity in C. grandiflora was found to be 5- to 7- fold higher than in C. bursa-pastoris A and C. bursa-pastorisB, respectively.

It has been hypothesized that the reduction of diversity in C. rubella is the result of a massive population bottleneck associated with the transition to a selfing lifestyle as

105 recently as 13,500 years ago (Foxe et al. 2009 (Chapter 2)). The similar reduction in

synonymous diversity in C. bursa-pastoris may also be due to a recent population

bottleneck associated with its origin and/or its selfing lifestyle as it has been well

documented that selfing species display much reduced levels of genetic diversity in

comparison to their outcrossing relatives (Charlesworth and Yang 1998; Baudry et al.

2001; Glemin 2006; Foxe et al. 2009 (Chapter 2)). Average synonymous Tajima's D in

C. bursa-pastoris (Tajima's D synonymous = -0.474) was considerably reduced in

comparison with C. grandiflora (Tajima's D synonymous = -0.132) fitting with a

population expansion following a population bottleneck in C. bursa-pastoris.

Evidence for an autopolyploid origin ofC. bursa-pastoris

The minimum number of synonymous substitutions between C. grandiflora and

each of the two C. bursa-pastoris alleles as well as between C. bursa-pastorisA and C.

bursa-pastorisB was calculated (Table 1). Were C. bursa-pastoris an allopolyploid, it

would be expected that one of the alleles when compared to C. grandiflora would display

a significantly greater number of synonymous substitutions than the other allele. Locus

by locus, there is, in general, a minimal difference in the number of synonymous

substitutions between C. grandiflora and C. bursa-pastoris A versus C. grandiflora and

C. bursa-pastorisB. As the more distantly related homoeolog has been designated C.

bursa-pastorisB, these results represent the most extreme case. What is clear from these results is that even under an extreme case where all the more distant homoeologs come from 'B', the minimum distance to C. grandiflora is still considerably lower than the

106 distance between C. bursa-pastoris A and C. bursa-pastorisB. These results are perhaps

more consistent with an autopolyploid event from two distinct C. grandiflora haplotypes

resulting in the formation of C. bursa-pastoris.

Taking synonymous sites, there were found to be 29 fixed differences between C.

bursa-pastoris A and C. bursa-pastorisB compared with 2 between C. bursa-pastoris A

and C. grandiflora and 19 between C. bursa-pastorisB and C. grandiflora (Figure 3).

Again, these results show that both the C. bursa-pastoris "A" and "B" loci are more

closely related to C. grandiflora than they are to each other. These results too seem

consistent with the origin of C. bursa-pastoris from an autopolyploid event from two

distinct C. grandiflora haplotypes. A possible alternative is an allopolyploid event between C. grandiflora and a very close relative.

Molecular Phylogenetics

Taking a molecular phylogenetic approach, the program BEST (Bayesian estimation of species trees), which implements a Bayesian hierarchical model while accounting for a deep coalescence, was used to estimate the Capsella genus species tree from the multi-locus dataset (Liu 2008). BEST was run twice (see Methods), once using

A. thaliana (Figure 4a) as an outgroup and again including both A. thaliana and Neslia

(where available, Figure 4b). In either tree, C. grandiflora is not shown to be more closely related to either the "A" or "B" C. bursa-pastoris locus and in fact, the tree illustrated in Figure 4a represents exactly what we would expect to see if C. bursa-

107 pastoris is an autopolyploid. Once again, these results do not provide strong evidence for

an allopolyploid event between two distantly related species.

Demographic Model Fitting

To investigate the timing of a speciation event between C. grandiflora and C.

bursa-pastoris and to quantify effective population size Ne, I used a Markov Chain

Monte Carlo (MCMC) approach based on coalescent simulations to fit a model of

isolation with and without migration to the data (Becquet and Przeworski 2007). This

approach makes use of the observed information from each locus on the number of

shared and unique polymorphisms, as well as fixed differences. The model assumes that a

single ancestral population of size Na split into two at time T, and the two derived

populations have distinct effective population sizes (NI and N2). The coalescent

simulations were run under three models; 1) species 1 = C. grandiflora, species 2 = C.

bursa-pastoris A; 2) species 1 = C grandiflora, species 2 = C. bursa-pastorisB; 3)

species 1 = C. bursa-pastoris A, species 2 = C. bursa-pastorisB and each model was run

1) in the absence of gene flow, 2) allowing for symmetrical migration and 3) allowing for

asymmetrical migration (Table 2). The models where C. grandiflora is the ancestor assumed equal effective population sizes in the ancestor as in present-day C. grandiflora.

It should again be mentioned that based upon the minimum number of synonymous

substitutions between C. grandiflora and each of the two C. bursa-pastoris alleles, each locus was designated locus A or locus B where locus B has the larger minimum distance compared with locus A (Table 1). Making these assumptions allows for exploration of

108 one extreme possibility where C. bursa-pastoris has originated via allopolyploidaztion.

However, the comparison between C. bursa-pastorisA and C. bursa-pastorisB should not

be biased by these assumptions.

The results from each of these three species pair comparisons are shown in Table

2 and Figure 5. The results from demographic models allowing for symmetrical and

asymmetrical migration show no evidence for migration between C. grandiflora and C.

bursa-pastoris. Consequently, the results outlined below refer to the models run in the

absence of migration.

In comparison to C. grandiflora, C. bursa-pastoris A and C. bursa-pastorisB display a 5- and 7-fold decrease in effective population size respectively (Figure 5a).

These results sharply contrast previous work, which demonstrated a 100-1,500 fold reduction in effective population size in C. rubella in comparison with C. grandiflora,

likely the result of a population bottleneck associated with the transition to selfing (Foxe et al. 2009 (Chapter 2)). It may be that the formation of C. bursa-pastoris as a new species did not result in as severe a bottleneck as in C. rubella. Were C. bursa-pastoris to have a single origin, it is likely that we would see evidence for a strong population bottleneck in the present day population. It may be that evidence of a bottleneck has eroded with time. It is also possible that C. bursa-pastoris is the result of recurrent polyploid formation which would not leave a signature of a severe population bottleneck.

This would be in keeping with the literature, which states that recurrent polyploid formation is vastly more common than a single polyploid speciation event (Soltis and

Soltis 1999).

109 Divergence time between C. grandiflora and C. bursa-pastoris A was estimated at

-308,000 years and at ~ 1.1 million years between C. grandiflora and C. bursa-pastorisB

(Figure 5a). Divergence estimates between C. bursa-pastoris A and C. bursa-pastorisB lie

at -667,000 years (Figure 5b). While the results from the first set of simulations (Figure

5a) do seem to point to an allopolyploid event they do not completely rule out an

autopolyploid event as it may be that the assumptions made when designating each of the

C. bursa-pastoris alleles "A" and "B" are biasing the results. The results from the second

set of simulations when dating the divergence between C. bursa-pastoris A and C. bursa- pastorisB (Figure 5b) give an approximate average of the divergence times between C. grandiflora and C. bursa-pastorisA and C. grandiflora and C. bursa-pastorisB,

consistent with an autopolyploid origin of C. bursa-pastoris.

To assess the plausibility of an autopolyploid model, an autopolyploid event

(depicted in Figure SI) was simulated using the coalescent simulation program ms

(Hudson 2002). The distribution of synonymous variants in the observed and five

simulated datasets were compared (Figures 6 and 7). The numbers of shared and unique synonymous variants as well as synonymous fixed differences were found to be comparable across all simulated datasets when compared to the observed data.

These simulated results were run through MIMAR in the absence of migration

(see Methods). The results of these simulations are presented in Table S3. Each of the five simulated datasets recapitulate the observed results under the models where; 1) species 1 = C. grandiflora, species 2 = C. bursa-pastoris A; 2) species 1 = C. grandiflora, species 2 = C. bursa-pastorisB. In each of the simulated datasets, in comparison with the

110 observed data, an asymmetry in divergence time can be seen where the average simulated

divergence time is -325,000 years between C. grandiflora and C. bursa-pastoris A and

-967,000 years between C. grandiflora and C. bursa-pastorisB. While the divergence

time between C. bursa-pastoris A and C. bursa-pastorisB fluctuatesacros s simulated

datasets, two of the datasets give divergence time estimates similar to those from

observed data. Overall, the congruence between observed and simulated data serve to

validate the use of our autopolyploid model in MIMAR.

Discussion

Previous studies investigating the evolutionary history of the Capsella genus have

inferred a single origin of the selfing C. rubella from the outcrossing C. grandiflora as

little as 13,500 years ago. Furthermore, this speciation event is thought to have resulted in

a massive population bottleneck in C. rubella (Foxe et al. 2009 (Chapter 2); Guo et al.

2009). Here, these results suggest multiple origins of C. bursa-pastoris via

autopolyploidization involving two distinct C. grandiflora haplotypes.

While C. grandiflora and C. rubella are thought to have an extremely recent

divergence time, results from demographic model fitting estimate a divergence time of

approximately 667,000 years between C. grandiflora and C. bursa-pastoris. With the

divergence of C. rubella from C. grandiflora dated at approximately 13,500 years ago, these results conclusively exclude a role for C. rubella in the formation of C. bursa- pastoris.

Ill The question of whether C. bursa-pastoris is an allopolyploid or an autopolyploid

has been the subject of some controversy in the literature (Hurka et al. 1989; Hurka and

Neuffer 1997; Slotte et al. 2006). Various studies have resulted in often-conflicting

theories as to the evolutionary origins of C. bursa-pastoris. Direct comparisons of the

minimum number of synonymous changes between C. grandiflora and C. bursa-pastoris,

comparisons in fixation patterns as well as a molecular phylogenetic approach all point to

an autopolyploid event resulting in C. bursa-pastoris.

Were C. bursa-pastoris to have a single origin, we might expect to see evidence

for a population bottleneck. While there is evidence for a reduction in effective population size in C. bursa-pastoris relative to its ancestor, the scale of this reduction is

not comparable to the bottleneck observed in C. rubella (where a 100-1500 reduction in

Ne has been estimated (Foxe et al. 2009 (Chapter 2))). It is possible that evidence for a population bottleneck has eroded with time as the divergence of C. bursa-pastoris from

C. grandiflora is approximately 50 times older than the divergence of C. rubella from C. grandiflora. It is likely however, that C. bursa-pastoris had multiple origins, which would not leave a signature of an extreme population bottleneck.

In keeping with earlier results (Foxe et al. 2009 (Chapter 2)), levels of synonymous diversity were found to be reduced in C. rubella when compared to C. grandiflora. However, these estimates are 2.3 times higher than previous studies indicate.

Much of this diversity found in C. rubella is driven by the DFR locus (ix, synonymous =

0.1539). Dihydroflavonol reductase (DFR) is a key enzyme in anthocyanin synthesis

(Holton and Cornish 1995). Anthocyanins play a vital role in the synthesis of brick red,

112 red and blue pigments in plants (Holton and Cornish 1995). Interestingly, one of the

diagnostic characteristics of C. rubella is the presence of the reddish tinge to its fruits.

Recent unpublished flower bud expression data generated from Illumina mRNA

sequencing runs shows an approximate 5-fold increase in DFR expression in C. rubella

relative to C. grandiflora (S. Wright, personal communication). However this does not

explain the increased levels of diversity present in DFR in C. rubella. Anthocyanins are

also known to play a role in UV protection and defence against pathogens (Koes et al.

1994). Disease resistance and defence-related genes are often subject to balancing

selection resulting from continuing plant-pathogen dynamics (Tiffin and Moeller 2006).

It may be that DFR is under balancing selection in C. rubella, which would result in

elevated levels of genetic diversity.

While these results shed much light on the evolutionary origin of C. bursa- pastoris and establish the molecular phylogeny of the Capsella genus, little is still known about the extensive phenotypic changes that have occurred in both C. bursa-pastoris and

C. rubella. Understanding the genomic context and underlying evolutionary forces that have prompted these changes will be of considerable interest in future studies.

113 References

Acarkan A., M. Rossberg, M. Koch, and R. Schmidt (2000) Comparative genome analysis reveals extensive conservation of genome organisation for Arabidopsis thaliana and Capsella rubella. Plant J 23: 55-6

Ainouche, M. L., A. Baumel, and A. Salmon. 2004. Spartina anglica C. E. Hubbard: a

natural model system for analysing early evolutionary changes that affect

allopolyploid genomes. Biol J Lin Soc 82:475-484.

Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local

alignment search tool. J Mol Biol 215:403-410.

Baudry, E., C. Kerdelhue, H. Innan, and W. Stephan. 2001. Species and recombination

effects on DNA variability in the tomato genus. Genetics 158:1725-1735.

Becquet, C, and M. Przeworski. 2007. A new approach to estimate parameters of

speciation models with application to apes. Genome Res 17:1505-1519.

Boivin K., A. Acarkan, R. S. Mbulu, O. Clarenz and R. Schmidt (2004) The Arabidopsis

genome sequence as a tool for genome analysis in Brassicaceae. A comparison of

the Arabidopsis and Capsella rubella genomes. Plant Physiol 135: 735-744

Charlesworth, D., and Z. Yang. 1998. Allozyme diversity in Leavenworthia populations

with different inbreeding levels. Heredity 81:453-461.

Coyne, J. A., and H. A. Orr. 2004. Speciation. Sinauer Associates, Inc., Sunderland,

Massachusetts.

114 Cronn, R., M. Cedroni, T. Haselkorn, C. Grover, and J. F. Wendel. 2002. PCR-mediated

recombination in amplification products derived from polyploid cotton. Theor

Appl Genet 104:482-489.

Doyle, J. J., and A. N. Egan. 2009. Dating the origins of polyploidy events. New Phytol.

Foxe, J. P., T. Slotte, E. A. Stahl, B. Neuffer, H. Hurka, and S. I. Wright. 2009. Rapid

morphological evolution and speciation associated with the evolution of selfing in

Capsella. PNAS 106:5241-5245.

Gill, N., S. Findley, J. G. Walling, C. Hans, J. Ma, J. Doyle, G. Stacey, and S. A.

Jackson. 2009. Molecular and chromosomal evidence for allopolyploidy in

soybean. Plant Physiol 151:1167-1174.

Glemin, S., Bazin, E. & Charlesworth, D. 2006. Impact of mating systems on patterns of

sequence polymorphism in flowering plants. Proc Biol Sci 273:3011-3019.

Grubbs, K. C, R. L. Small, and E. E. Schilling. 2009. Evidence for multiple, autoploid

origins of agamospermous populations in Eupatorium sessilifolium (Asteraceae).

Plant Sys Evol 279:151-161.

Guo, Y.-L., J. S. Bechsgaardb, T. Slotte, B. Neuffer, M. Lascoux, Weigel D., and M. H.

Schierup. 2009. Recent speciation of Capsella rubella from Capsella grandiflora,

associated with loss of self-incompatibility and an extreme bottleneck PNAS

106:5246-5251.

Hazzouri, K. M., A. Mohajer, S. I. Dejak, S. P. Otto, and S. I. Wright. 2008. Contrasting

patterns of transposable-element insertion polymorphism and nucleotide diversity

in autotetraploid and allotetraploid Arabidopsis species. Genetics 179:581-592.

115 Hegarty, M. J., and S. J. Hiscock. 2008. Genomic clues to the evolutionary success of

polyploid plants. Curr Biol 18:R435-444.

Hey, J. 2010. Isolation with migration models for more than two populations. Mol Biol

Evol 27:905-920.

Hey, J., and R. Nielsen. 2007. Integration within the Felsenstein equation for improved

Markov chain Monte Carlo methods in population genetics. PNAS 104:2785-

2790.

Hintz, M., C. Bartholmes, P. Nutt, J. Ziermann, S. Hameister, B. Neuffer, and G.

Theissen. 2006. Catching a 'hopeful monster': shepherd's purse (Capsella bursa-

pastoris) as a model system to study the evolution of flower development. J Exp

Bot 57:3531-3542.

Holton, T. A., and E. C. Cornish. 1995. Genetics and Biochemistry of Anthocyanin

Biosynthesis. The Plant Cell 7:1071-1083.

Hudson, R. R. 2002. Generating samples under a Wright-Fisher neutral model of genetic

variation. Bioinformatics 18:337-338.

Hurka, H., S. Freundner, A. H. Brown, and U. Plantholt. 1989. Aspartate

aminotransferase isozymes in the genus Capsella (Brassicaceae): subcellular

location, gene duplication, and polymorphism. Biochemical genetics 27:77-90.

Hurka, H., and B. Neuffer. 1997. Evolutionary processes in the genus Capsella

(Brassicaceae). Plant Sys Evol 206:295-316.

116 Jakobsson, M., J. Hagenblad, S. Tavare, T. Sail, C. Hallden, C. Lind-Hallden, and M.

Nordborg. 2006. A unique recent origin of the allotetraploid species Arabidopsis

suecica: Evidence from nuclear DNA markers. Mol Biol Evol 23:1217-1231.

Koch, M., B. Haubold, and T. Mitchell-Olds. 2001. Molecular systematics of the

Brassicaceae: evidence from coding plastidic matK and nuclear Chs sequences.

Am J Bot 88:534-544.

Kochert, G., H. T. Stocker, M. Gimenes, L. Galgaro, C. R. Lopes, and K. Moore. 1996.

RFLP and cytogenetic evidence on the origin and evolution of allotetraploid

domesticated peanut, Arachis hypogaea (Leguminosae). Am J Bot 83:1282-1291.

Koes, R. E., F. Quattrocchio, and M. J.M.N. 1994. The flavonoid biosynthetic pathway in

plants: function and evolution. BioEssays 16:123-132.

Levy, A. A., and M. Feldman. 2002. The impact of polyploidy on grass genome

evolution. Plant Physiol 130:1587-1593.

Librado, P., and J. Rozas. 2009. DnaSP v5: a software for comprehensive analysis of

DNA polymorphism data. Bioinformatics 25:1451-1452.

Liu, L. 2008. BEST: Bayesian estimation of species trees under the coalescent model.

Bioinformatics 24:2542-2543.

Mallet, J. 2007. Hybrid speciation. Nature 446:279-283.

Meimberg, H., K. J. Rice, N. F. Milan, C. C. Njoku, and J. K. McKay. 2009. Multiple

origins promote the ecological amplitude of allopolyploid Aegilops (Poaceae).

American Journal of Botany 96:1262-1273.

117 Muntzing, A. 1930. Outlines to a genetic monograph of the genus Galeopsis with special

reference to the nature and inheritance of partial sterility. Hereditas 13:185-341.

Nicholas, K. B., Nicholas, H.B. Jr., Deerfield, D.W. II. 1997. GeneDoc: Analysis and

Visualization of Genetic Variation. EMBNEW.NEWS 4:14.

Nielsen, R., and J. Wakeley. 2001. Distinguishing migration from isolation: a Markov

chain Monte Carlo approach. Genetics 158:885-896.

Noor, M. A., and J. L. Feder. 2006. Speciation genetics: evolving approaches. Nature

reviews 7:851-861.

Otto, S. P., and J. Whitton. 2000. Polyploid incidence and evolution. Ann Rev Gen

34:401-437.

Ramos-Onsins, S. E., B. E. Stranger, T. Mitchell-Olds, and M. Aguade. 2004. Multilocus

analysis of variation and speciation in the closely related species Arabidopsis

halleri and A. lyrata. Genetics 166:373-388.

Ramsey, J., and D. Schemske. 1998. Pathways, mechanisms and rates of polyploid

formation in flowering plants. Ann Rev Ecol Sys 29:467-501.

Raybould AF, G. A., Lawrence MJ, Marshall DF. 1991. The evolution of Spartina

anglica C.E. Hubbard (Gramineae): origin and genetic variability. Biological

Journal of the Linnean Society 83:1282-1291.

Ross-Ibarra, J., S. I. Wright, J. P. Foxe, A. Kawabe, L. DeRose-Wilson, G. Gos, D.

Charlesworth, and B. S. Gaut. 2008. Patterns of Polymorphism and Demographic

History in Natural Populations of Arabidopsis lyrata. PloS One 3:e2411.

118 Shimizu-Inatsugi, R., J. Lihova, H. Iwanaga, H. Kudoh, K. Marhold, O. Savolainen, K.

Watanabe, V. V. Yakubov, and K. K. Shimizu. 2009. The allopolyploid

Arabidopsis kamchatica originated from multiple individuals of Arabidopsis

lyrata and Arabidopsis halleri. Mol Ecol 18:4024-4048.

Skalinska, M. 1945. Cytogenetic studies in triploid hybrids of Aquilegia. J Gen 47:87-

111.

Slotte, T., A. Ceplitis, B. Neuffer, H. Hurka, and M. Lascoux. 2006. Intrageneric

phylogeny of Capsella (Brassicaceae) and the origin of the tetraploid C. bursa-

pastoris based on chloroplast and nuclear DNA sequences. Am J Bot 93:1714-

1724.

Slotte, T., H. Huang, M. Lascoux, and A. Ceplitis. 2008. Polyploid speciation did not

confer instant reproductive isolation in Capsella (Brassicaceae). Mol Biol Evol

25:1472-1481.

Smith, B. J. 2007. boa: An R Package for MCMC Output Convergence Assessment and

Posterior Inference. Journal of Statistical Software 21:1-37.

Soltis, D. E., and P. S. Soltis. 1993. Molecular data and the dynamic nature of

polyploidy. Crit Rev Plant Sci 12:243-273.

Soltis, D. E., and P. S. Soltis. 1999. Polyploidy: recurrent formation and genome

evolution. TREE 14:348-352.

Soltis, D. E., P. S. Soltis, and J. A. Tate. 2003. Advances in the study of polyploidy since

Plant Speciation. New Phyt 161:173-191.

Stebbins, G. L. 1971. Chromosomal Evolution in Higher Plants. Arnold, London, UK.

119 Stebbins G. L. 1950. Variation and evolution in plants. Columbia University Press, New

York, New York, USA.

Stephens, M., and P. Donnelly. 2003. A comparison of bayesian methods for haplotype

reconstruction from population genotype data. AM J Hum Gen 73:1162-1169.

Stephens, M., N. J. Smith, and P. Donnelly. 2001. A new statistical method for haplotype

reconstruction from population data. AM J Hum Gen 68:978-989.

Symonds, V. V., P. S. Soltis, and D. E. Soltis. 2010.- Dynamics of Polyploid Formation in

Tragopogon (Asteraceae): Recurrent Formation, Gene Flow, and Population

Structure. Evolution 64:1984-2003.

Tajima, F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA

polymorphism. Genetics 123:585-595.

Tajima, F. 1993. Measurement of DNA polymorphism. Pp. 37-60 in N. Takahata, and A.

Clark, eds. Mechanisms of molecular evolution. Japan Scientific Societies Press,

Tokyo.

Tiffin, P., and D. A. Moeller. 2006. Molecular evolution of plant immune system genes.

Trends Genet 22:662-670.

Wakeley, J., and J. Hey. 1997. Estimating ancestral population parameters. Genetics

145:847-855.

Wang, R. L., J. Wakeley, and J. Hey. 1997. Gene flow and natural selection in the origin

of Drosophila pseudoobscura and close relatives. Genetics 147:1091-1106.

Watterson, G. A. 1975. On the number of segregating sites in genetical models without

recombination. Theor Popul Biol 7:256-276.

120 Wood, T. E., N. Takebayashi, M. S. Barker, I. Mayrose, P. B. Greenspoon, and L. H.

Rieseberg. 2009. The frequency of polyploid speciation in vascular plants. PNAS

106:13875-13879.

Zhang, L.-B., and S. Ge. 2007. Multilocus analysis of nucleotide variation and speciation

in Oryza officinalis and its close relatives. Mol Biol Evol

121 Table 1. Minimum number of synonymous substitutions and number of fixed differences between C. bursa-pastoris A and C. grandiflora, C. bursa-pastorisB and C. grandiflora

and C. bursa-pastoris A and C. bursa-pastorisB. Number of fixed differences are given in parentheses.

C. bursa-pastoris A C bursa-pastorisB C bursa-pastorisA Locus and C. grandiflora and C. grandiflora and C. bursa-pastorisB Atlg03560 0(0) 2(1) 1(1) Atlgl5240 1(1) 2(1) 4(4) Atlg65450 2(1) 2(2) 4(2) Atlg77120 0(0) 2(2) 2(2) At2gl8790 0(0) 4(4) 6(5) At2g26730 2(0) 2(1) 8(4) At4g00650 1(0) 2(0) 3(2) At4g02560 0(0) 0(0) 0(0) At4g08920 0(0) 6(6) 6(4) At4gl4190 0(0) 2(0) 4(1) At5gl0140 0(0) 0(0) 0(0) At5g42800 3(0) 5(0) 0(0) At5g51670 2(0) 4(1) 4(3) At5g53020 0(0) KD KD

122 Table 2. Modes of parameter estimates under a range of MIMAR models. 90% HPD intervals are given in parentheses.

c Dataset and Model (A) ' (Spl) (Sp2) Mspl-Sp2 McsP2-spi

Constrained model, 0.02588 0.02588 0.00307 n/a n/a 278,278(116,568,

Species 1 is C. grandiflora (0.01930, (0.01930, (0.00149, 518,126)

and Species 2 is C. bursa- 0.03711) 0.03711) 0.00631) pastorisA

Constrained model, 0.02601 0.002601 0.00522 n/a n/a 1,060,000

Species 1 is C. grandiflora (0.01951, (0.01951, (0.00256, (639,571,

and Species 2 is C. bursa- 0.03499) 0.03496) 0.00842) 1,567,280)

pastorisB

Unconstrained model, 0.03843 0.00526 0.00424 n/a n/a 562,563 (220,087,

Species 1 is C. bursa-pastorisA (0.01891, (0.00255, (0.00218, 1,159,130)

and Species 2 is C. bursa- 0.07939) 0.00851) 0.00771)

pastorisB

Constrained model, 0.02524 0.02524 0.00307 0.33664 n/a 314,314(75,036,

symmetrical migration, (0.01721, (0.01721, (0.00129, (0.00674: 2,203,320)

Species 1 is C. grandiflora 0.03431) 0.03431) 0.006013) 0.85197)

and Species 2 is C. bursa-

123 pastorisA

Constrained model, 0.02538 0.02538 0.00413 0.03912 n/a 1,263,260(730, symmetrical migration, (0.01854, (0.01854, (0.00167, (0.00674, 954, 2,029,360)

Species 1 is C. grandiflora 0.03438) 0.03438) 0.00750) 0.12639) and Species 2 is C. bursa- pastorisB

Unconstrained model, 0.02656 0.00516 0.00395 0.01214 n/a 1,279,280 symmetrical migration, (0.00250, (0.00287, (0.00191, (0.00674: (397,615,

Species 1 is C. bursa-pastoris A 0.06506) 0.00902) 0.00746) 0.04608) 2,480,350) and Species 2 is C. bursa- pastorisB

Constrained model, 0.02524 0.02524 0.00200 0.00866 2.84202 342,342 (107,360; asymmetrical migration, (0.01814, (0.01814, (0.00045, (0.00674, (0.00675, 3,493,050)

Species 1 is C grandiflora 0.03494) 0.03494) 0.00514) 0.63413) 5.89006) and Species 2 is C. bursa- pastorisA

Constrained model, 0.026374 0.026374 0.00323 0.00707 0.24184 1,347,350 asymmetrical migration, (0.01889, (0.01889, (0.00123, (0.00674, (0.00675, (727,725,

124 Species 1 is C. grandiflora 0.03486) 0.03486) 0.00669) 0.09163) 0.79889) 2,537,080)

and Species 2 is C. bursa- pastorisB

Unconstrained model, 0.019028 0.00555 0.00384 0.00707 0.01546 1,355,360(393,

asymmetrical migration, (0.00250, (0.00292, (0.00171, (0.00674, (0.00674, 270,2,812,750)

Species 1 is C. bursa-pastoris A 0.07184) 0.00928) 0.00712) 0.06124) 0.11387)

and Species 2 is C. bursa- pastorisB

a 8 8 (4NepL, where Ne is the effective population size and fi is the mutation rate 1.5 X 10" ) for the ancestor (A), species 1 (Spl)

and species 2 (Sp2)

Migration rate (4Nem) from Species 1 to Species 2

c Migration rate (4Nem) from Species 2 to Species 1

rfTime of the split of Species 1 and Species 2

125 Figure 1. Floral organs and petals are reduced in C. bursa-pastoris and C. rubella (left and middle respectively) compared with C. grandiflora (right).

126 o o c w s

o q o

Figure 2. Comparison of silent polymorphism patterns between C. bursa-pastoris A, C bursa-pastorisB, C. grandiflora and C. rubella, given by n synonymous, where n is the average pairwise difference. Bars represent the median, boxes the interquartile range, and whiskers extend out to 1.5 times the interquartile range.

127 35

I 25 a "S20 LL * 15 e z 10

CbpAandCDpe CbpAandCgl C&pAandCrub CbpBandCgf CbpBanoCfLd Cgla-aCrua

Figure 3. Number of synonymous fixed differences between all pairs of C. bursa- pastorisA, C. bursa-pastorisB, C grandiflora and C. rubella

128 A. thaliana

C. bursa-pastorisA

C. bursa-pastorisB

C. grandiflora

C rubella

A. thaliana

Neslia

C. bursa-pastorisA

C. bursa-pastorisB

C. grandiflora

B C. rubella

129 Figure 4. Bayesian estimates of species trees of the Capsella genus generated using

BEST including the close relatives a) A. thaliana and b) A. thaliana and Neslia as outgroups. The numbers at each node represent branch lengths.

130 >. 0 005 ^ 0O045] "§ 0004 O 00351 f„ 0O03 fa 0 00251 2 0 002 as 0.0015 g 0.001 O. 0 00051

J5- 0 005 ~ 0.0045 « 0.004 9 0.0035 ^ 0.003 S 0.0025 S 0.002 g 0.0015 Q- 0.001 0.0005 B 2000000 3000000 4000000

Figure 5. Marginal posterior distributions of speciation parameters estimated by MIMAR, with posterior modes showing good fit to data summaries. 0 = ANepi where Ne is the effective population size and /A is the mutation rate (1.5*10"8) a) constrained model assuming equal effective population sizes in the ancestor as in present-day C. grandiflora.

Model 1; Species 1 = C. grandiflora, Species 2 = C. bursa-pastorisA is shown in blue

Model 2; Species 1 = C. grandiflora, Species 2 = C. bursa-pastorisB is shown in red

131 1 C. grandiflora

2 C bursa-pastorisA (given in blue) and C. bursa-pastorisB (given in red)

Tgen Divergence time (years) between C. grandiflora and C. bursa-pastoris A (in blue) and C. bursa-pastorisB (in red) b) unconstrained model

A ancestral C. grandiflora

1 C. bursa-pastoris A (given in blue)

2 C. bursa-pastorisB (given in red)

Tgen Divergence time (years) between C. bursa-pastoris A and C. bursa-pastorisB

132 C. bursa-pastorisA C. bursa-pastorisB

C. grandiflora J C. bursa-pastorisA

C. grandiflora C. bursa-pastorisB

Observed 1

Figure 6. Numbers of unique synonymous variants and for each of the three species pairs a) C. bursa-pastoris A and C. bursa-pastorisB, b) C. grandiflora and C. bursa-pastoris A and c) C. grandiflora and C. bursa-pastorisB. Shown are the observed values as well as 5 simulated datasets for which an autopolyploid event was simulated using the coalescent simulation program ms (see Methods).

133 C. bursa-pastorisA and C. bursa-pastorisB C. grandiflora and C. bursa-pastorisA C. grandiflora and C. bursa-pastorisB

Observed 12 3 4 5 Figure 7. Numbers of a) synonymous shared variants and b) synonymous fixed differences for each of the three species pairs 1) C. bursa-pastoris A and C. bursa- pastorisB, 2) C. grandiflora and C. bursa-pastoris A and 3) C. grandiflora and C. bursa- pastorisB. Shown are the observed values as well as 5 simulated datasets for which an autopolyploid event was simulated using the coalescent simulation program ms (see

Methods).

134 Supplementary Information

Table SI. Accession name and origin of each individual used in this study.

Species Accession Origin C. bursa-•pastoris AQ404 China C. bursa-•pastoris AQ416 China C. bursa -pastoris BJB234 China C. bursa •pastoris BJB236 China C. bursa -pastoris CSH1 China C. bursa •pastoris CSH2 China C. bursa •pastoris CSH3 China C. bursa •pastoris DL167 China C. bursa •pastoris DL170 China C. bursa -pastoris GY31 China C. bursa -pastoris GY36 China C. bursa -pastoris HD11 China C bursa •pastoris HD6 China C. bursa -pastoris HD63 China C. bursa -pastoris HD68 China C. bursa -pastoris HRB135 China C. bursa -pastoris HRB137 China C. bursa -pastoris HY7 China C bursa -pastoris HY76 China C bursa -pastoris HY89 China C. bursa -pastoris JZH1 China China C. bursa -pastoris JZH145 China C. bursa -pastoris JZH147 China C bursa--pastoris KMB205 China C. bursa-•pastoris KMB21 China C. bursa-•pastoris KMB210 China C. bursa-•pastoris LN16(DL) China C. bursa-•pastoris NCH3 China C. bursa--pastoris NCH350 China C. bursa-•pastoris NCH355 China C bursa- NCHJ (NCH) •pastoris China C. bursa- NJ218 •pastoris China C. bursa- •pastoris NJ222 China C. bursa- •pastoris QD3 China C. bursa- •pastoris QD318 China C. bursa- •pastoris QD319 China C. bursa-•pastoris QD4 China C. bursa-•pastoris QF3

135 C. bursa-pastoris QF335 China C bursa-pastoris QF340 China C. bursa-pastoris SHX2 (XA) China C. bursa-pastoris SHX3 China C bursa-pastoris TY11 China C. bursa-pastoris TY121 China C. bursa-pastoris TY123 China C. bursa-pastoris WH2 China C. bursa-pastoris WH48 China C bursa-pastoris WH49 China C. bursa-pastoris XA106 China C bursa-pastoris XA110 China C bursa-pastoris XN442 China C. bursa-pastoris XY11 China C. bursa-pastoris XY18 China C bursa-pastoris XY20 China C. bursa-pastoris ZZH2 China C. bursa-pastoris ZZH279 China C. bursa-pastoris ZZH284 China C. bursa-pastoris ZZH3 China C. bursa-pastoris 6 20 France C. bursa-pastoris FR4 2 France C. bursa-pastoris ISR3 3(MAE-ISR3) Israel C. bursa-pastoris 22 19 Italy C bursa-pastoris 32 25 Italy C. bursa-pastoris 32 27 Italy C bursa-pastoris 39 12 Italy C. bursa-pastoris BEL RU13-1 Russia C. bursa-pastoris MOG RUlO Russia C bursa-pastoris OBL RU5 Russia C. bursa-pastoris RUlO 2 (MOG-RUIO) Russia C. bursa-pastoris RU3 1 (KAB-RU3) Russia C bursa-pastoris RU3 2(KAB-RU3) Russia C. bursa-pastoris RU5 5(OBL-RU5) Russia C. bursa-pastoris RU8 1 (BAD-RU8) Russia C. bursa-pastoris C. bursa-pastoris VLA RU2 Russia Spain C. bursa-pastoris 1272 16 2 C. bursa-pastoris TWPL Taiwan C. bursa-pastoris CHPL (TWPL) ? C grandiflora CHPY (TWTY) ? C. grandiflora 8c block3 Greece C. grandiflora 1 8h (8h blockl) Greece C. grandiflora lOh Greece llf Greece

136 C. grandiflora 2e_block5 Greece C. grandiflora 4a block5 Greece C grandiflora 4o Greece C grandiflora 7g block3 Greece C grandiflora 910 18 1 Greece (Corfu) C. grandiflora 910 18 2 Greece (Corfu) C grandiflora 910 20 1 Greece (Corfu) C grandiflora 910 20 2 Greece (Corfu) C. grandiflora 910 21 1 Greece (Corfu) C. grandiflora 910 21 2 Greece (Corfu) C grandiflora 918 1 1 Greece (Corfu) C grandiflora 918 1 2 Greece (Corfu) C. grandiflora 921 3 1 Greece (Corfu) C. grandiflora 921 3 2 Greece (Corfu) C. grandiflora 925 14 1 Greece (Katara Pass) C grandiflora 925 14 2 Greece (Katara Pass) C grandiflora 925 19 1 Greece (Ioannina) C grandiflora 925 19 2 Greece (Ioannina) C. grandiflora 925 21A 1 Greece (Ioannina) C. grandiflora 925 21A 2 Greece (Ioannina) C. grandiflora 925 26A 1 Greece (Ioannina) C grandiflora 925 26A 2 Greece (Ioannina) C. grandiflora 925 9A 1 Greece (Ioannina) C. grandiflora 925 9A 2 Greece (Ioannina) C. grandiflora 926 4 1 Greece (Corfu) C. grandiflora 926 4 2 Greece (Corfu) C. grandiflora 926 8 1 Greece (Corfu) C. grandiflora 926 8 2 Greece (Corfu) C. grandiflora 933 14 1 Greece (Katara Pass) C. grandiflora 933 14 2 Greece (Katara Pass) C. grandiflora 933 15 1 Greece (Katara Pass) C. grandiflora 933 15 2 Greece (Katara Pass) C. grandiflora 933 18 1 Greece (Katara Pass) C. grandiflora 933 18 1 Greece (Katara Pass) C. grandiflora 933 18 2 Greece ^Katara Pass) C. grandiflora 933 18 2 Greece ^Katara Pass) C. grandiflora 934 31 1 Greece (Metsovo) C. grandiflora 934 31 2 Greece (Metsovo) C. grandiflora 934 32 1 Greece (Metsovo) C. grandiflora 934 32 2 Greece (Metsovo) C grandiflora 935 13 1 Greece (Corfu) C grandiflora 935 13 2 Greece (Corfu) C. grandiflora 935 15A 1 Greece (Corfu) C grandiflora 935 15A 2 Greece (Corfu)

137 C. grandiflora 935 17 1 Greece (Corfu) C grandiflora 935 17 2 Greece (Corfu) C. grandiflora 935 4A 1 Greece (Corfu) C. grandiflora 935 4A 2 Greece (Corfu) C. grandiflora 91 Greece C. rubella 60 1 Algeria C. rubella 1377 10A Argentina C. rubella 1377 1A Argentina C. rubella 1377 2A Argentina C. rubella 1377 5A Argentina C. rubella 6 24 France C rubella 6 25 France C. rubella 6 26 France C rubella 63 1 France C. rubella FR1 3 (U S-FR1) France C. rubella FR4 1 (STJ-FR4) France C. rubella U S FR1 France C. rubella 925 3 Greece C. rubella 925 6 • Greece C. rubella AET ISR1 Israel C rubella ENY ISR2 Israel C rubella ISR1 6(AET-ISR1) Israel C rubella ISR2 2 (ENY-ISR2) Israel C. rubella 2 3 Italy C. rubella 2 5 Italy C. rubella 22 12 Italy C. rubella 22 15 Italy C rubella 23 13 Italy C. rubella 27 6 Italy C. rubella 28 9 Italy C. rubella 3 11 Italy C. rubella 32 13 Italy C. rubella 32 15 Italy C. rubella 32 17 Italy C. rubella 35 13 Italy C. rubella 39 5 Italy C. rubella 49 13 Italy C. rubella 8 5 Luxembourg C rubella 4 23 Spain C. rubella 50 14 Spain C. rubella 1209 38A Spain (Canary Islands) C rubella 1209 24 Spain (Canary Islands) C rubella 1209 26 Spain (Canary Islands) C. rubella 1209 36 Spain (Canary Islands)

138 C. rubella 1504_10 Spain (Canary Islands) C. rubella 1504_2A Spain (Canary Islands) C. rubella 15048 Spain (Canary Islands) C. rubella 1204 2 Spain (Teneriffa)

139 Table S2. Names and gene ontology terms for each of the 14 loci studied.

Locus Gene Ontology Terms

Atlg03560 pentatricopeptide (PPR) repeat-containing

Atlgl5240 phox (PX) domain-containing protein

Atlg65450 transferase family protein

Atlg77120 ADH1 (alcohol dehydrogenase 1)

PHYB (PHYTOCHROME B); G-protein coupled photoreceptor

At2gl 8790 / protein histidine kinase/ red or far-red light photoreceptor (PHYB)

At2g26730 leucine-rich repeat transmembrane protein

FRI (FRIGIDA); protein heterodimerization/protein homodimerization

At4g00650 (FRI)

At4g02560 LD (luminidependens); transcription factor (LD)

ATP binding/blue light photoreceptor/protein homodimerization

At4g08920 /protein kinase (CRY1)

At4g 14190 pentatricopeptide (PPR) repeat-containing

specific transcriptional repressor/ transcription factor (FLC)/flowering

At5gl0140 locusC

At5g42800 dihydrokaempferol 4-reductase (DFR)

At5g51670 expressed protein

At5g53020 expressed protein

140 Table S3. Modes of parameter estimates under a range of MIMAR models for 5 simulated datasets where an autopolyploid event was simulated using the coalescent simulation program ms (See Methods). 90% HPD intervals are given in parentheses.

Model Dataset 0^ 6\s^ 0^ T6

Constrained model, Species 1 is C.grandiflora and 1 0.03168(0.02389, 0.03168(0.02389, 0.00296(0.00136, 254,254(114,175,

Species 2 is C. bursa-pastorisA 0.04140) 0.04140) 0.00572) 139,990)

2 0.03344 (0.02533, 0.03344 (0.02533, 0.00317(0.00154, 430,430 (223,414,

0.04348) 0.04348) 0.00583) 795,308)

3 0.02699 (0.01965, 0.02699 (0.01965, 0.00313(0.00146, 286,286(118,124,

0.03559) 0.03559) 0.00604) 531,389)

4 0.02688 (0.01903, 0.02688 (0.01903, 0.00459(0.00219, 310,310(148,054,

0.03485) 0.03485) 0.00804) 572,365)

5 0.03509(0.02677, 0.03509(0.02677, 0.00346(0.00174, 342,342(163,933,

0.04526) 0.04526) 0.00652) 600,000)

Constrained model, Species 1 is C. grandiflora 1 0.02687 (0.02014, 0.02687 (0.02014, 0.00529(0.00316, 974,975 (598,900,

and Species 2 is C bursa-pastorisB 0.03529) 0.03529) 0.00906) 1,414,660)

~2 0.03331 (0.02515, 0.03331 (0.02515, 0.00166(0.00610, 1,023,020(602,382,

0.04299) 0.04299) 0.00378) 1,490,120)

141 0.03164(0.02372, 0.03164(0.02372, 0.00771 (0.00436, 950,951 (584,542,

0.04109) 0.04109) 0.01124) 1,378,180)

0.02646(0.01921, 0.02646(0.01921, 0.00554(0.00318, 1,031,030(652,155,

0.03416) 0.03416) 0.00897) 1,502,950)

0.02795 (0.02094, 0.02795 (0.02094, 0.00375(0.00184, 854,855(529,511,

0.03714) 0.03714) 0.00654) 1,295,880)

Unconstrained model, Species 1 is 0.036258(0.01815, 0.00379(0.00185, 0.00357(0.00184, 426,426 (179,693,

C. bursa-pastorisA and Species 2 is C. bursa- 0.06888) 0.00662) 0.00659) 807,732) pastorisB 0.02579(0.00251, 0.00324(0.00156, 0.00545(0.00317, 926,927 (445,775,

0.05503) 0.00594) 0.00908) 1,865,000)

0.00614 (0.00250, 0.00315(0.00150, 0.00485 (0.00276, 1,143,140(503,620,

0.03983) 0.00590) 0.00829) 1,824,720)

0.01634(0.00250, 0.00457 (0.00235, 0.00237(0.00103, 558,559 (220,947,

0.04241) 0.00784) 0.00471) 1,277,640)

0.01699(0.00251, 0.00386(0.00195, 0.00474 (0.00270, 982,983 (449,219,

0.04681) 0.00680) 0.00818) 1,777,900)

a 8 6 (4Ne[i, where Ne is the effective population size and /u is the mutation rate 1.5 X 10" ) for the ancestor (A), species 1 (Spl)

and species 2 (Sp2) Time of the split of Species 1 and Species 2

142 A B c grandiflora

Figure SI. Autopolyploid event simulated using the coalescent simulation program ms.

8 is the effective population size (4Ne[t, where Ne is the effective population size and /u is the mutation rate 1.5 X 10"8).

1 is C. grandiflora

2 is C. bursa-pastoris locus A

3 is C. bursa-pastoris locus B

T is the time at which the autopolyploid event occurred

CHAPTER 5

Title: Reconstructing origins of loss of self-incompatibility and selfing in North

American Arabidopsis lyrata: a population genetic context

Running title: Reconstructing origins of selfing in A lyrata

John Paul Foxe*1, Marc Stiff*2'3, Andrew Tedder2, Annabelle Haudry2, Stephen I.

Wright4'5, Barbara K. Mable2

*These authors contributed equally to this work

1 Department of Biology, York University, 4700 Keele St. Toronto, ON Canada M3J 1P3

2 Division of Ecology and Evolutionary Biology, University of Glasgow, Glasgow, Scotland G12 8QQ

3 Present address: Centra de Investigacao em Biodiversidade e Recursos Geneticos, Campus Agrario de Vairao, R. Padre Armando Quintas, 4485-661, Vairao, Portugal

4 Department of Ecology and Evolutionary Biology, University of Toronto, 25 Willcocks St., Toronto, ON Canada M5S 3B2

5 Centre for the Analysis of Genome Evolution and Function, University of Toronto

Corresponding Author: Dr. Marc Stift Centra de Investigacao em Biodiversidade e Recursos Geneticos Campus Agrario de Vairao R. Padre Armando Quintas 4485-661, Vairao, Portugal email: [email protected] Phone:+351 252660411 Fax:+351 252661780

Keywords: mating system evolution; breakdown of self-incompatibility; demography; inbreeding depression; Arabidopsis; population genetics; bottlenecks; effective population size

Abstract

Theoretical and empirical comparisons of molecular diversity in selfing and outcrossing plants have primarily focused on long-term consequences of differences in mating system (between species). However, improving our understanding of the causes of mating system evolution requires ecological and genetic studies of the early stages of mating system transition. Here, we examine nuclear and chloroplast DNA sequences and microsatellite variation in a large sample of populations of Arabidopsis lyrata from the Great Lakes region of Eastern North American that show intra- and inter-population variation in the degree of self-incompatibility and realized outcrossing rates. Populations show strong geographic clustering irrespective of mating system, suggesting that selfing either evolved multiple times or has spread to multiple genetic backgrounds. Diversity is reduced in selfing populations, but not to the extent of the severe loss of variation expected if selfing evolved due to selection for reproductive assurance in connection with strong founder events. The spread of self-compatibility in this region may have been favored as colonization bottlenecks following glaciation or migration from Europe reduced standing levels of inbreeding depression. However, our results do not suggest a single transition to selfing in this system, as has been suggested for some other species in the Brassicaceae.

Introduction

Inbreeding has often been posited as an evolutionary dead end because of the accumulation of slightly deleterious mutations and reduced adaptability (Stebbins

1957; Takebayashi and Morrell 2001). Paradoxically, the transition to selfing is cited as one of the most common major evolutionary transitions across the plant kingdom and is well documented in a variety of species across many genera (Stebbins 1950;

Stebbins 1970; Grant 1981; Barrett et al. 1996; Barrett 2002; Igic et al. 2008). Selfers have two main advantages over outcrossers: an inherent transmission advantage

(Fisher 1941) and the ability to reproduce without mates (Darwin 1876; Kalisz et al.,

2004; Charlesworth 2006). While outcrossers only transmit 50% of their genome to their offspring, strict selfers transmit their whole genome and can at the same time act as pollen donors for seed produced by other individuals (Fisher 1941). Moreover, unlike outcrossers, selfers can reproduce when pollinators or potential mates are limited (reproductive assurance, first proposed by Darwin (1876)). Thus, a selfing lifestyle can result in increased colonization ability, as a new population may be founded from a single plant (Baker 1955; Stebbins 1956; Stebbins 1957; Pannell and

Barrett 1998). Particularly in the initial stages after a transition to selfing, increased homozygosity may lead to inbreeding depression (the reduction in fitness of selfed versus outcrossed individuals) due to expression of recessive deleterious load. Hence, theory predicts that selfing is only likely to evolve when the advantages of selfing outweigh the costs associated with inbreeding depression (Charlesworth and

Charlesworth 1987).

A selfing strategy is also expected to come at the cost of reduced genetic variation (Charlesworth et al. 1993; Nordborg 2000; Charlesworth and Wright 2001;

Glemin et al. 2006; Wright et al. 2008). First, selfing increases levels of homozygosity, thereby reducing the effective population size (iVe) and levels of diversity up to twofold under complete selfing (Pollak 1987). This increased homozygosity also leads to a reduction in effective recombination rate, resulting in increased linkage disequilibrium (LD) across loci (Nordborg 2000). This facilitates genetic hitchhiking through both positive (selective sweeps (Maynard Smith and

Haigh 1974)) and negative selection (background selection (Charlesworth et al.

1993)). Genetic hitchhiking exacerbates the reduction in 7Ye and diversity. Finally, increased population turnover and colonization bottlenecks in selfing plants may contribute to further reductions in diversity (Ingvarsson 2002).

Empirically, the population genetic consequences associated with a transition to selfing have been well documented at the species level in both plant and animal systems (Charlesworth and Yang 1998; Baudry et al. 2001; Chiang et al. 2003; Cutter and Payseur 2003; Glemin et al., 2006). Selfing species are typically characterized by greater than twofold reductions in diversity, consistent with roles for genetic hitchhiking and/or increased colonization bottlenecks (Wright et al. 2008). These patterns of reduced genetic diversity in selfers have been found across a number of plant genera that include both outcrossing and selfing species, including

Leavenworthia (Charlesworth and Yang 1998; Liu et al. 1998; Filatov and

Charlesworth 1999; Liu et al. 1999), Arabidopsis (Ross-Ibarra et al. 2008; Savolainen

(Chiang et al. 2003).

The model plant system A. thaliana is thought to have evolved self- fertilization approximately 1 million years ago through inactivation of the self- incompatibility locus, referred to as the S-locus (Tang et al. 2007). Evidence for the role of the S-locus stems from transformation studies, which identified five accessions in which full self-incompatibility could be restored by transformation with a functional S-locus, and implies that all other genes required for SI are still intact in these accessions (Tang et al. 2007; Boggs et al. 2009). Recent results suggest that a mutation in the male component of self-incompatibility (SCR) has resulted in loss of

SI, apparently across a wide range of accessions (Tsuchimatsu et al. 2010). In addition, a modifier locus has been identified, unlinked to the S-locus (Liu et al.

2007), which suggests that S-locus inactivation may not be the sole mechanism by which SI broke down in A. thaliana and different mechanisms of loss could have operated in different accessions (Boggs et al. 2009).

Systems with more recent transitions from outcrossing to selfing may provide a more direct picture of the causes and short-term consequences of mating system evolution (Foxe et al. 2009 (Chapter 2); Guo et al. 2009; Ness et al. 2010). For example, if the evolution of selfing involves the long-term spread of modifiers through previously outcrossing populations, recently-derived selfing populations are expected to retain reasonably high levels of ancestral polymorphism, as recently observed in Eichhornia paniculata (Ness et al. 2010). In contrast, if a highly selfing

148 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 lineage evolves rapidly from a small number of founders, we would expect a severe loss of genetic variation, as seen in Capsella rubella (Foxe et al. 2009 (Chapter 2)).

Here, we have set out to investigate the loss of self-incompatibility in North American populations of the normally outcrossing species ,4. lyrata (Brassicaceae).

It has been suggested that A. lyrata colonized North America from ancestral

European populations (Clauss and Mitchell-Olds 2006; Ross-Ibarra et al. 2008), which are highly self-incompatible and exclusively outcrossing. The North American populations are unique because some are still predominantly outcrossing, despite the occurrence of self-compatible individuals at low frequency, while others are almost entirely self-compatible and have undergone a transition to high rates of selfing

(Mable et al. 2005; Mable and Adam 2007). This transition to selfing in A. lyrata appears to be very recent, as selfing populations belong to a chloroplast lineage that also contains outcrossing populations (Hoebe et al. 2009). Moreover, selfing populations are not characterized by smaller flowers (Hoebe 2009), which contrasts with other systems where the transition to selfing has led to notable floral evolution towards smaller flowers (Hurka and Neuffer 1997; Charlesworth and Vekemans

2005; Tang et al. 2007; Foxe et al. 2009 (Chapter 2); Guo et al. 2009).

Previous work on North American populations of A. lyrata in the Great Lakes region (where loss of self-incompatibility has occurred) has been based on chloroplast sequences and microsatellite genotypes. The former, as a marker with a single coalescent history and limited variation across populations, allowed (limited) phylogeographic inferences (Hoebe et al. 2009, Tedder et al. 2010). The latter

Mable & Adam 2007; Hoebe et al. 2009), but suffers from potential limitations due to homoplasy and uncertainties in the mutation model, particularly for the individual markers that we had been using (Muller et al. 2008). However, nuclear gene sequences are more powerful to explicitly test population genetic predictions about the reductions in diversity in selfing populations. They also allow for the detection of recombination, and for testing whether selfing populations show more evidence of departures from demographic equilibrium than outcrossing populations. We have extended the population sampling presented in previous work (Mable et al. 2005;

Mable and Adam 2007; Hoebe et al. 2009), primarily to establish if there are more populations that have undergone a transition to selfing.

In this study, we integrate polymorphism information from nuclear genes, chloroplast markers and nuclear microsatellites, in order to obtain a detailed picture of the demographic history and population structure of A. lyrata in the Great Lakes region of North America. Our ultimate goal is to use this framework to elucidate the origins of the selfing populations. Specifically, we aim to: 1) investigate the demographic and population genetic consequences of losses of SI and transition to selfing, by describing population structure and testing the effects of individual selfing phenotype and selfing rate on genetic diversity; and 2) elucidate the extent to which severe population bottlenecks may have played a role in the transition to inbreeding.

Methods

Sampling

The samples for this study were collected from 22 locations throughout the

Great Lakes region of eastern North America (Figure 1, Table SI). From each location, we collected batches of seeds from 25-30 independent plants (growing at least 5 meters apart). More detailed sampling description is available as a supplement

(Supporting Information). In one location (Tobermory Cliffs, on the Bruce Peninsula that extends into Georgian Bay), we collected seeds from two spatially separated areas: in the first, plants were previously demonstrated to be highly selfing (dubbed

TC: Mable et al. 2005); in the second, plants were observed to have lower and more variable seed set in the field and were thus suspected of being more highly outcrossing (TCA). In another location (Tobermory Singing Sands, also on the Bruce

Peninsula, but on the Lake Huron side), we collected seeds from two spatially separated areas: in the first, plants grew on the characteristic sand dune habitat and were previously demonstrated to be highly outcrossing (dubbed TSS: Mable et al.

2005); in the second, plants grew on alvar (limestone pavement) and were observed to have high seed set in the field, suggestive of selfing (TSSA). We grew up eight plants from each of 10 seed batches from each of the 24 population samples (22 locations, cf. Table SI), forming a collection of 192 plants, grown in a common greenhouse environment at the Scottish Crop Research Institute, Invergowrie under a constant regime of 16 hours light: 8 hours dark, 22°C days and 18°C night.

Selfing phenotype determination

For all individual plants that flowered, we manually self-pollinated six

flowers. The resulting siliques were scored as either negative (no seeds), small (a

short silique smaller than 9 mm with no more than 3 seeds) or positive (a silique of 9

mm or longer with more than three seeds). Small siliques were considered as negative

for the purposes of classifying selfing phenotypes (as in Mable et al. 2005) but were

recorded to enable assessment of whether the degree of leakiness in the SI system

varied by population or geographic region. Based on this, we defined the selfing

phenotype of each plant based on the siliques produced after manual self-pollination

as: 1) self-incompatible (SI): zero or one (out of six) positive siliques; 2) self-

compatible (SC): five or six (out of six) positive siliques; and 3) partially self-

compatible (PC): two, three or four (out of six) positive siliques. To exclude pollen

sterility or total sterility causing a false SI phenotype, plants classified as self-

incompatible were crossed with plants from the same population to test for fertility.

All self-incompatible plants were cross-fertile with at least one other plant tested.

Mating system determination (establishing population level outcrossing rates)

For 12 populations (IND, LPT, LSP, MAN, PIC, PIN, PTP, PUK, RON, TC,

TSS, WAS), multi-locus outcrossing rates had been determined in previous studies

(Mable and Adam 2007). For nine of the newly sampled populations (BEI, HDC,

KTT, OWB, PCR, PIR, PRIA, SBD, TSSA) we used the same microsatellite markers

and procedures to calculate outcrossing rates as outlined in Mable and Adam (2007)

152 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 and Hoebe et al. (2009). In brief, we genotyped progeny arrays (6-10 offspring per mother) from 17 to 27 mothers per population, and estimated multilocus outcrossing rates using MLTR version 2.3 (Ritland 2002), which implements the mixed-mating model described by Ritland and Jain (Ritland and Jain 1981). Too few seed families from IOM, and TCA germinated to allow reliable evaluation of outcrossing rates but a rough estimate from TCA was obtained from 5 maternal families and those for IOM were obtained from Yvonne Willi (personal communication), who used a similar set of microsatellite markers on the same seed samples. Similarly, too few seed families from NCM germinated for reliable outcrossing estimates. Therefore, mating system for NCM was assumed based upon selfing phenotypes, which showed a predominance of SI individuals.

Microsatellite genotyping

Nine microsatellite loci previously used by Mable and Adam (2007) were screened for variation across all 192 individuals: ADH-1, AthZFPG, ATTS0392,

F20D22, ICE12, ICE9 (Clauss et al. 2002), LYR104, LYR133 and LYR417 (obtained from V. Castric and X. Vekemans, personal communication). Products were amplified by multiplex PCR, using the default reagent concentrations recommended by the kit instruction manual (QIAGEN Multiplex PCR Kit, QIAGEN Ltd, exact primer concentrations can be requested from the authors). Thermocycling was performed on PTC-200 (MJ research) machines using the following programme: initial denaturation at 95°C for 15 min followed by 34 cycles of 94° for 30 s, 55°C for

90 s, 72°C for 90 s, (ramp to 72°C at 0.7°C/s) and a final 72°C extension for 10 min.

Multiplex products (1:160 dilutions) were genotyped using an ABI 3730 sequencer

(by The Sequencing Service, University of Dundee). Genotypes were analyzed using

GENEMAPPER 4.0 (Applied Biosystems) and corrected manually.

PCR and sequencing of nuclear genes

For each of the 192 individuals across 24 populations, products were produced from PCR primer pairs that were previously designed and confirmed to amplify large exons from 18 nuclear genes (putative functions listed in Table SI in Ross-Ibarra et al. 2008). following methods described by Wright et al. (2006) and Ross-Ibarra et al.

(2008)., PCR reactions were performed in 25uL reaction volumes (15mM PCR (10X) buffer, 2 mM MgS04, lOmM dNTPs, 10uM forward primer, lO^M reverse primer,

1U Tsg polymerase and 50-1 OOng DNA) on an Eppendorf Mastercycler with the following program: 2 minutes at 94°C, 20 seconds at 94°C, 20 seconds at 55°C, 40 seconds at 72°C, for 35 cycles, with a final extension time of 4 minutes at 72°C.

Sequencing reactions were carried out by Lark Technologies, Texas, USA.

Chromatograms were analyzed using Sequencher 4.6 (Gene Codes, Corp.), using the

'call secondary peaks' option to aid in the identification of heterozygous sites. All chromatograms were checked manually for heterozygous nucleotide positions, using the sequence from both strands to confirm putative heterozygous sites. Due to a significant amount of sequencing failure across all 18 loci for individuals from LPT,

this population was removed from subsequent nuclear data analyses. All sequences

have been submitted to Genbank, with accession numbers HM168020-HM171110.

DNA sequencing of chloroplast DNA

The noncoding cpDNA region fr7jL(UAA)3'exon-TrnF(GAA) was amplified

with the primers E 5'-GGTTCAAGTCCCTCTATCCC-3' and F 5'-

ATTTGAACTGGTGACACGAG-3' (Taberlet et al. 1991) and sequenced. This

includes a region with pseudogene copies of the trn gene (Koch and Kiefer 2005;

Ansell et al. 2007; Tedder et al. 2010). Using the same primers in a smaller

population sample, Hoebe et al. (2009) identified a short haplotype (515bp, dubbed

SI) and two long haplotypes (741bp, dubbed LI and L2). After purification with

QiaQuick gel extraction kits (Qiagen Inc.), all PCR products were sequenced directly

on an ABI 3730 sequencer by The Sequencing Service, University of Dundee.

Sequences were visually checked using Sequencher 4.7 (Gene Codes, Corp.) and

aligned to the previously identified LI, L2, and SI haplotypes (Hoebe et al. 2009).

Nuclear gene sequence analysis

We reconstructed individual haplotypes of unphased diploid sequences using

the software PHASE (Stephens et al. 2001), as implemented in DnaSP Version 5.0

(Librado and Rozas 2009). For the sequence data, synonymous and nonsynonymous

sites were identified by aligning each fragment to the corresponding fragment in the

A. thaliana genome sequence, identified using BLAST (Altschul et al. 1990), and

using the protein annotation from A. thaliana. Standard population genetic

descriptives, including numbers of synonymous and nonsynonymous sites, estimates

of synonymous (Jtsyn) and nonsynymous (jTrep) diversity and Tajima's D, were

calculated using a modified version of Polymorphurama, a Perl script written by D.

Bachtrog and P. Andolfatto (available from

http ://ib. berkeley. edu/labs/bachtrog/data/po ly MORPHOram a/poly MORPHOrama. ht

ml). Significance of within-population mean Tajima's D was determined by

conducting 10,000 coalescent simulations as implemented in the HKA software

(available from http://genfacultv.iTitgers.edU/hey/software#HKA. Kliman et al. 2000).

The within population recombination parameter p (where p = 4Ner, Ne being the

effective population size and r recombination rate) was calculated for each locus with

more than three segregating sites and for each population by using the maxdip

program (available from http://genapps.uchicago.edu/maxdip/index.html) for diploid

unphased data. Maxdip applies a composite likelihood approach fit to the observed

pairwise SNP frequencies (Hudson 2001) and assumes an infinite-sites constant-

population-size neutral model.

Microsatellite analysis

For the microsatellite data, we used MSA (Dieringer and Schlotterer 2003) to

calculate observed and expected heterozygosity (H0 and He) for each locus.

The effects of selfing phenotype and mating system on genetic diversity and heterozygosity

Observed heterozygosity at nuclear loci was calculated using Perl scripts written by A.H. For each individual and each locus, the heterozygosity status was determined by comparing the two gene copies carried by the individual. If the two haplotypes were different, the status was described as heterozygous; if they were identical, the status was described as homozygous. Individual H0 was estimated using the average H0 over all the loci for each individual.

Linear regressions as implemented in JMP 8.0.2 (http://wwvv.jmp.coin') were used to test whether summary statistics describing genetic diversity and heterozygosity at the population level (7tsyn, p, and H0 for nuclear gene data), H0 and

He (for microsatellites) varied in relation to outcrossing rate (Tm) and/or the proportion of SC individuals in each population. Oneway ANOVA, also implemented in JMP 8.0.2 was used to test the effect of selfing phenotype on individual heterozygosity.

Explaining the reduced genetic diversity of selfing populations: testing for bottleneck effects beyond selfing alone

Selfing reduces effective population size (Nordborg 2000). Therefore, genetic diversity is expected to decrease with selfing rate, up to a twofold reduction for completely selfing populations. Other demographic effects and genetic hitchhiking could further decrease genetic diversity. We tested if genetic diversity (jtsyn and Jtrep

157 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 for gene sequences) in populations classified as selfing (based on multilocus outcrossing rates) was lower than expected based on the selfing rate (S- l-Tm) alone.

OobsO+.r7), where 0 is the genetic diversity measure being corrected, and F = S/2-S, where F is the inbreeding coefficient and S is the selfing rate for the population

(Nordborg 2000). Similarly, we corrected p using the formula Corrected = RobJ (1-S) where R is the recombination rate and S is the selfing rate for the population

(Nordborg 2000). Then, we used linear regressions to test if population outcrossing rates (Tm) predicted corrected Jtsyn, jTrep, microsatellite H0_ He and p. If so, and corrected genetic diversity decreased with selfing rate, this would provide evidence that genetic diversity in selfing populations is reduced due to additional factors beyond selfing alone (e.g., a bottleneck, selection).

Bayesian inference of population structure

We used InStruct (version 1.0, Gao et al. 2007) to infer population structure using a combination of phased nuclear gene sequence and microsatellite data. For comparative purposes, we also used STRUCTURE (version 2.3.2, Pritchard et al.

2000) to infer population structure. Both programs perform Bayesian clustering and work by assigning individuals to a given number of clusters in such a way that deviations from Hardy-Weinberg equilibrium are minimized. Unlike STRUCTURE,

InStruct can accommodate non-random mating due to selfing. Based on exploratory runs, we restricted the number of clusters (k) to range from k = 1 to k = 12, and ran

158 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 both programs for 2,000,000 generations with a burnin of 200,000 generations, with five independent chains (runs) for each k. We ran InStruct in the mode that allows for admixture and individual selfing rates. We used Perl scripts written by Joseph Hughes

(available from http://linnaeus.zoology.gla.ac.uk/~jhughes/Bioinformatics.html) on

InStruct output files, and Structure Harvester (v0.3 by D.A. Earl: http://taylorO.biology.ucla.edu/struct_harvest/) on STRUCTURE output files to excise the probability matrices for each level of k and perform matrix aligment using the software CLUMPP v 1.1.1 (Jakobsson and Rosenberg 2007). DISTRUCT vl.l

(Rosenberg 2004) was used to create bar plots of the aligned matrices.

Inferring the origin of selfing populations

For each locus, we counted the total number of microsatellite alleles and gene sequence haplotypes over all populations by using Perl scripts written by A.H. Then, we grouped populations according to their classification as selfing (Tm< 0.5), and outcrossing (Tm > 0.5). For both groups, we then counted the total number of microsatellite alleles and nuclear gene sequence haplotypes, and those that were unique to either group. Finally, we counted those that were shared between the groups. Then, we tested if the selfing and outcrossing populations had a significantly different number of unique variants relative to the total number of variants observed in each of these groups. We did this separately for microsatellite alleles (across all loci) and unique nuclear gene sequence haplotypes (across all genes) using G-test for goodness-of-fit to assess significance.

We then looked at the haplotypes that were unique to the selfing population group (i.e., the haplotypes not shared between the selfing and outcrossing group of populations) in a separate analysis. Specifically, we explored whether haplotypes unique to selfing populations were derived from haplotypes occurring in particular outcrossing populations, or unrelated to any haplotypes in outcrossing populations.

We considered haplotypes to be derived from another haplotype if they had up to two base pair differences, and unrelated if they differed by more than two base pairs from any other haplotype in our sampling. The presence of such 'unrelated' haplotypes within selfing populations would indicate an origin not included in our sampling, or mutations that have accumulated subsequent to divergence from a shared ancestor.

Note that, if a selfing population originated from another selfing population, this would be reflected by two or more selfing populations sharing haplotypes that are derived from the same ancestral population. As in the previous paragraph, this would be indistinguishable from a scenario of independent colonizations from the same outcrossing population.

Results

Selfing phenotypes and outcrossing rates

Population level multilocus outcrossing rate estimates (Tm) ranged from 0.09-

0.99 (Table 1; Table S2). We categorized populations as selfing if Tm was < 0.5, and outcrossing if Tm > 0.5. Eight populations were classified as selfing (Tm < 0.5: KTT,

LPT, PTP, RON, TC, TCA, TSSA, WAS). All of the remaining populations were

PIC, PIN, PIR, PRI, PUK, SBD, TSS). NCM was classified as outcrossing based on a preponderance of SI individuals, but outcrossing rates were not estimated due to insufficient seeds per mother (Table 1). Small siliques were found in all predominantly outcrossing populations but not in the populations that were predominantly self-compatible, emphasizing that their occurrence represents leakiness of the self-incompatibility system rather than a complete loss of it, as suggested previously (Mable et al. 2005). Overall, the proportion of self-compatible individuals in each population was a strong predictor of outcrossing rates (linear regression,

2 beta = -0.68, R = 0.887, F[Ui] = 165.7, p < 0.0005). Nevertheless, it is worth noting that some self-compatible (SC) or partially self-compatible (PC) plants were found in the majority of the outcrossing populations (Table S2). SC individuals were found in five of the 15 outcrossing populations (HDC, OWB, IND, MAN and PRI) and PC individuals were found in nine (BEI, HDC, IOM, NCM, OWB, PCR, PIR, PUK, and

SBD). Likewise, some of the populations classified as selfing included some SI and

PC individuals, with fully SI individuals found in two of the eight selfing populations

(TSSA and TC).

The effect of selfing phenotype and mating system (outcrossing rates) on genetic diversity and heterozygosity

Significant linear regressions indicated that both outcrossing rate and the proportion of SC individuals in a population were good predictors of average

161 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 multilocus synonymous nucleotide diversity (jtsyn) and expected microsatellite heterozygosity (He) (Table 2): both Jtsyn and He diversities increased with increasing outcrossing rate (Tm) (Figure 2) and decreased with increasing proportion of SC individuals. While a similar effect was observed for Jtrep, it was not statistically significant. Average population multilocus H0 was found to increase with increasing outcrossing rate and decrease with increasing proportion of SC individuals (Table 2).

In contrast, outcrossing rate and the proportion of SC individuals did not explain the population recombination parameter p (Table 2).

Observed heterozygosities (H0) calculated for each individual for the nuclear and the microsatellite datasets were highly correlated (Spearman rank correlation: r2 =

0.58, p < 0.0001). For both datasets, the selfing phenotype (PC, SC, SI) had a

= significant effect on individual heterozygosity (One-way ANOVA, Fr2,i67] 19.0, p <

0.0001). SC individuals (mean Ho=0.\0 for the nuclear gene sequences, mean H0 =

0.11 for the microsatellites) were found to be less heterozygous than SI (mean H0 =

0.21 for the nuclear gene sequences, mean H0= 0.26 for the microsatellites) and PC individuals (mean H0= 0.22 for the nuclear gene sequences, mean H0= 0.25 for the microsatellites). When the effect of selfing phenotype (SI, PC, SC) on heterozygosity was tested only considering outcrossing populations, the selfing phenotype did not have a significant effect for either the nuclear gene sequence data (overall mean H0 =

0.22), or for the microsatellite data (overall mean H0 = 0.27).

Explaining the reduced genetic diversity of selfing populations: testing for bottleneck effects beyond selfing alone

After correcting Jtsyn for the differences in effective population size expected due to selfing alone, neither outcrossing rate (Tm) nor proportion of SC individuals per population explained levels of diversity (r2 = 0.039, p > 0.05, r2 = 0.14, p > 0.05), indicating that the neutral effects of selfing alone may explain the reduction in diversity in selfing populations (Table 2).

Many populations showed average Tajima's D values that had significant departures from a standard neutral model; in particular, the average Tajima's D was significantly positive in 15 of our 22 populations, and significantly negative in one

(Table 1). It is worth noting, however, that neither the selfing nor outcrossing populations consistently display more significant departures from neutrality.

Chloroplast Haplotype Distribution

Expanding on the results and following the same naming reported by Hoebe et al. (2009), the previously identified 741 bp LI and L2 haplotypes and two additional

741 bp haplotypes (dubbed L3 and L4) were found among the 24 populations sampled here. In addition, we found the previously identified 515 bp SI haplotype

(Hoebe et al. 2009) and two additional (498 bp) haplotypes (dubbed S2 and S3). The shorter haplotypes S2 and S3 differed by 1 bp from one another. Based on the eight individuals per population sampled, most populations (21) were fixed for a single chloroplast haplotype, while three populations contained a mixture of cpDNA

SI predominate (Figure 1). Self-compatible individuals were found with LI, L2, L3,

L4 and SI chloroplast haplotypes and partially compatible individuals were found with haplotypes LI, L2, SI, S2 and S3. Predominantly selfing populations were associated with LI (LPT, PTP, RON, TSSA), SI (TC, TCA, TSSA), L3 (KTT) and

L4 (WAS). Of those, only L3 was unique to selfing populations. Self-incompatible individuals, and outcrossing populations in general were found with all haplotypes except L3.

Bayesian Clustering Analyses

Bayesian clustering using InStruct with estimation of individual selfing levels gave similar results as STRUCTURE (compare Figure SI and S2) and identified six main clusters when combining nuclear gene sequence and microsatellite data (Figure

1; Figure S3). Four clusters contained both selfing and outcrossing populations, while the two remaining clusters contained only outcrossing populations. Clustering based on nuclear gene or microsatellite data alone in general agreed well with the results from the combined dataset, and also identified six clusters (data not shown). Four of the clusters included both selfing and outcrossing populations and there were no clusters that consisted only of selfing populations. There were only a few populations with evidence of admixture, of which the MAN and IND population had the strongest signal.

Inferences on the origin of selfing populations

Only one microsatellite allele was found to be unique to the group of selfing populations (Tm < 0.5). The remaining 30 microsatellite alleles that occurred in the selfing group were shared with the 52 alleles occurring in the outcrossing group of populations (Table 3). There was a highly significant under-representation of unique microsatellite alleles in the selfing vs. outcrossing populations (Table 3). A similar pattern emerged for the gene sequence haplotypes. Although in absolute terms there were more variants unique to the selfing group (55 of 145), the majority of variants

(90) still were shared with the 449 haplotypes occurring in the group of outcrossing populations (Table 3). The group of selfing populations thus appeared to have a subset of the microsatellite alleles and gene sequences haplotypes found in the outcrossing group. None of these patterns were driven by particular microsatellite loci or nuclear genes (Table S3).

The haplotypes unique to the group of selfing populations were mostly uninformative with regards to revealing a potential origin because they were either not closely related (i.e., three or more bp difference) to any haplotype found in outcrossing populations (Table S4), or closely related (i.e. only one or two bp differences) to haplotypes that occurred in multiple outcrossing populations (Table

S4). Finally, each selfing population had its own unique haplotypes (not shared with other selfing populations in our sampling) and these were never closely related to haplotypes unique to other selfing populations (Table S4). Haplotypes that were not closely related (i.e. three or more bp differences) to any haplotype in outcrossing

PTP (Table S4).

Discussion

Breakdown of self-incompatibility and patterns of genetic variation

Previous investigations into the levels of genetic diversity in mixed mating populations of North- American A. lyrata revealed expected patterns, where genetic diversity in selfing populations has been reduced in comparison with outcrossing populations (Mable et al. 2005; Mable and Adam 2007; Hoebe et al. 2009). These studies were based on microsatellite variation alone whereas in this study we combine nuclear gene sequence data with microsatellites and found a striking concordance between them. We found an increase of synonymous nucleotide diversity (jtsyn) for nuclear sequence data and He for microsatellite data (Table 2, Figure 2) with increasing outcrossing rate, which corroborates previous conclusions (Mable and

Adam 2007; Hoebe et al. 2009) and confirms the theoretical prediction that a shift to selfing comes at a cost to genetic diversity (Charlesworth et al. 1993; Nordborg 2000;

Charlesworth and Wright 2001; Glemin et al. 2006; Wright et al. 2008). For both datasets, individual H0 was significantly lower for SC individuals versus SI individuals. This effect appeared to be solely due to the transition to inbreeding (so an effect of mating system rather than loss of SI), for when the heterozygosity of SC, PC and SI individuals was compared excluding the inbreeding populations (Tm < 0.5), selfing phenotype (SC, PC, SI) had no effect on heterozygosity. Maintenance of high

166 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 heterozygosity in SC and PC individuals in outcrossing populations emphasizes that loss of SI does not always lead to shifts to inbreeding (i.e., there is a two-step process). Even though we only sampled eight individuals per population, self- compatible individuals were found in five of the 15 predominantly outcrossing populations, and partially compatible individuals were found in nine, suggesting that self-compatibility is widespread. The comparable levels of heterozygosity of SC and

SI individuals in outcrossing populations, and the lack of any clustering according to selfing phenotype (Figures SI and S2), also suggests that self-compatible plants in outcrossing populations, despite their ability to self-fertilize, still predominantly outcross. This is not entirely unexpected because neither the loss of SI nor the shift to selfing in A. lyrata is associated with a reduction in flower size (Hoebe 2009), which would promote exclusive selfing. These results are compatible with a scenario of one or more mutations facilitating the loss of SI having occurred early in the colonization history of North-American A. lyrata, and that segregation of these mutations causes segregation of SC phenotypes in all populations, but shifts towards selfing only in a subset.

No role for bottlenecks in the breakdown of self-incompatibility of North American

A. lyrata?

Population bottlenecks may be expected to be common in highly selfing populations, particularly if strong founder events were important in their origins

(Foxe et al. 2009 (Chapter 2); Guo et al. 2009). If this is the case, we would expect a

more severe reduction in diversity in selfing populations than expected under

neutrality (i.e. a greater than twofold reduction in diversity under complete selfing).

Similarly, we would expect a greater skew in the allele frequency spectrum in selfing

populations, generating more positive Tajima's D values. When levels of Jtsyn were

corrected for differences in Ne due to selfing alone, no significant correlation between

jtsyn and multilocus estimation of the outcrossing rate (Tm) was found. Thus, similar to

recent results from Eichhorniapaniculata (Ness et al. 2010), our data provided no

evidence for reductions in diversity beyond neutral expectations, providing no

signature of elevated demographic effects or hitchhiking in selfing populations.

Furthermore, although 15 populations had a significantly positive Tajima's D value

(Table 1), which is expected if recent bottlenecks have played a role, there was no

difference in Tajima's D values between selfing and outcrossing populations.

While population bottlenecks alone may result in a positive Tajima's D,

admixture resulting from gene flow can also elevate such estimates (Wright & Gaut,

2005). It is possible that positive Tajima's D values found in the outcrossing

populations are the result of elevated incoming gene flow when compared to the

selfing populations; although our clustering analyses do not suggest that this is

generally true, larger within-population samples might reveal greater gene flow

among outcrossing than selfing populations.

Demographic history and the breakdown of self-incompatibility of North American

A. lyrata

The results from Bayesian clustering analyses for the combined nuclear sequence and microsatellite data suggest the existence of six clusters across the populations sampled from eastern North America (Figure 1). There was good agreement between InStruct and STRUCTURE (Figures SI and S2). This is surprising given that the STRUCTURE assumption of random mating (Pritchard et al.

2000) is clearly violated in this system with varying levels of inbreeding, and may be reassuring for studies that have used STRUCTURE in systems where selfing plays a role (e.g., Hoebe et al. 2009; Foxe et al. 2009 (Chapter 2)). It is worth noting that the clusters appear to be fairly isolated; a signal of admixture was only particularly strong in IND and MAN. These populations are also characterized by a mixture of chloroplast haplotypes (IND, LI and L2: cf. Figure 1; MAN, LI and L2: cf. Hoebe et al. 2009; Tedder etal. 2010).

The overwhelming pattern here is that populations are not clustered by mating system or selfing phenotype (Figure 1), which would have suggested that selfing evolved only once in the region. Instead, they are predominantly clustered by geographic location. This geographic clustering of the populations by both nuclear markers and chloroplast haplotypes (Figure 1) likely reflects the recolonization history of North America after the end of the Wisconsin glaciation (ca. 10,000 years ago), before which the entire Great Lakes region was covered in ice (Lewis et al.

2008). Hoebe et al. (2009) concluded that a mating system transition may have occurred more than once in North American^, lyrata, as the loss of self- incompatibility and a transition to high levels of selfing occurred in multiple

chloroplast lineages (LI, SI; Hoebe et al. 2009). The more extensive geographic

sampling here confirmed this and identified a new population that has undergone a

complete transition to predominant selfing (KTT), which was characterized by a

fourth origin based on clustering analysis and a unique chloroplast haplotype (L3).

The variation among different selfing populations within a geographic area is

once again in stark contrast with the uniformity of different populations of C. rubella

(Foxe et al. 2009 (Chapter 2); Guo et al. 2009). This may reflect that North-American

A. lyrata is at an earlier stage in the transition to selfing, or that the underlying forces

driving the transition are inherently different. None of the selfing populations had a

strong relationship with any of the outcrossing populations we sampled in terms of

sharing of haplotypes (data not shown), contrary to what would be expected if selfing

populations had an origin in specific outcrossing populations in our sampling. Selfing

populations as a group appear to harbor a subset of the genetic variation of

outcrossing populations, with most haplotypes shared with outcrossing populations

and a significant under-representation of unique variants (Table 3).

Outcrossing rate and the proportion of self-compatible individuals do not account for levels of recombination

The transition from outcrossing to selfing can result in a number of

consequences at the genotypic level (reviewed in Wright et al. 2008). Selfing is

associated with a decrease in levels of polymorphism and an increase in levels of

linkage disequilibrium. Increased levels of homozygosity in selfing populations

170 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 should lead to less efficient recombination (r) and a reduction in Ne, thus reflected in a reduced estimate of p (p = 4Ner). However, in our dataset outcrossing rate and the proportion of SC individuals did not explain the population recombination parameter p (Table 2). One possible explanation for this is that the power to estimate p in selfing populations may be diminished due to low levels of diversity. Alternatively, recent transitions to selfing may retain ancestral recombination events, especially at local physical distances (Tang et al. 2007).

What favours self-compatibility in the Great Lakes Region ?

The question remains as to why selfing broke down and has persisted in North

American A lyrata, and has not occurred in the European subspecies. Selfing is only expected to become established in a previously outcrossing population when the advantages associated with selfing outweigh the costs in terms of inbreeding depression and reduced adaptability (Charlesworth and Charlesworth 1987). On a coarse scale, the geographic distribution of selfing populations is not consistent with expectations of reproductive assurance being favored in peripheral populations at the front of colonization wave (Baker's law, Baker 1955; also see Pannell and Barrett

1998), since both selfing and outcrossing populations have colonized new areas within the past 10,000 years. Selfing populations tend to be distributed towards the southern part of the habitat that opened up first following the last glacial maximum

(and so are not at the periphery of the current distribution), and there does not appear

171 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 to be a difference in population size or obvious habitat differences between the two types of populations (Mable and Adam 2007).

An alternative explanation is that North American populations have generally experienced reduced inbreeding depression through purging, hence lowering the cost associated with a transition to inbreeding. Population bottlenecks, particularly accompanied by long-term reductions in effective population size, can lead to significant purging and/or fixation of deleterious alleles (Bataillon and Kirkpatrick

2000). Lower levels of genetic diversity found in North American versus European A. lyrata have been suggested to reflect a long-term population bottleneck associated with the colonization of North America from European populations (Ross-Ibarra et al.

2008). Therefore, reduced inbreeding depression as a consequence of this bottleneck may have partly facilitated the evolution of selfing. As the Great Lakes populations of

A. lyrata are thought to have spread north from their glacial refugia following the last

Ice Age, this may have further contributed to a substantial purging of deleterious alleles in founder populations. Consistent with this possibility, it has recently been shown that population range expansion can lead to a significant decrease in genetic load in Mercurialis annua (Pujol et al. 2009). Comparative studies of inbreeding depression, both within North American and European populations, could enable a test of this hypothesis in A lyrata. Previous studies have demonstrated high levels of inbreeding depression in European populations (Karkkainen et al. 1999), which does not seem to be as apparent in A. lyrata (Hoebe 2009).

Conclusions

In summary, we have shown that, compared to predominantly outcrossing populations, genetic diversity was reduced in selfing populations of North American

A. lyrata. We found a strong concordance between chloroplast markers, nuclear gene sequence and microsatellite data. The general reduction in diversity appeared to be the consequence of the transition to a selfing mating system, and not of the loss of SI alone. We found no evidence of severe bottlenecks associated with the transition to selfing beyond the bottleneck expected due to the transition itself. Although we assume that the transition to selfing in this system is recent (at least much more recent than the transition to selfing in other systems), and that there have been multiple independent origins of selfing, we did not find a clear relation of any of the selfing populations to a particular outcrossing "parent" population. The genetic basis for the loss of SI has not yet been elucidated in A. lyrata but it is important to identify potentially unique origins of self-compatibility prior to designing and interpreting crossing studies to investigate mechanisms of loss. For example, initial investigations of loss of SI in A. thaliana suggested a selective sweep across all populations

(Shimizu et al. 2004); later studies that have sampled populations more broadly have found that the story is more complicated, with the possibility of multiple independent losses, perhaps involving different mechanisms (Bechsgaard et al. 2006; Sherman-

Broyles et al. 2007; Tang et al. 2007). The North American^, lyrata populations offer an exciting system to unravel the causes and consequences of the loss of SI and

173 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 an evolutionary transition from outcrossing to selfing, as it involves a much more recent change than that which has given rise to the highly selfing A. thaliana species.

Acknowledgements

We thank Yvonne Willi and David Remington for generously providing seeds and outcrossing rates; Aileen Adam, Peter Hoebe, and Hong-Guang Zha for assistance with laboratory work and plant maintenance; Peter Hoebe, Rob Ness and Tanja Slotte for useful discussions; Joseph Hughes for assistance with Perl scripting; Hong Gao for providing assistance with the InStruct source code; three anonymous referees for their constructive and detailed comments that greatly improved the manuscript. We are grateful for funding from the Natural Environment Research Council

(NE/D013461/1) and European Research Area Network in Plant

Genomics/Biotechnology and Biosciences Research Council joint funding

(ERAPGFP-06.058A) to BKM. We thank Parks Canada, Ontario Parks, Michigan

State Parks Authority, U.S. National Park Service, Ohio Department of Natural

Resources, and the Ohio Nature Conservancy for access to protected park areas and advice on plant locations.

References

Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local

alignment search tool. J. Mol. Biol. 215:403-410.

Ansell, S. W., H. Schneider, N. Pedersen, M. Grundmann, S. J. Russell, and J. C.

Vogel. 2007. Recombination diversifies chloroplast.trnF pseudogenes in

Arabidopsis lyrata. J. Evol. Biol. 20:2400-2411.

Baker, H. G. 1955. Self-compatibility and establishment after long-distance dispersal.

Evolution 9:347-349.

Barrett, S. C. H. 2002. The evolution of plant sexual diversity. Nat. Rev. Genet.

3:274-284.

Barrett, S. C. H., L. D. Harder, and A. Worley. 1996. The comparative biology of

pollination and mating in flowering plants. Phil. Trans. R. Soc. Ser. B.

351:1271-1280.

Bataillon, T., and M. Kirkpatrick. 2000. Inbreeding depression due to mildly

deleterious mutations in finite populations: size does matter. Genet. Res.

75:75-81.

Baudry, E., C. Kerdelhue, H. Innan, and W. Stephan. 2001. Species and

recombination effects on DNA variability in the tomato genus. Genetics

158:1725-1735.

Bechsgaard, J. S., V. Castric, D. Charlesworth, X. Vekemans and M. H. Schierup.

2006. The transition to self-compatibility in Arabidopsis thaliana and

evolution within S-haplotypes over 10 Myr. Mol. Biol. Evol. 23:1741-1750.

Boggs, N. A., J. B. Nasrallah, and M. E. Nasrallah. 2009. Independent S-locus

mutations caused self-fertility in Arabidopsis thaliana. PLoS Genetics

5:el000426.

Charlesworth, B., M. T. Morgan, and D. Charlesworth. 1993. The effects of

deleterious mutations on neutral molecular variation. Genetics 134:1289-1303.

Charlesworth, D. 2006. Evolution of plant breeding systems. Curr. Biol. 16:R726-

735.

Charlesworth, D., and B. Charlesworth. 1987. Inbreeding depression and its

evolutionary consequences. Annu. Rev. Ecol. Syst. 18:237-268.

Charlesworth, D., and X. Vekemans. 2005. How and when did Arabidopsis thaliana

become highly self-fertilising. Bioessays 27:472-476.

Charlesworth, D., and S. I. Wright. 2001. Breeding systems and genome evolution.

Curr. Opin. Genet. Dev. 11:685-690.

Charlesworth, D., and Z. Yang. 1998. Allozyme diversity in Leavenworthia

populations with different inbreeding levels. Heredity 81:453-461.

Chiang, Y. H., B. A. Schaal, C. H. Chou, S. Huang, and T. Y. Chiang. 2003.

Contrasting selection modes at the Adhl locus in outcrossing Miscanthus

sinensis vs. inbreeding Miscanthus condensatus (Poaceae). Am. J. Bot.

90:561-570.

Clauss, M. J., H. Cobban, and T. Mitchell-Olds. 2002. Cross-species microsatellite

markers for elucidating population genetic structure in Arabidopsis and Arabis

(Brassicaeae). Mol. Ecol. 11:591-601.

Clauss, M. J., and T. Mitchell-Olds. 2006. Population genetic structure of Arabidopsis

lyrata in Europe. Mol. Ecol. 15:2753-2766.

Cutter, A. D., and B. A. Payseur. 2003. Selection at linked sites in the partial selfer

Caenorhabditis elegans. Mol. Biol. Evol. 20:665-673.

Darwin, C. R. 1876. The effects of cross and self-fertilization in the vegetable

kingdom. John Murray, London.

Dieringer, D., and C. Schlotterer. 2003. Two distinct modes of microsatellite mutation

processes: evidence from the complete genomic sequences of nine species.

Genome Res. 13:2242-2251.

Filatov, D. A., and D. Charlesworth. 1999. DNA polymorphism, haplotype structure

and balancing selection in the Leavenworthia PgiC locus. Genetics 153:1423-

1434.

Fisher, R. A. 1941. Average excess and average effect of a gene substitution. Ann.

Eugen. 11:53-63.

Foxe, J. P., T. Slotte, E. A. Stahl, B. Neuffer, H. Hurka, and S. I. Wright. 2009. Rapid

morphological evolution and speciation associated with the evolution of

selfing in Capsella. Proc. Natl. Acad. Sci. USA 106:5241-5245.

Gao, H., S. Williamson and C. D. Bustamante. 2007. A Markov chain Monte Carlo

approach for joint inference of population structure and inbreeding rates from

multilocus genotype data. Genetics 176:1635-1651.

Glemin, S., E. Bazin, and D. Charlesworth 2006. Impact of mating systems on

patterns of sequence polymorphism in flowering plants. Proc. R. Soc. B.

273:3011-3019.

Grant, V. 1981. Plant Speciation. Columbia University Press, New York.

Guo, Y.-L., J. S. Bechsgaardb, T. Slotte, B. Neuffer, M. Lascoux, Weigel D., and M.

H. Schierup. 2009. Recent speciation of Capsella rubella from Capsella

grandiflora, associated with loss of self-incompatibility and an extreme

bottleneck Proc. Natl. Acad. Sci. USA 106:5246-5251.

Hoebe, P. N. 2009. Evolutionary dynamics of mating systems in populations of North

American Arabidopsis lyrata. Dissertation, 164pp. University of Glasgow.

Hoebe, P. N., M. Stift, A. Tedder, and B. K. Mable. 2009. Multiple losses of self-

incompatibility in North-American Arabidopsis lyrata: phylogeographic

context and population genetic consequences. Mol. Ecol. 18:4924-4939.

Hudson, R. R. 2001. Two-locus sampling distributions and their application. Genetics

159:1805-1817.

Hurka, H., and B. Neuffer. 1997. Evolutionary processes in the genus Capsella

(Brassicaceae). Plant Syst. Evol. 206:295-316.

Igic, B., R. Lande, and J. R. Kohn. 2008. Loss of self-incompatibility and its

evolutionary consequences. Int. J. Plant Sci. 169:93-104.

Ingvarsson, P. K. 2002. A metapopulation perspective on genetic diversity and

differentiation in partially self-fertilizing plants. Evolution 56:2368-2373.

Jakobsson, M., and N. A. Rosenberg. 2007. CLUMPP: a cluster matching and

permutation program for dealing with label switching and multimodality in

analysis of population structure. Bioinformatics 23:1801-1806.

Kalisz, S., D. W. Volger and K. M. Hanley 2004. Context-dependent autonomous

self-fertilization yields reproductive assurance and mixed mating. Nature

430:884-887.

Karkkainen, K., H. Kuittinen, R. Van Treuren, C. Vogl, S. Oikarinen and O.

Savolainen. 1999. Genetic basis of inbreeding depression in Arabis petraea.

Evolution 53: 1354-1365

Kliman, R. M., P. Andolfatto, J. A. Coyne, F. Depaulis, M. Kreitman, A. J. Berry, J.

McCarter, J. Wakeley, and J. Hey. 2000. The Population Genetics of the

Origin and Divergence of the Drosophila simulans Complex Species. Genetics

156:1913-1931.

Koch, M., and M. Kiefer. 2005. Genome evolution among cruciferous plants: a

lecture from the comparison of the genetic maps of three diploid species-

Capsella rubella, Arabidopsis lyrata subsp. petraea, and A. thaliana. Am. J.

Bot. 92:761-767.

Lewis, C. F. M., P. F. Karrow, S. M. Blasco, F. M. G. McCarthy, J. W. King, T. C.

Moore Jr., and D. K. Rea. 2008. Evolution of lakes in the Huron basin:

Deglaciation to present. Aquatic Ecosystem Health and Management 11:127-

136.

Librado, P., and J. Rozas. 2009. DnaSP v5: a software for comprehensive analysis of

DNA polymorphism data. Bioinformatics 25:1451-1452.

Liu, F., D. Charlesworth, and M. Kreitman. 1999. The effect of mating system

differences on nucleotide diversity at the phosphoglucose isomerase locus in

the plant genus Leavenworthia. Genetics 151:343-357.

Liu, F., L. Zhang, and D. Charlesworth. 1998. Genetic diversity in Leavenworthia

populations with different inbreeding levels. Proc. R. Soc. Lond. B. Biol. Sci.

265:293-301.

Liu, P., S. Sherman-Broyles, M. E. Nasrallah, and J. B. Nasrallah. 2007. A Cryptic

Modifier Causing Transient Self-Incompatibility in Arabidopsis thaliana.

Curr. Biol. 17:734-740.

Mable, B. K., and A. Adam. 2007. Patterns of genetic diversity in outcrossing and

selfing populations of Arabidopsis lyrata. Mol. Ecol. 16:3565-3580.

Mable, B. K., A. V. Robertson, S. Dart, C. Di Berardo, and L. Witham. 2005.

Breakdown of self-incompatibility in the perennial Arabidopsis lyrata

(Brassicaceae) and its genetic consequences. Evolution 59:1437-1448.

Maynard Smith, J., and J. Haigh. 1974. The hitch-hiking effect of a favourable gene.

Genet. Res. 23:23-25.

Muller, M. H., J. Leppala and O. Savolainen. 2008. Genome-wide effects of

postglacial colonization in Arabidopsis lyrata. Heredity 100:47-58.

Ness, R. W., S. I. Wright, and S. C. H. Barrett. 2010. Mating-System Variation,

Demographic History and Patterns of Nucleotide Diversity in the Tristylous

Plant Eichhornia Paniculata Genetics 184:381-392.

Nordborg, M. 2000. Linkage disequilibrium, gene trees and selfing: an ancestral

recombination graph with partial self-fertilization. Genetics 154:923-929.

Pannell, J. R., and S. C. H. Barrett. 1998. Baker's law revisited: reproductive

assurance in a metapopulation. Evolution 52: 657-668.

Pollak, E. 1987. On the theory of partially inbreeding finite populations. I. Partial

selfing. Genetics 117:353-360.

Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. Inference of population

structure using multilocus genotype data. Genetics 155:945-959.

Pujol, B., S. R. Zhou, J. Sanchez Vilas, and J. R. Pannell. 2009. Reduced inbreeding

depression after species range expansion. Proc. Natl. Acad. Sci. USA 106:

15379-15383.

Ritland, K. 2002. Extensions of models for the estimation of mating systems using n

independent loci. Heredity 88:221-228.

Ritland, K., and S. K. Jain 1981. A model for the estimation of outcrossing rate and

gene frequencies using n independent loci. Heredity 47:35-52.

Rosenberg, N. A. 2004. DISTRUCT: a program for the graphical display of

population structure. Mol. Ecol. Notes 4:137-138.

Ross-Ibarra, J., S. I. Wright, J. P. Foxe, A. Kawabe, L. DeRose-Wilson, G. Gos, D.

Charlesworth, and B. S. Gaut. 2008. Patterns of Polymorphism and

Demographic History in Natural Populations of Arabidopsis lyrata. PloS One

3:e2411.

Savolainen, O., C. H. Langley, B. P. Lazzaro and H. Freville, 2000. Contrasting

patterns of nucleotide polymorphism at the alcohol dehydrogenase locus in the

outcrossing Arabidopsis lyrata and the selfing Arabidopsis thaliana. Mol.

Biol. Evol. 17:645-655.

Sherman-Broyles, S., N. Boggs, A. Farkas, P. Liu, J. Vrebalov, M. E. Nasrallah and J.

B. Nasrallah. 2007. S locus genes and the evolution of self-fertility in

Arabidopsisthaliana. Plant Cell 19:94-106.

Shimizu, K. K., J. M. Cork, A. L. Caicedo, C. A. Mays, R. C. Moore, K. M. Olsen, S.

Ruzsa, G. Coop, C. D. Bustamante, P. Awadalla and M. D. Purugganan. 2004.

Darwinian selection on a selfing locus. Science. 306:2081-2084.

Stebbins, G. L. 1950. Variation and evolution in plants. Columbia University. Press,

New York.

Stebbins, G. L. 1956. Taxonomy and evolution of genera with special reference to

family Gramineae. Evolution 10:235-245.

Stebbins, G. L. 1957. Self-fertilization and population variability in higher plants.

Am. Nat. 41:337-354.

Stebbins, G. L. 1970. Adaptative radiation of reproductive characteristics in

angiosperms. I. Pollination mechanisms. Ann. Rev. Ecol. Syst. 1:307-326.

Stephens, M., N. J. Smith, and P. Donnelly. 2001. A new statistical method for

haplotype reconstruction from population data. Am. J. Hum. Genet. 68:978-

989.

Taberlet, P., L. Gielly, G. Pautou, and J. Bouvet. 1991. Universal primers for

amplification of three non-coding regions of chloroplast DNA. Plant. Mol.

Biol. 17:1105-1109.

Takebayashi, N., and P. L. Morrell. 2001. Is self-fertilization an evolutionary dead

end? Revisiting an old hypothesis with genetic theories and a

macroevolutionary approach. Am. J. Bot. 88:1143-1150.

Tang, C, C. Toomajian, S. Sherman-Broyles, V. Plagnol, Y. L. Guo, T. T. Hu, R. M.

Clark, J. B. Nasrallah, D. Weigel, and M. Nordborg. 2007. The evolution of

selfing in Arabidopsis thaliana. Science 317:1070-1072.

Tedder, A., P. N. Hoebe, S. W. Ansell and B. K. Mable. 2010 Using chloroplast trnF

pseudogenes for phylogeography in Arabidopsis lyrata. Diversity 2:653-678.

Tsuchimatsu T., K. Suwabe, R. Shimizu-Inatsugi, S. Isokawa, P., Pavlidis, T. Stadler,

G. Suzuki, S. Takayama, M. Watanabe and K. K. Shimizu. 2010. Evolution of

self-compatibility in Arabidopsis by a mutation in the male specificity gene.

Nature 464:1342-1346.

Wright, S. I., J. P. Foxe, L. DeRose-Wilson, A. Kawabe, M. Looseley, B.S. Gaut and

D. Charlesworth 2006. Testing for effects of recombination rate on nucleotide

diversity in natural populations of Arabidopsis lyrata. Genetics 174:1421-

1430.

Wright, S. I. and B. S. Gaut. 2005. Molecular population genetics and the search for

adaptive evolution in plants. Mol. Biol. Evol. 22:506-519.

Wright S. I., B. Lauga and D. Charlesworth. 2003. Subdivision and haplotype

structure in natural populations of Arabidopsis lyrata. Mol. Ecol. 12:1247-

1263.

Wright, S. I., R. W. Ness, J. P. Foxe, and S. C. H. Barrett. 2008. Genomic

consequences of outcrossing and selfing in plants. Int. J. Plant Sci. 169:105-

118.

Table 1. Population-level outcrossing rates (Tm), proportion of self-compatible (SC) individuals (as an indication of selfing phenotype),

summary statistics and observed heterozygosity for 18 nuclear gene sequences, and observed and expected heterozygosities (H0 and He)

across nine microsatellite loci, with populations ordered by increasing outcrossing rates.

Based on nine Based on nuclear gene sequences of 18 loci microsatellite loci

1 5 7 8 9 Population Tm Prop. SC Syn. Repl. Jisy„ Corr. jirep Corr. Taj D p Corr. Ho Ho He Corr. 2 6 plants Sites Sites Jtsyn jirep p He PTP 0.09 1 1880.8 6225.2 0.005 0.0087 0.001 0.0018 0.421 0 0 0.022 0.069 0.077 0.14 LPT11 0.13 1 0.069 0.13 0.23 TC 0.18 0.88 1868.03 6174.97 0.011 0.0193 0.002 0.0034 0.505 * 0.005 0.025 0.117 0.181 0.281 0.48 WAS 0.25 1 1862.12 6174.88 0.005 0.0073 0.001 0.0016 0.774 ** 0 0 0.104 0.083 0.136 0.22 RON 0.28 1 1870.3 6190.8 0.004 0.0056 0.001 0.0016 -0.831 *** 0 0 0.045 0.028 0.028 0.044 KTT 0.31 1 1876.4 6211.6 0.008 0.012 0.001 0.0015 0.631 ** 0.008 0.024 0.057 0 0.044 0.067 TSSA 0.41 0.5 1880.52 6225.48 0.006 0.0082 0.002 0.0028 -0.278 0 0 0.084 0.097 0.187 0.27 TCA 0.48 1 1880.66 6225.34 0.013 0.0182 0.002 0.0027 0.509 ** 0 0 0.063 0.042 0.121 0.16 OWB 0.64 0.29 1835.9 6090 0.008 0.0094 0.001 0.0012 0.754 ** 0.004 0.006 0.27 0.35 0.3 0.36 HDC 0.65 0.14 1879.8 6226.3 0.003 0.0037 0.001 0.0012 0.480 0.009 0.014 0.17 0.069 0.14 0.17 PIC 0.77 0 1823.2 6015.8 0.013 0.0142 0.002 0.0023 0.513 * 0 0 0.21 0.36 0.32 0.36 MAN 0.83 0.13 1790.4 5850.6 0.016 0.0177 0.002 0.0022 0.700 ** 0.005 0.005 0.23 0.22 0.27 0.29 PIN 0.84 0 1878.4 6218.7 0.014 0.0149 0.002 0.0022 0.191 0.002 0.002 0.22 0.18 0.26 0.28 PIR 0.88 0 1780.5 5929.5 0.012 0.013 0.002 0.0021 0.170 0.001 0.001 0.21 0.26 0.26 0.28 PRI 0.89 0.13 1854.2 6143.8 0.008 0.0086 0.001 0.0011 0.527 * 0 0 0.076 0.14 0.15 0.16

TSS 0.91 0 1880.1 6225.9 0.01 0.0101 0.002 0.0021 0.578 ** 0.005 0.005 0.301 0.361 0.456 0.48 IOM 0.94 0 1873.9 6199 0.01 0.0106 0.002 0.0021 0.419 * 0.003 0.003 0.23 0.28 0.4 0.41 LSP 0.94 0 1781.3 5829.7 0.005 0.0057 0.001 0.001 -0.335 0.01 0.011 0.16 0.14 0.11 0.11 SBD 0.94 0 1813.3 6010.7 0.015 0.0157 0.003 0.0031 0.757 *** 0.023 0.024 0.27 0.42 0.54 0.56 PUK 0.96 0 1856.8 6150.2 0.016 0.0161 0.002 0.002 0.822 *** 0.009 0.009 0.26 0.34 0.34 0.34 BEI 0.98 0 1853.8 6135.2 0.02 0.0204 0.003 0.003 0.683 ** 0.016 0.016 0.33 0.32 0.39 0.39 PCR 0.98 0 1871.6 6195.4 0.012 0.0119 0.001 0.001 -0.009 0.002 0.002 0.23 0.21 0.3 0.3 IND 0.99 0.14 1834.6 6067.4 0.011 0.0111 0.002 0.002 0.421 * 0.005 0.005 0.3 0.43 0.47 0.47 NCM12 - 0 1741.9 5668.1 0.003 - 0.001 - 0.875 ** 0.003 - 0.12 0.14 0.17 -

* indicates p < 0.05, ** indicates p < 0.01, *** indicates p < 0.001 1 Based on multilocus estimates obtained from microsatellite variation in progeny arrays, using MLTR version 2.3 (Ritland 2002) 2 Based on manual self-pollinations (see Table S2 for more details) 3 Total number of synonymous sites across loci 4 Total number of replacement (non-synonymous) sites across loci 5 7tsyn: synonymous nucleotide diversity: the average number of pairwise differences between two sequences across loci 6 For each of the diversity measures corrections for the reduction in effective population size due to selfing rate (5=1- Tm) were = 7 performed using the formula: 9corrected 90bS(l+Jr ), where 9 represents the genetic diversity measure, and F = S/2-S, where F is the inbreeding coefficient and 5 is the selfing rate for the population 7 7irep: replacement (non-synonymous) nucleotide diversity: the average number of pairwise differences between two sequences across loci 8 Taj D: average Tajima's D across loci p: population recombination parameter (see text for details), median across loci 10 The recombination parameter p was corrected for the reduction in effective population size due to selfing rate (5=1- Tm) using the formula Rcorrected = Robsl(l-S) where R is the recombination parameter and 5 is the selfing rate for the population

Nuclear gene sequencing failed for this population due to PCR problems 12 Outcrossing rates could not be obtained for this population due to insufficient seeds in the batches for progeny arrays

Table 2. Linear regressions of outcrossing rate (Tm) and the proportion of self-compatible (SC) individuals per population on synonymous diversity

(jtSyn), corrected synonymous diversity, the recombination parameter (p), the corrected recombination parameter and observed heterozygosity (H0) across 18 nuclear gene sequences; and observed and expected microsatellite heterozygosity (H0 and He) across nine microsatellite loci.

Proportion of T^ SC Individuals2 Degrees r2 F ratio Beta Freedom p value r2 F ratio beta Degrees Freedom p value

JtSyn 0.3 8.65 0.008 21 0.008 0.21 5.4 -0.004 22 0.03

4 Corrected Jtsyn 0.039 0.81 0.003 21 0.38 0.14 0.28 -0.001 21 0.61 n 5 0.171 4.12 0.008 21 0.0558 0.112 2.77 -0.0005 22 0.11

4 Corrected Jtrep 0.0069 0.139 0.00019 21 0.71 0.0004 0.08 0.0001 21 0.78 P6 0.149 3.49 -0.005 21 0.0761 0.129 2.96 -0.005 22 0.1006 Corrected p7 0.0008 0.0156 0.0003 21 0.90 0.0003 0.0053 0.0003 21 0.94

Nuclear H„ 0.64 35.23 0.250 21 <0.0001 0.58 1.13 -0.169 22 <0.0001 Microsatellite H„ 0.49 19.1 0.306 22 0.0003 0.51 20.5 -0.223 23 0.0002

Microsatellite He 0.44 15.6 0.313 22 0.0008 0.44 15.5 -0.223 23 0.0008

1 Estimated based on multilocus microsatellite variation in progeny arrays, using MLTR version 2.3 (Ritland 2002) 2 Proportion of self-compatible (SC) individuals within the population (cf. Table S2) 3 Synonymous diversity across all 18 nuclear loci in each population as measured by Jtsyn, where Jt is the average number of pairwise differences between two sequences 4 For each of Jtsyn and Jtrep corrections for the reduction in effective population size due to selfing rate (5=1- Tm) were performed using = ? the formula: 9COiiected 90bs(l+-r ), where 9 represents the genetic diversity measure, and F = S/2-S, where F is the inbreeding coefficient and 5 is the selfing rate for the population

5 Replacement diversity across all 18 nuclear loci in each population as measured by Jtrep, where JI is the average number of pairwise differences between two sequences 6 The population recombination parameter p; here we use the median p across all 18 nuclear loci in each population on sequences with > 3 segregating sites 7 The recombination parameter p was corrected for the reduction in effective population size due to selfing rate (5=1- Tm) using the

formula Rcorrected =R0bJ(l-S) where R is the recombination parameter and 5 is the selfing rate for the population

Table 3. Total and unique number of variants (microsatellite alleles across nine loci, nuclear gene haplotypes across 18 genes) for the

group of inbreeding populations and for the group of outcrossing populations; overall total number of different variants; number of variants shared across inbreeding and outcrossing populations. For the unique alleles, expected numbers under the null-hypothesis (equal

proportion of unique alleles in inbreeding vs. outcrossing populations) are given in brackets. A goodness-of-fit-test (G-test) was used to

evaluate if the observed numbers of unique alleles significantly differed from null-expectations (* p < 0.05-10" , ** p < 0.05-10" ).

Group of inbreeding Group of G-test Total Total shared populations outcrossing statistic number between populations of inbreeding different and variants outcrossing Total Unique Total Unique Microsatellites 31 1 52 22 14.3 * 53 30 (expected) (8.59) (14.4) Nuclear genes 145 55 449 359 •7] 1 ** 504 90 (expected) (101.1) (312.9)

PIC ,JF- t PUK v^' // ^fateftMHrtar / ^ ISP TCATC / Hjran y y . , .

' \ * 7 ffBO )«»)\iWA S

v Lalte,!* PCi PT| t-

lAlD

Selfing Outcrossing

> — a. 3T «t «"<5- Sijr« so, 3 ?5 § PJ • i y «• := se 11 v: v > o OTI-^I-^WT* y iSrjrOasaaa a.

Figure 1. Posterior probabilities of Bayesian clustering analysis (InStruct) using the combined nuclear gene sequence and microsatellite datasets, based on a prior of six clusters (k = 6). Bar plots show individual posterior probabilities, pie charts on the map show mean posterior probabilities for each population (averaged over 8 individuals). Chloroplast trn¥(GAA) region haplotypes found in each population are listed beneath the population labels below the bar plots. The dotted line on the map

191 Reprinted with permission from Evolution: International Journal of Organic Evolution, In Press. Copyright 2010 indicates the approximate southern limit of the ice sheet during the last glacial maximum.