GENOMIC CAUSES AND CONSEQUENCES OF THE EVOLUTION OF SELF-FERTILIZATION IN THE FLOWERING GENERA AND

by

Khaled Hazzouri

A thesis submitted in conformity with the requirements for the degree of Doctorate of Philosophy Graduate Department of Ecology and Evolutionary Biology University of Toronto

© Copyright by Khaled Hazzouri 2012

GENOMIC CAUSES AND CONSEQUENCES OF THE EVOLUTION OF SELF-FERTILIZATION IN THE GENERA CAPSELLA AND COLLINSIA

Khaled Hazzouri

Doctor of Philosophy

Department of Ecology and Evolutionary Biology University of Toronto

2012 Abstract

The shift in mating system from outcrossing to selfing is associated with many evolutionary changes including reduced flower size and changes in sex allocation, leading to a suite of morphological characteristics known as the selfing syndrome. Furthermore, the evolution of selfing is expected to have important effects on genetic variation and the efficacy of natural selection. However, the underlying genomic causes of morphological evolution and the extent of relaxed selection remain unresolved. In this thesis I use new genomic approaches to investigate the genetic basis of floral evolution as well as the consequences of the evolution of selfing in the genus Capsella (), in which the highly selfing C. rubella evolved recently from the self-incompatible, obligately outcrossing C. grandiflora. Quantitative trait locus (QTL) mapping results suggest that few loci with major effects on multiple floral phenotypes underlie the evolution of the selfing syndrome. Patterns of neutral diversity in QTL regions from both resequencing and next-generation transcriptome sequencing suggest an important role for positive directional selection in the evolution of the selfing syndrome. Combined with the identification of differentially expressed genes, the signals of positive selection provide candidate regions for identifying the causal evolutionary changes. Analysis of whole- transcriptome polymorphism patterns shows a significant increase in the ratio of unique

ii nonsynonymous to synonymous polymorphism in C. rubella, consistent with a genome-wide relaxation of selection.

To investigate the evolution of selfing in the genus Collinsia, I combined transcriptome sequencing with Sanger resequencing of multiple populations of the large-flowered Collinsia linearis with its small flowered, more highly selfing sister species Collinsia rattanii. Patterns of nucleotide diversity suggest the recent divergence of C. rattanii associated with a severe population bottleneck. C. rattanii showed an increased proportion of nonsynonymous polymorphism, consistent with a relaxation of natural selection following the population bottleneck and shift to selfing. In general, my results using combined genome-wide polymorphism, QTL mapping and gene expression data allow for a powerful approach to investigate the targets of adaptive evolution following the shift to selfing, and highlight the potential for relaxation of selection associated with this transition.

iii

ACKNOWLEDGEMENTS

I would like to express my deep and sincere gratitude to my supervisor, Dr. Stephen Wright, whose expertise, understanding and patience, added so much experience to me as a graduate. Without his vast knowledge in the field, this thesis would not appear in its present form. I would like to thank him for his continuous support.

I would like to thank my committee members, Dr. Asher Cutter, Dr. Aneil Agrawal for the assistance they provided at all levels of the research project. Finally, I would like to thank Dr. Kermit Ritland from the University of British Columbia (UBC) for taking time out to serve as my external examiner.

I would like to thank Dr. Tanja Slotte, who I learned a lot from her when she was a postdoctoral fellow in the Wright lab. Particularly, I would like to thank Dr. Peter Andolfatto from the University of Princeton, for his willingness to have me in his lab to learn and perform the multiplex genotyping shotgun sequencing (MSG) protocol. I would like to thank Adrian platte from McGill University for all the help in programming skills. I would like to thank Rob Ness and Juan Escobar, who was there to help with their expertise in the field and taking the time to explain things.

I would like to thank my colleagues in the Wright lab for our interesting debates and exchanges of knowledge, which helped enrich my experience. I would like to thank Bruce and Andrew at the University of Toronto, greenhouse, for all the help provided in terms of advices about handling and taking care of .

Finally, I would like to thank my precious family. In particular, I would like to dedicate this thesis to my Dad and Mom who waited long time and supported me through my entire life abroad. With their sacrificial love and encouragement, I would not have finished this thesis.

iv

TABLE OF CONTENTS

Abstract……………………………………………………………………………………………ii

Acknowledgements……………………………………………………………………………….iv

Table of Contents…………………………………………………………………………………vi

List of Tables…………………………………………………………………………………….xii

List of Figures…………………………………………………………………………………...xiv

CHAPTER ONE: GENERAL INTRODUCTION………………………………………………..1

Causes and consequences of mating system evolution……………………………………....1

Capsella as a study system…………………………………………………………………..8

Collinsia as a study system…………………………………………………………………..9

Research objectives…………………………………………………………………………10

CHAPTER TWO: GENETIC ARCHITECTURE AND ADAPTIVE SIGNIFICANCE OF THE SELGING SYNDROME IN CAPSELLA………………………………………………………..13

Summary……………………………………………………………………………………13

Introduction…………………………………………………………………………………13

Materials and Methods……………………………………………………………………...17

Results………………………………………………………………………………………26

Discussion…………………………………………………………………………………..32

Conclusions…………………………………………………………………………………38

CHAPTER THREE: EVOLUTIONARY GENOMICS OF MATING SYSTEM IN CAPSELLA...... 50

Summary……………………………………………………………………………………50

Introduction…………………………………………………………………………………51 v

Materials and Methods……………………………………………………………………...56

Results………………………………………………………………………………………62

Discussion…………………………………………………………………………………..67

CHAPTER FOUR: COMPARATIVE POPULATION GENOMICS IN TWO COLLINSIA SPECIES WITH CONTRASTING MATING SYSTEM………………………………………..89

Summary……………………………………………………………………………………89

Introduction…………………………………………………………………………………90

Materials and Methods……………………………………………………………………...94

Results……………………………………………………………………………………..101

Discussion…………………………………………………………………………………106

CONCLUSION…………………………………………………………………………………121

LITERATURE CITED…………………………………………………………………………122

vi

LIST OF TABLES

Table 2.1. Means and standard deviations for vegetative, floral and reproductive traits in C. rubella and C. grandiflora……………………………………………………………………….41

Table 2.2. List of significant QTL, including 1.5-LOD and 2-LOD confidence intervals and effect size estimates……………………………………………………………………………...42

Table 2.3. Population genetics of narrow QTL regions compared to the rest of the genome…...44

Table 3.1. Sampling locations of C. grandiflora and C. rubella………………………………...71

Table 3.2. Summary of the Capsella rubella genome assembly and annotation………………...72

Table 3.3. Summary of the initial ordering of de novo assembly of different contigs…………..73

Table 3.4. Genome-wide polymorphism summary statistics…………………………………….76

Table 3.5. Down-regulated genes in Capsella rubella after enrichment analysis using David bioinformatics……………………………………………………………………………………77

Table 3.6. Up-regulated genes in Capsella rubella after enrichment analysis using David bioinformatics……………………………………………………………………………………78

Table 3.7. Estimates of the distribution of fitness effects (DFE) of amino acid mutations that are unique to C. grandiflora and C. rubella………………………………………………………….79

Table 4.1. Population samples used for Collinsia rattanii and Collinsia linearis……………...111

Table 4.2. Summary of the de novo assembly of C. linearis and C. rattanii………………………………………………………...... 112

Table 4.3. Pairwise comparisons of synonymous polymorphisms (unique, shared and fixed differences) between Collinsia linearis and Collinsia rattanii…………………………………113

vii

Table 4.4. Codon preference and GC bias from pairwise comparison at synonymous sites that differ between C. linearis and C. rattanii……………………………………………………....114

viii

LIST OF FIGURES

Figure 2.1. Schematic showing floral measurements……………………………………………45

Figure 2.2. Histograms of representative floral and reproductive traits in the F2, F1 as well as in C. grandiflora and C. rubella……………………………………………………………………46

Figure 2.3. Character correlations in the F2 population…………………………………………47

Figure 2.4. Capsella linkage map and 1.5-LOD confidence intervals…………………………...48

Figure 2.5. Fine-mapping of QTL for self-incompatibility on linkage group seven……………………………………………………………………………………………..49

Figure 3.1. Comparative genome mapping of A. lyrata, C. rubella and S. parvula……………..80

Figure 3.2. A sliding window of synonymous diversity in C. grandiflora and C. rubella……………………………………………………………………………………………81

Figure 3.3. Proportion of nonsynonymous relative to synonymous polymorphism over the different frequency class (1-5) as well as site frequency spectra for synonymous and nonsynonymous in both and Capsella rubella……………………………………………………………………………………………82

Figure 3.4. Plot of the normalized mean expression versus ln fold change for the difference in C. grandiflora compared to C. rubella……………………………………………………………...83

Figure 3.5. Pearson correlation analysis of differentially expressed genes in C. grandiflora and C. rubella……………………………………………………………………………………………84

Figure 3.6. A Box plot comparing differentially expressed genes in Capsella and Arabidopsis species genes that were identified as significantly up- and down-regulated in C. rubella compared to C. grandiflora………………………………………………………………………85

ix

Figure 3.7. Ratio of nonsynonymous to synonymous polymorphisms for unique, shared and fixed differences in C. grandiflora and C. rubella at the genome-wide level and at genes found to be significantly up- and down-regulated in C. rubella……………………………………………...86

Figure 3.8. Sliding window of unique and shared polymorphism, and fixed differences in C. rubella and C. grandiflora in windows of 100 SNPs……………………………………………87

Figure 3.9. Comparison of the number of fixed differences relative to shared polymorphisms in genome-wide data compared with narrow QTL regions………………………………………...88

Figure 4.1. Contig length distribution of de novo assemblies of Collinsia linearis and Collinsia rattanii…………………………………………………………………………………………..115

Figure 4.2. Plot of the fraction of each of the assembled contigs that is aligned to its top BLASTx hit from the plant Uniprot database against the proportion of that top hit which is aligned to the query sequence………………………………………………………………………………….116

Figure 4.3. Majority-rule consensus tree inferred with the concatenate of 17 loci sequenced from different populations in C. linearis and C. rattanii …………………………………………….117

Figure 4.4. Barplots of the average Watterson’s synonymous diversity (θW) using the 17 loci sequenced in Collinsia rattanii and Collinsia linearis…………………………………………118

Figure 4.5. Plot of the frequency of optimal codons (Fop) as a function of gene expression in Collinsia linearis and Collinsia rattanii………………………………………………………..119

Figure 4.6. Decay of linkage disequilibrium (r2) with physical distance for the combined 17 loci in Collinsia linearis and Collinsia rattanii……………………………………………………..120

x

LIST OF APPENDICES

Appendix 2.1……………………………………………………………………………………138

Appendix 2.2……………………………………………………………………………………140

Appendix 2.3……………………………………………………………………………………141

Appendix 3.1……………………………………………………………………………………142

Appendix 4.1……………………………………………………………………………………143

Appendix 4.2……………………………………………………………………………………145

Appendix 4.3……………………………………………………………………………………149

xi 1

CHAPTER ONE GENERAL INTRODUCTION

Causes and consequences of mating system evolution

In flowering plants the evolutionary shift of mating system from outcrossing to selfing has occurred frequently. Stebbins (1970) described it as the most common evolutionary transition in angiosperms. The majority of angiosperms possess hermaphroditic flowers and many are highly outcrossed. Nevertheless, approximately 30% are predominantly selfing (Barrett

2002). There are two types of evolutionary advantages that selfing can provide. First, there is an increase of reproductive success when there is lack of pollinators or insufficient pollen transfer, known as reproductive assurance. Second, there is a 3:2 transmission advantage first shown by

Fisher (1941), where in contrast to an outcrossing individual, a selfing one serves as both ovule and pollen parent to its own selfed progeny. Just as the forces favoring the evolution of self- fertilization are well known, so is the primary force opposing its spread, inbreeding depression, which is the reduction in viability of selfed relative to outcrossed progeny (Charlesworth and

Charlesworth 1987b).

Early models (Lande and Schemske 1985) of plant mating system evolution suggested that predominant outcrossing and selfing are alternative stable states. Under this model, only two stable equilibria exist: selfing with low levels of inbreeding depression (δ) and outcrossing with high levels of inbreeding depression. As selfing rates increase, inbreeding depression decreases due to increased efficiency of selection removing recessive deleterious alleles (Charlesworth and

Charlesworth 1987b; Dudash and Carr 1998). A non-stable equilibrium exists when δ = 0.5, where the two-fold transmission advantage of selfing is just balanced by the reduction in fitness

2 due to inbreeding depression. Therefore, a small increase or decrease in inbreeding depression should either favor outcrossing or selfing, respectively. More recent theory has also found conditions under which mixed mating is stable (Holsinger 1988; Uyenoyama et al. 1993; Porcher and Lande 2005). One possible precursor to the evolution of selfing may be a population bottleneck (Fowler 1965; Lewis 1973), which can result in a high degree of inbreeding, reducing the frequency of deleterious alleles and thus the level of inbreeding depression. Theory also suggests that even with high inbreeding depression, a mutation causing complete to near complete selfing can spread to fixation (Lande and Schemske 1989; Holsinger 1988;

Charlesworth et al. 1990; Schultz and Willis 1995). Similarly, the association between loci controlling outcrossing rate such as genetic and morphological traits and loci that cause inbreeding depression could favour the evolution of selfing (Uyenoyama and Waller 1991a).

In flowering plants the evolution of selfing involves a syndrome that is characterized by changes affecting many morphological traits (Ornduff 1969; Jain 1976; Wyatt 1988). In selfing species, among the traits that most commonly change are petal size, anther size, pistil length, and style length, the degree of stigma-anther separation, seed set and mass and the relative allocation to the two sexual functions (pollen to ovule ratio). In addition to the morphological evolution that influences the shift in mating system, genetic breakdown of self-incompatibility (SI) systems leads to self-compatibility (SC) (Paetsch et al. 2006; Boggs et al. 2009). Overall, in response to the increase in selfing rate, allocation of resources to structures for pollinator attraction and reward is reduced, and flower size tends to decrease. The association of floral reduction and high selfing has been demonstrated in many species, including Leavenworthia (Lloyd, 1965),

Arenaria (Wyatt, 1984), Mimulus (Ritland and Ritland, 1989), Leptosiphon (Goodwillie, 1999),

Collinsia (Kalisz, 2010), and Arabidopsis (Goodwillie et al. 2010).

3

A major factor generating patterns of flowering plant diversity has been interactions with animal pollinators (Fenster et al. 2004) and the morphological shape of flowers has been under continual natural selection (Campbell 1989; Alexandersson and Johnson 2002; Maad 2000;

Maad and Alexandersson 2004). When seed production is limited by the amount of pollen received (Burd 1994; Larson and Barrett 2000; Ashman et al. 2004; Knight et al. 2005), theory suggests that traits affecting pollinator visitation and pollinator efficiency should be subject to selection via female fitness (Ashman and Morgan 2004). For example, directional selection was shown to operate when plants with larger flowers produce more seeds than small-flowered plants, as has been documented in, e.g., Mimulus guttatus (Willis 1996), Ipomopsis aggregata

(Campbell et al. 1991), and Arabidopsis lyrata (Saskia and Agren 2009). A range of morphological traits affects pollination success widely. Pollinator behavior and pollen transfer efficiency affect flower morphology, and may influence both fecundity and mating system

(Nilsson 1988; Harder and Barrett 1993; Aigner 2004). Stabilizing selection is also believed to maintain the architectural invariability of animal-pollinated flowers (Cresswell 2000). Once a population evolves towards high selfing, however, strong stabilizing selection on flower shape may be relaxed, allowing for neutral drift to predominate.

On the other hand, it has been proposed that pollen limitation can shape floral characters via natural selection, but also can select for traits that increase seed set through autonomous selfing (Campbell 1989; Alexandersson and Johnson 2002; Maad and Alexandersson 2004;

Fishman and Willis 2007; Saskia and Agren 2009). If the structures and rewards that attract pollinators, such as petals and floral nectar, benefit male fitness more than female fitness (Bell

1985), resource allocation to such functions should decrease with increasing selfing rate.

However, the decrease in resource allocation to attractive structures will depend upon the extent to which female fitness benefits by attracting pollinators (Charlesworth and Charlesworth 1987a;

4

Lloyd 1987). Resources that are saved by reducing male function can be redirected into female function (ovules and seeds) or into growth and maintenance of vegetative structures (Schoen

1982a; Lloyd 1987). In general, the morphological evolution associated with the selfing syndrome could be the result of directional natural selection or relaxation of stabilizing selection, and the relative roles of these two processes remains unclear.

Increased knowledge of the genetics and molecular biology of self- incompatibility versus self-compatibility has been detailed (Takayama and Isogai, 2005). However, the genetic and molecular basis of adaptation of the selfing syndrome traits is less known, despite its frequent occurrence in angiosperm evolution. So far, only one key gene contributor to the selfing syndrome trait that has been indentified, which is a major locus responsible for style length and it was found to be influencing the degree of separation of anther and stigma (herkogamy) in cultivated tomatoes (Bernacchi and Tanksley, 1997; Fulton et al. 1997; Chen et al. 2007)

Theoretical work by Fisher (1930), proposed that adaptation proceeds through the fixation of small-effect mutations, since small phenotypic modifications are more likely to be beneficial compared to mutations of large effect. The hypothesis of gradual evolution was challenged by Gottlieb in 1984, after reviewing a number of studies of the genetics of plant shape and architecture, which suggested the importance of mutations with large phenotypic effects. Following that, Orr (1998) suggested that if a population is far from the optimum, it would be easy for this population to move first rapidly via mutations of large effect, and further refinement of its position relative to the optimum can be accomplished via small effect mutations.

Studies of the genetic basis of mating system evolution began with biometric analysis, which provides information about the number of genes responsible for traits and their effect

5 sizes. Some biometric studies demonstrated a polygenic control for floral traits (MacNair and

Cumbes 1989; Fenster and Ritland 1994a; Fenster et al. 1995; Holtsford and Ellstrand 1992;

Shore and Barrett 1990), whereas fewer have shown a major gene effect (Marshall and Abbott

1984; Fenster and Barrett 1994). However, inference about individual loci was not feasible until the emergence of molecular mapping of quantitative traits (Lander and Botstein 1989). Several studies have used QTL mapping to gain insight into the genetic basis of mating system evolution

(Gradillo and Tanskley 1996; Lin and Ritland 1997; Bradshaw et al. 1998; Fishman et al. 2002;

Georgiady et al. 2002; Hodges et al. 2002). For example, only one to six QTLs, with at least one of large effect, were found to distinguish floral traits associated with pollinator divergence in the bee-pollinated Mimulus lewisii and the hummingbird-pollinated Mimulus cardinalis (Bradshaw et al. 1998). In contrast, QTL analyses of traits involved in interspecific mating system differences between selfing Mimulus gutattus and outcrossing Mimulus nasutus species indicate that these trait differences (floral morphology and stigma-anther separation) are primarily controlled by numerous loci with small effect (Fishman et al. 2002), whereas 3 to 7 QTLs of moderate effect differentiate the selfer Leptosiphon bicolor from the outcrosser Leptosiphon jepsonii (Goodwillie et al. 2006). In general, QTL experiments indicate that most phenotypic differences associated with the selfing syndrome among populations and species are polygenic, but the number and the effect size differ across traits and species ( Lin and Ritland 1997;

Bernacchi and Tanksley 1997; Fishman et al. 2002; Georgiady et al. 2002; Goodwillie et al.

2006; Grillo et al. 2009). Moreover, most detected QTL are pleiotropic, implying that the evolutionary shift may be constrained. In contrast floral evolution in cultivated crops (Gradillo and Tanskley 1996; Georgiady et al. 2002) is typically governed by few genes of large effect, which may be the result of strong selection associated with human domestication of plants

(Remington and Purugganan 2003).

6

What are the genomic consequences of mating system transitions from outcrossing to selfing? The evolutionary potential of selfing species will be reduced by the reduction in the amount of genetic variation, recombination and the vulnerability to mutational decay

(Charlesworth and Wright 2001). The vertical transmission of heritable variation from generation to generation is influenced by the mating system; on the other hand gene flow affects the horizontal transfer among populations. The primary genomic consequence of selfing is an increased homozygosity at loci across the genome, which is an increasing function of selfing rate. The distribution of genetic variation within individuals and among populations is affected by the increase in homozygosity. Despite the presence of recombination, it is normally low, because there is minimal crossing over between heterozygous sites (Nordborg 2000). Given this, genetic hitchhiking (selective sweeps of beneficial mutations or background selection on deleterious mutations) (Charlesworth and Wright 2001) can significantly reduce the effective population size. Thus, selfing alone is responsible for more than the expected two-fold decrease in effective population size. Furthermore, many plants with high selfing rate are annual (Barrett et al. 1996) and also many are very good colonizers (Brown and Burdon 1987). Because of those life history features of the selfing plants, they are often lasting a very short time, subdivided and subject to frequent population genetic bottlenecks and isolation. Therefore, in addition to selfing, demography also plays a role in reducing the effective population size in selfing populations and this is commonly more reduced compared to an outcrossing species (Schoen and Brown 1991).

Baker (1953) established that mating system is a strong player in determining the pattern of phenotypic variation within and among populations. Selfing species are deprived of low genetic variation within populations, but an increase of differentiation among populations.

Recent studies provide evidence of the reduced DNA diversity and increased population differentiation among populations of selfing species compared to populations of outcrossing

7 relatives ((Leavenworthia; Liu et al. 1998; 1999); (Arabidopsis; Savolainen et al. 2000; Wright et al. 2002); (Solanum; Baudry et al. 2001); (Mimulus; Swiegart and Willis 2003); (Amsinckia;

Perusse and Schoen 2004); (Capsella; Guo et al.2009; Foxe et al. 2009)).

A resulting effect of the reduced effective population size from outcrossing to selfing is the expected reduction of the efficacy of natural selection. This is governed by both genetic drift and selection. This decrease in the efficacy of selection predicts a fewer beneficial mutations, compared to more slightly deleterious ones reaching fixation. Glemin (2007), theoretical work point out that these expected consequences are true when we have a large fraction of slightly deleterious additive mutations. If the majority of mutations are strongly selected, the reduced effective population size because of selfing is not enough to shift them into neutrality. Also, recessive mutations will be exposed with the increased homozygosity due to selfing, permitting the efficiency of selection against beneficial and deleterious mutations. Despite the theoretical work, recent empirical work has not shown clear evidence for the accumulation of slightly deleterious mutations in selfing species (Wright et al. 2002; Glemin et al. 2006; Cutter et al.

2008; Haudry et al. 2008; Wright et al. 2008; Escobar et al. 2010).

The lack of detection of an accumulation of deleterious mutations was enumerated with several explanations, including the recent origin of selfing, the small number of loci sampled, the distribution of mutational fitness effects and their dominance as well as the complicated effect of interspecific comparisons. Until now, the importance of these different explanations is still unclear, and a general understanding of the effect of mating system on patterns of selection on the genome is still in its infancy.

A central goal in my thesis is to use new genomic approaches to gain a better understanding of the genetic basis of the evolution selfing as well as the genomic consequences.

8

In an effort to provide more insight into these questions I used two systems, Capsella and

Collinsia. Below I provide a short summary about the two systems, including their natural history and features.

Capsella as a study system

Capsella is an herbaceous annual plant in the mustard family Brassicaceae. Commonly accepted are four species listed in Flora Europaea (Chater 1993); including Capsella grandiflora,

Capsella rubella, Capsella bursa-pastoris, Capsella orientalis and Capsella thracia. Hurka et al.

(2012) using parsimomous and Bayesian analyses to estimate divergence time, found that C. orientalis forms a clade with the autopolyploid C. bursa-pastoris and that C. thracia, which is an allopolyploid emerged from an interspecific hybridization of the autopolyploid C. bursa-pastoris and the diploid C. grandilfora. The two species under study in my thesis are the diploid

(2n=2x=16) self-incompatible outcrosser Capsella grandiflora and the self-compatible Capsella rubella. High rates of self-fertilization characterize Capsella rubella (Hurka et al. 1989), where it shows the typical morphological features of the selfing syndrome (Hurka et al. 1997, Sicard and

Lenhard 2011). Geographically, C. grandiflora is restricted to Western , Albania and locally Northern , whereas C. rubella is scattered around Southern, Middle Europe, North

Africa, Australia, and North and South America (Hurka et al. 1997; Paetsch et al. 2006). Recent population genetic studies suggest that in contrast to A. thaliana where selfing appears to have evolved gradually and happened further back in time (~1Myr), C. rubella evolved selfing very recently (35~45000 years) and this was associated with a severe population bottleneck (Foxe et al. 2009; Guo et al. 2009). Furthermore, the massive loss of diversity in C. rubella seems to suggest that autonomous selfing may have evolved rapidly by providing reproductive assurance

9 through colonization ability. Given the recent origin combined with a severe bottleneck, this implies that rapid phenotypic evolution has occurred from low levels of standing variation.

Collinsia as a study system

Collinsia is a genus of about 24 species of annual flowering plants. Traditionally, it was placed in the snapdragon family Scrophulariaceae, but following recent research in molecular systematics, it has now been placed in much enlarged family (reviewed in

Baldwin et al. 2011). Phylogenetic analysis done by Armbruster et al. 2002 corroborates the hypothesis that the diversification in Collinsia is marked with a recurring shift in flower size.

Baldwin et al. (2011) found at least 7 to 8 shifts from big to small flowers and one shift from small to big flowers, resulting in an interesting system of replicate sister pairs.

Kalisz et al. (2012) shows that Collinsia floral morphology is consistent with selfing and outcrossing, where outcrossing species having bigger flowers compared to selfing with smaller ones. They showed that the mating system of Collinsia species could be divided into two groups based on dichogamy. The first group consists of large flowered, with high dichogamy and high herkogamy and the other group with small flowered, showing the characteristics of the selfing syndrome. Based on polymorphic microsatellites markers, they showed with actual data from the wild that selfing rate is significantly higher in the selfing of small flowered species compared to outcrossing large flowered species.

Reproductive assurance has an important role in mating system evolution across

Collinsia. It has been demonstrated via field experiments that flower size, local population size and local pollinator environment (Elle and Carney; Kennedy and Elle, 2008a; Kalisz and Vogler

10

2003; Kalisz et al. 2004) affect the selective regime behind selfing and the mating transition in

Collinsia. In addition, the geographic range for small flowered species is larger compared to big flowered ones, which highlights the colonizing ability and the foundation of new populations for the selfing Collinsia (Randle et al. 2009). Given all this, the pattern highlights the potential for an association between selfing, reproductive assurance and population bottlenecks.

In my thesis the focus will be on two sister species, Collinsia linearis, a large flowered species and Collinsia rattanii a small flowered species, which have diverged recently (Baldwin et al. 2011). Both of them are self-compatible; with a different rate of outcrossing that is correlated with flower size. C. linearis’s habitat range is in the watershed of Kalamath River, which is through Oregon and northern California, whereas C. rattanii is also found in

Northwestern California and north Sierra Nevada.

Research objectives

The general objective of my thesis is to gain novel insights into the evolution of selfing using the genera Capsella and Collinsia. My focus is on studying the genomic basis of phenotypic evolution and the genomic consequences during the shift in mating system from outcrossing to selfing. My goal is to contribute to the existing body of literature, by bringing new approaches using next generation sequencing combined with the techniques of population genomics in an effort to better resolve on a whole-genome scale the evolutionary genomic changes associated with mating system shifts.

I will briefly outline the chapters in my thesis:

11

Chapter Two − Genetic architecture and adaptive significance of the selfing syndrome in

Capsella. In this chapter I examine using a combination of quantitative genetic QTL mapping and genotype by multiple shotgun sequencing the genetic architecture and the evolutionary forces responsible for the evolution of the selfing syndrome in Capsella rubella, which recently diverged from its outcrossing progenitor Capsella grandiflora. I mapped floral and reproductive traits to multiple genomic locations. My results suggest that few loci with major effect may have contributed to the evolution of the selfing syndrome. I showed using polymorphism and divergence data that there is an increase in the fixed differences compared to the rest of the genome, suggestive of directional selection driving the evolution of the selfing syndrome, consistent with the directionality of QTL effects.

Chapter Three − Evolutionary genomics of mating system evolution in Capsella. In this chapter

I investigated how the mating system transition from outcrossing to selfing affects the efficacy of selection using whole genome transcriptome sequencing from flower buds of five C. grandiflora individuals and six C. rubella individuals. I analyzed genome-wide patterns of polymorphism and divergence in the selfing C. rubella compared to outcrossing C. grandiflora searching for evidence of relaxed selection. I detected a significant increase in the ratio of unique nonsynonymous to synonymous polymorphism in C. rubella, which is consistent with relaxed selection. Using gene expression to test if there is a specific relaxation of selection on floral traits associated with pollinator attraction, I showed that down-regulated and up-regulated genes show the same excess of nonsynonymous polymorphism. Interestingly, I find that down-regulated genes in C. rubella are enriched for genes for pollen production and up-regulated ones are enriched for female functions, consistent with the resource allocation shift during the shift in mating system. My finding is consistent with global relaxation of selection on the genome of

12

Capsella. Combining population genomics data with QTL mapping and differential gene expression analysis, I found evidence suggesting that positive selection has driven the evolution of the selfing syndrome, and identify candidate regions for the causal changes associated with adaptation to selfing.

Chapter Four − Comparative population genomics in two Collinsia species with contrasting mating system. In this chapter I used de novo transcriptome assembly, combined with Sanger population resequencing to examine patterns of polymorphism, base composition and efficacy of selection in the selfing Collinisa rattanii and the mixed mating, more outcrossing sister species

Collinsia linearis. Our de novo assembly identified 14,106 orthologous sequence between the two species. Both Sanger resequencing of 17 genes as well as transcriptome polymorphism confirm the very recent species divergence as well as a strong population bottleneck in C. rattanii. We find evidence for ongoing relaxed selection in C. rattanii where we find an increase in the ratio of unique nonsynonymous to synonymous polymorphism.

13

14

CHAPTER TWO GENETIC ARCHITECTURE AND ADAPTIVE SIGNIFICANCE OF THE SELGING SYNDROME IN CAPSELLA

Summary

The transition from outcrossing to predominant self-fertilization is one of the most common evolutionary transitions in flowering plants. This shift is often accompanied by a suite of changes in floral and reproductive characters termed the selfing syndrome. Here, we characterize the genetic architecture and evolutionary forces underlying evolution of the selfing syndrome in Capsella rubella following its recent divergence from the outcrossing ancestor

Capsella grandiflora. We conduct genotyping by multiplexed shotgun sequencing and map floral and reproductive traits in a large (N=550) F2 population. Our results suggest that in contrast to previous studies of the selfing syndrome, changes at a few loci, some with major effects, have shaped the evolution of the selfing syndrome in Capsella. The directionality of QTL effects as well as population genetic patterns of polymorphism and divergence at 318 loci are consistent with a history of directional selection on the selfing syndrome. Our study provides an important first step towards characterizing the genetic basis and evolutionary forces underlying the evolution of the selfing syndrome in a genetically accessible model system.

15

Introduction

The transition from outcrossing to predominant self-fertilization is one of the most common evolutionary transitions in flowering plants (Stebbins 1950). In association with this mating system shift, similar changes in a suite of floral and reproductive characters have evolved repeatedly (Barrett 2002). In general, selfers tend to have smaller, more inconspicuous flowers, a lower degree of separation between anthers and stigma and lower pollen-ovule ratios than their outcrossing relatives. This combination of floral and reproductive characters is termed the selfing syndrome (Darwin 1876, Ornduff 1969).

Convergent evolution of the selfing syndrome strongly suggests that these floral and reproductive trait changes are adaptive. Indeed, there are several reasons to expect natural selection to favor the evolution of the selfing syndrome (reviewed in Sicard and Lenhard 2011).

Reduced resource allocation to costly structures such as petals and nectaries should be favored in selfers, which are no longer dependent on pollinators for reproductive success (Sicard and

Lenhard 2011). If there is a trade-off between resource allocation to pollen and ovule production, reduced pollen production is expected to be selected for in selfers (Charlesworth and

Charlesworth 1981). Floral morphology could evolve to reduce the incidence of herbivory

(Eckert et al. 2006), or simply as a result of selection for combinations of traits that improve autonomous selfing ability (Moeller and Geber 2005; Fishman and Willis 2008).

Quantitative genetic studies can aid in distinguishing between these different hypotheses for the evolution of the selfing syndrome. For instance, genetic correlations between pollen and ovule number can point to the existence of intrinsic genetic trade-offs, and genetic correlations between floral traits may suggest the presence of genetic constraints to floral evolution (Ashman and Majetic 2006). Quantitative trait locus (QTL) mapping can yield additional insights into the

16 evolution of the selfing syndrome. For instance, a prevalence of QTL with allelic effects consistent with species differences is suggestive of directional selection on the selfing syndrome

(Orr 1998) and overlapping QTL may suggest that underlying loci have pleiotropic effects, provided that resolution is sufficiently high. In addition, QTL mapping studies can yield information on the number and effect sizes of loci involved in the evolution of the selfing syndrome.

Dissecting the genetic architecture of the selfing syndrome through QTL mapping may thus allow for insights into adaptive evolution. However, despite the prevalence of the transition to selfing (Barrett 2002) and the vast literature on the causes and population genetic effects of this transition (reviewed in e.g. Barrett and Harder 1996; Charlesworth and Wright 2001;

Charlesworth 2006), quantitative genetic studies of the selfing syndrome have only been conducted in a handful of systems. These studies have found that floral traits involved in the selfing syndrome mostly have a polygenic basis (e.g. in Mimulus; Fishman et al. 2002; Lin and

Ritland 1997), Leptosiphon; Goodwillie et al. 2006, Oryza; Grillo et al. 2009 and Solanum;

Bernacchi and Tanksley 1997) with little evidence for major genes. The directionality of QTL effects in the majority of these studies is consistent with a history of directional selection, although this is not always explicitly tested. Genetic correlations between floral traits are often positive (Ashman and Majetic 2006), and overlapping QTL for multiple floral traits (Bernacchi and Tanksley 1997; Fishman et al. 2002) suggest that genetic architecture may pose constraints for floral evolution, although distinguishing between close linkage and pleiotropy will require vastly improved mapping resolution.

Our knowledge of the molecular basis of selfing syndrome evolution is limited and only a single gene underlying a selfing syndrome trait has been cloned so far (regulatory changes at a

17 gene cause the loss of stigma exsertion in domesticated tomatoes; Chen et al. 2007).

Development of genomic resources for genetically accessible model systems is important in order to improve our understanding of the types of genetic changes underlying this evolutionary transition, including the contribution of regulatory vs protein-coding changes. Recent developments in sequencing technology hold the promise to facilitate such studies.

Here, we characterize the genetic architecture of the selfing syndrome in the crucifer genus Capsella, a promising model system for the study of mating system shifts. This genus harbors two diploid sister species that differ in their mating system, the highly self-fertilizing

Capsella rubella, and the obligately outcrossing, self-incompatible Capsella grandiflora (Hurka and Neuffer 1997). The selfer C. rubella is thought to be derived from an outcrossing, C. grandiflora-like ancestor (Foxe et al. 2009; Guo et al 2009). These two species differ not only in their geographical distribution and mating system, but also with respect to floral and reproductive traits. C. grandiflora is mainly found in the western Balkans and occasionally in northern Italy, whereas C. rubella has a wider circum-Mediterranean distribution (Hurka and

Neuffer 1997). In C. rubella, there has been a derived loss of self-incompatibility and it exhibits the typical characteristics of the selfing syndrome (Hurka and Neuffer 1997). These changes have resulted in high rates of selfing, with effective selfing rates in natural populations of C. rubella estimated to 0.90-0.97 (St. Onge et al. 2011).

Population genetic analysis of the S locus and multilocus data has suggested that the evolution of selfing in C. rubella was associated with a severe population bottleneck, suggestive of a major, rapid shift to high selfing rates from a small number of founding lineages (Foxe et al.

2009; Guo et al. 2009). Changes in floral and reproductive traits in C. rubella have probably evolved relatively rapidly, as the transition to selfing occurred recently, most likely within the

18 last 50 kya (Foxe et al. 2009; Guo et al. 2009). If floral evolution occurred subsequent to the founder event, adaptive morphological evolution in C. rubella would have proceeded with a limited amount of standing genetic variation.

Studies of the genetic basis of the selfing syndrome in Capsella are facilitated by the interfertility of the two diploid species, their close relationship to A. thaliana (Boivin et al. 2004) and the availability of a sequenced C. rubella genome. Here, we make full use of these advantages to map QTL for floral and reproductive traits in a large interspecific F2 population (N=550). We generate a dense set of markers and genotype all individuals using a cost-effective technique based on massively parallel sequencing (MSG; Andolfatto et al. 2011). As a proof-of-concept, we map self-compatibility to a 255 kb region that encompasses the canonical Brassicaceae S- locus. We assess the distribution and directionality of additive QTL effects, as well as the degree of overlap between QTL. Finally, we use population genomic data for 318 loci to explore whether regions harboring QTL for the selfing syndrome exhibit signs of directional selection.

This study forms an important initial step in characterizing the genetic basis and evolutionary forces underlying recent evolution of the selfing syndrome in Capsella.

Materials and methods

Plant material

We generated an F2 mapping population from an interspecific cross between C. grandiflora and C. rubella. The F2 was generated by self-fertilizing a self-compatible F1 individual produced from a cross of an outbred C. grandiflora accession (2e-TS1) from

Paleokastritsas, Greece as seed parent and a C. rubella accession (1GR1) from Manolates,

19

Samos, Greece as pollen donor. Floral and reproductive traits (see below) were measured in both parents and the F1. In 2009, we grew a total of 700 F2 individuals alongside six selfed offspring of C. rubella 1GR1 and 16 accessions each of C. rubella and C. grandiflora sampled across the range of each species (Appendix 2.1).

Plant growth conditions

Seeds were surface-sterilized, plated on half strength Murashige-Skoog medium and vernalized at 1.8C for 18 days. Germination took place at room temperature over eight days, and

F2 seeds had a consistently very high germination rate (>95%). Seedlings were transplanted to

Pro-Mix BX potting mix in 3.5 inch pots which were placed in a fully randomized design in the greenhouse at University of Toronto on June 15th 2009. Seedlings were allowed to acclimatize to natural light conditions for one week and were subsequently grown under long day conditions

(22C day/21C night; 16 h light) with supplemental light from sodium high pressure lamps and biweekly fertilization with N:P:K 20:20:20 fertilizer.

Phenotypic measurements

A total of 13 floral, vegetative and reproductive characters were measured. We scored vegetative characters on all plants, whereas more labor-intensive floral and reproductive trait measurements were done on a subset of 550 F2s as well as on all C. rubella and C. grandiflora individuals.

20

We measured the following seven floral traits: petal length and width, the length of lateral and median , the length of lateral and median stamens and the total length of the style and gynoecium (Figure 2.1). All floral measurements were done on three flowers from each individual. These measurements were based on digital images of dissected floral organs, taken with an Olympus SZTR1 dissecting microscope with an Infinity CCD camera (Olympus Canada,

Markham ON, Canada). Images were calibrated with a stage micrometer and measurements were done using ImageJ 1.40 (Abramoff et al. 2004).

We assessed three reproductive traits: the total number of pollen grains per flower, the number of ovules per flower, and self-incompatibility. Pollen and ovule counts were done on three flowers per plant, using a standard aniline blue-lactophenol staining protocol (Kearns and

Inouye 1993) and a hemacytometer. As in Fishman et al. (2001), we also tentatively classified those pollen grains that had a reduced diameter and did not stain strongly as inviable. Ovule counts were done under a dissecting microscope. We scored self-incompatibility as a binary trait at the end of the experiment, with plants producing any seeds by autonomous pollination classified as self-compatible and those that produced no seeds by autonomous pollination classified as self-incompatible.

For an overall assessment of plant size, we measured the length of the two longest leaves at the start of flowering, and to assess variation in phenology we scored the number of days to flowering and the number of rosette leaves at the start of flowering.

In the F2 population, we found that some individuals exhibited floral abnormalities, such as fusions between floral organs, on some of their . In order to test whether this could be a result of inbreeding depression affecting some aspect of floral development, we scored and mapped this as a binary trait.

21

Descriptive statistics

We assessed phenotypic differentiation between C. rubella and C. grandiflora with respect to floral and reproductive traits using the Wilcoxon rank sum test. To assess trait normality in the F2 population we used the Shapiro-Wilk normality test. Nonparametric measures of correlation (Spearman’s rho) were calculated for all pairwise combinations of traits in the F2.

Multiplexed shotgun sequencing

To genotype our mapping population, we used multiplexed shotgun genotyping (MSG): a new approach based on shotgun sequencing of multiplexed Illumina libraries (Andolfatto et al.

2011). Briefly, MSG involves digesting genomic DNA with a restriction enzyme, followed by barcoding each sample with a unique adapter and pooling these samples for sequencing. Given a mapping population from a controlled cross and a reference genome, a Hidden Markov Model

(HMM) is used to estimate ancestry probabilities for all markers in each individual from the

MSG sequence data.

For genotyping of the 550 F2 individuals, we first extracted genomic DNA from frozen leaf tissue using a DNeasy Plant Mini Kit (QIAGEN). Genomic DNA concentration was quantified using a Qubit BR kit (Invitrogen) and a fluorometer and was subsequently diluted to a standard concentration. Sequence library construction closely followed the protocol described in

Andolfatto et al. (2011). First, a total of 10 ng of genomic DNA was digested with Mse I (New

22

England Biolabs, Ipswich, MA, USA). We then ligated unique bar-coded adapters to each sample and pooled samples to give six pools of 96 samples each. The initial part of the protocol was conducted in 96-well format, and 26 samples with low coverage in the first run were therefore included multiple times to fill all 576 wells and improve coverage. Independent multiplexed sequencing libraries were constructed for each of these six pools. Ligated linker- dimers were removed from each pool with Ampure beads and the cleaned ligation products were size-selected on an agarose gel to yield fragments of length 250-300 bp. FC2 flow-cell sequences were attached to ligation products using PCR. We sequenced each library on a separate lane of an Illumina Genome Analyzer IIX (Illumina) at Princeton Microarray Facility

(http://www.genomics.princeton.edu/microarray/) using standard Illumina sequencing protocols.

Sequence parsing and genotype calling

The MSG HMM algorithm (Andolfatto et al. 2011) requires reference genomes for both parents of the mapping population, however the high heterozygosity of our C. grandiflora parent required us to take a two-step approach to generating reference parental genomes. First, the genome of our reference C. rubella mapping parent was constructed as part of our ongoing population genomics effort in Capsella (http://biology.mcgill.ca/vegi/index.html). One lane of

108bp PE Illumina GAII sequence data was generated from C. rubella 1GR1-TS2 at the Genome

Quebec Innovation Centre at McGill University. Sequences were paired-end read-mapped to a pre-release of the JGI Capsella rubella genome assembly (http://www.jgi.doe.gov/) using the

Burrows-Wheeler algorithm (BWA; Li and Durbin 2009) with default settings. 79% of the approximately 5.9 million reads mapped successfully to the C. rubella reference nuclear genome, giving a median coverage of approximately 30X. To call SNP differences between our C. rubella

23 parent and the JGI assembly, we used the MAQ algorithm as implemented in the samtools pileup package (samtools 1.7), which assumes at most two segregating bases per site. For SNP calling, we considered only homozygous SNP calls from our reads with coverage greater than 15X and less than 50X, and the phred-scaled consensus quality score was required to be greater than 75.

The upper 50X coverage cutoff criterion for calling SNPs was introduced in order to avoid calling SNPs in repetitive regions, regions with excessive read coverage due to amplification artifacts or errors in read mapping. We created a reference genome sequence for our C. rubella parent using custom Perl scripts to replace any SNPs passing these criteria.

We used all reads from the F2 mapping population MSG sequence to identify informative

SNPs (i.e. SNPs that segregate in the mapping population) and generate a synthetic C. grandiflora parental reference sequence. Briefly, this was done by read-mapping all reads from the mapping population to the pre-release C. rubella reference genome using BWA with default settings. Basecalling of the pool of read-mapped MSG sequence data was done in samtools using the MAQ algorithm. The resulting pileup file, in combination with the C. rubella 1GR1-TS2 genomic pileup, were subsequently parsed to call informative SNPs using custom Perl scripts.

For calling an informative SNP we required the MSG data to have heterozygous base calls with

Phred-scaled consensus quality greater than 20, and coverage greater than 30. In addition, for any site with an informative SNP the 1GR1-TS2 genomic pileup was required to have coverage greater than 15 but less than 50, and the Phred-scaled consensus quality score was required to be greater than 75 for a homozygous base.

We then replaced any bases in the C. rubella 1GR1-TS2 reference genome that passed these criteria to construct a C. grandiflora reference genome sequence. By taking this approach,

24 we ensure that only informative SNPs that were identified in our F2 mapping population will be queried in the genotyping analysis.

For all individuals in the mapping population, we assigned ancestry probabilities across chromosomes using MSG v0.3, a pipeline of scripts developed by Andolfatto et al. (2011)

(available at http://genomics.princeton.edu/AndolfattoLab/MSG.html). Briefly, sequence reads were parsed by bar-code into 550 groups corresponding to the 550 F2 individuals. Reads were mapped to the parental reference genomes using BWA with default settings, and a HMM was used to estimate a posterior probability of each possible genotype (in our case: homozygous C. rubella, heterozygous, or homozygous C. grandiflora) in a genomic region. Genotype probabilities for each informative marker position were obtained by imputing ancestry for all individuals at all positions that were typed in at least one individual. Using this pipeline we obtained genotype probabilities for a total of 121,979 SNPs. We set the HMM parameter g which specifies the degree of uncertainty in the parental reference genomes to 0.03 for both parental genomes, the expected number of recombination events per genome per meiosis to 16 (i.e. at least one crossover per chromosome arm), and the model recombination rate parameter, rfac, to

1*10-6. Similar results were obtained with an rfac setting of 1, and our results are thus not sensitive to the exact setting used for the model recombination rate. As we analyzed an F2 population, genotype priors were set to 0.5 for the heterozygote and 0.25 for each parental allele homozygote.

To simplify linkage map construction and facilitate QTL mapping, we first assigned each individual a single genotype at each marker (i.e. a ‘hard ancestry call’). To do this, ancestry probabilities resulting from the MSG HMM were used. A cutoff ancestry probability of 0.9 was used for genotype calling, and we filtered markers that had identical genotype configurations

25 across the F2 so that only one of those markers was retained. If markers differed only in terms of missing ancestry calls, we kept the marker that had less missing data. The trimmed data set was used for linkage map construction and quantitative trait locus (QTL) analyses using MQM mapping, which requires hard genotype calls. To test the use of MSG for fine-scale location of

QTL, we also used a different filtering strategy and retained the HMM posterior probabilities for mapping. Specifically, we filtered genotype probabilities to retain markers that differed in posterior probability by more than 0.01 in at least one individual and genotype in our F2, and used those genotype probabilities (‘soft ancestry calls’) instead of hard ancestry calls for QTL analyses. This procedure was only used to map self-compatibility which was suggested to have a simple genetic basis both in previous studies (Riley 1934; Nasrallah et al. 2007) and on the basis of phenotype data for our mapping population, and for which we could use a simpler interval mapping algorithm.

Linkage map construction and segregation distortion

We constructed a linkage map in R/QTL (Broman et al. 2003) under default parameters, using a LOD score cutoff of 6 to assign markers to linkage groups as suggested by Broman

(2010). We tested for segregation distortion at each marker using a chi-square test of goodness- of-fit of expected vs observed genotype frequencies and assessed significance after Bonferroni correction for multiple tests.

26

QTL analysis

For floral, vegetative and reproductive traits that were scored on a continuous scale we mapped QTL by multiple QTL mapping (MQM) (Jansen 1993; Jansen and Stam 1994), which has benefits over interval mapping and composite interval mapping in terms of power and avoidance of false positives (Arends et al. 2010). After imputation of missing genotype data, significant cofactors were identified using an automated backward elimination procedure

(Arends et al. 2010). We tested for QTL in 1 cM intervals and excluded cofactors in a window of

25 cM when testing for a QTL effect at a location. For traits that were measured on a binary scale, such as self-incompatibility and presence/absence of floral abnormalities, we used a binary interval mapping model instead (Xu and Atchley 1996). All these QTL analyses were conducted in R/QTL (Broman et al. 2003), using hard genotype calls (see section Sequence parsing and genotype calling above).

We assessed significance of QTL using LOD scores, where LOD = log10(L0/L1), with L0 being the likelihood under the null hypothesis of no QTL in the interval and L1 the likelihood under the alternative hypothesis of a QTL in the interval. For each trait, genomewide significance thresholds (1% and 5%) were determined by 1000 permutations. For all QTL significant at P≤0.01, we obtained 1.5-LOD and 2-LOD confidence intervals, as well as estimates of additive allelic effects and dominance deviations. We present additive effect sizes both standardized by the mean difference between the parental species, as in Fishman et al.

(2002), and as the proportion of F2 phenotypic variance explained. We searched for interactions between all significant QTL and fitted a model containing all main QTL effects and significant interactions. For each trait, we obtained an estimate of the total proportion of F2 phenotypic

27 variation explained by this multiple QTL model. We assessed whether the model assumption of normally distributed residuals was fulfilled using a Shapiro-Wilk test.

To test whether our QTL effect size estimates could be biased due to variation in recombination rate and/or gene density (the “Noor effect”; Noor et al. 2001), we compared recombination rate and gene density within and outside of 1.5 LOD confidence intervals for petal size traits. Recombination rate (Mb/cM) and gene density was estimated in ~1cM bins and QTL regions were compared to the remainder of the genome using a Mann-Whitney test.

For fine-scale location of QTL for self-incompatibility, we obtained a denser set of markers by filtering HMM ancestry probabilities from MSG as outlined in the section Sequence parsing and genotype calling above, and used these ancestry probabilities for interval mapping in

R/QTL.

Population Genetics Analysis

To test whether QTL regions show evidence for directional selection in C. rubella, we made use of our resequencing data from 354 exons in both C. rubella and C. grandiflora, analyzing only those loci with at least 6 samples sequenced in each species. These loci were sequenced as described in Slotte et al. (2010) and Qiu et al. (in press). Briefly, samples from eight Mediterranean populations of C. rubella and five populations of C. grandiflora were sampled for this study, and single large exons were amplified and sequenced on both strands. We analyzed polymorphism levels using a modified version of Polymorphorama (Andolfatto 2007;

Haddrill et al 2008), and we used Perl scripts written by the authors to calculate the number of fixed differences, and shared and unique polymorphisms. We determined the physical positions

28 of these exons on the C. rubella genome using BLAST (Altschul et al. 1990), and identified exons that fell within the 2-LOD interval of QTL. To avoid very wide QTL intervals that would encompass large fractions of chromosomes, we focused this analysis on QTL that had 2 LOD intervals that were 2MB or less. When multiple overlapping QTL fit these criteria, we used the narrower QTL interval.

Directional selection is expected to lead to a reduction in the proportion of shared polymorphisms and an increase in the proportion of fixed differences between species (Foxe et al. 2009). We tested whether QTL regions exhibited such a signature using a Fisher’s exact test.

In addition, we assessed whether the ratio of polymorphism in C. rubella and C. grandiflora differed between QTL regions and other genomic regions, as may be expected if QTL regions have undergone selective sweeps.

Results

Phenotypic variation between species and in the F2

We found significant phenotypic differentiation between C. rubella and C. grandiflora for all measured floral and reproductive traits, but not for leaf size and phenology traits (Table

2.1). The distribution of petal and reproductive traits in C. rubella did not overlap with that of C. grandiflora, whereas there was considerable overlap for other floral traits (Figure 2.2).

In the F2 population, all floral traits exhibited continuous variation with a unimodal distribution. Petal trait means in the F2 were intermediate between those of the parental accessions, with no clear evidence of transgressive segregation. For the other floral traits, as well

29 as for ovule number and leaf size, more extreme values were often found in the F2 than in either parental accession. However, in most cases the range of variation in the F2 did not exceed that found in the parental species (Figure 2.2).

All traits except petal and stamen length deviated significantly from normality (data not shown). However, only pollen number per flower had a clearly bimodal distribution. The lower mode of this distribution was close to the value for the C. rubella parental accession, with an additional peak intermediate between the parental values (Figure 2.2). The mean proportion of inviable pollen in the F2 was low (mean: 7.9±10.7%) suggesting that hybrid incompatibilities affecting pollen viability are not rampant in this population.

Phenotypic correlations were highest among floral traits (Figure 2.3), while these traits exhibited a lower degree of correlation with vegetative and reproductive traits (Figure 2.3).

However, there were significant correlations between and within all three types of traits (Figure

2.3).

About 28% of the F2 individuals were self-incompatible. Segregation of self- incompatibility did not deviate significantly from the 1:3 ratio expected under a single dominant locus with the C. rubella allele conferring self-compatibility (Chi-square test, χ2=1.08, d.f=1,

P=0.30) as previously reported in crosses of C. grandiflora and C. rubella (Riley 1934; Nasrallah et al. 2007).

30

Multiplexed Shotgun Genotyping

We used multiplexed shotgun genotyping (MSG) to generate indexed 101 bp Illumina reads from each of our F2 individuals. We were able to map 67% of our approximately 188 million reads to the C. rubella reference nuclear genome assembly and identified a total of

121,979 informative SNPs among those reads. The median number of informative markers per individual was 6474, corresponding to a marker density of about 1 per 20 kb.

We used MSG v0.3 to assign ancestry probabilities for each F2 individual at each of these 121,979 markers. This resulted in a marker density of about 1 marker per kb. Ancestry probabilities were subsequently converted to hard ancestry calls (‘genotypes’), of which we retained a total of 890 markers for linkage map construction and initial QTL analyses. For a test of the use of soft ancestry calls (genotype probabilities) from MSG for fine-scale location of

QTL, we filtered soft ancestry calls and used those for QTL mapping.

Linkage map construction

The resulting linkage map contained 890 markers and had eight linkage groups (Figure

2.4), consistent with previous linkage mapping results (Boivin et al. 2004). The mean number of markers per linkage group was 111 (min: 93, max: 152) and the total map distance was 381.8 cM. The mean distance between markers was 0.4 cM, and the maximum distance between markers was 8.4 cM.

A total of 152 markers showed significant segregation distortion (Figure 2.4). These markers mapped to five main regions: the lower part of LG1 and most of LG4 showed a

31 consistent deficit of genotypes homozygous for the C. rubella allele, whereas regions on LG5,

LG6 and the lower part of LG7 showed an excess of heterozygous genotypes.

QTL mapping

We identified a total of 41 QTL for the 13 phenotypic traits assessed in this study. For each floral trait we identified between two and five significant QTL, and there were a total of 24 significant QTL for the seven floral size traits measured (Figure 2.4). These QTL co-localized to a great extent. Floral size QTL mainly mapped to five regions: the upper part of LG1, LG6 and

LG7, the central part of LG8 and the lower part of LG2 (Figure 2.4). QTL for reproductive traits

(three QTL for pollen number and two for ovule number) also co-localized with floral trait QTL on LG1, LG2, LG6 and LG7 (Figure 2.4). In contrast, QTL for phenology traits did not overlap with floral size QTL, with the exception of one QTL for days to flowering that mapped to the upper part of LG1 (Figure 2.4). We did not find any significant QTL for plant size at flowering.

The median width of 1.5-LOD confidence intervals for floral and reproductive trait QTL was 7.6 cM or 4.4 Mb, however 1.5-LOD CIs ranged from 1.8 to 34.6 cM (0.35 to 13.6 Mb).

Petal size trait QTL on LG2 had the narrowest confidence intervals of all continuous traits (1.8 cM or 0.7 Mb for petal length and 2.3 cM or 0.4 Mb for petal width; Table 2.2; Figure 2.4).

For 25 of the 29 floral size and reproductive trait QTL, the direction of allelic effects was consistent with phenotypic differences between species. Floral size QTL with allelic effects opposite to expectation were only found for traits whose phenotypic distributions overlap between C. rubella and C. grandiflora (e.g. stamen length and the total length of style and gynoecium; Table 2.2).

32

Altogether, a model with significant QTL explained between 10 and 64% of the phenotypic variation for floral and reproductive traits in the F2 population. Petal size QTL explained the highest proportion of F2 variance (petal width: 64%; petal length: 63%), whereas

QTL for stamen length explained an intermediate proportion (median stamen length: 50%, lateral stamen length: 46%) and QTL for reproductive traits had the lowest explanatory power (pollen number: 19%, ovule number: 10%).

Individual QTL effects were also greatest for petal size traits; the leading QTL for petal width and length explained 31% and 27% of F2 phenotypic variation, respectively (Table 2.2).

Homozygous additive effects at leading QTL for these traits accounted for a large fraction of the phenotypic difference between C. grandiflora and C. rubella (26% for petal width and 41% for petal length; Table 2.2). Overall, dominance deviations were small in relation to additive effects of QTL, with the exception of floral size QTL mapping to the lower part of LG2, where the C. rubella QTL allele was largely recessive for both petal length, petal width, lateral length, and stamen length (Table 2.2). Pairwise epistatic effects were found for petal length (between

QTL on LG1 and LG7), stamen length (lateral stamen length; between QTL on LG1 and LG6 and between LG6 and LG8, median stamen length; between QTL on LG1 and LG2) and days to flowering (between QTL on LG1 and LG3, and LG1 and LG4), but these effects explained a low proportion of the F2 variation (0.4-1.4%).

With the exception of pollen number, the distributions of residuals after accounting for

QTL effects were unimodal and reasonably symmetric (Appendix 2.3). There were no significant deviations from normality of residuals for petal length or width, or for median and lateral stamen lengths. However, we did find slight but significant departures from residual normality for some traits (e.g. sepal and style length, pollen/ovule number and phenology traits). As a standard QTL

33 mapping methods should be robust to deviations from normality when significance is determined by permutations and with dense genotyping information, as in our case (Broman and Sen 2009).

We therefore did not attempt to transform these data to achieve residual normality.

We did not find evidence for a “Noor effect” causing overestimates of effect sizes for petal traits, as there were no significant differences between QTL regions and the remainder of the genome in either recombination rate (petal length: W=369, NS; petal width: W=610, NS) or gene density (petal length: W=306, NS; petal width: W=552, NS).

About 40% of F2 individuals exhibited floral abnormalities on some inflorescences. We found four QTL for this trait, on linkage groups 1, 2, 6 and 7. Partly recessive C. rubella alleles on LG7 explain most of the F2 variation for this trait, however two minor QTL on LG1 and LG2 with fully or partly recessive C. grandiflora alleles also contribute to some degree. We did not find evidence for significant interaction between QTL for floral abnormalities in a two-locus

QTL scan.

In agreement with the 1:3 segregation of self-compatibility in our cross, SC mapped to a single, strongly significant QTL at ~7.6 Mb on LG7, with a dominant C. rubella allele conferring self-compatibility (Figure 2.5). As a test of the utility of the MSG method for fine-scale location of QTL, we mapped self-compatibility using 5361 markers on LG7 with soft ancestry calls.

Using this method, the peak of the QTL for SI was located 50 kb downstream of the S locus and the 255 kb wide 1.5-LOD interval included both key S locus genes SRK and SCR. This is consistent with previous reports that SI maps to the S locus in crosses between C. rubella and C. grandiflora (Nasrallah et al. 2007). The 1.5-LOD CI for self-compatibility did not overlap with those for floral traits, but overlapped with a very wide CI for a QTL for ovule number (Figure

34

2.4) Consistent with this, SI plants also produced slightly fewer ovules per flower than SC plants on average (SC: 15.4; SI: 14.2 ovules/flower; Mann-Whitney test, W=38374.5; P=4.4*10-7).

Population genetics of QTL

We identified four non-overlapping QTL that had 2 LOD intervals of 2MB or less in the

C. rubella genome. Out of 318 loci that were resequenced in at least 6 samples from each species, seven of our resequenced exons fall within three of these four QTL regions (Table 2.3).

Of these seven loci, only a single segregating site was observed in C. rubella, compared with 85 segregating sites in C. grandiflora. This contrasts with the relative diversity overall, where C. rubella has 766 segregating sites compared with C. grandiflora’s 3395 (2-tailed Fisher’s exact test p<0.01). QTL regions therefore exhibit a more extreme reduction in diversity than other genomic regions in C. rubella. When we compare the proportion of shared polymorphisms and fixed differences, our seven loci falling under QTL similarly show an excess of fixed differences relative to shared polymorphism (2-tailed Fisher’s exact test, p<0.01).

Discussion

In this study we have begun to characterize the genetic architecture of the selfing syndrome in Capsella by QTL mapping of floral and reproductive traits that differ between the self-fertilizing species C. rubella and the obligate outcrosser C. grandiflora. In addition, we use population genetic data to test for selection on genomic regions that affect the selfing syndrome.

35

Linkage map and segregation distortion

Our study highlights the strength of methods based on massively parallel sequencing for marker discovery and genotyping. Using multiplexed shotgun sequencing (MSG; Andolfatto et al. 2011), we identified a total of 121,979 markers or about 1 marker every kb. Given the rapid development of sequencing and genotyping technology, mapping resolution will increasingly be limited by the amount of recombination in the analyzed pedigree rather than by marker availability. Indeed, this is the case in our cross, where 890 markers remain after filtering those with identical genotype configurations across F2 individuals.

Our linkage map contained 8 linkage groups, as expected from previous studies (e.g.

Boivin et al. 2004). As is common for interspecific crosses (Rieseberg et al. 2000), a substantial proportion of markers (17%) showed significant segregation distortion. Inbreeding depression could be expected to cause segregation distortion in our study, as we used an outbred, highly heterozygous C. grandiflora individual as a mapping parent, and deleterious recessive C. grandiflora alleles would be rendered homozygous in the F2 mapping population. However, none of the regions that exhibited segregation distortion showed a deficit of homozygotes for the

C. grandiflora allele, and the germination and survival rate in the F2 population was very high

(>95%). Thus, inbreeding depression does not appear to be a major factor underlying segregation distortion in our cross. Other possible explanations include hybrid incompatibilities (Moyle et al.

2006) or loci that affect pollen performance and/or pollen-style interactions (Fishman et al.

2008).

36

Number of loci, effect sizes and pleiotropy

We identify a total of 41 QTL for all 13 traits examined. The number of QTL per trait is modest, between two and five, and additive effects of leading QTL for floral size traits explain a considerable proportion of F2 variance as well as divergence between species (e.g. 32% of F2 variance; 26% of interspecific divergence for petal width). Furthermore, homozygous additive effects at the three significant QTL for petal length and width can account for the majority of the divergence in petal length between C. grandiflora and C. rubella. Thus, changes at a few genomic regions appear to be sufficient to cause the severe reduction in petal size seen in C. rubella.

As we used a large mapping population (N=550), our effect size estimates should not be greatly overestimated due to the Beavis effect (Beavis 1994). Variation in recombination rates also seems unlikely to bias our estimates (“the Noor effect”; Noor et al. 2001), as petal size QTL regions did not have significantly lower recombination rates or higher gene densities than the rest of the genome. There was also no evidence for large-effect QTL being located in regions of unusually high gene densities or low recombination rates, as expected under the Noor effect

(Appendix 2.4). It is possible that there are additional minor QTL of small effect that we did not have the power to detect in this study. However, we note that our study has high power to detect

QTL with even very small effect (e.g. explaining about 3% of the F2 variance in petal size traits) assuming environmental and genetic variances as estimated for Capsella in recombinant inbred lines by Sicard et al. (2011). This conclusion was not sensitive to the exact values of environmental and genetic variance, as a similar estimated minimum detectable QTL effect was obtained when we assumed an environmental variance twice as large as that estimated by Sicard et al. (2011).

37

Our results thus differ from those of previous studies of the selfing syndrome that found a large number of QTL, each of small effect (e.g. Fishman et al. 2002; Goodwillie et al. 2006). For instance, in an interspecific Mimulus guttatus-Mimulus nasutus mapping population, Fishman et al. (2002) found at least 11 QTL for each trait examined; while our study has similar power, we only detect two to five QTL per trait. The distribution of effect sizes also appears to differ between Mimulus and Capsella, as none of the relative homozygous effects in Mimulus estimated by Fishman et al. (2002) are as great as the largest that we find in Capsella.

Another way of assessing effect sizes is to relate them to levels of standing variation in the ancestral population or species (we are unable to use environmental standard deviations as we do not have an inbred parental line of C. grandiflora). When we do this we find that leading

QTL for petal size traits have homozygous additive effects of about 1.6 times the C. grandiflora standard deviation. Although it is difficult to compare directly, this is about twice as large as the leading corolla width effect size in Mimulus when standardized by the variation seen in the M. guttatus Iron Mountain population (Fishman et al. 2002).

Thus, regardless of the exact measure of effect size used, the evolution of the selfing syndrome in Capsella seems to have involved genes of larger effect than in Mimulus. This could in part be due to differences in the demographics of the transition to selfing in these two genera.

In Capsella, there was a severe reduction in effective population size in association with the transition to selfing (Foxe et al 2009; Guo et al 2009). If floral evolution occurred subsequent to the bottleneck, such population size reductions may have rendered selection on alleles of small effect inefficient, resulting in the preferential fixation of alleles of larger effect (Hamblin et al

2011). Additionally, since outcrossing Capsella flowers are not as large and showy as those of

Mimulus, fewer mutational steps may be required to achieve the selfing syndrome phenotype.

38

Previous studies have found high correlations between floral size traits, but lower correlations between floral and vegetative traits (e.g. Bernacchi and Tanksley 1997; Lin and

Ritland 1997; Fishman et al. 2002; Georgiady et al. 2002; Goodwillie et al. 2006). Our results agree with this pattern, as phenotypic correlations were high between floral size traits, but lower between floral traits and both vegetative/phenology-related and reproductive traits. Consistent with this, we also found that floral size QTL co-localized to a great extent, but less so with QTL for phenology traits. These QTL map to five main genomic regions on linkage groups 1, 2, 6, 7 and 8. Distinguishing between pleiotropy and close linkage as a cause for co-localization of QTL will require additional fine-mapping, however at present our results suggest the selfing syndrome could have evolved through mostly major-effect changes at a modest number of loci.

Inbreeding depression

Our F2 mapping population allowed us to assess the degree of dominance at individual

QTL. Most QTL had additive effects that were considerably greater than dominance effects, the most prominent exception being a QTL on LG2 affecting several floral size traits, for which the

C. rubella allele was largely recessive. Overall, there was no tendency for alleles from either species to be mostly recessive, and the majority of allelic effects were in the direction expected from the phenotypic divergence between species. If inbreeding depression affected our QTL mapping of floral and reproductive traits, we would expect to see recessive C. grandiflora alleles causing reduced flower size or pollen/ovule number. As this was not the case, we conclude that our QTL mapping results are unlikely to be severely affected by inbreeding depression.

39

We observed abnormal flowers on some inflorescences in about 40% of all F2 individuals. To test whether this could be a result of inbreeding depression, we mapped QTL for this trait. If floral abnormalities were a result of inbreeding depression, we would expect them to be a result of partly or completely recessive C. grandiflora alleles. There were indeed two QTL that showed this pattern, however together they only accounted for about 7% of the F2 variation, and the major QTL for this trait instead featured a recessive C. rubella allele. As we have not observed this phenotype in C. rubella, and did not find any evidence for two-locus epistatic interactions, we hypothesize that higher-order interactions or nuclear-cytoplasmic incompatibilities may be involved.

Selection on the selfing syndrome

As mentioned above, the majority (86%) of QTL effects for floral and reproductive traits were in the direction expected from phenotypic differences between the two species. This is consistent with directional selection favoring the evolution of the selfing syndrome of C. rubella. However, we did not conduct a formal test for directional selection such as Orr’s sign test (Orr 1998), due to the relatively small number of QTL per trait and the possible non-independence of QTL for different floral traits due to pleiotropy.

Population genetic analysis of 318 loci sequenced in both C. rubella and C. grandiflora yielded some additional evidence for selection on the selfing syndrome. Narrow QTL regions show an excess of fixed differences relative to polymorphism in C. rubella, as expected if selective sweeps in C. rubella have affected these regions (Foxe et al. 2009). While these results are consistent with QTL regions being the target of recent selective sweeps, it will be important

40 to integrate these results with genome-wide patterns of polymorphism, since the severe population bottleneck in C. rubella could generate large-scale heterogeneity in polymorphism that cannot be accounted for here, with short genomic fragments sequenced across the genome.

Evolution of selfing

Previous population genetic analysis in this species pair has suggested very recent origins of C. rubella in the context of a severe population bottleneck leading to a substantial loss of genetic diversity (Foxe et al. 2009; Guo et al. 2009). Evidence for a severe genome-wide population bottleneck is consistent with a self-compatible lineage experiencing a rapid shift to high selfing, rather than a protracted spread of selfing modifiers through a previously outcrossing population. Models for the evolution of selfing suggest that a major modifier to high selfing can evolve even in the context of high inbreeding depression (Lande and Schemske 1985), and selection for reproductive assurance associated with colonization of new postglacial habitats may have enhanced this spread (Foxe et al. 2009).

Our results agree with Riley’s (1934) conclusion that self-compatibility is caused by a dominant C. rubella allele, and confirms a previous report that self-compatibility maps to the canonical Brassicaceae S-locus in Capsella (Nasrallah et al. 2007). The switch to self- compatibility might alone have resulted in high rates of selfing, as introgression of the self- compatibility allele from C. rubella into C. grandiflora yields a mean autonomous selfing efficiency of ~0.4-0.5, about half that of present-day C. rubella (Sicard et al. 2011). Plants that have a high selfing efficiency under greenhouse settings may still outcross to a large extent if pollinators and mates are available, and the realized selfing rate therefore depends on population

41 and pollinator context. Thus, while additional field-based experiments would be important to test the effect of self-compatibility per se on the realized selfing rate, it is possible that the mutation conferring self-compatibility itself comprised a major mutation to high selfing.

If there is extensive pleiotropy, this may have facilitated subsequent evolution of the selfing syndrome in the new self-compatible lineage. An attractive hypothesis is that following the loss of SI, selection for improved efficacy of autonomous self-pollination resulted in correlated changes in floral and reproductive traits (Sicard et al. 2011). In any case, changes in petal size likely occurred prior to the geographical spread of C. rubella, as our study finds petal size QTL in similar genomic locations as Sicard et al. (2011), despite the fact that our C. rubella mapping parents are from widely separated geographical locations (Greece vs. Canary Islands).

Conclusions

In this study we have conducted QTL mapping of floral and reproductive traits that differ between the outcrosser C. grandiflora and the predominantly selfing C. rubella. We find a modest number of QTL for each floral and reproductive trait examined. In contrast to previous studies of the selfing syndrome, we find evidence for major-effect loci affecting floral and reproductive trait divergence in Capsella. The directionality of QTL effects and patterns of polymorphism and divergence in QTL regions suggest that the selfing syndrome has been subject to directional selection. This study therefore forms an important starting point for further studies of the evolutionary forces and genetic basis that underlie evolution of the selfing syndrome.

42

Table 2.1. Means and standard deviations for vegetative, floral and reproductive traits in C. rubella and C. grandiflora, as well as Bonferroni-corrected P-values for Wilcoxon rank sum tests of trait differences between C. rubella and C. grandiflora

Trait C. rubella C. grandiflora PBonferroni

(n=15) (n=15)

Petal length (mm) 2.45±0.21 4.24±0.47 8.65*10-8

Petal width (mm) 1.06±0.14 2.58±0.29 8.65*10-8

Median sepal length (mm) 1.95±0.20 2.38±0.22 4.39*10-5

Lateral sepal length (mm) 1.92±0.20 2.30±0.22 2.21*10-3

Median stamen length (mm) 2.08±0.15 2.42±0.20 1.05*10-4

Lateral stamen length (mm) 1.84±0.15 2.17±0.22 1.05*10-4

Length of style + gynoecium (mm) 1.89±0.20 2.19±0.29 2.23*10-2

Ovules per flower 21±2 14±2 2.04*10-4

Pollen per flower 5791±302 37453±4114 2.04*10-4

Leaf number 11±3 12±4 NS

Leaf length (cm) 14.9±4.3 15.8±5.6 NS

Days to flowering 24±3 24±2 NS

43

Table 2.2. List of significant QTL, including 1.5-LOD and 2-LOD confidence intervals and effect size estimates.

44

1Additive effect: half the mean phenotypic difference between homozygotes for the C. grandiflora allele and homozygote for the C. rubella allele.

2Per cent F2 phenotypic variance explained by QTL.

3Homozygous additive effect (2a) relative to mean phenotypic difference between C. grandiflora and C. rubella for traits that differ significantly between C. grandiflora and C. rubella.

45

Table 2.3. Population genetics of narrow (2 LOD interval <2MB) QTL regions compared to the rest of the genome

QTL Locus Segregating sites Segregating sites Fixed Shared

differences polymorphism C. grandiflora C. rubella

LG1, Petal At1g18880 3 1 3 0

Length At1g18900 12 0 1 0

At1g20510 10 0 1 0

At1g20960 5 0 0 0

LG2, Petal At1g68530 10 0 0 0

Width

LG7, Self- At4g21390 25 0 3 0 compatibility At4g21380 20 0 1 0

All QTL 85 1 9 0

Total Genes 3395 766 111 327

46

Figure 2.1. Schematic showing floral measurements.

47

Figure 2.2. Histograms of representative floral and reproductive traits in the F2 (grey) as well as in C. rubella (red) and C. grandiflora (blue). Phenotypic values of the parental accessions are shown as lines in the upper panel of each plot (C. rubella 1GR1 in red, C. grandiflora 2e-TS1 in blue and the F1 value in black). The traits are A. Petal length (mm), B. Petal width (mm), C. Median stamen length (mm), D. Style + gynoecium length (mm), E. Ovule number per flower, and F. Pollen number per flower.

48

Figure 2.3. Character correlations in the F2 population. Scatterplots of all pairs of traits are given below the diagonal and Spearman’s correlation coefficients are given above the diagonal. Bold- face type indicates that a correlation is significant (at P<0.05) after Bonferroni correction. Traits are color-coded with green for vegetative and phenology traits, yellow for floral traits and grey for reproductive traits.

49

Figure 2.4. Capsella linkage map and 1.5-LOD confidence intervals. Markers that exhibit significant segregation distortion (at P<0.05 after Bonferroni correction) are colored in red. All distances are in cM.

50

Figure 2.5. Fine-mapping of QTL for self-incompatibility on LG7. The large figure shows the LOD profile resulting from interval mapping of SI with hard genotype calls, and the inset shows the fine-scale location of the QTL to a region of LG7 containing the S locus, as well as the ancestry probabilities for each individual across LG7, sorted by self-incompatibility status. For each individual, a blue bar indicates high probability (>0.9) that the region is homozygous for the C. grandiflora allele, a yellow bar indicates high probability that the region is homozygous for the C. rubella allele and an orange bar indicate high probability of heterozygosity.

51

CHAPTER THREE EVOLUTIONARY GENOMICS OF MATING SYSTEM IN CAPSELLA

Summary

The shift in mating system from outcrossing to selfing reduces the effective population size, which is expected to cause a relaxed efficacy of natural selection. Additionally, recent shifts to selfing are expected to be associated with a suite of adaptive changes associated with increasing selfing efficiency and fitness through female function. In this study, I use the complete genome sequence of Capsella rubella to investigate the effect of recent mating system evolution on genome-wide selection, gene expression and diversity. We generate a high-quality assembly of the Capsella rubella genome combined with next generation sequencing of flower bud transcriptomes from six selfing C. rubella individuals and five outcrossing C. grandiflora individuals. I find a significantly elevated proportion of nonsynonymous polymorphisms unique to C. rubella compared with C. grandiflora, providing evidence of relaxed purifying selection following the shift to selfing. These data were used to estimate the distribution of fitness effects of new deleterious mutations, which revealed a higher proportion of effectively neutral amino acid mutations in C. rubella. Our results show a signature of selective sweeps co-localized with floral trait QTL. This approach combined with identification of gene expression differences between the two species helps us identify candidate-selected regions important in the transition to selfing. Comparisons of gene expression changes in Capsella with selfing and outcrossing

Arabidopsis suggests the role of parallel evolution of gene expression in mating system evolution.

52

Introduction

Over evolutionary timescales, the transition in mating system from outcrossing to selfing results in recurrent changes in floral morphology, life history and genetic diversity. The variation in diversity of reproductive system within and among closely related plant species, as well as the significant heritability of traits that affect mating, suggest that mating system can be altered rapidly by selection (Ashman et al. 2006; Herlihy et al. 2007). However, we still know very little about the genomic basis of the evolution of selfing, or its genomic consequences.

The shift to selfing from outcrossing is generally associated with a set of adaptive changes to the flower (Ornduff 1969; Richards 1986), termed the selfing syndrome. Similar floral morphological changes characterize independently derived selfers in many lineages (e.g.,

Amsinckia, Schoen et al, 1997; Eichhornia, Kohn et al, 1996; Mimulus, Fenster and Ritland

1994; Linanthus, Goodwillie 1999). Those changes include a reduction in flower size, a reduction in the physical distance between anthers and stigma (herkogamy), a breakdown of self- incompatibility (SI), as well as a shift in sex allocation.

Several selective forces are likely to play an important role in morphological evolution during the evolution of selfing. First, floral trait morphology may have evolved by direct selection for efficient selfing. Herkogamy is the most obvious trait with a direct effect, since less separation between stigma and anthers will promote self-pollen deposition. If genes controlling herkogamy also influence flower size, selection on herkogamy can also drive the evolution of other floral traits (Takebayashi et al. 2006). However, this correlation is not complete; for instance, in a cross of M. gattatus and M. nasutus, Fishman and colleagues (2002) found several

QTL influencing both traits, but overall there was a weak correlation between herkogamy and flower size.

53

Second, sex allocation theory predicts that selfing plants should allocate less to male function because fewer ovules are available for cross-fertilization by pollen and, consequently, resources expended on male function gain less in terms of fitness. In other words, selfing taxa allocate fewer resources to male relative to female function than related outcrossers, and also invest less in traits that attract mates (Johnston and Schoen 1996; Sato and Yahara 1999;

Fishman and Stratton 2004; French et al, 2005; Goodwillie and Ness 2005; Ortiz et al, 2006;

Tomimatsu and Ohara 2006). Thus, pollen production is expected to decrease as the selfing rate increases resulting in lower pollen:ovule (P: O) ratios (Charnov 1979; Charlesworth and

Charlesworth 1981; Ritland and Ritland 1989; Armbruster et al. 2002). Selection for small flowers in selfers may thus also be explained by the reallocation of resources.

Since adaptation is shaped by selection, this can leave a distinct imprint on the levels and patterns of nucleotide variation in an organism’s genome. The main principle is that neutral loci across the genome will be similarly affected by genetic drift, demography and evolutionary history. In contrast, regions of the genome under positive directional selection will show unusual patterns of diversity due to genetic hitchhiking (selective sweeps). They will show reduced nucleotide diversity as well as an increase of genetic differentiation (Fst) and fixed differences

(Barton 2000; Andolfatto et al. 2007; Andersen et al. 2012). The extent of the reduced diversity is a function of the strength of selection, the timing of the sweep, and the effective rate of recombination (Wright et al. 2005). In contrast to the localized genomic regions of reduced diversity, population bottlenecks also reduce diversity, but this is manifested genome-wide. By using a large number of genes spread across the genome, selection on advantageous mutations and neutral diversity flanking sites (Maynard Smith and Haigh 1974) can be compared to the genome-wide demographic effect.

54

Because of uncertainty in demographic history, detection of selective sweeps across the genome can introduce false positives; QTL mapping provides a more direct approach to identifying the targets of phenotypic evolution, but often results in wider genomic intervals.

Thus, the most powerful approach for identifying selected loci may be to use a combination of the population genomic approach and QTL mapping to narrow down the sweep signature

(Stinchcombe et al. 2007). Additionally, whole-genome transcriptome analysis can be used to identify genes that are differentially expressed between species. This combination of expression analysis with population genomics and genetic mapping should provide a powerful approach to narrow down the targets of selection and help determine the functional basis of adaptive trait changes.

Despite the adaptive significance associated with the shift in mating system, mating system transitions from outcrossing to selfing have a number of population genetic consequences

(Charlesworth and Wright, 2001; Glemin et al. 2006; Wright et al. 2008). First, selfing will increase homozygosity and this will lead to a reduction of the effective population size (Ne), up to two-fold under complete selfing (Charlesworth et al. 1993b; Nordborg 2000). Furthermore, the increase in homozygosity will decrease chances of crossing over between heterozygous sites, thus increasing linkage disequilibrium among loci. Selfing genomes can thus be affected strongly by genetic hitchhiking of positively selected mutations (selective sweeps) and by the elimination of new deleterious mutations (background selection), further reducing the effective population size (Charlesworth and Wright 2001). The reduction of effective population size may also be further reduced by life-history traits associated with selfing such as population subdivision, isolation and genetic bottlenecks. Finally, gene flow between species will be reduced with increased selfing (Sweigart and Willis 2003), which can further reduce genetic diversity.

55

Given all of the processes above, we expect that selfing will lead to a reduction in the efficacy of selection, which will cause an increase in the rate of fixation of slightly deleterious mutations. On the one hand, the increase of homozygosity in selfing species will expose the recessive deleterious mutations to selection, thus leading to their elimination by the process of purging (Ohta and Cockerham 1974; Barrett 2002; Schoen and Busch 2008). Therefore, strong recessive deleterious mutations will be removed from the population. However, selfing species will accumulate more slightly deleterious mutations with additive fitness effects (Wang et al

1999), and these may predominate across the genome.

The effect of selfing on the level of polymorphism is well documented in the literature

(Hamrick and Godt 1996; Nybom 2004; Glémin et al. 2006; Foxe et al. 2009; Busch et al. 2011;

Ness et al. 2011). The signature of relaxed selection pressures in selfing species should be detected at the molecular level by an increase of the ratio of nonsynonymous to synonymous polymorphism (πn/πs) or the ratio of nonsynonymous to synonymous divergence (Dn/Ds)

(Bustamante et al. 2002; Wright et al. 2002). However, fewer studies have assessed the impact of selfing on the efficacy of selection, and these have generally seen little to no effect of mating system on the strength of selection on nonsynonymous sites and codon bias (Foxe et al. 2008;

Haudry et al. 2008, and Escobar et al. 2010).

Several studies have reported high proportions of rare nonsynonymous polymorphisms in selfing species (Cummings and Clegg 1998; Bustamante et al. 2002, Nordborg et al. 2005), but the real comparison of polymorphism and divergence with closely related outcrossing species is generally lacking. For instance, Bustamante et al. (2002), used a comparison of polymorphism and divergence and found an excess of nonsynonymous polymorphism and a reduction of selection, but this comparison was done using A. thaliana and Drosophila melanogaster. More

56 detailed analysis of polymorphism and divergence for more closely related selfing and outcrossing species is needed. To date, comparisons of Arabidopsis polymorphism to divergence in the selfing A. thaliana versus outcrossing A. lyrata using 73 genes in A. lyrata and 675 in A. thaliana did not suggest any significant difference (Foxe et al. 2008). Glemin et al. (2006), showed a weak but significant increase in the ratio of nonsynonymous to synonymous polymorphism and divergence in selfing species, when comparing plants with different mating system.

A major limitation of previous studies testing for relaxed selection has been the number of loci used in the study. Since the process of relaxed selection is driven by genetic drift, it is highly stochastic and may not be detectable with small numbers of genes. Therefore, the acquisition of whole genome comparisons of polymorphism and divergence in closely related selfing and outcrossing species is key to addressing the genomic consequences of the shift to selfing.

In this study, we sequence and assemble the genome of Capsella rubella. I focused on the two diploid species in the genus Capsella, the outcrossing Capsella grandiflora, and the selfing

Capsella rubella, because they provide a key opportunity to study the early genomic consequences of the evolution of mating system. The recent shift to selfing (30 to 45 KY) in C. rubella was accompanied by a severe bottleneck with a major loss of diversity, as well as a breakdown of self-incompatibility that coincides with the species divergence (Foxe et al. 2009,

Guo et al. 2009). The divergence of the two species is marked by divergent floral evolution, where C. rubella has small floral size, less pollen production and more ovule production

(Chapter two). In order to address the genomic changes associated with the mating system shift in Capsella, I combined the high quality genome assembly of C. rubella with Illumina RNA

57 sequence data (RNAseq) collected from mixed flower buds at different stages from C. grandilfora and C. rubella.

Materials and Methods

Capsella rubella Assembly

The initial sequencing and assembly was performed by JGI (Joint Genome Institute, CA) using the genome-wide shotgun assembly software ARACHNE (Batzoglou et al. 2002) incorporating 454 linear data, 454 paired end data, Illumina sequence data and Sanger BAC-end sequencing as well as Fosmid-end sequencing (FES/BES). This represents 22.35x coverage of the 134.8 Megabase (Mb) assembly. In addition, ~ 48.22 GBs of trimmed Illumina sequence data were used for pre- and post-fixing of 454 base pair errors in the consensus, to improve accuracy of the Capsella rubella genome sequence. Gene annotation was conducted by JGI using a combination of transcriptome data with gene prediction and comparative analysis of Arabidopsis, papaya and grapevine.

A set of molecular markers from the genetic map that was already established from the mapping population generated in an interspecific cross (C. grandiflora x C. rubella) (Chapter two) was used to identify misjoins in the assembly. Misjoins were characterized by a combination of an abrupt change in the linkage group combined with a region of low

BAC/Fosmid support. A significant proportion of the markers were located on the assembly, allowing us to anchor a large proportion of scaffolds from the assembled 134.8 Megabase (Mb) genome onto the eight Capsella rubella linkage groups. Scaffolds were then oriented, ordered and joined together using a combination of the Capsella rubella genetic map and Arabidopsis

58 lyrata synteny information. The results of anchoring, ordering and orientation (Table 3.3) of the scaffolds was used to generate eight large scaffolds corresponding to the major chromosomal sequence.

RNA extraction, sequencing and read mapping

Total RNA was harvested from mixed flower buds flash frozen in liquid nitrogen from 5

Capsella grandiflora genotypes and six Capsella rubella genotypes sampled from different geographic locations (Table 3.1) using the RNAeasy plant mini kit (Qiagen) with minor modifications to obtain the required yield (~5ug) for RNA sequencing. Following extraction, mRNA isolation and libraries preparation and paired-end 38 bp read sequencing was done on an

Illumina Genome Analyzer at the Centre for Analysis of Genome Evolution and Function

(CAGEF) at the University of Toronto. To compare our expression patterns in Capsella with those in Arabidopsis, replicate runs of strand-specific RNA sequencing from flower buds of the selfing A. thaliana and the outcrossing Arabidopsis lyrata were obtained from Richard Clark

(University of Utah). I used Tophat (Trapnell et al. 2010), to map our sequence reads to the reference genome of Capsella rubella. We used an inner distance between reads (-r) of 100, and minimum and maximum intron length of 40 and 1000 respectively. I set the minimum normalized depth to 0 (-F 0) to eliminate any heuristic filters associated with the human genome.

A splice junction annotation file in GFF format was generated and used based on the C. rubella gene annotation, defining the boundaries of all exons and introns to guide the alignment of sequence reads, and ignoring any novel junctions (--no-novel-juncs). A. thaliana and A. lyrata

RNA sequence reads mapped to their respective reference genomes using the same parameters mentioned above.

59

Gene expression, enrichment analyses and potential candidate genes

After aligning our reads to the C. rubella reference genome, the Cufflinks pipeline

(Trapnel et al. 2010) was used to assemble transcripts, estimate their abundance and extract differentially expressed genes using the C. rubella annotation with known exons. The data for gene expression was normalized as fragments per kilobase pair of transcript per million mapped reads (FPKM= 10^9 * C / (N * L), where C is the number of mapped reads, N is the total number in the experiment of mappable reads and L is the length of the gene in base pairs. Normalization among the samples of C. grandilfora and C. rubella, FPKM counts, as well as filtered expression value calculation and hierarchical clustering analysis (Eisen et al. 1998) was done using Dchip

(Li and Wong 2001a). Gene expression data were log-transformed and genes were tested for significant differences between species using a t-test, with a significance cutoff of 0.001. Genes showing significant up-and down-regulation were tested for enrichment of functional categories using the David bioinformatics database (http://david.abcc.ncifcrf.gov/) and subject to a false discovery rate (FDR) of 0.1.

Multispecies orthologues assignment

Four-way orthologues of Capsella rubella/Arabidopsis thaliana/Arabidopsis lyrata/Thellungiella halophila were assigned using the program Inparanoid/Multiparanoid

(Remm et al. 2001) using Thellungiella halophila as outgroup to resolve any ambiguous paralogy. The Inparanoid program uses as input protein sequences for the different species and performs a pairwise reciprocal blastp with the best similarity hit method. Two cut-off values were applied in the process: 1) a score cut-off with 50 bits was used, and 2) since orthologous

60 sequences maintain homology over the majority of their length, an overlap cut-off of 50 % was applied to prevent short, domain-level matches. Finally, a bootstrapping technique was used to estimate the probability that a pair of orthologues had by chance a mutual best score. The putative four-way orthologues were selected if they were orthologues in all-way comparisons.

The selection of the four orthologues was done using a Perl script implemented in Multiparanoid.

Whole-genome coding sequence extraction and SNP calling

The mapping of sequence reads of C. rubella and C. grandiflora using Bowtie/Tophat mentioned above generated a BAM file, which is a binary format containing the sequence alignments data for each individual. I used Samtools (Li et al. 2009) to sort and index the BAM file and call SNPs for all eleven individuals (5 C. grandilfora, 6 C. rubella). A VCF file, which is a simplified text file format from the mpileup, was generated using bcftools implemented in

Samtools. An in-house python script was written to extract coding sequences from the VCF file generated with the help of the C. rubella genome annotation in a GFF format. Stringent filters were used in the python script to reduce incorporating error in the further analysis. In particular,

I used cut-offs of a minimum SNP quality cut-off of 25, a minimum genotype quality score of 20 and a minimum individual read depth of 20. I generated fasta files for the coding sequences for 5

C. grandilfora and 6 C. rubella genotypes. To obtain a close outgroup for divergence analyses, I did a BLASTn (Altschul et al. 1997) search of the C. rubella coding sequences against a de novo illumina assembly of Neslia paniculata generated using Ray (Boisvert et al. 2010); a Perl script was used to extract coding sequences from the blast output using the highest bit score with an identity match of > 80 and an E-value cut-off of 10-5. Coding sequences extracted from Neslia paniculata were incorporated as an outgroup into the C. grandiflora and C. rubella alignments.

61

Whole-genome polymorphism and divergence

I ran a modified version of Polymorphorama (Haddrill et al. 2008), a library of Perl scripts for calculating polymorphism and divergence statistics, on the aligned C. grandiflora, C. rubella and N. paniculata fasta coding sequences. Additionally, I sub-sampled six C. grandiflora and six C. rubella sequences and calculated synonymous and nonsynonymous site- frequency spectra for each gene. The population mutation parameter θ =4Neu was estimated using the average number of pairwise differences in a sample, π (Nei 1987), and Watterson’s θ

(Watterson 1975). A sliding window across each chromosome of synonymous diversity for both

C. grandiflora and C. rubella was generated using windows of 10 genes with a step size of 1 gene. Tajima’s D statistic (Tajima 1989), a measure of the skew in the site frequency spectrum was also calculated. Divergence was estimated as Dxy, the average pairwise differences between species and the outgroup N. paniculata (Nei 1987), using a Jukes-Cantor correction (Jukes and

Cantor 1969). In-house Perl scripts were run on the aligned coding sequences of C. grandilfora and C. rubella to calculate the number of synonymous and nonsynonymous single nucleotide polymorphism (SNPs) that are unique, shared and fixed for both species. A sliding window was generated across each chromosome of the number of shared, unique polymorphisms and fixed differences in both species, using windows of 100 SNPs and a step size of 10 SNPs. To illustrate the pattern of polymorphism, we used in-house Perl scripts to generate the derived joint site frequency spectrum of each species using N. paniculata as outgroup.

62

Distribution of fitness effects of new deleterious mutations and the proportion of adaptive amino acid fixation (α) estimates

To estimate the distribution of fitness effects and the proportion of amino acid divergence fixed by positive selection (α), I used the method of Eyre-Walker and Keightley (2009) and

Keightley and Eyre-Walker (2007) implemented in the program DFE alpha

(http://www.homepages.ed.ac.uk/eang33). To generate the input file, I used the number of divergent SNPs between Neslia paniculata and, respectively, Capsella grandiflora and Capsella rubella, and the folded site frequency spectra (SFS) for unique synonymous SNPs and nonsynonymous SNPs. The method depends on population genetic models and uses the nucleotide polymorphism spectrum and between-species nucleotide divergence to estimate the distribution of fitness effects of deleterious mutations (DFE) and α, the fraction of nonsynonymous substitutions fixed due to positive selection as well as the demographic parameters associated with population size changes, taking into account the slightly deleterious mutations (Eyre-Walker and Keightley 2009; Halligan et al. 2010). Since a significant fraction of

SNPs in C. rubella will be shared with C. grandiflora and possibly reflect ancestral selection pressures, we restricted this analysis to unique polymorphisms segregating within each species to capture current selection pressures.

Results

Sequencing and assembly

The assembly performed using the ARACHNE assembler consists of 9,675 contigs and

853 scaffolds. The main nuclear genome comprises a total of 134.8 MB encompassing the 853

63 scaffolds, and over 90% of the nuclear assembly was found on the main eight chromosome scaffolds. Although this is considerably smaller than the estimated genome size (220 MB),

Illumina read mapping and EST analysis suggests that our assembly encompasses the vast majority of the euchromatic regions of the genome. The remaining scaffolds represent mostly heterochromatin centromeric repeats and some ribosomal DNA, mitochondria and chloroplast.

Gene annotation identified 26,521 protein-coding genes, which represents 24.5 % of the assembly (Table 3.2), which is lower than A. thaliana (27,025; The Arabidopsis Genome

Initiative 2000), A. lyrata (32,670; Hu et al. 2011) and Thellungiella parvula (28,901;

Dassanayake et al. 2011).

High levels of synteny were found between C. rubella and A. lyrata, consistent with previous comparative mapping work (Koch et al. 2005; Schranz et al. 2007) where only three major genome wide rearrangements were identified. Comparative mapping of orthologous genes

(http://genomevolution.org/CoGe/) and synteny information from Capsella rubella with A. lyrata and S. parvula indicate that those three rearrangements are specific to Arabidopsis lyrata (Figure

3.1).

Genome-wide pattern of polymorphism and divergence

The summaries of polymorphism and divergence are shown in Table 3.4. In agreement with previous studies using smaller data sets (Foxe et al. 2009, Slotte et al. 2010), the genome- wide level of nucleotide diversity at synonymous sites is much higher in the outcrossing C. grandiflora (π = 0.0164; θw= 0.0169) compared to the selfing C. rubella (π = 0.0047; θw=

0.0041). The severe loss of synonymous nucleotide diversity distribution in C. grandiflora to C.

64 rubella is apparent genome-wide, although there is also considerable chromosomal variation in the retention of polymorphism (Figure 3.2).

The average Tajima’s D (Tajima’s 1989) for synonymous sites in C. grandiflora is slightly negative at -0.204, indicating that there is a skew toward rare polymorphism that could be explained by a recent population expansion and/or the action of weak purifying selection on some synonymous sites associated with selection for codon usage bias. Tajima’s D is more negative at nonsynonymous sites (-0.369), which is consistent with the observed allelic frequency spectra in C. grandiflora, (Figure 3.3), suggesting that a significant proportion of nonsynonymous polymorphisms are under purifying selection. In contrast in C. rubella, the average Tajima’s D is positive for synonymous sites 0.412, which is consistent with a recent population bottleneck (Foxe et al. 2009, Guo et al. 2009). Nonsynonymous sites do show an excess of rare variants in C. rubella (Figure 3.3), although the ratio of nonsynonymous to synonymous SNPs is consistently elevated compared to C. grandiflora across all frequency classes, consistent with a genome-wide relaxation of selection in the selfing species (Figure 3.3).

Gene expression analysis

To avoid some flow cell lane effect, three of our samples (1 C. grandiflora and 2 C. rubella) that were run separately, were excluded from the expression analysis. I restricted our expression data analysis collected from flower buds at different stages from four C. grandiflora and four C. rubella. I identified 5771 genes that are differentially expressed between the two species at a p-value cutoff of 0.001 (Figure 3.4). As expected, the correlation of gene expression across samples was higher within species rather than between species, consistent with

65 considerable evolution of gene expression since species divergence (Figure 3.5). Filtering the differentially expressed genes using a false discovery rate (FDR) of 0.1, I identified 272 down- regulated and 155 up-regulated genes in C. rubella compared to C. grandiflora. Using enrichment tests for gene function (http://david.abcc.ncifcrf.gov/), I found that down-regulated genes in C. rubella are more enriched for functions associated with pollen production, and up- regulated genes are more enriched for female reproductive and post-embryonic functions (Table

3.5, 3.6). Comparing global gene expression across species, I found a modest trend that A. thaliana gene expression is more strongly correlated with C. rubella (Spearman’s R 0.787) than

C. grandiflora (0.739), and A. lyrata is correlated to C. grandiflora (0.737) more strongly than to

C. rubella (0.69). Nevertheless, the strongest gene expression correlations were found for A. thaliana and A. lyrata (0.81) and for C. grandiflora and C. rubella (0.95). Looking at the differentially expressed genes that were identified in Capsella, I found that genes up-regulated in

C. rubella were also significantly up-regulated in A. thaliana compared with A. lyrata, and similarly genes down-regulated in C. rubella are also significantly down-regulated in A. thaliana

(Wilcoxon test p<0.001) (Figure 3.6).

To search for candidate differentially expressed genes important in mating system transition, I used a combination of QTL mapping and differential gene expression between C. grandiflora and C. rubella. Our results map some potential candidate genes that are differentially expressed under the QTLs for petal width, homeotic phenotype, pollen number, and self- incompatibility (SI) on different chromosomes and this list is summarized in (Appendix 3.1).

Candidate genes that are down-regulated in C. rubella are the one for pollen number, self- incompatibility and some homeotic QTLs. On the other hand, up-regulated candidate genes in C. rubella are under the petal width QTLs.

66

Efficacy of natural selection

After conducting SNP calls from the six C. rubella and five C. grandiflora individuals, and using cut-offs for depth and quality for each sample, I obtained polymorphism data from

4582 genes and identified a total number of 44,451 SNPs in coding regions. The majority of

SNPs are uniquely segregating in C. grandiflora (74.32%), while only 4.7% were fixed between species, 10.6 % unique to C. rubella and 10.2 % shared. If I examine the ratio of nonsynonymous to synonymous changes; this ratio was significantly elevated (Chi-square with Yates correction,

2 tailed p <0.0001) at SNPs unique to C. rubella (Figure 3.7). In contrast, the ratios of nonsynonymous to synonymous fixed and shared polymorphisms are comparable to the unique

C. grandiflora SNPs (Figure 3.7). To test if the increase in the unique nonsynonymous polymorphism was not mainly due to relaxed constraint on genes specifically responsible for outcrossing, both down-regulated and up-regulated genes show the same results of the excess of unique nonsynonymous polymorphisms as in the total dataset (2-tailed Fisher’s exact test p<0.05) (Figure 3.7).

Distribution of fitness effects of new mutations and α estimates

Using the method of Keightley and Eyre-Walker (2007), I estimated the demographic parameters as well as the percentage of mutations in each selection category (Table 3.7). The common trend I found is that the majority of mutations in C. grandiflora and C. rubella were inferred to be strongly deleterious (83.4-63.9%; Nes>10), having a low probability of getting fixed in the population. When I looked at the other end of the spectrum, we can see that the

67 proportion of both nearly neutral mutations (0

The extent of directional selection in C. rubella associated with the shift to selfing

Examining the sliding windows of shared, fixed, and unique polymorphism in both C. grandiflora and C. rubella (Figure 3.8), I found that a number of genomic regions showed an increase of fixed differences compared to shared polymorphism, and this is includes the mapped self-incompatibility region (Chapter two). I looked at the narrow QTL regions from Chapter two

(petal width on chromosome 1,2; self-incompatibility (SI) on chromosome 7), and I found an overall increase of fixed differences relative to shared polymorphism under the narrow QTLs compared to the rest of the genome (Figure 3.9). The ratio of fixed to shared polymorphism is

2.4 in narrow QTL regions compared with 0.5 genome-wide (P<0.01, Fisher exact test).

Discussion

In this study, I combined a high quality assembly of the Capsella rubella genome with whole transcriptome data (RNAseq) in Capsella grandiflora and Capsella rubella in order to address the genomic causes and consequences of the mating shift from outcrossing to selfing.

The high marker density genetic map generated in the QTL analysis in Chapter two using genotyping by sequencing (next generation RAD) provided a key resource to identify false joins

68 in the genome assembly and order scaffolds in higher-order chromosomal segments. As genome projects will increasingly rely on ordering shorted scaffolds from next-generation sequencing, the importance of this genetic mapping approach is likely to increase and be an important tool in the genome assembly process.

Using the genome-wide survey of nucleotide polymorphism and divergence, our results showing severe diversity loss in Capsella rubella compared to C. grandiflora is generally consistent with Foxe and colleagues (2009) and reinforce their conclusions suggesting recent speciation under a severe population bottleneck. However, there is a variance across chromosomes, where some retention of polymorphism is marked for instance on the arm of chromosome 2. This could be the result of introgression and/or balancing polymorphism.

The negative Tajima’s D at synonymous sites in C. grandiflora could be explained by some population expansion, and this is consistent with the demographic parameter estimates from the Keightley and Eyre-Walker 2009 method (Table 3.7). Evidence for expansion in C. rubella following the bottleneck is also apparent when I focus on unique variants (Table 3.7).

Interestingly, the ratio of nonsynonymous to synonymous changes is significantly elevated at unique C. rubella SNPs, indicating an ongoing relaxation of purifying selection.

Using the method of Keightley and Eyre-Walker, I find an elevated proportion of effectively neutral mutations in C. rubella, consistent with relaxed selection. The absence of a similar elevation for fixed differences suggests that there is ongoing relaxed selection that was not driven solely by the initial founder event. Another possible explanation for the excess of nonsynonymous SNPs in C. rubella is that we are capturing a higher SNP error rate, particularly at heterozygous sites. To assess this, I reran the analysis using only homozygous SNPs in C. rubella and we still find a significant excess of nonsynonymous SNPs (p<0.05, Fisher exact test).

69

Our DFE-apha results C. grandiflora yielded comparable parameter estimates to those of Slotte et al. 2010 from 257 loci, where I also detected high rates of positive selection and efficient purifying selection. Comparable results have been described in sunflower species with high effective population sizes (Strasburg et al. 2011). Our DFE-alpha analysis for C. rubella showed more comparable proportions of effectively neutral mutations to those seen in humans and

Arabidopsis species with relatively low Ne (Halligan et al. 2010, Slotte et al. 2010). Our results suggest that the severe bottleneck and shift in mating system can lead to rapid changes in the efficacy of natural selection.

An alternative explanation for the elevated nonsynonymous polymorphism in C. rubella is the possibility that relaxed selection on floral traits mediated by pollinator attraction is driving the genome-wide signal of elevated nonsynonymous polymorphism. Looking at the differentially expressed genes between C. grandiflora and C. rubella, I find comparable elevated nonsynonymous polymorphism in both sets (Figure 3.7), suggesting that there is a global reduction in the efficacy of selection rather than specific relaxation on down-regulated genes.

Gene expression levels between C. grandiflora and C. rubella are highly correlated as expected since the species are closely related and recently diverged. Down-regulated genes show enrichment for male function, particularly pollen production, while up-regulated genes are enriched for female function. This is consistent with the shift in allocation of resources toward the female function during the shift in mating system to selfing (Chapter two). Comparative orthologous gene expression difference for down-regulated and up-regulated in the selfing C. rubella and A. thaliana compared to the outcrossing C. grandiflora, and A. lyrata show the same directionality in down and up-regulated gene expression, suggesting the role of parallel evolution of gene expression in mating system evolution.

70

Identification of differentially expressed genes in regions with floral QTL provide an important resource for fine mapping of the genes, mutation(s) or single nucleotide polymorphisms responsible for the shift in the adaptive floral evolution in Capsella. The gene

MS1 (AT5G22260), under the pollen number QTL, is important for post-meiotic development of pollen, and is down-regulated in C. rubella, consistent with reduced pollen production in this species (Chapter Two). In addition, another pollen number QTL includes GDSL (AT5G09440), which is up-regulated in C. rubella and is important for pollen coat formation and male fertility.

Interestingly, under the self-compatibility QTL, a putative kinase that is part of the Brassica S locus family ARK2 (AT1G65800) is up-regulated in C. rubella, which could possible explain in part the cryptic SI in our F2 mapping population (Chapter Two, data not shown), where a number of F2 plants have a leaky self-incompatibility phenotype, where early inflorescences are

SI, while later ones set autogamous seed. Another F2 phenotype I identified in QTL mapping

(Chapter Two) is a combination of shrunken floral organs and abnormal modified female and/or male parts fused to the petal. The differentially expressed candidate genes under these QTL

(AT1G77300; AT4G36930) have mutations in Arabidopsis that show comparable phenotypes

(Kim et al. 2005; Penfield et al. 2005; Grini et al. 2009). Perhaps surprisingly, for petal width

QTL, our two differentially expressed genes found in these regions were up-regulated in C. rubella. However, it could be part of a pathway that acts upstream of other genes like the boundry-specific genes such as PETAL LOSS (PTL) or CUP-SHAPED COTYLEDONS1 and 2

(CUC1 and CUC2) that could play a role in petal size regulation.

Overall, evidence for an increase in the proportion of fixed differences relative to shared polymorphism is consistent with directional selection shaping floral evolution in Capsella, rather than simply relaxed selection on traits mediating pollinator attraction. This is also consistent with results from chapter 2 using Sanger sequencing of much fewer genes. Our whole-genome

71 polymorphism data will enable us to narrow down the targets of positive selection in QTL regions. Other regions showing increased shared polymorphism relative to unique in C. grandiflora could possibly reflect some gene flow between species and/or retention of balanced polymorphism.

In conclusion, this study is the first to show genome-wide evidence for reduced efficacy of natural selection associated with the shift to selfing, Furthermore, our expression and population genomics analyses integrated with QTL work provide a foundation for understanding the genomic basis of mating system evolution.

72

Table 3.1. Sampling locations of C. grandiflora and C. rubella

Species Assigned names Location

C.grandiflora Cg88.23 Monodendri, Greece

C.grandiflora Cg103.02 East of Greece

C.grandiflora Cg85.9 Greece (39°33.313N; 20°54.985E)

C.grandiflora CgAxF Greece (F1 cross)

C.grandiflora Cg5a Corfu, Greece

C.rubella CrTaal Taguemont, Algeria

C.rubella Cr1337 Argentina

C.rubella Cr1Gr1 Samos, Greece

C.rubella Cr34.11 S. Luca, Italy

C.rubella Cr81.02 Greece (37°17.731N; 22°03.631E)

C.rubella Cr75.2 Greece (38°05.440N; 22°10.316E)

73

Table 3.2. Summary of the Capsella rubella genome assembly and annotation

Assembly Number N/L50 Size (Mb) Percentage of the assembly

Contigs 9675 265/134.1Kb 130.1 -

Scaffolds 853 4/15.1 Mb 134.8 96.1

Annotation

Protein 26521 33.1 24.5 coding

74

Table 3.3. Summary of the initial ordering of de novo assembly of different contigs (e.g. super 0) using our high density genetic map and synteny (chapter two) for anchoring positions of those contigs and ordering into chromosomal segments.

Linkage Scaffold Start End Left gene Right gene group position position

1 Super0 Start 8415233 AT1G01020 AT1G24100

Super10 Start End AT1G24120 AT1G36980

Centromere

Super7 Start 2390185 AT1G41830 AT1G48270

Super12 Start End AT1G48280 AT1G56200

2 Super9 End Start AT1G56210 AT1G64720

Centromere

Super6 End Start AT1G64790 AT1G80950

3 Super2 Start End AT3G01015 AT2G07050

Centromere

Super1 Start 4141427 AT2G10870 AT2G20400

Super29 Start End AT2G20410 AT2G20900

4 Super11 Start End AT2G20920 AT2G26570

Super8 End 2994099 AT2G26590 AT2G33630

Super19 End Start AT2G33640 AT2G36270

Super5 4873194 Start AT2G36290 AT2G48160

5 Super17 Start End AT2G011060 AT2G04030

Super0 8417344 End AT3G26820 AT3G30390

75

Centromere

Super25 End Start AT3G42660 AT3G43740

Super28 End Start AT3G43740 AT3G44670

Centromere

Super7 End 241786 AT3G44770 AT3G57020

Super5 End 8048070 AT3G57030 AT3G61600

Super23 End Start AT3G61610 AT3G53540

6 Super3 Start End AT5G01010 AT5G29000

Centromere

Super26 End Start AT5G31302 AT5G35142

Centromere

Super5 4874398 8043209 AT4G01990 AT4G12620

Super22 End Start AT4G01970 AT4G00026

7 Super8 Start 2990235 AT4G40100 AT4G32670

Super4 Start End AT4G32670 AT4G12640

Centromere

Super14 End Start AT5G32440 AT5G38300

Super15 Start End AT5G38310 AT5G42110

8 Super13 End Start AT5G47800 AT5G43320

Super16 Start End AT5G43310 AT5G42140

Centromere

76

Super18 Start End AT5G47820 AT5G50080

Super1 4143246 End AT5G50090 AT5G67640

77

Table 3.4. Genome-wide polymorphism summary statistics as well as Tajima’s D for nonsynonymous (Nonsyn) and synonymous (Syn) sites.

Species Site Mean π Mean θ Mean Tajima’s D

C. grandiflora Syn 0.0164 0.0169 -0.204

Nonsyn 0.0019 0.0022 -0.369

C. rubella Syn 0.0043 0.0041 0.421

Nonsyn 0.0007 0.0007 -0.174

78

Table 3.5. Down-regulated genes in Capsella rubella were analyzed for enrichment using David bioinformatics (http://david.abcc.ncifcrf.gov/). A false discovery rate (FDR) 0f 0.1 was applied and only categories showing at least six fold enrichment score are reported.

Categories Function Observed number of False discovery rate genes/expected (FDR)

1 Pollen wall assembly 6 4.04E-09

2 Cellular component 6 4.04E-09 assembly involved in morphogenesis

3 Pollen exine formation 5 7.72E-08

4 Pollen development 7.6 3.89E-05

5 Gametophyte 7.6 9.79E-04 development

6 Sporopollenin 2 0.1104 biosynthetic process

79

Table 3.6. Up-regulated genes in Capsella rubella were analyzed using David bioinformatics (http://david.abcc.ncifcrf.gov/). A false discovery rate (FDR) of 0.1 was applied and only categories showing at least three fold enrichment score are reported.

Categories Function Observed number of False discovery rate genes/expected (FDR)

1 Floral whorl 7 5.20E-03 development

2 Post-embryonic organ 8 5.20E-03 development

3 Flower development 8 1.80E-02

4 Gynoecium 5 1.50E-02 development

5 Floral organ 6 2.80E-02 development

6 Post-embryonic 13 5.40E-02 development

7 Carpel development 4 8.50E-02

8 Reproductive structure 11 1.10E-01 development

80

Table 3.7. Estimates of the distribution of fitness effects (DFE) of amino acid mutations that are unique to each species, for intervals of the strength of selection, Nes, and related demographic parameters.

Percentage of mutations in different Nes

Species N2/N1 a t/N2 b Beta c 0-1 1-10 >10

C.grandiflora 2.79 0.72 0.392 6.7 9.8 83.4

C.rubella 2.31 0.25 0.442 13.2 22.7 63.9 a Estimated demographic model, it is the ratio of present (N2) to ancestral (N1) population size b Timing of growth which is a function scaled in units of N2 c Estimated parameter of gamma distribution shape

81

Figure 3.1. Comparative mapping of the three genome of A. lyrata, C. rubella and S. parvula, showing three rearrangements in highlighted squares that are specific to A. lyrata.

82

A B

Genome positions

Figure 3.2. A sliding window of synonymous diversity in windows of 10 genes with a step size of 1 gene across the eight chromosomes A (1-4), B (5-8) in C. rubella (blue) compared to C. grandiflora (red).

83

A

B

Figure 3.3. A) Proportion of nonsynonymous relative to synonymous polymorphism over the different frequency class (1-5) in both Capsella grandiflora (gray) and Capsella rubella (black) using an equivalent sub-sampling of six C. rubella and six C. grandiflora sequences. B) Site frequency spectra in C. grandiflora and C. rubella of nonsynonymous (black) and synonymous (gray) mutations. Nesllia paniculatta was used to infer the ancestral state for each polymorphism.

84

Figure 3.4. Plot of the normalized mean expression versus ln fold change for the difference in C. grandiflora compared to C. rubella. Red points are the differentially highly and lowly expressed genes.

85

Figure 3.5. Pearson correlation of gene expression using four C. grandiflora and four C. rubella FPKM counts. Red color shows high correlation (> 0.9), blue color show a correlation of (0.8- 0.9) and yellow (0.7-0.8). Below the diagonal shows the correlation between pairs of samples (X, Y) while the upper panel shows the reverse axis correlation.

86

Figure 3.6. A box and whisker plot of the median, with the upper and lower quartile, as well as the extremes, comparing differentially expressed genes in Capsella and Arabidopsis species genes that were identified as significantly up- and down-regulated in C. rubella compared to C. grandiflora.

87

Figure 3.7. Ratio of nonsynonymous to synonymous polymorphisms for unique, shared and fixed differences in C. grandiflora (Cg) and C. rubella (Cr) at the genome-wide level and at genes found to be significantly up- and down-regulated in C. rubella

88

Figure 3.8. Sliding window of unique and shared polymorphism, and fixed differences in C. rubella and C. grandiflora in windows of 100 SNPs. Red circle shows a region of increased fixed differences relative to shared polymorphism surrounding the self-incompatibility locus.

89

Figure 3.9. Comparison of the number of fixed differences relative to shared polymorphisms in genome-wide data compared with narrow QTL regions as defined in Chapter two.

90

CHAPTER FOUR COMPARATIVE POPULATION GENOMICS IN TWO COLLINSIA SPECIES WITH CONTRASTING MATING SYSTEM

Summary

Highly selfing species experience reduced effective recombination rates and often more severe founder events, which can lead to reductions in both polymorphism and the efficacy of natural selection on slightly deleterious mutations. In this study, I use next generation transcriptome sequencing coupled with population resequencing to test for changes in polymorphism, base composition and selection efficacy in the highly selfing angiosperm

Collinsia rattanii compared with its mixed mating, more outcrossing sister species C. linearis. I assembled a total of 24,687 and 29,804 transcripts in C. linearis and C. rattanii respectively, and identified 14,106 orthologous sequences between the two species. Pairwise comparisons of orthologous genes indicate a reduction in GC content in C. rattanii. Since this pattern is found across all gene expression classes, it is most consistent with lower levels of GC-biased gene conversion associated with the transition to selfing. Polymorphism analysis using both Sanger resequencing of 17 genes and the transcriptome data confirms very recent species divergence coupled with a strong population bottleneck in C. rattanii. Analysis of shared and unique polymorphisms suggests an elevated ratio of unique nonsynonymous to synonymous polymorphism in C. rattanii, consistent with ongoing relaxation of selection.

91

Introduction

The shift in mating system from outcrossing to selfing is one of the most prevalent evolutionary transitions in flowering plants (Barrett et al. 1996; Takebayashi and Morrell 2001;

Igic et al. 2008). In addition to major effects on mating patterns and floral morphology, mating system transitions are expected to affect the patterns of molecular polymorphism, molecular evolution and base composition in the genome (Charlesworth and Wright 2001). Compared to outcrossing, selfing decreases the effective population size by reducing the number of independent gametes sampled for reproduction (Pollak 1987; Nordborg 2000), reduces the effective rate of recombination due to limited heterozygosity and, as a consequence, increases linkage disequilibrium (LD) among loci. Increased LD can also cause a further reduction in effective population size due to selection at linked sites (selective sweeps, background selection, and weak Hill-Robertson interference: Charlesworth et al. 1993a; Nordborg 2000; Mcvean and

Charlesworth 2000; Charlesworth and Wright 2001). In addition, demographic factors associated with selfing, such as frequent extinction-recolonization and founder events, can further reduce the effective population size, and diminish within-population and species-wide genetic diversity

(Schoen and Brown 1991; Hamrick and Godt 1996; Baker 1955; Charlesworth and Pannell 2001;

Ingvarsson 2002; Foxe et al. 2009; Guo et al. 2009; Ness et al. 2010; Busch et al. 2011). Limited gene flow via pollen should also increase population subdivision in selfers compared to outcrossers (Ingvarsson 2002). Finally, if selfing evolved recently via a severe founder event, this may lead to particularly extreme genome-wide reductions in effective population size (Foxe et al. 2009; Guo et al. 2009; Busch et al. 2011).

Because of the reduced effective population size, both genetic and demographic processes operating in selfing species should also reduce the efficacy of natural selection

92 compared to outcrossing species. One consequence of this is that slightly deleterious mutations are more likely to be segregating and become fixed in selfing than in outcrossing species, which may increase the ratio of nonsynonymous to synonymous polymorphism and divergence

(Glémin 2006). Furthermore, selection on advantageous mutations may be less efficient in selfers than in outcrossers (Slotte et al. 2010), although if advantageous mutations are predominantly recessive the predictions for beneficial mutations are less clear (Glémin 2007).

In addition to the expected effects on nonsynonymous polymorphism and divergence, selfing species are expected to experience reduced selection on synonymous codons (Marais et al. 2004; Wright et al. 2004; Qiu et al. 2011). In many species, natural selection favors the usage of the subset of codons that maximize the efficiency and/or accuracy of translation (the so-called

“preferred” or “optimal” codons), leading to a bias in codon usage (Ikemura 1982; Duret and

Mouchiroud 1999; Hershberg and Petrov 2008). The selectively favored codons tend to correspond to the most highly expressed tRNAs, and codon usage bias tends to be stronger in highly expressed genes (Ikemura 1985; Kanaya et al. 1999; Duret and Mouchiroud 1999; Wright et al. 2004; Sharp et al. 2005). As the strength of selection for synonymous codon usage is weaker compared to purifying selection against deleterious nonsynonymous mutations, we expect that the effects of selfing and effective population size may be stronger on the patterns of codon usage than on nonsynonymous sites. Therefore, inferences for the reduced efficacy of selection can be made from the increase in the frequency of unpreferred codons or by the reduction in codon usage bias between selfing and outcrossing species, as well as between high and low expressed genes.

Another consequence of high homozygosity of selfing species is the rarity of GC-biased gene conversion (BGC) (Marais et al. 2004). BGC is a process that preferentially converts A/T

93 into G/C at sites heterozygous for AT and GC (Marais 2003). The net effect of BGC is to increase the GC content of recombining DNA sequences and genomes (Glémin 2007; Glémin

2011). Because of high homozygosity, highly selfing taxa will experience a strong reduction in the opportunity for GC-biased gene conversion, and therefore reduced GC content (Marais et al.

2004, Wright et al. 2007).

There is clear evidence for important effects of selfing on nucleotide diversity, including in Leavenworthia (Liu et al. 1998, Busch et al. 2011), Arabidopsis (Savolainen et al. 2000;

Wright et al. 2002), Capsella (Foxe et al. 2009; Guo et al. 2009), Eichhornia (Ness et al. 2010),

Mimulus (Swiegart and Willis, 2003), Lycopersicum (Baudry et al. 2001), and several grasses

(Glémin et al. 2009). This reduced diversity is also marked in the nematode genus

Caenorhabditis (Graustein et al. 2002, Cutter 2006). However, the general effect of mating systems on the strength and efficacy of selection has been less clear. In particular, most studies have found little to no effect of selfing on the strength of selection on nonsynonymous sites or on codon usage bias, suggesting that effective population sizes in most selfing lineages may be high enough to counteract the effects of reduced recombination rates (Wright et al. 2002; Foxe et al.

2008; Haudry et al. 2008; Escobar et al. 2010). However, most studies to date have focused primarily on patterns of substitution rates, and this may give a limited picture for several reasons.

First, in some cases selfing may have evolved very recently relative to species divergence, meaning that the effects of selfing may be limited to a small proportion of overall divergence

(Qiu et al. 2011). Second, the fitness effects of nonsynonymous mutations likely range from strongly deleterious, through neutral and beneficial, which means that a simple comparison of substitution rates may not be very informative about the relaxation of selection. Because of this, studies that combine polymorphism and divergence across multiple genes will be important for quantifying the interaction between mating systems, demographic history and the strength of

94 selection. Indeed, recent analyses of large-scale polymorphism and divergence patterns in

Arabidopsis and Capsella (Slotte et al. 2010; Qiu et al. 2011) provide more evidence for changes in selection pressures across species on both codon usage bias and nonsynonymous mutations.

To more robustly assess the role of mating system on the efficacy of selection, the use of next generation sequencing should enable such large-scale studies to be attainable across multiple replicate species pairs, even in species with little prior genomic information (Ness et al. 2011).

The genus Collinsia (Plantaginaceae) has experienced numerous repeated shifts in flower size and selfing rate (Randle et al. 2009; Baldwin et al. 2011), making it an outstanding model group to investigate the causes and consequences of mating system evolution. Replicate shifts to selfing in Collinsia are thought to be caused primarily by selection for reproductive assurance through autonomous selfing, which depends on local population size (Kennedy and Elle 2008a), and local pollination environment (Kalisz and Vogler 2003). The presence of many independent species pairs with mating system shifts makes Collinsia an important target for genomic studies and for testing predictions about the evolution of selfing and its genomic consequences.

However, at present the availability of genomic resources for Collinsia is limited.

In this study, I investigate the effect of shifts in mating system on the pattern of sequence polymorphism, base composition and codon usage bias in two sister species of Collinsia which are thought to have diverged recently (Baldwin et al. 2011): C. linearis, a large-flowered, mixed mating species (outcrossing rate t = 0.57; Kalisz et al. 2012) and C. rattanii, a predominantly selfing, small-flowered species (t = 0.12; Kalisz et al. 2012). The geographic range of C. linearis comprises the Klamath River watershed, which runs through Oregon and northern California, whereas C. rattanii is also found in Northwestern California and the north Sierra Nevada. We use a combination of floral transcriptome sequencing of pooled population samples and Sanger

95 sequencing of a broader population sample. I compare levels of polymorphism at synonymous and nonsynonymous sites, as well as base composition and the frequency in the use of optimal codons in the two sister species. I find evidence of a rapid shift in diversity, the efficacy of natural selection and patterns of base composition evolution associated with the shift in the selfing rate.

Materials and Methods

De novo assembly

Total RNA was extracted from equivalent-staged mixed flower buds pooled across 12 individuals from a single population each of C. rattanii (M6; Table 4.1) and C. linearis (THR;

Table 4.1) using the RNeasy Plant Mini kit (Qiagen) according to the manufacturer’s instructions. Paired-end (PE) cDNA libraries were generated for each species, and one lane of

108 bp PE Illumina GAII sequencing was run for each species. This generated a total of approximately 71 million reads (7.7 Gbp) per lane passing quality filter. Sequencing and library construction were conducted at the Genome Quebec Innovation Centre at McGill University.

Sequence data have been deposited at the NCBI Trace Archive under accession numbers xxxxxxx-xxxxxx.

For de novo assembly, I trimmed low quality bases and reads from the raw sequence data using a sliding window approach. Moving in 5bp windows from 5’ to 3’ in 1bp increments, I trimmed the bases when the mean quality of a window dropped below a threshold (Q < 20). I assembled the transcriptome of each species using the program OASES v 0.1.22 (D. R. Zerbino,

European Bioinformatics Institute), designed specifically as an extension of the program Velvet

96

(Zerbino and Birney 2008) for de novo transcriptome assembly. To optimize the assembly parameters I used a combinatorial approach trying all possible combinations of multiple assembly parameters including: k-mer length (k = 31, 35, 39, 43, 47, 51, 55, 59 and 63), minimum number of times a k-mer has to be observed to be used in the assembly (cov_cutoff = 5,

10 or 15), minimum ratio between the numbers of observed and expected connecting

(paired_cutoff from 0.1 – 0.9), and minimum number of times two contigs must be connected by reads or read pairs to be clustered together (min_pair_cov = 5, 10 or 20). From this, I calculated a number of summaries to guide our choice of the optimal assembly, considering both length based statistics (N50, mean contig length, longest contig, number of contigs, total assembly length) and similarity to known transcripts from Mimulus guttatus (fraction of contigs with significant BLAST hits, mean e-value of best BLAST hits, mean proportion of a contig aligned to its best BLAST hit, fraction of total assembly that is aligned in the best BLAST hit of each contig, number of M. guttatus homologs in our assembly and average proportion of identities to these homologs in aligned regions). In the end, I combined two assemblies for each species where k = 63, cov_cutoff = 10 with two different scaffolding parameter sets (paired_cutoff = 0.5, min_pair_cov = 10 and paired_cutoff = 0.9, min_pair_cov = 20). To avoid including contigs present in both assemblies I removed near-identical contigs from the concatenated assemblies (>

99% identical over 95% of the length of the shorter contig). This also helped reduce the probability that allelic variation was not falsely represented as distinct loci. The final set of contigs from each species represents our reference transcriptome.

97

Whole transcriptome sequence analysis

I used a reciprocal best BLAST to identify orthologous transcripts between C. linearis and C. rattanii. In this approach all transcripts from each species are compared to all transcripts of the other species. Only transcripts from each species that are best reciprocal BLAST hits are defined as orthologs. To avoid pairing paralogous transcripts, I required orthologs to be over

70% identical over 80% of their length.

Polymorphic sites from the set of 14,106 orthologous genes were randomly phased into two sequences and aligned using Prank (Öytynoja and Goldman 2005). In-frame orthologous genes for both species were generated using a BLASTx (Altschul et al. 1990) search against the non-redundant (nr) protein database, followed by an in-house Perl script to parse the assigned frame from the BLAST results for each transcript.

Single nucleotide polymorphism (SNP) analysis of the transcriptome data was conducted by read mapping of the whole transcriptome reads of both C. rattanii and C. linearis using the

Burrows-Wheeler Aligner (BWA) (Li and Durbin 2009) to the C. rattanii in-frame orthologous transcripts as reference. As confirmation, I also conducted analysis using the C. linearis transcripts as reference, and the quantitative results were unaffected (data not shown). We used

Samtools and bcftools (Li and Durbin 2009) to call SNPs and calculate genotype likelihoods. In- house Python scripts were used to process SNPs in the VCF file from the genotype likelihoods, using a cutoff of Q = 40 for the second-highest genotype likelihood and a depth cutoff of 20.

SNPs were defined as shared between species or unique to one species based on the genotype likelihood calls, and classified as synonymous or nonsynonymous based on the inferred reading frame.

98

For each transcript, I estimated average expression levels in each species using the number of sequence reads mapped per kilobase per million reads mapped (FPKM) as estimated using the program Cuffdiff from the package Cufflinks v 0.83 (Trapnell et al. 2010).

I split the set of 14,106 1:1 orthologous genes assembled in C. linearis and C. rattanii according to gene expression into 30 classes of approximately 415 genes and at least 30,000 codons each. I implemented the program CodonW to estimate the frequency of optimal codons

(Fop) (for determination of codon preferences, see Optimal Codons below), GC content and GC content at the third codon position (GC3) in each transcript in both species. In addition, I estimated nonsynonymous to synonymous substitution rates (Ka/Ks) for each gene. Because we did not have a close outgroup to polarize substitutions, I estimated pairwise substitution rates between the two Collinsia species. Substitution rates were estimated using the program

CODEML implemented in the PAML 4.4 package (Yang 2007). I tested if Fop, GC, GC3, Ka and Ks are correlated with gene expression (Fop, GC and GC3 data arcsin-squared-root transformed; gene expression transformed using ln[1+x]). I additionally tested for differences in

Fop, GC and GC3 between the two species using non-parametric Mann-Whitney tests. The latter tests were performed using all genes, the lowest expressed genes (expression categories 1 and 2 in both species; 830 genes), and the highest expressed genes (expression categories 29 and 30 in both species; 832 genes).

In addition, in the alignments of the 14,106 orthologous genes of the two species, I counted the number of G and C bases among the variable sites (236,817 out of 13,333,608 sites).

We tested for differences in GC content at variable sites between the two Collinsia species using ranked sign tests. Statistical analyses were performed with R 2.13.0 (R Development Group

2011).

99

Primer Design, DNA extraction and sequencing

Individuals were sampled from five C. rattanii and four C. linearis populations grown at the greenhouse at the University of Pittsburgh (Table 4.1). In each population, I sequenced 3-4 individuals, except for two populations where fewer individuals were available (CFR, AG-500).

Genomic DNA was extracted from leaf material using a Qiagen DNeasy plant kit.

Based on the orthologous genes generated from the de novo assembly of both species, and using BLASTX (Altschul et al. 1990) similarity against Mimulus gutattus, conserved exonic regions were identified by aligning Collinsia transcripts from both species to M. gutattus genomic regions. Only genes showing a single copy significant match to the M. gutattus genome were used. Seventeen primers were designed to amplify 500-700 bp fragments (Appendix 4.1) using Primer3 (Rozen et al. 2000). A standard PCR reaction of 30 cycles with optimized annealing temperatures for each primer pair was used for all genes. Each cycle included 30 seconds of denaturing at 95°C, 40 seconds of annealing and 1 minute of extension at 72°C. Both forward and reverse strands of the amplicons were sequenced directly at McGill University

(Genome Quebec Innovation Centre), using the same primers as for the amplification.

Sanger sequencing analysis

For the 17 loci, haplotypes were inferred using PHASE 2.1 (Stephens et al. 2001;

Stephens and Donnelly 2003), as implemented in DNAsp v 4.50 (Rozas et al. 2003). Diversity statistics were calculated for synonymous and nonsynonymous sites using a modified version of

100

Polymorphorama (Bachtrog and Andolfatto 2006) including θW (Watterson 1975), π (Tajima

1983) and Tajima’s D (Tajima 1989) for synonymous and nonsynonymous sites as well as pairwise divergence between the species for synonymous (Ks) and nonsynonymous sites (Ka). In addition, I estimated population structure (Fst) by calculating 1-πS/πT, where πS is the average pairwise nucleotide diversity within populations and πT is the total pairwise nucleotide diversity across populations. Using custom Perl scripts, I calculated the number of synonymous and nonsynonymous unique, and shared polymorphisms and fixed differences between the two species.

I estimated intralocus recombination and linkage disequilibrium (LD) with the LDhat software (Mcvean et al. 2004) implemented as a composite likelihood method (Hudson 2001) for pairs of polymorphic informative sites, and calculated the correlation between pairwise linkage disequilibrium (r2) and physical distance between sites (d). I also estimated the population recombination parameter (ρ = 4Ner) where Ne is the effective population size, and r is the recombination rate expressed as the expected crossover events per generation per base pair between adjacent SNPs. A likelihood permutation test was performed for each ρ estimate and the corresponding maximum likelihood was used to test for significant evidence of recombination.

Using LDhat, I calculated the minimum number of recombination events (Hudson and Kaplan

1985). The effect of selfing on linkage disequilibrium was assessed using the ratio of the population recombination parameter ρ to the population mutation parameter θ (i.e., ρ/θ) (where ρ

= 4Ner and θ = 4Neµ, where µ is the rate of mutation) (Nordborg 2000). Using the libseq analysis package (Thornton K 2003), LD values were calculated between pairs of phased sites using the squared allele frequency correlation measure r2. We also used an R code (LDit.r) written by J.

Ross-Ibarra (R) (http://www.rilab.org/code/files/LDit.html) that uses equation 1 from Remington et al. (2001) to estimate 4Ner via a nonlinear regression and plot the decay of LD over distance.

101

The 17 genes described above were concatenated into a supermatrix 10,158 bp long to visualize the genetic similarity of populations. Tree diagrams were inferred from this alignment using MrBayes 3.1.2 (Huelsenbeck and Ronquist 2001) with the following parameters: Dirichlet priors (1,1,1,1) for base frequencies and (1,1,1,1,1) for General Time Reversible (GTR) parameters scaled to the G-T rate, a uniform (0.05,50) and (0,1) priors for the gamma (Γ) shape and the proportion of invariable sites (I), and an exponential (10.0) prior for branch lengths.

Analyses were performed on the supermatrix using the partition of loci. Base frequencies, substitution matrix parameters, Γ and I were thus independently estimated for each partition. The tree space was explored using Metropolis-coupled Markov Chain Monte Carlo (MCMCMC) analyses with random starting trees, five simultaneous, sequentially heated independent chains sampled every 1,000 iterations during 50 million iterations. Suboptimal trees were discarded once the burn-in phase was identified (five million iterations) and a majority-rule consensus tree was constructed with the remaining trees. Since this analysis includes within-species polymorphism data, the tree should be considered an exploratory tool of relatedness among individuals and populations and species rather than a phylogenetic hypothesis.

Optimal codons

To analyse selection on codon usage in our Collinsia datasets, we needed to define the set of optimal codons. To do this, I obtained information on tRNA abundance from M. guttatus, and assumed that optimal codons are the same in Collinsia. This assumption is reasonable since optimal codons are mostly conserved among (Kawabe and Miyashita 2003; Wright et al. 2004; Ingvarsson 2008; Qiu et al. 2010). I obtained data on tRNA gene abundance in the nuclear chromosomes (scaffolds 1 to 14) from the complete genome assembly of M. gutattus

102

(ftp://ftp.jgipsf.org/pub/JGI_data/phytozome/v7.0/Mguttatus/assembly/) using the program tRNAscan-SE 1.23 (Lowe and Eddy 1997). tRNA gene abundance has been shown to correlate with the corresponding tRNA expression level in prokaryotes and eukaryotes (Kanaya et al.

2001), and so I used it as a proxy of the relative abundance of tRNAs in the genome, as previously done in Arabidopsis (Wright et al. 2004). tRNA relative abundance served to define the set of optimal codons in M. guttatus, after taking into account the revised wobble rules for eukaryotic genomes (Percudani 2001). The revised wobble rules assume that GNN tRNAs pair with both C-ending and U-ending codons, whereas ANN tRNAs decode both U-ending and G- ending codons.

Results

De novo assembly

For C. linearis and C. rattanii I assembled 24,687 and 29,804 transcripts respectively

(Table 4.2). Each of these assemblies totaled approximately 33 Mbp of sequence. The mean contig length was 1,084 bp and 1,359 bp for C. linearis and C. rattanii respectively (Figure 4.1).

The vast majority of transcripts had similarity to known proteins (e-value < 1e-10) from the plant

Uniprot database (C. linearis 80.6% and C. rattanii 84.3%; Table 4.2). In addition, many of the homologous proteins identified with this comparison were alignable over the full length of the protein (Figure 4.2). The median fraction of best BLASTx hits aligned to transcripts was 76.1% in C. linearis and 89.8% in C. rattanii. I also identified 14,106 1:1 orthologs when I compared the two transcriptome assemblies using a reciprocal best BLAST approach. Thus, overall our

103 approach was highly successful in de novo assembly of thousands of transcripts that show clear similarity to genes in other species across a large proportion of their length.

Multigenic tree

The majority-rule consensus tree diagram inferred with the 17 genes for the C. linearis and C. rattanii populations is shown in Figure 4.3. This tree confirmed the existence of two strongly supported clades (posterior probability PP = 1.00): one clade grouped all populations of

C. rattanii with the CFR population of C. linearis and a second clade grouped the populations of

C. linearis except CFR. Notably, the length of terminal branches in the C. rattanii clade were shorter than in the C. linearis clade, consistent with fewer segregating sites and reduced nucleotide diversity in the selfing C. rattanii.

The tree shown in Figure 4.3 strongly suggested that individuals from the CFR population were more closely related to those from C. rattanii than C. linearis. For analysis of species-wide diversity below, we therefore excluded the CFR population from summary statistics for C. linearis.

Genetic diversity

A total of 10,158 bp of aligned coding sequences was produced from 17 loci, an average of 536 bp per locus. The number of segregating sites in three populations of C. linearis yielded 197 sites and an average nucleotide diversity (q) of 0.019 ± 0.0033 for synonymous sites and 0.002 ±

0.0007 for nonsynonymous sites. In contrast the five populations of C. rattanii yielded 73

104 segregating sites and average q of 0.0069 ± 0.001 for synonymous sites and 0.0008 ± 0.001 for nonsynonymous sites. Average Tajima’s D at synonymous sites in C. linearis was close to zero

(-0.06 ± 0.2) whereas for C. rattanii it was positive (0.79 ± 0.2) consistent with recent population bottlenecks and/or the effects of pooled samples in a structured population (Wakeley and Lessard

2003; Städler et al. 2009).

Within-population variation in C. linearis was generally high and comparable to species- wide diversity, with the exception of the divergent population CFR, which showed relatively limited synonymous diversity (Figure 4.4). In C. rattanii, within-population variation was found for some populations, while others were nearly or, in the case of M3, completely devoid of variation (Figure 4.4). Average between-population differentiation, Fst, was 0.14 ± 0.02 for C. linearis, while it was much higher at 0.57 ± 0.1 in C. rattanii.

Divergence between the species was low, with average Ks = 0.03 ± 0.004 for the Sanger sequenced loci and Ks = 0.0546 ± 0.001 for the transcriptome data as a whole. At nonsynonymous sites, the substitution rate Ka was 0.003 ± 0.001 for the Sanger sequence and

0.0104 ± 0.0003 for the transcriptome data. This divergence level was only slightly greater than the amount of diversity in C. linearis, even within a single population, consistent with a very recent origin of C. rattanii. Furthermore, there were very few fixed differences between species

(Table 4.3), and the proportion of shared polymorphism was low, suggestive of a severe, recent founder event in C. rattanii. Consistent with the network tree, CFR had no fixed differences with

C. rattanii, but showed fixed differences with other C. linearis populations (Table 4.3). Looking within a single population pair (C. linearis THR, C. rattanii M3), the proportion of fixed differences increased, as expected in a local population (Table 4.3). Analysis of the whole-

105 transcriptome Illumina data confirmed this pattern genome-wide, and revealed a very low but detectable level of standing variation within the M3 population (Table 4.3).

Recombination and the decay of linkage disequilibrium

For the 17 loci, the composite likelihood estimates of ρ and the lower bound on the

2 number of recombination events (Rmin) as well as the correlation of r by distance are given in

Appendix 4.3. On average, recombination rates were higher in C. linearis compared to C. rattanii; Rmin varied between 0 and 3 in C. linearis compared to 0 and 1 in C. rattanii, and the average ρ estimates were higher in C. linearis (7×10-3/bp) compared to C. rattanii (0.4×10-3/bp).

The ratio of ρ/θ was reduced in C. rattanii (0.1) compared to C. linearis (0.9), consistent with expectations from the mating system transition from outcrossing to selfing (Appendix 4.3). This is comparable to the estimates of ρ/θ of 0.05 in the highly selfing species A. thaliana (Nordborg et al. 2005). In the outcrossing C. linearis, LD decayed rapidly over less than 1 kbp, whereas patterns of LD in the more selfing C. rattanii decayed more slowly and suggest LD could extend to several kb (Figure 4.6).

Efficacy of selection

The mean Tajima’s D across the 17 genes in C. linearis was more negative at nonsynonymous (-0.56 ± 0.1) than synonymous sites (-0.06 ± 0.2), consistent with the action of purifying selection. This was also apparent in C. rattanii, where average Tajima’s D was less positive at nonsynonymous (0.31 ± 0.2) than synonymous sites (0.79 ± 0.2). For the 17 genes, the ratio of nonsynonymous to synonymous polymorphism at unique sites was higher in C.

106 rattanii (0.51) than C. linearis (0.35), although this difference was not significant (2×2 contingency table, P > 0.05). To evaluate this trend further, I examined SNPs in the transcriptome data of pooled within-population samples. From the Illumina data the ratio of nonsynonymous to synonymous unique polymorphisms was significantly elevated in C. rattanii

(0.79) compared to C. linearis (0.37; 2×2 contingency table, P < 0.0001), consistent with ongoing relaxation of selection in the more selfing species.

Base composition and molecular evolution

The set of optimal codons in M. guttatus is shown in Appendix 4.2. According to this set, the frequency of optimal codons (Fop) in Collinsia significantly increased with gene expression

(Spearman’s ρ= 0.18, P < 0.0001). Moreover, such an increase was not linear but exponential; optimal codons were overrepresented in the highest-expressed genes (Figure 4.5). On the other hand, gene expression was significantly correlated with GC content (Spearman’s ρ= 0.17, P <

0.0001) whereas the correlation with GC3 was significant but negligible (Spearman’s ρ= 0.03, P

< 0.0001).

To examine evidence for shifts in base composition and codon usage, we compared orthologous codons where synonymous sites differed between species in either their base composition (GC-

>AT changes) or codon preferences. Interestingly, we identified a significant reduction in GC at variable sites in C. rattanii compared to C. linearis, and this was apparent for all genes, as well as for highly and lowly expressed genes. The number of preferred codons for all genes was significantly higher in C. rattanii compared to C. linearis, but this was not significant for the subsets of low and high expression (Table 4.4).

107

Discussion

In this study, I demonstrate the utility of combining de novo assembly of the transcriptome and Sanger sequencing in a non-model genus of angiosperms, Collinsia, to investigate sequence polymorphism, efficacy of selection, and base composition in species with contrasting mating system. I have generated approximately 71 million sequence reads for each species and report a non-redundant set of 24,678 and 29,804 transcripts representing about 33

Mbp sequence from the Collinsia genome. Our approach led to the generation of a large fraction of orthologous genes for comparative molecular evolution and population genomics.

Population genetic theory predicts a two-fold reduction in levels of neutral variation between outcrossing and complete self-fertilization. Accordingly, the decrease of neutral diversity is predicted to be less pronounced in partially self-fertilizing species (Pollak 1987;

Nordborg and Donnelly 1997; Charlesworth 2003), such as those studied by us. Yet, in our study, we see a stronger than expected decrease in diversity in C. rattanii compared to C. linearis

(πC. rattanii synonymous = 0.008; πC. linearis synonymous = 0.019), even though C. rattanii is not completely self-fertilizing and C. linearis is partially selfing. This greater than twofold reduction in diversity has also been reported in other species such as the self-fertilizing A. thaliana compared to its primarily outcrossing relative A. lyrata (Savolainen et al. 2000; Wright et al.

2003), in Mimulus (Sweigart and Willis 2003) and Capsella (Foxe et al. 2009). Thus, in agreement with previous results in angiosperms, our study suggests that mating system alone cannot explain the loss of diversity observed in C. rattanii and that other factors, likely including a founding bottleneck during speciation, have contributed. The extreme loss of within-population diversity across the 17 loci in three populations of C. rattanii (SM, FH7 and M3 suggests severe

108 founder events in these populations and/or hitchhiking effects (Anderson et al. 2012, Cutter and

Paysevr 2003)

The shift in mating system and reduction in diversity are expected to result in increased differentiation, consistent with our results (Fst-C. rattanii= 0.80; Fst-C. linearis = 0.14). A positive average Tajima’s D, as observed in our data, is consistent with recent population bottlenecks, likely associated with the speciation event, playing an important role in reducing effective population size in C. rattanii. However, other factors such as the pooled sampling scheme in strongly subdivided populations likely also contribute to the positive Tajima’s D (Wakeley and

Lessard 2003; Städler et al. 2009).

Population AG retains most of the within-population variation observed in C. rattanii

(Figure 4.4). Interestingly, this population is sympatric with C. linearis. It is possible that gene flow between the species is ongoing (or occurred historically); alternatively, C. rattanii may have originated in this region, with subsequent colonization bottlenecks reducing diversity in other populations. Larger-scale genome-wide analysis of polymorphism levels will be important to distinguish the effects of retention of ancestral polymorphism vs. ongoing gene flow. For accurate identification of introgression events genome-wide, it will also be important to order transcripts along chromosomes using genetic mapping and/or de novo genome sequencing approaches.

Conversely, one C. linearis population (CFR) shows relatively low within-population diversity compared to other C. linearis populations. Interestingly, we found no fixed differences between CFR and C. rattanii, suggesting a very recent divergence or contemporary gene flow.

This population has been identified as C. linearis based on its flower morphology. However, the tree presented in Figure 4.3 suggests that population CFR clusters more closely with C. rattanii

109 consistent with results from Baldwin et al. (2011) showing that C. linearis populations north of the Klamath river are more closely related to C. rattanii than either is to other C. linearis populations. This suggests three possibilities; first, C. rattanii may have arisen from the northern populations of C. linearis, possibly following a population bottleneck in this region. Under this scenario, a bottleneck may have predated the evolution of selfing, which could have reduced inbreeding depression, promoting the shift to selfing, as has been suggested for the evolution of selfing in Arabidopsis lyrata (Foxe et al. 2010). Alternatively, it is possible that this population of C. linearis has independently evolved a large flower morphology, reverting to higher rates of outcrossing (Baldwin et al, 2011). Finally, introgression from C. rattanii into northern C. linearis populations could be affecting inferences about evolutionary history; however, all 17 of our genes independently confirm the same topology (data not shown), making this possibility unlikely. It will be important to evaluate this further using large numbers of samples and genome-wide data for all three lineages, to disentangle the origins of C. rattanii, better understand the extent of introgression, and to evaluate the possibilities of reversion to outcrossing.

Effective population size, outcrossing rate, mating system and demographic history all play a role in shaping the effective rate of recombination (ρ). Collinsia linearis has higher estimates of ρ compared to C. rattanii, consistent with its higher outcrossing rates and higher effective population size. Although C. rattanii selfs at a high rate (85%), which decreases recombination, high ρ estimates for some genes could be attributed to the recent transition to selfing (Lin et al. 2002). If selfing evolved recently, then crossover events that occurred before the mating-system transition took place would still be detected in our C. rattanii dataset (Morrell et al. 2005).

110

Selfing is predicted to reduce both recombination and genetic diversity but the effect on recombination is expected to be more pronounced (Nordborg 2000). Our data shows that there is

18.3-fold reduction in the average ρ in C. linearis (0.007) compared to C. rattanii (0.0003) and a

2.4-fold reduction in θ, consistent with the expectation that linkage disequilibrium will be more affected than diversity. Consistent with the above, our results show faster decay of LD in the outcrossing, more recombining C. linearis compared to the selfing C. rattanii. The estimate of

ρ/θ in A. thaliana is around 0.05, comparable to C. rattanii (0.1). However, species-wide estimates of ρ/θ from the outcrossing species Drosophila melanogaster and Zea mays range from

1.0 and 1.5, respectively (Balakirev and Ayala 2004b; Wright et al. 2005). These values are comparable to our estimate in C. linearis (0.9).

Our whole transcriptome and Sanger data both show an excess of nonsynonymous to synonymous unique polymorphism in C. rattanii compared to C. linearis. This may be due to a reduced efficacy of selection in eliminating slightly deleterious mutations in the more selfing species. The observation of a significant excess of nonsynonymous polymorphisms that are unique to C. rattanii highlights that relaxed selection is an ongoing process, and not simply associated with a founder event during speciation.

Our analyses of codon usage showed a positive and significant correlation in the frequency of optimal codons (Fop) with gene expression in both C. linearis and C. rattanii.

Preferred codons are likely to be preferentially used in highly expressed genes to increase translational efficiency and/or accuracy (Kanaya et al. 2001). However, I did not find any significant difference in Fop across gene expression categories between the two species. If selection were less efficient due to the shift to selfing, we would have expected an increase in unpreferred codons in the more selfing species C. rattanii compared with C. linearis. We would

111 also expect at synonymous sites that differ between the species that C. linearis would have more

AT->GC changes compared with GC->AT changes relative to C. rattanii. While we see the expected shift in base composition, C. rattanii actually shows a higher preferred codon usage, but this is not associated with more highly expressed genes. Taken together, I find evidence for a loss of biased gene conversion but not relaxed selection on codon usage in C. rattanii (Table

4.4). The lack of difference in the use of optimal codons is consistent with the conclusions of

Wright et al. (2007) where there was no difference between A. thaliana and A. lyrata. However, subsequent analyses of polymorphism data by Qiu et al (2011) did indicate relaxed selection on codon usage in A. thaliana compared with A. lyrata. It is possible that more detailed analysis of large-scale polymorphism patterns in a likelihood framework will provide a more powerful approach for testing for relaxed selection on codon bias.

112

Table 4.1. Population samples used for Collinsia rattanii and Collinsia linearis

Species Population code Country and State Number of individuals

C. linearis BURR Trinity Co., CA 3

C. linearis BAIR Humboldt Co., CA 3

C. linearis CFR Jackson Co., OR 2

C. linearis THR Humboldt Co., CA 4

C. rattanii SM Glenn Co., CA 2

C. rattanii M3 Glenn Co., CA 5

C. rattanii AG-500* Jackson Co., OR 2

C. rattanii FH7 Glenn Co., CA 3

CA California

OR Oregon

* Sympatric with C. linearis

113

Table 4.2. Summary statistics for de novo assembly of Collinsia rattanii and Collinsia linearis transcriptomes

Mean contig length Total assembly size Number of Contigs with BLASTx Species N50(bp) (bp) (bp) contigs hits1

C. linearis 1,555 1,084 32,297,895 29,804 80.6%

C. rattanii 1,777 1,359 33,536,604 24,678 84.3%

1 Percentage of contigs with blast hits with an e-value < 1e-10

114

Table 4.3. Pairwise comparisons of synonymous polymorphisms (unique, shared and fixed differences) between Collinsia linearis and Collinsia rattanii. Note that the CFR population is treated as a separate taxon (see Figure 4.3).

Populations Unique, Unique, Population Shared Fixed Population 1 2 Differences

C.rattanii/C.linearis 41 (23.4%) 116 (66.2%) 6 (3.4%) 12 (6.8%)

CFR/C.rattanii 10 (17.5%) 38 (66.6%) 9 (15.7%) 0

CFR/C.linearis 15 (8.8%) 128 (75.7%) 4 (2.3%) 22 (13%)

M3/THR 0 77 (59.6%) 0 52 (40.3%)

M3/THR 1704 (1.5%) 51747(46.6%) 854 (0.7%) 56661 (51%)

(transcriptome)

Table 4.4. Summary of pairwise comparison of orthologues at synonymous sites that differ between C. grandiflora and C. rubella for codon preference and GC bias.

115

Pairwise comparison All genes Low expressed High genes expressed genes

Codon preference

C. linearis 34583 *** 635 3680

C. rattanii 30639 719 3484

GC bias

C. linearis 231448*** 11983*** 19018***

C. rattanii 218159 9378 17050

*Significance (p<0.05)

116

Figure 4.1. Contig length distribution of de novo assemblies of Collinsia linearis and Collinsia rattanii. The distribution is represented as a proportion of the total number of contigs. All contigs over 5000bp are represented in a single category.

117

Figure 4.2. Fraction of each of the assembled contigs that is aligned to its top BLASTx hit from the plant Uniprot database against the proportion of that top hit which is aligned to the query sequence. This plot shows the size of our transcripts relative to the closest known homolog in other plant species. Dense clusters of points along the top of the figure represent transcripts fully covering their best BLAST protein hit and points along the right are entirely aligned to their respective protein hits. (A) Collinsia linearis (B) Collinsia rattanii.

118

Figure 4.3. Majority-rule consensus tree inferred with the concatenate of 17 genes (10,158 bp). Population codes are given in the tips of the tree and posterior probabilities in the nodes. Branch lengths are given in units of per site substitution rate. Note that because this tree includes within- species polymorphism data, the tree should be considered an exploratory tool of relatedness among individuals, populations and species rather than a phylogenetic hypothesis.

119

Figure 4.4. Barplots of the average Watterson’s synonymous diversity (θW) using the 17 loci sequenced in Collinsia rattanii and Collinsia linearis. Species-wide and individual population samples are shown for each species. Error bars show the mean ± SE (standard error).

120

Figure 4.5. Frequency of optimal codons (Fop) as a function of gene expression in Collinsia linearis (A) and Collinsia rattanii (B). Error bars show the mean ± SE (standard error).

121

Figure 4.6. Decay of linkage disequilibrium (r2) with physical distance for the combined 17 loci in Collinsia linearis and Collinsia rattanii. The solid curve indicates the least-squares regression fit of r2 with distance.

122

123

CONCLUSION

In this thesis, I investigated the evolutionary genetics of mating system evolution in Capsella and Collinsia, and more specifically the genetic basis of floral evolution and the genomic consequences of mating system shift from outcrossing to selfing. The new approach of combining population genomic data and quantitative trait loci (QTL), enable us to narrow down the targets of adaptive floral evolution. In addition, with whole-genome data, it was possible to infer robustly patterns of selection across the genome in selfers and outcrossers. Our results show using population genomic data that selfing species, C. rubella that evolved recently from its progenitor outcrossing C. grandiflora and experienced a severe population bottleneck, show the expected same genomic consequences, relaxation of selection, as the selfing C. rattanii, which is shown to be recently diverged from its outcrossing C. linerais and may be associated with a severe bottleneck. This explains the loss of adaptation in the outcrossing species consistent with the geographic restriction for both C. grandiflora and C. linearis (Foxe et al. 2009; Randle et al. 2010) and the convergence of phenotypes toward selfing one.

We showed the importance of integrating population genomic data with mapping experiments to indentify candidate regions importantly controlling traits that are associated with the shift to selfing and are under positive directional selection. As well, we identified candidate genes that are differentially expressed in the selfing species and showed conservation of up- and down-regulated gene expression between the selfing and outcrossing Capsella and Arabidopsis species, eluting to parallel evolution. Our novel approach answering these questions in Capsella builds a foundation for future work to test the relative importance of positive selection in the recently evolving C. rattanii, and in other similar system to quantify the changes in floral evolution as well resource allocation and characterize their genetic basis and consequences. I hope based on the population genomic and mapping studies in my thesis, more insights about the relative role of drift and selection in shaping the transitions in mating systems will be learned as well as the genomic consequences associated with.

124

LITERATURE CITED

Abramoff, MD, PJ Magelhaes, SJ Ram. 2004. Image processing with ImageJ. Biophotonics International 11:36-42. Aigner, PA. 2004. Floral specialization without trade-offs: Optimal corolla flare in contrasting pollination environments. Ecology 85:2560-2569. Alexandersson, R, SD Johnson. 2002. Pollinator-mediated selection on flower-tube length in a hawkmoth-pollinated Gladiolus (Iridaceae). Proceedings of the Royal Society of London Series B-Biological Sciences 269:631-636. Altschul, SF, W Gish, W Miller, EW Myers, DJ Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215:403-410. Andolfatto, P. 2007. Hitchhiking effects of recurrent beneficial amino acid substitutions in the Drosophila melanogaster genome. Genome Research 17:1755-1762. Andolfatto, P, D Davison, D Erezyilmaz, TT Hu, J Mast, T Sunayama-Morita, DL Stern. 2011. Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Research 21:610-617. Arends, D, P Prins, RC Jansen, KW Broman. 2010. R/qtl: high-throughput multiple QTL mapping. Bioinformatics 26:2990-2992. Armbruster, WS, CP Mulder, BG Baldwin, S Kalisz, B Wessa, H Nute. 2002. Comparative analysis of late floral development and mating-system evolution in tribe Collinsieae (Scrophulariaceae s.l.). American Journal of Botany 89:37-49. Armbruster, WS, CPH Mulder, BG Baldwin, S Kalisz, B Wessa, H Nute. 2002. Comparative analysis of late floral development and mating-system evolution in Tribe Collinsieae (Scrophulariaceae SL). American Journal of Botany 89:37-49. Ashman, TL, TM Knight, JA Steets, et al. 2004. Pollen limitation of plant reproduction: Ecological and evolutionary causes and consequences. Ecology 85:2408-2421. Ashman, TL, CJ Majetic. 2006. Genetic constraints on floral evolution: a review and evaluation of patterns. Heredity (Edinb) 96:343-352. Ashman, TL, MT Morgan. 2004. Explaining phenotypic selection on plant attractive characters: male function, gender balance or ecological context? Proceedings of the Royal Society of London Series B-Biological Sciences 271:553-559. Baker, HG. 1953. Race Formation and Reproductive Method in Flowering Plants. Symposia of the Society for Experimental Biology 7:114-145. Baker, HG. 1955. Self Compatibility and Establishment after Long Distance Dispersal. Evolution 9:347-349. Balakirev, ES, FJ Ayala. 2004. Nucleotide variation in the tinman and bagpipe homeobox genes of Drosophila melanogaster. Genetics 166:1845-1856.

125

Baldwin, BG, S Kalisz, WS Armbruster. 2011. Phylogenetic Perspectives on Diversification, Biogeography, and Floral Evolution of Collinsia and Tonella (Plantaginaceae). American Journal of Botany 98:731-753. Barrett, SC. 2002. The evolution of plant sexual diversity. Nature Reviews Genetics 3:274-284. Barrett, SCH. 2002. The evolution of plant sexual diversity. Nature Reviews Genetics 3:274-284. Barrett, SCH, LD Harder. 1996. Ecology and evolution of plant mating. Trends in Ecology & Evolution 11:A73-A79. Barrett, SCH, LD Harder, AC Worley. 1996. The comparative biology of pollination and mating in flowering plants. Philosophical Transactions of the Royal Society of London Series B- Biological Sciences 351:1271-1280. Baudry, E, C Kerdelhue, H Innan, W Stephan. 2001. Species and recombination effects on DNA variability in the tomato genus. Genetics 158:1725-1735. Bell, G. 1985. On the function of flowers. Pro R Soc London B 224:223-265. Bernacchi, D, SD Tanksley. 1997. An interspecific backcross of Lycopersicon esculentum x L. hirsutum: linkage analysis and a QTL study of sexual compatibility factors and floral traits. Genetics 147:861-877. Bernacchi, D, SD Tanksley. 1997. An interspecific backcross of Lycopersicon esculentum x L- hirsutum: Linkage analysis and a QTL study of sexual compatibility factors and floral traits. Genetics 147:861-877. Boggs, NA, JB Nasrallah, ME Nasrallah. 2009. Independent S-locus mutations caused self- fertility in Arabidopsis thaliana. PLoS Genet 5:e1000426. Bohonak, AJ. 1999. Dispersal, gene flow, and population structure. Quarterly Review of Biology 74:21-45. Boisvert, S, F Laviolette, J Corbeil. 2010. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol 17:1519-1533. Boivin, K, A Acarkan, RS Mbulu, O Clarenz, R Schmidt. 2004. The Arabidopsis genome sequence as a tool for genome analysis in Brassicaceae. A comparison of the Arabidopsis and Capsella rubella genomes. Plant Physiology 135:735-744. Bradshaw, HD, KG Otto, BE Frewen, JK McKay, DW Schemske. 1998. Quantitative trait loci affecting differences in floral morphology between two species of monkeyflower (Mimulus). Genetics 149:367-382. Broman, KW, H Wu, S Sen, GA Churchill. 2003. R/qtl: QTL mapping in experimental crosses. Bioinformatics 19:889-890. Brown, AHD, JJ Burdon. 1987. Mating systems and colonizing success in plants. A.J. Gray, M.J. Crawley and P.J. Edwards, eds. Colonization Succession and Stability. Blackwell, Oxford, UK:115-131 Burd, M. 1994. Bateman Principle and Plant Reproduction - the Role of Pollen Limitation in Fruit and Seed Set. Botanical Review 60:83-139. Busch, JW, S Joly, DJ Schoen. 2011. Demographic Signatures Accompanying the Evolution of Selfing in Leavenworthia alabamica. Molecular Biology and Evolution 28:1717-1729.

126

Bustamante, CD, R Nielsen, SA Sawyer, KM Olsen, MD Purugganan, DL Hartl. 2002. The cost of inbreeding in Arabidopsis. Nature 416:531-534. Campbell, DR. 1989. Measurements of Selection in a Hermaphroditic Plant - Variation in Male and Female Pollination Success. Evolution 43:318-334. Campbell, DR, NM Waser, MV Price, EA Lynch, RJ Mitchell. 1991. Components of Phenotypic Selection - Pollen Export and Flower Corolla Width in Ipomopsis-Aggregata. Evolution 45:1458-1467. Charlesworth, B, MT Morgan, D Charlesworth. 1993a. The Effect of Deleterious Mutations on Neutral Molecular Variation. Genetics 134:1289-1303. Charlesworth, D. 2003. Effects of inbreeding on the genetic diversity of populations. Philos Trans R Soc Lond B Biol Sci 358:1051-1070. Charlesworth, D, B Charlesworth. 1981. Allocation of Resources to Male and Female Functions in Hermaphrodites. Biological Journal of the Linnean Society 15:57-74. Charlesworth, D, B Charlesworth. 1987. Inbreeding Depression and Its Evolutionary Consequences. Annual Review of Ecology and Systematics 18:237-268. Charlesworth, D, B Charlesworth. 1987. The Effect of Investment in Attractive Structures on Allocation to Male and Female Functions in Plants. Evolution 41:948-968. Charlesworth, D, MT Morgan, B Charlesworth. 1990. Inbreeding Depression, Genetic Load, and the Evolution of Outcrossing Rates in a Multilocus System with No Linkage. Evolution 44:1469-1489. Charlesworth, D, MT Morgan, B Charlesworth. 1993. Mutation Accumulation in Finite Outbreeding and Inbreeding Populations. Genetical Research 61:39-56. Charlesworth, D, X Vekemans. 2005. How and when did Arabidopsis thaliana become highly self-fertilising. Bioessays 27:472-476. Charlesworth, D, SI Wright. 2001. Breeding systems and genome evolution. Current Opinion in Genetics & Development 11:685-690. Charnov, EL. 1979. Simultaneous Hermaphroditism and Sexual Selection. Proc Natl Acad Sci U S A 76:2480-2484. Chater, AO. 1993. Flora Europaea. Cambridge University Press. Chen, KY, B Cong, R Wing, J Vrebalov, SD Tanksley. 2007. Changes in regulation of a transcription factor lead to autogamy in cultivated tomatoes. Science 318:643-645. Clauss, MJ, T Mitchell-Olds. 2006. Population genetic structure of Arabidopsis lyrata in Europe. Mol Ecol 15:2753-2766. Cresswell, JE. 2000. Manipulation of female architecture in flowers reveals a narrow optimum for pollen deposition. Ecology 81:3244-3249. Crnokrak, P, SCH Barrett. 2002. Perspective: Purging the genetic load: A review of the experimental evidence. Evolution 56:2347-2358. Cummings, MP, MT Clegg. 1998. Nucleotide sequence diversity at the alcohol dehydrogenase 1 locus in wild barley (Hordeum vulgare ssp. spontaneum): an evaluation of the background selection hypothesis. Proc Natl Acad Sci U S A 95:5637-5642.

127

Cutter, AD. 2006. Nucleotide polymorphism and linkage disequilibrium in wild populations of the partial selfer Caenorhabditis elegans. Genetics 172:171-184. Cutter, AD, JD Wasmuth, NL Washington. 2008. Patterns of molecular evolution in caenorhabditis preclude ancient origins of selfing. Genetics 178:2093-2104. Darwin, C. 1876. The effects of cross and self fertilization in the vegetable kingdom. John Murray, London, UK. Dassanayake, M, DH Oh, JS Haas, et al. 2011. The genome of the extremophile crucifer Thellungiella parvula. Nature Genetics 43:913-U137. Delesalle, VA, SJ Mazer, H Paz. 2008. Temporal variation in the pollen : ovule ratios of Clarkia (Onagraceae) taxa with contrasting mating systems: field populations. Journal of Evolutionary Biology 21:310-323. Dudash, MR, DE Carr. 1998. Genetics underlying inbreeding depression in Mimulus with contrasting mating systems. Nature 393:682-684. Duret, L, D Mouchiroud. 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, Arabidopsis. Proceedings of the National Academy of Sciences of the United States of America 96:4482-4487. Eckert, CG, KE Samis, S Dart. 2006. Reproductive assurance and the evolution of uniparental reproduction in flowering plants. Harder, L. D., and S. C. H. Barrett (Eds). Ecology and Evolution of Flowers. Oxford University Press, Oxford, UK. Eisen, MB, PT Spellman, PO Brown, D Botstein. 1998. Cluster analysis and display of genome- wide expression patterns. Proc Natl Acad Sci U S A 95:14863-14868. Elle, E, R Carney. 2003. Reproductive assurance varies with flower size in Collinsia parviflora (Scrophulariaceae). American Journal of Botany 90:888-896. Escobar, JS, A Cenci, J Bolognini, A Haudry, S Laurent, J David, S Glemin. 2010. An integrative test of the dead-end hypothesis of selfing evolution in Triticeae (Poaceae). Evolution 64:2855-2872. Escobar, JS, S Glemin, N Galtier. 2011. GC-Biased Gene Conversion Impacts Ribosomal DNA Evolution in Vertebrates, Angiosperms, and Other Eukaryotes. Molecular Biology and Evolution 28:2561-2575. Eyre-Walker, A, PD Keightley. 2007. The distribution of fitness effects of new mutations. Nature Reviews Genetics 8:610-618. Eyre-Walker, A, PD Keightley. 2009. Estimating the Rate of Adaptive Molecular Evolution in the Presence of Slightly Deleterious Mutations and Population Size Change. Molecular Biology and Evolution 26:2097-2108. Fenster, CB, WS Armbruster, P Wilson, MR Dudash, JD Thomson. 2004. Pollination syndromes and floral specialization. Annual Review of Ecology Evolution and Systematics 35:375- 403. Fenster, CB, PK Diggle, SCH Barrett, K Ritland. 1995. The Genetics of Floral Development Differentiating 2 Species of Mimulus (Scrophulariaceae). Heredity 74:258-266. Fenster, CB, K Ritland. 1994. Evidence for Natural-Selection on Mating System in Mimulus (Scrophulariaceae). International Journal of Plant Sciences 155:588-596.

128

Fisher, RA. 1930. The genetical theory of natural selection. Oxford: Oxford University Press. Fisher, RA. 1941. Average excess and average effect of a gene substitution. Annals of Eugenics 11:53-63. Fishman, L, J Aagaard, JC Tuthill. 2008. Toward the Evolutionary Genomics of Gametophytic Divergence: Patterns of Transmission Ratio Distortion in Monkeyflower (Mimulus) Hybrids Reveal a Complex Genetic Basis for Conspecific Pollen Precedence. Evolution 62:2958-2970. Fishman, L, AJ Kelly, E Morgan, JH Willis. 2001. A genetic map in the Mimulus guttatus species complex reveals transmission ratio distortion due to heterospecific interactions. Genetics 159:1701-1716. Fishman, L, AJ Kelly, JH Willis. 2002. Minor quantitative trait loci underlie floral traits associated with mating system divergence in Mimulus. Evolution 56:2138-2155. Fishman, L, DA Stratton. 2004. The genetics of floral divergence and postzygotic barriers between outcrossing and selfing populations of Arenaria uniflora (caryophyllaceae). Evolution 58:296-307. Fishman, L, JH Willis. 2008. Pollen limitation and natural selection on floral characters in the yellow monkeyflower, Mimulus guttatus. New Phytologist 177:802-810. Fowler, DP. 1965. Effects of inbreeding in red pine, Pinus resinosa Ait. II. Pollination studies. Silvae Genet 14:12-23. Foxe, JP, VU Dar, H Zheng, M Nordborg, BS Gaut, SI Wright. 2008. Selection on amino acid substitutions in Arabidopsis. Molecular Biology and Evolution 25:1375-1383. Foxe, JP, VUN Dar, H Zheng, M Nordborg, BS Gaut, SI Wright. 2008. Selection on amino acid substitutions in Arabidopsis. Molecular Biology and Evolution 25:1375-1383. Foxe, JP, T Slotte, EA Stahl, B Neuffer, H Hurka, SI Wright. 2009. Recent speciation associated with the evolution of selfing in Capsella. Proceedings of the National Academy of Sciences of the United States of America 106:5241-5245. Foxe, JP, M Stift, A Tedder, A Haudry, SI Wright, BK Mable. 2010. Reconstructing Origins of Loss of Self-Incompatibility and Selfing in North American Arabidopsis Lyrata: A Population Genetic Context. Evolution 64:3495-3510. French, GC, RA Ennos, AJ Silverside, PM Hollingsworth. 2005. The relationship between flower size, inbreeding coefficient and inferred selfing rate in British Euphrasia species. Heredity (Edinb) 94:44-51. Fulton, TM, T BeckBunn, D Emmatty, Y Eshed, J Lopez, V Petiard, J Uhlig, D Zamir, SD Tanksley. 1997. QTL analysis of an advanced backcross of Lycopersicon peruvianum to the cultivated tomato and comparisons with QTLs found in other wild species. Theoretical and Applied Genetics 95:881-894. Gao, H, S Williamson, CD Bustamante. 2007. A Markov Chain Monte Carlo approach for joint inference of population structure and inbreeding rates from multilocus genotype data. Genetics 176:1635-1651.

129

Georgiady, MS, RW Whitkus, EM Lord. 2002. Genetic analysis of traits distinguishing outcrossing and self-pollinating forms of currant tomato, Lycopersicon pimpinellifolium (Jusl.) Mill. Genetics 161:333-344. Glemin, S. 2007. Mating systems and the efficacy of selection at the molecular level. Genetics 177:905-916. Glemin, S. 2011. Surprising Fitness Consequences of GC-Biased Gene Conversion. II. Heterosis. Genetics 187:217-227. Glemin, S, T Bataillon. 2009. A comparative view of the evolution of grasses under domestication. New Phytologist 183:273-290. Glemin, S, E Bazin, D Charlesworth. 2006. Impact of mating systems on patterns of sequence polymorphism in flowering plants. Proc Biol Sci 273:3011-3019. Goodwillie, C. 1999. Multiple origins of self-compatibility in Linanthus section Leptosiphon (Polemoniaceae): Phylogenetic evidence from internal-transcribed-spacer sequence data. Evolution 53:1387-1395. Goodwillie, C, JM Ness. 2005. Correlated evolution in floral morphology and the timing of self- compatibility in Leptosiphon jepsonii (Polemoniaceae). International Journal of Plant Sciences 166:741-751. Goodwillie, C, C Ritland, K Ritland. 2006. The genetic basis of floral traits associated with mating system evolution in Leptosiphon (Polemoniaceae): An analysis of quantitative trait loci. Evolution 60:491-504. Goodwillie, C, RD Sargent, CG Eckert, et al. 2010. Correlated evolution of mating system and floral display traits in flowering plants and its implications for the distribution of mating system variation. New Phytologist 185:311-321. Gottlieb, LD. 1984. Genetics and Morphological Evolution in Plants. American Naturalist 123:681-709. Grandillo, S, SD Tanksley. 1996. QTL analysis of horticultural traits differentiating the cultivated tomato from the closely related species Lycopersicon pimpinellifolium. Theoretical and Applied Genetics 92:935-951. Graustein, A, JM Gaspar, JR Walters, MF Palopoli. 2002. Levels of DNA polymorphism vary with mating system in the nematode genus Caenorhabditis. Genetics 161:99-107. Grillo, MA, CB Li, AM Fowlkes, TM Briggeman, AL Zhou, DW Schemske, T Sang. 2009. Genetic Architecture for the Adaptive Origin of Annual Wild Rice, Oryza Nivara. Evolution 63:870-883. Guo, YL, JS Bechsgaard, T Slotte, B Neuffer, M Lascoux, D Weigel, MH Schierup. 2009. Recent speciation of Capsella rubella from Capsella grandiflora, associated with loss of self-incompatibility and an extreme bottleneck. Proceedings of the National Academy of Sciences of the United States of America 106:5246-5251. Haddrill, PR, D Bachtrog, P Andolfatto. 2008. Positive and negative selection on noncoding DNA in Drosophila simulans. Molecular Biology and Evolution 25:1825-1834. Halligan, DL, F Oliver, A Eyre-Walker, B Harr, PD Keightley. 2010. Evidence for Pervasive Adaptive Protein Evolution in Wild Mice. PLoS Genet 6.

130

Hamblin, MT, ES Buckler, JL Jannink. 2011. Population genetics of genomics-based crop improvement methods. Trends in Genetics 27:98-106. Hamrick, JL, MJW Godt. 1996. Effects of life history traits on genetic diversity in plant species. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 351:1291-1298. Harder, LD, SCH Barrett. 1993. Pollen Removal from Tristylous-Pontederia-Cordata - Effects of Anther Position and Pollinator Specialization. Ecology 74:1059-1072. Haudry, A, A Cenci, C Guilhaumon, E Paux, S Poirier, S Santoni, J David, S Glemin. 2008. Mating system and recombination affect molecular evolution in four Triticeae species. Genetics Research 90:97-109. Herlihy, CR, CG Eckert. 2007. Evolutionary analysis of a key floral trait in aquilegia canadensis (ranunculaceae): genetic variation in herkogamy and its effect on the mating system. Evolution 61:1661-1674. Hershberg, R, DA Petrov. 2008. Selection on Codon Bias. Annual Review of Genetics 42:287- 299. Hodges, SA, JB Whittall, M Fulton, JY Yang. 2002. Genetics of floral traits influencing reproductive isolation between Aquilegia formosa and Aquilegia pubescens. American Naturalist 159:S51-S60. Hohenlohe, PA, S Bassham, PD Etter, N Stiffler, EA Johnson, WA Cresko. 2010. Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags. PLoS Genet 6. Holsinger, KE. 1988. Inbreeding depression doesn't matter: The genetic basis of mating sys- tem evolution. Evolution 42:1235-1244. Holtsford, TP, NC Ellstrand. 1992. Genetic and Environmental Variation in Floral Traits Affecting Outcrossing Rate in Clarkia-Tembloriensis (Onagraceae). Evolution 46:216- 225. Hu, TT, P Pattyn, EG Bakker, et al. 2011. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nature Genetics 43:476-+. Hudson, RR, NL Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111:147-164. Huelsenbeck, JP, F Ronquist. 2001. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 17:754-755. Hurka, H, S Freundner, AHD Brown, U Plantholt. 1989. Aspartate-Aminotransferase Isozymes in the Genus Capsella (Brassicaceae) - Subcellular Location, Gene Duplication, and Polymorphism. Biochemical Genetics 27:77-90. Hurka, H, N Friesen, DA German, A Franzke, B Neuffer. 2012. 'Missing link' species Capsella orientalis and Capsella thracica elucidate evolution of model plant genus Capsella (Brassicaceae). Mol Ecol 21:1223-1238. Hurka, H, B Neuffer. 1997. Evolutionary processes in the genus Capsella (Brassicaceae). Plant Systematics and Evolution 206:295-316.

131

Igic, B, R Lande, JR Kohn. 2008. Loss of self-incompatibility and its evolutionary consequences. International Journal of Plant Sciences 169:93-104. Ikemura, T. 1985. Codon Usage and Transfer-Rna Content in Unicellular and Multicellular Organisms. Molecular Biology and Evolution 2:13-34. Ingvarsson, PK. 2002. A metapopulation perspective on genetic diversity and differentiation in partially self-fertilizing plants. Evolution 56:2368-2373. Ingvarsson, PK. 2008. Molecular evolution of synonymous codon usage in Populus. BMC Evolutionary Biology 8. Innan, H, R Terauchi, NT Miyashita. 1997. Microsatellite polymorphism in natural populations of the wild plant Arabidopsis thaliana. Genetics 146:1441-1452. Jain, SK. 1976. The evolution of inbreeding in plants. Annual Review of Ecology and Systematics 1. 7. Jansen, RC. 1993. Interval Mapping of Multiple Quantitative Trait Loci. Genetics 135:205-211. Jansen, RC, P Stam. 1994. High-Resolution of Quantitative Traits into Multiple Loci Via Interval Mapping. Genetics 136:1447-1455. Johnston, MO, DJ Schoen. 1996. Correlated evolution of self-fertilization and inbreeding depression: An experimental study of nine populations of Amsinckia (Boraginaceae). Evolution 50:1478-1491. Kalisz, S, A Randle, D Chaiffetz, M Faigeles, A Butera, C Beight. 2012. Dichogamy correlates with outcrossing rate and defines the selfing syndrome in the mixed-mating genus Collinsia. Ann Bot 109:571-582. Kalisz, S, DW Vogler. 2003. Benefits of autonomous selfing under unpredictable pollinator environments. Ecology 84:2928-2942. Kalisz, S, DW Vogler, KM Hanley. 2004. Context-dependent autonomous self-fertilization yields reproductive assurance and mixed mating. Nature 430:884-887. Kanaya, S, Y Yamada, M Kinouchi, Y Kudo, T Ikemura. 2001. Codon usage and tRNA genes in uukaryotes: correlation of codon usage diversity with translation efficiency and with CG- dinucleotide usage as assessed by multivariate analysis. Journal of Molecular Evolution 53:290-298. Kanaya, S, Y Yamada, Y Kudo, T Ikemura. 1999. Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238:143-155. Kaul, SHL KooJ Jenkins, et al. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796-815. Kawabe, A, NT Miyashita. 2003. Patterns of codon usage bias in three dicot and four monocot plant species. Genes and Genetic Systems 78:343–352. Kearns, CA, DW Inouye. 1993. Techniques for pollination biologists. University of Colorado Press, Niwot, CO, USA.

132

Kennedy, BF, E Elle. 2008. The reproductive assurance benefit of selfing: importance of flower size and population size. Oecologia 155:469-477. Kennedy, BF, E Elle. 2008a. The reproductive assurance benefit of selfing: importance of flower size and population size. Oecologia 155:469-477. Knight, TM, JA Steets, JC Vamosi, SJ Mazer, M Burd, DR Campbell, MR Dudash, MO Johnston, RJ Mitchell, TL Ashman. 2005. Pollen limitation of plant reproduction: Pattern and process. Annual Review of Ecology Evolution and Systematics 36:467-497. Koch, MA, M Kiefer. 2005. Genome evolution among cruciferous plants: A lecture from the comparison of the genetic maps of three diploid species - Capsella rubella, Arabidopsis lyrata subsp Petraea, and A. thaliana. American Journal of Botany 92:761-767. Kohn, JR, SW Graham, B Morton, JJ Doyle, SCH Barrett. 1996. Reconstruction of the evolution of reproductive characters in Pontederiaceae using phylogenetic evidence from chloroplast DNA restriction-site variation. Evolution 50:1454-1469. Lande, R, DW Schemske. 1985. The Evolution of Self-Fertilization and Inbreeding Depression in Plants .1. Genetic Models. Evolution 39:24-40. Larson, BMH, SCH Barrett. 2000. A comparative analysis of pollen limitation in flowering plants. Biological Journal of the Linnean Society 69:503-520. Lewis, H. 1973. The origin of diploid neospecies in Clarkia. Am. Nat 107:161-170. Li, C, WH Wong. 2001. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc Natl Acad Sci U S A 98:31-36. Li, H, R Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754-1760. Li, H, B Handsaker, A Wysoker, T Fennell, J Ruan, N Homer, G Marth, G Abecasis, R Durbin. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078- 2079. Lin, JZ, PL Morrell, MT Clegg. 2002. The influence of linkage and inbreeding on patterns of nucleotide sequence diversity at duplicate alcohol dehydrogenase loci in wild barley (Hordeum vulgare ssp. spontaneum). Genetics 162:2007-2015. Lin, JZ, K Ritland. 1997. Quantitative trait loci differentiating the outbreeding Mimulus guttatus from the inbreeding M-platycalyx. Genetics 146:1115-1121. Liu, F, D Charlesworth, M Kreitman. 1999. The effect of mating system differences on nucleotide diversity at the phosphoglucose isomerase locus in the plant genus Leavenworthia. Genetics 151:343-357. Liu, F, L Zhang, D Charlesworth. 1998. Genetic diversity in Leavenworthia populations with different inbreeding levels. Proc Biol Sci 265:293-301. Lloyd, DG. 1965. Evolution of self-compatibility and racial differentiation in Leavenworthia (Cruciferae). Contributions from the Gray Herbarium of Harvard University 195:3-134. Lloyd, DG. 1987. Allocations to pollen, seeds and pollination mechanisms in self-fertilizing plants . Func. Ecol 1:83-89.

133

Lowe, TM, SR Eddy. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research 25:955-964. Loytynoja, A, N Goldman. 2005. An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A 102:10557-10562. Maad, J. 2000. Phenotypic selection in hawkmoth-pollinated Platanthera bifolia: Targets and fitness surfaces. Evolution 54:112-123. Maad, J, R Alexandersson. 2004. Variable selection in Platanthera bifolia (Orchidaceae): phenotypic selection differed between sex functions in a drought year. Journal of Evolutionary Biology 17:642-650. Mable, BK, A Adam. 2007. Patterns of genetic diversity in outcrossing and selfing populations of Arabidopsis lyrata. Mol Ecol 16:3565-3580. Macnair, MR, QJ Cumbes. 1989. The Genetic Architecture of Interspecific Variation in Mimulus. Genetics 122:211-222. Marais, G. 2003. Biased gene conversion: implications for genome and sex evolution. Trends in Genetics 19:330-338. Marais, G, B Charlesworth, SI Wright. 2004. Recombination and base composition: the case of the highly self-fertilizing plant Arabidopsis thaliana. Genome Biol 5. Marshall, DF, RJ Abbott. 1984. Polymorphism for Outcrossing Frequency at the Ray Floret Locus in Senecio-Vulgaris L .2. Confirmation. Heredity 52:331-336. Mazer, SJ, LS Dudley, VA Delesalle, H Paz, P Galusky. 2009. Stability of pollen-ovule ratios in pollinator-dependent versus autogamous Clarkia sister taxa: testing evolutionary predictions. New Phytologist 183:630-648. McVean, GAT, B Charlesworth. 2000. The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics 155:929-944. Moeller, DA, MA Geber. 2005. Ecological context of the evolution of self-pollination in Clarkia xantiana: Population size, plant communities, and reproductive assurance. Evolution 59:786-799. Morrell, PL, DM Toleno, KE Lundy, MT Clegg. 2005. Low levels of linkage disequilibrium in wild barley (Hordeum vulgare ssp. spontaneum) despite high rates of self-fertilization. Proc Natl Acad Sci U S A 102:2442-2447. Morrell, PL, DM Toleno, KE Lundy, MT Clegg. 2006. Estimating the contribution of mutation, recombination and gene conversion in the generation of haplotypic diversity. Genetics 173:1705-1723. Moyle, LC, EB Graham. 2006. Genome-wide associations between hybrid sterility QTL and marker transmission ratio distortion. Molecular Biology and Evolution 23:973-980. Muyle, A, L Serres-Giardi, A Ressayre, J Escobar, S Glemin. 2011. GC-Biased Gene Conversion and Selection Affect GC Content in the Oryza Genus (rice). Molecular Biology and Evolution 28:2695-2706.

134

Nasrallah, JB, P Liu, S Sherman-Broyles, R Schmidt, ME Nasrallah. 2007. Epigenetic mechanisms for breakdown of self-incompatibility in interspecific hybrids. Genetics 175:1965-1973. Nei, M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York, NY. Ness, RW, M Siol, SC Barrett. 2011. De novo sequence assembly and characterization of the floral transcriptome in cross- and self-fertilizing plants. BMC Genomics 12:298. Ness, RW, SI Wright, SCH Barrett. 2010. Mating-System Variation, Demographic History and Patterns of Nucleotide Diversity in the Tristylous Plant Eichhornia paniculata. Genetics 184:381-U105. Nilsson, LA. 1988. The Evolution of Flowers with Deep Corolla Tubes. Nature 334:147-149. Noor, MAF, AL Cunningham, JC Larkin. 2001. Consequences of recombination rate variation on quantitative trait locus mapping studies: Simulations based on the Drosophila melanogaster genome. Genetics 159:581-588. Nordborg, M. 2000. Linkage disequilibrium, gene trees and selfing: an ancestral recombination graph with partial self-fertilization. Genetics 154:923-929. Nordborg, M, P Donnelly. 1997. The coalescent process with selfing. Genetics 146:1185-1195. Nordborg, M, TT Hu, Y Ishino, et al. 2005. The pattern of polymorphism in Arabidopsis thaliana. Plos Biology 3:e196. Nordborg, M, TT Hu, Y Ishino, et al. 2005. The pattern of polymorphism in Arabidopsis thaliana. Plos Biology 3:1289-1299. Nybom, H. 2004. Comparison of different nuclear DNA markers for estimating intraspecific genetic diversity in plants. Mol Ecol 13:1143-1155. Ohta, T, Cockerha.Cc. 1974. Detrimental Genes with Partial Selfing and Effects on a Neutral Locus. Genetical Research 23:191-200. Ornduff, R. 1969. Reproductive biology in relation to systematics. Taxon 18:121-133. Orr, HA. 1998. The population genetics of adaptation: The distribution of factors fixed during adaptive evolution. Evolution 52:935-949. Orr, HA. 1998. Testing natural selection vs. genetic drift in phenotypic evolution using quantitative trait locus data. Genetics 149:2099-2104. Paetsch, M, S Mayland-Quellhorst, B Neuffer. 2006. Evolution of the self-incompatibility system in the Brassicaceae: identification of S-locus receptor kinase (SRK) in self- incompatible Capsella grandiflora. Heredity (Edinb) 97:283-290. Percudani, R. 2001. Restricted wobble rules for eukaryotic genomes. Trends in Genetics 17:133- 135. Perusse, JR, DJ Schoen. 2004. Molecular evolution of the GapC gene family in Amsinckia spectabilis populations that differ in outcrossing rate. Journal of Molecular Evolution 59:427-436. Polak, M. 2003. Developmental instability : causes and consequences. Oxford ; New York: Oxford University Press.

135

Porcher, E, R Lande. 2005. The evolution of self-fertilization and inbreeding depression under pollen discounting and pollen limitation. J Evol Biol 18:497-508. Qiu, S, R Bergero, K Zeng, D Charlesworth. 2010. Patterns of codon usage bias in Silene latifolia. Molecular Biology and Evolution 28:771–780. Qiu, S, K Zeng, T Slotte, S Wright, D Charlesworth. 2011. Reduced Efficacy of Natural Selection on Codon Usage Bias in Selfing Arabidopsis and Capsella Species. Genome Biology and Evolution 3:868-880. R Development Core Team. 2011. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. R., W. 1988. Phylogenetic aspects of the evolution of self-pollination. Gottlieb L. D., Jain S. (eds.) Plant Evolutionary Ecology. London: Chapman & Hall:109–131. Randle, AM, JB Slyder, S Kalisz. 2009. Can differences in autonomous selfing ability explain differences in range size among sister-taxa pairs of Collinsia (Plantaginaceae)? An extension of Baker's Law. New Phytologist 183:618-629. Remington, DL, MD Purugganan. 2003. Candidate genes, quantitative trait loci, and functional trait evolution in plants. International Journal of Plant Sciences 164:S7-S20. Remm, M, CEV Storm, ELL Sonnhammer. 2001. Automatic clustering of orthologs and in- paralogs from pairwise species comparisons. Journal of Molecular Biology 314:1041- 1052. Rieseberg, LH, SJE Baird, KA Gardner. 2000. Hybridization, introgression, and linkage evolution. Plant Molecular Biology 42:205-224. Riley, HP. 1934. A further test showing the dominance of self-fertility to self-sterility in shepherd's purse. American Naturalist 68:60-64. Ritland, C, K Ritland. 1989. Variation of Sex Allocation among 8 Taxa of the Mimulus-Guttatus Species Complex (Scrophulariaceae). American Journal of Botany 76:1731-1739. Rozas, J, JC Sanchez-DelBarrio, X Messeguer, R Rozas. 2003. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics 19:2496-2497. Rozen, S, H Skaletsky. 2000. Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol 132:365-386. Sandring, S, J Agren. 2009. Pollinator-mediated selection on floral display and flowering time in the perennial herb Arabidopsis lyrata. Evolution 63:1292-1300. Sato, H, T Yahara. 1999. Trade-offs between flower number and investment to a flower in selfing and outcrossing varieties of Impatiens hypophylla (Balsaminaceae). American Journal of Botany 86:1699-1707. Savolainen, O, CH Langley, BP Lazzaro, H Fr. 2000. Contrasting patterns of nucleotide polymorphism at the alcohol dehydrogenase locus in the outcrossing Arabidopsis lyrata and the selfing Arabidopsis thaliana. Molecular Biology and Evolution 17:645-655. Savolainen, O, CH Langley, BP Lazzaro, H Freville. 2000. Contrasting patterns of nucleotide polymorphism at the alcohol dehydrogenase locus in the outcrossing Arabidopsis lyrata and the selfing Arabidopsis thaliana. Molecular Biology and Evolution 17:645-655.

136

Schoen, DJ. 1982. Male Reproductive Effort and Breeding System in an Hermaphroditic Plant. Oecologia 53:255-257. Schoen, DJ, AHD Brown. 1991. Intraspecific Variation in Population Gene Diversity and Effective Population-Size Correlates with the Mating System in Plants. Proceedings of the National Academy of Sciences of the United States of America 88:4494-4497. Schoen, DJ, JW Busch. 2008. On the evolution of self-fertilization in a metapopulation. International Journal of Plant Sciences 169:119-127. Schoen, DJ, MO Johnston, AM LHeureux, JV Marsolais. 1997. Evolutionary history of the mating system in Amsinckia (Boraginaceae). Evolution 51:1090-1099. Schranz, ME, AJ Windsor, BH Song, A Lawton-Rauh, T Mitchell-Olds. 2007. Comparative genetic mapping in Boechera stricta, a close relative of Arabidopsis. Plant Physiology 144:286-298. Schultz, ST, JH Willis. 1995. Individual variation in inbreeding depression: the roles of inbreeding history and mutation. Genetics 141:1209-1223. Scotti-Saintagne, C, S Mariette, I Porth, PG Goicoechea, T Barreneche, K Bodenes, K Burg, A Kremer. 2004. Genome scanning for interspecific differentiation between two closely related oak species [Quercus robur L. and Q petraea (Matt.) Liebl.]. Genetics 168:1615- 1626. Sharp, PM, E Bailes, RJ Grocock, JF Peden, RE Sockett. 2005. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Research 33:1141-1153. Shimizu, KK, JM Cork, AL Caicedo, et al. 2004. Darwinian selection on a selfing locus (Retracted Article. See vol 320, pg 176, 2008). Science 306:2081-2084. Shore, JS, SCH Barrett. 1990. Quantitative Genetics of Floral Characters in Homostylous Turnera-Ulmifolia Var Angustifolia Willd (Turneraceae). Heredity 64:105-112. Sicard, A, M Lenhard. 2011. The selfing syndrome: a model for studying the genetic and evolutionary basis of morphological adaptation in plants. Ann Bot 107:1433-1443. Sicard, A, N Stacey, K Hermann, J Dessoly, B Neuffer, I Baurle, M Lenhard. 2011. Genetics, Evolution, and Adaptive Significance of the Selfing Syndrome in the Genus Capsella. Plant Cell 23:3156-3171. Slotte, T, JP Foxe, KM Hazzouri, SI Wright. 2010. Genome-wide evidence for efficient positive and purifying selection in Capsella grandiflora, a plant species with a large effective population size. Molecular Biology and Evolution 27:1813-1821. Smith, JM, J Haigh. 1974. Hitch-Hiking Effect of a Favorable Gene. Genetical Research 23:23- 35. St Onge, KR, T Kallman, T Slotte, M Lascoux, AE Palme. 2011. Contrasting demographic history and population structure in Capsella rubella and Capsella grandiflora, two closely related species with different mating systems. Mol Ecol 20:3306-3320. Stadler, T, B Haubold, C Merino, W Stephan, P Pfaffelhuber. 2009. The impact of sampling schemes on the site frequency spectrum in nonequilibrium subdivided populations. Genetics 182:205-216.

137

Stebbins, GL. 1950. Variation and evolution in plants. Columbia University Press, New York, USA. Stebbins, GL. 1970. Adaptive radiation of reproductive characteristics in angiosperms, I: pollination mechanisms. Annual Review of Ecology and Systematics 1. Stephens, M, P Donnelly. 2003. A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73:1162-1169. Stinchcombe, JR, HE Hoekstra. 2008. Combining population genomics and quantitative genetics: finding the genes underlying ecologically important traits. Heredity (Edinb) 100:158-170. Suyama, M, D Torrents, P Bork. 2006. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Research 34:W609- W612. Sweigart, AL, JH Willis. 2003. Patterns of nucleotide diversity in two species of Mimulus are affected by mating system and asymmetric introgression. Evolution 57:2490-2506. Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460. Tajima, F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595. Takayama, S, A Isogai. 2005. Self-incompatibility in plants. Annual Review of Plant Biology 56:467-489. Takebayashi, N, PL Morrell. 2001. Is self-fertilization an evolutionary dead end? Revisiting an old hypothesis with genetic theories and a macroevolutionary approach. American Journal of Botany 88:1143-1150. Takebayashi, N, DE Wolf, LF Delph. 2006. Effect of variation in herkogamy on outcrossing within a population of Gilia achilleifolia. Heredity (Edinb) 96:159-165. Tenaillon, MI, MC Sawkins, LK Anderson, SM Stack, J Doebley, BS Gaut. 2002. Patterns of diversity and recombination along chromosome 1 of maize (Zea mays ssp. mays L.). Genetics 162:1401-1413. Thornton, K. 2003. Libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 19:2325-2327. Tomimatsu, H, M Ohara. 2006. Evolution of hierarchical floral resource allocation associated with mating system in an animal-pollinated hermaphroditic herb, Trillium camschatcense (Trilliaceae). American Journal of Botany 93:134-141. Trapnell, C, BA Williams, G Pertea, A Mortazavi, G Kwan, MJ van Baren, SL Salzberg, BJ Wold, L Pachter. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28:511-515. Uyenoyama, MK, KE Holsinger, DM Waller. 1993. Ecological and genetic factors directing the evolution of self-fertilization. D.J. Futuyma and J. Antonovics, eds. Oxford Surveys in Evolutionary Biology. Oxford University Press, Oxford, UK:283–326

138

Uyenoyama, MK, DM Waller. 1991. Coevolution of self-fertilization and inbreeding depression. I. Mutation-selection balance at one and two loci. Theor Popul Biol 40:14-46. Vigouroux, Y, M McMullen, CT Hittinger, K Houchins, L Schulz, S Kresovich, Y Matsuoka, J Doebley. 2002. Identifying genes of agronomic importance in maize by screening microsatellites for evidence of selection during domestication. Proc Natl Acad Sci U S A 99:9650-9655. Wakeley, J, S Lessard. 2003. Theory of the effects of population structure and sampling on patterns of linkage disequilibrium applied to genomic data from humans. Genetics 164:1043-1053. Wang, JL, WG Hill, D Charlesworth, B Charlesworth. 1999. Dynamics of inbreeding depression due to deleterious mutations in small populations: mutation parameters and inbreeding rate. Genetical Research 74:165-178. Watterson, GA. 1975. On the number of segregating sites in genetical models without recombination. Theor Popul Biol 7:256-276. Willis, JH. 1996. Measures of phenotypic selection are biased by partial inbreeding. Evolution 50:1501-1511. Wright, SI, IV Bi, SG Schroeder, M Yamasaki, JF Doebley, MD McMullen, BS Gaut. 2005. The effects of artificial selection on the maize genome. Science 308:1310-1314. Wright, SI, G Iorgovan, S Misra, M Mokhtari. 2007. Neutral evolution of synonymous base composition in the Brassicaceae. Journal of Molecular Evolution 64:136-141. Wright, SI, B Lauga, D Charlesworth. 2002. Rates and patterns of molecular evolution in inbred and outbred Arabidopsis. Molecular Biology and Evolution 19:1407-1420. Wright, SI, B Lauga, D Charlesworth. 2003. Subdivision and haplotype structure in natural populations of Arabidopsis lyrata. Mol Ecol 12:1247-1263. Wright, SI, RW Ness, JP Foxe, SCH Barrett. 2008. Genomic consequences of outcrossing and selfing in plants. International Journal of Plant Sciences 169:105-118. Wright, SI, CBK Yau, M Looseley, BC Meyers. 2004. Effects of gene expression on molecular evolution in Arabidopsis thaliana and Arabidopsis lyrata. Molecular Biology and Evolution 21:1719-1726. Wyatt, R. 1984. The Evolution of Self-Pollination in Granite Outcrop Species of Arenaria (Caryophyllaceae) .4. Correlated Changes in the Gynoecium. American Journal of Botany 71:1006-1014. Xie, XF, J Molina, R Hernandez, A Reynolds, AR Boyko, CD Bustamante, MD Purugganan. 2011. Levels and Patterns of Nucleotide Variation in Domestication QTL Regions on Rice Chromosome 3 Suggest Lineage-Specific Selection. PLoS One 6. Xu, SZ, WR Atchley. 1996. Mapping quantitative trait loci for complex binary diseases using line crosses. Genetics 143:1417-1424. Yang, Z. 2007. PAML 4: a program package for phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24:1586-1591. Yang, Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24:1586-1591.

139

Zerbino, DR, E Birney. 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18:821-829.

140

APPENDICES

Appendix 2.1 Accession Species designation Geographical origin Lat Long Collector C. TS, KS, grandiflora Cg103-2 Near Lipiota, Greece 39°31N 21°31E JPF C. Paleokastritsas, Corfu, grandiflora Cg2e Greece 39°40N 19°42E TS C. Near Kanakades, grandiflora Cg3c Corfu, Greece 39°39N 19°45E TS C. Near Troumpeta, grandiflora Cg5a Corfu, Greece 39°42N 19°45E TS C. Near Vathyrrevma, grandiflora Cg83-18 Greece 39°26N 21°25E TS, KS C. grandiflora Cg85-2 Omirou, Greece 39°33N 20°55E TS, KS C. grandiflora Cg85-9 Omirou, Greece 39°33N 20°55E TS, KS C. Monodendri, Zagori, TS, KS, grandiflora Cg88-23 Greece 39°53N 20°44E JPF C. Monodendri, Zagori, TS, KS, grandiflora Cg88-3 Greece 39°53N 20°44E JPF C. Monodendri, Zagori, TS, KS, grandiflora Cg88-37 Greece 39°53N 20°44E JPF C. Near Koukouli, Zagori, TS, KS, grandiflora Cg89-8 Greece 39°52N 20°46E JPF C. Near Thirio, Sterea grandiflora Cg8d Ellada, Greece 38°52N 20°59E TS C. Near , TS, KS, grandiflora Cg91-27 Zagori, Greece 39°52N 20°42E JPF C. , Zagori, grandiflora Cg925/9 Greece NA NA BN C. grandiflora Cg935/13 Sokraki, Corfu, Greece NA NA BN C. Mikro , Zagori, TS, KS, grandiflora Cg94-05 Greece 39°57N 20°43E JPF TS, KS, C. rubella Cr104-12 Sesklo, Greece 39°21N 22°50E JPF Cumbre Dorsal, C. rubella Cr1209/26_TS4 Tenerife, Spain NA NA BN La Laguna, Tenerife, C. rubella Cr1215/17-TS1 Spain NA NA BN Buenos Aires, C. rubella Cr1377/5 Argentina NA NA BN C. rubella Cr32-14 La Sila, Italy 39°22N 16°32E KS C. rubella Cr34-11 San Luca, Italy 38°10N 16°32E KS C. rubella Cr39-5 Bacia, Italy 37°59N 15°20E KS C. rubella Cr73-10 Near Chalkeio, Greece 37°53N 22°43E KS C. rubella Cr75-2 Near Moni Megalou 38°05N 22°10E KS

141

Spileou, Greece Near Neo Kompigadi, C. rubella Cr76-05 Greece 38°02N 21°54E KS C. rubella Cr79-30 Near Lalai, Greece 37°44N 21°41E KS Near Paradeisia, C. rubella Cr81-02 Greece 37°18N 22°04E KS C. rubella Cr82-14 Nemea, Greece NA NA KS C. rubella Cr84-27 Kato Kerasovo, Greece 38°29N 21°26E TS, KS C. rubella Cr86IT1-C Serrano, Italy NA NA HC C. rubella CrGÖ665-1 NA NA NA BGG TS Tanja Slotte KS Kate St. Onge JPF John Paul Foxe BN Barbara Neuffer HC Helene Ceplitis BGG Botanical Garden Germany

142

Appendix 2.2

143

Appendix 2.3

144

Appendix 3.1

Summary of candidate genes that are differentially expressed under narrow QTL regions.

145

Appendix 4.1

Left and Right Locus Length primers Primer sequence Size(bp)

Collinsia_1 1492 L TTCAATTGATGCAGCAGAGG 753

R TCACCAGTGCTACGCATCTC

Collinsia_2 1270 L TCCTTGACCTGTCCCTGTTC 793

R TGCCCAAAAATTGGAAGAAG

Collinsia_3 1106 L TGTTGGCTTATTTGGCTTCC 849

R GATTCTGGGCCTGAAAATGA

Collinsia_4 1174 L AATTACGGCATCGTGAGGAC 813

R TGAGAGAAGCTATGCGCTGA

Collinsia_5 1192 L AATGCGCTGGTATCAACCTC 728

R CAGAGATTGCGGGATCATTT

Collinsia_6 1137 L TTGGATCTGGGAAGAGATGG 719

R TTCCATCTGGAATCCTGGAC

Collinsia_7 1131 L TTCTCAGCACACGGATCTTG 728

R CATTAAACGCACCCCTGTCT

Collinsia_8 1027 L CCTCAATTCCTCCGTTCGTA 845

R CCTGTCTCCCCATTACTCCA

Collinsia_12 807 L CCCTTCAAATTCTCGCGTAA 609

R CCCCTCGATTTTCAACTTCA

Collinsia_13 804 L TGTGAAAGCTCGACTGGTTG 620

R CATACGCCTTAGCAGCTTCG

146

Collinsia_14 786 L TTGTGCCATGCTACTGGAAG 669

R AAGGAGAAGTCCTGCCACCT

Collinsia_15 777 L TGGCTTCATCAAGCTTCTCA 607

R CAAAGAGTGTCCCCTGAACC

Collinsia_17 737 L ACGGTGGTGTTGAAACCTTC 601

R GGAGCCTTAGATCCCAAATCA

Collinsia_18 722 L GGGTTGACCATCATCTGTCC 638

R GTCTGATGTCCCTGGCTGAT

Collinsia_22 659 L GGATTTAGGCCCTCAAGTCAA 594

R CCAAGAGCCCAGTTTGTCAT

Collinsia_23 922 L CATACCCAGCAACAACATGC 675

R GGCATACAAAAGGGGTCTCA

Collinsia_25 1114 L TATCCGACCCAGAAGACACC 778

R CAATTCCCAACTTCCTCCAA

147

Appendix 4.2

Optimal codons in Mimulus guttatus defined by tRNA abundance in nuclear chromosomes. Optimal codons are underlined. RF: relative frequency.

Amino acid Codon tRNAs genes RF tRNA

Ala GCA 4 0.44

GCC 0 0.00

GCG 0 0.00

GCT 5 0.56

Arg AGA 2 0.15

AGG 2 0.15

CGA 2 0.15

CGC 0 0.00

CGG 2 0.15

CGT 5 0.38

Asn AAC 3 1.00

AAT 0 0.00

Asp GAC 11 1.00

GAT 0 0.00

Cys TGC 7 1.00

TGT 0 0.00

Gln CAA 1 0.33

148

CAG 2 0.67

Glu GAA 1 0.17

GAG 5 0.83

Gly GGA 0 0.00

GGC 11 0.85

GGG 2 0.15

GGT 0 0.00

His CAC 10 1.00

CAT 0 0.00

Ile ATA 3 0.16

ATC 0 0.00

ATT 16 0.84

Leu CTA 4 0.40

CTC 0 0.00

CTG 1 0.10

CTT 1 0.10

TTA 1 0.10

TTG 3 0.30

Lys AAA 2 0.20

AAG 8 0.80

149

Phe TTC 7 1.00

TTT 0 0.00

Pro CCA 3 0.43

CCC 0 0.00

CCG 1 0.14

CCT 3 0.43

Ser AGC 5 0.56

AGT 0 0.00

TCA 0 0.00

TCC 0 0.00

TCG 4 0.44

TCT 0 0.00

Thr ACA 0 0.00

ACC 0 0.00

ACG 0 0.00

ACT 6 1.00

Tyr TAC 4 1.00

TAT 0 0.00

Val GTA 1 0.10

GTC 0 0.00

150

GTG 2 0.20

GTT 7 0.70

151

Appendix 4.3 C.linearis Locus Rmin ρ=4Ner(bp) Corr(r2,d) Theta(θw) r/q Locus1 3 0.006 -0.06296 nnn 0.00575 1.043478261 Locus 2 0 0.004 -0.51204 nnn 0.00295 1.355932203 Locus 3 2 0.002 -0.06169 ssn 0.01218 0.164203612 Locus 4 2 0.012 -0.41421 nns 0.007 1.714285714 Locus 5 1 0.003 -0.18054 nnn 0.003 1 Locus 6 0 0.001 0.12289 nnn 0.0041 0.243902439 Locus 7 1 0.003 -0.37685 nss 0.00952 0.31512605 Locus 8 2 0.026 -0.03913 ssn 0.01914 1.358411703 Locus 9 3 0.025 -0.01016 nsn 0.01239 2.017756255 Locus10 2 0.011 -0.26295 sss 0.0062 1.774193548 Locus11 1 0.006 -0.04688 nnn 0.00456 1.315789474 Locus12 0 0.001 na snn 0.00114 0.877192982 Locus13 0 0.001 -0.18479 nsn 0.00707 0.141442716 Locus14 0 0.001 -0.95912 nnn 0.00212 0.471698113 Locus15 0 0.001 -0.13242 nnn 0.00335 0.298507463 Locus16 1 0.002 -0.31015 ssn 0.00349 0.573065903 Locus17 3 0.014 -0.05294 nsn 0.00943 1.484623542 Average 1.235294118 0.007 -0.21774625 0.00667 0.949977058

C.rattanii Locus Rmin ρ=4Ner(bp) Corr(r2,d) Theta(θw) r/q Locus1 0 0 na ns 0.00106 0 Locus 2 0 0 0.59982 nsn 0.00129 0 Locus 3 0 0 -0.67651 nsn 0.00164 0 Locus 4 0 0.001 -0.87309 nsn 0.00173 0.578034682 Locus 5 1 0.001 0.32755 nss 0.00203 0.492610837 Locus 6 na na na Locus 7 1 0.001 0.01793 nnn 0.00351 0.284900285 Locus 8 0 0 0.2773 nss 0.00651 0 Locus 9 1 0.002 0.0015 nnn 0.00369 0.54200542 Locus10 0 0 0.47032 snn 0.00169 0 Locus11 0 0 0.50204 nsn 0.00247 0 Locus12 na na na na na Locus13 0 0 0.39768 ssn 0.00177 0 Locus14 na na na na Locus15 0 0 0.39816 sss 0.00183 0 Locus16 na na na na Locus17 0 0 0.29861 nns 0.00118 0 Average 0.230769231 0.000384615 0.145109167 0.002338462 0.145965479 n not significant, s significant, na not applicable. Rmin, Minimum number of recombination events (Hudson and Kaplan 1985); 4Ner(bp) (Fearnhead and Donnelly 2001); Corr(r2,d), correlation between pairwise LD (r2) and distance between sites(d).Theta watterson (4Neu).