The genomic and evolutionary analysis of floral heteromorphy in Primula
Jonathan Matthew Cocker
Thesis submitted for the degree of Doctor of Philosophy
University of East Anglia John Innes Centre September 2016
© This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that use of any information derived therefrom must be in accordance with UK Copyright Law. In addition, any quotation or extract must include full attribution.
Abstract
The genetic basis and evolutionary significance of floral heteromorphy in Primula has been debated for over 150 years. Charles Darwin was the first to explain the importance of the two heterostylous floral morphs, pin and thrum, suggesting that their reciprocal anther and stigma heights facilitate cross-pollination, and showing that only between morph crosses are fully compatible. This key innovation is an archetypal example of convergent evolution that serves to physically promote insect-mediated outcrossing, having evolved in over 28 angiosperm families.
Darwin’s findings laid the foundation for an extensive number of studies into heterostyly that contributed to the establishment of modern genetic theory. The widely accepted genetic model portrays the Primula S locus, which controls heterostyly and self-incompatibility, as a coadapted group of tightly-linked genes, or supergene. It is predicted that self-fertile homostyle flowers, with anthers and stigma at the same height, arise via rare recombination events between dominant and recessive alleles in heterozygous thrums. These observations have underpinned over 60 years of research into the genetics and evolution of heterostyly.
The Primula vulgaris genome assembly and associated transcriptomic and comparative sequence analyses have facilitated the assembly and characterisation of the complete S locus in this species. Here it is revealed that thrums are hemizygous not heterozygous: the S locus contains five thrum-specific genes which are completely absent in pins, which means recombination cannot be the cause of homostyles as previously believed. The studies also reveal candidate genes in Primula veris and other species, and have facilitated an estimation for the assembly of the S locus supergene at 51.7 MYA. These findings challenge established theory, and reveal novel insight into the structure and origin of the Primula S locus, providing the foundation for understanding the evolution and breakdown of insect-mediated outcrossing in Primula and other heterostylous species.
2
Contents
Acknowledgements 6 Abbreviations 7 List of publications 9 1 Introduction 10 1.1 Heterostyly in Primula 10 1.2 Heteromorphic self-incompatibility 13 1.3 The role of heterostyly 15 1.4 Morphology of the heterostylous floral morphs 17 1.5 Inter-morph transfer of pollen 18 1.6 Models for the evolution of heterostyly 20 1.7 Phylogenetic context 25 1.8 Potential applications for the study of heterostyly 27 1.9 The history of heterostyly 29 1.10 Molecular studies of floral heteromorphy 33 1.11 Next-generation sequencing in the study of heterostyly 34 1.12 Summary and research aims 42 2 Assembly and genomic analysis of Primula genomes 43 2.1 Relevant publications 43 2.2 Introduction 43 2.3 Methods 47 Genomic DNA and RNA-Seq paired-end read libraries 47 Genomic DNA paired-end read assemblies 49 Assembly validation 50 Repeat library construction and repeat masking 50 Alignment of proteins from related species 51 Annotation of genes 52 Functional annotation of predicted genes 52 Comparative analysis of P. vulgaris and P. veris genes 53 Analysis of orthologous gene groups 53 RNA-Seq differential expression analysis 54 2.4 Results 55 Analysis of paired-end reads 55 Assembly validation 57 Evaluation of the genespace captured in the assembly 63 Repeats in the Primula vulgaris genome 67 Gene annotations in the Primula vulgaris genome 69 Comparison of genes in P. vulgaris and P. veris genomes 70 OrthoMCL analysis of orthologous genes 73 Differential expression 75 2.5 Discussion 78 3 Annotation and characterisation of S-linked genes 82 3.1 Relevant publications 82
3
3.2 Introduction 82 3.3 Methods 88 Plant material 88 Differential gene expression between Oakleaf and wild-type 88 Gene model predictions for P. vulgaris KNOX (PvKNOX) genes 89 Generation of the PvKNOX phylogenetic tree 90 Identification of variant sites between Oakleaf and wild-type 90 Linkage analysis of PvKNOX candidate genes 90 BAC contig assembly 91 Repeat masking of the BAC contig assembly 92 Prediction of genes in the BAC contig assembly 92 Functional annotation of the BAC contig 93 3.4 Results 93 Prediction of Primula vulgaris KNOX-like (PvKNL) gene models 93 Characterisation of the PvKNOX gene family 95 Expression of PvKNOX genes in Oakleaf and wild-type P. vulgaris 97 Sequence comparison of PvKNOX genes in wild-type and Oakleaf 100 Differential expression between Oakleaf and wildtype P. vulgaris 102 Annotation of the BAC contig assembly 103 Oakleaf in the BAC contig assembly 104 Linkage analysis of Oakleaf candidates 104 3.5 Discussion 107 4 The structure of the Primula vulgaris S locus 113 4.1 Relevant publications 113 4.2 Introduction 113 4.3 Methods 116 Read depth across the S locus 116 RNA-Seq differential expression analysis 117 Analysis of thrum-specific genome regions 117 Detection of recombination in S locus flanking regions 118 Annotation of microRNAs (miRNAs) in the Primula vulgaris S locus 120 Similarity searches of S locus genes to the P. vulgaris genome 121 Repeat analyses of the P. vulgaris S locus 121 Analysis of intron sizes in the Primula vulgaris S locus 121 4.4 Results 122 Alignment of thrum-specific BAC 70F11 generates 455 kb assembly 122 Predicted genes in the 455 kb assembly 123 Genomic reads aligned to the 455 kb assembly 123 Functional evaluation of genes in the 278 kb thrum-specific region 127 Linkage of the 278 kb region to the S locus 129 Expression of genes at the S locus 131 Identification of thrum-specific regions in the P. vulgaris genome 133 Recombination in regions flanking the S locus 137 The annotation of miRNAs in the Primula vulgaris S locus 143 Similarity of genes at the P. vulgaris S locus to other genomic regions 145 Repetitiveness of the Primula S locus 147 Intron sizes at the Primula S locus 149
4
4.5 Discussion 151 5 Evolution and cross-species comparative analyses of the S locus 161 5.1 Relevant publications 161 5.2 Introduction 161 5.3 Methods 164 Primula veris S locus gene model curation 164 P. vulgaris and P. veris S locus gene model visualization 164 Primula veris S locus gene expression analysis 165 S locus genomic read coverage for P. veris and P. vulgaris 165 Bayesian relaxed-clock phylogenetic analysis 165 Selection of sequences and parameters 168 Inspection of alignment for sequence saturation 169 5.4 Results 169 Analysis of read depth across the 455 kb S locus region 169 Identification of S locus genes in Primula veris 171 Expression of Primula veris S locus genes 173 Phylogenetic analysis of GLO-GLOT divergence 175 Saturation analysis of B-function MADS-box genes 177 Validation of the Bayesian phylogenetic analysis 179 The comparative analysis of CFB flanking genes 185 5.5 Discussion 186 6 General discussion and conclusions 196 Bibliography 200
5
Acknowledgements
I would like to thank my supervisor Phil Gilmartin for facilitating my development as a scientist, providing support and encouragement throughout this doctoral project, and giving me the freedom to develop and discuss ideas that have led to the results presented herein. In particular, I thank Jinhong Li for countless discussions, support and the generation of numerous resources, without which much this work would not have been possible. I also thank fellow final year PhD student Olivia Kent for phylogenetic resources and support throughout, including numerous discussions. I thank all members of the Gilmartin lab and others at JIC who have supported this work. The collaborative nature of some elements to this project mean that I am indebted to a number of collaborators, including Jon Wright who helped me to develop as a bioinformatician and carried out many genomic-based analyses that support this work, Mark McMullan for support in phylogenetic analyses, and my third supervisor David Swarbreck and other colleagues at TGAC for input into genomic analyses. I also acknowledge the input of Cock van Oosterhout, whose supervision, enthusiasm and encouragement was invaluable throughout manuscript submission and many of the analyses conducted. I thank my mentors David Westhead and Peter Meyer at the University of Leeds, and others who led me in this direction.
The friends and colleagues I have connected with over the last four years have provided an inspiring and friendly environment in which to work, and in which to unwind. I thank all fellow PhD students, especially friends in my year, and those I have met outside of JIC. I thank my family for encouraging me to follow anything I set my mind to, and for providing a platform from which I was able to take those steps. I cannot thank them enough for the support they have provided over the years. Finally, I thank Emily for immeasurable support and reassurance, for walks and stargazing, and always being there to listen.
6
Abbreviations
A Androecium S locus allele AHRD Automated Assignment of Human Readable Descriptions ARF Auxin response factor as1 asymmetric leaves 1 ASGR Apomixis-specific genomic region BACs Bacterial artificial chromosomes CEGs Core eukaryotic genes CFB Cyclin-like F box cM Centimorgans CRT Cyclic reversible terminator DEF DEFICIENS DSB Double-stranded break ESS Effective Sample Size FPKM Fragments Per Kilobase of transcript per Million mapped reads G Gynocieum S locus allele GLO GLOBOSA GO Gene Ontology GSI Gametophytic self-incompatibility HGP Human Genome Project HR Homologous recombination HSPs High-scoring segment pairs ID Identity JL Jinhong Li JMC Jonathan Matthew Cocker JW Jon Wright KAT K-mer Analysis Toolkit k-mer sequence of k length KNOX KNOTTED-like homeobox LH Long homostyle LMP Long mate-pair LTR Long terminal repeat MCMC Markov Chain Monte Carlo miRNAs microRNAs MYA Million years ago NGS Next generation sequencing NHEJ Non-homologous end-joining ONT Oxford Nanopore Technologies OVCK Olivia Victoria Constance Kent P Pollen S locus allele PacBio Pacific Biosciences PCR Polymerase chain reaction PMG Philip Mark Gilmartin PP Pin parent PvKNL Primula vulgaris KNOTTED-like PvSTL Primula vulgaris SHOOT MERISTEMLESS-like
7
SFGL S locus Flanking Gene Left SFGR S locus Flanking Gene Right SH Short homostyle SI Self-incompatibility SNP Single nucleotide polymorphism SRY Sex-determining region Y SSI Sporophytic self-incompatibility TDF Testis-determining factor TEs Transposable elements TF Transcription factor TGAC The Genome Analysis Centre TP Thrum parent UTR Untranslated region WGS Whole genome sequencing
8
List of publications
Li, J.*, Cocker, J.M.*, Wright, J., Webster, M.A., McMullan, M., Ayling, S., Swarbreck, D., Caccamo, M., Oosterhout, Cv., Gilmartin, P.M. (2016) Genetic architecture and evolution of the S locus supergene in Primula vulgaris. Nature plants, 2: 16188.
Cocker, J.M.*, Webster, M.A.*, Li, J., Wright, J., Kaithakottil, G., Swarbreck, D., Gilmartin, P.M. (2015) Oakleaf: an S locus-linked mutation of Primula vulgaris that affects leaf and flower development. New Phytologist, 208: 149–161.
Li, J., Webster, M.A., Wright, J., Cocker, J.M., Smith, M.C., Badakshi, F., Heslop- Harrison, P., Gilmartin, P.M. (2015) Integration of genetic and physical maps of the Primula vulgaris S locus and localization by chromosome in situ hybridization. New Phytologist, 208: 137–148.
Cocker, J.M., Wright, J., Li, J., Swarbreck, D., Dyer, S., Caccamo, M., Gilmartin, P.M. (2017) The Primula vulgaris genome (in preparation).
* These authors contributed equally
9
1 Introduction
1.1 Heterostyly in Primula
The floral architecture and breeding habits of Primula have been studied for over 150 years (Darwin, 1862, Darwin, 1877). Charles Darwin explained the importance of precise floral organ positioning as a physical method of promoting insect-mediated outcrossing (Darwin, 1862, Darwin, 1876, Darwin, 1877, Gilmartin, 2015). Recent advancements in genomics technology coupled with classical genetics approaches are allowing many of the ideas initially put forward by Darwin to be investigated in a holistic manner (Cohen, 2010). Thus, a bioinformatics project focussing on heterostyly in Primula is distinct in that it combines the historical with the cutting-edge: the application of next-generation sequencing (NGS) innovations has revolutionised biological research (van Dijk et al., 2014).
In the phenomenon known as floral heteromorphy, or heterostyly, each primrose plant has one of two forms of flower, the “pin” (long-styled) or the “thrum” (short-styled) (Darwin, 1877, Richards and Barrett, 1992). Darwin showed that intra-morph “illegitimate” fertilizations result in a much-reduced seed set, producing around 20-50% of the seed that results from so-called “legitimate” pin-thrum or thrum-pin cross- fertilizations (Darwin, 1877). He concluded that “seedlings raised from such [illegitimate] unions” were “in some degree sterile, dwarfed, and feeble” and was aware of such effects in other species, going so far as to dedicate an entire volume to the subject, The Effects of Cross and Self-fertilisation in the Vegetable Kingdom (Darwin, 1876, Darwin, 1877). Floral innovations and associated interactions with pollinators have since captured the imagination of evolutionary biologists, with the genetic effect of reproductive systems and the high evolvability that results thought to be a driving force
10
behind the remarkable species-diversity of the angiosperms, outstripping that of their sister group more than 250-fold (de Vos et al., 2014, Puttick et al., 2015).
The development of the two heterostylous floral forms (distyly) in Primula is controlled by a coadapted linkage group, or “supergene”, known as the S locus (Gregory et al., 1923, Gilmartin, 2015), which through the work of Ernst (1955) and others, was predicted to comprise at least three tightly-linked genes (Lewis, 1954, Dowrick, 1956, Lewis and Jones, 1992): the dominant G (Gynocieum) allele represses style cell elongation, the P (Pollen) allele acts to increase pollen size, and the A (Androecium) allele promotes cell division in the anthers (Ernst, 1936c, Lewis and Jones, 1992, Webster and Gilmartin, 2006). Together, the dominant alleles of the three genes (GPA) result in flowers with a short style, large pollen, and anthers situated at the mouth of the corolla tube; this morphology is known as the thrum, so-called due to the anthers at the mouth of the floral tube resembling “the ends of weavers’ threads” (Darwin, 1877). The thrum is thought to be heterozygous at the S locus (S/s) with the three dominant genes linked in coupling (GPA/gpa). The pin form, with the protruding stigma resembling the head of a pin, is recognised as homozygous recessive (gpa/gpa) at the S locus (s/s) and has the opposite morphology: a long style, small pollen, and anthers situated midway- down the corolla tube.
In this way, the stigma and anthers of two floral morphs are in complementary positions (Figure 1.1). The reciprocal architectures suggest that pollen is transferred to the proboscis or body of an insect from the anthers of one form of flower in such a way that it is transferred to the stigma of the other (Darwin, 1877).
11
Figure 1.1 The reciprocal floral architectures of Primula vulgaris pin (s/s) and thrum (S/s) heterostylous flowers; stigma and anthers are indicated, as well as the genotype of each morph for the three tightly-linked genes predicted to lie at the S locus. Figure reproduced from Cocker et al. (2017).
In addition, an associated biochemical self-incompatibility (SI) system acts as a safeguard against self-fertilization. It is predicted that rare recombination events between the genes at the S locus results in flowers with the anthers and stigma situated at the same height (Charlesworth and Charlesworth, 1979a, Li et al., 2015a); these flowers are termed homostyles and are associated with a breakdown in self- incompatibility (Figure 1.2). The above predictions form the backdrop to the last 60 years of research into heterostyly.
12
Figure 1.2 Disruption of the S locus “supergene” results in homostyle flowers with anthers and stigma at the same height; the short homostyle in this photo is in a Hose in Hose background, a GLOBOSA overexpression mutant that results in conversion of sepals to petals (see Chapter 3). Figure reproduced from Li et al. (2016).
1.2 Heteromorphic self-incompatibility
The majority of flowering plants (~90%) are hermaphrodite, with flowers that have both male and female reproductive structures; as such, they frequently employ diverse mechanisms to prevent self-fertilization through the presence of tightly-linked genes that control the pollen and pistil identity of specific mating types (Allen and Hiscock, 2008, McCubbin, 2008, Barrett and Hough, 2012). In Primula, a di-allelic biochemical SI system reinforces the morphological constraints of the heterostylous floral forms by reducing self and intra-morph fertilization through the action of pin- and thrum-specific molecules that interact on or in the stigma (McCubbin, 2008); genetic evidence shows that the pollen and pistil specificities are controlled by separate genes (Dowrick, 1956, Kurian and Richards, 1997). The GPA locus could be extended to include two additional genes for pollen and style incompatibility, but it remains that these genes could be regulated by the S locus genes rather than linked to them (Barrett, 2008).
The pollen of each Primula morph can only fertilise the opposite form effectively (Bateson and Gregory, 1905). Nonetheless, the mechanism has been described as “leaky”, with self-incompatibility being stronger in thrum-thrum crosses than in pin, as
13
observed through comparison of the relative size of resulting seed sets (Darwin, 1877, Ganders, 1979, McCubbin, 2008). Fully uninhibited pollen tube growth occurs only in inter-morph crosses (McCubbin, 2008).
In contrast to heteromorphic SI, the more frequently occurring gametophytic self- incompatibility (GSI) and sporophytic self-incompatibility (SSI) mechanisms are so- called homomorphic SI systems where the flowers of the different mating types are morphologically indistinguishable (Charlesworth, 2010). These homomorphic SI systems are regulated by one or more multi-allelic SI loci (McCubbin, 2008). In contrast, distylous species are characterised as having only two incompatibility types; they are di-allelic (Charlesworth, 2010). There are also heterostylous species that are tri- morphic such as Lythrum, where there are three mating types associated with mid-, long- and short-styled morphs (tristyly), with the morphological features potentially under the control of two supergene loci, S and M (Richards and Barrett, 1992, Charlesworth, 2010).
It is possible to gain some insight into the nature of a species’ SI system based on the site of the incompatibility reaction. SSI systems generally show pollen inhibition on the stigma surface, whereas inhibition of incompatible pollen tubes typically occurs within the style in GSI (Allen and Hiscock, 2008); an exception is pollen rejection triggered by a stigma-specific glycoprotein in the GSI system of poppy (Hiscock, 2002). The site of the self-incompatibility reaction appears to differ between heterostylous species in different families but also between species of Primula, reportedly occurring on the stigmatic surface, within the stigma, style, or ovary, and even conflicting locations between distinct floral morphs of the same species (Dulberger, 1964, Shivanna et al., 1981, Wedderburn and Richards, 1990, Lewis and Jones, 1992, McCubbin, 2008, Cohen, 2010).
In thrums, the majority of intra-morph pollen fails to germinate on or in the stigma, consistent with SSI, whilst within-pin pollinations can infrequently result in self pollen tubes that penetrate the stigma and reach as far as the stylar transmitting tract (Darwin, 1877, Baker, 1966, Shivanna et al., 1981). This is not necessarily the result of multiple sites of inhibition. Shivanna et al. (1981) showed that there are multiple sites at which growth of the pollen tube ceases, but it could be that the incompatibility reaction does indeed occur on or in the stigma, but varies in effectiveness such that pollen tube
14
growth stops at different points, with the SI reaction occurring at different speeds (McCubbin, 2008).
Based on such studies, a frequent assertion is that heteromorphic SI might function in a sporophytic manner, with inhibition on the stigma surface (Baker, 1966, Gibbs, 1986, Cohen, 2010). However, another conclusion is that SSI might function in the short- styled thrum morph, and either sporophytic or gametophytic SI in the pin morph. It has even been suggested that multiple gene products could be involved in several different SI mechanisms that act in tandem (Golynskaya et al., 1976, Wedderburn and Richards, 1990, McCubbin, 2008).
Further studies in the heterostylous Luculia gratissima (Rubiaceae) suggest inhibition occurs at the stigma. Stevens and Murray (1982) showed that, after removal of the stigma, in pin styles 33% vs. 21% of pollen tubes from legitimate vs. illegitimate pollinations grew from the top of the style to the bottom, and in thrum 66% vs. 72%. This suggests that the female SI determinant is primarily located in the stigma of both morphs, since its removal appears to reduce inhibition of illegitimate pollen tubes, as seen in the comparable percentages above. This effect appears to be clearer in the thrum morph, however, and the possibility of a second site of inhibition remains in pin, where there is a larger difference in the percentages of pollen tubes growing; this could potentially result from expansion of the spatial and temporal expression of the s gene products (McCubbin, 2008). It remains that more research needs to be to done reveal the SI components themselves: despite concerted efforts to identify morph-specific genes and associated SI determinants, there is currently no molecular evidence to support the above proposals (McCubbin et al., 2006, Li et al., 2015a).
1.3 The role of heterostyly
The major drawback of a di-allelic self-incompatibility (SI) system, as recognised by Charles Darwin, is that it renders a plant incapable of fertilizing “half its brethren” (Darwin, 1877). In addition to intra-flower pollen deposition, the absence of cross- pollination means that 50% of the incoming pollen will be incompatible based on a typical 1:1 ratio of pin to thrum plants in the wild (Ornduff, 1979, Barrett and Shore, 2008, McCubbin, 2008). Lloyd and Webb (1992b) supported this point of view through
15
a quantitative model that suggests the problem is great enough such that di-allelic self- incompatibility could not have been selected for prior to the onset of heterostyly. In contrast, homomorphic SI systems comprise multiple mating-types controlled by multi- allelic loci, which implies that much of the pollen that is brought to a stigma will be compatible, therefore resulting in adequate compatible pollen for a good seed set (McCubbin, 2008).
Darwin suggested that heterostyly preceded self-incompatibility; he was forced to regard the evolution of self-incompatibility as an “incidental and purposeless” by- product of the constraints already imposed by heterostyly (Darwin, 1877). Indeed, there are numerous species in primarily heterostylous groups without a self-incompatibility system, but not vice versa (Lloyd and Webb, 1992a), suggesting heterostyly is not a redundant mechanism despite the proposition that a self-incompatibility system in heterostylous species could seemingly act to guarantee outcrossing (Barrett and Shore, 2008). Furthermore, heterostyly is present in at least 28 families and is thought to have evolved independently on over 23 occasions (Barrett, 1992, Lloyd and Webb, 1992a, Naiki, 2012), which implies that it must confer some sort of selective advantage.
The purpose of heterostyly then, appears to be in the promotion of cross-pollination, such that pollen (the male gametophyte) is not wasted through deposition on incompatible stigmas, thereby promoting the male component of fitness through the reduction of “pollen wastage” (Lloyd and Webb, 1992b, Barrett and Shore, 2008). This suggests heterostyly results in a greater proportion of the pollen deposited on a stigma being compatible (Lloyd and Webb, 1992b), therefore mitigating against the detrimental effect that a di-allelic SI system would have on paternal fitness in preventing fertilization with half the population.
Yeo (1975) also proposed that heterostyly could prevent “stigma clogging” effects whereby the access of legitimate pollen might be hindered by illegitimate pollen deposited on stigmas; this hypothesis receives some support in evidence of reduced seed set in some species through application of illegitimate and legitimate pollen on the same stigmas, but other analyses suggest that it has no obvious role in the evolution of distyly (Piper and Charlesworth, 1986, Lloyd and Webb, 1992b).
16
If heterostyly is in place, and functioning to reduce “pollen wastage” and any other potentially disadvantageous effects of illegitimate pollination, the female component of fitness could then be increased by safeguarding against self-fertilization and reducing inbreeding depression through the action of heteromorphic self-incompatibility. In this way, neither mechanism is redundant. Darwin’s early adaptive explanation for the function of heterostyly in promoting insect-mediated cross-pollination between floral morphs is supported, not in spite of, but as a complementary mechanism that seemingly makes the unprecedented evolution of di-allelic SI possible (Darwin, 1877, Lloyd and Webb, 1992b, Barrett and Shore, 2008).
1.4 Morphology of the heterostylous floral morphs
In addition to dimorphic stigma and anther positioning, a range of ancillary features occurs in different Primula species. This includes differences in style cross-sectional area, stigma size and shape, pollen size and amount, and corolla mouth size (Dowrick, 1956, Webster and Gilmartin, 2006, McCubbin, 2008). In Primula species the anther filaments are fused to the corolla tube (McCubbin, 2008). Differential anther positioning is brought about by increased cell division in the lower corolla tube of thrum flowers. This is accompanied by wider cells in the upper corolla that presumably accommodate the above change by producing a larger corolla mouth (Webster and Gilmartin, 2006).
It has been suggested that developmentally interconnected traits, such as the distinct morphological features pertaining to the style and stigma, or those associated with the pollen or anthers, are probably controlled by single genetic factors (Richards, 1997, Webster and Gilmartin, 2006, McCubbin, 2008). Correspondingly, the observation of recombinants that show disruption of the linkage between specific features and their associated floral morph led to a model that describes three genes at the S locus in the order GPA (Lewis and Jones, 1992). In some cases recombinants with altered ancillary features would be difficult to observe, but there is no such issue in relation to pollen grain size: the thrum pollen grains are just over two-fold larger in volume than those in pin flowers, whilst the production of pollen in pin is slightly more than two-fold greater than in thrum (Darwin, 1877, Dowrick, 1956, Ornduff, 1979, Piper and Charlesworth, 1986, Barrett and Shore, 2008).
17
The importance of the above features in the operation of heteromorphic SI is debated, with some suggestions that pollen size and stigmatic papillae may be important (Dulberger, 1975, Darwin, 1877, McCubbin, 2008). Darwin (1877) proposed that the larger pollen of the thrum might provide greater energy reserves suitable for pollen tube growth to the base of the elongated pin style. However, this is refuted by Kurian and Richards (1997) who describe a mutant with a mix of small and large sized pollen that retains thrum pollen specificity in the pollination of pin stigmas. In some species, pollen only adheres to the stigma in inter-morph pollinations, resulting in suggestions that stigmatic papillae length might play a role in the SI reaction, but it seems this is more likely to be the result of biochemical rather than mechanical interaction (Richards, 1997, McCubbin, 2008).
Despite much debate on their potential roles, the remarkably consistent appearance and convergent evolution of the various morph-specific traits across diverse angiosperm species suggests that many of these features may be important in the function of heterostyly (McCubbin, 2008). It seems likely that the morph-associated morphological characters do not directly participate in the active self-recognition and rejection reaction between the pollen and pistil of the reciprocal morphs; perhaps these ancillary features are the result of coadaptation between the interacting surfaces, for the purpose of maximising legitimate and minimising illegitimate transfer of pollen in subtle ways, through the pleiotropic effect of genes at the S locus. It is intuitive to suggest that reciprocal anther and stigma positioning might bring about cross-pollination, the role of the ancillary features might be to further enhance this in support of precise pollinator interactions (Darwin, 1877, Lloyd and Webb, 1992a, McCubbin, 2008).
1.5 Inter-morph transfer of pollen
The distinct difference in pollen size between the two heterostylous morphs provides a direct means of assessing pollen loads. The vast majority of heterostylous species undergo biotic (insect) pollination (Lloyd and Webb, 1992). Darwin physically inserted the proboscises of dead bees, as well as bristles and needles, into the corolla tubes of Primula flowers in an attempt to ascertain whether heterostyly might result in morph- specific positioning of pollen on an insect body during nectar feeding (Darwin, 1877). In agreement with the positions of the anthers in the two floral morphs, he observed that
18
the large thrum pollen was largely positioned near the base, and the small pin pollen near the extremity of the proboscis, with some degree of intermingling of pollen types in between.
In support of Darwin’s hypothesis of insect-mediated cross-pollination (Darwin, 1877), others have subsequently shown differential deposition of the two pollen types onto different parts of insect bodies through examination of insects captured whilst foraging on heterostylous species, including those in the genus Fagopyrum, Pulmonaria and Cratoxylum (Rosov and Screbtsova, 1958, Olesen, 1979, Lewis, 1982, Lloyd and Webb, 1992b) as well as the tristylous species Pontederia cordata (Barrett and Wolfe, 1986, Lloyd and Webb, 1992b).
It has been suggested that the analysis of stigmatic pollen loads only moderately supports Darwin’s theory, with excess illegitimate pollen observed on stigmatic surfaces when examining intact flowers in some species (Ganders, 1979, Piper and Charlesworth, 1986, Lloyd and Webb, 1992b). However, it is pointed out that intra- flower pollination would inherently cause more pollen of the same morph to be deposited on a stigma than expected by random pollination (Ganders, 1979, Piper and Charlesworth, 1986, Lloyd and Webb, 1992b). This suggests that analysis of intact flowers says little about pollen flow between plants: intact flowers alone are an unsatisfactory resource for statistically ascertaining the level of disassortative pollination; that is, pollination between flower types (Piper and Charlesworth, 1986, Lloyd and Webb, 1992b). To this end, a number of studies including both intact and emasculated flowers of various species (e.g. Primula) revealed a reduction in the proportion of own-morph pollen deposited on the stigmas of emasculated flowers, suggesting that intra-flower pollination does indeed contribute significantly to the observed pollen load on stigmas (Ganders, 1974, Ganders, 1976, Piper and Charlesworth, 1986, Lloyd and Webb, 1992b).
Lloyd and Webb (1992) reanalysed data on pollen loads for stigmas of both intact and emasculated flowers for both floral morphs of the distylous species Jepsonia heterandra (Ganders, 1974), accounting for the observation that pin stigmas receive more pollen than thrum stigmas in total, perhaps due to the elevated stigma being more accessible in the precise entry and exit paths of pollinators during nectar feeding (Ganders, 1979, Lloyd and Webb, 1992b, Stone, 1995, Barrett and Shore, 2008). In doing so, the
19
probability of a pollen grain of each type being deposited on a stigma of each type was analysed, removing the effect of unequal total flower and pollen numbers, and considering the pin and thrum as either maternal or paternal parent in separate crosses. If emasculated flowers are considered and the distorting effects of self-pollination therefore removed, there is an overall twofold greater probability of legitimate transfer in Jepsonia heterandra, thus supporting Darwin’s cross-pollination hypothesis (Lloyd and Webb, 1992b). In the reanalysis of pollen load data for the tristylous Pontederia cordata (Barrett and Glover, 1985) it is shown that all three morphs are more likely to receive legitimate pollen grains, and all three pollen types are more likely to be deposited on legitimate stigmas (Lloyd and Webb, 1992b) thus confirming the above findings.
Piper and Charlesworth (1986) used a natural population of Primula vulgaris to analyse factors that could promote the evolution of distyly which were considered in the theoretical framework set out by Charlesworth and Charlesworth (1979). These factors include selection for disassortative pollination, and selection for a reduction in self- pollination, stigma clogging, or pollen wastage. The analysis revealed significantly more legitimate than illegitimate pollen on emasculated flowers, which supports disassortative pollination; they also showed that pin and thrums receive a fifth less self- pollen than long homostyles, suggesting heterostyly also reduces self-pollination to a degree (Lloyd and Webb, 1992b). The small pollen saving effect found was proposed to be biologically insignifciant, and stigma clogging is thought to be unimportant (Piper and Charlesworth, 1986). In summary, the results of various studies taken together with similar considerations to the above confirm that heterostyly promotes cross-pollination across diverse angiosperm families; there are some suggestions for the precise mechanism of transfer, how insect-mediated outcrossing might be physically achieved, and the consequences of pollinator foraging behaviours (Piper and Charlesworth, 1986, Lloyd and Webb, 1992b, Stone, 1995, Harder and Wilson, 1998, Richards et al., 2009, Keller et al., 2014, Zhou et al., 2015).
1.6 Models for the evolution of heterostyly
In defining a species as heterostylous, the presence of two or more floral forms with anthers and stigma at complementary heights (reciprocal herkogamy) is considered
20
essential (Ganders, 1979, Lloyd and Webb, 1992a). Lloyd and Webb (1992a) suggested that the more varied ancillary features (section 1.4) probably evolved after reciprocal stigma and anther positioning. For this reason, less attention is paid to these traits, and the main focus of theoretical models for the evolution of heterostyly is in establishing the order of the reciprocal architectures and SI, as well as determining the intermediate states in between. It is notable, however, that di-allelic SI is absent in species that have reciprocal herkogamy much more often than the reverse scenario, which means SI need not be present for a species to qualify as heterostylous. This perhaps emphasises the importance of reciprocal herkogamy as a means for promoting cross-pollination (section 1.5) (Lloyd and Webb, 1992a).
In line with the considerable depth of historical research conducted into heterostyly over the course of the last 150 years (section 1.9) there has been a broad range of predictions for the origins of this breeding system (Ernst, 1936a, Mather and De Winton, 1941, Baker, 1966, Charlesworth and Charlesworth, 1979b, Lloyd and Webb, 1992a). Darwin (1877) suggested that reciprocal herkogamy evolved first due to selection for increased precision in reciprocal pollen transfer, thus bringing the stigma and anthers into complementary positions in the two morphs. He postulated that self-incompatibility and other ancillary traits followed reciprocal herkogamy as an incidental by-product of coadaptation between compatible pollen and style. Mather and De Winton (1941) suggested heterostyly and SI arose together in a single step, whilst Ernst (1936a) proposed a series of mutations starting from a homostyle ancestor, but no reasons were given for how such mutations would spread in the population due to a selective advantage (Charlesworth and Charlesworth, 1979b). The majority of other postulates support the evolution of SI prior to reciprocal herkogamy (Charlesworth and Charlesworth, 1979b).
There are two major competing quantitative models considering the evolution of reciprocal herkogamy and di-allelic self-incompatibility; that of Charlesworth and Charlesworth (1979b) and Lloyd and Webb (1992a). Lloyd and Webb (1992a) suggest that heterostyly evolved prior to SI (see below). In contrast, Charlesworth and Charlesworth (1979b) proposed a model that supports two ancestral steps in the evolution of self-incompatibly from a long homostyle-like progenitor that has a long style and elevated anthers. Firstly, a transition to an effectively male-sterile plant based
21
on mutation to a new pollen type; this pollen mutant is unable to fertilise progenitor- type individuals or itself, and would spread as a result of increased female fitness due to the elimination of self-fertilization. Secondly, a mutation to a new stigma type in the progenitor that is compatible with the new pollen type, thus establishing two mating- types that are reciprocally compatible. This model requires that the second gene is linked to the first, and thus encompasses linkage constraints. For this reason the model is seen to support the supergene hypothesis, such that a “translocation bringing the genes together would be selected for” (see section 1.9) (Charlesworth and Charlesworth, 1979b). It was proposed that a stigma-height dimorphism would then emerge, perhaps due to further reduction in self-fertilization or reduction of pollen wastage. This was followed by the subsequent establishment of heterostyly via a change in anther position in response to selection for disassortative pollination.
The homostyle progenitor is proposed based on a number of monomorphic species within the predominantly heterostylous taxa Plumbaginaceae (Baker, 1966). Baker (1966) revealed species representing possible stages in the evolution of distyly based on a taxonomic analysis of this group, including some homostylous species that retain self- incompatibility (Charlesworth and Charlesworth, 1979b, Ganders, 1979). From this, Baker (1966) concluded that the dimorphisms were imposed sequentially from a homostylous ancestor, with di-allelic self-incompatibility being the first step; a view that is quantitatively supported (Charlesworth and Charlesworth, 1979b). In the Charlesworth and Charlesworth (1979b) model the long homostyle is favoured over the short homostyle as a progenitor based on reciprocal crossing behaviour with pins and thrums (see section 1.9) (Charlesworth and Charlesworth, 1979b). This occurs as most homostyle species retain self-incompatibility reactions; homostyles with no self- incompatibility reaction have been interpreted as being the primitive state (Baker, 1966).
In contrast to Darwin’s postulate, proposals such as the above are often based on the view that the evidence for promotion of cross-pollination by heterostyly is insubstantial, and that it confers a secondary selective force as compared to the strong effect of SI in preventing selfing (Lloyd and Webb, 1992a). This view was challenged by Lloyd and Webb (1992b) (section 1.5) and receives support from a reduction of self-pollen seen on the stigmas of homostyles (Piper and Charlesworth, 1986). It has often been assumed
22
that the ancestral morph was homostylous based on the occurrence of homostyle species in predominantly heterostylous clades, as indicated above (Baker, 1966, Charlesworth and Charlesworth, 1979b, Lloyd and Webb, 1992a). However, it is reasonable to suggest that these self-pollinating species may be so-called “secondary homostyles” that have derived from founding heterostylous species on multiple occasions, perhaps due to selection for reproductive assurance based on unreliable pollinator service (Mast et al., 2006, Barrett and Shore, 2008). Furthermore, where SI reactions are absent they may have arisen due to relaxation of the selective pressure to maintain them in homostyle species that are predominantly self-fertilizing, rather than this being the primitive state (Ganders, 1979).
Lloyd and Webb (1992a) proposed a contrasting model based on an ancestor with a pin- like approach herkogamous morphology that comprises stigma(s) situated above the anthers. This ancestral morph was suggested based on a comprehensive sampling of character states across heterostylous families, and occurs in addition to other characteristic features that were recognised, such as having a so-called depth-probed flower with a floral tube, as described below (section 1.7). The opposite arrangement with anthers above the stigma (reverse herkogamy) is much less common, whilst approach herkogamy occurs relatively frequently in the angiosperms as compared to heterostyly (Webb and Lloyd, 1986). The model supports the evolution of reciprocal herkogamy from this approach herkogamous ancestor, which in turn preceded the evolution of self-incompatibility. The first step is postulated to be a stigma-height dimorphism as in the Charlesworth and Charlesworth (1979b) model, which produces two herkogamous forms in one step from a herkogamous ancestor, as opposed to two steps from a homostylous ancestor. This may be one reason why SI was proposed as the first step in the Charlesworths’ model that assumes a homostylous ancestor.
In the Lloyd and Webb (1992b) proposal the initial morphologies of the two herkogamous morphs comprise stigma and anthers at different heights, but they are not in exactly complementary positions. The quantitative model suggests that the stigma- height dimorphism could evolve due to the selective advantage of legitimate- over illegitimate-pollinations, but not from a reduction in self-fertilization, suggesting Darwin’s cross-pollination hypothesis provides a powerful selective force. The foundation of a stigma-height dimorphism was proposed to allow further physiological
23
and morphological characteristics to evolve more readily, SI may subsequently evolve depending on the importance of selecting against the genetic costs of self-fertilization (Lloyd and Webb, 1992b). The importance of SI in heterostylous groups appears to vary considerably, with some species being self-compatible, others showing differing strengths in the incompatibility reaction between morphs, and rigid SI in all morphs (Barrett and Cruzan, 1994, Barrett and Shore, 2008). Instead of a homostylous precursor, the presence of a reciprocal herkogamy arrangement that promotes cross- pollination is shown to make the selection for self-incompatibility much more likely because it reduces the inherent problem of a di-allelic system in that each plant is unable to fertilize half the population (Lloyd and Webb, 1992b). In this model, additional dimorphic features such as an anther-height polymorphism or ancillary traits can be selected for if they decrease self-fertilization or increase legitimate pollination (Lloyd and Webb, 1992b).
The Lloyd and Webb model supports Darwin’s proposal that reciprocal herkogamy preceded SI, but it is clear that SI is not an incidental by-product; the two systems are complementary (section 1.3). Lloyd and Webb (1992a) extended the taxonomic assessment of heterostylous groups to show that the situation in Plumbaginaceae is atypical; this family represents one instance of di-allelic SI in the absence of heterostyly, as opposed to numerous heterostylous species that are self-compatible in an array of other angiosperm families (Ganders, 1979, Lloyd and Webb, 1992a). Furthermore, stigma-height polymorphisms are frequent in non-heterostylous groups, but there are no such occurrences for di-allelic SI, suggesting this was not the ancestral state prior to heterostyly and further acknowledging the supporting role of heterostyly in the evolution of di-allelic SI (Ganders, 1979, Lloyd and Webb, 1992a).
Despite the mathematical plausibility of the selective events above, theoretical models rely heavily on their assumptions (Ganders, 1979); changing the assumptions could dramatically affect the outcomes. For example, Lloyd and Webb (1992b) point out that selection against self-fertilization is likely to have played a more important role than assigned to it if the ancestors were not herkogamous. In order to distinguish between the theoretical evolutionary models, refinement of the character states associated with heterostylous families and an in-depth analysis of their sister clades is required (Lloyd and Webb, 1992b). This, alongside data-driven analysis of evolutionary events at the S
24
locus, will determine the order of the ancestral morphological and physiological states in the evolution of heterostyly and homostyly: only once the complete S locus sequence and heterostyly-determining genes are uncovered will this be possible.
1.7 Phylogenetic context
Floral heteromorphy is present in over 28 angiosperm families, with the most well studied distylous genuses including Fagopyrum, Primula, Turnera and Linum (Figure 1.3) (Cohen, 2010). The characteristic features of heterostylous families reveal that reciprocal herkogamy is more likely to evolve in species with actinomorphic flowers and a floral tube, typically with pollen presented peripherally and stigmas centrally, such that pollinators contact the reproductive organs in succession when feeding on nectar at the base of the flower (depth-probed). In contrast, species with numerous free carpels or stamens, exposed nectar, irregular corollas, or open dish-shaped flowers are in most cases thought to lack the precision required for the precise pollinator contacts (Darwin, 1877, Ganders, 1979, Lloyd and Webb, 1992a, Barrett and Shore, 2008). The taxonomic distribution of heterostyly is therefore strongly non-random (Lloyd and Webb, 1992a).
25
Figure 1.3 Phylogeny of Primula vulgaris and other angiosperm species; species with sequenced and assembled genomes are shown in bold, heterostylous species are underlined. Figure reproduced from Cocker et al. (2017).
Primula has probably attracted the most attention as a model of heteromorphic flower development. However, perhaps due to the agronomic importance of Fagopyrum esculentum (buckwheat), much research has also gone into the identification of the genes controlling the development of morph-specific features in this so-called pseudocereal (Yasui et al., 2012). Due to the meticulous work carried out in Primula towards determining the number of S locus genes and their order, other heterostylous taxa are also discussed in terms of control by a supergene (Barrett and Shore, 2008, Cohen, 2010). The evidence for supergene control of heterostyly in families other than the Primulaceae is limited due to a lack of unambiguous recombinants. In agronomically important Fagopyrum species this is perhaps due to the difficulties in carrying out association studies in crop plants due to their complex breeding histories (Flint-Garcia et al., 2003, Pasam et al., 2012, Yasui et al., 2012). Fagopyrum also represent an exception to the character states identified above, with species having open dish- or bowl-shaped flowers (Björkman et al., 1995).
26
In around ten heterostylous species the inheritance pattern of distyly is the same as in Primula; in some cases the dominance relationship is reversed such that the pin allele of the S locus is dominant to thrum (Lewis and Jones, 1992, Ornduff, 1992, Barrett and Shore, 2008). However, this does not imply that there is a supergene comprising multiple tightly-linked genes, and it remains a possibility that the “supergene” in these species comprises a single gene as in the control of butterfly mimicry for example (Barrett and Shore, 2008, Kunte et al., 2014) (see Chapter 4 discussion). In crosses between distylous and long homostyle Fagopyrum and Turnera species, thrum dominance is maintained, thus revealing allelic relationships equivalent to those at the Primula S locus (Barrett and Shore, 1987, Lewis and Jones, 1992, Fesenko et al., 2006, Yasui et al., 2012). The observed long homostyle species could therefore be the result of recombination at the S locus in thrum (GPA/gpa) to give a gPA haplotype as in Primula, but this interpretation only supports a supergene comprising two genes due to a lack of recombinants with dissociation between pollen size and anther height (Barrett and Shore, 2008).
1.8 Potential applications for the study of heterostyly
Primula are important horticultural crops in Europe, the United States and Japan, with a 100 million euro market in Germany alone (Hayta et al., 2016). These plants are produced mainly for autumn and early spring bedding and pot plant markets; as a cool- season crop they offer good return for low inputs and high plant densities, in growing temperatures that are unsuitable for most ornamental crops (Karlsson, 2001, Hayta et al., 2016). Breeding programs focus on flower size and colour, improved germination rate, and flowering time (Karlsson, 2001). From a genomics perspective, the sequencing and assembly of the Primula vulgaris genome could ultimately facilitate genome- assisted breeding (Varshney et al., 2005, Varshney et al., 2013, Kole et al., 2015), whilst an understanding of the genes controlling heterostyly could allow manipulation of floral architectures to further aid breeding programs.
The identification of the complete S locus will reveal the genes controlling both style and anther height. Perhaps these genes could allow precise floral engineering in a number of agriculturally important crop plants: self-pollinators could be engineered into out-breeders through the separation of anthers and stigma to aid hybrid seed set; or
27
anthers and stigma could be brought closer together in order to facilitate self- pollination. In addition, the breakdown to homostyly in heterostylous clades offers the opportunity to discover how and why self-fertilization evolves; an important consideration in light of declining populations of insect pollinators (Klein et al., 2007, Potts et al., 2010). It is thought that the transition of heterostylous species from an outcrossing system to inbreeding is a result of absent or unreliable pollen vectors such that self-fertilization is required to ensure seed set (Baker, 1966, Ganders, 1979). This decline is of particular concern for crops that rely heavily on insect pollinators to set seed or maximise fruit set (Richards, 2001). It is estimated that 20-50% of extant plant species are autogamous, including many crop plants. However, the loss of self- incompatibility is not sufficient to guarantee self-fertilization as changes in floral morphology are often required to facilitate self-pollination (Chen et al., 2007). In many crops, spatial separation of the anthers and stigma can mean that autogamy and fruit set is uncertain in the absence of animal visitation (Richards, 2001). The quality of fruit can also depend on the size or number of seeds; seed set is often reduced without insect- mediated cross-pollination due to inbreeding depression (Charlesworth and Charlesworth, 1987, Richards, 2001). The group of plants for which this is a concern includes species of agricultural importance such as oilseed rape (Brassica napus), flax and linseed (Linum usitatissimum), sunflowers (Helianthus annuus), cotton (Gossypium spp.), soya (Glycine max), strawberry (Fragaria x ananassa), aubergine (Solanum melanocarpum), pepper (Capsicum annuum), tomato (Lycopersicum esculentum), olives (Olea europea) and grapes (Vitis vinifera) (Richards, 2001). In the case of poor pollination by insect pollinators, costly techniques to enforce fruit set may have to be applied such as the introduction of bees, and in extreme cases plants may require hand pollination, which is particularly expensive (Westerkamp and Gottsberger, 2000, Richards, 2001). The manipulation of floral architectures could bring anthers and stigma closer together to aid pollination.
Linum grandifolium presents flowers with a stigma-height dimorphism but no difference in anther positioning (Darwin, 1863, Barrett, 2010, Ushijima et al., 2012). The closely related Linum usitatissimum is important for flaxseed oil production (Wang et al., 2012). This species is predominantly self-pollinating and has hypogynous rather than heterostylous flowers (Jhala et al., 2011), but perhaps its cultivation might benefit from a greater understanding of the breeding systems in this genus. Interestingly, the
28
outbreeding wild relatives of the autogamous cultivated tomato Solanum lycopersicum are approach herkogamous, whilst the cultivated tomato itself has recessed stigmas; flowers in which the stigma is recessed relative to the anthers are more likely to self- pollinate (Chen et al., 2007).
Finally, the identification of the S locus and its constituent genes in Primula would find a more direct application in an agronomic setting by providing a platform for directed studies into isolation of the heterostyly-determining genes in Fagopyrum esculentum. Buckwheat species have a short vegetation period (~60 days) and are important as low input crops due to higher yield returns in sandy, stony, and nitrogen poor soils, as well as being naturally gluten-free and having an excellent nutrient profile (Kreft and Luthar, 1990, Alvarez-Jubete et al., 2010). It has been estimated that heterostyly evolved independently on 23 occasions across diverse angiosperm families (Lloyd and Webb, 1992a). Despite the potential for disparity in the underlying genes, the isolation of the S locus in any species would represent a significant leap forward.
1.9 The history of heterostyly
The primrose has fascinated botanists for over four centuries. Perhaps the first attempt to describe the two forms of flower was documented in the late 16th Century by Clusius (Clusius, 1583). Clusius thought the distinct floral morphs were separate varieties (or species) and is known as a botanist who paid a great deal of attention to the intricate details of floral morphologies (van Dijk, 1943, Gilmartin, 2015). It is perhaps surprising then, that his illustrations of the primrose do not show a dissected thrum flower, suggesting he was seemingly unaware of the anthers hidden in the depths of the flower (van Dijk, 1943, Gilmartin, 2015). Two centuries later, the primrose featured centrally in studies of species definition spearheaded by Carl Linnaeus, who cited the plant as an example of where flower enthusiasts concentrate on small details that “no sane botanist” would consider important, such as those between the two forms of heterostylous flower (Linnaeus, 1792, Gilmartin, 2015). The result was that the floral morphs were thereafter recognised as being of the same species; a train of thought which carved the path for Darwin, who sought to answer the inevitable question of why those differences existed at all.
29
Darwin noted the “balancement of long and short pistils and stamens” in 1860 and wrote to his close-friend and confidant Hooker to explain his observations (Darwin, 1860, Gilmartin, 2015). He subsequently read a paper before the Linnaean Society on the dimorphic condition and remarkable sexual relations of Primula species, in which he contested the general consensus amongst botanists that heterostyly represented “mere variability” (Darwin, 1862, Gilmartin, 2015). In Darwin’s landmark book, The Different Forms of Flowers on Plants of the Same Species (Darwin, 1877), the true significance of the reproductive organs’ reciprocal positions was recognised. Darwin reported that within-morph (or “illegitimate”) crosses yielded a much-reduced seed set in comparison to “legitimate” crosses between morphs, and likened heterostyly to the male and female sexes in humans (Darwin, 1877).
Darwin’s first idea was that Primula were tending towards a dioecious condition, the long styled plants with longer pistils, rougher stigmas and smaller pollen were more feminine and thus produced more seed, and the short styled plants with shorter pistils, longer stamens and larger pollen grains were more masculine (Darwin, 1877). On rare occasions the transition from heterostyly to dioecy is thought to have occurred (Ganders, 1979). Darwin concluded, however, that the morphological differences between pin and thrum were in place as part of a system to promote insect-mediated outcrossing: “the benefit which heterostyled dimorphic plants derive from the existence of the two forms is sufficiently obvious, namely, the intercrossing of distinct plants being thus ensured…”
Darwin somewhat remarkably foresaw that the benefit of such a system was in reducing the propensity for inbreeding. This, despite the fact that he presumably knew nothing of Mendel’s studies at the time, and thus had no insight into the genetic basis of inbreeding depression that results from recessive deleterious traits being exposed (Charlesworth and Charlesworth, 1987). In fact, he demonstrated the negative effects of self- fertilization through his writings in an entire volume on the subject, as highlighted at the start of this introduction (Darwin, 1876).
It may be true that Darwin was aware of Henslow’s earlier drawings of the two forms of primrose flower in 1826 (Kohn et al., 2005), and still before that Curtis depicted whole and dissected forms of the two floral morphs in Flora Londinensis, using the descriptive terms pin-eyed and thrum-eyed 100 years before Darwin (Curtis, 1777-1798); but it was
30
Darwin who first sought to explain the true significance of the two morphs in promoting cross-pollination where others had previously only observe (Gilmartin, 2015).
The rediscovery of Gregor Mendel’s work in the early 20th Century sparked a chain of genetic studies into the genetic basis of heterostyly (Bateson, 1902, Bateson and Gregory, 1905). Bateson and Gregory (1905) were the first to define floral heteromorphy in Primula as a classic example of Mendelian inheritance, defining the dominance relationship of thrum and pin at the S locus. This was just months before Bateson coined the term “genetics” (Ornduff, 1992).
Despite little attention being paid to the ancillary features of heterostyly in determining its evolutionary origins (section 1.6), perhaps the main fascination of this system is the Mendelian segregation of a whole suite of phenotypic characters akin to the separation of the sexes (Ornduff, 1992). Pellow (1928), in an annual meeting of the John Innes Horticultural Institute, proposed that the morphological characters of the two heteromorphs might be controlled by more than one tightly-linked gene. Haldane (1933) expanded this to a series of closely linked genes. Darlington and Mather (1949) subsequently coined the term supergene to describe such a region, as well as suggesting that the S locus could lie on the largest pericentric chromosome based on an excessive number of S-linked phenotypic traits (Darlington, 1931).
The extensive inheritance studies of Alfred Ernst ultimately led to a proposal of three genes at the S locus, thus establishing the genetic architecture of distyly in Primula (Ernst, 1928a, Ernst, 1955). Ernst (1928a) described anomalous combinations of phenotypic characters and deduced that these abnormal plants were the result of mutation. It has been concluded, however, that the rate of appearance for these novel floral phenotypes is in keeping with recombination between the three genes thought to lie at the S locus rather than mutation, hence the prediction that homostyles are the result of recombination as described at the start of this introduction (Lewis and Jones, 1992, Barrett and Shore, 2008). Darwin described homostyles as early as 1877 (Darwin, 1877). There are natural populations in the UK near Sparkford, Somerset, and in the Chilterns, that captured the attention of population geneticists (Crosby, 1940, Fisher, 1949, Dowrick, 1956, Bodmer, 1960, Piper et al., 1984). In one case, Crosby (1949) counted 15,555 plants from heterostyled populations of Primula vulgaris, all of which were pin or thrum, suggesting the occurrence of homostyles is very rare, perhaps due to
31
recombination suppression at the S locus (Mather and De Winton, 1941, Barrett and Shore, 2008).
It is interesting that throughout Ernst’s studies of Primula species (Ernst, 1955, Ernst, 1957), the vast majority of “secondary homostyle” species are long homostyles; although some short homostyle species are known, long homostyles have been reported much more frequently (Charlesworth and Charlesworth, 1979a, Ganders, 1979). This is very surprising, as crossing-over should produce long and short homostyles in equal frequency (Ganders, 1979). However, if homostyles are almost completely autogamous, then they can act as male parents (pollen donors) but are unlikely to be female parents (Charlesworth and Charlesworth, 1979a, Ganders, 1979). In this case, it has been suggested that the establishment of an inbred homozygous population of long homostyles would be favoured over short homostyles as in outcrossed matings the short homostyle functions as recessive and the long homostyle functions as dominant. This is based on the following: pollen from the long homostyle (gPA/gPA) is compatible with the pin (gpa/gpa) stigma exclusively, with crosses resulting in a gPA/gpa (long homostyle) genotype. In contrast, short homostyle pollen (GPa/GPa) is only compatible with the thrum (GPA/gpa), suggesting they are less likely to spread and become established in a heterostylous population (Dowrick, 1956, Charlesworth and Charlesworth, 1979a, Ganders, 1979). The dominance relationships at the S locus determine the relative frequency of short and long homostyles. This conclusion was supported by the quantitative model of Charlesworth and Charlesworth (1979a) which predicts that a self-fertile homostyle with the dominant allele for pollen type could spread into and eventually replace a heterostylous population. This view is apparently supported due to long homostyles being more frequent in the Primulaceae, and short homostyles more frequent in heterostylous families where the dominance relationship of pin and thrum is reversed. Indeed, the short homostyle phenotype is noted in the literatue as having been fixed only once in Primula (P. septemloba var. minor) (Richards, 2014).
This broad and fundamental body of work highlights the primrose as an early model for genetics; a cornerstone in the establishment of the Darwinian evolutionary synthesis, featuring prominently in the development of Mendelian and population genetics, including research focusing on the generation of linkage maps, patterns of inheritance,
32
epistasis, supergenes and polymorphic equilibria by the likes of W. Bateson, R.A. Fisher, J.B.S. Haldane, C.D. Darlington, C.B. Bridges, A. Ernst, K. Mather, and D. Lewis (Darlington and Mather, 1949, Barrett and Shore, 2008). The breadth and detail of the above work, including meticulous inheritance studies in Primula over the course of 30 years by A. Ernst (Ernst, 1928a, Ernst, 1957, Lewis and Jones, 1992), suggests that heterostyly is one of the most well studied plant sexual polymorphisms (Barrett and Shore, 2008, McCubbin, 2008). Despite this, 150 years since Darwin first explained the importance of heterostyly in the promotion of cross-pollination, the molecular basis of the system is still to be uncovered (Darwin, 1862).
1.10 Molecular studies of floral heteromorphy
Efforts towards isolating the genes at the S locus have focused on examining differential expression between morphs, and the generation of linkage maps and associated BAC assemblies. The most well studied species in this regard are Primula vulgaris (Primulaceae), Turnera subulata (Turneraceae), Fagopyrum esculentum (Polygonaceae) and Linum grandiflorum (Linaceae) (Cohen, 2010).
In Primula recent investigations include the analysis of differentially expressed floral genes using subtractive hybridisation, and the characterisation of S locus-linked sequences and mutant phenotypes, with a view towards generating genetic and physical maps spanning the S locus (Manfield et al., 2005, McCubbin et al., 2006, Li et al., 2007, Li et al., 2008, Li et al., 2010, Li et al., 2011b, Yoshida et al., 2011). In Linum grandiflorum (heterostylous flax) analysis of morph-specific cDNA fragments and proteins revealed morph-related genes including a potential downstream target of the S locus that reduces style and anther height when overexpressed in Arabidopsis thaliana (Ushijima et al., 2012).
In Turnera subulata significant progress towards isolating the key genes was made using a high-resolution genetic map (Labonne et al., 2008). This facilitated the assembly of three BAC contigs that enabled the positional cloning of the recessive s haplotype through the use of X-ray deletion mutants that delimit the S locus region (Labonne et al., 2009, Labonne et al., 2010, Labonne and Shore, 2011). The application of similar approaches in Fagopyrum esculentum revealed S-linked markers and enabled
33
construction of a genetic map and associated BAC assembly (Yasui et al., 2004, Konishi et al., 2006, Ota et al., 2006, Yasui et al., 2008). In addition, several proteins specifically expressed in long or short styles were identified (Miljus-Dukic et al., 2004). The analysis of a short-style chromosome deletion mutant that produces long-styled flowers identified a candidate gene named S-ELF3 that was lost in the deletion (Yasui et al., 2012).
The classical genetic and molecular-based studies above provide a significant resource going forward that will facilitate the identification and characterisation of the heterostyly-determining genes, whilst the identification of differentially expressed genes has potentially revealed downstream targets of the S locus.
1.11 Next-generation sequencing in the study of heterostyly
It is perhaps surprising given the impressive number of focused molecular studies described above, that prior to the analyses described in this thesis, the genes responsible for the development of heterostyly had not been uncovered in any species; such efforts suggest that the identification of the so-called “supergene” represents a significant challenge. The application of next-generation sequencing (NGS) and associated technologies has revolutionised biological research and enabled complex tasks such as the assembly of the ~17 Gb hexaploid bread wheat (Triticum aestivum) genome and population level human genome sequencing through the 1000 Genomes Project (van Dijk et al., 2014, Goodwin et al., 2016). The technologies used in such studies represent a relatively untapped potential in the identification of the heterostyly determining genes.
Following the completion of the Human Genome Project (HGP) in 2003 (Collins et al., 2003), high-throughput sequencing has led to a 50,000-fold reduction in the cost of human genome sequencing, with arguably the greatest issues going forward leaning towards ethical and data storage considerations rather than sequencing costs (Goodwin et al., 2016). This makes next-generation sequencing a viable option for fundamental studies in non-model plants (Unamba et al., 2015). The HGP utilised Sanger sequencing; although this approach is costly and low-throughput it produces relatively long sequencing reads (up to 1000 bp) that allow genuine overlaps to be distinguished from repetitive sequences present in multiple copies throughout the genome. NGS
34
approaches on the other hand produce characteristically short reads; even accurately sequenced repetitive reads will be present in multiple copies throughout the genome (Goodwin et al., 2016).
Despite the above drawback, NGS platforms developed by Illumina currently dominate the sequencing industry, accounting for the largest market share and boasting a comparatively high-throughput and low per-base cost in comparison to alternative systems (van Dijk et al., 2014, Goodwin et al., 2016). In the cyclic reversible terminator (CRT) technology used in such platforms both ends of random DNA fragments are ligated with adapter oligonucleotides that bind to an optically transparent chip or “flow cell”. The high density of primers on the flow cell results in a massively parallelised approach, where each fragment forms a bridge with an adjacent primer to allow PCR amplification, and the build-up of dense clusters of identical fragments that permit sequencing. CRT is a so-called “sequencing by synthesis” approach where four fluorescently labelled deoxynucleoside triphosphates (dNTPs) are washed over the flow cell with DNA polymerase and incorporated into a growing chain; after laser excitation, the emitted fluorescence is imaged and the fluorophore that doubles as a reversible terminator is cleaved to allow the sequencing cycle to continue with incorporation and identification of the next base (Schatz et al., 2010, Koren et al., 2012). The maximum paired-end read length of Illumina HiSeq 2500 is 250 bp with the lower throughput Rapid Run Mode, whilst the high-throughput mode only produces reads up to 150 bp (Rhoads and Au, 2015).
In contrast to Illumina CRT, some third-generation sequencing technologies do not require cyclical hybridisation and wash steps as they aim to directly detect the composition of single-stranded DNA molecules without the need for costly sequencing reagents or base-incorporation steps (Goodwin et al., 2016). These platforms are based on the transit of the DNA molecule through a pore that allows sequencing based on an electric current or optical signal that is characteristic of a particular DNA sequence (Clarke et al., 2009). The single flowcell MinION from Oxford Nanopore Technologies (ONT) is the first commercially available platform for such a technology and produces considerably longer reads than Illumina CRT (up to 200 kb), but so far it is comparatively expensive and has much lower throughput. In lieu of the above, long mate-pair (LMP) libraries with large insert sizes can be used to bridge gaps in an
35
assembly by traversing repetitive regions. The sequencing of LMP libraries is carried out in the same way as for paired-end reads with short insert sizes, but preparation of a library comprising long fragments of DNA can prove problematic (Koren et al., 2012). This therefore suggests that the production of long reads by the so-called third- generation sequencing methods is a technology worth pursuing.
Future proposals include the massively parallelised ONT PromethION platform that is reported to comprise 48 improved flow cells that could produce ~2-4 Tb of sequencing data in a two day run. These figures would put this system in potential competition with the ultra high-throughput Illumina HiSeq X, currently the highest throughput platform available (Schadt et al., 2010, Goodwin et al., 2016). In response, Illumina’s synthetic long-read sequencing platform combines existing sequencing technologies with barcoded wells that contain long fragments (~8-10 kb) that are amplified and sheared to ~350 bp for pooled short read sequencing that retains a record of reads associated with each long fragment (Goodwin et al., 2016). Pacific Biosciences (PacBio) currently offer the most widely used (third-generation) platform for the sequencing of single molecules in real time, based on fluorescently labelled probes; unlike the ONT system this approach relies on sequencing reagents and optical equipment for base detection, but it does not require a pause between each base-read as it leverages the split-second delay during base-incorporation (Schadt et al., 2010, Rhoads and Au, 2015, Goodwin et al., 2016). This platform therefore allows faster runs than CRT approaches, and it is continually improving in terms of throughput and read lengths, but the per-base-pair costs and high single-pass error rates (11-15%) means data is often augmented with short reads from Illumina CRT that have an overall accuracy rate of > 99.5% (Nagarajan and Pop, 2013, Goodwin et al., 2016, Mostovoy et al., 2016). The PacBio RS II system boasts average read lengths of over 10 kb in single-pass runs, but multiple (15) passes of the circular DNA molecule are required to improve accuracy to > 99%, therefore prompting a trade-off between read length and accuracy based on limits stipulated by the lifetime of the DNA polymerase; longer sequences have lower accuracy (Rhoads and Au, 2015). NGS technologies are increasingly becoming a routine part of biological research, but the developing technologies described above are yet to achieve the necessary throughput or accuracy to compete with the high-throughput platforms from Illumina that are currently used in most genomics studies (Nagarajan and Pop, 2013, Goodwin et al., 2016, Jackman et al., 2016).
36
For the assembly of short sequencing reads associated with cost effective next- generation sequencing platforms, a number of application-specific algorithms were developed. These approaches are based on the de Bruijn graph data structure, first used in the EULER assembler (Schatz et al., 2010). Some of the most popular tools are Velvet, SOAPdenovo, ABySS and ALLPATHS (Butler et al., 2008, Zerbino and Birney, 2008, Simpson et al., 2009, Luo et al., 2012, Nagarajan and Pop, 2013). The first assemblies of long Sanger sequencing reads were accomplished using algorithms that find overlaps between reads allowing for sequence differences (1-10%), then merging sequences with the longest overlap to form contiguous sequences or “contigs” (Schatz et al., 2010) These simple “greedy” algorithms where all reads are compared with each other are not tractable for the millions of short reads generated for the assembly of large genomes (Schatz et al., 2010).
In de Bruijn graph based tools, reads are divided into sequences of k length (k-mers) that form the nodes of the graph, with k-mers that overlap by k-1 bases connected by edges such that a path through the graph can be traversed to form contigs that end on the boundaries of repeats (Schatz et al., 2010). The advantage of this approach is that k-mer overlaps representing perfect complementarity between reads are implicitly captured by the graph, rather than explicitly computed, saving a substantial amount of computing time (Simpson et al., 2009, Jackman et al., 2016). The method reveals sequencing errors as “tips” (dead-ends) or “bubbles” in the graph, with “forks” further highlighting repetitive or duplicated regions that may be simplified into one sequence or removed. In dividing reads into a set of k-mers pre-processing and error correction is possible, which results in an easier assembly process; a contig with a large number of associated reads can also be flagged as repetitive based on an erroneously high coverage (Chaisson and Pevzner, 2008, Zerbino and Birney, 2008, Schatz et al., 2010, Nagarajan and Pop, 2013).
Following contig assembly, information gleaned from paired-end reads is incorporated; these reads are generated from sequencing the DNA fragments from both ends, with reads separated by the average fragment (insert) size of the library. If a unique path in the assembly graph is found to connect reads at the end of a read-pair with length equal to the fragment size, then the path is thought to be correct; heuristic measures help to facilitate the intensive task of finding paths of pre-defined length (Jackman et al., 2016).
37
This approach enables contigs to be linked together across repetitive regions less than the insert size to generate “scaffolds” through the merging of linked contigs (Schatz et al., 2010, Jackman et al., 2016).
Recent advancements in genome assembly have focused on improving memory efficiency through new approaches to construct and process assembly graphs, thus facilitating the analysis of larger datasets without the need for high-performance computational infrastructure (Simpson et al., 2009, Nagarajan and Pop, 2013, Goodwin et al., 2016, Jackman et al., 2016). In addition, the incorporation of complementary information derived from multiple sequencing technologies and/or mate-pair data is increasingly being leveraged (Nagarajan and Pop, 2013, Jackman et al., 2016). The central challenge in genome assembly is resolving repetitive regions (Schatz et al., 2010). The use of LMP (paired-end read) libraries generated from libraries with larger fragment sizes enables larger gaps in short paired-end read assemblies to be bridged with “N”s representing the insert size between reads, but as a result of the difficulties in getting DNA to circularize efficiently as fragments get longer, these LMP libraries often comprise too little DNA to cover the genome at a depth comparable to that of paired- end read libraries generated from short fragments, and there is also a higher associated variance in their length distribution (Collins and Weissman, 1984, Schatz et al., 2010, Jackman et al., 2016).
If further linkage information in the form of a physical map is available, this also permits ordering and orientation of contigs; one widely applied approach to this end is optical mapping, which involves stretching linear DNA fragments across a glass surface or in a “nanochannel” array such that locations of restriction digest sites can be visualised and fragment lengths estimated with the aid of dye or fluorescent labelling (Schwartz et al., 1993, Dong et al., 2013, Tang et al., 2015, Jackman et al., 2016). The cost of generating a physical map is often prohibitive, but use of NGS generated paired- end reads in combination with LMP reads is now attainable for a non-model species such as Primula vulgaris, using libraries with a mixture of insert sizes can be very effective (Schatz et al., 2010, Jackman et al., 2016).
The most direct application of the above technologies in this project is for the whole genome sequencing (WGS) and assembly of Primula vulgaris. WGS offers the most comprehensive view of genomic information (Goodwin et al., 2016), whilst
38
transcriptomics applications through RNA sequencing (RNA-Seq), and sequence capture approaches, provide an alternative that requires much less throughput per sample for applications where it is neither practical nor necessary to sequence an entire genome (ten Bosch and Grody, 2008, Goodwin et al., 2016). For example, whole- exome sequencing is designed to selectively enrich for coding regions that make up only 2% of the human genome, but contain 85% of known disease-causing variants (Goodwin et al., 2016). The capture of a specific set of sequences is often achieved with in-solution capture methods (e.g. MYbait oligo-capture), where a biotinylated probe is designed for hybridization to an a priori genomic target, followed by fragment retrieval with streptavidin-coated magnetic beads, and a wash step to discard fragments that were not captured. PCR-amplification and sequencing of the captured fragments can then be carried out (Ávila-Arcos et al., 2015, Warr et al., 2015).
RNA-Seq reads are derived from the high-throughput sequencing of fragments prepared from total mRNA content in cells, comprising a mixture of different spliceoforms and transcripts at various levels of abundance that represent the genes that are transcribed at the time of sampling. The resulting reads can be used for the prediction of gene structures, identification of splice variants, and due to the correlation between the number of RNA-Seq reads and the abundance of the transcript that the reads were derived from, whole-genome expression profiling (Trapnell et al., 2009, Trapnell et al., 2012). RNA-Seq assembly approaches can be de novo (no reference assembly required) as in Trinity, trans-ABySS, Oases, and SOAPdenovo-Trans (Robertson et al., 2010, Grabherr et al., 2011, Schulz et al., 2012, Xie et al., 2014), using a similar k- mer based approach to the graph-based WGS assemblers above; or referenced based, in which case a high-quality reference genome assembly is desirable (Grabherr et al., 2011, Chang et al., 2015).
NGS tools produce hundreds of millions of short-sequencing reads; these reads may contain sequencing errors, or genuine mismatches, insertions and deletions as a result of genomic variation (Trapnell et al., 2012, Dobin et al., 2013). Phred quality scores provide a measure of read quality in the FASTQ sequence files produced using NGS technologies, and are calculated using several parameters related to peak shape and peak resolution of DNA sequencing traces at each base following NGS. These parameters are used to detect corresponding quality scores in a large lookup table. The lookup table
39
contains information on how often the correct base was called for a particular set of parameter values where the correct base was known. This was carried out for a test set comprising tens of thousands of known sequences (Ewing et al., 1998). The probability value obtained from the table is converted into a Phred score, with a higher score indicating a smaller probability of error (Cock et al., 2010).
In DNA resequencing efforts, the aim is to align NGS reads to a previously assembled reference genome, often in order to capture information on Single Nucleotide Polymorphisms (SNPs); the mapped reads are assigned Phred-scaled mapping quality values. The most popular alignment tools for this purpose are Bowtie, BWA and SOAP (Li et al., 2009b, Langmead and Salzberg, 2012, Li, 2013). These tools use the Burrows Wheeler Transform (BWT) which allows sequences to be more easily compressed such that repeats in an indexed genome can be collapsed (Li and Durbin, 2009). This means reads can be aligned against each copy of a repeat simultaneously instead of against each one individually, which makes these algorithms more efficient; sequences can be retrieved as the BWT works in such a way that the compression can be reversed (Nong and Zhang, 2007, Ruffalo et al., 2011). There are continual advancements to improve the efficiency of such tools (Liu et al., 2012, Xie et al., 2014). There is often a compromise between mapping accuracy and computational resource requirements, with the speed of read alignment increasingly representing a serious bottleneck in throughput (Dobin et al., 2013). Following short-read alignment, SNPs can be called using tools such as SAMtools and GATK if there is a sufficient number of high-quality reads covering the SNP position (Li et al., 2009a, DePristo et al., 2011).
The above tools are not sufficient for the mapping of RNA-Seq reads as transcriptome information is reorganised into non-contiguous exons that must be spliced together to produce mature transcripts (Dobin et al., 2013). For this reason “splice-aware” alignment methods have been developed to discover splice sites, including Tophat, GSNAP, STAR and HISAT (Wu and Nacu, 2010, Trapnell et al., 2012, Dobin et al., 2013, Kim et al., 2015). The problem of aligning sequences with mismatches to the genome is shared with DNA resequencing, but this must be extended to consider splicing in RNA-Seq alignment tools in order to fully reconstruct intron and exon structures based on spliced RNA molecules. This is a particular problem where a read maps to a splice site such that a short (< 10 bp) overhang is left. The above issues are
40
further confounded due to families of related genes that share similar sequences, thus resulting in multiple mapping locations that are seemingly correct (Dobin et al., 2013). Following alignment of RNA-Seq reads, tools such as Cufflinks or Scripture can be used to assemble the reads into transcripts by merging sequences with overlapping alignments and reconstructing splice variants (Guttman et al., 2010, Trapnell et al., 2012). The alignment of sequencing reads is another scenario where the long sequencing reads offered by third-generation sequencing platforms would provide great advancement in capabilities; particularly in the reconstruction of haplotype information and RNA connectivity.
Transcriptome assembly approaches are often the platform for detection of differential expression between genes using tools such as CuffDiff, DESeq and edgeR (Anders and Huber, 2010, Robinson et al., 2010, Trapnell et al., 2013). This task, and indeed RNA- Seq assembly itself, is complicated by non-uniform coverage between and within transcripts, and across samples with different library sizes. For this reason normalization schemes are applied, for example transformed quantities such as FPKM (Fragments Per Kilobase of transcript per Million mapped reads) are used to normalize the counts for differing library sizes and transcript sizes; long transcripts will have more associated reads than a short transcript with equal expression (Soneson and Delorenzi, 2013). RNA-Seq based transcriptome assemblies can be also used as evidence for intron positions and exon-noncoding region boundaries in genome annotation software such as AUGUSTUS and MAKER2, alongside other sources of support such as proteins from related-organisms and repeat annotations (Stanke and Morgenstern, 2005, Holt and Yandell, 2011, Hoff et al., 2016). RNA-Seq coverage alone is not sufficient to predict protein coding regions; algorithms incorporating statistical models for gene prediction usually require a training step where a set of high quality example genes defines species specific parameters (Hoff et al., 2016).
Following assembly and the generation of gene annotations and associated RNA-Seq datasets, functional annotation and cross-species comparative analyses are often used to facilitate evolutionary analyses and assign potential functions to genes; such studies can be based on local alignments of protein and nucleotide sequences using tools such as Exonerate, BLAT or BLAST+ (Kent, 2002, Slater and Birney, 2005, Camacho et al., 2009). These programs first scan for short matches (words), and then extend them into
41
larger alignments termed high-scoring pairs (HSPs). In contrast, multiple sequence alignment (MSA) tools such as MUSCLE or Clustal seek to maximize the sum of sequence similarities across the entire length of a set of sequences, so-called global alignment, with penalties for gaps (Edgar, 2004, Edgar and Batzoglou, 2006, Larkin et al., 2007). MSA is the starting point for the inference of phylogenetic distances and protein structure prediction (Edgar and Batzoglou, 2006).
For whole genome assembly, annotation, differential expression analysis, and associated comparative analyses in Primula vulgaris, a combination of the methods described in this section will form the first focus. In the study of heterostyly, the use of these approaches could help to bridge gaps in BAC assemblies, whilst RNA-Seq approaches will provide the precision to define differentially expressed sequences between pin and thrum, including those at the S locus.
1.12 Summary and research aims
Research into primroses and heterostyly has straddled major advances in science over the last century and a half. First, is an era characterised as descriptive, based on increasingly meticulous observations of floral morphologies, and the fine balance between this and Linnaean thinking (Clusius, 1583, Curtis, 1777-1798, Linnaeus, 1792, Gilmartin, 2015); second is a focus on Darwin’s explanation of the detrimental effects of self-fertilization, and the adaptive value of heterostyly in promoting outcrossing and generating novel variation as a substrate for natural selection (Darwin, 1862, Darwin, 1877, Li et al., 2016). Following this was the rediscovery of Mendel’s work, with a central role for the primrose in the subsequent advancement and synthesis of subjects related to genetics and the principles of Darwinian evolution (Bateson and Gregory, 1905, Bridges, 1914). Finally, the development of molecular-based technologies has accelerated progress towards isolation of the S locus genes (Labonne and Shore, 2011, Yasui et al., 2012, Li et al., 2015a) (section 1.10). This thesis is situated at the intersection between the molecular-era and a new phase of genomics-based research. The aim is to present an array of assemblies and associated genomic resources that have contributed towards the isolation and characterisation of the elusive S locus supergene, as well as an understanding of its evolutionary origin and maintenance.
42
2 Assembly and genomic analysis of Primula genomes
2.1 Relevant publications
Cocker, J.M., Wright, J., Li, J., Swarbreck, D., Dyer, S., Caccamo, M., Gilmartin, P.M. (2017) The Primula vulgaris genome (in preparation).
2.2 Introduction
The study of floral heteromorphy in Primula dates back over 150 years. It was Charles Darwin, in his book The Different Forms of Flowers in Plants of the Same Species, who first explained the importance of the system in promoting outcrossing (Darwin, 1877). Darwin described how each primrose plant had one of two forms of flower; the pin or the thrum, with the anthers and stigma situated in reciprocal positions, such that insect- mediated cross-pollination between the two floral morphs is physically promoted. In the majority of Primulaceae species this functions alongside a di-allelic self-incompatibility (SI) system termed heteromorphic SI, which acts to reduce self-fertilization (Lewis, 1954, Dowrick, 1956, Lewis and Jones, 1992).
Darwin was aware of the “inferiority” of seedlings resulting from self-fertilization in terms of their reduced height and vigour (Darwin, 1876), and proposed that the primary purpose of heterostyly was in preventing these detrimental effects (Darwin, 1877). This particular adaptation of floral reproductive structures is a remarkable example of the evolutionary innovations that angiosperm species have undergone in order to promote outcrossing, often involving coevolution with insect pollinators; the resulting genetic diversity may facilitate adaptation to changing environments due to favourable
43
combinations of alleles, thus mitigating extinction and accelerating species diversification (Dodd et al., 1999, de Vos et al., 2014). Floral innovations are thought to have enabled this group of plants to become the most diverse on earth, with over 350,000 living species classified into 416 families (Dodd et al., 1999, Paton et al., 2008, de Vos et al., 2014, The Angiosperm Phylogeny, 2016).
The apparent adaptive value of heterostyly in the promotion of outcrossing marks it as one of the most fascinating examples of convergent evolution, having evolved independently on at least 23 occasions, with representative species in 28 families of flowering plants (Lloyd and Webb, 1992a). The Primula genus sits within the Primulaceae family of the order Ericales; a highly evolved sub-clade of the Asterids lineage, within which the majority (90%) of species are heterostylous (Richards and Edwards, 2003). Like the closely related heterostylous species Primula veris (Nowak et al., 2015), the Primula vulgaris genome has 11 chromosome pairs (2n =2x = 22), with estimates by flow-cytometry placing the genome size at either 459 Mb (Temsch et al., 2010) or 489 Mb (Siljak-Yakovlev et al., 2010), which is comparable to the 479.22 Mb predicted for Primula veris (Siljak-Yakovlev et al., 2010).
The draft genome of P. veris was recently published (Nowak et al., 2015), with suggestions for the potential basis of heterostyly based on S locus-linked sequences previously identified by Li et al. (2011b). However, it did not facilitate the assembly of the S locus, nor was complete linkage to the heterostylous floral morphs shown for any of the assembled sequences. The assembly covers only 301.8 Mb (63%) (Nowak et al., 2015) of the estimated 479.22 Mb genome. In addition, the gene annotation and differential expression studies are based on RNA-Seq datasets without biological replicates: samples are only from leaf and flower tissue, as opposed flowers, leaves, roots, fresh seed, and seedlings in our analysis of P. vulgaris. To proceed with analyses of heterostyly in Primula, an improved reference assembly for the Primulaceae would be a valuable resource.
The comparison of assembly size with the predicted size of the genome based on flow- cytometry approaches is a useful gauge of genome-completeness, and is one reason why the NG50 measure of genome assembly quality is preferred over the N50 (Earl et al., 2011, Bradnam et al., 2013). The N50 is an often used metric that describes the contig length above which 50% of the total genome assembly content is contained, whereas the
44
NG50 is the contig length above which 50% of the total predicted size of the genome is contained; hence the latter takes into account the actual size of the genome being sequenced as opposed to the assembly size, which may fall somewhat short of this (Earl et al., 2011). These measures are an improvement over comparisons using the average contig size due to the high frequency of short contigs in many genome assemblies (Parra et al., 2009). The assembly of short sequencing reads from non-haploid organisms that are highly heterozygous due to widespread polymorphism often leads to a highly fragmented assembly with a total size larger than estimated (Small et al., 2007, Pryszcz et al., 2014, Pryszcz and Gabaldón, 2016). This suggests that the P. veris assembly could feasibly represent less than 63% of the genome content, with much of the polymorphic content from a heterozygous read library likely to be represented by alternate contigs that artificially inflate the size of the assembly. The use of k-mer frequency abundance graphs can provide a visual depiction of the heterozygosity of a genome by fragmenting genomic reads into k sized sequences (k-mers) and determining the number of distinct k-mers appearing in the genome at each frequency. Since a heterozygous genome will result in polymorphic regions being represented by two sets of alternate sequences, k-mers unique to these regions will have a frequency that is roughly half that in homozygous regions. This results in a distribution of k-mer frequency abundances containing two peaks, one with double the average k-mer frequency of the other (Liu et al., 2013). Alongside the N50 and NG50 metrics, this is another method that is useful for the validation of whole-genome assemblies (Mapleson et al., 2016). Further to this, genome assembly-completeness can be assessed by mapping genes known to be present in the majority of eukaryotic species. For this purpose, sets of core eukaryotic genes (CEGs) and associated methods for ascertaining their presence in a genome assembly have been developed: highly paralogous genes are removed from the CEG dataset to reduce false positives when trying to accurately identify orthologues (Parra et al., 2007, Simão et al., 2015). This chapter will employ the above methods for validation of the Primula genome assemblies described herein.
In the assembly of the potato (Solanum tuberosum) genome, for example, considerable effort was taken to overcome the key issue of heterozygosity by using a doubled- monoploid form of potato that can be derived through tissue culture approaches (Paz and Veilleux, 1999, Xu et al., 2011). Here, our chosen reference assembly for P. vulgaris is generated from a long homostyle P. vulgaris plant derived from an in-bred
45
population that originates from the Wyke champflower population in Somerset, UK (Crosby, 1949). This plant was selfed for five generations in the hands of the PMG lab (Jinhong Li, personal communication), but is likely to have undergone further selfing prior to collection. In contrast to pin or thrum plants, homostyle flowers are characterised by the display of anthers and stigma at the same height (Crosby, 1940); the long homostyle presents both of these organs at the mouth of the flower. In such an individual, there is a breakdown in the reciprocal positions of the reproductive structures and associated heteromorphic SI system, such that the long homostyle plant is self-fertile. This allowed an inbred and presumably highly-homozygous line to be generated for use in this study: the homozygosity of this individual will be explored using k-mer frequency abundance plots (Mapleson et al., 2016). Self-pollination of homostyle flowers will produce offspring with greater allelic homozygosity than obligate out-crossing pin or thrum plants. For genome assembly, greater homozygosity reduces fragmented contig assemblies and gene models, as well as duplicate redundant contigs and gene paralogues that could result from highly polymorphic regions of the genome (Pryszcz & Gabaldón, 2016). The use of this individual will therefore avoid problems of heterozygosity in the genome: alongside comprehensive P. vulgaris RNA- Seq datasets from multiple tissues, this will facilitate the generation of a high-quality genome assembly, which should enable the annotation of a more complete geneset than that of the draft P. veris genome assembly (Nowak et al., 2015), as well as contributing to the advancement of comparative analyses within the Primulaceae.
Efforts since Darwin’s work (Darwin, 1862, Darwin, 1877) have centred on defining the Mendelian inheritance of the genes responsible for heterostyly (Bateson and Gregory, 1905, Gregory et al., 1923, Dowrick, 1956), generating linkage maps around the controlling locus based on classical genetics approaches (Bridges, 1914, Altenburg, 1916, De Winton and Haldane, 1931, Ernst, 1936c), and the assembly and annotation of a BAC assembly initiated by genetic markers that co-segregate with the pin and thrum phenotypes (Li et al., 2011b, Li et al., 2015a); as described in Chapter 3 of this thesis. The combination of genetics and genomics promises to unravel the molecular basis of heterostyly in Primula by building on the above studies; this chapter presents the foundation of this work through multiple Primula genome assemblies and associated genomic analyses.
46
2.3 Methods
Genomic DNA and RNA-Seq paired-end read libraries
Plant material is from the population of Primula plants maintained by the Philip Mark Gilmartin (PMG) lab; for the inbred self-fertile P. vulgaris long homostyle, DNA was originally isolated from the wild population of long homostyles at Wyke Champflower, Somerset, UK (Crosby, 1940). DNA and RNA preparation was carried out by Jinhong Li (JL) prior to genomic DNA and RNA-Seq library preparation at The Genome Analysis Centre (TGAC) (now Earlham Institute) using standard Illumina protocols. Paired-end sequencing of read libraries (Table 2.1) was carried out with either Illumina HiSeq2000 or HiSeq2500 at TGAC, as described in Li et al. (2016). Genomic DNA was isolated from leaf tissue, and RNA from various tissues as listed in Table 2.1. The mean insert-sizes of the read libraries were determined (by TGAC) using an Agilent 2100 Bioanalyzer, which analyses wells on a chip based on the principles of gel- electrophoresis (http://www.genomics.agilent.com/).
47
Library Material Type Mean insert size (bp) Read count LIB2558 Long homostyle Genomic 522 140114901 LIB5215 Long homostyle Genomic 4131 40450009 LIB5216 Long homostyle Genomic 6675 54234033 LIB5217 Long homostyle Genomic 8818 53977718 LIB3565 Short homostyle Genomic 464 168866674 LIB1732 Pin parent Genomic 423 205079089 LIB1167 Thrum parent Genomic 368 147913358 LIB1474 Thrum parent Genomic 9780 225676870 LIB1730 Pin pool Genomic 349 197538574 LIB1731 Thrum pool Genomic 180 204029422 LIB3564 P. veris thrum Genomic 556 179871647 LIB4735 Root (pin & thrum) RNA 199 77754060 LIB4736 Fresh seed (pin & RNA 205 63843125 thrum) LIB4737 Seedlings (pin & RNA 235 80545495 thrum) LIB8234 Pin flower (1) RNA 296 39027602 LIB8235 Pin flower (2) RNA 278 26497129 LIB8236 Pin flower (3) RNA 318 31792278 LIB8237 Pin flower (4) RNA 284 34838987 LIB8238 Thrum flower (1) RNA 296 30318958 LIB8239 Thrum flower (2) RNA 295 32575413 LIB8240 Thrum flower (3) RNA 269 23971121 LIB8241 Thrum flower (4) RNA 286 38550763
Table 2.1 DNA and RNA-Seq paired-end read libraries used for genome assembly and transcriptomic analyses. Numbers in brackets indicate replicates for RNA-Seq libraries; flower replicate libraries were generated from 15-20 mm P. vulgaris buds. Read count = number of paired-end reads in the library.
48
Genomic DNA paired-end read assemblies
Genomic paired-end read assemblies listed in Table 2.2 were carried out by Jonathan Matthew Cocker (JMC) or Jon Wright (JW) as listed. To generate the LH_v2 assembly, the long homostyle paired-end reads (LIB2558) were assembled with SOAPdenovo (v2.04) by JW (k-mer length (-K) 81), followed by scaffolding (k-mer length (-k) 41) with long mate-pair (LMP) libraries (LIB5215, LIB5216, LIB5217) and removal of contaminated contigs. All other paired-end read assemblies (Table 2.2) were generated by either JMC: SH (short homostyle), PP (pin parent), and VT (P. veris thrum), or JW: TP (thrum parent) and LH_v1 (long homostyle), using ABySS (v1.3.4) (Simpson et al., 2009), and scaffolded (where applicable) with SOAPdenovo (v2.04) (Luo et al., 2012), as described in Li et al. (2016). For the short homostyle (SH_v2) and pin parent (PP_v2) assemblies, k-mer lengths of 85 (-k 85) and 71 (-k 71) respectively, were specified for contig assembly, followed by scaffolding with the 9 kb thrum parent LMP reads (LIB5217) (prepare command options: -K 85 (SH), -K 71 (PP), map command options: -k 63 (SH), -k 71 (PP)). P. veris thrum (VT_v1) was assembled with ABySS (v1.3.4) (Simpson et al., 2009) using a k-mer length of 81 (by JMC). For all assemblies, only sequences ≥ 200 bp were retained for further analysis.
Assembly Assembled Flower form Libraries used for assembly name by LH_v2 JW Long homostyle LIB2558 (scaffolded with LIB5215, (TGAC) LIB5216, LIB5217) LH_v1 JW Long homostyle LIB2558 (scaffolded with LIB1474) (TGAC) SH_v2 JMC Short homostyle LIB3565 (scaffolded with LIB1474) TP_v2 JW Thrum LIB1167 (scaffolded with LIB1474) (TGAC) TP_v1 JW Thrum LIB1167 (not scaffolded) (TGAC) PP_v1 JMC Pin LIB1732 PP_v2 JMC Pin LIB1732 (scaffolded with LIB1474) VT_v1 JMC P. veris Thrum LIB3564 (not scaffolded)
Table 2.2 Primula vulgaris (and P. veris, where listed) paired-end read assemblies carried out by JMC or JW (TGAC).
49
Assembly validation
To evaluate the LH_v2 assembly for duplicated content, and to compare the content of the unprocessed paired-end reads against the final assembly, the long homostyle assembly (LH_v2) and read library (LIB2558) were used to generate k-mer hashes with the Jellyfish (v2.2.0) “count” function: k-mer length (-m) = 31 (Marçais and Kingsford, 2011). The K-mer Analysis Toolkit (KAT) (v2.1.0) (Mapleson et al., 2016) was used to produce a matrix of k-mers shared between the FASTA file of LH_v2 and the long homostyle read library (LIB2558) (with the “comp” function), and to produce a k-mer spectra plot using the “plot” function. Jellyfish (v2.2.0) was also used to produce k-mer hashes for the pin (LIB1732), short homostyle (LIB4568), thrum (LIB1167), and published P. veris (Nowak et al., 2015) read libraries; the “histo” function was then used to generate a frequency-abundance distribution of k-mer occurrences, which was plotted in R (v3.2.0) with x- (frequency) and y-axis (k-mer count) limits of 100 and 2x107 respectively. Separate plots were produced in the same way for the P. veris reads, one on the same scale as the above for comparison, and the other with altered x- and y- axis limits of 400 and 5x106, respectively, to improve visualisation of the distinct peaks in the distribution. KAT was also used to produce a plot of k-mers shared between the P. veris reads and assembly (x- and y-axis limits of 400 and 5x106).
CEGMA (v2.5) (Parra et al., 2007) was used to evaluate the presence of a set of 248 core eukaryotic genes in the Primula vulgaris LH_v2 genome assembly, as a proxy for completeness of the genome; the genome assembly FASTA file was used as input (intron_size=50000).
Repeat library construction and repeat masking
Only the long homostyle (LH_v2) genome was annotated. RepeatModeler (open v1.0.7) (http://www.repeatmasker.org/RepeatModeler.html) was used to identify de novo repetitive sequences in the LH_v2 scaffolds.
To reduce the chance of protein coding genes being annotated as repeats, sequences from the de novo repeat library were aligned to Pfam-A (using curated thresholds) and Pfam-B (e-value 1x10-4) with HMMer hmmscan (v. 3.1b1) (Eddy, 2009, Finn et al., 2014). The sequences with at least one alignment to a transposition-associated domain,
50
or with no alignments to any Pfam domains (transposition-associated or otherwise), were retained. Pfam domains were considered transposition-associated based on alignment to a database of transposable elements included in the RepeatRunner package using HMMer hmmscan (v3.1b1) (e-value 1x10-4) (http://www.yandell- lab.org/software/repeatrunner.html). BlastX alignments to the NCBI “nr” database were carried out for sequences in the repeat library identifying a non-transposition-associated domain in addition to a transposition-associated domain; those with no BlastX (v2.2.28+) (Camacho et al., 2009) hits or that appeared to be transposon-associated based on manual review were retained, the remainder were removed from the repeat library.
Short-interspersed and simple repeats were identified in the genome assembly using the curated de novo repeat library with a local installation of RepeatMasker based on the RMBlast algorithm (open v4.0.1) (http://www.repeatmasker.org/). Additional classification of repeat elements was performed with TEclass (Abrusan et al., 2009).
Alignment of proteins from related species
FASTA files of protein sequences were obtained for the species listed in Table 2.3. Additional protein sequences from species within the Asterids were obtained from the NCBI protein database (http://www.ncbi.nlm.nih.gov/) using “asterids[orgn]” as the search term, with the aforementioned species excluded from the results. The sequences containing Selenocysteine (“U”) or Pyrrolysine (“O”), or ambiguous amino acids, were removed from the dataset; low-complexity regions were masked in the remaining sequences using the “segmasker” tool (available as part of Blast 2.2.28+) (Camacho et al., 2009).
The above proteins were aligned to the LH_v2 assembly with Exonerate 2.2.0 (https://www.ebi.ac.uk/~guy/exonerate/) using the “protein2genome” model with soft- masked query and target (minintron=20, maxintron=50000) to produce a GFF format file that was converted to an AUGUSTUS hints file using the exonerate2hints Perl script (available at http://augustus.gobics.de/binaries/scripts/exonerate2hints.pl); this script was modified to exclude alignments below 90% identity and 70% coverage.
51
Species Access link Version Solanum http://solgenomics.net/organism/Solanum_lycopersicu 2.4 lycopersicum m/genome Solanum http://solgenomics.net/organism/Solanum_tuberosum/ 3.4 tuberosum genome Mimulus http://phytozome.jgi.doe.gov/pz/#!info?alias=Org_Mg 2.0 guttatus uttatus Capsicum http://peppersequence.genomics.cn/page/species/index. 2.0 annuum jsp Actinidia http://bioinfo.bti.cornell.edu/cgi-bin/kiwi/home.cgi NA chinensis Nicotiana http://solgenomics.net/organism/Nicotiana_benthamia 0.4.4 benthamiana na/genome
Table 2.3 Protein databases of species closely related to Primula vulgaris used for alignment to the LH_v2 genome assembly.
Annotation of genes
The ab initio annotation software AUGUSTUS (Stanke and Morgenstern, 2005) was trained (by JW) with a set of full-length transcripts assembled using TopHat (Trapnell et al., 2012) (v2.0.11) and Cufflinks (Trapnell et al., 2012) (v2.1.1). These transcripts are based on alignment of RNA-Seq datasets derived from leaves, flowers, roots, fresh seed, and seedlings from both pin and thrum P. vulgaris plants (Table 2.1, with leaf/flower libraries from Table 3.1). This facilitated prediction (by JW) of protein- coding genes, with the repeat annotations, protein alignments from related species (JC; above), and RNA-Seq transcript models (assembled by JW) as evidence.
Functional annotation of predicted genes
Functional annotation of predicted genes in LH_v2 was carried out using AHRD (Automated assignment of Human Readable Descriptions) (https://github.com/groupschoof/AHRD) (Hallab, 2015). Alignments were performed using BLASTP (v2.2.26) (e-value 1x10-4) with the following target databases: Uniprot/trEMBL, Uniprot/Swissprot (http://www.uniprot.org/) and TAIR10 (https://www.arabidopsis.org/). AHRD aims to algorithmically select the best scoring descriptions from Blast alignments, with greater weight being given to descriptions from more trusted sources (weighting applied for this analysis; TAIR = 50, trEMBL =
52
10, Swissprot = 100); descriptions were truncated to a maximum of 100 characters long. GO (gene ontology) terms were assigned using Blast2GO (Conesa et al., 2005) with BlastX searches against the NCBI “nr” database as input (e-value 1x10-4); domains were annotated with InterProScan5 (Jones et al., 2014).
TransposonPSI (http://transposonpsi.sourceforge.net/) was used with the Primula vulgaris and Primula veris (Nowak et al., 2015) coding sequences to detect degenerate repetitive elements based on protein homology, as well as repetitive sequences which otherwise escaped detection with RepeatMasker.
Comparative analysis of P. vulgaris and P. veris genes
Primula veris coding sequences predicted in Nowak et al. (2015) were aligned to P. vulgaris LH_v2 coding sequences and genomic contigs in two separate alignments using TBLASTX (v2.2.31+) (Camacho et al., 2009). In corresponding TBLASTX alignments LH_v2 coding sequences were mapped against the published P. veris genomic contigs and coding sequences. In cases where there were multiple isoforms available for each gene, the isoform with the suffix “.1” was regarded as the principal isoform and was used in this analysis. The resulting output files from the alignments were parsed to extract any HSPs with over 95% sequence identity; the total percentage coverage across each coding sequence region for these alignments was recorded, and the cumulative number of coding sequences at each coverage plotted for each of the four alignments using R (v3.2.0) (https://www.r-project.org/).
Analysis of orthologous gene groups
Orthologous and paralogous groups were determined using OrthoMCL (v2.0.9) with the method described on the OrthoMCL website (inflation factor=1.5) (http://orthomcl.org/common/downloads/software/v2.0/UserGuide.txt). Two comparable analyses were performed, using proteins from (i) Primula vulgaris LH_v2, and (ii) Primula veris (Nowak et al., 2015). The all-vs-all alignments required for analysis with OrthoMCL was carried out with BLASTP (v2.2.28+) with the “seg” and “soft masking” options applied (e-value=1x10-5); protein sequences used in the alignments were from Actinidia chinensis (http://bioinfo.bti.cornell.edu/cgi- bin/kiwi/home.cgi), Orzya sativa (version 7,
53
http://rice.plantbiology.msu.edu/downloads_gad.shtml), Arabidopsis thaliana (TAIR10, https://www.arabidopsis.org/) and Solanum lycopersicum (version 2.4, http://solgenomics.net/organism/Solanum_lycopersicum/genome), as well as either Primula vulgaris (LH_v2, current study) (i) or Primula veris (Nowak et al. 2015) (ii). In cases where there were multiple isoforms available for each gene, the isoform first listed (with the suffix “.1”) was regarded as the principal isoform and was used in this analysis. From the OrthoMCL output, Venn diagrams of orthologous genes were drawn using the tool available at http://bioinformatics.psb.ugent.be/webtools/Venn/
RNA-Seq differential expression analysis
RNA was isolated in biological replicates from 15-20 mm buds of four wild-type pin plants and four wild-type thrum plants for sequencing with Illumina HiSeq2000 (Table 2.1), as described in Li et al. (2016). The resulting reads were screened for rRNA removal using SortMeRNA and quality-trimmed with trim galore (Q20) (Kopylova et al., 2012) (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore), before alignment to the long-homostyle genome assembly with TopHat (v2.0.13) and assembly with Cufflinks (v2.2.1) (Trapnell et al., 2012) using LH_v2 gene models as a guide after manual curation of all S locus genes (see Chapter 4). Differential expression was carried out using the Cufflinks (v2.2.1) tool “Cuffdiff”; genes differentially expressed between pin and thrum flowers, as well as genes specifically expressed in only one morph, were extracted from the “gene_exp.diff” file. R (v3.2.0) (https://www.r-project.org/) was used to plot the log2 fold change in FPKM for each differentially expressed gene, and the log10 difference in FPKM + 1 for genes with morph-specific expression.
GO term enrichment analysis based on Fisher's exact test was performed using the goatools package for Python (https://github.com/tanghaibao/goatools). GO terms corresponding to the subset of genes identified as differentially expressed, or morph- specific in expression, were compared with GO terms associated with genes across the whole genome, based on the functional annotation with Blast2GO (Conesa et al., 2005).
54
2.4 Results
Analysis of paired-end reads
Primula vulgaris plants are normally outbreeding (Li et al., 2011b). However, in rare cases homostyle plants with the anthers and stigma at the same height are produced; these plants are associated with a breakdown in self-incompatibility (SI) (Charlesworth and Charlesworth, 1979a). The long homostyle individual used for the LH_v2 assembly is from an inbred line; this plant was highlighted as a possible source for generating the paired-end read library required for a high-quality Primula vulgaris genome assembly, based on potentially reduced polymorphic content as compared to an outbreeding P. vulgaris plant (Pryszcz and Gabaldón, 2016).
The k-mer frequency distribution (spectra) of the LIB2558 library generated from the long homostyle individual reveals that the genome of this plant is highly homozygous, with a unimodal distribution beyond the first local minima (Figure 2.1). The distribution for LIB2558 is free from any noteworthy secondary peaks, indicating minimal heterozygosity within the sequencing reads. The distributions of the pin and thrum read libraries indicate heterozygosity within their genomes (Figure 2.1); the pin distribution suggests it might be difficult to resolve between true genomic content and erroneous sequences for this read library, as there is some overlap between the exponential phase of the distribution and the presumed true genomic content which lies beyond the first local minima (Mapleson et al., 2016).
55
Figure 2.1 Frequency distribution (spectra) of k-mers (k=31) showing the number of distinct k-mers occurring with each frequency in paired-end read libraries for pin (black), long homostyle (orange), short homostyle (blue) and thrum (red) individuals (LIB1732, LIB2558, LIB4568, LIB1167; Table 2.1).
These results provide a good indication that a long homostyle assembly generated from the LIB2558 library might comprise longer contigs, and an assembly size that more accurately reflects genome size predictions, based on reduced polymorphic content and an improved capacity to recognise sequencing errors in comparison to the other sequenced genomic read libraries (Table 2.1). In a more heterozygous genome with increased polymorphism, such as the thrum sequenced here (LIB1167) (Table 2.1), the resulting assembly would typically be more fragmented, comprising redundant contigs that cannot be resolved by a single path due to the small size of the short sequencing reads. This would potentially result in fragmented gene models and a greater number of
56
apparent paralogs and duplicated genomic regions than are actually present in the genome (Pryszcz and Gabaldón, 2016).
The k-mer spectra for the short sequencing read library used to generate the published P. veris genome assembly suggests there is a high level of heterozygosity in this library (Figure 2.2). This is probably due to the use of a read library comprising paired-end reads derived from genomic DNA isolated from both pin and thrum individuals (Nowak et al., 2015). Since wildtype Primula plants are usually outbreeding (Li et al., 2011b), both the pin and the thrum genome would be heterozygous with numerous polymorphic sites between the distinct P. veris individuals, thus resulting in what appears to be four peaks in the distribution. This would result in numerous groups of alternate contigs that comprise the same polymorphic genomic regions (Pryszcz and Gabaldón, 2016).
Figure 2.2 Frequency distribution (spectra) of k-mers (k=31) showing the number of distinct k-mers occurring with each frequency in the paired-end read library used in the published P. veris assembly (Nowak et al., 2015); a = on the same scale as Figure 2.1 for comparison, b = reduced y-axis maximum to facilitate visualisation of the distinct peaks in the distribution.
Assembly validation
Genomic paired-end reads (Table 2.1) were used to generate the assemblies listed in Table 2.2. The long homostyle LIB2558 paired-end read library, as well as the long
57
homostyle long mate-pair (LMP) reads (LIB5215, LIB5216, LIB5217) (Table 2.1) were used in the long homostyle (LH_v2) assembly which, with an N50 of 294.8 kb and NG50 of 229.8 kb, was chosen for subsequent annotation (Table 2.4). The LH_v2 NG50 value was calculated using the higher of the two flow-cytometry estimates for the P. vulgaris genome size (Siljak-Yakovlev et al., 2010, Temsch et al., 2010), that is 489 Mb (Siljak-Yakovlev et al., 2010).
The published Primula veris genome assembly is 309.7 Mb and has a reported N50 of 164 kb; it covers 65% of the estimated genome size (479.22 Mb) (Siljak-Yakovlev et al., 2010). However, the NG50 calculated with GenomeTools (v1.5.8) (Gordon, 2013) using the estimated genome size of 479.22 Mb, is 73.3 Kb; this considerable difference in N50 and NG50 values (55% decrease) is most likely due to the apparent removal of contigs less than 888 bp, which would result in the removal of a significant portion of the true genomic content from the P. veris assembly (Figure 2.5), thereby decreasing the assembly size and increasing the N50 statistic; this is an example of the NG50 metric offering a more appropriate measure of genome assembly completeness, as it is more robust to the removal of true genomic content (Earl et al., 2011, Bradnam et al., 2013). The P. vulgaris assembly covers more of the genome and has a higher NG50.
58
Assembly Assembled Flower Libraries Contig Total N50 NG50 name by form used for count (Mb) (kb) (kb) assembly LH_v2 JW Long LIB2558 67491 411.1 294.8 229.8 (TGAC) homostyle (scaffolded with LIB5215, LIB5216, LIB5217) LH_v1 JW Long LIB2558 30792 401.0 57.6 37.8 (TGAC) homostyle (scaffolded with LIB1474) SH_v2 JMC Short LIB3565 125497 552.8 12.0 14.2 homostyle (scaffolded with LIB1474) TP_v2 JW Thrum LIB1167 124558 625.5 13.7 19.8 (TGAC) (scaffolded with LIB1474) TP_v1 JW Thrum LIB1167 156809 482.2 8.3 8.1 (TGAC) (not scaffolded) TP_v1.1 JW Thrum LIB1167 159254 614.9 12.8 19.3 (TGAC) (scaffolded with LIB1474) PP_v1 JMC Pin LIB1732 136961 499.6 9.1 9.4 PP_v2 JMC Pin LIB1732 105238 581.6 14.9 20.0 (scaffolded with LIB1474) VT_v1 JMC P. veris LIB3564 145617 441. 5 10.8 9.5 Thrum
Table 2.4 Genome assembly statistics for Primula vulgaris (and P. veris, where listed) paired-end read assemblies carried out by JMC or JW (TGAC). NG50 is calculated using a genome size of 489 Mb (Primula vulgaris) or 479.22 (Primula veris) (Siljak- Yakovlev et al., 2010). In all cases contigs < 200 bp were removed prior to these calculations.
59
In summary, the LH_v2 genome assembly of 411 Mb covers between 84 and 90% of the 459-489 Mb genome (Siljak-Yakovlev et al., 2010, Temsch et al., 2010) and offers a considerable improvement over the P. veris assembly in terms of NG50 value, suggesting most of the Primula genome content is in contigs of comparatively larger size in the LH_v2 assembly. Furthermore, the 309.7 Mb P. veris assembly contains 40.7 Mb (13.14%) “N”s (ambiguous bases) as opposed to 29.9 Mb (7.26%) in the 411 Mb P. vulgaris assembly, suggesting LH_v2 comprises a greater proportion of resolved base calls. This is perhaps due to the homozygosity of the long-homostyle allowing highly polymorphic sites to be assembled into contigs (Pryszcz and Gabaldón, 2016); this would result in fewer contigs having to be scaffolded together with LMP libraries, which would result in intervening “N”s between joined sequences.
The final LH_v2 assembly discussed above was filtered to remove contigs less than 200 bp. The comparison of k-mers present in the LH_v2 genome assembly (without 200 bp contigs removed) and the paired-end reads used to generate it suggests that this precursor of the LH_v2 assembly incorporates the majority of true genomic content that lies beyond the first local minima (Figure 2.3); the k-mers in the exponential phase of the plot, which represent unique sequencing errors present in low frequencies (Sato et al., 2012), are for the most part not included in the genome assembly (0x coverage in black), with the homozygous content present once in the assembly (red), suggesting the assembly is highly “collapsed”, with few alternate contigs that would otherwise result from unresolvable polymorphic regions of the genome (Mapleson et al., 2016, Pryszcz and Gabaldón, 2016).
60
Figure 2.3 Frequency distribution (spectra) of k-mers in the LIB2558 read library, and the copy number of these k-mers in the LH_v2 assembly (prior to removal of contigs < 200 bp); black = content absent from the assembly, red = content present once in the assembly, purple = twice, etc.
Based on the validation steps described above, the LH_v2 genome with contigs < 200 bp removed was chosen for annotation and analysis of genes and repetitive sequences; the removal of “ultra-small” contigs < 200 bp is a common step (e.g. Mayer et al., 2014) as small contigs might comprise highly repetitive sequences, or sequencing errors, and are not incorporated into larger contigs, or would otherwise slow down downstream analyses. In this instance the removal of contigs < 200 bp does not remove a significant portion of genomic content in the LH_v2 assembly as there is little change in the k-mer spectra of the assembly with these contigs removed (Figure 2.4) in comparison to Figure 2.3.
61
Figure 2.4 Frequency distribution (spectra) of k-mers in the LIB2558 read library, and the copy number of these k-mers in the LH_v2 assembly (with removal of contigs < 200 bp); black = content absent from the assembly, red = content present once in the assembly, purple = twice, etc.
For completeness, the equivalent analysis for P. veris paired-end reads against the published genome assembly (Figure 2.5) was performed. This indicates that a significant proportion of the true genomic content in the read libraries is missing from the assembly. This is perhaps a result of the difficulty in distinguishing between erroneous k-mers and true genomic content in the highly-polymorphic read library, as well as the apparent use of a ≥ 888 bp cut-off for contig size. The k-mers that are present only once in the read library most likely result from reads containing sequencing errors; if the true genomic content is also represented by k-mers with a low copy number as compared to the main distribution, then a partial assembly will result.
62
Figure 2.5 Frequency distribution (spectra) of k-mers in the P. veris SRR1658103 paired-end read library, and the copy number of these k-mers in the published P. veris assembly (Nowak et al., 2015); black = content absent from the assembly, red = content present once in the assembly, purple = twice, etc.
Evaluation of the genespace captured in the assembly
The LH_v2 assembly was inspected for the inclusion of 248 CEGs; these genes are expected to be present in the majority of eukaryotic genomes (Parra et al., 2007). This is a useful measure of whether the genespace of an organism has been captured by the assembly (Simão et al., 2015). The vast majority (97.18%) of CEGs are at least partially present in the LH_v2 assembly, which suggests a high level of completeness for a non- model species. The published Primula veris genome (Nowak et al., 2015) partially covers 94.18% of CEGs, which by proxy suggests a greater number of genes may be absent from this assembly. The use of partial matches is to avoid biases in relation to method used for alignment (Parra et al., 2007, Parra et al., 2009).
63
No. proteins Completeness (%)
Complete 223 89.92
Group 1 61 92.42
Group 2 46 82.14
Group 3 53 86.89
Group 4 63 96.92
Partial 241 97.18
Group 1 63 95.45
Group 2 53 94.64
Group 3 61 100.00
Group 4 64 98.46
Table 2.5 The number and percentage of 248 ultra-conserved CEGs present (either complete or partial) in the Primula vulgaris LH_v2 genome, as determined by CEGMA (v2.5) (Parra et al., 2007).
RNA-Seq datasets derived from leaves, flowers, roots, fresh seed, and seedlings from both pin and thrum P. vulgaris plants (Table 2.1 and Table 3.1) (see section 2.3.6) were mapped to the LH_v2 genome (by JW) using the DRAGEN co-processor (http://www.edicogenome.com/dragen_bioit_platform/). The mean and mean concordant (paired) mapping rate was 98.5% and 88.1% respectively. The TopHat alignment of RNA-Seq reads (used in our expression analysis; see section 5.3.3) to the published P. veris assembly gives a mean overall mapping rate of 82.5%, and a concordant pair alignment rate of 75.7%. In both cases RNA-Seq reads were filtered for ribosomal RNA with sortmeRNA v1.9 (Kopylova, E. et al., 2012) and quality-trimmed (Q20) using trim galore v0.3.3 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), prior to short-read alignment. These RNA-Seq reads were used in gene prediction for the respective genomes. In conclusion, despite RNA-Seq reads from a much wider range of tissues and
64
a greater number of libraries being mapped to LH_v2, the mapping rates suggest a more complete genespace for this assembly.
It has been suggested that a good gauge of the “usefulness” of an assembly is to determine the number of scaffolds with a length greater than that of an average gene (Halder et al., 2010). Based on assembled and annotated vertebrate genomes, this was calculated as 25,000 bp (Halder et al., 2010). The average gene length in rice is 3,223 bp (RGAP 7; http://rice.plantbiology.msu.edu/), and in Arabidopsis 2,300 bp (TAIR10; https://www.arabidopsis.org/). Here, the average of these two values (2,761.5 bp) was used as the approximate length of an average angiosperm gene. To this end, ~303 Mb of sequence in the P. veris genome assembly comprises contigs > 2761.5 bp, versus ~388 Mb for P. vulgaris LH_v2, suggesting both genomes are adequate for gene prediction, but that the P. vulgaris assembly contains more total sequence of sufficient length for annotation (Figure 2.6). However, the approximate average gene length of 2,761.5 bp for angiosperms is much lower than that calculated for vertebrates (25,000 bp), suggesting the measure of relative “usefulness” used by Bradnam et al. (2013) is not as applicable for plants due to expansion-limited intron sizes in plants as compared to vertebrates (Wu et al., 2013) and the apparent ease of assembling contigs (and thus genes) less than 5 kb in length (Figure 2.6).
65
Figure 2.6 Contribution of genomic contigs of different lengths (x-axis) to total cumulative size of genome assembly (y-axis), for P. veris (red) (Nowak et al., 2015) and P. vulgaris LH_v2 (black). The dashed lines indicate the respective NG50 values for the assemblies (red = veris, black = vulgaris).
The validation methods above suggest that comprehensive annotation and analysis of the assembled P. vulgaris LH_v2 genome is worth pursuing based on a high-quality and relatively complete genome assembly.
66
Repeats in the Primula vulgaris genome
The species-specific de novo repeat library generated for P. vulgaris through curation of the RepeatModeler output allowed 37.03% of the 411 Mb P. vulgaris LH_v2 assembly to be annotated as repetitive (Table 2.6), with up to 36% comprising transposable elements (TEs); this is comparable with the predicted TE content in the similar sized (370 Mb) rice assembly (> 35%), as well as the closely-related and more recently annotated 616.1 Mb Actinidia chinensis (kiwifruit) genome (36%). In P. vulgaris up to 66.14% of annotated TEs were categorized as long terminal repeat (LTR) retrotransposons, in line with previous observations that LTRs are the most widespread TEs in plants (Kubat et al., 2014).
67
Repeat type Length (bp) % in genome % in repeat
Interspersed repeats 153945885 (153945885) 37.44 (37.44) 96.15 (96.15)
Class I: Retroelement 62615656 (95912158) 15.23 (23.33) 39.11 (63.68)
LTR Retrotransposon 54323780 (66959990) 13.21 (16.28) 33.93 (41.82)
Copia 39875915 (39875915) 9.70 (9.70) 24.91 (24.91)
Gypsy 12544567 (12544567) 3.05 (3.05) 7.83 (7.83)
Other LTR 1903298 (14539508) 0.46 (3.54) 1.19 (9.08)
non-LTR Retrotransposon 8291876 (28952168) 2.02 (7.04) 5.18 (18.08)
SINE 1085346 (1363804) 0.26 (0.33) 0.68 (0.85)
LINE 7206311 (20717729) 1.75 (5.04) 4.50 (12.94)
other non-LTR 219 (6870635) < 0.01 (1.67) < 0.01 (4.29)
Other Class I 219 (6045423) < 0.01 (1.47) < 0.01 (3.78)
Class II: DNA transposon 19522480 (52043722) 4.75 (12.66) 12.19 (32.50)
Unclassified interspersed 71807749 (5990005) 17.46 (1.46) 44.85 (3.74)
Simple repeats 6164398 (6164398) 1.50 (1.50) 3.85 (3.85)
Total (uncorrected) 160110283 (160110283) 38.94 100.00 (100.00)
Total (corrected for overlaps) 152243370 37.03
Table 2.6 Repetitive sequences annotated in the Primula vulgaris genome using RepeatMasker with a de novo repeat library for P. vulgaris. Length (base-pairs), and % in genome and repetitive portion of the genome is shown for each repeat class; additional classification with TEclass (Abrusan et al., 2009) is shown in brackets. Total length of all repeat types (bp) is shown either uncorrected or corrected for overlapping annotations based on the RepeatMasker output. The sequence percent for each individual repeat type is not corrected for overlaps.
68
The de novo repeat library generated for P. vulgaris was used to annotate 34.62% of the draft VT_v1 P. veris thrum genome (Table 2.2) as repetitive, which is much greater than the reported 7.7% for the published P. veris genome, suggesting the true proportion of repeats might be higher than 7.7% if a species-specific repeat library is generated for P. veris (Nowak et al., 2015). Indeed, using the P. vulgaris repeat library, ~26% of the published P. veris genome was annotated as repetitive. This perhaps highlights that a significant portion of the repeat content has been removed, or was not assembled due to comparatively high polymorphism between the multiple individuals used in sample preparation, as opposed to the preliminary VT_v1 assembly that adopts a more conservative ≥ 200 bp cut-off and was generated from a read library of one sequenced individual; this is consistent with findings in Figure 2.5 that show a large portion of the read library unrepresented in the published P. veris assembly.
The comprehensive repeat library generated for Primula vulgaris will facilitate comparisons between the S locus and surrounding genomic regions once the genes at this locus have been identified (see Chapter 4).
Gene annotations in the Primula vulgaris genome
In total 24,600 genes (principal isoforms) were predicted in the P. vulgaris LH_v2 genome assembly with an mean coding sequence length of 1,466 bp. This is a similar number of predicted genes to the 27,655 predicted in A. thaliana (27,411; TAIR10, https://www.arabidopsis.org/), but less than that in the Actinidia chinensis (kiwifruit) genome (39,040) (Huang et al., 2013). Kiwifruit is the most closely related sequenced plant species other than P. veris, which has a reported 19,507 (18,301 published) predicted genes. The comparatively large number of genes in the Actinidia chinensis genome is most likely due to recent whole genome duplication events in this species (Huang et al., 2013), as also noted by Nowak et al. (2015).
Of the 24,600 predicted P. vulgaris protein sequences, 84% were functionally annotated with AHRD based on homology to SwissProt, TrEMBL (http://www.uniprot.org/) and TAIR10 (https://www.arabidopsis.org/) protein databases using BLASTP (Camacho et al., 2009), and additional searches with Interproscan and BLAST2GO (Conesa et al., 2005, Jones et al., 2014). Of these, 90% contain at least one domain, and 58% have GO terms attached; 8% of the 24,600 genes have no descriptions based on homology to
69
proteins via the BLASTP alignments, but are nonetheless annotated with GO terms or domains.
Based on the results of additional homology-based searches for potentially degenerate repeat elements using TransposonPSI (http://transposonpsi.sourceforge.net/), as well as an accompanying lack of non TE-related descriptions in the functional annotations, up to 762 of the 24,600 predicted genes in LH_v2 could contain TEs. If this is the case, then the actual number of genes in the P. vulgaris genome is perhaps closer to 23,838. Unless, of course, some of these genes are in fact protein coding, but contain domains that share similarity to TE-related domains, as is the case with the AP2 binding domain that is present in both plant developmental transcription factors (TFs) and integrases such as tn916 (Balaji et al., 2005). There is a suggestion that recruitment of DNA binding domains in TFs from transposases or integrases could be a recurrent theme in evolution (Balaji et al., 2005), with evolutionary mobile protein domains seen in different sequence contexts (Triant and Pearson, 2015).
Comparison of genes in P. vulgaris and P. veris genomes
The above k-mer analyses show that the P. vulgaris LH_v2 genome assembly incorporates most of the genomic content present in the reads (Fig. 2.4). CEGMA (Table 2.5) and RNA-Seq alignment analyses suggest that the genespace has been successfully captured. The P. vulgaris assembly is therefore expected to include most of the genes in the genome. The P. veris assembly has a slightly lower percentage of CEGs partially mapped to it (94.18% vs. 97.18%), and a much lower percentage of CEGs with complete matches (79.84% vs. 89.92%) (Nowak et al., 2015). The RNA-Seq read mapping rate is also lower than that for P. vulgaris, and k-mer analyses show that some of the genomic content may have been lost due to the polymorphic read library (Fig. 2.5).
To evaluate the different number of genes annotated in the P. vulgaris LH_v2 and published P. veris genome assemblies (24,600 vs. 18,301) the predicted coding sequences were compared against each other and the respective genome assemblies, as shown in Figure 2.7. The results show that 6,502 coding sequences out of 24,600 genes predicted in the LH_v2 assembly (26.43%) have no coverage (> 95% identity) in the P. veris coding sequences (Figure 2.7). Furthermore, the alignment of P. veris coding
70
sequences to P. vulgaris coding sequences (red line) and P. vulgaris contigs (blue) results in a shallower distribution compared to reverse alignments of P. vulgaris coding sequences to P. veris (black/orange). Therefore, this analysis also suggests that P. vulgaris coding sequences mapped to the P. veris assembly do so with lower coverage than P. veris coding sequences mapped to P. vulgaris, as supported by a reduced number of CEGs with complete coverage in the P. veris genome (79.84% vs. 89.92%). It should be noted that the P. veris genome manuscript (Nowak et al., 2015) quotes the number of genes in the assembly as 19,507, whilst final available file of coding sequences contains 18,301 sequences. This is presumably due to the apparent use of a ≥ 888 bp cut-off for contig size in the assembly, which would result in some of the annotated genes being removed from the final assembly (Figure 2.5).
Of these 6,502 coding sequences, 1,166 are completely absent from the P. veris genome assembly (blue line). The absence of these genes might be expected based on the failure of the P. veris assembly to incorporate all of the genomic content in the reads (Fig. 2.5). This analysis therefore clarifies that the lower number of genes annotated in the P. veris assembly is not a result of gene duplication events in P. vulgaris or genes that are otherwise species-specific, but that the P. veris annotations are not as comprehensive as those in LH_v2, and do not include a sizeable portion of the genes in the genome. This could be due to the use of single library RNA-Seq datasets from a narrow range of tissues (leaves and flowers) compared to the multiple-replicate datasets used for the LH_v2 annotations (flowers, leaves, roots, fresh seed, and seedlings): unreplicated RNA-Seq data may fail to capture biological variance (see section 3.4.3). The broader range of tissues for P. vulgaris is likely to offer a more comprehensive sampling of the expressed genes. The use of a highly polymorphic genomic read library containing both pin and thrum reads (Figure 2.2) may also have confounded the P. veris assembly itself due the difficulty of assembling heterozygous genomes, resulting in the assembly excluding true genomic content (and thus genes) from the reads (Figure 2.5) (see section 2.4.2).
71
Figure 2.7 Comparison of genes annotated in P. vulgaris LH_v2 and published P. veris genome assemblies: total coverage of HSPs (High Scoring Pairs) with ≥ 95% sequence identity in TBLASTX alignments. LH_v2 coding sequences aligned to P. veris genome assembly (orange), and P. veris coding sequences (red). P. veris coding sequences mapped to LH_v2 genome assembly (black), and LH_v2 coding sequences (blue). Dotted lines indicate total number of genes annotated in each genome (P. veris = 18,301, P. vulgaris = 24,600). Δ1 = number of P. veris coding sequences with no coverage in LH_v2 coding sequences (685; 17,616 of 18,301 present); Δ2 = number of LH_v2 coding sequences with no coverage in P. veris coding sequences (6,502; 18,098 of 24,600 present); Δ3 = number of LH_v2 coding sequences with no coverage in P. veris genome assembly (1,166; 23,434 of 24,600 present). The small number of P. veris genes with no coverage in the P. vulgaris genome (130; 18,171 of 18,301 present) is not indicated. In contrast, there are only 685 (as opposed to 6,502) P. veris coding sequences with no coverage (> 95% identity) in the P. vulgaris geneset; with a minimal number (130) absent from the genome assembly as a whole. This suggests that the majority of P. veris coding sequences are covered by the P. vulgaris genome and geneset; conceivably those that are missing could be species-specific.
72
In addition to the above, TransposonPSI (http://transposonpsi.sourceforge.net/) annotations of the P. veris geneset suggest that 226 of the genes could be TE-related; as noted above, up to 762 genes in the Primula vulgaris LH_v2 annotations could be TE- related. This could explain at least some of the 6,502 coding sequences that are absent from the P. veris annotations.
Nowak et al. (2015) note a discrepancy between the number of assembly-based gene predictions (19,507), and the number of transcripts obtained in the de novo transcript assembly for P. veris (25,409); this was produced using Trinity (Grabherr et al., 2011). These de novo assembled transcripts without use of the assembly as a reference may represent the full complement of genes in the genome without limitation from missing genomic content, but it seems more likely that genes are still missing due to use of fewer tissue samples, and that a good proportion are instead alternative spliceoforms or fragmented transcripts, as is often the case with de novo transcriptome assemblers that can overestimate the number of transcripts compared to that expected for a given organism (Zhao et al., 2011, Bankar et al., 2015, Sayadi et al., 2016). Our final annotations comprise 29,088 coding sequences, with 4,488 recognised as alternative spliceoforms.
OrthoMCL analysis of orthologous genes
Further analysis of the full complement of genes in P. vulgaris to determine orthologues in well-annotated and closely related angiosperm species will provide a platform for future understanding of the gene families in which the S locus genes lie. This was undertaken through alignment of proteins from each of these species and analysis with OrthoMCL to produce clusters of related proteins (Li et al., 2003) (Figure 2.8). To provide a comparison, P. veris proteins were also aligned to these species, and vice- versa; due to differences in the number of predicted genes that may affect the clustering stage of the analysis, the two Primula species were investigated in two separate sets of alignments.
These analyses resulted in a sizeable number of orthologues being identified in both Primula species. In the P. vulgaris analysis, a total of 19,861 groups were identified, as compared to 19,448 in P. veris. P. veris performs relatively well in this analysis, presumably because the genes in question are those that are well-conserved across
73
angiosperm species; such genes are used as evidence in annotation pipelines (Holt and Yandell, 2011).
There are slight differences in the number of elements within each grouping. It is difficult to speculate on the exact cause of such differences, but the most noticeable change lies in the number of P. vulgaris- and P. veris-specific groups (853 vs. 224). It appears that the number of P. vulgaris-specific gene families (853) is more in line with that seen for the other species, which suggests that the missing genes in P. veris may have affected the clustering algorithm in such a way that some missing paralogous genes are no longer identified in this group, leading to their partners being recognised as more closely related to the groups of genes in other species.
This resource will facilitate downstream pairwise comparison of paralogous genes in the P. vulgaris genome, as well as phylogenetic analyses of the S locus genes once they have been identified.
a b
Figure 2.8 OrthoMCL analysis showing orthologous genes between P. vulgaris and four other angiosperm species (a), with alignments based on predicted protein sequences in the LH_v2 genome. For comparison, the same analysis is shown (b) using published P. veris protein sequences (Nowak et al., 2015).
74
Differential expression
Differential expression analysis for all genes predicted in the P. vulgaris genome was carried out between pin and thrum flowers using RNA-Seq reads in 4x biological replicate. This study identified 401 genes differentially expressed between pin and thrum flowers, and 994 with morph-specific expression (Figure 2.9).
The most distinct feature of this analysis is that there are many more genes upregulated in thrum (383), as compared to the number upregulated in pin (118) (Figure 2.9). This includes eight genes expressed in a morph-specific manner, with no expression in one of the floral forms. There are an approximately equal number of genes expressed only in thrum flowers (525), or only in pin flowers (468).
75
a
b
Figure 2.9 (a) genes upregulated in pin (black) and thrum (red) flowers, with log2 fold- change in expression (FPKM) shown relative to thrum; genes with morph-specific expression are not plotted; (b) genes specifically expressed in either pin (black) or thrum (red) flowers, log10 expression (FPKM+1) relative to thrum.
76
GO terms associated with genes in the sets of differentially expressed and morph- specific genes (Figure 2.9) were extracted from the LH_v2 functional annotations. GO- term enrichment analysis was carried out to identify over-represented GO terms as compared to the frequency of occurrence for GO terms attached to the full complement of genes in the LH_v2 assembly.
In the functional annotations for the set of 401 differentially expressed genes, GO terms involved in cell wall modification and potentially SI-related pathways are overrepresented (Figure 2.10). This highlights the genes as putative targets in the downstream regulatory pathway of the S locus genes that control differential cell division and elongation between the pin and thrum floral forms (Webster and Gilmartin, 2006).
Figure 2.10 Gene Ontology (GO) terms in the set of 401 genes upregulated in thrum flowers (Figure 2.9), and their associated GO-term enrichment scores (top 30 over- represented GO terms (False Discovery Rate (FDR) < 0.1) are shown). Enrichment score = –log10(p-value), p-value as calculated in goatools GO-term enrichment analysis versus GO term occurrence in the P. vulgaris LH_v2 gene annotations (https://github.com/tanghaibao/goatools).
77
In contrast, overrepresentation of such GO terms was not found for the genes with morph-specific expression, casting doubt on the importance of such genes in the downstream regulatory pathway. However, this set of genes includes all genes expressed in only one of the floral forms, including those with extremely low (in many cases < 0.1 FPKM) expression. If subsets of these genes with > 0.1 FPKM or > 1 FPKM expression are taken, then it remains that there is no enrichment of GO terms of apparent interest in relation to heterostyly. In scrutinising the functional annotations directly, it appears that some of the genes identified, for example those with similarity to F-box transcription factors, may be of potential interest with regards to a role in SI; nonetheless, genes expressed in a morph-specific manner by and large seem to be of limited importance in terms of identifying the broad downstream targets of the S locus.
2.5 Discussion
This chapter describes a broad range of assemblies and associated annotations that are poised to serve as a robust platform for future genomic analyses within the Primulaceae. Primula vulgaris is an outbreeding angiosperm with a putatively sporophytic self- incompatibility (SI) system that acts as a safeguard against self-fertilization (McCubbin, 2008, Li et al., 2011b), as such it might reasonably be expected that a wild-type primrose would have a highly heterozygous genome composition. The level of polymorphism or heterozygosity within a genome further confounds problems associated with the assembly of short sequencing reads, as the assembly algorithm is charged with resolving multiple polymorphic versions of a genome within one sequencing library (Pryszcz and Gabaldón, 2016).
In some cases it is inevitable that a heterozygous genome must be assembled, as with the heterozygous diploid Trifolium pratense (red clover) that is difficult to inbreed without severe loss of viability and vigour due to a gametophytic SI system (De Vega et al., 2015). In contrast, homostyle primroses are associated with a breakdown in SI; they are self-fertile, so given a sufficient level of inbreeding, perhaps could provide a highly- homozygous source for genome assembly. In choosing a P. vulgaris genome based on an inbred long-homostyle for annotation, the results presented here show that the homozygosity of this individual has been exploited, thus producing a reference genome sequence that is more contiguous, complete and compact than the published P. veris
78
assembly, covering 411 Mb (84-90%) of the estimated 459-489 Mb genome (Siljak- Yakovlev et al., 2010, Temsch et al., 2010) as compared to 65% for P. veris. In addition, draft assemblies and short sequencing read libraries for separate pin, thrum and short-homostyle individuals will offer invaluable insight into the specific differences between pin and thrum floral-morphs in trying to formulate a complete picture of the S locus and the surrounding regions.
Here, it is shown through an integrated approach combining RNA-Seq, repeat annotations and evidence from closely related angiosperm species, that the diploid Primula vulgaris long-homostyle (LH_v2) genome has 24,600 predicted genes. This is perhaps a reasonable estimate for the total number of genes to be expected in the closely related Primula veris genome (Nowak et al., 2015), as well as other diploid species in the Primulaceae that have not undergone whole-genome duplications or otherwise been subject to expansion of discrete gene families. The set of comprehensive P. vulgaris genes and associated functional annotations will serve as a blueprint for defining the genes in other Primulaceae species, as well as for the rapid functional classification of genes resulting from various expression and exploratory genomic analyses in this family.
The P. veris genome assembly has 18,301 annotated genes (Nowak et al., 2015), which contrasts with 24,600 genes in the LH_v2 assembly. From the results presented in this chapter, it appears that the majority of these genes are unannotated, perhaps resulting from the minimal number of tissues represented in the P. veris RNA-Seq dataset. Furthermore, a subset of those genes are completely absent from the P. veris genome. Due to the close relationship of P. veris and P. vulgaris, which can interbreed to produce hybrids known as false oxlip (Gurney et al., 2007) one might expect their similarly-sized diploid genomes (Siljak-Yakovlev et al., 2010, Temsch et al., 2010) to contain a similar number of genes. This suggests that some of the genomic DNA content is not incorporated into the P. veris assembly, as supported by the apparent absence of many seemingly error-free k-mers that are present in the associated genomic read library. The Primula vulgaris LH_v2 genome presented here represents a significant leap forward in terms of the accessible functionally-annotated genespace for the Primulaceae. In addition, a more comprehensive repeat analysis for P. vulgaris with 37.03% of the genome annotated as repetitive in comparison to 7% in P. veris,
79
highlights the potential usefulness of the repeat library generated in this study for the annotation of repeats in other Primula species, as well as for the analysis of the genomic architecture of the S locus once it is revealed.
The data presented in this chapter provide the first broad, robust set of differentially expressed genes between pin and thrum flowers for the Primulaceae, using RNA-Seq data based on four biological replicates from Primula vulgaris pin and thrum flowers, as well as a comprehensive gene prediction strategy encompassing multiple sources of evidence, including RNA-Seq libraries from a wide range of tissues. The differentially expressed geneset contains a large proportion of genes of potential interest as downstream targets, with a number of overrepresented GO terms that might relate to the cell division and elongation processes affecting the corolla tube and style of P. vulgaris plants in order to bring about the respective morphologies of the distinct heterostylous floral forms (Webster and Gilmartin, 2006). In addition, a large array of genes distributed throughout the genome is apparently regulated in response to the S locus, which raises the possibility that the SI determining genes might not reside at the S locus itself (Barrett, 2008).
Intriguingly, the differential expression analysis reveals more than double the number of upregulated genes in thrum as compared to pin. The thrum is characterised by large pollen, and increased cell division below the point of anther attachment. In addition, cells are widened above the point of anther attachment, which presumably acts to maintain the precision of pollinator interactions by counteracting the increased number of cells below the point of anther attachment; in pin flowers there is no such contrast in cell size (Webster and Gilmartin, 2006). It could be that the specific morphological features of the thrum are more intricate and complex to attain. This may involve the downregulation of some genes, but perhaps high levels of expression for the downstream genes is ultimately required to drive the increased size of the pollen and corolla tube cells, with a greater number of genes being required to manipulate the differential cell sizes above and below the point of anther attachment; additional developmental pathways may be upregulated by the S locus to generate the specific morphology of the thrum flowers.
Genes with morph-specific expression seem to be of largely limited importance in the regulatory pathway downstream of the S locus; there is no associated GO term
80
overrepresentation, and the majority of associated functional annotations reveal no immediately apparent role in cell division and elongation processes that act to bring about the distinct heterostylous floral morphologies (Webster and Gilmartin, 2006). Though it may prove that a small proportion of these morph-specific genes are important in this pathway, perhaps it is not surprising that the majority of the determining genes appear to be differentially expressed rather than subject to morph- specific silencing; precise adjustments in expression rather than complete downregulation of genes in the specific pathways leading to the fine-scale tuning of reciprocal organ height in the two forms of Primula flower could be seen as in keeping with the intricate developmental patterns that might be associated with such a system, as well as the precise nature of insect-mediated pollen transfer (Cohen, 2010, de Vos et al., 2014).
In summary, the P. vulgaris long-homostyle assembly and associated genomic analyses represent a significant improvement over existing resources for the Primulaceae, as well as highlighting this genome as a potential platform for the identification and assembly of the Primula S locus. These results exemplify the necessity for good starting material in the assembly, annotation and analysis of a non-model plant species; without such material it would not have been possible to assemble a good quality genome sequence without significant further funding. The current study describes a comprehensive set of genomic data that will facilitate the drive towards identifying the regions underpinning heterostyly in diverse families across the angiosperms.
81
3 Annotation and characterisation of S-linked genes
3.1 Relevant publications
Cocker, J. M.*, Webster, M. A.*, Li, J., Wright, J., Kaithakottil, G., Swarbreck, D., Gilmartin, P. M. (2015) Oakleaf: an S locus-linked mutation of Primula vulgaris that affects leaf and flower development. New Phytologist, 208: 149–161.
Li, J., Webster, M. A., Wright, J., Cocker, J. M., Smith, M. C., Badakshi, F., Heslop- Harrison, P., Gilmartin, P. M. (2015) Integration of genetic and physical maps of the Primula vulgaris S locus and localization by chromosome in situ hybridization. New Phytologist, 208: 137–148.
* These authors contributed equally
3.2 Introduction
Primula vulgaris plants have one of two forms of flower, pin or thrum. The S locus, which controls the development of these floral morphs, comprises a tightly linked cluster of at least three genes (Lewis and Jones, 1992); the dominant alleles of these genes co-segregate with the thrum phenotype, which presents anthers with large pollen at the mouth of the corolla tube and the stigma midway down. The pin has a reciprocal floral morphology, with the stigma at the mouth of the flower and the anthers halfway down the floral tube. This arrangement of reproductive structures serves to physically promote insect-mediated outcrossing between plants with the reciprocal forms of flower. The architectural constraints imposed by the complementary positioning of the
82
male and female reproductive structures is reinforced by a putatively sporophytic self- incompatibility (SI) system, for which the determining genes are thought to be linked to or under the control of the S locus and its constituent genes (Lewis, 1949, Dowrick, 1956de Nettancourt, 1997, Lewis and Jones, 1992). There is typically a 1:1 ratio of pin to thrum plants in field populations based on equally frequent pin and thrum genotypes (Ornduff, 1979).
It can be easily observed whether a plant has flowers of one form or the other, as such it is straightforward to determine whether a phenotypic character is linked to the S locus based on co-segregation of the character with the pin or thrum phenotype. As a result, over the many years of study since Darwin first explained the importance of the system in promoting outcrossing (Darwin, 1877), several genes have been identified as linked to the S locus in Primula sinensis and P. vulgaris, including flower pigment genes (De Winton and Haldane, 1933, Kurian, 1996), Hose in Hose (Fig. 3.1) (Ernst, 1942, Webster and Grant, 1990, Webster and Gilmartin, 2003, Webster, 2005, Li et al., 2010) and sepaloid (Webster and Gilmartin, 2003, Webster, 2005, Li et al., 2008).
Figure 3.1 Hose in hose wood block print (van de Passe, 1614) showing characteristic conversion of sepals to petals; arrow indicates flower with normal sepals that has reverted to wildtype. Figure reproduced from Li et al. (2010).
83
The above genes, as well as an array of S locus-linked genes and markers identified by random amplified polymorphic DNA (RAPD) and fluorescent differential display analyses (Manfield et al., 2005, Li et al., 2008) facilitated the formulation of a genetic map flanking the Primula vulgaris S locus (Figure 3.2) (Li et al., 2011, Li et al., 2015). In a drive towards identifying the key genes responsible for the development of the reciprocal floral architectures, the linked genes and markers were used as probes in the identification of founding BACs (Bacterial Artificial Chromosomes) in a BAC-walking strategy that resulted in BACs spanning the S locus (Li et al., 2011, Li et al., 2015). In one of the crosses used to generate the genetic map, a self-fertile short homostyle was produced, the sequencing and assembly of which was described in Chapter 2 of this thesis. This was surprising as homostyles are notably rare, with Richards (1997) estimating that they might occur at frequencies of less than 1% in wild P. vulgaris populations. Darwin (1877) noted them, as did Bodmer (1960) in Somerset, and Crosby (1940) in the Chilterns, but these are long homostyle populations.
Figure 3.2 Genetic map of the Primula S locus; mapping distances (cM) for S locus- linked genes and markers are shown using (i) data from Lewis and Jones (1992) for the predicted distance between S locus constituent genes (GPA), (ii-v) data from analyses in Li et al. (2015a). Figure reproduced from Li et al. (2015a).
Following the rediscovery of Mendel’s work on pea plants (e.g. Bateson, 1902), Bateson and Gregory’s studies in defining the dominance relationship of pin and thrum
84
flowers (Bateson and Gregory, 1905) led to a series of studies which mark heterostyly in Primula as one of first genetic systems for which genetically-linked phenotypes were identified (Gregory, 1911, Bridges, 1914, Altenburg, 1916); these phenotypes include Hose in Hose (Ernst, 1942) and four loci in P. sinensis: magenta (b), red stigma (g), red leaf back (l) and double (x) (Gregory, 1911, De Winton, 1928); genes responsible for sporophytic SI are also linked (Lewis, 1949, de Nettancourt, 1997, Lewis and Jones, 1992). De Winton and Haldane (1935) constructed the first genetic map for a distylous species. However, the identification of floral variants in Primula dates back over 400 years (van de Passe, 1614). This is perhaps owing to the primrose’s status as one of the most popular garden plants in the world, known in European gardens since the time of medieval herbalists (Mast et al., 2001, Richards and Edwards, 2003). It seems quite remarkable that Hose in Hose, shown illustrated by van de Passe in 1614 (van de Passe, 1614) (Figure 3.1) was subsequently identified as S locus-linked (Ernst, 1942, Webster and Grant, 1990), although perhaps not so surprising in the eyes of Darlington (1931), who noted that there were a great number of S-linked loci and used this as a basis to suggest that the S locus might lie on the largest chromosome, as confirmed in analyses associated with the current study (Li et al., 2015); in situ hybridisation analyses reveal that the P. vulgaris S locus is located close to the centromere of the largest metacentric chromosome, with PvGLO orientated proximal to the centromere. This confirms previous evidence of linkage to the centromere based on “double reduction” in autotetraploid plants, and is in line with suggestions of recombination suppression at the S locus to maintain the distinct pin and thrum genotypes (Darlington, 1929, Ornduff, 1992).
In addition to the Hose in Hose and sepaloid mutants that were established as S locus- linked in more contemporary studies (Ernst, 1942, Li et al., 2008), a third S locus-linked developmental phenotype exists in Primula vulgaris that produces lobed leaves and distinctive attenuated petals with variable shape and character, possibly due to differential expression of the locus in different genetic backgrounds (Figure 3.3). This mutant phenotype was named Oakleaf and is revealed as a single dominant locus through the observation of the expected ratio of progeny in crosses between Oakleaf and wildtype plants, supported by chi-squared analysis (Cocker et al., 2015). The phenotype is sometimes visible in seedlings as lobed cotyledons, with the first true leaves consistently showing a lobed phenotype that is similar in appearance to that of Quercus
85
species, such as the oak tree. Ectopic meristems, typically vegetative, can emerge from the leaves of Oakleaf plants. In addition, an increase in the separation and size of sepals can sometimes be observed, but the sepals are not lobed. The floral phenotype ranges from petals similar to wild type with splits in the corolla, to attenuated and separated petals; the most severe of which presents straight and narrow petals that look like the spokes of a wheel (Cocker et al., 2015). In combination with mutants that result in a disruption of floral-organ identity, the attenuation of petals or lobed appearance of Oakleaf leaves are observable in whorls that are converted to petals or leaves (Cocker et al., 2015). For example, in combination with Hose in Hose, where the first whorl is converted from sepals to petals, the first whorl petals show attenuation; thus, the effect of Oakleaf is organ-specific rather than whorl-specific.
86
Figure 3.3 Oakleaf developmental phenotypes (bars = 1 cm), (a) seedlings with wild‐ type and Oakleaf phenotypes (arrows), (b) leaf from Oakleaf plant, (c) Oakleaf mutant with distinctive lobed leaves and attenuated petals, (d) extreme attenuated petals phenotype, (e) partially attenuated petals, (f) Oakleaf plant with near normal petals, (g) Oakleaf leaf with leaves developing from ectopic meristem (arrow), (h) ectopic flower (arrow), (i) ectopic flower in (h) with developing seed capsule following pollination (arrow). Figure reproduced from Cocker et al. (2015).
The above features (lobed leaves with occasional ectopic meristems) are highly similar to the phenotype of Arabidopsis thaliana with ectopically expressed Class I KNOX genes (Lincoln et al., 1994, Chuck et al., 1996, Hay and Tsiantis, 2010), and other species including tobacco (Hareven et al., 1996), which like P. vulgaris resides in the asterids. This suggests KNOX-like genes in P. vulgaris are a good starting place for identifying the cause of Oakleaf.
Gregory (1911) described a P. sinensis mutation named “o” that also affects the flowers and causes oak-shaped leaves, but it is distinct from Oakleaf in that it was found to be recessive and is not linked to the S locus. The Oakleaf phenotype was identified in 1999
87
amongst commercial ornamental Primula plants and was developed in polyanthus form as a commercial variety by Margaret Webster, with a division of the original mutant plant being used to establish the Oakleaf population used in the current study (Cocker et al., 2015).
This chapter describes transcriptomic and genomic analyses towards the characterization of the new S locus-linked mutant named Oakleaf, as well as the ab initio annotation of genes in the BAC region spanning the S locus, in an effort to orientate and validate the region as a prelude to the identification of the genes controlling the development of heterostyly.
3.3 Methods
Plant material
Plants used in this study are wild-type Primula vulgaris Huds. and derived commercial cultivars. Primula vulgaris Oakleaf plants were originally obtained from Richards Brumpton (Woodborough Nurseries, Nottingham, UK) in 1999 and maintained by Margaret Webster as part of the National Collection of Primula, British Floral Variants. Plants were grown as described previously by Margaret Webster and JL (Webster & Gilmartin, 2006).
Differential gene expression between Oakleaf and wild-type
RNA was isolated from leaves and open flowers of Oakleaf and wild-type pin plants, as well as mixed stage pin and thrum flowers for RNA-Seq using Illumina HiSeq2000 (Table 3.1). RNA-Seq reads were aligned to LH_v1 contigs using TopHat (v2.0.8) (http://ccb.jhu.edu/software/tophat/index.shtml), followed by construction and merging of the transcriptome using Cufflinks (v2.1.1) (Trapnell et al., 2013) (http://cole-trapnell- lab.github.io/cufflinks/). RNA-Seq reads from mixed stage pin and thrum flowers were used for transcriptome assembly but not subsequent expression analysis. HTSeq (Anders et al., 2014) was used to count raw read numbers per gene with RNA-Seq data from Oakleaf and wild-type leaf and flower samples. DESeq (v1.16.0) was used to
88
normalise these read counts by estimating the effective library size (Anders & Huber, 2010) and to carry out differential expression analysis. Genes upregulated by a ×2 log2 fold-change in both Oakleaf leaves and Oakleaf flowers were characterised by BLASTX analysis (e-value 1 × 10−4) (Camacho et al., 2009) to identify related sequences in the TAIR10 (https://arabidopsis.org/) and NCBI “nr” protein databases, the latter alignments being used as an input for Blast2GO (Conesa et al., 2005). Sequences were submitted to NCBI under Bioproject number PRJNA260472.
Library Description Type Insert size Read count
LIB668 Pin mixed flower buds RNA 223 100768798
LIB669 Thrum mixed flower buds RNA 180 81045610
LIB976 Oakleaf flower RNA 220 24995179
LIB977 Pin mature open flower RNA 209 33400153
LIB978 Oakleaf leaf RNA 219 14310589
LIB979 Pin leaf RNA 215 45723021
Table 3.1 RNA-Seq libraries used for transcriptome assembly and differential expression analyses
Gene model predictions for P. vulgaris KNOX (PvKNOX) genes
Arabidopsis thaliana KNOX proteins, KNAT1, KNAT2 (Lincoln et al., 1994), KNAT3, KNAT4, KNAT5 (Serikawa et al., 1996), KNAT6 (Belles-Boix et al., 2006), KNAT7 (Li et al., 2011), and STM1 (Long et al., 1996) were aligned to the LH_v1 Primula vulgaris genome assembly (see Chapter 2 for assembly details) with Exonerate (v2.2.0) (Slater & Birney, 2005). Primula vulgaris KNOX loci were identified and the gene models confirmed by transcript evidence from TopHat (v2.0.8) and Cufflinks (v2.1.1) (Trapnell et al., 2013) and by homology of the predicted proteins to KNOX proteins from the TAIR10 protein database (https://www.arabidopsis.org/). Parameters for protein sequence comparisons were ≥ 50% identity with ≥ 30% coverage of the KNOX query sequence. Gene models were curated manually where necessary with
89
GenomeView to resolve the intron and exons sizes of all eight predicted genes (http://genomeview.org/). PvKNL1 was initially identified as two sequences on separate genomic contigs, but was resolved as one locus by alignment to a single transcript sequence spanning the boundary between the two sequences; this transcript was present in a de novo transcriptome assembly generated with Trinity (Grabherr et al., 2011) by TGAC using RNA-Seq data from P. vulgaris pin and thrum mixed stage flower buds, as used in the Cufflinks assembly.
Generation of the PvKNOX phylogenetic tree
Multiple sequence alignment of Zea mays KNOTTED1, A. thaliana KNOX proteins, and predicted PvKNOX protein sequences was carried out in MEGA6 using MUSCLE (Edgar, 2004, Tamura et al., 2013). To obtain phylogeny support, Bayesian analyses were performed using MrBayes (v3.2.2) (Ronquist et al., 2012) and output files visualised in FigTree (v1.4.0) (http://tree.bio.ed.ac.uk/software/figtree/). The mixed amino acid substitution model was used, and the first 25% of samples discarded as burn-in. The consensus tree was obtained after 1,000,000 generations, with the average standard deviation of split frequencies below 0.01 to ensure convergence.
Identification of variant sites between Oakleaf and wild-type
BAM files produced with TopHat (in the above transcriptome assembly) were used as an input for the SAMtools (v0.1.18) (Li et al., 2009a) “mpileup” tool for the identification of variant sites between Oakleaf and LH_v1, and the wild-type pin plant and LH_v1, for both leaves and open flowers. The positions of the KNOX-loci based on the curated PvKNOX gene models predicted above were used to determine the resulting amino acid substitutions that would result from the identified SNPs. The predicted impact on protein function of each amino acid substitution identified above was determined by PMG using the SIFT prediction tool (Ng and Henikoff, 2003b).
Linkage analysis of PvKNOX candidate genes
Four libraries of genomic paired-end sequencing reads for individual pin and thrum plants (LIB1732 and LIB1167; Table 2.1), and separate pools of 24 pin and 28 thrum progeny resulting from a cross between the individual pin and thrum plants (LIB1730
90
and LIB1731; Table 2.1), were mapped to the LH_v2 genome assembly using BWA v0.6.2 (Li and Durbin, 2009). This analysis was carried out at a later stage to the analyses described above, hence the use of the LH_v2 assembly. SAMtools “rmdup” was used to reduce amplification biases (PCR duplicates) for the mapped read libraries (Li and Durbin, 2009).
SAMtools (v0.1.18) (Li et al., 2009a) was used to call SNPs between the four read libraries and the LH_v2 contigs; sites were excluded with mapping quality (MQ) < 20 and depth (DP) < 10 in the Variant Call Format (VCF) files for each of the four read libraries; sites with genotype quality (GQ) < 30 in the thrum or pin parent were also omitted. PvKNOX genes were mapped to the P. vulgaris long-homostyle LH_v2 genome assembly using Exonerate (v2.2.0) (Slater and Birney, 2005), and SNPs extracted for the contigs to which they aligned. Sites that were heterozygous in the thrum parent and homozygous for an alternate (non-reference) allele in the pin parent were analysed to ascertain the ratio of reference to alternate alleles in the progeny, and thus calculate the mean genetic distance from the S locus in centimorgans (cM) for each contig based on the number of recombinants represented by the minor (alternate) allele frequency at each site. The distance used for each site was the lower of the two putative distances calculated for either the pin or thrum progeny, assuming one must be homozygous and one heterozygous if the contig is linked. Sites were only included where the reference allele frequency was lower than the alternate ( ) allele frequency in either pin or thrum progeny, whichever was selected based on a smaller associated genetic distance. Exonerate (v2.2.0) (Slater and Birney, 2005) was used to map PvKNOX genes to the LH_v2 contigs, and genetic distances for the associated contigs extracted.
BAC contig assembly
BAC library construction, screening and sequencing was carried out by JL, and is documented in Li et al. (2011 and 2015). The contig assemblies of BAC sequences from the resulting 454 and Illumina reads were generated by JW using gsAssembler (v2.6) and ABySS (v1.3.6) respectively (Simpson, 2009). BAC contig assembly to combine the small NGS-derived contigs above was carried out by JW using minimus2 (Sommer et al., 2007) to merge overlapping BAC sequences and BAC-end sequences (with
91
sequence identity > 98%); this was facilitated by the inclusion of contigs from the draft thrum genome assembly TP_v2 (see Chapter 2), and further merging of contigs based on regions overlapping > 500 bp using BLAT v3.5 (Kent, 2002).
Repeat masking of the BAC contig assembly
RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) was used to identify de novo repetitive sequences in the draft thrum genome assembly TP_v2; the resulting sequences were hard-marked for inclusions from protein coding genes using BLASTX (v2.2.28) (Camacho et al., 2009) alignments (e-value 1x10-4) to protein databases from Actinidia chinensis (http://bioinfo.bti.cornell.edu/cgi- bin/kiwi/home.cgi), Mimulus guttatus (v2.0), Solanum tuberosum (v3.4) and Solanum lycopersicum (v2.4) (http://phytozome.jgi.doe.gov/).
The repeat library of de novo repetitive sequences comprised all hard-masked sequences with at least one alignment to a transposition-associated domain from Pfam-A (curated thresholds) or Pfam-B (evalue 1x10-4); alignments were carried out using HMMer hmmscan (v3.1b1) (http://pfam.sanger.ac.uk/; http://hmmer.janelia.org/search/hmmscan). Pfam domains were considered transposition-associated if they aligned with any of the sequences contained in the database of transposable elements included in the RepeatRunner package (http://www.yandelllab.org/software/repeatrunner.html).
Repeats and low complexity regions were identified in the BAC contig assembly using the repeat library with a local installation of RepeatMasker based on the RMBlast algorithm (version open-4.0.1: http://www.repeatmasker.org/).
Prediction of genes in the BAC contig assembly
The self-training gene annotation program GeneMark-ES (http://opal.biology.gatech.edu/) was used with the draft thrum genome assembly to produce a training file which served as an input for the de novo gene finder GeneMark- E in the annotation of the repeat-masked BAC assembly. The gene models obtained for each contig were scanned with the software package Full-LengtherNEXT (e-value 1x10-
92
4) (http://www.scbi.uma.es/site/scbi/downloads/313-full-lengthernext) in order to classify them as full- length, 5’-end, 3’-end or internal.
Functional annotation of the BAC contig
BLASTX (e-value 1x10-4) (Camacho et al., 2009) was used to query the putative genes against the TAIR10 Arabidopsis thaliana protein database (http://www.arabidopsis.org/) and the NCBI non-redundant protein database, the result of the latter being an input for further annotation with Blast2GO (https://www.blast2go.com/) (Conesa et al., 2005) (see Table S2 (Li et al., 2015)).
3.4 Results
Prediction of Primula vulgaris KNOX-like (PvKNL) gene models
Oakleaf plants present lobed leaves and sometimes produce ectopic meristems on the veins of the leaves that can be floral or vegetative. Oakleaf is dominant to wildtype. These features have similarities to the role of Class I KNOX homeodomain genes in tomato and Cardamine hirsuta (Hareven et al., 1996, Bharathan et al., 2002, Hay and Tsiantis, 2006, Shani et al., 2009) and are particularly reminiscent of ectopically expressed Class I KNOX genes in Arabidopsis thaliana (Lincoln et al., 1994, Chuck et al., 1996, Hay and Tsiantis, 2010). We therefore hypothesized that Oakleaf might result from mutation of a KNOX-like gene in P. vulgaris.
On this basis, the current study proceeded with an exploration of KNOX-like genes in the Primula vulgaris genome to consider the following possibilities: (i) that the Oakleaf phenotype is the result of constitutive overexpression (as in Arabidopsis) of a KNOX- like homeodomain gene in mature leaves and flowers, (ii) that expression is unchanged but a mutation results in a dominant gain of function in protein activity, (iii) that the phenotype is the result of upregulation of a gene with no similarity to A. thaliana KNOX family genes.
The first reference-based transcriptome for Primula vulgaris was assembled prior to LH_v2 gene prediction (Chapter 2). This assembly comprised RNA-Seq datasets from
93
leaves and flowers of Primula vulgaris wild-type and Oakleaf plants that were used for subsequent expression analyses, as well as additional RNA-Seq reads from pin and thrum mixed stage flower samples, to maximise coverage of the transcribed regions of the genome (Table 3.1). The draft Primula vulgaris LH_v1 genome assembly available at the start of the current study was assembled prior to the final LH_v2 assembly (Chapter 2) and was used as a reference assembly for the above transcriptome (Table 2.2).
The seven Arabidopsis thaliana KNAT and STM coding sequences were aligned to the P. vulgaris genome with Exonerate (v2.2.0) (Slater and Birney, 2005) using evidence from transcripts in the Cufflinks-generated (Trapnell et al., 2013) transcriptome (above) to define the Primula vulgaris KNOX-like gene models. This analysis revealed five Class I and three Class II KNOX genes in the Primula vulgaris (LH_v1) genome (Figure 3.4). Class I KNOX family genes have roles in shoot apical meristem identity (Hay and Tsiantis, 2010); this is consistent with the ectopic meristems that are sometimes produced on the veins of Oakleaf leaves (Cocker et al., 2015). Class II KNOX family genes KNAT3, KNAT4 and KNAT5 are involved in root development (Truernit et al., 2006) and KNAT7 in the formation of secondary cell walls (Li et al., 2011a, Li et al., 2012). If these classifications hold true in P. vulgaris, then this suggests genes showing homology to Class I KNOX genes might be the best candidates for the Oakleaf phenotype.
PvKNL1 was originally identified on two separate contigs, supported by two transcript sequences, one comprising three exons from the 5’ end of the KNOX-like gene and the other two exons from the 3’ homeodomain region, potentially forming part of the same KNOX-like gene; this was confirmed using a de novo Trinity (Grabherr et al., 2011) transcriptome assembly containing a single transcript spanning the boundary between the two exon sequences.
RNA-Seq reads or transcriptome data can be used to resolve gaps between contigs as it captures information about the connections between exons in a single gene (Zhang et al., 2016). Introns that are absent from transcribed sequences represented by RNA-Seq reads may be sufficiently repetitive to complicate the process of assembling genomic sequencing data (Wang et al., 2008); this can lead to fragmented genes comprising predictions in multiple contigs that can be connected by scaffolding approaches that
94
leverage the information provided by exons anchored to distinct genomic fragments; as a result, evidence such as that present in the above de novo transcriptome data is sufficient to join the two contigs in the Primula vulgaris genome assembly, with a gap of unknown size between contigs and the fragmented intron sequence therein. The P. vulgaris genome thus contains eight expressed PvKNOX genes, with the predicted gene structures shown in Figure 3.4.
Figure 3.4 Primula vulgaris Knotted-like genes grouped based on similarity to Class I and Class II A. thaliana KNOX genes: predicted gene structures of the eight Primula vulgaris Knotted-like (PvKNL) and Shoot meristemless-like (PvSTL) gene models that show amino acid similarity to Arabidopsis thaliana KNAT and STM KNOTTED-like homeobox (KNOX) gene family members (thick lines = exons, thin lines = introns. Figure reproduced from Cocker et al. (2015).
Characterisation of the PvKNOX gene family
The Oakleaf phenotype is similar to that displayed by Arabidopsis thaliana plants where overexpressed KNAT1 induces lobed leaves with ectopic meristems (Lincoln, 1994). The Knotted mutant, after which Knotted-like homeobox (KNOX) genes are named, was first discovered in maize. KNAT1 and KNAT2 were identified through the use of the maize Knotted-1 (Kn1) homeobox as a heterologous probe (Lincoln, 1994), whilst further low stringency screening using these two sequences identified further
95
KNAT genes in Arabidopsis (Long et al., 1996, Serikawa et al., 1996, Belles-Boix et al., 2006, Li et al., 2011a). The phylogenetic tree (Figure 3.5) was generated using the eight PvKNOX-family protein sequences, as well as the amino acid sequences for A. thaliana KNAT (Long et al., 1996, Serikawa et al., 1996, Belles-Boix et al., 2006, Li et al., 2011a), STM (Long et al., 1996), and the Zea Mays KNOTTED-1 protein that was originally used to design the heterologous probe (based on the coding sequence), as described above. This analysis facilitated naming of the PvKNOX genes as Knotted-like (PvKNL) and Shoot meristemless-like (PvSTL) based on protein sequence similarity, and facilitated grouping of these genes as Class I and Class II, as defined for KNOX homeodomain gene families in Maize (Vollbrecht et al., 1991), Arabidopsis (Lincoln et al., 1994, Long et al., 1996, Serikawa et al., 1996, Belles-Boix et al., 2006, Li et al., 2011) and other species (Bharathan et al., 1999, Hay and Tsiantis, 2010) based on their expression and phylogenetic relationships (Kerstetter et al., 1994, Bharathan et al., 1999). Further low-stringency alignments with Exonerate (v2.2.0) confirm that P. vulgaris has no homologue of AtKNAT5, but it does have two AtSTM-like genes, resulting in five Class I and three Class II P. vulgaris KNOX-like genes in total.
96
Figure 3.5 Bayesian strict-click phylogenetic tree based on multiple sequence alignments of Zea Mays KNOTTED1, Arabidopsis thaliana KNAT and STM, and the eight Primula vulgaris PvKNL and PvSTL amino acid sequences. Posterior probabilities for clades are shown as percentages. Figure reproduced from Cocker et al. (2015).
Expression of PvKNOX genes in Oakleaf and wild-type P. vulgaris
The normalised read counts for RNA-Seq reads mapped to the nine PvKNOX-family genes facilitated a direct comparison between expression levels in leaves and flowers of Oakleaf and wildtype plants (Figure 3.6). The criterion of constitutive over-expression of a PvKNOX gene identified by upregulation in both leaves and flowers of Oakleaf plants compared to wild-type was used to determine whether the Oakleaf phenotype is caused by overexpression of Class I KNOX genes, as is the case for the similar knotted phenotype in A. thaliana (Lincoln et al., 1994, Chuck et al., 1996, Hay and Tsiantis, 2010).
97
For RNA-Seq data analysis, an estimate of the false discovery rate (FDR) is often used to correct for multiple testing errors when considering thousands of genes simultaneously; the more tests performed, the more likely some of those tests will be significant just by chance (Bi and Liu, 2016). In practice, the true FDR of a new dataset is unknown (Li, 2012): with a low number of replicates there is a failure to control FDR under the model assumed by differential expression tools due inaccurate calculation of the uncorrected p-values (Schurch et al., 2016). It is therefore prudent to carry out RNA-Seq analyses with biological replicates for each condition to unambiguously identify differentially expressed genes. If the biological variance within a condition is unknown, some of the estimates for fold change will be imprecise. In contrast, parallel measurements of biologically distinct samples capture random biological variation in a design that incorporates RNA-Seq replicates of biological conditions (Blainey et al., 2014).
In the current study, despite lack of replicates for each condition, it was reasoned that upregulation in both Oakleaf leaves and flowers as an indication of constitutive upregulation was sufficient to counteract the possibility of false positives in differential expression if either of these measures were to be used in isolation; the above observations are noted as a possible caveat.
98
Figure 3.6 The normalised counts for RNA-Seq reads mapping to PvKNOX family genes for RNA isolated from Oakleaf leaves (grey bars) and wild-type leaves (white bars) (a); Oakleaf flowers (grey bars) and wild-type flowers (white bars) (b); and the log2 fold change in expression between Oakleaf and wild-type leaves (grey bars) and Oakleaf and wild-type flowers (white bars) (c). Figure reproduced from Cocker et al. (2015).
99
The low expression of Class I PvKNOX genes in the leaves of Oakleaf and wildtype P. vulgaris (Figure 3.6) suggests that these genes are not responsible for the Oakleaf phenotype; PvKNL6 is upregulated in both leaves and flowers of Oakleaf but its low expression in Oakleaf leaves is not suggestive of constitutive upregulation. In contrast, the Class II PvKNOX genes show high expression in P. vulgaris leaves and flowers;
PvKNL3 shows minimal log2 fold-upregulation in both Oakleaf leaves (0.31) and flowers (0.26) as compared to wildtype. These data suggest that none of the PvKNOX family genes are promising candidates for Oakleaf based on an expectation of constitutive expression, as they do not show a high level of expression and upregulation in both Oakleaf leaves and flowers.
Sequence comparison of PvKNOX genes in wild-type and Oakleaf
SAMtools (v0.1.18) (Li et al., 2009a) was used to identify single nucleotide polymorphisms (SNPs) between Oakleaf RNA-Seq reads and the P. vulgaris LH_v1 genome assembly (Table 3.2). The Oakleaf plant that was used is heterozygous at the Oakleaf locus in a pin genetic background, SNPs that are heterozygous in Oakleaf and absent in the pin RNA-Seq reads are therefore potentially Oakleaf specific.
This analysis identified 10 Oakleaf-specific SNPs between seven of the PvKNOX genes in Oakleaf and wildtype: three are predicted to cause truncated proteins (in PvKNL2 and PvSTL1) (Table 3.2), with the remaining seven analysed with SIFT to predict the potential impact on protein function (Ng and Henikoff, 2003). Of these, five SNPs were predicted to result in tolerated amino acid substitutions, with the remaining two, Leu335-Ser (PvKNL3) and Gly6-Glu (PvKNL7), predicted to result in non-tolerated amino acid substitutions that could affect protein function. In addition, profiles of RNA- Seq read coverage for Oakleaf and wildtype leaves and flower were analysed using GenomeView (http://genomeview.org/); no discernible difference was observed between Oakleaf and wildtype read profiles, suggesting the PvKNOX genes in Oakleaf do not have alternate splicing profiles that could result in a protein with dominant function due to lack of a critical regulatory domain for example.
100
Gene Position Base Amino Oakleaf flower Oakleaf leaves Comments Change acid SNP Total SNP Total change reads reads reads reads
PvKNL1 671 TCT - ACT Ser35 – 7 7 0 0 Homozygous Thr PvKNL1 1882 ACG - ATG Thr161 - 3 3 0 0 Homozygous Met PvKNL2 3534 GCT - GGT Ala120 - 3 3 No - Homozygous Gly SNP PvKNL2 25367 GGG - Gly233 - 9 10 No - Also found in GCG Ala SNP pin PvKNL2 30146 CAG - TAG Gln298 - 2 15 No - Truncated Stop SNP protein PvSTL1 8202 Frameshift - aa12: 3 8 No - Truncated 2 bp stop16 SNP protein PvSTL1 8188 GAA - Glu17 – 8 15 2 2 Substitution GCA Ala tolerated PvSTL1 8068 Frameshift aa57: 4 11 No - Truncated +2 bp stop63 SNP protein PvSTL1 2413 TCG - CCG Ser271 - 5 10 No - Substitution Pro SNP tolerated PvSTL1 2278 GTC - ATC Val316 - 8 13 No - Substitution Ile SNP tolerated PvSTL2 7432 GAA - Glu47 - 37 37 0 0 Homozygous CAA Gln and in Pin PvSTL2 7336 GCG - ACG Ala79 - 29 29 0 0 Homozygous Thr and in Pin PvSTL2 3225 ATT - ATG Ile197 - 31 31 0 0 Homozygous Met and in Pin PvKNL3 7545 CAG - CAT Gln65 - 67 98 50 63 Substitution His tolerated PvKNL3 2860 TTG - TCG Leu335 - No No SNP 2 15 Substitution not Ser SNP tolerated PvKNL3 2034 AGT - AGA Ser395 - No No SNP 2 11 Also found in Arg SNP pin PvKNL4 2083 AAC - AGC Asn28 - 17 69 23 66 Substitution Ser tolerated PvKNL7 8564 GGG - Gly6 - 0 0 66 132 Substitution not GAG Glu tolerated
101
Table 3.2 Single nucleotide polymorphisms (SNPs) in PvKNOX family genes for Oakleaf RNA-Seq reads versus wildtype LH_v1 genome assembly. Position in genomic DNA (contig) is shown, with the nucleotide and associated change in amino acid indicated. Read counts for the total number of mapped reads and reads supporting the SNP are shown for Oakleaf leaves and flowers; where there is no SNP at a particular position, “No SNP” is shown. For predicted frameshifts, the amino acid at which the nucleotide change occurs and the new stop codon in the shifted reading frame is shown. The comments indicate whether the Oakleaf SNP is homozygous or heterozygous; where it is heterozygous and not present in pin, the impact of the SNP on the encoded protein is indicated; this was predicted (by PMG) using SIFT (Ng and Henikoff, 2003) (http://sift.bii.a-star.edu.sg/).
Differential expression between Oakleaf and wildtype P. vulgaris
The P. vulgaris draft transcriptome generated in this study contains 39,193 transcripts; normalised RNA-Seq read counts for each of the transcripts obtained using HTSeq and DESeq (Anders and Huber, 2010, Anders et al., 2014) were subjected to a log2 fold- change cut-off > 2 to reveal 507 genes potentially upregulated in both Oakleaf leaves and flowers versus wildtype (Figure 3.7). If there are no biological replicates, as in our analysis, then we do not know the within-group variance (Blainey et al., 2014): p-values can only be calculated with the use of pooled data from genes with similar expression strengths for the purpose of variance estimation (Anders and Huber, 2010), thereby reducing power; as a result, in this analysis we instead use a cut-off on the fold change in normalized expression to produce a set of differential expressed genes.
If the Oakleaf phenotype is a result of constitutive overexpression as predicted, then this geneset could contain a gene unrelated to PvKNOX that is responsible for the Oakleaf phenotype; none of the PvKNOX genes are in this set of differentially expressed genes as their fold-change in expression is below the two-fold cut-off. The P. vulgaris LH_v2 genome assembly and annotations were not available at the start of the current study, hence the use of a draft transcriptome with LH_v1 as a reference genome.
102
Figure 3.7 (a) Genes 2x log2fold-upregulated upregulated and (b) 2x log2fold- downregulated in P. vulgaris Oakleaf leaves and flowers versus wildtype P. vulgaris pin leaves and flowers. Genes that are up- or downregulated in both leaves and flowers of Oakleaf are indicated. Figure reproduced from Cocker et al. (2015).
KNOX proteins are transcriptional regulators, suggesting changes in their expression in Oakleaf would result in wider alterations in the expression of target genes in the downstream regulatory network. In the event that a PvKNOX gene is shown to be responsible for the Oakleaf phenotype, then the differentially expressed genes, including those that are both up- and downregulated in Oakleaf as compared to wildtype, are potential downstream targets that may be in the downstream regulatory network leading to the Oakleaf phenotype (Figure 3.7).
Annotation of the BAC contig assembly
The de novo annotation of BAC contigs flanking the S locus was carried out using GeneMark (http://exon.gatech.edu/GeneMark/). In total, 266 predicted genes or gene fragments were identified. BLASTX alignments identified matches in the NCBI “nr” and TAIR10 databases for 119 of the predicted protein sequences; after removal of potential duplicates indicated by gene models on the same contig or neighbouring
103
contigs that matched the same Arabidopsis gene, there were 82 Arabidopsis genes related to the predicted genes (Li et al., 2015).
The BAC assembly comprises BAC contigs positioned on the “left” and “right” of the S locus. In BAC Contig S-right (comprising BACs to the right of the S locus) PvSLL1 was identified on S_locus_groupB_ctg13, PvGLO on S_locus_groupB_ctg58, and PvSLP1 on S_locus_groupB_ctg36. PvSLL2 was found on contig S_locus_groupA_ctg9 within Contig S-left (comprising BACs to the left of the S locus); see Li et al. (2015) for details. These results unequivocally confirm the order of the S- linked markers as PvSLL2-S locus–PvSLL1–PvSLP1–PvGLO.
Oakleaf in the BAC contig assembly
The S locus-linked marker PvSLL2 is located 0.05 to 1.37 cM from the S locus within BAC Contig S-left, whilst crosses performed in Cocker et al. (2015) and Li et al. (2015a) give a map distance of between 0.56 and 2.55 cM for Oakleaf on the same side of the S locus as PvSLL2. Perhaps unsurprisingly then, none of the PvKNOX genes are located in BAC Contig S-left in which PvSLL2 lies due to its relatively close proximity to the S locus. Furthermore, none of the genes upregulated in Oakleaf versus wildtype are in this contig. The functional annotation of predicted genes in this assembly does not reveal any obvious candidates for Oakleaf.
Linkage analysis of Oakleaf candidates
Four libraries of genomic paired-end sequencing reads for individual pin and thrum plants, and separate pools of pin and thrum progeny resulting from a cross between these individual pin and thrum parental plants, were mapped to the LH_v2 genome. SAMtools (v0.1.18) (Li et al., 2009a) was used to call SNPs between these libraries and LH_v2 contigs, and the results used to calculate the estimated genetic distance from the S locus based on the ratio of reference to alternate alleles for either the pin or thrum progeny; the lower genetic distance of the two at each position was incorporated into the mean distance tabulated (Table 3.3). The SAMtools SNP calling pipeline is designed for use with samples from diploid individuals (Li and Durbin, 2009, Raineri et al., 2012); sequenced pools of individuals offer a lower cost alternative to sequenced individuals but may be unsuitable for inferring linkage, it is difficult to distinguish between read
104
errors (0.1 – 1% for Illumina sequencing) and low-frequency alleles, based on each read being an independent draw from a large pool of chromosomes (Schlotterer et al., 2014). Individuals may be differentially represented due to unequal amounts DNA in the pool; a lack of multiple reads for each individual means sequencing errors are hard to resolve (Schlotterer et al., 2014). Furthermore, the issues above in combination with the low number of individuals present in the pools of 24 pin and 28 thrum progeny means that the calculated genetic distance may be imprecise, with discordant genetic distances estimated at each site; small pool sizes yield suboptimal results (Schlotterer et al., 2014).
In summary, the approach applied here offers only a crude estimate of the mapping distance. The depth (DP) and mapping quality (MQ) values were thus used to filter SNPs in the progeny pools in order to reduce coverage problems based on misaligned reads, the quality values associated with SAMtools (Li et al., 2009a) SNP calling or genotype quality (GQ) were not used, as they may be inappropriate based on the above. Nonetheless, ignoring the above scenarios and assuming an equal number of reads for each individual in the progeny pools, it was reasoned that the ratio of pin progeny alleles could provide a preliminary measure of the distance of each contig from the S locus. The estimated distance in centimorgans (cM) was determined for each contig by dividing the number of reference alleles at each site by the number of alternate alleles, then calculating the mean distance (±SE) across all sites (Table 3.3). The analysis indicates that seven of the eight PvKNOX genes either recombine very frequently with the S locus, or are otherwise situated on a different chromosome; there are no SNPs that pass the criteria described above for PvKNL2.
105
Gene # SNPs Distance (cM) ±SEM
PvKNL2 0 NA NA
PvKNL1 35 31.92 1.63
PvKNL7 65 32.93 1.74
PvSTL2 12 24.42 1.42
PvSTL1 385 29.93 0.61
PvKNL3 18 27.86 2.80
PvKNL6 143 30.30 0.89
PvKNL4 443 41.22 0.86
Table 3.3 The mean genetic distance from the S locus (cM) is shown as predicted using SNPs present in pin and thrum read libraries mapped to LH_v2 contigs associated with the indicated PvKNOX genes. The standard error is indicated; problems with coverage and unresolved sequencing errors due to the use of pooled data (Schlotterer et al., 2014) may result in variability between the predicted distances at each site; the number of SNPs (# SNPs) is shown as the SEM may be artificially low in the event of a low number of SNPs being present in the contig.
In a preliminary analysis, the same strategy was used to test for linkage of contigs associated with the 507 genes upregulated in both Oakleaf leaves and flowers; using this approach the contigs associated with ten genes show mapping distances < 2.55 cM from the S locus, suggesting they may be potential candidates for Oakleaf should a KNOX gene not be responsible.
106
3.5 Discussion
The lobed leaves with occasional ectopic meristems that P. vulgaris Oakleaf mutants display is reminiscent of the overexpression phenotype of Class I KNOX genes in Arabidopsis thaliana (Lincoln et al., 1994, Chuck et al., 1996, Hay and Tsiantis, 2010). For this reason, we explored KNOX-like genes in P. vulgaris in an attempt to identify the cause of the Oakleaf phenotype. The current study reveals eight KNOX-like genes in the P. vulgaris genome; phylogenetic analyses facilitated classification into groups of five Class I and three Class II KNOX-like genes based on similarity to A. thaliana KNOX protein sequences, and phylogenetic analysis (Figure 3.5) (Kerstetter et al., 1994, Bharathan et al., 1999).
Differential expression studies carried out in the current study reveal that Class I KNOX-like genes in P. vulgaris are expressed at a low level in Oakleaf leaves and flowers, thus rendering them poor candidates for Oakleaf. In contrast, the Class II PvKNOX genes are highly expressed in both leaves and flowers of Oakleaf and wildtype. However, PvKNL3 is the only gene upregulated in Oakleaf leaves and flowers, and the level of upregulation is minimal. In addition, studies in A. thaliana show that Class I and II KNOX genes are involved in different developmental processes; Class II KNOX family genes KNAT3, KNAT4 and KNAT5 are typically involved in root development (Truernit et al., 2006) and KNAT7 in the formation of secondary cell walls (Li et al., 2011a, Li et al., 2012) whereas the Class I KNOX family genes have roles in shoot apical meristem identity (Hay and Tsiantis, 2010); this is consistent with the ectopic meristems that are sometimes produced on the veins of Oakleaf leaves (Cocker et al., 2015).
The above, as well as the abnormal lobed-leaf phenotype of A. thaliana plants with overexpression of KNAT1 (Chuck et al., 1996), is the primary basis for the prediction that abnormal expression of a Class I KNOX-like gene might be the underlying cause of the Oakleaf phenotype. In comparison to the Class I PvKNOX genes, the comparable high-level expression of Class II genes in both leaves and flowers of P. vulgaris and Oakleaf is consistent with the broader expression of Class II KNOX genes in A. thaliana. This suggests the Class I and II KNOX genes in P. vulgaris may also be associated with different developmental processes due to their contrasting expression profiles (Serikawa et al., 1997, Bharathan et al., 1999, Truernit et al., 2006). The Class
107
II KNOX gene PvKNL3 is therefore not considered a strong candidate for Oakleaf as it is a Class II KNOX gene that shows only minimal upregulation in leaves and flowers of Oakleaf plants. The Class I KNOX genes do not show strong upregulation in Oakleaf; the Class I KNOX gene PvKNL6 does have higher read counts in both Oakleaf leaves and flowers, with higher fold-upregulation than PvKNL3, but the extremely low expression in both leaves and flowers is not consistent with the constitutive overexpression of KNOX-genes that is observed in A. thaliana (Lincoln et al., 1994, Chuck et al., 1996, Hay and Tsiantis, 2010). In conclusion, the Oakleaf phenotype is most likely not the result of overexpression of a KNOX-like gene in Primula vulgaris.
In lieu of a discernible difference in transcript expression, it was reasoned that mutation in a PvKNOX gene could lead to the Oakleaf phenotype through a dominant gain of function in protein activity. If there is an intronic mutation affecting a splice site in a PvKNOX gene then it would not be detected as a SNP in the Oakleaf RNA-Seq reads. However, it was reasoned that a splice site mutation might affect the coverage profile of Oakleaf RNA-Seq reads across the gene. For the eight PvKNOX genes, there was no discernible difference in the coverage profiles of Oakleaf and wildtype pin RNA-Seq reads, thus suggesting a splice site mutation was not resulting in a dominant gain of function in PvKNOX protein activity that could occur due to the absence of a critical regulatory domain as a result of protein truncation. The analysis of RNA-Seq reads mapped to PvKNOX genes revealed several Oakleaf-specific SNPs following removal of homozygous sites in Oakleaf and SNPs present in wild-type pin RNA-Seq reads. In the remaining heterozygous Oakleaf SNPs that were absent in pin, several were predicted to result in conservative amino acid changes and were discounted as the basis for Oakleaf. Three SNPs in PvKNL2 and PvSTL1 were predicted to cause truncation of the encoded protein, but were only observed in Oakleaf flower RNA-Seq reads, not leaves. Two SNPs in PvKNL3 and PvKNL7 would cause non-conservative amino acid substitutions, but were only observed in Oakleaf leaf RNA-Seq reads, not flowers. In conclusion, the absence of these SNPs in both leaf and flower Oakleaf RNA-Seq reads suggests that they are unlikely to be the cause of the Oakleaf phenotype. The differential expression and SNP analyses taken together suggest that mutation or overexpression of a PvKNOX gene is not the basis for Oakleaf.
108
In the studies associated with this chapter (Cocker et al., 2015), linkage of Oakleaf to the S locus was supported by the predominant co-segregation of Oakleaf with either the pin or thrum phenotype, with the observed number of recombinants between Oakleaf and the S locus being small in all crosses (Cocker et al., 2015). The crosses carried out with Oakleaf as the female parent pollinated with pollen from wildtype pin reveal Oakleaf as a single dominant locus, with chi-squared analysis supporting a 1:1 ratio (P > 0.700). The 1:1 ratio would be expected with a heterozygous parental Oakleaf plant, given the dominance of Oakleaf over wildtype. Furthermore, crosses between two heterozygous Oakleaf plants with thrum as either pollen donor or recipient reveal the 3:1 ratio of Oakleaf to wild-type progeny that would also be expected given a dominant Oakleaf locus (P > 0.95 and P > 0.50). However, one of the crosses with Oakleaf as the male parent revealed a significant (P < 0.001) deviation from the expected 1:1 ratio of Oakleaf to wild-type progeny. In this cross, 45 seedlings were lost before secondary leaf development. The first true leaves of Oakleaf plants show a consistent lobed phenotype that is similar in appearance to the leaves of Oak trees and shrubs (Quercus); this facilitates reliable identification of an Oakleaf plant (Cocker et al., 2015); seedlings that are lost before the first true leaves are produced cannot therefore be reliably classified as wildtype in the absence of Oakleaf characteristics prior to secondary leaf development.
The underrepresentation of Oakleaf progeny may be a statistical consequence of the low total number of progeny under consideration; however, if the 45 seedlings lost before secondary leaf development are included in the chi-squared analysis as wild-type, then the test would support a 1:1 ratio (P > 0.300) of Oakleaf to wild-type progeny. In addition, for three out of four crosses carried out between Oakleaf and wildtype plants, it appears that progeny losses before flowering are higher in wild-type than Oakleaf. The leaves of Oakleaf plants are thicker and firmer than wildtype (Cocker et al., 2015), and perhaps may be less susceptible to the ‘damping off’ that can occur in Primula seedlings due to bacterial or fungal infection before the secondary leaves emerge, thus suggesting the Oakleaf mutation may result in greater resilience to seedling loss under pathogen exposure or unfavourable conditions; this is a possible explanation for the distorted ratio of Oakleaf to wildtype progeny.
In Arabidopsis, Antirrhinum and tobacco, studies show that asymmetric leaves 1 (as1) mutants of the AS1 gene involved in downregulation of KNOX gene expression have
109
enhanced resistance to necrotrophic fungi, with phenotypes similar to KNAT1 overexpression lines (Hay et al., 2002, Nurmberg et al., 2007). If AS1 is responsible for the Oakleaf phenotype, then upregulation of the PvKNOX genes in Oakleaf might be expected, but this is not the case; perhaps overexpression of a downstream target of AS1 or PvKNOX genes is the cause of Oakleaf. The identification of Oakleaf is of interest in order to determine the molecular basis of this potential resilience in Oakleaf seedlings; this may be of use in an agronomic setting.
RNA-Seq data assembled into a de novo assembly were used to join two fragments of PvKNL1 that were located on two distinct genomic contigs; a single transcript spanning the two exons resolved the two fragments as a single gene. In this way, transcriptome data, particularly that spanning large introns, can effectively act as a long mate-pair (LMP) library with insert size “equivalent to intron length” (Zhang et al., 2016). This is presumably best approximated by the average intron size of the full complement of genes in the genome, analogous to the use of the average insert-size of an LMP library in conventional scaffolding approaches, without the complex and time-consuming process of generating such a library (Xue et al., 2013). In an analysis of intron size distributions in five high-quality assemblies, introns greater than 5 kb in length were present in 42.5%–83.9% of all gene loci (Xue et al., 2013). In addition, the concept of pervasive transcription indicates that the majority of the genome is transcribed to some degree (Clark et al., 2011), suggesting the above may act as a tool to scaffold large portions of the genome, particularly in transcribed and correspondingly gene-rich regions of functional interest (Xue et al., 2013). The use of RNA-Seq data to improve the Primula vulgaris genome assembly without the need for significant further funding is therefore an intriguing possibility for this non-model species given the availability of a comprehensive array of libraries representing a number of different tissues (Chapter 2); the current study provides a practical example for the potential utility of such an approach through the resolution of the PvKNL1 gene structure.
Differential expression in Oakleaf and wildtype pin plants have revealed sets of up- and downregulated genes that may be downstream targets of Oakleaf. If one of the PvKNOX genes is shown to be the cause of the Oakleaf phenotype then these candidate genes will offer insight into the regulatory pathways leading to the development of the distinctive lobed leaves and attenuated flowers. If, on the other hand, the PvKNOX genes are not
110
responsible for Oakleaf, then these sets of genes may contain the key gene itself. Indeed, six of these genes lie in contigs that show preliminary mapping distances less than the highest estimated distance of Oakleaf from the S locus (2.55 cM). Class I KNOX genes in Arabidopsis thaliana are associated with downstream upregulation of GA2 oxidase and IPT7, and downregulation of GA20 oxidase, affecting cytokinin and gibberellin concentrations; genes involved in lignin synthesis are also downregulated (Hay and Tsiantis, 2010). There are no such genes present in the set of differentially expressed genes identified in Oakleaf based on the functional annotations performed; this perhaps suggests that the phenotype is caused by a different pathway. Indeed, the expression and mutation analyses presented here suggest that some of the eight PvKNOX genes are at best only very weak candidates for Oakleaf, despite the similarity of Oakleaf to the phenotypes seen in Arabidopsis thaliana KNOX overexpression lines; these lines do not however, present abnormalities in the petals despite the occasional appearance of ectopic meristems on the leaves as in Oakleaf. In addition, based on the preliminary linkage analysis carried out, it is likely that seven of the eight PvKNOX genes are not linked to the S locus; there is no evidence of linkage for the remaining gene. This analysis offers only a very crude estimate of the mapping distance from the S locus; the large distances predicted suggest that although the distance may be somewhat smaller due to the inherent imprecision of the method used, it is unlikely these genes are in tight linkage with the S locus. Future segregation analyses offering more robust estimates of the genetic distance will confirm whether any of the eight PvKNOX genes are linked to the S locus.
None of the eight PvKNOX genes are present in the BAC assembly generated in analyses associated with this study (Li et al., 2015a); this is perhaps not surprising as PvSLL2 is estimated at < 1.37 cM from the S locus (Figure 3.2), with Oakleaf between 0.56 and 2.55 cM away. The de novo annotation of the BAC assembly confirms the positions of the S linked markers and reveals at least 86 new S-linked genes based on comparisons with genes in Arabidopsis thaliana (Li et al., 2015). The inspection of the functional annotations for the predicted genes in the BAC assembly reveals no obvious candidates for Oakleaf. The findings presented here suggest that Oakleaf is not caused by a PvKNOX gene. Future segregation analyses will confirm this, as well as reducing the number of candidates in the set of up- and downregulated genes identified. The molecular characterization of Oakleaf will contribute to resolving the orientation of the
111
BAC contig, and facilitate further BAC- or in silico-walking with genome contigs towards the S locus, to close the gap in the BAC assembly. The various genome assemblies presented in Chapter 2 perhaps offer the resources to accomplish this. This chapter describes efforts to identify Oakleaf and annotate the BAC assembly flanking the S locus, providing validation for the position of the S locus-linked markers and a platform for further analysis of the regions surrounding the S locus.
112
4 The structure of the Primula vulgaris S locus
4.1 Relevant publications
Li, J.*, Cocker, J.M.*, Wright, J., Webster, M.A., McMullan, M., Ayling, S., Swarbreck, D., Caccamo, M., Oosterhout, Cv., Gilmartin, P.M. (2016) Genetic architecture and evolution of the S locus supergene in Primula vulgaris. Nature Plants, 2: 16188.
Cocker, J.M., Wright, J., Li, J., Swarbreck, D., Dyer, S., Caccamo, M., Gilmartin, P.M. (2017) The Primula vulgaris genome (in preparation).
* These authors contributed equally
4.2 Introduction
The established genetic model for heterostyly in Primula predicts that the S locus comprises a cluster of at least three genes (Ernst, 1936c, Lewis and Jones, 1992). These genes control the development of morphological features associated with two forms of flower, the pin or thrum: G controls stigma height; P controls pollen size; and A controls anther positioning (Dowrick, 1956, Webster and Gilmartin, 2006). The pin form is thought to be homozygous recessive for the three genes (gpa/gpa) (s/s) and the thrum heterozygous (GPA/gpa) (S/s), such that presence of the dominant alleles for the three predicted genes results in plants presenting thrum flowers with a short style due to repression of cell elongation (G), elevated anthers through increased cell division in the corolla tube (A), and large pollen (P); pin flowers have the opposite morphology, thus promoting insect-mediated outcrossing based on the reciprocal positions of the reproductive structures (Webster and Gilmartin, 2006).
113
The S locus has remained elusive since Darwin first explained the importance of heterostyly in promoting outcrossing more than 150 years ago (Darwin, 1877); the independent origins of this system in distinct orders marks it as a remarkable example of convergent evolution, with similar genetic architectures predicted for the S locus in P. vulgaris, Turnera subulata, and Fagopyrum esculentum (Gilmartin and Li, 2010). It is predicted that the constituent genes are very tightly linked, so much so that they have been termed a ‘supergene’, a term used to describe loci controlling other complex morphologies in plants, animals and fungi (Pellow, 1928, Darlington and Mather, 1949, Schwander et al., 2014, Thompson and Jiggins, 2014).
It is widely accepted that extremely rare recombination events within the S locus supergene result in homostyles that present flowers with the anthers and stigma at the same height (Lewis, 1954, Dowrick, 1956). In some cases no recombinants have been observed in as many as 12,000 Primula plants, with reports suggesting that the S locus in Turnera sublata could be as small as 32 kb (Ernst, 1928b, De Winton and Haldane, 1935, Mather, 1950, Dowrick, 1956, Labonne et al., 2009), as such it was originally proposed that homostyles were the result of mutation, prior to recombination becoming the accepted explanation (Ernst, 1936b). Disruption of the architectural constraints imposed by heterostyly in homostyles is also associated with a breakdown in self- incompatiblity (SI); this is another key facet of the accepted genetic model, thought to result from recombination at the S locus disrupting linkage with the potentially S-linked SI components that normally serve to inhibit within-morph pollination (Lewis, 1954, Dowrick, 1956, Lewis and Jones, 1992, Barrett and Shore, 2008).
The genes controlling heterostyly are thought to be within a region of suppressed recombination (Mather, 1950), thus maintaining the collective supergene and the distinct pin and thrum phenotypes; in situ hybridisation studies related to Chapter 3 show that the S locus is located close to the centromere of the largest pericentric chromosome (Li et al., 2015a), a region characterised by reduced recombination (Mather, 1939, Choo, 1998). The attempts to isolate genes at the P. vulgaris S locus through the BAC walking strategy described in Chapter 3 (Li et al., 2011b, Li et al., 2015a) were halted in progress towards the S locus, with a gap remaining between the two BAC contigs flanking the S locus due to highly repetitive sequences, resulting in multiple BACs being identified in further screening of the BAC library (Li et al.,
114
2015a); one possibility is that this is a result of repeat-rich sequences near to the centromere (Kursel and Malik, 2016).
The minimum number of genes at the S locus (three: G, P, and A) is predicted through the observation of putative recombinants that disrupt linkage between the reciprocal morphological features that co-segregate with the two floral forms (Ernst, 1936a, Lewis and Jones, 1992), with the order GPA defined based on the least number of double recombinants being required to describe the number of recombinants seen (Ernst, 1936a, Lewis and Jones, 1992). The prediction that recombination is the mechanism that disrupts the supergene and produces homostyles underpins the last 60 years of research directed towards identifying the molecular and evolutionary basis of heterostyly. There is considerable debate on the predicted number of genes at the S locus. Dowrick (1956) speculated that the S locus supergene could comprise up to seven genes, with the various morphological and physiological traits specified by distinct genetic factors, but there is no support from recombinants for this prediction. Kurian and Richards (1997) concluded that there must be seven genes at the very minimum, whilst Richards (2003) proposed as many as nine.
The identification of the S locus and subsequent functional analyses will reveal the true number of genes controlling heterostyly, but since there is no reason to suggest that the heterostyly-determining supergene contains only protein-coding genes, it will also be important to carry out further annotation of the S locus region. This is exemplified by reports that microRNAs (miRNAs) play a key role in the development of floral reproductive structures through the targeted regulation of auxin response factor 6 (ARF6) and ARF8 expression; double mutants of ARF6 and ARF8 present short stamen filaments due to decreased cell expansion as compared to wildtype, whilst overexpression of MIR167a driven by the 35S promoter mimics this phenotype (Nagpal et al., 2005, Wu et al., 2006, Su et al., 2016). Because of the increased distance between the anthers and stigma, arf6-2 and arf8-3 plants self-pollinate inefficiently; epidermal cells of double mutant stamen filaments were about half as long as those of wild-type filaments (13 compared with 24 μm), indicating that decreased cell expansion caused the short filaments (Wu et al., 2006). The annotation of miRNAs at the S locus will therefore be important in providing a full picture of the potential regulatory modules contained therein.
115
It has been suggested that until the molecular basis of heterostyly is revealed, further speculation on the SI determinants and evolutionary steps leading to heterostyly will be severely limited (Barrett and Shore, 2008, McCubbin, 2008). The integration of the genetic map and BAC assembly associated data (Chapter 3) (Li et al., 2011b, Li et al., 2015a) with the array of read libraries, assemblies and associated genomic resources presented in Chapter 2, provides an opportunity to identify the S locus and its constituent genes based on a holistic approach that exploits both genetics and genomics, to offer insights into the genomic architecture of the region that were touched on by Mather (1950) and provide a blueprint for defining the basis of heterostyly in diverse angiosperm families for the first time since Darwin’s studies 150 years ago (Darwin, 1877).
4.3 Methods
Read depth across the S locus
Four long-homostyle (LH_v2) contigs forming the 455,880 bp S locus and flanking regions identified in Li et al. (2016) were removed from the LH_v2 assembly and replaced with the contiguous 455,880 bp sequence comprising the full S locus and flanking regions, as assembled by JL (Li et al., 2016). Genomic sequencing reads from thrum, pin, long-homostyle and short-homostyle plants (Table 2.1) were aligned to the Primula vulgaris LH_v2 genome assembly containing the contiguous S locus sequence using BWA (v0.7.12) (Li and Durbin, 2009).
The SAMtools “depth” tool (v0.1.19) (Li et al., 2009a) was used to return the depth of coverage at each position for reads across the S locus and flanking regions for each read library. The read depth between each of the four libraries was normalised according to the average size of the four read libraries, and the depth of read coverage plotted across the S locus region in 5,000 bp windows using R (v3.2.0) (https://www.r-project.org/).
In addition, box plots were drawn to summarise read depth across the 278 kb S locus region for each of the four read libraries using R (v3.2.0); the CFB genes flanking the S locus were excluded and the predicted read depth calculated by taking the median of the total summed depth for all read libraries at each position and then determining the
116
expected read depth for each library by multiplying by the number of S alleles in that individual: the homozygous long-homostyle (median multiplied by two), heterozygous thrum and short-homostyle (median multiplied by one), and absence of the 278 kb region in pin (median multiplied by zero).
RNA-Seq differential expression analysis
RNA was isolated in biological replicates from 15-20 mm buds of four wild-type pin plants and four wild-type thrum plants for RNA-Seq with Illumina HiSeq2000 by JL, as described in Li et al. (2016); reads were screened for rRNA removal using SortMeRNA (Kopylova et al., 2012) and quality-trimmed with trim galore (Q20) (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore).
The RNA-Seq reads were aligned to the long-homostyle (LH_v2) genome assembly with TopHat (Trapnell et al., 2012) (v2.0.13) and assembled using Cufflinks (Trapnell et al., 2012) (v2.2.1), guided by LH_v2 gene model annotations after manual curation of all S locus genes (Trapnell et al., 2012). Differential expression analysis between the four pin- and four thrum-replicate libraries was carried out using Cuffdiff (Trapnell et al., 2012); the number of fragments per kilobase of transcript per million fragments mapped +1 (FPKM + 1) (log10-transformed) is reported for genes at the S locus in Figure 4.7.
Analysis of thrum-specific genome regions
Individual pin and thrum plants used for genome sequencing and assembly of PP_ and TP_ assemblies (Table 2.2) were used (by JL) to generate segregating pin and thrum progeny, which were sequenced in separate pools of DNA from 24 pin and 28 thrum plants (LIB1730 and LIB1731, respectively; Table 2.1); the thrum progeny pool was not used in this analysis. RNA-Seq reads from the four pin and four thrum replicate libraries, as described in the differential expression analysis above, were aligned to the thrum parent genome assembly (TP_v1) using TopHat (v2.0.13) (Trapnell et al., 2012); transcripts were assembled and merged with Cufflinks and Cuffmerge (Trapnell et al., 2012) (v2.2.1) and differential expression carried out with Cuffdiff (Trapnell et al., 2012).
117
Transcripts showing thrum-specific expression or near to thrum-specific expression (cut-off < 0.1 FPKM for pin flower) were identified from the differential expression results; pin-progeny reads were then aligned to the TP_v1 assembly using BWA (v0.6.2) (Li and Durbin, 2009) and the per-base depth of read coverage for contigs containing the transcripts calculated using the SAMtools (Li et al., 2009a) (v0.1.18) “depth” tool. The per-base depth of read coverage was then used to calculate the average depth and breadth of pin-progeny read coverage across each transcript region in the genomic contigs identified by a thrum specific transcript.
The transcripts were classified into two groups using the k-means algorithm implemented in the scikit-learn package for Python, with the breadth of pin-progeny read coverage and log10-transformed depth of pin-progeny read coverage across each transcript region as input variables (n_clusters=2); matplotlib was used for plotting in Python (Hunter, 2007, Pedregosa et al., 2011). Gene identities of thrum genome- specific transcripts were determined by alignments to the LH_v2 assembly and gene models using Exonerate v2.2.0 (Slater and Birney, 2005).
Detection of recombination in S locus flanking regions
Genomic paired-end reads from pin and thrum parental plants (Table 2.1) were aligned to the long homostyle (LH_v2) genome with BWA (v0.6.2) (Li and Durbin, 2009). SAMtools (v0.1.18) (Li et al., 2009a) was used to remove PCR-duplicates (over- amplified fragments) with the “rmdup” tool, and for variant calling between the two read libraries and LH_v2. The genotype (GT) sub-field in the resulting Variant Call Format (VCF) files was used to determine the genotype for pin and thrum at each nucleotide position; two analyses were then carried out: firstly, a phased analysis using only heterozygous sites in thrum (figures not shown; see Li et al. (2016)), and secondly, using heterozygous sites in thrum as well as homozygous sites in thrum where at least one of the alleles in pin was different to the nucleotide in thrum at that site. Sites were excluded with depth (DP) < 10, genotype quality < 30, or mapping quality (MQ) < 20 for heterozygous thrum sites (first and second analysis), or in either pin or thrum for homozygous thrum sites (second analysis).
118
The cumulative binomial probability was calculated for the S locus left- and right- flanking sequences using a sliding window of 5,000 bp and an overlap (step size) of 1,000 bp to test whether the observed frequency of variant sites in each window was significantly lower than that expected given the total number of variant sites in each flanking sequence (Equation 4.1); this was performed using variant sites in (i) to (iii) below.