THE GENETIC BASIS FOR SEED COAT POLYMORPHISMS IN PERENNIS

Rachel E. Wilson

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

December 2019

Committee:

Helen Michaels, Advisor

Paul Morris

Scott Rogers © 2019

Rachel Wilson

All Rights Reserved iii ABSTRACT

Helen J. Michaels, Advisor

Multigenetic traits, specifically seed coat phenotypes, are poorly understood in domesticated . This knowledge gap can sometimes be filled by studying wild relatives

(Mammadov et al., 2018; von Wettberg 2018). Pigments responsible for seed colors are deposited in the seed coat, which is composed of the two outermost layers of a seed. Seed coat phenotypes can be polymorphic and typically involve complex pathways with multiple layers of expression controlling pigmentation compounds, like anthocyanins. The Anthocyanin

Biosynthetic Pathway (ABP) plays a central role in the polymorphic phenotypes of seed coats in many , a family that includes Lupines such as Lupinus perennis (Chalker-Scott,

1999). We hypothesized that the genetics of the ABP correlate with seed coat phenotype.

This study aims to identify candidate genes that might be responsible for color differences in the polymorphic seeds of L. perennis. Using RNA sequencing (RNAseq), we produced de novo assemblies of the seed coat transcriptomes of immature seeds of white and speckled seeds. There were two stages of immature seeds used: a pre-pigment and post-pigment stage. The use of both stages increased the chance of constructing a transcriptome containing pigment transcripts. Putative functional annotations of the seed coat transcripts were assigned using multiple databases. Differential expression analysis revealed 58 candidates showing changes in expression patterns correlated between the two phenotypes, involving 36 up expressed and 22 down expressed genes. Two pertain to the ABP and several genes were previously reported to be involved in defense, such as Powdery mildew resistance genes. iv Further work is necessary to verify these results by qPCR and to examine the seed coat transcriptome in greater detail, such as co-expression analysis. These results have important implications for endangered butterfly habitat restoration and crop breeding in the genus Lupinus and suggest that there is much more to understand about these seed coat phenotypes. v ACKNOWLEDGMENTS

I would like to thank my advisor Dr. Helen Michaels for all of her advice and guidance. I would also like to thank the members of my committee Dr. Scott Rogers and Dr. Paul Morris for all of the help they provided and input they supplied throughout this project. Additionally, I would like to thank my fellow graduate students in the Michaels’ lab: Haley Meek, Erica

Forstater, Meigan Day, and Ian Anderson for their assistance and support both in the field and in the lab. Lastly, I would like to thank the Bowling Green City Parks Department and Cinda

Stutzman for allowing me to collect seeds on their properties. vi

TABLE OF CONTENTS

Page

INTRODUCTION ...... 1

METHODOLOGY ...... 14

Sample Collection and Preparation ...... 14

RNA Extraction ...... 16

RNA Quality Control Determination ...... 17

Illumina Sequencing ...... 18

Transcriptome Reconstruction ...... 19

Functional Annotation ...... 20

Gene Expression Analysis ...... 22

Differential Gene Expression ...... 22

RESULTS ...... 24

Phenotypes of Biological Replicates ...... 24

Raw Data ...... 24

Data Quality ...... 25

Transcriptome Reconstruction ...... 26

Gene Functional Annotation ...... 28

GO Classification ...... 30

KOG Classification ...... 31

KEGG Classification ...... 32

Gene Expression Analysis ...... 33

Sample Correlation ...... 33 vii

Summary of Gene Expression Levels ...... 35

Gene Expression Difference Analysis ...... 35

DISCUSSION ...... 40

REFERENCES ...... 47

APPENDIX A. SUPPLEMENTARY TABLES...... 61

APPENDIX B. SUPPLEMENTARY INFORMATION ...... 68 viii

LIST OF FIGURES

Figure Page

1 Pathway of the Anthocyanin Biosynthetic Pathway ...... 10

2 Three Developmental Time Points of L. perennis Seed Pods ...... 15

3 Graph Interval Lengths ...... 29

4 GO Classification ...... 30

5 KOG Classification ...... 31

6 KEGG Classification ...... 32

7 Pearson Correlation ...... 34

8 Venn Diagram of Differentially Expressed Genes ...... 37

9 Volcano Plots ...... 38

10 Heatmap of Differentially Expressed Genes...... 39

ix

LIST OF TABLES

Table Page

1 Sample Purity, Concentration, and Tissue Stage ...... 25

2 Data Quality Measurements for Library Construction ...... 26

3 The Mapped Transcriptome ...... 27

4 The Length Measurements of Transcriptome ...... 27

5 Interval Lengths of Transcripts ...... 27

6 The Transcriptome Annotation Summary ...... 28

7 Sample Names and Approaches...... 36

1

INTRODUCTION

Seed polymorphisms are common in many plants, especially legumes (members of the

Fabaceae), which are plants with pods that contain their seeds and have high nutritional value.

Some examples of economically important legumes with polymorphic seeds are Soybeans,

Chickpea, Common Bean, Sweet Pea, and Lupin (Todd and Vodkin 1996; Tuteja et al., 2009;

Zabala and Vodkin 2014, Mirali et al., 2016). Literature suggests that legumes, which are protein rich, can be used as a substitute for meat (Abraham et al., 2019). Lupines are a popular ornamental plant because of their towering spiraled and color variety. The most common ornamental variety is the Russell Hybrid with yellows, reds, purples, and all shades in- between (Elmer et al., 2001). While lupin is a common ornamental plant for gardeners, it is starting to gain standing as an important agricultural crop. Cultivated for over 3000 years beginning in the Mediterranean Basin (Gladstone, 1970), there are currently five Lupines that are domesticated, L. angustifolius, L. albus, L. polyphyllus, L. mutabilis, and L. luteus (Gladstone,

1974). In Western Australia, the world’s largest producer of lupines, the Lupin industry has an export sum of $65 million (Lupins, 2018). As a cover crop, Lupin can reduce the need for nitrogen-based fertilizers because of its symbiotic relationship with nitrogen fixing bacteria known as rhizobium. Lupin creates root nodules for the rhizobium to colonize where they convert atmospheric nitrogen into a biologically usable from of nitrogen (Zahran, 1999). Once colonized, this relationship allows Lupin to increase nitrogen concentration in fields. This also enables them to colonize volcanic slopes such as sites like Mt St. Helens (Bishop, 2002). Aside from use as a cover crop, Lupin is emerging as a health food and food additive, as well as a farm feed (Abraham et al., 2019; Kasprowicz-Potocka et al., 2015). Lupin is rich in protein, the seeds are nearly 30-40% protein (Williams, 1979). While pagethe industry sum of Lupin is not as great 2 as other crops like soybean and corn, it could be explained by the alkaloid content. The seeds have a high alkaloid concentration. Alkaloids have a bitter taste which is a limitation to industry

(Hernandez, 2011).

Our study species is Lupinus perennis, known by the common names of Wild Lupine or

Sundial Lupine (Plant Database, 2019). L. perennis occupies dry, sandy savannah typical of

North American prairies (Gleason and Cronquist, 1991). In the Lupinus genus, L. angustifolius has a sequenced genome and many genomic resources. It is estimated to have a genome size of around 920Mb (Hane, 2017) and has a diploid chromosome number of 40 (Gladstone, 1970), while L. perennis has a diploid chromosome number of 48 or 96 depending on subclade (CCDB,

2018). Unlike wheat, barley, and pea, L. angustifolius is a quite recently domesticated crop that was subjected to multiple genetic bottlenecks during early breeding programs, and therefore carries an even more narrowed genetic background compared to that found when most domesticated species are compared to their wild relatives. This underscores the need to explore wild relatives to discover potential untapped genetic diversity (Berger 2012), but also means that while there are some genetic resources available for studying a wild relative, using resources from a domesticated species leads to the high likelihood that much of the genetic information relevant to evolution of adaptations is missing.

Polyploidy is when a genome has multiple (3 or more) sets of homologous chromosomes and is a common occurrence in many organisms, especially in plants (Hollister, 2014). One major cause of polyploidy is Whole Genome Duplications (WGD), which has been documented multiple times in the evolution of angiosperms. Within the family several WGD events have been observed. In Lupinus there is also evidence of its own WGD history and split from other legumes (Hane 2017). Looking into the Fabaceae evolution there are several points where a 3

WGD event occurred. Papilionoideae (Papil) is a subfamily of Fabaceae and the lineage underwent WGD after splitting from Mimosoideae-Cassiinae-Caesalpinieae (MCC). Lupin, as a part of Papil, is thought to have undergone one less WGD than the rest of the subfamily, which includes Soybean and Medicago (Cannon et al., 2015). A consequence of WGD is the formation of multigene families (Grassi et al., 2008). Multigene families show up in the ABP (Springob et al., 2003). Polyploidy alters plant phenotypes and interactions with the environment. For example, polyploidy has altered nodule size, which then affects rates (Forrester and Ashman, 2017).

In plants the seed coat is the outermost layer protecting the seed. Pigments along with many other compounds are deposited in the seed coat, which consists of an inner layer (the integument) and an outer layer called the outer testa. Because the integument layer is derived from maternal tissue during seed development, the characteristics of the seed coat are heavily influenced by maternal genetics (Hedley and Ambrose, 1980; Roach and Wulff, 1987). The seed coat is not only influenced by the maternal parent but by the seed’s genotype and its interaction with the environment in which the seed develops (Penfield, 2017). The seed coat plays an important role in controlling dormancy (Kelly et al.,1992), dispersal (Howe and Smallwood,

1982), predation (Lai et al.,2014), and microbe interactions (Cooper 2007; Mhlongo et al.,

2018). There has been considerable research in legumes on the effects of seed color on early seedling development. Seed coat color (and anthocyanin content) was associated with differential success and speeds in two Lotus species (L. glinoides and L. halophilus), which have seeds coats of yellow, green, and black (Bhatt et al., 2016). Yellow seeds of these species had lower germination speeds and decreased success compared to seeds with green or black coats (Bhatt et al., 2016). Since the darker seeds were associated with larger amounts of 4 anthocyanins, increased anthocyanin levels in the seed coat may be influencing the germination rate (Bhatt et al., 2016).

Seed coat polymorphisms have been documented in many plants along with color polymorphisms of flower pigmentation. Some legumes show seed color morphs that vary in background color and speckling, appearing in many colors such as light yellow and brownish black (chickpea: (Cicer)) (Penmetsa et al., 2016); light brown, black, and brown with black speckling (soybean: Glycine) (Zabala and Vodkin, 2014); all white, white with brown speckling, and white with red speckling (common bean: Phaselous) (Stoilova et al., 2013). Seed phenotypes observed in L. perennis vary from white to a gradient of speckling against a background that is typically white but may appear pale tannish yellow or gray. This darkly pigmented speckling can vary considerably in degree as some seed coats have speckling up to

80% while others have no speckling present (Cartwright, 1997; Shimola, 2013; Michaels et al.,

2019). Understanding complex phenotypes of wild relatives can benefit crop species because wild relatives are free from artificial selection, making them a reservoir of genetic diversity for potentially useful traits absent in highly selected crop species (von Wettberg, 2018). Plant breeders often use flower or seed phenotypes linked with desirable traits, like taste or yield, as visible markers in artificial selection. The seed coat color is an important consumer-related trait, and consumers often make decisions on the acceptability, quality, and presumed taste of a product depending on appearance using color (Kostyla et al., 1978; Simonne et al., 2001).

Seed pigment polymorphisms could be maintained in populations for multiple reasons.

Possible explanations include pollinator interactions, predation, selection based on microbe interactions, or recurrent mutation. Section pressure from pollinators arises from selection for certain pigmentations because many plants use flower pigments that reflect wavelengths in the 5 color spectrum to which specific pollinators are sensitive (Chittka et al., 1992). Major changes in flower color are known to be associated with adaptation and speciation that are accompanied by shifts in primary pollinators (Schemske and Bradshaw 1999, Bradshaw and Schemske 2003).

Color variations may reflect interactions with pollinator and seed predator communities such as the Gentiana lutea¸ (Yellow Gentian) and Bombus terrestris (Bumblebee). Bumblebees can distinguish between oranges and yellows, preferentially visiting yellow flowers of G. lutea, but seed predators oviposit more on those with orange flowers (Sobral et al., 2015). This phenomenon could also occur due to a possible link between seed coat color and flower pigmentation as both flower and seed pigments are produced from the ABP. Compelling support for this link is that mutants of flower pigmentations often are mutant in seed coat pigmentation.

In Glycine max, silencing of flavanone 3-hydroxylase (F3H) leads to mutant plants with pink flowers and lighter gray seeds (Zabala and Vodkin, 2005). In Chickpea, various mutations creating a premature stop in the basic helix-loop-helix (bHLH) transcription factor leads to mutants with white flowers and tan seeds (Penmetsa et al., 2016). Another explanation is seed predation. In L. perennis field studies darker pigmented seeds are preferentially removed over white seeds (Michaels, Cartwright, and Wakeley 2019). This is surprising because color polymorphisms in other legumes are thought to function as camouflage against seed predators

(Porter, 2013). Another possible reason for seed predator preference is seed chemical composition because secondary metabolite concentrations could differ. Alkaloids are secondary metabolites with a nitrogen base that can be toxic and are known to occur in many lupin seeds, so they may have a bitter taste (Frick et al., 2017). If alkaloids or other bitter compounds were in higher levels in white seeds, this could explain the seed predator behavior independent of camouflage effects on selection based on predator interactions. 6

The seed coat and microbes communicate through chemical signaling. The seed coat acts as channel for stimuli to enter and interact with the seed (Radchuk and Borisjuk, 2014). Because they are sessile, plants also rely on sending signals out into the environment. One way they do this is through the root system, excreting chemical messages from the roots into the that microbes respond to (Babri et al., 2009). An example is the symbiotic relationship between rhizobium and lupin (Mierziak et al., 2014). It has been shown in Soybeans that their roots produce genistein as a signal for the induction of nod genes in Bradyrhizobium japonicum, particularly under cold root conditions (Zhang and Smith, 1997). The isoflavones genistein and daidzein act as chemical signals that activate nodulation genes in rhizobium, mediating chemoattraction (Reddy et al., 2007). The isoflavones genistein and daidzein also serve as a host-specific attraction signal for the soybean pathogen Phytophthora sojae (Morris and Ward

1992), which uses genistein as a signal to locate legumes and then cause root rot (Subramanian et al., 2005). Secondary compounds can also be detrimental like in the case of Pasture Bloat (Lees,

1992) or pathogen attack. The above information illustrates how the root system and microbes interact. The root system is a major chemical channel for the plant and environment it interacts.

The same can be said for the seed coat because it is the first interaction the seed has with the environment. Furthermore, flavonoids and many other metabolites are also released from seed coats after imbibition, potentially providing an early source of defensive compounds as well as initiating signaling to microbes during germination (Ndakidemi and Dakora 2003). Other possible explanations for the phenotype persistance could lie in the seed’s genetics. Besides selection, it is possible that this involves an unstable mutation, such as a transposon insertion, and the recurrent mutation maintains the phenotype in the population (Zabala and Vodkin 2014).

The unstable mutation is then responsible for reintroduction of the polymorphic phenotype. This 7 means that seed coat polymorphisms could stay in the population because of the insertion/removal caused by mutations in the genes.

It has been well established that observable seed color is determined by the amounts of various anthocyanin pigments present (Chalker-Scott, 1999), which are end products of a complex, but highly conserved pathway (Figure 1). Although some color variation is caused by environmental factors (Holtsford and Ellstrand, 1992), color variants are produced by integrated networks of structural and regulatory genes. Preliminary LC-MS studies in L. perennis found that seed phenotype differences are associated with differences in chemical composition

(unpublished data Kelly, Opiyo, Phuntumart, Michaels), specifically that genistein concentration was 180X higher in darkly speckled than in white seeds. Other research has found that in some

Fabaceae, seeds with darker coats, and therefore greater anthocyanin content, absorbed water faster and had quicker germination than those with lighter seeds and lower anthocyanin content

(Atis et al., 2011; Chachalis and Smith, 2000). Anthocyanins are responsible for the orange, red, blue, and purple pigments (Tanaka et al., 2008). They also mediate plant responses to UV stress by absorbing the energy preventing damage to the cells (Chalker-Scott, 1999). Anthocyanins are within a subgroup of flavonoids which also include: chalcones, flavones, flavonols, flavandiols, and condensed tannins (or proanthocyanidins). Legumes are also known to produce isoflavonoids, another product of the pathway. Flavonoids are known to play roles in plant microbe interactions, defense from microbes and deterrents, male fertility, and UV protection

(Winkel-Shirley2001). Flavonoids perform roles as antioxidants which are an important part in disease prevention (Tsuda et al., 1994). Additionally, anthocyanins can be a part of other stress responses. The flavonoid pathway regulates both UV and defense responses. A study in

Petroselinum crispum (Parsley) observed that UV induces activity of acyl-CoA (part of the 8 flavonoid pathway) while pathogen attack inhibits acyl-CoA activity (Logemann and Hahlbrock,

2002). It has been observed that field plants exposed to ultraviolet-B (UV-B) radiation (in the

280-320 nm wavelength) often show an increased resistance to herbivory (Stratmann, 2003).

Furthermore, the genes involved in regulating flavonoids are regulated by both UV light levels and herbivory responses. The gene responsible is a MYB transcription factor, which also controls expression of other genes in the ABP (Schenke et al., 2011).

The ABP is well studied and highly conserved across plant lineages and is involved in more than just control of flower color pigmentation (Chalker-Scott, 1999; Gould 2004; Winkel-

Shirley, 2001). While the ABP is responsible for flower color, it is also responsible for fruit, seed, and tissue pigmentation, microbe interactions, defense response

(UV/pathogen/herbivore), and stress responses. The ABP has a core set of genes and transcription factors. The core genes are structural genes that are split between early and late groups; early genes are chalcone synthase (CHS), chalcone isomerase (CHI), and flavanone-3- hydroxylase (F3H); late genes are dihydroflavonol-4-reductase (DFR), anthocyanin synthase

(ANS), and UDP- glucose:flaconoid 3-o-glucosyltransferase (UF3GT) (Park et al., 2004) (Figure

1). The key transcription factors are myeloblastic (MYB), basic helix-loop-helix (bHLH), and beta-transducin repeat (WD40), together called the BMW complex (Ramsay and Glover, 2005)

(Figure 1). Mutations of genes in the ABP can lead to changes in anthocyanin production.

Referring to preliminary metabolomics data for L. perennis, Malvidin (another anthocyanin,

Figure 1) was another metabolite showing concentration difference among seed coat phenotypes

(unvalidated because of absence of standard); because both earlier (genistein) and a late product

(malvidin) were affected suggests either an early gene or transcription factor in the ABP are involved. The ABP begins with p-Coumaroyl-CoA from the Phenylpropanoid Pathway being 9 converted by chalcone synthase (CHS) to chalcone (also known as Naringenin chalcone), which has a yellow color. Then chalcone isomerase (CHI) converts chalcone to naringenin which branches into different sections that produce different end products. See Figure 1 for the ABP overview. 10

Figure 1 Pathway of the Anthocyanin Biosynthetic Pathway. The Anthocyanin Biosynthetic Pathway based on the KEGG flavonoid and anthocyanin pathways. The chemicals are in black font. The transcription factors are in blue. The genes are abbreviated into three letters. The red star represents multiple steps that are unknown. The enzyme chalcone synthase converts p-coumaroyl-CoA to chalcone and chalcone synthase then converts chalcone to Naringenin. Naringenin serves as a branch point for the synthesis of genistein by isoflavone synthase, dihydrokaempferol by flavone 3 hydroxylase, or eriodictyol by F3’H. The transcription factors in the MYB family and bHLH are known to regulate most genes in the pathway. WD40 activates synthesis after Naringenin. 11

Early research on domesticated Lupinus using traditional plant breeding discovered the gene Leucospermus was linked with white flowers and white seeds, while the wild type has blue flowers and dark seeds (Gladstones, 1977). Other important domestication traits are as follows: mollis (for soft seeds), leucospermus (for white flower and seed colour); lentus (for reduced pod- shattering), iucundis (for low alkaloid levels), Ku (for early flowering), and moustache pattern on seed coats (Boersma et al., 2005; Nelson et al., 2010). The absence of pigmentation is a strong indication that the anthocyanin pigment producing genes of the ABP are being downregulated.

There are many different mutations that could be responsible for this observed polymorphism. In

Arctic Mustard, stress-induced silencing of the CHS enzyme can lead to an absence of pigments in petals, where the mutant type observed is white flowers (Butler et al., 2014). Another possible explanation is insertions or deletions. In Soybean (G. max) insertion of an inverted repeat in CHS leads to loss of pigment in seed coats, producing the observed mutant yellow seeds (Yang et al.,

2010). Elucidating the mutations behind these phenotypes will increase understanding of the

ABP, the products of which are important to agriculture. While the ultimate goal of this research is to determine the genetic basis for testa polymorphisms in L. perennis, this project seeks to use

RNA-seq to identify potential candidates for polymorphism in seed coat color.

RNA-seq can be used to study a broad range of topics: alternative splicing events, single nucleotide polymorphisms (SNP), and gene expression profiles (Applications of RNA-seq, 2016;

O’Rourke et al., 2012; Yang et al., 2019). There is no one single pipeline for RNA-seq because of differences in research and the broad range of topics to which RNA-seq can be applied, and the approach depends on the experiment and hypothesis. The overall path is cDNA library construction from transcripts, quality control, transcriptome alignment/assembly, followed by functional annotation, transcript quantification and differential expression analyses. In RNA-seq 12 library construction for Illumina, RNA is synthesized into cDNA which is then fragmented and sequenced. Illumina is one of the many Next Generations Sequencing methods; other methods are Roche 454, Ion torrent, and SOLiD. Paired end sequencing is preferred over single end because paired end involves sequencing a fragment from both ends (producing both the forward and reverse strand information), which increases the accuracy during the alignment stages.

Quality control measures are involved in every step of the process. Starting with the quality of

RNA, review of raw sequences, and the normalization and filtering during bioinformatics tests.

Differential expression gene analysis (DEG) normalizes read count data and then statically tests for differences in abundance of transcripts among varying treatments or groups of samples (Chen and Wong, 2019). DESeq preforms multiple comparisons and is the bioinformatics program used for biological replicates (Anders et al., 2010; Differential gene expression analysis, 2016).

Aim I. Use RNAseq to test that the absence of pigmentation involves decreased expression of

ABP structural genes. There is a large body of research done on legumes like soybean that suggest similar genetics may be at play in L. perennis. Most phenotypic differences of seed coats are found to have differences in expression of genes controlling the Anthocyanin Biosynthetic

Pathway (ABP). To understand the genetic basis for these phenotypes we seek to identify if they derive through structural or regulatory effects. Therefore, differences in expression levels of either ABP genes or transcription factors were expected between the two phenotypes. In other plants, a downregulation of ABP genes, from insertions and deletions, resulted in an absence of pigment (Park et al., 2007; Park et al., 2004; Xu and Chang, 2009; Zabala and Vodkin, 2005;

Casimiros-Soriguer et al., 2016; Mirali et al., 2016). Therefore, candidate genes are expected to have a lower to no expression in the white phenotype, with relatively higher expression in the speckled phenotype. 13

Hypothesis I) Decreased expression in a structural gene of the ABP, like CHI, is found in the white seed phenotype because as an early gene in the pathway its decreased expression would affect the later colored pigment producing genes in the pathway.

Hypothesis II) Decreased expression of a transcription factor controlling the ABP, like

MYB, is found in the white seed phenotype because as a transcription factor controlling most of the pathway, it is possible to lower expression of the whole pathway including the late pigment producing enzymes.

Aim II. Identify candidate genes from Differential Gene Expression Analysis of assembled transcriptome. Differential Gene Expression will identify expression differences in genes between the phenotypes. This will tell us what other genes are differentially expressed between the phenotypes. While anthocyanins are visible, other chemicals are present in seeds that are colorless and could cause other differences between the two phenotypes. Because the ABP pathway is very complex and may influence development of other traits, RNA-seq is an efficient approach for identifying where potential genetic differences reside overall. 14

METHODOLOGY

Sample Collection and Preparation

Immature seeds were collected from a large, naturalized population at Wintergarden

Park, Bowling Green, OH. A total of 20 flowering plants was haphazardly chosen for sampling and their GPS locations recorded (Appendix Table A2). In this study, a family consisted of one individually distinguishable plant, and two replicates were collected per family. Seeds were removed from pods and transferred into 1.5 mL tubes that were stored on dry ice in the field till they could be transferred to a -80oC freezer. They were stored in the -80oC freezer until RNA extraction. They were stored for a max of one month before RNA extraction. To allow for subsequent seed collection, each inflorescence was bagged using bridal veil mesh after pollination (after flower pigment change) because pods dehisce to scatter their seeds and the bagging helps to prevent the loss of the seeds needed for phenotyping. The phenotypes were split into two categories: White and Speckled. Phenotypes are white (speckling below 15%) and speckled (speckling above 15%). Immature seed tissues from two developmental stages were separately sampled to increase the probability of acquiring pigment-related transcripts. The two developmental stages were defined as pre-pigment (no pigment present) and post-pigment

(pigment visible). Examples of these stages can be seen in Figure 2. 15 A B C

Figure 2 Three Developmental Time Points of L. perennis Seed Pods. A) Opened seed pod of L. perennis to show the pre-pigment seeds. The time period is 18 days after flowering. B) Opened seed pod of L. perennis to show post-pigmented seeds. This time period is 25 days after flowering. C) Mature seed and seed pod of L. perennis.

After were bagged the pods were left to develop. The pre-pigment stages were collected at 18 to 20 days after flowering (defined by opening of the upper banner petals on flowers for pollination), while post-pigment stages were collected at 25 to 30 days. The pods of the pre-pigment stage are typically light green and about 3.5 cm in length and 0.8 cm in width

(Figure 2A). Visual characteristics of the pods with seeds in the post-pigment stage were yellow green color, about 2.3 cm in length and 0.5 cm in width (Figure 2B). Seeds collected in the field were placed on dry ice for transport to the lab and stored at -80C until seeds were removed for

RNA extraction. Mature pods are desiccated and black in color and collected 32 to 40 days after flowering (Figure 2C). Phenotypes were determined after remaining pods on the same inflorescence were mature. Mature seeds were collected and individually photographed on both sides to determine the amount of speckling using a Canon PowerShot A1400 digital camera 16 placed 0.3 meters above seeds illuminated with three Ivation 7-LED dimmable clip lights

(750LUX; 5500K) on a white background with a 24-color checker card (CameraTrax). We used

ImageJ image analysis software (v1.52a, National Institutes of Health, , USA) to quantify the amount of pigmentation present following the method detailed by Haque et al.

(2015) and implemented for L. perennis as described by Meek (2019) (Appendix: Protocol for

Image Analysis of Seed Phenotypes using ImageJ). We determined the proportion of the surface area that was speckled across the two sides for five mature seeds, which was then averaged for each family. For this project a family was classified as white when 0-15% speckling was present, while a family was classified as speckled if the proportion of speckled surface area was greater than 15%.

RNA Extraction

Before RNA extraction and seed coat removal all utensils and surfaces were cleaned and sprayed with RNase away (0.1N SDS and 0.1N NaOH) to prevent RNA degradation. RNA extractions were performed on the two developmental stages separately; pre-pigment tissue was not combined with post-pigment tissue until it was shipped to NOVOgene. To isolate RNA from the seed coats only, the seed coat was first physically removed from the rest of the seed

(cotyledon and embryo) through the use of heat shock and physical removal. Seeds frozen in a microcentrifuge tube were placed in a 65C water bath for 15 seconds, after which the seed

(cotyledon and embryo) was quickly manually squeezed out, leaving only seed coat. The seed coat was then immediately placed on dry ice in a coffee grinder and ground into a fine powder.

The powdered dry ice/seed coat mix was transferred into pre-chilled 1.5 ul microfuge tubes and then placed in a -20C freezer to allow the dry ice to sublime. This time was variable depending on the amount of dry ice present. Typically, one hour and thirty minutes was enough time for 17 sublimation, but if dry ice was still present then another thirty minutes was allowed. During this waiting period all necessary reagents and equipment were organized for a quick and timely procedure. RNA extraction began immediately after dry ice was completely absent. At no point during the procedure (besides when necessary for incubating or precipitating) was the RNA extraction protocol halted or interrupted (more detail in Appendix).

For RNA extraction we used the TRIzol™ extraction procedure following

ThermoFisher’s protocol with modifications to increase the quantity and quality of RNA precipitated (ThermoFisher, 2016). To pulverize the tissues a coffee grinder and dry ice was used instead of a homogenizer (see above about turning the seed coat into a fine powder). With the

1mL of TRIzol™, 4uL of -mercaptoethanol and 20uL of polyvinylpyrrolidone was added to help remove secondary compounds often present in plants like condensed tannins and phenols.

The next modification involved the second incubation time of the lysate, which was increased to

5 minutes. No RNase-A was added. The amount of isopropanol added was increased to 1mL of chilled isopropanol per 1mL of TRIzol™ after which RNA was precipitated for two hours at -

20C. After the RNA pellet was air dried at room temperature it was resuspended with 40uL of

0.1% DEPC treated water and stored at -20C.

RNA Quality Control Determination

Before samples could be sequenced, the RNA degradation and purity were determined and the two developmental stages for each sample were combined. Because it was not known when pigment producing messages are made both stages were important for capturing potential candidates. RNA degradation was checked on 1% agarose gel with Ethidium Bromide. RNA purity and concentration were analyzed by spectroscopy using a NANODROP 2000 spectrophometer (NanoDrop 2000 Thermo Scientific™). Following sample quality evaluation, 18 we chose six families to be sequenced: SD2, SD7, SD20, SW3, SW4, and SW10. These families were then sent to NOVOgene (Novogene Corporation Inc. 8801 Folsom Blvd #290, Sacramento,

CA 95826) for sequencing. Samples with a concentration above 50ng/uL are preferred for sequencing, but to have three replicates of the speckled seeded phenotype we included one family below the recommended value (see details under Results: Raw data: RNA degradation and purity). NOVOgene performed further quality analysis on all RNA samples before library preparation proceeded including tests for degradation, contamination, purity, and integrity.

NOVOgene used a 1% agarose gel, NanoPhotometer® spectrophotometer (IMPLEN, CA, USA), and RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies,

CA, USA) for their quality determination.

Illumina Sequencing

After quality evaluations, samples were prepared for Illumina sequencing by performing mRNA enrichment, double-stranded cDNA synthesis, end repair/poly-A/adaptor addition, fragment selection/PCR, and library quality assessment. The first step, mRNA enrichment, is critical to determine what type of messages will be in the library. RNA extraction gives total

RNA, which is a mixture of ribosomal RNA (rRNA), transfer RNA (tRNA) and messenger RNA

(mRNA), the majority of which is rRNA. Because the transcripts of interest are the mRNA, the rRNA and tRNA must be removed prior to sequencing to prevent domination of the resulting libraries by the ribosomal genes, possibly causing our transcript signals to be overshadowed and not identified during later analyses. mRNA was purified from total RNA of the sample using poly-T oligo-attached magnetic beads. Fragmentation was carried out under an elevated temperature in NEBNext First Strand Synthesis Reaction Buffer (5X). 19

Before sequencing the RNA was converted into cDNA and amplified by PCR using random hexamer primers, M-MuLV Reverse Transcriptase (RNase H-), DNA Polymerase I, and

RNase H, and then overhangs were converted into blunt ends via exonuclease/polymerase reaction. After adenylation of 3’ ends of DNA fragments, NEBNext Adaptors with hairpin loop structure were ligated and poly-A tails were also attached using the NEBNext® UltraTM RNA

Library Prep Kit for Illumina® (NEB, USA) following the manufacturer’s recommendations.

Next, cDNA fragments of 250 to 300 bp were preferentially selected by purifying the cDNA samples with the AMPure XP system (Beckman Coulter, Beverly, USA). Then 3 µl

USER Enzyme (NEB, USA) was used with size-selected, adaptor-ligated cDNA at 37°C for 15 minutes followed by 5 minutes at 95 °C before PCR. Then PCR was performed with Phusion

High-Fidelity DNA polymerase, Universal PCR primers and Index (X) Primer. Finally, PCR products were purified (AMPure XP system) and library quality was assessed on an Agilent

Bioanalyzer 2100 system. Following library preparation, complete sequencing can commence.

We used the Illumina HiSeq X at a sequencing depth of 30 million reads of 150 base pairs paired-end fragments for each sample. Illumina sequencing produced raw data in the form of

FASTQ files that were then processed into clean reads. Processing involved the removal of reads that were of low quality and contained adapter sequence. Low quality reads were any reads that consist of 10% or more uncertain nucleotides.

Transcriptome Reconstruction

Transcriptome Assembly using De novo-based reconstruction was performed using

Trinity (Grabherr et al., 2011). Inspection of correlation coefficients between samples revealed that SD20 was statistically deviant from the other families, with correlation coefficients much closer to 0 then to 1 (see Results: Pearson correlation, volcano plots, and heatmaps). Therefore, 20 three assemblies were constructed: 1) only speckled-seed samples, 2) only white-seed samples, and 3) five samples excluding sample SD20. A hierarchical clustering of the transcriptome using

Corset (Davidson and Oshlack, 2014) was performed before the transcriptome was annotated to filter low mapped reads and remove redundant clusters from the assembly.

Functional Annotation

After redundant clusters were removed, we identified the most likely function of the candidates that mapped to the assemblies. Gene function was annotated using the following databases: Nr (NCBI non-redundant protein sequences) (NCBI, 2017); Nt (NCBI non-redundant nucleotide sequences) (NCBI, 2019); Pfam (Protein family) (EMBL-EBI, 2018); KOG/COG

(Clusters of Orthologous Groups of proteins) (JGI, 2019); Swiss-Prot (A manually annotated and reviewed protein sequence database) (Services and Services, 2019); KO (KEGG Ortholog database) (Kanehisa Laboratories, 2019); GO (Gene Ontology) (Geneontology Unifying

Biology, 2019). The following parameters were used for each database. NT: NCBI blast 2.2.28+, the e-value threshold is 1e-5 (Each cluster shows top10 alignment results); NR, SwissProt, KOG:

Diamond 0.8.22. For NR and SwissProt databases, the e-value threshold was 1e-5 (Each cluster shows top10 alignment results), and 1e-3 for KOG. PFAM, the prediction of protein structure domain: HMMER 3.0 package, hmmscan, the e-value threshold is 0.01; GO: based on the protein annotation results of NR and Pfam: Blast2GO v2.5 (Götz et al., 2008) and Novogene script, the e-value threshold is 1e-6; KEGG: KAAS, KEGG Automatic Annotation Server, the e- value threshold is 1e-10.

To determine GO classification, the successfully annotated genes were grouped into three main GO domains: Biological Process, Cellular Component, Molecular Function. During KOG classification, the successfully annotated genes were grouped into 26 categories: RNA 21 processing and modifications; chromatin structure and dynamics; energy production and conversion; cell cycle control, cell division , and chromosome partitioning; amino acid transport and metabolism; nucleotide transport and metabolism; carbohydrate transport and metabolism; coenzyme transport and metabolism; lipid transport and metabolism; translation, ribosomal structure, and biogenesis; transcription; replication, recombination, and repair; cell wall/membrane/envelope biogenesis; cell motility; posttranslational modification, protein turnover, and chaperones; inorganic ion transport and metabolism; secondary metabolites biosynthesis, transport, and catabolism; general function prediction only; function unknown; signal transduction mechanisms; intracellular trafficking, secretion, and vesicular transport; defense mechanisms; extracellular structures; unnamed proteins; nuclear structure; and cytoskeleton.

For KEGG classification, we successfully annotated genes and separated them into 5

KEGG categories: Cellular Processes, Environmental Information Processing, Genetic

Information Processing, Metabolism, and Organismal Systems. We divided the CDS (coding sequence) prediction into two steps. Step 1 involved the BLASTing of clusters according to the priority of NR and Swissprot databases. If the information matched, then we extracted the CDS from cluster sequences and translated them into peptide sequences based on the standard codon table (using the 5' to 3' direction). If there was no BLAST match, then we proceeded to analyze clusters with no hits with ESTScan (3.0.3) to predict their coding regions and determine their sequence direction.

After differential gene expression identified candidates, we performed additional manual annotation of the resulting clusters using NCBI BLAST against the Lupinus angustifolius genome. We did this because in the first annotation many of the hits for clusters were from 22 species that were not from the Fabaceae family. Since there is a database resource, we repeated the BLAST to check that differentially expressed genes were correctly annotated to a member of the Lupinus genus.

Gene Expression Analysis

Once the transcriptome was mapped and annotated, we quantified expression levels using

RSEM software (Li and Dewey, 2011), which calculated expression levels of mapped reads based on FPKM (Fragments Per Kilobase per Millions of bases). FPKM is the most commonly used method for gene expression levels because it considers sequencing depth and gene length of fragments and allows us to standardize read counts so that we can compare expression between white and speckled transcriptomes.

For biological replicates it is important to look at correlation between samples as an indicator of reliability. A Pearson correlation was used as a quality control test for replicates, which measures similarity among samples. The higher the similarity, the higher the correlation.

Because these samples were from a highly outcrossing, native species and not an inbred line, some variability is expected even among replicates. The results of this test informed our analysis and group design for Differential Gene Expression Analysis.

Differential Gene Expression

Differential expression analysis was performed using the DESeq R package (1.10.1).

DESeq provides statistical routines for determining differential expression in digital gene expression data using a model based on the negative binomial distribution (Anders and Huber,

2010). DESeq eliminates biological variation found among replicates. The resulting P-values were adjusted using the Benjamini and Hochberg’s approach for controlling the false discovery rate. Genes with an adjusted P-value <0.05 found by DESeq were assigned as differentially 23 expressed. As part of differential gene expression analysis, we produced volcano plots and heatmap graphs to visualize the results. The volcano plots inferred overall distribution of differentially expressed genes, while the heatmap provides a visual representation of clustering using a clustering analysis. It also provides expression information through colors to illustrate up or down expression. 24

RESULTS

Phenotypes of Biological Replicates

The phenotypes (percentage of area speckled) of the families used for sequencing ranged from 15% to 43%, while the white families had 0% speckling. The average speckling percentages for each maternal family was: SD2 49%, SD7 15%, SD20 43%, SW3 0%, SW4 0%, and SW10 0%. In the field surveys the spectrum of speckling can va ry from white seeds of 0% speckling to 85% darkly speckled. Our spectrum for speckled families that were initially collected and from which RNA was isolated ranged from 7% to 70%. The higher speckling phenotypes were not sequenced because those samples did not pass quality standards for library construction.

Raw Data

RNA extraction yields for most of the families exceeded the concentration requirements for library construction of 50ng/uL except for family SD20 (Table 1). Although the purity was acceptable for all tested samples (RNA pure ratio is 2), degradation of the speckled families was a recurrent problem that led to the use of some samples with only one of the developmental stages. Although this was not an issue for the white seeded families, two of the three speckled phenotypes used for library construction had either the pre- or post- pigment stages and were missing the other developmental stage (Table 1). While SD20 was missing the pre-pigment developmental stage because it was one of the more mature samples collected, it was selected over the other families because it had less RNA degradation. This reasoning also explains why

SD 2, which was missing the post-pigment tissue stage, was included. As will be evident below, this inclusion of SD20 had a profound effect on the formation of the analyses. 25

Table 1 Sample Purity, Concentration, and Tissue Stage. Sample purity, concentration, and tissue stage. 260 abs and 280 abs are the absorbance frequency used in spectrometry. 260 nm is the wavelength that nucleotides absorb at. The other Abs is an indicator of protein and other contaminants that absorb at 280 nm. The ratio of 260/280 determines the purity of the sample. Sample Nucleic Acid A260 A280 260/280 Total Tissue Stage ID Concen. (ng/uL) (Abs) (Abs) ng SW3 287 7.174 3.791 1.89 12341 Pre- and post-pigment SW4 81.9 2.047 1.2 1.71 3521.7 Pre- and post-pigment SW10 126.4 3.16 1.844 1.71 5435.2 Pre- and post-pigment SD2 889.1 22.153 11.076 2.00 38231.3 Pre-pigment SD7 122.8 3.071 1.671 1.84 5280.4 Pre- and post-pigment SD20 30.1 0.754 0.450 1.68 1294.3 Late post-pigment Data Quality

All downstream analyses were based on clean data with high quality. The average percentage of contamination was around 6%. Another indicator of data quality is the error of base calling expressed as error rate or Phred Quality (Q) score, where Q20 is the percentage of bases with correct base recognition rates greater than 99% in total bases, while Q30 is the percentage of bases with correct base recognition rates greater than 99.9% of total bases. There was an error rate of 0.02 for SW10, SW3, SW4, and SD7 and an error rate of 0.03 for SD2 and

SD20 (where an error of 0.02 indicates a chance for 2 incorrect bases for every 100 bases). The average Q20 was 98.25% and Q30 was 94.80% for the six samples (Table 2). The GC content of the Lupinus angustifolius genome is around 34% (Yang et. al. 2013), while our seed coat transcriptomes had an average overall GC content around 43% (Table 2). 26

Table 2 Data Quality Measurements for Library Construction. Data quality measurements for each library. Each sample has its own error rate and count of raw and clean reads. Raw reads have contamination and clean reads have the contamination removed. Contamination arise from reads with adaptor information, uncertain nucleotides greater than 10%, and low-quality nucleotides greater than 50%. Error rate comes from the sequencer and increases as the reads are extended because reagents become scarce. Sequencing mistakes depend on the sequencing method, sequencing machine, reagents available, and the sample. Phred scores (Q20 or Q30) are a measure of quality in identifying nucleotides. The higher the Phred score the more accurate the base calling was and the lower chance of sequencing mistakes. GC content is based on the AT/GC separation. GC content is a measure of stability. The AT content was higher than the GC for each sample. Raw Clean Percentage of Clean Error GC Sample Q20 Q30 Reads reads Clean reads bases Rate content SW10 57130538 52536710 91.96 7.9G 0.02 98.54 95.45 42.18 SW3 62213808 53356894 85.76 8G 0.02 98.63 95.74 42.43 SW4 89917546 87084178 96.85 13.1G 0.02 98.17 94.61 43.5 SD7 67471490 64436314 95.50 9.7G 0.02 98.66 95.82 43.17 SD2 81331984 78669052 96.73 11.8G 0.03 97.74 93.54 42.84 SD20 80208146 76178074 94.98 11.4G 0.03 97.75 93.68 46.76 Average 73045585 68710204 93.63 10.3G 0.023 98.25 94.81 43.48 Transcriptome Reconstruction

To reconstruct the transcriptome, we used the De novo assembly of the five samples across both phenotypes (with SD20 excluded) as a guide to map the separate white and speckled assemblies. About 80% of the reads were mapped back to the maternal families SW3, SW4,

SW10, SD7, and SD2. Only 66% were mapped back for SD20, which was expected because it was not included in the De novo assembly (Table 3). De novo assemblies can include an increased number of transcripts from artifacts. Artifacts are any variation introduced by non- biological processes, such as sequencing errors. To reduce transcripts with artifacts the software

Corset (Davidson and Oshlack, 2014) was used to produce clusters. Corset is also a method for obtaining gene counts because it takes into account multi-mapped reads, which are used in differential expression analysis. This software does hierarchically clustering as well used in 27 construction of heatmaps. Therefore, we compared size lengths of transcripts and clusters, which were generally the same (Table 2). A difference was seen where cluster counts were lower than transcript counts (Table 5, Figure 2). When the transcript size lengths were categorized in intervals of less than 300bp, 300-500bp, 500-1kb, 1-2kb, and greater than 2kb, the majority were in the interval between 500-1kb followed by 300-500bp (Figure 2). The interval counts for both transcripts and clusters were the same for sizes greater than 1kb. For size lengths less than 1kb clusters had a slightly lower counts than transcripts. Following that those counts were removed because they contained sequence errors.

Table 3 The Mapped Transcriptome. Each sample was used in the assembly except SD20. Total reads over total mapped gives the percentage mapped. Sample Total Total mapped Percentage name reads SW10 52536710 42164948 80.26% SW3 53356894 39635084 74.28% SW4 87084178 66325962 76.16% SD7 64436314 47253776 73.33% SD2 78669052 62055032 78.88% SD20 76178074 50290072 66.02%

Table 4 The Length Measurements of Transcriptome. The length measurements of the transcripts after mapping. N50 and N90 are weighted medians statistics for clusters related to the assembly. N50 the minimum length of a contigs needed to cover 50% of the assembly.

Min Mean Median Max N50 N90 Total Nucleotides Length Length Length Length Transcripts 201 871 562 49098 1234 395 147267448 Unigene 201 871 562 49098 1234 395 147234592

Table 5 Interval Lengths of Transcripts. Record of the numbers for each interval length. This shows the numbers shown in the graph of length distribution. Transcript length <300bp 300- 500- 1-2kb +2kb Total interval 500bp 1kb Number of 23560 49523 52901 28636 14461 169081 transcripts Number of 23445 49509 52900 28636 14461 168951 clusters 28

Gene Functional Annotation

Overall at least 80% of clusters were annotated using the databases: NR, NT, KO,

SwissProt, PFAM, GO, and KOG (Table 6). The other 20% are uncharacterized and were not annotated in the databases used. This could be because the transcripts mapped to a gene that has not been characterized yet or has no known match in the databases used.

Table 6 The Transcriptome Annotation Summary. The transcriptome annotation summary for each database. The number of clusters annotated in the database is shown along with the percentage. The percentage comes from the number of clusters annotated over the total number of clusters.

Number of Percentage clusters (%) Annotated in NR 79795 47.22 Annotated in NT 119639 70.81 Annotated in KO 32640 19.31 Annotated in SwissProt 88045 52.11 Annotated in PFAM 71212 42.14 Annotated in GO 72019 42.62 Annotated in KOG 38544 22.81 Annotated in all Databases 18219 10.78 Annotated in at least one Database 135260 80 29

Figure 3 Graph Interval Lengths. Graphical representation of the interval lengths. The interval counts for clusters and transcripts are similar for most intervals. The exact counts are found in table 5. 30

GO Classification

The annotated transcripts were grouped into three classifications based on the GO annotation. Those three groups were biological processes, cellular component, and molecular function (Figure 4). Biological processes examples include cellular process, metabolic process, and single-organism process. Cellular component examples included the cell, the cell part, and organelles. Molecular function examples include binding, and catalytic activity.

Figure 4 GO Classification. The GO classifications are broken down into three groups with further subclasses. The three groups are biological process, cellular component, and molecular function. 31

KOG Classification

Alternatively, annotated transcripts were grouped based on KOG classification (Figure

5). This system had 26 groups. The largest group was the General function prediction only

(group R). The next two largest groups were signal transduction (group T) and posttranslational modification, protein turnover, and chaperones (group O). These three groups represent the transcript classifications present in developing seeds. Defense mechanism (group V) was a small percentage of KOG classification, yet genes with defense function are seen in the differential expression analysis.

Figure 5 KOG Classification. KOG classification of clusters. KOG stands for EuKaryotic Orthologous Groups and is specific for eukaryotic species. It is used for identifying ortholog and paralog proteins from clusters. 32

KEGG Classification

Further, annotated transcripts were grouped based on KEGG classification (Figure 6).

Here there were five groups: A) Cellular processes, B) Environmental Information Processing,

C) Genetic Information Processing, D) Metabolism, and E) Organismal Systems. These are the same transcripts grouped according to KEGG showing the breakdown of transcripts in another form. The largest classification was in group D, Metabolism. This was expected because immature seeds are actively producing products before the mature seeds go dormant.

Figure 6 KEGG Classification. KEGG stands for Kyoto Encyclopedia of Genes and Genomes. This database deals with genomes, chemical reactions, and biological pathways. 33

Gene Expression Analysis

Evaluations of results from the Pearson correlation (Figure 7), volcano plots (Figure 9) and heatmaps (Figure 10) led to the creation of an SD1 group with the removal of SD20. The groups were as follows: SW, SD, and SD1 (Table 1, Table 7). SW consisted of SW10, SW3, and

SW4 all with both pre and post tissues. SD consisted of SD7, SD2, and SD20 (which had a mixture of pre, post, and late post-pigment tissues). SD1 was made up of SD7 and SD2 which was a mixture of pre and post tissues, but without the later developmental stage of SD20.

Sample Correlation

In the Pearson correlation, the closer the coefficient is to 1 the more similar the samples are to each other. The tissue developmental stage played a role in the similarity of the replicates

(Figure 7). SD20 and SD2 have coefficients below 0.6 while the rest have coefficients around

0.6. Both SD20 and SD2 were missing one of the tissue types (Table 1). SD2 had a Pearson coefficient closer to the other samples with both pre and post tissues, while DS20 was particularly deviant. This follows because SD2 had only the pre-pigmented tissues when transcripts would be synthesized to produce more signals for the immature seed develop. Yet in a later stage seed, as in SD20, the transcripts would be changing towards maturation and the development of dormancy. These results supported the creation of two approaches for differential gene expression analysis. 34

Figure 7 Pearson Correlation. Table of Pearson correlations between each sample. The darker the color the higher the correlation and the more alike the samples are to each other. 35

Summary of Gene Expression Levels

Gene expression was based on read counts and only Clusters with significant expression levels and gene annotation were recorded in Table A1 of the Appendix.

Gene Expression Difference Analysis

We used two approaches to examine the differential expression results. Approach 1) all 3

SD replicates compared against all three SW. Approach 2) SD1 (only families SD7 and SD2 with the removal of SD20) compared against all three SW samples (Table 7). Approach 1 identified 6 transcripts with significant expression differences (Figure 8). In Approach 2 (where the more mature tissues of SD20 were excluded) 58 transcripts were identified, including the 6 from Approach 1 and 52 new clusters (Table A8). Volcano plots (Figure 9) for both Approaches show the distribution of the whole transcriptome. Although both the Volcano plot and heatmap show expression increases or decreases in the speckled phenotype (either SD or SD1 depending on the approach), the heatmap (Figure 10) also reflects hierarchal clustering analysis and shows co-expression of the transcripts. Combined, the volcano plot and heatmap illustrate the 58 differentially expressed genes.

The volcano plots display the distribution of the transcripts for the whole seed coat transcriptome (Figure 9). Any transcripts above the p-value cut off (p-value > 0.05, adjusted p- value > 1.3) were differentially expressed. Fold change gives direction of expression. The fold change for the volcano plot is in reference to the speckled phenotype. This means that a negative fold change conveys a transcript with decreased expression in speckled seeds, while a positive fold change refers to increased expression in speckled seeds. The larger the fold change, the stronger the expression difference. Combining adjusted p-value and fold change, the further the points were plotted near the top left and right corner, the greater the difference in expression. 36

A heatmap also illustrates differential expression, with deeper colors indicating stronger expression. However, heatmaps can also show clustering of transcripts. Using the hierarchical clustering methods, the transcripts were clustered based on expression. The transcripts were labeled as Cluster-ID with a unique number ID for the whole transcriptome. Only the 58 differentially expressed genes are shown in the heatmap. Inspection of the heatmap reveals the changes in transcripts in SD20 and the color difference illustrated shows its variability from the other samples. Looking at SD, SD1, and SW shows a clear clustering of expressed genes.

However, SD20 does not follow a similar pattern, consistent with less distinct color differences for SD compared to SD1. There is a clear clustering of expression for up and down expressed genes.

Table 7 Sample Names and Approaches. The sample name and group names for each sample according to the approach type. Sample Name has the identity for each sample that corresponds to Table 1. The labels in Approach 1 and 2 classify the samples included in each group. For example, SD2 in Approach 1 is included in group SD and in Approach 2 it is within group SD1.

Sample Name Approach 1 Approach 2 SD2 SD SD1 SD7 SD SD1 SD20 SD SD20 SW3 SW SW SW4 SW SW SW10 SW SW 37

Figure 8 Venn Diagram of Differentially Expressed Genes. A Venn diagram of the overlap of the approaches. Approach 1 is labeled as SD vs. SW and Approach 2 is labeled as SD1vs SW. The numbers represent the total differentially expressed genes in each approach. 38

A

B

Figure 9 Volcano Plots. Volcano plots of Approach 1 and Approach 2. A) Approach 1 with SD 20 included. Six clusters were identified. B) Approach 2 with SD 20 removed. Fifty-eight clusters were identified. For both graphs the dashed line represents the 0.05 p-value when adjusted. The adjusted p-value is the -log10(0.05) which equals 1.3. The dots above the dash line represent the differentially expressed genes. The red dots are up expressed in the speckled phenotype. The green dots are down expressed in the white phenotype. The blue dots are non-differentially expressed transcripts. 39

Figure 10 Heatmap of Differentially Expressed Genes. Heatmap of the three assemblies and SD20 of the 58 differentially expressed genes. The hue of the color indicates degree of expression. The darker the hue the greater the expression difference. The red represents increased expression in the speckled phenotype, while the blue represents decreased expression in the white phenotype. The highlighted clusters are from approach 1 (with SD20). The clusters outlined in red are from Approach 2 (without SD20) 40

DISCUSSION

The Aims were to identify if Anthocyanin Biosynthesis genes and transcription factors differed in expression based on phenotype (Aim I) and what other genes differ between the phenotypes. Referring back to the aims and hypotheses, Aim I, Hypothesis I and II were supported by the differential genes identified as one candidate gene associated with the

Anthocyanin pathway and a transcription factor were correlated with phenotype. Differential gene expression analysis identified an early gene and a transcription factor of the Anthocyanin

Biosynthetic Pathway (ABP) that had decreased expression in the white phenotype. The two genes that relate back to the ABP are Cluster-39697.17855 and Cluster-39697.11726. The functional annotation for Cluster-39697.17855 (BLASTed against the Lupinus angustifolius genome) was cinnamoyl-CoA reductase, while Cluster-39697.11726 was to DDB1- and CUL4- associated factor 13 (also known as WD40) (Appendix Table 8). Other differentially expressed genes have roles in plant defense and the immune response supporting Aim II. For the comparison without SD 20, fifteen clusters were uncharacterized or unknown when BLASTed against the L. angustifolius genome.

Cinnamoyl-CoA reductase (CCR) is a class of enzymes that plays a role in building cell walls (Barakat et. al., 2011). CCR converts cinnamoyl-CoA into cinnamaldehydes in the first step of monolignol biosynthesis and is the starting point for flavonoid metabolites in the phenylpropanoid pathways (Lauvergeat et al., 2001). Within the CCR class the cluster specifically BLASTs to Cinnamoyl-CoA reductase Snl6 located in the Lupinus angustifolius genome at the genomic location LG06 NC_032014.1 (4120541, 4123132). Specifically, cinnamoyl-CoA reductase Snl6 is part of plant immune response. In rice the enzyme is part of the NH1 (NPR1) immune response; NH1 provides rice with resistance to Xanthomonas oryzae pv. oryzae, which causes bacterial blight (Bart et al., 2010). Cinnamoyl-CoA reductase was up 41 expressed in speckled seeded lupin, suggesting that speckled seedlings could have a boosted response against bacterial blight.

DDB1- and CUL4-associated factor 13 is in the transcription factor class of WD40 proteins. WD40 proteins function to facilitate protein-protein interactions or act as a connecting point for several proteins to function in a complex (Ramsay and Glover, 2005). Plant genomes typically contain more than two hundred WD40 repeat encoded proteins (Miller et al., 2016).

They are involved in a wide range of diverse functions such as signal transduction, regulation, and protein modifications (Zhang and Zhang, 2015). Some functions include: trichome initiation, epidermal cell formation, and seed coat pigmentation. The ability for WD40 to be involved in regulating diverse functions lies in its ability to act as a scaffold for other proteins. In the anthocyanin biosynthetic pathway, WD40 plays a role in forming the BMW transcription factor complex with MYB and bHLH. This complex functions to regulate expression of late genes in the ABP after Naringenin, like F3H, DFR, and ANS. This transcription factor was up expressed in speckled seeds. The increased expression of this transcription factor could signal the ABP to synthesize anthocyanins, producing pigments in the seed coat. This is a simplified scenario where all downstream products of the ABP are affected, but the ABP is also part of the

Flavonoid Biosynthesis Pathway. However, it is not clear if all downstream products are affected or only certain branches of the pathway. Another aspect that is not yet understood is the role of transposons in these phenotypes. Classic transposon activity, such as Barbra McClintock’s work on corn (1950), has a similar speckling pattern observed in L. perennis, suggesting that transposable elements could be involved in the tissue specific expression of the speckling pattern of the seed coat. A transposon insertion in the WD40 gene of Ipomoea nil (Japanese morning glory) leads to the loss of both flower pigment and seed pigment (Hoshino et al., 2016). The 42 increased expression associated with the speckled phenotype suggests that when it is down regulated, the white phenotype arises because one or more late-acting genes of the ABP are down regulated, and the pigments that produce the speckling are not being synthesized.

Of the 6 clusters identified to be differentially expressed in the comparison of SD and

SW, 3 were identified to play a role in plant defense or immune response. Already mentioned above is Cluster 39697.17855, Cinnamoyl-CoA reductase, two additional genes involved in plant defense are: Cluster 39697.21529 and Cluster 39697.7745. Cluster 39697.21529 is a receptor- like protein 51, while Cluster 39697.7745 is a protein trichome birefringence gene.

Receptor-like protein 51 functions in signal transduction of the plasma membrane and in plant defense. They contain leucine-rich repeats that identify them as a class of resistance genes in plants (Dangl and Jones, 2001). Receptor-like proteins are known to confer resistance against fungal pathogens such as the pathogen Verticillium, which causes Wilt in tomatoes (Fradin et al.,

2009). This cluster was up expressed in speckled seeds, while the Cluster 39697.7745 is down expressed in speckled seeds. Cluster 39697.7745, a protein trichome birefringence gene, is known as powdery mildew resistance 5. Trichome birefringence is a plant protein family. In

Arabidopsis this family includes 46 members. The trichome birefringence family is proposed to encode wall polysaccharide specific O-acetyltransferases. Members of the trichome birefringence protein family had been shown to impact pathogen resistance, freezing tolerance, and cellulose biosynthesis. Powdery mildew genes (PMR) is another name for trichome birefringence (Bischoff et al., 2010). There is also another cluster that is involved in powdery mildew resistance, cluster 44464.0, which is a trichome birefringence-like protein. Trichome birefringence-like proteins are also involved in the cell wall such as in maintaining esterification of cell wall polymers (UniProt, 2019). However, this cluster is up expressed in speckled seeds. 43

The different correlation pattern of these defense genes suggests that the seed coat phenotypes may be the result of differences in local adaptation and histories of selection for pathogen defense. This could be related to the co-evolutionary arms race of pathogens and hosts. This interaction can maintain polymorphisms in populations (Salvaudon et al., 2008; Bergelson et al.,

2001).

The last up expressed cluster of the speckled phenotype (see Appendix Table 8 for list of others) is Cluster 39697.25331, which is a WVD2 or Targeting protein for Xklp2 (TPX2).

WVD2 are microtubule-associated proteins (MAP), which have a role in orientation of interphase cortical microtubules (UniProt, 2019). WVD2 stands for wave-dampened 2 and in

Arabidopsis high expression leads to stems and roots with short and thick root systems compared to wild-type plants (Perrin et al., 2007). Different root systems could generate differences in foraging and stress response between the two phenotypes. It is possible when comparing root systems of speckled and white seeded Lupines, the seedlings emerging from speckled seed coat phenotypes would also have shallower and thicker root systems, or that this protein also influences seed coat cell development.

The last two clusters identified were down expressed in speckled phenotype. One was

Cluster 39697.15107, a nucleobase-ascorbate transporter, and the other was Cluster 39697.22337 a low-temperature-induced cysteine proteinase. Nucleobase-ascorbate transporters have transmembrane transporter activity but are specifically transporters of purines and pyrimidines as well as ascorbate (UniProt, 2019). They play a role in DNA and RNA metabolism and have multiple other roles such as redox signaling, growth regulation, and pathogen defense (Maurino et al., 2006). This could suggest that in speckled plants decreased transcripts lead to reduced transport activity in nucleobase metabolism. Cluster 39697.22337 is a low-temperature-induced 44 cysteine proteinase and functions in cysteine-type peptidase activity (cysteine is the reorganization site for the enzyme) (UniProt, 2019). This suggests that white seeds could have metabolic differences at low temperatures. A cold spring could induce activity of the low- temperature cysteine proteinase, allowing white seeds to emerge earlier than speckled seeds. This could play a role in early and late spring germination patterns.

In this study, the particular maternal families analyzed affected the analysis and played a major role in interpreting the results. Starting with percent of speckling, the families selected

(SD2, SD7, SD20) had a speckling percent range between 15% and 47%. This could bias the results to show differentially expressed genes pertaining to seeds that are below 50% speckled.

However, it could be expected that differentially expressed genes in this group would present weaker signals then what would be seen in families with speckling above 50%. However, this is speculation and we focused on speckled seeds compared to white seeds and not the amount of speckling. Another limitation was the small number of replicates used. Only three families were selected for each phenotype and in the speckling phenotype not all were true developmental replicates. SD2 and SD20 both were missing one tissue type. Future work to corroborate these results could be to increase the number of replicates for another round of RNA-seq and include a wider range from the speckling spectrum, for example taking several families with 25%, 50%,

75%, and 90%. Another essential study to strengthen this work would be to do qPCR on the differentially expressed genes on a much larger sample of replicates (Yang et al., 2019; Wang et al., 2018; Wilhelmsson et al., 2019). This would be necessary to validate the results from our differential expression analysis and confirm that the loci shown are differentially expressed.

Other future work to follow this RNA-seq experiment is to perform sequence analysis and construct co-expression networks. With the sequence data it is possible to compare the genes 45 and identify sequence differences such as SNPs (single nucleotide polymorphism), insertions, deletions, and sequence inversions. Transposon insertions are known to have regulation effects on the ABP, which can lead to differences in seed coat phenotypes (Zabala and Vodkin, 2014;

Zabala and Vodkin, 2005; Park et al., 2007). A co-expression network analysis could further explore the correlation pattern among genes. This would be a visual representation of the relationship between gene expression and phenotype. This work was not done due to the limitation of time and access to computing power for analysis. Future work can include designing experiments to test hypotheses based on differential gene expression. Much of the discussion suggested possible implications of expression differences that require experiments for validation. For example, the suggestion that seed coat phenotypes influence possible plant defense mechanisms could be tested by an experiment where speckled and white seeded Lupines are exposed to live field soils containing pathogens/disease to test that the seed phenotype does confer defense against certain pathogens/diseases.

Our results suggest that seed phenotype differences are correlated with a suite of other traits likely to have significant implications for crop breeding beyond visual appearance. The identification of multiple differentially expressed genes related to plant defense could help breeding efforts of domesticated Lupin to become more resilient to pests in the field. Wild species hold a genetic reservoir of diversity not present in domesticated species in which reduced genetic diversity opens them to disease (Smykal et al., 2018). Domesticated crops found with a narrow genetic background have limited ability for adaptation against disease as has been shown in domesticated Lupin (Berger et al., 2012). The use of flower and seed pigments for breeding desired traits is a common practice. However, it is not known what other traits are being selected or excluded when accession lines chosen for subsequent breeding work are also uniform in 46 pigmentation. It is possible this gap in knowledge could lead to the loss of beneficial traits. More work is needed for understanding the basis of this seed coat polymorphism and the affect this phenotype has on the vast array of other complex traits important to fitness of the plant. 47

REFERENCES Abraham, E., Ganopoulos, I., Madesis, P., Mavromatis, A., Mylona, P., Nianiou-Obeidat, I., … Vlachostergios, D. (2019). The Use of Lupin as a Source of Protein in Animal Feeding: Genomic Tools and Breeding Approaches. International Journal of Molecular Sciences, 20(4), 851. doi: 10.3390/ijms20040851

Anders, S., and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10). doi:10.1186/gb-2010-11-10-r106

Applications of RNA-seq. (2016, July 06). Retrieved from https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common- technologies-and-data-analysis-methods/applications-rna-seq

Atis, I., Atak, M., Can, E., and Mavi, K. (2011). Seed coat color effects on seed quality and salt tolerance of red clover (Trifolium pratense). International Journal of Agriculture and Biology, 13: 363-368.

Badri, D. V., Weir, T. L., Lelie, D. V., & Vivanco, J. M. (2009). chemical dialogues: Plant–microbe interactions. Current Opinion in Biotechnology,20(6), 642-650. doi:10.1016/j.copbio.2009.09.014

Barakat, A., Yassin, N. B., Park, J. S., Choi, A., Herr, J., & Carlson, J. E. (2011). Comparative and phylogenomic analyses of cinnamoyl-CoA reductase and cinnamoyl-CoA-reductase- like gene family in land plants. Plant Science,181(3), 249-257. doi:10.1016/j.plantsci.2011.05.012

Bart, R. S., Chern, M., Vega-Sánchez, M. E., Canlas, P., & Ronald, P. C. (2010). Rice Snl6, a Cinnamoyl-CoA Reductase-Like Gene Family Member, Is Required for NH1-Mediated Immunity to Xanthomonas oryzae pv. oryzae. PLoS Genetics,6(9). doi:10.1371/journal.pgen.1001123

Bergelson, J., Dwyer, G., & Emerson, J. J. (2001). Models and Data on Plant-Enemy Coevolution. Annual Review of Genetics,35(1), 469-499. doi:10.1146/annurev.genet.35.102401.090954 48

Berger, J.D.; Buirchell, B.; Luckett, D.J.; Nelson, M.N. (2012). Domestication bottlenecks limit genetic diversity and constrain adaptation in narrow-leafed lupin (Lupinus angustifolius L.). Theor. Appl. Genet. 2012, 124, 637–652

Bischoff, V., Nita, S., Neumetzler, L., Schindelasch, D., Urbain, A., Eshed, R., … Scheible, W. R. (2010). TRICHOME BIREFRINGENCE and its homolog AT5G01360 encode plant- specific DUF231 proteins required for cellulose biosynthesis in Arabidopsis. Plant physiology, 153(2), 590–602. doi:10.1104/pp.110.153320

Bishop, J. G. (2002). Early Primary Succession on Mount St. Helens: Impact of Insect Herbivores on Colonizing Lupines. Ecology, 83(1), 191. doi:10.2307/2680131

Bhatt, A., Gairola, S., and El-Keblawy, A.A. 2016. Seed color affects light and temperature requirements during germination in two Lotus species (Fabaceae) of the Arabian subtropical deserts. International Journal of Tropical Biology, 64(2): 483-492.

Boersma, J. G., Pallotta, M., Li, C., Buirchell, B. J., Sivasithamparam, K., Yang, H. (2005). Construction of a Genetic Linkage Map Using MFLP and Identification of Molecular Markers Linked to Domestication Genes in Narrow-leafed Lupin (Lupinus angustifolius L..). Cellular & Molecular Biology Letters, 10: 331-344.

Bradshaw, Jr., H. D. and D. W Schemske. (2003). Allele substitution at a flower colour locus produces a pollinator shift in monkeyflowers. Nature 426: 176-178.

Butler, T., Dick, C., Carlson, M. L., & Whittall, J. B. (2014). Transcriptome Analysis of a Petal Anthocyanin Polymorphism in the Arctic Mustard, Parrya nudicaulis. PLoS ONE,9(7). doi:10.1371/journal.pone.0101338

Cannon, S. B., Mckain, M. R., Harkess, A., Nelson, M. N., Dash, S., Deyholos, M. K., . . . Leebens-Mack, J. (2014). Multiple Polyploidy Events in the Early Radiation of Nodulating and Nonnodulating Legumes. Molecular Biology and Evolution,32(1), 193- 210. doi:10.1093/molbev/msu296

Cartwright C. 1997. Interpopulation Variation In Lupinus perennis, The Wild Lupine. Master’s Thesis 49

Casimiro-Soriguer, I., Narbona, E., Buide, M. L., Del Valle, J. C., & Whittall, J. B. (2016). Transcriptome and Biochemical Analysis of a Flower Color Polymorphism in Silene littorea (Caryophyllaceae). Frontiers in plant science, 7, 204. doi:10.3389/fpls.2016.00204

CCDB Angiosperms Leguminosae :upinus. (2018). Retrieved from http://ccdb.tau.ac.il/search/

Chen, L., & Wong, G. (2019). Transcriptome Informatics. Encyclopedia of Bioinformatics and Computational Biology, 324-340. doi:10.1016/b978-0-12-809633-8.20204-5

Cooper, J. (2007). Early interactions between legumes and rhizobia: Disclosing complexity in a molecular dialogue. Journal of Applied Microbiology,103(5), 1355-1365. doi:10.1111/j.1365-2672.2007.03366.x

Chachalis, D. and Smith, M.L. (2000). Imbibition behavior of soybean (Glycine max(L.) Merrill) accessions with different testa characteristics. Seed Science & Technology, 28:321-331.

Chalker-Scott, L. (1999). Environmental significance of anthocyanins in plant stress responses. Photochemical Photobiology 70: 1-9

Chittka, L., & Menzel, R. (1992). The evolutionary adaptation of flower colours and the insect pollinators colour vision. Journal of Comparative Physiology A,171(2). doi:10.1007/bf00188925

Dangl, J. L., & Jones, J. D. (2001). Plant pathogens and integrated defence responses to infection. Nature,411(6839), 826-833. doi:10.1038/35081161

Davidson, N. M., & Oshlack, A. (2014). Corset: Enabling differential gene expression analysis for de novoassembled transcriptomes. Genome Biology,15(7). doi:10.1186/s13059-014- 0410-6

Differential gene expression analysis. (2016, June 14). Retrieved from https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common- technologies-and-data-analysis-methods/differential-gene 50

Elmer, W. H., Yang, H. A., & Sweetingham, M. W. (2001). Characterization of Colletotrichum gloeosporioides Isolates from Ornamental Lupines in Connecticut. Plant Disease, 85(2), 216-219. doi:10.1094/pdis.2001.85.2.216

EMBL-EBI. (2018). Pfam 32.0. Retrieved from http://pfam.xfam.org/

Fradin, E. F., Zhang, Z., Ayala, J. C., Castroverde, C. D., Nazar, R. N., Robb, J., . . . Thomma, B. P. (2009). Genetic Dissection of Verticillium Wilt Resistance Mediated by Tomato Ve1. Plant Physiology,150(1), 320-332. doi:10.1104/pp.109.136762

Fraser, C. M., & Chapple, C. (2011). The Phenylpropanoid Pathway in Arabidopsis. The Arabidopsis Book,9. doi:10.1199/tab.0152

Frick, K. M., Kamphuis, L. G., Siddique, K. H., Singh, K. B., & Foley, R. C. (2017). Quinolizidine Alkaloid Biosynthesis in Lupins and Prospects for Grain Quality Improvement. Frontiers in Plant Science,8. doi:10.3389/fpls.2017.00087

Forrester, N. J., & Ashman, T. (2017). The direct effects of plant polyploidy on the – rhizobia mutualism. Annals of Botany,121(2), 209-220. doi:10.1093/aob/mcx121

Geneontology Unifying Biology. (2019). Gene Ontology Resource. Retrieved from http://www.geneontology.org/

Gladstones, J.S. (1970) Lupins as Crop Plants. Field Crop Abstracts, 23, 123-148.

Gladstones, J.S. (1974) Lupins of the Mediterranean Region and Africa. Technical Bulletin No. 26. Western Australian Department of Agriculture, Western Australia

Gladstones, J. S. (1977) The narrow-leafed lupin in Western Australia (Lupinus angustifolius L.). Western Australian Department of Agriculture Bulletin 3990

Gleason, H.A. and Cronquist, A. 1991. Manual of vascular plants of the northeastern U.S. and adjacent Canada. 2nd ed. New York Botanical Garden, Bronx, New York, USA. p. 278 51

Götz, S., García-Gómez, J. M., Terol, J., Williams, T. D., Nagaraj, S. H., Nueda, M. J., … Conesa, A. (2008). High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic acids research, 36(10), 3420–3435. doi:10.1093/nar/gkn176

Gould, K.S. (2004). Nature’s Swiss army knife: the diverse protective roles of anthocyanins in . Journal of Biomedicine and Biotechnology 2004: 314-320.

Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., . . . Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology,29(7), 644-652. doi:10.1038/nbt.1883

Grassi, A. D., Lanave, C., & Saccone, C. (2008). Genome duplication and gene-family evolution: The case of three OXPHOS gene families. Gene,421(1-2), 1-6. doi:10.1016/j.gene.2008.05.011

Hane JK, Ming Y, Kamphuis LG, Nelson MN, Garg G, Atkins CA, Bayer P, Bravo A, Bringans S, Cannon S, Edwards D, Foley R, Gao L, Harrison MJ, Huang W, Hurgobin B, Li S, Liu C, McGrath A, Morahan G, Murray J, Weller J, Jian J, and Singh KB. 2017. A comprehensive draft genome sequence for lupin (Lupinus angustifolius), an emerging health food: insights into plant-microbe interactions and legume evolution. Plant Biotechnology Journal 15: 318-330

Hedley C. L., Smith C. M., Ambrose M. J., Cook S., Wang T. L., 1986. An analysis of seed development in Pisum sativum. II. The effect of the r-locus on the growth and development of the seed. Ann. Bot. 58: 371–379

Hernández, E. M., Rangel, M. L. C., Corona, A. E., Angel, J. A. C. D., López, J. A. S., Sporer, F., … Torres, K. B. (2011). Quinolizidine alkaloid composition in different organs of Lupinus aschenbornii. Revista Brasileira De Farmacognosia, 21(5), 824–828. doi: 10.1590/s0102-695x2011005000149

Hollister, J. D. (2014). Polyploidy: Adaptation to the genomic environment. New Phytologist,205(3), 1034-1039. doi:10.1111/nph.12939 52

Holtsford, T. P., & Ellstrand, N. C. (1992). Genetic And Environmental Variation In Floral Traits Affecting Outcrossing Rate In Clarkia Tembloriensis (Onagraceae). Evolution,46(1), 216-225. doi:10.1111/j.1558-5646.1992.tb01996.x

Hoshino, A., Yoneda, Y., & Kuboyama, T. (2016). A Stowawaytransposon disrupts the InWDR1gene controlling flower and seed coloration in a medicinal of the Japanese morning glory. Genes & Genetic Systems, 91(1), 37–40. doi: 10.1266/ggs.15- 00062

Howe, H. F.; Smallwood, J. (1982). Ecology of Seed Dispersal. Annual Review of Ecology and Systematics, 13:1, 201-228

Jezierny, D., Mosenthin, R. & Bauer, E. (2010). The use of grain legumes as a protein source in pig nutrition: A review. Animal Feed Science and Technology, 157, pp. 111–128.

JGI. (2019). The KOG Browser. Retrieved from https://mycocosm.jgi.doe.gov/help/kogbrowser.jsf

Kanehisa Laboratories. (2019). KEGG: Kyoto Encyclopedia of Genes and Genomes. Retrieved from http://www.genome.jp/kegg/

Kasprowicz-Potocka, M., Zaworska, A., Frankiewicz, A., Nowak, W., Gulewicz, P., Zduńczyk, Z., & Juśkiewicz, J. (2015). The Nutritional Value and Physiological Properties of Diets with Raw and Candida utilis-Fermented Lupin Seeds in Rats. Food technology and biotechnology, 53(3), 286–297. doi:10.17113/ftb.53.03.15.3979

Kelly KM, Van Staden J, Bell WE (1992) Seed coat structure and dormancy. Plant Growth Regul. 11:201-209.

Kelly, M., Opiyo, S., Phuntumart, V., Michaels, H. J., unpublished data

Kostyla A. S., Clydesdale F. M., McDaniel M. R., 1978 The psychophysical relationships between color and flavor. Food Sci. Nutr. 10: 303–321 53

Lai, X., Guo, C., & Xiao, Z. (2014). Trait-mediated seed predation, dispersal and survival among frugivore-dispersed plants in a fragmented subtropical forest, Southwest China. Integrative Zoology,9(3), 246-254. doi:10.1111/1749-4877.12046

Lauvergeat, V., Lacomme, C., Lacombe, E., Lasserre, E., Roby, D., & Grima-Pettenati, J. (2001). Two cinnamoyl-CoA reductase (CCR) genes from Arabidopsis thaliana are differentially expressed during development and in response to infection with pathogenic bacteria. Phytochemistry,57(7), 1187-1195. doi:10.1016/s0031-9422(01)00053-x

Lees, G. L. (1992). Condensed Tannins in Some Forage Legumes: Their Role in the Prevention of Pasture Bloat. Plant Polyphenols,915-934. doi:10.1007/978-1-4615-3476- 1_55

Li, B., & Dewey, C. N. (2011). RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics,12(1). doi:10.1186/1471-2105- 12-323

Logemann, E., & Hahlbrock, K. (2002). Crosstalk among stress responses in plants: Pathogen defense overrides UV protection through an inversely regulated ACE/ACE type of light- responsive gene promoter unit. Proceedings of the National Academy of Sciences,99(4), 2428-2432. doi:10.1073/pnas.042692199

Lupins. (2018). Retrieved from https://www.agric.wa.gov.au/crops/grains/lupins Government of Western Australia Department of Primary Industries and Regional Development

Lupins: About Lupins. (2019). Retrieved from http://www.lupins.org/lupins/ LUPINS.org Information resource portal for Lupins

Mammadov, J., Buyyarapu, R., Guttikonda, S. K., Parliament, K., Abdurakhmonov, I. Y., & Kumpatla, S. P. (2018). Wild Relatives of Maize, Rice, Cotton, and Soybean: Treasure Troves for Tolerance to Biotic and Abiotic Stresses. Frontiers in Plant Science, 9. doi:10.3389/fpls.2018.00886

Maurino, V. G., Grube, E., Zielinski, J., Schild, A., Fischer, K., & Flügge, U. (2006). Identification and Expression Analysis of Twelve Members of the Nucleobase–Ascorbate 54

Transporter (NAT) Gene Family in Arabidopsis thaliana. Plant and Cell Physiology,47(10), 1381-1393. doi:10.1093/pcp/pcl011

Mcclintock, B. (1950). The origin and behavior of mutable loci in maize. Proceedings of the National Academy of Sciences, 36(6), 344–355. doi: 10.1073/pnas.36.6.344

Mhlongo, M. I., Piater, L. A., Madala, N. E., Labuschagne, N., and Dubery, I. A. (2018). The chemistry of plant–microbe interactions in the rhizosphere and the potential for metabolomics to reveal signaling related to defense priming and induced systemic resistance. Front. Plant Sci. 9:112. doi: 10.3389/fpls.2018.00112

Michaels, H.J., C. A. Cartwright, and Wakeley-Tomlinson, E.F. 2019. Relationships Among Population Size, Environmental Factors, and Reproduction in Lupinus perennis (Fabaceae). Am. Midl. Nat. (2019) 182:160–180

Mierziak, J., Kostyn, K., & Kulma, A. (2014). Flavonoids as Important Molecules of Plant Interactions with the Environment. Molecules,19(10), 16240-16265. doi:10.3390/molecules191016240

Miller, J. C., Chezem, W. R., & Clay, N. K. (2016). Ternary WD40 Repeat-Containing Protein Complexes: Evolution, Composition and Roles in Plant Immunity. Frontiers in Plant Science, 6. doi: 10.3389/fpls.2015.01108

Mirali, M., Purves, R. W., Stonehouse, R., Song, R., Bett, K., & Vandenberg, A. (2016). Genetics and Biochemistry of Zero-Tannin Lentils. Plos One, 11(10). doi:10.1371/journal.pone.0164624

Morris, P.F., Ward, E.W.B. (1992). Chemoattraction of zoospores of the soybean pathogen Phytophthora sojae by isoflavones. Physiol Mol Plant Pathol. 40:17–22.

NCBI. (2017). RefSeq non-redundant proteins. Retrieved from https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/

NCBI. (2019). BLAST: Basic Local Alignment Search Tool. Retrieved from https://blast.ncbi.nlm.nih.gov/Blast.cgi 55

NCBI. (2019). Nucleotide. Retrieved from https://www.ncbi.nlm.nih.gov/nucleotide/

Ndakidemi, P.A. and F. D. Dakora. (2003). Legume seed flavonoids and nitrogenous metabolites as signals and protectants in early seedling development. Functional Plant Biology 30: 729-745

Nelson, M. N., Moolhuijzen, P. M., Boersma, J. G., Chudy, M., Lesniewska, K., Bellgard, M., … Ellwood, S. R. (2010). Aligning a new reference genetic map of Lupinus angustifolius with the genome sequence of the model legume, Lotus japonicus. DNA research : an international journal for rapid publication of reports on genes and genomes, 17(2), 73– 83. doi:10.1093/dnares/dsq001

O’Rourke, J. A., Yang, S. S., Miller, S. S., Bucciarelli, B., Liu, J., Rydeen, A., … Vance, C. P. (2012). An RNA-Seq Transcriptome Analysis of Orthophosphate-Deficient White Lupin Reveals Novel Insights into Phosphorus Acclimation in Plants. Plant Physiology, 161(2), 705–724. doi: 10.1104/pp.112.209254

Oshlack. (2019, July 06). Oshlack/Corset. Retrieved from https://github.com/Oshlack/Corset

Park, K., Choi, J., Hoshino, A., Morita, Y., & Iida, S. (2004). An intragenic tandem duplication in a transcriptional regulatory gene for anthocyanin biosynthesis confers pale-colored flowers and seeds with fine spots inIpomoea tricolor. The Plant Journal,38(5), 840-849. doi:10.1111/j.1365-313x.2004.02098.x

Park, K., Ishikawa, N., Morita, Y., Choi, J., Hoshino, A., & Iida, S. (2007). A bHLH regulatory gene in the common morning glory, Ipomoea purpurea, controls anthocyanin biosynthesis in flowers, proanthocyanidin and phytomelanin pigmentation in seeds, and seed trichome formation. The Plant Journal,49(4), 641-654. doi:10.1111/j.1365-313x.2006.02988.x

Penfield S. (2017). Seed biology - from lab to field. Journal of experimental botany, 68(4), 761– 763. doi:10.1093/jxb/erx021

Penmetsa, R. V., Carrasquilla‐Garcia, N., Bergmann, E. M., Vance, L., Castro, B., Kassa, M. T., . . . Cook, D. R. (2016). Multiple post‐domestication origins of kabuli chickpea through 56

allelic variation in a diversification‐associated transcription factor. New Phytologist,211(4), 1440-1451. doi:10.1111/nph.14010

Perrin, R. M., Wang, Y., Yuen, C. Y., Will, J., & Masson, P. H. (2007). WVD2 is a novel microtubule-associated protein in Arabidopsis thaliana. The Plant Journal,49(6), 961- 971. doi:10.1111/j.1365-313x.2006.03015.x

Plant Database. (2019). Retrieved from https://www.wildflower.org/plants/result.php?id_plant=lupe3 Lady Bird Johnson Wildflower center

Porter, S. S. (2013). Adaptive divergence in seed color camouflage in contrasting soil environments. New Phytologist,197(4), 1311-1320. doi:10.1111/nph.12110

Ramsay, N.A. and Glover, B.J. (2005). MYB-bHLH- WD40 protein complex and the evolution of cellular diversity. Trends Plant Science, 10, 63-70. doi:10.1016/j.tplants.2004.12.011

Reddy, P.P., Rendo-Anaya, M., de los Dolores Soto del Rio, M., and Khandual, S. (2007). Flavonoids as Signaling molecules and regulators of root nodule development. Dynamic Soil, Dynamic Plant, 1(2), 83-94

Roach, D.A. and Wulff, R.A. (1987). Maternal effects in plants. Annual Review of Ecological Systematics, 18: 209-235

Salvaudon, L., Giraud, T., & Shykoff, J. A. (2008). Genetic diversity in natural populations: A fundamental component of plant–microbe interactions. Current Opinion in Plant Biology,11(2), 135-143. doi:10.1016/j.pbi.2008.02.002

Schemske, D. W. & Bradshaw, H. D. Jr Pollinator preference and the evolution of floral traits in monkeyflowers (Mimulus). Proc. Natl Acad. Sci. USA 96, 11910–11915 (1999)

Schenke, D., Böttcher, C., & Scheel, D. (2011). Crosstalk between abiotic ultraviolet-B stress and biotic (flg22) stress signalling in Arabidopsis prevents flavonol accumulation in favor of pathogen defence compound production. Plant, Cell & Environment,34(11), 1849-1864. doi:10.1111/j.1365-3040.2011.02381.x 57

Services, E. W., & Servces, E. E. (2019). SWISS-PROT. Retrieved from https://www.iop.vast.ac.vn/theor/conferences/smp/1st/kaminuma/SWISSPROT/index.ht ml

Shimola, J. 2013. "Impacts of a seed predator on sundial lupine" master's thesis. Bowling Green State University. 2013.

Simonne A. H., Weaver D. B., Wei C., 2001 Immature soybean seeds as a vegetable or snack food: acceptability by American consumers. Innov. Food Sci. Emerg. Technol. 1: 289– 296. doi:10.1016/S1466-8564(00)00021-7

Sobral, M., Veiga, T., Domínguez, P., Guitián, J. A., Guitián, P., & Guitián, J. M. (2015). Selective Pressures Explain Differences in Flower Color among Gentiana lutea Populations. PloS one, 10(7), e0132522. doi:10.1371/journal.pone.0132522

Smýkal, P., Nelson, M., Berger, J., & Wettberg, E. V. (2018). The Impact of Genetic Changes during Crop Domestication. Agronomy,8(7), 119. doi:10.3390/agronomy8070119

Springob, K., Nakajima, J., Yamazaki, M., & Saito, K. (2003). Recent Advances in the Biosynthesis and Accumulation of Anthocyanins. ChemInform,34(38). doi:10.1002/chin.200338265

Stoilova, T., Pereira, G., de Sousa, M. T., (2005) Diversity in Common Bean Landraces (Phaseolus vulgaris). Journal of Central European Agriculture. 6:4

Stratmann, J. (2003). Ultraviolet-B radiation co-opts defense signaling pathways. Trends in Plant Science,8(11), 526-533. doi:10.1016/j.tplants.2003.09.011

Subramanian, S., Graham, M. Y., Yu, O., & Graham, T. L. (2005). RNA interference of soybean isoflavone synthase genes leads to silencing in tissues distal to the transformation site and to enhanced susceptibility to Phytophthora sojae. Plant physiology, 137(4), 1345–1353. doi:10.1104/pp.104.057257 58

Tanaka, Y., Sasaki, N., & Ohmiya, A. (2008). Biosynthesis of plant pigments: Anthocyanins, betalains and carotenoids. The Plant Journal,54(4), 733-749. doi:10.1111/j.1365- 313x.2008.03447.x

ThermoFisher. (2009). 260/280 and 260/230 Ratios. Retrieved from https://www.nhm.ac.uk/content/dam/nhmwww/our-science/dpts-facilities- staff/Coreresearchlabs/nanodrop.pdf

ThermoFisher. (2016). TRIzol Reagent User Guide. Retrieved from http://tools.thermofisher.com/content/sfs/manuals/trizol_reagent.pdf

Todd, J.J. and Vodkin, L.O. 1996. Duplications that suppress and deletions that restore expression from a CHS multigene family. Plant Cell, 8: 687–699

Tsuda, T., Watanabe, M., Ohshima, K., Norinobu, S., Choi, S., Kawakishi, S., & Osawa, T. (1994). Antioxidative Activity of the Anthocyanin Pigments Cyanidin 3-O-.beta.-D- Glucoside and Cyanidin. Journal of Agricultural and Food Chemistry,42(11), 2407-2410. doi:10.1021/jf00047a009

Tuteja, J.H., Zabala, G., Varala, K., Hudson, M., and Vodkin, L.O. 2009. Endogenous, tissuespecific short interfering RNAs silence the chalcone synthase gene family in Glycine max seed coats. Plant Cell, 21: 3063–3077

UniProt ConsortiumEuropean Bioinformatics InstituteProtein Information ResourceSIB Swiss Institute of Bioinformatics. (2019, July 31). Low-temperature-induced cysteine proteinase-like. Retrieved from https://www.uniprot.org/uniprot/A0A2G2Y9H4

UniProt ConsortiumEuropean Bioinformatics InstituteProtein Information ResourceSIB Swiss Institute of Bioinformatics. (2019, July 31). Nucleobase-ascorbate transporter 6. Retrieved from https://www.uniprot.org/uniprot/Q27GI3

UniProt ConsortiumEuropean Bioinformatics InstituteProtein Information ResourceSIB Swiss Institute of Bioinformatics. (2019, July 31). Protein trichome birefringence-like 19. Retrieved from https://www.uniprot.org/uniprot/Q9LFT0-1 59

UniProt ConsortiumEuropean Bioinformatics InstituteProtein Information ResourceSIB Swiss Institute of Bioinformatics. (2019, July 31). Protein WVD2-like 7. Retrieved from https://www.uniprot.org/uniprot/Q67Y69

Wang, L., Ruan, C., Liu, L., Du, W., & Bao, A. (2018). Comparative RNA-Seq Analysis of High- and Low-Oil Yellow Horn During Embryonic Development. International journal of molecular sciences, 19(10), 3071. doi:10.3390/ijms19103071

Wilhelmsson, P. K., Chandler, J. O., Fernandez-Pozo, N., Graeber, K., Ullrich, K. K., Arshad, W., . . . Rensing, S. A. (2019). Usability of reference-free transcriptome assemblies for detection of differential expression: A case study on Aethionema arabicum dimorphic seeds. BMC Genomics, 20(1). doi:10.1186/s12864-019-5452-4 von Wettberg, E., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu, L. B., Moenga, S. M., … Cook, D. R. (2018). Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation. Nature communications, 9(1), 649. doi:10.1038/s41467-018-02867-z

Williams W (1979) Studies on the development of lupins for oil and protein. Euphytica 28:481- 488

Winkel-Shirley, B. (2001). Flavonoid biosynthesis. A colorful model for genetics, biochemistry, cell biology, and biotechnology. Plant Physiology 126:485–493

Xu, B., Chang, S. K. (2009). Total Phenolic, Phenolic Acid, Anthocyanin, Flavan-3-ol, and Flavonol Profiles and Antioxidant Properties of Pinto and Black Beans (Phaseolus vulgaris L.) as Affected by Thermal Processing. Journal of Agricultural and Food Chemistry,57(11), 4754-4764. doi:10.1021/jf900695s

Yang, K., Jeong, N., Moon, J., Lee, Y., Lee, S., Kim, H. M., . . . Jeong, S. (2010). Genetic Analysis of Genes Controlling Natural Variation of Seed Coat and Flower Colors in Soybean. Journal of Heredity,101(6), 757-768. doi:10.1093/jhered/esq078

Yang, H., Tao, Y., Zheng, Z., Zhang, Q., Zhou, G., Sweetingham, M. W., . . . Li, C. (2013). Draft Genome Sequence, and a Sequence-Defined Genetic Linkage Map of the Legume 60

Crop Species Lupinus angustifolius L. PLoS ONE,8(5). doi:10.1371/journal.pone.0064799

Yang, S., Miao, L., He, J., Zhang, K., Li, Y., & Gai, J. (2019). Dynamic Transcriptome Changes Related to Oil Accumulation in Developing Soybean Seeds. International journal of molecular sciences, 20(9), 2202. doi:10.3390/ijms20092202

Zabala, G., & Vodkin, L. O. (2005). The wp mutation of Glycine max carries a gene-fragment- rich transposon of the CACTA superfamily. The Plant cell, 17(10), 2619–2632. doi:10.1105/tpc.105.033506

Zabala, G. and Vodkin, L.O. (2014). Methylation affects transposition and splicing of a CACTA transposon from a MYB transcription factor regulating anthocyanin synthase genes in soybean seed coats. PLOS One, 9: 1-20

Zahran, H. H. (1999). Rhizobium-Legume Symbiosis and Nitrogen Fixation under Severe Conditions and in an Arid Climate. Microbiology and Molecular Biology Reviews, 63(4), 968–989.

Zhang, F., Smith, D. L., (1997) Application of genistein to inocula and soil to overcome low spring soil temperature inhibition of soybean nodulation and nitrogen fixation. Plant and Soil, 192: 141. https://doi.org/10.1023/A:1004284727885

Zhang, C., & Zhang, F. (2015). The Multifunctions of WD40 Proteins in Genome Integrity and Cell Cycle Progression. Journal of Genomics,3, 40-50. doi:10.7150/jgen.11015 61

APPENDIX A. SUPPLEMENTARY TABLES

Table A1. Annotation of the 58 genes of the differential expression analysis. The annotation is based on the Lupinus angustifolius genome using NCBI BLAST. The organization of the chart parallels the heatmap. The top gene in the table is the first gene in the heatmap. The six genes identified from Approach 1 (with SD20) are boldfaced. The two described in the discussion from Approach 2 (without SD 20) are italicized. For Fold Change the infinite value represents transcripts where the abundance was outside the sensitivity range for detection. For -Infinite the values were below the detection level, while Infinite values were above the detection limit.

Cluster ID Relative Fold Predicted from Lupinus angustifolius reference genome expression Change based on for SD speckled vs SW phenotype Cluster- down Infinite PREDICTED: Lupinus angustifolius phosphatidate phosphatase PAH1-like 54625.3 (LOC109360901), transcript variant X1, mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius conglutin beta 4 (LOC109339932), mRNA 39697.21780 Cluster- down -10.475 PREDICTED: Lupinus angustifolius protein FAR1-RELATED SEQUENCE 3- 83447.1 like (LOC109352034), transcript variant X2, mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius uncharacterized LOC109340078 74161.3 (LOC109340078), mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius putative methyltransferase NSUN6 77687.0 (LOC109341618), mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius DNA-directed RNA polymerase II subunit 39697.21224 RPB1-like (LOC109325187), mRNA Cluster- down -6.2457 PREDICTED: Lupinus angustifolius uncharacterized LOC109326678 53920.0 (LOC109326678), mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius protein trichome birefringence-like 41 39697.7745 (LOC109348085), mRNA Cluster- down -7.5631 Uncharacterized sequence match to the L. angustifolius genome; maps to Linkage 39697.17666 Group 05 Cluster- down -6.7779 PREDICTED: Lupinus angustifolius tumor necrosis factor ligand superfamily 39697.33789 member 6 (LOC109347854), mRNA Cluster- down -5.9705 PREDICTED: Lupinus angustifolius uncharacterized LOC109325004 58249.0 (LOC109325004), transcript variant X1, mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius aconitate hydratase, cytoplasmic-like 39697.6515 (LOC109358767), mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius low-temperature-induced cysteine 39697.22337 proteinase-like (LOC109350348), mRNA Cluster- down -Infinite Uncharacterized sequence match to the L. angustifolius genome; maps to Linkage 39697.21081 Group 05 Cluster- down -Infinite PREDICTED: Lupinus angustifolius uncharacterized LOC109326484 35369.4 (LOC109326484), mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius nucleobase-ascorbate transporter 6 39697.15107 (LOC109338815), transcript variant X2, mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius DNA-damage-repair/toleration protein 39697.4773 DRT102 (LOC109345774), mRNA Cluster- down -6.5018 PREDICTED: Lupinus angustifolius NADH dehydrogenase [ubiquinone] 1 alpha 95772.0 subcomplex subunit 2-like (LOC109351835), transcript variant X2, mRNA Cluster- down -5.3254 PREDICTED: Lupinus angustifolius DNA-directed RNA polymerase II subunit 39697.34136 RPB1-like (LOC109325187), mRNA 62

Cluster- down -Infinite PREDICTED: Lupinus angustifolius protease Do-like 9 (LOC109357341), 39697.12053 mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius bZIP transcription factor 11-like 39697.22641 (LOC109355471), mRNA Cluster- down -Infinite PREDICTED: Lupinus angustifolius mitotic checkpoint serine/threonine-protein 39697.38141 kinase BUB1 (LOC109355173), mRNA Cluster- up -Infinite PREDICTED: Lupinus angustifolius increased DNA methylation 3 73545.5 (LOC109345693), mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius paladin-like (LOC109331983), mRNA 39697.22360 Cluster- up 5.9271 PREDICTED: Lupinus angustifolius cyclin-dependent kinase G-2-like 39697.27835 (LOC109362371), mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius protein FAR1-RELATED SEQUENCE 7- 26754.3 like (LOC109345299), transcript variant X1, mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius uncharacterized LOC109331688 39697.18347 (LOC109331688), mRNA Cluster- up 4.3613 PREDICTED: Lupinus angustifolius basic 7S globulin 2-like (LOC109335039), 39697.18349 mRNA Cluster- up 6.9894 Uncharacterized sequence match to the L. angustifolius genome; maps to Linkage 91171.0 Group 16 Cluster- up 10.033 Uncharacterized sequence match to the L. angustifolius genome; maps to Linkage 39697.18636 Group 05 Cluster- up Infinite PREDICTED: Lupinus angustifolius 3-hydroxybutyryl-CoA dehydrogenase 39697.21663 (LOC109351921), mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius uncharacterized LOC109347852 39697.33195 (LOC109347852), mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius U1 small nuclear ribonucleoprotein C-like 58746.11 (LOC109346600), transcript variant X1, mRNA Cluster- up 6.7665 PREDICTED: Lupinus angustifolius DNA-directed RNA polymerase II subunit 39697.34135 RPB1-like (LOC109325187), mRNA Cluster- up 7.1523 PREDICTED: Lupinus angustifolius uncharacterized LOC109332213 39697.18751 (LOC109332213), mRNA Cluster- up 5.4477 PREDICTED: Lupinus angustifolius protein trichome birefringence-like 38 44464.0 (LOC109341327), mRNA Cluster- up 4.0909 PREDICTED: Lupinus angustifolius uncharacterized LOC109342976 64077.0 (LOC109342976), ncRNA Cluster- up 3.5023 PREDICTED: Lupinus angustifolius mannan endo-1,4-beta-mannosidase 1-like 39697.10407 (LOC109350382), mRNA Cluster- up 3.1665 BLAST did not return any matches for this sequence 39697.21639 Cluster- up 3.8051 Uncharacterized sequence match to the L. angustifolius genome; maps to Linkage 39697.16183 Group 05 Cluster- up 3.4237 PREDICTED: Lupinus angustifolius ent-kaurenoic acid oxidase 2-like 39697.26329 (LOC109341805), mRNA Cluster- up 3.6822 BLAST did not return any matches for this sequence 39697.31659 Cluster- up Infinite PREDICTED: Lupinus angustifolius protein WVD2-like 7 (LOC109349466), 39697.25331 transcript variant X6, mRNA Cluster- up 7.3987 PREDICTED: Lupinus angustifolius pentatricopeptide repeat-containing protein 39697.28378 At4g35850, mitochondrial (LOC109347428), mRNA Cluster- up 8.6905 PREDICTED: Lupinus angustifolius receptor-like protein 51 39697.21529 (LOC109345223), mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius CBL-interacting serine/threonine-protein 39697.32721 kinase 5-like (LOC109347745), mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius ATP-dependent DNA helicase DDM1-like 65044.1 (LOC109331408), transcript variant X1, mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius cinnamoyl-CoA reductase-like SNL6 39697.17855 (LOC109349088), mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius receptor-like protein kinase FERONIA 39697.25346 (LOC109352683), mRNA 63

Cluster- up Infinite PREDICTED: Lupinus angustifolius SUPPRESSOR OF GAMMA RESPONSE 1 39697.27794 (LOC109337419), mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius uncharacterized LOC109347743 39697.3051 (LOC109347743), transcript variant X1, mRNA Cluster- up 3.9792 PREDICTED: Lupinus angustifolius dihydrolipoyllysine-residue acetyltransferase 39697.16503 component 2 of pyruvate dehydrogenase complex, mitochondrial-like (LOC109334152), transcript variant X3, mRNA Cluster- up Infinite PREDICTED: Lupinus angustifolius ABC transporter B family member 6-like 78730.0 (LOC109335893), transcript variant X1, mRNA Cluster- up 6.1323 PREDICTED: Lupinus angustifolius uncharacterized LOC109360548 94239.4 (LOC109360548), mRNA Cluster- up 5.7273 PREDICTED: Lupinus angustifolius putative glucose-6-phosphate 1-epimerase 39697.18510 (LOC109348602), transcript variant X1, mRNA Cluster- up 6.6713 PREDICTED: Lupinus angustifolius probable plastid-lipid-associated protein 4, 39697.41725 chloroplastic (LOC109345698), transcript variant X1, mRNA Cluster- up 5.3633 PREDICTED: Lupinus angustifolius DDB1- and CUL4-associated factor 13 39697.11726 (LOC109347849), mRNA Cluster- up 6.4527 PREDICTED: Lupinus angustifolius F-box protein At2g35280-like 39697.19537 (LOC109343831), mRNA 64

Table A2. Record of all L. perennis families collected from Wintergarden with GPS location. Each family was made up of one plant where two replicates were collected for each tissue type. Seed phenotype was quantified as speckled or white for each plant. Each family was named RW with a number following. The six families chosen for RNA-seq were renamed: SW for the white phenotype and SD for the speckled phenotype. The renamed families are in quotation marks. Some of the families were lost over the season due to field conditions like foraging deer and overgrowth of raspberry bushes

Location Seed Pre- Post-pigment (Latitude, Longitude) Phenotype pigment RW 1 41.3627374917, -83.6680690 Speckled collected collected RW 2 41.362733887, -83.66785135 Speckled collected Not collected deer ate the (SD 2) plant RW 3 41.362953200, -82.66814865 White collected Not collected deer ate the (SW 3) plant RW 4 41.363808447, -83.6680244 White collected collected (SW 4) RW 5 41.363765364, -83.66843741 White collected collected RW 6 41.363865430, -83.66851536 Speckled collected Not collected missed timing and seeds became mature RW 7 41.363735902, -83.6685649 Speckled collected collected (SD 7) RW 8 41.363828186, -83.6684888 White collected collected RW 9 41.363861379, -83.66858661 Speckled collected collected RW 10 41.363830491, -83.66879817 White collected collected (SW 10) RW 11 41.364069459, -83.66832618 Speckled collected collected RW 12 41.364032495, -83.66847882 White collected collected RW 13 41.364055965, -83.66815779 White collected collected RW 14 41.364056468, -83.66795612 White collected Not collected plant lost to weeds RW 15 41.363887572, -83.6684465 White collected collected RW 16 41.3642712543172, - White collected collected 83.6677473318611 RW 17 41.3639601180715, - White collected collected 83.66771279842 RW 18 41.362733887, -83.66785135 White collected collected RW 19 41.363765364, -83.66843741 White collected collected RW 20 41.363808446, -83.6680245 Speckled collected collected (SD 20) 65

Table A3. Readcounts, foldchange, and p-values for the 58 genes identified from differential gene expression analysis. The six genes from Approach 1 are in the top section of the table. The remainder are from Approach 2. Readcounts are from each approach used. Foldchange is 푠푝푒푐푘푙푒푑 relative to speckled phenotype (SD or SD1) as the first factor ( 푤ℎ𝑖푡푒 ) and was adjusted by taking the log two of the foldchange. The -log10(padj) takes the negative log ten of the adjusted P-value.

Log2Fold Adjusted P- gene_id SD_readcount SW_readcount P-value -log10(padj) Change value Cluster- 0 765.5899 -Infinite 4.64E-08 0.0038 2.42 39697.15107 Cluster- 801.7487 0 Infinite 4.31E-07 0.0177 1.75 39697.17855 Cluster- 1263.36 4.003876 8.3017 2.35E-07 0.0129 1.89 39697.21529 Cluster- 0 2585.325 -Infinite 5.54E-07 0.0182 1.74 39697.22337 Cluster- 442.6406 0 Infinite 1.01E-06 0.0275 1.56 39697.25331 Cluster- 10.82381 2998.468 -8.1139 4.25E-08 0.0038 2.42 39697.7745 Log2Fold Adjusted P- gene_id SD1_readcount SW_readcount P-value -log10(padj) Change value Cluster- 146.9595 0 Infinite 1.24E-06 0.0056 2.25 26754.3 Cluster- 0 206.7277 -Infinite 4.69E-06 0.0163 1.79 35369.4 Cluster- 1229.475 108.5016 3.5023 1.25E-05 0.0372 1.43 39697.10407 Cluster- 670.9544 16.30016 5.3633 1.13E-07 0.00066 3.18 39697.11726 Cluster- 0 312.3288 -Infinite 5.67E-09 4.40E-05 4.36 39697.12053 Cluster- 0 867.3002 -Infinite 3.38E-14 9.19E-10 9.04 39697.15107 Cluster- 2215.684 158.5143 3.8051 8.09E-07 0.004 2.4 39697.16183 Cluster- 2291.584 145.3032 3.9792 3.49E-07 0.00184 2.74 39697.16503 Cluster- 2.659286 291.8152 -6.7779 1.86E-06 0.00757 2.12 39697.17666 Cluster- 1310.89 0 Infinite 4.84E-21 7.90E-16 15.1 39697.17855 Cluster- 115.8508 0 Infinite 1.02E-05 0.0326 1.49 39697.18347 Cluster- 692.144 33.67477 4.3613 6.21E-06 0.0211 1.68 39697.18349 Cluster- 470.315 8.877895 5.7273 2.52E-07 0.00137 2.86 39697.18510 66

Cluster- 320.7694 0.306134 10.033 6.85E-10 6.98E-06 5.16 39697.18636 Cluster- 642.5592 4.517047 7.1523 4.63E-11 6.29E-07 6.2 39697.18751 Cluster- 223.9196 2.556443 6.4527 4.48E-06 0.0162 1.79 39697.19537 Cluster- 0 481.6873 -Infinite 4.36E-11 6.29E-07 6.2 39697.21081 Cluster- 5.619191 426.3929 -6.2457 9.50E-07 0.00443 2.35 39697.21224 Cluster- 1897.148 4.592015 8.6905 7.55E-18 6.15E-13 12.2 39697.21529 Cluster- 5435.479 605.3872 3.1665 8.87E-06 0.0295 1.53 39697.21639 Cluster- 771.154 0 Infinite 6.94E-17 3.77E-12 11.4 39697.21663 Cluster- 2.42529 3452.666 -10.475 4.39E-10 5.12E-06 5.29 39697.21780 Cluster- 0 2900.326 -Infinite 1.56E-05 0.0446 1.35 39697.22337 Cluster- 222.2117 3.652061 5.9271 1.31E-05 0.0382 1.42 39697.22360 Cluster- 0 924.3735 -Infinite 1.83E-14 5.95E-10 9.23 39697.22641 Cluster- 583.5731 0 Infinite 7.51E-15 3.06E-10 9.51 39697.25331 Cluster- 372.7038 0 Infinite 1.64E-08 0.000116 3.93 39697.25346 Cluster- 8559.734 797.6813 3.4237 2.61E-06 0.00989 2 39697.26329 Cluster- 262.2773 0 Infinite 1.82E-09 1.65E-05 4.78 39697.27794 Cluster- 225.3429 0 Infinite 1.60E-05 0.045 1.35 39697.27835 Cluster- 616.2712 3.652061 7.3987 4.44E-11 6.29E-07 6.2 39697.28378 Cluster- 161.1335 0 Infinite 6.03E-07 0.00307 2.51 39697.3051 Cluster- 1960.123 152.6921 3.6822 1.80E-06 0.00754 2.12 39697.31659 Cluster- 132.3802 0 Infinite 3.67E-06 0.0136 1.87 39697.32721 Cluster- 429.9301 0 Infinite 1.27E-12 2.59E-08 7.59 39697.33195 Cluster- 4.365522 273.7363 -5.9705 1.24E-05 0.0372 1.43 39697.33789 Cluster- 366.6369 3.367477 6.7665 6.49E-08 0.000423 3.37 39697.34135 Cluster- 0 233.6878 -Infinite 1.31E-07 0.000737 3.13 39697.34136 Cluster- 0 511.5531 -Infinite 2.36E-11 4.28E-07 6.37 39697.38141 Cluster- 558.333 5.478091 6.6713 1.01E-09 9.70E-06 5.01 39697.41725 Cluster- 17.96257 1627.862 -6.5018 3.52E-10 4.42E-06 5.35 39697.4773 Cluster- 0 236.8059 -Infinite 1.09E-07 0.00066 3.18 39697.6515 67

Cluster- 17.94715 3393.911 -7.5631 3.41E-13 7.94E-09 8.1 39697.7745 Cluster- 396.8228 9.092631 5.4477 1.44E-06 0.00635 2.2 44464.0 Cluster- 0 144.5958 -Infinite 1.08E-05 0.034 1.47 53920.0 Cluster- 0 298.0787 -Infinite 9.01E-09 6.68E-05 4.18 54625.3 Cluster- 0 278.8985 -Infinite 2.11E-08 0.000143 3.84 58249.0 Cluster- 116.3424 0 Infinite 1.17E-05 0.0359 1.44 58746.11 Cluster- 678.9989 39.84678 4.0909 9.96E-06 0.0325 1.49 64077.0 Cluster- 277.9796 0 Infinite 4.64E-06 0.0163 1.79 65044.1 Cluster- 243.6468 0 Infinite 4.86E-09 3.96E-05 4.4 73545.5 Cluster- 0 341.6577 -Infinite 1.93E-09 1.66E-05 4.78 74161.3 Cluster- 0 247.468 -Infinite 8.80E-07 0.00422 2.37 77687.0 Cluster- 288.7418 0 Infinite 5.42E-10 5.89E-06 5.23 78730.0 Cluster- 0 176.5721 -Infinite 1.95E-06 0.00758 2.12 83447.1 Cluster- 371.222 2.921649 6.9894 1.95E-06 0.00758 2.12 91171.0 Cluster- 295.7446 4.216033 6.1323 1.56E-06 0.00671 2.17 94239.4 Cluster- 35.07121 1406.273 -5.3254 1.08E-07 0.00066 3.18 95772.0 68

APPENDIX B. SUPPLEMENTARY INFORMATION

Protocol for Image Analysis of Seed Phenotypes using ImageJ. Take picture of seed with color card and standardized background and distance for both front and back of the seed.

1) Open image in ImageJ.

2) Select reference color/colors from the color card using the rectangular selection tool and

produce an RGB histogram (ctrl + H). Hit the RGB button on the histogram in order to

see the values of the different color channels in order to ensure that image color is

consistent throughout all images. If not, adjust color channels using the Color Balance

tool found by going to the image dropdown on the toolbar, selecting Adjust and then

selecting Color Balance.

3) Use the rectangular selection tool to select the seed (when doing both sides of the seed,

make sure that the size of the rectangle selected for each side is the same). Right click the

rectangular selection and select the “Duplicate” option. This will give you a new image

with just the seed in it.

4) Use the “Color Threshold” tool (Image -> Adjust -> Color Threshold) and the

circular selection tool + X to remove everything but the seed from the image. Duplicate

and save this image for actual phenotypic analysis.

5) Open the seed image created in step 4. Using the “Color Threshold” tool, check the

“Dark Background” box and make sure that the top scale of the brightness section reads 1

and press the M key. This will give you the pixel area of the seed (seed size).

6) Uncheck the “Dark Background” box and once again adjust the top scale of the

brightness section to read 1 and press the M key. This will give you the pixel area of the

speckling (amount of speckling). 69

7) Using an excel sheet input seed size and amount speckling values and calculate the

percentage of the seed that is speckled.

RNA 5' Adapter (RA5).

(5')AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGA

TCT

RNA 3' Adapter (RA3).

(5')GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCT

TCTGCT

Transcriptome Assembly. Transcriptome assembly was accomplished based on the left.fq and right.fq using Trinity (Grabherr et al., 2011) with min_kmer_cov set to 2 by default and all other parameters set default. Trinity is a transcript assembly pipeline which is comprised of several programs: Inchworm, Chrysalis, and Butterfly, and is developed by The Broad Institute and

Hebrew University of Jerusalem (software available: https://github.com/trinityrnaseq/trinityrnaseq/wiki). Inchworm constructs a k-mer file of all reads then selects seed k-mers to extend and construct contigs. Chrysalis uses Inchworm contigs by clustering the contigs that have minimal overlap. Then Chrysalis construct de Bruijin graphs.

Butterfly reconstructs plausible transcripts from the de Bruin graphs of Chrysalis and resolves ambiguity.

Transcript length distribution. This step selects longest the transcripts from each cluster to become clsuters. These are the basis for the rest of the following analyses. N50 and N90 are weighted medians statistics for clusters related to the assembly. N50 the minimum length of a 70 transcript needed to cover 50% of the assembly. For example, 100, 200, 400, and 500 are the transcript lengths then the total assembly length is 12,000 (half is 600) N50 is 400.

Data Quality. Q20 and Q30 measurements were used to determine base calling accuracy and were derived from Phred Quality (Q) where

Q = −10log10P (https://www.illumina.com/documents/products/technotes/technote_Q-

Scores.pdf).

RNA Isolation. Protocol for RNA isolation using the TRIZOL extraction method.

1) Pre-label 3 sets of 1.5mL tubes. Clean all utensils with RNase away. Spray gloves with

RNase away when touching equipment and periodically during extraction. Leave all

utensils in dry ice after cleaning. Seeds were kept on dry ice after taken out of -80oC

freezer

2) Heat-shock frozen seeds in water bath at 65oC for 15 seconds. Use 1.5mL tubes the seeds

are in to contain them in the water bath.

3) Remove seed coat by squeezing embryo and cotyledons out of coat of all seeds present in

tube (between 5 and 10 seeds). Spray gloves with RNase away before touching seeds.

Place seed coat in pre-chilled coffee grinder with dry ice. Use 2 to 3 chunks of 4 cm dry

ice.

4) Grind tissue seed coat and dry ice in coffee grinder till it is a fine powder.

5) Transfer mixture to 1.5 mL tube using pre-chilled spatula. Then leave in -20oC for 1 hour

for dry ice to sublime. Split powder into the pre-labeled tubes. Transfer with a pre-chilled

spatula. This step depends on the amount of tissue and dry ice used: typical aim for 1g of

tissue. In 1.5 mL tubes this will take up 0.5 cm of the bottom of the tube. Dry ice makes it 71

difficult to determine amount of tissue. Transfer all the powder among the three tubes. If

using a small number of seeds (5 or below) split between two tubes.

6) Once all dry ice is gone add 1 mL of TRIZOL with 4uL of beta-mercaptoethanol and

20uL of polyvinylpyrrolidone. Do not pre-mix TRIZOL, beta-mercapthoethanol, and

polyvinylpyrrolidone. Only add during extraction

7) Incubate at room temperature for 5 minutes

8) Centrifuge at 10,000xg for 1 minute and transfer aqueous clear layer to new tube (top

clear layer). Dispose of pellet.

9) Add 200uL of Chloroform/isoamyl (24:1) Mix vigorously to form emulsion

10) Incubate at room temperature 5 minutes Centrifuge 15 mintues10,000 xg

11) Transfer top layer to new tube by pipetting. Do not transfer lower layer. Stop transferring

when you remove liquid of the lower layer.

12) Repeat step 9 to 11 again.

13) Add 500uL COLD isopropanol. Mix gently by inversion

14) Incubate for 2 hours at -20oC

15) Centrifuge 10,000xg 10 min. Remove and discard supernatant

16) Add 1mL COLD 80% ethanol

17) Centrifuge 5000xg for 5 minutes. Remove and discard supernatant

18) Repeat steps 16 and 17 twice. (80% ethanol wash should be done three times in total)

19) Air dry 10 minutes (do not dry pellet fully or will no dissolve in next step)

20) Rehydrate with 40uL DEPC treated water (mix by pipetting)

21) Store in -20oC freezer