<<

Comparative analysis of nodulation-related small secreted peptides across legume

A DISSERTATION SUBMITTED TO THE FACULTY OF THE UNIVERSITY OF BY

Diana Trujillo

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Nevin Young

November 2017

© Diana Trujillo 2017

Acknowledgments

I would like to thank all those who, one way or another, have made this work possible.

First and foremost, thanks go to my advisor, Nevin Young, who helped shape my vision for this project and was always available to give me support or the necessary guidance in the right direction. I would also like to thank Kevin Silverstein who played a large role during the development of my LSE pipeline, and was a helpful mentor in bioinformatic matters. Thanks to the other members of my advisory committee, Peter

Morrell, Michael Sadowsky, and Robert Stupar, who provided a fresh outlook and valuable advice that helped to improve this study.

I would like to thank Joseph Guhlin and Peng Zhou who were always one step away when I had Unix concerns, and Shaun Curtin and Roxanne Denny who guided me through the technical aspects of growing or transforming Medicago.

I am grateful that I had a strong network of friends and colleagues to discuss biology, coding, and life. To Allison Haaning, Beth Fallon, Christina Smith, Leland

Werden, Derek Nedveck, and Eli Krumholz: thanks for sharing the journey.

I would like to thank my mother, María Cecilia, who instilled my love of reading, truth-seeking, and . To my husband, Dylan Huss, thank you for always calling to see if I already ate (including as I typed these words).

i

Table of Contents

Acknowledgments...... i Table of Contents ...... ii List of Tables ...... iv List of Figures ...... v List of Abbreviations ...... vi Chapter 1: Introduction ...... 1 1.1 Legumes form a unique relationship with -fixing rhizobia ...... 1 1.2 Rhizobial fate is determined by legume hosts ...... 2 1.3 The role of small-secreted peptides in nodulation remains largely unexplored ...... 4 1.4 Lineage-specific expansions can lead to phylogeny-restricted traits ...... 6 1.5 Project overview ...... 7 Chapter 2: Cross-species examination of legume signaling peptides reveals that nodule- specific PLAT domain are required for nodulation ...... 9 2.1 Summary ...... 9 2.2 Keywords ...... 10 2.3 Introduction ...... 10 2.4 Materials and Methods ...... 13 2.4.1 Legume genomic and RNA-seq data ...... 13 2.4.2 Identification of nodulation-related LSEs in small secreted peptide families .. 13 2.4.3 Confirmation of candidate LSEs ...... 15 2.4.4 Synteny analyses of NPDs and Phaseoleae-specific nodulins ...... 16 2.4.5 CRISPR/Cas9 mediated knockout of NPD ...... 16 2.4.6 Nodulation Phenotyping of NPD mutants ...... 17 2.4.7 RNA-seq differential expression analyses ...... 18 2.4.8 Microscopy and nodule staining ...... 19 2.4.9 Accession Numbers ...... 19 2.5 Results ...... 20 2.5.1 Overview of identified LSEs ...... 20 2.5.2 A nodule-specific subset of the PLAT domain family expanded exclusively in the Medicago lineage ...... 22 2.5.3 Multi- NPD knockouts lines show differences in nodulation phenotypes 23 2.5.4 Rhizobia lose N2-fixation capacity when all five NPD genes are knocked out 26 2.5.5 Transcriptome analysis of nodules from two- and three-gene knockout lines . 27 2.6 Discussion ...... 28 2.6.1 Types of LSEs found in this study...... 28 2.6.2 Phaseoleae-specific nodulins have expanded independently in Glycine and Phaseolus legumes ...... 31 2.6.3 The NPDs are a previously unexplored family related to nodulation ...... 32 2.6.4 Utility of LSE studies ...... 35 2.7 Acknowledgements ...... 36 2.8 Tables ...... 37 ii

2.9 Figures ...... 40 Chapter 3: Genomic characterization of the LEED..PEEDs, a gene family unique to the Medicago lineage ...... 54 3.1 Summary ...... 54 3.2 Keywords ...... 55 3.3 Introduction ...... 55 3.4 Material and methods ...... 57 3.4.1 genome and transcriptome sequences ...... 57 3.4.2 Detection of LPs in plant genomes and transcriptomes ...... 57 3.4.3 DNA-seq and Pisum sativum RNA-seq analysis ...... 58 3.4.4 Transcript levels of LPs in M. truncatula ...... 59 3.4.5 Synteny and collinearity comparisons ...... 60 3.4.6 Phylogenetic analysis of LP sequences ...... 60 3.4.7 Accession Numbers ...... 61 3.5 Results ...... 61 3.5.1 The LP gene family is specific to the Medicago lineage ...... 61 3.5.2 Genomic architecture of the LP gene family ...... 63 3.5.3 Phylogenetic relationship between LP genes ...... 64 3.5.4 The LEED..PEED family is nodule-specific in M. truncatula ...... 65 3.6 Discussion ...... 66 3.7 Acknowledgements ...... 70 3.8 Tables ...... 71 3.9 Figures ...... 74 4 Conclusions ...... 80 5 Bibliography ...... 84 6 Appendix ...... 96

iii

List of Tables

Table 2.1 Overview of genomic and RNA-seq data sources used in the LSE discovery pipeline ...... 37 Table 2.2 Overview of identified LSEs and magnitude of expansion ...... 38 Table 2.3 Wild type and NPD knockout line phenotypes ...... 39 Table 3.1 Summary of analyzed plant genomes and transcriptomes ...... 72 Table 3.2 Transcript abundance (FPKM) of LP genes in six M. truncatula A17 tissues . 73 Table 6.1 Genomic, transcriptomic, and computational resources generated by this study ...... 96

iv

List of Figures

Figure 2.1 Multiple sequence alignment of NPDs and PDPs ...... 40 Figure 2.2 LSE discovery pipeline...... 41 Figure 2.3 Expansion of Phaseoleae-specific nodulins ...... 42 Figure 2.4 Expansion of NPDs in the Medicago lineage ...... 44 Figure 2.5 NPD gene expression across Medicago accessions and rhizobial strains ...... 45 Figure 2.6 Comparison of M. truncatula wild type and NPD mutant line phenotypes .... 46 Figure 2.7 Correlation between N-fixation traits and plant height ...... 47 Figure 2.8 Rhizobial reporter gene expression in the five-gene NPD knockout line ...... 48 Figure 2.9 Rhizobial N-fixation gene expression in NPD knockout lines ...... 49 Figure 2.10 Differentially expressed genes in M. truncatula wild type and NPD mutant lines ...... 50 Figure 2.11 Genomic localization of genes belonging to LSE families ...... 53 Figure 3.1 Multiple sequence alignment of A17 LP peptides ...... 74 Figure 3.2 Synteny comparisons between LPs 1-13 chromosomal regions in M truncatula A17 and corresponding regions in G. max and C. arietinum...... 76 Figure 3.3 Dot plot analysis of a ~1 kbp region in Chromosome 4 and ~100 kbp region in Chromosome 7 of M. truncatula R108 and A17 ...... 78 Figure 3.4 Evolutionary expansion of the LP gene family ...... 79 Figure 6.1 CRISPR/Cas9 multiplex genome editing approach ...... 97 Figure 6.2 LSE analysis of NCR peptides in M. truncatula and T. pratense ...... 98 Figure 6.3 LSE analysis of GRP1 proteins in M. truncatula and T. pratense ...... 99 Figure 6.4 LSE analysis of GRP2 proteins in G. max ...... 100 Figure 6.5 LSE analysis of LEED..PEED peptides in M. truncatula ...... 101 Figure 6.6 LSE analysis of Aeschynomene NCR-like peptides in A. duranensis ...... 102 Figure 6.7 LSE analysis of Calmodulin-like proteins in M. truncatula and T. pratense 103 Figure 6.8 LSE analysis of CAP superfamily proteins in A. duranensis ...... 104 Figure 6.9 LSE analysis of Bowman Birk peptides in A. duranensis ...... 105 Figure 6.10 LSE analysis of Antimicrobial MBP-1 peptides in A. duranensis ...... 106 Figure 6.11 LSE analysis of Leginsulin peptides in M. truncatula ...... 107 Figure 6.12 LSE analysis of Phaseoleae-specific nodulins in G. max and P. vulgaris .. 108 Figure 6.13 LSE analysis of cystatin proteins in A. duranensis ...... 109 Figure 6.14 LSE analysis of a NPD proteins in M. truncatula and T. pratense ...... 110 Figure 6.15 Knockout mechanism of six NPD knockout lines ...... 111 Figure 6.16 Aligned LP DNA sequences from A17, HM056 and R108 ...... 112 Figure 6.17 Aligned LP DNA sequences from A17, HM056, R108 and M. sativa ...... 113 Figure 6.18 Multiple sequence alignment of LPs from M. truncatula accessions A17 and HM056, R108 and M. sativa ...... 114 Figure 6.19 Dotplot comparisons between M. truncatula and G. max in regions surrounding M. truncatula LPs ...... 115 Figure 6.20 Phylogenetic of A17, HM056, R108 and M. sativa LP nucleotide sequences ...... 116

v

List of Abbreviations

BLAST: Basic Local Alignment Search Tool

CaML: Calmodulin-like protein

CLE: CLAVATA3 (CLV3)/ESR-related protein

FPKM: Fragments Per Kilobase of transcript per Million

GRP: Glycine-Rich Protein

IRLC: Inverted Repeat-Lacking Clade

LP: LEED..PEED

LSE: Lineage-Specific Expansion

MCL: Markov Cluster algorithm

NCR: Nodule-specific Cysteine-Rich

NPD: Nodule-specific PLAT Domain

ORF: Open Reading Frame

PDP: PLAT Domain Protein

PLAT: Polycystin-1, Lipoxygenase, Alpha Toxin

SPADA: Small Peptide Alignment Detection Application

vi

Chapter 1: Introduction

1.1 Legumes form a unique relationship with nitrogen-fixing rhizobia

Access to nitrogen can be a limiting factor for optimal plant growth and yield, as plants cannot use atmospheric N2 gas. Legumes, however, can overcome this constraint by forming symbioses with nitrogen-fixing soil bacteria in dedicated root organs called nodules. In this relationship, legumes receive fixed nitrogen in the form of ammonia, while supplying rhizobia with a carbon source in the form of dicarboxylic (Guinel,

2015).

The nodule formation process begins in the soil, when free-living soil rhizobia come into close quarters with legume roots. In Medicago truncatula and other legumes, the host exudes flavonoids into the rhizosphere, which causes secretion of lipochitooligosaccharide signals (Nod factors) by rhizobia. The host perceives compatible rhizobial strains by Nod factor recognition through receptor-like kinases, triggering a calcium-dependent signaling cascade that leads to host transcription reprogramming

(reviewed by Dénarié et al, 1996; Oldroyd et al., 2011; Moreau et al., 2011). Nodule organogenesis commences with cell divisions in the root cortex that form a nodule primordium at the site of infection. Meanwhile, root hairs curl around rhizobia, and invaginate, allowing rhizobia to progress across the nodule primordium towards the root through infection threads. Rhizobia proliferate within branching infection threads as they arrive at the infection zone of the incipient nodule. Bacteria then exit the infection threads and enter the host cytoplasm by endocytosis, surrounded by a membrane derived from the host. These intracellular compartments called symbiosomes house differentiating 1 bacteroids and provide a specialized microenvironment that fosters nitrogen-fixation

(Gage, 2004).

Nodule development is regulated through concerted signaling processes that depend on a wide range and variety of peptides secreted by the host (Wang et al., 2010) and by rhizobial partners (Marie et al., 2003). Host x strain specificity can be determined at the onset of nodulation, restricting the interactions that result in nodule induction; plants must recognize compatible rhizobia and downregulate defense responses so that infection can occur (Tóth and Stacey, 2015). Additionally, host x strain recognition and communication is necessary in later stages of nodule formation, such as during bacteroid entry to symbiosomes, with incompatibilities potentially leading to premature nodule senescence (Moreau et al., 2011; Wang et al., 2016).

1.2 Rhizobial fate is determined by legume hosts

Rhizobia that have become incorporated into symbiosomes undergo varying levels of differentiation from free-living to a symbiotic bacteroid state. This range of bacteroid morphology can culminate in extreme rhizobial differentiation, in which bacteroids become enlarged, become polyploid, and lose their ability to replicate

(Mergaert et al., 2006). The legume host plays a critical role in this process, with studies showing that a single rhizobial strain adopts different morphological features within symbiosomes depending on the plant host (Sutton and Paterson, 1980). Interestingly, legumes have evolved multiple times to have the capacity to induce these terminal differentiation traits in nitrogen-fixing bacteroids (Oono et al., 2010; Oono et al., 2011;

Ishihara et al., 2011). Improved nitrogen fixation efficiency was seen in plants that host 2 terminally differentiated bacteroids (Oono and Denison, 2010), and the repeated evolution of this trait in at least five different legume lineages suggests there is a fitness benefit associated with reduced reproductive viability of nitrogen-fixing bacteroids

(Oono et al., 2010; Oono and Denison, 2010). Nevertheless, it is unknown whether this paraphyletic trait has evolved multiple times through similar mechanisms.

In legume species belonging to the Inverted Repeat-Lacking Clade (IRLC), which includes Medicago and its close relatives, extreme rhizobial differentiation inside nodules is mediated by host-derived nodule-specific cysteine-rich peptides (NCRs; Van de Velde et al., 2010). NCRs are small peptides that are targeted to a nodule-specific secretory pathway that directs them to symbiosomes (Wang et al., 2010; Haag et al., 2011). There, select NCR peptides with antimicrobial properties disrupt membrane integrity and block rhizobial cell division. Thus, as they transition to nitrogen-fixing bacteroids, rhizobia in

IRLC legumes undergo terminal differentiation characterized by bacterial elongation, polyploidy, and reduced viability (Haag et al., 2011; Penterman et al., 2014).

NCRs constitute a lineage-specific gene family, and have undergone a massive expansion in a limited subset of legume species (Mergaert et al., 2003; Graham et al.,

2004). In Medicago truncatula alone, the NCR gene family has up to ~600 members

(Young and Bharti, 2012). Most IRLC legumes, including Medicago, Pisum and

Trifolium, host terminally differentiated bacteroids and have large NCR gene families

(Mergaert et al., 2003). The closely related Cicer has fewer NCR genes (~60) and bacteroids show intermediate signs of differentiation (Montiel et al. 2017), while Lotus species lack this lineage-specific gene family expansion and do not host terminally differentiated bacteroids (Mergaert et al., 2006; Van de Velde et al., 2010). When Lotus 3 japonicas was transformed with an NCR gene (NCR035), bacteroid terminal differentiation traits were induced in transgenic nodules (Van de Velde et al., 2010), further supporting the notion that a lineage-specific expansion of NCR genes gave rise to a lineage-specific trait.

Although the massively expanded NCR gene family is implicated in the regulation of bacteroid terminal differentiation in IRLC legumes, the mechanism by which bacteroids become terminally differentiated in legume lineages outside of the

IRLC is unknown. For example, Leucaena glauca, a legume distantly related to the

IRLC, has terminally differentiated bacteroids but lacks NCRs (Ishihara et al., 2011). On the other hand, in Aeschynomene afraspera and A. indica (which manifest both stem and root nodules), ~40 lineage-specific NCR-like genes were identified in each species

(Czernic et al., 2015). These NCR-like peptides are candidate regulators of rhizobial terminal differentiation and rhizobial transition to elongated or spherical bacteroid morphotypes in A. afraspera and A. indica, respectively.

1.3 The role of small-secreted peptides in nodulation remains largely unexplored

Increasingly, it is becoming evident that peptides secreted by the host have the potential to regulate the fate of bacteroids residing within nodules and the fate of the nodulation process itself. Small secreted peptides could play important roles in communication between the legume host and symbiotic partners, as evidenced by the

NCRs. Peptides are especially suited to the task of local communication due to universal and specific secretion systems and the possibility of post-translational and proteolytic control of activity. Further, peptides can be suitable as precise extracellular ligands even 4 while remaining small and diffusible (Bisseling, 1999; Meng, 2012). In legumes, the nodule-specific signal peptidase complex discovered in Medicago can direct subsets of host secreted peptides to symbiosomes within the developing nodule cells (Wang et al.,

2010). Thus, legumes can specifically transmit signals to, and potentially regulate the fate of bacteroids though secreted peptides.

In legume hosts, numerous gene families are believed to be involved in nodulation, either because they are legume-specific or have nodule-specific expression.

Graham et al. (2004) conducted a pioneering study that identified legume-specific peptides through a series of comparative sequence homology searches between legume and non-legume plants. Additionally, nodulation-related genes have been identified in recent transcriptome studies, which have found that hundreds of genes are differentially expressed during nodulation and in specific sections of the nodule (Manoury et al., 2010;

Moreau et al., 2011; Limpens et al., 2013). Examples of potential nodulation-related secreted peptides include the NCRs, glycine-rich proteins (GRPs; Kevei et al., 2002) and the CLAVATA3 (CLV3)/ESR-related peptides (CLEs), which have roles in nodule organogenesis and autoregulation of nodulation (Mortier et al., 2011). All of these peptide families share important characteristics, including small sizes of less than 150 amino acids on average, repetitive motifs, and the presence of a putative signal peptide, implying their secretion. However, aside from the NCRs and CLEs, specific functions have yet to be determined for many of the nodule-specific small secreted peptide families.

5

1.4 Lineage-specific expansions can lead to phylogeny-restricted traits

Small secreted peptide families can have major functions in cell-cell communication, development, and differentiation (Murphy et al., 2012). Alternately, when acting as effectors during interactions between species, small secreted peptides are often involved in fine-tuning the relationship between the host and symbiont (Wang et al., 2017) or as tools in an arms race between plants and pathogens (Cheng et al., 2014).

These rapidly evolving gene families often undergo lineage-specific expansions leading to uneven representation across lineages (Duplessis et al., 2011), which can lead to the evolution of new functional traits unique to each lineage.

The uneven representation of gene families, such as the NCRs, across different taxa leads to the question of the origin of these gene families. The most likely routes include de novo gene evolution, recombination events splicing two distant genomic domains, or a repurposing of existing proteins (Tautz and Domazet-Lošo, 2011). Genes that emerge de novo typically code for proteins with structurally simple domains such as

α-helices or histidine/cysteine-rich regions that stabilize molecules (Lespinet et al., 2002).

The novel proteins they encode, however, may be small enough that there is no need for strict protein stabilization. In contrast, recombination may result in a signal peptide coding region joining with a C-terminus+ coding region of another protein. Genes that encode these ‘new’ proteins may then undergo distant segmental duplication or rapid local duplications (Silverstein et al., 2005) to create large gene families with novel functions. This phenomenon likely gave rise to a set of six calmodulin-like proteins

(CaMLs) in Medicago spp. The ancestor of these proteins originated from an unequal 6 recombination event between the N-terminal region of nodulin-25 and the calcium- binding region of a nearby calmodulin. The calmodulin-like ancestor tandemly duplicated, resulting in a cluster of six nodule-specific copies in M. truncatula (Liu et al.,

2006).

An immediate consequence of gene duplication is an increase in transcript levels of the gene. Thereafter, the duplicated gene may undergo three different fates as new mutations accumulate. A selectively neutral copy of a gene may become fixed in a population by chance alone, but truly redundant genes copies are not stably maintained, and are likely to be silenced within a few million years (pseudogenization). However, if one duplicate assumes a different function from the parent gene (neofunctionalization) or both daughter copies assume part of the function from the parent gene

(subfunctionalization), then the gene duplicates may be retained through natural selection

(Lynch and Conery, 2000). In plants, rates of gene duplication are substantially higher than in other eukaryotes (Hanada et al., 2008), with up to 80% of A. thaliana proteins having undergone lineage-specific expansions relative to non-plant genomes (Lespinet et al., 2002).

1.5 Project overview

The exchange of molecular signals between legume hosts and rhizobia begins during recognition and infection through nitrogen fixation and, finally, through nodule senescence. In the Leguminosae, the variety and range of these small signaling molecules across lineages can lead to stark differences in nodule and rhizobial biology, with important implications for the symbiont partners’ coevolution dynamics (Denison, 2000). 7

In this dissertation, Chapter 2 describes the development of a computational pipeline, integrating phylogenetic, comparative genomic and transcriptomic data, to reliably detect lineage-specific expansions of nodulation-related small secreted peptides in legume species. Using this tool, nodulation-related lineage-specific expansions were detected in 13 gene families, specific to Arachis, Glycine, or Medicago lineages. One such expansion occurred in a subset of PLAT domain genes, giving rise to five nodule- specific PLAT domain (NPD) genes in M. truncatula. In order to determine the role of

NPD gene family expansion in nodulation, a set of M. truncatula NPD knockout lines was created, targeting between one and five NPD genes. Knockout lines were subsequently phenotyped to ascertain whether cumulative NPD gene inactivations caused quantifiable differences in nodule or plant traits. Finally, Chapter 3 encompasses the genomic characterization of the LEED..PEEDs (LPs), a Medicago-specific secreted peptide family with nodule-specific expression. Genomic, transcriptomic, and computational resources generated during the course of this project are outlined in

Appendix Table 6.1.

8

Chapter 2: Cross-species examination of legume signaling peptides reveals that nodule-specific PLAT domain proteins are required for nodulation

2.1 Summary

Symbiotic nitrogen fixation in legumes is mediated by an interplay of signaling processes between plant hosts and their rhizobial symbiotic partners. Rapid evolution of nodulation-related signaling peptide families among legume lineages, including family expansion, can lead to novel functional traits such as differences in infection efficiency or nodule viability. A computational workflow was developed to identify nodulation-related gene families encoding small signaling peptides that have undergone lineage-specific expansions (LSEs). This workflow used RNA-seq data from nodule, , and root tissues from five legume species to obtain an initial pool of candidate gene families, followed by iterative clustering and expansion of the list of candidates. After curation, 13 nodulation-related LSEs were identified, each one specific to either Glycine, Arachis or

Medicago lineages. Three family expansions were especially notable, including a nodulin family which expanded independently in both soybean (Glycine max) and common bean

(Phaseolus vulgaris). In the diploid peanut relative, Arachis duranensis, a family of cysteine-rich peptides with nodule enhanced expression expanded to ~100 members, while in the Medicago lineage, a set of nodule-specific PLAT domain proteins (NPDs) expanded to five nodule-specific members through tandem duplication. Because NPDs represent a recently discovered component in nodulation, we examined their function in further detail. Using a CRISPR/Cas9 multiplex genome editing approach, an overlapping set of M. truncatula NPD knockout lines were generated, targeting one through five NPD 9 genes. Mutant lines with differing combinations of NPD gene inactivations had progressively earlier onset of nodule senescence, smaller size, or ineffective nodules compared to the wild type control. Two triple-knockout lines showed dissimilar nodulation phenotypes but coincided in upregulation of two proteolysis genes, possible candidates for the observed breakdown of proper nodule function. Studies of lineage- specific expansions are useful to discover new candidate gene families linked to phylogeny-restricted traits. Applying a bioinformatic approach to identify LSEs of nodulation-associated genes, we identified a new family of nodule-specific PLAT domain peptides and confirmed that they play a role in successful nodule formation.

2.2 Keywords

Lineage-specific expansion, nodule-specific PLAT domain, small secreted peptides, nodulation.

2.3 Introduction

During biological nitrogen fixation, soil rhizobia infect legume roots through a series of coordinated steps, leading to the formation of specialized root organs called nodules (Jones et al., 2007). In the model legume, Medicago truncatula, nodules have four distinct zones including a persistent meristem (I), infection zone (II), nitrogen fixation zone (III), and senescence zone (IV). In the nitrogen fixation zone, rhizobia are encased within organelle-like symbiosomes derived from the host cell membrane where they fix atmospheric nitrogen into ammonia usable by the plant (Franssen et al., 1992;

Huisman et al., 2012). Studies in M. truncatula have shown that the host cell can direct 10 small signaling peptides into the symbiosomes via a dedicated component of the secretory pathway (Wang et al., 2010), and govern their differentiation into nitrogen- fixing bacteroids (Van de Welde et al., 2010). Extreme bacteroid differentiation, which has been observed in Medicago and Arachis lineages but not in the soybean lineage, is characterized by polyploidization and enlargement of rhizobia that have ceased to divide

(reviewed by Pan and Wang, 2017).

Gene families that undergo differential duplications between sets of related species are said to have undergone a lineage-specific expansion (LSE). LSEs can vary greatly both in terms of their taxonomic range as well as magnitude of the expansion, and have been independently associated with increased biological complexity and phylum- specific (Mergaert et al., 2003; David et al., 2008; Tran et al., 2012). For example, the expansion of oxidase genes coincides with the emergence of land plants, and tissue-specific expression allows functional divergence of these genes

(Tran et al., 2012). On the other hand, Plett et al. (2017) found poplar-specific small secreted proteins that were induced during mycorrhizal infections and affected hyphal morphology, suggesting a role for these peptides in the symbiosis.

The LSEs of several small signaling peptide families in legumes are known to be associated with nodulation. Examples of legume LSEs include nodule-specific cysteine- rich peptides (NCRs; Mergaert et al., 2003), calmodulin-like proteins (CaMLs; Liu et al.,

2006), and glycine-rich proteins (GRPs; Vandepoele and Van de Peer 2005, reviewed by

Silverstein et al., 2006). Recently, our capacity to detect additional LSEs in legumes has greatly improved due to the high-quality genome and transcriptome data becoming available for an increasing number of species. 11

Others have systematically searched for LSEs in prokaryotes (Jordan et al., 2001), eukaryotes (Lespinet et al., 2002; Campbell et al., 2007; Yang et al., 2009), and in other legumes (Garg et al., 2011; Guillén et al., 2013). In particular, Garg et al. (2011) identified chickpea and legume-specific genes using transcriptome data that did not include nodules. Meanwhile, Guillén et al. (2013) surveyed legume genomes for small proteins, but relied on existing genome annotations and did not focus on secreted peptides. To our knowledge, this study is the first with the specific aim of de novo identification of LSEs of small nodulation-related secreted peptides. To this end, we used nodule transcriptome data coupled with a homology-based gene clustering approach to detect small, putatively secreted peptide families unique to specific legume lineages.

Among the detected gene families, a subset of peptides had a PLAT domain

(Polycystin-1, Lipoxygenase, Alpha Toxin; Bateman and Sanford, 1999), a beta- sandwich domain found in lipid-associated proteins in eukaryotes and prokaryotes

(PS50095, prosite.expasy.org). Nodule-specific peptides with a signal peptide and single

PLAT domains, NPDs, expanded in the M. truncatula lineage and were chosen for further characterization (Figure 2.1).

Our hypothesis was that finding LSEs of genes with nodule-enhanced expression would lead us to find sets of candidate gene families that would otherwise be missed by

Tnt1 forward genetic screening or genome-wide association studies because of cumulative properties or functional redundancies of multi-genes families. The study of expansion histories of nodulation-related genes across lineages, coupled with new multiplex targeted gene knockout techniques, is fundamental in deciphering and

12 dissecting the various ways in which specific nodulation traits have evolved across different legume lineages.

2.4 Materials and Methods

2.4.1 Legume genomic and RNA-seq data

A range of sources, listed in Table 2.1, were used to obtain genomic data for A. duranensis, G. max, P. vulgaris, M. truncatula and common , Trifolium pratense.

A. duranensis is a wild diploid relative of cultivated peanuts in the dalbergioid crown clade. G. max and P. vulgaris, both belong to the tribe Phaseoleae, and are in the milletioid crown clade, while M. truncatula and T. pratense are in the Inverted Repeat-

Lacking Clade within the Hologalegina crown clade (Lavin et al. 2005). Corresponding nodule, root, leaf, and flower RNA-seq data was obtained for each species from publicly available data or, in the case of T. pratense nodule RNA-seq, generated for this study using growth conditions described in Trujillo et al. (2014) for P. sativum (Table 2.1).

2.4.2 Identification of nodulation-related LSEs in small secreted peptide families

The LSE discovery pipeline is summarized in Figure 2.2 and custom scripts used for the workflow are open source and can be found at https://github.com/ditrujillo/LSE_pipeline. In the first stage, a pool of candidate peptides was obtained based on nodule-enhanced expression, small size, and presence of a signal peptide. RNA-seq reads from flower, leaf, root, and nodule tissue for each species were mapped onto corresponding genomes using Tophat v2.0.13 (Trapnell et al., 2009). For each species, nodule mapped reads were assembled into transcriptomes using Cufflinks 13 v2.0.0 (Trapnell et al., 2010). Transcript abundances for flower, leaf, root and nodules were then calculated for each tissue using the nodule transcriptome as a guide (-G option). For each transcript, it was considered to have nodule enhanced expression if it had a nodule FPKM (fragments per kilobase of transcript per million) value greater than

10 and the nodule FPKM was at least 4 times greater than the average of flower, leaf and root FPKMs. All possible open reading frames (ORFs) were identified for the transcripts with nodule enhanced expression. Those ORFs that coded for peptides 35 to 250 amino acids in length were selected and the signal peptide finding program, SignalP 4.0

(Petersen et al., 2011) was used on the size filtered ORFs to obtain a subset of putatively secreted peptides. The size range cut-off was chosen based on known plant signaling peptide lengths. After signal peptide cleavage (usually ~22 amino acids in length) and proteolytic processing, the smallest signaling peptides average 12 amino acids in length.

The largest known protein precursor with a signaling function, systemin, is 200 amino acids in length (reviewed in Bisseling, 1999; Meng, 2012; Murphy et al., 2012).

In the second stage, the pool of candidate genes was expanded by clustering the peptides based on similarity, evaluating the clusters on LSE status, and then using multiple sequence alignment profiles to conduct searches against the genome of each species. To start, a similarity matrix for small ORFs was created through All-vs.-All

BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990) comparisons. Using the Markov Cluster (MCL v10-201; van Dongen, 2000) algorithm, ORFs were assigned to gene families with the clustering parameter i set to 1.5, resulting in fewer and larger gene family clusters. Each MCL group was evaluated on whether it contained a lineage- specific expansion by determining if one of the species had at least three times the 14 number of gene members in the MCL group relative to any other species. MCL groups with putative LSEs were aligned using ClustalW2 v2.1 (Larkin et al., 2007) such that

SPADA (Small Peptide Alignment Discovery Application; Zhou et al., 2013) searches could be conducted within the legume genomes. Using multiple sequence alignments of small peptide families of interest, SPADA software identifies these families in tested genomes with better accuracy than generic gene-finding programs. SPADA was run with an e-value set to 10 for maximum sensitivity, then hits were filtered for small size, presence of a signal peptide and nodule-enhanced expression, to re-cluster the set of candidate genes into families. This second stage, consisting of LSE identification, genome prediction and candidate gene filtration, was repeated to expand and then refine the initial pool of candidate genes obtained through transcriptome assembly with

Cufflinks.

In the third stage, the peptide clusters were manually assigned to families based on their annotation and manually aligned in order to run a final SPADA search (e-value set to 0.1 for maximum specificity). The predicted peptide groups were analyzed individually to determine whether the gene family had members with nodule-enhanced expression and, if so, whether these genes had uneven representation across the surveyed legume species.

2.4.3 Confirmation of candidate LSEs

Putative nodulation-related LSEs were further analyzed and manually curated.

Part of the manual curation process involved merging large gene families, such as NCRs, that were split by MCL clustering. False positive MCL groups were identified by 15 searching the NCBI non-redundant database (https://blast.ncbi.nlm.nih.gov/, Altschul et al., 2005), and included ORFs that were predicted on a wrong frame of previously annotated proteins or prematurely truncated peptides from larger proteins. The LSE status of each putative gene family was confirmed by combining differential expression analyses (Supplementary File 1) with ClustalW2-generated phylogenetic of the genes families. To check the comprehensiveness of the final SPADA search, BLAST

(Altschul et al., 1990) searches for families of interest were conducted against the corresponding genomes.

2.4.4 Synteny analyses of NPDs and Phaseoleae-specific nodulins

To ascertain the expansion history of the Phaseoleae-specific nodulins and NPD gene families, synteny comparisons were conducted between genomes of select legume species using the CoGe Comparative Genomics Platform (https://genomevolution.org/,

Lyons and Freeling 2008). For each analysis, the SynMap feature in CoGe was used with default parameters for whole genome comparisons and identification of blocks of syntenic genes. The GEvo feature in CoGe was then used to detect micro-synteny boundaries in genomic regions of interest.

2.4.5 CRISPR/Cas9 mediated knockout of NPD genes

Transformation of M. truncatula R108 was carried out using a slightly modified method of Cosson et al. (2006), as described in Curtin et al. (2017). Leaf explants were inoculated with Agrobacterium tumefaciens EHA105, using either a 4-plex version of entry vector pSC218GG (Curtin et al., 2017) or 4-plex Csy4 or tRNA arrays driven by a 16

CmYLCV promoter (Čermák et al., 2017). In all cases, the BAR herbicide resistance and

NPTII kanamycin resistance genes were used for selection (pSC218GG example shown in Figure 6.1a). All entry vectors carried the same set of four guide RNAs (Figure 6.1b), each of which could target one through four of the five NPD genes of interest (Figure

6.1c).

2.4.6 Nodulation Phenotyping of NPD mutants

M. truncatula R108 and mutant T2 or T3 were scarified by nicking each with a razor blade, then imbibed for 2-3 days at 4°C in the dark. Seeds were allowed to germinate in the dark at room temperature and planted in sterile plant growth mixture consisting of 4 parts sand:1 part perlite. On average, 34 plants were grown for each genotype in a complete randomized design (only five plants were available for the npd2 line) at 22-24°C, with 75% humidity, on a 16h light (200 to 350 µmol m-2s-1) / 8h dark photoperiod. After 3 days, plants were inoculated with a 10 ml suspension of Fahräeus medium lacking nitrogen (Barker et al., 2006) containing Ensifer meliloti strain 1021 carrying the nifH::GUS reporter gene (OD600 = 0.01, Starker et al. 2006). Thereafter, plants were watered with sterile deionized water twice a week and with Fahräeus medium lacking nitrogen once a week. Harvested plants were photographed then traits of interest were analyzed with ImageJ 1.50i (Schneider et al., 2012). Measured traits included number of , plant height, nodules per plant, number of pink nodules, and number of senescing nodules. The ImageJ free hand tool was used to trace individual features and capture total nodule area (which was found to be correlated with nodule dry weight in soybean nodules, R = 0.959; Saito et al., 2014), visible pink areas from nodules (N2- 17 fixation area), and visible green areas (senescing area). The ImageJ line tool was used to measure leaf lengths from the base of the to the tip of the individual leaves.

2.4.7 RNA-seq differential expression analyses

Roots and pink nodules from the phenotyping experiment (above) were harvested immediately after photographing whole plants at 30 days post inoculation. Senescent and ineffective nodules from NPD knockout lines were not included. Harvested tissue was flash frozen in liquid nitrogen, then stored at -80°C. Each biological replicate for RNA- seq consisted of at least two individual plants. RNA was extracted using a Qiagen

RNeasy Plant Mini kit following manufacturer instructions. Samples were sent to the

University of Minnesota Genomics Center for TruSeq dual-indexed stranded RNA library preparation and Illumina sequencing (HiSeq 2500). Each sample consisted of >7.5 million 125 bp paired end reads with high quality scores (>30).

Reads were mapped to the Medicago truncatula 4.0 reference genome using

Tophat v2.0.13 (minimum intron size of 20, maximum intron size of 2000, b2-very- sensitive parameter). HTSeq v0.7.2 (Anders et al., 2015) was used to determine read counts for annotated genes. The DESeq R package (Anders and Huber, 2010) was used to obtain normalization factors determined as the median, for each sample, of ratios between gene counts and their geometric means. Differentially expressed genes were detected using the negative binomial model in DESeq with adjustment for multiple testing using the Benjamini and Hochberg method. Normalized gene counts were used to visualize differences in gene transcript levels. Unix and R scripts used for these analyses are uploaded to https://github.com/ditrujillo/NPD_knockout_experiment. 18

We also mapped nodule reads from four Medicago accessions (HM101, HM056,

HM034 and HM340) inoculated with two different rhizobial strains (E. meliloti strain

KH46c and E. medicae strain WSM419) onto the respective Medicago genomes using

Tophat (data from Burghardt et al., 2017; PRJNA327225), as described above. HTSeq counts were estimated for the NPD and PLAT domain genes, then DEseq was used to normalize counts and find differential expression, as described above. Scripts used for these analyses are uploaded to https://github.com/ditrujillo/HostStrain_NPD_expression.

2.4.8 Microscopy and nodule staining

Plants were inoculated with E. meliloti strain 1021 carrying a nodF::GUS, exoY::GUS, bacA::GUS, or nifH::GUS reporter gene (Starker et al., 2006). For GUS staining, harvested nodules were hand sectioned into 50 mM PBS buffer (pH 7.0) with

1mM X-Gluc (5-bromo-4-chloro-3-indolyl-beta-D-glucuronic ) and 0.02% SDS and incubated overnight at 37 °C. For toluidine blue staining, harvested nodules were fixed with 2.5% glutaraldehyde in 50mM PBS buffer (pH 7.4) then washed. Fixed nodules were hand sectioned into 0.05% toluidine blue and visualized immediately. Photographs were taken under a light microscope with a phase contrast filter.

2.4.9 Accession Numbers

All RNA-seq raw sequence read data from T. pratense nodules at 30 dpi and NPD knockout experiments are found under BioProject accession numbers PRJNA416968 and

PRJNA418151, respectively.

19

2.5 Results

2.5.1 Overview of identified LSEs

A workflow was developed that was tailored to the discovery of lineage-specific expansions of small nodulation-related gene families. This clustering pipeline consisted of three stages to 1) obtain a pool of candidate genes, 2) expand the pool, and 3) cluster a filtered set of candidate genes into LSE families (Figure 2.2).

Combined expression (Supplementary File 1) and phylogenetic (Figure 6.2-

6.14) analyses were used to curate the families identified through this clustering workflow. Ultimately, 13 nodulation-related signaling peptides families were identified as specific to Medicago, Glycine, or Arachis legume lineages (overview in Table 2.2, multiple sequence alignments in Supplementary File 2). As expected, the pipeline identified known families such as the NCRs (Figure 6.2; Mergaert et al., 2003; Graham et al., 2004), GRPs (Figure 6.3-6.4; Kevei et al., 2002; Alunni et al., 2007),

LEED..PEEDs (Figure 6.5; Laporte et al., 2010; Trujillo et al., 2014), Aeschynomene

NCR-like peptides (Figure 6.6; Czernic et al. 2015) and calmodulin-like proteins

(CaMLs; Figure 6.7; Liu et al., 2006), which have previously been studied in terms of their nodule-related function and are known to be specific to certain legume lineages. In addition to these, our study discovered expansions of other gene families with suspected roles in nodulation in the Medicago, Glycine or Arachis lineages.

Many of the LSE families in Medicago and Arachis lineages were Cysteine-Rich

Peptide (CRP) families (Table 2.2), including CAP superfamily proteins (cysteine-rich secretory proteins, antigen 5, pathogenesis-related 1; Figure 6.8), Bowman Birk Trypsin

Inhibitors (Figure 6.9), Antimicrobial Peptide MBP-1 (Figure 6.10), 20

Leginsulin/Albumin-1 peptides (Figure 6.11), and NCRs (Figure 6.2). In Medicago and

Arachis lineages, the magnitude of expansion of some of the families was very striking.

As reported previously (Zhou et al., 2013), the NCRs in Medicago truncatula expanded to >600 members, most of which are either overexpressed in nodules, or not at all

(Figure 6.2, Supplementary File 1). In A. duranensis, the Antimicrobial Peptide MBP-1 family expanded to ~100 members, with the majority of genes overexpressed in roots and nodules, and very low expression in any other tissue (Figure 6.10, Supplementary File

1). Like the NCRs, very few or no members of the Antimicrobial Peptide MBP-1 family were found in legumes outside of the Arachis lineage. In contrast, the CAP superfamily had several members with constitutive expression across all three legume lineages included in this study, but a subset of >50 members with root and nodule enhanced expression expanded specifically in the Arachis lineage (Figure 6.8, Supplementary

File 1).

A nodulin family expanded specifically in a subset of Phaseoleae legumes, with seven members in G. max, produced through nontandem “dispersed duplications” (Ganko et al., 2007) or as a result of a whole genome duplication that occurred approximately 13 million years ago (based on the designations of duplication blocks retrieved from

SoyBase; Grant et al., 2010). By contrast, members of this family present in P. vulgaris expanded through tandem duplications and have high percent identity, suggesting a more recent expansion that occurred independently of the expansion in the Glycine lineage

(Figure 2.3). Mung bean (Vigna radiata) has two members, located 5 Mbp apart and syntenic to the two G. max members in Chromosome 13 and the P. vulgaris members on

Chromosome 5. The Phaseoleae legume Cajanus cajan, which is an outgroup to Glycine 21 and Phaseolus, had a single peptide belonging to this family, further suggesting that independent expansions occurred in the Glycine and Phaseolus lineages. All genes in this family identified in G. max and P. vulgaris have nodule-specific expression (Figure 6.12,

Supplementary File 1).

Lastly, not all the proteins detected by the pipeline have likely signaling functions; rather some could be families of small enzymes or peptides that interact with other proteins. Specifically, in the Arachis lineage, we identified a small expansion of cysteine protease inhibitors (Figure 6.13, Supplementary File 1), which are small proteins (<150 amino acids) that have been shown to bind and inhibit protease targets

(Zhao et al. 2014).

2.5.2 A nodule-specific subset of the PLAT domain protein family expanded exclusively in the Medicago lineage

A family closely related to the PLAT-plant stress family and containing a single

PLAT domain was recently annotated as the nodule-specific PLAT domain (NPD) family

(C. Pislariu, personal communication). Legumes and non-legume plants have PLAT domain proteins (PDPs) with constitutive expression, but a nodule-specific subset of this family expanded exclusively in the Medicago lineage (Figure 2.4a, Figure 6.14).

Comparing synteny plots between ( persica), a non-legume outgroup species, against narrowleaf lupin (Lupinus angustifolius), and A. duranensis (Figure

2.4b), we can theorize that a PLAT domain protein ancestor (blue arrow in P. persica) duplicated to two members in legumes, one of which initiated the NPD subset ( arrows) of the PDP gene family. In the Glycine and Medicago lineage, the NPD gene 22 translocated to a proximal region. Finally, in M. truncatula a tandem duplication of the

NPD gene resulted in five NPD copies within a 20 kb region (Figure 2.4b). Interestingly, one of the soybean Phaseoleae-specific nodulins discussed above (Figure 2.3, purple arrows) is located between the NPD and PDP genes on chromosome 13 (Figure 2.4b, purple arrow).

Medicago accessions HM340, HM034, HM056, and HM101 (A17 reference genome) all have five members of the nodule-specific PLAT domain genes, suggesting that the tandem duplication occurred before Medicago speciation. Using RNA-seq data from Burghardt et al. (2017), NPD gene expression counts from nodules were compared in these four Medicago accessions, inoculated with two rhizobial strains. In all accessions, NPDs are upregulated in nodules after inoculation with E. meliloti or E. medicae compared to root tissue, but NPD gene expression varied across plant accessions and depended on inoculant strain (Figure 2.5).

2.5.3 Multi-gene NPD knockouts lines show differences in nodulation phenotypes

To investigate the potential role of NPDs in nodulation, a CRISPR/Cas9 multiplex genome editing approach (Figure 6.1a) was used to make one or more targeted deletions in multiple locations. Four guide RNAs (gRNA) were designed, two of which could only target NPD1 or NPD5, one of which could target NPD1, NPD2, NPD3 and NPD4, and one of which could target NPD2, NPD4 and NPD5 (Figure 6.1b). Given the close proximity of the NPDs, these gRNAs also offered the possibility of generating large chromosomal deletions encompassing all five M. truncatula R108 nodule-specific genes

(Figure 6.1c). 23

As expected, a variety of NPD knockouts were obtained targeting one through five genes and included npd2, npd2/4, npd1/2/4, npd2/4/5, npd1/2/4/5, and npd1/2/3/4/5

(Figure 2.6a). Knockouts were achieved through small indels, mid-scale deletions of 11-

70 bp, and large deletions of up to 20 kb (Figure 6.15). Whole plant transformants were taken to the T1 stage to obtain homozygous mutant lines. Putative deletions were detected through PCR, followed by Sanger sequencing to confirm mutations in NPD genes. Knockout lines of interest were selected and seeds from T2 or T3 generation plants were used for nodulation studies.

Nodule phenotyping experiments were carried out in which comparisons were made between lines containing cumulative deletions and wild type (M. truncatula

HM340) controls at 18 dpi and 30 dpi. When the external morphology of HM340 nodules was compared against mutant lines at 18 dpi (Figure 2.6b), nodules from the one-, two-, three- and four-gene knockout lines all had similar external morphology as the wild type.

Notably, at 18 dpi, the five-gene knockout line (npd1/2/3/4/5) had numerous but ineffective nodules, as determined by their small, round and white appearance (normal

Medicago nodules are elongated and pink), suggesting a cessation of nodule development, lack of leghemoglobin, and reduced or nonexistent nitrogen fixation ability

(Figure 2.6b).

At 30 dpi, differences in nodule numbers, morphology, and size (Figure 2.6c,

Table 2.3) influenced overall plant health, observed as plants with fewer leaves, smaller height and chlorotic leaves (Figure 2.6d). The nodules of mutant lines npd2 and npd2/4 were similar to the wild type in appearance (representative nodules shown in Figure

2.6c), with elongated pink nodules of comparable size (p-value > 0.05, Table 2.3). 24

However, the average size of the N2-fixation zone in nodules from line npd2 was significantly smaller (p-value = 0.0154) while npd2/4 had fewer pink nodules per plant

(8.7 vs. 15.3 in WT, p-value = 0.0003, Table 2.3).

Nodules on the npd2/4/5 triple knockout plants were slightly smaller in size compared to the wild type control (1.068 mm2 vs. 1.289mm2 respectively, p-value

0.0204; Table 2.3) but more noticeably, many of the larger nodules on npd2/4/5 plants had early or advanced signs of senescence (Figure 2.6c). This observation was confirmed by a higher number of senescent nodules per plant in line npd2/4/5 (4.9) compared to

HM340 controls (3.3), on average (p-value = 0.0029, Table 2.3). Further, line npd2/4/5 had a significantly larger senescent zone size compared to wild type, measured as green areas in individual nodules (0.189mm2 vs. 0.089mm2 respectively, p-value < 0.0001,

Table 2.3).

In knockout lines npd1/2/4, npd1/2/4/5, npd1/2/3/4/5 nodules were significantly smaller (p-value < 0.0001, Figure 2.6c), and these lines had lower numbers of pink nodules per plant compared to wild type (p-value < 0.0001, Figure 2.6e). Furthermore, most nodules in npd1/2/4, npd1/2/4/5 and npd1/2/3/4/5 were ineffective, with an average of 4.3, 2.8 and 0.4 pink nodules per plant, respectively, compared to 15.3 in the wild type

(Table 2.3). Leaf color in npd1/2/4, npd1/2/4/5, and npd1/2/3/4/5 knockout plants was a paler green compared to the WT control, while leaves of npd2, npd2/4, and npd2/4/5 gene knockout lines appeared healthy overall (representative plants shown in Figure

2.6d).

Subtle differences in above-ground phenotypes could be clarified by taking into account both nodule number and N2-fixation zone size (measured as pink areas in 25 individual nodules). For example, when cumulative nodule size (the addition of all nodule areas for a given plant) for the npd2/4 knockout line was compared against the control, it was found to be significantly smaller (p-value = 0.0002, Table 2.3), even though the average nodule size was not significantly different (p-value > 0.05, Table

2.3). Indeed, the measurement of a plant’s cumulative N2-fixation zone size was highly predictive of its height (R2=0.63, p-value < 0.0001; Figure 2.7a) and cumulative petiole length (R2=0.82, p-value < 0.0001; Figure 2.7b). Not surprisingly, a mutant line’s smaller cumulative pink nodule area compared to wild type coincided with quantifiable differences in cumulative petiole length (Table 2.3) and plant height (Figure 2.6f).

2.5.4 Rhizobia lose N2-fixation capacity when all five NPD genes are knocked out

Four rhizobial reporter strains with GUS fusions to different promoters (exoY, bacA, nifH) were used to determine at what stage nodulation was arrested in the five-gene

NPD knockout line (npd1/2/3/4/5). Rhizobia inoculated on npd1/2/3/4/5 plants, which had white ineffective nodules, expressed genes necessary for infection (exoY::GUS reporter strain) and survival in symbiosomes (bacA::GUS reporter strain) at 7 and 11 dpi, with lighter staining or no staining at 8 dpi. The rhizobial strain nifH:GUS, which indicates whether a primary nitrogen fixation gene is being expressed, stained blue at 7 dpi, but not at 8 or 11 dpi. This suggests that nodule organogenesis is arrested after 7 days in the five-gene knockout line (Figure 2.8), and any blue staining that is observed thereafter corresponds to successive waves of nodule inception.

After determining that N2-fixation was completely interrupted for npd1/2/3/4/5, nifH:GUS blue staining was used to determine whether rhizobial N2-fixation capacity 26 was also altered in one-, two-, three-, and four-gene NPD knockout lines. At 18 dpi, lines with one through four NPD gene knockouts stained blue (Figure 2.9) and had pink nodules (Figure 2.6b), suggesting that nitrogen fixation was possible in all the partial

NPD knockout lines. Toluidine blue staining showed that rhizobia inside cells were radially organized around a central vacuole in the wild type control and one-, two-, three-

, and four-gene knockout lines, but not in the five-gene knockout line (Figure 2.9).

2.5.5 Transcriptome analysis of nodules from two- and three-gene knockout lines

RNA-seq analyses were used to find differentially expressed (DE) genes in nodules from the two- and three-gene knockout lines (npd2/4, npd2/4/5 and npd1/2/4).

Compared to wildtype nodules, only five genes in the double mutant line npd2/4 (which had healthy pink elongated nodules) were identified as DE, with an adjusted p-value <

0.1 after correcting for multiple testing. These overexpressed genes (Figure 2.10, blue side bars) had varied functions, including mRNA decay (Medtr5g089690), disease resistance (Medtr1g090690), and membrane and lipid associated functions

(Medtr1g112830, Medtr3g106480, and Medtr4g105450).

For npd2/4/5 and npd1/2/4 knockout lines we chose to focus on genes that were

DE in comparison to both WT and npd2/4 (p-value < 0.01), based on the observation that

WT and npd2/4 nodules had similar morphology and similar gene expression patterns.

Compared to wildtype and npd2/4, eight genes in the npd2/4/5 triple knockout line

(which had earlier nodule senescence), were DE (Figure 2.10, red and purple side bars).

Two of these genes were also overexpressed in npd1/2/4 (Figure 2.10, purple side bars), and are involved in proteolysis. Genes overexpressed exclusively in npd2/4/5 (Figure 27

2.10, red side bars) were related to cation transmembrane transport (Medtr3g071990), protein glycosylation (Medtr4g035490), cell wall degradation (Medtr4g094732), and transcription regulation (Medtr5g014040). Genes with reduced transcript levels in npd2/4/5 had functions related to oxido-reduction (Medtr6g043280) and Golgi-associated intracellular transport (Medtr6g005610).

Thirty-nine genes were overexpressed exclusively in the npd1/2/4 knockout line

(which had small but pink nodules), while fourteen genes were downregulated relative to

WT and npd2/4 (Figure 2.10, green side bars). Genes with the highest increase in transcript levels in npd1/2/4 (Figure 2.10, green underline) included an asparaginase

(Medtr4g109900; 50-fold increase), a stress-response gene (Medtr6g471080; 18-fold increase), lipoxygenases (Medtr8g018510 and Medtr8g018520; >20-fold increase), and

WRKY transcription factors (Medtr3g095040, Medtr7g073380, Medtr1g013760 and

Medtr8g005750; >10-fold increase). Genes that were strongly suppressed in npd1/2/4

(Figure 2.10, green underline) included a hypothetical protein (Medtr5g009260; 30-fold decrease) and a gene related to RNA-silencing (Medtr6g088660; 20-fold decrease).

2.6 Discussion

2.6.1 Types of LSEs found in this study

LSEs emerged mostly as a result of tandem duplications (Figure 2.4b, Figure

2.11), but other mechanisms included dispersed duplications, or as a result of whole genome duplications (Figure 2.3). Additionally, LSEs arose from duplications of pre- existing genes followed by diversification or by de novo evolution from existing coding or non-coding regions (Dujon, 1996; Lespinet et al., 2002; Tautz and Domazet-Lošo, 28

2011). Most gene families that underwent LSEs in this study originated from an ancestral sequence and had homologs to other genes outside the lineage, while two families

(LEED..PEEDs and Phaseoleae-specific Nodulins) appeared de novo and might best be considered orphan genes (Dujon 1996). Among LSEs with homologs outside their lineage, the LSE families had either unequal gene family size between legume species

(NCRs, Antimicrobial Peptide MBP-1) or had notably different subsets of nodule- specific expression across legume species (CAP superfamily proteins and NPDs). In terms of the duplication mechanism, we found that both tandem duplication (Phaseoleae- specific Nodulins in P. vulgaris) as well as dispersed duplication (Phaseoleae-specific

Nodulins in G. max) can play a role in the appearance of LSEs (Figure 2.3).

Though some of the families identified in this study have been described previously, their role in nodulation requires further investigation. The NCRs are the best- studied family among nodulation-related LSEs and are known to mediate the terminal differentiation of rhizobia within symbiosomes (Van de Velde et al., 2010; Horváth et al.,

2015), while some are necessary for bacteroid survival in symbiosomes (Kim et al.,

2015). The cluster of nodule-specific calmodulin-like genes in the M. truncatula genome was explored by Liu et al. (2006), and their localization to symbiosomes was confirmed.

Further, the calcium-binding properties of these peptides suggests a likely role in signal transduction or interpretation in fully developed nodules. GRP and LEED..PEED families are also thought to be involved in nodule organogenesis, but the function of these genes remains largely unexplored (Kevei et al., 2002; Laporte et al., 2010; Trujillo et al., 2014).

Many LSE gene families that were identified in Arachis and Medicago lineages were small and had repetitive simple motifs (such as cysteine- or glycine-rich), which are 29 common findings in studies of gene family lineage-specificity (Graham et al., 2004;

Campbell et al., 2007; Lin et al., 2010). LSEs have been shown to occur in stress-related or intracellular regulatory gene families (Lespinet et al., 2002; Hanada et al., 2008) or, in the case of conserved Brassicaceae-specific genes, putatively secreted self- incompatibility families (Lin et al., 2010). The Medicago and Arachis lineages both contained a high number of LSE families, unlike Phaseoleae legumes, and most of these were small CRP families. Our pipeline was biased to identify nodulation-related secreted peptides, so it is unsurprising to find a bias in the putative function and molecular structure of these families. However, the high number of LSEs in Medicago and Arachis lineages compared to the Phaseoleae lineage was unexpected, and is potentially a common evolutionary mechanism to induce extreme bacteroid differentiation. Arachis and Medicago lineages both host terminally differentiated bacteroids (Khetmalas, 1996;

Oono et al., 2010), and the expansion of CRP families, which often have antimicrobial properties (Silverstein et al., 2007; Farkas et al., 2017), could be a shared means to induce this morphological change in rhizobia. Indeed, Czernic et al. (2015) concluded that NCR- like peptides induce bacteroid terminal differentiation in Aeschynomene legumes, which are closely related to Arachis. The Aeschynomene NCR-like genes described by Czernic et al. (2015) were detected as LSEs in A. duranensis in this study (Table 2.2), but a greater expansion occurred in the Antimicrobial Peptide MBP-1 gene family. This makes the latter family a candidate for future studies in terminal differentiation of bacteroids in the Arachis lineage.

Interestingly, we also detected an expansion of cystatin genes in A. duranensis.

Cystatins bind to and reversibly inhibit cysteine proteases (Benchabane et al., 2010) 30 which potentially target CRPs and other proteins inside the symbiosome (Vincent and

Brewin, 2000). Cystatins are thought to regulate proteolysis during nodule senescence by balancing protein turnover processes (van Wyk et al., 2014). Cysteine protease silencing in Astragalus sinicus (Li et al., 2008) and cystatin overexpression in Lotus japonica

(Yuan et al. 2017) hairy roots inhibited nodule senescence and produced better aboveground growth. These studies suggest that an increased number of nodule-specific cystatin genes in the Arachis lineage could confer a positive effect on nodulation.

2.6.2 Phaseoleae-specific nodulins have expanded independently in Glycine and

Phaseolus legumes

The Phaseoleae-specific nodulins identified in this study are homologous to those identified by Mauro et al. (1985) in soybean. Sengupta-Gopalan et al. (1986) predicted that this small nodulin family is involved in ureide biosynthesis, which takes place in plastids and peroxisomes. The absence of this family outside of the Phaseoleae lineage would be consistent with the observation that ureides are the form of organic nitrogen in soybean, while nitrogen in M. truncatula and A. duranensis is transported in the form of amides (Groat and Vance, 1981; Devi et al., 2010). Despite very high expression levels

(Supplementary File 1), with two members making up 35% of a nodule-specific non- leghemoglobin cDNA library (Sengupta-Gopalan et al., 1986), the function of this gene family remains unknown. SignalP detected a signal peptide on most members of this family. However, previous research shows that some members are associated with membranes, while others are found in the cytoplasmic fraction (Jacobs et al., 1987;

Richter et al., 1991). 31

2.6.3 The NPDs are a previously unexplored family related to nodulation

The PLAT domain was first described in large multidomain proteins, such as lipases, lipoxygenases and alpha toxins (Bateman and Sandford, 1999). However, it can also be found as a single domain such as in PLAT/plant-stress proteins (Hyun et al.,

2014; PDP1 subgroup in Figure 2.1) and in proteins with homology to A. thaliana ATS3

(Nuccio and Thomas, 1999; PDP2 subgroup in Figure 2.1). The NPDs that are a focus of this study have high homology to ATS3, which is -specific (Figure 2.1) and are thus labeled “Embryo-specific 3” in Medicago genome annotations (Young et al., 2011).

However, PLAT domain proteins were found to have constitutive expression in M. truncatula, or nodule-specific expression in the case of NPDs, and were renamed accordingly.

PLAT domain proteins not associated with nodules have been linked with abiotic and biotic stress responses in multiple studies (Shin et al., 2004; Ali, 2007; Son et al.,

2012; Hyun et al., 2014). A PLAT domain-like protein was overexpressed in Allium shoots, tubers and roots during cold stress (Son et al., 2012), while an A. thaliana PLAT plant-stress protein is induced by salt and cold stress and its overexpression promotes plant growth under these conditions (Hyun et al., 2014). On the biotic side, overexpression of a soybean PLAT gene (Glyma08g14550) caused a reduced number of mature female soybean cyst after inoculation, compared to controls (Matthews et al., 2013). While the mechanism of action for PLAT genes remains unknown, Hyun et al. (2014) showed that AtPLAT1 loss-of-function results in reduced ABA sensitivity, suggesting that this gene acts downstream of the ABA signaling pathway. 32

All surveyed legumes had a putative syntenic ortholog to the NPD genes of M. truncatula. Lupinus angustifolius belongs to another major legume branch (not included in the LSE pipeline due to a lack of available nodule RNA-seq data), but it also has two

PDPs which are syntenic to the constitutive and nodule-specific PDPs in A. duranensis

(Figure 2.4b). It is unknown if one of the lupin copies is nodule-specific, but this shows that the duplication that gave rise to the NPDs occurred prior to the Lupinus-Arachis split

(~57 mya), relatively early in the history of the legume lineage, which arose ~59 mya

(Lavin et al., 2005).

Pislariu et al. have studied NPD1 using a Tnt1 insertion line, and found that the npd1 Tnt1 mutation segregated with a Fix- phenotype when plants are inoculated with E. meliloti strain 2011 (personal communication). In our study, mutant lines with three, four and five NPD knockouts showed nodulation phenotypes, and two-way comparisons among the mutant lines revealed interesting trends. For example, when we compare npd2/4 to npd2/4/5, we can surmise that knocking out NPD5 caused early nodule senescence, which was observed as a bright green color in the base of nodules (Figure

2.6c). The mutant lines in which NPD1 was knocked out always had smaller nodules

(either Fix+ or Fix- nodules), which could lead us to predict a role in nodule organogenesis for this gene, or deficient nodule formation as a consequence of other upstream processes. The different nodulation phenotypes depending on the NPD knockout combination lets us surmise a non-redundant role and possible subfunctionalization of NPDs permitted by the tandem duplication that gave rise to five copies. However, these associations between genes and phenotypes would need to be confirmed with future experiments that incorporate additional NPD knockout 33 combinations and additional rhizobial strains. Early results from the Pislariu lab (personal communication) and the observation (first detected in Burghardt et al. 2017) that NPD expression varies depending on rhizobial strain suggest that, indeed, there could be strain- specific interactions of NPD functions.

In our RNA-seq analysis of NPD knockout mutants, the npd2/4 and npd2/4/5 lines showed few DE genes compared to npd1/2/4 (Figure 2.10). Genes related to proteolysis were overexpressed in both npd2/4/5 and npd1/2/4. Proteases are known to play a role in nodule senescence and programmed cell death (van Wyk et al. 2014).

Defense response reactions are normally downregulated to accommodate rhizobia inside plant host cells (reviewed by Guinel, 2015). However, several defense response proteins

(Medtr1g090690 in npd2/4, Medtr6g471080, Medtr8g045570 in npd1/2/4) were overexpressed in knockout mutants, suggesting that nodule cells may have started to display antagonistic reactions against rhizobia. Lipoxygenase gene induction

(Medtr8g018510, Medtr8g018520) in npd1/2/4 coincides with a study on nodule senescence by Van de Welde et al. (2006), which is suggested to trigger the jasmonic acid defense pathway (Cho et al., 2011). Separately, one of the overexpressed proteins in npd1/2/4 was L-asparaginase (Medtr4g109900), which could interfere with normal transport of fixed nitrogen in the asparagine form to aboveground tissues (Vincze et al.,

1994). While it is interesting to observe defense related genes in NPD knockout mutants, due to the late stage at which nodules were harvested (>28 dpi), we cannot know whether genes were differentially expressed due to the lack of specific NPDs, or as a downstream result after nodule senescence or nodule deterioration processes were triggered.

34

2.6.4 Utility of LSE studies

The objective of this study was to design a workflow to systematically identify nodulation-related peptide families that have undergone LSEs across legume species.

Previous studies have developed pipelines to investigate LSEs (Jordan et al. 2001;

Hanada et al., 2008; Yang et al. 2009) or to detect small ORFs and small secreted peptides (Guillén et al., 2013; Ghorbani et al., 2015; Plett et al., 2017). However, the pipeline we present here combined both approaches by biasing the LSE pipeline towards the detection of nodulation-related small secreted peptide families, with an a priori assumption that they are undergoing constant evolution in gene family size (Silverstein et al., 2006; Zhou et al., 2013). Additionally, we used RNA-seq transcripts as a starting point for downstream analyses, rather than rely on the accuracy of different gene prediction programs and approaches used in annotating each legume species.

The LSE discovery pipeline enables a global and comprehensive view of gene family evolution, as it relates to a specific trait, within a subset of related species.

Potentially, the pipeline we developed could be applied to other systems to identify multigene families associated with traits of interest. These types of families might otherwise be overlooked when they form part of long lists of differentially expressed genes, such as the NPDs in Roux et al. (2014), the Phaseoleae-specific nodulins in studies by Libault et al. (2010) and Severin et al. (2010) or the Antimicrobial MBP-1 proteins in Clevenger et al. (2016). Alternatively, multigene families could go unnoticed by Tnt1 screening or genome-wide association studies alone because of redundant functions or dosage dependent effects. Indeed, the npd2 single gene knockout line did not

35 show a visible nodule or plant phenotype when inoculated with E. meliloti strain 1021

(Figure 2.6c, Table 2.3).

In addition to previously described LSEs, our study discovered novel gene families with nodule-specific expression that are unique to specific Arachis, Medicago, and Glycine lineages. While we have confirmed a role in nodulation for one of the candidate families, more work is still needed for the other LSE families we identified.

Interesting avenues of study would include confirming where these peptides localize and whether they are signaling proteins. Additionally, breakthrough advances in multigene targeted knockout technologies will enable further investigations in these new candidate families.

2.7 Acknowledgements

We would like to thank Colby Starker, for kindly sharing the rhizobial reporter strains used in this study. We would further like to thank Bob Stupar and Junqi Liu for preliminary insights into NPD function in G. max, Liana Burghardt for valuable advice in

RNA-seq analysis, Peter Tiffin for assistance with experimental design, and Shaun Curtin for his expertise and guidance in Medicago transformation. Finally, we would like to thank Catalina Pislariu for sharing her ongoing findings in separate research on NPD genes and advice regarding NPD mutant RNA-seq experiments. Computational analyses were conducted using resources from the Minnesota Supercomputing Institute.

36

2.8 Tables

Table 2.1 Overview of genomic and RNA-seq data sources used in the LSE discovery pipeline

Genome RNA-seq Data Source Version Source Version (Accession) M. truncatula Phytozomea Mt4.0d NCBIb PRJNA80163d, T. pratense Phytozomea v2e NCBIb PRJNA287846 i, PRJNA416968 G. max Phytozomea v2.0f NCBIb PRJNA79597j, PRJNA208048k P. vulgaris Phytozomea v2.0g NCBIb PRJNA210619l A. duranensis PeanutBasec v1.0h NCBIb PRJNA291488m a Phytozome v12 (www.phytozome.net, accessed March 2017) b National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/, accessed March 2017) c PeanutBase (http://peanutbase.org/, accessed September 2016) d Young et al., 2011, Tang et al., 2014 e De Vega et al., 2015 f Schmutz et al., 2010 g Phaseolus vulgaris v2.1, DOE-JGI and USDA-NIFA, http://phytozome.jgi.doe.gov/ h Bertioli et al., 2016 i Chakrabarti et al., 2016 j Libault et al., 2010 k Severin et al., 2010 l O'Rourke et al., 2015 m Clevenger et al., 2016

37

Table 2.2 Overview of identified LSEs and magnitude of expansion. For G. max (Gm), M. truncatula (Mt), or A. duranensis (Ad) LSEs, the number of members with nodule enhanced expression (Nod) relative to the total (Tot) number of members is shown. Gene counts are shown in bold and underlined for the plant lineage where the LSE occurred. In families with repetitive regions such as cysteine-rich peptides (CRP) and glycine-rich proteins (GRP), the repetitive amino acids, and most common number of other amino acids spacing them, is indicated in parentheses. Gm Mt Ad Notes Gene Family Tot Nod Tot Nod Tot Nod (Signature) Aeschynomene 1 0 0 0 8 7 CRP NCR-like (C7C4C3C11C5C1C3C)

Antimicrobial 1 0 0 0 97 54 CRP peptide MBP-1 (C3C13C3C)

Bowman Birk 15 1 12 1 11 7 CRP Trypsin Inhibitor (CC2C1C7C1C6C3C2C1C7C1C6C3C)

CaMLs 3 0 8 7 0 0 Calmodulin-like proteins

CAP superfamily 21 2 15 1 52 26 CRP proteins (C44C5C15C4C9C)

Cystatin 15 0 31 0 27 14 Cysteine Protease Inhibitor

LEED..PEED / 0 0 13 13 0 0 (LEED..PEED) SNARPs

Leginsulin/ 3 0 40 11 0 0 CRP Albumin 1 (C3C7C4C1C9C19C5C8C12C)

Glycine-Rich 3 0 35 31 11 0 GRP Protein 1 (KGG)

Glycine-Rich 25 7 16 0 20 0 GRP..CRP Protein 2 (GGGY..C3CC8C2CC)

Nod-specific 1 0 633 598 3 0 CRP Cysteine-Rich (C5C4C7C4C1C)

Phaseoleae- 7 7 0 0 0 0 CRP specific nodulin (C7C18C7C..C7C13C7C)

CRP PLAT Domain 14 1 13 5 8 1 (C9C..C10C4C45C) Protein *NPD subset characterized in this study

38

Table 2.3 Wild type and NPD knockout line phenotypes. Above-ground and below- ground traits were measured for the wildtype control and six NPD knockout mutants. Columns show means and standard deviations of trait measurements, while p-values indicate Tukey Honest Significant Difference tests of knockout line phenotypes compared to wild type using ANOVA fitted models. Above ground length units are mm while nodule area measurements are mm2.

39

2.9 Figures

Figure 2.1 Multiple sequence alignment of NPDs and PDPs. Select M. truncatula (Mt) and G. max (Gm) PLAT Domain Proteins (PDP) and Nodule-specific PLAT Domain (NPD) proteins were aligned with CLUSTALW. JMol Secondary structure predictions are shown for the PDP1 subgroup, related to PLAT-plant stress proteins, the PDP2 subgroup, related to ATS3, and the NPD subgroup that expanded in the Medicago lineage. Green arrows indicate predicted beta sheets while red tubes indicate predicted alpha helices.

40

Figure 2.2 LSE discovery pipeline. The clustering pipeline to identify LSEs of nodulation-related signaling peptides consisted of three stages to identify a pool of candidate genes with nodule enhanced expression (top panel), expand the pool (middle panel), and curate the potential LSE gene families (bottom panel). Solid boxes represent output at each step, while dashed boxes indicate the software or scripts used at each step.

41

Figure 2.3 Expansion of Phaseoleae-specific nodulins. The expansion of Phaseoleae- specific nodulin genes (purple arrows) occurred through a whole genome duplication (13 million years ago) and dispersed duplications in the Glycine lineage. In contrast, members of this family expanded through tandem duplications in the Phaseolus lineage. Gray horizontal lines within genomic regions represent genes, while lines connecting genomic regions represent areas of sequence similarity and synteny. Orange and blue arrows represent NPDs and PLAT Domain Proteins, respectively.

42

2.4a

2.4b 43

Figure 2.4 Expansion of NPDs in the Medicago lineage. Differential expression and phylogenetic analyses of PLAT Domain genes were combined to confirm the LSE status of the NPDs in M. truncatula (Mt), G. max (Gm) and A. duranensis (Ad) (a). Genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW), with transcript levels for Flower, Leaf, Root and Nodule tissues shown in the heatmap (yellow indicates higher expression). Purple side bars specify genes with nodule-enhanced expression. The nodule-specific subset of PLAT domain genes emerged in legumes (NPDs, orange arrows), and expanded to five members in M. truncatula, while a constitutive PLAT domain gene was conserved (blue arrows) (b).

44

Figure 2.5 NPD gene expression across Medicago accessions and rhizobial strains. NPD gene expression is significantly different across plant accessions and when inoculated with E. meliloti and E. medicae (in comparisons between rhizobial strains, * indicates pval<0.05, ** is pval<0.01, *** is pval<0.001). Note scale on vertical axis differs between different NPD genes.

45

2.6a 2.6b 2.6c 2.6d 2.6e 2.6f Figure 2.6 Comparison of M. truncatula wild type and NPD mutant line phenotypes. Multiple CRISPR knockout lines were obtained, targeting between one and five NPD genes (a). NPD knockout lines showed visual differences in nodule morphology at 18 (b) or 30 dpi (c) and plant phenotype at 30 dpi (d). Mutant lines showed significant differences in the total number of pink nodules per plant (e) and plant height (f) at 30 dpi compared to WT (* indicates pval<0.05, ** is pval<0.01, *** is pval<0.001). Representative nodules and plants were chosen for pictures. Bar represents 200 µm in (b), 1 mm in (c) and 10 mm in (d).

46

2.7a

2.7b Figure 2.7 Correlation between N-fixation traits and plant height. For each plant within a mutant line, the relationship between its cumulative N-fixation zone size and plant height (a) or cumulative petiole length (b) is shown. 47

Figure 2.8 Rhizobial reporter gene expression in the five-gene NPD knockout line. Rhizobial marker gene expression is shown for 7, 8, and 11 dpi nodules of HM340 control (WT) and NPD five-gene knockout line. Blue staining indicates successful infection (exoY::GUS), entry into symbiosomes (bacA::GUS), or expression of nitrogen fixation genes (nifH:GUS). Scale bars indicate 200 µm. 48

Figure 2.9 Rhizobial N-fixation gene expression in NPD knockout lines. Expression of nitrogen-fixation genes and bacteroid organization at 18 dpi in HM340 control (WT) versus NPD mutant lines (one- through five-gene knockouts). Light blue staining indicates expression of rhizobial nitrogen fixation genes (nifH:GUS). Toluidine blue (Tol. Blue and Tol. Blue 3x magnified) staining indicates bacteroid organization around a central vacuole in all lines except the NPD five-gene knockout line. Scale bars indicate 200 µm. 49

Figure 2.10 Differentially expressed genes in M. truncatula wild type and NPD mutant lines. Differentially expressed (DE) genes were identified for npd2/4, npd2/4/5, 50 or npd1/2/4 versus wildtype plants (WT). Genes with adjusted p-values < 0.1 and DE when compared against at least 2 lines were chosen. After scaling count values, yellow indicates high expression while red indicates low expression. Blue bars indicate DE genes in npd2/4, green bars indicate DE genes in npd1/2/4, and red bars indicate DE genes in npd2/4/5. Purple bars indicate DE genes that npd2/4/5 and npd1/2/4 had in common when compared against WT and npd2/4.

51

2.11a

2.11b

52

2.11c

Figure 2.11 Genomic localization of genes belonging to LSE families. For A. duranensis (a), G. max (b), and M. truncatula (c), genomic locations of genes belonging to LSE families in that species are shown. Dots represent single genes and are stacked when found in close proximity, within a 700,000 bp window. The NCR gene family, with >600 members in M. truncatula, was excluded from (c).

53

Chapter 3: Genomic characterization of the LEED..PEEDs, a gene family unique to the Medicago lineage

3.1 Summary

The LEED..PEED (LP) gene family in Medicago truncatula (A17) is composed of 13 genes coding small putatively secreted peptides with one to two conserved domains of negatively charged residues. This family is not present in the genomes of Glycine max,

Lotus japonicus or the IRLC species Cicer arietinum. LP genes were also not detected in a Trifolium pratense draft genome or Pisum sativum nodule transcriptome, which were sequenced de novo in this study, suggesting that the LP gene family arose within the past

25 million years. M. truncatula accession HM056 has 13 LP genes with high similarity to those in A17, while M. truncatula ssp. tricycla (R108) and M. sativa have 11 and 10 LP gene copies, respectively. In M. truncatula A17, twelve LP genes are located on chromosome 7 within a 93 Kb window, while one LP gene copy is located on chromosome 4. A phylogenetic analysis of the gene family is consistent with most gene duplications occurring prior to Medicago speciation events, mainly through local tandem duplications and one distant duplication across chromosomes. Synteny comparisons between R108 and A17 confirm that gene order is conserved between the two species, though a further duplication occurred solely in A17. In M. truncatula A17, all 13 LPs are exclusively transcribed in nodules and absent from other plant tissues, including roots, leaves, , seeds, seed shells, and pods. The recent expansion of LP genes in

Medicago spp. and their timing and location of expression suggest a novel function in

54 nodulation, possibly as an aftermath of the evolution of bacteroid terminal differentiation or potentially associated with rhizobial-host specificity.

Chapter 3 has been adapted from work in the publication:

Trujillo, Diana I., Kevin A. T. Silverstein, and Nevin D. Young. “Genomic Characterization of the LEED..PEEDs, a Gene Family Unique to the Medicago Lineage.” G3 Genes|Genomes|Genetics 4.10 (2014): 2003–2012.

3.2 Keywords secreted peptides, LEED..PEEDs, nodulation, IRLC

3.3 Introduction

Legumes form symbiotic relationships with nitrogen-fixing soil bacteria in dedicated root organs called nodules. The establishment of this relationship is regulated through signaling processes dependent on concerted transcription reprogramming in the plant host and its symbiotic rhizobial partners (Moreau et al., 2011; Manoury et al., 2010;

Karunakaran et al., 2009). This signaling relies on a wide range of secreted compounds such as flavonoids and secreted peptides. Secreted peptides by both the host (Wang et al.,

2010) and by the rhizobial partners (Marie et al., 2003) are necessary for communication between the two organisms and successful establishment of nitrogen-fixing nodules.

There are several known secreted peptide families that are specific to legumes or have nodule-specific expression, such as the nodule-specific cysteine-rich peptides

(NCRs; Mergaert et al., 2003; Graham et al., 2004), proline-rich proteins (PRPs; Graham

55 et al., 2004), glycine-rich proteins (GRPs; Kevei et al., 2002) and nodulation- related

CLAVATA3 (CLV3)/ESR-related peptides (CLEs; Mortier et al., 2011). Some genes, like the NCRs and a nodule-specific GRP subfamily, are composed of many members but have so far only been found in legumes species belonging to the inverted repeat-lacking clade (IRLC; Kevei et al., 2002; Mergaert et al., 2003; Alunni et al., 2007). Mediated by

NCRs with antimicrobial function (Van de Velde et al., 2010), most IRLC legumes host terminally differentiated bacteroids that have undergone genome endoreduplication and lost the ability to replicate. Additionally, IRLC legumes have indeterminate nodules with persistent meristems (reviewed by Kereszt et al., 2011).

One secreted peptide family, which we call the LEED..PEEDs, was first classified as legume-specific in a study by Graham et al. (2004) (‘group 567’) based on a series of comparative sequence homology searches between legume and non-legume plants. The name LEED..PEED describes two conserved motifs that characterize this gene family

(Figure 3.1), numbered according to their positional order in the M. truncatula A17 genome. Laporte et al. (2010) previously refer to the LEED..PEED family as SNARPs, for small nodulin acidic RNA-binding protein. These authors provide evidence that one member of the LEED..PEED family binds ssRNA non-specifically, however the term

LEED-PEED indicates a more generic and appropriate description for the family as a whole, as their biological functions remain unknown. In accordance with the Mt4.0 genome annotation (www.jcvi.org/medicago/), we will refer to these genes as

LEED..PEEDs, abbreviated as LPs in text and figures.

LP11, described as MtSNARP2 by Laporte et al. (2010), is targeted to the secretory pathway in infected nodule cells. Specifically, the LP11 signal peptide directed 56

GFP to the membrane surrounding infection threads at 6 dpi and to the nuclear envelope at 14 dpi. Suppression of LP11 by RNAi caused aberrant nodules with a hypertrophic outer cortex. Bacteroids in the RNAi lines initially differentiated normally, but started to degenerate at 10 dpi, leading to empty peribacteroid spaces, collapse of nodule cells and early nodule senescence. Thus, Laporte et al. (2010) showed that at least LP11 is necessary for normal nodule development and function.

In the present study, the LP gene family was examined in terms of its phylogenetic, genomic, transcriptional, and sequence features. We find that this gene family, whose members are exclusively transcribed in nodules, primarily arose through tandem duplications prior to Medicago speciation events, but is absent from other legumes outside of the immediate Medicago clade.

3.4 Material and methods

3.4.1 Plant genome and transcriptome sequences

Genome and transcriptome sequences for a range of legume and non-legume species were obtained from various sources in order to search for the presence or absence of LPs. These sequences are summarized in Table 3.1.

3.4.2 Detection of LPs in plant genomes and transcriptomes

SPADA is a homology-based prediction program that accurately predicts small peptides at the genome level. Given a high-quality profile alignment, SPADA identifies nearly all family members with better performance than all general-purpose gene prediction programs (Zhou et al., 2013). SPADA was run with an e-value of 0.1 on all 57 available plant genomes (Table 3.1) using an HMM profile based on the multiple sequence alignment of group 567 from Graham et al. (2004). Hits were then added to and used to refine the resulting multiple sequence alignments using Muscle software (Edgar

2004) for subsequent SPADA searches. SPADA searches followed by alignment refinements can be done iteratively to find additional members of a gene family, but in the case of SPADA searches for LPs in M. truncatula A17, no additional genes were found after the first cycle (see Results).

In order to perform a more exhaustive search of these peptides in legumes other than Medicago ssp., tblastn (Altschul et al., 1990) searches were conducted on all the genomes and transcriptomes using M. truncatula A17 LPs as queries. Additionally, the

Uniref90 database (http://www.ebi.ac.uk/) was scanned through an HMM search using

M. truncatula A17 LPs. For this, LP sequences were used to scan the InterPro protein signature database using InterProScan at www.ebi.ac.uk/Tools/pfa/iprscan/.

3.4.3 Trifolium pratense DNA-seq and Pisum sativum RNA-seq analysis

Sterilized T. pratense ‘Marathon’ seeds were planted in commercial soil mix.

Leaves were collected after one week and frozen at -80°. DNA was extracted from frozen tissue using a Qiagen DNeasy Plant Mini Kit following manufacturer instructions.

Sterilized P. sativum ‘Little Marvel’ seeds were planted in sterilized Leonard jars

(Leonard, 1943) containing vermiculite and perlite (3:1) and immediately inoculated with

Rhizobium leguminosarum bv. viciae USDA 2370 at a concentration of 108 CFU/seed.

Plants were watered with 0.25X Hoagland’s nitrogen-free solution (Hoagland and Arnon 1950) and grown in a growth chamber at 22° and with a 16 h photoperiod. 58

Nodules were harvested after 30 days and frozen at -80°. RNA was extracted from frozen tissue using a Qiagen RNeasy Plant Mini Kit following manufacturer instructions.

RNA and DNA samples were sent to the University of Minnesota Genomics

Center for library preparation and sequencing using an Illumina HiSeq 2000.

Approximately 92 (51 bp) and 88 (101 bp) million T. pratense paired-end reads and 98

(51 bp) and 96 (101 bp) million P. sativum paired-end reads were obtained. All reads were trimmed with Trimmomatic software (Bolger et al., 2014) to a minimum quality score of 20 from each end and a minimum average quality of 20 using a 4 bp sliding window. Trimmed reads smaller than 40 bp were discarded. T. pratense reads were assembled using ABYSS (Simpson et al., 2009) with default parameters and a kmer size of 33. P. sativum transcript assembly was conducted using Trinity software (Grabherr et al., 2011) with default parameters. Assembled sequences as well as raw reads were searched for A17 LP genes using SPADA as described earlier and tblastn with an e-value of 0.1.

3.4.4 Transcript levels of LPs in M. truncatula

Illumina RNA-seq single-end reads 36 bp in size from root, nodule, seed, leaf blade, vegetative bud, and flower tissues were obtained from the Sequence Read Archive at NCBI (Young et al., 2011; Accession SRP008485). RNA-seq reads were trimmed by a sliding window of 1 bp from the 3’ends until a quality score of 20 was reached. Filtered reads were mapped to the M. truncatula 4.0 reference genome using Tophat (Trapnell et al., 2009) with maximum indel sizes of 4 bp and minimum and maximum intron lengths of 20 and 2000 bp, respectively. Cufflinks (Trapnell et al., 2010) was run with a 59 maximum intron length of 2000 bp using the multi-read correct option and a reference annotation containing the 13 LPs.

3.4.5 Synteny and collinearity comparisons

Regions of macro-synteny between M. truncatula v4.0 and G. max v1.1 genomes were identified and visualized using MUMmer 3 software (Kurtz et al., 2004) with default parameters. Gene homology patterns within these regions were analyzed using

GEvo (Lyons and Freeling, 2008; default parameters) and visualized with mGSV software (Revanna et al. 2012), filtering out small syntenic regions (<500bp) and joining consecutive syntenic regions within a single gene.

LPs 2-13 from M. truncatula ssp. tricycla R108 were located on a single scaffold

(848) which was annotated through a best-hit blastn search using A17 annotated gene sequences. Synteny comparisons were conducted between A17 and R108 using low- resolution custom R scripts, which provided analyses similar to synteny detection with

GEvo (Lyons and Freeling, 2008) and visualization with mGSV (Revanna et al., 2012), described above. Dotplot comparisons were made between A17 and R108 or against themselves, using Gepard software with default parameters (Krumsiek et al., 2007).

3.4.6 Phylogenetic analysis of LP sequences

LP amino acid sequences of M. truncatula A17 and HM056 and M. truncatula ssp. tricycla R108 were aligned with ClustalW using MEGA version 5 (Tamura et al.,

2011). The corresponding nucleotide alignment was then trimmed with removal of gapped columns, leaving 144 nucleotides to use in phylogenetic tree construction (Figure 60

6.16). Independently, M. sativa assembled genes were aligned with the above sequences, and trimmed to obtain 87 bp of aligned sequence for tree construction (Figure 6.17).

Phylogenetic analyses were conducted using Maximum Likelihood in MEGA5 and using

Bayesian Inference with MrBayes version 3.1 software (Ronquist and Huelsenbeck,

2003). For both approaches, trees were inferred based on the General Time Reversible model of evolution with gamma-distributed rate variation and a proportion of invariable sites. In the Bayesian phylogenetic analysis, congruence was reached with 300,000 generations and sampling every tenth generation. Phylogenetic trees were visualized using FigTree (http://www.tree.bio.ed.ac.uk/software/figtree/). Using the Bayesian

Inference trees, gene duplication histories in the A17 cluster on chromosome 7 were determined and displayed with DILTAG software (Lajoie et al. 2010).

3.4.7 Accession Numbers

All raw sequence read data have been deposited at NCBI (http:// ncbi.nlm.nih.gov/) under BioProject accession numbers PRJNA257076 and

PRJNA257308.

3.5 Results

3.5.1 The LP gene family is specific to the Medicago lineage

In M. truncatula A17, LPs detected by SPADA ranged from 66 to 89 amino acids in length. The LPs have an average size of 75 amino acids, with a signal peptide of ~23 amino acids. They all share a small domain of negatively charged glutamic acid and aspartic acid residues (E,D; red) followed by a tryptophan residue (W; orange). The C- 61 terminal end contains another small domain of negatively charged residues in most but not all LPs (Figure 3.1).

LP sequences were detected in M. truncatula accessions A17 (reference genome) and M. truncatula HM056 (phylogenetically close to A17) as well as in M. truncatula ssp. tricycla (R108) and M. sativa. Using an e-value cutoff of 0.1, no LP hits were produced in any of the other SPADA or tblastn searches of available legume genomes or transcriptomes, including C. arietinum, T. pratense and P. sativum, which are all IRLC legumes.

Only 11 LPs were initially detected in the pre-release genome of M. truncatula

HM056, which is very closely related to A17, suggesting that the HM056 scaffolds containing LPs might be misassembled. To verify this, raw HM056 reads were aligned to the A17 LP DNA regions using Bowtie2 (Langmead and Salzburg, 2012), changing default parameters to allow reads to only be mapped once. HM056 reads mapped to A17

LPs were visualized with IGV (Thorvaldsdóttir et al., 2012) and then HM056 LPs were manually assembled based on the visual comparison. All 13 A17 LPs were found to have

HM056 LPs orthologs, generally with zero to one SNP in the coding region (Figure

6.17). Eleven LPs were detected in the pre-release genome of R108 and 10 were detected in M. sativa after manual assembly of Illumina genome reads (Figure 6.18).

The preliminary genome assembly of T. pratense we created had only ~60X average coverage, yielding a low N50 of around 1500 bp. To determine if this genome assembly strategy would be sensitive enough to detect LPs, M. truncatula ssp. tricycla

R108 raw reads were subsampled to a level comparable to the T. pratense 101 bp sample, re-assembled and searched for discovery of LP genes. For this, an initial set of 165.7 62 million 100 bp R108 reads was subsampled twice to obtain sets of 80 million paired-end reads. The subsamples were then assembled with ABYSS using default parameters and a k-mer sweep. Assemblies were then searched for LPs using tblastn with an e-value of 0.1.

With this procedure, all R108 LPs were detected in both assemblies from subsampled reads indicating that LP genes can be correctly assembled and discovered at equivalent read coverage. By contrast, tblastn searches in T. pratense and P. sativum raw reads confirmed the lack of any LP sequence homology in either species. Non-legume genomes, transcriptomes, and the Uniref 90 database also did not contain sequences with homology to LPs. Finally, no LPs were found when searching the InterPro Database with any of the 13 M. truncatula A17 LP sequences.

3.5.2 Genomic architecture of the LP gene family

In M. truncatula A17, LP1 is located on chromosome 4, while LPs 2-13 are located in a 93 kbp region on chromosome 7. Neighboring regions of LP1 on A17 chromosome 4 showed synteny with G. max chromosomes 8 and 15 (visualized in dotplot comparisons in Figure 6.19a and Figure 6.19b, respectively). Directly neighboring the

LP1 region on A17, an analysis of corresponding syntenic regions of G. max shows that

LP1 in A17 (Figure 3.2a, red arrow) is bordered by two sets of genes on either side that have multiple copies (purple arrows). An LP ortholog is not present within either of the syntenic genomic regions in G. max.

Regions of synteny directly neighboring the LP 2-13 region on A17 chromosome

7 were only found on G. max chromosome 8 (Figure 6.19c), with more distantly related regions on chromosome 18 (Figure 6.19d). A comparison to G. max chromosome 8 and 63

C. arietinum scaffold 451 reveals that the area has undergone numerous tandem duplications events in all three species (Figure 3.2b). Thus, all three species have four to five copies of a flanking gene belonging to the protein kinase family (purple arrows), though some copies have changed orientation during duplication. On the other hand, A17 has an additional set of duplicated genes – the 12 tandem LPs on chromosome 7 (red arrows) – completely absent from G. max and C. arietinum. In M. truncatula, all of these

LPs are in the same orientation.

A synteny comparison between accessions A17 and R108 (Figure 3.2b) then revealed that long tracts of synteny are present in the LP 2-8 region, while in the LP 9-13 region, synteny is interrupted in noncoding regions. The presence of ten syntenic paralogous genes between R108 and A17 suggests that the expansion of the LP gene family in the region occurred largely before the Medicago subspecies split.

A dotplot analysis (Figure 3.3) shows the location of duplications in the genomic regions surrounding LPs 2-13 in A17 and R108. The duplication of a region encompassing LP2 gave rise to LP3 in both A17 and R108. A more recent duplication of a region encompassing LPs 3 and 6 occurred solely in the A17 lineage, giving rise to LPs

4 and 5 (Figure 3.3a), which are absent from R108 (Figure 3.3b). Large regions of colinearity between the two subspecies are seen surrounding LPs 2-8, with a degradation of colinearity around LPs 9-13 (Figure 3.3c).

3.5.3 Phylogenetic relationship between LP genes

In order to determine the evolutionary relatedness of the LPs, an unrooted phylogenetic tree was constructed based on an alignment of their nucleotide sequences 64

(Figure 6.16) using Bayesian Inference and Maximum Likelihood approaches. Inferred relationships between genes were largely similar using the two approaches, though the relationship of LPs 7-8 was unresolved using the Maximum Likelihood approach (data not shown). Figure 3.4a shows the tree inferred for A17, HM056 and R108 using

Bayesian Inference, with posterior probabilities at the nodes. Including M. sativa in the analysis resulted in fewer resolved nodes (Figure 6.20), probably due to the incomplete

LP gene sequence information for this species (Figure 6.18).

Relationships among LP orthologs reflect the established phylogenetic relatedness between the accessions (Yoder et al., 2013). As expected, A17 and HM056 sequences tend to cluster together due to a closer phylogenetic relationship between the two accessions, while M. truncatula ssp. tricycla and M. sativa sequences are less closely related (Figure 6.20). The tree topology shows an order of relatedness for LPs 2-13 which is in concordance with tandem gene duplications, most of which took place before

Medicago speciation events. Another duplication appears to have occurred after the M. sativa speciation event, giving rise to LP8 in M. truncatula (Figure 6.20). An additional duplication occurred solely in the M. truncatula A17 and HM056 lineage in a genomic region encompassing two genes, giving rise to LPs 3-6 (Figure 3.4b), as previously indicated by the dot-plot analysis (Figure 3.3a).

3.5.4 The LEED..PEED family is nodule-specific in M. truncatula

LP genes 1-3,5, 4,6-8 and 9-13 cluster into separate groups (Figure 3.4a), which led us to investigate whether genes belonging to the different clusters showed a difference in timing of LP nodule expression. Expression levels for LPs 10 and 11 65

(Figure 3.4a inset, http://mtgea.noble.org/v3/ and Roux et al. 2014) become noticeable early during nodulation by 4 dpi and peak at 6 dpi (shown as dark red bars in the heatmap), while LPs 1, 3, 6 and 7 had higher expression levels which peaked at 10 dpi and were maintained through 20 dpi. At a spatial scale, a clear difference in expression trends among nodules sections was seen between sets of genes. LPs 8 and 9, which belong to different phylogenetic clusters, both showed enriched expression in the distal

(FIId) and proximal (FIIp) fractions of nodule zone II, which contains bacterial and nodule cells undergoing infection (FIId) and differentiation (FIIp). LPs 10-13 were more highly expressed in FIIp, which contains rhizobia undergoing endoreduplication, and in interzone II-III (IZ II-III). LPs 1-3, 6 and 7 had higher transcript levels in IZ II-III and the nitrogen fixation zone (ZIII), which contains fully differentiated bacteroids. None of the

LP genes were expressed in the nodule meristem (FI). An analysis of transcript levels across several tissues showed that all 13 LPs are transcribed in M. truncatula A17 nodules at high levels, with little to no expression in roots or other tissues (Table 3.2).

3.6 Discussion

Like the NCRs and GRPs, the LEED..PEED gene family is also specific to IRLC legumes. In the case of LPs, however, this lineage-specific expansion is found in a much narrower range of species. Members of the IRLC group form indeterminate nodules with persistent meristems, while a subset of these legumes, including Medicago ssp., host terminally differentiated bacteroids (Mergaert et al., 2006). Chickpea is an IRLC legume with indeterminate nodules although rhizobia do not terminally differentiate (Oono et al.,

2010). SPADA and tblastn searches of the available C. arietinum genome, which has an 66 estimated 90.8% gene coverage (Varshney et al., 2013), suggested that LPs are absent in this species. Additionally, synteny analysis of the scaffold syntenic to the LP 2-13 region shows a clear absence of LP genes in C. arietinum (Figure 3.2). Furthermore, LPs were not detected in Trifolium and Pisum, which are even more closely related to Medicago and also host terminally differentiated rhizobia. The absence of LP genes in Cicer,

Trifolium and Pisum suggests that these proteins are not essential determinants of indeterminate nodule formation or bacteroid terminal differentiation, though they may have arisen as a consequence of these traits. Given that Trifolium and Pisum species are both nodulated by Rhizobium leguminosarum while Medicago species are nodulated by

Ensifer meliloti, the biological function of LP genes could be related to Medicago species’ interaction with its rhizobial partner. Melilotus and Trigonella, sister genera to

Medicago in the Trigonellinae (Steele et al., 2010), also associate with S. meliloti

(Roumiantseva et al., 2002), so it will be interesting to determine whether these legumes have LP genes (preliminary data indicates LP genes may be found within Melilotus, unpublished data).

The LP genes were considered to be of particular interest because all 13 M. truncatula A17 genes had very high expression in nodules compared to other tissues, based on RNA-seq data from Young et al. (2011, Table 3.2). Phylogenetic trees of the

LPs show that genes 1-3,5, 4,6-8 and 9-13 of M. truncatula A17 cluster separately, suggesting possible functional differences between the groups. In our analysis of nodulation time-series expression data available at http://mtgea.noble.org/v3/, LPs 10 and

11 transcription began and peaked earlier than LPs 1, 3, 4, and 7. Other studies have shown that LPs 1 and 13 belong to separate expression patterns, being activated in mature 67 and immature nodules, respectively (Manoury et al., 2010), and LP 11 is directed to membranes surrounding infection threads (Laporte et al., 2011). LP genes 8-13 had higher expression in the infection zone with cells undergoing differentiation. Thus, the gene cluster containing LPs 9-13 as well as LP8 may have an earlier role in nodulation.

Potentially, LPs in this cluster might be necessary in the maintenance of functional bacteroids as they undergo differentiation, as suggested by an aberrant nodulation phenotype after LP 11 was suppressed (Laporte et al., 2010). Notably, an observable phenotype after suppression of just a single LP gene suggests that these genes may have distinct, non-redundant functions.

The LP gene family arose after the Pisum-Medicago split, which has been estimated to have occurred ~25 Mya (Lavin et al., 2005). It appears the LP gene family evolved by tandem duplication within this time frame. The comparatively rapid expansion and subsequent fixation of multiple LP copies suggests that higher LP copy numbers have provided a selective advantage to Medicago plants. The two most recent duplications occurred less than 7 Mya, one after the M. sativa speciation event (time estimate based on matK substitution rates, Lavin et al., 2005) and one after the much more recent A17 – R108 split.

Possible theories about how the LP gene family emerged include de novo evolution, domain shuffling and horizontal gene transfer (HGT). None of these theories can be ruled out. De novo families that have undergone lineage-specific expansion typically have structurally simple domains such as α-helices or histidine/cysteine-rich regions that stabilize molecules (Lespinet et al., 2002). Though the LPs lack such features, their small size may counteract the need for strict protein stabilization. At this 68 point, there is no evidence for HGT from another species. Codon usage patterns, which can be used to distinguish native genes within a species from foreign genes, were calculated for the mature peptide region of LP genes using CAIcal software (Puigbò et al., 2008). These values did not stray from the average of all M. truncatula genes (data not shown). However, the short length of these genes and/or the amount of time since emergence may make it impossible to rule out the possibility of HGT through a codon usage index analysis alone. Evidence for domain shuffling was also not found. Though the LPs could have acquired their signal peptide from another region in the genome, the mature peptide region of LP genes does not have homology to any other gene within

Medicago or its immediate Trifolium and Pisum relatives.

Lineage-specific expanded gene families often have roles in an organism’s response to stress or pathogens, either as structural components or as mediators of specificity within signaling pathways (Lespinet et al., 2002). Gene duplication provides new material on which selection can act, without harming the original function of the gene. A selection pressure that favors high gene copy numbers is often associated with positive dN/dS ratios (in which amino acid changes are advantageous), due to an expansion event followed by diversification for specificity-related roles. However, an analysis of dN/dS ratios revealed that the LP gene family as a whole tends toward purifying selection (data not shown), with amino acid changes being deleterious. Perhaps amplification of this family was driven by the genomic region rather than a biological need for a diverse set of LPs. Whatever the driving force behind rising copy numbers, the

LPs are a lineage-specific innovation that has been directed toward a function in nodulation. 69

The LP gene family has undergone recent expansion, mediated through one distant and several rounds of local tandem duplication events. Likewise, most GRPs and

NCRs in M. truncatula are found in local clusters, generally facing the same orientation

(Alunni et al., 2007). LPs differ from NCRs and GRPs in that there are fewer peptides in this gene family with comparatively lower variation in copy number across Medicago species. Another difference is the much narrower range of legume species in which LPs are found. The lack of sequence similarity of LPs with genes in any other legume plants, including Pisum and Trifolium, suggests that this gene family emerged de novo (though its origins remain unclear), expanded rapidly, and became fixed in relatively short evolutionary time. Additionally, it appears to be specifically directed toward nodulation or rhizobial interactions.

This study supports the use of comparative bioinformatic approaches towards identifying genes of potential biological interest. Future studies should focus on the different biological roles of the LP members and determining whether these proteins are present in any other legume species that are closely related to Medicago, such as

Melilotus and Trigonella species.

3.7 Acknowledgements

Computational work for this project was conducted through the Minnesota

Supercomputing Institute at the University of Minnesota. We thank Peng Zhou for R scripts used in generating Figure 3.2b, and Joseph Guhlin and Peter Tiffin for useful discussions. This work was supported by grant DBI-1237993 from the National Science

70

Foundation and by graduate fellowship support from the Plant Biological Sciences program of the University of Minnesota.

3.8 Tables

71

Table 3.1 Summary of analyzed plant genomes and transcriptomes LPs Genome Transcriptome detected Source Version / Accession Source Version / Accession Medicago

A17 (M. truncatula) Yes JCVIa Mt4.0v1 NCBIb SRP008485 HM056 (M. Yes NCBIb PRJNA256006 -- -- truncatula R108 (M.) t. tricycla) Yes NCBIb PRJNA256006 -- -- HM102 (M. sativa) Yes NCBIb PRJNA256006 LISe v1.0 Trifolium pratense No NCBIb PRJNA257076 Nagy et al. 2013 Pisum sativum No -- -- NCBIb PRJNA257308 Cicer arietinum No NCBIb v1.0 / PRJNA175619 LISe v2.0 Lotus japonicus No Kazusac Lj2.5 DFCI-GIf Release 6.0 Cajanus cajan No NCBIb v1.0 / PRJNA72815 LISe v2.0 Glycine max No NCBIb v1.1 / PRJNA19861 DFCI-GIf Release 16.0 Phaseolus vulgaris No NCBIb v1.0 / PRJNA41439 DFCI-GIf Release 4.0 Populus trichocarpa No Phytozome JGI v3.0 DFCI-GIf Release 5.0 d Arabidopsis thaliana No NCBI b TAIR10 / PRJNA10719 DFCI-GIf Release 15.0 Oryza sativa No Phytozome MSU release 7 DFCI-GIf Release 19.0 d a J. Craig Venter Institute (http://www.jcvi.org/medicago/ ) b National Center for Biotechnology Information (http://ncbi.nlm.nih.gov/) c Kazusa DNA Research Institute (http://www.kazusa.or.jp/lotus/) d Phytozome v9.0 (www.phytozome.net, accessed January 2013) e Legume Information System (comparative-legumes.org/) f Dana Farber Institute – Gene Indices (compbio.dfci.harvard.edu/tgi/tgipage.html)

72

Table 3.2 Transcript abundance (FPKM) of LP genes in six M. truncatula A17 tissues Gene Root 4WK Nodule Seed Pod Blade 4Wk Bud 4WK Open Flower LP1 2 6120 0 0 0 0 LP2 0 100 0 0 0 0 LP3 3 1922 0 0 0 0 LP4 0 623 0 0 0 0 LP5 5 8805 0 0 0 0 LP6 0 973 0 0 0 0 LP7 2 1593 0 0 0 0 LP8 0 104 0 0 0 0 LP9 0 113 0 0 0 0 LP10 6 396 2 0 0 0 LP11 0 842 0 0 0 0 LP12 0 102 0 0 0 0 LP13 6 2087 0 0 0 0 FPKM values from RNA-seq expression analysis

73

3.9 Figures

Figure 3.1 Multiple sequence alignment of A17 LP peptides. The alignment was generated by ClustalW (Larkin et al. 2007) and viewed with Jalview software (Waterhouse et al. 2009). The signal peptide sequence and conserved regions are indicated by arrows, with the consensus sequence displayed under the LEED..PEED motifs.

74

3.2a

3.2b

75

Figure 3.2 Synteny comparisons between LPs 1-13 chromosomal regions in M truncatula A17 and corresponding regions in G. max and C. arietinum. Shaded bars indicate synteny between the A17 region surrounding LP 1 on chromosome 4 with G.max chromosomes 15 and 8 (a), and of the A17 chromosome 7 region surrounding LPs 2-13 with G. max chromosome 8 and C. arietinum scaffold 451 (b). LP genes are shown in red and the region containing them surrounded by red boxes. Neighboring genes that have also undergone tandem duplications are shown in purple while non-duplicated neighboring genes are shown in green. In (b), the ~93 kbp LP 2-13 region of A17 chromosome 7 is magnified and compared against a ~100 kbp region of Scaffold 848 of R108. Shaded lines between chromosomes indicate syntenic regions.

76

3.3a

3.3b

77

3.3c Figure 3.3 Dot plot analysis of a ~1 kbp region in Chromosome 4 and ~100 kbp region in Chromosome 7 of M. truncatula R108 and A17. Black diagonal lines indicate duplicated regions within A17 (a) or R108 (b) or sequence colinearity between the two organisms (c). Red horizontal and vertical lines indicate the borders of duplicated and collinear regions.

78

3.4a

3.4b Figure 3.4 Evolutionary expansion of the LP gene family. The phylogenetic tree of A17, HM056, and R108 LP nucleotide sequences was generated through a Bayesian phylogenetic approach (a). Posterior probability values of the clades are indicated at the nodes. The map insets show spatial (microdissected nodule sections, Roux et al. 2014) and temporal (nodule samples taken at various time points post-inoculation, Benedito et al. 2008 and Carvalho et al. unpublished data at http://mtgea.noble.org/v3/) expression patterns for LP genes, with dark red indicating a higher transcription level for each time point or nodule section. The duplication history for A17 LP genes was inferred for the Bayesian Inference trees using DILTAG software (b). Rounded squares indicate duplication events, while the rounded rectangle indicates a double duplication.

79

4 Conclusions

The legume-rhizobia symbiosis is a tightly regulated process that combines structured host organogenesis with a controlled infection process. Legumes have evolved to allow accommodation of rhizobia within host nodule cells, providing photosynthesis- derived carbon in exchange for fixed nitrogen. The beginning recognition steps and signals that lead to the establishment of symbiosis are generally well-understood

(reviewed by Gage, 2004; Oldroyd et al., 2011). However, signals that are exchanged during later stages of infection and for maintenance of nodule function are less clear

(Manoury et al., 2010; Sinharoy et al., 2013). In M. truncatula and other IRLC legumes, it is believed that the host continues to communicate with rhizobia and determines rhizobial fate by directing nodule-specific cysteine rich peptides (NCRs) toward symbiosomes. To my knowledge, however, no one had systematically explored whether other legume species also produce unique sets of secreted peptides that may be targeted to symbiosomes. Using publicly available genomic and transcriptomic data, I set out to classify and catalogue the peptide families that legumes may be using to communicate with rhizobia within functioning nodules.

Repertoires of nodule-specific small secreted peptides varied across the five surveyed legume taxa and thirteen gene families underwent lineage-specific expansions unique to either Arachis, Glycine or Medicago lineages (Figure 2.11). The Arachis and

Medicago lineages showed a higher number of expansions than the Phaseolus lineage and nine of the thirteen LSE families had cysteine-rich motifs. Families that did not have cysteine-rich motifs included the LEED..PEEDs, calmodulin-like proteins (CaMLs), glycine-rich proteins (GRPs), and cystatins (Table 2.2). Gene family expansions 80 occurred mostly through tandem duplications, though dispersed duplications and whole genome duplications were also contributing factors.

Of the thirteen families that underwent LSEs, a defined biological role has been explored only in NCRs and CaMLs. For NCRs, the reason behind the high number of members of this family in IRLC legumes has been a driving question. Montiel et al.

(2017) determined that the number of NCRs expressed or detected in different legume species seems to be associated with the degree of rhizobial differentiation traits. NCR peptides have diverged between and within species, allowing divergence of gene function. For example, some NCR peptides have antimicrobial properties (Van de Velde et al., 2010) while others are thought to protect the rhizobia in the symbiosomes environment (Kim et al., 2015). In contrast, the CaMLs underwent a much smaller expansion resulting in six genes (Liu et al., 2006). Nonetheless, this family also shows signs of gene divergence, with differences in the number of calcium binding domains across CaML gene copies. Thus, duplication followed by diversification of nodulation- related genes may allow the fine tuning of nodule organogenesis or of host-symbiont interactions.

I decided to characterize the NPD subset of the PLAT domain protein family, which expanded exclusively in the Medicago lineage, to determine whether the expansion of these nodule-specific members had a role in M. truncatula nodule development. Using a multiplex targeted genome editing strategy, I obtained a diverse set of plant knockout lines with cumulative disruptions in one through all five NPD members, which presented a great tool for the dissection of gene function. Indeed, the different nodulation

81 phenotypes for the NPD knockout lines suggested a non-redundant role for NPD genes in the Medicago-Ensifer interaction.

By comparing lines with cumulative NPD gene inactivations, specific NPD impacts on nodule phenotype could be predicted. Knockout lines with NPD2 and NPD4 disruptions (npd2 and npd2/4) did not show a visual change in nodule morphology

(though quantitative differences were detected; Table 2.3). Meanwhile, comparing mutant line npd2/4 to npd2/4/5 suggested that knocking out NPD5 probably results in earlier nodule senescence. Likewise, the npd2/4 versus npd1/2/4 comparison suggests that NPD1 inactivation results in smaller nodules. A complete loss of nodule function

(fix- nodules), was only observed in the five-gene knockout disrupting all M. truncatula

NPDs (Figure 2.6). Interestingly, personal communications with Catalina Pislariu at

Texas Woman’s University has led me to believe that the phenotypic results I observed may be rhizobium strain-specific, and that multiple strains should be included in future

NPD knockout line characterizations.

I tested the LSE discovery pipeline on a comparatively small sample size of five species, but these ranged across three distinct clades within the legume family, the dalbergioid, milletioid and Hologalegina crown clades. Therefore, the thirteen gene families detected in this study are great candidates for further explorations into the divergent evolutionary history of nodule traits in these legume lineages. Further, as more legume transcriptome and genomic data become available, this data should be incorporated into the LSE discovery pipeline as a way to get a “macro” overview of small peptide family evolution as it relates to nodulation. Including new representative legume species such as Lupinus in the genistoid crown node, for which nodule RNA-seq data 82 recently became available (Keller et al. 2017), will likely yield additional and exciting nodulation-related gene candidates to explore.

83

5 Bibliography

Ali, Walid Wahid. “Screening of Plant Suspension Cultures for Antimicrobial Activities and Characterization of Antimicrobial Proteins from Arabidopsis Thaliana.” University of Würzburg, 2007. Print.

Altschul, S F, W Gish, and W Miller. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (1990): 403–410.

Altschul, Stephen F. et al. “Protein Database Searches Using Compositionally Adjusted Substitution Matrices.” FEBS Journal 272.20 (2005): 5101–5109.

Alunni, Benoit et al. “Genomic Organization and Evolutionary Insights on GRP and NCR Genes, Two Large Nodule-Specific Gene Families in Medicago Truncatula.” Molecular plant-microbe interactions : MPMI 20.9 (2007): 1138–48.

Anders, Simon, and Wolfgang Huber. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11.10 (2010): R106.

Anders, Simon, Paul Theodor Pyl, and Wolfgang Huber. “HTSeq-A Python Framework to Work with High-Throughput Sequencing Data.” Bioinformatics 31.2 (2015): 166–169.

Barker, David G et al. “Growing M Truncatula: Choice of Substrates and Growth Conditions.” Medicago Truncatula Handbook. Ed. Ulrike Mathesius, Etienne-Pascal Journet, and Lloyd Sumner. Ardmore OK: The Samuel Roberts Noble Foundation, 2006. Print.

Bateman, A., and R. Sandford. “The PLAT Domain: A New Piece in the PKD1 Puzzle.” Current biology : CB 9.16 (1999): R588-590.

Benchabane, Meriem et al. “Plant Cystatins.” Biochimie 92.11 (2010): 1657–1666.

Benedito, Vagner A et al. “A Gene Expression Atlas of the Model Legume Medicago Truncatula.” The Plant Journal 55.3 (2008): 504–13.

Berends, Tineke, Patricia E Gamble, and John E Mullet. “Primary Structure of the Soybean Nodulin-23 Gene and Potential Regulatory Elements in the 5’-flanking Regions of Nodulin and Leghemoglobin Genes.” Nucleic acids research 13.1 (1985): 239–249.

84

Bertioli, David John et al. “The Genome Sequences of Arachis Duranensis and Arachis Ipaensis, the Diploid Ancestors of Cultivated Peanut.” Nature Genetics 48.4 (2016): 438–446.

Bisseling, Ton. “The Role of Plant Peptides in Intercellular Signalling.” Current opinion in plant biology 2 (1999): 365–368.

Bolger, Anthony M, Marc Lohse, and Bjoern Usadel. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics btu170 (2014): n. pag.

Burghardt, Liana T. et al. “Transcriptomic Basis of Genome by Genome Variation in a Legume-Rhizobia Mutualism.” Molecular Ecology March (2017): n. pag.

Campbell, M. A. et al. “Identification and Characterization of Lineage-Specific Genes within the Poaceae.” Plant Physiology 145.4 (2007): 1311–1322.

Cermak, Tomas et al. “A Multi-Purpose Toolkit to Enable Advanced Genome Engineering in Plants.” Plant Cell 29.June (2017): 1196–1217.

Chakrabarti, Manohar, Randy D. Dinkins, and Arthur G. Hunt. “De Novo Transcriptome Assembly and Dynamic Spatial Gene Expression Analysis in Red Clover.” The Plant Genome 0.0 (2016): 0.

Cheng, Qiang et al. “Discovery of a Novel Small Secreted Protein Family with Conserved N-Terminal IGY Motif in Dikarya Fungi.” BMC Genomics 15.1 (2014): 1151.

Cho, Kyoungwon et al. “Cellular Localization of Dual Positional Specific Maize Lipoxygenase-1 in Transgenic Rice and Calcium-Mediated Membrane Association.” Plant Science 181.3 (2011): 242–248.

Clevenger, Josh et al. “A Developmental Transcriptome Map for Allotetraploid Arachis Hypogaea.” Frontiers in Plant Science 7.September (2016): 1–18.

Cosson, Viviane et al. “Medicago Truncatula Transformation Using Leaf Explants.” Agrobacterium Protocols. Methods in Molecular Biology, Vol 343. Ed. Kan Wang. Humana Press, 2006. 115–128.

Curtin, Shaun J. et al. “Validating Genome-Wide Association Candidates Controlling Quantitative Variation in Nodulation.” Plant Physiology 173.2 (2017): 921–931.

Czernic, Pierre et al. “Convergent Evolution of Endosymbiont Differentiation in Dalbergioid and Inverted Repeat-Lacking Clade Legumes Mediated by Nodule- Specific Cysteine-Rich Peptides.” Plant Physiology 169.2 (2015): 1254–1265.

85

David, Charles N. et al. “Evolution of Complex Structures: Minicollagens Shape the Cnidarian Nematocyst.” Trends in Genetics 24.9 (2008): 431–438.

De Vega, Jose J. et al. “Red Clover (Trifolium Pratense L.) Draft Genome Provides a Platform for Trait Improvement.” Scientific Reports 5.1 (2015): 17394.

Dénarié, Jean, Frédéric Debellé, and Jean-Claude C Promé. “Rhizobium Lipo- Chitooligosaccharide Nodulation Factors: Signaling Molecules Mediating Recognition and Morphogenesis.” Annual review of 65.1 (1996): 503– 35.

Denison, R Ford. “Legume Sanctions and the Evolution of Symbiotic Cooperation by Rhizobia.” The American Naturalist 156.6 (2000): 567–576.

Devi, M. Jyostna, Thomas R. Sinclair, and Vincent Vadez. “Genotypic Variability among Peanut (Arachis Hypogea L.) in Sensitivity of Nitrogen Fixation to Soil Drying.” Plant and Soil 330.1 (2010): 139–148.

Dujon, B. “The Genome Project: What Did We Learn?” Trends in genetics : TIG 12.7 (1996): 263–270.

Duplessis, Sébastien. “Obligate Biotrophy Features Unraveled by the Genomic Analysis of Rust Fungi.” Proceedings of the National Academy of Sciences of the of America (2011): 1–23.

Edgar, Robert C. “MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput.” Nucleic Acids Research 32.5 (2004): 1792–7.

Farkas, Attila et al. “Comparative Analysis of the Bacterial Membrane Disruption Effect of Two Natural Plant Antimicrobial Peptides.” Frontiers in Microbiology 8.JAN (2017): 1–12.

Franssen, Henk J. et al. “Developmental Aspects of the Rhizobium-Legume Symbiosis.” Plant Molecular Biology 19.1 (1992): 89–107.

Gage, D. J. “Infection and Invasion of Roots by Symbiotic, Nitrogen-Fixing Rhizobia during Nodulation of Temperate Legumes.” Microbiology and Molecular Biology Reviews 68.2 (2004): 280–300.

Ganko, Eric W., Blake C. Meyers, and Todd J. Vision. “Divergence in Expression between Duplicated Genes in Arabidopsis.” Molecular Biology and Evolution 24.10 (2007): 2298–2309.

86

Garg, R. et al. “Gene Discovery and Tissue-Specific Transcriptome Analysis in Chickpea with Massively Parallel Pyrosequencing and Web Resource Development.” Plant Physiology 156.4 (2011): 1661–1678.

Ghorbani, Sarieh et al. “Expanding the Repertoire of Secretory Peptides Controlling Root Development with Comparative Genome Analysis and Functional Assays.” Journal of Experimental Botany 66.17 (2015): 5257–5269.

Grabherr, Manfred G et al. “Full-Length Transcriptome Assembly from RNA-Seq Data without a Reference Genome.” Nature biotechnology 29.7 (2011): 644–52.

Graham, Michelle A et al. “Computational Identification and Characterization of Novel Genes from Legumes.” Plant Physiology 135.July (2004): 1179–1197.

Grant, David et al. “SoyBase, the USDA-ARS Soybean Genetics and Genomics Database.” Nucleic Acids Research 38.SUPPL.1 (2009): 843–846.

Groat, R. Gene, and Carroll P Vance. “Root Nodule Enzymes of Ammonia Assimilation in Alfalfa (Medicago Sativa L.).” Plant Physiol 67 (1981): 1198–1203.

Guillén, Gabriel et al. “Detailed Analysis of Putative Genes Encoding Small Proteins in Legume Genomes.” Frontiers in plant science 4.June (2013): 208.

Guinel, Frédérique C. “Ethylene, a Hormone at the Center-Stage of Nodulation.” Frontiers in Plant Science 6.December (2015): n. pag.

Haag, Andreas F et al. “Protection of Sinorhizobium against Host Cysteine-Rich Antimicrobial Peptides Is Critical for Symbiosis.” PLoS biology 9.10 (2011): e1001169.

Hanada, Kousuke et al. “Importance of Lineage-Specific Expansion of Plant Tandem Duplicates in the Adaptive Response to Environmental Stimuli.” Plant physiology 148.2 (2008): 993–1003.

Hoagland, DR, and DI Arnon. “The Water-Culture Method for Growing Plants without Soil.” Circular. Agricultural Experiment Station 347 (1950): n. pag.

Horváth, Beatrix et al. “Loss of the Nodule-Specific Cysteine Rich Peptide, NCR169, Abolishes Symbiotic Nitrogen Fixation in the Medicago Truncatula dnf7 Mutant.” Proceedings of the National Academy of Sciences of the United States of America 112.49 (2015): 3–8.

Huisman, Rik et al. “Endocytic Accommodation of Microbes in Plants.” Endocytosis in Plants. Ed. Jozef Šamaj. N.p., 2012. 271–295.

87

Hyun, Tae Kyung et al. “The Arabidopsis PLAT Domain protein1 Is Critically Involved in Abiotic Stress Tolerance.” PLoS ONE 9.11 (2014): n. pag.

Ishihara, Hironobu et al. “Characteristics of Bacteroids in Indeterminate Nodules of the Leguminous Tree Leucaena Glauca.” Microbes and Environments 26.2 (2011): 156– 159.

Jacobs, Fred A et al. “Several Nodulins of Soybean Share Structural Domains but Differ in Their Subcellular Locations.” Nucleic Acids Research 15.3 (1987): 1271–1280.

Jones, Kathryn M et al. “How Rhizobial Symbionts Invade Plants: The Sinorhizobium- Medicago Model.” October 5.8 (2007): 619–633.

Jordan, I. King et al. “Lineage-Specific Gene Expansions in Bacterial and Archaeal Genomes.” Genome Research 11.4 (2001): 555–565.

Karunakaran, R et al. “Transcriptomic Analysis of Rhizobium Leguminosarum Biovar Viciae in Symbiosis with Host Plants Pisum Sativum and Vicia Cracca.” Journal of bacteriology 191.12 (2009): 4002–14.

Keller, J et al. “Title: RNA Sequencing and Analysis of Three Lupinus Nodulomes Provide New Insights into Specific Host-Symbiont Relationships with Compatible and Incompatible Bradyrhizobium Strains.” Plant Science 266. 2017 (2017): 102–116.

Kereszt, Attila, Peter Mergaert, and Eva Kondorosi. “Bacteroid Development in Legume Nodules: Evolution of Mutual Benefit or of Sacrificial Victims?” Molecular plant- microbe interactions : MPMI 24.11 (2011): 1300–9.

Kevei, Zoltán et al. “Glycine-Rich Proteins Encoded by a Nodule-Specific Gene Family Are Implicated in Different Stages of Symbiotic Nodule Development in Medicago Spp.” Molecular plant-microbe interactions : MPMI 15.9 (2002): 922–31.

Khetmalas, Madhukar B. “Oleosomes in Some Nitrogen-Fixing Root Nodules.” Memorial University of Newfoundland, 1996.

Kim, Minsoo et al. “An Antimicrobial Peptide Essential for Bacterial Survival in the Nitrogen-Fixing Symbiosis.” Proceedings of the National Academy of Sciences 112.49 (2015): 201500123.

Krumsiek, Jan, Roland Arnold, and Thomas Rattei. “Gepard: A Rapid and Sensitive Tool for Creating Dotplots on Genome Scale.” Bioinformatics 23.8 (2007): 1026–8.

Kurtz, Stefan et al. “Versatile and Open Software for Comparing Large Genomes.” Genome biology 5.2 (2004): R12. 88

Lajoie, Mathieu, Denis Bertrand, and Nadia El-Mabrouk. “Inferring the Evolutionary History of Gene Clusters from Phylogenetic and Gene Order Data.” Molecular biology and evolution 27.4 (2010): 761–72.

Langmead, Ben, and Steven L Salzberg. “Fast Gapped-Read Alignment with Bowtie 2.” Nature methods 9.4 (2012): 357–9.

Laporte, Philippe et al. “A Novel RNA-Binding Peptide Regulates the Establishment of the Medicago Truncatula-Sinorhizobium Meliloti Nitrogen-Fixing Symbiosis.” The Plant Journal 62.1 (2010): 24–38.

Larkin, M a et al. “Clustal W and Clustal X Version 2.0.” Bioinformatics 23.21 (2007): 2947–8.

Lavin, Matt, Patrick S Herendeen, and Martin F Wojciechowski. “Evolutionary Rates Analysis of Leguminosae Implicates a Rapid Diversification of Lineages during the Tertiary.” Systematic biology 54.4 (2005): 575–94.

Leonard, LT. “A Simple Assembly for Use in the Testing of Cultures of Rhizobia.” Journal of bacteriology (1943): 523–527.

Lespinet, Olivier et al. “The Role of Lineage-Specific Gene Family Expansion in the Evolution of Eukaryotes.” Genome research 12.7 (2002): 1048–59.

Li, Yixing et al. “A Nodule-Specific Plant Cysteine Proteinase, AsNODF32, Is Involved in Nodule Senescence and Nitrogen Fixation Activity of the Green Manure Legume Astragalus Sinicus.” New Phytologist 180.1 (2008): 185–192.

Libault, Marc et al. “An Integrated Transcriptome Atlas of the Crop Model Glycine Max, and Its Use in Comparative Analyses in Plants.” Plant Journal 63.1 (2010): 86–99.

Limpens, Erik et al. “Cell- and Tissue-Specific Transcriptome Analyses of Medicago Truncatula Root Nodules.” PloS one 8.5 (2013): e64377.

Lin, H et al. “Comparative Analyses Reveal Distinct Sets of Lineage-Specific Genes within Arabidopsis Thaliana.” BMC Evol Biol 10 (2010): 41.

Liu, J. “Recruitment of Novel Calcium-Binding Proteins for Root Nodule Symbiosis in Medicago Truncatula.” Plant Physiology 141.1 (2006): 167–177.

Lynch, M., and J.S. Conery. “The Evolutionary Fate and Consequences of Duplicate Genes.” Science 290.5494 (2000): 1151–1155.

Lyons, Eric, and Michael Freeling. “How to Usefully Compare Homologous Plant Genes and Chromosomes as DNA Sequences.” The Plant Journal 53.4 (2008): 661–73. 89

Marie, Corinne et al. “Characterization of Nops, Nodulation Outer Proteins, Secreted via the Type III Secretion System of NGR234.” Molecular plant-microbe interactions : MPMI 16.9 (2003): 743–51.

Matthews, Benjamin F. et al. “Engineered Resistance and Hypersusceptibility through Functional Metabolic Studies of 100 Genes in Soybean to Its Major Pathogen, the Soybean Cyst .” Planta 237.5 (2013): 1337–1357.

Maunoury, Nicolas et al. “Differentiation of Symbiotic Cells and Endosymbionts in Medicago Truncatula Nodulation Are Coupled to Two Transcriptome-Switches.” PloS One 5.3 (2010): e9519.

Meng, Ling. “Roles of Secreted Peptides in Intercellular Communication and Root Development.” Plant science : an international journal of experimental plant biology 183 (2012): 106–14.

Mergaert, Peter et al. “A Novel Family in Medicago Truncatula Consisting of More Than 300 Nodule-Specific Genes Coding for Small, Secreted Polypeptides with Conserved Cysteine Motifs.” Plant Physiology 132.May (2003): 161–173.

Mergaert, Peter et al. “Eukaryotic Control on Bacterial Cell Cycle and Differentiation in the Rhizobium-Legume Symbiosis.” Proceedings of the National Academy of Sciences of the United States of America 103.13 (2006): 5230–5.

Montiel, Jesús et al. “Morphotype of Bacteroids in Different Legumes Correlates with the Number and Type of Symbiotic NCR Peptides.” Proceedings of the National Academy of Sciences 114.19 (2017): 5041–5046.

Moreau, Sandra et al. “Transcription Reprogramming during Root Nodule Development in Medicago Truncatula.” PloS One 6.1 (2011): e16463.

Mortier, Virginie et al. “Search for Nodulation-Related CLE Genes in the Genome of Glycine Max.” Journal of experimental botany 62.8 (2011): 2571–83.

Murphy, Evan, Stephanie Smith, and Ive De Smet. “Small Signaling Peptides in Arabidopsis Development: How Cells Communicate over a Short Distance.” The Plant cell 24.8 (2012): 3198–217.

Nuccio ML, Thomas TL. “ATS1 and ATS3 Two Novel Embryo-Specific Genes in Arabidopsis Thaliana.pdf.” Plant Molecular Biology 39.6 (1999): 1153–1163.

O’Rourke, Jamie A et al. “An RNA-Seq Based Gene Expression Atlas of the Common Bean.” BMC Genomics 15.1 (2014): 866.

90

Oldroyd, Giles E.D. et al. “The Rules of Engagement in the Legume-Rhizobial Symbiosis.” Annual Review of Genetics 45.1 (2011): 119–144.

Oono, R., C. G. Anderson, and R. F. Denison. “Failure to Fix Nitrogen by Non- Reproductive Symbiotic Rhizobia Triggers Host Sanctions That Reduce Fitness of Their Reproductive Clonemates.” Proceedings of the Royal Society B: Biological Sciences 278.1718 (2011): 2698–2703.

Oono, Ryoko, and R Ford Denison. “Comparing Symbiotic Efficiency between Swollen versus Nonswollen Rhizobial Bacteroids.” Plant physiology 154.3 (2010): 1541–8.

Oono, Ryoko et al. “Multiple Evolutionary Origins of Legume Traits Leading to Extreme Rhizobial Differentiation.” The New Phytologist 187.2 (2010): 508–20.

Pan, Huairong, and Dong Wang. “Nodule Cysteine-Rich Peptides Maintain a Working Balance during Nitrogen-Fixing Symbiosis.” Nature Plants 3.5 (2017): 17048.

Penterman, Jon et al. “Host Plant Peptides Elicit a Transcriptional Response to Control the Sinorhizobium Meliloti Cell Cycle during Symbiosis.” Proceedings of the National Academy of Sciences of the United States of America 111.9 (2014): 3561– 6.

Petersen, Thomas Nordahl et al. “SignalP 4.0: Discriminating Signal Peptides from Transmembrane Regions.” Nature methods 8.10 (2011): 785–6.

Plett, Jonathan M. et al. “Populus Trichocarpa Encodes Small, Effector-like Secreted Proteins That Are Highly Induced during Mutualistic Symbiosis.” Scientific Reports 7.1 (2017): 382.

Puigbò, Pere, Ignacio G Bravo, and Santiago Garcia-Vallve. “CAIcal: A Combined Set of Tools to Assess Codon Usage .” Biology direct 3 (2008): 38.

Revanna, Kashi V et al. “A Web-Based Multi-Genome Synteny Viewer for Customized Data.” BMC bioinformatics 13.1 (2012): 190.

Richter, H E et al. “Characterization and Genomic Organization of a Highly Expressed Late Nodulin Gene Subfamily in Soybeans.” Mol Gen Genet 229.3 (1991): 445–452.

Ronquist, F., and J. P. Huelsenbeck. “MrBayes 3: Bayesian Phylogenetic Inference under Mixed Models.” Bioinformatics 19.12 (2003): 1572–1574.

Roumiantseva, ML et al. “Diversity of Sinorhizobium Meliloti from the Central Asian Alfalfa Gene Center.” Applied and Environmental Microbiology 68.9 (2002): 4694– 4697.

91

Roux, Brice et al. “An Integrated Analysis of Plant and Bacterial Gene Expression in Symbiotic Root Nodules Using Laser-Capture Microdissection Coupled to RNA Sequencing.” The Plant Journal 77.6 (2014): 817–37.

Saito, Akinori et al. “Effect of Nitrate on Nodule and Root Growth of Soybean (Glycine Max (L.) Merr.).” International Journal of Molecular Sciences 15.3 (2014): 4464– 4480.

Schmutz, Jeremy et al. “Genome Sequence of the Palaeopolyploid Soybean.” Nature 465.7294 (2010): 120–120.

Schneider, Caroline A, Wayne S Rasband, and Kevin W Eliceiri. “NIH Image to ImageJ: 25 Years of Image Analysis.” Nature Methods 9.7 (2012): 671–675.

Sengupta-Gopalan, Champa et al. “Expression of Host Genes during Root Nodule Development in Soybeans.” Mol Gen Genet 203 (1986): 410–420.

Severin, Andrew J et al. “RNA-Seq Atlas of Glycine Max: A Guide to the Soybean Transcriptome.” BMC plant biology 10.2007 (2010): 160.

Shin, R et al. “Capsicum Annuum Tobacco Mosaic Virus-Induced Clone 1 Expression Perturbation Alters the Plant’s Response to Ethylene and Interferes with the Redox Homeostasis.” Plant Physiol 135.1 (2004): 561–573.

Silverstein, Kevin A T, Michelle A. Graham, and Kathryn A. VandenBosch. “Novel Paralogous Gene Families with Potential Function in Legume Nodules and Seeds.” Current Opinion in Plant Biology 9.2 (2006): 142–146.

Silverstein, Kevin A T et al. “Genome Organization of More Than 300 Defensin-Like Genes in Arabidopsis 1.” Plant Physiology 138.June (2005): 600–610.

Silverstein, Kevin A T et al. “Small Cysteine-Rich Peptides Resembling Antimicrobial Peptides Have Been under-Predicted in Plants.” Plant Journal 51.2 (2007): 262– 280.

Simpson, Jared T et al. “ABySS: A Parallel Assembler for Short Read Sequence Data.” Genome Research 19.6 (2009): 1117–23.

Sinharoy, S. et al. “The C2H2 Transcription Factor REGULATOR OF SYMBIOSOME DIFFERENTIATION Represses Transcription of the Secretory Pathway Gene VAMP721a and Promotes Symbiosome Development in Medicago Truncatula.” The Plant Cell 25.9 (2013): 3584–3601.

Son, Jae Han et al. “Isolation of Cold-Responsive Genes from Garlic, Allium Sativum.” Genes and Genomics 34.1 (2012): 93–101. 92

Starker, C. G. “Nitrogen Fixation Mutants of Medicago Truncatula Fail to Support Plant and Bacterial Symbiotic Gene Expression.” Plant Physiology 140.2 (2006): 671– 680.

Steele, Kelly P et al. “Phylogeny and Character Evolution in Medicago (Leguminosae): Evidence from Analyses of Plastid trnK/matK and Nuclear GA3ox1 Sequences.” American journal of botany 97.7 (2010): 1142–55.

Sutton, W D, and A D Paterson. “Effects of the Plant Host on the Detergent Sensitivity and Viability of Rhizobium Bacteroids.” Planta 148 (1980): 287–292.

Tamura, Koichiro et al. “MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods.” Molecular biology and evolution 28.10 (2011): 2731–9.

Tang, Haibao et al. “An Improved Genome Release (Version Mt4.0) for the Model Legume Medicago Truncatula.” BMC Genomics 15.1 (2014): 312.

Tautz, Diethard, and Tomislav Domazet-Lošo. “The Evolutionary Origin of Orphan Genes.” Nature reviews. Genetics 12.10 (2011): 692–702.

Thorvaldsdóttir, Helga, James T Robinson, and Jill P Mesirov. “Integrative Genomics Viewer (IGV): High-Performance Genomics Data Visualization and Exploration.” Briefings in bioinformatics 14.2 (2013): 178–92.

Tóth, Katalin, and Gary Stacey. “Does Plant Immunity Play a Critical Role during Initiation of the Legume-Rhizobium Symbiosis?” Frontiers in Plant Science 6.June (2015): 1–7.

Tran, Lan T, John S Taylor, and C Constabel. “The Polyphenol Oxidase Gene Family in Land Plants: Lineage-Specific Duplication and Expansion.” BMC Genomics 13.1 (2012): 395.

Trapnell, Cole, Lior Pachter, and Steven L Salzberg. “TopHat: Discovering Splice Junctions with RNA-Seq.” Bioinformatics 25.9 (2009): 1105–11.

Trapnell, Cole et al. “Transcript Assembly and Abundance Estimation from RNA-Seq Reveals Thousands of New Transcripts and Switching among Isoforms.” Nature Biotechnology 28.5 (2011): 511–515.

Trujillo, Diana I., Kevin A. T. Silverstein, and Nevin D. Young. “Genomic Characterization of the LEED..PEEDs, a Gene Family Unique to the Medicago Lineage.” G3 Genes|Genomes|Genetics 4.10 (2014): 2003–2012.

93

Van de Velde, Willem et al. “Aging in Legume Symbiosis. A Molecular View on Nodule Senescence in Medicago Truncatula.” Plant physiology 141.2 (2006): 711–20.

Van de Velde, Willem et al. “Plant Peptides Govern Terminal Differentiation of Bacteria in Symbiosis.” Science 327.5969 (2010): 1122–6. van Dongen, Stijn Marinus. “Graph Clustering by Flow Simulation.” University of Utrecht PhD thesis (2000): n. pag. van Wyk, Stefan George et al. “Cysteine Protease and Cystatin Expression and Activity during Soybean Nodule Development and Senescence.” BMC Plant Biology 14.294 (2014): 1–13.

Vandepoele, Klaas, and Yves Van de Peer. “Exploring the Plant Transcriptome through Phylogenetic Profiling.” Plant Physiology 137.January (2005): 31–42.

Varshney, Rajeev K et al. “Draft Genome Sequence of Chickpea (Cicer Arietinum) Provides a Resource for Trait Improvement.” Nature biotechnology 31.3 (2013): 240–6.

Vincent, J L, and N J Brewin. “Immunolocalization of a Cysteine Protease in Vacuoles, Vesicles, and Symbiosomes of Pea Nodule Cells.” Plant physiology 123.2 (2000): 521–30.

Vincze, E et al. “Repression of the L-Asparaginase Gene during Nodule Development in Lupinus Angustifolius.” Plant molecular biology 26.1 (1994): 303–11.

Wang, Dong et al. “A Nodule-Specific Protein Secretory Pathway Required for Nitrogen- Fixing Symbiosis.” Science 327.5969 (2010): 1126–9.

Wang, Qi et al. “Host-Secreted Antimicrobial Peptide Enforces Symbiotic Selectivity in Medicago Truncatula.” Proceedings of the National Academy of Sciences 114.26 (2017): 201700715.

Waterhouse, Andrew M et al. “Jalview Version 2--a Multiple Sequence Alignment Editor and Analysis Workbench.” Bioinformatics 25.9 (2009): 1189–91.

Yang, Xiaohan et al. “Genome-Wide Identification of Lineage-Specific Genes in Arabidopsis, Oryza and Populus.” Genomics 93.5 (2009): 473–480.

Yoder, Jeremy B et al. “Phylogenetic Signal Variation in the Genomes of Medicago ().” Systematic biology 62.3 (2013): 424–38.

Young, Nevin D, and Arvind K Bharti. “Genome-Enabled Insights into Legume Biology.” Annual review of plant biology 63 (2012): 283–305. 94

Young, Nevin D et al. “The Medicago Genome Provides Insight into the Evolution of Rhizobial Symbioses.” Nature 480.7378 (2011): 520–4.

Yuan, Song L. et al. “RNA-Seq Analysis of Nodule Development at Five Different Developmental Stages of Soybean (Glycine Max) Inoculated with Bradyrhizobium Japonicum Strain 113-2.” Scientific Reports 7.May 2016 (2017): 42248.

Zhao, Peng et al. “Comprehensive Analysis of Cystatin Family Genes Suggests Their Putative Functions in Sexual Reproduction, Embryogenesis, and Seed Formation.” Journal of Experimental Botany 65.17 (2014): 5093–5107.

Zhou, Peng et al. “Detecting Small Plant Peptides Using SPADA (Small Peptide Alignment Discovery Application).” BMC bioinformatics 14.335 (2013): 1–16.

95

6 Appendix

Table 6.1 Genomic, transcriptomic, and computational resources generated by this study Name Link / NCBI Accession Description Custom pipeline to detect lineage- LSE discovery https://github.com/ditrujillo/ specific expansions of nodulation- pipeline LSE_pipeline related small secreted peptides

Custom scripts for statistical analysis NPD knockout https://github.com/ditrujillo/ of nodule and plant phenotypic data. experiment NPD_knockout_experiment Custom scripts for RNA-seq analysis of NPD knockout lines

Custom scripts for RNA-seq analysis Host x strain https://github.com/ditrujillo/ of nodule NPD gene expression in analysis of NPD HostStrain_NPD_expression four Medicago accessions inoculated and PDP with two different rhizobial strains

Trifolium pratense 'Marathon' T. pratense PRJNA416968 nodule RNA-seq, 30 days post nodule RNA-seq inoculation

Medicago truncatula R108 (wild NPD knockout type), npd2/4, npd1/2/4, and experiment PRJNA418151 npd2/4/5 knockout line nodule RNA-seq RNA-seq, 30 days post inoculation

T. pratense T. pratense 'Marathon' genomic genomic DNA- PRJNA257076 DNA-seq extracted from 7 day old seq leaves

P. sativum P. sativum 'Little Marvel' nodule PRJNA257308 nodule RNA-seq RNA-seq, 30 days post inoculation

96

6.1a

6.1b

6.1c Figure 6.1 CRISPR/Cas9 multiplex genome editing approach. A 4-plex version of entry vector pSC218GG carried four guide RNAs (gRNA) induced by AtU6 and At7SL promoters, BAR herbicide resistance and NPTII kanamycin resistance (a). Four guide RNAs were designed to target one through five NPD genes (b) which were tandemly duplicated within a 20 kb region (c).

97

Figure 6.2 LSE analysis of NCR peptides in M. truncatula and T. pratense. Of the 915 NCRs shown, most rows correspond to M. truncatula (548) and T. pratense (363) genes, while a blue arrow indicates G. max (2) and P.vulgaris (1) genes. Genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

98

Figure 6.3 LSE analysis of GRP1 proteins in M. truncatula and T. pratense. For the LSE of the nodule-specific Glycine-Rich Peptide family (GRP1) in M. truncatula (Mt) and T. pratense (Tp), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

99

Figure 6.4 LSE analysis of GRP2 proteins in G. max. For the LSE of the nodule- specific Glycine-Rich Peptide family (GRP2) in G. max (Gm), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

100

Figure 6.5 LSE analysis of LEED..PEED peptides in M. truncatula. For the LSE of the LEED..PEED family in M. truncatula (Mt), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

101

Figure 6.6 LSE analysis of Aeschynomene NCR-like peptides in A. duranensis. For the LSE of Aeschynomene NCR-like peptides in A. duranensis (Ad), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

102

Figure 6.7 LSE analysis of Calmodulin-like proteins in M. truncatula and T. pratense. For the LSE of calmodulin-like proteins (CaMLs) in M. truncatula (Mt) and T. pratense (Tp), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

103

Figure 6.8 LSE analysis of CAP superfamily proteins in A. duranensis. For the LSE of the nodulation-related CAP superfamily proteins in A. duranensis (Ad), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

104

Figure 6.9 LSE analysis of Bowman Birk peptides in A. duranensis. For the LSE of the Bowman Birk family in A. duranensis (Ad), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

105

Figure 6.10 LSE analysis of Antimicrobial MBP-1 peptides in A. duranensis. For the LSE of Antimicrobial Peptide MBP-1 genes in A. duranensis (Ad) relative to M. truncatula (Mt) and G. max (Gm), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

106

Figure 6.11 LSE analysis of Leginsulin peptides in M. truncatula. For the LSE of leginsulin genes in M. truncatula (Mt), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

107

Figure 6.12 LSE analysis of Phaseoleae-specific nodulins in G. max and P. vulgaris. For the LSE of Phaseoleae-specific nodulins in G. max (Gm) and P. vulgaris (Pv), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

108

Figure 6.13 LSE analysis of cystatin proteins in A. duranensis. For the LSE of cystatins in A. duranensis (Ad), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

109

Figure 6.14 LSE analysis of a NPD proteins in M. truncatula and T. pratense. For the LSE of a nodule-specific subset (NPD) of PLAT Domain Proteins (PDP) in M. truncatula (Mt) and T. pratense (Tp), genes are ordered by phylogenetic relatedness (tree generated by CLUSTALW). Transcript levels for Flower, Leaf, Root and Nodule tissues are shown in the heatmap, with yellow indicating higher expression. Purple bars indicate genes with nodule-enhanced expression.

110

Figure 6.15 Knockout mechanism of six NPD knockout lines. Six NPD knockout lines, targeting one through 5 NPD genes, were characterized using PCR and confirmed with Sanger sequencing. Gene knockouts occurred as a result of insertions (ins) or deletions (del) of varying magnitude.

111

Figure 6.16 Aligned LP DNA sequences from A17, HM056 and R108. Columns that were omitted for the phylogenetic analysis are highlighted with red headers. Phylogenetic trees were based on the remaining 144 nucleotides.

112

Figure 6.17 Aligned LP DNA sequences from A17, HM056, R108 and M. sativa. Columns that were omitted for the phylogenetic analysis are highlighted with red headers. Phylogenetic trees were based on the remaining 87 nucleotides.

113

Figure 6.18 Multiple sequence alignment of LPs from M. truncatula accessions A17 and HM056, R108 and M. sativa. Unknown M. sativa residues are indicated with “?”.

114

6.19a 6.19c

6.19b 6.19d

Figure 6.19 Dotplot comparisons between M. truncatula and G. max in regions surrounding M. truncatula LPs. Panels show syntenic regions of G. max chromosomes 8 (a) and 15 (b) compared against a 2 Mb region in A17 chromosome 4, corresponding to 0.5% of the total M. truncatula genome. Syntenic regions of G. max chromosomes 8 (c) and 18 (d) are compared against a 3-4 Mb region of A17 chromosome 7. Red dots indicate regions of sequence similarity with a forward orientation, while blue dots indicate reverse matches. The LP region in M. truncatula is shown with green rectangles.

115

Figure 6.20 Phylogenetic tree of A17, HM056, R108 and M. sativa LP nucleotide sequences. The tree was generated through Bayesian Inference and visualized with FigTree software. Posterior probability values of the clades are indicated at the nodes.

116