Quick viewing(Text Mode)

Gene Copy Number Variation Among and Within Plant Species

Gene Copy Number Variation Among and Within Plant Species

Research Collection

Doctoral Thesis

Gene copy number variation among and within species

Author(s): Liu, Xuanyu

Publication Date: 2014

Permanent Link: https://doi.org/10.3929/ethz-a-010432389

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library

DISS. ETH NO. 22293

Gene copy number variation among and within plant species

A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich)

Presented by Xuanyu Liu M.S. in Cell Biology, Chinese Academy of Sciences Born on April 10th, 1985 Citizen of China

Accepted on the recommendation of

Prof. Dr. Alex Widmer, examiner Prof. Dr. Ute Krämer, co-examiner

Dr. Weihong Qi, co-examiner

Dr. Alessia Guggisberg, co-examiner

2014

TABLE OF CONTENTS

SUMMARY ...... I ZUSAMMENFASSUNG ...... II ACKNOWLEDGMENTS ...... IV GENERAL INTRODUCTION ...... 1 THE CONCEPT OF A GENE FAMILY ...... 1 GENE GAINS AND LOSSES ...... 1 COPY NUMBER VARIANTS ...... 4 STUDY SYSTEMS ...... 5 MAIN GOALS AND METHODS ...... 6 REFERENCES ...... 8 CHAPTER ONE: Evolutionary analysis of gene family size variation in Arabidopsis and its relatives ...... 10 ABSTRACT ...... 10 INTRODUCTION ...... 11 RESULTS ...... 13 DISCUSSION ...... 17 EXPERIMENTAL PROCEDURES ...... 21 SUPPORTING INFORMATION ...... 25 REFERENCES ...... 27 CHAPTER TWO: Genome-wide comparative analysis of the GRAS gene family in Populus, Arabidopsis and rice ...... 30 ABSTRACT ...... 30 INTRODUCTION ...... 31 RESULTS ...... 32 DISCUSSION ...... 43 EXPERIMENTAL PROCEDURES ...... 46 SUPPORTING INFORMATION ...... 50 REFERENCES ...... 51 CHAPTER THREE: Copy number variants and their putative involvement in edaphic adaptation of Arabidopsis lyrata ...... 54 ABSTRACT ...... 54 INTRODUCTION ...... 55 RESULTS ...... 56 DISCUSSION ...... 61 EXPERIMENTAL PROCEDURES ...... 62 SUPPORTING INFORMATION ...... 68 REFERENCES ...... 69 GENERAL DISCUSSION ...... 72 REFERENCES ...... 74

Summary SUMMARY

For more than a century, biologists have sought to understand the origins, adaptive mechanisms and evolutionary processes underlying genomic variation within and among species. So far, our knowledge about these aspects, however, has been mainly limited to one type of genomic variation, i.e. single nucleotide polymorphisms (SNPs). There is growing evidence that another class of genomic variation, gene copy number variation (GCNV, a.k.a. gene family size variation), is pervasive and contributes to environmental adaptation and organismal diversification. While substantial progress has been made in animal studies, our understanding about GCNV in is still limited. In this thesis, I investigated several questions surrounding GCNV, both within and among plant species.

Firstly, I performed an evolutionary analysis of gene family size variation among five Brassicaceae species that diverged relatively recently using a likelihood approach. The average rate of gene gain-and-loss (λ) was estimated to be 0.0022 gains and losses per gene per million years, which corroborates the view that gene gain-and-loss evolves at similar average rate across different eukaryotic lineages. Branch-specific rate estimation further supported my hypothesis that plant mating system may influence the rate of gene gain-and-loss. Gene families that were inferred to have been evolving rapidly in size were found to be mainly involved in plant-pathogen/herbivore interactions and pollen-pistil interactions, and exhibited a high incidence of positive selection acting at the nucleotide level. Finally, I showed that gene gains via tandem duplication predominantly contributed to the adaptive evolution of gene family size.

Secondly, I pursued my investigations on GCNV among plant species, but over a much longer evolutionary time-scale in an individual gene family. The expansion and diversification of the GRAS gene family in Populus was investigated through comparative analyses with Arabidopsis and rice. I detected 106, 34 and 60 putative GRAS genes in Populus, Arabidopsis and rice, respectively, which could be grouped into 13 subfamilies. The joint action of tandem and segmental duplications could explain the rapid expansion of the GRAS family in Populus, while site-specific shifts in evolutionary rates might constitute the main driver in functional diversification. The observation that GRAS genes evolved mainly under purifying selection after duplication however revealed strong functional constraints. Expression divergence analyses between paralogous pairs of GRAS genes finally suggested that their retention likely resulted from functional novelty such as neo-functionalization or sub-functionalization.

Finally, I investigated copy number variants (CNVs) in Arabidopsis lyrata. Using three different sequencing-based approaches (read depth, read pair and split read), I identified 1,513 tandem duplications and 42,945 deletions among 24 individuals from eight natural populations. Dynamic genome evolution via frequent gene copy gains and losses due to CNVs was revealed, and defense-related genes were found particularly sensitive to these changes. Abundant functional novelty may be achieved by CNV-mediated gene fusions that preferentially happened between paralogous genes. Formation mechanism analyses further revealed that a much higher percentage of deletions originated by non-allelic homologous recombination (NAHR) in A. lyrata (7.5%) than in Drosophila (1%), a likely consequence of the abundant, highly conserved syntenic blocks within angiosperm genomes resulting from several rounds of polyploidization events. Candidate CNVs affecting genes interacting with soil chemicals, such as genes encoding sulfate transporter, ligand-gated ion channel, and high-affinity potassium ion transporter, were found to exhibit high allele frequency differences between plants of A. lyrata from different bedrocks, implying their putative involvement in edaphic adaptation.

Overall, this thesis contributes to our understanding about the origin, evolution and adaptive importance of GCNV in plants. It constitutes the first genome-wide perspective on gene family size evolution among closely related plant species. Of particular interest is the possible interplay found for the first time between plant mating system and the dynamics of gene gain-and-loss. This thesis also provides the first comprehensive analysis of the GRAS gene family in a woody tree species, as well as the first sequencing-based CNV map for A. lyrata.

I Zusammenfassung

ZUSAMMENFASSUNG ∗

Seit mehr als einem Jahrhundert haben Biologen versucht, die genomische Variation innerhalb und zwischen den Arten zu verstehen in dem sie die Herkunft, die adaptiven Mechanismen und die evolutionären Prozesse untersuchten. Unser Wissen über diese Aspekte ist jedoch hauptsächlich auf eine Art von genomischer Variation beschränkt: die Einzelnukleotide-Polymorphismen (engl. Single Nucleotide Polymorphisms, SNPs). Es gibt aber immer mehr Hinweise, dass auch die Variation in der Anzahl der Genkopien (GCNV, auch bekannt als Größenvariation der Genfamilie), allgegenwärtig ist und zur Umgebungsanpassung und organismischen Diversifikation beiträgt. Während erhebliche Fortschritte in Tiersystemen gemacht worden sind, ist unser Verständnis der GCNV in Pflanzen noch sehr begrenzt. In dieser Arbeit untersuchte ich mehrere Fragen rund um GCNV innerhalb und zwischen Pflanzenarten.

Zunächst führte ich eine evolutionäre Analyse zur Variation der Grösse in Genfamilien in fünf Brassicaceae Arten durch, die vor relativ kurzer Zeit voneinander divergierten. Die Durchschnittsrate von Gen-Gewinnen und Verlusten (λ) wurde auf 0,0022 Gewinne und Verluste pro Gen pro Million Jahren geschätzt, was die Ansicht, dass Gen Gewinne-und Verluste ähnliche durchschnittliche Entstehungsraten in verschiedenen eukaryotischen Abstammungslinien haben, bestätigt. Gruppenspezifische Schätzungen unterstützen die Hypothese, dass das Paarungssystem die Rate von Gen-Gewinnen und Verlusten beeinflussen kann. Gen-Familien, bei denen man annimmt, dass sie sich auf Grund positiver Selektion auf der Nukleotid-Ebene schnell entwickelt haben, sind insbesondere involviert in Pflanzen-Pathogen, Pflanzen-Herbivoren und Pollen-Stempel Wechselwirkungen. Schließlich konnte ich zeigen, dass Gen-Gewinne durch Tandemduplikation überwiegend zur adaptiven Evolution der Größe der Genfamilie beigetragen.

Zweitens untersuchte ich GCNV in Pflanzenarten über eine viel längere evolutionäre Zeitskala in einer einzelnen Gen-Familie. Die Evolution und die Diversifizierung der GRAS Genfamilie in Pappeln (Populus) wurde durch vergleichende Analysen mit Arabidopsis und Reis untersucht. Ich fand 106, 34 und 60 mutmaßliche GRAS Gene in Populus, Arabidopsis und Reis, die in 13 Unterfamilien gruppiert werden konnten. Der gemeinsame Effekt von Tandem- und Segmentduplikationen könnte die rasche Expansion der GRAS Familie in Populus erklären, während positionsspezifischen Änderungen in den Evolutionsraten die wichtigste treibende Kraft in der funktionellen Diversifizierung sein könnten. Die Beobachtung, dass GRAS Gene nach der Verdopplung hauptsächlich unter reinigender Selektion (engl.: purifying selection) evolvieren, ergab jedoch einen Hinweis auf starke funktionale Einschränkungen. Expressions-Divergenz-Analysen zwischen paralogen Paaren der GRAS Gene legte schließlich nahe, dass ihre Erhaltung wahrscheinlich durch funktionelle Neuheit bedingt wurde, wie etwa neo-Funktionalisierung oder Unter-Funktionalisierung.

Schließlich untersuchte ich die Variation in der Anzahl von Kopien (CNV) in Arabidopsis lyrata. Mit drei verschiedenen sequenzbasierten Ansätze (read depth, read pair and split read), identifizierte ich 1.513 Tandemduplikationen und 42.945 Deletionen in 24 Individuen aus acht natürlichen Populationen. Dynamische Genomevolution durch häufige Gen-Kopie-Gewinne und -Verluste aufgrund von CNVs wurden entdeckt, und Gene, die in die Immunabwehr involviert sind, waren überproportional von solchen CNVs betroffen. Funktionale Neuheit kann durch CNV vermittelte Gen-Fusionen erreicht werden, welche bevorzugt zwischen

∗ Translated by Sonja Hassold.

II Zusammenfassung paralogen Genen auftritt. Analysen zu Bildungsmechanismen ergab weiter, dass ein viel höherer Prozentsatz von Deletionen von nicht-allelischen homologen Rekombinationen (NAHR) in A. lyrata (7,5%) herrührt als in Drosophila (1%), was wahrscheinlich eine Folge der reichlich vorhandenen und stark konservierten syntenische Blöcke innerhalb der Angiospermen Genome ist, welche aus mehreren Polyploidisierungsrunden entstanden sind. Grosse Unterschiede in der Allelfrequenz in A. lyrata, welche auf verschiedenen Gesteinen (sauer-basisch) wuchsen, wurden gefunden. Die involvierten CNV Kandidaten beeinflussen Gene, die in die Interaktion mit Bodenchemikalien involviert sind (Sulfat Transporter, ligandengesteuerter Ionenkanal und hochaffiner Kaliumionentransporter). Dies führte zur Annahme, dass diese Gene an der edaphischen Anpassung beteiligt sind.

Insgesamt trägt diese Arbeit zum Verständnis über die Entstehung, Entwicklung und Bedeutung der adaptiven GCNV in Pflanzen bei. Es ist die erste genomweite Analyse zur Evolution der Grösse in Genfamilien in eng verwandten Pflanzenarten. Von besonderem Interesse ist das mögliche Zusammenspiel zwischen dem Paarungs-System in Pflanzen und der Dynamik der Gen-Gewinne und -Verluste, welches hier zum ersten Mal gefunden wurde. Diese Arbeit umfasst die erste detaillierte Analyse der GRAS Genfamilie in einer Holzpflanze, sowie die erste sequenzbasierte CNV Karte für A. lyrata.

III Acknowledgments

ACKNOWLEDGMENTS

It is definitely challenging to do a Ph D far abroad for most people. Fortunately, I met with a group of nice people during my hard times, to whom I hereby express my sincere appreciation for their kind help.

First of all, I am grateful to my supervisor Prof. Dr. Alex Widmer for offering me this study chance and helping me getting through the hardest time. Dr. Alessia Guggisberg helped me greatly with the projects. I have learned a lot from both of you.

Many thanks to Katharina Rentsch, the lady who frequently helps me out with important things! Appreciation to people who gave me hands with the work: Maja Frei for taking care of plants, Stenfan Zoller for Bioinformatics support and Claudia Michel for sequencing library preparation.

Thanks also to my office mates, Ana Marcela Florez-Rueda, Dr. Fernanda Baena,Sonja Hassold and Dr. Margot Paris, and other colleagues in the group!

Special gratitude goes to Dr. Weihong Qi and her family, as well as Yandong Wang for sharing so much happiness with me.

Saving the best for the last, I am deeply grateful for my fiancée Wenhua Zhao (赵文华), my mom and dad. It is their constant encouragement that gave me energy and confidence to reach this step.

Sep. 8, 2014 (Mid-Autumn Festival)

IV General Introduction

GENERAL INTRODUCTION

THE CONCEPT OF A GENE FAMILY

A gene family consists of a group of evolutionary related genes showing significant sequence similarity with each other (Walsh and Stephan, 2002). While the narrow-sense definition of a gene family only refers to multiple, paralogous genes within a genome (multi-gene family), a broad-sense definition includes both paralogs within a genome and orthologs or paralogs between genomes (Figure 1) (Demuth and Hahn, 2009). This broad-sense definition means that each gene belongs to a gene family, making it possible to perform comparative analyses of gene family size variation among species even if one of the studied species has only a single or no gene copy (Hahn et al., 2007b). While the term gene family is most commonly applied to protein-coding genes (protein family) (Walsh and Stephan, 2002), its application has also been extended to include non-coding RNA (ncRNA) genes that produce various classes of functional RNA molecules, such as transfer RNA (tRNA) genes (Wang and Lavrov, 2011), micro RNA (miRNA) genes (Jiang et al., 2006) and ribosomal RNA (rRNA) genes (Rooney and Ward, 2005).

Figure 1. Diagram illustrating orthology and paralogy between homologous genes, as modified from Jensen (2001). Any two genes joined by a red dot are a pair of orthologs, which originate by speciation events. Any two genes joined by a green dot are a pair of paralogs, which originate by gene duplications. For example, the gene B1 in species B has one orthologous gene C1 and two paralogous genes C2 and C3 in species C. All six homologous genes in species A, B and C constitute a gene family defined in a broad sense, even though a single gene copy is present in species A.

GENE GAINS AND LOSSES

Gene birth-and-death is a common theme in genome and gene family evolution (Zhang, 2003). Gene gain-and-loss events frequently occur within genomes via various mechanisms, resulting in gene family size variation, a.k.a. gene copy number variation within and among species.

Gene gains via duplication

The evolutionary significance of gene gains via duplication has been appreciated ever since the first report of a

1 General Introduction gene duplication in Drosophila (Bridges, 1936). Gene duplications may arise from large-scale duplications, i.e. whole genome duplications (WGDs), a.k.a. polyploidization events, whereby all genes within a genome are duplicated at the same time (Wendel, 2000). It has been demonstrated that WGDs have occurred in multiple eukaryotic lineages during evolution, including vertebrates (Dehal and Boore, 2005), fungi (Albertin and Marullo, 2012), and plants (Jiao et al., 2011). Yet, WGDs are particularly prominent and ubiquitous in plants, especially in angiosperms (flowering plants), where the genomes of most species have undergone multiple rounds of polyploidization (Adams and Wendel, 2005; Soltis et al., 2009). For example, the plant model species Arabidopsis thaliana has experienced an ancient λ whole-genome triplication that preceded the rosid-asterid split and two more recent WGDs, known as the β WGD that occurred 65-115 million years ago, and the α WGD that occurred 47-64 million years ago (Bowers et al., 2003; Beilstein et al., 2010).

Gene duplications may also arise from small-scale duplications, including tandem duplications and dispersed duplications. Tandem duplications result from template slippage during DNA repair or unequal crossing-over, generating tandem arrays of homologous genes in close genomic vicinity (Kane et al., 2010). Dispersed duplications occur through DNA- or RNA-based transposition, producing homologous genes that are usually not adjacent to each other (Wang et al., 2012).

Figure 2. The three phases that a gene duplication undergoes before it is eventually preserved in populations, as modified from Innan and Kondrashov (2010). This diagram is based on Ohno’s neofunctionalization model. The fixation phase and fate-determination phase may overlap in other models. A: single-copy gene; A-A: a gene duplication with the original gene copy and a new copy; A-B: a gene duplication with the original gene copy and a derived gene copy with novel functions.

Evolutionary mechanisms underlying the fixation and preservation of gene duplications in populations have long been explored, and ten different models have been proposed, although their relative importance is still under debate (reviewed by Innan and Kondrashov (2010). It is assumed that a gene duplication goes through three main phases before it eventually becomes stably maintained in populations. The first is the fixation phase during which it segregates in populations, the second is a fate-determination phase when the different gene copies diverge, and the third is a preservation phase during which the fixed duplication is maintained in populations (Innan and Kondrashov, 2010). The proposed models vary with respect to the selective forces acting during each phase and the functional evolution of the new gene copy. As exemplified in Figure 2, the most classic model, known as Ohno’s neofunctionalization (Ohno, 1970), assumes that a gene duplication (A-A) is neutral and eventually gets fixed by genetic drift. Afterwards, the new gene copy (A) may acquire new functions by advantageous nucleotide substitutions that are targeted by positive selection. Finally, the duplication is stably maintained with the original and derived gene copies (A-B).

2 General Introduction

Gene gains via other mechanisms

Gene gains via other mechanisms may occasionally happen, leading to the creation of so-called orphan genes that lack homologs in other lineages and thus constitute lineage-specific gene families (Tautz and Domazet-Lošo, 2011). Horizontal gene transfer (HGT), for example, results in the transmission of genes between distant species or between organelles and the nucleus (Gao et al., 2014). Genes without homologs in other species may also originate de novo from noncoding regions of DNA (Knowles and McLysaght, 2009).

Gene losses

Gene losses can occur through a process called pseudogenization (nonfunctionalization) when mutations either silence functional genes or cause their loss of function. Genes may also be removed from genomes through deletion events (Koskiniemi et al., 2012). Deletions may originate via the following four mechanisms (1) VNTRs, which refers to the contraction or expansion of simple tandem repeat units due to DNA replication slippage; (2) NAHR, associated with homology-mediated recombination between two stretches of non-allelic DNA sequences with high sequence similarity; (3) TEA, associated with activities of DNA transposons or retrotransposons; and (4) NHR, which refers to recombination occurring in the absence of sequence homology during DNA repair (non-homologous end-joining based DNA double-strand repair) or replication (rescue of DNA replication-fork stalling events) (Hastings et al., 2009; Lam et al., 2009).

Gene losses may also confer a selective advantage to the affected individuals and then contribute to adaptive evolution (Zhang, 2003; Gross, 2006). For instance, the loss of the gene CASPASE12 in humans, which encodes a negative regulator of immune response to endotoxins, confers better protection from sepsis (Wang et al., 2006).

Approaches for studying gene gains and losses

To study gene family size variation among species, it is essential to first accurately infer and count the number of gene gains and losses in each lineage. The direction of change in gene family size among species, i.e., either gene gains in one species or gene losses in another species, however, cannot be determined by simple pairwise comparisons of gene family size between species.

Two approaches have been developed and widely used in analyses of gene family size variation at a genome-wide level. One approach is the so-called phylogenetic tree reconciliation approach, which infers the number of gene gains and losses on each branch of the species tree by reconciling the gene tree reconstructed from members of a gene family with the species tree (Goodman et al., 1979; Chen et al., 2000). However, this approach has been suggested to cause a bias (an excess of ancient losses and recent gains) once the reconstructed gene trees are incorrect (Hahn, 2007). Moreover, it does not allow direct inferences of the evolutionary forces driving the observed changes in gene copy number, i.e. natural selection versus stochastic processes (Demuth and Hahn, 2009).

When assessing the role of natural selection in driving gene copy number changes between species, it is important to take the divergence time of the two species into account, because the more recently the species diverged, the more likely the change in gene copy number was driven by natural selection (Hahn et al., 2005). To incorporate this aspect, a maximum likelihood approach has been developed, which uses a stochastic birth-and-death process to model gene gains and losses along the branches of a phylogenetic tree of species with known divergence times, and this null model can then be violated by the action of natural selection (Hahn et al., 2005; Han et al., 2013). This approach can infer the ancestral state for gene family size at each node in the species tree, thereby facilitating inferences about the directions of changes in gene family size and rate

3 General Introduction estimates of gene gain-and-loss. This approach further allows statistical inferences regarding the likelihood (family-wide P value) that any particular change in gene family size results from a random process, thus providing a statistical framework to test for the action of natural selection (Demuth and Hahn, 2009). The robustness of this approach has been proven by its successful application in analyses of gene family size evolution in multiple lineages, such as yeast (Hahn et al., 2005), mammals (Demuth et al., 2006; Hahn et al., 2007a), and Drosophila (Hahn et al., 2007b). Mathematical details relevant for this approach and related models are explained in Hahn et al. (2005) and Han et al. (2013).

COPY NUMBER VARIANTS

The definition of copy number variants

Copy number variants (CNVs, Figure 3a) contribute to the standing genetic variation of populations (Zhang et al., 2009). They encompass gains and losses of genomic regions that are larger than or equal to 50 bp and belong to a broader class of genomic variation referred to as structural variations (SVs) (Alkan et al., 2011; Xi et al., 2012). Four types of SVs may lead to copy number changes, including deletions (Figure 3a), tandem duplications (Figure 3a), dispersed duplications and insertions, as opposed to the copy number neutral SVs like inversions (Alkan et al., 2011; Xi et al., 2012). CNVs potentially have significant functional impacts, for they may cause changes not only in gene copy number, but also in gene structure or expression regulation, and may expose recessive alleles in diploid species (Bickhart et al., 2012).

Approaches for high-throughput detection of CNVs within genomes

The various approaches available for high-throughput detection of CNVs can be categorized into either microarray-based or sequencing-based approaches. Microarray-based approaches, including SNP-array and array comparative genome hybridization (aCGH) based approaches, have been widely used to infer copy number expansions and contractions. However, these approaches only detect CNVs of large size and the breakpoint (boundary) resolution of detected CNVs is usually poor (Li and Olivier, 2013). For example, although one SNP array-based approach named PennCNV has a relatively good performance for an array-based approach, it only enables kilobase-resolution detection of CNVs with a median size of ∼12 kb (Wang et al., 2007). In contrast, the recently developed sequencing-based approaches make it possible to detect CNVs with relatively small size (< 1kb) and to infer breakpoints at nucleotide resolution (Li and Olivier, 2013).

The most commonly used sequencing-based approaches are the read depth (RD), read pair (RP) and split read (SR) approaches. The basic idea underlying the RD approach is that the density (depth) of mapped reads within a genomic region is roughly proportional to the copy number of that region (Xi et al., 2012). The RD approach is suited for estimating absolute copy number; however, it suffers from the same problem as microarray-based approaches do, namely low sensitivity for small CNVs (< 1 kb) and low breakpoint resolution (Bellos et al., 2012). The RP approach (Figure 3b-1, 2) is only suited for analyzing paired-end sequencing data, and in theory allows detecting all types of SVs by assessing the discordance of mapped read pairs from the expected insert size and orientation (Alkan et al., 2011). It outperforms the RD approach in terms of resolution and size range of detected CNVs; however, it fails to detect SVs where the read pairs do not flank the breakpoints, and incorrect mapping, especially in repetitive regions, often causes false positive calls with this approach (Li and Olivier, 2013). When mapping reads to a reference genome, those that overlap with CNV breakpoints usually cannot be mapped entirely and must be split into two mappable fragments, in which case CNVs can be detected at nucleotide resolution (Ye et al., 2009). These ideas constitute the basic principle of the SR approach (Figure 3b-3, 4) that is often used for accurately inferring breakpoints; however, split read alignment may be limited by

4 General Introduction the short read-length (∼100 bp) of current sequencing data (Li and Olivier, 2013).

Figure 3. CNV polymorphism and the principles underlying the detection of deletions and tandem duplications with the discordant read pairs (RP) and split read alignment (SR) approaches. (a) All possible genotypes in a diploid species at a copy-number polymorphic locus with at most two rounds of tandem duplications, as modified from Wain et al. (2009). CN: copy number. The red box denotes the locus (genomic region) affected by CNVs. The brown and purple boxes represent its left and right flanking regions, respectively. The normal genotype (iii) at this locus has two copies, but this locus may loose both copies when it is homozygous for a deletion (i). The largest copy number (six) appears when the locus is homozyous for a tandem triplication formed by two successive rounds of tandem duplications (x). (b) Diagrams showing the basic principles of detecting CNVs with the RP (1, 2) and SR approaches (3, 4). D: deletion; TD: tandem duplication. The light purple box indicates the affected genomic region. The read pairs colored in black denote paired reads from samples, while those colored in dark red or green denotes the status after mapping to the reference genome.

Given the fact that these sequencing-based approaches differ with regard to size spectrum, power and resolution, CNV studies in humans (Mills et al., 2011) and Drosophila (Zichner et al., 2013) adopted an integrated strategy, implementing multiple approaches in order to map a full spectrum of CNVs and lower the false positive rate. Among the four types of CNVs, only deletions and tandem duplications can currently be detected and genotyped relatively reliably. Moreover, the current CNV genotyping pipelines are only able to infer genotypes in di-allelic cases wherein the locus may be solely affected by a deletion (Figure 3a-i, ii) or a tandem duplication (Figure 3a-v, vii). It is indeed still challenging to infer and distinguish genotypes in multi-allelic cases (Figure 3a-iv, vi, ix) and cases wherein the locus is affected by multiple rounds of tandem duplications, e.g. tandem triplications, as shown in Figure 3a-viii and x (Wain et al., 2009).

STUDY SYSTEMS

System One

The recent publications of the genomes of four Brassicaceae species that are closely related to the plant model

5 General Introduction

Arabidopsis thaliana, namely A. lyrata, Capsella rubella, Eutrema salsugineum and Schrenkiella parvula, offer an unprecedented opportunity to study gene family size variation among plant species. The divergence among these five Brassicaceae species started approximately 43 million years ago and no further WGD events occurred after their divergence (Beilstein et al., 2010; Yang et al., 2013). They therefore meet the requirements for studying gene gains and losses with the maximum likelihood approach proposed by Demuth and Hahn (2009). The biology of these five species differs in many important aspects, such as mating system, life cycle and habitat. Arabidopsis lyrata is a self-incompatible (SI), perennial species that is congeneric with the self-compatible (SC) annual A. thaliana (Hu et al., 2011). Capsella rubella is a SC annual forb that emerges as an excellent model for understanding the evolution of self-fertilization, in which selfing evolved more recently (< 20,000 years ago) than in A. thaliana (around 1 MYs ago) (Slotte et al., 2013). Both E. salsugineum and S. parvula are SC annual hylophytic species that can naturally tolerate multiple types of abiotic stresses such as extreme salinity and cold (Dassanayake et al., 2011; Yang et al., 2013; Vekemans et al., 2014). This heterogeneity in mating system and time when shifts to selfing allowed me to explore the relationship between plant mating system and gene gain-and-loss.

System Two

Woody species have secondary growth, self-supporting structures and a much longer lifespan than forbs, making them distinct from herbaceous plants (Jansson and Douglas, 2007). The GRAS gene family, whose members are primarily involved in plant growth regulation (Bolle, 2004), represents a multi-gene family that may have expanded more rapidly in fast-growing woody species than in herbaceous species. To study the expansion and diversification of this gene family in flowering plants, I chose three plant model species, i.e. Populus trichocarpa (model for woody dicots) (Tuskan et al., 2006), Arabidopsis thaliana (model for herbaceous dicots) (The Arabidopsis Genome Initiative, 2000), and Oryza sativa (model plant for monocots) (Ouyang et al., 2007), for which abundant gene functional information and expression data were available for a comparative study.

System Three

Arabidopsis lyrata is an outcrossing, perennial species with a wide geographic distribution across North America, northern and central , and Asia (Ross-Ibarra et al., 2008). It has become a model for ecological genetic studies on various aspects, such as flowering time variation (Riihimäki et al., 2005; Sandring et al., 2007), trichome production (Karkkäinen et al., 2004; Kivimäki et al., 2007) and local adaptation (Turner et al., 2010). The publication of a high-quality reference genome of A. lyrata (Hu et al., 2011) finally makes it particularly suitable for genomic studies. To investigate copy number variation within plant species, I chose 24 individuals of A. lyrata subsp. petraea from eight natural populations growing on three different bedrocks in Europe, including four populations on calcareous, three on siliceous and one on mixed bedrock. This system not only allowed me to investigate basic aspects of CNVs in natural populations of a plant species, e.g. genomic impacts and formation mechanisms, but also gave me the opportunity to explore the roles of CNVs in edaphic adaptation.

MAIN GOALS AND METHODS

For more than a century, biologists have sought to understand the origins, adaptive mechanisms and evolutionary processes underlying genomic variation within and among species (Hilbish and Koehn, 1987). So far, our knowledge about these aspects, however, has been mainly limited to one type of genomic variation, i.e. single nucleotide polymorphisms (SNPs, a.k.a. point mutations). There is growing evidence however that another class of genomic variation, namely gene copy number variation (GCNV, a.k.a gene family size variation), is pervasive and contributes to environmental adaptation and organismal diversification (Demuth and Hahn, 2009;

6 General Introduction

Kondrashov, 2012). In this thesis, I investigated questions surrounding GCNV, both within and among plant species.

Chapter One

In the first chapter, I focused on GCNV among plant species that diverged relatively recently (within a plant family). I performed an evolutionary analysis of gene family size variation in Arabidopsis and its relatives using a likelihood approach. The main goals of this study were:

1) To test the hypothesis that mating system impacts the rate of gene gain-and-loss (selfing reduces the rate of gene gain-and-loss),

2) To assess whether rapidly evolving gene families in plants are enriched for defense-related genes and whether they exhibit a high incidence of positive selection acting on point mutations,

3) To explore the relative importance of different variation origins in the adaptive evolution of gene family size in plants.

Chapter Two

In the second chapter, I pursue my investigations on GCNV among plant species, but over a much longer time-scale of evolution in an individual gene family. The expansion and diversification of the GRAS gene family in Populus was investigated through comparative analyses with A. thaliana and rice on aspects such as phylogeny, expansion mechanisms, gene structure, adaptive evolution, functional divergence, and expression divergence. The main goals were:

1) To explore the mechanisms underlying the rapid expansion of the GRAS gene family in Populus,

2) To examine the extent and type of functional divergence between subfamilies of the GRAS gene family resulting from site-specific changes in evolutionary rates or biochemical properties,

3) To assess the relative contribution of genetic redundancy and functional novelty to the retention of GRAS gene duplicates.

Chapter Three

In the third chapter, I shift my focus from GCNV among species to GCNV within species. The studied genomic variation has also been extended to copy number variation that results from copy number variants (CNVs) within genomes. I used three sequencing-based approaches (RD, RP and SR) to detect CNVs in 24 individuals of A. lyrata from eight natural populations occurring on either calcareous, siliceous or mixed bedrocks. The major goals of this chapter were:

1) To characterize the content, genomic impact and formation mechanisms of CNVs within this plant species,

2) To assess whether there is evidence supporting the involvement of CNVs in edaphic adaptation of A. lyrata.

7 General Introduction

REFERENCES

Adams, K.L. and Wendel, J.F. (2005) Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol., 8, 135-141. Albertin, W. and Marullo, P. (2012) Polyploidy in fungi: evolution after whole-genome duplication. Proc. R. Soc. B, 279, 2497-2509. Alkan, C., Coe, B.P. and Eichler, E.E. (2011) Genome structural variation discovery and genotyping. Nat. Rev. Genet., 12, 363-376. Beilstein, M.A., Nagalingum, N.S., Clements, M.D., Manchester, S.R. and Mathews, S. (2010) Dated molecular phylogenies indicate a Miocene origin for Arabidopsis thaliana. Proc. Natl. Acad. Sci., 107, 18724-18728. Bellos, E., Johnson, M.R. and Coin, L.J.M. (2012) cnvHiTSeq: integrative models for high-resolution copy number variation detection and genotyping using population sequencing data. Genome Biol., 13, R120. Bickhart, D.M., Hou, Y., Schroeder, S.G., Alkan, C., Cardone, M.F., Matukumalli, L.K., Song, J., Schnabel, R.D., Ventura, M., Taylor, J.F., et al. (2012) Copy number variation of individual cattle genomes using next-generation sequencing. Genome Res., 22, 778-790. Bolle, C. (2004) The role of GRAS proteins in plant signal transduction and development. Planta, 218, 683-692. Bowers, J.E., Chapman, B.A., Rong, J. and Paterson, A.H. (2003) Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature, 422, 433-438. Bridges, C.B. (1936) The bar "gene" a duplication. Science, 83, 210-211. Chen, K., Durand, D. and Farach-Colton, M. (2000) NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol., 7, 429-447. Dassanayake, M., Oh, D.H., Haas, J.S., Hernandez, A., Hong, H., Ali, S., Yun, D.J., Bressan, R.A., Zhu, J.K., Bohnert, H.J., et al. (2011) The genome of the extremophile crucifer Thellungiella parvula. Nat. Genet., 43, 913-918. Dehal, P. and Boore, J.L. (2005) Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol., 3, e314. Demuth, J.P., De Bie, T., Stajich, J.E., Cristianini, N. and Hahn, M.W. (2006) The evolution of mammalian gene families. PLoS One, 1, e85. Demuth, J.P. and Hahn, M.W. (2009) The life and death of gene families. Bioessays, 31, 29-39. Gao, C., Ren, X., Mason, A.S., Liu, H., Xiao, M., Li, J. and Fu, D. (2014) Horizontal gene transfer in plants. Funct. Integr. Genomics, 14, 23-29. Goodman, M., Czelusniak, J., Moore, G.W., Romero-Herrera, A. and Matsuda, G. (1979) Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Biol., 28, 132-163. Gross, L. (2006) When less is more: losing genes on the path to becoming human. PLoS Biol., 4, e76. Hahn, M.W. (2007) Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol., 8, R141. Hahn, M.W., De Bie, T., Stajich, J.E., Nguyen, C. and Cristianini, N. (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res., 15, 1153-1160. Hahn, M.W., Demuth, J.P. and Han, S.G. (2007a) Accelerated rate of gene gain and loss in primates. Genetics, 177, 1941-1949. Hahn, M.W., Han, M.V. and Han, S.G. (2007b) Gene family evolution across 12 Drosophila genomes. PLoS Genet., 3, e197. Han, M.V., Thomas, G.W.C., Lugo-Martinez, J. and Hahn, M.W. (2013) Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol. Biol. Evol., 30, 1987-1997. Hastings, P., Lupski, J.R., Rosenberg, S.M. and Ira, G. (2009) Mechanisms of change in gene copy number. Nat. Rev. Genet., 10, 551-564. Hilbish, T.J. and Koehn, R.K. (1987) The adaptive importance of genetic variation. Am. Sci., 134-141. Hu, T.T., Pattyn, P., Bakker, E.G., Cao, J., Cheng, J.-F., Clark, R.M., Fahlgren, N., Fawcett, J.A., Grimwood, J. and Gundlach, H. (2011) The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet., 43, 476-481. Innan, H. and Kondrashov, F. (2010) The evolution of gene duplications: classifying and distinguishing between models. Nat. Rev. Genet., 11, 97-108. Jansson, S. and Douglas, C.J. (2007) Populus: a model system for plant biology. Annu. Rev. Plant Biol., 58, 435-458. Jensen, R.A. (2001) Orthologs and paralogs–we need to get it right. Genome Biol., 2, 1-3. Jiang, D., Yin, C., Yu, A., Zhou, X., Liang, W., Yuan, Z., Xu, Y., Yu, Q., Wen, T. and Zhang, D. (2006) Duplication and expression analysis of multicopy miRNA gene family members in Arabidopsis and rice. Cell Res., 16, 507-518. Jiao, Y., Wickett, N.J., Ayyampalayam, S., Chanderbali, A.S., Landherr, L., Ralph, P.E., Tomsho, L.P., Hu, Y., Liang, H., Soltis, P.S., et al. (2011) Ancestral polyploidy in seed plants and angiosperms. Nature, 473, 97-100. Kane, J., Freeling, M. and Lyons, E. (2010) The evolution of a high copy gene array in Arabidopsis. J. Mol. Evol., 70, 531-544. Karkkäinen, K., Løe, G. and Ågren, J. (2004) Population structure in Arabidopsis lyrata: evidence for divergent selection on trichome production. Evolution, 58, 2831-2836. Kivimäki, M., Kärkkäinen, K., Gaudeul, M., Løe, G. and Ågren, J. (2007) Gene, phenotype and function: GLABROUS1 and resistance to herbivory in natural populations of Arabidopsis lyrata. Mol. Ecol., 16, 453-462. Knowles, D.G. and McLysaght, A. (2009) Recent de novo origin of human protein-coding genes. Genome Res., 19, 1752-1759. Kondrashov, F.A. (2012) Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc. R. Soc. B, 279, 5048-5057. Koskiniemi, S., Sun, S., Berg, O.G. and Andersson, D.I. (2012) Selection-driven gene loss in bacteria. PLoS Genet., 8, e1002787.

8 General Introduction

Lam, H.Y., Mu, X.J., Stütz, A.M., Tanzer, A., Cayting, P.D., Snyder, M., Kim, P.M., Korbel, J.O. and Gerstein, M.B. (2009) Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat. Biotechnol., 28, 47-55. Li, W. and Olivier, M. (2013) Current analysis platforms and methods for detecting copy number variation. Physiol. Genomics, 45, 1-16. Mills, R.E., Walter, K., Stewart, C., Handsaker, R.E., Chen, K., Alkan, C., Abyzov, A., Yoon, S.C., Ye, K., Cheetham, R.K., et al. (2011) Mapping copy number variation by population-scale genome sequencing. Nature, 470, 59-65. Ohno, S. (1970) Evolution by gene duplication: London: George Alien & Unwin Ltd. Berlin, Heidelberg and New York: Springer-Verlag. Ouyang, S., Zhu, W., Hamilton, J., Lin, H., Campbell, M., Childs, K., Thibaud-Nissen, F., Malek, R.L., Lee, Y. and Zheng, L. (2007) The TIGR rice genome annotation resource: improvements and new features. Nucleic Acids Res., 35, D883-D887. Riihimäki, M., Podolsky, R., Kuittinen, H., Koelewijn, H. and Savolainen, O. (2005) Studying genetics of adaptive variation in model organisms: flowering time variation in Arabidopsis lyrata. Genetica, 123, 63-74. Rooney, A.P. and Ward, T.J. (2005) Evolution of a large ribosomal RNA multigene family in filamentous fungi: birth and death of a concerted evolution paradigm. Proc. Natl. Acad. Sci., 102, 5084-5089. Ross-Ibarra, J., Wright, S.I., Foxe, J.P., Kawabe, A., DeRose-Wilson, L., Gos, G., Charlesworth, D. and Gaut, B.S. (2008) Patterns of polymorphism and demographic history in natural populations of Arabidopsis lyrata. PLoS One, 3, e2411. Sandring, S., RIIHIMÄKI, M.A., Savolainen, O. and Ågren, J. (2007) Selection on flowering time and floral display in an alpine and a lowland population of Arabidopsis lyrata. J. Evol. Biol., 20, 558-567. Slotte, T., Hazzouri, K.M., Agren, J.A., Koenig, D., Maumus, F., Guo, Y.L., Steige, K., Platts, A.E., Escobar, J.S., Newman, L.K., et al. (2013) The Capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat. Genet., 45, 831-835. Soltis, D.E., Albert, V.A., Leebens-Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., Wall, P.K. and Soltis, P.S. (2009) Polyploidy and angiosperm diversification. Am. J. Bot., 96, 336-348. Tautz, D. and Domazet-Lošo, T. (2011) The evolutionary origin of orphan genes. Nat. Rev. Genet., 12, 692-702. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the Arabidopsis thaliana. Nature, 408, 796-815. Turner, T.L., Bourne, E.C., Von Wettberg, E.J., Hu, T.T. and Nuzhdin, S.V. (2010) Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils. Nat. Genet., 42, 260-263. Tuskan, G.A., Difazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., Hellsten, U., Putnam, N., Ralph, S., Rombauts, S., Salamov, A., et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science, 313, 1596-1604. Vekemans, X., Poux, C., Goubet, P.M. and Castric, V. (2014) The evolution of selfing from outcrossing ancestors in Brassicaceae: what have we learned from variation at the S-locus? J. Evol. Biol., 27, 1372-1385. Wain, L.V., Armour, J.A. and Tobin, M.D. (2009) Genomic copy number variation, human health, and disease. Lancet, 374, 340-350. Walsh, J.B. and Stephan, W. (2002) Multigene families: evolution. eLS, 12, 406-412. Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S.F., Hakonarson, H. and Bucan, M. (2007) PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res., 17, 1665-1674. Wang, X., Grus, W.E. and Zhang, J. (2006) Gene losses during human origins. PLoS Biol., 4, e52. Wang, X. and Lavrov, D.V. (2011) Gene recruitment–A common mechanism in the evolution of transfer RNA gene families. Gene, 475, 22-29. Wang, Y., Wang, X. and Paterson, A.H. (2012) Genome and gene duplications and gene expression divergence: a view from plants. Ann. N. Y. Acad. Sci., 1256, 1-14. Wendel, J.F. (2000) Genome evolution in polyploids. In: Plant Molecular Evolution: Springer, pp. 225-249. Xi, R., Lee, S. and Park, P.J. (2012) A survey of copy-number variation detection tools based on high-throughput sequencing data. Curr. Protoc. Hum. Genet., 75, 7.19.11-17.19.15. Yang, R., Jarvis, D.E., Chen, H., Beilstein, M.A., Grimwood, J., Jenkins, J., Shu, S., Prochnik, S., Xin, M., Ma, C., et al. (2013) The reference genome of the halophytic plant Eutrema salsugineum. Front. Plant Sci., 4, 46. Ye, K., Schulz, M.H., Long, Q., Apweiler, R. and Ning, Z. (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25, 2865-2871. Zhang, F., Gu, W., Hurles, M.E. and Lupski, J.R. (2009) Copy number variation in human health, disease, and evolution. Annu. Rev. Genomics Hum. Genet., 10, 451-481. Zhang, J. (2003) Evolution by gene duplication: an update. Trends Ecol. Evol., 18, 292-298. Zichner, T., Garfield, D.A., Rausch, T., Stutz, A.M., Cannavo, E., Braun, M., Furlong, E.E. and Korbel, J.O. (2013) Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res., 23, 568-579.

9 Chapter One

CHAPTER ONE: Evolutionary analysis of gene family size variation in Arabidopsis and its relatives

ABSTRACT

Variation in gene family size among species may contribute to adaptation and organismal diversification. A recently developed likelihood approach uses a stochastic birth-and-death process to model gene gain-and-loss across branches of the phylogenetic tree of closely related species, thus providing a statistical framework to trace gene family size variation over time and to test for the action of natural selection. While substantial progress has been made in animals, our understanding of the evolution of gene family size variation in plants is still limited. Using the likelihood approach described above, we performed an evolutionary analysis of gene family size variation among Arabidopsis and its relatives. The average rate of gene gain-and-loss (λ) was estimated to be 0.0022 gains and losses per gene per million years. Branch-specific rate estimates supported our hypothesis that selfing reduces the rate of gene gain-and-loss. 5,683 gene families that did not show any variance in size among species were found to be mainly involved in conserved processes. In contrast, 272 gene families that were inferred to have been evolving rapidly in size are mainly involved in plant-pathogen/herbivore interactions and pollen-pistil interactions. We further demonstrated a high incidence of positive selection acting on nucleotides in these rapidly evolving gene families. Finally, we found that gene gains via tandem duplication predominantly contributed to the adaptive evolution of gene family size. Overall, we provide the first genome-wide analyses on gene family size evolution among closely related plant species, and uncovered a possible interplay between plant mating system and the dynamics of gene gain-and-loss.

KEYWORDS

Brassicaceae · gene gain-and-loss · gene family size · mating system · positive selection · tandem duplication

ABBREVIATIONS

FDR: False Discovery Rate, GO: Gene Otology, MRCA: Most Recent Common Ancestor, MYs: Million years, PCD: Programmed Cell Death, SC: Self-compatible, SI: Self-incompatible, TE: Transposable element, WGDs: Whole Genome Duplications

10 Chapter One

INTRODUCTION

A gene family consists of a group of evolutionary related genes showing significant sequence similarity with each other (Walsh and Stephan, 2002). Variation in gene family size, a.k.a. gene copy number variation, has often been documented in comparative genomic studies even among closely related species such as flies (Hahn et al., 2007b; Vieira et al., 2007), ants (Kulmuni et al., 2013), fungi (Hahn et al., 2005; Floudas et al., 2012) and mammals (Demuth et al., 2006; Gutiérrez et al., 2014). Gene family size variation may underlie physiological or morphological differences among species, which may ultimately contribute to environmental adaptation and organismal diversification (Demuth and Hahn, 2009; Kondrashov, 2012).

An increasing number of empirical studies have provided evidence regarding the evolutionary significance of gene copy number changes, including expansions and contractions. One example highlighting the contribution of gene copy number expansion to environmental adaption comes from the study on the ability of Arabidopsis halleri to colonize soils polluted by heavy metal, which was partially attributed to copy number expansion of the metal pump gene HMA4 (Hanikenne et al., 2008). Another example highlighting the role of gene copy number expansion in shaping morphological diversity arose from a study on leaf shape evolution in the Brassicaceae family (Vlad et al., 2014). RCO-type genes, which originated via duplication of LMI1-type genes after the split of Brassicaceae from Aethionema, are responsible for the evolution of complex leaf shapes in Brassicaceae species such as Cardamine hirsuta (dissected leaves) and Arabidopsis lyrata (lobed leaves), whereas their loss in A. thaliana is responsible for the reversal to a simple leaf shape (Vlad et al., 2014). Gene copy number contraction may also be adaptive. For instance, the loss of the gene CASPASE12 in humans, which encodes a negative regulator of immune response to endotoxins, confers better protection from sepsis (Wang et al., 2006).

Variation in gene family size among species results from differences in gene gains and losses among species (Chauve et al., 2008). In plants, gene gains are mostly attributed to the various modes of gene duplication, namely whole genome duplications (WGDs), a.k.a. polyploidization events (Wendel, 2000; Tang et al., 2008), tandem duplications that result from template slippage during DNA repair or unequal crossing-over (Kane et al., 2010) and dispersed duplications that occur through DNA- or RNA-based transposition of transposable elements (TEs) (Wang et al., 2011b; Owens et al., 2013). Meanwhile, lineage-specific gene families may occasionally originate de novo (Long et al., 2003) or via horizontal gene transfer (Richardson and Palmer, 2007). While gene gains result in the expansion of gene families, gene losses via deletion or pseudogenization lead to the contraction or extinction of gene families (Demuth and Hahn, 2009). The size variation of gene families among closely related species that shared all WGDs may thus mainly be attributed to differential gene losses following WGDs among species, and differential gene gains through tandem duplications or dispersed duplications. Yet, the relative importance of these different origins of variation in the adaptive evolution of gene family size remains elusive.

In plants, the evolutionary shift from an outcrossing to a selfing mating system has occurred independently and repeatedly within families and even species, and has been suggested to play a central role in plant genome evolution (Wright et al., 2008; Vekemans et al., 2014). Our current knowledge about the relationship between genomic or molecular evolution and life-history traits, e.g. mating system, is mainly limited to point mutations. Emerging evidence suggests that selfing may reduce the overall rate of point mutations. A recent study both theoretically and empirically demonstrated that selfing reduces the rate of indel-associated point mutations (Hollister et al., 2010). Theoretical modeling further predicted that the rate of deleterious point mutations in populations is lower in selfers than in outcrossers (Nakayama et al., 2012). We therefore hypothesize that selfing also reduces the rates of other types of mutations including duplications and deletions, thus decreasing the rate of

11 Chapter One gene gain-and-loss, a.k.a. gene turnover rate (λ) in selfers relative to outcrossers.

While some observed variation in gene family size among extant species has been potentially shaped by natural selection, the vast majority may just be attributed to stochastic processes, e.g. genetic drift (Hahn et al., 2005; Demuth and Hahn, 2009). To distinguish between adaptive and stochastic processes, it is essential to first accurately infer the number of gene gains and losses in any lineage. Yet, the direction of changes (gene gains or losses) cannot be determined by simple pairwise comparisons of gene family size between species. Moreover, it is critical to take the divergence time of the two species into account, because for a given number of changes, the more recently the species diverged, the more likely the changes have been driven by natural selection (Hahn et al., 2005).

A maximum likelihood approach has been developed, which uses a stochastic birth-and-death process to model gene gains and losses along the branches of a time-calibrated phylogenetic tree of species, thus providing a statistical framework to trace gene family size variation over time in a whole-genome context and to test for the action of natural selection (Hahn et al., 2005; Han et al., 2013). The robustness of this approach has been proven by its successful application in multiple lineages of closely related species, such as yeast (Hahn et al., 2005), mammals (Demuth et al., 2006; Hahn et al., 2007a), and flies (Hahn et al., 2007b).

Two important insights have been achieved from studies on rapidly evolving gene families in size in animals based on the likelihood approach (Demuth et al., 2006; Hahn et al., 2007b). First, defense-related genes were over-represented among the rapidly evolving gene families. Second, a higher-than-expected proportion of these rapidly evolving gene families showed accelerated nucleotide substitution rates, indicative of positive selection acting on nucleotides. Plants and animals have evolved distinct immune systems to defend themselves against their specific biotic antagonists. (Jones and Dangl, 2006). It is thus intriguing for us to assess the generality of these two properties of gene family size evolution by incorporating results from plant species.

While substantial progress has been made in animal studies, our understanding of the evolution of gene family size variation in plants is still limited. To date, numerous studies have been undertaken to compare the size of gene families among different plant species; however, most of them only focused on individual gene families (Yang et al., 2006; Yang et al., 2008; Ye et al., 2011; Bagal et al., 2012; Lei et al., 2012). Furthermore, gene family size differences have commonly been ascribed to natural selection, without taking the divergence time between the studied species into account, although the observed size differences may have resulted from random gene gains and losses over the long period of time since lineage divergence (Hahn et al., 2005). It is therefore timely to investigate gene family size variation in plants from a genome-wide perspective and to distinguish between selection and drift as the underlying processes.

The recent publications of the genomes of four Brassicaceae species that are closely related to the plant model A. thaliana offer an unprecedented opportunity to study gene family size variation in plants. The biology of these five species differs in many important aspects such as mating system, life cycle and habitat. Arabidopsis lyrata is a self-incompatible (SI) perennial species that is congeneric with the self-compatible (SC) annual A. thaliana (Hu et al., 2011). Capsella rubella is a SC annual forb that emerges as an excellent model for understanding the evolution of self-fertilization, in which selfing evolved more recently (< 20,000 years ago) than in A. thaliana (around 1 MYs ago) (Slotte et al., 2013). Both Eutrema salsugineum and Schrenkiella parvula are SC annual halophytic species that can naturally tolerate multiple types of abiotic stresses such as extreme salinity and cold (Dassanayake et al., 2011; Yang et al., 2013; Vekemans et al., 2014). The heterogeneity in mating system and time since the evolution of selfing makes this system also excellent for exploring the relationship between plant mating system and the dynamics of gene gain-and-loss.

12 Chapter One

In the present study, we performed an evolutionary analysis of gene family size variation in Arabidopsis and its relatives using a likelihood approach. The major goals of this study included three aspects: 1) to test our hypotheses that mating system has impacts on the rate of gene gain-and-loss (selfing reduces the rate of gene gain-and-loss); 2) to assess whether in plants, rapidly evolving gene families are enriched for defense-related genes and exhibit a high incidence of positive selection acting on nucleotides; and 3) to explore the relative importance of different variation origins in the adaptive evolution of gene family size in plants.

RESULTS

Protein-coding gene families in the five Brassicaceae species

A total of 139,880 protein sequences from the five Brassicaceae species were clustered into 28,333 gene families (Table 1). Of these, 16,258 single-gene families (only one gene in total) were excluded, leaving 12'075 gene families for further analyses (Table S1). As shown in Figure 1, the size of these 12,075 gene families ranged from 0 to 220 in individual species. A majority (81%, 9,781) of them had low average copy number (≤ 2) in individual species, of which 44.5% (4,351) maintained a single copy in each species, and may represent a group of genes resistant to duplication. Although most (90.2%) of these 12,075 gene families exhibited low variance (standard deviation ≤ 1) in size among species, the remaining 9.8% (1,186) showed relatively high variation in size among species (Figure 1).

Table 1. Number of genes and gene families in the five Brassicaceae species

By incorporating outgroup species from the Rosid clade, we were able to distinguish lineage-specific gene families from extinction gene families. 1,847 families containing 6,947 genes (5%, 6,947/139,880) were lineage-specific (Table S2), implying that 11.8 new gene families have arisen per MYs in Arabidopsis and its relatives (1,847 lineage-specific families / 156 MYs, the total length of the phylogenetic tree). GO term enrichment analyses revealed that the most over-represented molecular function, biological process and cellular component are cysteine-type peptidase activity, proteolysis and endomembrane system, respectively (Table S3). Meanwhile, we found that 1,251 families containing 5,303 genes (3.8%, 5,303/139,880) went extinct in certain lineages after their divergence (Table S4), meaning that around 8 gene families were lost per MYs. The most over-represented molecular function, biological process and cellular component for these families are structural constituent of ribosome, electron transport chain and intracellular organelle, respectively (Table S5).

We found 8,977 families present in all five Brassicaceae species and these may represent the core proteome of the Brassicaceae. These 8,977 families, together with the 1,251 extinction families, constituted 10,228 families (116,675 genes, 83.4%, 116,675/139,880) that have at least one ancestral gene copy present in the most recent

13 Chapter One common ancestor (MRCA) of the studied species (Table S6). Of these, at least 44.4% (4,545 families) have changed in size in at least one species over the past 43 MYs since their divergence. The remaining 5,683 (55.6%) families that did not show any variance in size among the five species may represent gene families that are resistant to gene gain-and-loss. GO term enrichment analysis revealed that genes enriched among these families are mainly involved in conserved processes such as DNA repair, cell cycle, plastid organization and metabolism of cellular essential compounds (Figure S2).

Figure 1. Size variation of the 12,075 studied gene families among the five investigated Brassicaceae species.

Rates of gene gain-and-loss in Arabidopsis and its relatives

The 10,213 gene families that were inferred to be present in the MRCA of the five studied species (excluding 15 families with large variance in size; Table S6) were subjected to likelihood analyses of gene gain-and-loss. By using a one-parameter model, the average rate of gene gain-and-loss (λ) across all branches of the phylogenetic tree was estimated to be 0.002425 gains and losses per gene per MYs. To test the hypothesis that different branches evolve at different rates and to find the best-fit model for our dataset, we also tried a full-parameter model (in our case, an eight-parameter model) by assigning a different λ to each branch of the tree. This model, however, failed to converge to a global maximum, as did trials with the seven- and six-parameter models (Table S7). By contrast, the five-, four-, three- and two-parameter models all converged to a single maximum, and the five-parameter model showed the lowest likelihood score (-lnL = 57,307.93; Table S7), implying that it might be the best model for our dataset.

To assess the statistical significance of the five-parameter model, we simulated 1,000 datasets under the one-parameter model to generate a null distribution of likelihood ratios. The result showed that the likelihood ratio (-2∆lnL = 1,913.44) of our dataset was greater than 95% of values in the likelihood-ratio distribution from the simulated datasets (Figure S1). This result indicated that the five-parameter model fit our dataset

14 Chapter One significantly better than the one-parameter model, thus suggesting that heterogeneous rates of gene gain-and-loss exit across the branches.

Figure 2. Expansions and contractions of gene families in Arabidopsis and its relatives. The phylogeny and divergence times of the five studied species were obtained by simplifying a published topology (Beilstein et al., 2010; Yang et al., 2013). The branches of the tree are colored according to branch-specific rates of gene gain-and-loss (from λ1 to λ5). The pie plot on each branch shows the proportion of expanded families (black), contracted families (white) and unchanged families (gray). The two numbers underneath the pie plots indicate the number of expanded (+) and contracted (-) families for each branch, respectively. SC: Self-compatibility; SI: Self-incompatibility.

To correct for errors in genome assemblies and annotations, we also applied the species-specific error models as implemented in CAFE v3.0. The average rate of gene turnover (λ) across all branches of the phylogenetic tree was re-estimated to be 0.0022 gains and losses per gene per MYs. Branch-specific rates under the five-parameter model (from λ1 to λ5) are shown in Figure 2. We found the highest rate (λ2 = 0.006282) on the terminal branch leading to the SI species A. lyrata. This rate was 2.86 times higher than the average (λ = 0.0022) and 4.18 times faster than the slowest rate (λ5 = 0.001503). By contrast, gene gain-and-loss was found to evolve relatively slowly (λ3 = 0.001894; λ5 = 0.001503) along the branches leading to the two halophytic species, E. salsugineum and S. parvula. It is noteworthy that the branch leading to A. thaliana, in which selfing evolved earlier than in C. rubella, had a lower rate of gene gain-and-loss (λ1 = 0.002498) than that leading to the latter (λ4 = 0.003164).

Finally, we examined the numbers and proportions of expanded or contracted gene families along each branch (Figure 2). Gene family contractions were more frequent than expansions along five out of the eight branches of the tree, with the branch leading to S. parvula having accumulated the largest number of contractions (1,089 families; 10.7%). In contrast, more expansions were found along three out of the eight branches. For example, on the branch leading to A. lyrata, 923 families have expanded while 896 contracted over the past 13 MYs since the divergence of A. lyrata and A. thaliana.

15 Chapter One

Rapidly evolving gene families in Arabidopsis and its relatives

Given the branch-specific gene turnover rates estimated above, we calculated a family-wide P-value for each gene family to identify those families that have evolved at significantly faster rates than the genome-wide average. With a cutoff family-wide P-value < 0.0001, 272 families out of 10,213 were identified as rapidly evolving families (Table S8). The functional repertoire of these rapidly evolving gene families was assessed by GO term enrichment analysis. As shown in Figure S3, enriched GO terms were mainly associated with processes involved in biotic interactions, including plant-pathogen/herbivore interactions and pollen-pistil interactions (Fisher’s exact test, FDR < 0.01). Representative terms for plant-pathogen/herbivore interactions included “defense response”, “immune system process”, “immune response”, “apoptotic process”, “(1->3)-beta-D-glucan biosynthetic process”, “sesquiterpene biosynthetic process” and “triterpenoid metabolic process”. Representative terms for pollen-pistil interactions included “recognition of pollen”, “pollen-pistil interaction”, “cell communication”, “multi-organism reproductive process” and “programmed cell death”. Additionally, some terms associated with plant responses to abiotic factors were also found, such as "cellular response to cold", "cellular response to salt stress", and "phosphate ion transport" (Figure S3).

We further compared the relative rates of changes (expansion or contraction) on the terminal branches leading to Arabidopsis and its relatives for each of these 273 rapidly evolving gene families. As shown in Figure S4, we found the largest number (89) of rapidly evolving families that have undergone an expansion along the branch leading to A. lyrata (blue bars), but only 11 along the branch leading to S. parvula (purple bars). By setting a cutoff of Viterbi P-value < 0.01, we then examined whether any of these changes were statistically significant on the terminal branch leading to each species. We found that 25, 60, 42, 6 and 42 families have significantly expanded on the branches leading to A. thaliana, A. lyrata, C. rubella, S. parvula, and E. salsugineum, respectively (Table S9; Figure S4).

High incidence of positive selection acting upon nucleotides in rapidly evolving gene families

We searched for signatures of positive selection acting upon nucleotides by testing for significant differences in likelihood scores between two site models, M1a (the nearly neutral model) and M2a (the positive-selection model), for each of the 210 rapidly evolving gene families that were retained in at least two species. The results showed that for 33 (15.7%) out of 210 rapidly evolving families, the positive-selection model was significantly favored over the nearly neutral model after correction for multiple testing (P < 2.38E-4, likelihood ratio test, df = 2), suggesting accelerated nucleotide substitutions by positive selection in these gene families (Table S10 Sheet 1). For comparison, we also examined the percentage of families undergoing positive selection in a set of 1,000 randomly selected gene families. The results showed that only 13 (1.3%) families had significant evidence of positive selection after correction for multiple testing (Table S10 Sheet 2; P < 5E-5; likelihood ratio test; df = 2). Altogether, these results suggested that rapidly evolving gene families may undergo high incidence of positive selection acting on nucleotides, thus implying an association between accelerated nucleotide substitutions by positive selection and rapid changes in gene copy number in A. thaliana and its relatives. This association was subsequently proven to be statistically significant by Fisher’s exact test (Table S11; P < 2.2E-16).

Tandem duplication as the predominant duplication mode in rapidly evolving gene families

To investigate the origins of gene family size variation, we first assigned all duplicated genes to four different categories of duplication mode: proximal duplication, dispersed duplication, WGD, and tandem duplication. Subsequently, we searched for potential enrichment of certain modes in each gene family using Fisher’s exact tests adjusted for multiple testing. 144 out of the 272 rapidly evolving families were significantly enriched for at least one type of duplication mode (referred to as duplication-mode-enrichment families; P < 0.05; Table S12

16 Chapter One

Sheet 1). Among them, 43 families were enriched for more than one mode. Similarly, we found that 3,509 out of all 12,075 studied gene families were significantly enriched for at least one type of duplication mode, and 115 were enriched for more than one mode (P < 0.05; Table S12 Sheet 2). We then compared the percentages of each duplication mode for these duplication-mode-enrichment families between the rapidly evolving gene families and all studied gene families, and significantly different patterns were observed (P < 2.2E-16, Chi-square test). As shown in Figure 3, tandem duplication was the predominant mode (49.7%) in rapidly evolving families, while it represented only 15.1% across all gene families. By contrast, WGD accounted for the largest proportion (47.7%) across all studied gene families, but for only 3.2% in rapidly evolving gene families.

Figure 3. Contributions of four categories of duplication mode to the size dynamics of the rapidly evolving families and all studied gene families.

DISCUSSION

In this study, we performed the first genome-wide analysis on gene family size evolution among closely related plant species. We observed substantial variation in gene family size among Arabidopsis and its relatives, which is comparable with that reported in animals, thus reflecting conserved mechanisms underlying the evolution of gene family size between the two kingdoms. We estimated the average rate of gene turnover (λ) to be 0.0022 gains and losses per gene per MYs, which corroborates the view that gene gains and losses evolve at similar average rate across different eukaryotic lineages. Our estimates of branch-specific rates supported the hypothesis that selfing reduces the mutation rates of duplications and deletions, thereby decreasing the rate of gene gain-and-loss in SC species. We further speculated that mating system may impact the rate of gene gain-and-loss via differential activity and accumulation of TEs between selfing and outcrossing species. Gene families that did not vary in size among species were found to be mainly involved in conserved processes. In contrast, 272 gene families that evolved rapidly in size were found to be mainly involved in plant-pathogen/herbivore interactions

17 Chapter One and pollen-pistil interactions. We further demonstrated a high incidence of positive selection acting on nucleotides in these rapidly evolving gene families. Last but not least, we found that gene gains via tandem duplication predominantly contributed to the adaptive evolution of gene family size.

The dynamics of gene birth-and-death is of central interest in studies of genome evolution, and differential gene gains and losses among species often lead to interspecific variation in gene family size (Zhang, 2003). We observed substantial variation in gene family size among Arabidopsis and its relatives. Of the 10,228 families present in their MRCA, 44.4% have changed in size in at least one species over the past 43 MYs since their divergence (Table S6). This proportion is comparable with that observed among 12 Drosophila species (41%) that diverged 60 MYs ago (Hahn et al., 2007b) and five mammalian species (56.3%) that diverged 93 MYs ago (Demuth et al., 2006), implying conserved mechanisms underlying the evolution of gene family size between plant and animal kingdoms.

Despite frequent gene birth and death within genomes, 5,683 gene families (55.6%) were found to be no any variance in size among the five species, and may represent a group of gene families that are resistant to gene gain-and-loss. These gene families were further found to function primarily in conserved processes such as DNA repair, plastid organization, cell cycle, and metabolism of cellular essential compounds (Figure S2), in accordance with a recent study on gene families that maintain only one single copy in 20 flowering plants (De Smet et al., 2013). Our results thus suggested that the conservation of gene copy number in gene families, regardless of whether a single or multiple copies are present in each species, have been maintained probably by natural selection instead of stochastic processes.

To assess the evolutionary processes underlying the observed variation of gene family size, we inferred gene gains and losses on each branch using the maximum likelihood approach implemented in the CAFE package (De Bie et al., 2006; Han et al., 2013). Since this approach is limited to a group of species that have diverged recently (λ*t < 1, whereby t denotes the divergence time) and have not been separated by WGDs (Demuth and Hahn, 2009), we studied five closely related species in the Brassicaceae family, which have not undergone further WGD events since their divergence that started 43 MYs ago (Beilstein et al., 2010; Yang et al., 2013). We excluded the genome sequence of Brassica rapa, because this species underwent a recent whole genome triplication approximately 13-17 MYs ago (Wang et al., 2011a).

We estimated the average rate of gene turnover (λ) to be 0.0022 gains and losses per gene per MYs across all branches of the phylogenetic tree and a branch-specific rate of 0.002498 on the branch leading to A. thaliana (Figure 2). By using an independent approach, Lynch and Conery (2003) estimated the average rate of gene duplications (the rate of gene gains) to be 0.002 duplications per gene per MYs for A. thaliana. Our result is thus not only consistent with the previous estimate in A. thaliana, but also extends this rate estimate to the Brassicaceae family. The average rates of gene turnover (λ) have been estimated to be 0.0020, 0.0012, and 0.0016 gains and losses per gene per MYs in yeast (Hahn et al., 2005), Drosophila (Hahn et al., 2007b), and mammals (Demuth et al., 2006), respectively. Our rate estimate in plants, therefore, corroborates the view that gene gain-and-loss evolves at similar average rate across different eukaryotic lineages (Demuth and Hahn, 2009).

The estimation of branch-specific rates demonstrated heterogeneous rates of gene gains and losses across branches (Figure 2). Interestingly, we found the highest rate (λ2 = 0.006282) on the terminal branch leading to the SI species A. lyrata (Figure 2). This result supports our hypothesis that besides point mutation (Hollister et al., 2010), selfing may also reduce the rates of other types of mutation including duplications and deletions, thereby decreasing the rate of gene gain-and-loss in SC species. Selfing evolved earlier in A. thaliana

18 Chapter One

(approximately 1 MYs ago) than that in C. rubella (< 20,000 years ago) (Slotte et al., 2013). Accordingly, we observed a lower rate of gene gain-and-loss on the branch leading to A. thaliana (λ1 = 0.002498) than to C. rubella (λ4 = 0.003164), corroborating the idea that shifts in mating system have profound effects on the rate of gene gain-and-loss. No evidence exists so far as to the time when selfing has evolved in E. salsugineum and S. parvula. The observation of lower rates of gene gain-and-loss on the branches leading to E. salsugineum (λ3 =

0.001894) and S. parvula (λ5 = 0.001503) than that to A. thaliana (λ1 = 0.002498) may imply that selfing evolved even earlier in these two species than the latter. We hope to further test this hypothesis once more knowledge on the timing of mating system shifts and more genomes of outcrossing Brassicaceae species become available.

TEs contribute to gene gains via a process called trans-duplication, by which a gene or gene fragment captured by a TE is duplicated and transposed to a new position (Lisch, 2012). TEs may also cause gene losses by genomic sequence deletions mediated by TE insertions (Han et al., 2005). Interestingly, mating system has been hypothesized to play a role in the evolutionary dynamics of TEs (Wright and Schoen, 2000; Morgan, 2001). It has indeed been demonstrated that the outcrosser A. lyrata exhibits a lower efficacy of RNA-directed DNA methylation silencing of TEs than its congeneric selfer A. thaliana, suggestive of an enhanced activity of TEs in A. lyrata (Hollister et al., 2011). Moreover, selfing species generally accumulate fewer TEs than outcrossing species, because selfing impedes the spread of TEs among individuals and increases the efficacy of selection against deleterious recessive TE insertions (i.e. purging) in populations (Wright et al., 2008). Abundant TEs in outcrossing species may thus further raise the chances of TE-mediated gene gains and losses. Taken together, we speculate that mating system may impact the rate of gene gain-and-loss via differential activity and accumulation of TEs between selfing and outcrossing species.

The dynamics of gene families with accelerated rates of gene gain-and-loss are most likely driven by natural selection, and such gene families may ultimately contribute to adaptation (Walsh and Stephan, 2002; Demuth and Hahn, 2009). Using a likelihood approach, we identified 272 gene families that have evolved significantly faster than the genome-wide average in Arabidopsis and its relatives (family-wide P-value < 0.0001, Table S8). GO term enrichment analyses further revealed that genes from these rapidly evolving gene families were mainly associated with processes involved in biotic interactions, most notably plant-pathogen/herbivore interactions (plant defense) and pollen-pistil interactions (Figure S3).

Plant immunity involves an array of structural, chemical and protein-based defenses (Freeman and Beattie, 2008). Gene families involved in protein-based defense such as the pathogenesis-related (PR) Bet v I family (Hoffmann-Sommergruber et al., 1997), defensin-like (DEFL) family (Silverstein et al., 2005) and PR-6 proteinase inhibitor family (Sels et al., 2008) were found to be rapidly evolving in size (Table S8). Similarly, gene families known to be involved in recognition and signaling during plant defense, such as TIR-NBS-LRR (TNL) domain-containing (Tarr and Alexander, 2009) and CC-NBS-LRR (CNL) domain-containing gene families (McHale et al., 2006) were identified as rapidly evolving gene families (Table S8). Several GO terms that are associated with the metabolism of crucial compounds in plant defense, including callose (β-1, 3-glucan) (Luna et al., 2011), sesquiterpenes (Köllner et al., 2013) and triterpenoids (triterpenes) (González-Coloma et al., 2011; Muffler et al., 2011), were finally found to be enriched in rapidly evolving gene families (Figure S3), suggesting that genes involved in chemical defenses also rapidly change in copy number. Altogether, our study revealed that regardless of their specific roles, defense-related gene families show accelerated rates of gene gain-and-loss in Arabidopsis and its relatives.

Plants and animals have evolved distinct immune systems. Compared to animals that have evolved both innate (non-specific) and adaptive (acquired) immune systems, plants only rely on an innate immune system for defense against pathogens and herbivores (Jones and Dangl, 2006; Tiffin and Moeller, 2006). Defense-related

19 Chapter One gene families have been shown to be rapidly evolving in size in animals, as exemplified by studies in 12 Drosophila species (Hahn et al., 2007b) and five mammalian species (Demuth et al., 2006). Our study thus corroborates the view that defense-related gene families are generally rapidly evolving in size not only in animals but also in plants, despite the differences in immune system, suggesting parallel evolution in defense-related gene families between the two kingdoms.

Similar to plant-pathogen interactions, pollen-pistil interactions, which determine the acceptance (compatibility) or rejection (incompatibility) of pollen grains landing on stigma, also involve genetically regulated processes of recognition, signaling and response (Swanson et al., 2004; Higashiyama, 2010). Interestingly, such processes, e.g. “recognition of pollen”, “pollen-pistil interaction”, “cell communication” and “multi-organism reproductive process”, were found over-represented among rapidly evolving gene families (Figure S3). Noteworthy, the terms “apoptotic process” and “programmed cell death” are also over-represented. Programmed cell death (PCD) is known to play an important role in plant-pathogen interactions (Coll et al., 2011), but it has also been recognized as a mechanism to destroy incompatible pollen grains during plant self-incompatibility response (Bosch et al., 2008; Serrano et al., 2010). Coupled with our finding that selfing may decrease the rate of gene gain-and-loss, these results reveal a possible interplay between plant mating system and the dynamics of gene gain-and-loss.

Previous genomic studies in animals revealed that a higher-than-expected proportion of gene families that rapidly evolve in size (e.g. 65.5% in mammals and 20.4% in flies) also showed accelerated nucleotide substitution rates due to positive selection (Hahn et al., 2007b). Studies on individual gene families further reported that positive selection predominantly acts on gene families that have high gene turnover rates in copy number (Birtle et al., 2005; Popesco et al., 2006; Kulmuni et al., 2013). Concordantly, we found a higher incidence of positive selection in the rapidly evolving gene families that were inferred to be driven by selection than in a random set of 1000 gene families from Arabidopsis and its relatives (Table S10). These results suggest that positive selection may act in a multi-dimensional way during adaptive evolution, namely at the levels of both point mutations and gene copy number variation. However, it remains unresolved whether positive selection first acted on beneficial point mutations of the new duplicate or on the gene duplication itself. Different evolutionary models regarding the fixation and maintenance of gene duplications have been proposed, but the relative importance of these models remains highly debated to date (Innan and Kondrashov, 2010).

Different from WGDs whereby all genes within the genome are duplicated at the same time, tandem duplications and dispersed duplications constitute the predominant mode of small-scale duplications in plant genomes (Flagel and Wendel, 2009). It has been suggested that genes that originated by different duplication modes may show different evolutionary patterns and their retention in genomes may be functionally biased (Freeling, 2009; Wang et al., 2011b). This view may be supported by our finding that the rapidly evolving gene families, which were inferred to be most likely driven by natural selection, originated differently from all analyzed gene families that were presumed to be mainly affected by neutral, stochastic processes (P < 2.2E-16, Chi-square test, Figure 3).

Most angiosperms have undergone multiple rounds of polyploidization events during their evolution (Soltis et al., 2009). For example, all five species studied here share an ancient λ whole-genome triplication that preceded the rosid-asterid split (Tang et al., 2008), and two recent WGDs, i.e. the β WGD (65-115 MYs ago) and the α WGD (47-64 MYs ago) (Beilstein et al., 2010). Yet, only a subset of duplicates derived from WGDs are finally retained in conserved collinear blocks in the genome due to massive gene loss (Adams and Wendel, 2005; Wang, 2013). Given that the five Brassicaceae species investigated in this study underwent at least three rounds of polyploidization events in their common ancestors, a large proportion of the observed variation in gene family size should be attributed to gene losses following WGDs. This hypothesis is supported by the finding that WGD generally serves as the most important duplication mode across all studied gene families (Figure 3). However,

20 Chapter One gene losses following WGDs appear to contribute less to the adaptive evolution of gene family size, which is reflected by the low contribution of WGD to the evolution of these rapidly evolving gene families (Figure 3).

Duplication mode enrichment analyses revealed that rapidly evolving gene families mainly originate via tandem duplication (49.7%) or proximal duplication (47.7%) (Figure 3). The latter, however, represents a class of duplicates with ambiguous origin that may result from (1) ancient tandem duplications interrupted by more recent gene insertions, (2) localized transposition or (3) a large tandem duplication simultaneously affecting more than one gene (Wang et al., 2012b). Accordingly, a large proportion of duplicates falling into this category may still originate by tandem duplication. We can therefore conclude that gene gains via tandem duplication predominantly contributed to the adaptive evolution of gene family size in Arabidopsis and its relatives. This result coincides with previous studies on individual families that reported an enrichment of tandem duplicates in gene families showing rapid birth-and-death evolution such as mammalian olfactory receptors (Young et al., 2002) and plant LRR disease resistance genes (Mondragon-Palomino and Gaut, 2005). Similarly, Hanada et al. (2008) investigated the orthologous groups among A. thaliana, poplar, rice and moss, and highlighted the importance of lineage-specific expansion via tandem duplication in adaptive response to environmental stimuli.

Both tandem duplication and dispersed duplication can contribute to gene family size variation through differential gene gains among species. It is of interest for us to explore the reason why tandem duplications are preferentially selected and contribute more to the adaptive evolution of gene family size compared to dispersed duplications. Tandem duplications result from template slippage during DNA repair or unequal crossing-over, which thereby generate tandem arrays of homologous genes in close genomic vicinity (Kane et al., 2010), while dispersed duplications occur through DNA- or RNA-based transposition, thereby producing homologous genes that are usually not adjacent to each other (Wang et al., 2012c). Accordingly, tandem duplications may be more likely to produce functional paralogs, because they may share cis-regulatory elements due to their close vicinity, while dispersed duplications (especially via RNA-based retro-transposition) rarely bring regulatory sequences along, and therefore often become “dead-on-arrival” pseudogenes that are finally eliminated from the genome (Schrider et al., 2011). Moreover, tandem duplications are supposed to exhibit higher intraspecific polymorphism in natural populations than dispersed duplications, thus providing a larger pool of standing variation for natural selection to act upon (Clark et al., 2007; Zichner et al., 2013). To confirm these hypotheses, further work aimed at assessing the functional impact and intraspecific polymorphism of both tandem duplication and dispersed duplication in natural populations is necessary.

EXPERIMENTAL PROCEDURES

Data collection, gene family definition and protein sequence clustering to gene families

Five species from the Brassicaceae family were investigated in this study, namely those of A. thaliana (TAIR10) (The Arabidopsis Genome Initiative, 2000), A. lyrata (v1.0) (Hu et al., 2011), C. rubella (v1.0) (Slotte et al., 2013), E. salsugineum (v1.0, synonym of Thellungiella salsuginea and T. halophila) (Yang et al., 2013) and S. parvula (v2.0, synonym of T. parvula) (Dassanayake et al., 2011). The genome assemblies and annotations were downloaded from the Phytozome v9.1 database (http://www.phytozome.net/), and only non-TE nuclear protein-coding genes were used for subsequent analyses. In case of alternative splicing, only the longest transcripts were considered.

While the narrow-sense definition of a gene family only refers to multiple, paralogous genes within a genome (multigene family), a broad-sense definition includes both paralogs within a genome and orthologs or paralogs between genomes (Demuth and Hahn, 2009). This broad-sense definition means that each gene belongs to a gene

21 Chapter One family, making it possible to perform comparative analyses of gene family size among species even if one of the studied species has only a single or no gene copy (gene family extinction in this species) (Hahn et al., 2007b).

Clustering all protein sequences to families involved two steps: (1) aligning all protein sequences using all-against-all BLASTP (Camacho et al., 2009); and (2) clustering using SiLiX (Miele et al., 2011). Default settings of SiLiX (35% sequence identity and 80% alignment coverage) were chosen, because the best trade-off between sensitivity and specificity is achieved with this parameter combination (Miele et al., 2011). A custom bash script was then used to parse the output of SiLiX and assign gene families to four categories following Demuth et al. (2006) and Hahn et al. (2007b). Single-gene families mostly resulted from annotation errors and were excluded from further analyses. Lineage-specific families were defined as families whose members only arose in a subset of the studied species following divergence from their MRCA, while extinction families represented families whose ancestral sequences were present in the MRCA, but whose members were lost in some of the extant species. For families that contained genes only from a subset of the studied species, we distinguished between lineage-specific and extinction families by incorporating the genomes of nine species from the Rosids clade as outgroups, namely Carica papaya (v1.0) (Wang et al., 2012a), Citrus clementine (v1.0) (Haploid Clementine Genome, International Citrus Genome Consortium), Citrus sinensis (v1.1) (Xu et al., 2012), Eucalyptus grandis (v1.1) (Eucalyptus grandis Genome Project 2010), Glycine max (v1.1) (Schmutz et al., 2010), Gossypium raimondii (v2.1) (Ming et al., 2008), Populus trichocarpa (v3.0) (Tuskan et al., 2006) and Theobroma cacao (v1.1) (Argout et al., 2011). We considered a gene family to be lineage-specific if its members did not cluster with any sequences from the outgroup taxa. Else, a gene family was classified as extinction family. Extinction families together with those families containing genes from all studied species were grouped into a category of families with at least one ancestral gene present in the MRCA of the studied species.

Likelihood analysis of gene gain-and-loss

Analyses of gene family size variation were performed using CAFE v3.0 (Han et al., 2013). One of the input files was a time-calibrated phylogenetic tree of the five studied species, obtained by simplifying published tree topologies (Beilstein et al., 2010; Yang et al., 2013). The other input file contained the information on gene family size. Here, only gene families for which at least one ancestral gene was inferred to be present in the MRCA (one of the requirements to apply the stochastic birth-and-death model) were included (Hahn et al., 2005). It is further essential to eliminate gene families with large size variance before running CAFE, to ensure that the likelihood searches are properly initialized. For our dataset, 15 families that exhibited large size variance were thus excluded (Table S6).

CAFE estimates the rate of gene gain-and-loss (λ), which describes the rate at which gene families are expected to expand or contract over time (De Bie et al., 2006). Different models were tested to find the best one for our complete dataset. The model with a single global λ parameter (hereinafter one-parameter model) was used to estimate the average rate of gene gain-and-loss across all branches of the phylogenetic tree. Considering that different branches may evolve at different rates, we also applied a series of models with varying numbers of λ (2-8; hereinafter two-parameter to eight-parameter models). Because CAFE often fails to converge to a consistent ML estimate when using models with a large number of λ parameters, we run CAFE at least six times for each model to assess the robustness of the results. The model with both the largest and most consistent ML value was finally considered the best-fit model.

To assess the significance of multi-parameter models, we simulated 1,000 datasets under a one-parameter model using the function "genfamily" in CAFE. Each of these datasets had the same root-size distribution and number of families as our input dataset. Using the command "lhtest", we then calculated two likelihood scores (-ln

22 Chapter One

Likelihood) for each simulated dataset: (1) the likelihood score under a one-parameter model (LS1-p); and (2) the likelihood score under a multi-parameter model (LSm-p), along with the corresponding likelihood ratio (LR) as

LR = 2 × [(LS1-p) - (LSm-p)], in order to compute a null distribution for the test statistic. A multi-parameter model was finally considered to fit our dataset significantly better than a one-parameter model if the LR of our input dataset was greater than 95% of the null distribution.

To correct for genome assembly and annotation errors that may result in biased gene turnover rate estimates, we ran the python script caferror.py that is implemented in CAFE v3.0 (Han et al., 2013). This allowed us to identify the best-fit error model that maximizes the likelihood of CAFE runs, in order to re-calibrate the gene turnover rate.

Rapidly evolving gene families, i.e. families that are evolving at rates of gene gain-and-loss significantly faster than the genome-wide average, were identified by setting a cutoff family-wide P-value < 0.0001. To further assess whether the size change of a rapidly evolving gene family on the branch leading to a given species was statistically significant, we further set a cutoff Viterbi P-value < 0.01. Since CAFE can infer the ancestral states for each gene family at each internal node, we finally compared the relative rates of changes on terminal branches leading to the studied species for each rapidly evolving gene family by dividing the number of changes per MY along a given branch by the sum of changes across all studied branches.

Gene Ontology (GO) term enrichment analysis

GO annotations were downloaded from the Phytozome v9.1 database and used in GO term enrichment analyses implemented in Blast2GO (Conesa et al., 2005). To identify significantly overrepresented GO terms, we used Fisher's exact tests adjusted for multiple testing using the Benjamini and Hochberg’s false discovery rate (FDR < 0.01). The enriched GO terms were summarized and visualized with the web tool REVIGO (http://revigo.irb.hr/) (Supek et al., 2011).

Detection of positive selection

In studies of molecular evolution, the ratio (ω=dN/dS) of the number of non-synonymous substitutions per non-synonymous site (dN) to the number of synonymous substitutions per synonymous site (dS) is widely used as an indicator of selection pressure acting on DNA sequences. To detect positive selection among codons, we calculated and compared the likelihood values of two site models, M1a and M2a, using the program CODEML implemented in the package PAML (Yang, 2007). The nearly neutral model, M1a, does not allow for sites under positive selection (ω > 1) while the positive-selection model M2a does.

Homologous protein sequences from the five Brassicaceae species were aligned with the software MUSCLE (Edgar, 2004) using default settings. The resulting protein alignment was then converted into a codon alignment using PAL2NAL (Suyama et al., 2006). This alignment was then used to reconstruct a phylogenetic tree with the neighbor-joining method implemented in MUSCLE (Edgar, 2004). Finally, the phylogenetic tree and the codon alignment obtained above were used as input files for the program CODEML (Yang, 2007).

Duplication mode enrichment analysis

The MCScanX program implemented in the MCScanX toolkit (Wang et al., 2012b) was used to identify all collinear blocks and tandemly duplicated genes within each genome. Duplicate gene classifier from the MCScanX toolkit was then applied to assign each gene to one of five categories: proximal duplicates, dispersed duplicates, WGDs, tandem duplicates and singletons. Proximal duplicates are defined as paralogous genes separated by fewer than ten genes. Tandem duplicates are paralogous genes adjacent to each other. WGDs are

23 Chapter One paralogous genes located in collinear blocks around anchor genes. Dispersed duplicates are paralogous genes that are located neither within collinear blocks nor adjacent to each other. The singletons category contains all remaining genes. Each gene family was then searched for potential enrichment of certain modes of duplication using Fisher’s exact tests adjusted for multiple testing by Bonferroni (P < 0.05, Bonferroni correction) using the Perl script origin_enrichment_analysis.pl in the MCScanX toolkit (Wang et al., 2012b).

24 Chapter One

∗ SUPPORTING INFORMATION

Figure S1. The density curve of the null distribution of likelihood ratios from 1,000 simulated datasets. The red area on the plot indicates the 5% upper tail of the distribution.

Figure S2. Over-represented biological processes among the 5,683 families that do not show any change in size among the five studied species. The enriched GO terms were summarized and visualized by REVIGO, which joined loosely related terms that are indicated in the same color together into a supercluster. The size of each rectangular was adjusted according to the FDR of the corresponding GO term.

Figure S3. Over-represented biological processes among the 273 rapidly evolving gene families. The enriched GO terms were summarized and visualized by REVIGO.

Figure S4. Relative rates of changes on terminal branches leading to the studied species for each of the 273 rapidly evolving gene family. The widths of the colored bars reflect the relative rate of changes for each family along the branches leading to the five studied species. The five small boxes on the right of the colored bars indicate whether expansion (+), no-change (0) or contraction (-) has occurred on the corresponding branches leading to A. thaliana, A. lyrata, C. rubella, S. parvula, and E. salsugineum, respectively. The boxes colored in grey indicate significant changes (Viterbi P-value < 0.01).

Table S1. All 12,075 analyzed gene families that were clustered with protein sequences from five species in the Brassicaceae family. The clustering process was performed using SiLiX with default settings. Each gene family was assigned a unique family ID along the protein IDs of all genes that were clustered into a gene family. The protein ID of A. lyrata begins with digits; A. thaliana with "AT"; C. rubella with "Carubv"; E. salsugineum with "Thhalv"; and S. parvula with "Tp".

Table S2. Lineage-specific gene families. The size of each family is given for each species.

Table S3. Over-represented GO terms among lineage-specific gene families.

Table S4. Gene families that got completely lost in at least one studied species. The size of each family is given for each species.

Table S5. Over-represented GO terms among gene families that got completely lost in at least one studied species.

Table S6. Gene families that were inferred to have at least one ancestral sequence in the MRCA of Arabidopsis and its relatives. The size of each family is given for each species. Families highlighted in red were excluded from gene gain-and-loss analysis due to their large variance.

Table S7. The likelihood scores and estimated rates of CAFE runs under a series of models. For each model, we run CAFE for at least six times to check the consistency of likelihood scores.

Table S8. Rapidly evolving gene families in size. The size for each studied species and GO annotations (biological process, molecular function and cellular component) are given for each family.

Table S9. Rapidly evolving gene families that show significant changes on the branches leading to each species. Negative values of "expansions" represent contractions.

∗ Please download it here: https://yadi.sk/d/Pwmf7C2WgPhnV or find it in the accompanying CD.

25 Chapter One

Table S10. Likelihood ratio tests and parameter estimations for M1a and M2a site models based on sequences from each of the 210 rapidly evolving gene families (Sheet 1) and 1,000 randomly selected families (Sheet 2). The families colored in red indicate that positive selection was detected to be acting upon the codons of these families at a cutoff of P-value < 0.05 (P < 2.38E-4 after Bonferroni correction for rapidly evolving families and P < 5E-5 for randomly selected families, respectively).

Table S11. The contingency table for the Fisher’s exact test to assess the association between accelerated nucleotide substitutions by positive selection and rapid changes in gene copy number.

Table S12. List of 144 rapidly evolving families (Sheet 1) and 3,509 out of all analyzed 12,075 families (Sheet 2), which are significantly enriched for at least one type of duplication mode (P < 0.05). P-values adjusted for multiple testing for each duplication mode are shown.

26 Chapter One

REFERENCES

Adams, K.L. and Wendel, J.F. (2005) Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol., 8, 135-141. Argout, X., Salse, J., Aury, J.-M., Guiltinan, M.J., Droc, G., Gouzy, J., Allegre, M., Chaparro, C., Legavre, T. and Maximova, S.N. (2011) The genome of Theobroma cacao. Nat. Genet., 43, 101-108. Bagal, U.R., Leebens-Mack, J.H., Lorenz, W.W. and Dean, J.F. (2012) The phenylalanine ammonia lyase (PAL) gene family shows a gymnosperm-specific lineage. BMC Genomics, 13 Suppl 3, S1. Beilstein, M.A., Nagalingum, N.S., Clements, M.D., Manchester, S.R. and Mathews, S. (2010) Dated molecular phylogenies indicate a Miocene origin for Arabidopsis thaliana. Proc. Natl. Acad. Sci., 107, 18724-18728. Birtle, Z., Goodstadt, L. and Ponting, C. (2005) Duplication and positive selection among hominin-specific PRAME genes. BMC Genomics, 6, 120. Bosch, M., Poulter, N.S., Vatovec, S. and Franklin-Tong, V.E. (2008) Initiation of programmed cell death in self-incompatibility: role for cytoskeleton modifications and several caspase-like activities. Mol Plant, 1, 879-887. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. and Madden, T.L. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. Chauve, C., Doyon, J.P. and El-Mabrouk, N. (2008) Gene family evolution by duplication, speciation, and loss. J. Comput. Biol., 15, 1043-1062. Clark, R.M., Schweikert, G., Toomajian, C., Ossowski, S., Zeller, G., Shinn, P., Warthmann, N., Hu, T.T., Fu, G., Hinds, D.A., et al. (2007) Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science, 317, 338-342. Coll, N.S., Epple, P. and Dangl, J.L. (2011) Programmed cell death in the plant immune system. Cell Death Differ., 18, 1247-1256. Conesa, A., Gotz, S., Garcia-Gomez, J.M., Terol, J., Talon, M. and Robles, M. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674-3676. Dassanayake, M., Oh, D.H., Haas, J.S., Hernandez, A., Hong, H., Ali, S., Yun, D.J., Bressan, R.A., Zhu, J.K., Bohnert, H.J., et al. (2011) The genome of the extremophile crucifer Thellungiella parvula. Nat. Genet., 43, 913-918. De Bie, T., Cristianini, N., Demuth, J.P. and Hahn, M.W. (2006) CAFE: a computational tool for the study of gene family evolution. Bioinformatics, 22, 1269-1271. De Smet, R., Adams, K.L., Vandepoele, K., Van Montagu, M.C., Maere, S. and Van de Peer, Y. (2013) Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proc. Natl. Acad. Sci., 110, 2898-2903. Demuth, J.P., De Bie, T., Stajich, J.E., Cristianini, N. and Hahn, M.W. (2006) The evolution of mammalian gene families. PLoS One, 1, e85. Demuth, J.P. and Hahn, M.W. (2009) The life and death of gene families. Bioessays, 31, 29-39. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792-1797. Flagel, L.E. and Wendel, J.F. (2009) Gene duplication and evolutionary novelty in plants. New Phytol., 183, 557-564. Floudas, D., Binder, M., Riley, R., Barry, K., Blanchette, R.A., Henrissat, B., Martínez, A.T., Otillar, R., Spatafora, J.W. and Yadav, J.S. (2012) The Paleozoic origin of enzymatic lignin decomposition reconstructed from 31 fungal genomes. Science, 336, 1715-1719. Freeling, M. (2009) Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu. Rev. Plant Biol., 60, 433-453. Freeman, B. and Beattie, G. (2008) An overview of plant defenses against pathogens and herbivores. Plant Health Instr., DOI: 10.1094/PHI-I-2008-0226-01. González-Coloma, A., López-Balboa, C., Santana, O., Reina, M. and Fraga, B.M. (2011) Triterpene-based plant defenses. Phytochem. Rev., 10, 245-260. Gutiérrez, H., Castillo-Morales, A., Monzón-Sandoval, J. and Urrutia, A.O. (2014) Increased brain size in mammals is associated with size variations in gene families with cell signalling, chemotaxis and immune-related functions. Proc. R. Soc. B, 281, 20132428. Hahn, M.W., De Bie, T., Stajich, J.E., Nguyen, C. and Cristianini, N. (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res., 15, 1153-1160. Hahn, M.W., Demuth, J.P. and Han, S.G. (2007a) Accelerated rate of gene gain and loss in primates. Genetics, 177, 1941-1949. Hahn, M.W., Han, M.V. and Han, S.G. (2007b) Gene family evolution across 12 Drosophila genomes. PLoS Genet., 3, e197. Han, K., Sen, S.K., Wang, J., Callinan, P.A., Lee, J., Cordaux, R., Liang, P. and Batzer, M.A. (2005) Genomic rearrangements by LINE-1 insertion-mediated deletion in the human and chimpanzee lineages. Nucleic Acids Res., 33, 4040-4052. Han, M.V., Thomas, G.W.C., Lugo-Martinez, J. and Hahn, M.W. (2013) Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol. Biol. Evol., 30, 1987-1997. Hanada, K., Zou, C., Lehti-Shiu, M.D., Shinozaki, K. and Shiu, S.H. (2008) Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol., 148, 993-1003. Hanikenne, M., Talke, I.N., Haydon, M.J., Lanz, C., Nolte, A., Motte, P., Kroymann, J., Weigel, D. and Krämer, U. (2008) Evolution of metal hyperaccumulation required cis-regulatory changes and triplication of HMA4. Nature, 453, 391-395. Higashiyama, T. (2010) Peptide signaling in pollen-pistil interactions. Plant Cell Physiol., 51, 177-189. Hoffmann-Sommergruber, K., Vanek-Krebitz, M., Radauer, C., Wen, J., Ferreira, F., Scheiner, O. and Breiteneder, H. (1997) Genomic characterization of members of the Bet v 1 family: genes coding for allergens and

27 Chapter One

pathogenesis-related proteins share intron positions. Gene, 197, 91-100. Hollister, J.D., Ross-Ibarra, J. and Gaut, B.S. (2010) Indel-associated mutation rate varies with mating system in flowering plants. Mol. Biol. Evol., 27, 409-416. Hollister, J.D., Smith, L.M., Guo, Y.-L., Ott, F., Weigel, D. and Gaut, B.S. (2011) Transposable elements and small RNAs contribute to gene expression divergence between Arabidopsis thaliana and Arabidopsis lyrata. Proc. Natl. Acad. Sci., 108, 2322-2327. Hu, T.T., Pattyn, P., Bakker, E.G., Cao, J., Cheng, J.-F., Clark, R.M., Fahlgren, N., Fawcett, J.A., Grimwood, J. and Gundlach, H. (2011) The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet., 43, 476-481. Innan, H. and Kondrashov, F. (2010) The evolution of gene duplications: classifying and distinguishing between models. Nat. Rev. Genet., 11, 97-108. Jones, J.D. and Dangl, J.L. (2006) The plant immune system. Nature, 444, 323-329. Kane, J., Freeling, M. and Lyons, E. (2010) The evolution of a high copy gene array in Arabidopsis. J. Mol. Evol., 70, 531-544. Köllner, T.G., Lenk, C., Schnee, C., Köpke, S., Lindemann, P., Gershenzon, J. and Degenhardt, J. (2013) Localization of sesquiterpene formation and emission in maize leaves after herbivore damage. BMC Plant Biol., 13, 15. Kondrashov, F.A. (2012) Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc. R. Soc. B, 279, 5048-5057. Kulmuni, J., Wurm, Y. and Pamilo, P. (2013) Comparative genomics of chemosensory protein genes reveals rapid evolution and positive selection in ant-specific duplicates. Heredity (Edinb), 110, 538-547. Lei, L., Zhou, S.L., Ma, H. and Zhang, L.S. (2012) Expansion and diversification of the SET domain gene family following whole-genome duplications in Populus trichocarpa. BMC Evol. Biol., 12, 51. Lisch, D. (2012) How important are transposons for plant evolution? Nat. Rev. Genet., 14, 49-61. Long, M., Betran, E., Thornton, K. and Wang, W. (2003) The origin of new genes: glimpses from the young and old. Nat. Rev. Genet., 4, 865-875. Luna, E., Pastor, V., Robert, J., Flors, V., Mauch-Mani, B. and Ton, J. (2011) Callose deposition: a multifaceted plant defense response. Mol. Plant-Microbe Interact., 24, 183-193. Lynch, M. and Conery, J.S. (2003) The evolutionary demography of duplicate genes. J. Struct. Funct. Genomics, 3, 35-44. McHale, L., Tan, X., Koehl, P. and Michelmore, R.W. (2006) Plant NBS-LRR proteins: adaptable guards. Genome Biol., 7, 212. Miele, V., Penel, S. and Duret, L. (2011) Ultra-fast sequence clustering from similarity networks with SiLiX. BMC Bioinformatics, 12, 116. Ming, R., Hou, S., Feng, Y., Yu, Q., Dionne-Laporte, A., Saw, J.H., Senin, P., Wang, W., Ly, B.V. and Lewis, K.L. (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature, 452, 991-996. Mondragon-Palomino, M. and Gaut, B.S. (2005) Gene conversion and the evolution of three leucine-rich repeat gene families in Arabidopsis thaliana. Mol. Biol. Evol., 22, 2444-2456. Morgan, M.T. (2001) Transposable element number in mixed mating populations. Genet. Res., 77, 261-275. Muffler, K., Leipold, D., Scheller, M.-C., Haas, C., Steingroewer, J., Bley, T., Neuhaus, H.E., Mirata, M.A., Schrader, J. and Ulber, R. (2011) Biotransformation of triterpenes. Process Biochem., 46, 1-15. Nakayama, S.-I., Shi, S., Tateno, M., Shimada, M. and Takahasi, K.R. (2012) Mutation accumulation in a selfing population: consequences of different mutation rates between selfers and outcrossers. PLoS One, 7, e33541. Owens, S.M., Harberson, N.A. and Moore, R.C. (2013) Asymmetric functional divergence of young, dispersed gene duplicates in Arabidopsis thaliana. J. Mol. Evol., 76, 13-27. Popesco, M.C., MacLaren, E.J., Hopkins, J., Dumas, L., Cox, M., Meltesen, L., McGavran, L., Wyckoff, G.J. and Sikela, J.M. (2006) Human lineage–specific amplification, selection, and neuronal expression of DUF1220 domains. Science, 313, 1304-1307. Richardson, A.O. and Palmer, J.D. (2007) Horizontal gene transfer in plants. J. Exp. Bot., 58, 1-9. Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., Hyten, D.L., Song, Q., Thelen, J.J., Cheng, J., et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature, 463, 178-183. Schrider, D.R., Stevens, K., Cardeno, C.M., Langley, C.H. and Hahn, M.W. (2011) Genome-wide analysis of retrogene polymorphisms in Drosophila melanogaster. Genome Res., 21, 2087-2095. Sels, J., Mathys, J., De Coninck, B., Cammue, B. and De Bolle, M.F. (2008) Plant pathogenesis-related (PR) proteins: a focus on PR peptides. Plant Physiol. Biochem., 46, 941-950. Serrano, I., Pelliccione, S. and Olmedilla, A. (2010) Programmed-cell-death hallmarks in incompatible pollen and papillar stigma cells of Olea europaea L. under free pollination. Plant Cell Rep., 29, 561-572. Silverstein, K.A., Graham, M.A., Paape, T.D. and VandenBosch, K.A. (2005) Genome organization of more than 300 defensin-like genes in Arabidopsis. Plant Physiol., 138, 600-610. Slotte, T., Hazzouri, K.M., Agren, J.A., Koenig, D., Maumus, F., Guo, Y.L., Steige, K., Platts, A.E., Escobar, J.S., Newman, L.K., et al. (2013) The Capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat. Genet., 45, 831-835. Soltis, D.E., Albert, V.A., Leebens-Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., Wall, P.K. and Soltis, P.S. (2009) Polyploidy and angiosperm diversification. Am. J. Bot., 96, 336-348. Supek, F., Bošnjak, M., Škunca, N. and Šmuc, T. (2011) REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One, 6, e21800. Suyama, M., Torrents, D. and Bork, P. (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res., 34, W609-W612. Swanson, R., Edlund, A.F. and Preuss, D. (2004) Species specificity in pollen-pistil interactions. Annu. Rev. Genet., 38, 793-818. Tang, H., Wang, X., Bowers, J.E., Ming, R., Alam, M. and Paterson, A.H. (2008) Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Res., 18, 1944-1954.

28 Chapter One

Tarr, D.E. and Alexander, H.M. (2009) TIR-NBS-LRR genes are rare in monocots: evidence from diverse monocot orders. BMC Res. Notes, 2, 197. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408, 796-815. Tiffin, P. and Moeller, D.A. (2006) Molecular evolution of plant immune system genes. Trends Genet., 22, 662-670. Tuskan, G.A., Difazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., Hellsten, U., Putnam, N., Ralph, S., Rombauts, S., Salamov, A., et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science, 313, 1596-1604. Vekemans, X., Poux, C., Goubet, P.M. and Castric, V. (2014) The evolution of selfing from outcrossing ancestors in Brassicaceae: what have we learned from variation at the S-locus? J. Evol. Biol., 27, 1372-1385. Vieira, F.G., Sanchez-Gracia, A. and Rozas, J. (2007) Comparative genomic analysis of the odorant-binding protein family in 12 Drosophila genomes: purifying selection and birth-and-death evolution. Genome Biol., 8, R235. Vlad, D., Kierzkowski, D., Rast, M.I., Vuolo, F., Dello Ioio, R., Galinha, C., Gan, X., Hajheidari, M., Hay, A., Smith, R.S., et al. (2014) Leaf shape evolution through duplication, regulatory diversification, and loss of a homeobox gene. Science, 343, 780-783. Walsh, J.B. and Stephan, W. (2002) Multigene families: evolution. eLS, 12, 406-412. Wang, K., Wang, Z., Li, F., Ye, W., Wang, J., Song, G., Yue, Z., Cong, L., Shang, H. and Zhu, S. (2012a) The draft genome of a diploid cotton Gossypium raimondii. Nat. Genet., 44, 1098-1103. Wang, X., Grus, W.E. and Zhang, J. (2006) Gene losses during human origins. PLoS Biol., 4, e52. Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.H., Bancroft, I., Cheng, F., et al. (2011a) The genome of the mesopolyploid crop species Brassica rapa. Nat. Genet., 43, 1035-1039. Wang, Y. (2013) Locally duplicated ohnologs evolve faster than nonlocally duplicated ohnologs in Arabidopsis and rice. Genome Biol. Evol., 5, 362-369. Wang, Y., Tang, H., Debarry, J.D., Tan, X., Li, J., Wang, X., Lee, T.H., Jin, H., Marler, B., Guo, H., et al. (2012b) MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res., 40, e49. Wang, Y., Wang, X. and Paterson, A.H. (2012c) Genome and gene duplications and gene expression divergence: a view from plants. Ann. N. Y. Acad. Sci., 1256, 1-14. Wang, Y., Wang, X., Tang, H., Tan, X., Ficklin, S.P., Feltus, F.A. and Paterson, A.H. (2011b) Modes of gene duplication contribute differently to genetic novelty and redundancy, but show parallels across divergent angiosperms. PLoS One, 6, e28150. Wendel, J.F. (2000) Genome evolution in polyploids. In: Plant Molecular Evolution: Springer, pp. 225-249. Wright, S.I., Ness, R.W., Foxe, J.P. and Barrett, S.C.H. (2008) Genomic consequences of outcrossing and selfing in plants. Int. J. Plant Sci., 169, 105-118. Wright, S.I. and Schoen, D.J. (2000) Transposon dynamics and the breeding system. In: Transposable Elements and Genome Evolution: Springer, pp. 139-148. Xu, Q., Chen, L. L., Ruan, X., Chen, D., Zhu, A., Chen, C., Bertrand, D., Jiao, W.-B., Hao, B.-H. and Lyon, M.P. (2012) The draft genome of sweet orange (Citrus sinensis). Nat. Genet., 45, 59-66. Yang, R., Jarvis, D.E., Chen, H., Beilstein, M.A., Grimwood, J., Jenkins, J., Shu, S., Prochnik, S., Xin, M., Ma, C., et al. (2013) The reference genome of the halophytic plant Eutrema salsugineum. Front. Plant Sci., 4, 46. Yang, X., Kalluri, U.C., Jawdy, S., Gunter, L.E., Yin, T., Tschaplinski, T.J., Weston, D.J., Ranjan, P. and Tuskan, G.A. (2008) The F-box gene family is expanded in herbaceous annual plants relative to woody perennial plants. Plant Physiol., 148, 1189-1200. Yang, X., Tuskan, G.A. and Cheng, M.Z. (2006) Divergence of the Dof gene families in poplar, Arabidopsis, and rice suggests multiple modes of gene evolution after duplication. Plant Physiol., 142, 820-830. Yang, Z. (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol., 24, 1586-1591. Ye, C.Y., Li, T., Tuskan, G.A., Tschaplinski, T.J. and Yang, X. (2011) Comparative analysis of GT14/GT14-like gene family in Arabidopsis, Oryza, Populus, Sorghum and Vitis. Plant Sci., 181, 688-695. Young, J.M., Friedman, C., Williams, E.M., Ross, J.A., Tonnes-Priddy, L. and Trask, B.J. (2002) Different evolutionary processes shaped the mouse and human olfactory receptor gene families. Hum. Mol. Genet., 11, 535-546. Zhang, J. (2003) Evolution by gene duplication: an update. Trends Ecol. Evol., 18, 292-298. Zichner, T., Garfield, D.A., Rausch, T., Stutz, A.M., Cannavo, E., Braun, M., Furlong, E.E. and Korbel, J.O. (2013) Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res., 23, 568-579.

29 Chapter Two

CHAPTER TWO: Genome-wide comparative analysis of the GRAS gene family in Populus, Arabidopsis and rice

ABSTRACT

GRAS genes belong to a gene family of transcription regulators that function in the regulation of plant growth and development. Our knowledge about the expansion and diversification of this gene family in flowering plants is presently limited to the herbaceous species Arabidopsis and rice. Numerous aspects, including the phylogenetic history, expansion, functional divergence and adaptive evolution await further study, especially in woody tree species. Based on the latest genome assemblies, we found 106, 34 and 60 putative GRAS genes in Populus, Arabidopsis and rice, respectively. Phylogenetic analysis revealed that GRAS proteins could be divided into at least 13 subfamilies. Tandem and segmental duplications are the most common expansion mechanisms of this gene family, and their frequent joint action may explain the rapid expansion in Populus. Site-specific shifts in evolutionary rates might be the main force driving subfamily-specific functional diversification. Adaptive evolution analysis revealed that GRAS genes have evolved mainly under purifying selection after duplication, suggesting that strong functional constraints have a bearing on the evolution of GRAS genes. Both EST and microarray data revealed that GRAS genes in Populus have broad expression patterns across a variety of organs/tissues. Expression divergence analyses between paralogous pairs of GRAS genes suggested that the retention of GRAS genes after duplication could be mainly attributed to substantial functional novelty such as neo-functionalization or sub-functionalization. Our study highlights the expansion and diversification of the GRAS gene family in Populus and provides the first comprehensive analysis of this gene family in the Populus genome.

KEYWORDS

Adaptive evolution · Expansion · Expression · The GRAS gene family · Phylogeny · Populus

ABBREVIATIONS

BI: Bayesian Inference, EST: Expressed Sequence Tag, GEO: Gene Expression Omnibus, HMM: Hidden Markov Model, JTT: Jones-Taylor-Thornton, ML: Maximum Likelihood, NJ: Neighbor Joining, RMA: Robust Multiarray Analysis

30 Chapter Two

INTRODUCTION

The GRAS gene family, named after the three early functionally characterized members, Gibberellic acid insensitive (GAI), Repressor of GA1 (RGA) and Scarecrow (SCR), has long been regarded as a plant-specific transcription regulator family (Pysh et al., 1999; Sun et al., 2012). A recent study, however, reported the presence of this gene family in the genomes of bacteria, and proposed that the GRAS gene family first emerged in bacteria and should be assigned to the Rossmann fold methyltransferase superfamily (Zhang et al., 2012). The proteins encoded by the GRAS gene family typically share a variable N-terminus and a highly conserved C-terminus, known as the GRAS domain, which contains several ordered and conserved motifs, including leucine rich region I, VHIID, leucine rich region II, PFYRE and SAW (Pysh et al., 1999; Tian et al., 2004).

Several attempts have been made to group members of the GRAS gene family into subfamilies that reflect their evolutionary history. An early phylogenetic analysis of the GRAS family proteins in Arabidopsis thaliana assigned them into eight subfamilies named DELLA, LS, SCR, SHR, PAT1, HAM, SCL9 (LISCL) and SCL4/7 (Bolle, 2004). An independent analysis based on sequences from Arabidopsis and rice (Tian et al., 2004) also suggested that this gene family could be divided into eight subfamilies but with sight differences, including a new subfamily SCL3 instead of the subfamily SCL4/7. More recently, GRAS proteins were classified into ten subfamilies: DELLA, AtLAS (LS), AtSCR, AtSHR, AtPAT1, HAM, LISCL, AtSCL3, SCL4/7 and DLT (Sun et al., 2011). However, some GRAS proteins, especially those from rice, were not assigned into subfamilies in these analyses. Engstrom (2011) constructed a phylogenetic tree of aligned GRAS protein sequences selected from Arabidopsis, rice and other plant species, which revealed 12 well-supported monophyletic clades. These studies indicate that this gene family diversified substantially in flowering plants and more subfamilies may be recognized upon analysis of further plant genomes.

The diversification of the GRAS gene family is congruent with its highly diverse functions. To date, GRAS transcription regulators have been functionally characterized mainly in Arabidopsis and rice. The studies revealed that they play diverse and critical roles in the regulation of plant growth and development, often acting as integrators of multiple growth regulatory and environmental signals, such as phytohormones, light and abiotic/biotic stress. The main biological functions of ten subfamilies of GRAS transcription regulators were recently reviewed by Sun et al. (2012). For example, AtGAI, a member of the DELLA subfamily, participates in gibberellic acid signaling in Arabidopsis (Peng et al., 1997); Ls, a member of the AtLAS subfamily, acts in axillary meristem initiation of tomato (Schumacher et al., 1999); HAM, belonging to the HAM subfamily, in shoot meristem maintenance of Petunia hybrida (Stuurman et al., 2002); AtPAT1, belonging to the AtPAT1 subfamily, in phytochrome A signaling of Arabidopsis (Bolle et al., 2000); LISCL, a member of the LISCL subfamily in transcription regulation during microsporogenesis of Lilium longiflorum (Morohashi et al., 2003); DLT, a member of the DLT subfamily, in brassinosteroid signaling of rice (Tong et al., 2009); AtSHR and AtSCR, belonging to the AtSHR and AtSCR subfamilies respectively, in root radial patterning and growth of Arabidopsis (Di Laurenzio et al., 1996; Helariutta et al., 2000); AtSCL3, a member of the AtSCL3 subfamily, in integrating multiple signals in root cell elongation of Arabidopsis (Heo et al., 2011). To the best of our knowledge, only a limited number of the GRAS genes in Populus have to date been functionally characterized. Ma et al. (2010) demonstrated that overexpressing PeSCL7 from Populus euphratica (belonging to the AtSCL4/7 subfamily) enhanced drought and salt tolerance in transgenic Arabidopsis plants. Wang et al. (2011a) reported that down-regulation of PtSHR1, isolated from P. trichocarpa, resulted in an incremental increase of primary and secondary growth rates in transgenic Populus. More members of the GRAS family in flowering plants, especially in Populus trees, remain to be functionally characterized in the future.

31 Chapter Two

Although the GRAS gene family has been studied for many years, we currently have only a limited understanding in many aspects, such as the expansion mechanism and the evolutionary forces driving the diversification of this gene family in flowering plants. And we still lack a comprehensive analysis of this gene family especially in woody species. Woody species have secondary growth, self-supporting structures and a much longer lifespan, making them distinct from herbaceous plants (Lei et al., 2012). Differences in loss and retention of duplicated gene family members between woody and herbaceous species may help identifying genes with specialized roles in the adaptive evolution of different lineages (Jansson and Douglas, 2007). The genome sequence of P. trichocarpa was published in 2006, making this species a model for perennial woody dicots (Tuskan et al., 2006). Previous studies of genome evolution have revealed that Populus has undergone three rounds of polyploidization with the most recent ‘salicoid’ whole genome duplication event dating to about 65 million years ago (Tuskan et al., 2006). The ‘eurosid’ whole genome duplication event happened close to the time of divergence of the eurosid I and II lineages, and an ancient ‘γ’ whole genome triplication event preceded the eurosid-asterid split (Tuskan et al., 2006; Yang et al., 2009). This complicated history of duplications and rearrangements of the Populus genome offers an excellent opportunity to study expansion patterns of gene families in the course of genome evolution (Lan et al., 2009). Furthermore, as both ecologically and economically important species, Populus is being intensively studied also in the light of increasing needs for biofuel production around the world. Because members of the GRAS gene family play important roles in plant growth, a comprehensive survey of this gene family in Populus may facilitate future functional genomic studies and biomass production.

We here present the results of a comparative analysis of the GRAS gene family in three representative plant species. We reconstructed the phylogenetic history of this gene family, explored expansion mechanisms, documented genomic distribution, exon/intron structure and domain architecture, assessed functional divergence and adaptive evolution, and determined expression profiling of Populus GRAS genes in a variety of organs/tissues. We hope that our analyses will facilitate future functional and ecological studies of this important gene family in flowering plants, and especially in Populus.

RESULTS

Genome-wide identification of GRAS genes in flowering plants

Based on the latest genome annotations and assemblies, we found 106, 34 and 60 putative GRAS genes in Populus, Arabidopsis and rice, respectively (Table S1). The numbers of GRAS genes in these genomes are obviously not in proportion to the assembly sizes of genomes, which are 422.9, 135 and 372Mb for Populus, Arabidopsis and rice, respectively, thus reflecting different duplication and retention rates of this gene family in different plant lineages. We further examined the numbers of GRAS genes in other flowering plant species (Table S1). We identified 94, 71, 48 and 47 GRAS genes in four other tree species, E. grandis, V. vinifera, P. persica and C. sinensis, as well as 61, 73 and 78 GRAS genes in three other herbaceous species, S. lycopersicum, M. truncatula and S. bicolor, respectively.

The numbers of GRAS genes in Arabidopsis and rice identified in the present study differs slightly from earlier reports. Bolle (2004) reported 33 GRAS genes in Arabidopsis including a pseudogene (SCL16), while 32 and 57 genes have been identified in Arabidopsis and rice, respectively, by Tian et al. (2004). Analyses in these studies, however, were based on older genome annotations and assemblies. Using the latest annotations and assemblies, we identified a new putative member (AT5g67411) in Arabidopsis and one member (AT2g29060) that was considered a single gene in previous studies has now been split into two genes (AT2g29060, AT2g29065). Seven

32 Chapter Two more members were identified in rice as compared to the latest study by Sun et al. (2011). These genes are listed in Table S1.

To predict and assess the completeness of the GRAS domains of these putative genes identified by HMM and BLASTP searches, we performed both SMART and Pfam domain searches. From the alignment of predicted GRAS domain sequences we found members containing partial GRAS domains with missing motifs, some of which were severely truncated. In Populus, for example, these fragments could be as short as 79 amino acids (e.g. Potri.015G095300.1), while the typical GRAS domain has a minimum length of about 350 amino acids (e.g. Potri.002G144200.1; AT4G00150.1). These short remnant sequences may be the results of ancient pseudogenization events (Lan et al., 2009), and sequences with partial domains were thus considered to be putative pseudogenes. Because of the low reliability of phylogenetic analyses by incorporating these fragments, we excluded them from some of the following analyses. Finally, we found 93, 33 and 50 genes in Populus, Arabidopsis and rice, respectively, with full-length domain sequences (Table S1). We assigned each putative GRAS member a symbol based on either the order of its location in the genome or the name used in earlier studies.

Phylogenetic analysis of GRAS genes in Populus, Arabidopsis and rice

To reconstruct the evolutionary history of the GRAS gene family in the studied plant species, we built phylogenetic trees from the alignment of 176 GRAS domain amino acid sequences (Figure S1) in Populus (93), Arabidopsis (33) and rice (50) using Neighbor Joining (NJ), Maximum Likelihood (ML) and Bayesian inference (BI) methods (Figure 1; Figure S2). The tree topologies from different methods consistently revealed that GRAS proteins could be divided into at least 13 subfamilies, most of which were supported by decent posterior probabilities (>0.9) and bootstrap values (>60%). These subfamilies were designated following earlier studies (Tian et al., 2004; Sun et al., 2011) or named according to one of their members in the case of newly identified subfamilies. The 13 subfamilies are AtSHR, AtPAT1, AtSCR, AtSCL4/7, AtLAS, Os19, HAM, Os4, Pt20, DLT, AtSCl3, DELLA and LISCL. Of these, the three subfamilies Os4, Os19 and Pt20 are newly identified. Notably, Pt20 is a Populus-specific subfamily containing eight members that form a well-supported monophyletic clade.

In general, the majority of subfamilies harbored GRAS members from each of the three species (Figure 1). However, Os4 and Os19 subfamilies did not include any Arabidopsis genes, suggesting lineage-specific gene loss in Arabidopsis. Remarkably, the subfamily Pt20 is Populus-specific, suggesting that it has been gained in the Populus lineage after divergence from the most recent common ancestor with Arabidopsis and rice or entirely lost from the latter two lineages. It is tempting to speculate that this subfamily has specialized roles in the adaptive evolution of Populus.

33 Chapter Two

Figure 1. Combined phylogenetic analysis of GRAS proteins from Populus, Arabidopsis, and rice. The NJ tree was constructed from the alignment of 176 GRAS domain amino acid sequences from Populus, Arabidopsis and rice. Bootstrap values above 50% are shown. The posterior probability from BI and bootstrap value from ML analyses are also presented for each subfamily.

Genomic organization and expansion of GRAS genes

After mapping all GRAS genes to their corresponding chromosomes, we found members of the GRAS gene family to be widely distributed on the chromosomes of the three studied genomes. In Populus, physical positions of 102 out of 106 GRAS genes were assigned to 18 of the 19 Populus chromosomes (none on chr18), while four genes (PtGRAS92, ΨPtGRAS12, ΨPtGRAS13 and PtGRAS93) were assigned to as yet unmapped scaffolds (Figure 2). GRAS genes in Populus also demonstrated uneven distribution across the chromosomes. Chr1, the longest chromosome, harbored the most (15) GRAS genes. In contrast, only a single GRAS gene was found on Chr11 and on Chr13. In rice, the situation was comparable and GRAS genes were found on ten of the 12 chromosomes (none on chr8 and chr9). In contrast, GRAS genes in Arabidopsis were almost evenly distributed

34 Chapter Two across the five chromosomes (Table S1).

Figure 2. Chromosomal localization of 106 putative GRAS genes in Populus. Segmentally duplicated genes are connected by red curves. Gray ribbons represent pairwise collinear blocks. Twisted ribbons indicate that collinear block pairs are in reverse orientation. Each tick interval stands for 1Mb. “ψ” prior to gene symbols indicates pseudogene fragments.

We further assessed the contribution of tandem duplications to the expansion of the GRAS gene family by identifying genes within tandem clusters across the three studied genomes. In Populus, we identified 12 tandem clusters of GRAS genes with 2-6 genes per cluster (Figure 2). The largest tandem cluster was identified on Chr16, containing two full-length domain GRAS genes and four putative pseudogenes. Overall, a total of 40 GRAS genes (39%) in Populus were arranged in tandem clusters, indicating that their origins involved tandem duplication events. In Arabidopsis and rice, 4 genes (12%) and 15 genes (25%) were located in tandem clusters, respectively (Table S2).

35 Chapter Two

We also investigated the contribution of segmental duplications. In Populus, 74 genes (73%) were retained in collinear blocks with their duplicates (Table S2), suggesting that their origins involved segmental duplication events. Further analysis revealed that 22 (22%) tandemly duplicated genes were also within collinear blocks (Figure 3), suggesting that their origins involved both tandem and segmental duplication events. Such a combined expansion is assumed to greatly facilitate the rapid expansion of this gene family in Populus. Furthermore, by reconciliation of both the physical positions of genes and the phylogeny, we tried to deduce the expansion histories of the LISCL and Pt20 subfamilies. Both have members in relatively large tandem clusters (five-gene cluster) and involved simultaneous segmental duplications (Figure 4). However, they demonstrated two different expansion scenarios: for the LISCL subfamily, the two tandem clusters were generated by three rounds of tandem duplications that were then followed by segmental duplication; for the Pt20 subfamily, their ancestral gene underwent segmental duplication first, and then four rounds of tandem duplications that formed the present tandem cluster. For comparison, we also investigated the contributions of segmental duplications to the expansion of this family in Arabidopsis and rice. In our analysis, we found that the origins of 17 (48%) and 24 (40%) genes involved segmental duplications in Arabidopsis and rice, respectively (Table S2).

Figure 3. The contributions of tandem and/or segmental duplications to the expansion of GRAS genes. Pie diagrams indicate the contributions of tandem duplications (blue), segmental duplications (orange) and a combination of tandem and segmental duplications (green) to the expansion of GRAS genes in the three studied genomes. “Others” indicates the genes duplicated by mechanisms other than tandem and segmental duplication.

Functional divergence of GRAS proteins

To investigate whether adaptive functional diversification resulted from amino acid substitutions in the conserved GRAS domain, we estimated type-I and type-II functional divergence between subfamilies based on the aligned GRAS domain sequences from Populus, Arabidopsis and rice (Figure S1). As shown in Table S3, our results revealed that with a few exceptions, the coefficients of type-I functional divergence (θI) between subfamilies were statistically significant (Likelihood ratio test, P < 0.01) and varied from 0.29±0.07 to 0.88±0.09.

We further estimated the functional branch length (bF) for each subfamily and a tree-like topology in terms of the functional distance was generated (Figure 5). Interestingly, subfamily LISCL had the longest functional branch with a length of about 0.95, indicating that the evolutionary conservation may be altered at many sites for this subfamily and the derived functional state may be distant from the ancestral state. Nonetheless, we did not detect any evidence for type II functional divergence between subfamilies (data not shown). Furthermore, we predicted

36 Chapter Two the critical amino acid sites responsible for the functional divergence between subfamilies based on site-specific posterior probability profiles. The results revealed that the numbers of critical sites varied greatly among pairs of subfamilies, ranging from 0 to 68 (Table S3). A list of these critical amino acid sites is presented in Table S4, where the numbering of sites corresponds to the alignment position in Figure S1. We further examined the distribution and frequency of these critical sites along the whole GRAS domain. As shown in Figure S4, these critical sites are relatively evenly distributed along the GRAS domain, and are present in all motifs. Site “745” in the SAW motif is most often associated with functional divergence between subfamilies. Although the most common residue at this site is glutamine, residues at this site show high variability in certain subfamilies, implying that this site may greatly contribute to type I functional divergence between subfamilies. However, the functional role of this site currently remains unclear.

Figure 4. Hypothetical evolutionary histories of the LISCL and Pt20 subfamilies in Populus reflecting two different expansion scenarios. a, b Phylogenetic relationships for the LISCL and Pt20 subfamilies, respectively. c, d Hypothetical origins of the LISCL and Pt20 subfamilies, respectively, by tandem duplications, segmental duplications and other mechanisms. The letters S, T and O on the nodes indicate the positions where segmental duplications, tandem duplications and other mechanisms-mediated duplication occurred, respectively. The genes within colored rectangles represent tandem clusters and the black boxes represent ancestral genes.

37 Chapter Two

Figure 5. The tree-like topology of GRAS domains of each subfamily in terms of the functional distance. The branch length for each subfamily is proportional to bF, the functional branch length of a given subfamily. The Os19 subfamily was not included in this analysis because of its small size.

Exon/intron structure and domain architecture of the GRAS gene family

To investigate the diversity and evolution of gene structure and domain architecture, we first reconstructed the phylogenetic history of this gene family (Figure 6a) using only the GRAS domain amino acid sequences from Populus, which also classified the genes into 13 subfamilies as above. As listed in Table S5, 36 paralogous pairs of GRAS genes in Populus were identified at the terminal nodes of the phylogenetic tree in contrast to 8 pairs in Arabidopsis and 13 pairs in rice. These paralogous pairs accounted for 77%, 48% and 46% of the GRAS genes in Populus, Arabidopsis and rice, respectively. The sequence similarities can be found in Table S5. We then compared the exon/intron structures of Populus GRAS genes. It is noteworthy that up to 54.7% of the genes (58/106) were intronless in Populus (Figure 6b), 67.6 % (23/34) in Arabidopsis and 55% (33/60) in rice (Table S1). We also observed that closely related GRAS genes within the same subfamily generally demonstrated more similar exon/intron structures. Especially for the paralogous pairs, most of them shared conserved exon/intron

38 Chapter Two structures in terms of either gene length or intron number. There were exceptions; for example, we also observed some variations in exon/intron structures for pairs of PtGRAS25/26, PtGRAS31/42 and PtGRAS32/43, which might result from intron loss or gain events during the process of structure evolution.

Early sequence analysis indicated that the products of GRAS genes typically share a variable N-terminus and a highly conserved C-terminus (GRAS domain) (Pysh et al., 1999). However, we found the domain architecture of plant GRAS proteins to be substantially more variable than previously reported and to consist of at least three types. Firstly, most GRAS family members possess a variable N-terminal domain and just one highly conserved C-terminal GRAS domain (e.g. all Arabidopsis GRAS proteins; see Table S1). Secondly, some members contain multiple GRAS domains (e.g. OsGRAS39 with 2 domains, OsGRAS54 with 3 domains; see Table S1). Thirdly, some members start directly with one GRAS domain and are followed by either another functional domain (e.g. PtGRAS26) or none in the C-terminal part (e.g. PtGRAS82; see Figure 6c).

Driving forces for genetic divergence after duplication of GRAS genes

To investigate whether positive selection has been involved in the divergence after duplication of GRAS genes, we first calculated ω ratios for the paralogous pairs of GRAS genes from Populus, Arabidopsis and rice (Table S5) based on overall protein sequences. Except for PtGRAS80/84 and PtGRAS88/89, which have identical protein sequences, all the paralogous pairs have ω ratios less than 1, indicating that the evolution of plant GRAS genes after duplication is constrained by purifying selection. We also calculated ω ratios within and outside the GRAS domain regions for each pair of these paralogs (Figure 7). The results showed that ω ratios outside the GRAS domain were generally higher than ratios within the GRAS domain (Paired t-test, P < 0.001), indicating that the regions outside the GRAS domain have experienced less constraints and faster evolution after duplication than the regions within the GRAS domain. This pattern might result from relaxed purifying selection at the regions outside the GRAS domain.

To identify possible positive selection acting at specific sites, six site models that allow ω ratios to vary among sites were used based on the coding sequences of 176 GRAS genes from the three studied plant species. The discrete model M3 fit better than the one-ratio model M0, suggesting that ω ratios vary among sites (Table S6; Likelihood ratio test, P < 0.01). Both M1a-M2a and M7-M8 comparisons suggested that positive selection might act on some sites (Table S6; Likelihood ratio test, P < 0.01). However, only two positively selected sites, listed in Table S6, were detected based on posterior probabilities, while the majority of sites were dominated by purifying selection.

39 Chapter Two

Figure 6. Phylogenetic relationships, exon/intron structure and domain architecture of 93 Populus GRAS genes. a an NJ tree constructed from alignments of the GRAS domain amino acid sequences of 93 Populus GRAS proteins. Bootstrap values above 50% are shown. b Exon/intron structures of Populus GRAS genes. Green boxes and black lines represent exons and introns, respectively. c Schematic representation of the conserved domains in the Populus GRAS proteins. Orange boxes represent GRAS domains, while red, green and grey boxes represent the hATC, Znf-BED and Peptidase_C48 domains, respectively.

40 Chapter Two

Figure 7. Comparison of dN/dS ratios between within- and outside-GRAS domain regions for paralogous GRAS gene pairs. Green dots represent ratios from Populus, while red and blue dots represent ratios from Arabidopsis and rice, respectively.

Expression analysis of Populus GRAS genes

Expressed Sequence Tags (ESTs) available in public databases provide a valuable resource to explore gene expression profiles (Hu et al., 2012). To survey the expression patterns of Populus GRAS genes in different organ/tissue types, we performed an in silico EST analysis by counting the frequencies of corresponding ESTs of Populus GRAS genes from various libraries across a set of 18 organ/tissue types (Figure S5). A total of 39 Populus GRAS genes (42%) were found to have corresponding EST sequences in the NCBI EST database (release 120701). Our EST profiling analysis demonstrated that Populus GRAS genes have rather broad expression patterns across a variety of tissues, but frequencies of ESTs were generally low, congruent with the observation that transcription regulators are often weakly expressed (Wilkins et al., 2009; Hu et al., 2012). Strikingly, one gene of the subfamily HAM, PtGRAS67, had a quite high EST representation in flower buds (Figure S5), implying that it may have an important role in flower development or differentiation.

To gain more insights into the expression pattern of Populus GRAS genes, we first re-analyzed an Affymetrix microarray dataset (accession number GSE13990), which encompassed results from eight organ/tissue types or treatments (Wilkins et al., 2009). Four pairs of GRAS genes with identical probe sets and seven genes without corresponding probe sets were excluded (Table S7). The expression profiles of 78 Populus GRAS genes were finally examined. Based on hierarchical clustering, the expression patterns could be grouped into three clusters (Figure 8a). Genes in cluster a (I) mostly had the highest transcript abundance in leaves, especially in young leaves (YL). Most genes in cluster a (II) were preferentially expressed in floral organs: male or female catkins (ML or FC), implying their potential functions in reproductive processes. We examined the expression profiles of the Populus-specific subfamily Pt20. Of the six genes (PtGRAS20, 58, 59, 60, 69, 90) present in the analyzed

41 Chapter Two datasets, five were preferentially expressed in were preferentially expressed in male or female catkins, implying that the expansion of this subfamily may potentially be associated with the specific dioecious reproduction of Populus life history. In contrast, the remaining genes in cluster a (III) generally showed highest expression levels in vegetative organs like roots (R) or seedlings (CL, DL and CD). Among them, PtGRAS56 showed highest expression in roots and its orthologous gene AtSCL3 has been reported to be a tissue-specific integrator of gibberellin signaling in the Arabidopsis root (Heo et al., 2011). Similarly, PtGRAS4 and PtGRAS21 with the highest transcription abundance in roots, belong to the AtSHR subfamily, whose members have been reported to mainly function in root radial patterning and growth in Arabidopsis (Sun et al., 2012). These expression patterns may suggest the functional conservation of GRAS gene homologs between Populus and Arabidopsis. In addition, the transcripts of PtGRAS29 and PtGRAS5 greatly accumulated in seedlings grown in continuous darkness and then transferred to light for 3 hours (DL), suggesting their possible roles in light regulation.

Wood (secondary xylem) produced through the secondary growth of tree species serves as a primary feedstock in various fields such as biofuel, sawn timber and fibers nowadays (Plomion et al., 2001). Considering the economic importance of wood for tree species and important roles played by the GRAS gene family in growth regulation, we also re-analyzed a second microarray dataset (accession number GSE30507), primarily from wood-forming tissues. Hierarchical clustering grouped the expression patterns into six clusters (Figure 8b). Most genes in cluster b (I) preferentially expressed in bark and mature phloem (BP). Genes in cluster b (II) mostly showed the highest transcript abundance in mature xylem (MX). Cluster b (III) contains members that were highly expressed in developing xylem (DX) such as PtGRAS61, 62, 69, 11, 72, 36, and 74. Because the increment of woody biomass mainly results from the successive accumulation of secondary xylem after differentiation of the developing xylem cells, these genes probably could serve as candidates for future bioengineering modification to improve the quantity of wood biomass (Chaffey, 1999). Genes in cluster b (IV) mostly had the highest expression levels in vascular cambium (VC). These genes could also be candidates to be manipulated to improve the quality and quantity of wood biomass, because wood cells originate from vascular cambial activity, which ensures the perennial life of trees by regular renewing of functional phloem and xylem (Plomion et al., 2001). Cluster b (V) contains members that were preferentially expressed in developing phloem (DP). Cluster b (VI) members were preferentially expressed in shoot and leaf primordia (SL), in agreement with their putative roles in meristem development. Among them, PtGRAS64 and PtGRAS71, belong to the AtLAS subfamily, and their orthologous gene in Arabidopsis, AtLAS, has been reported to be involved in axillary meristem formation (Greb et al., 2003).

To infer the retention modes of GRAS genes after duplication, we evaluated the expression divergence of paralogous pairs by calculating Pearson’s correlation coefficients (r) of their expression intensities. We used r < 0.59, r <0.53 and r < 0.85 (95% quantile of the r values obtained from 10, 000 random gene pairs) as the cut-off values for our studied datasets in Arabidopsis, rice and Populus, respectively, below which duplicated genes were considered divergent in their expression (Blanc and Wolfe, 2004; Chi et al., 2011). After excluding the paralogous pairs without corresponding probe sets, we evaluated the expression divergence of 8, 9 and 28 paralogous pairs across various tissue/organs in Arabidopsis, rice and Populus, respectively (Table S5). The result showed that only two pairs (AtSCL14/AtSCL33a; AtSCL5/AtSCL21) in Arabidopsis, two pairs (OsGRAS7/OsMOC1; OsGRAS8/OsGRAS9) in rice and eight pairs (PtGRAS3/PtGRAS22; PtGRAS15/PtGRAS37; PtGRAS25/PtGRAS26; PtGRAS27/PtGRAS87; PtGRAS28/PtGRAS86; PtGRAS67/PtGRAS68; PtGAI.1/PtGAI.2) exhibited conserved expression patterns, implying their retention by genetic redundancy and selection for their contributions to the robustness of the genetic network. In comparison, the remaining paralogous pairs (75%, 78% and 71%, in Arabidopsis, rice and Populus, respectively) displayed expression divergence according to our criteria, suggesting that these pairs of genes may undergo neo-functionalization or sub-functionalization.

42 Chapter Two

Figure 8. Hierarchical clustering of Populus GRAS gene expression. Log2 expression intensity values were gene-wise normalized (z-score) and hierarchically clustered. The colored dot before the gene symbol indicates the subfamily that the gene belongs to. The same color scheme is used as that in Figure 6. a Heatmap showing hierarchical clustering of Populus GRAS genes across various tissue/organs or treatments. The re-analyzed microarray data encompasses results from eight organ/tissue types or treatments including seedlings grown in continuous light (CL), seedlings grown in continuous darkness and then transferred to light for three hours (DL), seedlings grown in continuous darkness (CD), roots (R), young leaves (YL), mature leaves (ML), female catkins (FC) and male catkins (MC). b Heatmap showing hierarchical clustering of Populus GRAS genes across tissue/organs mainly related to wood formation. The re-analyzed microarray data encompasses involves seven tissue/organs including bark and mature phloem (BP), developing phloem (DP), vascular cambium (VC), developing xylem (DX), mature xylem (MX), shoot and leaf primordia (SL), and whole stem (WS).

DISCUSSION

GRAS transcription regulator family is composed of GRAS domain containing proteins that play diverse and critical roles in the regulation of plant growth and development. In this study, we performed a comparative study of the GRAS gene family mainly in three representative lineages of flowering plants. Particularly, we provide the first comprehensive analysis of this gene family in a woody tree species.

Based on the latest genome annotations and assemblies, we found 106, 34 and 60 putative GRAS genes in Populus, Arabidopsis and rice, respectively (Table S1). Remarkably, the Populus genome contains approximately three times more GRAS genes than Arabidopsis, which is twice the reported ratio of 1.4~1.6 putative homologues in Populus to each gene in Arabidopsis (Tuskan et al., 2006). However, mis-assembly due

43 Chapter Two to allelic variation of genes can generate seemingly independent paralogous genes, which usually results in decreasing sizes of gene families in an undated versions of genome assemblies (Kulmuni et al., 2013). For example, 11 out of 59 gene models of the TLP gene family in Populus assembly v1.1 were not validated in assembly v2.0, probably due to allelic artifacts (Petre et al., 2011). To exclude this concern, we examined the number of Populus GRAS genes in the earlier assembly v2.2. The number of putative GRAS genes identified (102, Table S1) was less than that in the latest assembly v3.0 (106), which was used in the present study. Thus allelic artifacts may not underlie the inferred expansion of this gene family in Populus. Gene family size variation among lineages shaped by natural selection may have functional outcomes associated with speciation or adaptation (Lynch and Conery, 2000; Demuth and Hahn, 2009). The great expansion of this gene family in the Populus genome might therefore reflect the specialized roles played by these genes in the transcriptional regulations of this perennial woody species, as also suggested in a study of the HD-ZIP transcription regulator gene family in Populus (Hu et al., 2012).

To check whether this gene family also greatly expanded in other perennial woody species, we examined the numbers of GRAS genes in four other tree species (Table S1). E. grandis, a woody tree species with the same fast-growing property as Populus (Kullan et al., 2012), has 94 GRAS genes, while the three slower-growing woody species tree species, V. vinifera, P. persica and C. sinensis have only 71, 48 and 47 GRAS genes, respectively. The GRAS gene families in three further herbaceous flowering plants, S. lycopersicum, M. truncatula and S. bicolor, have 61, 73 and 78 members, respectively. Although these results show that the GRAS gene family has not greatly expanded in all woody species, they indicated that great expansions may have occurred in those fast-growing woody tree species, which is congruent with the presumed important role of this gene family in plant growth regulation.

Previous studies on the phylogeny of the plant GRAS gene family are generally based on aligned full-length protein sequences, and some members usually could not be classified into subfamilies (Bolle, 2004; Tian et al., 2004; Engstrom, 2011; Sun et al., 2011). For example, one recent study classified GRAS proteins into ten subfamilies based on aligned full-length protein sequences mainly from Arabidopsis and rice (Sun et al., 2011); however, nine GRAS members from rice were not assigned to any subfamily (Table S1, indicated by “Unclassified”). We thus reconstructed phylogenetic trees from the alignment of 176 GRAS domain amino acid sequences from Populus, Arabidopsis and rice. In our analysis, we grouped these unclassified members into their corresponding subfamilies, except for three members (LOC_Os01g67670; LOC_Os11g11600; LOC_Os03g40080) that were considered putative pseudogenes and thus excluded from our analysis. Of these members, LOC_Os01g67650 and LOC_Os04g35250 were assigned to two new subfamilies, Os4 and Os19, respectively, together with genes from Populus. We also tried to construct an NJ tree with full-length protein sequences as done previously (Tian et al., 2004; Sun et al., 2011), and it also demonstrated 13 well-supported subfamilies (Figure S3). However, there were always some genes, for example, LOC_Os01g45860, LOC_Os05g49930 and LOC_Os11g31100, failing to be grouped into any subfamily, as happened in previous studies. We therefore inferred that it might be due to ambiguous alignment, which resulted from the highly variable N-terminus of GRAS proteins when using full-length protein sequences. We consequently suggest that alignments of GRAS domain sequences are more reliable for phylogenetic inference in this gene family than aligned full-length protein sequences.

Tandem and segmental duplications are thought to be the main mechanisms contributing to the expansions of gene families in plants (Cannon et al., 2004). Both tandemly and segmentally duplicated genes that have been retained in plant genomes play important roles in adaptive responses to environmental stimuli (Hanada et al., 2008; Jiang et al., 2010). An earlier analysis of expansion mechanisms of the GRAS gene family in Arabidopsis

44 Chapter Two and rice was performed by Tian et al. (2004), but it was based on older genome assemblies and annotations, and also limited by insufficient information available at the time regarding segmental duplications especially in the rice genome. So the extent to which tandem and segmental duplication contribute to the expansion of the GRAS gene family in flowering plants remained unclear. Our results indicate that tandem and segmental duplications are the most common expansion mechanisms for the GRAS gene family in flowering plants (Figure 2; Table S2). In Populus, the origins of up to 90% of GRAS genes involved tandem and/or segmental duplications, while the percentages in Arabidopsis and rice were 51% and 57%, respectively (Figure 3). Strikingly, the origins of 22% of GRAS genes in Populus involved a combined mechanism of tandem and segmental duplications, in contrast to only 8% in both Arabidopsis and rice (Figure 3). This difference may account for the rapid expansion of this gene family in Populus.

Intronless or single exon genes (SEG) are archetypical of prokaryotic genomes; however, they account for a large proportion in eukaryotic genomes. The proportions of intronless genes in Arabidopsis, rice and Populus are 21.7%, 19.9% and 18.9%, respectively (Jain et al., 2008). In this study, we found much higher percentages of intronless genes in the GRAS gene family than genomic overall percentages in all three studied species. The percentages are 67.6%, 55% and 54.7% in Arabidopsis, rice and Populus, respectively (Table S1, Figure 6b). Previous studies reported some large gene families that are enriched with intronless genes in flowering plants, such as F-box transcriptional regulator proteins (Jain et al., 2007), pentatricopeptide repeat (PPR) containing proteins (Lurin et al., 2004), DEAD box RNA helicases (Aubourg et al., 1999) and small auxin-up RNAs (SAUR) gene family (Jain et al., 2006). These gene families belong to various functional categories and probably have different origins (Jain et al., 2008). Intronless genes in eukaryotic genomes may arise by either horizontal gene transfer from ancient prokaryotes, or duplication of existing intronless genes, or retroposition of intron-containing genes (Zou et al., 2011). Zhang et al. (2012) recently reported the origin of plant GRAS genes from prokaryotic genomes of bacteria by horizontal gene transfer. The reason why plant GRAS gene family is enriched with intronless genes is likely to be its prokaryotic origin followed by extensive duplications in flowering plants. However, whether all the families enriched with intronless genes have prokaryotic origins remains to be explored.

We estimated type-I and type-II functional divergence between subfamilies of GRAS genes. Significant type-I functional divergence was detected between most subfamilies (Table S3). It suggests that site-specific changes in evolutionary rates occurred in the majority of GRAS proteins, resulting in subfamily-specific functional evolution after duplication. However, we did not find any evidence for type II functional divergence between subfamilies (data not shown). Out results therefore indicate that site-specific shifts in evolutionary rates may be the main force for the functional divergence between GRAS subfamilies. The same observation was also made in other protein families of flowering plants, such as the ADP-glucose pyrophosphorylase subunit family (Georgelis et al., 2008), 12-oxo-phytodienoic acid reductases (OPRs) family (Li et al., 2009) and oligopeptide transporters (OPTs) family (Cao et al., 2011). However, significant type-II functional divergence has often been detected in evolution analyses of protein families, such as the NOD26-like intrinsic protein family (Liu et al., 2009) and SAP30 transcriptional regulator family (Viiri et al., 2009). The relative importance of type-I and type-II functional divergence may thus be related to specific functional categories of protein families. Nevertheless, their relationship remains to be explored.

In studies of molecular evolution, the ratio of non-synonymous substitution rate (dN) to synonymous substitution rate (dS) is widely used as an indicator of selection pressure acting on protein-coding genes (ω=dN/dS). Generally, ω = 1 is taken as an indicator of neutral evolution, whereas ω > 1 suggests accelerated evolution due to positive selection, and ω < 1 indicates functional constraint due to purifying or negative selection (Chi et al.,

45 Chapter Two

2011). Our results from both pairwise comparisons and site-specific analysis indicate that the evolution of plant GRAS genes after duplication is constrained by purifying selection (Table S5; Table S6; Figure 7). Similar results have been observed in other transcription regulator gene families in plants, including the HD-ZIP (Hu et al., 2012), CCCH (Chai et al., 2012), SBP (Yang et al., 2008) and bZIP (Wang et al., 2011b) domain gene families. These results may indicate that strong functional constraints have a bearing on the evolution of these gene families, reflecting their essential roles in the transcriptional regulation of cellular processes in flowering plants (Liu et al., 2010).

The retention of duplicated genes can be attributed to either genetic redundancy (Dean et al., 2008) or functional novelty, including neo-functionalization via gain of new functions, or sub-functionalization by partitioning of functional modules so that the complement of both duplicates produces the original functions of their ancestral gene (Lynch and Conery, 2000). To infer the retention modes of plant GRAS genes after duplication, we evaluated the expression divergence of paralogous pairs (Table S5). The results showed that 75%, 78% and 71% of paralogous pairs in Arabidopsis, rice and Populus, respectively, displayed expression divergence. It suggests that these pairs of genes may undergo neo-functionalization or sub-functionalization. Our results thus reveal that the retention of GRAS genes after duplication can be mainly attributed to substantial functional novelty.

EXPERIMENTAL PROCEDURES

Genome-wide identification of GRAS domain genes

Three representative lineages of flowering plants with fully sequenced genomes were incorporated in most analyses, which are Arabidopsis thaliana (a model plant for annual herbaceous dicots), Populus trichocarpa (a model plant for perennial woody dicots and Oryza sativa (a model plant for monocots). We downloaded the latest versions of the genome annotations of the following species from the Phytozome v9.1 database (http://www.phytozome.net/): P. trichocarpa (JGI v3.0) (Tuskan et al., 2006), A. thaliana (TAIR annotation release 10), O. sativa (MSU annotation release 7.0) (Ouyang et al., 2007), Eucalyptus grandis (v1.1), Prunus persica (JGI v1.0) (Verde et al., 2013), Citrus sinensis (Orange v1.1) (Xu et al., 2012), Vitis vinifera (March 2010 release) (Jaillon et al., 2007), Solanum lycopersicum (ITAG v2.3) (Zouine et al., 2012), Sorghum bicolor (v1.4) (Paterson et al., 2009) and Medicago truncatula (Mt3.5 v4) (Young et al., 2011). The Pfam database [http://pfam.sanger.ac.uk/] was used to retrieve representative seed sequences for the GRAS domain (PF03514). The alignment of these seed sequences was then used to build hidden Markov model (HMM) profiles for HMM searches against annotated protein databases from different genomes with an E-value cutoff of 1.0 by executing the program HMMER 3.0 (http://hmmer.org/). We also carried out BLASTP searches using the seed GRAS domain sequences as queries with an E-value cutoff of 0.01 to identify further possible members of this gene family. Presence of the GRAS domain in each of the putative gene family members was further verified by both Pfam search (http://pfam.sanger.ac.uk/search) and SMART sequence analysis (http://smart.embl-heidelberg.de/) that was also employed to retrieve domain sequences with default parameters.

Multiple sequence alignment and phylogenetic analysis

Full-length or domain amino acid sequences of GRAS proteins were aligned using MUSCLE (Edgar, 2004). The aligned sequences were then subjected to phylogenetic analysis with different methods. A Neighbor Joining (NJ) consensus tree was constructed using MEGA 5 (Tamura et al., 2011) under the Jones-Taylor-Thornton (JTT) amino acid substitution model and the pairwise deletion option with 1000 bootstrap replicates to assess the reliability of internal branches. The Maximum likelihood (ML) tree was generated with the program PhyML 3.0 (Guindon and Gascuel, 2003; Guindon et al., 2010) with 100 bootstrap replicates, using BIONJ distance-based

46 Chapter Two starting tree and under the JTT+I+G+F model as inferred by the program ProtTest 2.4 (Abascal et al., 2005). Under the same model of sequence evolution, a Bayesian inference (BI) tree was constructed with MrBayes 3.2.1 (Ronquist and Huelsenbeck, 2003). Three independent runs were carried out with four Markov chains starting from random trees, each consisting of two million generations with trees sampled every 1000 generations. The 50% majority-rule consensus tree was calculated with supporting posterior probabilities on each clade.

Tandem and segmental duplications

Tandem duplicates have been defined as paralogous genes that are physically close to each other on the chromosomes and may originate from illegitimate chromosomal recombination (Freeling, 2009). To identify tandemly duplicated GRAS genes, we assessed whether they were ten or fewer genes apart and within 350, 100 and 350kb for Populus, Arabidopsis and rice, respectively, as proposed by Lehti-Shiu et al. (2009). Segmental duplicates, which result from large-scale events such as whole genome duplication or duplications of large chromosomal regions, can be inferred through anchor genes in collinear blocks (Cannon et al., 2004). For the analysis of segmental duplications we did not directly obtain the data of collinear block pairs within these plant genomes from the Plant Genome Duplication Database (http://chibba.agtec.uga.edu/duplication/index/home), because it was based on older genome assemblies and annotations. Instead, we re-identified collinear block pairs in Populus, Arabidopsis and rice based on the new genome releases using the program MCscan (Tang et al., 2008), which was the tool used in the Plant Genome Duplication Database and could detect homologous chromosomal regions within or between genomes and then align them with genes as anchors.

Functional divergence analysis

Statistical methods have been developed to test for type I or type II functional divergence between two subfamilies, and to identify amino acid sites that contribute to functional innovations following gene duplication (Gu, 1999; Gu, 2006). Protein sites with type I functional divergence mean that the amino acid residues at these sites are highly conserved in one subfamily but highly variable in the other (Gu, 2001). For example, tryptophan residues are present at a specific site in all members of subfamily A, while in subfamily B, the residues at this site have been replaced with more than two types of residues. Type I functional divergence implies that these sites have experienced altered functional constraints, which results in site-specific changes of evolutionary rates (Wang and Gu, 2001). Protein sites with type II functional divergence represent that the amino acid residues at these sites are highly conserved in both subfamilies, but have distinct biochemical properties (Gu, 2001). For example, at a specific site in subfamily A, all the members have positively charged arginine residues, whereas all the members of subfamily B have negatively charged aspartate residues. Type II functional divergence suggests that these residues may contribute to functional specification in different subfamilies (Wang and Gu, 2001). To estimate the level of functional divergence of GRAS gene family members caused by amino acid substitutions in the conserved GRAS domain, the coefficients of type-I and type-II functional divergence (θI and θII) between any two subfamilies (except for Os19 because of its small size) were estimated with the algorithm based on maximum likelihood procedures (Gu, 2001; Gu, 2006), as implemented in the program DIVERGE v2.0 (Gu and

Vander Velden, 2002). Type-I or type-II functional divergence is considered statistically significant when θI or

θII is significantly larger than 0 (Li et al., 2009). Additionally, the critical amino acid sites responsible for functional divergence between subfamilies were predicted with a site-specific posterior analysis. A posterior probability greater than 0.9 was considered as the cutoff value in order to reduce possible false positives. To assess the functional divergence of each subfamily from the ancestral state, we also calculated the parameter bF that is the functional branch length for a given subfamily with the DIVERGE v2.0 program. A large bF value for a given subfamily suggests that the evolutionary conservation may have been changed at many sites and the

47 Chapter Two derived functional state may thus be far away from the ancestral state, whereas bF ≈ 0 indicates that the evolutionary rate of each site has not changed substantially since duplication (Gu et al., 2002).

Chromosomal localization and exon/intron structure

The physical positions of GRAS genes were obtained from the Phytozome v9.0 database. Chromosomal localization and collinear relationship of genes were visualized using the program Circos (Krzywinski et al., 2009). The exon/intron structures of GRAS genes were generated online by the Gene Structure Display Server (http:// gsds.cbi.pku.edu.cn/).

Adaptive evolution analysis

Amino acid sequences of paralogous gene pairs were first aligned by Clustal X v2.1 (Thompson et al., 1997), and then their corresponding coding sequences were converted into codon alignments using PAL2NAL (Suyama et al., 2006). The ratio (ω) of non-synonymous substitution rate (dN) to synonymous substitution rate (dS) for the paralogous pairs was finally calculated with the program Codeml in the PAML v4.6 package (Yang, 2007). To further assess whether positive selection acts upon specific sites, six site models that allow ω ratios to vary among sites, as implemented in the program Codeml, were used based on the coding sequences of 176 GRAS genes from Populus, Arabidopsis and rice. These models are the one-ratio model (M0), the nearly neutral model (M1a), the positive-selection model (M2a), the discrete model (M3), the β model (M7) and the β & ω model (M8). They were used to test different biological hypotheses and the model fitting the data best could be further determined with likelihood ratio tests (Yang et al., 2000; Cao et al., 2011). Briefly, M0 assumes one ω ratio for all sites. M3 uses an unconstrained discrete distribution to model heterogeneous ω ratios among sites. M1a does not allow for sites under positive selection (ω > 1) while M2a does. M8 and M7 both fit to a β distribution and the former allows for ω > 1 while the latter does not (Yang et al., 2000). M0-M3 comparison can be used to test whether ω values vary among sites. Both M1a-M2a and M7-M8 comparisons can be used to test positive selection acting on sites (Yang et al., 2000). Positively selected sites were inferred based on posterior probabilities calculated by the method of Bayes Empirical Bayes that are only implemented under the models of M2a and M8 (Yang et al., 2005).

EST profiling and microarray analysis of GRAS genes

The NCBI EST database (http://www.ncbi.nlm.nih.gov/dbEST/, release 120701) was used to retrieve all Populus EST sequences. We then used the coding sequences of 93 Populus GRAS genes as queries to perform BLASTN searches against all ESTs. The corresponding EST sequences were identified if the identities were above 90% and the alignments with lengths of at least 100bp extended over at least 90% of the lengths of the ESTs. For ESTs that hit more than one gene, the gene with the highest identity was considered as the corresponding gene.

Populus microarray data were downloaded from the NCBI Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/) under the GEO accession numbers GSE13990 and GSE30507. The former involves eight organ/tissue types or treatments including roots (R), female catkins (FC), male catkins (MC), young leaf (YL), mature leaf (ML), seedlings grown in continuous darkness (CD), seedlings grown in continuous darkness and then transferred to light for three hours (DL), and seedlings grown in continuous light (CL) (Wilkins et al., 2009). The latter involves seven tissue/organs including whole stem (WS), shoot and leaf primordia (SL), bark and mature phloem (BP), developing phloem (DP), mature xylem (MX), vascular cambium (VC), and developing xylem (DX). The online probe match tool at the NetAffx Analysis Center (http://www.affymetrix.com/analysis/index.affx) was used to identify probe sets corresponding to different GRAS genes. We considered the median of expression values when genes matched more than one probe set

48 Chapter Two

(Chai et al., 2012). For the microarray data analysis, we followed the method of Wilkins et al. (Wilkins et al., 2009). Briefly, the data was first imported with the R package Affy (Gautier et al., 2004) and then normalized with the Robust Multiarray Analysis (RMA) algorithm. The average log2 intensity values for probe sets corresponding to Populus GRAS genes were extracted and then imported into Mev 4.8 (Saeed et al., 2003) to generate heatmaps and perform hierarchical clustering based on Pearson coefficients with average linkage. For Arabidopsis and rice, the Plant Expression Database (http://www.plexdb.org/) was used to retrieve RMA normalized development expression atlas data ("AT40" (Schmid et al., 2005) and "OS64" (Fujita et al., 2010)) across 63 and 33 tissues, respectively. To assess the expression divergence of paralogous pairs, we calculated Pearson’s correlation coefficients (r) between the expression of 10, 000 random gene pairs across the studied tissues. Because most genes in random pairs are not functionally related, we used 95% quantile of the r values as the cutoff value below which duplicated gene pairs can be considered divergent in expression (Blanc and Wolfe, 2004).

49 Chapter Two

∗ SUPPORTING INFORMATION

Figure S1. Multiple sequence alignment of 176 GRAS domain sequences from Populus, Arabidopsis and rice. Conserved residues are shaded. The five conserved motifs within the domain (LHR I, VHIID, LHRII, PFYRE and SAM) are indicated above the sequence alignment following the study by Pysh et al. (1999).

Figure S2. The ML tree (a) and BI tree (b) constructed using 176 GRAS domain sequences from Populus, Arabidopsis and rice.

Figure S3. NJ tree based on 176 full-length protein sequences of GRAS genes from Populus, Arabidopsis and rice.

Figure S4. Frequency and distribution of critical amino acid sites responsible for type I functional divergence between subfamilies along the GRAS domain. The five conserved motifs within the domain are indicated with colored boxes.

Figure S5. In silico EST analysis of Populus GRAS genes. EST frequency for each of the 39 Populus GRAS genes was counted by searching NCBI EST datasets from various libraries across a set of 18 organ/tissue types.

Table S1. The GRAS genes identified in Arabidopsis, rice, Populus and other species. “ψ” prior to gene symbols indicates pseudogene fragments.

Table S2. Tandem and segmental duplications of GRAS genes in Arabidopsis and rice. Genes in tandem clusters are indicated in red.

Table S3. Type I functional divergence between subfamilies of GRAS genes.

Table S4. Critical amino acid sites responsible for the type I functional divergence between subfamilies.

Table S5. Sequence similarities, dN/dS ratios, and Pearson’s correlation coefficient of expression between paralogous genes in Arabidopsis, rice and Populus.

Table S6. Likelihood ratio tests and parameter estimations for the six site models based on the coding sequences of 176 GRAS genes from Populus, Arabidopsis and rice.

Table S7. Corresponding probe sets of Populus GRAS genes in Affymetrix microarray analysis.

∗ Please download it here: https://yadi.sk/d/Pwmf7C2WgPhnV or find it in the accompanying CD.

50 Chapter Two

REFERENCES

Abascal, F., Zardoya, R. and Posada, D. (2005) ProtTest: selection of best-fit models of protein evolution. Bioinformatics, 21, 2104-2105. Aubourg, S., Kreis, M. and Lecharny, A. (1999) The DEAD box RNA helicase family in Arabidopsis thaliana. Nucleic Acids Res., 27, 628-636. Blanc, G. and Wolfe, K.H. (2004) Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell, 16, 1679-1691. Bolle, C. (2004) The role of GRAS proteins in plant signal transduction and development. Planta, 218, 683-692. Bolle, C., Koncz, C. and Chua, N.-H. (2000) PAT1, a new member of the GRAS family, is involved in phytochrome A signal transduction. Genes Dev., 14, 1269-1278. Cannon, S.B., Mitra, A., Baumgarten, A., Young, N.D. and May, G. (2004) The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana. BMC Plant Biol., 4, 10. Cao, J., Huang, J., Yang, Y. and Hu, X. (2011) Analyses of the oligopeptide transporter gene family in poplar and grape. BMC Genomics, 12, 465. Chaffey, N. (1999) Wood formation in forest trees: from Arabidopsis to Zinnia. Trends Plant Sci., 4, 203-204. Chai, G., Hu, R., Zhang, D., Qi, G., Zuo, R., Cao, Y., Chen, P., Kong, Y. and Zhou, G. (2012) Comprehensive analysis of CCCH zinc finger family in poplar (Populus trichocarpa). BMC Genomics, 13, 253. Chi, Y., Cheng, Y., Vanitha, J., Kumar, N., Ramamoorthy, R., Ramachandran, S. and Jiang, S.Y. (2011) Expansion mechanisms and functional divergence of the glutathione s-transferase family in sorghum and other higher plants. DNA Res., 18, 1-16. Dean, E.J., Davis, J.C., Davis, R.W. and Petrov, D.A. (2008) Pervasive and persistent redundancy among duplicated genes in yeast. PLoS Genet., 4, e1000113. Demuth, J.P. and Hahn, M.W. (2009) The life and death of gene families. Bioessays, 31, 29-39. Di Laurenzio, L., Wysocka-Diller, J., Malamy, J.E., Pysh, L., Helariutta, Y., Freshour, G., Hahn, M.G., Feldmann, K.A. and Benfey, P.N. (1996) The SCARECROW gene regulates an asymmetric cell division that is essential for generating the radial organization of the Arabidopsis root. Cell, 86, 423-433. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792-1797. Engstrom, E.M. (2011) Phylogenetic analysis of GRAS proteins from moss, lycophyte and lineages reveals that GRAS genes arose and underwent substantial diversification in the ancestral lineage common to bryophytes and vascular plants. Plant Signal. Behav., 6, 850. Freeling, M. (2009) Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu. Rev. Plant Biol., 60, 433-453. Fujita, M., Horiuchi, Y., Ueda, Y., Mizuta, Y., Kubo, T., Yano, K., Yamaki, S., Tsuda, K., Nagata, T. and Niihama, M. (2010) Rice expression atlas in reproductive development. Plant Cell Physiol., 51, 2060-2081. Gautier, L., Cope, L., Bolstad, B.M. and Irizarry, R.A. (2004) affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20, 307-315. Georgelis, N., Braun, E.L. and Hannah, L.C. (2008) Duplications and functional divergence of ADP-glucose pyrophosphorylase genes in plants. BMC Evol. Biol., 8, 232. Greb, T., Clarenz, O., Schafer, E., Muller, D., Herrero, R., Schmitz, G. and Theres, K. (2003) Molecular analysis of the LATERAL SUPPRESSOR gene in Arabidopsis reveals a conserved control mechanism for axillary meristem formation. Genes Dev., 17, 1175-1187. Gu, J., Wang, Y. and Gu, X. (2002) Evolutionary analysis for functional divergence of Jak protein kinase domains and tissue-specific genes. J. Mol. Evol., 54, 725-733. Gu, X. (1999) Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol., 16, 1664-1674. Gu, X. (2001) Maximum-likelihood approach for gene family evolution under functional divergence. Mol. Biol. Evol., 18, 453-464. Gu, X. (2006) A simple statistical method for estimating type-II (cluster-specific) functional divergence of protein sequences. Mol. Biol. Evol., 23, 1937-1945. Gu, X. and Vander Velden, K. (2002) DIVERGE: phylogeny-based analysis for functional–structural divergence of a protein family. Bioinformatics, 18, 500-501. Guindon, S., Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W. and Gascuel, O. (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol., 59, 307-321. Guindon, S. and Gascuel, O. (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol., 52, 696-704. Hanada, K., Zou, C., Lehti-Shiu, M.D., Shinozaki, K. and Shiu, S.H. (2008) Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol., 148, 993-1003. Helariutta, Y., Fukaki, H., Wysocka-Diller, J., Nakajima, K., Jung, J., Sena, G., Hauser, M.-T. and Benfey, P.N. (2000) The SHORT-ROOT gene controls radial patterning of the Arabidopsis root through radial signaling. Cell, 101, 555-567. Heo, J.O., Chang, K.S., Kim, I.A., Lee, M.H., Lee, S.A., Song, S.K., Lee, M.M. and Lim, J. (2011) Funneling of gibberellin signaling by the GRAS transcription regulator SCARECROW-LIKE 3 in the Arabidopsis root. Proc. Natl. Acad. Sci., 108, 2166-2171. Hu, R., Chi, X., Chai, G., Kong, Y., He, G., Wang, X., Shi, D., Zhang, D. and Zhou, G. (2012) Genome-wide identification, evolutionary expansion, and expression profile of homeodomain-leucine zipper gene family in poplar (Populus trichocarpa). PLoS One, 7, e31149.

51 Chapter Two

Jaillon, O., Aury, J.-M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., Choisne, N., Aubourg, S., Vitulo, N. and Jubin, C. (2007) The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature, 449, 463-467. Jain, M., Khurana, P., Tyagi, A.K. and Khurana, J.P. (2008) Genome-wide analysis of intronless genes in rice and Arabidopsis. Funct. Integr. Genomics, 8, 69-78. Jain, M., Nijhawan, A., Arora, R., Agarwal, P., Ray, S., Sharma, P., Kapoor, S., Tyagi, A.K. and Khurana, J.P. (2007) F-box proteins in rice. Genome-wide analysis, classification, temporal and spatial gene expression during panicle and seed development, and regulation by light and abiotic stress. Plant Physiol., 143, 1467-1483. Jain, M., Tyagi, A.K. and Khurana, J.P. (2006) Genome-wide analysis, evolutionary expansion, and expression of early auxin-responsive SAUR gene family in rice Oryza sativa. Genomics, 88, 360-371. Jansson, S. and Douglas, C.J. (2007) Populus: a model system for plant biology. Annu. Rev. Plant Biol., 58, 435-458. Jiang, S.Y., Ma, Z. and Ramachandran, S. (2010) Evolutionary history and stress regulation of the lectin superfamily in higher plants. BMC Evol. Biol., 10, 79. Krzywinski, M., Schein, J., Birol, İ., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J. and Marra, M.A. (2009) Circos: an information aesthetic for comparative genomics. Genome Res., 19, 1639-1645. Kullan, A.R., van Dyk, M.M., Hefer, C.A., Jones, N., Kanzler, A. and Myburg, A.A. (2012) Genetic dissection of growth, wood basic density and gene expression in interspecific backcrosses of Eucalyptus grandis and E. urophylla. BMC Genet., 13, 60. Kulmuni, J., Wurm, Y. and Pamilo, P. (2013) Comparative genomics of chemosensory protein genes reveals rapid evolution and positive selection in ant-specific duplicates. Heredity (Edinb), 110, 538-547. Lan, T., Yang, Z.L., Yang, X., Liu, Y.J., Wang, X.R. and Zeng, Q.Y. (2009) Extensive functional diversification of the Populus glutathione S-transferase supergene family. Plant Cell, 21, 3749-3766. Lehti-Shiu, M.D., Zou, C., Hanada, K. and Shiu, S.H. (2009) Evolutionary history and stress regulation of plant receptor-like kinase/pelle genes. Plant Physiol., 150, 12-26. Lei, L., Zhou, S.L., Ma, H. and Zhang, L.S. (2012) Expansion and diversification of the SET domain gene family following whole-genome duplications in Populus trichocarpa. BMC Evol. Biol., 12, 51. Li, W., Liu, B., Yu, L., Feng, D., Wang, H. and Wang, J. (2009) Phylogenetic analysis, structural evolution and functional divergence of the 12-oxo-phytodienoate acid reductase gene family in plants. BMC Evol. Biol., 9, 90. Liu, Q., Wang, H., Zhang, Z., Wu, J., Feng, Y. and Zhu, Z. (2009) Divergence in function and expression of the NOD26-like intrinsic proteins in plants. BMC Genomics, 10, 313. Liu, Q., Zhang, C., Yang, Y. and Hu, X. (2010) Genome-wide and molecular evolution analyses of the phospholipase D gene family in Poplar and Grape. BMC Plant Biol., 10, 117. Lurin, C., Andrés, C., Aubourg, S., Bellaoui, M., Bitton, F., Bruyère, C., Caboche, M., Debast, C., Gualberto, J. and Hoffmann, B. (2004) Genome-wide analysis of Arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis. Plant Cell, 16, 2089-2103. Lynch, M. and Conery, J. (2000) The evolutionary fate and consequences of duplicate genes. Science, 290, 1151-1155. Ma, H.S., Liang, D., Shuai, P., Xia, X.L. and Yin, W.L. (2010) The salt- and drought-inducible poplar GRAS protein SCL7 confers salt and drought tolerance in Arabidopsis thaliana. J. Exp. Bot., 61, 4011-4019. Morohashi, K., Minami, M., Takase, H., Hotta, Y. and Hiratsuka, K. (2003) Isolation and characterization of a novel GRAS gene that regulates meiosis-associated gene expression. J. Biol. Chem., 278, 20865-20873. Ouyang, S., Zhu, W., Hamilton, J., Lin, H., Campbell, M., Childs, K., Thibaud-Nissen, F., Malek, R.L., Lee, Y. and Zheng, L. (2007) The TIGR rice genome annotation resource: improvements and new features. Nucleic Acids Res., 35, D883-D887. Paterson, A.H., Bowers, J.E., Bruggmann, R., Dubchak, I., Grimwood, J., Gundlach, H., Haberer, G., Hellsten, U., Mitros, T. and Poliakov, A. (2009) The Sorghum bicolor genome and the diversification of grasses. Nature, 457, 551-556. Peng, J., Carol, P., Richards, D.E., King, K.E., Cowling, R.J., Murphy, G.P. and Harberd, N.P. (1997) The Arabidopsis GAI gene defines a signaling pathway that negatively regulates gibberellin responses. Gene Dev, 11, 3194-3205. Petre, B., Major, I., Rouhier, N. and Duplessis, S. (2011) Genome-wide analysis of eukaryote thaumatin-like proteins (TLPs) with an emphasis on poplar. BMC Plant Biol., 11, 33. Plomion, C., Leprovost, G. and Stokes, A. (2001) Wood formation in trees. Plant Physiol., 127, 1513-1523. Pysh, L.D., Wysocka-Diller, J.W., Camilleri, C., Bouchez, D. and Benfey, P.N. (1999) The GRAS gene family in Arabidopsis: sequence characterization and basic expression analysis of the SCARECROW-LIKE genes. Plant J., 18, 111-119. Ronquist, F. and Huelsenbeck, J.P. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19, 1572-1574. Saeed, A., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M., Currier, T. and Thiagarajan, M. (2003) TM4: a free, open-source system for microarray data management and analysis. BioTechniques, 34, 374. Schmid, M., Davison, T.S., Henz, S.R., Pape, U.J., Demar, M., Vingron, M., Schölkopf, B., Weigel, D. and Lohmann, J.U. (2005) A gene expression map of Arabidopsis thaliana development. Nat. Genet., 37, 501-506. Schumacher, K., Schmitt, T., Rossberg, M., Schmitz, G. and Theres, K. (1999) The Lateral suppressor (Ls) gene of tomato encodes a new member of the VHIID protein family. Proc. Natl. Acad. Sci., 96, 290-295. Stuurman, J., Jäggi, F. and Kuhlemeier, C. (2002) Shoot meristem maintenance is controlled by a GRAS-gene mediated signal from differentiating cells. Genes Dev., 16, 2213-2218. Sun, X., Jones, W.T. and Rikkerink, E.H. (2012) GRAS proteins: the versatile roles of intrinsically disordered proteins in plant signalling. Biochem. J., 442, 1-12. Sun, X., Xue, B., Jones, W.T., Rikkerink, E., Dunker, A.K. and Uversky, V.N. (2011) A functionally required unfoldome from the plant kingdom: intrinsically disordered N-terminal domains of GRAS proteins are involved in molecular recognition during plant development. Plant Mol. Biol., 77, 205-223. Suyama, M., Torrents, D. and Bork, P. (2006) PAL2NAL: robust conversion of protein sequence alignments into the

52 Chapter Two

corresponding codon alignments. Nucleic Acids Res., 34, W609-W612. Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. and Kumar, S. (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol., 28, 2731-2739. Tang, H., Bowers, J.E., Wang, X., Ming, R., Alam, M. and Paterson, A.H. (2008) Synteny and collinearity in plant genomes. Science, 320, 486-488. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res., 25, 4876-4882. Tian, C., Wan, P., Sun, S., Li, J. and Chen, M. (2004) Genome-wide analysis of the GRAS gene family in rice and Arabidopsis. Plant Mol. Biol., 54, 519-532. Tong, H., Jin, Y., Liu, W., Li, F., Fang, J., Yin, Y., Qian, Q., Zhu, L. and Chu, C. (2009) DWARF AND LOW- TILLERING, a new member of the GRAS family, plays positive roles in brassinosteroid signaling in rice. Plant J., 58, 803-816. Tuskan, G.A., Difazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., Hellsten, U., Putnam, N., Ralph, S., Rombauts, S., Salamov, A., et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science, 313, 1596-1604. Verde, I., Abbott, A.G., Scalabrin, S., Jung, S., Shu, S., Marroni, F., Zhebentyayeva, T., Dettori, M.T., Grimwood, J. and Cattonaro, F. (2013) The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat. Genet., 45, 487-494. Viiri, K.M., Heinonen, T.Y., Maki, M. and Lohi, O. (2009) Phylogenetic analysis of the SAP30 family of transcriptional regulators reveals functional divergence in the domain that binds the nuclear matrix. BMC Evol. Biol., 9, 149. Wang, J., Andersson-Gunneras, S., Gaboreanu, I., Hertzberg, M., Tucker, M.R., Zheng, B., Lesniewska, J., Mellerowicz, E.J., Laux, T., Sandberg, G., et al. (2011a) Reduced expression of the SHORT-ROOT gene increases the rates of growth and development in hybrid poplar and Arabidopsis. PLoS One, 6, e28878. Wang, J., Zhou, J., Zhang, B., Vanitha, J., Ramachandran, S. and Jiang, S.Y. (2011b) Genome-wide expansion and expression divergence of the basic leucine zipper transcription factors in higher plants with an emphasis on sorghum. J Integr Plant Biol, 53, 212-231. Wang, Y. and Gu, X. (2001) Functional divergence in the caspase gene family and altered functional constraints: statistical analysis and prediction. Genetics, 158, 1311-1320. Wilkins, O., Nahal, H., Foong, J., Provart, N.J. and Campbell, M.M. (2009) Expansion and diversification of the Populus R2R3-MYB family of transcription factors. Plant Physiol., 149, 981-993. Xu, Q., Chen, L.-L., Ruan, X., Chen, D., Zhu, A., Chen, C., Bertrand, D., Jiao, W.-B., Hao, B.-H. and Lyon, M.P. (2012) The draft genome of sweet orange (Citrus sinensis). Nat. Genet., 45, 59-66. Yang, X., Kalluri, U.C., DiFazio, S.P., Wullschleger, S.D., Tschaplinski, T.J., Cheng, M.Z.-M. and Tuskan, G.A. (2009) Poplar Genomics: State of the Science. Crit. Rev. Plant Sci., 28, 285-308. Yang, Z. (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol., 24, 1586-1591. Yang, Z., Nielsen, R., Goldman, N. and Pedersen, A.M.K. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics, 155, 431-449. Yang, Z., Wang, X., Gu, S., Hu, Z., Xu, H. and Xu, C. (2008) Comparative study of SBP-box gene family in Arabidopsis and rice. Gene, 407, 1-11. Yang, Z., Wong, W.S. and Nielsen, R. (2005) Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol., 22, 1107-1118. Young, N.D., Debellé, F., Oldroyd, G.E., Geurts, R., Cannon, S.B., Udvardi, M.K., Benedito, V.A., Mayer, K.F., Gouzy, J. and Schoof, H. (2011) The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature, 480, 520-524. Zhang, D., Iyer, L.M. and Aravind, L. (2012) Bacterial GRAS domain proteins throw new light on gibberellic acid response mechanisms. Bioinformatics, 28, 2407-2411. Zou, M., Guo, B. and He, S. (2011) The roles and evolutionary patterns of intronless genes in deuterostomes. Comp. Funct. Genomics, 2011, 1-8. Zouine, M., Latché, A., Rousseau, C., Regad, F., Pech, J.C., Philippot, M., Bouzayen, M., Delalande, C., Frasse, P. and Schiex, T. (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature, 485, 635-641.

53 Chapter Three

CHAPTER THREE: Copy number variants and their putative involvement in edaphic adaptation of Arabidopsis lyrata

ABSTRACT

Copy number variation is a potentially important but largely unexplored component of genomic variation. Our current knowledge about copy number variants (CNVs) in plants is limited, partly due to technical limitations that were only recently overcome through the application of NGS techniques. Using three different sequencing-based approaches, read depth, read pair and split read, we identified 1,513 tandem duplications and 42,945 deletions among 24 Arabidopsis lyrata individuals from eight natural populations. 95.2% were inferred at nucleotide resolution. Our results revealed dynamic genome evolution via frequent gene copy gain-and-loss in A. lyrata, and showed that defense-related genes are particularly sensitive to copy number changes. Substantial functional novelty may be achieved by gene fusions that were mediated by CNVs and preferentially occurred between paralogs. We found that a higher than expected percentage of deletions originated by non-allelic homologous recombination (NAHR) in A. lyrata (7.5%), which may be attributed to the abundant, highly conserved syntenic blocks within angiosperms genomes that formed by multiple rounds of polyploidization events during evolution. Candidate CNVs affecting genes interacting with soil chemicals, such as genes encoding a sulfate transporter, ligand-gated ion channels, and high-affinity potassium ion transporters, were found to exhibit high allele frequency differences between individuals of A. lyrata from calcareous and siliceous bedrocks, implying their putative involvement in edaphic adaptation. Our study provides the first sequencing-based CNV map for A. lyrata, and constitutes an important step towards deciphering the roles of CNVs in plant adaptation.

KEYWORDS

Arabidopsis lyrata · calcicole-calcifuge problem · copy number variations · deletions · edaphic adaptation · tandem duplications

ABBREVIATIONS

CNVs: Copy Number Variants, FDR: False Discovery Rate, GO: Gene Otology, IGV: Integrative Genomics Viewer, MAPQ: Mapping Quality, Mya: Million years ago, NAHR: Non-allelic Homologous Recombination, NGS: Next Generation Sequencing, NHR: Non-homologous Recombination, RD: Read Depth, RP: Read Pair, SR: Split Read, SNP: Single Nucleotide Polymorphism, TEA: Transposable Elements Activity, VNTRs: Variable Number of Tandem Repeats, WGD: Whole-genome Duplication

54 Chapter Three

INTRODUCTION

Genetic variation is of fundamental importance for the adaptation of natural populations to biotic and abiotic environments (Schoville et al., 2012). Alongside the widely studied single nucleotide polymorphisms (SNPs), copy number variants (CNVs) contribute substantially to the standing genetic variation in populations upon which natural selection can act (Zhang et al., 2009). CNVs encompass gains and losses of genomic regions including tandem duplications, dispersed duplications and deletions that are larger than or equal to 50 bp, and belong to a broader class of genomic variation referred to as structural variants (SVs) (Alkan et al., 2011; Xi et al., 2012). CNVs potentially have significant functional effects, for they may cause changes in gene structure, gene dosage, or expression regulation, and may expose recessive alleles to selection (Bickhart et al., 2012).

While substantial progress has been made in characterizing CNVs in humans and animals (Reviewed in Girirajan et al., 2011; Clop et al., 2012), our knowledge about CNVs in plants is still limited. Most studies of CNVs in plants have used array-based approaches, for example studies in Arabidopsis thaliana (DeBolt, 2010), maize (Swanson-Wagner et al., 2010), rice (Yu et al., 2011) and barley (Muñoz-Amatriaín et al., 2013). The recent advances in massive parallel sequencing technology and development of novel analytical algorithms now allows the detection of CNVs using short read (< 100 bp) sequencing data with small library sizes (< 500 bp) (Xi et al., 2012). Three sequencing-based approaches are commonly used in CNV detection: (1) the read depth (RD) approach, which relies on changes in normalized read depth to estimate gains and losses of copies; (2) the read pair (RP) approach, which is based on discordantly mapped read pairs; and (3) the split read (SR) approach, which uses gapped read alignments (Alkan et al., 2011). Compared with array-based approaches, sequencing-based approaches, especially RP and SR, have obvious advantages, including the ability to detect CNVs with relatively small size (< 1 kb), and to infer the breakpoints of CNVs at nucleotide resolution (Li and Olivier, 2013), which is a prerequisite to assess their genomic impacts and to infer their formation mechanisms (Lam et al., 2009). Given the fact that these three sequencing-based approaches differ with regard to size spectrum, power and resolution (Alkan et al., 2011; Xi et al., 2012), CNV studies in humans (Mills et al., 2011) and Drosophila (Zichner et al., 2013) adopted an integrated strategy, implementing multiple approaches to map a full spectrum of CNVs and lower the false positive rate. Yet, in plants, only a few studies used sequencing-based approaches, and most relied solely on the RD method (Turner et al., 2010; Cao et al., 2011; Zheng et al., 2011; Flagel et al., 2013), which may entail a size bias towards larger variants and low breakpoint resolution (Xi et al., 2012; Li and Olivier, 2013).

There is growing evidence that CNVs, especially gene duplications, may be associated with adaptation to environmental stresses in a variety of species (Reviewed in Kondrashov, 2012). Changes in gene copy number have been reported to be associated with tolerance to toxic soil chemicals in plants. Copy number expansion of the metal pump gene HMA4, for example, contributes to hyper-accumulation and hyper-tolerance to zinc and cadmium in A. halleri (Hanikenne et al., 2008; Hanikenne et al., 2013). Similarly, boron-tolerant genotypes of barley contain four times as many copies of the boron transporter gene (Bot1) than intolerant genotypes (Sutton et al., 2007), and aluminum tolerance in maize is associated with higher copy number of the multidrug and toxin extrusion gene MATE1 (Maron et al., 2013). It is thus of fundamental interest to decipher the roles of CNVs in plant adaptation using state-of-art sequencing-based approaches.

While plant adaptation to extreme environments, such as heavy metal contaminated soils (Antonovics et al., 1971), has received substantial attention, much less is known about edaphic adaptation to widespread soil types, which nevertheless may differ substantially in their chemical and physical properties and may have important consequences for individual performance and species composition. Plant ecologists have long been fascinated

55 Chapter Three by the so-called calcicole-calcifuge problem: calcicole species preferentially grow in alkaline calcareous soils, but are rare or absent in acidic siliceous soils, while calcifuges do reversely (Tansley, 1917; Hope Simpson, 1938; Rorison, 1960a; Rorison, 1960b). It has been suggested that the primary factors governing the distribution of calcicole and calcifuge species are chemical factors that are specific to the different soil types (Lee, 1998). Plants growing on calcareous soils may suffer from calcium toxicity, and the low availability of cobalt, iron, phosphate, potassium and boron (Lee, 1998; Michalet et al., 2002). By contrast, plants growing on siliceous soils may encounter problems due to the toxicity of iron, manganese and aluminum, the deficiencies in magnesium, calcium and molybdenum, the impaired nitrogen cycle, and the low availability of phosphate (Lee, 1998). Yet, genetic perspectives on the calcicole-calcifuge problem are still limited.

Arabidopsis lyrata is an outcrossing, perennial species with a wide geographic distribution across North America, northern and central Europe, and Asia (Ross-Ibarra et al., 2008). It has become a model for ecological genetics studies on various aspects such as flowering time variation (Riihimäki et al., 2005; Sandring et al., 2007), trichome production variation (Karkkäinen et al., 2004; Kivimäki et al., 2007) and local adaptation (Turner et al., 2010). The publication of a high-quality reference genome of A. lyrata (Hu et al., 2011) further makes it ideal for genomic studies. While it is still difficult to perform a whole-genome study in calcicole and calcifuge species that have large and complex genomes, studies on the genetic differences between populations of A. lyrata growing on either soil type may shed light on the calcicole-calcifuge problem.

In this study, we used three sequencing-based approaches (RD, RP and SR) to detect CNVs in 24 individuals of A. lyrata from eight natural populations occurring either on calcareous, siliceous or mixed bedrocks. We aim to address the following questions: How pervasive are CNVs within plant species in natural populations? What genomic impacts may these CNVs incur? DNA rearrangements are hypothesized to be influenced by genome architecture (Stankiewicz and Lupski, 2002). Given the fact that most angiosperms have undergone multiple rounds of whole genome duplication (WGD) during their evolution (Soltis et al., 2009), is there any difference in the relative importance of different CNV formation mechanisms between angiosperms and other species that have not undergone WGD? With respect to edaphic adaptation, we further asked whether there is evidence for a role of CNVs in edaphic adaptation of A. lyrata.

RESULTS

Tandem duplications and deletions in 24 A. lyrata individuals from natural populations

We sampled four populations of European A. lyrata growing on calcareous bedrock, three on siliceous bedrock and one on mixed bedrock (Table S1). A total of 24 individuals (three per population) were subjected to whole-genome re-sequencing, and ~307 Gb sequence data were obtained (Table S1). The mean coverage (MAPQ > 20) varied from 5× to 44×, and the median insert size of read pairs ranged from 326 to 500 bp (Table S1).

We focused on two CNV types that can be inferred most reliably, i.e. deletions and tandem duplications. We merged the calls from three different sequencing-based CNV detection approaches (RD, RP and SR) in a precision-aware way and only kept calls detected by at least two approaches. Altogether, we identified 1,513 tandem duplications (Table S2) and 42,945 deletions (Table S3) relative to the reference genome. The breakpoints of 89.2 % (1,349) of tandem duplications and 95.4% (40,964) of deletions were inferred at nucleotide resolution (95.2% in total). The size of these CNVs ranged from 52 to 125,183 bp for tandem duplications, and from 50 bp to 127,861 bp for deletions (Table S2; Table S3). As shown in Figure 1a, size distributions of tandem duplications and deletions were both skewed towards small sizes, but tandem

56 Chapter Three duplications were on average significantly larger (median: 4,602 bp) than deletions (median: 376 bp) (P < 2.2E-16, Mann-Whitney U-test). The frequency spectrum for deletions across the 24 A. lyrata individuals was U-shaped, with a high abundance at both very low (≤ 4 individuals, 32.3%) and very high frequencies (≥ 20 individuals, 29.9%), whereas the one for tandem duplications only peaked at low frequency (Figure 1a).

To study the genomic distribution of CNVs in A. lyrata, we calculated the number and median size for both deletions and tandem duplications in each 0.5 Mb genomic window and mapped them onto the eight longest scaffolds of the genome assembly corresponding to the eight A. lyrata chromosomes (Figure 1b). As shown in tracks 1 and 3 of Figure 1b, the numbers of deletions and tandem duplications were relatively evenly distributed across the genome, and no CNV hotspot was detected. The size of CNVs near the telomeric regions (2 Mb genomic regions near each end of all eight scaffolds) was significantly smaller than that in regions flanking the centromeres (2 Mb genomic regions on each side of all centromeres) (P < 2.2E-16 for deletions; P = 0.03426 for tandem duplications, t-test; Figure 1b).

Figure 1. Characterization of deletions and tandem duplications found in 24 A. lyrata individuals. (a) Size distribution (large panel; ≤ 20 kb) and frequency spectrum (small panel) of deletions and tandem duplications in A. lyrata. (b) Genomic distribution of deletions and tandem duplications in A. lyrata. ❶ Number of deletions in each 0.5 Mb window; ❷ Median deletion size in each 0.5 Mb window; ❸ Number of tandem duplications in each 0.5 Mb window; ❹ Median size of tandem duplications in each 0.5 Mb window. The blank regions that are present at the center of each scaffold correspond to the centromeric regions of the reference genome. Only the eight longest scaffolds of the genome assembly are shown.

Genomic impact of tandem duplications and deletions in 24 A. lyrata individuals

We assessed the genomic impact of tandem duplications and deletions by relating them to intersected genomic elements based on the annotation of the reference genome (Table 1). We found that a large percentage (67.1%; 28,801) of deletions were located in intergenic regions, whereas only 25.1% (380) of all tandem duplications were in intergenic regions. 5,130 different genes experienced gene copy losses due to deletions, and 1,722 genes gained gene copies through tandem duplications. We also found that 15.9% (240) of all tandem duplications and 2% (861) of all deletions partially affected two genes at the same time, which led to the creation of novel fusion genes. We further found that at least 50% (120) and 60.4% (520) of the gene fusion

57 Chapter Three events happened between paralogous genes for tandem duplications and deletions, respectively. The genes overlapping with tandem duplications and/or deletions are indicated in Table S2 and Table S3, respectively.

We assessed the functional repertoire of genes that were affected by CNVs using GO term enrichment analyses. Interestingly, we found that plant defense-related GO terms were significantly overrepresented among genes that were either fully or partially affected by tandem duplications or deletions (FDR < 0.05, Fisher’s exact test), such as “defense response” (GO:0006952), “apoptotic process” (GO:0006915), “cell death” (GO:0008219), “immune response” (GO:0006955) and “signaling receptor activity” (GO:0038023). These defense-related genes include members of the NBS-LRR domain containing gene family, defensin-like (DEFL) protein family and pathogenesis-related (PR) gene family (data not shown). The complete lists of significantly overrepresented GO terms for genes fully or partially affected by CNVs are provided in Table S4 and Table S5, respectively.

Table 1. Genomic impact of tandem duplications and deletions in 24 A. lyrata individuals

A single CNV may simultaneously fall into the categories “Full gene” and “Partial gene”, if it affects one entire gene and also part of another gene.

Ψ CNVs that affect CDS and UTRs / introns simultaneously are also counted. a Number of genes fully affected by CNVs. b Number of genes partially affected by CNVs.

Mechanisms underlying the formation of deletions in A. lyrata

The nucleotide-resolution breakpoint detection enabled us to examine the mechanisms underlying the formation of deletions in A. lyrata. A total of 40,264 deletions with nucleotide-resolution breakpoints that are located in the eight longest scaffolds were subjected to formation mechanism analysis, and 40,147 (99.7%) of them were unambiguously assigned a formation mechanism (Table S3). A large proportion of deletions (86.6%, 34,760/40,147) were inferred to be formed by non-homologous recombination (NHR), and the remaining 7.5%, 4.6% and 1.3% were attributed to non-allelic homologous recombination (NAHR), transposable element activity (TEA) and variable number of tandem repeats (VNTRs), respectively (cf. pie chart in Figure 2a). Deletions originating from these four formation mechanisms were all relatively evenly distributed across the genome (cf. circular chart in Figure 2a). A further analysis of the transposable elements associated with the formation of the deletions revealed that active transposable elements in A. lyrata encompassed both DNA transposons and retrotransponsons (Table S3). Among these, the most abundant DNA transposon classes/families included DNA/TcMar-Stowaway (19.2%), DNA/hAT-Ac (14.9%), DNA/MULE-MuDR (12%), RC/Helitron (8%), DNA/hAT (5.9%), and DNA/TcMar-Pogo (2.4%), and the most abundant retrotransposon classes/families included LINE/L1 (11.5%), SINE/tRNA (7%), LTR/Copia (5.6%), and LTR/Gypsy (1.8%).

58 Chapter Three

Figure 2. Mechanisms underlying the formation of deletions in A. lyrata, as inferred by Breakseq. (a) Genomic distribution (circular chart) and relative contributions (pie chart) of four molecular processes underlying the formation of deletions: NAHR (red), NHR (green), TEA (blue) and VNTRs (purple). The histograms in the circular chart show the number of deletions in each 0.5 Mb windows. (b) Size distributions for each formation mechanism (< 10 kb). (c) Frequency spectra for each formation mechanism. (d) Proportions of genomic elements overlapping with deletions originating from each of the four formation mechanisms.

We further related formation mechanisms to deletion sizes (Figure 2b). We observed significant differences in the size distributions of deletions derived from different formation mechanisms (P < 2.2E-16, Kruskal-Wallis test, Figure 2b). A post-hoc test further showed significant differences in deletion size between any two mechanisms (P < 2.2E-16, Mann-Whitney tests with Bonferroni correction). The four formation mechanisms were ranked as follows with respect to the median deletion size: NAHR (5,288 bp) > TEA (461 bp) > NHR (290 bp) > VNTRs (62 bp). As shown in Figure S1, we observed four major peaks in the size spectrum of deletions originating from transposable elements. These corresponded to the characteristic length distributions of four transposable element families: DNA/TcMar-Stowaway, SINE/tRNA, DNA/hAT-Ac, and DNA/MULE-MuDR.

59 Chapter Three

When comparing the frequency spectra of deletions originating from different mechanisms (Figure 2c), we found that TEA-associated deletions showed a lack of low frequency variants (present in ≤ 4 individuals) relative to other mechanisms, but a high abundance of high frequency variants (present in ≥ 20 individuals). We noticed that the vast majority of deletions originating from TEA (97.2%), VNTRs (91.9%) and NHR (82.5%) resided in intergenic and intronic regions (Figure 2d). In contrast, a large proportion (62.7%) of NAHR-associated deletions overlapped with full gene, UTR and coding regions.

Putative involvement of CNVs in edaphic adaptation

Genotypes were unambiguously inferred for 96.7 % of all tandem duplications (Table S2) and 92.1% of all deletions (Table S3) for each of the 24 A. lyrata individuals. As shown in Figure S2, for both tandem duplications and deletions, the mean percentage of heterozygous genotypes (RA) in plants growing on mixed bedrock (21.3%) was significantly higher than that in plants growing on calcareous (15.9%) or siliceous bedrock (15.5%) (P < 0.05, one-tailed t-test), and no significant difference could be detected among the bedrock types for the mean percentage of ambiguous genotypes (P > 0.05, one-way ANOVA).

Figure 3. Distribution of allele frequency differences between 12 individuals of A. lyrata growing on siliceous and nine on calcareous bedrock for both deletions and tandem duplications. Cutoffs are indicated for each distribution, along with shaded areas that denote highly differentiated allele frequencies between plants from the two different soil types.

With the genotype information at hand for each individual, we calculated allele frequencies for all tandem duplications and deletions among plants from each bedrock. Ambiguous genotypes were excluded from analysis. To identify deletions and tandem duplications with highly differentiated allele frequencies between plants from siliceous and calcareous bedrocks, we examined the distributions of allele frequency differences

60 Chapter Three

(Figure 3). We chose cutoff values at the 5th and 95th percentiles for tandem duplications, but more stringent cutoffs at the 1st and 99th percentiles for deletions, considering that there were many more deletions than tandem duplications. Accordingly, we detected 455 deletions and 81 tandem duplications (affecting 193 and 179 genes, respectively) that were present at higher frequencies among individuals from calcareous than from siliceous bedrock (Sheets 1 and 2 in Table S6). Similarly, we found 431 deletions and 76 tandem duplications (affecting 190 and 158 genes, respectively) that were present at higher frequencies among individuals from siliceous than from calcareous bedrock (Sheet 1 and 2 in Table S7).

Figure 4. IGV screenshots of one candidate deletion and one tandem duplication as evidenced by discordant read pairs and read depths. Color bars at the bottom indicate the affected genomic regions, while the numbers at the top right corners represent the range of mapping coverage. (a) A 468 bp deletion affecting the sulfate transporter gene 942653 is fixed (genotype: AA) in individual “88G” from calcareous bedrock, is present in heterozygous form (genotype: RA) in “95C” from mixed bedrock, but absent (genotype: RR) in “174A” from siliceous bedrock. (b) A 5,322 bp tandem duplication affecting two glutamate receptor genes (325650 and 893036) is fixed in individual “133H” from calcareous bedrock, present in heterozygous form in “95C” from mixed bedrock, but absent in “74D” from siliceous bedrock.

Despite the lack of functional annotations for a large proportion (61.3%) of genes affected by CNVs with strong allele frequency differences between soil types, we were able to identify candidate CNVs putatively involved in edaphic adaptation. For example, a 468 bp deletion (ID: 95F_DELLY_Deletion_00029205) affecting the intronic region of a sulfate transporter gene (942653) was found to be fixed among individuals from calcareous bedrock, whereas it was absent in all individuals from siliceous bedrock and present in heterozygous form in all individuals from the mixed bedrock (Table S6, Figure 4a and Figure S3). Another interesting candidate is a 5,322 bp tandem duplication (ID: 112K_PINDEL_Tandem_Duplication_1089), which gives birth to a novel gene by the fusion of the partial sequences of the two glutamate receptor genes, 325650 and 893036 (Figure 4b). It was found to be almost fixed among individuals from calcareous bedrock and almost absent among individuals from siliceous bedrock (Table S6, Figure 4b and Figure S4). We also found an intriguing 9,845 bp deletion (ID: 112K_PINDEL_Deletion_330357) that was present in heterozygous form in nearly all individuals from calcareous bedrock, but was absent in all individuals from siliceous bedrock (Table S6). This deletion caused the fusion of the partial sequences of two high affinity potassium transporter genes (358974 and 915047) into a single novel gene. Last but not least, we found candidate CNVs that were not fixed for any bedrock type, but exhibited significant allele frequency differences between plants from either bedrock type. For example, a 4,766 bp tandem duplication (ID: 174A_PINDEL_Tandem_Duplication_234)

61 Chapter Three that affected the gene (915596) encoding a major facilitator superfamily protein occurred at higher allele frequency among individuals from calcareous than siliceous bedrock (allele frequency difference: 0.42) (Table S6). Likewise, a 2,985 bp deletion (ID: 112K_DELLY_Deletion_00048956) that affected the MATE efflux family protein-encoding gene (494591) had a significantly higher allele frequency among individuals from siliceous than calcareous bedrock (allele frequency difference: 0.36) (Table S7).

DISCUSSION

In this study, we report the first sequencing-based CNV map for A. lyrata and provide novel insights into the content, genomic impact, and formation mechanisms of CNVs in plant species. Our study also constitutes an important step towards deciphering the roles of CNVs in plant adaptation.

Since the first CNV map was constructed from 270 humans (Redon et al., 2006), the prevalence of CNVs across genomes has been demonstrated in a variety of animal and plant species, although the numbers of CNVs may differ among studies. For example, Cao et al. (2011) found 1,509 CNVs among 80 accessions of A. thaliana, Bickhart et al. (2012) detected 1,265 CNVs among five cattle breeds, Zichner et al. (2013) reported 916 tandem duplications and 8,962 deletions among 39 lines of Drosophila from a single natural population, and Muñoz-Amatriaín et al. (2013) identified 31,494 genomic fragments affected by CNVs across 14 barley accessions. In this study, we identified 1,513 tandem duplications (Table S2) and 42,945 deletions (Table S3) among 24 A. lyrata individuals from eight natural populations. Our results therefore corroborate the notion that CNVs are common in natural plant populations and contribute substantially to the pool of standing genetic variation. When comparing the reported contents of CNVs between studies, five additional aspects must be considered besides the number of individuals and populations investigated: (i) the genome size of studied species (the larger the size, the more CNVs are detected), (ii) the genetic distance between the studied individuals and the reference (the bigger the distance, the more CNVs are detected), (iii) the detection power of the employed method (array-based approaches are less powerful than sequencing-based ones) (Bellos et al., 2012; Li and Olivier, 2013), (iv) the minimum length for a CNV to be considered (the previously defined minimum length is 1 kb, while 50 bp is commonly accepted nowadays) (Alkan et al., 2011), and (v) assembly errors in the reference genome (the more errors, the more variants are identified). Even if the above aspects were considered, our results may still imply that CNVs within plant species are more prevalent than previously appreciated.

Many more deletions (42,945) than tandem duplications (1,513) were found in this study (Table 1), suggesting that the rate of deletions may be much higher than that of tandem duplications. Nevertheless, we found that a much smaller percentage (32.9%) of deletions overlapped with genic regions than do tandem duplications (74.9%; Table 1), which may suggest that deletions are generally more deleterious than tandem duplications, and are therefore purged more rapidly by purifying selection in genic regions.

The birth-and-death dynamics of genes is of central interest in studies on genome evolution (Zhang, 2003). We found that 5,130 different genes in 24 A. lyrata genomes lost gene copies due to deletions, and 1,722 genes gained gene copies through tandem duplications (Table 1), indicating dynamic genome evolution via frequent gene copy gains and losses. Consistent with the results found in other species such as human (Mills et al., 2011), Drosophila (Zichner et al., 2013), and barley (Muñoz-Amatriaín et al., 2013), defense-related genes were substantially over-represented among genes that gained or lost gene copies due to CNVs in A. lyrata (Table S4). Interestingly, defense-related gene families have also been reported to greatly vary in size among different species, e.g. among five mammalian species (Demuth et al., 2006) and among 12 Drosophila species

62 Chapter Three

(Hahn et al., 2007). It is therefore conceivable that changes in gene copy number contribute substantially to the adaptive evolution of a robust immune system in both plants and animals. It is also reasonable to assume that the products of many defense-related genes have dosage-dependent effects, and that CNVs changing gene dosage can be selectively favored and eventually fixed in natural populations if they confer an adaptive advantage.

Fusion genes, a.k.a. chimeric genes, play important roles in the evolution of genetic novelty, because they are immediately different from their parental genes and therefore more likely to evolve novel functions as compared with duplicated genes (Jones and Begun, 2005; Williford and Betrán, 2013). We revealed abundant cases (1,101 in total) whereby a single tandem duplication or deletion caused the fusion of partial sequences of two genes (Table S2, Table S3). We further noticed that such gene fusion events preferentially happened between paralogous genes (at least 50% for tandem duplications and 60.4% for deletions). This result may be explained by the mediation of an aberrant recombination between homologous sequences, known as non-allelic homologous recombination (NAHR) (Hastings et al., 2009). Indeed, 90% (468) of the deletions that caused gene fusions between paralogs originated by NAHR (Table S3). Altogether, our results suggest that abundant functional novelty may be achieved in A. lyrata by gene fusions, which preferentially occur between paralogous genes and are mediated by CNVs that originated by NAHR.

Sequencing-based approaches, especially the SR approach, allowed us to identify a large proportion (95.4% of deletions; 89.2 % of tandem duplications) of CNVs with nucleotide resolution breakpoints, which in turn enabled us to study the mechanisms underlying the formation of deletions in A. lyrata. Four major formation mechanisms could be distinguished: (1) VNTRs, which refers to the contraction or expansion of simple tandem repeat units due to DNA replication slippage; (2) NAHR, associated with homology-mediated recombination between two stretches of non-allelic DNA sequences with high sequence similarity; (3) TEA, associated with deletion or insertion events by DNA transposons or retrotransposons; and (4) NHR, which refers to recombination occurring in the absence of sequence homology during DNA repair (non-homologous end-joining based DNA double-strand repair) or replication (rescue of DNA replication-fork stalling events) (Hastings et al., 2009; Lam et al., 2009). We found that the relative contributions of NHR and VNTRs (86.6% and 1.3%, respectively) in A. lyrata were comparable with those reported in Drosophila (88% and 1%, respectively) (Zichner et al., 2013) (Figure 2a). However, the largest discrepancy between these two species was found in the relative contributions of NAHR (7.5% in A. lyrata versus 1% in Drosophila) (Zichner et al., 2013), suggesting that the molecular processes associated with NAHR are much more active in A. lyrata than in Drosophila. Since DNA rearrangements are hypothesized to be influenced by genome architecture (Stankiewicz and Lupski, 2002), the differences observed between A. lyrata and Drosophila may result from differences in genomic features between these two species. It is known that all angiosperms have undergone polyploidization events during their evolution (Bowers et al., 2003). Specifically, A. lyrata has experienced an ancient (λ) whole-genome triplication that preceded the rosid-asterid split and two more recent WGDs, namely the β WGD that occurred 65-115 Mya and the α WGD that occurred 47-64 Mya (Beilstein et al., 2010). Due to extensive genomic rearrangements over time, these duplicated sequences formed conserved sequence blocks, so-called syntenic blocks, which are abundant in extant plant genomes (Tang et al., 2008; Tang et al., 2011). We speculate that NAHR may be facilitated in angiosperms by the abundant, highly conserved syntenic blocks formed by multiple rounds of polyploidization events during evolution. Within this context, it is interesting to note that deletions originating from NAHR were generally much larger than those originating from other mechanisms (Figure 2b), in accordance with large sizes for syntenic blocks within genomes. It will be interesting to see whether our inference finds support from studies in other angiosperms.

63 Chapter Three

When examining the frequency spectra of tandem duplications and deletions in A. lyrata (Figure 1a), we noticed that deletions showed a U-shaped distribution with a specific peak at high frequencies (≥ 20 individuals), which was distinct from that obtained for tandem duplications. This observation may be explained by deletions that were almost fixed in European A. lyrata populations or by insertions of transposable elements in the reference genome. The analyses of frequency spectra by formation mechanism (Figure 2c) indeed suggest that the presence of transposable element insertions in the reference genome could account for a fraction of the prevalent deletions found in 24 individuals of A. lyrata. The fact that the vast majority (97.2%) of TE-related deletions were found within intergenic and intronic regions (Figure 2d) further indicates that most of them were probably neutral and got fixed by random genetic drift.

Sequencing-based CNV genotyping remains challenging (Alkan et al., 2011), and current CNV detection tools that include genotyping features, such as DELLY v0.3.3 (Rausch et al., 2012), are only able to infer the genotypes of diallelic deletions and tandem duplications. Despite these limitations, our allele frequency analyses based on the genotyping results revealed several interesting aspects potentially related to edaphic adaptation in A. lyrata. We found that for both deletions and tandem duplications, the mean percentage of heterozygous variants (RA) among individuals growing on mixed bedrock (21.3%) was significantly higher than individuals growing on calcareous (15.9%) or siliceous bedrock (15.5%) (Figure S2). If the population from mixed bedrock represented an ancestral population, this observation may imply that edaphic adaptation of the two soil ecotypes of A. lyrata evolved by selection on a large pre-existing pool of standing genetic variation instead of new mutations (Barrett and Schluter, 2008). Alternatively, gene flow between populations of the two soil ecotypes may account for the high rate of heterozygosity observed in the so-called mixed population.

We found deletions and tandem duplications that showed highly differentiated allele frequencies between individuals from siliceous and calcareous bedrocks (Table S6, Table S7). Among them were several CNVs affecting genes interacting with soil chemical factors, implying a putative role of CNVs in the edaphic adaptation of A. lyrata. Sulfur is an essential nutrient that is mainly taken up by plants as sulfate from the soil (Buchner et al., 2004). Due to the co-precipitation with calcium carbonate in calcareous soils, a large fraction of sulfate may be present in water-insoluble form, thus reducing the availability of sulfate to plants (Williams and Steinbergs, 1962; Roberts and Bettany, 1985). In A. thaliana, the gene Sultr1;1 (AT4G08620) encodes a high-affinity sulfate transporter that mediates uptake of sulfate from soil and its expression is inducible in roots under sulfate starvation (Takahashi et al., 2000; Yoshimoto et al., 2002). We found that its homologous gene in A. lyrata, 942653, has been affected by a homozygous intronic deletion (ID: 95F_DELLY_Deletion_00029205) in all individuals from calcareous bedrock (Figure 4a, Figure S3). It is likely that this deletion altered the expression of this gene, for example by disrupting a transcription regulatory element (silencer) residing in the intron (Morello and Breviario, 2008), thus potentially conferring an enhanced capacity for sulfate uptake to plants growing on calcareous bedrock. In this case, the deletion likely got fixed in calcareous populations due to positive selection. Alternatively, the deletion might be neutral and simply got fixed due to hitchhiking with beneficial mutations that are located in close vicinity (Fay and Wu, 2000).

In plants, a specific amount of calcium ions is required at the outer surface of plasma membranes to maintain the structural stability and functional integrity of the cell wall (Clarkson and Hanson, 1980). Similarly, a low concentration of calcium ions must be steadily maintained in the cytosol due to their role as secondary messengers in cell signaling (Bush, 1995). The maintenance of a proper cellular concentration of calcium ions must therefore constitute an essential adaptive feature for plants growing in calcareous soils that are enriched with calcium ions (Lee, 1998). Plant glutamate receptors (GLRs) are ligand-gated ion channels that are likely involved in cellular calcium ion homeostasis since they mediate calcium ion influx through plasma membranes

64 Chapter Three

(Lacombe et al., 2001). It has been reported that the gene AtGLR3.1 (AT2G17260; formerly AtGLR2) encodes a ligand-gated ion channel protein that unloads calcium ions from xylem vessels, but its overexpression causes symptoms of calcium deficiency in transgenic plants (Kim et al., 2001). This interesting observation was explained by a reduced channel activity resulting from the overexpression of one subunit of a heteromeric channel, or an enhanced competition with potassium and sodium ions that are concomitantly taken up following ectopic expression of this gene, either of which may reduce the utilization efficiency of calcium ions in plants (Kim et al., 2001). In this study, we found a tandem duplication (ID: 112K_PINDEL_Tandem_Duplication_1089) that is almost fixed in individuals from calcareous bedrock, which may give birth to a novel glutamate receptor gene through the fusion of two adjacent paralogous glutamate receptor genes (325650 and 893036) (Figure 4b, Figure S4). This tandem duplication may therefore result in an increased number of transcripts of a glutamate receptor gene, which is potentially analogous to overexpression of the gene AtGLR3.1, thus tuning down the utilization efficiency of calcium ions in A. lyrata growing on calcareous soils.

It is well known that plants growing in calcareous soils are limited by low potassium supply, and that the uptake of potassium ions may also be affected due to the predominance of calcium ions in calcareous soils (Lee, 1998; Jalali and Zarabi, 2006). In A. thaliana, the gene AtHAK5 (AT4G13420) encodes a high-affinity potassium ion transporter, which mediates potassium ion uptake in the roots and whose expression is highly induced by potassium deprivation (Gierth et al., 2005). Here, we found that two genes in A. lyrata that are homologous to AtHAK5 were affected by a deletion (ID: 112K_PINDEL_Deletion_330357), which creates a fusion gene (Table S6). Interestingly, this deletion was found to be heterozygous in almost all individuals from calcareous bedrock, implying that it may have been maintained by balancing selection. We speculate that this novel fusion gene enhances potassium ion uptake. Further investigations are needed to elucidate the functions of this novel fusion gene and how this deletion is maintained by balancing selection.

At last, we found some CNVs that showed relatively large differences in allele frequency between calcareous and siliceous bedrocks. One example is a deletion (ID: 112K_DELLY_Deletion_00048956) that truncated the gene 494591, which encodes a MATE efflux family protein. This deletion exhibited a higher allele frequency among individuals from siliceous than calcareous bedrock (allele frequency difference: 0.36) (Table S7). CNVs affecting MATE genes have been reported to play a role in adaptation to serpentine soils in A. lyrata (Turner et al., 2010), and to aluminum toxicity in acidic soils in maize (Maron et al., 2013). It is known that plants growing on acidic siliceous soils suffer from the toxicity of aluminum (Lee, 1998). Its role in edaphic adaptation, especially to acidic siliceous soils, thus may need to be further assessed with more individuals and populations.

EXPERIMENTAL PROCEDURES

Sampling, DNA sequencing and data pre-processing

We sampled eight natural populations of A. lyrata subsp. petraea growing on three different bedrocks in Europe, including four populations on calcareous, three on siliceous and one on mixed bedrock (Table S1). A total of 24 individuals (three per population) were subjected to 2 × 100 bp paired-end whole-genome re-sequencing on the Illumina HiSeq 2000 platform. Quality control of raw sequencing data involved three steps: (1) quality trimming and filtering of reads with ConDeTri (-hq=20 -lq=10 -frac=0.80 -lfrac=0.1 -minlen=30 -mh=10) (Smeds and Künstner, 2011); (2) adapter trimming with cutadapt (-e 0.1 -O 20 -m 100 --match-read-wildcards) (Martin, 2011); and (3) PCR duplicate filtering with the script filterPCRdupl.pl in the

65 Chapter Three package ConDeTri. Reads passing the quality control were mapped against the genome sequence of the North American A. lyrata subsp. lyrata strain MN47 (v1.0) using BWA-MEM under default settings (Li, 2013). To facilitate subsequent analyses, utilities implemented in SAMtools (Li et al., 2009) were used to skip split read alignments (view -F 2048) and low quality alignments (MAPQ < 20), while keeping the information of all unmapped reads (view -f 4). Detailed informations are provided in Table S1.

CNV detection and merging

Deletions and tandem duplications were detected using three different sequencing-based approaches: (i) the read depth (RD) approach implemented in CNVnator v0.2.7 (Abyzov et al., 2011) was used with different bin sizes set for different coverage ranges (100 bp for > 15× coverage, 200 bp for 10-15× coverage, 300 bp for < 10× coverage); (ii) the discordant read pair (RP) approach implemented in DELLY v0.3.3 (Rausch et al., 2012) was used with default settings; and (iii) the split read (SR) approach implemented in Pindel v0.2.5 (Ye et al., 2009) was used (-x 6).

To combine the CNV calls from all three approaches across all 24 individuals, we performed a so-called “precision-aware merging” (Mills et al., 2011; Zichner et al., 2013). Confidence intervals around the breakpoints of CNVs were defined based on the presumed resolution of breakpoint inference for each approach (RD approach: 1 kb outwards, 400 bp inwards; RP approach: 50 bp outwards, 250 bp inwards; SR approach: 10 bp outwards, 10 bp inwards, accounting for misalignments owing to short indels). CNV calls with overlapping confidence intervals at both start and end coordinates were considered as the same variant and were merged into a final dataset. In order to choose the most precise breakpoint coordinates for each CNV in the final dataset, we set the precedence of breakpoints inferred by different approaches: SR > RP > RD, and for those inferred by the same approach, we chose the one with the best quality, e.g. the largest number of supporting read pairs for the RP approach. To minimize the false positive rate, only CNVs detected by at least two approaches were kept in the final dataset, while small variants (< 50 bp) were filtered out. Custom Python scripts are available upon request.

CNV genotyping and categorization

The genotype for each deletion and tandem duplication was inferred with DELLY (Rausch et al., 2012), which counts all normal and discordant pairs at the predicted breakpoints to infer one of the three possible genotypes: homozygous reference (RR), heterozygous variant (RA), and homozygous variant (AA). In cases where the software failed to clearly distinguish heterozygous from homozygous variants, the genotype was defined as “ambiguous”.

The genome annotation of A. lyrata (v1.0) downloaded from the Phytozome v9.1 database (http://www.phytozome.net/) was used along with custom Python scripts to assign CNVs to genomic elements they overlapped with. Each CNV was first assigned to one or two of the three following categories: “intergenic”, “full gene” and “partial gene”. The CNVs assigned to the category “partial gene” were further assigned to one of the three following sub-categories: “UTR”, “Intron”, and “CDS”.

GO term enrichment analysis

GO annotations of A. lyrata (v1.0) were downloaded from the Phytozome v9.1 database for the subsequent GO term enrichment analysis. Enrichment of GO terms in a given gene list compared to the whole genome was tested by means of Fisher's exact tests adjusted for multiple testing using the Benjamini and Hochberg false discovery rate (FDR < 0.01) as implemented in Blast2GO (Conesa et al., 2005).

66 Chapter Three

Formation mechanism analysis

The formation mechanisms of deletions with nucleotide-resolution breakpoints were inferred with the annotation pipeline BreakSeq after setting the minimum size to 50 bp (Lam et al., 2009). By analyzing the sequence features of each deletion and its flanking regions, BreakSeq assigns each deletion to one of four formation mechanisms: variable number of tandem repeats (VNTRs), non-allelic homologous recombination (NAHR), transposable elements activity (TE) and non-homologous recombination (NHR). Briefly, the mechanism VNTRs is inferred, if a deletion region is covered by tandem repeats and low-complexity DNA; TE, if a deletion region is covered by one or several transposable elements; NAHR, if the flanking regions of the two breakpoints share high sequence similarity; NHR, if a deletion lacks above patterns (Lam et al., 2009).

Statistical analysis and data visualization

All statistical analyses were performed with R v2.15.2 (R Development Core Team, 2012). Most figures were generated using the R package “ggplot2” (Wickham, 2009) except the circular plots, which were drawn with the software Circos (Krzywinski et al., 2009). Sequencing read alignments were visualized with the Integrative Genomics Viewer (IGV) (Thorvaldsdóttir et al., 2012).

67 Chapter Three

∗ SUPPORTING INFORMATION

Figure S1. The size spectrum of deletions originating from TE.

Figure S2. Mean percentage of each genotype among individuals of A. lyrata growing on calcareous, mixed or siliceous bedrock. Error bars represent standard errors. AA: homozygous variant; RA: heterozygous variant; Ambiguous: ambiguous between AA and RA; RR: homozygous reference.

Figure S3. IGV screenshots of a 468 bp deletion (ID: 95F_DELLY_Deletion_00029205) affecting a sulfate transporter gene (942653). The evidences from read depth and read pairs supporting the genotypes are displayed for all 24 individuals.

Figure S4. IGV screenshots of a 5,322 bp tandem duplication (ID: 112K_PINDEL_Tandem_Duplication_1089) affecting two glutamate receptor genes (325650 and 893036). The evidences from read depth and read pairs supporting the genotypes are displayed for all 24 individuals.

Table S1. Overview of the 24 sequenced individuals of A. lyrata.

Table S2. List of tandem duplications found in 24 individuals of A. lyrata. Comprehensive information are provided for each tandem duplication, including breakpoints, size, resolution, support, number of samples, affected gene IDs, and genotypes for each individual.

Table S3. List of deletions found in 24 individuals of A. lyrata. Comprehensive information are provided for each deletion, including breakpoints, size, resolution, support, number of samples, formation mechanism, transposable elements underlying the formation of deletions, affected gene IDs, and genotypes for each individual.

Table S4. Lists of over-represented GO terms for 5,130 genes fully affected by deletions (Sheet 1) and 1,723 genes fully affected by tandem duplications (Sheet 2) (Fisher’s exact test, FDR < 0.05). The terms common to both lists are highlighted in red.

Table S5. Lists of over-represented GO terms for 8,296 genes partially affected by deletions (Sheet 1) and 894 genes partially affected by tandem duplications (Sheet 2) (Fisher’s exact test, FDR < 0.05). The terms common to both lists are highlighted in red.

Table S6. Lists of deletions (Sheet 1) and tandem duplications (Sheet 2) that exhibited higher frequencies among individuals from calcareous than from siliceous bedrock. If one CNV overlapped with more than one gene, each affected gene is shown on a separate row.

Table S7. Lists of deletions (Sheet 1) and tandem duplications (Sheet 2) that exhibited higher frequencies among individuals from siliceous than from calcareous bedrock. If one CNV overlapped with more than one gene, each affected gene is shown on a separate row.

∗ Please download it here: https://yadi.sk/d/Pwmf7C2WgPhnV or find it in the accompanying CD.

68 Chapter Three

REFERENCES

Abyzov, A., Urban, A.E., Snyder, M. and Gerstein, M. (2011) CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res., 21, 974-984. Alkan, C., Coe, B.P. and Eichler, E.E. (2011) Genome structural variation discovery and genotyping. Nat. Rev. Genet., 12, 363-376. Antonovics, J., Bradshaw, A.D. and Turner, R.G. (1971) Heavy metal tolerance in plants. Adv. Ecol. Res, 7, 2-85. Barrett, R.D. and Schluter, D. (2008) Adaptation from standing genetic variation. Trends Ecol. Evol., 23, 38-44. Beilstein, M.A., Nagalingum, N.S., Clements, M.D., Manchester, S.R. and Mathews, S. (2010) Dated molecular phylogenies indicate a Miocene origin for Arabidopsis thaliana. Proc. Natl. Acad. Sci., 107, 18724-18728. Bellos, E., Johnson, M.R. and Coin, L.J.M. (2012) cnvHiTSeq: integrative models for high-resolution copy number variation detection and genotyping using population sequencing data. Genome Biol., 13, R120. Bickhart, D.M., Hou, Y., Schroeder, S.G., Alkan, C., Cardone, M.F., Matukumalli, L.K., Song, J., Schnabel, R.D., Ventura, M., Taylor, J.F., et al. (2012) Copy number variation of individual cattle genomes using next-generation sequencing. Genome Res., 22, 778-790. Bowers, J.E., Chapman, B.A., Rong, J. and Paterson, A.H. (2003) Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature, 422, 433-438. Buchner, P., Takahashi, H. and Hawkesford, M.J. (2004) Plant sulphate transporters: co-ordination of uptake, intracellular and long-distance transport. J. Exp. Bot., 55, 1765-1773. Bush, D.S. (1995) Calcium regulation in plant cells and its role in signaling. Annu. Rev. Plant Biol., 46, 95-122. Cao, J., Schneeberger, K., Ossowski, S., Gunther, T., Bender, S., Fitz, J., Koenig, D., Lanz, C., Stegle, O., Lippert, C., et al. (2011) Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat. Genet., 43, 956-963. Clarkson, D.T. and Hanson, J.B. (1980) The mineral nutrition of higher plants. Annu. Rev. Plant Physiol., 31, 239-298. Clop, A., Vidal, O. and Amills, M. (2012) Copy number variation in the genomes of domestic animals. Anim. Genet., 43, 503-517. Conesa, A., Gotz, S., Garcia-Gomez, J.M., Terol, J., Talon, M. and Robles, M. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674-3676. DeBolt, S. (2010) Copy number variation shapes genome diversity in Arabidopsis over immediate family generational scales. Genome Biol. Evol., 2, 441-453. Demuth, J.P., De Bie, T., Stajich, J.E., Cristianini, N. and Hahn, M.W. (2006) The evolution of mammalian gene families. PLoS One, 1, e85. Fay, J.C. and Wu, C.I. (2000) Hitchhiking under positive Darwinian selection. Genetics, 155, 1405-1413. Flagel, L.E., Willis, J.H. and Vision, T.J. (2013) The standing pool of genomic structural variation in a natural population of Mimulus guttatus. Genome Biol. Evol., 6, 53-64. Gierth, M., Maser, P. and Schroeder, J.I. (2005) The potassium transporter AtHAK5 functions in K(+) deprivation-induced high-affinity K(+) uptake and AKT1 K(+) channel contribution to K(+) uptake kinetics in Arabidopsis roots. Plant Physiol., 137, 1105-1114. Girirajan, S., Campbell, C.D. and Eichler, E.E. (2011) Human copy number variation and complex genetic disease. Annu. Rev. Genet., 45, 203-226. Hahn, M.W., Han, M.V. and Han, S.G. (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genet., 3, e197. Hanikenne, M., Kroymann, J., Trampczynska, A., Bernal, M., Motte, P., Clemens, S. and Krämer, U. (2013) Hard selective sweep and ectopic gene conversion in a gene cluster affording environmental adaptation. PLoS Genet., 9, e1003707. Hanikenne, M., Talke, I.N., Haydon, M.J., Lanz, C., Nolte, A., Motte, P., Kroymann, J., Weigel, D. and Krämer, U. (2008) Evolution of metal hyperaccumulation required cis-regulatory changes and triplication of HMA4. Nature, 453, 391-395. Hastings, P., Lupski, J.R., Rosenberg, S.M. and Ira, G. (2009) Mechanisms of change in gene copy number. Nat. Rev. Genet., 10, 551-564. Hope Simpson, J. (1938) A chalk flora on the lower greensand: its use in interpreting the calcicole habit. J. Ecol., 218-235. Hu, T.T., Pattyn, P., Bakker, E.G., Cao, J., Cheng, J.-F., Clark, R.M., Fahlgren, N., Fawcett, J.A., Grimwood, J. and Gundlach, H. (2011) The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet., 43, 476-481. Jalali, M. and Zarabi, M. (2006) Kinetics of nonexchangeable-potassium release and plant response in some calcareous soils. J. Plant Nutr. Soil Sci., 169, 196-204. Jones, C.D. and Begun, D.J. (2005) Parallel evolution of chimeric fusion genes. Proc. Natl. Acad. Sci., 102, 11373-11378. Karkkäinen, K., Løe, G. and Ågren, J. (2004) Population structure in Arabidopsis lyrata: evidence for divergent selection on trichome production. Evolution, 58, 2831-2836. Kim, S.A., Kwak, J., Jae, S.-K., Wang, M.-H. and Nam, H. (2001) Overexpression of the AtGluR2 gene encoding an Arabidopsis homolog of mammalian glutamate receptors impairs calcium utilization and sensitivity to ionic stress in transgenic plants. Plant Cell Physiol., 42, 74-84. Kivimäki, M., Kärkkäinen, K., Gaudeul, M., Løe, G. and Ågren, J. (2007) Gene, phenotype and function: GLABROUS1 and resistance to herbivory in natural populations of Arabidopsis lyrata. Mol. Ecol., 16, 453-462. Kondrashov, F.A. (2012) Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc. R. Soc. B, 279, 5048-5057. Krzywinski, M., Schein, J., Birol, İ., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J. and Marra, M.A. (2009) Circos: an information aesthetic for comparative genomics. Genome Res., 19, 1639-1645. Lacombe, B., Becker, D., Hedrich, R., DeSalle, R., Hollmann, M., Kwak, J.M., Schroeder, J.I., Le Novère, N., Nam,

69 Chapter Three

H.G. and Spalding, E.P. (2001) The identity of plant glutamate receptors. Science, 292, 1486-1487. Lam, H.Y., Mu, X.J., Stütz, A.M., Tanzer, A., Cayting, P.D., Snyder, M., Kim, P.M., Korbel, J.O. and Gerstein, M.B. (2009) Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat. Biotechnol., 28, 47-55. Lee, J.A. (1998) The calcicole-calcifuge problem revisited. Adv. Bot. Res., 29, 1-30. Li, H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. and Durbin, R. (2009) The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078-2079. Li, W. and Olivier, M. (2013) Current analysis platforms and methods for detecting copy number variation. Physiol. Genomics, 45, 1-16. Maron, L.G., Guimaraes, C.T., Kirst, M., Albert, P.S., Birchler, J.A., Bradbury, P.J., Buckler, E.S., Coluccio, A.E., Danilova, T.V., Kudrna, D., et al. (2013) Aluminum tolerance in maize is associated with higher MATE1 gene copy number. Proc. Natl. Acad. Sci., 110, 5241-5246. Martin, M. (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J., 17, 10-12. Michalet, R., Gandoy, C., Joud, D., Pagès, J.-P. and Choler, P. (2002) Plant community composition and biomass on calcareous and siliceous substrates in the northern French Alps: comparative effects of soil chemistry and water status. Arct. Antarct. Alp. Res., 102-113. Mills, R.E., Walter, K., Stewart, C., Handsaker, R.E., Chen, K., Alkan, C., Abyzov, A., Yoon, S.C., Ye, K., Cheetham, R.K., et al. (2011) Mapping copy number variation by population-scale genome sequencing. Nature, 470, 59-65. Morello, L. and Breviario, D. (2008) Plant spliceosomal introns: not only cut and paste. Curr. Genomics, 9, 227. Muñoz-Amatriaín, M., Eichten, S.R., Wicker, T., Richmond, T.A., Mascher, M., Steuernagel, B., Scholz, U., Ariyadasa, R., Spannagl, M., Nussbaumer, T., et al. (2013) Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome. Genome Biol., 14, R58. R Development Core Team (2012) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Rausch, T., Zichner, T., Schlattl, A., Stütz, A.M., Benes, V. and Korbel, J.O. (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28, i333-i339. Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R. and Chen, W. (2006) Global variation in copy number in the human genome. Nature, 444, 444-454. Riihimäki, M., Podolsky, R., Kuittinen, H., Koelewijn, H. and Savolainen, O. (2005) Studying genetics of adaptive variation in model organisms: flowering time variation in Arabidopsis lyrata. Genetica, 123, 63-74. Roberts, T. and Bettany, J. (1985) The influence of topography on the nature and distribution of soil sulfur across a narrow environmental gradient. Can. J. Soil Sci., 65, 419-434. Rorison, I. (1960a) The calcicole-calcifuge problem: II. The effects of mineral uutrition on seedling growth in solution culture. J. Ecol., 679-688. Rorison, I.H. (1960b) Some experimental aspects of the calcicole-calcifuge problem: I. The effects of competition and mineral nutrition upon seedling growth in the field. J. Ecol., 585-599. Ross-Ibarra, J., Wright, S.I., Foxe, J.P., Kawabe, A., DeRose-Wilson, L., Gos, G., Charlesworth, D. and Gaut, B.S. (2008) Patterns of polymorphism and demographic history in natural populations of Arabidopsis lyrata. PLoS One, 3, e2411. Sandring, S., RIIHIMÄKI, M.A., Savolainen, O. and Ågren, J. (2007) Selection on flowering time and floral display in an alpine and a lowland population of Arabidopsis lyrata. J. Evol. Biol., 20, 558-567. Schoville, S.D., Bonin, A., François, O., Lobreaux, S., Melodelima, C. and Manel, S. (2012) Adaptive genetic variation on the landscape: methods and cases. Annu. Rev. Ecol. Evol. Syst., 43, 23-43. Smeds, L. and Künstner, A. (2011) ConDeTri-a content dependent read trimmer for Illumina data. PloS one, 6, e26314. Soltis, D.E., Albert, V.A., Leebens-Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., Wall, P.K. and Soltis, P.S. (2009) Polyploidy and angiosperm diversification. Am. J. Bot., 96, 336-348. Stankiewicz, P. and Lupski, J.R. (2002) Genome architecture, rearrangements and genomic disorders. Trends Genet., 18, 74-82. Sutton, T., Baumann, U., Hayes, J., Collins, N.C., Shi, B.J., Schnurbusch, T., Hay, A., Mayo, G., Pallotta, M., Tester, M., et al. (2007) Boron-toxicity tolerance in barley arising from efflux transporter amplification. Science, 318, 1446-1449. Swanson-Wagner, R.A., Eichten, S.R., Kumari, S., Tiffin, P., Stein, J.C., Ware, D. and Springer, N.M. (2010) Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. Genome Res., 20, 1689-1699. Takahashi, H., Watanabe-Takahashi, A., Smith, F.W., Blake-Kalff, M., Hawkesford, M.J. and Saito, K. (2000) The roles of three functional sulphate transporters involved in uptake and translocation of sulphate in Arabidopsis thaliana. Plant J., 23, 171-182. Tang, H., Bowers, J.E., Wang, X., Ming, R., Alam, M. and Paterson, A.H. (2008) Synteny and collinearity in plant genomes. Science, 320, 486-488. Tang, H., Lyons, E., Pedersen, B., Schnable, J.C., Paterson, A.H. and Freeling, M. (2011) Screening synteny blocks in pairwise genome comparisons through integer programming. BMC Bioinformatics, 12, 102. Tansley, A. (1917) On competition between saxatile L.(G. hercynicum Weig.) and Galium sylvestre Poll.(G. asperum Schreb.) on different types of soil. J. Ecol., 5, 173-179. Thorvaldsdóttir, H., Robinson, J.T. and Mesirov, J.P. (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform., 14, 178-192. Turner, T.L., Bourne, E.C., Von Wettberg, E.J., Hu, T.T. and Nuzhdin, S.V. (2010) Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils. Nat. Genet., 42, 260-263. Wain, L.V., Armour, J.A. and Tobin, M.D. (2009) Genomic copy number variation, human health, and disease. Lancet,

70 Chapter Three

374, 340-350. Wickham, H. (2009) ggplot2: elegant graphics for data analysis. Springer New York. Williams, C. and Steinbergs, A. (1962) The evaluation of plant-available sulphur in soils. I. The chemical nature of sulphate in some Australian soils. Plant Soil, 17, 279-294. Williford, A. and Betrán, E. (2013) Gene fusion. In: eLS. John Wiley & Sons, Ltd: Chichester. DOI: 10.1002/9780470015902.a0005099.pub3. Xi, R., Lee, S. and Park, P.J. (2012) A survey of copy-number variation detection tools based on high-throughput sequencing data. Curr. Protoc. Hum. Genet., 75, 7.19.11-17.19.15. Ye, K., Schulz, M.H., Long, Q., Apweiler, R. and Ning, Z. (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25, 2865-2871. Yoshimoto, N., Takahashi, H., Smith, F.W., Yamaya, T. and Saito, K. (2002) Two distinct high-affinity sulfate transporters with different inducibilities mediate uptake of sulfate in Arabidopsis roots. Plant J., 29, 465-473. Yu, P., Wang, C., Xu, Q., Feng, Y., Yuan, X., Yu, H., Wang, Y., Tang, S. and Wei, X. (2011) Detection of copy number variations in rice using array-based comparative genomic hybridization. BMC Genomics, 12, 372. Zhang, F., Gu, W., Hurles, M.E. and Lupski, J.R. (2009) Copy number variation in human health, disease, and evolution. Annu. Rev. Genomics Hum. Genet., 10, 451-481. Zhang, J. (2003) Evolution by gene duplication: an update. Trends Ecol. Evol., 18, 292-298. Zheng, L.Y., Guo, X.S., He, B., Sun, L.J., Peng, Y., Dong, S.S., Liu, T.F., Jiang, S., Ramachandran, S., Liu, C.M., et al. (2011) Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor). Genome Biol., 12, R114. Zichner, T., Garfield, D.A., Rausch, T., Stutz, A.M., Cannavo, E., Braun, M., Furlong, E.E. and Korbel, J.O. (2013) Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res., 23, 568-579.

71 General Discussion

GENERAL DISCUSSION

This thesis contributes to the growing knowledge about gene copy number variation and, more broadly, copy number variation in plants by using state-of-art approaches. Among the most original results of my thesis are the findings that 1) there exits a possible interplay between plant mating system and the dynamics of gene gain-and-loss; 2) defense-related genes, regardless of their specific roles, are sensitive to changes in gene copy number both among and within plant species; 3) the GRAS gene family has undergone significant expansion and diversification in fast-growing trees; 4) the high percentage of deletions originated by NAHR may be a characteristic feature of angiosperms; and 5) both deletions and tandem duplications may contribute to edaphic adaptation of A. lyrata.

Comparative genomic studies between individuals, especially in humans, reveal that gene copy number, i.e. the size of a multi-gene family, may vary extensively among individuals and populations. For example, at least 100 genes differ in copy number between any two African individuals (Conrad et al., 2009; Schrider et al., 2013). Numerous studies on gene family size variation among species, including my study reported in Chapter One and Two, are, however, based on a single reference genome for each species. The implicit assumption made in such studies is that either the gene copy number in the reference genome of a species represents an average level for that species, or that a given variant allele has already been fixed in natural populations of that species. Accordingly, such reference-genome-based studies are suitable to compare the rates of gene gain-and-loss between species, and to infer the functional repertoire of rapidly evolving gene families. However, one should be cautious when ascribing species-specific adaptations to some lineage-specific expansions in gene copy number that are inferred solely on a single reference genome, as is typically done. A confirmation of the inferred expansions based on analyses of population-scale re-sequencing data would be desirable.

Currently, the likelihood approach used in Chapter One only infers the action of natural selection in gene family size evolution in a quantitative framework (Hahn et al., 2005). In other words, size variation is considered to be driven by natural selection if they evolve at faster rates of gene gain-and-loss than the genomic average. The power of detecting selection may thus be relatively limited using this approach. By contrast, population re-sequencing data provide the opportunity to detect selection with multiple methods, including frequency spectrum-, linkage disequilibrium- and population differentiation-based methods (Vitti et al., 2013). Although extant methods for inferring selection are mainly based on point mutations (SNPs), future advances on structural variant genotyping will enable selection inference based on other types of mutations such as duplications, hence allowing to study the role of selection acting on gene duplications, i.e. gene copy number expansions.

Clustering sequences into gene families is sensitive to the chosen threshold of similarity. The more stringent the threshold, the less genes can be clustered. In contrast, too low a threshold, especially with respect to alignment coverage, can lead to erroneous clustering as a consequence of the “domain chaining” problem (see Figure 1 in Miele et al., 2011). In this case, two protein sequences without any homology are clustered into the same gene family during the single-linkage clustering process. The threshold set for sequence clustering may thus affect the results of the likelihood approach. However, studies have shown that the absolute values of gene gain-and-loss rates may be only slightly affected, while the major results and conclusions should remain the same over a reasonable range of thresholds (Demuth and Hahn, 2009). Hahn et al. (2007) for example observed only 2% changes in rate with varying thresholds, suggesting that the effect of thresholds is minor as long as the set threshold is biologically realistic.

In this thesis, I estimated the average rate of gene gain-and-loss to be 0.0022 gains and losses per gene per

72 General Discussion million years among Arabidopsis and its relatives using a likelihood approach. It would though be interesting in the future to empirically study the rates of duplications and deletions within genomes, in particular the rate of gene copy number changes, in A. thaliana using mutation accumulation lines. The spontaneous point mutation rate in A. thaliana was estimated to be 7×10-9 base substitutions per site per generation, by analyzing the whole-genome sequencing data of five mutation accumulation lines derived by 30 generations of single-seed decent from Col-0 (Ossowski et al., 2010). With the advances in sequencing technology and CNV genotyping approaches, it actually becomes possible to directly estimate the gene turnover rate using mutation accumulation lines (Ossowski et al., 2010). Of relevance would also be the assessment of fitness effects of gene copy number changes with field assays of survival and reproduction (Eyre-Walker and Keightley, 2007). Despite the difficulty in isolating the effects of individual mutations, the combined effects on fitness of spontaneous mutations in the five mutation accumulation lines of A. thaliana have been studied (Rutter et al., 2012). Such studies would greatly improve our understanding of the roles copy number changes may play in ecological adaptation.

Currently available sequencing-based CNV detection approaches have various biases in terms of size, types and breakpoint resolution (Alkan et al., 2011). Therefore, a single approach cannot fulfill the requirement of CNV studies, and a pipeline that integrates multiple approaches to not only map a full spectrum of CNVs, but also lower the false positive rate, is highly recommended. Accordingly, I performed a precision-aware merging of the calls of three tools, i.e. CNVnator, DELLY and Pindel that implement RD, RP and SR approaches, respectively, following the studies in Drosophila (Zichner et al., 2013) and humans (Mills et al., 2011). Relying on multiple CNV detection tools however complicates the whole process and thus increases the likelihood of errors. A software named Genome STRiP (Handsaker et al., 2011) that implements two approaches (RD and RP) simultaneously comes in handy; however, a background population (typically 20 to 30 genomes) is required to run it. The commonly used approaches including RD, RP and SR rely on read mapping and are sensitive to mapping accuracy. The sequence assembly (SA) approach, although still in its infancy, represents a promising approach that may not require a reference genome and does not rely on read mapping, thus serving as a good complement to the available read mapping dependent approaches (Alkan et al., 2011). Taken together, we are still lacking a robust pipeline that implements all sequencing-based approaches, but given that recent years have seen substantial progress in this rapidly developing field, I believe that current limitations will be overcome soon.

The major problem with sequencing-based CNV detection still derives from the nature of NGS data (Alkan et al., 2011). The general short read length of NGS data can lead to abundant read-mapping ambiguity, which affects the performance of all mapping accuracy-sensitive approaches. Moreover, short read length may affect split read alignment and sequence assembly, thus directly lowering the performance of the two CNV detecting approaches: SR and SA. With longer read length and lower sequencing errors of NGS data, the sequencing-based CNV detection approaches would achieve better sensitivity and specificity.

The pervasiveness of CNVs across plant genomes and their substantial genomic impacts in plants revealed in Chapter Three make it reasonable to hypothesize that CNVs account for an appreciable proportion of the phenotypic variation in plants. With the advances in CNV detection and genotyping, it may soon be possible to extend genome-wide association studies (GWAS) to include CNVs (McCarroll, 2008). One meaningful direction for future research is thus to appreciate the contributions of CNVs to plant phenotypic variation using the powerful tool GWAS, which may provide new insights into old, yet unresolved problems surrounding ecological adaptation and evolution.

73 General Discussion

REFERENCES

Alkan, C., Coe, B.P. and Eichler, E.E. (2011) Genome structural variation discovery and genotyping. Nat. Rev. Genet., 12, 363-376. Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J., Andrews, T.D., Barnes, C. and Campbell, P. (2009) Origins and functional impact of copy number variation in the human genome. Nature, 464, 704-712. Demuth, J.P. and Hahn, M.W. (2009) The life and death of gene families. Bioessays, 31, 29-39. Eyre-Walker, A. and Keightley, P.D. (2007) The distribution of fitness effects of new mutations. Nat. Rev. Genet., 8, 610-618. Hahn, M.W., De Bie, T., Stajich, J.E., Nguyen, C. and Cristianini, N. (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res., 15, 1153-1160. Handsaker, R.E., Korn, J.M., Nemesh, J. and McCarroll, S.A. (2011) Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet., 43, 269-276. McCarroll, S.A. (2008) Extending genome-wide association studies to copy-number variation. Hum. Mol. Genet., 17, R135-142. Miele, V., Penel, S. and Duret, L. (2011) Ultra-fast sequence clustering from similarity networks with SiLiX. BMC Bioinformatics, 12, 116. Mills, R.E., Walter, K., Stewart, C., Handsaker, R.E., Chen, K., Alkan, C., Abyzov, A., Yoon, S.C., Ye, K., Cheetham, R.K., et al. (2011) Mapping copy number variation by population-scale genome sequencing. Nature, 470, 59-65. Ossowski, S., Schneeberger, K., Lucas-Lledo, J.I., Warthmann, N., Clark, R.M., Shaw, R.G., Weigel, D. and Lynch, M. (2010) The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science, 327, 92-94. Rutter, M.T., Roles, A., Conner, J.K., Shaw, R.G., Shaw, F.H., Schneeberger, K., Ossowski, S., Weigel, D. and Fenster, C.B. (2012) Fitness of Arabidopsis thaliana mutation accumulation lines whose spontaneous mutations are known. Evolution, 66, 2335-2339. Schrider, D.R., Navarro, F.C., Galante, P.A., Parmigiani, R.B., Camargo, A.A., Hahn, M.W. and de Souza, S.J. (2013) Gene copy-number polymorphism caused by retrotransposition in humans. PLoS Genet., 9, e1003242. Vitti, J.J., Grossman, S.R. and Sabeti, P.C. (2013) Detecting natural selection in genomic data. Annu. Rev. Genet., 47, 97-120. Zichner, T., Garfield, D.A., Rausch, T., Stutz, A.M., Cannavo, E., Braun, M., Furlong, E.E. and Korbel, J.O. (2013) Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res., 23, 568-579.

74 Xuanyu Liu’s CV

Xuanyu Liu 刘宣雨 Chinese citizenship Born in Xuanhua County (宣化县), Hebei Province of China Born on April 10th, 1985 (Ox year) E-mail pers: [email protected] E-mail prof: [email protected]

EDUCATION Nov. 2011-Oct. 2014 Ph D in Plant Science Plant ecological genomics Supervisor: Prof. Dr. Alex Widmer Swiss Federal Institute of Technology in Zurich (ETH),Switzerland Sep. 2008-Jul. 2011 MS in Cell Biology Plant tissue culture and genetic transformation; Seed physiology Supervisor: Prof. Dr. Songquan Song Institute of Botany, the Chinese Academy of Sciences (IB-CAS), China Sep. 2004-Jul. 2008 BS in Agronomy China Agricultural University (CAU), China COMPETENTS Operating System Linux, Macintosh, Windows Programming Language Python, Bash, Perl, C (National Computer Rank Examination Certificate Grade 2); R (statistical analyses and data visualization using pakages such as "lattice" and '"ggplot2") Database MySQL NGS Data Analysis Quality control (FastQC, fastx toolkit, Qualimap), Read mapping and munipulation (IGV, BWA-MEM, Samtools, Bcftools), SNP calling (Popoolation, GATK), CNV calling (DELLY, CNVnator, Pindel), Downstream analyses (SNPEff, Blast2GO, REVIGO, Circos) Microarray Analysis Affymetrix gene expression array Molecular Evolution analysis Selection analysis (PAML, CAFE), Phylogenetic analysis (ProtTest, PhyML, MEGA, MrBayes), Functional divergence analysis (DIVERGE), Sequence alignment (MUSCLE, CLUSTAL X, Blast), Protein family clustering (SiLiX), Synteny analysis (MCscanX) Lab Skills Molecular Biology (DNA/RNA extraction, PCR, qPCR, SNP genotyping), Physiology (Plant tissue culture, Transformation, Anti-oxidation enzyme activity assay, Seed germination assay) PUBLICATIONS Research Articles Liu X., Widmer A., Guggisberg A. Copy number variation and its putative involvement in edaphic adaptation of Arabidopsis lyrata (In prep.) Liu X., Widmer A., Guggisberg A. Evolutionary analysis of gene family size variation in Arabidopsis and its relatives (In prep.) Liu X. and Widmer A. 2014. Genome-wide comparative analysis of the GRAS gene family in Populus, Arabidopsis and rice. Plant Molecular Biology Reporter, DOI 10.1007/s11105-014-0721-5 (2012 IF: 5.31; Correspondence author) Liu X., Deng Z., Cheng H, Song S. 2011. Nitrite, sodium nitroprusside, potassium ferricyanide and hydrogen peroxide release dormancy of Amaranthus retroflexus seeds in a nitric oxide-dependent manner. Plant Growth Regulation, 64:155-161 (2009 IF: 1.53) Liu X., Liu S., Song S. 2010. The research of establishing a highly frequent and efficient regeneration system of sweet sorghum. Scientia Agricultura Sinica 中国农业科学, 43: 4963–4969 (In Chinese with English abstract) Liu X., Liu S.J., Cheng H, Song S. 2008. Changes in activities of reactive oxygen species scavenging enzymes of sweet sorghum seeds during artificial aging. Plant Physiology Communications 植物生理学通讯, 4, 719-722 (In Chinese with English abstract) Review Liu X., Wang Q., Liu S., Song S. 2011. Advances on genetic transformation of sorghum bicolor. Bulletin of Botany 植物学报, 46:261-223 (In Chinese with English abstract) Patent Liu X., Zhao L., Liu. S., Song S. The mediums and methods for sweet sorghum tissue culture. CN 101766122 B WORKSHOPS 2014 "Genome-enabled approaches towards molecular functions in ecology and evolution" The First International Adaptomics Symposium, Bad Neuenahr, . (A poster presented) 2013 Population genomic data analysis workshop, University of Hohenheim, Germany. 2012 "Analysis of differential gene expression" workshop, Swiss Institute of Bioinformatics, Lausanne, Switzerland. 2010 The Second China-US Workshop on Biotechnology of Bioenergy Plants. Sep. 19-21, Beijing, China. AWARDS & HONORS 2011 Doctoral study abroad scholarship from China Scholarship Council from 2011 to 2014 (No. 2011491017) "Di Ao" (地奥) scholarship by Graduate University of the Chinese Academy of Science for excellent academic performance Honored as "Miyoshi student" (三好) of Graduate University of the Chinese Academy of Science 2010 "Miyoshi student" of Graduate University of the Chinese Academy of Science 2007 "Xi Zhi " (曦之) scholarship by China Agricultural University The second class scholarship by China Agricultural University for the outstanding academic record 2006 "Xi Zhi" Education scholarship by China Agricultural University in 2006 The first class scholarship by China Agricultural University for the outstanding academic record 2005 The second class scholarship by China Agricultural University for the outstanding academic record 2004 "Miyoshi student" of Hebei province REFERENCES Prof. Dr. Alex Widmer, ETH Zurich, Switzerland Email: [email protected] Prof. Dr. Songquan Song, IB-CAS, China Email: [email protected]