Coevolution of transposable elements and plant genomes by DNA sequence exchanges

Douglas R. Hoen

Department of Biology

McGill University, Montreal

December, 2011

A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Doctor of Philosophy.

© Douglas R. Hoen, 2011  2

Express yourself completely, then keep quiet. Be like the forces of nature: when it blows, there is only wind; when it rains, there is only rain; when the clouds pass, the sun shines through.

LAO-TZU  3

Table of Contents

Abstract7

Résumé8

Acknowledgements9

Contributions10

Chapter One: Literature review12

Introduction13

Discovery15

Frequent Birth Model17

Plant DTEs24

FAR1/FHY3/FRS24

DAYSLEEPER30

MUSTANG31

GARY34

Additional Examples35

Other Types of -Coding Exaptation36

Regulatory Exaptation, Coding and Noncoding40

Conclusions45

References46

Bridge to Chapter Two58

Chapter Two: The evolutionary fate of MULE-mediated duplications of host fragments in rice59

Abstract60

Introduction60

Results and Discussion62  

 4 Methods72

MULE detection72

Gene prediction73

cDNA mapping73

Transduplicate detection74

Conserved domain detection74

Coding-sequence disablement and truncation75

Synonymous and nonsynonymous substitution rate75

Acknowledgments75

Supplemental Material76

References81

Bridge to Chapter Three84

Chapter Three: Transposon-mediated expansion and diversification of a family of ULP-like 85

Abstract86

Introduction87

Materials and Methods89

Sequences89

Identification of MULEs, Transduplicates, and Peptidase C48 89

Alignment, Phylogeny, and Conservation92

Expression94

Mobility94

Results96

Detection of MULEs and Transduplicates96

Identification of KI98  

 5 Phylogeny and Age100

Conservation103

Expression104

Mobility107

Discussion110

Transduplication110

KI Sequence Evolution111

Expression113

Mobility115

Is KI Selfish?118

Function120

Broader Evolutionary Implications122

Acknowledgments123

Supplemental Material124

References128

Bridge to Chapter Four133

Chapter Four: Discovery of domesticated transposable element genes in Arabidopsis thaliana by integrative genomic data mining134

Abstract135

Introduction135

Methods138

Data Sources138

Identification of TE-like Genes140

Storage, Attribution, and Analysis of Genomic Features144

Determination of Shared Microsynteny145  

 6 Clustering of Domain Sequences146

Calculation of DNA Repetitiveness146

Calculation of Pericentromere Shoulder Coordinates146

Controls147

Calculation of Confidence Levels and Curation147

Additional Analysis of Complex Loci150

Results151

TE Signature Domains151

Domain Scans152

Cross-Genome Conservation155

Pericentromere Shoulders160

Domesticated Transposable Elements160

Discussion198

References210

Conclusion213

Appendix One: The map-based sequence of the rice genome215

Appendix Two: MUSTANG is a novel family of domesticated transposase genes found in diverse angiosperms216  7

Abstract

Transposable elements (TEs) are self-replicating genetic elements that comprise a large portion of all characterized nuclear genomes. Self- replication, which is catalyzed by encoded by autonomous TEs, permits TEs to persist without necessarily providing immediate adaptive benefit to the organism; therefore, TEs are sometimes characterized as selfish, parasitic, or junk DNA. Nevertheless, over the course of evolution, TEs have produced diverse and vital eukaryotic adaptations. One way in which TEs coevolve with ordinary genes is by direct sequence exchange: TEs can duplicate and mobilize ordinary genes; conversely, TE-derived sequences can become conserved as ordinary genes. In this thesis, I use genome-scale bioinformatic analyses to identify direct sequence exchanges from plant genomes to TEs, and vice versa, and to characterize their functional and evolutionary consequences. After reviewing the literature, I first examine Mutator-like elements (MULEs) in rice that have duplicated and mobilized thousands of ordinary coding gene fragments, a process we term transduplication. Contrary to a previous report, these sequences do not appear to produce functional proteins, although they may have regulatory functions. Second, I examine a gene family that appears to have originated through transduplication in Arabidopsis thaliana MULEs, which is conserved within TEs, called Kaonashi (KI). KI shows that transduplication does occasionally produce functional gene duplications; however, at least in this case, the result is a not a new ordinary gene, but a new TE gene. Finally, I examine ordinary genes in A. thaliana derived from TE genes, a process termed molecular domestication. In addition to 3 previously known A. thaliana domesticated transposable elements (DTEs) families, I identify 23 candidate novel families. Together, these results support the view that, despite persisting by self-replication, TEs are not molecular parasites but are integral components of eukaryotic genomes.  8

Résumé

Les éléments transposables (ET) sont des séquences d’ADN capables de se déplacer et de s’autoreproduire dans un génome, un mécanisme appelé transposition. Ces éléments représentent l’une des composantes les plus importantes des génomes nucléaires eucaryotes. Cette capacité à s’autoreproduire, grâce aux protéines codées par les ET autonomes, a permis aux ET de persister et de peupler les génomes sans nécessairement apporter un avantage adaptatif immédiat à l’organisme hôte. À cet égard, les ET sont parfois considérés comme des éléments égoïstes ou parasites, ou de l’ADN «poubelle». Néanmoins, les ET ont joué un rôle important au cours de l’évolution en générant diverses adaptations essentielles aux eucaryotes. Ainsi, les ET peuvent coévoluer avec les gènes du génome hôte par l’échange direct de séquence d’ADN. Les ET peuvent se dupliquer et mobiliser des gènes hôtes ; à l’inverse, des séquences d’ADN dérivées de ET peuvent avoir le même niveau de conservation que des gènes hôtes. Dans le cadre de ma thèse, j’ai utilisé des analyses bio-informatiques à l’échelle du génome afin d’identifier des échanges directs de brins de séquence d’ADN à partir de génomes de plantes vers les ET, et vice-versa, et de caractériser leurs fonctions et leurs effets évolutifs. Ma thèse débutera par une recension des diverses publications scientifiques dans le domaine. Je dresserai ensuite un portrait des éléments mobiles Mutator-like (MULE) dans le génome du riz qui ont entraîné la duplication et la mobilisation de milliers de fragments de gènes codants normaux, un procédé appelé transduplication. Contrairement à ce qui avait été rapporté dans des publications antérieures, ces séquences transdupliquées ne semblent pas produire des protéines fonctionnelles malgré le fait qu’elles puissent avoir des fonctions régulatrices. En second lieu, j’examinerai une famille de gènes, appelée Kaonashi (KI), qui proviendrait d’un événement de transduplication présent dans les MULE de l’Arabidopsis thaliana, mais également conservé dans les ET. La  

 9 présence de la famille KI nous montre que le procédé de transduplication permet à l’occasion des duplications fonctionnelles de gènes. Cependant, du moins dans le cas de la KI, le procédé n’entraîne pas la création d’un nouveau gène normal, mais bien d’un nouvel élément transposable. En troisième lieu, j’examinerai les gènes hôtes présents dans le génome de la plante A. thaliana qui proviendrait de ET, un procédé appelé domestication moléculaire. En plus des trois cas de familles d’éléments transposables domestiquées (ETD) déjà connues dans l’espèce A. thaliana, j’ai identifié 23 nouvelles familles potentielles. L’ensemble de ces résultats tend à démontrer que, malgré le fait qu’ils persistent dans les génomes grâce à leur capacité d’autoreproduction, les ET ne sont pas des parasites moléculaires, mais bien des éléments clés faisant partie intégrale des génomes eucaryotes.

Acknowledgements

I appreciate greatly the help, support, and patience that many have given to me as I have walked this path. I especially thank my thesis supervisor, Thomas Bureau, who has been generous with his time, ideas, and most importantly understanding. Thanks also to my supervisory committee, Joseph Dent, Daniel Schoen, and Paul Harrison, as well as to Mathieu Blanchette, for their attention and suggestions. Thanks to my lab mates, co-authors, and other colleagues for fruitful (and fruitless) discussions. Thanks most of all to my family and friends, for their love. My funding was in part provided by the Natural Sciences and Engineering Research Council of Canada Postgraduate Scholarship Program and the E & M Wilson Bursary administered by McGill University.  10

Contributions

Chapter One has been submitted as an invited contribution to an upcoming Topics in Current Genetics book on Plant Transposons. I researched and wrote this chapter with guidance from my thesis supervisor, Prof. Thomas Bureau.

Chapter Two was published in Genome Research: Hoen DR, Juretic N, Huynh ML, Harrison PM, Bureau TE. 2005. The evolutionary fate of MULE-mediated duplications of host gene fragments in rice. Genome Res. 15(9):1292-7. Co-first authorship is shared by N.J., who contributed to MULE discovery, experimental design, quality control, analysis of specific cases, and writing. M.H. was involved in preliminary MULE discovery. P.H. contributed to analyzing pseudogenic features. T.B. contributed to experimental design and writing. I designed, implemented, and analyzed all large-scale bioinformatic experiments, including MULE 1 & TSD detection, expression mapping, transduplicate detection, conserved domain detection, analyses of coding sequence disablements, and analysis of synonymous substitution rates, and I contributed to writing. This work helped fundamentally reshape scientific understanding of transduplicated gene fragments, which had been previously suggested to encode functional peptides. It received a Faculty of 1000 rating of ten (Exceptional) and has to date been cited by fourty-five articles in the Web of Science database.

Chapter Three was published in Molecular Biology and Evolution: Hoen DR, Park KC, Elrouby N, Yu Z, Mohabir N, Cowan RK, Bureau TE. 2006. Transposon-mediated expansion and diversification of a family of ULP-like genes. Mol Biol Evol. 23(6):1254-68. Co-first authorship is shared by K.P., who contributed to transduplicate detection, quality control, analysis of

1 Based on previous work during my thesis. See Appendix One: IRGSP. 2005. The map-based sequence of the rice genome. Nature. 11;436(7052):793-800.  

 11 specific cases, performed wet-lab mobility experiments, and contributed to writing. N.E., Z.Y., N.M., and R.C. were involved in preliminary analyses. T.B. contributed to experimental design and writing. I designed, implemented, and analyzed all large-scale bioinformatic experiments, including MULE detection, transduplicate detection, Peptidase C48 detection, conservation and phylogenetic analyses, synonymous substitution rate analysis, and expression analysis, and I was primarily responsible for writing. This publication has to date been cited by twenty- nine articles in the Web of Science database.

Chapter Four is in preparation to be submitted for publication. I designed, implemented, and analyzed all experiments2 and was solely responsible for writing. T.B. contributed to experimental design. The novel DTE candidates identified in this study have already been incorporated into an agronomic trait discovery project (VEGI, http://biology.mcgill.ca/vegi), the preliminary results of which are presented in Table 4-21, and which are the result of the following contributions: T-DNA lines were selected by D.H. Homozygous mutant isolation, genotyping was performed by Zoe Joly- Lopez, Eva Forczek, and Akiko Tomita with minor assistance by D.H. Trait assays were performed by: E.F., freezing tolerance; Z.J., salt tolerance; Young Wha Lee, Emily Josephs, and John Stinchcombe, nitrogen use efficiency.

2 Based on previous work during my thesis. See Appendix Two: Cowan RK, Hoen DR, Schoen DJ, Bureau TE. 2005. MUSTANG is a novel family of domesticated transposase genes found in diverse angiosperms. Mol. Biol. Evol. 22(10):2084-9.  12

Chapter One: Literature review  13

Introduction

When Barbara McClintock first discovered transposable elements (TEs), she immediately recognized their potential importance in gene regulation and genome evolution (McClintock 1950). But the discovery that TEs self- replicate within the genome led to their characterization as selfish (Doolittle & Sapienza 1980) parasitic (Orgel & Crick 1980; Hickey 1982) junk (Ohno 1972). We have since learned that TEs are diverse, abundant, and ubiquitous (Feschotte & Pritham 2007; Wicker et al. 2007; Pritham 2009; Aziz et al. 2010; Levin & Moran 2011). We now know that TEs contribute to numerous aspects of eukaryotic genome function and evolution, such as: regulatory networks, (Feschotte 2008), epigenetic regulation (Weil & Martienssen 2008; Lisch 2009), gene duplication (Flagel & Wendel 2009), chromosomal rearrangement (Feschotte & Pritham 2007), allelic (Gaut et al. 2007) and ectopic (Ponting et al. 2011) recombination, genome size (Agren & Wright 2011) and structure (Parisod et al. 2010), chromatin organization (Lunyak & Atallah 2011), the generation of variation (Kidwell & Lisch 2001), including copy number (Conrad et al. 2010), structural (Xing et al. 2009), and somatic variation (Levin & Moran 2011), DNA repair (Cordaux & Batzer 2009), centromere (Lisch 2009) and telomere (Pardue & DeBaryshe 2011) maintenance, the evolution of sex (Hickey 1982; Rose & Oakley 2007), the evolution of a dedicated germ line (Johnson 2008), stress response (McClintock 1984; Lisch 2009), and speciation (Rebollo et al. 2010). Yet, despite these contributions to evolution, it remains common to view TEs as molecular parasites (Kidwell & Lisch 2001; Rose & Oakley 2007; Malone & Hannon 2009; Obbard et al. 2009). This stems from an explanatory framework in which evolution is understood exclusively through adaptationist arguments (Gould & Lewontin 1979). Since TEs persist in the short term without contributing beneficial phenotypes, the argument goes, that is the “reason” they exist, and any beneficial effects  

 14 that do happen to come about are therefore incidental (Werren 2011). But is mutation a cause or consequence of evolution? A narrow focus exclusively on immediate adaptation fails to satisfactorily explain the origin of many important traits; Darwin himself emphasized that gradual adaptation alone does not account for the incipient stages of many useful structures, but rather that changes in function, such as swim bladders becoming lungs when fish colonized land, are often at the root of innovation (Darwin 1876, Chapter VI, Modes of Transition). Gould and Vrba (1982) proposed that adaptive features originally built by natural selection for one role, or even non-adaptive features, that have since been co-opted for a new role, be called exaptations. Instead of parasites, TEs might better be viewed as exaptation engines. TEs may be understood as the products of multiple levels of selection: selection on the TE to replicate, selection on the organism to minimize potentially harmful mutations resultant from TE replication, and group selection to maximize potentially beneficial genetic variation resultant from TE replication. Broadly speaking, many of the aforementioned contributions of TEs to organismal evolution may be considered exaptations, albeit nonspecific or indirect. Indeed, any feature evolved at one level of selection (e.g., TE self-replicative selection) that produces a beneficial effect at another level of selection (e.g., organismal phenotypic selection) is an exaptation (Gould & Lloyd 1999). More narrowly, the genetic constituents (i.e., sequences) of TEs, such as genes, binding sites, and terminal repeats, can be directly exapted for specific phenotypic functions in the organism. These specific, direct exaptations of TE sequences, which we refer to simply as TE exaptations, and especially protein-coding TE exaptations in plants, are the topic of this chapter. We begin by describing the first exapted TE genes to be discovered, the Drosophila P neogenes, to provide not only an historical perspective, but more importantly an  

 15 illustration of the evolutionary forces underlying TE exaptation. We explore these forces by developing a model of the process. We then turn to plants, describing in detail the first known and best characterized exapted plant TEs, the FHY3/FAR1/FRS family. We then review the other confirmed plant DTE families. We briefly examine other types of TE exaptation, such as chimerization, exonization, and noncoding exaptation. Finally, we discuss the significance of TE exaptation for regulatory evolution. For additional perspectives, we highly recommend, in addition to others cited throughout the chapter, the following excellent reviews: (Volff 2006; Dooner & Weil 2007; Feschotte & Pritham 2007; Feschotte 2008; Sinzelle et al. 2009; Hua-Van et al. 2011).

Discovery

Drosophila P elements were among the first TEs to be extensively characterized (Biémont 2010) and produced the first TE exaptations to be discovered. Like other Drosophila TEs (Sánchez-Gracia et al. 2005), P elements frequently move between species (Daniels et al. 1990), an evolutionary strategy called horizontal transfer that is common among DNA transposons (Diao et al. 2005; Schaack et al. 2010). P elements recently horizontally transferred from D. willistoni to D. melanogaster (Daniels et al. 1990; Pinsker et al. 2001). Crosses between D. melanogaster males carrying autonomous P elements and females lacking P elements lead to TE activation and decreased fitness. The resultant syndrome, hybrid dysgenesis, sexually isolates strains carrying P elements from naive strains lacking P elements, a genetic barrier that could contribute to speciation (Kidwell et al. 1977). Drosophila P element families consist of a few autonomous elements and many nonautonomous elements. Autonomous P elements produce a complete 88 kDa protein that catalyzes transposition in the germ line. In somatic cells, differential splicing produces a truncated 66 kDa isoform  

 16 that may repress transposition (Rio 1990). These “P repressors” retain the N-terminal DNA binding domain but lack the C-terminal domain required for TE excision. By binding to specific subterminal P element motif, repressors may block the promoter, preventing transcription and interfering with normal P transposase activity (Kaufman et al. 1989). Some nonautonomous P elements encode similar truncated products that may act as repressors (Miller et al. 1997). Although P repressor proteins were initially thought to explain the somatic silencing of P elements, it has more recently been shown that it may instead be mediated predominantly by small RNAs (Brennecke et al. 2007; Khurana et al. 2011). Unlike D. melanogaster, which has dozens of autonomous P elements (Daniels et al. 1990), species in the D. obscura group have only a few P- like genes. Although these genes are similar to truncated repressors, they are not flanked by P terminal sequences, are not mobile, and are arranged in a complex locus that is orthologous in the genomes of D. guanche, D. subobscura, and D. madeirensis. These are not decayed TE fossils, but instead have three undisrupted exons and conserved intronic splicing signals, are highly similar between species, and are transcribed (Paricio et al. 1991; Miller et al. 1992). Thus, these P homologs are neither functional transposases, nor TE fossils, but are neogenes that must have been immobilized in a common ancestor of the three obscura species, exapted to perform a non-transposition function, and thereafter conserved by phenotypic selection (Miller et al. 1999). Miller et al. (1992) termed this process of TE gene exaptation molecular domestication; we refer to exapted TE genes as domesticated TEs (DTEs). P elements were also domesticated independently in the D. montium subgroup. A single-copy DTE lacking P termini, but again with three uninterrupted exons similar to the obscura DTEs, is present at orthologous locations in at least nine montium species. Both the obscura and montium DTEs have, subsequent to their domestication, undergone lineage-specific  

 17 gene rearrangements, including gene duplication, secondary P element insertion, and exon shuffling. Additional TE insertions provided cis- regulatory elements. In total, there were four independent P element immobilization events, and four different products are encoded by the obscura and montium DTEs (Nouaud & Anxolabehere 1997; Quesneville et al. 2005). Despite their similarity to “P repressors,” the DTEs likely have functions other than repression (Miller et al. 1992). In the obscura group, there are no autonomous P elements, so the DTEs cannot act as repressors. While the montium group does contain at least one active P element subfamily, that family is highly diverged from the DTE, suggesting that repression may not occur. The montium DTE is incapable of repressing transcription or the transposition of canonical P elements. Instead, it is expressed in the brains and gonads of transgenic fly larvae and adults and binds to multiple sites on the that are not similar to extant P elements, suggesting that domesticated P transposases likely serve some function other than P repression related to binding DNA (Reiss et al. 2005). P elements were originally thought to be found only in flies (Diptera) but have now been identified in zebrafish (Hammer 2005). DNA binding domains homologous those of P elements in flies and zebrafish, i.e., THAP domains (Roussigne et al. 2003; Sabogal et al. 2010), have been identified in around 100 genes, many of them transcription factors, with various functions, including apoptosis, angiogenesis, cell cycle regulation, neurological function, stem cell pluripotency, and epigenetic gene silencing (Clouaire et al. 2005). Thus, P elements may have been domesticated multiple times during animal evolution (Quesneville et al. 2005).

Frequent Birth Model

How does molecular domestication work? The answer is not as straightforward as it may seem. We first must understand the different  

 18 levels of selection (Gould & Lloyd 1999) acting on TEs and ordinary genes. Ordinary genes (a.k.a. host genes or cellular genes) do not self- replicate and are subject to what Doolittle and Sapienza (Doolittle & Sapienza 1980) termed phenotypic selection. That is, they are maintained by selection on their phenotypic effects at the organism level. On the other hand, TEs are subject to what we will term self-replicative selection (which Doolittle and Sapienza termed non-phenotypic selection). That is, they are maintained by selection on their ability to self-replicate in germ cells at the genome level. TEs are also subject to phenotypic selection, in part because mutations that they produce may impair the fitness of their host and thereby also impair their own proliferation, but also because such mutations can sometimes be beneficial. In addition, TEs can produce molecules, such as proteins and small RNAs, that may affect cell function. One type of interaction between the phenotypic and self-replicative levels of selection is genetic conflict, which results in the evolution of mechanisms to repress transposition (Werren 2011). For instance, alternative splicing of P transposase genes to produce isoforms that may in germ cells catalyze transposition but in somatic cells repress it would result from self-replicative and phenotypic selection, respectively. Similarly, germ tissue-specific regulatory elements, such as pollen-specific enhancers, have also evolved as a form of TE auto-repression (Raizada et al. 2001; Lisch & Jiang 2009). Molecular domestication, a process through which a mobile gene, evolving primarily under self-replicative selection, becomes an ordinary gene, evolving exclusively under phenotypic selection, is another result of interactions between phenotypic and self-replicative levels of selection. This similarity perhaps suggests a link between TE auto-repression and domestication, or at least between the evolutionary forces underlying the two: TEs exist in a balance between competing levels of selection that may occasionally tip either towards excessive self-replicative selection,  

 19 such as in transposition bursts, or towards excessive phenotypic selection, such as in molecular domestication. We elaborate these ideas in the Frequent Birth model of molecular domestication, as follows:

1. Phenotype: Molecular domestication occurs when the balance of self- replicative and phenotypic selection, under which TEs normally evolve, tips towards phenotypic selection. To eventually become domesticated, a TE sequence must first produce a selectable phenotype. Although TEs can produce phenotypes by a variety of mechanisms, only those in which the specific TE sequence is required are candidates for molecular domestication. For instance, a phenotype produced by a TE that inserts into, and knocks out, a gene is not a potential source of molecular domestication because it is the disruption of the open reading frame that is the cause, not the TE sequence itself. Examples that are candidates for molecular domestication include: a transposase that binds a TE affecting the regulation of a nearby gene; a somatic splice product, such as a P repressor, that represses transposition and reduces somatic mutation; or an endogenous retroviral envelope protein that encourages cell fusion. These phenotypes may occur during normal TE function, so this first step in our model is simply the default state.

2. Benefit: To become domesticated, the phenotype produced by a TE must become beneficial to the organism. In general, we may expect that, because they are side-effects of processes evolved primarily for self- replication, TE-encoded phenotypes are neutral or deleterious. (Nevertheless, it is conceivable that phenotypic selection has enabled TEs to evolve functions other than auto-repression that benefit the organism, which are not yet known.) To become beneficial, a change may need to occur, such as a mutation to the TE or a change in the habitat of the organism. For instance, altering the regulation of genes may be beneficial  

 20 to populations adapting to new conditions or habitats, but detrimental to populations already adapted to a stable environment. If a TE produces a phenotype that benefits the organism without reducing its own self- replicative fitness, it has a selective advantage at the phenotypic level.  Auto-repression, such as by tissue-specific regulatory sequences or a somatic repressor protein isoforms (e.g., P repressors), is a special case because it is inherently beneficial, yet concomitant with self-replication. Although it is phenotypically beneficial, it must different from other beneficial phenotypes in that it merely limits harm rather than supplying independent benefits. In addition, because they are still subject to self- replicative selection, active TEs cannot evolve auto-repressive functions that significantly reduce their own germinal replicative success.

3. Immobilization: TEs frequently sustain mutations that prevent transposition, such as truncations of their termini. Immobilization contributes to molecular domestication in several ways: It may elevate and stabilize TE expression, for instance by restricting epigenetic and post- transcriptional silencing, which relies on double-stranded RNAs, which are generated from the intact terminal inverted repeats of DNA transposons. Also, the expression of an immobile sequence can be integrated with the regulatory system of the surrounding area, for instance with adjacent transcription factor binding sites. Immobilization also abolishes self- replicative selection, which may relieve selective constraints that limit its benefits, such as transposition repression.

4. Short-term persistence: Genomes contain numerous immobilized TEs. Like ordinary genes, immobilized TE-derived sequences are not subject to self-replicative selection so to persist must be selected phenotypically. The phenotypic effects of immobilized TEs may range from deleterious to beneficial, most often having weak, near-neutral effects, most of which will  

 21 not be sufficient to enable the element to persist in a population. Although the fraction of immobilized TEs that do provide phenotypic benefits, such as those described in Step 2, is probably small, we hypothesize that they may nevertheless occur frequently. In other words, DTEs may undergo frequent birth.  Consider for example an immobilized TE that expresses a transposase, which bind to cognate TEs distributed throughout individual genomes in a polymorphic pattern. This forms a novel regulatory network in which both the identity of the genes involved and the strength of regulation varies depending on the location of TE insertions. Few such random networks are likely to be beneficial, and fewer still will have benefits that are strong and stable enough to permit them to persist by phenotypic selection, except in the short-term.

5. Long-term persistence: We hypothesize that immobilized TEs have a range phenotypic effects, and among those that are beneficial, i.e., nascent DTEs, the majority to be only weakly or transiently beneficial and therefore unable to persist beyond the short-term. However, some fraction of DTEs evidently do provide a strong and stable enough benefit to generate enough phenotypic selection to become fixed in a population and persist for the medium-term or long-term. Finally, we note that the distinction between short-term and long-term persistence, like that between weak and strong benefit, is an oversimplification. More likely is that there are gradients of both duration and strength of effect, with far more DTEs generated with weak effects and short persistence than those with strong effect and long persistence. Since the DTEs that are currently known are all of relatively long persistence, we predict that many more DTEs remain to be discovered.  

 22 The Frequent Birth model is attractive because it seems probable from a molecular viewpoint. Only a few relatively high probability mutations are required. A simple, high probability change of molecular function, such as the loss of endonucleolytic capacity, can result in a dramatic change in overall function, from catalyzer of transposition to repressor. Production of a repressor protein is not unique to Drosophila P transposases, but common to most DNA transposons, which typically have N-terminal DNA- binding domains (Rice & Baker 2001) that can remain after other C- terminal domains are truncated. For instance, deletion derivatives and the tnpA transcript of maize CACTA (En/Spm) elements also encode repressors (Fedoroff 1999). Exapted retroelement genes may also serve as repressors, such as the mouse Fv1 gene (Volff 2006). Furthermore, DTEs derived from one type of TE may regulate different types of TE. For instance, CENP-B homologs in fission yeast are derived from DNA transposons but, in addition to roles in centromere function, repress retrotransposition (Cam et al. 2008). Also, other types of self-repression are also possible, and may portend alternate domestication events, not necessarily of coding sequence, such as the Mu killer locus in maize (Lisch & Jiang 2009). Successful domestication presumably requires the nascent DTE to come under stable expression regulation. Because genomes have various TE silencing mechanisms, this step may be improbable, but may be made easier by primary exaptation and immobilization. Certainly there may be cases in which no primary exaptation of a repressive function is needed to achieve long-term domestication, e.g., SETMAR (see below). We suggest only that the Frequent Birth model may be one relatively easy path. Verification of the model would require the identification of a larger number of DTEs of various ages, especially young DTEs, assuming they do exist and can be discriminated from their cognate TEs. If the model is correct, we should expect to find a relatively large  

 23 number of young repressors and a few old DTEs with other evolutionarily stable functions. It is sometimes asked, does molecular domestication even require immobilization, or might a gene encoded by an active TE perform a beneficial cellular function and be conserved (Hoen et al. 2006; Smith et al. 2011)? Or more precisely, can there be TE coding sequence exaptation without molecular domestication? The answer seems to largely hinge on the extent to which TEs directly experience phenotypic selection. Consider germ cell-specific expression, e.g., due to somatic self-repression in P elements or tissue-specific enhancers found in many TEs. Germ-cell specific repression may be due to direct phenotypic selection against genomes with TEs that are active in soma, or to self-replicative selection to counteract cellular TE silencing mechanisms, or both. If direct phenotypic selection does indeed play a significant role in producing germ cell-specific transposition, might it not also be able to maintain other cellular functions, unrelated to transposition, within active TEs? While no such cases have yet been reported, it is not clear whether this is because it is impossible, improbable, difficult to detect, or simply unexamined.  24

Plant DTEs

FAR1/FHY3/FRS

Plants process sensory inputs by modulating networks affecting development and growth. The perception of light is of particular importance. To determine their position, the time of day, and the season, plants measure the intensity, direction, duration, periodicity, and spectrum of incident light using light-sensitive proteins called photoreceptors (Jiao et al. 2007). Of three major classes of photoreceptors in plants, the phytochromes are the best characterized, consisting in Arabidopsis thaliana of a monophyletic family of five genes, PHYA-PHYE (PHYTOCHROME A-E) that has undergone diversification and subfunctionalization throughout angiosperm evolution (Mathews 2006). Phytochromes predominantly absorb red and far-red light to effect various responses including germination, development, dormancy, shade avoidance, and flowering. PHYA is highly conserved among angiosperms and some PHYA mutations are conditionally lethal. PHYA is the primary photoreceptor for both very low fluence irradiance and far-red high irradiance responses and triggers development in a wide range of light conditions (Mathews 2006). In darkness, phyA protein is concentrated in the cytosol. Absorption of red light by phyA triggers structural changes, exposing interaction surfaces to which FHY1 (FAR-RED ELONGATED HYPOCOTYL 1) and FHL (FHY1-LIKE) bind. FHY1/FHL possess nuclear localization signals, causing PHYA, which does not have a nuclear localization signal, to be translocated into the nucleus, where it mediates light responses by triggering transcriptional cascades (Bae & Choi 2008; Chen & Chory 2011). Screens for A. thaliana mutants that undergo normal de-etoliation under white light, but have impaired inhibition of hypocotyl growth under far-red light, identified fhy3 (far-red elongated hypocotyls 3) (Whitelam et al. 1993), which specifically disrupts only the high irradiance phyA response  

 25 (Yanovsky et al. 2000), and far1 (far-red-impaired response 1) (Hudson et al. 1999). The FAR1 gene sequence is similar to Mutator transposase genes, yet lacks TE termini, making it the first DTE to be recognized in plants (Lisch et al. 2001). FAR1 and FHY3 are paralogs that probably arose early in eudicot evolution. Each is located at orthologous chromosomal positions in A. thaliana, B. rapa, and P. trichocarpa (Lin et al. 2007). Although both genes affect hypocotyl elongation, fhy3 has a more pleiotropic phenotype, for example also affecting cotyledon opening. Double mutants have longer hypocotyls and greater reductions in cotyledon expansion than single mutants, suggesting that FHY3 and FAR1 act in an additive manner. Overexpression of the FHY3 C-terminal fragment, which contains a SWIM zinc finger domain, completely blocks phyA signaling. The FHY3 protein can substitute for FAR1, as can the promoters, to completely suppress the far1 mutant phenotype; conversely, FAR1 can only partially restore fhy3 (Wang & Deng 2002; Lin et al. 2008b). Thus, FHY3 and FAR1, after duplicating early in eudicot diversification, have undergone subfunctionalization to maintain partially overlapping functions, with FHY3 acting more widely. FHY3 and FAR1 directly activate FHY1 and FHL expression, which accumulate in dark and dissipate in light, and are required for the high irradiance phyA response. FHY3 and FAR1 activate expression by homo- and heterodimerizing and binding to a specific cis-regulatory sequence, called the FHY3/FAR1 binding site (FBS), found in the promoters of both FHY1 and FHL. DNA binding is mediated by an N-terminal C2H2 zinc finger domain. Two other domains, the central MULE domain and N-terminal SWIM domain, are required for dimerization and to activate expression (Lin et al. 2007; Lin et al. 2008b). In addition to de-etoliation, FHY3 and FAR1 have recently been found to play additional roles. During dark to red-light transitions, phyA signaling is rapidly desensitized by multiple transcriptional and post-translational  

 26 feedback mechanisms, three of which directly involve FHY3/FAR1. First, under far-red light, FHY3 binds to underphosphorylated phyA, protecting against COP1/SPA-mediated proteolysis (Saijo et al. 2008). Second, FHY3/FAR1 expression is repressed by phyA signaling in light, which in turn reduces FHY1/FHL expression (Lin et al. 2007). Third, phyA signaling in light activates the transcription of another photomorphogenic transcription factor, HY5 (long hypocotyl 5), which blocks FHY3/FAR1 activity by binding adjacent to the FHY1/FHL FBS, sterically hindering FHY3/FAR1 binding. HY5 also interacts directly with the DNA-binding domain of FHY3/FAR1, but it is not clear whether this interaction is important in repressing FHY1/FHL expression, or if it plays a role in other FHY3/FAR1 and HY5 co-regulatory activities (Li et al. 2010). FHY3 also plays multiple roles in circadian clock regulation. The FBS is present in the promoters of more than 200 genes that exhibit diurnal or circadian cycling, including PHYB, CCA1 (CIRCADIAN CLOCK- ASSOCIATED 1), and ELF4 (EARLY FLOWERING 1) (Lin et al. 2007). FHY3 specifically gates red light signaling for circadian clock resetting, playing a crucial role in the maintenance of clock rhythmicity, especially at dawn (Allen et al. 2006). By binding to the ELF4 promoter and directly interacting with at least three additional transcription factors, including HY5, FHY3 regulates both the rhythmicity and amplitude of the circadian clock central oscillator (Li et al. 2011a). Finally, FHY3 also stimulates chloroplast division (Ouyang et al. 2011). More functions of FHY3/FAR1 likely remain to be uncovered. Microarray studies revealed that in fhy3 mutants especially, but also far1, the majority of known light-regulated genes, including transcription factor genes and genes involved in cell elongation, have reduced responsiveness to continuous far-red light (Hudson et al. 2003). Furthermore, the FBS is found in hundreds of promoters (Lin et al. 2007). A recent chromatin immunoprecipitation–based sequencing (ChIP-seq) study, showed that  

 27 FHY3 binds to thousands of binding sites associated with over 1700 genes (within 1000 base pairs upstream to the 3-prime untranslated region) (Ouyang et al. 2011). Nearly 800 genes are bound only in darkness, while over 200 are bound only in light, and nearly 800 in both conditions. The majority of genic binding sites are in promoters, with density peaking at the transcription start site. About half of the genic binding sites are FBS, but it remains unknown where exactly FHY3 binds at the remaining sites. Several other types of transcription factor binding site are significantly enriched near the FBS, including 283 genes that have HY5 binding sites in close proximity, and 136 genes that are co-regulated with PIL5 (PHYTOCHROME INTERACTING FACTOR 3-LIKE5) (Ouyang et al. 2011). In addition, about 40% of FHY3 binding sites are intergenic. Unexpectedly, the majority of intergenic binding sites are located in a centromeric motif, a pattern unlike any other known plant transcription factor (Ouyang et al. 2011). The functional significance of centromeric binding is not yet known, but we note that it is reminiscent of mammalian CENP-B (CENTROMERE-ASSOCIATED PROTEIN B), a DTE and component of the centromere/kinetochore, which binds to centromeric satellites and regulates centromere formation (Casola et al. 2008; Zaratiegui et al. 2011). Microarray analyses showed that about one-eighth of the genes with FHY3 binding sites are differentially regulated in dark (197 genes) or far- red light (86 genes) conditions, which is roughly half of all genes differentially regulated in these conditions. Most of these “directly regulated” genes are activated rather than repressed by FHY3, especially in light, where all but a single gene are activated. In dark, FHY3 directly represses the expression of 43 genes, 42 of which are released from repression on light exposure. Among the directly regulated genes, several functional categories are highly enriched, including transcriptional  

 28 regulation, signal transduction, intracellular signaling, environmental response, hormone response, and development (Ouyang et al. 2011). These results suggest that FHY3 may have roles in diverse regulatory networks, or networks of other types, as yet mostly uncharacterized. Furthermore, FHY3 and FAR1 are but two close paralogs in a gene family that includes twelve additional FRSs (FAR1-RELATED SEQUENCES) in A. thaliana (Hudson et al. 2003; Lin & Wang 2004). The FHY3/FAR1/FRS family (hereafter, the FHY3 family) consists of five widely diverged phylogenetic clades, each containing at least two A. thaliana genes (Lin et al. 2007), and each except one with members in both eudicots and monocots. The sole eudicot-specific clade contains both FHY3 and FAR1. Phylogenetic analysis of the FHY3/FAR1 clade suggests that FHY3 and FAR1 arose by a duplication early in eudicot evolution, well before the divergence of the asterids. Furthermore, FAR1 and FHY3 loci are arranged in tandem in the Populus genome, possibly the ancestral configuration, suggesting they arose by tandem duplication. Other duplications are also evident. In the FRS3 clade, FRS5 and FRS9 are arranged in tandem in A. thaliana suggesting another tandem duplication. In the FRS7 clade, the FRS7 and FRS12 sequences are very similar, suggesting a recent duplication. All twelve A. thaliana FRSs are ubiquitously expressed in all major organs, except FRS10, which appears to have a highly unstable transcript. FRS proteins all localize to the nucleus, consistent with transcription factor activity, although FRS1 maintains residual cytosolic distribution. So far, only two clades in addition to the FHY3/FAR1 clade have been characterized. Unlike fhy3/far1, frs6 and frs8 (FRS6 clade) do not affect hypocotyl elongation, but mutant plants flower early, especially under short-day conditions, suggesting they are positive regulators of phyB- mediated inhibition of floral initiation (Lin & Wang 2004). Conversely, fhy3/ far1 mutants also flower early but have greater effect under long-day  

 29 conditions. FRS9 (FRS3 clade) RNAi knockdowns exhibit a hypersensitive response specifically to continuous red light, suggesting FRS9 is a negative regulator of phyB-mediated de-etoliation (Lin & Wang 2004). Thus, in addition to the well characterized functions of FHY3/FAR1, the other FHY3 family members are likely to play even more diverse roles, at least some of which may involve phytochrome-mediated light responses. The phylogenetic pattern, chromosomal arrangement, and functions of the FHY3 gene family show that it underwent successive duplication and subfunctionalization throughout angiosperm evolution (Lin et al. 2007). Intriguingly, it may also have been initially founded in not one but multiple molecular domestication events. The FHY3 family phylogram published by Lin et al. (2007) includes two internal branches belonging to extant, active TEs (LOM-1 and Jittery). All the five major clades in the phylogram, including the TE branches, have greater than 90% bootstrap support. If this phylogram is correct, then the PHY3 family must have originated by at least three independent molecular domestication events. If so, and if the descendants of these multiple domestication events are involved in similar functions, as they appear to be (i.e., phytochrome-mediated responses), it raises important questions about the nature of molecular domestication. What might each independent domestication event have had in common, linking them to phytochrome responses? One possibility is pleiotropy. While members of at least three FHY3 family clades are known to be involved in light responses, FHY3/FAR1 may have additional uncharacterized functions, so the different domestication events may not all be specifically linked to light responses. Another possibility is receptiveness. Over the period in which the domestications occurred, the phytochrome system may have been evolving rapidly, making it especially receptive to the addition of novel transcription factors. Yet it seems implausible that other systems would not also have been evolving quickly during the same period.  

 30 A third possibility is that the ancestral Mutator elements themselves, while still active, may have somehow been tied to phytochrome responses. This might not seem surprising, as various TEs are known to respond to specific environmental cues, e.g., stress responses (McClintock 1984; Zeh et al. 2009). Yet if true it might mean that the active TEs would themselves have been able to affect light response phenotypes. Or perhaps, after the first domestication event established a DTE with a phytochrome-related function, the nascent DTE might have continued to interact with cognate TE transposases and binding sites, and vice versa. Could a situation have existed in which DTEs and TEs were co-evolving, gradually adding new DTE binding sites as well as new DTEs? The continued ability of a DTE to bind to sites in TEs would permit it to expand and modulate its functions rather than relying solely on a pre- existing distribution of binding sites present at the time of domestication. If such a mode of interaction between phenotypic and self-replicative selection could indeed be maintained, it might accelerating the evolution of the new function, benefitting the organism. Indeed, this may be a plausible model of transcription factor network domestication even in cases involving only a single transposase domestication event.

DAYSLEEPER

DAYSLEEPER was isolated in a yeast one-hybrid screen for proteins binding upstream of a DNA repair gene (Ku70), where it binds to multimers of a motif (Kubox1) also found upstream of other genes (Bundock & Hooykaas 2005). DAYSLEEPER has the same conserved domain architecture as hAT transposase, including the hAT dimerization domain, but lacks residues required for transposition, and the DAYSLEEPER locus is not flanked by TE termini. Unlike most DTEs, DAYSLEEPER is located in a pericentromeric region dominated by TEs which, unlike DAYSLEEPER, are heavily targeted by small RNAs and DNA methylation and are not expressed (www.arabidopsis.org).  

 31 Homozygous daysleeper knockout mutants have severe developmental defects, which can be rescued by molecular complementation. Overexpressing DAYSLEEPER for prolonged periods causes slow growth, delayed flowering, altered cauline leaves, fasciation, partial or total sterility, and altered flower morphology. Overexpression induced for 24 hours in wild-type seedlings leads to strong changes in the transcript abundance of dozens of genes, many of which are upregulated by more than an order of magnitude. However, none of the genes with significantly altered expression have a Kubox1 DNA motif, nor do any of the genes identified to have a Kubox1 motif were found to have significantly altered expression, including Ku70. Thus, while DAYSLEEPER does bind DNA, it is not clear whether the observed regulatory effects are direct or indirect (Bundock & Hooykaas 2005). Although these results suggest that DAYSLEEPER may be a transcription factor with multiple roles, unfortunately no additional reports have been published since Bundock & Hooykaas (2005). The genes with significantly altered expression when DAYSLEEPER is overexpressed are involved in a range of processes, especially response to stimulus, pathogen defense, signaling, metabolism, and development. EST library searches show that DAYSLEEPER homologs are found in widely diverged eudicots, both rosids and asterids, but not in monocots (D. Hoen, unpublished results).

MUSTANG

To date, most DTEs have been identified through forward genetics. For instance, FAR1 was identified in a phenotypic screen for far-red light mutants (Lin et al. 2007) and DAYSLEEPER in a yeast one-hybrid screen for proteins binding upstream of a DNA repair gene (Bundock & Hooykaas 2005). However, phenotypic screens may be ill-suited to the identification of gene families in which close homologs are can compensate for a single knockout, as may be the case for families of DTEs undergoing lineage-  

 32 specific expansion and subfunctionalization. The problem may be exacerbated in plants, which tend to have large gene families. Furthermore, a general lack of awareness of the existence of molecular domestication may cause some researchers who do observe novel DTEs in forward genetic screens to dismiss them under the false assumption that they are TEs. Because of these limitations, we do not know what ascertainment biases may exist in the set of known DTEs, nor how many DTEs remain to be identified, a potentially large number considering that transposases are the single most abundant and ubiquitous genes in nature (Aziz et al. 2010). To address this problem, we need systematic screens of genomic data to identify DTE candidates, followed by reverse genetic characterization to determine whether they are bona fide DTEs. One of the few bioinformatic searches conducted thus far was designed to identify plant Mutator-like DTEs that originated prior to the divergence of monocots and eudicots (Cowan et al. 2005). Mutator-like genes that lack TE termini were identified through comprehensive searches of rice and Arabidopsis genome sequences. Phylogenetic analysis revealed that while most Mutator-like genes cluster into clades found only in rice or Arabidopsis, indicative of lineage-specific transposases, two families are different, having close orthologs in both rice and Arabidopsis that are not associated with TE termini and that are expressed. One of these was the previously identified FHY3 family (Hudson et al. 2003), a validation that the method could succeed in finding DTEs. The second was a novel family of DTEs, which was named MUSTANG (MUG). MUG1 has syntenic orthologs in rice, Arabidopsis, Medicago, and poplar. Synonymous substitution rate analysis suggests that MUG1 has been subject to strong purifying selection at the protein level. Like the majority of characterized DTEs derived from DNA transposons (Sinzelle 2009), including the FHY3 family, MUG has maintained its DNA-binding domain, suggesting it may be a  

 33 transcription factor. In fact, all of the ancestral Mutator conserved domains are present in MUG. In addition some MUG genes contain a protein- interaction domain, PB1 (Phox and Bem1), not normally found in transposases. Expressed MUG homologs have also been identified in sugarcane (Saccaro et al. 2007). Investigations subsequent to Cowan et al. (2005) have revealed additional details. The MUG family consists of two major clades, each with members in all examined angiosperms, including basal angiosperms, but not in gymnosperms or other plants. The clades have undergone different patterns of diversification in monocots and eudicots, and it is not yet clear whether they originated in a single or multiple molecular domestication events. Microarray experiments show that in mug mutants the expression levels of hundreds of additional genes are significantly altered, similar to the pattern found in fhy3 (Ouyang et al. 2011), and consistent with the hypothesis that MUGs may function as transcription factors. Different Arabidopsis mug mutants have different pleiotropic phenotypes, including increased freezing tolerance, delayed development, delayed flowering time, aberrant flower morphology, and reduced seed set. This phylogenetic distribution of MUG, along with the flower-related phenotypes, suggest that it was domesticated early in angiosperm evolution and may have coincided with the origin of flowers (Z. Joly-Lopez, E. Forczek, D. Hoen, and T. Bureau, unpublished data). These results were corroborated by a genome-wide survey of the transcription of TE-like sequences in rice, which found only three putative Mutator-like families that are highly transcribed, two of which are the FHY3 and MUG families (Jiao & Deng 2007). (The third does not in fact appear to be related to Mutator as it does not contain MuDR or MULE conserved domains; thus, it may be spurious (D. Hoen, unpublished data).) The successful bioinformatic identification and reverse-genetic characterization of MUG highlights the importance of performing  

 34 systematic screens to identify DTEs. Although Cowan et al. (2005) identified only two Mutator-like DTE families with orthologs in both monocots and eudicots, more may remain undiscovered. For example, Benjak et al. (2008) identified single-copy grapevine genes that are derived from DNA transposons but lack TE termini, finding 2 DAYSLEEPER homologs, 8 MUG homologs, and 5 FHY3 homologs, as well as one hAT-like and one Mutator-like gene that may be novel DTEs. We ourselves have also undertaken subsequent searches, the results of which suggest that plant genomes may contain many additional unreported DTEs (D. Hoen, unpublished data). Once a sufficiently large and unbiased sample of DTEs has been identified, we may more confidently characterize the nature of molecular domestication and its effect on evolution.

GARY

In addition to MUG, one other putative plant DTE was detected bioinformatically. Muehlbauer et al. (2006) identified a hAT-like EST in barley corresponding to a gene, GARY, with only one or two copies in barley, two syntenic copies in the rice, and close EST homologs in several other cereal grasses. GARY is not flanked by TE termini, key residues that would be required for mobility are missing, and no transposition was observed in experimental conditions conducive to it (Muehlbauer et al. 2006). The function of GARY is not known, but given its phylogenetic distribution and its expression in wheat and barley spikes, it may have a grass-specific reproductive function. Searches of sequenced plant genomes confirm that GARY is found only in grasses, not eudicots (D. Hoen, unpublished data). The narrow phylogenetic distributions of GARY and DAYSLEEPER suggest that molecular domestication is an ongoing process in both monocot and eudicot plants.  

 35 Additional Examples

In total, roughly 100 families of eukaryotic DTEs have been identified so far (Volff 2006; Feschotte & Pritham 2007; Sinzelle et al. 2009). Even though retrotransposons greatly outnumber DNA transposons, the majority of known DTEs are derived from DNA transposons, on which we have focussed, being the only well characterized cases in plants. The disproportionate number of known DTEs derived from DNA transposons may be the result of ascertainment biases, as few systematic searches for DTEs have yet been conducted, or it may be due to intrinsic differences between DNA transposons and retrotransposons. For instance, the molecular functions of DNA transposons may be more easily domesticated. Over half of known DTEs putatively function as transcription factors, suggesting that this is a relatively easy evolutionary transition. Other DTEs have diverse functions, such as transposition repression, translation regulation, nuclear import, mRNA splicing, chromatin regulation, DNA maintenance and repair, telomere maintenance, centromere formation, chromosome segregation, and recombination (Feschotte & Pritham 2007; Sinzelle et al. 2009). Many DTEs regulate development and thus have major phenotypic effects (e.g., (Bundock & Hooykaas 2005)). Furthermore, DTEs have often played vital roles in major evolutionary innovations, which should not be surprising, as exaptation is inherently innovative. For example, the vertebrate adaptive immune system includes both domesticated proteins (recombination activating genes) and binding sites (recombination signal sequences), and is essentially a domesticated, specifically regulated transposition system. Another example is the mammalian placenta, which evolved by multiple exaptations of both proteins (e.g., the Syncytins, ERV-3, Peg10, and Rt11/ Peg11) and regulatory elements (Edwards et al. 2008; Rawn & Cross 2008). Other systems in which TE exaptations play important roles include programmed genome rearrangements in protozoans (Baudry et al. 2009;  

 36 Cheng et al. 2010), mating-type switching in yeast (Rusche & Rine 2010), and neural development in mammals and other vertebrates (Cao et al. 2006; Santangelo et al. 2007; Okada et al. 2010; Franchini et al. 2011; Beck et al. 2011). TEs and DTEs have also been repeatedly recruited for functions in centromeres (Casola et al. 2008) and telomeres (Levin & Moran 2011; Pardue & DeBaryshe 2011). Indeed, ancient TE exaptations may even have been responsible for the very origin of telomeres in early eukaryotes, an innovation that may have been vital to enable the evolution of meiosis and sexual reproduction (which may also have been driven by TEs), prerequisites for complex life (Nosek et al. 2006).

Other Types of Protein-Coding Exaptation

Thus far, we have focussed on cases of full molecular domestication, i.e., TE exaptations that form whole new genes, because they are illustrative, interesting, and the best characterized in plants. However, additional types of TE coding sequence exaptation also occur. One type is that certain DNA transposons acquire, by unknown mechanisms, duplicated fragments of ordinary genes, called transduplications. It is not yet clear whether transduplications have phenotypic coding functions, but they may generate regulatory small RNAs (Le et al. 2000; Yu et al. 2000; Turcotte et al. 2001; Jiang, Bao, et al. 2004a; Hoen et al. 2005; Lai et al. 2005; Hanada et al. 2009; Jiang et al. 2011). Furthermore, in some cases it appears that transduplicated genes serve self-replicative functions, i.e., they may have been exapted in the opposite direction to molecular domestication, an ordinary gene converted to TE gene (Hoen et al. 2006; Sela et al. 2008). It is possible that these self-replicative transduplicates might sometimes be re-exapted to again encode ordinary proteins, forming a cycle of exaptations, but no such cases are known. Another potential path to coding sequence exaptation is through exonization. A TE that inserts into or near a gene can be incorporated as a

37 novel cassette exon. Exonization is especially prevalent in animals, where alternative splicing is common (Sorek 2007; Cordaux & Batzer 2009; Schmitz & Jürgen Brosius 2011). For instance, thousands of ordinary human gene transcripts may contain TE-derived sequences (Nekrutenko & Li 2001; Britten 2006; Sela et al. 2007). However, most exonized TEs are expressed only as rare splice variants and are probably not translated into functional peptides (Gotea & Makalowski 2006; Lin et al. 2008a). Indeed, the vast majority are derived not from TE coding sequences, but from nonautonomous elements. For instance, roughly half of exonized human TEs are originated from Alu elements, the most abundant human TE at over one million copies, accounting for more than 10% of the genome (Nekrutenko & Li 2001; Sela et al. 2007). Alu elements are themselves derived from retrotransposed 7SL RNAs and thus have no inherent coding capacity, but do contain multiple splice signals, which perhaps make them amenable to exonization (Makałowski et al. 1994). It is not yet clear whether or how exonized Alu elements may contribute to phenotypic evolution (Cordaux & Batzer 2009). Exonization also occurs in plants, but is not as prevalent or well characterized (Barbazuk et al. 2008). In A. thaliana, more than two thousand loci have transcribed exons derived from TEs, but it is not clear how many of these are ordinary genes and how many are transduplications, nor is it known what fraction, if any, are translated into functional proteins (Lockton & Gaut 2009). MITEs, high copy-number TEs prominent in plants but also found in other eukaryotes that frequently insert near genes (see below), may be good candidates for exonization (Marino-Ramirez et al. 2005). However, like Alu elements, MITEs usually have no inherent coding capacity and at least certain families very rarely insert into coding exons or become exonized (Oki et al. 2008; Naito et al. 2009).  

 38 Exonization may seem to be an easier path to exaptation than full molecular domestication, given that existing genes are already be stably regulated. Conversely, adding new sequence to an existing gene, and thus changing its function, would likely be deleterious, unless perhaps it was a recently duplicated gene, or unless the novel exon were included in only one isoform leaving the original gene function intact. However, this scenario would also require a concurrent change in regulatory control. Although exonized TEs appear to usually be nonfunctional, some functional chimeras are known. For instance, the primate gene SETMAR (Robertson & Zumpano 1997; Cordaux et al. 2006), a.k.a. Metnase (Lee et al. 2005), arose when a Tc1-Mariner element inserted downstream of a SET gene. Tc1-Mariner elements are present in all eukaryotic kingdoms and undergo frequent horizontal transfer (Lohe et al. 1995). In humans, Tc1-Mariner is now extinct, like other DNA transposons, but 60 to 80 million years ago, during the primate radiation, Tc1-Mariner underwent a burst of transposition (Pace & Feschotte 2007). SETMAR, which formed shortly thereafter (Shaheen et al. 2010), is conserved in humans, apes, and monkeys (Lee et al. 2005) and, unlike other Tc1-Mariner fossils in the , is evolving slowly with uninterrupted, expressed open reading frames (Robertson & Zumpano 1997). SETMAR contains both a transposase domain, evolving under strong purifying selection, and a SET histone methyl transferase domain (Shaheen et al. 2010). It has maintained its ancestral DNA-binding specificity for a 19 base-pair ancestral Tc1-Mariner terminal sequence still present in thousands of copies in the human genome, suggesting that it may be involved in a large network (Cordaux et al. 2006). The DDE motif, normally required for transposition, is absent and SETMAR does not efficiently catalyze transposition, nor does it, like functional transposases such as hAT in maize (Zhang et al. 2009), mediate chromosomal translocation, but instead it represses translocation (Shaheen et al. 2010).  

 39 Yet, SETMAR does maintain certain transposase functions, including limited DNA endonuclease activity. One activity of SETMAR is histone methylation, and thus it potentially functions in transcriptional regulation, DNA repair, DNA replication, and imprinting. It also interacts Pso4, a protein involved both in DNA double-stranded break repair and RNA splicing. It plays a role in double-stranded break repair by non- homologous end joining, and possibly in other DNA repair processes, and in restarting stalled replication forks (Shaheen et al. 2010). Replication fork restart is also a function of another DTE, a CENP-B homolog in fission yeast (Zaratiegui et al. 2011). SETMAR was formed through a unique series of mutations. After inserting downstream of the ancestral SET gene, a Tc1-Mariner was immobilized by an Alu insertion. The original stop codon of the SET gene was deleted, resulting in the exonization of a segment of 3-prime UTR and creation of a novel intron, which fused the SET gene to the transposase. Curiously, the intron acceptor site is encoded by the ancestral Tc1-Mariner element itself, just three base-pairs upstream of the transposase start codon; furthermore, the ancestral element also encoded two putative branch sites (Cordaux et al. 2006). These features perhaps suggest that ancestral Tc1-Mariners may, like Alu elements, have evolved a capacity to exonize, and thus that TEs themselves can benefit from exonization. This might suggest an evolutionary feedback between self-replicative and phenotypic selection to increase the rate of exonization, which in turn may increase the rate of TE exaptation. In any case, SETMAR demonstrates that transposase functions can be exapted not only through the formation of whole new genes, but also by chimerization with existing ordinary genes. Although only a small fraction of exonized TEs may become exapted, over the long course of evolution they may have made a large contribution to the genome, especially in  

 40 functions such as DNA binding, protein-protein interaction, and recombination.

Regulatory Exaptation, Coding and Noncoding

Complex multicellular life has evolved primarily not through changes to structural genes but by changes to how genes are regulated (King & Wilson 1975; Carroll 2005; Wray 2007). Transcription factors are a major component of gene regulation. While most genes are similar between different species, some transcription factor families evolve quickly (Nowick & Stubbs 2010). TEs contribute to regulatory evolution in various ways, including by creating new families of lineage-specific transcription factors through molecular domestication (Britten & Davidson 1971; Bourque et al. 2008; Feschotte 2008; Kunarso et al. 2010). DNA transposons in particular have innate characteristics that predispose them to exaptation as de novo regulatory networks, i.e., transcription factors and binding sites (Feschotte 2008). Transposases have DNA binding domains that recognize motifs within TEs (Haren et al. 1999; Lisch & Jiang 2009). Some TEs preferentially insert upstream of genes, an ideal location for elements that regulate transcription (Bureau & Wessler 1992; Naito et al. 2009; Lisch & Jiang 2009). When the transposase is expressed, it binds to the TEs and can modify the expression of flanking genes. If the TEs happen to be distributed in a way that the coordinated expression change is phenotypically beneficial, then the transposase and the binding sites are favored by phenotypic selection and can be domesticated. Of eukaryotic DTEs derived from DNA transposons with putative or known functions, about half are transcription factors (Feschotte & Pritham 2007; Sinzelle et al. 2009). As we have seen, these probably include all the plant DTE families: FHY3 (Lin et al. 2007), DAYSLEEPER (Bundock & Hooykaas 2005), and MUG (Z. Joly-Lopez, E. Forczek, D. Hoen, and T. Bureau, unpublished data).  

 41 Indeed, molecular domestication may have been the ancient foundation of several transcription factor families. We have already discussed THAP genes, many of which are transcription factors, which may have originated by P element domestication. Additional examples include the AP2/ERF and WRKY superfamilies of plant-specific transcription factors, which may also have arisen through the ancient domestication of DNA-binding domains (Magnani et al. 2004; Babu et al. 2006). However, as may be expected for ancient events, these gene families have limited similarity to extant TEs making it difficult to determine exactly how they arose. There may have been multiple domestication events, or single domestication events followed by exon shuffling (not mediated by TEs), or it is even possible, though unlikely, that the TEs co-opted these domains through transduplication, rather than the transcription factors arising by domestication. TE exaptation also contributes to regulatory evolution in other ways. TEs contain binding sites for not only transposases, which are needed for transposition, but also for ordinary transcription factors, as well as promoters, which are needed for transposase expression. These binding sites and promoters can be exapted for phenotypic function, independently of transposase domestication, singly or in networks. Indeed, this type of exaptation appears to be far more prevalent than coding sequence domestication. In general, transcription factor binding sites and other functional noncoding sequences evolve rapidly and are frequently restricted to single species or narrow phyletic ranges (Ponting et al. 2011). While some may originate de novo (Eichenlaub & Ettwiller 2011), TEs are appropriate sources of lineage-specific regulatory elements, since TEs themselves evolve rapidly, both in sequence and genomic distribution (e.g., Hollister et al. 2011). Indeed, noncoding TE sequences are frequently exapted as short- and long-distance (Pi et al. 2010) enhancers, repressors, silencers, insulators, alternative transcription start sites,  

 42 alternative polyadenylation sites (Lee et al. 2008), antisense transcripts, or alternative exons (Feschotte 2008; Beauregard et al. 2008; Shapiro 2010; Studer et al. 2011; Ponting et al. 2011). For instance, in humans, one- quarter to one-third of promoters and transcription factor binding sites may be derived from TEs (Jordan et al. 2003; Bourque et al. 2008; Kunarso et al. 2010), and thousands of conserved noncoding elements derived from TEs are especially enriched near genes involved in development and transcriptional regulation (Lowe et al. 2007). In plants, small nonautonomous DNA transposons known as miniature inverted-repeat transposable elements (MITEs) are especially abundant. MITEs insert preferentially into 5-prime flanking regions of genes, can be exapted as various types of regulatory element, and can form de novo regulatory networks (Bureau et al. 1996; Jiang, Feschotte, et al. 2004b; Oki et al. 2008; Naito et al. 2009). In addition to transcription factors and cis-binding sites, TEs can be also be exapted as various classes of RNAs (Brosius 1999). Of these, the most important for regulation are small RNAs (sRNAs). sRNAs of different types are used in several related RNAi pathways to target transcriptional and post-transcriptional silencing (Chapman & Carrington 2007; Malone & Hannon 2009). sRNAs are generated from double-stranded RNAs (dsRNAs), which ordinary genes do not normally produce, but which TEs produce in various ways, for example by intermolecular hybridization of bidirectional transcripts, or intramolecular hybridization of read-through transcripts containing inverted repeats (Lisch 2009). In plants, transcriptional silencing of TEs by chromatin compaction is activated by sRNA-directed DNA methylation, as well as repressive histone modifications. Secondary sRNAs spread DNA methylation to adjacent areas (Simon & Meyers 2011). DNA methylation is initiated and reinforced in plant embryos by sRNAs produced in the vegetative cell of pollen and the central cell of endosperm, cells which are hypomethylated to activate  

 43 TEs, but which do not contribute genetic material to subsequent generations (Law & Jacobsen 2010; Calarco & Martienssen 2011). Once established, DNA methylation is epigenetically inherited (Feng & Jacobsen 2010); thus, degenerate TEs may eventually be desilenced making them more available for exaptation. Epigenetic regulation of TEs can alter the expression of nearby ordinary genes, and cis-regulatory elements exapted from TEs can also be epigenetically regulated (Slotkin & Martienssen 2007; Weil & Martienssen 2008; Markljung et al. 2009). In fact, McClintock first discovered TEs by observing effects caused by the epigenetic regulation of maize En/Spm elements (Fedoroff 1999). Methylation of TEs inserted near ordinary genes decreases their expression (Hollister & Gaut 2009), for instance by methylation spreading to cis-regulatory elements (Martin et al. 2009). Conversely, TE-derived regulatory elements may also be targeted by active histone modifications (Huda et al. 2011). TE-derived sRNAs can regulate ordinary genes not only epigenetically, but also post-transcriptionally. For instance, transduplications can produce, via antisense transcription, or possibly inverted duplication, sRNAs that post-transcriptionally downregulate parent gene expression in trans (Hoen et al. 2005; Slotkin et al. 2005; Hanada et al. 2009). Also, microRNA (miRNA) genes (MIR) can be derived from TEs (Smalheiser & Torvik 2005; Piriyapongsa et al. 2007; Li et al. 2011b). MIR transcripts form stem-loop dsRNAs which are processed into sRNAs that target the mRNA transcripts of ordinary genes for cleavage or translation inhibition. While some plant MIR genes have ancient origin and are highly conserved, the majority are young and restricted to single species. Young MIR genes may originate by random inversion of coding sequence, or they may be derived from short inverted repeat TEs, such as MITEs (Voinnet 2009). However, many young MIR genes may be evolving neutrally (Fahlgren et al. 2010).  

 44 While some TE-derived miRNAs produce canonical stem-loop structures, many are atypical and more similar to ancestral TE configurations. This suggests a model (Piriyapongsa & Jordan 2007) of MIR exaptation similar in principle to the Frequent Birth model. Consider a specific TE locus that generates sRNAs targeting homologous TEs for silencing. As a side effect of silencing, a set of ordinary genes may also be down-regulated, perhaps via a distributed set of exonized TEs targeted by the sRNAs. In the event that this new regulatory network were phenotypically beneficial, the locus would come under phenotypic selection and could evolve to become a MIR gene. The main principle here is the same as that for de novo exaptation of transcription factor networks: a network of sequences derived from TEs randomly distributed near ordinary genes are targets of regulation, either by an ordinary transcription factor, a domesticated transcription factor, or an miRNA exapted from a TE. Indeed, miRNAs have been likened to post- transcriptional transcription factors (Malone & Hannon 2009). The first step in this process, the exaptation of a TE to produce repressive sRNAs is nicely illustrated by the maize Mu killer locus (Muk). Muk is an inverted duplication of part of a Mutator TE. It produces a stem-loop dsRNA transcript that is processed into sRNAs which effectively silence, both post-transcriptionally and epigenetically, this otherwise highly active TE family (Slotkin et al. 2005). The large number of observed young MIR genes is consistent with a rapid birth-death cycle as predicted by the Frequent Birth model, and with a role for these miRNAs in lineage-specific diversification. Perhaps a similar phenomenon might be found for domesticated protein-coding genes, given sufficiently sensitive searches. On a grander scale, the RNAi system itself may itself be an indirect exaptation of TE-genome coevolution. Perhaps originally evolved to regulate TEs (Lisch & Bennetzen 2011), epigenetic regulation is now used for various purposes such as genomic imprinting (Köhler & Weinhofer-  

 45 Molisch 2010), gene body methylation (Saze & Kakutani 2011), developmental plasticity, and the buffering of developmental programs (Obbard et al. 2009; Martin & Bendahmane 2010; Mirouze & Paszkowski 2011; Feng & Jacobsen 2011; Simon & Meyers 2011). On an even larger scale, the modulation of epigenetic TE regulation may provide a mechanism for evolvability and may thus play a fundamental evolutionary role. Global epigenetic desilencing increases the rate of transposition, so it may enable periods of rapid evolution and, ultimately, punctuated equilibrium (Gould & Eldredge 1977; Zeh et al. 2009; Okada et al. 2010; Johnson & Tricker 2010; Oliver 2011; Werren 2011).

Conclusions

We have reviewed the known exaptations of transposase genes in plants. Although some results suggest that there may be only few plant DTEs left to be discovered (Cowan et al. 2005; Jiao & Deng 2007), the question remains open. Indeed, other lines of evidence suggest that molecular domestication may be a relatively common phenomena: the convergent domestication of pogo transposases into CENP-B-like DTEs in mammals, fission yeast, and possibly insects and plants (Barbosa-Cisneros & Herrera-Esparza 2002; Casola et al. 2008; d'Alençon et al. 2011); the repeated and convergent domestication of placental genes (Rawn & Cross 2008); the recurrent domestication of P elements (Miller et al. 1999; Quesneville & Nouaud 2005); and multiple domestication events during FHY3 evolution (Lin et al. 2007). Thus, before we assess the frequency and global significance of molecular domestication, additional systematic genomic searches are needed. Regardless of how frequent TE coding sequence exaptation is, it has already become apparent that noncoding TE exaptations, such as transcription factor binding sites, contribute significantly to regulatory evolution.  

 46 It is common to describe TEs as selfish parasites, a viewpoint supposedly justified by the capacity of TEs to persist outside a need for phenotypic selection. Gould cautioned that focussing exclusively on immediate adaptation leads to an inverted understanding of evolution, a focus on spandrels not arches, on paintings not architecture (Gould & Lewontin 1979). Modern genomics supports this view (Koonin 2009). We find it fascinating and beautiful that nature appears to have created a system of molecular evolution in which self-replicating genetic elements, immune to the immediate pressures of phenotypic selection, may nevertheless be fundamental to the evolution of complex organisms. When Barbara McClintock first discovered transposable elements, she recognized their potential importance in gene regulation and genome evolution (McClintock 1950). In the end, this may well be proven right.

References

Agren, J.A. & Wright, S.I., 2011. Co-evolution between transposable elements and their hosts: a major factor in genome size evolution? Chromosome research, 19(6), 777–786. Allen, T. et al., 2006. Arabidopsis FHY3 specifically gates phytochrome signaling to the circadian clock. The Plant cell, 18(10), 2506–2516. Aziz, R.K., Breitbart, M. & Edwards, R.A., 2010. Transposases are the most abundant, most ubiquitous genes in nature. Nucleic acids research, 38(13), 4207–4217. Babu, M.M. et al., 2006. The natural history of the WRKY-GCM1 zinc fingers and the relationship between transcription factors and transposons. Nucleic acids research, 34(22), 6505–6520. Bae, G. & Choi, G., 2008. Decoding of light signals by plant phytochromes and their interacting proteins. Annual review of plant biology, 59, 281–311. Barbazuk, W.B., Fu, Y. & McGinnis, K.M., 2008. Genome-wide analyses of alternative splicing in plants: opportunities and challenges. Genome research, 18(9), 1381–1392. Barbosa-Cisneros, O. & Herrera-Esparza, R., 2002. CENP-B is a conserved gene among vegetal species. Genetics and molecular research, 1(3), 241– 245. Baudry, C. et al., 2009. PiggyMac, a domesticated piggyBac transposase involved in programmed genome rearrangements in the ciliate Paramecium tetraurelia. Genes & development, 23(21), 2478–2483.  

 47 Beauregard, A., Curcio, M.J. & Belfort, M., 2008. The Take and Give Between Retrotransposable Elements and their Hosts. Annual review of genetics, 42(1), 587–617. Beck, C.R. et al., 2011. LINE-1 Elements in Structural Variation and Disease. Annual review of genomics and human genetics, 12, 187–215. Benjak, A., Forneck, A. & Casacuberta, J.M., 2008. Genome-wide analysis of the “cut-and-paste” transposons of grapevine. PLoS One, 3(9), e3107. Biémont, C., 2010. A brief history of the status of transposable elements: from junk DNA to major players in evolution. Genetics, 186(4), 1085–1093. Bourque, G. et al., 2008. Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome research, 18(11), 1752–1762. Brennecke, J., Aravin, A. A., Stark, A., Dus, M., Kellis, M., Sachidanandam, R., & Hannon, G. J., 2007. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell, 128(6) 1089–1103. Britten, R., 2006. Transposable elements have contributed to thousands of human proteins. Proc. Natl. Acad. Sci. USA, 103(6), 1798–1803. Britten, R.J. & Davidson, E.H., 1971. Repetitive and non-repetitive DNA sequences and a speculation on the origins of evolutionary novelty. The Quarterly review of biology, 46(2), 111–138. Brosius, J, 1999. RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene, 238(1), 115–134. Bundock, P. & Hooykaas, P., 2005. An Arabidopsis hAT-like transposase is essential for plant development. Nature, 436(7048), 282–284. Bureau, T.E. & Wessler, S.R., 1992. Tourist: a large family of small inverted repeat elements frequently associated with maize genes. The Plant cell, 4(10), 1283–1294. Bureau, T.E., Ronald, P.C. & Wessler, S.R., 1996. A computer-based systematic survey reveals the predominance of small inverted-repeat elements in wild- type rice genes. Proc. Natl. Acad. Sci. USA, 93(16), 8524–8529. Calarco, J.P. & Martienssen, R.A., 2011. Genome reprogramming and small interfering RNA in the Arabidopsis germline. Current opinion in genetics & development, 21(2), 134–139. Cam, H.P. et al., 2008. Host genome surveillance for retrotransposons by transposon-derived proteins. Nature, 451(7177), 431–436. Cao, X. et al., 2006. Noncoding RNAs in the mammalian central nervous system. Annual review of neuroscience, 29, 77–103. Carroll, S.B., 2005. Evolution at Two Levels: On Genes and Form. PLoS biology, 3(7), e245. Casola, C., Hucks, D. & Feschotte, C., 2008. Convergent domestication of pogo- like transposases into centromere-binding proteins in fission yeast and mammals. Molecular biology and evolution, 25(1), 29–41. Chapman, E.J. & Carrington, J.C., 2007. Specialization and evolution of endogenous small RNA pathways. Nature reviews. Genetics, 8(11), 884–896.  

 48 Chen, M. & Chory, J., 2011. Phytochrome signaling mechanisms and the control of plant development. Trends in cell biology, 21(11), 664–671. Cheng, C.-Y. et al., 2010. A domesticated piggyBac transposase plays key roles in heterochromatin dynamics and DNA cleavage during programmed DNA deletion in Tetrahymena thermophila. Molecular biology of the cell, 21(10), 1753–1762. Clouaire, T. et al., 2005. The THAP domain of THAP1 is a large C2CH module with zinc-dependent sequence-specific DNA-binding activity. Proc. Natl. Acad. Sci. USA, 102(19), 6907–6912. Conrad, D.F. et al., 2010. Origins and functional impact of copy number variation in the human genome. Nature, 464(7289), 704–712. Cordaux, R. & Batzer, M.A., 2009. The impact of retrotransposons on human genome evolution. Nature reviews. Genetics, 10(10), 691–703. Cordaux, R. et al., 2006. Birth of a chimeric primate gene by capture of the transposase gene from a mobile element. Proc. Natl. Acad. Sci. USA, 103(21), 8101–8106. Cowan, R.K. et al., 2005. MUSTANG is a novel family of domesticated transposase genes found in diverse angiosperms. Molecular biology and evolution, 22(10), 2084–2089. d'Alençon, E. et al., 2011. Characterization of a CENP-B homolog in the holocentric Lepidoptera Spodoptera frugiperda. Gene, 485(2), 91–101. Daniels, S.B. et al., 1990. Evidence for horizontal transmission of the P transposable element between Drosophila species. Genetics, 124(2), 339– 355. Darwin, C., 1876. The origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. 6th ed. J. Murray, ed. London. Diao, X., Freeling, M. & Lisch, D.R., 2005. Horizontal Transfer of a Plant Transposon. PLoS biology, 4(1), e5. Doolittle, W.F. & Sapienza, C., 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature, 284(5757), 601–603. Dooner, H.K. & Weil, C.F., 2007. Give-and-take: interactions between DNA transposons and their host plant genomes. Current opinion in genetics & development, 17(6), 486–492. Edwards, C.A. et al., 2008. The evolution of the DLK1-DIO3 imprinted domain in mammals. PLoS biology, 6(6), e135. Eichenlaub, M.P. & Ettwiller, L., 2011. De novo genesis of enhancers in vertebrates. PLoS biology, 9(11), e1001188. Fahlgren, N. et al., 2010. MicroRNA gene evolution in Arabidopsis lyrata and Arabidopsis thaliana. The Plant cell, 22(4), 1074–1089. Fedoroff, N.V., 1999. The suppressor-mutator element and the evolutionary riddle of transposons. Genes to cells, 4(1), 11–19. Feng & Jacobsen, S.E., 2011. Epigenetic modifications in plants: an evolutionary perspective. Current opinion in plant biology, 14(2), 179–186.  

 49 Feng, S. & Jacobsen, S., 2010. Epigenetic Reprogramming in Plant and Animal Development. Science, 330(6004), 622–627. Feschotte, C., 2008. Transposable elements and the evolution of regulatory networks. Nature reviews. Genetics, 9(5), 397–405. Feschotte, C. & Pritham, 2007. DNA transposons and the evolution of eukaryotic genomes. Annual review of genetics, 41, 331–368. Flagel, L.E. & Wendel, J.F., 2009. Gene duplication and evolutionary novelty in plants. New Phytologist, 183(3), 557–564. Franchini, L.F. et al., 2011. Convergent evolution of two mammalian neuronal enhancers by sequential exaptation of unrelated retroposons. Proc. Natl. Acad. Sci. USA, 108(37), 15270–15275. Gaut, B.S. et al., 2007. Recombination: an underappreciated factor in the evolution of plant genomes. Nature reviews. Genetics, 8(1), 77–84. Gotea, V. & Makalowski, W., 2006. Do transposable elements really contribute to proteomes? Trends in Genetics, 22(5), 260–267. Gould & Eldredge, N., 1977. Punctuated Equilibria: The Tempo and Mode of Evolution Reconsidered. Paleobiology, 3(2), 115–151. Gould & Lewontin, R.C., 1979. The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proceedings of the Royal Society of London. Series B, 205(1161), 581–598. Gould & Lloyd, E.A., 1999. Individuality and adaptation across levels of selection: how shall we name and generalize the unit of Darwinism? Proc. Natl. Acad. Sci. USA, 96(21), 11904–11909. Gould & Vrba, E., 1982. Exaptation-a missing term in the science of form. Paleobiology, 8(1), 4–15. Hammer, S.E., 2005. Homologs of Drosophila P Transposons Were Mobile in Zebrafish but Have Been Domesticated in a Common Ancestor of Chicken and Human. Molecular biology and evolution, 22(4), 833–844. Hanada, K. et al., 2009. The functional role of pack-MULEs in rice inferred from purifying selection and expression profile. The Plant cell, 21(1), 25–38. Haren, L., Ton-Hoang, B. & Chandler, M., 1999. Integrating DNA: transposases and retroviral integrases. Annual review of microbiology, 53, 245–281. Hickey, D.A., 1982. Selfish DNA: a sexually-transmitted nuclear parasite. Genetics, 101(3-4), 519–531. Hoen, D.R., Juretic, N. et al., 2005. The evolutionary fate of MULE-mediated duplications of host gene fragments in rice. Genome research, 15(9), 1292– 1297. Hoen, D.R., Park, K.C. et al., 2006. Transposon-mediated expansion and diversification of a family of ULP-like genes. Molecular biology and evolution, 23(6), 1254–1268. Hollister, J.D. & Gaut, B.S., 2009. Epigenetic silencing of transposable elements: a trade-off between reduced transposition and deleterious effects on neighboring gene expression. Genome research, 19(8), 1419–1428.  

 50 Hollister, J.D. et al., 2011. Transposable elements and small RNAs contribute to gene expression divergence between Arabidopsis thaliana and Arabidopsis lyrata. Proc. Natl. Acad. Sci. USA, 108(6), 2322–2327. Hua-Van, A. et al., 2011. The struggle for life of the genome's selfish architects. Biology direct, 6, 19. Huda, A. et al., 2011. Epigenetic regulation of transposable element derived human gene promoters. Gene, 475(1), 39–48. Hudson, M.E. et al., 1999. The FAR1 locus encodes a novel nuclear protein specific to phytochrome A signaling. Genes & development, 13(15), 2017– 2027. Hudson, M.E., Lisch, D.R. & Quail, P.H., 2003. The FHY3 and FAR1 genes encode transposase-related proteins involved in regulation of gene expression by the phytochrome A-signaling pathway. The Plant journal, 34(4), 453–471. Jiang, N. et al., 2011. Pack-Mutator-like transposable elements (Pack-MULEs) induce directional modification of genes through biased insertion and DNA acquisition. Proc. Natl. Acad. Sci. USA, 108(4), 1537–1542. Jiang, N., Bao, Z., et al., 2004a. Pack-MULE transposable elements mediate gene evolution in plants. Nature, 431(7008), 569–573. Jiang, N., Feschotte, C., et al., 2004b. Using rice to understand the origin and amplification of miniature inverted repeat transposable elements (MITEs). Current opinion in plant biology, 7(2), 115–119. Jiao, Y. & Deng, X., 2007. A genome-wide transcriptional activity survey of rice transposable element-related genes. Genome Biology, 8(2), R28. Jiao, Y., Lau, O.S. & Deng, X.W., 2007. Light-regulated transcriptional networks in higher plants. Nature reviews. Genetics, 8(3), 217–230. Johnson & Tricker, P.J., 2010. Epigenomic plasticity within populations: its evolutionary significance and potential. Heredity, 105(1), 113–121. Johnson, L.J., 2008. Selfish genetic elements favor the evolution of a distinction between soma and germline. Evolution, 62(8), 2122–2124. Jordan, I.K. et al., 2003. Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends in genetics, 19(2), 68–72. Kaufman, P.D., Doll, R.F. & Rio, D.C., 1989. Drosophila P element transposase recognizes internal P element DNA sequences. Cell, 59(2), 359–371. Khurana, J.S., Wang, J., Xu, J., Koppetsch, B.S., Thomson, T.C., Nowosielska, A., Li, C., Zamore, P.D., Weng, Z., & Theurkauf, W.E., 2011. Adaptation to P element transposon invasion in Drosophila melanogaster. Cell 147(7) 1551– 1563. Kidwell, M.G. & Lisch, D.R., 2001. Perspective: transposable elements, parasitic DNA, and genome evolution. Evolution, 55(1), 1–24. Kidwell, M.G., Kidwell, J.F. & Sved, J.A., 1977. Hybrid Dysgenesis in DROSOPHILA MELANOGASTER: A Syndrome of Aberrant Traits Including Mutation, Sterility and Male Recombination. Genetics, 86(4), 813–833. King, M.C. & Wilson, A.C., 1975. Evolution at two levels in humans and chimpanzees. Science, 188(4184), 107–116.  

 51 Koonin, E.V., 2009. Darwinian evolution in the light of genomics., Nucleic acids research, 37(4), 1011–1034. Köhler, C. & Weinhofer-Molisch, I., 2010. Mechanisms and evolution of genomic imprinting in plants. Heredity, 105(1), 57–63. Kunarso, G. et al., 2010. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nature genetics, 42(7), 631–634. Lai, J. et al., 2005. Gene movement by Helitron transposons contributes to the haplotype variability of maize. Proc. Natl. Acad. Sci. USA, 102(25), 9068– 9073. Law, J.A. & Jacobsen, S.E., 2010. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature reviews. Genetics, 11(3), 204–220. Le, Q.H., Wright, S., Yu, Z., & Bureau, T., 2000. Transposon diversity in Arabidopsis thaliana. Proc. Natl. Acad. Sci. USA, 97(13), 7376–7381. Lee, J.Y., Ji, Z. & Tian, B., 2008. Phylogenetic analysis of mRNA polyadenylation sites reveals a role of transposable elements in evolution of the 3'-end of genes. Nucleic acids research, 36(17), 5581–5590. Lee, S.H. et al., 2005. The SET domain protein Metnase mediates foreign DNA integration and links integration to nonhomologous end-joining repair. Proc. Natl. Acad. Sci. USA, 102(50), 18075–18080. Levin, H.L. & Moran, J.V., 2011. Dynamic interactions between transposable elements and their hosts. Nature reviews. Genetics, 12(9), 615–627. Li, G. et al., 2011a. Coordinated transcriptional regulation underlying the circadian clock in Arabidopsis. Nature cell biology, 13(5), 616–622. Li, J. et al., 2010. Arabidopsis Transcription Factor ELONGATED HYPOCOTYL5 Plays a Role in the Feedback Regulation of Phytochrome A Signaling. The Plant cell, 22(11), 3634–3649. Li, Y. et al., 2011b. Domestication of transposable elements into MicroRNA genes in plants. PLoS One, 6(5), e19212. Lin, L. et al., 2008a. Diverse Splicing Patterns of Exonized Alu Elements in Human Tissues. PLoS genetics, 4(10), e1000225. Lin, R. & Wang, H., 2004. Arabidopsis FHY3/FAR1 gene family and distinct roles of its members in light control of Arabidopsis development. Plant physiology, 136(4), 4010–4022. Lin, R. et al., 2008b. Discrete and essential roles of the multiple domains of Arabidopsis FHY3 in mediating phytochrome A signal transduction. Plant physiology, 148(2), 981–992. Lin, R. et al., 2007. Transposase-derived transcription factors regulate light signaling in Arabidopsis. Science, 318(5854), 1302–1305. Lisch, D.R., 2009. Epigenetic regulation of transposable elements in plants. Annual review of plant biology, 60, 43–66. Lisch, D.R. & Bennetzen, J.L., 2011. Transposable element origins of epigenetic gene regulation. Current opinion in plant biology, 14(2), 156–161.

52 Lisch, D.R. & Jiang, N., 2009. Mutator and MULE transposons. In J. L. Bennetzen & S. Hake, eds. Handbook of Maize. New York, NY: Springer New York, 277–306. Lisch, D.R. et al., 2001. Mutator transposase is widespread in the grasses. Plant physiology, 125(3), 1293–1303. Lockton, S. & Gaut, B.S., 2009. The Contribution of Transposable Elements to Expressed Coding Sequence in Arabidopsis thaliana. Journal of molecular evolution, 68(1), 80–89. Lohe, A.R. et al., 1995. Horizontal transmission, vertical inactivation, and stochastic loss of mariner-like transposable elements. Molecular biology and evolution, 12(1), 62–72. Lowe, C.B., Bejerano, G. & Haussler, D., 2007. Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc. Natl. Acad. Sci. USA, 104(19), 8005–8010. Lunyak, V.V. & Atallah, M., 2011. Genomic relationship between SINE retrotransposons, Pol III-Pol II transcription, and chromatin organization: the journey from junk to jewel. Biochemistry and cell biology, 89(5), 495–504. Magnani, E., Sjölander, K. & Hake, S., 2004. From endonucleases to transcription factors: evolution of the AP2 DNA binding domain in plants. The Plant cell, 16(9), 2265–2277. Makałowski, W., Mitchell, G.A. & Labuda, D., 1994. Alu sequences in the coding regions of mRNA: a source of protein variability. Trends in genetics, 10(6), 188–193. Malone, C.D. & Hannon, G.J., 2009. Small RNAs as guardians of the genome. Cell, 136(4), 656–668. Marino-Ramirez, L. et al., 2005. Transposable elements donate lineage-specific regulatory sequences to host genomes. Cytogenetic and genome research, 110(1-4), 333–341. Markljung, E. et al., 2009. ZBED6, a Novel Transcription Factor Derived from a Domesticated DNA Transposon Regulates IGF2 Expression and Muscle Growth. PLoS biology, 7(12), e1000256 Martin, A. & Bendahmane, A., 2010. A blessing in disguise: Transposable elements are more than parasites. Epigenetics, 5(5), 378–380. Martin, A. et al., 2009. A transposon-induced epigenetic change leads to sex determination in melon. Nature, 461(7267), 1135–1138. Mathews, S., 2006. Phytochrome-mediated development in land plants: red light sensing evolves to meet the challenges of changing light environments. Molecular ecology, 15(12), 3483–3503. McClintock, B., 1950. The origin and behavior of mutable loci in maize. Proc. Natl. Acad. Sci. USA, 36(6), 344–355. McClintock, B., 1984. The significance of responses of the genome to challenge. Science, 226(4676), 792–801. Miller, W.J., McDonald, J.F. & Pinsker, W., 1997. Molecular domestication of mobile elements. Genetica, 100(1-3), 261–270.  

 53 Miller, W.J. et al., 1999. Molecular domestication--more than a sporadic episode in evolution. Genetica, 107(1-3), 197–207. Miller, W.J. et al., 1992. P-element homologous sequences are tandemly repeated in the genome of Drosophila guanche. Proc. Natl. Acad. Sci. USA, 89(9), 4018–4022. Mirouze, M. & Paszkowski, J., 2011. Epigenetic contribution to stress adaptation in plants. Current opinion in plant biology, 14(3), 267–274. Muehlbauer, G.J. et al., 2006. A hAT superfamily transposase recruited by the cereal grass genome. Molecular genetics and genomics, 275(6), 553–563. Naito, K. et al., 2009. Unexpected consequences of a sudden and massive transposon amplification on rice gene expression. Nature, 461(7267), 1130– U232. Nekrutenko, A. & Li, W.H., 2001. Transposable elements are found in a large number of human protein-coding genes. Trends in genetics, 17(11), 619–621. Nosek, J., Kosa, P. & Tomaska, L., 2006. On the origin of telomeres: a glimpse at the pre-telomerase world. BioEssays, 28(2), 182–190. Nouaud, D. & Anxolabehere, D., 1997. P element domestication: a stationary truncated P element may encode a 66-kDa repressor-like protein in the Drosophila montium species subgroup. Molecular biology and evolution, 14(11), 1132–1144. Nowick, K. & Stubbs, L., 2010. Lineage-specific transcription factors and the evolution of gene regulatory networks. Briefings in functional genomics, 9(1), 65–78. Obbard, D.J. et al., 2009. The evolution of RNAi as a defence against viruses and transposable elements. Philosophical transactions of the Royal Society of London. Series B, 364(1513), 99–115. Ohno, S., 1972. So much “junk” DNA in our genome. Brookhaven symposia in biology, 23, 366–370. Okada, N. et al., 2010. Emergence of mammals by emergency: exaptation. Genes to cells, 15(8), 801–812. Oki, N. et al., 2008. A genome-wide view of miniature inverted-repeat transposable elements (MITEs) in rice, Oryza sativa ssp. japonica. Genes & genetic systems, 83(4), 321–329. Oliver, 2011. Mobile DNA and the TE-Thrust hypothesis: supporting evidence from the primates. Mobile DNA, 2, 8. Orgel, L.E. & Crick, F.H.C., 1980. Selfish DNA: the ultimate parasite. Nature, 284(5757), 604–607. Ouyang, X. et al., 2011. Genome-Wide Binding Site Analysis of FAR-RED ELONGATED HYPOCOTYL3 Reveals Its Novel Function in Arabidopsis Development. The Plant cell, 23(7), 2514–2535. Pace, J.K.2. & Feschotte, C., 2007. The evolutionary history of human DNA transposons: evidence for intense activity in the primate lineage. Genome research, 17(4), 422–432.  

 54 Pardue, M.-L. & DeBaryshe, P.G., 2011. Retrotransposons that maintain chromosome ends. Proc. Natl. Acad. Sci. USA, [Epub ahead of print]. Paricio, N. et al., 1991. P sequences of Drosophila subobscura lack exon 3 and may encode a 66 kd repressor-like protein. Nucleic acids research, 19(24), 6713–6718. Parisod, C. et al., 2010. Impact of transposable elements on the organization and function of allopolyploid genomes. The New phytologist, 186(1), 37–45. Pi, W. et al., 2010. Long-range function of an intergenic retrotransposon. Proc. Natl. Acad. Sci. USA, 107(29), 12992–12997. Pinsker, W. et al., 2001. The evolutionary life history of P transposons: from horizontal invaders to domesticated neogenes. Chromosoma, 110(3), 148– 158. Piriyapongsa, J, Marino-Ramirez, L. & Jordan, I.K., 2007. Origin and Evolution of Human microRNAs From Transposable Elements. Genetics, 176(2), 1323– 1337. Piriyapongsa, Jittima & Jordan, I.K., 2007. A family of human microRNA genes from miniature inverted-repeat transposable elements. PLoS One, 2(2), e203. Ponting, C.P., Nellåker, C. & Meader, S., 2011. Rapid turnover of functional sequence in human and other genomes. Annual review of genomics and human genetics, 12, 275–299. Pritham, 2009. Transposable Elements and Factors Influencing their Success in Eukaryotes. The Journal of heredity, 100(5), 648–655. Quesneville, H., Nouaud, D. & Anxolabehere, D., 2005. Recurrent recruitment of the THAP DNA-binding domain and molecular domestication of the P- transposable element. Molecular biology and evolution, 22(3), 741–746. Raizada, M. N., Benito, M. I., & Walbot, V., 2001. The MuDR transposon terminal inverted repeat contains a complex plant promoter directing distinct somatic and germinal programs. Plant J., 25(1), 79–91. Rawn, S.M. & Cross, J.C., 2008. The evolution, regulation, and function of placenta-specific genes. Annual review of cell and developmental biology, 24, 159–181. Rebollo, R. et al., 2010. Jumping genes and epigenetics: Towards new species. Gene, 454(1-2), 1–7. Reiss, D. et al., 2005. Domesticated P elements in the Drosophila montium species subgroup have a new function related to a DNA binding property. Journal of molecular evolution, 61(4), 470–480. Rice, P.A. & Baker, T.A., 2001. Comparative architecture of transposase and integrase complexes. Nature structural biology, 8(5), 302–307. Rio, D.C., 1990. Molecular mechanisms regulating Drosophila P element transposition. Annual review of genetics, 24, 543–578. Robertson, H.M. & Zumpano, K.L., 1997. Molecular evolution of an ancient mariner transposon, Hsmar1, in the human genome. Gene, 205(1-2), 203– 217.  

 55 Rose, M.R. & Oakley, T.H., 2007. The new biology: beyond the Modern Synthesis. Biology direct, 2, 30. Roussigne, M. et al., 2003. The THAP domain: a novel protein motif with similarity to the DNA-binding domain of P element transposase. Trends in biochemical sciences, 28(2), 66–69. Rusche, L.N. & Rine, J., 2010. Switching the mechanism of mating type switching: a domesticated transposase supplants a domesticated homing endonuclease. Genes & development, 24(1), 10–14. Sabogal, A. et al., 2010. THAP proteins target specific DNA sites through bipartite recognition of adjacent major and minor grooves. Nature structural & molecular biology, 17(1), 117–123. Saccaro, N.L., Jr. et al., 2007. MudrA-like sequences from rice and sugarcane cluster as two bona fide transposon clades and two domesticated transposases. Gene, 392(1-2), 117–125. Saijo, Y. et al., 2008. Arabidopsis COP1/SPA1 complex and FHY1/FHY3 associate with distinct phosphorylated forms of phytochrome A in balancing light signaling. Molecular cell, 31(4), 607–613. Santangelo, A.M. et al., 2007. Ancient exaptation of a CORE-SINE retroposon into a highly conserved mammalian neuronal enhancer of the proopiomelanocortin gene. PLoS genetics, 3(10), 1813–1826. Saze, H. & Kakutani, T., 2011. Differentiation of epigenetic modifications between transposons and genes. Current opinion in plant biology, 14(1), 81–87. Sánchez-Gracia, A., Maside, X. & Charlesworth, B., 2005. High rate of horizontal transfer of transposable elements in Drosophila. Trends in genetics, 21(4), 200–203. Schaack, S., Gilbert, C. & Feschotte, C., 2010. Promiscuous DNA: horizontal transfer of transposable elements and why it matters for eukaryotic evolution. Trends in Ecology & Evolution, 25(9), 537–546. Schmitz, J. & Brosius, Jürgen, 2011. Exonization of transposed elements: A challenge and opportunity for evolution. Biochimie, 93(11), 1928–1934. Sela, N. et al., 2007. Comparative analysis of transposed element insertion within human and mouse genomes reveals Alu's unique role in shaping the human transcriptome. Genome Biology, 8(6), R127. Sela, N. et al., 2008. Transduplication resulted in the incorporation of two protein- coding sequences into the turmoil-1 transposable element of C. elegans. Biology direct, 3, 41. Shaheen, M. et al., 2010. Metnase/SETMAR: a domesticated primate transposase that enhances DNA repair, replication, and decatenation. Genetica, 138(5), 559–566. Shapiro, J.A., 2010. Mobile DNA and evolution in the 21st century. Mobile DNA, 1(1), 4. Simon, S.A. & Meyers, B.C., 2011. Small RNA-mediated epigenetic modifications in plants. Current opinion in plant biology, 14(2), 148–155.  

 56 Sinzelle, L., Izsvák, Z. & Ivics, Z., 2009. Molecular domestication of transposable elements: from detrimental parasites to useful host genes. Cellular and molecular life sciences, 66(6), 1073–1093. Slotkin, R.K. & Martienssen, R.A., 2007. Transposable elements and the epigenetic regulation of the genome. Nature reviews. Genetics, 8(4), 272–285. Slotkin, R.K., Freeling, M. & Lisch, D.R., 2005. Heritable transposon silencing initiated by a naturally occurring transposon inverted duplication. Nature genetics, 37(6), 641–644. Smalheiser, N.R. & Torvik, V.I., 2005. Mammalian microRNAs derived from genomic repeats. Trends in genetics, 21(6), 322–326. Smith, J.J., Sumiyama, K. & Amemiya, C.T., 2011. A living fossil in the genome of a living fossil: Harbinger transposons in the coelacanth genome. Molecular biology and evolution, [Epub ahead of print]. Sorek, R., 2007. The birth of new exons: mechanisms and evolutionary consequences. RNA, 13(10), 1603–1608. Studer, A., Zhao, Q., Ross-Ibarra, J., & Doebley, J., 2011. Identification of a functional transposon insertion in the maize domestication gene tb1. Nature genetics, 43(11), 1160–1163. Turcotte, K., Srinivasan, S., & Bureau, T., 2001. Survey of transposable elements from rice genomic sequences. Plant J. 25(2) 169–179. Voinnet, O., 2009. Origin, biogenesis, and activity of plant microRNAs. Cell, 136(4), 669–687. Volff, J.-N., 2006. Turning junk into gold: domestication of transposable elements and the creation of new genes in eukaryotes. BioEssays, 28(9), 913–922. Wang, H. & Deng, X.W., 2002. Arabidopsis FHY3 defines a key phytochrome A signaling component directly interacting with its homologous partner FAR1. The EMBO journal, 21(6), 1339–1349. Weil, C.F. & Martienssen, R.A., 2008. Epigenetic interactions between transposons and genes: lessons from plants. Current opinion in genetics & development, 18(2), 188–192. Werren, J.H., 2011. Selfish genetic elements, genetic conflict, and evolutionary innovation. Proc. Natl. Acad. Sci. USA, 108 Suppl 2, 10863–10870. Whitelam, G.C. et al., 1993. Phytochrome A null mutants of Arabidopsis display a wild-type phenotype in white light. The Plant cell, 5(7), 757–768. Wicker, T. et al., 2007. A unified classification system for eukaryotic transposable elements. Nature reviews. Genetics, 8(12), 973–982. Wray, G.A., 2007. The evolutionary significance of cis-regulatory mutations. Nature reviews. Genetics, 8(3), 206–216. Yu, Z., Wright, S.I, & Bureau, T.E., 2000. Mutator-like elements in Arabidopsis thaliana. Structure, diversity and evolution. Genetics, 156(4) 2019–2031. Xing, J. et al., 2009. Mobile elements create structural variation: analysis of a complete human genome. Genome research, 19(9), 1516–1526. Yanovsky, M.J., Whitelam, G.C. & Casal, J.J., 2000. fhy3-1 retains inductive responses of phytochrome A. Plant physiology, 123(1), 235–242.  

 57 Zaratiegui, M. et al., 2011. CENP-B preserves genome integrity at replication forks paused by retrotransposon LTR. Nature, 469(7328), 112–115. Zeh, D.W., Zeh, J.A. & Ishida, Y., 2009. Transposable elements and an epigenetic basis for punctuated equilibria. BioEssays, 31(7), 715–726. Zhang, J. et al., 2009. Alternative Ac/Ds transposition induces major chromosomal rearrangements in maize. Genes & development, 23(6), 755– 765.  58

Bridge to Chapter Two

The broad objective of this thesis is to better characterize the direct or active effects of transposable elements (TEs) on genome evolution in plants, and vice versa. The underlying hypothesis may be summed up as follows: Although TEs are primarily maintained in the short term through intra-genomic replication rather than natural selection at the organism level, they nevertheless actively contribute to genome evolution by duplicating and mobilizing useful DNA sequences that are more frequently co-opted to perform non-transposition functions than is currently reported. Conversely, TEs are themselves shaped by co-opting non-TE sequences for transposition-related functions.

In Chapter One, I reviewed the literature on the coevolution of TEs and genomes, focussing on one specific aspect: molecular domestication, or the co-option of TE sequences by genomes. I revisit molecular domestication in Chapter Four. But first, in Chapter Two, I investigate a different type of sequence exchange, in the opposite direction, from genome to TE. I briefly reviewed this process, called transduplication, in Chapter One. In Chapter Two I first re-examine the relevant literature and then describe a large-scale study of transduplication. Contrary to a previous, widely publicized report, these sequences do not appear to produce functional proteins, although they may have regulatory functions. Chapter Two is a reproduction of the following published article: Hoen DR, Juretic N, Huynh ML, Harrison PM, Bureau TE. 2005. The evolutionary fate of MULE-mediated duplications of host gene fragments in rice. Genome Res. 15(9):1292-7.  59

Chapter Two: The evolutionary fate of MULE- mediated duplications of host gene fragments in rice  60

Abstract

DNA transposons are known to frequently capture duplicated fragments of host genes. The evolutionary impact of this phenomenon depends on how frequently the fragments retain protein-coding function as opposed to becoming pseudogenes. Gene fragment duplication by Mutator-like elements (MULEs) has previously been documented in maize, Arabidopsis, and rice. Here we present a rigorous genome-wide analysis of MULEs in the model plant Oryza sativa (domesticated rice). We identify 8274 MULEs with intact termini and target-site duplications (TSDs) and show that 1337 of them contain duplicated host gene fragments. Through a detailed examination of the 5% of duplicated gene fragments that are transcribed, we demonstrate that virtually all cases contain pseudogenic features such as fragmented conserved protein domains, frameshifts, and premature stop codons. In addition, we show that the distribution of the ratio of non-synonymous to synonymous amino acid substitution rates for the duplications agrees with the expected distribution for pseudogenes. We conclude that MULE-mediated host gene duplication results in the formation of pseudogenes, not novel functional protein-coding genes; however, the transcribed duplications possess characteristics consistent with a potential role in the regulation of host gene expression.

Introduction

With their ubiquitous presence in eukaryotic genomes and their capacity to generate mutations, transposons are important players in genome evolution, influencing genome size, function, and structure (Kazazian 2004). They are involved in the duplication of parts of their host genome through several different mechanisms, such as ectopic exchange between neighboring transposons via unequal crossing over (Lim and Simmons 1994) and macrotransposition (Ralston, English et al. 1989). In addition to these mechanisms, which produce duplications of the genomic sequence  

 61 between two transposons, fragments of host genes can be found duplicated within a single transposon. In the case of retrotransposons, this phenomenon is known as transduction and involves read-through transcription from a retrotransposon promoter into adjacent host gene sequences and the incorporation of the host gene into the transposon sequence during reverse transcription (Bureau, White et al. 1994; Kazazian 2004). The mechanism of gene-fragment duplication by DNA transposons, herein referred to as “transduplication,” is unknown, but is clearly different from transduction because duplicated fragments retain introns. The potential functional impact of a duplicated gene fragment depends on whether it forms a protein-coding gene or a pseudogene; while the latter appears to be predominant for transduced gene fragments, the outcome of transduplication events is unclear. MULEs are DNA transposons found in high-copy number in plants, fungi, and prokaryotes (Yu, Wright et al. 2000; Turcotte, Srinivasan et al. 2001; Chalvet, Grimaldi et al. 2003). They are characterized by long terminal inverted repeats (TIRs; longer than 100 bp), lengths ranging from a few hundred base pairs to more than 30 kbp, hyper-variable internal sequence, and a 9-11 bp target site duplication (TSD). MULEs lacking TIRs (i.e., non-TIR MULEs) are rare in rice but predominate in Arabidopsis (Le, Wright et al. 2000; Yu, Wright et al. 2000). MULE families are typically composed of a few autonomous elements containing a transposase gene, mudrA, and tens to hundreds of non-autonomous elements that lack functional mudrA, but which may be mobilized in trans (Le, Wright et al. 2000; Yu, Wright et al. 2000; IRGSP 2005). Gene fragment transduplication by MULEs has previously been documented in maize (Talbert and Chandler 1988), Arabidopsis (Le, Wright et al. 2000; Yu, Wright et al. 2000), and rice (Turcotte, Srinivasan et al. 2001). Recently, it has been suggested as a mechanism for the creation of novel protein-coding genes (Jiang, Bao et al. 2004). In this

62 report, we demonstrate that the evidence is to the converse, that widespread MULE-mediated transduplication in rice results in the formation of pseudogenes that lack protein-coding function. Even the 5.3% of the transduplicates that are transcribed have one or more features that would disable any potential translation product. Furthermore, the synonymous amino acid substitution rate is consistent with reduced purifying selection. However, the transcribed pseudogenes may function in the regulation of host gene expression.

Results and Discussion

Members of rice MULE families were identified by (1) determining the location of mudrA genes, (2) examining flanking regions for TIRs or terminal structures, (3) using the termini of complete putative autonomous elements to identify the corresponding nonautonomous elements, and (4) removing elements that lack characteristic TSDs (Supplemental Table 2-1). This protocol yields well-supported family classification and avoids the potential mis-annotation of elements with similar structural features to, but no evolutionary connection with, bona fide MULEs. In total, we mined 8274 MULEs from the rice genome, with an estimated 3% false positive and 17% false negative rate. The distribution of MULEs along the rice (Supplemental Fig. 2-1) is similar to that of other DNA transposons (IRGSP 2005), with higher element density in euchromatic regions (23 MULEs/Mbp) than in centromeric and pericentromeric regions (16 MULEs/Mbp). To determine the prevalence of transduplication in our data set, we identified 1968 MULEs (24%) with high sequence similarity (BLASTN E- value ≤ 1 × 10-37) to coding regions of predicted genes that are unrelated to transposons (Supplemental Table 2-2). Because gene prediction is relatively inaccurate, we focused on the 1337 MULEs (16%) with transduplicates for which the paralogous host gene is confirmed by a full- 63 length cDNA (Supplemental Table 2-3), or which encode a conserved protein domain (Supplemental Table 2-4) (Fig. 2-1). Our methods sacrifice sensitivity for specificity, so we expect the actual number of transduplicates to be considerably larger. The size of transduplicated regions ranges from 95 to 7742 bp, with a mean of 305 bp.

Figure 2-1. Distribution of supported transduplicates among rice MULEs. Percent contribution is shown for each MULE category. (PG) Predicted gene; (PG-cDNA) predicted gene supported by a cDNA; (CD) conserved domain.

We clustered MULEs that contained the same transduplicate(s) and found that 64% are single-copy, and thus were formed by unique transduplication events. The remainder of the MULEs are multi-copy, and thus presumably arose by replicative transposition of an ancestral MULE that had previously undergone transduplication. The number of MULEs per cluster followed a power-law relationship, decreasing rapidly with cluster size; 99% of the clusters had fewer than 10 members. For example, we identified seven MULEs in two families that contain a transduplicated putative poly(A) polymerase (PAP) gene fragment (Fig. 2-2). In addition, we found that in 25% of the cases, a single MULE contains separate transduplicates of more than one host gene fragment, presumably the result of multiple independent transduplication events. In general, the insertion and rearrangement of transduplicated sequences 64  

 65 Figure 2-2 (preceding page). Transduplication and diversification of a poly(A) polymerase (PAP) gene fragment in the rice genome. (A) The central region of a putative PAP gene on chromosome 4 (4_012_1023) was transduplicated in a MULE that later replicated and, in one branch of the group, internal deletions reduced the size of the transduplicate. (B) Three highly similar copies in the rice genome possess the larger transduplicate, each having a different pattern of expression, including TI0000727, which generates sense and antisense transcripts and TI0005797 with the transcript terminating in a retrotransposon solo long terminal repeat (LTR). (C) The MULE copy with the smaller transduplicate subsequently transduplicated regions from two additional rice genes, a ribA-like gene (5_009_0152) and a serine/ threonine protein kinase (1_008_0111). (D) There are four MULEs with these three transduplicates and one of them is transcribed in both the sense and antisense direction. Mapped cDNAs and ESTs are represented by arrows, pink bars represent ORFs, and regions containing conserved protein domains are indicated by double- headed arrows. Genes are shown in blue, green, and purple, and introns are shown in a lighter shade. Inverted triangles represent TIRs. may account for the high level of divergence both within and between MULE families (Yu, Wright et al. 2000). The existence of an extensive data set of rice full-length cDNAs (Kikuchi, Satoh et al. 2003) allowed us to identify transcribed transduplicates by comparing the position of elements to mapped cDNAs in the rice genome. A total of 310 MULEs overlap cDNAs (Supplemental Table 2-5), and in 66 of these, the expressed region includes one or more transduplicates, for a total of 72 transcribed transduplicates (Fig. 2-1, Supplemental Table 2-6). All of the transcribed transduplicates have disablements preventing them from expressing a functional protein. First, six expressed transduplicates (8%) are not within a predicted ORF of the cDNA (Fig. 2-3A). Second, for 13 expressed transduplicates (18%) that overlap a cDNA ORF, the predicted coding region is not in the same reading frame or has accumulated frameshift mutations and premature stop codons (Fig. 2-3B). Third, for 22 expressed transduplicates with regions in the same reading frame as a cDNA ORF, the overlap does not contain protein domains found within the transduplicate (21%) or these domains are truncated (10%; Fig. 2-3C). The only transduplicate with an intact conserved domain within a possible cDNA ORF is the  

 66 pentatricopeptide repeat (PPR) transduplicate found in TI0007876 (Supplemental Table 2-6); however, all known PPR proteins contain multiple PPRs (Lurin, Andrés et al. 2004), so the functionality of a single repeat is questionable. Fourth, 23 transduplicates (32%) are expressed only in the antisense direction and therefore do not encode a protein. Finally, 10 of the expressed predicted host genes contain only truncated domains and may themselves be expressed pseudogenes. The predicted genes matching the nine remaining transduplicates do not contain any known conserved domain. In addition, a search of the SWISS-PROT and SCOP protein databases reveals that many of the transduplicates show similarity to known proteins and protein domains, but the matching regions contain frameshifts, premature stop codons, or are truncated (Supplemental Tables 2-6 & 2-7). Interestingly, a recent study of CACTA- like DNA transposons with transduplicated gene fragments also revealed that they are transcribed pseudogenes (Kawasaki and Nitasaka 2004). In contrast to these results, a recently published study of MULE transduplication in rice (Jiang, Bao et al. 2004), concludes that the coding sequence of over 10% of transduplicates might have been functionally constrained. Our study differed from Jiang et al. (2004) in several ways. We identified and analyzed transduplicates on all 12 rice chromosomes, whereas they limited their analysis to chromosomes 1 and 10, then extrapolated to make a genome-wide estimate. We recorded duplicated regions that are highly similar to a transcribed gene or conserved domain, whereas they used similarity to any non-hypothetical non-transposase protein, regardless of additional support. According to Jiang et al. (2004), of a sample of 100 of their transduplicates, 20 were not found to be similar to any genomic sequence outside of the MULE, seven of the remaining 80 were not similar to a predicted host coding region, and only 46 of the suspected host-coding sequences were supported by a cDNA. After taking 67

Figure 2-3. Examples of expressed transduplicates. (A) The transduplicated conserved domain fragment is found within the cDNA, but it does not overlap with any of the possible ORFs. (Inset) White bars represent the three forward frames of MULE cDNAs, and pale blue regions indicate possible ORFs (not to scale). (B) As a result of a frameshift mutation, the transduplicated conserved domain fragment is split between two cDNA frames and two possible ORFs. (C) The transduplicate is found within one of the possible cDNA ORFs, but the conserved domain is truncated. Mapped cDNAs are represented by arrows, pink bars represent ORFs, and regions containing conserved domains are indicated by double-headed arrows. Genes are shown in green and introns are in light-green. Inverted triangles represent TIRs.  

 68 into account these differences, many of our basic results agree with Jiang et al. (2004). For example, we identified 1337 MULEs with transduplicates, whereas Jiang et al. estimate 3200 in total, but based on their sample, fewer than 46% (1500) of these would have host genes with matching cDNAs. Additional similar results include the fraction of transduplicate- containing MULEs present in more than one copy in the genome (36%; Jiang et al. 20%); the fraction of transduplicates that are transcribed (5%; Jiang et al. 5%); and the fraction of MULEs containing multiple transduplicates (25%; Jiang et al. 23%). Despite the consistency of these basic results, our key result that virtually all transcribed transduplicates contain coding-sequence disablements contradicts the conclusion of Jiang et al. (2004) that many transduplicates show evidence of purifying selection on the coding sequence. The reason for this contradiction is that Jiang et al. (2004) based their conclusion on a misinterpreted dN/dS analysis. The ratio of dN (non-synonymous substitutions per non-synonymous site) to dS (synonymous substitutions per synonymous site) is a measure of selective constraint on coding sequences (Li, Gojobori et al. 1981). dN and dS can be estimated using a number of substitution models and methods, and the estimates are sensitive to these choices and other complications such as the GC content of the sequences and their genomic context (Bustamante, Nielsen et al. 2002). A pair of sequences will have dN/dS≪1 if both sequences have been under purifying selection, dN/dS<1 if one sequence has been under purifying selection but the other drifting neutrally, dN/dS∼1 if both sequences are drifting neutrally, and rarely, dN/ dS>1 at specific sites that are under adaptive selection. For example, mammalian gene-gene pairs have mean dN/dS ∼ 0.15 (Waterston, Lindblad-Toh et al. 2002; Jørgensen, Hobolth et al. 2005), while human gene-pseudogene pairs have mean dN/dS ∼ 0.5 (Torrents, Suyama et al. 2003; Zhang, Harrison et al. 2003).

69 Testing for purifying selection on gene-transduplicate pairs using pairwise dN/dS is difficult because the standard null hypothesis for gene- gene pairs, that dN/dS=1, is invalid. In this case, we expect dN/dS<1 due to selection on the host gene, even if the transduplicate has been drifting neutrally. Jiang et al. (2004) overlooked this critical consideration and based their conclusion that >10% of transduplicates have been under purifying selection primarily on the results of a statistical test using this invalid null hypothesis. Their conclusion is therefore unfounded. It would be useful to construct a generally valid null hypothesis of the form dN/dS=α, where α is the expected pairwise dN/dS ratio between a neutrally drifting transduplicate and a corresponding host gene under purifying selection. Unfortunately, α would vary with the strength of selective pressure on the host gene and so would need to be estimated on a case by case basis using, for example, additional homologous functional genes. Such an approach would be difficult to implement en masse for large numbers of transduplicates. Instead, we adopted an alternative method and compared the dN/dS distribution of all qualified transduplicates with a benchmark distribution, as follows. We calculated the dN/dS ratios of all 47 transduplicates for which the transduplicate and host gene cDNA sequences could be aligned for a minimum length of 150 bp (87%-96% identity). This gave a dN/dS distribution with a median of 0.55 and standard deviation of 0.36 (Supplemental Fig. 2-2 & Supplemental Table 2-6). As a benchmark, we used the published dN/dS distribution of a large set of human processed pseudogenes (Zhang, Harrison et al. 2003). If a large fraction of transduplicates had been under selection, it would have been reflected in a skewed transduplicate distribution relative to the benchmark. However, the distribution profiles are similar, both peaking near 0.5 with most dN/dS ratios in the range from 0.4 to 0.7. Although this comparison must be interpreted cautiously due to differences between human processed  

 70 pseudogene benchmark population and the rice transduplicate sample, such as GC-content and sequence divergence, the similarity between the two distributions is consistent with the observed high incidence of coding- sequence disablements and our hypothesis that transduplicates are noncoding pseudogenes. This suggests that if the process of transduplication ever does result in the formation of functional protein- coding genes, it occurs only rarely. Of over 1300 transduplicates in the extant rice (japonica) genome, we were unable to identify a single candidate functional protein-coding transduplicate. Contrary to the standard definition of pseudogenes as nonfunctional, it has been shown that an inability to encode functional proteins often does not preclude pseudogenes from performing other functional roles, such as the provision of sequence reservoirs for gene conversion, or the regulation of paralogous gene expression (Korneev, Park et al. 1999; Balakirev and Ayala 2003; Hirotsune, Yoshida et al. 2003). Sense (51), antisense (23), and bidirectional (3) transcripts containing transduplicated sequences (Supplemental Table 2-6) could negatively affect host gene expression through RNA-mediated gene silencing (i.e., RNA interference or RNAi) by mechanisms including mRNA degradation, translational suppression, or DNA methylation of the host gene sequence (Meister and Tuschl 2004). For prokaryotic and eukaryotic organisms, endogenous noncoding transcripts are known to suppress the expression of genes with which they share short regions of high nucleotide sequence similarity (Korneev, Park et al. 1999; Palatnik, Allen et al. 2003; Vazquez, Vaucheret et al. 2004). Although these RNAs are often processed from a stem-loop precursor structure (i.e., miRNAs) (Bartel 2004), partial complementarity is often sufficient to cause silencing (Korneev, Park et al. 1999; Palatnik, Allen et al. 2003; Vazquez, Vaucheret et al. 2004). Therefore, transcribed MULE transduplicates possess features that could allow them to silence host genes. Interestingly, categories of proteins that predominate among known  

 71 plant miRNA targets, such as transcription factors involved in development and proteins involved in ubiquitination and sulfur assimilation (Bonnet, Wuyts et al. 2004), are also frequently found among sources of expressed transduplicates (Supplemental Table 2-6). Transcription of MULE transduplicates is likely to be differentially regulated, since transduplicates exist in a different regulatory context from their corresponding host genes. The eukaryotic TATA box is usually found -20 to -35 bp from the transcription start site, and in most plant promoters, functionally important cis-regulatory elements reside within 300 bp upstream of the transcription start site (Guilfoyle 1997). Since the transcription start site for 153 of 235 transcripts that initiate within a MULE sequence is located <500 bp from the element terminus, the promoter is likely located within the TIR. In fact, the TIRs of the maize MuDR element harbor promoters for mudrA and mudrB (Raizada, Benito et al. 2001), and while the TIR sequences of rice and maize MULEs are divergent, the majority of rice MULE TIRs contain putative TATA boxes and other promoter elements. In addition, transposition of MULEs with transduplicates may bring them into proximity to other cis-regulatory sequences. In summary, our study confirms that MULEs mediate the duplication of host gene fragments, but we found no evidence that such transduplicates may frequently evolve into novel functional protein-coding genes. All expressed transduplicates with conserved domains have disablements preventing them from encoding functional proteins, and their dN/dS ratios do not indicate that they have been evolving under purifying selection. However, we cannot rule out the possible existence of ancient protein- coding transduplicates that are no longer recognizable as MULEs. Of more significance may be the fact that transduplicates constitute a large source of pseudogenes with characteristics suggestive of a role in  

 72 regulating host gene expression, but further analysis is needed to evaluate this possibility.

Methods

MULE detection

We used HMMER (Eddy 1998) and RepeatMasker (http:// www.repeatmasker.org) to identify sequences of the rice genome [Oryza sativa, ssp. japonica; (IRGSP 2005) Build 2.0 pseudomolecules; http:// rgp.dna.affrc.go.jp/IRGSP/Build2/build2.html; DDBJ accessions AP006867-AP006877 and NCBI accession AE016959] similar to known MULE families with autonomous members (Juretic, Bureau et al. 2004). We tentatively identified as MULEs pairs of sequences similar to MULE termini of the same family, with inverted orientation and separated by <30 kbp. We retained MULEs longer than 10 kbp only if manual inspection of their TIRs, repetitiveness, and TSDs supported their authenticity. We built custom software to identify TSDs located no more distant than 17 bp from the termini with at most two mismatches, rejecting MULEs lacking TSDs. From among overlapping MULEs, a single element was selected based on TSD quality (located nearest the predicted termini with fewest mismatches); nesting was permitted. We detected additional MULEs using BLASTN (Altschul, Gish et al. 1990) searches of 10 kbp sequences flanking either side of mudrA-like genes not associated with a MULE in our data set. To estimate false positives, 140 MULEs were manually inspected; 136 (97%) were authentic. To estimate false negatives, we mapped 18 independently identified MULEs (Jiang, Bao et al. 2004); 15 mapped to identical positions as MULEs in our data set; 3 (17%) were absent. To determine MULE density in euchromatic and heterochromatic regions, pericentromeric regions were defined by mapping marker  

 73 positions to regions with reduced recombination surrounding the centromere of each chromosome, defined by comparing genetic to physical distances (S.I. Wright, pers. comm.).

Gene prediction

Gene models for Build 2.0 pseudomolecules were predicted using FGENESH (Salamov and Solovyev 2000) in monocot trained settings. Models were rejected if the ORFs were incomplete due to missing initiation or termination codons, if they encoded proteins shorter than 50 amino acids, or if they contained ambiguous sequence (represented by the symbol “N”) (T. Sasaki, pers. comm.). Transposon-related predicted genes (PGs) were identified by the following four methods and rejected: (1) BLASTN against a comprehensive rice transposon nucleotide sequence database (TIGR Oryza Repeats.v2; E<10-10) (Ouyang and Buell 2004); T. Sasaki, pers. comm.; http://www.tigr.org/tdb/e2k1/osa1/blastsearch.shtml); (2) BLASTP of translated PG coding sequences (CDS) against a custom library of transposon proteins (NCBI accessions BAC15617.2, AAK27803.1, NP_920829.1, BAC15617.2, BAC15618.2, AAA70218.1, AAA70219.1, AAA70220.1, AAA66266.1, AAA66267.1, AAA66268.1, AAA21566.1, AAA21567.1, CAA25498.1, CAA25499.1, CAA26444.1, CAA26445.1, CAA27457.1, CAA27458.1, CAA27459.1, CAA29005.1, CAA29006.1, CAA36480.1); (3) overlap with a MULE or MULE fragment in our data set; (4) BLASTN of transduplicate clusters of more than 20 PGs or 40 MULEs (see Transduplicate detection, below). cDNA mapping

We used BLAT (Kent 2002) to map rice full-length cDNAs (Kikuchi, Satoh et al. 2003) to Build 2.0 pseudomolecules. We aligned 29,364 of 32,127 cDNAs (91%) by selecting the best alignment of minimum 99% sequence identity.  

 74 Transduplicate detection

Transduplicates were identified by BLASTN (E<10-37) of MULE internal sequences, trimmed by 200 bp at each end to partially remove TIRs, against non-transposon-related PG CDS (Supplemental Table 2-2). We then identified transduplicates corresponding to expressed PGs, defined as CDS overlapping a mapped cDNA (Supplemental Table 2-3). Because some transduplicated regions shared significant similarity to more than one PG, PGs with overlapping (>30 bp) BLASTN high-scoring segment pairs (HSPs) to one or more MULEs were clustered. Eighty percent of the clusters contained only a single PG, and 99.9% were sized 10 or smaller. PG clusters occurring in only one MULE were interpreted as single, independent transduplication events, while PG clusters occurring in more than one MULE were interpreted as being descended by replicative transposition of a common transduplicate-containing ancestral MULE. In addition, individual MULEs containing multiple PG clusters were interpreted as resulting from multiple separate transduplication events.

Conserved domain detection

Transduplicates containing conserved protein domains (CD) were located by querying the NCBI Conserved Domain Database (CDD) (Marchler- Bauer, Anderson et al. 2005) with 6-frame translations of each MULE (Supplemental Table 2-4). Five mudrA conserved domains (smart00575, pfam03108, pfam00872, pfam04434, and COG3328) were identified by querying Zea mays mudrA (NCBI accession M76978) against the CDD and excluded from consideration as transduplicates. Additional transposon-related CDs were identified by examining CD descriptions and also excluded.  

 75 Coding-sequence disablement and truncation

We identified 66 MULEs containing 72 high-stringency transduplicates (Supplemental Table 2-6), defined as a transduplicate that overlaps a cDNA and contains a CD or has a corresponding host PG that also overlaps a cDNA. For each high-stringency transduplicate, we verified the TSDs, TIRs, and alignments to conserved domains, PGs and cDNAs. cDNAs overlapping high-stringency transduplicates were compared with known non-fragmentary SWISS-PROT proteins (Bairoch, Apweiler et al. 2005) and known SCOP protein domains (Chandonia, Hon et al. 2004) to test for frameshifts and premature stop codons using a modification of procedures described previously (Harrison, Hegyi et al. 2002; Zhang, Harrison et al. 2003) (Supplemental Tables 2-6 & 2-7). We also identified transduplicate cDNAs with significant protein-domain truncations, where protein-domain mappings were shorter than 50% of the domain length (Harrison, Carriero et al. 2003).

Synonymous and nonsynonymous substitution rate

We used BL2SEQ (Tatusova and Madden 1999) to align PG cDNA ORFs to corresponding high-stringency transduplicates. Forty-seven alignments longer than 150 bp (87%-96% identity) were manually curated by joining HSPs (E<10-4) on the correct strand to eliminate frameshifts and redundancies. dN/dS ratios were calculated using PAML CODONML (runmode=-2, CodonFreq=0) (Yang 1997) (Supplemental Table 2-6).

Acknowledgments

This study was supported by operating and genomics grants from the National Science and Engineering Research Council (NSERC) of Canada to T.E.B. and P.M.H. and a grant from the McGill University William Dawson Scholar Chair to T.E.B. The authors thank Ben Burr, Takuji Sasaki, and Takashi Matsumoto for access to the IRGSP sequence and annotation data set, Rebecca Cowan, Daniel Schoen, and Stephen Wright for critical reading of the manuscript, and Stephen Wright for analysis of pericentromeric locations. 76 Supplemental Material

Chromosome 1

30 50

) 25 40 -1 ) 20 30 -1 (100 kbp 15 20 or 5.8 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

position (Mbp)

Chromosome 2

30 50

 25 40 -1 ) 20 30 -1 (100 kbp 15 20 or 5.9 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

position (Mbp)

Chromosome 3

30 50

) 25 40 -1 ) 20 30 -1 (100 kbp 15 20 or 6.5 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

position (Mbp)

Chromosome 4

30 50

) 25 40 -1 ) 20 30 -1 (100 kbp 15 20 or 6.6 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

position (Mbp)

Chromosome 5

30 50

) 25 40 -1 ) 20 30 -1 (100 kbp 15 20 or 5.6 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

position (Mbp) 77

Chromosome 6

30 50

) 25 40 -1 ) 20 30 -1 (100 kbp 15 20 or 6.1 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

position (Mbp)

Chromosome 7

30 50

) 25 40 -1 ) 20 30 -1 (100 kbp 15 20 or 6.4 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

position (Mbp)

Chromosome 8

30 50

) 25 40  ) 20 30 -1 (100 kbp 15 20 or 6.2 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

position (Mbp)

Chromosome 9

30 50

) 25 40  ) 20 30 -1 (100 kbp 15 20 or 6.3 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

position (Mbp) 78

Chromosome 10

30 50

) 25 40 -1 ) 20 30 -1 (100 kbp 15 20 or 6.6 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

position (Mbp)

Chromosome 11

30 50

) 25 40  ) 20 30 -1 (100 kbp 15 20 or 5.9 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

position (Mbp)

Chromosome 12

30 50

) 25 40  ) 20 30 -1 (100 kbp 15 20 or 5.3 Mbp -1 10 10 (Mbp regional density

local density 5 0

0 -10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

position (Mbp)

Supplemental Figure 2-1a. (Preceeding pages.) The local and regional density of MULEs with transduplicates and all MULEs is shown. Mules with transduplicates are shown in white, all MULEs are shown in black. Local (short distance) density is the number of subjects in a 100 kbp long window centered at the indicated position on the chromosome, and plotted as stacked columns on the primary y-axis. Regional (long distance) density is the average local density over a 2.1 Mbp sliding window, centered at the indicated position, plotted as lines on the secondary y-axis. The regional density of MULEs with transduplicates is scaled by the factor indicated in the axis title (e.g. has a scaling factor of 5.8; see Supplemental Fig. 2-1c for calculations) to permit direct comparison of the lines. The greater variability of the density of MULEs with transduplicates compared to all MULEs is due to the relatively low density of these MULEs (decreased sample size). A higher density of MULEs in the arms than in the pericentromeric region is apparent (see also Supplemental Fig. 2-1d) but there is no obvious systematic difference between the two distributions.  

 79

Supplemental Figure 2-2. The dN/dS distribution of expressed transduplicates for supported predicted genes.  80

Note: The following supplemental data are too large to be included here, but may be downloaded from the Genome Research journal website at http://genome.cshlp.org/content/15/9/1292/suppl/DC1.

Supplemental Figure 2-1b. Local and regional density.

Supplemental Figure 2-1c. Scaling factors.

Supplemental Figure 2-1d. The copy number and density of MULEs with transduplictes and all MULEs in the pericentromeric regions and chromosome arms.

Supplemental Table 2-1. Rice MULEs.

Supplemental Table 2-2. MULEs with regions of high sequence similarity to predicted genes.

Supplemental Table 2-3. MULEs with regions of high sequence similarity to predicted genes supported by full-length cDNAs.

Supplemental Table 2-4. MULEs containing conserved domains.

Supplemental Table 2-5. MULEs overlapping mapped full-length cDNAs.

Supplemental Table 2-6. MULEs with expressed transduplicates of supported genes.

Supplemental Table 2-7. Identification of coding sequence disruption in high stringency transduplicates by comparison to known protein and protein domain sequences.  81

References

Altschul, S. F., W. Gish, et al. (1990). "Basic local alignment search tool." Journal of molecular biology 215(3): 403-410. Bairoch, A., R. Apweiler, et al. (2005). "The Universal Protein Resource (UniProt)." Nucleic acids research 33(Database issue): D154-159. Balakirev, E. S. and F. J. Ayala (2003). "Pseudogenes: are they 'junk' or functional DNA?" Annual review of genetics 37: 123-151. Bartel, D. P. (2004). "MicroRNAs: genomics, biogenesis, mechanism, and function." Cell 116(2): 281-297. Bonnet, E., J. Wuyts, et al. (2004). "Detection of 91 potential conserved plant microRNAs in Arabidopsis thaliana and Oryza sativa identifies important target genes." Proc. Natl. Acad. Sci. USA 101(31): 11511-11516. Bureau, T. E., S. E. White, et al. (1994). "Transduction of a cellular gene by a plant retroelement." Cell 77(4): 479-480. Bustamante, C. D., R. Nielsen, et al. (2002). "A maximum likelihood method for analyzing pseudogene evolution: implications for silent site evolution in humans and rodents." Molecular biology and evolution 19(1): 110-117. Chalvet, F., C. Grimaldi, et al. (2003). "Hop, an active Mutator-like element in the genome of the fungus Fusarium oxysporum." Molecular biology and evolution 20(8): 1362-1375. Chandonia, J. M., G. Hon, et al. (2004). "The ASTRAL Compendium in 2004." Nucleic acids research 32(Database issue): D189-192. Eddy, S. R. (1998). "Profile hidden Markov models." Bioinformatics 14(9): 755-763. Guilfoyle, T. J. (1997). In Genetic Engineering. J. K. Setlow. New York, Plenum Press: 15-47. Harrison, P. M., N. Carriero, et al. (2003). "A 'polyORFomic' analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs." Journal of molecular biology 333(5): 885-892. Harrison, P. M., H. Hegyi, et al. (2002). "Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22." Genome research 12(2): 272-280. Hirotsune, S., N. Yoshida, et al. (2003). "An expressed pseudogene regulates the messenger-RNA stability of its homologous coding gene." Nature 423(6935): 91-96. IRGSP, I. R. G. S. P. (2005). "The map-based sequence of the rice genome." Nature 436(7052): 793-800. Jiang, N., Z. Bao, et al. (2004). "Pack-MULE transposable elements mediate gene evolution in plants." Nature 431(7008): 569-573. Jørgensen, F. G., A. Hobolth, et al. (2005). "Comparative analysis of protein coding sequences from human, mouse and the domesticated pig." BMC biology 3: 2.  

 82 Juretic, N., T. E. Bureau, et al. (2004). "Transposable element annotation of the rice genome." Bioinformatics 20(2): 155-160. Kawasaki, S. and E. Nitasaka (2004). "Characterization of Tpn1 family in the Japanese morning glory: En/Spm-related transposable elements capturing host genes." Plant & cell physiology 45(7): 933-944. Kazazian, H. H., Jr. (2004). "Mobile elements: drivers of genome evolution." Science 303(5664): 1626-1632. Kent, W. J. (2002). "BLAT--the BLAST-like alignment tool." Genome research 12(4): 656-664. Kikuchi, S., K. Satoh, et al. (2003). "Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice." Science 301(5631): 376-379. Korneev, S. A., J. H. Park, et al. (1999). "Neuronal expression of neural nitric oxide synthase (nNOS) protein is suppressed by an antisense RNA transcribed from an NOS pseudogene." The Journal of neuroscience 19(18): 7711-7720. Le, Q. H., S. Wright, et al. (2000). "Transposon diversity in Arabidopsis thaliana." Proc. Natl. Acad. Sci. USA 97(13): 7376-7381. Li, W.-H., T. Gojobori, et al. (1981). "Pseudogenes as a paradigm of neutral evolution." Nature 292(5820): 237-239. Lim, J. K. and M. J. Simmons (1994). "Gross chromosome rearrangements mediated by transposable elements in Drosophila melanogaster." BioEssays 16(4): 269-275. Lurin, C., C. Andrés, et al. (2004). "Genome-wide analysis of Arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis." The Plant cell 16(8): 2089-2103. Marchler-Bauer, A., J. B. Anderson, et al. (2005). "CDD: a Conserved Domain Database for protein classification." Nucleic acids research 33(Database issue): D192-196. Meister, G. and T. Tuschl (2004). "Mechanisms of gene silencing by double- stranded RNA." Nature 431(7006): 343-349. Ouyang, S. and C. Buell (2004). "The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants." Nucleic acids research 32(Database issue): D360-D363. Palatnik, J. F., E. Allen, et al. (2003). "Control of leaf morphogenesis by microRNAs." Nature 425(6955): 257-263. Raizada, M. N., M. I. Benito, et al. (2001). "The MuDR transposon terminal inverted repeat contains a complex plant promoter directing distinct somatic and germinal programs." The Plant journal 25(1): 79-91. Ralston, E., J. English, et al. (1989). "Chromosome-breaking structure in maize involving a fractured Ac element." Proc. Natl. Acad. Sci. USA 86(23): 9451-9455. Salamov, A. A. and V. V. Solovyev (2000). "Ab initio gene finding in Drosophila genomic DNA." Genome research 10(4): 516-522.  

 83 Talbert, L. E. and V. L. Chandler (1988). "Characterization of a highly conserved sequence related to mutator transposable elements in maize." Molecular biology and evolution 5(5): 519-529. Tatusova, T. A. and T. L. Madden (1999). "BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences." FEMS microbiology letters 174(2): 247-250. Torrents, D., M. Suyama, et al. (2003). "A genome-wide survey of human pseudogenes." Genome research 13(12): 2559-2567. Turcotte, K., S. Srinivasan, et al. (2001). "Survey of transposable elements from rice genomic sequences." The Plant journal 25(2): 169-179. Vazquez, F., H. Vaucheret, et al. (2004). "Endogenous trans-acting siRNAs regulate the accumulation of Arabidopsis mRNAs." Molecular cell 16(1): 69-79. Waterston, R. H., K. Lindblad-Toh, et al. (2002). "Initial sequencing and comparative analysis of the mouse genome." Nature 420(6915): 520-562. Yang, Z. (1997). "PAML: a program package for phylogenetic analysis by maximum likelihood." Computer applications in the biosciences13(5): 555-556. Yu, Z., S. I. Wright, et al. (2000). "Mutator-like elements in Arabidopsis thaliana. Structure, diversity and evolution." Genetics 156(4): 2019-2031. Zhang, Z., P. M. Harrison, et al. (2003). "Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome." Genome research 13(12): 2541-2558.  84

Bridge to Chapter Three

In Chapter Two, I examined transduplication, or the large scale copying of fragments of ordinary genes into TEs, in rice MULEs. Contrary to a previous, widely publicized report, these sequences do not appear to produce functional proteins, although they may have regulatory functions. In Chapter Three, I extend my analysis of transduplication in MULEs from rice, the first sequenced monocot, to A. thaliana, the first sequenced eudicot. Analysis of A. thaliana MULE transduplicates reveals a large gene family that appears to have originated through transduplication, but which is conserved within TEs, called Kaonashi (KI). The conservation of KI in active TEs shows that transduplication does occasionally produce functional gene duplications; however, at least in this case, the result is a not a new ordinary gene, but a new TE gene. Chapter Three is a reproduction of the following published article: Hoen DR, Park KC, Elrouby N, Yu Z, Mohabir N, Cowan RK, Bureau TE. 2006. Transposon- mediated expansion and diversification of a family of ULP-like genes. Mol Biol Evol. 23(6):1254-68.  85

Chapter Three: Transposon-mediated expansion and diversification of a family of

ULP-like genes  86

Abstract

Transposons comprise a major component of eukaryotic genomes, yet it remains controversial whether they are merely genetic parasites or instead significant contributors to organismal function and evolution. In plants, thousands of DNA transposons were recently shown to contain duplicated cellular gene fragments, a process termed transduplication. Although transduplication is a potentially rich source of novel coding sequences, virtually all appear to be pseudogenes in rice. Here we report the results of a genome-wide survey of transduplication in Mutator-like elements (MULEs) in Arabidopsis thaliana, which shows that the phenomenon is generally similar to rice transduplication, with one important exception: KAONASHI (KI). A family of more than 97 potentially functional genes and apparent pseudogenes, evidently derived at least 15 MYA from a cellular small ubiquitin-like modifier–specific protease gene, KI is predominantly located in potentially autonomous non–terminal inverted repeat MULEs and has evolved under purifying selection to maintain a conserved peptidase domain. Similar to the associated transposase gene but unlike cellular genes, KI is targeted by small RNAs and silenced in most tissues but has elevated expression in pollen. In an Arabidopsis double mutant deficient in histone and DNA methylation with elevated KI expression compared to wild type, at least one KI-MULE is mobile. The existence of KI demonstrates that transduplicated genes can retain protein-coding capacity and evolve novel functions. However, in this case, our evidence suggests that the function of KI may be selfish rather than cellular. 87

Introduction

Transposons are abundant constituents of all eukaryotic genomes. Although they are considered “selfish DNA” because they survive not by phenotypic selection but through self-replication, mounting evidence indicates that transposons contribute by a variety of mechanisms to the function and evolution of host genomes (Makalowski 2003; Kazazian 2004). For instance, eukaryotic transposons are able to duplicate and mobilize cellular gene sequences, potentially contributing to creative mutagenic processes like exon shuffling and gene duplication. Cellular genes flanking the 3′ termini of retrotransposons can be duplicated by readthrough transcription initiated from the element, reverse transcription of mature mRNA, and insertion into genomic DNA (Moran, DeBerardinis, and Kazazian 1999; Elrouby and Bureau 2001). By a different but unknown mechanism, DNA transposons can incorporate unspliced fragments of unlinked cellular genes between the transposon termini in a process called transduplication (Hoen et al. 2005). Mutator elements were first discovered in maize, and Mutator-like elements (MULEs) constitute a diverse superfamily of DNA transposons in plants, fungi, and prokaryotes (Le et al. 2000; Yu, Wright, and Bureau 2000; Turcotte, Srinivasan, and Bureau 2001; Lisch 2002; Chalvet et al. 2003; Neuveglise et al. 2005). Autonomous MULEs contain a mudrA gene which encodes a transposase required for mobility. In most organisms, MULEs have long, high-identity terminal inverted repeats (TIRs), but roughly one-third of Arabidopsis MULEs do not. It was not previously known whether non-TIR MULEs are capable of transposition (Le et al. 2000; Yu, Wright, and Bureau 2000). Maize Mutator elements were first observed to contain insertions of non-Mutator DNA (Chandler, Rivin, and Walbot 1986), and transduplication has subsequently been documented in Arabidopsis MULEs (Le et al. 2000; Yu, Wright, and Bureau 2000), CACTA elements in Japanese  

 88 morning glory (Kawasaki and Nitasaka 2004) and soybean (Zabala and Vodkin 2005), as well as maize rolling circle elements (Morgante et al. 2005). Two genome-wide studies recently identified over 1,300 duplicated gene fragments formed by transduplication (i.e., transduplicates) in rice MULEs (Jiang et al. 2004; Hoen et al. 2005), but despite their abundance all existing transduplicates in rice appear to be pseudogenes (Hoen et al. 2005), and there is, to our knowledge, no documented case of a transduplicated gene encoding a functional protein. It has also yet to be determined whether transduplicates have other functions, such as the provision of sequence reservoirs for gene conversion or the generation of small RNAs (sRNAs) which might participate in RNA-mediated silencing of paralogous cellular genes. Here we report the results of a genome-wide survey of MULE- mediated transduplication in Arabidopsis and the discovery of one highly unusual case. We find that the Arabidopsis genome contains at least 97 sequences containing strong similarity to peptidase C48, a conserved domain (CD) found exclusively in ubiquitin-like protein–specific protease (ULP)–like genes. Most of these sequences are located in predicted genes, but our analysis indicates that only eight are cellular ULP-like (AtULP) genes and the remainder form a unique family of transduplicates located in non-TIR MULEs which we name KAONASHI (KI; named after the mysterious character in Hayao Miyazaki's animated film “Spirited Away” who is split between conflicting worlds and identities). We contrast the characteristics of KI with other transduplicates; examine KI phylogeny, conservation, and age; compare KI expression patterns to those of mudrA and cellular AtULP genes; and investigate KI-MULE mobility and transcriptional silencing. We conclude by arguing that KI is a functional and possibly selfish gene family and discuss its significance in understanding the evolutionary forces underlying the phenomenon of transduplication and transposon evolution in general.  

 89 Materials and Methods

Sequences The Institute for Genomic Research (TIGR) Arabidopsis thaliana (Columbia-0) (hereafter referred to simply as Arabidopsis) genome sequences and annotations version 5.0 (TIGR5) were accessed both locally and online at http://www.tigr.org (AGI 2000; Haas et al. 2005). Pseudochromosome sequences were downloaded from http://www.tigr.org on June 2004.

Identification of MULEs, Transduplicates, and Peptidase C48 MULEs are characterized by long terminal sequences (greater than 100 bp) which are conserved within families, flanked by short direct repeats (9–11 bp) called target site duplications (TSDs) which are generated on insertion. MULE internal sequences are variable even between closely related elements due to deletions, insertions (including transduplications), and substitutions and also because only some MULEs contain a mudrA gene. We modified a previously described automated in silico procedure (Hoen et al. 2005) to identify MULEs in the TIGR5 Arabidopsis genome sequence, augmented with extensive manual curation, briefly described here. We compiled a library of terminal sequences (100 bp in length) from a set of previously characterized MULEs (Le et al. 2000; Yu, Wright, and Bureau 2000) and additional MULEs identified through manual curation. Using these as queries, we identified a complete set of putative MULE termini through similarity searches with the genome sequence using WU- Blast (version 2.0 release March 27, 2004; http://blast.wustl.edu) with MASKERAID (Bedell, Korf, and Gish 2000) as a search engine for REPEATMASKER (version 2004/03/06; options: nolow, nocut, no_is, s; http://www.repeatmasker.org). Pairs of termini from the same family, with correct orientation, separated by less than 30 kbp were matched, and the immediate flanking region was searched for 9-bp TSDs. Matched pairs of

90 termini flanked by TSDs with at most two mismatched base pairs were considered to be intact MULEs. Each MULE subsequently found to contain a transduplication or peptidase C48 sequence (see below) was verified by manual inspection of its termini, repetitiveness, and TSDs. Greater than two TSD mismatches were permitted, subject to manual inspection, if the MULE contained a peptidase C48 sequence or a transduplicate. MULEs were categorized as TIR or non-TIR according to the previously described library, in which TIR MULEs were defined as having at least 60% nucleotide identity in an alignment between the 5′ terminal 100 bp and the reverse complement of the 3′ terminal 100 bp (Yu, Wright, and Bureau 2000). We then identified additional MULEs surrounding mudrA sequences not contained in MULEs in this data set and iterating the above procedure. The distribution of MULEs on Arabidopsis chromosomes was visualized using the Nottingham Arabidopsis Stock Centre Arabidopsis Ensemble KaryoView tool (http://genome.arabidopsis.info). To locate candidate transduplicates, we searched for cellular CDs within the MULEs by identifying all CDs and then filtering out CDs which are found in MULEs or other transposons. Because many transduplicates will probably not contain a CD, this method is expected to have a high false negative rate; however, it has the advantage of a low false positive rate (see Discussion). Putative transduplicated CDs were identified by querying the National Center for Biotechnology Information (NCBI) CD- Search (default settings; accessed July 2004; Marchler-Bauer and Bryant 2004) with six-frame conceptual translations of the MULE sequences. Transposon-related CDs were filtered out and ambiguous CDs, which may have been derived either from a transposon or a cellular gene, were also filtered out unless adjacent to a unambiguous cellular CD. Excluding peptidase C48 (pfam02902) which was analyzed separately (see below), a putative cellular gene paralog for each candidate transduplicate was  

 91 identified as the locus corresponding to the second best hit (best non–self- hit) in an NCBI BlastN (Altschul et al. 1990) search of the genome sequence. A large number of MULEs were found to contain the peptidase C48 domain. To exhaustively locate all Arabidopsis sequences similar to peptidase C48, we employed two complementary methods: (1) CD-Search of TIGR5 annotated open reading frames (ORFs) and (2) PSI-TBlastN (version 2.2.8; Schaffer et al. 2001) search of the TIGR5 genome sequence, using as query a consensus of eight representative Arabidopsis peptidase C48 sequences computed by the NCBI Conserved Domain Database (gi|3377828, gi|5731755, gi|4309748, gi|3377837, gi|4678213, gi|3859612, gi|3080361, and gi|4733978). We ignored annotated exon and gene boundaries and extracted maximum length peptidase C48 sequences from genomic sequences at the positions identified in these searches and, because many sequences were apparent pseudogenes, we disregarded frameshifts and stop codons. Frameshifts were defined as two consecutive ORFs (putative exons) in different frames separated by a gap (putative intron) of less than 10 bp, a conservative threshold given that 99.9% of true Arabidopsis introns are longer than 50 bp (Yu et al. 2002). Frameshifts and premature stop codons in the peptidase C48 domain were counted. Cases where peptidase C48 sequences were not within a MULE were reexamined in an attempt to identify evidence of MULE-like sequences in the flanking genomic regions. Neighboring mudrA genes were identified based on TIGR5 annotations. The positions and distribution of peptidase C48 sequences were visualized using The Arabidopsis Information Resource (TAIR) Chromosome Map Tool (http:// www.arabidopsis.org; Garcia-Hernandez et al. 2002). In four cases, where there was no TIGR5 locus at the position of the peptidase C48 sequence, we assigned ad hoc identifiers based on genomic positions (e.g.,  

 92 At4k0284 was used to identify a peptidase C48 sequence located at chromosome 4 in the region 28.4–28.5 Mbp). Estimates of the number of peptidase C48 domains in various species were obtained from the Protein Families database (Pfam; version 19.0; Bateman et al. 2004). We also searched for KI-like sequences in preliminary Brassica oleracea genome sequence contigs (less than 0.5× coverage) using a TBlastN search on the TIGR Web site (http:// www.tigr.org; accessed 10/2005) with a representative KI peptidase C48 domain (from At2g12100) as the query. We verified that the resulting B. oleracea sequences contained peptidase C48 using NCBI CD-Search.

Alignment, Phylogeny, and Conservation 3DCOFFEE (standalone T-COFFEE version 2.50; option: special-mode = 3dcoffee; O'Sullivan et al. 2004), a highly accurate multiple sequence alignment tool, was used to align all Arabidopsis peptidase C48 sequences with sequences from Drosophila melanogaster (Ulp1 gi| 18860521:1318–1508), Saccharomyces cerevisiae (Ulp1 gi|6325237:433– 617; Ulp2 gi|6322158:444–673), Schizosaccharomyces pombe (gi| 2894265:377–564), Homo sapiens (Senp2 gi|54607091:397–586), and human adenovirus type 2 protein (HavULP; gi|34810217:50–144), as well as three-dimensional structure data from S. cerevisiae Ulp1 (Protein Database [PDB] ID 1EUV; Mossessova and Lima 2000) and H. sapiens Senp2 (PDB ID 1TH0; Reverter and Lima 2004). Alignments were manually edited to correct the alignment of the invariant glutamine in HavULP and KI sequences as in Mossessova and Lima (2000). We used the Phylogeny Inference Package (version 3.6; http:// evolution.genetics.washington.edu/phylip.html) programs SEQBOOT, PROTDIST, NEIGHBOR, and CONSENSE to construct an extended majority rule consensus tree from 1,000 bootstrap replicates. We also clustered both protein and DNA sequences using NCBI BLASTCLUST (standalone version 2.2.10; http://www.ncbi.nlm.nih.gov/BLAST) to define  

 93 clusters at various levels of divergence. Both the phylogenetic and clustering analyses supported a division between eight previously documented cellular AtULP genes (Novatchkova et al. 2004), which are not located in MULEs, and the remaining sequences, namely KI genes, which are located in MULEs. To prepare the alignment for synonymous substitution rate analysis, we made the following adjustments. Non-Arabidopsis sequences were removed. Genomic DNA sequences were substituted for corresponding amino acid sequences, stop codons were replaced with gaps, codons containing gaps in more than 15% of sequences were removed, and clusters of greater than 88% amino acid identity were pruned to one sequence. Two alignments were created, one that included cellular AtULP genes and a second that excluded them. TREEVIEW (version 1.6.6; Page 1996) was used in making corresponding edits to the consensus tree. The adjusted alignments and trees were used to estimate dN/dS (the ratio of non-synonymous to synonymous nucleotide substitutions per site) using the Phylogenetic Analysis by Maximum Likelihood package (PAML; version 3.14 release January 2004). BASEML was used to estimate initial branch lengths and CODEML was used for dN/dS calculations (default parameters except CodonFreq = 2, clock = 0, cleandata = 0, fix_blength = 1, and model, NSSites, and fix_omega as below). Overall dN/dS values for KI and AtULP peptidase C48 domains were calculated using the several- ratios branch model (model = 2, NSSites = 0; Yang 1998; Yang and Nielsen 1998) with one dN/dS value for KI and a second for AtULPs. KI clade-specific dN/dS values were calculated under the same model, using only KI sequences (no AtULP), two runs per clade (one run with fix_omega = 0, one with fix_omega = 1 and omega = 1), with one dN/dS value for the selected clade and a second for all remaining nodes. Likelihood ratio tests were used to determine whether clade-specific dN/ dS values differed significantly from unity. In addition to the above  

 94 calculations, dN/dS was calculated for various other sequence subsets and tree branches (e.g., excluding outliers, including nearly identical sequences, including or excluding AtULP genes) using various models and parameters with similar results (data not shown).

Expression We used a combination of data sources to investigate the expression of KI, mudrA, and cellular AtULP genes. We identified expressed sequence tag (EST) and full-length cDNA sequences using the Munich Information Center for Protein Sequences Arabidopsis thaliana Database (http:// mips.gsf.de/proj/thal/db; Schoof et al. 2002) and TAIR (http:// www.arabidopsis.org; Garcia-Hernandez et al. 2002), and we also searched for evidence of expression in the whole-genome tiled oligonucleotide array data of Yamada et al. (2003). Using GENEVESTIGATOR (https://www.genevestigator.ethz.ch; Zimmermann et al. 2004), we collected microarray measurements (ATH1 22k array, Columbia-0) in different tissues from a pooled and standardized database incorporating several large public databases (Craigon et al. 2004; Barrett et al. 2005; Parkinson et al. 2005). Finally, 17-bp mRNA and sRNA massively parallel signature sequencing (MPSS) data were extracted from the Arabidopsis MPSS database (http://mpss.udel.edu; Nakano et al. 2006).

Mobility Plant Material Mutant seeds were obtained from Tetsuji Kakutani (Department of Integrated Genetics, National Institute of Genetics, Japan). Plants were raised in a growth chamber under 24 h daylight at 22°C. Genomic DNA was isolated from leaf tissue using the DNeasy Plant Mini Kit (QIAGEN, Mississauga, Ontario, Canada).

95 Transposon Display Analysis Transposon display (TD) was performed as described by Wright et al. (2001) with minor modifications. Genomic DNA (100 ng) was digested with 2.5 U BfaI (New England Biolabs, Beverly, Mass.) and ligated to 15 pmol adaptor cassettes (5′-TAGCAAGGAGAGGACGCTGTCTGTCGAAGGTA AGGAACGGACGAGAGAAGGGAGA-3′ and 5′-TCTTCCCTTCTCGAATC GTAACCGTTCGTACGAGAATCGCTGTCTCTCCTTGC-3′) with T4 DNA ligase (Invitrogen, Carlsbad, Calif.). The ligation reaction was diluted fourfold. A 3-μl aliquot of diluted ligation mixture was used as template for preselective amplification with MULE-specific primer MuP-1 (5′- GGTCAGTTTTTGGCT(G/T)AATGGCTAA-3′) and adaptor-specific primer ap-1 (5′-CGAATCGTAACCGTTCGTACGAGAATCGCT-3′) and the following polymerase chain reaction (PCR) conditions: initial denaturing of 10 min at 94°C; 20 cycles of 1 min at 94°C, 1 min at 55°C, and 1 min at 72°C; and final extension of 10 min at 72°C. PCR products were diluted 100-fold in MilliQ distilled water. A 3-μl aliquot of diluted PCR products was used as template for selective amplification under the same reaction conditions with nested element-specific primer MuP-2 (5′-AAAGTGGGT CAA(C/T)GGCTACTG-3′) and IRD700-labeled (LiCor, Lincoln, Nebr.) nested adaptor-specific primer ap-2 (5′-GTACGAGAATCGCTGTCCTC -3′). Five microliters of loading dye was added to the final amplification products, which were separated by size and visualized on a 5.5% denaturing polyacrylamide gel using a LiCor IR4200 DNA sequencer. LiCor IRD700-labeled DNA (50–700 bp) was used as molecular weight markers. Polymorphic DNA fragments were isolated from the gel using the same primers (MuP-2 and ap-2) cloned into pCR 2.1 vectors (TA cloning kit, Invitrogen, Canada) and sequenced using a 3730xl DNA Analyzer (Applied Biosystems, Foster City, Calif.) at the McGill University and Genome Québec Innovation Centre (Montreal, Canada). 96

Reverse Transcriptase–PCR Total RNA was isolated from floral tissue using the RNeasy Plant Mini Kit (QIAGEN, Germany) and treated with RNase-free DNase (QIAGEN) to eliminate contaminating DNA. cDNA was synthesized from 1 μg of total RNA using the SuperScript III First-Strand Synthesis System for reverse transcriptase–PCR (RT-PCR) following the manufacturer's instructions. PCR was performed using the following conditions: one cycle of 2 min at 94°C; 35 cycles of 30 s at 94°C, 30 s at 58°C, and 90 s at 68°C; and a final extension of 5 min at 68°C. PCR products were separated on a 1.5% agarose gel containing 0.4 μg/ml ethidium bromide. Forward (F) and reverse (R) primer sequences used for the amplification of KI-At2g12100 and mudrA-At2g12150 transcripts were as follows: UPF, 5′-GAAGAGAAA TCGGTATGTCGTT-3′; UPR, 5′-GACGTTGCAGGCATATAGCT-3′; UP1F, 5′-GCAACTGGTAGTCTTCCTGTC-3′; UP1R, 5′-ACAACTTCTTCTTCA GGATTT-3′; UP2F, 5′-GTTCAGAAAGGACTTGGT GGA-3′; UP2R, 5′- ATTAAGGCATTCTCCCTTGGA-3′; MPF, 5′-GAACATGAGGACGAGGAT AAC-3′; MPR, 5′-CTGTTGTCACTAGACCGT CA-3′. The actin gene ACT4 was used as control with primers Act4-U (5′-GCATAGAGTGAGAGAACA GC-3′) and Act4-D (5′-GACGTTGAAGACATTCAACC-3′).

Results

Detection of MULEs and Transduplicates The MULE-mediated acquisition of cellular gene fragments in Arabidopsis was previously documented in a survey of approximately 15% of the genome (Yu, Wright, and Bureau 2000). To better characterize MULE- mediated transduplication in Arabidopsis, we modified a procedure previously shown to be accurate (Hoen et al. 2005) and identified 722 MULEs (Supplemental Table 3-1). Consistent with earlier reports (AGI 2000; Le et al. 2000; Yu, Wright, and Bureau 2000; Singer, Yordan, and Martienssen 2001), MULEs were found to be distributed across all  

 97 chromosomes and concentrated at higher densities in pericentromeric regions and heterochromatic knobs (Supplemental Table 3-1 and Supplemental Fig. 3-1). To locate transduplicate candidates, we searched for CDs within the MULEs, filtering out mudrA-related CDs. To differentiate between transduplication and transposon insertion, which occurs frequently, we also filtered out CDs characteristic of other transposons. Because most transduplicated sequences undergo frequent mutation, we identified CDs by scanning conceptual translations of MULE internal sequences in all six reading frames, ignoring frameshifts and stop codons. Each candidate transduplicate-containing MULE was manually verified, corresponding full- length cDNAs and ESTs were identified, and cellular genes corresponding to transduplicates were located and characterized (Table 3-1; Supplemental Table 3-2). We found a total of 86 CDs in 22 transduplicated sequences and 19 MULEs. However, because our methods only detect transduplicates that contain a CD, these almost certainly constitute only a subset of Arabidopsis transduplicates. Alignments of transduplicated sequences to putative paralogous cellular genes had nucleotide identities ranging from 58% to 98% (unweighted mean 81%). Sixteen MULEs contained a single transduplicate and three MULEs contained two transduplicates each. Nineteen of the transduplicated CDs were present in two groups of MULEs, at the following locations: chr1|16662129– 16666762, chr3|11796038–11800933, chr3|14747509–14753917, chr3| 20780577–20799426 and chr4|591671–592987, chr4|1866831–1882462, chr4|5200471–5204006. Eighty-one percent of transduplicated CDs were truncated by at least 30%.  

 98 Table 3-1. Summary of selected Arabidopsis transduplicates.

MULE Transduplicate Cellular Gene

Location1 TIR Length2 Locus CD3 GO Term4 Locus % Identity5

Chr 1: 12926896– Transcription yes 4,597 At1g35240 pfam06507 At1g35540 82 12932261 factor Chr 2: 4839811– Transcription no 553 At2g11990 pfam00319 At3g04100 98 4841640 factor Chr 2: 6585875– Microtubule no 5,854 At2g15160 smart00129 At3g49650 94 6591729 motor Chr 36: 20780577– Transcription yes 6,180 At3g55980 pfam03157 At2g40140 83 20799426 factor Expressed 1,628 At3g55990 COG1215 At2g40150 protein Chr 4: 1866831– Pectin- no 11,964 At4g03930 pfam01095 At3g27980 68 1882462 esterase Chr 5: 14217004– yes 491 None7 smart00220 Kinase At3g01085 88 14218072

1 Chr, chromosome. 2 Length in base pairs. 3 Most have multiple CDs, of which only one representative is given. For more details see Supplemental Table 3-2. 4 GO, term of cellular gene (http://www.geneontology.org). 5 Percent nucleotide identity. 6 This MULE contains two transduplicates. 7 There is no locus assigned to this transduplicate.

Identification of KI One apparently transduplicated CD, peptidase C48, had markedly unique characteristics. While most transduplicates were single copy and highly truncated, peptidase C48 was largely intact and found in a large number of MULEs (Supplemental Table 3-3). Peptidase C48 is a cysteine protease domain approximately 200 amino acids in length found in Ulp-like proteins, located at the C-terminus of Ulp1 and in the central region of Ulp2 (Li and Hochstrasser 1999, 2000). To fully characterize the occurrence of peptidase C48 in Arabidopsis, we scanned the genome sequence and located 84 peptidase C48 domains truncated by less than 5%, an additional 21 truncated by less than 20% (Supplemental Table 3-3), and at  

 99 least an additional 35 truncated by at least 20% (Supplemental Table 3-4). We refer to domains truncated by 20% or less as “intact” and to the remainder as “truncated” and implicitly refer to intact domains unless otherwise noted. Eight of the intact domains were located in non- pseudogenic TIGR5 ORFs previously identified as probable AtULP genes (Novatchkova et al. 2004) and were not associated with MULE features such as adjacent mudrA ORFs or flanking MULE termini. We refer to these eight ORFs as cellular AtULP genes. Sixty-nine of the remaining 97 intact domains were located in intact MULEs. Note that we use the term “intact MULEs” to mean not truncated, that is, MULEs which have matching termini and TSDs (see Materials and Methods). This is different from “autonomous,” which would require that, in addition to being intact, a MULE contains a functional mudrA gene and is furthermore capable of mobilizing itself in the absence of other MULEs. Nevertheless, most of these intact MULEs also contained a mudrA gene, many of which (29 of 69) were in turn found to contain all three mudrA- related CDs (pfam03108, pfam00872, and smart00575). These MULEs are potentially autonomous (Supplemental Table 3-3). The remaining 28 intact peptidase C48 domains appeared to be located in truncated MULEs as they were highly similar to sequences in intact MULEs and were usually associated with one or more MULE features such as a single unpaired terminus or a mudrA ORF. Ninety-three of these 97 intact peptidase C48 domains were located in TIGR5 annotated ORFs (Supplemental Table 3-3). These MULE-related sequences constitute a novel family of ULP-like genes, which we named KI. Seventy-seven percent of KI-MULEs contained a mudrA gene and 60% contained one or more additional ORFs, which may be transduplications or transposon insertions (Supplemental Table 3-3). KI and mudrA are located on opposite strands in convergent orientation. Interestingly, all KI-MULEs have the unusual characteristic that their  

 100 termini do not form high-identity inverted repeats, that is, they are non-TIR MULEs (Yu, Wright, and Bureau 2000).

Phylogeny and Age To investigate the phylogeny of KI, we performed multiple sequence alignments of the predicted amino acid sequences of all intact Arabidopsis peptidase C48 domains including both KI and cellular AtULP genes, representative sequences from diverse species, and three-dimensional X- ray crystallographic structures of S. cerevisiae Ulp1 (Mossessova and Lima 2000) and H. sapiens Senp2 (Reverter and Lima 2004; Fig. 3-1; Supplemental Fig. 3-2). We constructed a neighbor-joining tree based on protein distances (Fig. 3-2; Supplemental Fig. 3-3). KI and the eight cellular AtULP genes formed two highly diverged phylogenetic groups. Although the bootstrap support for each major KI clade (see below) is 100%, it is below 50% for the most ancient nodes due to the high degree of divergence between clades and between KI and AtULP genes. This indicates that there is not enough evidence either to determine which cellular AtULP is most closely related to the KI family or to evaluate the order in which KI clades diverged from one another. Similarity-based clustering of the amino acid sequences was consistent with the phylogenetic analysis and permitted further definition of KI subgroups. At an arbitrary threshold of 1.0 bits/residue (approximately equivalent to 45% identity), we grouped KI into nine clades of 3–20 members, leaving four single-member cluster outliers. At the same threshold, cellular AtULP genes formed four groups (At4g00690, At4g15880, At3g06910; At1g60220, At1g10570; At4g33620, At1g09730; and At5g60190), consistent with our phylogenetic tree and with previous studies (Kurepa et al. 2003; Novatchkova et al. 2004). The non- Arabidopsis Ulp-like sequences each formed single-member clusters at this threshold. Further clustering at various thresholds identified additional subgroups. For instance, at 100% nucleotide identity there were 2 clusters  

 101 of 2 and 4 members, at 99% nucleotide identity there were 8 clusters of 2– 5 members, and at 88% amino acid identity there were 13 clusters of 2–15 members, distributed across all clades (Supplemental Fig. 3-3). To estimate a minimum KI formation age, we searched for sequences similar to a representative KI peptidase C48 domain (At2g12100, Clade 9) in preliminary B. oleracea genomic contigs (less than 0.5× coverage) and identified 228 KI-like sequences (E < 10−10). The B. oleracea sequences were most closely related to Arabidopsis Clades 7 and 8 with maximum 33% identity in a 208–amino acid BlastP alignment.

Figure 3-1. Multiple alignment of selected Arabidopsis peptidase C48 protein sequences with several outgroups. Probable cellular AtULP genes are indicated by the suffix ‘‘-C.’’ Each KI clade is represented (Clade 1, At2g15140; Clade 3, At4g08430; Clade 2, At2g12680; Clade 5, At3g26530; Clade 4, At1g35770 and At2g07240; Clade 6, At1g34610; Clade 7, At1g35110; Clade 8, At4g04130; and Clade 9, At2g12100 and At5g36860). Outgroups are S. cerevisiae ULP1 (ScULP1, gi18860521:1318–1508) and ULP2 (ScULP2, gi6325237:433–617), Homo sapiens SENP2 (HsSENP2, gi54607091:397–586), and human adenovirus type 2 protein (HavULP, gi34810217:50–144). Similar residues are highlighted in gray and identical residues in black. Probable catalytic residues are indicated by asterisks and are conserved only in some clades. A full alignment of KI and cellular AtULP sequences is given in Supplemental Fig. 3-3. 102

Figure 3-2. Extended majority rule phylogram of the KI gene family. Individual KI genes and 88% amino acid identity clusters are shown as labeled leaves (see Supplemental Fig. 3-2) with circled numbers indicating the size of clusters. Bootstrap percentages are indicated in gray and dN/dS values for each clade in black. All dN/dS values are significantly smaller than unity (P<0.0001), indicating that these groups have been subject to purifying selection, except Clade 2 (P>0.05). Distances were estimated by PAML, and a scale of 1-nt substitution per codon is shown. Bootstrap values near the divergence between KI and AtULP are low, indicating uncertainty about the order in which KI clades diverged. The position of the branch containing the cellular AtULP genes is shown as a dashed line. A full cladogram of KI and AtULP genes is given in Supplemental Fig. 3-2.  

 103 Conservation Peptidase C48 contains a putative catalytic triad of histidine, aspartate, and cysteine as well as a highly conserved glutamine residue positioned near the active site (Li and Hochstrasser 1999, 2000). We examined the KI peptidase C48 domains to determine whether these features were conserved and also whether the domains contained obvious disablements such as frameshifts or premature termination codons. Although many of the KI peptidase C48 domains were found to have one or more obvious defects, as expected for transposon-related ORFs, 53 had no obvious disablement and were positioned at the C-terminus like in Ulp1. In three of the KI clades (Clades 2, 3, and 9), the majority of sequences encoded all four invariant residues; in two clades (Clades 7 and 8), the majority of sequences encoded all three residues in the catalytic triad (histidine, aspartate, cysteine) but not the putative invariant glutamine; in three clades (Clades 4, 5, and 6), the majority of sequences encoded histidine, aspartate, and glutamine but not cysteine; and in the remaining clade (Clade 1), the majority of sequences encoded the invariant glutamine but none of the catalytic triad. It is possible that some of the apparently missing invariant residues are actually present but were improperly aligned; however, the regions adjacent to the invariant sites are particularly well conserved making this unlikely (Fig. 3-1; Supplemental Fig. 3-2). To evaluate whether the amino acid sequences of KI genes had been subject to selective constraint, we estimated peptidase C48 domain dN/dS ratios for the entire KI subtree, for each KI clade, and for cellular AtULP genes using maximum likelihood (Fig. 3-2; Supplemental Table 3-5). The overall dN/dS for KI and AtULP genes were 0.24 and 0.12, respectively. The dN/dS values of eight of the nine clades ranged from 0.11 to 0.51 and were all significantly smaller than unity (likelihood ratio test, P < 0.001). The only exception was Clade 2, which had dN/dS of 1.35, a value which  

 104 was, however, not significantly different from unity (P = 0.08). In additional analyses, Clade 2 did not have an exceptionally high dN/dS ratio compared to other clades and so has probably not been subject to positive selection.

Expression In addition to EST and full-length cDNA sequencing projects, the Arabidopsis transcriptome has been characterized by large-scale microarray and MPSS investigations (Brenner et al. 2000; Meyers et al. 2004; Lu et al. 2005). We identified ESTs and full-length cDNAs corresponding to KI and cellular AtULP genes in large public databases. Each of the eight cellular AtULP genes had between 1 and 30 corresponding ESTs, and all but two (At4g33620 and At4g00690) had at least one full-length cDNA (Supplemental Table 3-3). In contrast, only eight of the KI genes (9%) had a corresponding EST or full-length cDNA, three of which were full-length cDNAs that overlapped only a small fraction of the gene. Data from a high-density whole-genome tiled oligonucleotide array (Yamada et al. 2003) confirmed this pattern, supporting expression for only 5 of 97 KI genes (5%) compared to 3 of 8 cellular AtULP genes (38%; Supplemental Table 3-3). We also compiled microarray measurements of KI and cellular AtULP gene expression levels in various Arabidopsis tissues using GENEVESTIGATOR, a public Web interface and standardized database of Affymetrix GeneChip data (Zimmermann et al. 2004), which consolidates several large public databases (Craigon et al. 2004; Barrett et al. 2005; Parkinson et al. 2005). Because most KI genes have multiple high-identity copies, many (28 of 61) probesets corresponding to KI genes were ambiguous; however, the ambiguity was mainly restricted to closely related KI genes and both ambiguous and unambiguous probesets gave similar results (Supplemental Table 3-6). Whereas all eight cellular AtULP genes had expression levels significantly greater than background in most

105 tissues (P < 0.06), maximum KI signal intensities were typically lower and not significantly above background in any tissue (Pina et al. 2005). The tissue-wide intensities for KI and AtULP probesets were 103 ± 17 and 597 ± 178 (normalized units; mean of inflorescence, rosette, and roots ± standard error), respectively. The majority (six of eight) of cellular AtULP genes had signal intensities which were 177%–1779% lower in pollen than their tissue-wide means, but two, At4g33620 and At4g15880 (ESD4), had signals that were, respectively, 364% and 141% higher in pollen. Conversely, most KI probesets had signal intensities significantly higher in pollen than their tissue-wide mean (two-tailed paired t-test, P = 9.2 × 10−5; 281 ± 64% increase). Only 23% of KI probesets had decreased expression in pollen. The signal intensities of ambiguous probesets appeared to increase roughly in proportion to increased ambiguity, as would be expected if multiple loci were contributing to the signal. For instance, Clade 1 was represented by five probesets with fourfold redundancy (each probeset hybridizes to the same four KI genes which formed a 97% nucleotide identity cluster) and six probesets with twofold redundancy which, respectively, had average pollen signal intensities of 667 and 202 and average overall signal intensities of 394 and 157. Furthermore, even if only unambiguous KI probesets were considered, their signal intensities in pollen remained significantly elevated compared to their tissue-wide means (two-tailed paired t-test, P = 2.1 × 10−4). Finally, elevated expression in pollen was supported by lower but still elevated levels of expression in the stamen and inflorescence generally, as recorded by a larger number of microarrays (two for pollen, eight for inflorescence), typically with lower standard error (Supplemental Table 3-6). Although microarrays remain the most widely used technology for performing large-scale analyses of expression patterns, the technology is hybridization based and so, as illustrated by KI, has an inherently limited

106 ability to distinguish weakly expressed genes from the background. MPSS, which involves sequencing millions of short (17–20 bp) cDNA signatures, is both quantitative and sensitive enough to detect transcripts at concentrations as low as three to five transcripts per million (TPM; Brenner et al. 2000). MPSS was recently used to sequence a set of over 36 million 17-bp signatures (268,000 unique signatures) in 14 mRNA libraries from various Arabidopsis tissues, mutants, and treatments (Meyers et al. 2004). We compiled signatures from this set corresponding to KI and mudrA genes in KI-MULEs and to cellular AtULP genes (Supplemental Table 3-7). Because of the repetitiveness of KI-MULEs, many KI and mudrA signatures were non-unique; however, as with the microarray probes, ambiguities usually corresponded to KI or mudrA genes in closely related MULEs, and both unique and non-unique signatures yielded similar results. The cellular AtULP genes each had several unique sense signatures (2.4 ± 0.3 per gene) at high maximum abundance across all libraries (26 ± 7.4 TPM) and few or no antisense signatures (0.6 ± 0.3 per gene) at relatively low maximum abundance (8.1 ± 1.8 TPM). Conversely, the combined number of sense signatures for all KI and mudrA genes were only seven (five unique) and six (one unique), respectively, with roughly the same number of antisense signatures, eight (one unique) and three (zero unique), respectively. The maximum abundance for sense and antisense signatures for KI were, respectively, 1.4 ± 0.3 TPM and 3.6 ± 0.7 TPM. Considering only sense signatures (including non-unique signatures), the maximum abundance of KI signatures was significantly lower than that for AtULP genes, consistent with the microarray results (two-tailed heteroscedastic t-test, P = 3.6 × 10−3). To investigate whether KI may be targeted for RNA-mediated silencing, we examined a database of over 2 million sRNAs (over 75,000 non-redundant) derived from Arabidopsis inflorescence tissues and seedlings that were sequenced by MPSS (Lu et al. 2005) and identified  

 107 sRNAs corresponding to KI, mudrA, and cellular AtULP genes (Supplemental Table 3-8). As with the microarray and mRNA MPSS results, many of the sRNA MPSS signatures for KI and mudrA were non- unique and so could have been generated by any of several closely related genes. However, in this case, the redundancy correlates with biological function because, in vivo, sRNAs presumably target all sequences with which they are able to hybridize. Whereas only two cellular AtULP genes had even a single sRNA, these having low abundance (two and three transcripts per quarter million [TPQ]), KI and mudrA genes had a large total number of sRNA signatures (24 ± 3 and 18 ± 2, respectively) at high abundance (57 ± 6 TPQ and 50 ± 6 TPQ, respectively; excluding the outlier KI-At4g08340 which had an abnormally high abundance of 1512 TPQ due to a non-unique signature which also matches a very highly expressed, unrelated chloroplast gene).

Mobility We screened for insertions in wild-type Arabidopsis and various mutants (Columbia-0 background) with elevated transposition rates— ddm1, cmt3, met1, and cmt3 met1—using TD, a modification of the amplified fragment length polymorphism technique (Korswagen et al. 1996; Wright et al. 2001). We detected a single new insertion in a cmt3 met1 plant, which we verified by sequencing its termini and the flanking genomic DNA (Fig. 3-3; Supplemental Fig. 3-4). The terminal sequence uniquely identified the element as the MULE-KI-At2g12100. The flanking genomic sequence contained a perfect 9-bp TSD and mapped to the short arm of chromosome 3, immediately downstream of a geranylgeranyl pyrophosphate synthase gene (Supplemental Fig. 3-5). RT-PCR experiments showed that the mudrA and KI genes of KI-MULE-At2g12100 have elevated expression in met1 cmt3 compared to wild type (Fig. 3-4).  

 108

Figure 3-3. TD profile in cmt3, met1, and cmt3 met1 mutants using MuP-1 and MuP-2 primers. Three individual plants were analyzed for each mutant line. M, molecular weight marker and WT, wild type (Columbia-0). Molecular marker sizes are indicated (base pairs). The bands at a position indicated by an arrow represent a new insertion of a MULE containing the KI gene At2g12100. Note that the polymorphisms marked by asterisks are not due to insertions or excisions but due to a single 153-bp deletion adjacent to a preexisting MULE.  

 109

Figure 3-4. Expression of the mobile KI gene, At2g12100. (a) The structure of At2g12100. The narrow line represents genomic DNA and boxes represent exons. The exon containing the peptidase C48 domain is labeled. Arrows indicate the position of primers used for RT-PCR. (b) RT-PCR of At2g12100 in different backgrounds using total RNA isolated from flowers. Actin (ACT4) was used as a positive control. Lane 1, wild type (Columbia-0); lane 2, cmt3 met1; lane 3, genomic DNA; and lane 4, molecular weight marker.  110

Discussion

Transduplication We conducted a genome-wide survey of Arabidopsis transduplicates, first applying an accurate procedure we previously developed to identify MULEs and then identifying cellular CDs within these elements. Because many transduplicates do not contain CDs (e.g., approximately two-thirds in rice MULEs; Hoen et al. 2005), this approach is limited in sensitivity, but it also has key advantages. (1) CDs are compact, highly conserved sequences corresponding to three-dimensional protein-folding units; therefore, sequences similar to CDs are originally derived from genuine coding sequences, not potentially misannotated genes. (2) Since most transposon-related CDs are known, CDs derived from transposon insertions can be filtered out. (3) CD truncations can be identified with high confidence and, along with frameshifts and premature stop codons, are strong indicators of gene disablement (Harrison et al. 2005). (4) Most importantly, in transduplicates that encode functional proteins maintained by selection, it is possible to detect CDs long after their nucleotide sequences and the less highly conserved regions of their amino acid sequences become too divergent from the paralogous cellular genes for any similarity to be detected. The results of our survey indicate that the general characteristics of MULE-mediated transduplication in Arabidopsis (a eudicot) are similar to those previously documented in rice (a monocot), implying that its mechanism may be widely conserved in higher plants. Eighty-one percent of transduplicated CDs in Arabidopsis (excluding KI; Table 3-1; Supplemental Table 3-2) and 83% of expressed transduplicated CDs in rice (Hoen et al. 2005) are truncated by more than 30%, suggesting that they may be pseudogenes (Harrison et al. 2005). Seventy-eight percent of transduplicated CDs in Arabidopsis (excluding KI; Table 3-1; Supplemental Table 3-2) and 64% of transduplicates in rice (Hoen et al. 2005) are single  

 111 copy, and virtually all have fewer than 10 copies in each organism, suggesting that transduplicates do not usually convey a direct selective advantage on the corresponding MULEs. Interestingly, in both Arabidopsis and rice, cellular genes encoding DNA-binding and transcription factors appear to have frequently been the targets of transduplication (Table 3-1; Supplemental Table 3-2; Hoen et al. 2005).

KI Sequence Evolution Despite the many similarities between Arabidopsis and rice transduplication, there is one key difference. KI is a large family of Arabidopsis transduplicates with unique characteristics. Unlike most transduplicates, which have low copy number and contain truncated CDs, there are 97 KI genes with intact peptidase C48 domains as well as at least 30 with truncated domains (Supplemental Tables 3-3 and 3-4). The KI family is also unusual in its diversity. Most other identified transduplicates have high nucleotide sequence similarity to their putative cellular gene paralogs (over four-fifths have greater than 70% identity), indicating that they formed recently (Table 3-1; Supplemental Table 3-2). In contrast, KI has split into nine highly diverged clades with as much evolutionary distance between clades as between KI and cellular AtULP genes, suggesting that extant KI genes may have arisen from an ancient transduplication event. This is supported by the presence of hundreds of KI-like sequences in B. oleracea (despite 0.5× coverage; data not shown), consistent with a KI origin prior to the divergence of Arabidopsis and B. oleracea, 15–20 MYA (Yang et al. 1999; AGI 2000). Curiously, rice, which diverged from Arabidopsis approximately 200 MYA (Yang et al. 1999; AGI 2000), contains at least 161 genes with peptidase C48 domains (Bateman et al. 2004); however, preliminary analysis indicates that they are predominantly associated with other DNA transposons (data not shown). Two alternative hypotheses might explain the large number of intact peptidase C48 domains in KI-MULEs: either KI-MULEs have coincidentally  

 112 undergone a large expansion recently enough for the domains to have escaped mutation or KI has evolved under selective constraint. While the aforementioned evidence supporting an ancient origin of the KI gene family provides only circumstantial evidence that the second of these possibilities is correct, nucleotide substitution patterns convincingly demonstrate that KI has been subject to selection. Ratios of non- synonymous (dN) to synonymous (dS) nucleotide substitutions per site that are much smaller, somewhat smaller, or not significantly different from unity indicate, respectively, that many, some, or none of a set of putative coding sequences have been subject to purifying selection (Li, Gojobori, and Nei 1981). Similarly, values significantly larger than unity indicate positive selection. For the KI gene family, both overall and clade-specific dN/dS values strongly support the claim that KI has evolved under purifying selection. The overall dN/dS for intact KI peptidase C48 sequences (0.24) is within the range typically observed for functional Arabidopsis genes (Zhang, Vision, and Gaut 2002) and comparable to that of cellular AtULP peptidase C48 sequences (0.12). Also, eight of nine KI clades have individual dN/dS values significantly smaller than unity (Fig. 3-2; Supplemental Table 3-5). Furthermore, the KI gene family is typical of transposon-related genes in that, in addition to putatively functional members, it contains many apparent pseudogenes which have obvious disablements such as truncations, premature stop codons, and frameshifts (Supplemental Table 3-3). Because the codon sequences of these pseudogenes have presumably been drifting neutrally since becoming disabled, we would expect that the strength of selection on the functional fraction of KI genes has been underestimated by these dN/dS calculations. It is important to keep in mind that because KI is located in transposons and therefore presumably subject to different selective constraints than cellular AtULP genes (see below), direct comparisons between the two should be interpreted with caution. Nevertheless, the  

 113 observed pattern of peptidase C48 domain sequence evolution indicates that KI has maintained a protein-coding function which utilizes the peptidase C48 domain. The conclusion that KI encodes a peptidase is further supported by the conservation of invariant residues in widely diverged KI sequences. Peptidase C48 contains four highly conserved residues: a putative catalytic triad of histidine, aspartate, and cysteine and a glutamine residue predicted to help form the oxyanion hole (Li and Hochstrasser 1999, 2000). There are some reported exceptions, including human adenovirus type 2 and African swine fever virus, which contain peptidase C48 domains with a glutamate and an asparagine, respectively, at the aspartate site (Fig. 3-1; Li and Hochstrasser 1999). The majority of sequences in three KI clades (Clades 2, 3, and 9) encode all four invariant residues, and the active site sequences appear to be more highly conserved than other parts of the domain, consistent with the preservation of catalytic activity in these gene products (Fig. 3-1; Supplemental Fig. 3-2). To varying degrees, most other KI clades also appear to have maintained some invariant residues.

Expression The Arabidopsis transcriptome has been exceptionally well characterized by EST and full-length cDNA sequencing, whole-genome microarrays (Edgar, Domrachev, and Lash 2002; Brazma et al. 2003; Craigon et al. 2004), tiled oligonucleotide arrays (Yamada et al. 2003), and MPSS (Meyers et al. 2004; Lu et al. 2005). We used these resources to compile a detailed picture of the expression pattern and sRNA matches of KI and, for comparison, associated mudrA genes and cellular AtULPs. The repetitiveness of KI-MULEs presents an inherent complication in the interpretation of microarray and mRNA (but not sRNA) MPSS results because most of the microarray probesets and virtually all MPSS signatures match multiple genomic locations. However, in many cases,  

 114 most or all of the ambiguous locations are within closely related KI (or mudrA) genes. Therefore, although it is impossible to determine which KI (or mudrA) gene among a set of matches contributes to signal amplitudes, we can be reasonably confident that the observed expression patterns are due to some KI (or mudrA) gene in the set. The low level of KI and mudrA expression further complicates the interpretation of the microarray (but not the MPSS) data because many individual signals are not significantly above background. Despite these complications, our results across all data sets and most loci consistently show that KI expression and sRNA patterns are similar to mudrA and different from cellular AtULP genes. EST, cDNA, tiling array, microarray, and MPSS data all indicate that whereas cellular AtULP genes are generally expressed at significant levels, KI and mudrA genes are expressed at low or undetectable levels (Supplemental Tables 3-3, 3-6, 3-7, and 3-8). Microarray data suggest that this pattern is reversed in pollen, where the expression of six of eight cellular AtULP genes is roughly 2- to 17-fold lower than in other tissues while the expression of KI genes is increased on an average of threefold (Supplemental Table 3-6). Interestingly, the two cellular genes with increased expression in pollen (ESD4 and At4g33620) represent widely diverged branches of the cellular AtULP phylogeny (Supplemental Fig. 3-3), which would be consistent with a separation of functional roles like that of S. cerevisiae ULP1 and ULP2. These two genes may play specific specialized roles in pollen. The MPSS results confirm these trends and provide a higher resolution picture, showing that KI and mudrA genes generate few sense transcripts and a disproportionately large number of antisense transcripts (Supplemental Table 3-7). The 97 KI genes with intact peptidase C48 domains have only 14 mRNA signatures in total, roughly the same number of which are antisense and sense. The mean abundance of the antisense transcripts is almost threefold higher than that of the sense transcripts.  

 115 The opposite pattern is true of the eight cellular AtULP genes, which have a total of 26 mRNA signatures. Only one-quarter of these signatures are antisense, and these have mean abundance more than threefold lower than the sense transcripts. Only one KI gene (At1g40078) has a sense signature with abundance (16 TPM) greater than that of the least abundant primary signature of any cellular AtULP (At4g33620, 14 TPM). sRNAs, which are generated by the cleavage of double-stranded RNAs including long RNA hybrids (e.g., a sense and an antisense transcript) and short RNA “hairpins,” target complementary genomic DNA sequences for transcriptional silencing by heterochromatin formation, and target complementary mRNA sequences for post-transcriptional cleavage (Baulcombe 2005). The sRNA MPSS data show that, while there are virtually no sRNAs for cellular AtULP genes, KI and mudrA match numerous, highly abundant sRNAs (Supplemental Table 3-8). Only a single KI gene (At5g33259) has no sRNA, and each KI has an average of approximately 24 sRNAs with abundance 57 TPQ. The large quantity and abundance antisense transcripts and sRNAs are consistent with sRNA- mediated gene silencing of KI and mudrA.

Mobility MULEs and other DNA transposons have low transposition rates in wild- type Arabidopsis and elevated rates in DNA and histone methylation mutants (Singer, Yordan, and Martienssen 2001; Lippman et al. 2003). There is to our knowledge no documented case of a non-TIR MULE (e.g., KI-MULEs) being mobilized. Several of our results provide circumstantial evidence of recent KI-MULEs transposition. (1) KI is predominantly located in mudrA-containing MULEs (i.e., potentially autonomous MULEs) which must presumably have been recently mobile in order to have maintained these mudrA ORFs (see also the discussion of non-phenotypic selection, below). (2) Many KI-MULEs have perfect target site duplications, indicating that they are recent insertions. (3) Upon replicating, duplicated  

 116 transposons are identical, so divergence between paralogous transposon sequences is a measure of time since duplication. The nucleotide sequences of several KI genes are nearly identical, for example, two clusters (six sequences in total) have 100% identity and eight clusters (25 sequences from six clades) have 99% identity (Supplemental Fig. 3-3). While a few of these copies may have arisen through segmental duplication or other cell-mediated mechanisms (e.g., KI-At1k1483 and KI- At1g40078 form one of the 100% identity clusters but appear to have been duplicated in an inverted segmental duplication), differences in TSDs and flanking sequences show that the majority resulted from replicative transposition. This strongly suggests that a diverse group of KI-MULEs were recently mobile; however, it does not exclude the possibility that transposition has since been silenced. Transposons may be epigenetically silenced by sRNA-directed heterochromatin formation, which involves DNA and histone methylation, primarily via DDM1 (decrease in DNA methylation 1)–dependent histone H3 lysine 9 methylation (H3mK9) as well as MET1 (METHYLTRANSFERASE1)-dependent cytosine methylation at CG sites and, in plants, CMT3 (CHROMOMETHYLASE3)-dependent cytosine methylation at non-CG sites (Bartee, Malagnac, and Bender 2001; Miura et al. 2001; Gendrel et al. 2002; Kato et al. 2003; Lippman et al. 2004). Like most Arabidopsis DNA transposons, KI-MULEs are concentrated in heterochromatin at the pericentromeres and at two isolated islands on chromosomes 4 and 5 (Supplemental Table 3-1 and Supplemental Fig. 3-1; AGI 2000; Lippman et al. 2004). Transposons in these regions, as well as those in cytologically defined euchromatin, have been shown to be subject to H3mK9 and cytosine methylation (Lippman et al. 2003; Zilberman, Cao, and Jacobsen 2003; Lippman et al. 2004). This is consistent with our results, which indicate that both KI and mudrA genes in  

 117 most KI-MULEs are associated with antisense transcripts and sRNAs and are only weakly expressed. To test whether KI-MULEs remain transpositionally competent, we screened for insertions in wild-type Arabidopsis and various mutants (Columbia-0 background) known to have elevated transposition rates— ddm1, cmt3, met1, and cmt3 met1—using TD (Korswagen et al. 1996; Wright et al. 2001). Although previous studies found increased transposition of a MULE and CACTA elements in ddm1 mutants (Miura et al. 2001; Kato et al. 2003; Lippman et al. 2003), we did not detect KI- MULE mobility in these mutants (Supplemental Fig. 3-4). However, we did detect a single new insertion of the MULE containing the KI gene At2g12100 in a cmt3 met1 background. This confirms that at least one KI- MULE remains capable of mobility and provides the first experimental evidence of non-TIR MULE mobility (Fig. 3-3). The mobility of this MULE may be related to the expression levels of its mudrA and KI genes, which are elevated in the met1 cmt3 background compared to wild type (Fig. 3-4). Interestingly, At2g12100 belongs to a cluster of four closely related KI genes: it has 99%, 97%, and 96% nucleotide sequence identity with the peptidase C48 domains of At1g45090, At2g16180, and At2g05450, respectively. This suggests that the corresponding MULEs were recently mobile in wild type. Although we found no evidence in public databases (i.e., ESTs, full-length cDNAs, tiled oligonucleotide arrays, microarrays, and MPSS mRNAs) that At2g12100 is expressed in wild type, it has no obvious disablement, contains 99.5% of the peptidase C48 domain including all four invariant residues, and the corresponding mudrA gene (At2g12150) contains 100% of two mudrA-related CDs (Supplemental Table 3-3).  

 118 Is KI Selfish? Selfish genetic elements have a unique mode of survival in the genome. Cellular (i.e., non-selfish) elements are selected through their contribution to beneficial phenotypes which increase reproductive success (i.e., phenotypic selection). Selfish elements are also subject to phenotypic selection, but the phenotypes associated with selfish elements are probably deleterious or neutral in most cases because new insertions are likely to jeopardize the function of nearby sequences (Kidwell and Lisch 2001). Thus, phenotypic selection likely works to remove selfish elements rather than conserve them. But selfish elements do not require phenotypic selection to survive, and can even escape mild negative phenotypic selection, because of their ability to self-replicate. By duplicating with sufficient frequency, at least one copy of a selfish element (in an interbreeding population of host organisms) can continually escape disablement and remain able to self-replicate. This process, which has been termed “nonphenotypic” selection, inherently selects for self- replication. (Doolittle and Sapienza 1980; Orgel and Crick 1980; Hurst and Werren 2001; Brookfield 2005). Unlike other selfish elements, the evolution of eukaryotic DNA transposons is also shaped by the constraint that, to catalyze self- replication, transposase-encoded proteins must presumably act in trans after being imported into the nucleus. Thus, transposons that encode functional transposases (i.e., autonomous elements) may be no more likely to be replicated by their products than related nonautonomous elements. This leads to the replication and accumulation of nonautonomous elements, which often significantly outnumber autonomous elements, and can result in decreased rates of autonomous transposition and eventually the complete silencing of DNA transposon families (Brookfield 2005).  

 119 Thus, MULEs evolve under a tension of opposing evolutionary forces. Whereas nonphenotypic selection favors elevated replication rates, new transposon copies can insert into functional cellular sequences causing negative phenotypic selection. As a result, not only do hosts maintain transposition silencing systems such as sRNA-directed transcriptional and post-transcriptional gene silencing, but transposons also evolve self- regulatory mechanisms such as tissue-specific promoters to maximize their reproductive success while minimizing deleterious effects (Kidwell and Lisch 2001). For instance, promoters in autonomous maize MULEs contain nested sets of pollen-specific motifs, and reporter gene expression is increased more than 20-fold in pollen compared to leaves (Walbot and Rudenko 2002). Preferential accumulation in heterochromatic regions with low gene density may also limit the deleterious consequences of transposition (Kidwell and Lisch 2001). KI has the expected characteristics of a family of selfish genes. Like mudrA and unlike cellular AtULP genes, KI genes are targeted for silencing by numerous abundant sRNAs, are located primarily in heterochromatic regions, are not expressed at high levels, and are preferentially transcribed in pollen. Because KI genes have replicated to high copy number within potentially autonomous MULEs, nonphenotypic selection is sufficient to explain the observed conservation and dN/dS values. Although transposon-related genes are sometimes co-opted to perform cellular functions, these sequences subsequently lose their mobility, which is no longer required for conservation and may be a liability because mobilization could lead to deletion or inactivation of a beneficial cellular function (Cowan et al. 2005). However, KI-MULEs generally have intact termini, contain mudrA genes and so are potentially autonomous, and at least one has retained the ability to transpose. These observations may suggest that KI evolution has been predominantly shaped by selfish selective forces; nevertheless, the possibility remains that a contribution  

 120 has been made by selection for advantageous phenotypes that KI may contribute to its host.

Function All known proteins containing the peptidase C48 domain are small ubiquitin-like modifier (SUMO)–specific proteases (Li and Hochstrasser 1999). SUMO is a peptide tag that modulates the function of target proteins in diverse processes, including nucleocytoplasmic transport, signal transduction, cell-cycle progression, stress response, and transcriptional regulation (Novatchkova et al. 2004; Hay 2005). Arabidopsis contains nine SUMO genes, at least one of which is a pseudogene (Novatchkova et al. 2004). Among the four which show significant levels of expression, two appear to be involved in stress response signal transduction pathways (Kurepa et al. 2003; Lois, Lima, and Chua 2003). The SUMOylation of transcription factors usually correlates with transcriptional repression (Gill 2005). SUMO-specific proteases (i.e., Ulps) function as endopeptidases to activate SUMO from its inactive precursor and isopeptidases to deconjugate it from target proteins. Yeast encodes two Ulps which differ in their primary function and localization: Ulp1 is an endopeptidase and localizes to the nuclear periphery and Ulp2 is an isopeptidase and is distributed throughout the nucleus (Li and Hochstrasser 1999, 2000). Interestingly, although complete ULP1 deletion is lethal, nonlethal mutations have been reported that result in the proliferation of yeast plasmids, a type of selfish element (Dobson et al. 2005). Furthermore, SUMO plays a role in plant defense responses, and some viruses and pathogenic bacteria encode Ulps which disrupt host defense mechanisms (Xia 2004). Consistent with our results, previous reports have noted the relatively large number of ULP-like genes and pseudogenes in Arabidopsis (Kurepa et al. 2003; Murtas et al. 2003; Novatchkova et al. 2004). Although Kurepa et al. (2003) identified only four of the eight cellular AtULP genes found in  

 121 this study (which they classified as ULP1-like, as well as eight KI genes, which they classified as ULP2-like), Novatchkova et al. (2004) suggest as candidate functional genes exactly the same set of eight cellular AtULP genes as we identified. Our results provide strong evidence that this set of eight genes, which form a separate phylogenetic group from KI, are the only cellular Arabidopsis ULPs (Supplemental Fig. 3-3). They each contain an intact peptidase C48 domain, are annotated as nonpseudogenic ORFs (TIGR5 annotations), are located in gene-rich euchromatic regions, are not associated with MULE features such as nearby mudrA ORFs or flanking MULE termini, and are expressed at levels similar to other cellular genes. Murtas et al. (2003) carried out the only functional analysis to date of a plant ULP-like gene, showing that one of the cellular AtULP genes (ESD4, early in short days 4, At4g15880) encodes a SUMO isopeptidase (like ULP1) which localizes to the nuclear periphery (like ULP2). If, as we suggest above, KI has been subject to nonphenotypic selection, then it must function to increase the replicative success of KI-MULEs. Although we have yet to determine the mechanism by which this may occur, we may speculate based on the known functions of SUMO and eukaryotic and pathogen-encoded Ulps that the KI protein might act as a deSUMOylase to disrupt host transposon–silencing mechanisms. For instance, KI could deSUMOylate regulators of MULE transcription, especially in pollen, enabling increased mobility. On the other hand, the substitution of invariant residues in several KI clades may indicate that these proteins no longer act as proteases but may, for instance, instead function as competitive inhibitors of deSUMOylation or might be nonfunctional. Another possibility is that some KI clades have undergone subfunctionalization and, for instance, may act on different SUMO isoforms or even alternative substrates. Finally, it is possible that, rather than directly increasing MULE mobility, KI might decrease phenotypic  

 122 selection against KI-MULE transposition by playing a self-regulatory role, such as fine-tuning transpositional timing or frequency.

Broader Evolutionary Implications The phenomenon of transduplication has recently generated interest largely because of its potential but unproven capacity to drive cellular protein-coding gene diversification, especially given its ability to create chimeric genes by joining duplicated sequences from distant loci (Jiang et al. 2004; Lisch 2005). However, both our current analysis of Arabidopsis transduplicates and our previous analysis in rice (Hoen et al. 2005) suggest that virtually all transduplicates are pseudogenes. Transduplication also has the potential to contribute to cellular function by other mechanisms, such as regulating paralogous gene expression through the generation of complementary sRNAs, or providing sequence reservoirs for gene conversion. KI is the first clear example to show that transduplicated genes sometimes (but rarely) do retain a protein-coding function. Yet instead of directly contributing to cellular function and evolution, we suggest that transduplication may perhaps be more important as a mechanism for transposon survival, generating selfish gene diversity that enables transposons to overcome host repression mechanisms. Furthermore, it is now well documented that transposon- related genes can adopt cellular functions (Brookfield 2005), suggesting that a cycle of transduplication followed by “domestication” is a possible albeit circuitous route to the expansion of the cellular protein-coding gene repertoire.  

 123 Acknowledgments

We would like to thank Tetsuji Kakutani for supplying the mutant lines. This study was supported by grants to T.E.B. from the National Science and Engineering Research Council of Canada and the McGill University William Dawson Scholar Chair. K.C.P. was supported by the Korea Research Foundation grant (Ministry of Education & Human Resources Development, KRF-2003-214-C00229; Korea). Sequencing of B. oleracea was funded by the National Science Foundation, and preliminary sequence data were obtained from TIGR (http://www.tigr.org). 124

Supplemental Material

Supplemental Figure 3-1. The distribution of MULEs and peptidase C48 loci in the arabidopsis genome is shown. Cellular AtULP genes are labelled c1 to c8. KI genes are labelled with letters. Centromeres are shown as a thin line and the position of heterochromatic knobs on chromosomes 4 and 5 are shown with asterisks. Black lines illustrate the density of MULEs and green lines show the density of DNA transposons to an arbitrary scale (AGI 2000). 125

Supplemental Figure 3-4. Transposon display profile in ddm1-1 and ddm1-2 mutant backgrounds using MuP-1 and MuP-2 primers. M, molecular weight marker and WT:wild type (Columbia-0). Marker sizes are indicated in base pairs. 126

Supplemental Figure 3-5. PCR analysis of transposition of a KI-MULE in a met1 cmt3 double mutant background. (a) The KI-MULE containing the KI gene At2g12100 inserted upstream of the GPS (Geranylgeranyl Pyrophosphate Synthase) locus. Arrows indicate the position of primers used for PCR. Underlined sequences are TSDs. (b) The PCR products amplified with a primer set specific for the MULE and genomic flanking sequence.  127

Note: The following supplemental data are too large to be included here, but may be downloaded from the Molecular Biology and Evolution journal website at http://mbe.oxfordjournals.org/content/23/6/1254/suppl/DC1.

Supplemental Figure 3-2. Full alignment of Arabidopsis peptidase C48 domains and outgroups.

Supplemental Figure 3-3. The unrooted majority rule cladogram, including all intact peptidase C48 domains, including KI and cellular AtULP genes.

Supplemental Table 3-1. Arabidopsis MULEs.

Supplemental Table 3-2. Non-KI-MULEs with transduplicates.

Supplemental Table 3-3. Location, expression, and structure of KI-MULEs containing intact peptidase C48 domains.

Supplemental Table 3-4. Truncated peptidase C48 domains.

Supplemental Table 3-5. dN/dS values and likelihood ratio test for KI clades.

Supplemental Table 3-6. Microarray results for intact KI.

Supplemental Table 3-7. mRNA MPSS results.

Supplemental Table 3-8. sRNA MPSS results.  128

References

Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403–410. [AGI] Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815. Barrett, T., T. O. Suzek, D. B. Troup, S. E. Wilhite, W. C. Ngau, P. Ledoux, D. Rudnev, A. E. Lash, W. Fujibuchi, and R. Edgar. 2005. NCBI GEO: mining millions of expression profiles—database and tools. Nucleic Acids Res. 33: D562–D566. Bartee, L., F. Malagnac, and J. Bender. 2001. Arabidopsis cmt3 chromomethylase mutations block non-CG methylation and silencing of an endogenous gene. Genes Dev. 15:1753–1758. Bateman, A., L. Coin, R. Durbin et al. (13 co-authors). 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138–D141. Baulcombe, D. 2005. RNA silencing. Trends Biochem. Sci. 30:290–293. Bedell, J. A., I. Korf, and W. Gish. 2000. MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16:1040–1041. Brazma, A., H. Parkinson, U. Sarkans et al. (13 co-authors). 2003. ArrayExpress —a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 31:68–71. Brenner, S., M. Johnson, J. Bridgham et al. (24 co-authors). 2000. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18:630–634. Brookfield, J. F. 2005. The ecology of the genome—mobile DNA elements and their hosts. Nat. Rev. Genet. 6:128–136. Chalvet, F., C. Grimaldi, F. Kaper, T. Langin, and M. J. Daboussi. 2003. Hop, an active Mutator-like element in the genome of the fungus Fusarium oxysporum. Mol. Biol. Evol. 20: 1362–1375. Chandler, V., C. Rivin, and V. Walbot. 1986. Stable non-mutator stocks of maize have sequences homologous to the Mu1 transposable element. Genetics 114:1007–1021. Cowan, R. K., D. R. Hoen, D. J. Schoen, and T. E. Bureau. 2005. MUSTANG is a novel family of domesticated transposase genes found in diverse angiosperms. Mol. Biol. Evol. 22:2084–2089. Craigon, D. J., N. James, J. Okyere, J. Higgins, J. Jotham, and S. May. 2004. NASCArrays: a repository for microarray data generated by NASC’s transcriptomics service. Nucleic Acids Res. 32:D575–D577. Dobson, M. J., A. J. Pickett, S. Velmurugan, J. B. Pinder, L. A. Barrett, M. Jayaram, and J. S. Chew. 2005. The 2 microm plasmid causes cell death in Saccharomyces cerevisiae with a mutation in Ulp1 protease. Mol. Cell. Biol. 25:4299–4310. Doolittle, W. F., and C. Sapienza. 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature 284:601–603.  

 129 Edgar, R., M. Domrachev, and A. E. Lash. 2002. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30:207–210. Elrouby, N., and T. E. Bureau. 2001. A novel hybrid open reading frame formed by multiple cellular gene transductions by a plant long terminal repeat retroelement. J. Biol. Chem. 276: 41963–41968. Garcia-Hernandez, M., T. Z. Berardini, G. Chen et al. (21 co-authors). 2002. TAIR: a resource for integrated Arabidopsis data. Funct. Integr. Genomics 2:239–253. Gendrel, A. V., Z. Lippman, C. Yordan, V. Colot, and R. A. Martienssen. 2002. Dependence of heterochromatic histone H3 methylation patterns on the Arabidopsis gene DDM1. Science 297:1871–1873. Gill, G. 2005. Something about SUMO inhibits transcription. Curr. Opin. Genet. Dev. 15:536–541. Haas, B. J., J. R. Wortman, C. M. Ronning et al. (12 co-authors). 2005. Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release. BMC Biol. 3:7. Harrison, P. M., D. Zheng, Z. Zhang, N. Carriero, and M. Gerstein. 2005. Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability. Nucleic Acids Res. 33:2374–2383. Hay, R. T. 2005. SUMO: a history of modification. Mol. Cell. 18:1–12. Hoen, D. R., N. Juretic, M. L. Huynh, P. M. Harrison, and T. E. Bureau. 2005. The evolutionary fate of MULE-mediated duplications of host gene fragments in rice. Genome Res. 15: 1292–1297. Hurst, G. D., and J. H. Werren. 2001. The role of selfish genetic elements in eukaryotic evolution. Nat. Rev. Genet. 2:597–606. Jiang, N., Z. Bao, X. Zhang, S. R. Eddy, and S. R. Wessler. 2004. Pack-MULE transposable elements mediate gene evolution in plants. Nature 431:569–573. Kato, M., A. Miura, J. Bender, S. E. Jacobsen, and T. Kakutani. 2003. Role of CG and non-CG methylation in immobilization of transposons in Arabidopsis. Curr. Biol. 13:421–426. Kawasaki, S., and E. Nitasaka. 2004. Characterization of Tpn1 family in the Japanese morning glory: En/Spm-related transposable elements capturing host genes. Plant Cell Physiol. 45:933–944. Kazazian, H. H. Jr. 2004. Mobile elements: drivers of genome evolution. Science 303:1626–1632. Kidwell, M. G., and D. R. Lisch. 2001. Perspective: transposable elements, parasitic DNA, and genome evolution. Evolution 55:1–24. Korswagen, H. C., R. M. Durbin, M. T. Smits, and R. H. Plasterk. 1996. Transposon Tc1-derived, sequence-tagged sites in Caenorhabditis elegans as markers for gene mapping. Proc. Natl. Acad. Sci. USA 93:14680–14685. Kurepa, J., J. M. Walker, J. Smalle, M. M. Gosink, S. J. Davis, T. L. Durham, D. Y. Sung, and R. D. Vierstra. 2003. The small ubiquitin-like modifier (SUMO)  

 130 protein modification system in Arabidopsis. Accumulation of SUMO1 and -2 conjugates is increased by stress. J. Biol. Chem. 278: 6862–6872. Le, Q. H., S. Wright, Z. Yu, and T. Bureau. 2000. Transposon diversity in Arabidopsis thaliana. Proc. Natl. Acad. Sci. USA 97:7376–7381. Li, S. J., and M. Hochstrasser. 1999. A new protease required for cell-cycle progression in yeast. Nature 398: 246–251. ———. 2000. The yeast ULP2 (SMT4) gene encodes a novel protease specific for the ubiquitin-like Smt3 protein. Mol. Cell. Biol. 20:2367–2377. Li, W. H., T. Gojobori, and M. Nei. 1981. Pseudogenes as a paradigm of neutral evolution. Nature 292:237–239. Lippman, Z., A. V. Gendrel, M. Black et al. (14 co-authors). 2004. Role of transposable elements in heterochromatin and epigenetic control. Nature 430:471–476. Lippman, Z., B. May, C. Yordan, T. Singer, and R. Martienssen. 2003. Distinct mechanisms determine transposon inheritance and methylation via small interfering RNA and histone modification. PLoS Biol. 1:E67. Lisch, D. 2002. Mutator transposons. Trends Plant Sci. 7: 498–504. ———. 2005. Pack-MULEs: theft on a massive scale. Bioessays 27:353–355. Lois, L. M., C. D. Lima, and N. H. Chua. 2003. Small ubiquitin-like modifier modulates abscisic acid signaling in Arabidopsis. Plant Cell 15:1347–1359. Lu, C., S. S. Tej, S. Luo, C. D. Haudenschild, B. C. Meyers, and P. J. Green. 2005. Elucidation of the small RNA component of the transcriptome. Science 309:1567–1569. Makalowski, W. 2003. Genomics. Not junk after all. Science 300:1246–1247. Marchler-Bauer, A., and S. H. Bryant. 2004. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 32: W327–W331. Meyers, B. C., S. S. Tej, T. H. Vu, C. D. Haudenschild, V. Agrawal, S. B. Edberg, H. Ghazal, and S. Decola. 2004. The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. Genome Res. 14:1641–1653. Miura, A., S. Yonebayashi, K. Watanabe, T. Toyama, H. Shimada, and T. Kakutani. 2001. Mobilization of transposons by a mutation abolishing full DNA methylation in Arabidopsis. Nature 411:212–214. Moran, J. V., R. J. DeBerardinis, and H. H. Kazazian Jr. 1999. Exon shuffling by L1 retrotransposition. Science 283: 1530–1534. Morgante, M., S. Brunner, G. Pea, K. Fengler, A. Zuccolo, and A. Rafalski. 2005. Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat. Genet. 37:997–1002. Mossessova, E., and C. D. Lima. 2000. Ulp1-SUMO crystal structure and genetic analysis reveal conserved interactions and a regulatory element essential for cell growth in yeast. Mol. Cell 5:865–876. Murtas, G., P. H. Reeves, Y. F. Fu, I. Bancroft, C. Dean, and G. Coupland. 2003. A nuclear protease required for flowering-time regulation in Arabidopsis reduces the abundance of small ubiquitin-related modifier conjugates. Plant Cell 15: 2308–2319.  

 131 Nakano, M., K. Nobuta, K. Vemaraju, S. S. Tej, J. W. Skogen, and B. C. Meyers. 2006. Plant MPSS databases: signature-based transcriptional resources for analyses of mRNA and small RNA. Nucleic Acids Res. 34:D731–D735. Neuveglise, C., F. Chalvet, P. Wincker, C. Gaillardin, and S. Casaregola. 2005. Mutator-like element in the yeast Yarrowia lipolytica displays multiple alternative splicings. Eukaryot. Cell 4:615–624. Novatchkova, M., R. Budhiraja, G. Coupland, F. Eisenhaber, and A. Bachmair. 2004. SUMO conjugation in plants. Planta 220:1–8. Orgel, L. E., and F. H. Crick. 1980. Selfish DNA: the ultimate parasite. Nature 284:604–607. O’Sullivan, O., K. Suhre, C. Abergel, D. G. Higgins, and C. Notredame. 2004. 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340:385–395. Page, R. D. 1996. TreeView: an application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 12:357–358. Parkinson, H., U. Sarkans, M. Shojatalab et al. (18 co-authors). 2005. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 33: D553–D555. Pina, C., F. Pinto, J. A. Feijo, and J. D. Becker. 2005. Gene family analysis of the Arabidopsis pollen transcriptome reveals biological implications for cell growth, division control, and gene expression regulation. Plant Physiol. 138: 744–756. Reverter, D., and C. D. Lima. 2004. A basis for SUMO protease specificity provided by analysis of human Senp2 and a Senp2-SUMO complex. Structure 12:1519–1531. Schaffer, A. A., L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, and S. F. Altschul. 2001. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29:2994–3005. Schoof, H., P. Zaccaria, H. Gundlach, K. Lemcke, S. Rudd, G. Kolesov, R. Arnold, H. W. Mewes, and K. F. Mayer. 2002. MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res. 30: 91–93. Singer, T., C. Yordan, and R. A. Martienssen. 2001. Robertson’s Mutator transposons in A. thaliana are regulated by the chromatin-remodeling gene decrease in DNA methylation (DDM1). Genes Dev. 15:591–602. Turcotte, K., S. Srinivasan, and T. Bureau. 2001. Survey of transposable elements from rice genomic sequences. Plant J. 25:169–179. Walbot, V., and G. N. Rudenko. 2002. MuDR/Mu transposable elements of maize. Pp. 533–564 in N. L. Craig, R. Craigie, M. Gellert, and A. M. Lambowitz, eds. Mobile DNA II. ASM Press, Washington, D.C. Wright, S. I., Q. H. Le, D. J. Schoen, and T. E. Bureau. 2001. Population dynamics of an Ac-like transposable element in self- and cross-pollinating Arabidopsis. Genetics 158: 1279–1288.  

 132 Xia, Y. 2004. Proteases in pathogenesis and plant defence. Cell Microbiol. 6:905– 913. Yamada, K., J. Lim, J. M. Dale et al. (70 co-authors). 2003. Empirical analysis of transcriptional activity in the Arabidopsis genome. Science 302:842–846. Yang, Y. W., K. N. Lai, P. Y. Tai, and W. H. Li. 1999. Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J. Mol. Evol. 48:597–604. Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568–573. Yang, Z., and R. Nielsen. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. 46:409–418. Yu, J., Z. Yang, M. Kibukawa, M. Paddock, D. A. Passey, and G. K. Wong. 2002. Minimal introns are not ‘‘junk’’. Genome Res. 12:1185–1189. Yu, Z., S. I. Wright, and T. E. Bureau. 2000. Mutator-like elements in Arabidopsis thaliana. Structure, diversity and evolution. Genetics 156:2019–2031. Zabala, G., and L. O. Vodkin. 2005. The wp mutation of Glycine max carries a gene-fragment-rich transposon of the CACTA superfamily. Plant Cell 17:2619– 2632. Zhang, L., T. J. Vision, and B. S. Gaut. 2002. Patterns of nucleotide substitution among simultaneously duplicated gene pairs in Arabidopsis thaliana. Mol. Biol. Evol. 19:1464–1473. Zilberman, D., X. Cao, and S. E. Jacobsen. 2003. ARGONAUTE4 control of locus-specific siRNA accumulation and DNA and histone methylation. Science 299:716–719. Zimmermann, P., M. Hirsch-Hoffmann, L. Hennig, and W. Gruissem. 2004. GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox. Plant Physiol. 136: 2621–2632.  133

Bridge to Chapter Four

In Chapter Three, I characterized a novel A. thaliana gene family, KI, that appears to have arisen through transduplication to become conserved in MULEs and play a role in self-replication. As I have described, transduplication is a process through which host genome sequences are copied into TEs. However, as reviewed in Chapter One, TE sequences can also be co-opted for ordinary genomic functions, a process called molecular domestication. In Chapter Four, I examine domesticated TE genes (DTEs) in A. thaliana. In addition to three previously known A. thaliana DTE families, I identify twenty-three candidate novel families. 134

Chapter Four: Discovery of domesticated transposable element genes in Arabidopsis thaliana by integrative genomic data mining  

 135 Abstract

Transposable elements (TEs) encode proteins required for self-replication, which are sometimes co-opted to perform new functions not related to self-replication. While examples of such domesticated TEs (DTEs) have been discovered fortuitously, some with essential functions, there have been few systematic searches, and those that have been undertaken have been limited, so we do not know how many DTEs remain to be identified. Here, I describe the most thorough DTE search reported to date, in which I combine multiple types of genomic evidence in Arabidopsis thaliana, including comparative, functional, and structural evidence, to detect all 29 known DTEs and to predict 38 novel high-confidence DTE candidates.

Introduction

Transposable elements (TEs) are abundant and ubiquitous constituents of eukaryotic genomes, which encode proteins that catalyze their self- replication. Occasionally, TE genes adopt new functions, not related to self-replication, that organisms co-opt, a process termed molecular domestication. Dozens of domesticated TE genes (DTEs) have been discovered fortuitously in plants, animals, and other eukaryotes, but there have been few systematic searches, so we do not know how many DTEs remain to be identified, nor to what extent molecular domestication affects evolution. This dearth of systematic searches is partly because there is no easy way to discriminate TE genes from DTEs. As reviewed in Chapter One of this thesis, domesticated genes play critical roles in development (Bundock and Hooykaas 2005) or survival (Flajnik and Kasahara 2010); others are transcription factors that may drive lineage-specific adaptation (Feschotte and Pritham 2007; Sinzelle, Izsvák et al. 2009). Most known domesticated genes were found by accident in forward genetic studies, an approach which examines genes  

 136 with specific predetermined functions or phenotypes. For example, DAYSLEEPER was identified in a molecular screen for proteins that bind a particular DNA motif and was subsequently found, through similarity searches, to be similar to a transposable gene (Bundock and Hooykaas 2005). This type of approach cannot deliberately identify domesticated genes, which share no common function. Instead, systematic screens for domesticated genes exploit fundamental differences between how TE genes and ordinary genes (i.e., non-TE genes, including DTEs) evolve and operate, differences which are independent of specific gene function. For example, because they are maintained by self-replication, TEs at specific loci are not conserved; conversely, most ordinary genes maintain orthologous chromosomal positions for relatively long evolutionary periods, a property called co- linearity or conserved microsynteny. For another example, TE genes are typically silenced both epigenetically and post-transcriptionally by small- RNA-directed chromatin modification and mRNA degradation; conversely, most ordinary genes are expressed at relatively high levels. To be effective, systematic screens must incorporate sufficient such lines of evidence to be highly specific because they must distinguish, among thousands of TE genes, the few dozen, perhaps, that are DTEs. To date, there have been few systematic searches for domesticated genes, each with different limitations. Two searches in vertebrates each focused primarily on one line of evidence: the Human Genome Project focused on expression (Lander, Linton et al. 2001), while a study by Zdobnov et al. (2005) focused on conserved microsynteny between rat and mouse. A third systematic search, in plants, used several lines of evidence, especially TE structure and phylogeny, but was restricted to molecular domestication events occurring prior to the divergence of monocots and eudicots, and to one particular transposase (mudrA). It  

 137 found one new family of domesticated genes called MUSTANG (Cowan, Hoen et al. 2005). Because of this paucity of systematic searches, we do not know the amount, identity, or complete set of functions of DTEs. Nor can we begin to properly evaluate the basic characteristics of the process of molecular domestication, such as any its frequency, the longevity of DTEs in different lineages of organisms or derived from different TEs, or its prevalence in various evolutionary periods. Using systematic screens to identify large, unbiased sets of domesticated genes, we can begin to evaluate the evolutionary significance of this potentially important evolutionary process. We can also identify DTEs that would otherwise be dismissed as TEs but which may have important functions or applications. Conversely, we can improve genome annotations by identifying TEs that have been erroneously annotated as ordinary genes. Here, I describe a systematic screen that appropriately combines sufficient lines of evidence to accurately discriminate between known DTEs and TEs and that appears to identify novel DTEs. I first identify TE- like genes by computationally scanning the Arabidopsis thaliana genome for genes that contain particular conserved domains that are normally found exclusively or primarily in TEs. The characteristics of each TE-like genes are examined for various characteristics that usually differ between TEs and ordinary genes (including DTEs), including structural characteristics (e.g., repetitiveness), functional characteristics (e.g., expression), and evolutionary characteristics (e.g., conserved microsynteny). Based on these characteristics, I discriminate between DTEs and TE genes, using an ad hoc integrative data mining strategy. In addition to successfully detecting all 29 known A. thaliana DTEs, I predict 38 novel high-confidence DTE candidates, including five separate complex loci, which may suggest a connection between these two unusual aspects of molecular evolution. I also find that some lines of evidence, such as  

 138 shared microsynteny, more effectively discriminate between DTEs and TEs than other lines of evidence, such as expression, yet the best results come from combining diverse lines of evidence.

Methods

Data Sources A. thaliana TAIR9 genome annotations and chromosome sequences were downloaded from the National Center for Biotechnology Information (NCBI) in GenBank format (GI: 240254421, 240254678, 240255695, 240256243, 240256493). Genome features comprising the lines of evidence described in Table 4-1 (except for cross-genome alignments, see below) were downloaded in GFF format, pre-mapped to the TAIR9 chromosome release, from the Arabidopsis Information Resource (TAIR) (Swarbreck, Wilks et al. 2008). Precomputed cross-genome VISTA and RankVISTA alignments and contigs between A. thaliana and seventeen other species were downloaded from the Lawrence Berkeley National Laboratory (LBL) (Frazer, Pachter et al. 2004) (Table 4-2) and processed with custom Perl scripts. To download the data, HTTP GET requests were sent to LBL web applications (http://pipeline.lbl.gov/cgi-bin/*, where * is ‘gp_cns’ or ‘textBrowser2’, for VISTA or RankVISTA, respectively). VISTA downloads were broken into multiple overlapping segments per chromosome because requests for segments longer than about one Mbp caused LBL server timeouts. The data consists of tables of local cross-genome alignments grouped into contigs. For VISTA, contig data includes an A. thaliana chromosome number and a contig name in the subject species. Local alignment data includes genome coordinates in both species, percent identity, and type (exon, intron, UTR, or intergenic). For RankVISTA, only the A. thaliana coordinates are provided, and p-values replace percent  

 139 identity. Downloaded cross-genome alignment and contig data for each species and chromosome segment was converted from HTML to GFF format.

Table 4-1. Lines of evidence.

Labels1 Description Analyses

COMPARATIVE Spp cross-genome local alignments count, species count 2

Cv, Crv cross-genome contigs count, species count 2 P p-value (RankVISTA) minimum

I %-identity vs. A. lyrata (VISTA) count, weighted average, stdev.

Sv, Srv shared microsynteny microsynteny set 2

FUNCTIONAL

cDNA, EST cDNAs & ESTs from A. thaliana, Brassica, radish (GenBank) count, coverage

Seq mRNA-seq (Lister, O'Malley et al. 2008) count, coverage 3 n/a CATMA mRNA (Gagnot, Tamby et al. 2008) count, coverage 3 Prot proteome (Baerenfaller, Grossmann et al. 2008) count, coverage Prot proteome (Castellana, Payne et al. 2008) count, coverage sm smRNA (Lister, O'Malley et al. 2008) count, coverage 3 n/a smRNA (Gregory, O'Malley et al. 2008) count, coverage 3 n/a smRNA (ASRP) (Backman, Sullivan et al. 2008) count, coverage 3 n/a smRNA (MPSS) (Lu, Tej et al. 2005) count, coverage 3 Me methylation (Zhang, Yazaki et al. 2006) integral, average phosphorylation (PhosPhAt) (Heazlewood, Durek et al. n/a count, coverage 2008) n/a Quesneville TEs (TAIR) count, coverage n/a polymorphism (Zeller, Clark et al. 2008) count, coverage

REPETITIVENESS

RDNA DNA repetitiveness (domain, gene, flanking) copy-number, length, position

RAA AA clustering (low & high similarity) count

POSITIONAL

n/a pericentromere, distance to shoulder or genetically-defined distance regional densities of: protein coding genes, TE EST, sm genes, TE domains, Quesneville TEs, ESTs, Lister count/kbp, coverage smRNAs

1 Column headers in Table 4-13, corresponding to this line of evidence 2 VISTA and RankVISTA features analyzed separately 3 Sense and antisense features analyzed separately 140 Table 4-2. Number of cross-genome alignments to A. thaliana per species.

Species VISTA RankVISTA A. lyrata 382624 62694 G. max 204497 601 C. papaya 110323 520 P. trichocarpa 159760 514 M. esculenta 138772 488 M. guttatus 103947 479 R. communis 115642 390 C. sativus 101631 287 O. sativa 76695 287 V. vinifera 127303 271 Z. mays 84059 187 M. truncatula 95747 178 S. bicolor 72390 150 C. reinhardtii 2582 137 B. distachyon 69117 112 P. patens 38926 60 S. moellendorffii 46081 48

Identification of TE-like Genes Conserved domains that are characteristic of transposable element (TE) genes were identified in the Pfam 24.0 database (Finn, Mistry et al. 2010) using two complimentary approaches: literature and keyword searches (Fig 1). The literature was searched for references to TE domains and corresponding Pfam models were identified. Keywords were manually compiled (Table 4-3) and used to search the Pfam database. In addition, the clan of identified TE domains was searched for additional domains (except ‘domains of unknown function’, DUF). For each model, the literature references, descriptions, and species distributions were examined. Models were categorized as TE signature domains if they were found to be present exclusively or predominantly in eukaryotic TEs, or in known domesticated TEs (DTEs). They were excluded if commonly found in non-TE genes, or found only in bacterial genomes or viral genes. 141 Figure 4-1. Schematic overview.

Step 1 Pfam Literature Search Identify Keywords TE Domain Models TE Domain 48 models Models (Table 1)

Step 2 Genome Scan Locate Sequences Domains

TE Domain 2365 genes Genes

Step 3 Load Functional Comparative Genome Data Database Data Data

Step 4 Conservation Repetitiveness Analyze Evidence Local Regional Synteny Features Features Families

Step 5 Integrate & Predict Detect

Step 6 Curate Visualize Curate Manually 67 DTEs DTEs TEs 2298 TEs (Tables X & Y) (Table X)

Table 4-3. Pfam keywords.

Strings transpos, retrotranspos, gag capsid, aspartic proteinase, reverse transcriptase, rnase h, integrase, envelope, tyrosine recombinase, apurinic endonuclease, Tc1 transposase, mariner, hAT transposase, mule transposase, mudr, mutator transposase, merlin, Transib, Drosophila P element, PiggyBac, PIF, Harbinger, CACTA, En-Spm, Crypton, Helitron, Maverick  

 142 Three types of A. thaliana sequences were scanned for TE signature domains: whole chromosomes, open reading frames (ORFs), and gene sequences. First, the sequence of each chromosome was translated into all six virtual reading frames using the EMBOSS program, transeq (Rice, Longden et al. 2000). Second, potential ORFs, meeting the following definition, were identified in the virtual chromosome translations using a custom Perl script: a contiguous amino acid sequence, longer than 9 residues, bracketed by stop codons. (The minimum length was limited so that very short ORFs, of which there are many, were not processed. Since the minimum domain length found by hmmscan (see below) was 18 residues, this limit did not eliminate any useful results.) Third, TAIR gene models were used for protein coding, or transposable element, gene model types. If multiple models were available for a locus, all were used. Each of these three sequence types was searched for TE signature domain instances using the HMMER3 program, hmmscan (Eddy 2009), with options ‘max’ (maximum sensitivity) and ‘cut_tc’ (use trusted cutoff thresholds). TE domain coordinates from the search results were mapped to chromosome coordinates with a custom Perl script. Based on the results of whole chromosome, ORF, and gene searches, loci containing TE signature domains were identified, using a custom Perl script, as follows. Loci overlapping each TE domain were identified based on chromosome coordinates. For loci annotated as protein coding genes, a domain was counted as overlapping if it overlaps an exon. For loci annotated as transposable element genes, a domain was counted as overlapping if it overlaps any part of the gene. Each domain was attributed to at most one overlapping locus, as follows. For domains overlapping exactly one locus, the domain was attributed to the locus. For domains overlapping multiple loci in a nested configuration, the domain was attributed to the innermost nested locus. For domains overlapping multiple loci in a non-nested configuration, the  

 143 domain was attributed, wherever possible, to a locus annotated as a transposable element gene. For domains does not overlapping any locus, the domain was treated as an ‘orphan’ domain (and excluded from most subsequent analyses). TAIR loci that were found to contain a TE signature domain were defined to be TE-like genes.

Table 4-4. Pfam TE signature domains.

Order Superfamily Domain Name

Class I

Copia TYA

ATHILA, Gypsy, Transposase_28, Gypsy Peptidase_A2B, rve, Integrase, Peptidase_A2E LTR Integrase_DNA, Retrotrans_gggag, Integrase_Zn RnaseH, RVP, Bel-Pao Peptidase_A17 RVP_2 Retrovirus TLV_coat, Gag_p30, RVT_1, Gag_MA, Retro_M RVT_2 ERV

DIRS (all)

PLE Penelope

L1 Transposase_22 LINE I RnaseH

Class II

Transposase_5, Transposase_1, Transposase_14, Tc3_transposase, Tc1-Mariner Transposase_Tc5, Transp_Tc5_C, DDE

hAT hATC, Hermes_DBD, DUF659

Transposase_mut, MuDR, MULE, FAR1, Zea_mays_MuDR, Mutator TIR Peptidase_C484

P 87kDa_TransP, Transposase_37

PIF-Harbinger Plant_tran, Transposase_11

CACTA Transposase_21, Transposase_24, Transposase_23

Maverick Maverick DNA_pol_B_2, rve  

 144 Storage, Attribution, and Analysis of Genomic Features All genomic feature data, including loci, TE domains, and each line of evidence, were loaded into a MySQL SeqFeature::Store database. Features were attributed to TE-like genes that they overlap by at least half the feature’s length. Depending on the type of feature, various combinations of analyses were performed, using custom Perl scripts (Table 4-1). Some of these analyses require explanation: ‘Coverage’ is the fraction of unique genomic positions (i.e., base pairs) in a given range (i.e., start to end) that overlap any feature of a specified type. (Unique means that positions overlapping more than one feature were counted only once.) ‘Integral’, calculated for methylation data, which consists of a signal in the interval [0, 1] for each genomic position in a specified range, is the sum of the signals over the range (i.e., the area under the curve); The average signal is the integral divided by the length of the range. Depending on the type of feature, these analyses were performed over various segments of the gene or genome. Cross-genome analyses were conducted separately for features attributed to a TE domain and features attributed a TE-like gene but not overlapping a domain. Separate analyses were conducted for VISTA and RankVISTA data. Expression-related analyses were conducted separately for features on sense or antisense strands. All analyses (except shared microsynteny, see below) were conducted in each of: (i) for each TE-like gene, (ii) for windows of various sizes (10, 20, 50, 100, and 200 kbp) centered at each TE-like gene, and (iii) genome-wide, in 100 kbp buckets. VISTA local alignments of each type (exon, intron, UTR, intergenic) were used to calculate DNA percent identity between A. thaliana and A. lyrata for each TE-like gene, and genome-wide. The length-weighted average percent-identity of a gene was defined as the sum of the products  

 145 of percent-identity and length, divided by the sum of lengths, of all alignments attributed to the gene.

Determination of Shared Microsynteny The shared microsynteny set for a TE-like gene was defined as the set of neighbouring loci, annotated as protein coding genes, that share with the TE-like gene at least one cross-genome contig, as follows. A cross- genome contig consists of a set of local alignments, between A. thaliana and a subject species, located on contiguous chromosome segments in each species (Frazer, Pachter et al. 2004). As described above, a local alignment is attributed to a TE-like gene if it overlaps (by at least half its length) an exon, in the case of annotated protein coding genes, or any part of the gene, in the case of annotated transposable element genes. The contig to which an such an alignment belongs, also attributed to the TE- like gene, may contain alignments that overlap neighbouring protein coding loci. If such an alignment overlaps an exon in a neighbouring locus (by at least half the length of the exon or the alignment), the locus is part of the shared synteny set. Loci in the shared synteny set of each TE-like gene were divided into two categories: domain and non-domain. ‘Domain’ shared synteny loci were defined as those attributed to any cross-genome contig containing a local alignment that overlaps a TE domain in the gene (by at least half the length of the domain or the alignment). Other shared synteny loci were defined to be ‘non-domain’. The total number of domain and non-domain shared synteny genes were individually tallied. VISTA and RankVISTA data were considered both separately and jointly. Shared synteny between A. thaliana and A. lyrata was manually validated in select cases using the TAIR Synteny Viewer tool (http:// gbrowse.arabidopsis.org/cgi-bin/gbrowse_syn/arabidopsis/) (McKay, Vergara et al. 2010).  

 146 Clustering of Domain Sequences The amino acid sequence of each TE domain, as identified by hmmscan (see above), was aligned by TBLASTN to a database of the DNA sequences of all annotated protein coding and transposable element genes (one DNA sequence per locus, which includes exons, introns, and UTRs). Alignments that did not exceed specified percent identity and length thresholds were filtered out (see below). Genes that co-aligned to a common domain sequence were clustered by a single-link clustering algorithm implemented in Perl. Clustering was performed at twice, at different thresholds that were calibrated using known TE and DTE reference sets. ‘High similarity’ thresholds were calibrated to form separate clusters of known TE families. ‘Low similarity’ thresholds were calibrated to cluster families of known DTEs. High and low similarity thresholds were, respectively: 90% identity, 85 residues; 55% identity, 85 residues.

Calculation of DNA Repetitiveness The DNA sequence of each TE-like gene, plus 1 kbp flanking, was aligned to the chromosomes sequences by BLASTN, with a maximum E-value of 1e-50 (roughly corresponding to 85% identity over 200 bp). The number of alignments (HSPs) in each of the following three (non mutually exclusive) categories was counted: (i) alignments to a TE domain (overlapping the domain by at least half the length the domain or the alignment), (ii) alignments to the gene but not the domain (overlapping the gene by at least 100 bp but not sufficiently overlapping a TE domain), and (iii) alignments overlapping both the gene (by at least 100 bp) and the flanking region (also by at least 100 bp). Alignments that did not sufficiently overlap the TE-like gene were not counted.

Calculation of Pericentromere Shoulder Coordinates Genetically-defined pericentromere coordinates were used as defined in (AGI 2000). In addition, pericentromeric ‘shoulders’ were identified.  

 147 Pericentromeric shoulders correspond to regions of greatly elevated TE densities, and reduced non-TE gene densities, that extend beyond the genetically-defined pericentromeres. They were defined as a contiguous region encompassing the genetically-defined pericentromere and extending outwards, by 100 kbp increments, over which both of the following conditions are met: (i) the ‘Quesneville’ TE density (see http:// gbrowse.arabidopsis.org/cgi-bin/gbrowse/arabidopsis/?help=citations #Transposons_raw) is greater than the chromosome mean (over all 100 kbp windows) plus one standard deviation, and (ii) the protein coding gene density is less than the mean minus half a standard deviation.

Controls Known A. thaliana DTEs (Table 4-9) and TE genes (Hoen, Park et al. 2006) were used as positive and negative controls, respectively, for all relevant methods, as were A. thaliana DTEs and TEs identified in a previous, unpublished search (J. Schwartzentruber, pers. comm.). In addition, in estimating the sensitivity of RankVISTA in detecting conservation, two additional sets of known genes were used (Table 4-5): (i) twenty-six homologs of selected slowly evolving genes (Nozaki, Iseki et al. 2007), and (ii) twenty arbitrarily-selected known genes.

Calculation of Confidence Levels and Curation Each TE-like gene was assigned an integer value to reflect the certainty that it is a DTE, the confidence level. Predicted DTEs were defined as TE- like genes with confidence levels ranging from 5, for the most certain DTEs, to 3, for the least certain. Predicted TEs were defined as TE-like genes with confidence levels ranging from 2, for the least certain TEs, to 0, for the most certain. Genes with confidence level 2 include atypical TEs, such as those that are expressed or have an unusually high level of cross- genome conservation, and may also include some atypical DTEs.  

 148 Table 4-5. Additional positive controls.

Name Loci SLOWLY EVOLVING GENES ACTIN 1 AT2G37620 EUKARYOTIC INITIATION FACTOR 1A AT5G35680, AT2G04520

ARABIDOPSIS CYTOPLASMIC RIBOSOMAL PROTEIN, L10 AT1G14320, AT1G26910, AT1G66580 HISTONE H4 AT2G28740, AT5G59970, AT5G59690 HEAT SHOCK PROTEIN 70S (HSP70S) AT3G09440, AT5G02500, AT1G16030 RPN (26S PROTEASOME SUBUNIT) AT5G09900, AT5G05780 60S RIBOSOMAL PROTEIN L8 AT4G36130, AT2G18020 60S RIBOSOMAL PROTEIN L27A AT1G12960 40S RIBOSOMAL PROTEIN S2 AT2G41840, AT1G59359, AT3G57490 40S RIBOSOMAL PROTEIN S14 AT2G36160, AT3G11510, AT3G52580 40S RIBOSOMAL PROTEIN S17 AT5G04800 40S RIBOSOMAL PROTEIN S23 AT3G09680 VACUOLAR ATP SYNTHASE SUBUNIT B1 AT1G76030

OTHER GENES FERRIC REDUCTION OXIDASE 2 AT1G01580 F14N23.19 AT1G10310 ATP-CITRATE LYASE A-1 AT1G10670 GIGANTEA AT1G22770 F28B23.4 AT1G26300 JASMONATE INSENSITIVE 1 AT1G32640 F10C21.1 AT1G33320 CELL WALL / VACUOLAR INHIBITOR OF FRUCTOSIDASE 1 AT1G47960 ACYL-ACTIVATING ENZYME 18 AT1G55320 PRODUCTION OF ANTHOCYANIN PIGMENT 1 AT1G56650 F1E22.17 AT1G65820 ETHYLENE RESPONSE 1 AT1G66340 ACT REPEAT 4 AT1G69040 GAMMA-GLUTAMYL HYDROLASE 1 AT1G78660 GAMMA-GLUTAMYL HYDROLASE 2 AT1G78680 NONHOST RESISTANCE TO P. S. PHASEOLICOLA 1 AT1G80460 XERICO AT2G04240 MONOGALACTOSYL DIACYLGLYCEROL SYNTHASE 3 AT2G11810 CYTOCHROME P450, FAMILY 79, SUBFAMILY B, AT2G22330 POLYPEPTIDE 3 PHYB ACTIVATION TAGGED SUPPRESSOR 1 AT2G26710

149 Confidence levels were calculated based on the strength of, and agreement between, key lines of evidence. Key lines of evidence, identified based on biological considerations and observed measurements, were combined into simplified categories (Table 4-6). Evidence was divided into four levels of strength: convincing (++), strong (+), weak (?), or absent (−). Each evidence category and strength level was assigned a score. The sum of scores (S) was calculated. Uncurated confidence levels were assigned based on the following minimum values of S, for levels 5 to 0, respectively: 50, 30, 1, -30, -65, -∞. All DTE-like genes with uncurated confidence level 2 or higher, as well as a sample with levels 0 & 1, were manually curated. Curation involved multiple steps. All key evidence was validated. The distribution of nearby features (e.g., TEs, protein coding genes, unconfirmed exons, small RNAs, cross-genome alignments) was examined at various scales. Exceptions and unexpected cases (e.g., spurious short cross-genome alignments) that were not handled as intended by the computational methods were identified and corrected. In ambiguous cases, additional evidence was examined. Interrelationships between lines of evidence were examined, as were spatial patterns, which are difficult to quantify or evaluate computationally. Evidence from similar genes (based on domain clustering), where available, was taken into account. Evidence was visualized using the genome browser, GBrowse 2.0 (Stein, Mungall et al. 2002). (For reasons of database efficiency, and to take advantages of recent TAIR10 updates, such as the addition of an “unconfirmed exons” track, a combination was used of both a locally installed GBrowse instance and the TAIR online tool (http:// gbrowse.arabidopsis.org). The following tracks were examined. (1) Locally: TE domains (Whole-Chromosome Search, ORF Search, Gene Search, & Multi-Exon), ORFs, Genes, Pseudogenes, Quesneville Natural transposons, RankVISTA (contigs & alignments), and VISTA Synteny

150 (contigs & alignments); (2) At arabidopsis.org: Annotation Units, Gaps, Assembly Updates, TAIR10 Unconfirmed Exons, Locus, Protein Coding Gene Models, Transposable element genes, Obsoleted Genes, Arabidopsis cDNAs, Arabidopsis ESTs, Brassica cDNAs, Brassica ESTs, Radish Clones, AtProteome, AtPeptide, Polyadenylation sites (AtPolyA- DB), Phosphorylation (PhosPhAt), mRNA col-0 (Lister et al. 2008), smRNA col-0 (Lister et al. 2008), Methylation (all eight tracks from Zhang, 2006), and VISTA Plot (all species). Based on curation results, DTE-like genes were assigned a “TE- appearance” value, using the same 4-level scale as described above for evidence strength. A curated confidence level was calculated based on the uncurated confidence level and TE-appearance value, as follows. For uncurated confidence levels 3 or higher: If TE-appearance was convincing (++), or strong (+), the curated confidence level was set to 2; If TE- appearance was weak (?), the curated confidence level was set to one less than the uncurated level; If TE-appearance was absent (−), the curated confidence level was set equal to the uncurated level. For uncurated confidence levels 2 or lower, the curated confidence level was set equal to the uncurated level (they were not adjusted because these levels were intended to capture TEs).

Additional Analysis of Complex Loci TE-like genes with confidence level 2 or higher, located close together (less than about 10 kbp), and containing the same TE domain(s) were considered to be potential complex loci and subjected to various additional analyses. For instance, copy number, intra- and inter-genomic sequence similarity, and location in A. thaliana and A. lyrata, as well as other plant species, was investigated by TBLASTN using the LBL Phytozome website (http://www.phytozome.net). DNA repetitiveness was also examined using Phytozome. Searches for terminal repeats, which are indicative of TEs, were conducted by BLASTN searches of sequences flanking the genes.  

 151 Combined microarray expression data was examined using the Arabidopsis eFP Browser (Winter, Vinegar et al. 2007) . SBS, PARE, and MPSS short read data was examined using the Meyers lab’s (University of Delaware) Arabidopsis short read databases (Nakano, Nobuta et al. 2006). Basic phylogenetic analyses were performed using the phylogeny.fr platform (Dereeper, Guignon et al. 2008).

Results

TE Signature Domains Forty-eight Pfam TE signature domains were identified, 28 for Class I TEs and 25 for Class II (Table 4-4 and Figure 4-1). Class I TE superfamilies have many similar genes (Wicker, Sabot et al. 2007), thus many of the signature domains are common between Class I superfamilies and orders. For instance, all Class I TE orders are represented by at least one signature domain because the RVT_1 and RVT_2 domains are common to all of them (except SINEs, which by definition have no coding capacity). However, many Class I superfamilies are not represented by any signature domain specific to only that family. Conversely, all Class II domains in the TE signature domain set are superfamily-specific. No signature domain was found for the following 5, of 12 total, Class II superfamilies: Merlin, Transib, PiggyBac, Crypton, Helitron. Three of the Class II signature domains (DDE, DUF659, and Transposase_11) initially were not assigned to specific superfamilies, based on literature and Pfam distributions. However, all three were found, upon analysis of the completed scan results in A. thaliana (see below) to unambiguously co-occupy loci containing other, superfamily-specific domains, so these initially unassigned domains were then assigned to the corresponding superfamilies.  

 152 The Peptidase_C48 domain was included in the TE signature domain set because it is found in approximately one-hundred A. thaliana TEs (Hoen, Park et al. 2006); However, because it is also found in a family of non-TE genes (the ULP family), this domain was treated separately in most analyses. An additional 32 TE-related domains were found, but were excluded from the signature domain set because their distribution was not restricted to eukaryotic TEs (Table 4-7).

Domain Scans In the three types of domain scan (scans of whole-chromosome translations, virtual open reading frames (ORFs), and gene models), a total of 13,251 matches to TE signature domains were found. All matches had significant E-values, with the number of matches decreasing rapidly as their E-values increased, only 20 matches at E-value greater than 1e-5, and a maximum E-value of 1e-3. (This E-value distribution is an improvement over previous searches using an earlier version of the scanning software (HMMER2), in which the number of matches increased rapidly at high E-value (near unity), and which had required the use of an arbitrary E-value cutoff, compromising both sensitivity and specificity -- data not shown.)

Table 4-7. Non-signature Pfam TE domains.

Domains WRKY, PMD, Plant_all_beta, THAP, DUF390, DUF1979, RVT_connect, RVT_thumb, Phage-MuB_C, TnsA_C, TnsA_N, Mu-transpos_C, Tn916-Xis, IstB, DUF746, TniB, Transposase_34, DUF834, TniQ, Transposase_36, IstB_N, TraO, TraH, DUF1087, CtnDOT_TraJ, DUF2080, Tn7_TnsC, HTH_14, DUF222, zf-FCS, RteC, Hypoth_Ymh  

 153 Table 4-8. Number of TE domain genes by confidence level and type.

Known Predicted Confidence Level1 TE2 PC2 Total TE2 PC2 Total

5 6 19 25 4 14 18

4 2248614 DTEs 3 000606

Total 82129182038

2 3 2 5521870

1 15 2 17 230 24 254 TEs 0 151 0 151 1795 6 1801

Total 169 4 173 2077 48 2125

1 Uncurated confidence level for known DTEs & TEs; curated confidence for predicted. 2 Annotated TAIR gene model type: transposable element gene (TE), protein coding (PC)

The TE signature domain matches overlap 2,365 TE-like loci (Table 4-8) (not including loci that contain only zf-CCHC or Peptidase_C48 domain matches, unless they are known TE genes (Hoen, Park et al. 2006).) Most TE-like loci (2,113) were found in all three types of scan, with ORF scans being the most sensitive, finding at least one match in all but 13 of the loci, and finding 67 loci not found in the other two scan types. Whole-chromosome and gene model scans missed 312 loci and 85 loci, respectively, and neither found any locus not also found in at least one other scan type. All 13 loci missed in the ORF scans were found in the gene model scans. The TE-like loci include all 29 known domesticated transposable elements (DTEs) (and other host genes known to contain a TE signature domain) (Tables 9 & 10), as well as 174 (94.5%) of known TE genes in the control set. In addition to the TE-like genes, 284 ‘orphan’ TE signature domain matches were found that do not overlap any locus. Because most orphan domains were found in both ORF and whole-chromosome searches, and because single domains were often fragmented across multiple ORFs or 154 Table 4-9. Known DTEs summary. 2 2 2 2 o 2 2 h 2 2 1 osp

Name Domains Family Type Synteny RankV VISTA smRNA Repetitive Pericent Ph mRNA NOT RECOGNIZED AS TE-DERIVED TERT RVT_1 PC ++ + ++ ++ − − − + NMAT1a RVT_1 RVT_1.3 PC ++ ++ ++ + − − − + NMAT2 RVT_1 RVT_1.3 PC ++ ++ ++ ++  − − + NMAT3 RVT_1 PC ++ + ++ ++  − − + NMAT4 RVT_1 PC ++ + ++ ++ − − − + RAD54 Plant_tran PC ++ + ++ ++  − − +

ARE RECOGNIZED AS TE-DERIVED DAYSLEEPER hATC PC  ++ ++ ++   ++ − FAR1 MULE FAR13 PC ++++ ++ −−−+ FHY3 MULE FAR13 PC ++ − ++ ++ −−−+ FRS1 MULE FAR13 PC ++ + ++ ++ −−−+ FRS2 MULE FAR13 PC ++ + ++ ++  −−+ FRS3 MULE FAR13 PC ++ + ++ ++ −−−+ FRS4 MULE FAR13 PC ++ + ++ ++ −−−+ FRS5 MULE FAR13 PC ++ + ++ ++ −−−+ FRS6 MULE4 FAR13 PC ++ + ++ ++ −−−+ FRS7 MULE FAR13 PC ++ + ++ ++ −  − + FRS8 MULE FAR13 PC ++ + ++ +  −−+ FRS9 MULE FAR13 PC ++ − ++ ++ −−−+ FRS10 MULE FAR13 PC − + ++ +  − + + FRS11 MULE FAR13 PC + ++ ++ ++ −−−+ FRS12 MULE FAR13 PC ++ + ++ ++ −  − + MUG1 MuDR, MULE4MUG TE ++ ++ ++ ++  −−− MUG2 MuDR, MULE MUG TE ++ + ++ + −−−− MUG3 MuDR, MULE MUG TE ++ + ++ ++  −−− MUG4 MuDR, MULE MUG TE ++ ++ ++ ++ −−−− MUG5 MuDR, MULE MUG TE + ++ ++ ++ −  −− MUG6 MuDR, MULE MUG TE ++++ ++   −− MUG7 MuDR, MULE MUG TE ++ ++ ++ ++ −−−− MUG8 MuDR, MULE MUG TE + ++ ++ +  − + −

1 Gene model type: protein coding gene (PC), transposable element gene (TE) 2 See Table 4-6 3 FAR1/FHY3/FRS family 4 Transposase_mut domain also found  

 155 whole-chromosome reading frames (e.g. because of frame-shifts or because they are spread across multiple exons), the actual number of unique orphan domains is approximately 140. These orphan domains, presumably part of TE fossils, were not analyzed further. Finally, 23 domains overlap more than one locus, all but one of which are annotated as transposable element genes.

Cross-Genome Conservation The percent identity of cross-genome VISTA alignments between A. thaliana and A. lyrata had, as expected, higher conservation, and less variance, in regions annotated as exons than in regions annotated as introns, UTRs, or intergenic (Table 4-11). Statistically significant cross-genome alignments (p < 0.1), between A. thaliana and A. lyrata, as computed by the Rank-VISTA pipeline (using the Gumby algorithm) (Prabhakar, Poulin et al. 2006; Brudno, Poliakov et al. 2007), were exceptionally useful in discriminating between TE and non-TE genes, primarily because of their high specificity. Only 4% (7/174) of known TEs in the negative control set had a Rank-VISTA alignment overlapping a TE signature domain (confidence level 2 or lower, see below). (Note, however, that predicted TEs is a biased set since Rank- VISTA alignment was one factor used in calculating confidence levels.) Rank-VISTA alignments between A. thaliana and A. lyrata were also reasonably sensitive. The fraction of genes in four positive controls with at least one Rank-VISTA alignment varied between 63% and 93%, with the highest sensitivity being for in the known DTE set (Table 4-12). Fifty-five percent (21/38) of predicted DTEs (confidence level 3 or higher, see below) had a Rank-VISTA alignment (Table 4-13). (Note, again, that the predicted DTEs set is biased since Rank-VISTA alignment was one factor used in calculating confidence levels.) Sensitivity was greatly reduced in comparisons to species less closely related to A. thaliana than is A. lyrata. Whereas A. lyrata had tens of thousands of Rank-VISTA alignments to A. 156 0 6 4 3 1 1 3 7 1 1 3 0 0 3 0 0 0 1 3 28 17 nal o 6 61 47 30 63 54 61 34 27 37 57 53 47 70 43 55 58 70 60 43 EST sm Regi 0 0 0 0 2 0 01623 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0512 0543 0690 0522 0560 0525 0480 0728 16 AA R veness i 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 DNA Repetit R 0 0 0 14 0.4 0.26 0.35 0.36 0.71 0.01 0.29 0.39 0.27 0.44 0.46 0.26 0.01 0.46 0.76 0.01 0.43 Me ation l 0 0 2 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 13 sm 4 4 3 3 3 6 8 4 2 7 4 2 1 2 4 7 6 2 8 12 10 Prot 2 11 16 98 44 48 30 59 22 48 86 79 96 75 111 113 482 126 184 222 381 Seq ssion Regu e 2 8 7 1 8 10 39 28 21 40 14 50 12 24 15 10 12 20 14 30 19 EST 9 2 0 3 1 3 2 2 1 3 2 2 4 2 2 0 4 0 2 1 10 cDNA 8 I 91 93 95 92 96 95 92 94 88 90 88 94 95 90 92 87 95 93 92 94 7 P 2E-10 6E-32 8E-15 5E-19 2E-22 6E-12 3E-26 91 4 77 673 96 1 0.037E-04 2 6E-56 91 0 10 257 5 1 0.21 on Expr i 6 1 2 2 1 1 1 2 1 2E-17 1 1E-26 7 4E-251 4E-08 93 91 6 3 19 19 99 43 0 1 0 1 0.36 0.36 Spp 5 1 2 2 1 1 1 2 1 1 1E-12 1 111 3E-15 1 1 8E-111 8844 3 10 49 0 18 0 rv 14 9 1E-27 89 3 52 269 3 2 0.28 C 4 6 v 11 1 1 2E-22 15 19 18 15 19 27 1713 110 010 1 226 8E-10 0 1 127 n/a 2E-10 20 1 110 4E-07 110 1 1 15 8E-23 122 2E-19 03113 0 234 1 13 n/a 216 7E-16 11919 1 514 6E-06 9 14 5 9116 5E-19 4 0 90 4 4E-25 2 2 90 21 33 310 2 4 10 4 0 195 0.16 0 0.61 3 0 0.41 C 3 8 3 1 9 2 1 0 0 2 6 1 3 4 5 0 0 0 7 2 9 5 1 4 1 11 10 10 12 15 rv S eny Conservat t 2 1 7 5 3 v 34 14 22 18 S Syn 1 515 523 5 521 523 520 4 530 Conf Name NMAT1a NMAT2 NMAT3 NMAT4 RAD54 FHY3 FRS1FRS2FRS3 5FRS4 5FRS5 37 5FRS6 21 5FRS7 12 5FRS8 17 5FRS9 11 5FRS10 13 5 20 5FRS12 24 4MUG1 11 0 5 5 51 FAR1 5 3 MUG2MUG3MUG4 5MUG5 5MUG6 25 5 MUG7 49 5 MUG8 4 5 7 4 DAYSLEEPER TERT FRS11 5  

 157 Table 4-10. Known DTEs details (preceding page).

1 Curated confidence level (and uncurated confidence level in parentheses, if different) 2 Size of VISTA synteny set (TE domain only) 3 Size of RankVISTA synteny set (TE domain only) 4 Number of VISTA contigs (TE domain only) 5 Number of RankVISTA contigs (TE domain only) 6 Number of species with RankVISTA alignments (TE domain only) 7 Minimum P-value of RankVISTA alignments (TE domain only; ‘n/a’ if no alignments) 8 Average weighted percent-identity to A. lyrata VISTA alignments (‘n/a’ if no alignments) 9 Number of A. thaliana cDNAs 10 Percent coverage of gene by A. thaliana ESTs 11 Number of mRNA-seq reads (sense strand only) 12 Sum of Baerenfaller & Castellana peptides 13 Number of small RNA-seq reads (any strand) 14 Maximum of HMBD & mCIP (col BU/UB) average methylation 15 DNA repetitiveness (any category) 16 AA repetitiveness: number of loci with a high-similarity BLASTP alignment to a TE domain amino acid sequence 17 EST, smRNA percent coverage over 100 kbp region centered at gene

Table 4-11. Genome-wide DNA percent identity between A. thaliana and A. lyrata.

Number of Weighted Standard Type Alignments Average Deviation exon 11610 92.7% 6.1% gene1 19917 89.4% 8.6% any2 26775 86.7% 8.9%

1 Exon, intron, or UTR 2 Exon, intron, UTR, or intergenic

Table 4-12. Rank-VISTA sensitivity.

A. lyrata A. lyrata & Other1 A. lyrata Gene Set Only Other1 Only None2 Sensitivity3 Slowly Evolving 17 5 3 2 81% Metabolism 13 2 0 5 75% ULP 5 0 0 3 63% Known DTE 18 9 0 2 93%

1 ‘Other’: species other than A. lyrata 2 ‘None’: no RankVISTA result 3 Percent of total genes with a RankVISTA result in A. lyrata

158 Table 4-13. New DTEs summary. 1 2 dence fi Type Synteny RankV VISTA smRNA Repetitive Pericent Phospho TE Appearance Con

Locus Domains Family mRNA AT1G642403 MuDR, MULE MULE.2 TE ++ − +  − + −−−3 AT1G642503 MuDR, MULE MULE.2 TE ++ + +  − + −−−4 AT1G642553 MuDR, MULE MULE.2 PC ++ + + −−+ − + − 4 AT1G642603 MuDR, MULE MULE.2 PC ++ − +  − + − + − 4 AT1G642703 MuDR, MULE MULE.2 TE ++ − +++− + −−−4 AT1G49920 MuDR, MULE MULE.2 PC ++++ − + − 5 AT5G41710 MuDR, MULE MULE.25 TE + − + −−−−−−3 AT5G503154 MuDR TE  ++++−−−−−4 AT4G08220 MuDR TE + +++−−+ −  3 AT3G130103 hATC, DUF659 hATC.1 PC ++ − ++  −  − + − 4 AT3G130203 hATC, DUF659 hATC.15 PC +++++ −  − + − 5 AT3G130303 hATC, DUF659 hATC.1 PC +++++++−  − + − 5 AT4G150204 hATC, DUF659 hATC.2 PC + + ++ ++ −  − + − 5 AT3G222204 hATC, DUF659 hATC.2 PC + + ++ ++ −  − + − 5 AT1G79740 hATC, DUF659 PC +++++++−−−+ − 5 AT3G174504 hATC, DUF659 PC ++ ++ ++ ++ −−−+ − 5 AT3G172903 hATC hATC.35 TE ++ + +  −−−−5 AT3G172603 hATC hATC.35 TE ++ − +  −−−−−4 AT1G80020 hATC hATC.45 TE ++++++ −−−−5 AT1G15300 hATC hATC.45 TE ++ − ++ ++ −  −−−4 6 AT1G18560 hATC (DAYSLEEPER) PC ++++++ −−+ − 5 AT3G148004 hATC TE +++++++−−−−−5 AT1G69950 hATC TE ++++++ −−−−−5 AT4G13120 hATC TE + − ++−−−−−3 AT3G63270 Plant_tran PC +++++++−−−+ − 5 AT3G55350 Plant_tran PC +++++++−−−+ − 5 AT5G41980 Plant_tran PC + − ++ ++ −−−+ − 4 AT3G14780 Transposase_24 PC ++ − +++−−−+ − 5 AT1G212203 Retrotrans_gag gag.1 TE + − +++−−−−−3 AT1G212603 Retrotrans_gag gag.1 TE  − ++−  −−−3 AT1G212803 Retrotrans_gag gag.1 PC  − +++−  − + − 4 AT1G212903 Retrotrans_gag gag.1 TE  − + −−−−−−3 AT1G61160 Retrotrans_gag PC ++ + + + −−−+  4 AT5G208804 Retrotrans_gag TE ++ − +++−−−−−4 AT5G51080 RnaseH RnaseH.1 PC +++++++−  − + − 5 AT1G24090 RnaseH RnaseH.1 PC + − ++ ++ −−−+ − 4 AT4G21420 RnaseH TE ++++ −−−− 3 AT3G014104 RnaseH PC ++ − ++ ++ −−−+ − 5

1 Curated strength of TE evidence: conclusive (++), strong (+), weak (?), none (-) 2 Curated confidence level (and uncurated confidence level in parentheses, if different) 3 Belongs to a complex locus 4 Previously identified DTE candidate (J. Schwartzentruber, pers. comm.) 5 Not part of an automated low similarity family (added manually) 6 BLASTP best-reciprocal-hit with DAYSLEEPER; potential paralogs 159 Table 4-14. New DTEs details. 3 3 3 3 3 2 3 0 0 0 4 6 1 2 3 3 0 0 2 17 43 2 45 45 43 43 39 47 52 21 53 58 59 38 39 70 49 49 47 56 56 53 0 0 0 0 0 0 0458 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0503 0325 0375 0650 0525 0492 0493 0362 0382 0391 0381 0391 0559 0543 0385 0358 0 veness Regional i 0 3 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0.01 0.22 0.02 0.21 0.13 0.74 0.38 0.16 0.19 0.42 0.34 0.16 0.14 0.06 ation Repetit l 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 4 1 1 0 0 0 1 1 6 1 2 1 0 2 2 28 34 102 3 0 1 8 2 3 0000 8200 11 11 10 67 20 27 61 62 56 45 348 336 266 602 ssion Regu e 08000 0 0 0 0 1 6 0 0 0 0 6 3 15 13 19 22 26 13 19 22 0 0 0 0 1 0 2 0 0 0 2 6 2 2 1 0 0 0 2 0 87 89 90 88 85 87 88 87 89 93 91 93 91 91 94 82 86 90 91 92 1E-05 3E-12 2E-02 3E-08 9E-02 848E-02 852E-05 0 86 113 3 14 0 n Expr o 0n/a 1 1 0n/a 0n/a 1 0n/a89000100 2 9E-16 0 0 1 6E-30 10n/a91746100 4E-08 88 6 18 47 2 0 0.2 nservati o 0 1 1 0 0 1 0 1 1 1E-02 111 1 2E-07 10 1 5E-21 00 88 0 n/a00n/a83210000 200n/a820831000 85 336230 0 1700n/a85000100 011 238 00 511 n/a 80 0 0n/a892 4711 91 00 n/a 2 0.52 4 84 0 0 25 n/a 1 0.44 67 86 14 1 15 2 0 16 4 0 18 0 1 0 0 0 1 1 1 1 1 1 1 11500n/a 1 17 1 1 2E-03 8 1 7E-11 1 11 1E-01 100n/a 1 4E-10 300n/a 91 1 1100n/a842125600 3E-091 1 921 1 01 1 7 6 886 1 0 0.34 11 1 11 10 2118 128 1 1 2 4E-18 12 1 3E-13 10 152912 1 1 1 4E-09 90 916 11 68 1 0 0 0 1 0 0 0 2 0 3 3 0 1 0 4 2 7 2 0 6 0 3 1 1 0 6 0 0 0 0 0 0 0 2 0 3 0 1 0 10 eny C t 4 9 1 2 1 6 8 11 11 31 78 20 44 13 19 31 34 Sv Srv Cv Crv Spp P I cDNA EST Seq Prot sm Me RDNA RAA EST sm Syn 332 432 4 431 432 57 3 3 (4) 4 Conf Locus AT1G64250 AT1G64255 AT1G64260 AT1G64270 AT1G49920 AT5G41710 AT4G08220 AT3G13020AT3G13030AT4G15020AT3G22220 5AT1G79740 5AT3G17450 22 5AT3G17290 22 5AT3G17260 5 7 AT1G80020 5 4 (5) 7 AT1G15300 23 3 (4)AT1G18560 34 35 AT3G14800 5 35 4 25 5 5 39 AT5G50315 4 2 AT1G69950AT4G13120AT3G63270AT3G55350 5AT5G41980 3AT3G14780 23 5 AT1G21220 5 6 AT1G21260 4AT1G21280 5 AT1G21290 3 9 AT1G61160 3 AT5G20880 4 AT5G51080 3 4 (5) AT1G24090AT4G21420 4 AT3G01410 5 4 3 (4) 5 AT3G13010 4 22 AT1G64240  

 160 lyrata, the species with next highest number of alignments, G. max, had only hundreds of alignments (Table 4-2). Nevertheless, some DTEs, such as members of the MUSTANG gene family (Cowan, Hoen et al. 2005), did have Rank-VISTA alignments to species other than A. lyrata (Table 4-11).

Pericentromere Shoulders Pericentromere shoulder regions, containing elevated TE density and lowered gene density relative to the chromosome arms, were found to extend beyond the genetically-defined pericentromere boundaries by approximately 0.5 Mbp to 2 Mbp (Table 4-15).

Table 4-15. Pericentromere boundaries.

Genetic Shoulder Chr. (Mbp) (Mbp) 1 14.2 - 15.7 12.6 - 17.1 2 3.0 - 3.9 1.2 - 6.8 3 13.3 - 14.5 10.8 - 16.3 4 2.1 - 4.2 1.6 - 6.1 5 11.0 - 12.6 9.7 - 14.7

Domesticated Transposable Elements There are 23 known A. thaliana DTEs, in two families (FAR1/FHY2/FRS and MUG) and one single-copy gene (DAYSLEEPER). All were successfully identified (at confidence level 3 or higher) (Tables 4-9 and 4-10). Six additional non-TE genes that contain TE signature domains, but which are not recognized in the literature as DTEs, were also identified. Thirty-eight previously unreported DTEs were predicted based on the similarity of their characteristics to known DTEs (Tables 4-13 and 4-14). All have some evidence of conserved microsynteny between A. thaliana and A. lyrata. Four of these have shared microsynteny with fewer than three loci, a condition that is occasionally also found in TEs that contain multiple  

 161 genes; however, in three of these four (AT1G21260, AT1G21280, AT1G21290), this situation is due to their arrangement in a complex locus. All have at least strong evidence of cross-genome conservation (Rank- VISTA or VISTA, see also above). Many are expressed, none are strongly targeted by small RNAs, and only one is located in a pericentromere shoulder region. Although five predicted DTEs (AT1G64240, AT1G64250, AT1G64255, AT1G64260, AT1G64270) appear to have evidence of repetitiveness, the apparent repetitiveness appears to be a result of their arrangement in a complex locus, rather than TE-like structures. Twenty-two of the predicted DTEs form 7 gene families. An additional predicted DTE (AT1G18560) is the BLASTP best-reciprocal-hit to the known DTE, DAYSLEEPER, possibly forming another gene family. The remaining 15 predicted DTEs appear to be single-copy genes. Although the average number of TE-like genes that were predicted to be DTEs was only 3% overall, Class II TEs were seven times more likely to form DTEs than Class I TEs, as a fraction of extant TE-like genes (Table 4-16). The superfamily with the largest absolute number of DTEs was Mutator-like elements, nearly doubling the number of DTEs derived from the second-ranked superfamily, hAT. Among Class II TEs, hAT elements were the most frequently domesticated, followed by Mutator-like and PIF- Harbinger elements. Among Class I TEs, the Retrotrans_gag and RnaseH domains were the most frequently domesticated. In addition to the predicted DTEs, 2124 predicted TEs were identified, a sample of which are shown in Tables 17 & 18. Most predicted TEs had characteristics, such as repetitiveness, typical of, and specific to, TEs, and thus had confidence levels less than 2. However, 82 of the predicted TEs had atypical characteristics, such as being expressed. A subset of these atypical TE-like genes at confidence level 2 may be DTEs rather than TEs.  

 162 Table 4-16. Number of TEs and DTEs per superfamily.

Order or Class Superfamily Domains TE DTE % DTE

II Mutator MuDR, MULE, Transposase_mut 247 31 11%

hAT hATC, DUF659 73 16 18%

PIF-Harbinger Plant_tran, Transposase_11 38 4 10%

Transposase_21, Transposase_23, CACTA 295 1 0% Transposase_24

Tc1-Mariner Transposase_Tc5 1 0 0%

Total 654 52 7%

I Gypsy ATHILA 220 0 0%

LTR rve 527 0 0%

LTR or DIRS Retrotrans_gag, RnaseH, RVP_2 454 10 2%

(nonspecific) RVT_1, RVT_2 346 5 1%

Total 1547 15 1%

163 Table 4−17. Predicted TEs examples. dence fi

Locus Domains Type Synteny RankV VISTA mRNA smRNA Repetitive Pericent Phospho TE Appearance Con AT4G11200 MuDR TE −−++ + −−−+  2 (3) AT5G45576 MuDR, MULE, Transposase_mut TE  +++− ++ ++ −−++ 1 AT5G43015 MuDR, MULE, Transposase_mut TE − −−−++ ++ −−++ 0 AT2G16040 hATC PC −−++  ++  − ++2 AT2G06541 hATC PC −−++ + ++ + + − + 1 AT5G28673 hATC TE −−−−++ ++ + − ++ 0 AT4G01975 Plant_tran, MULE TE ++1 − +++++++−−++ 2 AT1G24370 Plant_tran TE  − ++ + ++ ++ − +++1 AT5G35207 Plant_tran TE −−−++++++ − ++ 0 AT3G30836 Transposase_24, Transposase_23 TE − +++ ++ − + − ++ 2 AT5G28065 Transposase_21 TE  ++− ++ ++ + − ++ 1 AT2G04210 Transposase_24 TE −−−++ ++ ++ + + ++ 0 AT2G06660 Transposase_Tc5, DDE TE −−−−−−+ − + 1 AT5G28920 ATHILA PC −−−++ + − +++2 AT4G05633 ATHILA TE −−++ + ++ ++ ++ + ++ 1 AT2G14000 ATHILA PC −−− ++ + + + ++ 0 AT2G27935 rve TE ++2 − + −  −−−+ 2 (4) AT1G27285 rve, RVT_2, Retrotrans_gag TE −−−++ − ++ −−++ 1 AT3G33011 rve, RVT_2 TE −  − ++ ++ ++ − ++ 0 AT1G47370 Retrotrans_gag PC −−−++−−+  2 AT1G34070 Retrotrans_gag PC −−−++ ++ −−+++1 AT3G32383 Retrotrans_gag TE −−−−++ ++ ++ − ++ 0 AT2G11240 RVT_1 TE − ++++ ++ + − + 2 AT2G02110 RVT_1 TE  − + + ++ ++ −−++ 1 AT2G07683 RVT_2 TE −−−+++++++− ++ 0

1 Spurious synteny due to a short VISTA match 2 Synteny appears legitimate 164 Table 4-18. Selected evidence for example TEs. 4 8 5 2 0 11 10 28 39 27 24 32 34 17 29 20 41 21 38 10 9 4 2 5 0 3 11 11 32 35 28 14 55 22 26 18 41 48 28 35 Regional EST sm 0 2 3 0 0 3 2475 2 4 0 0 2 0 0 0 0 0 0 9 0 0 2028 0627 33316 03232 AA R veness i 2 3 3 2 3 4 2 2 0 2 3 4 0 0 0 2 0 0 0 2 0 0 5 2 10 DNA Repetit 0 0 0 0.62 0.38 0.29 0.88 0.72 0.22 0.47 0.77 0.91 0.83 0.43 0.29 0.67 0.92 0.31 0.01 0.74 ation l 0 0 8 1 0 11 79 32 84 43 74 20 61 74 33 18 133 236 146 129 2 2 0 1 2 0 1 0 1 0 1 0 0 0 2 0 1 3 0 1 0 0 3 0 0 0 0 2 0 0 0 1 1 0 7 0 3 0 33 13 ssion Regu e 0 0 0 3 0 3 0 0 0 4 0 0 1 0 2 0 13 21 21 38 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 1 10 79 83 85 91 86 81 86 78 80 78 80 7E-16 3E-14 n/a 0n/a 1 00n/a n/a0n/a n/a 00 n/a n/a n/a 71 1 2 60 3 n/a 117 n/a 0.32 0 n/a n/a Spp P I cDNA EST Seq Prot sm Me R nservation Expr o 0 1 0 0 0 0 0 1 1 8E-04 0 0 n/a 0 00 n/a 0 86 n/a 0 72 0 0 0 3 1 0 148 2 0.66 187 0.97 rv 23 7 C v 3 2 0 3 7 0 1 200n/a 05 01 30 00 1 0 n/a000n/a 3E-06 0200n/a n/a 000 0 n/a1 n/a0 n/a n/a 0000n/a 00 00 n/a11 n/a 10 1 0 5E-01 0 73 n/a 4 n/a 21 0 0 5 1 0 7 0 0.26 35 0.02 C 245 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 rv S eny C t v 0 0 0 0 S 27 Syn 11 0 20 10 00 2 00 2 (3) 0 Conf Locus AT1G24370 1 1 AT2G11240AT2G02110AT2G07683 2 1 0 0 1 AT3G32383 0 AT4G11200 AT5G45576 AT2G16040 AT5G35207 AT1G27285AT3G33011 1 0 0 0 AT5G43015 AT2G06541 AT5G28673 AT4G01975 AT3G30836AT5G28065AT2G04210 2AT2G06660 1 0 AT5G28920 0 1 AT4G05633 1 0 AT2G14000 2 0 AT2G27935 1 0 0 2 (4) 0 10 0 AT1G47370AT1G34070 2 1 0  

 165 Table 4-19. Small RNA coverage of DTEs and TEs.

Coverage Cumulative Cumulative # DTEs # TEs Range 1 (%) DTEs (%) TEs (%)

0 79 3.4% 71 3.0% (0, 1] 15 87.9% 17 3.8% (1, 2] 8 95.3% 16 4.4% (2, 3] 3 98.1% 8 4.8% (3, 4] 1 99.1% 19 5.6% (4, 5] 1 100.0% 14 6.2% (5, 6] 0 100.0% 10 6.6% (6, 7] 0 100.0% 13 7.2% (7, 8] 0 100.0% 11 7.7% (8, 9] 0 100.0% 16 8.3% (9, 10] 0 100.0% 24 9.4% (10, 20] 0 100.0% 274 21.1% (20, 30] 0 100.0% 281 33.1% (30, 40] 0 100.0% 265 44.4% (40, 50] 0 100.0% 229 54.2% (50, 60] 0 100.0% 222 63.7% (60, 70] 0 100.0% 189 71.8% (70, 80] 0 100.0% 165 78.9% (80, 90] 0 100.0% 142 84.9% (90, 100] 0 100.0% 76 88.2% (100, 110] 0 100.0% 58 90.7% (110, 120] 0 100.0% 77 94.0% (120, 130] 0 100.0% 42 95.8% (130, 140] 0 100.0% 26 96.9% (140, 150] 0 100.0% 21 97.8% (150, 160] 0 100.0% 24 98.8% (160, 170] 0 100.0% 20 99.7% (170, 180] 0 100.0% 6 99.9% (180, 190] 0 100.0% 2 100.0% (190, 200] 0 100.0% 0 100.0%

1 Percent coverage was calculated separately for each strand and then summed. Maximum coverage of each stand is 100%, so max total coverage is 200%.  

 166 Table 4-20. Identified TE genes misannotated in TAIR as protein coding

Loci1 AT1G19260, AT1G34070, AT1G36095, AT1G36230, AT1G37020, AT1G40087, AT1G42525, AT1G43260, AT1G47370, AT1G47497, AT1G53705, AT2G04420, AT2G06500, AT2G06541, AT2G07200, AT2G07240, AT2G07738, AT2G10260, AT2G12880, AT2G13770, AT2G14000, AT2G15180, AT2G16040, AT2G19960, AT2G46460, AT3G24255, AT3G25270, AT3G27800, AT3G29750, AT3G30200, AT3G30235, AT3G30370, AT3G30770, AT3G30820, AT3G31402, AT3G42723, AT3G43470, AT3G43490, AT3G44570, AT3G44580, AT3G60040, AT4G00430, AT4G08267, AT4G08430, AT4G09490, AT4G10200, AT4G20220, AT4G20520, AT4G23160, AT4G29090, AT5G28190, AT5G28920, AT5G31412, AT5G33330, AT5G33406, AT5G35490, AT5G35695, AT5G39770, AT5G42905, AT5G45570, AT5G48050

1 List of 61 loci, comprised of 4 known KI-MULE loci and 48 predicted TE loci with domains from Table 4-4, plus 9 additional Peptidase_C48-containing loci, likely in KI-MULEs, that were not documented in Hoen et al (2006). 167

Figure 4-2: Host genes 168

NC_003070:4917851..4937850

4920k 4930k

TE Domains, Gene Search

Genes AT1G01950

AT1G03780

AT1G05610 AT1G14370 AT1G14390 AT1G14410 AT1G14420 AT1G14430

AT1G14380 AT1G14400

Pseudogenes

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata A. lyrata VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008) smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 169 Figure 4-2-1 (preceding page). The graphical display of key evidence, illustrated using the well-known gene UBIQUITIN CONJUGATING ENZYME 1 (UBC1). Conservation of UBC1 (“Genes” track, AT1G14400) is supported in A. lyrata by matches to both a RankVISTA region (“RankVISTA track”) and a VISTA region (“VISTA Synteny” track, top line), as well as by VISTA matches to other species (“VISTA Synteny” track, unlabeled lines). RankVISTA P-values are indicated by colour codes (white, P > 0.05; red, P > 0.01; yellow, P > 0.001; green, P <= 0.001). UBC1 synteny is strongly supported by matches to the A. lyrata VISTA and RankVISTA regions from multiple genes on either side. UBC1 expression is supported by multiple lines of evidence: cDNA (“Arabidopsis cDNAs” track), EST (“Arabidopsis ESTs” track), mRNA-Seq (“mRNA col-0” track), and proteomics (“AtProteome” and “AtPeptide” tracks). There is no evidence that UBC1 is targeted by small RNAs (“smRNA col-0” track) or methylated (“HMBD col BU/UB” track). There are no TE domains (“TE Domains, Gene Search” track) or “transposable element gene” loci (“Pseudogenes” track) found in this region, and there is only a single short annotated TE (“Quesneville et al. Natural Transposons” track). (Note that the three thin lines at the top of the “Gene” track, labelled AT1G01950, AT1G03780, and AT1G05610, are artifacts and should be ignored.)

Figure 4-2-2 (following page). The characteristics of a transposed gene. AT2G16390 is neither a TE nor a DTE, yet it has transposed in A. thaliana since the last common ancestor with A. lyrata (Woodhouse, Pedersen et al. 2010). Conservation of AT2G16390 is supported by VISTA matches but not RankVISTA. Because it has transposed, AT2G16390 is not syntenic in A. lyrata, as reflected by the VISTA track (see below). (Note: Evidence of synteny does exist for AT2G16390 in a different species: M. guttatus, an Asterid). Expression is well supported and AT2G16390 is not targeted by smRNA or methylated. The insertion pattern reflected in the VISTA Synteny track is interpreted as follows: Orthologous A. lyrata regions are labeled “scaffold 3(a)”, “scaffold 3(b) and “scaffold 6”. Scaffold 3a consists of matches to a segment of A. lyrata scaffold 3 (21621431 bp - 18112766 bp), and scaffold 3b is another segment of scaffold 3 (21629923 bp - 21794115 bp), separated from 3a by a gap of 8491 bp (21629923 bp - 21621431 bp - 1 bp). AT2G16390 sits in the gap in scaffold 3, aligning instead to a different A. lyrata scaffold, scaffold 6. This configuration reflects the transposition of AT2G16390 from a region orthologous to A. lyrata scaffold 6 to one orthologous to scaffold 3. 170

NC_003071:7331711..7351710

7340k

TE Domains, Gene Search

Genes AT2G16910 AT2G16930

AT2G16920 AT2G16940

Pseudogenes

Quesneville et al. Natural transposons

RankVISTA

scaffold 6 scaffold 3 (b) VISTA Synteny

scaffold 3 (a)

M. guttatus

Arabidopsis cDNAs

Arabidopsis ESTs mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008) smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 171

Figure 4-3: TEs  172

NC_003071:5875000..5931999

5880k 5890k 5900k 5910k 5920k 5930k

TE Domains, Gene Search rve RVP_2 ATHILA Peptidase_C48 RVT_1 RVT_1 MULE

Retrotrans_gag MuDR

Genes AT2G14000 AT2G14045 AT2G14060 AT2G14070 AT2G14080 AT2G14050

Pseudogenes AT2G13990 AT2G14010 AT2G14030 AT2G14040 AT2G14090 AT2G14020

Quesneville et al. Natural transposons

RankVISTA

VISTA Synteny

A. lyrata

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 173 Figure 4-3-1 (preceding page). A typical TE-dense region adjoining a typical gene- dense region. Loci annotated as “protein coding” are shown in the “Genes” track; those annotated as “transposable element gene” are shown in the “Pseudogenes” track. The A. lyrata VISTA region is labeled. At approximate position 5907k (chr. 2) there is a boundary between a TE-dense region on the left and a gene-dense region on the right. The TE-dense region includes 4 predicted TE domain loci (or 5, if Peptidase_C48 is included), including AT2G14000, one of the examples listed in Tables 4-17 and 4-18. Note that AT2G14000 is (erroneously) annotated “protein coding” gene. None of the predicted TE genes have evidence of synteny or conservation, there is little evidence of expression, and the entire region is heavily targeted by smRNA and methylated. Except for AT2G14000, the loci also all overlap with annotated TEs. Conversely, in the gene-dense region, 5 of 6 loci (including AT2G14090) are supported by both RankVISTA and VISTA conservation and synteny in A. lyrata. The sixth, AT2G14080, appears to have transposed since the divergence of A. thaliana and A. lyrata, yet is expressed and shows no evidence of smRNA targeting or methylation; It may be a non-TE gene that has transposed.

Figure 4-3-2 (following page). An easily identifiable TE. AT5G40315 (Tables 17 and 18) contains Transposase_mut, MULE, and MuDR domains, shows evidence of neither conservation nor expression, is heavily targeted by smRNA, and is methylated, albeit at a low level. Also, in both the RankVISTA and VISTA tracks, AT5G40315 appears to be an insertion relative to A. lyrata. (The A. lyrata VISTA track is labeled; unlabeled tracks are other species.) 174

NC_003076:17223000..17262999

17230k 17240k 17250k 17260k

TE Domains, Gene Search Transposase_mut RVT_2 MULE MuDR

Genes AT5G17990 AT5G23040 AT5G40390 AT5G42950 AT5G42957 AT5G42980 AT5G43000 AT5G43020 AT5G43030

AT5G42955 AT5G42965 AT5G42990 AT5G43010 AT5G42960 AT5G42970

Pseudogenes AT5G43015 AT5G43035

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata Arabidopsis_lyrata Arabidopsis_lyrata

Arabidopsis_lyrata

VISTA Synteny

A. lyrata

Arabidopsis cDNAs

Arabidopsis ESTs mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008) smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 175 NC_003071:1426859..1446858

1430k 1440k

TE Domains, Gene Search RVT_1 Transposase_24

Genes AT2G04190 AT2G04220

Pseudogenes AT2G04200 AT2G04210

Quesneville et al. Natural transposons

RankVISTA

VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB

Figure 4-3-3. An example of an expressed TE. Expression of AT2G04210, which contains Transposase_24, is supported by cDNAs, ESTs, and peptides, yet it and its flanking regions are targeted by smRNA. Interestingly, although the flanking regions are heavily methylated, AT2G04210 itself is not. There is no evidence of conservation or synteny, and AT2G04210 is repetitive and located in the pericentromere shoulder (Tables 17 and 18). (The neighbouring gene, AT2G04200, which contains the unrelated RVT_1 domain, is also an apparent TE that is neither expressed nor epigenetically repressed.)

Figure 4-3-4 (following page). A borderline case, conservation and synteny of AT2G27935 (and its neighbour AT2G27936) is supported by VISTA but not RankVISTA. It is not expressed yet is neither targeted by smRNA nor methylated. It is also not repetitive nor located in the pericentromere (Tables 17 and 18). AT2G27935 may be a DTE, but as there is no strong evidence either way, it was, for the purposes of this study, deemed a TE (confidence level 2). 176

NC_003071:11885400..11905399

11890k 11900k

TE Domains, Gene Search rve

Genes AT2G21660 AT2G27900 AT2G27930 AT2G27940 AT2G27950 AT2G27920

Pseudogenes AT2G27935 AT2G27936

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata Arabidopsis_lyrata

VISTA Synteny

A. lyrata

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 177

Figure 4-4: Known DTEs  

 178 NC_003074:14307511..14337510

14310k 14320k 14330k

TE Domains, Gene Search hATC rve RVT_2

rve

Genes AT3G24170 AT3G25520 AT3G42148 AT3G42155 AT3G42170

AT3G42150 AT3G42160 AT3G42180

Pseudogenes AT3G42174

Quesneville et al. Natural transposons

RankVISTA Sorghum_bicolor

Arabidopsis_lyrata A. lyrata

VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 179 Figure 4-4-1 (preceding page). A well known DTE, DAYSLEEPER (AT3G42170) is unusual in that it is located in a genetically defined pericentromere (Bundock and Hookyas 2005). Conservation of DAYSLEEPER is supported by both RankVISTA and VISTA, but it is syntenic in A. lyrata with only a single gene, AT3G42180. Unlike most of its neighbours, the expression of DAYSLEEPER is strongly supported by all lines of evidence and it does not appear to be targeted by smRNA or methylated. 180 NC_003075:8606106..8626105

8610k 8620k

TE Domains, Gene Search MULE

Genes AT4G15070 AT4G15080 AT4G15090 AT4G15096 AT4G15075 AT4G15093

Pseudogenes

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata

VISTA Synteny A. lyrata

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB

Figure 4-4-2. Another well known DTE, FAR1 (AT4G15090) conservation is supported by RankVISTA and A. lyrata synteny to a single immediate neighbour on either side is supported by VISTA. Expression is supported by all lines of evidence, yet despite little evidence of smRNA targeting, it is methylated. (This may reflect tissue-specific epigenetic regulation.)  181

NC_003074:1228470..1258431

1230k 1240k 1250k

TE Domains, Gene Search MuDR

MULE

Genes AT3G04560 AT3G04570 AT3G04580 AT3G04600 AT3G04610 AT3G04620

AT3G04590

Pseudogenes AT3G04605

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata

Wine_grape

Manihot_esculenta

Medicago_truncatula

Cucumis_sativus

Mimulus_guttatus

Poplar

Ricinus_communis

Glycine_max

Poplar

Glycine_max

Glycine_max

Glycine_max A. lyrata Wine_grape

VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 182 Figure 4-4-3 (preceding page). MUG1 (AT3G04605), another previously identified DTE (Cowan, Hoen et al. 2005), is unusually well conserved, with RankVISTA alignments in multiple species. It is syntenic in A. lyrata with multiple genes on either side. Expression is supported by multiple lines of evidence. Like FAR1 (Fig. 3-2), another MULE-like DTE, MUG1 is methylated despite little evidence of smRNA targeting. 183

Figure 4-5. Predicted DTEs 184 NC_003074:23366545..23386544

23370k 23380k

TE Domains, Gene Search Plant_tran

Genes AT3G63240 AT3G63250 AT3G63270 AT3G63280 AT3G63300

AT3G63260 AT3G63290

Pseudogenes

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata Arabidopsis_lyrata

VISTA Synteny A. lyrata

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB

Figure 4-5-1. A previously unreported DTE with confidence level 5, the highest confidence level for predicted DTEs. AT3G63270 has strong evidence of conservation, synteny, and expression, and no evidence of smRNA targeting or methylation. NCBI HomoloGene (Sayers, Barrett et al. 2011) identifies the rice gene Os07g0175100 as a homolog; A BLASTP search of the NCBI NR database using query AT3G63270.1 identifies unusually high-scoring hits in multiple monocot and dicot species (data not shown). The Plant_tran domain has 48% amino acid identity with AT3G55350, another new DTE candidate (Table 4-13).  

 185 NC_003076:20470334..20490333

20480k

TE Domains, Gene Search MuDR

Genes AT5G17990

AT5G23040

AT5G40390

AT5G46330

AT5G50310 AT5G50320 AT5G50330

Pseudogenes AT5G50315

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata

A. lyrata VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 186 Figure 4-5-2 (preceding page). A previously unreported DTE with confidence level 4. AT5G50315 conservation and synteny in A. lyrata is supported by RankVISTA and VISTA, expression is supported by cDNA, EST, and mRNA-Seq, there is no evidence of smRNA targeting and little methylation. AT5G50315 appears as an insertion in VISTA regions from multiple other species, suggesting it may have resulted from a relatively recent domestication event. In addition, AT5G50315 is neither repetitive nor located near the pericentromere (Table 4-13) and was independently identified in a previous, unpublished study (J. Schwartzentruber, pers. comm.). 187

NC_003075:5177091..5197090

5180k 5190k

TE Domains, Gene Search Transposase_11 MuDR

Genes AT4G08210 AT4G08230 AT4G08240

AT4

Pseudogenes AT4G08200 AT4G08220

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata

A. lyrata VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB

Figure 4-5-3. A previously unreported DTE with confidence level 3, the lowest confidence level for predicted DTEs. AT4G08220 conservation is supported by RankVISTA and VISTA, which also support A. lyrata synteny with four protein coding genes (AT4G08210, AT4G08230, AT4G08240, and AT4G08250). But AT4G08220 is located in a TE-dense region of the pericentromere shoulder, is in close proximity on either side to TE insertions, and itself appears as an insertion in VISTA regions of species other than A. lyrata. Expression of AT4G08220 is supported by ESTs and mRNA-seq, but it does not have a cDNA, and, despite showing no evidence of smRNA targeting, is heavily methylated. 188

Figure 4-6. Predicted DTEs that form complex loci 189 NC_003070:23836100..23856099

23840k 23850k

TE Domains, Gene Search MuDR MuDR MuDR MuDR MuDR

MULE MULE MULE MULE MULE

Genes AT1G01950

AT1G03780

AT1G29690 AT1G64235 AT1G64255 AT1G64260 AT1G64280

Pseudogenes AT1G64240 AT1G64250 AT1G64270

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata A. lyrata VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB

Figure 4-6-1. A complex locus consisting of 5 MULE-like genes in a tandem array. AT1G64240, AT1G64250, AT1G64255, AT1G64260, and AT1G64270 are members of a DTE family (MULE.2) of 7 A. thaliana predicted DTEs, which also includes AT1G49920 (Supplementary Figure 1-4) and AT5G41710 (Supplementary Figure 1-16). All 5 match a VISTA region with matches to multiple genes on either side. AT1G64250 and AT1G64255 also have RankVISTA matches. Multiple lines of evidence support AT1G64270 expression. None show evidence of smRNA targeting or methylation. 190 NC_003074:4157956..4177955

4160k 4170k

TE Domains, Gene Search hATC DUF659 hATC DUF659 hATC DUF659 hATC DUF659

hATC DUF659

Genes AT3G12990 AT3G13010 AT3G13020 AT3G13030 AT3G13040 AT3G13050

AT3G13000

Pseudogenes

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata Arabidopsis_lyrata Arabidopsis_lyrata

Arabidopsis_lyrata A. lyrata

VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008) smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 191 Figure 4-6-2. (preceding page). A complex locus consisting of 3 hAT-like genes in a tandem array. AT3G13010, AT3G13020, and AT3G13030 constitute the DTE family, hATC.1. The conservation of all 3 is supported by an A. lyrata VISTA region with matches to multiple genes on either side. Only AT3G13030 has evidence of expression. 192

NC_003070:7420900..7455099

7430k 7440k 7450k

TE Domains, Gene Search Retrotrans_gag Retrotrans_gag Retrotrans_gag

Retrotrans_gag

Genes AT1G01950 AT1G21190 AT1G21210 AT1G21230 AT1G21240 AT1G21250 AT1G21270 AT1G21281 AT1G21310

AT1G21200 AT1G21245 AT1G21280 AT1G21202

Pseudogenes AT1G21220 AT1G21260 AT1G21286

AT1G21290 AT1G21300

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata Arabidopsis_lyrata Arabidopsis_lyrata A. lyrata A. lyrata VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB

Figure 4-6-3. A complex locus that consists of four short predicted DTEs arranged in tandem, each with a Retrotrans_gag domain, that form the gag.1 family: AT1G21220, AT1G21260, AT1G21280, and AT1G21290. They are interspersed among, and largely syntenic in A. lyrata with, members of another gene family, not derived from TEs, called WAK (cell wall-associated kinase): AT1G21210, AT1G21230, AT1G21240, AT1G21250, and AT1G21270. The WAK loci are also arranged in tandem, in opposite orientation to gag.1 genes. 193

NC_003074:5891700..5911700

5892k 5893k 5894k 5895k 5896k 5897k 5898k 5899k 5900k 5901k 5902k 5903k 5904k 5905k 5906k 5907k 5908k 5909k 5910k 5911k

TE Domains, Gene Search hATC hATC

Genes AT3G17240 AT3G17261 AT3G17265 AT3G17270 AT3G17300 AT3G17280 AT3G17310 AT3G17250

Pseudogenes AT3G17260 AT3G17290

Quesneville et al. Natural transposons

RankVISTA Carica_papaya Arabidopsis_lyrata A. lyrata VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB

Figure 4-6-4. A complex locus of two predicted DTEs. AT3G17260 and AT3G17290 each contain an hATC domain and are located approximately 9 kbp apart on chr. 3 and form the DTE family hATC.3. Although their configuration may suggest they a macrotransposition event, additional evidence suggests this is a complex locus that consists of 4 members at the orthologous location in A. lyrata. 194

Figure 4-7. Neighbouring unrelated DTEs 195

NC_003070:30092186..30112185

30100k 30110k

TE Domains, Gene Search MULE hATC

Genes AT1G79990 AT1G80010 AT1G80030 AT1G80

AT1G80000 AT1G80040

Pseudogenes AT1G80020

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata Arabidopsis_lyrata Arabidopsis_lyrata

Ricinus_communis

Poplar

Poplar

Mimulus_guttatus

VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB

Figure 4-7-1. The predicted DTE AT1G80020, which contains a hATC domain, is adjacent to the known DTE, FRS8 (AT1G80010), which contains a MULE domain.  196

NC_003074:4958717..4978716

4960k 4970k

TE Domains, Gene Search Transposase_24 hATC

Genes AT3G14770 AT3G14780 AT3G14790 AT3G14810

Pseudogenes AT3G14800

Quesneville et al. Natural transposons

RankVISTA Arabidopsis_lyrata Arabidopsis_lyrata Arabidopsis_lyrata

Mimulus_guttatus A. lyrata VISTA Synteny

Arabidopsis cDNAs

Arabidopsis ESTs

mRNA col-0 (Lister et al. 2008)

AtProteome (Baerenfaller et al. 2008)

AtPeptide (Castellana et al. 2008)

smRNA col-0 (Gregory et al. 2008)

HMBD col BU/UB 197 Figure 4-7-2 (preceding page). AT3G14780 and AT3G14800 are two predicted DTEs located close together on the same A. lyrata VISTA synteny region. They are derived from different Class II superfamilies: AT3G14780 from CACTA (Transposase_24 domain), AT3G14800 from hAT (hATC domain). Both are expressed, but only AT3G14800 is supported by RankVISTA conservation. Interestingly, AT3G14800 is also methylated, despite having no evidence of methylation (possibly reflecting tissue-specific epigenetic regulation).  198

Discussion

Transposable element (TE) ubiquity, abundance, and diversity pose biological challenges to genome function and practical challenges to genome annotation, yet these very qualities also provide opportunity for evolutionary innovation. One process by which TEs directly contribute to host adaptation is molecular domestication, whereby TE genes adopt non- replicative functions, lose their mobility, and become domesticated TEs (DTEs). DTEs are conventional host genes derived from mobile genes, thus DTE identification hinges on discriminating host genes from mobile genes, which must be accomplished with high sensitivity and specificity to identify a small number of DTEs from a large number of TEs. Despite its potential impact on evolution, molecular domestication has not yet been thoroughly and systematically investigated. Most known DTEs were discovered fortuitously, for instance in forward genetic screens for genes involved in particular phenotypes or functions. This approach is poorly suited to clarifying the general attributes of the molecular domestication, which requires a large and unbiased sample of DTEs. But fortuitous discovery is inherently haphazard and likely suffers from selection biases. For example, it is plausible that even DTEs that have been uncovered in forward genetic screens have been dismissed by researchers who assume they are mobility-related genes. This situation is aggravated by the fact that even some published DTEs continue to be misannotated as TE genes in supposedly well annotated genomes, e.g., MUSTANG in TAIR10. There have been only a few systematic searches for DTEs, with various scopes and degrees of success. There are two aspects to finding DTEs: First, identifying genes descended from TEs and, second, discriminating those with mobility-related functions from those that have been co-opted to perform other functions. Discriminating TE genes from DTEs is challenging. The few genome-scale searches that have attempted  

 199 it have used various combinations of a few common lines of evidence: repetitiveness, structure (e.g., degeneration or absence of flanking TE termini), expression (e.g., representation in cDNA or EST libraries), local TE v. gene density, phylogenetic analysis (e.g., conservation of orthologs across widely diverged taxa), microsynteny, and synonymous substitution rate (Lander, Linton et al. 2001; Cowan, Hoen et al. 2005; Zdobnov, Campillos et al. 2005). But high TE abundance and diversity mean invariable exceptions to any simple rule, particularly those involving measures of gene function. For instance, while most TEs are silenced, at least in most tissues and developmental stages, some fraction are expressed; furthermore, some host genes are expressed only at low levels or in specific conditions. Thus, using mRNA abundance as a single line of evidence (e.g., Lander, Linton et al. 2001) results in high rates of both false positives and false negatives. Conversely, measurements that rely on patterns of evolution, such as patterns of conservation and synteny, are generally more specific, but less sensitive, again due to numerous exceptions. Therefore, the most powerful approach, which maximizes both sensitivity and specificity, is to combine diverse lines of evidence, including both functional and evolutionary characteristics. For example, Zdobnov, Campillos et al. (2005) used expression and synteny and Cowan, Hoen et al (2005) used phylogeny (i.e., conservation), TE structure, and expression. The first step in systematic DTE detection is to identify genes descended from TEs, which can be convincingly achieved only for DTEs with sufficient similarity to extant TEs. The strongest evidence would be to identify for a DTE a closely related family of extant TEs with sequences significantly more similar to it than its similarity to any non-TE gene, i.e., to identify a close extant TE ortholog. But because mobility-related genes evolve rapidly, and because many domestication events were ancient, this is not generally possible. Instead, common ancestry between dissimilar  

 200 genes can be identified by examining their most highly conserved subsequences, the conserved domains. Conserved domains are contiguous gene segments that encode specific structural or enzymatic functions, which evolve under strong selective constraint. Certain conserved domains are found exclusively or predominantly in TE genes, or in known DTEs. In this search, similarity to TE-specific conserved domains was used to identify TE-derived genes, most of which are TEs, but some of which are DTEs (Table 4-4). Using conserved domains to identify TE-derived genes has some limitations. Many conserved domains are found not only in TEs but also, to varying degrees, in non-TE genes. To minimize the risk of incorrectly concluding a gene is derived from a TE, I have limited my search to highly TE-specific domains. As a result I have identified only 5 well characterized genes not currently thought to be DTEs (Table 4-9). But specificity comes at the cost of sensitivity, and many TE superfamilies are not represented in the signature domain set (Tables 4-9 and 4-14). Furthermore, even DTEs that were derived from superfamilies that are represented would have been missed if no signature domain was sufficiently conserved. Also note that genes in DTE gene families have derived their TE signature domain through duplication of a previously domesticated gene, not through independent domestication events. The second aspect of identifying DTEs, discriminating them from mobility-related genes still associated with TEs, is by far the most problematic. Part of the difficulty is that only a small fraction of TE-like genes are DTEs, the rest being TEs. Thus, the discrimination method must be highly specific to avoid a large number of false positives. A naive approach would be to simply identify all TEs, subtract them from the set of TE-derived genes, and conclude that the remainder are DTEs. But we cannot yet identify TEs to a high level of specificity. The most commonly used TE detection methods simply identify sequences that are sufficiently  

 201 similar to known TEs. Such homology-based methods cannot be used here, because DTEs are, by definition, similar to TEs. Instead, certain differences in the characteristics of TE and non-TE genes can be used to differentiate between them (Table 4-1). The most specific of these characteristics results, perhaps unsurprisingly, directly from the most obvious difference between TEs and ordinary genes: mobility. Non-TE genes maintain colinear genomic locations between divergent genomes, subject only to processes of chromosomal rearrangement resulting from recombination, processes that are infrequent relative to transposition. In contrast, TEs do not generally contribute beneficial traits to the organism so are not subject to purifying selection; thus, only TEs that transpose are conserved. TEs at specific genomic locations, if they do not excise, gradually accumulate mutations to eventually lose their open reading frames and ability to encode functional proteins. Thus, in a comparison of co-linearity between genomes sufficiently divergent for such mutations to have obliterated stationary mobility related genes (or at least sufficiently divergent enough for mutations to TE fossils to have become apparent), only ordinary genes (including DTEs), can maintain colinear positions, not TE genes. Co-linearity can be evaluated by identifying orthologous blocks of genes in two or more genomes. If a gene was present in an orthologous block in the last common ancestor of two genomes being compared, then it should continue to have the same local gene neighbours, subject to chromosomal rearrangements. This property of shared local gene order between genomes is known as conserved microsynteny. Depending on the frequency and type of rearrangements, non-TE genes often maintain multiple conserved orthologous neighbours between species that have diverged for hundreds of millions of years. Conversely, the rate at which TEs at specific genomic locations lose detectable open reading frames is faster by orders of magnitude. This difference permits conserved  

 202 microsynteny between even closely related species, such as A. lyrata and A. thaliana, to be used in DTE detection. Close comparisons such as this are ideal because the more closely related the genomes being compared, the less time for chromosomal rearrangements, and thus the fewer ordinary genes (including DTEs) that will have lost co-linearity. That is, close comparisons reduce false negatives. Because of it’s inherent specificity, conserved co-linearity or microsynteny has been used as a key feature in previous large-scale DTE searches (Cowan, Hoen et al. 2005; Zdobnov, Campillos et al. 2005). However, as genomes diverge over evolutionary time, chromosomal rearrangements cause non-mobile genes to also lose conserved microsynteny, and ‘transposition’ of non-TE genes has recently been shown to be more common than previously understood (Woodhouse, Pedersen et al. 2010), reducing the sensitivity of relying on this characteristic alone, a problem that is exacerbated as the evolutionary distance increases between the genomes being compared. Another problem with conserved microsynteny is technical. Many TEs contain multiple genes, so care must be taken to not mistake mobile clusters of genes for conserved co-linearity, for instance by requiring a sufficiently large number of conserved neighbours, certainly more than one, but also by not relying solely on this single line of evidence. Another consequence of transposition that can be used is repetitiveness. To persist in a population of genomes, members of each TE family must transpose often enough to maintain at least one autonomous copy, i.e., a copy with functional TE genes. As a result, many TE genes have more than one highly similar copy. Furthermore, nonautonomous elements are often present in high copy number, resulting in highly repetitive terminal regions flanking TE genes but not DTEs. However, because non-TE genes also undergo duplication by various mechanisms, such as segmental chromosomal duplications, the existence  

 203 of multiple high similarity copies is not conclusive evidence that a gene is mobility-related. Nevertheless, segmental duplication is rare compared to transposition, so most genes with multiple high-similarity copies are TE- derived genes; thus, a reasonable conservative assumption is that high copy-number gene families with very similar members that contain TE conserved domains are indeed TEs. If we rely on only a single line of evidence to discriminate between TEs and DTEs, the predictions are obviously limited to the sensitivity and specificity of that one line of evidence. For instance, if conserved microsynteny is used, as in Zdobnov, Campillos et al. (2005), maximum sensitivity will be limited to DTEs that were domesticated prior to the divergence of the species used in to calculate synteny, as well as to DTEs that have maintained a detectable level of microsynteny in the species. If distant species are used, recent domestication events will be excluded, and so if domestication events undergo any short term evolutionary patterns, such as gene birth and death -- an intriguing possibility (see Chapter One) -- it will go undetected. Conversely, if very closely related species are used (if they are available), false positives may occur if some TE-related genes present in the last common ancestor are still detectable. By incorporating diverse lines of evidence, both the overall sensitivity and specificity of the search can be increased. For instance, expression is another line of evidence that has been used in several studies (e.g., Lander, Linton et al. 2001; Cowan, Hoen et al. 2005). But while it is true that most TE-related genes are silenced in most tissues, many TE-related genes are nevertheless expressed, especially in certain tissues, particularly reproductive tissues. Conversely, many host genes, possibly including DTEs, are only expressed in specific tissues, developmental times, or conditions, so may not be detected in expression studies. But whereas the mechanism by which TEs are silenced typically involves small RNA-directed epigenetic modification, host gene expression  

 204 is typically regulated by tissue-specific enhancers (Levin and Moran 2011). Thus, the combination of measurements of small RNA targeting, methylation, and mRNA production, using different methods and in different tissues, may better discriminate between DTEs and TEs than simple binary observations of expression (e.g., does or does not have cDNAs or ESTs). Combining diverse lines of evidence designed to detect differences in diverse aspects of TE and DTE evolution, mobility, and function may therefore provide the best hope of maximizing both sensitivity and specificity of DTEs derived from any TE superfamily and over any timescale. Yet this attractive approach raises the problem of combining the multiple lines of evidence. Previous studies have focussed on one or a few lines of mandatory evidence, such as conserved microsynteny or expression, and then have, to varying degrees, eliminated or supported the small subset of candidates using small-scale analyses of additional evidence, such as synonymous substitution rate (dN/dS). While this approach can improve specificity, it still limits sensitivity to that of the mandatory characteristic. Another drawback of this approach is that it cannot fully utilize the large and rapidly increasing volume of genomic data. Because it is based on expert human knowledge and manual curation, evidence needs to be simplified and only limited controls can be employed. Another major limitation of this approach is that, while repeatable, it is not generalizable: the formula used to make predictions would need to be adjusted for each different combination of available evidence, set of genomes, etc. To combine multiple lines of evidence, I have used something akin to a scoring matrix. Each line of evidence is given a weight, which is summed to a score for each TE-like gene (Table 4-6), and given a confidence weighting consistent with positive controls or training sets (known DTEs and other ordinary genes) and negative controls or training sets (known  

 205 TEs). This approach is an improvement over mandatory requirements, especially studies focussed on only one or a few lines of evidence. For instance, an earlier search for MULE-like DTEs based primarily on expression and conservation found only a single novel DTE family (Cowan, Hoen et al. 2005), whereas this study has identified 3 additional novel MULE-like DTE candidate families (9 genes total; Table 3-13). Also, this study succeeded in finding not only all 6 novel DTE candidates identified an earlier unpublished study conducted in our lab that focussed on A. thaliana expression data (J. Schwartzentruber, pers. comm.), but an additional 32 high-confidence candidates (38 total). Yet, the scoring matrix approach also has limitations. One goal of this work was to limit the use of mandatory characteristics, such as conserved microsynteny, in the search. While not strictly requiring any particular characteristic, the weighting coefficients used had the effect that some characteristics, such as synteny, were practically required to be present, whereas others, such as strong sRNA targeting, were practically required to be absent (Tables 4-13 and 4-14). However, this is perhaps unavoidable if these characteristics are intrinsic differences between TEs and DTEs. Also, to fit into a simple scoring scheme, the evidence had to be simplified by dividing continuous values, such as the number of sRNAs targeting a gene, into a small number of intervals. The use of extensive positive and negative controls to guide this scheme appears to have been successful, given that the search identified 32 high-confidence candidates, 16 of which show evidence of phenotypic effects in preliminary assays (Table 4-21). Nevertheless, one goal of this work was to create a search methodology that can be applied to additional systems as easily as possible. While the scoring approach described is probably generalizable, the specific scheme and weightings is unlikely to be, since different genomes will have different available lines of evidence, each with different characteristics.  

 206 Salt Freezing Nitrogen Tolerance Tolerance Use Survivorship Survivorship Efficiency Locus Domain Family T-DNA(s)2 (%) (%) Phenotype3

KNOWN DTES FHY3 MULE FAR1 059264C 18*** FHY3 002711C 18*** FRS6 MULE FAR1 114017C 17** FRS7 MULE FAR1 039727 L: rm** FRS8 MULE FAR1 122261 68* FRS9 MULE FAR1 CS853829 43* FRS11 MULE FAR1 146044 38** MUG1 MULE MUG GK514B01 32*** MUG2 MULE MUG 055071 43* 22* MUG2 090878 41* MUG4 MULE MUG 036244C 8*** MUG4 036408C 16*** MUG7 MULE MUG 063737 21*** 22* L: rm** MUG7 GK378C04 41* MUG8 MULE MUG GK155E09 22* MUG8 GK244B09 36*** 19* NOVEL DTE CANDIDATES AT5G50315 MuDR CS839585 31*** 19* AT3G13020 hATC hATC.1 025166 L: rm* AT3G13020 034362C 41* L: rm* AT4G15020 hATC hATC.2 100861C L: rm* AT3G22220 hATC hATC.2 141788C 40** L: rm* AT3G17450 hATC 023396 41* AT1G80020 hATC hATC.4 CS807311 L: rm* AT1G80020 049737 40** AT1G15300 hATC hATC.4 020839C 44*

4 5 AT1G18560 hATC (DAYSLEEPER) 139610C 44* L: rm** AT1G69950 hATC 069419 43* H: rm* AT1G69950 CS860391 37** 14*** L: rm* AT3G55350 Plant_tran 122829C L: rm* AT3G55350 020538 20* AT5G41980 Plant_tran 146017C 37** AT1G21260 gag gag.1 121997C L: rm* AT5G51080 RnaseH RnaseH.1 CS841723 26*** L: rm* AT2G112406 RVT_1 118772 L: rm* AT2G112406 070393 35*** AT4G112006 MuDR 048168C 38** AT5G430156 MULE 105034C 40**  

 207 Table 4-21 (preceding page). DTEs with significant phenotypes1

* p<0.05; ** p<0.01; *** p<0.001. 1 Homozygous T-DNA knockout lines were isolated, genotyped, and assayed as part of the VEGI project (http:// biology.mcgill.ca/vegi), with only modest contribution by the author (see Contributions statement). Statistically significant preliminary results shown as a validation of the efficacy of the DTE search. 2 T-DNA stock ID if [ID] starts with “CS” or “GK”, or otherwise it is “SALK_[line ID]”. 3 L, low nitrogen; H, high nitrogen; rm, rosette mass. 4 A third T-DNA line (086452C) showed a severe leaf phenotype under normal conditions; however, genotyping results were ambiguous. 5 BLASTP best-reciprocal-hit with DAYSLEEPER; potential paralogs. 6 Confidence level 2, see Table 4-17.

An improvement to this approach might be to employ machine learning to determine the optimal combinations of evidence and weightings. For example, training could be conducted on a large set (thousands) of known TE and non-TE genes (not necessarily DTEs), and could use each line of evidence independently, without simplification. Such an approach may be able to detect discriminative characteristics not evident to an expert and has the potential to improve sensitivity and specificity. Perhaps as important as the real accuracy of the search is its perceived accuracy. It is almost pointless to identify DTEs if they continue to be disregarded and misannotated. Hopefully, the inclusion of multiple lines of corroborating evidence will increase not only the calculated confidence in the veracity of the protein coding functions of these genes, but in their acceptance by the research community. Using diverse lines of evidence, I have identified, in addition to all 23 known Arabidopsis DTEs, 38 new DTEs with strong evidentiary support. Because this study was designed to be highly specific, and despite attempting to mitigate the loss of sensitivity using multiple lines of evidence, this number is likely an underestimate: There are approximately one hundred additional genes, at confidence level 2 for which I found no strong evidence to support that they are DTEs, but neither is there conclusive evidence that they are TEs (Tables 4-17 and 4-18). Indeed, preliminary results of phenotypic assays show that three of eight confidence level 2 candidates that were tested show evidence of phenotypic effect and may therefore be DTEs. Furthermore, the search  

 208 was restricted to genes that contain TE-specific conserved domains (Table 4-4); however, this identified only about half of annotated TE genes (data not shown). While this approach has the benefit of finding only genes with strong evidence of TE origin, additional homology-based searches could be employed to improve search sensitivity. With additional lines of evidence and improved methods, such as machine learning, I should eventually be able to resolve these ambiguous genes. Thus, the number of DTEs in Arabidopsis is at least 62, and may be as many as around 150, 2.5-times to 6-times the number identified prior to this research. Furthermore, since 8 of the previously known DTEs were discovered using a previous systematic search, limited to genes derived from the MULE transposase mudrA (Cowan, Hoen et al. 2005), the number of high confidence DTEs uncovered through systematic searches is now 46, three times as many as the 15 DTEs uncovered using forward genetic approaches. The results of this search are roughly consistent with the findings of the most general previous study, conducted in mammals (Zdobnov, Campillos et al. 2005), which is remarkable considering the vast evolutionary distance between plants and animals. Strangely, the number of known human DTEs at the time of that study was 21, practically equal to the number known in Arabidopsis at the time of this study, 23. A rough comparison of the two studies suggests that the incorporation of multiple lines of evidence resulted in improved sensitivity. Using synteny, Zdobnov et al. were able to identify 65% (15 of 21) of the human genes then known to be DTEs. They also found three additional cases with synteny, genes that were false negatives in their discovery method and identified ex post facto, with the knowledge they were DTEs. As measured by the fraction of known genes that were found, the current study found 100%, perhaps suggesting increased sensitivity from the use of multiple lines of evidence.  

 209 Even assuming that many of the ambiguous cases are DTEs, there are at most a few hundred DTEs in Arabidopsis, a small fraction of the approximately 30,000 protein coding genes. From an evolutionary standpoint, is this significant? Leaving aside the as-yet unsupported possibility that domestication plays a large, difficult to detect, role in exon shuffling (Chapter One), the process may play a larger role than is suggested by the absolute number of DTEs. Evolutionary impact cannot be measured simply by counting genes, as illustrated by the observation that H. sapiens and C. elegans each have roughly 20,000 genes, yet humans are more complex than nematodes. This is partly because many genes perform seemingly basic structural, metabolic, or structural functions conserved between widely diverged organisms. In more complex organisms, a single gene may perform multiple functions, depending for example on post-transcriptional modification such as mRNA splicing, or on the regulation of mRNA production. But most complexity appears to arise not by the evolution of structural genes, but by changes in gene expression. Gene expression is largely regulated by transcription factors, proteins that bind to specific DNA regulatory motifs and interact directly or indirectly with other proteins, such as RNA polymerase, to enhance or repress transcription. Many characterized DTEs, especially those derived from Class II (DNA) TEs, are transcription factors. DTEs can also innovate highly adaptive evolutionary novelties, such as the mammalian placenta and adaptive immune system (Feschotte and Pritham 2007; Sinzelle, Izsvák et al. 2009). Another consideration is that domestication may operate at multiple evolutionary timescales. Whereas some adaptations provided by DTEs may have longterm benefit and be conserved between widely diverged species, such as is the case for most known DTEs, others may be needed only for short-lived adaptations, for instance during periods of rapid environmental change. It is difficult to know how many short-lived DTEs  

 210 may have existed, particularly if the frequency of short-term domestication may increase with increased selective pressure, but any insight would need to come from searches capable of detecting recently domesticated genes, not, for instance, by relying on synteny between widely diverged organisms. Taken together, these considerations suggest that molecular domestication may, from an evolutionary standpoint, be an extremely fruitful process. Domestication generates transcription factors that may affect the expression of many genes to facilitate rapid or sustained adaptation to environmental challenges and play a major role in the evolution of complexity. It innovates entirely new systems like adaptive immunity and placental isolation. Thus, the novel DTEs identified in this study may play an important role in themselves, and also to advance our understanding of the overall significance of molecular domestication. At the very least, it seems clear that future systematic DTE searches in additional species would be useful, both to identify new genes and to permit a more general comparison of the process of molecular domestication in different lineages.

References

AGI (2000). "Analysis of the genome sequence of the flowering plant Arabidopsis thaliana." Nature 408(6814): 796-815. Backman, T. W., C. M. Sullivan, et al. (2008). "Update of ASRP: the Arabidopsis Small RNA Project database." Nucleic acids research 36(Database issue): D982-985. Baerenfaller, K., J. Grossmann, et al. (2008). "Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics." Science 320(5878): 938-941. Brudno, M., A. Poliakov, et al. (2007). "Multiple whole genome alignments and novel biomedical applications at the VISTA portal." Nucleic acids research 35(Web Server issue): W669-674. Bundock, P. and P. Hooykaas (2005). "An Arabidopsis hAT-like transposase is essential for plant development." Nature 436(7048): 282-284.  

 211 Castellana, N. E., S. H. Payne, et al. (2008). "Discovery and revision of Arabidopsis genes by proteogenomics." Proc Natl Acad Sci U S A 105(52): 21034-21038. Cowan, R. K., D. R. Hoen, et al. (2005). "MUSTANG is a novel family of domesticated transposase genes found in diverse angiosperms." Mol Biol Evol 22(10): 2084-2089. Cowan, R. K., D. R. Hoen, et al. (2005). "MUSTANG is a novel family of domesticated transposase genes found in diverse angiosperms." Molecular biology and evolution 22(10): 2084-2089. Dereeper, A., V. Guignon, et al. (2008). "Phylogeny.fr: robust phylogenetic analysis for the non-specialist." Nucleic acids research 36(Web Server issue): W465-469. Eddy, S. R. (2009). "A new generation of homology search tools based on probabilistic inference." Genome informatics. International Conference on Genome Informatics 23(1): 205-211. Feschotte, C. and Pritham (2007). "DNA transposons and the evolution of eukaryotic genomes." Annual review of genetics 41: 331-368. Finn, R. D., J. Mistry, et al. (2010). "The Pfam protein families database." Nucleic acids research 38(Database issue): D211-222. Flajnik, M. F. and M. Kasahara (2010). "Origin and evolution of the adaptive immune system: genetic events and selective pressures." Nature reviews. Genetics 11(1): 47-59. Frazer, K. A., L. Pachter, et al. (2004). "VISTA: computational tools for comparative genomics." Nucleic acids research 32(Web Server issue): W273-279. Gagnot, S., J. P. Tamby, et al. (2008). "CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform." Nucleic acids research 36(Database issue): D986-990. Gregory, B. D., R. C. O'Malley, et al. (2008). "A link between RNA metabolism and silencing affecting Arabidopsis development." Dev Cell 14(6): 854-866. Heazlewood, J. L., P. Durek, et al. (2008). "PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor." Nucleic acids research 36(Database issue): D1015-1021. Hoen, D. R., K. C. Park, et al. (2006). "Transposon-mediated expansion and diversification of a family of ULP-like genes." Mol Biol Evol 23(6): 1254-1268. Lander, E. S., L. M. Linton, et al. (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921. Levin, H. L. and J. V. Moran (2011). "Dynamic interactions between transposable elements and their hosts." Nature reviews. Genetics 12(9): 615-627. Lister, R., R. C. O'Malley, et al. (2008). "Highly integrated single-base resolution maps of the epigenome in Arabidopsis." Cell 133(3): 523-536. Lu, C., S. S. Tej, et al. (2005). "Elucidation of the small RNA component of the transcriptome." Science 309(5740): 1567-1569.  

 212 McKay, S. J., I. A. Vergara, et al. (2010). "Using the Generic Synteny Browser (GBrowse_syn)." Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] Chapter 9: Unit 9 12. Nakano, M., K. Nobuta, et al. (2006). "Plant MPSS databases: signature-based transcriptional resources for analyses of mRNA and small RNA." Nucleic acids research 34(Database issue): D731-735. Nozaki, H., M. Iseki, et al. (2007). "Phylogeny of primary photosynthetic eukaryotes as deduced from slowly evolving nuclear genes." Mol Biol Evol 24(8): 1592-1595. Prabhakar, S., F. Poulin, et al. (2006). "Close sequence comparisons are sufficient to identify human cis-regulatory elements." Genome research 16(7): 855-863. Rice, P., I. Longden, et al. (2000). "EMBOSS: the European Molecular Biology Open Software Suite." Trends in genetics : TIG 16(6): 276-277. Sinzelle, L., Z. Izsvák, et al. (2009). "Molecular domestication of transposable elements: from detrimental parasites to useful host genes." Cellular and molecular life sciences : CMLS 66(6): 1073-1093. Stein, L. D., C. Mungall, et al. (2002). "The generic genome browser: a building block for a model organism system database." Genome research 12(10): 1599-1610. Swarbreck, D., C. Wilks, et al. (2008). "The Arabidopsis Information Resource (TAIR): gene structure and function annotation." Nucleic acids research 36(Database issue): D1009-1014. Wicker, T., F. Sabot, et al. (2007). "A unified classification system for eukaryotic transposable elements." Nat Rev Genet 8(12): 973-982. Winter, D., B. Vinegar, et al. (2007). "An "Electronic Fluorescent Pictograph" browser for exploring and analyzing large-scale biological data sets." PloS one 2(1): e718. Woodhouse, M. R., B. Pedersen, et al. (2010). "Transposed genes in Arabidopsis are often associated with flanking repeats." PLoS genetics 6(5): e1000949. Zdobnov, E. M., M. Campillos, et al. (2005). "Protein coding potential of retroviruses and other transposable elements in vertebrate genomes." Nucleic acids research 33(3): 946-954. Zeller, G., R. M. Clark, et al. (2008). "Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays." Genome Res 18(6): 918-929. Zhang, X., J. Yazaki, et al. (2006). "Genome-wide high-resolution mapping and functional analysis of DNA methylation in arabidopsis." Cell 126(6): 1189-1201.  213

Conclusion

In this thesis, I have described genome-scale bioinformatic analyses that I used to identify direct sequence exchanges from plant genomes to TEs, and vice versa, and their functional and evolutionary consequences. Transduplication is a process whereby ordinary gene fragments are duplicated into, and mobilized by, TEs. In rice, thousands of ordinary gene fragments have been transduplicated by MULEs, but they do not appear to produce functional proteins, although they may have regulatory functions. In A. thaliana, transduplication appears to have given rise to a large gene family that produces functional proteins; however, at least in this case, they are not new ordinary genes, but new TE genes. A converse process to transduplication is molecular domestication, the co-option of TE sequences by genomes. Using an innovative systematic search, I have identified in A. thaliana, in addition to three previously known families, twenty-three candidate novel families of domesticated transposable elements.  Future work in the areas of transduplication and molecular domestication will likely shed much additional light on their mechanisms and effects. The rate at which genomes are being sequenced, along with supporting genomic evidence, such as transcriptome and methylome data, continues to dramatically increase, as does the power and sophistication of computational analysis tools. Our ability to identify and understand events in which TEs generate beneficial variation will increase correspondingly. For instance, we may hope to test the hypotheses: (1) that transduplicated gene fragments may produce small RNAs that regulate the parent gene, and (2) that molecular domestication occurs much more frequently in the short than long-term.  In general, the characteristics and scale of TE-genome co- evolutionary processes make them difficult to study using classical genetic methods, but, as this thesis demonstrates, the recent explosion of  

 214 genomic data contains buried within it abundant evidence of their effects. Even though they perpetuate by self-replication, the results I have presented support the view that TEs are not molecular parasites, but an integral part of eukaryotic genomes.  215

Appendix One: The map-based sequence of the rice genome

Available at:

http://www.nature.com/nature/journal/v436/n7052/full/nature03895.html  216

Appendix Two: MUSTANG is a novel family of domesticated transposase genes found in diverse angiosperms

Available at:

http://mbe.oxfordjournals.org/content/22/10/2084.full