FINDING SMALL RNAS IN TRANSPOSABLE ELEMENTS FROM

USING NEXT GENERATION SEQUENCING ON THE ION TORRENT PGM

A University Thesis Presented to the Faculty

of

California State University, East Bay

In Partial Fulfillment

of the Requirements for the Degree

Master of Science in Biological Science

By

Hao Ngan Trinh

August 2016

Copyright © 2016 by Hao Trinh

ii ABSTRACT

Small RNAs are short non-coding RNAs that regulate gene expression and Transposable

Elements (TEs) through complementary binding. They act as an epigenetic defense mechanism used to silence TE proliferation in eukaryotic organisms. TEs are repetitive

DNA sequences that make up the majority of eukaryotic genomes; in higher families, such as Liliaceae, the proliferation of these TEs contribute to their enormous genome sizes. TEs can also introduce genetic variations into the genome of an organism, particularly within protein coding regions. How do regulate the abundance of TEs?

That question leads into the goal and objective of this graduate thesis project.

This project aimed to do Next-Generation Sequencing on the Ion Torrent PGM sequencer to sequence and find small RNAs from Liliaceae and to shed light on how these small RNAs control TEs in this plant family. Sequence results found hundreds of small RNAs (mainly siRNAs and miRNAs) in Liliaceae. The small RNAs bind to TE conserved domains, which include GAG and Pol genes that enable the TEs to transpose.

Small RNAs binding to these domains theoretically can inhibit transposition from occurring by inhibiting translation or cleaving mRNA. The data obtained from this project has revealed the presence of small RNAs; the next step would be to perform

Bisulphite sequencing of the TEs to look for methylated DNA within the TEs, which is a result of RNA-directed DNA Methylation (RdDM). This will further the study of RNAi in TEs and genome size expansion of Liliaceae and similar plants and organisms.

iii FINDING SMALL RNAS IN TRANSPOSABLE ELEMENTS FROM LILIACEAE

USING NEXT GENERATION SEQUENCING ON THE ION TORRENT PGM

By

Hao Ngan Trinh

Appro~ Date: .·. t/f~ ¥-- ~/t{ / 6 Pr~ysdorfer, Ph.D.

Professor Carol Lauzon, Ph.D. t/)aa.I _Ja L ulJ« (._ 8- I- Jb

~.0: ~fka@: PhD f - f-(C,

Professor Ken Curr, Ph.D.

iv

ACKNOWLEDGEMENTS

I am incredibly relieved that this Master’s graduate program has come to an end.

I worked very hard on this thesis project and I am really happy it is completed. I could not have finished my thesis without valuable help. I want to thank my graduate thesis adviser

Professor Chris Baysdorfer (Dr. B) for the lab space and for all the guidance and help throughout this thesis project. I want to thank my thesis committee members Professor

Carol Lauzon, Professor Danika LeDuc, and Professor Ken Curr for their readiness to serve on the committee. I want to thank Professor William “Bill” Pezzaglia from the Physics department for his friendship and encouragement with my thesis. I want to thank a former graduate student Caroline Roth for her willingness to help and good advice.

I want to give a shout out to the Starbucks in the San Francisco Premium Outlets in Livermore for the last few months that I spent there working on the remaining parts of my thesis and hiding away from the heat. I want to thank my parents Loc To and Quan

Trinh for being kind and good people and for loving and caring about me. And last but not least, I want to thank my loving boyfriend Robert. I want to thank him for his never-ending support, concern, encouragement, and love for me. You are a wonderful person. You make me very happy and I want to spend the rest of my life with you. Regardless of what the world is like, just live life to the fullest and be true to yourself.

v

TABLE OF CONTENTS

ABSTRACT……………………….………………………………………………….....iii

ACKNOWLEDGEMENTS……………………………………………………………..v

LIST OF FIGURES……………………………………………………………...…….viii

LIST OF TABLES…………………………………………………………………….....x

INTRODUCTION…………………………………………………………...... 1 Plant Genome Size and Liliaceae…………………..…………………...... 1 Transposable Elements……...………………………………………….…...... 2 TE Classification and Mechanism……………………………………………………....2 Life Cycle of LTR Retrotransposons.…………………………………………………..7 TE Significance…………………………………………………………………………..7 RNA Interference (RNAi)…………………………………………………………….…9 siRNAs…………………………………………………………………………………...10 miRNAs………………………………………………………………………………….11 RdDM……………………………………………………………………………………14 Riboswitches.…………….....…………………………………………………………...16 Recent RNAi Research…………...... 17

OBJECTIVE AND SIGNIFICANCE…...... 18

MATERIALS AND METHODS………………………………………………...... 19 Small RNA Isolation………………...... 19 Small RNA Analysis Using the Agilent Bioanalyzer…………...……………………..20 Constructing cDNA from the Isolated Small RNAs………………...………...... 20 cDNA Analysis Using the Agilent Bioanalyzer………………..……...... 21 Preparing Template-Positive Ion Template OT2 Ion Sphere Particles (ISPs) from the Amplified cDNA………………………...... 21 Enrich the Template-Positive Ion PGM Template OT2 200 ISPs………………...………………………………………………...... 22 Quality Control of the Unenriched ISPs Using the Qubit 2.0 Fluorometer…..……………..…………………………………..……….22 Performing the Sequencing Protocol for the Ion 314 Chip………...……………..….23 Assembling the miRNA Sequences Using SeqMan NGen Software..…...... 24 Mapping the Small RNA Sequences to TE Sequences…...……...... 24

vi RESULTS……………………………………………………………………………….26 Small RNA and cDNA Bioanalyzer Results…………………………………………..26 Small RNA Results from ………………………………...……………………..27 cDNA Results from Lilium……………………………………...……………………...28 Data Analysis of the Sequences………………………………………………………...29 BlASTN Results of Assembled miRNA Sequences………...……...….……………....30 Mapping of Small RNA Sequences to TE Sequences Results…………...... 34 Screenshots of the TE and Small RNA Assembly from Lilium...…...….……………39

DISCUSSION…………………………………………………………………………...42

REFERENCES……………………………………………………………...... 46

APPENDICES…………………………………………………………………………..52 Appendix A - Small RNA Bioanalyzer results…………………………….………….52 Appendix B - cDNA Bioanalyzer Results………………………………….…………..54 Appendix C - SeqMan NGen Files…………………………………………………….56 Appendix D - Sequencing run reports…………………………………………...... ….57

vii LIST OF FIGURES

Fig. 1 Classification of main transposable elements and their subclasses…………………………………………………………...... 4

Fig. 2 Structure of the main types of transposable elements………………………………………………….………………..4

Fig. 3 A positive correlation of genome size and increases in TE proportion………………………………..…………………………….5

Fig. 4 The contribution of different TE groups in ascending genome size………………………………………………………...... 6

Fig. 5 The process of siRNA production and mRNA targeting………………………………………………………………………………....11

Fig. 6 The process of miRNA production and mRNA targeting…………………………………………………………………..………...…...12

Fig. 7 The simplified process of RNA-directed DNA Methylation……………………………………………………………………………...15

Fig. 8 Illustration of an ISP with the fluorescence tags and DNA insert…………………………………………………………………….23

Fig. 9 Experimental Small RNA ladder – Ladder obtained from the experiment.……………………………………………………...….27

Fig. 10 Small RNA sample result – Sample result from the manual……………………………………………………………...…………27

Fig. 11 Lilium pardalinum – Small RNAs primarily obtained from the 20-100 nts region ……………...………………………...…………28

Fig. 12 Experimental High Sensitivity DNA ladder – Ladder obtained from the experiment………………………………..……...... 28

Fig. 13 Lilium pardalinum – Concentration of paired end adapter peaks 1485 pmol/1560 pmol…………………………………..………….29

viii Fig. 14 Various small RNAs and their amounts obtained from the miRNA analysis in the five different Liliaceae ………………………...... 33

Fig. 15 Total small RNAs found from the total number of contigs that were analyzed from the miRNA analysis…………………………………….....…………...…………33

Fig. 16 A characterization of the various TE domains analyzed and the amount of small RNAs that bind to each domain………………………….…….…………….....37

Fig. 17 Percentage of small RNAs that bind to each of the 16 TE domains……………………….…………………..……...... 38

Fig. 18 Total amounts of small RNAs that bind to TE domains, other domains, and total number of small RNAs found……………….………………………………38

Fig. 19 Percentage of small RNAs that bind to TE domains vs. other regions……………….……….....………...…………...39

Fig. 20 Screenshots of conserved domain results, small RNAs and TE alignments, and small RNAs and TE illustration……………………………………………………………………...41

ix

LIST OF TABLES

Table 1. A comparison of the properties of siRNAs and miRNAs…..…...... …………13

Table 2. Plant samples used for small RNA isolation……………………..………….19

Table 3. Simplified protocol for constructing cDNA from small RNAs……………………………………………………….……...... 20

Table 4. cDNA dilution instructions for use with the Bioanalyzer…………………………………………………………….……………21

Table 5. miRNAs and other small RNAs found from doing the miRNA assembly………………………………………………………30

Table 6. Small RNAs found that bound to TE domains.…………...... 34

x 1

INTRODUCTION

Plant Genome Size and Liliaceae

Most plants have small genome sizes, and plants that have very large genomes are rare. The smallest plant genome size recorded is that of Genlisea margaretae, with 1C =

0.06 pg, and the largest plant genome size recorded is Paris japonica, with 1C = 152.23 pg, where 1C = the amount of DNA in an unreplicated gametic nucleus (Kelly and Leitch

2011). Why study plants with large genomes?

Because the genome size differences among plants have potential evolutionary significance. The correlation between genome size and morphological and ecological traits suggests that genome size increases may have significant impact on evolution and the plant lineages; plants with larger genomes are constrained to a narrower range of lifestyle and life strategy options (Kelly and Leitch 2011). Phenotypic variation decreases with increasing genome sizes. This pattern has been documented in traits such as climate tolerance, leaf mass per unit area, and seed mass (Beaulieu et al. 2010).

Species with both small and large genomes have large seed masses; however, small seed sizes are rarely seen in species with large genomes (Beaulieu et al. 2007).

What is also significant about the large plant genome sizes are the repetitive elements that make up the large genome. These elements are called transposable elements

(TEs). Other large repetitive sequences in the plant genome can be found in centromeres and telomeres (Mehrotra and Goyal 2014). These TEs can make up half of the genome of large genome plants (Grover and Wendel 2010). TEs are DNA segments capable of transposing throughout the chromosome either by replicative or conservative (cut and

2

paste) mechanisms (Kejnovsky et al. 2012). Replicative and cut and paste mechanisms are described later on in the paper.

Most angiosperms have small genome sizes (mode 1C = 0.6 pg, median 1C =

2.9 pg); those with very large genome sizes (i.e. ≥ 35 pg) are few and are restricted to plant families including Liliaceae. Liliaceae is a family of angiosperms that consists of some genera with very large genomes. assyriaca has a genome size of 1C =

124.7 pg (Leitch et al. 2007). These large genome sizes are due to transposable elements, a wide array of which is found in genera of Liliaceae such as Lilium and Fritillaria

(Ambrožová et al. 2011).

Transposable Elements

Transposable elements (TEs) or transposons are DNA segments capable of transposing throughout the chromosome either by replicative or conservative (cut and paste) mechanisms (Kejnovsky et al. 2012). TEs were first discovered by Barbara

McClintock in her experiments with maize in the 1940s (Ravindran 2012). The nuclear genomes of plants consist mainly of repetitive DNA, as a result of the proliferation of

TEs, mainly Class I RNA elements (mainly LTR retrotransposons) (Kejnovsky et al.

2012).

TE Classification and Mechanism

Eukaryotic TEs are divided into two major classes. Class I TEs (replicative mode of transposition) consist of retrotransposons that transpose via an RNA intermediate and

3

must be reverse transcribed prior to being integrated into the genome, while Class II TEs, transpose via a DNA intermediate (Finnegan 1989). Class I TEs are divided into Long

Terminal Repeat (LTR) retrotransposons which integrate into the genome as double stranded DNA via reverse transcriptase and integrase, and Non-LTR retrotransposons, which include Long and Short Interspersed Elements (LINEs and SINEs). Non-LTRs use target-primed reverse transcription, which is a mechanism that couples reverse transcription and integration. Class II TEs are divided into two subclasses. “Cut-and- paste” DNA transposons are characterized by Terminal Inverted Repeats (TIRs). The TEs are excised and then reintegrated as double stranded DNA by an element-encoded transposase. The other subclass is Helitrons, also known as rolling-circle transposons, which likely transpose via a replicative mechanism that involves a single-stranded DNA intermediate and which encode recombinase with Replicator initiator motif (Rep) and

DNA Helicase domains (Fig. 1) (Kejnovsky et al. 2012).

4

LINEs

Non-LTRs

SINEs Class I Retrotransposons RNA intermediate "replicative" GYPSY

LTRs

Transposable Elements COPIA

DNA transposons with TIRs (Terminal Inverted Class II Repeats) Transposons DNA intermediate “cut-and-paste” Helitrons, also known as rolling circle transposons (“replicative”)

Fig. 1 Classification of main transposable elements and their subclasses.

TEs can be classified based on the protein coding regions of their genomes (Fig.

2).

Fig. 2 Structure of the main types of transposable elements. All the important genes are labeled. GAG, Protease (PR), Reverse Transcriptase (RT), RNaseH (RH), and Integrase (INT), make up LTR retrotransposons. Endonuclease (EN) and RT make up the non-LTR

5

retrotransposons. Transposase (TPAse) is found in the DNA transposons and HEL (Helicase) in the Helitrons. RPA is Replicative Protein A, and PPT is Polypurine Tract (Kejnovsky et al. 2012).

The amount of TEs in a plant species is positively correlated with its genome size

(Fig. 3) and for most plants, genome size increase is due to LTR retrotransposons (Fig. 4)

(Kejnovsky et al. 2012).

Fig. 3 A positive correlation of genome size and increases in TE proportion. As genome size increases, the proportion of TEs also increases. This graph shows a positive correlation of the two factors (Kejnovsky et al. 2012).

6

Fig. 4 The contribution of different TE groups in ascending genome size (left to right) of 12 plant genomes calculated by genome coverage. LTR retrotransposons contribute the most to genome size increases of these genomes (Kejnovsky et al. 2012).

Within the plant kingdom, Mu transposons have the highest transposition frequency and mutagenicity. Mu transposon projects have been studied and researched widely in maize (Wang et al. 2008). Mu-like elements (MULEs) capture host genes and gene fragments and reshuffle them within the host genome therefore contributing to evolution of the species (Jiang et al. 2004).

Retrotransposons undergo duplicative transposition in which a copy of the retrotransposon remains at the original site and another copy of that retrotransposon is transposed to the acceptor site, thus enabling their numbers to increase with each transposition and providing an ability to expand genomes. Transposons have transposase, which recognizes the TIRs that flank the transposon and excises the TE out of the donor position and integrates it into a new acceptor site. The gap from the donor position is

7

either left without the element, which makes it a cut-and-paste transposition, or filled with a copy of the transposon left by gap repair (Slotkin and Martienssen 2007).

Life Cycle of LTR Retrotransposons

LTR Retrotransposons within a chromosome are transcribed from DNA into mRNA in the nucleus by RNA polymerase II, the mRNA is then exported to the cytoplasm, where translation occurs on ribosomes attached to the endoplasmic reticulum.

Here, transfer RNAs (tRNAs) transfer an amino acid to each codon that specifies for a particular amino acid. The proteins that make virus-like particles (VLPs), RT, and

Integrase, get translated into proteins. The reverse transcription of the mRNA by RT is done in the VLPs that reside in the cytoplasm. Reverse transcription starts when a tRNA binds to the primer-binding site in the 5’UTR region of the LTR Retrotransposon. The resulting cDNA gets transferred from the 5’UTR to the 3’UTR, where reverse transcription proceeds. The PPT region at the 3’UTR end gets primed and the cDNA primed from the PPT gets an additional strand transfer that gives rise to double-stranded cDNA that is integrated back into the chromosome in the nucleus by Integrase (Havecker et al. 2004).

TE Significance

Highly mutagenic effects of active TEs often result from their insertion in protein- coding genes. TEs also cause chromosomal breakage, illegitimate recombination, and genome rearrangement (Slotkin and Martienssen 2007). TEs influence neighboring

8

genes by altering splicing and polyadenylation patterns and by functioning as enhancers or promoters (Girard and Freeling 1999). TEs are implicated in the developmental regulation of gene expression. (Slotkin and Martienssen 2007). Significantly, TEs are highly conserved among distantly related taxonomic groups suggesting a biological value to the genome (Pennisi 2007).

Polymerase Chain Reaction (PCR) has been used to show that a superfamily of

TEs called V-Sines was widespread among vertebrates, such as lampreys, cartilaginous fish, bony fish, and amphibians. Lampreys are the oldest vertebrates and date back to the

Cambrian era (544 to 510 million years ago). Findings suggest that V-Sines have been around for a very long time, about 544 million years ago (Ogiwara et al. 2002). TEs are highly conserved in many different species, which suggests that they must play a specific role. TE jumping can also have advantageous results and not just destructive results.

Transposons can drive the translocation of genomic sequences and, therefore, can propagate genomic evolution. TEs can also shuffle exons and repair double-stranded breaks (Pray 2008).

The question remains, however, as to what the evolutionary advantage of having a big genome if the genome consists mainly of TEs and non-coding small RNAs that the host genome makes itself? Could there be an advantage since TEs are conserved among various species? Is bigger better even if it is “junk” DNA? Perhaps this “junk” DNA serves an underlying significant purpose in helping the species proliferate? But a bigger genome size could introduce more mutations via TEs transposing so why would having a bigger genome be advantageous?

9

Taking into account the potentially harmful effects of active TEs, the genome has developed epigenetic defense mechanisms to silence TE activity. This epigenetic defense mechanism is called RNA interference (RNAi). These inactive TEs retain the coding potential to mobilize themselves, but they do not produce the necessary proteins due to a repressive chromatin environment (Slotkin and Martienssen 2007). Given the consequences of TE impact on the genome and the abundance of TEs in plant genomes, it is important to study them, particularly in plant families such as Liliaceae due to the abundant amount of TE proliferation in Liliaceae.

RNA Interference (RNAi)

RNA interference (RNAi) in eukaryotic organisms is an epigenetic mechanism of regulating gene expression through the interaction of non-coding regulatory small RNAs with messenger RNAs (mRNAs) or repetitive sequences in the genome. RNAi is both a transcriptional and post-transcriptional gene silencing process. RNAi was first discovered by Napoli et al. (Napoli et al. 1990) in their transgenic experiment in petunias (Geley and

Müller 2004). However, the involvement dsRNA in RNAi was discovered by Fire et al.

(Fire et al. 1998) who found that it was dsRNA, not single-stranded sense or anti-sense

RNA, that mediated gene silencing in microinjected Caenorhabditis elegans. Even though small RNAs are known for inhibiting translation by cleavage mRNA, they can also activate gene expression as in the case of the microRNA mir-122 which is a liver- specific miRNA needed for the successful replication of the hepatitis C virus (Großhans and Filipowicz 2008).

10

siRNAs

In post-transcriptional gene silencing (PTGS), the enzyme Dicer cleaves

(dsRNA), (these dsRNA can come from: transgenes, RNA viruses, transposons, etc.

(Hannon 2002)) into smaller pieces (about 21-23 nucleotides long) known as small interfering RNAs (siRNAs) (Hammond et al. 2000). The siRNAs unwind themselves [the sense strand is degraded (Novina and Sharp 2004)], and the guiding strand (also called the antisense strand) incorporates itself into the RNA-induced silencing complex (RISC).

The guiding strand then complementarily base pairs with the targeted mRNA, and RISC cleaves the transcript. The targeted mRNA is then degraded (Geley and Müller 2004). In fruitflies and mammals, the antisense strand gets incorporated directly onto RISC to target the complementary mRNA for degradation; however, in worms and plants, RNA-

Dependent RNA polymerase (RdRP) binds the antisense strand, and this strand pairs up with a complementary mRNA to start an amplification process to make more dsRNA.

Dicer then generates new siRNAs that are specific to different sequences on the same mRNA, once again the target mRNA is degraded (Fig. 5). In plants, this amplification process enables RNAi to spread throughout somatic cells by cell-to-cell transfer of dsRNAs thereby generating widespread resistance to viral infections. Within plants, siRNAs initiate transcriptional silencing and can aid in the process of recruiting enzymes that put methyl groups (methylation) on the target mRNA to inhibit gene expression.

(Novina and Sharp 2004).

11

dsRNA

↓ dicer

siRNAs (21-23 nts long)

↙ ↘

sense RNA = degraded anti-sense RNA

↓ RISC

anti-sense RNA+RISC

↓ target mRNA

target mRNA (perfect complementarity) = degraded

Fig. 5 The process of siRNA production and mRNA targeting.

miRNAs

Another form of post-transcriptional gene silencing involves MicroRNAs

(miRNAs). miRNAs are single-stranded RNA (ssRNA) (Lim et al. 2003) that are encoded by the host genome. They are formed from hairpin RNAs, which are dsRNA that are self-complementary and fold over in the middle. They are expressed from an intergenic cluster or from single genetic regions. Both of these types fold into long hairpin RNAs called primary microRNAs (pri-miRNAs), which have some imperfect complementarity resulting in a bulge in the hairpin. Drosha, a nuclear enzyme, cleaves

12

these pri-miRNAs into smaller hairpin RNAs called precursor microRNAs (pre-miRNAs)

(Lee et al. 2003). Pre-miRNAs are exported from the nucleus into the cytoplasm where

Dicer cleaves them into mature ssRNA (21-22 nucleotides (nts) long). A mature ssRNA is assembled into a ribonucleoprotein (miRNP) complex. This complex binds to the 3’- untranslated sequences of particular mRNA through partial complementarity and prevents translation of the mRNA. If a miRNA is exactly or nearly exactly complementary to the target mRNA, the target mRNA gets degraded (Fig. 6) (Novina and Sharp 2004). Table 1 compares the properties of siRNAs vs. miRNAs with important features such as structure, complementarity, and gene regulation mechanism.

pri-miRNAs

↓ drosha

pre-miRNAs (70 nts long)

↓ dicer

ssRNA (21-22 nts long)

↓ miRNP

ssRNA+miRNP

↙ ↘ target mRNA

Exact complementary = target mRNA degraded Partial complementarity = inhibit translation Fig. 6 The process of miRNA production and mRNA targeting.

13

Table 1 A comparison of the properties of siRNAs and miRNAs (Lam et al. 2015).

There are different types of RNA silencing pathways. In siRNAs, Dicer cleaves ds

RNA, and the result is siRNA-targeted complementary RNA destruction. In the microRNA pathway, Dicer cleaves an inverted repeat RNA with a partially double- stranded structure. The targeted mRNA can be either translationally repressed or degraded just like in the RNAi (siRNAs) pathway. The last pathway is called RNA- directed chromatin silencing. It is similar to the siRNAs pathway, but the siRNAs target

DNA or chromatin-associated RNA and the outcome is the targeted locus gets methylated. In plants, the 21-24 nts long siRNAs direct the silencing machinery to the targeted mRNA based on complementary base pairing. The 21/22 nts mainly function in cleaving mRNA and PTGS. Whereas the 24 nts mainly function in RNA-directed DNA methylation (RdDM) and transcriptional silencing (Baumberger and Baulcombe 2005).

14

A recent study suggests that both 21 and 24 nts siRNAs carry out the process of

DNA methylation (Nuthikattu et al. 2013), where the 21 nts establish DNA methylation and the 24 nts amplify and maintain the methylation (Bond and Baulcombe 2015).

RdDM is the process of adding cytosine residues on CG, CHG, and CHH (where H is any sequence except for G) (Henderson and Jacobsen 2007). Methylation represses chromatin expression at target loci (Lister et al. 2008) and blocks gene expression when present in promoter regions (Chan et al. 2004). RdDM represses TE activity and contributes to environmental and developmental regulation of gene expression (Baumberger and

Baulcombe 2005).

RdDM

In RdDM, Dicer-like proteins cleave dsRNA into 21-24 nts long siRNA that get loaded onto the Argonaute proteins (Baumberger and Baulcombe 2005). Argonaute proteins target chromatin-associated scaffold transcripts in a sequence-specific manner

(Wierzbicki et al. 2009). The complexes bound to the chromatin then recruit Domains

Rearranged Methyltransferases 1 and 2 (DRM1 and DRM2) to methylate the DNA (Fig.

7) (Kim and Zilberman 2014).

15

dsRNA

↓ dicer

siRNAs (21-24 nts long)

↓ argonaute (AGO)

siRNAs + AGO

↓ Chromatin

siRNAs +AGO --bind to----> Chromatin

↓ recruits DRM

Chromatin = methylated

Fig. 7 The simplified process of RNA-directed DNA Methylation.

In RdDM, there are also RNA-dependent mechanisms to maintain DNA methylation. For example, plant-specific DNA-dependent RNA polymerases (Pol IV and

Pol V) are recruited to chromatin by a methyl-DNA binding proteins called Sawadee homeodomain homologs (Johnson et al. 2014). Pol IV produces precursor RNAs that get cleaved into 24 nts siRNAs and Pol V produces scaffold transcripts at chromatin at DNA methylation sites (Haag and Pikaard 2011). Pol IV has a self-reinforcing feedback system that maintains TE silencing. This, however, does not explain how epigenetic silencing is inherited.

16

Riboswitches

A RNA element that regulates gene expression through the binding of small molecules is a riboswitch. They are found in the un-translated region (UTR) and introns of mRNA. To induce gene expression, riboswitches convert an intracellular level of small molecules into RNA structural rearrangement. Riboswitches regulate adjacent genes by altering the transcription, translation, and splicing of this gene. Within plants, riboswitches were found to bind to a Thiamin Pyrophosphate (TPP) and regulate thiamin biosynthesis in plants and algae. TPP is also an element that is important in regulating primary metabolism in plants. Phosphomethylpyrimidine synthase (THIC gene) is the gene in plants to which the TPP riboswitch resides. Various regions of THIC 3’, such as introns, exons, TPP riboswitches, splice sites, poly-adenylation signal, polypyrimidine tract, etc. and the conservation of the distances between them indicate the structural importance of this region for TPP-mediated gene regulation. The TPP riboswitches are the most widespread riboswitches among eukaryotes and prokaryotes. TPP riboswitches regulate alternative splicing to induce mRNA molecules formation that is likely to be unstable. TPP riboswitches do this by introducing upstream start codons that result in the non-functional translation ORF in algae and fungi, premature translation termination in algae, or a 3’UTR modification in plants.

Riboswitch engineering (riboswitch-based gene control platform) in plants has been difficult due to of lack of information on natural metabolite-responsive RNA elements and post-transcriptional gene regulation complexities. However, chloroplasts have greater translational control due to their separate gene expression mechanism that is

17

similar to the one found in prokaryotes. A Theophylline-responsive control device, in E. coli was optimized to meet translational requirements in chloroplasts and the synthetic riboswitch that resulted from that, was engineered in E. coli and was stably transformed into the tobacco chloroplast genome. (Bocobza and Aharoni 2014).

Recent RNAi Research

Recent research was done to investigate de novo establishment of RdDM by using genetic mutants in Arabidopsis. The research showed that Virus-Induced Gene Silencing

(VIGS) RdDM required Pol V and DRM2 but not dicer-like 3 and other Pol IV pathway components. The research demonstrated that VIGS-RdDM can be enhanced by using mutant plants in which 24 nts siRNA production increases reinforce the level of RdDM

(Bond and Baulcombe 2015).

Matsunaga et al. (Matsunaga et al. 2015) found that Ty1/copia retrotransposon

ONSEN is activated by heat stress in Arabidopsis. The ONSEN promoter was heat induced and results analyzed using a gene reporter assay, which showed mutant deficiencies in siRNA biogenesis; gene expression was enhanced. The transgene activity level of the transgene gradually declined after three days. The results indicated that activation of ONSEN was originally regulated by transcriptional factors, but the resilencing of ONSEN was controlled independently of RdDM (Matsunaga et al. 2015).

Gong et al. (Gong et al. 2015) studied TEs and siRNAs using Illumina sequencing in Gossypium raimondii Ulbr and found that a region at the 3' end of chromosome 1 contained significant siRNA coverage. Their analysis of the correlation pattern between

18

uniquely mapped siRNAs and those mapping to multiple regions implied that active siRNAs biogenesis was being produced from potential young TEs. Their research showed that sufficient expression of TEs may be necessary for siRNAs to be generated and to maintain the silent state of recently transposed TEs (Gong et al. 2015).

Researchers who study transposon silencing in Arabidopsis found that these plants have more than 20 different mutator transposon sequences. These sequences are methylated or silenced in wild-type plants. However, in plants that are defective for the enzymes required for methylation, the transposons get transcribed (Pray 2008). In methylation-deficient plants, mutant phenotypes are linked to transposon insertions

(Miura et al. 2001).

In 2014, an article by Becher et al. sequenced small RNAS in Fritillaria imperialis using high-throughput sequencing (Illumina), characterized these small RNAs, and found methylation levels for an endogenous pararetrovirus in this species. They called the viruses FriEPRVs. FriEPRVs are endogenous repeat viral sequences that contribute to repetitive genome content (Becher et al. 2014).

OBJECTIVE AND SIGNIFICANCE

The overall objective of this thesis research project was to isolate small RNAs

(mainly siRNAs and miRNAs) from five species of the Liliaceae plant family (Lilium pardalinum, smithii, andrewsiana, Fritillaria affinis, and ) and identify small RNAs that bind to transposable elements in the same family.

Using Next-Generation high-throughput sequencing of the small RNAs on the Ion

19

Torrent PGM, siRNAs and miRNAs were analyzed using sequences generated by the

SeqMan NGen software program from DNA Star and Bioinformatics databases. The sequence analysis will give insight into the small RNAs that bind to TEs. Given the significance that RNAi has on regulating gene expression, mainly through degrading mRNA, inhibiting mRNA translation, and/or repressing translation, this research will help to shed light to how TEs control gene expression in the Liliaceae family, which aids in bulking up the genome, affecting plant evolution in addition to being of genetic interest as to how RNAi controls TE proliferation in plants with large genomes.

MATERIALS AND METHODS

Small RNA Isolation

Small RNAs were isolated from five wild-type species of the Liliaceae plant family using the mirVana miRNA Isolation Kit from Ambion (Life Technologies, San

Francisco, CA). The protocol used to isolate the small RNAs was the mirVana miRNA

Isolation Kit Procedure (2011) as described on pages 10-11 and 13-15. Table 2 details the plant samples used.

Table 2 Plant samples used for small RNA isolation.

Plant Sample Sample Isolation Sample Sample Condition Tissue Type Date Amount (g) Lilium Fresh Leaf 7/22/2014 .1123 pardalinum

Prosartes smithii Fresh Leaf 8/19/2014 .1081

20

Clintonia Fresh Leaf 8/20/2014 .0644 andrewsiana o Fritillaria affinis Frozen (-80 C) Leaf 8/25/2014 .118

Scoliopus Frozen (-80o C) Leaf 8/29/2014 .119 bigelovii

Small RNA Analysis Using the Agilent Bioanalyzer

Small RNAs were analyzed for quality control (to make sure small RNAs were obtained) using the Bioanalyzer from Agilent Technologies. The protocol used was the

Agilent Small RNA Kit Quick Start Guide protocol on pages 1-4. For each sample, an undiluted concentration and a 1:10 dilution of that original concentration was made.

Constructing cDNA from the Isolated Small RNAs

The isolated small RNAs were converted into complementary DNAs (cDNAs) using the Ion Total RNA-Seq v2 for small RNA libraries kit, quick reference protocols pages 3-7. Table 3 simplifies the cDNA construction protocol.

Table 3 Simplified protocol for constructing cDNA from small RNAs. Constructing the Amplified Small RNA Library

Protocol Step Page(s)

Hybridize and Ligate the RNA 3-4

Perform Reverse Transcription (RT) 4

Purify and Size-Select the cDNA 4-5

Amplify the cDNA (Non-barcoded Library) 6-7

21

cDNA Analysis Using the Agilent Bioanalyzer

The cDNAs constructed from the small RNAs were analyzed for quality control and concentration using the Bioanalyzer from Agilent Technologies. The protocol used was the Agilent High Sensitivity DNA Kit Quick Start Guide pages 1-4. Samples contained the original concentration (undiluted) and a 1:10 to 1:20 dilution of the original sample. Table 4 shows the protocol for making the cDNA dilutions to be loaded onto the

Bioanalyzer. The RNA library that was sequenced needed to be at a very diluted concentration. The sample with the lowest concentration of cDNA obtained from the

Bioanalyzer results was further diluted to a final concentration of 20 pmol, and those samples were used for the Ion One Touch.

Table 4 cDNA dilution instructions for use with the Bioanalyzer. Dilution Preparation

Straight No dilution, original concentration

1:10 1 µl of sample + 9 µl of DNA grade water

1:20 1 µl of sample + 19 µl of DNA grade water

Preparing Template-Positive Ion Template OT2 Ion Sphere Particles (ISPs) from the Amplified cDNA

From the Bioanalyzer results, the concentration of the paired-end adapter peaks

(pmol/L) were determined. For preparation of the templates (template-positive ion template), each of the libraries for each plant sample needed to be at 20 pM concentration. The libraries with the 1:10 and 1:20 dilution were used to make an additional to get to a final concentration of 20 pM. The 20 pM diluted library stock

22

needed to be at a final volume of 25 µl. DNA grade water was used to make the dilutions.

The protocol: Template-Positive Ion Template OT2 Ion Sphere Particles was prepared using the manufacture’s protocol, pages 19-49 of the Ion PGM Template OT2 200 Kit

User Guide. For Scoliopus bigelovii, the protocol “Prepare Template-Positive Ion PGM

Hi-Q Ion Sphere Particles on pages 18-45 in the Ion PGM Hi-Q OT2 Kit was used. The

Hi-Q protocol was a newer version.

Enrich the Template-Positive Ion PGM Template OT2 200 ISPs

After performing the Ion One Touch 2, enrichment of the template-positive ion sphere particles was done using the Ion One Touch ES, and the protocol: Enrich the

Template-Positive Ion PGM Template OT2 200 Ion Sphere Particles on pages 51-58 of the Ion PGM Template OT2 200 Kit User Guide. The function of the Ion One touch ES is to isolate template-positive ISPs (pull out the beads with the amplified DNA). These ISPs are then loaded onto the semiconductor-sequencing chip (Ion 314 Chip).

Quality Control of the Unenriched ISPs Using the Qubit 2.0 Fluorometer

Quality control on the unenriched ISPs (using the 2 µl aliquot of the unenriched

ISPs obtained after Ion One touch 2) was performed using the protocol “Quality Control of Ion PGM Template OT2 200 Ion Sphere Particles” in the Ion PGM Template OT2 200

Kit User Guide, pages 43 and 69-84. The quality control assay labels the ISPs with two fluorophores: Alexa Fluor 488 and Alexa Fluor 647. The probe labeled 488 anneals to primer B sites (all of the ISPs) and the probe labeled 647 anneals to primer A sites (only

23

ISPs with extended templates) (Fig. 8). The ratio of the Alexa Fluor 647 fluorescence to the Alexa Fluor 488 fluorescence yields the percent templated ISPs. Information obtained from the Qubit protocol was recorded and put into the Qubit 2.0 Easy Calculator

Microsoft Excel Spreadsheet File. The targeted range that produces the most data should yield 10-30% templated ISPs.

Fig. 8 Illustration of an ISP with the fluorescence tags and DNA insert. (Ion PGM Template OT2 200 Kit User Guide).

Performing the Sequencing Protocol for the Ion 314 Chip

The protocol “Sequencing Protocol- Ion 314 Chip” in the Ion PGM Sequencing

200 Kit v2 User Guide pages 33-42 was done to sequence the small RNAs. The Ion

Torrent PGM was used to sequence the samples in this project because like other Next

Generation sequencing machines it can sequence large amounts of DNA (hence, high- throughput) in a relatively short amount of time (2-4 hrs). This sequencing technology facilitates the sequencing of organisms with large genomes.

24

Assembling the miRNA Sequences Using SeqMan NGen Software

Sequence results come in the form of contigs (a consensus sequence put together by overlapping DNA segments) that are in the fastq file format. Contigs came already assembled. Contigs were assembled into miRNAs sequences de novo. The SeqMan

NGen software instructions were followed.

Once assembly was finished, the results were launched in SeqMan Pro and each contig is individually selected to view. Blast searches were used to aid in data analysis.

To do a BLAST search, select the desired contig, go to Net SearchBlast selectionok

(make sure program is blastn and database is nr). BLASTN searches nucleotide databases using a nucleotide query. NR means the search is within a non-redundant database. About 10-15 BLAST searches were done for each plant species. The miRNA contig number, its length, BLASTN results, ncRNAs found, precursor and primary transcripts were recorded.

Mapping the Small RNA Sequences to TE Sequences

Using TE and the small RNA sequences from all five samples, a Genome assembly-templated assembly special workflows was done. This assembly used the TEs as the template and did an assembly of the TEs against the small RNA sequences in fastq format. The set of Liliaceae TE sequences used for analysis was sequenced and assembled by a fellow graduate student, and the bulk of this project was sequencing the small RNAs and finding the small RNAs that bind to the TEs.

25

A conserved domain BLAST search was done by going to the BLAST website: http://blast.ncbi.nlm.nih.gov/Blast.cgi, and scrolling down to “Specialized searches,” and clicking on the “CD-search” link (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi).

A text file of the TE files was opened and the nucleotide sequence was copied and pasted into the box that reads: “Enter protein or nucleotide query...” After submission, the search results came back with identified conserved domains within the sequence that was searched. To be a TE, results should show Integrase, Reverse Transcriptase (RT), GAG,

Transposase, and RNAse (Copia/Gypsy). Results can either contain at least one of these domains or a few of these domains, but are not limited to just these genes. Once results were launched in SeqMan Pro, each contig result that corresponds to the TE that is being analyzed was selected. Sequences that bind to the TE being examined were identified.

Sequences that match in the nucleotide region of the conserved TE genes (Integrase, RT,

GAG, Transposase, and RNAse (Copia/Gypsy)) were identified. The contig name and number, length of contig (nts), how many small RNA sequences bound, is it a TE, how many small RNAs bind to each domain, and how many bind to other domains was all recorded. The search results showed the bp region of a conserved domain within that particular TE, and then SeqMan NGen was used to view the small RNAs that bind to that particular TE and the bp region that the small RNAs bind to and to determine if these were conserved domain regions.

26

RESULTS

Small RNA and cDNA Bioanalyzer Results

After the small RNAs were isolated, they were measured on the Bioanalyzer to check for presence of small RNAs. If small RNAs were found, then the small RNAs would be converted to cDNA and measured on the Bioanalyzer to look for the presence of paired-end adapters. If paired end adapters were identified, the next step was preparing Template-Positive Ion Template OT2 Ion Sphere Particles (ISPs) using the Ion

One Touch 2. Small RNA and cDNA Bioanalyzer results from Lilium pardalinum (Figs.

9-13) are included in the results section, and the remaining Bioanalyzer results for the other four species are in Appendix A (Small RNAs), pages 52-53 and Appendix B

(cDNAs), pages 54-55. Bioanalyzer results for the small RNA isolations confirmed small

RNAs were obtained for all of the sampled species, and Bioanalyzer results for cDNA confirmed the presence of cDNA and the adapters. Qubit results showed all the samples had between 10-30% templated ISPs which is the targeted range which should produce the most sequence data.

27

Small RNA Results from Lilium

Fig. 9 Experimental Small RNA ladder – Ladder obtained and used from the experiment. Each peak corresponds to a nt length.

Fig. 10 Small RNA sample result – Sample result from the manual. Small RNA region range is from <20 nts to 200 nts.

28

Fig. 11 Lilium pardalinum – Small RNAs primarily obtained from the 20-100 nts region. cDNA Results from Lilium

Fig. 12 Experimental High Sensitivity DNA ladder – Ladder obtained from the experiment.

29

Fig. 13 Lilium pardalinum – Concentration of paired end adapter peaks 1485 pmol/1560 pmol.

Data Analysis of the Sequences

Emulsion PCR was done on the Ion One Touch, and the samples were sequenced on the Ion Torrent. Sequencing run reports are located in Appendix D, pages 57-66. A table showing all the sequence file names used for the assemblies is in Appendix C, page

56. A miRNA assembly was done to look for miRNAs and a table was created to summarize the results found (Table 5). The results are illustrated on the graphs following the table. Figures 14 and 15 show the different types of small RNAs found, the most being miRNAs (102) and non-coding RNAs (ncRNAs) (36).

30

BLASTN Results of Assembled miRNA Sequences

Table 5 miRNAs and other small RNAs found from doing the miRNA assembly. Lilium pardalinum miRNA Length (nt) BLASTN results (miRNA ncRNA Precursor (contig #) name) results miRNA (contig #) results (contig #) 114 41 Vitis vinifera Zea mays, 166 114 114 205 29 Ananas comosus, Solanum 205 230 lycopersicum,Vitis vinifera, 396 212 95 230 32 Zea mays, Brachypodium 230 distachyon, 167 Arab thaliana, 2 159 294 95 32 Cumis melon, Vitis vinifera, 159 40 Prosartes smithii 72 36 Glycine max, Vitis vinifera, None 381 Solanum tuberosum, Malus 43 domesticus, Cucumus melo, 156 72 381 35 Solanum lycopersicum, Zea mays, 196 Glycine max, Vitis vinifera, 180 Solanum tuberosum, Malus 232 domestica, Cucumis melo, 166 43 40 Brachypodium distachyon, Citrus sinensis, Zea mays, Cucumis melo, Glycine max 167, Prunus persica, 398 177 33 Vitis vinifera, Brachypodium distachyon, Zea mays, Brassica rapa, Glycine max, Hordeum vulgare, Arab thaliana, Ananas comosus, 159 Arab thaliana primary transcript 159, Arab thaliana ago_5425 siRNA complete seq Cucumis melo, 167

1417 32 Malus domestica, Cucumis melo, Ananas comosus, 396 196 35 Zea mays, Cucumis melo, Vitis vinifera, Solanum tuberosum, 156 210 32 Solanum lycopersicum,

31

Brachypodium distachyon, Zea mays, Malus domestica, 156 133 30 Glycine max, 2118 151 41 Arab thaliana ath_wt_15224 siRNA Fritillaria affinis 28 40 Homo sapiens, 43768 piRNA 37 11 18 ? Arab thaliana ago4_00296 50 siRNA, Arab thaliana 52 ath_wt_01031 siRNA, Arab 10 thaliana ath_wt_20696 siRNA 47 11 29 Arab thaliana ath_wt_05459 14 siRNA, Arab thaliana 62 ath_wt_15224 siRNA, Arab 43 thaliana ath_wt_18702 siRNA 34 8 41 Glycine max, 6300 18 7 Clintonia andrewsiana 198 30 Solanum lycopersicum, Zea mays, 3 198 Glycine max, Vitis vinifera, 198 118 Solanum tuberosum, Malus 180 196 domestica, Cucumis melo, 166 196 180 124 30 Arab thaliana ath_wt_05459 97 232 siRNA, Arab thaliana 232 201 ath_wt_15224 siRNA, Arab 149 thaliana ath_wt_18702 siRNA 118 33 Brachypodium distachyon, Zea mays, Cucumis melo, Glycine max, 167 Prunus persica, 398 196 32 Vitis vinifera, Glycine max, Malus domestica, Cucumis melo, 828 180 31 Vitis vinifera, Brachypodium distachyon, Zea mays, Cucumis melo, Hordeum vulgare, Arab thaliana primary transcript, Brassica rapa, Glycine max, Ananas comosus, 159 201 40 Brachypodium distachyon, Zea mays, Ananas comosus, 529 232 30 Solanum lycopersicum, Malus domestica, Brassica rapa, Vitis

32

vinifera, Brachypodium distachyon, Glycine max, Zea mays, Malus domestica, Cucumis melo, eucalyptus globulus, 156 Scoliopus bigelovii 229 31 Solanum lycopersicum, Vitis 26 229 vinifera, Solanum tuberosum, Zea 66 303 mays, Malus domestica, Cucumis 229 melo, Glycine max, 166 65 99 38 Arab thaliana ath_wt_16726 233 siRNA, Arab thaliana 101 ath_wt_06849 siRNA 45 303 30 Vitis vinifera, Brachypodium 146 distachyon, Zea mays, Cucumis 124 melo, Hordeum vulgare, Brassica 150 rapa, Glycine max, Ananas 193 comosus, 159 Arab thaliana primary transcript 159

33

Various Small RNAs Found in Each Species of Liliaceae miRNAs 45 39 40 37 35 siRNAs 30 25 piRNAs 20 15 15 10 11 11 ncRNAs 10 7 6 6 7 6 3 2 3 2 2 5 0 0 0 0 0 1 1 1 0 1 0 0 0 1 Primary Number Number Small of RNAs 0 Transcript Lilium Prosartes Fritillaria Clintonia Scoliopus pardalinum smithii affinis andrewsiana bigelovii Precursor miRNAs Species

Fig. 14 Various small RNAs and their amounts obtained from the miRNA analysis in the five different Liliaceae species.

Total Small RNAs Found and Contigs Analyzed

200 180 172 160 140 120 100 80 60 27 40 20 0 Total Small RNAs found Total contigs analyzed

Fig. 15 Total small RNAs identified from the total number of contigs that were analyzed from the miRNA analysis.

34

Mapping of Small RNA Sequences to TE Sequences Results

Small RNAs that were found from the assembly of TEs and small RNAs are listed in the table below (Table 6) and illustrated in the graphs (Figs 16-19) following this table.

Table 6 Small RNAs found that bound to TE domains. Note: Integrase/Trans insO are overlapping domains and RT_LTR RT from retro/RT are also overlapping domains. Lilium pardalinum HCH4U_ TE Length Seq TE? Integrase RT GAG Transposase RNAse Other contig (nt) LP320 4542 51 Yes 12 39 LP198 8978 34 Yes 2 int/trans 4 28 InsO LP193 9955 31 Yes 1 1 1 retro 28 gag protein LP673 3962 24 Yes 1 23 LP144 8109 20 Yes 1 1 2 retro 16 RT_L gag TR protein RT from retro/ RT LP591 5171 16 Yes 3 1 2 gag 2 Ty1 8 poly copia LP1483 4983 11 Yes 3 3 gag 5 poly LP1088 4101 10 Yes 1 int/trans 9 InsO LP633 3951 10 Yes 3, 1 trans assoc. 6 LP291 6751 7 Yes 1 Ty3 6 gypsy LP712 4335 5 Yes 1 retro 4 gag protein LP 346 4317 4 Yes 4 LP490 4210 2 Yes 1 retropepsin 1 LP149 8933 2 Yes 2 Prosartes smithii 4NEZN_ TE contig Length Seq TE? Integrase RT GAG Transposase RNAse Other (nt)

PS3 8054 51 Yes 1 trans assoc. 50

35

PS28 8472 33 Yes 3, 1 trans assoc. 29

PS1 6901 26 Yes 1 trans assoc. 25 PS6 14173 10 Yes 1 9 PS16 7321 8 Yes 1 2 5 PS15 11493 7 Yes 3 Ty3 4 gypsy PS11 5639 7 1 3 gag 3 poly PS26 15679 5 Yes 1 1 gag 3 poly PS14 10356 4 Yes 4 PS22 15846 4 Yes 1 1 gag 2 poly PS13 8343 3 Yes 3 PS43 8862 2 Yes 2 PS8 4475 1 Yes 1 PS54 7751 1 Yes 1 Clintonia andrewsiana 30XWK_ TE contig Length Seq TE? Integrase RT GAG Transposase RNAse Other (nt) CA112 5310 58 Yes 3 7 3 gag 2 Ty1 42 poly, 1 copia gag pre- integ domain CA102 4430 39 Yes 1 4 3 gag 1 Ty1 30 pre- copia integ domain CA93 4154 26 Yes 1 Mule trans 25 CA198 3934 21 Yes 21 CA104 5107 20 Yes 2 4 14 CA154 7868 16 Yes 3 2 Ty1 11 copia CA175 4426 12 Yes 2 3 5 RT/R T_LT R RT from retro, 2 retrop epsin CA90 3681 10 Yes 1 gag 9 poly CA86 6799 9 Yes 1 1 retro 7 RT/R gag T_LT protein R RT from

36

retro CA148 3482 6 Yes 1 5 zinc- bind in RT CA325 2021 6 Yes 6 CA180 1130 3 Yes 2 gag 1 poly CA396 1594 13 Yes 5 helicase 8 CA329 1777 1 Yes 1 Fritillaria affinis AHLTB_ TE contig Length Seq TE? Integrase RT GAG Transposase RNAse Other (nt) FA115 2166 8 Yes 4 4 FA102 4946 4 Yes 2 1 gag 1 poly FA163 4488 1 Yes 1 0 FA117 4235 1 Yes 1 retropepsin 0 FA145 3272 1 Yes 1 FA99 2078 4 Yes 1 3 FA527 2148 2 Yes 1 (133 1 nts) FA1386 2045 1 Yes 1 FA1597 1129 6 Yes 4 Ty1 2 copia FA1922 1497 3 Yes 3 FA142 1786 0 Yes 0 FA277 3133 0 Yes 0 FA580 2394 0 Yes 0 FA260 1659 0 Yes 0 Scoliopus bigelovii SDSSL_ TE contig Length Seq TE Integrase RT GAG Transposase RNAse Other (nt) ? SB211 8942 24 Yes 1 3 1 retro 19 RT_L gag TR RT protein from retro/R T SB363 9139 15 Yes 1 RNAse 14 HI

SB249 5809 15 Yes 2 retro 13 gag protein SB585 5825 12 Yes 1 retropepsin 11 SB62 8375 11 Yes 11 SB438 4966 6 Yes 6

37

SB126 4571 6 Yes 2 4 RT_L TR RT from retro/ RT SB1 10312 32 Yes 1 retropepsin 1 RNAse 30 HI

SB26 7655 29 Yes 1 2 2 gag 1 Ty1 23 pre- copia integ domain SB17 6216 27 Yes 2 (1 base more 25 than the interval) SB671 6798 17 Yes 5 12 SB18 7427 15 Yes 15 SB116 8870 14 Yes 1 13 SB16 7233 13 Yes 13

Characterization of TE Domains and Small RNA binding

40 35 35 30 30 25 22

20 17 12 15 10 10 8 10 6 6 5 3 4 4 5 1 1 2

Number Number small of RNAs 0

TE domains

Fig. 16 A characterization of the various TE domains analyzed and the amount of small RNAs that bind to each domain.

38

Small RNAs Binding to TE Demographic in Liliaceae

ty3 Gypsy Helicase ty1 Copia 2% 3% 7% RNAse HI Integ/Trans insO 1% Integrase 2% Mule Transposase 13% 1% Transposase Assoc. 2% RT Transposase 21% 18%

zinc-bind RT Gag Pre-Integ Gag 1% 4% 10% Retro Gag RT_LTR RT from 5% Retropepsin retro/RT 4% 6%

Fig. 17 Percentage of small RNAs that bind to each of the 16 TE domains.

Small RNAs Binding to TE Domains vs. Other Areas 896 1000 100% 730 900 81.5% 800 700 600

500 400 300 166 18.5% 200 Number Number small of RNAs 100 0 TE domains Other Total Domains

Fig. 18 Total amounts of small RNAs that bind to TE domains, other domains, and total number of small RNAs found.

39

RT_LTR RT Retropepsin Integ/TransIntegrase from 1% Transposase retro/RT insO2% zinc- Retro Gag Pre- 3% 1% TE Domain Demographic in Liliaceae 0% bind Gag Integ Gag Transposase Assoc. RT 1% 1% 2% 0% 0% RNAse HI RT 0% 4% Mule Transposase 0% ty1 Copia 1% ty3 Gypsy 0% Helicase 1%

Other 81%

Fig. 19 Percentage of small RNAs that bind to TE domains vs. other regions.

Screenshots of the TE and Small RNA Assembly from Lilium

A screenshot of Lilium pardalinum small RNA and TE assembly is shown to show the alignment between the two. The top sequence is the “template” sequence, which is the TE sequence, and the bottom(s) are the small RNAs that bind to that sequence (Fig.

20). RT is the domain found in all the Retrotransposons, both LTR and non-LTR, and it resulted in the greatest number of small RNAs binding to it according to the sequence analysis. Figure 20 also includes an illustration of the TE DNA sequence, the RT domain within the TE sequence, and how many small RNAs bind to that domain.

40

Lilium pardalinum

TE Length Binding region (bp) Domain small RNAs LP198 8978 6536-7297 RT 4

41

DNA

RT (6536-7297) 8.9 Kb

4

Fig. 20 Conserved domain results, small RNAs and TE alignments, and small RNAs and TE illustration.

42

DISCUSSION

The objective of this thesis project was to sequence small RNAs from species of the Liliaceae family using high-throughput sequencing on the Ion Torrent PGM and analyze the obtained sequence using the Next-Generation Sequencing analysis software:

SeqMan NGen from DNA Star. The sequence analysis revealed small RNA sequences

(mainly siRNAs and miRNAs) that were successfully sequenced (Table 6). The software used to assemble TE sequences with small RNAs showed areas where the small RNAs bound to the TEs.

Sequencing results showed small RNAs bind to different areas of the TE sequences. The areas of most importance are the TE genes conserved domains. LTR retrotransposons primarily contain the POL protein coding open reading frames (ORFs), which include: GAG, RT, Integrase, and RNAse, non-LTR retrotransposons can also use

RT, while transposons use transposase, and helitrons use helicase (Fig. 2). Small RNAs that bind to these domains have the ability to prevent and stop transposition of the TEs.

A total of 70 TE contigs were analyzed in this project. 896 small RNAs were found among the 70 contigs. Out of 896 small RNAs, only 166 (18.5%) bind to conserved domains, while the rest 730 (81.5%) bind to other areas of TEs (Fig. 18).

Interestingly enough, more small RNAs were found that bound to other areas of the TEs than ones for the conserved domain areas. Those small RNAs could be binding to other regions of the TEs such as LTRs, TIRs, 5’ UTR, 3’ UTR, promoter sequences or other coding regions between the TE genes, and other domains of a TE sequence. The TEs sequenced for this project and analyzed so far showed that the majority of the TEs are

43

indeed LTR retrotransposons as characterized by LTRs, GAG, PR, RT, Copia/Gypsy

RNAses, and Integrase, found within them according to the conserved domain search results.

The contig FA 702 is a contig with 869 nts. It is not a TE (based on conserved domain search), but does have 4352 small RNAs binding to it, particularly at the beginning of the sequence. BLASTN searches mainly resulted in Ribosomal RNA

(rRNA) and chromosomal DNA of various plants with roughly 90%+ matching identities to the TE contig and a top E value of 0. BLASTX (searches protein databases using a translated nucleotide query) results showed hypothetical proteins, primarily in Glycine

Max, with the top result having 78% matching identity and an E value of 1e-112 for a

Glycine Max hypothetical protein. A conserved domain search showed that this contig contains a conserved domain called a sensor super family, a putative sensor domain found at the N-terminus of proteins that functions in stimulus sensing. The domain sequence is at the 100-252 nts region of the contig. Small RNAs binding in this region could indicate inhibition of gene expression of that particular gene.

Figure 16 shows the different types of TE conserved domains found in the TEs analyzed and how many small RNAs were found for each domain. And Figure 17 shows the number of small RNAs binding to each domain in terms of percentages. The percentage refers to how many small RNAs bind to each domain out of 100%. The domains containing the most small RNAs are RT (35, 21%) and Transposase (30, 18%), with Integrase (22, 13%) and GAG (17, 10%) close behind. In addition to the main TE genes, small RNAs were discovered that bind to other TE genes including: Retropepsin,

44

which is a retroviral protease, Transposase insO, a putative transposase, and Mule

Transposase, a transposase that encodes mutator-like elements (Fig. 16). One Mule transposon was found in the CA93 contig. A small RNA was found that bound to this conserved domain. Mu and Mule transposons were also found in Arabidopsis and maize as described earlier in the paper.

The majority of the small RNAs found in this analysis are between 19 to 22 nts long. Some are longer than 22 nts, but ones shorter than 19 nts were not observed in the analysis. There are siRNA sequences of 21/22 nts in length that bind perfectly to the TE template, and, according to published papers, these would be the small RNAs that function in cleaving mRNA and PTGS. The 21 nts siRNAs are also posited to establish

DNA methylation. The sequence results yielded small RNAs that bind to more than one species (binding to a species other than that from which they were isolated. For example, in contig PS16, small RNAs from C. andrewsiana (30XWK_) and S. bigelovii (SDSSL_) were found in P. smithii TEs. Since TEs are conserved and abundant, this suggests that these TEs and the corresponding small RNAs are conserved too.

This project successfully met the objective of sequencing small RNAs in

Liliaceae. The results revealed many small RNAs binding to TEs, which supports the theory that small RNAs are part of a mechanism for controlling TE proliferation. It has also been shown in this project that small RNAs are found in these five species of

Liliaceae and that the binding of these small RNAs can potentially help curb TE proliferation and maintaining genome size.

45

This project is similar to the paper by Becher et al. (Becher et al. 2014) in which they sequenced pararetroviruses from Fritillaria imperialis and the small RNAs that bind to them. For this project, the small RNAs that were sequenced from Liliaceae bind to TEs from unknown proteins; they have yet to be mapped to chromosomes and characterized.

Further work would include more sequence analysis with the remaining TE contigs that are yet to be analyzed. Knock-down experiments can be done to see what affect would be if these small RNAs were eliminated from the genome. A prediction would be that the genes that the miRNAs target would not be degraded. Knock-down experiments are done to test for function. Additionally, Bisulphite sequencing can be done to sequence methylated DNA resulting from RdDM. Methylated DNA is expected to be found from sequencing the TEs.

46

REFERENCES

Ambrožová, K., Mandáková, T., Bureš, P., Neumann, P., Leitch, I. J., Koblížková, A., …

& Lysak, M. A. (2010). Diverse Retrotransposon Families and an AT-rich

Satellite DNA Revealed in Giant Genomes of Fritillaria Lilies. Annals of Botany,

mcq235.

Baumberger, N., & Baulcombe, D. C. (2005). Arabidopsis ARGONAUTE1 is an RNA

Slicer that Selectively Recruits microRNAs and short interfering RNAs.

Proceedings of the National Academy of Sciences of the United States of America,

102(33), 11928-11933.

Beaulieu, J. M., Moles, A. T., Leitch, I. J., Bennett, M. D., Dickie, J. B., & Knight, C. A.

(2007). Correlated Evolution of Genome Size and Seed Mass. New Phytologist,

173(2), 422-437.

Becher, H., Ma, L., Kelly, L. J., Kovarik, A., Leitch, I. J., & Leitch, A. R. (2014).

Endogenous Pararetrovirus Sequences Associated with 24 nt Small RNAs at the

Centromeres of Fritillaria imperialis L.(Liliaceae), a Species with a Giant

Genome. The Plant Journal, 80(5), 823-833.

Bond, D. M., & Baulcombe, D. C. (2015). Epigenetic Transitions Leading to Heritable,

RNA-mediated de Novo Silencing in Arabidopsis thaliana. Proceedings of the

National Academy of Sciences, 112(3), 917-922.

47

Bocobza, S. E., & Aharoni, A. (2014). Small Molecules that Interact with RNA:

Riboswitch‐ based Gene Control and its Involvement in Metabolic Regulation in

Plants and Algae. The Plant Journal, 79(4), 693-703.

Chan, S. W. L., Zilberman, D., Xie, Z., Johansen, L. K., Carrington, J. C., & Jacobsen, S.

E. (2004). RNA Silencing Genes Control de Novo DNA Methylation. Science,

303(5662), 1336-1336.

Finnegan, D. J. (1989). Eukaryotic Transposable Elements and Genome Evolution.

Trends in Genetics, 5, 103-107.

Fire, A., Xu, S., Montgomery, M. K., Kostas, S. A., Driver, S. E., & Mello, C. C. (1998).

Potent and Specific Genetic Interference by Double-Stranded RNA in

Caenorhabditis elegans. Nature, 391(6669), 806-811.

Geley, S., & Müller, C. (2004). RNAi: Ancient Mechanism With a Promising Future.

Experimental Gerontology, 39(7), 985-998.

Girard, L., & Freeling, M. (1999). Regulatory Changes as a Consequence of Transposon

Insertion. Developmental Genetics, 25(4), 291-296.

Gong, L., Masonbrink, R. E., Grover, C. E., Renny-Byfield, S., & Wendel, J. F. (2015).

A Cluster of Recently Inserted Transposable Elements Associated with siRNAs in

Gossypium raimondii. The Plant Genome, 8(2).

Großhans, H., & Filipowicz, W. (2008). Molecular Biology: The Expanding World of

Small RNAs. Nature, 451(7177), 414-416.

48

Grover, C. E., & Wendel, J. F. (2010). Recent Insights into Mechanisms of Genome Size

Change in Plants. Journal of Botany, 2010.

Haag, J. R., & Pikaard, C. S. (2011). Multisubunit RNA polymerases IV and V:

Purveyors of Non-coding RNA for Plant Gene Silencing. Nature Reviews

Molecular Cell Biology, 12(8), 483-492.

Hammond, S. M., Bernstein, E., Beach, D., & Hannon, G. J. (2000). An RNA-directed

Nuclease Mediates Post-transcriptional Gene Silencing in Drosophila cells.

Nature, 404(6775), 293-296.

Hannon, G. J. (2002). RNA Interference. Nature, 418(6894), 244-251.

Havecker, E. R., Gao, X., & Voytas, D. F. (2004). The diversity of LTR

Retrotransposons. Genome Biology, 5(6), 1.

Henderson, I. R., & Jacobsen, S. E. (2007). Epigenetic Inheritance in Plants. Nature,

447(7143), 418-424.

Jiang, N., Bao, Z., Zhang, X., Eddy, S. R., & Wessler, S. R. (2004). Pack-MULE

Transposable Elements Mediate Gene Evolution in Plants. Nature, 431(7008),

569-573.

Johnson, L. M., Du, J., Hale, C. J., Bischof, S., Feng, S., Chodavarapu, R. K., ... & Patel,

D. J. (2014). SRA-and SET-domain-containing Proteins link RNA polymerase V

Occupancy to DNA methylation. Nature, 507(7490), 124-128.

49

Kejnovsky, E., Hawkins, J. S., & Feschotte, C. (2012). Plant Transposable Elements:

Biology and Evolution. Plant Genome Diversity Volume 1 (pp. 17-34). Springer

Vienna.

Kelly, L. J., & Leitch, I. J. (2011). Exploring Giant Plant Genomes with Next-Generation

Sequencing Technology. Chromosome Research, 19(7), 939-953.

Kim, M. Y., & Zilberman, D. (2014). DNA Methylation as a System of Plant Genomic

Immunity. Trends in Plant Science, 19(5), 320-326.

Lam, J. K., Chow, M. Y., Zhang, Y., & Leung, S. W. (2015). siRNA versus miRNA as

Therapeutics for Gene Silencing. Molecular Therapy—Nucleic Acids, 4(9), e252.

Lee, Y., Ahn, C., Han, J., Choi, H., Kim, J., Yim, J., ... & Kim, V. N. (2003). The

Nuclear RNase III Drosha Initiates microRNA Processing. Nature, 425(6956),

415-419.

Leitch, I. J., Beaulieu, J. M., Cheung, K., Hanson, L., Lysak, M. A., & Fay, M. F. (2007).

Punctuated Genome Size Evolution in Liliaceae. Journal of Evolutionary Biology,

20(6), 2296-2308.

Lim, L. P., Lau, N. C., Weinstein, E. G., Abdelhakim, A., Yekta, S., Rhoades, M. W., ...

& Bartel, D. P. (2003). The microRNAs of Caenorhabditis elegans. Genes &

Development, 17(8), 991-1008.

Lister, R., O'Malley, R. C., Tonti-Filippini, J., Gregory, B. D., , C. C., Millar, A. H.,

& Ecker, J. R. (2008). Highly Integrated Single-base Resolution Maps of the

Epigenome in Arabidopsis. Cell, 133(3), 523-536.

50

Matsunaga, W., Ohama, N., Tanabe, N., Masuta, Y., Masuda, S., Mitani, N., ... & Ito, H.

(2015). A Small RNA Mediated Regulation of a Stress-activated Retrotransposon

and the Tissue Specific Transposition During the Reproductive Period in

Arabidopsis. Frontiers in Plant Science, 6, 48.

Mehrotra, S., & Goyal, V. (2014). Repetitive Sequences in Plant Nuclear DNA: Types,

Distribution, Evolution, and Function. Genomics, Proteomics & Bioinformatics,

12(4), 164-171.

Miura, A., Yonebayashi, S., Watanabe, K., Toyama, T., Shimada, H., & Kakutani, T.

(2001). Mobilization of Transposons by a Mutation Abolishing Full DNA

Methylation in Arabidopsis. Nature, 411(6834), 212-214.

Napoli, C., Lemieux, C., & Jorgensen, R. (1990). Introduction of a Chimeric Chalcone

Synthase Gene into Petunia Results in Reversible Co-suppression of Homologous

Genes in Trans. The Plant Cell, 2(4), 279-289.

Novina, C. D., & Sharp, P. A. (2004). The RNAi Revolution. Nature, 430(6996), 161-

164.

Nuthikattu, S., McCue, A. D., Panda, K., Fultz, D., DeFraia, C., Thomas, E. N., &

Slotkin, R. K. (2013). The Initiation of Epigenetic Silencing of Active

Transposable Elements is Triggered by RDR6 and 21-22 nucleotide small

interfering RNAs. Plant Physiology, 162(1), 116-131.

51

Ogiwara, I., Miya, M., Ohshima, K., & Okada, N. (2002). V-SINEs: A New Superfamily

of Vertebrate SINEs that are Widespread in Vertebrate Genomes and Retain a

Strongly Conserved Segment within each Repetitive Unit. Genome Research,

12(2), 316-324.

Pennisi, E. (2007). Jumping Genes Hop into the Evolutionary Limelight. Science,

317(5840), 894-895.

Pray, L. A. (2008). Transposons: The Jumping Genes. Nature Education, 1(1), 204.

Ravindran, S. (2012). Barbara McClintock and the Discovery of Jumping Genes.

Proceedings of the National Academy of Sciences, 109(50), 20198-20199.

Slotkin, R. K., & Martienssen, R. (2007). Transposable Elements and the Epigenetic

Regulation of the Genome. Nature Reviews Genetics, 8(4), 272-285.

Wang, Y., Xu, M., Deng, D., & Bian, Y. (2008). Maize Mutator Transposon. Frontiers of

Agriculture in China, 2(4), 396-403.

Wierzbicki, A. T., Ream, T. S., Haag, J. R., & Pikaard, C. S. (2009). RNA polymerase V

Transcription Guides ARGONAUTE4 to Chromatin. Nature Genetics, 41(5),

630-634.

52

APPENDICES

Appendix A

Small RNA Bioanalyzer Results

Prosartes smithii - Small RNAs primarily obtained from the 20-30 & 40-80 nts region.

Clintonia andrewsiana - Small RNAs primarily obtained from the 20-30 & 40-80 nts region.

53

Fritillaria affinis - Small RNAs primarily obtained from the 4-150 nts region.

Scoliopus bigelovii - Small RNAs primarily obtained from the 20-100 nts region.

54

Appendix B cDNA Bioanalyzer Results

Prosartes smithii - Concentration of paired end adapter peaks 910 pmol/1300 pmol.

Clintonia andrewsiana - Concentration of paired end adapter peak 366 pmol.

55

Fritillaria affinis - Concentration of paired end adapter peaks 8200 pmol/1670 pmol.

Scoliopus bigelovii - Concentration of paired end adapter peak 1728 pmol.

56

Appendix C

SeqMan NGen Files

Assembled Files of TE and Small RNA Small RNA Files (fastq format) assembly Lilium pardalinum R_2014_09_26_13_28_02_user_MOR- TE file (fasta format) Contigs 107_Auto_user_MOR-107_137.fastq LPTEswithAllsmallRNAs Lilium pardalinum Prosartes smithii TE file (fasta format) Contigs R_2014_12_30_12_55_21_user_MOR- PSTEswithAllsmallRNAs 114_Auto_user_MOR-114_144.fastq Prosartes smithii Clintonia andrewsiana TE file (fasta format) Contigs R_2014_12_30_15_31_03_user_MOR- CATEswithAllsmallRNAs 115_Auto_user_MOR-115_145.fastq Clintonia andrewsiana Fritillaria affinis

TE file (fasta format) Contigs R_2015_04_16_12_15_05_user_MOR- FATEswithAllsmallRNAs 131_Auto_user_MOR-131_161.fastq Scoliopus bigelovii Fritillaria affinis TE file (fasta format) Contigs SBTEswithAllsmallRNAs R_2015_05_28_13_27_22_user_MOR- 139.fastq Scoliopus bigelovii

57

Appendix D

Sequencing Run Reports

Lilium pardalinum

58

59

Prosartes smithii

60

61

Clintonia andrewsiana

62

63

Fritillaria affinis

64

65

Scoliopus bigelovii

66