<<

MIAMI UNIVERSITY

The Graduate School

Certificate for Approving the Dissertation

We hereby approve the Dissertation

of

Zhixin Zhao

Candidate for the Degree:

Doctor of Philosophy

______Director Dr. Chun Liang ______Co-director Dr. Qingshun Quinn Li ______Reader Dr. Richard Moore ______Reader Dr. Susan Barnum ______Graduate School Representative Dr. John Karro

ABSTRACT

GENOME-WIDE ANALYSIS OF TRANSCRIPTOME DYNAMICS IN PLANTS AND ALGAE

by Zhixin Zhao

Genome-wide transcriptome analysis is a prevalent research field. The first chapter gives a mini review on polyadenylation in plants and algae. The second chapter illuminates newly detected putative poly(A) signals (UAA and UAAA) in seven species. Based on this systematic analysis of the AAUAAA signal from a wide range of organisms, our results suggest that the first two bases (i.e., NN in NNUAAA) are likely degenerated whereas UAAA appears to be the core part of the motif. Combined with other published results, our comparative poly(A) signal study suggests that AAUAAA may be derived from UAA with an intermediate, putative poly(A) signal of UAAA, following the pathway UAA→ UAAA → AAUAAA. The third chapter deeply analyzes poly(A) signals and alternative polyadenylation (APA) utilizing different sequencing data (i.e., ESTs, 454 and Illumina) in green alga Chlamydomonas reinhardtii. In comparison with previous collections, more new poly(A) sites are found in coding sequences (CDS), intron and intergenic regions by deep- sequencing. Prevalence of different poly(A) signal between CDS and 3’-UTR might indicate a different mechanism of polyadenylation. Our data analysis suggests that the extent of alternative polyadenylation (APA) is about 68% in C. reinhardtii. Using Gene Ontolgy (GO) analysis, we found most of the APA genes involve in protein synthesis, hydrolase and ligase activates. Moreover, intronic poly(A) sites are more abundant in constitutively spliced introns than the retained introns, suggesting interplay between polyadenylation and splicing. A mini review of Tandem Repeats (TRs) is presented in chapter 4. Chapter 5 characterizes TR motif features and explorers their distributions among 31 plant and green algal species in Phytozome v8.0. It is indicated that genome sizes have no significantly discernable relationship with TR densities, and the highest TR densities are detected in 5'-UTRs in land plants and in introns in green algae respectively. GO annotation in two green algae reveals that the genes with TRs in introns are significantly involved in transcriptional and translational processing. Our study shows that TRs display non-random distribution for both intragenic and intergenic regions, suggesting that they have potential roles in transcriptional or translational regulation in plants and green algae.

GENOME-WIDE ANALYSIS OF TRANSCRIPTOME DYNAMICS IN PLANTS AND ALGAE

A Dissertation

Submitted to the Faculty of

Miami University in partial

fulfillment of the requirements

for the degree of

Doctor of Philosophy

Department of Botany

by

Zhixin Zhao

Miami University

Oxford, Ohio

2013

Director: Dr. Chun Liang

Co-director: Dr. Qingshun Quinn Li

Copyright

Copyright @ of Chapter 2 and Chapter 3 of this dissertation belongs to the authors of that published paper. The two papers have been submitted to journals.

Copyright @ of Chapter 5 of this dissertation has been accepted and will be published on January 2014 by G3: Genes, Genomes, Genetics, the Genetics Society of America.

Table of Contents

List of Tables ...... v

List of Figures...... vi

Acknowledgements ...... vii

Chapter 1 Introduction of mRNA polyadenylation ...... 1

Chapter 2 Genome-wide comparative analysis of polyadenylation signals in suggest a possible origin of the AAUAAA signal ...... 6

Abstract...... 6

Background ...... 7

Results ...... 8

Discussion ...... 12

Materials and methods ...... 17

Tables ...... 20

Figures ...... 22

Supplementary data ...... 26

Acknowledgements ...... 32

Chapter 3 Bioinformatics analysis of alternative polyadenylation from multiple sequencing platforms in green alga Chlamydomonas reinhardtii ...... 33

Abstract...... 33

Background ...... 34

Results ...... 35

Discussion ...... 40

Materials and methods ...... 42

Tables ...... 46

Figures ...... 49

iii

Supplementary data ...... 54

Acknowledgements ...... 62

Chapter 4 Introduction of Tandem Repeat ...... 63

Chapter 5 Genome-wide analysis of tandem repeats in plants and green algae ...... 67

Abstract...... 67

Background ...... 68

Results ...... 69

Discussion ...... 73

Materials and methods ...... 76

Tables ...... 78

Figures ...... 80

Supplementary data ...... 86

Acknowledgements ...... 87

Literature Cited ...... 88

iv

List of Tables

Table 2.1: The conserved poly(A) signals in the NUE region of the seven species…..20 Table 2.2: The frequencies of UGUAA and AAUAAA and their single nucleotide….21

Table 3.1: The data sources and numbers of unique poly(A) sites from three sequencing platforms………………………………………………………………………………46

Table 3.2: Number of poly(A) site in different gene types………………….………..47

Table 3.3: Number of poly(A) site clusters (PAC) in different categories……………47

Table 3.4: APA extent variation among the four datasets…………………………….47

Table 3.5: The most significant GO functions in high-quality APA genes with 5 or more PACs…………………………………………………………………………………..48

Table 5.1: The top TR motifs and GC contents in different regions…………………..79

Table 5.2: The most significant GO functions of genes with TRs in introns in C. reinhardtii……………………………………………………………………………...80

v

List of Figures

Figure 2.1: The single nucleotide profiles around poly(A) sites for the 7 unstudied species……………………………………………………………………………………22 Figure 2.2: The phylogenetic relations among the 11 species investigated in our study……………………………………………………………………………………..23 Figure 2.3: The combined logos of mononucleotide variants from the UGUAA group (top) and two AAUAAA groups (bottom)…………………………………………………….24 Figure 2.4: The overall frequencies of UGUAA and AAUAAA in a 2-D coordinate in the 11 species………………………………………………………………………………...25 Figure 3.1: The distribution of poly(A) sites in the genic and intergenic regions in different datasets…………………………………………………………………………49

Figure 3.2: The single nucleotide profiles from different datasets………………………50 Figure 3.3: The length difference of intron and CDS with and without poly(A) sites..…51 Figure 3.4: The single nucleotide profiles (-50 to +25) and signals in NUE (-28 to -5) from different categories of all the combined poly(A) sites……………………………..52 Figure 3.5: The genes with different Poly(A) site clusters (PACs) among different datasets………………………………………………………………………………...... 53 Figure 5.1: The phylogenic tree of 31 species showed in Phytozome v8.0 (http://www.phytozome.net/)...... 81 Figure 5.2: The schematic intragenic and intergenic regions used for TR analysis……..82 Figure 5.3: Genome size versus genomic TR density in 31 land plants and green algae..83 Figure 5.4: The relative density of TRs in different intragenic elements and the intergenic regions……………………………………………………………………………………84 Figure 5.5: The relative distribution position of TRs in the intron and intergenic regions……………………………………………………………………………………85 Figure 5.6: The percentage of TRs in different intragenic and intergenic regions………86

vi

Acknowledgements

It is great chance to have my heartily thank to my advisor Dr. Chun Liang, and co-advisor Dr. Qingshun Quinn Li, whose encouragement, guidance, and support helped me conduct the three research projects in a meaningful manner. I would also like to thank them for all the time spent on reading and revising my dissertation. I would like to extend my gratitude to my committee members, Dr. Richard Moore, Dr. Susan Barnum and Dr. John Karro for their enthusiasm, and insightful comments. I would also like to thank my Lab mates Praveen Raj Kumar for providing splicing data in Chlamydomonas reinhardtii and helping in dissertation writing, Cheng Guo and Sreeskandarajan Sutharzan for their support in statistical analysis, and Pei Li and Ming Dong giving great help in the research. I thank the Department of Botany for funding this research through academic challenge grants, and Miami University’s DUOS for funding supports of the research presented in this dissertation. Last but not least, I would like to thank my mother for making all this possible, and my girlfriend for great supporting my study in Miami.

vii

Chapter 1 Introduction of mRNA polyadenylation

Messenger RNA (mRNA) 3'-end formation, including poly(A) site cleavage and polyadenylation, is a crucial and prevalent step in mRNA post-transcriptional processing in eukaryotes, except for replication-dependent histone mRNAs in metazoa (Dávila López and Samuelsson, 2008). Polyadenylation is essential for eukaryotic gene expression because of its biological functions, including protection of mature mRNA from unregulated degradation, recognition of mature mRNA by cytoplasm export machinery and translational apparatus (Rothnie, 1996; Zhao et al., 1999). Moreover, the formation of 3'-ends is indispensable for transcription termination of RNA polymerase II coupled by cleavage factors that recognize and combine with polyadenylation signals on pre-mRNAs (Birse et al., 1998). In recent years, it has been increasingly recognized that polyadenylation can be an important regulator in eukaryotic gene expression and regulation (Danckwardt et al., 2008; Millevoi and Vagner, 2010). There are two major factors involved in polyadenylation process, the polyadenylation [poly(A)] signals, and polyadenylation protein factors. Poly(A) signals are cis-elements surrounding the cleavage sites [or poly(A) sites] that are recognized by the polyadenylation protein complex and direct both the cleavage and polyadenylation reactions. Generally, there are four different signal elements in the cleavage and polyadenylation process in eukaryotes: the far upstream element (FUE), defined here as it is located furthest away from the cleavage site, generally it is believed as a less conserved U-rich element; the near upstream element (NUE), located from ~10 to 35 nucleotides (nt) upstream of a cleavage site, also known as the equivalent of consensus motif AAUAAA in most eukaryotes; the cleavage site, which usually possesses dominant adenine (A, generally >50%), and its immediate surrounding motifs called cleavage element (CE); and the poorly conserved U/GU-rich downstream element (DSE) located downstream of the cleavage site and commonly found in animals only (Li and Hunt, 1997; Tian and Graber, 2011; Venkataraman et al., 2005; Xing and Li, 2010). Among these different elements, NUE appears to be the strongest and most conserved one for both cleavage and polyadenylation reactions, because cleavage and polyadenylation specificity factor (CPSF) has been shown experimentally to bind with this region, especially with the AAUAAA signal. Moreover, other polyadenylation factors, like cleavage stimulation factor (CstF) and cleavage factor Im (CFIm), which recognize and bind with the motifs in DSE and UGUA motif in FUE respectively, together with CPSF complex to form the 3'- end processing complex to cleave downstream sequence and add a poly(A) tail at the cleavage/poly(A) site (Hunt et al., 2008; Shi, 2012; Tian and Graber, 2011; Zhao et al., 1999). In animals and plants, the highly conserved and frequent poly(A) signals are AAUAAA and its one- or two-nucleotide variants (e.g., AUUAAA) located in the NUE

1 region. For example, 82.5% of 18,277 mRNA transcripts possess the canonical hexamer AAUAAA or its 11 single-nucleotide variants in humans (Kamasawa and Horiuchi, 2008; Tian et al., 2005; Zarudnaya et al., 2003). Recently, 87% of mRNA transcript isoforms in Caenorhabditis elegans are detected with AAUAAA or its one/two nucleotide(s) variants in 3'-UTR regions (Mangone et al., 2010). Although AAUAAA is not showed with high frequency in plants, this hexamer is still ranked as the top one in the NUE signal list, and detected in ~10% transcripts in Arabidopsis (Loke et al., 2005) and ~7% in rice (Shen et al., 2008a) respectively. Graber et al (Graber et al., 1999a) also find that universal and conservative AAUAAA varies widely among six eukaryotic species (yeast, rice, Arabidopsis, fruitfly, mouse and human), and this hexamer is especially weak in plants and yeast. It is reported that plant polyadenylation complex is assembled by a constant core part and plenty of combined peripheral subunits, which may explain the highly diversified and degenerated poly(A) signals in plants (Hunt et al., 2012a). Utilizing direct RNA sequencing (DRS) technology, abundant previously unannotated poly(A) sites are revealed in human and yeast, and TTTTTTTTT and AAWAAA (W=A/T) are identified as novel motifs in the upstream of poly(A) sites (Ozsolak et al., 2010). Yet the results are doubtable because of a lack of a mechanism of preventing internal priming within protein-coding regions during the DRS process (Sherstnev et al., 2012). Beyond AAUAAA, it is suggested that UGUAA, a poly(A) signal found in the NUE region, plays vital functions in alpha 2- and beta 2-tubulin-encoding genes in green algae Volvoxcarteri carteri and Chlamydomonas reinhardtii (Mages et al., 1995). This pentamer is found pre-dominantly in all Chlorophytes except Pyramimonas parkeae. For example, 49.8% of 10,508 sequences contain UGUAA signal within 50 nt upstream from the cleavage sites based on large cDNA/ESTs dataset in C. reinhardtii (Wodniok et al., 2007). Shen et al also reported that UGUAA is the most significant and dominant signal (52% in 16,952 sequences) in NUE region in C. reinhardtii (Shen et al., 2008b). Furthermore, UGUAA appears to be the most frequent motif (31.8%) within 40 nt upstream from poly(A) sites in chrysophycean alga Ochromonas danica (Terauchi et al., 2010). Interestingly, some experiments demonstrate that UGUA is the specific motif recognized and bound by cleavage factor I in mammal in the upstream of AAUAAA signal in human mRNA 3'-end processing (Yang et al., 2011a, 2011b). However, UGUAA is not a strong poly(A) signal in human in comparison to AAUAAA (Shen et al., 2008b). Moreover, tetramer UAAA is found to be a poly(A) signal in parasitic protozoan Trichomonas vaginalis mRNAs, suggesting a possible cooperation between translation stop codon (UAA) and signaling for mRNA 3'-end processing (Espinosa et al., 2002; Fuentes et al., 2012). In eukaryotes, many genes have more than one poly(A) site, and those sites would generate multiple mRNA isoforms with different lengths and differential 3'-ends that may change the protein-coding potentials of the mRNA (Hunt, 2012). This phenomenon is called alternative polyadenylation (APA), which has been proven to play crucial roles in gene expression and RNA regulation by its potential of forming multiple protein isoforms

2 from the same genes (Hunt, 2012; Lutz, 2008). Based on Sanger-sequenced cDNA/ESTs data, the extent of APA in C. reinhardtii and rice are shown to be 33% (Shen et al., 2008b) and 50% (Shen et al., 2008a), respectively. Furthermore, the accuracy of the extent of APA is strongly influenced by sequencing depth and the accuracy of genome annotation. Through deep-sequencing it is demonstrated that APA is more universal and widespread than previous researches (Shen et al., 2011; Shi, 2012; Wu et al., 2011). For example, the APA extent in rice is shown to be ~47% and ~82% supported by massively parallel signature sequencing and Illumina sequencing-by-synthesis sequencing data, respectively (Shen et al., 2011). Not only producing multiple coding-mRNAs, but APA also involves RNA-mediated regulation and influences the expression of non-coding RNAs (Hunt, 2012). RNA- mediated regulation would be caused by generating short/long transcripts, which can inhibit or enhance the activities of transcription and/or translation. Furthermore, APA isoforms can affect mRNA localization, protein stability and/or translational efficiency (Shi, 2012). At the same time, non-coding RNAs (e.g., siRNAs and miRNAs) would affect the choice of poly(A) sites in APA genes (Hunt, 2012). Moreover, APA has been reported in non-coding RNAs regulation causing RNA-mediated chromatin silencing in Arabidopsis FLC gene (Liu et al., 2010). Not only various poly(A) signals, but multiple forms of polyadenylation factors also have been uncovered in affecting APA events in plants. For instance, poly(A) site choice is significantly altered in a mutant (oxt6) of the gene encoding the 30-kD subunit of CPSF (CPSF30), and the detected poly(A) signals in oxt6 are different from the canonical plant poly(A) signals (e.g., AAUAAA) (Thomas et al., 2012). Furthermore, it is indicated that there are more polyadenylation factors detected in plants from “lower” to “higher” following the evolutionary trends (Hunt et al., 2012a). Simultaneously, enhanced expression of polyadenylation factors are also detected in APA genes in human colorectal cancer development research (Morris et al., 2012). Alternative mRNA processing, particularly APA, would lead to transcript isoform abundance variation and could be used as powerful molecular biomarker in cancer research and diagnosis (Singh et al., 2009). The precise mechanisms of polyadenylation and alternative polyadenylation in plants, including how poly(A) signals are selected to function with protein factors and changed in , are still not clear. Not only alternative polyadenylation can produce multiple isoforms from the same genes but alternative splicing also can generate more different transcripts; and it is indicated that polyadenylation has link with splicing in intronic poly(A) site recognization, which can influence the efficiency of splicing (Hunt, 2012). Some examples of interactions between splicing and polyadenylation indicate that splicing and polyadenylation factors seems to evolve together in regulation, and splicing factors could influence how or when mRNA precursors are polyadenylated (Zhao et al., 1999). It is demonstrated that alternative 3'-UTR length is affected by thiamin pyrophosphate riboswitch, and metabolite-dependent riboswitch RNA folding could alter both alternative splicing and alternative 3'-end processing (Wachter et al., 2007). Research shows that introns with poly(A) sites would have weaker 5' splice site and larger intron

3 size comparing with those without poly(A) sites (Tian et al., 2007). The similar results of polyadenylation and splicing in introns are also detected in Arabidopsis (Wu et al., 2011). Moreover, intronic poly(A) signals (poly(A) sites located in intron regions reciprocally regulate splicing and polyadenylation and control sFlt1 (shorter Fms-like tyrosine kinase- 1) gene expression (Thomas et al., 2007, 2010). APA and alternative splicing (AS) have been demonstrated with strong correlation across human tissue-specific transcripts, and some regulatory motifs and factors are also involved in polyadenylation and splicing in tissue-level regulation (Wang et al., 2008). Comparing with universal and prevalent APA in eukaryotes, alternative splicing (AS) is detected rarely in yeast although it was universal in humans, therefore it was proposed that APA may be much more ancient event that AS in evolution in eukaryotes (Shi, 2012). Previously, most polyadenylation data were generated through mRNA/cDNA sequencing. Nowadays there are more powerful machines and methods are developed following the advance of sequencing technology. The earliest sequencing technology, chain-termination method, also known as Sanger sequencing developed by Sanger and his coworkers in 1975, was the firstly dominated sequencing technology due to its relative low cost and higher reliability (Sanger and Coulson, 1975; Sanger et al., 1977). Following the need of deciphering the human genome, capillary electrophoresis technology (e.g., 3730XL capillary sequencer from Applied Biosystem) was developed and integrated with the chain-termination method for high-throughput DNA sequencing in late 1980’s. Currently, ultra high throughput sequencing technologies, often referred as Next-Generation Sequencing (NGS), are rapidly replacing Sanger sequencing because they are revolutionizing the sequencing depths and coverage with much cheaper in cost, and such comprehensive deep-sequencing would boost the global analysis of APA events (Mardis, 2008; Schuster, 2008; Shi, 2012). Illumina sequencing through sequencing-by- synthesis technology (SBS) is one of the most famous technologies in NGS, and Roche/454 FLX Pyrosequencing, developed by 454 Sciences and Roche Applied Science in 2005, also uses sequencing-by-synthesis technology (Margulies et al., 2005). Direct RNA Sequencing (DRS) (Ozsolak et al., 2009) technologies provide more deep- sequencing to the RNAs, even the whole transcriptome. Yet still there are some drawbacks in DRS sequencing, such as high error rate, short reads (~50 bp) and low yield (Ozsolak and Milos, 2011). Although the two conserved poly(A) signals UGUAA and AAUAAA have been computationally detected and experimentally verified in some cases, their relationships and evolutional change patterns have not been systematically studied yet. So in the secondary chapter, the study aimed at exploring the change of poly(A) signals among 11 eukaryotes spanning a wide range of the evolutionary scale, including 7 previously unstudied species whose poly(A) signals have not been investigated up to date. This comparative study provides valuable insights into the origin and evolution of poly(A) signals, and extents our understanding of molecular, biological and evolutionary mechanisms regulating mRNA polyadenylation processes in eukaryotes.

4

The canonical poly(A) signal UGUAA has been found and experimentally verified in C. reinhardtii, yet the variations in different intragenic regions (5'-UTR, 3'-UTR, CDS and intron) and intergenic region have not been systematically studied and compared based on different sequencing platforms so far. To this end, in the third chapter we set to explore the variations of the poly(A) signals and alternative polyadenylation extent from the three different sequencing platforms (Sanger, 454 and Illumina), and to study the relationship between polyadenylation and splicing in intronic poly(A) sites. The comprehensive study will deepen our understanding of poly(A) signal variations in different parts of genes in C. reinhardtii, and give detailed views on the three different sequencing technologies.

5

Chapter 2 Genome-wide comparative analysis of polyadenylation signals in eukaryotes suggest a possible origin of the AAUAAA signal Abstract Background Messenger RNA polyadenylation adds a poly(A) tail at 3’-end of pre-mRNA to culminate the creation of mature mRNA in eukaryotes. It plays crucial molecular functions including prevention of unregulated mRNA degradation and recognition by the protein apparatuses for cytoplasmic exportation and translational initiation. The addition of poly(A) tails requires cis-elements on the pre-mRNAs. In eukaryotes, there are mainly two verified and conserved polyadenylation signals reported so far: pentamer UGUAA and hexamer AAUAAA. To understand this phenomenon, we characterized the profiles of seven species whose poly(A) signals were not previously known, and compared these signals with four other model organisms. Results Among the seven species, ciliate does not show any conserved poly(A) signal. A triplet (UAA) and two tetramers (UAAA and GUAA) were found dominant in diatoms and red alga respectively in the Near Upstream Element (NUE) located about 10−35 nucleotides upstream of the poly(A) sites. Green alga Ostreococcus lucimarinus uses UGUAA as a major poly(A) signal in NUE, the same profile to other green algae. Spikemoss and moss use conserved AAUAAA signal but in low frequency (~8%), similar to other land plants. Conclusions Based on this systematic analysis of the variants of the AAUAAA signal from a wide range of organisms, our results suggest that the first two bases (i.e., NN in NNUAAA) are likely degenerated whereas UAAA appears to be the core part of the motif. Combined with other published results, our comparative poly(A) signal study suggests that AAUAAA may be derived from UAA with an intermediate, putative poly(A) signal of UAAA, following the pathway UAA→ UAAA → AAUAAA.

6

Background Messenger RNA (mRNA) 3’-end formation, including cleavage and polyadenylation, is a crucial step in mRNA post-transcriptional processing in eukaryotes. Polyadenylation is essential for eukaryotic gene expression because of its biological function, including protection of mature mRNA from unregulated degradation and recognition of mature mRNA by cytoplasm export machinery and translational apparatus (Rothnie, 1996; Zhao et al., 1999). Moreover, the formation of 3’-end is an integral part for transcription termination of RNA polymerase II coupled by cleavage factors that utilize polyadenylation signals on pre-mRNAs (Birse et al., 1998). In recent years, it has been increasingly recognized that polyadenylation can be an important regulator in eukaryotic gene expression and regulation (Danckwardt et al., 2008; Millevoi and Vagner, 2010). Polyadenylation signals are cis-elements surrounding the cleavage sites (or poly(A) sites) that are recognized by polyadenylation complex and direct both the cleavage and polyadenylation reactions. They include four different elements: the far upstream element (FUE), defined here as it is located furthest away from the cleavage site; the near upstream element (NUE), located ~10 to 35 nt upstream of cleavage site, also known as the equivalent of AAUAAA in many species; the cleavage site and its immediate surrounding motifs called cleavage element (CE); and the downstream element commonly found only in animals and located downstream of the cleavage site (Li and Hunt, 1997; Tian and Graber, 2011; Venkataraman et al., 2005; Xing and Li, 2010). Among these different elements, NUE appears to be the strongest signal for both cleavage and polyadenylation reactions, because the cleavage and polyadenylation specificity factor (CPSF) complex has been shown experimentally to bind to this region, especially with the AAUAAA signal (Hunt et al., 2008; Tian and Graber, 2011; Zhao et al., 1999). In animals, the highly conserved polyadenylation signals are AAUAAA and its one- or two-nucleotide variants (e.g., AUUAAA) located in the NUE region. In human, 82.5% of 18,277 mRNA transcripts possessed the canonical hexamer AAUAAA or its 11 single- nucleotide variants (Kamasawa and Horiuchi, 2008). Similar results were also reported in other studies in human and mouse (Tian et al., 2005; Zarudnaya et al., 2003). Recently, 87% of mRNA transcript isoforms in Caenorhabditis elegans were detected with AAUAAA or its one/two nucleotide(s) variants in 3’UTR regions (Mangone et al., 2010). Although AAUAAA was not shown with high frequency in plants, this hexamer was still ranked as the top one in the NUE signal list, and detected in ~10% transcripts in Arabidopsis (Loke et al., 2005) and ~7% in rice (Shen et al., 2008a) respectively. Graber et al (Graber et al., 1999a) also found that universal and conservative AAUAAA varied widely among six eukaryotic species (yeast, rice, Arabidopsis, fruitfly, mouse and human). Utilizing direct RNA sequencing technology, abundant unannotated poly(A) sites were revealed in human and yeast, and TTTTTTTTT and AAWAAA (W=A/T) were identified as novel motifs in the upstream of poly(A) sites (Ozsolak et al., 2010).

7

In green algae Volvoxcarteri carteri and Chlamydomonas reinhardtii, UGUAA was reported as the poly(A) signal in the NUE region in alpha 2- and beta 2-tubulin-encoding genes (Mages et al., 1995). This pentamer appeared to be the most frequent motif (31.8%) within 40 nt upstream from poly(A) sites in chrysophycean alga Ochromonas danica (Terauchi et al., 2010). In C. reinhardtii, UGUAA was also found to be the most significant and conserved poly(A) signal in the NUE regions in over 4,000 genes (Shen et al., 2008b). Meanwhile, poly(A) signals showed a considerable variation among different green algae species: the UGUAA signal was completely lost in streptophyte algae Pyramimonas, and NUE region was U-rich instead of A-rich in most algal species investigated (Wodniok et al., 2007). Interestingly, wet-lab experiments demonstrated that UGUAA was the specific motif recognized and bound by cleavage factor I in mammal (CFIm) in human mRNA 3’-end processing (Li et al., 2011; Yang et al., 2010). However, UGUAA was not a strong poly(A) signal in human in comparison to AAUAAA (Shen et al., 2008b). Moreover, tetramer UAAA was found to be a poly(A) signal in parasitic protozoan Trichomonas vaginalis mRNAs, suggesting the possible cooperation between translation stop codon (UAA) and signaling for mRNA 3’-end processing (Espinosa et al., 2002; Fuentes et al., 2012). Although the two conserved poly(A) signals UGUAA and AAUAAA have been computationally detected and experimentally verified in some cases, their relationships and evolutional change patterns have not been systematically studied yet. This study is aimed at exploring the change of the poly(A) signals among 11 eukaryotes spanning a wide range of the evolutionary scale, including 7 previously unstudied species whose poly(A) signals have not been investigated up to date. This comparative study provides valuable insights into the potential origin of poly(A) signals, and furthers our understanding of molecular, biological and evolutionary mechanisms regulating mRNA polyadenylation processes in eukaryotes.

Results Single nucleotide profiles around poly(A) sites in 7 species

To study the wide range eukaryotic species for their poly(A) signals, we first studied these 7 species whose poly(A) signals were not characterized previously: two diatoms Thalassiosira pseudonana and Phaeodactylum tricornutum, ciliate Tetrahymena thermophila, green alga Ostreococcus lucimarinus, red alga Cyanidioschyzon merolae, spikemoss Selaginella moellendorffii and moss Physcomitrella patens. It has been demonstrated that single nucleotide profiles around poly(A) sites were unique in comparison to other parts of transcription units (Loke et al., 2005; Shen et al., 2008a, 2008b). Therefore, 400 nt (i.e., -300 nt to +100 nt, where poly(A) site is defined as -1 position) genomic sequences around poly(A) sites are extracted for poly(A) signal analysis. Firstly, the single nucleotide profiles of all 7 species are displayed in Figure 1. The average nucleotide frequencies in the whole 400-nt region as well as in FUE, NUE

8 and CE regions were shown in Table S1. In particular, the positions of NUE were determined based on the changes of nucleotide profiles near poly(A) sites and the distribution regions of conserved signals, while FUE and CE elements are set to be ~100 nt upstream and ~20 nt downstream of the determined NUE elements, respectively (see Material and methods section for details). Based on both average nucleotide frequencies and single nucleotide profiles, especially in the FUE and NUE regions, the 7 species show interesting differences and some similarities as followings: (1) Two diatoms (T. pseudonana and P. tricornutum). As shown in Table S1, the two diatoms have similar nucleotide frequencies in the whole 400 nt regions, and there is a common trend of A>U>G>C. In the NUE region for both diatoms (Figure 1 A and B), A and U increase almost concordantly first and then decrease after reaching a peak.

(2) Ciliate (T. thermophila). As shown in Figure 1 C, ciliate shows a unique single nucleotide profile, in which A-richness is evident from -300 to -80 followed by a U-rich region from -80 to -13.

(3) Green alga (O. lucimarinus). There are higher G and C contents, especially in the FUE region (see Table S1). In the NUE region (Figure 2 D), a U-peak is followed concordantly by an A-peak.

(4) Red alga (C. merolae). Like diatom P. tricornutum (Figure 1 B), the contents of four nucleotides are similar to each other in the FUE region in red alga (Figure 1 E). In particular, there are a striking A-peak and three similar U-, G- and C-troughs in the NUE region (Figure 1 E). The roughness of the single nucleotide profile reflects the relatively less sequence data available for our poly(A) signal analysis in this species.

(5) Spikemoss (S. moellendorffii). It displays a unique single nucleotide profile in the FUE region that is different from other species. There seems to be a transition for dominant single nucleotide from G (-200 to -110) to U (-110 to -32) in the FUE region (see Table S1 and Figure 1 F). This indicates that spikemoss may sit between green alga C. reinhardtii (possessing the highest G content) and moss (possessing the highest U content) in term of nucleotide variation in FUE. The striking A-peak and relative deeper U-trough are also detected in its NUE region (Figure 1 F).

(6) Moss (P. patens). There are dominant U content in the FUE regions; the single nucleotide profile in the NUE regions is similar to that in spikemoss: an A-peak and a relatively obvious U-trough in comparison with G- and C- troughs (Figure 1 G).

Putative polyadenylation signals revealed in the 7 species

9

Our analysis shows no dominant pentamer (e.g. UGUAA) or hexamer (e.g. AAUAAA) detected in the NUE region in the two diatoms. In contrast, a triplet UAA is extremely dominant (86.51% and 78.21%) with significant Z-Scores (see Material and methods for definition) of 19.12 and 6.94 for T. pseudonana and P. tricornutum respectively (Table 1). As shown in Figure S1 A and B, the dominancy of triplet UAA is also evident in the positional distribution profiles across the entire NUE regions of the two diatom species. No significantly frequent motif of 3-nt (triplet) to 8-nt (octamer) is detected in the NUE region in ciliate (Table 1). This is consistent with the fact that no single dominant signal stands out in its positional distribution profile (Figure S1 C), which is obviously different from other 6 species described herein. UGUAA is the most prominent pentamer in the NUE region in green alga O. lucimarinus (30.4% frequency with a Z- Score of 30.21, Table 1). This is also evident in the UGUAA positional distribution profile in the NUE region shown in Figure S1 D. Although found in the top 50 hexamers, AAUAAA shows a very low frequency (only 4.21%) with a Z-Score that is not significantly high enough to be reported by RSAT (Helden et al., 1998). Accordingly, it is inferred that AAUAAA could not be the significantly conserved poly(A) signal like UGUAA in green alga O. lucimarinus. In red alga, UAAA and GUAA appear to be more frequent in the NUE region (Table 1 and Figure S1 E): UAAA is detected in 86.45% transcripts with a significant Z-score of 15.33 while GUAA is found in 43.23% transcripts with a good Z-score of 8.78. In spikemoss, the most frequently occurred motif detected in its NUE region is the canonical hexamer AAUAAA (see Figure S1 F), although its frequency is still low (7.83%, with a Z-Score of 7.18). In moss, AAUAAA is also proved to be the most frequent motif (7.25%, with a significant Z-Score of 15.46) in the NUE region (Table 1), and it also stands out obviously from other hexamers in its positional distribution profile shown in Figure S1 G. In the FUE region, there is no obvious or conserved individual signal reported previously in Arabidopsis, rice or green alga C. reinhardtii (Loke et al., 2005; Shen et al., 2008a, 2008b). In this study, the top 50 signals (from triplets to octamers) in diatom T. pseudonana are found to be AG-rich in the FUE region and then UG-rich in the FUE- NUE junction region (-80 to -35) (see Figure S2 A). Similar to diatom T. pseudonana, the top 50 signals detected in the FUE region of diatom P. tricornutum are also AG-rich (see Figure S2 B). In ciliate, the most frequently occurred motifs in the FUE region are UA- rich (Figure S2 C), which are consistent with its single nucleotide profile (see Figure 2 C). Green alga O. lucimarinus does not show any significant frequent motifs in the FUE region, but many CG-rich signals are shown in the top 50-motif list with stable distribution across the whole FUE region (see Figure S2 D). Red alga does not show any significantly frequent motif in the FUE region (see Figure S2 E). In spikemoss and moss, the highest overall frequencies of the top 50 motifs (from triplets to octamers) are detected in the FUE-NUE junction regions (Figure S2 F and G). In spikemoss, AAG-rich and UUC-rich pentamers are common in the FUE region. The predominant pentamer signals are AAGAA (44.17%, 5.15 Z-Score), UUCUU (40.43%, 5.96 Z-Score) and

10

GAAGA (39.65%, 5.69 Z-Score) in -150 to -80 region, where a transitional change from G-rich to U-rich is evident in single nucleotide profile (Figure 1 F). In moss, no motif is proved to be significantly frequent in the FUE region, but top motifs are U-rich and the overall frequencies of the top 50-motifs increase obviously in the region between -70 and -35 (see Figure S2 G).

Conservative nucleotide composition around poly(A) sites in the 7 species Through investigating the nucleotide composition in the CE region, we find that BA dinucleotide (-2 to -1, B=T/G/C) is dominant in frequencies and conserved in the 7 investigated species (Figure S3). Actually, BA is an extension of YA (-2 to -1, Y=U/C) structure(Loke et al., 2005). The average frequency of BA (70.45%) is much higher than that of YA (53.48%), and the frequency distributions show that BA is obviously higher than YA in the 7 species showed by box-plot using R analysis (http://www.r-project.org/, see Figure S4).

The frequency variations of polyadenylation signals in NUE Our study shows that UAA, UAAA, GUAA, UGUAA and AAUAAA are the major motifs detected in the NUE regions in most of the 7 species, with variable degrees of frequency and significance. Moreover, the poly(A) signals in model organisms have been frequently examined, including green alga Chlamydomonas reinhardtii (Shen et al., 2008b; Wodniok et al., 2007), yeast Saccharomyces cerevisiae (Graber et al., 1999a; Ozsolak et al., 2010), Arabidopsis Arabidopsis thaliana (Loke et al., 2005; Wu et al., 2011) and human Homo sapiens (Ozsolak et al., 2010; Tian et al., 2005). As shown in Figure 2, after addition of the 4 model organisms the investigated species span large evolutionary distances, from Chromalveolates (ciliate and two diatoms), Unikonts (yeast and human), to Plantae (red alga, two green algae and embryophyte-land plants). Meanwhile, some of the species are closely related and represent small evolutionary distances. Thus, the compared species include two diatoms (T. pseudonana and P. tricornutum), two green algae (C. reinhardtii and O. lucimarinus), red alga (C. merolae), ciliate (T. thermophila), moss (P. patens), spikemoss (S. moellendorffii), Arabidopsis (A. thaliana), yeast (S. cerevisiae) and human (H. sapiens). This comparative study is set up to understand the poly(A) signal changes for species with large and small evolutionary distances.

In addition to the two canonical poly(A) signals UGUAA and AAUAAA, the mono- or dinucleotide variants of the poly(A) signals were also suggested to affect the polyadenylation process (Kamasawa and Horiuchi, 2008; Mangone et al., 2010; Tian et al., 2005). In our study, the single nucleotide variants of UGUAA and AAUAAA are extracted and compiled from the top 100 (considering the low frequencies of their single

11 nucleotide variants in top 50) frequent pentamers and hexamers respectively. As shown in Table 2, the combined frequency of mononucleotide variants of UGUAA in green algae is ~22% for C. reinhardtii and ~29% for O. lucimarinus, both of which belong to the UGUAA group (named after the UGUAA signal). Similarly, the combined frequency of mononucleotide variants of AAUAAA is ~25% for human (Table 2). In spikemoss, moss and Arabidopsis, these combined frequencies are 5%~7%, and the overall frequencies that include both AAUAAA and its single-nucleotide variants are still less than 15%. Figure 3 shows the sequence logos for the consensus sequences of single nucleotide variants based on their frequencies for the UGUAA group and two AAUAAA groups by Weblogo (Crooks et al., 2004). Clearly, UAAA appears to be the core part of AAUAAA whereas the first two bases are likely degenerated.

The variation and distribution of canonical polyadenylation signals (UGUAA and AAUAAA) in the 11 species Among the 11 species, only green algae C. reinhardtii and O. lucimarinus have significantly dominant pentamer UGUAA, which is mainly distributed in the NUE regions. Although the other 9 species do not have significantly dominant UGUAA signal, they do show between-species differences in UGUAA frequencies. In a two-dimensional coordinate graph shown in Figure 4, the overall frequencies of UGUAA (from -80 to -15 region) and AAUAAA (from -50 to -15 region) demonstrate that human and two green algae are located in the furthest end of X and Y axes, respectively. Interestingly, human not only possesses the highest AAUAAA (56%), but also has intermediate UGUAA frequency (20.09%). Green alga C. reinhardtii only shows the highest UGUAA frequency (50%) with the lowest AAUAAA frequency (0.21%). Green alga O. lucimaritus, which also utilizes UGUAA as its major poly(A) signal (40%), shows AAUAAA in a relative low frequency (6.77%). The remaining 8 species can be divided into two groups roughly: (1) diatoms and red alga, which have similar AAUAAA (5~6%) and UGUAA (14-19%) frequencies and (2) land plants (embryophytes), yeast and ciliate, which have similar AAUAAA (8% ~ 14%) and UGUAA (19% ~ 25%) frequencies.

Discussion Polyadenylation proves to be an important post-transcriptional process to mRNA maturation, cytoplasm exportation and protein translation (Rothnie, 1996; Zhao et al., 1999). Poly(A) signals near poly(A) sites are critical in defining the location of the cleavage event on a pre-mRNA. Both UGUAA and AAUAAA have been computationally detected and experimentally verified in some eukaryotes as the conserved cis-regulatory motifs (or poly(A) signals) that can be recognized and bound by the polyadenylation complex to conduct polyadenylation (Li et al., 2011; Millevoi and Vagner, 2010; Shen et al., 2008a, 2008b; Wodniok et al., 2007). In this study, we

12 examine and characterize the putative poly(A) signal profiles from the 11 species that span a large evolutionary distance, 7 of which are previously unstudied in terms of poly(A) signal analysis. Clearly, such comparative poly(A) signal analysis will facilitate our understanding of the potential evolutionary patterns of poly(A) signal usage and variation. The usage and distribution of poly(A) signals in the NUE region The seven species show great differences in frequency and significance of both the conserved poly(A) signals (i.e., UGUAA and AAUAAA) as well as other putative poly(A) signals (i.e.,UAA, GUAA and UAAA) in their NUE regions. As shown in Figure S1 and Table 1, ciliate does not have conserved signal. Surprisingly, no significantly frequent pentamer (UGUAA) or hexamer (AAUAAA) is found in the two diatoms (T. pseudonana and P. tricornutum) and red alga (C. merolae). Instead, as shown in Figure S1, a triplet UAA shows dominant frequency (~80%) and strong significance especially in diatom T. pseudonana (Z-Score=19.12) in comparison with other triplets. So far there is no direct experimental evidence supporting triplet as poly(A) signal, because it is generally believed that three-nucleotide motifs are too short to be bounded by polyadenylation factors. Our results, however, suggest that triplet UAA (stop codon) might be an ancestral motif that lays out a foundation for the functional poly(A) signals like AAUAAA and UGUAA. In red alga, UAAA and GUAA are found having significantly higher frequencies than other tetramers in the NUE region. Interestingly, UAAA was also reported to be a conserved poly(A) signal in protozoan parasite Trichomonas vaginalis mRNAs (Espinosa et al., 2002; Fuentes et al., 2012). Apparently, our research not only supports the hypothesis that UAAA might be the intermediate in AAUAAA evolutionary process (Espinosa et al., 2002), but further suggests that UAA may be the initial ancestral signal of UAAA. Both the previous studies (Shen et al., 2008b; Wodniok et al., 2007) and our work demonstrate that green algae C.reinhartdii and O. lucimarinus mainly use UGUAA as poly(A) signal in the NUE region. Yet in another green alga Scherffelia dubia, AAUAAA was found to be loosely distributed in the NUE region (Wodniok et al., 2007). Spikemoss and moss employ canonical signal AAUAAA (~8% in frequency) with high significance (Z-Score>8), which is similar to streptophyta, land plants and animals (Graber et al., 1999a; Loke et al., 2005; Mangone et al., 2010; Tian et al., 2005). AAUAAA seems to be a major poly(A) signal widely used in eukaryotes, while UGUAA could also be utilized as the core poly(A) signals in some eukaryotes.

If AAUAAA does not have high frequency in the NUE region, then UGUAA might be used frequently. This is specially exemplified in green alga C. reinhardtii in our study (see Figure 4). C. reinhardtii utilizes UGUAA (~53%), rather than AAUAAA (<1%) as its major signal, suggesting that AAUAAA might not be a necessary poly(A)

13 signal in some species. Interestingly, human genes not only use AAUAAA as its major poly(A) signal but also have relatively high UGUAA frequency (~20%). Moreover, wet- lab experiment has demonstrated that UGUA is the specific motif recognized by CFIm25 in human mRNA 3’-end processing (Li et al., 2011; Yang et al., 2011a). This may explain the high UGUAA frequency in human pre-mRNAs and indicate that UGUAA is necessary for most of eukaryotes.

The poly(A) signals in the FUE and CE regions FUE is defined as a region spanning about 60-100 nt that contains weak cis-element signals. In Arabidopsis, U-rich motifs appeared to be abundant in the FUE region(Loke et al., 2005). The apparent motifs in the FUE region were G-rich in green alga C. reinhardtii (Shen et al., 2008b; Wodniok et al., 2007). In our study, among all scanned motifs (3-8 nt in length) in the 7 previously unstudied species, we do not detect any individual signal that shows significantly high frequency in the FUE region. However, our results reveal the tendency that the top-ranked motifs in FUE regions have certain nucleotide(s) preferences (e.g., AG-rich signals in diatom T. pseudonana) when we examine the top 50 ranked motifs of 3-8 nt in size. Such nucleotide preferences appear to be associated with their single nucleotide profiles. Interestingly, a conserved BA (-2 to -1; B=T/C/G) dinucleotide is found in the CE region in the 7 species. YA (-2 to -1; Y=T/C) dinucleotide was found to be conserved around cleavage site in six eukaryotic species including Arabidopsis (Graber et al., 1999a; Loke et al., 2005). In our study, GA (-2 to -1) also has high abundance (15%~20%) yet AA (-2 to -1) is extremely low (<5%) in the 7 species. Therefore, BA (-2 to -1) should be more inclusive and accurate than YA (-2 to -1) in describing the conserved dinucleotide in the CE region. The relationships between single nucleotide profiles and poly(A) signals The single nucleotide profiles, which show distinctive patterns among the 7 unstudied species (Figure 1), appear to be consistent with the frequencies and significances of poly(A) signals detected in the NUE and FUE. In Chlorophyta (green algae), Prototheca wickerhamii and Scherffelia dubia were found to possess high G and C contents in the FUE region, U-peak was ahead of A-peak in the NUE region, and U-rich signals and UGUAA were found dominantly in the FUE and NUE regions respectively(Wodniok et al., 2007). Such results are consistent with green algae C. reinhardtii and O. lucimarinus in our research. Similar to our spikemoss result, the pattern that the dominant single nucleotide profile changes from G to U in FUE was also found in Streptophyta Closterium peracerosum and Klebsormidium subtile (Wodniok et al., 2007). Furthermore, both our study and these previous works suggest that U-rich signals in FUE and AAUAAA signal in NUE are concordantly detected in many eukaryotic species.

14

Moreover, single nucleotide profile around poly(A) site is an indicator of the nucleotide composition of poly(A) signals. For example, a strong A-peak found in the NUE region in human suggests that its major poly(A) signals are likely to be A-rich motifs (i.e., AAUAAA). In green alga O. lucimarinus, an U-rich peak followed concordantly by an A-rich peak in NUE regions would explain why UGUAA is found as their major poly(A) signal. Similarly, in land plants A-peak and a relative deep U-trough in NUE regions also explains why AAUAAA is a top poly(A) signal in these species. However, ciliate has no obvious A-peak and red alga does not possess relative deep U- trough, suggesting that UGUAA or AAUAAA might not be major poly(A) signal in ciliate and red alga.

The comparative analysis of poly(A) signals in 11 species suggested an evolutionary pathway of poly(A) signal variation Based on the comparative analysis of the putative poly(A) signals in the 11 species (Table 1 and 2) and their evolutionary distances (Figure 2), it is suggested that two possible evolutionary pathways of poly(A) signals may exist: (a) UAA → UAAA→ AAUAAA and (b) UAA → GUAA → UGUAA. In terms of shown in Figure 2, the two diatoms and ciliate belong to the of simple organisms (species) in comparison with the other 8 species. It is found that diatoms have strong evolutionary relationship with bacteria in gene transfer, which has been a major driving force during their evolution (Bowler et al., 2008). So it is reasonable to assume that the dominant motif UAA (also a stop codon) detected in the two diatoms might represent the ancestral poly(A) signal, whereas UAAA and GUAA may represent the intermediate one between UAA and AAUAAA or between UAA and UGUAA.

The results of the single nucleotide variants from AAUAAA (Table 2) show that UAAA motif is much conserved in the two AAUAAA groups (Figure 3), and imply that evolutionary pathway (a) might be true. Moreover, point mutations in AAUAAA signal in animal virus SV40 terminator strongly showed that mutations in the last four positions (UAAA) caused much more reduction in cleavage and polyadenylation efficiency than those occurred in first two positions (AA) (Sheets et al., 1990). Recently in Trichomonas vaginalis, a parasitic protozoon, UAAA was proved to be a core part of AAUAAA signal for polyadenylation, and its point mutation produced alternation of poly(A) sites (Espinosa et al., 2002; Fuentes et al., 2012). In contrast, the mutation in AAUAAA signal in plants and yeast (S. cerevisiae) did not show significant difference in polyadenylation efficiency in comparison with wild-type, and the 3'-half of the mutated motifs (-AAA) had slightly more tolerance than 5'-half (AAU-) in terms of the mutation-induced polyadenylation efficiency changes(Rothnie et al., 1994). This suggests that plants and yeast have extreme tolerance to point mutation of poly(A) signals in comparison to animals, and such extreme tolerance may partially explain the low frequency of AAUAAA in plants and yeast, considering that non-AAUAAA containing signals also could serve as polyadenylation signals with relative high efficiency. Bioinformatics

15 analysis revealed that the top 6 most significant hexamers in human were AAUAAA and its variants (NNUAAA), and their overall frequencies were up to 81.6% (Beaudoing et al., 2000). Moreover, the previous researches (Espinosa et al., 2002; Fuentes et al., 2012) also supported the idea that UAAA may be the ancestral motif in AAUAAA signal evolution. Obviously, the signal pathway (a) UAA → UAAA → AAUAAA seems to be supported by both these previous studies and our results. In contrast, further evidence is needed to support the pathway (b) UAA → GUAA → UGUAA because our weblogo result in Figure 3 suggests that only the first U in UGUAA is much conserved, which conflicts with our assumption. In spikemoss, moss and Arabidopsis, the frequency of the canonical AAUAAA is still low (<15%, Table 2) even combining with its single nucleotide variants. This is completely different from the UGUAA group and high-frequency AAUAAA group, in which the frequencies of single-nucleotide variants are 20%-25% and finally the overall combined frequencies (including both the core signal AAUAAA or UGUAA and corresponding mononucleotide variants) could be up to 60% or more. Other researchers also verified that there were high frequencies of AAUAAA (~50%) and its mononucleotide variants (~30%) in human (Kamasawa and Horiuchi, 2008; Tian et al., 2005). Therefore, it is suggested that plants and animals might have much different polyadenylation processes (including poly(A) signals and proteins) considering the huge differences of poly(A) signal frequencies. It is noteworthy that, no canonical poly(A) signal (i.e., AAUAAA or UGUAA) is detected in ciliate, diatoms or red alga, especially in ciliate where no any conserved signal is found. There still are RNAs with poly(A) tails detected in cDNA/mRNA sequencing in ciliate (Eisen et al., 2006; Xiong et al., 2012), which implies that polyadenylation is a normal mRNA processing event. It is assumed that not only poly(A) signals but polyadenylation elements (e.g., FUE, NUE and CE) also play crucial roles in polyadenylation (Hunt et al., 2012a; Zhao et al., 1999). Although no significantly conserved poly(A) signal is detected in ciliate, the three elements (i.e., FUE, NUE and CE) are still clearly shown in its single nucleotide profile. Moreover, comparing with SV40 (representing mammalian gene), CaMV (representing plants) has much higher tolerance to single nucleotide variation of poly(A) signal in polyadenylation efficiency detection, which means that non-canonical signals (e.g., AUUAAA, AAAAAA) still have higher polyadenylation efficiency in plants than in animals (Rothnie et al., 1994). Clearly, these poly(A) signals do not function on their own, but in combination with an extensively characterized set of mRNA 3’ processing factors. Thus the functional unit that directs mRNA 3’ processing are composed of both RNAs and proteins. Furthermore, it is reported that the number of poly(A) protein factor coding gene distinctly increases in plant evolution from “lower” to “higher” (Hunt et al., 2012a). And the model of polyadenylation complex in plants is composed with a constant core part and numerous peripheral subunits, which may be explain the degenerated and diversified poly(A) signals in plants (Hunt et al., 2012a). So such model is helpful to understand the detected multiple signals (e.g., UAA, UAAA and AAUAAA) in the 11 investigated species. For

16 example, it is possible that the three elements (not a single signal) around poly(A) sites are recognized and combined by the core poly(A) factors (e.g., CPSF) and some peripheral subunits in polyadenylation process in ciliate. However, it remains to be a challenge to answer how the variable poly(A) signals changed along with protein factors during evolution process in eukaryotes.

Materials and methods Data collection and polyadenylation site definition Taking advantage of sequenced genomes and relevant ESTs in GenBank dbEST and/or community databases, eleven eukaryotic species, including two diatoms T. pseudonana and P. tricornutum, ciliate T. thermophila, two green algae C. reinhardtii and O. lucimarinus, red alga C. merolae, spikemoss S. moellendorffii, moss P. patens, Arabidopsis A. thaliana, yeast S. cerevisiae and human H. sapiens were selected to investigate the potential evolutionary patterns of poly(A) signals. In particular, the poly(A) signals of the seven species, diatoms T. pseudonana and P. tricornutum, ciliate T. thermophila, green alga O. lucimarinus, red alga C. merolae, spikemoss S. moellendorffii and moss P. patens, have not been reported previously. Because only EST data were available for these seven species whose poly(A) signals had not being studied previously, to be comparative, only EST-derived poly(A) data were utilized for the other four model organisms (C. reinhardtii, A. thaliana, yeast S. cerevisiae and human H. sapiens). When available, raw EST trace files are used to identify post-transcriptional poly(A) tails, consolidated by identification of cDNA termini to reduce false positives in poly(A) tail identification (Liang et al., 2008). Detailed information about the collected EST data, genome sequences and available gene annotation is listed in Table S2. For example, nearly 77k diatom T. pseudonana and 208k diatom P. tricornutum EST sequences were obtained from the Diatom EST Database (http://www.diatomics.biologie.ens.fr/EST/) (Maheswari et al., 2009). To determine the genomic poly(A) sites, all ESTs for the species other than yeast were mapped to their corresponding genomes using GMAP, a genomic mapping and alignment program for mRNA and EST sequences (Wu and Watanabe, 2005). The poly(A) site data of yeast are directly downloaded from the website (http://harlequin.jax.org/polyA/) (Graber et al., 1999b), which contained 1,353 genomic sequences spanning 110 nt upstream and 50 nt downstream of the putative cleavage sites. The EST-to-genome mapping results were then analyzed and filtered for valid genomic hits using a similar protocol described previously (Liang et al., 2008). Because poly(A) tails detected in ESTs are post-transcriptional, they should not be mapped into the genome except in the case where internal priming is likely to occur. Internal priming is defined as the case in which at least 6 consecutive adenines (As) are found or 7 As are detected from 10 nt-window in -10/+10 region around poly(A) sites in genomic sequences (Tian et al., 2005). Therefore, genomic poly(A) sites were finally determined

17 through the EST-to-genome mapping results, filtering out those that were potential internal priming candidates.

Frequently, multiple ESTs were mapped to the same poly(A) sites during EST-to- genome mapping. To eliminate the redundancy in the data, only one such poly(A) site was used to identify one unique poly(A) site. Finally using these non-redundant unique sites, 400 nt sequences were extracted from the corresponding genomes [i.e., 300 nt upstream and 100 nt downstream of the poly(A) sites, which was defined as -1 position (Shen et al., 2008b)] for further data analysis. The relative evolutionary positions of all these target species are showed in Figure 2 according to the web project (http://tolweb.org/tree/phylogeny.html). Poly(A) signal elements definition and poly(A) signals analysis Based on researches done before (Graber et al., 1999a, 1999b; Loke et al., 2005; Shen et al., 2008a, 2008b; Tian and Graber, 2011; Tian et al., 2005; Zhao et al., 1999), the NUE region in our study was defined based on the two criterions: (1) single nucleotide profile: NUE could start around the first crossing site of A and U (around -30) and end around the another crossing site of A and U(around -10); (2) the significantly frequent and dominant motifs (if exist) should exist in this regions. Once NUE region was determined, FUE was defined as a range immediately upstream of NUE, in which a dominant single nucleotide profile could be evident. Based on our data, the start position of FUE was defined as the position where dominant G or U should appear. If not, -200 was used as the FUE start position. However, ciliate T. thermophila was an exception, because no dominant signal was found. Because we did not specifically exanimate the signals in CE, so -10 and +10 were defined as the start and end positions of CE. A new version of SignalSleuth(Loke et al., 2005), SignalSleuth2, is developed to perform exhaustively search of short sequence motifs in specified range of nucleotide sequences with variable motif sizes (generally 3-8 nt in length) and rank the detected motifs based on their occurrence frequencies. In addition, SignalSleuth2 has new functions including Position-Specific Scoring Matrix (PSSM) scores calculation and multiple scanning modes. Occasionally, a target motif may appear multiple times within a given region of a sequence. Sometimes, such multiple occurrences might be overlapped, resulting in over-representation of a specific motif. SignalSleuth2 provides a distance parameter (-gap) to prevent over-counting of the overlapping motifs. This is what we call gap scanning mode, which (-gap=motif length-1) is used to count non-overlapping signal frequency in a given region of a sequence. For example, ATATAT was counted only once in sequence …ATATATAT… if -gap is set to be 5. For another example, if AATAAA motif was searched, -gap=5 would avoid over count the overlapping motifs if they exist. The motif frequencies reported in this study are obtained using the gap scanning mode to avoid over counting of overlapping motifs. Meanwhile, SignalSleuth2 also allows users to use overlapping scanning mode (namely, -gap=0) to obtain the frequencies of overlapping signals. Moreover, in the cases where there were more than

18 one occurrence of non-overlapped motif in a given region, SignalSleuth2 is able to only choose the motif that is the closest to the poly(A) site. This is termed as once scanning mode. For each scanning mode, the SignalSleuth2 provides PSSM results simultaneously. To evaluate the statistical significance of the signals, we use Z-Score to inspect the signals/motifs detected by Regulatory Sequence Analysis Tools (RSAT), which is based on Markov chain model(Helden et al., 1998). Considering the short length of triplets and tetramers, order-1 Markov Model was used and the cutoff value for a valid Z-Score was set to 5; otherwise, order-3 Markov Model was used and the cutoff value was 3. For the nucleotide composition in the CE region around poly(A) sites, Weblogo3.0 was used to examine the profiles of nucleotide composition(Crooks et al., 2004).

19

Tables Table 1. The conserved poly(A) signals in the NUE region of the seven species

Species name (common name) Conserved signal Frequency (%) Z-score T. pseudonana (T diatom) UAA 86.51 19.12* P. tricornutum (P diatom) UAA 78.21 6.94* T. thermophila (Ciliate) - - -

O. lucimarinus (Ostreococcus) UGUAA 30.4 30.21* C. merolae (Red alga) UAAA 86.45 15.33*

GUAA 43.23 8.78* S. moellendorffii (Spikemoss) AAUAAA 7.83 7.18 P. patens (Moss) AAUAAA 7.25 15.46 Note: * means order-1 Markov Model, the one without * is order-3 Markov Model.

20

Table 2. The frequencies of UGUAA and AAUAAA and their single nucleotide variants

Species name Canonical signal Variant (frequency) Overall (common name) (frequency) frequency C. reinhardtii UGUAA (50.38%) UGCAA (6.42%) 72.86% (Chlamydomonas) UGUUA (3.38%) UUUAA (3.28%) UGUAG (2.92%) UGUAU (2.43%) UGUGA (2.34%) UGAAA (2.33%) UGUAA group UGUAC (2.14%) O. lucimarinus UGUAA (30.40%) UUUAA (10.47%) 59.64% (Ostreococcus) UGAAA (5.75%) UGUAU (5.24%) UGUUA (4.60%) UGUGA (4.09%) UGCAA (4.09%) P. patens (Moss) AAUAAA (7.25%) AGUAAA (2.73%) 11.93% UAUAAA (2.25%) S. moellendorffii AAUAAA (7.83%) UAUAAA (3.31%) 14.41% Low AAUAAA (Spikemoss) AUUAAA (2.08%) group AGUAAA (1.67%) A. thaliana AAUAAA (8.59%) UAUAAA (3.44%) 13.71% (Arabidopsis) AUUAAA (2.17%) High H. sapiens AAUAAA AUUAAA (16.68%) 85.19% AAUAAA (Human) (64.92%) UAUAAA (4.30%) group AGUAAA (3.76%)

21

Figures

Figure 1. The single nucleotide profiles around poly(A) sites for the 7 unstudied species. The -1 position is poly(A) site, “-” is designated as upstream sequences (300 nt), and ‘‘+’’ represents the downstream sequence (100 nt).

22

Figure 2. The phylogenetic relations among the 11 species investigated in our study. The common names and scientific names are listed in parentheses close to the relevant clade names in the phylogenetic tree. The common names are all underlined. The phylogenetic tree is constructed according to Tree of Life Web Project (http://tolweb.org/tree/phylogeny.html).

23

Figure 3. The combined logos of mononucleotide variants from the UGUAA group (top) and two AAUAAA groups (bottom).

24

Figure 4. The overall frequencies of UGUAA and AAUAAA in a 2-D coordinate in the 11 species. Abbreviations: chlamy, green alga C. reinhardtii; Ostreococcus, green alga O. lucimarinus; diatom_t, diatom T. peudonana; diatom_p, diatom P. tricornutum.

25

Supplementary data

Supplemental File 1 (Figure S1). The distributions of 50 top-ranked poly(A) signals in the NUE regions of the 7 previously unstudied species.

26

Supplemental File 2 (Figure S2). The distribution of 50 top-ranked poly(A) signals in the FUE regions in the 7 previously unstudied species.

27

Supplemental File 3 (Figure S3). The nucleotides composition in CE region in the 7 unstudied species.

28

Supplemental File 4 (Figure S4). The conservative structures in CE region in the 7 species. The box-plot represent the collective rang of the BA (right) and YA (left) elements.

29

Supplemental File 5 (Table S1). The percentages of four nucleotides in 7 previously unstudied species

30

Species name (common EST source Genome source No. ESTs Poly(A) sites name)

Thalassiosira pseudonana Genbank, JGI V3.0, JGI 76,319 2,943 (T diatom)

Phaeodactylum Genbank, JGI V2.0, JGI 207,560 3,520 tricornutum (P diatom)

Tetrahymena thermophila Genbank, TGD Tetrahymena 103,511 2,729 (Ciliate) Genome Database

Chlamydomonas JGI, Liang et al. V4.0, JGI 338,234 21,037 reinhardtii (26) (Chlamydomonas)

Ostreococcus lucimarinus Genbank, JGI V2.0, JGI 26,863 783 (Ostreococcus)

Cyanidioschyzon merolae C. merolae C. merolae Genome 63,712 155 (Red alga) Genome Project Project

Selaginella moellendorffii Genbank, JGI V1.0, JGI 94,214 9,080 (Spikemoss)

Physcomitrella patens Genbank V1.1, JGI 382,584 8,415 (Moss)

Arabidopsis thaliana Genbank TAIR9.0 1,527,298 23,762 (Arabidopsis)

Saccharomyces cerevisiae Graber et al.(29) SGD 3,425 555 (Yeast)

Homo sapiens (Human) Genbank UCSC hg19 8,296,280 12,449

Supplemental File 6 (Table S2). The data sources of the 11 species for poly(A) signal analysis

31

Acknowledgements The authors thank Praveen Raj Kumar and Pei Li for their helps with figures and tables used in the manuscript. Dr. Chun Liang, Dr. Qingshun Quinn Li and Dr. Guoli Ji managed and coordinated the project. Zhixin Zhao carried out data collection and data analysis. Dr. Xiaohui Wu developed SignalSleuth2 program. All authors participated in manuscript writing and revision.

This paper has been submitted to journal Genomics.

32

Chapter 3 Bioinformatics analysis of alternative polyadenylation from multiple sequencing platforms in green alga Chlamydomonas reinhardtii Abstract Background Messenger RNA (mRNA) 3'-end formation is an essential step in most eukaryotic mRNA post-transcriptional processing. Different from plants and animals where AAUAAA and its variants are routinely found as the main poly(A) signal, Chlamydomonas reinhardtii proves to have UGUAA as the major poly(A) signal. The advance of sequencing technology provided enormous amount of sequencing data for us to explore the variations of the poly(A) signals, alternative polyadenylation, and its relationship with splicing in C. reinhardtii. Results Genome wide analysis of poly(A) sites in C. reinhardtii identified a large number of poly(A) sites: 21,041 from ESTs, 88,184 from 454 and 195,266 from Illumina sequence reads. In comparison with previously collections, more new poly(A) sites are found in coding sequences (CDS), intron and intergenic regions by deep-sequencing. Interestingly, G-rich signals are particularly rich in the poly(A) signal regions for the poly(A) sites located in intron and intergenic regions. Prevalence of different poly(A) signal between CDS and 3’-UTR might indicate a different mechanism of polyadenylation. Our data analysis suggests that the extent of alternative polyadenylation (APA) is about 68% in C. reinhardtii. Using Gene Ontolgy analysis, we found most of the APA genes involve in protein synthesis, hydrolase and ligase activates. Conclusions Deep sequencing revealed much more poly(A) sites, higher proportion of which are located in CDS, intron and intergenic regions. Polyadenylation in CDS may employ a different mechanism from that of 3'-UTR considering the dramatically distinct poly(A) signals. Moreover, intronic poly(A) sites are more abundant in constitutive spliced introns than the retained introns, suggesting interplay between polyadenylation and splicing.

33

Background Messenger RNA (mRNA) 3'-end formation, including cleavage and polyadenylation, is a crucial step in most eukaryotic mRNA post-transcriptional processing. Polyadenylation plays essential roles for eukaryotes, including protection of mature mRNAs from unregulated degradation and recognition by mRNA cytoplasm export machinery and by translational apparatus. Moreover, the formation of 3'-end is indispensable for transcription termination (Hunt et al., 2012b; Rothnie, 1996; Zhao et al., 1999). Different from AAUAAA, which is believed to be the canonical poly(A) signal in eukaryotes (Graber et al., 1999a; Mangone et al., 2010; Tian et al., 2005; Wu et al., 2011), it is reported that UGUAA, a poly(A) signal found in near upstream element region (NUE), played vital functions in alpha 2- and beta 2-tubulin-encoding genes in green algae Volvoxcarteri carteri and Chlamydomonas reinhardtii (Mages et al., 1995). A bioinformatics analysis using 10,508 cDNA/EST data in C. reinhardtii reveals that 49.8% of the sequences contain the UGUAA signal within 50 nt upstream from the cleavage sites (CS) or poly(A) site (Wodniok et al., 2007). Shen et al. also report that UGUAA is the most dominant poly(A) signal (52% of 16,952 EST sequences) in the near upstream element (NUE) region in C. reinhardtii (Shen et al., 2008b). In eukaryotes, many genes have more than one poly(A) site, and such sites would generate multiple mRNA isoforms with different lengths and differential 3'-ends. This phenomenon is called alternative polyadenylation (APA), which has been proved to play a crucial role in gene expression (Lutz 2008). Based on Sanger-based sequencing of EST/cDNA data, the extent of APA in C. reinhardtii and rice are shown to be up to 33% (Shen et al., 2008b) and 50% (Shen et al., 2008a) of annotated protein-coding genes, respectively. Furthermore, the occurrence of APA is influenced by data sequencing depth (Shen et al., 2011; Wu et al., 2011). For example, the APA extent in rice is ~47% and 82% supported by massively parallel signature sequencing (MPSS-DGE) and the Illumina sequencing-by-synthesis (SBS-DGE) sequencing data, respectively (Shen et al., 2011). Interestingly, alternative mRNA processing, particular APA, would lead to transcript isoform abundance variation and can be used as powerful molecular biomarkers in cancer research and diagnosis (Singh et al., 2009). Not only APA can produce multiple isoforms from the same genes but alternative splicing (AS) also can generate different transcripts from the same genes. Research shows that introns with poly(A) sites would have weaker 5' splice site and larger intron size comparing with the introns without a poly(A) site (Tian et al., 2007). Similar results of polyadenylation and splicing in introns are also detected in Arabidopsis (Wu et al., 2011). Moreover, intronic poly(A) signals that are near the intronic poly(A) site reciprocally regulate splicing and polyadenylation and control sFlt1 (shorter Fms-like tyrosine kinase-1) gene expression (Thomas et al., 2007, 2010). The chain-termination method, known as Sanger sequencing developed by Sanger and his coworkers in 1975, was the dominant sequencing technology due to its relative

34 low cost and higher reliability (Sanger and Coulson, 1975; Sanger et al., 1977). Following the eager need for deciphering the human genome, capillary electrophoresis technology (e.g., 3730XL capillary sequencer from Applied Biosystem) was developed and integrated with the chain-termination method for high-throughput DNA sequencing in late 1980’s. Now ultra high throughput sequencing technologies, often referred to as Next-Generation Sequencing (NGS), are rapidly replacing Sanger sequencing because they are revolutionizing the sequencing depths and coverage with much cheaper cost (Mardis, 2008; Schuster, 2008). The first NGS technology-Roche/454 FLX Pyrosequencing using sequencing-by-synthesis technology, was developed by 454 Life Sciences and Roche Applied Science in 2005 (Margulies et al., 2005). Nowadays, RNA- seq (Wang et al., 2009) and Direct RNA Sequencing (DRS) (Ozsolak et al., 2009) technologies provide more deep-sequencing transcriptome data, yet the lengths of sequenced reads are still not long (~50-200 bp). Without a doubt, the massive data accumulated over the last few years provides us with opportunities to examine issues of polyadenylation and splicing in many different species. Although the poly(A) signal UGUAA has been found and experimentally verified in C. reinhardtii, its variations in different genic regions (i.e., 5'-UTR, 3'-UTR, CDS and intron) and intergenic region have not been systematically studied and compared. In this study, we compared the poly(A) sites obtained by different sequencing technologies and found that deep-sequencing (454 and Illumina sequencing) revealed much more previously un-annotated poly(A) sites. In particular, higher proportion of these new poly(A) sites are located in CDS, intron and intergenic regions. Moreover, our study showed that polyadenylation signlas are more pronounced in the constitutively spliced introns than retained introns in C. reinhardtii.

Results Poly(A) site collection and distribution in the C. reinhardtii genome From 23,535,153 ESTs and RNA-Seq reads with poly(A) tails, a total of 256,771 unique poly(A) sites were obtained after removing redundancy and potential internal priming candidates (Table 1). Based on C. reinhardtii genome annotation v4.3, it is indicated that 44% and 40% of poly(A) sites distributed in 3'-UTR and intergenic regions, respectively, and the poly(A) site numbers in the region of 5'-UTR, CDS and intron regions were approximately 1%~10% (Figure 1 A). To avoid potential genome annotation errors in 3'-UTR, the distribution of the intergenic poly(A) sites within 1000 nt to the 3'-UTR end of all genes is investigated. It is found that there are much more sites closing to the 3'-UTR ends. For example, there are 34.36% sites (26,136 from 76,056) are located within 5 nt beyond the 3'-UTR ends (Figure S1 in supplemental data). 54.34% sites were no more than 50 nt within the 1000-nt region. So, empirically 50 nt was used to expand 3'-UTR length to

35 avoid the annotation errors. This means any poly(A) site located within 50 nt from the end of 3’-UTR will be treated as a poly(A) site in 3’-UTR.

Overview of single nucleotide profiles and polyadenylation signals (PAS) variation in C. reinhardtii To analyze the poly(A) signals, 100 nt upstream plus 50 nt downstream sequences for each unique poly(A) site were extracted from the C. reinhardtii genome. Single nucleotide profiles around cleavage or poly(A) sites have been demonstrated to have special features for different poly(A) signals (Loke et al., 2005; Shen et al., 2008a, 2008b). In general, there are four groups of poly(A) signals in plants: the poly(A) site and its surrounding sequences called the cleavage element (CE), the sequences from the near upstream element (NUE) region of the poly(A) site, and the far upstream element (FUE) region from poly(A) sites (FUE) (Loke et al., 2005). In our study, we first examined the single nucleotide profiles and then focused on the poly(A) signals in NUE, because this is the most conserved region with a canonical poly(A) signal and its variants (Loke et al., 2005; Shen et al., 2008a, 2008b; Xing and Li, 2010). The single nucleotide profile from all unique poly(A) site data (Figure 2 A) shows a clear NUE region (-25 to -10) and a high A-peak (52%) at the poly(A) site (-1 position), with a high G content (~34% in average) in the whole investigated region (-100 to +50). Then each dataset of different sequencing platforms is specifically investigated, and it is showed that the difference is caused by Illumina data, because only the profile from Illumina has a high G content (~34% in average) in the whole region (Figure 2 B). The single nucleotide profiles from ESTs and 454 (Figure 2 C and D) are much similar to each other with low G content (~25% in average) in the NUE regions, and such feature is much like what is described by Shen et al (Shen et al. 2008b). Regarding the NUE regions from the four profiles shown in Figure 2, it is clear that there are high U- and A- peaks and C- and G- troughs in C. reinhardtii.

The similar results are also found from the pentamer signals in the NUE regions (- 28 to -5) among the different datasets (Figure S2). The signal profiles from all combined poly(A) site data show that GGGGG has the highest frequency (24.47%), but it cannot be significantly differentiated from the background noises due to the low Z-score (its Z- score is too low to be listed by RSAT) (Figure S2 A). The most significant (Z- score=62.22 with order-3 Markov model) and highly frequent (23.28%) signal is UGUAA in the NUE region, and significant G-rich pentamers GGUGG and GUGGG are also detected with high frequency (see Table S1 for details). Similarly in Illumina data, UGUAA is the most conserved pentamer, and G-rich pentamers (e.g., GGUGG and GUGGG) are also detected (Figure S2 B and Table S1). However, the pentamers from ESTs and 454 are much similar to the results of Shen et al (Shen et al., 2008b) considering the dominant pentamer UGUAA (40%~50%) and UGUAA-related motifs whereas G-rich pentamers are not found in those data (Figure S2 C and D, and Table S1).

36

Single nucleotide profiles and poly(A) signal variations in the genic and intergenic regions To understand the variations of polyadenylation, the single nucleotide profiles and poly(A) signals are investigated using data from three different sequencing platforms as well as the combined one. As shown in Figure 1, 5'-UTRs always have similar proportion (~1%) in all four datasets including ESTs (Figure 1 C), 454 (Figure 1 D), Illumina (Figure 1 E) and high- quality Illumina data (i.e., poly(A) sites with support of at least 3 reads, Figure 1 F). In Figure S3, it can be seen that NUE regions (-28 to -5) have a clear U-peak followed by an A-peak, and the peaks are obvious in ESTs and 454 data (>40%), whereas Illumina does not show such high peaks (~33%) in comparison with ESTs and 454 data. Poly(A) sites (- 1 position) always possess a high A-peak (~50%) in all 3 datasets. In term of poly(A) signals, UGUAA is the most frequent signal (~20%) in the NUE regions (Figure S3 and Table S1). UGUAA has an obviously higher frequency than other pentamers, especially in 454 and Illumina data. Because the collected poly(A) site number is too low in 5'- UTRs (~1%), RSAT cannot give the Z-score for the signals from all three datasets. 3'-UTRs always have the highest poly(A) site numbers (i.e., 49% in Illumina dataset and >80% in ESTs and 454 datasets, see Figure 1) in all the genic and intergenic regions. From the single nucleotide profiles (Figure S4), there are clearly an U-peak (~35%) following by an A-peak (~40%) in NUE regions and a high A-peak (~45%) at poly(A) sites in all three datasets. In the NUE regions, the canonical pentamer UGUAA shows the highest frequency (>33%) and significance (Z-score>30 with order-3 Markov model) in all 3 datasets (Table S1). Other UGUAA-like pentamers (e.g., CUGUA and UGCAA) are also detected with relatively high frequencies (>7%) and significance (Z- score>5). The frequency of poly(A) sites in CDS varies dramatically from 0.29% (ESTs) to 6.15% (Illumina), and after filtering out low-quality poly(A) sites (< 3 support reads) there is a still high proportion (3.04%) in CDS in the Illumina dataset (Figure 1). Different from 5'-UTRs and 3'-UTRs, the single nucleotide profiles in CDS have no clear U-peak and A-peak detected in the relevant NUE regions (-28 to -5) in all three datasets; and at poly(A) sites (-1 position) the A-peak is only found in 454 and Illumina datasets (Figure S5). For the pentamers in the NUE region (Figure S5 and Table S1), UGUAA still has the highest frequency (19.35%) in ESTs (Z-score cannot be given because of small dataset by RSAT). Yet in 454 dataset, UGUAA still can be found but in low frequency (6.87%). The Illumina dataset does not have UGUAA in the top 50 ranked pentamers but other signals (e.g., GGCAA and AGGGC) are found in the NUE region (Figure S5 and Table S1).

ESTs and 454 have similar percentage (~0.5%) of poly(A) sites located in introns, but there are more introns detected in Illumina data (12.44%) even filtered out the low- quality poly(A) sites (10.22%) (Figure 1). The single nucleotide profiles and signals in

37

NUE are also similar in ESTs and 454 datasets, there is clearly an U-peak followed by a A-peak and UGUAA (~40%) is dominant in the NUE regions (Figure S6). In Illumina dataset, however, the single nucleotide profile has dramatically high G content (50.51% in average) and no clear U-peak and A-peak is detected; UGUAA is not detected but G- rich pentamers (e.g., GUGGG, GGUGG and GAGGG) are found with high frequencies (>40%) and significance (Z-score>10).

The number of poly(A) sites in intergenic regions are similar to ESTs and 454 datasets (10.55% and 16.94%, respectively), yet more sites are detected in Illumina (30.97%) and high-quality Illumina data (22.29%) (Figure 1). The single nucleotide profiles and signals are similar in ESTs and 454; an U-peak is followed by an A-peak in the NUE regions and a dominant A-peak is also found at poly(A) site (-1 position); UGUAA has the highest frequency (39.63% and 28%) and significance (Z-score=10.4 and 15.29) in the two datasets (Figure S7 A and B). In the single nucleotide profile (Figure S7 C), Illumina data always have high G content (33.39% in average) except at poly(A) sites, which is similar to those intronic poly(A) sites (Figure S6 C). There is a weak U-peak (24.71%) followed by an A-peak (24.03%) in the NUE region, and UGUAA is not dominant (3.06% in frequency) but G-rich signals (e.g., GUGGG, GGUGG and GAGGG) are found with high significance (Z-score>16) (Table S1). Introns and CDS with poly(A) sites are associated with larger sizes The introns with poly(A) sites are much longer than those without an internal poly(A) site. There are 25,282 poly(A) sites located in 12,135 introns from 5,930 genes. Three control groups (NPA1, NPA2 and NPA3) with the same intron counts (12,135) but without a poly(A) site were randomly selected as control datasets. We found that the average length of introns with and without poly(A) sites is 451 nt (median=345) and 259 nt (median=219), respectively. Apparently, the mean size of introns with poly(A) sites are significantly larger than those without poly(A) sites (Figure 3 A, Wilcoxon tests, p-value < 2.2e-16). Similarly, the CDS regions with poly(A) sites are also longer than those without a poly(A) site. In our data analysis, 13,133 poly(A) sites were found in 7,190 CDS regions of 5,064 genes. Similarly to the analysis in introns, in three randomly selected control groups of CDS regions (without poly(A) sites), the average length of CDS with and without poly(A) sites is 506 nt (median=221) and 242 nt (median=129) respectively. Statistically, the sizes of CDS with poly(A) sites are significantly larger than those without poly(A) sites (Figure 3 B, Wilcoxon tests, p-value < 2.2e-16). The preference of polyadenylation on splicing genes in C. reinhardtii There are 9.69% (25,411) poly(A) sites located in intron regions in the combined dataset (Figure 1 B), such sites are referred to as intronic poly(A) sites. Intronic poly(A) sites based on the categories such as constitutive/retained introns from protein-coding and non-coding genes are shown in Table 2. There were 139,857 introns from coding genes and 2,666 introns from potential non-coding genes. Intronic poly(A) sites is almost

38 equally in percentage between the introns of coding (17%) and non-coding genes (15%), while constitutive spliced introns (23%) is ~2.5 times more than retained introns (9%). Such results may indicate that polyadenylation has no preference on coding/non-coding genes, but strongly prefers constitutive introns than retained introns.

The unique poly(A) site categorization and APA extent variations Based on the definition of four different poly(A) site categories (i.e., constitutive, strong, median and weak poly(A) sites, see Materials and methods for definition), the number and percentage of the four different sites and corresponding genes are showed in Table 3. There are less constitutive and strong poly(A) sites (2,747 (4.93%) and 3,653 (6.56%), respectively), and the corresponding genes are also less (2,747 (16.05%) and 3,653 (21.35%), respectively). Because weak poly(A) sites are located in the same genes with strong sites, so it has the same gene number (3,653, 21.35%) as strong sites. Most unique poly(A) sites are median poly(A) sites (39,044, 70.07%) and nearly half of all annotated genes (7,946, 46.43%) possess median sites. To study whether different site categories have distinct nucleotide composition and poly(A) signals, the single nucleotide profiles (-50 to +25) and signals in the NUE regions (-28 to -5) are showed in Figure 4. It is obviously that strong poly(A) sites have high U-peak (42.70%) and A-peak (45.00%) in the NUE region and dominant A-peak at poly(A) sites in the single nucleotide profile, and UGUAA has the highest frequency (46.45%) and significance (Z-score=8.04 with order-3 Markov model) in the pentamers of the NUE region (Figure 4 B). Then constitutive poly(A) sites have similar features as strong sites but with high G-content (34.67% in average) in the single nucleotide profile and G-rich signals in the NUE region (Figure 4 A). Both weak and median poly(A) sites have similar features: high G-content (34.89% and 36.62% respectively) in the single nucleotide profiles and low UGUAA frequencies, especially in weak sites (UGUAA is not detected in the top 50 ranked pentamers) (Figure 4 C and D).

The number of genes with poly(A) site clusters (PAC) are showed in Figure 5. The APA extent varies dramatically based on different datasets (Table 4). It was found that the APA extent is up to 67.78% according to the total PACs in C. reinhardtii. Yet only 7.87% genes are APA genes based on the EST data; following the increase of datasets, the APA extent is dramatically increased up to 63.46% in the Illumina dataset. Statistical analysis shows that there is significant relationship between APA extent and the size of poly(A) datasets (Pearson correlation coefficient: r = 0.995, p = 0.005). Clearly, there are less genes with multiple PACs, especially for high quality (>=5 PAC sites) group, in small datasets. For example, there are only 0.11% and 1.80% genes with at least five sites based on 6,770 and 17,828 PACs in EST and 454 datasets, respectively (Figure 5 C and D). In contrast, in Illumina dataset 23.05% genes have at least five sites, considering that there are 50,664 PACs in total. It is interesting to investigate the biological functions involved by the high APA genes (i.e., the genes with at least five PACs). Utilizing the GO annotation v4.0 from JGI

39 and GOEAST (Zheng and Wang, 2008), the highly significant GO functions are listed in Table 5. The result shows that the significant GO terms involve molecular functions and biological processes, mainly including receptor activities, non-coding RNA processing, protein synthesis, hydrolase and ligase activities. Such functions imply that the APA genes could relate with multiple protein isoform formation.

Discussion Deep sequencing technologies provided us with unprecedented DNA information. With a much larger collection of poly(A) site data from C. reinhardtii, we are able to perform an in-depth analysis of the distribution of polyadenylation sites and their relationship with splicing. Our results would provide further information about this model algal organism in terms of its RNA processing profiles and the potentials of APA in regulating gene expression. Poly(A) site distribution and polyadenylation profile variation in the genome and different datasets According to current genome annotation, there are about of 40% of poly(A) sites located in intergenic region. This may be a result of inaccurate or incomplete annotation from an insufficient number of ESTs/cDNA and reads used for the gene annotation. After expanding the 3'-end of genome annotation by 50 nt, the 3'-UTR sites increase to 55.67%, whereas there still are 28.48% sites located within intergenic regions. These intergenic poly(A) sites might be from unannotated genes or indicate the presence of some novel polyadenylated transcripts (Lopez et al., 2006; Wu et al., 2011). From poly(A) sites distribution, which is similar to Shen et al’s results utilizing 16,952 in silico-verified poly(A) sites from EST sequences (Shen et al., 2008b), it is showed that ESTs have much more similar distribution features to 454 than Illumina in the whole genome among the three datasets from different sequencing platforms (Figure 1). This may be caused by the difference of deep-sequencing reads considering that the reads in 454 are much longer (>400 bp) than Illumina (~50 bp). Moreover, deep- sequencing (454 and Illumina) can reveal much more poly(A) sites than Sanger sequencing. Interestingly, 5'-UTRs always have low site abundance (~1%) in the three datasets, which may suggest that 5'-UTRs are more stable in transcription and/or translation and may be less involved in APA events. Such result is consistent with the hypothesis that shorter pre-AUG poly(A) tract in 5'-UTRs can bind to translation initiation factors to enhance translation initiation, while a long tract (>=12) will bind to Pab1p resulting in repression of translation (Xia et al., 2011). The variations of single nucleotide profiles and poly(A) signals are clearly different among different genic and intergenic regions. 3'-UTRs have the most dominant UGUAA signal, which is also consistent with the published papers (Shen et al., 2008b; Wodniok et al., 2007) in C. reinhardtii. No clear NUE region or significant poly(A)

40 signal is detected in CDS poly(A) sites in all the three datasets (Figure S5), and similar result is also found in Arabidopsis (Wu et al., 2011). In Illumina data, the single nucleotide profiles and signals in intron and intergenic regions have much higher G content than those in EST and 454 data, and such feature could be caused by extremely biased base composition in Illumina sequencing (Oyola et al., 2012), because the genome of C. reinhardtii is strong GC-biased considering that the GC content is up to 63.45%, which is significantly higher than other multi-cellular organisms (Merchant et al., 2007). Through Direct RNA Sequencing (DRS) technology, novel poly(A) signals TTTTTTTTT and AAWAAA (W=A/T) were found around NUE region in human (Ozsolak et al., 2010), Such results may be also caused by sequencing bias because human genome is AT-rich. Another possible explanation is internal priming, and it has been reported that internal priming would cause artifacts within protein-coding regions by reverse transcriptase when using an oligo(dT)-based primers in A. thaliana DRS sequencing (Sherstnev et al., 2012).

The influence of polyadenylation on splicing events in C. reinhardtii In our study, CDS and introns with poly(A) sites are much longer than those without poly(A) sites, and similar results are also found in intronic poly(A) sites in Arabidopsis (Wu et al., 2011) and human (Tian et al., 2007). Although weak 5' splice site (5'ss) is detected in introns with poly(A) sites in Arabidopsis (Wu et al., 2011) and human (Tian et al., 2007), no significant difference exists between the introns with poly(A) sites and without poly(A) sites in C. reinhardtii in our study (data not shown). So it seems that polyadenylation may not prefer weak 5'ss, and UGUAA-dominated species (e.g., C. reinhardtii) may have distinct splice site selection in polyadenylation from AAUAAA- dominated species (e.g., Arabidopsis and human). Although intron retention made up half of the alternative splicing events (305 out of 611) in C. reinhardtii (Labadorf et al., 2010), our study shows that constitutively spliced introns are ~2.5 times more enriched with poly(A) sites than that of retained introns, suggesting constitutively spliced introns have stronger preference to polyadenylation than retained introns. So it is possible that the preference of polyadenylation to constitutive splicing is the result of interplay between polyadenylation and splicing. Alternative polyadenylation in C. reinhardtii It is found that the extent of APA in C. reinhardtii is up to 68% based on our combined datasets. We used a highly strict criterion (each site with at least 3 ESTs/reads support) and new annotation file (v4.3) to filter the unique poly(A) site and obtain PACs that can appropriately define APA. Interestingly, the APA is only found in 8% of genes based on our EST data (from 21,041 poly(A) sites); whereas Shen et al report up to 33% APA based on 16,952 poly(A) sites from EST data and gene annotation v3.1 (Shen et al., 2008b). It is clear that the extent of APA depends on the size of dataset and accuracy of

41 genome annotation, and similar results are also obtained in Arabidposis and rice through deep-sequencing (Shen et al., 2011; Wu et al., 2011).

In our study more poly(A) sites are detected in CDS and intron regions, especially in deep-sequencing data (454 and Illumina); and there is no clear NUE region and significant poly(A) signal in CDS poly(A) sites. As known, APA located in intron/CDS regions would result in different protein isoforms and is much different from APA located in 3'-UTR. APAs from intron/CDS and 3'-UTR may involve different regulation mechanism (Di Giammartino et al., 2011). It is mentioned that sequencing results from protein-coding regions (CDS/intron regions) may be dubious because of internal priming (Sherstnev et al., 2012), so more poly(A) sites detected in CDS and intron regions, especially in Illumina data, may not reflect the real number of poly(A) sites. It is equally possible that the differences of poly(A) signal in CDS/intron and 3'-UTR relate to different APA mechanisms. Although APA involved functions and mechanisms have been reported in plants and animals (Di Giammartino et al., 2011; Mangone et al., 2010; Ozsolak et al., 2009, 2010; Shen et al., 2011; Singh et al., 2009; Thomas et al., 2010; Wu et al., 2011; Xing and Li, 2010), the verified biological function and mechanism of APA genes in C. reinhardtii have not been reported. Our in silico analysis of APA genes indicates that receptor genes and non-coding RNA as well as protein metabolism are mainly impacted by the APA in C. reinhardtii. The significance of these bioinformatics findings remains to be verified by wet-lab experiments in the near future.

Materials and methods Data collection and polyadenylation sites definition Sanger-based EST sequences were collected from both JGI and NCBI GenBank (http://www.ncbi.nlm.nih.gov/genbank/). Our collaborator (Dr. Olivier Vallon from Institut de Biologie Physico-Chimmique, Paris, France) provided us with 454 data in C. reinhardtii that has not yet been deposited in NCBI SRA. In addition, more publically available 454 data and the most of Illumina RNA-Seq data were collected and downloaded from DNAnexus (http://sra.dnanexus.com/), a robust web portal that provides -friendly search and browsing interfaces for the NCBI SRA. For EST data, when available, raw EST trace files are used to identify poly(A)/(T) tails by identification of cDNA termini (Liang et al., 2008). Detailed information about data sources, poly(A) sites obtained, genome sequences and gene annotation is listed in Table 1. For ESTs, the minimum length of valid poly(A) tails was set to be 10 nt. The maximum adenines that were integral part of poly(A) tails but also mapped to the genome were set to be 4 nt. The maximum non-templated nucleotide addition (Jin and Bian, 2004) between the poly(A) tract and poly(A) sites (e.g., the mapped end of the cDNA before its

42 poly(A) tail) was allowed to be 3 nt. All ESTs with indentified poly(A) tails were mapped to the reference genome sequences using GMAP (Wu and Watanabe, 2005). The resultant EST-to-genome mapping results were then analyzed and filtered for valid genomic hits using a similar protocol described in (Liang et al., 2008). For 454 and Illumina data, the poly(A) tails were identified by our in-house tool SCOPE++ (http://code.google.com/p/scopeplusplus/) with at least 15 nt and 95% identity in purity, and then the sequence reads with a valid poly(A) tail were mapped to the genome utilizing Helicos Heliosphere (Ozsolak et al., 2010). Because poly(A) tails detected in the ESTs or RNA-Seq reads are post-transcriptional, they should not be mapped into the genome except in the case where internal priming is likely to occur. The internal priming was defined as that there were at least 6-7 consecutive adenines (As) from 10 nt-window in -10 to +10 region around poly(A) sites in genomic sequences (Tian et al., 2005). Therefore, individual genomic poly(A) sites were finally determined through the sequence-to-genome mapping results. The poly(A) sites that might be due to the internal priming were filtered out. All individual poly(A) sites were organized to get non- redundant, unique poly(A) sites that have distinctive genomics coordinates. Using these unique poly(A) sites, we extract 100 nt upstream and 50 nt downstream sequences for downstream analysis (each poly(A) site is defined as -1 position (Shen et al., 2008b)). Utilizing SignalSleuth2, a program to conduct exhaustive search and rank all of motifs in terms of their frequencies, all aforementioned 150-nt sequence fragments were scanned and the occurrences of different motifs at each position were counted.

Poly(A) signals variations in different parts of genome In this study, the gene annotation (v4.3) from Phytozome (http://www.phytozome.net/) was used to determine the regions of 3'-UTR, 5'-UTR, CDS, intron and intergenic regions in C. reinhardtii genome. Furthermore, 3'-UTR region usually needs to be expanded because of incomplete gene end annotation (Wu et al., 2011), so we can investigate the distribution of poly(A) sites in the immediate downstream of 3'-UTR region (namely the intergenic region) more accurately. A new program, SignalSleuth2, was developed to perform exhaustive search of top frequent signals within a defined region for the given sequence data. SignalSleuth2 was improved based on SignalSleuth (Loke et al., 2005). This program offers new functionalities including multiple scanning modes and Position-Specific Scoring Matrix (PSSM). SignalSleuth2 also provides a signal distance optional parameter (-gap) if there is more than one signal in the sequence, called distribution scanning mode. Such mode can provide signal distribution in the given region in sequence reads, avoid missing counts for a specific motif that appears multiple times in that region, and prevent over- counting of overlapping motifs. For example, ATATATAT would represent once for ATATAT and TATATA motifs search when set gap was two. If we do not consider the distance (namely gap=0), we call it as overlapping scanning mode. Moreover, in order to investigate the signal frequencies in sequence reads, SignalSleuth2 would check whether the sequence has the signal (occurrence once or zero). In addition, if there is more than one signal, the program would choose the signal closest to the poly(A) site, this is called

43 frequency scanning mode. For each mode, the program would provide PSSM results at the same time. To check the statistic significance, we used Z-score to inspect the significance of the signals from Regulatory Sequence Analysis Tools (RSAT), which is based on Markov chain models (van Helden et al., 1998).

APA analysis from the different datasets It is well-known that poly(A) complex appears to be “sloppy” to recognize the poly(A) sites in cleaving pre-mRNAs. In order to eliminate the microheterogeneity of poly(A) sites (Liang et al., 2008; Shen et al., 2008b; Tian et al., 2005), iterative clustering of adjacent unique poly(A) sites was performed in each chromosome. We developed an algorithm based on density theory and restraint theory in data mining to cluster poly(A) sites. In our algorithm, we utilized the advanced Ward's minimum variance method (Szekely and Rizzo, 2005), a popular hierarchical clustering method. To obtain high- quality poly(A) site clusters (PAC), all individual unique poly(A) sites were filtered with the support of at least three ESTs or RNA-Seq reads. Within a PAC, the unique poly(A) site that has the most sequence read support will be treated as the representative site of this PAC. Therefore, the PACs represent polyadenylation sites without microheterogeneity, and APA gene is defined as those that have at least two PACs. In the non-APA genes, there should be only one PAC, which is also called constitutive poly(A) site. For APA genes with multiple PACs, the PAC with 75% or more supporting reads is categorized as a strong poly(A) site, and the other sites in the same gene is called weak poly(A) sites. Otherwise, if no strong site is detected in a APA gene, then all of the poly(A) sites were classified as median poly(A) sites (Hu et al., 2005). To investigate the gene ontology (GO) functions for high-quality APA genes (i.e., at least five PAC sites), the GO annotation (v4.0) file in C. reinhardtii was downloaded from JGI (DOE Joint Genome Institute). Then GOEAST (Gene Ontology Enrichment Analysis Software Toolkit) (Zheng and Wang, 2008) was used to detect the significance (p-value) of the GO terms. The splicing events in C. reinhardtii The splicing genes were obtained from another C. reinhardtii splicing paper (Praveen Raj Kumar et al; in preparation). A total of over 7 million cDNA sequences of both Sanger and NGS technologies were utilized to collect splicing events. An automated version of PASA (Campbell et al., 2006) integrated with GMAP (Wu and Watanabe, 2005) was used to deduce alternatively spliced isoforms. The raw sequence data with contaminants (e.g., adapters, linkers and poly(A) tails) were cleaned along with low complexity sequences using seqclean (http://compbio.dfci.harvard.edu/tgi/software/). Then clean cDNAs reads were mapped to AUGUSTUS u10.2 gene annotation for C. reinhardtii with predicted protein mRNAs (Stanke et al., 2008) to define the mapped cDNAs as protein coding cDNAs and the unmapped as non-coding cDNAs. The results of protein coding and non-coding cDNAs were separately and used in PASA to deduce alternative splicing. All of the constitutively spliced introns (i.e., those that are spliced in all the isoforms) and

44 retained introns (i.e., the ones that are retained in some isoforms but spliced in others) were extracted from the PASA database using Perl scripts, and the genes containing the introns were then categorized into four different groups based on coding/non-coding and constitutive/retained introns (see Table 2).

45

Tables

Table 1. The data sources and numbers of unique poly(A) sites from three sequencing platforms

Data types Source No. of reads with No. of unique poly(A) poly(A) tails sites

Genome sequences Phytozome v4.3

Gene annotation Phytozome v4.3

ESTs data NCBI, JGI, Liang et 338,234 21,041 al., 2008

454 data Collaborator, 824,565 88,184 DNAnexus

Illumina data DNAnexus 22,372,354 195,266

Total 23,535,153 256,771

46

Table 2. Number of poly(A) site in different gene types

Constitutive introns Retrained introns

Protein-coding Non-coding Protein-coding Non-coding genes genes genes genes

Intron Number 134,708 2,550 5,151 116 No. of intronic poly(A) site 17,102 247 208 6 % of intronic poly(A) site 13% 10% 4% 5%

Table 3. Number of poly(A) site clusters (PAC) in different categories

Poly(A) Site No. of PAC PAC percentage No. of genes Gene percentage classification (%) (%)

Constitutive 2,747 4.93 2,747 16.05

Strong 3,653 6.56 3,653 21.35

Weak 10,276 18.44 3,653 21.35

Median 39,044 70.07 7,946 46.43

Table 4. APA extent variation among the four datasets

Dataset PAC number APA extent (%)

EST 11,035 7.87

454 30,086 27.49

Illumina 88,304 63.46

Total 97,479 67.78

47

Table 5. The most significant GO functions in high-quality APA genes with 5 or more PACs

GO_ID Ontology Term P-value

GO:0004872 molecular_function receptor activity 3.24E-33

GO:0034660 biological_process ncRNA metabolic process 1.16E-23

GO:0008236 molecular_function serine-type peptidase activity 3.95E-23

GO:0017171 molecular_function serine hydrolase activity 3.95E-23

GO:1901135 biological_process carbohydrate derivative metabolic 5.48E-23 process

GO:0006399 biological_process tRNA metabolic process 1.07E-21

GO:0006418 biological_process tRNA aminoacylation for protein 1.34E-18 translation

GO:0043038 biological_process amino acid activation 1.34E-18

GO:0043039 biological_process tRNA aminoacylation 1.34E-18

GO:0009056 biological_process catabolic process 1.43E-17

GO:0004812 molecular_function aminoacyl-tRNA ligase activity 5.44E-16

GO:0016875 molecular_function ligase activity, forming carbon- 5.44E-16 oxygen bonds

GO:0016876 molecular_function ligase activity, forming 5.44E-16 aminoacyl-tRNA and related compounds

GO:0003824 molecular_function catalytic activity 8.21E-16

GO:0016798 molecular_function hydrolase activity, acting on 2.26E-12 glycosyl bonds

GO:0004553 molecular_function hydrolase activity, hydrolyzing O- 2.77E-12 glycosyl compounds

48

Figures

Figure 1. The distribution of poly(A) sites in the genic and intergenic regions in different datasets.

(A) All poly(A) sites from ESTs, 454 and Illumna before 50-nt extension for 3’-UTR. (B) All poly(A) sites from ESTs, 454 and Illumna after 50-nt extension for 3’-UTR. (C) All EST poly(A) sites after 50-nt extension. (D) All 454 poly(A) sites after 50-nt extension. (E) All Illumina poly(A) sites after 50-nt expansion. (F) All high-quality Illumina poly(A) sites (i.e., with support of >= 3 sequence reads) after 50-nt extension.

49

Figure 2. The single nucleotide profiles from different datasets. (A) All unique poly(A) site data (including ESTs, 454 and Illumina). (B) Illumina data. (C) EST data. (D) 454 data.

50

Figure 3. The length difference of intron and CDS with and without poly(A) sites. (A) Intron. (B) CDS. PA: poly(A) sites; NPA1: control group 1 without poly(A) sites; NPA2: control group 2 without poly(A) sites; NPA3: control group 3 without poly(A) sites.

51

Figure 4. The single nucleotide profiles (-50 to +25) and signals in NUE (-28 to -5) from different categories of all the combined poly(A) sites. (A) Constitutive poly(A) sites. (B) Strong poly(A) sites. (C) Weak poly(A) sites. (D) Median poly(A) sites.

52

Figure 5. The genes with different Poly(A) site clusters (PACs) among different datasets. (A) All poly(A) data (including ESTs, 454 and Illumina). (B) Illumina data. (C) ESTs data. (D) 454 data.

53

Supplementary data

Supplemental File 1: Figure S1. The distance distribution of intergenic poly(A) sites after 3'-UTRs. 1-1000 nt downstream of 3'-UTR is selected to investigate the distribution of poly(A) sites in intergenic region. X-axis shows the 200 sub-regions which are 5 nt in length (e.g., 1-5 and 6-10). Y-axis labels the percentage of poly(A) sites for each sub-region.

54

Supplemental File 2: Figure S2. The top frequent motifs from different datasets in the NUE region. (A) All poly(A) data (including ESTs, 454 and Illumina). (B) Illumina data. (C) EST data. (D) 454 data.

55

Supplemental File 3: Figure S3. The single nucleotide profiles (-50 to +25) and top frequent motifs in NUE (-28 to -5) in 5'-UTRs from different datasets. (A) EST data. (B) 454 data. (C) Illumina data.

56

Supplemental File 4: Figure S4. The single nucleotide profiles (-50 to +25) and top frequent motifs in NUE (-28 to -5) in 3'-UTRs from different datasets. (A) EST data. (B) 454 data. (C) Illumina data.

57

Supplemental File 5: Figure S5. The single nucleotide profiles (-50 to +25) and top frequent motifs in NUE (-28 to -5) in CDS from different datasets. (A) EST data. (B) 454 data. (C) Illumina data.

58

Supplemental File 6: Figure S6. The single nucleotide profiles (-50 to +25) and top frequent motifs in NUE (-28 to -5) in intron from different datasets. (A) EST data. (B) 454 data. (C) Illumina data.

59

Supplemental File 7: Figure S7. The single nucleotide profiles (-50 to +25) and top frequent motifs in NUE (-28 to -5) in intergenic regions from different datasets. (A) EST data. (B) 454 data. (C) Illumina data.

60

Supplemental File 8: Table S1. The conserved pentamers detected in the NUE regions in C. reinhardtii.

61

Acknowledgements Dr. Chun Liang and Dr. Qingshun Quinn Li managed and coordinated the project. Zhixin Zhao carried out data collection, data analysis, and manuscript writing. Dr. Xiaohui Wu participated in data analysis and developed SignalSleuth2 program. Praveen Raj Kumar provided the splicing data. Ming Dong implemented poly(A) site clustering. All authors participated in manuscript writing and editing. This project was supported by a NIH AREA grant (1R15GM94732-1 A1 to CL and QQL). This project is also funded by the National Natural Science Foundation of China (Nos.61174161 and 61201358), the Natural Science Foundation of Fujian Province of China (No. 2012J01154), the specialized Research Fund for the Doctoral Program of Higher Education of China (No. 20120121120038), the Fundamental Research Funds for the Central Universities in China (Xiamen University: Nos. 2011121047, 201112G018 and 201212G005), and the Fundamental Research Fund for the university student Creative and Entrepreneurship training program in China (Xiamen University: No. XDDC201210384063).

This paper will be submitted to journal Molecular Genetics and Genomics.

62

Chapter 4 Introduction of Tandem Repeat Tandem repeats (TRs), also known as satellite DNA, are DNA motifs that contain at least two adjacent repeating units. They extensively exist in prokaryotes (Orsi et al., 2010) and eukaryotes (Christians and Watt, 2009; Fujimori et al., 2003; Li et al., 2004; da Maia et al., 2009; Mayer et al., 2010; Roorkiwal and Sharma, 2011; Sharma et al., 2007; Subramanian et al., 2003; Sureshkumar et al., 2009), even in viruses (Zhao et al., 2012). Generally, two categories are given to distinguish TRs based on different repeat unit size: microsatellites (unit size: 1—6 or 1—10 bp, also known as simple sequence repeats (SSR)) and minisatellites (unit size: 10—60 or 10—100 bp) (Gemayel et al., 2012; Mayer et al., 2010). In plants and animals, SSRs are widely detected in both mRNAs (cDNA/ESTs) and genomes (Fujimori et al., 2003; Gemayel et al., 2010; Jurka and Pethiyagoda, 1995; Sharma et al., 2007; Subramanian et al., 2003; Tautz and Renz, 1984; Tóth et al., 2000). Although most published research papers focus on microsatellites, Mayer et al find that longer TR densities (unit size: 7—50 bp) in arthropoda Daphnia pulex are much higher than shorter TR densities (unit size: 1—6 bp) in coding regions (Mayer et al., 2010), suggesting the necessity and importance of including longer TRs (minisatellites) in comparative analyses. The major mechanisms of TR mutation are caused by strand-slippage replication and recombination (Gemayel et al., 2012). In protozoan Stylonychia, slippage replication and unequal crossover are suggested to be the cause, and the repeats are claimed to have no biological function in gene expression (Tautz and Renz, 1984). In short tandem repeats (STRs) or microsatellites, strand-slippage replication is reported as principal mutation mechanism (Fan and Chu, 2007; Tachida and Iizuka, 1992). Yet in minisatellite mutation, recombination is much more often used than in microsatellites (Richard and Pâques, 2000). In yeast Saccharomyces cerevisiae, the expansion and contraction of TRs are mainly caused by the repair of double-strand break (DSB) (Pâques et al., 1998). Moreover, DSB also causes the intragenic repeat mutation in yeast S. cerevisiae, and the repeat mutation would induce diverse functions of cells (Verstrepen et al., 2005). However, the precise molecular mechanisms of TR mutations are still not clear (Gemayel et al., 2010), and we also do not know how the mutations of TRs change genes expression/regulation and genome sizes in evolution. TRs are extremely mutable with high mutation rates that are 10~100,000 times higher than other parts of the genome (Gemayel et al., 2010). Most mutations in TRs are caused by the changes in the number of the repeating units, not by point mutations, and repeat number mutation rates are 1~1010 higher than point mutations (Gemayel et al., 2010, 2012; Verstrepen et al., 2005). Repeat number mutation would lead to some serious diseases or defects, which are mainly reported in animals, especially in humans, because more researches are implemented. The three most famous examples are Fragile X Syndrome (FRAXA), which is caused by the length variation of CGG repeat in a brain expressed gene (FMR-1) in human X chromosome (Verkerk et al., 1991), Spinobulbar Muscular Atrophy (SBMA), which is led by the expansion of CAG repeat in androgen

63 receptor gene in human X chromosome (La Spada et al., 1991), and Huntington's disease (HD) that is made by the expansion of CAG repeat in human autosomal chromosome 4 (Walker, 2007). In plants, a well-known defect is caused by the expansion of triplet TTC/GAA in the intron of isopropyl malate isomerase large unit1 (IIL1; At4g13430) gene in Arabidopsis thaliana (Sureshkumar et al., 2009). In bacteria, it is found that the contraction of intragenic TR copy number in Escherichia coli tolA gene can enhance stress tolerance (Zhou et al., 2012). Through investigating TR density variation, it is concluded that there is no significant relationship between genome size and TR density. Systematic research on EST-derived SSR frequencies and motifs in moss Physcomitrella patens and other 24 algae and plants shows that no significant group-specific characteristics detected, yet only species-specific characteristics are found among those species (von Stackelberg et al., 2006). Moreover, in silico analysis of TR distribution using cDNA/ESTs shows that no association is detected between genome size and TR density among three plant families Brassicaceae, Solanaceae and Poaceae (da Maia et al., 2009). Through genome-wide study of TRs in 12 species including two fungi, one green alga, one plant, three , one and three arthropods, Mayer et al (Mayer et al., 2010) detect weak but no significant correlation (Pearson correlation coefficient: P=0.111, R=0.483) between the genome size and TR density (Mayer et al., 2010). In a recent study of 257 virus genomes, the relative density of simple sequence repeats (SSRs) (SSRs sequence base pairs per kilo genomic base pairs) shows quite weak correlation with genome size (Zhao et al., 2012). Such results suggest that genome expansion/contraction has no relation with TR density variation in genome evolution. Although many hypothesises and mechanisms (e.g., polyploidy and transposon amplification, unequal homologous recombination) are proposed in plants, the comprehensive mechanisms of genome size variation is still not clear in genome evolution from simple to complex (Ai et al., 2012; Bennetzen, 2005; Grover and Wendel, 2010). TRs show non-random distribution in genome and are often located within genes and regulatory regions (Gemayel et al., 2012; Legendre et al., 2007; Li et al., 2002; Martin et al., 2005; Rockman and Wray, 2002; Streelman and Kocher, 2002; Vinces et al., 2009). In plants, 5'-UTRs have higher TR densities than others in the intragenic regions (Fujimori et al., 2003; Morgante et al., 2002; Zhang et al., 2006). For example, in A. thaliana 5'-UTRs have the highest TR densities, and the abundant motifs are CT/GA and CTT/GAA, indicating that the repeats may have roles in gene regulation (Zhang et al., 2006). It is proposed that 5'-UTRs are under very strong positive selection in TR expansion (Morgante et al., 2002). In humans, the variable TRs are abundant in genes that are involved in transcriptional regulation and morphogenesis (Legendre et al., 2007). TRs are the source of sequence variation not only in 5’-UTRs but also in promoter, CDS, intron and 3'-UTR, all of which might be important regulatory regions in gene expression and evolution. 5'-UTR upstream flanks (namely promoter regions) have the secondary highest TR densities in intragenic region in A. thaliana (Zhang et al., 2006). In yeast S. cerevisiae, there are up to 25% genes possessing TRs in promoters, and the

64 variations of TR sequence length would result in the change in gene expression and local nucleosome positioning (Vinces et al., 2009). It is demonstrated that the most significant (P-value=8.05e-9) biological function (GO term) is regulation of transcription from RNA polymerase II promoter from human genes having TRs within coding regions (Legendre et al., 2007). In pathogen Neisseria meningitides, researchers find that the loss or gain of TAAA repeats in nadA gene promoter would affect its number of transcription factor binding sites (Martin et al., 2005). In pathogenic bacteria, the variable TRs in promoters can also change the space of critical promoter elements affecting phenotype variation (Gemayel et al., 2012). Although there is no significant relationship between TR size in promoters and gene expression and phenotypes, some studies do suggest that TRs in promoter are the major source of sequence-based variability in gene expression, and TRs are associated with the rapid evolution of promoters through transcriptional and/or translational regulation (Gemayel et al., 2010, 2012). Among coding sequences (CDS), the overabundant repeat unit sizes are three-fold nucleotides (e.g. tri- and hexa- nucleotides), because it is assumed that CDS is under negative selection against frame-shift mutations in translation for all microsatellites except three-fold nucleotide motifs (Legendre et al., 2007; Metzgar et al., 2000; Morgante et al., 2002). Moreover, selection preference is strongly biased to some amino acid tandem repeats (e.g., glutamine, arginine, glutamate) in coding regions in 12 species (Mularoni et al., 2010). In 42 fully sequenced prokaryotic genomes, the TRs in CDS prefer to locate near CDS termini, yielding U-shaped TR density curves (Lin and Kussell, 2011). Interestingly, it is suggested that TRs may play beneficial roles in coding sequences in a number of genes in humans (Gemayel et al., 2012). STR replacement analysis in introns and 3'-UTRs demonstrates that STRs have important roles in gene expression through investigating disease-related genes (e.g., FGA, THO1 and vWA) in humans, and such replacement is often retained in important genes (Riley and Krieger, 2005). It is uncovered that intronic TRs can regulate gene expression through changing repeat unit numbers to change the activity of RNA-binding proteins (Gatchel and Zoghbi, 2005) and affect mRNA splicing patterns in humans (Gemayel et al., 2012). And it is claimed that the selection pressure in 3'-UTRs is moderate positive in evolution (Morgante et al., 2002). Finally it is deduced that TRs in intragenic and regulatory regions are one of the sources of sequence variation, and could help the evolution of gene expression and regulation (Gemayel et al., 2010), but the mechanisms of how the regulation is generated and happened are still not well understood. Although TRs are previously considered as evolutionary neutral or “junk” DNAs, microsatellites are detected in numerous biological components and processes, such as centromere and telomere, chromatin organization, DNA replication and cell cycle and so on (Li et al., 2002). Moreover, the expansion or contraction of repeats would change biological functions, which are beneficial for the species to rapid adapt to new environment (Verstrepen et al., 2005). For example, it is demonstrated that there are more TRs (especially AAAT repeats) detected in evolutionary breakpoint regions (i.e., regions where the synteny has been disrupted by chromosomal reorganizations) that shape

65 genome architecture in great apes (Farré et al., 2011). Nowadays it is believed that TRs play important roles in gene and genome evolution and result in relevant phenotype changes to adopt different environment (Gemayel et al., 2012). Following the advance of whole genome deep-sequencing, more and more genome sequences are available to search and identify the TRs, and plenty of bioinformatical algorithms and programs are developed. For example, Tandem Repeats Finder uses probabilistic model to search perfect and imperfect TRs (searched repeat unit size could up to 2000 bp) (Benson, 1999). Phobos is based on non-probabilistic search algorithm to identify perfect (unit size is up to 10,000 bp) and imperfect TRs (unit size is up to 50 bp with high detection and alignment quality) (Mayer et al., 2010). Currently the newest versions (v4.04 for Tandem Repeats Finder and v3.3.12 for Phobos) of both programs have friendly Graphical User Interface (GUI) and command-line version, and the two programs can be used in multiple platforms (e.g. Windows, Linux and Mac). The comparison of the algorithms and programs is comprehensively investigated previously (Merkel and Gemmell, 2008). Yet how to set threshold to acquire 'real' TRs, especially imperfect TRs, and how to compare the significance of different motifs still are challenging for all the programs. Moreover, although it is easy to detect numerous sequences with TRs, yet how to identify and validate their potential biological functions is much more difficult. So far, no systematic research on TR variation and comparison has been conducted in genome-wide scale in plants. The rapid advance of sequencing technologies makes more and more plant genomes available to investigate the characteristics and distributions of TRs in both intragenic (e.g., 5'-UTR, CDS, intron and 3'-UTR) and intergenic regions. Taking advantage of 31 species (i.e., 29 land plants and 2 green algae) released in Phytozome v8.0 (http://www.phytozome.net/) recently, we detect and characterize TRs in genomes and examine their distributions and variations in both intragenic and intergenic regions in Chapter 5. Obviously, this research will facilitate our understanding of TRs and their distribution features, as well as their evolutionary trends in land plants and green algae.

66

Chapter 5 Genome-wide analysis of tandem repeats in plants and green algae Abstract Background Tandem repeats (TRs) extensively exist in the genomes of prokaryotes and eukaryotes. It is known that TRs are non-randomly distributed within genes and regulatory regions. Based on the sequenced genomes and gene annotations of 31 plant and alga species in Phytozome v8.0 (http://www.phytozome.net/), we examined TRs in a genome-wide scale, characterized their distributions and motif features, and explored their putative biological functions. Results Among the 31 species (29 land plants and 2 green algae), no significant correction was detected between the TR density and genome size (Pearson correlation coefficient: r = 0.010, p = 0.957). Interestingly, green alga Chlamydomonas reinhardtii (42,059 bp/Mbp) and castorbean Ricinus communis (55,454 bp/Mbp) showed much higher TR density than all other species (13,209 bp/Mbp averagely). In the 29 land plants, including 22 dicots, 5 monocots and 2 bryophytes, 5'-UTR and upstream intergenic 200 nt (UI200) regions possessed significantly the first and second highest TR densities, whereas in the two green algae (C. reinhardtii and Volvox carteri), the first and second were found in intron and CDS regions respectively. In CDS regions, tri- and hexa- nucleotide motifs were those most frequently represented in all species. In intron regions, especially in the two green algae, significantly more TRs were detected near the intron-exon junctions. Within intergenic regions in dicots and monocots, more TRs were found near both the 5’ and 3’ ends of genes. GO annotation in two green algae revealed that the genes with TRs in introns are significantly involved in transcriptional and translational processing. Conclusions As the first systematic examination of TRs in plant and green algal genomes, our study showed that genome size has no significant discernible relationship with the TR density. TRs displayed non-random distribution for both intergenic and intragenic regions, suggesting that they have potential roles in transcriptional or translational regulation in plants and green algae.

67

Background Tandem repeats (TRs) are DNA sequence motifs that contain at least two adjacent repeating units. They extensively exist in prokaryotes and eukaryotes (Orsi et al., 2010; Roorkiwal and Sharma, 2011; Sharma et al., 2007; Sureshkumar et al., 2009; Tautz and Renz, 1984; Tóth et al., 2000). Generally, two categories are given to distinguish TRs based on different repeat unit size: microsatellites (unit size: 1−6 or 1−10 bp, also known as simple sequence repeats (SSR)) and minisatellites (unit size: 10−60 or 10−100 bp) (Gemayel et al., 2012; Mayer et al., 2010). In plants and animals, SSRs are widely detected in both mRNAs (cDNA/ESTs) and genomes (Fujimori et al., 2003; Gemayel et al., 2010; Jurka and Pethiyagoda, 1995; Sharma et al., 2007; Subramanian et al., 2003; Tautz and Renz, 1984; Tóth et al., 2000). For example, through investigating SSRs (repeat unit size: 1−6 bp) using EST databases in 11 plant and green algal species, Victoria et al find that dimer motifs have higher frequencies in green algae, bryophytes and ferns; while trimer motifs are more frequent in flowering plants (Victoria et al., 2011). Different from nuclear genomes, mitochondrial genomes appear to prefer mononucleotide repeats (A/T) first and dinucleotide repeats (AT) next in 16 investigated plant species (Kuntal and Sharma, 2011). Although most published research papers focus on SSRs, Mayer et al find that in coding regions, densities of longer TRs (unit size: 7−50 bp) in arthropoda Daphnia pulex are much higher than shorter TRs (unit size: 1−6 bp), and suggest the importance of including longer TRs in comparative analyses (Mayer et al., 2010). TRs are extremely mutable with mutation rates that are much higher than other parts of the genome (Gemayel et al., 2010). Most mutations in TRs are caused by the changes in the number of the repeating units, not by point mutations (Gemayel et al., 2010, 2012; Verstrepen et al., 2005). In humans, such repeat number variants are related to some serious diseases or defects, such as Fragile X Syndrome (FRAXA) (Verkerk et al., 1991), Spinobulbar Muscular Atrophy (SBMA) (La Spada et al., 1991) and Huntington's disease (HD) (Walker, 2007). In plants, the well-known Bur-0 IIL1 defect in Arabidopsis thaliana that generates a detrimental phenotype is caused by the expansion of triplet TTC/GAA in the intron of IIL1 gene (Sureshkumar et al., 2009). Through investigating TR density variation in a few plant and animal species, it is concluded that there is no significant relationship between genome size and TR density in plants and animals (da Maia et al., 2009; Mayer et al., 2010). Based on EST data from two green algae, two mosses, a fern, a fern palm, the ginkgo tree, two conifers, ten dicots and five monocots, SSRs were found to have highly variable abundances among different species (von Stackelberg et al., 2006). Recently, a comparative analysis for 282 species including plants and animals shows no sequence conservation in centromere TRs (Melters et al., 2013). Moreover, TRs show a non-random distribution in genome and are often located within genes and regulatory regions (Legendre et al., 2007; Li et al., 2002; Martin et al., 2005; Rockman and Wray, 2002; Streelman and Kocher, 2002; Vinces et al., 2009). Variable TRs are abundant in genes that are involved in transcriptional regulation and morphogenesis in humans (Legendre et al., 2007). 5'-UTRs have higher TR density

68 among different genic regions in plants (Fujimori et al., 2003; Morgante et al., 2002; Zhang et al., 2006). In A. thaliana, for example, 5'-UTRs have the highest TR density and the abundant motifs are dinucleotide CT/GA and trinucleotide CTT/GAA (Zhang et al., 2006). In the yeast Saccharomyces cerevisiae, about 25% genes possess TRs in their promoters, and the variations of repeat unit number can cause changes in gene expression and local nucleosome positioning (Vinces et al., 2009). Among coding sequences (CDS), the dominant repeat unit sizes are three-fold nucleotides (e.g. tri- and hexa- nucleotides), because it is assumed that such motifs are selected to avoid frame shift mutations that would affect translation (Legendre et al., 2007; Metzgar et al., 2002). In 42 fully sequenced prokaryotic genomes, the TR distributions in CDS are biased toward CDS termini, yielding U-shaped TR density curves across the span of the CDS (Lin and Kussell, 2011). So far, no systematic research on TR variation and characterization has been conducted on a genome-wide scale in plants. The rapid advance of sequencing technologies has made a number of plant and algal genomes available to investigate the characteristics and distributions of TRs in both intragenic (i.e., 5'-UTR, CDS, intron and 3'-UTR) and intergenic regions. Using genome sequence data from 31 species (i.e., 29 land plants and 2 green algae) released in Phytozome v8.0 (http://www.phytozome.net/), we detected and characterized TRs and examined their distributions and variations in intragenic and intergenic regions. This research will facilitate our understanding of TRs and their potential biological functions in transcription or translation in land plants and green algae.

Results The TR density variation among different genome sizes The species that we examined span a large evolutionary distance, including two green algae, two mosses, five monocots, and twenty-two dicots (Figure 1). As shown in Figure 3, there was no correlation between genome sizes and TR densities (r = 0.010, p = 0.957). The mean TR density at the whole-genome level was 13,209 bp/Mbp (sd = 10,309) among all tested species except in C. reinhardtii (42,059 bp/Mbp) and Ricinus communis (55,454 bp/Mbp), which showed dramatically higher TR densities than the other species. Excluding these two outliers (C. reinhardtii and R. communis), we still cannot find a significant correlation between genome sizes and TR densities among the remaining species (r = 0.311, p = 0.101) (see Figure 3).

The TR density variation in intragenic and intergenic regions Sequences from functionally different intragenic regions (i.e., 5'-UTR, CDS, intron and 3'-UTR) and progressively flanking upstream (i.e., UI200, UI500, UI1000), and downstream (i.e., DI200, DI500 and DI1000) intergenic regions were analyzed for TRs (Figure 2). UTR annotations were not available from Phytozome v8.0 for four species

69

(Carica papaya, Brassica rapa, Linum usitatissimum and Malus domestica), therefore 5'- UTR and 3'-UTR were analyzed only for the remaining 27 species.

We found that TRs showed clearly localization preferences among different intragenic and intergenic regions. In the two green algae C. reinhardtii and V. carteri, (Figure 4 A), intron regions have the highest relative TR densities of 162 and 120 respectively, which are 1.62 and 1.20 times of the relevant whole-genome TR densities (the whole-genome relative TR density is defined as 100 for each species). Based on F- test from ANOVA, the null hypothesis that all tested intergenic and intragenic regions have the equal mean relative TR densities can be rejected (p < 0.001), and Tukey's Honestly Significant Difference (HSD) test (Yandell, 1997) also shows a significant difference between the intron and each of the other regions (p < 0.05). In contrast, CDS regions have the second highest relative TR densities in the genic regions (86 and 65), whereas 5'-UTRs have the lowest (17 and 22). Interestingly, TRs in intergenic regions increase their relative densities away from the genes in these two green algae (Figure 4A). In the 29 land plants we examined, 5'-UTRs have the most significant and highest relative densities (p < 0.001, F-test from ANOVA, p < 0.01, HSD test; Figure 4 B and C) among different intragenic and intergenic regions. In the dicots and bryophytes, CDS regions have the significantly lowest relative TR densities (p < 0.01, F-test from ANOVA, p < 0.03, HSD test; Figure 4 B) among different regions. In monocots, the relative densities of CDS, intron and 3'-UTRs are similarly low (Figure 4 C). Different from the two green algae, the intergenic regions in land plants generally show higher TR densities than their genomes average (Figure 4 B, C and D). Comparing all intergenic regions (Figure 4 B and C), promoter regions closing to 5'-UTR appear to have more TR occurrences in land plants. In particular, the UI200 (upstream intergenic 200 nt) regions display a strong positive correlation with 5'-UTR in term of relative TR densities for land plants (r = 0.755, p = 1.998e-05).

The nucleotide content of the most abundant TR motifs are influenced by GC content All 22 dicots have GC contents ranged from 32.40% to 39.56%, five monocots from 43.57% to 46.14%, and two green algae from 55.70% to 63.45%. For the two bryophyte species, the GC content of moss Physcomitrella patens (33.60%) is within the range of dicots whereas spikemoss Selaginella moellendorffii (45.25%) is within monocots. GC contents in genome wide scale seem to follow the pattern of green algae > monocots > dicots. On the other hand, GC contents vary greatly among different intragenic and intergenic regions. In intragenic regions of dicots, the highest and lowest GC contents are detected in CDS (44.34%) and intron (33.19%) respectively, and 5'-UTR has the second highest (39.89%). In intergenic regions of dicots, GC contents vary from 31.75% to 35.56%. In intragenic regions of monocots and bryophytes, the highest and lowest GC contents are detected in 5'-UTR (54.79%) and intron (39.15%) respectively, CDS has the second highest content (53.32%). In intergenic regions of monocots and bryophytes, GC contents change from 42.39% to 49.20%. In green algae, the highest GC content is

70 detected in CDS (66.17%), lowest in 5'-UTR (52.82%) and its adjacent intergenic region (UI200, 52.32%). The 3'-UTR and other intergenic regions in green algae have GC contents from 53.75% to 57.38%. Differently from both monocots and dicots, introns in green algae show the second highest GC content (58.38%).

Our data suggests a clear relationship between GC contents and nucleotide content of the most frequent TR motifs detected within either a whole genome or individual intragenic or intergenic regions. If a high GC content is detected within a given region, the abundant TR motifs will preferably be GC-rich. In dicots, the most abundant TRs have repeat unit sizes of mono-, di- and tri-nucleotides, except CDS where tri-nucleotide TR motifs are the most frequent and then tri-fold TR motifs (e.g., hexa- and 9- nucleotide motifs) the second. The top TR motifs in dicots are dinucleotide motifs (16.87%, e.g., AT), mononucleotide motifs (14.48%, e.g., A/T) and trinucleotide motifs (9.17%, e.g., ATT/AAT and AAG/CTT). Also, 4 - 7 bp motifs still show high frequencies (>3%) whereas other longer motifs have low frequencies (<=2%) except 39- nucleotide motifs (3.92%). This exception is caused by the dramatically high frequency of 39-nucleotide motifs detected in R. communis (69.57%). Moreover, only AT-motifs (e.g., T, AT and ATT) are detected in introns in dicots where the lowest GC content is evident in comparison with other intragenic regions (see Table 1). Differently from dicots, mononucleotide motifs are lower in frequency (8.73%) whereas tri-nucleotide motifs (14.12%) and di-nucleotide motifs (13.05%) are obviously preferred in monocots (>13%). Meanwhile, GC-rich motifs like CGG/GCC are more frequently found in monocots than in dicots due to their higher GC contents. Although tri- and tri-fold nucleotide motifs are still dominant in CDS regions in monocots, those are essentially GC-rich motifs (e.g., CGG/GCC). Interestingly, dinucleotide (16.69%) and 12-nucleotide (17.48%) motifs have higher frequencies in two bryophytes, because dramatically high dinucleotide (26.79%) and 12-nucleotide (31.11%) motifs are detected in P. patens and S. moellendorffii respectively. In green algae, mononucleotide TR motifs show an extremely low frequency (0.94% only). In green alga C. reinhardtii, di-nucleotide GT/AC motifs are dominantly used in all intra- and inter-genic regions except 5'-UTR and CDS regions where tri-nucleotide AGC and CGG/CGG motifs are frequently used. In green alga V. carteri, the long 17-nucleotide motifs are frequently found in all intragenic regions except CDS regions where tri-nucleotide CGG and AGC are frequent. Interestingly, the top three frequent motifs in V. carteri are 17-nucleotide motifs (8.18%), tri-nucleotide (7.44%) and 50-nucleotide (7.39%). Based on the analysis of 198 experimentally verified plant promoter sequences downloaded from EPD (Eukaryotic Promoter Database), the abundant TR motif units are mono-, di- and tetra-nucleotide, and the top-ranked frequent motifs are A- and AT-rich (e.g., A/T, AT, ATGC, CTTT and ATTT), which is very similar to our results in the intergenic region adjacent to 5'-UTR (i.e., UI200 regions) in the dicots and monocots.

TR distribution and frequency profiles in intragenic (5'-UTR, CDS, intron and 3'- UTR) and intergenic regions

71

As shown in Figure 5 A, within the upstream intergenic UI200 regions of both dicots and monocots, the highest and lowest relative TR motif contents are found in the 80−89 (p < 0.001, F-test from ANOVA; p < 0.01, HSD test) and 0−9 intervals (p < 0.001, F-test from ANOVA; p < 0.01, HSD test) respectively. This suggests that the distribution of TR motifs is significantly toward the 3' ends of UI200, closer to 5' ends of genes. In contrast, within the downstream intergenic DI200 regions (see Figure 5 B), TR motifs are shown to distribute significantly toward the 5' ends of DI200, closer to 3' ends of genes, considering that the highest and lowest relative motif contents are detected in 10−19 (p < 0.001, F-test from ANOVA; p < 0.01 except comparing with 20−29 where p is 0.13, HSD test) and 90−99 (p < 0.001, F-test from ANOVA; p < 0.01, HSD test) intervals respectively. Interestingly, the progressively increasing trend of motif frequency toward gene ends does not keep in the sub-regions that are immediately adjacent to gene ends (e.g., the interval 90−99 in Figure 5 A and the interval 0-9 in Figure 5 B). Meanwhile, TR motif frequencies appear to be relatively consistent within 5'-UTR, 3'-UTR and CDS regions (Figure S1), except near their ends (i.e., the interval 0−9 and the interval 90−99) where lower frequencies are detected. Within the introns of all 31 species, more motifs are significantly detected in 10-19 (p < 0.001, F-test from ANOVA; p < 0.05, HSD test) and 80−89 intervals (p < 0.001, F-test from ANOVA; p < 0.01 except comparing with 20−29 where p is 0.19, HSD test) intervals. So, it is deduced that TR motifs are more frequently distributed towards intron ends forming U-shape (Figure 5 C), which has a different trend from intergenic UI200 and DI200 regions. Different from dicots and monocots, bryophytes and green algae show special trends in TR distribution profiles in both intergenic UI200 and DI200 regions: more motifs are detected in the middle intervals (see Figure 5 D and E), and no progressive increase or decrease trend is observed. On the other hand, the TR distribution profiles within intragenic regions are similar to dicots and monocots (see Figure S1). As shown in Figure 6, we have determined TR occurrences among all annotated genes, their intragenic regions and adjacent promoter regions for 4 groups of all 31 species (bryophytes, monocots, dicots and green algae). First, ~84% of all annotated mRNAs (or genes, some genes have more than one mRNA annotated) possess TRs (see Figure 6 A), and no significant difference is detected among 4 groups by F-test and HSD test. Interestingly, monocots show less TR frequency than other three groups, but the difference is not statistically significant. In UI200 and 5’-UTR regions (see Figure 6 B and C), about 4% and 10% of the annotated mRNAs have TRs respectively, except for green algae, in which only ~2% are found with TRs in both regions. The difference in TR frequencies in UI200 and 5’-UTR regions of all annotated mRNAs is not significant among 4 groups. However, as shown in Figure 6 D, E and F, green algae display significantly higher TR frequencies in 3’-UTR (~17%), CDS (~9%) and intron (~23%) regions for all annotated mRNAs in comparison with other three groups: bryophytes, monocots and dicots (p < 0.01, F-test from ANOVA; p < 0.01, HSD test). Utilizing GOEAST (Gene Ontology Enrichment Analysis Software Toolkit) (Zheng and Wang, 2008), the GO terms of C. reinhardtii genes with TRs in introns were

72 analyzed. As shown in Table 2, the most highly enriched GO terms involve catalytic activity (p = 5.384e-35) and hydrolase activity (p = 1.344e-10). In green alga V. carteri, the most significant GO functions mainly involve ribosomal proteins and heat shock proteins in the genes with TRs in introns. Such results suggest that the TRs could involve RNA and/or protein activity in intron processing.

Discussion The variation of TR densities in different genomes In genome-wide study of TRs in 12 species including two fungi (S. cerevisiae and Neurospora crassa), one green alga (Ostreococcus lucimarinus), one plant (A. thalina), three vertebrates (Homo sapiens, Mus musculus, Gallus gallus), one nematode (Caenorhabditis elegans) and three arthropods (Daphnia pulex, Drosophila melanogaster, Apis mellifera), Mayer et al (Mayer et al., 2010) detected weak but no significant correlation between the genome sizes and TR densities (r = 0.483, p = 0.111). In three plant families Brassicaceae, Solanaceae and Poaceae (da Maia et al., 2009), the association between genome sizes and TR densities in gene transcripts was also not found. In a recent study of 257 virus genomes, the relative SSR densities (i.e., SSRs sequence base pairs per kilo genomic base pairs) showed quite weak correlation with genome size (Zhao et al., 2012). Our analysis shows no significant relationship detected between the TR densities and genome sizes in green algae and plants in genome-wide scale (see Figure 2). Furthermore, it is obviously showed that TR densities have species-specific features rather than group-based, like the two green algae; such result is coincided with the SSR density variation detected in 25 algae and plants (von Stackelberg et al., 2006). It seems that there is a week positive, but not significant, correlation detected between genome sizes and TR densities for both compact genomes (like viruses) and genomes with lot of intergenic regions (like plants). This suggests that TRs might have not contributed significantly to the genome size expansion in evolution.

The variation of TR densities in different intragenic and intergenic regions In Arabidopsis and rice, TRs are significantly enriched within 5'-UTRs (Fujimori et al., 2003; Lawson and Zhang, 2006; Zhang et al., 2004). In our study, both dicot and monocot plants possess the first and second highest TR densities in 5'-UTRs and their immediate upstream intergenic region (i.e., UI200) (see Figure 4 B and C), which belongs to the promoter regions where core promoter elements are often represented (Kokulapalan, 2011) (see Figure 4 B and C). 5'-UTRs are thought to be the hotspots for TRs in eukaryotes. Previous studies on genes for light and salicylic acid responses (Li et al., 2004; Zhang et al., 2006) suggested that TRs in 5'-UTRs might be involved in the transcription and/or translation regulation. It is reported that as many as 25% genes in yeast S. cerevisiae have TRs in the promoter regions (Vinces et al., 2009). Our study also demonstrates that ~4−25% of genes in dicots and monocots possess TR in both 5'-UTR

73 and promoter UI200 regions (see Figure 6 B and C). In both dicots and monocots, TR abundance is the least in CDS region, indicating that low TR abundance may decrease the evolvability of proteins. This is reasonable because it is demonstrated that the mutations of CDS could cause protein functional changes, loss of function and protein truncation (Li et al., 2004). Interestingly, intron and 3'-UTR regions have much lower TR densities in monocots than in dicots. Such TR differences between dicots and monocots are still not clear in their biological meanings. In the two green algae we examined, the first and second highest TR densities are detected in intron and CDS respectively (Figure 4A), which is completely different from all other land plants. Our data show that green algae have significantly more intron sequences (32.85% and 37.03% in the whole genome in C. reinhardtii and V. carteri) comparing with land plants (15.73% averagely). This may imply that in green algae the TRs in intron and CDS are not randomly expanded and could involve in intron- or CDS- related activities and in RNA processing (e.g. exon splicing). In fact, our GO analysis for C. reinhardtii genes with TRs in intons shows that the most significant GO functions are catalytic activity and hydrolase activity (Table 2). Those functions indicate that the genes with rich TR motifs in their introns could involve in protein synthesis and degradation.

The top TR motifs are influenced by GC content In our study, the top ranked TR motifs are CT/AG and CTT/AAG in 5'-UTR in dicots. This is consistent with the results from Zhang et al, in which the motifs (CT/AG and CTT/AAG) were preferred in 5'-UTR in Arabidopsis and acted as regulatory elements for genes involved in light and salicylic acid responses (Zhang et al., 2006). Our result also shows that CDS regions are preferentially associated with tri- and hexa- nucleotides motifs, which has been reported previously by other researchers (Fujimori et al., 2003; Li et al., 2004; Mayer et al., 2010; Subramanian et al., 2003; Zhang et al., 2006). It is suggested that there is a strong evolutionary pressure against TR expansion in CDS than in introns to keep stable protein products (Dokholyan et al., 2000). Such feature can help explain why tri-fold nucleotide motifs (e.g., tri- and hexa- nucleotides motifs) are more frequent than others to reduce potential translational frame shifting. Two green algae have the highest TR densities in introns and the relevant abundant motifs are dinucleotide GT/AC in our study. As known, canonical splicing signals GT and AG are located at the 5' and 3' ends of the intron respectively. The abundant GT/AC dinucleotide TRs in introns might suggest that such repeats may be involved in exon splicing or alternative splicing in green algae (Gemayel et al., 2012). In dicots, most TR motifs contain A and/or T nucleotide(s); while both A/T-rich motifs and CCG/CGG motifs are often used in monocots. However, A/T -rich motifs are rarely detected in the two green algae. So it is clear that the top TR motifs have strong relationship with the GC content (Table 1). If there is a high GC content, the top frequent TR motifs prefer to be GC-rich instead of AT-rich. A similar relationship also has been demonstrated in other 11 species (including green algae, bryophytes, ferns, gymnosperms

74 and angiosperms) (Victoria et al., 2011) and AAR (Amino Acid Repeats) in 10 angiosperms (Zhou et al., 2011).

In term of repeat unit size length distribution, mono-nucleotide motifs are not the most frequent TR motifs in all 31 investigated species. It is known that longer repeats (>6bp) have high densities in D. pulex (Mayer et al., 2010). In our study, some longer repeats also show higher frequencies than many short TRs: 39-nucleotide motifs in dicot R. communis, 17-nucleotide and 50-nucleotide motifs in green alga V. carteri, and 12- nucleotide motifs in bryophyte S. moellendorffii. So, this suggests that TRs are not generated randomly in genome and longer TRs may play some roles in gene expression and regulation. The distribution and frequency of TRs in intragenic and intergenic regions It is clear that the distribution of TR motifs in intergenic regions is significantly biased towards both the 5’ and 3’ ends of genes in dicots and monocots (Figure 5 A and B). TRs have been shown to locate within genes and regulatory regions and participate in transcriptional and translational regulation (Legendre et al., 2007; Li et al., 2002; Martin et al., 2005; Rockman and Wray, 2002; Streelman and Kocher, 2002; Vinces et al., 2009). In our study, the biased TR motif distribution in intergenic regions further supports this notion.

In introns of all the 31 species, especially in the two green algae, more abundant TR motifs are significantly detected toward the ends of introns. Interestingly, SSR densities in CDS regions of 42 prokayrote genomes also show a similar U-shaped profile (Lin and Kussell, 2011). Because introns contain important regulatory motifs for many biological processing including splicing (Barbazuk et al., 2008; Matlin et al., 2005), our results suggest that the TRs in introns might have localization preference in their regulatory roles. Considering exon splicing that utilizes the canonical splicing signals (GT and AG) at the 5' and 3' end of introns and the GO functions of genes with TRs in introns (see Table 2), we believe that the highly abundant TRs in introns, especially in the two green algae, may involve with both constitutive and alternative splicing activities. The frequencies of TRs are consistent with the TR density variations in the four different groups. It is showed that 5’-UTR and UI200 have much higher TR densities in dicots, monocots and bryophytes, whereas higher TR densities are found in intron and CDS regions in green algae (Figure 4 and Figure 6). Comparatively speaking, there are more TRs (densities and frequencies) in 5’-UTR and promoter (UI200) regions in land plants (dicots, monocots and bryophytes), while green algae have more TRs in intron and CDS regions. In this study, the genome assemblies and gene annotations were obtained from Phytozome v8.0. Within this release, some species (e.g., Arabidopsis and rice) apparently have better, high-quality gene annotation than other species (e.g., papaya and apple without UTR annotation). We also noticed that many genome assemblies have unfinished gaps (e.g., …NNN…). Perhaps, this is due to the highly repetitive nature of the

75 sequences and the limitation of current sequencing technologies. On other hand, our data analysis is obviously biased towards dicot plants because the species number available in Phytozome v8.0 is not balanced for all four groups: 2 species in green algae, 2 in bryophytes, 5 in monocots, and 22 in dicots. Another limitation in our data analysis is the repeat unit size selection. Ideally, we should have examined all TRs with the repeat unit size of 1 -100 (i.e., covering all microsatellites and minisatellites) or even longer. Unfortunately, we have decided to examine the TR motifs of 1-50 bp due to the constraints in both current bioinformatics tools and the demanding computational resources required for processing all 31 genomes (i.e., CPU, memory and execution periods). Clearly, these limitations will affect the quality of our data analysis results presented in this paper to some extent. With the rapid advances in sequencing and computational technologies and with the rapid increase of transcriptomics data, we can expect more high-quality, accurate genome assemblies and gene annotations available for in-depth TR analyses from more plant and green algal species. This will definitely help us improve our understanding of the evolution of TRs and their roles in gene expression regulation.

Materials and methods Collecting genomes and annotation data The assembled genome sequences (including chromosomes, mitochondria and chloroplasts) and gene annotations of the 31 species were downloaded from Phytozome v8.0 (http://www.phytozome.net/) (see Figure 1 for the species list). Only valid nucleotides (A, T, G and C) were counted when analyzing the sequences. For each species, the nucleotide sequences from whole genome were used for genome-wide TR detection and density calculation. According to the data extraction schema shown in Figure 2, individual intergenic and intragenic regions were also extracted and used for TR analysis. In Phytozome v8.0, UTR annotations, including 5'-UTRs and 3'-UTRs, were not available for Carica papaya, Brassica rapa, Linum usitatissimum and Malus domestica. Therefore, the UTR regions were not examined individually for these four species. However, the upstream and downstream intergenic regions (e.g., UI1000, DI1000) were still examined based on the relevant gene start and end positions annotated for these four species (see Figure 2). Perl (Practical Extraction and Report Language) was used to write codes to extract sequences, initiate TR detection and parse results for downstream data analysis. TR detection and analysis For both perfect and imperfect TR detection, we utilized a tandem repeat search tool for complete genomes - Phobos (Mayer et al., 2010) (version 3.3.12). Considering the computational resource and execution time required for processing all 31 genomes, we adopted 1-50 bp as the repeat unit size, similar to what has been utilized by Mayer et al.

76

The minimum length of the detected repeats needed to be at least 12 nt, and the minimum repeat alignment score for imperfect repeats was set to be 12. As for the recursive TRs, only one motif was selected based on alphabetical ordering to be representative (Jurka and Pethiyagoda, 1995). For example, AAG, AGA and GAA were the repeat units of (AAG)n, but only AAG was selected to represent the repeat motif. Moreover, the TR motifs and their corresponding reverse complement motifs (e.g., AAG and CTT motifs) were investigated separately. This was because (1) genes are annotated in different strand (i.e., + versus -), (2) there are plenty of sense and anti-sense transcripts reported recently for many genes (Gu et al., 2009; Kerin et al., 2012), emphasizing the importance of gene orientation in genome annotations, and (3) a similar strategy had been adopted by others (Kuntal and Sharma, 2011; Zhang et al., 2006). TR density was defined by base pairs per mega base pairs (bp/Mbp), namely the length of detected TRs out of the total length of the sequences for detection. To enable comparison among different species or different regions (e.g., intragenic vs. intergenic regions) within the same species, we normalized the TR densities and computed the relative density: for each species, the whole genome density was defined as 100, and then the relative density for a specific region was computed by (TR density for a given region)/(the whole genome density). To investigate the TR distribution profiles within a given region, the sequence length of a specific region was firstly normalized to 0−99 scale that contained 10 intervals of same size (e.g. 0−9, 10−19, …, 90−99). The motif percentages were then calculated for the 10 intervals based on their occurrences. In this way, the same intergenic or intragenic regions with different sequence lengths can be compared. The 198 experimentally verified plant promoter sequences, which were extracted from -499 to 100 around the transcription start site (0 position in the coordinate), were downloaded from EPD (Eukaryotic Promoter Database, http://epd.vital- it.ch/seq_download.php#) (Périer et al., 2000). These promoter sequences were also scanned for perfect and imperfect TRs. Based on Chlamydomonas reinhardtii GO annotation (v4.0) from JGI (http://genome.jgi-psf.org/Chlre4/Chlre4.download.ftp.html), GOEAST (Gene Ontology Enrichment Analysis Software Toolkit) (Zheng and Wang, 2008) was used to detect the significance of GO terms for the genes with TRs in introns. GO annotations in Volvox carteri were analyzed by annot8r (Schmid and Blaxter, 2008) and ranked based on E- value. Pearson correlation (r) test statistics was conducted using Minitab 16 (www.minitab.com). The figures of box-plot were drawn using R (http://www.r- project.org/). Also through R, both ANOVA F-test and Tukey's Honestly Significant Difference (HSD) test (Yandell, 1997) were performed for significance tests.

77

Tables Table 1. The top TR motifs and GC contents in different regions

Group name (genome GC Region Average GC Top-frequent motifs content range) content (%) Dicots Whole genome 35.61 A/T; AT; ATT/AAT, AAG/CTT (32%, 40%) UI1000 32.63 A/T; AT; ATT/AAT, AAG/CTT UI500 31.75 A/T; AT; ATT/AAT UI200 35.56 A/T; AT, CT; CTT/AAG 5'UTR 39.89 CT/AG; CTT/AAG CDS 44.34 AAG/CTT Intron 33.19 T; AT; ATT 3'UTR 35. 41 T; AT; ATT/AAT DI200 34.58 T/A; AT, AG; AAT/ATT, AAG/CTT DI500 32.78 T/A; AT; AAT/ATT, AAG/CTT DI1000 33.50 T/A; AT; AAT/ATT, AAG/CTT Monocots and bryophytes Whole genome 43.29 AT; AAT/ATT, CCG/CGG UI1000 42.39 AT; AAT/ATT, CCG/CGG (34%, 55%) UI500 43.38 AT; ATT, CCG UI200 49.20 AT; CCG 5'UTR 54.79 AG/CT; CCG CDS 53.32 CGG/CCG Intron 39.15 C; CT 3'UTR 42.34 CTT/AAG, CCG, GT DI200 45.27 AT; CGG/CCG DI500 42.67 AT; CGG/CCG DI1000 42.56 AT; CGG/CCG Green algae Whole genome 59.58 AC/GT; CCG/CGG; AAGCATATGCGATCTGC (>55%) UI1000 57.38 AC/GT; CCG/CGG; AAGCATATGCGATCTGC UI500 55.84 GT/AC UI200 52.32 GT/AC 5'UTR 52.82 AGC CDS 66.17 CGG/CGG, AGC Intron 58.38 GT/AC 3'UTR 55.42 GCT/AGC; GT DI200 53.75 GT/AC DI500 56.32 GT/AC; AAGCATATGCGATCTGC DI1000 57.16 GT/AC; AAGCATATGCGATCTGC

78

Table 2. The most significant GO functions of genes with TRs in introns in C. reinhardtii

Ontology Term P-value molecular function catalytic activity 5.384e-35 molecular function hydrolase activity 1.344e-10 molecular function oxidoreductase activity 7.792e-6 molecular function peptidase activity 1.109e-4 molecular function peptidase activity, acting on L-amino acid peptides 1.766e-4 molecular function endopeptidase activity 5.793e-4

79

Figures

Figure 1. The phylogenic tree of 31 species showed in Phytozome v8.0 (http://www.phytozome.net/). The abbreviated name (the first two letters from both genus and species name are combined) and common name are listed in the parenthesis.

80

Figure 2. The schematic intragenic and intergenic regions used for TR analysis. UI200: 1-200nt upstream of 5’UTR; UI500: 201-700nt upstream of 5’UTR; UI1000: 701-1700nt upstream of 5’UTR; DI200: 1-200nt downstream of 3’UTR; DI500: 201- 700nt downstream of 3’UTR; DI1000: 701-1700nt downstream of 3’UTR.

81

Figure 3. Genome size versus genomic TR density in 31 land plants and green algae. The abbreviation name is based on the first two letters from both genus and species name (See Figure 1). The linear trendline is made by Excel.

82

Figure 4. The relative density of TRs in different intragenic elements and the intergenic regions. (A) 2 green algae; (B) 20 species including dicots and bryophytes; (C) 5 monocot land plants; (D) 4 land plants without available UTRs. Note: the average whole genome density is defined as 100 for each species. The number above each interval is the average density of the grouped species.

83

Figure 5. The relative distribution position of TRs in the intron and intergenic regions. (A) upstream 200 nt region in dicots and monocots; (B) downstream 200 nt region in dicots and monocots; (C) intron region in the 31 investigated species; (D) upstream 200 nt region in bryophytes and green algae; (E) downstream 200 nt region in bryophytes and green algae.

84

Figure 6. The percentage of TRs in different intragenic and intergenic regions. (A) mRNAs; (B) UI200 region; (C) 5’-UTR region; (D) 3’-UTR region; (E) CDS region; (F) Intron region.

85

Supplementary data

Supplemental File 1: Figure S1. The relative distribution position of TRs in the 3 intragenic regions. (A) 5’-UTR regions in the 27 investigated species; (B) CDS regions in the 31 investigated species; (C) 3’-UTR regions in the 27 investigated species.

86

Acknowledgements This project was funded partially by the NIH-AREA (1R15GM94732-1 A1 to CL), and Botany Department and Office for the Advancement of Research and Scholarship (OARS) in Miami University. Dr. Chun Liang managed and coordinated the project. Zhixin Zhao carried out data collection and data analysis. Guo Cheng and Sreeskandarajan Sutharzan helped with statistical analyses. GO analysis in V. carteri was implemented by Pei Li. All authors participated in manuscript writing and editing. We thank Dr. Qingshun Quinn Li and two anonymous reviewers for their constructive comments to improve the manuscripts.

This paper has been accepted by journal G3: Genes | Genomes | Genetics and will be published on January 2014.

87

Literature Cited Ai, B., Wang, Z.-S., and Ge, S. (2012). GENOME SIZE IS NOT CORRELATED WITH EFFECTIVE POPULATION SIZE IN THE ORYZA SPECIES. Evolution 66, 3302–3310.

Barbazuk, W.B., Fu, Y., and McGinnis, K.M. (2008). Genome-wide analyses of alternative splicing in plants: opportunities and challenges. Genome Res. 18, 1381–1392.

Beaudoing, E., Freier, S., Wyatt, J.R., Claverie, J.M., and Gautheret, D. (2000). Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10, 1001–1010.

Bennetzen, J.L. (2005). Mechanisms of Recent Genome Size Variation in Flowering Plants. Annals of Botany 95, 127–132.

Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580.

Birse, C.E., Minvielle-Sebastia, L., Lee, B.A., Keller, W., and Proudfoot, N.J. (1998). Coupling termination of transcription to messenger RNA maturation in yeast. Science 280, 298–301.

Bowler, C., Allen, A.E., Badger, J.H., Grimwood, J., Jabbari, K., Kuo, A., Maheswari, U., Martens, C., Maumus, F., Otillar, R.P., et al. (2008). The Phaeodactylum genome reveals the evolutionary history of diatom genomes. Nature 456, 239–244.

Campbell, M.A., Haas, B.J., Hamilton, J.P., Mount, S.M., and Buell, C.R. (2006). Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics 7, 327.

Christians, J.K., and Watt, C.A. (2009). Mononucleotide repeats represent an important source of polymorphic microsatellite markers in Aspergillus nidulans. Mol Ecol Resour 9, 572–578.

Crooks, G.E., Hon, G., Chandonia, J.M., and Brenner, S.E. (2004). WebLogo: a sequence logo generator. Genome Research 14, 1188–1190.

Danckwardt, S., Hentze, M.W., and Kulozik, A.E. (2008). 3’ end mRNA processing: molecular mechanisms and implications for health and disease. EMBO J 27, 482–498.

Dávila López, M., and Samuelsson, T. (2008). Early evolution of histone mRNA 3’ end processing. RNA 14, 1–10.

88

Dokholyan, N.V., Buldyrev, S.V., Havlin, S., and Stanley, H.E. (2000). Distributions of dimeric tandem repeats in non-coding and coding DNA sequences. J. Theor. Biol. 202, 273–282.

Eisen, J.A., Coyne, R.S., Wu, M., Wu, D., Thiagarajan, M., Wortman, J.R., Badger, J.H., Ren, Q., Amedeo, P., Jones, K.M., et al. (2006). Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model . PLoS Biology 4, e286.

Espinosa, N., Hernández, R., López-Griego, L., and López-Villaseñor, I. (2002). Separable putative polyadenylation and cleavage motifs in Trichomonas vaginalis mRNAs. Gene 289, 81–86.

Fan, H., and Chu, J.-Y. (2007). A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics 5, 7–14.

Farré, M., Bosch, M., López-Giráldez, F., Ponsà, M., and Ruiz-Herrera, A. (2011). Assessing the Role of Tandem Repeats in Shaping the Genomic Architecture of Great Apes. PLoS ONE 6, e27239.

Fuentes, V., Barrera, G., Sánchez, J., Hernández, R., and López-Villaseñor, I. (2012). Functional Analysis of Sequence Motifs Involved in the Polyadenylation of Trichomonas vaginalis mRNAs. Eukaryotic Cell.

Fujimori, S., Washio, T., Higo, K., Ohtomo, Y., Murakami, K., Matsubara, K., Kawai, J., Carninci, P., Hayashizaki, Y., Kikuchi, S., et al. (2003). A novel feature of microsatellites in plants: a distribution gradient along the direction of transcription. FEBS Lett. 554, 17– 22.

Gatchel, J.R., and Zoghbi, H.Y. (2005). Diseases of unstable repeat expansion: mechanisms and common principles. Nat. Rev. Genet. 6, 743–755.

Gemayel, R., Vinces, M.D., Legendre, M., and Verstrepen, K.J. (2010). Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet. 44, 445–477.

Gemayel, R., Cho, J., Boeynaems, S., and Verstrepen, K.J. (2012). Beyond Junk-Variable Tandem Repeats as Facilitators of Rapid Evolution of Regulatory and Coding Sequences. Genes 3, 461–480.

Di Giammartino, D.C., Nishida, K., and Manley, J.L. (2011). Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853–866.

Graber, J.H., Cantor, C.R., Mohr, S.C., and Smith, T.F. (1999a). In silico detection of control signals: mRNA 3’-end-processing sequences in diverse species. Proceedings of the National Academy of Sciences of the United States of America 96, 14055–14060.

89

Graber, J.H., Cantor, C.R., Mohr, S.C., and Smith, T.F. (1999b). Genomic detection of new yeast pre-mRNA 3’-end-processing signals. Nucleic Acids Research 27, 888–894.

Grover, C.E., and Wendel, J.F. (2010). Recent Insights into Mechanisms of Genome Size Change in Plants. Journal of Botany 2010, 1–8.

Gu, R., Zhang, Z., DeCerbo, J.N., and Carmichael, G.G. (2009). Gene regulation by sense-antisense overlap of polyadenylation signals. RNA 15, 1154–1163.

Helden, J. van, Andre, B., and Collado-Vides, J. (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology 281, 827–842.

Hu, J., Lutz, C.S., Wilusz, J., and Tian, B. (2005). Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. RNA (New York, N.Y.) 11, 1485–1493.

Hunt, A.G. (2012). RNA Regulatory Elements and Polyadenylation in Plants. Frontiers in Plant Science 2.

Hunt, A.G., Xu, R., Addepalli, B., Rao, S., Forbes, K.P., Meeks, L.R., Xing, D., Mo, M., Zhao, H., Bandyopadhyay, A., et al. (2008). Arabidopsis mRNA polyadenylation machinery: comprehensive analysis of protein-protein interactions and gene expression profiling. BMC Genomics 9, 220.

Hunt, A.G., Xing, D., and Li, Q.Q. (2012a). Plant polyadenylation factors: conservation and variety in the polyadenylation complex in plants. BMC Genomics 13, 641.

Hunt, A.G., Xing, D., and Li, Q.Q. (2012b). Plant polyadenylation factors: conservation and variety in the polyadenylation complex in plants. BMC Genomics 13, 641.

Jurka, J., and Pethiyagoda, C. (1995). Simple repetitive DNA sequences from primates: compilation and analysis. J. Mol. Evol. 40, 120–126.

Kamasawa, M., and Horiuchi, J. (2008). Identification and characterization of polyadenylation signal (PAS) variants in human genomic sequences based on modified EST clustering. In Silico Biol. (Gedrukt) 8, 347–361.

Kerin, T., Ramanathan, A., Rivas, K., Grepo, N., Coetzee, G.A., and Campbell, D.B. (2012). A Noncoding RNA Antisense to Moesin at 5p14.1 in Autism. Science Translational Medicine 4, 128ra40–128ra40.

Kokulapalan, W. (2011). Genome-wide Computational Analysis of Chlamydomonas reinhardtii Promoters. OhioLINK ETD Center.

90

Kuntal, H., and Sharma, V. (2011). In Silico Analysis of SSRs in Mitochondrial Genomes of Plants. OMICS: A Journal of Integrative Biology 15, 783–789.

Labadorf, A., Link, A., Rogers, M.F., Thomas, J., Reddy, A.S., and Ben-Hur, A. (2010). Genome-wide analysis of alternative splicing in Chlamydomonas reinhardtii. BMC Genomics 11, 114.

Lawson, M.J., and Zhang, L. (2006). Distinct patterns of SSR distribution in the Arabidopsis thaliana and rice genomes. Genome Biol. 7, R14.

Legendre, M., Pochet, N., Pak, T., and Verstrepen, K.J. (2007). Sequence-based estimation of minisatellite and microsatellite repeat variability. Genome Res. 17, 1787– 1796.

Li, Q., and Hunt, A.G. (1997). The Polyadenylation of RNA in Plants. Plant Physiol 115, 321–325.

Li, B., Xia, Q., Lu, C., Zhou, Z., and Xiang, Z. (2004). Analysis on frequency and density of microsatellites in coding sequences of several eukaryotic genomes. Genomics Proteomics Bioinformatics 2, 24–31.

Li, H., Tong, S., Li, X., Shi, H., Ying, Z., Gao, Y., Ge, H., Niu, L., and Teng, M. (2011). Structural basis of pre-mRNA recognition by the human cleavage factor Im complex. Cell Res.

Li, Y.-C., Korol, A.B., Fahima, T., Beiles, A., and Nevo, E. (2002). Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol. Ecol. 11, 2453–2465.

Liang, C., Liu, Y., Liu, L., Davis, A.C., Shen, Y., and Li, Q.Q. (2008). Expressed sequence tags with cDNA termini: previously overlooked resources for gene annotation and transcriptome exploration in Chlamydomonas reinhardtii. Genetics 179, 83–93.

Lin, W.-H., and Kussell, E. (2011). Evolutionary pressures on simple sequence repeats in prokaryotic coding regions. Nucleic Acids Research 40, 2399–2413.

Liu, F., Marquardt, S., Lister, C., Swiezewski, S., and Dean, C. (2010). Targeted 3’ processing of antisense transcripts triggers Arabidopsis FLC chromatin silencing. Science 327, 94–97.

Loke, J.C., Stahlberg, E.A., Strenski, D.G., Haas, B.J., Wood, P.C., and Li, Q.Q. (2005). Compilation of mRNA polyadenylation signals in Arabidopsis revealed a new signal element and potential secondary structures. Plant Physiol 138, 1457–1468.

91

Lopez, F., Granjeaud, S., Ara, T., Ghattas, B., and Gautheret, D. (2006). The disparate nature of “intergenic” polyadenylation sites. RNA 12, 1794–1801.

Lutz, C.S. (2008). Alternative polyadenylation: a twist on mRNA 3’ end formation. ACS Chemical Biology 3, 609–617.

Mages, W., Cresnar, B., Harper, J.F., Bruderlein, M., and Schmitt, R. (1995). Volvox carteri alpha 2- and beta 2-tubulin-encoding genes: regulatory signals and transcription. Gene 160, 47–54.

Maheswari, U., Mock, T., Armbrust, E.V., and Bowler, C. (2009). Update of the Diatom EST Database: a new tool for digital transcriptomics. Nucleic Acids Research 37, D1001–5.

Da Maia, L.C., de Souza, V.Q., Kopp, M.M., de Carvalho, F.I.F., and de Oliveira, A.C. (2009). Tandem repeat distribution of gene transcripts in three plant families. Genet. Mol. Biol. 32, 822–833.

Mangone, M., Manoharan, A.P., Thierry-Mieg, D., Thierry-Mieg, J., Han, T., Mackowiak, S.D., Mis, E., Zegar, C., Gutwein, M.R., Khivansara, V., et al. (2010). The landscape of C. elegans 3’UTRs. Science 329, 432–435.

Mardis, E.R. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics : TIG 24, 133–141.

Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380.

Martin, P., Makepeace, K., Hill, S.A., Hood, D.W., and Moxon, E.R. (2005). Microsatellite instability regulates transcription factor binding and gene expression. Proc. Natl. Acad. Sci. U.S.A. 102, 3800–3804.

Matlin, A.J., Clark, F., and Smith, C.W.J. (2005). Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell Biol. 6, 386–398.

Mayer, C., Leese, F., and Tollrian, R. (2010). Genome-wide analysis of tandem repeats in Daphnia pulex--a comparative approach. BMC Genomics 11, 277.

Melters, D.P., Bradnam, K.R., Young, H.A., Telis, N., May, M.R., Ruby, J.G., Sebra, R., Peluso, P., Eid, J., Rank, D., et al. (2013). Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biology 14, R10.

92

Merchant, S.S., Prochnik, S.E., Vallon, O., Harris, E.H., Karpowicz, S.J., Witman, G.B., Terry, A., Salamov, A., Fritz-Laylin, L.K., Maréchal-Drouard, L., et al. (2007). The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science 318, 245–250.

Merkel, A., and Gemmell, N. (2008). Detecting short tandem repeats from genome data: opening the software black box. Brief. Bioinformatics 9, 355–366.

Metzgar, D., Bytof, J., and Wills, C. (2000). Selection against frameshift mutations limits microsatellite expansion in coding DNA. Genome Res. 10, 72–80.

Metzgar, D., Liu, L., Hansen, C., Dybvig, K., and Wills, C. (2002). Domain-level differences in microsatellite distribution and content result from different relative rates of insertion and deletion mutations. Genome Res. 12, 408–413.

Millevoi, S., and Vagner, S. (2010). Molecular mechanisms of eukaryotic pre-mRNA 3’ end processing regulation. Nucleic Acids Res 38, 2757–2774.

Morgante, M., Hanafey, M., and Powell, W. (2002). Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat. Genet. 30, 194–200.

Morris, A.R., Bos, A., Diosdado, B., Rooijers, K., Elkon, R., Bolijn, A.S., Carvalho, B., Meijer, G.A., and Agami, R. (2012). Alternative Cleavage and Polyadenylation during Colorectal Cancer Development. Clinical Cancer Research 18, 5256–5266.

Mularoni, L., Ledda, A., Toll-Riera, M., and Albà, M.M. (2010). Natural selection drives the accumulation of amino acid tandem repeats in human proteins. Genome Res. 20, 745– 754.

Orsi, R.H., Bowen, B.M., and Wiedmann, M. (2010). Homopolymeric tracts represent a general regulatory mechanism in prokaryotes. BMC Genomics 11, 102.

Oyola, S.O., Otto, T.D., Gu, Y., Maslen, G., Manske, M., Campino, S., Turner, D.J., Macinnis, B., Kwiatkowski, D.P., Swerdlow, H.P., et al. (2012). Optimizing Illumina next-generation sequencing library preparation for extremely AT-biased genomes. BMC Genomics 13, 1.

Ozsolak, F., and Milos, P.M. (2011). RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12, 87–98.

Ozsolak, F., Platt, A.R., Jones, D.R., Reifenberger, J.G., Sass, L.E., McInerney, P., Thompson, J.F., Bowers, J., Jarosz, M., and Milos, P.M. (2009). Direct RNA sequencing. Nature 461, 814–818.

93

Ozsolak, F., Kapranov, P., Foissac, S., Kim, S.W., Fishilevich, E., Monaghan, A.P., John, B., and Milos, P.M. (2010). Comprehensive polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Cell 143, 1018–1029.

Pâques, F., Leung, W.Y., and Haber, J.E. (1998). Expansions and contractions in a tandem repeat induced by double-strand break repair. Mol. Cell. Biol. 18, 2045–2054.

Périer, R.C., Praz, V., Junier, T., Bonnard, C., and Bucher, P. (2000). The eukaryotic promoter database (EPD). Nucleic Acids Res. 28, 302–303.

Richard, G.F., and Pâques, F. (2000). Mini- and microsatellite expansions: the recombination connection. EMBO Rep. 1, 122–126.

Riley, D.E., and Krieger, J.N. (2005). Short tandem repeat (STR) replacements in UTRs and introns suggest an important role for certain STRs in gene expression and disease. Gene 344, 203–211.

Rockman, M.V., and Wray, G.A. (2002). Abundant raw material for cis-regulatory evolution in humans. Mol. Biol. Evol. 19, 1991–2004.

Roorkiwal, M., and Sharma, P.C. (2011). Mining functional microsatellites in legume unigenes. Bioinformation 7, 264–270.

Rothnie, H.M. (1996). Plant mRNA 3’-end formation. Plant Molecular Biology 32, 43– 61.

Rothnie, H.M., Reid, J., and Hohn, T. (1994). The contribution of AAUAAA and the upstream element UUUGUA to the efficiency of mRNA 3’-end formation in plants. EMBO J. 13, 2200–2210.

Sanger, F., and Coulson, A.R. (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology 94, 441–448.

Sanger, F., Nicklen, S., and Coulson, A.R. (1977). DNA sequencing with chain- terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America 74, 5463–5467.

Schmid, R., and Blaxter, M.L. (2008). annot8r: GO, EC and KEGG annotation of EST datasets. BMC Bioinformatics 9, 180.

Schuster, S.C. (2008). Next-generation sequencing transforms today’s biology. Nature Methods 5, 16–18.

Sharma, P.C., Grover, A., and Kahl, G. (2007). Mining microsatellites in eukaryotic genomes. Trends Biotechnol. 25, 490–498.

94

Sheets, M.D., Ogg, S.C., and Wickens, M.P. (1990). Point mutations in AAUAAA and the poly (A) addition site: effects on the accuracy and efficiency of cleavage and polyadenylation in vitro. Nucleic Acids Res. 18, 5799–5805.

Shen, Y., Ji, G., Haas, B.J., Wu, X., Zheng, J., Reese, G.J., and Li, Q.Q. (2008a). Genome level analysis of rice mRNA 3’-end processing signals and alternative polyadenylation. Nucleic Acids Research 36, 3150–3161.

Shen, Y., Liu, Y., Liu, L., Liang, C., and Li, Q.Q. (2008b). Unique features of nuclear mRNA poly(A) signals and alternative polyadenylation in Chlamydomonas reinhardtii. Genetics 179, 167–176.

Shen, Y., Venu, R.C., Nobuta, K., Wu, X., Notibala, V., Demirci, C., Meyers, B.C., Wang, G.-L., Ji, G., and Li, Q.Q. (2011). Transcriptome dynamics through alternative polyadenylation in developmental and environmental responses in plants revealed by deep sequencing. Genome Res. 21, 1478–1486.

Sherstnev, A., Duc, C., Cole, C., Zacharaki, V., Hornyik, C., Ozsolak, F., Milos, P.M., Barton, G.J., and Simpson, G.G. (2012). Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylation. Nat. Struct. Mol. Biol. 19, 845–852.

Shi, Y. (2012). Alternative polyadenylation: New insights from global analyses. RNA 18, 2105–2117.

Singh, P., Alley, T.L., Wright, S.M., Kamdar, S., Schott, W., Wilpan, R.Y., Mills, K.D., and Graber, J.H. (2009). Global changes in processing of mRNA 3’ untranslated regions characterize clinically distinct cancer subtypes. Cancer Research 69, 9422–9430.

La Spada, A.R., Wilson, E.M., Lubahn, D.B., Harding, A.E., and Fischbeck, K.H. (1991). Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature 352, 77–79.

Von Stackelberg, M., Rensing, S.A., and Reski, R. (2006). Identification of genic moss SSR markers and a comparative analysis of twenty-four algal and plant gene indices reveal species-specific rather than group-specific characteristics of microsatellites. BMC Plant Biol. 6, 9.

Stanke, M., Diekhans, M., Baertsch, R., and Haussler, D. (2008). Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644.

Streelman, J.T., and Kocher, T.D. (2002). Microsatellite variation associated with prolactin expression and growth of salt-challenged tilapia. Physiol. Genomics 9, 1–4.

95

Subramanian, S., Mishra, R.K., and Singh, L. (2003). Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol. 4, R13.

Sureshkumar, S., Todesco, M., Schneeberger, K., Harilal, R., Balasubramanian, S., and Weigel, D. (2009). A genetic defect caused by a triplet repeat expansion in Arabidopsis thaliana. Science 323, 1060–1063.

Szekely, G.J., and Rizzo, M.L. (2005). Hierarchical Clustering via Joint Between-Within Distances: Extending Ward’s Minimum Variance Method. Journal of Classification 22, 151–183.

Tachida, H., and Iizuka, M. (1992). Persistence of repeated sequences that evolve by replication slippage. Genetics 131, 471–478.

Tautz, D., and Renz, M. (1984). Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucleic Acids Res. 12, 4127–4138.

Terauchi, M., Kato, A., Nagasato, C., and Motomura, T. (2010). Research note: Analysis of expressed sequence tags from the chrysophycean alga Ochromonas danica (Heterokontophyta). Phycological Research 58, 217–221.

Thomas, C.P., Andrews, J.I., and Liu, K.Z. (2007). Intronic polyadenylation signal sequences and alternate splicing generate human soluble Flt1 variants and regulate the abundance of soluble Flt1 in the placenta. FASEB J. 21, 3885–3895.

Thomas, C.P., Raikwar, N.S., Kelley, E.A., and Liu, K.Z. (2010). Alternate processing of Flt1 transcripts is directed by conserved cis-elements within an intronic region of FLT1 that reciprocally regulates splicing and polyadenylation. Nucleic Acids Research 38, 5130–5140.

Thomas, P.E., Wu, X., Liu, M., Gaffney, B., Ji, G., Li, Q.Q., and Hunt, A.G. (2012). Genome-Wide Control of Polyadenylation Site Choice by CPSF30 in Arabidopsis. Plant Cell 24, 4376–4388.

Tian, B., and Graber, J.H. (2011). Signals for pre-mRNA cleavage and polyadenylation. Wiley Interdisciplinary Reviews. RNA.

Tian, B., Hu, J., Zhang, H., and Lutz, C.S. (2005). A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Research 33, 201–212.

Tian, B., Pan, Z., and Lee, J.Y. (2007). Widespread mRNA polyadenylation events in introns indicate dynamic interplay between polyadenylation and splicing. Genome Research 17, 156–165.

96

Tóth, G., Gáspári, Z., and Jurka, J. (2000). Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res. 10, 967–981.

Venkataraman, K., Brown, K.M., and Gilmartin, G.M. (2005). Analysis of a noncanonical poly(A) site reveals a tripartite mechanism for vertebrate poly(A) site recognition. Genes Dev 19, 1315–1327.

Verkerk, A.J., Pieretti, M., Sutcliffe, J.S., Fu, Y.H., Kuhl, D.P., Pizzuti, A., Reiner, O., Richards, S., Victoria, M.F., and Zhang, F.P. (1991). Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell 65, 905–914.

Verstrepen, K.J., Jansen, A., Lewitter, F., and Fink, G.R. (2005). Intragenic tandem repeats generate functional variability. Nat. Genet. 37, 986–990.

Victoria, F.C., da Maia, L.C., and de Oliveira, A. (2011). In silico comparative analysis of SSR markers in plants. BMC Plant Biology 11, 15.

Vinces, M.D., Legendre, M., Caldara, M., Hagihara, M., and Verstrepen, K.J. (2009). Unstable Tandem Repeats in Promoters Confer Transcriptional Evolvability. Science 324, 1213–1216.

Wachter, A., Tunc-Ozdemir, M., Grove, B.C., Green, P.J., Shintani, D.K., and Breaker, R.R. (2007). Riboswitch control of gene expression in plants by splicing and alternative 3’ end processing of mRNAs. Plant Cell 19, 3437–3450.

Walker, F.O. (2007). Huntington’s disease. Lancet 369, 218–228.

Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., and Burge, C.B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476.

Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63.

Wodniok, S., Simon, A., Glockner, G., and Becker, B. (2007). Gain and loss of polyadenylation signals during evolution of green algae. BMC Evolutionary Biology 7, 65.

Wu, T.D., and Watanabe, C.K. (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics (Oxford, England) 21, 1859– 1875.

97

Wu, X., Liu, M., Downie, B., Liang, C., Ji, G., Li, Q.Q., and Hunt, A.G. (2011). Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation. Proc. Natl. Acad. Sci. U.S.A 108, 12533–12538.

Xia, X., MacKay, V., Yao, X., Wu, J., Miura, F., Ito, T., and Morris, D.R. (2011). Translation initiation: a regulatory role for poly(A) tracts in front of the AUG codon in Saccharomyces cerevisiae. Genetics 189, 469–478.

Xing, D., and Li, Q.Q. (2010). Alternative polyadenylation and gene expression regulation in plants. Wiley Interdisciplinary Reviews: RNA n/a–n/a.

Xiong, J., Lu, X., Zhou, Z., Chang, Y., Yuan, D., Tian, M., Zhou, Z., Wang, L., Fu, C., Orias, E., et al. (2012). Transcriptome Analysis of the Model Protozoan, Tetrahymena thermophila, Using Deep RNA Sequencing. PLoS ONE 7, e30630.

Yandell, B.S. (1997). Practical data analysis for designed experiments (London ; New York: Chapman & Hall).

Yang, Q., Gilmartin, G.M., and Doublié, S. (2010). Structural basis of UGUA recognition by the Nudix protein CFI(m)25 and implications for a regulatory role in mRNA 3’ processing. Proc. Natl. Acad. Sci. U.S.A 107, 10062–10067.

Yang, Q., Coseno, M., Gilmartin, G.M., and Doublié, S. (2011a). Crystal structure of a human cleavage factor CFI(m)25/CFI(m)68/RNA complex provides an insight into poly(A) site recognition and RNA looping. Structure 19, 368–377.

Yang, Q., Gilmartin, G.M., and Doublié, S. (2011b). The structure of human Cleavage Factor Im hints at functions beyond UGUA-specific RNA binding: A role in alternative polyadenylation and a potential link to 5’ capping and splicing. RNA Biology 8, 748–753.

Zarudnaya, M.I., Kolomiets, I.M., Potyahaylo, A.L., and Hovorun, D.M. (2003). Downstream elements of mammalian pre-mRNA polyadenylation signals: primary, secondary and higher-order structures. Nucleic Acids Res. 31, 1375–1386.

Zhang, L., Yuan, D., Yu, S., Li, Z., Cao, Y., Miao, Z., Qian, H., and Tang, K. (2004). Preference of simple sequence repeats in coding and non-coding regions of Arabidopsis thaliana. Bioinformatics 20, 1081–1086.

Zhang, L., Zuo, K., Zhang, F., Cao, Y., Wang, J., Zhang, Y., Sun, X., and Tang, K. (2006). Conservation of noncoding microsatellites in plants: implication for gene regulation. BMC Genomics 7, 323.

Zhao, J., Hyman, L., and Moore, C. (1999). Formation of mRNA 3’ ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesis. Microbiology and Molecular Biology Reviews : MMBR 63, 405–445.

98

Zhao, X., Tian, Y., Yang, R., Feng, H., Ouyang, Q., Tian, Y., Tan, Z., Li, M., Niu, Y., Jiang, J., et al. (2012). Coevolution between simple sequence repeats (SSRs) and virus genome size. BMC Genomics 13, 435.

Zheng, Q., and Wang, X.-J. (2008). GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis. Nucleic Acids Res. 36, W358–363.

Zhou, K., Michiels, C.W., and Aertsen, A. (2012). Variation of Intragenic Tandem Repeat Tract of tolA Modulates Escherichia coli Stress Tolerance. PLoS ONE 7, e47766.

Zhou, Y., Liu, J., Han, L., Li, Z.-G., and Zhang, Z. (2011). Comprehensive analysis of tandem amino acid repeats from ten angiosperm genomes. BMC Genomics 12, 632.

99