<<

ABSTRACT

GENOME-WIDE COMPUTATIONAL ANALYSIS OF CHLAMYDOMONAS REINHARDTII PROMOTERS

by Kokulapalan Wimalanathan

As a model organism, use of Chlamydomonas is not only limited with biological experiments to understand chloroplast and flagella, but is also extended to biodiesel production. Chlamydomonas regions were extracted based on available RNA-Seq data and community genome annotation, and promoters were used to analyze and detect core and proximal promoter elements. While the evidence suggests only the TATA box (canonical and non-canonical TATA boxes) as the only core promoter element, it also indicates that the TATA box in Chlamydomonas is different than Arabidopsis and human TATA boxes. While some proximal promoter elements discovered show weak similarities to known promoter elements from other species, most are novel elements only present in Chlamydomonas. Most of the proximal promoter elements detected show significant similarities to each other. It is evident from this study that the promoter architecture in Chlamydomonas seems to be simpler compared to animals and plants. GENOME-WIDE COMPUTATIONAL ANALYSIS OF CHLAMYDOMONAS REINHARDTII

PROMOTERS

A Thesis

Submitted to the

Faculty of Miami University

in partial fulfillment of

the requirements for the degree of

Master of Science

Department of

by

Kokulapalan Wimalanathan

Miami University

Oxford, Ohio

2011

Advisor ______(Dr. Chun Liang) Reader ______(Dr. Roger Meicenheimer) Reader ______(Dr. Mufit Ozden)

i Table of Contents List of Tables...... iii List of Figures...... iv Acknowledgments...... v 1 Chapter 1:Review of Core Promoter Analysis...... 1 1.1 expression...... 1 1.1.1 Eukaryotic ...... 1 1.1.2 factors and cis-regulatory elements...... 2 1.1.3 Transcription initiation...... 2 1.1.4 Different types of promoters based on the distribution of TSS...... 3 1.1.5 Core promoter elements...... 4 1.1.6 Proximal promoter elements...... 6 1.1.7 Alternative promoters...... 7 1.2 Computational methods to identify cis-regulatory elements...... 9 1.2.1 Enumerative method...... 9 1.2.2 Deterministic optimization method...... 11 1.2.3 Probabilistic optimization method...... 12 1.2.4 Notations to represent cis-regulatory elements...... 12 1.3 References...... 14 1.4 Figures...... 21 1.5 Tables...... 28 2 Chapter 2 ...... 30 2.1 Abstract...... 30 2.2 Introduction...... 31 2.3 Results...... 37 2.3.1 Obtaining promoter regions for analysis...... 37 2.3.2 LDSS analysis of core promoters of human, Arabidopsis and Chlamydomonas ...... 37 2.3.3 Comparing and combining octamer clusters to produce putative core promoter elements...... 41 2.3.4 Analysis of proximal promoter elements...... 42 2.4 Discussion...... 47 2.5 Materials and Methods...... 52 2.5.1 Obtaining Valid Promoter Sequences...... 52 2.5.2 LDSS analysis of core promoters of three species...... 53 2.5.3 Comparing and combining octamer clusters to form octamer groups ...... 55 2.5.4 Analysis of proximal promoter elements...... 56 2.6 Acknowledgments...... 58 2.7 References...... 59 2.8 Figures...... 68 2.9 Tables...... 81

ii List of Tables Table 1.1: The position and consensus of the core promoter elements in plants and animals...... 28 Table 1.2: IUPAC single character codes for nucleic acid sequences...... 29 Table 2.1: Basic statistics of upstream promoter analysis...... 81 Table 2.2: The 14 KEGG motif groups and their component KEGG motifs...... 82 Table 2.3: Functional annotation of GO motifs detected in GO gene groups...... 83 Table 2.4: Functional annotation of KEGG motifs detected in KEGG gene groups...... 84

iii List of Figures Figure 1.1: , structural gene components and pre-initiation complex...... 21 Figure 1.2: Different types of promoters based on TSS distribution...... 22 Figure 1.3: Core promoter elements commonly found in plants and animals...... 23 Figure 1.4: A simple example of how to use the LDSS method...... 24 Figure 1.5: Creating sequence logos...... 25 Figure 1.6: Information represented in a sequence logo...... 26 Figure 1.7: Several ways commonly used to represent consensus motif models...... 27 Figure 2.1: The Major core promoter elements present in animals and plants...... 68 Figure 2.2: Examples of LDSS-positive and LDSS-negative octamers in Chlamydomonas...... 69 Figure 2.3: LDSS heatmap graphs for Arabidopsis, human, and Chlamydomonas...... 70 Figure 2.4: Sequence logos from Arabidopsis LDSS octamer clusters...... 71 Figure 2.5: Sequence logos from human LDSS octamer clusters...... 72 Figure 2.6: Sequence logos from Chlamydomonas LDSS octamer clusters...... 73 Figure 2.7: Sequence logos from combined putative core promoter elements...... 74 Figure 2.8: GO motifs...... 75 Figure 2.9: Representative motifs from each KEGG motif group...... 76 Figure 2.10: Overall method used to analyze promoters...... 77 Figure 2.11: LDSS Parameters...... 78 Figure 2.12: Valid Promoters...... 79 Figure 2.13: MEME Minimal Motif Format...... 80

iv Acknowledgments

I heartily thank my advisor, Dr. Chun Liang, whose encouragement, guidance, and support has helped me conduct this research in a meaningful manner. I would also like to thank him for all the time spent on reading and revising my thesis. I would like to extend my gratitude to my committee members, Dr. Roger Meicenheimer and Dr. Mufit Ozden for their enthusiasm, and insightful comments. I would also like to thank my Lab mate Ms. Lin Liu for annotating the C.reinhardtii gene models to the GO and KEGG databases, and Praveenkumar Rajkumar for aligning ESTs to the genome. I would also like to thank Zhixin Zhao for his support during presentations. I thank the Department of Botany at Miami University, for funding this research through academic challenge grants. My sincere gratitude goes to Christina Johnson for proof reading my thesis and providing contructive critisism to improve my writing. Also I would also like to thank my friend Jie Chen for helping me complete the thesis, and providing moral support. My sincere appreciation goes to Kathy Millar for helping with my academic challenge grants, which financially supported this thesis research. Last but not least, I would like to thank my mom for making all this possible, and checking on my progress frequently throughout the time of writing this thesis.

v 1 Chapter 1:Review of Core Promoter Analysis

1.1 Gene expression All known eukaryotic organisms contain deoxyribonucleic acid (DNA) in a membrane bound nucleus. DNA is the genetic material inherited by successive generations, and a “blueprint” of that contains enormous amount of instructional information for building and maintaining an organism. Only a small portion of total genomic DNA available in each eukaryotic cell actively undergoes transcription. Transcription is the biochemical process which produces ribonucleic acid (RNA) molecule using a DNA template. Essentially, are the regions in the DNA that can be activated or expressed to form RNAs during transcription with the aid of RNA polymerase II (PoI II , in Figure 1.1A). In eukaryotic protein coding genes, the transcribed RNA is called the messenger RNA (mRNA), which can be translated to produce proteins by ribosomes (see Figure 1.1A). Protein coding genes in eukaryotic genomes have several distinguishable regions, each of which plays an important role in transcription or processes.

1.1.1 Eukaryotic gene structure Protein coding genes found in eukaryotic genomes usually contain a promoter, 5’ untranslated region (5’ UTR), introns (sequence that do not code for proteins), exons (sequences coding for polypeptides) and 3’ UTR (see Figure 1.1B) (Brosius, 2009). The 3' UTR and the 5' UTR regions are part of the exons, but for illustration purpose they have been separated from the exons. Under normal circumstances a gene is transcribed from the start of the 5' UTR till the end of 3' UTR, where 5' and 3' denote the directionality of the DNA. The exact site where transcription initiates is called Transcription Start Site (TSS). A pre-mRNA consisting of UTRs, introns and exons is the direct product of transcription. The pre-mRNA will go through RNA processing steps, including 5’-capping, intron splicing (removal of introns) and polyadenylation (addition of adenine tails at newly cleaved 3’-end of pre-mRNA), to become a mature mRNA for translation (Bentley, 1999). All aforementioned different gene regions might be involved either in transcription initiation or transcriptional regulation, as well as in pre-mRNA processing

1 (Brosius, 2009).

1.1.2 Transcription factors and cis-regulatory elements Transcription initiation and regulation rely on different proteins called transcription factors, and these factors initiate and regulate transcription in a precise manner. Protein factors can initiate (Krishnamurthy and Hampsey, 2009), enhance (Collins et al., 1995), or silence transcription (Sparmann and van Lohuizen, 2006). These protein factors detect short sequence motifs known as cis-regulatory elements (cis-elements) in the DNA and bind to them in order to initiate and/or regulate transcription. cis-elements directly or indirectly contribute to many pre and post-transcriptional processes, such as transcription initiation, intron splicing and polyadenylation.

Some cis-elements such as TATA box and Initiator (Inr) are associated with initiating transcription (Juven-Gershon et al., 2006), and they are known as core promoter elements. Other cis-elements such as the heat shock element are associated with regulating transcription, e.g., how many transcripts (mRNAs) are generated for the same gene (Lodha et al., 2008). The cis- elements present in the promoter region proximal to the TSS and associated with transcriptional regulation are called the proximal promoter elements. Even though these are the general characteristics of core and proximal promoter elements, in reality there is a significant overlap between the functions performed by these elements (Juven-Gershon and Kadonaga, 2010).

1.1.3 Transcription initiation Transcription is mediated by RNA Polymerase II (Pol II) core enzyme in most eukaryotes, and it is a complex process involving a large number of proteins. Five general transcription factors are essential for the initiation of transcription, namely TFIIB, TFIID, TFIIE, TFIIF and TFIIH (Krishnamurthy and Hampsey, 2009). These general transcription factors recognize specific core promoter elements responsible for transcription initiation, and enable accurate binding of Pol II to the TSS. The recognition of core promoter elements by the general transcription factors plays an important role in initiating transcription at the correct site (i.e., selecting the TSS to initiate transcription instead of other sites). Pol II and the general transcription factors assemble together to form a pre-initiation complex (PIC) to initiate

2 transcription (see Figure 1.1C). PIC assembly is a series of cascading events that involves complex protein-DNA motif (e.g., TFIID binds to TATA box) binding and protein-protein binding (e.g., TFIIF binds to Pol II). Even though the PIC is capable of initiating transcription, a larger protein complex referred to as the mediator is needed for transcriptional regulation (Krishnamurthy and Hampsey, 2009), (see Figure 1.1C).

1.1.4 Different types of promoters based on the distribution of TSS Initially, each gene was assumed to have only one TSS and transcription was presumed to be initiating from this common starting point. This theory proved to be an oversimplification, because transcription can initiate from more than a single site for a given gene (Juven-Gershon et al., 2006, 2008). Based on different transcription initiation patterns, Juven-Gershon et al divided the promoters into two types, namely focused promoters and dispersed promoters. In focused promoters, transcription initiates from a specific site which is one or a few wide. In dispersed promoters, transcription initiates from several weak sites in a region spanning 100 to 150 nucleotides. In a study of mammalian promoter architecture, promoters were grouped into four different categories based on TSS patterns (Carninci et al., 2006): focused promoters showed single peaked TSS sites, and dispersed promoters were further divided into three sub- types or sub-groups as broad promoters, broad peaked promoters, and multimodal promoters (see Figure 1.2). Transcription was weakly initiated in broad promoters over wider regions (150-200 bp) than peaked promoters (Carninci et al., 2006). In broad peaked promoters, weak transcription initiation was observed over a wider region as in broad promoters, but a strong transcription initiation site was also present within the wider region of weak transcription initiation (Carninci et al., 2006). In multimodal promoters the transcription initiated in two or more strong sites resulting in two or more strong peaks (Carninci et al., 2006).

In simple terms, a promoter is the region upstream of the TSS, but this oversimplified definition does not cover the entire features or the functions of a promoter. A more complete definition would be “the modulatory DNA structures containing a complex array of cis-acting regulatory elements required for accurate and efficient initiation of transcription and for controlling expression of a gene” (Ayoubi and Van De Ven, 1996). This definition does not restrict the region or the distance from TSS which can be considered as promoters, but merely

3 assigns the term in a broader scope based on the functions performed by a promoter. This is more appropriate because some important core promoter elements necessary for transcription initiation such as Downstream Core Element (Deng and Roberts, 2005) and Motif 10 Element (Lim et al., 2004) are actually found in the 5' UTR, downstream of TSS.

1.1.5 Core promoter elements Cis-elements which are present in a narrow region adjacent to the TSS and directly interact with the basal transcription machinery (i.e., Pol II and general transcription factors) are called the core promoter elements (Smale and Kadonaga, 2003; Krishnamurthy and Hampsey, 2009). We are able to observe several DNA motifs that are directly involved in transcription initiation (Frith et al., 2008). Figure 1.3 is based on several review papers and illustrates the core promoter elements that have been discovered in animals and plants (Smale and Kadonaga, 2003; Juven-Gershon et al., 2008; Sandelin et al., 2007; Yamamoto et al., 2007a, 2007b; Chen et al., 2007; Bernard et al., 2010). Most core promoter elements occur within a small region around the TSS, and each core promoter element tends to occur at a specific distance from the TSS (See Table 1.1 for the positions of different core promoter elements). This positional conservation is seen because of two factors: the localization of some core promoter elements is necessary for accurate transcription initiation (e.g., TATA box), and the spacing requirement between different core promoter elements is needed for optimal transcription (e.g., TATA box and Inr) (Juven- Gershon et al., 2008; Grosveld et al., 1981; Zhu et al., 1995). This stringent spacing or positional requirement from the TSS is a special characteristic observed in core promoter elements (Juven- Gershon et al., 2008; Ponjavic et al., 2006). Table 1.1 summarizes the information known about different core promoter elements, their consensus sequence in IUPAC format and their positional information (Zhu et al., 1995; Smale and Kadonaga, 2003; Juven-Gershon et al., 2006, 2008; Bernard et al., 2010).

The first core promoter element discovered was the TATA box. It was discovered in 1979 by Goldberg in prokaryote promoter sequences (Goldberg, 1979). It is the most studied core promoter element till now, and has been confirmed to influence transcription initiation through wet-lab experiments such as transfection (Grosschedl and Birnstiel, 1980) and in vitro transcription assays (Wasylyk et al., 1980). TATA box was believed to be the most prevalent

4 core promoter element and was presumed to be a strict requirement for transcription initiation (Breathnach and Chambon, 1981), but recent genome-wide studies have revealed that the TATA box is present only in a subset of the genes. Large proportion of the genes that contain the TATA box are peaked promoters. Around 24% of the genes have the TATA boxes in humans, but only about 10% of genes possess the canonical TATA box consensus (TATAWAAR) sequence (Yang et al., 2007). TATA box is also present in plants (Yamamoto et al., 2007a, 2007b), but only around 25% of genes contain the TATA boxes.

TFIIB Recognition Element (BRE) is a core promoter element that is closely related to the TATA box, because it is recognized by the general TFIIB (Krishnamurthy and Hampsey, 2009). Moreover, the position where the BRE is found also makes it closely related to the TATA box, because the BRE lies immediately upstream and downstream of the TATA box. The BRE was initially only found to occur upstream of the TATA box (BREU) (Lagrange et al., 1998), but recent studies confirmed the presence of another site recognized by TFIIB immediately downstream of the the TATA box (BRED) (Deng and Roberts, 2005).

TATA box was identified as the most prevalent core promoter element in earlier studies, however, currently the Initiator (Inr) element is considered as the most prevalent core promoter element. The Inr element occurs in about 46% genes in humans, and is found in approximately 22% more genes than the TATA box (Yang et al., 2007). Inr element encompasses the TSS, and is known to influence the direction of transcription and the selection of the TSS. Inr acts synergistically with the TATA box when both core promoter elements are present (O’Shea- Greenfield and Smale, 1992). Functional Inr elements have also been discovered in some genes of Tobacco (Nakamura et al., 2002). About 10 % of the genes in Arabidopsis and Rice were shown to have Inr elements (Yamamoto et al., 2007a, 2007b). Plant Inr elements also do not have a strict consensus sequence as in humans (i.e., YYANWYY, see Table 1.1) or Drosophila (i.e., TCAKTY, see Table 1.1), but they contain a conserved Pyrimidine/Purine in the -1/+1 position (Yamamoto et al., 2007a).

Motif Ten Element (MTE) (Lim et al., 2004) , and Downstream core Promoter Element (DPE) were shown to act cooperatively with the Inr element (Burke and Kadonaga, 1996; Burke

5 et al., 1998). Both of these elements were first discovered in Drosophila (Juven-Gershon et al., 2008), but were also identified in humans and have been experimentally verified to influence transcription (Lim et al., 2004). Downstream Core Element (DCE) was first discovered in human beta-globin gene, and it was found to have three sub-motifs with gaps between these sub- motifs. DCE motif was the first core promoter motif that contains gaps, and it changed the common belief that core promoter elements did not contain gaps (see Figure 1.3) (Lewis et al., 2000). DCE frequently occurs with the TATA box and Inr core promoter elements, but also occurs independent of TATA box (Lewis et al., 2000).

Plant promoters differ considerably from animal promoters, because plants usually lack some core promoter elements such as the MTE, DPE, DCE and CpG islands that are found in animal promoters (Molina and Grotewold, 2005). Plants contain plant specific core promoter elements such as Y-patch which occurs between the TATA box and TSS (Civan and Svec, 2009; Yamamoto et al., 2007b), and TC motifs which occur in the same position as the TATA box (Bernard et al., 2010). Moreover the Inr is not overrepresented in plants, but only a YR rule is seen (Yamamoto et al., 2007a, 2007b), and this rule is also seen in mammals (Yamamoto et al., 2007a, 2007b). The YR rule is the conservation of the Pyrimidine and Purine nucleotides at the -1 and +1 position respectively (Juven-Gershon et al., 2008).

1.1.6 Proximal promoter elements Other cis-regulatory elements involved in transcriptional regulation and present in the region upstream of core promoter elements are called proximal promoter elements (Haberer et al., 2011). Proximal promoter elements can be classified into two major groups: enhancers and silencers. As the name implies, enhancers increase transcription level whereas silencers reduce transcription level (Lauster et al., 1993). Moreover, the large mediator complex involved in transcriptional regulation contains multiple proteins that can bind to large number of motifs including proximal promoter elements (Krishnamurthy and Hampsey, 2009). The large number of proteins and sequence motifs enable the combinatorial control of transcription (Takaiwa et al., 1996; Singh, 1998; Smale, 2001; Tuch et al., 2008; Karamitri et al., 2009). Moreover, this complexity has enabled the evolution of regulatory motifs regulating groups of genes with

6 similar function, leading to the lack of a universal proximal promoter element (Takaiwa et al., 1996; Yokoyama and Nishitani, 2001; Zhang et al., 2005; Obayashi et al., 2006). Usually putative proximal promoter elements are discovered by sequence analysis of co-expressed genes (Stine et al., 2003). However in order to accept a putative proximal promoter element as a true cis-regulatory element, experimental verification is necessary (Yamaguchi-Shinozaki and Shinozaki, 1994).

Proximal promoter elements have been used in genetic engineering studies to improve traits including drought and flood tolerance in crop plants (Guan et al., 2000; Umezawa et al., 2006). In Chlamydomonas, a recent finding revealed that a Negative Fe-Deficiency-Responsive Element (Fei et al., 2010), which is about 15 bp long and found in -292 position upstream of the TSS, controls the expression of FTR1 gene. Obviously, more research efforts are needed to characterize important core and proximal promoter elements in this green algal species to facilitate our understanding of the gene expression and regulation for its applications such as bio- diesel production (Morowvat et al., 2010).

1.1.7 Alternative promoters The human genome project has revealed that the number of protein coding genes in human genome is only around 25,000 genes, much less than previously predicted 100,000 genes (Lander et al., 2001; Venter et al., 2001). The number of genes in humans does not correlate well with the organismal complexity. Great efforts have been put in studies that address how a small number of genes could lead to complex structural and behavioral traits in humans as well as in other animals and plants. More and more evidence support that alternative pre-mRNA processing enables a single gene to produce multiple isoforms of mRNAs and increases transcriptome diversity (Moore and Proudfoot, 2009). The major mechanisms that prove to contribute to transcriptome diversity are alternative promoter usage, alternative splicing, and alternative polyadenylation (Bentley, 1999). Alternative splicing and alternative polyadenylation have been studied well in different organisms, but alternative promoters usage has been only well studied in animals such as humans (Kimura et al., 2006), Drosophila (Zhu and Halfon, 2009) and mouse (Tsuritani et al., 2007), but is lagging in plants. So far, alternative promoter usage has been studied only in Arabidopsis thaliana and Oryza sativa (Chen et al., 2007).

7 Alternative promoters generate multiple transcripts with different 5' end from the same genes, thereby increasing transcriptome diversity (Ayoubi and Van De Ven, 1996). Recent methods which utilize the G-cap in the 5' end of the transcript to capture the 5' end of a mature mRNA have increased the accuracy of TSS determination (Sandelin et al., 2007). Furthermore, next-generation sequencing technologies such as 454 pyrosequencing and Illumina SBS (sequences by synthesis) are replacing traditional techniques such as RACE (Rapid Amplification of cDNA Ends), SAGE (Serial Analysis of Gene Expression), and CAGE (Cap Analysis of Gene Expression) (Gowda et al., 2007, 2006; Ni et al., 2010). These new methods provide an accurate method for genome-wide determination of TSSs in a high-throughput manner. High-throughput identification of TSSs is invaluable in determining genes that utilize alternative promoters in a genome-wide scale (Sandelin et al., 2007).

Prevalence of alternative promoters have been proved widely in animals, starting with humans. Approximately 50% of human genes appear to use alternative promoters, which was determined based on alternative TSSs that are at least 500 bp away (Kimura et al., 2006). Similar results are also evident in Mouse (Tsuritani et al., 2007), and Drosophila (Zhu and Halfon, 2009) with about half of their genes utilizing alternative promoters. Prevalence of alternative promoter usage in plants seems to be much lower than in animals, but this striking difference may be due to a smaller data set used for analysis. Rice and Arabidopsis show ~5% and ~5.9% of alternative promoter usage based on traditional Sanger sequence data (Chen et al., 2007). This is in contrast to the alternative splicing rate of ~42% genes with introns in Arabidopsis (Filichkin et al., 2010), and alternative polyadenylation has so far been documented ~70% Arabidopsis (Wu et al., 2011), and ~50% in Rice (Shen et al., 2008). If more improved methods are used to determine TSS in plants, the number of genes showing alternative promoter usage is expected to increase significantly.

While the prevalence and extent of alternative promoter usage are addressed by recent genome-wide studies including next-generation sequencing, the functional importance of alternative promoters has been also evident in many studies. Usage of alternative promoters have been linked to environmental stresses, such Hg2+ in bean (Qi et al., 2007) and disease resistance in rice (Koo et al., 2009). Specific genes such as stem cell protein AC133 in humans

8 produce different tissue specific isoforms (Shmelkov et al., 2004).

Genome-wide analysis of alternative splicing and alternative promoter usage revealed a significant correlation between these two events. Genes with single promoters were less likely to undergo alternative splicing compared to genes with alternative promoters; and genes with more alternative promoters tend to have more alternative spliced isoforms, which makes the alternative promoters positively correlated to alternative splicing events (Xin et al., 2008).

1.2 Computational methods to identify cis-regulatory elements As mentioned in the previous section, cis-regulatory elements play a crucial role in both transcription initiation and transcriptional regulation. Moreover, they also play an important role in pre-mRNA processing such as splicing and polyadenylation. The computational approach used to detect potential cis-elements is also known as motif discovery, because cis-elements are essentially short sequence motifs that are recognized and bound by transcription factors or other protein factors.

Motif discovery problem is described as “searching for imperfect copies of an unknown pattern, perhaps as small as 6–8 base pairs, occurring potentially thousands of bases upstream of some unknown subset of our genes of interest” (D’haeseleer, 2006). Motif discovery problem has high complexity and is one of the most challenging computational problems in bioinformatics. Based on the approach used to find motifs, motif discovery approaches can be divided into three major groups: enumeration, deterministic optimization and probabilistic optimization (D’haeseleer, 2006).

1.2.1 Enumerative method Enumerative method utilizes an exhaustive search for all possible motifs in the set of input sequences, and the top candidate motifs are selected using statistical methods (D’haeseleer, 2006). Dictionary based approach is an approach used by enumerative method, and a dictionary means all possible motifs of a fixed motif width. For instance, if we search for a DNA/RNA motif of eight bases in length, it will produce 65,536 (48) different motifs for scanning. Enumerative methods typically calculate the number of occurrences of each of these motifs and

9 try to select the top motifs based on the number of occurrences and its statistical significance (e.g., Z-score) (Sinha and Tompa, 2002; Yamamoto et al., 2007b).

One specific enumerative method that was used to study promoters in plants and animals is the Local Distribution of Short Sequences (LDSS) (Yamamoto et al., 2007b, 2007a). LDSS is a dictionary based method, and the statistics used to select the candidate motifs are more suitable to elucidate and characterize core promoter elements than proximal promoter elements. LDSS uses a fixed size of 8 nt (octamer), and generates all possible octamers (65,536) to create a dictionary. LDSS uses the promoter regions from -1000bp to +200bp of the genes chosen for analysis, where +1 indicates the TSS. LDSS tends to produce more accurate results when a large number of genes (>5000) are used for analysis (Yamamoto et al., 2007b, 2007a). The number of genes that have a specific octamer at each position is counted, and this is called the frequency distribution for this octamer. Frequency distribution is used to determine the putative conserved motifs (i.e. core promoter motifs), which usually have positional distance restriction from TSS (Juven-Gershon et al., 2008).

LDSS uses 1200 bp region (from -1000 bp to +200 bp) to calculate the frequency distribution. The core promoter elements tend to occur in high frequency near the TSS in a localized manner, resulting in a positional bias in their frequency distribution. As shown in Figure 1.4, over-represented octamers will display evident peaks that indicate the positional bias, and these peaks can be detected easily using simple statistics. LDSS method adopts several statistical parameters to assess the significance of the peaks for different octamers (for details, see Chapter 2). Any octamer that possesses a significant peak in the core promoter region (-50bp - +50bp) will be considered as part of a putative core promoter motif. Multiple octamers can have peaks in the same position, and this is due to motif degeneracy seen in core promoter elements. LDSS method takes advantage of the positional conservation of the core promoter motifs, and has been proven to be largely effective in core promoter studies of plants and animals (Yamamoto et al., 2007b, 2007a). Most proximal promoter elements do not have a strict spacing requirement or constraint from TSS, so LDSS is less effective in finding proximal promoter elements.

10 1.2.2 Deterministic optimization method Deterministic optimization methods build a motif model based on position weight matrix from a set of random sites (a site is an n-mer string which is considered as an occurrence of a given motif) and background frequencies. It iteratively tries to refine this motif model by finding and integrating other similar sites. A final model is produced by multiple iterations of model refining until the optimal solution is found or the selected number of iterations is finished (D’haeseleer, 2006). An important algorithm based on deterministic optimization is Expectation Maximization (EM) (Do and Batzoglou, 2008), which has been used in gene frequency estimation (Ceppellini and Siniscalco, 1955) and in motif discovery applications (Lawrence and Reilly, 1990). Currently, the most popular motif discovery software suite MEME also uses EM algorithm (Bailey and Elkan, 1994; Bailey et al., 2006, 2009).

MEME uses EM along with several additional measures to find the top candidate motifs. Input to MEME is a set of sequences which can be either nucleotides (DNA/RNA) or amino acids (protein), and these sequences are presumed to contain common motifs. MEME utilizes EM to identify the common motifs of a given length(s), and rank them by the expected value (e- value).

EM is initiated by building motif model from a set of random sites and the nucleotide background frequency. Each different site (an n-mer string with the same length as the target motif) from the entire set of input sequences is evaluated to see if it fits the background frequency or the putative motif model (Do and Batzoglou, 2008). If the new site appears to fit the motif model, it is counted as an occurrence of the motif and the probability of the new site fitting the motif is calculated. Then the motif model is reconstructed using a weighted average of probabilities of all the occurrences of the motif (See Figure 1.5). More motifs can be found after masking the occurrences of the previously found motifs in the input sequences and running the iteration again (D’haeseleer, 2006). This iterative process creates multiple motif models, and these motif models are then tested to see if the occurrences are due to random chance, or non- random conservation. A functional motif is non-randomly conserved in a set of sequences due to evolution, and is not merely observed due to similar nucleotide compositions. If the motif occurrence is because of conservation and not due to random chance, then its expected value (e-

11 value) is equal to 0. However, in reality observing an e-value of 0 is rare, and the e-value for over-represented motifs is mostly found to be small and close to 0 (e.g., 1e-10). The final step of MEME ranks the motifs in ascending order using e-value (D’haeseleer, 2006) .

The selection of random sites to construct the initial motif model often leads to a suboptimal solution (local optima). Therefore, MEME has an additional enumerative step, which first builds basic motif models from all possible n-mers and selects the best model to initiate EM to find the global optimal solution. Due to this additional enumerative step, MEME software has been immensely successful in various motif-finding studies ranging from several genes (Bailey and Elkan, 1994) to genome-wide studies (Molina and Grotewold, 2005). Most importantly, MEME software suite contains a wide range of bioinformatics utilities to annotate motifs and compare the newly discovered motifs to experimentally verified motifs. Annotation and comparison of motifs are essential components in motif finding studies because they allow the exploration of the potential biological functions of putative motifs and detection of possible homology among newly discovered motifs, as well as to experimentally verified motifs.

1.2.3 Probabilistic optimization method Probabilistic optimization methods create initial motif models based on randomly selected sites from a set of sequences. It then searches the sequences and probabilistically decides whether a new site could be added or an old site can be deleted. This is an important difference compared to deterministic optimization, because deterministic optimization does not remove improbable sites after they are included in the motif model. After the addition or deletion, the probabilities for each site will be recalculated and the model will be updated (D’haeseleer, 2006). An example of this approach is the Gibbs sampling algorithm, and Motifsampler is a tool that implements this algorithm (Thijs et al., 2002).

1.2.4 Notations to represent cis-regulatory elements Although transcription factor–cis-regulatory element binding is specific, a certain degree of flexibility is observed. This flexibility allows minor changes in the nucleotide sequence of different occurrences of a given motif. The sequence variation observed in a motif is called motif degeneracy (Zhang et al., 2005). Due to degeneracy, most DNA motifs have consensus

12 sequences to represent the different nucleotides/amino acids that can be present at a single position in a motif (see Figure 1.7). The most popular way to represent a consensus sequence within text is to use the IUPAC rules (see Table 1.2). In case of DNA/RNA, the IUPAC rules allow more than one nucleotide to be represented by a single character (See Table 1.2).

DNA motifs are also represented by regular expressions (Sigrist et al., 2002). This method is most useful for bioinformaticians to search for potential binding sites of a known motif in a given set of sequences. Albeit these two methods are simple and common, they do not provide enough information regarding cis-regulatory motifs. Motif description models such as sequence logos and position-specific probability matrix (aka position weight matrix) have been developed to represent other important information associated with cis-regulatory motifs.

Position weight matrix is used to represent the probability of the residues being present in each position of the motif (see Figure 1.7). This matrix is constructed after aligning all the motifs together in a given set of sequences. It contains additional information about the frequency of a nucleotide found in each position. Position weight matrix is important to determine the probability of finding a specific nucleotide at a certain position, and enables to find most probable nucleotide at a base (Bailey and Elkan, 1994).

Sequence logos developed by Schneider and Stephens in 1990 has gained wide acceptance as a graphical approach to represent a consensus motif sequences in literature (Schneider and Stephens, 1990). Similar to constructing the position-specific probability matrix, the first step in constructing a sequence logo is to align all occurrences of the motif, including its variants. The information content is then determined and encoded using a method adapted from the communication theory proposed by Shannon (Claude Shannon et al., 1948). Sequence logos provide information about the “general consensus of the motif”, “the order of predominance of the residues at each position”, “relative frequencies of every residue at every position”, and “the amount of information present at every position in the motif” (Schneider and Stephens, 1990) (see Figure 1.6).

13 1.3 References Ayoubi, T., and Van De Ven, W. (1996). Regulation of gene expression by alternative promoters. FASEB J. 10, 453-460.

Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., Clementi, L., Ren, J., Li, W. W., and Noble, W. S. (2009). MEME Suite: tools for motif discovery and searching. Nucleic Acids Research 37, W202 -W208.

Bailey, T. L., Williams, N., Misleh, C., and Li, W. W. (2006). MEME: discovering and analyzing DNA and protein sequence motifs. Nucl. Acids Res. 34, W369-373.

Bailey, T. L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28 - 36.

Bentley, D. (1999). Coupling RNA polymerase II transcription with pre-mRNA processing. Curr. Opin. Cell Biol 11, 347-351.

Bernard, V., Brunaud, V., and Lecharny, A. (2010). TC-motifs at the TATA-box expected position in plant genes: a novel class of motifs involved in the transcription regulation. BMC Genomics 11, 166.

Breathnach, R., and Chambon, P. (1981). Organization and Expression of Eucaryotic Split Genes Coding for Proteins. Annual Review of Biochemistry 50, 349-383.

Brosius, J. (2009). The Fragmented Gene. Annals of the New York Academy of Sciences 1178, 186-193.

Burke, T. W., and Kadonaga, J. T. (1996). Drosophila TFIID binds to a conserved downstream basal promoter element that is present in many TATA-box-deficient promoters. Genes & Development 10, 711-724.

Burke, T. W., Willy, P. J., Kutach, A. K., Butler, J. E., and Kadonaga, J. T. (1998). The DPE, a conserved downstream core promoter element that is functionally analogous to the TATA box. Cold Spring Harb Symp Quant Biol 63, 75 - 82.

Carninci, P., Sandelin, A., Lenhard, B., Katayama, S., Shimokawa, K., Ponjavic, J., Semple, C. A. M., Taylor, M. S., Engstrom, P. G., Frith, M. C., et al. (2006). Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 38, 626-635.

Ceppellini, R., and Siniscalco, M. (1955). A new genetic hypothesis of the Lewis-Secretor system and evidence of its linkage with other loci. Riv Ist Sieroter Ital 30, 431-445.

Chen, W.-H., Lv, G., Lv, C., Zeng, C., and Hu, S. (2007). Systematic analysis of alternative first exons in plant genomes. BMC Plant Biology 7, 55.

14 Civan, P., and Svec, M. (2009). Genome-wide analysis of rice (Oryza sativa L. subsp. japonica) TATA box and Y Patch promoter elements. Genome 52, 294 - 297.

Claude Shannon, Noshirwan Petigara, and Satwiksai Seshasai (1948). A Mathematical Theory of Communication. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.84.9920.

Collins, T., Read, M., Neish, A., Whitley, M., Thanos, D., and Maniatis, T. (1995). Transcriptional regulation of endothelial cell adhesion molecules: NF- kappa B and cytokine-inducible enhancers. The FASEB Journal 9, 899 -909.

Deng, W., and Roberts, S. G. E. (2005). A core promoter element downstream of the TATA box that is recognized by TFIIB. Genes & development 19, 2418-2423.

D’haeseleer, P. (2006). How does DNA sequence motif discovery work? Nat Biotech 24, 959- 961.

Do, C. B., and Batzoglou, S. (2008). What is the expectation maximization algorithm? Nat Biotech 26, 897-899.

Fei, X., Eriksson, M., Li, Y., and Deng, X. (2010). A novel negative Fe-deficiency-responsive element and a TGGCA-type-like FeRE control the expression of FTR1 in Chlamydomonas reinhardtii. Journal of Biomedicine & Biotechnology 2010, 790247.

Filichkin, S. A., Priest, H. D., Givan, S. A., Shen, R., Bryant, D. W., Fox, S. E., Wong, W.-K., and Mockler, T. C. (2010). Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Research 20, 45 -58.

Frith, M. C., Valen, E., Krogh, A., Hayashizaki, Y., Carninci, P., and Sandelin, A. (2008). A code for transcription initiation in mammalian genomes. Genome research 18, 1-12.

Goldberg, M. L. (1979). PhD Thesis.

Gowda, M., Li, H., Alessi, J., Chen, F., Pratt, R., and Wang, G.-L. (2006). Robust analysis of 5’- transcript ends (5’-RATE): a novel technique for transcriptome analysis and genome annotation. Nucl. Acids Res. 34, e126.

Gowda, M., Li, H., and Wang, G.-L. (2007). Robust analysis of 5’-transcript ends: a high- throughput protocol for characterization of sequence diversity of transcription start sites. Nat. Protocols 2, 1622-1632.

Grosschedl, R., and Birnstiel, M. L. (1980). Identification of regulatory sequences in the prelude sequences of an H2A gene by the study of specific deletion mutants in vivo. Proceedings of the National Academy of Sciences of the United States of America 77, 1432 -1436.

15 Grosveld, G. C., Shewmaker, C. K., Jat, P., and Flavell, R. A. (1981). Localization of DNA sequences necessary for transcription of the rabbit [beta]-globin gene in vitro. Cell 25, 215-226.

Guan, L. M., Zhao, J., and Scandalios, J. G. (2000). Cis-elements and trans-factors that regulate expression of the maize Cat1 antioxidant gene in response to ABA and osmotic stress: H2O2 is the likely intermediary signaling molecule for the response. The Plant Journal 22, 87-95.

Haberer, G., Wang, Y., and Mayer, K. F. X. (2011). The Non-coding Landscape of the Genome of Arabidopsis thaliana. In and Genomics of the Brassicaceae Plant Genetics and Genomics: Crops and Models. (Springer New York), pp. 67-121-121. Available at: http://dx.doi.org/10.1007/978-1-4419-7118-0_3.

Juven-Gershon, T., and Kadonaga, J. T. (2010). Regulation of gene expression via the core promoter and the basal transcriptional machinery. Developmental Biology 339, 225-229.

Juven-Gershon, T., Hsu, J.-Y., Theisen, J. W., and Kadonaga, J. T. (2008). The RNA polymerase II core promoter — the gateway to transcription. Current opinion in cell biology 20, 253- 259.

Juven-Gershon, T., Hsu, J.-Y., and Kadonaga, J. T. (2006). Perspectives on the RNA polymerase II core promoter. Biochem.Soc.Trans. 34, 1047-1050.

Karamitri, A., Shore, A. M., Docherty, K., Speakman, J. R., and Lomax, M. A. (2009). Combinatorial Transcription Factor Regulation of the Cyclic AMP- on the Pgc-1α Promoter in White 3T3-L1 and Brown HIB-1B Preadipocytes. Journal of Biological Chemistry 284, 20738 -20752.

Kimura, K., Wakamatsu, A., Suzuki, Y., Ota, T., Nishikawa, T., Yamashita, R., Yamamoto, J.-ichi, Sekine, M., Tsuritani, K., Wakaguri, H., et al. (2006). Diversification of transcriptional modulation: Large-scale identification and characterization of putative alternative promoters of human genes. Genome Research 16, 55-65.

Koo, S., Choi, M., Chun, H., Park, H., Kang, C., Shim, S., Chung, J., Cheong, Y., Lee, S., Yun, D.-J., et al. (2009). Identification and characterization of alternative promoters of the rice MAP kinase gene OsBWMK1. Molecules and Cells 27, 467-473.

Krishnamurthy, S., and Hampsey, M. (2009). initiation. Current Biology 19, R153-R156.

Lagrange, T., Kapanidis, A. N., Tang, H., Reinberg, D., and Ebright, R. H. (1998). New core promoter element in RNA polymerase II-dependent transcription: sequence-specific DNA binding by transcription factor IIB. Genes & development 12, 34-44.

Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K.,

16 Dewar, K., Doyle, M., FitzHugh, W., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860 - 921.

Lauster, R., Reynaud, C. A., Mårtensson, I. L., Peter, A., Bucchini, D., Jami, J., and Weill, J. C. (1993). Promoter, and elements regulate rearrangement of an immunoglobulin transgene. Available at: http://www.pubmedcentral.gov/articlerender.fcgi?artid=413898.

Lawrence, C. E., and Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41-51.

Lewis, B. A., Kim, T.-K., and Orkin, S. H. (2000). A downstream element in the human β-globin promoter: Evidence of extended sequence-specific transcription factor IID contacts. Proceedings of the National Academy of Sciences of the United States of America 97, 7172 -7177.

Lim, C. Y., Santoso, B., Boulay, T., Dong, E., Ohler, U., and Kadonaga, J. T. (2004). The MTE, a new core promoter element for transcription by RNA polymerase II. Genes & Development 18, 1606-1617.

Lodha, M., Schulz-Raffelt, M., and Schroda, M. (2008). A New Assay for Promoter Analysis in Chlamydomonas Reveals Roles for Heat Shock Elements and the TATA Box in HSP70A Promoter-Mediated Activation of Transgene Expression. Eukaryotic Cell 7, 172-176.

Meyers, B. C., Tej, S. S., Vu, T. H., Haudenschild, C. D., Agrawal, V., Edberg, S. B., Ghazal, H., and Decola, S. (2004a). The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. Genome Res 14, 1641 - 1653.

Meyers, B. C., Vu, T. H., Tej, S. S., Ghazal, H., Matvienko, M., Agrawal, V., Ning, J., and Haudenschild, C. D. (2004b). Analysis of the transcriptional complexity of Arabidopsis thaliana by massively parallel signature sequencing. Nat Biotech 22, 1006-1011.

Molina, C., and Grotewold, E. (2005). Genome wide analysis of Arabidopsis core promoters. BMC Genomics 6, 25.

Moore, M. J., and Proudfoot, N. J. (2009). Pre-mRNA Processing Reaches Back toTranscription and Ahead to Translation. Cell 136, 688-700.

Morowvat, M. H., Rasoul-Amini, S., and Ghasemi, Y. (2010). Chlamydomonas as a “new” organism for biodiesel production. Bioresource Technology 101, 2059-2062.

Nakamura, M., Tsunoda, T., and Obokata, J. (2002). Photosynthesis nuclear genes generally lack TATA-boxes: a tobacco photosystem I gene responds to light through an initiator. The Plant Journal 29, 1-10.

17 Ni, T., Corcoran, D. L., Rach, E. A., Song, S., Spana, E. P., Gao, Y., Ohler, U., and Zhu, J. (2010). A paired-end sequencing strategy to map the complex landscape of transcription initiation. Nat Meth 7, 521-527.

Obayashi, T., Kinoshita, K., Nakai, K., Shibaoka, M., Hayashi, S., Saeki, M., Shibata, D., Saito, K., and Ohta, H. (2006). ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis. Nucl. Acids Res., gkl783.

O’Shea-Greenfield, A., and Smale, S. T. (1992). Roles of TATA and initiator elements in determining the start site location and direction of RNA polymerase II transcription. Journal of Biological Chemistry 267, 1391 -1402.

Ponjavic, J., Lenhard, B., Kai, C., Kawai, J., Carninci, P., Hayashizaki, Y., and Sandelin, A. (2006). Transcriptional and structural impact of TATA-initiation site spacing in mammalian core promoters. Genome Biol 7, R78.

Qi, X.-T., Zhang, Y.-X., and Chai, T.-Y. (2007). The bean PvSR2 gene produces two transcripts by alternative promoter usage. Biochemical and Biophysical Research Communications 356, 273-278.

Sandelin, A., Carninci, P., Lenhard, B., Ponjavic, J., Hayashizaki, Y., and Hume, D. A. (2007). Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nature reviews. Genetics 8, 424-436.

Schneider, T. D., and Stephens, R. M. (1990). Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097-6100.

Shen, Y., Liu, Y., Liu, L., Liang, C., and Li, Q. Q. (2008). Unique Features of Nuclear mRNA Poly(A) Signals and Alternative Polyadenylation in Chlamydomonas reinhardtii. Genetics 179, 167-176.

Shmelkov, S. V., Jun, L., St Clair, R., McGarrigle, D., Derderian, C. A., Usenko, J. K., Costa, C., Zhang, F., Guo, X., and Rafii, S. (2004). Alternative promoters regulate transcription of the gene that encodes stem cell surface protein AC133. Blood 103, 2055-2061.

Sigrist, C. J. A., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., and Bucher, P. (2002). PROSITE: A documented database using patterns and profiles as motif descriptors. Briefings in Bioinformatics 3, 265 -274.

Singh, K. B. (1998). Transcriptional Regulation in Plants: The Importance of Combinatorial Control. Plant Physiol. 118, 1111-1120.

Sinha, S., and Tompa, M. (2002). Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 30, 5549 -5560.

Smale, S. T., and Kadonaga, J. T. (2003). The RNA polymerase II core promoter. Annu Rev

18 Biochem 72, 449 - 479.

Smale, S. T. (2001). Core promoters: active contributors to combinatorial gene regulation. Genes & Development 15, 2503-2508.

Sparmann, A., and van Lohuizen, M. (2006). Polycomb silencers control cell fate, development and cancer. Nat Rev Cancer 6, 846-856.

Stine, M., Dasgupta, D., and Mukatira, S. (2003). Motif discovery in upstream sequences of coordinately expressed genes. In Evolutionary Computation, 2003. CEC ’03. The 2003 Congress on, pp. 1596-1603 Vol.3. Available at: 10.1109/CEC.2003.1299863.

Takaiwa, F., Yamanouchi, U., Yoshihara, T., Washida, H., Tanabe, F., Kato, A., and Yamada, K. (1996). Characterization of common cis-regulatory elements responsible for the endosperm-specific expression of members of the rice glutelin multigene family. Plant 30, 1207-1221.

Thijs, G., Marchal, K., Lescot, M., Rombauts, S., De Moor, B., Rouzé, P., and Moreau, Y. (2002). A Gibbs Sampling Method to Detect Overrepresented Motifs in the Upstream Regions of Coexpressed Genes. Journal of Computational Biology 9, 447-464.

Tsuritani, K., Irie, T., Yamashita, R., Sakakibara, Y., Wakaguri, H., Kanai, A., Mizushima- Sugano, J., Sugano, S., Nakai, K., and Suzuki, Y. (2007). Distinct class of putative “non- conserved” promoters in humans: Comparative studies of alternative promoters of human and mouse genes. Genome Research 17, 1005-1014.

Tuch, B. B., Galgoczy, D. J., Hernday, A. D., Li, H., and Johnson, A. D. (2008). The Evolution of Combinatorial Gene Regulation in Fungi. PLoS Biol 6, e38.

Umezawa, T., Fujita, M., Fujita, Y., Yamaguchi-Shinozaki, K., and Shinozaki, K. (2006). Engineering drought tolerance in plants: discovering and tailoring genes to unlock the future. Current Opinion in Biotechnology 17, 113-122.

Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001). The Sequence of the Human Genome. Science 291, 1304-1351.

Wasylyk, B., Derbyshire, R., Guy, A., Molko, D., Roget, A., Téoule, R., and Chambon, P. (1980). Specific in vitro transcription of conalbumin gene is drastically decreased by single-point in T-A-T-A box homology sequence. Proceedings of the National Academy of Sciences of the United States of America 77, 7024 -7028.

Wu, X., Liu, M., Downie, B., Liang, C., Ji, G., Li, Q. Q., and Hunt, A. G. (2011). Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation. Proceedings of the National Academy of Sciences 108, 12533 -12538.

19 Xin, D., Hu, L., and Kong, X. (2008). Alternative Promoters Influence Alternative Splicing at the Genomic Level. PLoS ONE 3, e2377.

Yamaguchi-Shinozaki, K., and Shinozaki, K. (1994). A Novel cis-Acting Element in an Arabidopsis Gene Is Involved in Responsiveness to Drought, Low-Temperature, or High- Salt Stress. Plant Cell 6, 251-264.

Yamamoto, Y. Y., Ichida, H., Abe, T., Suzuki, Y., Sugano, S., and Obokata, J. (2007a). Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucl. Acids Res., gkm685.

Yamamoto, Y. Y., Ichida, H., Matsui, M., Obokata, J., Sakurai, T., Satou, M., Seki, M., Shinozaki, K., and Abe, T. (2007b). Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics 8, 67.

Yang, C., Bolotin, E., Jiang, T., Sladek, F. M., and Martinez, E. (2007). Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. Gene 389, 52-65.

Yokoyama, R., and Nishitani, K. (2001). A Comprehensive Expression Analysis of all Members of a Gene Family Encoding Cell-Wall Allowed us to Predict cis-Regulatory Regions Involved in Cell-Wall Construction in Specific Organs of Arabidopsis. Plant Cell Physiol. 42, 1025-1033.

Zhang, W., Ruan, J., Ho, T.-hua D., You, Y., Yu, T., and Quatrano, R. S. (2005). Cis-regulatory element based targeted gene finding: genome-wide identification of abscisic acid- and abiotic stress-responsive genes in Arabidopsis thaliana. Bioinformatics 21, 3074 -3081.

Zhu, Q., Dabi, T., and Lamb, C. (1995). TATA box and initiator functions in the accurate transcription of a plant minimal promoter in vitro. Plant Cell 7, 1681 - 1689.

Zhu, Q., and Halfon, M. (2009). Complex organizational structure of the genome revealed by genome-wide analysis of single and alternative promoters in Drosophila melanogaster. BMC Genomics 10, 9.

20 1.4 Figures

A B

C

Figure 1.1: Gene expression, structural gene components and pre-initiation complex (A) The central dogma of molecular biology, (B) Different regions found in eukaryotic genes. (UTR- Untranslated region, TSS- Transcription start site), (C) Transcription pre-initiation complex.

21 Figure 1.2: Different types of promoters based on TSS distribution A) Strong transcription initiation from a single site that is one or several base pairs wide. B) Weak transcription initiation over a wide region of 100-150bp. C) Weak transcription initiation over a wide region, but strong transcription initiation in a narrow region within the wide region. D) Strong transcription initiation from two or more sites. The transcription start region is only several bp wide in peaked promoters, the region could be 100 – 150 bp wide in other types of promoters.

22 1 0 0 b p

A n i m a l s

- 5 0 + 5 0 TSS ( + 1 )

X C P E 1 MTE

P r o m o t e r 5 ’ U T R

BRE U TATA BRED I n i t i a t o r S S S DPE 1 2 3 C pG Islands DCE

P l a n t s

P r o m o t e r 5 ’ U T R

TATA Y P a t c h I n i t i a t o r

T C m o t i f s Figure 1.3: Core promoter elements commonly found in plants and animals. TATA box, Intiator (Inr), BRE – TFIIB recognition element, DCE – Downstream core element, DPE - Downstream core promoter element, MTE – Motif 10 Element, Y patch – A small stretch containing C and T, and XCPE1 - X Core Promoter Element 1.

23 Figure 1.4: A simple example of how to use the LDSS method. This example illustrates the major steps used in LDSS to create a distribution profile for a given octamer (TATATATA).

24 Figure 1.5: Creating sequence logos The steps of the EM algorithm to create a final motif and construct a final sequence logo. Details of the information present in sequence logos are explained in Figure 1.6.

25 Figure 1.6: Information represented in a sequence logo

26 Figure 1.7: Several ways commonly used to represent consensus motif models

27 1.5 Tables Core Promoter Element Abbr Location consensus

Initiator Inr -2 - +5 Human - YYANWYY Drosophila – TCAKTY Mammal - RY

TATA box TATA -31/-30 TATAWAAR

TAFIIB Recognition Element BREU Immediately upstream BREU - SSRCGCC and downstream BRED – BRED RTDKKKK

Downstream core promoter DPE +28 to +33 RGWYVT element

Motif ten element MTE +18 - +27 CSARCSSAAC

Downstream core element DCE SI - +6 to +11 SI - CTTC SII - +16 to +21 SII - CTGT SIII - +30 to +34 SIII – AGC

X core promoter element 1 XCPE1 -8 to +2 DSGYGGRASM

CpG Islands 1000s of bp upstream of TSS

Table 1.1: The position and consensus of the core promoter elements in plants and animals See Table 1.2 for explanation of different characters used in consensus sequences (Juven- Gershon et al., 2008).

28 Symbol Meaning A A C C G G T T M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T N A or C or G or T Table 1.2: IUPAC single character codes for nucleic acid sequences

29 2 Chapter 2

2.1 Abstract As a model organism, use of Chlamydomonas is not only limited to biological experiments to understand chloroplast and flagella, but is also suitable for biodiesel production. This study has completed determining TSSs and extracted promoter regions based on available RNA-seq data, and has characterized both the core promoter elements and proximal promoter elements in this green alga. TATA box is the only core promoter element present in Chlamydomonas, and the sequence composition suggests the presence of a canonical and non- canonical TATA boxes. There is evidence that the TATA box in Chlamydomonas is also different than that in Arabidopsis and human. While some proximal promoter elements discovered show weak similarities to Know motifs from other species, most are novel elements only present in Chlamydomonas. Most of the proximal promoter elements discovered are also similar to each other. The promoter architecture in Chlamydomonas seems to be simple compared to Arabidopsis and Human, based on the core and proximal promoter elements discovered during the analysis. This initial effort in discovering the components of the transcription initiation and regulation provides the first genome-wide glimpse into the transcriptional networks found in Chlamydomonas.

30 2.2 Introduction Transcription is the synthesis of RNA from a DNA template. Transcription initiates with the recruitment of RNA polymerase II (Pol II) enzyme to the accurate transcription start site (TSS) (Krishnamurthy and Hampsey, 2009). After initiation, transcription needs to be continuously regulated to fulfill cellular requirement constantly changed by processes such as cellular response to external stimuli (Yamaguchi-Shinozaki and Shinozaki, 1994; Shinozaki and Yamaguchi-Shinozaki, 1997; Yamaguchi-Shinozaki and Shinozaki, 2005; Umezawa et al., 2006) or progression through developmental stages (Adrian et al., 2010). Transcriptional factors, which are proteins capable of recognizing and binding to short DNA sequences, are responsible for transcriptional initiation and regulation (Latchman, 1997). Transcription factors are known to initiate (Krishnamurthy and Hampsey, 2009), enhance (Collins et al., 1995), and silence transcription (Sparmann and van Lohuizen, 2006). The short DNA sequences recognized by these transcription factors are known as cis-regulatory elements (i.e., cis-elements), and are indispensable in transcription initiation and regulation. They directly or indirectly contribute to pre-transcriptional and post-transcriptional processes, such as transcription initiation, intron splicing and polyadenylation (Modrek and Lee, 2002; Shen et al., 2008; Krishnamurthy and Hampsey, 2009). Mediated by Pol II, transcription of protein coding genes in eukaryotes is a complex process involving a large number of proteins. General transcription factors essential for transcription initiation recognize specific cis-elements called core promoter elements and allow accurate binding of Pol II to the TSS. Pol II enzyme and these general transcription factors assemble together to form a pre-initiation complex (PIC) that can initiate transcription. PIC assembly is a series of cascading events that involves complex protein-DNA motif binding (e.g. TATA box binding protein binds to TATA box ) and protein-protein binding (e.g. general transcription factors binds to Pol II). Even though PIC is capable of initiating transcription, another larger protein complex referred to as the mediator is necessary for transcriptional regulation, i.e., when transcription occurs and how many RNA transcripts are created (Krishnamurthy and Hampsey, 2009). cis-elements located upstream of core promoter elements regulate transcription, and are usually called proximal promoter elements. Another group of cis-

31 elements called distal promoter elements are located further away from the gene than both the core and proximal promoter elements, and in some instances they are located on a different chromosome (Haberer et al., 2011). Both core promoter elements and proximal promoter elements mostly occur upstream of the TSS (Nolis et al., 2009). The regulatory region immediately upstream of the 5’-end of a gene is generally known as the promoter region (Brosius, 2009). Promoters are defined as “modulatory DNA structures containing a complex array of cis- acting regulatory elements required for accurate and efficient initiation of transcription and for controlling expression of a gene” (Ayoubi and Van De Ven, 1996). In genome-wide promoter analysis, two major types of promoter elements have been characterized: core promoter elements and proximal promoter elements. Present in a narrow region immediately adjacent to the TSS, core promoter elements are cis-elements that directly interact with transcription machinery (general transcription factors) to initiate transcription (Smale and Kadonaga, 2003). Inr element is the most prevalent core promoter motif in animals, such as humans, mouse and Drosophila (Smale and Kadonaga, 2003; Juven-Gershon et al., 2006, 2008). Inr element was detected in roughly 46% of humans genes, and was evident in approximately 22% more genes than the TATA box (Yang et al., 2007). Interestingly, Inr like DNA motifs were also detected in 40% of yeast genes in the same study (Yang et al., 2007). Inr elements (consensus of TCCAAG) have been shown to contribute to TSS selection in plant minimal promoter in vitro systems (Zhu et al., 1995), and they were also characterized as plant core promoters in PlantProm database (Shahmuradov et al., 2003). In contrast to the observations of Zhu et al, analysis of Inr by Shahmuradov et al did not reveal any strong consensus for Inr elements, but it did reveal some differences between Inr elements of TATA box positive promoters (i.e., promoter with a TATA- box), and TATA box negative promoters (i.e., promoter without TATA box) (Shahmuradov et al., 2003). Moreover, differences between Inr elements of moncots and dicots were also observed (Shahmuradov et al., 2003). Recent genome-wide studies in Arabidopsis and rice showed that the Inr element is not strictly conserved in plants, although it showed the consensus of YR at -1/+1 positions where TSS is designated as +1 and there is no 0th position (Yamamoto et al., 2007b). As the first core promoter element discovered (Goldberg, 1979), TATA box was originally regarded as a strict requirement for transcription initiation, but this is now proven to be

32 incorrect by many studies in animals and plants (Loganantharaj, 2006; Molina and Grotewold, 2005; Civan and Svec, 2009). TATA box occurs in about 29% of Arabidopsis genes (Molina and Grotewold, 2005) and approximately 24% of Human genes (Yang et al., 2007). While some core promoter elements like TATA box and Inr exist in both animals and plants, there are many animal-specific and plant-specific core promoter elements (see Figure 2.1) (Smale and Kadonaga, 2003; Yamamoto et al., 2007a; Juven-Gershon et al., 2008). Animal- specific core promoter elements include motif 10 element (MTE), downstream core element (DCE), and downstream core promoter element (DPE) (Smale and Kadonaga, 2003; Juven- Gershon et al., 2008). CpG islands, which occur in mammals, are important regions of transcriptional regulation (Sandelin et al., 2007). CpG islands are not small core promoter elements such as TATA box or Inr element, but wider regions spanning up to thousands of nucleotides, and are present near nearly 72% of human genes (Saxonov et al., 2006). As shown in Figure 2.1, plants seem to lack CpG islands, but have Y-patch (smaller patches of C and T nucleotides in promoter region) instead, and the distribution of these Y-patches are similar to CpG islands (Civan and Svec, 2009). In addition to the Y-patch, TC motifs consisting of mainly C and T nucleotides were observed to have a similar distribution as the TATA box, occurring in TATA box negative promoters in rice and Arabidopsis (Bernard et al., 2010). Despite their similarities to some motifs in fungi and Drosophila, a definitive function has not yet been assigned to the Y-patch and TC motifs in plants (Haberer et al., 2011). Core promoter elements are preferentially located around specific positions surrounding TSS. For example in Arabidopsis, TATA box is found from positions -39 to -29 with respect to the TSS (Bernard et al., 2010). There have been reports of stringent spacing requirements between different core promoter elements (Juven-Gershon et al., 2008). For example, MTE and DPE have a stringent spacing requirement from the Inr element, and such spacing requirement is essential for optimal transcription (Burke et al., 1998; Lim et al., 2004). Because of this relative positional conservation, core promoter elements can be effectively elucidated by searching for DNA motifs over-representated within a certain distance constraint of the TSS. Local Distribution of Short Sequences (LDSS) method proved to be advantageous in discovering potential core promoter elements based on the localized distribution of DNA motif occurrences (Yamamoto et al., 2007a, 2007b, 2009). Similar approaches that took advantages of localized

33 distribution of motifs in core promoter study have been proposed by others as well (Roepcke et al., 2006; Bernard et al., 2010; Perina et al., 2011). Usually found upstream of the core promoter elements, proximal promoter elements are mainly associated with transcriptional regulation (Haberer et al., 2011). Proximal promoter elements do not directly interact with the transcription machinery (basal transcription factors and Pol II), but regulate transcription by binding to transcription factors capable of interacting with the mediator complex (Krishnamurthy and Hampsey, 2009). Proximal promoter elements include enhancers and silencers that can increase and reduce transcription level respectively (Lauster et al., 1993). The interaction between the large number of protein factors and cis- elements enables the combinatorial control of transcription that could lead to differential expression pattern in response to developmental and environmental cues (Singh, 1998; Smale, 2001; Kreiman, 2004; De Val et al., 2008; Tuch et al., 2008; Karamitri et al., 2009). Known to regulate transcription in genes with similar functions or from the same gene families (Takaiwa et al., 1996; Hughes et al., 2000), proximal promoter elements are cis-elements present in promoter regions upstream of the TSS and regulate transcription (Haberer et al., 2011). Proximal promoter elements occur less frequent than core promoter elements, and vary largely in sequence structure and composition (Haberer et al., 2011). Moreover, most proximal promoter elements do not display stringent spacing or distance requirement from the TSS, allowing much greater level of flexibility in their organizations (Brown et al., 2007). Due to the large number of proximal promoter elements present and the complex nature of transcriptional regulatory networks, a proximal promoter element that occurs in a significant number of genes like the core promoters such as Inr or TATA box has not yet been found. Clearly, characterizing proximal promoter elements in genes with similar function or co-regulated is the first step in unraveling the complex transcription network for these genes. Many different computational methods and tools have been developed for the discovery and characterization of proximal promoter elements (Tompa et al., 2005). Among them, Multiple EM for Motif Elicitation (MEME) software suite that utilizes expectation maximization (EM) algorithm for motif discovery appears to be the most popular one (Bailey and Elkan, 1994, 1995a; Tompa et al., 2005; Bailey et al., 2006, 2009). MEME has been successfully used for promoter analysis in various studies ranging from genome-wide scale to a

34 smaller set of co-expressed genes (Kreiman, 2004; Molina and Grotewold, 2005; Bailey et al., 2006, 2009). LDSS method was deemed inappropriate for proximal promoter analysis, because LDSS relies on stringent spacing requirement for motif discovery, which is not evident in most proximal promoter elements discovered (Brown et al., 2007; Yamamoto et al., 2007b, 2007a). Although MEME is not intent for genome-wide scan of over-represented motifs because of computational source-demanding algorithms, it is appropriate to search proximal promoter elements for genes with similar biological functions or co-expressed genes (Hu et al., 2005). Moreover, MEME suite stands out from the other motif search utilities by providing a comprehensive set of tools for downstream analysis and functional annotation of discovered motifs (Bailey et al., 2009). Chlamydomonas reinhardtii is a unicellular alga, which is known to share animal features like flagella or cilia and plant characteristics like chloroplast and cellulosic cell wall. Chlamydomonas is used as a model organism in studying photosynthesis (Merchant et al., 2007), chloroplast transformation (Takahashi, 1991), flagella regeneration (Stolc et al., 2005), and most recently biodiesel production (Morowvat et al., 2010). It is also used commercially to produce carotenoid (Del Campo et al., 2007), a raw pigment used for production of food colorings and dyes. These laboratory and commercial uses make Chlamydomonas an interesting and important model organism for genomics study. Since the draft genome sequence was released in 2007 (Merchant et al., 2007), the transcriptomics data of Chlamydomonas from using both 454 pyrosequencing and Illumina sequencing by synthesis is rapidly accumulating (Merchant et al., 2007; Vallon and Dutcher, 2008; González-Ballester et al., 2010; Miller et al., 2010; Kropat et al., 2011), providing critical expression evidence for improving gene annotation. Like many genomics studies in other model or non-model species, understanding the mechanisms for gene expression and its regulation remains to be a great challenge for genomics research in Chlamydomonas. Utilizing the currently available genomics resources, this study is aimed at exploring cis-elements in promoters of Chlamydomonas in a genome-wide scale by a computational approach. Based on the most up-to-date gene annotation supported by all available gene expression evidence, we have characterized both the core and proximal promoter elements using LDSS and MEME approaches respectively, and compared them with human and Arabidospsis to identify both conserved and species-specific motifs in this algal species. As the

35 first genome-wide computational analysis of promoter in Chlamydomonas, this study will facilitate our understanding of transcription initiation and regulation in this model organism, and will pave way for functional characterization of such regulatory elements through wet-lab experiments in future.

36 2.3 Results

2.3.1 Obtaining promoter regions for analysis Promoter sequences in this study are extracted from -1000 to +200 around the TSSs of given genes where +1 is the TSS and there is no position 0. 23,610 unique human promoter sequences were directly downloaded from Database of Transcriptional Start Sites (DBTSS version 7) ( http:// dbt ss . hgc.jp/ ), where the TSSs were precisely determined through massive sequencing of full-lengh cDNA and 5’-end of oligo-cap selected cDNA (Tsuchihara et al., 2009). Promoter regions of 19,089 Arabidopsis genes were extracted based on annotated TSSs in TAIR9 genome annotation ( ftp :// ftp . Arabidopsis . org / Genes / TAIR 9_ genome _ release /). Chlamydomonas TSS data was obtained using Augustus 10.2 (Au10.2) genome annotation. As one of the best gene prediction programs evaluated by community-wide independent assessment (Coghlan et al., 2008), AUGUSTUS (Stanke et al., 2004, 2006b, 2006a, 2008; Specht et al., 2011) can operate as ab initio, de novo or cDNA-based gene predictor. Au10.2 gene models are an improved gene set for Chlamydomonas that takes advantage of multiple evidence including 0.3 million Sanger ESTs and 6.3 million 454 EST sequences, MS/MS (Tandem mass spectrometry), and genomic conservation with Volvox carteri (Stanke, personal communication). Because the lack of direct experimental data, TSSs in Au10.2 were trained mainly based on (1) differences in the sequence compositions between 5’UTRs and intergenic regions, (2) length distribution of 5’ UTR, and (3) evidence from coverage with the 6.6 million EST data. Among the total of 17,163 genes in Au10.2, we determined that 9,047 genes have valid TSSs based on cDNA-to-genome alignment of aforementioned 6.6 million transcriptome data and additional 63 million RNA-Seq downloaded from NCBI short read archive ( http://www.ncbi.nlm.nih.gov/sra?term=SRP00228 4 )(see Materials and Methods for details).

2.3.2 LDSS analysis of core promoters of human, Arabidopsis and Chlamydomonas Local Distribution of Short Sequences (LDSS) method is appropriate for detecting and characterizing core promoter elements, because of its ability to identify DNA motifs showing

37 localized distribution within a certain distance constraint from the TSS (Yamamoto et al., 2007b, 2007a). Initial step of LDSS generates all possible (48 = 65,536) octamers, 8 nt long DNA sequences such as AAAAAAAA, AAAAAAAG, and TTTTTTTT. The number of promoter sequences having a specific octamer (e.g. ATACATAC) at each position (-1000 to +200) was then counted. This count represents the frequency of a specific octamer at a specific position, and frequencies of the octamer at all promoter positions are called the frequency distribution of an octamer. The frequency distribution of different octamers might have distinct patterns: some have a random frequency distribution, whereas others show a non-random distribution with obvious peaks. These patterns are referred to as distribution profiles, which essentially represent a collection of frequencies of a given octamer over the entire promoter region (Yamamoto et al., 2007b). Distribution profiles can be visualized using frequency line graphs as shown in Figure 2.2. Preferentially located around a specific position near the TSS, core promoter elements tend to form localized peaks in the distribution profiles, which can be detected by LDSS. For example, the TATA box occurs from -35 to -30 in humans (Yang et al., 2007). In this study, LDSS was used to detect octamers with significant peak(s) in the core promoter region (-50 to +50) (Molina and Grotewold, 2005). LDSS defines the octamers with significant peaks in core promoter region as LDSS positive octamers (see Figure 2.2 A for an example) and the octamers without a significant peak in core promoter region as LDSS negative octamers (see Figure 2.2 B for an example) (Yamamoto et al., 2007b). In our study, LDSS detected 357 LDSS positive octamers for Arabidopsis, 636 octamers for Chlamydomonas, and 546 octamers for humans (see Materials and Methods section for details). Exhaustive searches of fixed octamer sequences do not provide a way to account for motif degeneracy. K-means clustering was used to cluster LDSS positive octamers into a predetermined number of clusters based on distribution profile similarity to overcome such limitation (Yamamoto et al., 2007a, 2007b). Octamers within the same cluster share similarities in their distribution profiles, the extent of which relies on the predetermined number of clusters specified in K-means clustering. This similarity might indicate potential functional conservation among distinctive octamers that share similar distribution profiles (Yamamoto et al., 2007b). For example, variants of the TATA box were clustered into the same cluster because they were overrepresented within the region from -35 to -30 (Yamamoto et al., 2007b, 2007a). That is why

38 a set of LDSS positive octamers might just represent one core promoter element (e.g., TATA box). After clustering, heatmap graphs based on the distribution profiles of all LDSS-positive motifs over the entire promoter regions (-1000 to +200) needs to be generated in LDSS method (Yamamoto et al., 2007b). The heatmap graph not only provides the graphical representation and visualization of individual octamers and their distribution profiles, but also helps in evaluating whether two or more K-means clusters can be combined or a K-means cluster needs to be further divided to refine the similarity. In some cases, multiple rounds of K-means clustering might be necessary to get similar motifs within the same K-means cluster (Yamamoto et al., 2007a). Exemplified in Figure 2.3, a LDSS heatmap graph compiles all different octamer clusters (K- means clusters) and differentiates them by a margin between the clusters. Y-axis represents the individual, distinctive LDSS positive octamers. The higher the height of each octamer cluster along the Y-axis, the more the distinctive octamers one cluster contains, indicating a larger extent of the motif degeneracy for a given core promoter element represented by the specific cluster. X-axis represents individual nucleotide position around TSS for the whole promoter region (- 1000 to +200). Intensity of the red color indicates the frequency of the octamer at a specific position, the brighter the color, the higher the frequency. For example, red color intensity will be higher around -35 to -30 bp in the cluster representing the TATA box in comparison to other promoter regions. In the final step, LDSS method will provide a list of LDSS-positive motifs grouped by K-means clustering and label them as different types of core promoter elements based on visual examination and prior knowledge about promoters (Yamamoto et al., 2007a, 2007b). In this study, we adopted a similar approach with our unique improvements. K-means clustering requires a predetermined number of clusters, and the selection of the cluster number is crucial for proper separation of the LDSS-positive octamers. A smaller cluster number will lead to poor separation, while a large cluster number will generate smaller clusters with little difference or variation. The cluster number of 10 was determined for K-means clustering after experimenting with different numbers ranging from 5 to 12. All LDSS positive octamers with similar distribution profiles were clustered into 10 distinctive octamer clusters for each of the three species. Figure 2.3 presents a comparative view of LDSS heatmap graphs produced for Arabidopsis, human, and Chlamydomonas, while Figures 2.4, 2.5, and 2.6 show species-specific

39 heatmap graphs and relevant sequence logos. As shown in Figures 2.3 and 2.4, there are only two major octamer clusters evident in Arabidopsis: TATA box as high intensity vertical lines near the TSSs and the Y-patch as high intensity smears covering a larger range than the TATA box. In humans (see Figure 2.3 and 2.5), TATA box and Inr display clear vertical lines, but CpG islands appear to occupy a wider region similar to the Y-patch in Arabidopsis. Several other core promoter elements such as SP1 and DPE, which were determined in the original LDSS analysis were also observed in human in our study (Yamamoto et al., 2007a). All the aforementioned motif elements are mainly determined based on our prior knowledge about core promoters in Arabidopsis and human, visual inspection and examination of individual octamers in each octamer cluster, and the comparison of results between our LDSS analysis and the original LDSS analysis. Although not clearly evident as in human and Arabidopsis, heatmap graphs of Chlamydomonas in Figure 2.3 and 2.6 do point out the presence of putative core promoter elements like the TATA box. This is determined based on the positions of the fuzzy vertical lines present in several octamer groups, and the inspection of the list of individual octamers present in each octamer cluster. It is interesting to notice that cluster 8, 9 and 10 are ~20 bases closer to the TSS than other clusters (see Figure 2.4), indicating a variant of the putative TATA box. Apart from visual inspection of the individual octamers within each cluster, no further analysis was pursued in the original LDSS method (Yamamoto et al., 2007a, 2007b). In this study, we extended the LDSS method further by creating sequence logos for each octamer cluster and comparing different octamer clusters to identify similarity (see next section). As a representative view of the consensus sequence for a given group of sequences, sequence logos provide an easy way to visualize the nucleotide composition and its nucleotide frequencies (Schneider and Stephens, 1990). In our study, two factors were taken into account when sequence logos were generated for a given octamer cluster: distinctive octamer sequences and their frequencies over the regions where a significant peak was identified. Indeed, our sequence logos shown in Figures 2.4, 2.5 and 2.6 do facilitate the identification of putative core promoter elements such as Inr, TATA box, and CpG islands. Moreover, sequence logos make it possible to visually inspect motif similarity among different octamer clusters within the same species, as well as between different species.

40 In Arabidopsis, as shown in Figure 2.4, TATA box is represented by three different octamer clusters (i.e., clusters 1-3) while Y-patch is represented by four different octamer clusters (i.e., clusters 4-7). In human, the most obvious core promoter elements visible from the sequence logos are TATA box (cluster 1), Inr (cluster 2 and 3), CpG islands (clusters 4-7), DPE (cluster 9), and Sp1 (cluster 10) (see Figure 2.5). In Chlamydomonas, all the octamer clusters display similar sequence logos that are AT rich, representing potential TATA boxes (see Figure 2.6). Visual inspection also revealed that TATA box logos from all three species seem to be similar. Clearly, it is important for us to objectively and quantitatively compare the similarity among different groups of octamer clusters within species and between species, so that similar motifs from different octamer groups can be combined and annotated accordingly.

2.3.3 Comparing and combining octamer clusters to produce putative core promoter elements As the most popular tool for discovering and analyzing DNA and protein sequence motifs, MEME provides a tool called Tomtom that can compare motifs based on sequence similarity (Gupta et al., 2007; Bailey et al., 2009). Utilizing different similarity functions (e.g., Euclidean distance) to quantify sequence similarities, Tomtom determines whether the motifs for comparison have statistically significant similarity (Gupta et al., 2007; Tanaka et al., 2011). Motifs discovered by MEME can be used as query motifs to search a database of known motifs (target motifs) to retrieve significantly matches. We can also use Tomtom to compare two motifs if they are presented in MEME minimal motif format, a common motif format that contains the minimum amount of information required to represent a motif ( http://meme.nbcr.net/meme4_6_1/doc/meme-format.html; Bailey et al., 2009 ). MEME minimal motif format was constructed for each octamer cluster. Based on this format, we compared all octamer clusters within species and between species using Tomtom tool. For example, in Arabidopsis, we used each octamer cluster as the query motif to search the other 9 octamer clusters as the target motif database to find out significantly similar matches. Using the comparison results for individual octamer cluster, we then detect and combine octamer clusters with reciprocal matches. A reciprocal match is identified when octamer cluster A is used as the query it matches octamer cluster B as the target, and when octamer cluster B is used as the query

41 it matches octamer cluster A as the target. Because reciprocal matches signify the most accurate matches among individual octamer clusters within a species, we can then combine the two octamer (or more) clusters that have reciprocal matches to form a new cluster called a motif group that represents a putative core promoter element. After motif comparison and merging, Arabidopsis has 5 major motif groups, human has 4, and Chlamydomonas has 1 (see Figure 2.7). All 10 octamer clusters (i.e., 357 LDSS-positive octamers) in Arabidopsis can be divided into five motif groups including TATA box and Y-patch element , whereas all 10 octamer clusters (i.e., 546 LDSS-positive octamers) in human were assembled into 4 major motif groups that represent TATA box, CpG island, DPE and Inr element. As shown in Figure 2.7, all 10 octamer clusters (i.e., 636 LDSS-positive octamers) in Chlamydomonas are similar enough to be grouped into one motif group that represents a TATA box. Moreover, TATA box appears to be the only common core promoter element shared among three species from visual inspection. This observation was further confirmed by between-species comparison of individual octamer clusters using Tomtom. TATA box motifs between Arabidopsis and human had reciprocal matches, but the matches between Chlamydomonas and the other two species were not reciprocal. When Chlamydomonas octamer clusters had been used as query it matched the other species, but when other species were used as query it did not match with Chlamydomonas significantly.

2.3.4 Analysis of proximal promoter elements MEME has been used to detect proximal promoter elements for groups of genes that are either co-regulated or functionally similar (Bailey and Elkan, 1995b; Hughes, 2000). In this study, we also adopted MEME for proximal promoter element analysis in Chlamydomonas. Instead of using the whole promoter region (-1000 to +200) that we extracted, we only focused on 950 bp region (from -1000 to -50) which usually contains proximal promoter elements. In order to group genes that are either co-regulated or functionally similar, Annot8r (Schmid and Blaxter, 2008) annotation pipeline was used to provide Gene Ontology (GO) and KEGG pathway (KEGG) annotation to AUGUSTUS u10.2 gene models. KEGG (Kanehisa et al., 2006, 2008) is a database of biological systems, consisting of information about genes and proteins (KEGG GENES), endogenous and exogenous substances (KEGG LIGAND), molecular

42 wiring diagrams of interaction and reaction networks (KEGG PATHWAY), and hierarchies and relationships of various biological objects (KEGG BRITE). Focusing on KEGG PATHWAY database and KEGG GENES databases, we annotated u10.2 protein sequences with KEGG pathway maps and modules, as well as KEGG ORTHOLOGY information. Among all 9,047 genes with valid TSSs, only 2,562 genes had valid KEGG annotations (e-value < 0.001), enabling us to group genes based on KEGG annotation. We obtained 277 KEGG pathways that contain at least one AUGUSTUS u10.2 gene. Namely, there are 277 KEGG gene groups within which individual genes shared the same KEGG pathway. The gene number in a group ranged from 1 to 203 (see Table 2.1).

Gene Ontology (GO) provides a controlled vocabulary that describes genes and their products in terms of their associated biological processes (P), cellular components (C) and molecular functions (F) in a species-independent manner (Hill et al., 2008). Each GO term carries a GO Slim term, which represents a high-level GO term that covers the major aspects of the three GO ontologies (Camon et al., 2004). GO Slim terms provide a 'birds eye' view of the GO annotations, and prove to be useful in determining a genome-wide sense of transcriptome analysis (Lomax, 2005). Therefore, we also used GO Slim term to group our genes for proximal promoter element analysis. GO annotations from Annot8r were also filtered so that only valid annotation entries (e-value < 0.001) were retained. Out of 9,047 genes with validated TSSs, only 4,104 possessed valid GO Slim annotations (e-value < 0.001). As shown in Table 2.1, the total GO Slim gene groups (GO gene groups) are 32, and the gene number within a group varied from 29 to 3,769.

Comparing two types of gene grouping, GO gene groups contained more genes per group than KEGG gene groups (see Table 2.1). The difference was mainly attributed to the broader GO Slim terms in comparison to the specific nature of KEGG pathway annotations. 2,414 genes possessed both GO and KEGG annotations, and therefore they were grouped by both GO and KEGG grouping. Only 146 genes were uniquely annotated to KEGG while 1,690 genes were uniquely annotated to GO.

The promoter sequences extracted for all KEGG and GO gene groups were subject to proximal promoter analysis using MEME. MEME analysis for a given gene group produces

43 MEME motifs (position specific probability matrix), representative sequence logos and the associated e-values. E-value for MEME motifs represents the “probability of finding an equally well conserved pattern in a set of random sequences” (Bailey et al., 2006). A lower e-value means a lower possibility that the detected motif occurs by random chance. In this study, we adopt an e-value of 0.001 (i.e., 1 out of 1000) or less for detecting a valid motif, and this motif is required to be present in at least 50% of genes for a given gene group. Only 92 KEGG gene groups out of 277 contained valid motifs, and we detected a total of 102 valid KEGG motifs for these groups (see Table S1), meanwhile only 24 GO gene groups out of 32 contained valid motifs, and we detected a total of 24 valid GO motifs for these groups (see Figure 2.8). Tomtom tool was used to detect motif similarity among all 102 KEGG motifs, as well as among 24 GO motifs. As shown in Figure 2.9 and Table 2.2, all 102 KEGG motifs can be categorized into 14 distinct KEGG motif groups. In contrast, all 24 GO motifs are similar enough to be grouped into a single GO motif group (see Figure 2.8). From the visual inspection of the sequence logos shown in Figures 2.8 and 2.9, three KEGG motif groups (i.e., Figure 2.9 Groups 1-3) look similar to the single GO motif group. Moreover, there are two KEGG motif groups (i.e., Figure 2.9 Groups 4-5) representing microsatellites with repeat units of GT/CA (Kang and Fawley, 1997). While 90% of the KEGG motifs were categorized into Groups 1-5, other motifs which did not have any similarities were categorized into own groups, i.e., Figure 2.9 Groups 7-14, in which there is only one motif KEGG per group. Interestingly, the 6 motifs (i.e., Figure 2.9 Group 9-14) present in KEGG pathway “05322” (Systemic lupus erythematosus; http://www.genome.jp/dbget-bin/www_bget?pathway:map05322) were so different from each other, each of the 6 KEGG motifs was categorized into its own KEGG motif group.

There are subtle differences among individual motifs found in the same GO or KEGG motif groups. So motif annotation was done independent of motif grouping to explore the possible subtle differences. In order to assign putative biological functions to individual GO or KEGG motifs, Tomtom tool was used to search the known motif database (target motifs) for each of the 24 GO motifs and 102 KEGG motifs (query motif) based on similarity. Since each motif group represents a putative proximal promoter element, the variation of such putative proximal promoter element should be manifested by distinctive individual motifs within that

44 particular motif group. Therefore, a database annotation to a specific individual motif would also indicate the function of its motif group, hence the function of the putative proximal promoter element. When an e-value of 0.01 was used to filter valid database annotations for each GO and KEGG motifs, no significant matches were found. This threshold seems to be too stringent to account for motif variation between species (there are no Chlamydomonas motifs in the database). When the stringency was relaxed to an e-value of 0.1, some GO and KEGG motifs showed annotation matches with the known motifs from other species. Only 16 of the 24 GO motifs had valid annotations. For example, GO motif 1_GO:0016209 has matched one target motif, whereas GO motif 1_GO:0003824 matched more than one target motif (see Table 2.3). 10 GO motifs matched MA0048.1 from JASPAR Core database (http://jaspar.genereg.net/cgi- bin/jaspar_db.pl?ID=MA0048.1&rm=present&collection=CORE) (See Table 2.3 for details). MA0048.1 has been found in the E-box motifs present in some genes related to nervous system development in humans (Brown and Baer, 1994). Another match included MA0055.1 which is also found in JASPAR Core database (http://jaspar.genereg.net/cgi-bin/jaspar_db.pl? ID=MA0055.1&rm=present&collection=CORE), which matched to 11 GO motifs (see Table2.3). MA0055 is also found in humans, but it is a proximal promoter element which confers muscle specific gene expression (Wasserman and Fickett, 1998). Moreover, another motif PF0101 from JASPAR phylofacts motif database is annotated to 5 GO motifs (Xie et al., 2005) ( http://jaspar.genereg.net/cgi- bin/jaspar_db.pl? ID=PF0101.1&rm=present&collection=PHYLOFACTS). This motif was discovered in human during analysis of human promoters, but is not yet known to be associated with any known biological function. GO motif 1_GO:0016209 was annotated to MX000192 found in Prodoric database (http://prodoric.tu-bs.de/matrix.php?matrix_acc=MX000192), and this element is found associated with nitric oxide metabolism in Rhodobacter sphaeroides (Tosques et al., 1996). A single GO motif matched Ascl2 motif from Uniprobe (http://the_brain.bwh.harvard.edu/uniprobe/details2?id=99), a motif in mouse and fruit flies that can bind to a protein associated with nervous system development (Badis et al., 2009). Two GO motifs matched PF0035 motif from JASPAR phylofacts database (see Table 2.3) (http://jaspar.genereg.net/cgi-bin/jaspar_db.pl? ID=PF0035.1&rm=present&collection=PHYLOFACTS), which are bound to MYOD factors

45 known to regulate muscle specific gene expression (Xie et al., 2005). Functional annotation using known motif databases was also performed to 102 KEGG motifs. Only 31 motifs were had valid annotations (e-value <0.1). Most of KEGG motifs matched to the same set of target motifs as GO motifs, namely MA0048 and MA0055 from JASPAR core, PF0101 and PF0035 from JASPAR phylofacts, MX000192 from Prodric, and Ascl2 from Uniprobe (see Table 2.4). Interestingly, the KEGG motif 1_00910 matched MSN2 motif found in yeast that has been implicated to bind to factors which stimulate stress response in yeast (MacIsaac et al., 2006) (see Table 2.4). None of the GO motifs has matched to MSN motif, and KEGG motif 1_00910 is not similar to any GO motifs.

46 2.4 Discussion As the first genome-wide exploration of promoter structures and characters in Chlamydomonas, LDSS analysis in our study enabled us to identify and characterize the putative core promoter elements in this green alga species and to compare them with other model species like Arabidopsis and human. In Arabidopsis, TATA box and Y-patch were detected as the most prominent core promoter elements. A microsatellite with CA repeat unit was also found to be overrepresented in the core promoter region (see Figure 2.4 cluster 8), and this element is also identified in the original LDSS analysis (Yamamoto et al., 2007a). Additionally, two more elements with unknown functions were also detected and shown as clusters 9 and 10 in Figure 2.4. These two elements were not detected in the original LDSS analysis, but were previously detected in a genome-wide core promoter analysis in Arabidopsis (Molina and Grotewold, 2005). In human, Inr, TATA box and CpG islands appear to be the most prominent core promoter elements, while DPE and Sp1 elements were also evident (see Figure 2.5). In contrast to the obviously distinctive elements characterized in Arabidopsis and human, Chlamydomonas seems to have only a simple AT rich putative core promoter element. The between-species motif comparison using Tomtom tool revealed that these AT rich elements are significantly similar to the TATA box present in both Arabidopsis and human. Despite such similarity, the TATA box in Chlamydomonas does show more sequence degeneracy, as evident by more LDSS-positive motifs (177 octamers in Arabidopsis, 61 octamers in human and 645 octamers in Chlamydomonas) and more K-means clusters (3 in Arabidopsis, 1 in human, and 10 in Chlamydomonas) in Figures 2.3, 2.4, 2.5, and 2.6. The presence of both canonical and non- canonical TATA boxes have been identified in various organisms such as yeast and human (Yang et al., 2007). A canonical TATA box has a strict consensus sequence (TATAWAAR in human and Yeast) (Yang et al., 2007), while non-canonical TATA box is an AT rich region that allows more sequence degeneracy despite functioning similar to a TATA box (Sugihara et al., 2011). Our data suggests the presence of both canonical and non-canonical TATA boxes in Chlamydomonas. This is suggested by the appearance of slightly concentrated lines seen in the middle of red smears in the heatmap graphs of Chlamydomonas. In addition, in terms of the distance to TSS, there seems to be two different types of TATA boxes in Chlamydomonas, one

47 like those in Arabidopsis and human and located around -35 and the other located around -15. Clearly, the single putative core promoter element evident in Chlamydomonas suggests a simple core promoter architecture in comparison with both Arabidopsis and human. Proximal promoter elements are usually shared by functionally similar genes (Meng et al., 2010), co-regulated genes (Brown et al., 2007), and genes involved in the same pathway (Alam and Cook, 2003). In our study, proximal promoter analysis in Chlamydomonas was conducted for GO gene groups and KEGG gene groups grouped by GO Slim terms and KEGG pathway IDs respectively. For each gene group, MEME suite was used to detect the most common occurring motifs that have gene coverage of > 50% (i.e., present in at least half of the genes within a gene group) along with an e-value < 0.001. We found 24 GO motifs and 102 KEGG motifs that meet aforementioned criteria (see Figure 2.8 and supplementary Table S1). Motif comparison confirmed that all 24 GO motifs are significantly similar to be grouped as one GO motif group. Although all 102 KEGG motifs were grouped into 14 KEGG motif groups, most of them (~90%) are similar enough and grouped into 5 KEGG motif groups. The GO motif group showed visual similarities to the KEGG motif groups 1,2 and 3 (see Figure 2.9), and these three KEGG motif groups contain the most number of KEGG motifs. Similar to the core promoter element, the proximal promoter elements seems to show less diversity in Chlamydomonas.

We have used Tomtom tool to annotate all 24 GO motifs and 102 KEGG motifs to existing known motif databases including JASPAR (Sandelin et al., 2004), TRANSFAC (Fogel et al., 2005), Flyreg (Bergman et al.), and UniPROBE (Newburger and Bulyk, 2009). It is interesting that there is no significant matches when we set up a stringent cutoff (e-value <0.01). When the cutoff was relaxed to a certain degree (e-value<0.1), we were able to obtain annotation for some GO or KEGG motifs. This indicates that most proximal promoter elements in Chlamydomonas seems to be species specific and show less homology and weak conservation in comparison to the know motifs in other species. In particular, many of our GO (e.g., 1_GO:0003824, 1_GO:0005198, 1_GO:0006139 and 1_GO:0005576) and KEGG (e.g., 3_05145, 2_04120, 1_00770 and 2_05140) motifs (see Table 2.3 and 2.4) show a week similarity to the E-box motif (CAGCTG), which binds to the basic helix-loop-helix (bHLH) proteins (Brown

48 and Baer, 1994). E-box motif regulates transcription of genes responding to temperature stress in Chlamydomonas (Voytsekh et al., 2008). One of the genes which produces the C3 subunit of CHLAMY1 protein is regulated by the E-box, and CHLAMY1 binds to several genes to regulate them in a circadian manner (Mittag, 1996; Schulze et al., 2010). A number of genes responsible for circadian rhythm and found in plants and animals also have analogues in Chlamydomonas (Breton and Kay, 2006). Therefore, it is no surprise to see the weak conservation between our proximal promoter element and the E-box motif. Our annotation results for KEGG motifs are mainly the same to those for GO motifs, except the following one. KEGG motif (1_00910 from Table 2.4) was matched to Msn2 element in yeast (MacIsaac et al., 2006), which binds to Msn2, a transcription factor responsible for growth control inhibition (Smith et al., 1998). Moreover, some KEGG Motifs such as 3_00280, 2_00330, 1_00052 and 1_00600 (see Table 2.2) seem to represent microsatellite regions in Chlamydomonas (See Groups 4 and 5 in Figure 2.13 also). These simple sequence repeats will be useful in identity testing, population studies, linkage analysis, and genome mapping in Chlamydomonas (Kang and Fawley, 1997). Obviously, more efforts using differentially expressed microarray and RNA-seq are needed to discover and confirm more proximal promoter elements and to characterize transcription networks in Chlamydomonas.

Searching for localized distribution of putative motifs in promoter regions have been performed successfully in different species (Bernard et al., 2010; Yamamoto et al., 2007b; Perina et al., 2011). As one of this approach, LDSS is particularly useful because it is able to detect the octamer sequences and characterize core promoter elements (Yamamoto et al., 2007b, 2007a). The underlying reason is that core promoter elements tend to have special a distance or space constraint from TSSs (Juven-Gershon et al., 2008). Essentially, our approach adopted in this study is an extension of the LDSS method with our unique improvements that appear to be better in exploring species without a prior knowledge in promoter structure and characteristics. Obviously, LDSS method is powerful in detecting motifs which have frequency peaks in the distribution profiles around core-promoter region. It produces a heatmap of distribution profiles of different core promoter elements produced by K-means clustering, and a list of LDSS-positive octamers present in each core promoter element as the final output (Yamamoto et al., 2007a,

49 2007b). Unfortunately, a consensus view of a core promoter element is not provided in the original LDSS approach (Yamamoto et al., 2007b). Instead groups of octamers labeled as putative core promoter elements in terms of the prior promoter knowledge were provided as a list. In this study, all LDSS-positive octamers were clustered into 10 different octamer clusters by one round of K-means clustering, and then a sequence logo that takes account of both the octamer sequences and their frequencies was generated for each octamer cluster for better visualization and visual comparison. Furthermore, motif comparisons within species and between species were conducted using Tomtom tool for individual octamer clusters to define motif groups that represent potential core promoter elements. Such motif comparison not only helps us detect and characterize the putative core promoter elements in Chlamydomonas, but also enables us to explore their cross-species conservation. Consequently, a single round of K-means clustering with a fixed cluster number of 10 was sufficient for detecting putative core promoters, instead of multiple rounds of K-means clustering used by the original LDSS method. Multiple rounds of K-means clustering used for detecting core promoter elements in original LDSS method requires prior knowledge of promoter architecture for the organism studied. Unfortunately, this approach is not effective for organisms such as Chlamydomonas, in which little research has been conducted in understanding its promoter. So in practice, a single round of K-means clusters combined with sequence logos and Tomtom motif comparison is more appropriate for species whose promoter architecture and character is less explored.

Chlamydomonas is a model organism for studying chloroplast, flagella and biofuel production. Clearly, it is indispensable for us to explore its promoter structure and characters, the critical information for understanding the gene expression and regulation in this green alga species. Unfortunately up to date, there is almost no direct sequencing data (e.g., full-length cDNA and 5'-RACE data) for TSSs available in Chlamydomonas. As a consequence, our promoter data for Chlamydomonas is based on the most recent gene annotation - Augustus 10.2 gene models, which was trained with 0.3 million Sanger EST and 6.3 million 454 EST sequences. Recently, more transcriptomics data from Illumina RNA-seq has become available in Chlamydomonas, which has ~63 million reads with a total of ~2.2 billion bases (González- Ballester et al., 2010). All these transcriptomics data do help us validate and adjust TSSs for

50 more than 50% of Augustus 10.2 genes that have been used in our promoter analysis. The lack of real TSS data might be the major reason why Chlamydomonas heat map graphs seem to have more noise (e.g., less sharp vertical high-intensity lines) in comparison to human and Arabidopsis in Figure 2.3. In the future, direct sequencing data for TSSs will definitely help us to perceive clearer patterns of the regulatory motifs in Chlamydomonas promoters. As the first genome-wide study of Chlamydomonas promoter, in our opinion, our in-silico study based on computationally predicted TSSs validated with cDNA evidence is still able to present a first, valuable glimpse at the promoter structures and characters in this model organisms.

51 2.5 Materials and Methods

2.5.1 Obtaining Valid Promoter Sequences The most recent Chlamydomonas genome annotation AUGUSTUS u10.2 was used to determine the TSS position for genes. This set of gene models was trained based on several types of evidences including 0.3 million Sanger ESTs, 6.3 million 454 ESTs, MS/MS, and genomic conservation with Volvox carteri (Stanke et al., 2004, 2006a, 2006b, 2008; Specht et al., 2011). There are additional ~63 million reads of RNA-Seq data (González-Ballester et al., 2010) that have not been used for training u10.2 gene models. Moreover, there are few or almost no public available TSS data such as (5’-RACE data, CAGE and experimentally verified full-length cDNA data) in Chlamydomonas that can be utilized in this genome-wide study. That is why it is important for us to align all available cDNA data to the genome and obtain transcriptome evidence to validate the predicted TSSs. 309,185 raw trace files of Sanger ESTs were processed and cleaned using cDNA terminus approach (Liang et al., 2008). Additional vector-masked 29,019 Sanger ESTs were obtained from Chlamy Center. For 454 pyrosequencing of transcriptome, we obtained 689,548 reads from Genoscope and 6,226,485 reads from JGI. We downloaded 63,912,574 Illumina short reads (36bp) from (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17970) (González- Ballester et al., 2010). Both 454 sequences and the vector-masked EST sequences were cleaned using SeqClean that removes adapters and vector fragments, as well as low-quality or low- complexity regions (Haas et al., 2003; Campbell et al., 2006). We used GMAP (version 2011-03-18) for aligning longer sequence reads (i.e., 454 and Sanger ESTs) and GSNAP for aligning short Illumina reads to the reference genome (Wu and Watanabe, 2005; Wu and Nacu, 2010). GMAP is specially designed for aligning longer sequences to the genome while GSNAP tool is used for short-read mapping (Wu and Watanabe, 2005; Wu and Nacu, 2010). For both tools, only unique and unambiguous alignments were selected for downstream analyses. Both GMAP and GSNAP allow us to export cDNA-to- genome alignment results in Sequence Alignment/Map (SAM) format (Li et al., 2009). Using SAMtools (http://samtools.sourceforge.net/), the resultant SAM files converted into binary files in Binary Alignment Map (BAM) format, which can then be merged into a single BAM file for

52 effective and comprehensive data processing and mining. We developed a C++ program that utilizes SAMtools to extract all cDNA reads aligned to the chromosomal regions specified by Augustus u10.2 gene models. Among all the reads mapped to a given gene locus, we identified those aligned nearest to the predicted TSS positions. A valid alignment had to be at least 30 nt in length and within +/- 10 nt around the predicted TSSs. If a valid alignment was found for a given predicted gene, then its predicted TSS was replaced by the actual start position of the alignment as the validated TSS (see Figure 2.11). Of all 17,163 AUGUSTUS genes, only 9,071 genes fit our criteria and possess validated TSSs for further downstream promoter analysis.

Strand specific promoter sequences (-1000 bp to +200 bp, where the validated TSS is in +1 position) were extracted from the coding strand for all the 9,071 protein-coding genes based on JGI Chlamydomonas genome assembly version 4 (http://genome.jgi- psf.org/Chlre4/download/Chlre4_genomic_scaffolds.fasta.gz). For Arabidopsis, the similar promoter regions were extracted for 19,089 genes using the annotated TSSs in TAIR9 genome release ( ftp :// ftp . Arabidopsis . org / Genes / TAIR 9_ genome _ release /). In human, 23,610 unique promoter (-1000 to +200) sequences were directly downloaded from Database of Transcriptional Start Sites (DBTSS version 7) ( http :// db tss . hgc . jp /). The TSS position which was found most downstream was used for a gene if the gene had more than one documented TSS position. Extracted promoter regions were subjected to further downstream analyses using Local Distribution of Short Sequences (LDSS) and MEME software suite.

2.5.2 LDSS analysis of core promoters of three species Core promoter of Arabidopsis, human and Chlamydomonas were analysed individually by LDSS method (Yamamoto et al., 2007b). LDSS method initializes with counting the occurrence of all possible octamers (i.e., 65,536) in each position along the whole extracted promoter region (i.e.,from position -1000 to +200). Positional distribution profiles of the octamers were examined using several statistical measures to identify LDSS positive octamers, the octamers with a significant peak in their positional distribution profiles. To determine whether a specific octamer has a significant peak or not, a basal threshold called baseline is

53 needed. The baseline is determined as the mean occurrence of a specific octamer from region -1000 to -500, because the LDSS positive peaks were not observed beyond -200 bp upstream from the TSS (Yamamoto et al., 2007b). The first statistical measure is Relative Peak Height (RPH), which is the peak height divided by the baseline. The second measure is the peak height divided by the standard deviation of octamer occurrences from -1000 to -500. The third measure is peak area divided by basal fluctuation, which is calculated by adding the differences between the baseline and the number occurrence of a given octamer for each position from -1000 to -500. Peak area is calculated by obtaining the sum of the octamer occurrences in the region under the peak. Only motifs satisfying the the minimum thresholds (i.e., Relative Peak Height (RPH) > 3, peak height divided by standard deviation > 15, and peak area divided by basal fluctuation > 1) will be selected as LDSS positive octamers. In comparison with the thresholds used in the original LDSS method (Yamamoto et al., 2007b), our thresholds seem to be little relaxed due to the lack of real TSS data in Chlamydomonas and hence higher data noise. Figure 2.12 illustrates the basic statistical measurements for assessing an LDSS-positive octamer and how the relevant parameters are calculated. Distribution profiles of all LDSS positive octamers were extracted and combined, and used as the input file for K-means clustering. With pre-determined clusters of 10, K-means clustering was conducted using Cluster software from Eisen lab (http://rana.lbl.gov/downloads/Cluster/Cluster_vers_2.11.zip ) (Eisen et al., 1998). After clustering, Treeview program from Eisen lab was used to produced heatmap graphs for K-means octamer clusters (http://rana.lbl.gov/downloads/TreeView/TreeView_vers_1_60.exe) (Eisen et al., 1998). Sequence logos were generated for individual octamer clusters produced by K-means clustering. Weblogo software was used to create a representative sequence logo from the fasta input file (Crooks et al., 2004). The fasta input file was generated to contain the nucleotide sequences and their occurrence frequency information in each octamer cluster. For example, if the octamer TATATATA was found 50 times under the peak, it was repeated 50 times as sequence reads in the fasta file. The sequence logos of all octamer clusters are shown in Figures 2.4 – 2.6 for all three species.

54 2.5.3 Comparing and combining octamer clusters to form octamer groups Tomtom tool from MEME suit (Bailey et al., 2009; Tanaka et al., 2011; Gupta et al., 2007) was used to compare the octamer clusters based on many quantitative measures or scores for sequence similarity (Gupta et al., 2007). To evaluate similarity and identify similar motifs, Tomtom requires both the query and target motifs to be presented in the MEME minimal motif format. The MEME minimal format, exemplified in Figure 2.13, requires background frequencies that were calculated using all the promoter sequences extracted for a given species. It also needs motif information such as number of sites/occurrences (nsites), width (w), e-value (E), and position specific nucleotide probability matrix. Number of sites was set as the total number of sequences contained in the same fasta file used for generating sequence logo for a given octamer cluster. Letter probability matrix was calculated using the same fasta file by counting the number of times each specific nucleotide occurs (i.e. A – 196, C – 20, G – 20, T – 202) in all sequences. Other required information such as MEME version (Version 4.6), strands (+/-) and e-value is also needed for Tomtom to accept a given octamer cluster presented in the MEME minimal motif format for comparison. For motif comparison, Tomtom uses a query motif to search a database of target motifs for significant matches. Each octamer cluster was used as a query motif and searched against other octamer clusters (target motifs) within the same species. Tomtom calculates quantitative values to asses the motif similarity, such as p-value and e-value. p-value is the similarity score calculated for the overlapping positions of the two motifs, and e-value is the possibility that such an similarity is seen because of random chance. Calculation of e-value and p-value for a motif comparison is not reliable when the database contains less than 50 motifs (Gupta et al., 2007), but this can be easily overcome by using a subset of motifs from the motif database available with MEME suite. The following motif subsets were used for this purpose: dmmpmm2009.meme, dpinteract.meme, and homeodomain.meme. These three files contained a total of 276 motifs, increasing the database size sufficient for calculating correct e-value and p- values. Only the matches below e-value of 0.001 were selected as significant ones. We also checked for reciprocal matches, which are the cases when a query octamer cluster matches a target octamer cluster and vice versa. Octamer comparison by Tomtom is conducted individually

55 for each species, and reciprocally matching octamer clusters within a given species were combined to produce putative core promoter elements for that species. Both octamer groups and those octamer clusters which do not show significant similarity to any other octamer cluster are considered as unique putative core promoter elements for a given species. Individual fasta files were concatenated to form a combined fasta file for the newly combined octamer group. The concatenated fasta file for the newly combined octamer cluster was also used to create new sequence logos (see Figure 2.7). Conservation of putative core promoter elements among different species was accessed by comparing the putative core promoter elements (or combined octamer groups) between species using Tomtom. The parameter for between-species comparison of combined octamer clusters (e-value < 0.01) was less stringent than the one used to for intra-species comparison of individual octamer clusters (e-value < 0.001). The difference is due to the expectation that more variation should be allowed in finding similar motifs between species than within species.

2.5.4 Analysis of proximal promoter elements Proximal promoter analysis was conducted only in Chlamydomonas using MEME software for groups of functionally similar genes or co-expressed genes. Promoter region of 950 nt length (from -1000 to -50) was extracted for this purpose from each gene. Annot8r (Schmid and Blaxter, 2008) annotation pipeline was adopted to provide Gene Ontology (GO) and KEGG pathways (KEGG) for all AUGUSTUS u10.2 genes. To generate GO gene groups, GO annotations with the e-value below 0.001 were kept and genes were grouped using GO Slim terms (Hill et al., 2008)(http://www.geneontology.org/GO.slims.shtml). A specific gene was allowed to be present in different GO gene groups, but for a given GO gene group, duplicate entries of the same gene due to potential annotation errors were removed. Similar to GO gene groups, all 9,071 Au10.2 genes were grouped based on KEGG pathway to form distinct KEGG gene groups. KEGG path ID was used to group genes together. KEGG annotations for individual gene were filtered using an e-value of 0.001, and no redundant entries were allowed for the same genes within the same KEGG gene group. The promoter sequences (from -1000bp to -50bp) from the same groups of genes, either GO or KEGG gene groups, were subject to motif analysis by MEME software (Version 4.6.0)

56 (Bailey and Elkan, 1994) (http://meme.nbcr.net/meme4_6_0/intro.html). In MEME, any number of repetitions (anr) option was used to search for 10 top scoring octamers for a given GO or KEGG gene group. This option represents a mixture model that integrates two different models to allow presence and absence of a motif, as well as multiple occurrences of a motif within the same sequence (Bailey et al., 2006). Motifs discovered for any GO or KEGG gene group were subject to further screening and analysis to filter out insignificant motifs. An in- house Perl script was created to extract significant valid motifs with e-value less than 0.001 and produce MEME minimal motif format files required for motif comparison. Significant valid motifs detected from GO gene groups are called GO motifs. Significant valid motifs detected from KEGG gene groups are known as KEGG motifs. Using Tomtom tool, we compare GO motifs or KEGG motifs to identify similarity. We also conducted the comparison between GO motifs and KEGG motifs. The threshold for significance of the motif similarity was set to be < 0.01 in e-value.

Also using Tomtom tool, valid GO and KEGG motifs were annotated by searching for similar known motifs in the motif database provided by MEME. Such motif database search can reveal the similarity between our query motifs (i.e., GO or KEGG motifs) and motifs with known biological functions from other species.

57 2.6 Acknowledgments The project was supported by the funding granted through academic challenge grant from Department of Botany, Miami University, Oxford, Ohio. We thank High Performance Computing Group of Miami University and Ohio Supercomputer Center for providing computer resources for MEME analysis. Special thanks goes to Jens Muller, Lin Liu and Praveen Kumar Raj Kumar for the help with computer resources, sequence annotation and sequence alignment respectively.

58 2.7 References Adrian, J., Farrona, S., Reimer, J. J., Albani, M. C., Coupland, G., and Turck, F. (2010). cis- Regulatory Elements and Chromatin State Coordinately Control Temporal and Spatial Expression of FLOWERING LOCUS T in Arabidopsis. The Plant Cell Online 22, 1425 -1440.

Alam, J., and Cook, J. L. (2003). Transcriptional regulation of the heme oxygenase-1 gene via the stress response element pathway. Curr. Pharm. Des 9, 2499-2511.

Ayoubi, T., and Van De Ven, W. (1996). Regulation of gene expression by alternative promoters. FASEB J. 10, 453-460.

Badis, G., Berger, M. F., Philippakis, A. A., Talukder, S., Gehrke, A. R., Jaeger, S. A., Chan, E. T., Metzler, G., Vedenko, A., Chen, X., et al. (2009). Diversity and Complexity in DNA Recognition by Transcription Factors. Science 324, 1720 -1723.

Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., Clementi, L., Ren, J., Li, W. W., and Noble, W. S. (2009). MEME Suite: tools for motif discovery and searching. Nucleic Acids Rsearch 37, W202 -W208.

Bailey, T. L., Williams, N., Misleh, C., and Li, W. W. (2006). MEME: discovering and analyzing DNA and protein sequence motifs. Nucl. Acids Res. 34, W369-373.

Bailey, T. L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28 - 36.

Bailey, T. L., and Elkan, C. (1995a). The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 3, 21 - 29.

Bailey, T. L., and Elkan, C. (1995b). The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 3, 21 - 9.

Bergman, C. M., Carlson, J. W., and Celniker, S. E. Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster. Bioinformatics 21, 1747 -1749.

Bernard, V., Brunaud, V., and Lecharny, A. (2010). TC-motifs at the TATA-box expected position in plant genes: a novel class of motifs involved in the transcription regulation. BMC Genomics 11, 166.

Breton, G., and Kay, S. (2006). Circadian rhythms lit up in Chlamydomonas. Genome Biology 7, 215.

Brosius, J. (2009). The Fragmented Gene. Annals of the New York Academy of Sciences 1178,

59 186-193.

Brown, C. D., Johnson, D. S., and Sidow, A. (2007). Functional Architecture and Evolution of Transcriptional Elements That Drive Gene Coexpression. Science 317, 1557 -1560.

Brown, L., and Baer, R. (1994). HEN1 encodes a 20-kilodalton phosphoprotein that binds an extended E-box motif as a homodimer. Mol. Cell. Biol. 14, 1245-1255.

Burke, T. W., Willy, P. J., Kutach, A. K., Butler, J. E., and Kadonaga, J. T. (1998). The DPE, a conserved downstream core promoter element that is functionally analogous to the TATA box. Cold Spring Harb Symp Quant Biol 63, 75 - 82.

Camon, E., Barrell, D., Lee, V., Dimmer, E., and Apweiler, R. (2004). The Gene Ontology Annotation (GOA) Database - an integrated resource of GO annotations to the UniProt Knowledgebase. In Silico Biol 4, 5 - 6.

Campbell, M. A., Haas, B. J., Hamilton, J. P., Mount, S. M., and Buell, C. (2006). Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics 7, 327.

Del Campo, J., García-González, M., and Guerrero, M. (2007). Outdoor cultivation of microalgae for carotenoid production: current state and perspectives. Applied Microbiology and Biotechnology 74, 1163-1174.

Civan, P., and Svec, M. (2009). Genome-wide analysis of rice (Oryza sativa L. subsp. japonica) TATA box and Y Patch promoter elements. Genome 52, 294 - 297.

Coghlan, A., Fiedler, T., McKay, S., Flicek, P., Harris, T., Blasiar, D., Consortium, the nGASP, and Stein, L. (2008). nGASP - the nematode genome annotation assessment project. BMC Bioinformatics 9, 549.

Collins, T., Read, M., Neish, A., Whitley, M., Thanos, D., and Maniatis, T. (1995). Transcriptional regulation of endothelial cell adhesion molecules: NF- kappa B and cytokine-inducible enhancers. The FASEB Journal 9, 899 -909.

Crooks, G., Hon, G., Chandonia, J., and Brenner, S. (2004). WebLogo: a sequence logo generator. Genome research 14, 1188 - 1190.

Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863 - 14868.

Fogel, G. B., Weekes, D. G., Varga, G., Dow, E. R., Craven, A. M., Harlow, H. B., Su, E. W., Onyia, J. E., and Su, C. (2005). A statistical analysis of the TRANSFAC database. Biosystems 81, 137-154.

Goldberg, M. L. (1979). PhD Thesis.

60 González-Ballester, D., Casero, D., Cokus, S., Pellegrini, M., Merchant, S. S., and Grossman, A. R. (2010). RNA-Seq Analysis of Sulfur-Deprived Chlamydomonas Cells Reveals Aspects of Acclimation Critical for Cell Survival. The Plant Cell Online. Available at: http://www.plantcell.org/content/early/2010/06/29/tpc.109.071167.abstract.

Gupta, S., Stamatoyannopoulos, J., Bailey, T., and Noble, W. (2007). Quantifying similarity between motifs. Genome Biology 8, R24.

Haas, B. J., Delcher, A. L., Mount, S. M., Wortman, J. R., Smith Jr, R. K., Hannick, L. I., Maiti, R., Ronning, C. M., Rusch, D. B., Town, C. D., et al. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654 -5666.

Haberer, G., Wang, Y., and Mayer, K. F. X. (2011). The Non-coding Landscape of the Genome of Arabidopsis thaliana. In Genetics and Genomics of the Brassicaceae Plant Genetics and Genomics: Crops and Models. (Springer New York), pp. 67-121-121. Available at: http://dx.doi.org/10.1007/978-1-4419-7118-0_3.

Hill, D., Smith, B., McAndrews-Hill, M., and Blake, J. (2008). Gene Ontology annotations: what they mean and where they come from. BMC Bioinformatics 9, S2.

Hughes (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Journal of Molecular Biology 296, 1205.

Hughes, J. D., Estep, P. W., Tavazoie, S., and Church, G. M. (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296, 1205 - 1214.

Hu, J., Li, B., and Kihara, D. (2005). Limitations and potentials of current motif discovery algorithms. Nucleic Acids Research 33, 4899-4913.

Juven-Gershon, T., Hsu, J.-Y., Theisen, J. W., and Kadonaga, J. T. (2008). The RNA polymerase II core promoter — the gateway to transcription. Current opinion in cell biology 20, 253- 259.

Juven-Gershon, T., Hsu, J.-Y., and Kadonaga, J. T. (2006). Perspectives on the RNA polymerase II core promoter. Biochem.Soc.Trans. 34, 1047-1050.

Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., et al. (2008). KEGG for linking genomes to life and the environment. Nucleic Acids Research 36, D480 -D484.

Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K. F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M. (2006). From genomics to chemical genomics: new developments in KEGG. Available at:

61 http://nar.oxfordjournals.org/cgi/content/short/34/suppl_1/D354.

Kang, T.-J., and Fawley, M. W. (1997). Variable (CA/GT)n simple sequence repeat DNA in the alga Chlamydomonas. Plant Molecular Biology 35, 943-948.

Karamitri, A., Shore, A. M., Docherty, K., Speakman, J. R., and Lomax, M. A. (2009). Combinatorial Transcription Factor Regulation of the Cyclic AMP-response Element on the Pgc-1α Promoter in White 3T3-L1 and Brown HIB-1B Preadipocytes. Journal of Biological Chemistry 284, 20738 -20752.

Kreiman, G. (2004). Identification of sparsely distributed clusters of cis‐ regulatory elements in sets of co‐ expressed genes. Nucleic Acids Research 32, 2889 -2900.

Krishnamurthy, S., and Hampsey, M. (2009). Eukaryotic transcription initiation. Current Biology 19, R153-R156.

Kropat, J., Hong-Hermesdorf, A., Casero, D., Ent, P., Castruita, M., Pellegrini, M., Merchant, S. S., and Malasarn, D. (2011). A revised mineral nutrient supplement increases biomass and growth rate in Chlamydomonas reinhardtii. The Plant Journal 66, 770-780.

Latchman, D. S. (1997). Transcription factors: An overview. The International Journal of Biochemistry & Cell Biology 29, 1305-1312.

Lauster, R., Reynaud, C. A., Mårtensson, I. L., Peter, A., Bucchini, D., Jami, J., and Weill, J. C. (1993). Promoter, enhancer and silencer elements regulate rearrangement of an immunoglobulin transgene. Available at: http://www.pubmedcentral.gov/articlerender.fcgi?artid=413898.

Liang, C., Liu, Y., Liu, L., Davis, A., Shen, Y., and Li, Q. (2008). Expressed sequence tags with cDNA termini: previously overlooked resources for gene annotation and transcriptome exploration in Chlamydomonas reinhardtii. Genetics 179, 83.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map (SAM) Format and SAMtools. Bioinformatics. Available at: http://bioinformatics.oxfordjournals.org/content/early/2009/06/08/bioinformatics.btp352. abstract.

Lim, C. Y., Santoso, B., Boulay, T., Dong, E., Ohler, U., and Kadonaga, J. T. (2004). The MTE, a new core promoter element for transcription by RNA polymerase II. Genes & Development 18, 1606-1617.

Loganantharaj, R. (2006). Discriminating TATA box from putative TATA boxes in plant genome. Int J Bioinform Res Appl 2, 36 - 51.

Lomax, J. (2005). Get ready to GO! A biologist’s guide to the Gene Ontology. Briefings in

62 Bioinformatics 6, 298 -304.

MacIsaac, K., Wang, T., Gordon, D. B., Gifford, D., Stormo, G., and Fraenkel, E. (2006). An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7, 113.

Meng, G., Mosig, A., and Vingron, M. (2010). A computational evaluation of over-representation of regulatory motifs in the promoter regions of differentially expressed genes. BMC Bioinformatics 11, 267.

Merchant, S. S., Prochnik, S. E., Vallon, O., Harris, E. H., Karpowicz, S. J., Witman, G. B., Terry, A., Salamov, A., Fritz-Laylin, L. K., Marechal-Drouard, L., et al. (2007). The Chlamydomonas Genome Reveals the Evolution of Key Animal and Plant Functions. Science 318, 245-250.

Miller, R., Wu, G., Deshpande, R. R., Vieler, A., Gärtner, K., Li, X., Moellering, E. R., Zäuner, S., Cornish, A. J., Liu, B., et al. (2010). Changes in Transcript Abundance in Chlamydomonas reinhardtii following Nitrogen Deprivation Predict Diversion of Metabolism. Plant Physiology 154, 1737 -1752.

Mittag, M. (1996). Conserved circadian elements in phylogenetically diverse algae. Proceedings of the National Academy of Sciences 93, 14401 -14404.

Modrek, B., and Lee, C. (2002). A genomic view of alternative splicing. Nat Genet 30, 13 - 19.

Molina, C., and Grotewold, E. (2005). Genome wide analysis of Arabidopsis core promoters. BMC Genomics 6, 25.

Morowvat, M. H., Rasoul-Amini, S., and Ghasemi, Y. (2010). Chlamydomonas as a “new” organism for biodiesel production. Bioresource Technology 101, 2059-2062.

Newburger, D. E., and Bulyk, M. L. (2009). UniPROBE: an online database of protein binding microarray data on protein–DNA interactions. Nucleic Acids Research 37, D77 -D82.

Nolis, I. K., McKay, D. J., Mantouvalou, E., Lomvardas, S., Merika, M., and Thanos, D. (2009). Transcription factors mediate long-range enhancer–promoter interactions. Proceedings of the National Academy of Sciences 106, 20222 -20227.

Perina, D., Korolija, M., Roller, M., Harcet, M., Jelicic, B., Mikoc, A., and Cetkovic, H. (2011). Over-represented localized sequence motifs in ribosomal protein gene promoters of basal metazoans. Genomics 98, 56-63.

Roepcke, S., Zhi, D., Vingron, M., and Arndt, P. F. (2006). Identification of highly specific localized sequence motifs in human ribosomal protein gene promoters. Gene 365, 48-56.

Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W., and Lenhard, B. (2004). JASPAR:

63 an open‐ access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32, D91 -D94.

Sandelin, A., Carninci, P., Lenhard, B., Ponjavic, J., Hayashizaki, Y., and Hume, D. A. (2007). Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nature reviews. Genetics 8, 424-436.

Saxonov, S., Berg, P., and Brutlag, D. L. (2006). A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proceedings of the National Academy of Sciences of the United States of America 103, 1412 -1417.

Schmid, R., and Blaxter, M. (2008). annot8r: GO, EC and KEGG annotation of EST datasets. BMC Bioinformatics 9, 180.

Schneider, T. D., and Stephens, R. M. (1990). Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097-6100.

Schulze, T., Prager, K., Dathe, H., Kelm, J., Kießling, P., and Mittag, M. (2010). How the green alga Chlamydomonas reinhardtii keeps time. Protoplasma 244, 3-14-14.

Shahmuradov, I. A., Gammerman, A. J., Hancock, J. M., Bramley, P. M., and Solovyev, V. V. (2003). PlantProm: a database of plant promoter sequences. Nucleic Acids Res 31, 114 - 117.

Shen, Y., Liu, Y., Liu, L., Liang, C., and Li, Q. Q. (2008). Unique Features of Nuclear mRNA Poly(A) Signals and Alternative Polyadenylation in Chlamydomonas reinhardtii. Genetics 179, 167-176.

Shinozaki, K., and Yamaguchi-Shinozaki, K. (1997). Gene Expression and in Water-Stress Response. Plant Physiology 115, 327-334.

Singh, K. B. (1998). Transcriptional Regulation in Plants: The Importance of Combinatorial Control. Plant Physiol. 118, 1111-1120.

Smale, S. T., and Kadonaga, J. T. (2003). The RNA polymerase II core promoter. Annu Rev Biochem 72, 449 - 479.

Smale, S. T. (2001). Core promoters: active contributors to combinatorial gene regulation. Genes & Development 15, 2503-2508.

Smith, A., Ward, M. P., and Garrett, S. (1998). Yeast PKA represses Msn2p/Msn4p-dependent gene expression to regulate growth, stress response and glycogen accumulation. EMBO J 17, 3556-3564.

Sparmann, A., and van Lohuizen, M. (2006). Polycomb silencers control cell fate, development and cancer. Nat Rev Cancer 6, 846-856.

64 Specht, M., Stanke, M., Terashima, M., Naumann-Busch, B., Janßen, I., Höhner, R., Hom, E. F. Y., Liang, C., and Hippler, M. (2011). Concerted action of the new Genomic Peptide Finder and AUGUSTUS allows for automated proteogenomic annotation of the Chlamydomonas reinhardtii genome. Proteomics 11, 1814-1823.

Stanke, M., Diekhans, M., Baertsch, R., and Haussler, D. (2008). Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637 -644.

Stanke, M., Schoffmann, O., Morgenstern, B., and Waack, S. (2006a). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62.

Stanke, M., Steinkamp, R., Waack, S., and Morgenstern, B. (2004). AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Research 32, W309 -W312.

Stanke, M., Tzvetkova, A., and Morgenstern, B. (2006b). AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biology 7, S11.

Stolc, V., Samanta, M. P., Tongprasit, W., and Marshall, W. F. (2005). Genome-wide transcriptional analysis of flagellar regeneration in Chlamydomonas reinhardtii identifies orthologs of ciliary disease genes. Proceedings of the National Academy of Sciences of the United States of America 102, 3703-3707.

Sugihara, F., Kasahara, K., and Kokubo, T. (2011). Highly redundant function of multiple AT- rich sequences as core promoter elements in the TATA-less RPS5 promoter of Saccharomyces cerevisiae. Nucleic Acids Research 39, 59 -75.

Takahashi, Y. (1991). Directed chloroplast transformation in Chlamydomonas reinhardtii: insertional inactivation of the psaC gene encoding the iron sulfur protein destabilizes photosystem I. The EMBO Journal 10, 2033.

Takaiwa, F., Yamanouchi, U., Yoshihara, T., Washida, H., Tanabe, F., Kato, A., and Yamada, K. (1996). Characterization of common cis-regulatory elements responsible for the endosperm-specific expression of members of the rice glutelin multigene family. Plant Molecular Biology 30, 1207-1221.

Tanaka, E., Bailey, T., Grant, C. E., Noble, W. S., and Keich, U. (2011). Improved similarity scores for comparing motifs. Bioinformatics. Available at: http://bioinformatics.oxfordjournals.org/content/early/2011/05/04/bioinformatics.btr257.a bstract.

Tompa, M., Li, N., Bailey, T. L., Church, G. M., De Moor, B., Eskin, E., Favorov, A. V., Frith, M. C., Fu, Y., Kent, W. J., et al. (2005). Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotech 23, 137-144.

65 Tosques, I., Shi, J., and Shapleigh, J. (1996). Cloning and characterization of nnrR, whose product is required for the expression of proteins involved in nitric oxide metabolism in Rhodobacter sphaeroides 2.4.3. J. Bacteriol. 178, 4958-4964.

Tsuchihara, K., Suzuki, Y., Wakaguri, H., Irie, T., Tanimoto, K., Hashimoto, S.-ichi, Matsushima, K., Mizushima-Sugano, J., Yamashita, R., Nakai, K., et al. (2009). Massive transcriptional start site analysis of human genes in hypoxia cells. Nucleic Acids Research 37, 2249 -2263.

Tuch, B. B., Galgoczy, D. J., Hernday, A. D., Li, H., and Johnson, A. D. (2008). The Evolution of Combinatorial Gene Regulation in Fungi. PLoS Biol 6, e38.

Umezawa, T., Fujita, M., Fujita, Y., Yamaguchi-Shinozaki, K., and Shinozaki, K. (2006). Engineering drought tolerance in plants: discovering and tailoring genes to unlock the future. Current Opinion in Biotechnology 17, 113-122.

Vallon, O., and Dutcher, S. (2008). Treasure Hunting in the Chlamydomonas Genome. Genetics 179, 3 - 6.

De Val, S., Chi, N. C., Meadows, S. M., Minovitsky, S., Anderson, J. P., Harris, I. S., Ehlers, M. L., Agarwal, P., Visel, A., Xu, S.-M., et al. (2008). Combinatorial Regulation of Endothelial Gene Expression by Ets and Forkhead Transcription Factors. Cell 135, 1053- 1064.

Voytsekh, O., Seitz, S. B., Iliev, D., and Mittag, M. (2008). Both Subunits of the Circadian RNA- Binding Protein CHLAMY1 Can Integrate Temperature Information. Plant Physiology 147, 2179 -2193.

Wasserman, W. W., and Fickett, J. W. (1998). Identification of regulatory regions which confer muscle-specific gene expression. Journal of Molecular Biology 278, 167-181.

Wu, T. D., and Nacu, S. (2010). Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873 -881.

Wu, T. D., and Watanabe, C. K. (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859-1875.

Xie, X., Lu, J., Kulbokas, E. J., Golub, T. R., Mootha, V., Lindblad-Toh, K., Lander, E. S., and Kellis, M. (2005). Systematic discovery of regulatory motifs in human promoters and 3[prime] UTRs by comparison of several mammals. Nature 434, 338-345.

Yamaguchi-Shinozaki, K., and Shinozaki, K. (1994). A Novel cis-Acting Element in an Arabidopsis Gene Is Involved in Responsiveness to Drought, Low-Temperature, or High- Salt Stress. The Plant Cell 6, 251-264.

Yamaguchi-Shinozaki, K., and Shinozaki, K. (2005). Organization of cis-acting regulatory

66 elements in osmotic- and cold-stress-responsive promoters. Trends in Plant Science 10, 88-94.

Yamamoto, Y. Y., Ichida, H., Abe, T., Suzuki, Y., Sugano, S., and Obokata, J. (2007a). Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucl. Acids Res., gkm685.

Yamamoto, Y. Y., Ichida, H., Matsui, M., Obokata, J., Sakurai, T., Satou, M., Seki, M., Shinozaki, K., and Abe, T. (2007b). Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics 8, 67.

Yamamoto, Y. Y., Yoshitsugu, T., Sakurai, T., Seki, M., Shinozaki, K., and Obokata, J. (2009). Heterogeneity of Arabidopsis core promoters revealed by high-density TSS analysis. The Plant Journal 60, 350-362.

Yang, C., Bolotin, E., Jiang, T., Sladek, F. M., and Martinez, E. (2007). Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. Gene 389, 52-65.

Zhu, Q., Dabi, T., and Lamb, C. (1995). TATA box and initiator functions in the accurate transcription of a plant minimal promoter in vitro. Plant Cell 7, 1681 - 1689.

67 2.8 Figures

Figure 2.1: The Major core promoter elements present in animals and plants TATA box, Intiator (Inr), BRE – TFIIB recognition element, DCE – Downstream core element, DPE - Downstream core promoter element, MTE – Motif 10 Element, Y patch – A small stretch containing C and T, and XCPE1 - X Core Promoter Element 1.

68 Figure 2.2: Examples of LDSS-positive and LDSS-negative octamers in Chlamydomonas A) LDSS-positive octamer is overrepresented in the core promoter region. B) LDSS-negative octamer occurs randomly along the entire promoter region.

69 Figure 2.3: LDSS heatmap graphs for Arabidopsis, human, and Chlamydomonas This figure illustrates the heatmaps produced by LDSS for different species, and major core promoter elements for each species are listed below A) Arabidopsis – TATA box and Y-patch. B) Human Inr, CpG islands, TATA box, DPE, and Sp1. C) Chlamydomonas – TATA box.

70 Figure 2.4: Sequence logos from Arabidopsis LDSS octamer clusters. These sequence logos were created from the LDSS octamer clusters produced by K-means clustering method from Arabidopsis. Clusters 1-3 represent TATA box, while clusters 4-7 represent Y-patch. There are others which have not assigned any functions related to transcription initiation

71 Figure 2.5: Sequence logos from human LDSS octamer clusters. These sequence logos were created from the LDSS octamer groups produced by K-means clustering method from human. Cluster 1 seems to represent TATA box, and clusters 2 and 3 seem to represent Inr. Cluster 4-8 seem to represent CpG islands. Cluster 9 represents DPE element and cluster 10 represents Sp1 element, both were also recognized in the original LDSS study.

72 Figure 2.6: Sequence logos from Chlamydomonas LDSS octamer clusters. These sequence logos were created from the LDSS octamer groups produced by K-means clustering method from Chlamydomonas. All groups present seem to represent TATA box.

73 Figure 2.7: Sequence logos from combined putative core promoter elements This figure presents the sequence logos for octamer groups produced after the comparison and combination of initial octamer groups produced by LDSS analysis. All species contain octamer groups representing TATA box core promoter motif. Chlamydomonas contains only one octamer group indicating the TATA box. Elements such as Y-patch and CpG Islands are clearly visible in Arabidopsis. Inr element is indicated in human, but this has to elucidated using position of the motif from LDSS.

74 Figure 2.8: GO motifs Sequence logos in this figure represent the 24 valid GO motifs detected through MEME analysis of the GO Slim gene groups. All these different motifs were categorized into the same GO motif group based on Tomtom motif similarity measures (e-value < 0.01). This is not unexpected, because the visual inspection of the sequence logos do not show significantly dissimilar motifs as seen in KEGG motif groups shown in Figure 2.9. Description for each motif contains GO ID of the GO gene group and the E-value of the motif. Within the brackets it contains component gene number of the specific GO gene group, and the gene coverage of the motif.

75 Figure 2.9: Representative motifs from each KEGG motif group Motifs presented in this figure represent the various KEGG motif groups after comparing and categorizing of valid KEGG motifs. KEGG motif groups 1-3, are similar to the single GO motif group produced by Tomtom comparison. Groups 1-5 contain 92 motifs out of 102 valid KEGG motifs (~90%), while other groups only contain 10 out of the 102 valid KEGG motifs (~10%) (see Table 2.2). Groups 1-3 were annotated to the most number of target motifs in the database provided with MEME suite (See table 2.4), and apart from Groups 1-3 only Group 6 matched a target motif in the database (see Table 2.4). While the digit before the “_” represent different motifs within the same KEGG group, the digits after “_” represent the KEGG pathway ID.

76 Figure 2.10: Overall method used to analyze promoters The figure illustrates the method used to analyze Chlamydomonas promoters. This flow chart illustrates the method used to extract valid promoters from Chlamydomonas, and how promoter regions were extracted. It also includes details about core promoter and upstream promoter analysis.

77 Figure 2.11: LDSS Parameters Peak characteristics used to calculate threshold parameters for LDSS. A – Base line, B – Peak Height, C – Peak Area, D – Basal Fluctuation. Parameter 1 – Relative Peak Height (RPA) = Peak Height / Base Line, Parameter 2 – Peak Hight / Standard Deviation, Parameter 3 = Peak Area / basal fluctuation. The figure was created from the idea presented in Yamamoto et.al. 2007b. Thresholds used in this study are Parameter 1 >= 3, Parameter 2 >= 15 and Parameter 3 > 1.

78 Figure 2.12: Valid Promoters This figure illustrates the procedure of how valid promoters are selected using predicted TSS positions and EST alignments. The valid EST alignment selected is in red color. The TSS position is replaced by the leftmost position of the valid alignment

79 Figure 2.13: MEME Minimal Motif Format MEME motif format contains important information such as strands, background frequency of nucleotides, position specific letter probability matrix, width of the motif, number of sites of the motif, and the e-value. In this study background frequency is calculated from all promoters used in this study. Position specific letter probability matrix is calculated from the fasta files created to produce sequence logos.

80 2.9 Tables

Property GO Slim Gene Groups KEGG Pathway Gene Groups Total gene clusters 32 277 Minimum gene count 29 1 within a group The maximum gene count 3769 203 within a group Maximum – Minimum 3740 202 Mean gene count 1071.25 39.91 Median gene count 5696 30 Total Distinct Genes 4104 2560 Unique Genes 1690 146 Shared Genes 2414 Groups with valid motif 24 92 (e-value <=0.001) Valid motifs 24 102 Table 2.1: Basic statistics of upstream promoter analysis

81 2_00071, 1_00130, 1_00230, 1_00240, 1_00340, 1_00363, 1_00564, 1_00770, 1_03010, 1_03013, 1_03040, 1_04013, 1_04146, 1_04320, 1_04370, 1_04662, 1_04664, 2_00520, Group 1 2_00621, 2_00622, 2_03018, 2_04111, 2_04120, 2_04141, 2_04142, 2_04145, 2_04540, 2_05140, 2_05160, 3_04144, 3_05145, 3_05214 1_00020, 1_00500, 1_00960, 1_00983, 1_04012, 1_04510, 1_04621, 1_04710, 1_04711, Group 2 1_04740, 1_04912, 1_05100, 1_05110, 1_05217, 1_05414, 2_00140, 2_04380, 2_04660, 2_05200, 2_05416, 3_04270 3_00280, 2_00330, 2_00860, 1_04810, 1_04113, 1_04020, 2_05010, 2_00350, 2_05012, Group 3 2_04010, 1_04962, 1_00360, 1_03420, 2_05016, 1_00480, 2_05120, 1_00624, 1_04340 1_00380, 1_00410, 1_00511, 1_00621, 1_00622, 1_00626, 1_00901, 1_00905, 1_00950, Group 4 1_00965, 1_04966, 2_00600, 2_05213 Group 5 1_00052, 1_00600, 1_00565, 1_01055, 1_00524, 1_00514, 1_00280, 1_05410 Group 6 1_00520, 1_00910 Group 7 1_00627 Group 8 1_00720 Group 9 1_05322 Group 10 2_05322 Group 11 3_05322 Group 12 4_05322 Group 13 6_05322 Group 14 8_05322 Table 2.2: The 14 KEGG motif groups and their component KEGG motifs The table presents all the 102 KEGG motifs detected in in all the KEGG gene groups. Similar KEGG motifs are grouped into the same KEGG motif group. KEGG motif groups were produced by comparing these KEGG motifs to each other using Tomtom and grouping significantly similar motifs(e-value < 0.001). Please see Figure 2.13 for sequence logos of a representative KEGG motif from each group.

82 Target Motif Motif Database Query GO Motif Ascl2 (http://the_brain.bwh.harvar d.edu/uniprobe/details2? id=99) Uniprobe 2_GO:0008219 MA0048.1 (http://jaspar.genereg.net/cgi JASPAR Core 1_GO:0003824, 1_GO:0005198, 1_GO:0006139, -bin/jaspar_db.pl? 1_GO:0006810, 1_GO:0007275, 1_GO:0008152, ID=MA0048.1&rm=present 1_GO:0016874, 1_GO:0030154, 1_GO:0050789, &collection=CORE) 2_GO:0008219 MA0055.1 (http://jaspar.genereg.net/cgi 1_GO:0003824, 1_GO:0005198, 1_GO:0005488, -bin/jaspar_db.pl? JASPAR Core 1_GO:0005622, 1_GO:0006139, 1_GO:0006810, ID=MA0055.1&rm=present 1_GO:0007275, 1_GO:0008152, 1_GO:0030154, &collection=CORE) 1_GO:0050789, 2_GO:0008219 MX000192 (http://prodoric.tu- bs.de/matrix.php? matrix_acc=MX000192) Prodric 1_GO:0016209 PF0035 (http://jaspar.genereg.net/cgi -bin/jaspar_db.pl? JASPAR phylofacts ID=PF0035.1&rm=present& collection=PHYLOFACTS) 1_GO:0006139, 2_GO:0008219 PF0101 (http://jaspar.genereg.net/cgi -bin/jaspar_db.pl? JASPAR phylofacts ID=PF0101.1&rm=present& 1_GO:0005576, 1_GO:0006810, 1_GO:0030154, collection=PHYLOFACTS) 1_GO:0043062, 2_GO:0007610 Table 2.3: Functional annotation of GO motifs detected in GO gene groups The table presents the GO motifs, which matched target motifs in the motif database provided with MEME. Tomtom tool was used to compare similarity between the query GO motifs and target motifs in the database with a certain threshold (e-value < 0.1). The table provides the target motif (a link to the original motif), the database in which the target motif was originally found, and the GO motifs that matched the given target motif. The number preceding the “_” denotes different valid GO motifs discovered in a GO gene group. The string following the “_” denotes the GO Slim ID of the specific GO gene group.

83 Target motif Database Query KEGG Motif Ascl2 (http://the_brain.bwh.harvard.e Uniprobe du/uniprobe/details2?id=99) 2_041202_04120á, 3_05214á, 2_04142á, 2_00860 MA0048.1 (http://jaspar.genereág.net/cgi- JASPAR Core bin/jaspar_db.pl? 3_05145á, 2_04120á, 3_05214á, 2_04141á, 1_00564á, ID=MA0048.1&rm=present& 1_00230á, 1_03013á, 1_03010á, 2_04142á, 2_05140á, collection=CORE) 1_00770á, 1_00240á, 1_04146á, 2_04111 MA0055.1 (http://jaspar.genereág.net/cgi- 2_05016á, 2_04120á, 3_04270á, 3_05214á, 2_04141á, 1_00564á, 1_00230á, 1_03013á, 1_04020á, 2_05010á, bin/jaspar_db.pl? JASPAR Core 1_03010á, 2_04142á, 2_00860á, 1_03420á, 2_05140á, ID=MA0055.1&rm=present& 1_00770á, 2_05160á, 1_04146á, 1_04962á, 2_04111, collection=CORE) 3_04144á, 2_05012 From publicationá, no database MacIsaac et al., MSN2 2006 1_00910 MX000192 (http://prodoric.tu- bs.de/matrix.php? Prodric matrix_acc=MX000192) 2_04540á, 1_00130á, 2_00520 PF0035 (http://jaspar.genereág.net/cgi- JASPAR bin/jaspar_db.pl? phylofacts ID=PF0035.1&rm=present&c ollection=PHYLOFACTS) 2_04120á, 3_05214á, 2_04142á, 1_00770 PF0101 (http://jaspar.genereág.net/cgi- JASPAR bin/jaspar_db.pl? phylofacts ID=PF0101.1&rm=present&c 1_05110á, 2_05010á, 1_03010á, 1_00360á, 1_00480á, ollection=PHYLOFACTS) 2_05140á, 2_05160á, 1_04962á, 3_04144 Table 2.4: Functional annotation of KEGG motifs detected in KEGG gene groups The table presents the KEGG motifsá, which matched target motifs in the motif database provided with MEME. Tomtom tool was used to compare similarity between the query KEGG motifs and target motifs in the database with an e-value < 0.1. The table provides the target motif (a link to the original motif)á, the database in which the target motif was originally foundá, and the KEGG motifs that matched the ágiven target motif. Details of all valid KEGG motifs are found in supplemental file S1. The number precedinág the “_” denotes different valid KEGG motifs discovered in a KEGG ágene ágroup. The strinág followinág the “_” denotes the KEGG pathway ID of the specific KEGG ágene ágroup.

84 Table S1 : List of 102 valid KEGG motifs Five columns are included, namely KEGG pathway ID, sequence logo of the motif, e-value of the motif, number of genes annotated to the KEGG pathway ID, and the percentage of genes within the pathway containing the motif.

KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00020 3.5e-010 34 64.71%

1_00052 1.5e-023 22 50%

2_00071 5.9e-007 46 73.91% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00130 6.3e-015 26 69.23%

2_00140 1.6e-010 30 53.33%

1_00230 8.5e-053 204 68.14% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00240 2.3e-024 80 61.25%

1_00280 3.1e-008 27 51.85%

3_00280 2.6e-007 27 85.19% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_00330 1.8e-008 64 73.44%

1_00340 1.2e-009 47 78.72%

2_00350 1.2e-016 60 65% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00360 2.2e-006 23 78.26%

1_00363 2.7e-010 68 61.76%

1_00380 2.0e-011 34 61.76% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00410 1.0e-006 20 55%

1_00480 1.9e-007 45 73.33%

1_00500 2.6e-008 57 56.14% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00511 4.5e-020 5 80%

1_00514 7.0e-012 6 66.67%

1_00520 2.2e-013 35 94.29% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_00520 1.3e-005 35 54.29%

1_00524 5.1e-005 1 100%

1_00564 1.1e-005 32 75% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00565 1.8e-041 20 50%

1_00600 3.2e-011 17 52.94%

2_00600 4.9e-011 17 52.94% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00621 2.9e-045 5 80%

2_00621 3.9e-004 5 100%

1_00622 1.0e-045 3 66.67% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_00622 4.3e-006 3 100%

1_00624 4.7e-009 51 66.67%

1_00626 2.3e-036 25 56% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00627 2.4e-009 60 61.67%

1_00720 1.2e-008 19 73.68%

1_00770 1.5e-009 19 78.95% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_00860 3.0e-010 55 69.09%

1_00901 6.6e-014 3 100%

1_00905 4.7e-014 5 60% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00910 5.1e-009 21 76.19%

1_00950 5.5e-013 8 62.5%

1_00960 4.6e-005 19 84.21% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_00965 4.7e-013 3 66.67%

1_00983 1.5e-008 32 50%

1_01055 9.2e-004 6 66.67% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_03010 2.0e-022 104 72.12%

1_03013 5.4e-012 60 83.33%

2_03018 4.4e-011 94 67.02% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_03040 6.2e-055 151 76.82%

1_03420 6.6e-009 45 68.89%

2_04010 1.5e-031 165 56.36% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_04012 1.1e-012 99 51.52%

1_04013 1.0e-006 41 87.8%

1_04020 8.6e-017 88 54.55% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_04111 3.2e-014 100 54%

1_04113 8.6e-035 117 59.83%

2_04120 2.2e-030 89 70.79% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_04141 2.6e-010 96 52.08%

2_04142 1.5e-010 60 70%

3_04144 2.7e-014 119 63.03% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_04145 4.9e-011 61 78.69%

1_04146 1.8e-005 62 69.35%

3_04270 3.9e-010 102 52.94% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_04320 4.4e-004 36 83.33%

1_04340 1.4e-023 42 76.19%

1_04370 7.6e-005 66 69.7% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_04380 4.4e-013 105 56.19%

1_04510 2.4e-013 99 63.64%

2_04540 8.4e-004 103 62.14% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_04621 6.7e-015 86 58.14%

2_04660 9.4e-017 107 58.88%

1_04662 8.1e-004 55 74.55% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_04664 1.7e-008 62 72.58%

1_04710 1.4e-005 13 100%

1_04711 1.6e-008 4 100% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_04740 3.7e-005 30 56.67%

1_04810 1.8e-026 126 73.02%

1_04912 2.9e-013 84 58.33% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_04962 5.3e-006 45 53.33%

1_04966 4.7e-011 9 55.56%

2_05010 1.4e-004 84 54.76% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_05012 1.0e-005 85 51.76%

2_05016 1.6e-016 124 66.13%

1_05100 3.6e-009 54 55.56% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_05110 1.6e-007 46 67.39%

2_05120 2.2e-005 62 77.42%

2_05140 2.9e-005 75 65.33% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

3_05145 2.1e-026 126 50%

2_05160 8.9e-004 68 66.18%

2_05200 1.9e-016 150 52.67% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

2_05213 5.6e-005 57 56.14%

3_05214 1.4e-005 70 61.43%

1_05217 5.7e-008 21 76.19% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_05322 2.8e-049 33 81.82%

2_05322 9.3e-036 33 93.94%

3_05322 3.8e-025 33 75.76% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

4_05322 3.3e-025 33 90.91%

6_05322 5.1e-009 33 57.58%

8_05322 3.4e-004 33 72.73% KEGG Component Sequence Logo E-value Gene coverage Motif gene #

1_05410 3.0e-004 30 50%

1_05414 2.2e-004 25 56%

2_05416 1.2e-007 50 52%