<<

ABSTRACT

A -WIDE ANALYSIS OF PERFECT INVERTED REPEATS IN ARABIDOPSIS THALIANA by Sutharzan Sreeskandarajan

Perfect inverted repeats play wide variety of roles in . In this study, we conduct a genome-wide analysis of perfect inverted repeats in Arabidopsis thaliana and explore the biological significance of the observed distribution. The roles of palindromic sequences are reviewed in chapter 1. Chapter 2 describes a tool which was developed to detect perfect inverted repeats using a novel prime number-based algorithm. Chapter 3 focuses on the performed genome-wide analysis, illustrating the observed non-random distribution of perfect inverted repeats in different genic and intergenic regions in Arabidopsis genome.

A GENOME-WIDE ANALYSIS OF PERFECT INVERTED REPEATS IN ARABIDOPSIS THALIANA

A Thesis

Submitted to the Faculty of Miami University in partial fulfillment of the requirements for the degree of Master of Science Department of Botany by Sutharzan Sreeskandarajan Miami University Oxford, Ohio 2013

Advisor______Dr. Chun Liang

Reader______Dr. Daniel K. Gladish

Reader______Dr. John E. Karro

TABLE OF CONTENTS

LIST OF TABLES ...... IV

LIST OF FIGURES ...... V

THESIS ORGANISATION ...... VI

CHAPTER 1: INTRODUCTION ...... 1

1.1 DNA ...... 1

1.2 Non β-helix DNA Structures ...... 2

1.3 Biological importance of Non β -helix DNA Structures ...... 2

1.4 Binding of Proteins to Palindromes ...... 3

1.5 Transposons and Palindromes ...... 4

Figures ...... 5

CHAPTER 2: A MATLAB-BASED TOOL FOR ACCURATE DETECTION OF PERFECT OVERLAPPING AND NESTED INVERTED REPEATS IN DNA SEQUENCES ...... 6

Abstract ...... 6

Introduction ...... 6

Algorithm ...... 7

Implementation and evaluation ...... 8

Conclusion ...... 9

Figures ...... 10

CHAPTER 3: A GENOME-WIDE ANALYSIS OF PERFECT INVERTED REPEATS IN ARABIDOPSIS THALIANA ...... 11

Abstract ...... 11

Introduction ...... 11

Materials and Methods ...... 13

1. Genome-wide detection of perfect IRs in Arabidopsis thaliana ...... 13

ii

2. The distribution of perfect IRs in the near upstream intergenic region ...... 14

2.1. Breakdown of the 200 base unit upstream of the 5’ UTR ...... 14

2.2 The distribution of IRs in the near upstream intergenic regions ...... 14

2.3 Detection of Cis-acting Elements containing perfect IRs ...... 15

2.4 The distribution of perfect IRs which are part of Cis-acting Element sequences in the near intergenic regions upstream of 5’ UTR ...... 15

2.5 The distribution of TATATATA sequences in the near intergenic region upstream of 5’UTR ...... 15

Results ...... 16

1. Genome-wide detection of IRs in Arabidopsis thaliana ...... 16

2. The analysis of perfect IRs in the near intergenic regions upstream of 5’ UTR ...... 17

2.1 The distribution of perfect IRs in the near intergenic regions upstream of 5’ UTR ...... 17

2.2 The distribution of IRs which are part of Cis-acting Element sequences in the near intergenic regions upstream of 5’ UTR ...... 17

2.3 The distribution of TATATATA sequences in the near intergenic regions upstream of 5’ UTR ...... 17

Discussion ...... 18

Conclusion ...... 20

Tables ...... 22

Figures ...... 23

LITERATURE CITED ...... 31

SUPPLEMENTARY MATERIALS ...... 37

iii

LIST OF TABLES

Table 1: Nested and overlapping statistics of Arabidopsis thaliana nuclear 22

iv

LIST OF FIGURES

Figure 1: Examples for different types of palindromes ...... 5 Figure 2: The major steps of the cumulative prime-number scoring system algorithm for the detection of overlapping IRs ...... 10 Figure 3: The schematic representation of intragenic and intergenic regions for perfect inverted repeat analysis...... 23 Figure 4: Distribution of intergenic distances in Arabidopsis thaliana ...... 24 Figure 5: Chromosomes lengths of Arabidopsis thaliana ...... 25 Figure 6: Abundance of perfect IRs in chromosomes of Arabidopsis thaliana ...... 25 Figure 7: Lengths distribution of perfect IRs in Arabidopsis thaliana genome ...... 26 Figure 8: The distribution of total bases among various genomic regions ...... 27 Figure 9: The distribution of total count of perfect IR bases among various genomic regions .... 27 Figure 10: The percentage of perfect IR bases in various genomic regions ...... 28 Figure 11: The distribution of perfect IR bases in the considered windows of the 5’ UTR upstream region...... 28 Figure 12: The distribution of perfect IR bases which are part of cis-acting elements in the considered windows of the near intergenic regions upstream of 5’ UTR ...... 29 Figure 13: The distribution of bases which are part of TATATATA sequences in the considered windows of the near intergenic regions upstream of 5’ UTR ...... 29 Figure 14: The distribution of bases which are part of cis-acting elements, including TATATATA, in the considered windows of the near intergenic regions upstream of 5’ UTR ...... 30

v

THESIS ORGANISATION

This thesis consists of two chapters in two different journal formats. Chapter 2 is in the format of the journal Bioinformatics, and is co-authored by Michelle M Flowers, John E Karro, and Chun Liang. Chapter 3 is formatted according to the format specifications of the journal G3, and is co- authored by John E Karro and Chun Liang.

vi

CHAPTER 1: INTRODUCTION

1.1 DNA Palindromes

A can be defined as a sequence in which the first half is a mirror image to the other. Palindromic patterns are enriched in biological sequences such as , and proteins. In DNA sequences, palindromes can be expressed in terms of base identities (Cox & Mirkin, 1997; Gupta, Mittal, & Gupta, 2006; Lu, Jia, Dröge, & Li, 2007). An exact DNA palindrome is a DNA sequence in which half occurs as the mirror image of the other, in terms of the nucleotide occurrence. Meanwhile, a reverse complimentary DNA palindrome can be defined as a DNA sequence in which the first half is reverse complimentary to the second half. This research is focused only on reverse complimentary DNA palindromes; hence, from here on, unless otherwise specified the word palindrome will only refer to reverse complimentary DNA palindromes. Palindromes appear to have an important biological role, facilitating the formation of DNA secondary structures such as hairpins, bulge loops and cruciform, due to their ability to form complimentary internally within a single sequence (Lu et al., 2007; Mizuuchi, Mizuuchi, & Gellert, 1982) .

Palindromes containing a non-palindromic spacer in the middle are generally known as inverted repeats (IRs) (Humphrey-Dixon, Sharp, Schuckers, & Lock, 2011; Lilley, 1980; Smith, 2008). Palindromes can be classified as the following 4 types based on the perfectness of the complementary base pairing and the presence of a gap (or spacer) sequence in the middle (Figure 1). 1. Perfect palindromes – Palindromes in which the first half is the exact reverse compliment of the other. They can be called perfect IRs. 2. Spacer palindromes – Palindromes which contain a non-palindromic sequence, called a spacer, in the middle. On such a palindrome, the sequences which flank the spacer sequence are complimentary to each other. These palindromes are generally called as IRs. 3. Imperfect non-spacer palindromes or imperfect inverted repeats – IRs with a certain degree of imperfectness in the presence of complimentary bases. In such palindromes

1

the first half is not exactly the reverse compliment of the other, with the presence of several mismatches in terms of complementation. 4. Imperfect spacer palindromes – Palindromes which contains a spacer sequence in the middle and also has a certain degree of imperfectness (i.e., mismatches) in the presence of complimentary bases.

Palindromes which belong to the aforementioned categories of 2, 3 and 4 are referred as quasipalindromes. In another word, any palindrome which is not a perfect non-spacer palindrome is a quasipalindrome.

1.2 Non β-helix DNA Structures

DNA structures are formed due to intra- and inter-strand base pairing between (Mizuuchi et al., 1982). The base pairing is achieved by the formation of hydrogen bonds between bases (Brown, Hunter, Kneale, & Kennard, 1986; Watson & Crick, 1953). The double helix structure is formed by inter-strand base pairing of the two complementary DNA strands. The most prominent example of double helix structure of DNA is canonical β -helix DNA structure described by Watson & Crick (1953). Non β -helix DNA structures would include hairpins and cruciforms (Bikard, Loot, Baharoglu, & Mazel, 2010; Zhao, Bacolla, Wang, & Vasquez, 2010). Hairpins are formed by the intra-strand base pairing in palindromic DNA sequences. A hairpin structure could contain two stems connected by a loop. Hairpins can be perfectly palindromic, formed by perfect palindromes (Nag & Petes, 1991), or contain non- palindromic characteristics such as non-palindromic spacers in loops and non-complementary bases in the stems when they are formed by quasipalindromic sequences (Lilley, 1980; Rennekamp, Wang, & Lieberman, 2010; Smith, 2008). The combination of double helix and hairpins can give rise to a cruciform structure, where hairpins are present in between double helix DNA (Gough, Sullivan, & Lilley, 1986; Mizuuchi et al., 1982).

1.3 Biological importance of Non β -helix DNA Structures

Palindromes such as Inverted Repeats (IRs) can perform numerous biological functions due to

2 their ability to form DNA secondary structures. Formation of one such structure, cruciform, has the potential of involving in the initiation of the DNA replication process (Pearson, Zorbas, Price, & Zannis-Hadjopoulos, 1996). Cruciform structures can affect the level of supercoiling. Such structures have the potential of relaxing the negative supercoiling of DNA molecules (White & Bauer, 1987). Changes in the level of supercoiling affect the binding of the regulatory proteins involved in the DNA replication process (Cozzarelli & Wang, 1990). Similarly, cruciforms can involve in the control of transcription, by their ability to alter the supercoiling level (Liu & Wang, 1987; Pearson et al., 1996). Cruciforms also control several biological processes in by affecting the nucleosome structures (Pearson et al., 1996). Nucleosome interference can repress proteins binding to cis-acting elements. The nucleosome interference has been observed in the functioning of the transcription factor GAL4 by Workman, Taylor, & Kingston (1991). Their work showed that the effective functioning of GAL4 derivatives, acidic activation domains, is needed to elevate the GAL4 derivatives’ competence for binding at the site with the nucleosomes. Another study carried out by Simpson (1990) indicated that the cis-acting element ARS, which enhances the replication of TRP1ARS1, is hindered by nucleosomes in its protein binding process. Hairpins and cruciforms are unable to bind to histones strongly (Nickol & Martin, 1983; Nobile, Nickol, & Martin, 1986; Pearson et al., 1996). Histones try to avoid the stem-loops when binding to DNA molecules to form nucleosomes. Hence, stem-loops are most likely to be present in between nucleosomes rather than being part of them (Nickol & Martin, 1983). Formation of cruciform structures leads to changes in nucleosome phasing, due to the inability of histones in binding to cruciforms (Nobile et al., 1986). The ability of hairpins and cruciforms to resist nucleosome binding allows an increased exposure of cis-acting elements to their corresponding transacting factors, enabling cruciforms to be a key controlling factor in many biological reactions (Pearson et al., 1996).

1.4 Binding of Proteins to Palindromes

Palindromic sequences can act as protein binding sites, especially the binding sites of dimeric proteins (Lu et al., 2007). Palindromic sequences, including perfect IRs, can act as cis-acting elements for dimeric proteins in prokaryotic and eukaryotic systems (Oñate et al., 1994; Rodikova et al., 2007; Solar, Giraldo, Ruiz-Echevarría, Espinosa, & Díaz-Orejas, 1998; Tang &

3

Perry, 2003). Leucine Zippers, a widely studied class of dimeric transcription factor proteins, bind to palindromic sequences (Landschulz, Johnson, & McKnight, 1988; Lu et al., 2007; McKnight, 1991). Two such sequences are the G-Box sequence ‘CCACGTGG’, and the palindromic motif ‘TGACGT’ which contains the palindromic tetramer ‘ACGT’ (Schindler, Beckmann, & Cashmore, 1992; Schindler, Menkens, Beckmann, Ecker, & Cashmore, 1992). Restriction endonuclease recognition sites could be palindromic in . Restriction endonucleases enzymes in bacterial systems cleave intruder DNAs at specific sites through recognizing specific sequences, as a defense act (Pingoud & Jeltsch, 2001). A category of restriction enzymes called Type II could function as a dimer and can bind to palindromic sequences (Pingoud & Jeltsch, 2001; Roberts et al., 2003). Basal promoter sequences such as sequences in TATA boxes and G Boxes can be palindromic (Kiran et al., 2006; Molina & Grotewold, 2005; Schindler, Beckmann, et al., 1992). Many cis-acting elements or motifs contain palindromic sequences (Adrian et al., 2010; Klug, Knapp, Castro, & Beato, 1994; McGuire, Hughes, & Church, 2000). Hence, palindromicity would be an important metric in motif predication (Manson McGuire & Church, 2000; McGuire et al., 2000).

1.5 Transposons and Palindromes

Palindromic patterns are associated with many transposons. IRs can be part of transposons (Berg, Johnsrud, McDivitt, Ramabhadran, & Hirschel, 1982). DNA transposons which transpose without having RNA intermediates, also known as Class II transposable elements, contain terminal IRs (Kuang et al., 2009; Pray, 2008). Hairpin structures aid in the transposition process (Bikard et al., 2010). Many transposons would show a biased site preference among palindromic insertion sites (Linheiro & Bergman, 2008, 2012).

4

Figures

Figure 1: Examples for different types of palindromes

5

CHAPTER 2: A MATLAB-BASED TOOL FOR ACCURATE DETECTION OF PER- FECT OVERLAPPING AND NESTED INVERTED REPEATS IN DNA SEQUENCES

Abstract

Summary: Palindromic sequences, or inverted repeats (IRs), in DNA sequences involve important biological processes such as DNA–protein binding, DNA replication and DNA transposition. Development of bioinformatics tools that are capable of accurately detecting perfect IRs can enable genome-wide studies of IR patterns in both prokaryotes and eukaryotes. Different from conventional string-comparison approaches, we propose a novel algorithm that uses a cumulative score system based on a prime number representation of nucleotide bases. We then implemented this algorithm as a MATLAB-based program for perfect IR detection. In comparison with other existing tools, our program demonstrates a high accuracy in detecting nested and overlapping IRs.

Availability and implementation: The source code is freely available on (http://bioinfolab.miamioh.edu/bioinfolab/palindrome.php)

Publication information: Received on August 15, 2013; revised on October 23, 2013; accepted on November 5, 2013 (http://bioinformatics.oxfordjournals.org/content/early/2013/11/30/bioinformatics.btt651.full) doi: 10.1093/bioinformatics/btt651

Introduction

Palindromic sequences, or inverted repeats (IRs), are sequences that can form complementary pairing with themselves. Found in abundance throughout both eukaryotic and prokaryotic ge- nomes, the ability of IRs to self-pair allows for the formation of DNA secondary structures that serve as intermediates in biological reactions such as DNA replication and transposition (Berg et al., 1982; Lu et al., 2007; Rice, 2005): cruciforms can initiate DNA replication by changing the supercoiling level or chromatin structure, affecting the binding of DNA replication regulatory

6 factors (Cozzarelli & Wang, 1990; Pearson et al., 1996; White & Bauer, 1987); DNA hairpins are a result of the palindromic nature of perfect and imperfect IRs, allowing them to act as the binding sites of dimeric regulatory factors (LeBlanc, Aspeslagh, Buggia, & Dyer, 2000; Lon- skaya et al., 2005) and aid in the transposition process (Kuang et al., 2009; Linheiro & Bergman, 2008, 2012; Pray, 2008). Both perfect and imperfect IRs also have an impact on genomic struc- ture: transposons prefer palindromic insertion sites (Linheiro & Bergman, 2012). Transposons can be present in eukaryotic genomes in a nested manner by occurring within other transposons (Gao et al., 2012). Hence, analysis of nested and overlapping perfect IRs can aid in a variety of research areas including the study of regulatory factor binding sites and of nested transposons.

Currently available tools in the detection of palindromes do not perform well in terms of result quality (Gupta et al., 2006). To make genome-wide studies of palindromic sequences feasible, we need to develop efficient and accurate tools for detecting both perfect and quasi-palindromic sequences (palindromes contain mismatches and/or non-palindromic spacers). Here we describe a novel algorithm for a tool detecting perfect IRs, implemented as a MATLAB function, capable of running on genome-scale inputs.

Algorithm

Different from conventional string comparison algorithms adopted in existing tools for palindrome detection, the major steps of our algorithm are shown in Figure 2 (see the graphic representation and pseudo-code of our algorithm in Supplementary Figures 1 and 2).

The algorithm detects the boarder positions of perfect IRs using a cumulative scoring system. The scoring system is based on assigning prime number scores to nucleotides such that the scores of complimentary bases cancel each other out. A perfect IR is defined as a DNA sequence that satisfies the following two conditions: (i) the numbers of complementary nucleotide bases are equal and (ii) the number of nested palindromes within the sequence, which share the same center as with the sequence, is equal to half the number of the total bases present.

Our detection scheme is based around the assignment of scores to each nucleotide: A and C are

7 each assigned different prime number scores, whereas T and G are each assigned a score equal to the negative of the score of their complementary base, as shown in Supplementary Figure S1. We then scan the sequence and keep a running cumulative score, noting that if a score repeats itself at two positions, the intervening section must have a balanced number of complementary bases (a result following from our choice of prime number score values), and hence is potentially a perfect IR. We filter out all substrings larger than 1000 bp, as doing so significantly increases the efficiency of our algorithm, and our data testing using one of human, Arabidopsis and maize genome indicates that the presence of palindromes larger than 1000 bp are exceedingly rare. We then process each remaining substring with a filtering step to eliminate non-IRs that happen to be balanced in nucleotide numbers.

Implementation and evaluation

The proposed algorithm was implemented as a MATLAB function (using MATLAB R2012a) and was compared for accuracy against existing IR detection tools including EMBOSS (Rice et al., 2000), BioPHP (http://www.biophp.org/minitools/find_palindromes) and the MATLAB built- in function palindromes on several simple test cases (Supplementary Table 1). The tool IRF (Inverted Repeat Finder) (Warburton, Giordano, Cheung, Gelfand, & Benson, 2004) was excluded from the comparison because it cannot detect overlapping IRs. Our MATLAB tool was able to detect all nested and overlapping IRs, whereas the other competing tools were unable to function with 100% accuracy in many cases. As indicated in the Supplementary Table 1, MATLAB’s and EMBOSS’s algorithms were unable to detect some perfect IR instances of TATA sequences, whereas BioPHP’s algorithm missed overlapping perfect IRs starting in same position. Guglietta, Pantaleo, & Graziosi (2010) used BioPHP in their analysis of IRs in HIV-1 gp120 sequences. In their analysis in the C3 region on patient number three, they missed the detection of perfect IRs: ATAT and TATA, whereas our tool was able to detect them (see Testcase_HIV.fa in our supplementary data available on http://bioinformatics.oxfordjournals.org/content/early/2013/11/30/bioinformatics.btt651/suppl/D C1).

For validation on real data, our MATLAB program was run on the HIV-1 genome (NC_001802.1)

8 to search for all perfect IRs, including the nested and overlapping patterns, of lengths 4–1000 nt. Consequently, we detected 649 perfect IRs in 0.1339 s. Moreover, our tool proves to be well- suited for the detection of perfect IRs in larger genome data sets like chromosome 1 of Arabidopsis thaliana, Homo sapiens and Zea mays: we detected 17 030 043 perfect IRs in H.sapiens, 25 181 675 in Z.mays and 2 788 106 in A.thaliana, using 1.035 h, 1.050 h and 7.943 min, respectively.

Although our tool has considerably higher accuracy than other tools, that accuracy comes with some cost in runtime. To determine whether the program was still practical to use on large sequence inputs, we looked at the average execution time over 100 randomly generated sequences at sequence length ranging from 4 to 1000 bp and found it to increase linearly with input size (Supplementary Figure 3).

Conclusion

Different from conventional string-comparison approaches adopted in the existing tools, our MATLAB program uses a novel prime number-based algorithm and can accurately detect nested and overlapping IRs of lengths ranging from 4 to 1000 nt. This tool is practically feasible for per- fect IR detection in large DNA sequences. Hence, this tool will assist in the effective and accu- rate detection of perfect IRs in a genome-wide scale.

Funding: NIH-AREA (1R15GM94732-1 A1 to C.L.) and NSF (No. O953215to J.K.) (in part).

9

Figures

Figure 2: The major steps of the cumulative prime-number scoring system algorithm for the detection of overlapping IRs

10

CHAPTER 3: A GENOME-WIDE ANALYSIS OF PERFECT INVERTED REPEATS IN ARABIDOPSIS THALIANA

Abstract

Perfect inverted repeats (perfect IRs) play vital roles in genomes. Based on the functions performed, the abundance of the perfect inverted repeats could vary in different genomic regions. This paper presents a genome-wide analysis of perfect IRs in the genome of Arabidopsis thaliana. The performed genome-wide analysis has generated 11,001,771 perfect IR patterns of 4 bases or more, and has revealed that the perfect IRs of Arabidopsis thaliana shows a non-random distribution in different intergenic and genic regions. Perfect IRs of Arabidopsis thaliana are enriched in the intergenic region, as compared to the gene regions. Within gene regions, appear to have high enrichment of perfect IRs. Intergenic regions which are close to the start sites of genes are enriched in perfect IRs, indicating the importance of palindromic sequences in protein binding sites around promoter regions.

Introduction

Inverted repeats (IRs) play important roles in genomes. The palindromic nature of inverted repeats makes them key players in DNA secondary structures, especially DNA hairpins and cruciforms (Pearson & Sinden, 1996). Such DNA secondary structures are involved in several biological processes, including transcription, DNA replication and transposition (Berg et al., 1982; Kuang et al., 2009; Linheiro & Bergman, 2012; Pearson et al., 1996; Pray, 2008). Many cis-acting elements (e.g. the TATA box element in promoters) and sites consist of patterns including perfect IRs (Lu, Jia, Dröge, & Li, 2007; Molina & Grotewold, 2005; Pingoud & Jeltsch, 2001). Due to the biological importance of IRs, many studies have focused on the genome-wide analysis of IRs in organisms such as Homo sapiens, Saccharomyces cerevisiae and (Cox & Mirkin, 1997; Humphrey-Dixon et al., 2011; Lu et al., 2007). Such studies reveal IR distribution trends in several genomic regions.

11

Genome-wide studies generate databases of IR sequences in genomes. The differences in the distribution of the generated IRs in different intragenic regions and intergenic (far and near) regions can provide evidence for the variety of potential functions performed by them. Genome- wide analysis of perfect IRs in the has shown that the perfect IRs are differentially distributed among various genomic regions, possibly due to their different roles in the genome (Lu et al., 2007). Human perfect IRs are highly abundant in upstream regions of the genes, most likely due to the presence of palindromic regulatory elements, and are also common in introns, possibly aiding in the formation of secondary structures needed for the splicing process of transcription (Berendzen, Stuber, Harter, & Wanke, 2006; Kiran et al., 2006; Lu et al., 2007; Molina & Grotewold, 2005).

Promoter motifs and other cis-acting elements often include perfect IRs. A good example of a perfect IR promoter motif would be the TATA box sequence segment TATATATA (Berendzen et al., 2006; Kiran et al., 2006; Molina & Grotewold, 2005). Dimeric proteins are more likely to bind to motifs containing IR patterns (LeBlanc et al., 2000; Lonskaya et al., 2005; Schindler, Beckmann, et al., 1992; Tsai & Reed, 1998). Hence, intergenic regions could show a non-random distribution of perfect IRs, with more perfect IRs present in upstream intergenic regions closer to 5’ UTRs than in the intergenic regions that are further away from the gene start site (Lu et al., 2007).

This chapter describes the first genome-wide analysis of perfect IRs carried out on the Arabidopsis thaliana genome, illustrates a non-random distribution of the detected perfect IRs among different genic and intergenic regions, and explores the potential functional meanings for the differential enrichment of perfect IRs. As the part to test the functional importance of perfect IRs, potential cis-acting elements or motifs from Arabidopsis thaliana are also analyzed for the presence of prefect palindromic patterns.

12

Materials and Methods 1. Genome-wide detection of perfect IRs in Arabidopsis thaliana

Using our developed tool described in Chapter 2, a genome-wide detection of nested and overlapping IRs was performed in the Arabidopsis thaliana genome. The minimum and maximum lengths of IRs were taken to be 4 and 1,000 bases. Nested and overlapping perfect IR detection was conducted in individual chromosomes. The distribution of perfect IRs among different intragenic and intergenic regions (i.e., overall intergenic regions, far intergenic regions and near upstream intergenic regions) was analyzed (perfect IRs of length of 4 and 6 bases were excluded to from the analysis in order to minimize the effects of random IRs sequences). The overall intergenic regions are those without annotated genes in both strands. Moreover, intergenic regions in the size ranges of 1 – 10 kb, which are 10 kb away from the flanking protein-coding genes, were considered to be within far intergenic regions (Figure 3). These far intergenic regions were designated as the regions with a high probability of the presence of random perfect IRs due to having fewer cis-acting elements. 10 kb distance from the flanking genes were maintained to avoid most of the major cis-acting elements in Arabidopsis thaliana that occur proximally and distally to the genes (Adrian et al., 2010; Lu et al., 2007; Molina & Grotewold, 2005; Schindler, Terzaghi, Beckmann, Kadesch, & Cashmore, 1992). The length limit of 10 kb for the far intergenic distance was defined to avoid large intergenic regions containing high numbers of repetitive sequence, such as centromere regions (Ayele et al., 2005; Round, Flowers, & Richards, 1997). Further, perfect IR bases which correspond to micro RNAs, non-coding RNAs, pseudo-genes, ribosomal RNAs, small nuclear RNAs, small nucleolar RNAs, transfer RNAs and transposable elements were excluded from perfect IR base count calculation in the far intergenic regions. The total number of bases that are part of perfect IRs (total count of perfect IR bases) were counted for different intragenic and intergenic regions. The percentage of bases that are part of perfect IRs (the percentage of perfect IR bases) is calculated by dividing the total number of perfect IR bases by the total number of bases present in the specific intragenic or intergenic regions (length of the region). The genomic region coordinates were extracted and determined using TAIR 10 release genome annotation (Lamesch et al., 2012) (http://www.arabidopsis.org/portals/genAnnotation/gene_structural_annotation/annotation_data.j sp). Perfect IR base enrichments of different intragenic and intergenic regions were compared

13 using the χ2 statistic, at 95% confidence level (Cox & Mirkin, 1997).

2. The distribution of perfect IRs in the near upstream intergenic region

Furthermore, the distribution of the perfect IRs in the near intergenic region upstream of 5’UTR was analyzed in detail (Figure 3). All perfect IRs having lengths of eight or more nucleotides were considered. Perfect IRs of length of 4 and 6 bases were ignored to minimize the effects of random IRs sequences. The analysis was restricted to the distribution of inverted repeat bases in 200 bases upstream of 5’UTR of annotated protein-coding genes (chromosomes 1-5). The distribution of intergenic distance in Arabidopsis thaliana was determined. The 200 base upstream region was chosen based on the intergenic region length distribution (Figure 4) to ensure that most of the considered intergenic regions of analysis did not overlap with adjacent genic regions. The percentages of nested and overlapping genes were determined as a measure of the gene overlaps in the Arabidopsis thaliana genome (see Table 1).

2.1. Breakdown of the 200 base unit upstream of the 5’ UTR

The 200-bp long near intergenic region upstream of 5’ UTR was further divided into 10 non- overlapping windows of 20 bases in length (Figure 3). The windows were numbered and labeled in the ascending order of their distance from 5’ UTR starting point. Distance of each window was defined as the distance between the left boundary of the window and the 5’UTR start position.

2.2 The distribution of IRs in the near upstream intergenic regions

In each of the aforementioned windows, the number of bases that are part of the detected perfect IRs (the count of perfect IR bases) was found for each annotated gene. The overall distribution of the perfect IRs was measured in terms of the total count of perfect IR bases each window by adding together the perfect IR base individual counts of all the annotated gene . The statistical significances of the windows having highest and lowest enrichment of perfect IR bases were determined by performing proportion tests using the χ2 statistic at a confidence level of 95% (Cox & Mirkin, 1997).

14

2.3 Detection of Cis-acting Elements containing perfect IRs

Using the Atcis database, a database containing all putative cis-regulatory elements or motifs in Arabidopsis genome (Yilmaz et al., 2011) (http://arabidopsis.med.ohio-state.edu/AtcisDB/), we determined those cis-acting elements that contain prefect IRs of lengths 8 or more bases. Cis- acting element sequences were downloaded from the Atcis database (Yilmaz et al., 2011) (http://arabidopsis.med.ohio-state.edu/AtcisDB/). The sequences containing perfect IRs were determined using the tool we developed (see chapter 2).

2.4 The distribution of perfect IRs which are part of Cis-acting Element sequences in the near intergenic regions upstream of 5’ UTR

The distribution of cis-acting elements containing perfect IRs in the 200-nt near intergenic regions upstream of 5’ UTR was determined by searching for cis-acting elements that contain perfect IRs in considered windows (see Figure 3). The searches were performed using exact string matching. The detected location coordinates for the matches were used to determine the frequencies of cis-acting elements that contain perfect IRs in the considered windows.

2.5 The distribution of TATATATA sequences in the near intergenic region upstream of 5’UTR

The search for the potential major palindromic TATA box sequence TATATATA was performed by searching for the exact match for the sequence. The detected coordinates were used to determine TATATATA base frequencies for the considered windows. The statistical significances of the windows having highest and lowest enrichment of perfect IR bases were determined by performing proportion tests using the χ2 statistic at confidence level of 95%.

15

Results 1. Genome-wide detection of IRs in Arabidopsis thaliana

The TAIR10 release of Arabidopsis thaliana contains five nuclear and two organelle chromosomes, having a total count of 119,667,750 base pairs (Figure 5). The nuclear chromosomes contain a total of 28,496 genes, including both protein-coding and non-protein coding genes. The percentage of nested and overlapping genes in the nuclear chromosomes varies approximately 6.4-8.2% (Table 1). The analysis of the intergenic distances revealed that many of the non-overlapping genes are closely packed (Figure 4). Around 6% of intergenic regions are less than 100 bases, while around 16% were less than 200 bases, and around 66% of the intergenic distance were over 500 bases. Based on the distance distribution, the 200- base region immediately upstream of the 5’ UTR seems to be a candidate suitable for near upstream intergenic region analysis as it provides a longer region with a low probability of overlapping genes in comparison to 100- and 500-base upstream regions.

The performed genome-wide analysis revealed a total of 11,001,771 perfect IRs, including both nested and overlapping ones cross all 5 chromosomes, plus mitochondrion and chloroplast. 7,500,128 perfect IRs were detected as non-nested perfect IR patterns of 8-88 bases in length. The chromosome-wide distribution of perfect IRs of 8-88 bases in length is shown in Figure 6, and the length, distribution of perfect IRs is shown in Figure 7. The regional distribution pattern of perfect IR base count was similar to the regional distribution of total base counts (Figure 8 and Figure 9) and the percentage of perfect IR bases did not vary much across regions (see Figure 10). Notable differences in the perfect IR distribution were observed between those of chromosome and those of genes, and between perfect IR distributions in and regions. Gene regions had a significantly lower percentage of perfect IR content compared to the chromosomal perfect IR base percentage (p-value < 2.2x10-16). Within the genic region, perfect IR bases were significantly enriched within the intron region in comparison to exon region (p- value < 2.2x10-16). In comparison to the far intergenic region, intron has higher perfect IR percentage also (p-value < 2.2x10-16).

16

2. The analysis of perfect IRs in the near intergenic regions upstream of 5’ UTR 2.1 The distribution of perfect IRs in the near intergenic regions upstream of 5’ UTR

Window number 2 gave a prominent peak in the perfect IR base count distribution of the considered 200-base long near upstream intergenic region, having a total perfect IR base count of ~43 million bases, while window number 1 had the lowest perfect IR base count (~24 million bases) (Figure 11). The other windows did not show any prominent signal patterns. In comparison to windows 3-10 altogether, window number 2 had significantly higher perfect IR bases proportion (p-value < 2.2x10-16) and window number 1 had significantly lower proportion (p-value < 2.2x10-16) of perfect IR bases.

2.2 The distribution of IRs which are part of Cis-acting Element sequences in the near intergenic regions upstream of 5’ UTR

The Atcis database had 763,507 cis-acting elements, out of which (~ 1%) were found containing IRs of size 8 or more bases. All these 763,507 elements can be grouped into 471 different sequence types based on the sequence content. Out of the 471 sequence types, we found that 28 (~ 6%) contained perfect IR sequences. The distribution of IRs that is a part of cis-acting elements in the near intergenic regions upstream of 5’ UTR did not show any prominent features (Figure 12). The counts were much lower compared to the frequencies of perfect IR bases. The maximum total base count (625 bases) was recorded in the window number three.

2.3 The distribution of TATATATA sequences in the near intergenic regions upstream of 5’ UTR

The total count of bases that are part of the palindromic TATA box sequence ‘TATATATA’ peaked at the second window (21-40 bases upstream region) and recorded the lowest value in the first window (Figure 13). The maximum and minimum enrichment pattern of ‘TATATATA’ was similar to that of the all perfect IRs shown in Figure 11. The total count distribution of perfect IR containing cis-acting element bases including TATATATA showed a distributional pattern which is similar to that of TATATATA bases (Figure 13 and Figure 14).

17

Discussion

Our genome-wide analysis of perfect IRs generated a dataset of all perfect IR patterns of lengths from 4 – 88 bases, which includes nested and overlapping perfect IRs in Arabidopsis thaliana. The perfect IRs of 8-88 bases in length were considered for further analysis. The considered perfect IRs did not show any highly prominent regional preferences, when the perfect IR bases were grouped based on genomic regions using TAIR10 genome annotation. The total count distribution of perfect IR bases in each chromosome was similar to the distribution of total bases of each chromosome (Figure 5 and Figure 6). But a notable and significant lower of perfect IRs were observed in the gene regions in comparison with the chromosomal overall perfect IR abundance (p-value < 2.2x10-16), suggesting IR enrichment in the intergenic region. Presence of palindromic regulatory elements, such as palindromic transcriptional regulatory elements in the upstream intergenic regions of the genes, might have been the major factor for the observed high enrichment of perfect IRs in the intergenic region in comparison with the genic regions (Lu et al., 2007). Regulatory elements are highly clustered in the intergenic regions that are close to start potions of genes (Berendzen et al., 2006; Kiran et al., 2006; Lu et al., 2007; Molina & Grotewold, 2005). Hence, far intergenic regions contain fewer perfect IRs due to the low abundance of cis- acting elements, compared to the perfect IR abundance of overall intergenic regions (p-value < 2.2x10-16). Most of the far integenic perfect IRs should be formed due to sequence randomness, rather than existing for functional purposes (Lu et al., 2007). Another differential perfect IR enrichment was observed between the intron and the exon regions. Introns contained significantly higher perfect IR abundance in comparison with and far intergenic regions (p-value < 2.2x10-16). Increased perfect IR presence in intron region would increase the ability of introns to form RNA secondary structures after transcription of pre-mRNA, in comparison to the exons. Such RNA structures play important roles in the alternative splicing of pre-mRNAs (Eperon, Graham, Griffiths, & Eperon, 1988; Lu et al., 2007; Nasim, Hutchison, Cordeau, & Chabot, 2002).

As looking at the overall distribution is a more coarse approach that might blur the fine level details of perfect IR distribution, an attempt was made to investigate the perfect IR distribution in the near intergenic regions upstream of 5’ UTR. Lu et al. (2007) has shown that the perfect IR

18 distribution in the region that is upstream of translation start site of human genes shows a site preference, peaking very close to the translation start site. In this study, a similar pattern was observed in the near intergenic region upstream of 5’ UTR by the presence of a significant total perfect IR base count peak at the window that is 20 bases upstream of the 5’ UTR starting position (transcription star site). Even though the analysis was restricted to a short region of 200 bases to minimize the effects caused by the close proximity of genes, prominent and statistically significant minimum and maximum perfect IR base counts were observed. The observed non- random distribution in the near intergenic regions upstream of 5’ UTR could possibly be one of the fine level enrichment patterns that lead to the high enrichment of perfect IRs in the intergenic region. Another statistically significant enrichment was observed in the intron region in comparison to the exon region (Figure 10).

Lu et al. (2007) argued that the observed site preference of perfect inverted repeats in human genes is due to the presence of palindromic regulatory elements. Perfect IR containing cis-acting elements (excluding TATA box elements) considered in this study, which were obtained from the Atcis database, did not provide support for the above argument, mainly due to the fact that the perfect IR containing cis-acting element’s base counts were very low and did not show any notable pattern (see Figure 12). On the other hand, the observed distribution of TATATATA bases in the near intergenic regions upstream of 5’ UTR supports the concept of perfect IR enrichment due to presence of regulatory elements (see Figure 13). According to the previous analysis (Molina & Grotewold (2005), the TATA box sequence is one of the major promoter motifs of Arabidopsis thaliana. The presence of the TATA box motif could explain the high abundance of TATATATA sequences in the window number 2 (p-value < 2.2x10-16). The observed TATATATA base peak location agrees with the previously reported TATA box peak locations (Molina & Grotewold, 2005). The TATATATA base sequences in window number 2 are accounted for nearly 20% of the perfect IR bases present in the same window. When the TATATATA base count is at its lowest (Figure 13, p-value < 2.2x10-16) in window 1, the total perfect IR count is also at its lowest (Figure 11, p-value < 2.2x10-16).

When the cis-acting element count distribution was combined with that of the TATATATA bases, no notable change in the overall distribution was observed (Figure 14). This indicates that the

19

TATATATA is the major factor influencing the non-random distribution of IRs in the near intergenic regions upstreams of 5’ UTR. But the presence of a considerable proportion (~ 6%) of perfect IR containing sequence types in the total unique cis-acting element sequence types indicates that IRs can be a considerable part of regulatory element motifs sequences of Arabidopsis thaliana, supporting the idea of palindromicity as a property of regulatory motifs (McGuire & Church, 2000; McGuire et al., 2000). The observed proportion of cis-acting element sequence types agrees with the palindromic motif proportion reported by McGuire et al. (2000) in microbial genomes. Furthermore, it can be argued that the non-random distribution of perfect IRs in the near intergenic regions upstream of 5’ UTRs would be better explained by the identification of more cis-acting element motifs in the Arabidopsis thaliana genome.

Conclusion

The performed genome-wide analysis resulted in a data set of perfect IRs in Arabidopsis thaliana. Intergenic region showed a high abundance of perfect IRs, possibly due to the presence of palindromic regulatory elements. Within genes, introns showed higher perfect IR enrichment than the exons, which can be explained by the need for RNA secondary structures in the alternative splicing process of pre-mRNAs. Further analysis revealed a non-random distribution of perfect IRs in the near intergenic regions upstream of 5’ UTR with maximum and minimum perfect IR base counts in windows which are 2 and 1 bases, respectively, upstream from 5’ UTR start position. The TATATATA sequence, a potential palindromic TATA box motif, showed maximum and minimum enrichment pattern was similar to that of the perfect IRs, suggesting major contribution in the perfect IR base abundance in the promoter regions upstream of 5’ UTRs. The observed TATATATA peak location agreed with the previously reported TATA box motif peak locations. Hence, it can be concluded that the TATA box motifs play an important role in non-random distribution of perfect IRs in the 5’ UTR upstream region.

This analysis also generated a catalog of cis-regulatory elements or motifs that contain perfect IRs in Arabidopsis thaliana. Even though the perfect IR containing cis-acting elements obtained from the Atcis database did not provide any notable contribution in the perfect IR distribution due to their low abundance, they showed a considerable abundance in the cis-acting element

20 sequence types. The observed percentage of perfect IR containing sequences in the Arabidopsis thaliana cis-acting motif sequence types was similar to the previously reported percentage by McGuire et al. (2000), which provided addition evidence for the important of palindromicity as measure in regulatory motif prediction. Furthermore, it can be concluded that findings of additional perfect IR containing motifs of Arabidopsis thaliana could explain the non-random distribution of perfect IRs in the near intergenic regions upstream of 5’ UTRs better. More work in the functional exploration of perfect IR enrichment in intron and the near intergenic regions upstream of 5’UTR are needed to improve our understanding of perfect IR and biological/molecular mechanisms.

21

Tables

Table 1: Nested and overlapping gene statistics of Arabidopsis thaliana nuclear chromosomes Chromosome Nested and Overlap- Total Genes Percentage ping % Genes 1 553 7509 7.3645 2 365 4470 8.1655 3 362 5650 6.4071 4 331 4308 7.6834 5 484 6559 7.379

22

Figures

Figure 3: The schematic representation of intragenic and intergenic regions for perfect inverted repeat analysis. A. Different intragenic regions along with the 200-bp long near intergenic region upstream of 5’ UTR. The upstream near intergenic region is divided into 10 20-bp windows (windows #1-10). B. The graphical representation of far intergenic regions. Intergenic regions of 1-10 kb in length, which are 10 kb away from each of the flanking genes, are selected as the far intergenic regions.

23

Figure 4: Distribution of intergenic distances in Arabidopsis thaliana genes

24

Figure 5: Chromosomes lengths of Arabidopsis thaliana

Figure 6: Abundance of perfect IRs in chromosomes of Arabidopsis thaliana

25

Figure 7: Lengths distribution of perfect IRs in Arabidopsis thaliana genome

26

Figure 8: The distribution of total bases among various genomic regions

Figure 9: The distribution of total count of perfect IR bases among various genomic regions

.

27

Figure 10: The percentage of perfect IR bases in various genomic regions

Figure 11: The distribution of perfect IR bases in the considered windows of the 5’ UTR upstream region.

28

Figure 12: The distribution of perfect IR bases which are part of cis-acting elements in the considered windows of the near intergenic regions upstream of 5’ UTR

Figure 13: The distribution of bases which are part of TATATATA sequences in the considered windows of the near intergenic regions upstream of 5’ UTR

29

Figure 14: The distribution of bases which are part of cis-acting elements, including TATATATA, in the considered windows of the near intergenic regions upstream of 5’ UTR

30

LITERATURE CITED

1. Adrian, J., Farrona, S., Reimer, J. J., Albani, M. C., Coupland, G., & Turck, F. (2010). cis-Regulatory Elements and Chromatin State Coordinately Control Temporal and Spatial Expression of FLOWERING LOCUS T in Arabidopsis. The Plant Cell Online, 22(5), 1425–1440. doi:10.1105/tpc.110.074682 2. Ayele, M., Haas, B. J., Kumar, N., Wu, H., Xiao, Y., Aken, S. V., … Town, C. D. (2005). Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis. Genome Research, 15(4), 487–495. doi:10.1101/gr.3176505 3. Berendzen, K. W., Stuber, K., Harter, K., & Wanke, D. (2006). Cis-motifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in genomes using frequency distribution curves. BMC Bioinformatics, 7, 522. doi:10.1186/1471-2105-7-522 4. Berg, D. E., Johnsrud, L., McDivitt, L., Ramabhadran, R., & Hirschel, B. J. (1982). Inverted repeats of Tn5 are transposable elements. Proceedings of the National Academy of Sciences, 79(8), 2632–2635. 5. Bikard, D., Loot, C., Baharoglu, Z., & Mazel, D. (2010). Folded DNA in Action: Hairpin Formation and Biological Functions in Prokaryotes. Microbiology and Reviews, 74(4), 570–588. doi:10.1128/MMBR.00026-10 6. Brown, T., Hunter, W. N., Kneale, G., & Kennard, O. (1986). Molecular structure of the G.A base pair in DNA and its implications for the mechanism of transversion mutations. Proceedings of the National Academy of Sciences of the United States of America, 83(8), 2402–2406. 7. Cox, R., & Mirkin, S. M. (1997). Characteristic enrichment of DNA repeats in different genomes. Proceedings of the National Academy of Sciences of the United States of America, 94(10), 5237–5242. 8. Cozzarelli, N. R., & Wang, J. C. (1990). DNA topology and its biological effects. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. 9. Eperon, L. P., Graham, I. R., Griffiths, A. D., & Eperon, I. C. (1988). Effects of RNA secondary structure on alternative splicing of Pre-mRNA: Is folding limited to a region

31

behind the transcribing RNA polymerase? Cell, 54(3), 393–401. doi:10.1016/0092- 8674(88)90202-4 10. Gao, C., Xiao, M., Ren, X., Hayward, A., Yin, J., Wu, L., … Li, J. (2012). Characterization and functional annotation of nested transposable elements in eukaryotic genomes. Genomics, 100(4), 222–230. doi:10.1016/j.ygeno.2012.07.004 11. Gough, G. W., Sullivan, K. M., & Lilley, D. M. (1986). The structure of cruciforms in supercoiled DNA: probing the single-stranded character of nucleotide bases with bisulphite. The EMBO Journal, 5(1), 191–196. 12. Guglietta, S., Pantaleo, G., & Graziosi, C. (2010). Long sequence duplications, repeats, and palindromes in HIV-1 gp120: length variation in V4 as the product of misalignment mechanism. Virology, 399(1), 167–175. doi:10.1016/j.virol.2009.12.030 13. Gupta, R., Mittal, A., & Gupta, S. (2006). An efficient algorithm to detect palindromes in DNA sequences using periodicity transform. Signal Process., 86(8), 2067–2073. doi:10.1016/j.sigpro.2005.10.008 14. Humphrey-Dixon, E. L., Sharp, R., Schuckers, M., & Lock, R. (2011). Comparative genome analysis suggests characteristics of yeast inverted repeats that are important for transcriptional activity. Genome / National Research Council Canada = Génome / Conseil national de recherches Canada, 54(11), 934–942. doi:10.1139/g11-058 15. Kiran, K., Ansari, S. A., Srivastava, R., Lodhi, N., Chaturvedi, C. P., Sawant, S. V., & Tuli, R. (2006). The TATA-Box Sequence in the Basal Promoter Contributes to Determining Light-Dependent Gene Expression in Plants. Plant Physiology, 142(1), 364– 376. doi:10.1104/pp.106.084319 16. Klug, J., Knapp, S., Castro, I., & Beato, M. (1994). Two distinct factors bind to the rabbit uteroglobin TATA-box region and are required for efficient transcription. Molecular and cellular biology, 14(9), 6208–6218. 17. Kuang, H., Padmanabhan, C., Li, F., Kamei, A., Bhaskar, P. B., Ouyang, S., … Baker, B. (2009). Identification of miniature inverted-repeat transposable elements (MITEs) and biogenesis of their siRNAs in the Solanaceae: New functional implications for MITEs. Genome Research, 19(1), 42–56. doi:10.1101/gr.078196.108 18. Lamesch, P., Berardini, T. Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., … Huala, E. (2012). The Arabidopsis Information Resource (TAIR): improved gene annotation and

32

new tools. Nucleic Acids Research, 40(D1), D1202–D1210. doi:10.1093/nar/gkr1090 19. Landschulz, W. H., Johnson, P. F., & McKnight, S. L. (1988). The leucine zipper: a hypothetical structure common to a new class of DNA binding proteins. Science (New York, N.Y.), 240(4860), 1759–1764. 20. LeBlanc, M. D., Aspeslagh, G., Buggia, N. P., & Dyer, B. D. (2000). An Annotated Catalog of Inverted Repeats of Chromosomes III and X, with Observations Concerning Odd/Even Biases and Conserved Motifs. Genome Research, 10(9), 1381–1392. doi:10.1101/gr.122700 21. Lilley, D. M. (1980). The inverted repeat as a recognizable structural feature in supercoiled DNA molecules. Proceedings of the National Academy of Sciences, 77(11), 6468–6472. 22. Linheiro, R. S., & Bergman, C. M. (2008). Testing the palindromic target site model for DNA transposon insertion using the Drosophila melanogaster P-element. Nucleic Acids Research, 36(19), 6199–6208. doi:10.1093/nar/gkn563 23. Linheiro, R. S., & Bergman, C. M. (2012). Whole Genome Resequencing Reveals Natural Target Site Preferences of Transposable Elements in Drosophila melanogaster. PLoS ONE, 7(2), e30008. doi:10.1371/journal.pone.0030008 24. Liu, L. F., & Wang, J. C. (1987). Supercoiling of the DNA template during transcription. Proceedings of the National Academy of Sciences, 84(20), 7024–7027. 25. Lonskaya, I., Potaman, V. N., Shlyakhtenko, L. S., Oussatcheva, E. A., Lyubchenko, Y. L., & Soldatenkov, V. A. (2005). Regulation of Poly(ADP-ribose) Polymerase-1 by DNA Structure-specific Binding. Journal of Biological Chemistry, 280(17), 17076–17083. doi:10.1074/jbc.M413483200 26. Lu, L., Jia, H., Dröge, P., & Li, J. (2007). The human genome-wide distribution of DNA palindromes. Functional & integrative genomics, 7(3), 221–227. doi:10.1007/s10142- 007-0047-6 27. Manson McGuire, A., & Church, G. M. (2000). Predicting regulons and their cis- regulatory motifs by comparative genomics. Nucleic acids research, 28(22), 4523–4530. 28. McGuire, A. M., Hughes, J. D., & Church, G. M. (2000). Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome research, 10(6), 744–757.

33

29. McKnight, S. L. (1991). Molecular zippers in gene regulation. Scientific American, 264(4), 54–64. 30. Mizuuchi, K., Mizuuchi, M., & Gellert, M. (1982). Cruciform structures in palindromic DNA are favored by DNA supercoiling. Journal of Molecular Biology, 156(2), 229–243. doi:10.1016/0022-2836(82)90325-4 31. Molina, C., & Grotewold, E. (2005). Genome wide analysis of Arabidopsis core promoters. BMC Genomics, 6, 25. doi:10.1186/1471-2164-6-25 32. Nag, D. K., & Petes, T. D. (1991). Seven-Base-Pair Inverted Repeats in DNA Form Stable Hairpins in Vivo in Saccharomyces Cerevisiae. , 129(3), 669–673. 33. Nasim, F.-U. H., Hutchison, S., Cordeau, M., & Chabot, B. (2002). High-affinity hnRNP A1 binding sites and duplex-forming inverted repeats have similar effects on 5’ splice site selection in support of a common looping out and repression mechanism. RNA (New York, N.Y.), 8(8), 1078–1089. 34. Nickol, J., & Martin, R. G. (1983). DNA stem-loop structures bind poorly to histone octamer cores. Proceedings of the National Academy of Sciences of the United States of America, 80(15), 4669–4673. 35. Nobile, C., Nickol, J., & Martin, R. G. (1986). Nucleosome phasing on a DNA fragment from the replication origin of simian virus 40 and rephasing upon cruciform formation of the DNA. Molecular and cellular biology, 6(8), 2916–2922. 36. Oñate, S. A., Prendergast, P., Wagner, J. P., Nissen, M., Reeves, R., Pettijohn, D. E., & Edwards, D. P. (1994). The DNA-bending protein HMG-1 enhances progesterone receptor binding to its target DNA sequences. Molecular and cellular biology, 14(5), 3376–3391. 37. Pearson, C. E., & Sinden, R. R. (1996). Alternative Structures in Duplex DNA Formed within the Trinucleotide Repeats of the Myotonic Dystrophy and Fragile X Loci †. Biochemistry, 35(15), 5041–5053. doi:10.1021/bi9601013 38. Pearson, C. E., Zorbas, H., Price, G. B., & Zannis-Hadjopoulos, M. (1996). Inverted repeats, stem-loops, and cruciforms: Significance for initiation of DNA replication. Journal of Cellular Biochemistry, 63(1), 1–22. doi:10.1002/(SICI)1097- 4644(199610)63:1<1::AID-JCB1>3.0.CO;2-3 39. Pingoud, A., & Jeltsch, A. (2001). Structure and function of type II restriction

34

endonucleases. Nucleic Acids Research, 29(18), 3705–3727. doi:10.1093/nar/29.18.3705 40. Pray, L. (2008). Transposons: The jumping genes, Nature Education 1(1). 41. Rennekamp, A. J., Wang, P., & Lieberman, P. M. (2010). Evidence for DNA Hairpin Recognition by Zta at the Epstein-Barr Virus Origin of Lytic Replication. Journal of Virology, 84(14), 7073–7082. doi:10.1128/JVI.02666-09 42. Rice, P. A. (2005). Visualizing Mu transposition: assembling the puzzle pieces. Genes & Development, 19(7), 773–775. doi:10.1101/gad.1309305 43. Roberts, R. J., Belfort, M., Bestor, T., Bhagwat, A. S., Bickle, T. A., Bitinaite, J., … Xu, S. (2003). A nomenclature for restriction enzymes, DNA methyltransferases, homing endonucleases and their genes. Nucleic Acids Research, 31(7), 1805–1812. doi:10.1093/nar/gkg274 44. Rodikova, E. A., Kovalevskiy, O. V., Mayorov, S. G., Budarina, Z. I., Marchenkov, V. V., Melnik, B. S., … Solonin, A. S. (2007). Two HlyIIR dimers bind to a long perfect inverted repeat in the operator of the hemolysin II gene from Bacillus cereus. FEBS Letters, 581(6), 1190–1196. doi:10.1016/j.febslet.2007.02.035 45. Round, E. K., Flowers, S. K., & Richards, E. J. (1997). Arabidopsis thaliana Centromere Regions: Genetic Map Positions and Repetitive DNA Structure. Genome Research, 7(11), 1045–1053. doi:10.1101/gr.7.11.1045 46. Schindler, U., Beckmann, H., & Cashmore, A. R. (1992). TGA1 and G-box binding factors: two distinct classes of Arabidopsis leucine zipper proteins compete for the G- box-like element TGACGTGG. The Plant Cell, 4(10), 1309–1319. 47. Schindler, U., Menkens, A. E., Beckmann, H., Ecker, J. R., & Cashmore, A. R. (1992). Heterodimerization between light-regulated and ubiquitously expressed Arabidopsis GBF bZIP proteins. The EMBO Journal, 11(4), 1261–1273. 48. Schindler, U., Terzaghi, W., Beckmann, H., Kadesch, T., & Cashmore, A. R. (1992). DNA binding site preferences and transcriptional activation properties of the Arabidopsis transcription factor GBF1. The EMBO journal, 11(4), 1275–1289. 49. Simpson, R. T. (1990). Nucleosome positioning can affect the function of a cis-acting DMA elementin vivo. Nature, 343(6256), 387–389. doi:10.1038/343387a0 50. Smith, G. R. (2008). Meeting DNA palindromes head-to-head. Genes & Development, 22(19), 2612–2620. doi:10.1101/gad.1724708

35

51. Solar, G. del, Giraldo, R., Ruiz-Echevarría, M. J., Espinosa, M., & Díaz-Orejas, R. (1998). Replication and Control of Circular Bacterial Plasmids. Microbiology and Molecular Biology Reviews, 62(2), 434–464. 52. Tang, W., & Perry, S. E. (2003). Binding Site Selection for the Plant MADS Domain Protein AGL15 AN IN VITRO AND IN VIVO STUDY. Journal of Biological Chemistry, 278(30), 28154–28159. doi:10.1074/jbc.M212976200 53. Tsai, R. Y. L., & Reed, R. R. (1998). Identification of DNA Recognition Sequences and Protein Interaction Domains of the Multiple-Zn-Finger Protein Roaz. Molecular and Cellular Biology, 18(11), 6447–6456. 54. Warburton, P. E., Giordano, J., Cheung, F., Gelfand, Y., & Benson, G. (2004). Inverted Repeat Structure of the Human Genome: The X-Chromosome Contains a Preponderance of Large, Highly Homologous Inverted Repeats That Contain Testes Genes. Genome Research, 14(10a), 1861–1869. doi:10.1101/gr.2542904 55. Watson, J. D., & Crick, F. H. C. (1953). Molecular Structure of Nucleic Acids: A Structure for Deoxyribose . Nature, 171(4356), 737–738. doi:10.1038/171737a0 56. White, J. H., & Bauer, W. R. (1987). Superhelical DNA with local substructures. A generalization of the topological constraint in terms of the intersection number and the ladder-like correspondence surface. Journal of molecular biology, 195(1), 205–213. 57. Workman, J. L., Taylor, I. C., & Kingston, R. E. (1991). Activation domains of stably bound GAL4 derivatives alleviate repression of promoters by nucleosomes. Cell, 64(3), 533–544. 58. Yilmaz, A., Mejia-Guerra, M. K., Kurz, K., Liang, X., Welch, L., & Grotewold, E. (2011). AGRIS: the Arabidopsis Gene Regulatory Information Server, an update. Nucleic acids research, 39(Database issue), D1118–1122. doi:10.1093/nar/gkq1120 59. Zhao, J., Bacolla, A., Wang, G., & Vasquez, K. M. (2010). Non-B DNA structure-induced genetic instability and evolution. Cellular and Molecular Life Sciences, 67(1), 43 – 62.

36

SUPPLEMENTARY MATERIALS

Supplementary Figure 1: Graphic representation of scoring schema.

37

# Function definitions: # unique(A) - returns unique values of A # find(A == B) - returns the positions of A satisfying A == B

# Set up scoring system and calculate cumulate score score[‘A’, ‘C’, ‘G’, ‘T’] = [10007, -10007, 10009, -10009] scoreCumulativeVec[0] = 0 for i  1 to length(sequence): scoreCumulativeVec[i] = scoreCumulativeVec[i-1] + score[sequence[i]] scoreCumulativeUniqueVec = unique(scoreCumulativeVec)

# Find the multiple occurrences of the unique cumulative scores for each score in scoreCumulativeUniqueVec: positionVec = find(scoreCumulativeVec == score) initialIRposition.append(positionVec)

# Construct all the possible perfect inverted repeats IRIniCount = 0, seedVec = [0]*length(sequence) for each IRPositionVec in initialIRposition if length(IRPositionVec) > 1: for i=1 to length(IRPositionVec)-1: for j=i+1 to length(IRPositionVec): startPosition = IRPositionVec[i] endPosition = IRPositionVec[j] - 1 IRIniLength = (endPosition - startPosition + 1) if IRIniLength <= maxLength and IRIniLength % 2 == 0: IRIniCount += 1 startPositionVecIni[IRIniCount] = startPosition endPositionVecIni[IRIniCount] = endPosition lengthVecIni[IRIniCount] = IRIniLength seedPositionIni = startPosition + (IRIniLength / 2) seedVec[seedPositionIni] += 1 else: break

# Validate the possible perfect inverted repeats IRCount = 1, lengthIniSrtIndex = sort(lengthVecIni,'descend') for each IRIndex in lengthIniSrtIndex: if lengthVecIni[IRIndex] >= minLength: seedPositionIni = startPositionVecIni[IRIndex] + (lengthVecIni[IRIndex] / 2); if (lengthVecIni[IRIndex] / 2) == seedVec[seedPositionIni]: startPositionVec[IRCount] = startPositionVecIni[IRIndex] endPositionVec[IRCount] = endPositionVecIni[IRIndex] IRCount += 1 seedVec[seedPositionIni] -= 1 return startPositionVec,endPositionVec

Supplementary Figure. 2: The pseudocode of our MATLAB program.

38

Supplementary Figure. 3: The average runtimes of our MATLAB program with different input sequence lengths. All these tests were performed in an Ubuntu Linux server (1,400 MHz, 96 GB RAM).

Supplementary Table 1: The comparison of perfect inverted repeats detected using tools provid- ed by MATLAB (palindromes function), BioPHP and EMBOSS and our proposed algo- rithm/tool. The comparison is based on selected test cases, which indicates the inability of the other tools in detecting some simple perfect inverted repeat patterns. Input Sequence Matlab (palindromes) BioPHP EMBOSS Proposed Algorithm

Starting Palindromic Starting Palindromic Starting Palindromic Starting Palindromic Position Sequence Position Sequence Position Sequence Position Sequence CATATATC 2 ATATAT 2 ATATAT 2 ATATAT 2 ATAT 4 ATAT 3 TATA 2 ATAT 2 ATATAT 4 ATAT 4 ATAT 3 TATA 4 ATAT AAATTTATA 1 AAATTT 1 AAATTT 1 AAATTT 1 AAATTT 6 TATA 2 AATT 6 TATA 2 AATT 6 TATA 6 TATA ATATATGCGC 1 ATATAT 1 ATATAT 1 ATATAT 1 ATAT 3 ATAT 2 TATA 1 ATAT 1 ATATAT 7 GCGC 3 ATAT 3 ATAT 2 TATA 7 GCGC 7 GCGC 3 ATAT 7 GCGC

39