<<

Bioinformatics Tools for Finding the Vocabularies of

A thesis presented to

the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment

of the requirements for the degree

Master of Science

Eric D.C. Petri

August 2008 2

This thesis titled

Bioinformatics Tools for Finding the Vocabularies of Genomes

by

ERIC D.C. PETRI

has been approved for

the School of Electrical Engineering and Computer Science

and the Russ College of Engineering and Technology by

Lonnie R. Welch

Professor of Electrical Engineering and Computer Science

Dennis Irwin

Dean, Russ College of Engineering and Technology 3

ABSTRACT

PETRI, ERIC D.C., M.S., August 2008, Computer Science

Bioinformatics Tools for Finding the Vocabularies of Genomes (96 pp.)

Director of Thesis: Lonnie R. Welch

More organisms are having their genomes sequenced recently than in the past,

thus creating a greater demand from the biological community to better understand the

exact biological mechanisms which are encoded within the genomic blueprint of each

organism. While biologists continue to analyze genomes and to identify new functional

elements within organisms, there remain several regions of the genomes which are often

overlooked, such as non-protein encoding regions, , and intergenic regions.

Several bioinformatics algorithms exist to discover functional elements (which are also

referenced within as words) in these regions.

In this thesis, a functional genomics toolkit for finding functional words of

genomes (vocabularies) is presented and described. With currently available vocabulary

based tools, limitations arise when analyzing large input sequences. To overcome this

limitation, a scalable word searching approach is presented and tested with genomic

sequences with file sizes up to 2 Gigabytes (GB). In addition, the toolkit is utilized to

provide a -wide characterization of the Arabidopsis thaliana genome in terms of over- and under-represented repeats within specific genome regions and to search for similarities between putative functional elements in the and Arabidopsis thaliana thereby producing a putative vocabulary. The difficulties encountered during the research process and suggestions for future work are also further discussed.

4

Approved: ______

Lonnie R. Welch

Professor of Electrical Engineering and Computer Science 5

To my wife, Andrea, and son, Noah, for their

unending love and support.

6

ACKNOWLEDGMENTS

I would like to thank Dr. Lonnie Welch for introducing me to a new application of

Computer Science in the area of Bioinformatics and for his continued guidance and encouragement through my graduate career and the completion of this thesis. I would also like to extend a thank you to Dr. Klaus Ecker, Dr. Frank Drews, and Dr. Sarah Wyatt for participating in the ongoing Bioinformatics research projects and for serving as members on my thesis committee.

Thank you also to past and present members of the Bioinformatics research group including Dr. Dazhang Gu, Jens Lichtenberg, Joshua Welch, Mohit Alam, Chase Nelson,

Kyle Kurz, Josiah Seaman, Kaiyu Shen, and Xiaoyu Liang for all of our collaboration in developing and extending the WordSeeker Functional Genomics Toolkit. In the initial deployment of WordSeeker, Dr. Gu implemented the word searching algorithm and word selection component, Joshua and Mohit implemented the word scoring component in addition to the Markov modeling, and Chase Nelson provided the early testing of

WordSeeker on actual biological data. With the contribution of this thesis, a scalable approach has been incorporated into WordSeeker to allow the same word discovery analysis on much larger input sequences.

I would also like to thank the Science and Technology Enrichment of Appalachia

Middle-schools (STEAM) project and the principal investigators, Dr. Chang Liu, Dr.

David Chelberg, and Dr. Teresa Franklin, for the opportunity to develop educational games to support Science comprehension and work with area middle-schoolers to 7 promote the furthering one’s education. I would also like to thank the other STEAM

Fellows and my partner teacher Mrs. Rebecca Hartline for their partnership and collaboration during my two years of involvement with the STEAM project.

I would also like to thank Ohio University and the School of Electrical Engineering and

Computer Science (EECS) for fostering my undergraduate and graduate career with challenging applications of Computer Science, in-depth problem solving concepts, and for increasing my technology skill set, specifically in the areas of programming and software design. Over the past six years, I have enjoyed all of the experiences I have shared with the EECS faculty and administrative staff.

Thank you also to my parents, Carol and Dennis H. Petri, for raising and instilling within me hard working values and determination. Thank you also to my brother Dennis B.

Petri for his support. And most of all, thank you to my wonderful wife, Andrea Petri, my son, Noah Petri, and furry son, Carter Petri, for their love and understanding during the research and writing completion of this thesis. Without their involvement, I would not be able to achieve all that I am capable of within my life.

8

TABLE OF CONTENTS

Page

Abstract……………………………………………………………………...…………….3

Dedication……………………………………………………………………...………….5

Acknowledgments……………………………………………………………………..….6

List of Tables…………………………………………………………………………….11

List of Figures……………………………………………………………………..……..12

Chapter 1: Introduction……………………………………………………………..……14

1.1 Background…………………………………………………………..……....14

1.2 Problem Statement…………………………………………………..…….....16

1.3 Overview of Thesis……………………………………………………..……17

Chapter 2: Discovery of Functional Elements in Bioinformatics…………………..……19

2.1 Discovery………………………………………………………..……..20

2.2 Discovery…………………………………………………….……24

2.3 Cis-regulatory Element Discovery………………...……..……………..……27

2.4 Vocabulary Discovery of Functional Elements……………………………...30

Chapter 3: WordSeeker Functional Genomics Toolkit…………………………….…….34

3.1 Motivation……………………………………………………………………34

3.2 Overview of WordSeeker Functional Genomics Toolkit…………………....34

3.3 Word Searching……………………………………………………….…..…36

3.3.1 SignatureSeeker…………………………………………….…...…36

3.3.2 TEIRESIAS…………………………………………………….…..37

3.3.3 Suffix Trees……………………………………………………...…41 9

3.4 Word Scoring……………………………………………………………...…48

3.5 Word Selection…………………………………………………………..…...51

Chapter 4: Characterization of Arabidopsis thaliana Genome…………………….….…53

4.1 Description of Arabidopsis thaliana Genome……………………………….53

4.2 Vocabulary Generation of Genome Regions…………………………...……57

4.2.1 Experiment Methodology………………………………………….57

4.2.2 Vocabulary of 5' UTR………………………………………..….…59

4.2.3 Vocabulary of Coding Region……………………………….….…61

4.2.4 Vocabulary of 3' UTR…………………………………………...…62

4.2.5 Vocabulary of Intergenic Region………………………………..…63

4.2.6 Vocabulary of Introns …………………………………………. …65

4.3 Vocabulary Significance Analysis…………………………………………...66

4.3.1 Comparison with Known Transcription Factor Binding Sites……..66

Chapter 5: Scalable Word Searching Approach..………………………………………..68

5.1 Scalable Approach..………………………………………………………….68

5.2 Input Fragmentation………………………………………………………….70

5.3 Validation of Scalable Approach………………….…………………………73

5.4 Methodology Qualifications…………………………………………………74

Chapter 6: Survey of Human Pyknons within Arabidopsis...... 76

6.1 Motivation……………………………………………………………………76

6.2 Experiment Methodology……………………………………………………78

6.3 Results……………………………………………………..…………………79

Chapter 7: Conclusions…………………………………………………..………………85 10

7.1 Challenges…………………………………………………………..……..…85

7.2 Summary of Results and Findings…………………………………………...85

7.3 Suggestions for Future Work………………………………………………...87

References………………………………………………………………………………..89

11

LIST OF TABLES

Page

Table 4.1: Details of the Arabidopsis thaliana Genome………………………….……...54

Table 4.2: File Sizes of Arabidopsis thaliana Chromosomes and……………………….56

Genome Regions

Table 4.3: IUPAC Nucleotide Base Alphabet…………………………………………...57

Table 5.1: Larger Genomic Test Sets…………………….………………….…………..74

Table 6.1: Pyknon Matches with High Similarity Discovered in ……………………….80

Arabidopsis Genome

Table 6.2: Pyknon Matches with High Similarity Discovered within…………………..81

Specific Regions of the Arabidopsis thaliana Genome

12

LIST OF FIGURES

Page

Figure 2.1: Removal of introns by splicing of from a DNA sequence……………21

Figure 3.1: Overview of WordSeeker execution………………………………………...36

Figure 3.2: Execution of SignatureSeeker using a sliding window and ………………...37

oligonucleotide frequency table.

Figure 3.3: Demonstration of the convolution phase within TEIRESIAS……………….40

Figure 3.4: Suffix tree representation of the sequence TTCAGCAT……………………42

Figure 3.5: Example of maximal, supermaximal, and near-supermaximal ……………..44

repeats.

Figure 3.6: Pseudocode for finding maximal repeats in a suffix tree……………………45

Figure 3.7: Example calculations of zero, first and second order Markov chains……….50

Figure 4.1: Picture of Arabidopsis thaliana …...………………………………………………….54

Figure 4.2: DNA sequence representation before and after transcription……………….56

Figure 4.3: Experimental procedure for the vocabulary generation of ………………….59

Arabidopsis thaliana genome regions.

Figure 4.4: Vocabulary of all words and the top 5% in the 5’ UTR of Arabidopsis…….60

Figure 4.5: Vocabulary of all words and the top 5% in the CDS of Arabidopsis……...... 62

Figure 4.6: Vocabulary of all words and the top 5% in the 3’ UTR of Arabidopsis…….63

Figure 4.7: Vocabulary of all words and the top 5% in the intergenic region of …...... 65

Arabidopsis.

Figure 4.8: Vocabulary of all words and the top 5% in the introns of Arabidopsis.…….66

Figure 4.9: Arabidopsis genome region vocabularies with instances of known TFBS….67

13

Figure 5.1: Detailed program execution of the scalable word searching approach.……..70

Figure 5.2: Execution of the input fragmentation technique…………………………….72

Figure 6.1: Experimental procedure for the discovery of human pyknons in…………...79

Arabidopsis thaliana.

Figure 6.2: Pyknon instances within six regions of the Arabidopsis genome………...…82

Figure 6.3: Number of human pyknon instances within the 5’ UTR, 3’ UTR…………..83

and coding regions of six animal genomes in addition to

Arabidopsis (Mouse, Rat, Dog, Chicken, Fruit Fly, Worm, ATH % -

Arabidopsis Similarity Percentage).

Figure 6.4: Percentage of 5’ UTR, 3’ UTR and coding region sequences with pyknon...84

instances for each similarity setting.

14

CHAPTER 1: INTRODUCTION

1.1 Background

In recent decades, an increasing number of organisms have had, and continue to have,

their genomes sequenced. Further studies are being conducted to help fully understand

the underlying framework, or blueprint, of the genomes and to discover which parts of

the framework are responsible for or are correlated with certain biological functions [1-

3]. The human (Homo sapiens) is one organism whose entire genome has recently been

sequenced with the completion of the Human Genome Project (HGP). HGP was

launched in 1990 by the United States Department of Energy (DoE) and National

Institutes of Health (NIH) to develop a deeper understanding of the human genome

beyond discovering just the nucleotide composition [1]. The HGP also set out to identify

the approximately 20,000 to 25,000 in the human genome, to improve DNA

sequencing technology, to create publicly accessible databases that contain the HGP’s

findings, and to address the ethical, legal, and social concerns encountered in regard to the details of the HGP [1]. The project concluded thirteen years later with the final copy of the human genome being published in 2003 [1].

Discovering regulatory elements within model organisms, such as Mus musculus (mouse)

and Escherichia coli (E. coli), has been a valuable approach for the discovery of

functional elements in non-model organisms based upon evolutionary conservation in a

method termed comparative genomics [4]. Comparative genomics has successfully identified genomic regions that are evolutionarily conserved across similar species, but leaves much unknown about which nucleotide bases are exactly conserved and an absent 15

explanation as to why certain genomic regions were conserved versus others [2].

Functional elements, referenced here as words or biological words, present in genomes are usually found within non-protein coding sequences where the mass majority of these functional elements still remain undiscovered [4]. Similarly, the human genome consists primarily of non-protein coding regions and is made up of approximately 1.5% of protein coding regions [5].

Much of the research being conducted within the biological community resides in the discovery of novel regulatory elements within the genomes of organisms, predominately in model organisms such as Arabidopsis thaliana and . The discovery of novel functional elements in model organisms has helped to identify similar functional elements in other organisms and relate biological functions to areas that were once unknown. A large portion of this type of research is being conducted for the human genome by the ENCyclopedia Of DNA Elements (ENCODE) project.

ENCODE has discovered that the human genome framework is much more complex than first thought, and that the regulatory elements in the genome can be expected to occur in regions other than just genes and proteins [5]. Surprisingly, the pilot phase of the

ENCODE project discovered functional elements that show no evolutionary conservation, which suggests that researchers need to view genomic analyses from additional perspectives [2]. ENCODE’s findings will lead to a better understanding of the underlying framework of complex genomes such as the human genome and an improved knowledge of the exact causes for genetic mutations and diseases. 16

1.2 Problem Statement

Presently, Bioinformatics algorithms exist in an attempt to discover all functional elements within a genomic sequence based on the concept of creating a vocabulary of words. These algorithms are not an absolute means of finding every functional element, but rather aim to give biologists insight into which biological words may be functional amongst vast amounts of genomic data. Vocabulary based tools [6-8] exist that are unable to analyze large complex genomes, due to exceeding memory consumption, and consequently lead to longer running times. A need exists to create a vocabulary based tool which can exhibit an analysis of large genomes. To address the memory limitations of the current vocabulary based algorithms, a scalable word searching approach is presented.

A biological word is considered a statistically over- or under-represented repeat or pattern discovered by a word searching algorithm in the genomic sequence currently being analyzed. A vocabulary is considered to contain the set of all discovered putative biological words within the genomic sequence. For example, when analyzing an entire genome, the word searching output is considered the biological vocabulary of the genome. Once a vocabulary is created, observations and statistical analyses can be conducted.

From the large set of putative biological words, a standard procedure with strict criteria must be devised to prune the large number of possible significant repeats or patterns to a subset which are most likely to be of biological value. For instance, there may be a 17

substring present within longer words which is more biologically important than the larger word itself. Therefore, each word’s nucleotide composition must be examined as well as its frequency and relative positions. In addition to the successful construction of a

vocabulary, further research focuses on identifying biological phrases and the discovery

of the underlying biological grammar.

While observing the current limitations, this research’s aim and contributions are two-

fold. The first aim is to address the memory consumption restrictions of current

vocabulary based approaches by describing a scalable word searching approach and a

functional genomics toolkit which allows biologists to analyze large amounts of genomic

data. The second aim is to validate the scalable approach by testing the approach with

large genomic sequences and providing a genome-wide characterization of the

Arabidopsis thaliana genome.

1.3 Overview of Thesis

This thesis is structured into seven chapters. Chapter one gives an introduction into the

importance of Bioinformatics research, a description of the research within, and the

relevance of this research to present day Bioinformatics research. Chapter two discusses

prevalent word searching problems that have been encountered in the field of

Bioinformatics and solutions which have been developed to solve them. Chapter three

describes the WordSeeker Functional Genomics Toolkit developed at Ohio University

which aids in the discovery of novel functional elements. Within chapter three, each

major component of the genomics toolkit is discussed in further detail. Chapter four

focuses on the utilization of specific components from the WordSeeker software suite to 18

aid in a characterization of functional elements within the Arabidopsis thaliana genome.

Chapter five describes a scalable word searching approach that is compatible with multiple word searching algorithms. Chapter six describes an investigation of discovering putative human functional patterns, termed pyknons, within Arabidopsis thaliana. The final chapter presents a summary of the experimental results and findings, challenges encountered, and comments for future work in the area of functional element discovery in genomics and algorithm enhancements to the scalable word searching approach. 19

CHAPTER 2: DISCOVERY OF FUNCTIONAL ELEMENTS IN

BIOINFORMATICS

Traditionally, the stages of discovering functional elements within an organism have primarily been separated into four classes. The first class of discovery focuses on identifying the genes in relation to their starting and ending positions. The next class builds upon the gene discovery class and aims to discover promoter regions which regulate gene expression levels [9]. Within promoter regions, there exist other functional elements such as binding sites for enhancer, silencer, and transcription factor proteins

which also effect gene expression levels [9]. The third class seeks to identify these other

functional elements, called cis-regulatory elements, located within promoter regions.

The final class, which closely follows the third class, searches for the underlying

vocabulary, or set of all functional elements, within an organism. Many of the new

discoveries made in recent published literature focus on the latter two classes of

discovering novel cis-regulatory elements and vocabulary identification of an organism

[10].

With larger genomic sequences beginning to be analyzed, vocabulary generation with

current algorithms has demonstrated a weakness in scalability resulting in longer running

times. This thesis aims to address the scalability limitation faced by current vocabulary

generation algorithms by describing the approach used within the WordSeeker Functional

Genomics Toolkit. The following overview of these functional element classes is

focused primarily on the examination and discovery of regulatory elements in eukaryotic

organisms and identifying current trends in their respective areas. 20

2.1 Gene Discovery

At the conclusion of a genome sequencing project, the specific nucleotide composition of

the genome and their respective positions is all that is known. The discovery of genes is

the first step in identifying locations of functional importance within a genome. A gene

is a segment of DNA which inherently specifies an inheritable trait [11]. Although most

genes code for proteins, there exist DNA sequences which code for specific functional

RNA molecules [11].

In the beginning of gene discovery, time consuming and costly biological experiments were required to determine the starting and ending positions of genes and the number of genes present within a specific genome. Although several methods exist to identify or predict genes within genomes, the primary biological structure of genes only adds to the complication of successfully identifying genes, especially in . Eukaryotic genes are not a continuous sequence of nucleotides and are not found immediately adjacent to one another. These types of genes are found sporadically throughout a genome, separated by intergenic regions, and consist of an interweaving between exons, coding sequences, and introns, non-coding sequences [12]. During post-transcription stages, introns of a gene are removed and the exons are assembled together using

different splicing mechanisms, such as alternative splicing, during the formation of the

messenger RNA (mRNA) [12]. Figure 2.1 illustrates the removal of introns from a gene

sequence during post-transcription.

21

Exon Intron Exon

Exon Exon Exon

Figure 2.1: Removal of introns by splicing of exons from a DNA sequence.

To add to the gene structure complexity, instances of overlapping genes have been discovered where genes overlap on the same or opposite strands and within the introns of another gene [12]. Pseudogenes are also another aspect to consider when searching for functional genes. Pseudogenes are created as a copy of a functional gene, but through mutations by mostly deletions, the pseudogene becomes non-functional [13]. Another type of pseudogenes is called processed pseudogenes, which also originate from a copy of a functional gene, but do not contain introns or a promoter [13]. By developing a better understanding of the gene structure, it becomes more feasible to develop algorithms which achieve a higher quality gene prediction, but it has been shown that gene prediction accuracy decreases with larger DNA sequences [14].

Gene discovery began with evidence based algorithms using homology methods that sought to use known gene, protein, and RNA sequences from large genomic databases as the foundation of identifying functionally enriched areas in unknown genomic sequences, since these types of sequences strongly suggest areas of protein coding regions [13]. The underlying candidate genomic sequences of known proteins and mRNA sequences can be 22 derived through re-translation and re-transcription and then trivially searched for within the genome being examined [15]. However, this approach requires a large set of gene, protein, and RNA sequences to be previously known and sequenced, which is very costly, in addition to not all sequences being known, especially in more complex genomes.

Since there exists a strong dependency on databases of known sequences, accuracy and maintenance of these databases raises concerns about the quality of the sequences they contain [14]. By applying computer science techniques and mathematical models to biology, methodologies and algorithms are constantly being refined to increase prediction quality for genes and other regulatory elements.

There are two different types of sensors used for locating genes. The first type of sensor is known as a content sensor which seeks to categorize a DNA sequence into regions.

The second type of sensor is a signal sensor which aims to discover functional elements for a certain gene [14]. Content sensors have applications using external databases for the testing of similarity and use mathematical models to predict and identify certain characteristics which identify a segment of DNA as a coding region, intergenic region, a promoter, an exon, or an intron. [14]. Testing for the similarity of a set of sequences is achieved by sequence alignment techniques such as the Smith-Watermann algorithm and heuristic approaches such as Basic Local Alignment Search Tool (BLAST) [14].

Similarities with known sequences such as protein sequences, transcripts, intra-genome and inter-genome DNA sequences can help achieve exact positions for exons and introns

[14]. Conservation of these known genes within genetically similar and evolutionarily close organisms is dependent upon the theory that genetic mutations occur at a faster rate 23

within areas of non-functional elements than in areas of functional elements such as

genes [13].

To predict and categorize a DNA region, specific characteristics such as GC rich regions in exons, hexamer frequency, and codon composition aid to identify coding sequences

[14]. Hexamer frequency has been shown to be a strong property to differentiate between coding and non-coding regions and is a main technique used in SORFIND, Genview2,

MZEF, and GeneParser [14]. Markov models and neural networks are also widely used within the mathematical and computational models of gene prediction algorithms [14].

Commonly, larger Markov model orders are used to discover intergenic regions and introns and three Markov models are used for coding regions, one model for each codon element [14]. Adaptations to Markov models within algorithms, such as interpolated

Markov models and hidden Markov models, seek to increase the characterization of genomic regions [14]. To increase gene prediction algorithm accuracy, the model used for coding regions is different from the model used for non-coding regions and some algorithms go even further by creating separate models for different non-coding regions

[14].

With the aid of high performance computing environments and statistical measures, gene prediction has become a viable approach for the discovery of genes, because of the lack of funds needed to run these types of biological experiments. In eukaryotes, gene prediction algorithms incorporate signal sensors to identify signals and statistical properties of protein coding sequences in genomic DNA sequences such as 24 polyadenylation sites, transcription factor binding sites, TATA and CAAT boxes, splicing, and start and stop sites for transcription which may suggest that a gene is located nearby [14-15]. These algorithms rely on successfully identifying not only the starting and ending locations of genes and their structures but also the precise exon and intron separations [15].

Splicing site prediction algorithms like SPLICEVIEW and SplicePredictor exist solely to identify the exact exon positions of a genomic sequence using position weight matrices

(PWM), core sequences from multiple alignments of known regulatory functions, and

Markov models [14]. Algorithms which utilize PWM rely on previously known motif sequences and presume independence between adjacent nucleotides where as Markov orders assume a dependency between adjacent nucleotides [16]. As a result of knowing the precise exon and intron separations, the amino acid sequence of the area of interest can be extracted and the resulting protein of the gene can be identified [15]. GeneWise and Procrustes are two gene prediction tools that use this idea to construct a gene sequence which encodes for the exact or highly similar amino acid sequence [15]. Some tools combine content and signal sensors into one tool in an effort to increase gene prediction quality [14]. Once a result has been reached, gene predictions must be verified to conclude if the putative gene is correlated with a biological function.

2.2 Promoter Discovery

To provide an improved knowledge of gene function and regulation, further study needs to be conducted to successfully recognize promoters of genes and transcription factor 25

binding sites, or motifs, contained within a promoter which is responsible for the

initiation of transcription and the level of gene expression [17]. A promoter is a segment

of DNA usually located in the core promoter region, which is the minimal amount of the

promoter needed to properly initiate transcription of a gene and contains motifs which represent the transcription start site (TSS) [17]. Downstream of the TSS, +1, is the starting position of the region which is transcribed into mRNA. Consequently, each promoter is associated with at least one gene coding region which is later translated into a protein [16].

Similar to gene prediction, some promoter identification methods use comparative genomics and homology methods to identify common functional elements shared between two organisms. However, using comparative genomics will not discover specific functional elements which are related only to a subset of organisms [17]. This leaves a need for the development of accurate promoter prediction algorithms to identify promoter regions and TSS within genomic sequences. Many of the promoter prediction algorithms rely on identifying structural and statistical properties only shared amongst promoters to locate TSS [17]. These algorithms search for properties such as known transcription factor binding sites (TFBS), CpG islands, high numbers of putative TFBS within a region, and the use of comparative genomics and other statistical properties

shared between proximal and core promoter regions [17]. The proximal promoter region is located upstream of the coding sequence and contains primary regulatory elements such as TFBS [11]. A constraint of using known TFBS to identify putative promoter 26

regions and motifs is that these short TFBS DNA sequences statistically occur by random

in other regions of the genome, other than specifically promoter regions [16].

Promoter prediction algorithms use computer science methods and mathematical models

such as position weight matrices (PWM), Markov models, neural networks, discriminant

analysis and Relevance Vector Machines (RVM) [17]. Neural networks aim to represent

interactions of neurons in the brain and involve a large set of small interconnected units

which each contain specific data [13]. These interconnections have associated weights

and the individual nodes only act upon the input received and the specific data they

contain [13]. Neural networks have the advantage of discovering degenerate patterns in

motifs and adaptations to neural networks have been attempted to increase promoter

prediction quality [16]. Discriminant analysis uses known predictors or standards of a

subject, promoter signals in this case, to label a set of input as either highly similar or

highly different to the standards set [16]. Relevance Vector Machines is a machine

learning technique that embodies a Bayesian analysis which determines an expected

frequency probability of a pattern by observing the distribution from a background

sequence [13, 18]. The NNPP 2.2 algorithm takes a different approach to promoter

prediction by incorporating the distance between the transcription start site and the

translation start site of known coding sequences [16].

Current published promoter prediction algorithms have not accurately solved the

promoter prediction objectives set forth [19]. These promoter prediction algorithms

generate several false positives and false negatives at the outcome of an investigation 27

[16]. Burden et al. [16] denotes the limitations of these promoter prediction algorithms is attributed to the lack of knowledge of the transcription process and interactions between proteins and DNA known by researchers and algorithm implementations. Bajic et al.

[19] evaluated eight promoter prediction algorithms on a large portion of the human genome and provided standard measures of assessing the quality of current and new promoter prediction algorithms.

Bajic et al. [19] also describes a few limitations and open problems of current promoter prediction algorithms. The first limitation is present within the structural properties used for promoter identification. The current properties described are used amongst the most current accurate promoter prediction algorithms, but the properties are not able to accurately identify transcription start sites [19]. The second limitation is the extensive computational time spent discovering promoters that lack CpG islands which is a large portion within some eukaryotes such as humans [19]. The inability for prediction algorithms to have high sensitivity (true positives) and a low number of promoter predictions is another limitation of current prediction methods [19]. Even with the described limitations, Bajic et al. discovered that masking repeats by using RepeatMasker significantly increased the precision of the algorithms [19].

2.3 Cis-regulatory Element Discovery

Cis-regulatory elements, or motifs, are relatively short DNA sequences which play a vital role in the level of gene expression. They provide binding sites for transcription factors which can increase or decrease gene expression during transcription [10]. Cis-regulatory 28

elements are usually found upstream of the TSS, in introns, and downstream of the

transcription stop site [20]. Since cis-regulatory elements are reasonably small

sequences, around 10 base pairs long, the successful identification of these small

sequences from a background of large genomic sequences does not become a trivial

search and discover but rather a computational and statistical endeavor [10]. Most genes

of eukaryotic organisms do not have their transcription initiated by just one TFBS, but by

many groups, called cis-regulatory modules, which consist of multiple TFBS [21].

Motif discovery algorithms are usually divided into three classes of approaches:

enumeration, probabilistic with multiple sequence alignment, and deterministic [22].

Enumeration based approaches generally search for all possible candidate motifs which are found to be overrepresented and cover a specified alphabet with no variations [22].

Since there are plenty of known instances of TFBS with variations, adaptations to enumerative approaches can be made to allow for a user specified number of variations within candidate sequences [22]. WEEDER [23] is an example of an enumerative algorithm which allows for user specified mismatches, or mutations, in a search for patterns within a set of input sequences. By searching all of the possible candidate motifs

according to a certain criteria, enumerative approaches do not run the limitation of local

optimums [22].

Gibbs sampling is one method used in the classification of probabilistic and statistical

methods for motif discovery. Gibbs sampling employs a strategy which avoids having to

search the entire problem space by searching for putative motifs through random

sampling sites and probabilistically updating the current motif model by deciding to add 29

or remove sites [22]. Once an action has taken place, the binding sites are added or

removed based on their calculated binding probability and the probabilities for all

considered sites are recalculated [22]. Iteratively, Gibbs sampling will converge towards

the best motif models based on the joint probability between the current motif models and the sites under consideration [22].

An example of a common deterministic approach in a subset of motif discovery tools is

using some variant of an expectation-maximization algorithm. Expectation-

maximization creates a preliminary motif model of a certain length from an initial

position weight matrix and then iteratively searches through the target sequences to calculate the probability that other sequences of the same length were generated by the current model motif [22]. Between the calculations of the next site’s probability, the

algorithm attempts to refine the current motif model to see if any modifications need to

be made by considering the weighted average of the probabilities thus far [22]. The main

disadvantage of a general implementation of the expectation-maximization algorithm is

the risk of discovering local optimums rather than the single optimum motif [22]. Such

tools, like MEME, are based primarily on expectation-maximization, but add additional

enhancements such as enumeration techniques to avoid local optimums [22].

Biologists are often unsure of which motif prediction algorithms to use for specific

biological problems and the accuracy of the algorithms being used. Tompa et al. [10]

evaluated thirteen popular motif discovery tools. In this study, Tompa et al. discovered

that WEEDER outperformed the other tools in most test sets for reporting a single motif 30

candidate [10]. Most of the motif discovery tools also performed very well on yeast data

sets compared to complex multi-cellular eukaryotes which suggest a need for better

understanding of TFBS identification in higher organisms [10]. In the end, Tompa et al.

claims that motif discovery still remains an open research problem and suggests using a

combination of multiple motif discovery tools, which employ different methods, and then analyze the top motif candidates from each tool to discover a set of top putative motifs

[10].

2.4 Vocabulary Discovery of Functional Elements

Vocabulary and dictionary based techniques which utilize language processing methods for discovering a set of regulatory elements and an underlying regulatory network were first proposed back in 2000 by Bussemaker et al. [24]. These ideas categorize regulatory elements as a dictionary of words and assume that a genome is a contiguous string of words that share an overrepresentation according to certain statistical properties [6].

Although vocabulary generation provides a search for multiple motifs through one initial scan rather than a single motif at a time, a few limitations and advantages are present when using this type of biological model for an algorithmic design.

Vocabulary generation assumes independence between words in a sequence, which on one hand disregards biological interactions between possibly related words, and on the other hand, provides a speedup in run time [6]. Also in this model, a pattern’s probability is calculated from past computations of previous patterns, which allows for putative motifs to occur at any possible position in a sequence [6]. Words with multiple spellings 31 or variations are assumed to have the same length. Generally, algorithms that utilize a vocabulary based model generate precise putative words with no wildcard positions, while a few recent algorithms are allowing for overrepresented words with variations, or fuzzy words, to allow for TFBS reconstruction and multiple candidates for a specific binding site [6]. Algorithms, such as Vocabulon [6], have been experimentally shown to be able to reconstruct and discover motifs in which prior knowledge was and was not known.

The ENCODE project, as mentioned in chapter one, aims to comb through the entire human genome and analyze the genome piece by piece to discover all functional elements [2]. While conducting this research, the ENCODE project has utilized previous techniques as well as developing innovative techniques to complete the research faster and with higher degrees of accuracy [2]. The ENCODE project is a public project where investigators and researchers from a variety of organizations collaborate, discuss, and report their findings [2]. Currently, the project is focusing on a 1% section of the human genome [2]. Before the researchers move on to the next section, the current section of the human genome must be declared fully analyzed and complete.

Following closely with analyzing the human genome, the pyknons project at IBM

Research has discovered a set of overrepresented patterns, termed pyknons, which occur over abundantly in coding regions as well as non-coding regions [3]. The discovery of these patterns, which represents a form of a vocabulary, suggests a possible linkage, specifically in the human genome, between the coding and non-coding regions of 32 genomes, also known as junk DNA [3]. At this time, these repeats are believed to contain unknown functions and are involved with certain biological processes [3]. The recent discovery of these pyknons has launched an exciting campaign of innovative research being conducted in an area of human genomics that was once before purposely overlooked and discarded [25].

Since vocabulary based techniques generate large amounts of putative words during a genome wide analysis, scoring measures are used to rank the significance of the top most scoring biological words. As a next step, most of the top words are analyzed against known regulatory elements of the genome and similar organisms in the hopes of discovering novel functional elements among the set of top scoring words. Yeast Motif

Finder (YMF) [7] and WordSpy [8] are a few widely known Bioinformatics tools which utilize variations of a vocabulary based model such as exhaustive searching, word scoring, and steganalysis.

Vocabulary based models have shown to be very effective as a new approach for discovering genome-wide motifs, even though the discovery of all genome-wide motifs remains in need of further research of comprising a model o better understands dependent interactions between words [6]. Some of the vocabulary based tools [6-8] offer stand- alone versions, in addition to web interfaces, which allow researchers and biologists to use the tools on their own systems rather than web interfaces that often enforce an input size limitation. However, a main limitation with current vocabulary based tools [6-8] is the ability to analyze much larger genomic sequences, up to 2GB or more, on a single or 33 dual-core system rather than smaller genomes such as Escherichia coli and

Saccharomyces cerevisiae, in a reasonable running time.

34

CHAPTER 3: WORDSEEKER FUNCTIONAL GENOMICS TOOLKIT

3.1 Motivation

The biological function of most genomic data still remains vastly unknown and research

is continuously being conducted in an effort to piece together the underlying blueprints of

organisms from their available genomic information. Having a better understanding of

an organism’s genome structure can help with addressing problems that are potentially

encountered during genetic mutations. Also, it could further the field of genetic

engineering in the means of adapting an organism to meet a novel commercial or

environmental need. The WordSeeker Functional Genomics Toolkit was developed to

create a powerful and dynamic resource to aid biologists in their need to analyze their

genomic data and to support a better understanding of the underlying blueprints of

organisms. The WordSeeker software suite has been designed to identify biological

words within a sequence or set of sequences that are either over- or under-represented

and to rank the words according to their probability of containing a biological function.

3.2 Overview of WordSeeker Functional Genomics Toolkit

The WordSeeker Functional Genomics Toolkit combines the functionality of several

components into one motif discovery software suite. The user of WordSeeker is able to

submit their job for processing through the graphical user interface (GUI) which is stored

within the index.html file. The user’s request is then sent to a Common Gateway

Interface (CGI) script stored in server.pl where the request is interpreted by the system 35 and program parameters are retrieved. The CGI script invokes the control shell which oversees the rest of the system’s execution.

Three main components make up the motif discovery functionality of the WordSeeker suite: word searching, word scoring, and word selection. The word searching component aims to identify putative biological words that satisfy the specific criteria set by the user.

The word scoring component interprets the set of putative biological words from the word searching phase and then applies a ranking of the words by using mathematical models based primarily on the observed number of occurrences of each word versus the expected number of occurrences. Lastly, in the word selection component, putative biological words are filtered for triviality and conservation within presently known motifs. It is important to note that when words are filtered, the words are not removed from the list but rather are tagged with labels such as redundant, unique, trivial, or non- trivial.

The control shell creates a time stamped folder on the system to uniquely identify each handled request. In this folder, intermediate results from each component are stored as well as files for the current status of the request, a log file to record errors, and a parameter file which stores the parameters invoked by the user. After the request has been fulfilled by the system, the user can view the results page which displays a list of parameters and an accessible zip file containing intermediate and final result files. The following diagram, Figure 3.1, displays an overview of the execution of WordSeeker.

36

Figure 3.1: Overview of WordSeeker execution.

3.3 Word Searching

3.3.1 SignatureSeeker

SignatureSeeker is a separate module of the WordSeeker Functional Genomics Toolkit which is designed to identify genomic signatures by computing the oligonucleotide frequency of an input genomic sequence. SignatureSeeker contains an exhaustive search approach which spans the length of the input sequence. The oligonucleotide frequencies 37 are computed using a sliding window of a specified length to incrementingly progress through the input sequence and to track the frequency of each word that has currently been identified. Since an exhaustive search is used, the running time is dependent upon the length of the input genomic sequence. The time complexity of SignatureSeeker is then − lnO )( where n represents the length of the input sequence and l represents the specified oligonucleotide length. Figure 3.2 shows the execution of the sliding window mechanism integrated into SignatureSeeker.

Input Sequence a g t c g a g t c g c Sliding Window = 4

Oligonucleotide Frequency List agtc: 1 2 gtcg: 1 tcga: 1 cgag: 1 gagt: 1

Figure 3.2: Execution of SignatureSeeker using a sliding window and oligonucleotide frequency table.

3.3.2 TEIRESIAS

The motivation behind the TEIRESIAS algorithm was to create a novel algorithm which generates all maximal patterns as specific as possible within a set of input sequences without searching the entire solution space [26]. The TEIRESIAS algorithm was developed by IBM Research and was an algorithm utilized for the discovery of the pyknon data set [3] within the human genome. A pattern is considered a maximal pattern 38 if it cannot be made more specific in terms of nucleotide composition and don’t care states [26]. Unlike TEIRESIAS, predecessors focused on searching the entire solution space, generating all possible patterns, and then verifying each pattern to see if it was well ‘supported’ by occurring in a specific number of input sequences [26]. Often, individual algorithm performance is hindered considerably when searching the entire solution space unless it is relatively simple. Searching the entire solution space of all possible maximal patterns within an input genomic sequence has been shown to be an

NP-hard problem [26].

The TEIRESIAS algorithm accepts three parameters: L, W, and K. The L parameter indicates the smallest number of literals that must be present in all maximal patterns [26].

The W parameter specifies the minimum length of all maximal patterns [26]. Parameter

K specifies the support that is needed for a maximal pattern to be reported [26]. The support of a pattern is defined as the number of input sequences that must contain the specified pattern. There are two main stages to the TEIRESIAS algorithm: the scanning phase and the convolution phase.

The aim of the scanning phase is to generate all elementary patterns using primarily L and W within a set of sequences. Elementary patterns are later used to construct maximal patterns during the convolution phase. All elementary patterns are required to start and end with a literal and have a maximum length equal to or less than W [26]. Wildcards may be present in any location in the elementary pattern other than the starting and ending positions and as long as L literals are contained within the elementary pattern 39

[26]. For each elementary pattern, its support is tested and an offset list is created. An offset entry of an elementary pattern contains a list of (x,y) pairs for each occurrence of the pattern [26]. In the (x,y) pair, x represents the sequence index of the occurrence and y represents the position of the occurrence [26]. If the elementary pattern has sufficient support, the generated pattern is inserted into the elementary pattern list [26].

In the convolution phase, the elementary patterns from the scanning phase are pieced together to form the set of maximal patterns. The goals of the convolution phase are to

“generate all the patterns and manage to identify and discard quickly patterns that are not maximal [26].” As described in [26], these goals are achieved by the representation of the elementary patterns in a stack and the order in which elementary and intermediate patterns are convoluted. This is accomplished by taking each elementary pattern and attempting to extend it to the left and to the right using other elementary patterns [27].

In this phase, two sets are created for each elementary pattern and these two sets represent the convolution candidates that can be extended to the left and to the right. If the elementary pattern can be extended to the left, then the elementary pattern’s prefix is equal to the convolution candidates’ suffix [27]. To extend an elementary pattern to the right, the elementary pattern’s suffix must be equal to the convolution candidates’ prefix

[27]. Once the extending pattern has been extended and is not contained within another maximal pattern, the pattern is labeled as a maximal pattern and is reported [27]. Figure

3.3 below shows an example of the convolution phase.

40

Elementary Patterns: {AG..GT, CCGAT.T, GTCCG, TT.A..GC}

Convolution Phase AG..GT Maximal Pattern AG..GTCCG GTCCG AG..GTCCGAT.T CCGAT.T Figure 3.3: Demonstration of the convolution phase within TEIRESIAS.

The TEIRESIAS algorithm allows the user to have a broader range of pattern structure by not enforcing a bounding constraint except for the exact number of don’t care states between nucleotides [26]. In respect to algorithm execution time, the running time of the

TEIRESIAS algorithm has been shown to be strongly correlated with its output size [26].

This correlation can lead to longer running times because as the output size grows exponentially based on the support parameter, as shown in [26], so will the running times

[26]. Wolinsky [27] addresses concerns with formal definitions and claims made in the original literature of the algorithm and offers further clarification.

It has been shown that TEIRESIAS [26] can discover patterns of arbitrary lengths only under certain conditions [27]. With current implementations of TEIRESIAS, patterns with flexible gaps of don’t care states, for example: “A….G” and “A……….G”, are not convoluted together and could be discarded if the patterns do not meet the support threshold or are broken down into non-maximal patterns [27]. As discussed in [27], a parallelization approach of dividing the sequences among a group of systems utilizing the

TEIRESIAS algorithm and compiling the results may benefit performance. In [27], it suggests that the complexity of the TEIRESIAS algorithm focusing on handling only 41 maximal patterns throughout all phases could possibly be outperformed by less complex approaches.

3.3.3 Suffix Trees

A suffix tree is a data structure that allows for the storage of all suffixes of a given string in a manner that is more efficient and easier to analyze than other comparable data structures [28]. The main advantage of using a suffix tree to perform operations on a given string is the speed of the operations. Suffix trees have the property to allow for operations to run in linear time in respect to the length of the given string [28]. However, within the light of increasing speed for string operations, suffix trees require a significant amount of memory allocation to be devoted to storing the suffix tree for a given string

[28]. Repeats that are found within suffix trees are known as maximal, supermaximal, and near-supermaximal repeats.

A suffix tree is a rooted tree with the edges consisting of suffixes, or substrings, of a sequence [28]. A branching of a node occurs if the prefix of a suffix is already contained in the tree. The prefix of the suffix remains on the current edge and a new branching edge is created for each remaining suffix substring not contained by the prefix [28]. A path from the root node to a leaf node with the concatenation of all the edge labels represents a suffix of a sequence. If a sequence S has n suffixes, then a suffix tree T of S will contain n leaf nodes [28]. Suffix trees have the unique ability to store all suffixes in linear space by indexing the positions of the substrings rather than the actual suffixes.

Also, suffix trees can be constructed within linear time [28]. Before a suffix tree is constructed for a sequence, a character not in the alphabet ∑of S is appended to the end 42 of the sequence. This character is appended to the end of the sequence so if the last suffix of a sequence, which is the last character, does not end at an internal node that already exists in the suffix tree [28]. Figure 3.4 shows the basic suffix tree construction of the sequence TTCAGCAT.

Figure 3.4: Suffix tree representation of the sequence TTCAGCAT.

The first linear suffix tree construction algorithm was developed by Weiner [29] in 1973 with the main disadvantage of the algorithm being space inefficiency [28]. In 1976,

McCreight [30] developed a more space efficient algorithm for suffix tree construction which was also linear time. The main disadvantage of McCreight’s algorithm was the required reverse processing of the input sequence [28]. Ukkonen [31] developed a variant of McCreight’s algorithm in 1995 which allowed for the suffix tree construction to begin from the start of the sequence rather than from the end [28]. For the possibility of linear time suffix tree construction algorithms, suffix links are needed to connect non- root internal nodes to other internal nodes [28]. The linkage between internal nodes is set 43 up so that if a path from a root to an end node has the suffix ab where a represents a single character and b represents a string, then there exists a suffix link from the ending node to another internal node specifying b [28]. Since suffix trees are versatile data structures, they have a variety of applications within Bioinformatics and other research areas.

Suffix trees are able to provide the means of solving difficult computation problems such as the notorious longest common substring and repeat between two strings [28]. Two strings can be represented within one suffix tree by appending different characters at the end of each sequence that are not contained within∑ [23]. Maximal, near-supermaximal and supermaximal repeats of a sequence can be discovered in linear time by the use of suffix trees. Repeat finding is an important problem in Bioinformatics since regulatory elements are observed to be repeated elements in a genome [28]. Suffix trees are also able to allow for the searching of maximal patterns of unknown lengths with and without matches such as described in the WEEDER algorithm [23]. Outside the area of

Bioinformatics, data compression algorithms such as the Burrows-Wheelers Transform use suffix trees for the sorting step during compression [32].

A maximal repeat is a substring of a large string, S, where the maximal repeat will occur

in at least two different positions, p1 and p2 [28]. In the positions of the maximal repeat,

the immediate left character of p1 will differ from the immediate left character of p2 as

well as the immediate right character of p1 will differ from the immediate right character

of p2 [28]. As defined in [28], “a supermaximal repeat is a maximal repeat that never 44 occurs as a substring of any other maximal repeat.” A near-supermaximal repeat is a maximal repeat which occurs at least once without being present as a substring in any other maximal repeat [28]. Therefore, the set of maximal repeats that are not supermaximal repeats is not equal to the set of near-supermaximal repeats. Figure 3.5 shows an example of maximal, supermaximal, and near-supermaximal repeats.

Let S = the input sequence

Let R 0 (S) = the set of all repeats in S

Let R 1 (S) = the set of all maximal repeats in S

Let R 2 (S) = the set of all near-supermaximal repeats in S

Let R 3 (S) = the set of all supermaximal repeats in S Let R (S) = the set of words (R (S) R (S)) 4 2 U 3

S = TTCAGCATTAGCTTC

R 0 (S) =[C, CA, A, AG, AGC, G, GC, T, TT, TTC]

R 1 (S) = [C, CA, A, AG, AGC, GC, T, TT, TTC]

R 2 (S) = [CA, AGC, TT, TTC]

R 3 (S) = [CA, AGC, TTC]

R 4 (S) = [CA, AGC, TT, TTC]

Figure 3.5: Example of maximal, supermaximal, and near-supermaximal repeats.

To search for maximal repeats in a suffix tree, specific properties can be used to efficiently identify which paths from the root node to a terminating node represent a maximal repeat. A terminating node, never a leaf node, is considered left-diverse, only if there exist two leaves in the node’s subtree that have different left characters in the original sequence [28]. If a terminating node is found to be left-diverse, then the path label is considered a maximal repeat [28]. The property of a node being left-diverse 45 propagates upwards to all ancestors of the terminating node [28]. Pseudocode for finding maximal repeats in a suffix tree is outlined in Algorithm 1 of Figure 3.6

Algorithm: Find_Maximal_Repeats(T) all_maximals = φ

for every path in T do l = left- most leaf node of current path p = parent node of l s = path label from root to p

0 ...aa n = ancestor nodes of p

if left_diverse(p) then s is a maximal repeat all_maximals.push(s)

for a0 to an do

s’ = path label from root to ai all_maximals.push(s’) endfor endif endfor

return all_maximals Figure 3.6: Pseudocode for finding maximal repeats in a suffix tree.

There are well-known Bioinformatics tools such as WEEDER, TRELLIS, and REPuter which utilize suffix trees as the foundation of their algorithms. The WEEDER algorithm is based on the usage of the suffix trees which allows for many distinct properties of strings to be interpreted in an algorithmic manner. Although WEEDER does not seek to discover maximal, near-supermaximal, or supermaximal repeats, a set of input parameters similar to the TEIRESIAS algorithm specifies the type of patterns to identify.

WEEDER seeks to discover all patterns with a specific length and with a defined number 46 of mutations, or wildcards, which contain a level of support within a set of input sequences [23]. The WEEDER algorithm builds upon an original algorithm approach developed by Sagot et al. [19], which uses an exhaustive approach of pattern extension with the main idea that any pattern found that satisfies the input parameters has a path from the root node to a terminating node [23]. The number of occurrences of the pattern can be calculated by counting the number of leaves from the subtree of the terminating node [23].

WEEDER increases the performance of the algorithm by limiting the number of paths to search rather than the number of patterns by stating an initial error ratio and dynamically adjusting the error ratio based on the length of the current path’s pattern [23]. In an assessment survey of top TFBS prediction algorithms, Tompa et al. [10] concluded

WEEDER outperformed several widely used TFBS prediction algorithms on certain test sets. However, Tompa et al. suggests that much work still needs to be accomplished to increase the accuracy of putative TFBS of motif prediction algorithms including

WEEDER, and to use multiple motif prediction algorithms to maximize the likelihood of a successful novel TFBS discovery [10].

REPuter is another tool which uses a suffix tree as the basic data representation structure of the algorithm. REPuter is a two-phase searching algorithm, consisting of searching and visualization modules, which identifies all maximal repeats within a large input sequence [33]. REPuter was tested on a large input sequence containing 67 million nucleotide bases, approximately a file size of 65Mb [33]. Kurtz and Schleiermacher [33] 47 claim REPuter is optimal in respect to searching for maximal repeats supporting a run time of O(n+m) where n represents the length of the input sequence and m representing the number of maximal repeats. The optimal speedup of REPuter can be attributed to the decreased space required for the generation of the suffix tree [33]. By applying space reduction techniques of Kurtz [34], REPuter employs a 7n bytes decrease in space over common suffix tree implementations with n representing the length of the input sequence

[33].

A new suffix tree based algorithm named TRELLIS has recently been published which aims to solve the scalability limitations of most suffix tree implementations at the genomic level and to provide a faster means of suffix tree construction and querying [35].

TRELLIS demonstrates its scalability by being able to construct a suffix tree for the entire human genome in 4.7 hours with only 2GB of memory [35]. TRELLIS provides an efficient disk based solution for suffix tree construction and querying by using partitioning and merging mechanisms which do not affect data integrity [35].

Phoophakdee and Zaki [35] experimentally demonstrated TRELLIS outperforming other scalable suffix tree construction algorithms utilizing disk based methods such as Top-

Down Disk-Based (TDD), TOP-Q and DynaCluster. TDD is the only other known suffix tree based algorithm able to scale to the human genome but disposes of suffix links which

Phoophakdee and Zaki [35] experimentally attribute longer running times to the lack of suffix links. Similar to TRELLIS, Top-Q and DynaCluster use suffix links but have been experimentally shown not to scale to larger genomes such as the human genome [35].

Although TRELLIS seems to be a noteworthy tool for database and Bioinformatics 48 applications, the tool has not been widely tested with genomic sequences nor critiqued with smaller genomic sequences using performance metrics such as described in Tompa et al. [10].

3.4 Word Scoring

The word scoring phase calculates the expected number of occurrences and compares this measure with the observed number of occurrences to see if a significant difference is shown. If a large difference is observed, either over- or under-represented, the analyzed repeat will have a greater probability of being a putative biological word. By calculating the expected frequency of the repeat, it is statistically known how many occurrences the repeat should have based on the alphabet of the given input string. The word scoring phase uses Markov chain modeling as the main basis of calculating the expected number of occurrences by taking into consideration a background sequence which is the given input sequence, the vocabulary to be scored, and the order of Markov chain to use in the calculation.

Markov chain modeling is used, because it has proven to be an accurate calculation for predicting oligonucleotide frequencies within prokaryotic and eukaryotic genomes [36].

Higher order Markov chain models, such as 3rd and 4th order, have shown higher accuracy within Escherichia coli and Saccharomyces cerevisiae as well as in more complex genomes such as Drosophila melanogaster [36]. As described in [36], the zero order Markov chain estimates the frequency of the nucleotide sequence by considering the frequency of each separate nucleotide base. The total predicted probability of a 49 nucleotide sequence is the product of the probability of each nucleotide base separately.

With higher order Markov chains (m>0), the total probability of a nucleotide sequence, or word, is calculated by computing the probability of each nucleotide with m preceding nucleotides to the immediate left of the current nucleotide. If there is less than m preceding nucleotides, then all preceding nucleotides from the left to the beginning of the sequence are considered. The total probability of the word is the product of all the probabilities of each nucleotide with at most m preceding nucleotides. Therefore, the

probability of a biological word = 21 ...wwww k is represented as:

k = 21 k mwwwpwp )|...()( = ∏ (wwwp imii −− 1 )...| i=1

Figure 3.7 shows an example of computing zero, first, and second order Markov chains in addition to calculating the expected frequency of the nucleotide sequence TGATC. 50

Nucleotide Sequence: TGATC

Let w represent a biological word, ∈ TCGAw },,,{ Let p(w) represent the probability of w Let s represent a substring of w Let O(s) represent the observed frequency of s Let E(w) represent the expected frequency of w Let N represent the length of the input sequence Let n represent the length of w

Probability of a Substring s of w p(s) = /)( NsO

Zero Order Markov Chain (m=0): TGATCp ××= × × CpTpApGpTp )()()()()()0|(

First Order Markov Chain (m=1): TGATCp ××= × × TCpATpGApTGpTp )|()|()|()|()()1|(

Second Order Markov Chain (m=2): TGATCp ×= × × × TACpAGTpGTApTGpTp )|()|()|()|()()2|(

Calculating Expected Frequency of w E(w) = p(w) × nN +− )1( Figure 3.7: Example calculations of zero, first and second order Markov chains.

The word scoring phase of WordSeeker also generates two scores, an R-metric and a

Coverage score. The R-metric compares the observed frequency of a word to the calculated expected frequency. The R-metric serves to find words that are statistically over- and under-represented with those words appearing with a score much higher than words that are occurring with a frequency close to expected. Therefore, the closer the R- metric score is to zero, either above or below, the less statistically important the putative biological word becomes. The Coverage score compares all the positions that are covered by all occurrences of the word in respect to the length of the entire input 51 sequence length. The calculations for the scores are described in more detail below.

Once the word list has been scored by the word scoring program, there may still be words present which are trivial in their nucleotide composition so a word selection program is used. The word selection program enforces a defined set of criteria to determine which words are trivial in and labels them as such.

R-metric Let w represent a biological word, ∈ ,,,{ TCGAw } Let O (w) represent the observed frequency of w Let E (w) represent the calculated expected frequency of w

R-metric = − wEwO |)(log)(log|

Coverage score Let w represent a biological word, ∈ ,,,{ TCGAw } Let O (w) represent the observed frequency of w Let n represent the length of w

Coverage score = () × nwO

3.5 Word Selection

The word selection program identifies words as trivial or non-trivial, unique or redundant, and known or unknown by analyzing the nucleotide composition of each word. For a word to be classified as trivial, it must contain at least three different nucleotides and not contain simple di- and tri-nucleotide patterns that cover more than one-third of the word. The word selection program marks a word as redundant if it is contained as a substring within a longer word; otherwise the word is marked as unique.

Finally, the WordSeeker suite offers precompiled word lists of known repeats, such as

Alu, short interspersed nuclear element (SINES) and long interspersed nuclear element 52

(LINES) repeats, for certain organisms such as Homo sapiens and also a precompiled word list of known repeats for Arabidopsis thaliana. In addition to the list of known repeats, a list of known microRNAs (miRNA) is also incorporated. These precompiled word lists are helpful if the input sequences are derived from either of these two organisms. If a match is found, the word is labeled as known; otherwise it will be labeled as unknown. The miRNA list was obtained from release 9.2 of miRBase [42], the known repeats for Homo sapiens were obtained from the July 13th, 2007 release of RepBase

[43], and the known repeats for Arabidopsis thaliana were obtained from AtRepBase

[44].

MiRNAs are single stranded RNA molecules which do not code for proteins but are involved in affecting gene expression levels [41]. On the other hand, Alu sequences are long, roughly 300 bp, repetitive DNA sequences found spread out through the human genome in approximately 300,000 instances [37]. Alu sequences are an example of a

SINE (80-300 bp), which the human genome consists of about 20% of SINEs [9, 38].

LINEs consist of longer repetitive DNA sequences (6000 – 8000 bp) and are less common in the human genome [9, 39]. Currently, research suggests a biological function associated with SINEs, such as controlling of gene expression by serving as an origin of

DNA replication, but no studies have yet suggested a biological function associated with

LINEs [38, 40]. 53

CHAPTER 4: CHARACTERIZATION OF ARABIDOPSIS THALIANA GENOME

4.1 Description of Arabidopsis thaliana Genome

The sequencing of the Arabidopsis thaliana genome, about 120 million base pairs, began in 1996 as an international effort between labs in the United States, Europe, and Japan

[45]. The successful completion of the project has had an enormous impact on plant biology with Arabidopsis thaliana serving as a model system [45]. Arabidopsis thaliana serves as a model system for plants because of its rapid growth cycle of six weeks, the variety of mutants available for plant biologists, and the genetic similarity to other flowering plants [45]. The sequencing project, concluded in 2000, discovered approximately 27,000 genes and 35,000 proteins on the five chromosomes of Arabidopsis thaliana [46]. The Arabidopsis thaliana genome being analyzed in this thesis was obtained from the National Center for Biotechnology Information (NCBI) genome database dated November 16th, 2005 [47]. Figure 4.1 displays an image of an

Arabidopsis thaliana plant and Table 4.1 shows the relative number of chromosomes, genome size in base pairs (bp), number of genes and proteins, and the year in which the

Arabidopsis thaliana genome was completely sequenced.

54

Figure 4.1: Picture of Arabidopsis thaliana [48].

Table 4.1

Details of the Arabidopsis thaliana Genome Genome Arabidopsis thaliana Number of Chromosomes 5 Genome Size (bp) 120 Million Number of Genes 27,000 Number of Proteins 35,000 Year Genome was Sequenced 2000

The genome-wide characterization of Arabidopsis focuses on the entire genome at first and later converges towards different genomic regions. The regions consist of the 5’ and

3’ untranslated regions (UTR), the coding and intergenic regions, introns, and promoters.

All of the different genome regions, except the promoters, were retrieved from the The

Arabidopsis Information Resource (TAIR) Genome Release version 7 dated August 15th,

2007 [49]. The promoters were extracted from the Arabidopsis Genome Regulatory

Information Server (AGRIS) database dated May 26th, 2006 [50].

Introns are non-protein coding sequences located within protein coding sequences that represent genes in the original DNA strand [11]. Removal of introns occurs during the 55 latter stages of transcription and during the formation of mRNA [11]. Typically, genes are not found adjacent to one another and usually have areas of separation known as the intergenic region. Promoters are usually located upstream of protein coding sequences in the original DNA strand and provide a binding site for RNA polymerase to initiate transcription [11]. These protein coding sequences, which are transcribed into mRNA, consist of a cap, 5’ UTR, coding sequence, 3’ UTR and a poly-A tail.

The 5’ UTR is the portion of mRNA from the 5’ end to the start codon which signals the starting site for translation. A codon is a group of three nucleotides which code for a specific amino acid [11]. Located at the other end, the 3’ UTR is the portion from the 3’ end to the stop codon which indicates the ending position for translation. The region between the start and stop codons is called the coding sequence which synthesizes a protein by undergoing translation [11]. Figure 4.2 shows the general structure of a DNA sequence before transcription and an mRNA strand after transcription. Table 4.2 shows the file sizes of each chromosome as well as each region being analyzed.

56

Before Transcription – Original DNA strand

Promoter Coding Sequence Exon Intron Exon Intron Exon

After Transcription – mRNA strand

5’ End 3’ End

Cap 5’ UTR Coding Region 3’ UTR Poly-A Tail

Figure 4.2: DNA sequence representation before and after transcription.

Table 4.2

File Sizes of Arabidopsis thaliana Chromosomes and Genome Regions Region File Size (Mb) Number of Sequences Chromosome 1 30.9 1 Chromosome 2 20.0 1 Chromosome 3 23.8 1 Chromosome 4 18.8 1 Chromosome 5 27.4 1 5’ UTR 4.51 22,928 3’ UTR 6.78 24,016 Coding Region 42.4 31,921 Intergenic Region 50.6 30,413 Introns 31.9 148,558

57

4.2 Vocabulary Generation of Genome Regions

4.2.1 Experiment Methodology

This section describes the formal procedure used for the vocabulary generation of genomic regions within the Arabidopsis genome. When generating a vocabulary for a region, all 6- through 9-mers were extracted using the SignatureSeeker component of the

WordSeeker Functional Genomics Toolkit. The words which were extracted correspond to the International Union of Biochemistry (IUPAC) Nucleotide Base Alphabet [51]. The

IUPAC alphabet is shown in Table 4.3 below. Since motifs are made up of short DNA sequences of about 10 bp, lengths of 6 to 9 were used during this analysis [10]. All 6- through 9-mers were also extracted from the entire Arabidopsis thaliana genome using the same WordSeeker component. All 6- through 9-mers for each region and the genome were scored using the word scoring component of WordSeeker with a Markov order of zero.

Table 4.3

IUPAC Nucleotide Base Alphabet Symbol Meaning Symbol Meaning A A (Adenine) Y C or T C C (Cytosine) K G or T G G (Guanine) V A, C, or G but not T T T (Thymine) H A, C, or T but not G M A or C D A, G, or T but not C R A or G B C, G, or T but not A W A or T N A, G, C, or T S C or G

58

The word scoring component evokes measures which calculate the consideration of a word being correlated with a biological function. Statistical methods are outlined in the word scoring component to determine which words are statistically over- or under- represented using the R-metric score. For each word in a region, their R-metric score is compared against the scores across the entire genome. A scatter plot graph is created for all putative words and the top 5% scoring words to display the difference between the R- metric scores of the entire genome versus each region. Arbitrarily, the top 5% R-metric scoring words are considered to be putative biological words which represent the vocabulary of each region. R-metric outliers are considered to be the most likely to be correlated with a biological function. To test for vocabulary significance, each vocabulary is compared against a list of known TFBS for Arabidopsis. Figure 4.3 presents an outline of the methodology of the approach. All resulting putative vocabularies are retrievable from http://steam.cs.ohiou.edu/~epetri/Thesis/. 59

Figure 4.3: Experimental procedure for the vocabulary generation of Arabidopsis thaliana genome regions.

4.2.2 Vocabulary of 5' UTR

A total of 323,201 putative words were extracted from the 5’ UTR of Arabidopsis. There were 16,160 words present in the top 5% of the scoring words in the 5’UTR. In the first graph of Figure 4.4, top scoring words can easily be seen margins above the lower scoring words. To generate the vocabulary for the 5’ UTR, the top 5% of 16,160 words are considered the most likely to be correlated with a biological function. Since these words are located in the 5’ UTR, a subset of the words could be responsible for post- 60 transcriptional modifications, the initiation of translation, or affects on gene expression levels. The lower graph of Figure 4.4 shows the lower 95% of the scoring words filtered out leaving the putative vocabulary of the 5’ UTR of Arabidopsis.

Figure 4.4: Vocabulary of all words and the top 5% in the 5’ UTR of Arabidopsis.

61

4.2.3 Vocabulary of Coding Region

In the coding region (CDS), the second largest genomic region under consideration yielded a total of 348,303 words with 17,415 words scoring in the top 5%. Several more outliers were able to be identified, separating themselves from the noise of the lower scoring words as shown in the upper graph of Figure 4.5. Since these words are located within the coding region of genes, the words may be correlated with specific genetic traits or the synthesis of proteins. In the lower graph of Figure 4.5, several of the lower scoring words were removed from the putative vocabulary and the graph contains many more outliers than shown in the 5’ UTR diagrams.

62

Figure 4.5: Vocabulary of all words and the top 5% in the CDS of Arabidopsis.

4.2.4 Vocabulary of 3' UTR

From the 3’ UTR of Arabidopsis, a total of 330,022 putative words were extracted and a total of 16,501 ranking in the top 5%. Unlike the other regions under consideration, the graphs of all of the words and the top 5% found in Figure 4.6 do not show nearly as many outliers. The R-metric score of all the top 5% of words were relatively close and do not distinguish a set of words with noticeable differences. However, the words discovered in the 3’ UTR vocabulary may be involved with affecting levels of gene expression, signaling of a poly-A tail, the termination site of translation, or post-transcriptional modifications.

63

Figure 4.6: Vocabulary of all words and the top 5% in the 3’ UTR of Arabidopsis.

4.2.5 Vocabulary of Intergenic Region

The intergenic region, which is the largest region being analyzed, yielded 357,384 putative words with a top 5% of 17,869 words. The graph of R-metric scores in Figure 64

4.7 displays all the words and the top 5% in the intergenic region. Figure 4.7 shows several more outliers separating from the lower scoring words than all previously analyzed regions. Another interesting view of the graph is the separation of outliers into almost level segments. The words of the intergenic region, which are primarily non- coding sequences, may be associated with a biological function and could further be linked to additional words, or motifs, present in protein coding sequences.

65

Figure 4.7: Vocabulary of all words and the top 5% in the intergenic region of Arabidopsis.

4.2.6 Vocabulary of Introns

The third largest genomic region analyzed is the Introns, consisting of 343,719 words and

17,185 words in the upper 5%. The graphs of the introns in Figure 4.8 depict an outlier trend similar to those of the other genomic regions. The removal of the lower 95% of words, shown in the bottom graph of Figure 4.8, eliminates much of the noise generated.

By acknowledging a possible relationship between Introns, which consist of non-coding sequences and coding sequences such as the coding region, novel motifs may be discovered which further explain the encoded message of genomic sequences.

66

Figure 4.8: Vocabulary of all words and the top 5% in the introns of Arabidopsis.

4.3 Vocabulary Significance Comparisons

4.3.1 Comparison with Known Transcription Factor Binding Sites

A list of 56 known transcription factor binding sites, although few contain variations, was obtained from the AGRIS database dated May 26th, 2006 [50]. For each region 67 analyzed, their putative vocabularies were compared against a list of known TFBS within

Arabidopsis. Figure 4.9 below displays each genome region’s vocabulary size (top 5%) and the number of known TFBS instances within the set of putative words. Of all the genome regions, the coding region vocabulary contained the most known TFBS followed by the 5’ UTR with 12.23% and 6.19% consumption of their respective vocabulary.

These regions, the coding region and the 5’ UTR, are the prime regions that contain

TFBS which regulate transcription and gene expression levels. The remaining putative words from each vocabulary could represent segments of other regulatory and non- regulatory elements, such as known repetitive sequences, conserved non-coding regions, or possibly even novel motifs, which could be identified through further investigation.

Genome Region Vocabularies of Arabidopsis containing Known TFBS

20000 18000 16000 14000 12000 Top 5% of Words 10000 8000 Known TFBS Words 6000

Word Instances 4000 2000 0 5' UTR Coding Region 3' UTR Intergenic Introns Region Genome Region

Figure 4.9: Arabidopsis genome region vocabularies with instances of known TFBS. 68

CHAPTER 5: SCALABLE WORD SEARCHING APPROACH

5.1 Scalable Approach

The purpose of the proposed scalable word searching approach is to provide a flexible approach which scales in respect to increasing input sizes. The scalable word searching approach utilizes a word searching algorithm to create a vocabulary, also referred to as a word dictionary, for an organism on a genome-wide scale in a timely and efficient manner. Before the execution of the scalable word searching approach, a set of input parameters must be specified by the user.

The parameters consist of the input and output filenames, minimum and maximum word lengths, minimum number of sequences the word must occur in, minimum number of occurrences (sites) of the word, and the fragmentation size. The size of the minimum and maximum word lengths have a direct relationship with the size of the vocabulary that is produced for a given input sequence. These two parameters must be strategically set by the user so that the vocabulary does not exceed memory capacities. The fragmentation size is also an important parameter that must be carefully considered. This parameter fragments the input genomic sequence into manageable parts for the word searching algorithm. Depending upon the system being used, having a large fragmentation size for analyses leads to longer running times and potentially exceeding memory capacity.

Overall, the complete program execution of the scalable word searching approach involves many steps. Once the input sequence and parameters have been supplied to the program, the input sequence is then divided into individual sequences. Each separate 69

FASTA header is treated as a separate sequence. After each sequence has been identified, these sequences are written to disk. Incrementingly, each sequence is fragmented to the file size specified by the input fragmentation parameter. During this step, overlap segments are written to disk for later removal and then each fragment is sequentially submitted to the word searching algorithm. The exact fragmentation process will be discussed further in the following subsection.

After the word searching algorithm completes examination of each fragment, the output is written to disk for later compilation. Once all fragments have been completed, every output file is compiled to create an initial vocabulary of putative biological words.

During the output compilation, the corresponding overlap segments are checked to remove repeats that have been accounted for more than once due to sequence fragmentation. In order to not exceed memory constraints of the system, the created vocabulary always remains on disk and is accessed during the intersection between the words from each fragment output in memory and the current vocabulary on disk. These measures ensure that an accurate count has been established and that the successful identification of all biological words in the set of fragments has taken place.

After the initial vocabulary has been generated, each word must be scored to determine if the word is either statistically over- or under-represented within the scope of the input genomic sequence. Once the word scoring has been completed, the word list is then submitted to a word selection program that analyzes the nucleotide composition of each word to determine if the word is considered trivial or non-trivial. In conclusion, non- 70 trivial words are amongst the words which are considered the most putative biological words. The full execution of the scalable word searching method is outlined in Figure

5.1.

Figure 5.1: Detailed program execution of the scalable word searching approach.

5.2 Input Fragmentation

The major limitation with word searching algorithms is the enormous amount of memory consumption needed to search and store a large list of putative biological words. A fragmentation technique is used to address large file sizes such as the chromosomes of genomes. Overlap sections are generated between fragments which are adjacent to one another. Overlap sections are written to disk to remove occurrences of repeats which were accounted for more than once. The overlap length between the fragments is given 71 as a parameter and represents the maximum length of a word which can be found in the input sequence. Fragments and overlap sections are created from a pair of positions, starting and ending< ,ji > . All fragments are equal to the fragmentation size, except for possibly the last fragment. All fragments of a sequence can be represented in a set of

pairs, 0 , jiF >< to n ,jiF >< . Likewise, all overlap sections can be represented in a set

of pairs 0 , jiO >< to n ,jiO >< . The overlap sections are created from adjacent fragments using the maximum word length parameter as the fragment offset size and calculating the starting position using the fragment offset size and fragmentation size.

Let k equal the fragment offset size and let l equal the fragmentation size. For example,

the first fragment will have the following pair of starting and ending positions 0 ,0 lF >< .

The starting position of the overlap section between the first and second fragment is

calculated by i0 0 +∈=∈ − klFiO +1( ) and the ending position is equal

to 0 ∈=∈ FjOj 0 . For the second fragment, the starting position is equal to

1 ∈=∈ OiFi 0 and the ending position is equal to ∈ 1 = ∈ 1 + kFiFj . This process continues until the entire input sequence has been fragmented.

This approach will create an overlap between the adjacent fragments, where the length of the overlap equals k-1, so all words of length equal to k will not be repeated twice within the results of the adjacent fragments. However, the words with lengths less than k will be accounted for more than once. Therefore, there must exist a phase in the scalable word searching approach that performs the removal of repetitive occurrences of repeats from the overlap sections to ensure accurate frequencies. Figure 5.2 demonstrates the input fragmentation that takes place in this step. 72

Sequence Fragmentation

Fragmentation Size=7 Maximum Word Length=4

Input Sequence 0Mb 20

Fragment 1

Overlap Fragment 2 …

0 6 =0+ (7-4+1) =4+ 7 =4 =11

Figure 5.2: Execution of the input fragmentation technique.

The word searching algorithm runs on each fragment generated and the corresponding significant words are outputted to disk. The results of each fragment are compiled into a single results file at the end. During this step, the overlap analysis takes place and thus generates more accurate counts for the significant words discovered. The results file consists of the entire collection of words found in the input sequence and represents the generated vocabulary for the given input sequence. The vocabulary now must be scored to determine the likelihood that each individual word is statistically important, either over- or under-represented, and could potentially be correlated with a biological function.

73

5.3 Validation of Scalable Approach

To validate the scalable word searching approach, several genomic test sets were created with increasing file sizes up to around 2 Gigabytes (GB), and ran on an Intel Xeon 3.00

Gigahertz (GHz) system with 2GB of memory. The scalable word searching approach, in coordination with the suffix tree method, searched for supermaximal and near- supermaximal repeats of length 16 to 300 with a fragmentation size of 9 MB. The analysis began with an entire genome analysis of Caenorhabditis elegans (103.1 MB), which was obtained from a NCBI release dated February 16th, 2006 [52], and Arabidopsis thaliana (120.9 MB) [47]. Genomic tests with increasing file sizes were created by incrementingly adding chromosomes from Canis familiaris, which was obtained from a

NCBI release dated September 8th, 2005 [53], into an analysis test set. Table 5.1 displays all test sets used and their respective running times in addition to the larger test sets created from Canis familiaris. From Table 5.1, it can be seen that the scalable word searching approach is effective with larger genomic sequences, such as with a 2.35 GB file, which is approximately two-thirds the size of the human genome.

74

Table 5.1

Larger Genomic Test Sets Test Set Number Genome File Size Running Time

1 Caenorhabditis elegans 103.1 MB 679.979 seconds 2 Arabidopsis thaliana 120.9 MB 678.352 seconds 3 Canis familiaris 211.0 MB 1413.458 seconds (Chromosomes 1-2) 4 Canis familiaris 305.0 MB 2153.604 seconds (Chromosomes 1-3) 5 Canis familiaris 484.0 MB 3654.549 seconds (Chromosomes 1-5) 6 Canis familiaris 1.13 GB 10218.960 seconds (Chromosomes 1-14) 7 Canis familiaris 2.23 GB 25950.990 seconds (Chromosomes 1-38)

5.4 Methodology Qualifications

As the scalable word searching approach decreases running times substantially, qualifications of the results arise from the techniques and methodology used. For instance, the vocabularies generated contain the majority, but not all, of the supermaximal and near-supermaximal repeats within a given input sequence. This qualification is due to the input fragmentation method described within the scalable word searching approach. The fragmentation method creates subsections of the input sequences altering its true starting state. Since each subsection of input is analyzed separately, repeats which normally occur throughout the input sequence, but are only found once per a subsection, run the risk of being omitted from the final vocabulary generated.

75

Another qualification lies within the use of suffix trees when searching for supermaximal and near-supermaximal repeats. With the labeling of supermaximal and near- supermaximal repeats as putative biological words, an indirect uniqueness filter is present which can remove core regulatory motifs if the motifs found as maximal repeats are present as a substring of larger maximal repeats. The removal of these core regulatory motifs leads to the discovery of longer maximal repeats that are supermaximal and contain the core motif in addition to padding nucleotides. 76

CHAPTER 6: SURVEY OF HUMAN PYKNONS WITHIN ARABIDOPSIS

6.1 Motivation

In 2007, Rigoutsos et al. [3] published their novel discovery of millions of non-random patterns of variable length occurring more often than statistically expected within the intergenic region and introns of the human genome. These non-random patterns were discovered by using the TEIRESIAS algorithm, described in section 3.2.2, with a minimum length of 16 nucleotides and support within 40 sequences. Rigoustsos et al. also discovered that a subset of these millions of patterns have non-overlapping occurrences within the protein coding and untranslated regions of the human genome, which could suggest a new form of gene regulation, thus creating a linkage between non- coding and coding regions of the human genome [3]. This subset of 127,998 patterns was termed pyknons [3]. Most of the pyknons discovered were present in the 3’ UTR with an average spacing between 18-22 nucleotides which could suggest post- transcription modifications [25]. A few main discoveries of the pyknons discovered in the human genome outlined in Rigoutsos et al. [3] are:

• Human pyknons are present at least once within all currently known genes.

• Pyknons are shown to be correlated with biological functions and are related with

known microRNAs.

• Pyknons discovered in the human genome have also been discovered in other

animal genomes such as mouse, rat, dog, chicken, fruit fly, and worm and have

been correlated with human comparable biological functions. 77

Literature has shown similarities between human and plant regulatory elements and the

Human Genome Project has concluded in their findings that the human genome shares a majority of its protein families with other organisms such as flies, worms, and plants

[54]. In 2007, Correia et al. [55] released their discovery of a direct relationship between a common regulatory pathway within mammals and plants. The common regulatory pathway is found within the immune system of mammals and plants where foreign bacteria are discovered by the Nod-like Protein (NLP) family [55]. Mutations to proteins within the NLP family have been associated with the cause of certain human diseases

[55]. According to Correia et al., their identification of the NLP activation due to the

SGT1 protein, which is present in both plants and mammals, serves as a common pathway present within each type of organism [55]. Research involving the discovery of common regulatory elements between plants and animals still remains in its infancy, but with promising findings thus far [56].

In an effort to provide a better understanding of the possibility of common regulatory elements between plants and animals, the following analysis will focus on the identification of human pyknons within the Arabidopsis thaliana genome. The analysis will focus on pyknons that share 75%, 85%, 90%, and 100% nucleotide similarity to repeats within Arabidopsis thaliana. The analysis then identifies the number of occurrences within different regions of the Arabidopsis thaliana such as the 5’ and 3’

UTR, intergenic and coding regions, introns, and promoters.

78

6.2 Experiment Methodology

By using the scalable approach evoking the suffix tree word searching method, supermaximal and near-supermaximal repeats of lengths 16 to 300 are discovered within the Arabidopsis thaliana genome. The set of supermaximal and near-supermaximal repeats are tested for a 100% match against the set of 127,998 human pyknon patterns.

The list of human pyknon patterns from Rigoutsos et al. [3] was retrieved from the

Pyknon Set Download Page [57]. A 100% match occurs as an exact match with a supermaximal or near-supermaximal repeat or as a substring within a supermaximal or near-supermaximal repeat. All exact and substring matches are considered the set of pyknons found within the Arabidopsis thaliana genome. The remaining pyknons that are not found within the set of repeats will be further analyzed for similarity amongst the human pyknons by calculating their respective Hamming distances. A Hamming distance is a measure that computes the number of mismatches at each character position between two strings of equal length.

To account for genetic mutations and phylogenetic distance, the remaining human pyknons have been evaluated with the supermaximal and near-supermaximal repeats using 75%, 85%, and 90% similarity measures. The Hamming distance is calculated for each pyknon and repeat pair. A mismatch percentage represents the percentage of allowed mismatch for each similarity measure: 75%, 85%, and 90%, so that the mismatch percentages are 25%, 15%, and 10% respectively. To determine the mismatch percentage for a specific pyknon, the length of the pyknon is multiplied by the allowed mismatch percentage. If the resulting value is less than or equal to the allowed mismatch 79 percentage, the pyknon is labeled as discovered for that specific mismatch percentage.

The number of human pyknons discovered with each similarity criteria is reported in addition to their particular frequencies within the 5’ and 3’ UTRs, coding and intergenic regions, introns and promoters of the Arabidopsis thaliana genome. Figure 6.1 outlines a flowchart of the experimental procedure as discussed above.

Figure 6.1: Experimental procedure for the discovery of human pyknons in Arabidopsis thaliana.

6.3 Results

The scalable word searching approach, in coordination with the suffix tree algorithm, was executed on each of the five chromosomes of the Arabidopsis genome. The results of each chromosome were compiled to create a list of supermaximal and near-supermaximal repeats with lengths of 16 to 300 for the entire genome. The running time completion for 80 this specific search was 11 minutes and 18.352 seconds for the Arabidopsis genome which has a total file size of 120.9 MB. The initial search of discovering the number of pyknons found with 75% similarity yielded a high percentage (95.99%) of human pyknons discovered within the Arabidopsis thaliana genome. The following two similarity tests of 85% and 90% identified 60.90% and 19.06% of the human pyknons respectively. Only 1.52% of the human pyknons was discovered as an exact or substring match. The exact number of human pyknons discovered for each similarity test is shown in Table 6.1.

Table 6.1

Pyknon Matches with High Similarity Discovered in Arabidopsis Genome Similarity Setting Number of Pyknon Percentage of Human Matches Pyknons Found 75% 122,863 95.99% 85% 77,957 60.90% 90% 24,395 19.06% 100% 1,946 1.52%

For six regions of the Arabidopsis thaliana genome, the results of each similarity setting were used to discover the number of pyknons located within each region. The results of this analysis were then compared with the pyknon findings of the six animal organisms discussed within Rigoustsos et al. Table 6.2 displays the number of pyknon matches located within each of the six regions: 5’ UTR, 3’ UTR, coding region, intergenic region, introns, and promoters since pyknon instances were discovered within each analyzed region of the Arabidopsis genome. The intergenic region consisted of the most pyknon instances followed by promoters, introns, coding region, 3’ UTR, and 5’ UTR 81 respectively in each similarity setting. Figure 6.2 shows a graph of pyknon instances for each similarity setting within each genome region. The pyknon trends annotated in the human genome are found to also be present within the evolutionarily distant genome of

Arabidopsis. Moreover, the findings of pyknon instances within non-coding regions and coding regions of the Arabidopsis genome suggest that the fundamental relationship discovered in Rigoutsos et al. [3] between the two regions in human genome is also present within Arabidopsis.

Table 6.2

Pyknon Matches with High Similarity Discovered within Specific Regions of the Arabidopsis thaliana Genome Genome Region 75% Similarity 85% Similarity (122,863 Pyknons) (77,957 Pyknons) 5’ UTR 1,930 Pyknon Hits 635 Pyknon Hits Coding Region 9,145 Pyknon Hits 2,610 Pyknon Hits 3’ UTR 3,731 Pyknon Hits 1,202 Pyknon Hits Intergenic Region 24,438 Pyknon Hits 11,090 Pyknon Hits Introns 11,582 Pyknon Hits 4,125 Pyknon Hits Promoters 20,891 Pyknon Hits 8,824 Pyknon Hits Genome Region 90% Similarity 100% Similarity (24,395 Pyknons) (1,946 Pyknons) 5’ UTR 447 Pyknon Hits 226 Pyknon Hits Coding Region 1,136 Pyknon Hits 358 Pyknon Hits 3’ UTR 794 Pyknon Hits 287 Pyknon Hits Intergenic Region 7,616 Pyknon Hits 1,691 Pyknon Hits Introns 2,680 Pyknon Hits 697 Pyknon Hits Promoters 6,037 Pyknon Hits 1,296 Pyknon Hits 82

Pyknon instances within Arabidopsis

30,000

25,000

20,000 75% Similarity 85% Similarity 15,000 90% Similarity 10,000 100% Similarity

5,000

Number of Pyknon Instances 0 5’ UTR Coding 3’ UTR Intergenic Introns Promoters Region Region Genome Region

Figure 6.2: Pyknon instances within six regions of the Arabidopsis genome.

In Figure 6.3, all pyknon instances of the 5’ UTR, 3’ UTR, and coding regions of six genomes are graphed with the instances contained within the same regions in Arabidopsis thaliana for each similarity setting. The results of the six genomes were obtained from

Rigoutsos et al. [3] and represent model genomes that are descending in evolutionary distance and genetic similarity to the human genome. With the 100% similarity setting of the Arabidopsis analysis, the pyknon instances within the 5’ UTR is shown to be greater than the instances within the 5’ UTR of Caenorhabditis elegans. As the similarity settings are decreased, the number of pyknon instances within the 5’ UTR, 3’ UTR, and coding regions increases as this trend is apparent in the figure above also. These results suggest that a small number of Arabidopsis genes consist of exact pyknon instances whether in their 5’ UTR, 3’ UTR, or protein coding region. A decreasing similarity setting suggests an increasing number of Arabidopsis genes containing pyknon instances.

83

Conserved Pyknon Instances between Eukaryotic Genomes

10000 9000 8000 7000 6000 5' UTR 5000 Coding Regions 4000 3' UTR 3000 2000 1000

Number of Pyknon Instances 0

t g o Ra Fly D t 90% 75% Mouse ui Worm H Chicken Fr T ATH ATH 85% A ATH 100% Genome

Figure 6.3: Number of human pyknon instances within the 5’ UTR, 3’ UTR and coding regions of six animal genomes in addition to Arabidopsis (Mouse, Rat, Dog, Chicken, Fruit Fly, Worm, ATH % - Arabidopsis Similarity Percentage).

From the TAIR 7 Genome Release of Arabidopsis [49], there are 22,928 5’ UTR sequences, 31,921 coding region sequences and 24,016 3’ UTR sequences. The following Figure 6.4 shows the percentage of 5’ UTR, 3’ UTR, and coding region sequences which contain pyknon instances for each similarity setting. An interesting percentage change between the coding region and 3’ UTR is seen at the 75% similarity setting versus the other similarity settings. As shown in Figure 6.4, there is not an overabundant of sequences with pyknon instances, but a large enough percentage not to be easily dismissed.

84

Percentage of Sequences with Pyknon Instances in Regions of Arabidopsis

18.00% 16.00% 14.00% 12.00% 5' UTR 10.00% Coding Region 8.00% 6.00% 3' UTR 4.00% 2.00% 0.00% Percentage of Sequences 75% 85% 90% 100% Similarity Percentage

Figure 6.4: Percentage of 5’ UTR, 3’ UTR and coding region sequences with pyknon instances for each similarity setting.

85

CHAPTER 7: CONCLUSIONS

7.1 Challenges

In the development of the scalable word searching approach, memory consumption became an issue during the final step of fragment output compilation. In this step, memory consumption exceeded memory available on the system and memory swapping took place. To alleviate this issue, the generated vocabulary was set to remain on disk and is constantly updated during the processing of each fragment output. This simple adaptation causes an increase in running time, especially in user space, due to costly overhead spent between continuously accessing memory and disk. However, this modification allows for the successful completion of larger genomic sequences in a reasonable amount of time. To increase the efficiency of the scalable word searching approach, the addition of compression algorithms and multi-core systems could be used to further refine the approach.

7.2 Summary of Results and Findings

Within this thesis, a new vocabulary based software suite, known as WordSeeker, has been presented. WordSeeker consists of a scalable word searching approach for analyzing larger genomic sequences and is currently being made available for easy accessibility. To validate the approach, the scalable word searching approach has been shown to handle entire genomes of less complex organisms, in addition to larger input sequences in the order of 2GB, which is roughly two-thirds of the size for the human genome. In these tests, the near-supermaximal and supermaximal repeats of lengths 16 to

300 for a 2GB input sequence completed in 25950.99 seconds or approximately 7 hours 86 and 30 minutes. Furthermore, the word scoring and word selection components were discussed, which provide further measures of labeling words from the generated vocabulary as putative biological words.

Within the biological experiments conducted in accordance with the WordSeeker

Functional Genomics Toolkit, interesting results have been produced, especially in the survey of discovering human pyknons within regions of the Arabidopsis genome. In the first experiment, a vocabulary methodology was discussed and carried out on five genomic regions of Arabidopsis and the resulting words tested against known TFBS. A number of these words demonstrated an overlap with known TFBS, but the majority of the words did not, which suggests these putative words may be portions of known repetitive sequences, conserved non-coding segments, or possibly novel regulatory elements. In the second experiment which utilized the scalable word search approach, the human pyknons discovered in Rigoutsos et al. [3] were examined and a survey of instances of these pyknons within six genomic regions of the Arabidopsis genome was provided. The findings showed that most of the pyknon instances occurred within the intergenic region, introns, and promoters and a smaller number of pyknon instances were discovered in the 5’ UTR, coding region, and 3’ UTR of the Arabidopsis genes. Having instances within both non-coding and coding regions could suggest a possible linkage between the two regions.

87

7.3 Suggestions for Future Work

A few suggestions for the future research of word searching problems in biology, and future modifications and research to increase the performance of the scalable word searching approach in the areas of running time and memory consumption are:

• To uncover core motifs by performing a clustering of similar words in a

vocabulary, creating a set of candidate motif models. Once a general vocabulary

has been formed for an analyzed genome or area of interest, a separate research

endeavor arises trying to understand the biological phrases or grammar. By

identifying how the words are formed and interlinked, a better understanding of

the relationship between putative words and further, the formation of regulatory

networks can be formulate.

• To apply the methodology for vocabulary construction to other genomes, such as

non-model organisms, and specific areas of interest in an effort to discover novel

regulatory elements.

• To avoid the costly overhead of user time spent between accessing memory and

disk, the addition of a data compression algorithm, such as using the Burrows-

Wheeler Transform, or a more cache friendly output compilation approach could

be used. This would allow the current generated vocabulary to reside in memory

rather than on disk when compiling all fragment outputs.

• With the accessibility to multi-core processing system, providing a parallelization

version of the scalable word searching approach could benefit the overall running

time by taking advantage of idle cores. This can be accomplished by having

separate threads, or slaves, working on different fragments of the input genomic 88 sequence and then having a master thread performing the initiation fragmentation and compilation of fragment outputs.

89

REFERENCES

[1] Human Genome Management Information System (HGMIS), “About the Human

Genome Project,” Oak Ridge National Laboratory, December 07, 2005. [Online].

Available:

http://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml.

[Accessed: Dec. 16, 2007].

[2] "Identification and analysis of functional elements in 1% of the human genome by the

pilot project," Nature, vol. 447, no. 7146, pp. 799-816, June 2007.

[3] I. Rigoutsos, T. Huynh, K. Miranda, A. Tsirigos, A. McHardy, and D. Platt, "Short

blocks from the noncoding parts of the human genome have instances within

nearly all known genes and relate to biological processes," PNAS, vol. 103, no.

17, pp. 6605-6610, April 2006.

[4] X. Xie, J. Lu, E. J. Kulbokas, T. R. Golub, V. Mootha, K. Lindblad-Toh, E. S.

Lander, and M. Kellis, "Systematic discovery of regulatory motifs in human

promoters and 3’ utrs by comparison of several mammals," Nature, vol. 434,

pp. 338-345, February 2005.

[5] Geoff Spencer, “Researchers Expand Efforts to Explore Functional Landscape of the

Human Genome,” National Institutes of Health, Oct. 9, 2007. [Online]. Available:

http://www.genome.gov/26023194. [Accessed: Dec. 16, 2007].

[6] S. Chiara, R. Lars, L. Kenneth, and C. Liao, "Vocabulon: a dictionary model

approach for reconstruction and localization of transcription factor binding sites,"

Bioinformatics, vol. 21, no. 7, pp. 922-931, April 2005.

90

[7] S. Sinha and M. Tompa, "Discovery of novel transcription factor binding sites by

statistical overrepresentation," Nucleic Acids Res, vol. 30, no. 24, pp. 5549-5560,

December 2002.

[8] G. Wang, T. Yu, and W. Zhang, "Wordspy: identifying transcription factor binding

motifs by building a dictionary and learning a grammar." Nucleic Acids Res,

vol. 33, no. Web Server issue, pp. 412-416, July 2005.

[9] D. W. Mount, Bioinformatics: Sequence and Genome Analysis. Cold Spring

Harbor Laboratory Press, September 2004.

[10] M. Tompa, N. Li, T. L. Bailey, G. M. Church, B. D. De Moor, E. Eskin, A. V.

Favorov, M. C. Frith, Y. Fu, J. J. Kent, V. J. Makeev, A. A. Mironov, W. S.

Noble, G. Pavesi, G. Pesole, M. Régnier, N. Simonis, S. Sinha, G. Thijs, J. v. van

Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, and Z. Zhu,

"Assessing computational tools for the discovery of transcription factor binding

sites," Nature Biotechnology, vol. 23, no. 1, pp. 137-144, January 2005.

[11] B. Glick and J. Pesternak, Molecular Biotechnology: Principles and Applications of

Recombinant DNA, 3rd ed. Washington D.C.: American Society for

Microbiology, 2003.

[12] S. Rogic, A. K. Mackworth, and F. B. Ouellette, "Evaluation of gene-finding

programs on mammalian sequences," Genome Res., vol. 11, no. 5, pp. 817-832,

May 2001.

[13] D. W. Mount, Bioinformatics: Sequence and Genome Analysis. Cold Spring

Harbor Laboratory Press, September 2004.

[14] C. Mathé, M. F. Sagot, T. Schiex, and P. Rouzé, "Current methods of gene

91

prediction, their strengths and weaknesses," Nucleic Acids Res, vol. 30, no. 19,

pp. 4103-4117, October 2002.

[15] I. Meyer and R. Durbin, "Gene structure conservation aids similarity based gene

prediction." Nucleic Acids Res, vol. 32, no. 2, pp. 776-783, 2004.

[16] S. Burden, Y. Lin, and R. Zhang, "Improving promoter prediction for the NNPP2.2

algorithm: a case study using Escherichia coli DNA sequences," Bioinformatics,

vol. 21, no. 5, pp. 601-607, 2005.

[17] S. Sonnenburg, A. Zien, and G. Rätsch, “ARTS: accurate recognition of

transcription starts in human,” Bioinformatics, vol. 22, no. 14, pp. 472-480, July

2006.

[18] M. Tipping, “The relevance vector machine tipping,” in Advances in Neural

Information Processing Systems, Eds.: S. Solla, T. Leen, and K. Muller, vol. 12,

pp. 652-658, 2000.

[19] V. B. Bajic, S. L. Tan, Y. Suzuki, and S. Sugano, "Promoter prediction analysis on

the whole human genome," Nature Biotechnology, vol. 22, no. 11, pp. 1467-1473,

November 2004.

[20] Nature, “Reviews glossary,” Nature. [Online]. Available:

http://www.nature.com/nri/journal/v4/n7/glossary/nri1392_glossary.html.

[Accessed: April 16, 2008].

[21] Q. Zhou and W. H. Wong, "Cismodule: de novo discovery of cis-regulatory modules

by hierarchical mixture modeling," PNAS, vol. 101, no. 33,

pp. 12114-12119, August 2004.

92

[22] P. D'Haeseleer, "How does DNA sequence motif discovery work?" Nature

Biotechnology, vol. 24, no. 8, pp. 959-961, August 2006.

[23] G. Pavesi, G. Mauri, and G. Pesole, "An algorithm for finding signals of unknown

length in DNA sequences," Bioinformatics, vol. 17, Suppl. 1, pp. S207-S214,

2001.

[24] H. Bussemaker, H. Li, and E. Siggia, "From the cover: Building a dictionary

for genomes: Identification of presumptive regulatory sites by statistical analysis,"

PNAS, vol. 97, no. 18, pp. 10 096-10100, August 2000.

[25] A. Meynert and E. Birney, “Picking pyknons out of the human genome,” Cell, vol.

125, no. 5, pp. 836-838, June 2006.

[26] I. Rigoutsos and A. Floratos, "Combinatorial pattern discovery in biological

sequences: The TEIRESIAS algorithm." Bioinformatics, vol. 14, no. 1, pp. 55-67,

1998. [27] M. Wolinsky, “An experimental implementation of the TEIRESIAS algorithm,”

Standford University Biochemistry 218, 1999. [Online]. Available:

http://biochem218.stanford.edu/Projects%201999/Wolinsky.pdf. [Accessed: Jan.

24, 2008].

[28] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and

Computational Biology. Cambridge University Press, January 1997.

[29] P. Weiner, "Linear pattern matching algorithm," in 14th Annual IEEE Symposium on

Switching and Automata Theory, 1973, pp. 1-11.

[30] E. McCreight, "A space-economical suffix tree construction algorithm," Journal of

the ACM, vol. 23, no. 2, pp. 262-272, April 1976.

93

[31] E. Ukkonen, "On-line construction of suffix trees,” Algorithmica, vol. 14, no. 3, pp.

249-260, September 1995

[32] S. Kurtz and B. Balkenhol, “Space efficient linear time computation of the burrows

and wheeler-transformation,” in Numbers, Information and Complexity, Eds.: I.

Althöfer, N. Cai, G. Dueck, L. Khachatrian, M. Pinsker, A. Sarközy, I. Wegener,

and Z. Zhang. Netherlands: Kluwer Academic Publishers, 2000, pp. 375-383.

[33] S. Kurtz and C. Schleiermacher, “REPuter: fast computation of maximal repeats in

complete genomes,” Bioinformatics, vol. 15, no. 5, pp. 426-428, 1999.

[34] S. Kurtz, "Reducing the space requirement of suffix trees," Softw. Pract. Exper.,

vol. 29, no. 13, pp. 1149-1171, November 1999.

[35] B. Phoophakdee and M. J. Zaki, "Genome-scale disk-based suffix tree indexing," in

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference

on Management of data. New York, NY, USA: ACM Press, 2007, pp. 833-844.

[36] A.J. Cuticchia, R. Ivarie, and J. Arnold, "The application of Markov chain analysis

to oligonucleotide frequency prediction and physical mapping of Drosophila

melanogaster," Nucleic Acids Research, vol. 20, no. 14, pp. 3651-3657, 1992.

[37] Nature, “Reviews glossary,” Nature. [Online]. Available:

http://www.nature.com/nrg/journal/v1/n2/glossary/nrg1100_126a_glossary.html.

[Accessed: April 16, 2008].

[38] Journal of Clinical Oncology, “Glossary,” American Society of Clinical Oncology,

2008. [Online]. Available:

http://jco.ascopubs.org/cgi/glossarylookup?lookup=by_letter&letter=S.

[Accessed: April 16, 2008].

94

[39] M. Singer, “SINEs and LINEs: highly repeated short and long interspersed

sequences in mammalian genomes,” Cell, vol. 28, no. 3, pp. 433-434, 1982

[40] G. Georgiev, Y. Ilyin, V. Chmeliauskaite, A. Ryskov, D. Kramerov, K. Skryabin, A.

Krayev, E. Lukanidin, and M. Grigoryan, “Mobile dispersed genetic elements and

other middle repetitive DNA sequences in the genomes of Drosophila and mouse:

transcription and biological significance,” Cold Spring Harbor Symposia on

Quantitative Biology, vol. 45, pt. 2, pp. 641-654, 1981.

[41] R. Shivdasani, "Micrornas: regulators of gene expression and cell differentiation,"

Blood, August 2006.

[42] Sanger Institute, “miRBase Sequence Download,” Sanger Institute. [Online].

Available: http://microrna.sanger.ac.uk/sequences/ftp.shtml. [Accessed: July 17,

2007]. [43] Genetic Information Research Institute, “Repbase,” Genetic Information Research

Institute. [Online]. Available: http://www.girinst.org/repbase/index.html.

[Accessed: July 17, 2007].

[44] Larry Parnell, “AtRepBase,” Cold Spring Harbor Laboratory, April 18, 1999.

[Online]. Available: http://nucleus.cshl.org/protarab/AtRepBase.htm. [Accessed:

Nov. 6, 2007].

[45] J. Kaiser, “First global sequencing effort begins,” Science, vol. 274, no. 5284,

pp. 30, October 1996.

[46] European Bioinformatics Institute, “Integr8 : A.thaliana Genome Statistics,”

European Bioinformatics Institute. [Online]. Available:

http://www.ebi.ac.uk/integr8/OrganismStatsAction.do?orgProteomeId=3.

[Accessed: Dec. 8, 2007].

95

[47] National Center for Biotechnology Information (NCBI), “The NCBI ftp site,”

NCBI, March 25, 2007. [Online]. Available:

ftp://ftp.ncbi.nih.gov/genomes/Arabidopsis_thaliana. [Accessed: Dec. 28, 2007].

[48] Wikipedia: The Free Encyclopedia, “Image: Arabidopsis thaliana.jpg,”

Wikipedia, October 2004. [Online]. Available:

http://en.wikipedia.org/wiki/Image:Arabidopsis_thaliana.jpg. [Accessed: Dec. 8,

2007].

[49] The Arabidopsis Information Resource (TAIR), “TAIR Downloads,” TAIR, Oct. 2,

2006. [Online]. Available:

ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR7_genome_release/TAIR7_blastset

s. [Accessed: Dec. 28, 2008].

[50] R. Davuluri, H. Sun, S. Palaniswamy, N. Matthews, C. Molina, M. Kurtz, and

E. Grotewold, "Agris: Arabidopsis gene regulatory information server, an

information resource of arabidopsis cis-regulatory elements and transcription

factors." BMC Bioinformatics, vol. 4, no. 25, June 2003.

[51] A. Cornish-Bowden, “Nomenclature for incompletely specified bases in nucleic acid

sequences: recommendations 1984,” Nucleic Acids Res., vol.13, no. 9, pp. 3021-

3030, 1985.

[52] National Center for Biotechnology Information (NCBI), “The NCBI ftp site,”

NCBI, March 25, 2007. [Online]. Available:

ftp://ftp.ncbi.nih.gov/genomes/Caenorhabditis_elegans. [Accessed: April 21,

2007].

96

[53] National Center for Biotechnology Information (NCBI), “The NCBI ftp site,”

NCBI, March 25, 2007. [Online]. Available:

ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/. [Accessed: April 21,

2007].

[54] Human Genome Management Information System (HGMIS), “The Science Behind

the Human Genome Project,” Oak Ridge National Laboratory, March 26, 2008.

[Online]. Available:

http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml.

[Accessed: March 21, 2008].

[55] J. Correia, Y. Miranda, N. Leonard, and R. Ulevitch, “SGT1 is essential for Nod1

activation,” PNAS, vol. 104, no. 16, pp. 6764-6769, April 2007.

[56] Scripps Research Institute, “Study shows humans and plants share common

regulatory pathway,” PHYSORG, April 9, 2007. [Online]. Available:

http://www.physorg.com/news95358652.html. [Accessed March 15, 2008].

[57] IBM Research, “Pyknon Set Download Page”, IBM. [Online]. Available:

http://cbcsrv.watson.ibm.com/pyknons.html. [Accessed: Feb. 5, 2008]