Bioinformatics Tools for Finding the Vocabularies of Genomes a Thesis

Bioinformatics Tools for Finding the Vocabularies of Genomes A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Eric D.C. Petri August 2008 2 This thesis titled Bioinformatics Tools for Finding the Vocabularies of Genomes by ERIC D.C. PETRI has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Lonnie R. Welch Professor of Electrical Engineering and Computer Science Dennis Irwin Dean, Russ College of Engineering and Technology 3 ABSTRACT PETRI, ERIC D.C., M.S., August 2008, Computer Science Bioinformatics Tools for Finding the Vocabularies of Genomes (96 pp.) Director of Thesis: Lonnie R. Welch More organisms are having their genomes sequenced recently than in the past, thus creating a greater demand from the biological community to better understand the exact biological mechanisms which are encoded within the genomic blueprint of each organism. While biologists continue to analyze genomes and to identify new functional elements within organisms, there remain several regions of the genomes which are often overlooked, such as non-protein encoding regions, introns, and intergenic regions. Several bioinformatics algorithms exist to discover functional elements (which are also referenced within as words) in these regions. In this thesis, a functional genomics toolkit for finding functional words of genomes (vocabularies) is presented and described. With currently available vocabulary based tools, limitations arise when analyzing large input sequences. To overcome this limitation, a scalable word searching approach is presented and tested with genomic sequences with file sizes up to 2 Gigabytes (GB). In addition, the toolkit is utilized to provide a genome-wide characterization of the Arabidopsis thaliana genome in terms of over- and under-represented repeats within specific genome regions and to search for similarities between putative functional elements in the human genome and Arabidopsis thaliana thereby producing a putative vocabulary. The difficulties encountered during the research process and suggestions for future work are also further discussed. 4 Approved: _____________________________________________________________ Lonnie R. Welch Professor of Electrical Engineering and Computer Science 5 To my wife, Andrea, and son, Noah, for their unending love and support. 6 ACKNOWLEDGMENTS I would like to thank Dr. Lonnie Welch for introducing me to a new application of Computer Science in the area of Bioinformatics and for his continued guidance and encouragement through my graduate career and the completion of this thesis. I would also like to extend a thank you to Dr. Klaus Ecker, Dr. Frank Drews, and Dr. Sarah Wyatt for participating in the ongoing Bioinformatics research projects and for serving as members on my thesis committee. Thank you also to past and present members of the Bioinformatics research group including Dr. Dazhang Gu, Jens Lichtenberg, Joshua Welch, Mohit Alam, Chase Nelson, Kyle Kurz, Josiah Seaman, Kaiyu Shen, and Xiaoyu Liang for all of our collaboration in developing and extending the WordSeeker Functional Genomics Toolkit. In the initial deployment of WordSeeker, Dr. Gu implemented the word searching algorithm and word selection component, Joshua and Mohit implemented the word scoring component in addition to the Markov modeling, and Chase Nelson provided the early testing of WordSeeker on actual biological data. With the contribution of this thesis, a scalable approach has been incorporated into WordSeeker to allow the same word discovery analysis on much larger input sequences. I would also like to thank the Science and Technology Enrichment of Appalachia Middle-schools (STEAM) project and the principal investigators, Dr. Chang Liu, Dr. David Chelberg, and Dr. Teresa Franklin, for the opportunity to develop educational games to support Science comprehension and work with area middle-schoolers to 7 promote the furthering one’s education. I would also like to thank the other STEAM Fellows and my partner teacher Mrs. Rebecca Hartline for their partnership and collaboration during my two years of involvement with the STEAM project. I would also like to thank Ohio University and the School of Electrical Engineering and Computer Science (EECS) for fostering my undergraduate and graduate career with challenging applications of Computer Science, in-depth problem solving concepts, and for increasing my technology skill set, specifically in the areas of programming and software design. Over the past six years, I have enjoyed all of the experiences I have shared with the EECS faculty and administrative staff. Thank you also to my parents, Carol and Dennis H. Petri, for raising and instilling within me hard working values and determination. Thank you also to my brother Dennis B. Petri for his support. And most of all, thank you to my wonderful wife, Andrea Petri, my son, Noah Petri, and furry son, Carter Petri, for their love and understanding during the research and writing completion of this thesis. Without their involvement, I would not be able to achieve all that I am capable of within my life. 8 TABLE OF CONTENTS Page Abstract……………………………………………………………………...…………….3 Dedication……………………………………………………………………...………….5 Acknowledgments……………………………………………………………………..….6 List of Tables…………………………………………………………………………….11 List of Figures……………………………………………………………………..……..12 Chapter 1: Introduction……………………………………………………………..……14 1.1 Background…………………………………………………………..……....14 1.2 Problem Statement…………………………………………………..…….....16 1.3 Overview of Thesis……………………………………………………..……17 Chapter 2: Discovery of Functional Elements in Bioinformatics…………………..……19 2.1 Gene Discovery………………………………………………………..……..20 2.2 Promoter Discovery…………………………………………………….……24 2.3 Cis-regulatory Element Discovery………………...……..……………..……27 2.4 Vocabulary Discovery of Functional Elements……………………………...30 Chapter 3: WordSeeker Functional Genomics Toolkit…………………………….…….34 3.1 Motivation……………………………………………………………………34 3.2 Overview of WordSeeker Functional Genomics Toolkit…………………....34 3.3 Word Searching……………………………………………………….…..…36 3.3.1 SignatureSeeker…………………………………………….…...…36 3.3.2 TEIRESIAS…………………………………………………….…..37 3.3.3 Suffix Trees……………………………………………………...…41 9 3.4 Word Scoring……………………………………………………………...…48 3.5 Word Selection…………………………………………………………..…...51 Chapter 4: Characterization of Arabidopsis thaliana Genome…………………….….…53 4.1 Description of Arabidopsis thaliana Genome……………………………….53 4.2 Vocabulary Generation of Genome Regions…………………………...……57 4.2.1 Experiment Methodology………………………………………….57 4.2.2 Vocabulary of 5' UTR………………………………………..….…59 4.2.3 Vocabulary of Coding Region……………………………….….…61 4.2.4 Vocabulary of 3' UTR…………………………………………...…62 4.2.5 Vocabulary of Intergenic Region………………………………..…63 4.2.6 Vocabulary of Introns …………………………………………. …65 4.3 Vocabulary Significance Analysis…………………………………………...66 4.3.1 Comparison with Known Transcription Factor Binding Sites……..66 Chapter 5: Scalable Word Searching Approach..………………………………………..68 5.1 Scalable Approach..………………………………………………………….68 5.2 Input Fragmentation………………………………………………………….70 5.3 Validation of Scalable Approach………………….…………………………73 5.4 Methodology Qualifications…………………………………………………74 Chapter 6: Survey of Human Pyknons within Arabidopsis...............................................76 6.1 Motivation……………………………………………………………………76 6.2 Experiment Methodology……………………………………………………78 6.3 Results……………………………………………………..…………………79 Chapter 7: Conclusions…………………………………………………..………………85 10 7.1 Challenges…………………………………………………………..……..…85 7.2 Summary of Results and Findings…………………………………………...85 7.3 Suggestions for Future Work………………………………………………...87 References………………………………………………………………………………..89 11 LIST OF TABLES Page Table 4.1: Details of the Arabidopsis thaliana Genome………………………….……...54 Table 4.2: File Sizes of Arabidopsis thaliana Chromosomes and……………………….56 Genome Regions Table 4.3: IUPAC Nucleotide Base Alphabet…………………………………………...57 Table 5.1: Larger Genomic Test Sets…………………….………………….…………..74 Table 6.1: Pyknon Matches with High Similarity Discovered in ……………………….80 Arabidopsis Genome Table 6.2: Pyknon Matches with High Similarity Discovered within…………………..81 Specific Regions of the Arabidopsis thaliana Genome 12 LIST OF FIGURES Page Figure 2.1: Removal of introns by splicing of exons from a DNA sequence……………21 Figure 3.1: Overview of WordSeeker execution………………………………………...36 Figure 3.2: Execution of SignatureSeeker using a sliding window and ………………...37 oligonucleotide frequency table. Figure 3.3: Demonstration of the convolution phase within TEIRESIAS……………….40 Figure 3.4: Suffix tree representation of the sequence TTCAGCAT……………………42 Figure 3.5: Example of maximal, supermaximal, and near-supermaximal ……………..44 repeats. Figure 3.6: Pseudocode for finding maximal repeats in a suffix tree……………………45 Figure 3.7: Example calculations of zero, first and second order Markov chains……….50 Figure 4.1: Picture of Arabidopsis thaliana …...………………………………………………….54 Figure 4.2: DNA sequence representation before and after transcription……………….56 Figure 4.3: Experimental procedure for the vocabulary generation of ………………….59 Arabidopsis thaliana genome regions. Figure 4.4: Vocabulary of all words and the top 5% in the 5’ UTR of Arabidopsis…….60 Figure 4.5: Vocabulary of all words and the top 5% in the CDS of Arabidopsis……......62 Figure 4.6: Vocabulary of all words

Bioinformatics Tools for Finding the Vocabularies of Genomes a Thesis

Identification of the Promoter and a Transcriptional Enhancer of The

Medium Reiteration Frequency Repetitive Sequences in the Human Genome

Super Short Operations on Both Gene Order and Intergenic Sizes Andre R

Genome-Wide Analysis of the Intergenic Regions in Arabidopsis Thaliana Suggests the Existence of Bidirectional Promoters and Genetic Insulators Xiaohan Yang, Cara M

A Model-Based Approach for Identifying Functional Intergenic Transcribed Regions and Noncoding Rnas John P

Number of Patterns Over-Represented in Three Genomes

Transcriptional Regulation of Pena Β-Lactamase in Acquired

Deconvoluting the Most Clinically Relevant Region of the Human Genome

6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008

Comparative Genome Sequence Analyses of Geographic Samples of Aspergillus Fumigatus—Relevance for Amphotericin B Resistance

Intronic Cnvs Cause Gene Expression Variation in Human Populations

Integration of Multiple Repeats of Geminiviral DNA Into