Modeling and Analysis of Regulatory Elements in Arabidopsis Thaliana from Annotated Genomes and Gene Expression Data

Modeling and Analysis of Regulatory Elements in Arabidopsis thaliana from Annotated Genomes and Gene Expression Data Amrita Pati Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Computer Science and Applications Lenwood S. Heath, Ph.D., Chairman Ruth Grene, Ph.D. T. M. Murali, Ph.D. July, 2005 Blacksburg, Virginia Copyright 2005, Amrita Pati Modeling and Analysis of Regulatory Elements in Arabidopsis thaliana from Annotated Genomes and Gene Expression Data by Amrita Pati Committee Chairman: Lenwood S. Heath, Ph.D. Computer Science (ABSTRACT) Modeling of cis-elements in the upstream regions of genes is a challenging computational problem. A set of regulatory motifs present in the promoters of a set of genes can be modeled by a biclique. Combinations of cis-elements play a vital role in ascertaining that the correct co-action of transcription factors binding to the gene promoter, results in appropriate gene expression in response to various stimuli. Geometrical and spatial constraints in transcription factor binding also impose restrictions on order and separation of cis-elements. Not all regulatory elements that coexist are biologically significant. If the set of genes in which a set of regulatory elements co-occur, are tightly correlated with respect to gene expression data over a set of treatments, the regulatory element combination can be biologically directed. The system developed in this work, XcisClique, consists of a comprehensive infrastructure for annotated genome and gene expression data for Arabidopsis thaliana. XcisClique models cis-regulatory elements as regular expressions and detects maximal bicliques of genes and motifs, called itemsets. An itemset consists of a set of genes (called a geneset) and a set of motifs (called a motifset) such that every motif in the motifset occurs in the promoter of every gene in the geneset. XcisClique differs from existing tools of the same kind in that, it offers a common platform for the integration of sequence and gene expression data. Itemsets identified by XcisClique are not only evaluated for statistical over-representation in sequence data, but are also examined with respect to the expression patterns of the corresponding geneset. Thus, the results produced are biologically directed. XcisClique is also the only tool of its kind for Arabidopsis thaliana, and can also be used for other organisms in the presence of appropriate sequence, expression, and regulatory element data. The web-interface to a subset of functionalities, source code and supplemental material are available online at http://bioinformatics.cs.vt.edu/~xcisclique. iii ACKNOWLEDGMENTS I express my sincere gratitude to my advisor, Dr. Lenwood S. Heath, for being a constant source of inspiration, support, and guidance. I thank him for patiently listening to all my ideas, meticulously going over cumbersome results, giving valuable direction to my work, and always being there to talk. I am also thankful to my committee members Dr. Ruth Grene and Dr. T. M. Murali. Dr. Grene has provided the much needed biological insight in all stages of this work. Dr. Murali understood my work with the briefest of meetings, had a solution ready for almost any dilemma, and pointed me to the appropriate places to look for more information. I thank the Department of Computer Science, Virginia Tech, for supporting me through my first year of graduate study. I would like to acknowledge National Science Foundation Grant ITR- 0219322. Thanks to the National Center for Biotechnology Information, Nottingham database, and Plant Acting cis-Element database for the databases used. Thanks are due to Allan for listening to all my technical problems patiently and helping me work them out. I also thank Cecilia for her help in interpreting the results. Thanks to Animesh, Kiran, my roommates, everybody in Avataar, and all my friends for being there and making graduate school fun. Finally, I thank my parents Sanjukta and Aloke Pati for their love and understanding. iv TABLE OF CONTENTS 1 Introduction 1 2 Preliminaries 4 2.1 Definitions . 4 2.1.1 Biological Terms . 4 2.1.2 Computational Terms and Concepts . 8 2.1.3 Statistical Terms and Concepts . 14 2.2 Transcriptional Regulation . 20 2.2.1 Basal Promoter . 20 2.2.2 Proximal Promoter . 21 2.2.3 Distal Promoter . 22 3 Promoter Analysis 25 3.1 Promoter Discovery . 25 3.2 Motif Discovery . 27 3.2.1 Enumerative Approaches . 27 3.2.2 Probabilistic Methods . 32 4 The Regulatory Biclique problem 34 4.1 Computational Problem Formulation and Solution . 36 4.2 Motivation . 36 4.3 Identifying Putative Co-Regulated Genes . 37 4.4 Combinatorial Analysis Programs . 38 v 4.5 Solution to The Regulatory Biclique problem . 40 5 Data Sources and Databases 44 5.1 Sequence Data . 44 5.2 Gene Expression Data . 49 5.3 Cis-Element Data Sources . 51 6 Probabilistic Methods 52 6.1 Correlation Methods . 52 6.1.1 Pearson's Correlation Coefficient . 52 6.1.2 Spearman's Rank Correlation . 53 6.2 The Hypergeometric Distribution . 55 6.3 The Chi-Square Test of Independence . 56 7 XcisClique Process Flow 58 7.1 Modeling Gene Expression Data . 61 7.2 XcisClique Inputs . 64 7.3 Analysis of Over-representation of Individual Motifs . 64 7.4 Identification of Itemsets . 65 7.4.1 Identification of Significant Itemsets . 66 7.5 Integrating Itemset and Expression Data . 67 7.5.1 Using p-values of Correlation . 68 7.5.2 Sum of Absolute Values of Correlation (SAV): A New Statistic . 70 7.5.3 Calculating Final p-values . 76 7.6 System Requirements . 78 8 Biological Case Studies 81 8.1 Heat Shock Response . 81 8.1.1 Heat Shock Protein Families . 81 8.1.2 Heat Shock Regulation . 83 8.1.3 Regular Expressions for HSEs . 84 8.1.4 Results of Combinatorial Analysis using XcisClique . 89 8.2 Cold Long Term Up-regulated Genes in Metabolism . 95 vi 8.3 Cold Long Term Down-regulated genes in Metabolism . 101 8.4 Analysis of Senescence genes involved in Stress, Pathogenicity, and Secondary Metabo- lites . 104 9 Conclusions and Future Directions 106 A cis-Elements used in XcisClique 117 B Slides used for retrieving Expression Data 122 C Perl Code to Retrieve Entrez Data Using E-Utilities 130 C.1 Perl Code: createGeneDB.pl . 130 C.2 Perl Code: createProteinDB.pl . 136 C.3 Perl Code: ExtractATUpstream.pl . 143 D Shell Scrits and Perl Code from the XcisClique Pipeline 150 D.1 Perl Code: GetPromoter.pl . 150 D.2 Perl Code: SelectSequences.pl . 151 D.3 Perl Code: GetXUpstream.pl . 153 D.4 Perl Code: FindMotifs.pl . 154 D.5 Perl Code: Chi-Square.pl . 159 D.6 Shell Script: CountMotifs.sh . 160 D.7 Perl Code: MakeAprioriMatrix.pl . 161 D.8 Perl Code: Hypergeometric.pl . 162 D.9 Perl Code: Eval Itemset.pl . 165 E Expression Data Analysis: Perl and MATLAB Code 179 E.1 Perl Code: MakeExpressionVectors.pl . 179 E.2 Perl Code: CorrelateRepVectors.pl . 183 E.3 Perl Code: MakeAverageVectors.pl . 186 E.4 Perl Code: GeneCorrelate.pl . 188 E.5 Perl Code: SelectTreatmentVectors.pl . 194 E.6 MATLAB Code: GeneStat.m . 196 E.7 MATLAB Code: GeneCorrelateInter.m . 197 vii E.8 MATLAB Code: GeneCorrelateWithGenome.m . 199 E.9 Perl Code: copyCorrstoDB.pl . 201 E.10 MATLAB Code: CallSimulate.m . 204 E.11 MATLAB Code: simulate.m . 205 E.12 Perl Code: getSDistribution.pl . 206 viii LIST OF FIGURES 2.1 Organization of the eukaryotic gene . 5 2.2 Transcription of DNA into mRNA . 6 2.3 Example of a biclique . 13 2.4 Organization of the eukaryotic promoter . 20 2.5 Transcription in eukaryotes . 21 4.1 Curation of yeast regulatory motifs in Pilpel et al. [2001]. 43 5.1 Entity-Relationship diagram of the POPS database . 45 5.2 POPS database tables-I . 46 5.3 POPS database tables-II . 47 5.4 POPS database tables-III . 47 5.5 POPS database tables-IV . 49 5.6 POPS database tables-V . 50 6.1 The Hypergeometric model . 56 7.1 XcisClique schematic . 60 7.2 Distribution of ρ . ..

Load more