EVOLUTIONARILY CONSERVED REGULATORY PROGRAMS By
Total Page:16
File Type:pdf, Size:1020Kb
EVOLUTIONARILY CONSERVED REGULATORY PROGRAMS by Tae-Jun Andrew Kwon M.Sc., The University of British Columbia, 2003 B.Sc., The University of British Columbia, 2000 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Genetics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2011 © Tae-Jun Andrew Kwon, 2011 Abstract Despite the diversity of metazoans, common biochemical systems and structures can be found in distinct taxonomic groups. The development and formation of metazoan tissues and structures has been well researched, but their regulatory mechanisms are not understood well. To this end, we implemented bioinformatics tools regulatory mechanism analysis and applied them to study regulatory program conservation with an emphasis on muscle development. We first performed a genome-wide scan for muscle-specific cis-regulatory modules (CRMs) using three computational prediction programs. Based on the predictions, 339 candidate CRMs were tested in cell culture with NIH3T3 fibroblasts and C2C12 myoblasts for capacity to direct selective reporter gene expression to differentiated myotubes. A subset of 19 CRMs validated as functional in the assay. The rate of predictive success reveals striking limitations of computational CRM discovery. Motif- based methods performed no better than predictions based only on sequence conservation. Analysis of the properties of the functional sequences relative to inactive sequences identifies nucleotide sequence composition and ChIP-Seq evidence as important characteristics to incorporate in future methods for improved predictive specificity. In studying the transcriptional regulation, motif enrichment analysis of co-expressed genes is often employed to determine mediating transcription factors. We built oPOSSUM-3, a web-based software system for identification of enriched transcription factor binding sites (TFBS) and TFBS families in DNA sequences of co-expressed genes and sequences generated from high-throughput methods, such as ChIP-Seq. Validation of the system with known sets of published data demonstrates the capacity for oPOSSUM-3 to identify mediating TFs for co-regulated genes. ii Studies have shown that TF binding profiles tend to be highly conserved over long evolutionary distances. In large-scale public genome annotation projects, such as modENCODE, transcriptional regulation data is compiled for comparative genomics research. Using the oPOSSUM-3 system and published data, we performed comparative analyses of the regulatory programs across evolutionarily divergent species, including human, fruit fly, and nematode, and examined the extent of conservation in major regulatory programs. The thesis research provides new approaches to computational analysis of DNA sequences and insights into the analysis of transcription regulation across the phylogenetic spectrum. iii Preface Chapter 2 is based on work conducted in collaboration with Alice Chou. I collected the pre-annotated regulatory regions from literature, and developed the software and the protocols for the genome-wide computational predictions. Alice Chou designed the experimental protocols for the validation of the computational predictions, and we jointly performed the validation experiments. David J. Arenillas designed the oligonucleotides used for PCR in the experiments. I performed the computational analysis of the results and wrote the whole manuscript. Dr. Wasserman supervised the work and revised the manuscript, as well as contributing to the development of the ideas. A version of chapter 2 has been submitted for publication and is currently undergoing requested revisions. Chapter 3 is based on work conducted in collaboration with David J. Arenillas and Rebecca Worsley Hunt. I was responsible for designing the TFBS clustering algorithm and preparing the TFBS clusters used by the oPOSSUM system, as well as designing routines for nematode operon analysis. David J. Arenillas and I jointly participated in the software development. Rebecca Worsley Hunt was responsible for studying the effects of sequence composition on the results produced by the oPOSSUM system. I wrote the manuscript, with contributions from Rebecca Worsley Hunt for the sections pertaining to her work. David J. Arenillas produced two figures that were used in the manuscript. Dr. Wasserman supervised the work and revised the manuscript, as well as contributing to the development of the ideas. A version of chapter 3 has been submitted for publication and is currently under review. The research described in Chapter 4 was the product of my efforts. Dr. Wasserman supervised the work and revised the manuscript, as well as contributing to the development of the ideas. iv Table of Contents Abstract ............................................................................................................................................. ii Preface .............................................................................................................................................. iv List of Tables .................................................................................................................................. ix List of Figures ................................................................................................................................. xi List of Abbreviations ................................................................................................................. xiv Acknowledgements .................................................................................................................... xvi 1 Introduction ............................................................................................................................. 1 1.1 Metazoan Tissue Development ................................................................................................. 3 1.1.1 Tissue and Structure Conservation among Metazoans ........................................................... 4 1.2 Gene Regulation ............................................................................................................................. 5 1.2.1 Promoter and Regulatory Regions ................................................................................................... 6 1.2.2 Transcription Factors ............................................................................................................................. 7 1.2.3 MicroRNA .................................................................................................................................................... 8 1.2.4 Chromatin and Epigenetics .................................................................................................................. 9 1.3 Analysis of Gene Expression ................................................................................................... 10 1.3.1 DNA Microarrays ................................................................................................................................... 11 1.3.2 Serial Analysis of Gene Expression (SAGE) ................................................................................ 12 1.3.3 RNA-Seq .................................................................................................................................................... 12 1.3.4 Differential Gene Expression Analysis ......................................................................................... 13 1.4 Experimental Methods for Identification of Regulatory Regions .............................. 14 1.4.1 Earlier Methods ..................................................................................................................................... 14 1.4.2 High Throughput Methods ................................................................................................................ 15 1.5 Comparative Genomics ............................................................................................................. 18 1.5.1 Homology: Orthologs and Paralogs ............................................................................................... 19 1.5.2 Sequence Alignment ............................................................................................................................ 20 1.5.3 Evolutionary Constraint Measures ................................................................................................ 20 1.6 Public Repositories of Genomic Data ................................................................................... 21 1.6.1 Public Genome Databases ................................................................................................................. 21 v 1.6.2 Ortholog Databases .............................................................................................................................. 22 1.6.3 Transcription Factor and Regulatory Region Databases ..................................................... 24 1.7 Bioinformatics Approaches for Regulatory Region Analysis ...................................... 25 1.7.1 TFBS Representation ........................................................................................................................... 26 1.7.2 Motif Discovery ...................................................................................................................................... 29 1.7.3 Phylogenetic Footprinting ................................................................................................................