Downloaded from Ensembl V41 (6), All

IN SILICO APPROACHES TO INVESTIGATING MECHANISMS OF GENE REGULATION by SHANNAN JANELLE HO SUI B.Sc., The University of British Columbia, 2000 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Genetics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) March 2008 © Shannan Janelle Ho Sui, 2008 Abstract Identification and characterization of regions influencing the precise spatial and temporal expression of genes is critical to our understanding of gene regulatory networks. Connecting transcription factors to the cis-regulatory elements that they bind and regulate remains a challenging problem in computational biology. The rapid accumulation of whole genome sequences and genome-wide expression data, and advances in alignment algorithms and motif-finding methods, provide opportunities to tackle the important task of dissecting how genes are regulated. Genes exhibiting similar expression profiles are often regulated by common transcription factors. We developed a method for identifying statistically over- represented regulatory motifs in the promoters of co-expressed genes using weight matrix models representing the specificity of known factors. Application of our methods to yeast fermenting in grape must revealed elements that play important roles in utilizing carbon sources. Extension of the method to metazoan genomes via incorporation of comparative sequence analysis facilitated identification of functionally relevant binding sites for sets of tissue-specific genes, and for genes showing similar expression in large-scale expression profiling studies. Further extensions address alternative promoters for human genes and coordinated binding of multiple transcription factors to cis-regulatory modules. Sequence conservation reveals segments of genes of potential interest, but the degree of sequence divergence among human genes and their orthologous sequences varies widely. Genes with a small number of well-distinguished, highly conserved noncoding elements proximal to the transcription start site may be well-suited for targeted laboratory promoter characterization studies. We developed a “regulatory resolution” ii score to prioritize lists of genes for laboratory gene regulation studies based on the conservation profile of their promoters. Additionally, genome-wide comparisons of vertebrate genomes have revealed surprisingly large numbers of highly conserved noncoding elements (HCNEs) that cluster nearby to genes associated with transcription and development. To further our understanding of the genomic organization of regulatory regions, we developed methods to identify HCNEs in insects. We find that HCNEs in insects have similar function and organization as their vertebrate counterparts. Our data suggests that microsynteny in insects has been retained to keep large arrays of HCNEs intact, forming genomic regulatory blocks that surround the key developmental genes they regulate. iii Table of Contents Abstract............................................................................................................................... ii Table of Contents............................................................................................................... iv List of Tables ...................................................................................................................viii List of Figures.................................................................................................................... ix List of Abbreviations and Acronyms.................................................................................. x Acknowledgements........................................................................................................... xii Co-authorship Statement.................................................................................................. xiv Chapter 1: Introduction....................................................................................................... 1 1.1 Background and significance.................................................................................... 1 1.2 Structure of eukaryotic regulatory regions ............................................................... 3 1.2.1 The core promoter.............................................................................................. 3 1.2.2 Complexities in promoter architecture............................................................... 6 1.2.3 Enhancers........................................................................................................... 7 1.2.4 Chromatin structure and its effects on gene regulation ..................................... 8 1.2.5 Current strategies for improving our understanding of gene regulation.......... 10 1.3. Laboratory-based methods for regulatory region identification ............................ 10 1.3.1 Reporter constructs .......................................................................................... 11 1.3.2 DNA binding assays ........................................................................................ 12 1.3.3 In vitro selection .............................................................................................. 13 1.3.4 Transcript profiling to locate proximal promoters........................................... 13 1.4 Computational methods for regulatory element prediction .................................... 14 1.4.1 Repositories of gene regulatory information ................................................... 14 1.4.2 Computational identification of TFBSs........................................................... 16 1.5 Conservation analysis for the identification of regulatory regions......................... 20 1.5.1 Algorithms for multi-species sequence alignments......................................... 21 1.5.2 Identifying and measuring evolutionary constraint ......................................... 21 1.5.3 Highly conserved noncoding elements in metazoan genomes ........................ 24 1.6 Thesis overview and chapter objectives ................................................................. 26 1.7 References............................................................................................................... 30 Chapter 2: oPOSSUM: Identification of Over-represented Transcription Factor Binding Sites in Sets of Co-expressed Genes ................................................................................ 42 2.1 Introduction............................................................................................................. 42 2.2 Methods................................................................................................................... 45 2.2.1 Automated retrieval of human-mouse orthologs ............................................. 45 2.2.2 Phylogenetic footprinting ................................................................................ 46 2.2.3 Detection of TF binding sites........................................................................... 47 2.2.4 Discovery of over-represented binding sites ................................................... 47 2.2.5 NF-κB microarray experiment......................................................................... 49 2.2.6 Simulations using random sampling................................................................ 50 2.2.7 Parameter selection for validation studies ....................................................... 51 2.3 Results..................................................................................................................... 51 2.3.1 Validation using reference gene sets ............................................................... 52 2.3.2 Application to transcript profiling data............................................................ 56 iv 2.3.3 Specificity assessment ..................................................................................... 59 2.3.4 Noise tolerance ................................................................................................ 60 2.3.5 Web implementation........................................................................................ 62 2.3.6 The oPOSSUM application programming interface (API).............................. 63 2.4 Discussion............................................................................................................... 65 2.4.1 Performance ..................................................................................................... 65 2.4.2 Challenges........................................................................................................ 67 2.4.3 Utility............................................................................................................... 69 2.5 Conclusions............................................................................................................. 70 2.6 References............................................................................................................... 71 Chapter 3: Integrated Tools for Analysis of Regulatory Motif Over-representation ....... 75 3.1 Introduction............................................................................................................. 75 3.2 Methods................................................................................................................... 76 3.2.1 Over-representation analysis...........................................................................

Downloaded from Ensembl V41 (6), All

Tools for Reproducible Research Organizing Projects; Exploratory Data Analysis

A Guide to Teaching Data Science

Advanced Environmental Data Analysis Bren School of Environmental Science & Management

Dynamic Data in the Statistics Classroom

Using Github Classroom to Teach Statistics Arxiv:1811.02021V1 [Stat.OT] 5 Nov 2018

Jenny Bryan Wednesday, Nov. 28, 7 P.M

Hadley Wickham Elegant Graphics for Data Analysis Second Edition

Cluster Analysis for Microarray Data

Creating Optimal Conditions for Reproducible Data Analysis in R With

Gov 1005: Data David Kane Fall 2018: M/W 1:30 – 2:45

The Fivethirtyeight R Package: "Tame Data" Principles for Introductory Statistics and Data Science Courses

Years of Data Science