Identifying Pattern-Defined Regulatory Islands in Mammalian Genomes

Identifying pattern-defined regulatory islands in mammalian genomes Tom H. Cheung*, Kristen K. B. Barthel, Yin Lam Kwan†, and Xuedong Liu‡ Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309 Communicated by Marvin H. Caruthers, University of Colorado, Boulder, CO, May 1, 2007 (received for review March 15, 2007) Identifying cis-regulatory regions in mammalian genomes is a key tion regulation. Finally, although genomewide location analysis challenge toward understanding transcriptional regulation. How- using ChIP-on-chip may ultimately elucidate many transcription ever, identification and functional characterization of those regu- factor binding sites in mammalian genomes, identification of latory elements governing differential gene expression has been cis-elements by this approach is impeded by the daunting size of hampered by the limited understanding of their organization and the genomes and the one-by-one nature of probing transcription locations in genomes. We hypothesized that genes that are con- factor binding. Therefore, unraveling functional cis-regulatory served across species will also display conservation at the level of elements buried in the genome remains a challenge because of their transcriptional regulation and that this will be reflected in the the insufficient knowledge of the organization and architecture organization of cis-elements mediating this regulation. Using a of regulatory DNA. computational approach, clusters of transcription factor binding A defining feature for mammalian transcriptional regulation sites that are absolutely conserved in order and in spacing across is combinatorial control, in which multiple transcription factors human, rat, and mouse genomes were identified. We term these are required simultaneously to regulate gene expression (2, 12, regions pattern-defined regulatory islands (PRIs). We discovered 13). However, it is unclear what characteristics of functional that these sequences are frequently active sites of transcriptional cis-regulatory regions will facilitate their identification in the regulation. These PRIs occur in Ϸ1.1% of the half-billion base pairs context of combinatorial control. A number of studies suggest covered in the search and are located mainly in noncoding regions that combinatorial regulation often relies on direct interactions of the genome. We show that the premise of PRIs can be used to between participating transcription factors and that these inter- identify previously known and novel cis-regulatory regions con- actions depend on proper spatial orientation of transcription trolling genes regulated by myogenic differentiation. Thus, PRIs factors with respect to each other (2, 12, 13). In addition, there may represent a fundamental property of the architecture of is significant conservation at the level of transcriptional regula- cis-regulatory elements in mammalian genomes, and this feature tion among human, mouse, and rat species and functional can be exploited to pinpoint critical transcriptional regulatory cis-regulatory elements frequently cluster in small regions of the elements governing cell type-specific gene expression. genome (2, 12, 13). An offshoot of this observation is that the architecture of cis-regulatory sequence motifs will likewise be cis-regulatory elements ͉ combinatorial regulation ͉ computational conserved. These fundamental features prompted us to develop prediction ͉ myogenesis ͉ transcription a motif-centric computational algorithm to search for regions in mammalian genomes that harbor clusters of multiple transcrip- nalysis of the human genome unexpectedly revealed that tion factor binding sites that are absolutely conserved with Aclose to 99% of the DNA base pairs do not code for proteins respect to order and spacing across three mammalian genomes (1). Although the functions of this noncoding sequence remain (human, mouse, and rat). obscure, an important function associated with noncoding regions is cis-acting regulation. Cis-regulatory sequences are se- Results quences of DNA to which regulatory molecules, such as tran- Development of a Method to Identify Conserved Cis-Regulatory scription factors and microRNAs, can bind to modify gene Regions. We expect to find clusters of conserved binding sites to expression. There is speculation that one-third of the human permit short-range interactions between transcription factors genome may represent cis-regulatory sequences, and it is hy- cooperating to regulate transcription. Moreover, we assume that pothesized that the degree of complexity of cis-regulatory there is selective pressure on the orientation of transcription sequences partially accounts for the complexity of the organism factors with respect to each other. Consequently, we expect that (2). Therefore, identifying and characterizing functional cis- binding-site sequences and the spacing between them are under regulatory elements is critical to understanding regulation of purifying selection but that the identity of the intervening gene expression. sequence is not. By similar reasoning, we also expect that the The sequencing of many genomes has enabled comparative order in which the binding sites occur within a cluster is genomics approaches to annotate functions of genomic sequences. Indeed, exploiting phylogenetic conservation has proved to be a productive approach for finding cis-regulatory Author contributions: T.H.C. and K.K.B.B. contributed equally to this work; T.H.C., K.K.B.B., and X.L. designed research; K.K.B.B. performed research; T.H.C. and Y.L.K. contributed new elements in less organismically complex species where the reagents/analytic tools; T.H.C., K.K.B.B., and X.L. analyzed data; and K.K.B.B. and X.L. wrote regulatory regions tend to be short and compact (3–10). Unfor- the paper. tunately, identifying cis-regulatory elements in vertebrate ge- The authors declare no conflict of interest. nomes is difficult because of the increased size and complexity Abbreviations: PRI, pattern-defined regulatory island; TFD, transcription factor database; of their genomes. For example, functional cis-regulatory ele- GM, growing myoblasts; DM, differentiating myotubes. ments typically consist of short stretches of DNA sequences *Present address: Department of Neurology and Neurological Sciences, Stanford University (Ͻ500 bases) and are often scattered in the noncoding regions School of Medicine, Stanford, CA 94305. hundreds or thousands of base pairs away from the transcrip- †Present address: Dharmacon Research, Lafayette, CO 80026. tional start site (11). In addition, the increased size of the ‡To whom correspondence should be addressed. E-mail: [email protected]. genome implies that an increased number of known binding-site This article contains supporting information online at www.pnas.org/cgi/content/full/ sequence motifs will occur by random chance, even though only 0704028104/DC1. a fraction of them are likely to figure prominently in transcrip- © 2007 by The National Academy of Sciences of the USA 10116–10121 ͉ PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0704028104 Downloaded by guest on October 1, 2021 number of binding sites should be thought of as a computational index and one that could change as the customized TFD expands. With the current database, we computationally define a PRI as a region in the genome in which at least seven distinct binding sites are conserved in order and spacing across the three mammalian genomes. We then preprocessed 13,520 orthologous genes and found 7,453 genes with at least one PRI. The search corresponded to a coverage of Ϸ540,800,000 bases, or about one-sixth, of the human genome. Of this, we estimate that Ϸ6,154,852 bases, equal to 1.1% of the searched regions, are covered by PRIs. The mean PRI size is 335 bp, even though we placed no constraints on PRI size but did constrain number of binding sites. It is important to note that any apparent clustering of binding sites is Fig. 1. In silico analysis to determine the minimum number of transcription a natural feature of the genomic organization and is not imposed factor binding sites required to define a PRI with 99% confidence. The results artificially. are displayed in a histogram of the number of PRIs versus the number of To ease exploration of the PRI database, we created a binding sites per PRI. web-based interface (http://barcode.colorado.edu/pri/). (For more information relating to the web site, please see SI Text, and see SI Table 1 for a catalog of all binding sites conserved within conserved. We term the genomic regions that satisfy these PRIs to be examined in this study). Using the interface, we criteria pattern-defined regulatory islands (PRIs). recovered many cis-regulatory regions in our database that have The algorithm for finding PRIs includes the following steps been reported in the literature, including well characterized [supporting information (SI) Fig. 5]. First, we developed a regions from IL-2, myogenin, and CDC2. custom curated version of the transcription factor database To make a broader assessment of the biological relevance of (TFD) (14, 15). The TFD contains experimentally characterized PRIs, we surveyed 100 regulatory regions reported in the transcription factor binding sites. Our custom curation mini- literature to ask how frequently PRIs in our database overlap mized redundancy and eliminated many of the poorly defined with regions defined by functional studies. As shown in SI Table binding sites that could potentially affect

Identifying Pattern-Defined Regulatory Islands in Mammalian Genomes

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

4-6 Weeks Old Female C57BL/6 Mice Obtained from Jackson Labs Were Used for Cell Isolation

A Dissertation Entitled the Androgen Receptor

Transcriptome Analysis of Newt Lens Regeneration Reveals Distinct Gradients in Gene Expression Patterns

Identification of Genomic Targets of Krüppel-Like Factor 9 in Mouse Hippocampal

Identification of Nine New Susceptibility Loci for Endometrial Cancer

SUPPORTING INFORMATION for Regulation of Gene Expression By

Supplemental Solier

Lineage-Specific Effector Signatures of Invariant NKT Cells Are Shared Amongst Δγ T, Innate Lymphoid, and Th Cells

By Ethanol Probably Leads to the Development of Fetal Alcohol Spectrum Disorder (FASD) Phenotypes in Japanese Rice ﬁsh (Oryzias Latipes) Embryogenesis

The Complexity of Human Ribosome Biogenesis Revealed by Systematic Nucleolar Screening of Pre-Rrna Processing Factors

Supplementary Data Supplemental Fig. 1