Identifying pattern-defined regulatory islands in mammalian genomes

Tom H. Cheung*, Kristen K. B. Barthel, Yin Lam Kwan†, and Xuedong Liu‡

Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309

Communicated by Marvin H. Caruthers, University of Colorado, Boulder, CO, May 1, 2007 (received for review March 15, 2007) Identifying cis-regulatory regions in mammalian genomes is a key tion regulation. Finally, although genomewide location analysis challenge toward understanding transcriptional regulation. How- using ChIP-on-chip may ultimately elucidate many transcription ever, identification and functional characterization of those regu- factor binding sites in mammalian genomes, identification of latory elements governing differential expression has been cis-elements by this approach is impeded by the daunting size of hampered by the limited understanding of their organization and the genomes and the one-by-one nature of probing transcription locations in genomes. We hypothesized that that are con- factor binding. Therefore, unraveling functional cis-regulatory served across species will also display conservation at the level of elements buried in the genome remains a challenge because of their transcriptional regulation and that this will be reflected in the the insufficient knowledge of the organization and architecture organization of cis-elements mediating this regulation. Using a of regulatory DNA. computational approach, clusters of transcription factor binding A defining feature for mammalian transcriptional regulation sites that are absolutely conserved in order and in spacing across is combinatorial control, in which multiple transcription factors human, rat, and mouse genomes were identified. We term these are required simultaneously to regulate gene expression (2, 12, regions pattern-defined regulatory islands (PRIs). We discovered 13). However, it is unclear what characteristics of functional that these sequences are frequently active sites of transcriptional cis-regulatory regions will facilitate their identification in the regulation. These PRIs occur in Ϸ1.1% of the half-billion base pairs context of combinatorial control. A number of studies suggest covered in the search and are located mainly in noncoding regions that combinatorial regulation often relies on direct interactions of the genome. We show that the premise of PRIs can be used to between participating transcription factors and that these inter- identify previously known and novel cis-regulatory regions con- actions depend on proper spatial orientation of transcription trolling genes regulated by myogenic differentiation. Thus, PRIs factors with respect to each other (2, 12, 13). In addition, there may represent a fundamental property of the architecture of is significant conservation at the level of transcriptional regula- cis-regulatory elements in mammalian genomes, and this feature tion among human, mouse, and rat species and functional can be exploited to pinpoint critical transcriptional regulatory cis-regulatory elements frequently cluster in small regions of the elements governing cell type-specific gene expression. genome (2, 12, 13). An offshoot of this observation is that the architecture of cis-regulatory sequence motifs will likewise be cis-regulatory elements ͉ combinatorial regulation ͉ computational conserved. These fundamental features prompted us to develop prediction ͉ myogenesis ͉ transcription a motif-centric computational algorithm to search for regions in mammalian genomes that harbor clusters of multiple transcrip- nalysis of the unexpectedly revealed that tion factor binding sites that are absolutely conserved with Aclose to 99% of the DNA base pairs do not code for respect to order and spacing across three mammalian genomes (1). Although the functions of this noncoding sequence remain (human, mouse, and rat). obscure, an important function associated with noncoding re- gions is cis-acting regulation. Cis-regulatory sequences are se- Results quences of DNA to which regulatory molecules, such as tran- Development of a Method to Identify Conserved Cis-Regulatory scription factors and microRNAs, can bind to modify gene Regions. We expect to find clusters of conserved binding sites to expression. There is speculation that one-third of the human permit short-range interactions between transcription factors genome may represent cis-regulatory sequences, and it is hy- cooperating to regulate transcription. Moreover, we assume that pothesized that the degree of complexity of cis-regulatory there is selective pressure on the orientation of transcription sequences partially accounts for the complexity of the organism factors with respect to each other. Consequently, we expect that (2). Therefore, identifying and characterizing functional cis- binding-site sequences and the spacing between them are under regulatory elements is critical to understanding regulation of purifying selection but that the identity of the intervening gene expression. sequence is not. By similar reasoning, we also expect that the The sequencing of many genomes has enabled comparative order in which the binding sites occur within a cluster is genomics approaches to annotate functions of genomic se- quences. Indeed, exploiting phylogenetic conservation has proved to be a productive approach for finding cis-regulatory Author contributions: T.H.C. and K.K.B.B. contributed equally to this work; T.H.C., K.K.B.B., and X.L. designed research; K.K.B.B. performed research; T.H.C. and Y.L.K. contributed new elements in less organismically complex species where the reagents/analytic tools; T.H.C., K.K.B.B., and X.L. analyzed data; and K.K.B.B. and X.L. wrote regulatory regions tend to be short and compact (3–10). Unfor- the paper. tunately, identifying cis-regulatory elements in vertebrate ge- The authors declare no conflict of interest. nomes is difficult because of the increased size and complexity Abbreviations: PRI, pattern-defined regulatory island; TFD, transcription factor database; of their genomes. For example, functional cis-regulatory ele- GM, growing myoblasts; DM, differentiating myotubes. ments typically consist of short stretches of DNA sequences *Present address: Department of Neurology and Neurological Sciences, Stanford University (Ͻ500 bases) and are often scattered in the noncoding regions School of Medicine, Stanford, CA 94305. hundreds or thousands of base pairs away from the transcrip- †Present address: Dharmacon Research, Lafayette, CO 80026. tional start site (11). In addition, the increased size of the ‡To whom correspondence should be addressed. E-mail: [email protected]. genome implies that an increased number of known binding-site This article contains supporting information online at www.pnas.org/cgi/content/full/ sequence motifs will occur by random chance, even though only 0704028104/DC1. a fraction of them are likely to figure prominently in transcrip- © 2007 by The National Academy of Sciences of the USA

10116–10121 ͉ PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0704028104 Downloaded by guest on October 1, 2021 number of binding sites should be thought of as a computational index and one that could change as the customized TFD expands. With the current database, we computationally define a PRI as a region in the genome in which at least seven distinct binding sites are conserved in order and spacing across the three mammalian genomes. We then preprocessed 13,520 orthologous genes and found 7,453 genes with at least one PRI. The search corresponded to a coverage of Ϸ540,800,000 bases, or about one-sixth, of the human genome. Of this, we estimate that Ϸ6,154,852 bases, equal to 1.1% of the searched regions, are covered by PRIs. The mean PRI size is 335 bp, even though we placed no constraints on PRI size but did constrain number of binding sites. It is important to note that any apparent clustering of binding sites is Fig. 1. In silico analysis to determine the minimum number of transcription a natural feature of the genomic organization and is not imposed factor binding sites required to define a PRI with 99% confidence. The results artificially. are displayed in a histogram of the number of PRIs versus the number of To ease exploration of the PRI database, we created a binding sites per PRI. web-based interface (http://barcode.colorado.edu/pri/). (For more information relating to the web site, please see SI Text, and see SI Table 1 for a catalog of all binding sites conserved within conserved. We term the genomic regions that satisfy these PRIs to be examined in this study). Using the interface, we criteria pattern-defined regulatory islands (PRIs). recovered many cis-regulatory regions in our database that have The algorithm for finding PRIs includes the following steps been reported in the literature, including well characterized [supporting information (SI) Fig. 5]. First, we developed a regions from IL-2, myogenin, and CDC2. custom curated version of the transcription factor database To make a broader assessment of the biological relevance of (TFD) (14, 15). The TFD contains experimentally characterized PRIs, we surveyed 100 regulatory regions reported in the transcription factor binding sites. Our custom curation mini- literature to ask how frequently PRIs in our database overlap mized redundancy and eliminated many of the poorly defined with regions defined by functional studies. As shown in SI Table binding sites that could potentially affect the robustness of the 2, documented functional transcription regulatory elements can algorithm. Second, we located all putative transcription factor be matched with the PRI database by using the default search on binding sites from the TFD in the same orientation as the gene the web site in 54 of the 100 cases examined. There are several feature within 20 kb on either side of the start codon of all reasons that could explain why the PRI algorithm did not find annotated gene orthologs in all three species [determined by the all regions. First, nine of the genes controlled by these regions participation of a given gene in a homologene group as defined are not found in a National Center for Biotechnology Informa-

by the National Center for Biotechnology Information (16)]. tion Homologene group. In addition, it is probable that not all GENETICS This amounted to 13,520 sets of sequences. Third, we identified genes are subject to combinatorial regulation. It is also important PRIs for each gene with a pattern-matching approach that to note that any very distal elements would not have been locates clusters of binding sites whose order and spacing are covered by the search. Finally, binding sites that have yet to be conserved across the three genomes. Spacing conservation was discovered or that do not conform to a canonical consensus determined by calculating the difference in genomic position of sequence would not be in the TFD. However, new binding sites a given binding site in the human genome with respect to its can be added to the TFD as they are discovered. Overall, our position in the mouse and rat genomes. We called this set of analysis suggests that the PRI database is highly enriched with numbers the genomic position offset (GPO). Binding sites with experimentally defined functional regulatory regions. the same GPO in one 40-kb region were grouped into a PRI. In an effort to assess the uniqueness of the PRI approach, we To define a minimum number of binding sites within a PRI in evaluated the performance of other programs dedicated to an effort to ensure that the PRIs found are statistically signif- genomewide identification of cis-regulatory regions across spe- icant, we evaluated the likelihood of different numbers of cies, PReMod (17) and EEL (18). We used the default search binding sites occurring in concert in random sequences. Specif- parameters for each database. Interestingly, EEL did not predict ically, we selected the 1,000 sequence sets in our database with any of the same 100 literature-defined regions tested with PRI. PRIs containing the greatest number of binding sites and Conversely, PReMod finds 63 of 100; 43 of these overlap with extracted the 40-kb region centered about the start codon of each those found by PRI (SI Table 2). Curiously, of 12 regions of these genes in each genome. Each of these sequences was then heretofore unexplored with respect to transcriptional regulation randomized while maintaining GC distribution and subjected to but predicted by PRI, PReMod predicts only 4, yielding overall the PRI search algorithm. This procedure was repeated 100 totals of 66 of 112 regions for PRI and 67 of 112 for PReMod. times per gene. In total, we obtained 100,000 sets of randomized As detailed below, we tested these regions for their potential as sequences that were searched for PRIs (see SI Text for further genomic sites of transcriptional regulation. (For a more detailed details). comparison of these two predictive programs, please see SI A histogram depicting the distributions of PRIs in the scram- Text.) bled and natural sequences is shown in Fig. 1. We found that PRIs from the natural sequences generally had more binding Identifying Skeletal Muscle-Specific PRIs. A corollary to the above sites than those from the scrambled sequences. Only 1% of PRIs survey is that the PRI approach can be used to discover novel, from scrambled sequences contained seven or more binding biologically relevant cis-regulatory regions on a genomewide sites, implying that we could be 99% confident that PRIs with scale. To test this proposition, we experimentally investigated as seven or more binding sites from natural sequences are unlikely yet uncharacterized regulatory regions of genes that may be to occur by chance and likely represent a cis-regulatory region. involved in myogenic differentiation. Myogenesis is principally Although it is quite likely that functional PRIs exist with fewer orchestrated by a series of transcriptionally controlled events binding sites, we wanted to set a conservative threshold to governed by myogenic regulatory factors (MRFs) (19, 20). A key account for any residual redundancy in the TFD. Therefore, the MRF is MYOD, which binds to a short, degenerate consensus

Cheung et al. PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 ͉ 10117 Downloaded by guest on October 1, 2021 binding site (CANNTG) called an E-box (20). The probability of finding an E-box in the genome is 1/256, only a few of which are likely to be functional. However, MYOD often acts in concert with other transcription factors (20); therefore, we expect that functional E-boxes occur within PRIs. Examples of transcription factors known to cooperate with MYOD are members of the MEF2 family (19, 20) and the SRF family (21). Thus, as a result of the characteristics of the E-box and the combinatorial nature of MYOD-mediated transcriptional regulation, finding func- tional PRIs in myogenesis poses an excellent challenge to the PRI hypothesis. To test this hypothesis, we developed a systematic approach with several criteria to sift through the PRI database. We first extracted all PRIs with E-boxes and/or MEF2 binding sites. Next, we crossed this list with a list of genes up-regulated during myogenesis as revealed by several microarray studies (22–24). We deemed this step necessary because many other transcription factors in addition to MYOD are known to bind E-boxes and MEF2 family members can regulate other processes, such as neurogenesis (25). We chose 12 genes, of which the regulatory elements that mediate responses to myogenic signals in mice are unidentified (TNNI2, AQP1, EDA2R, PGCP, PKIA, CPA1, FHL3, MAX, RDH5, SOX6, TPM2, and USP2), for further study. Schematic representations of the organization of associated PRIs are displayed in SI Fig. 6. Some of the PRIs display unexpected features. For example, the PRI for FHL3 is found in the intron of an upstream gene (UTP11L) in the opposite orientation, and the PRI for EDA2R encompasses an exon. Although it is generally believed that transcriptional regulatory regions do not reside within exons, we still were interested in pursuing this region. We then proceeded to verify that these genes are up-regulated during myogenesis with real-time PCR. We chose to study these candidate PRIs in the C2C12 mouse myoblast cell line as it expresses MYOD and can recapitulate skeletal myotube forma- tion. Fig. 2A tabulates the fold change in transcript level after either 1 or 4 days of serum deprivation. We included the genes SERPINE1 and CUEDC2 as negative controls as the expression of SERPINE1 is known to be regulated in response to TGF-␤ through an E-box that does not lie within a PRI (26) and CUEDC2 harbors a MEF2 binding site in its upstream region that also is not covered by a PRI. MYOG serves as a positive control for this study as it is a well studied muscle-specific gene whose documented regulatory region (27, 28) overlaps with a PRI. Of the 12 candidate genes, 10 were up-regulated in at least one time point tested. The remaining two, AQP1 and MAX, were not amenable to real-time PCR analysis. However, AQP1 is known to be strongly expressed in adult cardiac and skeletal muscle (29), and this is corroborated by the expression pattern Fig. 2. PRIs with MEF2 binding sites and/or E-boxes from genes regulated of AQP1 on a microarray following muscle regeneration after during myogenesis are bound by these transcription factors. (A) Real-time PCR injury (23). In addition, it can be demonstrated by microarray analysis of genes with PRIs selected for study. Fold change in transcript level (22) that transcript levels of MAX are induced as an early of the indicated genes relative to the control gene EPB7.2 was determined by myogenic event. real-time PCR. Actual fold change values are reported over a heat map. (B) MEF2 and MYOD bind PRIs containing their corresponding binding sites Next, using ChIP assays, we qualitatively assessed whether during myogenesis but they do not bind consensus sites not found within PRIs. binding sites within PRIs are actually bound during C2C12 Formaldehyde-fixed, sonicated C2C12 chromatin extracts (GM and DM D1) myogenic differentiation. Briefly, we immunoprecipitated lysate were immunoprecipitated with anti-MEF2 or anti-MYOD antibodies. PRIs from growing myoblasts (GM) and differentiating myotubes associated with the indicated genes or regions containing binding sites not (DM) after 1 day of serum deprivation with antibodies specific within PRIs were analyzed by PCR. PRI from MYOG (indicated by *) served as to MYOD and MEF2 (30). We then analyzed the PRIs in a positive control. The input lane refers to PCR amplification of 0.1% of the question through the use of traditional PCR. Fig. 2B reveals that input lysate (preimmunoprecipitation) with the same primer pairs. No primary all 12 PRIs are bound at least by day 1, with a subset also bound antibody was added to the no antibody lanes (No Ab). in the GM, according to which binding sites are conserved in the region. Importantly, the non-PRI MEF2 binding site in the upstream region of CUEDC2 is not bound by the MEF2 , To test the capacity of candidate PRIs to drive transcription and the TGF-␤-responsive region (3APP) of SERPINE1 is not during myogenic differentiation, six of these PRIs were cloned bound by MYOD at the E-box known to be critical for TGF-␤ into a luciferase reporter gene vector upstream of a minimal signaling. Additionally, the positive control region from MYOG TATA box-containing promoter and subsequently transfected is bound by both MYOD and MEF2 as expected. into C2C12 cells. We induced cell differentiation and measured

10118 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0704028104 Cheung et al. Downloaded by guest on October 1, 2021 mediated by the mutant promoters was significantly decreased for both PRIs tested. Moreover, in the case of the FHL3 PRI, both conserved E-boxes appear to contribute to driving tran- scription. In addition, mutation of the SRE alone or in combi- nation with the E-boxes also impairs the luciferase response governed by the FHL3 PRI, which suggests combinatorial reg- ulation of FHL3 by MYOD and SRF. We were therefore able to successfully mine our PRI database by asking candidate PRIs to fulfill the following criteria: (i) the PRI contains binding sites for transcription factors known to regulate myogenesis; (ii) the associated gene is differentially expressed during myogenesis as demonstrated by microarray and real-time PCR; (iii) the transcription factors for which there are conserved binding sites in the PRI bind this region as assessed by ChIP; and (iv) the PRIs can drive transcription during myogenesis and depend on the integrity of conserved binding sites as determined by reporter gene assays.

Assessing the Importance of Spacing Between Conserved Binding Sites in PRIs. We next explored the idea that the identity of the intervening bases between binding sites is not critical. An objective variation in spacer base identity can be found in the form of orthologous human sequences. As an illustration of this idea, SI Fig. 7 depicts the conservation of binding sites and the differences between spacing base pairs of human and mouse versions of PRI 8 of AQP1. While there is 100% conservation at the level of the conserved binding sites, only 61% of the intervening sequence is conserved, which is satisfying because it is a good illustration of what we expect to find given the PRI hypothesis is true. Fig. 4A demonstrates that the human ortholog of mouse AQP1 PRI 8 is quite active in our experimental C2C12 system, supporting the notion that the identity of spacer bases between conserved binding sites in PRIs is not crucial. To test the assumption that the size of the spacer sequence is Fig. 3. Predicted PRIs can be sites of cis-regulation in myogenesis. (A) C2C12 important, we deleted the bases between conserved binding sites GENETICS cells were transfected with luciferase reporter plasmids containing PRIs up- stream of a TATA box and normalized with a renilla reporter plasmid driven and reintroduced the resulting reporter construct into our by the thymidine kinase promoter. Cells were harvested as GM and 1 day after experimental C2C12 system. Fig. 4B reveals that the deletion serum withdrawal. Labels directly below the x axis are the genes associated severely impairs reporter gene transcription in PRIs from with the PRIs studied, except for minTATA (background vector) and 3APP. The mAQP1 and mCPA1, which further supports the notion that the labels below the gene names are binding sites within the PRIs. Each bar spatial organization of binding sites is critical to mediating gene represents the average of three independent experiments, and error bars activation. Although we cannot definitively rule out the possi- denote Ϯ SD. Note the difference in scales between the two segments of the bility that these deletions would inadvertently delete uncharac- graph. (B) Two PRIs were subjected to mutational studies directed against terized binding sites, the fact that we observe decreased activity MYOD, MEF2, and SRF binding sites. The assay was performed as in A. in the cases of two PRIs with different intervening sequences does in part validate our assumption that the spacing constraint luciferase activity in GM and differentiated myotubes 1 day after is informative in computational identification of PRIs. Taken induction. As shown in Fig. 3A, minTATA, which is the parental together, these results suggest that by following the logic that the construct for all of our cloned PRIs, has minimal basal activity organization of cis-elements will be conserved with respect to transcription factor binding site identity, order, and spacing and just 1.8-fold induction. In contrast, mMYOG Ϫ184 3 ϩ11, across human, mouse, and rat genomes, the PRI algorithm can a construct that harbors the same region probed in the ChIP reliably predict the regulatory regions targeted in the actions of experiments, showed robust transcriptional activity and induc- transcription factors of interest and lends support to the com- tion upon myogenic differentiation. The 3APP region of SER- binatorial hypothesis of mammalian transcriptional regulation. PINE1 displays minimal activity during myogenesis and is in- duced only 1.3-fold, suggesting that an E-box alone without the Discussion proper combination of other transcription factor binding sites is The richness of genomic sequence data now available has insufficient for mediating gene activation during myogenic dif- prompted exponential growth in comparative genomics in the ferentiation. The six PRIs showed robust basal transcriptional last several years. A number of predictive programs for identi- activity as well as induction greater than the minTATA construct fying cis-regulatory regions, including PRI as well as EEL and upon myogenic differentiation. PReMod, have taken advantage of the power of evolutionary To demonstrate that the binding sites used to identify the PRIs conservation in an effort to reduce noise in the predictions. We contribute to transcriptional regulation, we made single point have presented here the successful application of the PRI mutations in E-boxes (CANNTG to CANNTA), MEF2 binding algorithm to pinpointing regulatory regions involved in regulat- sites (YTWWAAATAR to YAAWAAATAR), or serum re- ing the transcriptional program of myogenesis. The algorithm is sponse elements (CCW6GG to CCW6GA) and tested the result- especially powerful for predicting regions engaged in combina- ing reporter construct. Fig. 3B shows the effects of mutated PRIs torial transcriptional regulation. As stated above, we expect from the genes AQP1 and FHL3. In general, when compared orthologous genes to be similarly regulated, and we expect with their wild-type counterparts, the transcriptional output conservation of the organization of the regulatory region at the

Cheung et al. PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 ͉ 10119 Downloaded by guest on October 1, 2021 The experiments presented here have verified the idea that binding-site order and spacing are defining features of regula- tory regions, particularly those that serve as sites of combina- torial control. What we do not know is how stringent these criteria should be. As enhancer properties are not well enough defined to be certain how relaxed the criteria can be, we allowed no tolerance for any deviations in binding-site order and spacing. Programs such as PReMod are more flexible from this point of view. Although PReMod also performs well, it is possible that this relaxation with respect to conservation could result in a higher false positive rate. At the same time, there is no existing program that can guarantee identification of all regulatory regions, as clearly evidenced by the fact that PRI and PReMod each find literature-defined regulatory regions that the other does not. Thus, it is distinctly possible that there are key features of regulatory regions that are not yet known but that could improve the algorithm and allow it to be more inclusive. An interesting observation is that there are frequently mul- tiple PRIs associated with a given gene. It is possible that this is correlated with the complex expression patterns observed for many of the genes of higher eukaryotes. For example, different PRIs could be used in different scenarios. This also establishes the potential for long-range interactions between regulatory regions, which could allow fine-tuning of the expression level of a given gene in different biological settings. As a result, we anticipate that there could be cases where a PRI cloned out of its genomic context and thereby isolated from any potential interactions could not recapitulate the expression pattern of the associated gene. Conversely, we encountered a few cases of PRIs from genes not normally differentially expressed in muscle that could unexpectedly drive reporter gene expression in a C2C12 Fig. 4. Spacing between binding sites is important for PRI-mediated tran- differentiation time course. This aberrant behavior could be a scriptional regulation, whereas spacer identity is not. (A) Orthologous human good example of chromatin structure playing an important role regions of PRI 8 of AQP1 can drive luciferase expression in mouse C2C12 cells. in determining DNA accessibility to transcription factors as the Luciferase assay was performed as in Fig. 3. (B) Spacing bases between clusters plasmid-borne PRI would escape much of the influence of of conserved binding sites in PRIs from mAQP1 and mCPA1 were deleted, and chromatin modifications normally present. In the future, per- luciferase assays were performed as described. Note the difference in scales haps overlaying our database with databases that contain infor- between the two segments of the graph. mation about chromatin structure, such as DNA methylation or histone posttranslational modifications, could allow even more confident forecasting of which PRIs are directing transcription level of the DNA to reflect this. The fact that we can find in different scenarios. thousands of PRIs that display conservation of binding-site There are many potential applications of the PRI algorithm to order and spacing across three mammalian species lends cre- answering diverse questions. For instance, by examining a list of dence to this assumption. It also suggests that approaches that genes containing PRI-associated binding sites for a given tran- look for simple clustering of binding sites with no cross-species scription factor, it could be possible to predict the biological comparison will suffer understandably from high false positive process in which this transcription factor is involved. This, in rates. This does appear to be the case when examining earlier effect, would constitute a functional annotation of cis-regulatory generation predictive programs. For example, an early program elements. We performed such an analysis on all of the genes in developed by Wasserman and Fickett (31) and dedicated to the PRI database with conserved MEF2 binding sites. We identifying muscle-specific regulatory regions looks for clusters compared these genes with all genes in the genome and looked of binding sites for muscle-specific transcription factors as for enrichment of specific functional categories by using Gene informed by the organization of already reported myogenic Ontology (33). As a control, we also performed the same analysis regulatory regions. As there was not yet enough sequence data on genes that contain MEF2 binding sites in their upstream available from multiple species, the authors could not incorpo- regions but that are not conserved in PRIs. The analysis done on rate a multigenome comparison. As a result, while they were able genes with MEF2 binding sites in PRIs returned significantly low Ͻ P to retrieve 60% of validated regulatory regions from a set of ( 0.05) binomial values for functional categories compatible with myogenesis and neurogenesis (data not shown). No such genes known to be expressed in muscle, they report a 79% false enrichment was observed with genes whose MEF2 consensus positive rate when searching a database of eukaryotic promoters binding sites were not within PRIs. We anticipate that one could not restricted to muscle-specific genes. In addition, a program apply this approach to uncharacterized binding sites to predict generated by Berman et al. (9) that looks for clusters of binding what system might use them. sites for early developmental transcription factors in Drosophila In addition to discovering new cis-regulatory regions, the predicted 37 regions. Of these regions, 18 displayed no enhancer- computational framework developed to support our analysis like properties, yielding a false positive rate of 48.6% (32). could be exploited to perform global analyses of transcriptional Although it is generally acknowledged that Drosophila display networks. One obvious question one can ask is what is the less complex transcriptional regulation than mammals, it is network of genes controlled by two or more transcription factors apparent that even in this case a single-species, simple clustering suspected of working together to achieve combinatorial tran- approach does not fully suffice. scriptional regulation? The first search tool on the PRI web site

10120 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0704028104 Cheung et al. Downloaded by guest on October 1, 2021 can serve as a launch pad for such an analysis. Further analysis at 5,000 cells per cm2 in triplicate and 24 h after were transfected could be done to functionally categorize the PRIs or determine with FuGENE6 transfection reagent (Roche, Indianapolis, IN) whether key combinations of transcription factors might coop- with 1 ␮g of the indicated plasmids and 10 ng of pRL-TK for erate to regulate gene expression under different conditions, for normalization according to the manufacturer’s conditions. example, in response to cytokines, in governing tissue-specific Twenty-four hours after transfection, GM were harvested and expression patterns, or in coordinating developmental timing differentiation was induced to yield DM. DM cells were har- events. vested at the time points indicated. Reporter gene assays were Finally, PRIs can be viewed from an evolutionary perspective. performed by using the Dual Luciferase kit (Promega, Madison, It has already been proposed that the complexity of gene WI) according to the manufacturer’s instructions and read with regulatory mechanisms is proportional to the complexity of the a Dynex luminometer. organism (2), more so than gene number or gene conservation. Could it be that natural mutations within the PRI region that Real-Time PCR. Total RNA was isolated from C2C12 cells with an change spacing, binding-site sequence, or orientation alter the RNeasy kit (Qiagen, Valencia, CA) following the manufacturer’s transcriptional output and drive the phenotypic variations that instructions. The total RNA was then reverse-transcribed with a distinguish species? Transcriptor First Strand cDNA synthesis kit using Anchored In summary, identification of PRIs coupled with the experi- oligo-(dT)18 (Roche). The real-time PCR experiment was per- mental validation presented here has revealed heretofore unex- formed by using LightCycler FastStart DNA Master(PLUS) plored characteristics of the architecture of cis-regulatory re- SYBR Green I following the manufacturer’s instructions on a gions. This analysis underscores the importance of the order and Light Cycler 2.0 (Roche). Expressions were normalized to the spacing of transcription factor binding sites that serve as the control gene EPB7.2, which was selected from a microarray for platform for transcriptional regulation of gene expression in its high expression level and minimal fold change during myo- mammalian genomes. genic differentiation. For oligo sequences used in this analysis, Methods please refer to SI Text. Cloning of Luciferase Reporters and Mutagenesis. The reporter ChIP. ChIP assays were performed essentially according to Lam- constructs p3TP-Lux and p3APP-Lux have been described (26). bert and Nordeen (34) except that antibodies against MYOD We cloned some of the representative PRIs into the KpnI–PstI (sc-760; Santa Cruz Biotechnology, Santa Cruz, CA) and MEF2 site of p3APP-lux by substituting the existing TGF-␤-responsive (sc-313; Santa Cruz Biotechnology) were used. The results were elements cloned previously between the two restriction sites. analyzed by PCR on a Mastercycler Gradient (Eppendorf, Point mutations and deletions were generated with the QuikChange II kit (Stratagene, La Jolla, CA) per the manufac- Hamburg, Germany). For more information, please refer to SI turer’s instructions. Text. We thank Natalie Ahn, Jim Goodrich, Rob Knight, Leslie Leinwand, and Cell Lines and Luciferase Assay. C2C12 mouse myoblasts were a gift David Clarke for critical reading of the manuscript and members of the from Leslie Leinwand (University of Colorado) and maintained Leinwand and Knight laboratories at the University of Colorado for GENETICS in DMEM supplemented with L-glutamine, penicillin, strepto- discussion and reagents. K.K.B.B. was supported by National Institute of mycin, and 20% FBS. Myogenic differentiation was induced by General Medical Sciences Predoctoral Training Grant T32GM08759. serum deprivation. Briefly, cells were washed with PBS, and the T.H.C. was supported by National Institutes of Health Predoctoral culture media were replaced with DMEM supplemented with Training Grant 5T32HL07851. This work was supported by National 5% horse serum. For luciferase assays, C2C12 cells were plated Institutes of Health Grants CA107098-01 and R01CA95527 (to X.L.).

1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, 18. Hallikas O, Palin K, Sinjushina N, Rautiainen R, Partanen J, Ukkonen E, Dewar K, Doyle M, FitzHugh W, et al. (2001) Nature 409:860–921. Taipale J (2006) Cell 124:47–59. 2. Levine M, Tjian R (2003) Nature 424:147–151. 19. Molkentin JD, Olson EN (1996) Proc Natl Acad Sci USA 93:9366–9373. 3. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Nature 20. Tapscott SJ (2005) Development (Cambridge, UK) 132:2685–2695. 423:241–254. 21. Groisman R, Masutani H, Leibovitch MP, Robin P, Soudant I, Trouche D, 4. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston Harel-Bellan A (1996) J Biol Chem 271:5258–5264. R, Cohen BA, Johnston M (2003) Science 301:71–76. 22. Tomczak KK, Marinescu VD, Ramoni MF, Sanoudou D, Montanaro F, Han 5. Pritsker M, Liu YC, Beer MA, Tavazoie S (2004) Genome Res 14:99–108. M, Kunkel LM, Kohane IS, Beggs AH (2004) FASEB J 18:403–405. 6. Wang T, Stormo GD (2005) Proc Natl Acad Sci USA 102:17400–17405. 23. Zhao P, Iezzi S, Carver E, Dressman D, Gridley T, Sartorelli V, Hoffman EP 7. Papatsenko D, Levine M (2005) Nat Methods 2:529–534. (2002) J Biol Chem 277:30091–30101. 8. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, 24. Chen IH, Huber M, Guan T, Bubeck A, Gerace L (2006) BMC Cell Biol 7:3. Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al. (2004) Nature 431:99–104. 25. McDermott JC, Cardoso MC, Yu YT, Andres V, Leifer D, Krainc D, Lipton 9. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin SA, Nadal-Ginard B (1993) Mol Cell Biol 13:2564–2577. GM, Eisen MB (2002) Proc Natl Acad Sci USA 99:757–762. 26. Hua X, Liu X, Ansari DO, Lodish HF (1998) Genes Dev 12:3084–3095. 10. Bejerano G, Siepel AC, Kent WJ, Haussler D (2005) Nat Methods 2:535–545. 27. Buchberger A, Ragge K, Arnold HH (1994) J Biol Chem 269:17289–17296. 11. Kleinjan DA, van Heyningen V (2005) Am J Hum Genet 76:8–32. 28. Edmondson DG, Cheng TC, Cserjesi P, Chakraborty T, Olson EN (1992) Mol 12. Remenyi A, Scholer HR, Wilmanns M (2004) Nat Struct Mol Biol 11:812–815. 13. Arnone MI, Davidson EH (1997) Development (Cambridge, UK) 124:1851– Cell Biol 12:3665–3677. 1864. 29. Au CG, Cooper ST, Lo HP, Compton AG, Yang N, Wintour EM, North KN, 14. Ghosh D (1992) Nucleic Acids Res 20(Suppl):2091–2093. Winlaw DS (2004) J Mol Cell Cardiol 36:655–662. 15. Cheung TH, Kwan YL, Hamady M, Liu X (2006) Genome Biol 7:R97. 30. Blais A, Tsikitis M, Acosta-Alvear D, Sharan R, Kluger Y, Dynlacht BD (2005) 16. Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Genes Dev 19:553–569. Schuler GD, Schriml LM, Tatusova TA, Wagner L, et al. (2001) Nucleic Acids 31. Wasserman WW, Fickett JW (1998) J Mol Biol 278:167–181. Res 29:11–16. 32. Berman BP, Pfeiffer BD, Laverty TR, Salzberg SL, Rubin GM, Eisen MB, 17. Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, Lefebvre C, Celniker SE (2004) Genome Biol 5:R61. Deblois G, Giguere V, Ferretti V, Bergeron D, et al. (2006) Genome Res 33. Consortium (2006) Nucleic Acids Res 34:D322–D326. 16:656–668. 34. Lambert JR, Nordeen SK (2001) Methods Mol Biol 176:273–281.

Cheung et al. PNAS ͉ June 12, 2007 ͉ vol. 104 ͉ no. 24 ͉ 10121 Downloaded by guest on October 1, 2021