Annotation of Cis-Regulatory Elements by Identification, Subclassification, and Functional Assessment of Multispecies Conserved Sequences
Total Page:16
File Type:pdf, Size:1020Kb
Annotation of cis-regulatory elements by identification, subclassification, and functional assessment of multispecies conserved sequences Jim R. Hughes*, Jan-Fang Cheng†, Nicki Ventress*, Shyam Prabhakar†, Kevin Clark*, Eduardo Anguita*, Marco De Gobbi*, Pieter de Jong‡, Eddy Rubin†, and Douglas R. Higgs*§ *Medical Research Council Molecular Haematology Unit, Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DS, United Kingdom; †Department of Energy Joint Genome Institute, Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720; and ‡Children’s Hospital and Research Center at Oakland, Oakland, CA 94609 Communicated by David Weatherall, University of Oxford, Oxford, United Kingdom, April 28, 2005 (received for review November 5, 2004) An important step toward improving the annotation of the human and͞or unknown because there is usually very limited experimental genome is to identify cis-acting regulatory elements from primary data to link genome structure to function at the locus in question. DNA sequence. One approach is to compare sequences from Therefore, to establish accurate and sensitive routines for evaluat- multiple, divergent species. This approach distinguishes multispe- ing cis-regulatory sequences, it will be necessary to analyze well cies conserved sequences (MCS) in noncoding regions from more characterized regions of the human genome so that such sequences rapidly evolving neutral DNA. Here, we have analyzed a region of can be recognized and their potential importance can be assessed Ϸ238kb containing the human ␣ globin cluster that was sequenced and prioritized for functional studies. ͞ and͞or annotated across the syntenic region in 22 species spanning Here, we have sequenced and or annotated and compared extensive regions (47–238 kb) of conserved synteny containing the 500 million years of evolution. Using a variety of bioinformatic ␣ approaches and correlating the results with many aspects of -globin cluster, its known regulatory elements and flanking genes chromosome structure and function in this region, we were able to in 22 species representing widely divergent groups, spanning 500 my of evolution. Multiple alignments were analyzed with a variety of identify and evaluate the importance of 24 individual MCSs. This bioinformatic approaches, and the results were compared with approach sensitively and accurately identified previously charac- algorithms for the detection of MCSs described in ref. 9 and a newly terized regulatory elements but also discovered unidentified pro- developed analysis tool, GUMBY (S. Prabhakar, unpublished data). moters, exons, splicing, and transcriptional regulatory elements. The approach taken here allowed us to identify 24 MCSs within a Together, these studies demonstrate an integrated approach by 238-kb region containing and surrounding the human ␣-globin which to identify, subclassify, and predict the potential importance cluster. Previous studies of this locus have localized all DNase1 of MCSs. hypersensitive sites (DHSs; refs. 10 and 11), characterized its nuclear localization (12), nuclear matrix attachment (13), pattern of conserved noncoding elements ͉ gene regulation ͉ globin gene ͉ replication (14), chromatin structure (10), chromatin modification comparative genomics (15, 16) DNA methylation (17), and pattern of gene expression in normal loci and those with natural deletions (18), allowing us to ith the development of multiple computational algorithms evaluate in full the functional significance of the MCS elements Wand appropriate biological data, it is now possible to detect identified here. coding sequences relatively efficiently and accurately from primary Materials and Methods DNA sequence (1). By contrast, it is still difficult to identify and Clone Isolation and Sequencing. Bacterial artificial chromosomes assign function to noncoding sequences that include critical cis- (BACs) used in this study were isolated and separated by using acting regulatory elements such as promoters, enhancers, silencers, standard techniques (19); all details concerning multiple align- locus control elements, nuclear matrix attachment sites, and origins ments of the discussed elements are available as additional material of replication. A promising approach is to search for conserved from the authors on request. All BACs used in this study are orthologous noncoding sequences from multiple, evolutionarily available from the BACPAC Resources Centre (http:͞͞bacpac. diverse species (2–8), so-called multispecies conserved sequences chori.org). The sequences of the chimpanzee, cow, and cat were (MCSs). However, at present, it is not established how best to design obtained from the ENCODE project (http:͞͞genome.gov͞ and carry out such searches. Furthermore, even when complete, it ENCODE), region ENm008. is not clear how to evaluate and prioritize MCSs for functional analysis. Sequence Assembly and Finishing. Sequence reads were base called An important issue is to know which species to include in by using PHRED 0.020425 and assembled by using PHRAP (www.phrap. searches for MCSs. When comparing the sequences of distantly org) into a GAP4 database. Standard finishing methods were used. related species [e.g., birds and mammals, which diverged 310 million years (my) ago], cis elements may have diverged too much to be easily identified. By contrast, when comparing more closely related Abbreviations: ␣MRE, human ␣-globin major regulatory element; DHS, DNase1 hypersen- sitive site; MCS, multispecies conserved sequences; MCS-E, MCS-previously unidentified species (e.g., rodents and primates, which diverged 64–74 my ago), exon; MCS-P, MCS promoters; MCS-R, MCS-other transcriptional regulatory element; it may be difficult to distinguish between sequence homology MCS-S, MCS splicing element; my, million years; TFB, transcription factor binding site. resulting from a slow rate of evolution and the conservation of Data deposition: The sequences reported in this paper have been deposited in the GenBank functionally important cis elements. Nevertheless, it is becoming database [(species, clone names, and accession nos.): baboon, RP41-98L24, AC145461; clear that analyzing multiple species in different combinations adds colobus, CH272–179O16, AC148220; dusky titi, LB5-13F19, AC145465; marmoset, CH259- significant power to the identification of functionally important 50K11, AC146591, and CH259-177O2, AC145483; owl monkey, CH258-305M22, AC146782; squirrel monkey, CH254-200G16, AC146463, and CH254-167L1, AC151883; lemur, LB2- cis-acting sequences (2, 3). 99K6, AC145463; pig, RP44-210L18, AC120214, and RP44-293L23, AC120215; dog, RP81- Even when MCS can be unequivocally identified, there is often 84F12, AC120212, and RP81-397M1, AC120213; hedgehog, LB4-358K2, AC150435; opos- no way to judge how sensitive the searches have been. Do these sum, LB3-33C20, AC120504, and LB3-206D4, AC139599, and LB3-171C2, AC148752]. routines overcall or undercall regulatory elements? Furthermore, §To whom correspondence should be addressed. E-mail: [email protected]. the true functional significance of MCSs often remains untested © 2005 by The National Academy of Sciences of the USA 9830–9835 ͉ PNAS ͉ July 12, 2005 ͉ vol. 102 ͉ no. 28 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0503401102 Downloaded by guest on September 29, 2021 Sequence Annotation and Multiple Alignment. Sequences were an- notated against EST data when available for that species (http:͞͞ genome.ucsc.edu and www.ensembl.org) or by using cross-species comparison of available ESTs (MACVECTOR 7.2, Accelrys, Inc., San Diego). The presence of nonsyntenic features were evaluated by using a modification of HPREP and stored in AceDB (www.acedb. org). Repeat annotation was generated by using REPEATMASKER (www.repeatmasker.org). Multiple alignments were generated and͞or visualized by using VISTA (www-gsd.lbl.gov͞vista͞index. shtml); MULTIPIPMAKER (http:͞͞pipmaker.bx.psu.edu͞pipmaker), and MULTILAGAN (http:͞͞lagan.stanford.edu͞laganweb͞index. shtml). Multiple alignments were analyzed as described in Results and were also compared with data obtained by using WEBMCS (http:͞͞research.nhgri.nih.gov͞MCS͞submit.shtml). In addition, MULTILAGAN alignments were analyzed for noncoding conserva- tion by using the algorithm GUMBY, a tool for detecting statistically significant conserved regions in pairwise or multiple alignments of DNA sequences at any alignable evolutionary distance. Individual MCSs were aligned by using CLUSTALW (MACVECTOR 7.2, Accelrys); BESTFIT (WISCONSIN PACKAGE 10.3, GCG) and DIALIGN (www. genomatix.de͞cgi-bin͞dialign͞dialign.pl) and manually optimized. Sequence Analysis. Potential transcription factor binding sites (TFBs) were identified by using TRANSFAC 6.0 (www.gene- regulation.com); TRES (http:͞͞bioportal.bic.nus.edu.sg͞tres), and REGULATORY VISTA (http:͞͞gsd.lbl.gov͞vista͞rvista͞submit. shtml). Plots of Observed͞Expected for the CpG dinucleotide were generated by using WINDOW and STATPLOT 10.3 (GCG) comparing the local frequency to the global expected value over a 1,000-bp window. Results A Region of Conserved Synteny Maintained Throughout 500 my of Evolution. ␣ The structure of the -globin cluster was established by Fig. 1. The arrangement (to scale) of globin genes (red boxes) and nonglobin EVOLUTION DNA sequence analysis (6- to 8-fold coverage) in 22 divergent genes (colored boxes) flanking the ␣-globin cluster. The extent of each black species representing well defined time points throughout