Ultraconserved Elements in the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
R EPORTS 16. L. Shen, K. L. Rock, Proc. Natl. Acad. Sci. U.S.A. 101, J. Immunol. 171, 5668 (2003). Supporting Online Material 3035 (2004). 22. N. P. Restifo et al., J. Immunol. 154, 4414 (1995). www.sciencemag.org/cgi/content/full/304/5675/1318/ 17. S. P. Schoenberger et al., J. Immunol. 161, 3808 (1998). 23. We thank B. Buschling, D. Tokarchick, and A. Schell DC1 18. M. L. Albert, B. Sauter, N. Bhardwaj, Nature 392,86 for technical assistance. We are grateful to M. Epler Materials and Methods b (1998). and S. Tevethia for their generous gift of D - Figs. S1 and S2 19. M. Bellone et al., J. Immunol. 159, 5391 (1997). NP366-374 tetramers. This work was supported in part References and Notes 20. J. W. Yewdell, C. C. Norbury, J. R. Bennink, Adv. by a Wellcome Prize Traveling Fellowship and U.S. Immunol. 73, 1 (1999). Public Health Service grants, and NIH grant AI- 21. A. Serna, M. C. Ramirez, A. Soukhanova, L. J. Sigal, 056094-01 to C.C.N. 3 February 2004; accepted 14 April 2004 with the dog genome, which was estimated Ultraconserved Elements in the using reads from the National Center for Biotechnology Information (NCBI) trace ar- Human Genome chive (477/481 ϭ 99.2% of the elements aligning at an average of 99.2% identi- Gill Bejerano,1* Michael Pheasant,3 Igor Makunin,3 ty).Thus, it appears that nearly all of these Stuart Stephen,3 W. James Kent,1 John S. Mattick,3 ultraconserved elements may have been un- David Haussler2* der extreme negative selection in many spe- cies for more than 300 million years, and There are 481 segments longer than 200 base pairs (bp) that are absolutely some of them for at least 400 million years. conserved (100% identity with no insertions or deletions) between orthologous As expected, the ultraconserved elements regions of the human, rat, and mouse genomes.Nearly all of these segments exhibit almost no natural variation in the are also conserved in the chicken and dog genomes, with an average of 95 and human population. Only 6 out of 106,767 99% identity, respectively.Many are also significantly conserved in fish.These bases examined in the ultraconserved ele- ultraconserved elements of the human genome are most often located either ments (excluding the first and last 20 bases in overlapping exons in genes involved in RNA processing or in introns or nearby each element) are at validated single- genes involved in the regulation of transcription and development.Along with nucleotide polymorphisms (SNPs) in NCBI’s more than 5000 sequences of over 100 bp that are absolutely conserved among SNP database (dbSNP) (table S2a). For this the three sequenced mammals, these represent a class of genetic elements much DNA, we would have expected 119 whose functions and evolutionary origins are yet to be determined, but which validated sites, so validated SNPs are under- Ϫ42 are more highly conserved between these species than are proteins and appear represented by 20-fold (P Ͻ 10 ). The 48 to be essential for the ontogeny of mammals and other vertebrates. unvalidated SNPs we found revealed many likely errors in the unvalidated portion of Although only about 1.2% of the human served with orthologous segments in rodents: dbSNP (table S2b). These same 106,767 genome appears to code for proteins (1–3), those showing 100% identity and with no bases exhibit very few differences with the it has been estimated that as much as 5% is insertions or deletions in their alignment with chimp genome as well, showing only 38 more conserved than would be expected the mouse and rat. Exclusive of ribosomal single base changes where the chimp base has from neutral evolution since the split with RNA (rRNA) regions, there are 481 such a Phred quality score at least 45, whereas the rodents, and hence may be under negative segments longer than 200 bp that we call expected number would be 716 (roughly a or “purifying” selection (4–6). Several ultraconserved elements (table S1). They are 19-fold reduction, P Ͻ 10Ϫ200; supporting studies have found specific noncoding seg- widely distributed in the genome (on all chro- text, section S2). This low level of variation ments in the human genome that appear to mosomes except chromosomes 21 and Y) and within the human population and in comparison be under selection, using a threshold for are often found in clusters (Fig. 1). The prob- with the chimp suggests that these elements are conservation of 70 or 80% identity with the ability is less than 10Ϫ22 of finding even one currently changing at a rate that is roughly 20 mouse over more than 100 base pairs (bp) such element in 2.9 billion bases under a times slower than the average for the genome. (7–13). A study of these elements on hu- simple model of neutral evolution with inde- Only 4.3% of the bases are different in the man chromosome 21 found that those that pendent substitutions at each site, using the chicken, which is also consistent with a roughly were very highly conserved in multiple spe- slowest neutral substitution rate that is ob- 20-fold reduction from neutral substitution rates cies contained significant numbers of non- served for any 1-Mb region of the genome (supporting text, section S2). coding elements (13). Similar results were (supporting text, section S1). Nearly all of Of the 481 ultraconserved elements, 111 found when comparing the human, mouse, these elements also exhibit extremely high overlap the mRNA of a known human and rat genomes (14, 15) in a study of the levels of conservation with orthologous re- protein-coding gene [including the untrans- 1.8–megabase (Mb) CFTR region (16, 17) gions in the chicken genome [467 out of 481 lated regions (UTRs)], 256 show no evidence and in a functional study of the SIM2 locus (467/481) ϭ 97% of the elements aligning at of transcription from any matching expressed in a number of mammalian species (18). an average of 95.7% identity, 29 at 100% sequence tag (EST) or mRNA from any spe- We determined the longest segments of identity] and about two-thirds of them with cies, and for the remaining 114 the evidence the human genome that are maximally con- the fugu genome as well (324/481 ϭ 67.3% for transcription is inconclusive. We call of the elements aligning at an average of these partly exonic (or exonic for short), non- 1Department of Biomolecular Engineering, 2Howard 76.8% identity), despite the fact that only exonic, and possibly exonic ultraconserved Hughes Medical Institute, University of California about 4% of the human genome can be reli- elements, respectively. A hundred non- Santa Cruz, Santa Cruz, CA 95064, USA. 3ARC Special ably aligned to the chicken genome (at an exonic elements are located in introns of Research Centre for Functional and Applied Genom- average of 62.9% identity where an align- known genes and the rest are intergenic. ics, Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD 4072, Australia. ment is found), and less than 1.8% of the The non-exonic elements, both intronic and human genome aligns to fugu (at an average intergenic, tend to congregate in clusters *To whom correspondence should be addressed. E- mail: [email protected] (G.B.); [email protected] of 60% identity). In addition, nearly all ex- near transcription factors and developmen- (D.H.). hibit extremely high levels of conservation tal genes, whereas the exonic and possibly www.sciencemag.org SCIENCE VOL 304 28 MAY 2004 1321 R EPORTS exonic elements are more randomly distrib- enrichment for RNA binding and regulation attributes are enriched in type I genes as well uted along the chromosomes (Fig. 1). of splicing (P Ͻ 10Ϫ18 and 10Ϫ9, respective- but 16, 8, and 9 orders of magnitude less There are 93 known genes that overlap ly, against all GO annotated human genes) significantly, respectively. This suggests that with exonic ultraconserved elements; we call and are uniquely abundant in the RNA rec- exonic ultraconserved elements may be spe- these type I genes. The 225 genes that are ognition motif RRM (P Ͻ 10Ϫ17, against all cifically associated with RNA processing and near the non-exonic elements we call type II InterPro annotated human genes). In contrast, non-exonic elements with regulation of tran- genes (methods in supporting text, section the type II genes are devoid of enrichment for scription at the DNA level. S3). We looked for categories of biological RNA binding or splicing or the RRM (P ϭ Non-exonic ultraconserved elements are process and molecular function defined in the 0.39, 0.44, and 0.77, respectively). However, often found in “gene deserts” that extend Gene Ontology (GO) database (19) that are type II genes are strongly enriched for regu- more than a megabase. In particular, of the significantly enriched in type I and II genes lation of transcription and DNA binding (P Ͻ non-exonic elements, there are 140 that are and also searched InterPro (20) for enrich- 10Ϫ19 and 10Ϫ14, respectively), as well as more than 10 kilobases (kb) away from any ment in particular structural domains (Fig. 2). DNA binding motifs, in particular the Ho- known gene, and 88 that are more than 100 The type I genes show significant functional meobox domain (P Ͻ 10Ϫ14). These three kb away. The set of 156 annotated genes that chr1 chr7 chr14 POU3F1 LMO4 PBX1 AKT3/ZNF238 SP4/SP8 AUTS2 DLX6 NOVA1 VRK1 FLJ10597 PTBP2 HNRPU TRA2A FOXP2 FOXG1B FAF1 HOXA HIC/TFEC COCH castor-ortholog AUTL1 NPAS3 GARNL1/GARNL2 MIPOL1 PRPF39 chr2 chr8 chr15 BCL11A ZFHX1B ST18 ZFPM2 MEIS2 MAP2K5 NR4A2 BHLHB5 TBR1 FIGN chr9 chr16 HAT1/DLX1/DLX2 SP3/PTD004 HOXD chr3 DMRT1/DMRT3 TLE4 MNAB OAZ C9orf39 HNRPK PBX3 IRX3/IRX5/IRX6 ELAVL2 NFAT5 FLJ22611 ATBF1 chr17 SATB1 FOXP1 ZNF288 SOX14 ZIC4 SHOX2 EVI1 chr10 chr18 LZK1/LHX1/AATF HOXB SFRS1 chr4 DRG11 PAX2 EHZF TCF4 C10orf11 FLJ13188 TAF4B ZNF407 BUB3 BRUNOL4 LRBA EBF3 chr11 chr19 chr20 chr5 LMO1 PAX6 KIAA1474 SOX6 KPNB2 RANBP17 orthopedia-ortholog chr12 chr21 chr22 chrY MEF2C NR2F1 chr6 chr13 HOXC chrX DDX3X TFAP2B/PKHD1 POU3F2 DACH POLA/ARX NLGN3 GRIA3/STAG2 Base Position 24000000 24500000 Chromosome Band Xp22.11 Xp21.3 PDK3 POLA PCYT1B ARX Ultra-conserved Elements Fig.