R EPORTS To minimize biases from missing data resulting from PCR failure, we considered Evolutionary Discrimination of two CNG data sets, one of all 191 CNGs (CNG-all, fig. S1A) and another of 63 Mammalian Conserved CNGs for which the sequences of at least 10 species were available, including at least Non-Genic Sequences (CNGs) one of armadillo, elephant, wallaby, or platypus (16). This second data set (CNG- Emmanouil T. Dermitzakis,1*† Alexandre Reymond,1† high, for high species coverage; fig. S1B) is Nathalie Scamuffa,1 Catherine Ucla,1 Ewen Kirkness,2 directly comparable to the CODs, which Colette Rossier,1 Stylianos E. Antonarakis1* contain all 12 species’ sequences. With the use of the same criteria, we considered the Analysis of the human and mouse genomes identified an abundance of con- complete data set for all 14 ncRNAs served non-genic sequences (CNGs). The significance and evolutionary depth (ncRNA-all, fig. S1C) and a subset of 5 of their conservation remain unanswered. We have quantified levels and pat- ncRNAs with high alignment coverage terns of conservation of 191 CNGs of human chromosome 21 in 14 mammalian (ncRNA-high, fig. S1D). Both data sets of species. We found that CNGs are significantly more conserved than protein- CNGs and ncRNAs (all and high) are used coding genes and noncoding RNAS (ncRNAs) within the mammalian class from below to illustrate that the missing data do primates to monotremes to marsupials. The pattern of substitutions in CNGs not influence the observed patterns. differed from that seen in protein-coding and ncRNA genes and resembled that A large fraction of the 191 successfully of protein-binding regions. About 0.3% to 1% of the corre- amplified CNGs were highly conserved in sponds to a previously unknown class of extremely constrained CNGs shared multiple mammalian species (fig. S1; A, B, among mammals. and E). Specifically, we could retrieve more than 43% of the orthologous sequenc- Until recently, the extent of nucleotide con- their evolutionary properties with protein es from wallaby and/or platypus. High se- servation between human and other mam- coding sequences (CODs) from past studies quence conservation was evident even in malian species has been unclear. Small- (13–15) and noncoding RNA gene sequenc- the presence of species-specific substitu- scale analyses between human and mouse es (ncRNAs) obtained here. To perform tion biases [e.g., A-T to G-C bias in mouse, genomes suggested conservation outside of polymerase chain reaction (PCR) from porcupine, rabbit, and elephant (17)] that gene regions (1–6). Comparison with the genomic DNA of green monkey, ring-tailed increase the substitution rate, providing ad- draft of the mouse genome indicated that at lemur, brush-tailed porcupine, rabbit, pig, ditional support for the significant role of least 5% of the human genome was under cat, greater mouse-eared bat, white-toothed CNGs. The divergence values of CNGs selective constraint; surprisingly, the ma- shrew, nine-banded armadillo, African ele- were much lower than those of CODs and jority of these highly conserved sequences phant, tammar wallaby, and platypus, we ncRNAs for each species pair (Fig. 1 and did not correspond to known genic se- designed oligonucleotides on CNG and table S1), illustrating strong selec- quences, and experimental attempts to test ncRNA human sequences in highly con- tive constraint. the hypothesis that they are previously un- served regions between human and mouse. To quantify the levels of conservation, identified genes showed that this is un- The selection of ncRNAs has its basis in we estimated the amount of sequence di- likely (7–10). In addition, a method was criteria of orthology and sufficient conser- vergence per unit of evolutionary time. For recently described for the identification of vation to design primers. Only a small sub- each of the 191 CNGs, 14 ncRNAs, and 57 primate-specific functional elements (11). set of known ncRNAs could be used be- CODs (18), we calculated sequence change Computational and mathematical efforts cause of characteristics such as antisense to per million years (D/my) assuming the phy- have attempted to distinguish the conserved genes, small size, and unknown function. logenetic tree described in (15, 19). Ances- regulatory portion of the genome from neu- After PCR, we obtained at least one tral states were derived with maximum trally evolving sites (12). However, no sequence from the other 12 species align- likelihood with the use of PAML3 (20), and highly accurate methodology that can dis- able to human and mouse for 191 out of inferred substitutions were placed on the criminate between different functional 220 CNGs (87%) and 14 out of 16 ncRNAs branches of the phylogenetic tree to ac- classes of highly conserved sequences has (88%). The 19 nuclear protein-coding count for all detectable substitution events. been developed. genes had been analyzed previously (15); Divergence times were derived from (19). In this report, we analyze 220 sequences we aligned 12 of the 44 original species We calculated the sequence change for of the 2262 CNGs initially identified as (human, strepsirrhine, mouse, hystricid, each tree branch and divided by the number highly conserved between human chromo- rabbit, pig, cat, free-tailed bat, shrew, ar- of millions of years each branch covered. some 21 and mouse syntenic regions and madillo, elephant, and opossum). In that Figure 2A shows that CNG-all and CNG- presented no evidence for transcription po- study, CODs were chosen to have 80 to high are significantly more constrained tential (7). We subsequently compared 95% nucleotide identity between human than CODs, ncRNA-all, and ncRNA-high. and mouse (14), and they were selected These observations are not a result of am- from a larger set because a PCR product plification bias, because multiple species 1Division of Medical and National Center of Competence in Research (NCCR) Frontiers in Genet- could be obtained from all 44 species (15). sequences for CODs were obtained with ics, University of Geneva Medical School and Univer- Therefore, these sequences are biased for stricter criteria (see above) than CNGs and sity Hospitals, 1211 Geneva, Switzerland. 2Institute high success of amplification in other spe- ncRNAs. The low D/my values of CNGs for Genomic Research ( TIGR), Rockville, MD 20850, cies and high conservation, issues that be- show that they are under a stronger selec- USA. come relevant below. Our analyses were tive pressure than other functional genomic *To whom correspondence should be addressed. E- performed in multiple alignments of elements. To confirm that the higher sub- mail: [email protected] (S.E.A.); [email protected] (E.T.D.) 55,519 base pairs (bp) of CNGs, 17,028 stitution rate of CODs is not an artifact of †These authors contributed equally to this work. bp of CODs, and 5599 bp of ncRNAs. the selection of the CNGs, we performed a

www.sciencemag.org VOL 302 7 NOVEMBER 2003 1033 R EPORTS similar analysis by searching for all the Hsa21 2262 CNGs and 1229 CODs identi- fied in (7) in the 1.5X genome of the dog (Canis familiaris) available from TIGR (18). For the set of 2262 CNGs, 1674 (74%) had a reciprocal best dog hit (E value Ͻ 0.001), and 1406 (62%) satisfied additional criteria of at least 90% coverage and at least 70% nucleotide identity. For the set of 1229 CODs, 994 (81%) had a reciprocal best dog hit (E Ͻ 0.001), and 749 (61%) satisfied additional criteria of at least 90% coverage and at least 70% nucle- otide identity. This result, together with the fact that we expect to find by chance about 70% of any sequence in a 1.5X genome, suggests that the vast majority of CNGs are conserved in dog and likely in many pla- cental mammals. We subsequently aligned Fig. 1. Plot of average pairwise sequence divergence (Kimura two-parameter estimate) be- 1638 CNGs and 976 CODs in human, tween human and other mammalian species in CNGs (blue), CODs (burgundy), and ncRNA mouse, and dog and compared their dog- (yellow). There are no COD values for green monkey and platypus because they were not specific divergence (Fig. 2B). CNGs sequenced in the original study. showed a significantly lower rate of substi- tution than CODs, confirming the result with multiple species. We conclude that a large fraction of these CNGs, originally found conserved between human and mouse, are highly con- served in multiple mammals, strongly sup- porting functional importance. The CNGs studied here represent 10% of the total number of CNGs on Hsa21. Even if only the CNG-high set (29% of the whole) can be considered functionally important, there are at least 656 such highly conserved ele- ments (CNGs) on Hsa21 and at least 65,600 Fig. 2. Divergence of genomic elements. (A) Sequence change per million years (D/my) in CNG-all, in the human genome, twice as many as the CNG-high, CODs, ncRNA-all, and ncRNA-high [Mann-Whitney tests; CNG versus COD, P Ͻ 0.001 genes (Hsa21 is ϳ1% of the human ge- (high versus all) and P Ͻ 0.001 (all versus all); CNG versus ncRNA, P ϭ 0.012 (high versus high) and nome). Moreover, the 2262 CNGs of Hsa21 P ϭ 0.004 (all versus all)]. (B) Sequence divergence (per million years) of CNGs and CODs in the Ͻ cover 1% of the Hsa21 sequence, and the dog lineage (Mann-Whitney test; CNG versus COD, P 0.001). CNG-high constitutes 29% of this 1%. Therefore, we estimate that at least 0.3% of able sites within the sequence, we used a lead to a silent change, whereas regions the Hsa21 sequence (ϳ 90 kbp) or of the modified method from (24) to infer signif- with a high density of binding sites will whole human genome (ϳ 9 Mbp) is under icance (18). For each of the 191 CNGs, 14 resemble the latter pattern. The ncRNAs very strong selective pressure so that there ncRNAs, and 57 CODs, we calculated the P may have either pattern depending on the is minimal sequence change across hun- value of clustering of variable sites along fraction of nucleotides that are functional. dreds of millions of years of evolutionary the sequence. Figure 3A illustrates the The average number of substitutions per time. In addition, the fact that there is highly significant separation of P values of variable site was calculated for each se- extensive conservation of CNGs in the dog CNG-all and CNG-high from the CODs, quence and corrected for the substitution strongly supports the idea that the whereas the ncRNAs covered have a wide rate in the sequence (18). This correction majority of CNGs are conserved in most distribution. As expected, the majority of was done to exclude the effect of global placental mammals. CODs have significantly uniform distribu- selective constraint within the sequence In order to identify characteristics that tion of substitutions along the sequence. In and to consider a normalized estimate of could distinguish CNGs from other func- contrast, many of the CNGs have statisti- the number of substitutions per variable tional genomic elements, we devised three cally significant clustering, strongly sug- site. CNG-all and CNG-high have signifi- metrics that are described below. CODs gesting the presence of motifs for protein- cantly smaller residual values than CODs tend to have a uniform distribution of sub- binding or other interactions. (Fig. 3B). This suggests that even the frac- stitutions, because most third positions of In constrained sequences, there are re- tion of variable nucleotides in CNGs are the codons are free to change (silent chang- current substitutions in the nucleotide po- more constrained than that in CODs. es). In protein-binding DNA regions, sub- sitions that are free to evolve. When the One of the properties that distinguishes stitutions usually occur between highly entirety of the sequence is constrained, we a transcribed sequence (COD and ncRNA) conserved binding sites (21–23). For observe sparse events of substitution, be- from a nontranscribed one (CNG) is that ncRNAs, we have no prior evidence for one cause almost none of the nucleotides are the function of the former is expressed in or the other pattern. To derive a measure neutral. The former case is true for CODs one of the two strands whereas for the latter for the distribution and clustering of vari- because about one-third of the substitutions both strands may be important. Therefore,

1034 7 NOVEMBER 2003 VOL 302 SCIENCE www.sciencemag.org R EPORTS

Fig. 3. Evolutionary discrimination of functional genomic elements. (A to D) Plots of confi- dence intervals and pairwise P values (based on Mann-Whitney tests) for CNGs, CODs, and ncRNAs. P values of (A) the clustering of substitutions [CNG versus COD, P Ͻ 0.001 (high versus all) and P Ͻ 0.001 (all versus all); CNG versus ncRNA, P ϭ 0.707 (high versus high), P ϭ 0.762 (all versus all)], (B) residuals of substitutions per variable site [CNG versus COD, P Ͻ 0.001 (high versus all) and P Ͻ 0.001 (all versus all); CNG versus ncRNA, P ϭ 0.534 (high versus high), P ϭ 0.065 (all versus all)], and binomial probabilities for (C) AT [CNG versus COD, P Ͻ 0.001 (high versus all) and P Ͻ 0.001 (all versus all); CNG versus ncRNA, P ϭ 0.202 (high versus high) and P ϭ 0.041 (all versus all)] and (D) CG [CNG versus COD, P Ͻ 0.001 (high versus all) and P Ͻ 0.001 (all versus all); CNG versus ncRNA, P ϭ 0.906 (high versus high) and P ϭ 0.362 (all versus all)] substitution symmetry.

selection may be acting in only one strand of conservation of almost all CNGs in dog whereas comparisons with CNG-all and ncRNA-all for the CODs and ncRNAs, which could be suggest that the majority of CNGs are con- are indicative. 17. E. T. Dermitzakis et al., unpublished data. detected by asymmetries of substitutions served in placental mammals. CNGs also 18. Materials and methods are available as supplemen- (e.g., A3 T compared with T3 A). Such have characteristics typical of protein-bind- tary material on Science Online. asymmetry has been shown to exist, and ing regions with alternating clusters of 19. M. S. Springer, W. J. Murphy, E. Eizirik, S. J. O’Brien, selection due to transcription is one expla- high- and low-constraint nucleotides, sug- Proc. Natl. Acad. Sci. U.S.A. 100, 1056 (2003). 20. Z. Yang, Comput. Appl. Biosci. 13, 555 (1997). nation (25, 26). We quantified this asym- gesting that some of them are indeed pro- 21. E. T. Dermitzakis, C. Bergman, A. G. Clark, Mol. Biol. metry for CNGs, CODs, and ncRNAs with tein-binding and likely regulatory regions. Evol. 20, 703 (2003). substitutions that maintain the G ϩ C con- Functional analysis of CNGs will require 22. N. Stojanovic et al., Nucleic Acids Res. 27, 3899 tent (A3 T compared with T3 A and extensive protein-binding assays, reporter (1999). 3 3 ϩ 23. J. Y. Leung et al., Proc. Natl. Acad. Sci. U.S.A. 97, C G compared with G C), because G construct experiments, mouse knockouts, 6614 (2000). C content may be under different selective and other intensive experimental efforts. 24. H. Tang, R. C. Lewontin, Genetics 153, 485 (1999). forces. We first counted the number of Nevertheless, understanding the role of 25. P. Green, B. Ewing, W. Miller, P. J. Thomas, E. D. Green, A3 T substitutions compared with T3 A CNGs in genome function and regulation Genet. 33, 514 (2003). 3 3 26. A. C. Frank, J. R. Lobry, Gene 238, 65 (1999). and C G compared with G Cinthe and their involvement in phenotypic varia- 27. We are grateful to M. Casellini, F. M. Catzeflis, D. L. phylogenetic tree for each of the sequences. tion and human diseases should be a high Dittmann, J. A. Marshall Graves, M. Ruedi, F. Shel- We then calculated the probability of the priority in future genomic studies. don, P. Vogel, C. Wenker, and M. Westerman for DNA samples; to W. Murphy and S. O’Brien for data assuming a binomial distribution and sharing sequence data; and to M. Chapuisat and D. obtained probabilities for the AT and CG Sanchez Ruiz for help and advice. This project was References and Notes asymmetries. The AT asymmetry (Fig. 3C) supported by the Fonds National de la Recherche 1. F. Chiaromonte et al., Proc. Natl. Acad. Sci. U.S.A. 98, Scientifique (Switzerland), NCCR Frontiers in Ge- was more pronounced in ncRNA-high, 14503 (2001). netics, the European Union, and the ChildCare and ncRNA-all, and CODs than in CNG-all and 2. E. T. Dermitzakis, A. G. Clark, Mol. Biol. Evol. 19, 1114 Lejeune Foundations. Tissues were provided by CNG-high, illustrating that transcription (2002). Institut des Sciences de l’, Montpellier 3. I. Dubchak et al., Genome Res. 10, 1304 (2000). University 2; the Louisiana Museum of Natural generates a preferential accumulation of 4. G. G. Loots et al., Science 288, 136 (2000). History, Louisiana State University; the Compara- substitutions in one strand. The CG asym- 5. S. A. Shabalina, A. Y. Ogurtsov, V. A. Kondrashov, A. S. tive Research Group, Australian National metry (Fig. 3D) was stronger in CODs than Kondrashov, Trends Genet. 17, 373 (2001). University; the Department of Genetics, La Trobe 6. W. W. Wasserman, M. Palumbo, W. Thompson, J. W. in both CNGs and ncRNAs, indicating a University; the Basel Zoo; the Museum of Natural Fickett, C. E. Lawrence, Nature Genet. 26, 225 (2000). History of Geneva; and the Institute of Ecology, protein-coding specific bias for this type of 7. E. T. Dermitzakis et al., Nature 420, 578 (2002). University of Lausanne. substitutions. 8. R. J. Mural et al., Science 296, 1661 (2002). 9. K. A. Frazer et al., Genome Res. 11, 1651 (2001). Supporting Online Material The results presented here demonstrate 10. R. H. Waterston et al., Nature 420, 520 (2002). www.sciencemag.org/cgi/content/full/1087047/DC1 that a large fraction of CNGs on Hsa21 11. D. Boffelli et al., Science 299, 1391 (2003). Materials and Methods belong to a distinct class of highly con- 12. L. Elnitski et al., Genome Res. 13, 64 (2003). SOM Text Figs. S1 to S4 strained functional sequences. At least 29% 13. O. Madsen et al., Nature 409, 610 (2001). 14. W. J. Murphy et al., Nature 409, 614 (2001). Tables S1 to S3 of them were highly conserved in multiple 15. W. J. Murphy et al., Science 294, 2348 (2001). References mammalian species as distant as human, 16. Characteristics are theoretically influenced by the mouse, pig, elephant, wallaby, and platypus alignment coverage. Therefore, we consider both 21 May 2003; accepted 22 September 2003 CNG and ncRNA data sets (CNG-all, CNG-high, Published online 2 October 2003; and generally more conserved than protein- ncRNA-all, and ncRNA-high). Comparisons involving 10.1126/science.1087047 coding and ncRNA sequences. High levels CNG-high and ncRNA-high are statistically sound, Include this information when citing this paper.

www.sciencemag.org SCIENCE VOL 302 7 NOVEMBER 2003 1035