Genomics 86 (2005) 414 – 422 http://www.paper.edu.cn

Nested in the

Peng Yua,b,c,*, Dalong Maa,b, Mingxu Xua,b,.

aLaboratory of Medical Immunology, School of Basic Medical Sciences, Peking University, Beijing 100083, People’s Republic of China bCenter for Human Disease Genomics, Peking University, Beijing 100083, People’s Republic of China cCenter for Bioinformatics, Peking University, Beijing 100083, People’s Republic of China

Received 29 March 2005; accepted 15 June 2005 Available online 3 August 2005

Abstract

Here we studied one special type of , i.e., the nested gene, in the human genome. We collected 373 reliably annotated nested genes. Two-thirds of them were on the strand opposite that of their host gene. About 58% coding nested gene pairs were conserved in mouse and some were even maintained in chicken and fish, while nested pseudogenes were poorly conserved. Ka/Ks analysis revealed that nested genes were under strong selection, although they did not demonstrate greater conservation than other genes. With microarray data we observed that two partners of one nested pair seemed to be expressed reciprocally. A significant proportion of nested genes were tissue-specifically expressed. analysis demonstrated that quite a number of nested genes participated in cellular signal transduction. Based on these observations, we think that nested genes are a group of genes with important physiological functions. D 2005 Elsevier Inc. All rights reserved.

Keywords: Nested gene; Gene-within-a-gene; Overlapping gene; Evolution; Comparative analysis; Inverse expression

Nested gene, or gene-within-a-gene, refers to a gene that In addition to coding genes, pseudogenes and snoRNA is contained in another gene. In eukaryotes, nested genes are genes were also found within introns [6,7]. In human usually located within one intron of a host gene. It was first 7, 100 processed pseudogenes were reported reported in Drosophila that the gene Pcp encoding pupal to be located in introns of unrelated genes [6]. SnoRNAs in cuticle was found within an intron of adenosine 3 introns are processed from the pre-mRNA of the host genes (ade3), lying on the opposite DNA strand [1]. In human it [7], and they are not considered as independently tran- was first reported for the gene F8A1 (coagulation factor scribed nested genes. VIII-associated intronic transcript 1), which was entirely Although nested genes have been found for a long time, no contained in intron 22 of coagulation factor VIII (F8), also systematic study has been conducted on them. Some features on the opposite strand [2]. Parallel to the progress of of this type of gene have been observed in Drosophila. For genome sequencing projects, more and more nested genes example, most of the reported nested genes are on the strand have been discovered. In Drosophila, this category of gene opposite that of the host gene and many are intronless. In comprises about 7.5% of the total genes, among which other species, however, including human, these features are about 85% encode , while the remaining 15% are reported only in sporadic cases and need to be verified on a noncoding RNAs [3]. Sequencing of the human chromo- larger scale. In addition, no systematic comparative analysis somes 21 and 22 revealed about a dozen nested genes [4,5]. has been conducted to study the evolution of nested genes and their conservation status between species. In essence, nested genes represent an extreme type of * Corresponding author. Laboratory of Medical Immunology, School of overlapping gene. As suggested by Miyata and Yasunaga Basic Medical Sciences, Peking University, Beijing 100083, People’s Republic of China. Fax: +86 10 82801149. [8], the rate of evolution can be expected to be slower in E-mail address: [email protected] (P. Yu). overlapping genes. Veeramachaneni et al. [9] had studied . Deceased. the conservation of overlapping genes between human and

0888-7543/$ - see front matter D 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2005.06.008

转载 中国科技论文在线 http://www.paper.edu.cn

P. Yu et al. / Genomics 86 (2005) 414–422 415 mouse and did not observe supporting proof for the view of coding genes with good EST support. Of the remaining, 212 Miyata et al. However, the study was carried out with were pseudogenes, among which 189 appeared to be multiple types of overlapping genes mixed and was biased processed and 23 showed sign of introns. In addition there toward genes overlapping in their boundary regions (UTRs were 3 snoRNA genes. About 63% of nested genes were on or regulatory region). As nested genes are totally embedded the strand opposite that of the host, forming antiparallel in introns, we think that they may be different from other pairs. The percentage was similar for coding genes and overlapping genes. Being a segment of two transcripts pseudogenes. For the remaining 37% pairs, two partners (intronic region of the host gene and itself), the nested gene were on the same strand in a parallel manner (Table 1). No might be under a double transcription check, which might chromosomal distribution bias was detected for the nested reduce the probability of mutation. With the availability of genes. the genome sequence of multiple organisms, it is possible to carry out a refined comparative analysis to test this Gene size and overlapping pattern hypothesis. The relationship between the host and the nested gene is The host genes were relatively larger, with a mean exon also intriguing. Up to now, in only one case, the gene number of 17 (T13.8), compared to the average level of neurofibromin 1 (NF1) and its nested gene oligodendrocyte human genes, which is ¨10.4 exons per [14]. The myelin glycoprotein (OMG), have the two genes been nested genes were much smaller, with a mean exon number reported to have similar functions of growth suppression of 2.1 (T1.9). About 41% (64/158) of the coding nested [10]. On the other hand, it has been observed that in the loci genes had only one exon. of eukaryotic translation initiation factor (eIF)2A[11], We studied the size distribution of the introns containing insulin-like growth factor 2 receptor (Igf2r) [12], and a1- nested genes and compared with other introns of the host collagen (I) [13], intronic genes on the opposite strand genes (Fig. 1). It revealed that the introns with nested genes interfere with the expression of the host genes. As the were significantly larger than others. The median length of number is very limited, a larger scale study is needed to these introns was 21.5 kb (T23.6 kb), and about 68.2% of fully elucidate the relationship. them were >10 kb, while the median length of other introns Here we extracted the reliably annotated nested genes in of the host genes was just 2.5 kb (T2.8 kb) and only 16.7% the human genome and carried out systematic studies on the of them were >10 kb. There were 10 introns that contained gene size, strand orientation, and function category of them. nested genes that were >200 kb. Based on these observa- We also used the human–mouse genome alignment data tions, it seemed that nested genes tended to occur in large and comparative genomics database to study the conserva- introns. tion of nested genes in multiple species. The Ka/Ks ratio was Karlin et al. had reported that nested genes in human used to study the selection on nested genes. In addition, with 21 and 22 were often located within the public microarray data we studied the expression correlation boundary intron (first or last) of host genes [5]. In our of the host and nested genes. dataset, we observed that about 62% coding genes and 59% of pseudogenes were in the internal introns, enclosed by the coding exons of the hosts. These internal introns Results were usually larger than the boundary ones and might give further support for the association between intron size and Identification of nested genes the probability of forming a nested structure (data not shown). As mentioned above, most host genes contained According to the chromosomal localization of annotated only one nested gene. For multiple-nested genes, the human genes (NCBI MapViewer Build 34.3), we initially nested genes were often located in one intron and were identified 804 nested gene pairs. However, by comparing similar to each other, indicating their possible formation by with the genes’ chromosomal alignment at the UCSC duplication. Genome Browser, we found that 285 genes’ localizations Pair-wise BLAST [15] alignment of the host and nested were greatly inconsistent between the two databases. These genes showed no association between the partners, with suspicious pairs were discarded from our dataset. We also only 8 pairs with identity greater than 20%. We also checked the EST support for coding nested genes; 146 genes with poor EST support were excluded (see Materials and methods). Table 1 Finally we obtained 373 nested gene pairs in the human Types of nested genes genome, comprising 340 host genes and 373 nested genes Nested gene Parallel Antiparallel Total (Supplementary Table 1). Of the host genes, 27 genes Coding 53 105 158 contained multiple nested genes so that the number of host Pseudogene 81 131 212 genes was less than that of the total pairs. All but 3 host SnoRNA 3 0 3 genes encoded proteins. Of the nested genes, 158 were Total 137 236 373 中国科技论文在线 http://www.paper.edu.cn

416 P. Yu et al. / Genomics 86 (2005) 414–422

Fig. 1. Distribution of the length of the introns of host genes. The white bars represent the length distribution of the introns containing nested genes. The length of nested gene is excluded. The black bars represent the length distribution of other introns of the host genes. The y axis is the probability of the introns of certain size. The area of each bar represents the percentage of introns in the size group.

BLASTed [15] the proteins encoded by the nested genes genes encoding G-protein-coupled receptors. For host against all known human proteins annotated in RefSeq genes, a higher percentage was seen in the ‘‘intracellular’’ (Release 8, October 2004) [16]. For each gene, only the best cellular component (50.9% vs. 40.1%, p < 0.05) comparing hit located in a different locus and satisfying the lowest to the GOA set. Of molecular functions, host genes had threshold (coverage > 0.3, identity > 0.3) was kept. As the significantly higher percentage of ‘‘protein binding’’ (18.9% result we got possible paralogs for 99 coding nested genes. vs. 12.0%, p < 0.05) and ‘‘ligase activity’’ (7.5% vs. 3.3%, In 29 cases a single-exon nested gene was aligned to a p < 0.05). multiple-exon gene, indicating that the former might be formed by retroposition [17]. And five cases were two Conservation analysis genes, both with a single coding exon. The remaining 65 pairs were composed of two multiple-exon genes. There First, we used the human–mouse genome alignment to were still 59 nested genes (158 À 99) seeming not to have study the conservation of nested gene pairs in mammals. A any analog even under such a low threshold. total of 92 coding–coding pairs (both host and nested genes encode protein) were conserved in mouse (92/158, 58%). GO annotation of nested gene pairs Among them 27 were parallel pairs and 65 were antiparallel. Only 4 coding–pseudogene pairs were conserved (4/212, There were 106 host genes and 96 nested genes with GO 1.9%). And 118 pseudogenes were aligned to regions far (Gene Ontology) [18] annotation information. By compar- away (>400 kb) from the orthologous region of the host ing the GO entries and checking manually, we observed that genes; 20 pseudogenes were aligned to different chromo- only 5 parallel nested gene pairs seemed to have similar somes. Based on this observation, it seemed that most of the functions. One antiparallel pair, the zinc-finger protein 540 nested pseudogenes were probably formed after the (ZNF540) and its nested gene zinc-finger protein 571 human–mouse split. Pseudogene analysis on human chro- (ZNF571), were similar but seemed to be formed by local mosome 7 also supports our observation [6]. duplication. In the 92 conserved nested genes, 21 genes participate in We compared with the GO annotation for the whole signal transduction. For the 66 nonconserved nested genes, human proteome (GOA [19] human 27.0, including 23,148 the genes’ functions were divergent and there was no UniProt proteins corresponding to a similar number of outstanding category. Thirty-five of the 66 genes were genes) to study if nested genes were overrepresented in intronless. specific functional classes. It revealed that 28 of 96 nested Furthermore, we used the Ensembl comparative genes acted as signal transducers; the proportion (29%) was genomics database [20] to study the conservation status in significantly higher than that of the GOA set (3710 of a wider range. As the result, we obtained homologs for 139 23,148 genes were annotated as signal transducers, 18.9%) host genes and 111 nested genes in species other than ( p < 0.001). Consistent with this, nested genes had a primates. It strikingly emerged that quite a few nested pairs significantly higher percentage of cellular components such were conserved among multiple organisms. In addition to as membrane (37.5% vs. 27.7%) and extracellular region 53 conserved pairs in mouse, we found 26 pairs maintained (11.5% vs. 6.0%) (both p < 0.05). There were 12 nested in chicken and 14 pairs in Takifugu rubripes. The relation- 中国科技论文在线 http://www.paper.edu.cn

P. Yu et al. / Genomics 86 (2005) 414–422 417 ships of gene pairs conserved in human, mouse, chicken, Table 2 and Takifugu are demonstrated in Fig. 2. However, the Comparison of Ka/Ks ratios conservation seemed to be limited to vertebrates as we only Type Ka/Ks Ks found one conserved pair in Drosophila. Host genes 0.099 T 0.106 0.539 T 0.189 The number of conserved pairs in mouse found here (53) Total nested genesa 0.168 T 0.154 0.645 T 0.219 T T was less than that we got by human–mouse genome Nested gene subset 0.120 0.085 0.611 0.232 Nested genes’ paralogs 0.080 T 0.068 0.540 T 0.153 alignment (92). There were 49 pairs detected by both Human-Mus orthologb 0.131 T 0.131 À methods. Four pairs were specifically obtained by searching The values are means T standard deviation. the Ensembl database. For the 43 pairs specifically obtained a Total nested genes were the 50 nested genes that were conserved in by genome alignment, 20 pairs were annotated to have mouse. Among them, 14 genes had nonembedded paralogs in human and different gene structures in the NCBI MapViewer and the were specially taken out as a subset. The Ka/Ks ratios and the Ks values for Ensembl system. For another 15 pairs, at lease one member the ortholog group of nested genes were significantly higher than those of of each pair had a close paralog in human and the paralogs the host genes (Wilcoxon test, p < 0.02). No correlation between the selection on the host and on the nested genes was detected (Spearman test). shared a common mouse ortholog. These pairs were The Ka/Ks ratios and the Ks values were similar between nested genes and eliminated when searching the Ensembl database. The their nonembedded paralogs (Wilcoxon test, p > 0.2). b remaining 8 pairs were not included in the Ensembl database. The Ka/Ks ratio for human–mouse orthologs was taken from the paper by Kondrashov et al. [23]. Selection on nested genes tested whether a correlation existed between the intensity of The ratio of the rate of nonsynonymous substitutions selection on the host and nested genes. The hypothesis was (Ka) to the rate of synonymous substitution (Ks) was used to that if the two partners had a functional association, they measure selection on nested genes. It is assumed that might evolve coordinately. synonymous substitutions are usually neutral, whereas We also found that 14 of the above 50 nested genes had nonsynonymous substitutions are subject to selective paralogs in the human genome that were not embedded in pressure. Ka/Ks < 1 is indicative of purifying selection, other genes. These paralogs had their respective unique best which is the most common mode of selection [21,22]. hits in the mouse genome, indicating that they were formed We used the precalculated Ka/Ks ratios for human and before the human–mouse split. We extracted the Ka/Ks mouse orthologous genes in the Ensembl comparative ratios for these genes and compared with their nested genomics database. As mentioned above, 53 nested pairs paralogs. were found to be conserved in mouse by searching the The comparisons demonstrated that Ka/Ks ratios for the database. Three pairs had no corresponding Ka/Ks ratio in nested genes were significantly larger than those of the host the database. For the remaining 50 pairs, we compared the genes ( p < 0.02), which indicated a weaker selection on the Ka/Ks ratios and Ks values of the ortholog groups of the host nested genes. However, the intensity of selection was still and nested genes using the Wilcoxon test. In addition we comparable to the average level of human and mouse orthologous genes (Table 2). The Ks values of nested genes were also significantly larger than those of the hosts ( p < 0.01); combining with the Ka/Ks ratio comparison, it seemed that the nested genes accumulated more mutations (both synonymous and nonsynonymous) than the host genes. We also observed that the Ks values of the two groups were significantly correlated (Pearson test, coeffi- cient 0.579, p = 1.545 Â 10À5). No correlation between the Ka/Ks ratios of the host and of the nested genes was detected (Spearman test). For nested genes and their nonembedded paralogs, both the Ka/Ks ratios and the Ks values were similar ( p > 0.2).

Expression correlation of the partners

Since the Ka/Ks ratio correlation may be an indirect measure of functional association, we studied the expression of the host and nested genes using GNF (Genomics Institute of the Novatis Research Foundation) microarray data [24]. Fig. 2. The relationships of gene pairs conserved in human, mouse, chicken, and Takifugu. The outer circle represents all human nested genes. The left, The dataset covers the expression of human transcripts right, and bottom inner circles represent the conserved nested gene pairs in across a diverse panel of 79 tissues. As the data were Takifugu, chicken, and mouse, respectively. produced on a uniform technology platform, they are very 中国科技论文在线 http://www.paper.edu.cn

418 P. Yu et al. / Genomics 86 (2005) 414–422

reliable for comparison between tissues. For the 158 correlation were also proved to present a similar condition. coding–coding nested pairs, 45 pairs had respective probes For the other 8 pairs with negative correlation, it was for both members. Four pairs were parallel and others were interesting to find that the 8 nested genes were highly antiparallel. Additionally there were 57 host genes and 29 expressed only in one tissue, while in other tissues the nested genes with corresponding probes, while their partners gene’s expression was average and much lower than the were without. top tissue. For each of the 45 pairs, we took 20 tissues (10 for host For the last pair, RB1 (retinoblastoma 1) and its nested and 10 for nested gene, see Materials and methods) and gene P2RY5 (purinergic receptor P2Y, G-protein-coupled, studied the expression correlation of the partners. It was 5), it was detected that the two genes were both highly assumed that compatible expression of genes might infer expressed in PB CD14+ monocytes. However, the signal that they function cooperatively, otherwise they would not intensities of the two replicative probes of RB1 in the tissue be coexpressed or may even be mutually interfering. were sharply different (628 and 1455, respectively). If the There were 33 pairs showing significant negative data point is excluded, the two genes showed significant correlation (Pearson correlation coefficient < À0.45, p < negative correlation ( p = 0.02). 0.05), including all 4 parallel pairs and 29 antiparallel pairs. The correlation coefficients and p values are plotted Tissue-specifically expressed genes in Fig. 3. Of the remaining 12 pairs, 3 showed a positive As mentioned above, we observed that some nested correlation. One of them, DREV1 (DORA reverse strand genes appeared to be highly expressed only in one tissue. protein 1) and its nested gene IGSF6 (immunoglobulin We defined a simple but efficient criterion (see Materials superfamily, member 6) had a significant correlation ( p = and methods) and used it to detect that 16 of the 74 nested 0.02). The other 9 pairs showed insignificant negative genes with expression data were highly expressed in one correlation. By checking the expression data, we found tissue (Table 3). For example, the gene corneodesmosin that in 11 of 12 pairs, at least one member was expressed (CDSN) was highly expressed only in skin. Consistent with tissue-specifically. For example, in the case of DREV1/ the expression status, CDSN has been proved to be closely IGSF6, IGSF6 was detected to be highly expressed only in associated with skin disease such as psoriasis and hypo- PB CD14+ monocytes, whole blood, and BM CD33+ trichosis [25,26]. It is interesting to note that its host gene, myeloid, whereas DREV1 was widely expressed in multi- psoriasis susceptibility 1 candidate 1 (PSORS1C1), was also ple tissues. When conducting our correlation analysis, we specifically expressed, but in testis seminiferous tubule. can take 20 tissues and carry out the test. The calculated Another nested gene, H2B histone family, member S coefficient is 0.594 with a p value of 0.02. However, if we (H2BFS), was uniquely highly expressed in the sample of take just the first 3 tissues, the coefficient is À0.618 with a chronic myelogenous leukemia, indicating its potential role p value of 0.576. Thus the tissue-specific expression of the in the disease. gene greatly interfered with the calculation of the Of the host genes, 9 of 102 matched our criterion. And in correlation. The other 2 pairs with apparent positive the 14,162 human genes of the GNF U133A dataset, 1634

Fig. 3. Correlation status of the partners of nested gene pairs. The y axis is the Pearson coefficient. The x axis is the p value. To make the graph compact, points with high p values (>0.4) are not included. 中国科技论文在线 http://www.paper.edu.cn

P. Yu et al. / Genomics 86 (2005) 414–422 419

Table 3 had a significantly higher percentage participating in Tissue-specifically expressed nested genes ‘‘translation regulator activity,’’ but a significantly lower Gene Tissue percentage in ‘‘signal transducer activity’’ compared to other TIMP3 Placenta human genes. However, from our data a significant GPR105 PB-BDCA dendritic cells proportion of genes were involved in signal transducer GPR86 Whole blood activity. This also indicated that our dataset was different CHAD Trachea GPR87 Bronchial epithelial cells from the two sets. NR1I3 Liver The conservation of nested genes was only sporadically GPR18 PB-CD19 B cells demonstrated before, e.g., in the case of tissue inhibitor of H2BFS Leukemia chronic myelogenous (k562) metalloproteinase (TIMP) and the synapsin gene family. PLAC4 Placenta The TIMP1/SYN1, TIMP4/SYN2, and TIMP3/SYN3 gene LOC55908 Liver PTX3 Smooth muscle pairs are coupled in both human and mouse genomes, with CDSN Skin TIMP nested in SYN [5]. Using human–mouse genome FLJ10647 Placenta alignment, we observed that 58% of human coding–coding IAPP Pancreatic islets nested gene pairs were conserved in mouse. Antiparallel PMCHL1 Hypothalamus pairs (65/105, 61.9%) seemed to be more conserved than LRRC17 Smooth muscle parallel pairs (27/53, 50.9%). Some pairs could even be traced back to chicken and fish. However, few pairs were matched the criterion. The percentage of specifically maintained in lineages outside of vertebrates, implying the expressed genes among the nested genes (21.6%) was phenomenon might be lineage specific. Meanwhile we significantly higher than among the hosts (8.8%), as well as found that pseudogenes in introns were not conserved and the U133A set (11.5%) (both p < 0.05, m2 test). probably formed after the human–mouse split. Since the nested structure is fairly conserved, we set out to test two hypotheses: whether nested genes are more Discussion conserved than other genes and if there is a functional association between the host and the nested gene. As to the In this study, we collected the largest currently available first, it was hypothesized that as a segment of two set of nested genes that were reliably annotated in the human transcripts, the nested gene might be under a double genome. A total of 158 coding genes and 212 pseudogenes transcription check, which might reduce the probability of were included. As we kept just genes with confirmed mutation. A similar hypothesis has been tested by Veer- chromosomal localization and good EST support, the number amachaneni et al. by studying the conservation of over- of coding nested genes might be larger. About two-thirds of lapping genes between human and mouse [9]. Their results the nested genes were on the strand opposite that of their host showed that overlapping genes were not significantly more gene. And 41% coding nested genes were intronless. conserved than other genes. However, they studied nested During the process of data collection, we found that genes mixed with other types of overlapping genes. Here we many genes were problematically annotated for their specifically studied the conservation of nested genes. It was structures, especially the UTR region. This again reminds demonstrated that nested genes were under weaker selection us that the human genome annotation is still far from than their host genes; however, their evolutionary rate was perfect. Using different datasets for cross-validation seems similar to that of their nonembedded paralogs. Based on this to be very necessary. observation, the effect of a double transcription check might Recently mounting evidence suggests that gene overlap not really exist or be very weak. The intensity of selection exists widely in eukaryotic genomes [3,27–30] (for review on nested genes was still very strong (Ka/Ks << 1) and see [31]). To date the two largest datasets of human comparable to the average level of human–mouse orthol- overlapping genes are those observed by Yelin et al. and ogous genes. Chen et al. [28,29]. Both of these groups focused on the As to the second hypothesis, we did not detect a natural antiparallel overlapping genes. The common feature correlation between the intensity of selection of the two they found was that the majority of the antiparallel overlap partners. Gene ontology analysis demonstrated that only (62–72%) occurred between the UTR region of one coding five parallel pairs seemed to have similar functions. As the gene and a noncoding RNA on the opposite strand. The data were very limited, we could not reach the conclusion authors suggested that this might indicate that antisense that a positive functional association between the two overlapping genes play roles in translation regulation. No partners widely existed. evolutionary analysis was conducted in the two studies. In With microarray data, we detected a significant negative our study we focused on nested genes. About 60% of them correlation for the expression of two partners on the same lie within an internal intron of their host gene. In addition, strand or two opposite strands. No true positive correlation no overrepresented functional classes were detected by was observed. The phenomenon of inverse expression of Yelin et al., while Chen et al. suggested that antisense genes genes on opposite strands has been observed at the eIF2A 中国科技论文在线 http://www.paper.edu.cn

420 P. Yu et al. / Genomics 86 (2005) 414–422

[11], Igf2r/Air [12], and a1-collagen (I) [13] loci before. different roles. Our study here focuses on an extreme type of However, the three antisense genes are noncoding RNAs. overlapping gene, i.e., the nested gene. We expect that our The genes we studied here all encode proteins. The negative work can contribute to the better understanding of the correlation might indicate that there is transcriptional regulation of the human genome. We also believe more and interference between the partners. The interference might more hidden treasures like nested genes will be found in the take place by direct competition for the transcription future. apparatus or by formation of double-stranded RNAs. As to the latter, as nested genes are located in introns, dsRNA is likely to form during the splicing process. Nested genes Materials and methods could also participate in processes such as RNA editing and imprinting by forming double-stranded RNA with the hosts Identification of nested genes in the human genome [32,33]. Refined experimental work is needed to verify the potential interactions of the two partners. Interestingly, in a We downloaded the human gene annotation file (Build quite recent paper, a strategy to study the real-time 34.3) from the NCBI (ftp://ftp.ncbi.nih.gov/genomes/ transcription of two partners of one nested pair using atomic H_sapiens/maps/mapview/). Only the reference annotation force microscopy was proposed [34], which might facilitate for 26,732 genes was used. According to chromosomal unveiling the potential interaction of two partners at the localization, we identified the genes that were embedded molecular level. As a by-product of correlation analysis, we within other genes. Gene pairs that had overlapping exons observed that a significant proportion of nested genes were larger than 20 bp were eliminated. expressed in an extremely tissue-specific manner, which Initially we obtained 804 nested genes, among which 551 might indicate they could play important physiological were coding genes, 250 were pseudogenes, and the roles. remaining 3 were snoRNA genes. By reviewing the data, The formation of the nested structure seems to be we found that many nested gene pairs were the product of associated with multiple mechanisms. In our dataset, about annotation error. In these pairs, usually the host genes were 50% genes were processed pseudogenes. It is accepted that wrongly granted a UTR exon that was far away from other retroposition is the main mechanism by which processed exons. Thus the adjacent small genes were wrongly pseudogenes are inverted into the genome [17]. Long included in the pseudo-long tail of the host genes. To interspersed element 1 plays an important role in the exclude such cases, we used the BLAT alignment of RefSeq process [35,36]. In addition to pseudogenes, there were 29 mRNA sequences to the human genome (Build 34) at nested genes that might be formed by retroposition. The UCSC [38,39] (October 2004 freeze) to confirm each gene’s direction of retroposition is easy to determine because the location. Only genes with position shifts less than 2000 bp parental gene has introns while the retro-gene does not. compared to the UCSC alignment were considered to be However, for the other 70 nested genes that have correct. As a result, 285 nested gene pairs with suspicious nonembedded paralogs, the case becomes complex. There chromosomal localization were eliminated. The number of are two possibilities: the nested gene naturally exists in the coding nested genes and pseudogenes became 304 and 215, intron of the host and the outside gene is the newer copy respectively. or vice versa. To solve the puzzle, an in-depth evolutionary For the 304 coding nested genes, we further confirmed comparison is needed. As a potential hint, we observed their transcription using EST data. The ESTs that were that nested genes tended to be located in relatively large uniquely aligned to human chromosomes at UCSC (October introns. A previous study on retroposing virus has 2004 freeze) were used. For intronless genes, at least 50% demonstrated that the viruses more commonly insert in coverage by more than two ESTs was requested to be taken the proximity of highly expressed genes [37], for these as supported. For multiple-exon genes, each exon was regions have a higher probability in open chromatin than requested to have a 20-bp minimum overlap with an EST. elsewhere. Here we might draw a similar inference that Only if more than 1/3 of the total exons were supported was large introns could supply such an open environment for the gene kept. Finally there were 158 nested genes external genes to enter. Additionally there were 59 genes satisfying the criteria. The other 146 coding nested genes without recognizable paralogs. Their paralogs might have were not included in the final dataset. been lost or diverged too much to be detected, or they may The proteins of coding nested genes were aligned against be formed by other mechanisms. all human proteins annotated in RefSeq (Release 8, October 2004) using BLASTp. A low threshold (coverage > 0.3, identity > 0.3) was used to detect possible paralogs. For Conclusion annotated pseudogenes, the DNA sequences were aligned to human proteins using BLASTx. Genes that contained Overlapping genes have been shown to exist widely in intervals longer than 30 bp between aligning blocks (HSPs) eukaryotes. Different types of overlapping genes may were taken as not fully processed. Others were taken as correspond to different mechanisms of formation and play processed. 中国科技论文在线 http://www.paper.edu.cn

P. Yu et al. / Genomics 86 (2005) 414–422 421

Human–mouse genome alignment used the tail of the hypergeometric distribution to calculate the p value against the background of all annotated human We used the BLASTz Tight subset of alignments of genes. human (July 2003 Build 34) vs. mouse (October 2003 mm4) at UCSC to study the conservation at the genome level. The dataset was obtained by stringently filtering the result of the Acknowledgments human–mouse BLASTz alignments using the axtBest and subsetAxt programs [40]. Only the best alignment for any We dedicate this article to Mingxu Xu (Center for Human given region of the human genome was kept. The same Disease Genomics, Peking University), who proposed the criterion as for EST checking was used to judge if a gene original idea and contributed a lot to the work. Unfortu- was conserved (50% of the gene or more than 1/3 of the nately he passed away during the preparation of the exons were included in conserved block). manuscript. We thank Ge Gao, Xiyin Wang, and Xiaoli Shi (Center for Bioinformatics, Peking University) for Evolutionary analysis constructive discussion. We thank Xiaocheng Guo and Jingchu Luo (Center for Bioinformatics, Peking University) The work was based mainly on the Ensembl comparative for critical comments. We are also very grateful to the genomics database (version 26). The orthologs of host and Genomics Institute of the Novatis Research Foundation for nested genes were obtained using BioEnsEMBL modules. making their microarray data publicly available. This work The Ka/Ks ratio was precalculated in Ensembl using the was supported by the Chinese High Tech Program codeml program included in the PAML package [41].We (2002BA711A01). had calculated Ka/Ks locally and the results were similar to those in the Ensembl database, so we just used the precalculated data. The mean of Ks was about 0.6, which Appendix A. Supplementary data was consistent with the substitution rate between human and mouse reported by Cooper et al. [42]. Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.ygeno.2005. Expression correlation analysis 06.008.

We downloaded the U133A+GNF1H microarray data for human from http://symatlas.gnf.org. The data were prenor- References malized using the MAS5 algorithm. As all our genes were included in RefSeq, they were represented mainly on the [1] S. Henikoff, M.A. Keene, K. Fechtel, J.W. Fristrom, Gene within a Affymetrix U133A array, while only 28 genes were gene: nested Drosophila genes encode unrelated proteins on opposite DNA strands, Cell 44 (1986) 33–42. included in the GNF custom array (GNF1H). So we used [2] B. Levinson, S. Kenwrick, D. Lakich, G. Hammonds Jr., J. Gitschier, only the U133A array data. For genes with multiple probes, A transcribed gene in an intron of the human factor VIII gene, the longest probe was used. The mean intensity of the signal Genomics 7 (1990) 1–11. of two replicative probes for one gene was used to stand for [3] S. Misra, et al., Annotation of the Drosophila melanogaster the expression amount of the gene. euchromatic genome: a systematic review, Genome Biol. 3 (2002) 0083.1–22. When we studied the expression correlation of two [4] Dunham, et al., The DNA sequence of human chromosome 22, Nature partners, we took the 10 tissues for the host and nested gene 402 (1999) 489–495. in which they were most expressed. Common tissue was [5] S. Karlin, C. Chen, A.J. Gentles, M. Cleary, Associations used once. The Pearson test was conducted to test the between human disease genes and overlapping gene groups and expression correlation of two genes in the resulting tissues. multiple amino acid runs, Proc. Natl. Acad. Sci. USA 99 (2002) 17008–17013. To detect tissue-specifically expressed genes, we first [6] L.W. Hillier, et al., The DNA sequence of human chromosome 7, sorted tissues by their expression amount. If the expression Nature 424 (2003) 157–164. in the first tissue was twice as high as that in the second, the [7] J.P. Bachellerie, J. Cavaille, A. Huttenhofer, The expanding snoRNA gene was suspected to be a candidate. Using such a world, Biochimie 84 (2002) 775–790. criterion, we originally got 17 nested genes and 16 were [8] T. Miyata, T. Yasunaga, Evolution of overlapping genes, Nature 272 (1978) 532–535. proved to be true. For the 10 host genes detected, 9 were [9] V. Veeramachaneni, W. Makalowski, M. Galdzicki, R. Sood, I. correct. Makalowska, Mammalian overlapping genes: the comparative per- spective, Genome Res. 14 (2004) 280–286. Gene ontology analysis [10] A.A. Habib, J.R. Gulcher, T. Hognason, L. Zheng, K. Stefansson, The OMgp gene, a second growth suppressor within the NF1 gene, Oncogene 16 (1998) 1525–1531. We used GoTermfinder [43] to map genes with ontology [11] T.A. Silverman, M. Noguchi, B. Safer, Role of sequences within the annotation to the Goslim trees defined by the goslim_goa first intron in the regulation of expression of eukaryotic initiation file (N. Mulder, M. Pruess, February 2005) [44]. We then factor 2 alpha, J. Biol. Chem. 267 (1992) 9738–9742. 中国科技论文在线 http://www.paper.edu.cn

422 P. Yu et al. / Genomics 86 (2005) 414–422

[12] Wutz, et al., Imprinted expression of the Igf2r gene depends on an [28] R. Yelin, et al., Widespread occurrence of antisense transcription in the intronic CpG island, Nature 389 (1997) 745–749. human genome, Nat. Biotechnol. 21 (2003) 379–386. [13] C.M. Farrell, L.N. Lukens, Naturally occurring antisense transcripts [29] J. Chen, et al., Over 20% of human transcripts might form sense– are present in chick embryo chondrocytes simultaneously with the antisense pairs, Nucleic Acids Res. 32 (2004) 4812–4820. down-regulation of the alpha 1 (I) collagen gene, J. Biol. Chem. 270 [30] N. Terryn, P. Rouze, The sense of naturally transcribed antisense (1995) 3400–3408. RNAs in plants, Trends Plant Sci. 5 (2000) 394–396. [14] International Human Genome Sequencing Consortium, Finishing the [31] S. Boi, G. Solda`, M.L. Tenchini, Shedding light on the dark side of the euchromatic sequence of the human genome, Nature 431 (2004) genome: overlapping genes in higher eukaryotes, Curr. Genom. 5 931–945. (2004) 509–524. [15] S.F. Altschul, et al., Gapped BLAST and PSI-BLAST: a new [32] C. Vanhee-Brossollet, C. Vaquero, Do natural antisense transcripts generation of protein database search programs, Nucleic Acids Res. make sense in eukaryotes? Gene 211 (1998) 1–9. 25 (1997) 3389–3402. [33] Q. Wang, G.G. Carmichael, Effects of length and location on the [16] K.D. Pruitt, T. Tatusova, D.R. Maglott, NCBI Reference Sequence cellular response to double-stranded RNA, Microbiol. Mol. Biol. Rev. (RefSeq): a curated non-redundant sequence database of genomes, 68 (2004) 432–452. transcripts and proteins, Nucleic Acids Res. 33 (2005) D501–D504. [34] C.W. Gibson, N.H. Thomson, W.R. Abrams, J. Kirkham, Nested [17] L. Benjamin, Genes VII, Oxford Univ. Press, New York, 2000. genes: biological implications and use of AFM for analysis, Gene [18] M. Ashburner, et al., Gene ontology: tool for the unification of (2005) (E-publication ahead of print). biology. The Gene Ontology Consortium, Nat. Genet. 25 (2000) [35] C. Esnault, J. Maestre, T. Heidmann, Human LINE retrotransposons 25–29. generate processed pseudogenes, Nat. Genet. 24 (2000) 363–367. [19] E. Camon, et al., The Gene Ontology Annotation (GOA) database: [36] H.H. Kazazian Jr., L1 retrotransposons shape the mammalian genome, sharing knowledge in Uniprot with Gene Ontology, Nucleic Acids Science 289 (2000) 1152–1153. Res. 32 (2004) D262–D266. [37] A.V. Rynditch, S. Zoubak, L. Tsyba, N. Tryapitsina-Guley, G. [20] M. Clamp, et al., Ensembl 2002: accommodating comparative Bernardi, The regional integration of retroviral sequences into the genomics, Nucleic Acids Res. 31 (2003) 38–42. mosaic genomes of mammals, Gene 222 (1998) 1–16. [21] W.H. Li, Molecular Evolution, Sinauer, Sunderland, MA, 1997. [38] W.J. Kent, BLAT—The BLAST-like alignment tool, Genome Res. 12 [22] A.L. Hughes, Adaptive Evolution of Genes and Genomes, Oxford (2002) 656–664. Univ. Press, New York, 1999. [39] D. Karolchik, et al., The UCSC Genome Browser Database, Nucleic [23] F.A. Kondrashov, I.B. Rogozin, Y.I. Wolf, E.V. Koonin, Selection Acids Res. 31 (2003) 51–54. in the evolution of gene duplications, Genome Biol. 3 (2002) [40] S. Schwartz, et al., Human–mouse alignments with BLASTZ, 008.1–9. Genome Res. 13 (2003) 103–107. [24] A.I. Su, et al., A gene atlas of the mouse and human protein-encoding [41] Z. Yang, PAML: a program package for phylogenetic analysis by transcriptomes, Proc. Natl. Acad. Sci. USA 101 (2004) 6062–6067. maximum likelihood, Comput. Appl. Biosci. 13 (1997) 555–556. [25] S. Jenisch, et al., Corneodesmosin gene polymorphism demonstrates [42] G.M. Cooper, et al., Characterization of evolutionary rates and strong linkage disequilibrium with HLA and association with psoriasis constraints in three mammalian genomes, Genome Res. 14 (2004) vulgaris, Tissue Antigens 54 (1999) 439–449. 539–548. [26] E. Levy-Nissenbaum, et al., Hypotrichosis simplex of the scalp is [43] E.I. Boyle, et al., GO0TermFinder—Open source software for associated with nonsense mutations in CDSN encoding corneodesmo- accessing gene ontology information and finding significantly sin, Nat. Genet. 34 (2003) 151–153. enriched gene ontology terms associated with a list of genes, [27] H. Kiyosawa, I. Yamanaka, N. Osato, S. Kondo, Y. Hayashizaki, Bioinformatics 20 (2004) 3710–3715. Antisense transcripts with FANTOM2 clone set and their implications [44] M. Biswas, et al., Applications of InterPro in protein annotation and for gene regulation, Genome Res. 13 (2003) 1324–1334. genome analysis, Brief Bioinform. 3 (2002) 285–295.