View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Elsevier - Publisher Connector

Genomics 100 (2012) 231–239

Contents lists available at SciVerse ScienceDirect

Genomics

journal homepage: www.elsevier.com/locate/ygeno

A unified framework of overlapping : Towards the origination and endogenic regulation

Meng-Ru Ho a, Kuo-Wang Tsai b, Wen-chang Lin c,⁎

a Biodiversity Research Center, Academia Sinica, Taipei 115, Taiwan b Department of Medical Education and Research, Kaohsiung Veterans General Hospital, Kaohsiung 813, Taiwan c Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan

article info abstract

Article history: Overlapping genes are pairs of adjacent genes whose genomic regions partially overlap. They are notable by Received 27 January 2012 their potential intricate regulation, such as cis-regulation of nested -promoter configurations, and Accepted 25 June 2012 post-transcriptional regulation of natural antisense transcripts. The originations and consequent detailed reg- Available online 2 July 2012 ulation remain obscure. Herein, we propose a unified framework comprising biological classification rules followed by extensive analyses, namely, exon-sharing analysis, a human–mouse conservation study, and Keywords: transcriptome analysis of hundreds of microarrays and transcriptome sequencing data (mRNA-Seq). We Overlapping gene Natural antisense transcript (NAT) demonstrate that the tail-to-tail architecture would result from sharing functional elements in Gene regulation 3′-untranslated regions (3′-UTRs) of pre-existing genes. Dissimilarly, we illustrate that the other gene over- Cis-regulatory element laps would originate from a new gene arising in a pre-existing gene locus. Interestingly, these types of Promoter coupled overlapping genes may influence each other synergistically or competitively during transcription, Local chromatin status depending on the promoter configurations. This framework discloses distinctive characteristics of over- lapping genes to be a foundation for a further comprehensive understanding of them. © 2012 Elsevier Inc. Open access under CC BY-NC-ND license.

1. Introduction gene overlapping in the same genomic locus. Keese and Gibbs demon- strated overprinting using the overlap of THRA and NR1D1 [15].The Overlapping genes are pairs of adjacent genes with a certain ex- overprinting hypothesis seems to have been developed from published tent of overlap in genomic locations. They are known to be common results where one of the coupled overlapping genes appears to be rela- in viruses, mitochondria, and bacteria, and may compose a compact tively new, or lineage-specific to its partner gene. However, the occur- genome organization to facilitate gene regulation efficiency [1], rence of overlapping events between ancient genes simply contradicts like operons. Overlapping genes have recently been found in the the overprinting hypothesis. For example, the overlap of two ancient as well [2–6]. Human overlapping genes have been genes, FGF2 and NUDT6, is observed in human [16,17],rat[18],chicken identified and systemically analyzed according to the genome-wide [19], and Xenopus [20], but not dog. On the other hand, Shintani et al. annotations of genes and mRNA features [7,8]. Over 10% of human suggested that the scenario of gene-overlap origination is that a 3′- genes are involved in gene-overlap events. Most overlapping genes injured gene borrows its neighbor's polyadenylation signals on are new or lineage-specific and their overlap structures are not con- the opposite strand after genome rearrangement [21]. This hypothe- served through the vertebrates [9–11]. In theory, overlapping genes sis can be supported by significant 3′-untranslated region (3′-UTR) in a pair will be intricately involved in mutual regulatory events, sharing phenomena in some gene-overlap events. Sharing genomic se- such as transcriptional interference [12], and natural antisense tran- quences in 3′-UTRs may be a crucial transition mechanism from non- scripts (NATs) [13,14]. Thus far, the biological consequences of this overlapping genes to overlapping genes [9]. Furthermore, utilizing notable genome organization, namely gene-overlaps, have not been distinct polyadenylation signals can alter gene-overlap structures as elucidated. well [22].Forinstance,humanKMO appears with two functional The genesis of overlapping genes continues to be debated, and none polyadenylation (polyA) signals, and results in overlapping or non- of the currently proposed hypotheses fully explain these existing geno- overlapping transcripts [23]. These discrepant explanations manifest mic architectures. One well-known hypothesis is “overprinting”―a that the origination and biological meanings of existing overlapping novel gene originates through accumulated mutations inside a genes, though previously discussed, remains an open problem to be pre-existing gene [15]. This process results in a new gene and an ancient resolved. Genome organization is believed to have an impact on gene regula- ⁎ Corresponding author. Fax: +886 2 27827654. tion, such as co-regulated genomic neighborhoods [24–26]. The gene- E-mail address: [email protected] (W. Lin). overlap structure in a genome is expected to introduce regulatory

0888-7543 © 2012 Elsevier Inc. Open access under CC BY-NC-ND license. doi:10.1016/j.ygeno.2012.06.011 232 M.-R. Ho et al. / Genomics 100 (2012) 231–239 impacts as well. Some bioinformatics analyses have reported that paired overlapping genes are associated with locus-specific synergistic expres- sion profiles [27,28], as well as other meticulous analyses of a single gene pair. For instance, the expression of human MYCN (N-myc) transcript and its antisense partner MYCNOS is co-regulated [29].The human overlapping genes, VLCAD (ACADVL) and DLG4 are co-expressed inthesametissues[3]. Theoretically, however, it may be unlikely that two overlapping transcriptional units are active at the same time, where- as transcription by RNA polymerase II involves large complexes. This is proposed as transcriptional interference resulting in exclusive reg- ulation [12]. For example, Noguchi et al. demonstrated that human eIF-2 alpha (EIF2S1) is inversely correlated to its antisense RNA [30].Human ABHD1 is shown to express 1.4% of its overlapping partner Sec12 (PREB), on average [2]. Additionally, the expression patterns of some nested gene pairs, in which one gene contains the other, show negative correla- tion [31]. These results are contrary to previous general co-expression ob- servations. It can be seen from these debates that regulatory mechanisms remain mysterious. Thus, it is essential to address this controversial issue with a comprehensive analysis framework. In this study herein, we utilize the configuration of paired over- lapping genes, which directly affects what and how nearby promoters are used by one or multiple overlapping genes, to be the deciding fea- ture for function-oriented classification of overlapping genes. We propose a strategy to subgroup overlapping genes, and reveal their evolutionary features and specialized gene regulation. For example, the majority of overlapping genes, the tail-to-tail (TT) group, is very distinct from other types in three ways. First, significantly 3′-UTR overlaps of paired TT genes expose their particular functional de- mand. In addition, the human–mouse conservation levels of TT genes and non-overlapping genes are similar, although that of other overlapping genes are much lower. Furthermore, the promoter archi- tecture of paired TT genes just resembles adjacent genes' while other gene-overlap architectures may induce an inverse regulation from promoter competition or co-regulation from common promoters. In summary, we demonstrate that each subgroup of overlapping genes is distinct so that they have potential to behave completely different- ly; and hence, we propose that this is the reason why independent observations on overlapping genes in earlier work appear conflicting. We demonstrate that our analysis framework can provide a more uni- fied approach, and facilitate a comprehensive systematic understand- ing of overlapping genes. Fig. 1. Types of gene-overlap events. There are five types of gene-overlap events and 2. Results paired genes with overlapping promoters described in Fig. 1. Arrows indicate the ori- entations of genes, and blue regions present core promoter regions of genes. 2.1. Five subgroups of overlapping genes

Based on the published reference genome and annotated gene transcript information, we identify 2541 human overlapping gene pairs in total (Supplementary Table S1). We observe that the config- uration of overlapping genes directly affects what and how nearby promoters are used by one or multiple overlapping genes. From the function-oriented point of view, we consider the configuration of reflect the published co-regulation of paired overlapping genes overlapping genes and their associated promoters as our classification [27,28]. Meanwhile, the promoter competition of different-strand rules. Accordingly, we classified gene-overlaps into five subgroups nested genes (DN in Fig. 1) may result in a negative correlation of ex- (Fig. 1): head-to-head (HH: 473 pairs); tail-to-tail (TT: 705 pairs); pression patterns, which has been observed in some paired overlapping different-strand nested (DN: 493 pairs); same-strand overlap (SO: genes [31]. Moreover, the genes in the SO group are conjectured to be 116 pairs); and, same-strand nested (SN: 754 pairs). regulated by common promoters or associated with read-through These five subgroups reflect discriminating properties, respective- events [34]. Given that each subgroup possesses its unique biological ly. The coupled genes in the TT group mainly overlap in their 3′-ends trait(s), we propose a unified framework to explain originations and di- containing polyadenylation (polyA) signals, such as “AAUAAA”.The vergent regulation of all overlapping genes. Our framework is com- polyA signal can be utilized by two genes on both strands which intro- prised of two parts: the first part dedicates the classification of duces transcriptional regulation between genes [32]. In addition, the overlapping genes, according to the configuration of overlapping HH group can be recognized by the obvious nested structure of two genes and their associated promoters; while the second part introduces genes' transcriptional start sites (TSSs). Transcriptions of overlapping statistical analyses and biological inference of the overlapping genes in genes may be initiated by bidirectional promoters just like common bi- each subgroup. The following results provide dissection of overlapping directional non-overlapping gene pairs [33]. This possibility can also genes from various aspects. M.-R. Ho et al. / Genomics 100 (2012) 231–239 233

2.2. The categories of co-shared exons Generally, exon-sharing is found in less than 10% of overlapping genes. The exon-sharing preference of different-strand gene-overlap It has been shown that genes sharing a genomic locus may overlap events is spatially-dependent while those of same-strand are not. their exons as well [11]. That is, the genomic fragment, referred to as That is, mutually nearest regions are the most significant sharing a co-shared exon, is an exonic sequence from both paired overlapping part, such as the 3′-end sharing of the genes in the TT group, and genes. The functional characterization of shared exons can provide bi- the 5′-end sharing of the genes in the HH group (Fig. 2b; Tables 1 ological implications to illustrate why two genes overlap and remain and 2). In addition, the majority of nested genes (about 80%) do not overlapped. We first investigate overlapped bases by calculating the share any bases with their overlapping partners, while other types percentages of four categories, 5′-untranslated region (5′-UTR), of overlapping genes present slight sharing phenomena at the tran- coding region (CDS), 3′-untranslated region (3′-UTR), and without script level (Fig. 3a; Table 1). We can infer that overlapping genes ei- definition (Null) regarding the five gene-overlap subgroups. We mea- ther avoid exon-sharing or merely share UTRs because they adapt sured the mean percentage of shared-exonic sequences in five gene themselves to overlapping architectures without being bound by overlap types to represent the general sharing patterns, and to further each other during the evolution process. That explains why the infer possible biological meanings. The exon-sharing percentages of sequence conservation of overlapping genes between human and same-strand gene-overlaps (SO and SN) are strikingly higher than mouse is not significantly higher than the average conservation, as others (Fig. 2a). The main reason is that a genomic locus produces a previously observed [11,38,39]. Finally, we observe higher sharing whole gene family whose members utilize some common exons dur- percentages in same-strand overlaps (SO and SN in Fig. 2b; Table 1) ing transcription (Supplementary Table S2), such as the PCDHA family which result mainly from some “heavy” exon-sharing overlapping [35,36], the PCDHG family [35,36], and the UGT1A family [37]. They genes (shared>70% of the transcript length) in Fig. 3a. This implies are gene clusters with alternative first or last exons. Their sharing sce- the existence of redundant gene annotations, such as FAM74A2/ nario is highly distinct from other overlapping genes; therefore, we FAM74A4 and SPANXD/SPANXE (Supplementary Fig. 1). removed them from the subsequent analysis. The adjusted results are shown in Fig. 2b and Table 1. 2.3. The conservation between human and mouse

We demonstrate the human–mouse conservation on overlapping genes from three aspects. The first is the proportion of species- (a) specific genes, the second is the proportion of partial gene loss in Categories of Co−shared Exons paired genes, and the last is the resistant degree of overlap structures to genome rearrangement. We use the human–mouse orthologous 3' UTR CDS gene information provided by Mouse Genome Informatics (http:// 5' UTR Null www.informatics.jax.org/). This human–mouse ortholog information covers 84% of human genes (gray dotted line in Fig. 3b). This implies that the proportion of species-specific genes in the human genome is about 16%. Here, we report that the proportion of species-specific genes is not increased in both the PO gene and the TT gene types, un- like other types of overlapping genes (Fig. 3b). It suggests that the TT overlap structure is not a result of overprinting. Secondly, the propor- tion of partial loss in the TT group is as low as the PO group's (Fig. 3b; Table 3). A gene pair defined as partial loss means one human gene in this pair is not annotated with any mouse orthologous genes. Thus, the proportion of partial loss can reflect how many human-specific genes arise in pre-existing gene loci to form gene overlap structures. 0% 10% 20% 30% 40% HH TT DN SO SN Unlike other types of overlapping genes, the TT genes possess the lowest proportion (Tables 2 and 3). Again, it suggests that the TT (b) overlap structure is not a result of overprinting. Both explain why Categories of Co−shared Exons (adjusted) some ancient genes are found to form overlapping structures in spe- cific species, even though published results have reported that most 3' UTR CDS fi – 5' UTR overlapping genes are new or lineage-speci c [9 11]. Null Furthermore, we interrogated the conservation degree of gene- overlap structure between human and mouse. Generally, the conserva- tion of HH, DN, SO, and SN is significantly lower, and the low conserva- tion is frequently accompanied by large proportions of partial loss in gene-overlaps (Fig. 3b; Tables 2 and 3). The partial-loss percentages of these four overlap types are about twice that of PO (Table 3). That im- plies the majority of these gene-overlaps result from a new gene arising

Table 1 Statistics of the human overlapping genes on co-shared exons.

Overlap Category of co-shared exon 0%HH 10% 20%TT 30% DN 40% SO SN Frequency 5′‐UTR CDS 3′‐UTR Null Total Fig. 2. Categories of co-shared exons. There are four categories of commonly utilized HH (473) 78% 2.5% 1.2% 0.1% 1.5% 5.2% exons between coupled genes. The x-axis shows the five subgroups of overlapping TT (705) 90% 0.1% 2.1% 6.8% 1.8% 10.8% genes: head-to-head (HH); tail-to-tail (TT); different-strand nested (DN); same-strand DN (493) 24% 0.4% 1% 0.7% 1.1% 3.2% overlap (SO); and same-strand nested (SN). The y-axis represents the accumulated per- SO (116) 91% 2.9% 3.2% 4.7% 2.2% 13% centages of shared length in the transcription level. After removing some families with SN (754) 21% 1.7% 2.6% 3.1% 2.9% 10.2% the alternative first/last exon, we obtain the (b) plot as the adjusted result of (a). 234 M.-R. Ho et al. / Genomics 100 (2012) 231–239

Table 2 Table 3 Summary of overlapping genes in human. Statistics of mouse orthologs of the human overlapping gene pairs.

Type Orientation Overlap region Shared Human-specific Co-express Overlap H–M conservation of gene-overlap exon Overlapping gene paira H–M orthologb HH Opposite 5′/5′ 5′‐UTR + ++ Presentc Partiald Absente Consensusf Divg TT Opposite 3′/3′ 3′‐UTR − + DN Opposite 5′/3′ All ++ + PO (1,047) 72% 25% 3% 75% 25% SO Same 5′/5′ or 5′/3′ or 3′/3′ All + +++ HH (473) 53% 41% 6% 17% 83% SN Same 5′/5′ or 3′/3′ All ++ +++ TT (705) 72% 25% 3% 41% 59% PO Opposite promoter only −− ++ DN (493) 36% 56% 8% 72% 28% SO (116) 40% 45% 15% 9% 91% − none. SN (754) 29% 61% 10% 70% 30% + low grade. ++ median grade. a Overlapping gene pairs in human. +++ high grade. b Gene-overlaps with both genes present in human and mouse. c Both human genes in a pair are present in mouse. d Only one human gene in a pair is present in mouse. e The whole human gene pair is absent in mouse. (a) f The overlap structure is consensus between human and mouse. Exon−sharing g The overlap structure is divergent between human and mouse.

Heavy Median Light None

in the genomic location of the other ancient gene. This supports the overprinting hypothesis [15] to be the origin of these four gene- overlap types. Thirdly, the preservation of nested gene structures, including DN and SN, is similar to that of PO while both paired human genes are present in mouse (Table 3). The genome insertions and deletions (indels) can be a main driving force to alter these gene-overlap struc- tures. As the overlap structure of the PO genes, the overlap structure of overlapping genes can be destroyed by genomic indels, while the simple neighboring gene structure is relatively resistant to genomic indels. Unusually, the preservation of other three gene-overlap types, including HH, TT and SO, is much lower than that of PO 0% 20%HH 40% TT 60% DN 80% 100% SO SN (Table 3). It may indicate that maintaining these gene-overlaps is less desirable. If the occurrence of genomic indels is unbiased in the (b) genome, there may be some interference breaking the overlap struc- H−M Conservation tures beyond the genome rearrangement. Inadequate annotation of the mouse genes' 5′-UTRs and/or 3′-UTRs might also help to explain Absent Partial this. Present only Consensus 2.4. Cis-regulation

The paired overlapping genes, located in the identical genomic locus, share the same chromatin status but some subgroups of them do not share core promoters, such as TT. As a result, distinct gene- overlap subgroups are suitable sets for studying the regulatory effects specifically resulting from alternative gene-promoter conformations. We have interrogated the expression correlation of paired over- lapping genes at two levels, the regulation of chromatin status and core promoters. In total, we have extensively analyzed the expression profiles of hundreds of normal human tissue samples conducted in three high throughput platforms, Affymetrix U133 2.0, Affymetrix U133 plus2, and illumina single-end mRNA-Seq (Supplementary Ta- 0%PO 20% HH 40% TT 60% DN 84% 100% SO SN bles S3–S5) and obtained consistent results. It is believed that chromatin status is an important cause for co- Fig. 3. Exon-sharing degree and human–mouse conservation. The percentage bar chart (a) describes the gene-based exon-sharing degree of the five subgroups of overlapping expressed genomic neighborhoods. For example, on account of sharing genes: head-to-head (HH); tail-to-tail (TT); different-strand nested (DN); same-strand chromatin relaxation, immediate-early genes (IEGs) and their neighbor- overlap (SO); and same-strand nested (SN). There are four exon-sharing levels: “None” ing genes are co-upregulated [40]. To examine how the distance affects “ ” represents that paired overlapping genes do not share any base, and Light indicates the co-regulated genomic blocks, we first take randomly coupled gene they share b30% of their own transcript lengths. Similarly, the genes with shared tran- script length >70% is annotated as “Heavy”, and the remaining genes are designated as pairs as the control set to compare to other paired neighboring gene “Median”. The percentage bar chart (b) describes the human–mouse conservation of sets (Fig. 4a; Supplementary Figs. 2a and 3a). To be specific, the set of five subgroups and paired genes with overlapping promoters (PO). There are four indi- paired adjacent genes is the closest gene pair set in which paired cators: “Absent” means both paired human genes are absent in mouse; “Partial” shows genes have no genes in between and the most distant gene pairs have only one human gene in a pair is present in mouse; “Present only” annotates both nine genes in between. Our results show that the co-regulation of two paired human genes are present in mouse but their overlap structure is not conserved; and “Consensus” indicates that both paired human genes are present and their overlap neighboring genes is dependent on their separated physical distances. structure is conserved in mouse. That is, the highest co-expression degree belongs to adjacent gene M.-R. Ho et al. / Genomics 100 (2012) 231–239 235

(a) the “remote” group in the same-strand gene pairs. This observation NGS BodyMap 2.0 (GSE30611) holds in the opposite-orientation groups, near oo and remote oo. In addi-

1.0 tion, the co-expression level of the same-strand pairs is higher than that of the opposite-orientation pairs when coupled genes are far from each other. However, this observation is not true when paired genes are close to each other (near ss vs. near oo). We hypothesize that this may

0.5 result from bidirectional promoters existing in some gene pairs of the “near oo” group. We then extracted the promoter-overlap pairs (PO: the coupled genes' promoters overlap within 1 kb) from the “near oo” group. We observed that the PO group, which possesses the highest co-expression level, supports our conjecture (Fig. 4b; Supplementary 0.0 Figs. 2b and 3b). In short, the co-regulation of neighboring genes is per-

correlation coefficient vasive in the human genome, and the correlation degree is directly af- fected by the genomic distance of two paired genes. This indicates that the co-regulation of neighboring genes is a general result of sharing −0.5 local chromatin status. In addition, the PO's highest co-expression level implies that it is an effective factor impacting the co-expression level, whether the promoters of two coupled genes overlap or not. R0123456789 Because of the identical local chromatin status, two paired over- number of genes in between lapping genes are expected to be positively correlated with each other in gene regulation. Indeed, we observe that the correlation of (b) paired overlapping genes is higher than that of paired nearly adjacent NGS BodyMap 2.0 (GSE30611) genes (distanceb10 kb). Surprisingly, each subgroup of overlapping

1.0 genes presents distinct correlation levels (Fig. 5, and p-value lists of two sample t-tests in Supplementary Tables S5 and S7). We reveal that the same-strand overlap types (SO and SN), especially the nested ones, possess the highest correlation level (Fig. 5a; Table 2). That im- plies that genes in a same-strand overlap pair may be co-regulated by 0.5 the same upstream regulatory sequences. Similarly, the correlation of PO is high, and the correlation may get even higher, if the overlapped regions further extend to the first exon/intron, such as HH (Fig. 5a; Table 2). It reflects that the transcription of coupled overlapping 0.0 genes with opposite orientations can be activated simultaneously by – correlation coefficient bidirectional promoters [33,41 43]. Unlike the PO or HH genes, the TT or DN genes frequently do not share the core promoter regions. There- fore, the common bidirectional promoter hypothesis is not operating in

−0.5 this situation. That is, the promoter overlap degree of the TT group resem- bles that of adjacent genes, and the correlation level of the TT group is slightly lower than that of neighboring gene pairs as well (gray dotted R adj remote near remote near PO s.s. s.s. o.o. o.o. line in Fig. 5a). In addition, besides synergistic effects, we observed com- petitive effects from overlapping genes. The DN group, the nested Fig. 4. Co-expression of neighboring genes. The (a) plot displays the correlation levels gene pairs with opposite orientations, presents the lowest corre- of paired genes. In addition to randomly coupled gene pairs (R), there are 10 sets of lation among all subgroups, and even lower than the baseline neighboring gene pairs where each gene pair is separated by zero to nine genes, re- spectively. The co-expression levels of neighboring gene pairs are significantly differ- (Fig. 5a; Table 2). It suggests that two individual promoters of ent, especially among the adjacent gene pair, the gene pair with one gene in a DN gene pair are too close to initiate independent transcrip- between, and the gene pair with two genes in between (all two-sample t-tests with tions, respectively. This happens as an effect of promoter com- bb p-values 0.001). The (b) plot displays the correlation levels of adjacent gene pairs. petition or transcriptional interference. For example, it may be Besides random gene pairs (R) and paired genes with overlapping promoters (PO: ad- jacent gene pairs with opposite orientations and the shortest distanceb1 kb), there are unlikely that two different sets of overlapping transcriptional four subgroups of adjacent gene pairs: 1) near ss: same-strand adjacent gene pairs with units activate at the same time in the same locus because tran- the shortest distance in between b10 kb. 2) remote ss: same-strand adjacent gene scription by RNA polymerase II involves large protein complexes pairs with the shortest distance in between >10 kb. 3) near oo: adjacent gene pairs [12]. In short, we hypothesize that the distinct correlation levels with opposite orientations and the shortest distance b10 kb. 4) remote oo: adjacent of subgroups may result from the configuration of overlapping gene pairs with opposite orientations and the shortest distance >10 kb. In addition, all t-tests between any two groups in (b) are significant with their p-valuesbb0.001, genes, and their associated core promoters. except for the t-test of “near oo” vs. “PO” (p-value=0.01348). According to the analysis results above, the co-regulation of paired overlapping genes results from two major cis-regulatory mecha- nisms. The first is the impact of sharing a common local chromatin status. It fully depends on the physical distance between two paired genes on the genome, in terms of the overlap extent of overlapping genes. As a result, the co-regulation occurs in all paired overlapping pairs, and the co-expression degree decreases monotonically as the genes. The second mechanism is caused by the configuration of pro- physical distance of paired genes increases (Fig. 4a; Supplementary Figs. moters. When overlapping genes are activated by the same promoters, 2a and 3a). We further divided adjacent gene pairs into four subgroups, their correlation degree of gene expression is further enhanced. Further- near ss, remote ss, near oo, and remote oo, to compare their correlation more, due to promoter competition, the correlation of DN is stamped degrees among subgroups with distinct physical distances and gene ori- down significantly. The core promoter configuration of paired over- entations. The correlation of the “near” group (distanceb10 kb) is higher lapping genes plays a dramatic role in determining the regulatory trend (near and remote groups in Fig. 4b; Supplementary Figs. 2b and 3b) than between two coupled genes. 236 M.-R. Ho et al. / Genomics 100 (2012) 231–239

(a) Affymetrix HG−U133 Plus2 (GSE3526) remove probe sets (microarray) or reads (mRNA-Seq) that are mapped to multiple genes. Thus, the coverage of overlapping gene 1.0 pairs in microarrays is reduced, and the results are shown without undistinguishable paralogs. On the other hand, although the coverage of overlapping genes in mRNA-Seq is not altered, removing multi-located reads may cause interference because of the co-shared exons of over-

0.5 lapping genes. Further, adding unstable proportions of antisense reads to coupled genes, respectively, can falsely increase their correlation(s). In the mRNA-Seq result, as a consequence, we observe the significant in- creased correlation of TT, possessing the highest exonic sharing degree among all different-strand gene-overlaps (Figs. 2 and 5). Furthermore,

0.0 the same-strand overlapping genes are too close, and may produce chi-

correlation coefficient mera transcripts resulting from unsuccessful splicing events [34].Itis thus unable to distinguish short reads of chimera transcripts from that of each parent gene. Due to some read-through transcripts, the correla- tion levels of SO and SN are less distinguishable in NGS data. In addition −0.5 to these two systemic effects, we provide consistent co-regulation results

RPadj OHHTTDNSOSN of each gene-overlap type from two major high throughput systems after near all. As an extreme gene–gene organization in the genome, the gene- (b) NGS BodyMap 2.0 (GSE30611) overlap events influence 10% of human genes. The different-strand over- laps can create nature antisense transcripts and the same-strand overlaps 1.0 can share the same motif sequences. Given that, overlapping genes are in- volved in pervasively efficient and hazardous regulations. That may be a reason why gene-overlap events go on and off during the evolutionary course, such as the overlap of MINK1 and CHRNE [44]. Importantly, most 0.5 overlapping genes are new or lineage-specific [9–11],andnestedones are further reported as tissue-specificgenes[37].Thesealladdressthe need and significance of investigating overlapping genes. Under our proposed framework, we can explain previous conflicting

0.0 reports of overlapping genes. In the beginning, we demonstrated that the tail-to-tail overlaps (TT) is an extraordinary subgroup of over- correlation coefficient lapping genes because of three lines of substantial evidence: 1) Most TT genes are not human-speci fic genes; 2) TT genes mainly share ′

−0.5 their 3 -UTRs; and 3) the overlap structure of TT is relatively conserved, except for nested genes which are less sensitive to the completeness of genes' 5′-UTR and 3′-UTR annotations. It follows that the sharing of Radj POHHTTDNSOSN functional sequences in the 3′-UTRs is responsible for the existence of near TT overlaps in the genome, as Shintain et al. proposed [21]. Additionally, Fig. 5. Co-regulation of paired overlapping genes. The correlation levels of paired over- we show that the low conservation of overlap structures is associated lapping genes are displayed in Fig. 5. The x-axis indicates groups of randomly paired with large proportions of partial loss in HH, DN, SO, and SN. It implies fi genes, paired genes with overlapping promoters, and the ve types of gene-overlap that these gene-overlap events mainly result from a new gene arising events: head-to-head (HH); tail-to-tail (TT); different-strand nested (DN); same-strand overlap (SO); and same-strand nested (SN). in a pre-existing gene locus, and serves as quantitative evidence for the overprinting hypothesis [15]. The genomic features of overlapping genes make them a suitable 3. Discussion benchmark for demonstrating cis-regulatory mechanisms. That is, the paired overlapping genes share local chromatin status, like his- In this study, we analyze a huge amount of high throughput tone modifications and DNA methylation. However, paired over- transcriptome data from two major expression analysis platforms, lapping genes do not necessarily share their core promoters, such as microarrays and mRNA-Seq. Generally, the mRNA-Seq is more sensi- TT genes. That allows us to examine cis-regulatory mechanisms tive than microarrays. The correlation levels of overlap types in from the general chromatin impact to core promoter regulations. On the mRNA-Seq results are more distinguishable to reflect distinct account of the synchronized chromatin status, paired overlapping co-regulation levels of overlapping genes (Fig. 5). In addition, the genes are pervasively co-regulated to a certain extent. In addition, mRNA-Seq can identify lowly expressed transcripts and un-probed tran- the gene-promoter configuration of overlapping genes plays a critical scripts without being restricted by hybridization efficiency and probe role to further determine co-regulation levels of five subgroups. designs. As a result, the expression of all overlapping genes can be First, we reveal the synergistic effects which can be caused by the detected in the mRNA-Seq data. In contrast, the results of microarrays same promoters of paired same-strand overlapping genes (SO and are compressed (Fig. 5) and microarrays only detect the genes being SN) or bidirectional promoters of divergent different-strand over- well-probed. For instance, the HG-U133 platform merely covers partial lapping genes (PO and HH). For example, the mouse Timp2 hosts overlapping gene pairs (b50%). BC100451 (DDC8) as a SN gene pair. They are reported to be co- Broadly speaking, our analysis results of microarrays and the expressed, and even involved in alternative splicing events [45]. The mRNA-Seq data are similar. Practically, microarrays possess higher PO gene pair, LRRC49 and THAP10, is silenced by the bidirectional pro- specificity to reflect variant delicate regulations of promoter utiliza- moter hypermethylation in breast cancer [46]. Unlike PO genes, it tions on account of precise probe designs; meanwhile, ambiguous seems that few bidirectional promoter studies are proven with re- reads between paired overlapping genes can cause interference in spect to overlapping gene pairs. However, because of alternative the mRNA-Seq data. In the analysis, to avoid the ambiguity, we first exons, a PO gene pair may become a head-to-head (HH) gene M.-R. Ho et al. / Genomics 100 (2012) 231–239 237 pair, especially the nearest PO gene pairs. Further building on the fact same-strand adjacent gene pairs with the shortest distance in be- that the alternative first exon is a common phenomenon in human tween >10 kb. 3) near oo: adjacent gene pairs with opposite orienta- [47], we expect to observe general co-regulation in HH genes like tions, and the shortest distance b10 kb. 4) remote oo: adjacent gene PO genes. For instance, the HH gene pair, MRPS12 and SARS2, exhibits pairs with opposite orientations and the shortest distance >10 kb. bidirectional transcription in human and mouse [48]. Indeed, we The control of Set B in the co-expression analysis is the set, including have proven the high co-expression of HH genes in genome-wide all neighboring gene pairs. investigations. In the beginning of our proposed framework, we grouped the In addition to synergistic effects, we observe a low co-expression identified gene-overlap events (Supplementary Table S1) into five level of the TT group which could result from two promoters of coupled types as illustrated in Fig. 1. There are three groups describing TT genes whose locations are as far as adjacent genes'. Furthermore, the different-strand gene-overlaps. 1) head-to-head (HH: 473 gene interference from the sense-antisense overlap of polyadenylation signals pairs); 2) tail-to-tail (TT: 705 gene pairs); 3) different-strand nested: can cause slightly negative regulation between several coupled TT genes one harbors the other (DN: 493 gene pairs). We also considered two [32]. Similarly, the competition of two opposite promoters seems to same-strand gene-overlap types: 4) the same-strand overlap (SO: stamp down the correlation of different-strand nested genes (DN). How- 116 pairs), and 5) same-strand nested (SN: 754 pairs), where one is ever, the co-expression level of DN is still significantly higher than that embedded in the other coupled gene. Additionally, the genomic dis- observed from random gene pairs (Fig. 5). It reflects that the core pro- tance in between is not that much different among all types of over- moter regulation is influential but cannot fully erase the general chroma- lapping genes but their promoter configurations. To provide a positive tin impacts. This result conflicts with the negative correlation of DN control, we defined the promoter-overlaps (PO: 1,047 pairs) that con- genes presented by Yu et al. in 2005 [31]. Because of the restriction of tained paired genes sharing core promoter regions, upstream 1 kb of probe designs, they only examined 45 pairs possessing probes, and pres- transcription start sites (TSS), but not their gene regions. Regarding ented 33 DN pairs which exhibit significantly negative correlations. In the genomic distance between two genes, the PO group is an intermedi- contrast, by taking advantage of the modern mRNA-Seq, we have inves- ate between the neighboring gene sets and the overlapping gene sets. tigated almost all the 493 DN pairs in human. From that, we present that This makes the PO group suitable as a positive control in revealing the the DN genes are not negatively correlated, and even significantly pos- promoter regulation mechanisms due to distinct gene-promoter config- sess co-expression to some extent on account of their common chroma- urations of overlapping genes. In addition to the analysis of gene regu- tin status. Because of the high coverage of DN genes we have analyzed, lation, these defined gene-overlap subgroups are also applied to we believe that the detailed regulation between DN genes has been re- exon-sharing analysis and a human–mouse conservation study to clar- vealed and modified. ify published conflicting hypotheses of gene-overlap origins. In summary, instead of case studies, we substantiate those scenar- ios based on the observations from our systemic analyses of human 4.2. Exon-sharing between overlapping genes transcriptome data, including hundreds of microarrays and dozens of next-generation sequencing libraries. Through clarifying previous We analyzed exon-sharing degrees between overlapping genes by conflicting reports, we demonstrate that this framework can reflect first examining every transcribed base to see if it is commonly used by the whole scenario of human overlapping genes. It promotes a com- two paired overlapping genes or not. Total lengths of shared exons prehensive understanding of gene-overlap events to bring functional were calculated with respect to five gene-overlap subgroups. We fur- insight into genome organization. ther assigned three categories (5′-UTR, CDS, or 3′-UTR) to those com- monly used bases according to the coding region annotation of genes 4. Materials and methods from UCSC (hg19, http://genome.ucsc.edu/). A transcript is classified into the Null category if it is a non-coding gene. We plotted the per- 4.1. Human neighboring, overlapping and nested gene pairs centage of commonly used bases in five overlap types and colored the proportion of shared exons by their categories (Fig. 2). In addi- We used human reference transcript sequences and annotation tion, we discovered that some gene families located in the same geno- from UCSC (hg19, http://genome.ucsc.edu/). We clustered iso-forms mic locus frequently share their 5′-UTRs or 3′-UTRs (Supplementary into 23,189 genes and filtered out genes annotated with multiple ge- Table S2). This sharing phenomenon significantly alters the general nomic locations (472 multi-located genes). Small nucleolar RNAs co-shared pattern of overlapping genes. Therefore, we excluded (snoRNAs) and microRNAs are not included on Affymetrix Gene Ex- these gene families and calculated the remainder (Fig. 2b). pression Chips, and their transcripts are so short that they are fre- Furthermore, we examined the exon-sharing degree of over- quently filtered out in the sample preparing process of the Whole lapping genes in the gene base. Every overlapping gene is classified Transcriptome Shotgun Sequencing. Therefore, we excluded them in into one of four sharing degrees according to the percent transcript our expression analyses. In addition, some transcripts result from shared with the other in the same pair. “None” represents genes combining the exons of two or more distinct (parent) genes lying that do not share any base, and “Light” represents genes that share on the same strand of a , known as readthrough tran- b30% transcript length with their own coupled genes. A gene that scripts [34]. We also eliminated readthrough genes in the analyses shares >70% transcript length with its paired one is annotated as because their expression cannot be distinguished from the parental “Heavy”. The remaining genes were designated “Median”.Wedepicted genes in microarrays and next-generation sequencing assays. In a percentage bar chart to show the distribution of exon-sharing degree total, we analyzed 22,615 human genes in this study. in each overlap type (Fig. 3a). Based on the genomic locations of genes, we generated 10 sets of neighboring gene pairs in which each gene pair is separated by zero 4.3. Identification of human–mouse conserved overlapping pairs to nine genes, respectively. We took a set of randomly paired genes as the control set. These pairs (Set A) are utilized to illustrate the To identify human–mouse conserved overlapping gene pairs, we size of co-expressed genomic neighborhoods in the human genome. utilized the human–mouse orthologous gene information provided In order to further examine whether the genomic distance or orienta- by Mouse Genome Informatics (http://www.informatics.jax.org/). tion affects the co-expression level of coupled genes, we classified ad- The dataset contains 17,855 human–mouse orthologous gene pairs jacent gene pairs into four classes according to their orientation and and covers 84% human genes. We examined the orthologs of over- separation distance (Set B): 1) near ss: same-strand adjacent gene lapping genes and defined four conservation statuses as follows. pairs with the shortest distance in between b10 kb. 2) remote ss: 1) Absent: both paired human genes are absent in mouse; 2) Partial: 238 M.-R. Ho et al. / Genomics 100 (2012) 231–239 only one human gene in a pair is present in mouse; 3) Present only: 22,615×16 read count matrix. Only a small number of genes (550 both paired human genes are present in mouse, but their overlap genes, 2%) are unexpressed in all of the 16 tissues. Furthermore, we structure is not conserved; and 4) Consensus: both paired human performed the total read count normalization (a gene's read count di- genes are present and their overlap structure is conserved in mouse. vided by the total read count) and calculated the correlation coeffi- We illustrated the percentages of these four conservation statuses in cients of expression profiles of coupled genes. The obtained five gene-overlap groups, and a promoter-overlap group via percent- correlation coefficient is the same as the correlation coefficient de- age bar chart (Fig. 3b). rived from the RPKM normalized read count profiles [53]. The box plots of correlation coefficients and the t-test results are shown in 4.4. Analysis process of whole transcriptome data Figs. 4 and 5b, and the information on sequenced tissues and mapping statistics are in Supplementary Table S6. The t-test p-values of We intended to explore the general gene regulation when two co-expression levels of subgroups are listed in Supplementary Table genes reside in the same genomic locus. Therefore, we looked for S7. the transcriptome data set covering normal tissues of the human body, as many as possible in the public domain. In addition, to avoid Acknowledgments technological biases, we chose data sets from three commonly used high-throughput platforms to draw consistent conclusions. In this This work is supported in part by grants from Academia Sinica and study, we analyzed three transcriptome datasets of human normal National Sciences Council, Taiwan. tissues from NCBI Gene Expression Omnibus. The first data set [GEO: GSE2361] is the expression profile of 36 types of normal Appendix A. Supplementary data human tissues (Supplementary Table S3) conducted by using Affymetrix U133A arrays. We used the robust multi-array average Supplementary data to this article can be found online at http:// (RMA) expression measure [49–51] to obtain the expression value dx.doi.org/10.1016/j.ygeno.2012.06.011. of each probe set. The RMA consists of three steps: a background ad- justment, quantile normalization, and finally summarization. The References Affymetrix U133A platform contains 20,967 probe sets which cover 13,435 human genes. On the chip, one probe set may represent mul- [1] Z.I. Johnson, S.W. Chisholm, Properties of overlapping genes are conserved across microbial genomes, Genome Res. 14 (2004) 2268–2272. tiple genes. We eliminated this kind of probe sets to avoid ambiguity. [2] A.J. Edgar, The gene structure and expression of human ABHD1: overlapping In addition, the expression of a gene is determined by the maximum polyadenylation signal sequence with Sec12, BMC Genomics 4 (2003) 18. [3] C. Zhou, B. Blumberg, Overlapping gene structure of human VLCAD and DLG4, value of all of the probe sets representing it. We then calculated the – fi fi Gene 305 (2003) 161 166. correlation coef cient of expression pro les of paired genes of Set [4] Y. Ohinata, S. Sutou, M. Kondo, T. Takahashi, Y. Mitsui, Male-enhanced antigen-1 A, Set B, and Set C. To illustrate how genomic distance related to gene flanked by two overlapping genes is expressed in late spermatogenesis, Biol. co-expression levels, we depicted box plots of correlation coefficients Reprod. 67 (2002) 1824–1831. [5] H. Kiyosawa, I. Yamanaka, N. Osato, S. Kondo, Y. Hayashizaki, Antisense tran- of each subgroup (Supplementary Fig. 2). Because some genes are not scripts with FANTOM2 clone set and their implications for gene regulation, Ge- well-probed, this platform only covers 135 HH (29%), 253 TT (36%), nome Res. 13 (2003) 1324–1334. 76 DN (15%), 21 SO (18%), 90 SN (12%), and 357 PO (34%). The second [6] D.S. Kim, C.Y. Cho, J.W. Huh, H.S. Kim, H.G. Cho, EVOG: a database for evolutionary – fi analysis of overlapping genes, Nucleic Acids Res. 37 (2009) D698 D702. data set [GEO: GSE3526] presents the expression pro le of 65 types of [7] B. Lehner, G. Williams, R.D. Campbell, C.M. Sanderson, Antisense transcripts in the normal human tissues from 10 donors (Supplementary Table S4). It human genome, Trends Genet. 18 (2002) 63–65. contains expression data conducted by 353 Affymetrix U133 Plus2 ar- [8] M.E. Fahey, T.F. Moore, D.G. Higgins, Overlapping antisense transcription in the human genome, Comp. Funct. Genomics 3 (2002) 244–253. rays (41,789 probe sets representing 20,902 genes). Similarly, we [9] C.R. Sanna, W.H. Li, L. Zhang, Overlapping genes in the human and mouse ge- normalized expression values with RMA, excluded ambiguous probe nomes, BMC Genomics 9 (2008) 169. sets, and took the maximum value of all probe sets of a gene as its ex- [10] I. Makalowska, C.F. Lin, K. Hernandez, Birth and death of gene overlaps in verte- pression value. This platform includes 318 HH (67%), 532 TT (75%), brates, BMC Evol. Biol. 7 (2007) 193. [11] V. Veeramachaneni, W. Makalowski, M. Galdzicki, R. Sood, I. Makalowska, Mam- 242 DN (49%), 54 SO (47%), 207 SN (27%), and 815 PO (78%). We cal- malian overlapping genes: the comparative perspective, Genome Res. 14 (2004) culated the correlation coefficients of each pair in subgroups and dis- 280–286. [12] I. Makalowska, C.F. Lin, W. Makalowski, Overlapping genes in vertebrate ge- played them in box plots (Fig. 5a; Supplementary Fig. 3). To better – fi nomes, Comput. Biol. Chem. 29 (2005) 1 12. illustrate the signi cant differences of co-expression levels of sub- [13] N. Osato, Y. Suzuki, K. Ikeo, T. Gojobori, Transcriptional interferences in cis natural groups, we performed t-tests for any two compared groups as well antisense transcripts of humans and mice, Genetics 176 (2007) 1299–1306. (Supplementary Table S5). [14] Y. Zhang, X.S. Liu, Q.R. Liu, L. Wei, Genome-wide in silico identification and anal- ysis of cis natural antisense transcripts (cis-NATs) in ten species, Nucleic Acids In addition to probe-based hybridization assays, next-generation Res. 34 (2006) 3465–3475. sequencing technology is also commonly applied to detect expres- [15] P.K. Keese, A. Gibbs, Origins of genes: “big bang” or continuous creation? Proc. sions of transcripts genome-wide. Our third dataset is the 75 bases Natl. Acad. Sci. U. S. A. 89 (1992) 9489–9493. [16] R.S. Knee, S.E. Pitcher, P.R. Murphy, Basic fibroblast growth factor sense (FGF) and single-end mRNA-Seq data from the Illumina BodyMap2 trans- antisense (gfg) RNA transcripts are expressed in unfertilized human oocytes and criptome [GEO: GSE30611]. There are 16 different human tissues an- in differentiated adult tissues, Biochem. Biophys. Res. Commun. 205 (1994) alyzed. To begin with, we prepared the genome mapping indexes 577–583. ′ [17] P.R. Murphy, R.S. Knee, Identification and characterization of an antisense RNA after removing the 3 poly-A tail of reference transcript sequences transcript (gfg) from the human basic fibroblast growth factor gene, Mol. (downloaded from UCSC) and annotating them with gene symbols. Endocrinol. 8 (1994) 852–859. Because both the mapping criteria (such as mismatches, gaps, and [18] A.W. Li, G. Seyoum, R.P. Shiu, P.R. Murphy, Expression of the rat BFGF antisense fi numbers of mapped loci) and the read length can affect mapping re- RNA transcript is tissue-speci c and developmentally regulated, Mol. Cell. Endocrinol. 118 (1996) 113–123. sults, we kept high confidence bases and restricted all of the read [19] A. Zúñiga Mejía Borja, C. Meijers, R. Zeller, Expression of alternatively spliced lengths to 40, and truncated reads to 40-based segments (from the bFGF first coding exons and antisense mRNAs during chicken embryogenesis, – 10th-base to the 50th-base). We then mapped these processed Dev. Biol. 157 (1993) 110 118. [20] A.W. Li, C.K. Too, P.R. Murphy, The basic fibroblast growth factor (FGF-2) anti- reads to transcript indexes by the bowtie program [52] with up to 2 sense RNA (GFG) is translated into a MutT-related protein in vivo, Biochem. mismatches. Generally, each run has around 70% mappable reads. Biophys. Res. Commun. 223 (1996) 19–23. We kept the mapping results with same strand of mapped transcripts. [21] S. Shintani, C. O'HUigin, S. Toyosawa, V. Michalova, J. Klein, Origin of gene over- lap: the case of TCP1 and ACAT2, Genetics 152 (1999) 743–754. According to gene symbol annotation of transcripts, we only counted [22] A.J. Wood, et al., Regulation of alternative polyadenylation by genomic imprint- reads mapped to one gene to avoid ambiguity. We obtained a ing, Genes Dev. 22 (2008) 1141–1146. M.-R. Ho et al. / Genomics 100 (2012) 231–239 239

[23] G. Kasper, et al., Different structural organization of the encephalopsin gene in [40] M. Ebisuya, T. Yamamoto, M. Nakajima, E. Nishida, Ripples from neighbouring man and mouse, Gene 295 (2002) 27–32. transcription, Nat. Cell Biol. 10 (2008) 1106–1113. [24] H.J. Gierman, et al., Domain-wide regulation of gene expression in the human ge- [41] T. Shimada, H. Fujii, H. Lin, A 165- sequence between the dihydrofolate nome, Genome Res. 17 (2007) 1286–1295. reductase gene and the divergently transcribed upstream gene is sufficient for bi- [25] A. Purmann, et al., Genomic organization of transcriptomes in mammals: cor- directional transcriptional activity, J. Biol. Chem. 264 (1989) 20171–20174. egulation and cofunctionality, Genomics 89 (2007) 580–587. [42] M. Platzer, et al., Ataxia-telangiectasia locus: sequence analysis of 184 kb of human [26] S. De, M.M. Babu, Genomic neighbourhood and the regulation of gene expression, genomic DNA containing the entire ATM gene, Genome Res. 7 (1997) 592–605. Curr. Opin. Cell Biol. 22 (2010) 326–333. [43] N. Adachi, M.R. Lieber, Bidirectional gene organization: a common architectural [27] P.G. Engstrom, et al., Complex loci in human and mouse genomes, PLoS Genet. 2 feature of the human genome, Cell 109 (2002) 807–809. (2006) e47. [44] I. Dan, N.M. Watanabe, E. Kajikawa, T. Ishida, A. Pandey, A. Kusumi, Overlapping [28] G. Solda, et al., Non-random retention of protein-coding overlapping genes in of MINK and CHRNE gene loci in the course of mammalian evolution, Nucleic Metazoa, BMC Genomics 9 (2008) 174. Acids Res. 30 (2002) 2906–2910. [29] G.W. Krystal, B.C. Armstrong, J.F. Battey, N-myc mRNA forms an RNA–RNA duplex [45] D.M. Jaworski, M. Beem-Miller, G. Lluri, R. Barrantes-Reynolds, Potential regulato- with endogenous antisense transcripts, Mol. Cell. Biol. 10 (1990) 4180–4191. ry relationship between the nested gene DDC8 and its host gene tissue inhibitor [30] M. Noguchi, S. Miyamoto, T.A. Silverman, B. Safer, Characterization of an anti- of metalloproteinase-2, Physiol. Genomics 28 (2007) 168–178. sense Inr element in the eIF-2 alpha gene, J. Biol. Chem. 269 (1994) 29161–29167. [46] E. De Souza Santos, S.A. De Bessa, M.M. Netto, M.A. Nagai, Silencing of LRRC49 and [31] P. Yu, D. Ma, M. Xu, Nested genes in the human genome, Genomics 86 (2005) THAP10 genes by bidirectional promoter hypermethylation is a frequent event in 414–422. breast cancer, Int. J. Oncol. 33 (2008) 25–31. [32] R. Gu, Z. Zhang, J.N. DeCerbo, G.G. Carmichael, Gene regulation by sense-antisense [47] J.S. Tan, N. Mohandas, J.G. Conboy, High frequency of alternative first exons in overlap of polyadenylation signals, RNA 15 (2009) 1154–1163. erythroid genes suggests a critical role in regulating gene function, Blood 107 [33] N.D. Trinklein, S.F. Aldred, S.J. Hartman, D.I. Schroeder, R.P. Otillar, R.M. Myers, An (2006) 2557–2561. abundance of bidirectional promoters in the human genome, Genome Res. 14 [48] E. Zanotto, A. Hakkinen, G. Teku, B. Shen, A.S. Ribeiro, H.T. Jacobs, NF-Y influences di- (2004) 62–66. rectionality of transcription from the bidirectional Mrps12/Sarsm promoter in both [34] P. Akiva, et al., Transcription-mediated gene fusion in the human genome, Ge- mouse and human cells, Biochim. Biophys. Acta 1789 (2009) 432–442. nome Res. 16 (2006) 30–36. [49] R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs, T.P. Speed, Summaries of [35] Q. Wu, T. Maniatis, A striking organization of a large family of human neural Affymetrix GeneChip probe level data, Nucleic Acids Res. 31 (2003) e15. cadherin-like cell adhesion genes, Cell 97 (1999) 779–790. [50] B.M. Bolstad, R.A. Irizarry, M. Astrand, T.P. Speed, A comparison of normalization [36] Q. Wu, et al., Comparative DNA sequence analysis of mouse and human prot- methods for high density oligonucleotide array data based on variance and bias, ocadherin gene clusters, Genome Res. 11 (2001) 389–404. Bioinformatics 19 (2003) 185–193. [37] P.I. Mackenzie, et al., The UDP glycosyltransferase gene superfamily: recommended [51] R.A. Irizarry, et al., Exploration, normalization, and summaries of high density ol- nomenclature update based on evolutionary divergence, Pharmacogenetics 7 igonucleotide array probe level data, Biostatistics 4 (2003) 249–264. (1997) 255–269. [52] B. Langmead, C. Trapnell, M. Pop, S.L. Salzberg, Ultrafast and memory-efficient [38] D.J. Lipman, Making (anti)sense of non-coding sequence conservation, Nucleic alignment of short DNA sequences to the human genome, Genome Biol. 10 Acids Res. 25 (1997) 3580–3583. (2009) R25. [39] R. Yelin, et al., Widespread occurrence of antisense transcription in the human [53] A. Mortazavi, B.A. Williams, K. McCue, L. Schaeffer, B. Wold, Mapping and quantifying genome, Nat. Biotechnol. 21 (2003) 379–386. mammalian transcriptomes by RNA-Seq, Nat. Methods 5 (2008) 621–628.