Deep RNA Sequencing Analysis of Readthrough Gene Fusions in Human
Total Page:16
File Type:pdf, Size:1020Kb
Nacu et al. BMC Medical Genomics 2011, 4:11 http://www.biomedcentral.com/1755-8794/4/11 RESEARCHARTICLE Open Access Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples Serban Nacu, Wenlin Yuan, Zhengyan Kan, Deepali Bhatt, Celina Sanchez Rivers, Jeremy Stinson, Brock A Peters, Zora Modrusan, Kenneth Jung, Somasekar Seshagiri*, Thomas D Wu* Abstract Background: Readthrough fusions across adjacent genes in the genome, or transcription-induced chimeras (TICs), have been estimated using expressed sequence tag (EST) libraries to involve 4-6% of all genes. Deep transcriptional sequencing (RNA-Seq) now makes it possible to study the occurrence and expression levels of TICs in individual samples across the genome. Methods: We performed single-end RNA-Seq on three human prostate adenocarcinoma samples and their corresponding normal tissues, as well as brain and universal reference samples. We developed two bioinformatics methods to specifically identify TIC events: a targeted alignment method using artificial exon-exon junctions within 200,000 bp from adjacent genes, and genomic alignment allowing splicing within individual reads. We performed further experimental verification and characterization of selected TIC and fusion events using quantitative RT-PCR and comparative genomic hybridization microarrays. Results: Targeted alignment against artificial exon-exon junctions yielded 339 distinct TIC events, including 32 gene pairs with multiple isoforms. The false discovery rate was estimated to be 1.5%. Spliced alignment to the genome was less sensitive, finding only 18% of those found by targeted alignment in 33-nt reads and 59% of those in 50-nt reads. However, spliced alignment revealed 30 cases of TICs with intervening exons, in addition to distant inversions, scrambled genes, and translocations. Our findings increase the catalog of observed TIC gene pairs by 66%. We verified 6 of 6 predicted TICs in all prostate samples, and 2 of 5 predicted novel distant gene fusions, both private events among 54 prostate tumor samples tested. Expression of TICs correlates with that of the upstream gene, which can explain the prostate-specific pattern of some TIC events and the restriction of the SLC45A3-ELK4 e4-e2 TIC to ERG-negative prostate samples, as confirmed in 20 matched prostate tumor and normal samples and 9 lung cancer cell lines. Conclusions: Deep transcriptional sequencing and analysis with targeted and spliced alignment methods can effectively identify TIC events across the genome in individual tissues. Prostate and reference samples exhibit a wide range of TIC events, involving more genes than estimated previously using ESTs. Tissue specificity of TIC events is correlated with expression patterns of the upstream gene. Some TIC events, such as MSMB-NCOA4, may play functional roles in cancer. * Correspondence: [email protected]; [email protected] Departments of Bioinformatics and Molecular Biology, Genentech, Inc., South San Francisco, California 94080, USA © 2011 Nacu et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Nacu et al. BMC Medical Genomics 2011, 4:11 Page 2 of 22 http://www.biomedcentral.com/1755-8794/4/11 Background by the Encyclopedia of DNA Elements (ENCODE) pro- Readthrough gene fusions, or transcription-induced chi- ject across 12 human tissues and 3 cell lines, an meras (TICs), occur when consecutive genes on a gen- upstream extension indicative of a TIC event was found ome strand are spliced together. Their existence was in 136 of the 410 loci studied [14]. first reported experimentally in isolated cases [1-4], and Because RACE can assay only specific transcripts later surveyed computationally using analyses of selected in advance, it can miss TIC events that may expressed sequence tags (ESTs). Two different EST- predominate or have functional relevance in a particular based studies have been carried out to date. In one sample. In contrast, a genome-wide study of transcrip- study [5], researchers clustered ESTs and then aligned tional phenomena in individual tissues is now possible these clusters to the genome, looking for alignments with the recent advent of deep, or next-generation, that crossed gene boundaries. The other study [6] sequencing technology. Such technology provides a sam- involved identification of potential tandem gene pairs pling of the entire range of transcriptional phenomena andsoughtESTsthatspannedbothgenesinapair. in single tissues, by generating large volumes of short These studies indicate that at least 4-6% of genes in the reads of 30-100 nt [15]. However, analyzing such RNA- genome may be involved in TIC formation, although Seq data to study TICs poses its own set of unique chal- their prevalence was found to be generally low. lenges. Although many studies to date have analyzed Nevertheless, in some cases, TICs appear to be RNA-Seq data for the tasks of expression, sequence expressed highly and generate functional protein pro- polymorphisms, and even gene fusions in general, none ducts, with possible implications in cancer. For example, so far have tried to specifically detect TIC events, and the HHLA1-OC90 TIC is expressed highly in teratocar- previous studies have reported relatively few such cinoma cell lines [7], while the CD205-DCL1 TIC is events. Four TICs were reported in targeted sequencing expressed in Hodgkin lymphoma cell lines [8]. A TIC analysis of K562 [16]. Another study of the VCaP and between the oncogene RBM14 and RBM4 generates a K562 cell lines and the HBR and UHR samples using fusion protein called transcriptional coactivator CoAZ paired-end reads reported 76 fusion events [17], of [9]. Another TIC between RBM6 and RBM5 is found in which 23 appear to be TICs. A recent study of 25 pros- several cancer tissues and cell lines, but not in non- tate cancer samples [18] using an algorithm called tumor tissues, and is associated with larger breast tumor FusionSeq [19] identified 11 readthrough fusion candi- sizes [10]. Likewise, a TIC between exon 4 of SLC45A3 dates and experimentally verified 9 of them. and exon 2 of ELK4 found in prostate adenocarcinomas In this study, we explore two different methods for [11] has an erythroblast transformation-specific (ETS) detecting TICs in RNA-Seq data with high sensitivity. oncogene family member as its downstream gene. In One method involves a targeted alignment approach addition to the e4-e2 TIC isoform, which was observed where reads are aligned to a set of artificial exon-exon specifically in ERG-negative prostate cancer samples, target sequences constructed in advance. Such a tar- another isoform e1-e2 has been observed in both pros- geted alignment strategy has been highly effective in tate cancer and benign prostate tissue, and found to be studying the extent of intragenic alternative splicing in regulated by androgen levels [12]. the human genome, even in short reads of only 32 nt Although EST-based studies have identified over 300 [20-23]. But for intergenic splicing events in general, tar- distinct TIC events so far, these events are spread over geted alignment is not applicable as a computational the multiple RNA libraries from which the ESTs were strategy, because it is infeasible to generate all possible derived. Accordingly, an EST-based study cannot reveal exon-exon pairs over the human genome. In this paper, the extent or diversity of TIC occurrences in an indivi- we demonstrate that a targeted alignment approach is dual sample. The ability to study TICs in a single sam- nevertheless well suited for sensitive detection of TIC ple would facilitate the discovery of associations events across the universe of possible exon-exon pairs of between TIC events and phenotypic traits, such as pro- this type. pensity for particular cancer types or other diseases, or Our other method is to align the reads to a reference sensitivity to specific treatments. genome using a program that can split an individual One clue to the occurrence of TIC events within sam- alignment to different locations in a genome. Various ples comes from RACE (rapid amplification of cDNA such alignment tools or pipelines have been developed ends) studies, in which specifically targeted transcript for detecting spliced reads in short read data within regions are extended upstream (5’ RACE) or down- local regions of a genome, including QPALMA [24] and stream (3’ RACE) and then aligned to genomic tiling TopHat [25]. Other recent programs, including Split- arrays to reveal their gene structure [13]. In one large- Seek [26] and our own program GSNAP [27], provide scale 5’ RACE study covering 1% of the genome targeted the additional capability of finding splicing events Nacu et al. BMC Medical Genomics 2011, 4:11 Page 3 of 22 http://www.biomedcentral.com/1755-8794/4/11 involving genes from distant or interchromosomal loca- tissues [6], the mechanism for these varying expression tions in a genome. The spliced alignment approach is patterns has not been well studied. more general, because it can identify novel or distant Expression information can also provide clues about gene fusions not enumerable by a targeted alignment the mechanism of TIC formation. The prevailing approach. However, it is less sensitive, especially when hypothesis is that TIC events represent a type of tran- reads are very short, because it requires enough material scriptional “leakage,” in which termination of transcrip- ’ on both sides of the exon-exon junction for accurate tion fails for the upstream, or 5 gene, resulting in the ’ ’ alignment. For example, GSNAP requires at least 14 nt two adjacent 5 and 3 genes existing on a single tran- on both sides of the exon-exon junction to to find a script [6]. The splicing machinery then acts on this tran- novel spliced alignment, without any further assistance, script to give rise to the TIC.