Oncogene (2014) 33, 4786–4794 & 2014 Macmillan Publishers Limited All rights reserved 0950-9232/14 www..com/onc

ORIGINAL ARTICLE Identification of latent biomarkers in hepatocellular carcinoma by ultra-deep whole-transcriptome sequencing

K-T Lin1,8, Y-J Shann2,8, G-Y Chau3, C-N Hsu4,5,6 and C-YF Huang1,2,7

There is an urgent need to identify biomarkers for hepatocellular carcinoma due to limited treatment options and the poor prognosis of this common lethal disease. Whole-transcriptome shotgun sequencing (RNA-Seq) provides new possibilities for biomarker identification. We sequenced B250 million pair-end reads from a pair of adjacent normal and tumor liver samples. With the aid of bioinformatics tools, we determined the transcriptome landscape and sought novel biomarkers by further empirical validations in 55 pairs of adjacent normal and tumor liver samples with various viral statuses such as HBV( þ ), HCV( þ ) and HBV( À )HCV( À ). We identified a novel with coding regions, termed DUNQU1, which has a tissue-specific expression pattern in tumor liver samples of HCV( þ ) and HBV( À )HCV( À ) hepatocellular carcinomas. Overexpression of DUNQU1 in Huh7 cell lines enhances the ability to form colonies in soft agar. Also, we identified three novel differentially-expressed -coding (ALG1L, SERPINA11 and TMEM82) that lack documented expression profiles in liver cancer and showed that the level of SREPINA11 is correlated with pathology stages. Moreover, we showed that the alternative splicing event of FGFR2 is associated with virus infection, tumor size, cirrhosis and tumor recurrence. The findings indicate that these new markers of hepatocellular carcinoma may be of value in improving prognosis and could have potential as new targets for developing new treatment options.

Oncogene (2014) 33, 4786–4794; doi:10.1038/onc.2013.424; published online 21 October 2013 Keywords: hepatocellular carcinoma; DUNQU1; FGFR2; alternative splicing; RNA-Seq

INTRODUCTION cannot be explained solely by the B2% of the genome Hepatocellular carcinoma (HCC) is one of the three fastest- analyzed by microarrays. growing cancers in the US and is the most lethal cancer in Asia. With the aid of bioinformatics tools, many potential biomarkers Compared to other cancers, HCC has a relatively poor prognosis that were latent variables in previous genome-wide studies can 7 and limited treatment options. Traditionally, to dissect how the now be seen by RNA-Seq. In the present study, we sought out functional units are deployed in different cells, gene expression potential biomarkers and validated their expression patterns in 55 microarrays are the most frequently used tools. Gene signatures pairs of adjacent normal and tumor liver samples with diverse viral derived from these microarrays are considered to be the status and gender. blueprints of events taking place in the cells under particular conditions at specific time points. In previous genome-wide studies, many hypotheses were generated from the gene RESULTS signatures to explain biological outcomes. The catalog of the transcriptome landscape of HCC Recently, RNA-Seq, one of the applications of the second- In total, we sequenced B120 million read pairs per sample from a generation sequencing techniques, was developed. Since RNA- pair of adjacent normal and tumor liver samples (Supplementary Seq offers single-base resolution on the whole-genome scale, it Table 1). Aligned normal reads covered B6.52% of the human provides the opportunity to greatly improve our knowledge of genome and B7.59% for tumor reads (Supplementary Figure 1a both the quantitative and qualitative aspects of the human 1 and Supplementary Table 2). We also identified many novel exon transcriptome. It has been reported that RNA-Seq can detect at junctions shown in Supplementary Figure 2. All of the above least 25% more known genes than traditional gene expression 2,3 information can be either downloaded from our website (http:// arrays, as well as many novel transcripts in intergenic regions. bioagent.iis.sinica.edu.tw/HCCT2012) or browsed on the UCSC Also, studies using RNA-Seq have shown that the number of Genome Browser for visualization and track comparison. functional units in the is much larger than previously anticipated.4 The transcription of mammalian genomes is now known to take place across almost all sections of the Novel differentially-expressed protein-coding genes whose genome, and many alternative splicing (AS) events in the human expression profiles were missing in liver cancer transcriptome are very noisy, even in normal cells.5,6 It quickly RNA-Seq can detect more differentially-expressed protein-coding became clear that the complexity of the human transcriptome genes (DE genes) than previous genome-wide arrays. We filtered

1Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan; 2Institute of Biopharmaceutical Sciences, National Yang-Ming University, Taipei, Taiwan; 3Division of General Surgery, Department of Surgery, Taipei Veterans General Hospital, Taipei, Taiwan; 4Institute of Information Science, Academia Sinica, Taipei, Taiwan; 5USC/ Information Sciences Institute, Marina del Rey, CA, USA; 6Division of Biomedical Informatics, Department of Medicine, University of California, San Diego, La Jolla, CA, USA and 7Cancer Research Center and Genome Research Center, National Yang-Ming University, Taipei, Taiwan. Correspondence: Professor CYF Huang, Institute of Clinical Medicine, National Yang-Ming University, No. 155, Li-Non St, Sec. 2 Taipei 112, Taiwan, Taiwan. E-mail: [email protected] 8These authors contributed equally to this work. Received 3 January 2013; revised 15 August 2013; accepted 19 August 2013; published online 21 October 2013 Latent biomarkers in hepatocellular carcinoma K-T Lin et al 4787 out 2,576 up-regulated and 855 down-regulated genes 38146 tumor transcript sequences were found. Among them, (Supplementary Table 3) and found 17 DE gene candidates 28569 normal and 28278 tumor sequences contained coding without annotations from the genome-wide arrays (U95, U133 and regions longer than or equal to 50 amino acids (Supplementary U133 plus 2.0) (Supplementary Table 4). Also, the 17 DE genes Table 7). Of the predicted coding regions, 231 normal and 286 lacked documented expression profiles for liver cancer in Gene tumor peptides were located in intergenic regions based on Expression Omnibus, Gene Expression Atlas, ArrayExpress and Ensembl 65 gene annotations. These coding peptides in intergenic Oncomine (Supplementary Table 4 and 5).8–11 regions represent potentially novel, unidentified genes. In By real-time PCR in 55 pairs of HCC patient samples, we showed particular, 224 of the 286 tumor peptides were specific to the that ALG1L, SERPINA11 and TMEM82 indeed had expected sequenced liver tumor. This implies the existence of tumor- expression patterns (Figure 1b and Supplementary Table 6). In specific protein-coding genes in intergenic regions. To prioritize particular, the DDCt values of SERPINA11 were significantly the candidates, we ranked the tumor-specific peptides by FPKM different between stage I/II and stage III/IV (P ¼ 0.0202) and values and manually determined whether there were multi-exon negatively correlated with pathology stages (stage I/II ¼ 1 and peptides with predicted functional domains. Among tumor- stage III/IV ¼ 2, Pearson’s correlation ¼ –0.328, and P ¼ 0.0145) specific peptides whose FPKM value was greater than 0.5, we (Figure 1c). That is to say, SERPINA11 was significantly lower in the found only one multi-exon candidate that had predicted later stages, such as stages III and IV. functional domains. It was a 101-amino-acid peptide encoded ALG1L is a putative glycosyltransferase. An altered mRNA by a gene with 3 exons. We named the gene DUNQU1, from the expression level of glycosyltransferases might be helpful for early 12 Chinese for ‘the latent one’. detection of carcinomas and tumor progression. SERPINA11 is a Our analysis of the junction reads suggested that DUNQU1 serine proteinase inhibitor that might be secreted. The down- comprises three exons (E1, E2 and E3) and expresses two isoforms: regulation of SERPINA11 has been correlated with breast cancer 13 SP1 (E1 þ E2 þ E3) and SP2 (E1 þ E3) (Supplementary Figure 3a). initiation and progression. TMEM82 has a transmembrane The mRNA transcript of SP1 was predicted to be 5438 bp, whereas domain. These new DE genes may improve our understanding SP2 was 5345 bp. The predicted peptide sequences for SP1 and of the carcinogenesis of HCC. SP2 were 101 and 94 amino acids, respectively. According to InterProScan, both isoforms have a phosphodiesterase A novel gene, termed DUNQU1, has a tissue-specific expression domain (PF01663) and alkaline phosphatase-like domains pattern and may play a role in liver tumorigenesis (G3DSA:3.40.720.10 and SSF53649), and both have a potential RNA-Seq can address a certain proportion of the transcribed signal peptide (SignalP-NN(euk)) (Supplementary Figure 4).14 genome beyond current gene annotations (Supplementary Analysis of the predicted DUNQU1 peptide sequences revealed Figure 1b). To investigate the uncharacterized areas of the human high identity between DUNQU1 and the N-terminus of ENPP7 genome, we extracted the putative mRNA sequences from the (Supplementary Figure 5). ENPP7, also known as alkaline sphingo- Cufflinks de novo assembly results. In total, 38155 normal and myelinase (alk-SMase) or NPP7, is expressed in intestine and liver

Figure 1. Flow chart of this study and novel differentially-expressed genes in liver cancer. (a) Using the total RNA extracted from a pair of adjacent normal and tumor liver samples, we performed RNA-Seq to obtain short reads and constructed the transcriptome landscape by using several bioinformatics tools such as TopHat, MapSplice, SpliceTrap and Cufflinks. Several major findings, such as novel differentially expressed protein-coding genes (DE genes), novel genes and AS events, were selected and verified in a set of 55 pairs of adjacent normal and tumor HCC patient samples. (b) DDCt values of ALG1L, SERPINA11 and TMEM82 were derived from real-time PCR of 55 pairs of adjacent normal and tumor liver samples. Each gene has a violin plot showing the shape of the distribution of DDCt values (gray area), their median (white dot) and their interquartile range (black box). (c) DDCt values of SERINA11 were grouped into two pathological stages. P-value was obtained from t-test comparing stage I/II and stage III/IV.

& 2014 Macmillan Publishers Limited Oncogene (2014) 4786 – 4794 Latent biomarkers in hepatocellular carcinoma K-T Lin et al 4788 and may function as a tumor suppressor. It hydrolyzes sphingo- sequencing of a gel-eluted DNA fragment of the hybrid form myelin to ceramide, which inhibits cell proliferation and induces showed a mixture of SP1 and SP2 (Supplementary Figure 7). apoptosis. Loss-of-function mutations and (AS) of ENPP7 have been To explore the expression patterns of DUNQU1 in HCC cell lines, associated with colon cancer and liver tumorigenesis.15,16 we performed real-time PCRs and found that DUNQU1 expressed DUNQU1 is located at the band 16p11.2 two isoforms in Hep3B, HepG2, Huh7 and PLC5 cell lines, but not (Supplementary Figure 3b), whose deletion has been linked to in Mahlavu cell lines (Figure 3a). Quantification of the expression autism and obesity.17,18 Upstream of DUNQU1, we found a levels of DUNQU1 showed that, compared with the sequenced transcription factor binding site and a DNase hypersensitivity tumor liver, Hep3B had a higher expression level of DUNQU1 while cluster at the same location (Supplementary Figure 3b).19 The first the other three cell lines had a relatively lower expression level of exon of DUNQU1 is conserved in Xenopus tropicalis, Tetraodon, DUNQU1 (Figure 3b). To explore the effect of DUNQU1 in HCC cells, Fugu, stickleback, medaka, zebrafish and lamprey (Supplementary we overexpressed DUNQU1 in Huh7 cell line which was examined Figure 3b). This sequence conservation corroborates the existence via PCR (Figure 3c). Soft agar assay showed that overexpression of of DUNQU1 as an expressed gene. The transcription start site DUNQU1 increased the colony formation of Huh7 cell lines identified by 50 RACE was close to the transcription start site (Figure 3d). The results raise the possibility that DUNQU1 might suggested by RNA-Seq data (Supplementary Figure 6). The tumor- play one or more roles in liver tumorigenesis. specific expression pattern of DUNQU1 in the sequenced liver tumors was confirmed by end-point PCR (Figure 2a). Direct sequencing of the PCR products also revealed exactly the same Alternative splicing events show the changes in cell behaviors and sequence as produced by RNA-Seq (within the PCR-amplified may serve as new biomarkers of HCC region). To investigate the possible role of DUNQU1 in liver To detect the alterations of AS events between normal and tumor tumorigenesis, we first investigated whether DUNQU1 is expressed liver tissues, we used SpliceTrap to identify significant AS events only in tumor tissues. Two sets of primers were designed and used by comparing the exon inclusion ratios between normal and to detect DUNQU1 cDNA in a nested PCR experiment of 55 pairs of tumor reads. SpliceTrap reported 1003 AS events from 825 exons HCC samples (Supplementary Figure 3a). No signal was detected in 648 genes.20 Pathway analysis showed that the potentially AS in most normal livers (except 4 HBV( þ ) male samples), even after genes were enriched in metabolism and cell-cell communication extensive PCR amplification, whereas there was clear DNA pathways (Supplementary Table 8).21 Most of the AS events were amplification in most liver tumors (Figure 2c). Some of the from exons with low inclusion ratios (Supplementary Figure 8). samples showed three amplified DNA fragments, corresponding We report 38 AS events with high exon inclusion ratios (X 0.4) to the SP1 (425 bp), SP2 (332 bp) and SP1/SP2 hybrid forms (Supplementary Table 9). After manual curation, we concluded (highest in the gel because of their loop structure). Direct that 14 of the AS events were relatively obvious (highlighted in

Figure 2. DUNQU1 has a tissue-specific expression pattern in HCV( þ ) and HBV( À )HCV( À ) liver tumors. (a) The RT-PCR results confirm that DUNQU1 is only expressed in the tumor part of the sequenced liver tumors (HCC1428). (b) The table shows the viral status and gender information for the 55 pairs of patient samples used for further RT-PCR validation. (c) In most cases, DUNQU1 was only detectable in liver tumors. The only exceptions were 4 pairs of male HBV( þ ) samples, showing signals in both adjacent normal and tumor liver samples. Some samples showed an extra DNA fragment at B500 bp. This signal represents a hybrid form of SP1 and SP2 arising during PCR (Supplementary Figure 7).

Oncogene (2014) 4786 – 4794 & 2014 Macmillan Publishers Limited Latent biomarkers in hepatocellular carcinoma K-T Lin et al 4789

Figure 3. DUNQU1 enhances soft agar colony formation. (a) 40-cycle of real-time PCR products (Primer DUN-1 and DUN-7) were used for electrophoresis to confirm DUNQU1 expression in Hep3B, HepG2, Huh7, Mahlavu and PLC5 cell lines. (b) Quantification of DUNQU1 in all samples was normalized against the endogenous control b-. DDCt values were used to represent relative abundance of DUNQU1 compared to the sequenced tumor liver (1428T). Positive number represents that the expression level of DUNQU1 is higher than the sequenced tumor liver (1428T). (c) Primers DUN-8 and DUN-9 were used to detect both endogenous and exogenous DUNQU1 levels in lentivirus-transducted Huh7 cell lines. Ratios are relative to the WT Huh7 cell lines. (d) Soft agar colony formation assay. Mixed stable Huh7 cell lines were seeded in soft agar in triplicate. EGFP is a negative control for overexpression experiments. red in Supplementary Table 9). Three of the AS events (FGFR2, tumors (Figure 4a). 65.11% of the HCC samples has stronger EXOC7 and ADAM15) are cancer-related, and one AS event with a down-regulation of FGFR2-IIIb than FGFR2-IIIc. This resulted in that novel exon (TELO2) may have a role in the cell cycle.22 the FGFR2-IIIc inclusion ratios increased in 65.12% of the HCC FGFR2 has a mutually exclusive AS event, which corresponds to samples (Figure 4b). the switch from its epithelial isoform to mesenchymal isoform Contingency table analysis showed that the DDCt values (Supplementary Figure 9a). The AS event of FGFR2 is necessary for of FGFR2-IIIc and overall FGFR2 were associated with viral epithelial-mesenchymal transition, which is a critical event during status (Table 1). Also, combining HBV( þ ) and HCV( þ ) HCCs tumorigenesis.23–25 Reduced protein level of FGFR2-IIIb is together, we found the presence of virus infection in a patient reportedly correlated with the tumor stages of HCC.26 EXOC7 who was associated with the downregulation of overall FGFR2 changed its exon 7 inclusion ratio in the sequenced tumor liver (P ¼ 0.001145) and FGFR2-IIIb (P ¼ 0.03511). Moreover, the (Supplementary Figure 9b). Similar to FGFR2, the AS event of switch from FGFR2-IIIb to FGFR2-IIIc in the liver tumors was EXOC7 has been reported to be epithelial-mesenchymal transition- significantly associated with virus infection (Table 2 and driven in human breast cancer.27 The protein encoded by EXOC7, Figure 4b). Interestingly, we also found that FGFR2-IIIc Exo70, is a component that regulates cell migration and inclusion ratios in the adjacent normal tissues were significantly maintains the epithelial polarity at the plasma membrane. correlated with tumor sizes (Pearson’s correlation ¼ 0.4661 and ADAM15 has different exon inclusion ratios for exon 20 and P value ¼ 0.001631)(Supplementary Figure 11b). Furthermore, exon 21. The different use of exons 19–21 has been reported FGFR2-IIIc inclusion ratio change (tumor minus normal) was previously and was proposed as a diagnostic marker for cancer associated with both cirrhosis and tumor recurrence (Table 2 and diagnostics.28,29 ADAM15 is an adhesion receptor on endothelial Supplementary Figure 11c). cells, and the protein expression of ADAM15 is reportedly In adult normal and fetal liver tissues, FGFR2-IIIc inclusion ratios associated with cancer cell proliferation and progression.30 were near 50% or below 50%, whereas the ratios were much Finally, TELO2 has a novel exon involved in an alternative higher in the tumor cell lines (Supplementary Figure 11a). These 50 splice site event and seemed lost in the sequenced tumor evidences raised the possibility that FGFR2-IIIc might also have liver (Supplementary Figure 10). The predicted transcript roles in liver tumorigenesis. (CUFF.11822.1) with the novel exon had a shorter coding region (718 amino acids) than the coding region of the canonical transcript (ENST00000262319) (837 amino acids). DISCUSSION Taken together, the AS events showed the changes in cell Our knowledge converges to a point based on what we have behaviors such as cell-cell adhesion, polarity and migration in HCC. observed, and the point is often not fixed. By analyzing the first ultra-deep transcriptome landscape of human liver cancer, taking into account empirical validations and published evidence, the The switch from FGFR2-IIIb to FGFR2-IIIc in the liver tumors was present study identified potential biomarkers for HCC, including significantly-associated with virus infection, and increased FGFR2- ALG1L, SERPINA11, TMEM82 and DUNQU1 and the AS event of IIIc inclusion ratio was associated with cirrhosis and tumor FGFR2. They were latent variables in the previous genome-wide recurrence studies of HCC. Thanks to the power of RNA-Seq, their importance Real-time PCR of 43 pairs of adjacent normal and HCC samples can now be revealed. showed that FGFR2-IIIb was down-regulated in 67.44% of the liver RNA-Seq is revolutionizing both the size and complexity of tumors, whereas FGFR2-IIIc was up-regulated in 46.51% of the liver human transcriptome analysis. The idea of ‘genome-wide’ has no

& 2014 Macmillan Publishers Limited Oncogene (2014) 4786 – 4794 Latent biomarkers in hepatocellular carcinoma K-T Lin et al 4790

Figure 4. Distributions of DDCt values of FGFR2 isoforms and FGFR2-IIIc inclusion ratios in 43 pairs of HCC samples. (a) Blue bars are DDCt values of FGFR2-IIIb and green bars are for FGFR2-IIIc. Patient IDs are ordered first by virus states and then sorted by the DDCt values of FGFR2- IIIb in ascending order. (b) Patient IDs are arranged in the same order as in (a). The green bars are the FGFR2-IIIc inclusion ratio changes, which are the differences of tumor inclusion ratios minus normal inclusion ratios.

Table 1. Contingency table analysis for FGFR2 and its isoforms based on the DDCt values of realtime PCR

Variable IIIb DDCt IIIc DDCt FGFR2 DDCt

Categorization n % þÀP value þÀP-value þÀ P-value

Age at diagnosis, years X60 23 53.49% 9 14 0.3528 10 13 0.2233 8 15 1 o60 20 46.51% 5 15 13 7 7 13

Gender Male 21 48.84% 4 17 0.104 11 10 1 6 15 0.5256 Female 22 51.16% 10 12 12 10 9 13

Pathology Stage I 16 37.21% 4 12 0.4315 6 10 0.1537 4 12 0.7116 II 12 27.91% 5 7 7 5 5 7 III 14 32.56% 4 10 10 4 6 8 IV 1 2.33% 1 0 0 1 0 1

Cirrhosis Yes 16 37.21% 2 14 0.04471 8 8 0.7611 3 13 0.1095 No 27 62.79% 12 15 15 12 12 15

Viral status HBV 18 41.86% 4 14 0.1112 11 7 0.02924 5 13 0.001976 HCV 13 30.23% 3 10 3 10 1 12 None 12 27.91% 7 5 9 3 9 3

Vascular invasion 0 18 45.00% 4 14 0.427 7 11 0.2612 5 13 0.6307 2 16 40.00% 8 10 12 6 8 10 4 6 15.00% 2 5 4 3 2 5

Tumor size, cm p5 23 53.49% 5 18 0.1912 10 13 0.2233 6 17 0.2193 45 20 46.51% 9 11 13 7 9 11

Recurrence Yes 15 34.88% 5 10 1 6 9 0.4699 5 10 1 No 23 53.49% 8 15 14 9 8 15 Unknown 5 11.63% 1 4 3 2 2 3 n 43 14 29 32.56% 23 20 53.49% 15 28 34.88%

*‘þ ’ means an increase of expression level where DCt (normal)–DCt (tumor) is positive. ‘ À ’ means the expression level decreases.

Oncogene (2014) 4786 – 4794 & 2014 Macmillan Publishers Limited Latent biomarkers in hepatocellular carcinoma K-T Lin et al 4791 Table 2. Contingency table analysis of FGFR2-IIIc inclusion ratio change

Variable Switch from IIIb to IIIc FGFR2-IIIc inclusion ratio change

Categorization n % Yes No P-value þÀ P-value

Age at diagnosis, year Z60 23 53.49% 10 13 1 15 8 1 o60 20 46.51% 8 12 13 7

Gender Male 21 48.84% 9 12 1 15 6 0.5256 Female 22 51.16% 9 13 13 9

Pathology stage I 16 37.21% 6 10 0.9553 11 5 0.7116 II 12 27.91% 6 6 8 4 III 14 32.56% 6 8 9 5 IV 1 2.33% 0 1 0 1

Cirrhosis Yes 16 37.21% 8 8 0.526 14 2 0.02293 No 27 62.79% 10 17 14 13

Viral status HBV/HCV 31 72.09% 16 15 0.04637 23 8 0.07395 None 12 27.91% 2 10 5 7

Vascular invasion 0 18 45.00% 7 11 0.7154 13 5 0.7567 2 16 40.00% 9 9 11 7 4 6 15.00% 2 5 4 3

Tumor size, cm p5 20 46.51% 8 12 1 16 7 0.5401 45 23 53.49% 10 13 12 8

Recurrence Yes 15 34.88% 5 10 0.7522 6 9 0.02216 No 23 53.49% 11 12 17 6 Unknwon 5 11.63% 2 3 5 0 n 43 18 25 41.86% 15 8 65.22%

*‘þ ’ means an increase of FGFR2-IIIc inclusion ratio. ‘ À ’ means the inclusion ratio decreased.

longer been limited to the B2% of the human genome. Known the remainder of the genome thus are less likely to be sequenced. coding genes such as ALG1L, SERPINA11 and TMEM82, which were Moreover, different sequencing strategy and alignment settings analyzed in the present study, are not detectable in traditional can result in significantly different sizes of mapped regions. For ‘genome-wide’ arrays and can serve as new DE genes in terms of example, mapping pair-end reads by Bowtie can cover B6to expression patterns. Long non-coding RNAs (lncRNAs) such as B8% of the genome depending on the seed length MALAT1 and HULC have been shown to be associated with cancers (Supplementary Figure 13a). If we use only the first end of pair- and also had expected expression patterns in our sequenced end reads to simulate alignments of single reads, the sizes range samples (Supplementary Figure 12).31–33 Moreover, the from B6toB18% (Supplementary Figure 13b). We postulate that Encyclopedia of DNA Elements (ENCODE) project recently the reasons described above explain why there was such a reported that B75% of the human genome is transcribable at discrepancy between results in our analysis, that is, 6.5 to 7.6% vs some point in some cells and can produce highly overlapped B75%. Since 6% of the genome can be perfectly aligned by 75 bp transcripts from both DNA strands.34 Taken together, the evidence pair-end reads, it is likely that 6% is the minimum number of suggests that we should rethink our approach to understanding transcribable regions on the human genome (Supplementary the human transcriptome and elements in the human genome. Figure 13b). Still, this figure is much larger than the B2% that is DUNQU1, for instance, which is not documented in any current currently accepted. gene annotations including GENCODE (version 14) and thereby In addition to the transcribed regions, an important dimension exceeds the boundary of B75% reported by ENCODE, has of the human transcriptome is AS. AS events can alter the intriguing expression patterns and potentially has functions in expression of and serve as potential targets for new liver tumorigenesis. treatment options.35 In the present study, we showed that the AS Our cDNA libraries were constructed from mRNAs with a poly(A) events of FGFR2 might be related to virus infection and the FGFR2- tail. Therefore, non-Poly(A) RNAs such as ribosomal RNAs will not IIIc inclusion ratios were related to tumor size, cirrhosis and tumor be captured. Also, cell types can have a considerable impact on recurrence. Most importantly, it is intriguing to see that the more whether or not a particular transcript is sequenced. In our the FGFR2-IIIc in the liver tumors, the lower the tumor recurrence. sequenced liver samples, B80% of the reads were derived from It suggests that the bad ones expressing FGFR2-IIIc have been the top 100 highest-expressed genes such as ALB. Transcripts from removed by surgery, and the paracrine networks are

& 2014 Macmillan Publishers Limited Oncogene (2014) 4786 – 4794 Latent biomarkers in hepatocellular carcinoma K-T Lin et al 4792 disrupted.36,37 It might be the reason why the tumor recurrence RNA quality was confirmed by gel electrophoresis. RNA preparation for rate became lower. If this is the case, detecting the abundance of 1428N and 1428T were carried out according to the manual of the RNeasy FGFR2-IIIc might be a good target for improving prognosis.38 With Mini Kit (Qiagen, Hilden, Germany), with an extra on-column DNase the help of RNA-Seq and appropriate bioinformatics tools, it is digestion for sample preparation. Free of DNA contamination and RNA now possible to investigate AS events more accurately. integrity was first checked by gel electrophoresis. RNA samples for RNA- Seq experiment were further checked with Agilent Bioanalyzer to assess A recent study also performed RNA-Seq to analyze 10 pairs of sample integrity. RNA-Seq experiment was performed in the genomic HCC samples, and the results were significantly different both 39 center of National Yang-Ming University. In total, we sequenced 8 lanes (4 quantitatively and qualitatively with those of the present study. lanes for adjacent normal and 4 lanes for tumor) using the Illumina GA2 In the study by Huang et al., ‘single-end’ RNA-Seq (36 bases) were platform with a 75 bp pair-end sequencing protocol and base calling with performed, capturing B21.6 million single-end reads with B10.6 the Illumina pipeline, version 1.6. The sequencing data is deposit under million aligned reads per sample. In contrast, in the present study, accession number SRA043490. DUNQU1’s accession numbers are JF934746 we performed ‘pair-end’ RNA-Seq (75 bases) and captured B126 and JF934747. million read ‘pairs’ per sample, and each sample had B100 million read pairs properly aligned (Supplementary Table 1). The purposes cDNA synthesis, primers and PCR reaction conditions of identifying DE genes in the two studies were also very different. cDNA was synthesized from 1 mg of total RNA using the ThermoScript Huang et al. randomly validated some DE genes to demonstrate RT-PCR system (Invitrogen, Life Technologies, Grand Island, NY, USA) with the accuracy of RNA-Seq. In the present study, we identified DE random primers. Real-time PCRs were performed using StepOnePlus genes not detected in previous genome-wide studies and system (Applied Biosystems, Life Technologies, Grand Island, NY, USA) and correlated their expression levels with clinicopathological char- QuantiFast SYBR Green Real-time PCR Master Mix (Qiagen) with cycling 1 1 1 acteristics of HCCs. Moreover, there were also differences between conditions of 5 min at 95 C and 40 cycles of 20 s at 95 C and 40 s at 66 C. The changes in gene expression were analyzed by the DDCt method, using the two studies with regard to AS. The main subject of interest in -actin as an endogenous control. All primers are listed in Supplementary the study by Huang et al. was the identification of novel junctions, Table 10. whereas in the present study we were primarily concerned with exploring the switch of AS events associated with HCC. Huang Short reads process et al. identified a novel junction of ATAD2 expressed in 5 adjacent B B non-cancerous and 20 tumor liver samples. Although we detected To process 100 GB raw data ( 253.6 million 75 bp pair-end reads), we used TopHat 1.4 and MapSplice 1.15.2 to identify candidate junctions.41,42 many novel junctions (Supplementary Figure 2), we did not detect For transcriptome landscape construction, we did two runs. Run A aimed the novel junction of ATAD2. In the present study, we investigated to assign FPKM values to known genes by Cufflinks.43 For each gene with the switch of AS events such as FGFR2 and ADAM15. A recent normal FPKM (n) and tumor FPKM (t) values, we calculated the fold change study suggested that at least B500 million single-end reads (50 based on the following equation: bases) are required for detecting the changes of isoforms.40 This 8 might explain why Huang et al. found it difficult to confirm their < t=n; nX1 and tX1 X novel junctions because the junction of interest might not be Fold changeðÞ¼n; t : 1=n; n 1 and to1 X supported by a sufficient amount of robust reads bridging the t=1; no1 and t 1 intron. Huang et al. did not release their RNA-Seq reads, so this Run B aimed to predict putative transcripts by de novo assembly using cannot be confirmed. We have deposited our RNA-Seq reads on Cufflinks.43 The reference genome for bias detection and the correction algorithm was based on hg19. We used Ensembl 65 version for gene the GEO database and have made all alignments as well as 44 assembled isoforms readily accessible on the UCSC genome annotation. browser. In conclusion, we have characterized the first ultra-deep liver Differentially-included exons and inclusion ratio formula cancer transcriptome landscape by validating several novel We combined the Gene Transfer Format (GTF) files generated from run B findings. The new biomarkers might be used as new diagnostic and the Ensembl 65 gene annotation to construct a new database for or prognostic markers for HCC biopsies by RT-qPCR or immuno- SpliceTrap to estimate the exon inclusion ratios in adjacent normal and tumor tissues for AS events such as CAssette exon (CA), Alternative histochemistry when the antibodies are available. Not only do 20 these findings provide new insights for the field of liver cancer Acceptor (AA), Alternative Donor (AD), and Intron Retention (IR). research, but they also serve as a valuable resource for under- standing of the human transcriptome. Statistical analysis and FGFR2-IIIc inclusion ratio Association test of contingency table analysis was carried out by Fisher exact test. Correlation test was carried out by Pearson’s product moment correlation coefficient and follows a Student’s t-distribution. FGFR2-IIIc MATERIALS AND METHODS inclusion ratio was estimated by the following equation: Clinical samples and RNA-Seq experiment . À ðÞIIIcDCt À ðÞIIIcDCt À IIIbDCt The sequenced samples of HCC (1428T) and adjacent normal liver (1428N) FGFR2 À IIIc inclusion ratio ¼ 2 2 þ 1 tissues were obtained from a patient who had undergone curative hepatic resection for HCC at the Department of Surgery, Taipei Veterans General Hospital (Taipei, Taiwan). Curative resection was defined as complete Plasmid and lentivirus protocol clearance of the tumour macroscopically, with a microscopically clear Two expression constructs (EGFP and 3Flag-DUNQU1) use the LJM1 margin. The patient had not received any preoperative treatment, such as lentiviral vector in which expression is driven by the CMV promoter. chemotherapy, ethanol injection or transarterial chemoembolization. The Lentiviral constructs were co-transfected with LP1, LP2 and LP/VSVG diagnosis of HCC was confirmed by histological examination of surgically- plasmid into 293T cells. Medium containing virus particles were harvested resected specimens. Tumour specimen and paired non-tumour liver tissue at 72 h and passed through a 0.45 mM filter. Target cells were transducted were obtained immediately after surgical resection; the non-tumour liver with lentivirus and mixed population of stable cell lines were generated by tissues were taken more than 10 mm away from the HCC. Samples of selection with puromycin for 7 days. tumor tissue were free of necrotic region and were collected after histological examination. For empirical validations, 55 pairs of HCC samples from the Taiwan Liver Cancer Network were selected based on gender and Soft agar colony formation assay viral status (Figure 2b). These samples were used in accordance with the Four percent agar in water was prepared and autoclaved, then kept in a IRB procedures of National Yang-Ming University. water bath of 56 1C. The two-layer agar plate was prepared by mixing RNA samples including 1428N, 1428T and the 55 pairs of HCC samples culture medium with a volume of 4% melted agar and then adding to were prepared from frozen resected tumor and non-tumor tissues directly. culture dish immediately. For bottom layer, 5 ml culture medium

Oncogene (2014) 4786 – 4794 & 2014 Macmillan Publishers Limited Latent biomarkers in hepatocellular carcinoma K-T Lin et al 4793 containing 0.75% agar was added into a 60 mm culture dish and kept at 7 Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for 4 room temperature to allow the plate to solidify. For the top layer, 3 Â 10 transcriptome annotation and quantification using RNA-seq. Nat Methods 2011; 8: cells was mixed in 3 ml of medium containing 0.4% agar and added to the 469–477. bottom layer plate and then placed in an incubator at 37 1C for 14 days. 8 Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF et al. NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 2011; 39: ABBREVIATIONS D1005–D1010. HCC, hepatocellular carcinoma; RNA-Seq, whole-transcriptome 9 Kapushesky M, Adamusiak T, Burdett T, Culhane A, Farne A, Filippov A et al. shotgun sequencing; HCV, hepatitis C virus; HBV, hepatitis B virus; Gene Expression Atlas update--a value-added database of microarray and sequencing-based functional genomics experiments. Nucleic Acids Res 2012; 40: FPKM, fragments per kilobase of transcript per million mapped D1077–D1081. reads; DNA, deoxyribonucleic acid; GTF, Gene Transfer Format; AS, 10 Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M alternative splicing; CA, cassette exon; AA, alternative acceptor; et al. ArrayExpress update--an archive of microarray and high-throughput AD, alternative donor; IR, intron retention; PCR, polymerase chain sequencing-based functional genomics experiments. Nucleic Acids Res 2011; 39: reaction; DE genes, differentially expressed protein-coding genes; D1002–D1004. TSS, transcription start side; PTC, premature termination codon; 11 Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB et al. IHC, immunohistochemistry; RT-qPCR, reverse transcription quan- Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer titative real time polymerase chain reaction. gene expression profiles. Neoplasia 2007; 9: 166–180. 12 Petretti T, Kemmner W, Schulze B, Schlag PM. Altered mRNA expression of glycosyltransferases in human colorectal carcinomas and liver metastases. Gut 2000; 46: 359–366. CONFLICT OF INTEREST 13 Parris TZ, Danielsson A, Nemes S, Kova´cs A, Delle U, Fallenius G et al. Clinical The authors declare no conflict of interest. implications of gene dosage and gene expression patterns in diploid breast carcinoma. Clin Cancer Res 2010; 16: 3860–3874. 14 Zdobnov EM, Apweiler R. InterProScan--an integration platform for the signature- recognition methods in InterPro. Bioinformatics 2001; 17: 847–848. ACKNOWLEDGEMENTS 15 Hertervig E, Nilsson A, Nyberg L, Duan RD. Alkaline sphingomyelinase We thank the Taiwan Liver Cancer Network for providing the liver tumor tissue activity is decreased in human colorectal carcinoma. Cancer 1997; 79: samples and related clinical data (all are anonymous) for this work. This network 448–453. currently includes five major medical centers in Taiwan (National Taiwan University 16 Cheng Y, Wu J, Hertervig E, Lindgren S, Duan D, Nilsson A et al. Identification of Hospital, Chang-Gung Memorial Hospital-Linko, Veteran General Hospital-Taichung, aberrant forms of alkaline sphingomyelinase (NPP7) associated with human liver Chang-Gung Memorial Hospital-Kaohsiung and Veteran General Hospital-Kaohsiung). tumorigenesis. Br J Cancer 2007; 97: 1441–1448. Taiwan Liver Cancer Network is supported by grants from the National Science 17 Eichler EE, Zimmerman AW. A hot spot of genetic instability in autism. N Engl J Council (NSC94–3112-B-182–002, NSC97–3112-B-182–004) and National Health Med 2008; 358: 737–739. Research Institutes, Taiwan. We also want to thank National Core Facility Program 18 Bochukova EG, Huang N, Keogh J, Henning E, Purmann C, Blaszczyk K et al. Large, for Biotechnology (Bioinformatics Consortium of Taiwan, NSC102–2319-B-010–002), rare chromosomal deletions associated with severe early-onset obesity. Nature National Research Program for Biopharmaceuticals (NRPB, NSC10102325-B-492–001) 2010; 463: 666–670. and National Center for High-performance Computing of National Applied Research 19 Myers RM, Stamatoyannopoulos J, Snyder M, Dunham I, Hardison RC, Bernstein BE Laboratories (NCHC, NARLabs) for providing computing and storage resources. et al. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol Finally, we want to thank Professor Adrian R Krainer for his valuable comments on the 2011; 9: e1001046. manuscript and hosting at Cold Spring Harbor Laboratory. 20 Wu J, Akerman M, Sun S, McCombie WR, Krainer AR, Zhang MQ. SpliceTrap: a This research was supported by grants from the National Science Council (NSC101– method to quantify alternative splicing under single cellular conditions. Bioin- 2627-B-010–001- and NSC102-2627-B-010-001-), Taipei Veterans General Hospital formatics 2011; 27: 3010–3016. (V102E2–006), the National Health Research Institutes (NHRI-EX102–10029BI), Ministry 21 Kamburov A, Pentchev K, Galicka H, Wierling C, Lehrach H, Herwig R. of Economic Affairs (101-EC-17-A-17-S1–152) and the Ministry of Education, Aim for ConsensusPathDB: toward a more complete picture of cell biology. Nucleic Acids the Top University Plan (National Yang-Ming University) to C-YF. Huang. This research Res 2011; 39: D712–D717. was also supported by the research aboard grant from the National Science Council 22 Takai H, Wang RC, Takai KK, Yang H, de Lange T. Tel2 regulates the stability of (NSC 100–2917-I-010–001) to K-T. Lin. PI3K-related protein kinases. Cell 2007; 131: 1248–1259. 23 Grosso AR, Martins S, Carmo-Fonseca M. The emerging role of splicing factors in cancer. EMBO Rep 2008; 9: 1087–1093. AUTHOR CONTRIBUTIONS 24 David CJ, Manley JL. Alternative pre-mRNA splicing regulation in cancer: pathways and programs unhinged. Genes Dev 2010; 24: 2343–2364. Kuan-Ting Lin and Yih-Jyh Shann designed the study, analyzed, interpreted the 25 Warzecha CC, Sato TK, Nabet B, Hogenesch JB, Carstens RP. ESRP1 and ESRP2 data and drafted the article. Kuan-Ting Lin performed RNA-Seq and statistical are epithelial cell-type-specific regulators of FGFR2 splicing. Mol Cell 2009; 33: analysis. Yih-Jyh Shann performed the experimental validations. Gar-Yang Chau, 591–601. Chun-Nan Hsu and Chi-Ying F Huang participated in the design of the study. 26 Amann T, Bataille F, Spruss T, Dettmer K, Wild P, Liedtke C et al. Reduced All authors agreed to publication. expression of fibroblast growth factor receptor 2IIIb in hepatocellular carcinoma induces a more aggressive growth. Am J Pathol 2010; 176: 1433–1442. 27 Shapiro IM, Cheng AW, Flytzanis NC, Balsamo M, Condeelis JS, Oktay MH et al. An EMT-driven alternative splicing program occurs in human breast cancer and REFERENCES modulates cellular phenotype. PLoS Genet 2011; 7: e1002218. 1 Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. 28 Kleino I, Ortiz RM, Huovila AP. ADAM15 gene structure and differential alternative Nat Rev Genet 2011; 12: 87–98. exon use in human tissues. BMC Mol Biol 2007; 8:90. 2 Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M et al. A global 29 Ortiz RM, Karkkainen I, Huovila AP. Aberrant alternative exon use and increased view of gene activity and alternative splicing by deep sequencing of the human copy number of human metalloprotease-disintegrin ADAM15 gene in breast transcriptome. Science 2008; 321: 956–960. cancer cells. Genes Cancer 2004; 41: 366–378. 3 Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T et al. Profiling the 30 Mochizuki S, Okada Y. ADAMs in cancer cell proliferation and progression. Cancer HeLa S3 transcriptome using randomly primed cDNA and massively short-read Sci 2007; 98: 621–628. sequencing. Biotechniques 2008; 45: 81–94. 31 Lin R, Maeda S, Liu C, Karin M, Edgington TS. A large noncoding RNA is a marker 4 Lander ES. Initial impact of the sequencing of the human genome. Nature 2011; for murine hepatocellular carcinomas and a spectrum of human carcinomas. 470: 187–197. Oncogene 2007; 26: 851–858. 5 Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, Rinn JL et al. The reality of 32 Panzitt K, Tschernatsch MM, Guelly C, Moustafa T, Stradner M, Strohmaier HM pervasive transcription. PLoS Biol 2011; 9: e1000625 , discussion e1001102. et al. Characterization of HULC, a novel gene with striking up-regulation 6 Pickrell JK, Pai AA, Gilad Y, Pritchard JK. Noisy splicing drives mRNA isoform in hepatocellular carcinoma, as noncoding RNA. Gastroenterology 2007; 132: diversity in human cells. PLoS Genet 2010; 6: e1001236. 330–342.

& 2014 Macmillan Publishers Limited Oncogene (2014) 4786 – 4794 Latent biomarkers in hepatocellular carcinoma K-T Lin et al 4794 33 Gutschner T, Diederichs S. The hallmarks of cancer: a long non-coding RNA point 39 Huang Q, Lin B, Liu H, Ma X, Mo F, Yu W et al. RNA-Seq analyses generate of view. RNA Biol 2012; 9: 703–719. comprehensive transcriptomic landscape and reveal complex transcript patterns 34 Ecker JR, Bickmore WA, Barroso I, Pritchard JK, Gilad Y, Segal E. Genomics: in hepatocellular carcinoma. PLoS One 2011; 6: e26168. ENCODE explained. Nature 2012; 489: 52–55. 40 Toung JM, Morley M, Li M, Cheung VG. RNA-sequence analysis of human B-cells. 35 Hua Y, Sahashi K, Rigo F, Hung G, Horev G, Bennett CF et al. Peripheral SMN Genome Res 2011; 21: 991–998. restoration is essential for long-term rescue of a severe spinal muscular atrophy 41 Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA- mouse model. Nature 2011; 478: 123–126. Seq. Bioinformatics 2009; 25: 1105–1111. 36 Turner N, Grose R. Fibroblast growth factor signalling: from development to 42 Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL et al. MapSplice: cancer. Nat Rev Cancer 2010; 10: 116–129. accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 37 Huijts PE, van Dongen M, de Goeij MC, van Moolenbroek AJ, Blanken F, Vreeswijk 2010; 38:e178. MP et al. Allele-specific regulation of FGFR2 expression is cell type-dependent and 43 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ et al. may increase breast cancer risk through a paracrine stimulus involving FGF10. Transcript assembly and quantification by RNA-Seq reveals unannotated tran- Breast cancer research: BCR 2011; 13: R72. scripts and isoform switching during cell differentiation. Nat Biotechnol 2010; 28: 38 Somarelli JA, Schaeffer D, Bosma R, Bonano VI, Sohn JW, Kemeny G et al. Fluor- 511–515. escence-based alternative splicing reporters for the study of epithelial plasticity 44 Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y et al. Ensembl 2011. Nucleic in vivo. Rna 2013; 19: 116–127. Acids Res 2011; 39: D800–D806.

Supplementary Information accompanies this paper on the Oncogene website (http://www.nature.com/onc)

Oncogene (2014) 4786 – 4794 & 2014 Macmillan Publishers Limited