Brunner et al. Genome Biology 2012, 13:R75 http://genomebiology.com/2012/13/8/R75

RESEARCH Open Access Transcriptional profiling of long non-coding RNAs and novel transcribed regions across a diverse panel of archived human cancers Alayne L Brunner1, Andrew H Beck1,2, Badreddin Edris1,3, Robert T Sweeney1, Shirley X Zhu1, Rui Li1, Kelli Montgomery1, Sushama Varma1, Thea Gilks1, Xiangqian Guo1, Joseph W Foley3, Daniela M Witten4, Craig P Giacomini1,5, Ryan A Flynn6, Jonathan R Pollack1, Robert Tibshirani7,8, Howard Y Chang6, Matt van de Rijn1 and Robert B West1*

Abstract Background: Molecular characterization of tumors has been critical for identifying important in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. However, despite the potential importance of long non-coding RNAs to the cancer field, no comprehensive survey of long non-coding RNA expression across various cancers has been reported. Results: We performed a sequencing-based transcriptional survey of both known long non-coding RNAs and novel intergenic transcripts across a panel of 64 archival tumor samples comprising 17 diagnostic subtypes of adenocarcinomas, squamous cell carcinomas and sarcomas. We identified hundreds of transcripts from among the known 1,065 long non-coding RNAs surveyed that showed variability in transcript levels between the tumor types and are therefore potential biomarker candidates. We discovered 1,071 novel intergenic transcribed regions and demonstrate that these show similar patterns of variability between tumor types. We found that many of these differentially expressed cancer transcripts are also expressed in normal tissues. One such novel transcript specifically expressed in breast tissue was further evaluated using RNA in situ hybridization on a panel of breast tumors. It was shown to correlate with low tumor grade and estrogen receptor expression, thereby representing a potentially important new breast cancer biomarker. Conclusions: This study provides the first large survey of long non-coding RNA expression within a panel of solid cancers and also identifies a number of novel transcribed regions differentially expressed across distinct cancer types that represent candidate biomarkers for future research. Keywords: 3SEQ, FFPE, human cancer, intergenic transcripts, lncRNAs, novel transcripts, solid tumors, transcriptional profiling

Background cancers have been studied by microarrays, - The differentially expressed genes from hundreds of can- coding genes have been well studied in this context. cer profiling studies over the last several years have yielded Recent studies looking beyond protein-coding genes have numerous biomarkers that have improved the subtyping, shown that microRNAs (miRNAs) can show higher speci- classification and diagnosis of tumors for both research ficity as biomarkers for some applications than protein- and the clinic [1-3]. Because of the extent to which coding genes. As the landscape of long non-coding RNAs (lncRNAs) grows, studies have found that lncRNAs also * Correspondence: [email protected] tend to show more tissue-specific expression than protein- 1Department of Pathology, Stanford University School of Medicine, 269 coding genes [4]. This property of lncRNAs makes them Campus Drive, Stanford, CA 94305-5324, USA highly attractive as tissue-specific biomarkers. Full list of author information is available at the end of the article

© 2012 Brunner et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Brunner et al. Genome Biology 2012, 13:R75 Page 2 of 13 http://genomebiology.com/2012/13/8/R75

LncRNAs, generally defined as having a size greater than quantification (3SEQ) [21,22]. This variation on the tradi- 200 nucleotides, make up a diverse group of non-coding tional RNAseq method enriches for 3’ ends of transcripts RNAs that are distinct from miRNAs. Until recently, very by relying on polyA+ selection of fragmented RNA. Only few lncRNAs were annotated within the . the 3’-end-most polyadenylated fragment of each tran- Now, various groups have developed independent catalogs script is isolated and sequenced, making the technique of human lncRNAs [4-11]. These range from the lncRNA especially suitable for archived samples with fragmented database containing a few hundred high-confidence, RNA. This ‘sequence tag’ approach (not unlike expressed experimentally validated lncRNAs [12] to the largest of the sequence tag (EST) libraries) also significantly reduces the lncRNA catalogs, compiled by the GENCODE group, who sequencing depth necessary relative to RNAseq, because combined manual curation, computational analysis and RNAseq targets the entire length of each transcript. By targeted experimental validation of the GENCODE tran- reducing the sequencing to only a 3’ fragment from each script database to predict more than 14,000 lncRNA tran- transcript using a polyA+ selection, 3SEQ allowed us to scripts from roughly 9,000 gene loci [5,13]. Despite the substantially reduce the depth of sequencing required to thousands of human lncRNAs now predicted, few discover and quantify rare RNAs and allowed us to profile lncRNAs have been well characterized to date, so little is many more samples than would have been possible using known about the expression patterns of most lncRNAs in traditional RNAseq. Additionally, because of its strand- more than a handful of cell types. specific libraries, 3SEQ also provided directional infor- Recent analyses suggest that lncRNAs may play key roles mation for each sequence read, allowing separate in cancer [14,15]. LncRNAs have been shown to play an quantification of transcripts on both strands. important role in epigenetic gene regulation and cellular Given our interest in exploring lncRNA expression differentiation [16-18], both of which are processes fre- across a broad range of solid tumors and identifying new quently deregulated in cancer. Targeted evaluation of the transcripts associated with various cancer types, we chose lncRNA Hox antisense intergenic RNA (HOTAIR) in a set to profile a diverse set of solid human tumors representing of clinical breast cancer specimens showed that HOTAIR adenocarcinomas and squamous cell carcinomas from dif- expression was associated with breast cancer metastasis ferent organs, as well as a variety of soft-tissue sarcomas. and acts by disrupting polycomb repressive complex Although all of these cancer subtypes have been pre- 2-dependent gene silencing [19,20]. Despite the growing viously evaluated by gene microarray, this study represents expectation that numerous lncRNAs are important regula- the first unbiased sequencing-based screen for lncRNAs tors within both normal and cancer biology, no study to and other long novel transcripts across this diverse set of date has performed a survey of lncRNA expression across cancer subtypes. a broad range of human cancer types. We hypothesized that the usage of lncRNAs may be different between nor- Results mal tissues and cancer as well as between various cancers. 3SEQ profiling of lncRNAs and novel transcribed regions To address this, we chose to profile lncRNA expression To profile known lncRNAs as well as identify and quanti- across a range of primary human tumors and compare tate novel transcripts across a range of solid cancers, we expression patterns against a small sampling of normal tis- performed 3SEQ, a 3’-strand-specific sequencing technique sues. A broad collection of cancers offers a rich resource for the quantification of polyA+ RNA [21], to obtain global for studying the usage of known lncRNAs across a range gene expression patterns in 64 solid tumors, representing of cancer biologies and also provides the opportunity to 17 diagnostic classes of both sarcomas and carcinomas. identify new lncRNA candidates. Over 1.7 billion total short sequence reads were generated, High-throughput sequencing provides a number of resulting in more than 409 million reads that mapped advantages over other profiling methods for this type of uniquely to the human genome (Table S1 in Additional study. Sequencing of polyA+ RNA from the various cancer file 1). Because the 3SEQ method captures only the 3’ samples allows for the profiling of transcriptional patterns polyadenylated ends of RNA fragments and directionally ofprotein-codingmRNAsaswellasbothknownand sequences the start of the fragments, the resulting sequen- novel lncRNAs, and any other relatively long polyadeny- cing reads cluster into strand-specific peaks when mapped lated transcripts. Traditional RNA sequencing (RNAseq) to the reference human genome. Initial analysis of the has limitations when profiling numerous primary tumor uniquely mapping reads from the 66 libraries (two tumor samples, including the need for high quality RNA, which samples had duplicate libraries) demonstrated that most is difficult to obtain from formalin-fixed clinical speci- sample libraries possessed reads from at least 24,000 of the mens, and the expense associated with deep sequencing of 37,576 RefSeq transcripts, and this number did not many samples. increase significantly with deeper sequencing (Figure S1 in Instead, we chose to profile a diverse set of archived Additional file 2). Raw and processed gene expression data human cancers using 3’-end sequencing for expression are available in GEO [GEO:GSE28866]. Brunner et al. Genome Biology 2012, 13:R75 Page 3 of 13 http://genomebiology.com/2012/13/8/R75

Because we were particularly interested in the expres- peaks approximately 275 bp upstream of the 3’-end of sion patterns of unannotated transcribed regions and known genes (Figure 1a). known lncRNAs, including recently discovered lncRNAs, Although theoretically 3SEQ provides sequence reads we adopted a systematic unbiased analysis approach to against all polyA+ transcripts present in the samples and identify clusters of sequencing reads across the genome, we observed thousands of peaks within protein-coding independent of gene annotations. After pooling reads genes, we focused our analysis on those peaks that over- from all samples, we applied a sliding-window peak-calling lap known lncRNA genes or are situated in intergenic algorithm to identify genomic regions enriched for regions and may correspond to novel transcripts. Four of sequencing reads (see Methods). Our initial analysis the larger recent lncRNA catalogs were combined and revealed 54,511 clusters of sequence reads with at least used as the most current collection of known lncRNA 150 reads across the 66 samples. These peaks, situated transcripts [4,10,11,13]. After the 21,934 peaks from cod- across the genome, had a median length of 196 base pairs ing genes were excluded, 1,065 peaks overlapped known (bp), and represented same-strand clustered sequence lncRNA transcripts. These lncRNAs, most of them iden- reads. Further filtering and increasing the threshold for tified in normal tissues or cancer cell lines and previously peak inclusion resulted in a dataset containing 36,048 unstudied in tumor samples, were expressed in at least peaks with a median enrichment of 74-fold above back- one of the solid tumors in our survey (Tables S2 and S3 ground in at least one sample (see Methods). in Additional file 1). The vast majority of the remaining We first compared these peaks with a collection of peaks could be ascribed to either exons or introns within known protein-coding genes (compiled from RefSeq and other known transcripts including non-coding tran- University of California, Santa Cruz (UCSC) known scripts, or regions within 5 kb downsteam of known tran- genes). The genomic coordinates for 21,934 of the peaks scripts that likely represented alternative 3’-ends for at least partially overlapped exons from annotated pro- known genes (Table S2 in Additional file 1). Following tein-coding genes. When the peaks were compared with removal of peaks within or near known transcripts, we a broader definition of known transcripts (coding and defined a list of 1,071 intergenic peaks greater than 5 kb non-coding from RefSeq, UCSC known genes, GEN- from the nearest known transcript (Tables S2 and S4 in CODE and UCSC all_mRNA), 23,449 peaks were situated Additional file 1). These intergenic peaks represented within the 3’-mostexonofatleastoneannotatedtran- novel transcribed regions at what could be the 3’-ends of script. When non-3’ exons were included in the analysis, new independent transcripts. Based on the characteristics 25,258 peaks were determined to be exonic. Because the of the peaks from known genes, we infer that the inter- 3SEQ sequence reads were from RNA fragments ran- genic peaks represented sequenced regions located ging between 200 and 300 bp in length, it was not approximately 275 bp upstream of the 3’-end of novel unexpected to observe a tight distribution of exonic polyA+ RNA transcripts.

(a) Peaks in exons of known genes (b) 2.0 2000

1.5 1500

1000 1.0 Frequency

500 Expression 0.5 Average Normalized Average 0 0 -3 -2 -1 0 1 coding lncRNA novel Distance from 3’ end (kb) (n=21,934) (n=1,065) (n=1,071)

Figure 1 Distribution and mean expression of 3SEQ peaks. (a) The distribution plot shows a tight cluster of exonic peaks approximately 275 bp upstream of the 3’ end of known genes (n = 29,024 peaks in known exons; distances are based on genomic coordinates and not the spliced transcriptome). (b) Boxplots show the distribution of mean expression levels for each peak by peak category. Raw sequence count data was normalized by dividing each value by the sample mean, and then taking the square root. Boxes range from the first to the third quartiles. Median expression is marked with a line. Mean values are 0.799, 0.407 and 0.324 for coding, lncRNA and novel transcripts, respectively. Plots are truncated to show mean expression values less than 2. Outlier peaks show expression as high as 17.9. Brunner et al. Genome Biology 2012, 13:R75 Page 4 of 13 http://genomebiology.com/2012/13/8/R75

Although knowledge of the novel transcripts’ start, stop selected to represent each diagnosis, with roughly half of and exon structure is not necessary for analysis of the the samples representing carcinomas (adenocarcinomas, 3SEQ peak expression data, we performed a preliminary squamous cell carcinomas and others) and the other half analysis of the transcript structure of the novel intergenic representing more rare sarcomas. Because the 3SEQ peaks peaks by comparing them with predicted transcripts were identified based on pooled reads from all of the can- assembled from a large, publically available RNAseq data- cer samples, the peaks could represent RNAs expressed in set from 16 normal human tissues. Scripture was run on any number of the samples. To assess sample expression Illumina’s BodyMap RNAseq dataset to predict assembled levels, we tallied the total number of 3SEQ reads within transcripts from the RNAseq reads [6,23]. Because this each peak for each sample, and the raw read counts were reference set was assembled using only a handful of nor- normalized between samples to account for the sequen- mal samples, it was not surprising that only 402 (37.5%) of cing depth of each library (see Methods). the novel intergenic 3SEQ peaks overlapped with novel Globally, the average expression levels of the lncRNA transcripts from the normal dataset. Interestingly, how- peaks and the intergenic peaks were reduced relative to ever, only 72 intergenic peaks overlapped BodyMap multi- the expression of coding genes (1.96-fold difference for exon transcripts with predicted splice sites. Similar to lncRNAs and 2.46-fold difference for the novel transcripts; results described by van Bakel and colleagues [11], 330 P <2.2 × 10-16;Figure1b).AveragelncRNA expression (82.1%) of the overlapping 3SEQ intergenic peaks corre- was slightly higher than average expression of the novel sponded to predicted single-exon transcripts in the Body- transcripts (P <0.0005; Figure 1b). To examine those tran- Map dataset. Similarly, 558 (52%) of our novel peaks scripts with higher, more variable expression levels across overlappedwithESTs,butonly88(8%)overlapwith samples, we selected peaks with a standard deviation spliced ESTs. Single-exon transcripts have not traditionally greater than 0.25 (roughly the median expression level of been classified as lncRNAs, but their polyA+ status sepa- lncRNA and novel transcripts) across the cancer samples. rates them from traditional miRNAs. This focused our analysis to 368 lncRNA peaks and 297 To further evaluate our intergenic transcripts with novel transcribed regions. Strikingly, samples of the same respect to other annotations, we compared our novel diagnosis tended to show similar expression levels, with peaks with a variety of datasets (see Table S4 in Additional various combinations of the cancer types showing expres- file 1 and Methods). Although 269 intergenic peaks (25%) sion across the peaks surveyed (Figure 2). Significant overlapped conserved regions, 588 (55%) contained at expression differences were shown between the carcino- least some repetitive sequences. A handful of the inter- mas and sarcomas for 483 of the lncRNAs and 394 of the genic peaks corresponded with known genes from GEN- novel peaks (false discovery rate (FDR) <0.05). Other CODE that are absent from the RefSeq and UCSC gene peaks showed specificity for various combinations of the annotations: 23 (2.1%) and 13 (1.2%) intergenic peaks adenocarcinomas or the squamous cell carcinomas. Many overlapped GENCODE transcripts on the sense and anti- peaks, however, appeared to show even more specificity, sense strands, respectively. Another 23 (2.1%) are within with expression in only a small fraction of the cancer introns of GenBank mRNAs. Likewise, a small number types. (25, 2.3%) of the intergenic peaks overlapped ribosomal To better characterize this specificity and identify RNA pseudogenes, while another 31 (2.9%) overlapped which of the above peaks was differentially expressed in other pseudogenes. None of the intergenic peaks over- at least one diagnostic subtype, we performed a two-class lapped annotated small nucleolar RNAs or miRNAs, but significance analysis using significance analysis of micro- 66 of the peaks (6.2%) were antisense to other 3SEQ arrays (SAM) [24] for each of the 17 cancer classes. This peaks. Even though the classification and functional status allowed us to identify peaks differentially expressed in of most of the 3SEQ intergenic peaks remained unclear, each diagnostic class versus all other classes. These ana- we were interested in determining if any of these tran- lyses revealed 267 lncRNAs (72.6%) and 217 novel peaks scripts were associated with specific cancer classes and (73.1%) were differentially expressed in at least one can- could be used as cancer biomarkers. cer type with an FDR of 0.05 (Figure 2 and Tables S3, S4 and S5 in Additional file 1). Of these peaks showing sig- Expression analysis of lncRNAs and novel transcribed nificant differential expression, most peaks (202 lncRNAs regions in solid tumors and 189 novel peaks) were significantly expressed in only After identifying lncRNAs and new transcripts expressed one of the 17 diagnostic classes, making these peaks can- in our panel of tumors, we explored how these RNAs were cer-type specific with respect to the other cancers expressed across the different types of cancer surveyed. surveyed. The sequencing analysis comprised 66 libraries from 64 To confirm the class-specific differential expression independent cancer samples representing 17 different measured by 3SEQ for a subset of 3SEQ peaks, we per- diagnostic classes. Between two and seven samples were formed quantitative RT-PCR (qRT-PCR) on 23 peaks in a Brunner et al. Genome Biology 2012, 13:R75 Page 5 of 13 http://genomebiology.com/2012/13/8/R75

Figure 2 Variably expressed lncRNAs and novel intergenic transcripts. Heatmaps illustrating the 368 lncRNAs (left) and 297 novel transcripts (right) with variable expression as defined by standard deviation >0.25 across 66 cancer samples. Transcripts with differential expression in at least one of the 17 two-class SAM analyses (top) were clustered separately from those transcripts not significantly differentially expressed (bottom). Normalized read data were median centered, hierarchically clustered and plotted on a low (green) to high (red) heatmap. Samples are grouped by cancer type; the number in parentheses indicates the number of libraries for each cancer type. Red and pink is used for libraries made from adenocarcinomas of breast, lung, colon and prostate, as well as normal breast, lung and colon. Orange and yellow show squamous cell carcinomas of the head and neck, skin, lung and other carcinomas: papillary urothelial carcinoma and nasopharyngeal carcinoma. Green indicates sarcomas with known translocations: endometrial stromal sarcoma, Ewing’s sarcoma, extraskeletal myxoid chondrosarcoma, synovial sarcoma and myxoid liposarcoma. Blue shows other sarcomas: gastrointestinal stromal tumor, leiomyosarcoma and dedifferentiated liposarcoma. Normal samples and cancer samples were combined for hierarchical clustering, but are displayed separately for clarity. Samples are ordered according to Table S1 in Additional file 1. Breast, breast invasive ductal carcinoma; colon, colon adenocarcinoma; DDLPS, dedifferentiated liposarcoma; EMC, extraskeletal myxoid chondrosarcoma; ESS, endometrial stromal sarcoma; EWS, Ewing’s sarcoma; GIST, gastrointestinal stromal tumor; HN SCC, head and neck squamous cell carcinoma; LMS, leiomyosarcoma; Lung, lung adenocarcinoma; Lung SCC, lung squamous cell carcinoma; MLS, myxoid liposarcoma; NPC, nasopharyngeal carcinoma; prostate, prostate adenocarcinoma; PUC, papillary urothelial carcinoma; Skin SCC, skin squamous cell carcinoma; SS, synovial sarcoma. Brunner et al. Genome Biology 2012, 13:R75 Page 6 of 13 http://genomebiology.com/2012/13/8/R75

subset of the samples profiled by 3SEQ. In this qRT-PCR However, average expression levels for the transcripts analysis, 19 of the 23 peaks had elevated expression in across the cancer and normal samples tended to be com- samples from the diagnostic subtype that previously had parable (R = 0.9635 for lncRNAs and R = 0.9329 for novel shown elevated expression by3SEQ(FigureS2inAddi- peaks; see Figure S3 in Additional file 2). Because this ana- tional file 2). lysis did not include normal tissue-of-origin matches for several of the cancer types, we were interested in further lncRNA and novel transcript expression in normal tissues evaluating expression levels for transcripts in the subset of Although we observed many lncRNAs and novel tran- cancers with corresponding normal tissues (breast, colon scripts expressed specifically in some cancer types with and lung). Specifically, were the transcripts identified as respect to others, it was unclear if the transcripts were highly expressed and specific for cancer subtypes specific to cancer or instead were normally expressed expressed in their normal tissue counterparts? Indeed, for transcripts, likely possessing a function within normal tis- those peaks differentially expressed in cancer (13 in breast, sue biology, that were also expressed in cancers. To 10 in colon, 6 in lung), although the average expression of address how the cancer-expressed lncRNAs and novel the peaks tended to be higher in cancer tissues, most of transcribed peaks identified above were expressed in nor- the peaks showed evidence of expression in normal tissues mal tissues, we performed 3SEQ on 27 normal samples, (Figure 3). Interestingly, the breast cancer peaks showed including 10 fetal samples from bowel, lung and kidney more comparable expression levels between cancer and at 13 to 15 weeks and 17 adult tissues from normal normal tissue, whereas the colon cancer peaks were breast, colon, kidney, lung and uterus. The number of expressed at lower levels in normal adult tissue. Two of 3SEQ reads was determined for each of the previously the colon cancer peaks showed distinct normal fetal identified peaks, and the normalized expression data expression, whereas none of the lung-specific peaks were analyzed. Just as with expression patterns across the showed obvious fetal expression. Two of the lung cancer cancer subtypes, lncRNA and novel peak expression peaks were highly expressed in normal adult lung tissue. showed a number of patterns in the normal tissue sam- This case study of a small number of normal counterparts ples: some did not appear to be expressed or were showed that, although expression patterns between cancer expressed at very low levels, some were expressed across and normal tissue may vary, the cancer peaks identified in most normal samples, some were limited to adult sam- this study tended to show some level of normal expres- ples, some to fetal samples, and some showed more tis- sion. These findings suggest that many of these transcripts sue-specific expression patterns (Figure 2). may have functional roles in normal tissues and their Although most of the peaks showed some level of expression is maintained in the corresponding cancers. expression in the normal samples (929 lncRNAs (87%) and 867 novel peaks (81%) showed greater than 0.4 A case study of breast-specific RNAs expression in at least one of the normal samples), generally We wished to confirm and extend our observations about the maximally expressing cancer sample showed higher lncRNA expression patterns obtained using 3SEQ by expression levels than the maximally expressing normal exploring the expression of one of our novel transcripts sample (888 lncRNAs (83%) and 955 novel peaks (89%)). using an alternative expression method, RNA in situ

Figure 3 Significant lncRNAs and novel transcripts in breast, lung and colon cancer. LncRNAs and novel transcripts significantly differentially expressed in (a) breast (n = 13), (b) colon (n = 10) and (c) lung (n = 6) cancers. Normalized, uncentered read data for cancer and normal samples were hierarchically clustered and plotted on low (black) to high (red) heatmaps. Brunner et al. Genome Biology 2012, 13:R75 Page 7 of 13 http://genomebiology.com/2012/13/8/R75

hybridization. Our 3SEQ profiling efforts relied on RNA RNA in situ hybridization signal representing peak 13741 collected from the various cell types comprising heteroge- expression for both invasive breast cancer and normal neous tissue and tumor samples. We hypothesized that breast tissue was confined to the epithelial cells within the lncRNA expression is cell-type specific within normal tis- samples (Figure 4c, bottom left). Within the normal breast sues and that this spatial specificity will be recapitulated in tissues examined, expression varied and was present in 5% the cancer counterparts. RNA in situ hybridization per- to 30% of cells. Strong staining was observed in 127 out of formed on tissue microarrays allows a quick method for the 197 invasive breast carcinomas, as defined by greater assessing expression of an RNA of interest across numer- than 30% of cells staining, though 81.9% (104 out of 127) ous samples in parallel as well as providing additional showed staining in the vast majority of the tumor cells information about how the RNA is expressed across cell (60% to 95% of tumor cells staining). The probe signal for types and how it is localized within cells. To this end, we all of the positively staining samples appeared localized to identified an intriguing set of three breast-specific tran- the nucleus (Figure 4c, bottom left), which is distinct from scripts (two novel and one lncRNA) showing particularly the cytoplasmic probe signal typically observed when RNA high expression levels and located within a 200 kb inter- in situ hybridization is performed on protein-coding genic region on 10, approximately 50 kb mRNAs. Instead, this nuclear pattern follows the trend for downstream of the nearest protein-coding gene, the breast lncRNAs and other regulatory RNAs, which, with limited gene ANKRD30A (also known as NY-BR-1;Figure4a evidence, tend to show nuclear localization [26]. [25]). An analysis of 3SEQ data from these three peaks, Although the structure and function of the transcript 13741, 13742 and 13743, showed that they were variably corresponding with peak 13741 remains to be deter- expressed in both normal breast and breast cancer tissue, mined, we were struck by this transcript’s potential as a but were not expressed in cancers from other tissue types breast cancer biomarker because of the specificity of its (Figure S4a in Additional file 2). Because peak 13741 was expression across the more than 800 samples surveyed. the most highly expressed of the three 3SEQ peaks in the Specifically, we found that significant expression was region, we further studied this RNA. Peak 13741 expres- only present in breast tissue, with peak 13741 showing sion was examined in a panel of 11 breast cancer cell lines no significant expression in other neoplasms or normal using qRT-PCR (Figure S4b in Additional file 2), and tissues. Because a range of breast cancer subtypes were northern blot analysis was used to determine that peak present on the tissue microarrays, we were able to deter- 13741 corresponded with a long transcript, greater than mine that the expression of peak 13741 was positively 5 kb (Figure S4c in Additional file 2). associated with low tumor grade (P =6.1×10-8), estro- The genomic coordinates of the 13741 transcript were gen receptor (ER) expression (Figure 4c, top panels, P = unclear because its 3SEQ peak did not overlap any 6.4 × 10-15) and progesterone receptor (PR) expression annotated exons in known genes or pseudogenes. Peak (P =6.8×10-14) in breast cancers and showed no signifi- 13742 was annotated as a known lncRNA in our analy- cant association with metastasis or human epidermal sis because of its overlap with GENCODE transcript growth factor receptor 2 (HER2) expression (Table 1). ENSG00000235687 (Figure 4a). Peak 13742 also corre- That the expression was correlated with ER+ or PR+ sponded with a spliced transcript predicted by Scripture breast cancers and was largely absent from ER- tumors using the BodyMap normal breast dataset. To further suggests that the transcript corresponding to peak 13741 study the transcripts within this , we performed may function specifically within normal breast tissue and transcript prediction using RNAseq data we generated low grade ER+ or PR+ breast cancer. Thus, it may be from six breast cancer cell lines expressing peak 13741 useful not only as a biomarker to distinguish breast can- and identified a number of predicted transcripts in this cer subtypes but also as an interesting, possibly regula- region (Figure S5 in Additional file 2; [GEO:GSE28866]). tory RNA in normal breast and breast cancer biology. Interestingly, most of the spliced transcripts corresponded with the 13742 transcript, leaving us with only short sin- Discussion gle-exon transcript predictions (some of them within Recent studies have demonstrated the importance of introns of the 13742 transcript isoforms) to explain peaks lncRNAs in regulating embryogenesis and gene expres- 13741 and 13743. Additionally, the genomic region con- sion, and there is considerable evidence that this class of taining 13741 did not appear to be evolutionarily con- RNAs plays important roles in cancer [14,18,19,27,28]. served, nor did it span any repetitive elements. However, the definition of lncRNAs is still evolving and Though the full transcript structure for peak 13741 even the number of lncRNAs encoded by the human gen- remains to be determined, we used the genomic sequence ome is unclear, making it difficult to survey lncRNA spanning peak 13741 to develop an RNA in situ hybridiza- expression comprehensively. We surveyed lncRNA expres- tion probe (Figure 4b) and evaluated the expression of sion in cancer by using 3SEQ, a 3’-end targeted RNAseq peak 13741 on several cancer tissue microarrays. The method, to determine how lncRNAs are expressed within Brunner et al. Genome Biology 2012, 13:R75 Page 8 of 13 http://genomebiology.com/2012/13/8/R75

Figure 4 A case study of novel, breast-specific peak 13741. (a) Browser shot showing expression for a breast cancer sample in the region downstream of ANKRD30A on chromosome 10. The first two tracks show the known genes and RNAs in this locus. The third track shows the peaks identified in this study, including three highly expressed peaks: novel 13741, lncRNA 13742 and novel 13743. The fourth track shows the raw 3SEQ reads (transcript abundance levels) on the forward strand (blue) and reverse strand (red). The final tracks show the longest transcripts that overlappeak 13742, a Scripture-assembled transcript produced using normal breast RNAseq reads from the Illumina BodyMap data set and GENCODE lncRNA ENSG00000235687. (b) Zoom-in browser shot of peak 13741 on chromosome 10 shows the location of the RNA in situ hybridization probe (top track) as well as the raw sequence reads for one breast cancer sample (bottom track). This peak illustrates the shape of a typical 3SEQ peak from a high- expressing transcript. (c) ER staining on an ER+ breast cancer (top left) and an ER-breast cancer (top right). RNA in situ hybridization for peak 13741 performed on the same ER+ breast cancer specimen (bottom left) and same ER- breast cancer (bottom right). Specimens were matched but ER and 13741 stains used different tissue slices. All images are at 400× magnification. 3SEQ, 3’-end sequencing for expression quantification; chr10, chromosome 10; ER, estrogen receptor; lncRNA, long non-coding RNA. Brunner et al. Genome Biology 2012, 13:R75 Page 9 of 13 http://genomebiology.com/2012/13/8/R75

Table 1 Novel, breast-specific peak 13741 is associated the lncRNAs and the >21,000 protein-coding mRNAs. We with estrogen-receptor-positive and progesterone- therefore inferred that the novel intergenic peaks were receptor-positive cells and Grade 1 breast cancer derived from long (>200 bp) polyA+ transcripts; however, ER+ ER- PR+ PR- Grade 1 Grade 3 overlap analysis with Scripture-predicted transcripts sug- (n = 146) (n = 33) (n = 129) (n = 50) (n = 42) (n = 41) gested that some of these novel transcripts were single- 13741+ 122 5 112 15 37 12 exon transcripts. Although both the levels and patterns of (83.6%) (15.2%) (86.8%) (30.0%) (88.1%) (29.3%) gene expression across the novel intergenic peaks 13741- 24 28 17 35 5 29 appeared similar to the lncRNA peaks, suggesting that (16.4%) (84.8%) (13.2%) (70.0%) (11.9%) (70.7%) these novel peaks may be new lncRNAs, these peaks also RNA in situ hybridization results for analyses correlating the expression of resembled the short intergenic transcripts described by peak 13741. Numbers of samples are reported. Each test was significant by the Kruskal-Wallis rank sum test. +, staining in >30% of cells; -, staining in others [11]. Further characterization of these novel tran- <10% of cells; ER, estrogen receptor; PR, progesterone receptor. scripts will be required to determine their structure and functional importance within cancer. cancers and their normal counterparts, and to identify Profiling of the non-coding transcriptome in primary novel intergenic transcripts in cancers. human cancers has been limited, despite the evidence that Several groups have compiled catalogs of lncRNAs, lncRNAs are important players in gene regulation. We ranging from lists of a few hundred named lncRNAs to therefore performed unbiased expression profiling on a lists of up to 15,000 predicted lncRNAs [12,13]. We com- diverse panel of solid cancers, representing 17 diagnostic pared our peaks with four of the larger lncRNA catalogs. subclasses of adenocarcinomas, squamous cell carcinomas Three of these catalogs themselves include compilations and sarcomas. We identified 267 lncRNAs and 217 novel of previous catalogs and datasets. Ulitsky and colleagues transcribed regions that were differentially expressed in at assembled a set of 2,458 lncRNAs by computationally fil- least one of the cancer types. Differential expression has tering transcripts from the RefSeq, Ensembl and UCSC been used as an important filtering parameter for identify- gene lists [10]. Cabili and colleagues assembled a catalog ing genes that are not only correlated with but potentially using 3,376 lncRNAs from RefSeq, GENCODE and functionally important in various cancer types. Despite the UCSC and adding 4,819 lncRNAs identified from among limited number of samples representing each diagnosis, the predicted transcripts assembled from the Human we were struck by the extent of specificity of some of Body Map RNAseq dataset [4]. Versions 7 to 10 of the these transcripts. Though many of the transcripts GENCODE lncRNA annotation set, assembled using appeared to show some expression in the normal samples manual curation, computational analysis and targeted examined, these differentially expressed transcripts are experimental validation of GENCODE transcripts com- intriguing candidates for both future study and more piled by the ENCODE project, comprise more than immediately for biomarker analyses. 15,000 transcripts representing more than 9,000 non- Because previous cancer studies performed on lncRNAs overlapping lncRNA sequences (Derrien et al., sub- have been small-scale studies using a less comprehensive mitted) [5,13,29]. van Bakel and colleagues used a normal collection of cancer samples, it has been difficult to make human RNAseq dataset to predict approximately 1,250 any generalizations about whether the expression of novel intergenic lncRNA transcripts [11]. Because the lncRNAs differs between cancer and normal tissue. We various catalogs were based on different datasets and took the opportunity to address this question by analyzing employed different criteria for lncRNA prediction, per- 3SEQ expression for the cancer-expressed lncRNA peaks haps it is not surprising that they did not show high per- and novel transcripts within 27 normal tissues. Although centages of transcript overlap. In addition, these studies our analysis was limited to only a handful of types of nor- were based on transcripts primarily isolated from normal mal tissues, we observed that the lncRNAs profiled in this tissues and cell lines; no study to date has evaluated these study and expressed in cancer tended to have at least newest annotated lncRNAs for expression in human can- some expression in the normal samples. Therefore, the cers.Inthisstudy,weusedtheunionofallofthe cancer-expressed lncRNAs may not to be specific to can- lncRNAs from the four catalogs described above as the cer, but most likely have some function in normal tissues. reference set for comparisonandshowedthat1,065of It also does not appear that lncRNAs, as a class, have sig- the 3SEQ peaks expressed in our cancer panel overlap nificantly different expression levels between cancer and previously predicted lncRNAs. This allowed us to deter- normal tissue. More striking, however, was that many can- mine for the first time the expression patterns for many cer-expressed lncRNAs and novel transcripts showed tis- of these lncRNAs within primary tumor samples. sue-specific expression patterns and, for some cancers, the We also found 1,071 novel intergenic 3SEQ peaks normal tissue-of-origin expression patterns appeared to be expressed in our cancer panel. These RNAs were isolated maintained in the associated cancers. Further research is using the same polyA+ and size selection used to capture needed to determine the function and importance of the Brunner et al. Genome Biology 2012, 13:R75 Page 10 of 13 http://genomebiology.com/2012/13/8/R75

tissue-specific lncRNAs in both cancer and normal acid isolation. (See Table S1 in Additional file 1 for biology. sample numbers and diagnoses.) RNA in situ hybridization allowed us to further charac- terize the expression of a novel breast-specific transcript, RNA isolation and 3SEQ library preparation peak 13741, across hundreds of cancer and normal patient RNA was purified from FFPE slices following deparaffina- samples. We demonstrated that the transcript was not tion with a xylene incubation, ethanol wash and protease only tissue-specific (expressed only in breast and breast digestion, using the RecoverAll Total Nucleic Acid Isola- cancers), it was cell-type specific (expressed in epithelial tion Kit (Ambion/Life Technologies, Austin, TX, USA), breast cells and not the surrounding stromal cells) and as described previously [21]. Isolation of the 3’-ends of had specific cellular localization patterns (nuclear, as mRNA was achieved with an oligodT selection per- opposed to cytoplasmic). All of these observations suggest formed on at least 5 μg total RNA using the Oligotex that this transcript has regulated transcription and may mRNA mini Kit (QIAgen, Valencia, CA, USA). RNA that have a functional role within breast tissue biology. The was not sufficiently fragmented was heat-sheared to a correlated expression of this transcript with ER in breast size of approximately 100 to 200 bp. The polyA+ RNA cancer and the apparent loss of expression of this tran- was then subjected to first and second strand cDNA script in ER- breast cancer suggest that this transcript may synthesis and Illumina library synthesis, as described pre- be functioning within the ER pathway. Further research is viously [21]. (Duplicate 3SEQ libraries were created for necessary to determine the function of the peak 13741 two cancer samples: ESSSTT5520 and LMS STT516.) transcript, because of its strong correlation with ER+ and PR+ breast cancers, but this peak has potential as a useful Sequencing and mapping breast cancer biomarker. 3SEQ libraries were sequenced with Illumina Genome Analyzer IIx machines to obtain a minimum of 1.5 mil- Conclusions lion uniquely mapping 25-base single-end sequence This study represents the first large survey of lncRNA reads. Reads were mapped to hg18 with a two-mismatch expression across a diverse panel of primary human cancer allowance using ELAND (Illumina Inc., San Diego, CA, samples and provides the research community with a valu- USA). Sequence reads were further filtered to remove able resource for cancer-expressed lncRNAs as well as a mapping artifacts caused by ambiguous mapping by collection of novel transcripts expressed within cancers. requiring a posterior probability of at least 0.8 for the We illustrate that 3SEQ can be used for transcriptional best alignment, where posterior probability was calcu- profiling to identify both known lncRNAs and novel tran- lated as the ratio of the likelihood of the best alignment scripts within archived primary tissue samples, and our to the sum of the likelihoods of all alignments. Likelihood RNA in situ hybridization analysis of one such novel RNA, was taken from a binomial model of sequencing errors peak 13741, identified it as an intriguing breast-specific with constant substitution rate 0.01. FASTQ files and fil- RNA that may be important for the differences between tered bed files of hg18-mapped reads are available in ER+ and ER- breast cancers. [GEO:GSE28866].

Methods Peak analysis Tissue samples Detection of genomic regions enriched for sequencing Tumor and normal samples were collected using writ- reads (peak-calling) was performed once, by combining ten informed consent compliant with the Health Insur- filtered sequence reads from all 66 cancer libraries and ance Portability and Accountability Act and approved using a custom Perl script to call genomic regions with by the Stanford University Medical Center institutional greater than 150 reads across all samples. This provided review board. Some of the tissues already existed in tis- a single set of genomic peak coordinates, which allowed sue banks and fell under exemption 4. For this study, for direct comparison of all samples. Because of our we selected 64 formalin-fixed paraffin-embedded (FFPE) interest in identifying peaks expressed in only a fraction cancer samples, along with 27 normal samples, follow- of samples, we initially selected a more permissive overall ing examination of hematoxylin and eosin stains by threshold, corresponding with roughly a two-fold enrich- pathologists RBW and MvdR. Multiple 2 mm-diameter mentabovethatexpectedbychancewhenapplyinga cores were taken from each FFPE block, re-embedded uniform distribution model for reads from all samples in paraffin at a perpendicular orientation, and then across the mappable genome. During this process, clus- sliced into 4-μm-sections for further hematoxylin and ters of sequence reads, each within 20 bp of their nearest eosin examination or 20-μm-sections for nucleic acid same-strand neighbor, were identified and the coordi- extraction. Approximately one hundred 20-μm-slices nates for these genomic regions were defined using the were placed into one to two 1.5 mL tubes for nucleic start coordinates for the left-most and right-most Brunner et al. Genome Biology 2012, 13:R75 Page 11 of 13 http://genomebiology.com/2012/13/8/R75

sequence reads within each cluster. The list of peaks was comparison of the specific cancer type versus all other then filtered to exclude any region that contained fewer cancers was performed to determine differentially than 25 different starting positions for the reads within expressed genes at an FDR <0.05. Using a two-class FDR the region, and in this way required each peak to possess of 0.05 to define differential expression, we calculated an a minimum of 25 unique reads. The initial list of 54,511 overall FDR for identifying a peak as differentially peaks was further filtered to remove peaks antisense to expressed in at least one of the 17 two-class SAM analyses peaks in known genes (determined to be artifacts of (overall FDR = 0.04). LncRNA and novel transcript expres- library preparation). Applying a more stringent threshold sion were hierarchically clustered using the average linkage requiring peaks to possess at least 200 total reads method in Cluster3 and visualized using Java TreeView resulted in a filtered list of 36,048 peaks. Assuming a uni- (Figures 2 and 3). form distribution of reads to calculate expected back- ground levels, we determined that peaks on this list were Quantitative RT-PCR at least 2.5-fold enriched above background when all cDNA was made from the same total RNA used for the samples were considered or at least four-fold enriched 3SEQ libraries in this study using the DyNAmo™ cDNA above background in at least one sample. Most peaks Synthesis Kit (Finnzymes/ThermoScientific, Lafayette, were enriched well above this level (median = 74-fold CO, USA). Primers were designed against 23 peak enrichment; see Tables S3 and S4 in Additional file 1 for sequences using Primer3 and qRT-PCR was performed fold-enrichment values). Peaks were annotated as coding using the SYBR green method on a StepOnePlus instru- genes if they overlapped any exons within annotated cod- ment (Applied Biosystems/Life Technologies, Foster City, ing genes from the RefSeq and UCSC known gene data- CA, USA) according to the manufacturer’sinstructions. sets. Peaks were then overlapped with genomic Data were normalized using a series of five housekeeping coordinates for the lncRNA transcripts from the follow- genes (ARL8B, CTBP1, CUL1, PAPOLA and ACTB). For ing datasets: Ulitsky et al. (n = 2,642) [10]; Cabili et al. qRT-PCR on breast cancer cell lines, cDNA was ampli- (n = 14,309) [4]; GENCODE lncRNA annotation sets ver- fied using primers against 13741 as described above and sion 7 (n = 15,207), version 9 (n = 18,878) and version 10 normalized using ACTB. See Table S6 in Additional file 1 (n = 17,547) [5,13,29]; and van Bakel et al. (n = 1,252) for primer sequences. [11]. The remaining peaks were determined to be novel intergenic peaks if they did not overlap promoter regions Breast cancer cell lines (within 5 kb), exons, introns or downstream regions HCC1419, HCC1500, UACC-812, ZR-75-30, CAMA-1, (within 5 kb) of any known genes from the RefSeq, MCF7, MDA-MB-231, MDA-MB-361, MDA-MB-436 and UCSC Known Genes, GENCODE or all_mRNA (UCSC MDA-MB-468 were obtained from American Type Cul- genome browser) datasets. Overlap analysis was used to ture Collection (Manassas, VA, USA), whereas SUM 44PE further characterize the novel transcripts using several and SUM 52PE were a kind gift from Dr Stephen Ethier. datasets downloaded from the UCSC genome browser, Cells were grown to 70% to 80% confluence, and total including all_ests, spliced_ests, phastConsElements17way RNA was isolated using the QIAgen RNeasy Mini Kit (17way conservation), phastConsElements28wayPlac- (QIAgen). Mammal (28way conservation), rmsk (RepeaterMasker), wgRna (small nucleolar RNAs, miRNAs), GENCODE RNAseq on breast cancer cell lines genes and rnaGene. Pseudogenes were obtained from Six breast cancer cell lines were used for transcriptome Pseudogene.org [30,31]. Files with raw counts of the sequencing. RNAseq libraries were prepared using the sequence reads within each 3SEQ peak, as well as the Illumina mRNA-Seq Sample Prep Kit (Illumina, Inc.) as normalized peak data, are available in [GEO:GSE28866]. described previously [32]. Illumina Genome Analyzer IIx machines were then used to sequence paired-end 36mer Gene expression analysis reads, which were subsequently mapped to hg18 using The number of sequence reads located within each peak TopHat and assembled into predicted transcripts using was determined for each sample. This raw read count data Scripture [6]. RNAseq data for the breast cancer cell lines were then normalized for each 3SEQ peak using the are available in [GEO:GSE28866]. sequencing depth of each sample by scaling by the mean value of each sample. Data were further compressed to Northern blot reduce outliers by taking the square root of each value. Five micrograms of total RNA from UACC-812 and Correlations between technical replicate libraries were MDA-MB-436 were used as directed in the NorthernMax determined to be r2 = 0.91 and 0.95. Differential expres- Kit (Life Technologies, Grand Island, NY, USA). To gener- sion was determined using a series of two-class SAM ana- ate radioactive RNA probes, primers with SP6 polymerase lyses [24]. For each of the 17 cancer types, a two-class binding sites were designed to amplify genomic DNA at Brunner et al. Genome Biology 2012, 13:R75 Page 12 of 13 http://genomebiology.com/2012/13/8/R75

b ACTB the loci for 13741 and -actin ( ) in a strand-specific Additional file 2: Supplemental figures 1-5. Figure S1 plots manner (see Table S7 in Additional file 1 for primer sequencing depth versus RefSeq transcripts detected by 3SEQ. Figure S2 sequences). In vitro transcription reactions were carried shows differential expression for the 23 peaks examined by qRT-PCR. 32 Figure S3 plots the mean expression in cancer versus the mean out using [ - P]-UTP as directed in the SP6 MAXIscript expression in normal samples. Figure S4 shows expression of peak 13741 Kit (Life Technologies). Following 13741 probe hybridiza- by 3SEQ, qRT-PCR and northern blot. Figure S5 is a browser shot tion, the blot was stripped and re-probed for b-actin. The showing the predicted breast transcripts near peak 13741. northern blot was exposed to maximum sensitivity film at -80°C with an intensifier screen for 24 hours for the 13741 hybridization and maximum resolution film for 2 hours Abbreviations for the b-actin hybridization. 3SEQ: 3’-end sequencing for expression quantification; bp: base pairs; ER: estrogen receptor; EST: expressed sequence tag; FDR: false discovery rate; FFPE: formalin-fixed paraffin-embedded; lncRNA: long non-coding RNA; RNA in situ hybridization miRNA: microRNA; PR: progesterone receptor; qRT-PCR: quantitative reverse The RNA in situ hybridization probe for peak 13741 was transcription polymerase chain reaction; RNAseq: RNA sequencing; SAM: designed against chr10:37,634,173-37,634,625 (hg18) significance analysis of microarrays; UCSC: University of California: Santa Cruz. using primer 5’-TTGGAAAGCCAAATTGTTGA-3’ and Authors’ contributions the T7 promoter-tagged primer 5’-CTAATACGACTCA ALB, AHB and RBW designed the study, analyzed the data, and wrote the CTATAGGGTGTTTGTGTTCCCCCATTTT-3’.The paper. ALB, BE and TG conducted the qRT-PCR experiments. RTS scored the RNA in situ hybridization. SXZ, RL, KM and SV processed the tissue blocks, digoxigenin-labeled antisense RNA probe was generated purified RNA and built the 3SEQ libraries. RL also designed and performed by PCR amplification and in vitro transcription using the the in situ hybridization experiments. XG, JWF, DMW and RT assisted with data processing, normalization and analysis. JRP and CPG contributed RNA DIG RNA labeling kit and T7 polymerase according to as well as RNAseq data from the breast cancer cell lines. RAF performed the the manufacturer’s protocol (Roche Diagnostics, Indiana- northern blot. HYC and MvdR assisted with designing the study and writing polis, IN, USA). RNA in situ hybridization using the the paper. All authors read and approved the final manuscript. 13741 probe on a tissue microarray containing 197 breast Competing interests cancers with clinical outcome data was performed as The authors declare that they have no competing interests. described previously [33]. Controls included known negative tissue types and other probes that were non- Acknowledgements We thank Ziming Weng, Phil Lacroute, Ghia Euskirchen, Lucia Ramirez, Nick reactive. Additional tissue microarrays also tested for Addleman and Keith Bettinger for help with Illumina sequencing and 13741 staining included ductal carcinoma in situ (n = mapping; Tiffany Hung, Yue Wan and Grace Zheng for helpful discussion on n lncRNAs; and Arend Sidow, Gavin Sherlock and Pat Brown for helpful 253), normal breast ( = 30) and a collection of other suggestions on study design, results and interpretation. Work was supported normal specimens (n = 17) including skin, adrenal gland, by grants from the National Institutes of Health (R01-CA129927, R01- kidney, placenta, gallbladder, ovary, prostate, pancreas, CA112270 and R01-CA118750), National Leiomyosarcoma Foundation, salivary gland and thymus. Neoplasms tested for 13741 Leiomyosarcoma Direct Research, Life Raft and California Breast Cancer Research Program (15NB-0156 and 17IB-0038). staining (n = 402) were from esophagus, liver, stomach, pancreas, kidney, colon, ovary, bladder, testes, skin, pros- Author details 1Department of Pathology, Stanford University School of Medicine, 269 tate, lung and uterus. Positive signal was defined as stain- 2 Campus Drive, Stanford, CA 94305-5324, USA. Department of Pathology, ing in at least 30% of tumor cells; staining in less than Beth Israel Deaconess Medical Center, Harvard Medical School, 330 Brookline 10% of tumor cells within a case was considered negative. Avenue, Boston, MA 02215, USA. 3Department of Genetics, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305-5120, Several metrics were tested for association with expres- USA. 4Department of Biostatistics, University of Washington, 1705 NE Pacific sion of peak 13741, including ER expression, PR expres- Street, Seattle, WA 98195-7232, USA. 5Program in Cancer Biology, Stanford sion, tumor grade, human epidermal growth factor University School of Medicine, 269 Campus Drive, Stanford, CA 94305-5456, 6 receptor 2 expression and the presence of metastasis using USA. Howard Hughes Medical Institute and Program in Epithelial Biology, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA the Kruskal-Wallis rank sum test. Only Grade 1 and 3 94305, USA. 7Department of Statistics, Stanford University, 390 Serra Mall, tumors were used in the association with peak 13741. Stanford, CA 94305-4065, USA. 8Department of Health Research and Policy, Stanford University School of Medicine, 269 Campus Drive, Stanford, CA 94305-5405, USA. Additional material Received: 20 March 2012 Revised: 12 June 2012 Accepted: 28 August 2012 Published: 28 August 2012 Additional file 1: Supplemental tables 1-7. Table S1 shows the sequencing statistics for the 3SEQ libraries. Table S2 shows the numbers of 3SEQ peaks overlapping lncRNAs and other annotation classes. Table References ’ S3 lists the 1,065 known lncRNAs. Table S4 lists the 1,071 novel 1. Nielsen TO, West RB, Linn SC, Alter O, Knowling MA, O Connell JX, Zhu S, intergenic peaks. Table S5 shows the number of peaks differentially Fero M, Sherlock G, Pollack JR, Brown PO, Botstein D, van de Rijn M: expressed in each diagnostic class. Table S6 includes primer sequences Molecular characterisation of soft tissue tumours: a gene expression used in qRT-PCR experiments. Table S7 includes primer sequences used study. Lancet 2002, 359:1301-1307. for the northern blot probes. 2. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Brunner et al. Genome Biology 2012, 13:R75 Page 13 of 13 http://genomebiology.com/2012/13/8/R75

Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Botstein D: Molecular Huntsman DG, Dal Cin P, Debiec-Rychter M, Nucci MR, Fletcher JA: 14-3-3 portraits of human breast tumours. Nature 2000, 406:747-752. fusion oncogenes in high-grade endometrial stromal sarcoma. Proc Natl 3. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Acad Sci USA 2012, 109:929-934. Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, 23. Illumina Human Body Map 2.0 Project. [http://www.ebi.ac.uk/arrayexpress/ Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, browse.html?keywords=E-MTAB-513]. Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, 24. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays Byrd JC, Botstein D, Brown PO, et al: Distinct types of diffuse large B-cell applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, lymphoma identified by gene expression profiling. Nature 2000, 98:5116-5121. 403:503-511. 25. Jager D, Stockert E, Gure AO, Scanlan MJ, Karbach J, Jager E, Knuth A, 4. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL: Old LJ, Chen YT: Identification of a tissue-specific putative transcription Integrative annotation of human large intergenic noncoding RNAs factor in breast tissue by serological screening of a breast cancer library. reveals global properties and specific subclasses. Genes Dev 2011, Cancer Res 2001, 61:2055-2061. 25:1915-1927. 26. Clark MB, Mattick JS: Long noncoding RNAs in cell biology. Semin Cell Dev 5. Derrien T, Guigo R, Johnson R: The long non-coding RNAs: a new (p)layer Biol 2011. in the ‘dark matter’. Front Genet 2012, 2:107. 27. Pauli A, Rinn JL, Schier AF: Non-coding RNAs as regulators of 6. Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, embryogenesis. Nat Rev Genet 2011, 12:136-149. Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A: Ab initio 28. Ponting CP, Oliver PL, Reik W: Evolution and functions of long noncoding reconstruction of cell type-specific transcriptomes in mouse reveals the RNAs. Cell 2009, 136:629-641. conserved multi-exonic structure of lincRNAs. Nat Biotechnol 2010, 29. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, 28:503-510. Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, 7. Jia H, Osak M, Bogu GK, Stanton LW, Johnson R, Lipovich L: Genome-wide Antonarakis SE, Guigo R: GENCODE: producing a reference annotation for computational identification and manual annotation of human long ENCODE. Genome Biol 2006, 7(Suppl 1):S4, 1-9. noncoding RNA genes. RNA 2010, 16:1478-1487. 30. Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, 8. Khalil AM, Guttman M, Huarte M, Garber M, Raj A, Rivea Morales D, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei CL, Gingeras TR, Thomas K, Presser A, Bernstein BE, van Oudenaarden A, Regev A, Lander ES, Guigo R, Harrow J, Gerstein MB: Pseudogenes in the ENCODE regions: Rinn JL: Many human large intergenic noncoding RNAs associate with consensus annotation, analysis of transcription, and evolution. Genome chromatin-modifying complexes and affect gene expression. Proc Natl Res 2007, 17:839-851. Acad Sci USA 2009, 106:11667-11672. 31. Pseudogene dataset. [http://www.pseudogene.org/]. 9. Orom UA, Derrien T, Beringer M, Gumireddy K, Gardini A, Bussotti G, Lai F, 32. Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo S, Zytnicki M, Notredame C, Huang Q, Guigo R, Shiekhattar R: Long Khrebtukova I, Barrette TR, Grasso C, Yu J, Lonigro RJ, Schroth G, Kumar- noncoding RNAs with enhancer-like function in human cells. Cell 2010, Sinha C, Chinnaiyan AM: Chimeric transcript discovery by paired-end 143:46-58. transcriptome sequencing. Proc Natl Acad Sci USA 2009, 106:12353-12358. 10. Ulitsky I, Shkumatava A, Jan CH, Sive H, Bartel DP: Conserved function of 33. West RB, Corless CL, Chen X, Rubin BP, Subramanian S, Montgomery K, lincRNAs in vertebrate embryonic development despite rapid sequence Zhu S, Ball CA, Nielsen TO, Patel R, Goldblum JR, Brown PO, Heinrich MC, evolution. Cell 2011, 147:1537-1550. van de Rijn M: The novel marker, DOG1, is expressed ubiquitously in 11. van Bakel H, Nislow C, Blencowe BJ, Hughes TR: Most ‘dark matter’ gastrointestinal stromal tumors irrespective of KIT or PDGFRA mutation transcripts are associated with known genes. PLoS Biol 2010, 8:e1000371. status. Am J Pathol 2004, 165:107-113. 12. Amaral PP, Clark MB, Gascoigne DK, Dinger ME, Mattick JS: lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res 2010, 39: doi: D146-151. Cite this article as: Brunner et al.: Transcriptional profiling of long non- 13. GENCODE lncRNA dataset.. [ftp://ftp.sanger.ac.uk/pub/gencode/]. coding RNAs and novel transcribed regions across a diverse panel of 14. Tsai MC, Spitale RC, Chang HY: Long intergenic noncoding RNAs: new archived human cancers. Genome Biology 2012 13:R75. links in cancer progression. Cancer Res 2011, 71:3-7. 15. Calin GA, Liu CG, Ferracin M, Hyslop T, Spizzo R, Sevignani C, Fabbri M, Cimmino A, Lee EJ, Wojcik SE, Shimizu M, Tili E, Rossi S, Taccioli C, Pichiorri F, Liu X, Zupo S, Herlea V, Gramantieri L, Lanza G, Alder H, Rassenti L, Volinia S, Schmittgen TD, Kipps TJ, Negrini M, Croce CM: Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell 2007, 12:215-229. 16. Caley DP, Pink RC, Trujillano D, Carter DR: Long noncoding RNAs, chromatin, and development. ScientificWorldJournal 2010, 10:90-102. 17. Qureshi IA, Mattick JS, Mehler MF: Long non-coding RNAs in nervous system function and disease. Brain Res 2010, 1338:20-35. 18. Huarte M, Rinn JL: Large non-coding RNAs: missing links in cancer?. Hum Mol Genet 2010, 19:R152-161. 19. Gupta RA, Shah N, Wang KC, Kim J, Horlings HM, Wong DJ, Tsai MC, Hung T, Argani P, Rinn JL, Wang Y, Brzoska P, Kong B, Li R, West RB, van de Vijver MJ, Sukumar S, Chang HY: Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature 2010, Submit your next manuscript to BioMed Central 464:1071-1076. and take full advantage of: 20. Rinn JL, Kertesz M, Wang JK, Squazzo SL, Xu X, Brugmann SA, Goodnough LH, Helms JA, Farnham PJ, Segal E, Chang HY: Functional demarcation of active and silent chromatin domains in human HOX loci • Convenient online submission by noncoding RNAs. Cell 2007, 129:1311-1323. • Thorough peer review 21. Beck AH, Weng Z, Witten DM, Zhu S, Foley JW, Lacroute P, Smith CL, • No space constraints or color figure charges Tibshirani R, van de Rijn M, Sidow A, West RB: 3’-end sequencing for expression quantification (3SEQ) from archival tumor samples. PLoS One • Immediate publication on acceptance 2010, 5:e8768. • Inclusion in PubMed, CAS, Scopus and Google Scholar 22. Lee CH, Ou WB, Marino-Enriquez A, Zhu M, Mayeda M, Wang Y, Guo X, • Research which is freely available for redistribution Brunner AL, Amant F, French CA, West RB, McAlpine JN, Gilks CB, Yaffe MB, Prentice LM, McPherson A, Jones SJ, Marra MA, Shah SP, van de Rijn M, Submit your manuscript at www.biomedcentral.com/submit