bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 2 High-throughput annotation of full-length 3 long noncoding RNAs with Capture Long- 4 Read Sequencing (CLS) 5

6 Authors 7 Julien Lagarde*1,2, Barbara Uszczynska-Ratajczak*1,2,6, Silvia Carbonell3, Sílvia Pérez-Lluch1,2, 8 Amaya Abad1,2, Carrie Davis4, Thomas R. Gingeras4, Adam Frankish5, Jennifer Harrow5,7, 9 Roderic Guigo#1,2, Rory Johnson#1,2,8

10 * Equal contribution

11 # Corresponding authors: [email protected], [email protected]

12 Author affiliations 13 1 Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), 14 Dr. Aiguader 88, 08003 Barcelona, Spain.

15 2 Universitat Pompeu Fabra (UPF), Barcelona, Spain.

16 3 R&D Department, Quantitative Genomic Medicine Laboratories (qGenomics), Barcelona, 17 Spain. 18 4 Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring 19 Harbor, New York 11724, USA.

20 5 Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK CB10 1HH.

21 6 Present address: International Institute of Molecular and Cell Biology, Ks. Trojdena 4, 02-109 22 Warsaw, Poland

23 7 Present address: Illumina, Cambridge, UK.

24 8 Present address: Department of Clinical Research, University of Bern, Murtenstrasse 35, 3010 25 Bern, Switzerland.

1

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1

2

3 Keywords 4 Long non-coding RNA; lncRNA; lincRNA; RNA sequencing; transcriptomics; GENCODE; 5 annotation; CaptureSeq; third generation sequencing; long read sequencing; PacBio; KANTR.

6

7 Abbreviations 8 9 bp: base pair 10 FL: full length 11 nt: nucleotide 12 ROI: read of insert, i.e. PacBio reads 13 SJ: splice junction 14 SMRT: single-molecule real-time 15 TM: transcript model 16

2

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Abstract 2

3 Accurate annotations of and their transcripts is a foundation of genomics, but no 4 annotation technique presently combines throughput and accuracy. As a result, current reference 5 collections remain far from complete: many genes models are fragmentary, while thousands 6 more remain uncatalogued—particularly for long noncoding RNAs (lncRNAs). To accelerate 7 lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), 8 combining targeted RNA capture with third generation long-read sequencing. We present an 9 experimental re-annotation of the GENCODE intergenic lncRNA population in matched human 10 and mouse tissues, resulting in novel transcript models for 3574 / 561 gene loci, respectively. 11 CLS approximately doubles the annotated complexity of targeted loci, in terms of validated 12 splice junctions and transcript models, outperforming existing short-read techniques. The full- 13 length transcript models produced by CLS enable us to definitively characterize the genomic 14 features of lncRNAs, including promoter- and gene-structure, and protein-coding potential. Thus 15 CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality 16 full-length transcript models at high-throughput scales.

17

18

19

3

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Introduction 2

3 Long noncoding RNAs (lncRNAs) represent a vast and largely unexplored component of 4 the mammalian . Efforts to assign lncRNA functions rest on the availability of high- 5 quality transcriptome annotations. At present such annotations are still rudimentary: we have 6 little idea of the total lncRNA count, and for those that have been identified, transcript structures 7 remain largely incomplete.

8 The number and size of available lncRNA annotations have grown rapidly thanks to 9 projects using diverse approaches. Early gene sets, deriving from a mixture of FANTOM cDNA 10 sequencing efforts and public databases (1,2) were joined by the “lincRNA” (long intergenic 11 non-coding RNA) sets, discovered through analysis of chromatin signatures (3). More recently, 12 studies have applied de novo transcript-reconstruction software, such as Cufflinks (4) and 13 Scripture (5) to identify novel genes in short-read RNA sequencing (RNAseq) datasets (6-10). 14 However the reference for lncRNAs, as for protein-coding genes, has become the regularly- 15 updated, manual annotations from GENCODE, which are based on curation of cDNAs and ESTs 16 by human annotators (11,12). GENCODE has been adopted by most international genomics 17 consortia(13-17).

18 At present, annotation efforts are caught in a trade-off between throughput and quality. 19 De novo methods deliver large annotations with low hands-on time and financial investment. In 20 contrast, manual annotation is relatively slow and requires long-term funding. However the 21 quality of de novo annotations is often doubtful, due to the inherent difficulty of reconstructing 22 transcript structures from much shorter sequence reads. Such structures tend to be incomplete, 23 often lacking terminal exons or omitting splice junctions between adjacent exons (18). This 24 particularly affects lncRNAs, whose low expression results in low read coverage (12). The 25 outcome is a growing divergence between automated annotations of large size but uncertain 26 quality (e.g. 101,700 genes for NONCODE (9), and smaller but highly-curated “conservative” 27 annotations of GENCODE (15,767 genes for version 25) (12).

28 Annotation incompleteness takes two forms. First, genes may be entirely missing from 29 the annotation. Many genomic regions are suspected to transcribe RNA but presently contain no 30 annotation, including “orphan” small RNAs with presumed long precursors (19), enhancers (20)

4

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 and ultraconserved elements (21,22). Similarly, thousands of single-exon predicted transcripts 2 may be valid, but are generally excluded owing to doubts over their origin (12). The second form 3 of incompleteness refers to missing or partial gene structures in already-annotated lncRNAs. 4 Start and end sites frequently lack independent supporting evidence (12), and lncRNAs as 5 annotated have shorter spliced lengths and fewer exons than mRNAs (8,12,23). Recently, 6 RACE-Seq was developed to complete lncRNA annotations, but at relatively low throughput 7 (23).

8 One of the principal impediments to lncRNA annotation arises from their low steady- 9 state levels (3,12). To overcome this, targeted transcriptomics, or “RNA Capture Sequencing” 10 (CaptureSeq) (24) is used to boost the concentration of known or suspected low-abundance 11 transcripts in cDNA libraries. These studies have relied on Illumina short read sequencing and de 12 novo transcript reconstruction (24-26), with accompanying doubts over transcript structure 13 quality. Thus, while CaptureSeq achieves high throughput, its transcript structures lack the 14 confidence required for inclusion in GENCODE.

15 In order to harness the power of CaptureSeq while eliminating de novo transcript 16 assembly, we have developed RNA Capture Long Seq (CLS). CLS couples targeted RNA 17 capture with third generation long-read cDNA sequencing. We used CLS to interrogate the 18 GENCODE catalogue of intergenic lncRNAs, together with thousands of suspected novel loci, in 19 six tissues each of human and mouse. CLS dramatically extends known annotations with high- 20 quality novel structures. These data can be combined with other genomic data indicating 5’ and 21 3’ transcript termini to yield full-length transcript models in an automated way, allowing us to 22 describe fundamental lncRNA promoter and gene structure properties for the first time. Thus 23 CLS represents a significant advance in transcriptome annotation, and the dataset produced here 24 advances our understanding of lncRNA’s basic properties.

25

5

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Results 2

3 Capture Long Seq approach to extend the GENCODE lncRNA annotation 4

5 Our aim was to develop an experimental approach to improve and extend reference 6 transcript annotations, while minimizing human intervention and avoiding de novo transcript 7 assembly. We designed a method, Capture Long Seq (CLS), which couples targeted RNA 8 capture to Pacific Biosciences (“PacBio”) Third Generation long-read sequencing (Figure 1A). 9 The novelty of CLS is that it captures full-length, unfragmented cDNAs: this enables the targeted 10 sequencing of low-abundance transcripts, while avoiding the uncertainty of assembled transcript 11 structures from short-read sequencing.

12 CLS may be applied to two distinct objectives: to improve existing gene models, or to 13 identify novel loci (blue and orange in Figure 1A, respectively). Although the present study 14 focuses mainly on the first objective of improving existing lncRNA annotations, we demonstrate 15 also that novel loci can be captured and sequenced. With this in mind, we created a 16 comprehensive capture library targeting the set of intergenic GENCODE lncRNAs in human and 17 mouse. It should be noted that annotations for human are presently more complete than for 18 mouse, and this accounts for the differences in the annotation sizes throughout (9,090 vs 6,615 19 genes, respectively). It should also be noted that GENCODE annotations probed in this study are 20 principally multi-exonic transcripts based on polyA+ cDNA and EST libraries, and hence are not 21 likely to include many “enhancer RNAs” (11,27). To these we added tiled probes targeting loci 22 that may produce lncRNAs: small RNA genes (28), enhancers (29) and ultraconserved elements 23 (30). For mouse we also added orthologue predictions of human lncRNAs from PipeR (31). 24 Numerous control probes were added, including a series targeting half of the ERCC synthetic 25 spike-ins (32). Together, these sequences were used to design capture libraries of temperature- 26 matched and non-repetitive oligonucleotide probes (Figure 1B).

27 To access the maximal breadth of lncRNA diversity, we chose a set of transcriptionally- 28 complex and biomedically-relevant organs from mouse and human: whole brain, heart, liver and 29 testis (Figure 1C). To these we added two deeply-studied ENCODE human cell lines, HeLa and 30 K562 (33), and two mouse embryonic time-points (E7 and E15).

6

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 We designed a protocol to capture full-length, oligo-dT-primed cDNAs (full details can 2 be found in Materials and Methods). Barcoded, unfragmented cDNAs were pooled and captured. 3 Preliminary tests using quantitative PCR indicated strong and specific enrichment for targeted 4 regions (Supplementary Figure 1A). PacBio sequencing tends to favour shorter templates in a 5 mixture (34). Therefore pooled, captured cDNA was size-selected into three ranges (1-1.5kb, 6 1.5-2.5kb, >2.5kb) (Supplementary Figure 1B-C), and used to construct sequencing libraries for 7 PacBio SMRT (single-molecular real-time) technology (35).

8

9 CLS yields an enriched long-read transcriptome 10

11 Samples were sequenced on altogether 130 SMRT cells, yielding ~2 million reads in total 12 in each species (Figure 2A). PacBio sequence reads, or “reads of insert” (ROIs) were 13 demultiplexed to retrieve their tissue of origin and mapped to the genome (see Materials and 14 Methods for details). We observed high mapping rates (>99% in both cases), of which 86% and 15 88% were unique, in human and mouse, respectively (Supplementary Figure 2A). For brevity, all 16 data are henceforth quoted in order of human then mouse. The use of short barcodes meant that, 17 for ~30% of reads, the tissue of origin could not be retrieved (Supplementary Figure 2B). This 18 may be remedied in future by the use of longer barcode sequences. Representation was evenly 19 distributed across tissues, with the exception of testis (Supplementary Figure 2D). The ROIs had 20 a median length of 1 - 1.5 kb (Figure 2B) consistent with previous reports (34) and longer than 21 typical lncRNA annotation of ~0.5 kb (12).

22 Capture performance is assessed in two ways: by “on-target” rate – the proportion of 23 reads originating from probed regions – and by enrichment, or increase of on-target rate 24 following capture (36). To estimate this, we sequenced pre- and post-capture libraries using 25 MiSeq. CLS achieved on-target rates of 29.7% / 16.5%, representing 19- / 11-fold increase over 26 pre-capture cDNA (Figure 2C, D and Supplementary Figure 2E). These rates are competitive 27 with intergenic lncRNA capture in previous, short-read studies (Supplementary Figure 2F-G). 28 The majority of off-target signal arises from non-targeted, annotated protein-coding genes 29 (Figure 2C).

7

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 CLS on-target rates were lower than previous studies using fragmented DNA (37). Side- 2 by-side comparisons showed that this is likely due to the lower efficiency of capturing long 3 cDNA fragments (Supplementary Figure 2H-I), as observed by others (26), and thus representing 4 a future target for protocol optimization.

5 Synthetic spike-in sequences at known concentrations were used to assess CLS 6 sensitivity and quantitativeness. We compared the relationship of sequence reads to starting 7 concentration for the 42 probed (green) and 50 non-probed (violet) synthetic ERCC sequences in 8 pre- and post-capture samples (Figure 2E, top and bottom rows). Given the low sequencing 9 depth, CLS is surprisingly sensitive, extending detection sensitivity by two orders of magnitude, 10 and capable of detecting molecules at approximately 5 x 10-3 copies per cell (Materials and 11 Methods). As expected, it is less quantitative than conventional CaptureSeq (26), particularly at 12 higher concentrations where the slope falls below unity. This suggests saturation of probes by 13 cDNA molecules during hybridisation. A degree of noise, as inferred by the coefficient of 14 determination (R2) between read counts and template concentration, is introduced by the capture 15 process (R2 of 0.63 / 0.87 in human post-capture and pre-capture, respectively).

16

17 CLS expands the complexity of known and novel lncRNAs 18

19 CLS discovers a wealth of novel transcript structures within annotated lncRNA loci. A 20 good example is the SAMMSON oncogene (LINC01212) (13), where we discover a variety of 21 new exons, splice sites, and transcription termination sites that are not present in existing 22 annotations (Figure 3A, more examples in Supplementary Figures 3, 4, 5). The existence of 23 substantial additional downstream structure in SAMMSON could be validated by RT-PCR 24 (Figure 3A).

25 Gathering the non-redundant union of all ROIs, we measured the amount of new 26 complexity discovered in targeted lncRNA loci. CLS detected 58% and 45% of targeted lncRNA 27 nucleotides, and extended these annotations by 6.3 / 1.6 Mb nucleotides (86% / 64% increase 28 compared to existing annotations) (Supplementary Figure 6A). CLS discovered 45,673 and 29 11,038 distinct splice junctions (SJs), of which 36,839 and 26,715 are novel (Figure 3B and 30 Supplementary Figure 6B, left bars). The number of novel, high-confidence human SJs

8

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 amounted to 20,327 when using a deeper human SJ reference catalogue composed of both 2 GENCODE v20 and miTranscriptome (8) as a reference (Supplementary Figure 6C). For 3 independent validation, and given the relatively high sequence indel rate detected in PacBio 4 reads (Supplementary Figure 2M) (see Methods for analysis of sequencing error rates), we 5 deeply sequenced captured cDNA by Illumina HiSeq at an average depth of 35 million / 26 6 million pair-end reads per tissue sample. Split reads from this data exactly matched 78% / 75% 7 SJs from CLS. These “high-confidence” SJs alone represent a 160% / 111% increase over the 8 existing, probed GENCODE annotations (Figure 3B, Supplementary Figure 6B). Novel high- 9 confidence lncRNA SJs are rather tissue-specific, with greatest numbers observed in testis 10 (Supplementary Figure 6D), and were also discovered across other classes of targeted and non- 11 targeted loci (Supplementary Figure 6E). We observed a greater frequency of intron retention 12 events in lncRNAs, compared to protein-coding transcripts (Supplementary Figure 6F).

13 To evaluate the biological significance of novel lncRNA SJs, we computed their strength 14 using standard position weight matrix models from donor and acceptor sites (38) (Figure 3C, 15 Supplementary Figure 7A). High-confidence novel SJs from lncRNAs (orange, upper panel) far 16 exceed the predicted strength of background SJ-like dinucleotides (bottom panels), and are 17 essentially indistinguishable from annotated SJs in protein-coding and lncRNA loci (pink, upper 18 and middle panels). Even unsupported, novel SJs (black) tend to have high scores in excess of 19 background, although with a significant low-scoring tail. Although they display little evidence of 20 sequence conservation using standard measures (similar to lncRNA SJs in general) 21 (Supplementary Figure 7B), novel SJs also display weak but non-random evidence of selected 22 function between human and mouse (Supplementary Figure 7C).

23 We estimated how close these sequencing data are to saturation of true gene structures, 24 that is, to reaching a definitive lncRNA annotation. In each tissue sample, we tested the rate of 25 novel splice junction and transcript model discovery as a function of increasing depth of 26 randomly-sampled ROIs (Figure 3D, Supplementary Figures 8A-B). We observed an ongoing 27 gain of novelty with increasing depth, for both low- and high-confidence SJs, up to that 28 presented here. Similarly, no SJ discovery saturation plateau was reached at increasing simulated 29 HiSeq read depth (Supplementary Figure 8C). Thus, considerable additional sequencing is 30 required to fully define the complexity of annotated GENCODE lncRNAs.

9

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Beyond lncRNA characterization, CLS can be of utility to characterize many other types 2 of transcriptional units. As an illustration, we searched for precursor transcripts of small RNAs 3 (microRNAs, snoRNAs and snRNAs), whose annotation remains poor (19). We probed 1 kb 4 windows around all “orphan” small RNAs, i.e. those with no annotated overlapping transcript. 5 Note that, although mature snoRNAs are non-polyadenylated, they tend to be processed from 6 polyA+ precursor transcripts (39). We identified more than one hundred likely exonic primary 7 transcripts, and hundreds more potential precursors harbouring small RNAs within their introns 8 (Figure 3E). One intriguing example was the cardiac-enriched hsa-mir-143 (Supplementary 9 Figure 9). We previously identified a standalone lncRNA in the same locus, CARMEN1, which is 10 necessary for cardiac precursor cell differentiation (40). CLS identifies a new RT-PCR-supported 11 isoform that overlaps hsa-mir-143, suggesting it is a bifunctional lncRNA directing a complex 12 auto-regulatory feedback loop in cardiogenesis.

13

14 Assembling a full-length lncRNA annotation 15

16 A unique benefit of the CLS approach is the ability to identify full-length transcript 17 models with confident 5’ and 3’ termini. ROIs of oligo-dT-primed cDNAs carry a fragment of 18 the poly(A) tail, which can identify the polyadenylation site with basepair precision (34). Using 19 conservative filters, 73% / 64% of ROIs had identifiable polyA sites (Supplementary Table S1) 20 representing 16,961 / 12,894 novel polyA sites when compared to end positions of all 21 GENCODE annotations. Both known and novel polyA sites were accompanied by canonical 22 polyadenylation motifs (Supplementary Figure 10A-D). Similarly, the 5’ completeness of ROIs 23 was confirmed by proximity to methyl-guanosine caps identified by CAGE (Cap Analysis of 24 Gene Expression) (17) (Supplementary Figure 10E). Together, TSS and polyA sites were used to 25 define the 5’ / 3’ completeness of all ROIs (Figure 4A).

26 We developed a pipeline to merge ROIs into a non-redundant collection of transcript 27 models (TMs). In contrast to previous approaches (4), our “anchored merging” method respects 28 confirmed internal TSS or polyA sites (Figure 4B). Applying this to captured ROIs results in a 29 greater number of unique TMs than would be identified otherwise (Figure 4C, Supplementary 30 Figure 11A). Specifically, we identified 179,993 / 129,556 transcript models across all biotypes

10

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 (Supplementary Table S2), 86 / 87% of which displayed support of their entire intron chain by 2 captured HiSeq split reads (Supplementary Table S3). The CCAT1 locus is an example where 3 several novel transcripts are identified, each with CAGE and polyA support of 5’ and 3’ termini, 4 respectively (Figure 4D). CLS here suggests that adjacent CCAT1 and CASC19 gene models are 5 in fact fragments of the same underlying gene, a conclusion supported by RT-PCR (Figure 6 4D)(41).

7 Merged TMs can be defined by their end support: full length (5’ and 3’ supported), 5’ 8 only, 3’ only, or unsupported (Figure 4B, E). We identified a total of 65,736 / 44,673 full length 9 (FL) transcript models (Figure 4E and Supplementary Figure 11B, left panels): 47,672 (73%) / 10 37,244 (83%) arise from protein coding genes, and 13,071 (20%) / 5,329 (12%) from lncRNAs 11 (Supplementary Table S2). An additional 3,742 (6%) / 1,258 (3%) represent FL models that span 12 loci of different biotypes (listed in Figure 1B), usually including one protein-coding gene 13 (“Multi-Biotype”). Of the remaining non-coding FL transcript models, 295 / 434 are novel, 14 arising from unannotated gene loci. Altogether, 11,429 / 4,350 full-length structures arise from 15 probed lncRNA loci, of which 8,494 / 3,168 (74% / 73%) are novel (Supplementary Table S2). 16 We identified at least one FL TM for 19% / 12% of the originally-probed lncRNA annotation 17 (Figure 4F, Supplementary Figure 11C). Independent evidence for gene promoters from DNaseI 18 hypersensitivity sites, supported the accuracy of our 5’ identification strategy (Figure 4G). 19 Human lncRNAs with mouse orthologues had significantly more FL transcript models, although 20 the reciprocal was not observed (Supplementary Figure 11D-G). This imbalance may be due to 21 evolutionary factors (eg the appearance of novel lncRNA isoform complexity during primate 22 evolution), or else technical biases: it is noteworthy that we had access to deeper CAGE data for 23 the definition of TSS in human compared to mouse (217,516 vs 129,465), and that human 24 lncRNA annotations are more complete than for mouse.

25 In addition to probed lncRNA loci, CLS also discovered several thousand novel TMs 26 originating from unannotated regions, mapping to probed (blue in Figure 1B) or unprobed 27 regions (Supplementary Figures 11H-I). These TMs tended to have lower detection rates 28 (Supplementary Figure 11J) consistent with their low overall expression (Supplementary Figure 29 11K) and lower rates of 5’ and 3’ support than probed lncRNAs, although a small number are 30 full length (“other” in Figure 4E and Supplementary Figure 11B, right panels).

11

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 We next compared CLS performance to the conventional CaptureSeq methodology using 2 short-read data. We took advantage of our HiSeq analysis (212/156 million reads, in 3 human/mouse) of the same captured cDNA samples, to make a fair comparison between 4 methods. Short-read methods depend on de novo transcriptome assembly: we found, using 5 PacBio reads as a reference, that the recent StringTie tool consistently outperforms Cufflinks, 6 which has been used in previous CaptureSeq projects (Supplementary Figure 12A)(26,42). Using 7 intron chains to compare annotations, we found that CLS identifies 69% / 114% more novel TMs 8 than StringTie assembly (Figure 4H and Supplementary Figure 12B), despite sequencing 272- 9 fold fewer nucleotides in the PacBio library. CLS TMs are more complete at 5’ and 3’ ends than 10 StringTie assemblies, and more complete at the 3’ end compared to GENCODE annotations from 11 which they were probed (Figure 4I and Supplementary Figures 12D-H). As a result, although 12 StringTie TMs are slightly longer (Figure 4J and Supplementary Figure 12C), they are far less 13 likely to be full-length than CLS. The greater length of StringTie TMs may be due, in part, to 14 artefactual fusions of adjacent genes, as described recently (43). CLS also provided an advantage 15 over short reads in the detection of transcribed genome repeats, identifying in human 16 approximately 20% more nucleotides in repeats being transcribed (Supplementary Figure 12I).

17 Together, these findings show that CLS is effective in creating large numbers of full- 18 length transcript annotations for probed gene loci, in a highly scalable way.

19

20 Re-defining lncRNA promoter and gene characteristics with full-length annotations 21

22 With a full-length lncRNA catalogue, we could revisit the question of fundamental 23 differences of lncRNA and protein-coding genes. Existing lncRNA transcripts, as annotated, are 24 significantly shorter and have less exons than mRNAs (6,12). However it has remained 25 unresolved whether this is a genuine biological trend, or simply the result of annotation 26 incompleteness (23). Considering FL TMs, we find that the median lncRNA transcript to be 27 1108 / 1067 nt, similar to mRNAs mapped by the same criteria (1240 / 1320 nt) (Figure 5A, 28 Supplementary Figure 13A). This length difference of 11% / 19% is statistically significant 29 (P<2x10-16 for human and mouse, Wilcoxon test). These measured lengths are still shorter than 30 most annotated protein-coding transcripts (median 1,543 nt in GENCODE v20), but much larger

12

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 than annotated lncRNAs (median 668 nt). There are two factors that preclude our making firm 2 statements regarding relative lengths of lncRNAs and mRNAs: first, the upper length limitation 3 of PacBio reads (Figure 2B); and second, the fact that our size-selection protocol selects against 4 shorter transcripts. Nevertheless we do not find evidence that lncRNAs are substantially shorter 5 (12). Indeed, transcript annotation length estimates are likely to be strongly biased by lncRNAs’ 6 lower expression, which would be manifested in less complete annotations by both manual and 7 de novo approaches. We expect that this issue will be definitively answered with future nanopore 8 sequencing approaches.

9 We previously observed a striking enrichment for two-exon genes in lncRNAs, which 10 was not observed in mRNAs (12). However, we have found that this is clearly an artefact arising 11 from annotation incompleteness: the mean number of exons for lncRNAs in the FL models is 12 4.27, compared to 6.69 for mRNAs (Figure 5B, Supplementary Figure 13B). This difference is 13 explained by lncRNAs’ longer exons, although they peak at approximately 150 bp, or one 14 nucleosomal turn (Supplementary Figure 13C).

15 The usefulness of TSS annotation used here is demonstrated by the fact that FL 16 transcripts’ TSS are, on average, closer than existing annotations to expected promoter features, 17 including promoters and enhancers predicted by genome segmentations (44) and CpG islands, 18 although not evolutionarily-conserved elements or phenotypic GWAS sites (45) (Figure 5C). 19 More accurate mapping of lncRNA promoters in this way may provide new hypotheses for the 20 latter’s mechanism of action. For example, an improved 5’ annotation strengthens the link 21 between GWAS SNP rs246185, correlating with QT-interval and lying in the promoter of heart- 22 and muscle-expressed RP11-65J2 (ENSG00000262454), for which it is an expression 23 quantitative trait locus (eQTL) (Supplementary Figure 13D-E) (46).

24 The improved 5’ definition provided by CLS transcript models also enables us to 25 compare lncRNA and mRNA promoters. Recent studies, based on the start position of gene 26 annotations, have claimed to observe strong apparent differences across a range of features 27 (47,48). To make fair comparisons between gene sets, we created an expression-matched set of 28 mRNAs in HeLa and K562 cells, and removed bidirectional promoters. These were compared 29 across a variety of datasets from ENCODE (49) (Supplementary Figures 14, 15).

13

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 We observe a series of similar and divergent features of lncRNAs’ and mRNAs’ 2 promoters. For example, activating promoter histone modifications such as H3K4me3 (Figure 3 5D) and H3K9ac (Figure 5E), are essentially indistinguishable between full-length lncRNAs 4 (dark blue) and protein-coding genes (red), suggesting that, when accounting for expression 5 differences, active promoter architecture of lncRNAs is not unique. The contrast of these 6 findings with previous reports, suggest that the latter’s reliance on annotations alone led to 7 inaccurate promoter identification (47,48).

8 On the other hand, and as observed previously, lncRNA promoters are distinguished by 9 elevated levels of repressive chromatin marks, such as H3K9me3 (Figure 5F) and H3K27me3 10 (Supplementary Figures 14, 15) (47). This may be the consequence of elevated recruitment to 11 lncRNAs of Polycomb Repressive Complex, as evidenced by its subunit Ezh2 (Figure 5G). 12 Surprisingly, we also observed that the promoters of lncRNAs are distinguished from those of 13 protein-coding genes by a localised peak of insulator protein CTCF binding (Figure 5H). Finally, 14 there is a clear signal of evolutionary conservation at lncRNA promoters, although lower than for 15 protein-coding genes (Figure 5G).

16 Two conclusions are drawn from this analysis. First, that CLS-inferred TSS have greater 17 density of expected promoter features, compared to corresponding GENCODE annotations, 18 implying that CLS improves TSS annotation. And second, that when adjusting for expression, 19 lncRNA have comparable activating histone modifications, but distinct repressive modifications, 20 compared to protein-coding genes.

21

22 Discovery of new potential open reading frames 23

24 Recently a number of studies have suggested that many lncRNA loci peptide 25 sequences through unannotated open reading frames (ORFs) (50,51). We searched for signals of 26 protein-coding potential in FL models using two complementary methods, based on evolutionary 27 conservation and intrinsic sequence features (Figure 6A, Materials and Methods, Supplementary 28 Data File 1) (52,53). This analysis finds evidence for protein-coding potential in a small fraction 29 of lncRNA FL TMs (109/1271=8.6%), with a similar number of protein-coding FL TMs 30 displaying no evidence of encoding protein (2900/42,758=6.8%) (Figure 6B).

14

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 CLS FL models may lead to reclassification of protein-coding potential for seven cases in five 2 distinct gene loci (Figure 6C, Supplementary Figure 16A, Supplementary Data File 2). A good 3 example is the KANTR locus, where CLS (supported by independent RT-PCR) identifies an 4 unannotated exon harbouring a placental mammal-conserved 76aa ORF with no detectable 5 protein orthologue, composed of two sequential transmembrane domains (Figure 6D, 6 Supplementary Figure 16E) (54). This region derives from the antisense strand of a LINE1 7 transposable element. Another case is LINC01138, linked with prostate cancer, where a potential 8 42 aa ORF is found in the extended transcript (55). This ORF has no identifiable domains or 9 orthologues. We could not find peptide evidence for translation of either ORF (see Materials and 10 Methods). Whole-cell expression, as well as cytoplasmic-to-nuclear distributions, also showed 11 that potentially protein-coding lncRNAs’ behaviour is consistently more similar to annotated 12 lncRNAs than to mRNAs (Supplementary Figures 16B-D and F-G). Together, these findings 13 demonstrate the utility of CLS in improving the biotype annotation of the small minority of 14 lncRNAs that may encode proteins.

15

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Discussion 2

3 We have introduced an annotation methodology that resolves the competing needs of 4 quality and throughput. Capture Long Read Sequencing produces transcript models with quality 5 approaching that of human annotators, yet with throughput comparable to de novo transcriptome 6 assembly. CLS improves upon existing assembly-based methods not only in terms of yielding 7 confident exon connectivity, but also through (1) far higher rates of 5’ and 3’ completeness in 8 resulting transcript models, and (2) carrying encoded polyA tails enabling base-pair mapping of 9 3’ ends.

10 In the context of GENCODE, CLS will be used to accelerate annotation pipelines. 11 Transcript models, accompanied by meta-data describing 5’, 3’ and splice junction support, will 12 be stratified by confidence level. These will receive attention from human annotators as a 13 function of their incompleteness, with FL TMs passed directly to published annotations. Future 14 workflows will utilise de novo models from short read data from diverse cell types and 15 developmental time points to perform new rounds of CLS. This approach lays the path towards a 16 truly comprehensive human transcriptome annotation.

17 CLS is appropriate for virtually any class of RNA transcript. CLS’ versatility and 18 throughput makes it suited to rapid, low-cost transcriptome annotation in non-model organisms. 19 Preliminary bioinformatic homology screens for potential genes (including protein-coding, 20 lncRNAs, microRNAs etc.), in newly-sequenced , or first-pass short read RNA-Seq, 21 could be used to design capture libraries. Resulting annotations would be substantially more 22 accurate than those produced by current pipelines based on homology and short-read data.

23 In economic terms, CLS is also competitive. Using conservative estimates, with 2016 24 prices ($2460 for 1 lane of PE125bp HiSeq, $500 for 1 SMRT), and including the cost of 25 sequencing alone, we estimate that CLS yielded one novel, full-length lncRNA structure for 26 every $8 spent, compared to $27 for conventional CaptureSeq. This difference is accounted for 27 vastly greater rate of full-length transcript discovery by CLS.

28 CLS could also be applied to personal genomics studies. Targeted sequencing of gene 29 panels, perhaps those with medical relevance, could examine the little-studied question of

16

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 alternative transcript variability across individuals—i.e. whether there exist isoforms that are 2 private to given individuals or populations.

3 Despite its advantages, CLS remains to be optimised in several respects. First, the capture 4 efficiency for long cDNAs will need to be improved to levels presently observed for short 5 fragments. Second, a combination of technical factors limit the completeness of CLS transcript 6 models (TMs), including: sequencing reads that remain shorter than many transcripts; incomplete 7 reverse transcription of the RNA template; degradation of RNA molecules before reverse 8 transcription. Resolving these issues will be important objectives of future protocol 9 improvements, and only then can we make definitive judgements about lncRNA transcript 10 properties. In recent, unpublished work we have further optimised the capture protocol, pushing 11 on-target rates to around 35% (details in Supplementary Methods, Section “cDNA Capture”). 12 However the most dramatic gains in cost-effectiveness and completeness of CLS will most likely 13 be made by advances in sequencing technology. Indeed the latest nanopore cDNA sequencing 14 promises to be ~150-fold cheaper per read than the PacBio technology used here (0.01 vs 15 15 cents/read, respectively).

16 Full-length annotations have provided the most confident view to date of lncRNA gene 17 properties. These are more similar to mRNAs than previously thought, in terms of spliced length 18 and exon count (12,56). A similar trend is seen for promoters: when lncRNA promoters are 19 accurately mapped by CLS and compared to matched protein-coding genes, we find them to be 20 surprisingly similar for activating modifications. This suggests that previous studies, which 21 placed confidence in annotations of TSS, should be reassessed (47,48). On the other hand 22 lncRNA promoters do have unique properties, including elevated levels of repressive histone 23 modification, recruitment of Polycomb group proteins, and interaction with the insulator protein 24 CTCF. To our knowledge, this is the first report to suggest a relationship between lncRNAs and 25 insulator elements. Overall, these results suggest that that lncRNA gene features per se are 26 generally comparable to mRNAs, after normalising for their differences in overall expression. 27 Finally, extended TMs do not yield evidence for widespread protein-coding capacity encoded in 28 lncRNAs.

29 Despite success in mapping novel structure in annotated lncRNAs, we observed 30 surprisingly low numbers of transcript models originating in the relatively fewer numbers of

17

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 unannotated loci that we probed, including ultraconserved elements and developmental 2 enhancers. This would suggest that, at least in the tissue samples probed here, such elements are 3 not giving rise to substantial numbers of lncRNA-like, polyadenylated transcripts.

4 In summary, by resolving a longstanding roadblock in lncRNA transcript annotation, the 5 CLS approach promises to dramatically accelerate our progress towards an eventual “complete” 6 mammalian transcriptome annotation. These updated lncRNA catalogues represent a valuable 7 resource to the genomic and biomedical communities, and address fundamental issues of 8 lncRNA biology.

18

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Figure legends 2

3 Figure 1: Capture Long Seq approach to extend the GENCODE lncRNA annotation 4

5 (A) Strategy for automated, high-quality transcriptome annotation. CLS may be used to 6 complete existing annotations (blue), or to map novel transcript structures in suspected 7 loci (orange). Capture oligonucleotides (black bars) are designed to tile across targeted 8 regions. PacBio libraries are prepared for from the captured molecules. Illumina HiSeq 9 short-read sequencing can be performed for independent validation of predicted splice 10 junctions. Predicted transcription start sites can be confirmed by CAGE clusters (green), 11 and transcription termination sites by non-genomically encoded polyA sequences in 12 PacBio reads. Novel exons are denoted by lighter coloured rectangles. 13 (B) Summary of human and mouse capture library designs. Shown are the number of 14 individual gene loci that were probed. “PipeR pred.”: orthologue predictions in mouse 15 genome of human lncRNAs, made by PipeR (31); “UCE”: ultraconserved elements; 16 “Prot. coding”: expression-matched, randomly-selected protein-coding genes; “ERCC”: 17 spike-in sequences; “Ecoli”: randomly-selected E. coli genomic regions. Enhancers and 18 UCEs are probed on both strands, and these are counted separately. “Total nts”: sum of 19 targeted nucleotides. 20 (C) RNA samples used.

21

22 Figure 2: CLS yields an enriched, long-read transcriptome 23

24 (A) Summary statistics for long-read sequencing. ROI = “Read Of Insert”, or PacBio reads. 25 (B) Length distributions of ROIs. Sequencing libraries were prepared from three size-selected 26 cDNA fractions (see Supplementary Figure 1B-C). 27 (C) Breakdown of sequenced reads by gene biotype, pre- (left) and post-capture (right), for 28 human (equivalent mouse data are found in Supplementary Figure 2J). Colours denote 29 the on/off-target status of the genomic region from which the reads originate, namely: 30 Grey: reads originating from annotated but not targeted features; green: reads from

19

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 targeted features, including lncRNAs; yellow: reads from unannotated, non-targeted 2 regions. The ERCC class comprises only those ERCC spike-ins that were probed in this 3 experiment. Note that when a given read overlapped more than one targeted class of 4 regions, it was counted in each of these classes separately. 5 (D) Summary of capture performance. The y-axis shows the percent of all mapped ROIs 6 originating from a targeted region (“on-target”). Enrichment is defined as the ratio of this 7 value in Post- and Pre-capture samples. Note that Pre- and Post-capture on-target rates 8 were calculated using MiSeq reads. 9 (E) Response of read counts in captured cDNA to input RNA concentration. Upper panels: 10 Pre-capture; lower panels: Post-capture. Left: human; right: mouse. Note the log scales 11 for each axis. Each point represents one of 92 spiked-in synthetic ERCC RNA sequences. 12 42 were probed in the capture design (green), while the remaining 50 were not (violet). 13 Lines represent linear fits to each dataset, whose parameters are shown above. Given the 14 log-log representation, a linear response of read counts to template concentrate should 15 yield an equation of type y = c + mx, where m is 1.

16

17 Figure 3: Extending known lncRNA gene structures 18

19 (A) Novel transcript structures from the SAMMSON (LINC01212) locus. Annotation as 20 present in GENCODE v20 is shown in green, capture probes in grey, CLS reads in black 21 (confirming known structure) and red (novel structures). A sequence amplified by 22 independent RT-PCR is also shown. 23 (B) Novel splice junction (SJ) discovery. The y-axis denotes counts of unique SJs for human 24 (equivalent mouse data in Supplementary Figure 6B). Only “on-target” junctions 25 originating from probed lncRNA loci are considered. Grey represents GENCODE- 26 annotated SJs that are not detected. Dark green represents annotated SJs that are detected 27 by CLS. Light green represent novel SJs that are identified by CLS but not annotated. 28 The left column represents all SJs, and the right column represents only high-confidence 29 SJs (supported by at least one split-read from Illumina short read sequencing). See also 30 Supplementary Figure 6C for a comparison of CLS SJs to the miTranscriptome 31 catalogue. 20

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 (C) Splice junction (SJ) motif strength. Panels plot the distribution of predicted SJ strength, 2 for acceptors (left) and donors (right). Data shown are for human, equivalent analysis for 3 mouse may be found in Supplementary Figure 7A. The strength of the splice sites were 4 computed using standard position weight matrices used by GeneID (38). Data are shown 5 for non-redundant SJs from CLS transcript models from targeted lncRNAs (top), all 6 annotated protein-coding genes (middle), or a background distribution sampled from 7 randomly-selected AG (acceptor-like) and GT (donor-like) dinucleotides. 8 (D) Novel splice junction discovery as a function of sequencing depth in human. Each panel 9 represents the number of novel splice junctions (SJs) discovered (y-axis) in simulated 10 analysis where increasing numbers of ROIs (x-axis) were randomly sampled from the 11 experiment. The SJs retrieved at each read depth were further stratified by level of 12 sequencing support (Dark brown: all PacBio SJs; Orange: HiSeq-supported PacBio SJs; 13 Black: HiSeq-unsupported PacBio SJs). Each randomization was repeated fifty times, and 14 a boxplot summarizes the results at each simulated depth. The highest y value represents 15 the actual number of novel SJs discovered. Equivalent data for mouse can be found in 16 Supplementary Figure 8A, and for rates of novel transcript model discovery in 17 Supplementary Figure 8B. 18 (E) Identification of putative precursor transcripts of small RNA genes. For each gene 19 biotype, the figures show the count of unique genes in each group. “Orphans” are those 20 with no annotated same-strand overlapping transcript in GENCODE, and were used for 21 capture probe design in this project. “Pot. Precursors” (potential precursors) represent 22 orphan small RNAs that reside in the intron of and on the same strand as a novel 23 transcript identified by CLS; “Precursors” represent those that reside in the exon of a 24 novel transcript.

25

26 Figure 4: Full-length transcript annotation 27

28 (A) The 5’ (transcription start site, TSS) and 3’ (polyA site) termini of new transcript models 29 can be inferred using CAGE clusters and sequenced polyA tails, respectively. The latter 30 correspond to polyA fragments identified at ROI 3’ ends that are not genomically- 31 encoded. 21

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 (B) “Anchored” merging of ROIs to create transcript models, while respecting their TSS and 2 polyA sites. In conventional merging (left), transcripts’ TSS and polyA sites are lost 3 when they overlap exons of other transcripts. Anchored merging (right) respects and does 4 not collapse TSS and polyA sites that fall within exons of other transcripts. 5 (C) Anchored merging yields more distinct transcript models. The y-axis represents total 6 counts of ROIs (pink), anchor-merged transcript models (brown) and conventionally- 7 merged transcript models (turquoise). Transcript models were merged irrespective of 8 tissue-origin. 9 (D) Example of full-length, TSS- and polyA-mapped transcript models at the human CCAT1 10 / CASC19 locus. GENCODE v20 annotation is shown in green, novel full-length CLS 11 models in red. Note the presence of a CAGE-supported TSS (green star) and multiple 12 distinct polyA sites (red stars). Also shown is the sequence obtained by RT-PCR and 13 Sanger sequencing (black). 14 (E) The total numbers of anchor-merged transcript models identified by CLS for human. The 15 y-axis of each panel shows unique transcript model (TM) counts. Left panel: All merged 16 TMs, coloured by end-support. Middle panel: Full length (FL) TMs, broken down by 17 novelty with respect to existing GENCODE annotations. Green areas are novel and 18 multi-exonic: “overlap” intersect an annotation on the same strand, but do not respect all 19 its splice junctions; “intergenic” overlap no annotation on the same strand; “extension” 20 respect all of an annotation’s splice junctions, and add novel ones. Right panel: Novel FL 21 TMs, coloured by their biotype. “Other” refers to transcripts not mapping to any 22 GENCODE protein-coding or lncRNA annotation. Note that the majority of “multi- 23 biotype” models link a protein-coding gene to another locus. Equivalent data for mouse 24 are found in Supplementary Figure 11B. 25 (F) The total numbers of probed lncRNA loci giving rise to CLS transcript models (TMs), 26 novel TMs, full-length CLS TMs (FL TMs) and novel FL TMs in human at increasing 27 minimum cutoffs for each category. Equivalent mouse data can be found in 28 Supplementary Figure 11C. 29 (G) Coverage of CLS transcript TSSs with ENCODE DNaseI-hypersensitive sites (DHS) in 30 HeLa-S3. “CAGE+” / “CAGE-“ denote CLS transcript models with / without CAGE- 31 supported 5’ ends, respectively. “GENCODE lncRNA” represent the annotated 5’ ends of

22

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 CLS-detected lncRNA transcript annotations. “GENCODE protein-coding” corresponds 2 to the TSSs of a subset of annotated protein-coding genes, expression-matched to CLS 3 TMs in HeLa-S3. 4 (H) Comparison of non-redundant transcript catalogues from GENCODE annotation, CLS, 5 and de novo models produced by StringTie software within probed lncRNA regions. The 6 latter was run on short reads sequenced from the same captured cDNA as CLS. 7 “GENCODE”: subset of probed lncRNAs detected by CLS or StringTie. The identity of 8 transcripts was defined by their intron chain coordinates; as a result only spliced 9 transcripts are reported here. Equivalent mouse date can be found in Supplementary 10 Figure 12B-E. 11 (I) Comparing completeness of transcript annotations: 5’ and 3’ completeness as estimated 12 by CAGE overlap and upstream polyadenylation signal (PAS), with respect to 5’ (left) 13 and 3’ ends (right), respectively (human). Represented is the proportion of transcript ends 14 with such CAGE or PAS support in each set. Neighbouring transcript ends were merged 15 within each individual set (maximum distance: +/- 5nts on the same strand). CAGE(+) 5’ 16 ends are those TSSs having a CAGE cluster within a +/-50 bases window around them. 17 Similarly, PAS(+) 3’ ends correspond to 3’ends falling 10 to 50 bases downstream of a 18 PAS motif. Note that full length (FL) CLS models have, by definition, a CAGE signal at 19 5’ end, and thus have 100% 5’ completeness. “GENCODE lncRNA”: subset of probed 20 lncRNAs detected by CLS or StringTie. Control sites represent a random sample of 21 internal exons’ middle coordinate. Corresponding counts of CAGE- or PAS-supported 22 features are indicated on each bar. Equivalent mouse data is to be found in 23 Supplementary Figure 12F. 24 (J) Spliced length distributions of indicated non-redundant transcript catalogues. “FL” 25 indicates the subset of transcripts from each catalogue that has 5’ support from CAGE, 26 and 3’ support from PacBio-identified polyA sites. The median spliced length of each 27 population is indicated by a vertical dotted line. “GENCODE”: subset of probed 28 lncRNAs detected by CLS or StringTie. Equivalent mouse data can be found in 29 Supplementary Figure 12C.

30

23

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Figure 5: Discovery of novel lncRNA transcripts 2

3 (A) The mature, spliced transcript length of: CLS full-length transcript models from targeted 4 lncRNA loci (dark blue); transcript models from the targeted and detected GENCODE 5 lncRNA loci (light blue); CLS full-length transcript models from protein-coding loci 6 (red). 7 (B) The numbers of exons per full length transcript model, from the same groups as in (A). 8 Dotted lines represent medians. 9 (C) Distance of annotated transcription start sites (TSS) to genomic features. Each cell 10 displays the mean distance to nearest neighbouring feature for each TSS. TSS sets 11 correspond to the classes from (A). “Shuffled” represent FL lincRNA TSS randomly 12 placed throughout genome. 13 (D) – (I) Comparing promoter profiles across gene sets. The aggregate density of various 14 features is shown across the TSS of indicated gene classes. Note that overlapping TSS 15 were merged within classes, and TSSs belonging to bi-directional promoters were 16 discarded (see Methods). The y-axis denotes the mean signal per TSS, and grey fringes 17 represent the standard error of the mean. Gene sets are: Dark blue, full-length lncRNA 18 models from CLS; Light blue, the GENCODE annotation models from which the latter 19 were probed; Red, a subset of protein-coding genes with similar expression in HeLa as 20 the CLS lncRNAs.

21

22 Figure 6: Properties of full-length lncRNAs 23

24 (A) The predicted protein-coding potential of all full-length transcript models mapped to 25 lncRNA (left) or protein-coding loci (right). Each point represents a single full length 26 (FL) transcript model (TM). The y-axis displays the coding likelihood according to 27 PhyloCSF, based on multiple genome alignments, while the x-axis displays that 28 calculated by CPAT, an alignment-free method. Red lines indicate score thresholds, 29 above which transcript models are considered protein-coding. Models mapping to 30 multiple different biotypes were not considered. 31 (B) The numbers of classified transcript models (TMs) from (A).

24

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 (C) Discovery of new protein-coding transcripts as a result of full-length CLS reads, using 2 PhyloCSF. For each probed lncRNA locus, we calculated the transcript isoform with 3 highest scoring ORF (x-axis). From each locus, we identified the full-length transcript 4 model with high scoring ORF (y-axis). LncRNA loci from existing GENCODE v20 5 annotation predicted to encode proteins are highlighted in yellow. LncRNA loci where 6 new ORFs are discovered as a result of CLS transcript models are highlighted in red. 7 (D) KANTR, an example of an annotated lncRNA locus where CLS discovers novel 8 protein-coding sequence. The upper panel shows the structure of the lncRNA and the 9 associated ORF (highlighted region) falling within two novel full-length CLS transcripts 10 (red). Note how this ORF lies outside existing GENCODE annotation (green), and its 11 overlap with a highly-conserved region (see green PhyloCSF conservation track, below). 12 Also shown is the sequence obtained by RT-PCR and Sanger sequencing (black). The 13 lower panel, generated by CodAlignView (57), reveals conservative substitutions in the 14 predicted ORF of 76 aa consistent with a functional peptide product. High-confidence 15 predicted SMART (58) domains are shown as coloured bars below. The entire ORF lies 16 within and antisense to a L1 transposable element (grey bar).

25

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Author contributions 2 3 RJ, RG, JH, AF, BU-R and JL designed the experiment. SC generated cDNA libraries and 4 performed the Capture. CD and TRG performed the PacBio sequencing of Capture libraries. JL 5 and BU-R analysed the data under the supervision of RG and RJ. RJ wrote the manuscript, with 6 contributions from JL, BU-R and RG. SP-L and AA performed the RT-PCR experiments. 7

8 Competing financial interests 9 10 The author declare no competing financial interest. 11

12 Acknowledgements 13

14 We thank members of the Guigó laboratory for their valuable input and help when handling 15 samples, analysing data and writing the manuscript, including Emilio Palumbo, Ferran Reverter, 16 Alessandra Breschi, Dmitri Pervouchine, Carme Arnan and Francisco Camara. We wish to thank 17 Lluis Armengol (qGenomics) for advice on RNA capture, Diego Garrido (CRG) for help with 18 eQTL analysis, Sarah Bonnin (CRG) for help with data manipulation in R, Irwin Jungreis (MIT) 19 for advice on PhyloCSF. James Wright and Jyoti Choudhary (Sanger Institute) helped in 20 searching for peptide hits to putative coding regions. Sarah Djebali (INRA, France) kindly made 21 available the Compmerge utility. This work and publication were supported by the National 22 Human Genome Research Institute of the National Institutes of Health (grant numbers 23 U41HG007234, U41HG007000 and U54HG007004) and the Wellcome Trust (grant number 24 WT098051). RJ was supported by Ramón y Cajal RYC-2011-08851. Work in laboratory of RG 25 was supported by Awards Number U54HG0070, R01MH101814 and U41HG007234 from the 26 National Human Genome Research Institute. This research was partly supported by the NCCR 27 RNA & Disease funded by the Swiss National Science Foundation (to RJ). We thank Romina 28 Garrido (CRG) for administrative support. We acknowledge support of the Spanish Ministry of 29 Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa 2013-2017’, SEV-2012- 30 0208.

26

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Data availability 2

3 Raw and processed data is deposited in the Gene Expression Omnibus under accession 4 GSE93848. RT-PCR validation sequences are available in Supplementary Data File 3. Genome- 5 aligned data were assembled into a public Track Hub, which can be loaded into the UCSC 6 Genome Browser (pre-loaded URL: http://genome-euro.ucsc.edu/cgi- 7 bin/hgTracks?hubUrl=http://public_docs.crg.es/rguigo/CLS/data/trackHub//hub.txt). In addition, 8 a supplementary data portal is available on the web at https://public_docs.crg.es/rguigo/CLS/ .

9

27

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 References 2

3 1. Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M.C., Maeda, N., Oyama, R., Ravasi, T., 4 Lenhard, B., Wells, C. et al. (2005) The transcriptional landscape of the mammalian genome. 5 Science, 309, 1559-1563. 6 2. Jia, H., Osak, M., Bogu, G.K., Stanton, L.W., Johnson, R. and Lipovich, L. (2010) Genome-wide 7 computational identification and manual annotation of human long noncoding RNA genes. RNA, 8 16, 1478-1487. 9 3. Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F., Feldser, D., Huarte, M., Zuk, O., Carey, 10 B.W., Cassady, J.P. et al. (2009) Chromatin signature reveals over a thousand highly conserved 11 large non-coding RNAs in mammals. Nature, 458, 223-227. 12 4. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., 13 Rinn, J.L. and Pachter, L. (2012) Differential gene and transcript expression analysis of RNA-seq 14 experiments with TopHat and Cufflinks. Nature protocols, 7, 562-578. 15 5. Guttman, M., Garber, M., Levin, J.Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M.J., 16 Gnirke, A., Nusbaum, C. et al. (2010) Ab initio reconstruction of cell type-specific transcriptomes 17 in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature biotechnology, 28, 18 503-510. 19 6. Cabili, M.N., Trapnell, C., Goff, L., Koziol, M., Tazon-Vega, B., Regev, A. and Rinn, J.L. (2011) 20 Integrative annotation of human large intergenic noncoding RNAs reveals global properties and 21 specific subclasses. Genes & development, 25, 1915-1927. 22 7. Hangauer, M.J., Vaughn, I.W. and McManus, M.T. (2013) Pervasive Transcription of the Human 23 Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs. PLoS 24 genetics, 9, e1003569. 25 8. Iyer, M.K., Niknafs, Y.S., Malik, R., Singhal, U., Sahu, A., Hosono, Y., Barrette, T.R., Prensner, J.R., 26 Evans, J.R., Zhao, S. et al. (2015) The landscape of long noncoding RNAs in the human 27 transcriptome. Nature genetics. 28 9. Zhao, Y., Li, H., Fang, S., Kang, Y., Wu, W., Hao, Y., Li, Z., Bu, D., Sun, N., Zhang, M.Q. et al. (2016) 29 NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic acids 30 research, 44, D203-208. 31 10. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., 32 Wold, B.J. and Pachter, L. (2010) Transcript assembly and quantification by RNA-Seq reveals 33 unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 34 28, 511-515. 35 11. Harrow, J., Frankish, A., Gonzalez, J.M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B.L., 36 Barrell, D., Zadissa, A., Searle, S. et al. (2012) GENCODE: the reference human genome 37 annotation for The ENCODE Project. Genome research, 22, 1760-1774. 38 12. Derrien, T., Johnson, R., Bussotti, G., Tanzer, A., Djebali, S., Tilgner, H., Guernec, G., Martin, D., 39 Merkel, A., Knowles, D.G. et al. (2012) The GENCODE v7 catalog of human long noncoding RNAs: 40 analysis of their gene structure, evolution, and expression. Genome research, 22, 1775-1789. 41 13. Leucci, E., Vendramin, R., Spinazzi, M., Laurette, P., Fiers, M., Wouters, J., Radaelli, E., 42 Eyckerman, S., Leonelli, C., Vanderheyden, K. et al. (2016) Melanoma addiction to the long non- 43 coding RNA SAMMSON. Nature, 531, 518-522. 44 14. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57-74.

28

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 15. Chen, L., Kostadima, M., Martens, J.H., Canu, G., Garcia, S.P., Turro, E., Downes, K., Macaulay, 2 I.C., Bielczyk-Maczynska, E., Coe, S. et al. (2014) Transcriptional diversity during lineage 3 commitment of human blood progenitors. Science, 345, 1251033. 4 16. Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi, A., Kheradpour, P., 5 Zhang, Z., Wang, J., Ziller, M.J. et al. (2015) Integrative analysis of 111 reference human 6 epigenomes. Nature, 518, 317-330. 7 17. Forrest, A.R., Kawaji, H., Rehli, M., Baillie, J.K., de Hoon, M.J., Haberle, V., Lassmann, T., 8 Kulakovskiy, I.V., Lizio, M., Itoh, M. et al. (2014) A promoter-level mammalian expression atlas. 9 Nature, 507, 462-470. 10 18. Steijger, T., Abril, J.F., Engstrom, P.G., Kokocinski, F., Hubbard, T.J., Guigo, R., Harrow, J. and 11 Bertone, P. (2013) Assessment of transcript reconstruction methods for RNA-seq. Nature 12 methods, 10, 1177-1184. 13 19. Georgakilas, G., Vlachos, I.S., Paraskevopoulou, M.D., Yang, P., Zhang, Y., Economides, A.N. and 14 Hatzigeorgiou, A.G. (2014) microTSS: accurate microRNA transcription start site identification 15 reveals a significant number of divergent pri-miRNAs. Nature communications, 5, 5700. 16 20. Orom, U.A., Derrien, T., Beringer, M., Gumireddy, K., Gardini, A., Bussotti, G., Lai, F., Zytnicki, M., 17 Notredame, C., Huang, Q. et al. (2010) Long noncoding RNAs with enhancer-like function in 18 human cells. Cell, 143, 46-58. 19 21. Ferdin, J., Nishida, N., Wu, X., Nicoloso, M.S., Shah, M.Y., Devlin, C., Ling, H., Shimizu, M., Kumar, 20 K., Cortez, M.A. et al. (2013) HINCUTs in cancer: hypoxia-induced noncoding ultraconserved 21 transcripts. Cell death and differentiation, 20, 1675-1687. 22 22. Calin, G.A., Liu, C.G., Ferracin, M., Hyslop, T., Spizzo, R., Sevignani, C., Fabbri, M., Cimmino, A., 23 Lee, E.J., Wojcik, S.E. et al. (2007) Ultraconserved regions encoding ncRNAs are altered in human 24 leukemias and carcinomas. Cancer cell, 12, 215-229. 25 23. Lagarde, J., Uszczynska-Ratajczak, B., Santoyo-Lopez, J., Gonzalez, J.M., Tapanari, E., Mudge, 26 J.M., Steward, C.A., Wilming, L., Tanzer, A., Howald, C. et al. (2016) Extension of human lncRNA 27 transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq). Nature 28 communications, 7, 12339. 29 24. Mercer, T.R., Gerhardt, D.J., Dinger, M.E., Crawford, J., Trapnell, C., Jeddeloh, J.A., Mattick, J.S. 30 and Rinn, J.L. (2012) Targeted RNA sequencing reveals the deep complexity of the human 31 transcriptome. Nature biotechnology, 30, 99-104. 32 25. Bussotti, G., Leonardi, T., Clark, M.B., Mercer, T.R., Crawford, J., Malquori, L., Notredame, C., 33 Dinger, M.E., Mattick, J.S. and Enright, A.J. (2016) Improved definition of the mouse 34 transcriptome via targeted RNA sequencing. Genome research, 26, 705-716. 35 26. Clark, M.B., Mercer, T.R., Bussotti, G., Leonardi, T., Haynes, K.R., Crawford, J., Brunck, M.E., Cao, 36 K.A., Thomas, G.P., Chen, W.Y. et al. (2015) Quantitative gene profiling of long noncoding RNAs 37 with targeted RNA sequencing. Nature methods, 12, 339-342. 38 27. Andersson, R., Gebhard, C., Miguel-Escalada, I., Hoof, I., Bornholdt, J., Boyd, M., Chen, Y., Zhao, 39 X., Schmidl, C., Suzuki, T. et al. (2014) An atlas of active enhancers across human cell types and 40 tissues. Nature, 507, 455-461. 41 28. Kozomara, A. and Griffiths-Jones, S. (2014) miRBase: annotating high confidence microRNAs 42 using deep sequencing data. Nucleic acids research, 42, D68-73. 43 29. Visel, A., Minovitsky, S., Dubchak, I. and Pennacchio, L.A. (2007) VISTA Enhancer Browser--a 44 database of tissue-specific human enhancers. Nucleic acids research, 35, D88-92. 45 30. Dimitrieva, S. and Bucher, P. (2013) UCNEbase--a database of ultraconserved non-coding 46 elements and genomic regulatory blocks. Nucleic acids research, 41, D101-109.

29

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 31. Bussotti, G., Raineri, E., Erb, I., Zytnicki, M., Wilm, A., Beaudoing, E., Bucher, P. and Notredame, 2 C. (2011) BlastR--fast and accurate database searches for non-coding RNAs. Nucleic acids 3 research, 39, 6886-6895. 4 32. Kralj, J.G. and Salit, M.L. (2013) Characterization of in vitro transcription amplification linearity 5 and variability in the low copy number regime using External RNA Control Consortium (ERCC) 6 spike-ins. Analytical and bioanalytical chemistry, 405, 315-320. 7 33. Djebali, S., Davis, C.A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., Tanzer, A., Lagarde, J., 8 Lin, W., Schlesinger, F. et al. (2012) Landscape of transcription in human cells. Nature, 489, 101- 9 108. 10 34. Sharon, D., Tilgner, H., Grubert, F. and Snyder, M. (2013) A single-molecule long-read survey of 11 the human transcriptome. Nature biotechnology, 31, 1009-1014. 12 35. Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni, A., Swerdlow, 13 H.P. and Gu, Y. (2012) A tale of three next generation sequencing platforms: comparison of Ion 14 Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC genomics, 13, 341. 15 36. Mercer, T.R., Clark, M.B., Crawford, J., Brunck, M.E., Gerhardt, D.J., Taft, R.J., Nielsen, L.K., 16 Dinger, M.E. and Mattick, J.S. (2014) Targeted sequencing for gene discovery and quantification 17 using RNA CaptureSeq. Nature protocols, 9, 989-1009. 18 37. Garcia-Garcia, G., Baux, D., Faugere, V., Moclyn, M., Koenig, M., Claustres, M. and Roux, A.F. 19 (2016) Assessment of the latest NGS enrichment capture methods in clinical context. Scientific 20 reports, 6, 20948. 21 38. Blanco, E., Parra, G. and Guigo, R. (2007) Using geneid to identify genes. Current protocols in 22 bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.], Chapter 4, Unit 4 3. 23 39. Smith, C.M. and Steitz, J.A. (1998) Classification of gas5 as a multi-small-nucleolar-RNA (snoRNA) 24 host gene and a member of the 5'-terminal oligopyrimidine gene family reveals common 25 features of snoRNA host genes. Molecular and cellular biology, 18, 6897-6909. 26 40. Ounzain, S., Micheletti, R., Arnan, C., Plaisance, I., Cecchi, D., Schroen, B., Reverter, F., Alexanian, 27 M., Gonzales, C., Ng, S.Y. et al. (2015) CARMEN, a human super enhancer-associated long 28 noncoding RNA controlling cardiac specification, differentiation and homeostasis. Journal of 29 molecular and cellular cardiology , 89, 98-112. 30 41. Nissan, A., Stojadinovic, A., Mitrani-Rosenbaum, S., Halle, D., Grinbaum, R., Roistacher, M., 31 Bochem, A., Dayanc, B.E., Ritter, G., Gomceli, I. et al. (2012) Colon cancer associated transcript- 32 1: a novel RNA expressed in malignant and pre-malignant human tissues. International journal of 33 cancer. Journal international du cancer, 130, 1598-1606. 34 42. Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T. and Salzberg, S.L. (2015) 35 StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature 36 biotechnology, 33, 290-295. 37 43. You, B.H., Yoon, S.H. and Nam, J.W. (2017) High-confidence coding and noncoding 38 transcriptome maps. Genome research, 27, 1050-1062. 39 44. Marques, A.C., Hughes, J., Graham, B., Kowalczyk, M.S., Higgs, D.R. and Ponting, C.P. (2013) 40 Chromatin signatures at transcriptional start sites separate two equally populated yet distinct 41 classes of intergenic long noncoding RNAs. Genome biology, 14, R131. 42 45. Welter, D., MacArthur, J., Morales, J., Burdett, T., Hall, P., Junkins, H., Klemm, A., Flicek, P., 43 Manolio, T., Hindorff, L. et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait 44 associations. Nucleic acids research, 42, D1001-1006. 45 46. Arking, D.E., Pulit, S.L., Crotti, L., van der Harst, P., Munroe, P.B., Koopmann, T.T., Sotoodehnia, 46 N., Rossin, E.J., Morley, M., Wang, X. et al. (2014) Genetic association study of QT interval 47 highlights role for calcium signaling pathways in myocardial repolarization. Nature genetics, 46, 48 826-836.

30

bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 47. Alam, T., Medvedeva, Y.A., Jia, H., Brown, J.B., Lipovich, L. and Bajic, V.B. (2014) Promoter 2 analysis reveals globally differential regulation of human long non-coding RNA and protein- 3 coding genes. PloS one, 9, e109443. 4 48. Mele, M., Mattioli, K., Mallard, W., Shechner, D.M., Gerhardinger, C. and Rinn, J.L. (2016) 5 Chromatin environment, transcriptional regulation, and splicing distinguish lincRNAs and 6 mRNAs. Genome research. 7 49. Consortium, E. (2012) An integrated encyclopedia of DNA elements in the human genome. 8 Nature, 489, 57-74. 9 50. Mackowiak, S.D., Zauber, H., Bielow, C., Thiel, D., Kutz, K., Calviello, L., Mastrobuoni, G., 10 Rajewsky, N., Kempa, S., Selbach, M. et al. (2015) Extensive identification and analysis of 11 conserved small ORFs in animals. Genome biology, 16, 179. 12 51. Bazzini, A.A., Johnstone, T.G., Christiano, R., Mackowiak, S.D., Obermayer, B., Fleming, E.S., 13 Vejnar, C.E., Lee, M.T., Rajewsky, N., Walther, T.C. et al. (2014) Identification of small ORFs in 14 vertebrates using ribosome footprinting and evolutionary conservation. The EMBO journal, 33, 15 981-993. 16 52. Wang, L., Park, H.J., Dasari, S., Wang, S., Kocher, J.P. and Li, W. (2013) CPAT: Coding-Potential 17 Assessment Tool using an alignment-free logistic regression model. Nucleic acids research, 41, 18 e74. 19 53. Lin, M.F., Jungreis, I. and Kellis, M. (2011) PhyloCSF: a comparative genomics method to 20 distinguish protein coding and non-coding regions. Bioinformatics, 27, i275-282. 21 54. Sauvageau, M., Goff, L.A., Lodato, S., Bonev, B., Groff, A.F., Gerhardinger, C., Sanchez-Gomez, 22 D.B., Hacisuleyman, E., Li, E., Spence, M. et al. (2013) Multiple knockout mouse models reveal 23 lincRNAs are required for life and brain development. eLife, 2, e01749. 24 55. Wan, X., Huang, W., Yang, S., Zhang, Y., Pu, H., Fu, F., Huang, Y., Wu, H., Li, T. and Li, Y. (2016) 25 Identification of androgen-responsive lncRNAs as diagnostic and prognostic markers for prostate 26 cancer. Oncotarget. 27 56. Ruiz-Orera, J., Messeguer, X., Subirana, J.A. and Alba, M.M. (2014) Long non-coding RNAs as a 28 source of new peptides. eLife, 3, e03523. 29 57. I Jungreis, M Lin and Kellis, M. CodAlignView: a tool for visualizing protein-coding constraint. In 30 Preparation. 31 58. Letunic, I., Doerks, T. and Bork, P. (2015) SMART: recent updates, new developments and status 32 in 2015. Nucleic acids research, 43, D257-260. 33

34

31

Figure 1 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A B

Probe design Feature Human Mouse Target features lincRNAs 5,953 1,920 Annotated ODE PipeR pred. 0 2,271

NC MicroRNAs 786 494 snRNAs 838 721 annotation GE Unannotated snoRNAs 401 850 1kb Enhancers 954 203 UCE 158 156 TOTAL 9,090 6,615 Novel Novel region Prot. coding 124 112 ERCC 42 42 Controls Ecoli 100 100

cDNA library Capture TOTAL nts 15,541,153 8,307,739

C Brain

PacBio HiSeq Heart Liver Testis CAGE Validated novel sj Novel exon HeLa Embryo d7 PacBio K562 Embryo d15 HiSeq Figure 2 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Human (1kb−1.5kb) Human (1.5kb−2.5kb) Human (>2.5kb) 250000 250000 250000 A Human Mouse B N = 637,164 N = 706,086 N = 861,808 200000 200000 200000 Median = 1,207 Median = 1,504 Median = 1,120 64 SMRT Cells 66 150000 150000 150000 100000 100000 100000 #ROI Reads 2,397,192 ROI Reads 2,137,176 #ROI Reads #ROI Reads 50000 50000 50000 2,383,470 Mapped 2,132,978 0 0 0 0 0 0 1000 2000 3000 4000 2,053,424 Uniquely mapped 1,870,681 1000 2000 3000 4000 1000 2000 3000 4000 1,627,322 Demultiplexed 1,509,374 Mouse (1kb−1.5kb) Mouse (1.5kb−2.5kb) Mouse (>2.5kb) 250000 250000 250000 N = 601,223 N = 772,259 N = 570,362 200000 200000 200000 Median = 1,328 Median = 1,640 Median = 999 C Human 150000 150000 150000 100000 100000 100000 #ROI Reads #ROI Reads #ROI Reads 100 50000 50000 50000 0 0 0

miRNA 0 0 lncRNA 0 1000 2000 3000 4000 1000 2000 3000 4000 snoRNA 1000 2000 3000 4000 75 ERCC enhancer Length (nt) misc_RNA protein coding snRNA prot. coding D UCE P neg. region Pre−Capture ost−Capture 50 rRNA protein coding 50 % reads 18.6 fold 11.0 fold 25 lncRNA lncRNA 29.7% pseudogene snoRNA snoRNA misc_RNA pseudogene misc_RNA 25 snRNA snRNA rRNA rRNA 16.5% 0 Other Other Pre-Capture Post-Capture Intergenic On-Target Off-Target 1.6% 1.5% % Reads on target 0 Human Mouse E Figure 3 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A Scale 100 kb hg38 chr3: 70,000,000 70,050,000 70,100,000 70,150,000 Comprehensive Gene Annotation Set from GENCODE Version 20 (Ensembl 76) LINC01212 (SAMMSON) RT-PCR Validation

GENCODE Capture Cap1, hsAll, tissue: all, anchor-merged all

hs splice site scores B Novel vs Annotated Splice Junctions (hs) C (CLS vs GENCODE) Acceptor Donor

0.25 lncRNA 0.20 36,839 0.15 0.10 0.05 40,000 26,715 lncRNAonTarget 0.00 Protein− coding 0.2 0.1

20,000 density 0.0 8,834 8,793 0.15 Random

# Splice Junctions 0.10 7,825 7,866 0.05 0 0.00 1 PacBio 1 PacBio + 1 HiSeq −30 −20 −10 0 10−15 −10 −5 0 5 10 Minimum required read support Splice site score Only in GENCODE Common Only in CLS Novel Known HiSeq−supported HiSeq−unsupported D Novel Known Simulated PacBio read depth vs HiSeq−unsupported HiSeq−supported novel splice junction discovery (hs) Brain Heart HeLa Total genes Orphans Potential precursors Precursors 12,500 10,000 7,500 7,500 E 7,500 5,000 5,000 Human Mouse 5,000 2,500 2,500 2,500 35 147 0 0 0 175 285 miRNA 907 647 2,794 2,035 100,000 200,000 100,000 200,000 100,000 200,000 K562 Liver Testes 19 57 53 144 8,000 6,000 snoRNA 40,000 112 872 6,000 30,000 # Novel splice junctions # Novel 4,000 978 1,530 4,000 20,000 55 56 2,000 2,000 10,000 224 171 0 0 0 snRNA 1,246 1,101 1,912 1,383

100,000 200,000 100,000 100,000 200,000 300,000 400,000 Simulated read depth 0 1000 2000 3000 0 1000 2000 3000 Number of genes Number of genes

SJ set HiSeq−supported HiSeq−unsupported All Figure 4 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available A under a CC-BY 4.0 International license.

C Merging efficiency (across tissues) (Human) (FANTOM) (CLS) 800,000 771,585 B Standard merging Anchored merging 600,000

400,000 Read # objects

alignments 200,000 179,993 117,258 1) Anchor supported ends Merge 0 2) Merge Input Merged transcripts Merged transcripts (raw reads) (ends anchored) (no anchoring) TMs Merged

D hg38 20 kb 127,225,000 127,220,000 127,215,000 127,210,000 127,205,000 127,200,000 127,195,000 127,190,000 127,185,000 127,180,000 127,175,000 127,170,000 127,165,000 :chr8 Comprehensive Gene Annotation Set from GENCODE Version 20 (Ensembl 76) CCAT1 CASC19 RT-PCR Validation

GENCODE Capture Cap1, hsAll, tissue: all, anchor-merged cagepolyASupported

FANTOM5 CAGE true TSSs

Full Length (FL) Novel 731 65,000 22,000 15,587 1,916 E Human 85,438 60,000 20,000 150,000 55,000 18,000 8,494 50,000 16,000 45,000 14,000 40,000 100,000 12,000 35,000 65,736 30,000 10,000 11,076 25,000 8,000 50,000 20,000 6,000 15,000 4,000 10,000 2,000 13,232 5,000 0 0 Merged transcript models 0 Liver K562 Heart HeLa Brain Testes All protein−coding Monoexonic Overlap lncRNA w/CAGE only w/polyA only Exact Intergenic multi−biotype w/CAGE+polyA no CAGE/no polyA Inclusion Extension other

0.6 Transcripts: 5000 n= 3248 F G 0.5 n= 2193 n = 5953 probed loci n= 9425 n= 3109 4177 0.4 Merged TSSs: 4000 n= 1968 n= 412 3574 n= 7081 0.3 n= 1912 TMs FL TMs 0.2 3000 Novel TMs Novel FL TMs 0.1 Fraction of sites Fraction

2000 on a DHS hotspot −10000 −5000 0 5000 10000

1110 Position around TSS 1000 947 # lncRNA loci (nucleotides) 0 CLS (CAGE−) GENCODE lncRNA protein−coding CLS (CAGE+) GENCODE >0 >1 >2 >3 >4 >5 >6 >7 >8 >9 lncRNA lncRNA # CLS transcript models (TMs) per locus

H I StringTie J GENCODE StringTie 526 2,550 0.0015 Sample lncRNA sizes: Medians: CLS n= 24075 1159 1,894 lncRNA 1,891 0.0010 n= 4763 1152 641 (FL) n= 8577 606 CLS n= 13930 1279 5444 9447 2,607 lncRNA 9,241 density n= 116 1400.5 (all) 0.0005

GENCODE 828 1,650 lncRNA 4,451 0.0000 GENCODE 1051 2452 protein− 0 1000 2000 3000 4000 32,059 coding 27,003 (confident) Spliced length (nt) 17708 27,394 Control 30,519 CLS StringTie CLS_FL StringTie_FL 0% 0% 100% 75% 50% 25% 25% 50% 75% 100% % CAGE(+) 5' ends % PAS(+) 3' ends CLS GENCODE Figure 5 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Figure 6 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A

B C

KANTR

D 20 kb hg38 chrX:53,095,000 53,100,000 53,105,000 53,110,000 53,115,000 53,120,000 53,125,000 53,130,000 53,135,000 53,140,000 53,145,000 Comprehensive Gene Annotation Set from GENCODE Version 20 (Ensembl 76)

KANTR

RT-PCR Validation

GENCODE Capture Cap1, hsAll, tissue: all, anchor-merged cagepolyASupported

FANTOM5 CAGE true TSSs

100 vertebrates conservation by PhastCons

GAC No Change GATGAT Synonymous 76aa GAAAAGAAG Conservative GGGGGG Radical TAATAA Ochre Stop Codon TTAGAG Amber Stop Codon TGATGA Opal Stop Codon ATG In-frame ATG GA-- Indel GAC Frame-shifted Transmembrane region Transmembrane region Low complexity LINE1 L1ME3F Supplementary Figure S01 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Supplementary Figure S02 C 26 nt 26 nt bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was10 106 not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is madeAAAAA available under aCC-BY 4.0 International license. TTTTT - SMART CDS III/3' PCR Primer - TruSeq Indexed adapter A B (3’ adapter) - Barcode - Transcript amplicon - TruSeq Universal adapter - SMART IV oligonucleotide - SMRTbell adapter (5’ adapter) D

E F Pre−Capture Post−Capture G Pre−Capture Post−Capture 50 60 15.9 fold15.9 fold 14.3 fold14.3 fold 19.8 fold19.8 fold 19.7 fold19.7 fold 14.8 fold14.8 fold 14.8 fold14.8 fold

49.4% 6.25 fold6.25 fold 7.17 fold7.17 fold 19.6 fold19.6 fold 3.3 fold3.3 fold

40 35%

25 24.4% 20.8% 19.6% 23.7% 20 20.1% 19.3% 12.2% % Reads on target % Reads on target 9.9%

3.9% 2.9% 3.7% 2.2% 2.5% 1.6% 1% 1.4% 0.5% 1.3% 0 0 Brain Heart Liver Testes Brain Heart Liver Testes HeLa K562

100 miRNA lncRNA pipeR snoRNA J ERCC protein coding H I enhancer misc_RNA 75 snRNA pseudogene UCE protein coding negative region

50 protein coding % reads

25 pseudogene pseudogene Other lncRNA snoRNA lncRNA snoRNA Other misc_RNA misc_RNA snRNA snRNA 0 rRNA rRNA K Pre-Capture Post-Capture M Brain Heart HeLa K562 Liver Testes Undeter 0.0015

0.0010 hs

0.0005

0.0000 # Errors per mapped base

hiSeq hiSeq hiSeq hiSeq hiSeq hiSeq hiSeq pacBio pacBio pacBio pacBio pacBio pacBio pacBio Platform L Brain E15 E7 Heart Liver Testes Undeter 0.003

0.002 mm

0.001

0.000 # Errors per mapped base

hiSeq hiSeq hiSeq hiSeq hiSeq hiSeq hiSeq pacBio pacBio pacBio pacBio pacBio pacBio pacBio Platform Error class deletions insertions mismatches Supplementary Figure S03 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

CASC15 / ENSG00000272168

Scale 200 kb hg38 chr6: 21,700,000 21,750,000 21,800,000 21,850,000 21,900,000 21,950,000 22,000,000 22,050,000 22,100,000 22,150,000 22,200,000 22,250,000 Comprehensive Gene Annotation Set from GENCODE Version 20 (Ensembl 76) CASC15 RN7SKP240 CASC14 RP11-524C21.2 CASC15 CASC15 CASC15 CASC15 CASC15 FANTOM5 DPI peak, robust set TSS peaks GENCODE Capture Cap1, capture probes Cap1 probes GENCODE Capture Cap1, hsAll, tissue: all, anchor-merged cagepolyASupported compmerge.994.pooled.chr6 compmerge.991.pooled.chr6 compmerge.990.pooled.chr6 compmerge.847.pooled.chr6 compmerge.833.pooled.chr6 compmerge.842.pooled.chr6 compmerge.830.pooled.chr6 compmerge.845.pooled.chr6 compmerge.831.pooled.chr6 compmerge.832.pooled.chr6 compmerge.843.pooled.chr6 compmerge.844.pooled.chr6 compmerge.838.pooled.chr6 compmerge.828.pooled.chr6 compmerge.841.pooled.chr6 compmerge.824.pooled.chr6 compmerge.825.pooled.chr6 compmerge.823.pooled.chr6 compmerge.821.pooled.chr6 compmerge.822.pooled.chr6 compmerge.818.pooled.chr6 compmerge.807.pooled.chr6 compmerge.804.pooled.chr6 compmerge.806.pooled.chr6 compmerge.803.pooled.chr6 compmerge.799.pooled.chr6

GAS5 / ENSG00000234741

Scale 2 kb hg38 chr1: 173,864,000 173,864,500 173,865,000 173,865,500 173,866,000 173,866,500 173,867,000 173,867,500 173,868,000 173,868,500 173,869,000 Comprehensive Gene Annotation Set from GENCODE Version 20 (Ensembl 76) GAS5-AS1 GAS5 ZBTB37 GAS5 ZBTB37 GAS5 GAS5 ZBTB37 GAS5 ZBTB37 GAS5 ZBTB37 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 GAS5 FANTOM5 DPI peak, robust set TSS peaks GENCODE Capture Cap1, capture probes Cap1 probes GENCODE Capture Cap1, hsAll, tissue: all, anchor-merged cagepolyASupported compmerge.10676.pooled.chr1 compmerge.10677.pooled.chr1 compmerge.10696.pooled.chr1 compmerge.10698.pooled.chr1 compmerge.10699.pooled.chr1 compmerge.10700.pooled.chr1 compmerge.10701.pooled.chr1 compmerge.10702.pooled.chr1 compmerge.10703.pooled.chr1 compmerge.10704.pooled.chr1 compmerge.10705.pooled.chr1 compmerge.10706.pooled.chr1 compmerge.10707.pooled.chr1 compmerge.10708.pooled.chr1 compmerge.10709.pooled.chr1 compmerge.10710.pooled.chr1 compmerge.10711.pooled.chr1 compmerge.10712.pooled.chr1 compmerge.10697.pooled.chr1 Supplementary Figure S04 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

LINC01133 / ENSG00000224259

Scale 5 kb hg38 chr1: 159,965,000 159,970,000 159,975,000 Comprehensive Gene Annotation Set from GENCODE Version 20 (Ensembl 76) LINC01133 LINC01133 FANTOM5 DPI peak, robust set TSS peaks GENCODE Capture Cap1, capture probes Cap1 probes GENCODE Capture Cap1, hsAll, tissue: all, anchor-merged cagepolyASupported compmerge.5371.pooled.chr1 compmerge.5368.pooled.chr1 compmerge.5369.pooled.chr1 compmerge.5370.pooled.chr1 compmerge.5367.pooled.chr1 compmerge.5366.pooled.chr1 compmerge.5365.pooled.chr1 compmerge.5363.pooled.chr1 compmerge.5364.pooled.chr1 compmerge.5362.pooled.chr1 compmerge.5360.pooled.chr1 compmerge.5358.pooled.chr1 compmerge.5359.pooled.chr1 Supplementary Figure S05 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was UCA1not / certified ENSG00000214049 by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Scale 5 kb hg38 chr19: 15,827,000 15,828,000 15,829,000 15,830,000 15,831,000 15,832,000 15,833,000 15,834,000 15,835,000 15,836,000 15,837,000 15,838,000 Basic Gene Annotation Set from GENCODE Version 20 (Ensembl 76) AC004510.3 UCA1 UCA1 GENCODE Capture Cap1, capture probes Cap1 probes FANTOM5 DPI peak, robust set TSS peaks GENCODE Capture Cap1, hsAll, tissue: all, merged cagepolyASupported compmerge.746.pooled.chr19 compmerge.692.pooled.chr19 compmerge.684.pooled.chr19 compmerge.716.pooled.chr19 compmerge.714.pooled.chr19 compmerge.739.pooled.chr19 compmerge.743.pooled.chr19 compmerge.745.pooled.chr19 compmerge.708.pooled.chr19 compmerge.685.pooled.chr19 compmerge.696.pooled.chr19 compmerge.717.pooled.chr19 compmerge.678.pooled.chr19 compmerge.679.pooled.chr19 compmerge.680.pooled.chr19 compmerge.681.pooled.chr19 compmerge.686.pooled.chr19 compmerge.687.pooled.chr19 compmerge.694.pooled.chr19 compmerge.695.pooled.chr19 compmerge.700.pooled.chr19 compmerge.702.pooled.chr19 compmerge.704.pooled.chr19 compmerge.705.pooled.chr19 compmerge.707.pooled.chr19 compmerge.712.pooled.chr19 compmerge.713.pooled.chr19 compmerge.718.pooled.chr19 compmerge.722.pooled.chr19 compmerge.723.pooled.chr19 compmerge.724.pooled.chr19 compmerge.725.pooled.chr19 compmerge.735.pooled.chr19 compmerge.737.pooled.chr19 compmerge.738.pooled.chr19 compmerge.747.pooled.chr19 compmerge.751.pooled.chr19 compmerge.753.pooled.chr19 compmerge.754.pooled.chr19 compmerge.757.pooled.chr19 compmerge.760.pooled.chr19 compmerge.690.pooled.chr19 compmerge.691.pooled.chr19 compmerge.732.pooled.chr19 compmerge.736.pooled.chr19 compmerge.744.pooled.chr19 compmerge.682.pooled.chr19 compmerge.698.pooled.chr19 compmerge.726.pooled.chr19 compmerge.733.pooled.chr19 compmerge.752.pooled.chr19 compmerge.756.pooled.chr19 compmerge.688.pooled.chr19 compmerge.689.pooled.chr19 compmerge.709.pooled.chr19 compmerge.721.pooled.chr19 compmerge.749.pooled.chr19 compmerge.759.pooled.chr19 compmerge.763.pooled.chr19 compmerge.699.pooled.chr19 compmerge.706.pooled.chr19 compmerge.710.pooled.chr19 compmerge.711.pooled.chr19 compmerge.720.pooled.chr19 compmerge.741.pooled.chr19 compmerge.748.pooled.chr19 compmerge.750.pooled.chr19 compmerge.762.pooled.chr19 compmerge.693.pooled.chr19 compmerge.715.pooled.chr19 compmerge.719.pooled.chr19 compmerge.727.pooled.chr19 compmerge.729.pooled.chr19 compmerge.740.pooled.chr19 compmerge.758.pooled.chr19 compmerge.761.pooled.chr19 compmerge.677.pooled.chr19 compmerge.697.pooled.chr19 compmerge.701.pooled.chr19 compmerge.703.pooled.chr19 compmerge.728.pooled.chr19 compmerge.730.pooled.chr19 compmerge.731.pooled.chr19 compmerge.742.pooled.chr19 Supplementary Figure S06 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxivB a licenseNo to displayvel vs the Annotated preprint in perpetuity. Splice It is Junctionsmade available (mm) A under aCC-BY 4.0 International license. Novel vs Annotated Novel vs Annotated (CLS vs GENCODE) 15,000 Exonic lncRNA Nucleotides (hs) Exonic lncRNA Nucleotides (mm) 8,847 6,272,863 1,625,341 4,000,000

6,042 lncRNAonTarget 10,000 10,000,000 3,000,000

1,150,477 Only in GENCODE 4,262,217 2,000,000 Common 5,000 5,000,000 Only in CLS 2,191 2,177 1,387,014

1,000,000 # Splice Junctions 3,037,587 3,263 3,277 # Projected genomic nucleotides # Projected genomic nucleotides 0 0 0 1 PacBio 1 PacBio + 1 HiSeq Minimum required read support CNovel vs Annotated Splice Junctions (hs) Only in GENCODE Common Only in CLS (CLS vs miTranscriptome) 32,478 D Novel vs Annotated Splice Junctions (Human) 60,000 22,829 50,000 31,837 lncRNAonTarget

40,000 40,000 13,195 12,679 30,000 ontarget 20,000 22,387 22,903

# Splice Junctions 20,000 10,093 0 10,000 1 PacBio 1 PacBio + 1 HiSeq # Splice Junctions 8,212 Minimum required read support 0 Only in miTranscriptome Common Only in CLS All Brain Heart HeLa K562 Liver Testes Undeter Novel vs Annotated Splice Junctions (hs) tissue (CLS vs miTranscriptome+GENCODE) Novel vs Annotated Splice Junctions (Mouse) 29,954 7,876 60,000 20,327 lncRNAonTarget

40,000 15,719 15,181 10,000 category ontarget Only in GENCODE 26,979 27,517 20,000 Common 2,587 # Splice Junctions 5,000 Only in CLS 0 3,338 # Splice Junctions 1 PacBio 1 PacBio + 1 HiSeq Minimum required read support 0 Only in miTranscriptome Common Only in CLS +GENCODE All Brain E15 E7 Heart Liver Testes Undeter tissue

Novel vs Annotated Splice Junctions (hs) Novel vs Annotated Splice Junctions (mm) 500

200 enhancer 400 enhancer E 150 F 300 100 200 50 100 0 0 CLS 20,000 lncRNA 60,000 15,000 lncRNA 40,000 10,000 5,000 20,000 0

0 microRNA 75 50 microRNA 50 40 30 25 20 0

10 multiBiotype 0 3,000 12,500 multiBiotype 2,000 10,000 1,000 7,500 5,000 0 6,000 2,500 nonExonic 0 4,000

4,000 nonExonic 2,000 category 3,000 0 Only in GENCODE 2,000 200 Common

1,000 piper 150 Only in CLS GENCODE 0 100 protein_coding # Splice Junctions # Splice Junctions 300,000 50 0

200,000 protein_coding 250,000 100,000 200,000 150,000 0 100,000 8 50,000

snoRNA 0 6 40 4 30 snoRNA 2 20 0 10 15 0 snRNA

10 7.5 snRNA 5 5.0 2.5 0 4 0.0 3 30 uce 2 20 uce 1 10 0 0 All Brain Heart HeLa K562 Liver Testes All Brain E15 E7 Heart Liver Testes tissue tissue Supplementary Figure S07 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Supplementary Figure S08 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. A Simulated PacBio read depth vs novel splice junction discovery (mm) Mouse Brain E15 E7

5,000 5,000 6,000 4,000 4,000 4,000 3,000 3,000 2,000 2,000 2,000 1,000 1,000 0 0 0

100,000 200,000 100,000 200,000 100,000 200,000 Heart Liver Testes

5,000 25,000 4,000 4,000 20,000 3,000 # Novel splice junctions # Novel 3,000 15,000 2,000 2,000 10,000 1,000 5,000 1,000 0 0 0

100,000 200,000 100,000 100,000 200,000 300,000 Simulated read depth

SJ set HiSeq−supported HiSeq−unsupported All

B C

Human Simulated read depth vs Simulated HiSeq read depth vs novel transcript discovery (hs) Human detected canonical SJs (hs) Brain Heart HeLa Brain Heart HeLa 10,000 8,000 8,000 200,000 150,000 7,500 6,000 120,000 6,000 150,000 100,000 5,000 4,000 80,000 4,000 100,000 50,000 2,500 2,000 2,000 50,000 40,000

5M 5M 5M 10M 15M 20M 25M 30M 35M 10M 15M 20M 25M 10M 15M 20M 100,000 200,000 100,000 200,000 100,000 200,000 K562 Liver Testes

# HiSeq SJs 125,000 K562 Liver Testes 400,000 40,000 5,000 120,000 100,000 300,000 6,000 4,000 30,000 75,000 80,000 200,000 # Novel transcript structures # Novel 50,000 3,000 4,000 20,000 100,000 40,000 25,000 2,000 2,000 10,000 5M 5M 5M 1,000 10M 15M 20M 25M 30M 10M 15M 20M 10M15M20M25M30M35M40M45M Simulated HiSeq read depth (# sampled reads)

Common with PacBio 100,000 200,000 100,000 100,000 200,000 300,000 400,000 All HiSeq SJ set (canonical and HiSeq supported PacBio SJs) Simulated read depth −

Simulated read depth vs Mouse novel transcript discovery (mm) Simulated HiSeq read depth vs Brain E15 E7 Mouse detected canonical SJs (mm)

6,000 4,000 Brain E15 E7 4,000 150,000 160,000 125,000 3,000 3,000 4,000 100,000 120,000 2,000 100,000 2,000 75,000 2,000 80,000 1,000 1,000 50,000 50,000 40,000 25,000

5M 5M 5M 10M 15M 20M 10M 15M 10M 15M 20M 25M 100,000 200,000 100,000 200,000 100,000 200,000 Heart Liver Testes Heart Liver Testes 120,000 # HiSeq SJs 4,000 20,000 200,000 90,000 90,000 3,000 3,000 15,000 150,000

# Novel transcript structures # Novel 60,000 60,000 2,000 2,000 10,000 100,000 30,000 30,000 50,000 1,000 1,000 5,000 5M 5M 5M 10M 15M 10M 15M 20M 10M 15M 20M 25M 30M Simulated HiSeq read depth (# sampled reads) 100,000 200,000 100,000 100,000 200,000 300,000 Common with PacBio Simulated read depth HiSeq SJ set All (canonical and HiSeq−supported PacBio SJs) Supplementary Figure S09 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 10 kbunder aCC-BY 4.0 International license. hg38 chr5: 149,410,000 149,415,000 149,420,000 149,425,000 149,430,000 Comprehensive Gene Annotation Set from GENCODE Version 20 (Ensembl 76) MIR143HG MIR143 MIR145 MIR143HG MIR145 MIR143HG MIR143HG RP11-394O4.6 MIR143HG MIR143HG MIR143HG MIR143HG MIR143HG MIR143HG MIR143HG MIR143HG PCR Validation

GENCODE Capture Cap1, hsAll, tissue: all, anchor-merged cagepolyASupported

FANTOM5 CAGE true TSSs Supplementary Figure S10 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available B under aCC-BY 4.0 International license. A

C

D

E Human Mouse Supplementary Figure S11 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has grantedMerging bioRxiv efficiency a license (across tissues) to display (mm) the preprint in perpetuity. It is made available A 600,000under aCC-BY604,199 4.0 International license.

400,000 (See Figure 4C) # objects 200,000 129,556 87,939

0 Input all (raw reads) tissue Merged transcripts (ends anchored) Merging efficiency (within tissues) (hs) Merging efficiency (within tissues) (mm) Merged transcripts (no anchoring) 216,263 175,430 200,000 150,000

150,000

123,533 100,000 118,256 115,477 92,611 93,338 100,000 89,218 101,661 96,395 81,322 72,280

# objects 83,740 # objects 62,162 50,000 54,847 50,000 40,268 43,553 34,507 32,326 35,810 35,823 26,790 25,219 24,826 25,783 27,088 24,179 22,99418,763 18,893 22,700 18,813 19,800 17,981 17,576 14,020 0 0 Brain Heart HeLa K562 Liver Testes Brain E15 E7 Heart Liver Testes tissue tissue B Full Length (FL) Novel C

654 n = 1920 probed loci 45,000 482 10,376 600 619 561 12,000 66,102 40,000 3,168 TMs FL TMs Novel TMs Novel FL TMs 100,000 35,000 10,000 400 30,000 8,000 8,973 234 200 192 25,000 # lncRNA loci 6,000 20,000 50,000 44,673 0 15,000 4,000

10,000 >0 >1 >2 >3 >4 >5 >6 >7 >8 >9 2,000 # CLS transcript models (TMs) per locus 8,405 5,000 0 0 Merged transcript models 0 Liver Heart E15 E7 Brain Testes All protein−coding Monoexonic Overlap lncRNA w/CAGE only w/polyA only Exact Intergenic multi−biotype w/CAGE+polyA no CAGE/no polyA Inclusion Extension other D E H I

Human, full-length reads: P=0.0014, Fisher two-tail Mouse, full-length reads: P=NS, Fisher two-tail 74 67 enhancer 600 enhancer 559 300 289 Conserved Not_conserved Conserved Not_conserved 400 200 Detected 34 1076 Detected 18 318 200 100 0 5 3 0 01 Not_detected 23 1753 Not_detected 45 859 15,000 1,398 7,851 lncRNA 4,959 40,000 26,091 lncRNA 10,000 30,000 5,000 5,329 20,000 13,071 10,000 0 1,002 0 2,881 250 21 microRNA Human, all reads: P=NS, Fisher two-tail Mouse, all reads: P=NS, Fisher two-tail 200 199

150 20 microRNA 127 150 Conserved Not_conserved Conserved Not_conserved 100 100 50 25 Detected 78 4101 Detected 39 981 50 0 1 20 18 Not_detected 23 1753 Not_detected 45 859 0 12 15 miRNA 20 4 10 18 miRNA 15 5 10 0 2 multiBiotype 3,000 295 5 1,244

● 0 ● ● ● 2,000 ● ● ● multiBiotype ● Human ● Mouse 897 1,258 ● ● ● F ● G 3,110 ● ● 7,500 1,000 ● ● ●●● full-length reads ● full-length reads 278 ● ● 5,000 0 ● ● 3,742 ● ●

● nonExonic ● ● ● 20,000 2,260 ● ● 2,500 17,459 ● ● ● ● ● 867 15,000 30 ● 30 ● ● ● ● 0 ●●● ● 10,000 ● P=0.0019 (Wilcox test) ● P=0.7 (Wilcox test) ● ● 2,739 nonExonic ● ● 15,367 5,000 ●●● ● 15,000 623 ● ● 0 127 ● ● ● ● ● 10,000 ● ● ● 51 20 ● 20 ● 400 ● ● ● 369 ● ● 5,000 piper ● ● 300 # merged CLS reads ● ● # merged CLS reads 548 ● ● ● 0 97 200 ● ● protein_coding ● ● ●●● ● 100,000 6,555 100 ● ● 38,690 ● ● ● 9 4 protein_coding 10 ●●● 10 ● 0 ● ● 75,000 ●●● ● ● ●●● ● ● 47,672 6,146 ● ●●● ●● 50,000 75,000 37,821 ● ● ● ● 2 ● ● ● ● 25,000 50,000 0● 57 0● 0 9,239 37,244 2829● 1177● 63 0 ● 0 ● 0 25,000 ●●● all reads ● all reads ● ● 5 ●●● ● ● 6,966 65 snoRNA 0 ● ● 60 ● ● ● ● ●● 26 snoRNA ● ● 40 30 ●●● 30 ● 100 96 ●●● ● P=0.9 (Wilcox test) ● P=0.055 (Wilcox test) ● ● ● 20 ● ●●● ● ● 50 Number of Reads ●●● Number of Reads ● ● 0 1 ● ● ● ● 9 ● 22 0 20 20 ● 99 snRNA 8 100 60 64 snRNA 50 40 20 10 10 0 16 0 1 14 3 100 6 2 90

101 1 uce 75 5854 0 10 uce 1840 84 50 0 0 5 0 1 0 1 25 Orthologue Status Orthologue Status 0 0 Liver K562 Heart HeLa Brain Testes All Liver Heart E15 E7 Brain Testes All

w/CAGE only w/polyA only w/CAGE only w/polyA only w/CAGE+polyA no CAGE/no polyA w/CAGE+polyA no CAGE/no polyA J K Detection by RNAseq, cutoff 1 FPKM LncRNA PipeR Protein_coding

0.32 0.3

0.21 0.21 0.2

0.15 0.14 0.11 0.1 0.1

0.06 0.04 0.01 0.01 0 0.0 Fraction of transcripts detected Brain Heart Liver Testis Organ Supplementary Figure S12 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Supplementary Figure S13 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available A CLS FL lncRNA CLS protein−coding GENCODE lncRNA underB aCC-BYCLS 4.0 FL lncRNAInternational CLS protein license−coding. GENCODE lncRNA n= 777 n= 777 mean= 3.54 n= 37909 6000 0.0020 n= 37909 mean= 6.98 n= 4893 n= 4893 mean= 4.24

0.0015

Median: 715 4000 Median: 1067 Median: 1320 0.0010 Density 0.0005 2000

0.0000 Number of models 0 0 1000 2000 3000 0 5 10 15 Spliced length, nt Number of exons C

D Scale 1 kb hg38 chr16: 14,301,500 14,302,000 14,302,500 14,303,000 Comprehensive Gene Annotation Set from GENCODE Version 20 (Ensembl 76) RP11-65J21.3 RP11-65J21.3 FANTOM5 DPI peak, robust set TSS peaks Simple Nucleotide Polymorphisms (dbSNP 147) Flagged by dbSNP as Clinically Assoc NHGRI-EBI Catalog of Published Genome-Wide Association Studies rs246185 rs246185 rs246185 GENCODE Capture Cap1, hsAll, tissue: all, anchor-merged cagepolyASupported compmerge.497.pooled.chr16 compmerge.496.pooled.chr16

E F G Supplementary Figure S14 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Supplementary Figure S15 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. LincRNA Supplementary Figure S16 C bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright Nholder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the4 preprint in perpetuity. It is made available A under aCC-BY 4.0 International licenseB . 3 2 1 mRNA

Predicted open reading frame C D K562: Cytoplasmic−Nuclear Ratio K562: Whole Cell 10 2 5 0.27 0.1 1.08 0 −0.68 0 19647 −1 −0.68 0 −1.07 1340 116 1998 17460 1237 −5 E −2 94 1611 −10 Log10 Whole Cell FPKM

Log2 Cytoplasm/Nucleus FPKM 1 2 3 4 1 2 3 4

F G Supplementary Figure S17 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Mapped Illumina RNAseq reads

All lncRNAs Filtered lncRNAs Human

Top 20 human lncRNAs: 71% of RNAseq reads

Top 20 mouse Mouse lncRNAs: 79% of RNAseq reads Supplementary Figure S18 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Supplementary Figure S19 bioRxiv preprint doi: https://doi.org/10.1101/105064; this version posted September 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.